What's the best way to quantify the uncertainty of a trained LLM? I assume the entropy of the model's final probability distribution is a decent measure. Just wanted to know if the NLP community sticks to this measure, or if there's something more specific to language?

Would really appreciate recent references that may have popped up over the past few months (if any). Also if there are any cool & easy to integrate implementations. Thanks!

Comments

You must log in or register to comment.

pyepyepie t1_j9uanug wrote on February 24, 2023 at 4:43 PM

In all honesty, at some point, any type of evaluation that is not qualitative is simply a joke. I have observed it a long time ago while working on NMT and trying to base the results on BLEU score - it literally meant nothing. Trying to force new metrics based on simple rules or computation will probably fail - I believe we need humans or stronger LLMs in the loop. E.g., humans should rank the output of multiple LLMs and the same humans should do so for multiple different language models, not just for the new one. Otherwise, I view it as a meaningless self-promoting paper (LLMs are not interesting enough to read about if there are no new ideas and no better performance). Entropy is good for language models that are like "me language model me no understand world difficult hard", not GPT-3 like.

Edit: this semantic uncertainty looks interesting but I would still rather let humans rank the results.

_atswi_ OP t1_j9ukzlk wrote on February 24, 2023 at 5:48 PM

That's a good point

What sounds like an open problem statement is how to get these LLMs to "quantify" that themselves the same way humans do. It's also interesting how that relates to the broader question of sentience and consciousness.

activatedgeek t1_j9tq7qo wrote on February 24, 2023 at 2:27 PM

Came across this recently - Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

_atswi_ OP t1_j9u1idx wrote on February 24, 2023 at 3:44 PM

Very cool, thanks!

cthorrez t1_j9x772h wrote on February 25, 2023 at 5:10 AM

along with each prompt, just put: "And at the end of your response, state on a scale from one to ten how confident you are in you answer"

This works amazingly and is very accurate. source

It has the added bonus where you can get confidence intervals on your confidence intervals just by asking how confident it is in it's estimation of its confidence.

iidealized t1_jabkowf wrote on February 28, 2023 at 6:25 AM

I’ve heard you can even ask the LLM: what fraction of your uncertainty is aleatoric vs epistemic, and how would the uncertainty estimates changed if you used bootstrap vs MC dropout :)

[deleted] t1_j9ufqfa wrote on February 24, 2023 at 5:15 PM

[removed]

le4mu t1_ja1zj5f wrote on February 26, 2023 at 6:15 AM

I am not in the language community, but in general, I dont think there is the 'best' way for uncertainty measure. In my opinion, the research on uncertainty and out-of-distribution (detection) is still very primitive and without a solid theoretical ground. For a general reference, please have a look at a recent ICLR paper.