sineiraetstudio

sineiraetstudio t1_jdymf8q wrote

Oh, RLHF absolutely has all sorts of benefits (playing with top-p only makes answers more consistent - but sometimes you want to optimize for something different than "most likely"), so it's definitely here to stay (for now?), it's just not purely positive. Ideally we'd have a RLHF version that's still well calibrated (or even better, some way to determine confidence without looking at logits that also works with chain of thought prompting).

2

sineiraetstudio t1_jdws2iv wrote

(The graph doesn't give enough information to determine whether it's actually becoming more confident in its high-confidence answers, but it sounds like a reasonable enough rationale.)

I'm not sure I understand what distinction you're trying to draw. The RLHF'd version assigns higher confidence to answers than it actually gets correct, unlike the original pre-trained version. That's literally the definition of overconfidence.

You might say that this is more "human-like", but being human-like doesn't mean that it's good. If you want only the most likely answer, you can already do this via the sampler, while on the hand calibration errors are a straight up downside as Paul Christiano explicitly mentions in the part you quoted. If you need accurate confidence scores (because you e.g. only want to act if you're certain), being well-calibrated is essential.

2

sineiraetstudio t1_jdvvvdb wrote

... that's not what's happening though? The calibration error is causing it to increase its confidence in low accuracy answer and decrease it in med-high accuracy answers, making it more likely to output wrong answers. Seems like maybe you're confusing it with using a different sampler? Something like top-p already does what you mentioned.

1