Jean-Porte t1_j6wvy2p wrote on February 2, 2023 at 1:50 PM

The traditional language modeling loss (negative log-likelihood) is misaligned with human expectations. One negation radically changes the meaning of a sentence. It doesn't radically change the loglikelihood. It isn't more important than a "the" or a superfluous word.

With RLHF, important words have important impact, and the loss is exactly aligned to human interests.

alpha-meta OP t1_j6x1r2j wrote on February 2, 2023 at 2:33 PM

But isn't this only if you train it on the loss (negative log-likelihood) via next-word prediction, i.e., what they do during pretraining?

If you use the ranks (from having users rank the documents) to compute the loss on the instead of the words as labels, would that still be the case?

Jean-Porte t1_j6x8oyx wrote on February 2, 2023 at 3:21 PM

Yes but the LM has to take many steps to produce the text

We need to train the LM to maximize a far-away reward and we need RL to do that

alpha-meta OP t1_j6xylk8 wrote on February 2, 2023 at 6:04 PM

Could you help me understand what the far-away rewards represent here in this context? The steps are generating the individual words? So in this case you mean words that occur early in the text? In this case, a weighting scheme for the cross-entropy loss components could be used?

Jean-Porte t1_j6y0djg wrote on February 2, 2023 at 6:15 PM

The beginning of the best possible answer might not be the best beginning. It's the final outcome, the complete answer that counts, so it makes sense to evaluate that. The reward is the feedback on the complete answer.

alpha-meta OP t1_j6yud7x wrote on February 2, 2023 at 9:21 PM

Ah yes, I see what you mean now, thanks!