alpha-meta OP t1_j6xylk8 wrote on February 2, 2023 at 6:04 PM

Reply to comment by Jean-Porte in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta

Could you help me understand what the far-away rewards represent here in this context? The steps are generating the individual words? So in this case you mean words that occur early in the text? In this case, a weighting scheme for the cross-entropy loss components could be used?

Jean-Porte t1_j6y0djg wrote on February 2, 2023 at 6:15 PM

The beginning of the best possible answer might not be the best beginning. It's the final outcome, the complete answer that counts, so it makes sense to evaluate that. The reward is the feedback on the complete answer.

alpha-meta OP t1_j6yud7x wrote on February 2, 2023 at 9:21 PM

Ah yes, I see what you mean now, thanks!