Jean-Porte t1_j6y0djg wrote
Reply to comment by alpha-meta in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta
The beginning of the best possible answer might not be the best beginning. It's the final outcome, the complete answer that counts, so it makes sense to evaluate that. The reward is the feedback on the complete answer.
alpha-meta OP t1_j6yud7x wrote
Ah yes, I see what you mean now, thanks!
Viewing a single comment thread. View all comments