alpha-meta OP t1_j72dxx7 wrote
Reply to comment by bigabig in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta
I think it's probably the non-differentiable nature of the sampling techniques. If it's just about limited training data and using the reward model, in that case you can also use weakly supervised learning with that reward model.
Viewing a single comment thread. View all comments