Viewing a single comment thread. View all comments

alpha-meta OP t1_j72dxx7 wrote on February 3, 2023 at 4:09 PM

Reply to comment by bigabig in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta

I think it's probably the non-differentiable nature of the sampling techniques. If it's just about limited training data and using the reward model, in that case you can also use weakly supervised learning with that reward model.