wardellinthehouse t1_j6zrsvp wrote on February 3, 2023 at 1:11 AM Reply to [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta I asked this same question: https://www.reddit.com/r/reinforcementlearning/comments/zqfw7r/why_cant_we_do_supervised_learning_in_step_3_of/?utm_source=share&utm_medium=android_app&utm_name=androidcss&utm_term=1&utm_content=share_button I believe the answer is due to the fact that sampling from the policy network is a non-differentiable operation. Permalink 5
wardellinthehouse t1_j6zrsvp wrote
Reply to [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta
I asked this same question: https://www.reddit.com/r/reinforcementlearning/comments/zqfw7r/why_cant_we_do_supervised_learning_in_step_3_of/?utm_source=share&utm_medium=android_app&utm_name=androidcss&utm_term=1&utm_content=share_button
I believe the answer is due to the fact that sampling from the policy network is a non-differentiable operation.