alpha-meta OP t1_j6wvgbr wrote on February 2, 2023 at 1:46 PM

Reply to comment by koolaidman123 in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta

Thanks for the response! I just double-checked the InstructGPT paper and you were right regarding the rankings -- they are pairwise, and I am not sure why I thought otherwise.

Regarding the updates on a sentence level, that makes sense. That would be more of a discrete problem as well for which you probably can't backpropagate (otherwise, you would be back to token-level).