alpha-meta OP t1_j6wvgbr wrote
Reply to comment by koolaidman123 in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta
Thanks for the response! I just double-checked the InstructGPT paper and you were right regarding the rankings -- they are pairwise, and I am not sure why I thought otherwise.
Regarding the updates on a sentence level, that makes sense. That would be more of a discrete problem as well for which you probably can't backpropagate (otherwise, you would be back to token-level).
Viewing a single comment thread. View all comments