Viewing a single comment thread. View all comments

blimpyway t1_j742oes wrote

I guess the point of the reward model is to approximate human feedback and instead of hiring humans to actually rank (e.g.) 1billion chats needed to update the LLM, train a reward model with 1% of them then use it to simulate human evaluators 99% of the times.

1