bigabig t1_j6z3a6d wrote
I thought this was also because you do not need so much supervised training data because you 'just' have to train the reward model in a supervised fashion?
alpha-meta OP t1_j72dxx7 wrote
I think it's probably the non-differentiable nature of the sampling techniques. If it's just about limited training data and using the reward model, in that case you can also use weakly supervised learning with that reward model.
Viewing a single comment thread. View all comments