bigabig t1_j6z3a6d wrote on February 2, 2023 at 10:18 PM

I thought this was also because you do not need so much supervised training data because you 'just' have to train the reward model in a supervised fashion?

alpha-meta OP t1_j72dxx7 wrote on February 3, 2023 at 4:09 PM

I think it's probably the non-differentiable nature of the sampling techniques. If it's just about limited training data and using the reward model, in that case you can also use weakly supervised learning with that reward model.