JClub OP t1_j4uc8lc wrote on January 18, 2023 at 8:44 AM

Reply to comment by buzzbuzzimafuzz in [D] RLHF - What type of rewards to use? by JClub

Yes, that makes sense! But for example, can you really combine a thumbs-up/down experience with a scale of 1-5? That will be even harder to make them both work together when training the model, right?

koolaidman123 t1_j4uuko0 wrote on January 18, 2023 at 12:35 PM

chatgpt (assuming they use same training as instructgpt) doesn't use a numerical scale, everything is a comparison between 2 (out of k) sampled outputs from a prompt, so everything is a pairwise comparison

JClub OP t1_j4v057p wrote on January 18, 2023 at 1:25 PM

yeah, instructGPT is like that. How do you calculate a reward score for each output in this ranking scenario?

koolaidman123 t1_j4v2uyq wrote on January 18, 2023 at 1:47 PM

it's just a binary pairwise comparison of which is more preferred between 2 outputs, read the instructgpt paper or the wandb post https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2#train-the-reward-model

JClub OP t1_j4v5d0y wrote on January 18, 2023 at 2:06 PM

Ah right, then you can just use the model's reward directly or pass it through a sigmoid so that the reward is between 0-1!

Do you think that the sigmoid is needed?