JClub

JClub t1_jabyi76 wrote

GPT was never trained with image data, why is this a fair comparison? The UnifiedQA model is from 2022, so it doesn't seem fair either. Why don't we have some comparisons with other SOTA multimodal models? Such as OFA or UniT

1

JClub t1_jabyhe8 wrote

GPT was never trained with image data, why is this a fair comparison? The UnifiedQA model is from 2020, so it doesn't seem fair either. Why don't we have some comparisons with other SOTA multimodal models? Such as OFA or UniT

1

JClub t1_jabyh73 wrote

GPT was never trained with image data, why is this a fair comparison? The UnifiedQA model is from 2022, so it doesn't seem fair either. Why don't we have some comparisons with other SOTA multimodal models? Such as OFA or UniT

1

JClub OP t1_j51h8up wrote

> Shouldn't it be the opposite ?

Yes, that makes more sense. Will change!

> How is that different from min(ratio * R, 1.2 * R) ? Does 0.8 have any influence at all ?

Maybe I did not explain properly what the clip is doing. If you have ratio=0.6, then it become 0.8 and if it is > 1.2, it becomes 1.2
Does that make more sense? Regarding the min operation, it's just an heuristic to choose the smaller update tbh

2

JClub OP t1_j4xgp2x wrote

You're not the first person that asks me that question! I need to add a more detailed explanation for that :)

The reward is non-differentiable because it was produced with a reward model, and this reward model takes text as input. This text was obtained by decoding the log probabilities of the output of your model. This decoding process is non-differentiable and we lose the gradient link between the LM model and the reward model.

Does this make sense? Also, if the reward is given directly by a human, instead of a reward model, it's clearer that this reward is non-differentiable.

RL helps transforming this non-differentiable reward into a differentiable loss :)

5

JClub OP t1_j4uc0bg wrote

PPO's formula makes the gradient update always rather smaller than other RL algorithms. I get that the reward is measuring the human's preference but that does not answer my question 🤔 : what rewards work best for PPO?

1

JClub t1_j1yp8uf wrote

I understand your point. My point is that these guys are making you pay for it when this API that you want could definitely be for free.

Without open source they would not be able to make you pay for any of this, so using open source tools to make paid APIs just doesn't go along with me.

1