JClub
JClub t1_jbwu3lx wrote
Reply to comment by Non-jabroni_redditor in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
more than that, GPT is unidirectional, which is really not great a sentence embedder
JClub t1_jabyi76 wrote
Reply to comment by AiChip in [R] Multimodal Chain-of-Thought Reasoning in Language Models - Amazon Web Services Zhuosheng Zhang et al - Outperforms GPT-3.5 by 16% (75%->91%) and surpasses human performance on ScienceQA while having less than 1B params! by Singularian2501
GPT was never trained with image data, why is this a fair comparison? The UnifiedQA model is from 2022, so it doesn't seem fair either. Why don't we have some comparisons with other SOTA multimodal models? Such as OFA or UniT
JClub t1_jabyhe8 wrote
Reply to [R] Multimodal Chain-of-Thought Reasoning in Language Models - Amazon Web Services Zhuosheng Zhang et al - Outperforms GPT-3.5 by 16% (75%->91%) and surpasses human performance on ScienceQA while having less than 1B params! by Singularian2501
GPT was never trained with image data, why is this a fair comparison? The UnifiedQA model is from 2020, so it doesn't seem fair either. Why don't we have some comparisons with other SOTA multimodal models? Such as OFA or UniT
JClub t1_jabyh73 wrote
Reply to comment by astonzhang in [R] Multimodal Chain-of-Thought Reasoning in Language Models - Amazon Web Services Zhuosheng Zhang et al - Outperforms GPT-3.5 by 16% (75%->91%) and surpasses human performance on ScienceQA while having less than 1B params! by Singularian2501
GPT was never trained with image data, why is this a fair comparison? The UnifiedQA model is from 2022, so it doesn't seem fair either. Why don't we have some comparisons with other SOTA multimodal models? Such as OFA or UniT
JClub t1_j9j5cv4 wrote
Sorry, but why do we need another package? Can't you build on top of https://github.com/huggingface/peft ?
JClub OP t1_j57rrn6 wrote
Reply to comment by Ouitos in [R] A simple explanation of Reinforcement Learning from Human Feedback (RLHF) by JClub
ah yes, you're right. I actually don't know why, but you can check the implementation and ask it on GitHub
JClub OP t1_j51h8up wrote
Reply to comment by Ouitos in [R] A simple explanation of Reinforcement Learning from Human Feedback (RLHF) by JClub
> Shouldn't it be the opposite ?
Yes, that makes more sense. Will change!
> How is that different from min(ratio * R, 1.2 * R) ? Does 0.8 have any influence at all ?
Maybe I did not explain properly what the clip is doing. If you have ratio=0.6, then it become 0.8 and if it is > 1.2, it becomes 1.2
Does that make more sense? Regarding the min operation, it's just an heuristic to choose the smaller update tbh
JClub OP t1_j4zejga wrote
Reply to comment by dataslacker in [R] A simple explanation of Reinforcement Learning from Human Feedback (RLHF) by JClub
Yes, 100% agree with you. I believe that the researchers have also tried pseudo labeling or making the reward differentiable as you say, and maybe RL is the SOTA approach now. But these are just guesses!
JClub OP t1_j4z5ciu wrote
Reply to comment by JoeHenzi in [R] A simple explanation of Reinforcement Learning from Human Feedback (RLHF) by JClub
This package is pretty simple to use! https://github.com/lvwerra/trl
It supports decoder-only models like GPT and it is in the process of supporting enc-dec like T5.
JClub OP t1_j4z57kr wrote
Reply to comment by dataslacker in [R] A simple explanation of Reinforcement Learning from Human Feedback (RLHF) by JClub
Yes, the reward model can rank model outputs but it does that by giving a score to each output. You want to train with this score, not with "pseudo labeling" as you are stating. But the reward score is non-differentiable, and RL helps to construct a differentiable loss. Does that make sense?
JClub OP t1_j4xgp2x wrote
Reply to comment by dataslacker in [R] A simple explanation of Reinforcement Learning from Human Feedback (RLHF) by JClub
You're not the first person that asks me that question! I need to add a more detailed explanation for that :)
The reward is non-differentiable because it was produced with a reward model, and this reward model takes text as input. This text was obtained by decoding the log probabilities of the output of your model. This decoding process is non-differentiable and we lose the gradient link between the LM model and the reward model.
Does this make sense? Also, if the reward is given directly by a human, instead of a reward model, it's clearer that this reward is non-differentiable.
RL helps transforming this non-differentiable reward into a differentiable loss :)
Submitted by JClub t3_10fh79i in MachineLearning
JClub OP t1_j4v5d0y wrote
Reply to comment by koolaidman123 in [D] RLHF - What type of rewards to use? by JClub
Ah right, then you can just use the model's reward directly or pass it through a sigmoid so that the reward is between 0-1!
Do you think that the sigmoid is needed?
JClub OP t1_j4v057p wrote
Reply to comment by koolaidman123 in [D] RLHF - What type of rewards to use? by JClub
yeah, instructGPT is like that. How do you calculate a reward score for each output in this ranking scenario?
JClub OP t1_j4uc8lc wrote
Reply to comment by buzzbuzzimafuzz in [D] RLHF - What type of rewards to use? by JClub
Yes, that makes sense! But for example, can you really combine a thumbs-up/down experience with a scale of 1-5? That will be even harder to make them both work together when training the model, right?
JClub OP t1_j4uc0bg wrote
Reply to comment by velcher in [D] RLHF - What type of rewards to use? by JClub
PPO's formula makes the gradient update always rather smaller than other RL algorithms. I get that the reward is measuring the human's preference but that does not answer my question 🤔 : what rewards work best for PPO?
Submitted by JClub t3_10emf7a in MachineLearning
JClub t1_j46sm7q wrote
Reply to [D] Has ML become synonymous with AI? by Valachio
If you see it on slides, it is AI. If you see it in Python, it is ML.
JClub t1_j1yp8uf wrote
Reply to comment by mac4281 in [P] I built an API that makes it easy and cheap for developers to build ML-powered apps using Stable Diffusion by TrueBlueDreamin
I understand your point. My point is that these guys are making you pay for it when this API that you want could definitely be for free.
Without open source they would not be able to make you pay for any of this, so using open source tools to make paid APIs just doesn't go along with me.
JClub t1_j1ujk9d wrote
Reply to comment by the_magic_gardener in [P] I built an API that makes it easy and cheap for developers to build ML-powered apps using Stable Diffusion by TrueBlueDreamin
Yes, there are some colab notebooks that do it for you easily. This one works great: https://colab.research.google.com/github/TheLastBen/fast-stable-diffusion/blob/main/fast-DreamBooth.ipynb
JClub t1_j1u36tt wrote
Reply to [P] I built an API that makes it easy and cheap for developers to build ML-powered apps using Stable Diffusion by TrueBlueDreamin
Another guy making money out of dreambooth training when you can do it for free on Google Colab...
JClub t1_j0fwb4d wrote
Nice project!
How do you prevent the model from hallucinating? I did not get that. Do you just hope that the model will copy from the top 10 searches you give it?
JClub t1_izjnf35 wrote
Reply to comment by tetrisdaemon in [R] What the DAAM: Interpreting Stable Diffusion and Uncovering Generation Entanglement by tetrisdaemon
Damn then this method can only run on such hardware, the attention weights are very heavy!
JClub t1_izij5x5 wrote
Reply to [R] What the DAAM: Interpreting Stable Diffusion and Uncovering Generation Entanglement by tetrisdaemon
Hey! I'm the author of https://github.com/JoaoLages/diffusers-interpret
I have also tried to collect attentions in the diffusion process but the matrices with (text size, image size) were too big to keep in RAM/VRAM, how did you solve that problem?
JClub t1_jc5ys39 wrote
Reply to [D] ChatGPT without text limits. by spiritus_dei
Is there any implementation of CAM? Why is this better than the tglobal attention used in LongT5?