liyanjia92

liyanjia92 OP t1_jdjx0zs wrote

The project is to explore if RLHF can help smaller models to also output something naturally in a human/assistant conversation.

you can take a look at this Get Started section for more details: https://github.com/ethanyanjiali/minChatGPT#get-started

in short, SFT is supervised fine-tuning, reward model is the one that used to generate reward giving the language model output (action) in the reinforcement learning. RLHF is to use human feedback to set up reinforcement learning, and an epoch means the model see all the data by once.

https://web.stanford.edu/class/cs224n/ this could be a good class if you are new, they have a youtube version from 2021 (except that they probably didn't talk about RLHF back then)

3

liyanjia92 OP t1_jdjwfnh wrote

It maybe better to submit an issue on github so that i can point you to some code with context. if you are talking my code, you need to convert the weights and load it into GPT class before running SFT training. otherwise there might be mismatch in weights and it could just output random stuff.

2

liyanjia92 OP t1_jdj87cv wrote

SFT is a bit longer, probably 8-12 hours I need to check the tensorboard to verify. Reward Model is faster because it only need to do 1 epoch, just a couple of hours. RLHF is the slowest because of its complexity (4 models interacting with each other), probably need to improve the "make_experiment" part of code, GPU is also often idle. So it could take days to just do 1 epoch. I didn't finish tuning because even if we just do RLHF on maybe 10K examples it's already outperforming SFT model in terms of "human" preference.

5

liyanjia92 OP t1_jdj7h0x wrote

Thanks for trying out! This is a good example to show the difference between RLHF'ed GPT-2 medium vs the vanilla GPT-2 medium. You can see that GPT-2 medium is completely outputting garbage while the RLHF version tend to come up with some answer for human. (although it failed)

The way i see this is that pre-trained model encode the knowledge of the world, and RLHF is just a way to align the model with human's preference of how to interact with the world.

You might see this tweet before: https://twitter.com/geoffreyhinton/status/1636110447442112513?s=20

So with GPT-2 medium, what we really do here is to parent a dumb kid, instead of a "supernaturally precocious child" like GPT-3. What interested me is that RLHF does actually help to parent this dumb kid to be more socially acceptable.

In other words, if we discover the power of alignment and RLHF earlier, we might foresee the ChatGPT moment much earlier when GPT-2 is out in 2019.

I'm also thinking to do the same with LLaMA to maybe have a nanoChatGPT that actually could be useful for a real life application. Stay tuned!

4