G_fucking_G t1_jdifa1c wrote on March 24, 2023 at 4:48 PM

Very interesting.

Quick question. How long does training take? For:

SFT Model
Reward Model
RLHF

I saw you used one 3090Ti, so was it done in hours/days/weeks?

liyanjia92 OP t1_jdj87cv wrote on March 24, 2023 at 7:54 PM

SFT is a bit longer, probably 8-12 hours I need to check the tensorboard to verify. Reward Model is faster because it only need to do 1 epoch, just a couple of hours. RLHF is the slowest because of its complexity (4 models interacting with each other), probably need to improve the "make_experiment" part of code, GPU is also often idle. So it could take days to just do 1 epoch. I didn't finish tuning because even if we just do RLHF on maybe 10K examples it's already outperforming SFT model in terms of "human" preference.

Puzzleheaded_Acadia1 t1_jdjukpg wrote on March 24, 2023 at 10:27 PM

I'm new to this can you explain what is the project about and what is SFT Model, reward model, RLHF and what is an epoch?

liyanjia92 OP t1_jdjx0zs wrote on March 24, 2023 at 10:45 PM

The project is to explore if RLHF can help smaller models to also output something naturally in a human/assistant conversation.

you can take a look at this Get Started section for more details: https://github.com/ethanyanjiali/minChatGPT#get-started

in short, SFT is supervised fine-tuning, reward model is the one that used to generate reward giving the language model output (action) in the reinforcement learning. RLHF is to use human feedback to set up reinforcement learning, and an epoch means the model see all the data by once.

https://web.stanford.edu/class/cs224n/ this could be a good class if you are new, they have a youtube version from 2021 (except that they probably didn't talk about RLHF back then)

Puzzleheaded_Acadia1 t1_jdkvut4 wrote on March 25, 2023 at 3:18 AM

Thx