Viewing a single comment thread. View all comments

visarga t1_it96451 wrote

The idea is actually from a 2021 paper from the same authors. Language models usually predict the next token when they are GPT like, and predict random masked words when they are BERT like. They combine both of them and discover it has a huge impact on scaling laws. In other words we were using the wrong mix of noise to train the model. The new solution is 2x better than before.

This paper combines with the FLAN paper that uses 1800 different tasks to instruction-tune the model. They hope learning many tasks will teach the model to generalise to new tasks. An important trick is using chain of thought, without it there is a big drop. Both methods boost the score and together they get the largest boost.

They even released the FLAN models. Google is on a roll!

I tried FLAN, reminds me of GPT-3 how quickly it gets the task. It doesn't have the vast memory of GPT-3 though. So now I have on my computer a Dall-E like model (SD) and a GPT-3 like model (FLAN-T5-XL), plus an amazing voice recognition system - Whisper. It's hard to believe. After 2 years they shrunk GPT-3 and we have voice, image and language on a regular gaming desktop.

14

FirstOrderCat t1_it9b2t8 wrote

>The new solution is 2x better than before.

Is it like 2 points only better? They just put very small portion (6 points) on Y axis..

2

Spoffort t1_itbjrj0 wrote

I know what do you mean, look at the x axis where compute is. The model is not 2 times better (your point with y axis) but 2 times less compute for given outcome (x axis). If you want i can explain it further 😄

3

FirstOrderCat t1_itc6pne wrote

It looks like they had point of diminishing return somewhere at 0.5*1e25 FLOPS.

After that model trains much slower. They could continue training farther, and say they "saved" another 20M TPU hours.

1