cztomsik
cztomsik t1_jbgdoar wrote
Reply to comment by currentscurrents in [D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla? by __Maximum__
but this is likely going to take forever because of LR decay, right?
cztomsik t1_jb995yy wrote
Reply to comment by alterframe in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
And maybe also related to lr decay?
Also interesting thing is random sampling - at least at the start it seems to help when training causal LMs.
cztomsik t1_j5qoc6a wrote
Reply to comment by inquisitor49 in [D] Simple Questions Thread by AutoModerator
I think it does mess them, alibi paper seems like better solution.
cztomsik t1_jbgexxt wrote
Reply to comment by alterframe in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
Another interesting idea might be to start training with smaller context len (and bigger batch size - together with random sampling)
If you think about it, people also learn the noun-verb pairs first and then go to sentences and then to longer paragraphs/articles, etc. And it's also good if we have a lot of variance at this early stages.
So it makes some sense, BERT MLM is also very similar to what people do when learning languages :)