cztomsik t1_jbgexxt wrote on March 8, 2023 at 9:25 PM

Reply to comment by alterframe in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

Another interesting idea might be to start training with smaller context len (and bigger batch size - together with random sampling)

If you think about it, people also learn the noun-verb pairs first and then go to sentences and then to longer paragraphs/articles, etc. And it's also good if we have a lot of variance at this early stages.

So it makes some sense, BERT MLM is also very similar to what people do when learning languages :)

cztomsik t1_jbgdoar wrote on March 8, 2023 at 9:17 PM

Reply to comment by currentscurrents in [D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla? by __Maximum__

but this is likely going to take forever because of LR decay, right?

cztomsik t1_jb995yy wrote on March 7, 2023 at 11:38 AM

Reply to comment by alterframe in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

And maybe also related to lr decay?

Also interesting thing is random sampling - at least at the start it seems to help when training causal LMs.

cztomsik t1_j5qoc6a wrote on January 24, 2023 at 10:02 PM

Reply to comment by inquisitor49 in [D] Simple Questions Thread by AutoModerator

I think it does mess them, alibi paper seems like better solution.