brain_overclocked t1_jeb3l2f wrote on March 30, 2023 at 6:50 PM

Little late to the party, but if it helps here are a couple of playlists made by 3Blue1Brown about neural networks and how they're trained (although focus is on convolutional neural networks rather than transformers much of the math is similar):

https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

https://www.youtube.com/playlist?list=PLZHQObOWTQDMp_VZelDYjka8tnXNpXhzJ

Here is the original paper on the Transformer architecture (although in this original paper they mention they had a hard time converging and suggest other approaches that have long since been put into practice):

https://arxiv.org/abs/1706.03762

And here is a wiki on it (would recommend following the references):

https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)#Training

Not-Banksy OP t1_jeba64r wrote on March 30, 2023 at 7:32 PM

Thanks for the links!