Viewing a single comment thread. View all comments

brain_overclocked t1_jeb3l2f wrote

Little late to the party, but if it helps here are a couple of playlists made by 3Blue1Brown about neural networks and how they're trained (although focus is on convolutional neural networks rather than transformers much of the math is similar):

https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

https://www.youtube.com/playlist?list=PLZHQObOWTQDMp_VZelDYjka8tnXNpXhzJ

Here is the original paper on the Transformer architecture (although in this original paper they mention they had a hard time converging and suggest other approaches that have long since been put into practice):

https://arxiv.org/abs/1706.03762

And here is a wiki on it (would recommend following the references):

https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)#Training

2