Submitted by Not-Banksy t3_126a1dm in singularity
brain_overclocked t1_jeb3l2f wrote
Little late to the party, but if it helps here are a couple of playlists made by 3Blue1Brown about neural networks and how they're trained (although focus is on convolutional neural networks rather than transformers much of the math is similar):
https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
https://www.youtube.com/playlist?list=PLZHQObOWTQDMp_VZelDYjka8tnXNpXhzJ
Here is the original paper on the Transformer architecture (although in this original paper they mention they had a hard time converging and suggest other approaches that have long since been put into practice):
https://arxiv.org/abs/1706.03762
And here is a wiki on it (would recommend following the references):
https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)#Training
Not-Banksy OP t1_jeba64r wrote
Thanks for the links!
Viewing a single comment thread. View all comments