Submitted by Smooth-Earth-9897 t3_11nzinb in MachineLearning
appenz t1_jbqsu7k wrote
Both of the answers above are correct and if you care about the structure (i.e. depth, layers etc.) of the transformer it is complicated.
If you only care about scaling with the number of weights, most transformers scale with O(weights) and a generative transformer like GPT scales approximately with 2*weights.
Viewing a single comment thread. View all comments