Submitted by Smooth-Earth-9897 t3_11nzinb in MachineLearning
Hostilis_ t1_jbqh1fm wrote
In terms of layer width, all operations within a single transformer layer are O(n^2 ), with n the width of the largest matrix in the layer. The architectures are sequential, so the contribution to complexity from depth is given by multiplying by d for depth. Finally, they are quadratic in context length c. So in total: O(n^2 d c^2 ).
There is generally not much difference between different transformer architectures in terms of the computational complexity.
multiverseportalgun t1_jbr55gh wrote
Quadratic 🤢
Hostilis_ t1_jbr5iul wrote
Yeah quadratic scaling in context length is a problem lol. Hopefully RWKV will come to the rescue.
Viewing a single comment thread. View all comments