Hostilis_ t1_jbqh1fm wrote on March 10, 2023 at 10:42 PM

In terms of layer width, all operations within a single transformer layer are O(n^2 ), with n the width of the largest matrix in the layer. The architectures are sequential, so the contribution to complexity from depth is given by multiplying by d for depth. Finally, they are quadratic in context length c. So in total: O(n^2 d c^2 ).

There is generally not much difference between different transformer architectures in terms of the computational complexity.

multiverseportalgun t1_jbr55gh wrote on March 11, 2023 at 1:45 AM

Quadratic 🤢

Hostilis_ t1_jbr5iul wrote on March 11, 2023 at 1:48 AM

Yeah quadratic scaling in context length is a problem lol. Hopefully RWKV will come to the rescue.