Viewing a single comment thread. View all comments

Hostilis_ t1_jbqh1fm wrote

In terms of layer width, all operations within a single transformer layer are O(n^2 ), with n the width of the largest matrix in the layer. The architectures are sequential, so the contribution to complexity from depth is given by multiplying by d for depth. Finally, they are quadratic in context length c. So in total: O(n^2 d c^2 ).

There is generally not much difference between different transformer architectures in terms of the computational complexity.

3