Submitted by AutoModerator t3_122oxap in MachineLearning
Matthew2229 t1_jduyi8o wrote
Reply to comment by masterofn1 in [D] Simple Questions Thread by AutoModerator
It's a memory issue. Since the attention matrix scales quadratically (N^2) with sequence length (N), we simply don't have enough memory for long sequences. Most of the development around transformers/attention has been targeting this specific problem.
Viewing a single comment thread. View all comments