Haycart t1_jdtcnc5 wrote on March 27, 2023 at 1:00 AM

Reply to comment by liqui_date_me in [D] GPT4 and coding problems by enryu42

Where are they getting O(1) from? Has some new information been released regarding GPT-4's architecture?

The standard attention mechanism in a transformer decoder (e.g. GPT 1-3) has a time complexity of O(N^2) w.r.t. the combined input and output sequence length. Computing the output autoregressively introduces another factor of N for a total of O(N^3).

There are fast attention variants with lower time complexity, but has there been any indication that GPT-4 actually uses these? And in any case, I'm not aware of any fast attention variant that could be described as having O(1) complexity.

visarga t1_jdtypz6 wrote on March 27, 2023 at 4:15 AM

Doesn't autoregressive decoding cache the states for the previous tokens when decoding a new token?

Haycart t1_jdu7hlp wrote on March 27, 2023 at 5:55 AM

Oh, you are probably correct. So it'd be O(N^2) overall for autoregressive decoding. Which still exceeds the O(n log n) that the linked post says is required for multiplication, though.