Haycart

Haycart t1_je6grih wrote

>The Transformer is not a universal function approximator. This is simply shown by the fact that it cannot process arbitrary long input due to the finite context limitations.

We can be more specific, then: the transformer is a universal function approximator* on the space of sequences that fit within its context. I don't this distinction is necessarily relevant to the point I'm making, though.

*again with caveats regarding continuity etc.

>Your conclusion is not at all obvious or likely given your facts. They seem to be in hindsight given the strong performance of large models.

Guilty as charged, regarding hindsight. I won't claim to have predicted GPT-3's performance a-priori. That said, my point was never that the strong performance we've observed from recent LLMs was obvious or likely--only that it shouldn't be surprising. And, in particular it should not be surprising that a GPT model (not necessarily GPT-3 or 4) trained on a language modeling task would have the abilities we've seen. Everything we've seen falls well within the bounds of what transformers are theoretically capable of doing.

There are, of course, aspects of the current situation specifically that you can be surprised about. Maybe you're surprised that 100 billion-ish parameters is enough, or that the current volume of training data was sufficient. My argument is mostly aimed at claims along the lines of "GPT-n can't do X because transformers lack capability Y" or "GPT-n can't do X because it is only trained to model language".

1

Haycart t1_je4923c wrote

>Yes, ChatGPT is doing much more than querying text! It is not just a query engine on a giant corpus of text. … Duh! I do not think you should only think of ChatGPT as a query engine on a giant corpus of text. There can be a lot of value in reasoning about ChatGPT anthropomorphically or in other ways. RLHF also complicates the story, as over time it weighs responses away from the initial training data. But “query engine on a giant corpus of text” should be a non-zero part of your mental model because, without it, you cannot explain many of the things ChatGPT does.

The author seems to present this bizarre dichotomy, that either you have to think of ChatGPT as a query engine or you have to think of it in magical/mystical/anthropomorphic terms.

(They also touch on viewing ChatGPT as a function on the space of "billion dimensional" embeddings. This is closer to the mark but seems to conflate the model's parameter count with the dimensionality of its latent space, which doesn't exactly inspire confidence in the author's level of understanding.)

Why not just think of ChatGPT as what it is--a very large transformer?

The fact that a model like ChatGPT is able to do what it does is not at all surprising, IMO, when you consider the following facts:

  1. Transformers (and neural networks in general) are universal approximators. A sufficiently large neural network can approximate any function to arbitrary precision (with a few minor caveats).
  2. Neural networks trained with stochastic gradient descent benefit from implicit regularization -- SGD naturally tends to seek out simple solutions that generalize well. Furthermore, larger neural networks appear to generalize better than smaller ones.
  3. The recent GPTs have been trained on a non-trivial fraction of the entire internet's text content.
  4. Text on the internet (and language data in general) arises from human beings interacting with the world--reasoning, thinking, and emoting about those interactions--and attempting to communicate the outcome of this process to one another.

Is it really crazy to imagine that the simplest possible function capable of fitting a dataset as vast as ChatGPT's, might resemble the function that produced it? A function that subsumes, among other things, human creativity and reasoning?

In another world, GPT 3 or 4 might have turned out to be incapable of approximating that function to any notable degree of fidelity. But even then, it wouldn't be outlandish to imagine that one of the later members of the GPT family could eventually succeed.

5

Haycart t1_jdu7hlp wrote

Reply to comment by visarga in [D] GPT4 and coding problems by enryu42

Oh, you are probably correct. So it'd be O(N^2) overall for autoregressive decoding. Which still exceeds the O(n log n) that the linked post says is required for multiplication, though.

1

Haycart t1_jdtcnc5 wrote

Where are they getting O(1) from? Has some new information been released regarding GPT-4's architecture?

The standard attention mechanism in a transformer decoder (e.g. GPT 1-3) has a time complexity of O(N^2) w.r.t. the combined input and output sequence length. Computing the output autoregressively introduces another factor of N for a total of O(N^3).

There are fast attention variants with lower time complexity, but has there been any indication that GPT-4 actually uses these? And in any case, I'm not aware of any fast attention variant that could be described as having O(1) complexity.

1