Viewing a single comment thread. View all comments

csreid t1_irxfue3 wrote

Imo, transformers are significantly less simple and more "hand-crafted" than lstm.

The point of the bitter lesson, I think, is that trying to be clever ends up biting you and eventually compute will reach a point when you can just learn it. Cross attention and all this special architecture to help a model capture intraseries information is definitely being clever when compared LSTM (or rnns in general) which just give a way for the network to keep some information around when presented with things in series.

4

harharveryfunny t1_irxuxr9 wrote

Yes, I agree about the relative complexity (not that an LSTM doesn't also have a fair bit of structure), but the bitter lesson requires an approach that above all else will scale, which transformers do.

I think many people, myself included, were surprised with the emergent capabilities of GPT-3 and derivatives such as OpenAI Codex ... of course it makes sense how much domain knowledge (about fairy tales, programming, etc, etc) is needed to be REALLY REALLY good at "predict next word", but not at all obvious that something as relatively simple as a transformer was sufficient to learn that.

At the end of the day any future architecture capable of learning intelligent behavior will have to have some amount of structure - it needs to be a learning machine, and that machine needs some cogs. Is the transformer more complex than necessary for what it is capable of learning? I'm not sure - it's certainly conceptually pretty minimal.

1