harharveryfunny t1_irvssm1 wrote on October 11, 2022 at 12:45 PM

Reply to comment by _Arsenie_Boca_ in [D] Looking for some critiques on recent development of machine learning by fromnighttilldawn

It seems transformers really have two fundamental advantages over LSTMs:

By design (specifically to improve over the shortcomings of recurrent models), they are much more efficient to train since samples can be presented in parallel. Also, positional encoding allows transformers to more accurately deal with positional structure which is critical for language.
Transformers scale up very successfully. Per Rich Sutton's "Bitter Lesson", generally dumb methods that scale up in terms of ability to usefully absorb compute and data do better than more highly engineered "smart" methods. I wouldn't argue that transformers are any simpler in architecture than LSTMs, but as GPT-3 proved they do scale very successfully - increasing performance while still being relatively easy to train.

The context of your criticism is still valid though. Not sure whether it's fair or not, but I tend to look at DeepMind's recent matrix multiplication paper like that - they are touting it as a success of "AI" and RL, when really it's not at all apparent what RL is adding here. Surely the tensor factorization space could equally well been explored by other techniques such as evolution or even just MCTS.

sambiak t1_irwzqdv wrote on October 11, 2022 at 5:44 PM

> Surely the tensor factorization space could equally well been explored by other techniques such as evolution or even just MCTS.

I think you're underestimating the difficulty of exploring an enormous state space. The state space of this problem is bigger than the one in go or chess.

Reinforcement Learning specializes in finding good solutions when only a small subset of state space can be explored. You're quite right that Monte Carlo Tree Search would work here because that's exactly what they used ^ ^

> Similarly to AlphaZero, AlphaTensor uses a deep neural network to guide a Monte Carlo tree search (MCTS) planning procedure.

That said, you do need a good way to guide this MCTS, and a neural network is a great solution to evaluate how good a given state is. But then you've got a new problem, how do you train this neural network ? And so on. It's not trivial, and frankly even the best tools have quite some weaknesses.

But no, evolution algorithms would not be easier, because you still need a fitness function, and once again you can use neural networks for approximating it, but you run into training issues once again. As far as I know, evolution algorithms are just worse than MCTS at the moment until someone figures a better way to approximate fitness functions.

csreid t1_irxfue3 wrote on October 11, 2022 at 7:27 PM

Imo, transformers are significantly less simple and more "hand-crafted" than lstm.

The point of the bitter lesson, I think, is that trying to be clever ends up biting you and eventually compute will reach a point when you can just learn it. Cross attention and all this special architecture to help a model capture intraseries information is definitely being clever when compared LSTM (or rnns in general) which just give a way for the network to keep some information around when presented with things in series.

harharveryfunny t1_irxuxr9 wrote on October 11, 2022 at 9:03 PM

Yes, I agree about the relative complexity (not that an LSTM doesn't also have a fair bit of structure), but the bitter lesson requires an approach that above all else will scale, which transformers do.

I think many people, myself included, were surprised with the emergent capabilities of GPT-3 and derivatives such as OpenAI Codex ... of course it makes sense how much domain knowledge (about fairy tales, programming, etc, etc) is needed to be REALLY REALLY good at "predict next word", but not at all obvious that something as relatively simple as a transformer was sufficient to learn that.

At the end of the day any future architecture capable of learning intelligent behavior will have to have some amount of structure - it needs to be a learning machine, and that machine needs some cogs. Is the transformer more complex than necessary for what it is capable of learning? I'm not sure - it's certainly conceptually pretty minimal.