CommunismDoesntWork t1_irwxgxk wrote on October 11, 2022 at 5:30 PM

Reply to comment by _Arsenie_Boca_ in [D] Looking for some critiques on recent development of machine learning by fromnighttilldawn

>Are transformers really architecturally better than LSTMs or is their success mainly due to the huge amount of compute and data we throw at them?

That's like asking if B-trees are actually better than red black trees, or if modern CPUs and their large caches just happen to lead to better performance. It doesn't matter. If one algorithm works theoretically but doesn't scale, then it might as well not work. It's the same reason no one uses fully connected networks even though they're universal function approximators.

_Arsenie_Boca_ t1_irwzk3j wrote on October 11, 2022 at 5:43 PM

The point is that you cannot confirm the superiority of an architecture (or whatever component) when you change multiple things. And yes, it does matter where an improvement comes from, it is the only scientfically sound method to improve. Otherwise we might as well try random things until we find something that works.

To come back to LSTM vs Transformers: Im not saying LSTMs are better or anything. Im just saying that if LSTMs would have received the amount of engineering attention that went into making transformers better and faster, who knows if they might be similarly successful?

_Arsenie_Boca_ t1_irx1c80 wrote on October 11, 2022 at 5:54 PM

In fact, here is a post of someone who apparently found pretty positive results about scaling up recurrent models to billions of parameters https://www.reddit.com/r/MachineLearning/comments/xfup9f/r_rwkv4_scaling_rnn_to_7b_params_and_beyond_with/?utm_source=share&utm_medium=android_app&utm_name=androidcss&utm_term=1&utm_content=share_button

visarga t1_irzdrho wrote on October 12, 2022 at 4:14 AM

> if LSTMs would have received the amount of engineering attention that went into making transformers better and faster

There was a short period when people were trying to improve LSTMs using genetic algorithms or RL.

An Empirical Exploration of Recurrent Network Architectures (2015, Sutskever)
LSTM: A Search Space Odyssey (2015, Schmidhuber)
Neural Architecture Search with Reinforcement Learning (2016, Quoc Le)

The conclusion was that the LSTM cell is somewhat arbitrary and many other architectures work just as well, but none much better. So people stuck with classic LSTMs.

CommunismDoesntWork t1_is0wj7j wrote on October 12, 2022 at 2:28 PM

If an architecture of more scalable, then it's the superior architecture.