csreid

csreid t1_irxfue3 wrote

Imo, transformers are significantly less simple and more "hand-crafted" than lstm.

The point of the bitter lesson, I think, is that trying to be clever ends up biting you and eventually compute will reach a point when you can just learn it. Cross attention and all this special architecture to help a model capture intraseries information is definitely being clever when compared LSTM (or rnns in general) which just give a way for the network to keep some information around when presented with things in series.

4

csreid t1_irg0y7f wrote

I kinda get where OP is coming from, though. With all the pop-sci ML stuff and big press releases for popular consumption hitting really shortly after actual publication, there's always a risk that some manager will be like "hey I just read about stable diffusion on Twitter, can we use it to do this?" and then you're a deer in headlights bc you weren't at the press conference where they introduced it and you have no idea what the manager is even talking about.

2