_Arsenie_Boca_ t1_irvjdtn wrote on October 11, 2022 at 11:10 AM

I dont have the papers on hand that investigate this, but here are 2 things that dont make me proud of being part of this field.

Are transformers really architecturally better than LSTMs or is their success mainly due to the huge amount of compute and data we throw at them? More generally, papers tend to make many changes to a system and credit the improvement to the thing they are most proud of without a fair comparison.

Non-opensource models like GPT3 dont make their training dataset public. People evaluate the performance on benchmarks but nobody can say for sure if the benchmark data was in the training data. ML used to be very cautious about data leakage, but this is simply ignored in most cases when its about those models.

harharveryfunny t1_irvssm1 wrote on October 11, 2022 at 12:45 PM

It seems transformers really have two fundamental advantages over LSTMs:

By design (specifically to improve over the shortcomings of recurrent models), they are much more efficient to train since samples can be presented in parallel. Also, positional encoding allows transformers to more accurately deal with positional structure which is critical for language.
Transformers scale up very successfully. Per Rich Sutton's "Bitter Lesson", generally dumb methods that scale up in terms of ability to usefully absorb compute and data do better than more highly engineered "smart" methods. I wouldn't argue that transformers are any simpler in architecture than LSTMs, but as GPT-3 proved they do scale very successfully - increasing performance while still being relatively easy to train.

The context of your criticism is still valid though. Not sure whether it's fair or not, but I tend to look at DeepMind's recent matrix multiplication paper like that - they are touting it as a success of "AI" and RL, when really it's not at all apparent what RL is adding here. Surely the tensor factorization space could equally well been explored by other techniques such as evolution or even just MCTS.

sambiak t1_irwzqdv wrote on October 11, 2022 at 5:44 PM

> Surely the tensor factorization space could equally well been explored by other techniques such as evolution or even just MCTS.

I think you're underestimating the difficulty of exploring an enormous state space. The state space of this problem is bigger than the one in go or chess.

Reinforcement Learning specializes in finding good solutions when only a small subset of state space can be explored. You're quite right that Monte Carlo Tree Search would work here because that's exactly what they used ^ ^

> Similarly to AlphaZero, AlphaTensor uses a deep neural network to guide a Monte Carlo tree search (MCTS) planning procedure.

That said, you do need a good way to guide this MCTS, and a neural network is a great solution to evaluate how good a given state is. But then you've got a new problem, how do you train this neural network ? And so on. It's not trivial, and frankly even the best tools have quite some weaknesses.

But no, evolution algorithms would not be easier, because you still need a fitness function, and once again you can use neural networks for approximating it, but you run into training issues once again. As far as I know, evolution algorithms are just worse than MCTS at the moment until someone figures a better way to approximate fitness functions.

csreid t1_irxfue3 wrote on October 11, 2022 at 7:27 PM

Imo, transformers are significantly less simple and more "hand-crafted" than lstm.

The point of the bitter lesson, I think, is that trying to be clever ends up biting you and eventually compute will reach a point when you can just learn it. Cross attention and all this special architecture to help a model capture intraseries information is definitely being clever when compared LSTM (or rnns in general) which just give a way for the network to keep some information around when presented with things in series.

harharveryfunny t1_irxuxr9 wrote on October 11, 2022 at 9:03 PM

Yes, I agree about the relative complexity (not that an LSTM doesn't also have a fair bit of structure), but the bitter lesson requires an approach that above all else will scale, which transformers do.

I think many people, myself included, were surprised with the emergent capabilities of GPT-3 and derivatives such as OpenAI Codex ... of course it makes sense how much domain knowledge (about fairy tales, programming, etc, etc) is needed to be REALLY REALLY good at "predict next word", but not at all obvious that something as relatively simple as a transformer was sufficient to learn that.

At the end of the day any future architecture capable of learning intelligent behavior will have to have some amount of structure - it needs to be a learning machine, and that machine needs some cogs. Is the transformer more complex than necessary for what it is capable of learning? I'm not sure - it's certainly conceptually pretty minimal.

elbiot t1_irwyleo wrote on October 11, 2022 at 5:37 PM

The fact that you can throw a bunch of compute at transformers is part of their superiority. Even if it's the only factor, its really important

_Arsenie_Boca_ t1_irx1ubl wrote on October 11, 2022 at 5:58 PM

Thats definitely a fair point (although you can do that with recurrent models as well, see reddit link in my other comment). Anyway, the more general point about multiple changes stands, maybe I chose a bad example

nickkon1 t1_irxid6a wrote on October 11, 2022 at 7:43 PM

> ML used to be very cautious about data leakage, but this is simply ignored in most cases when its about those models.

I work on economic stuff. Either I am super unlucky or the number of papers that have data leakage is incredibly high. A decent chunk of papers that try to predict some macro-economic data one quarter a head dont leave a gap of one quarter between their training date and the prediction. Their backtest is awesome, the error is small, nice, a new paper! But it cant be used in production since how can I train a model on the 01.09.2022 if I need the data from 1st Oct to 31rd Dec for my target value.

It is incredibly frustrating. There have been papers, master thesis and even a dissertation that did this. I am incredibly frustrated and stopped trusting anything without code/data

scarynut t1_irxshd1 wrote on October 11, 2022 at 8:48 PM

I noticed this on a lot of YouTube stock prediction tutorials. Made me conclude that people are idiots. Shocking that this mistake makes its way into papers..

popcornn1 t1_is03bja wrote on October 12, 2022 at 9:52 AM

Sorry, but, I cannot understand your comment. What you mean by "don't leave gap"? So how they make forecast? Training data from January 2021 to December 2021 and then forecast from October 2021 to December 2021????

nickkon1 t1_is09o1x wrote on October 12, 2022 at 11:12 AM

A lot of papers, articles, youtube videos on time series have the premise:
Our data is dependent on time. Not only does new data come in regularly, it might also happen that the coefficients of our model change over time and important features in 2020 (e.g. the number of people who are ill with covid) are less relevant now in 2022. To combat that, you retrain your model in regular intervals. Let us retrain our model daily.
That is totally fine and a sensible approach.

The key is: How far into the future do you want to predict something?

Because a lot of medium, towardsdatascience, and plenty of other blogs do that: Let us try to predict the 7-day return of a stock.

To train a new model today at t_{n}, I need data from the next week. But since I cant view into the future and do not know the future 7-day return of my stock, I dont have my y variable. The same holds for time step t_{n-1} and so on until I reach time step t_{n-prediction window}. Only there, I can calculate the future 7-day return of my stock with today's information.
This means that the last data point of my training data is always lagging by 7 days from my evaluation date.

The issue is: This becomes a problem only at your most recent data points (specifically the last #{prediction window} data points). Since you are creating a blog, publishing a paper... who cares? You dont really use that model daily for your business anyway. But: You can still do that on your backtest where you iterate through each time step t_{i}, take the last 2 years of training data up until t_{i} and make your prediction.

Your backtest is suddenly a lot better, your error becomes smaller, BAM 80% accuracy on a stock prediction! You beat the live tested performance of your competition! It is a great achievement and let us write a paper about it! But the reality is: Your model is actually unusable in a live setting and the errors you reported from your backtest are wrong. The reason is a subtle way of giving your model info about the future by accident. Throughout the whole backtest you have retrained your model's parameters at time t_{i} with data about your target variable at t_{i+1} to t_{i+prediction_window-1}. You need a gap between your training data and validation/test data.

Specifically in numbers (YYYY-MM-DD):
Wrong:
Training: 2020-10-10 - 2022-10-10
You try to retrain your model on 2022-10-10 and make a prediction on that date.

Correct:
Training: 2020-10-03 - 2022-10-03
You retrain your model on 2022-10-10 and make a prediction on that date. Notice that the last data point of your training data is not today, but today - #{prediction window}

CommunismDoesntWork t1_irwxgxk wrote on October 11, 2022 at 5:30 PM

>Are transformers really architecturally better than LSTMs or is their success mainly due to the huge amount of compute and data we throw at them?

That's like asking if B-trees are actually better than red black trees, or if modern CPUs and their large caches just happen to lead to better performance. It doesn't matter. If one algorithm works theoretically but doesn't scale, then it might as well not work. It's the same reason no one uses fully connected networks even though they're universal function approximators.

_Arsenie_Boca_ t1_irwzk3j wrote on October 11, 2022 at 5:43 PM

The point is that you cannot confirm the superiority of an architecture (or whatever component) when you change multiple things. And yes, it does matter where an improvement comes from, it is the only scientfically sound method to improve. Otherwise we might as well try random things until we find something that works.

To come back to LSTM vs Transformers: Im not saying LSTMs are better or anything. Im just saying that if LSTMs would have received the amount of engineering attention that went into making transformers better and faster, who knows if they might be similarly successful?

_Arsenie_Boca_ t1_irx1c80 wrote on October 11, 2022 at 5:54 PM

In fact, here is a post of someone who apparently found pretty positive results about scaling up recurrent models to billions of parameters https://www.reddit.com/r/MachineLearning/comments/xfup9f/r_rwkv4_scaling_rnn_to_7b_params_and_beyond_with/?utm_source=share&utm_medium=android_app&utm_name=androidcss&utm_term=1&utm_content=share_button

visarga t1_irzdrho wrote on October 12, 2022 at 4:14 AM

> if LSTMs would have received the amount of engineering attention that went into making transformers better and faster

There was a short period when people were trying to improve LSTMs using genetic algorithms or RL.

An Empirical Exploration of Recurrent Network Architectures (2015, Sutskever)
LSTM: A Search Space Odyssey (2015, Schmidhuber)
Neural Architecture Search with Reinforcement Learning (2016, Quoc Le)

The conclusion was that the LSTM cell is somewhat arbitrary and many other architectures work just as well, but none much better. So people stuck with classic LSTMs.

CommunismDoesntWork t1_is0wj7j wrote on October 12, 2022 at 2:28 PM

If an architecture of more scalable, then it's the superior architecture.

SleekEagle t1_irx6j3n wrote on October 11, 2022 at 6:28 PM

I think it's more about the parallelizability of Transformers than anything. For all intents and purposes that makes them better than LSTMs and any recurrent model in general imo.

xEdwin23x t1_irv8zfs wrote on October 11, 2022 at 8:44 AM

These question the "progress" or rather the illusion of progress in the field:
Are GANs Created Equal? A Large-Scale Studyhttps://arxiv.org/abs/1711.10337

Do Transformer Modifications Transfer Across Implementations and Applications?

https://arxiv.org/abs/2102.11972

SleekEagle t1_irx747e wrote on October 11, 2022 at 6:32 PM

Is progress really a question? It seems very obvious that we have made progress in the last 5 years, and just looking at GANs seems ridiculous when Diffusion Models are sitting right there. Not trying to be a jerk, genuinely curious if anyone actually thinks that progress as a whole is not being made?

I definitely sympathize with the "incremental progress" that comes down to 0.1% better performance on some imperfect metric which occurs between big developments (GANs, transformers, diffusion models), but ignoring those papers and looking at bigger trends it seems obvious that really incredible progress has been made.

freezelikeastatue t1_irvgsg4 wrote on October 11, 2022 at 10:38 AM

I can say this: out of all the AI and ML research papers I’ve read, the data sources folks are using, such as The Pile or Phaseshift.io (?) for Reddit data, are not particularly valid.

I’ve been pouring over a lot of the raw data and have found so many errors that I think it would be difficult or disingenuous to say that the models created from those data sets are viable for use. Now when you look at overall correctness, you’ll find statistically the AI and ML architecture can overcome those issues. However, when it comes to the reliability in fidelity of the data, it’s either too inconsistent or wildly wrong in its assertions. Another way to say it is; the validity of outcomes produced by AI and ML architecture that utilize public raw data should be questioned.

Just because you disagree or have never heard of it from this point of view, doesn’t mean it’s wrong ….

pm_me_your_pay_slips t1_irw5s8a wrote on October 11, 2022 at 2:25 PM

What are those errors you have observed in the datasets like the Pile?

freezelikeastatue t1_irwan1x wrote on October 11, 2022 at 2:59 PM

Grammatical errors and mischaracterizing the context of 1 to 2 token words. Such as acronyms, slang, etc. Additionally, I think how the raw data is structured is prohibitive of true optimization. That’s more a theory of mine than anything but I’ve built models from scratch and they’ve outperformed these models for my specific application every time.

My personal raw data is what you would call curated but more so every cell was meticulously verified and validated. Additionally there aren’t stray variables or additional characters such as spaces or underscores that could be confused as part of the real data. I know AI has done an exceptional job at cleansing data but it still isn’t 100%. I’m still better at manually cleansing data than any Software in existence and I’ve used a majority of them.

shoegraze t1_irwlpoz wrote on October 11, 2022 at 4:14 PM

But surely with a dataset as large as the Pile and enough weights, the model will be able to learn at least decently well how to interpret misspellings and abbreviations. If anything wouldn’t this data “issue” help improve a LLM’s robustness? Not sure I see what the issue is in the context of LLMs, but to be fair I agree with you if you’re trying to train a small model on a small amount of context-specific text data (but then you shouldn’t be using the Pile should you?)

freezelikeastatue t1_irx3brk wrote on October 11, 2022 at 6:07 PM

Yeah so this gets pretty philosophical and theoretical real quick. Also, interpretation of data is unique to every individual. I did place the constraint of my purposes only which are, admittedly, not necessary for such large model sets and I can achieve similar if not better results with a smaller, more defined model.

I also have not created a curated data set on the level of CLIP or OpenAI or OPT. I’ve tried scaling my data by applying a text generator to each parameter of data that I had and replicate a faux variable exponentially to generate the number of parameters by 1/1000th of the number of parameters in GPT-3’s model but got noise in return.

My summation is viability of the model is wholly dependent upon the unique properties and ensured individuality of each variable. I can say I have achieved higher benchmarks with regards to few and zero shot settings, with the highest being 89.2% on few shot but it was a very specified data set.

_Arsenie_Boca_ t1_irx6bvg wrote on October 11, 2022 at 6:27 PM

I guess this is part of the bitter lesson. Sacrificing some quality for quantity seems to pay off in many cases

freezelikeastatue t1_irx78wt wrote on October 11, 2022 at 6:33 PM

It pays off in the general sense of text generation and image generation. The errors and chaos is what makes it beautiful. I’m not sure how others are using the data for more technical applications but it seems to be working, whatever they’re doing. My warning to everybody who reads this is download all the scripts and code you can of the diffusers and encoders and decoders and models, because all that shit is going to become proprietary very soon. you must understand that while those who created the source code did create it under those open licenses that make it free, they have the absolute authority to remove it as we are slowly starting to see.

_Arsenie_Boca_ t1_irx86or wrote on October 11, 2022 at 6:39 PM

I see your point but I wouldnt see it too pessimistically. If anything the not-so-open policy of OpenAI has lead to many initiatives that aim to demcratize AI. If they decide to go commercial as well, others will take their place.

freezelikeastatue t1_irx9anz wrote on October 11, 2022 at 6:46 PM

Agreed and I think what the civilian developer core has done in spite of OpenAIs promise of OPEN AI is a testament. But we cannot forget invention, patents, and capitalism. We’re early in understanding just what this technology does but we as individuals don’t have the computational resources capitalistic organizations do. The models that are out now and freely available are so fucking lucrative, its not even funny. If one were so inclined, which I am, you can develop software without one software developer. Simple code yes but multiple instances of simplicity compounded becomes quite complex. And wouldn’t you agree that building a software system is best when done incrementally and object oriented?

_Arsenie_Boca_ t1_irxaphg wrote on October 11, 2022 at 6:55 PM

While I am optimistic about the open-ness of AI, I am much more pessimistic regarding its capabilities. I dont believe AI could replace a team of software engineers anytime soon.

visarga t1_irzod3c wrote on October 12, 2022 at 6:16 AM

Not a whole team, not even a whole job, but plenty of tasks can be automated. By averaging over many developers there is a cumulative impact.

But on the other hand software has been cannibalising itself for 70 years and we're still accelerating, there's always space at the top.

freezelikeastatue t1_irxh2lr wrote on October 11, 2022 at 7:35 PM

Not for anything new, no

maizeq t1_irvl2wm wrote on October 11, 2022 at 11:29 AM

I really liked Ali Rahimi's "Machine learning has become alchemy" talk from NeurIPS 2017.

https://www.youtube.com/watch?v=x7psGHgatGM

RobbinDeBank t1_irwsxln wrote on October 11, 2022 at 5:01 PM

https://xkcd.com/1838/

visarga t1_irznvdp wrote on October 12, 2022 at 6:09 AM

The original PILE.

[deleted] t1_irw043n wrote on October 11, 2022 at 1:44 PM

[deleted]

[deleted] t1_irw70vb wrote on October 11, 2022 at 2:34 PM

[deleted]

tariban t1_irw5z8d wrote on October 11, 2022 at 2:27 PM

I'm not sure these count as direct critiques of the field, but here are a few papers showing that we've essentially made no progress on some problems, despite many claims to the contrary:

respeckKnuckles t1_irw4idm wrote on October 11, 2022 at 2:16 PM

Gary Marcus's twitter is a firehose of unwarranted pessimism, but occasionally he'll retweet or interact with a legitimate, balanced criticism.

Chhatrapati_Shivaji t1_irwlt82 wrote on October 11, 2022 at 4:14 PM

Who is this guy btw and why does he seen so upset with current trends in ML? I ask since I only know his name due to the recent Twitter feud he had with Lecun.

respeckKnuckles t1_irwn0vj wrote on October 11, 2022 at 4:22 PM

NYU professor who published a few "pop-sciency" books on AI-related stuff. Like many in his generation, he got some attention for taking a contrarian stance on what current approaches to AI can do, and decided to go extremist with it. I'm not sure he's much more than a full-time angry twitterer now.

minisculebarber t1_irv6fcx wrote on October 11, 2022 at 8:04 AM

Woah, sadly I have nothing to contribute, but thank you so much for collecting these resources!

bernhard-lehner t1_irvpvmq wrote on October 11, 2022 at 12:18 PM

Not recent, but still interesting:

The Mythos of model interpretability https://arxiv.org/abs/1606.03490

maxToTheJ t1_irw9f9l wrote on October 11, 2022 at 2:51 PM

This one was discussed on Gelman’s blog

The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning

https://arxiv.org/abs/2203.06498

drd13 t1_irw82ai wrote on October 11, 2022 at 2:41 PM

I remember compiling a list of papers a few years ago for a Reddit comment. Here is the comment https://www.reddit.com/r/MachineLearning/comments/p6zu7w/comment/h9kafi5/

zfurman t1_iry959d wrote on October 11, 2022 at 10:45 PM

I'm not sure if this is the type of criticism you were looking for, but I found the paper Concrete Problems in AI Safety fairly interesting. It points out a number of ways modern ML systems, especially RL, could be prone to accident risk: reward hacking, distributional shift, etc.

nibbels t1_irvvnpw wrote on October 11, 2022 at 1:09 PM

These aren't all critiques, but they do discuss issues with both the field and the models.

https://arxiv.org/abs/2011.03395

https://openreview.net/forum?id=xNOVfCCvDpM

https://arxiv.org/abs/2110.09485#:~:text=The%20notion%20of%20interpolation%20and,outside%20of%20that%20convex%20hull.

https://towardsdatascience.com/the-reproducibility-crisis-and-why-its-bad-for-ai-c8179b0f5d38

https://ai100.stanford.edu/2021-report/standing-questions-and-responses/sq10-what-are-most-pressing-dangers-ai

And then, of course, there are the oft-discussed topics like bias in data, the reliance on expensive equipment, and proprietary data that is closed off to researchers.

OdinsHammer t1_irxe655 wrote on October 11, 2022 at 7:17 PM

If you're into general intelligence / superintelligence / etc. critiques as well I have some recommendations.

I just picked up Why machines will never rule the world by Jobst Langrebe and Berry Smith. It's not a critique of concrete ML, but rather of the idea that we'll ever create General Intelligence in a machine. They base it on a broad verity of fields, including linguistics, philosophy, biology, and physics. I read an interview where Jobst accuses people in the ML field of being one-eyed, thinking everything is doable based on Turing and Gödel. Being a CS-guy I found that I probably have that bias, so I have to read it.

I also found Maciej Ceglowski's talk interesting. It's a bit old, but he's an amazing presenter, and I don't think "the top of our industry" which his critique is targeting, has changed all that much.

VenerableSpace_ t1_irzdgb9 wrote on October 12, 2022 at 4:11 AM

RemindMe! 1 month

RemindMeBot t1_irzdjlt wrote on October 12, 2022 at 4:12 AM

I will be messaging you in 1 month on 2022-11-12 04:11:47 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

skoetje t1_irzstbt wrote on October 12, 2022 at 7:16 AM

AI and the Everything in the Whole Wide World Benchmark

There is a tendency across different subfields in AI to valorize a small collection of influential benchmarks. These benchmarks operate as stand-ins for a range of anointed common problems that are frequently framed as foundational milestones on the path towards flexible and generalizable AI systems. State-of-the-art performance on these benchmarks is widely understood as indicative of progress towards these long-term goals. In this position paper, we explore the limits of such benchmarks in order to reveal the construct validity issues in their framing as the functionally “general” broad measures of progress they are set up to be.

CatalyzeX_code_bot t1_iruydyy wrote on October 11, 2022 at 6:10 AM

Found relevant code at https://github.com/lab-ml/nn + all code implementations here

--

Found relevant code at https://github.com/MadryLab/implementation-matters + all code implementations here

--

Found relevant code at https://github.com/astoycos/Mini_Project2 + all code implementations here

--

To opt out from receiving code links, DM me

Comments