Submitted by fromnighttilldawn t3_y11a7r in MachineLearning

Critiques on ML approach, technique, implementation, reproducibility or entire field of research, can often be equally (if not more) enlightening as compared to ML surveys.

I think this is because they usually point out what the field is ignoring or if a certain set of popular practice/belief is unsound or useless.

Some famous examples are:

Troubling Trends in ML https://arxiv.org/pdf/1807.03341.pdf

ML that Matters https://arxiv.org/abs/1206.4656

On the Convergence of ADAM https://arxiv.org/abs/1904.09237

On the Information Bottleneck https://iopscience.iop.org/article/10.1088/1742-5468/ab3985

Implementation Matters in Deep Policy Gradients https://arxiv.org/abs/2005.12729 (showed a certain purported algorithm gain is actually mainly due to code-level optimization)

Critique of Turing Award https://people.idsia.ch/~juergen/critique-turing-award-bengio-hinton-lecun.html (basically a critique on the citation practice in ML)

Deep Learning a Critical Appraisal https://arxiv.org/abs/1801.00631

However, these are a little bit dated.

Does anyone have any recent critique papers of similar flavour as the ones I've provided above? (or would you rather offer your original critique in the comments ;) )

131

Comments

You must log in or register to comment.

_Arsenie_Boca_ t1_irvjdtn wrote

I dont have the papers on hand that investigate this, but here are 2 things that dont make me proud of being part of this field.

Are transformers really architecturally better than LSTMs or is their success mainly due to the huge amount of compute and data we throw at them? More generally, papers tend to make many changes to a system and credit the improvement to the thing they are most proud of without a fair comparison.

Non-opensource models like GPT3 dont make their training dataset public. People evaluate the performance on benchmarks but nobody can say for sure if the benchmark data was in the training data. ML used to be very cautious about data leakage, but this is simply ignored in most cases when its about those models.

91

harharveryfunny t1_irvssm1 wrote

It seems transformers really have two fundamental advantages over LSTMs:

  1. By design (specifically to improve over the shortcomings of recurrent models), they are much more efficient to train since samples can be presented in parallel. Also, positional encoding allows transformers to more accurately deal with positional structure which is critical for language.
  2. Transformers scale up very successfully. Per Rich Sutton's "Bitter Lesson", generally dumb methods that scale up in terms of ability to usefully absorb compute and data do better than more highly engineered "smart" methods. I wouldn't argue that transformers are any simpler in architecture than LSTMs, but as GPT-3 proved they do scale very successfully - increasing performance while still being relatively easy to train.

The context of your criticism is still valid though. Not sure whether it's fair or not, but I tend to look at DeepMind's recent matrix multiplication paper like that - they are touting it as a success of "AI" and RL, when really it's not at all apparent what RL is adding here. Surely the tensor factorization space could equally well been explored by other techniques such as evolution or even just MCTS.

44

sambiak t1_irwzqdv wrote

> Surely the tensor factorization space could equally well been explored by other techniques such as evolution or even just MCTS.

I think you're underestimating the difficulty of exploring an enormous state space. The state space of this problem is bigger than the one in go or chess.

Reinforcement Learning specializes in finding good solutions when only a small subset of state space can be explored. You're quite right that Monte Carlo Tree Search would work here because that's exactly what they used ^ ^

> Similarly to AlphaZero, AlphaTensor uses a deep neural network to guide a Monte Carlo tree search (MCTS) planning procedure.

That said, you do need a good way to guide this MCTS, and a neural network is a great solution to evaluate how good a given state is. But then you've got a new problem, how do you train this neural network ? And so on. It's not trivial, and frankly even the best tools have quite some weaknesses.

But no, evolution algorithms would not be easier, because you still need a fitness function, and once again you can use neural networks for approximating it, but you run into training issues once again. As far as I know, evolution algorithms are just worse than MCTS at the moment until someone figures a better way to approximate fitness functions.

19

csreid t1_irxfue3 wrote

Imo, transformers are significantly less simple and more "hand-crafted" than lstm.

The point of the bitter lesson, I think, is that trying to be clever ends up biting you and eventually compute will reach a point when you can just learn it. Cross attention and all this special architecture to help a model capture intraseries information is definitely being clever when compared LSTM (or rnns in general) which just give a way for the network to keep some information around when presented with things in series.

4

harharveryfunny t1_irxuxr9 wrote

Yes, I agree about the relative complexity (not that an LSTM doesn't also have a fair bit of structure), but the bitter lesson requires an approach that above all else will scale, which transformers do.

I think many people, myself included, were surprised with the emergent capabilities of GPT-3 and derivatives such as OpenAI Codex ... of course it makes sense how much domain knowledge (about fairy tales, programming, etc, etc) is needed to be REALLY REALLY good at "predict next word", but not at all obvious that something as relatively simple as a transformer was sufficient to learn that.

At the end of the day any future architecture capable of learning intelligent behavior will have to have some amount of structure - it needs to be a learning machine, and that machine needs some cogs. Is the transformer more complex than necessary for what it is capable of learning? I'm not sure - it's certainly conceptually pretty minimal.

1

elbiot t1_irwyleo wrote

The fact that you can throw a bunch of compute at transformers is part of their superiority. Even if it's the only factor, its really important

26

_Arsenie_Boca_ t1_irx1ubl wrote

Thats definitely a fair point (although you can do that with recurrent models as well, see reddit link in my other comment). Anyway, the more general point about multiple changes stands, maybe I chose a bad example

3

nickkon1 t1_irxid6a wrote

> ML used to be very cautious about data leakage, but this is simply ignored in most cases when its about those models.

I work on economic stuff. Either I am super unlucky or the number of papers that have data leakage is incredibly high. A decent chunk of papers that try to predict some macro-economic data one quarter a head dont leave a gap of one quarter between their training date and the prediction. Their backtest is awesome, the error is small, nice, a new paper! But it cant be used in production since how can I train a model on the 01.09.2022 if I need the data from 1st Oct to 31rd Dec for my target value.

It is incredibly frustrating. There have been papers, master thesis and even a dissertation that did this. I am incredibly frustrated and stopped trusting anything without code/data

16

scarynut t1_irxshd1 wrote

I noticed this on a lot of YouTube stock prediction tutorials. Made me conclude that people are idiots. Shocking that this mistake makes its way into papers..

7

popcornn1 t1_is03bja wrote

Sorry, but, I cannot understand your comment. What you mean by "don't leave gap"? So how they make forecast? Training data from January 2021 to December 2021 and then forecast from October 2021 to December 2021????

1

nickkon1 t1_is09o1x wrote

A lot of papers, articles, youtube videos on time series have the premise:
Our data is dependent on time. Not only does new data come in regularly, it might also happen that the coefficients of our model change over time and important features in 2020 (e.g. the number of people who are ill with covid) are less relevant now in 2022. To combat that, you retrain your model in regular intervals. Let us retrain our model daily.
That is totally fine and a sensible approach.

The key is: How far into the future do you want to predict something?


Because a lot of medium, towardsdatascience, and plenty of other blogs do that: Let us try to predict the 7-day return of a stock.

To train a new model today at t_{n}, I need data from the next week. But since I cant view into the future and do not know the future 7-day return of my stock, I dont have my y variable. The same holds for time step t_{n-1} and so on until I reach time step t_{n-prediction window}. Only there, I can calculate the future 7-day return of my stock with today's information.
This means that the last data point of my training data is always lagging by 7 days from my evaluation date.

The issue is: This becomes a problem only at your most recent data points (specifically the last #{prediction window} data points). Since you are creating a blog, publishing a paper... who cares? You dont really use that model daily for your business anyway. But: You can still do that on your backtest where you iterate through each time step t_{i}, take the last 2 years of training data up until t_{i} and make your prediction.

Your backtest is suddenly a lot better, your error becomes smaller, BAM 80% accuracy on a stock prediction! You beat the live tested performance of your competition! It is a great achievement and let us write a paper about it! But the reality is: Your model is actually unusable in a live setting and the errors you reported from your backtest are wrong. The reason is a subtle way of giving your model info about the future by accident. Throughout the whole backtest you have retrained your model's parameters at time t_{i} with data about your target variable at t_{i+1} to t_{i+prediction_window-1}. You need a gap between your training data and validation/test data.

Specifically in numbers (YYYY-MM-DD):
Wrong:
Training: 2020-10-10 - 2022-10-10
You try to retrain your model on 2022-10-10 and make a prediction on that date.

Correct:
Training: 2020-10-03 - 2022-10-03
You retrain your model on 2022-10-10 and make a prediction on that date. Notice that the last data point of your training data is not today, but today - #{prediction window}

4

CommunismDoesntWork t1_irwxgxk wrote

>Are transformers really architecturally better than LSTMs or is their success mainly due to the huge amount of compute and data we throw at them?

That's like asking if B-trees are actually better than red black trees, or if modern CPUs and their large caches just happen to lead to better performance. It doesn't matter. If one algorithm works theoretically but doesn't scale, then it might as well not work. It's the same reason no one uses fully connected networks even though they're universal function approximators.

2

_Arsenie_Boca_ t1_irwzk3j wrote

The point is that you cannot confirm the superiority of an architecture (or whatever component) when you change multiple things. And yes, it does matter where an improvement comes from, it is the only scientfically sound method to improve. Otherwise we might as well try random things until we find something that works.

To come back to LSTM vs Transformers: Im not saying LSTMs are better or anything. Im just saying that if LSTMs would have received the amount of engineering attention that went into making transformers better and faster, who knows if they might be similarly successful?

8

visarga t1_irzdrho wrote

> if LSTMs would have received the amount of engineering attention that went into making transformers better and faster

There was a short period when people were trying to improve LSTMs using genetic algorithms or RL.

The conclusion was that the LSTM cell is somewhat arbitrary and many other architectures work just as well, but none much better. So people stuck with classic LSTMs.

2

SleekEagle t1_irx6j3n wrote

I think it's more about the parallelizability of Transformers than anything. For all intents and purposes that makes them better than LSTMs and any recurrent model in general imo.

2

xEdwin23x t1_irv8zfs wrote

These question the "progress" or rather the illusion of progress in the field:
Are GANs Created Equal? A Large-Scale Studyhttps://arxiv.org/abs/1711.10337

Do Transformer Modifications Transfer Across Implementations and Applications?

https://arxiv.org/abs/2102.11972

21

SleekEagle t1_irx747e wrote

Is progress really a question? It seems very obvious that we have made progress in the last 5 years, and just looking at GANs seems ridiculous when Diffusion Models are sitting right there. Not trying to be a jerk, genuinely curious if anyone actually thinks that progress as a whole is not being made?

I definitely sympathize with the "incremental progress" that comes down to 0.1% better performance on some imperfect metric which occurs between big developments (GANs, transformers, diffusion models), but ignoring those papers and looking at bigger trends it seems obvious that really incredible progress has been made.

8

freezelikeastatue t1_irvgsg4 wrote

I can say this: out of all the AI and ML research papers I’ve read, the data sources folks are using, such as The Pile or Phaseshift.io (?) for Reddit data, are not particularly valid.

I’ve been pouring over a lot of the raw data and have found so many errors that I think it would be difficult or disingenuous to say that the models created from those data sets are viable for use. Now when you look at overall correctness, you’ll find statistically the AI and ML architecture can overcome those issues. However, when it comes to the reliability in fidelity of the data, it’s either too inconsistent or wildly wrong in its assertions. Another way to say it is; the validity of outcomes produced by AI and ML architecture that utilize public raw data should be questioned.

Just because you disagree or have never heard of it from this point of view, doesn’t mean it’s wrong ….

19

pm_me_your_pay_slips t1_irw5s8a wrote

What are those errors you have observed in the datasets like the Pile?

7

freezelikeastatue t1_irwan1x wrote

Grammatical errors and mischaracterizing the context of 1 to 2 token words. Such as acronyms, slang, etc. Additionally, I think how the raw data is structured is prohibitive of true optimization. That’s more a theory of mine than anything but I’ve built models from scratch and they’ve outperformed these models for my specific application every time.

My personal raw data is what you would call curated but more so every cell was meticulously verified and validated. Additionally there aren’t stray variables or additional characters such as spaces or underscores that could be confused as part of the real data. I know AI has done an exceptional job at cleansing data but it still isn’t 100%. I’m still better at manually cleansing data than any Software in existence and I’ve used a majority of them.

−1

shoegraze t1_irwlpoz wrote

But surely with a dataset as large as the Pile and enough weights, the model will be able to learn at least decently well how to interpret misspellings and abbreviations. If anything wouldn’t this data “issue” help improve a LLM’s robustness? Not sure I see what the issue is in the context of LLMs, but to be fair I agree with you if you’re trying to train a small model on a small amount of context-specific text data (but then you shouldn’t be using the Pile should you?)

9

freezelikeastatue t1_irx3brk wrote

Yeah so this gets pretty philosophical and theoretical real quick. Also, interpretation of data is unique to every individual. I did place the constraint of my purposes only which are, admittedly, not necessary for such large model sets and I can achieve similar if not better results with a smaller, more defined model.

I also have not created a curated data set on the level of CLIP or OpenAI or OPT. I’ve tried scaling my data by applying a text generator to each parameter of data that I had and replicate a faux variable exponentially to generate the number of parameters by 1/1000th of the number of parameters in GPT-3’s model but got noise in return.

My summation is viability of the model is wholly dependent upon the unique properties and ensured individuality of each variable. I can say I have achieved higher benchmarks with regards to few and zero shot settings, with the highest being 89.2% on few shot but it was a very specified data set.

−1

_Arsenie_Boca_ t1_irx6bvg wrote

I guess this is part of the bitter lesson. Sacrificing some quality for quantity seems to pay off in many cases

3

freezelikeastatue t1_irx78wt wrote

It pays off in the general sense of text generation and image generation. The errors and chaos is what makes it beautiful. I’m not sure how others are using the data for more technical applications but it seems to be working, whatever they’re doing. My warning to everybody who reads this is download all the scripts and code you can of the diffusers and encoders and decoders and models, because all that shit is going to become proprietary very soon. you must understand that while those who created the source code did create it under those open licenses that make it free, they have the absolute authority to remove it as we are slowly starting to see.

1

_Arsenie_Boca_ t1_irx86or wrote

I see your point but I wouldnt see it too pessimistically. If anything the not-so-open policy of OpenAI has lead to many initiatives that aim to demcratize AI. If they decide to go commercial as well, others will take their place.

5

freezelikeastatue t1_irx9anz wrote

Agreed and I think what the civilian developer core has done in spite of OpenAIs promise of OPEN AI is a testament. But we cannot forget invention, patents, and capitalism. We’re early in understanding just what this technology does but we as individuals don’t have the computational resources capitalistic organizations do. The models that are out now and freely available are so fucking lucrative, its not even funny. If one were so inclined, which I am, you can develop software without one software developer. Simple code yes but multiple instances of simplicity compounded becomes quite complex. And wouldn’t you agree that building a software system is best when done incrementally and object oriented?

1

_Arsenie_Boca_ t1_irxaphg wrote

While I am optimistic about the open-ness of AI, I am much more pessimistic regarding its capabilities. I dont believe AI could replace a team of software engineers anytime soon.

3

visarga t1_irzod3c wrote

Not a whole team, not even a whole job, but plenty of tasks can be automated. By averaging over many developers there is a cumulative impact.

But on the other hand software has been cannibalising itself for 70 years and we're still accelerating, there's always space at the top.

2

maizeq t1_irvl2wm wrote

I really liked Ali Rahimi's "Machine learning has become alchemy" talk from NeurIPS 2017.

https://www.youtube.com/watch?v=x7psGHgatGM

16

respeckKnuckles t1_irw4idm wrote

Gary Marcus's twitter is a firehose of unwarranted pessimism, but occasionally he'll retweet or interact with a legitimate, balanced criticism.

10

Chhatrapati_Shivaji t1_irwlt82 wrote

Who is this guy btw and why does he seen so upset with current trends in ML? I ask since I only know his name due to the recent Twitter feud he had with Lecun.

2

respeckKnuckles t1_irwn0vj wrote

NYU professor who published a few "pop-sciency" books on AI-related stuff. Like many in his generation, he got some attention for taking a contrarian stance on what current approaches to AI can do, and decided to go extremist with it. I'm not sure he's much more than a full-time angry twitterer now.

8

minisculebarber t1_irv6fcx wrote

Woah, sadly I have nothing to contribute, but thank you so much for collecting these resources!

7

maxToTheJ t1_irw9f9l wrote

This one was discussed on Gelman’s blog

The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning

https://arxiv.org/abs/2203.06498

3

zfurman t1_iry959d wrote

I'm not sure if this is the type of criticism you were looking for, but I found the paper Concrete Problems in AI Safety fairly interesting. It points out a number of ways modern ML systems, especially RL, could be prone to accident risk: reward hacking, distributional shift, etc.

2

nibbels t1_irvvnpw wrote

These aren't all critiques, but they do discuss issues with both the field and the models.

https://arxiv.org/abs/2011.03395

https://openreview.net/forum?id=xNOVfCCvDpM

https://arxiv.org/abs/2110.09485#:~:text=The%20notion%20of%20interpolation%20and,outside%20of%20that%20convex%20hull.

https://towardsdatascience.com/the-reproducibility-crisis-and-why-its-bad-for-ai-c8179b0f5d38

https://ai100.stanford.edu/2021-report/standing-questions-and-responses/sq10-what-are-most-pressing-dangers-ai

And then, of course, there are the oft-discussed topics like bias in data, the reliance on expensive equipment, and proprietary data that is closed off to researchers.

1

OdinsHammer t1_irxe655 wrote

If you're into general intelligence / superintelligence / etc. critiques as well I have some recommendations.

I just picked up Why machines will never rule the world by Jobst Langrebe and Berry Smith. It's not a critique of concrete ML, but rather of the idea that we'll ever create General Intelligence in a machine. They base it on a broad verity of fields, including linguistics, philosophy, biology, and physics. I read an interview where Jobst accuses people in the ML field of being one-eyed, thinking everything is doable based on Turing and Gödel. Being a CS-guy I found that I probably have that bias, so I have to read it.

I also found Maciej Ceglowski's talk interesting. It's a bit old, but he's an amazing presenter, and I don't think "the top of our industry" which his critique is targeting, has changed all that much.

1

skoetje t1_irzstbt wrote

AI and the Everything in the Whole Wide World Benchmark

There is a tendency across different subfields in AI to valorize a small collection of influential benchmarks. These benchmarks operate as stand-ins for a range of anointed common problems that are frequently framed as foundational milestones on the path towards flexible and generalizable AI systems. State-of-the-art performance on these benchmarks is widely understood as indicative of progress towards these long-term goals. In this position paper, we explore the limits of such benchmarks in order to reveal the construct validity issues in their framing as the functionally “general” broad measures of progress they are set up to be.

1