Submitted by t3_yon48p in MachineLearning

Increasingly large/deep models for Sound/Image/Language/Games are all the rage (you know what I'm talking about). This is concerning on some level:

  1. Focus shifts to amount of data, instead of curation
  2. Require more (expensive) hardware to train, out of reach for many
  3. API-ization of functionality leads to large scale monitoring by centralized providers

Lets take OpenAI Codex / Github Copilot as an example: Disregarding the licensing questions for a bit, amazing as this model is, there are some drawbacks observed when using it:

  • It can generate outdated code or API calls, especially for evolving languages
  • Known vulnerabilities observed in generated code e.g. MITRE weaknesses
  • No local use of the service, unless replicated and self hosted (expensive)

Now my questions are these:

Do you think there is a case to be made for smaller models fed with higher quality data? Can we substantially reduce number of parameters if we do better with the input?
For example a Codex-like model for a single language only.

Or do you think that the pre-training of large models and then refining to task (e.g. GPT or maybe programmer -> specific language) will continue to dominate because we require the amount of parameters for the tasks at hand anyway? An AGI that we just teach "courses" if you like.

70

Comments

You must log in or register to comment.

t1_ivf2wkb wrote

Network distillation and transfer learning are both reasonable approaches to constructing high quality "compressed" models.

40

t1_ivf6xqd wrote

There is ALWAYS a need for smaller models. However, I don’t think “higher quality data” is what will affect that change. The fundamental building blocks we use need to change drastically to allow for more flexibility, higher data representation, and more robust pattern learning. High quality data is always good but enabling higher data representation and one-shot/zero-shot learning is better.

25

t1_ivffk9p wrote

All evidence currently points that it is quite possible even for one person to make a high quality model all by themselves. It will take a great effort in high quality data curation, but I do not see anything that is out of reach. The only reason this field has a perception of a large data set requirement, is because a large amount of data was used to train the base model. But what folks don't seem to understand is that the quantity of data used in training the base model was EXTREMELY poor. Bad captions, bad cropping, redundancies, mis-categorizations, and a plethora of other issues plague the training data. The base SD model could have been trained with orders of magnitude less data, if due diligence was used in data curation.

This is the case for Stable Diffusion. I would not be surprised if this was the case for other models as well.

9

t1_ivf0kjq wrote

I think you’ve got the right idea. It makes sense that big companies are developing and pushing big models. They’ve got the resources to train them. But you can often get a lot done with a much smaller, boutique model — thats one of the next frontiers.

8

t1_ivf77hf wrote

Examples?

3

t1_ivf84sd wrote

Easy answer is distillations like EfficientNet or DistillBERT. You can also get an intuition for the process by taking a small easy dataset — like MNIST or CIFAR — and running a big hyperparameter search over models. There will be small models which perform close to the best models.

These days nobody uses ResNet or Inception but there was a time they were the bleeding edge. Now it’s all smaller more precise stuff.

There other dimension you can win over big models is hardcoding in your priors.

11

t1_ividc40 wrote

I have attended a talk yesterday where the speaker was alluding to something called "fundamental models", where the future of deep learning is centralised with large organisation owning pre-trained extremely heavy models and providing them as a service to those wanting to finetune their top layers for a specific task

3

OP t1_ivj4vad wrote

this is exactly the philosophical motivation of this post

1

t1_ivix7ir wrote

This concept is a key part of my research.

I kept seeing these massive models for timeseries that were frankly mediocre at best. Using better preprocessing and smaller models (1k versus 1M parameters) and curated datasets I have met or exceeded similar works.

Smaller, curated models are the future IMO.

2

t1_ivevmim wrote

Could you provide a bit more context? Are you referring to specifically the case of language processing and such use cases? Or are you referring to general ML use cases?

1

OP t1_ivf36bj wrote

I am talking about ML in general, language processing was just a tangible example for the sake of this post.
Models keep one-upping each other in size and capabilities, but do you see meaningful potential for reduction of size (through configuration or new approaches) in more specific use cases?

3

t1_ivf7an0 wrote

This is where I believe the human component of AI/ ML lies in the future. Being able to discern use-cases where simple models will work vs where complex algorithms will add value.

If you look at how businesses approach AI/ML today, everyone wants to have a cloud based platform that’s integrated to a massive data lake capable of running deep learning / reinforced learning algorithms. But the reality is, majority of business problems (specifically in non-tech businesses like retail, e-commerce, financial services etc..) don’t require such complex things.

My heart weeps when organisations try to implement a deep learning model for a simple fraud detection use case which could well be achieved by a logistic regression model using much smaller amount of data. What’s worse, they’d spend probably millions of dollars in trying to develop and operationalise the solution.

The problem however is that hype merchants (read consulting companies) make it sound like this is the only way that companies can stay competitive in the future. AI/ML conferences also don’t help in that they almost always only want to showcase an insanely complicated algorithm utilising a massive tech-stack. I feel, there are very few people in the industry too, who advocate for simplification.

But eventually, I expect the hype to die and companies to realise that this doesn’t give them any incremental benefit in every use case.

Having said all that, the specific example that you’ve given like language and image processing, I also expect the large / deep models to become the norm because these models are also offered as a service like GitHub copilot. And it might actually be cheaper to use them directly than develop a small-scale customised model.

7

t1_ivhzsot wrote

There are diminishing returns on data. It's difficult to get truly new data when you already have billions of data points, and it's difficult to improve a model when it's already very good.

So, like Moore's law, it'll probably slow down eventually. At that point, most significant developments will be a result of improving model efficiency rather than just making them bigger.

Not to mention, models are made more efficient all the time. Sure, DALL-E-2 is huge. But first off, it's smaller than DALL-E. And second, if you compare a model of a fixed size today to a model of the same size just a couple of years ago, today's model will still win out by a significant margin. Heck, you can definitely train a decent ImageNet1K model on a hobby ML PC (e.g., an RTX graphics card, or even something cheaper if you have enough days to spare on a small learning rate and batch size). And inference takes much less time / memory than training since you can usually fix the batch size to 1 and you don't have to store a computational graph for a backward pass. A decade ago, this would have been much more difficult.

1

t1_ivochpk wrote

I think training for big models is meaningless. Now, we even don't know how to interpret many behaviors of deep neural networks. We cannot determine if deep learning implies the causal-effect mechanism.

1

t1_ivgtbgc wrote

It's not concerning. It's just you are going unemployed. Not a bad thing. It's called efficiency.

−4