Blog: https://epochai.org/blog/will-we-run-out-of-ml-data-evidence-from-projecting-dataset

Abstract:

>We analyze the growth of dataset sizes used in machine learning for natural language processing and computer vision, and extrapolate these using two methods; using the historical growth rate and estimating the compute-optimal dataset size for future predicted compute budgets. We investigate the growth in data usage by estimating the total stock of unlabeled data available on the internet over the coming decades. Our analysis indicates that the stock of high-quality language data will be exhausted soon; likely before 2026. By contrast, the stock of low-quality language data and image data will be exhausted only much later; between 2030 and 2050 (for low-quality language) and between 2030 and 2060 (for images). Our work suggests that the current trend of ever-growing ML models that rely on enormous datasets might slow down if data efficiency is not drastically improved or new sources of data become available.

Possible solutions based on the following papers:

https://arxiv.org/abs/2112.04426 , https://arxiv.org/abs/2111.00210 and https://openreview.net/forum?id=NiEtU7blzN / Retrival machanisms, EfficientZero and synthetic data can be seen as possible solutions that need to be improved on.

https://preview.redd.it/5tji6jd60e0a1.jpg?width=1559&format=pjpg&auto=webp&s=d7b5e5dbe6836fc0a59a17281cb7e2ea20e56727

https://preview.redd.it/qgsmjod60e0a1.jpg?width=1544&format=pjpg&auto=webp&s=d949c561f4a006791fecaf56bd155265b4580389

https://preview.redd.it/0zwq9ld60e0a1.jpg?width=1200&format=pjpg&auto=webp&s=808d578f3ac19ca4556830c21646d90132687918

Comments

ktpr t1_iwode1v wrote 2 years ago

What’s wrong with self supervision? It enables combinatorial expansion of dataset sizes if the task is specified well.

ReasonablyBadass t1_iwnbmrx wrote 2 years ago

AFAIK most LLMs don't even use one epoch?

TheRealSerdra t1_iwo4w46 wrote 2 years ago

Technically aren’t you always doing at least one epoch? You’re doing one pass through of all your data at least, even if that data is less than the amount you theoretically could use

ReasonablyBadass t1_iwoq0ug wrote 2 years ago

Not a complete one. GPT-3,I think, didn't complete it's first pass-through

zzzthelastuser t1_iwpi7r5 wrote 2 years ago

You could argue GPT-3 was trained on a subset of the available training data, no?

Not completing the first pass-through means the remaining data could be considered as not part of the training data.

ReasonablyBadass t1_iwplk0c wrote 2 years ago

Semantics. It didn't see any of it's data more than once and it had more available. Not one full epoch.

zzzthelastuser t1_iwpltkw wrote 2 years ago

Sure, but in theory my little Hello World network had also more data available on the internet.

leondz t1_ix96sfz wrote 2 years ago

Yeah, this gives you an idea of how little of the data is actually worth going through - most of it repeats structures found elsewhere in the data, and isn't very diverse. Going through huge low-curation datasets is inefficient: the data diversity just isn't there.

CatalyzeX_code_bot t1_iwnbh16 wrote 2 years ago

Found relevant code at https://github.com/YeWR/EfficientZero + all code implementations here

To opt out from receiving code links, DM me

bloc97 t1_ixjuivv wrote 2 years ago

This can be considered good news. If all data is exhausted people will be actually forced to research better data-efficient algorithms. We humans don't ingest 100 GBs of arXiv papers to do research and we don't need billions of images to paint a cat sitting on a sofa. Until we figure out how to run GPT-3 on smartphones (maybe using neuromorphic computing?), we shouldn't be too worried about the trend of using bigger and bigger datasets, because small(er) networks can be successfully trained without that much data.

lostmsu t1_iwnoxf0 wrote 2 years ago

Have they mentioned Efficient Zero?

I think the author is severely behind of the current SOTA.

Singularian2501 OP t1_iwq1iph wrote 2 years ago

https://www.lesswrong.com/posts/mRwJce3npmzbKfxws/efficientzero-how-it-works

A lesswrong article I have found that explains how efficient zero works.

In my opinion the author wants to say that systems like efficient zero are more efficient in their data usage and could be used for llm also to increase their sample efficiency.

To be honest I hope that my post gets so much attention that the author of the paper can answer our questions.

Singularian2501 OP t1_iwnpy8m wrote 2 years ago

Yes they mentioned it at the end of their blog article. But I think it was only meant as an example how better sample efficiency could be achieved and not SOTA related.

13ass13ass t1_iwo4lan wrote 2 years ago

Efficient zero is for RL with atari games though. How does it apply to things like large language models?

lostmsu t1_iws6anl wrote 2 years ago

The point is there are many models that use the same technique.

[deleted] t1_iwo4us9 wrote 2 years ago

[deleted]

londons_explorer t1_iwp5r0a wrote 2 years ago

There is a lot more data that could be used in the form of private communications (for example all iMessage chats), if only the ethical and legal side could be sorted out.

leondz t1_ix96ivb wrote 2 years ago

We already did for most languages that aren't English. Data efficiency is the only way to catch up, for them.

[R] Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning - Epochai Pablo Villalobos et al - Trend of ever-growing ML models might slow down if data efficiency is not drastically improved!