Submitted by t3_yx7zft in MachineLearning

Paper: https://arxiv.org/abs/2211.04325

Blog: https://epochai.org/blog/will-we-run-out-of-ml-data-evidence-from-projecting-dataset

Abstract:

>We analyze the growth of dataset sizes used in machine learning for natural language processing and computer vision, and extrapolate these using two methods; using the historical growth rate and estimating the compute-optimal dataset size for future predicted compute budgets. We investigate the growth in data usage by estimating the total stock of unlabeled data available on the internet over the coming decades. Our analysis indicates that the stock of high-quality language data will be exhausted soon; likely before 2026. By contrast, the stock of low-quality language data and image data will be exhausted only much later; between 2030 and 2050 (for low-quality language) and between 2030 and 2060 (for images). Our work suggests that the current trend of ever-growing ML models that rely on enormous datasets might slow down if data efficiency is not drastically improved or new sources of data become available.

Possible solutions based on the following papers:

https://arxiv.org/abs/2112.04426 , https://arxiv.org/abs/2111.00210 and https://openreview.net/forum?id=NiEtU7blzN / Retrival machanisms, EfficientZero and synthetic data can be seen as possible solutions that need to be improved on.

https://preview.redd.it/5tji6jd60e0a1.jpg?width=1559&format=pjpg&auto=webp&s=d7b5e5dbe6836fc0a59a17281cb7e2ea20e56727

https://preview.redd.it/qgsmjod60e0a1.jpg?width=1544&format=pjpg&auto=webp&s=d949c561f4a006791fecaf56bd155265b4580389

https://preview.redd.it/0zwq9ld60e0a1.jpg?width=1200&format=pjpg&auto=webp&s=808d578f3ac19ca4556830c21646d90132687918

53

Comments

You must log in or register to comment.

t1_iwode1v wrote

What’s wrong with self supervision? It enables combinatorial expansion of dataset sizes if the task is specified well.

10

t1_iwnbmrx wrote

AFAIK most LLMs don't even use one epoch?

4

t1_iwo4w46 wrote

Technically aren’t you always doing at least one epoch? You’re doing one pass through of all your data at least, even if that data is less than the amount you theoretically could use

7

t1_iwoq0ug wrote

Not a complete one. GPT-3,I think, didn't complete it's first pass-through

12

t1_iwpi7r5 wrote

You could argue GPT-3 was trained on a subset of the available training data, no?

Not completing the first pass-through means the remaining data could be considered as not part of the training data.

7

t1_iwplk0c wrote

Semantics. It didn't see any of it's data more than once and it had more available. Not one full epoch.

9

t1_iwpltkw wrote

Sure, but in theory my little Hello World network had also more data available on the internet.

4

t1_ix96sfz wrote

Yeah, this gives you an idea of how little of the data is actually worth going through - most of it repeats structures found elsewhere in the data, and isn't very diverse. Going through huge low-curation datasets is inefficient: the data diversity just isn't there.

1

t1_ixjuivv wrote

This can be considered good news. If all data is exhausted people will be actually forced to research better data-efficient algorithms. We humans don't ingest 100 GBs of arXiv papers to do research and we don't need billions of images to paint a cat sitting on a sofa. Until we figure out how to run GPT-3 on smartphones (maybe using neuromorphic computing?), we shouldn't be too worried about the trend of using bigger and bigger datasets, because small(er) networks can be successfully trained without that much data.

3

t1_iwnoxf0 wrote

Have they mentioned Efficient Zero?

I think the author is severely behind of the current SOTA.

2

OP t1_iwq1iph wrote

https://www.lesswrong.com/posts/mRwJce3npmzbKfxws/efficientzero-how-it-works

A lesswrong article I have found that explains how efficient zero works.

In my opinion the author wants to say that systems like efficient zero are more efficient in their data usage and could be used for llm also to increase their sample efficiency.

To be honest I hope that my post gets so much attention that the author of the paper can answer our questions.

3

OP t1_iwnpy8m wrote

Yes they mentioned it at the end of their blog article. But I think it was only meant as an example how better sample efficiency could be achieved and not SOTA related.

1

t1_iwp5r0a wrote

There is a lot more data that could be used in the form of private communications (for example all iMessage chats), if only the ethical and legal side could be sorted out.

2

t1_ix96ivb wrote

We already did for most languages that aren't English. Data efficiency is the only way to catch up, for them.

2