Viewing a single comment thread. View all comments

marr75 t1_iyjvtdc wrote

Just guessing here, but: overfitting.

32

Internal-Diet-514 t1_iykhg3s wrote

I think so too, I’m confused why they would need to train for 14 days, from skimming the paper it doesn’t seem like the dataset itself is that large. I bet a DL solution that was parameterized correctly to the problem would outperform the traditional statistical approaches.

19

marr75 t1_iykwulm wrote

While I agree with your general statement, my gut says a well parameterized/regularized deep learning solution would perform as well as an ensemble of statistical approaches (without the expertise needed to select the statistical approaches) but would be harder to explain/interpret.

15

TheDrownedKraken t1_iyko6jf wrote

I’m just curious, why do you think that?

3

Internal-Diet-514 t1_iymjci2 wrote

If a model has more parameters than datapoints in the training set it can quickly just learn the training set resulting in an over-fit model. You don’t always need 16+ attention heads to have the best model for a given dataset. A single self attention layer with one head still has the ability to model more complex relationships among the inputs than something like arima would.

2

kraegarthegreat t1_iyor5g6 wrote

This is something I have found in my research. I keep seeing people making models with millions of parameters when I am able to achieve 99% of the performance with roughly 1k.

2

TropicalAudio t1_iylsprn wrote

Little need to speculate in this case: they're trying to fit giant models on a dataset that's a fraction of a megabyte, without any targeted pretraining or prior. That's like trying to prove trains are slower than running humans by having the two compete in a 100m race from standstill. The biggest set (monthly observations) is around 105kB of data. If anyone is surprised your average 10GB+ network doesn't perform very well there, well... I suppose now you know.

7

marr75 t1_iymo8k3 wrote

Yeah

> Just guessing here, but

is a common US English idiom that typically means, "Obviously".

You're absolutely right, though. Just by comparing the training data to the training process and serialized weights, you can see how clearly this should overfit. Once your model is noticeably bigger than a dictionary of X, Y pairs of all of your training data, it's very hard to avoid overfitting.

I volunteer with a group that develops interest and skills in science and tech for kids from historically excluded groups. I was teaching a lab on CV last month and my best student was like, "What if I train for 20 epochs, tho? What about 30?" and the performance improved (but didn't generalize as well). He didn't understand generalization yet so instead, he looked at the improvement trend and had a lightbulb moment and was like, "What if I train for 10,000 epochs???" I should check to see if his name is on the list of collaborators for the paper 😂

3