marr75 t1_iyjvtdc wrote on December 1, 2022 at 10:57 PM

Reply to comment by dataslacker in [R] Statistical vs Deep Learning forecasting methods by fedegarzar

Just guessing here, but: overfitting.

Internal-Diet-514 t1_iykhg3s wrote on December 2, 2022 at 1:41 AM

I think so too, I’m confused why they would need to train for 14 days, from skimming the paper it doesn’t seem like the dataset itself is that large. I bet a DL solution that was parameterized correctly to the problem would outperform the traditional statistical approaches.

marr75 t1_iykwulm wrote on December 2, 2022 at 3:42 AM

While I agree with your general statement, my gut says a well parameterized/regularized deep learning solution would perform as well as an ensemble of statistical approaches (without the expertise needed to select the statistical approaches) but would be harder to explain/interpret.

TheDrownedKraken t1_iyko6jf wrote on December 2, 2022 at 2:33 AM

I’m just curious, why do you think that?

Internal-Diet-514 t1_iymjci2 wrote on December 2, 2022 at 2:41 PM

If a model has more parameters than datapoints in the training set it can quickly just learn the training set resulting in an over-fit model. You don’t always need 16+ attention heads to have the best model for a given dataset. A single self attention layer with one head still has the ability to model more complex relationships among the inputs than something like arima would.

kraegarthegreat t1_iyor5g6 wrote on December 2, 2022 at 11:53 PM

This is something I have found in my research. I keep seeing people making models with millions of parameters when I am able to achieve 99% of the performance with roughly 1k.

TropicalAudio t1_iylsprn wrote on December 2, 2022 at 9:55 AM

Little need to speculate in this case: they're trying to fit giant models on a dataset that's a fraction of a megabyte, without any targeted pretraining or prior. That's like trying to prove trains are slower than running humans by having the two compete in a 100m race from standstill. The biggest set (monthly observations) is around 105kB of data. If anyone is surprised your average 10GB+ network doesn't perform very well there, well... I suppose now you know.

marr75 t1_iymo8k3 wrote on December 2, 2022 at 3:17 PM

Yeah

> Just guessing here, but

is a common US English idiom that typically means, "Obviously".

You're absolutely right, though. Just by comparing the training data to the training process and serialized weights, you can see how clearly this should overfit. Once your model is noticeably bigger than a dictionary of X, Y pairs of all of your training data, it's very hard to avoid overfitting.

I volunteer with a group that develops interest and skills in science and tech for kids from historically excluded groups. I was teaching a lab on CV last month and my best student was like, "What if I train for 20 epochs, tho? What about 30?" and the performance improved (but didn't generalize as well). He didn't understand generalization yet so instead, he looked at the improvement trend and had a lightbulb moment and was like, "What if I train for 10,000 epochs???" I should check to see if his name is on the list of collaborators for the paper 😂