ewankenobi t1_j9jl91t wrote on February 22, 2023 at 1:05 PM

I like your wording, did you come up with that definition yourself or is it from a paper?

yldedly t1_j9jorh1 wrote on February 22, 2023 at 1:34 PM

It's not from a paper, but it's pretty uncontroversial I think - though people like to forget about the "bounded interval" part, or at least what it implies about extrapolation.

[deleted] t1_j9jsgf6 wrote on February 22, 2023 at 2:03 PM

What is "bounded interval" here?

yldedly t1_j9judc7 wrote on February 22, 2023 at 2:17 PM

Any interval [a; b] where a and b are numbers. In practice, it means that the approximation will be good in the parts of the domain where there is training data. I have a concrete example in a blog post of mine: https://deoxyribose.github.io/No-Shortcuts-to-Knowledge/

OdinGuru t1_j9lovng wrote on February 22, 2023 at 9:44 PM

Amazing article. Thanks for sharing

[deleted] t1_j9jukfu wrote on February 22, 2023 at 2:18 PM

Interesting but that is valid for us as well. So I am not sure this is true once they learn very general things, like learning itself.

GraciousReformer OP t1_j9jfxvh wrote on February 22, 2023 at 12:14 PM

Yes but is DL the unique mechanism? Why DL?

yldedly t1_j9jpuky wrote on February 22, 2023 at 1:43 PM

There are two aspects, scalability and inductive bias. DL is scalable because compositions of differentiable functions make backpropagation fast, and those functions being mostly matrix multiplications make GPU acceleration effective. Combine this with stochastic gradients, and you can train on very large datasets very quickly.
Inductive biases make DL effective in practice, not just in theory. While the universal approximation theorem guarantees that an architecture and weight-setting exist that approximate a given function, the bias of DL towards low-dimensional smooth manifolds reflects many real-world datasets, meaning that SGD will easily find a local optimum with these properties (and when it doesn't, for example on tabular data where discontinuities are common, DL performs worse than alternatives, even if with more data it would eventually approximate a discontinuity).

GraciousReformer OP t1_j9jr4i4 wrote on February 22, 2023 at 1:53 PM

"for example on tabular data where discontinuities are common, DL performs worse than alternatives, even if with more data it would eventually approximate a discontinuity." True. Is there references on this issue?

yldedly t1_j9jr821 wrote on February 22, 2023 at 1:53 PM

This one is pretty good: https://arxiv.org/abs/2207.08815

GraciousReformer OP t1_j9jrhjd wrote on February 22, 2023 at 1:55 PM

This is a great point. Thank you. So do you mean that DL work for language models only when they get a large amount of data?

GraciousReformer OP t1_j9k1srq wrote on February 22, 2023 at 3:37 PM

But then what is the difference from the result that NN works better for ImageNet?

yldedly t1_j9k3orr wrote on February 22, 2023 at 3:52 PM

Not sure what you're asking. CNNs have inductive biases suited for images.

GraciousReformer OP t1_j9k4974 wrote on February 22, 2023 at 3:56 PM

So it works for images but not for tabular data?

yldedly t1_j9k5n8n wrote on February 22, 2023 at 4:06 PM

It depends a lot on what you mean by works. You can get a low test error with NNs on tabular data if you have enough of it. For smaller datasets, you'll get a lower test error using tree ensembles. For low out-of-distribution error neither will work.

[deleted] t1_j9jt2vp wrote on February 22, 2023 at 2:07 PM

>the bias of DL towards low-dimensional smooth manifolds

What is this? Got all the rest but that

yldedly t1_j9jtuzy wrote on February 22, 2023 at 2:13 PM

I'll link you to an old comment: https://www.reddit.com/r/MachineLearning/comments/z12zxj/comment/ix9t149/?utm_source=share&utm_medium=web2x&context=3

[D] "Deep learning is the only thing that currently works at scale"

yldedly t1_j9j6gk8 wrote on February 22, 2023 at 10:19 AM