activatedgeek t1_j9jvj8h wrote on February 22, 2023 at 2:25 PM

For generalization (performing well beyond the training), there’s at least two dimensions: flexibility and inductive biases.

Flexibility ensures that many functions “can” be approximated in principle. That’s the universal approximation theorem. It is a descriptive result and does not prescribe how to find that function. This is not something very unique to DL. Deep Random Forests, Fourier Bases, Polynomial Bases, Gaussian processes all are universal function approximators (with some extra technical details).

The part unique to DL is that somehow their inductive biases have helped match some of the complex structured problems including vision and language that makes them generalize well. Inductive bias is a loosely defined term. I can provide examples and references.

CNNs provide the inductive bias to prefer functions that handle translation equivariance (not exactly true but only roughly due to pooling layers). https://arxiv.org/abs/1806.01261

Graph neural networks provide a relational inductive bias. https://arxiv.org/abs/1806.01261

Neural networks overall prefer simpler solutions, embodying Occam’s razor, another inductive bias. This argument is made theoretically using Kolmogorov complexity. https://arxiv.org/abs/1805.08522

SodomizedPanda t1_j9jyhem wrote on February 22, 2023 at 2:46 PM

And somehow, the best answer is at the bottom of the thread..

A small addition : Recent research suggests that the implicit bias in DNN that helps generalization does not only lie in the structure of the network but in the learning algorithm as well (Adam, SGD, ...). https://francisbach.com/rethinking-sgd-noise/ https://francisbach.com/implicit-bias-sgd/

red75prime t1_j9k0i84 wrote on February 22, 2023 at 3:21 PM

Does in-context learning suggest that inductive biases could also be extracted from training data?

activatedgeek t1_j9k4z4o wrote on February 22, 2023 at 4:01 PM

Very much indeed. See https://arxiv.org/abs/2205.05055

activatedgeek t1_j9k58ev wrote on February 22, 2023 at 4:03 PM

Not only dataset, the Transformer architecture itself seems to be amenable to in-context learning. See https://arxiv.org/abs/2209.11895

KingRandomGuy t1_j9k363j wrote on February 22, 2023 at 3:48 PM

> CNNs provide the inductive bias to prefer functions that handle translation equivariance

There's some interesting bodies of work to inductive biases in CNNs, such as "Making Convolutional Networks Shift-Invariant Again". Really interesting stuff!

hpstring t1_j9kf6te wrote on February 22, 2023 at 5:06 PM

This is a very good answer! I want to add that apart from generalization, the fact that we have efficient optimization algorithms that can find quite good minima also contributes a lot to the deep learning magic.

-vertigo-- t1_j9kih7k wrote on February 22, 2023 at 5:26 PM

hmm for some reason the arxiv links are giving 403 forbidden

CO2mania t1_j9r6990 wrote on February 23, 2023 at 11:44 PM

Save the message.

sanman t1_ja3kkkp wrote on February 26, 2023 at 4:18 PM

first 2 links are the same - do you have the one for CNNs inductive bias?

GraciousReformer OP t1_j9ljjqu wrote on February 22, 2023 at 9:11 PM

>inductive biases

Then why does DL have inductive biases and others do not?

activatedgeek t1_j9lu7q7 wrote on February 22, 2023 at 10:18 PM

All model classes have inductive biases. e.g. random forests have the inductive bias of producing axis-aligned region splits. But clearly, that inductive bias is not good enough for image classification because a lot of information in the pixels is spatially correlated that axis-aligned regions cannot capture as specialized neural networks, under the same budget. By budget, I mean things like training time, model capacity, etc.

If we have infinite training time and infinite number of image samples, then probably random forests might be just as good as neural networks.

currentscurrents t1_j9n3o7u wrote on February 23, 2023 at 3:45 AM

Sounds like ideally we'd want a model with good inductive biases for meta-learning new inductive biases, since every kind of data requires different biases.

GraciousReformer OP t1_j9lwe7i wrote on February 22, 2023 at 10:32 PM

Still, why is it that DL has better inductive biases than others?

activatedgeek t1_j9lz6ib wrote on February 22, 2023 at 10:50 PM

I literally gave an example of how (C)NNs have better inductive bias than random forests for images.

You need to ask more precise questions than just a "why".

GraciousReformer OP t1_j9m1o15 wrote on February 22, 2023 at 11:06 PM

So it is like an ability to capture "correlations" that cannot be done with random forests.

currentscurrents t1_j9n8in9 wrote on February 23, 2023 at 4:26 AM

In theory, either structure can express any solution. But in practice, every structure is better suited to some kinds of data than others.

A decision tree is a bunch of nested if statements. Imagine the complexity required to write an if statement to decide if an array of pixels is a horse or a dog. You can technically do it by building a tree with an optimizer; but it doesn't work very well.

On the other hand, a CNN runs a bunch of learned convolutional filters over the image. This means it doesn't have to learn the 2D structure of images and that pixels tend to be related to nearby pixels; it's already working on a 2D plane. A tree doesn't know that adjacent pixels are likely related, and would have to learn it.

It also has a bias towards hierarchy. As the layers stack upwards, each layer builds higher-level representations to go from pixels > edges > features > objects. Objects tend to be made of smaller features, so this is a good bias for working with images.

GraciousReformer OP t1_j9oeeo3 wrote on February 23, 2023 at 12:46 PM

What are the situations that the bias for the hierarchy is not helpful?

relevantmeemayhere t1_j9ij6cc wrote on February 22, 2023 at 5:25 AM

Lol. The fact that we use general linear models in every scientific field, and have been for decades should tell you all you need to know about this statement.

adventuringraw t1_j9in5sj wrote on February 22, 2023 at 6:07 AM

I mean... the statement specifically uses the phrase 'arbitrary functions'. GLMs are a great tool in the toolbox, but the function family it optimizes over is very far from 'arbitrary'.

I think the statement's mostly meaning 'find very nonlinear functions of interest when dealing with very large numbers of samples from very high dimensional sample spaces'. GLM's are used in every scientific field, but certainly not for every application. Some form of deep learning really is the only game in town still for certain kinds of problems at least.

relevantmeemayhere t1_j9kin48 wrote on February 22, 2023 at 5:27 PM

I agree with you. I was just pointing out that to say they are the only solution is foolish, as the quote implied

This quote could have just been used without much context, so grain of salt.

adventuringraw t1_j9ll3fp wrote on February 22, 2023 at 9:21 PM

I can see how the quote could be made slightly more accurate. In particular, tabular data in general is still better tackled with something like XGBoost instead of deep learning, so deep learning certainly hasn't turned everything into a nail for one universal hammer yet.

Featureless_Bug t1_j9iy4yq wrote on February 22, 2023 at 8:22 AM

Haven't heard of GLMs being successfully used for NLP and CV in the recent time. And these are like the only things that would be described as large scale in ML. The statement is completely correct - even stuff like gradient boosting does not work at scale in that sense

chief167 t1_j9kt5ho wrote on February 22, 2023 at 6:31 PM

We use gradient boosting at quite a big scale. Not LLM big, but still big. It's just not NLP or CV at all. It's for fraud detection in large transactional tabular datasets. And it outperforms basically all neural network, shallow or deep, approaches.

Featureless_Bug t1_j9kuu22 wrote on February 22, 2023 at 6:41 PM

Large scale is somewhere to the north of 1-2 TB of data. Even if you had that much data, in absolutely most cases tabular data has such a simplistic structure that you wouldn't need that much data to achieve the same performance - so I wouldn't call any kind of tabular data large scale to be frank

relevantmeemayhere t1_j9ki2x1 wrote on February 22, 2023 at 5:24 PM

Because they are useful for some problems and not others, like every algorithm? Nowhere in my statement did I say they are monolithic in their use across all subdomains of ml

The statement was that deep learning is the only thing that works at scale. It’s not lol. Deep learning struggles in a lot of situations.

Featureless_Bug t1_j9kvek5 wrote on February 22, 2023 at 6:45 PM

Ok, name one large scale problem where GLMs are the best prediction algorithm possible.

relevantmeemayhere t1_j9kygtx wrote on February 22, 2023 at 7:03 PM

Any problem where you want things like effect estimates lol. Or error estimates. Or models that generate joint distributions

So, literally a ton of them. Which industries don’t like things like that?

VirtualHat t1_j9j2gwx wrote on February 22, 2023 at 9:23 AM

Large linear models tend not to scale well to large datasets if the solution is not in the model class. Because of this lack of expressivity, linear models tend to do poorly on complex problems.

relevantmeemayhere t1_j9khp8m wrote on February 22, 2023 at 5:22 PM

As you mentioned, this is highly dependent on the functional relationship of the data.

You wouldn’t not use domain knowledge to determine that.

Additionally, non linear models tend to have their own drawbacks. Lack of interpretability, high variability being some of them

GraciousReformer OP t1_j9j7iwm wrote on February 22, 2023 at 10:34 AM

>Large linear models tend not to scale well to large datasets if the solution is not in the model class

Will you provide me a reference?

VirtualHat t1_j9j8805 wrote on February 22, 2023 at 10:43 AM

Linear models make an assumption that the solution is in the form of y=ax+b. If the solution is not in this form then the best solution will is likely to be a poor solution.

I think Emma Brunskill's notes are quite good at explaining this. Essentially the model will underfit as it is too simple. I am making an assumption though, that a large dataset implies a more complex non-linear solution, but this is generally the case.

relevantmeemayhere t1_j9kifhu wrote on February 22, 2023 at 5:26 PM

Linear models are often preferred for the reasons you mentioned. Under fitting is almost always preferred to overfitting.

VirtualHat t1_j9ll5i2 wrote on February 22, 2023 at 9:21 PM

Yes, that's right. For many problems, a linear model is just what you want. I guess what I'm saying is that the dividing line between when a linear model is appropriate vs when you want a more expressive model is often related to how much data you have.

GraciousReformer OP t1_j9j8bsl wrote on February 22, 2023 at 10:44 AM

Thank you. I understand the math. But I meant a real world example that "the solution is not in the model class."

VirtualHat t1_j9j8uvr wrote on February 22, 2023 at 10:51 AM

For example, in IRIS dataset, the class label is not a linear combination of the input. Therefore, if your model class is all linear models, you won't find the optimal or in this case, even a good solution.

If you extend the model class to include non-linear functions, then your hypothesis space now at least contains a good solution, but finding it might be a bit more trickly.

GraciousReformer OP t1_j9jgdmc wrote on February 22, 2023 at 12:19 PM

But DL is not a linear model. Then what will be the limit of DL?

terminal_object t1_j9jp51j wrote on February 22, 2023 at 1:37 PM

You seem confused as to what you yourself are saying.

GraciousReformer OP t1_j9jppu7 wrote on February 22, 2023 at 1:42 PM

"Artificial neural networks are often (demeneangly) called "glorified regressions". The main difference between ANNs and multiple / multivariate linear regression is of course, that the ANN models nonlinear relationships."

https://stats.stackexchange.com/questions/344658/what-is-the-essential-difference-between-a-neural-network-and-nonlinear-regressi

PHEEEEELLLLLEEEEP t1_j9k691x wrote on February 22, 2023 at 4:10 PM

Regression doesnt just mean linear regression, if that's what you're confused about

Acrobatic-Book t1_j9k94l4 wrote on February 22, 2023 at 4:28 PM

The simplest example is the xor-problem (aka either or). This was also why multilayer perceptrons as the basis of deep learning where actually created. Because a linear model cannot solve it.

VirtualHat t1_j9lkto4 wrote on February 22, 2023 at 9:19 PM

Oh wow, super weird to be downvoted just for asking for a reference. r/MachineLearning isn't what it used to be I guess, sorry about that.

BoiElroy t1_j9ioqcz wrote on February 22, 2023 at 6:25 AM

Yeah always should first exhaust existing classical methods before reaching for deep learning.

[deleted] t1_j9ijb65 wrote on February 22, 2023 at 5:26 AM

[deleted]

relevantmeemayhere t1_j9ilsax wrote on February 22, 2023 at 5:52 AM

?

sdmat t1_j9izc3q wrote on February 22, 2023 at 8:38 AM

Not exactly but close enough?

Fancy-Jackfruit8578 t1_j9j5v25 wrote on February 22, 2023 at 10:10 AM

Because every NN is just basically a big linear function… with a nonlinearity at the end.

hpstring t1_j9jbhwv wrote on February 22, 2023 at 11:24 AM

This is correct for two-layer NNs, not general NNs.

TeamRocketsSecretary t1_j9jemik wrote on February 22, 2023 at 12:00 PM

Wrong.

hpstring t1_j9jb96f wrote on February 22, 2023 at 11:22 AM

Universal approximation is not enough, you need efficiency to make things work.

DL is the only class of algorithms that beats the curse of dimensionality for discovering certain (very general) class of high dimensional functions(something related to Barron space). Point me out if this is not accurate.

inspired2apathy t1_j9jsbz6 wrote on February 22, 2023 at 2:02 PM

It's that entirely accurate though? There's all kinds of explicit dimensionally reduction methods. They can be combined with traditional ml models pretty easily for supervised learning. As I understand, the unique thing DL gives us just a massive embedding that can encode/"represent" something like language or vision.

hpstring t1_j9jxzpm wrote on February 22, 2023 at 2:43 PM

Well the traditional ml + dimensionality reduction cannot crack e.g. imagenet recognition

inspired2apathy t1_j9jzpqw wrote on February 22, 2023 at 2:57 PM

Other models like PGMs can absolutely be applied to ImageNet, just not for SOTA accuracy.

MuonManLaserJab t1_j9k3bcn wrote on February 22, 2023 at 3:50 PM

They did say "crack", not "attempt".

GraciousReformer OP t1_j9ji7t1 wrote on February 22, 2023 at 12:37 PM

But why DL beats the curse? Why is DL the only class?

hpstring t1_j9juk1f wrote on February 22, 2023 at 2:18 PM

Q1: We don't know yet. Q2: Probably there are other classes but they haven't been discovered or are only at the early age of research.

NitroXSC t1_j9k09wt wrote on February 22, 2023 at 3:14 PM

> Q2: Probably there are other classes but they haven't been discovered or are only at the early age of research.

I think there are many different classes that would work but current DL is based in large parts on matrix-vector operations which can be implemented efficiently on current hardware.

[deleted] t1_j9jveco wrote on February 22, 2023 at 2:24 PM

[deleted]

randomoneusername t1_j9iyf7r wrote on February 22, 2023 at 8:26 AM

I mean this has two elements in it.

DL is not the only algorithm that works in scale for sure.

[deleted] t1_j9jgblt wrote on February 22, 2023 at 12:18 PM

[deleted]

randomoneusername t1_j9jkzs7 wrote on February 22, 2023 at 1:02 PM

The statement stand-alone that you have there is very vague. Can I assume is taking about NLP or CV projects ?

On tabular data even with non linear relationship normal boosting, ensemble algorithms can scale and be top of the game.

bloodmummy t1_j9jzvnr wrote on February 22, 2023 at 3:01 PM

It strikes me that people who tout DL as a hammer-for-all-nails never touched tabular data in their lives. Go try to do a couple of Kaggle Tabular competitions and you'll soon realise that DL can be very dumb, cumbersome, and data-hungry. Ensemble models,Decision Tree models, and even feature-engineered Linear Regression models still rule there and curb-stomp DL all day long ( For most cases ).

Tabular data is also still the type of data most-used with ML. I'm not a "DL-hater" if there is such a thing, in fact my own research is using DL only. But it isn't a magical wrench, and it won't be.

Mefaso t1_j9jgvoz wrote on February 22, 2023 at 12:24 PM

Anything that scales sub-quadraticaly?

Anything "big-data"

GraciousReformer OP t1_j9jib2n wrote on February 22, 2023 at 12:38 PM

Then why DL?

suflaj t1_j9jjetb wrote on February 22, 2023 at 12:48 PM

Because it requires the least amount of human intervention

Also because it subjectively sounds like magic to people who don't really understand it, so it both sells to management and to consumers.

At least it's easier for humans to cope it is like magic than to accept that a lot of what AI can do is just stuff that is trivial and doesn't require humanity to solve.

chief167 t1_j9jev01 wrote on February 22, 2023 at 12:03 PM

Define scale

Language models? Sure. Images? Sure. Huge amounts of transaction data to search for fraud? Xgboost all the way lol.

Church no free lunch theorem: there is no single approach best for every possible problem. Djeezes I hate it when marketing takes over. You learn this principle in the first chapter of literally every data course

activatedgeek t1_j9jt721 wrote on February 22, 2023 at 2:08 PM

I think the no free lunch theorem is misquoted here. The NFL also assumes that all datasets from the universe of datasets are equally likely. But that is objectively false. Structure is more likely than noise.

chief167 t1_j9ku5mq wrote on February 22, 2023 at 6:37 PM

I don't think it implies that all datasets are equally likely. I think it only implies that given all possible datasets, there is no best approach to modelling them. All possible != All are equally likely

But I don't have my book with me, and I do t trust the internet since it seems to lead to random blogposts instead of the original paper (Wikipedia gave a 404 in the footnotes)

activatedgeek t1_j9lnhvv wrote on February 22, 2023 at 9:36 PM

See Theorem 2 (Page 34) of The Supervised Learning No-Free-Lunch Theorems.

It conditions "uniformly" averaged over all "f" the input-output mapping, i.e. the function that generates the dataset (this is a noise-free case). It also provides "uniformly averaged over all P(f)", a distribution over the data-generating functions.

So while you could still have different data-generating distributions P(f), the result is defined over all such distributions uniformly averaged.

The NFL is sort of a worst-case result, and I think it pretty meaningless and inconsequential for the real world.

Let me know if I have misinterpreted this!

GraciousReformer OP t1_j9jfvfy wrote on February 22, 2023 at 12:13 PM

Then what will be the limitation of transformers?

LowLook t1_j9jprdu wrote on February 22, 2023 at 1:42 PM

Inventing them

GraciousReformer OP t1_j9jpu94 wrote on February 22, 2023 at 1:43 PM

?

ktpr t1_j9jmqq7 wrote on February 22, 2023 at 1:17 PM

I feel like recently ML boosters come this Reddit, make large claims, and then use the ensuing discussion, time, and energy from others to correct their click content at our expense

yldedly t1_j9j6gk8 wrote on February 22, 2023 at 10:19 AM

>discover arbitrary functions

Uh, no. Not even close. DL can approximate arbitrary functions on a bounded interval given enough data, parameters and compute.

ewankenobi t1_j9jl91t wrote on February 22, 2023 at 1:05 PM

I like your wording, did you come up with that definition yourself or is it from a paper?

yldedly t1_j9jorh1 wrote on February 22, 2023 at 1:34 PM

It's not from a paper, but it's pretty uncontroversial I think - though people like to forget about the "bounded interval" part, or at least what it implies about extrapolation.

[deleted] t1_j9jsgf6 wrote on February 22, 2023 at 2:03 PM

What is "bounded interval" here?

yldedly t1_j9judc7 wrote on February 22, 2023 at 2:17 PM

Any interval [a; b] where a and b are numbers. In practice, it means that the approximation will be good in the parts of the domain where there is training data. I have a concrete example in a blog post of mine: https://deoxyribose.github.io/No-Shortcuts-to-Knowledge/

OdinGuru t1_j9lovng wrote on February 22, 2023 at 9:44 PM

Amazing article. Thanks for sharing

[deleted] t1_j9jukfu wrote on February 22, 2023 at 2:18 PM

Interesting but that is valid for us as well. So I am not sure this is true once they learn very general things, like learning itself.

GraciousReformer OP t1_j9jfxvh wrote on February 22, 2023 at 12:14 PM

Yes but is DL the unique mechanism? Why DL?

yldedly t1_j9jpuky wrote on February 22, 2023 at 1:43 PM

There are two aspects, scalability and inductive bias. DL is scalable because compositions of differentiable functions make backpropagation fast, and those functions being mostly matrix multiplications make GPU acceleration effective. Combine this with stochastic gradients, and you can train on very large datasets very quickly.
Inductive biases make DL effective in practice, not just in theory. While the universal approximation theorem guarantees that an architecture and weight-setting exist that approximate a given function, the bias of DL towards low-dimensional smooth manifolds reflects many real-world datasets, meaning that SGD will easily find a local optimum with these properties (and when it doesn't, for example on tabular data where discontinuities are common, DL performs worse than alternatives, even if with more data it would eventually approximate a discontinuity).

GraciousReformer OP t1_j9jr4i4 wrote on February 22, 2023 at 1:53 PM

"for example on tabular data where discontinuities are common, DL performs worse than alternatives, even if with more data it would eventually approximate a discontinuity." True. Is there references on this issue?

yldedly t1_j9jr821 wrote on February 22, 2023 at 1:53 PM

This one is pretty good: https://arxiv.org/abs/2207.08815

GraciousReformer OP t1_j9jrhjd wrote on February 22, 2023 at 1:55 PM

This is a great point. Thank you. So do you mean that DL work for language models only when they get a large amount of data?

GraciousReformer OP t1_j9k1srq wrote on February 22, 2023 at 3:37 PM

But then what is the difference from the result that NN works better for ImageNet?

yldedly t1_j9k3orr wrote on February 22, 2023 at 3:52 PM

Not sure what you're asking. CNNs have inductive biases suited for images.

GraciousReformer OP t1_j9k4974 wrote on February 22, 2023 at 3:56 PM

So it works for images but not for tabular data?

yldedly t1_j9k5n8n wrote on February 22, 2023 at 4:06 PM

It depends a lot on what you mean by works. You can get a low test error with NNs on tabular data if you have enough of it. For smaller datasets, you'll get a lower test error using tree ensembles. For low out-of-distribution error neither will work.

[deleted] t1_j9jt2vp wrote on February 22, 2023 at 2:07 PM

>the bias of DL towards low-dimensional smooth manifolds

What is this? Got all the rest but that

yldedly t1_j9jtuzy wrote on February 22, 2023 at 2:13 PM

I'll link you to an old comment: https://www.reddit.com/r/MachineLearning/comments/z12zxj/comment/ix9t149/?utm_source=share&utm_medium=web2x&context=3

VirtualHat t1_j9j23wp wrote on February 22, 2023 at 9:18 AM

If you're interested in the math, learning curve theory might be a good place to start.

[deleted] t1_j9jl4z3 wrote on February 22, 2023 at 1:04 PM

[deleted]

[deleted] t1_j9jlx00 wrote on February 22, 2023 at 1:10 PM

[deleted]

1bir t1_j9jbi5e wrote on February 22, 2023 at 11:24 AM

Apparently* decision trees are also capable of [universal function approximation](https://cstheory.stackexchange.com/a/46405).

Whether the algorithms for training them do that as well as the ones for deep NNs in practice is a separate issue.

*Haven't seen (& probably wouldn't understand) a proof.

GraciousReformer OP t1_j9jg13p wrote on February 22, 2023 at 12:15 PM

Then why not use decision trees instead of DL?

uhules t1_j9jpkun wrote on February 22, 2023 at 1:40 PM

Before why, ask if. GBDTs are very widely used.

1bir t1_j9jnu2h wrote on February 22, 2023 at 1:26 PM

>Whether the algorithms for training them do that as well as the ones for deep NNs in practice is a separate issue.

For supervised learning the big problem with decision trees (RFs, GBTs etc) seems to be representation learning

Brudaks t1_j9k6mo0 wrote on February 22, 2023 at 4:12 PM

Because being an universal function approximator is not sufficient to be useful in practice, and IMHO is not even really a particularly interesting property; we don't care if something can approximate any function, we care whether it approximates the thing needed for a particular task; and in any case being able to approximate it is a necessary but not a sufficient condition. We care about efficiency of approximation (e.g. a single-layer perceptron is an universal approximator iff you assume an impractical number of neurons), but even more important than how well the function can be approximated with a limited number of parameters is how well you can actually learn these parameters - this differs a lot for different models, and we don't care about how well a model would fit the function with optimal parameters, we care about how well it fits the function with the parameter values we can realistically identify with a bounded amount of computation.

That being said, we do use decision trees instead of DL; for some types of tasks the former outperform the latter and for other types of tasks its the other way around.

[deleted] t1_j9kdvxd wrote on February 22, 2023 at 4:58 PM

[deleted]

VirtualHat t1_j9lp6z3 wrote on February 22, 2023 at 9:46 PM

There was a really good paper a few years ago that identifies some biases in how DNNs learn might explain why they work so well in practice as compared to alternatives. Essentially they are biased towards smoother solutions, which is often what is wanted.

This is still an area of active research, though. I think it's fair to say we still don't quite know why DNNs work as well as they do.

BoiElroy t1_j9ipbtg wrote on February 22, 2023 at 6:32 AM

This is not the answer to your question but one intuition I like about universal approximation theorem I thought I'd share is the comparison to a digital image. You use a finite set of pixels, each that can take on a certain set of discrete values. With a 10 x 10 grid of pixels you can draw a crude approximation of a stick figure. With 1000 x 1000 you can capture a blurry but recognizable selfie. Within the finite pixels and the discrete values they can take you can essentially capture anything you can dream of. Every image in every movie ever made. Obviously there are other issues later like does your models operational design domain match the distribution of the training domain or did you just waste a lot of GPU hours lol

GraciousReformer OP t1_j9jh7zm wrote on February 22, 2023 at 12:27 PM

Yes a very finite grid size will approximate any digital image. But this is an approximation of an image in grids. How will it lead to approximation by NN?

DigThatData t1_j9k17rr wrote on February 22, 2023 at 3:32 PM

it's not. tree ensembles scale gloriously, as do approximations of nearest neighbors. there are certain (and growing) classes of problems for which deep learning produces seemingly magical results, but that doesn't mean it's the only path to a functional solution. It'll probably give you the best solution, but that doesn't mean it's the only way to do things.

in any event, if you want to better understand scaling properties of DL algorithms, a good place to start is the "double descent" literature.

JackBlemming t1_j9n4bp2 wrote on February 23, 2023 at 3:50 AM

This is true. Netflix famously didnt use some complex neural net for choosing shows you'd like exactly because it didnt scale. Neural nets are expensive and if you can sacrifice a few percentages to save hundreds of millions in server fees, it's probably good.

DigThatData t1_j9nlrii wrote on February 23, 2023 at 6:41 AM

just to be clear: i'm not saying neural networks don't scale, i'm saying they're not the only class of learning algorithm that scales.

howtorewriteaname t1_j9kmbrv wrote on February 22, 2023 at 5:50 PM

There's no mathematical formulation of that statement because there's no mathematical reasoning behind that statement. It's just an opinion (which I believe it isn't true btw)

30299578815310 t1_j9kziil wrote on February 22, 2023 at 7:10 PM

This is just not true. As others noted there are other algorithms which are universal approximators and run at scale. The key to the success of DNNs is unknown. A hypothesis is called the lottery ticket hypothesis.

https://arxiv.org/abs/1803.03635

VirtualHat t1_j9lm05j wrote on February 22, 2023 at 9:26 PM

It's worth noting that it wasn't until conv nets that DNNs took off. It's hard to think of a problem that a traditional vanilla MLP solves that can't also be solved with an SVM.

kvutxdy t1_j9l7on1 wrote on February 22, 2023 at 7:59 PM

Universal approximation theorem only states that DNN can approximate Lipschitz functions, but not necessarily all functions.

VirtualHat t1_j9loi32 wrote on February 22, 2023 at 9:42 PM

It should be all continious functions, but I can't really think of any problems where this would limit the solution. The set of all continuous functions is a very big set!

As a side note, I think it's quite interesting that the theorem doesn't include periodic functions like sin, so I guess it's not quite all continuous functions, just continuous functions with bounded input.

[deleted] t1_j9jsq4d wrote on February 22, 2023 at 2:05 PM

[removed]

[deleted] t1_j9kbzaz wrote on February 22, 2023 at 4:46 PM

[removed]

pyfreak182 t1_j9slq8a wrote on February 24, 2023 at 6:50 AM

It helps that the math behind back propagation (i.e. matrix multiplications) is easily parallelizable. The computations in the forward pass are independent of each other, and can be computed in parallel for different training examples. The same is true for the backward pass, which involves computing the gradients for each training batch independently.

And we have hardware accelerators like GPUs that are designed to perform large amounts of parallel computations efficiently.

The success of deep learning is just as much about implementation as it is theory.

alterframe t1_ja2f7xu wrote on February 26, 2023 at 9:36 AM

Part of the answer is probably that DL is not a single algorithm or a class of algorithms, but rather a framework or a paradigm for building such algorithms.

Sure, you can take a SOTA model for ImageNet and apply it to similar image classification problems, by tuning some hyperparameters and maybe replacing certain layers. However, if you want to apply it to a completely different task, you need to build a different neural network.

elmcity2019 t1_j9jm4nq wrote on February 22, 2023 at 1:12 PM

I have been an applied data scientist for 10 years. I have built over 100k models using python, databricks and DataRobot. I have never seen a DL model out compete all the other algorithms. Granted I am largely working with structured business data, but nonetheless DL isn't really competitive.

VirtualHat t1_j9lmkg1 wrote on February 22, 2023 at 9:30 PM

In my experience DNNs only help with structured data (audio, video, images etc.). I once had a large (~10M datapoints) tabular dataset and found that simply taking a random 2K subset and fitting an SVM gave the best results. I think this is usually the case, but people still want DNNs for some reason. If it were a vision problem, then, of course, it'd be the other way around.

Comments