Submitted by GraciousReformer t3_118pof6 in MachineLearning
activatedgeek t1_j9jvj8h wrote
For generalization (performing well beyond the training), there’s at least two dimensions: flexibility and inductive biases.
Flexibility ensures that many functions “can” be approximated in principle. That’s the universal approximation theorem. It is a descriptive result and does not prescribe how to find that function. This is not something very unique to DL. Deep Random Forests, Fourier Bases, Polynomial Bases, Gaussian processes all are universal function approximators (with some extra technical details).
The part unique to DL is that somehow their inductive biases have helped match some of the complex structured problems including vision and language that makes them generalize well. Inductive bias is a loosely defined term. I can provide examples and references.
CNNs provide the inductive bias to prefer functions that handle translation equivariance (not exactly true but only roughly due to pooling layers). https://arxiv.org/abs/1806.01261
Graph neural networks provide a relational inductive bias. https://arxiv.org/abs/1806.01261
Neural networks overall prefer simpler solutions, embodying Occam’s razor, another inductive bias. This argument is made theoretically using Kolmogorov complexity. https://arxiv.org/abs/1805.08522
SodomizedPanda t1_j9jyhem wrote
And somehow, the best answer is at the bottom of the thread..
A small addition : Recent research suggests that the implicit bias in DNN that helps generalization does not only lie in the structure of the network but in the learning algorithm as well (Adam, SGD, ...). https://francisbach.com/rethinking-sgd-noise/ https://francisbach.com/implicit-bias-sgd/
red75prime t1_j9k0i84 wrote
Does in-context learning suggest that inductive biases could also be extracted from training data?
activatedgeek t1_j9k4z4o wrote
Very much indeed. See https://arxiv.org/abs/2205.05055
activatedgeek t1_j9k58ev wrote
Not only dataset, the Transformer architecture itself seems to be amenable to in-context learning. See https://arxiv.org/abs/2209.11895
KingRandomGuy t1_j9k363j wrote
> CNNs provide the inductive bias to prefer functions that handle translation equivariance
There's some interesting bodies of work to inductive biases in CNNs, such as "Making Convolutional Networks Shift-Invariant Again". Really interesting stuff!
hpstring t1_j9kf6te wrote
This is a very good answer! I want to add that apart from generalization, the fact that we have efficient optimization algorithms that can find quite good minima also contributes a lot to the deep learning magic.
-vertigo-- t1_j9kih7k wrote
hmm for some reason the arxiv links are giving 403 forbidden
CO2mania t1_j9r6990 wrote
Save the message.
sanman t1_ja3kkkp wrote
first 2 links are the same - do you have the one for CNNs inductive bias?
GraciousReformer OP t1_j9ljjqu wrote
>inductive biases
Then why does DL have inductive biases and others do not?
activatedgeek t1_j9lu7q7 wrote
All model classes have inductive biases. e.g. random forests have the inductive bias of producing axis-aligned region splits. But clearly, that inductive bias is not good enough for image classification because a lot of information in the pixels is spatially correlated that axis-aligned regions cannot capture as specialized neural networks, under the same budget. By budget, I mean things like training time, model capacity, etc.
If we have infinite training time and infinite number of image samples, then probably random forests might be just as good as neural networks.
currentscurrents t1_j9n3o7u wrote
Sounds like ideally we'd want a model with good inductive biases for meta-learning new inductive biases, since every kind of data requires different biases.
GraciousReformer OP t1_j9lwe7i wrote
Still, why is it that DL has better inductive biases than others?
activatedgeek t1_j9lz6ib wrote
I literally gave an example of how (C)NNs have better inductive bias than random forests for images.
You need to ask more precise questions than just a "why".
GraciousReformer OP t1_j9m1o15 wrote
So it is like an ability to capture "correlations" that cannot be done with random forests.
currentscurrents t1_j9n8in9 wrote
In theory, either structure can express any solution. But in practice, every structure is better suited to some kinds of data than others.
A decision tree is a bunch of nested if statements. Imagine the complexity required to write an if statement to decide if an array of pixels is a horse or a dog. You can technically do it by building a tree with an optimizer; but it doesn't work very well.
On the other hand, a CNN runs a bunch of learned convolutional filters over the image. This means it doesn't have to learn the 2D structure of images and that pixels tend to be related to nearby pixels; it's already working on a 2D plane. A tree doesn't know that adjacent pixels are likely related, and would have to learn it.
It also has a bias towards hierarchy. As the layers stack upwards, each layer builds higher-level representations to go from pixels > edges > features > objects. Objects tend to be made of smaller features, so this is a good bias for working with images.
GraciousReformer OP t1_j9oeeo3 wrote
What are the situations that the bias for the hierarchy is not helpful?
Viewing a single comment thread. View all comments