pia322 t1_iqtfrt3 wrote
I really like this question. I agree with you that a NN is an arbitrary function approximator, and it could easily implicitly learn the attention function.
I personally embrace the empiricism. We try to make theoretical justifications, but in reality, attention/transformers just happen to work better, and no one really knows why. One could argue that 95% of deep learning research follows this empirical methodology, and the "theory" is an afterthought to make the papers sound nicer.
Why is ResNet better than VGG? Or ViT better than ResNet? They're all arbitrary function approximators, so they should all be able to perform identically well. But empirically, that's not the case.
029187 OP t1_iqth0xk wrote
I'm kinda scared by the idea that we get all the way to strong AI and still don't understand why it works.
ResourceResearch t1_iqunzdz wrote
Well at least for ResNet, there is a technical reason for its success. Skip connections mitigate vanishing gradients, via the chain rule of differentiation.
DeepNonseNse t1_iqvgzsk wrote
But then again, that just lead to another question: why are deep(er) architectures better in the first place?
Desperate-Whereas50 t1_iqwzlgc wrote
I am not a transformer expert. So maybe this is a stupid question, but is this also true for transformer based architectures? For example BERT uses 12/24 transformer Blocks. Thats sounds not as deep as for example a resnet-256.
ResourceResearch t1_iro8zof wrote
Afaik it is not clear. In my personal experience, the number of parameters is more important, rather then the layer size, i.e. a smaller number of wider layers does the same job as a large number of narrower layers.
Consider this paper for empirical insights for large models: https://arxiv.org/pdf/2001.08361.pdf
Viewing a single comment thread. View all comments