AlmightySnoo t1_j3fa93g wrote on January 8, 2023 at 3:42 AM

Reply to comment by hattulanHuumeparoni in [R] Greg Yang's work on a rigorous mathematical theory for neural networks by IamTimNguyen

Excerpt from pages 8 and 9:

>Unfortunately, the formal infinite-width limit, n -> ∞, leads to a poor model of deep neural networks: not only is infinite width an unphysical property for a network to possess, but the resulting trained distribution also leads to a mismatch between theoretical description and practical observation for networks of more than one layer. In particular, it’s empirically known that the distribution over such trained networks does depend on the properties of the learning algorithm used to train them. Additionally, we will show in detail that such infinite-width networks cannot learn representations of their inputs: for any input x, its transformations in the hidden layers will remain unchanged from initialization, leading to random representations and thus severely restricting the class of functions that such networks are capable of learning. Since nontrivial representation learning is an empirically demonstrated essential property of multilayer networks, this really underscores the breakdown of the correspondence between theory and reality in this strict infinite-width limit.
>
>From the theoretical perspective, the problem with this limit is the washing out
of the fine details at each neuron due to the consideration of an infinite number of incoming signals. In particular, such an infinite accumulation completely eliminates the subtle correlations between neurons that get amplified over the course of training for representation learning.

rikkajounin t1_j3g9eki wrote on January 8, 2023 at 9:51 AM

I’m only marginally familiar with Greg’s work (skimmed some papers and listened to his talks) but i believe that both criticisms are addressed.

Tensor programs consider discrete time (stochastic) learning algorithms stopped at T steps in place of continuous time gradient flow until convergence (the latter is used in standard neural tangent kernel literature), hence I think the infinite width limit varies depending on the algorithm and also the order of minibatches.
They identify infinite width limits where representation learning happens and where it doesn’t. The behaviour changes by varying how to scale with width parameters of the weights distribution of the input, output, and middle layers and the learning rate. In particular they propose to use a limit where representation (they call them features) is maximally learned. In contrast in neural tangent kernel the representation stays fixed.