rikkajounin
rikkajounin t1_j3g9eki wrote
Reply to comment by AlmightySnoo in [R] Greg Yang's work on a rigorous mathematical theory for neural networks by IamTimNguyen
I’m only marginally familiar with Greg’s work (skimmed some papers and listened to his talks) but i believe that both criticisms are addressed.
-
Tensor programs consider discrete time (stochastic) learning algorithms stopped at T steps in place of continuous time gradient flow until convergence (the latter is used in standard neural tangent kernel literature), hence I think the infinite width limit varies depending on the algorithm and also the order of minibatches.
-
They identify infinite width limits where representation learning happens and where it doesn’t. The behaviour changes by varying how to scale with width parameters of the weights distribution of the input, output, and middle layers and the learning rate. In particular they propose to use a limit where representation (they call them features) is maximally learned. In contrast in neural tangent kernel the representation stays fixed.
rikkajounin t1_isxk23o wrote
Reply to [P] Stochastic Differentiable Programming: Unbiased Automatic Differentiation for Discrete Stochastic Programs (such as particle filters, agent-based models, and more!) by ChrisRackauckas
Very interesting work!
What about doing the same with reverse mode AD? Are there some issues in this case?
rikkajounin t1_j4umb8q wrote
Reply to [D] Are there any results on convergence guarantees when optimizing NNs? by Dartagnjan
The following work shows that with sufficiently large width (overparameterized regime) you can have polynomial convergence to the global minimum which gets worse (but polynomially) with the depth of the network.
A Convergence Theory for Deep Learning via Over-Parameterization