cdsmith t1_j3heb7r wrote on January 8, 2023 at 4:27 PM

I'm not at all up to speed on this, but I followed most of the presentation. I was left with this question, though.

Up to the latter part of the video, I was left with the impression that this was building a rigorous theory of what happens if you forget to train your neural network. That is, the assumption was that all the weights were taken from independently sampled Gaussian distributions. The "master theorem" as stated here definitely assumed that all the weights in the network were random. But then suddenly about 2.5 hours in, they are talking about the behavior of the network under training, and as far as I can tell, there's no discussion at all of how the theorems they have painstakingly established for random weights tell you anything about learning behavior.

Did I miss something, or was this just left out of the video? They do seem to have switched by this point from covering proofs to just stating results... which is fine, the video is long enough already, but I'd love to have some intuition for how this model treats training, as opposed to inference with random weights.

IamTimNguyen OP t1_j3hj6ef wrote on January 8, 2023 at 4:59 PM

Great question and you're right we did not cover this (alas, we could not cover everything even with 3 hours). You can unroll NN training as a sequence of gradient updates. The gradient updates involve nonlinear additions to the set of weights at initialization (e.g. the first update is w -> w - grad_w(L), where w is randomly initialized). Unrolling the entire graph is a large composition of such nonlinear functions of the weights at initialization. The Master Theorem, from a bird's eye view, is precisely the tool to handle such a computation graph (all such unrolls are themselves tensor programs). This is how Greg's work covers NN training.

Note: This is just a cartoon picture of course. The updated weights are now highly correlated in the unrolled computation graph (weight updates in a given layer depend on weights from all layers), and one has to do a careful analysis of such a graph.

Update: Actually, Greg did discuss this unrolling of the computation graph for NN training. https://www.youtube.com/watch?v=1aXOXHA7Jcw&t=8540s