Submitted by IamTimNguyen t3_105v7el in MachineLearning

Greg Yang is a mathematician and AI researcher at Microsoft Research who for the past several years has done incredibly original theoretical work in the understanding of large artificial neural networks. His work currently spans the following five papers:

Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes: https://arxiv.org/abs/1910.12478
Tensor Programs II: Neural Tangent Kernel for Any Architecture: https://arxiv.org/abs/2006.14548
Tensor Programs III: Neural Matrix Laws: https://arxiv.org/abs/2009.10685
Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks: https://proceedings.mlr.press/v139/yang21c.html
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer: https://arxiv.org/abs/2203.03466

In our whiteboard conversation, we get a sample of Greg's work, which goes under the name "Tensor Programs". The route chosen to compress Tensor Programs into the scope of a conversational video is to place its main concepts under the umbrella of one larger, central, and time-tested idea: that of taking a large N limit. This occurs most famously in the Law of Large Numbers and the Central Limit Theorem, which then play a fundamental role in the branch of mathematics known as Random Matrix Theory (RMT). We review this foundational material and then show how Tensor Programs (TP) generalizes this classical work, offering new proofs of RMT.

We conclude with the applications of Tensor Programs to a (rare!) rigorous theory of neural networks. This includes applications to a rigorous proof for the existence of the Neural Network Gaussian Process and Neural Tangent Kernel for a general class of architectures, the existence of infinite-width feature learning limits, and the muP parameterization enabling hyperparameter transfer from smaller to larger networks.

​

https://preview.redd.it/av3ovotcunaa1.png?width=1280&format=png&auto=webp&v=enabled&s=a7aa946741036e5c39d990f070a01a1e72202265

https://preview.redd.it/hh9q6wqdunaa1.png?width=1200&format=png&auto=webp&v=enabled&s=939a9dbba9e46a928cef5d1d7bbe75819873ca7f

Youtube: https://youtu.be/1aXOXHA7Jcw

Apple Podcasts: https://podcasts.apple.com/us/podcast/the-cartesian-cafe/id1637353704

Spotify: https://open.spotify.com/show/1X5asAByNhNr996ZsGGICG

RSS: https://feed.podbean.com/cartesiancafe/feed.xml

401

Comments

You must log in or register to comment.

IamTimNguyen OP t1_j3cybcu wrote

Part I. Introduction

00:00:00 : Biography

00:02:36 : Harvard hiatus 1: Becoming a DJ

00:07:40 : I really want to make AGI happen (back in 2012)

00:09:00 : Harvard math applicants and culture

00:17:33 : Harvard hiatus 2: Math autodidact

00:21:51 : Friendship with Shing-Tung Yau

00:24:06 : Landing a job at Microsoft Research: Two Fields Medalists are all you need

00:26:13 : Technical intro: The Big Picture

00:28:12 : Whiteboard outline

Part II. Classical Probability Theory

00:37:03 : Law of Large Numbers

00:45:23 : Tensor Programs Preview

00:47:25 : Central Limit Theorem

00:56:55 : Proof of CLT: Moment method

01:02:00 : Moment method explicit computations

Part III. Random Matrix Theory

01:12:45 : Setup

01:16:55 : Moment method for RMT

1:21:21 : Wigner semicircle law

Part IV. Tensor Programs

1:31:04 : Segue using RMT

1:44:22 : TP punchline for RMT

1:46:22 : The Master Theorem (the key result of TP)

1:55:02 : Corollary: Reproof of RMT results

1:56:52 : General definition of a tensor program

Part V. Neural Networks and Machine Learning

2:09:09 : Feed forward neural network (3 layers) example

2:19:16 : Neural network Gaussian Process

2:23:59 : Many large N limits for neural networks

2:27:24 : abc parametrizations (Note: "a" is absorbed into "c" here): variance and learning rate scalings

2:36:54 : Geometry of space of abc parametrizations

2:39:50 : Kernel regime

2:41:35 : Neural tangent kernel

2:43:40 : (No) feature learning

2:48:42 : Maximal feature learning

2:52:33 : Current problems with deep learning

2:55:01 : Hyperparameter transfer (muP)

3:00:31 : Wrap up

24

AlmightySnoo t1_j3d6wpo wrote

Haven't watched yet, but does he address criticism by e.g. The Principles of Deep Learning Theory, regarding the infinite width limit?

59

ThatInternetGuy t1_j3ejkr3 wrote

To be honest, even though I've coded in many ML code repos and I did well in college math, but this video outline looks like an alien language to me. Tangent kernel, kernel regime (is AI getting into the politics?), punchline for Matrix Theory (who's trying to get a date here?), etc.

−15

ThatInternetGuy t1_j3emjfs wrote

I was just saying that ML researchers are using terms that are way too technical to infer meaning, or using a common word such as punchline to mean something else entirely. What does punchline have anything to do with ML?

−15

ReginaldIII t1_j3epizn wrote

No need to downvote, it was an honest question not an attack. Have you studied the literature and background mathematics of this area much?

Regime is a well established term in mathematics and many other fields, and one example of a "regime" (a domain under rules or constrains) is what you are likely familiar with as a political regime.

With respect to "punchline", I'm going to assume you didn't look at the video at the timestamp listed? Here it is https://youtu.be/1aXOXHA7Jcw?t=6105 All he is saying is that, after a few minutes long tangent talking about something the "punchline" is him circling back around to the point he was trying to make.

It isn't a literal haha punchline, it's not a mathematical term, the punchline comes at the end of a joke, a joke often takes you on a journey before circling back to some type of point. He used the word to mean that here too.

Timothy Nguyen, OP of this post and the host of the video, made a light hearted chapter title within a long video based on a term that Greg Yang used on his whiteboard.

18

madrury83 t1_j3epnqt wrote

Repurposing common words to have technical meanings is a basic trope in mathematics: kernel, neuron, limit, derivative, spectrum, manifold, atlas, chart, model, group, ring, ideal, field, topology, open, closed, compact, exotic, neighborhood, domain, immerse, embed, fibre, bundle, flow, section, measure, category, scheme, torsion, ...

... and typing Natural Transformation into google shows you skinny dudes that got buff.

14

cdsmith t1_j3ev3je wrote

This is definitely a theory presentation, though it does end with some applications to hyperparameter transfer when scaling model size. But if your main experience with ML is building models and applications, I'm not surprised it looks unfamiliar.

That being said, though, give it a chance if you're interested. Some parts of the outline didn't look familiar to me either, but the video is well-made and stops to explain most of the background knowledge. And you can always gloss over the bits you don't understand.

1

trajo123 t1_j3ev6wz wrote

Can anyone ELI5? More specifically, what are the practical applications to Deep Learning problems?

6

cdsmith t1_j3evg7e wrote

Punchline is just sort of common vernacular for "here's where all the parts come together in a moment of realization". It's a metaphor to a joke, where you have all the setup, and then there's the moment when you "get it" and laugh.

4

zhumao t1_j3evw1h wrote

took a quick glance (https://arxiv.org/abs/1910.12478 and https://proceedings.mlr.press/v139/yang21c.html), a few theorems but where r the proofs? also

>This includes applications to a rigorous proof for the existence of the Neural Network Gaussian Process and Neural Tangent Kernel for a general class of architectures, the existence of infinite-width feature learning limits, and the muP parameterization enabling hyperparameter transfer from smaller to larger networks.

it is well-known that training NN is a NP-complete, also means locally optimal solution r not globally optimal in general, hence stick a pre-train sub-net into a bigger one may or may not perform better than training larger NN from scratch, proof by application/implementation r demonstrations or one-shot experiment at best, not proof, speaking from a mathematics POV

3

getsmartbsharp t1_j3exy4f wrote

Has anyone read the papers? Are they worth a read? The first one is a 73 page white paper…

15

binfin t1_j3f8n5x wrote

That the network gets some percentage accuracy on some testing set.

Editing for additional context… the poly time verifier is testing the network and verifying that it gets at least some % accuracy. This is similar to the poly time verifier for the traveling salesman problem, where you verify that a solution exists with some total travel distance. To find the shortest distance in a TSP instance, or the highest accuracy in a neural network you just need to do a binary search on your accuracy cutoff.

The complete poly-time solution for the NDTM is to test every bit combination of your parameters, each time checking that the model has some % accuracy. To find the highest % accuracy you just perform a binary search of possible accuracies (which of course grows logarithmically with the floating point precision of your accuracy). This whole process is poly time on an NDTM, and therefor the problem is contained in NP. You can further prove that finding optimal neural networks is NP complete by having the neural network be trained to solve 3-sat problems, where number of neural network parameters is a poly scale of the 3-sat N.

Because optimizing an NN (with respect to some testing set) is solvable in poly-time by an NDTM, and because you can reduce an NP-complete problem to optimizing a neural network, optimizing a neural network is NP-complete.

12

Firm-Hard-Hand t1_j3f9thl wrote

I have not much followed this paper but I did get a chance to read a paper by Ansari on Gaussian Processes which was implemented in STAN, the bayesian inference framework . GP's are incredibly flexible and they perform very well on the test data.

10

AlmightySnoo t1_j3fa93g wrote

Excerpt from pages 8 and 9:

>Unfortunately, the formal infinite-width limit, n -> ∞, leads to a poor model of deep neural networks: not only is infinite width an unphysical property for a network to possess, but the resulting trained distribution also leads to a mismatch between theoretical description and practical observation for networks of more than one layer. In particular, it’s empirically known that the distribution over such trained networks does depend on the properties of the learning algorithm used to train them. Additionally, we will show in detail that such infinite-width networks cannot learn representations of their inputs: for any input x, its transformations in the hidden layers will remain unchanged from initialization, leading to random representations and thus severely restricting the class of functions that such networks are capable of learning. Since nontrivial representation learning is an empirically demonstrated essential property of multilayer networks, this really underscores the breakdown of the correspondence between theory and reality in this strict infinite-width limit.
>
>From the theoretical perspective, the problem with this limit is the washing out
of the fine details at each neuron due to the consideration of an infinite number of incoming signals. In particular, such an infinite accumulation completely eliminates the subtle correlations between neurons that get amplified over the course of training for representation learning.

30

deepwank t1_j3faobn wrote

Thanks! I was wondering when neural networks were going to go beyond the current trial and error alchemy, and it looks like this is a big step forward for the mathematical foundations.

0

kastbort2021 t1_j3fea3l wrote

Or they could just be quite recent topics? NTK, for example, seems to have been introduced in 2018. If you're not actively reading ML research papers, you'll probably have a hard time getting exposed to those topics.

3

rikkajounin t1_j3g9eki wrote

I’m only marginally familiar with Greg’s work (skimmed some papers and listened to his talks) but i believe that both criticisms are addressed.

  1. Tensor programs consider discrete time (stochastic) learning algorithms stopped at T steps in place of continuous time gradient flow until convergence (the latter is used in standard neural tangent kernel literature), hence I think the infinite width limit varies depending on the algorithm and also the order of minibatches.

  2. They identify infinite width limits where representation learning happens and where it doesn’t. The behaviour changes by varying how to scale with width parameters of the weights distribution of the input, output, and middle layers and the learning rate. In particular they propose to use a limit where representation (they call them features) is maximally learned. In contrast in neural tangent kernel the representation stays fixed.

8

cdsmith t1_j3heb7r wrote

I'm not at all up to speed on this, but I followed most of the presentation. I was left with this question, though.

Up to the latter part of the video, I was left with the impression that this was building a rigorous theory of what happens if you forget to train your neural network. That is, the assumption was that all the weights were taken from independently sampled Gaussian distributions. The "master theorem" as stated here definitely assumed that all the weights in the network were random. But then suddenly about 2.5 hours in, they are talking about the behavior of the network under training, and as far as I can tell, there's no discussion at all of how the theorems they have painstakingly established for random weights tell you anything about learning behavior.

Did I miss something, or was this just left out of the video? They do seem to have switched by this point from covering proofs to just stating results... which is fine, the video is long enough already, but I'd love to have some intuition for how this model treats training, as opposed to inference with random weights.

3

IamTimNguyen OP t1_j3himnu wrote

Having spoken to Greg (who may or may not be chiming in), it appears that the authors of PDLT were only considering one kind of infinite width limit (as evidenced by your use of the word "the"). But Greg considers a general family of them. The NTK limit indeed has no feature learning, whereas Greg analyzes entire families, some that do have feature learning, in particular, one that has maximal feature learning. So there is no contradiction with respect to past works.

6

IamTimNguyen OP t1_j3hj6ef wrote

Great question and you're right we did not cover this (alas, we could not cover everything even with 3 hours). You can unroll NN training as a sequence of gradient updates. The gradient updates involve nonlinear additions to the set of weights at initialization (e.g. the first update is w -> w - grad_w(L), where w is randomly initialized). Unrolling the entire graph is a large composition of such nonlinear functions of the weights at initialization. The Master Theorem, from a bird's eye view, is precisely the tool to handle such a computation graph (all such unrolls are themselves tensor programs). This is how Greg's work covers NN training.

Note: This is just a cartoon picture of course. The updated weights are now highly correlated in the unrolled computation graph (weight updates in a given layer depend on weights from all layers), and one has to do a careful analysis of such a graph.

Update: Actually, Greg did discuss this unrolling of the computation graph for NN training. https://www.youtube.com/watch?v=1aXOXHA7Jcw&t=8540s

2

trajo123 t1_j3i3dy1 wrote

I don't mean applications _of Deep Learning_, I mean what are the applications of this specific theory to real life Deep Learning problems. Can this theory help a Deep Learning practitioner, or is it applied only to proving some abstract bounds on some theoretical, abstracted and simplified neural nets?

0

trajo123 t1_j3ly2r2 wrote

Sorry for the ignorant question, but are there any practical applications of this theory?

2

yolky t1_j3oa0oc wrote

The work is probably not useful for most DL practitioners (yet), but has lots of applications for deep learning research, even for people outside of deep learning theory. As an example consider the work on the Neural Tangent Kernel, which considers one infinite-width limit of neural networks. While the work itself originally was just trying to understand wide fully connected networks, its impact now in 2023 is immense.

A lot of new algorithms for things like active learning, meta learning, etc. use NTK theory as motivation for their development. You could pretty much search "a neural tangent kernel perspective on ____" on google and get a ton of results, a mixture of applied algorithms and theoretical analyses.

So this is just one example of how understanding DL theory leads to better algorithms. One part Greg Yang's work could be considered generalizing NTK theory to different infinite width limits. At the moment, there doesn't seem to be too many applications of his work, but of course the same would have been said about the NTK in 2018. His "Tensor Programs V" paper shows that one application of his work is for choosing hyperparameters for large neural networks using smaller ones as a proxy.

So TL;DR - there might not be practical applications yet, but there are potentially a lot!

4

eyeofthephysics t1_j4f2w85 wrote

>u/IamTimNguyen

Hi Tim, just to add on to your comment, Sho Yaida (one of the co-authors of PDLT) also wrote a paper on the various infinite width limits of neural nets, https://arxiv.org/abs/2210.04909. He was able to construct a family of infinite width limits and show that in some of them there is representation learning (and he also found agreement with Greg's existing work).

1