Submitted by aviisu t3_xzpc0w in MachineLearning

I'm currently learning Diffusion Model/Variational Autoencoder. I did encounter some great videos (1, 2, and more. The first one in particular is quite nice.) doing impressive job breaking down the math and (some of) their intuitions. I think I understand 50%-75% of them for now and with some practices I think it will be more clear.

The common theme I encountered (not just Diffusion Model) is something is intractable. Not just one but a series of intractable stuff, one after another. And then the paper's authors attempt to put a bunch of tricks, mathematical gymnastics, assumption, straight-up eliminate some terms because they did not matter (Which TBH, could be justified in its own way). And then for some reason the all the terms just magically cancel each other leaving us two lines of nice, beautiful equations.

My question would be:

  • What kind of thought process reseacher had in mind while choosing what to do with their intractable equation. For me (someone who is not in the math field), it is like looking for a needle in a haystack when there are a lot of possibility of what direction to go forward, and it's not just one step until some terms finally cancel each other in the end.

  • How can they be confident that what they are doing will be fruitful in the end, that they will arrive at satisfiable and practical result. I heard sometimes people just do the implementation first and then do math later to justify the result, but I don't think that's always how it's done.

I would like to hear from people sharing their experience on this. And if anyone has resources to learn about this particular mathematics or mental framework on this, please share.

(I understand that I still have a long way to learn/catch up especially in the math part. My question may make sense to me in the future with enough knowledge and experience.)

154

Comments

You must log in or register to comment.

dpkingma t1_irooyas wrote

Author of VAE/diffusion model papers here.

When a paper introduces something as intractable, it typically means that it can't be computed exactly in a computationally feasible way, which forms a motivation for using approximations. The challenge is then to come up with an approximation that has nice properties (e.g computationally cheap, unbiased, is a bound, etc.).

As another commenter also wrote, how an idea is presented in the paper is typically different from how the author(s) came up with the idea. At the start of a research project you often start with a concrete problem and some vague intuitions on how to solve it. Then through an iterative process you refine your ideas. Often it turns out that what you wanted to do is impossible. Often it turns out that a solution already exists so there's no need to publish. The process often requires lots of backtracking, scrapping dead ends, and it often requires time before an idea finally 'clicks' in your mind (which is a great feeling). Especially when you start out in research, nine out of ten ideas turn out to be either already solved, or (seemingly) unsolvable, which can be very frustrating. And Since negative results are typically unpublishable or uninteresting, you don't read about them. The flip side is that with more experience, you build a bigger mental toolbox, making it easier spot dead ends and see opportunities. The best way to get is there is to read ML books (or other media) that teach you the underlying math, and lots of practice.

154

elemintz t1_irqaz64 wrote

Thanks for taking the time Durk! Means a lot for the younger generation of researchers for whom the work by you and others can look like magic from time to time. :D

9

aviisu OP t1_irpq3vy wrote

Thanks a lot. I will keep that as a valuable lesson. 🙏

6

picardythird t1_irnu71n wrote

Let's be honest, 99.99% of the time the math is a post-hoc, hand-wavey description of the architecture that happened to work after dozens or hundreds of iterations of architecture search. End up with an autoencoder? Great, you have a latent space. Did you perform normalizations or add noise? Fantastic, you can make (reductive and unjustified) claims about probabilistic interpretations using simple distributions. Know some linear algebra? Awesome, you can take pretty much any vaguely-accurate observation about the archtecture and use it to support whatever interpretation you want.

There is a vanishingly small (but admittedly nonzero) amount of research being done that starts with a theoretical formulation and designs a hypothesis test about that formulation. Much, much, much more common is that researchers will start with a high-level (but ultimately heuristic) systems approach toward architecture selection, and then twiddle and tweak the hyperparameters until the numbers look good enough to publish. Then, maybe they might look back and conjure up some vague bullshit that involves math notation, and claim that as "theoretical justification" knowing that 99% of the time no one is going to read it, much less check their work. Bonus points if they make the notation intentionally confusing, so as to dissuade people from looking too closely. If you can mix in as many fonts, subscripts, superscripts, and set notation as possible, most people's eyes are going to glaze over and they will assume you're right.

It's sad, but it's the reality of the state of our field.

Edit: As for intractable equations, some people are going to answer with talk of variational inference and theoretical guarantees about lower bounds. This is nice and all, but the real answer is that "neural networks go brrr." Neural networks are used as function approximators, and except for some high level architectural descriptions, no one actually knows how or why neural networks learn the functions that they do (or even if they learn the intended functions at all). Anyone that tells you otherwise is selling you something.

Edit 2: Lol, it's always hilarious getting downvotes from people can't stand being called out for their bad practices.

51

dpkingma t1_irojnp3 wrote

VAE author here.

The parent's (picardythird) comment is a, in my experience, false take on ML. VAEs and many other methods in ML are actually derived from first principles. Which means: start with an idea, do the math/derivations first (which can take considerable time) until you get them right, then implement later.

It's true that a lot of work in ML is different, e.g. hacking up a new architectures, testing against benchmarks, and coming up with a theory later. It may be that parent's experience is more along these lines. But that's not how VAEs (and I think, diffusion models) were conceived, and certainly not 99.99% of the methods.

(I'll try to answer OP in a separate comment.)

37

picardythird t1_iroxxxd wrote

I'll admit that VAEs are one of the (imo very few) methods that are principally motivated. The original VAE paper is actually one of my favorite papers in all of ML; it's well-structured, easy to read, and the steps leading from problem to solution are clear.

This is not at all the case for the vast majority of ML research. How many times have we seen "provably optimal" bounds on something that are then shown to, whoops, not actually be optimal a few years (or months! or weeks!) later? How many times have seen "theoretical proofs" for why something works where it turns out that, whoops, the proofs were wrong (looking at you, Adam)? Perhaps most damningly, how many times have we seen a mathematical model or structure that seems to work well in a lab completely fail when exposed to real world data that does not follow the extraordinarily convenient assumptions and simplifications that serve as the foundational underpinnings? This last point is extremely common in nonstandard fields such as continual learning, adversarial ML, and noisy label learning.

Most certainly, some (perhaps even "much"... I hesitate to say "most" or "all") of the work that comes out of the top top top tier labs is motivated by theory before practice. However, outside of these extraordinary groups it is plainly evident that most methods you see in the wild are empirical, not theoretical.

12

dpkingma t1_irpaffs wrote

I agree with the (milder) claim that most methods in the wild are empirical.

With regards to Adam: I wouldn't say that this method was ad hoc. The method was motivated by the ideas in Sections 2 and 3 of the paper (update rule / initialization bias correction), which are correct. The convergence result in Section 4 should be ignored, but didn't play a role in the conception of the problem, and wasn't very relevant in practice anyway due to the convexity assumption.

11

scraper01 t1_irod27t wrote

Stable diffusion actual implementation doesen't make sense when compared with the paper's description

5

master3243 t1_irq8yrx wrote

Can you provide an example?

Personally, I went through Ho et al. (2020)'s paper (equation by equation) and their code (line by line) back when it first came out and I remember that they both matched each other almost perfectly.

3

astrange t1_irpk8s2 wrote

Do you have specific examples?

It's obviously true that diffusion models don't work for the reasons they were originally thought to; the cold diffusion paper shows this. Also, Stable Diffusion explainers I've seen explain it using pixel diffusion even though it's latent diffusion. And I'm not sure I understand why latent diffusion works.

1

aviisu OP t1_iro0trj wrote

thank for your insight. Indeed, I genuinely don't understand people who intentionally put a bunch of complex maths in the paper instead of trying to guide with intuition and make the paper more accessible. But then again I heard it will get accepted easier so...

0

master3243 t1_irq9k6g wrote

That's just how math is done in research. If you don't like that then you'll hate, even more, pure math papers where they start with a theorem then show steps to end up with a true statement that is the theorem.

The intuition behind how the author came up with that path of thinking to come up with the final theorem is left (justifiably) entirely in the authors scratch paper or notebooks.

Some authors do give out insight onto the steps they took or their general intuition which is always nice, but not a requirement.

It's also worth mentioning that a lot of us like doing research but don't like writing research papers (as that is just a necessity due to humans lacking telepathic communication) so giving out more info is an optional step in a disliked process which makes sense why it's skipped.

5

Ok_Swordfish5638 t1_irobilk wrote

First off, keep in mind that this is difficult work, and the mathematical story you’re reading is one that has been cleaned up and streamlined for presentation. The blind alleys, dead ends, and confusion have been filtered out for you, giving the illusion that the author had it all figured out from the start. That’s not true, and often the complicated steps that seem like magic are a result of a researcher chewing on a particular step for a quite a while before finding the right way to proceed. The process of figuring it out often involves trying different ways of looking at a problem until you hit upon one that works.

A lot of prior experience plays into it as well. After having solved many similar problems and working through other derivations you start to get a feel for what might work in certain situations. At the same time, as you do more of this kind of work your “bag of tricks” expands, so you learn more tools that you can bring to bear on new problems. There’s not really a substitute for this other than experience and practice, similar to how an expert programmer has practiced their skills for years to reach that point.

Often times, you have a strong intuition about what the final result will look like qualitatively. This helps determine whether or not it’s worth grinding through the math to make sure you get all the details right, and guides you in how to approach each step. It’s usually not the case that you start a derivation without having some idea where it will lead you, though sometimes the initial intuition doesn’t make it in to the final presentation. Having a good reason to believe that the thing you’re trying to derive or prove is useful is very important before starting out.

The more clearly you can state what you’re trying to do at the outset, the better off you will be. For a proof, this takes the form of very clearly stating what you’re trying to prove, while for something like the topics you’re studying currently this might take the form of having a clear idea what parts of the cost function you’re trying to simplify or improve. There’s no guarantee you’ll be successful, but you also don’t read about the unsuccessful attempts.

27

Ecclestoned t1_irpictt wrote

I am not much of a mathmetician, but have published theoretical ML papers. To answer your second question, I had no idea that I would be able to solve the problem when I started out.

The process was a lot of trial and error, and iterative refinement. I started with the simplest form of the problem I could and made every simplifying assumption that I could (usually that various terms are insignificant). Once I got closer to solving the equations, I got a better idea of what form the initial problem must take for the final solution to be solvable. Then working backwards, I determined which assumptions are necessary, solved the problem, then checked the assumptions.

11

spring_m t1_irq31mn wrote

Another important skill is being able to understand what is truly crucial to a solution versus what is nice to have. For example you mention VAEs which are used to encode the images in latent space for stable diffusion etc. Based on some experiments I've done you don't actually need a VAE - a vanilla convolutional auto-encoder is good enough since you don't actually sample from the latents - the diffusion takes care of that.

2