benanne

benanne OP t1_j427zj0 wrote

One indirect advantage for working with very long sequences is the lack of causality constraint, which makes it very easy to use architectures where computation is largely decoupled from the sequence length, like Perceivers (https://arxiv.org/abs/2103.03206, https://arxiv.org/abs/2107.14795), or Recurrent Interface Networks (https://arxiv.org/abs/2212.11972). This is highly speculative though :)

(I am aware that an autoregressive variant of the Perceiver architecture exists (https://arxiv.org/abs/2202.07765), but it is actually quite a bit less general/flexible than Perceiver IO / the original Perceiver.)

2

benanne OP t1_j3r3stl wrote

DiffWave and WaveGrad are two nice TTS examples (see e.g. here https://andrew.gibiansky.com/diffwave-and-wavegrad-overview/), Riffusion (https://www.riffusion.com/) is also a fun example. Advances in audio generation always tend to lag behind the visual domain a bit, because it's just inherently more unwieldy to work with (listening to 100 samples one by one takes a lot more time and patience than glancing at a 10x10 grid of images), but I'm pretty sure the takeover is also happening there.

If you're talking about text-to-audio in the vein of current text-to-image models, I'm pretty sure that's in the pipeline :)

3

benanne OP t1_j3qzbcr wrote

If you were to graph the weighting that ensures the training loss corresponds to likelihood, you would find that it looks roughly like exp(-x). In other words, the importance of the noise levels decreases more or less exponentially (but not exactly!) as they increase. So if you want to train a diffusion model to maximise likelihood (which can be a valid thing to do, for example if you want to use it for lossless compression), your training set should have many more examples of low noise levels than of high noise levels (orders of magnitude more, in fact).

Usually when we train diffusion models, we sample noise levels uniformly, or from a simple distribution, but certainly not from a distribution which puts exponentially more weight on low noise levels. Therefore, relative to the likelihood loss, the loss we tend to use puts a lot less emphasis on low noise levels, which correspond to high spatial frequencies. Section 5 of my earlier blog post is an attempt at an intuitive explanation why this correspondence between noise levels and spatial frequencies exists: https://benanne.github.io/2022/01/31/diffusion.html#scale

"Variational diffusion models" is another paper that focuses on optimising likelihood, which you might find more accessible: https://arxiv.org/abs/2107.00630

3

benanne OP t1_j3qxvaa wrote

Reply to comment by jimmymvp in [R] Diffusion language models by benanne

As I understand it, the main motivation for latent diffusion is that in perceptual domains, ~99% of information content in the input signals is less perceptually relevant, so it does not make sense to spend a lot of model capacity on it (lossy image compression methods like JPEG are based on the same observation). Training an autoencoder first to get rid of the majority of this irrelevant information can greatly simplify the generative modelling problem at almost no cost to fidelity.

This idea was originally used with great success to adapt autoregressive models to perceptual domains. Autoregression in pixel space (e.g. PixelRNN, PixelCNN) or amplitude space for audio (e.g. WaveNet, SampleRNN) does work, but it doesn't scale very well. Things work much better if you first use VQ-VAE (or even better, VQGAN) to compress the input signals, and then apply autoregression in its latent space.

The same is true for diffusion models, though in this case there is another mechanism we can use to reduce the influence of perceptually irrelevant information: changing the relative weighting of the noise levels during training, to downweight high-frequency components. Diffusion models actually do this out of the box when compared to likelihood-based models, which is why I believe they have completely taken over generative modelling of perceptual signals (as I discuss in the blog post).

But despite the availability of this reweighting mechanism, the latent approach can still provide further efficiency benefits. Stable Diffusion is testament to this: I believe the only reason they are able to offer up a model that generates high-res content on a single consumer GPU, is because of the adversarial autoencoder they use to get rid of all the imperceptible fine-grained details first.

I think this synergy between adversarial models (for low-level detail) and likelihood- or diffusion-based models (for structure and content) is still underutilised. There's a little bit more discussion about this in section 6 of my blog post on typicality: https://benanne.github.io/2020/09/01/typicality.html#right-level (though this largely predates the rise of diffusion models)

11

benanne OP t1_j3qvx05 wrote

My blog posts are mostly shower thoughts expanded into long form, so naturally they tend to be a bit speculative. I have in fact tried a bunch of stuff in the diffusion language modelling space, which culminated in the CDCD paper: https://arxiv.org/abs/2211.15089 as well as this theoretical note on simplex diffusion: https://arxiv.org/abs/2210.14784 -- if the style of the blog post isn't your cup of tea, this might be more to your liking :)

Completely agree re: hard numbers, by the way (I spent quite a bit of time Kaggling during my PhD, see some of my earlier blog posts), but a single researcher can only do so many experiments. Part of the motivation for writing these blog posts is to draw attention to areas of research I think are interesting, and hopefully encourage some people to delve deeper into them as well! Pointing out open questions can be quite conducive to that, in my experience.

7