bloc97 t1_j2pj1c6 wrote on January 3, 2023 at 1:24 AM

There are many ways to condition a diffusion model using time, but concatenating it as input is the least efficient method because:

The first layer of your model is a convolutional layer, applying a convolution on a "time" image that has the same value everywhere is not computationally efficient. Early conv layers exist to detect variations in an image (eg. texture), applying the same kernel over and over on an empty image is not efficient.
By giving t only to the first layer, the network will need to waste resources/neurons to propagate that information through the network. Again, this waste is compounded by the fact that you will need to propagate the time information for every "pixel" in each convolutional feature of your network (because it is a ConvNet). Why not just skip all that and directly give the time embedding to deeper layers within the network?

Agreeable-Run-9152 OP t1_j2px8mz wrote on January 3, 2023 at 2:59 AM

Okay, yeah makes sense. I am currently working in the Context of FNOs. What is the way youd do it there?

bloc97 t1_j2pz7r6 wrote on January 3, 2023 at 3:13 AM

FNO? Are you referring to Fourier Neural Operator?

Agreeable-Run-9152 OP t1_j2pz999 wrote on January 3, 2023 at 3:13 AM

Yep

bloc97 t1_j2q0aio wrote on January 3, 2023 at 3:20 AM

I'm not too familiar with FNOs, but I guess you could start experimenting by adding the time embeddings to the "DC component" of the fourier transform, it would be at least equivalent to adding the time embeddings to the entire feature in a ResNet.

jm2342 t1_j2rixjn wrote on January 3, 2023 at 1:15 PM

How do you think the brain handles that?