Submitted by Agreeable-Run-9152 t3_101s5kj in MachineLearning
bloc97 t1_j2pj1c6 wrote
There are many ways to condition a diffusion model using time, but concatenating it as input is the least efficient method because:
- The first layer of your model is a convolutional layer, applying a convolution on a "time" image that has the same value everywhere is not computationally efficient. Early conv layers exist to detect variations in an image (eg. texture), applying the same kernel over and over on an empty image is not efficient.
- By giving t only to the first layer, the network will need to waste resources/neurons to propagate that information through the network. Again, this waste is compounded by the fact that you will need to propagate the time information for every "pixel" in each convolutional feature of your network (because it is a ConvNet). Why not just skip all that and directly give the time embedding to deeper layers within the network?
Agreeable-Run-9152 OP t1_j2px8mz wrote
Okay, yeah makes sense. I am currently working in the Context of FNOs. What is the way youd do it there?
bloc97 t1_j2pz7r6 wrote
FNO? Are you referring to Fourier Neural Operator?
Agreeable-Run-9152 OP t1_j2pz999 wrote
Yep
bloc97 t1_j2q0aio wrote
I'm not too familiar with FNOs, but I guess you could start experimenting by adding the time embeddings to the "DC component" of the fourier transform, it would be at least equivalent to adding the time embeddings to the entire feature in a ResNet.
jm2342 t1_j2rixjn wrote
How do you think the brain handles that?
Viewing a single comment thread. View all comments