Submitted by Agreeable-Run-9152 t3_101s5kj in MachineLearning
Usually when you approximate the score s(x,t) in Diffusion models, the time t is passed through an embedding network before it is added to the x components in the res net blocks of your model.
What is the rationale behind this? Couldnt you just concatenate x and t in the channel dimension? And If you were to use any other model than a UNet, what would be the equivalent?
bloc97 t1_j2pj1c6 wrote
There are many ways to condition a diffusion model using time, but concatenating it as input is the least efficient method because: