-ZeroRelevance- t1_irvla8f wrote on October 11, 2022 at 11:31 AM

Reply to comment by kasiotuo in Generation of high fidelity videos from text using Imagen Video by Dr_Singularity

That probably comes from the temporal upscaling. As they said, the initial video is only 3fps, so they’re basically synthesising 7 frames for each actual frame given. It’s no wonder it’s going to morph. If it began with a higher temporal resolution (initial fps), then it would likely be much more coherent.