SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction

Enrico Pallotta, Sina Mokhtarzadeh Azar, Shuai Li, Olga Zatsarynna, Jürgen Gall,

University of Bonn, Lamarr Institute for Machine Learning and Artificial Intelligence

SyncVP is a diffusion model for synchronized multi-modal video prediction. It generates multi-modal future frames like RGB and depth for a given observation that can consist of both modalities (left) or only one modality (right).

In red the generated future frames, in yellow the initial conditions.