SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction

University of Bonn, Lamarr Institute for Machine Learning and Artificial Intelligence

SyncVP is a diffusion model for synchronized multi-modal video prediction. It generates multi-modal future frames like RGB and depth for a given observation that can consist of both modalities (left) or only one modality (right).

Video predictions

In red the generated future frames, in yellow the initial conditions.