SyncVP is a diffusion model for synchronized multi-modal video prediction. It generates multi-modal future frames like RGB
and depth for a given observation that can consist of both modalities (left) or only one modality (right).
Video predictions
In red the generated future frames, in yellow the initial conditions.