State-of-the-art novel view synthesis methods achieve impressive results for multi-view captures of static 3D scenes. However, the reconstructed scenes still lack “liveliness,” a key component for creating engaging 3D experiences. Recently, novel video diffusion models generate realistic videos with complex motion and enable animations of 2D images, however they cannot naively be used to animate 3D scenes as they lack multi-view consistency. To breathe life into the static world, we propose Gaussians2Life, a method for animating parts of high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is to leverage powerful video diffusion models as the generative component of our model and to combine these with a robust technique to lift 2D videos into meaningful 3D motion. We find that, in contrast to prior work, this enables realistic animations of complex, pre-existing 3D scenes and further enables the animation of a large variety of object classes, while related work is mostly focused on prior-based character animation, or single 3D objects. Our model enables the creation of consistent, immersive 3D experiences for arbitrary scenes.
We aim at animating given, captured 3D scenes using video diffusion models. Unfortunately,
current openly available video diffusion models are far from being consistent in their outputs.
In this project, we asked ourselves, how we can deal with such noisy generations and
successfully "breathe life" into static 3D scenes that were captured using Gaussian Splatting.
Our method is built upon the following key components: We use
DynamiCrafter,
image-conditioned video diffusion model, to generate guidance videos. While image conditions
enable more realistic and scene-aligned video generations, they are not (multi-view) consistent.
To address this, we propose a simple, autoregressive method that makes use of the given static
3D scene and the previous guidance video to generate more 3D-consistent motion for the next
viewpoint.
Unfortunately, these video generations are still not perfectly multi-view consistent. When using an appearance-based optimization, this noise will usually still be enough to cause catastrophic failure . To address this, we propose to make use of techniques from monocular dynamic 3D reconstruction, and directly lift 2D motion to 3D. We do this by using off-the-shelf 2D point tracking , as well as depth estimation models that we use to lift the 2D motion to 3D.
Using the trajectories of these projected "anchor" points, we can subsequently transfer the
motion to the 3D Gaussians, where we can employ
techniques inspired by traditional geometry processing
to promote smooth and, e.g., as rigid as possible object
motion.
This way, we not only reduce the sensitivity to noisy guidance videos, but also significantly
reduce the optimization time, as motion is distilled much faster. By using a video diffusion
model that is trained on a large variety of real-world videos, our method is able to animate a
large variety of object classes, and implicitly handles movement within a given scene.
Limitations: While our method is able to generate realistic movement without affecting the appearance of the scene, it is not free of limitations. For example, as scenes are only deformed, but 3D Gaussians are neither added nor removed, our method is not able to "fill" opening gaps when objects are moved apart. Furthermore, our method is limited by the quality of the guidance videos, and thus by the state-of-the-art in open video diffusion models. For example, the domain mismatch between the usually inherently static scenes captured with 3DGS and the dynamic scenes used to train the video diffusion model limits the quality of the generated motion in videos. For more details, please refer to our paper.
Thomas Wimmer is supported by the Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) through the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the German Federal Ministry of Education and Research.
@inproceedings{wimmer2025gaussianstolife,
title={Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes},
author={Wimmer, Thomas and Oechsle, Michael and Niemeyer, Michael and Tombari, Federico},
booktitle={2025 International Conference on 3D Vision (3DV)},
year={2025}
}