Gaussians-to-Life

Abstract

State-of-the-art novel view synthesis methods achieve impressive results for multi-view captures of static 3D scenes. However, the reconstructed scenes still lack “liveliness,” a key component for creating engaging 3D experiences. Recently, novel video diffusion models generate realistic videos with complex motion and enable animations of 2D images, however they cannot naively be used to animate 3D scenes as they lack multi-view consistency. To breathe life into the static world, we propose Gaussians2Life, a method for animating parts of high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is to leverage powerful video diffusion models as the generative component of our model and to combine these with a robust technique to lift 2D videos into meaningful 3D motion. We find that, in contrast to prior work, this enables realistic animations of complex, pre-existing 3D scenes and further enables the animation of a large variety of object classes, while related work is mostly focused on prior-based character animation, or single 3D objects. Our model enables the creation of consistent, immersive 3D experiences for arbitrary scenes.

tl;dr: We introduce a method to animate given 3D scenes that uses pre-trained models to lift 2D motion into 3D. We propose a training-free, autoregressive method to generate more 3D-consistent video guidance across viewpoints, which we use to refine the distilled dynamics. Our method supports diverse object classes and runs on a single 24GB GPU with ~10 minutes optimization time.

Overview

We aim at animating given, captured 3D scenes using video diffusion models. Unfortunately, current openly available video diffusion models are far from being consistent in their outputs. In this project, we asked ourselves, how we can deal with such noisy generations and successfully "breathe life" into static 3D scenes that were captured using Gaussian Splatting.

Our method is built upon the following key components: We use DynamiCrafter, image-conditioned video diffusion model, to generate guidance videos. While image conditions enable more realistic and scene-aligned video generations, they are not (multi-view) consistent. To address this, we propose a simple, autoregressive method that makes use of the given static 3D scene and the previous guidance video to generate more 3D-consistent motion for the next viewpoint.

Pipeline for approximately multi-view consistent video generation using latent interpolation. — Improvement of multi-view consistency of generated videos through latent interpolation. In addition to the rendering of the dynamic scene \(f\) using the rendering function \(g\) from the current viewpoint \(g(f)_{s}\), we compute the latent embedding of the warped video output \(v_{s-1}\) of the previous optimization step (from a different viewpoint). We linearly interpolate the latents before passing them through the video diffusion model (VDM), which is additionally conditioned on the static scene view from the current viewpoint. The resulting output is finally decoded to a new video output \(v_{s}\).

Improvement of multi-view consistency of generated videos through latent interpolation. In addition to the rendering of the dynamic scene \(f\) using the rendering function \(g\) from the current viewpoint \(g(f)_{s}\), we compute the latent embedding of the warped video output \(v_{s-1}\) of the previous optimization step (from a different viewpoint). We linearly interpolate the latents before passing them through the video diffusion model (VDM), which is additionally conditioned on the static scene view from the current viewpoint. The resulting output is finally decoded to a new video output \(v_{s}\).

Approximately multi-view consistent video generations, starting from one video generated from an anchor viewpoint. The 3D scene is kept static between all generation steps to demonstrate the effect of the proposed latent interpolation.

Unfortunately, these video generations are still not perfectly multi-view consistent. When using an appearance-based optimization, this noise will usually still be enough to cause catastrophic failure . To address this, we propose to make use of techniques from monocular dynamic 3D reconstruction, and directly lift 2D motion to 3D. We do this by using off-the-shelf 2D point tracking , as well as depth estimation models that we use to lift the 2D motion to 3D.

Using the trajectories of these projected "anchor" points, we can subsequently transfer the motion to the 3D Gaussians, where we can employ techniques inspired by traditional geometry processing to promote smooth and, e.g., as rigid as possible object motion.

This way, we not only reduce the sensitivity to noisy guidance videos, but also significantly reduce the optimization time, as motion is distilled much faster. By using a video diffusion model that is trained on a large variety of real-world videos, our method is able to animate a large variety of object classes, and implicitly handles movement within a given scene.

Qualitative Results

Qualitative Result 1 — "bear statue turns its head"

Qualitative Result 2 — "toy bulldozer lifts up its shovel"

Qualitative Result 3 — "strong wind makes the vase wobble"

Qualitative Result 4 — "toy bulldozer moves forwards"

Limitations: While our method is able to generate realistic movement without affecting the appearance of the scene, it is not free of limitations. For example, as scenes are only deformed, but 3D Gaussians are neither added nor removed, our method is not able to "fill" opening gaps when objects are moved apart. Furthermore, our method is limited by the quality of the guidance videos, and thus by the state-of-the-art in open video diffusion models. For example, the domain mismatch between the usually inherently static scenes captured with 3DGS and the dynamic scenes used to train the video diffusion model limits the quality of the generated motion in videos. For more details, please refer to our paper.

Acknowledgements

Thomas Wimmer is supported by the Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) through the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the German Federal Ministry of Education and Research.

BibTeX

@inproceedings{wimmer2025gaussianstolife,
    title={Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes},
    author={Wimmer, Thomas and Oechsle, Michael and Niemeyer, Michael and Tombari, Federico},
    booktitle={2025 International Conference on 3D Vision (3DV)},
    year={2025}
}