Thomas Wimmer

Hi! I am currently pursuing a PhD through the Max Planck ETH Center for Learning Systems (CLS) and ELLIS programs. My advisors are Jan Eric Lenssen, Bernt Schiele, Christian Theobalt (MPI), and Siyu Tang (ETH).

I previously graduated from my double master’s degree at the Technical University of Munich and the Institut Polytechnique de Paris. During my studies, I have had the chance to work with various great people, including Daniel Cremers, Maks Ovsjanikov, Peter Wonka, and Federico Tombari.

My main research interests lie at the intersection of computer vision, computer graphics, and geometry processing, focusing on (dynamic) 3D scene understanding, reconstruction, and generation, as well as visual semantics. However, I am always open to new ideas and collaborations in related fields. This website gives you an overview of my recent research and other projects.

Get in touch via email!

Curriculum Vitae

news

Oct 14, 2025	New pre-print: “AnyUp: Universal Feature Upsampling” is now available on arXiv! Super excited to share this work, where we propose a first-of-its-kind feature-agnostic upsampling architecture that can upsample features from any vision model at any resolution, without requiring any encoder-specific training. New state-of-the-art results on multiple downstream benchmarks, while being the first upsampler that naturally generalizes to different feature types at inference time.
Jun 05, 2025	New pre-print: “Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels” is now available on arXiv! We show that foundational features can be refined with an adapter that is trained with pseudo-labels, which are themselves zero-shot predictions using the same foundational features. We improve the quality of pseudo-labels through 3D-aware chaining with cycle-consistency and reject wrong pairs using a spherical prototype. New state-of-the-art results on SPair71k and scalable to larger datasets. Accepted to ICCV 2025!
Jan 12, 2025	Our pre-print “MEt3R: Measuring Multi-View Consistency in Generated Images” is now available on arXiv! In this work, we propose a DUSt3R-based method to measure multi-view consistency which can, e.g., be used to evaluate the 3D consistency of video diffusion models. Accepted to CVPR 2025!
Nov 05, 2024	Happy to report that my latest paper, “Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes”, was accepted for publication at 3DV 2025. Thanks for a great collaboration to my co-authors, Michael Oechsle, Michael Niemeyer, and Federico Tombari!

selected publications

AnyUp: Universal Feature Upsampling

Thomas Wimmer, Prune Truong, Marie-Julie Rakotosaona, Michael Oechsle, Federico Tombari, Bernt Schiele, and Jan Eric Lenssen

arXiv preprint arXiv:2510.12764, 2025

Abs Website

We introduce AnyUp, a method for feature upsampling that can be applied to any vision feature at any resolution, without encoder-specific training. Existing learning-based upsamplers for features like DINO or CLIP need to be re-trained for every feature extractor and thus do not generalize to different feature types at inference time. In this work, we propose an inference-time feature-agnostic upsampling architecture to alleviate this limitation and improve upsampling quality. In our experiments, AnyUp sets a new state of the art for upsampled features, generalizes to different feature types, and preserves feature semantics while being efficient and easy to apply to a wide range of downstream tasks.
Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels

Olaf Dünkel, Thomas Wimmer, Christian Theobalt, Christian Rupprecht, and Adam Kortylewski

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

Abs Website

Finding correspondences between semantically similar points across images and object instances is one of the everlasting challenges in computer vision. While large pre-trained vision models have recently been demonstrated as effective priors for semantic matching, they still suffer from ambiguities for symmetric objects or repeated object parts. We propose to improve semantic correspondence estimation via 3D-aware pseudo-labeling. Specifically, we train an adapter to refine off-the-shelf features with pseudo-labels obtained via 3D-aware chaining, filtering wrong labels through relaxed cyclic consistency, and 3D spherical prototype mapping constraints. While reducing the need for dataset specific annotations compared to prior work, we set a new state-of-the-art on SPair-71k by over 4% absolute gain and by over 7% against methods with similar supervision requirements. The generality of our proposed approach simplifies extension of training to other data sources, which we demonstrate in our experiments.
MEt3R: Measuring Multi-View Consistency in Generated Images

Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Abs Website

We introduce MEt3R, a metric for multi-view consistency in generated images. Large-scale generative models for multi-view image generation are rapidly advancing the field of 3D inference from sparse observations. However, due to the nature of generative modeling, traditional reconstruction metrics are not suitable to measure the quality of generated outputs and metrics that are independent of the sampling procedure are desperately needed. In this work, we specifically address the aspect of consistency between generated multi-view images, which can be evaluated independently of the specific scene. Our approach uses DUSt3R to obtain dense 3D reconstructions from image pairs in a feed-forward manner, which are used to warp image contents from one view into the other. Then, feature maps of these images are compared to obtain a similarity score that is invariant to view-dependent effects. Using MEt3R, we evaluate the consistency of a large set of previous methods for novel view and video generation, including our open, multi-view latent diffusion model.
Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes

Thomas Wimmer, Michael Oechsle, Michael Niemeyer, and Federico Tombari

In 2025 International Conference on 3D Vision (3DV), 2025

Abs arXiv Poster Website

State-of-the-art novel view synthesis methods achieve impressive results for multi-view captures of static 3D scenes. However, the reconstructed scenes still lack “liveliness,” a key component for creating engaging 3D experiences. Recently, novel video diffusion models generate realistic videos with complex motion and enable animations of 2D images, however they cannot naively be used to animate 3D scenes as they lack multi-view consistency. To breathe life into the static world, we propose, a method for animating parts of high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is to leverage powerful video diffusion models as the generative component of our model and to combine these with a robust technique to lift 2D videos into meaningful 3D motion. We find that, in contrast to prior work, this enables realistic animations of complex, pre-existing 3D scenes and further enables the animation of a large variety of object classes, while related work is mostly focused on prior-based character animation, or single 3D objects. Our model enables the creation of consistent, immersive 3D experiences for arbitrary scenes.
Back to 3D: Few-Shot 3D Keypoint Detection with Back-Projected 2D Features

Thomas Wimmer, Peter Wonka, and Maks Ovsjanikov

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Abs HTML Poster Website

With the immense growth of dataset sizes and computing resources in recent years, so-called foundation models have become popular in NLP and vision tasks. In this work, we propose to explore foundation models for the task of keypoint detection on 3D shapes. A unique characteristic of keypoint detection is that it requires semantic and geometric awareness while demanding high localization accuracy. To address this problem, we propose, first, to back-project features from large pre-trained 2D vision models onto 3D shapes and employ them for this task. We show that we obtain robust 3D features that contain rich semantic information and analyze multiple candidate features stemming from different 2D foundation models. Second, we employ a keypoint candidate optimization module which aims to match the average observed distribution of keypoints on the shape and is guided by the back-projected features. The resulting approach achieves a new state of the art for few-shot keypoint detection on the KeyPointNet dataset, almost doubling the performance of the previous best methods.