AnyUp: Universal Feature Upsampling

Thomas Wimmer^1,2, Prune Truong³, Marie-Julie Rakotosaona³, Michael Oechsle³,
Federico Tombari^3,4, Bernt Schiele¹ and Jan Eric Lenssen¹

¹Max Planck Institute for Informatics, ²ETH Zurich, ³Google, ⁴TU Munich

AnyUp can be applied to any feature from any layer of any image encoder without feature-specific retraining.

We introduce AnyUp, a method for feature upsampling that can be applied to any vision feature at any resolution, without encoder-specific training. Existing learning-based upsamplers for features like DINO or CLIP need to be re-trained for every feature extractor and thus do not generalize to different feature types at inference time. In this work, we propose an inference-time feature-agnostic upsampling architecture to alleviate this limitation and improve upsampling quality. In our experiments, AnyUp sets a new state of the art for upsampled features, generalizes to different feature types, and preserves feature semantics while being efficient and easy to apply to a wide range of downstream tasks.

Paper Code

Introduction

Pre-trained image encoders (e.g., DINO, CLIP, SigLIP) are foundational for modern vision tasks but produce coarse feature maps, hindering pixel-level predictions. Existing learned upsamplers work only with the encoders they were trained on. We introduce AnyUp, a feature upsampler that is encoder-agnostic at inference and handles features of any type, resolution, and dimensionality. AnyUp combines a feature-agnostic layer, window-attention upsampling, and a crop-based training strategy with feature-consistency regularization, achieving state-of-the-art accuracy across diverse downstream tasks while preserving original feature semantics. Trained once, it generalizes robustly to unseen feature types and vision backbones without retraining.

Code Example. Using AnyUp is as easy as one call to torch.hub.load. Check the Github repository for more details.

Method

Given an RGB image $I_{hr} \in \mathbb{R}^{H \times W \times 3}$ and coarse encoder features $p \in \mathbb{R}^{h \times w \times c}$, AnyUp predicts high-resolution features $q = f(p, I_{hr}) \in \mathbb{R}^{H \times W \times c}$ suitable for pixel-level tasks. We build a lightweight window attention-based upsampler: image pixels provide queries; keys combine downsampled image features and coarse features; values are the coarse feature patches. Two changes make this architecture encoder-agnostic and more accurate.

First, a feature-agnostic layer maps features from any backbone and dimensionality to a canonical space. Each input channel is convolved with a learned filter basis; per-filter responses are softmax-normalized and averaged across channels, producing a representation invariant to the input dimensionality while capturing local structural changes needed for upsampling.

Second, local window attention restricts each query to a spatial neighborhood, preventing attention to unrelated far-away patches, improving locality and efficiency, and simplifying optimization.

Method Overview. (Left) During training, features computed for randomly sampled image parts are used as a reference for the respective part of the upsampled feature map. (Right) AnyUp performs window attention-based upsampling. Input features are processed with a feature-agnostic layer.

Training uses local crops: we compute reference features for randomly sampled image parts, upsample full-image coarse features, and supervise only within the crop. This avoids expensive high-resolution encoder queries yet covers large upsampling factors. The loss combines cosine and L2 feature matching, plus self-consistency and input-consistency regularizers to preserve the source feature space and encourage locality.

Results

Qualitative Results (Hover over the image to magnify)

Qualitative Results

Qualitative Results. RGB channels correspond to the first three principal components computed over all features. Previous methods result in excessive smoothing or contain other artifacts: See, e.g., in the first row, the smoothed-out cloud features in LoftUp or the feature distribution shift for the mountains in JAFAR, as well as the oversmoothing and halo-artifacts of FeatUp and Guided Filter in the third row. AnyUp produces sharp output feature maps while preserving the input feature quality.

Comparison to Prior Art

Following previous works, we evaluate the upsampled features using linear probing for semantic segmentation, monocular depth, and surface normal estimation. As prior upsamplers are limited to specific backbones, we compare AnyUp to them on the DINOv2 (ViT-S) backbone, for which all prior methods have models available. We find that AnyUp outperforms all prior specialized methods on these tasks.

Quantitative Results. AnyUp outperforms all prior methods when performing linear probing on semantic segmentation, monocular depth, and surface normal estimation when using DINOv2 features. Best results per column are in bold.

Notably, AnyUp also gives the best results when evaluating with a frozen DINOv2 linear probe (trained on DINOv2 patch features), indicating that AnyUp preserves the original feature space better than prior learnable work while giving better upsampling quality than heuristic methods.

Qualitative Results for Linear Probing. AnyUp produces sharper and more accurate predictions than prior methods.

Generalization to Unseen Backbones

We test encoder-agnostic upsampling by training AnyUp once on DINOv2 (ViT-S) and applying it unchanged to features from different encoders and model sizes. See the teaser video for a qualitative impression. AnyUp matches or surpasses encoder-specific upsamplers on SigLIP 2 and transfers well to DINOv3. Across sizes, the expected linear-probe trend (ViT-L $\geq$ ViT-B $\geq$ ViT-S) holds regardless of the training backbone used for AnyUp.

Generalization Results. AnyUp trained on DINOv2 (ViT-S) generalizes well to other model sizes (left) and families (right).

Conclusion

We introduced AnyUp, a method for feature upsampling from any resolution to any resolution, which generalizes to feature representations that it was not trained on. Key technical novelties include a feature-agnostic layer, windowed attention for upsampling, and a new training strategy, which work together to achieve state-of-the-art upsampling quality. We make our code and models publicly available at this URL.

Citation

If you find our work useful in your research, please cite it as:

@article{wimmer2025anyup,
    title={AnyUp: Universal Feature Upsampling},
    author={Wimmer, Thomas and Truong, Prune and Rakotosaona, Marie-Julie and Oechsle, Michael and Tombari, Federico and Schiele, Bernt and Lenssen, Jan Eric},
    journal={arXiv preprint arXiv:2510.12764},
    year={2025}
}

Acknowledgements

This work was partially funded by the Saarbrücken Research Center for Visual Computing, Interaction, and Artificial Intelligence (VIA). Thomas Wimmer is supported through the Max Planck ETH Center for Learning Systems. Jan Eric Lenssen is supported by the German Research Foundation (DFG) - 556415750 (Emmy Noether Programme, project: Spatial Modeling and Reasoning).