Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

1Technical University of Denmark 2Pioneer Center for AI 3LIX, Ecole Polytechnique, CNRS, IP Paris
CVPRW 2025

We leverage pre-trained diffusion models for Zero-Shot Video Object Segmentation by addressing key challenges: selecting the appropriate diffusion model, determining the optimal time step, identifying the best feature extraction layer, and designing an effective affinity matrix calculation strategy to match the features.

Abstract

This paper investigates the use of large-scale diffusion models for Zero-Shot Video Object Segmentation (ZS-VOS) without fine-tuning on video data or training on any image segmentation data. While diffusion models have demonstrated strong visual representations across various tasks, their direct application to ZS-VOS remains underexplored. Our goal is to find the optimal feature extraction process for ZS-VOS by identifying the most suitable time step and layer from which to extract features. We further analyze the affinity of these features and observe a strong correlation with point correspondences. Through extensive experiments on DAVIS-17 and MOSE, we find that diffusion models trained on ImageNet outperform those trained on larger, more diverse datasets for ZS-VOS. Additionally, we highlight the importance of point correspondences in achieving high segmentation accuracy, and we yield state-of-the-art results in ZS-VOS. Finally, our approach performs on par with models trained on expensive image segmentation datasets.

Segmenting videos sequentially with diffusion features

Given a memory of \( N \) past frames and their corresponding predicted segmentation masks, we segment the query frame by first calculating the affinity matrix \( \mathcal{A} \) between the query and memory frames, and then multiplying \( \mathcal{A} \) with the past predicted segmentation masks.

State-of-the-art Zero-Shot Video Object Segmentation comparison

We categorize state-of-the-art methods based on whether they are pre-trained on image-level or video-level data and/or fine-tuned on object segmentation annotations. We observe:

  • ADM with our MAG-Filter, enhanced by our layer and time step findings, outperforms all methods that do not use any segmentation annotations and yields state-of-the-art results.
  • Among methods trained only on image-level data, Matcher is the only approach with higher performance than ours, but it clearly benefits from the vast SA-1B dataset with 1.1 billion segmentation masks.

Cross-attention maps with Prompt Learning

(Left) Cross-attention maps, \(\mathcal{CA}\), of Stable Diffusion 2.1 before and after our prompt learning strategy. (Right) Cross-attention maps with the optimized token from the first frame.

BibTeX

@article{delatolas2025studying,
        title={Studying Image Diffusion Features for Zero-Shot Video Object Segmentation},
        author={Delatolas, Thanos and Kalogeiton, Vicky and Papadopoulos, Dim P},
        journal={arXiv preprint arXiv:2504.05468},
        year={2025}
      }
    

Acknowledgements

Vicky Kalogeiton was supported by a Hi!Paris collaborative project. Dim Papadopoulos was supported by the DFF Sapere Aude Starting Grant "ACHILLES". We would like to thank Mehmet Onurcan Kaya and Marco Schouten for insightful discussions.