DIFF-ZSVOS

Abstract

This paper investigates the use of large-scale diffusion models for Zero-Shot Video Object Segmentation (ZS-VOS) without fine-tuning on video data or training on any image segmentation data. While diffusion models have demonstrated strong visual representations across various tasks, their direct application to ZS-VOS remains underexplored. Our goal is to find the optimal feature extraction process for ZS-VOS by identifying the most suitable time step and layer from which to extract features. We further analyze the affinity of these features and observe a strong correlation with point correspondences. Through extensive experiments on DAVIS-17 and MOSE, we find that diffusion models trained on ImageNet outperform those trained on larger, more diverse datasets for ZS-VOS. Additionally, we highlight the importance of point correspondences in achieving high segmentation accuracy, and we yield state-of-the-art results in ZS-VOS. Finally, our approach performs on par with models trained on expensive image segmentation datasets.

Segmenting videos sequentially with diffusion features

Given a memory of \( N \) past frames and their corresponding predicted segmentation masks, we segment the query frame by first calculating the affinity matrix \( \mathcal{A} \) between the query and memory frames, and then multiplying \( \mathcal{A} \) with the past predicted segmentation masks.

State-of-the-art Zero-Shot Video Object Segmentation comparison

We categorize state-of-the-art methods based on whether they are pre-trained on image-level or video-level data and/or fine-tuned on object segmentation annotations. We observe:

ADM with our MAG-Filter, enhanced by our layer and time step findings, outperforms all methods that do not use any segmentation annotations and yields state-of-the-art results.
Among methods trained only on image-level data, Matcher is the only approach with higher performance than ours, but it clearly benefits from the vast SA-1B dataset with 1.1 billion segmentation masks.

Cross-attention maps with Prompt Learning

(Left) Cross-attention maps, \(\mathcal{CA}\), of Stable Diffusion 2.1 before and after our prompt learning strategy. (Right) Cross-attention maps with the optimized token from the first frame.

BibTeX

@article{delatolas2025studying,
        title={Studying Image Diffusion Features for Zero-Shot Video Object Segmentation},
        author={Delatolas, Thanos and Kalogeiton, Vicky and Papadopoulos, Dim P},
        journal={arXiv preprint arXiv:2504.05468},
        year={2025}
      }

Acknowledgements

Vicky Kalogeiton was supported by a Hi!Paris collaborative project. Dim Papadopoulos was supported by the DFF Sapere Aude Starting Grant "ACHILLES". We would like to thank Mehmet Onurcan Kaya and Marco Schouten for insightful discussions.