Semantic segmentation is one of the fundamental problems in computer vision. Many learning paradigms have been proposed to alleviate the annotation burden for the semantic segmentation task, such as weakly-supervised learning, one-shot learning, and few-shot learning. However, these approaches still require collecting a large amount of labeled data for training. In this paper, we are motivated to release the model from collecting such training samples and propose to retarget on-the-shelf self-supervised models for semantic segmentation tasks. We observe that the self-supervised transformers have reasonable representations to evaluate the patch affinities and the semantic meanings. By leveraging the patch-level affinities, accurate segmentation masks can be obtained. Meanwhile, the semantic assignments can be obtained by comparing the pixel’s representations with precomputed prototypes. Assembling them together, semantic segmentation results can be derived from the model without any finetuning. With only a single example per class, our approach achieves up to 51.4% mIoU on the challenging PASCAL VOC 2012 val set without any training using masks, demonstrating the effectiveness of the approach.
|