1 Dalian University of Technology 2 Beijing University of Posts and Telecommunications 3 Tianjin University
*Equal contribution. †The corresponding author.
CVPR 2025 (Highlight)
Abstract
Adapting CLIP models for few-shot recognition has recently attracted significant attention. Despite considerable progress, these adaptations remain hindered by the pervasive challenge of data scarcity. Text-to-image models, capable of generating abundant photorealistic labeled images, offer a promising solution. |
HighlightsOur major contributions are summarized as follows: |
|
|
|
Experiments Across the paper, we use CLIP ViT-B/16 as the visual encoder and its aligned counterpart as the textual encoder, unless otherwise specified. |
Few-shot RecognitionCompared with methods using synthetic imagesWe compare to prior methods based on synthetic images in 1-/16-shot settings, including IsSynth [18], CaFo [13], DISEF [17], and DataDream [19]. Bold: best results; underlined: second-best results. †Reproduced by us. We compare our methods to prior techniques that do not employ synthetic images for K-shot tasks, including prompt tuning, adapter tuning, encoder tuning, and hybrid tuning approaches. The detailed numerical results are available at our official code repository. |
Zero-shot RecognitionOur method is applicable to zero-shot recognition tasks. In this setting, we perform fine-tuning using solely the synthetic images, without touch of any real images. We compare to previous methods on zero-shot recognition. |
![]() |
We compares our methods with prior approaches on the domain generalization task. In this task, a model is trained on 16 shots per class from ImageNet (source) and tested on four target datasets. We specifically synthesize images for ImageNet-S and ImageNet-R datasets, while synthetic images for ImageNet are used for ImageNet-V2 and ImageNet-A.
|
Citation
@InProceedings{ImagineFSL_CVPR25,author = {Haoyuan, Yang and Xiaoou, Li and Jiaming, Lv and Xianjun, Cheng and Qilong, Wang and Peihua, Li}, title = {ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year = {2025} } |