ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set

for VLM-based Few-shot Learning

Haoyuan Yang^1,*, Xiaoou Li^2,*, Jiaming Lv¹, Xianjun Cheng¹, Qilong Wang³, Peihua Li^1,†

¹Dalian University of Technology ²Beijing University of Posts and Telecommunications ³Tianjin University

^*Equal contribution. ^†The corresponding author.

CVPR 2025 (Highlight)

Code: https://github.com/HaoyuanYang-2023/ImagineFSL Paper: CVF

Abstract

Adapting CLIP models for few-shot recognition has recently attracted significant attention. Despite considerable progress, these adaptations remain hindered by the pervasive challenge of data scarcity. Text-to-image models, capable of generating abundant photorealistic labeled images, offer a promising solution.

However, existing approaches simply treat synthetic images as complements to real images, rather than as standalone knowledge repositories stemming from distinct foundation models. To overcome this limitation, we frame synthetic images as an imagined base set (iBase) , i.e., an independent, large-scale synthetic dataset encompassing diverse concepts. Building on this perspective, we introduce ImagineFSL, a novel CLIP adaptation methodology that pretrains on iBase and then fine-tunes for downstream few-shot tasks. We find that, compared to no pretraining, both supervised and self-supervised pretraining are beneficial, with the latter providing better performance. Based on on this finding, we propose an improved self-supervised method tailored for few-shot scenarios, enhancing the transferability of representations from synthetic to real image domains. Additionally, we present a systematic and scalable pipeline that employs chain-of-thought and in-context learning techniques, harnessing foundation models to automatically generate diverse, realistic images.

Validated across eleven datasets, our methods consistently outperform state-of-the-art approaches by substantial margins.

Highlights

Our major contributions are summarized as follows:

We frame synthetic images as standalone knowledge repositories and present a CLIP adaptation methodology that pretrains on purely synthetic images before fine-tuning for few-shot tasks. This marks a clear departure from existing one-stage fine-tuning methods that simply treat synthetic images as complements to real images.

We propose an improved Self-SL method based on DINO, specifically tailored for FSL. It introduces higher-order moments for image representation and employs synthetic augmentation for effective view construction.

We develop a systematic and scalable pipeline for synthesizing both captions and images, enabling generation of large-scale base sets for pretraining and task-specific datasets. Distinct from existing arts, we leverage CoT and ICL techniques for diverse, realistic image generation.

Experiments

Across the paper, we use CLIP ViT-B/16 as the visual encoder and its aligned counterpart as the textual encoder, unless otherwise specified.

Following prior arts [11, 12], we benchmark few-shot and zero-shot recognition tasks on 11 datasets, including ImageNet [55], Caltech [57], Aircraft [58], Cars [59], Food [60], Pets [61], Flowers [62], DTD [63], EuroSAT [64], SUN [65], and UCF101 [66]. We assess domain generalization task on ImageNet-V2 [67], ImageNet-S [68], ImageNet-A [69], and ImageNet-R [70].

For meta-testing, we follow the common practice, randomly sampling 3 All-way-K-shot tasks per dataset, reporting the average accuracy (Avg Acc) as a percentage (%).

Few-shot Recognition

Compared with methods using synthetic images

We compare to prior methods based on synthetic images in 1-/16-shot settings, including IsSynth [18], CaFo [13], DISEF [17], and DataDream [19]. Bold: best results; underlined: second-best results. ^†Reproduced by us.

Compared with methods only using real images

We compare our methods to prior techniques that do not employ synthetic images for K-shot tasks, including prompt tuning, adapter tuning, encoder tuning, and hybrid tuning approaches. The detailed numerical results are available at our official code repository.

Zero-shot Recognition

Our method is applicable to zero-shot recognition tasks. In this setting, we perform fine-tuning using solely the synthetic images, without touch of any real images. We compare to previous methods on zero-shot recognition.

Domain Generalization

We compares our methods with prior approaches on the domain generalization task. In this task, a model is trained on 16 shots per class from ImageNet (source) and tested on four target datasets. We specifically synthesize images for ImageNet-S and ImageNet-R datasets, while synthetic images for ImageNet are used for ImageNet-V2 and ImageNet-A.

Citation

@InProceedings{ImagineFSL_CVPR25,
author = {Haoyuan, Yang and Xiaoou, Li and Jiaming, Lv and Xianjun, Cheng and Qilong, Wang and Peihua, Li},
title = {ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2025}
}