Abstract. Vision-Language Models seamlessly discriminate among arbitrary semantic categories, yet they still suffer from poor generalization when presented with challenging examples. For this reason, Episodic Test-Time Adaptation (TTA) strategies have recently emerged as powerful techniques to adapt VLMs in the presence of a single unlabeled image. The recent literature on TTA is dominated by the paradigm of prompt tuning by Marginal Entropy Minimization, which, relying on online backpropagation, inevitably slows down inference while increasing memory. In this work, we theoretically investigate the properties of this approach and unveil that a surprisingly strong TTA method lies dormant and hidden within it. We term this approach ZERO (TTA with “zero” temperature), whose design is both incredibly effective and frustratingly simple: augment N times, predict, retain the most confident predictions, and marginalize after setting the Softmax temperature to zero. Remarkably, ZERO requires a single batched forward pass through the vision encoder only and no backward passes. We thoroughly evaluate our approach following the experimental protocol established in the literature and show that ZERO largely surpasses or compares favorably w.r.t. the state-of-the-art while being almost 10× faster and 13× more memory friendly than standard Test-Time Prompt Tuning. Thanks to its simplicity and comparatively negligible computation, ZERO can serve as a strong baseline for future work in this field.
TLDR. DON'T FORGET ABOUT MAJORITY VOTING WHEN YOU EVALUATE YOUR TTA METHOD!
If you find this work useful, please consider citing:
@article{farina2024frustratingly,
title={Frustratingly Easy Test-Time Adaptation of Vision-Language Models},
author={Farina, Matteo and Franchi, Gianni and Iacca, Giovanni and Mancini, Massimiliano and Ricci, Elisa},
journal={arXiv preprint arXiv:2405.18330},
year={2024}
}We provide both pip requirements and a conda environment to install the dependencies of this repository, feel free to choose the one that better suits your needs. The code was tested with python 3.11.9.
Install pip requirements:
pip install -r requirements.txt
Install with conda:
conda env create -f environment.yaml
The only model weights you need to download are MaPLe's pretrained initializations. For your convenience, we provide a script to download them automatically. Simply run:
./scripts/download_maple.sh
You should now have a weights folder with the 3 MaPLe's ImageNet pretrainings provided by the authors (weights/maple_seed1.pth, weights/maple_seed2.pth and weights/maple_seed3.pth). Please check everything is in place. Should you have any problems, please download the weights from this link and rename them accordingly.
We strongly suggest you create a datasets folder under the root of this repository and store all datasets there.
For robustness to natural distribution shifts, we consider ImageNet-1k and 4 variants:
- ImageNet-A.
- ImageNet-v2 (we use the validation set of the
MatchedFrequencyversion) - ImageNet-Sketch.
- ImageNet-R.
For all datasets simply download, extract and put them in the ./datasets folder. You should have the following structure:
./datasets/
| imagenet/
| | train/
| | | # class folders
| | val/
| | | # class folders
| imagenet-a/
| | # class folders
| imagenet-r/
| | # class folders
| imagenet-sketch
| | # class folders
| imagenetv2-matched-frequency-format-val
| | # class folders (0 to 999)
For Finegrained classification, we adopt the same splits as Zhou et al. Please refer to this page for the installation of all datasets and the JSON files for the splits. Once everything is downloaded, please organize everything as follows:
./datasets/
| caltech-101/
| | images/
| | | # class folders
| | split_zhou_Caltech101.json
| dtd/
| | images/
| | | # class folders
| | split_zhou_DescribableTextures.json
| fgvc_aircraft/
| | images/
| | | # list of images
| | # a bunch of txt files
| flower102/
| | jpg/
| | | # list of images
| | split_zhou_OxfordFlowers.json
| food101/
| | images/
| | | # class folders
| | split_zhou_Food101.json
| oxford_pets/
| | images/
| | | # list of images
| | split_zhou_OxfordPets.json
| sun397/
| | images/
| | | # lettered folders ('a', 'b', 'c', etc.)
| | split_zhou_SUN397.json
| ucf101/
| | images/
| | | # class folders
| | split_zhou_UCF101.json
| eurosat/
| | images/
| | | # class folders
| | split_zhou_EuroSAT.json
IMPORTANT. By the time of developing this work, the official Stanford Cars' website was unreachable. Please download images from this Kaggle page and annotations from this Drive link. You should organize files as follows:
./datasets/
| stanford_cars/
| | images/
| | | train/
| | | | # list of images
| | | test/
| | | | # list of images
| | annots/
| | | labels.csv
| | | metadata.csv
| | | split_coop.csv
The entrypoint for this repository is run.py. Please execute python run.py --help for a sense of the arguments.
We provide different bash files in scripts to run different versions of Zero:
zero.shruns VanillaZero;zero_rlcf.shruns theZerovariant with a smaller CLIP-ViT-B-16 and a larger CLIP-ViT-L-14;
Note that the --templates flag activates the ensemble of textual templates (+Ensemble in Tab.1 and 2 of the article).
The --maple flag uses a MaPLe pretraining (only available with CLIP-ViT-B-16).
The repository integrates with both open_clip [a] and the official CLIP implementation from OpenAI.
You can choose which CLIP model to adapt with the --arch and the --pretrained flags of run.py. While --arch is mandatory, --pretrained is optional and it will default to the OpenAI implementation (yes, also the implementation, not only the weights).
For the experiments with CLIP-ViT-B/16 pretrained on the 2B English Subset of LAION presented in the Appendix of the manuscript, set --arch ViT-B-16 and --pretrained laion2b_s34b_b88k. While I have not tried all possible outcomes of CLIP models within open_clip, feel free to explore :)
NOTE. MaPLe weights are available only for OpenAI pretrained models.
[a] Cherti, Mehdi, et al. "Reproducible scaling laws for contrastive language-image learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
Parts of this repository are based on TPT, RLCF, MaPLe and CoOp repositories. Huge thx to all authors! Favicon credits here.
Please do not hesitate to file an issue or to contact me at [email protected]. I'll do my best to help!