Skip to content

bethgelab/cabs

Repository files navigation

CABS: Concept-Aware Batch Sampling

A flexible data-centric approach to train better CLIP/SigLIP models and better vision encoders for VLMs.

This is the official codebase for the paper Concept-Aware Batch Sampling Improves Language-Image Pretraining by Adhiraj Ghosh, Vishaal Udandarao*, Thao Nguyen*, Matteo Farina*, Mehdi Cherti, Jenia Jitsev, Sewoong Oh, Elisa Ricci, Ludwig Schmidt, Matthias Bethge.

Updates

  • Feb 2026: CABS accepted at CVPR 2026. See you in Denver!
  • Feb 2026: DataConcept_128M released on HuggingFace.
  • Nov 2025: CABS is out on arXiv.

Contents

CABS Checkpoints

Will be updated with more model architectures soon.

All CABS variants are trained with a filter ratio (e.g. 0.8), indicated in the table below. We find that 0.8 is the best ratio for both Diversity Maximisation (CABS-DM) and Frequency-Maximisation (CABS-FM).

Model Architecture CABS Caption Link
CLIP ViT-B/32 CABS-DM (0.8) Alt-text 🤗
CLIP ViT-B/32 CABS-DM (0.8) Recap 🤗
CLIP ViT-B/32 CABS-FM (0.8) Alt-text 🤗
CLIP ViT-B/32 CABS-FM (0.8) Recap 🤗

Model usage can be found in model_usage.py.

Note: if you require the checkpoints as .pt files, please let us know.

DataConcept Annotation Pipeline

DataConcept pipeline

The DataConcept pipeline enriches image-text datasets with sample-level annotations in three stages:

  1. RAM++ tagging — open-set image tagging using RAM++ with a Swin-L backbone. Produces per-image tags.

  2. GroundingDINO detection — grounded object detection using GroundingDINO conditioned on the RAM++ tags. Runs at multiple image scales with weighted box fusion to produce high-quality bounding boxes, confidence scores and class labels.

  3. Qwen2-VL recaptioning — recaption images using Qwen2-VL-7B-Instruct, guided by both the original alt-text and the detected concepts. The VLM is loaded via HuggingFace Transformers, so you can easily swap in a newer or larger model.

We release DataConcept_128M, a large-scale dataset annotated with this pipeline. We would appreciate the community's help in converting more image-text datasets into DataConcept-style pretraining datasets with sample-level concept annotations and bounding boxes.

Data Format

DataConcept follows the DataComp protocol and uses the WebDataset format. Each sample is stored as a set of files sharing the same key:

<key>.jpg   # image
<key>.txt   # alt-text caption
<key>.json  # metadata (bounding boxes, classes, etc.)

Samples are grouped into sequentially numbered tar shards (00000.tar, 00001.tar, ...). Both the detection and captioning scripts accept --chunk_start and --chunk_end arguments to specify the range of tar shards to process, making it easy to parallelise annotation across multiple jobs.

Setup

Create a dedicated environment for the DataConcept pipeline:

conda create -n dataconcept python=3.10
conda activate dataconcept
pip install -r requirements_dataconcept.txt

Then install GroundingDINO (requires CUDA 11.8 and GCC <= 11):

cd dataconcept/detection
git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO
pip install -e .

See dataconcept/detection/README.md for detailed GroundingDINO installation troubleshooting.

For recaptioning with Qwen2-VL, install FlashAttention-2 for faster inference:

pip install flash-attn --no-build-isolation

Running Bounding Box Annotation

Download the required model checkpoints:

cd dataconcept/detection

python ensemble_boxes.py \
    --load_path /path/to/tar/shards \
    --chunk_start 00000 \
    --chunk_end 00010 \
    --class_jsons data/vocabulary_descriptions.json \
    --ram_checkpoint /path/to/ram_plus_swin_large_14m.pth \
    --grounded_checkpoint /path/to/groundingdino_swinb.pth \
    --config GroundingDINO/groundingdino/config/GroundingDINO_SwinB.py \
    --features_dir /path/to/features \
    --results_dir /path/to/results

Running Recaptioning

cd dataconcept/captioning

python recap.py \
    --load_path /path/to/tar/shards \
    --chunk_start 00000 \
    --chunk_end 00010 \
    --results_dir /path/to/results

Training With CABS

The training code in src/open_clip_train/ extends open_clip with CABS. We advise you first create a virtual environment:

cd cabs
python3.12 -m venv .env
source .env/bin/activate
pip install -U pip

You can then install the package for training with:

pip install -e '.[training]'

We provide example SLURM scripts in scripts/. For instance, see scripts/cabs-dm/vitb32_alt_0.8.sh for training a ViT-B/32 with CABS-DM at a filter ratio of 0.8.

Key Parameters

  • --which-sampling: Sampling strategy. Use "filter" for CABS or "iid" for standard training.
  • --cabs-dm / --cabs-freq: Enable CABS-DM (diversity maximisation) or CABS-FM (frequency maximisation).
  • --captions: Caption source. Use "alt" for original DataComp alt-text or "recap" for DataConcept recaptions.
  • --filter-ratio: Fraction of each super-batch to keep after concept-aware filtering (e.g. 0.8).
  • --batch-size: Per-GPU super-batch size (before filtering).
  • --epochs: Number of training epochs (adjusted for filtering overhead).

Relating Batch Size, Epochs, And Filter Ratio

CABS loads a larger super-batch, filters out a fraction of samples, and passes the remaining sub-batch to the model. The filter ratio controls this relationship:

sub-batch size (per GPU) = super-batch size * (1 - filter_ratio)

Example — to pass 4096 samples to the model per step on 4 GPUs (1024 per device):

super-batch size per GPU = sub-batch / (1 - filter_ratio)
                         = 1024 / (1 - 0.8)
                         = 1024 / 0.2
                         = 5120          → --batch-size 5120

Since only (1 - filter_ratio) of each super-batch is used for training, the number of epochs must be scaled up accordingly to see the same number of effective training samples:

epochs = 1 / (1 - filter_ratio)
       = 1 / 0.2
       = 5                              → --epochs 5

See params.py for all available CLI arguments.

Citation

If you find this work useful to your research, please consider citing as:

@article{ghosh2025concept,
  title={Concept-Aware Batch Sampling Improves Language-Image Pretraining},
  author={Ghosh, Adhiraj and Udandarao, Vishaal and Nguyen, Thao and Farina, Matteo and Cherti, Mehdi and Jitsev, Jenia and Oh, Sewoong and Ricci, Elisa and Schmidt, Ludwig and Bethge, Matthias},
  journal={arXiv preprint arXiv:2511.20643},
  year={2025}
}

Acknowledgements

The training code is adapted from open_clip. The DataConcept pipeline builds on RAM++, GroundingDINO, and Qwen2-VL. We train and evaluate on data from DataComp. We thank the authors of these projects and the broader open-source community for making large-scale vision-language research accessible.

Contact

Please feel free to open an issue or email us at [email protected].

About

[CVPR'26] Official codebase for "Concept-Aware Batch Sampling Improves Language-Image Pretraining"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors