Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

🔍 Overview

This project implements AVG, a new approach for text-to-image retrieval that reformulates the task as token-to-voken generation to improve both retrieval effectiveness and efficiency. Unlike traditional methods based on cross-attention (one-tower) or shared embedding spaces (two-tower), AVG incorporates fine-grained token-level interactions while maintaining fast retrieval speed. It uses a semantically aligned image tokenizer and a hybrid generative-discriminative training objective to reduce semantic misalignment and bridge the gap to the retrieval target. Experimental results show that AVG achieves a 7.53% relative improvement in effectiveness and a 4× speedup compared to the widely used two-tower baseline, CLIP.

For more details, refer to our paper accepted to SIGIR 2025: Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation.

📦 Requirements

The code is tested on Python 3.9.18, PyTorch 1.13.1 and CUDA 11.7.

You can create a conda environment with the required dependencies using the provided environment.yml file.

conda env create -f environment.yml
conda activate avg

🧾 Data

The dataset used in the paper is the COCO 2014 dataset and the Flickr30k dataset. The raw images should be downloaded and placed in the RQ-VAE/data directory, along with captions. The captions files we used can be found here.
Run the following command to preprocess the data to generate the image features and text features:

cd RQ-VAE
bash scripts/prepare_emb.sh

You can also use the simple tools/generate_psudo_query.py script to generate pseudo queries to augment the dataset. The psudo queries we used can be found here.

📈 AVG

🔵 Tokenizer (RQ-VAE)

Step 1: Train the Tokenizer (RQ-VAE)

cd RQ-VAE
bash scripts/train_rqvae.sh

The trained model will be saved in the RQ-VAE/output directory.

Step 2: Generate Voken Codes (Discrete Image Representations)

bash scripts/generate_codes.sh

This script encodes images into discrete token sequences, which will be used in downstream Retriever and Reranker stages.

🟡 Retriever (LLM)

Step 3: Prepare Retriever Training Data

Use the previously generated voken codes to construct the training data for the retriever:

cd ..
bash scripts/prepare_retriever_dataset.sh

Step 4: Train the Retriever

🔹 Stage 1: Generative Training (T5 or LLaMA)

To train a T5-based retriever:

bash scripts/train_retriever_t5.sh
# Recall metrics will be automatically recorded in the log file and wandb.

To train a LLaMA-based retriever:

bash scripts/finetune_retriever_llama.sh

For testing the LLaMA-based retriever (requires separate evaluation):

bash scripts/test_retriever_llama.sh

🔹 Stage 2: Discriminative Fine-Tuning (Optional)

Load the checkpoint from Stage 1 and perform discriminative training:

bash scripts/train_retriever_t5_stage2.sh
# Note: Hyperparameters are sensitive and may require tuning.

🟢 Reranker (SEED-LLaMA)

We use the SEED-LLaMA model as the reranker. A lightweight MLP head is added for scoring candidate responses retrieved by the retriever.

Step 5: Prepare Reranker Data

First, generate retrieval results for training set and test set:

bash scripts/prepare_reranker_dataset.sh

Then, convert the voken codes to SEED-LLaMA token IDs:

bash scripts/convert_voken_to_seed.sh

Step 6: Train the Reranker

bash scripts/finetune_reranker_llama.sh

Step 7: Evaluate the Reranker

bash scripts/test_reranker_llama.sh

🟥 Note

Make sure to update all dataset-related paths in the training and evaluation scripts according to your local directory structure.
The default hyperparameters (e.g., learning rate, batch size, number of epochs) are configured for a reference dataset. You must tune them for your own dataset.
Some scripts assume a fixed number of retrieval candidates (e.g., top-50 or top-100). Adjust top_k accordingly based on your retriever output.

📚 Citation

If you find this code useful, please consider citing our paper:

@inproceedings{li2025avg,
  title={Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation},
  author={Yongqi Li and Hongru Cai and Wenjie Wang and Leigang Qu and Yinwei Wei and Wenjie Li and Liqiang Nie and Tat-Seng Chua},
  booktitle={Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  series={SIGIR '25},
  year={2025}
}

📄 License

This project is licensed under the CC BY-NC 4.0 License.

Part of the code (./models) is based on the SEED-LLaMA. We continue to honor and adhere to its licensing terms.

📬 Contact

For inquiries, feel free to reach out to Hongru Cai at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
RQ-VAE		RQ-VAE
config		config
models		models
scripts		scripts
tools		tools
.gitignore		.gitignore
.project-root		.project-root
README.md		README.md
environment.yml		environment.yml
finetune_reranker_llama.py		finetune_reranker_llama.py
finetune_retriever_llama.py		finetune_retriever_llama.py
requirements.txt		requirements.txt
train_retriever_t5.py		train_retriever_t5.py
train_retriever_t5_stage2.py		train_retriever_t5_stage2.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

🔍 Overview

📦 Requirements

🧾 Data

📈 AVG

🔵 Tokenizer (RQ-VAE)

Step 1: Train the Tokenizer (RQ-VAE)

Step 2: Generate Voken Codes (Discrete Image Representations)

🟡 Retriever (LLM)

Step 3: Prepare Retriever Training Data

Step 4: Train the Retriever

🟢 Reranker (SEED-LLaMA)

Step 5: Prepare Reranker Data

Step 6: Train the Reranker

Step 7: Evaluate the Reranker

🟥 Note

📚 Citation

📄 License

📬 Contact

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

🔍 Overview

📦 Requirements

🧾 Data

📈 AVG

🔵 Tokenizer (RQ-VAE)

Step 1: Train the Tokenizer (RQ-VAE)

Step 2: Generate Voken Codes (Discrete Image Representations)

🟡 Retriever (LLM)

Step 3: Prepare Retriever Training Data

Step 4: Train the Retriever

🟢 Reranker (SEED-LLaMA)

Step 5: Prepare Reranker Data

Step 6: Train the Reranker

Step 7: Evaluate the Reranker

🟥 Note

📚 Citation

📄 License

📬 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages