Evaluation of Large Language Models via Coupled Token Generation

This repository contains the code and data for the paper "Evaluation of Large Language Models via Coupled Token Generation" by Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco, Suhas Thejaswi, and Manuel Gomez-Rodriguez, published at AISTATS 2026.

Dependencies

All the experiments were performed using Python 3.11. In order to create a virtual environment and install the project dependencies you can run the following commands:

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt ; pip install 'autoawq==0.2.8'

Code organization

The directory data contains the data used for the experiments.

The directory models/models_<model_family> contains the list of models used for each one of the following model families: llama, mistral, qwen.

The directory src contains the source code for the experiments.

The directory scripts contains bash scripts that use code under src to run the experiments.

The directory notebooks contains jupyter notebooks producing the figures appearing in the paper.

The directory figures is used for saving the figures produced by the notebooks.

The directory outputs is used for saving the outputs produced by the scripts.

Instructions

Downloading the models

Our experiments use LLMs from three model families, namely the Llama, Mistral, and Qwen family.

Llama Llama is a "gated" model, that is, it requires licensing to use. You can request to access it at: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct. Once you have access, you can download any model in the Llama family.

Mistral Mistral is also a "gated" model and requires one to agree to share their contact information within huggingface to be granted access. For example, one can request access for the Mistral-7B-Instruct-v0.1 model at https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1. Upon agreement, one can freely access and download the model.

Qwen Qwen models are directly accessible through the huggingface platform without any prior requirements.

To be able to download the gated models, before running the scripts you need to authenticate with your Hugging Face account by running huggingface-cli login in the terminal. Each model will be downloaded to the models folder the first time it is called from a script.

Setting up

Run python3 src/merge_tokenizers.py before running the scripts to set up the joint vocabulary for each of the model families.

Bechmark datasets experiments

The final output files of the experiments are provided in the respective output directory of each dataset and model family, e.g., outputs/mmlu/llama, outputs/gsm8k/llama, outputs/human_eval/llama for the Llama family. To reproduce the figures in the paper for a given model family (one of llama, mistral, qwen), you only need to run the benchmark_datasets.ipynb notebook.

The script benchmark_datasets.sh produces the outputs of one LLM of a given model family, using one seed, given the questions from the MMLU dataset as input prompts. To reproduce all the outputs for a given model family, run the script twice for each model and dataset (for independent and coupled generation), using the seeds, system prompts, and quantization parameters provided in the script.

LMSYS experiment

The final output files of the experiment are provided in the outputs/LMSYS directory. To reproduce the figures in the paper, you only need to run the lmsys.ipynb notebook.

The script lmsys.sh produces the outputs of all LLMs to the questions from the dataset in data/processed/LMSYS/questions.json, using independent and coupled generation with 10 seeds each. Run the script to reproduce the outputs, which will be saved in the outputs/LMSYS directory. The results of the pairwise comparisons of these outputs by GPT-4o-2024-11-20 are provided in the outputs/LMSYS directory.

Contact & attribution

In case you have questions about the code, you identify potential bugs or you would like us to include additional functionalities, feel free to open an issue or contact Ivi Chatzi or Eleni Straitouri.

If you use parts of the code in this repository for your own research, please consider citing:

@article{benz2025evaluation,
  title={Evaluation of Large Language Models via Coupled Token Generation}, 
  author={Nina Corvelo Benz and Stratis Tsirtsis and Eleni Straitouri and Ivi Chatzi and Ander Artola Velasco and Suhas Thejaswi and Manuel Gomez-Rodriguez},
  year={2025},
  journal={arXiv preprint arXiv:2502.01754}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of Large Language Models via Coupled Token Generation

Dependencies

Code organization

Instructions

Downloading the models

Setting up

Bechmark datasets experiments

LMSYS experiment

Contact & attribution

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data/processed/LMSYS		data/processed/LMSYS
models		models
notebooks		notebooks
outputs		outputs
scripts		scripts
src		src
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Evaluation of Large Language Models via Coupled Token Generation

Dependencies

Code organization

Instructions

Downloading the models

Setting up

Bechmark datasets experiments

LMSYS experiment

Contact & attribution

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages