UTRL: Learning to Generate Unit Test via Adversarial Reinforcement Learning (ICLR 26')

🚀 Adversarial RL framework for training LLMs for Unit test Generation

Dongjun Lee¹, Changho Hwang², Kimin Lee¹
¹KAIST AI, ²Microsoft Research

What is UTRL?

UTRL introduces a Reinforcement learning framework that trains Large Language Models to generate high-quality unit tests. Unlike traditional approaches, UTRL only requires the dataset of task instruction-code pairs for training LLM as unit test generator.

Two-Step Adversarial RL Training

UTRL iterates over the following steps:

Training Unit test Generator: Unit test generator learns to distinguish code generated by code generator LLM from ground-truth code solution.
Training Code generator: Code generator learns to pass increasingly sophisticated unit tests generated by unit test generator.

🎪 Why UTRL?

Unit tests enables (1) reliable software development, (2) quantitative evaluation over code, (3) Test-time scaling / RLVR in code generation domain, but implementing reliable unit tests is labor-intensive and requires sophisticated code reasoning. However, algorithm for training LLMs for unit test generation has been underexplored, compared to code generation.

🧠 Lack of training data with high-quality unit test annotations
🔍 Lack of research on algorithm for training LLMs for unit test generation
⚡ Lack of reliable evaluation protocol for measuring the quality of LLM generated unit tests

📊 Experiments

Evaluation Metrics

In order to evaluate the quality of unit tests, we introduce 2 evaluation metrics, Best-of-N improvement and Unit Test Fidelity.

1. Best-of-N Improvement

Measures whether generated unit tests can identify highest-quality code solution among code solutions of varying qualities.
Process: Generate N candidate solutions per programming task using LLM → Select best one using generated unit tests → Evaluate the selected code against ground-truth unit test.

2. Unit Test Fidelity

Quantifies how closely generated unit tests approximate ground-truth unit tests.
Computed as Spearman's correlation between code score vectors (evaluated with generated unit tests vs. ground-truth unit tests)
Higher correlation = better approximation of comprehensive ground-truth unit tests

Experimental Results

We evaluate the quality of generated unit tests on 945 competitive programming tasks from TACO and 511 tasks from LiveCodeBench-v2.

Key Findings:

UTRL surpasses supervised fine-tuning: Models trained with UTRL (without ground-truth unit tests) generate higher-quality unit tests than models trained with SFT using ground-truth unit tests, demonstrating that learning to detect LLM-generated code solutions is more effective than directly imitating ground-truth unit tests.
Small models trained via UTRL outperform large closed-source models: Qwen3-4B trained with UTRL generates higher-quality unit tests than GPT-4o and GPT-4.1, achieving 14.9% accuracy compared to GPT-4o's 10.6% and GPT-4.1's 13.7% (when evaluating Qwen3-8B code generation).
Unit-tests generated by small model trained via UTRL improve larger code generator models: Unit tests generated by UTRL-trained 4B models effectively improve code generation of much larger models (32B, GPT-4o).

Please check our paper for more details.

🚀 Quick Start & Reproduction

📝 Prerequisites

System Requirements

GPU: NVIDIA H100 80GB + CUDA 12.4 (recommended)
OS: Linux (Ubuntu 20.04+)

Dependencies

# Clone the repository
git clone https://github.com/dgjun32/UTRL.git
cd UTRL

# Create conda environment
conda create -n utrl python==3.10
conda activate utrl

# install dependencies for verl
cd verl
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
cd ..

# install additional dependencies
pip install -r requirements.txt

Authentication Setup

# Login to Hugging Face for model access
huggingface-cli login

# Login to Weights & Biases for experiment tracking
wandb login

📊 Evaluation

0. Download checkpoints

We provide checkpoints fine-tuned via UTRL on instruction-code pairs from the TACO dataset.

Model	Base	Task	Download
UTRL-4B	Qwen3-4B	Unit Test Generation
UTRL-Codegen-4B	Qwen3-4B	Code Generation
UTRL-14B	Qwen3-14B	Unit Test Generation	🔜 Coming soon
UTRL-7B	Qwen2.5-7B-Instruct	Unit Test Generation	🔜 Coming soon

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("dgjun32/UTRL-4B")
tokenizer = AutoTokenizer.from_pretrained("dgjun32/UTRL-4B")

1. Generate unit tests using the trained checkpoint

python -m inference.generate_unit_tests \
    --test_generation_model ${checkpoint_path} \
    --target_path ${signature of the checkpoint (e.g., qwen3_4b_utrl)} \
    --dataset ${dataset for evaluation (taco or livecodebench)} \
    --split test \

Please refer to scripts/run_inference.sh.

2. Evaluate the generated unit tests (unit test fidelity & best-of-N improvement)

You may download evaluation set built upon TACO and LiveCodeBench-v2 to measure best-of-N improvement and Unit test Fidelity.

import datasets
from datasets import load_dataset

taco_dataset = load_dataset('dgjun32/UTRL_TACO_EVAL')
livecodebench_dataset = load_dataset('dgjun32/UTRL_LCB_EVAL')

For Best-of-N improvement, run the commands below:

python -m evaluation.evaluate_bon_solution --benchmark taco \
  --test_generation_model ${signature of the checkpoint (e.g., qwen3_4b_utrl)} \
  --solution_generation_model qwen3_4b \
  --best_of_n \
  --n_samples 32

python -m evaluation.evaluate_bon_solution --benchmark taco \
  --test_generation_model ${signature of the checkpoint (e.g., qwen3_4b_utrl)} \
  --solution_generation_model qwen3_8b \
  --best_of_n \
  --n_samples 32

python -m evaluation.evaluate_bon_solution --benchmark taco \
  --test_generation_model ${signature of the checkpoint (e.g., qwen3_4b_utrl)} \
  --solution_generation_model qwen3_14b \
  --best_of_n \
  --n_samples 32

python -m evaluation.evaluate_bon_solution --benchmark taco \
  --test_generation_model ${signature of the checkpoint (e.g., qwen3_4b_utrl)} \
  --solution_generation_model gpt_4o \
  --best_of_n \
  --n_samples 32

For best-of-N improvement induced by ground-truth unit tests (which is required for computing unit test fidelity), you may run following scripts:

python -m evaluation.evaluate_bon_solution --benchmark taco \
  --test_generation_model ground_truth \
  --solution_generation_model qwen3_4b \
  --best_of_n \
  --n_samples 32

python -m evaluation.evaluate_bon_solution --benchmark taco \
  --test_generation_model ground_truth \
  --solution_generation_model qwen3_8b \
  --best_of_n \
  --n_samples 32

python -m evaluation.evaluate_bon_solution --benchmark taco \
  --test_generation_model ground_truth \
  --solution_generation_model qwen3_14b \
  --best_of_n \
  --n_samples 32

python -m evaluation.evaluate_bon_solution --benchmark taco \
  --test_generation_model ground_truth \
  --solution_generation_model gpt_4o \
  --best_of_n \
  --n_samples 32

For Unit test fidelity, run the commands below:

python -m evaluation.compute_ut_fidelity \
  --benchmark taco \
  --test_generation_model ${signature of the checkpoint (e.g., qwen3_4b_utrl)} \

🏃‍♂️ Training (Coming Soon)

We will update the training scripts based on verl.

📚 Citation

If you find UTRL useful in your research, please consider citing our work:

@article{lee2025learning,
  title={Learning to generate unit test via adversarial reinforcement learning},
  author={Lee, Dongjun and Hwang, Changho and Lee, Kimin},
  journal={arXiv preprint arXiv:2508.21107},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
dataset		dataset
evaluation		evaluation
inference		inference
scripts		scripts
utils		utils
verl		verl
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
reward.py		reward.py
sft_train_codegen.py		sft_train_codegen.py
sft_train_testgen.py		sft_train_testgen.py
sft_train_testgen_dt.py		sft_train_testgen_dt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UTRL: Learning to Generate Unit Test via Adversarial Reinforcement Learning (ICLR 26')

What is UTRL?

Two-Step Adversarial RL Training

🎪 Why UTRL?

📊 Experiments

Evaluation Metrics

Experimental Results

🚀 Quick Start & Reproduction

📝 Prerequisites

System Requirements

Dependencies

Authentication Setup

📊 Evaluation

0. Download checkpoints

1. Generate unit tests using the trained checkpoint

2. Evaluate the generated unit tests (unit test fidelity & best-of-N improvement)

🏃‍♂️ Training (Coming Soon)

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UTRL: Learning to Generate Unit Test via Adversarial Reinforcement Learning (ICLR 26')

What is UTRL?

Two-Step Adversarial RL Training

🎪 Why UTRL?

📊 Experiments

Evaluation Metrics

Experimental Results

🚀 Quick Start & Reproduction

📝 Prerequisites

System Requirements

Dependencies

Authentication Setup

📊 Evaluation

0. Download checkpoints

1. Generate unit tests using the trained checkpoint

2. Evaluate the generated unit tests (unit test fidelity & best-of-N improvement)

🏃‍♂️ Training (Coming Soon)

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages