Teng Pan1,2,
Yuchen Yan1,
Zixuan Wang1,2,
Ruiqing Zhang2,
Guiyang Hou1,
Wenqi Zhang1,
Weiming Lu1,
Jun Xiao1,
Yongliang Shen1,†
1Zhejiang University,
2Baidu Inc.
Preprint.
†Corresponding Author
- 2026.04.06: Our paper has been accepted at ACL 2026 Main Conference 🎉🎉🎉!
- 2026.03.18: We release our paper.
Label-free reinforcement learning for LLMs typically adopts majority voting to generate pseudo-labels, but suffers from a consensus trap—output diversity collapses during training, leading the model to confidently reinforce systematic self-consistent errors. To address this issue, we propose CoVerRL, a novel framework that unifies generator and verifier roles into a single model via multi-turn reinforcement learning, enabling their mutual bootstrapping and co-evolution without external ground-truth labels.
Our contributions can be summarized as follows:
-
We identify the consensus trap in majority voting based label-free RL, where diversity collapse causes reward accuracy degradation as models become overconfident in systematic errors, explaining why such methods eventually stagnate.
-
We propose CoVerRL, a co-evolution framework that unifies generation and verification into a multi-turn RL process, enabling mutual bootstrapping where each capability supervises improvement of the other without external labels.
-
We validate CoVerRL across Qwen and Llama model families, demonstrating 4-6% improvements over label-free baselines on mathematical reasoning benchmarks while producing verifiers that generalize well to held-out evaluation.
This repository is based on verl v0.6.x branch. Please refer to
verl installation for setup instructions. Additionally, install Math-Verify as the verifier: pip install math-verify. It is recommended to install swanlab or wandb to visualize the training dynamics. pip install swanlab
Before running the script, set the model path in it.
BACKBONE="your backbone"
BACKBONE_PATH="path to your backbone"
bash recipe/cover_rl/scripts/gpu/ttrl_baseline.shbash recipe/cover_rl/scripts/gpu/cover_rl.shIf you want to run with NPU, we also provide scripts in the "npu" folder, feel free to use it.
The training data is stored in verl/recipe/cover_rl/data/MATH-7500/math7500_train.parquet. And the validation data is stored in un. If you want to prepare your own dataset, refer to verl/recipe/cover_rl/data/preprocess.py
Results are reported as Acc.@first / Acc.@final. CoVerRL consistently outperforms TTRL across all models and benchmarks, achieving average improvements of 5.7%, 5.9%, and 4.7% in Acc.@final for the three models respectively.
| Model | Method | MATH500 | AMC | AIME24 | GPQA | Average |
|---|---|---|---|---|---|---|
| Qwen3-1.7B -Base |
Base Model | 53.5 / 53.3 | 24.6 / 24.5 | 3.8 / 3.3 | 27.5 / 27.3 | 27.4 / 27.1 |
| TTRL | 65.1 / 65.0 | 31.1 / 30.9 | 5.2 / 5.2 | 30.9 / 30.7 | 33.1 / 33.0 | |
| CoVerRL (Ours) | 69.0 / 71.9 | 36.0 / 38.6 | 9.8 / 10.6 | 32.9 / 33.6 | 36.9 / 38.7 | |
| Δ | +3.9 / +6.9 | +4.9 / +7.7 | +4.6 / +5.4 | +2.0 / +2.9 | +3.8 / +5.7 | |
| Llama-3.2-3B -Instruct |
Base Model | 42.7 / 41.0 | 17.0 / 15.7 | 4.6 / 5.0 | 26.9 / 26.1 | 22.8 / 22.0 |
| TTRL | 52.6 / 52.2 | 23.8 / 23.3 | 13.8 / 14.0 | 29.8 / 28.2 | 30.0 / 29.4 | |
| CoVerRL (Ours) | 55.9 / 59.3 | 28.3 / 32.2 | 16.3 / 16.9 | 32.3 / 32.6 | 33.2 / 35.3 | |
| Δ | +3.3 / +7.1 | +4.5 / +8.9 | +2.5 / +2.9 | +2.5 / +4.4 | +3.2 / +5.9 | |
| Qwen2.5-7B | Base Model | 50.1 / 51.4 | 25.5 / 26.4 | 5.2 / 6.5 | 29.9 / 29.7 | 27.7 / 28.5 |
| TTRL | 73.8 / 74.2 | 42.2 / 42.2 | 12.7 / 12.5 | 35.8 / 35.6 | 41.1 / 41.1 | |
| CoVerRL (Ours) | 76.8 / 79.6 | 47.6 / 49.2 | 14.6 / 17.1 | 36.2 / 37.2 | 43.8 / 45.8 | |
| Δ | +3.0 / +5.4 | +5.4 / +7.0 | +1.9 / +4.6 | +0.4 / +1.6 | +2.7 / +4.7 |
The figure below shows the training dynamics of reward/label accuracy for TTRL and CoVerRL on Qwen3-1.7B-Base. CoVerRL maintains reward accuracy above around 90% and boosts label accuracy via generator-verifier co-evolution, while TTRL faces reward accuracy degradation and stagnant label accuracy due to the consensus trap.
If you find our work helpful, feel free to give us a cite.
@misc{pan2026coverrlbreakingconsensustrap,
title={CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution},
author={Teng Pan and Yuchen Yan and Zixuan Wang and Ruiqing Zhang and Gaiyang Han and Wanqi Zhang and Weiming Lu and Jun Xiao and Yongliang Shen},
year={2026},
eprint={2603.17775},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.17775},
}
The RL training stack is built on top of the excellent verl framework. Many thanks to the verl team for open-sourcing the infrastructure that this project extends.
If you have any questions, please contact us by email: [email protected]

