CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

Teng Pan^1,2, Yuchen Yan¹, Zixuan Wang^1,2, Ruiqing Zhang²,
Guiyang Hou¹, Wenqi Zhang¹, Weiming Lu¹, Jun Xiao¹, Yongliang Shen^1,†

¹Zhejiang University, ²Baidu Inc.
Preprint.
^†Corresponding Author

Arxiv | 📑 WebPage

🔥 News

2026.04.06: Our paper has been accepted at ACL 2026 Main Conference 🎉🎉🎉!
2026.03.18: We release our paper.

📖 Overview

Label-free reinforcement learning for LLMs typically adopts majority voting to generate pseudo-labels, but suffers from a consensus trap—output diversity collapses during training, leading the model to confidently reinforce systematic self-consistent errors. To address this issue, we propose CoVerRL, a novel framework that unifies generator and verifier roles into a single model via multi-turn reinforcement learning, enabling their mutual bootstrapping and co-evolution without external ground-truth labels.

Our contributions can be summarized as follows:

We identify the consensus trap in majority voting based label-free RL, where diversity collapse causes reward accuracy degradation as models become overconfident in systematic errors, explaining why such methods eventually stagnate.
We propose CoVerRL, a co-evolution framework that unifies generation and verification into a multi-turn RL process, enabling mutual bootstrapping where each capability supervises improvement of the other without external labels.
We validate CoVerRL across Qwen and Llama model families, demonstrating 4-6% improvements over label-free baselines on mathematical reasoning benchmarks while producing verifiers that generalize well to held-out evaluation.

🚀 QuickStart

Preparation

This repository is based on verl v0.6.x branch. Please refer to verl installation for setup instructions. Additionally, install Math-Verify as the verifier: pip install math-verify. It is recommended to install swanlab or wandb to visualize the training dynamics. pip install swanlab

Before running the script, set the model path in it.

BACKBONE="your backbone"
BACKBONE_PATH="path to your backbone"

TTRL baseline

bash recipe/cover_rl/scripts/gpu/ttrl_baseline.sh

CoVerRL

bash recipe/cover_rl/scripts/gpu/cover_rl.sh

If you want to run with NPU, we also provide scripts in the "npu" folder, feel free to use it.

📊 Dataset

The training data is stored in verl/recipe/cover_rl/data/MATH-7500/math7500_train.parquet. And the validation data is stored in un. If you want to prepare your own dataset, refer to verl/recipe/cover_rl/data/preprocess.py

📈 Main results

Results are reported as Acc.@first / Acc.@final. CoVerRL consistently outperforms TTRL across all models and benchmarks, achieving average improvements of 5.7%, 5.9%, and 4.7% in Acc.@final for the three models respectively.

Model	Method	MATH500	AMC	AIME24	GPQA	Average
Qwen3-1.7B -Base	Base Model	53.5 / 53.3	24.6 / 24.5	3.8 / 3.3	27.5 / 27.3	27.4 / 27.1
	TTRL	65.1 / 65.0	31.1 / 30.9	5.2 / 5.2	30.9 / 30.7	33.1 / 33.0
	CoVerRL (Ours)	69.0 / 71.9	36.0 / 38.6	9.8 / 10.6	32.9 / 33.6	36.9 / 38.7
	Δ	+3.9 / +6.9	+4.9 / +7.7	+4.6 / +5.4	+2.0 / +2.9	+3.8 / +5.7
Llama-3.2-3B -Instruct	Base Model	42.7 / 41.0	17.0 / 15.7	4.6 / 5.0	26.9 / 26.1	22.8 / 22.0
	TTRL	52.6 / 52.2	23.8 / 23.3	13.8 / 14.0	29.8 / 28.2	30.0 / 29.4
	CoVerRL (Ours)	55.9 / 59.3	28.3 / 32.2	16.3 / 16.9	32.3 / 32.6	33.2 / 35.3
	Δ	+3.3 / +7.1	+4.5 / +8.9	+2.5 / +2.9	+2.5 / +4.4	+3.2 / +5.9
Qwen2.5-7B	Base Model	50.1 / 51.4	25.5 / 26.4	5.2 / 6.5	29.9 / 29.7	27.7 / 28.5
	TTRL	73.8 / 74.2	42.2 / 42.2	12.7 / 12.5	35.8 / 35.6	41.1 / 41.1
	CoVerRL (Ours)	76.8 / 79.6	47.6 / 49.2	14.6 / 17.1	36.2 / 37.2	43.8 / 45.8
	Δ	+3.0 / +5.4	+5.4 / +7.0	+1.9 / +4.6	+0.4 / +1.6	+2.7 / +4.7

The figure below shows the training dynamics of reward/label accuracy for TTRL and CoVerRL on Qwen3-1.7B-Base. CoVerRL maintains reward accuracy above around 90% and boosts label accuracy via generator-verifier co-evolution, while TTRL faces reward accuracy degradation and stagnant label accuracy due to the consensus trap.

📄 Citation

If you find our work helpful, feel free to give us a cite.

@misc{pan2026coverrlbreakingconsensustrap,
      title={CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution}, 
      author={Teng Pan and Yuchen Yan and Zixuan Wang and Ruiqing Zhang and Gaiyang Han and Wanqi Zhang and Weiming Lu and Jun Xiao and Yongliang Shen},
      year={2026},
      eprint={2603.17775},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.17775}, 
}

🙏 Acknowledgement

The RL training stack is built on top of the excellent verl framework. Many thanks to the verl team for open-sourcing the infrastructure that this project extends.

📨 Contact Us

If you have any questions, please contact us by email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
docs		docs
recipe/cover_rl		recipe/cover_rl
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements-cuda.txt		requirements-cuda.txt
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

🔥 News

📖 Overview

🚀 QuickStart

Preparation

TTRL baseline

CoVerRL

📊 Dataset

📈 Main results

📄 Citation

🙏 Acknowledgement

📨 Contact Us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

🔥 News

📖 Overview

🚀 QuickStart

Preparation

TTRL baseline

CoVerRL

📊 Dataset

📈 Main results

📄 Citation

🙏 Acknowledgement

📨 Contact Us

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages