We propose LaSeR, a lightweight and effective algorithm that simultaneously optimizes both the reasoning and self-rewarding capabilities of LLMs with minimal additional cost, by introducing a simple MSE loss into the standard RLVR objective. The optimized self-rewarding scores can serve as auxiliary reward signals in both training and testing stages to enhance model performance.
- [2025.10.16] We release our paper on arxiv. We release the source code and the checkpoints.
| Name | |
|---|---|
| Octothinker-3B-Short-LaSeR | hf model |
| Qwen2.5-7B-LaSeR | hf model |
| ORZ-7B-LaSeR | hf model |
The evaluation data is in the data/ directory. The processed training data can be downloaded from here.
Our code is mainly based on verl (v0.5.0). To prepare the environment, please follow these steps:
conda create -n verl python==3.10
conda activate verl
cd verl/
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install math-verifyWe provide example scripts for GRPO and LaSeR training in the examples/grpo_trainer/. Before running, please download the related datasets to the appropriate locations.
For experiments on Qwen2.5-7B-Base and ORZ-7B:
cd verl/
bash examples/grpo_trainer/run_qwen2_5_7b.shFor experiments on OctoThinker-3B-Short-Base:
cd verl/
bash examples/grpo_trainer/run_octothinker_3b.shYou can modify these scripts to adapt training parameters and paths for your own settings. The scripts include all necessary hyper-parameters. Detailed hyper-parameter explanations are in the verl/verl/trainer/config/actor/actor.yaml.
Make sure to set your WANDB_API_KEY if you want to use Weights & Biases logging.
Our evaluation code is in the src/ folder.
Ideally, the self-rewarding score can be calculated directly by performing an additional forward process after the model generates the <EOS> token, obtaining the prediction probability for the pre-specified self-rewarding token. However, this requires modifying the underlying sampling logic of vLLM. In the current version, we have chosen to concatenate the pre-specified tokens after the solutions have been fully generated and then perform separate forward process to obtain the self-rewarding scores. We welcome the community to contribute a PR for a vLLM version that adapts to our method to enable more efficient self-rewarding!
For now, users can run the following script to perform evaluation on the reasoning and self-rewarding capabilities of the target model:
CUDA_VISIBLE_DEVICES=0,1,2,3 sh scripts/run_eval_math.shOur training code is mainly based on verl. Our training data is adopted from DeepMath-103K. We sincerely thank the contributors for their open-sourcing!
If you find our work helpful, please kindly cite as
@article{yang2025laser,
title={LaSeR: Reinforcement Learning with Last-Token Self-Rewarding},
author={Yang, Wenkai and Liu, Weijie and Xie, Ruobing and Guo, Yiju and Wu, Lulu and Yang, Saiyong and Lin, Yankai},
journal={arXiv preprint arXiv:2510.14943},
year={2025}
}