LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

We propose LaSeR, a lightweight and effective algorithm that simultaneously optimizes both the reasoning and self-rewarding capabilities of LLMs with minimal additional cost, by introducing a simple MSE loss into the standard RLVR objective. The optimized self-rewarding scores can serve as auxiliary reward signals in both training and testing stages to enhance model performance.

News

[2025.10.16] We release our paper on arxiv. We release the source code and the checkpoints.

Models

Name
Octothinker-3B-Short-LaSeR	hf model
Qwen2.5-7B-LaSeR	hf model
ORZ-7B-LaSeR	hf model

Data

The evaluation data is in the data/ directory. The processed training data can be downloaded from here.

Installation

Our code is mainly based on verl (v0.5.0). To prepare the environment, please follow these steps:

conda create -n verl python==3.10
conda activate verl
cd verl/
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install math-verify

Training

We provide example scripts for GRPO and LaSeR training in the examples/grpo_trainer/. Before running, please download the related datasets to the appropriate locations.

Quick Start

For experiments on Qwen2.5-7B-Base and ORZ-7B:

cd verl/
bash examples/grpo_trainer/run_qwen2_5_7b.sh

For experiments on OctoThinker-3B-Short-Base:

cd verl/
bash examples/grpo_trainer/run_octothinker_3b.sh

You can modify these scripts to adapt training parameters and paths for your own settings. The scripts include all necessary hyper-parameters. Detailed hyper-parameter explanations are in the verl/verl/trainer/config/actor/actor.yaml.

Make sure to set your WANDB_API_KEY if you want to use Weights & Biases logging.

Evaluation

Our evaluation code is in the src/ folder.

Ideally, the self-rewarding score can be calculated directly by performing an additional forward process after the model generates the <EOS> token, obtaining the prediction probability for the pre-specified self-rewarding token. However, this requires modifying the underlying sampling logic of vLLM. In the current version, we have chosen to concatenate the pre-specified tokens after the solutions have been fully generated and then perform separate forward process to obtain the self-rewarding scores. We welcome the community to contribute a PR for a vLLM version that adapts to our method to enable more efficient self-rewarding!

For now, users can run the following script to perform evaluation on the reasoning and self-rewarding capabilities of the target model:

CUDA_VISIBLE_DEVICES=0,1,2,3 sh scripts/run_eval_math.sh

Acknowledgments

Our training code is mainly based on verl. Our training data is adopted from DeepMath-103K. We sincerely thank the contributors for their open-sourcing!

Citation

If you find our work helpful, please kindly cite as

@article{yang2025laser,
  title={LaSeR: Reinforcement Learning with Last-Token Self-Rewarding},
  author={Yang, Wenkai and Liu, Weijie and Xie, Ruobing and Guo, Yiju and Wu, Lulu and Yang, Saiyong and Lin, Yankai},
  journal={arXiv preprint arXiv:2510.14943},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
imgs		imgs
scripts		scripts
src		src
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

News

Models

Data

Installation

Training

Quick Start

Evaluation

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

News

Models

Data

Installation

Training

Quick Start

Evaluation

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages