π News β’ π Overview β’ β¨ Getting Started β’ π Data β’ π Citation
- [2026-04-16] KnowRL ranks #1 on Hugging Face Daily Papers! Check it out: Daily Paper.
- [2026-04-15] We release our paper, code, training data, KP annotations, and model checkpoints. Check it out: Paper.
Hint-based reinforcement learning (RL) addresses reward sparsity in LLM reasoning by injecting auxiliary guidance into prompts. However, existing methods suffer from hint redundancy β they inject excessive or loosely structured guidance while only a small subset of information is actually needed. We identify three key challenges:
Figure 1. Three key challenges in hint-based RL. (a) The critical-segment effect: performance improves sharply once a short key hint segment appears, with diminishing returns beyond it. (b) Cross-hint inconsistency: longer prefixes may introduce branching or ambiguity. (c) Training-efficiency trade-off: abstraction-based hints rely on teacher models, increasing computational overhead.
KnowRL formulates hint design as a minimal sufficient guidance problem. Instead of injecting long solution prefixes or full reasoning templates, KnowRL decomposes guidance into atomic knowledge points (KPs) and identifies the minimal subset required to unlock reward learning.
-
Minimal-sufficiency perspective on hint-based RL β We empirically demonstrate a non-linear, jump-like performance pattern (critical-segment effect), revealing that effective guidance depends on selective key knowledge rather than cumulative hint length.
-
Principled KP selection pipelines β We design several KP selection strategies (S-LOO, T-LOO, CBRS, CSS) that ensure minimal, non-redundant, and interaction-compatible KP subsets. CSS achieves the best performance with ~38% fewer KPs.
-
State-of-the-art results at 1.5B scale β KnowRL-Nemotron-1.5B achieves 74.16 average accuracy with CSS across eight competition-level math benchmarks, establishing a new SOTA.
KnowRL-Nemotron-1.5B achieves consistent improvements across all eight benchmarks under different hint settings:
| Model | Hint Setting | AIME24 | AIME25 | BRUMO25 | HMMT25 | AMC23 | CMIMC25 | MATH-500 | OlyBench | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
| Nemotron-1.5B | w/o KP | 59.06 | 48.33 | 60.73 | 30.63 | 90.70 | 30.08 | 92.35 | 71.70 | 60.45 |
| Nemotron-1.5B | CSS | 64.06 | 50.10 | 65.03 | 35.77 | 90.47 | 36.70 | 92.90 | 74.09 | 63.64 |
| QuestA | w/o KP | 71.56 | 62.08 | 67.50 | 40.94 | 93.44 | 41.48 | 92.95 | 72.28 | 67.78 |
| JustRL | w/o KP | 69.69 | 62.92 | 66.88 | 40.63 | 96.02 | 41.72 | 94.15 | 76.59 | 68.58 |
| KnowRL-1.5B | w/o KP | 69.79 | 64.69 | 69.48 | 41.04 | 95.55 | 44.14 | 95.70 | 80.23 | 70.08 |
| KnowRL-1.5B | CBRS | 75.52 | 65.00 | 78.33 | 45.00 | 95.78 | 49.22 | 96.45 | 82.34 | 73.46 |
| KnowRL-1.5B | CSS | 74.58 | 65.21 | 78.12 | 48.75 | 95.70 | 52.19 | 96.20 | 82.44 | 74.16 |
Even without KP hints at inference, KnowRL-Nemotron-1.5B reaches 70.08 (+9.63 over baseline), showing that KnowRL improves the underlying policy itself rather than relying on test-time hint injection.
We compare multiple offline KP selection strategies on Nemotron-1.5B. CSS achieves the highest accuracy with only 2.57 KPs per problem on average:
| Strategy | AIME24 | AIME25 | BRUMO25 | HMMT25 | AMC23 | CMIMC25 | MATH-500 | OlyBench | Avg. | Avg. #KP |
|---|---|---|---|---|---|---|---|---|---|---|
| w/o KP | 58.75 | 48.44 | 61.67 | 30.10 | 90.55 | 30.08 | 92.40 | 71.70 | 60.46 | 0.00 |
| All KP | 60.90 | 49.01 | 61.11 | 32.46 | 89.67 | 32.32 | 92.22 | 70.55 | 61.03 | 5.86 |
| Random | 60.52 | 49.27 | 61.04 | 33.23 | 91.02 | 31.09 | 91.65 | 71.88 | 61.21 | 2.53 |
| Max-Score | 62.63 | 49.79 | 64.27 | 34.79 | 90.94 | 32.99 | 92.52 | 73.89 | 62.73 | 2.61 |
| S-LOO | 62.71 | 49.22 | 63.88 | 33.54 | 91.71 | 33.52 | 92.90 | 73.70 | 62.65 | 1.72 |
| T-LOO | 62.11 | 49.27 | 64.20 | 33.65 | 91.25 | 33.67 | 92.40 | 73.46 | 62.50 | 1.20 |
| CBRS | 63.02 | 49.90 | 64.17 | 34.79 | 91.56 | 33.57 | 92.65 | 73.89 | 62.94 | 2.60 |
| CSS | 64.44 | 50.57 | 65.03 | 35.77 | 91.71 | 36.70 | 92.90 | 74.11 | 63.90 | 2.57 |
CSS-selected KPs deliver larger and more consistent gains across different difficulty levels compared to full-KP injection, which can even introduce negative effects on certain subsets:
Figure 2. Difficulty-bucket analysis on test set (left) and training set (right). CSS-selected KPs deliver larger and more consistent gains across difficulty levels. Full-KP injection can introduce regressions on certain subsets.
Figure 3. Comparison of KP selection strategies (CSS vs CBRS) under the same training budget. CSS maintains a persistent advantage in training accuracy with more stable policy refinement.
KnowRL training dramatically reduces reward sparsity: the zero-correct fraction drops from 41.21% to 13.00%, while the all-correct bucket rises from 1.35% to 34.28% (+32.93pp).
Figure 4. Distribution of per-query correct counts on the training set. KnowRL training (middle) collapses the zero-correct fraction and shifts mass toward all-correct. Adding KP hints at inference (right) further concentrates correctness.
We discover a pruning interaction paradox: removing individual "bad" KPs improves accuracy, but removing them jointly can degrade performance due to inter-KP dependencies.
Figure 5. Left: Pruning interaction paradox under LOO-style selection β cross-hint inconsistency occurs in 40%β60% of cases. Right: Tolerance-threshold sensitivity for the delta parameter in CBRS.
Figure 6. Visualization of the critical-segment effect across prefix ratios on 50 training instances. Performance typically remains flat in low-ratio regions, then exhibits a distinct jump once a key segment is included.
KnowRL/
βββ README.md
βββ setup_env.sh # One-click environment setup script
βββ environment.yml # Conda environment configuration
βββ requirements.txt # Full pip dependencies for exact reproducibility
β
βββ eval/ # Evaluation pipeline
β βββ data/ # Evaluation benchmark datasets
β β βββ AIME24/ # AIME 2024
β β βββ AIME25/ # AIME 2025
β β βββ AMC23/ # AMC 2023
β β βββ BRUMO25/ # BRUMO 2025
β β βββ CMIMC25/ # CMIMC 2025
β β βββ HMMT25/ # HMMT 2025
β β βββ MATH_500/ # MATH-500 subset
β β βββ Olympiad_Bench/ # Olympiad Bench
β βββ eval_scripts/ # Evaluation scripts
β βββ task.py # Task definitions (all supported task names)
β βββ prompts.py # Prompt templates (with/without KP hints)
β βββ s1_gen_vllm.py # Step 1: Generate responses via vLLM
β βββ s1_gen_vllm.sh # Step 1: Shell wrapper
β βββ s2_vllm_serve.sh # Step 2: Serve judge model (CompassVerifier-3B)
β βββ rule_base_verl.py # Rule-based answer verification
β βββ s3_rule_base_verl.sh # Step 3: Run rule-based evaluation
β βββ s3_model_base_verl.py # Model-based evaluation (with judge)
β βββ eval_outputs/ # Generated evaluation results
β
βββ train/ # Training pipeline
β βββ knowrl.sh # Main training launch script (DAPO/GRPO via Ray)
β βββ train_data/ # Training data (KP hints embedded in prompts)
β β βββ css.jsonl
β β βββ css.parquet
β βββ val/ # Validation data used during training
β βββ aime24/
β βββ aime25/
β βββ brumo_2025/
β βββ hmmt_25_2/
β
βββ utils/
β βββ ray_start.sh # Ray cluster initialization for multi-node training
β
βββ huggingface/ # Scripts and data for HuggingFace upload
β βββ hf_upload.py
β βββ train/
β
βββ verl/ # verl framework (RL training library for LLMs)
- Linux (tested on Ubuntu)
- NVIDIA GPU with CUDA 12.4+
- Conda (Miniconda or Anaconda)
git clone <knowrl_repo_url>
cd KnowRLNote: This step only installs the Python dependencies. verl is installed separately in the next step.
bash setup_env.sh
conda activate knowrlconda env create -f environment.yml
conda activate knowrlNote: If pip fails to find
torch, your pip mirror may not host PyTorch. Theenvironment.ymlincludes--extra-index-urlfor the official PyTorch index, but if it still fails, use Option A or Option C instead.
# 1. Create conda env
conda create -n knowrl python=3.10 -y
conda activate knowrl
# 2. Install PyTorch (CUDA 12.4)
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
--extra-index-url https://download.pytorch.org/whl/cu124
# 3. Install all other dependencies
pip install -r requirements.txt --extra-index-url https://pypi.org/simple/
# 4. Install flash-attn (requires compilation)
pip install flash-attn==2.7.4.post1 --no-build-isolationcd verl
pip install -e . --extra-index-url https://pypi.org/simple/python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA {torch.cuda.is_available()}')"
python -c "import transformers; print(f'Transformers {transformers.__version__}')"
python -c "import vllm; print(f'vLLM {vllm.__version__}')"
python -c "import verl; print('verl OK')"| Resource | Link |
|---|---|
| KnowRL Collection | HasuerYu/knowrl |
| Training Data | HasuerYu/KnowRL-Train-Data |
| KP Annotations | HasuerYu/KnowRL-KP-Annotations |
| Model | HasuerYu/KnowRL-Nemotron-1.5B |
-
eval/data/: Evaluation data for offline evaluation, consistent with the evaluation data on HuggingFace. -
train/train_data/: Training-ready data. Knowledge points are directly embedded in the prompt as a## Hintsection (rather than listed separately). Each record containskp_list(all candidate knowledge points) andkept_kp_index(indices of selected knowledge points injected into the prompt). The HuggingFace version provides the complete knowledge points along with those extracted by CSS and CBRS strategies respectively. -
train/val/: Validation data used during training. Prompts contain only the math problem followed by the suffixPlease reason step by step, and put your final answer within \boxed{}., without any knowledge point hints.
We use the DAPO/GRPO algorithm via verl with Ray for distributed training.
Base model: nvidia/OpenMath-Nemotron-1.5B
Key hyperparameters:
| Parameter | Value |
|---|---|
| Learning rate | 1e-6 |
| Batch size | 256 |
| Max prompt length | 8192 |
| Max response length | 32768 |
| Rollout engine | vLLM |
| Samples per prompt (train) | 8 |
| Samples per prompt (val) | 32 |
| Total epochs | 150 |
| Total training steps | 2,960 |
| Save / Eval frequency | Every 10 steps |
| Hardware | 8x NVIDIA H100 nodes (64 GPUs) |
| Training time | ~13 days |
For single-node training, you can skip this step β Ray will be automatically initialized when the training script runs.
For multi-node training, start the Ray cluster before launching the training script on each node:
# On head node
bash utils/ray_start.shbash train/knowrl.shTraining checkpoints will be saved under checkpoints/. Validation and rollout data will be saved under validation_data/ and rollout_data/ respectively. Training metrics are logged to WandB.
We apply entropy annealing by adjusting the clip upper bound during training. After 2,590 steps, clip_high is reduced from 0.28 to 0.26, inducing a faster entropy drop and encouraging the policy to shift from exploration to exploitation.
Figure 7. Comparison with and without entropy annealing. Entropy annealing yields faster entropy reduction and consistently better validation performance, contributing +0.74 average accuracy improvement.
The evaluation pipeline consists of three steps: generation, judge model serving, and scoring.
We evaluate on 8 benchmarks (1,374 problems total): AIME24, AIME25, BRUMO25, HMMT25, AMC23, CMIMC25, MATH-500, and Olympiad-Bench.
First, check all supported task names in eval/eval_scripts/task.py. The available tasks include:
- Raw (no KP hints):
AIME24,AIME25,BRUMO25,HMMT25,AMC23,CMIMC25,MATH_500,Olympiad_Bench - CBRS (Consensus-Based Robust Selection):
AIME24_CBRS,AIME25_CBRS, ... (append_CBRSsuffix) - CSS (Constrained Subset Search):
AIME24_CSS,AIME25_CSS, ... (append_CSSsuffix)
Then modify eval/eval_scripts/s1_gen_vllm.sh to set your model path and task list, and run:
cd eval/eval_scripts
# Edit s1_gen_vllm.sh to configure your model and tasks
bash s1_gen_vllm.shResults will be saved under eval/eval_scripts/eval_outputs/.
Launch the CompassVerifier-3B as a judge model using vLLM:
export CUDA_VISIBLE_DEVICES=0
vllm serve <path_to_CompassVerifier-3B> \
--served-model-name cv_3b \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1 \
--trust-remote-code \
--port 8000 \
--host 0.0.0.0Run rule-based evaluation on the generated outputs:
bash eval/eval_scripts/s3_model_base_verl.shIf you find this work helpful, please cite our paper:
@misc{yu2026knowrlboostingllmreasoning,
title={KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance},
author={Linhao Yu and Tianmeng Yang and Siyu Ding and Renren Jin and Naibin Gu and Xiangzhao Hao and Shuaiyi Nie and Deyi Xiong and Weichong Yin and Yu Sun and Hua Wu},
year={2026},
eprint={2604.12627},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2604.12627},
}









