Yuda Song*, Lili Chen*, Fahim Tajwar, Rémi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, Andrea Zanette
*equal contribution
Official codebase for Expanding the Capabilities of Reinforcement Learning via Text Feedback.
Our codebase uses tinker, so please follow the setup from the official tinker-cookbook.
- Sign up for Tinker through the waitlist.
- Once you have access, create an API key from the console and export it as environment variable
TINKER_API_KEY. - Install tinker python client via
pip install tinker - We recommend installing
tinker-cookbookin a virtual env either withcondaoruv. For running most examples, you can install viapip install -e ..
This experiment folder contains experiment scripts for RL training with text feedback. Each script defines preset configurations for reproducible experiments.
experiments/
├── reasoning/ # Reasoning tasks (Knights & Knaves, Binary Matrix, Shortest Path)
│ ├── knights_knaves_grpo.py
│ ├── knights_knaves_sd.py
│ ├── knights_knaves_fm.py
│ ├── binary_matrix_*.py
│ └── shortest_path_*.py
└── math/ # Math tasks (DeepMath, DAPO)
├── deepmath_grpo.py
├── deepmath_sd.py
├── dapo_*.py
└── ...
| Method | Description | Key Config |
|---|---|---|
grpo |
GRPO baseline (no distillation) | distillation_mode="none" |
fm |
Feedback Modeling | distillation_mode="feedback_modeling" |
sd |
Self-Distillation | distillation_mode="rl_reweight_mask" |
sft |
SFT on correct y2 | distillation_mode="sft" |
Run an experiment with default settings:
python -m experiments.reasoning.knights_knaves_sdOverride specific parameters:
python -m experiments.reasoning.knights_knaves_sd wandb_name=my_run learning_rate=1e-5Each experiment script follows this pattern:
import asyncio
import logging
import sys
import chz
from tinker_cookbook.recipes.rltf.train_with_critique import CLIConfig, cli_main
def build_config_blueprint() -> chz.Blueprint[CLIConfig]:
"""Build config blueprint with experiment defaults."""
return chz.Blueprint(CLIConfig).apply(
{
# Model
"model_name": "meta-llama/Llama-3.1-8B-Instruct",
# Environment
"env": "knights_knaves",
"horizon": 2,
"early_termination": True,
# Judge
"judge_type": "judge",
# Training
"group_size": 8,
"groups_per_batch": 32,
"loss_fn": "importance_sampling",
# Distillation
"distillation_mode": "rl_reweight_mask",
"rl_coef": 0.1,
"use_first_turn_baseline": True,
"gamma": 0.1,
# Logging
"wandb_project": "knights",
"wandb_name": "self-distillation",
}
)
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
blueprint = build_config_blueprint()
blueprint.make_from_argv(sys.argv[1:])
cli_config = blueprint.make()
asyncio.run(cli_main(cli_config))| Environment | Description | Script prefix |
|---|---|---|
knights_knaves |
Knights and Knaves logic puzzles | knights_knaves_* |
binary_matrix |
Binary matrix distance computation | binary_matrix_* |
shortest_path |
Grid pathfinding | shortest_path_* |
deepmath |
DeepMath math dataset | deepmath_* |
dapo |
DAPO math dataset | dapo_* |
@article{song2026expanding,
title={Expanding the Capabilities of Reinforcement Learning via Text Feedback},
author={Song, Yuda and Chen, Lili and Tajwar, Fahim and Munos, Remi and Pathak, Deepak and Bagnell, J Andrew and Singh, Aarti and Zanette, Andrea},
journal={arXiv preprint arXiv:2602.02482},
year={2026}
}