Skip to content

jrzmnt/ask-rl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning

IJCNN 2026 Python 3.11 License: MIT HuggingFace arXiv

Official implementation of ASK (Adaptive Safety through Knowledge), an extrinsic method that improves out-of-distribution (OOD) generalization in reinforcement learning by selectively querying a Language Model (LM) based on uncertainty estimates, without retraining the RL policy.

ASK uses Monte Carlo Dropout to measure epistemic and aleatoric uncertainty at each step. When uncertainty exceeds a threshold τ, it queries a LM for an action recommendation. In in-domain scenarios, ASK preserves PPO baseline performance. Under downward generalization (trained on 8×8, tested on 4×4–7×7), 32B/72B models achieve up to 0.95 reward, where both PPO and LM alone fail completely.


Requirements

  • Python 3.11
  • uv

Installation

uv sync
source .venv/bin/activate

Setup: Generate Evaluation Maps

The FrozenLake evaluation and test maps are not included in the repository and must be generated before running any experiment. Each map size uses 100 fixed contexts for evaluation and 100 for testing (the paper uses 300 total, equally split into train/eval/test).

python scripts/generate_maps.py

This creates tmp/frozenlake{4..8}/eval/ and tmp/frozenlake{4..8}/test/ with 100 .npy maps each.

You can also generate them manually:

from awu.envs.frozen_lake import FrozenLake

for size in [4, 5, 6, 7, 8]:
    env = FrozenLake(id="FrozenLake-v1", size=size)
    env.create_structures(100, eval=True)   # -> tmp/frozenlake{size}/eval/
    env.create_structures(100, eval=False)  # -> tmp/frozenlake{size}/test/

Running Experiments

Full pipeline

bash scripts/run_all.sh

Runs setup, PPO training, SLM-only rollout, and gated rollout in sequence.

Individual steps

# 1. Train the PPO agent
bash scripts/run_rl.sh

# 2. Run SLM-only rollout
bash scripts/run_slm.sh

# 3. Run uncertainty-gated rollout (ASK: PPO + SLM)
bash scripts/run_gated.sh

Results are saved under runs/.

Evaluation

Evaluation scripts load the pre-trained PPO model from HuggingFace (NathanGavenski/ppo-FrozenLake-v1-8x8) and run it over the fixed evaluation maps.

python eval_ppo.py        # PPO-only
python eval_ppo_slm.py    # ASK: PPO + SLM gated
python eval_slm.py        # SLM-only

Configuration

File Description
configs/rl/ppo.yaml PPO training config (environment regime, timesteps)
configs/slm/small.yaml SLM config — Qwen2.5-1.5B-Instruct
configs/slm/medium.yaml SLM config — larger Qwen variant
configs/sweeps/ Sweep configs for threshold and model search

The regime field in configs/rl/ppo.yaml controls the experiment type:

experiment:
  regime: rl_only   # rl_only | slm_only | gated

Key hyperparameters from the paper:

  • PPO training: 2×10⁷ timesteps (StableBaselines3 defaults)
  • MC Dropout: N=100 forward passes, dropout rate 0.2
  • LMs: Qwen2.5 family (0.5B–72B), off-the-shelf from HuggingFace, no fine-tuning
  • Evaluation: 100 episodes per configuration

Project Structure

├── configs/          # YAML experiment configs
├── eval_ppo.py       # Evaluate PPO-only
├── eval_ppo_slm.py   # Evaluate ASK (PPO + SLM gated)
├── eval_slm.py       # Evaluate SLM-only
├── prompts/          # SLM prompt templates
├── scripts/
│   ├── generate_maps.py   # Generate FrozenLake maps (run once)
│   ├── run_all.sh         # Full pipeline
│   ├── run_rl.sh          # Train PPO
│   ├── run_slm.sh         # SLM-only rollout
│   ├── run_gated.sh       # ASK gated rollout
│   └── setup.sh           # Install dependencies
└── src/awu/
    ├── envs/              # FrozenLake environment
    ├── experiments/       # Training and rollout entry points
    ├── slm/               # SLM loading, prompting, and parsing
    ├── uncertainty/       # MC Dropout uncertainty estimation
    └── utils/             # Callbacks, seeding, I/O utilities

Acknowledgments

This work was partially supported by UK Research and Innovation [grant number EP/S023356/1], in the UKRI Centre for Doctoral Training in Safe and Trusted Artificial Intelligence (www.safeandtrustedai.org), and by the Kunumi Institute (https://www.kunuminst.org/), through individual grants awarded to the authors.


Citation

@inproceedings{7fc4d3b96a2c441d92209a877e111a5d,
  title     = "When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning",
  author    = "Juarez Monteiro and Nathan Gavenski and Gianlucca Zuin and Adriano Veloso",
  booktitle = "Proceedings of the 2026 International Joint Conference on Neural Networks (IJCNN)",
  year      = "2026",
  month     = jun,
  note      = "Conference date: 21-06-2026 Through 26-06-2026",
  url       = "https://attend.ieee.org/wcci-2026/",
}