Official implementation of ASK (Adaptive Safety through Knowledge), an extrinsic method that improves out-of-distribution (OOD) generalization in reinforcement learning by selectively querying a Language Model (LM) based on uncertainty estimates, without retraining the RL policy.
ASK uses Monte Carlo Dropout to measure epistemic and aleatoric uncertainty at each step. When uncertainty exceeds a threshold τ, it queries a LM for an action recommendation. In in-domain scenarios, ASK preserves PPO baseline performance. Under downward generalization (trained on 8×8, tested on 4×4–7×7), 32B/72B models achieve up to 0.95 reward, where both PPO and LM alone fail completely.
- Python 3.11
- uv
uv sync
source .venv/bin/activateThe FrozenLake evaluation and test maps are not included in the repository and must be generated before running any experiment. Each map size uses 100 fixed contexts for evaluation and 100 for testing (the paper uses 300 total, equally split into train/eval/test).
python scripts/generate_maps.pyThis creates tmp/frozenlake{4..8}/eval/ and tmp/frozenlake{4..8}/test/ with 100 .npy maps each.
You can also generate them manually:
from awu.envs.frozen_lake import FrozenLake
for size in [4, 5, 6, 7, 8]:
env = FrozenLake(id="FrozenLake-v1", size=size)
env.create_structures(100, eval=True) # -> tmp/frozenlake{size}/eval/
env.create_structures(100, eval=False) # -> tmp/frozenlake{size}/test/bash scripts/run_all.shRuns setup, PPO training, SLM-only rollout, and gated rollout in sequence.
# 1. Train the PPO agent
bash scripts/run_rl.sh
# 2. Run SLM-only rollout
bash scripts/run_slm.sh
# 3. Run uncertainty-gated rollout (ASK: PPO + SLM)
bash scripts/run_gated.shResults are saved under runs/.
Evaluation scripts load the pre-trained PPO model from HuggingFace (NathanGavenski/ppo-FrozenLake-v1-8x8) and run it over the fixed evaluation maps.
python eval_ppo.py # PPO-only
python eval_ppo_slm.py # ASK: PPO + SLM gated
python eval_slm.py # SLM-only| File | Description |
|---|---|
configs/rl/ppo.yaml |
PPO training config (environment regime, timesteps) |
configs/slm/small.yaml |
SLM config — Qwen2.5-1.5B-Instruct |
configs/slm/medium.yaml |
SLM config — larger Qwen variant |
configs/sweeps/ |
Sweep configs for threshold and model search |
The regime field in configs/rl/ppo.yaml controls the experiment type:
experiment:
regime: rl_only # rl_only | slm_only | gatedKey hyperparameters from the paper:
- PPO training: 2×10⁷ timesteps (StableBaselines3 defaults)
- MC Dropout: N=100 forward passes, dropout rate 0.2
- LMs: Qwen2.5 family (0.5B–72B), off-the-shelf from HuggingFace, no fine-tuning
- Evaluation: 100 episodes per configuration
├── configs/ # YAML experiment configs
├── eval_ppo.py # Evaluate PPO-only
├── eval_ppo_slm.py # Evaluate ASK (PPO + SLM gated)
├── eval_slm.py # Evaluate SLM-only
├── prompts/ # SLM prompt templates
├── scripts/
│ ├── generate_maps.py # Generate FrozenLake maps (run once)
│ ├── run_all.sh # Full pipeline
│ ├── run_rl.sh # Train PPO
│ ├── run_slm.sh # SLM-only rollout
│ ├── run_gated.sh # ASK gated rollout
│ └── setup.sh # Install dependencies
└── src/awu/
├── envs/ # FrozenLake environment
├── experiments/ # Training and rollout entry points
├── slm/ # SLM loading, prompting, and parsing
├── uncertainty/ # MC Dropout uncertainty estimation
└── utils/ # Callbacks, seeding, I/O utilities
This work was partially supported by UK Research and Innovation [grant number EP/S023356/1], in the UKRI Centre for Doctoral Training in Safe and Trusted Artificial Intelligence (www.safeandtrustedai.org), and by the Kunumi Institute (https://www.kunuminst.org/), through individual grants awarded to the authors.
@inproceedings{7fc4d3b96a2c441d92209a877e111a5d,
title = "When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning",
author = "Juarez Monteiro and Nathan Gavenski and Gianlucca Zuin and Adriano Veloso",
booktitle = "Proceedings of the 2026 International Joint Conference on Neural Networks (IJCNN)",
year = "2026",
month = jun,
note = "Conference date: 21-06-2026 Through 26-06-2026",
url = "https://attend.ieee.org/wcci-2026/",
}