Latent Policy Steering through One-Step Flow Policies

Preprint

1Yonsei University, 2Microsoft Research

TL;DR: Robust, Tuning-Free Offline RL — via direct Q-gradient backprop through a one-step flow policy, eliminating both proxy latent critics and brittle α tuning.

Abstract

Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration. Yet, offline RL's performance often hinges on a brittle trade-off between (1) return maximization, which can push policies outside the dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within the dataset support during RL, but existing offline adaptations commonly approximate action values using latent-space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an original-action-space critic to guide end-to-end latent-space optimization, while the one-step MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.


Limitations of Prior Works

Standard offline RL methods (e.g., TD3+BC) rely on a sensitive hyperparameter α to balance return maximization and behavioral constraints. Latent steering methods like DSRL avoid this but require a distilled latent-space critic, which can be lossy and produce inaccurate gradients.

Click to expand details

Behavioral constraint is too sensitive

QCFQL Sensitivity to regularization weight alpha

Large α yields overly conservative policies, while small α encourages out-of-support actions. The best α is highly sensitive to reward scale, dataset diversity, and model capacity.

Latent-space critic is not enough

DSRL Latent-space critic gradient mismatch

Even when a distilled latent Q approximates values well, its gradient ∇zQϕ(s,z) can deviate substantially from the true action-space critic gradient, leading to suboptimal latent updates.


Latent Policy Steering (LPS)

We propose Latent Policy Steering (LPS), which addresses both limitations above. First, LPS avoids the explicit behavior-regularization trade-off by separating reward maximization and distributional constraints: a fixed generative behavior policy defines the support, while a latent actor performs value-driven steering (resolving α-sensitivity). Second, LPS eliminates proxy latent critics by directly backpropagating action-space critic gradients through a differentiable generative base policy to update the latent actor (avoiding the inaccurate latent critic).

1. Differentiable Base Policy via MeanFlow

We employ MeanFlow as a differentiable one-step generative base policy, enabling direct gradient flow from the action-space critic to the latent actor.

2. Spherical Latent Geometry

Both the base policy and latent actor operate on Sd−1 (radius √d), preventing norm explosion and keeping latent queries within valid coverage.

3. Direct Latent Policy Steering

LPS = −𝔼s∼D[Qθ(s, πβ(s, πϕ(s)))]

Gradients from Qθ propagate through πβ to πϕ — no proxy latent critic, no α tuning needed.

Key Insight: LPS structurally decouples behavioral constraints from reward maximization—the fixed MeanFlow base policy confines output to the data manifold, while the latent actor freely maximizes Q-value within that manifold. This eliminates the need for sensitive α tuning.

Experiments

We evaluate LPS across OGBench manipulation tasks in simulation and real-world robotic manipulation tasks on the DROID platform. All methods share the same Q-Chunking (QC) value-learning backbone and differ only in policy extraction strategy. We compare against: QC-FQL and QC-MFQL (action-space distillation baselines), DSRL (latent steering with latent-space critic), and CFGRL (inference-time classifier-free guidance).

Real-World (Franka)

Real-World Real-World Plot

Our real-world experiments use the DROID platform with a Franka Research 3 robot across four manipulation tasks: pick-and-place carrots, eggplant to bin, plug in bulb, and refill tape. We collected 50 human-teleoperated demonstrations per task. Across all tasks, LPS achieves the highest success rates and the best average performance (56.2%), outperforming both behavioral cloning baselines (Flow-BC: 31.2%, MF-BC: 28.7%) and prior latent-steering methods (DSRL: 35.0%). LPS is particularly effective on precision-critical tasks where DSRL struggles.

Click to see rollout videos

Pick & Place Carrots

Flow-BC

MF-BC

DSRL

LPS (Ours)

Eggplant to Bin

Flow-BC

MF-BC

DSRL

LPS (Ours)

Plug in Bulb

Flow-BC

MF-BC

DSRL

LPS (Ours)

Refill Tape

Flow-BC

MF-BC

DSRL

LPS (Ours)

Simulation (OGBench)

OGBench Plot

We evaluate on five state-based manipulation tasks from OGBench (cube-single, cube-double, scene-sparse, puzzle-3x3-sparse, puzzle-4x4) and pixel-based visual task variants. LPS consistently outperforms the one-step distillation baselines (QC-FQL and QC-MFQL). DSRL exhibits higher variance and performs poorly on the challenging cube-double domain, highlighting the limitations of relying on a distilled latent-space critic. CFGRL underperforms explicit policy extraction methods, suggesting that inference-time guidance alone provides weaker improvement signals than direct critic-based optimization.


Robustness to α

Sensitivity to regularization weight

Measured on simulation (OGBench) tasks, we sweep α on representative manipulation benchmarks. QC-MFQL exhibits a sharp performance peak at a specific α, with success rates dropping rapidly when α deviates from the task-specific optimum. In contrast, LPS remains stable across a wide range of α, consistent with its design goal of decoupling policy improvement from explicit behavior-regularization weights.

Computational Efficiency

Computational cost comparison

Measured on our real-world (Franka) setup, LPS generates actions in a single forward pass via MeanFlow, unlike multi-step diffusion policies that require dozens of denoising steps. This one-step design also enables stable end-to-end gradient backpropagation without the instability of unrolling multi-step ODE solvers, achieving strong real-world performance with substantially lower per-step compute.


Ablation Study

ablation merged

Using simulation (OGBench) tasks, we conduct comprehensive ablation studies to validate three key design choices in LPS:

  • Latent-space geometry (a–c): The spherical prior does not degrade baseline performance (a), confirming it retains sufficient expressivity. However, for LPS, it is critical: both normal and truncated-normal alternatives significantly reduce performance (b), as unconstrained latents exhibit norm growth that pushes the actor out of support, while truncated latents saturate near boundaries where gradients vanish (c).
  • One-step generative backbone (d): FM-1step-LPS (single Euler step) performs worst due to large approximation errors; FM-LPS (10-step) improves but still underperforms LPS, as backpropagation through multi-step sampling (BPTT) introduces instability. MeanFlow enables high-fidelity one-step generation with stable end-to-end gradients.
  • Noise-to-action reformulation (e): The original MeanFlow parameterization leads to unstable learning. Our reformulation—predicting denoised actions rather than raw velocity fields—consistently yields strong performance.

BibTeX

@misc{im2026latentpolicysteeringonestep,
      title={Latent Policy Steering through One-Step Flow Policies}, 
      author={Hokyun Im and Andrey Kolobov and Jianlong Fu and Youngwoon Lee},
      year={2026},
      eprint={2603.05296},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.05296}, 
}