Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky
exploration. Yet, offline RL's performance often hinges on a brittle trade-off between
(1) return maximization, which can push policies outside the dataset support, and
(2) behavioral constraints, which typically require sensitive hyperparameter tuning.
Latent steering offers a structural way to stay within the dataset support during RL,
but existing offline adaptations commonly approximate action values using latent-space critics
learned via indirect distillation, which can lose information and hinder convergence.
We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement
by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy
to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an
original-action-space critic to guide end-to-end latent-space optimization, while the one-step
MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust
method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS
achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong
latent steering baselines.