Platform: ROS2 Humble · Gazebo Harmonic · Ubuntu 22.04 · CUDA
A mobile manipulator that learns pick-and-place from reinforcement learning plus a scripted pre-grasp routine — no demonstrations and no motion planning in the RL loop. A differential-drive base carries a 6-DOF UR3 arm with a Robotiq 2F-85 gripper. A lightweight controller handles base approach and arm pre-positioning; TQC (Truncated Quantile Critics) then learns the manipulation stages from dense reward shaping.
| Step | Description |
|---|---|
| 1. Pre-grasp | Scripted P-controller faces robot toward object, keeps caster clear of bin wall, extends arm (shoulder=-1.7, elbow=2.0, wrist=-1.0) |
| 2. Approach | RL lowers EE to object height and closes XY distance simultaneously |
| 3. Grasp | RL positions the gripper around the pickup cube and closes fingers |
| 4. Lift | RL raises grasped object to 25cm clearance height |
| 5. Transport | RL drives base toward drop zone while holding object |
| 6. Place | RL lowers and releases object at target location |
- No motion planning — single TQC policy controls 6 arm joints + gripper + base simultaneously
- Phase-based curriculum — 5 phases with milestone bonuses (+100 to +1000) and asymmetric retreat penalties (3–4× harsher than approach reward)
- Analytical FK — UR3 DH parameters compute EE world position with zero TF latency
- Real Gazebo poses —
ros_gzdynamic_pose bridge gives ground-truth object position, no fake randomisation - Grasp verification — object lift is checked against the real Gazebo pose for up to 30 steps before confirming success
- VecNormalize — online normalisation of all 46 obs dimensions + reward, critical for mixed-scale inputs
- Caster-aware pregrasp — front caster (r=6cm at x=+30cm from chassis) kept clear of bin wall; arm reaches over wall from spawn
Upgraded from SAC (SAC best reward: −328). TQC distributes return estimates across multiple quantile networks and truncates the top quantiles before Bellman updates — the pessimistic bias suppresses Q-value overestimation in contact-rich manipulation, producing more stable and consistent grasping behaviour.
| Hyperparameter | Value | Rationale |
|---|---|---|
| Policy network | [512, 512, 512] |
Deeper than default [256,256] — maps complex 46-dim obs to 9-dim action |
gradient_steps |
4 | 4 updates per env step — higher sample efficiency |
buffer_size |
500 000 | Replay buffer for off-policy learning |
batch_size |
512 | Large batch for stable gradients |
gamma |
0.99 | Long-horizon discounting for multi-phase task |
learning_starts |
1 000 | Random exploration before first update |
VecNormalize |
clip_obs=10 |
Normalises obs online; eval env uses frozen stats |
top_quantiles_to_drop |
2 | Conservative Q-targets for manipulation stability |
| Field | Dim | Notes |
|---|---|---|
joint_positions |
6 | UR3 arm angles (rad) |
joint_velocities |
6 | UR3 arm speeds (rad/s) |
finger_position |
1 | 0 = open, ~0.8 = fully closed |
ee_pos |
3 | EE XYZ in world frame via DH FK |
obj_pos |
3 | Object XYZ from Gazebo dynamic_pose bridge |
ee_to_obj |
3 | Direct tracking vector from EE to object |
ee_to_target |
3 | Vector from EE to placement target |
obj_to_target |
3 | Vector from object to placement target |
obj_in_base |
3 | Object position expressed in base frame |
gripper_error |
1 | Error to desired open/closed gripper state |
object_grasped |
1 | Binary — updated by grasp verification |
current_phase |
1 | Integer phase (1–5) |
base_pose |
3 | Base x, y, heading θ from odometry |
prev_action |
9 | Previous action for temporal smoothing/context |
All quantities share the world frame — no mixed-frame distance bugs.
| Field | Dim | Notes |
|---|---|---|
joint_deltas |
6 | Position delta per arm joint; max ±0.25 rad/step, P-controlled to velocity |
gripper |
1 | >0 → close at 0.5 rad/s, <0 → open |
base_linear |
1 | Forward speed (×0.5 m/s); zeroed in phases 1–3 |
base_angular |
1 | Turn speed (×1.0 rad/s); zeroed in phases 1–3 |
Position-delta control (not raw velocity) gives a stable zero-action baseline — the arm holds still when the policy outputs 0.
| Phase | Goal | Reward Signal | Transition Condition |
|---|---|---|---|
| 1 | Lower EE to grasp height + close XY to object | Δdist × 100 approach / × 300 retreat |
dist_z < 4 cm AND dist_xy < 6 cm |
| 2 | Reach object and close gripper | Δdist × 80 / × 320 + proximity/touch bonuses |
Gripper > 0.7 AND dist < 4 cm |
| 3 | Lift object to 25 cm | Δheight × 100 / × 200 |
EE height within 5 cm of 25 cm |
| 4 | Transport to drop zone | Δdist × 50 base + arm |
EE within 15 cm of target XY |
| 5 | Lower and release | Δdist × 50 |
EE within 8 cm, gripper open |
Milestone bonuses: +100 (phases 1, 3, 4, 5), +1000 (grasp success at phase 2).
Phase 1 (approach)
Δ(dist_z + dist_xy) × 100 approach | × 300 retreat ← 3× harsher
proximity bonus: +5 × (1 − dist_xy/0.15) when dist_xy < 15 cm
z-align bonus: +4 × (1 − dist_z/0.05) when dist_z < 5 cm
dual-close bonus: +8 flat when both xy < 8 cm AND z < 5 cm
gripper-close penalty: −2/step if gripper closed during approach
Phase 2 (grasp)
Δdist_to_grasp_target × 80 approach | × 320 retreat ← 4× harsher
proximity bonus: +8 × (1 − dist/0.10) when dist < 10 cm
touch-range bonus: +15 × (1 − dist/0.04) when dist < 4 cm
very-close bonus: +10 flat when dist < 3 cm
XY-align bonus: +10 × (1 − xy/0.05) when xy < 5 cm
Z-align bonus: +8 × (1 − z/0.04) when z < 4 cm
dual-align bonus: +12 flat when xy < 4 cm AND z < 3 cm
gripper-close bonus: +8 × gripper_pos when closing in true grasp range
open-near-object penalty: −5 when gripper opens inside 5 cm
wrong-close penalty: −(0.5 + dist × 5) when closing far away
high-close penalty: −10 × z_dist when closing while vertically misaligned
wrist orientation: −|wrist_2_angle| × 0.3 (keep gripper horizontal)
Safety / global
out-of-bounds: −500 + terminate if base > 1.5 m from object
EE underground: −500 + terminate (phase ∉ 1,2,5)
high joint vel: −500 + terminate if any joint vel > 10 rad/s
misalignment penalty: −|angle_err| × 0.2 if base not facing object in phases 1–3
action smoothness: −0.01 × Σ|joint_vels| every step
Episode reset()
├── Randomise object XY ±4 cm (domain randomisation)
├── Scripted pre-grasp (P-controller, up to 300 steps):
│ · Turn base to face object
│ · Drive forward only while chassis_x ≤ 0.04 m
│ (keeps 6 cm front caster clear of bin back wall at x≈0.40 m)
│ · Extend arm: pan=0, shoulder=-1.7, elbow=2.0, wrist_1=-1.0
│ · Break when EE within 30 cm XY of object
└── Hand off to RL at phase 1
RL step() (~40 Hz)
├── Spin ROS node (joint_states, odom, dynamic_pose)
├── Compute DH FK → EE world position
├── Execute action (position-delta arm + gripper + base)
├── Compute phase reward + check transitions
└── Return (obs, reward, terminated, truncated)
FK pipeline: UR3 DH parameters compute EE analytically. A 180° yaw on base_link_inertia means FK output is flipped (x→−x, y→−y) before adding the arm mount offset, giving EE in the chassis frame and then world frame via odometry.
# Clone
git clone https://github.com/darshmenon/pickplace-rl-mobile.git
cd pickplace-rl-mobile
# ROS
source /opt/ros/humble/setup.bash
# Python deps
pip install stable-baselines3 sb3-contrib gymnasium tensorboard
# Build the workspace
colcon build --packages-select pickplace_rl_mobile --symlink-install
source install/setup.bashIf you open a new terminal later, run:
cd /path/to/pickplace-rl-mobile
source /opt/ros/humble/setup.bash
source install/setup.bash# Launch with Gazebo GUI (watch the robot)
bash src/pickplace_rl_mobile/launch/run_rl_training.sh
# Launch headless — no GUI window, ~3-4× faster fps
bash src/pickplace_rl_mobile/launch/run_rl_training.sh --headless
# Resume from best checkpoint, GUI
bash src/pickplace_rl_mobile/launch/run_rl_training.sh ./rl_models/best_model.zip
# Resume from best checkpoint, headless (recommended for long runs)
bash src/pickplace_rl_mobile/launch/run_rl_training.sh ./rl_models/best_model.zip --headless
# Resume from the latest numbered checkpoint
bash src/pickplace_rl_mobile/launch/run_rl_training.sh ./rl_models/pickplace_model_580000_steps.zip --headlessThis is the recommended launch path for RL work because it starts Gazebo and the trainer with the wiring used by the current checkpoints.
# Fresh run
bash src/pickplace_rl_mobile/launch/run_rl_training.sh --headless
# Resume from the best checkpoint
bash src/pickplace_rl_mobile/launch/run_rl_training.sh ./rl_models/best_model.zip --headless
# Resume from the latest numbered checkpoint
bash src/pickplace_rl_mobile/launch/run_rl_training.sh ./rl_models/pickplace_model_580000_steps.zip --headlessros2 launch pickplace_rl_mobile gazebo.launch.py
# Headless Gazebo only
ros2 launch pickplace_rl_mobile gazebo.launch.py headless:=trueUse this only if Gazebo is already running in another terminal.
ros2 launch pickplace_rl_mobile rl_train.launch.py
# Resume from a saved checkpoint
ros2 launch pickplace_rl_mobile rl_train.launch.py load_model:=./rl_models/best_model.zipThis path is useful for the broader stack, but the RL training workflow above is the main maintained path for checkpointed learning.
ros2 launch pickplace_rl_mobile full_system.launch.py
# Full system with RL inference node
ros2 launch pickplace_rl_mobile full_system.launch.py use_rl:=true model_path:=./rl_models/best_model.zip
# Full system with Nav2
ros2 launch pickplace_rl_mobile full_system.launch.py use_nav2:=trueros2 launch pickplace_rl_mobile vla_full_pipeline.launch.py
# Lighter fallback mode without the LLM or OWLv2
ros2 launch pickplace_rl_mobile vla_full_pipeline.launch.py use_llm:=false use_owlv2:=falseros2 launch pickplace_rl_mobile display_launch.py# Watch live rewards (log written by ros2 launch)
grep -a "ep_rew_mean\|fps" /tmp/training.log | tail -10
# TensorBoard
tensorboard --logdir ./rl_models/tensorboard
# open http://localhost:6006
# Kill everything cleanly
pkill -9 -f "gz|ros2|train_rl|parameter_bridge"
# Check EE and object pose live
ros2 topic echo /joint_states --once
ros2 topic echo /odom --once
ros2 topic echo /world/pickplace_world/dynamic_pose/info --onceCheckpoints save every 10 k steps to ./rl_models/. best_model.zip updates automatically on eval improvement. VecNormalize stats save to ./rl_models/vecnormalize.pkl and replay data to ./rl_models/replay_buffer.pkl; both are reused automatically on resume when compatible.
| Artifact | Status |
|---|---|
| Latest numbered checkpoint | rl_models/pickplace_model_580000_steps.zip |
| Best eval checkpoint | rl_models/best_model.zip and rl_models/best_model/best_model.zip |
| Latest eval file | rl_models/evaluations.npz |
| Last recorded eval step | 580000 |
| Last recorded mean eval reward | about -7835.72 |
| Best recorded mean eval reward | about -776.75 |
These numbers mean training has been running and saving correctly, but the policy is not yet consistently solving the full task. The README now reflects that instead of overstating convergence.
The current trainer resumes from checkpoints, restores VecNormalize stats, reloads the replay buffer when possible, and auto-detects legacy 27-dim checkpoints versus the current 46-dim observation mode. Key improvements over the older SAC setup:
| Change | Impact |
|---|---|
| SAC → TQC | Replaced the older SAC baseline with a more stable critic ensemble |
| Scripted pre-grasp | Deterministic base approach frees RL to focus on manipulation |
| Caster-aware driving | Stops chassis at x ≤ 4 cm to avoid bin wall collision |
| Arm pre-extension | Pregrasp sets shoulder/elbow/wrist to face object; EE ≤ 30 cm from target |
| Asymmetric penalties | 3–4× harsher retreat vs approach; prevents oscillating policy |
| Equal XY+Z weight (phase 1) | Was 0.5× XY; now 1.0× so agent approaches horizontally and vertically together |
| VecNormalize | Normalises mixed-scale 46-dim obs; critical for stable TQC training |
| Network [512,512,512] | Larger than default [256,256]; better function approximation |
gradient_steps=4 |
2× more updates per env step; faster convergence |
| Verified grasp reward +1000 | Only awarded once the real Gazebo object actually lifts |
| Grasp verification | Real-object lift verification over a 30-step window prevents reward hacking |
See CONCEPTS.md for deep-dives on every technique: TQC · Phase curriculum · Potential-based reward shaping · Hierarchical control (scripted + RL) · Position-delta control · DH forward kinematics · Grasp verification · Domain randomisation · VecNormalize · Replay buffer
Darsh Menon — [email protected] · GitHub: @darshmenon
