ARES: Autonomous Robotic End-to-End System — UR3 Mobile Pick & Place via RL

Platform: ROS2 Humble · Gazebo Harmonic · Ubuntu 22.04 · CUDA

A mobile manipulator that learns pick-and-place from reinforcement learning plus a scripted pre-grasp routine — no demonstrations and no motion planning in the RL loop. A differential-drive base carries a 6-DOF UR3 arm with a Robotiq 2F-85 gripper. A lightweight controller handles base approach and arm pre-positioning; TQC (Truncated Quantile Critics) then learns the manipulation stages from dense reward shaping.

What it does

Step	Description
1. Pre-grasp	Scripted P-controller faces robot toward object, keeps caster clear of bin wall, extends arm (shoulder=-1.7, elbow=2.0, wrist=-1.0)
2. Approach	RL lowers EE to object height and closes XY distance simultaneously
3. Grasp	RL positions the gripper around the pickup cube and closes fingers
4. Lift	RL raises grasped object to 25cm clearance height
5. Transport	RL drives base toward drop zone while holding object
6. Place	RL lowers and releases object at target location

Highlights

No motion planning — single TQC policy controls 6 arm joints + gripper + base simultaneously
Phase-based curriculum — 5 phases with milestone bonuses (+100 to +1000) and asymmetric retreat penalties (3–4× harsher than approach reward)
Analytical FK — UR3 DH parameters compute EE world position with zero TF latency
Real Gazebo poses — ros_gz dynamic_pose bridge gives ground-truth object position, no fake randomisation
Grasp verification — object lift is checked against the real Gazebo pose for up to 30 steps before confirming success
VecNormalize — online normalisation of all 46 obs dimensions + reward, critical for mixed-scale inputs
Caster-aware pregrasp — front caster (r=6cm at x=+30cm from chassis) kept clear of bin wall; arm reaches over wall from spawn

RL System

Algorithm: TQC (Truncated Quantile Critics)

Upgraded from SAC (SAC best reward: −328). TQC distributes return estimates across multiple quantile networks and truncates the top quantiles before Bellman updates — the pessimistic bias suppresses Q-value overestimation in contact-rich manipulation, producing more stable and consistent grasping behaviour.

Hyperparameter	Value	Rationale
Policy network	`[512, 512, 512]`	Deeper than default [256,256] — maps complex 46-dim obs to 9-dim action
`gradient_steps`	4	4 updates per env step — higher sample efficiency
`buffer_size`	500 000	Replay buffer for off-policy learning
`batch_size`	512	Large batch for stable gradients
`gamma`	0.99	Long-horizon discounting for multi-phase task
`learning_starts`	1 000	Random exploration before first update
`VecNormalize`	`clip_obs=10`	Normalises obs online; eval env uses frozen stats
`top_quantiles_to_drop`	2	Conservative Q-targets for manipulation stability

Observation Space — 46 dimensions

Field	Dim	Notes
`joint_positions`	6	UR3 arm angles (rad)
`joint_velocities`	6	UR3 arm speeds (rad/s)
`finger_position`	1	0 = open, ~0.8 = fully closed
`ee_pos`	3	EE XYZ in world frame via DH FK
`obj_pos`	3	Object XYZ from Gazebo dynamic_pose bridge
`ee_to_obj`	3	Direct tracking vector from EE to object
`ee_to_target`	3	Vector from EE to placement target
`obj_to_target`	3	Vector from object to placement target
`obj_in_base`	3	Object position expressed in base frame
`gripper_error`	1	Error to desired open/closed gripper state
`object_grasped`	1	Binary — updated by grasp verification
`current_phase`	1	Integer phase (1–5)
`base_pose`	3	Base x, y, heading θ from odometry
`prev_action`	9	Previous action for temporal smoothing/context

All quantities share the world frame — no mixed-frame distance bugs.

Action Space — 9 dimensions (continuous, clipped to [−1, 1])

Field	Dim	Notes
`joint_deltas`	6	Position delta per arm joint; max ±0.25 rad/step, P-controlled to velocity
`gripper`	1	>0 → close at 0.5 rad/s, <0 → open
`base_linear`	1	Forward speed (×0.5 m/s); zeroed in phases 1–3
`base_angular`	1	Turn speed (×1.0 rad/s); zeroed in phases 1–3

Position-delta control (not raw velocity) gives a stable zero-action baseline — the arm holds still when the policy outputs 0.

5-Phase Curriculum

Phase	Goal	Reward Signal	Transition Condition
1	Lower EE to grasp height + close XY to object	`Δdist × 100` approach / `× 300` retreat	`dist_z < 4 cm AND dist_xy < 6 cm`
2	Reach object and close gripper	`Δdist × 80 / × 320` + proximity/touch bonuses	Gripper > 0.7 AND dist < 4 cm
3	Lift object to 25 cm	`Δheight × 100 / × 200`	EE height within 5 cm of 25 cm
4	Transport to drop zone	`Δdist × 50` base + arm	EE within 15 cm of target XY
5	Lower and release	`Δdist × 50`	EE within 8 cm, gripper open

Milestone bonuses: +100 (phases 1, 3, 4, 5), +1000 (grasp success at phase 2).

Reward Design — Full Detail

Phase 1  (approach)
  Δ(dist_z + dist_xy) × 100   approach  |  × 300   retreat   ← 3× harsher
  proximity bonus:  +5 × (1 − dist_xy/0.15)  when dist_xy < 15 cm
  z-align bonus:    +4 × (1 − dist_z/0.05)   when dist_z  <  5 cm
  dual-close bonus: +8 flat                  when both xy < 8 cm AND z < 5 cm
  gripper-close penalty: −2/step             if gripper closed during approach

Phase 2  (grasp)
  Δdist_to_grasp_target × 80  approach  |  × 320   retreat   ← 4× harsher
  proximity bonus:   +8 × (1 − dist/0.10)   when dist < 10 cm
  touch-range bonus: +15 × (1 − dist/0.04)  when dist <  4 cm
  very-close bonus:  +10 flat               when dist <  3 cm
  XY-align bonus:    +10 × (1 − xy/0.05)    when xy < 5 cm
  Z-align bonus:     +8 × (1 − z/0.04)      when z  < 4 cm
  dual-align bonus:  +12 flat               when xy < 4 cm AND z < 3 cm
  gripper-close bonus: +8 × gripper_pos     when closing in true grasp range
  open-near-object penalty: −5              when gripper opens inside 5 cm
  wrong-close penalty: −(0.5 + dist × 5)    when closing far away
  high-close penalty: −10 × z_dist          when closing while vertically misaligned
  wrist orientation:  −|wrist_2_angle| × 0.3  (keep gripper horizontal)

Safety / global
  out-of-bounds:      −500 + terminate   if base > 1.5 m from object
  EE underground:     −500 + terminate   (phase ∉ 1,2,5)
  high joint vel:     −500 + terminate   if any joint vel > 10 rad/s
  misalignment penalty: −|angle_err| × 0.2  if base not facing object in phases 1–3
  action smoothness:  −0.01 × Σ|joint_vels|  every step

Architecture

Episode reset()
    ├── Randomise object XY ±4 cm (domain randomisation)
    ├── Scripted pre-grasp (P-controller, up to 300 steps):
    │       · Turn base to face object
    │       · Drive forward only while chassis_x ≤ 0.04 m
    │         (keeps 6 cm front caster clear of bin back wall at x≈0.40 m)
    │       · Extend arm: pan=0, shoulder=-1.7, elbow=2.0, wrist_1=-1.0
    │       · Break when EE within 30 cm XY of object
    └── Hand off to RL at phase 1

RL step()  (~40 Hz)
    ├── Spin ROS node (joint_states, odom, dynamic_pose)
    ├── Compute DH FK → EE world position
    ├── Execute action (position-delta arm + gripper + base)
    ├── Compute phase reward + check transitions
    └── Return (obs, reward, terminated, truncated)

FK pipeline: UR3 DH parameters compute EE analytically. A 180° yaw on base_link_inertia means FK output is flipped (x→−x, y→−y) before adding the arm mount offset, giving EE in the chassis frame and then world frame via odometry.

Setup

# Clone
git clone https://github.com/darshmenon/pickplace-rl-mobile.git
cd pickplace-rl-mobile

# ROS
source /opt/ros/humble/setup.bash

# Python deps
pip install stable-baselines3 sb3-contrib gymnasium tensorboard

# Build the workspace
colcon build --packages-select pickplace_rl_mobile --symlink-install
source install/setup.bash

If you open a new terminal later, run:

cd /path/to/pickplace-rl-mobile
source /opt/ros/humble/setup.bash
source install/setup.bash

Quick Start

# Launch with Gazebo GUI (watch the robot)
bash src/pickplace_rl_mobile/launch/run_rl_training.sh

# Launch headless — no GUI window, ~3-4× faster fps
bash src/pickplace_rl_mobile/launch/run_rl_training.sh --headless

# Resume from best checkpoint, GUI
bash src/pickplace_rl_mobile/launch/run_rl_training.sh ./rl_models/best_model.zip

# Resume from best checkpoint, headless (recommended for long runs)
bash src/pickplace_rl_mobile/launch/run_rl_training.sh ./rl_models/best_model.zip --headless

# Resume from the latest numbered checkpoint
bash src/pickplace_rl_mobile/launch/run_rl_training.sh ./rl_models/pickplace_model_580000_steps.zip --headless

This is the recommended launch path for RL work because it starts Gazebo and the trainer with the wiring used by the current checkpoints.

Launch Guide

RL training

# Fresh run
bash src/pickplace_rl_mobile/launch/run_rl_training.sh --headless

# Resume from the best checkpoint
bash src/pickplace_rl_mobile/launch/run_rl_training.sh ./rl_models/best_model.zip --headless

# Resume from the latest numbered checkpoint
bash src/pickplace_rl_mobile/launch/run_rl_training.sh ./rl_models/pickplace_model_580000_steps.zip --headless

Gazebo only

ros2 launch pickplace_rl_mobile gazebo.launch.py

# Headless Gazebo only
ros2 launch pickplace_rl_mobile gazebo.launch.py headless:=true

Trainer only

Use this only if Gazebo is already running in another terminal.

ros2 launch pickplace_rl_mobile rl_train.launch.py

# Resume from a saved checkpoint
ros2 launch pickplace_rl_mobile rl_train.launch.py load_model:=./rl_models/best_model.zip

Full system launch

This path is useful for the broader stack, but the RL training workflow above is the main maintained path for checkpointed learning.

ros2 launch pickplace_rl_mobile full_system.launch.py

# Full system with RL inference node
ros2 launch pickplace_rl_mobile full_system.launch.py use_rl:=true model_path:=./rl_models/best_model.zip

# Full system with Nav2
ros2 launch pickplace_rl_mobile full_system.launch.py use_nav2:=true

VLA pipeline

ros2 launch pickplace_rl_mobile vla_full_pipeline.launch.py

# Lighter fallback mode without the LLM or OWLv2
ros2 launch pickplace_rl_mobile vla_full_pipeline.launch.py use_llm:=false use_owlv2:=false

RViz and URDF view

ros2 launch pickplace_rl_mobile display_launch.py

Training

# Watch live rewards (log written by ros2 launch)
grep -a "ep_rew_mean\|fps" /tmp/training.log | tail -10

# TensorBoard
tensorboard --logdir ./rl_models/tensorboard
# open http://localhost:6006

# Kill everything cleanly
pkill -9 -f "gz|ros2|train_rl|parameter_bridge"

# Check EE and object pose live
ros2 topic echo /joint_states --once
ros2 topic echo /odom --once
ros2 topic echo /world/pickplace_world/dynamic_pose/info --once

Checkpoints save every 10 k steps to ./rl_models/. best_model.zip updates automatically on eval improvement. VecNormalize stats save to ./rl_models/vecnormalize.pkl and replay data to ./rl_models/replay_buffer.pkl; both are reused automatically on resume when compatible.

Latest Local Checkpoints

Artifact	Status
Latest numbered checkpoint	`rl_models/pickplace_model_580000_steps.zip`
Best eval checkpoint	`rl_models/best_model.zip` and `rl_models/best_model/best_model.zip`
Latest eval file	`rl_models/evaluations.npz`
Last recorded eval step	`580000`
Last recorded mean eval reward	about `-7835.72`
Best recorded mean eval reward	about `-776.75`

These numbers mean training has been running and saving correctly, but the policy is not yet consistently solving the full task. The README now reflects that instead of overstating convergence.

Current Trainer Behavior

The current trainer resumes from checkpoints, restores VecNormalize stats, reloads the replay buffer when possible, and auto-detects legacy 27-dim checkpoints versus the current 46-dim observation mode. Key improvements over the older SAC setup:

Change	Impact
SAC → TQC	Replaced the older SAC baseline with a more stable critic ensemble
Scripted pre-grasp	Deterministic base approach frees RL to focus on manipulation
Caster-aware driving	Stops chassis at x ≤ 4 cm to avoid bin wall collision
Arm pre-extension	Pregrasp sets shoulder/elbow/wrist to face object; EE ≤ 30 cm from target
Asymmetric penalties	3–4× harsher retreat vs approach; prevents oscillating policy
Equal XY+Z weight (phase 1)	Was 0.5× XY; now 1.0× so agent approaches horizontally and vertically together
VecNormalize	Normalises mixed-scale 46-dim obs; critical for stable TQC training
Network [512,512,512]	Larger than default [256,256]; better function approximation
`gradient_steps=4`	2× more updates per env step; faster convergence
Verified grasp reward +1000	Only awarded once the real Gazebo object actually lifts
Grasp verification	Real-object lift verification over a 30-step window prevents reward hacking

Concepts

See CONCEPTS.md for deep-dives on every technique: TQC · Phase curriculum · Potential-based reward shaping · Hierarchical control (scripted + RL) · Position-delta control · DH forward kinematics · Grasp verification · Domain randomisation · VecNormalize · Replay buffer

Maintainer

Darsh Menon — [email protected] · GitHub: @darshmenon

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.vscode		.vscode
demos		demos
docs		docs
images		images
rl_models		rl_models
rviz		rviz
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
CONCEPTS.md		CONCEPTS.md
README.md		README.md
__init__.py		__init__.py
frames_2026-02-23_13.04.30.gv		frames_2026-02-23_13.04.30.gv
frames_2026-02-23_13.04.30.pdf		frames_2026-02-23_13.04.30.pdf
frames_2026-02-23_13.12.05.gv		frames_2026-02-23_13.12.05.gv
frames_2026-02-23_13.12.05.pdf		frames_2026-02-23_13.12.05.pdf
readme.md		readme.md
resume.md		resume.md
tf_out.txt		tf_out.txt
tf_static_out.txt		tf_static_out.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARES: Autonomous Robotic End-to-End System — UR3 Mobile Pick & Place via RL

What it does

Highlights

RL System

Algorithm: TQC (Truncated Quantile Critics)

Observation Space — 46 dimensions

Action Space — 9 dimensions (continuous, clipped to [−1, 1])

5-Phase Curriculum

Reward Design — Full Detail

Architecture

Setup

Quick Start

Launch Guide

RL training

Gazebo only

Trainer only

Full system launch

VLA pipeline

RViz and URDF view

Training

Latest Local Checkpoints

Current Trainer Behavior

Concepts

Maintainer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ARES: Autonomous Robotic End-to-End System — UR3 Mobile Pick & Place via RL

What it does

Highlights

RL System

Algorithm: TQC (Truncated Quantile Critics)

Observation Space — 46 dimensions

Action Space — 9 dimensions (continuous, clipped to [−1, 1])

5-Phase Curriculum

Reward Design — Full Detail

Architecture

Setup

Quick Start

Launch Guide

RL training

Gazebo only

Trainer only

Full system launch

VLA pipeline

RViz and URDF view

Training

Latest Local Checkpoints

Current Trainer Behavior

Concepts

Maintainer

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages