clawgui-rl

ClawGUI-RL: Scalable Online RL Infrastructure for GUI Agents

📚 Table of Contents

Overview
Architecture
GiGPO Algorithm
Installation
Quick Start
How to Add a New Environment
Experimental Results
Acknowledgements

📖 Overview

ClawGUI-RL is the training module of ClawGUI. It provides an open-source online RL infrastructure built specifically for GUI agents: parallel virtual environments or real physical devices, production-grade training stability, and GiGPO+PRM for step-level reward signals that go beyond what standard GRPO can offer.

Key Features:

GiGPO algorithm — Replaces standard GRPO with GiGPO+PRM for fine-grained step-level scoring via anchor-state grouping, yielding stronger policy optimization on GUI tasks.
Parallel multi-environment training — Runs dozens of Docker-based virtual Android environments simultaneously for fast, scalable data collection.
Real-device training support — Trains directly on physical Android devices (or cloud phones) using the same API as virtual environments.
Multi-model support — Out-of-the-box support for MAI-UI and GUI-Owl, with an extensible interface for other VLMs (e.g., Qwen3-VL family).
Dynamic sampling — DAPO-style filtering discards trivial samples (all-correct or all-wrong groups) before backpropagation, keeping gradients informative throughout training.
Pluggable custom context — Inject custom history, screenshots, or action spaces into the agent's observation without touching core training logic.
Environment restart & retry mechanisms — Periodic container restarts and configurable retry logic keep long training runs stable without manual intervention.
Spare server rotation — Automatically rotates across backend server URLs so a single unhealthy container never stalls training.
Episode trajectory recording & visualization — Episodes are saved to episode/; use scripts/episode_visualizer.py to replay any rollout.

🏗️ Architecture

ClawGUI-RL is built on verl with a Ray single-controller + FSDP distributed training architecture:

Ray single-controller — A single Python process orchestrates rollout workers, training workers, and environment workers. All coordination is explicit and inspectable.
FSDP training workers — Model parameters are sharded across GPUs for memory efficiency.
Hybrid vLLM engine — vLLM handles token generation during rollout; training weights are synced back via broadcast_from_vllm() after each gradient step.
Multi-turn rollout loop — Each episode is a multi-step conversation: the agent receives a screenshot, reasons, outputs an action (tap/swipe/type/terminate), and receives the next screenshot until termination or max_steps.

🧬 GiGPO Algorithm

Standard GRPO assigns a single advantage score to an entire episode, which gives no credit signal to intermediate steps. ClawGUI-RL replaces GRPO with GiGPO (Group-in-Group Policy Optimization), which estimates per-step advantages without a learned critic.

GiGPO works in three stages:

Stage 1 — Episode-level advantage. Group all rollouts by their task prompt (same UID). Within each group, normalize episode returns using the group mean and standard deviation. This is equivalent to standard GRPO and provides a coarse signal about whether the overall episode succeeded.

Stage 2 — Anchor-state grouping. Within each episode group, cluster steps that reach the same intermediate state (identified by matching screenshot hash or action prefix). Steps sharing an anchor state form a sub-group — they faced identical observations and had their futures compared under identical conditions.

Stage 3 — Step-level advantage. Within each anchor-state sub-group, apply discounted return normalization. A step that leads to a higher discounted return than its peers in the same sub-group receives a positive advantage; a step that leads to a dead end receives a negative advantage. This gives dense per-step credit without training a separate value network.

The final advantage for each token is a blend of episode-level and step-level signals, weighted by a hyperparameter λ.

PRM integration. When step_reward_judge=True, a VLM-as-judge (the Process Reward Model) evaluates each intermediate step independently: given the screenshot before and after the action, it produces a score ∈ {0, 1}. This score augments the environment reward at each step, giving the optimizer a denser training signal than the sparse binary episode-success reward alone.

🔧 Installation

conda create -n opengui-rl python=3.12 -y
conda activate opengui-rl

pip3 install vllm==0.11.0

pip3 install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir

pip install datasets

pip install -e .

🚀 Quick Start

ClawGUI-RL supports two training modes: virtual environment scaling (via Docker-based MobileWorld) and real device training (via physical or cloud Android phones).

1. Virtual Environment Scaling

Step 1 — Clone OpenGUI-Server

git clone https://github.com/sugarandgugu/ClawGUI-Server

Step 2 — Set up the server

Follow the installation guide in the ClawGUI-Server repository. Key steps include:

Verify KVM support — Ensure your machine supports KVM virtualization (kvm-ok or check /dev/kvm).
Pull the Docker image — Pull the MobileWorld Android emulator image as described in the OpenGUI-Server docs.
Launch Docker containers — Start one or more containers. Each running container exposes a backend URL (e.g., http://127.0.0.1:PORT).

After this step, each container provides an API backend URL that ClawGUI-RL will use to interact with the emulated phone.

Step 3 — Register environment URLs

Fill in your container backend URLs, one per line, into:

examples/env_server/mobileworld_server.txt

The number of lines equals the number of parallel environments. More URLs = more parallel environments = faster training.

Example:

http://127.0.0.1:5000
http://127.0.0.1:5001
http://127.0.0.1:5002

Step 4 — Download training data

Download the hiyouga/geometry3k dataset to a local directory. This dataset is used by the data preprocessing pipeline as a curriculum data source.

from datasets import load_dataset
ds = load_dataset("hiyouga/geometry3k")
ds.save_to_disk("~/data/geometry3k")

If you have trouble accessing Hugging Face, use the mirror: export HF_ENDPOINT=https://hf-mirror.com before running.

Step 5 — Configure the training script

Open examples/grpo_trainer/run_mobileworld.sh and configure the parameters:

Parameter	Description
`model_path`	Path to your local model weights (e.g., MAI-UI-2B or GUI-Owl-1.5-2B)
`model_type`	Model type: `mai_ui` or `gui_owl`
`n_gpus`	Number of GPUs to use for training
`history_length`	Number of previous steps included in the agent's context
`max_steps`	Maximum steps per episode
`total_epochs`	Number of training epochs
`train_data_size`	Batch size per training step
`group_size`	Number of rollouts per sample (GRPO group size)
`adv_estimator`	Advantage estimator: `grpo` or `gigpo`
`checkpoints_path`	Local directory to save model checkpoints
`data_source_dir`	Path to the downloaded geometry3k dataset directory
`server_file`	Path to the server URL list file (default: `../env_server/mobileworld_server.txt`)
`experiment_name`	Name of the experiment (used for logging)
`step_reward_judge`	Enable PRM-based step reward judge (`True`/`False`)
`step_reward_judge_base_url`	Base URL of the step reward judge API (required if enabled)
`step_reward_judge_model_name`	Model name for the step reward judge
`env_restart_enable`	Enable periodic environment container restart (`True`/`False`)
`env_restart_every_n_steps`	Restart containers every N training steps
`env_restart_wait`	Seconds to wait after issuing a restart command

Step 6 — Install the logger

We use SwanLab as the default training logger. Install it with:

pip install swanlab

You can replace it with other loggers (e.g., wandb, tensorboard) by modifying trainer.logger in the script.

Step 7 — Launch training

Before running, make sure the following key fields in the script are correctly set:

CUDA_VISIBLE_DEVICES — Set to the GPU indices you want to use, e.g. export CUDA_VISIBLE_DEVICES=4,5,6,7
data_source_dir — Set to the path where you downloaded geometry3k, e.g. ~/data/geometry3k
n_gpus — Should match the number of GPUs in CUDA_VISIBLE_DEVICES
train_data_size and group_size — Adjust based on your GPU memory and number of environments. Note that the total number of environment URLs must be ≥ train_data_size × group_size
step_reward_judge — For standard GRPO training, set to False (no PRM required)
env_restart_enable — Can be set to False for initial testing; enable for long-running production runs

Important — Server count requirement: The number of URLs in mobileworld_server.txt must be ≥ train_batch_size × group_size. We strongly recommend registering extra spare servers beyond this minimum. Virtual containers are prone to errors (e.g., screenshot failures, task init errors, unhealthy device state). When an active server encounters an error, the system will automatically rotate to a spare server to keep training running without interruption.

Example: If train_batch_size=4 and group_size=4, you need at least 16 active servers. Registering 24 URLs gives you 8 spare servers for rotation.

bash examples/grpo_trainer/run_mobileworld.sh

Note: We have verified GRPO and GiGPO training. We also provide scripts for GSPO and DAPO, but have not run experiments with them. Contributions and test results are welcome!

GiGPO training integrates a Process Reward Model (PRM) for finer-grained step-level reward signals, generally yielding better performance than GRPO. To use it:

bash examples/gigpo_trainer/run_mobileworld.sh

2. Real Device Training

Real-device training is a notoriously challenging problem in the industry. It involves app login verification, device management, and large-scale scaling concerns. ClawGUI-RL provides a validated end-to-end pipeline for physical Android devices. We have verified training on two real phones. Users can adapt this setup for cloud phones — the interaction protocol is identical.

Step 1 — Prepare your Android device

If you have already cloned OpenGUI-Server, connect a physical Android phone to your computer via USB.

Running on a remote server? If your training runs on a remote server but you want to connect a local phone via USB, you can set up port forwarding between your local machine and the server. For detailed steps, refer to the real-device training section in the OpenGUI-Server documentation.

Step 2 — Enable developer mode & USB debugging

On your phone:

Go to Settings → About Phone and tap Build Number 10 times to unlock Developer Options.
In Developer Options, enable USB Debugging.

Step 3 — Install ADB Keyboard (optional but recommended)

ADB Keyboard allows programmatic text input via ADB, which is required for typing tasks.

adb install ADBKeyboard.apk
adb shell ime enable com.android.adbkeyboard/.AdbIME

Step 4 — Start the MobileWorld server

uv run mobile-world server

Step 5 — Test device connectivity

Use the OpenGUI-Server test script to verify that your device is recognized:

python test_true_device.py

Make sure the correct device_id is set (check with adb devices).

Step 6 — Register device backend URL

Fill the OpenGUI-Server API backend URL into:

examples/env_server/realdevice_server.txt

Step 7 — Configure training tasks

Write your training tasks (one task per line) into:

examples/env_server/realdevice_tasks.txt

Step 8 — Configure and launch training

Both GRPO and GiGPO are supported for real device training. Fill in your device_id and configure the scripts:

GRPO:

# Edit examples/grpo_trainer/run_realdevice.sh, then:
bash examples/grpo_trainer/run_realdevice.sh

GiGPO (recommended):

# Edit examples/gigpo_trainer/run_realdevice.sh, then:
bash examples/gigpo_trainer/run_realdevice.sh

Key parameters specific to real device training:

Parameter	Description
`model_path`	Path to your model weights (MAI-UI-2B or GUI-Owl)
`model_type`	Model type: `mai_ui` or `gui_owl`
`device`	ADB device ID(s). Single: `DEVICE_ID`; Multiple (comma-separated): `DEVICE_ID1,DEVICE_ID2`; Auto-detect: `auto`
`server_file`	Path to `realdevice_server.txt`
`task_file`	Path to `realdevice_tasks.txt`
`data_source_dir`	Path to the geometry3k dataset (same as MobileWorld; leave empty to use default `~/data/geometry3k`)
`step_reward_judge`	Enable PRM step-level reward (`True`/`False`)
`step_reward_judge_base_url`	Step reward judge API URL
`step_reward_judge_model_name`	Step reward judge model name
`task_eval_judge`	Enable VLM-based task completion evaluation (`True`/`False`). When the agent outputs `answer`/`terminate`, the judge scores the result: score=1 → reward=1; score=0 → reward=0
`task_eval_judge_base_url`	Task eval judge API URL
`task_eval_judge_model_name`	Task eval judge model name
`adv_estimator`	`grpo` or `gigpo`
`mode`	GiGPO normalization mode: `mean_norm` or `mean_std_norm`
`max_steps`	Maximum steps per episode (recommend 7 for real device)
`group_size`	Rollout group size per sample
`checkpoints_path`	Directory to save checkpoints

Note: Due to physical device limitations, we have not validated large-scale real-device training. We recommend cloud phones for large-scale experiments, as the interaction protocol is identical.

3. Convert Checkpoint to HuggingFace Format

After training, the saved checkpoint is in FSDP format. Use scripts/model_merger.py to convert it to a standard HuggingFace model:

python scripts/model_merger.py merge \
    --backend fsdp \
    --local_dir gigpo_maiui2b_exp1/global_step_15/actor \
    --target_dir gigpo_maiui2b_exp1/global_step_15/hf

4. Visualize Episode Trajectories

Training episodes are automatically saved to the episode/ directory. You can replay and inspect any rollout trajectory with:

python scripts/episode_visualizer.py --episode_dir episode/<your_episode>

🧩 How to Add a New Environment

Want to scale on cloud phones, or add a completely new environment? Cloud phones follow the same protocol as real-device training — the difference is only that phones are hosted remotely.

You can implement a new environment by referencing the existing implementations:

MobileWorld: agent_system/environments/env_package/mobileworld/
RealDevice: agent_system/environments/env_package/realdevice/

Each environment package implements a standard interface for:

Resetting the environment and returning the initial observation
Executing an action and returning the next observation + reward signal
Termination / episode end detection

This modular design also means you can train GUI-Spec models other than MAI-UI and GUI-Owl — including general VLMs from the Qwen3-VL series — by implementing the appropriate model adapter alongside the environment.

📈 Experimental Results

We release ClawGUI-2B, a GUI agent trained using the ClawGUI-RL framework with the GiGPO algorithm, based on MAI-UI-2B as the base model.

MobileWorld Benchmark Results

Category	Model	MobileWorld SR (GUI-Only)
Agentic Framework	Claude-4.5-Sonnet + UI-Ins-7B	47.8
	Gemini-3-Pro + UI-Ins-7B	55.6
	GPT-5 + UI-Ins-7B	54.0
End-to-End Model	GUI-Owl-7B	7.7
	GUI-Owl-32B	8.5
	UI-Venus-7B	8.5
	UI-Venus-72B	16.4
	Qwen3-VL-8B	9.4
	Qwen3-VL-32B	11.9
	Qwen3-VL-235B-A22B	12.8
	Doubao-1.5-UI-TARS	26.3
	MAI-UI-2B (baseline)	11.1
	MAI-UI-8B	19.7
Ours	ClawGUI-2B [GRPO]	14.5
	ClawGUI-2B [GiGPO]	17.1

📄 License

This project is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
parent directory ..
agent_system		agent_system
assets		assets
docker		docker
examples		examples
gigpo		gigpo
recipe		recipe
scripts		scripts
verl		verl
README.md		README.md
README_zh.md		README_zh.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

ClawGUI-RL: Scalable Online RL Infrastructure for GUI Agents

📚 Table of Contents

📖 Overview

🏗️ Architecture

🧬 GiGPO Algorithm

🔧 Installation

🚀 Quick Start

1. Virtual Environment Scaling

Step 1 — Clone OpenGUI-Server

Step 2 — Set up the server

Step 3 — Register environment URLs

Step 4 — Download training data

Step 5 — Configure the training script

Step 6 — Install the logger

Step 7 — Launch training

2. Real Device Training

Step 1 — Prepare your Android device

Step 2 — Enable developer mode & USB debugging

Step 3 — Install ADB Keyboard (optional but recommended)

Step 4 — Start the MobileWorld server

Step 5 — Test device connectivity

Step 6 — Register device backend URL

Step 7 — Configure training tasks

Step 8 — Configure and launch training

3. Convert Checkpoint to HuggingFace Format

4. Visualize Episode Trajectories

🧩 How to Add a New Environment

📈 Experimental Results

MobileWorld Benchmark Results

📄 License

FilesExpand file tree

clawgui-rl

Directory actions

More options

Directory actions

More options

Latest commit

History

clawgui-rl

Folders and files

parent directory

README.md

ClawGUI-RL: Scalable Online RL Infrastructure for GUI Agents

📚 Table of Contents

📖 Overview

🏗️ Architecture

🧬 GiGPO Algorithm

🔧 Installation

🚀 Quick Start

1. Virtual Environment Scaling

Step 1 — Clone OpenGUI-Server

Step 2 — Set up the server

Step 3 — Register environment URLs

Step 4 — Download training data

Step 5 — Configure the training script

Step 6 — Install the logger

Step 7 — Launch training

2. Real Device Training

Step 1 — Prepare your Android device

Step 2 — Enable developer mode & USB debugging

Step 3 — Install ADB Keyboard (optional but recommended)

Step 4 — Start the MobileWorld server

Step 5 — Test device connectivity

Step 6 — Register device backend URL

Step 7 — Configure training tasks

Step 8 — Configure and launch training

3. Convert Checkpoint to HuggingFace Format

4. Visualize Episode Trajectories

🧩 How to Add a New Environment

📈 Experimental Results

MobileWorld Benchmark Results

📄 License