ClawGUI-Agent is the deployment module of ClawGUI. Built on OpenClaw and powered by nanobot, it provides two core capabilities: GUI phone control and one-command evaluation. For phone control, it drives a Vision-Language Model through a closed-loop "screenshot → reasoning → action" cycle to autonomously complete tasks on Android, HarmonyOS, and iOS devices — accessible from Feishu, QQ, Telegram, and 12+ other chat platforms. For evaluation, a single natural-language command triggers the full ClawGUI-Eval pipeline: environment check, multi-GPU inference, judging, and metric reporting.
- Key Features
- Architecture
- How the Agent Works
- Quick Start
- Run
- ClawGUI-Eval Evaluation
- GUI Phone Control
- Directory Structure
- License
- nanobot Integration — Remotely control phones from 12+ chat platforms including Feishu / DingTalk / Telegram / Discord / Slack / QQ, issue tasks anytime anywhere
- GUI Phone Control — Powered by OpenClaw, AI autonomously captures screenshots, understands the screen, and performs tap/swipe/type GUI actions to complete complex tasks
- ClawGUI-Eval Integration — Built-in ClawGUI-Eval evaluation skill, launch GUI Grounding model benchmarks with natural language (environment check → multi-GPU inference → judging → metric calculation), with automatic progress monitoring and result comparison against official baselines
- Multi-Model Support — Compatible with AutoGLM, Qwen VL, UI-TARS, MAI-UI, GUI-Owl and more VLMs, connected via OpenAI-compatible API
- Personalized Memory — Automatically learns user preferences (contacts, frequently used apps, habits), with a vector-search-based persistent memory system
- Real-time Episode Recording — Each task execution (screenshots + model outputs + actions) is saved as a structured episode, enabling replay and dataset construction
- Web UI — Gradio-based web interface for device management, task execution visualization, manual takeover, memory management and more
Understanding the execution loop helps with configuration and debugging. PhoneAgent.run() in phone_agent/agent.py follows this cycle for each task:
- Screenshot — Capture the current screen via ADB (
screencap), HDC, or XCTest depending on the device backend. - Memory retrieval — Query the vector memory store for relevant memories from past interactions (contacts, app knowledge, user preferences). Top-k similar memories are appended to the system context.
- History construction — Assemble the multi-turn conversation history: each past step contributes a
(user: screenshot + instruction, assistant: reasoning + action)pair, up tohistory_lengthsteps back. - VLM call — Send the prompt (system prompt + history + current screenshot + task instruction) to the configured GUI model via OpenAI-compatible API.
- Action parsing — Extract the structured action from the model output. Different models use different output formats (
autoglm,uitars,qwenvl,maiui,guiowladapters inphone_agent/model/adapters.py). - Coordinate normalization — Convert the model's output coordinates to absolute device pixels. AutoGLM uses
[0, 1000]normalized coordinates; UI-TARS uses absolute pixel coordinates insmart_resizespace; Qwen-VL uses absolute pixels; MAI-UI uses[0, 1000]. - Action execution — Send the action to the device backend: tap, long-press, swipe, type, home, back, or task-complete. Each action type has a dedicated handler in
phone_agent/actions/. - Trace recording — If
traceEnabled=True, append the screenshot, reasoning, and action to the episode tracer for later replay or training data export. - Memory update — After task completion, extract contact names, app knowledge, and user habits from the conversation and upsert them into the vector store with deduplication.
This loop runs until the model outputs a terminate or answer action, or max_steps is reached.
- Python: ≥ 3.11
- Package Manager: uv (recommended) or conda + pip
Assuming you have cloned the ClawGUI project and are in the root directory:
cd clawgui-agent
# Create virtual environment
uv venv .venv --python 3.12
# Activate
source .venv/bin/activate
# Install phone_agent
uv pip install -e .
# Install nanobot
uv pip install -e nanobot/cd clawgui-agent
# Create conda environment
conda create -n opengui python=3.12 -y
conda activate opengui
# Install phone_agent
pip install -e .
# Install nanobot
pip install -e nanobot/Run the onboard wizard to generate default config:
nanobot onboardThen edit ~/.nanobot/config.json. Here is a reference configuration:
We recommend using autoglm-phone as the external GUI model for phone control.
{
"agents": {
"defaults": {
"workspace": "/path/to/ClawGUI",
"model": "glm-5",
"provider": "zhipu",
"maxTokens": 8192,
"contextWindowTokens": 131072,
"temperature": 0.1,
"maxToolIterations": 40
}
},
"providers": {
"zhipu": {
"apiKey": "YOUR_ZHIPU_API_KEY",
"apiBase": "https://open.bigmodel.cn/api/paas/v4/"
},
"openrouter": {
"apiKey": "YOUR_OPENROUTER_API_KEY",
"apiBase": "https://openrouter.ai/api/v1"
}
},
"tools": {
"gui": {
"enable": true,
"deviceType": "adb",
"deviceId": null,
"maxSteps": 50,
"useExternalModel": true,
"guiBaseUrl": "https://openrouter.ai/api/v1",
"guiApiKey": "YOUR_OPENROUTER_API_KEY",
"guiModelName": "autoglm-phone",
"promptTemplateLang": "en",
"promptTemplateStyle": "autoglm",
"traceEnabled": false,
"traceDir": "gui_trace"
},
"exec": {
"enable": true,
"timeout": 60
}
}
}Important:
workspacePath SettingSet
workspaceto the ClawGUI project root — the directory that contains bothclawgui-agent/andclawgui-eval/. The built-in evaluation skill uses this path to locate the evaluation framework. For example, if your project lives at/home/user/ClawGUI, setworkspaceto"/home/user/ClawGUI".
| Parameter | Description |
|---|---|
enable |
Enable/disable the GUI phone control tool |
deviceType |
Device type: adb (Android) or hdc (HarmonyOS) |
deviceId |
Specific device ID, null for auto-detection |
maxSteps |
Maximum execution steps per task |
useExternalModel |
Use an external GUI-specific model (recommended true) |
guiBaseUrl |
API endpoint for the external GUI model |
guiApiKey |
API key for the external GUI model |
guiModelName |
External GUI model name, used with guiBaseUrl |
promptTemplateLang |
Prompt language: cn / en |
promptTemplateStyle |
Prompt template style: autoglm / uitars / qwenvl etc. |
traceEnabled |
Enable episode recording |
traceDir |
Episode save directory |
The controlled phone must be connected (e.g. via USB) to the server machine where ClawGUI-Agent is installed.
Option A: Install via package manager
macOS (recommended: brew):
brew install android-platform-toolsLinux:
sudo apt install android-tools-adb # Ubuntu/DebianWindows: See this blog tutorial to download and configure PATH.
Option B: Manual download
Download the official ADB platform-tools and extract it, then add it to your PATH:
macOS / Linux:
# Assuming extracted to ~/Downloads/platform-tools
export PATH=${PATH}:~/Downloads/platform-toolsWindows: Add the extracted directory (e.g. C:\platform-tools) to the system PATH environment variable.
- Enable Developer Mode: Go to Settings > About Phone > Build Number, tap rapidly ~10 times until you see "You are now a developer"
- Enable USB Debugging: Go to Settings > Developer Options > USB Debugging, toggle it on (some devices may require a restart)
- Verify connection:
adb devices
# Expected output:
# List of devices attached
# <your_device_id> deviceADB Keyboard is used for text input. Download ADBKeyboard.apk and install:
adb install ADBKeyboard.apk
adb shell ime enable com.android.adbkeyboard/.AdbIMENote: This step is optional. The framework will auto-detect and prompt installation when needed.
See the Open-AutoGLM device connection guide.
To remotely control the phone via chat platforms, enable the corresponding platform in channels within config.json and fill in credentials.
📖 Click to expand setup steps
- Step 1: Go to Feishu Open Platform, click Create App on the homepage, select Enterprise Self-Built App, fill in the app name and description, and enable the Bot capability.
- Step 2: Click Permission Management on the left sidebar, then click Enable Permissions.
- Step 3: Search for and enable the following permissions in the text box:
im:message,im:message.p2p_msg:readonly,cardkit:card:writeIf
cardkit:card:writecannot be added, set"streaming": falseinchannels.feishu(see config below). The bot will still work normally; replies use regular interactive cards without token-by-token streaming. - Step 4: Click Event & Callback on the left, click Subscription Method, and select Persistent Event Reception (requires ClawGUI-Agent to be running to establish the connection).
- Step 5: Go to Credentials & Basic Info on the left to get your
App IDandApp Secret. - Step 6: Click Publish App.
- Step 7: Open Feishu, go to any group, click the group settings, click Group Bots, then Add Bot, and add the bot you just created to the group.
- Step 8: @mention the bot in the group and send a message.
- Configure in
~/.nanobot/config.json:
"feishu": {
"enabled": true,
"appId": "YOUR_APP_ID",
"appSecret": "YOUR_APP_SECRET",
"encryptKey": "",
"verificationToken": "",
"allowFrom": ["*"],
"groupPolicy": "mention"
}
allowFromset to["*"]allows all users. To restrict, provide a list of user Open IDs.groupPolicyset to"mention"means the bot only responds when @mentioned in groups.
- Go to QQ Open Platform and create a bot application
- Obtain the
App IDandSecret - Configure in
~/.nanobot/config.json:
"qq": {
"enabled": true,
"appId": "YOUR_APP_ID",
"secret": "YOUR_SECRET",
"allowFrom": ["*"]
}nanobot also supports Telegram, Discord, Slack, DingTalk, WeCom, WhatsApp, Email and 12+ more platforms. Set "enabled": true in the corresponding channels field and fill in credentials.
Start the nanobot gateway service:
nanobot gatewayOnce started, you can send messages on configured chat platforms (e.g. Feishu) to control the phone:
Open WeChat and send "I'll be late" to Zhang San
nanobot will invoke the gui_execute tool, automatically capturing screenshots → VLM reasoning → executing phone actions in a loop until the task is completed.
ClawGUI-Agent includes a built-in ClawGUI-Eval skill that turns natural language into a complete benchmark run — from GPU environment check through multi-GPU inference, judging, and metric reporting — without writing a single script.
- workspace correctly set:
workspaceinconfig.jsonpoints to the ClawGUI root directory (see configuration above) - ClawGUI-Eval environment installed: Follow ClawGUI-Eval README to install and download data
- GPU available: Inference requires NVIDIA GPUs
- (Recommended) Install FlashAttention-2:
pip install flash-attn --no-build-isolation— the framework falls back to SDPA automatically if not installed, but precision may be slightly lower
Simply say it in a nanobot conversation:
Benchmark qwen3vl 2b model on screenspot-pro
Run uivision and osworld-g evaluation with MAI-UI-8B
nanobot will automatically:
- Environment Check — Check GPU, CUDA, FlashAttention-2, data integrity
- Inference — Generate run scripts from templates, launch multi-GPU parallel inference in background, monitor progress in real-time
- Judging — Automatically select and run the corresponding judge script
- Metric Calculation — Automatically select and run the corresponding metric script
- Result Report — Present accuracy, sub-category breakdowns, and comparison against official baselines
| Model Type | Example HuggingFace ID |
|---|---|
qwen3vl |
Qwen/Qwen3-VL-2B/4B/8B-Instruct |
qwen25vl |
Qwen/Qwen2.5-VL-3B/7B-Instruct |
maiui |
Tongyi-MAI/MAI-UI-2B/8B |
uitars |
ByteDance-Seed/UI-TARS-1.5-7B |
uivenus15 |
inclusionAI/UI-Venus-1.5-2B/8B |
guiowl15 |
mPLUG/GUI-Owl-1.5-2B/4B/8B-Instruct |
guig2 |
inclusionAI/GUI-G2-7B |
stepgui |
stepfun-ai/GELab-Zero-4B-preview |
uivenus |
inclusionAI/UI-Venus-Ground-7B |
Supported Benchmarks: ScreenSpot-Pro, ScreenSpot-V2, UIVision, MMBench-GUI, OSWorld-G, AndroidControl
The following features are part of ClawGUI-Agent's phone/device control capabilities, driven by the gui_execute tool.
You can also invoke the GUI agent directly via command line:
python main.py \
--base-url https://open.bigmodel.cn/api/paas/v4/ \
--model autoglm-phone \
--apikey <YOUR_API_KEY> \
--max-steps 100 \
--lang cn \
"Open QQ Music, play Justin Bieber's Baby and add it to favorites. If it is already favorited, just play it. After it starts playing, pause it, then go back and play Bieber's Love Me."In addition to chat platform control, you can use the Web UI directly:
python webui.pyOpens at http://localhost:7860 by default, featuring:
- Device Management: Connect/disconnect devices, view device status
- Task Execution: Enter task descriptions, watch screenshots and AI reasoning in real-time
- Manual Takeover: Switch to manual control for scenarios like CAPTCHAs
- Memory Management: View/edit/clear memory data
- Configuration Panel: Graphical model parameter settings
The framework includes a built-in personalized memory system (phone_agent/memory/). After each completed task, the agent extracts structured facts from the conversation — contact names and relationships, app-specific knowledge, user habits and preferences — and upserts them into a persistent store as JSON records with numpy vector embeddings. On subsequent tasks, the top-k most semantically similar memories are retrieved and injected into the system context, letting the agent recognize "Zhang San" as the user's colleague or know which music app the user prefers. Duplicate memories are detected and merged rather than accumulated, keeping the store lean. Multi-user isolation is supported via per-user namespaces.
The framework supports multiple Vision-Language Models via an adapter pattern:
| Model | promptTemplateStyle |
Provider |
|---|---|---|
| AutoGLM-Phone-9B | autoglm |
Zhipu AI |
| Doubao-1.5-UI-TARS | uitars |
ByteDance |
| Qwen2.5-VL / Qwen3-VL | qwenvl |
Alibaba Cloud |
| MAI-UI | maiui |
Alibaba Cloud |
| GUI-Owl-7B/32B | guiowl |
mPLUG |
All models are connected via OpenAI-compatible API and can be deployed locally with vLLM / SGLang, or connected to cloud services such as Zhipu BigModel, Alibaba Cloud Bailian, or OpenRouter.
ClawGUI-Agent/
├── main.py # CLI entry point
├── webui.py # Gradio Web UI entry point
├── ios.py # iOS CLI entry point
├── setup.py # Package setup
├── requirements.txt # Python dependencies
│
├── phone_agent/ # Core phone automation package
│ ├── agent.py # PhoneAgent main class (screenshot→VLM→action loop)
│ ├── agent_ios.py # IOSPhoneAgent class
│ ├── device_factory.py # Device type factory (ADB / HDC / XCTest)
│ ├── tracer.py # Episode execution tracer
│ ├── config/ # Configuration & prompts (8 template files)
│ ├── model/ # Model clients & adapters (5 VLM adapters)
│ ├── adb/ # Android ADB device control
│ ├── hdc/ # HarmonyOS HDC device control
│ ├── xctest/ # iOS XCTest device control
│ ├── actions/ # Action handlers (tap, swipe, type, etc.)
│ └── memory/ # Personalized memory system (vector store)
│
├── nanobot/ # nanobot subproject
│ ├── nanobot/ # nanobot core package
│ │ ├── agent/ # Agent core + GUI tool
│ │ ├── channels/ # 12+ chat platform integrations
│ │ ├── providers/ # 20+ LLM provider adapters
│ │ └── skills/ # Pluggable skills (gui-mobile, clawgui-eval)
│ ├── pyproject.toml
│ └── README.md
│
├── examples/ # Usage examples
└── scripts/ # Deployment & verification scripts
This project is licensed under the Apache License 2.0. The nanobot subproject is licensed under the MIT License.

