Skip to content

Tongyi-MAI/MobileWorld

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

126 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Banner

Leaderboard β€’ Website β€’ Paper β€’ Docs β€’ Issues

While maintaining the same level of rigorous, reproducible evaluation as AndroidWorld, MobileWorld offers a more challenging online mobile-use benchmark by introducing four additional features that better capture real-world agent behavior.

  • 🎯 Broad Real-World Coverage: 201 carefully curated tasks across 20 mobile applications
  • πŸ”„ Long-Horizon Tasks: Multi-step reasoning and cross-app workflows
  • πŸ‘₯ Agent-User Interaction: Novel tasks requiring dynamic human-agent collaboration
  • πŸ”§ MCP-Augmented Tasks: Support Model Context Protocol (MCP) to evaluate hybrid tool usage

Comparison to AndroidWorld
Difficulty comparison between MobileWorld and AndroidWorld

πŸ“’ Updates

  • 2026-04-15: Important Fix β€” Mattermost Session Expiry If you pulled the Docker image before this date, Mattermost task evaluations may produce false negatives due to expired authentication tokens in the emulator snapshot. Please git pull the latest codebase β€” the fix runs automatically during task initialization (no Docker image rebuild required).
  • 2026-03-20: End-to-End Frontier Model Evaluation & Real Device SupportπŸ”₯ We benchmarked five frontier models β€” Seed-2.0-Pro, Gemini 3 Pro, KIMI K2.5, Claude Sonnet 4.5, and Qwen-3.5 β€” for end-to-end mobile-use, and demonstrated real-phone execution. See our blog post for the full write-up.
    • πŸ† New SOTA: Seed-2.0-Pro leads at 63.2% GUI-Only and 61.4% User-Interaction, overtaking Seed-1.8 as the top end-to-end model.
    • πŸ“Š Expanded Leaderboard:
      • KIMI K2.5 (49.6% GUI-Only, 51.2% User-Int)
      • Qwen-3.5-397B-A17B (42.7% GUI-Only, 54.4% User-Int)
      • GUI-Owl-1.5-32B (43.9% GUI-Only, 56.1% User-Int)
      • UI-Venus-1.5-30B (17.1% GUI-Only)
    • πŸ“± Real Device Support: You can now run frontier models on physical Android phones. See the Testing on Real Devices section.
    • New Agents: gui_owl_1_5 (code), ui_venus_agent (code)
  • 2026-01-16: Expanded Model Evaluation SupportπŸ”₯ We have introduced evaluation implementations for the latest frontier models, covering both end-to-end and agentic workflows.
    • πŸš€ Leaderboard Upgrade with Multi-Dimensional Filtering: We now support focused comparisons within GUI-Only and User-Interaction categories. This allows for a more balanced assessment of core navigation capabilities, especially for models not yet optimized for MCP-hybrid tool calls or complex user dialogues.
    • πŸ† New Performance Records:
      • GUI-Only Tasks: Seed-1.8 secured the Top-1 spot for end-to-end performance with a 52.1% success rate, followed by Gemini-3-Pro (51.3%) and Claude-4.5-Sonnet (47.8%).
      • Combined GUI & User Interaction: MAI-UI-235B-A22B leads the leaderboard with a 45.4% success rate, surpassing Claude-4.5-Sonnet (43.2%) and Seed-1.8 (40.8%).
    • Supported Models:
      • End-to-End: MAI-UI, Gemini-3-Pro, Claude-4.5-sonnet, Seed-1.8, and GELab-Zero.
      • Agentic: Gemini-3-Pro, Claude-4.5-sonnet, and GPT-5.
    • Implementation Details:
    • Note on GPT-5: While supported for agentic tasks, GPT-5 is currently excluded from end-to-end evaluation as its grounding mechanisms remain unclear for standardized testing.
  • 2025-12-29: We released MAI-UI, achieving SOTA performance with a 41.7% success rate in the end-to-end models category on the MobileWorld benchmark.
  • 2025-12-23: Initial release of MobileWorld benchmark. Check out our paper and websiteπŸ”₯.
  • 2025-12-23: Docker image ghcr.io/Tongyi-MAI/mobile_world:latest available for public use.

πŸ“‹ Table of Contents


πŸ“– Overview

MobileWorld Overview

MobileWorld is a comprehensive benchmark for evaluating autonomous mobile agents in realistic scenarios. Our benchmark features a robust infrastructure and deterministic evaluation methodology:

πŸ—οΈ System Architecture

Containerized Environment
The entire evaluation environment runs in Docker-in-Docker containers, including:

  • Rooted Android Virtual Device (AVD)
  • Self-hosted application backends
  • API server for orchestration

This design eliminates external dependencies and enables consistent deployment across different host systems.

Open-Source Applications
We build stable, reproducible environments using popular open-source projects:

  • Mattermost: Enterprise communication (Slack alternative)
  • Mastodon: Social media platform (X/Twitter alternative)
  • Mall4Uni: E-commerce platform

Self-hosting provides full backend access, enabling precise control over task initialization and deterministic verification.

Snapshot-Based State Management
AVD snapshots capture complete device states, ensuring each task execution begins from identical initial conditions for reproducible results.

βœ… Task Evaluation

We implement multiple complementary verification methods for reliable assessment:

  • Textual Answer Verification: Pattern matching and string comparison for information retrieval tasks
  • Backend Database Verification: Direct database queries to validate state changes (messages, posts, etc.)
  • Local Storage Inspection: ADB-based inspection of application data (calendar events, email drafts, etc.)
  • Application Callbacks: Custom APIs capturing intermediate states for validation

πŸ’Ύ Installation

System Requirements

  • Docker with privileged container support
  • KVM (Kernel-based Virtual Machine) for Android emulator acceleration
  • Python 3.12+
  • Linux host system (or Windows with WSL2 + KVM enabled), MacOS support is in progress.

Quick Install

# Clone the repository
git clone https://github.com/Tongyi-MAI/MobileWorld.git
cd MobileWorld

# Install dependencies with uv
uv sync

Environment Configuration

Create a .env file from .env.example in the project root:

cp .env.example .env

Edit the .env file and configure the following parameters:

Required for Agent Evaluation:

  • API_KEY: Your OpenAI-compatible API key for the agent model
  • USER_AGENT_API_KEY: API key for user agent LLM (used in agent-user interactive tasks)
  • USER_AGENT_BASE_URL: Base URL for user agent API endpoint
  • USER_AGENT_MODEL: Model name for user agent (e.g., gpt-4.1)

Required for MCP-Augmented Tasks:

  • DASHSCOPE_API_KEY: DashScope API key for MCP services
  • MODELSCOPE_API_KEY: ModelScope API key for MCP services

Example .env file:

API_KEY=your_api_key_for_agent_model
DASHSCOPE_API_KEY=dashscope_api_key_for_mcp
MODELSCOPE_API_KEY=modelscope_api_key_for_mcp

USER_AGENT_API_KEY=your_user_agent_llm_api_key
USER_AGENT_BASE_URL=your_user_agent_base_url
USER_AGENT_MODEL=gpt-4.1

Note:

  • MCP API keys are only required if you plan to run MCP-augmented tasks
  • User agent settings are only required for agent-user interactive tasks
  • See MCP Setup Guide for detailed MCP server configuration

πŸš€ Quick Start

1. Check Environment & Pull Docker Image

sudo uv run mw env check

This command verifies Docker, KVM support, and prompts to pull the latest mobile_world Docker image if needed.

2. Launch Docker Containers

sudo uv run mw env run --count 5 --launch-interval 20

This launches 5 containerized Android environments with:

  • --count 5: Number of parallel containers
  • --launch-interval 20: Wait 20 seconds between container launches

3. Run Evaluation

sudo uv run mw eval \
    --agent_type qwen3vl \
    --task ALL \
    --max_round 50 \
    --model_name Qwen3-VL-235B-A22B \
    --llm_base_url [openai_compatible_url] \
    --step_wait_time 3 \
    --log_file_root traj_logs/qwen3_vl_logs \
    --enable_mcp \
    --enable_user_interaction

Flags:

  • --enable_mcp: Include MCP-augmented tasks in evaluation
  • --enable_user_interaction: Include agent-user interaction tasks. Without this flag, only GUI-only tasks are evaluated.

4. View Results

uv run mw logs view --log_dir traj_logs/qwen3_vl_logs

Opens an interactive web-based visualization at http://localhost:8760 to explore task trajectories and results.


πŸ“± Testing on Real Devices

Beyond the containerized emulator environment, MobileWorld supports running frontier models on real physical Android phones. This lets you evaluate models like Gemini, Claude, Qwen, and others as true end-to-end mobile agents.

Prerequisites

  • A physical Android phone connected via USB
  • ADB (Android Debug Bridge) installed on your local machine
  • An API key for the model you want to test

Step 1: Install ADB

Download the official ADB platform-tools and extract it.

macOS/Linux:

# Assuming extracted to ~/Downloads/platform-tools
export PATH=${PATH}:~/Downloads/platform-tools

Windows: Refer to this guide for configuration steps.

Step 2: Connect Your Phone & Enable USB Debugging

  1. Enable Developer Mode: Go to Settings > About Phone > Build Number and tap rapidly ~10 times until you see "Developer mode has been enabled."
  2. Enable USB Debugging: Go to Settings > Developer Options > USB Debugging and enable it. Some devices may require a restart.
  3. Verify the connection:
adb devices

# Expected output:
# List of devices attached
# <your_device_id>   device

Step 3: Install ADB Keyboard (Optional)

ADB Keyboard is needed for text input. Download the ADBKeyboard.apk and install it on your device:

adb install ADBKeyboard.apk
adb shell ime enable com.android.adbkeyboard/.AdbIME

Note: This step is optional β€” MobileWorld will install it automatically if not present.

Step 4: Clone MobileWorld

git clone https://github.com/Tongyi-MAI/MobileWorld.git
cd MobileWorld
uv sync

Step 5: Start the MobileWorld Server

uv run mobile-world server

This starts the backend API server that bridges the model and the device.

Step 6: Run a Task on Your Real Device

uv run mw test "set an alarm at 8:00 am" \
    --agent-type general_e2e \
    --model_name anthropic/claude-sonnet-4-5 \
    --llm_base_url https://openrouter.ai/api/v1 \
    --aw-host http://127.0.0.1:6800 \
    --api_key YOUR_API_KEY

Replace --model_name, --llm_base_url, and --api_key with the model and credentials you want to use. Any OpenAI-compatible endpoint works. The --agent-type general_e2e prompt works across most frontier models. For Seed-2.0-Pro, use --agent-type seed_agent for better performance.

Supported End-to-End Models

Model Agent Type Coordinate System Notes
Gemini 3 Pro general_e2e Relative (0–1000) Normalized coordinates
Claude Sonnet 4.5 general_e2e Absolute pixels Requires image resize to 1280Γ—720
Qwen-3.5 general_e2e Relative (0–1000)
KIMI K2.5 general_e2e Relative (0–1)
Seed-2.0-Pro seed_agent Relative (0–1000) Best with specialized seed_agent agent type

Tip: You can view the live device screen at any time with uv run mw device.


πŸ”§ Available Commands

MobileWorld provides a comprehensive CLI (mw or mobile-world) with the following commands:

Command Description
mw env check Check prerequisites (Docker, KVM) and pull latest image
mw env run Launch Docker container(s) with Android emulators
mw env list List running MobileWorld containers
mw env rm Remove/destroy containers
mw env info Get detailed info about a container
mw env restart Restart the server in a container
mw env exec Open a shell in a container
mw eval Run benchmark evaluation suite
mw test Run a single ad-hoc task for testing
mw info task Display available tasks
mw info agent Display available agents
mw info app Display available apps
mw info mcp Display available MCP tools
mw logs view Launch interactive log viewer
mw logs results Print results summary table
mw logs export Export logs as static HTML site
mw device View live Android device screen
mw server Start the backend API server

Use mw <command> --help for detailed options.


πŸ“š Documentation

For detailed documentation, see the docs/ directory:

Document Description
Development Guide Dev mode, debugging, container management workflows
MCP Setup Configure MCP servers for external tool integration
Windows Setup WSL2 and KVM setup instructions for Windows
AVD Configuration Customize and save Android Virtual Device snapshots

🎯 Benchmark Statistics

Scenario Distribution MobileWorld Statistics

πŸ“¬ Contact

For questions, issues, or collaboration inquiries:

WeChat Group QR Code

Acknowledgements

We thank Android World and Android-Lab for their open source contributions. We also thank all the open-source contributors!

Citation

If you find MobileWorld useful in your research, please cite our paper:

@misc{kong2025mobileworldbenchmarkingautonomousmobile,
      title={MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments}, 
      author={Quyu Kong and Xu Zhang and Zhenyu Yang and Nolan Gao and Chen Liu and Panrong Tong and Chenglin Cai and Hanzhang Zhou and Jianan Zhang and Liangyu Chen and Zhidan Liu and Steven Hoi and Yue Wang},
      year={2025},
      eprint={2512.19432},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2512.19432}, 
}

⭐ Star History

If you find MobileWorld helpful, please consider giving us a star ⭐!

Star History Chart

About

Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors