- π³ Environment Setup
- βοΈ Configuration
- π Usage
- π Benchmark Session
- π Metrics
- π€ Adding a New Agent
- π€ Leaderboard Submission
- π Tasks
- π Citation
- π§ Contact
- 2026-02-15: π MemGUI-Bench adopted by Mobile-Agent-v3.5! Congrats to the Tongyi Lab team for achieving 27.1% on Easy tasks with GUI-Owl-1.5-32B. We welcome more agents to challenge the full benchmark! π
- 2026-02-09: ποΈ Benchmark tasks now available on HuggingFace: lgy0404/MemGUI-Bench
- 2026-02-09: π Paper released on arXiv! Check out our paper: arXiv:2602.06075
- 2026-02-03: Initial release of MemGUI-Bench benchmark. Check out our website.
Use our pre-configured Docker image with all dependencies installed:
# Pull the image (public, no login required)
sudo docker pull \
crpi-6p9eo5da91i2tx5v.cn-hangzhou.personal.cr.aliyuncs.com/memgui/memgui-bench:26020301
# Run container
sudo docker run -it --privileged \
--name memgui-bench \
-w /root/MemGUI-Bench \
crpi-6p9eo5da91i2tx5v.cn-hangzhou.personal.cr.aliyuncs.com/memgui/memgui-bench:26020301 \
bash
# Inside container, you're already in /root/MemGUI-Bench
python run.pyNote: The
--privilegedflag is required for Android emulator support.
The Docker image includes:
- Pre-configured Android emulator with MemGUI-AVD
- All required conda environments
- ADB and Android SDK tools
For developers who prefer local installation:
Click to expand local setup instructions
- Conda: Install from conda.io
- Android Debug Bridge (ADB): Install from Android Developer and add to PATH
- Android Studio & AVD:
- Download and install Android Studio
- Download the pre-configured MemGUI-AVD emulator snapshot:
- Download: Baidu Netdisk (Code:
tfnb) - File:
MemGUI-AVD-250704-base.zip
- Download: Baidu Netdisk (Code:
- Extract to your AVD directory:
- Windows:
C:\Users\[Username]\.android\avd\ - macOS:
~/Library/Android/avd/ - Linux:
~/.android/avd/
- Windows:
- Launch Android Studio β Device Manager β Start MemGUI-AVD
# Clone repository with submodules
git clone --recursive https://github.com/lgy0404/MemGUI-Bench.git
cd MemGUI-Bench
# If already cloned without --recursive, init submodules manually:
# git submodule update --init --recursive
# Run setup script
./setup.sh
# Configure
cp config.yaml.example.opensource config.yaml
# Edit config.yaml with your pathsEdit config.yaml to match your environment:
# Environment Mode
ENVIRONMENT_MODE: "local" # "local" or "docker"
# Experiment Settings
AGENT_NAME: "Qwen3VL"
DATASET_PATH: "./data/memgui-tasks-all.csv"
SESSION_ID_SUFFIX: "my-experiment"
# API & Parallelism
BASE_URL: "https://api.openai.com/v1"
NUM_OF_EMULATOR: 4
MAX_EVAL_SUBPROCESS: 8
# Model API Keys
QWEN_API_KEY: "your-api-key"
QWEN_MODEL: "qwen3-vl-8b"Full configuration example (for local mode)
# Part 1: Environment Mode
ENVIRONMENT_MODE: "local" # "local" or "docker"
# Part 2: Experiment Settings
AGENT_NAME: "Qwen3VL"
DATASET_PATH: "./data/memgui-tasks-all.csv"
SESSION_ID_SUFFIX: "my-experiment"
# Part 3: API & Parallelism
BASE_URL: "https://api.openai.com/v1"
NUM_OF_EMULATOR: 4
MAX_EVAL_SUBPROCESS: 8
# Part 4: Model API Keys
QWEN_API_KEY: "your-api-key"
QWEN_MODEL: "qwen3-vl-8b"
# Part 5: Paths (for local mode)
_MODE_PRESETS:
environment:
local:
_CONDA_PATH: "/path/to/miniconda3"
_EMULATOR_PATH: "/path/to/android-sdk/emulator/emulator"
_ANDROID_SDK_PATH: "/path/to/android-sdk"
_SYS_AVD_HOME: "/path/to/.android/avd"
_SOURCE_AVD_HOME: "/path/to/.android/avd"conda activate MemGUI
python run.py| Argument | Default | Description |
|---|---|---|
--agents |
config | Agent name(s), comma-separated |
--mode |
full |
full (exec+eval) / exec / eval |
--session_id |
config | Session identifier for results |
--task_id |
None | Run specific task only |
--max_attempts |
3 | Max attempts per task |
--overwrite |
False | Overwrite existing results |
--no_concurrent |
False | Disable parallel evaluation |
# Full benchmark (execution + evaluation)
python run.py
# Run specific task
python run.py --task_id 001-FindProductAndFilter
# Evaluation only (on existing trajectories)
python run.py --mode eval --session_id my-experiment
# Multiple attempts
python run.py --max_attempts 5
# Disable parallel execution
python run.py --no_concurrentEach session_id creates an isolated benchmark folder in ./results/.
- The dataset is copied to
results.csvto track progress - Re-running the same session resumes from incomplete tasks
- Results accumulate across runs
Click to expand output directory structure
results/session-{session_id}/
βββ results.csv # Aggregated execution & evaluation metrics
βββ results.csv.lock # File lock for concurrent access
βββ metrics_summary.json # Computed benchmark metrics
βββ {agent_name}.json # Leaderboard format (for submission)
βββ config.yaml # Config snapshot for reproducibility
β
βββ {task_id}/
βββ {agent_name}/
βββ attempt_{n}/
βββ log.json # Execution log with actions
βββ 0.png, 1.png, ... # Raw screenshots per step
βββ stdout.txt, stderr.txt # Process output logs
βββ error.json # Error info (if any)
β
βββ visualize_actions/ # Action visualization images
β βββ step_1.png, step_2.png, ...
β
βββ single_actions/ # Individual action screenshots
β βββ step_1.png, step_2.png, ...
β
βββ puzzle/ # Evaluation puzzle images
β βββ puzzle.png
β βββ pre_eval_puzzle.png
β βββ supplemental_puzzle.png (if needed)
β
βββ evaluation_summary.json # Detailed evaluation results
βββ final_decision.json # Final evaluation decision
βββ irr_analysis.json # IRR evaluation results
βββ badcase_analysis.json # BadCase classification
βββ step_*_description.json # Step-by-step analysis
The benchmark automatically computes:
| Metric | Description |
|---|---|
| Pass@K | Success rate within K attempts |
| IRR | Information Retrieval Rate (memory accuracy) |
| FRR | Failure Recovery Rate (learning from errors) |
| MTPR | Memory Task Performance Ratio |
| Step Ratio | Agent steps / Golden steps |
| Time/Step | Average execution time per step |
| Cost/Step | API cost per step (if applicable) |
Results are saved to metrics_summary.json and {agent_name}.json (leaderboard format).
Add your agent to config.yaml:
AGENTS:
- NAME: "MyAgent"
REPO_PATH: "./framework/models/MyAgent"
ENV_NAME: "my_agent_env"Create your agent class in framework/agents.py:
class MyAgent(AndroidWorldAgent):
agent_name = "MyAgent"
def construct_command(self, task, full_task_description, output_dir, device):
script = "run.py"
args = f'--task "{full_task_description}" --output {output_dir} --device {device["serial"]}'
return script, argsYour agent must output:
- Screenshots:
0.png,1.png, ... (one per step) - Log file:
log.jsonwith execution summary
The benchmark handles evaluation automatically.
After running the benchmark:
Find {agent_name}.json in your session folder and fill in metadata:
{
"name": "YourAgent",
"backbone": "GPT-4V",
"type": "Agentic Workflow",
"institution": "Your Institution",
"date": "2026-02-03",
"paperLink": "https://arxiv.org/...",
"codeLink": "https://github.com/...",
"hasUITree": true,
"hasLongTermMemory": false
}Submit via Pull Request to lgy0404/MemGUI-Bench β docs/data/agents/
Compress and submit via PR to lgy0404/memgui-bench-trajs:
# Compress session folder
cd results && zip -r your-agent-name.zip session-{id}
# Upload via HuggingFace Web UI:
# 1. Go to https://huggingface.co/datasets/lgy0404/memgui-bench-trajs
# 2. Click "Community" β "New Pull Request" β "Upload files"
# 3. Upload your zip file and submit the PRSee submission guide for details.
| File | Tasks | Description |
|---|---|---|
memgui-tasks-all.csv |
128 | Full benchmark |
memgui-tasks-40.csv |
40 | Subset for quick testing |
Task Fields (click to expand)
task_identifiertask_descriptiontask_appnum_appsrequires_ui_memorytask_difficultygolden_steps
@misc{liu2026memguibenchbenchmarkingmemorymobile,
title={MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments},
author={Guangyi Liu and Pengxiang Zhao and Yaozhen Liang and Qinyi Luo and Shunye Tang and Yuxiang Chai and Weifeng Lin and Han Xiao and WenHao Wang and Siheng Chen and Zhengxi Lu and Gao Wu and Hao Wang and Liang Liu and Yong Liu},
year={2026},
eprint={2602.06075},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2602.06075},
}For questions, issues, or collaborations, please contact: [email protected]
If you find MemGUI-Bench helpful, please consider giving us a star β!


