Skip to content

lgy0404/MemGUI-Bench

Repository files navigation


πŸ“‹ Table of Contents


πŸ“’ Updates

  • 2026-02-15: πŸŽ‰ MemGUI-Bench adopted by Mobile-Agent-v3.5! Congrats to the Tongyi Lab team for achieving 27.1% on Easy tasks with GUI-Owl-1.5-32B. We welcome more agents to challenge the full benchmark! πŸš€
  • 2026-02-09: πŸ—‚οΈ Benchmark tasks now available on HuggingFace: lgy0404/MemGUI-Bench
  • 2026-02-09: πŸ“„ Paper released on arXiv! Check out our paper: arXiv:2602.06075
  • 2026-02-03: Initial release of MemGUI-Bench benchmark. Check out our website.

🐳 Environment Setup

Task Distribution

Option 1: Docker (Recommended)

Use our pre-configured Docker image with all dependencies installed:

# Pull the image (public, no login required)
sudo docker pull \
  crpi-6p9eo5da91i2tx5v.cn-hangzhou.personal.cr.aliyuncs.com/memgui/memgui-bench:26020301

# Run container
sudo docker run -it --privileged \
  --name memgui-bench \
  -w /root/MemGUI-Bench \
  crpi-6p9eo5da91i2tx5v.cn-hangzhou.personal.cr.aliyuncs.com/memgui/memgui-bench:26020301 \
  bash

# Inside container, you're already in /root/MemGUI-Bench
python run.py

Note: The --privileged flag is required for Android emulator support.

The Docker image includes:

  • Pre-configured Android emulator with MemGUI-AVD
  • All required conda environments
  • ADB and Android SDK tools

Option 2: Local Setup

For developers who prefer local installation:

Click to expand local setup instructions

Prerequisites

  1. Conda: Install from conda.io
  2. Android Debug Bridge (ADB): Install from Android Developer and add to PATH
  3. Android Studio & AVD:
    • Download and install Android Studio
    • Download the pre-configured MemGUI-AVD emulator snapshot:
      • Download: Baidu Netdisk (Code: tfnb)
      • File: MemGUI-AVD-250704-base.zip
    • Extract to your AVD directory:
      • Windows: C:\Users\[Username]\.android\avd\
      • macOS: ~/Library/Android/avd/
      • Linux: ~/.android/avd/
    • Launch Android Studio β†’ Device Manager β†’ Start MemGUI-AVD

Repository Setup

# Clone repository with submodules
git clone --recursive https://github.com/lgy0404/MemGUI-Bench.git
cd MemGUI-Bench

# If already cloned without --recursive, init submodules manually:
# git submodule update --init --recursive

# Run setup script
./setup.sh

# Configure
cp config.yaml.example.opensource config.yaml
# Edit config.yaml with your paths

βš™οΈ Configuration

Edit config.yaml to match your environment:

# Environment Mode
ENVIRONMENT_MODE: "local"  # "local" or "docker"

# Experiment Settings
AGENT_NAME: "Qwen3VL"
DATASET_PATH: "./data/memgui-tasks-all.csv"
SESSION_ID_SUFFIX: "my-experiment"

# API & Parallelism
BASE_URL: "https://api.openai.com/v1"
NUM_OF_EMULATOR: 4
MAX_EVAL_SUBPROCESS: 8

# Model API Keys
QWEN_API_KEY: "your-api-key"
QWEN_MODEL: "qwen3-vl-8b"
Full configuration example (for local mode)
# Part 1: Environment Mode
ENVIRONMENT_MODE: "local"  # "local" or "docker"

# Part 2: Experiment Settings
AGENT_NAME: "Qwen3VL"
DATASET_PATH: "./data/memgui-tasks-all.csv"
SESSION_ID_SUFFIX: "my-experiment"

# Part 3: API & Parallelism
BASE_URL: "https://api.openai.com/v1"
NUM_OF_EMULATOR: 4
MAX_EVAL_SUBPROCESS: 8

# Part 4: Model API Keys
QWEN_API_KEY: "your-api-key"
QWEN_MODEL: "qwen3-vl-8b"

# Part 5: Paths (for local mode)
_MODE_PRESETS:
  environment:
    local:
      _CONDA_PATH: "/path/to/miniconda3"
      _EMULATOR_PATH: "/path/to/android-sdk/emulator/emulator"
      _ANDROID_SDK_PATH: "/path/to/android-sdk"
      _SYS_AVD_HOME: "/path/to/.android/avd"
      _SOURCE_AVD_HOME: "/path/to/.android/avd"

πŸš€ Usage

Running the Benchmark

conda activate MemGUI
python run.py

Command-line Arguments

Argument Default Description
--agents config Agent name(s), comma-separated
--mode full full (exec+eval) / exec / eval
--session_id config Session identifier for results
--task_id None Run specific task only
--max_attempts 3 Max attempts per task
--overwrite False Overwrite existing results
--no_concurrent False Disable parallel evaluation

Examples

# Full benchmark (execution + evaluation)
python run.py

# Run specific task
python run.py --task_id 001-FindProductAndFilter

# Evaluation only (on existing trajectories)
python run.py --mode eval --session_id my-experiment

# Multiple attempts
python run.py --max_attempts 5

# Disable parallel execution
python run.py --no_concurrent

πŸ“ Benchmark Session

Each session_id creates an isolated benchmark folder in ./results/.

  • The dataset is copied to results.csv to track progress
  • Re-running the same session resumes from incomplete tasks
  • Results accumulate across runs

Output Structure

Click to expand output directory structure
results/session-{session_id}/
β”œβ”€β”€ results.csv                    # Aggregated execution & evaluation metrics
β”œβ”€β”€ results.csv.lock               # File lock for concurrent access
β”œβ”€β”€ metrics_summary.json           # Computed benchmark metrics
β”œβ”€β”€ {agent_name}.json              # Leaderboard format (for submission)
β”œβ”€β”€ config.yaml                    # Config snapshot for reproducibility
β”‚
└── {task_id}/
    └── {agent_name}/
        └── attempt_{n}/
            β”œβ”€β”€ log.json                    # Execution log with actions
            β”œβ”€β”€ 0.png, 1.png, ...          # Raw screenshots per step
            β”œβ”€β”€ stdout.txt, stderr.txt     # Process output logs
            β”œβ”€β”€ error.json                 # Error info (if any)
            β”‚
            β”œβ”€β”€ visualize_actions/         # Action visualization images
            β”‚   └── step_1.png, step_2.png, ...
            β”‚
            β”œβ”€β”€ single_actions/            # Individual action screenshots
            β”‚   └── step_1.png, step_2.png, ...
            β”‚
            β”œβ”€β”€ puzzle/                    # Evaluation puzzle images
            β”‚   β”œβ”€β”€ puzzle.png
            β”‚   β”œβ”€β”€ pre_eval_puzzle.png
            β”‚   └── supplemental_puzzle.png (if needed)
            β”‚
            β”œβ”€β”€ evaluation_summary.json    # Detailed evaluation results
            β”œβ”€β”€ final_decision.json        # Final evaluation decision
            β”œβ”€β”€ irr_analysis.json          # IRR evaluation results
            β”œβ”€β”€ badcase_analysis.json      # BadCase classification
            └── step_*_description.json    # Step-by-step analysis

πŸ“Š Metrics

The benchmark automatically computes:

Metric Description
Pass@K Success rate within K attempts
IRR Information Retrieval Rate (memory accuracy)
FRR Failure Recovery Rate (learning from errors)
MTPR Memory Task Performance Ratio
Step Ratio Agent steps / Golden steps
Time/Step Average execution time per step
Cost/Step API cost per step (if applicable)

Results are saved to metrics_summary.json and {agent_name}.json (leaderboard format).


πŸ€– Adding a New Agent

Step 1: Add Config

Add your agent to config.yaml:

AGENTS:
  - NAME: "MyAgent"
    REPO_PATH: "./framework/models/MyAgent"
    ENV_NAME: "my_agent_env"

Step 2: Implement Agent Class

Create your agent class in framework/agents.py:

class MyAgent(AndroidWorldAgent):
    agent_name = "MyAgent"
  
    def construct_command(self, task, full_task_description, output_dir, device):
        script = "run.py"
        args = f'--task "{full_task_description}" --output {output_dir} --device {device["serial"]}'
        return script, args

Step 3: Output Format

Your agent must output:

  • Screenshots: 0.png, 1.png, ... (one per step)
  • Log file: log.json with execution summary

The benchmark handles evaluation automatically.


πŸ“€ Leaderboard Submission

After running the benchmark:

1. Submit Results JSON (Required)

Find {agent_name}.json in your session folder and fill in metadata:

{
  "name": "YourAgent",
  "backbone": "GPT-4V",
  "type": "Agentic Workflow",
  "institution": "Your Institution",
  "date": "2026-02-03",
  "paperLink": "https://arxiv.org/...",
  "codeLink": "https://github.com/...",
  "hasUITree": true,
  "hasLongTermMemory": false
}

Submit via Pull Request to lgy0404/MemGUI-Bench β†’ docs/data/agents/

2. Upload Trajectories (Optional but Recommended)

Compress and submit via PR to lgy0404/memgui-bench-trajs:

# Compress session folder
cd results && zip -r your-agent-name.zip session-{id}

# Upload via HuggingFace Web UI:
# 1. Go to https://huggingface.co/datasets/lgy0404/memgui-bench-trajs
# 2. Click "Community" β†’ "New Pull Request" β†’ "Upload files"
# 3. Upload your zip file and submit the PR

See submission guide for details.


πŸ“šTasks

Task Distribution
File Tasks Description
memgui-tasks-all.csv 128 Full benchmark
memgui-tasks-40.csv 40 Subset for quick testing
Task Fields (click to expand)
  • task_identifier
  • task_description
  • task_app
  • num_apps
  • requires_ui_memory
  • task_difficulty
  • golden_steps

πŸ“ Citation

@misc{liu2026memguibenchbenchmarkingmemorymobile,
  title={MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments},
  author={Guangyi Liu and Pengxiang Zhao and Yaozhen Liang and Qinyi Luo and Shunye Tang and Yuxiang Chai and Weifeng Lin and Han Xiao and WenHao Wang and Siheng Chen and Zhengxi Lu and Gao Wu and Hao Wang and Liang Liu and Yong Liu},
  year={2026},
  eprint={2602.06075},
  archivePrefix={arXiv},
  primaryClass={cs.DC},
  url={https://arxiv.org/abs/2602.06075},
}

πŸ“§ Contact

For questions, issues, or collaborations, please contact: [email protected]


⭐ Star History

If you find MemGUI-Bench helpful, please consider giving us a star ⭐!

Star History Chart

About

Official code repo for the paper "MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors