Industry-Level Benchmark

SWE-BENCH MOBILE

Evaluating AI coding agents on real-world mobile development tasks from an industry-level iOS codebase.

50

Tasks

449

Test Cases

4

Agents

9

Models

Leaderboard

Top performing agents

View all →
AgentModel
1
Cursor icon
Cursor
Opus 4.5
12.0%
28.1%
2
Cursor icon
Cursor
Sonnet 4.5
12.0%
26.7%
3
Codex icon
Codex
GLM 4.6
12.0%
19.6%
4
Codex icon
Codex
Sonnet 4.5
10.0%
28.1%
5
Claude Code icon
Claude Code
GLM 4.6
10.0%
26.7%

Task Categories

50 industry-level mobile development tasks

UI Components

18 tasks

12.5%

avg pass

Data Management

10 tasks

15.3%

avg pass

Gesture & Interaction

8 tasks

8%

avg pass

Media & Assets

7 tasks

9.8%

avg pass

Networking

4 tasks

11.2%

avg pass

Other

3 tasks

10.5%

avg pass

Task details are private. Contact us for research collaboration.

Real-World PRDs

Tasks derived from actual product requirement documents used in mobile app development.

Automated Testing

Comprehensive test suites that validate functionality, not just syntax correctness.

Reproducible Results

Standardized evaluation pipeline ensures consistent and comparable results.

Interested in SWE-Bench Mobile?

Contact us for research collaboration or to discuss evaluating your AI coding agent.

Learn More

We are currently preparing the repo for public release.Please follow our project updates.