Industry-Level Benchmark
SWE-BENCH MOBILE
Evaluating AI coding agents on real-world mobile development tasks from an industry-level iOS codebase.
50
Tasks
449
Test Cases
4
Agents
9
Models
Leaderboard
Top performing agents
| Agent | Model | |||
|---|---|---|---|---|
| 1 | Opus 4.5 | 12.0% | 28.1% | |
| 2 | Sonnet 4.5 | 12.0% | 26.7% | |
| 3 | GLM 4.6 | 12.0% | 19.6% | |
| 4 | Sonnet 4.5 | 10.0% | 28.1% | |
| 5 | GLM 4.6 | 10.0% | 26.7% |
Task Categories
50 industry-level mobile development tasks
UI Components
18 tasks
12.5%
avg pass
Data Management
10 tasks
15.3%
avg pass
Gesture & Interaction
8 tasks
8%
avg pass
Media & Assets
7 tasks
9.8%
avg pass
Networking
4 tasks
11.2%
avg pass
Other
3 tasks
10.5%
avg pass
Task details are private. Contact us for research collaboration.
Real-World PRDs
Tasks derived from actual product requirement documents used in mobile app development.
Automated Testing
Comprehensive test suites that validate functionality, not just syntax correctness.
Reproducible Results
Standardized evaluation pipeline ensures consistent and comparable results.
Interested in SWE-Bench Mobile?
Contact us for research collaboration or to discuss evaluating your AI coding agent.
We are currently preparing the repo for public release.Please follow our project updates.