🌐HP | 📖 arXiv | GitHub | LeaderBoard
Atsuyuki Miyai Zaiying Zhao Kazuki Egashira Atsuki Sato Tatsumi Sunada Shota Onohara Hiromasa Yamanishi Mashiro Toyooka Kunato Nishina Ryoma Maeda Kiyoharu Aizawa Toshihiko Yamasaki
- 2025.08: We add a leaderboard for the research community.
- 2025.05: We make this codebase public.
As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted the four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress.
We used Azure OpenAI API, Anthoropic Claude API, and Google Gemini API for the experiments. We will share the code for OpenAI API soon.
export AZURE_OPENAI_API_KEY='your-api-key-here'
export AZURE_OPENAI_ENDPOINT="your-azure-endpoint-here"
export ANTHROPIC_API_KEY='your-api-key-here'
export GEMINI_API_KEY='your-api-key-here'- Setup the standalone environment following the original Webarena repository.
- After evaluating each website, reset the environment to the initial state following the instructions here. After the reset, run the inference for cross-site tasks.
WebChoreArena/
│── figs/ # figures used across the project
│── AgentOcccam/ # Web browsing agent module
│── BrowserGym/ # Web browsing agent module
│── README.md # Main documentation for the overall projectPlease dive in to the project of AgentOccam and BrowserGym for more details.
We provide dataset JSON files in either AgentOccam/config_files or BrowserGym/config_files. The benchmark can be run directly without any additional downloads. (We also provide the dataset via Kaggle Dataset. You can also download via it.)
The columns in this JSON file are defined as follows:
| Column Name | Description |
|---|---|
task_id |
Unique identifier for the task |
sites |
Websites used in the task |
start_url |
Initial URL where the agent begins |
start_url_lite |
Simplified start URL for easier tasks |
strage_state |
Path to login/session state |
affect_environment |
Whether the task affects the environment |
required_wait |
Whether a wait is needed after task |
intent_template |
Template defining task goal |
intent |
Specific task goal or instruction |
required_obs |
Required modalities (any/text/image) |
type_main |
Main task category |
type_sub |
Subcategory of task |
description |
How the task should be performed |
instantiation_dict |
Dictionary with content for templates |
eval |
Evaluation method used |
Running the full WebChoreArena benchmark can cost several hundred dollars in API usage. Therefore, we also provide a small subset of tasks. For each directory, the file small_set_ids.txt inside config_files specifies the task IDs used in the small subset. This subset corresponds to the one used in the subset experiments reported in Table 2 of the paper.
| Shopping | Admin | GitLab | Cross | Overall | |||
|---|---|---|---|---|---|---|---|
| AgentOccam | GPT-4o (2024-05-13) | 10.3 | 4.5 | 9.9 | 7.1 | 0.0 | 6.8 |
| Claude 3.7 Sonnet (claude-3-7-sonnet-20250219) | 27.4 | 28.8 | 23.1 | 22.8 | 7.7 | 23.5 | |
| Gemini 2.5 Pro (preview-03-25)* | 41.9 | 42.4 | 44.0 | 38.6 | 10.8 | 37.8 | |
| BrowserGym | GPT-4o (2024-05-13) | 0.9 | 2.3 | 5.5 | 3.9 | 0.0 | 2.6 |
| Claude 3.7 Sonnet (claude-3-7-sonnet-20250219) | 16.2 | 26.5 | 18.7 | 25.2 | 30.8 | 23.1 | |
| Gemini 2.5 Pro (preview-03-25)* | 47.9 | 50.0 | 44.0 | 40.2 | 40.0 | 44.9 |
*: Currently, preview-03-25 is being redirected to the latest stable version, and we have observed a performance degradation with the updated model. We will add these results later.
We adopt these codes to create this repository. We sincerely appreciate the great work/codebases.
If you have questions, please open an issue mentioning @AtsuMiyai or send an email to miyai[at]cvm.t.u-tokyo.ac.jp

