Skip to content

WebChoreArena/WebChoreArena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

🌐HP | 📖 arXiv | GitHub | LeaderBoard


Atsuyuki Miyai  Zaiying Zhao  Kazuki EgashiraAtsuki Sato  Tatsumi Sunada  Shota Onohara  Hiromasa Yamanishi  Mashiro Toyooka  Kunato Nishina  Ryoma Maeda  Kiyoharu AizawaToshihiko Yamasaki
The University of Tokyo
teaser.png

Figure 1. The Overview of The WebChoreArena Challenge. WebChoreArena extends WebArena by introducing more complex and labor-intensive tasks, pushing the boundaries of agent capabilities. This enhanced benchmark allows for a clearer evaluation of progress in advanced models and reveals that even powerful models such as Gemini 2.5 Pro still have significant room for improvement.

🚀 News

  • 2025.08: We add a leaderboard for the research community.
  • 2025.05: We make this codebase public.

📕 Abstract

As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted the four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress.

task_type.png

📦 Requirements

API KEY

We used Azure OpenAI API, Anthoropic Claude API, and Google Gemini API for the experiments. We will share the code for OpenAI API soon.

export AZURE_OPENAI_API_KEY='your-api-key-here'
export AZURE_OPENAI_ENDPOINT="your-azure-endpoint-here"
export ANTHROPIC_API_KEY='your-api-key-here'
export GEMINI_API_KEY='your-api-key-here'

End-to-end Evaluation

  1. Setup the standalone environment following the original Webarena repository.
  2. After evaluating each website, reset the environment to the initial state following the instructions here. After the reset, run the inference for cross-site tasks.

📂 Code Structure

WebChoreArena/
│── figs/                # figures used across the project
│── AgentOcccam/                # Web browsing agent module
│── BrowserGym/                # Web browsing agent module
│── README.md              # Main documentation for the overall project

Please dive in to the project of AgentOccam and BrowserGym for more details.

📊 Dataset

We provide dataset JSON files in either AgentOccam/config_files or BrowserGym/config_files. The benchmark can be run directly without any additional downloads. (We also provide the dataset via Kaggle Dataset. You can also download via it.)

Columns Info.

The columns in this JSON file are defined as follows:

Column Name Description
task_id Unique identifier for the task
sites Websites used in the task
start_url Initial URL where the agent begins
start_url_lite Simplified start URL for easier tasks
strage_state Path to login/session state
affect_environment Whether the task affects the environment
required_wait Whether a wait is needed after task
intent_template Template defining task goal
intent Specific task goal or instruction
required_obs Required modalities (any/text/image)
type_main Main task category
type_sub Subcategory of task
description How the task should be performed
instantiation_dict Dictionary with content for templates
eval Evaluation method used

Small Set

Running the full WebChoreArena benchmark can cost several hundred dollars in API usage. Therefore, we also provide a small subset of tasks. For each directory, the file small_set_ids.txt inside config_files specifies the task IDs used in the small subset. This subset corresponds to the one used in the subset experiments reported in Table 2 of the paper.

✅ Final Results

Shopping Admin Reddit GitLab Cross Overall
AgentOccam GPT-4o (2024-05-13) 10.3 4.5 9.9 7.1 0.0 6.8
Claude 3.7 Sonnet (claude-3-7-sonnet-20250219) 27.4 28.8 23.1 22.8 7.7 23.5
Gemini 2.5 Pro (preview-03-25)* 41.9 42.4 44.0 38.6 10.8 37.8
BrowserGym GPT-4o (2024-05-13) 0.9 2.3 5.5 3.9 0.0 2.6
Claude 3.7 Sonnet (claude-3-7-sonnet-20250219) 16.2 26.5 18.7 25.2 30.8 23.1
Gemini 2.5 Pro (preview-03-25)* 47.9 50.0 44.0 40.2 40.0 44.9

*: Currently, preview-03-25 is being redirected to the latest stable version, and we have observed a performance degradation with the updated model. We will add these results later.

🤝 Acknowledgement

We adopt these codes to create this repository. We sincerely appreciate the great work/codebases.

✉️ Contact

If you have questions, please open an issue mentioning @AtsuMiyai or send an email to miyai[at]cvm.t.u-tokyo.ac.jp

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages