This folder contains evaluation for MiniWoB++ benchmark, powered by BrowserGym for easy evaluation of how well an agent capable of browsing can perform on synthetic web browsing tasks.
Please follow this document to setup local develop environment for OpenDevin.
Create a config.toml file if it does not exist at the root of the workspace.
Add the following configurations:
[core]
max_iterations = 100
cache_dir = "/tmp/cache"
ssh_hostname = "localhost"
[sandbox]
box_type = "ssh"
timeout = 120
# TODO: Change these to the model you want to evaluate
[llm.eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[llm.eval_some_openai_compatible_model]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0MiniWoB++ requires you to set up websites containing a static website that is accessible via URL to the machine running the OpenDevin agents.
- Clone miniwob (use a specific frozen commit for reproducibility)
git clone [email protected]:Farama-Foundation/miniwob-plusplus.git
git -C "./miniwob-plusplus" reset --hard 7fd85d71a4b60325c6585396ec4f48377d049838- Setup Miniwob URL (change
PATH_TO_MINIWOB_CLONED_REPOhere to the absolute path to yourminiwob-plusplusfolder) inevaluation/miniwob/scripts/run_infer.sh
export MINIWOB_URL="file://<PATH_TO_MINIWOB_CLONED_REPO>/miniwob/html/miniwob/"Access with browser the above MiniWoB URLs and see if they load correctly.
bash evaluation/miniwob/scripts/run_infer.shResults will be in evaluation/evaluation_outputs/outputs/miniwob/
To calculate the average reward, run:
poetry run python evaluation/miniwob/get_success_rate.py evaluation/evaluation_outputs/outputs/miniwob/SOME_AGENT/EXP_NAME/output.jsonlYou can start your own fork of our huggingface evaluation outputs and submit a PR of your evaluation results following the guide here.
Tested on BrowsingAgent V1.0
MiniWoB++, 125 tasks (3 runs due to random init task), max step 10
- GPT4o: 0.384, 0.416, 0.424, avg: 0.408
- GPT3.5: 0.288, 0.256, 0.272, avg: 0.272