Skip to content

Commit 6a0ffc5

Browse files
xingyaowwenyst
andauthored
[Evaluation] Use the latest official SWE-Bench Dockerization for evaluation (OpenHands#2728)
* add newline after patch to fix patch apply * new swebench wip * add newline after patch to fix patch apply * only add newline if not empty * update swebench source and update * update gitignore for swebench eval * update old prep_eval * update gitignore * add scripts for push and pull swebench images * update eval_infer.sh * update eval_infer for new docker workflow * update script to create markdown report based on report.json * update eval infer to use update output * update readme * only move result to folder if running whole file * remove set-x * update conversion script * Update evaluation/swe_bench/README.md * Update evaluation/swe_bench/README.md * Update evaluation/swe_bench/README.md * make sure last line end with newline * switch to an fix attempt branch of swebench * Update evaluation/swe_bench/README.md * Update evaluation/swe_bench/README.md --------- Co-authored-by: Engel Nyst <[email protected]>
1 parent 6246cb8 commit 6a0ffc5

11 files changed

Lines changed: 809 additions & 313 deletions

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -212,4 +212,8 @@ cache
212212
config.toml
213213
config.toml.bak
214214

215-
containers/agnostic_sandbox
215+
containers/agnostic_sandbox
216+
217+
# swe-bench-eval
218+
image_build_logs
219+
run_instance_logs

evaluation/swe_bench/README.md

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)). We created [a fork of SWE-Bench](https://github.com/OpenDevin/OD-SWE-bench.git) mostly built on top of [the original repo](https://github.com/princeton-nlp/SWE-bench) and [containerized](#opendevin-swe-bench-docker-image) it for easy evaluation.
44

5+
**UPDATE (7/1/2024): We now support the official SWE-Bench dockerized evaluation as announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**
6+
57
## Setup Environment
68

79
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to set up a local development environment for OpenDevin.
@@ -10,7 +12,7 @@ Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/D
1012

1113
In [OpenDevin-SWE-Bench fork](https://github.com/OpenDevin/OD-SWE-bench.git) (mostly from [original repo](https://github.com/princeton-nlp/SWE-bench) with some fixes), we try to pre-build the **testbed** (i.e., code of the repository we want the agent to edit) AND the **conda environment**, so that in evaluation (inference) time, we can directly leverage existing environments for efficient evaluation.
1214

13-
**We pack everything you need for SWE-Bench evaluation into one, gigantic, docker image.** To use it:
15+
**We pack everything you need for SWE-Bench inference into one, gigantic, docker image.** To use it:
1416

1517
```bash
1618
docker pull ghcr.io/opendevin/eval-swe-bench:full-v1.2.1
@@ -124,16 +126,23 @@ After running the inference, you will obtain a `output.jsonl` (by default it wil
124126

125127
With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patches, and produce a fine-grained report.
126128

129+
**This evaluation is performed using the official dockerized evaluation announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**
130+
127131
If you want to evaluate existing results, you should first run this to clone existing outputs
128132

129133
```bash
130134
git clone https://huggingface.co/spaces/OpenDevin/evaluation evaluation/evaluation_outputs
131135
```
132136

133-
To prepare for swe-bench evaluation, you should pull evaluation docker from [OpenDevin/SWE-bench-docker](https://github.com/OpenDevin/SWE-bench-docker) and download swe-bench data by running:
137+
If you have extra local space (e.g., 500GB), you can try pull the [instance-level docker images](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md#choosing-the-right-cache_level) we've prepared to speed up the evaluation by running:
138+
139+
```bash
140+
evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh instance
141+
```
134142

143+
If you want to save disk space a bit (e.g., with ~50GB free disk space), while speeding up the image pre-build process, you can pull the environment-level docker images:
135144
```bash
136-
evaluation/swe_bench/scripts/eval/prep_eval.sh
145+
evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh env
137146
```
138147

139148
Then you can run the following:
@@ -146,12 +155,11 @@ Then you can run the following:
146155

147156
PS: You can also pass in a JSONL with [SWE-Bench format](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-creating-predictions) to `./evaluation/swe_bench/scripts/eval_infer.sh`, where each line is a JSON of `{"model_patch": "XXX", "model_name_or_path": "YYY", "instance_id": "ZZZ"}`.
148157

149-
The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with the following files/directory (following format of [SWE-bench-docker](https://github.com/aorwall/SWE-bench-docker/tree/main/evaluations/SWE-bench_Lite_golden)):
158+
The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with the following files/directory:
150159

151160
- `README.md`: a report showing what are the instances that passed, failed, etc.
152-
- `logs/`: a directory of test logs
153-
- `report.json`: a JSON file that contains keys like `"resolved"` pointing to instance IDs that are resolved by the agent.
154-
- `summary.json`: a JSON file contains more fine-grained information for each test instance.
161+
- `report.json`: a JSON file that contains keys like `"resolved_ids"` pointing to instance IDs that are resolved by the agent.
162+
- `eval_outputs/`: a directory of test logs
155163

156164
## Visualize Results
157165

0 commit comments

Comments
 (0)