You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Evaluation] Use the latest official SWE-Bench Dockerization for evaluation (OpenHands#2728)
* add newline after patch to fix patch apply
* new swebench wip
* add newline after patch to fix patch apply
* only add newline if not empty
* update swebench source and update
* update gitignore for swebench eval
* update old prep_eval
* update gitignore
* add scripts for push and pull swebench images
* update eval_infer.sh
* update eval_infer for new docker workflow
* update script to create markdown report based on report.json
* update eval infer to use update output
* update readme
* only move result to folder if running whole file
* remove set-x
* update conversion script
* Update evaluation/swe_bench/README.md
* Update evaluation/swe_bench/README.md
* Update evaluation/swe_bench/README.md
* make sure last line end with newline
* switch to an fix attempt branch of swebench
* Update evaluation/swe_bench/README.md
* Update evaluation/swe_bench/README.md
---------
Co-authored-by: Engel Nyst <[email protected]>
Copy file name to clipboardExpand all lines: evaluation/swe_bench/README.md
+15-7Lines changed: 15 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,8 @@
2
2
3
3
This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)). We created [a fork of SWE-Bench](https://github.com/OpenDevin/OD-SWE-bench.git) mostly built on top of [the original repo](https://github.com/princeton-nlp/SWE-bench) and [containerized](#opendevin-swe-bench-docker-image) it for easy evaluation.
4
4
5
+
**UPDATE (7/1/2024): We now support the official SWE-Bench dockerized evaluation as announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**
6
+
5
7
## Setup Environment
6
8
7
9
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to set up a local development environment for OpenDevin.
In [OpenDevin-SWE-Bench fork](https://github.com/OpenDevin/OD-SWE-bench.git) (mostly from [original repo](https://github.com/princeton-nlp/SWE-bench) with some fixes), we try to pre-build the **testbed** (i.e., code of the repository we want the agent to edit) AND the **conda environment**, so that in evaluation (inference) time, we can directly leverage existing environments for efficient evaluation.
12
14
13
-
**We pack everything you need for SWE-Bench evaluation into one, gigantic, docker image.** To use it:
15
+
**We pack everything you need for SWE-Bench inference into one, gigantic, docker image.** To use it:
@@ -124,16 +126,23 @@ After running the inference, you will obtain a `output.jsonl` (by default it wil
124
126
125
127
With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patches, and produce a fine-grained report.
126
128
129
+
**This evaluation is performed using the official dockerized evaluation announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**
130
+
127
131
If you want to evaluate existing results, you should first run this to clone existing outputs
To prepare for swe-bench evaluation, you should pull evaluation docker from [OpenDevin/SWE-bench-docker](https://github.com/OpenDevin/SWE-bench-docker) and download swe-bench data by running:
137
+
If you have extra local space (e.g., 500GB), you can try pull the [instance-level docker images](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md#choosing-the-right-cache_level) we've prepared to speed up the evaluation by running:
If you want to save disk space a bit (e.g., with ~50GB free disk space), while speeding up the image pre-build process, you can pull the environment-level docker images:
@@ -146,12 +155,11 @@ Then you can run the following:
146
155
147
156
PS: You can also pass in a JSONL with [SWE-Bench format](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-creating-predictions) to `./evaluation/swe_bench/scripts/eval_infer.sh`, where each line is a JSON of `{"model_patch": "XXX", "model_name_or_path": "YYY", "instance_id": "ZZZ"}`.
148
157
149
-
The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with the following files/directory (following format of [SWE-bench-docker](https://github.com/aorwall/SWE-bench-docker/tree/main/evaluations/SWE-bench_Lite_golden)):
158
+
The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with the following files/directory:
150
159
151
160
-`README.md`: a report showing what are the instances that passed, failed, etc.
152
-
-`logs/`: a directory of test logs
153
-
-`report.json`: a JSON file that contains keys like `"resolved"` pointing to instance IDs that are resolved by the agent.
154
-
-`summary.json`: a JSON file contains more fine-grained information for each test instance.
161
+
-`report.json`: a JSON file that contains keys like `"resolved_ids"` pointing to instance IDs that are resolved by the agent.
0 commit comments