Skip to content

Commit 5ff9611

Browse files
xingyaowwhuybery
andauthored
A starting point for SWE-Bench Evaluation with docker (OpenHands#60)
* a starting point for SWE-Bench evaluation with docker * fix the swe-bench uid issue * typo fixed * fix conda missing issue * move files based on new PR * Update doc and gitignore using devin prediction file from OpenHands#81 * fix typo * add a sentence * fix typo in path * fix path --------- Co-authored-by: Binyuan Hui <[email protected]>
1 parent dc88dac commit 5ff9611

9 files changed

Lines changed: 172 additions & 2 deletions

File tree

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,9 @@ cython_debug/
160160
.idea/
161161
.vscode/
162162

163+
# evaluation
164+
evaluation/SWE-bench/data
165+
163166
# frontend
164167

165168
# dependencies

evaluation/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,6 @@ all the preprocessing/evaluation/analysis scripts.
1919
- resources
2020
- Devin's outputs processed for evaluations is available on [Huggingface](https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output)
2121
- get predictions that passed the test: `wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_passed.json`
22-
- get all predictions`wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_outputs.json`
22+
- get all predictions `wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_outputs.json`
23+
24+
See [`SWE-bench/README.md`](./SWE-bench/README.md) for more details on how to run SWE-Bench for evaluation.

evaluation/SWE-bench/Dockerfile

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
FROM ubuntu:20.04
2+
3+
# https://github.com/princeton-nlp/SWE-bench/issues/15#issuecomment-1815392192
4+
RUN apt-get update && \
5+
apt-get install -y bash gcc git jq wget && \
6+
apt-get clean && \
7+
rm -rf /var/lib/apt/lists/*
8+
9+
RUN git config --global user.email "[email protected]"
10+
RUN git config --global user.name "swebench"
11+
12+
RUN apt update && apt install -y build-essential
13+
14+
# Create new user
15+
RUN useradd -ms /bin/bash swe-bench
16+
USER swe-bench
17+
WORKDIR /home/swe-bench
18+
19+
# Setup Conda
20+
ENV PATH="/home/swe-bench/miniconda3/bin:${PATH}"
21+
ARG PATH="/home/swe-bench/miniconda3/bin:${PATH}"
22+
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh \
23+
&& mkdir ~/.conda \
24+
&& bash miniconda.sh -b \
25+
&& rm -f miniconda.sh
26+
RUN conda --version
27+
28+
# Setup SWE-Bench Env
29+
COPY environment.yml .
30+
RUN conda env create -f environment.yml
31+
32+
# Some missing packages
33+
RUN pip install datasets python-dotenv gitpython
34+
35+
RUN conda init bash
36+
37+
CMD ["/bin/bash"]
38+
# docker build -t opendevin/eval-swe-bench:v0.1 -f evaluation/swe-bench/Dockerfile evaluation/swe-bench/
39+
# docker push opendevin/eval-swe-bench:v0.1

evaluation/SWE-bench/README.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# SWE-Bench Evaluation
2+
3+
Work in-progress.
4+
5+
**TODOs**:
6+
7+
- [ ] Generate `predictions` files given an OpenDevin `Agent` implementation. We could borrow something from [devin's eval-harness implementation](https://github.com/CognitionAI/devin-swebench-results/tree/main/harness), for example, [how to generate `TestSpec`](https://github.com/CognitionAI/devin-swebench-results/blob/main/harness/scripts.py#L150-L160).
8+
- [ ] Make sure the evaluation suite runs on all repos. I only tested on `matplotlib` so far, `scikit-learn` does not work for now (see [this issue](https://github.com/princeton-nlp/SWE-bench/issues/57))).
9+
10+
11+
## Run tests for a prediction file inside a docker container
12+
13+
Currently, the docker container should be able to for running SWE-Bench. It was tested on `matplotlib`, but it requires further testing to make sure it works on other repositories. Currently, [it does not work for `scikit-learn`](https://github.com/princeton-nlp/SWE-bench/issues/57)).
14+
15+
### Setup example data
16+
17+
```bash
18+
cd evaluation/SWE-bench
19+
./scripts/prepare_devin_swe_bench_data.sh
20+
21+
# Clone the repo
22+
# This is a fork that fixes some issues that stops matplotlib from running (see https://github.com/princeton-nlp/SWE-bench/pull/56)
23+
git clone https://github.com/xingyaoww/SWE-bench.git
24+
25+
# Enter the docker container
26+
./scripts/run_docker_interactive.sh
27+
```
28+
29+
### Run evaluation
30+
31+
```bash
32+
#!/bin/bash
33+
mkdir -p data/logs
34+
mkdir -p data/testbeds
35+
36+
python SWE-bench/harness/run_evaluation.py \
37+
--predictions_path data/predictions/devin_swe_outputs.json \
38+
--swe_bench_tasks data/processed/swe-bench-test.json \
39+
--log_dir data/logs \
40+
--testbed data/testbeds \
41+
--skip_existing \
42+
--timeout 900 \
43+
--verbose
44+
```
45+
46+
You will see the command line outputs similar to this (if success):
47+
48+
```log
49+
swe-bench@2f3a6b9fcab2:/swe-bench$ ./harness/run_evaluation.sh
50+
/swe-bench/harness/run_evaluation.py:101: SyntaxWarning: assertion is always true, perhaps remove parentheses?
51+
assert(temp, datasets.arrow_dataset.Dataset)
52+
2024-03-20 09:21:18,796 - INFO - Found 1 predictions across 1 model(s) in predictions file
53+
2024-03-20 09:21:18,796 - INFO - [claude-2/matplotlib__matplotlib/3.6] # of predictions to evaluate: 1 (0 already evaluated)
54+
2024-03-20 09:21:18,797 - INFO - [Testbed] Creating log directory /swe-bench/data/logs/claude-2
55+
2024-03-20 09:21:18,797 - INFO - [Testbed] Using conda path /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708
56+
2024-03-20 09:21:18,797 - INFO - [Testbed] Using working directory /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23 for testbed
57+
2024-03-20 09:21:18,797 - INFO - [Testbed] Repo matplotlib/matplotlib: 1 versions
58+
2024-03-20 09:21:18,797 - INFO - [Testbed] Version 3.6: 1 instances
59+
2024-03-20 09:21:18,797 - INFO - No conda path provided, creating temporary install in /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3...
60+
2024-03-20 09:21:27,482 - INFO - [Testbed] Using conda path /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3
61+
2024-03-20 09:21:27,942 - INFO - [Testbed] Setting up testbed for matplotlib__matplotlib__3.6
62+
2024-03-20 09:21:44,257 - INFO - [Testbed] Cloned matplotlib/matplotlib to /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23/matplotlib__matplotlib__3.6
63+
2024-03-20 09:21:44,415 - INFO - [Testbed] Creating environment matplotlib__matplotlib__3.6; Command: /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3/bin/conda env create --file /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23/environment.yml
64+
2024-03-20 09:23:39,781 - INFO - [Testbed] Installing pip packages for matplotlib__matplotlib__3.6; Command: . /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3/bin/activate matplotlib__matplotlib__3.6 && pip install pytest
65+
/swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23/matplotlib__matplotlib__3.6: 1 instances
66+
2024-03-20 09:23:42,309 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Reset task environment to aca6e9d5e98811ca37c442217914b15e78127c89
67+
2024-03-20 09:23:42,314 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Apply patch successful (pred_try)
68+
2024-03-20 09:23:42,318 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Revert patch successful (pred_try)
69+
2024-03-20 09:23:42,318 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Installing with command: . /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3/bin/activate matplotlib__matplotlib__3.6 && echo 'activate successful' && python -m pip install -e .
70+
2024-03-20 09:24:54,966 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Installation successful
71+
2024-03-20 09:24:54,970 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Apply patch successful (test)
72+
2024-03-20 09:24:54,974 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Apply patch successful (pred)
73+
2024-03-20 09:25:04,775 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Test script run successful
74+
swe-bench@2f3a6b9fcab2:/swe-bench$
75+
```
76+
77+
### Interpret Results
78+
79+
Then you may interpret the results under `data/logs`, and interpret it following [this guide](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-metrics).
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# FROM https://github.com/princeton-nlp/SWE-bench/blob/main/environment.yml
2+
name: swe-bench
3+
dependencies:
4+
- python=3.9
5+
- pip
6+
- pip:
7+
- beautifulsoup4
8+
- chardet
9+
- ghapi
10+
- GitPython
11+
- python-dotenv
12+
- requests
13+
- rich
14+
- transformers>=4.34.0
15+
- conda-forge::gh
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
from datasets import load_dataset
2+
import pandas as pd
3+
4+
dataset = load_dataset("princeton-nlp/SWE-bench")
5+
test = dataset["test"].to_pandas()
6+
test.to_json("data/processed/swe-bench-test.json", orient="records")
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
#!/bin/bash
2+
3+
set -xeo pipefail
4+
mkdir -p data/processed
5+
python3 scripts/download_test_data.py
6+
7+
# Download an example output file (FROM claude-2)
8+
# https://gist.github.com/sorendunn/9f1f1fade59f986b4925b6633f9ff165
9+
mkdir -p data/predictions
10+
wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_outputs.json -O data/predictions/devin_swe_outputs.json
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
#!/bin/bash
2+
3+
DOCKER_IMAGE=opendevin/eval-swe-bench:v0.1
4+
WORK_DIR=`pwd`
5+
6+
docker run \
7+
-it \
8+
--rm \
9+
--user root \
10+
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
11+
-v $WORK_DIR:/swe-bench \
12+
-w /swe-bench \
13+
$DOCKER_IMAGE \
14+
/bin/bash -c "usermod -u $(id -u) swe-bench && su swe-bench"

requirements.txt

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,4 @@
1+
datasets
2+
pandas
13
litellm
2-
termcolor
4+
termcolor

0 commit comments

Comments
 (0)