|
| 1 | +# SWE-Bench Evaluation |
| 2 | + |
| 3 | +Work in-progress. |
| 4 | + |
| 5 | +**TODOs**: |
| 6 | + |
| 7 | +- [ ] Generate `predictions` files given an OpenDevin `Agent` implementation. We could borrow something from [devin's eval-harness implementation](https://github.com/CognitionAI/devin-swebench-results/tree/main/harness), for example, [how to generate `TestSpec`](https://github.com/CognitionAI/devin-swebench-results/blob/main/harness/scripts.py#L150-L160). |
| 8 | +- [ ] Make sure the evaluation suite runs on all repos. I only tested on `matplotlib` so far, `scikit-learn` does not work for now (see [this issue](https://github.com/princeton-nlp/SWE-bench/issues/57))). |
| 9 | + |
| 10 | + |
| 11 | +## Run tests for a prediction file inside a docker container |
| 12 | + |
| 13 | +Currently, the docker container should be able to for running SWE-Bench. It was tested on `matplotlib`, but it requires further testing to make sure it works on other repositories. Currently, [it does not work for `scikit-learn`](https://github.com/princeton-nlp/SWE-bench/issues/57)). |
| 14 | + |
| 15 | +### Setup example data |
| 16 | + |
| 17 | +```bash |
| 18 | +cd evaluation/SWE-bench |
| 19 | +./scripts/prepare_devin_swe_bench_data.sh |
| 20 | + |
| 21 | +# Clone the repo |
| 22 | +# This is a fork that fixes some issues that stops matplotlib from running (see https://github.com/princeton-nlp/SWE-bench/pull/56) |
| 23 | +git clone https://github.com/xingyaoww/SWE-bench.git |
| 24 | + |
| 25 | +# Enter the docker container |
| 26 | +./scripts/run_docker_interactive.sh |
| 27 | +``` |
| 28 | + |
| 29 | +### Run evaluation |
| 30 | + |
| 31 | +```bash |
| 32 | +#!/bin/bash |
| 33 | +mkdir -p data/logs |
| 34 | +mkdir -p data/testbeds |
| 35 | + |
| 36 | +python SWE-bench/harness/run_evaluation.py \ |
| 37 | + --predictions_path data/predictions/devin_swe_outputs.json \ |
| 38 | + --swe_bench_tasks data/processed/swe-bench-test.json \ |
| 39 | + --log_dir data/logs \ |
| 40 | + --testbed data/testbeds \ |
| 41 | + --skip_existing \ |
| 42 | + --timeout 900 \ |
| 43 | + --verbose |
| 44 | +``` |
| 45 | + |
| 46 | +You will see the command line outputs similar to this (if success): |
| 47 | + |
| 48 | +```log |
| 49 | +swe-bench@2f3a6b9fcab2:/swe-bench$ ./harness/run_evaluation.sh |
| 50 | +/swe-bench/harness/run_evaluation.py:101: SyntaxWarning: assertion is always true, perhaps remove parentheses? |
| 51 | + assert(temp, datasets.arrow_dataset.Dataset) |
| 52 | +2024-03-20 09:21:18,796 - INFO - Found 1 predictions across 1 model(s) in predictions file |
| 53 | +2024-03-20 09:21:18,796 - INFO - [claude-2/matplotlib__matplotlib/3.6] # of predictions to evaluate: 1 (0 already evaluated) |
| 54 | +2024-03-20 09:21:18,797 - INFO - [Testbed] Creating log directory /swe-bench/data/logs/claude-2 |
| 55 | +2024-03-20 09:21:18,797 - INFO - [Testbed] Using conda path /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708 |
| 56 | +2024-03-20 09:21:18,797 - INFO - [Testbed] Using working directory /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23 for testbed |
| 57 | +2024-03-20 09:21:18,797 - INFO - [Testbed] Repo matplotlib/matplotlib: 1 versions |
| 58 | +2024-03-20 09:21:18,797 - INFO - [Testbed] Version 3.6: 1 instances |
| 59 | +2024-03-20 09:21:18,797 - INFO - No conda path provided, creating temporary install in /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3... |
| 60 | +2024-03-20 09:21:27,482 - INFO - [Testbed] Using conda path /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3 |
| 61 | +2024-03-20 09:21:27,942 - INFO - [Testbed] Setting up testbed for matplotlib__matplotlib__3.6 |
| 62 | +2024-03-20 09:21:44,257 - INFO - [Testbed] Cloned matplotlib/matplotlib to /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23/matplotlib__matplotlib__3.6 |
| 63 | +2024-03-20 09:21:44,415 - INFO - [Testbed] Creating environment matplotlib__matplotlib__3.6; Command: /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3/bin/conda env create --file /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23/environment.yml |
| 64 | +2024-03-20 09:23:39,781 - INFO - [Testbed] Installing pip packages for matplotlib__matplotlib__3.6; Command: . /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3/bin/activate matplotlib__matplotlib__3.6 && pip install pytest |
| 65 | +/swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23/matplotlib__matplotlib__3.6: 1 instances |
| 66 | +2024-03-20 09:23:42,309 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Reset task environment to aca6e9d5e98811ca37c442217914b15e78127c89 |
| 67 | +2024-03-20 09:23:42,314 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Apply patch successful (pred_try) |
| 68 | +2024-03-20 09:23:42,318 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Revert patch successful (pred_try) |
| 69 | +2024-03-20 09:23:42,318 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Installing with command: . /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3/bin/activate matplotlib__matplotlib__3.6 && echo 'activate successful' && python -m pip install -e . |
| 70 | +2024-03-20 09:24:54,966 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Installation successful |
| 71 | +2024-03-20 09:24:54,970 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Apply patch successful (test) |
| 72 | +2024-03-20 09:24:54,974 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Apply patch successful (pred) |
| 73 | +2024-03-20 09:25:04,775 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Test script run successful |
| 74 | +swe-bench@2f3a6b9fcab2:/swe-bench$ |
| 75 | +``` |
| 76 | + |
| 77 | +### Interpret Results |
| 78 | + |
| 79 | +Then you may interpret the results under `data/logs`, and interpret it following [this guide](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-metrics). |
0 commit comments