|
| 1 | +## Introduction |
| 2 | + |
| 3 | +This folder contains backend integration tests that rely on a mock LLM. It serves |
| 4 | +two purposes: |
| 5 | +1. Ensure the quality of development, including OpenDevin framework and agents. |
| 6 | +2. Help contributors learn the workflow of OpenDevin, and examples of real interactions |
| 7 | +with (powerful) LLM, without spending real money. |
| 8 | + |
| 9 | +Why don't we launch an open-source model, e.g. LLAMA3? There are two reasons: |
| 10 | +1. LLMs cannot guarantee determinism, meaning the test behavior might change. |
| 11 | +2. CI machines are not powerful enough to run any LLM that is sophisticated enough |
| 12 | +to finish the tasks defined in tests. |
| 13 | + |
| 14 | +Note: integration tests are orthogonal to evaluations/benchmarks |
| 15 | +as they serve different purposes. Although benchmarks could also |
| 16 | +capture bugs, some of which may not be caught by tests, benchmarks |
| 17 | +require real LLMs which are non-deterministic and costly. |
| 18 | +We run integration test suite for every single commit, which is |
| 19 | +not possible with benchmarks. |
| 20 | + |
| 21 | +Known limitations: |
| 22 | +1. To avoid the potential impact of non-determinism, we remove all special |
| 23 | +characters and numbers (often used as PIDs) when doing the comparison. If two |
| 24 | +prompts for the same task only differ in non-alpha characters, a wrong mock |
| 25 | +response might be picked up. |
| 26 | +2. It is required that the agent itself doesn't do anything non-deterministic, |
| 27 | +including but not limited to using randomly generated numbers. |
| 28 | + |
| 29 | +The folder is organised as follows: |
| 30 | + |
| 31 | +``` |
| 32 | +├── README.md |
| 33 | +├── conftest.py |
| 34 | +├── mock |
| 35 | +│ ├── [AgentName] |
| 36 | +│ │ └── [TestName] |
| 37 | +│ │ ├── prompt_*.log |
| 38 | +│ │ ├── response_*.log |
| 39 | +└── [TestFiles].py |
| 40 | +``` |
| 41 | + |
| 42 | +where `conftest.py` defines the infrastructure needed to load real-world LLM prompts |
| 43 | +and responses for mocking purpose. Prompts and responses generated during real runs |
| 44 | +of agents with real LLMs are stored under `mock/AgentName/TestName` folders. |
| 45 | + |
| 46 | +## Run Integration Tests |
| 47 | + |
| 48 | +Take a look at `run-integration-tests.yml` to learn how integration tests are |
| 49 | +launched in CI environment. Assuming you want to use `workspace` for testing, an |
| 50 | +example is as follows: |
| 51 | + |
| 52 | +```bash |
| 53 | +rm -rf workspace; AGENT=PlannerAgent \ |
| 54 | +WORKSPACE_BASE="/Users/admin/OpenDevin/workspace" WORKSPACE_MOUNT_PATH="/Users/admin/OpenDevin/workspace" MAX_ITERATIONS=10 \ |
| 55 | +poetry run pytest -s ./tests/integration |
| 56 | +``` |
| 57 | + |
| 58 | +Note: in order to run integration tests correctly, please ensure your workspace is empty. |
| 59 | + |
| 60 | + |
| 61 | +## Write Integration Tests |
| 62 | + |
| 63 | +To write an integraion test, there are essentially two steps: |
| 64 | + |
| 65 | +1. Decide your task prompt, and the result you want to verify. |
| 66 | +2. Either construct LLM responses by yourself, or run OpenDevin with a real LLM. The system prompts and |
| 67 | +LLM responses are recorded as logs, which you could then copy to test folder. |
| 68 | +The following paragraphs describe how to do it. |
| 69 | + |
| 70 | +Your `config.toml` should look like this: |
| 71 | + |
| 72 | +```toml |
| 73 | +LLM_MODEL="gpt-4-turbo" |
| 74 | +LLM_API_KEY="<your-api-key>" |
| 75 | +LLM_EMBEDDING_MODEL="openai" |
| 76 | +WORKSPACE_MOUNT_PATH="<absolute-path-of-your-workspace>" |
| 77 | +``` |
| 78 | + |
| 79 | +You can choose any model you'd like to generate the mock responses. |
| 80 | +You can even handcraft mock responses, especially when you would like to test the behaviour of agent for corner cases. If you use a very weak model (e.g. 8B params), chance is most agents won't be able to finish the task. |
| 81 | + |
| 82 | +```bash |
| 83 | +# Remove logs iff you are okay to lose logs. This helps us locate the prompts and responses quickly, but is NOT a must. |
| 84 | +rm -rf logs |
| 85 | +# Clear the workspace, otherwise OpenDevin might not be able to reproduce your prompts in CI environment. Feel free to change the workspace name and path. Be sure to set `WORKSPACE_MOUNT_PATH` to the same absolute path. |
| 86 | +rm -rf workspace |
| 87 | +mkdir workspace |
| 88 | +# Depending on the complexity of the task you want to test, you can change the number of iterations limit. Change agent accordingly. If you are adding a new test, try generating mock responses for every agent. |
| 89 | +poetry run python ./opendevin/main.py -i 10 -t "Write a shell script 'hello.sh' that prints 'hello'." -c "MonologueAgent" -d "./workspace" |
| 90 | +``` |
| 91 | + |
| 92 | +After running the above commands, you should be able to locate the real prompts |
| 93 | +and responses logged. The log folder follows `logs/llm/%y-%m-%d_%H-%M.log` format. |
| 94 | + |
| 95 | +Now, move all files under that folder to `tests/integration/mock/<AgentName>/<TestName>` folder. For example, moving all files from `logs/llm/24-04-23_21-55/` folder to |
| 96 | +`tests/integration/mock/MonologueAgent/test_write_simple_script` folder. |
| 97 | + |
| 98 | +That's it, you are good to go! When you launch an integration test, mock |
| 99 | +responses are loaded and used to replace a real LLM, so that we get |
| 100 | +deterministic and consistent behavior, and most importantly, without spending real |
| 101 | +money. |
0 commit comments