Skip to content

Commit e7b5ddf

Browse files
authored
Add integration test framework with mock llm (OpenHands#1301)
* Add integration test framework with mock llm * Fix MonologueAgent and PlannerAgent tests * Remove adhoc logging * Use existing logs * Fix SWEAgent and PlannerAgent * Check-in test log files * conftest: look up under test name folder only * Add docstring to conftest * Finish dev doc * Avoid non-determinism * Remove dependency on llm embedding model * Init embedding model only for MonologueAgent * Add adhoc fix for sandbox discrepancy * Test ssh and exec sandboxes * CI: fix missing sandbox type * conftest: Remove hack * Reword comment for TODO
1 parent bf5a2af commit e7b5ddf

48 files changed

Lines changed: 4053 additions & 30 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
name: Run Integration Tests
2+
3+
on: [push, pull_request]
4+
5+
jobs:
6+
on-linux:
7+
runs-on: ubuntu-latest
8+
strategy:
9+
matrix:
10+
include:
11+
- name: SWEAgent-py311-ssh
12+
python-version: "3.11"
13+
agent: "SWEAgent"
14+
embedding-model: "none"
15+
sandbox: "ssh"
16+
- name: PlannerAgent-py311-ssh
17+
python-version: "3.11"
18+
agent: "PlannerAgent"
19+
embedding-model: "none"
20+
sandbox: "ssh"
21+
- name: MonologueAgent-py311-ssh
22+
python-version: "3.11"
23+
agent: "MonologueAgent"
24+
embedding-model: "local"
25+
sandbox: "ssh"
26+
- name: CodeActAgent-py311-ssh
27+
python-version: "3.11"
28+
agent: "CodeActAgent"
29+
embedding-model: "none"
30+
sandbox: "ssh"
31+
- name: SWEAgent-py311-exec
32+
python-version: "3.11"
33+
agent: "SWEAgent"
34+
embedding-model: "none"
35+
sandbox: "exec"
36+
- name: PlannerAgent-py311-exec
37+
python-version: "3.11"
38+
agent: "PlannerAgent"
39+
embedding-model: "none"
40+
sandbox: "exec"
41+
- name: MonologueAgent-py311-exec
42+
python-version: "3.11"
43+
agent: "MonologueAgent"
44+
embedding-model: "local"
45+
sandbox: "exec"
46+
- name: CodeActAgent-py311-exec
47+
python-version: "3.11"
48+
agent: "CodeActAgent"
49+
embedding-model: "none"
50+
sandbox: "exec"
51+
steps:
52+
- uses: actions/checkout@v4
53+
- name: Set up Python ${{ matrix.python-version }}
54+
uses: actions/setup-python@v2
55+
with:
56+
python-version: ${{ matrix.python-version }}
57+
- name: Install Poetry
58+
run: curl -sSL https://install.python-poetry.org | python3 -
59+
- name: Build Environment
60+
run: make build
61+
- name: Run Integration Tests
62+
env:
63+
SANDBOX_TYPE: ${{ matrix.sandbox }}
64+
AGENT: ${{ matrix.agent }}
65+
MAX_ITERATIONS: 10
66+
LLM_EMBEDDING_MODEL: ${{ matrix.embedding-model }}
67+
run: |
68+
rm -rf workspace
69+
mkdir workspace
70+
WORKSPACE_BASE="$GITHUB_WORKSPACE/workspace" WORKSPACE_MOUNT_PATH="$GITHUB_WORKSPACE/workspace" poetry run pytest -s ./tests/integration

.github/workflows/run-tests.yml

Lines changed: 0 additions & 20 deletions
This file was deleted.
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: Build & Run Tests
1+
name: Run Unit Tests
22

33
on: [push, pull_request]
44

@@ -26,7 +26,7 @@ jobs:
2626
- name: Build Environment
2727
run: make build
2828
- name: Run Tests
29-
run: poetry run pytest ./tests
29+
run: poetry run pytest ./tests/unit
3030
on-linux:
3131
runs-on: ubuntu-latest
3232
strategy:
@@ -44,4 +44,4 @@ jobs:
4444
- name: Build Environment
4545
run: make build
4646
- name: Run Tests
47-
run: poetry run pytest ./tests
47+
run: poetry run pytest ./tests/unit

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,6 @@ cover/
5757
*.pot
5858

5959
# Django stuff:
60-
*.log
6160
local_settings.py
6261
db.sqlite3
6362
db.sqlite3-journal

CONTRIBUTING.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,5 +85,6 @@ Please refer to the README in each module:
8585
- [mock server](./opendevin/mock/README.md)
8686

8787
## Tests
88-
TODO: make sure code pass the test before submit.
89-
88+
Please navigate to `tests` folder to see existing test suites.
89+
At the moment, we have two kinds of tests: `unit` and `integration`. Please refer to the README for each test suite. These tests also run on CI to ensure quality of
90+
the project.

agenthub/monologue_agent/utils/memory.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,11 @@ def wrapper_get_embeddings(*args, **kwargs):
7373
azure_endpoint=config.get('LLM_BASE_URL', required=True),
7474
api_version=config.get('LLM_API_VERSION', required=True),
7575
)
76+
elif (embedding_strategy is not None) and (embedding_strategy.lower() == 'none'):
77+
# TODO: this works but is not elegant enough. The incentive is when
78+
# monologue agent is not used, there is no reason we need to initialize an
79+
# embedding model
80+
embed_model = None
7681
else:
7782
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
7883
embed_model = HuggingFaceEmbedding(

opendevin/config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ def get_parser():
9595
parser.add_argument(
9696
'-c',
9797
'--agent-cls',
98-
default='MonologueAgent',
98+
default=config.get(ConfigType.AGENT),
9999
type=str,
100100
help='The agent class to use',
101101
)

opendevin/llm/llm.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
21
from litellm import completion as litellm_completion
32
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_random_exponential
43
from litellm.exceptions import APIConnectionError, RateLimitError, ServiceUnavailableError

opendevin/main.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,13 @@ def read_task_from_stdin() -> str:
2020
return sys.stdin.read()
2121

2222

23-
async def main():
23+
async def main(task_str: str = ''):
2424
"""Main coroutine to run the agent controller with task input flexibility."""
2525

2626
# Determine the task source
27-
if args.file:
27+
if task_str:
28+
task = task_str
29+
elif args.file:
2830
task = read_task_from_file(args.file)
2931
elif args.task:
3032
task = args.task

tests/integration/README.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
## Introduction
2+
3+
This folder contains backend integration tests that rely on a mock LLM. It serves
4+
two purposes:
5+
1. Ensure the quality of development, including OpenDevin framework and agents.
6+
2. Help contributors learn the workflow of OpenDevin, and examples of real interactions
7+
with (powerful) LLM, without spending real money.
8+
9+
Why don't we launch an open-source model, e.g. LLAMA3? There are two reasons:
10+
1. LLMs cannot guarantee determinism, meaning the test behavior might change.
11+
2. CI machines are not powerful enough to run any LLM that is sophisticated enough
12+
to finish the tasks defined in tests.
13+
14+
Note: integration tests are orthogonal to evaluations/benchmarks
15+
as they serve different purposes. Although benchmarks could also
16+
capture bugs, some of which may not be caught by tests, benchmarks
17+
require real LLMs which are non-deterministic and costly.
18+
We run integration test suite for every single commit, which is
19+
not possible with benchmarks.
20+
21+
Known limitations:
22+
1. To avoid the potential impact of non-determinism, we remove all special
23+
characters and numbers (often used as PIDs) when doing the comparison. If two
24+
prompts for the same task only differ in non-alpha characters, a wrong mock
25+
response might be picked up.
26+
2. It is required that the agent itself doesn't do anything non-deterministic,
27+
including but not limited to using randomly generated numbers.
28+
29+
The folder is organised as follows:
30+
31+
```
32+
├── README.md
33+
├── conftest.py
34+
├── mock
35+
│   ├── [AgentName]
36+
│   │   └── [TestName]
37+
│   │   ├── prompt_*.log
38+
│   │   ├── response_*.log
39+
└── [TestFiles].py
40+
```
41+
42+
where `conftest.py` defines the infrastructure needed to load real-world LLM prompts
43+
and responses for mocking purpose. Prompts and responses generated during real runs
44+
of agents with real LLMs are stored under `mock/AgentName/TestName` folders.
45+
46+
## Run Integration Tests
47+
48+
Take a look at `run-integration-tests.yml` to learn how integration tests are
49+
launched in CI environment. Assuming you want to use `workspace` for testing, an
50+
example is as follows:
51+
52+
```bash
53+
rm -rf workspace; AGENT=PlannerAgent \
54+
WORKSPACE_BASE="/Users/admin/OpenDevin/workspace" WORKSPACE_MOUNT_PATH="/Users/admin/OpenDevin/workspace" MAX_ITERATIONS=10 \
55+
poetry run pytest -s ./tests/integration
56+
```
57+
58+
Note: in order to run integration tests correctly, please ensure your workspace is empty.
59+
60+
61+
## Write Integration Tests
62+
63+
To write an integraion test, there are essentially two steps:
64+
65+
1. Decide your task prompt, and the result you want to verify.
66+
2. Either construct LLM responses by yourself, or run OpenDevin with a real LLM. The system prompts and
67+
LLM responses are recorded as logs, which you could then copy to test folder.
68+
The following paragraphs describe how to do it.
69+
70+
Your `config.toml` should look like this:
71+
72+
```toml
73+
LLM_MODEL="gpt-4-turbo"
74+
LLM_API_KEY="<your-api-key>"
75+
LLM_EMBEDDING_MODEL="openai"
76+
WORKSPACE_MOUNT_PATH="<absolute-path-of-your-workspace>"
77+
```
78+
79+
You can choose any model you'd like to generate the mock responses.
80+
You can even handcraft mock responses, especially when you would like to test the behaviour of agent for corner cases. If you use a very weak model (e.g. 8B params), chance is most agents won't be able to finish the task.
81+
82+
```bash
83+
# Remove logs iff you are okay to lose logs. This helps us locate the prompts and responses quickly, but is NOT a must.
84+
rm -rf logs
85+
# Clear the workspace, otherwise OpenDevin might not be able to reproduce your prompts in CI environment. Feel free to change the workspace name and path. Be sure to set `WORKSPACE_MOUNT_PATH` to the same absolute path.
86+
rm -rf workspace
87+
mkdir workspace
88+
# Depending on the complexity of the task you want to test, you can change the number of iterations limit. Change agent accordingly. If you are adding a new test, try generating mock responses for every agent.
89+
poetry run python ./opendevin/main.py -i 10 -t "Write a shell script 'hello.sh' that prints 'hello'." -c "MonologueAgent" -d "./workspace"
90+
```
91+
92+
After running the above commands, you should be able to locate the real prompts
93+
and responses logged. The log folder follows `logs/llm/%y-%m-%d_%H-%M.log` format.
94+
95+
Now, move all files under that folder to `tests/integration/mock/<AgentName>/<TestName>` folder. For example, moving all files from `logs/llm/24-04-23_21-55/` folder to
96+
`tests/integration/mock/MonologueAgent/test_write_simple_script` folder.
97+
98+
That's it, you are good to go! When you launch an integration test, mock
99+
responses are loaded and used to replace a real LLM, so that we get
100+
deterministic and consistent behavior, and most importantly, without spending real
101+
money.

0 commit comments

Comments
 (0)