AweAgent provides two core capabilities:
- Pluggable Agent Scaffolds β Modular agent loop with extensible tools (bash, editor, search, think), pluggable LLM backends (OpenAI, Azure, Ark, SGLang), and configurable context management.
- Reproducible Evaluation β Docker-isolated execution, built-in evaluators, batch runner with concurrent execution, and structured result / trajectory output.
AweAgent currently ships with ScaleSWE (training data), BeyondSWE, and Terminal Bench 2.0 benchmarks support.
[2026-03-15]π Terminus-2 agent scaffold with Terminal Bench 2.0 benchmark support[2026-03-11]π Sync codebase with internal version for BeyondSWE + SearchSWE[2026-03-01]π Initial release β SearchSWE agent scaffold with BeyondSWE & ScaleSWE support
awe_agent/
core/ # Framework internals
agent/ # Agent loop, context, trajectory, protocol
condenser/ # Context window management
config/ # YAML config loading & schema
eval/ # Evaluation (PatchTestEvaluator, isolation)
llm/ # LLM backends + tool-call formatting
format/ # Three modes: openai_function, codeact_xml, terminus_json
runtime/ # Container runtimes (Docker)
task/ # Task protocol, TaskRunner (unified batch engine)
tool/ # Tool registry (bash, editor, search, think, finish)
scaffold/ # Agent implementations
search_swe/ # SearchSWE agent with optional web search
terminus_2/ # Terminus-2 agent (tmux + JSON keystrokes)
tasks/ # Benchmark-specific task & evaluator
beyond_swe/ # BeyondSWE
scale_swe/ # ScaleSWE
terminal_bench_v2/ # Terminal Bench 2.0
configs/ # YAML configurations (LLM, task, runtime)
recipes/ # Reproducible entry points
beyond_swe/ # BeyondSWE runner
scale_swe/ # ScaleSWE runner
terminal_bench_v2/ # Terminal Bench 2.0 runner
The built-in SearchSWE agent scaffold (awe_agent/scaffold/search_swe/) is a modular agent loop that can operate in two modes β switch between them with a single config flag:
| Mode | enable_search |
tool_call_format |
Description |
|---|---|---|---|
| SearchSWE | true |
openai_function |
Full tool set including web search & link summary |
| OpenHands-style | false |
codeact_xml |
CodeAct XML format, compatible with OpenHands agent behavior |
Tool Blocks. The agent composes its tool set from independent, self-contained tool blocks. Each block implements a unified Tool protocol (name, JSON Schema parameters, async execute) and is registered via a plugin registry with entry-point discovery:
| Tool | Name | Description |
|---|---|---|
| Bash | execute_bash |
Persistent shell session inside Docker with output truncation, timeout control, and regex-based command blocklist |
| Editor | str_replace_editor |
File viewer/editor with view, create, str_replace, and insert sub-commands |
| Search | search |
Web search with anti-leak filtering (auto-blocks target repo URLs). Only active when enable_search: true |
| Link Summary | link_summary |
Fetch a URL and summarize content via a dedicated LLM. Only active when enable_search: true |
| Think | think |
Reasoning scratchpad β no environment side-effects, helps the agent plan before acting |
| Finish | finish |
Signals task completion and triggers evaluation |
Adding a custom tool is as simple as implementing the Tool protocol and registering it via a Python entry-point β no changes to the agent loop required.
The Terminus-2 agent scaffold (awe_agent/scaffold/terminus_2/) is a terminal agent that uses tmux for persistent shell sessions. The LLM outputs raw JSON with keystrokes; the framework translates this into actions via TerminusJSONFormat (the third ToolCallFormat alongside openai_function and codeact_xml). This allows Terminus-2 to run inside the standard AgentLoop, inheriting RL training, context condensing, stats tracking, and step callbacks β without changing the LLM-facing prompt.
Designed for Terminal Bench 2.0 evaluation, aligned with the official Harbor framework. Evaluation runs in the same container as the agent (no patch β the agent modifies container state directly).
SWE-bench style (BeyondSWE / ScaleSWE). After the agent finishes, evaluation runs in a separate Docker container to ensure a clean, tamper-proof environment:
- Check out the base commit in a fresh container
- Apply the agent-generated patch (6 auto-fallback strategies)
- Restore original test files (prevents the agent from gaming tests)
- Run fail-to-pass & pass-to-pass test suites via an injected pytest runner
- Report structured results (score, pass/fail details, trajectory)
Terminal Bench 2.0. Evaluation runs in the same container β the evaluator uploads test scripts, executes bash /tests/test.sh, and reads the reward file. This is controlled by the Evaluator.requires_same_session protocol property.
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/AweAI-Team/AweAgent.git && cd AweAgent
uv venv --python 3.11
uv pip install -e ".[dev]"git clone https://github.com/AweAI-Team/AweAgent.git && cd AweAgent
pip install -e ".[dev]"Why editable install? AweAgent uses entry-points for plugin discovery. Without
-e, the plugin registry cannot find LLM backends, agents, or tools.
| Benchmark | Description | Agent | Dataset | Guide |
|---|---|---|---|---|
| BeyondSWE | Doc2Repo, CrossRepo, DepMigrate, DomainFix | SearchSWE (with web search) | Hugging Face | README |
| ScaleSWE | Large-scale SWE-bench style training datasets (20k instances) | SearchSWE (CodeAct XML) | Hugging Face | README |
| Terminal Bench 2.0 | Terminal tasks in containerized environments | Terminus-2 | GitHub (commit 69671fbaac6d67a7ef0dfec016cc38a64ef7a77c) |
README |
# BeyondSWE
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="AweAI-Team/BeyondSWE",
repo_type="dataset",
local_dir="<your_path>/BeyondSWE",
)
# ScaleSWE β see the Hugging Face collection for available splits
# https://huggingface.co/datasets/AweAI-Team/Scale-SWE
# Terminal Bench 2.0 β clone and checkout commit for reproducibility
git clone https://github.com/laude-institute/terminal-bench-2.git
cd terminal-bench-2
git checkout 69671fbaac6d67a7ef0dfec016cc38a64ef7a77c
# Each task folder contains instruction.md, task.toml, environment/, tests/# Configure LLM
export OPENAI_API_KEY="sk-..."
# List instances (no Docker needed)
python recipes/beyond_swe/run.py \
--data-file /path/to/beyondswe.jsonl --mode dry-run
# Batch run
python recipes/beyond_swe/run.py \
--data-file /path/to/beyondswe.jsonl --mode batchSee each benchmark's guide above for full setup, CLI arguments, and output format.
Configs are YAML files with environment variable substitution (${VAR}, ${VAR:-default}) and !include support.
| Backend | Config File | Required Env Vars |
|---|---|---|
| OpenAI | configs/llm/openai.yaml |
OPENAI_API_KEY |
| Azure OpenAI | configs/llm/azure.yaml |
AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT |
| Volcengine Ark | configs/llm/ark.yaml |
ARK_API_KEY, ARK_MODEL_ID |
| SGLang | configs/llm/sglang.yaml |
(self-hosted endpoint) |
Copy .env.example to .env and fill in your values:
cp .env.example .envThe .env.example is organized into three sections:
-
LLM Backend (pick one) β set the API key and endpoint for your chosen backend:
# OpenAI OPENAI_API_KEY=sk-... # Or Azure OpenAI AZURE_OPENAI_API_KEY=your-key AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
-
Task Data β path to your benchmark JSONL file:
DATA_FILE=/path/to/data.jsonl
-
Search Tools (optional, BeyondSWE search mode only) β required when running with
enable_search: true:SERPAPI_API_KEY=your-serpapi-key JINA_API_KEY=your-jina-key # optional, 20 RPM free without key
See each benchmark guide for the full list of variables.
Our long-term goal is to build practical, general-purpose agents and optimize them with reinforcement learning.
Agent Scaffolds & Capabilities
- SearchSWE β coding agent with optional web search augmentation
- Terminus-2 β terminal agent for Terminal Bench 2.0
- deep research Agent
Evaluation & Optimization
- BeyondSWE & ScaleSWE benchmark support
- Terminal Bench 2.0 benchmark support
- More benchmarks β wider task coverage and domain diversity
- Agentic RL β scalable reinforcement learning infrastructure for agent optimization
This project is released under the Apache-2.0 License.
For any questions or feedback, please reach out to us at [email protected].