Skip to content

AweAI-Team/AweAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

35 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AweAgent

A general-purpose agent framework with pluggable scaffolds and reproducible evaluation.

Python 3.11+ License: Apache-2.0

AweAgent provides two core capabilities:

  • Pluggable Agent Scaffolds β€” Modular agent loop with extensible tools (bash, editor, search, think), pluggable LLM backends (OpenAI, Azure, Ark, SGLang), and configurable context management.
  • Reproducible Evaluation β€” Docker-isolated execution, built-in evaluators, batch runner with concurrent execution, and structured result / trajectory output.

AweAgent currently ships with ScaleSWE (training data), BeyondSWE, and Terminal Bench 2.0 benchmarks support.

πŸ“° News

  • [2026-03-15] πŸŽ‰ Terminus-2 agent scaffold with Terminal Bench 2.0 benchmark support
  • [2026-03-11] πŸŽ‰ Sync codebase with internal version for BeyondSWE + SearchSWE
  • [2026-03-01] πŸŽ‰ Initial release β€” SearchSWE agent scaffold with BeyondSWE & ScaleSWE support

πŸ—οΈ Architecture

awe_agent/
  core/              # Framework internals
    agent/           #   Agent loop, context, trajectory, protocol
    condenser/       #   Context window management
    config/          #   YAML config loading & schema
    eval/            #   Evaluation (PatchTestEvaluator, isolation)
    llm/             #   LLM backends + tool-call formatting
      format/        #     Three modes: openai_function, codeact_xml, terminus_json
    runtime/         #   Container runtimes (Docker)
    task/            #   Task protocol, TaskRunner (unified batch engine)
    tool/            #   Tool registry (bash, editor, search, think, finish)
  scaffold/          # Agent implementations
    search_swe/      #   SearchSWE agent with optional web search
    terminus_2/      #   Terminus-2 agent (tmux + JSON keystrokes)
  tasks/             # Benchmark-specific task & evaluator
    beyond_swe/      #   BeyondSWE
    scale_swe/       #   ScaleSWE
    terminal_bench_v2/  #   Terminal Bench 2.0

configs/             # YAML configurations (LLM, task, runtime)
recipes/             # Reproducible entry points
  beyond_swe/        #   BeyondSWE runner
  scale_swe/         #   ScaleSWE runner
  terminal_bench_v2/ #   Terminal Bench 2.0 runner

🧩 Supported Scaffolds

SearchSWE

The built-in SearchSWE agent scaffold (awe_agent/scaffold/search_swe/) is a modular agent loop that can operate in two modes β€” switch between them with a single config flag:

Mode enable_search tool_call_format Description
SearchSWE true openai_function Full tool set including web search & link summary
OpenHands-style false codeact_xml CodeAct XML format, compatible with OpenHands agent behavior

Tool Blocks. The agent composes its tool set from independent, self-contained tool blocks. Each block implements a unified Tool protocol (name, JSON Schema parameters, async execute) and is registered via a plugin registry with entry-point discovery:

Tool Name Description
Bash execute_bash Persistent shell session inside Docker with output truncation, timeout control, and regex-based command blocklist
Editor str_replace_editor File viewer/editor with view, create, str_replace, and insert sub-commands
Search search Web search with anti-leak filtering (auto-blocks target repo URLs). Only active when enable_search: true
Link Summary link_summary Fetch a URL and summarize content via a dedicated LLM. Only active when enable_search: true
Think think Reasoning scratchpad β€” no environment side-effects, helps the agent plan before acting
Finish finish Signals task completion and triggers evaluation

Adding a custom tool is as simple as implementing the Tool protocol and registering it via a Python entry-point β€” no changes to the agent loop required.

Terminus-2

The Terminus-2 agent scaffold (awe_agent/scaffold/terminus_2/) is a terminal agent that uses tmux for persistent shell sessions. The LLM outputs raw JSON with keystrokes; the framework translates this into actions via TerminusJSONFormat (the third ToolCallFormat alongside openai_function and codeact_xml). This allows Terminus-2 to run inside the standard AgentLoop, inheriting RL training, context condensing, stats tracking, and step callbacks β€” without changing the LLM-facing prompt.

Designed for Terminal Bench 2.0 evaluation, aligned with the official Harbor framework. Evaluation runs in the same container as the agent (no patch β€” the agent modifies container state directly).

Evaluation

SWE-bench style (BeyondSWE / ScaleSWE). After the agent finishes, evaluation runs in a separate Docker container to ensure a clean, tamper-proof environment:

  1. Check out the base commit in a fresh container
  2. Apply the agent-generated patch (6 auto-fallback strategies)
  3. Restore original test files (prevents the agent from gaming tests)
  4. Run fail-to-pass & pass-to-pass test suites via an injected pytest runner
  5. Report structured results (score, pass/fail details, trajectory)

Terminal Bench 2.0. Evaluation runs in the same container β€” the evaluator uploads test scripts, executes bash /tests/test.sh, and reads the reward file. This is controlled by the Evaluator.requires_same_session protocol property.

πŸš€ Installation

uv (Recommended)

curl -LsSf https://astral.sh/uv/install.sh | sh

git clone https://github.com/AweAI-Team/AweAgent.git && cd AweAgent
uv venv --python 3.11
uv pip install -e ".[dev]"

pip

git clone https://github.com/AweAI-Team/AweAgent.git && cd AweAgent
pip install -e ".[dev]"

Why editable install? AweAgent uses entry-points for plugin discovery. Without -e, the plugin registry cannot find LLM backends, agents, or tools.

πŸ“‹ Supported Benchmarks

Benchmark Description Agent Dataset Guide
BeyondSWE Doc2Repo, CrossRepo, DepMigrate, DomainFix SearchSWE (with web search) Hugging Face README
ScaleSWE Large-scale SWE-bench style training datasets (20k instances) SearchSWE (CodeAct XML) Hugging Face README
Terminal Bench 2.0 Terminal tasks in containerized environments Terminus-2 GitHub (commit 69671fbaac6d67a7ef0dfec016cc38a64ef7a77c) README

Download Data

# BeyondSWE
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="AweAI-Team/BeyondSWE",
    repo_type="dataset",
    local_dir="<your_path>/BeyondSWE",
)

# ScaleSWE β€” see the Hugging Face collection for available splits
# https://huggingface.co/datasets/AweAI-Team/Scale-SWE

# Terminal Bench 2.0 β€” clone and checkout commit for reproducibility
git clone https://github.com/laude-institute/terminal-bench-2.git
cd terminal-bench-2
git checkout 69671fbaac6d67a7ef0dfec016cc38a64ef7a77c
# Each task folder contains instruction.md, task.toml, environment/, tests/

Quick Example

# Configure LLM
export OPENAI_API_KEY="sk-..."

# List instances (no Docker needed)
python recipes/beyond_swe/run.py \
    --data-file /path/to/beyondswe.jsonl --mode dry-run

# Batch run
python recipes/beyond_swe/run.py \
    --data-file /path/to/beyondswe.jsonl --mode batch

See each benchmark's guide above for full setup, CLI arguments, and output format.

βš™οΈ Configuration

Configs are YAML files with environment variable substitution (${VAR}, ${VAR:-default}) and !include support.

LLM Backends

Backend Config File Required Env Vars
OpenAI configs/llm/openai.yaml OPENAI_API_KEY
Azure OpenAI configs/llm/azure.yaml AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT
Volcengine Ark configs/llm/ark.yaml ARK_API_KEY, ARK_MODEL_ID
SGLang configs/llm/sglang.yaml (self-hosted endpoint)

Environment Variables

Copy .env.example to .env and fill in your values:

cp .env.example .env

The .env.example is organized into three sections:

  1. LLM Backend (pick one) β€” set the API key and endpoint for your chosen backend:

    # OpenAI
    OPENAI_API_KEY=sk-...
    
    # Or Azure OpenAI
    AZURE_OPENAI_API_KEY=your-key
    AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
  2. Task Data β€” path to your benchmark JSONL file:

    DATA_FILE=/path/to/data.jsonl
  3. Search Tools (optional, BeyondSWE search mode only) β€” required when running with enable_search: true:

    SERPAPI_API_KEY=your-serpapi-key
    JINA_API_KEY=your-jina-key          # optional, 20 RPM free without key

See each benchmark guide for the full list of variables.

πŸ—ΊοΈ Roadmap

Our long-term goal is to build practical, general-purpose agents and optimize them with reinforcement learning.

Agent Scaffolds & Capabilities

  • SearchSWE β€” coding agent with optional web search augmentation
  • Terminus-2 β€” terminal agent for Terminal Bench 2.0
  • deep research Agent

Evaluation & Optimization

  • BeyondSWE & ScaleSWE benchmark support
  • Terminal Bench 2.0 benchmark support
  • More benchmarks β€” wider task coverage and domain diversity
  • Agentic RL β€” scalable reinforcement learning infrastructure for agent optimization

πŸ“„ License

This project is released under the Apache-2.0 License.

πŸ“¨ Contact

For any questions or feedback, please reach out to us at [email protected].

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors