Skip to content

sainarne15/Autonomy

Repository files navigation

Autonomy — Self-Improving Multi-Agent Software Development System

Version 1.0

A production-grade AI system where 9 specialized agents collaborate to turn a single English sentence into a fully built, tested, reviewed, and deployed full-stack application — and get better at it with every task.


The Idea

Most AI coding tools are single-shot: you prompt, you get code, you manually fix it. Autonomy asks a different question: what if the entire software development lifecycle — design, build, test, review, deploy — was an autonomous pipeline of specialized AI agents that learn from their own mistakes?

The core insight is that software development is not one skill — it's many. A good API designer thinks differently from a good test writer. A security reviewer catches things a code generator misses. By splitting the work across 9 agents, each with deep expertise in one domain, the system produces better results than any single monolithic prompt could.

But specialization alone isn't enough. The agents need to coordinate (the frontend must match the backend's API contract), self-correct (fix their own syntax errors before passing work downstream), and learn (remember that "always use async endpoints" after making the mistake once). Autonomy builds all three into the architecture.

Design Principles

  1. Contract-first development — The Planner writes an OpenAPI spec before any code. Every downstream agent reads this spec as the single source of truth. No agent guesses what another built.

  2. Identity ≠ Model — Each agent has a persistent identity (expertise, rules, past mistakes) stored in Firestore that survives model swaps. You can switch the backend agent from Claude to Llama and it keeps all its learned rules. The model is just an interchangeable inference engine.

  3. Self-correction before escalation — Every build agent checks its own output (syntax verification, self-review) before returning. Problems caught early cost nothing; problems caught late require a full retry loop.

  4. Human-in-the-loop, not human-out-of-the-loop — The system pauses and asks for approval before deploying. You see test results, review scores, a live demo URL, and total cost before deciding.

  5. Zero-cloud dev mode — The full pipeline works with just ANTHROPIC_API_KEY. No GCP, no E2B, no GitHub required. Mock storage fills in automatically.


What It Does

You describe an app in one sentence. Autonomy runs a pipeline of 9 AI agents that:

  1. Designs the full API contract (OpenAPI 3.1 spec)
  2. Writes the backend (FastAPI + SQLAlchemy)
  3. Writes the database layer (models + Alembic migrations)
  4. Writes the frontend (React + TypeScript)
  5. Tests everything (pytest + Jest in a sandboxed cloud VM)
  6. Reviews code for security and quality (OWASP Top 10)
  7. Pauses and asks you for approval (via Telegram or dashboard)
  8. Commits the code and opens a GitHub PR
  9. Generates Docker + CI/CD + Cloud Run deployment config

If tests fail or the reviewer finds issues, the orchestrator routes back to the responsible agent with structured feedback (exact file, line, error message) and retries — up to MAX_ITERATIONS times before escalating to you.


The 9-Agent Pipeline

START
  |
  v
+-----------------+
|  Orchestrator   |  Plans the task, sets quality criteria, drives the retry loop
+--------+--------+
         |
         v
+-----------------+
|    Planner      |  Generates OpenAPI spec -- the shared contract for all agents
+--------+--------+
         |
         v  (dynamic -- agents included based on task type)
+--------------------------------------------+
|  Backend Agent  | Database Agent | Frontend |
|  FastAPI routes | SQLAlchemy ORM | React/TS |
+--------------------------------------------+
         |
         v
+-----------------+
|    QA Agent     |  Writes pytest suite, runs in E2B sandbox, self-corrects
+--------+--------+
         |
         v
+-----------------+
|  Code Reviewer  |  OWASP + performance scoring, routes back on failure
+--------+--------+
         |         ^ retry loop (up to max_iterations)
         v
+-----------------+
|  Human Review   |  Pauses -> sends Telegram message with summary + live demo URL
+--------+--------+
         |  (approve / reject with feedback)
         v
+-----------------+     +-----------------+
|   Git Agent     |---->|  DevOps Agent   |
|  Commit + PR    |     |  Docker + CI/CD  |
+-----------------+     +-----------------+
         |
        END

Why These Technologies

Every library in the stack was chosen for a specific reason. Here's the decision log.

LangGraph — Stateful Workflow Orchestration

Why not just chain LLM calls in a loop? Because the pipeline has:

  • Conditional branching — the orchestrator routes failures to whichever agent owns the broken file
  • Human-in-the-loop interrupts — the workflow pauses mid-execution, waits for Telegram/dashboard input, then resumes exactly where it left off
  • Persistent checkpointing — if the server crashes during a retry loop, it picks up from the last checkpoint on restart
  • Dynamic pipelines — the planner decides at runtime which agents are needed (a backend-only task skips frontend and database agents)

LangGraph gives us all of this as a first-class state machine with typed state (AgentState TypedDict), interrupt nodes, and pluggable checkpointers. The alternative was hand-rolling state persistence, interrupt/resume logic, and routing — which is exactly the kind of infrastructure that breaks in edge cases.

LiteLLM — Provider-Agnostic LLM Routing

Every agent calls litellm.acompletion() — one function, same signature, any provider. This means:

  • Switch any agent to a different model by changing one env var (MODEL_BACKEND=ollama/qwen2.5-coder)
  • Run the entire pipeline locally with Ollama (zero cost, no internet)
  • Use free-tier providers (OpenRouter :free, Groq) for development
  • Automatic retry with exponential backoff on rate limits (429) and transient errors (503)
  • Fallback to a backup model if the primary exhausts retries

LiteLLM translates between Anthropic, OpenAI, Ollama, Groq, DeepSeek, Gemini, and OpenRouter APIs — no provider-specific SDK code anywhere in agent logic.

Google Firestore — Agent Identity Persistence

Each agent's identity (expertise, learned rules, past mistakes, correction rules, code style, beliefs) is stored in Firestore. Why Firestore specifically?

  • Document model — identity is a nested JSON document (lists of rules, dicts of beliefs), not relational rows
  • Real-time updates — agents read their identity at the start of every task and write back after reflection
  • Version snapshots — every identity save creates a versioned subcollection entry; you can roll back any agent to any previous version
  • Serverless — no database to manage, automatic scaling, free tier covers development

The identity is completely model-independent. Swapping MODEL_BACKEND from Claude to DeepSeek changes the inference engine but preserves every learned rule.

Google BigQuery — Interaction Logging & Pattern Mining

Every LLM call is logged to BigQuery with full context: system prompt, user prompt, model response, token counts, cost, success/failure. Three tables:

Table Purpose
agent_interactions Full input/output per agent call — the raw training data for future fine-tuning
agent_metrics Performance metrics over time — success rates, token efficiency, cost trends
memory_diffs What changed in each agent's identity after reflection — the audit trail for self-improvement

Why BigQuery? Because the reflection system needs to query "show me this agent's last N interactions where success=false" — that's a SQL query, not a document lookup. BigQuery handles analytical queries over growing datasets without index management. The free tier (10 GB storage, 1 TB queries/month) covers even heavy development.

SQLite — LangGraph Checkpoint Persistence

LangGraph checkpoints (the full pipeline state at each node) are stored in SQLite via AsyncSqliteSaver. This means:

  • Paused workflows survive server restarts — a task waiting for human approval at 2am is still waiting when the server starts at 9am
  • The retry loop state is durable — iteration count, accumulated artifacts, test results all persist
  • Zero infrastructure — SQLite is a single file (data/checkpoints.db), no database server

Falls back to in-memory MemorySaver if langgraph-checkpoint-sqlite isn't installed (checkpoints lost on restart, but the pipeline still works).

E2B — Sandboxed Code Execution

Generated code runs in an E2B Code Interpreter cloud sandbox — an isolated VM with its own filesystem, Python environment, and network. This serves three purposes:

  1. Safety — nothing executes on your machine. A generated rm -rf / would only affect the disposable sandbox.
  2. pytest execution — the QA agent writes tests, uploads them to the sandbox along with the generated source, runs pytest, and reads back structured results (pass/fail counts, error messages, stack traces).
  3. Live demo — after tests pass, the sandbox starts the generated FastAPI app on a public HTTPS URL so you can preview it before approving.

The sandbox has a configurable lifetime (SANDBOX_TIMEOUT_SECONDS, default 1 hour) with automatic keepalive pings so it survives long-running tasks.

Anthropic Claude — Primary LLM

Claude Sonnet powers all code-generation agents (backend, frontend, database, QA, reviewer, orchestrator, planner) because it handles:

  • Long structured output — full Python files, OpenAPI specs, React components
  • Instruction following — "output only valid JSON" actually means only valid JSON
  • Extended thinking — complex planning steps use adaptive thinking for multi-step reasoning
  • Prompt caching — agent identity is injected with cache_control: ephemeral, cutting identity token cost by ~90% across repeated calls

Claude Haiku handles mechanical agents (git, devops) where output is templated and speed matters more than reasoning depth.

FastAPI — API Server

The system exposes a REST API (not just a CLI) so it can be driven by the dashboard, Telegram, or any HTTP client. FastAPI was chosen for:

  • Async-native — the pipeline is fully async; FastAPI's async endpoints avoid blocking during long LLM calls
  • Auto-generated OpenAPI docs/docs gives you interactive API docs for free
  • Pydantic validation — request/response models are type-checked at the boundary
  • SSE supportStreamingResponse streams live LLM conversation events to the dashboard

python-telegram-bot — Human-in-the-Loop Notifications

The Telegram bot is the primary human interface for production use:

  • Task started/completed/failed notifications
  • Per-agent progress summaries as they complete
  • Rich review request with test stats, cost summary, and live demo URL
  • Inline keyboard buttons (Approve / Give Feedback / Restart Demo)
  • Budget alerts when cost approaches the limit
  • Reply to resume paused workflows naturally — the orchestrator LLM interprets your message

Uses long-polling (not webhooks) so it works behind NAT/firewalls without a public URL. Supports HTTP/SOCKS5 proxy for restricted networks.

React + TypeScript + Vite — Monitoring Dashboard

A real-time dashboard at http://localhost:3000 for monitoring pipeline execution:

Tab What You See
Overview Total tasks, active agents, total cost, system uptime
Agents All 9 agent cards with identity, model, status, success rate
Tasks Task list + Kanban pipeline view with per-agent metrics
Logs Recent interactions stream + cost chart over time

The LLM Conversation panel streams every LLM call in real time via Server-Sent Events — you see exactly what each agent asks the model and what it responds, including token counts and per-call cost.

httpx — GitHub Integration

The Git Agent uses httpx (async HTTP client) to call the GitHub REST API directly — no git binary required on the host. Creates feature branches, commits artifacts, and opens pull requests programmatically.


Key Features

Agent Intelligence

  • Persistent identity per agent — each of the 9 agents stores its expertise, learned rules, past mistakes, and preferences in Firestore. Identity persists across restarts and is model-independent.
  • Conditional reflection — every N tasks (configurable, default 3), agents re-read their recent interactions from BigQuery and use an LLM to update their own rules. Duplicate rules are removed by similarity.
  • Prompt caching — agent identity is injected with cache_control: ephemeral on every Claude call, cutting identity token cost by ~90%.
  • Adaptive thinking — complex agents (orchestrator, planner, backend) use extended thinking for planning steps.

Self-Correction (within each agent)

Every build agent fixes its own mistakes before returning:

  • Backend & Database agents — syntax-check Python files in the E2B sandbox; if broken, run a multi-turn LLM correction pass.
  • Frontend agent — run a self-review pass for TypeScript issues (missing imports, wrong API paths, type mismatches); only emit fixed files if issues found.
  • QA agent — fast-path: re-run previous tests against updated source (no LLM cost). If test code has errors, fix with LLM. If source code is broken, return failure to the responsible agent.

Shared Context — How Agents Know What Others Built

All agents share a single AgentState dictionary managed by LangGraph:

  • openapi_spec — Planner writes this first; every downstream agent reads it as the integration contract
  • artifacts — accumulates all generated files ({"main.py": "...", "models.py": "..."}) with zero truncation between agents
  • agent_briefs — Planner writes per-agent task descriptions
  • feedback + review_issues — QA and Reviewer write structured issues; fix agents read them on retry

Code Execution in the Cloud (E2B Sandbox)

  • All code runs in an isolated cloud VM on E2B's infrastructure — nothing executes on your machine
  • Syntax checking — backend and database files verified before the pipeline continues
  • pytest execution — full test suite run in the sandbox with fastapi, uvicorn, httpx, pytest-asyncio
  • Frontend testing — Jest/Vitest tests for React components, run in the same sandbox
  • Live demo — after QA passes, the generated FastAPI app is started in the sandbox and a public HTTPS URL is returned for you to preview

Human-in-the-Loop

Two interrupt points in the pipeline:

  1. Pre-deploy review — after code review passes, workflow pauses and sends a rich Telegram message with per-agent build summary, test pass/fail stats, total cost, and live demo URL
  2. Failure escalation — if max_iterations is exhausted, workflow pauses and asks how to proceed

The orchestrator uses LLM-based intent parsing for human input — no keyword matching. Natural language like "looks great, ship it!" or "the search endpoint is broken, returns too many results" is correctly interpreted and routed.

Budget Control

  • MAX_TASK_BUDGET_USD — hard per-task cost ceiling (default $0.50)
  • When the limit is hit, workflow pauses and asks via Telegram whether to approve more budget
  • Reply naturally: "give it another dollar", "set budget to $2", or "abort"
  • The orchestrator LLM parses amounts from natural language (including "a dollar fifty", "50 cents", etc.)

Task Persistence

  • All task state persisted to data/tasks.json — survives server restart
  • LangGraph checkpoints stored in data/checkpoints.db (SQLite) — paused workflows can be resumed after restart
  • Output artifacts optionally written to disk at {output_directory}/{task_id}/ with preserved subdirectory structure

Live LLM Conversation Monitoring

The dashboard Tasks > LLM Conversation panel streams every LLM call in real time via Server-Sent Events:

  • Prompt bubble — the exact message sent to the model (expandable)
  • Response bubble — the model's reply, token count, and per-call cost
  • Agent badge — color-coded by which agent made the call
  • "Thinking..." pulse — animated indicator while waiting for a response
  • Replay mode — switching to a completed task replays the full conversation from history

GitHub Integration

  • Git agent commits all artifacts to the target repo using GitHub REST API (no local git required)
  • Opens a pull request with a structured description
  • DevOps agent generates Dockerfile, docker-compose.yml, .github/workflows/ci.yml, and Cloud Run config alongside the PR

Security

  • X-API-Key header authentication (optional in dev, enforced in production)
  • CORS restricted to configured origins
  • MAX_CONCURRENT_TASKS — prevents resource abuse
  • MAX_TASK_DESCRIPTION_LENGTH — caps prompt injection surface
  • Path traversal prevention on output file writes
  • E2B sandbox isolates all executed code from host

Developer Mode (No Cloud Required)

Set no GCP_PROJECT_ID -> automatically uses:

  • MockFirestoreMemory — in-memory agent identity
  • MockBigQueryLogger — in-memory interaction logs

The full 9-agent pipeline works without any cloud account. You only need ANTHROPIC_API_KEY.


Model-Agnostic Architecture

The Core Principle: Identity != Model

Every agent has a persistent identity stored in Firestore that is completely independent of which AI model powers it:

+---------------------------------------------+
|              Agent Identity                 |  <-- stored in Firestore
|  expertise, learned_rules, past_mistakes,   |  <-- persists across restarts
|  correction_rules, code_style, beliefs      |  <-- survives model changes
+---------------------------------------------+
                       |
                       | injected as system prompt context
                       v
+---------------------------------------------+
|                  LiteLLM                    |  <-- provider-agnostic router
|  acompletion(model="claude-sonnet-4-6", ..) |  <-- same call for any model
|  acompletion(model="groq/llama-3.3-70b",..) |
|  acompletion(model="ollama/qwen2.5-coder",) |
+---------------------------------------------+

You can switch the orchestrator from Claude to DeepSeek by setting MODEL_ORCHESTRATOR=deepseek/deepseek-chat — the orchestrator keeps all its learned rules and expertise unchanged. It simply speaks through a different model.

Supported Providers

Provider Model format Example
Anthropic (Claude) claude-{family}-{version} claude-sonnet-4-6
Ollama (local, free) ollama/{model} ollama/qwen2.5-coder
Groq (fast, free tier) groq/{model} groq/llama-3.3-70b-versatile
DeepSeek deepseek/{model} deepseek/deepseek-chat
Google Gemini gemini/{model} gemini/gemini-2.0-flash-exp
OpenAI gpt-{version} gpt-4o, gpt-4o-mini
OpenRouter (free tier) openrouter/{org}/{model}:free openrouter/qwen/qwen3-coder:free
OpenRouter (paid) openrouter/{org}/{model} openrouter/deepseek/deepseek-r1

Per-Agent Model Assignment

Each of the 9 agents has its own model, tuned for task complexity:

Agent Default Model Why
orchestrator claude-sonnet-4-6 Multi-step reasoning, quality scoring, natural language input parsing
planner claude-sonnet-4-6 OpenAPI spec design — highest complexity output
backend_agent claude-sonnet-4-6 FastAPI code generation, security awareness
database_agent claude-sonnet-4-6 Schema design, indexing decisions
frontend_agent claude-sonnet-4-6 React + TypeScript, type safety
qa_agent claude-sonnet-4-6 Edge case reasoning, test design
code_reviewer claude-sonnet-4-6 OWASP analysis requires deep reasoning
git_agent claude-haiku-4-5 Deterministic commit messages — fast is enough
devops_agent claude-haiku-4-5 Dockerfile/CI YAML — structured, templated

Override any assignment via environment variable:

# Run the backend on a free local model, keep orchestrator on Claude
MODEL_BACKEND=ollama/qwen2.5-coder
MODEL_ORCHESTRATOR=claude-sonnet-4-6

# Run entire pipeline on free OpenRouter models
MODEL_ORCHESTRATOR=openrouter/meta-llama/llama-3.3-70b-instruct:free
MODEL_PLANNER=openrouter/qwen/qwen3-coder:free
MODEL_BACKEND=openrouter/qwen/qwen3-coder:free
MODEL_QA=openrouter/qwen/qwen3-coder:free
MODEL_REVIEWER=openrouter/deepseek/deepseek-r1:free

# Run entire pipeline locally with Ollama (zero cost, no internet)
MODEL_ORCHESTRATOR=ollama/llama3.2
MODEL_PLANNER=ollama/qwen2.5-coder
MODEL_BACKEND=ollama/qwen2.5-coder
MODEL_DATABASE=ollama/qwen2.5-coder
MODEL_FRONTEND=ollama/qwen2.5-coder
MODEL_QA=ollama/llama3.2
MODEL_REVIEWER=ollama/llama3.2

LLM Parameter Overrides

Three layers, applied lowest to highest priority:

MODEL_LLM_PARAMS   (per-model or per-provider prefix)
  -> AGENT_LLM_PARAMS   (per-agent, any model)
    -> explicit call-site param   (always wins)

Examples:

# All Ollama models: low temperature for deterministic output
MODEL_LLM_PARAMS = {"ollama/": {"temperature": 0.1}}

# Code reviewer: force JSON output (works on Groq, Ollama, OpenAI)
AGENT_LLM_PARAMS = {"code_reviewer": {"response_format": {"type": "json_object"}}}

Claude-Specific Features (Graceful Degradation)

Feature Claude Other providers
Prompt caching (cache_control: ephemeral) Cuts identity token cost ~90% Skipped silently
Adaptive thinking ("type": "adaptive") Extended reasoning on complex tasks Skipped silently
Streaming Default for all calls Also supported via LiteLLM

Retry, Rate Limits, and Fallback

Free-tier providers have aggressive rate limits. The base agent handles this transparently:

  • Retried: RateLimitError (429), ServiceUnavailableError (503), APIConnectionError
  • Not retried: AuthenticationError (bad key), BadRequestError (bad prompt)
  • Backoff: exponential with jitter — 5s base, doubles each attempt, capped at 60s
  • Max retries: 6 by default (configurable via LLM_MAX_RETRIES)
  • Fallback: after all retries exhausted, switches to MODEL_FALLBACK if configured

Cost Tracking

config.cost_usd(model, input_tokens, output_tokens) handles pricing across all providers:

  • :free suffix -> $0.00 (OpenRouter free tier)
  • ollama/ prefix -> $0.00 (local inference)
  • Known models -> looked up from MODEL_PRICING dict in config.py
  • Unknown models -> $0.00 (conservative default)

How Agents Learn

Each agent maintains a persistent identity profile:

AgentIdentity {
  expertise:        ["FastAPI", "SQLAlchemy", "JWT auth"]
  learned_rules:    ["Always use async endpoints", "Parameterize all SQL queries"]
  past_mistakes:    ["Forgot to add CORS middleware in v1"]
  correction_rules: ["After adding auth, always test with an invalid token"]
  code_style:       "Google Python style guide"
  beliefs:          { "prefer_postgresql": true, "use_alembic": true }
  rule_stats:       { "Always use async endpoints": { use_count: 12, success_count: 11 } }
  version:          12
}

The Reflection Loop

Every REFLECTION_INTERVAL tasks (default: 3), the agent:

  1. Reads its recent interactions from BigQuery (successes and failures)
  2. Calls the LLM: "What new rules should I add based on these outcomes?"
  3. Quality gate — rejects rules that are too short (<15 chars), too long (>200 chars), or task-specific (mentions filenames, route paths, specific identifiers)
  4. Deduplication — exact match check, then semantic similarity (cosine over keyword tokens, threshold 65%) against existing rules
  5. Saves the diff to Firestore and logs it to the memory_diffs BigQuery table

Rule Confidence Scoring

Every rule accumulates a track record:

  • After each task, record_rule_outcomes() increments use_count and (if successful) success_count for every active rule
  • get_weak_rules() surfaces rules with high usage but low success rate (used 10+ times, <55% success)
  • Weak rules become candidates for removal in the next reflection cycle

Safety Controls

Control Default Purpose
REFLECTION_ENABLED true Master switch — counters still update when off
REFLECTION_DRY_RUN false Log changes without writing to Firestore
REFLECTION_MAX_RULE_LENGTH 200 Reject verbose, task-specific rules
Version snapshots Last 10 kept Roll back any agent to any previous identity version

The result: agents that get measurably better at their specialty over time without any manual prompt engineering.


Quick Start

Prerequisites

  • Python 3.11+
  • Node.js 18+ (for the dashboard)

1. Install Python dependencies

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt

2. Configure environment

cp .env.example .env
# Edit .env -- minimum required:
#   ANTHROPIC_API_KEY=sk-ant-...

3. Start the backend

uvicorn main:app --reload
# API:  http://localhost:8000
# Docs: http://localhost:8000/docs

4. Start the dashboard (new terminal)

cd dashboard
npm install
npm run dev
# Dashboard: http://localhost:3000

5. Submit a task

curl -X POST http://localhost:8000/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "task_description": "Build a REST API for a todo list with user auth and JWT",
    "max_iterations": 3,
    "use_sandbox": false
  }'

Or use the dashboard Submit Task form.


Environment Variables

Required

Variable Description
ANTHROPIC_API_KEY Anthropic API key (or use another provider via model overrides)

Optional — Cloud Services

Variable Default Description
GCP_PROJECT_ID (empty) Google Cloud project — leave blank for dev mode with mocks
BQ_DATASET agent_system BigQuery dataset name
FIRESTORE_COLLECTION agents Firestore collection for agent identity
E2B_API_KEY (empty) E2B sandbox — sandbox disabled if not set
GITHUB_TOKEN (empty) GitHub PAT — git/devops agents disabled if not set
GITHUB_REPO (empty) owner/repo — target repository for commits/PRs

Optional — Telegram

Variable Description
TELEGRAM_BOT_TOKEN Get from @BotFather
TELEGRAM_CHAT_ID Your chat ID — get from @userinfobot
TELEGRAM_PROXY http://host:port or socks5://host:port

Optional — Workflow

Variable Default Description
MAX_ITERATIONS 3 Max retry loops before escalation
MAX_CONCURRENT_TASKS 5 Concurrent task limit (429 on overflow)
MAX_TASK_BUDGET_USD 0.50 Per-task cost ceiling in USD
REFLECTION_INTERVAL 3 Reflect every N completed tasks
REFLECTION_ENABLED true Toggle agent reflection on/off
REFLECTION_DRY_RUN false Log reflection changes without saving
LOCAL_OUTPUT_DIR (empty) Write generated files to disk
SANDBOX_TIMEOUT_SECONDS 3600 E2B sandbox lifetime (1 hour)
FRONTEND_SELF_REVIEW false Enable frontend agent self-review pass

Optional — Model Overrides

All default to claude-sonnet-4-6 (or claude-haiku-4-5 for git/devops):

MODEL_ORCHESTRATOR, MODEL_PLANNER, MODEL_BACKEND, MODEL_DATABASE,
MODEL_FRONTEND, MODEL_QA, MODEL_REVIEWER, MODEL_GIT, MODEL_DEVOPS
MODEL_FALLBACK  -- tried when primary exhausts retries

Optional — LLM Tuning

Variable Default Description
LLM_MAX_RETRIES 6 Retry count on rate limits / transient errors
LLM_RETRY_BASE_DELAY 5.0 Base backoff delay in seconds
LLM_RETRY_MAX_DELAY 60.0 Max backoff cap in seconds

Optional — API Security

Variable Default Description
API_KEY (empty) Require X-API-Key header (empty = no auth)
CORS_ORIGINS localhost:3000,localhost:5173 Allowed origins

API Reference

Method Endpoint Description
POST /tasks Submit a new task
GET /tasks List all tasks
GET /tasks/{id} Poll task status + results
POST /tasks/{id}/reply Resume a paused task with human input
DELETE /tasks/{id} Cancel a running task
GET /agents All agent identity summaries
GET /agents/{id}/identity Full agent identity (rules, mistakes, beliefs)
GET /logs Recent interactions (dev mode)
GET /health System health + config summary
GET /tasks/{id}/events SSE stream — live llm_start / llm_end events
GET /stats BigQuery summary (dev mode)

Submit Task — Request Body

{
  "task_description": "Build a user auth API with JWT",
  "max_iterations": 3,
  "use_sandbox": true,
  "output_directory": "/absolute/path/to/write/files"
}

Task — Response Shape

{
  "task_id": "abc123...",
  "status": "running | completed | failed | waiting_for_input",
  "current_agent": "backend_agent",
  "pipeline": ["backend_agent", "database_agent", "qa_agent", "code_reviewer"],
  "iteration": 1,
  "artifacts": { "main.py": "...", "models.py": "..." },
  "agent_stats": {
    "backend_agent": { "status": "completed", "cost_usd": 0.012, "tokens": 4200, "files": ["main.py"] }
  },
  "agent_timeline": [
    { "ts": "2026-02-27T14:23:01", "agent": "backend_agent", "event": "completed", "summary": "..." }
  ],
  "pr_url": "https://github.com/owner/repo/pull/42",
  "demo_url": "https://sandbox-abc.e2b.dev",
  "needs_human_input": false,
  "human_input_prompt": null,
  "total_cost_usd": 0.087,
  "output_path": "/path/to/written/files"
}

Project Structure

autonomy/
+-- main.py                    # FastAPI server + all routes + workflow execution
+-- config.py                  # Model routing, pricing, env vars, LLM param overrides
+-- resilient_storage.py       # Disk-based task state persistence
+-- requirements.txt
+-- .env.example
|
+-- agents/
|   +-- base_agent.py          # call_claude(), reflection, retry logic, prompt caching
|   +-- event_bus.py           # Per-task SSE event queue (llm_start/llm_end)
|   +-- orchestrator.py        # Quality gates, retry routing, human input interpretation
|   +-- planner.py             # OpenAPI spec + pipeline design
|   +-- backend_agent.py       # FastAPI code generation + syntax self-correction
|   +-- database_agent.py      # SQLAlchemy + Alembic + syntax self-correction
|   +-- frontend_agent.py      # React + TypeScript + optional self-review
|   +-- qa_agent.py            # pytest/Jest generation + sandbox execution + fast-path reuse
|   +-- code_reviewer.py       # OWASP + security scoring + structured issue output
|   +-- git_agent.py           # GitHub commit + PR via REST API
|   +-- devops_agent.py        # Docker + CI/CD + Cloud Run config generation
|
+-- graph/
|   +-- state.py               # AgentState TypedDict (70+ fields)
|   +-- workflow.py            # LangGraph pipeline: nodes, routers, interrupt points
|
+-- memory/
|   +-- firestore_memory.py    # AgentIdentity class + reflection logic + version snapshots
|
+-- observability/
|   +-- bigquery_logger.py     # 3 BigQuery tables: interactions, metrics, memory_diffs
|
+-- sandbox/
|   +-- e2b_sandbox.py         # E2B cloud VM: pytest, Jest, syntax check, live demo
|
+-- telegram_bot/
|   +-- bot.py                 # Notifications, inline keyboards, reply handling
|
+-- tests/
|   +-- mocks.py               # MockFirestoreMemory + MockBigQueryLogger for dev mode
|
+-- dashboard/                 # React + Vite monitoring UI
|   +-- src/
|       +-- App.tsx            # 4-tab layout (Overview, Agents, Tasks, Logs)
|       +-- api/client.ts      # Typed API client with SSE support
|       +-- components/        # 9 React components (AgentGrid, TaskList, CostChart, etc.)
|
+-- data/                      # Auto-created, git-ignored
    +-- tasks.json             # Persisted task state
    +-- checkpoints.db         # LangGraph SQLite checkpointer

Tech Stack Summary

Core Pipeline

Component Library Why
Workflow orchestration LangGraph Stateful graph with interrupts, checkpointing, conditional routing
LLM abstraction LiteLLM One acompletion() call for any provider — Claude, Ollama, Groq, etc.
LLM provider (primary) Anthropic Claude Best instruction following, prompt caching, extended thinking
API server FastAPI Async-native, auto-docs, Pydantic validation, SSE streaming
Validation Pydantic v2 Request/response type checking at API boundaries

Persistence & Observability

Component Library Why
Agent identity Google Firestore Document model for nested JSON identity, version snapshots, serverless
Interaction logging Google BigQuery Analytical queries for reflection, free tier, fine-tuning dataset
Workflow checkpoints SQLite (via langgraph-checkpoint-sqlite) Durable pause/resume, zero infrastructure, single file
Task state JSON file (data/tasks.json) Simple, human-readable, survives restart

Execution & Integration

Component Library Why
Code sandbox E2B Code Interpreter Isolated cloud VM, pytest/Jest execution, live demo URLs
GitHub integration httpx Async HTTP client, no git binary dependency
Notifications python-telegram-bot Long-polling (works behind NAT), inline keyboards, proxy support

Dashboard

Component Library Why
UI framework React 18 Component model, hooks, wide ecosystem
Language TypeScript 5 Type safety for API client, component props
Build tool Vite 5 Fast HMR, minimal config
Charts Recharts Simple declarative charts for cost/activity data
Icons Lucide React Consistent icon set, tree-shakeable

Limitations (v2.0)

  • One concurrent human approval — only one task can be waiting for Telegram reply at a time
  • E2B demo URL — requires E2B API key; app must be FastAPI with a standard entry point
  • Git integration — only supports GitHub (not GitLab/Bitbucket)
  • Dashboard polling — task list polls every 3s (LLM conversation panel uses SSE for true real-time)

Roadmap

  • WebSocket-based real-time dashboard streaming
  • Multiple concurrent human-in-the-loop approvals
  • GitLab and Bitbucket support
  • Playwright tests for generated frontends
  • Agent-to-agent messaging (agents can ask each other questions mid-task)
  • Fine-tuning export — one-click export of BigQuery interactions to JSONL for LoRA/QLoRA
  • Weak rule auto-pruning based on rule confidence scores

Credits

This project was designed and built by Sai Narne with significant development assistance from Claude (Anthropic) — Claude Sonnet 4.6 and Claude Opus 4.6 contributed to architecture design, agent implementation, prompt engineering, and debugging across the entire codebase.


Built with LangGraph, Anthropic Claude, LiteLLM, and E2B.

About

9 AI agents collaborate to turn a single sentence into a fully built, tested, and deployed full-stack app. Built with Python, LangGraph, LiteLLM, React, TypeScript, E2B, Firestore, and GCP.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors