Autonomy — Self-Improving Multi-Agent Software Development System

Version 1.0

A production-grade AI system where 9 specialized agents collaborate to turn a single English sentence into a fully built, tested, reviewed, and deployed full-stack application — and get better at it with every task.

The Idea

Most AI coding tools are single-shot: you prompt, you get code, you manually fix it. Autonomy asks a different question: what if the entire software development lifecycle — design, build, test, review, deploy — was an autonomous pipeline of specialized AI agents that learn from their own mistakes?

The core insight is that software development is not one skill — it's many. A good API designer thinks differently from a good test writer. A security reviewer catches things a code generator misses. By splitting the work across 9 agents, each with deep expertise in one domain, the system produces better results than any single monolithic prompt could.

But specialization alone isn't enough. The agents need to coordinate (the frontend must match the backend's API contract), self-correct (fix their own syntax errors before passing work downstream), and learn (remember that "always use async endpoints" after making the mistake once). Autonomy builds all three into the architecture.

Design Principles

Contract-first development — The Planner writes an OpenAPI spec before any code. Every downstream agent reads this spec as the single source of truth. No agent guesses what another built.
Identity ≠ Model — Each agent has a persistent identity (expertise, rules, past mistakes) stored in Firestore that survives model swaps. You can switch the backend agent from Claude to Llama and it keeps all its learned rules. The model is just an interchangeable inference engine.
Self-correction before escalation — Every build agent checks its own output (syntax verification, self-review) before returning. Problems caught early cost nothing; problems caught late require a full retry loop.
Human-in-the-loop, not human-out-of-the-loop — The system pauses and asks for approval before deploying. You see test results, review scores, a live demo URL, and total cost before deciding.
Zero-cloud dev mode — The full pipeline works with just ANTHROPIC_API_KEY. No GCP, no E2B, no GitHub required. Mock storage fills in automatically.

What It Does

You describe an app in one sentence. Autonomy runs a pipeline of 9 AI agents that:

Designs the full API contract (OpenAPI 3.1 spec)
Writes the backend (FastAPI + SQLAlchemy)
Writes the database layer (models + Alembic migrations)
Writes the frontend (React + TypeScript)
Tests everything (pytest + Jest in a sandboxed cloud VM)
Reviews code for security and quality (OWASP Top 10)
Pauses and asks you for approval (via Telegram or dashboard)
Commits the code and opens a GitHub PR
Generates Docker + CI/CD + Cloud Run deployment config

If tests fail or the reviewer finds issues, the orchestrator routes back to the responsible agent with structured feedback (exact file, line, error message) and retries — up to MAX_ITERATIONS times before escalating to you.

The 9-Agent Pipeline

START
  |
  v
+-----------------+
|  Orchestrator   |  Plans the task, sets quality criteria, drives the retry loop
+--------+--------+
         |
         v
+-----------------+
|    Planner      |  Generates OpenAPI spec -- the shared contract for all agents
+--------+--------+
         |
         v  (dynamic -- agents included based on task type)
+--------------------------------------------+
|  Backend Agent  | Database Agent | Frontend |
|  FastAPI routes | SQLAlchemy ORM | React/TS |
+--------------------------------------------+
         |
         v
+-----------------+
|    QA Agent     |  Writes pytest suite, runs in E2B sandbox, self-corrects
+--------+--------+
         |
         v
+-----------------+
|  Code Reviewer  |  OWASP + performance scoring, routes back on failure
+--------+--------+
         |         ^ retry loop (up to max_iterations)
         v
+-----------------+
|  Human Review   |  Pauses -> sends Telegram message with summary + live demo URL
+--------+--------+
         |  (approve / reject with feedback)
         v
+-----------------+     +-----------------+
|   Git Agent     |---->|  DevOps Agent   |
|  Commit + PR    |     |  Docker + CI/CD  |
+-----------------+     +-----------------+
         |
        END

Why These Technologies

Every library in the stack was chosen for a specific reason. Here's the decision log.

LangGraph — Stateful Workflow Orchestration

Why not just chain LLM calls in a loop? Because the pipeline has:

Conditional branching — the orchestrator routes failures to whichever agent owns the broken file
Human-in-the-loop interrupts — the workflow pauses mid-execution, waits for Telegram/dashboard input, then resumes exactly where it left off
Persistent checkpointing — if the server crashes during a retry loop, it picks up from the last checkpoint on restart
Dynamic pipelines — the planner decides at runtime which agents are needed (a backend-only task skips frontend and database agents)

LangGraph gives us all of this as a first-class state machine with typed state (AgentState TypedDict), interrupt nodes, and pluggable checkpointers. The alternative was hand-rolling state persistence, interrupt/resume logic, and routing — which is exactly the kind of infrastructure that breaks in edge cases.

LiteLLM — Provider-Agnostic LLM Routing

Every agent calls litellm.acompletion() — one function, same signature, any provider. This means:

Switch any agent to a different model by changing one env var (MODEL_BACKEND=ollama/qwen2.5-coder)
Run the entire pipeline locally with Ollama (zero cost, no internet)
Use free-tier providers (OpenRouter :free, Groq) for development
Automatic retry with exponential backoff on rate limits (429) and transient errors (503)
Fallback to a backup model if the primary exhausts retries

LiteLLM translates between Anthropic, OpenAI, Ollama, Groq, DeepSeek, Gemini, and OpenRouter APIs — no provider-specific SDK code anywhere in agent logic.

Google Firestore — Agent Identity Persistence

Each agent's identity (expertise, learned rules, past mistakes, correction rules, code style, beliefs) is stored in Firestore. Why Firestore specifically?

Document model — identity is a nested JSON document (lists of rules, dicts of beliefs), not relational rows
Real-time updates — agents read their identity at the start of every task and write back after reflection
Version snapshots — every identity save creates a versioned subcollection entry; you can roll back any agent to any previous version
Serverless — no database to manage, automatic scaling, free tier covers development

The identity is completely model-independent. Swapping MODEL_BACKEND from Claude to DeepSeek changes the inference engine but preserves every learned rule.

Google BigQuery — Interaction Logging & Pattern Mining

Every LLM call is logged to BigQuery with full context: system prompt, user prompt, model response, token counts, cost, success/failure. Three tables:

Table	Purpose
`agent_interactions`	Full input/output per agent call — the raw training data for future fine-tuning
`agent_metrics`	Performance metrics over time — success rates, token efficiency, cost trends
`memory_diffs`	What changed in each agent's identity after reflection — the audit trail for self-improvement

Why BigQuery? Because the reflection system needs to query "show me this agent's last N interactions where success=false" — that's a SQL query, not a document lookup. BigQuery handles analytical queries over growing datasets without index management. The free tier (10 GB storage, 1 TB queries/month) covers even heavy development.

SQLite — LangGraph Checkpoint Persistence

LangGraph checkpoints (the full pipeline state at each node) are stored in SQLite via AsyncSqliteSaver. This means:

Paused workflows survive server restarts — a task waiting for human approval at 2am is still waiting when the server starts at 9am
The retry loop state is durable — iteration count, accumulated artifacts, test results all persist
Zero infrastructure — SQLite is a single file (data/checkpoints.db), no database server

Falls back to in-memory MemorySaver if langgraph-checkpoint-sqlite isn't installed (checkpoints lost on restart, but the pipeline still works).

E2B — Sandboxed Code Execution

Generated code runs in an E2B Code Interpreter cloud sandbox — an isolated VM with its own filesystem, Python environment, and network. This serves three purposes:

Safety — nothing executes on your machine. A generated rm -rf / would only affect the disposable sandbox.
pytest execution — the QA agent writes tests, uploads them to the sandbox along with the generated source, runs pytest, and reads back structured results (pass/fail counts, error messages, stack traces).
Live demo — after tests pass, the sandbox starts the generated FastAPI app on a public HTTPS URL so you can preview it before approving.

The sandbox has a configurable lifetime (SANDBOX_TIMEOUT_SECONDS, default 1 hour) with automatic keepalive pings so it survives long-running tasks.

Anthropic Claude — Primary LLM

Claude Sonnet powers all code-generation agents (backend, frontend, database, QA, reviewer, orchestrator, planner) because it handles:

Long structured output — full Python files, OpenAPI specs, React components
Instruction following — "output only valid JSON" actually means only valid JSON
Extended thinking — complex planning steps use adaptive thinking for multi-step reasoning
Prompt caching — agent identity is injected with cache_control: ephemeral, cutting identity token cost by ~90% across repeated calls

Claude Haiku handles mechanical agents (git, devops) where output is templated and speed matters more than reasoning depth.

FastAPI — API Server

The system exposes a REST API (not just a CLI) so it can be driven by the dashboard, Telegram, or any HTTP client. FastAPI was chosen for:

Async-native — the pipeline is fully async; FastAPI's async endpoints avoid blocking during long LLM calls
Auto-generated OpenAPI docs — /docs gives you interactive API docs for free
Pydantic validation — request/response models are type-checked at the boundary
SSE support — StreamingResponse streams live LLM conversation events to the dashboard

python-telegram-bot — Human-in-the-Loop Notifications

The Telegram bot is the primary human interface for production use:

Task started/completed/failed notifications
Per-agent progress summaries as they complete
Rich review request with test stats, cost summary, and live demo URL
Inline keyboard buttons (Approve / Give Feedback / Restart Demo)
Budget alerts when cost approaches the limit
Reply to resume paused workflows naturally — the orchestrator LLM interprets your message

Uses long-polling (not webhooks) so it works behind NAT/firewalls without a public URL. Supports HTTP/SOCKS5 proxy for restricted networks.

React + TypeScript + Vite — Monitoring Dashboard

A real-time dashboard at http://localhost:3000 for monitoring pipeline execution:

Tab	What You See
Overview	Total tasks, active agents, total cost, system uptime
Agents	All 9 agent cards with identity, model, status, success rate
Tasks	Task list + Kanban pipeline view with per-agent metrics
Logs	Recent interactions stream + cost chart over time

The LLM Conversation panel streams every LLM call in real time via Server-Sent Events — you see exactly what each agent asks the model and what it responds, including token counts and per-call cost.

httpx — GitHub Integration

The Git Agent uses httpx (async HTTP client) to call the GitHub REST API directly — no git binary required on the host. Creates feature branches, commits artifacts, and opens pull requests programmatically.

Key Features

Agent Intelligence

Persistent identity per agent — each of the 9 agents stores its expertise, learned rules, past mistakes, and preferences in Firestore. Identity persists across restarts and is model-independent.
Conditional reflection — every N tasks (configurable, default 3), agents re-read their recent interactions from BigQuery and use an LLM to update their own rules. Duplicate rules are removed by similarity.
Prompt caching — agent identity is injected with cache_control: ephemeral on every Claude call, cutting identity token cost by ~90%.
Adaptive thinking — complex agents (orchestrator, planner, backend) use extended thinking for planning steps.

Self-Correction (within each agent)

Every build agent fixes its own mistakes before returning:

Backend & Database agents — syntax-check Python files in the E2B sandbox; if broken, run a multi-turn LLM correction pass.
Frontend agent — run a self-review pass for TypeScript issues (missing imports, wrong API paths, type mismatches); only emit fixed files if issues found.
QA agent — fast-path: re-run previous tests against updated source (no LLM cost). If test code has errors, fix with LLM. If source code is broken, return failure to the responsible agent.

Shared Context — How Agents Know What Others Built

All agents share a single AgentState dictionary managed by LangGraph:

openapi_spec — Planner writes this first; every downstream agent reads it as the integration contract
artifacts — accumulates all generated files ({"main.py": "...", "models.py": "..."}) with zero truncation between agents
agent_briefs — Planner writes per-agent task descriptions
feedback + review_issues — QA and Reviewer write structured issues; fix agents read them on retry

Code Execution in the Cloud (E2B Sandbox)

All code runs in an isolated cloud VM on E2B's infrastructure — nothing executes on your machine
Syntax checking — backend and database files verified before the pipeline continues
pytest execution — full test suite run in the sandbox with fastapi, uvicorn, httpx, pytest-asyncio
Frontend testing — Jest/Vitest tests for React components, run in the same sandbox
Live demo — after QA passes, the generated FastAPI app is started in the sandbox and a public HTTPS URL is returned for you to preview

Human-in-the-Loop

Two interrupt points in the pipeline:

Pre-deploy review — after code review passes, workflow pauses and sends a rich Telegram message with per-agent build summary, test pass/fail stats, total cost, and live demo URL
Failure escalation — if max_iterations is exhausted, workflow pauses and asks how to proceed

The orchestrator uses LLM-based intent parsing for human input — no keyword matching. Natural language like "looks great, ship it!" or "the search endpoint is broken, returns too many results" is correctly interpreted and routed.

Budget Control

MAX_TASK_BUDGET_USD — hard per-task cost ceiling (default $0.50)
When the limit is hit, workflow pauses and asks via Telegram whether to approve more budget
Reply naturally: "give it another dollar", "set budget to $2", or "abort"
The orchestrator LLM parses amounts from natural language (including "a dollar fifty", "50 cents", etc.)

Task Persistence

All task state persisted to data/tasks.json — survives server restart
LangGraph checkpoints stored in data/checkpoints.db (SQLite) — paused workflows can be resumed after restart
Output artifacts optionally written to disk at {output_directory}/{task_id}/ with preserved subdirectory structure

Live LLM Conversation Monitoring

The dashboard Tasks > LLM Conversation panel streams every LLM call in real time via Server-Sent Events:

Prompt bubble — the exact message sent to the model (expandable)
Response bubble — the model's reply, token count, and per-call cost
Agent badge — color-coded by which agent made the call
"Thinking..." pulse — animated indicator while waiting for a response
Replay mode — switching to a completed task replays the full conversation from history

GitHub Integration

Git agent commits all artifacts to the target repo using GitHub REST API (no local git required)
Opens a pull request with a structured description
DevOps agent generates Dockerfile, docker-compose.yml, .github/workflows/ci.yml, and Cloud Run config alongside the PR

Security

X-API-Key header authentication (optional in dev, enforced in production)
CORS restricted to configured origins
MAX_CONCURRENT_TASKS — prevents resource abuse
MAX_TASK_DESCRIPTION_LENGTH — caps prompt injection surface
Path traversal prevention on output file writes
E2B sandbox isolates all executed code from host

Developer Mode (No Cloud Required)

Set no GCP_PROJECT_ID -> automatically uses:

MockFirestoreMemory — in-memory agent identity
MockBigQueryLogger — in-memory interaction logs

The full 9-agent pipeline works without any cloud account. You only need ANTHROPIC_API_KEY.

Model-Agnostic Architecture

The Core Principle: Identity != Model

Every agent has a persistent identity stored in Firestore that is completely independent of which AI model powers it:

+---------------------------------------------+
|              Agent Identity                 |  <-- stored in Firestore
|  expertise, learned_rules, past_mistakes,   |  <-- persists across restarts
|  correction_rules, code_style, beliefs      |  <-- survives model changes
+---------------------------------------------+
                       |
                       | injected as system prompt context
                       v
+---------------------------------------------+
|                  LiteLLM                    |  <-- provider-agnostic router
|  acompletion(model="claude-sonnet-4-6", ..) |  <-- same call for any model
|  acompletion(model="groq/llama-3.3-70b",..) |
|  acompletion(model="ollama/qwen2.5-coder",) |
+---------------------------------------------+

You can switch the orchestrator from Claude to DeepSeek by setting MODEL_ORCHESTRATOR=deepseek/deepseek-chat — the orchestrator keeps all its learned rules and expertise unchanged. It simply speaks through a different model.

Supported Providers

Provider	Model format	Example
Anthropic (Claude)	`claude-{family}-{version}`	`claude-sonnet-4-6`
Ollama (local, free)	`ollama/{model}`	`ollama/qwen2.5-coder`
Groq (fast, free tier)	`groq/{model}`	`groq/llama-3.3-70b-versatile`
DeepSeek	`deepseek/{model}`	`deepseek/deepseek-chat`
Google Gemini	`gemini/{model}`	`gemini/gemini-2.0-flash-exp`
OpenAI	`gpt-{version}`	`gpt-4o`, `gpt-4o-mini`
OpenRouter (free tier)	`openrouter/{org}/{model}:free`	`openrouter/qwen/qwen3-coder:free`
OpenRouter (paid)	`openrouter/{org}/{model}`	`openrouter/deepseek/deepseek-r1`

Per-Agent Model Assignment

Each of the 9 agents has its own model, tuned for task complexity:

Agent	Default Model	Why
orchestrator	`claude-sonnet-4-6`	Multi-step reasoning, quality scoring, natural language input parsing
planner	`claude-sonnet-4-6`	OpenAPI spec design — highest complexity output
backend_agent	`claude-sonnet-4-6`	FastAPI code generation, security awareness
database_agent	`claude-sonnet-4-6`	Schema design, indexing decisions
frontend_agent	`claude-sonnet-4-6`	React + TypeScript, type safety
qa_agent	`claude-sonnet-4-6`	Edge case reasoning, test design
code_reviewer	`claude-sonnet-4-6`	OWASP analysis requires deep reasoning
git_agent	`claude-haiku-4-5`	Deterministic commit messages — fast is enough
devops_agent	`claude-haiku-4-5`	Dockerfile/CI YAML — structured, templated

Override any assignment via environment variable:

# Run the backend on a free local model, keep orchestrator on Claude
MODEL_BACKEND=ollama/qwen2.5-coder
MODEL_ORCHESTRATOR=claude-sonnet-4-6

# Run entire pipeline on free OpenRouter models
MODEL_ORCHESTRATOR=openrouter/meta-llama/llama-3.3-70b-instruct:free
MODEL_PLANNER=openrouter/qwen/qwen3-coder:free
MODEL_BACKEND=openrouter/qwen/qwen3-coder:free
MODEL_QA=openrouter/qwen/qwen3-coder:free
MODEL_REVIEWER=openrouter/deepseek/deepseek-r1:free

# Run entire pipeline locally with Ollama (zero cost, no internet)
MODEL_ORCHESTRATOR=ollama/llama3.2
MODEL_PLANNER=ollama/qwen2.5-coder
MODEL_BACKEND=ollama/qwen2.5-coder
MODEL_DATABASE=ollama/qwen2.5-coder
MODEL_FRONTEND=ollama/qwen2.5-coder
MODEL_QA=ollama/llama3.2
MODEL_REVIEWER=ollama/llama3.2

LLM Parameter Overrides

Three layers, applied lowest to highest priority:

MODEL_LLM_PARAMS   (per-model or per-provider prefix)
  -> AGENT_LLM_PARAMS   (per-agent, any model)
    -> explicit call-site param   (always wins)

Examples:

# All Ollama models: low temperature for deterministic output
MODEL_LLM_PARAMS = {"ollama/": {"temperature": 0.1}}

# Code reviewer: force JSON output (works on Groq, Ollama, OpenAI)
AGENT_LLM_PARAMS = {"code_reviewer": {"response_format": {"type": "json_object"}}}

Claude-Specific Features (Graceful Degradation)

Feature	Claude	Other providers
Prompt caching (`cache_control: ephemeral`)	Cuts identity token cost ~90%	Skipped silently
Adaptive thinking (`"type": "adaptive"`)	Extended reasoning on complex tasks	Skipped silently
Streaming	Default for all calls	Also supported via LiteLLM

Retry, Rate Limits, and Fallback

Free-tier providers have aggressive rate limits. The base agent handles this transparently:

Retried: RateLimitError (429), ServiceUnavailableError (503), APIConnectionError
Not retried: AuthenticationError (bad key), BadRequestError (bad prompt)
Backoff: exponential with jitter — 5s base, doubles each attempt, capped at 60s
Max retries: 6 by default (configurable via LLM_MAX_RETRIES)
Fallback: after all retries exhausted, switches to MODEL_FALLBACK if configured

Cost Tracking

config.cost_usd(model, input_tokens, output_tokens) handles pricing across all providers:

:free suffix -> $0.00 (OpenRouter free tier)
ollama/ prefix -> $0.00 (local inference)
Known models -> looked up from MODEL_PRICING dict in config.py
Unknown models -> $0.00 (conservative default)

How Agents Learn

Each agent maintains a persistent identity profile:

AgentIdentity {
  expertise:        ["FastAPI", "SQLAlchemy", "JWT auth"]
  learned_rules:    ["Always use async endpoints", "Parameterize all SQL queries"]
  past_mistakes:    ["Forgot to add CORS middleware in v1"]
  correction_rules: ["After adding auth, always test with an invalid token"]
  code_style:       "Google Python style guide"
  beliefs:          { "prefer_postgresql": true, "use_alembic": true }
  rule_stats:       { "Always use async endpoints": { use_count: 12, success_count: 11 } }
  version:          12
}

The Reflection Loop

Every REFLECTION_INTERVAL tasks (default: 3), the agent:

Reads its recent interactions from BigQuery (successes and failures)
Calls the LLM: "What new rules should I add based on these outcomes?"
Quality gate — rejects rules that are too short (<15 chars), too long (>200 chars), or task-specific (mentions filenames, route paths, specific identifiers)
Deduplication — exact match check, then semantic similarity (cosine over keyword tokens, threshold 65%) against existing rules
Saves the diff to Firestore and logs it to the memory_diffs BigQuery table

Rule Confidence Scoring

Every rule accumulates a track record:

After each task, record_rule_outcomes() increments use_count and (if successful) success_count for every active rule
get_weak_rules() surfaces rules with high usage but low success rate (used 10+ times, <55% success)
Weak rules become candidates for removal in the next reflection cycle

Safety Controls

Control	Default	Purpose
`REFLECTION_ENABLED`	`true`	Master switch — counters still update when off
`REFLECTION_DRY_RUN`	`false`	Log changes without writing to Firestore
`REFLECTION_MAX_RULE_LENGTH`	`200`	Reject verbose, task-specific rules
Version snapshots	Last 10 kept	Roll back any agent to any previous identity version

The result: agents that get measurably better at their specialty over time without any manual prompt engineering.

Quick Start

Prerequisites

Python 3.11+
Node.js 18+ (for the dashboard)

1. Install Python dependencies

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt

2. Configure environment

cp .env.example .env
# Edit .env -- minimum required:
#   ANTHROPIC_API_KEY=sk-ant-...

3. Start the backend

uvicorn main:app --reload
# API:  http://localhost:8000
# Docs: http://localhost:8000/docs

4. Start the dashboard (new terminal)

cd dashboard
npm install
npm run dev
# Dashboard: http://localhost:3000

5. Submit a task

curl -X POST http://localhost:8000/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "task_description": "Build a REST API for a todo list with user auth and JWT",
    "max_iterations": 3,
    "use_sandbox": false
  }'

Or use the dashboard Submit Task form.

Environment Variables

Required

Variable	Description
`ANTHROPIC_API_KEY`	Anthropic API key (or use another provider via model overrides)

Optional — Cloud Services

Variable	Default	Description
`GCP_PROJECT_ID`	(empty)	Google Cloud project — leave blank for dev mode with mocks
`BQ_DATASET`	`agent_system`	BigQuery dataset name
`FIRESTORE_COLLECTION`	`agents`	Firestore collection for agent identity
`E2B_API_KEY`	(empty)	E2B sandbox — sandbox disabled if not set
`GITHUB_TOKEN`	(empty)	GitHub PAT — git/devops agents disabled if not set
`GITHUB_REPO`	(empty)	`owner/repo` — target repository for commits/PRs

Optional — Telegram

Variable	Description
`TELEGRAM_BOT_TOKEN`	Get from @BotFather
`TELEGRAM_CHAT_ID`	Your chat ID — get from @userinfobot
`TELEGRAM_PROXY`	`http://host:port` or `socks5://host:port`

Optional — Workflow

Variable	Default	Description
`MAX_ITERATIONS`	`3`	Max retry loops before escalation
`MAX_CONCURRENT_TASKS`	`5`	Concurrent task limit (429 on overflow)
`MAX_TASK_BUDGET_USD`	`0.50`	Per-task cost ceiling in USD
`REFLECTION_INTERVAL`	`3`	Reflect every N completed tasks
`REFLECTION_ENABLED`	`true`	Toggle agent reflection on/off
`REFLECTION_DRY_RUN`	`false`	Log reflection changes without saving
`LOCAL_OUTPUT_DIR`	(empty)	Write generated files to disk
`SANDBOX_TIMEOUT_SECONDS`	`3600`	E2B sandbox lifetime (1 hour)
`FRONTEND_SELF_REVIEW`	`false`	Enable frontend agent self-review pass

Optional — Model Overrides

All default to claude-sonnet-4-6 (or claude-haiku-4-5 for git/devops):

MODEL_ORCHESTRATOR, MODEL_PLANNER, MODEL_BACKEND, MODEL_DATABASE,
MODEL_FRONTEND, MODEL_QA, MODEL_REVIEWER, MODEL_GIT, MODEL_DEVOPS
MODEL_FALLBACK  -- tried when primary exhausts retries

Optional — LLM Tuning

Variable	Default	Description
`LLM_MAX_RETRIES`	`6`	Retry count on rate limits / transient errors
`LLM_RETRY_BASE_DELAY`	`5.0`	Base backoff delay in seconds
`LLM_RETRY_MAX_DELAY`	`60.0`	Max backoff cap in seconds

Optional — API Security

Variable	Default	Description
`API_KEY`	(empty)	Require `X-API-Key` header (empty = no auth)
`CORS_ORIGINS`	`localhost:3000,localhost:5173`	Allowed origins

API Reference

Method	Endpoint	Description
`POST`	`/tasks`	Submit a new task
`GET`	`/tasks`	List all tasks
`GET`	`/tasks/{id}`	Poll task status + results
`POST`	`/tasks/{id}/reply`	Resume a paused task with human input
`DELETE`	`/tasks/{id}`	Cancel a running task
`GET`	`/agents`	All agent identity summaries
`GET`	`/agents/{id}/identity`	Full agent identity (rules, mistakes, beliefs)
`GET`	`/logs`	Recent interactions (dev mode)
`GET`	`/health`	System health + config summary
`GET`	`/tasks/{id}/events`	SSE stream — live `llm_start` / `llm_end` events
`GET`	`/stats`	BigQuery summary (dev mode)

Submit Task — Request Body

{
  "task_description": "Build a user auth API with JWT",
  "max_iterations": 3,
  "use_sandbox": true,
  "output_directory": "/absolute/path/to/write/files"
}

Task — Response Shape

{
  "task_id": "abc123...",
  "status": "running | completed | failed | waiting_for_input",
  "current_agent": "backend_agent",
  "pipeline": ["backend_agent", "database_agent", "qa_agent", "code_reviewer"],
  "iteration": 1,
  "artifacts": { "main.py": "...", "models.py": "..." },
  "agent_stats": {
    "backend_agent": { "status": "completed", "cost_usd": 0.012, "tokens": 4200, "files": ["main.py"] }
  },
  "agent_timeline": [
    { "ts": "2026-02-27T14:23:01", "agent": "backend_agent", "event": "completed", "summary": "..." }
  ],
  "pr_url": "https://github.com/owner/repo/pull/42",
  "demo_url": "https://sandbox-abc.e2b.dev",
  "needs_human_input": false,
  "human_input_prompt": null,
  "total_cost_usd": 0.087,
  "output_path": "/path/to/written/files"
}

Project Structure

autonomy/
+-- main.py                    # FastAPI server + all routes + workflow execution
+-- config.py                  # Model routing, pricing, env vars, LLM param overrides
+-- resilient_storage.py       # Disk-based task state persistence
+-- requirements.txt
+-- .env.example
|
+-- agents/
|   +-- base_agent.py          # call_claude(), reflection, retry logic, prompt caching
|   +-- event_bus.py           # Per-task SSE event queue (llm_start/llm_end)
|   +-- orchestrator.py        # Quality gates, retry routing, human input interpretation
|   +-- planner.py             # OpenAPI spec + pipeline design
|   +-- backend_agent.py       # FastAPI code generation + syntax self-correction
|   +-- database_agent.py      # SQLAlchemy + Alembic + syntax self-correction
|   +-- frontend_agent.py      # React + TypeScript + optional self-review
|   +-- qa_agent.py            # pytest/Jest generation + sandbox execution + fast-path reuse
|   +-- code_reviewer.py       # OWASP + security scoring + structured issue output
|   +-- git_agent.py           # GitHub commit + PR via REST API
|   +-- devops_agent.py        # Docker + CI/CD + Cloud Run config generation
|
+-- graph/
|   +-- state.py               # AgentState TypedDict (70+ fields)
|   +-- workflow.py            # LangGraph pipeline: nodes, routers, interrupt points
|
+-- memory/
|   +-- firestore_memory.py    # AgentIdentity class + reflection logic + version snapshots
|
+-- observability/
|   +-- bigquery_logger.py     # 3 BigQuery tables: interactions, metrics, memory_diffs
|
+-- sandbox/
|   +-- e2b_sandbox.py         # E2B cloud VM: pytest, Jest, syntax check, live demo
|
+-- telegram_bot/
|   +-- bot.py                 # Notifications, inline keyboards, reply handling
|
+-- tests/
|   +-- mocks.py               # MockFirestoreMemory + MockBigQueryLogger for dev mode
|
+-- dashboard/                 # React + Vite monitoring UI
|   +-- src/
|       +-- App.tsx            # 4-tab layout (Overview, Agents, Tasks, Logs)
|       +-- api/client.ts      # Typed API client with SSE support
|       +-- components/        # 9 React components (AgentGrid, TaskList, CostChart, etc.)
|
+-- data/                      # Auto-created, git-ignored
    +-- tasks.json             # Persisted task state
    +-- checkpoints.db         # LangGraph SQLite checkpointer

Tech Stack Summary

Core Pipeline

Component	Library	Why
Workflow orchestration	LangGraph	Stateful graph with interrupts, checkpointing, conditional routing
LLM abstraction	LiteLLM	One `acompletion()` call for any provider — Claude, Ollama, Groq, etc.
LLM provider (primary)	Anthropic Claude	Best instruction following, prompt caching, extended thinking
API server	FastAPI	Async-native, auto-docs, Pydantic validation, SSE streaming
Validation	Pydantic v2	Request/response type checking at API boundaries

Persistence & Observability

Component	Library	Why
Agent identity	Google Firestore	Document model for nested JSON identity, version snapshots, serverless
Interaction logging	Google BigQuery	Analytical queries for reflection, free tier, fine-tuning dataset
Workflow checkpoints	SQLite (via langgraph-checkpoint-sqlite)	Durable pause/resume, zero infrastructure, single file
Task state	JSON file (`data/tasks.json`)	Simple, human-readable, survives restart

Execution & Integration

Component	Library	Why
Code sandbox	E2B Code Interpreter	Isolated cloud VM, pytest/Jest execution, live demo URLs
GitHub integration	httpx	Async HTTP client, no git binary dependency
Notifications	python-telegram-bot	Long-polling (works behind NAT), inline keyboards, proxy support

Dashboard

Component	Library	Why
UI framework	React 18	Component model, hooks, wide ecosystem
Language	TypeScript 5	Type safety for API client, component props
Build tool	Vite 5	Fast HMR, minimal config
Charts	Recharts	Simple declarative charts for cost/activity data
Icons	Lucide React	Consistent icon set, tree-shakeable

Limitations (v2.0)

One concurrent human approval — only one task can be waiting for Telegram reply at a time
E2B demo URL — requires E2B API key; app must be FastAPI with a standard entry point
Git integration — only supports GitHub (not GitLab/Bitbucket)
Dashboard polling — task list polls every 3s (LLM conversation panel uses SSE for true real-time)

Roadmap

WebSocket-based real-time dashboard streaming
Multiple concurrent human-in-the-loop approvals
GitLab and Bitbucket support
Playwright tests for generated frontends
Agent-to-agent messaging (agents can ask each other questions mid-task)
Fine-tuning export — one-click export of BigQuery interactions to JSONL for LoRA/QLoRA
Weak rule auto-pruning based on rule confidence scores

Credits

This project was designed and built by Sai Narne with significant development assistance from Claude (Anthropic) — Claude Sonnet 4.6 and Claude Opus 4.6 contributed to architecture design, agent implementation, prompt engineering, and debugging across the entire codebase.

Built with LangGraph, Anthropic Claude, LiteLLM, and E2B.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
agents		agents
dashboard		dashboard
graph		graph
memory		memory
observability		observability
sandbox		sandbox
telegram_bot		telegram_bot
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
config.py		config.py
debug_sandbox_fe.py		debug_sandbox_fe.py
main.py		main.py
requirements.txt		requirements.txt
resilient_storage.py		resilient_storage.py
test_tool_calling.py		test_tool_calling.py

Folders and files

Latest commit

History

Repository files navigation