Version 1.0
A production-grade AI system where 9 specialized agents collaborate to turn a single English sentence into a fully built, tested, reviewed, and deployed full-stack application — and get better at it with every task.
Most AI coding tools are single-shot: you prompt, you get code, you manually fix it. Autonomy asks a different question: what if the entire software development lifecycle — design, build, test, review, deploy — was an autonomous pipeline of specialized AI agents that learn from their own mistakes?
The core insight is that software development is not one skill — it's many. A good API designer thinks differently from a good test writer. A security reviewer catches things a code generator misses. By splitting the work across 9 agents, each with deep expertise in one domain, the system produces better results than any single monolithic prompt could.
But specialization alone isn't enough. The agents need to coordinate (the frontend must match the backend's API contract), self-correct (fix their own syntax errors before passing work downstream), and learn (remember that "always use async endpoints" after making the mistake once). Autonomy builds all three into the architecture.
-
Contract-first development — The Planner writes an OpenAPI spec before any code. Every downstream agent reads this spec as the single source of truth. No agent guesses what another built.
-
Identity ≠ Model — Each agent has a persistent identity (expertise, rules, past mistakes) stored in Firestore that survives model swaps. You can switch the backend agent from Claude to Llama and it keeps all its learned rules. The model is just an interchangeable inference engine.
-
Self-correction before escalation — Every build agent checks its own output (syntax verification, self-review) before returning. Problems caught early cost nothing; problems caught late require a full retry loop.
-
Human-in-the-loop, not human-out-of-the-loop — The system pauses and asks for approval before deploying. You see test results, review scores, a live demo URL, and total cost before deciding.
-
Zero-cloud dev mode — The full pipeline works with just
ANTHROPIC_API_KEY. No GCP, no E2B, no GitHub required. Mock storage fills in automatically.
You describe an app in one sentence. Autonomy runs a pipeline of 9 AI agents that:
- Designs the full API contract (OpenAPI 3.1 spec)
- Writes the backend (FastAPI + SQLAlchemy)
- Writes the database layer (models + Alembic migrations)
- Writes the frontend (React + TypeScript)
- Tests everything (pytest + Jest in a sandboxed cloud VM)
- Reviews code for security and quality (OWASP Top 10)
- Pauses and asks you for approval (via Telegram or dashboard)
- Commits the code and opens a GitHub PR
- Generates Docker + CI/CD + Cloud Run deployment config
If tests fail or the reviewer finds issues, the orchestrator routes back to the responsible agent with structured feedback (exact file, line, error message) and retries — up to MAX_ITERATIONS times before escalating to you.
START
|
v
+-----------------+
| Orchestrator | Plans the task, sets quality criteria, drives the retry loop
+--------+--------+
|
v
+-----------------+
| Planner | Generates OpenAPI spec -- the shared contract for all agents
+--------+--------+
|
v (dynamic -- agents included based on task type)
+--------------------------------------------+
| Backend Agent | Database Agent | Frontend |
| FastAPI routes | SQLAlchemy ORM | React/TS |
+--------------------------------------------+
|
v
+-----------------+
| QA Agent | Writes pytest suite, runs in E2B sandbox, self-corrects
+--------+--------+
|
v
+-----------------+
| Code Reviewer | OWASP + performance scoring, routes back on failure
+--------+--------+
| ^ retry loop (up to max_iterations)
v
+-----------------+
| Human Review | Pauses -> sends Telegram message with summary + live demo URL
+--------+--------+
| (approve / reject with feedback)
v
+-----------------+ +-----------------+
| Git Agent |---->| DevOps Agent |
| Commit + PR | | Docker + CI/CD |
+-----------------+ +-----------------+
|
END
Every library in the stack was chosen for a specific reason. Here's the decision log.
Why not just chain LLM calls in a loop? Because the pipeline has:
- Conditional branching — the orchestrator routes failures to whichever agent owns the broken file
- Human-in-the-loop interrupts — the workflow pauses mid-execution, waits for Telegram/dashboard input, then resumes exactly where it left off
- Persistent checkpointing — if the server crashes during a retry loop, it picks up from the last checkpoint on restart
- Dynamic pipelines — the planner decides at runtime which agents are needed (a backend-only task skips frontend and database agents)
LangGraph gives us all of this as a first-class state machine with typed state (AgentState TypedDict), interrupt nodes, and pluggable checkpointers. The alternative was hand-rolling state persistence, interrupt/resume logic, and routing — which is exactly the kind of infrastructure that breaks in edge cases.
Every agent calls litellm.acompletion() — one function, same signature, any provider. This means:
- Switch any agent to a different model by changing one env var (
MODEL_BACKEND=ollama/qwen2.5-coder) - Run the entire pipeline locally with Ollama (zero cost, no internet)
- Use free-tier providers (OpenRouter
:free, Groq) for development - Automatic retry with exponential backoff on rate limits (429) and transient errors (503)
- Fallback to a backup model if the primary exhausts retries
LiteLLM translates between Anthropic, OpenAI, Ollama, Groq, DeepSeek, Gemini, and OpenRouter APIs — no provider-specific SDK code anywhere in agent logic.
Each agent's identity (expertise, learned rules, past mistakes, correction rules, code style, beliefs) is stored in Firestore. Why Firestore specifically?
- Document model — identity is a nested JSON document (lists of rules, dicts of beliefs), not relational rows
- Real-time updates — agents read their identity at the start of every task and write back after reflection
- Version snapshots — every identity save creates a versioned subcollection entry; you can roll back any agent to any previous version
- Serverless — no database to manage, automatic scaling, free tier covers development
The identity is completely model-independent. Swapping MODEL_BACKEND from Claude to DeepSeek changes the inference engine but preserves every learned rule.
Every LLM call is logged to BigQuery with full context: system prompt, user prompt, model response, token counts, cost, success/failure. Three tables:
| Table | Purpose |
|---|---|
agent_interactions |
Full input/output per agent call — the raw training data for future fine-tuning |
agent_metrics |
Performance metrics over time — success rates, token efficiency, cost trends |
memory_diffs |
What changed in each agent's identity after reflection — the audit trail for self-improvement |
Why BigQuery? Because the reflection system needs to query "show me this agent's last N interactions where success=false" — that's a SQL query, not a document lookup. BigQuery handles analytical queries over growing datasets without index management. The free tier (10 GB storage, 1 TB queries/month) covers even heavy development.
LangGraph checkpoints (the full pipeline state at each node) are stored in SQLite via AsyncSqliteSaver. This means:
- Paused workflows survive server restarts — a task waiting for human approval at 2am is still waiting when the server starts at 9am
- The retry loop state is durable — iteration count, accumulated artifacts, test results all persist
- Zero infrastructure — SQLite is a single file (
data/checkpoints.db), no database server
Falls back to in-memory MemorySaver if langgraph-checkpoint-sqlite isn't installed (checkpoints lost on restart, but the pipeline still works).
Generated code runs in an E2B Code Interpreter cloud sandbox — an isolated VM with its own filesystem, Python environment, and network. This serves three purposes:
- Safety — nothing executes on your machine. A generated
rm -rf /would only affect the disposable sandbox. - pytest execution — the QA agent writes tests, uploads them to the sandbox along with the generated source, runs
pytest, and reads back structured results (pass/fail counts, error messages, stack traces). - Live demo — after tests pass, the sandbox starts the generated FastAPI app on a public HTTPS URL so you can preview it before approving.
The sandbox has a configurable lifetime (SANDBOX_TIMEOUT_SECONDS, default 1 hour) with automatic keepalive pings so it survives long-running tasks.
Claude Sonnet powers all code-generation agents (backend, frontend, database, QA, reviewer, orchestrator, planner) because it handles:
- Long structured output — full Python files, OpenAPI specs, React components
- Instruction following — "output only valid JSON" actually means only valid JSON
- Extended thinking — complex planning steps use adaptive thinking for multi-step reasoning
- Prompt caching — agent identity is injected with
cache_control: ephemeral, cutting identity token cost by ~90% across repeated calls
Claude Haiku handles mechanical agents (git, devops) where output is templated and speed matters more than reasoning depth.
The system exposes a REST API (not just a CLI) so it can be driven by the dashboard, Telegram, or any HTTP client. FastAPI was chosen for:
- Async-native — the pipeline is fully async; FastAPI's async endpoints avoid blocking during long LLM calls
- Auto-generated OpenAPI docs —
/docsgives you interactive API docs for free - Pydantic validation — request/response models are type-checked at the boundary
- SSE support —
StreamingResponsestreams live LLM conversation events to the dashboard
The Telegram bot is the primary human interface for production use:
- Task started/completed/failed notifications
- Per-agent progress summaries as they complete
- Rich review request with test stats, cost summary, and live demo URL
- Inline keyboard buttons (Approve / Give Feedback / Restart Demo)
- Budget alerts when cost approaches the limit
- Reply to resume paused workflows naturally — the orchestrator LLM interprets your message
Uses long-polling (not webhooks) so it works behind NAT/firewalls without a public URL. Supports HTTP/SOCKS5 proxy for restricted networks.
A real-time dashboard at http://localhost:3000 for monitoring pipeline execution:
| Tab | What You See |
|---|---|
| Overview | Total tasks, active agents, total cost, system uptime |
| Agents | All 9 agent cards with identity, model, status, success rate |
| Tasks | Task list + Kanban pipeline view with per-agent metrics |
| Logs | Recent interactions stream + cost chart over time |
The LLM Conversation panel streams every LLM call in real time via Server-Sent Events — you see exactly what each agent asks the model and what it responds, including token counts and per-call cost.
The Git Agent uses httpx (async HTTP client) to call the GitHub REST API directly — no git binary required on the host. Creates feature branches, commits artifacts, and opens pull requests programmatically.
- Persistent identity per agent — each of the 9 agents stores its expertise, learned rules, past mistakes, and preferences in Firestore. Identity persists across restarts and is model-independent.
- Conditional reflection — every N tasks (configurable, default 3), agents re-read their recent interactions from BigQuery and use an LLM to update their own rules. Duplicate rules are removed by similarity.
- Prompt caching — agent identity is injected with
cache_control: ephemeralon every Claude call, cutting identity token cost by ~90%. - Adaptive thinking — complex agents (orchestrator, planner, backend) use extended thinking for planning steps.
Every build agent fixes its own mistakes before returning:
- Backend & Database agents — syntax-check Python files in the E2B sandbox; if broken, run a multi-turn LLM correction pass.
- Frontend agent — run a self-review pass for TypeScript issues (missing imports, wrong API paths, type mismatches); only emit fixed files if issues found.
- QA agent — fast-path: re-run previous tests against updated source (no LLM cost). If test code has errors, fix with LLM. If source code is broken, return failure to the responsible agent.
All agents share a single AgentState dictionary managed by LangGraph:
openapi_spec— Planner writes this first; every downstream agent reads it as the integration contractartifacts— accumulates all generated files ({"main.py": "...", "models.py": "..."}) with zero truncation between agentsagent_briefs— Planner writes per-agent task descriptionsfeedback+review_issues— QA and Reviewer write structured issues; fix agents read them on retry
- All code runs in an isolated cloud VM on E2B's infrastructure — nothing executes on your machine
- Syntax checking — backend and database files verified before the pipeline continues
- pytest execution — full test suite run in the sandbox with
fastapi,uvicorn,httpx,pytest-asyncio - Frontend testing — Jest/Vitest tests for React components, run in the same sandbox
- Live demo — after QA passes, the generated FastAPI app is started in the sandbox and a public HTTPS URL is returned for you to preview
Two interrupt points in the pipeline:
- Pre-deploy review — after code review passes, workflow pauses and sends a rich Telegram message with per-agent build summary, test pass/fail stats, total cost, and live demo URL
- Failure escalation — if
max_iterationsis exhausted, workflow pauses and asks how to proceed
The orchestrator uses LLM-based intent parsing for human input — no keyword matching. Natural language like "looks great, ship it!" or "the search endpoint is broken, returns too many results" is correctly interpreted and routed.
MAX_TASK_BUDGET_USD— hard per-task cost ceiling (default $0.50)- When the limit is hit, workflow pauses and asks via Telegram whether to approve more budget
- Reply naturally: "give it another dollar", "set budget to $2", or "abort"
- The orchestrator LLM parses amounts from natural language (including "a dollar fifty", "50 cents", etc.)
- All task state persisted to
data/tasks.json— survives server restart - LangGraph checkpoints stored in
data/checkpoints.db(SQLite) — paused workflows can be resumed after restart - Output artifacts optionally written to disk at
{output_directory}/{task_id}/with preserved subdirectory structure
The dashboard Tasks > LLM Conversation panel streams every LLM call in real time via Server-Sent Events:
- Prompt bubble — the exact message sent to the model (expandable)
- Response bubble — the model's reply, token count, and per-call cost
- Agent badge — color-coded by which agent made the call
- "Thinking..." pulse — animated indicator while waiting for a response
- Replay mode — switching to a completed task replays the full conversation from history
- Git agent commits all artifacts to the target repo using GitHub REST API (no local git required)
- Opens a pull request with a structured description
- DevOps agent generates
Dockerfile,docker-compose.yml,.github/workflows/ci.yml, and Cloud Run config alongside the PR
X-API-Keyheader authentication (optional in dev, enforced in production)- CORS restricted to configured origins
MAX_CONCURRENT_TASKS— prevents resource abuseMAX_TASK_DESCRIPTION_LENGTH— caps prompt injection surface- Path traversal prevention on output file writes
- E2B sandbox isolates all executed code from host
Set no GCP_PROJECT_ID -> automatically uses:
MockFirestoreMemory— in-memory agent identityMockBigQueryLogger— in-memory interaction logs
The full 9-agent pipeline works without any cloud account. You only need ANTHROPIC_API_KEY.
Every agent has a persistent identity stored in Firestore that is completely independent of which AI model powers it:
+---------------------------------------------+
| Agent Identity | <-- stored in Firestore
| expertise, learned_rules, past_mistakes, | <-- persists across restarts
| correction_rules, code_style, beliefs | <-- survives model changes
+---------------------------------------------+
|
| injected as system prompt context
v
+---------------------------------------------+
| LiteLLM | <-- provider-agnostic router
| acompletion(model="claude-sonnet-4-6", ..) | <-- same call for any model
| acompletion(model="groq/llama-3.3-70b",..) |
| acompletion(model="ollama/qwen2.5-coder",) |
+---------------------------------------------+
You can switch the orchestrator from Claude to DeepSeek by setting MODEL_ORCHESTRATOR=deepseek/deepseek-chat — the orchestrator keeps all its learned rules and expertise unchanged. It simply speaks through a different model.
| Provider | Model format | Example |
|---|---|---|
| Anthropic (Claude) | claude-{family}-{version} |
claude-sonnet-4-6 |
| Ollama (local, free) | ollama/{model} |
ollama/qwen2.5-coder |
| Groq (fast, free tier) | groq/{model} |
groq/llama-3.3-70b-versatile |
| DeepSeek | deepseek/{model} |
deepseek/deepseek-chat |
| Google Gemini | gemini/{model} |
gemini/gemini-2.0-flash-exp |
| OpenAI | gpt-{version} |
gpt-4o, gpt-4o-mini |
| OpenRouter (free tier) | openrouter/{org}/{model}:free |
openrouter/qwen/qwen3-coder:free |
| OpenRouter (paid) | openrouter/{org}/{model} |
openrouter/deepseek/deepseek-r1 |
Each of the 9 agents has its own model, tuned for task complexity:
| Agent | Default Model | Why |
|---|---|---|
| orchestrator | claude-sonnet-4-6 |
Multi-step reasoning, quality scoring, natural language input parsing |
| planner | claude-sonnet-4-6 |
OpenAPI spec design — highest complexity output |
| backend_agent | claude-sonnet-4-6 |
FastAPI code generation, security awareness |
| database_agent | claude-sonnet-4-6 |
Schema design, indexing decisions |
| frontend_agent | claude-sonnet-4-6 |
React + TypeScript, type safety |
| qa_agent | claude-sonnet-4-6 |
Edge case reasoning, test design |
| code_reviewer | claude-sonnet-4-6 |
OWASP analysis requires deep reasoning |
| git_agent | claude-haiku-4-5 |
Deterministic commit messages — fast is enough |
| devops_agent | claude-haiku-4-5 |
Dockerfile/CI YAML — structured, templated |
Override any assignment via environment variable:
# Run the backend on a free local model, keep orchestrator on Claude
MODEL_BACKEND=ollama/qwen2.5-coder
MODEL_ORCHESTRATOR=claude-sonnet-4-6
# Run entire pipeline on free OpenRouter models
MODEL_ORCHESTRATOR=openrouter/meta-llama/llama-3.3-70b-instruct:free
MODEL_PLANNER=openrouter/qwen/qwen3-coder:free
MODEL_BACKEND=openrouter/qwen/qwen3-coder:free
MODEL_QA=openrouter/qwen/qwen3-coder:free
MODEL_REVIEWER=openrouter/deepseek/deepseek-r1:free
# Run entire pipeline locally with Ollama (zero cost, no internet)
MODEL_ORCHESTRATOR=ollama/llama3.2
MODEL_PLANNER=ollama/qwen2.5-coder
MODEL_BACKEND=ollama/qwen2.5-coder
MODEL_DATABASE=ollama/qwen2.5-coder
MODEL_FRONTEND=ollama/qwen2.5-coder
MODEL_QA=ollama/llama3.2
MODEL_REVIEWER=ollama/llama3.2Three layers, applied lowest to highest priority:
MODEL_LLM_PARAMS (per-model or per-provider prefix)
-> AGENT_LLM_PARAMS (per-agent, any model)
-> explicit call-site param (always wins)
Examples:
# All Ollama models: low temperature for deterministic output
MODEL_LLM_PARAMS = {"ollama/": {"temperature": 0.1}}
# Code reviewer: force JSON output (works on Groq, Ollama, OpenAI)
AGENT_LLM_PARAMS = {"code_reviewer": {"response_format": {"type": "json_object"}}}| Feature | Claude | Other providers |
|---|---|---|
Prompt caching (cache_control: ephemeral) |
Cuts identity token cost ~90% | Skipped silently |
Adaptive thinking ("type": "adaptive") |
Extended reasoning on complex tasks | Skipped silently |
| Streaming | Default for all calls | Also supported via LiteLLM |
Free-tier providers have aggressive rate limits. The base agent handles this transparently:
- Retried:
RateLimitError(429),ServiceUnavailableError(503),APIConnectionError - Not retried:
AuthenticationError(bad key),BadRequestError(bad prompt) - Backoff: exponential with jitter — 5s base, doubles each attempt, capped at 60s
- Max retries: 6 by default (configurable via
LLM_MAX_RETRIES) - Fallback: after all retries exhausted, switches to
MODEL_FALLBACKif configured
config.cost_usd(model, input_tokens, output_tokens) handles pricing across all providers:
:freesuffix ->$0.00(OpenRouter free tier)ollama/prefix ->$0.00(local inference)- Known models -> looked up from
MODEL_PRICINGdict inconfig.py - Unknown models ->
$0.00(conservative default)
Each agent maintains a persistent identity profile:
AgentIdentity {
expertise: ["FastAPI", "SQLAlchemy", "JWT auth"]
learned_rules: ["Always use async endpoints", "Parameterize all SQL queries"]
past_mistakes: ["Forgot to add CORS middleware in v1"]
correction_rules: ["After adding auth, always test with an invalid token"]
code_style: "Google Python style guide"
beliefs: { "prefer_postgresql": true, "use_alembic": true }
rule_stats: { "Always use async endpoints": { use_count: 12, success_count: 11 } }
version: 12
}
Every REFLECTION_INTERVAL tasks (default: 3), the agent:
- Reads its recent interactions from BigQuery (successes and failures)
- Calls the LLM: "What new rules should I add based on these outcomes?"
- Quality gate — rejects rules that are too short (<15 chars), too long (>200 chars), or task-specific (mentions filenames, route paths, specific identifiers)
- Deduplication — exact match check, then semantic similarity (cosine over keyword tokens, threshold 65%) against existing rules
- Saves the diff to Firestore and logs it to the
memory_diffsBigQuery table
Every rule accumulates a track record:
- After each task,
record_rule_outcomes()incrementsuse_countand (if successful)success_countfor every active rule get_weak_rules()surfaces rules with high usage but low success rate (used 10+ times, <55% success)- Weak rules become candidates for removal in the next reflection cycle
| Control | Default | Purpose |
|---|---|---|
REFLECTION_ENABLED |
true |
Master switch — counters still update when off |
REFLECTION_DRY_RUN |
false |
Log changes without writing to Firestore |
REFLECTION_MAX_RULE_LENGTH |
200 |
Reject verbose, task-specific rules |
| Version snapshots | Last 10 kept | Roll back any agent to any previous identity version |
The result: agents that get measurably better at their specialty over time without any manual prompt engineering.
- Python 3.11+
- Node.js 18+ (for the dashboard)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtcp .env.example .env
# Edit .env -- minimum required:
# ANTHROPIC_API_KEY=sk-ant-...uvicorn main:app --reload
# API: http://localhost:8000
# Docs: http://localhost:8000/docscd dashboard
npm install
npm run dev
# Dashboard: http://localhost:3000curl -X POST http://localhost:8000/tasks \
-H "Content-Type: application/json" \
-d '{
"task_description": "Build a REST API for a todo list with user auth and JWT",
"max_iterations": 3,
"use_sandbox": false
}'Or use the dashboard Submit Task form.
| Variable | Description |
|---|---|
ANTHROPIC_API_KEY |
Anthropic API key (or use another provider via model overrides) |
| Variable | Default | Description |
|---|---|---|
GCP_PROJECT_ID |
(empty) | Google Cloud project — leave blank for dev mode with mocks |
BQ_DATASET |
agent_system |
BigQuery dataset name |
FIRESTORE_COLLECTION |
agents |
Firestore collection for agent identity |
E2B_API_KEY |
(empty) | E2B sandbox — sandbox disabled if not set |
GITHUB_TOKEN |
(empty) | GitHub PAT — git/devops agents disabled if not set |
GITHUB_REPO |
(empty) | owner/repo — target repository for commits/PRs |
| Variable | Description |
|---|---|
TELEGRAM_BOT_TOKEN |
Get from @BotFather |
TELEGRAM_CHAT_ID |
Your chat ID — get from @userinfobot |
TELEGRAM_PROXY |
http://host:port or socks5://host:port |
| Variable | Default | Description |
|---|---|---|
MAX_ITERATIONS |
3 |
Max retry loops before escalation |
MAX_CONCURRENT_TASKS |
5 |
Concurrent task limit (429 on overflow) |
MAX_TASK_BUDGET_USD |
0.50 |
Per-task cost ceiling in USD |
REFLECTION_INTERVAL |
3 |
Reflect every N completed tasks |
REFLECTION_ENABLED |
true |
Toggle agent reflection on/off |
REFLECTION_DRY_RUN |
false |
Log reflection changes without saving |
LOCAL_OUTPUT_DIR |
(empty) | Write generated files to disk |
SANDBOX_TIMEOUT_SECONDS |
3600 |
E2B sandbox lifetime (1 hour) |
FRONTEND_SELF_REVIEW |
false |
Enable frontend agent self-review pass |
All default to claude-sonnet-4-6 (or claude-haiku-4-5 for git/devops):
MODEL_ORCHESTRATOR, MODEL_PLANNER, MODEL_BACKEND, MODEL_DATABASE,
MODEL_FRONTEND, MODEL_QA, MODEL_REVIEWER, MODEL_GIT, MODEL_DEVOPS
MODEL_FALLBACK -- tried when primary exhausts retries
| Variable | Default | Description |
|---|---|---|
LLM_MAX_RETRIES |
6 |
Retry count on rate limits / transient errors |
LLM_RETRY_BASE_DELAY |
5.0 |
Base backoff delay in seconds |
LLM_RETRY_MAX_DELAY |
60.0 |
Max backoff cap in seconds |
| Variable | Default | Description |
|---|---|---|
API_KEY |
(empty) | Require X-API-Key header (empty = no auth) |
CORS_ORIGINS |
localhost:3000,localhost:5173 |
Allowed origins |
| Method | Endpoint | Description |
|---|---|---|
POST |
/tasks |
Submit a new task |
GET |
/tasks |
List all tasks |
GET |
/tasks/{id} |
Poll task status + results |
POST |
/tasks/{id}/reply |
Resume a paused task with human input |
DELETE |
/tasks/{id} |
Cancel a running task |
GET |
/agents |
All agent identity summaries |
GET |
/agents/{id}/identity |
Full agent identity (rules, mistakes, beliefs) |
GET |
/logs |
Recent interactions (dev mode) |
GET |
/health |
System health + config summary |
GET |
/tasks/{id}/events |
SSE stream — live llm_start / llm_end events |
GET |
/stats |
BigQuery summary (dev mode) |
{
"task_description": "Build a user auth API with JWT",
"max_iterations": 3,
"use_sandbox": true,
"output_directory": "/absolute/path/to/write/files"
}{
"task_id": "abc123...",
"status": "running | completed | failed | waiting_for_input",
"current_agent": "backend_agent",
"pipeline": ["backend_agent", "database_agent", "qa_agent", "code_reviewer"],
"iteration": 1,
"artifacts": { "main.py": "...", "models.py": "..." },
"agent_stats": {
"backend_agent": { "status": "completed", "cost_usd": 0.012, "tokens": 4200, "files": ["main.py"] }
},
"agent_timeline": [
{ "ts": "2026-02-27T14:23:01", "agent": "backend_agent", "event": "completed", "summary": "..." }
],
"pr_url": "https://github.com/owner/repo/pull/42",
"demo_url": "https://sandbox-abc.e2b.dev",
"needs_human_input": false,
"human_input_prompt": null,
"total_cost_usd": 0.087,
"output_path": "/path/to/written/files"
}autonomy/
+-- main.py # FastAPI server + all routes + workflow execution
+-- config.py # Model routing, pricing, env vars, LLM param overrides
+-- resilient_storage.py # Disk-based task state persistence
+-- requirements.txt
+-- .env.example
|
+-- agents/
| +-- base_agent.py # call_claude(), reflection, retry logic, prompt caching
| +-- event_bus.py # Per-task SSE event queue (llm_start/llm_end)
| +-- orchestrator.py # Quality gates, retry routing, human input interpretation
| +-- planner.py # OpenAPI spec + pipeline design
| +-- backend_agent.py # FastAPI code generation + syntax self-correction
| +-- database_agent.py # SQLAlchemy + Alembic + syntax self-correction
| +-- frontend_agent.py # React + TypeScript + optional self-review
| +-- qa_agent.py # pytest/Jest generation + sandbox execution + fast-path reuse
| +-- code_reviewer.py # OWASP + security scoring + structured issue output
| +-- git_agent.py # GitHub commit + PR via REST API
| +-- devops_agent.py # Docker + CI/CD + Cloud Run config generation
|
+-- graph/
| +-- state.py # AgentState TypedDict (70+ fields)
| +-- workflow.py # LangGraph pipeline: nodes, routers, interrupt points
|
+-- memory/
| +-- firestore_memory.py # AgentIdentity class + reflection logic + version snapshots
|
+-- observability/
| +-- bigquery_logger.py # 3 BigQuery tables: interactions, metrics, memory_diffs
|
+-- sandbox/
| +-- e2b_sandbox.py # E2B cloud VM: pytest, Jest, syntax check, live demo
|
+-- telegram_bot/
| +-- bot.py # Notifications, inline keyboards, reply handling
|
+-- tests/
| +-- mocks.py # MockFirestoreMemory + MockBigQueryLogger for dev mode
|
+-- dashboard/ # React + Vite monitoring UI
| +-- src/
| +-- App.tsx # 4-tab layout (Overview, Agents, Tasks, Logs)
| +-- api/client.ts # Typed API client with SSE support
| +-- components/ # 9 React components (AgentGrid, TaskList, CostChart, etc.)
|
+-- data/ # Auto-created, git-ignored
+-- tasks.json # Persisted task state
+-- checkpoints.db # LangGraph SQLite checkpointer
| Component | Library | Why |
|---|---|---|
| Workflow orchestration | LangGraph | Stateful graph with interrupts, checkpointing, conditional routing |
| LLM abstraction | LiteLLM | One acompletion() call for any provider — Claude, Ollama, Groq, etc. |
| LLM provider (primary) | Anthropic Claude | Best instruction following, prompt caching, extended thinking |
| API server | FastAPI | Async-native, auto-docs, Pydantic validation, SSE streaming |
| Validation | Pydantic v2 | Request/response type checking at API boundaries |
| Component | Library | Why |
|---|---|---|
| Agent identity | Google Firestore | Document model for nested JSON identity, version snapshots, serverless |
| Interaction logging | Google BigQuery | Analytical queries for reflection, free tier, fine-tuning dataset |
| Workflow checkpoints | SQLite (via langgraph-checkpoint-sqlite) | Durable pause/resume, zero infrastructure, single file |
| Task state | JSON file (data/tasks.json) |
Simple, human-readable, survives restart |
| Component | Library | Why |
|---|---|---|
| Code sandbox | E2B Code Interpreter | Isolated cloud VM, pytest/Jest execution, live demo URLs |
| GitHub integration | httpx | Async HTTP client, no git binary dependency |
| Notifications | python-telegram-bot | Long-polling (works behind NAT), inline keyboards, proxy support |
| Component | Library | Why |
|---|---|---|
| UI framework | React 18 | Component model, hooks, wide ecosystem |
| Language | TypeScript 5 | Type safety for API client, component props |
| Build tool | Vite 5 | Fast HMR, minimal config |
| Charts | Recharts | Simple declarative charts for cost/activity data |
| Icons | Lucide React | Consistent icon set, tree-shakeable |
- One concurrent human approval — only one task can be waiting for Telegram reply at a time
- E2B demo URL — requires E2B API key; app must be FastAPI with a standard entry point
- Git integration — only supports GitHub (not GitLab/Bitbucket)
- Dashboard polling — task list polls every 3s (LLM conversation panel uses SSE for true real-time)
- WebSocket-based real-time dashboard streaming
- Multiple concurrent human-in-the-loop approvals
- GitLab and Bitbucket support
- Playwright tests for generated frontends
- Agent-to-agent messaging (agents can ask each other questions mid-task)
- Fine-tuning export — one-click export of BigQuery interactions to JSONL for LoRA/QLoRA
- Weak rule auto-pruning based on rule confidence scores
This project was designed and built by Sai Narne with significant development assistance from Claude (Anthropic) — Claude Sonnet 4.6 and Claude Opus 4.6 contributed to architecture design, agent implementation, prompt engineering, and debugging across the entire codebase.
Built with LangGraph, Anthropic Claude, LiteLLM, and E2B.