Inspiration

  1. Single-agent code generators are fast but brittle—hallucinations, inconsistent structure, and no memory of what worked.
  2. We wanted a standard, interoperable way for specialized agents to collaborate, review, and learn from each run.
  3. Ephemeral, reproducible environments + real observability felt essential for credible demos and real-world use.

What it does

  1. CodeForge turns “prompt → code” into generate → review → learn → reuse with Google’s A2A protocol.
  2. A Generator (Gemini 2.5 Flash) produces full web apps, a Reviewer (Gemini 2.5 Pro) scores and secures them, and a Pattern Analyzer extracts reusable building blocks.
  3. Users ask in chat (CopilotKit AG UI), CodeForge returns files, review scores, a workflow log, and patterns used—spun up and previewed in Daytona workspaces, with tracing/evals in Weave.

How we built it

  1. Backend: FastAPI implementing A2A via JSON-RPC 2.0; Agent Cards for discovery; async orchestration Manager.
  2. Models: Gemini 2.5 Flash (synthesis) + Pro (deep review/security); structured outputs for stable inter-agent contracts.
  3. Learning: In-memory Pattern Library (with MongoDB backing); captures success/failure motifs for future runs.
  4. Frontend: React + CopilotKit AG UI actions calling the Manager; live status and follow-ups in chat.
  5. Infra & DX: Daytona spins ephemeral dev workspaces for deterministic builds/previews; Weave instruments latency, quality scores, and pattern hits.

Challenges we ran into

  1. Designing tight A2A schemas so agents remain decoupled but composable.
  2. Ensuring deterministic previews across machines (solved with Daytona).
  3. Getting consistent structured outputs from LLMs under tool pressure.
  4. Avoiding race conditions in multi-agent workflows and streaming logs.
  5. Balancing security (auth, CORS, rate limits) with hackathon speed.

Accomplishments that we're proud of

End-to-end A2A multi-agent pipeline working with real UX in CopilotKit. Automatic review gate with quality scoring and actionable diffs. First cut of a Pattern Library that actually feeds back into generation. One-click reproducible previews via Daytona; credible traces/evals via Weave. Clean, documented APIs: /api/agents, /api/generate, /api/patterns, /api/metrics.

What we learned

Small schema decisions in A2A ripple across every agent—contracts are everything. Reviews catch far more than linting when they’re model-aware and spec-driven. Memory beats prompt engineering alone; patterns give stable, compounding gains. Observability (Weave) turns “it feels better” into proof; ephemeral envs (Daytona) turn “works for me” into “works for everyone.”

What's next for Code Forge

New agents: Testing (Browserbase), Docs, Deployment, Security, Performance. Streamed A2A messages and partial results; smarter pattern ranking (bandits/RL). Persist patterns with clustering/search; per-tenant metrics and cost controls. Hardening: OAuth2/JWT, capability-scoped permissions, VPC isolation. Agent marketplace + templates so teams can plug CodeForge into their stacks.

Built With

Share this project:

Updates