Inspiration
- Single-agent code generators are fast but brittle—hallucinations, inconsistent structure, and no memory of what worked.
- We wanted a standard, interoperable way for specialized agents to collaborate, review, and learn from each run.
- Ephemeral, reproducible environments + real observability felt essential for credible demos and real-world use.
What it does
- CodeForge turns “prompt → code” into generate → review → learn → reuse with Google’s A2A protocol.
- A Generator (Gemini 2.5 Flash) produces full web apps, a Reviewer (Gemini 2.5 Pro) scores and secures them, and a Pattern Analyzer extracts reusable building blocks.
- Users ask in chat (CopilotKit AG UI), CodeForge returns files, review scores, a workflow log, and patterns used—spun up and previewed in Daytona workspaces, with tracing/evals in Weave.
How we built it
- Backend: FastAPI implementing A2A via JSON-RPC 2.0; Agent Cards for discovery; async orchestration Manager.
- Models: Gemini 2.5 Flash (synthesis) + Pro (deep review/security); structured outputs for stable inter-agent contracts.
- Learning: In-memory Pattern Library (with MongoDB backing); captures success/failure motifs for future runs.
- Frontend: React + CopilotKit AG UI actions calling the Manager; live status and follow-ups in chat.
- Infra & DX: Daytona spins ephemeral dev workspaces for deterministic builds/previews; Weave instruments latency, quality scores, and pattern hits.
Challenges we ran into
- Designing tight A2A schemas so agents remain decoupled but composable.
- Ensuring deterministic previews across machines (solved with Daytona).
- Getting consistent structured outputs from LLMs under tool pressure.
- Avoiding race conditions in multi-agent workflows and streaming logs.
- Balancing security (auth, CORS, rate limits) with hackathon speed.
Accomplishments that we're proud of
End-to-end A2A multi-agent pipeline working with real UX in CopilotKit. Automatic review gate with quality scoring and actionable diffs. First cut of a Pattern Library that actually feeds back into generation. One-click reproducible previews via Daytona; credible traces/evals via Weave. Clean, documented APIs: /api/agents, /api/generate, /api/patterns, /api/metrics.
What we learned
Small schema decisions in A2A ripple across every agent—contracts are everything. Reviews catch far more than linting when they’re model-aware and spec-driven. Memory beats prompt engineering alone; patterns give stable, compounding gains. Observability (Weave) turns “it feels better” into proof; ephemeral envs (Daytona) turn “works for me” into “works for everyone.”
What's next for Code Forge
New agents: Testing (Browserbase), Docs, Deployment, Security, Performance. Streamed A2A messages and partial results; smarter pattern ranking (bandits/RL). Persist patterns with clustering/search; per-tenant metrics and cost controls. Hardening: OAuth2/JWT, capability-scoped permissions, VPC isolation. Agent marketplace + templates so teams can plug CodeForge into their stacks.

Log in or sign up for Devpost to join the conversation.