Inspiration
AI coding tools are great at generating code, but they often stop at suggestions. In real development workflows, code must be tested, executed, and debugged before it can be trusted. We were inspired by the idea of closing this gap — creating an AI system that doesn’t just generate code, but actually validates behavior through testing and execution.
AgentForge was built to explore what happens when AI agents reason about a codebase, generate tests, run them, analyze failures, and help developers fix issues faster.
What it does
AgentForge is an AI-powered testing and debugging workspace for backend codebases.
Given a repository, AgentForge: 1. Analyzes the repository structure and architecture 2. Generates a testing strategy 3. Automatically creates tests 4. Executes them using pytest 5. Analyzes failures 6. Produces a structured repair plan and IDE-ready prompts
Instead of stopping at code generation, AgentForge closes the loop between reasoning and execution.
The system uses Amazon Nova 2 Lite to power multiple reasoning agents that perform repository analysis, test generation, failure diagnosis, and repair planning.
How we built it
AgentForge is built as a multi-engine AI system with several components working together.
The backend is implemented in Python with FastAPI and orchestrates the full agent pipeline.
Key components include: • Repository Parsing Engine – scans the repo and extracts routes, services, models, and dependencies • Repository Knowledge Graph – builds relationships between files, functions, and APIs • Retrieval Engine – creates structured context bundles for AI agents • Agent Reasoning Engine – powered by Amazon Nova 2 Lite for test generation, failure analysis, and repair planning • Test Execution Engine – creates an isolated workspace and runs tests with pytest • Memory Engine – stores artifacts, execution results, and iteration history • Orchestration Engine – manages pipeline stages and agent execution
The frontend is built with Next.js, providing a developer workspace where users can inspect artifacts, execution logs, generated tests, and repair plans.
Challenges we ran into
Challenges we ran into
One of the biggest challenges was bridging AI reasoning with real code execution.
AI-generated tests can fail for many reasons: • incorrect assumptions about application state • missing fixtures or setup • differences between expected and actual API behavior
We had to design a system that could capture execution results, parse failures, and feed that information back into the reasoning pipeline.
Another challenge was managing the complexity of the pipeline itself. Coordinating repository parsing, retrieval, agent reasoning, and execution required building a structured orchestration system that could reliably run each stage.
Accomplishments that we're proud of
We’re proud that AgentForge successfully demonstrates a full execution-aware AI workflow.
Instead of just generating suggestions, the system: • analyzes a real codebase • generates tests • executes them in a real environment • diagnoses failures • produces actionable repair guidance
We also built a developer workspace that makes these artifacts transparent, allowing users to inspect every stage of the pipeline.
What we learned
Building AgentForge taught us that creating effective AI developer tools requires much more than simply calling a model API. One of the most important lessons was how critical prompt design and structured context are when building multi-agent systems.
Early on, we discovered that sending large unstructured prompts to the model produced inconsistent outputs. To address this, we designed a structured context pipeline that organizes repository information, test results, and previous reasoning into well-defined artifacts that can be injected into each model request.
We also learned how important structured memory between agent calls is. Each stage of the pipeline — repository analysis, test generation, execution, and failure diagnosis — produces artifacts that are stored and reused by later agents. This ensures that every Amazon Nova 2 Lite API request receives relevant, grounded context instead of raw code dumps.
Another key lesson was the importance of deterministic orchestration around AI systems. By separating reasoning tasks into specialized agents and giving them structured inputs, we were able to make the pipeline significantly more reliable.
Overall, the project reinforced that successful AI systems rely on a combination of good prompt engineering, structured data flow, and carefully designed agent orchestration, not just powerful models.
What's next for Agent_Forge
Future improvements include: • supporting additional languages beyond Python • deeper repository understanding using semantic embeddings • automatic code repair and patch generation • integration with GitHub repositories and CI pipelines • iterative learning from previous debugging runs
Our goal is to evolve AgentForge into a system that can act as an AI-powered testing and debugging partner for developers, helping them move from code generation to reliable, production-ready software.
Log in or sign up for Devpost to join the conversation.