Jungle GenAI Genesis 2026

Inspiration

Most AI coding tools are excellent at producing code and terrible at living with the consequences. They generate, they gesture confidently, and then they leave the scene before anyone checks whether the thing actually runs. If the app is broken, the test is flaky, the auth flow fails, or the UI quietly implodes under real interaction, that usually gets discovered by a human later, after the model has already moved on to its next beautiful hallucination. That gap is what pushed us towards Jungle. We were not interested in building another wrapper where a model writes code and everyone politely pretends that counts as verification. We wanted an environment in which an agent could build something, launch it, interact with it, observe the result, store the evidence, and then return to that history later without losing the plot. In other words, we wanted to give coding agents something they usually lack: a world to operate inside, a memory of what happened there, and a mechanism for learning from failure that does not disappear the moment the chat window scrolls. Jungle came from that instinct. Not “make the agent smarter,” but “stop forcing it to work blind”. This project also presents a unique opportunity to show the induction of AI coding tools to a real-life workflow, wherein the users can relish the autonomy of the coding tools but also maintain observability, persistent history and feedback loops that mirror actual coding practices.

What it does

Jungle is a sandboxed runtime environment for testing and iterating on software with agents in the loop. A coding agent can hand Jungle a task, Jungle spins up a fresh environment for that project, executes a structured testing procedure against the running system, captures what happens, and stores the entire run as something you can actually revisit. The key idea is that Jungle treats execution history as a first-class object. A run is not just a pass or fail badge and a pile of vibes. It becomes a navigable record: what request initiated the test, what environment was launched, what actions were taken, what the agent saw, where things failed, what artifacts were captured, and how that run relates to earlier or later versions. Instead of watching an agent thrash around and hoping it learned something, the user can move through the history, inspect individual moments, replay earlier tests, and branch off prior versions to try different procedures. For the current demo, Jungle focuses on a strong and understandable slice of that vision: testing live web applications through a visible runtime loop. A project can be launched, exercised through Playwright, recorded, stored, and revisited through the interface. The user can inspect prior runs, return to earlier test branches, and evolve the testing path rather than starting from scratch each time. Underneath that, the architecture is deliberately aimed at something broader: any agent-built system that emits observable signals should eventually be testable inside Jungle, whether that means frontend behavior, backend responses, database state, cache or queue behavior, or auth flows that break in annoying and realistic ways. So yes, the current demo is a testing environment. But the larger ambition is an execution substrate for agentic development.

How we built it

We built Jungle as a desktop-hosted runtime system because we wanted the environment itself to be visible, persistent, and under our control. Electron gave us a native shell where the testing interface, execution timeline, run history, and artifact views could all live together instead of being scattered across terminal logs, browser tabs, and half-finished screenshots. From there, we layered in an orchestration flow that starts with a testing request, turns that request into a structured plan, and then executes that plan against the live application. Playwright handles the active side of the loop by interacting with the running app the way a real user would: navigating pages, filling inputs, pressing buttons, traversing flows, and triggering states worth inspecting. Around that, Jungle manages the environment itself — run creation, state tracking, artifact storage, execution history, and the interfaces needed to make old runs reusable rather than disposable. A big part of the architecture is how we think about history. Internally, we frame projects as environments that grow into branching execution trees rather than flat logs. A user should be able to go back to an earlier version, rerun a previous test, modify the procedure, and create a new branch without losing the context of what came before. That makes the environment feel less like a transient test session and more like a place where agent behavior can be observed, compared, and improved over time while still keeping the human in the loop. This enables us to augment the user’s ability to We also leaned hard into evidence capture. Every run is meant to leave behind artifacts that are useful later: video, logs, execution status, test summaries, and the runtime context needed to understand not just that something failed, but how it failed and under what conditions. That matters because in agentic systems, “it broke” is almost never enough. You need to know what the agent touched, what the system returned, and what world state existed at the time.

Challenges we faced

The hardest part was building a loop that stayed coherent once real execution entered the picture. Alignment with the task at hand is necessary to ensure that we keep the positive feedback loop in place — achieved through chain-of-thought reasoning models and explicit communication between the evaluation and generative model. We also identified a key weak point in the critic layer: the Python Gemini critic was initially prompted like a generic video reviewer, which caused it to over-report issues unrelated to the actual test. We tightened the prompt by feeding it explicit feature scope, plan steps, assertions, and actual step outcomes so its findings stay anchored to what was specifically requested — keeping the feedback loop signal clean rather than noisy. As soon as an agent stops generating text and starts operating inside a live environment, every interface becomes fragile. The request has to be structured well enough to execute consistently. The parser has to turn vague intent into a test procedure that is useful instead of decorative. The runtime has to launch cleanly. The executor has to interact with the app robustly to properly test use and edge cases. The UI has to reflect the truth of the run rather than some sanitized fake version of it. And the storage layer has to preserve enough detail for history to matter later, without turning the whole project into a landfill of random artifacts nobody can navigate. The other challenge was resisting the urge to overbuild. The vision for Jungle is broad by design, but hackathons punish people who fall in love with their own roadmap. We had to choose a vertical slice that would prove the core idea clearly: a sandboxed environment, a visible testing loop, persistent run history, and the ability to revisit and branch past tests. That discipline was probably the difference between having an idea and having a product.

Accomplishments we’re proud of

What we are proudest of is that Jungle makes agent testing inspectable. That sounds simple, but it is the whole difference between infrastructure and theater. Most demos in this space rely on the audience accepting that the agent “probably did something smart.” Jungle does not ask for that trust. It gives the user a run history they can open, inspect, and reason about. It stores the evidence. It keeps the execution timeline alive. It turns old runs into reusable context instead of dead output. That shift from disposable agent behavior to navigable runtime memory is, for us, the core accomplishment. We are also proud that Jungle is framed around the environment itself rather than the model. The interesting part is not “look, the AI clicked a button.” The interesting part is that the environment can hold state, preserve memory, support branching histories, and eventually allow deliberate perturbation — slow network, broken auth, delayed responses, bad state, all the ugly conditions that real software has to survive. That is where the system starts becoming genuinely useful.

What we learned

We learned that once you build around execution instead of generation, observability stops being a bonus feature and becomes the entire game. Agents are only as useful as the evidence they leave behind. If a failure cannot be replayed, inspected, and connected to prior context, then it may as well not have happened. This tracks with the entire purpose of code - undocumented black box systems are counterproductive to the deployment of software. We also learned that “memory” in agent systems is usually discussed too abstractly. The useful memory is not just more text shoved back into context. It is structured runtime history: what environment existed, what actions were taken, what failed, what changed afterward, and which earlier branch still matters. That kind of memory is actionable. And maybe most importantly, we learned that the infrastructure layer is where this space gets interesting. Models will improve. Tool calling will get easier. The durable value is in building environments where agent work can be tested, stored, compared, replayed, and trusted.

What’s next for Jungle

The next step is to make Jungle more powerful as an environment, not just more impressive as a demo. That means deeper runtime observability, richer artifact capture, and a more explicit runtime map so users can move through version history and failure branches with less friction. It means letting users revisit earlier trees, edit old tests, and launch new branches from meaningful checkpoints instead of treating each run like a disconnected event. It also means adding deliberate perturbation of the environment so Jungle can test software under conditions that actually resemble reality: broken auth, slow network, unstable dependencies, delayed services, and other forms of engineered pain. Beyond the current web-app slice, we want Jungle to become a more general testing substrate for anything an agent builds, as long as the system produces observable signals. Frontend, backend, database, queue, cache, auth state — if it runs, Jungle should be able to watch it, stress it, remember it, and help improve it. The long-term goal is simple: agents should not just write software and hope. They should be able to enter an environment, prove what works, expose what fails, and carry that knowledge forward. Jungle is our attempt to build that world.

Built With

Share this project:

Updates