Inspiration: The "Agentic Crisis"
When we built the original Jayu (v1), it was a breakthrough—a multimodal assistant that could see the screen and click buttons. It won "Best Overall App," but as we pushed it toward real-world tasks, we hit a wall. We call it the Stochastic Trap.
Current "Computer Use" agents operate on a "Happy Path." They look at a screenshot, guess the coordinates of a button, and click. If the button is 5 pixels to the left, or if a pop-up appears 500ms later, the agent fails. It doesn't know why it failed; it just keeps clicking.
We realized that to move from a fun demo to a production tool, we didn't need a smarter model; we needed a better body. We needed an agent that wasn't just "watching" the screen but was deeply integrated into the Operating System's reality.
Jayu v2 is our answer: a shift from a stochastic scripting engine to a Deterministic Cognitive State Machine. What it does
Jayu v2 is an autonomous OS agent that performs complex, multi-step desktop workflows with engineering-grade reliability. Unlike standard vision agents, Jayu v2:
Grounds its vision: It doesn't just guess pixels; it queries the OS Accessibility Tree to find the exact, deterministic coordinates of UI elements.
Verifies its actions: It uses a "Verify-Act-Verify" (VAV) loop. It doesn't assume a click worked—it checks the state change before moving on.
Remembers its reasoning: It leverages Gemini 3's Thought Signatures to maintain a persistent "Chain of Thought" across API calls, preventing the "amnesia" that plagues long tasks.
How we built it
We architected Jayu v2 on three core pillars using Python and the Google Gen AI SDK.
- Perception Layer: Vision Fusion
The biggest failure mode for agents is Coordinate Drift. A model might see a button at pixel (100,100), but due to High-DPI scaling (Retina displays), the OS expects a click at (50,50).
We built a Hybrid Perception System. We capture the screen visually using mss (the "Skin") and simultaneously extract the underlying Accessibility (AX) Tree using pywinauto/atomacos (the "Skeleton").
We then map the visual hallucination to the ground truth using a rigorous coordinate transformation. When the vision model suggests a target, we lock it to the nearest AX node and apply a scaling factor: ScaleFactorx=WidthlogicalWidthphysical (xlog,ylog)=(ScaleFactorxximg,ScaleFactoryyimg)
- Cognitive Layer: Thought Signatures
Standard LLM interactions are stateless. If an agent performs a 20-step task, it often forgets the "Why" by step 10.
We utilized Gemini 3's Thought Signatures—encrypted tokens that encapsulate the model's internal activation state. By passing these signatures back and forth between the client and the API, Jayu v2 maintains a cryptographic continuity of reasoning. It doesn't just see the history of what it did; it reloads the memory of why it planned to do it.
- Actuation Layer: Closed-Loop Control
We abandoned "fire-and-forget" actuation for industrial Control Theory. Every action is treated as a hypothesis.
L2 Verification: We check the Accessibility Tree. Did the ToggleState change from Off to On?
L3 Verification: For visual changes, we use the Structural Similarity Index (SSIM). If SSIM>0.995 after a click, we know the screen didn't change, and the action failed.
Challenges we faced
The DPI Nightmare: Operating Systems lie about resolution. Mapping physical pixels from a screenshot to logical points for a mouse driver was the single hardest engineering hurdle. We had to build a custom CoordinateTransformer class to handle multi-monitor setups with different scaling factors.
"Dark Matter" Apps: Modern apps built with Electron or Flutter often hide their UI elements from the Accessibility Tree, rendering them invisible to standard tools. We developed a fallback "Grid Overlay" system that allows Gemini to switch to pure vision mode when it detects these "Dark Zones."
Context Window Limits: High-res screenshots eat tokens fast. We implemented a background "Janitor" process that summarizes old history using Gemini Flash, keeping our working memory fresh without losing the narrative arc.
What we learned
Building Jayu v2 taught us that Context is King, but Grounding is God.
A Reasoning Model like Gemini 3 is incredibly powerful, but it is only as good as the data it perceives. By giving the model "Vision Fusion"—the ability to check its visual intuition against the hard data of the OS—we unlocked a level of reliability that simply isn't possible with vision alone.
We learned that the future of AI isn't just about "smarter" models; it's about building better interfaces between those models and the digital world they inhabit. Jayu v2 is the first step toward a truly symbiotic relationship between user and agent.
Log in or sign up for Devpost to join the conversation.