Universal Tasker

GIF
Demo

Inspiration

We've all been there: copying data between two systems that don't talk to each other, clicking through the same 15-step workflow for the hundredth time, or watching a colleague manually enter invoices into a legacy system from the 90s.

The RPA (Robotic Process Automation) market exists because this problem is massive—$50B+ annually. But traditional tools require recording macros, maintaining brittle CSS selectors, and hiring consultants to build each automation. The moment a UI changes, everything breaks.

We asked: What if automation could just see the screen and figure it out, like a human would?

What it does

Universal Tasker is an autonomous UI agent. You give it a goal in plain English—"Open Calculator and add 3+3" or "Fill out the patient billing form"—and it:

Sees your screen (via screenshot)
Reasons about what to do next (MiniMax M2.1 vision model)
Executes real mouse/keyboard actions (pyautogui)
Verifies each step worked before continuing
Repeats until the goal is achieved

No scripts. No selectors. No API integrations. If a human can do it by looking at a screen, Universal Tasker can automate it.

How we built it

MiniMax M2.1 as the brain—multimodal vision + reasoning in a single API call
pyautogui for cross-platform mouse/keyboard control
DuckDB for persistent session memory and audit logging
Streamlit for rapid UI prototyping
Prompt engineering to make the agent reliable: always use Spotlight on macOS, prefer keyboard over mouse, verify before proceeding

The core loop is deceptively simple: screenshot → reason → act → verify → repeat. The complexity is in making each step reliable.

Challenges we ran into

Browser focus stealing keyboard input: When running in a browser, Cmd+Space to open Spotlight would trigger the browser's search instead of macOS Spotlight. We solved this with AppleScript focus management.
Timing and UI responsiveness: Actions need time to complete before the next screenshot. We added configurable delays and taught the agent to include time.sleep() in its code.
Coordinate fragility: Clicking at (500, 300) works on one screen but misses on another. We're moving toward vision-based element detection rather than raw coordinates.
Hallucination loops: The agent would sometimes "see" a Calculator that wasn't there. Step verification catches this—if the screenshot doesn't match expectations, it stops and reports LOST instead of thrashing.

Accomplishments we're proud of

It actually works: Real keyboard and mouse actions, not simulated. The Calculator really opens.
Full audit trail: Every step logged with before/after screenshots—perfect for debugging and compliance.
Graceful failure: When stuck, it stops and explains why instead of spinning forever.
Clean architecture: Prompt templates, rule-based fallbacks, and a UI that shows exactly what's happening.

What we learned

Vision models are shockingly good at understanding UIs—MiniMax correctly identifies buttons, text fields, and application states from screenshots
The hardest part of UI automation isn't the automation—it's handling the edge cases (focus, timing, verification)
Prompt engineering is software engineering: version control your prompts, test them, iterate

Built With

duckdb
minimax
pyautogui
python
streamlit

Updates

Dan Siegel started this project — Jan 30, 2026 02:56 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.