Inspiration

We've all been there: copying data between two systems that don't talk to each other, clicking through the same 15-step workflow for the hundredth time, or watching a colleague manually enter invoices into a legacy system from the 90s.

The RPA (Robotic Process Automation) market exists because this problem is massive—$50B+ annually. But traditional tools require recording macros, maintaining brittle CSS selectors, and hiring consultants to build each automation. The moment a UI changes, everything breaks.

We asked: What if automation could just see the screen and figure it out, like a human would?

What it does

Universal Tasker is an autonomous UI agent. You give it a goal in plain English—"Open Calculator and add 3+3" or "Fill out the patient billing form"—and it:

  1. Sees your screen (via screenshot)
  2. Reasons about what to do next (MiniMax M2.1 vision model)
  3. Executes real mouse/keyboard actions (pyautogui)
  4. Verifies each step worked before continuing
  5. Repeats until the goal is achieved

No scripts. No selectors. No API integrations. If a human can do it by looking at a screen, Universal Tasker can automate it.

How we built it

  • MiniMax M2.1 as the brain—multimodal vision + reasoning in a single API call
  • pyautogui for cross-platform mouse/keyboard control
  • DuckDB for persistent session memory and audit logging
  • Streamlit for rapid UI prototyping
  • Prompt engineering to make the agent reliable: always use Spotlight on macOS, prefer keyboard over mouse, verify before proceeding

The core loop is deceptively simple: screenshot → reason → act → verify → repeat. The complexity is in making each step reliable.

Challenges we ran into

  1. Browser focus stealing keyboard input: When running in a browser, Cmd+Space to open Spotlight would trigger the browser's search instead of macOS Spotlight. We solved this with AppleScript focus management.

  2. Timing and UI responsiveness: Actions need time to complete before the next screenshot. We added configurable delays and taught the agent to include time.sleep() in its code.

  3. Coordinate fragility: Clicking at (500, 300) works on one screen but misses on another. We're moving toward vision-based element detection rather than raw coordinates.

  4. Hallucination loops: The agent would sometimes "see" a Calculator that wasn't there. Step verification catches this—if the screenshot doesn't match expectations, it stops and reports LOST instead of thrashing.

Accomplishments we're proud of

  • It actually works: Real keyboard and mouse actions, not simulated. The Calculator really opens.
  • Full audit trail: Every step logged with before/after screenshots—perfect for debugging and compliance.
  • Graceful failure: When stuck, it stops and explains why instead of spinning forever.
  • Clean architecture: Prompt templates, rule-based fallbacks, and a UI that shows exactly what's happening.

What we learned

  • Vision models are shockingly good at understanding UIs—MiniMax correctly identifies buttons, text fields, and application states from screenshots
  • The hardest part of UI automation isn't the automation—it's handling the edge cases (focus, timing, verification)
  • Prompt engineering is software engineering: version control your prompts, test them, iterate

Built With

  • duckdb
  • minimax
  • pyautogui
  • python
  • streamlit
Share this project:

Updates