Inspiration
We've all been there: copying data between two systems that don't talk to each other, clicking through the same 15-step workflow for the hundredth time, or watching a colleague manually enter invoices into a legacy system from the 90s.
The RPA (Robotic Process Automation) market exists because this problem is massive—$50B+ annually. But traditional tools require recording macros, maintaining brittle CSS selectors, and hiring consultants to build each automation. The moment a UI changes, everything breaks.
We asked: What if automation could just see the screen and figure it out, like a human would?
What it does
Universal Tasker is an autonomous UI agent. You give it a goal in plain English—"Open Calculator and add 3+3" or "Fill out the patient billing form"—and it:
- Sees your screen (via screenshot)
- Reasons about what to do next (MiniMax M2.1 vision model)
- Executes real mouse/keyboard actions (pyautogui)
- Verifies each step worked before continuing
- Repeats until the goal is achieved
No scripts. No selectors. No API integrations. If a human can do it by looking at a screen, Universal Tasker can automate it.
How we built it
- MiniMax M2.1 as the brain—multimodal vision + reasoning in a single API call
- pyautogui for cross-platform mouse/keyboard control
- DuckDB for persistent session memory and audit logging
- Streamlit for rapid UI prototyping
- Prompt engineering to make the agent reliable: always use Spotlight on macOS, prefer keyboard over mouse, verify before proceeding
The core loop is deceptively simple: screenshot → reason → act → verify → repeat. The complexity is in making each step reliable.
Challenges we ran into
Browser focus stealing keyboard input: When running in a browser, Cmd+Space to open Spotlight would trigger the browser's search instead of macOS Spotlight. We solved this with AppleScript focus management.
Timing and UI responsiveness: Actions need time to complete before the next screenshot. We added configurable delays and taught the agent to include
time.sleep()in its code.Coordinate fragility: Clicking at (500, 300) works on one screen but misses on another. We're moving toward vision-based element detection rather than raw coordinates.
Hallucination loops: The agent would sometimes "see" a Calculator that wasn't there. Step verification catches this—if the screenshot doesn't match expectations, it stops and reports LOST instead of thrashing.
Accomplishments we're proud of
- It actually works: Real keyboard and mouse actions, not simulated. The Calculator really opens.
- Full audit trail: Every step logged with before/after screenshots—perfect for debugging and compliance.
- Graceful failure: When stuck, it stops and explains why instead of spinning forever.
- Clean architecture: Prompt templates, rule-based fallbacks, and a UI that shows exactly what's happening.
What we learned
- Vision models are shockingly good at understanding UIs—MiniMax correctly identifies buttons, text fields, and application states from screenshots
- The hardest part of UI automation isn't the automation—it's handling the edge cases (focus, timing, verification)
- Prompt engineering is software engineering: version control your prompts, test them, iterate
Built With
- duckdb
- minimax
- pyautogui
- python
- streamlit
Log in or sign up for Devpost to join the conversation.