A real-screen computer-use agent powered by multimodal LLMs. Built on Agent Zero with the Computer (Pointer) profile: see the display like a human, interact with UIs, fill forms, and complete tasks across applications.
Model support: Only DashScope qwen3.5-plus is verified so far. You must set api_base to the correct regional endpoint as well as the API key—see §4.4.
Pointer is an AI assistant for real desktop and browser UIs. It does not rely on spinning up a separate page in a headless browser; it captures what is visible on your monitor, understands on-screen elements, and acts through mouse, keyboard, shortcuts, and composite actions—the same inputs a person uses.
Capabilities include:
- Interactions in web and desktop apps: click, type, scroll, multi-select, drag, and more
- Multi-step flows: login wizards, forms, file pickers, paginated reading and extraction
- Vision: numbered UI overlays plus optional zoomed regions to reduce mis-clicks
- Structured memory: per-task screen extraction, merge,
task_done:checkpoint/read, and persisted execution state (seeagents/computer/docs)
Pointer runs as an Agent Zero profile (computer), so tool orchestration, dialogue, and extensions work the same way. The product name is Pointer.
Goal: We aim to fully replace repetitive, physical computer work—the routine clicking, typing, and screen navigation that people do every day—with capable automation. That includes personal daily tasks (your own desktop and browser workflows) and enterprise employees’ day-to-day work (repeated operational steps across standard business apps and web consoles), wherever vision, judgment, and multi-step control are needed.
Example use cases:
- Complex web workflows (login, search, forms, admin consoles, multi-page collection)
- Desktop apps and hybrid work (real windows in the foreground, coexisting with your workflow)
- Repetitive human–computer tasks that need vision + reasoning + long horizons
Video demo:
(TBD: link or embed)
Try it online / pilot access:
Contact [email protected] for trials, demos, or partnerships.
| Area | Description |
|---|---|
| Human-like operation | Live screenshots and UI annotation; prefer element indices, fall back to coordinates; OS-specific shortcut hints for macOS, Windows, and Linux. |
| Credential safety | Dedicated login and credential handling to avoid putting plaintext passwords in model-visible prompts; guided secure fill (see Computer Agent tool specs). |
| CAPTCHA handling | captcha_verify and related flows: when a CAPTCHA-like UI is detected, choose click / type / drag per protocol (exact behavior depends on implementation and model). |
| Full Agent Zero stack | Skills, memory and learning, code execution, subordinate agents, MCP, browser tooling, and more—usable alongside Pointer as your task requires. |
| Long-horizon tasks | Screen extraction, merge by task index, task_done:checkpoint / read, and persisted plans / progress / learnings to bound context and improve recoverability. |
For architecture, file layout, env vars, and tools, see agents/computer/README.md.
- Python: 3.10+ recommended (match project dependencies)
- OS: macOS, Windows, and Linux; Pointer injects OS-specific shortcut references
- Permissions: Screen capture and accessibility / assistive APIs may be required for screenshots and input—grant them per your OS settings
From the repository root:
pip install -r requirements.txt
pip install -r requirements2.txtrequirements2.txt includes LiteLLM and the OpenAI SDK (model calling). Install both files.
Optional for development / tests:
pip install -r requirements.dev.txt| OS | Notes |
|---|---|
| macOS | Allow your terminal or Python process Screen Recording and Accessibility; otherwise capture and control may fail. |
| Windows | Mind HiDPI scaling; run under a normal interactive desktop session (not a headless session without a desktop). |
| Linux | Needs a display server (e.g. X11). Headless hosts need a virtual display (e.g. Xvfb)—(TBD: recommended distros and minimal setup) |
-
Copy or create an env file (e.g.
.env) if your workflow uses one. -
In the Web UI settings or
.env, set keys for your provider (seeconf/model_providers.yaml), e.g.:- OpenAI, OpenRouter, Anthropic, etc.
- DashScope (Qwen):
DASHSCOPE_API_KEYorAPI_KEY_DASHSCOPE(provider id:dashscope)
-
Set
api_basefor DashScope (required). LiteLLM does not auto-pick China vs international. In Settings, fill API base for each role that uses DashScope (Chat, Utility, Browser, etc.—field names such as chat model API base, utility model API base, browser model API base), or ensure the value is merged fromconf/model_providers.yamldefaults. Use the URL that matches where your key was issued:Region api_baseChina (Beijing) https://dashscope.aliyuncs.com/compatible-mode/v1International https://dashscope-intl.aliyuncs.com/compatible-mode/v1Wrong region + key combination typically causes auth or routing errors.
-
Pick models for Chat, Utility, Browser, etc. Computer use typically needs a vision-capable Utility model for screenshot understanding and extraction.
Tested models (today): Only Qwen 3.5 Plus on DashScope has been validated end-to-end with Pointer. Use provider dashscope, the correct regional api_base (see table above), DASHSCOPE_API_KEY (or API_KEY_DASHSCOPE), and model name qwen3.5-plus (or the exact id from your DashScope console). Chat, Utility, and other roles may all need this stack until broader compatibility is confirmed. Other providers and model IDs are untested and may not work.
Set Settings → workdir. Pointer writes under it:
computer/snapshots/— screenshots and debug imagescomputer/extract_data/,computer/task_done/,computer/execution_checkpoint/— extraction and task state
Avoid pointing workdir at read-only or ephemeral locations unless intentional.
Pointer calls an HTTP annotation service to produce numbered UI boxes (default COMPUTER_ANNOTATE_API_BASE, path /api/v1/annotate/all).
(TBD: how to deploy the service, Docker image, or doc links)
Common environment variables (details in agents/computer/README.md):
| Variable | Default | Meaning |
|---|---|---|
COMPUTER_ANNOTATE_API_BASE |
http://127.0.0.1:8000 |
Annotation API base URL |
COMPUTER_ANNOTATE_TIMEOUT |
120 |
Request timeout (seconds) |
python run_ui.py(TBD: default port, HTTPS, reverse-proxy examples)
-
Lightweight RPA — Web data extraction, chat-driven tasks, and visiting social or content sites (e.g. Xiaohongshu, X, Facebook) where work stays mostly in the browser or a small set of familiar UIs.
-
Cross-application work — Uploading and downloading files across apps, delivering results in chat when a task completes, and auto-filling forms that span multiple programs or sites.
-
Research — Work on professional or domain data that is public but requires login to access; produce and hand off research outputs—for example, legal case research and similar deep dives.
-
Digital coworkers — Full loop: receive a task → plan → execute on the machine → deliver results, aimed at recurring operational work patterned after roles such as developers, HR, and finance.
Pointer has full control of the computer (mouse, keyboard, system-level shortcuts). With bad prompts, model errors, or ambiguous goals, it may:
- Delete or corrupt files and important data
- Trigger unintended payments, emails, or account changes
- Leak privacy (screen content is sent to your model provider)
Use only in controlled environments, with backups and least privilege; high-stakes production use requires human oversight and auditing.
You are solely responsible for any direct or indirect damage from using this software.
- Email: [email protected]
- Feedback, trial requests, partnerships, custom work: [email protected]
See LICENSE in the repository root (if it differs from upstream Agent Zero, the file in this repo prevails).
Pointer builds on Agent Zero; Computer / Pointer code lives under agents/computer/. Thanks to Agent Zero and the broader open-source ecosystem.