A "cold start" web agent runner that explores web applications like a human user with no prior knowledge, attempting to complete goals and producing evidence-rich reports.
- Cold start navigation: Agent explores using only what's visible on screen (a11y snapshots)
- Goal-driven exploration: Provide a goal and optional success hints
- Evidence capture: Screenshots, video recordings, Playwright traces
- Help ladder escalation: Automatic fallback to search and help when stuck
- UX findings: Automatic detection of discoverability issues, bugs, and navigation problems
- Destructive action protection: Blocks delete/remove actions unless explicitly in goal
Cold Agent uses Claude Code CLI for AI-powered decision making, which means it can use your Claude Pro/Max subscription credits via OAuth token - no separate API key needed (though you can use an API key if preferred). The agent:
- Captures a compact accessibility (a11y) snapshot of the current page
- Sends the snapshot to Claude via CLI with the goal and history context
- Claude decides the next action (click, fill, search, etc.)
- Playwright executes the action and captures evidence
- Repeats until the goal is achieved or budget is exhausted
- Node.js 20+
- Claude Code CLI installed
Cold Agent uses your Claude Pro/Max subscription via OAuth token. Run this once:
# Install Claude CLI if you haven't
npm install -g @anthropic-ai/claude-code
# Run Claude once to create config file
claude
# Complete any first-time setup prompts, then exit with Ctrl+C
# Get a long-lived OAuth token (opens browser for auth)
claude setup-token
# Copy the token it outputs and set it:
export CLAUDE_CODE_OAUTH_TOKEN="sk-ant-oat01-..."Add the export to your ~/.zshrc or ~/.bashrc to persist it.
npm installThis will also install Chromium for Playwright.
# Set the OAuth token (if not in your shell profile)
export CLAUDE_CODE_OAUTH_TOKEN="sk-ant-oat01-..."
# Development mode with hot reload
npm run dev
# Or build and run
npm run build
npm startThe server starts on http://localhost:3000 by default.
If you see "Credit balance is too low":
- You may have an old
ANTHROPIC_API_KEYset - unset it:unset ANTHROPIC_API_KEY - Re-run
claude setup-tokenin your terminal - Make sure you're logging in with your Pro/Max account
If Claude keeps asking for setup prompts:
- Run
claudeinteractively once to complete setup - Make sure
~/.claude.jsonexists
curl -X POST http://localhost:3000/runs \
-H "Content-Type: application/json" \
-d '{
"baseUrl": "https://example.com",
"goal": "Find the contact page and locate the email address"
}'If you want to kick off many missions at once:
curl -X POST http://localhost:3000/runs/batch \
-H "Content-Type: application/json" \
-d '{
"runs": [
{ "baseUrl": "https://example.com", "goal": "Find pricing for teams" },
{ "baseUrl": "https://example.com", "goal": "Locate the security settings page" }
]
}'Response:
{
"runId": "20260127_abc12345",
"status": "pending",
"message": "Run started",
"links": {
"status": "/runs/20260127_abc12345",
"artifacts": "/runs/20260127_abc12345/artifacts/"
}
}curl -X POST http://localhost:3000/runs \
-H "Content-Type: application/json" \
-d '{
"baseUrl": "https://yourapp.example",
"goal": "Set up a waste stream and log a waste entry for that stream",
"auth": {
"type": "password",
"loginUrl": "https://yourapp.example/login",
"username": "[email protected]",
"password": "secret"
},
"budgets": {
"maxSteps": 40,
"maxMinutes": 6
},
"options": {
"headless": true,
"viewport": { "width": 1280, "height": 800 },
"recordVideo": true,
"recordTrace": true,
"networkAllowlist": ["yourapp.example", "cdn.yourapp.example"],
"successHints": {
"mustSeeText": ["Waste Stream", "Entry saved"],
"mustEndOnUrlIncludes": ["/waste", "/streams"]
}
}
}'curl http://localhost:3000/runs/20260127_abc12345Once a run finishes, you can open an HTML report (summary + findings + step timeline with screenshots/video):
open http://localhost:3000/runs/20260127_abc12345/reportThe HTML report includes:
- Deep links like
#step-12for sharing a specific point in the run - A Create GitHub issue panel (if
GITHUB_REPO/GITHUB_TOKENare configured)
curl http://localhost:3000/runsConfigure:
export GITHUB_REPO="owner/repo"
export GITHUB_TOKEN="ghp_... (PAT with repo permissions)"Then:
curl -X POST http://localhost:3000/runs/20260127_abc12345/issues/github \
-H "Content-Type: application/json" \
-d '{
"title": "Signup flow is hard to discover",
"labels": ["ux", "bug"]
}'Response:
{ "number": 123, "url": "https://github.com/owner/repo/issues/123", "title": "Signup flow is hard to discover" }You can describe a persona (who they are + what they want to learn/use) and have the LLM generate multiple specific missions/questions.
curl -X POST http://localhost:3000/personas/questions \
-H "Content-Type: application/json" \
-d '{
"baseUrl": "https://example.com",
"siteDescription": "Project management SaaS for enterprise teams with Gantt charts, resource allocation, and reporting",
"count": 6,
"persona": {
"name": "IT Admin",
"description": "A new IT admin responsible for onboarding employees and configuring access.",
"interests": ["user management", "SSO", "permissions", "audit logs", "billing"]
}
}'Important: Include siteDescription to tell the LLM what your site/product does. Without it, the LLM may guess wrong (especially for domain names similar to other products).
This generates questions and starts one agent run per question (runs execute in parallel, limited by MAX_CONCURRENT_RUNS):
curl -X POST http://localhost:3000/personas/runs \
-H "Content-Type: application/json" \
-d '{
"baseUrl": "https://dyrt.co",
"siteDescription": "AI-powered waste intelligence platform for analyzing invoices and tracking diversion rates",
"count": 4,
"persona": {
"description": "waste facilities manager looking for software solutions",
"interests": ["waste management software", "compost monitoring", "facility operations"]
},
"budgets": { "maxSteps": 25, "maxMinutes": 4 },
"options": { "headless": true, "recordVideo": true }
}'Response:
{
"persona": { "description": "...", "interests": [...] },
"siteDescription": "...",
"questions": ["Find pricing info...", "Locate case studies...", "..."],
"runIds": ["20260204_abc123", "20260204_def456", "..."]
}Each agent's goal is prefixed with persona context so it knows who it's acting as:
[Persona: waste facilities manager | Interests: waste management, compost monitoring] Find pricing info...
Set "embedPersona": false in the request to disable this.
{
"runId": "20260127_abc12345",
"status": "success",
"goal": "Find the contact page",
"baseUrl": "https://example.com",
"startedAt": "2026-01-27T10:00:00.000Z",
"endedAt": "2026-01-27T10:02:30.000Z",
"summary": {
"outcome": "success",
"reason": "Found contact page with email",
"completionEvidence": ["step:5", "step:8"]
},
"metrics": {
"steps": 8,
"pageTransitions": 3,
"backtracks": 1,
"searchUsed": false,
"stuckEvents": 0,
"consoleErrors": 0,
"failedRequests": 0,
"durationMs": 150000
},
"findings": [
{
"type": "discoverability",
"severity": "med",
"title": "Contact link not prominent",
"details": "Agent had to scroll to find contact link in footer",
"evidence": { "step": 4, "screenshot": "screens/step004.png" }
}
],
"artifacts": {
"traceZip": "artifacts/trace.zip",
"video": "artifacts/video.webm",
"stepsJson": "artifacts/steps.json",
"screenshotsDir": "artifacts/screens/"
}
}src/
├── server.ts # Express API server
├── types.ts # TypeScript interfaces and schemas
└── run/
├── runOrchestrator.ts # Manages Playwright sessions and runs
├── agentLoop.ts # Core decision-action loop
├── snapshot.ts # Builds compact a11y snapshots
├── evidence.ts # Captures screenshots, video, traces
└── evaluator.ts # Post-run analysis and findings
The agent can only perform these actions:
click(target)- Click buttons/linksfill(target, value)- Fill text fieldsselect(target, option)- Select dropdown optionsscroll(direction)- Scroll up/downback()- Go to previous pagewait(ms)- Wait brieflysearch(query)- Use in-app searchopenHelp()- Open help documentationdone(reason, evidenceSteps)- Declare goal complete
When the agent gets stuck:
- Phase 0 (steps 0-5): Normal exploration
- Phase 1 (steps 6-9): Try search with goal-related terms
- Phase 2 (steps 10-13): Open help if available
- Step 14+: Stop with "discoverability block" failure
Progress is marked as:
- Major: URL path changed, new page title/heading appeared
- Some: Modal opened/closed, form validation appeared
- None: Same page with no meaningful changes
| Type | Description |
|---|---|
discoverability |
Feature was hard to find |
copy |
Confusing or unclear labels |
validation |
Form validation issues |
bug |
Console errors or failed requests |
performance |
Slow page loads |
- Destructive action blocklist: Won't click "delete", "remove", etc. unless goal explicitly requires it
- Rate limiting: 300-700ms delay between actions
- Network allowlist: Can restrict navigation to approved domains
| Variable | Description | Default |
|---|---|---|
PORT |
Server port | 3000 |
MAX_CONCURRENT_RUNS |
Max simultaneous Playwright runs (global queue) | 2 |
CLAUDE_CODE_OAUTH_TOKEN |
OAuth token from claude setup-token (Pro/Max subscription) |
(none) |
ANTHROPIC_API_KEY |
Alternative: use API key instead of OAuth (uses API credits) | (none) |
USE_TMUX |
Set to 1 to use tmux interactive mode instead of pipe mode |
0 |
GITHUB_REPO |
GitHub repo to create issues in (owner/repo) |
(none) |
GITHUB_TOKEN |
GitHub token (PAT) used for issue creation | (none) |
Claude Mode Priority:
- If
ANTHROPIC_API_KEYis set → uses Anthropic SDK (API credits, fastest) - If
CLAUDE_CODE_OAUTH_TOKENis set → uses CLI pipe mode (claude -p, Pro/Max subscription, recommended) - If
USE_TMUX=1→ uses tmux interactive mode (slower, useful for debugging)
Cold Agent supports three modes for calling Claude:
1. CLI Pipe Mode (Recommended)
Uses claude -p with your Pro/Max subscription via OAuth token. This is the fastest non-API option (~7-8 seconds per step).
export CLAUDE_CODE_OAUTH_TOKEN="sk-ant-oat01-..."
npm run dev2. Anthropic SDK
Uses the Anthropic API directly. Fastest option but uses API credits instead of subscription.
export ANTHROPIC_API_KEY="sk-ant-..."
npm run dev3. tmux Interactive Mode (Legacy)
Runs Claude in a tmux session (like Gastown). Slower (~30-60 seconds per step) but useful for debugging.
export CLAUDE_CODE_OAUTH_TOKEN="sk-ant-oat01-..."
export USE_TMUX=1
npm run dev
# Attach to see what Claude is doing:
tmux attach -t cold-agent-claudePrompts are phrased as "analysis questions" (e.g., "I'm testing a web application and need to decide the next UI action") rather than role-playing directives to work smoothly with Claude Code's prompt handling.
Claude may return actions in various formats. The parser handles:
- Flat format:
{"action": {"type": "click", "target": "btn_1"}} - Nested format:
{"action": {"click": {"target": "btn_1"}}} - Shorthand format:
{"action": {"click": "btn_1"}} - Property aliases:
namefortype,textforvalue, etc.
# Run tests
npm test
# Watch mode
npm run test:watch
# Type check
npm run buildMIT