Skip to content

AUTHENSOR/SiteSitter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

SiteSitter

CI License Node TypeScript

The missing API for any website. Sits between any agent (human or AI) and any website. Compiles pages into structured data, enforces policy, and receipts every action.

Part of the Authensor Safety Stack

SiteSitter is part of Authensor — the open-source safety stack for AI agents. While Authensor handles action authorization and policy enforcement, SiteSitter provides web governance for browsing agents.

  • Authensor — Policy engine & control plane for agent action authorization
  • SafeClaw — Local agent gating with approval workflows
  • SiteSitter — Web governance for browsing agents (you are here)

The Problem

AI agents can browse the web. But they browse it blind:

  • No structure — An agent sees raw HTML. A "Submit" button on a search form and a "Submit" button on a wire transfer look identical.
  • No policy — Nothing stops an agent from clicking a dark-pattern upsell it didn't recognize. Agents fall for dark patterns 41% of the time.
  • No receipts — If an agent fills a form with your SSN, there's no record of what was sent, where, or why.
  • No memory — The agent doesn't know you saw this product cheaper elsewhere last week.

Humans have the same problems. We just suffer through them manually.

What SiteSitter Does

SiteSitter is a local HTTP server + browser extension that creates a governance layer for web browsing. Three things happen on every page:

1. Compile: HTML in, structured data out

Any webpage goes in. A typed, machine-readable Page IR comes out — entities, actions, risk levels, dark patterns, accessibility issues, provenance for every claim:

{
  "pageKind": "checkout",
  "entities": [
    { "type": "product", "name": "Wireless Headphones", "price": "$79.99", "confidence": 0.94 }
  ],
  "actions": [
    { "type": "purchase", "riskClass": "high_consequence", "requiresApproval": true },
    { "type": "add_warranty", "riskClass": "mutation", "darkPatterns": ["preselection"] }
  ]
}

No site-specific code needed. Works on any webpage via an 8-stage compilation pipeline with 16-stage universal extraction fallback.

2. Govern: Policy before action

Every action is evaluated against a browsing constitution — 26 enforceable rules across 6 categories (privacy, safety, financial, healthcare, consumer protection). Actions are risk-classified:

Risk Class Example Requires
read View a page Nothing
soft_interaction Filter search results Nothing
mutation Submit a form Policy check
high_consequence Wire transfer, delete account Human approval

Dark patterns are detected and flagged. Agents get clean, classified data instead of adversarial HTML.

3. Receipt: Cryptographic proof of everything

Every observation, policy decision, and action produces an immutable receipt in a content-addressed chain:

  • Who saw what, when
  • What was proposed, what policy said, what happened
  • W3C Verifiable Credentials with Ed25519 proofs
  • Merkle tree verification for any subset
  • GDPR-compliant erasure via key deletion

Quick Start

Option A: Server + Browser Extension (interactive use)

git clone https://github.com/AUTHENSOR/SiteSitter.git
cd sitesitter
pnpm install && pnpm build

# Start the governance server
node packages/runtime/dist/cli.js --db-path ./data.db

# Load the Chrome extension:
# chrome://extensions → Developer mode → Load unpacked → extensions/chrome/
# Navigate to any page → click SiteSitter → Capture

Option B: CLI (headless / scripting)

# Compile raw HTML into Page IR
node packages/cli/dist/main.js compile-html page.html --url https://example.com

# Accessibility audit
node packages/cli/dist/main.js audit bundle.json

# Dark pattern scan with regulatory citations
node packages/cli/dist/main.js compliance-report bundle.json

Option C: MCP Server (plug into Claude or any MCP client)

node packages/mcp-server/dist/cli.js

Every extracted action becomes an MCP tool with risk metadata. Claude sees structured entities and policy-evaluated actions instead of raw HTML.

Option D: Docker (self-hosted)

# One-command start with persistent data
docker compose up -d

# Server is running at http://localhost:3838
curl http://localhost:3838/health

Data persists in a Docker volume. Set SPIRO_AUTH_TOKEN in your environment for token-based auth.

Option E: HTTP API (integrate with any agent)

# Compile a page
curl -X POST http://localhost:3838/compile \
  -H "Content-Type: application/json" \
  -d @observation-bundle.json

# Evaluate an action against policy
curl -X POST http://localhost:3838/evaluate \
  -H "Content-Type: application/json" \
  -d '{"action": {"actionType": "purchase", "riskClass": "high_consequence"}, "context": {"url": "https://shop.example.com"}}'

# Get receipts
curl http://localhost:3838/receipts

90+ API endpoints. Full reference in CLAUDE.md.


Why This Doesn't Exist Yet

Layer Without SiteSitter With SiteSitter
What agents see Raw HTML, adversarial CSS, dark patterns Structured entities, classified actions, risk levels
Policy enforcement None — agents act freely Constitutional rules evaluated before every action
Dark pattern defense None — agents are more susceptible than humans 8-category detection with FTC/EU DSA citations
Audit trail None — browsers have no accountability Immutable receipt chain with Merkle proofs
Memory None — every session starts blank Entity fingerprinting, cross-site matching, temporal recall
Approval gates None — no distinction between read and delete Risk-classified actions with human-in-the-loop for high consequence

Existing tools solve fragments: ad blockers filter, accessibility tools audit, browser automation acts. Nothing governs the full loop from observation to action to proof.


Architecture

Browser Extension (capture)
       │
       ▼ ObservationBundle (DOM + AX tree + screenshots)
┌──────────────────────┐
│   Compiler (8 stages) │ ── adapters, site families, dark pattern detection
└──────────┬───────────┘
           ▼ Page IR (entities, actions, regions, provenance)
┌──────────────────────┐
│   Policy Engine       │ ── 26 constitutional rules, risk classification
└──────────┬───────────┘
           ▼ PolicyEvaluation (allow / escalate / block)
┌──────────────────────┐
│   Runtime Server      │ ── HTTP API, SQLite, approval queue, SSE, auth
└──────────┬───────────┘
           ▼ Receipt (content-addressed, Merkle tree, W3C VC)
┌──────────────────────┐
│   Replay / Eval       │ ── benchmark, trace, diff, remix views
└──────────────────────┘

Packages

Package What it does
@sitesitter/web-ir Core types — Page, Entity, Action, Region, Provenance, State
@sitesitter/compiler 8-stage pipeline: classify, extract, detect dark patterns, validate
@sitesitter/policy Constitutional browsing rules, risk evaluation, approval gates
@sitesitter/receipts Immutable receipt chain, Merkle tree, W3C VCs, GDPR erasure
@sitesitter/runtime HTTP server, SQLite, SSE, search, memory, reputation, federation
@sitesitter/replay Record/replay, benchmarks, traces, differential testing
@sitesitter/adapters Site-specific extraction, 20 built-in adapters, federated registry
@sitesitter/mcp-server MCP bridge for Claude and compatible clients
@sitesitter/cli 33+ headless commands for compilation, audit, and analysis
@sitesitter/playwright Playwright integration — compile pages in E2E tests
@sitesitter/remix 12 alternative HTML views (readable, spreadsheet, voice-nav, etc.)
Chrome extension MV3 — capture, inspect, approve, investigate
Firefox extension MV2 — cross-browser parity

Key Capabilities

Compilation — Any webpage to structured IR. No per-site code. 16-stage universal extraction as fallback. Site family grammars for common patterns (e-commerce, news, social, forums, SaaS, healthcare, finance, academic).

Dark Patterns — 8 categories (confirmshaming, urgency, preselection, sneaking, obstruction, trick wording, social proof, misdirection). Regulatory references (FTC Act, EU DSA Art. 25, GDPR, CCPA). Evidence packages for filing complaints.

Streaming — MutationObserver-based real-time capture. Incremental compilation. Server-Sent Events push updates to clients. Works with SPAs and infinite scroll.

Memory — Entity fingerprinting across sites and time. Ebbinghaus-inspired forgetting engine. Deja vu detection ("you saw this product 3 weeks ago for $47 less"). Workflow memory for repeated tasks.

Federation — Federated dark pattern observatory with Ed25519-signed reports. Peer-to-peer receipt witnesses. Consensus health scoring across instances.

Governance — Self-writing policies that learn from your decisions. Consent auto-handler. Web reputation scoring (8 dimensions). All learned rules are user-overridable.

Compliance — EU AI Act audit logging. GDPR receipt erasure. WCAG accessibility audits. FINRA/SOX/EDRM evidence exports. Berkeley Protocol case files.

Security — Prompt injection sanitizer for LLM pipelines. Ed25519 adapter signing with enforced verification. AES-256-GCM credential vault. Configurable CORS. Rate limiting. Circuit breakers.

Security Model

  • Read-only by default. Mutation is opt-in.
  • High-consequence actions always require human approval.
  • Websites are treated as adversarial.
  • Inference is separated from execution.
  • Provenance on every claim. No silent escalations.
  • Constitutional rules enforced below the prompt level.
  • SiteSitter does not "solve prompt injection." It contains, measures, and governs around it.

Development

pnpm install          # install dependencies
pnpm build            # build all packages
pnpm test             # 1,003 tests
pnpm typecheck        # 19 typecheck tasks
pnpm lint             # ESLint
pnpm format           # Prettier

Requires Node.js >= 20 and pnpm >= 9.

Contributing

We welcome contributions! See CONTRIBUTING.md for setup instructions and PR process.

For architecture documentation and API reference, see CLAUDE.md.

License

SiteSitter is licensed under the Apache License 2.0 (LICENSE-APACHE).

All packages, including the runtime server, are Apache 2.0. You can freely use, modify, and distribute SiteSitter for any purpose.

About

Web governance for AI browsing agents. Compiles any website into structured, machine-safe representations. Apache 2.0.

Topics

Resources

License

Unknown and 2 other licenses found

Licenses found

Unknown
LICENSE
AGPL-3.0
LICENSE-AGPL
Apache-2.0
LICENSE-APACHE

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors