REST API

The webclaw REST API gives you programmatic access to the full extraction engine. Every endpoint accepts JSON and returns JSON.

Base URL

Use the cloud endpoint for managed infrastructure, or point at your own instance when self-hosting.

bash

# Cloud (managed)
https://api.webclaw.io

# Self-hosted
http://localhost:3000

Authentication

All requests require an API key sent via the Authorization header.

http

Authorization: Bearer <api_key>

Cloud: Create API keys from your dashboard at webclaw.io. Keys are prefixed with wc_.

Self-hosted: Pass --api-key when starting the server, or set the WEBCLAW_API_KEY environment variable. If neither is set, the server runs without authentication.

Note

Self-hosted instances with no API key configured accept all requests. Set one before exposing the server to the internet.

Request format

All POST endpoints accept a JSON body. Set the Content-Type header accordingly.

http

Content-Type: application/json

Response format

All responses are JSON. Successful responses return the data directly. Errors use a consistent shape:

json

{
  "error": "Human-readable error message"
}

Rate limiting

Cloud API rate limits are based on your plan tier. Self-hosted instances have no rate limits by default. See the Cloud API page for plan details.

Output formats

The /v1/scrape endpoint supports 9 output formats. Pass one or more in the formats array.

Format	Description
`markdown`	Clean markdown with resolved URLs and collected assets. Default.
`text`	Plain text with no formatting.
`json`	Full ExtractionResult as JSON. Includes metadata, content, word count, and extracted URLs.
`llm`	LLM-optimized. 9-step pipeline: image stripping, emphasis removal, link dedup, stat merging, whitespace collapse. ~67% fewer tokens than raw HTML.
`screenshot`	Base64-encoded PNG screenshot of the page.
`links`	Array of all extracted links. Each entry has `text` and `href` fields.
`rawHtml`	The raw HTML string from the pipeline, before any extraction processing.
`attributes`	Extract DOM attributes by CSS selector. Requires the `attribute_selectors` parameter.
`query`	Page-level Q&A with LLM. Requires the `query` parameter. Returns `query_answer` in the response.

Document parsing

webclaw auto-detects document types from Content-Type headers and file extensions. In addition to PDF, the following formats are supported natively and converted to your requested output format:

DOCX -- Microsoft Word documents
XLSX -- Excel spreadsheets (each sheet becomes a markdown table)
CSV -- Comma-separated values

Tip

No special parameters are needed. Point /v1/scrape at a .docx, .xlsx, or .csv URL and webclaw handles the rest.

YouTube transcript extraction

YouTube URLs (youtube.com/watch) are auto-detected and processed differently. webclaw extracts structured markdown containing the video title, channel name, view count, publish date, duration, description, and the full transcript text.

curl

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "formats": ["markdown"]}'

Note

No special parameters needed. Just pass a YouTube watch URL and webclaw returns the transcript as structured markdown.

Endpoints

The full list of available endpoints.

Core endpoints

Method	Path	Description
POST	`/v1/scrape`	Single URL extraction. Supports browser actions, screenshots, mobile emulation, and document parsing (PDF, DOCX, XLSX, CSV).
POST	`/v1/crawl`	Start async crawl with path filtering, subdomain/external link control, and webhook support.
GET	`/v1/crawl/{id}`	Poll crawl status
GET	`/v1/crawl/{id}/stream`	Stream crawl results via Server-Sent Events (SSE)
GET	`/v1/crawl/history`	List all crawl jobs for your account
POST	`/v1/crawl/{id}/retry`	Retry a failed or completed crawl with the same configuration
POST	`/v1/batch`	Multi-URL extraction
POST	`/v1/map`	Sitemap discovery
POST	`/v1/search`	Web search with parallel scraping of result pages
POST	`/v1/agent-scrape`	AI agentic scraper that navigates pages to achieve a goal
POST	`/v1/extract`	LLM JSON extraction with prompt-to-schema generation
POST	`/v1/summarize`	LLM summarization
POST	`/v1/diff`	Content change tracking
POST	`/v1/brand`	Brand identity extraction

Webhooks

Method	Path	Description
POST	`/v1/webhooks`	Register a new webhook
GET	`/v1/webhooks`	List all registered webhooks
PATCH	`/v1/webhooks/{id}`	Update a webhook configuration
DELETE	`/v1/webhooks/{id}`	Delete a webhook

Firecrawl v2 compatibility

Drop-in replacements for Firecrawl v2 endpoints. Point existing Firecrawl SDKs at webclaw by changing the base URL.

Method	Path	Description
POST	`/v2/scrape`	Firecrawl-compatible single URL scrape
POST	`/v2/crawl`	Firecrawl-compatible async crawl
GET	`/v2/crawl/{id}`	Firecrawl-compatible crawl status polling
POST	`/v2/search`	Firecrawl-compatible web search

Utility

Method	Path	Description
GET	`/health`	Health check + Ollama status

POST /v1/search

Perform a web search and optionally scrape all result pages in parallel. Combines discovery and extraction into a single call.

POST/v1/search

Search the web and scrape result pages in parallel.

Request body

json

{
  "query": "best rust web frameworks 2026",
  "num_results": 5,
  "scrape": true,
  "formats": ["markdown"],
  "country": "us",
  "lang": "en"
}

Field	Type	Required	Description
`query`	`string`	Yes	The search query string.
`num_results`	`number`	No	Number of search results to return. Default: 5.
`scrape`	`boolean`	No	When true, scrapes each result page in parallel. Default: true.
`formats`	`string[]`	No	Output formats for scraped pages. Same options as /v1/scrape. Defaults to `["markdown"]`.
`country`	`string`	No	Two-letter country code to localize results (e.g. "us", "gb", "de").
`lang`	`string`	No	Two-letter language code for results (e.g. "en", "fr").

Example

curl

curl -X POST https://api.webclaw.io/v1/search \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "best rust web frameworks 2026",
    "num_results": 3,
    "formats": ["llm"],
    "country": "us"
  }'

POST /v1/agent-scrape

An AI-powered agentic scraper that autonomously navigates pages, clicks elements, fills forms, and extracts data to achieve a stated goal.

POST/v1/agent-scrape

AI agent that navigates a site to accomplish a scraping goal.

Request body

json

{
  "url": "https://news.ycombinator.com",
  "goal": "Find the top 5 Show HN posts from today with their scores and comment counts",
  "max_steps": 5
}

Field	Type	Required	Description
`url`	`string`	Yes	Starting URL for the agent.
`goal`	`string`	Yes	Natural language description of what data to extract or what action to perform.
`max_steps`	`number`	No	Maximum number of navigation steps the agent can take. Default: 5.

Example

curl

curl -X POST https://api.webclaw.io/v1/agent-scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "goal": "Find the top 5 Show HN posts with their scores",
    "max_steps": 3
  }'

Note

The agent uses a headless browser under the hood. Each step counts toward your plan's page credit usage.

POST /v1/extract

Extract structured JSON data from any URL. Provide a JSON schema for typed output, or a natural language prompt for flexible extraction. When only a prompt is provided (no schema), the LLM generates a JSON schema first, then extracts data matching it. The response includes a generated_schema field showing the auto-generated schema.

POST/v1/extract

LLM-powered structured data extraction with prompt-to-schema generation.

Prompt-to-schema example

curl

curl -X POST https://api.webclaw.io/v1/extract \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "prompt": "Extract all pricing tiers with name, price, and features"
  }'

When only prompt is provided, the response includes both the extracted data and the generated schema:

Response with generated_schema

{
  "data": {
    "tiers": [
      {"name": "Free", "price": 0, "features": ["500 pages/month"]},
      {"name": "Pro", "price": 49, "features": ["100k pages/month", "Priority support"]}
    ]
  },
  "generated_schema": {
    "type": "object",
    "properties": {
      "tiers": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "price": {"type": "number"},
            "features": {"type": "array", "items": {"type": "string"}}
          }
        }
      }
    }
  }
}

Enhanced scrape parameters

The /v1/scrape endpoint supports additional parameters for browser automation, screenshots, mobile emulation, cache control, page-level Q&A, and attribute extraction.

Field	Type	Required	Description
`actions`	`object[]`	No	Array of browser actions to execute before extraction (click, type, scroll, wait, press, executeJavascript, etc.).
`screenshot`	`boolean`	No	Capture a full-page screenshot. Returned as a base64-encoded PNG in the response.
`mobile`	`boolean`	No	Emulate a mobile device viewport and user agent.
`no_cache`	`boolean`	No	Bypass the response cache and force a fresh fetch.
`query`	`string`	No	Natural language question about the page. Used with the `query` format. The answer is returned in `query_answer`.
`attribute_selectors`	`object[]`	No	Array of `{selector, attribute}` pairs. Used with the `attributes` format to extract specific DOM attributes by CSS selector.

Browser actions

Actions execute in order before content extraction. Useful for clicking cookie banners, expanding sections, filling forms, scrolling to load lazy content, pressing keyboard keys, or running custom JavaScript.

Action	Fields	Description
`click`	`selector`	Click an element matching the CSS selector.
`type`	`selector, value`	Type text into an input element.
`scroll`	`direction`	Scroll the page. Direction: up or down.
`wait`	`ms`	Wait for the specified number of milliseconds.
`screenshot`	--	Take a screenshot at this point in the action sequence.
`press`	`key`	Send a keyboard event. Supported keys: Enter, Tab, Escape, ArrowDown, ArrowUp, Space, Backspace.
`executeJavascript`	`code`	Run custom JavaScript on the page. Results are returned in the `js_results` array in the response.

Browser actions example

json

{
  "url": "https://example.com/dashboard",
  "actions": [
    { "type": "wait", "selector": ".login-form" },
    { "type": "type", "selector": "#email", "value": "[email protected]" },
    { "type": "press", "key": "Tab" },
    { "type": "type", "selector": "#password", "value": "secret" },
    { "type": "press", "key": "Enter" },
    { "type": "wait", "milliseconds": 2000 },
    { "type": "executeJavascript", "code": "return document.title" },
    { "type": "scroll", "direction": "down" },
    { "type": "screenshot" }
  ],
  "screenshot": true,
  "formats": ["markdown"]
}

Query format example

curl

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "formats": ["query"],
    "query": "What is the price of the Pro plan?"
  }'

Attributes format example

curl

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["attributes"],
    "attribute_selectors": [
      {"selector": "img", "attribute": "alt"},
      {"selector": "a", "attribute": "href"}
    ]
  }'

Enhanced crawl parameters

The /v1/crawl endpoint now supports path filtering, subdomain and external link control, and webhook notifications.

Field	Type	Required	Description
`allow_subdomains`	`boolean`	No	Allow the crawler to follow links to subdomains of the starting URL. Default: false.
`allow_external_links`	`boolean`	No	Allow the crawler to follow links to external domains. Default: false.
`include_paths`	`string[]`	No	Glob patterns for paths to include (e.g. `/docs/**`). Only matching URLs will be crawled.
`exclude_paths`	`string[]`	No	Glob patterns for paths to exclude (e.g. `/blog/**`). Matching URLs will be skipped.
`webhook_url`	`string`	No	URL to receive a POST request when the crawl completes or fails.

Crawl sub-endpoints

GET/v1/crawl/{id}/stream

Stream crawl results in real-time via Server-Sent Events. Each page result is sent as an SSE event as it completes.

GET/v1/crawl/history

List all crawl jobs for your account, ordered by creation date.

POST/v1/crawl/{id}/retry

Retry a failed or completed crawl using the same configuration. Returns a new crawl ID.

Enhanced crawl example

curl

curl -X POST https://api.webclaw.io/v1/crawl \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.stripe.com",
    "max_depth": 3,
    "max_pages": 200,
    "include_paths": ["/docs/**"],
    "exclude_paths": ["/docs/changelog/**"],
    "allow_subdomains": false,
    "webhook_url": "https://your-app.com/webhooks/crawl-done"
  }'

Webhooks

Register webhook URLs to receive POST notifications when async operations (crawls, batch jobs) complete. Manage webhooks with full CRUD.

POST/v1/webhooks

Create webhook request

json

{
  "url": "https://your-app.com/webhooks/webclaw",
  "events": ["crawl.completed", "crawl.failed", "batch.completed"]
}

GET/v1/webhooks

List all registered webhooks for your account.

PATCH/v1/webhooks/{id}

Update an existing webhook's URL or event subscriptions.

DELETE/v1/webhooks/{id}

Delete a webhook. It will no longer receive notifications.

Example

curl -X POST https://api.webclaw.io/v1/webhooks \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://your-app.com/webhooks/webclaw",
    "events": ["crawl.completed"]
  }'

Firecrawl v2 compatibility

These endpoints accept the same request/response shapes as Firecrawl v2. Migrate by changing your base URL to webclaw -- no code changes needed.

POST/v2/scrape

Firecrawl-compatible scrape. Accepts Firecrawl v2 request format and returns Firecrawl v2 response shape.

POST/v2/crawl

Firecrawl-compatible async crawl. Returns a job ID in Firecrawl v2 format.

GET/v2/crawl/{id}

Poll a Firecrawl-compatible crawl job for status and results.

POST/v2/search

Firecrawl-compatible web search with scraping.

Tip

To migrate from Firecrawl, replace https://api.firecrawl.dev with https://api.webclaw.io in your SDK configuration. Authentication works the same way -- just use your webclaw API key instead.

Quick example

curl

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}'