webclaw

REST API

The webclaw REST API gives you programmatic access to the full extraction engine. Every endpoint accepts JSON and returns JSON.

Base URL

Use the cloud endpoint for managed infrastructure, or point at your own instance when self-hosting.

bash
# Cloud (managed)
https://api.webclaw.io

# Self-hosted
http://localhost:3000

Authentication

All requests require an API key sent via the Authorization header.

http
Authorization: Bearer <api_key>

Cloud: Create API keys from your dashboard at webclaw.io. Keys are prefixed with wc_.

Self-hosted: Pass --api-key when starting the server, or set the WEBCLAW_API_KEY environment variable. If neither is set, the server runs without authentication.

Note
Self-hosted instances with no API key configured accept all requests. Set one before exposing the server to the internet.

Request format

All POST endpoints accept a JSON body. Set the Content-Type header accordingly.

http
Content-Type: application/json

Response format

All responses are JSON. Successful responses return the data directly. Errors use a consistent shape:

json
{
  "error": "Human-readable error message"
}

Rate limiting

Cloud API rate limits are based on your plan tier. Self-hosted instances have no rate limits by default. See the Cloud API page for plan details.

Output formats

The /v1/scrape endpoint supports 9 output formats. Pass one or more in the formats array.

FormatDescription
markdownClean markdown with resolved URLs and collected assets. Default.
textPlain text with no formatting.
jsonFull ExtractionResult as JSON. Includes metadata, content, word count, and extracted URLs.
llmLLM-optimized. 9-step pipeline: image stripping, emphasis removal, link dedup, stat merging, whitespace collapse. ~67% fewer tokens than raw HTML.
screenshotBase64-encoded PNG screenshot of the page.
linksArray of all extracted links. Each entry has text and href fields.
rawHtmlThe raw HTML string from the pipeline, before any extraction processing.
attributesExtract DOM attributes by CSS selector. Requires the attribute_selectors parameter.
queryPage-level Q&A with LLM. Requires the query parameter. Returns query_answer in the response.

Document parsing

webclaw auto-detects document types from Content-Type headers and file extensions. In addition to PDF, the following formats are supported natively and converted to your requested output format:

  • DOCX -- Microsoft Word documents
  • XLSX -- Excel spreadsheets (each sheet becomes a markdown table)
  • CSV -- Comma-separated values
Tip
No special parameters are needed. Point /v1/scrape at a .docx, .xlsx, or .csv URL and webclaw handles the rest.

YouTube transcript extraction

YouTube URLs (youtube.com/watch) are auto-detected and processed differently. webclaw extracts structured markdown containing the video title, channel name, view count, publish date, duration, description, and the full transcript text.

curl
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "formats": ["markdown"]}'
Note
No special parameters needed. Just pass a YouTube watch URL and webclaw returns the transcript as structured markdown.

Endpoints

The full list of available endpoints.

Core endpoints

MethodPathDescription
POST/v1/scrapeSingle URL extraction. Supports browser actions, screenshots, mobile emulation, and document parsing (PDF, DOCX, XLSX, CSV).
POST/v1/crawlStart async crawl with path filtering, subdomain/external link control, and webhook support.
GET/v1/crawl/{id}Poll crawl status
GET/v1/crawl/{id}/streamStream crawl results via Server-Sent Events (SSE)
GET/v1/crawl/historyList all crawl jobs for your account
POST/v1/crawl/{id}/retryRetry a failed or completed crawl with the same configuration
POST/v1/batchMulti-URL extraction
POST/v1/mapSitemap discovery
POST/v1/searchWeb search with parallel scraping of result pages
POST/v1/agent-scrapeAI agentic scraper that navigates pages to achieve a goal
POST/v1/extractLLM JSON extraction with prompt-to-schema generation
POST/v1/summarizeLLM summarization
POST/v1/diffContent change tracking
POST/v1/brandBrand identity extraction

Webhooks

MethodPathDescription
POST/v1/webhooksRegister a new webhook
GET/v1/webhooksList all registered webhooks
PATCH/v1/webhooks/{id}Update a webhook configuration
DELETE/v1/webhooks/{id}Delete a webhook

Firecrawl v2 compatibility

Drop-in replacements for Firecrawl v2 endpoints. Point existing Firecrawl SDKs at webclaw by changing the base URL.

MethodPathDescription
POST/v2/scrapeFirecrawl-compatible single URL scrape
POST/v2/crawlFirecrawl-compatible async crawl
GET/v2/crawl/{id}Firecrawl-compatible crawl status polling
POST/v2/searchFirecrawl-compatible web search

Utility

MethodPathDescription
GET/healthHealth check + Ollama status

POST /v1/search

Perform a web search and optionally scrape all result pages in parallel. Combines discovery and extraction into a single call.

POST/v1/search

Search the web and scrape result pages in parallel.

Request body

json
{
  "query": "best rust web frameworks 2026",
  "num_results": 5,
  "scrape": true,
  "formats": ["markdown"],
  "country": "us",
  "lang": "en"
}
FieldTypeRequiredDescription
querystringYesThe search query string.
num_resultsnumberNoNumber of search results to return. Default: 5.
scrapebooleanNoWhen true, scrapes each result page in parallel. Default: true.
formatsstring[]NoOutput formats for scraped pages. Same options as /v1/scrape. Defaults to ["markdown"].
countrystringNoTwo-letter country code to localize results (e.g. "us", "gb", "de").
langstringNoTwo-letter language code for results (e.g. "en", "fr").

Example

curl
curl -X POST https://api.webclaw.io/v1/search \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "best rust web frameworks 2026",
    "num_results": 3,
    "formats": ["llm"],
    "country": "us"
  }'

POST /v1/agent-scrape

An AI-powered agentic scraper that autonomously navigates pages, clicks elements, fills forms, and extracts data to achieve a stated goal.

POST/v1/agent-scrape

AI agent that navigates a site to accomplish a scraping goal.

Request body

json
{
  "url": "https://news.ycombinator.com",
  "goal": "Find the top 5 Show HN posts from today with their scores and comment counts",
  "max_steps": 5
}
FieldTypeRequiredDescription
urlstringYesStarting URL for the agent.
goalstringYesNatural language description of what data to extract or what action to perform.
max_stepsnumberNoMaximum number of navigation steps the agent can take. Default: 5.

Example

curl
curl -X POST https://api.webclaw.io/v1/agent-scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "goal": "Find the top 5 Show HN posts with their scores",
    "max_steps": 3
  }'
Note
The agent uses a headless browser under the hood. Each step counts toward your plan's page credit usage.

POST /v1/extract

Extract structured JSON data from any URL. Provide a JSON schema for typed output, or a natural language prompt for flexible extraction. When only a prompt is provided (no schema), the LLM generates a JSON schema first, then extracts data matching it. The response includes a generated_schema field showing the auto-generated schema.

POST/v1/extract

LLM-powered structured data extraction with prompt-to-schema generation.

Prompt-to-schema example

curl
curl -X POST https://api.webclaw.io/v1/extract \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "prompt": "Extract all pricing tiers with name, price, and features"
  }'

When only prompt is provided, the response includes both the extracted data and the generated schema:

Response with generated_schema
{
  "data": {
    "tiers": [
      {"name": "Free", "price": 0, "features": ["500 pages/month"]},
      {"name": "Pro", "price": 49, "features": ["100k pages/month", "Priority support"]}
    ]
  },
  "generated_schema": {
    "type": "object",
    "properties": {
      "tiers": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "price": {"type": "number"},
            "features": {"type": "array", "items": {"type": "string"}}
          }
        }
      }
    }
  }
}

Enhanced scrape parameters

The /v1/scrape endpoint supports additional parameters for browser automation, screenshots, mobile emulation, cache control, page-level Q&A, and attribute extraction.

FieldTypeRequiredDescription
actionsobject[]NoArray of browser actions to execute before extraction (click, type, scroll, wait, press, executeJavascript, etc.).
screenshotbooleanNoCapture a full-page screenshot. Returned as a base64-encoded PNG in the response.
mobilebooleanNoEmulate a mobile device viewport and user agent.
no_cachebooleanNoBypass the response cache and force a fresh fetch.
querystringNoNatural language question about the page. Used with the query format. The answer is returned in query_answer.
attribute_selectorsobject[]NoArray of {selector, attribute} pairs. Used with the attributes format to extract specific DOM attributes by CSS selector.

Browser actions

Actions execute in order before content extraction. Useful for clicking cookie banners, expanding sections, filling forms, scrolling to load lazy content, pressing keyboard keys, or running custom JavaScript.

ActionFieldsDescription
clickselectorClick an element matching the CSS selector.
typeselector, valueType text into an input element.
scrolldirectionScroll the page. Direction: up or down.
waitmsWait for the specified number of milliseconds.
screenshot--Take a screenshot at this point in the action sequence.
presskeySend a keyboard event. Supported keys: Enter, Tab, Escape, ArrowDown, ArrowUp, Space, Backspace.
executeJavascriptcodeRun custom JavaScript on the page. Results are returned in the js_results array in the response.

Browser actions example

json
{
  "url": "https://example.com/dashboard",
  "actions": [
    { "type": "wait", "selector": ".login-form" },
    { "type": "type", "selector": "#email", "value": "[email protected]" },
    { "type": "press", "key": "Tab" },
    { "type": "type", "selector": "#password", "value": "secret" },
    { "type": "press", "key": "Enter" },
    { "type": "wait", "milliseconds": 2000 },
    { "type": "executeJavascript", "code": "return document.title" },
    { "type": "scroll", "direction": "down" },
    { "type": "screenshot" }
  ],
  "screenshot": true,
  "formats": ["markdown"]
}

Query format example

curl
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "formats": ["query"],
    "query": "What is the price of the Pro plan?"
  }'

Attributes format example

curl
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["attributes"],
    "attribute_selectors": [
      {"selector": "img", "attribute": "alt"},
      {"selector": "a", "attribute": "href"}
    ]
  }'

Enhanced crawl parameters

The /v1/crawl endpoint now supports path filtering, subdomain and external link control, and webhook notifications.

FieldTypeRequiredDescription
allow_subdomainsbooleanNoAllow the crawler to follow links to subdomains of the starting URL. Default: false.
allow_external_linksbooleanNoAllow the crawler to follow links to external domains. Default: false.
include_pathsstring[]NoGlob patterns for paths to include (e.g. /docs/**). Only matching URLs will be crawled.
exclude_pathsstring[]NoGlob patterns for paths to exclude (e.g. /blog/**). Matching URLs will be skipped.
webhook_urlstringNoURL to receive a POST request when the crawl completes or fails.

Crawl sub-endpoints

GET/v1/crawl/{id}/stream

Stream crawl results in real-time via Server-Sent Events. Each page result is sent as an SSE event as it completes.

GET/v1/crawl/history

List all crawl jobs for your account, ordered by creation date.

POST/v1/crawl/{id}/retry

Retry a failed or completed crawl using the same configuration. Returns a new crawl ID.

Enhanced crawl example

curl
curl -X POST https://api.webclaw.io/v1/crawl \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.stripe.com",
    "max_depth": 3,
    "max_pages": 200,
    "include_paths": ["/docs/**"],
    "exclude_paths": ["/docs/changelog/**"],
    "allow_subdomains": false,
    "webhook_url": "https://your-app.com/webhooks/crawl-done"
  }'

Webhooks

Register webhook URLs to receive POST notifications when async operations (crawls, batch jobs) complete. Manage webhooks with full CRUD.

POST/v1/webhooks

Register a new webhook endpoint.

Create webhook request

json
{
  "url": "https://your-app.com/webhooks/webclaw",
  "events": ["crawl.completed", "crawl.failed", "batch.completed"]
}
GET/v1/webhooks

List all registered webhooks for your account.

PATCH/v1/webhooks/{id}

Update an existing webhook's URL or event subscriptions.

DELETE/v1/webhooks/{id}

Delete a webhook. It will no longer receive notifications.

Example

Register a webhook
curl -X POST https://api.webclaw.io/v1/webhooks \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://your-app.com/webhooks/webclaw",
    "events": ["crawl.completed"]
  }'

Firecrawl v2 compatibility

These endpoints accept the same request/response shapes as Firecrawl v2. Migrate by changing your base URL to webclaw -- no code changes needed.

POST/v2/scrape

Firecrawl-compatible scrape. Accepts Firecrawl v2 request format and returns Firecrawl v2 response shape.

POST/v2/crawl

Firecrawl-compatible async crawl. Returns a job ID in Firecrawl v2 format.

GET/v2/crawl/{id}

Poll a Firecrawl-compatible crawl job for status and results.

POST/v2/search

Firecrawl-compatible web search with scraping.

Tip
To migrate from Firecrawl, replace https://api.firecrawl.dev with https://api.webclaw.io in your SDK configuration. Authentication works the same way -- just use your webclaw API key instead.

Quick example

curl
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}'