REST API
The webclaw REST API gives you programmatic access to the full extraction engine. Every endpoint accepts JSON and returns JSON.
Base URL
Use the cloud endpoint for managed infrastructure, or point at your own instance when self-hosting.
Authentication
All requests require an API key sent via the Authorization header.
Cloud: Create API keys from your dashboard at webclaw.io. Keys are prefixed with wc_.
Self-hosted: Pass --api-key when starting the server, or set the WEBCLAW_API_KEY environment variable. If neither is set, the server runs without authentication.
Request format
All POST endpoints accept a JSON body. Set the Content-Type header accordingly.
Response format
All responses are JSON. Successful responses return the data directly. Errors use a consistent shape:
Rate limiting
Cloud API rate limits are based on your plan tier. Self-hosted instances have no rate limits by default. See the Cloud API page for plan details.
Output formats
The /v1/scrape endpoint supports 9 output formats. Pass one or more in the formats array.
| Format | Description |
|---|---|
markdown | Clean markdown with resolved URLs and collected assets. Default. |
text | Plain text with no formatting. |
json | Full ExtractionResult as JSON. Includes metadata, content, word count, and extracted URLs. |
llm | LLM-optimized. 9-step pipeline: image stripping, emphasis removal, link dedup, stat merging, whitespace collapse. ~67% fewer tokens than raw HTML. |
screenshot | Base64-encoded PNG screenshot of the page. |
links | Array of all extracted links. Each entry has text and href fields. |
rawHtml | The raw HTML string from the pipeline, before any extraction processing. |
attributes | Extract DOM attributes by CSS selector. Requires the attribute_selectors parameter. |
query | Page-level Q&A with LLM. Requires the query parameter. Returns query_answer in the response. |
Document parsing
webclaw auto-detects document types from Content-Type headers and file extensions. In addition to PDF, the following formats are supported natively and converted to your requested output format:
- DOCX -- Microsoft Word documents
- XLSX -- Excel spreadsheets (each sheet becomes a markdown table)
- CSV -- Comma-separated values
YouTube transcript extraction
YouTube URLs (youtube.com/watch) are auto-detected and processed differently. webclaw extracts structured markdown containing the video title, channel name, view count, publish date, duration, description, and the full transcript text.
Endpoints
The full list of available endpoints.
Core endpoints
| Method | Path | Description |
|---|---|---|
| POST | /v1/scrape | Single URL extraction. Supports browser actions, screenshots, mobile emulation, and document parsing (PDF, DOCX, XLSX, CSV). |
| POST | /v1/crawl | Start async crawl with path filtering, subdomain/external link control, and webhook support. |
| GET | /v1/crawl/{id} | Poll crawl status |
| GET | /v1/crawl/{id}/stream | Stream crawl results via Server-Sent Events (SSE) |
| GET | /v1/crawl/history | List all crawl jobs for your account |
| POST | /v1/crawl/{id}/retry | Retry a failed or completed crawl with the same configuration |
| POST | /v1/batch | Multi-URL extraction |
| POST | /v1/map | Sitemap discovery |
| POST | /v1/search | Web search with parallel scraping of result pages |
| POST | /v1/agent-scrape | AI agentic scraper that navigates pages to achieve a goal |
| POST | /v1/extract | LLM JSON extraction with prompt-to-schema generation |
| POST | /v1/summarize | LLM summarization |
| POST | /v1/diff | Content change tracking |
| POST | /v1/brand | Brand identity extraction |
Webhooks
| Method | Path | Description |
|---|---|---|
| POST | /v1/webhooks | Register a new webhook |
| GET | /v1/webhooks | List all registered webhooks |
| PATCH | /v1/webhooks/{id} | Update a webhook configuration |
| DELETE | /v1/webhooks/{id} | Delete a webhook |
Firecrawl v2 compatibility
Drop-in replacements for Firecrawl v2 endpoints. Point existing Firecrawl SDKs at webclaw by changing the base URL.
| Method | Path | Description |
|---|---|---|
| POST | /v2/scrape | Firecrawl-compatible single URL scrape |
| POST | /v2/crawl | Firecrawl-compatible async crawl |
| GET | /v2/crawl/{id} | Firecrawl-compatible crawl status polling |
| POST | /v2/search | Firecrawl-compatible web search |
Utility
| Method | Path | Description |
|---|---|---|
| GET | /health | Health check + Ollama status |
POST /v1/search
Perform a web search and optionally scrape all result pages in parallel. Combines discovery and extraction into a single call.
/v1/searchSearch the web and scrape result pages in parallel.
Request body
| Field | Type | Required | Description |
|---|---|---|---|
query | string | Yes | The search query string. |
num_results | number | No | Number of search results to return. Default: 5. |
scrape | boolean | No | When true, scrapes each result page in parallel. Default: true. |
formats | string[] | No | Output formats for scraped pages. Same options as /v1/scrape. Defaults to ["markdown"]. |
country | string | No | Two-letter country code to localize results (e.g. "us", "gb", "de"). |
lang | string | No | Two-letter language code for results (e.g. "en", "fr"). |
Example
POST /v1/agent-scrape
An AI-powered agentic scraper that autonomously navigates pages, clicks elements, fills forms, and extracts data to achieve a stated goal.
/v1/agent-scrapeAI agent that navigates a site to accomplish a scraping goal.
Request body
| Field | Type | Required | Description |
|---|---|---|---|
url | string | Yes | Starting URL for the agent. |
goal | string | Yes | Natural language description of what data to extract or what action to perform. |
max_steps | number | No | Maximum number of navigation steps the agent can take. Default: 5. |
Example
POST /v1/extract
Extract structured JSON data from any URL. Provide a JSON schema for typed output, or a natural language prompt for flexible extraction. When only a prompt is provided (no schema), the LLM generates a JSON schema first, then extracts data matching it. The response includes a generated_schema field showing the auto-generated schema.
/v1/extractLLM-powered structured data extraction with prompt-to-schema generation.
Prompt-to-schema example
When only prompt is provided, the response includes both the extracted data and the generated schema:
Enhanced scrape parameters
The /v1/scrape endpoint supports additional parameters for browser automation, screenshots, mobile emulation, cache control, page-level Q&A, and attribute extraction.
| Field | Type | Required | Description |
|---|---|---|---|
actions | object[] | No | Array of browser actions to execute before extraction (click, type, scroll, wait, press, executeJavascript, etc.). |
screenshot | boolean | No | Capture a full-page screenshot. Returned as a base64-encoded PNG in the response. |
mobile | boolean | No | Emulate a mobile device viewport and user agent. |
no_cache | boolean | No | Bypass the response cache and force a fresh fetch. |
query | string | No | Natural language question about the page. Used with the query format. The answer is returned in query_answer. |
attribute_selectors | object[] | No | Array of {selector, attribute} pairs. Used with the attributes format to extract specific DOM attributes by CSS selector. |
Browser actions
Actions execute in order before content extraction. Useful for clicking cookie banners, expanding sections, filling forms, scrolling to load lazy content, pressing keyboard keys, or running custom JavaScript.
| Action | Fields | Description |
|---|---|---|
click | selector | Click an element matching the CSS selector. |
type | selector, value | Type text into an input element. |
scroll | direction | Scroll the page. Direction: up or down. |
wait | ms | Wait for the specified number of milliseconds. |
screenshot | -- | Take a screenshot at this point in the action sequence. |
press | key | Send a keyboard event. Supported keys: Enter, Tab, Escape, ArrowDown, ArrowUp, Space, Backspace. |
executeJavascript | code | Run custom JavaScript on the page. Results are returned in the js_results array in the response. |
Browser actions example
Query format example
Attributes format example
Enhanced crawl parameters
The /v1/crawl endpoint now supports path filtering, subdomain and external link control, and webhook notifications.
| Field | Type | Required | Description |
|---|---|---|---|
allow_subdomains | boolean | No | Allow the crawler to follow links to subdomains of the starting URL. Default: false. |
allow_external_links | boolean | No | Allow the crawler to follow links to external domains. Default: false. |
include_paths | string[] | No | Glob patterns for paths to include (e.g. /docs/**). Only matching URLs will be crawled. |
exclude_paths | string[] | No | Glob patterns for paths to exclude (e.g. /blog/**). Matching URLs will be skipped. |
webhook_url | string | No | URL to receive a POST request when the crawl completes or fails. |
Crawl sub-endpoints
/v1/crawl/{id}/streamStream crawl results in real-time via Server-Sent Events. Each page result is sent as an SSE event as it completes.
/v1/crawl/historyList all crawl jobs for your account, ordered by creation date.
/v1/crawl/{id}/retryRetry a failed or completed crawl using the same configuration. Returns a new crawl ID.
Enhanced crawl example
Webhooks
Register webhook URLs to receive POST notifications when async operations (crawls, batch jobs) complete. Manage webhooks with full CRUD.
/v1/webhooksRegister a new webhook endpoint.
Create webhook request
/v1/webhooksList all registered webhooks for your account.
/v1/webhooks/{id}Update an existing webhook's URL or event subscriptions.
/v1/webhooks/{id}Delete a webhook. It will no longer receive notifications.
Example
Firecrawl v2 compatibility
These endpoints accept the same request/response shapes as Firecrawl v2. Migrate by changing your base URL to webclaw -- no code changes needed.
/v2/scrapeFirecrawl-compatible scrape. Accepts Firecrawl v2 request format and returns Firecrawl v2 response shape.
/v2/crawlFirecrawl-compatible async crawl. Returns a job ID in Firecrawl v2 format.
/v2/crawl/{id}Poll a Firecrawl-compatible crawl job for status and results.
/v2/searchFirecrawl-compatible web search with scraping.
https://api.firecrawl.dev with https://api.webclaw.io in your SDK configuration. Authentication works the same way -- just use your webclaw API key instead.