Launches or adopts a Chromium/CDP browser session and publishes a stable session contract for all browser-backed extractors.
The important rule is:
- browser readiness is defined by the published
chrome/session markers - browser liveness is verified through CDP
chrome.pidis only ownership/cleanup metadata for local processes
This plugin now supports two execution modes:
CHROME_ISOLATION=crawl- one browser session per crawl
- each snapshot gets its own tab in that browser
CHROME_ISOLATION=snapshot- one browser session per snapshot
- the snapshot owns the browser directly
It also supports two session sources:
- launch a local Chromium process
- adopt an existing browser via
CHROME_CDP_URL
Most browser extractors only need:
- a connectable CDP endpoint
- a concrete page target
- sometimes loaded extensions
- sometimes a configured downloads directory
They should not need to know:
- whether the browser is local or remote
- whether the browser was launched by the Chrome plugin or provided externally
- whether the browser is crawl-scoped or snapshot-scoped
The Chrome plugin publishes those details behind a shared chrome/ session directory and chrome_utils.js.
Defined in config.json.
| Variable | Default | Meaning |
|---|---|---|
CHROME_ENABLED |
true |
Enable browser-backed archiving. |
CHROME_BINARY |
chromium |
Local Chromium binary to launch when not adopting an existing browser. |
CHROME_CDP_URL |
"" |
Adopt an already-running browser instead of launching a local one. Accepts WS or HTTP CDP endpoints. |
CHROME_IS_LOCAL |
true |
Whether the owned browser process is local and should publish chrome.pid. If CHROME_CDP_URL is set, runtime behavior is external/non-local. |
CHROME_KEEPALIVE |
false |
Whether the owning launch hook should exit immediately and leave the browser running, instead of staying alive and closing it during cleanup. |
CHROME_ISOLATION |
crawl |
crawl for one browser per crawl, snapshot for one browser per snapshot. |
| Variable | Default | Meaning |
|---|---|---|
CHROME_TIMEOUT |
60 |
General Chrome operation timeout in seconds. |
CHROME_PAGELOAD_TIMEOUT |
60 |
Navigation/page-load timeout in seconds. |
CHROME_WAIT_FOR |
networkidle2 |
Puppeteer navigation completion condition. |
CHROME_DELAY_AFTER_LOAD |
0 |
Extra delay after page load completes. |
CHROME_HEADLESS |
true |
Run headless. |
CHROME_SANDBOX |
true |
Enable Chromium sandbox. |
CHROME_RESOLUTION |
1440,2000 |
Viewport/window size. |
CHROME_USER_AGENT |
"" |
Optional browser user agent override. |
CHROME_CHECK_SSL_VALIDITY |
true |
Whether to reject invalid TLS certs. |
| Variable | Default | Meaning |
|---|---|---|
CHROME_USER_DATA_DIR |
"" |
User data dir for persistent local profile state. |
CHROME_EXTENSIONS_DIR |
persona-derived | Extension cache directory. |
CHROME_DOWNLOADS_DIR |
persona-derived | Download output directory configured via CDP after launch/adoption. |
CHROME_ARGS |
see config | Static Chromium flags. |
CHROME_ARGS_EXTRA |
[] |
Final extra flags appended at launch. |
Ownership:
- on_CrawlSetup__90_chrome_launch.daemon.bg.js owns the browser session
- on_CrawlSetup__91_chrome_wait.js verifies the crawl-scoped session is connectable
- on_Snapshot__10_chrome_tab.daemon.bg.js creates one page/tab per snapshot
Contract:
CRAWL_DIR/chrome/publishes the browser sessionSNAP_DIR/chrome/publishes the tab/session for one snapshot inside that browser
Ownership:
- on_Snapshot__09_chrome_launch.daemon.bg.js owns the browser session for that snapshot
- on_Snapshot__10_chrome_tab.daemon.bg.js adopts or verifies the already-published snapshot session
Contract:
SNAP_DIR/chrome/is the only browser/session surface needed by downstream snapshot hooks- there may be no crawl-scoped shared browser at all
When CHROME_CDP_URL is unset, ensureChromeSession(...) launches local Chromium, waits until:
- the browser endpoint is connectable
- an
about:blankpage exists - CDP can attach to that page and read its title
Only after that does it publish cdp_url.txt.
When CHROME_CDP_URL is set, ensureChromeSession(...) adopts that browser instead of launching one.
Important behavior:
CHROME_CDP_URLmay be a WS or HTTP CDP endpointchrome.pidis not required for external sessions- readiness and liveness are based on CDP, not PID
- repeated launch/adoption against the same
chrome/dir and same liveCHROME_CDP_URLreuses the session in place instead of closing it
The owning launch hook stays alive and closes the browser on SIGTERM:
- first tries
Browser.close()over CDP - if the process is local and still alive after timeout, escalates to
SIGTERM/SIGKILL
Cleanup scope depends on isolation:
crawlisolation: crawl launch hook owns browser shutdownsnapshotisolation: snapshot launch hook owns browser shutdown
The owning launch hook exits immediately and leaves the browser running.
This is useful when:
- an upstream provider manages lifecycle
- tests want to reattach repeatedly
- a caller wants to close the browser explicitly later
| Hook | Priority | Purpose |
|---|---|---|
on_CrawlSetup__90_chrome_launch.daemon.bg.js |
90 | Launch/adopt crawl-scoped browser when CHROME_ISOLATION=crawl. No-op when snapshot. |
on_CrawlSetup__91_chrome_wait.js |
91 | Wait for a connectable crawl session when crawl. Reports "snapshot isolation active" when snapshot. |
Chromium and extension dependencies are resolved before crawl setup from
config.json > required_binaries during orchestrator preflight.
| Hook | Priority | Purpose |
|---|---|---|
on_Snapshot__09_chrome_launch.daemon.bg.js |
9 | Launch/adopt snapshot-scoped browser when CHROME_ISOLATION=snapshot. No-op readiness check when crawl. |
on_Snapshot__10_chrome_tab.daemon.bg.js |
10 | Create or adopt the snapshot page target. |
on_Snapshot__11_chrome_wait.js |
11 | Verify snapshot cdp_url.txt + target_id.txt point at a live target. |
on_Snapshot__30_chrome_navigate.js |
30 | Navigate the snapshot page and publish navigation markers. |
CRAWL_DIR/chrome/
Used when:
CHROME_ISOLATION=crawl- or crawl-scoped extension config hooks need a browser-wide session
SNAP_DIR/chrome/
Always the main contract for snapshot extractors.
Downstream snapshot hooks should consume:
SNAP_DIR/chrome/cdp_url.txtSNAP_DIR/chrome/target_id.txt- optional
SNAP_DIR/chrome/extensions.json - navigation markers written later by
chrome_navigate
They should not reach back into CRAWL_DIR/chrome/ directly unless they are intentionally crawl-scoped browser hooks.
| File | Meaning |
|---|---|
cdp_url.txt |
Authoritative browser readiness marker. |
chrome.pid |
Local-process ownership metadata. Optional for external sessions. |
extensions.json |
Loaded extension metadata. Optional if no extensions are present. |
| File | Meaning |
|---|---|
target_id.txt |
Authoritative page-target marker for the snapshot. |
url.txt |
Requested URL used for snapshot reuse checks. |
navigation.json |
Structured navigation result, including errors. |
cdp_url.txtmeans the browser is safe to connect totarget_id.txtmeans a specific page target exists for this snapshotnavigation.jsonmeanschrome_navigatecompleted, and success vs failure is encoded inside the JSONchrome.piddoes not imply readiness
The shared helpers live in chrome_utils.js.
Launches or adopts a browser session and publishes the session markers.
Used by:
- crawl launch hook
- snapshot launch hook
Responsibilities:
- reuse healthy existing sessions when appropriate
- avoid destroying an explicit live
CHROME_CDP_URLsession when re-invoked against the same directory - configure downloads over CDP
- import cookies if configured
- publish
extensions.jsonbeforecdp_url.txt - publish
chrome.pidonly for local sessions
Closes the owned browser session.
Behavior:
- send
Browser.close()over CDP first - for local processes, escalate to
SIGTERM/SIGKILLif needed - clean up stale session markers afterward
Wait for a published session directory to be ready.
Returns:
nullon timeout- otherwise a normalized state object:
sessionDircdpUrltargetIdpidextensions
Important behavior:
- does not require
chrome.pidby default - can require
target_id.txtwithrequireTargetId: true - can require parseable
extensions.jsonwithrequireExtensionsLoaded: true - is purely marker-based readiness, not browser-connection liveness
Connects Puppeteer to a browser endpoint.
Handles both:
ws://.../devtools/browser/...viabrowserWSEndpointhttp://host:portviabrowserURL
Use this for browser-scoped hooks that need a browser but not a specific page target.
High-level helper for snapshot page consumers.
Options include:
chromeSessionDirtimeoutMsrequireTargetIdrequireExtensionsLoadedwaitForNavigationCompletepageLoadTimeoutMspostLoadDelayMspuppeteer
Returns:
- the session state from
waitForChromeSessionState(...) - plus:
browserpagecdpSessiontargetId
Important behavior:
- fails fast if there is no Chrome session at all
- if
waitForNavigationComplete: true, waits for successfulnavigation.jsonbefore attaching - creates a page CDP session for the returned page
- sends initial
Target.setAutoAttach({ autoAttach: true, waitForDebuggerOnStart: false, flatten: true })
This is the preferred helper for almost all snapshot extractors.
Waits for navigation.json from chrome_navigate, parses it, and throws if navigation failed.
This is marker-based, not a live CDP page-state poll.
Low-level session inspection used mainly by core Chrome hooks for reuse/stale cleanup decisions.
Most downstream plugins should not need this directly.
Derives the SingleFile-style browser server URL from a published session directory.
Core tab lifecycle helpers used by the Chrome hooks.
- Extension installer hooks populate the extension cache.
- Crawl launch hook calls
ensureChromeSession(...). - Browser-wide setup completes.
cdp_url.txtis published inCRAWL_DIR/chrome/.- Crawl wait verifies a real CDP connection.
- Snapshot tab hook creates a target and publishes
SNAP_DIR/chrome/target_id.txt. - Snapshot wait verifies the target is live.
- Navigate hook writes
navigation.json. - Later extractors call
connectToPage(...).
- Snapshot launch hook calls
ensureChromeSession(...)inSNAP_DIR/chrome/. cdp_url.txtis published inSNAP_DIR/chrome/.- Snapshot tab hook adopts or creates the snapshot page target.
- Snapshot wait verifies the target is live.
- Navigate hook writes the navigation markers.
- Later extractors call
connectToPage(...).
If you are writing a Chrome-dependent plugin:
- use
connectToPage(...)for snapshot/page extractors - use
connectToBrowserEndpoint(...)only for crawl-scoped browser hooks that do not need a page target - use
waitForChromeSessionState(..., { requireExtensionsLoaded: true })only when you truly need extension metadata before page attachment - treat
cdp_url.txtas the browser readiness gate - treat
target_id.txtas the page readiness gate - treat
navigation.jsonas the post-navigation readiness gate
Anti-patterns:
- do not require local Chrome unless absolutely necessary
- do not use
chrome.pidas a readiness check - do not read raw session files directly when
connectToPage(...)orwaitForChromeSessionState(...)already provide the contract - do not hardcode provider-specific behavior in downstream plugins
- do not configure downloads by mutating Chromium profile preferences; use CDP/browser setup
Chrome itself does not know about specific extension plugins.
Extension flow:
- installer hooks populate extension cache metadata
ensureChromeSession(...)loads or discovers those extensions into the active browser sessionextensions.jsonpublishes the loaded metadata- downstream extension-aware hooks consume the published
extensionsmetadata
Download flow:
- browser/session setup configures downloads through CDP
- downstream hooks should consume the resulting files from the published downloads directory
What downstream plugins can rely on:
- there is a published
chrome/session directory waitForChromeSessionState(...)returns normalized session metadataconnectToPage(...)gives them a live page plus connected page CDP session- the browser may be local or external
- the browser may be crawl-scoped or snapshot-scoped
What they should not care about:
- how the browser was launched
- who owns process lifecycle
- whether
chrome.pidexists