A Node.js crawler for STAC Index API that fetches and processes catalog and collection data with configurable options. Includes an automated scheduler for periodic crawling.
- Features
- Quick Start
- Configuration
- Running Locally
- Docker
- Testing
- Dependencies
- Technical Decisions
- Architecture
- How It Works
- Troubleshooting
- Performance Tuning
- npm Scripts
- License
- Examples
- Single-run Mode: Execute crawler once and exit
- Scheduled Mode: Automated periodic crawling with configurable intervals
- Time Window Control: Optional restriction to specific hours (e.g., night-time crawling)
- Retry Logic: Automatic retry on crawl errors with configurable delay
- Environment-based Configuration: All settings configurable via
.envfile - CLI Arguments: Override settings with command-line flags
- Database Integration: PostgreSQL storage with deadlock handling
- Parallel Execution: Efficient domain-based parallel processing with configurable rate limiting
- Graceful Shutdown: Stop after current batch with Ctrl+C, resume later
- Pause/Resume Support: Already-crawled collections are tracked and skipped on re-run
- Fresh Mode: Clear crawl log with
--freshflag to re-crawl everything - STAC Validation: Validates collections using stac-node-validator
- Automatic Cleanup: Marks stale collections as inactive after 7 days without updates
# Install dependencies
npm install
# Copy and configure environment file
cp .env.example .env# Run crawler once
npm start# Run scheduler for automatic periodic crawling
node scheduler.jsThe scheduler will:
- Run the crawler immediately on startup (configurable)
- Schedule next runs based on configured interval (default: 7 days)
- Respect time window restrictions if enabled
- Automatically retry on errors
The crawler can be configured using environment variables, CLI arguments, or a combination of both. CLI arguments take precedence over environment variables.
| Option | CLI Flag | Environment Variable | Default | Description |
|---|---|---|---|---|
| Mode | -m, --mode |
CRAWL_MODE |
both |
Crawl mode: catalogs, apis, or both |
| Max Catalogs | -c, --max-catalogs |
MAX_CATALOGS |
10 |
Maximum number of catalogs to process (0 = unlimited) |
| Max APIs | -a, --max-apis |
MAX_APIS |
5 |
Maximum number of APIs to process (0 = unlimited) |
| Timeout | -t, --timeout |
TIMEOUT_MS |
30000 |
Timeout per operation in milliseconds |
| Max Depth | -d, --max-depth |
MAX_DEPTH |
10 |
Maximum recursion depth for nested catalogs (0 = unlimited) |
| Fresh | -f, --fresh |
FRESH_CRAWL |
false |
Clear crawl log and re-crawl all collections |
| Option | CLI Flag | Environment Variable | Default | Description |
|---|---|---|---|---|
| Parallel Domains | -p, --parallel-domains |
PARALLEL_DOMAINS |
2 |
Number of domains to crawl in parallel |
| RPM per Domain | --rpm-per-domain |
MAX_REQUESTS_PER_MINUTE_PER_DOMAIN |
60 |
Max requests per minute per domain |
| Concurrency per Domain | --concurrency-per-domain |
MAX_CONCURRENCY_PER_DOMAIN |
5 |
Max concurrent requests per domain |
| Option | CLI Flag | Environment Variable | Default | Description |
|---|---|---|---|---|
| Max Concurrency | --max-concurrency |
MAX_CONCURRENCY |
5 |
Maximum concurrent requests (global) |
| Requests per Minute | --rpm, --requests-per-minute |
MAX_REQUESTS_PER_MINUTE |
60 |
Maximum requests per minute (global) |
| Domain Delay | --domain-delay |
SAME_DOMAIN_DELAY_SECS |
1 |
Delay between requests to same domain (seconds) |
| Max Retries | --max-retries |
MAX_REQUEST_RETRIES |
3 |
Maximum retries for failed requests |
| Environment Variable | Default | Description |
|---|---|---|
CRAWL_DAYS_INTERVAL |
7 |
Days between crawl runs |
CRAWL_RUN_ON_STARTUP |
true |
Run crawler immediately on startup |
CRAWL_RETRY_ON_ERROR |
true |
Retry if crawl fails but DB is ok |
CRAWL_RETRY_DELAY_HOURS |
2 |
Hours to wait before retry on error |
CRAWL_ENFORCE_TIME_WINDOW |
false |
Enable time window restrictions |
CRAWL_ALLOWED_START_HOUR |
22 |
Start hour (0-23) when time window is enforced |
CRAWL_ALLOWED_END_HOUR |
7 |
End hour (0-23) when time window is enforced |
CRAWL_GRACE_PERIOD_MINUTES |
30 |
Grace period in minutes after end hour |
| Environment Variable | Description |
|---|---|
PGHOST |
PostgreSQL host |
PGPORT |
PostgreSQL port (default: 5432) |
PGUSER |
PostgreSQL username |
PGPASSWORD |
PostgreSQL password |
PGDATABASE |
PostgreSQL database name |
- Copy the example environment file:
cp .env.example .env- Edit
.envto customize settings:
# Database Configuration
PGHOST=localhost
PGPORT=5432
PGUSER=postgres
PGPASSWORD=yourpassword
PGDATABASE=stac_db
# Crawler Configuration
CRAWL_MODE=both
MAX_CATALOGS=0 # 0 = unlimited
MAX_APIS=0 # 0 = unlimited
TIMEOUT_MS=30000
MAX_DEPTH=3
# Scheduler Configuration
CRAWL_DAYS_INTERVAL=7
CRAWL_RUN_ON_STARTUP=true
CRAWL_RETRY_ON_ERROR=true
CRAWL_RETRY_DELAY_HOURS=2
# Time Window Configuration (optional)
# Set CRAWL_ENFORCE_TIME_WINDOW=true to restrict crawling to specific hours
CRAWL_ENFORCE_TIME_WINDOW=false
CRAWL_ALLOWED_START_HOUR=22 # 10 PM
CRAWL_ALLOWED_END_HOUR=7 # 7 AM
CRAWL_GRACE_PERIOD_MINUTES=30- Run the crawler or scheduler:
# Single run
npm start
# Scheduled runs
node scheduler.jsRun the crawler with command-line arguments to override defaults or environment variables:
# Crawl only catalogs with custom limits
node index.js --mode catalogs --max-catalogs 20
# Crawl only APIs with extended timeout
node index.js -m apis -a 10 -t 60000
# Crawl both with all custom settings
node index.js -m both -c 50 -a 20 -t 45000 -d 5
# Start fresh - clear crawl log and re-crawl everything
node index.js --fresh
# Combine fresh mode with other options
node index.js -f -m apis -a 10
# Configure parallel crawling for high-performance servers
node index.js -p 5 --rpm-per-domain 120 --concurrency-per-domain 10
# Full unlimited crawl with fresh start
node index.js -f -m both -c 0 -a 0 -d 0Display all available options:
node index.js --help# Install dependencies
npm install
# Run with default configuration
npm start
# Run with custom configuration via CLI
node index.js --mode catalogs --max-catalogs 15# Start the scheduler (runs in foreground)
node scheduler.js
# The scheduler will:
# - Run crawler immediately on startup (if CRAWL_RUN_ON_STARTUP=true)
# - Schedule next run based on CRAWL_DAYS_INTERVAL
# - Wait for allowed time window (if CRAWL_ENFORCE_TIME_WINDOW=true)
# - Automatically retry on errors (if CRAWL_RETRY_ON_ERROR=true)
# - Stop gracefully with Ctrl+CExample 1: Night-time only crawling (22:00 - 07:00)
CRAWL_ENFORCE_TIME_WINDOW=true
CRAWL_ALLOWED_START_HOUR=22
CRAWL_ALLOWED_END_HOUR=7Example 2: Business hours crawling (09:00 - 17:00)
CRAWL_ENFORCE_TIME_WINDOW=true
CRAWL_ALLOWED_START_HOUR=9
CRAWL_ALLOWED_END_HOUR=17Example 3: No restrictions (default)
CRAWL_ENFORCE_TIME_WINDOW=false# Build the image
docker build -t stac-crawler .
# Run single crawl with default configuration
docker run --rm stac-crawler
# Run with environment variables
docker run --rm \
-e PGHOST=host.docker.internal \
-e PGPORT=5432 \
-e PGUSER=postgres \
-e PGPASSWORD=yourpassword \
-e PGDATABASE=stac_db \
-e CRAWL_MODE=apis \
-e MAX_APIS=10 \
stac-crawler
# Run with CLI arguments
docker run --rm stac-crawler --mode catalogs --max-catalogs 20
# Run scheduler in Docker (detached)
docker run -d \
--name stac-scheduler \
-e PGHOST=host.docker.internal \
-e CRAWL_DAYS_INTERVAL=7 \
stac-crawler node scheduler.jsOr use npm scripts:
npm run docker:build
npm run docker:runCreate a .env file or modify docker-compose.yml to set environment variables:
# Start the crawler (single run)
docker-compose up -d
# View logs
docker-compose logs -f
# Stop the crawler
docker-compose downFor scheduled crawling with Docker Compose, modify docker-compose.yml:
services:
crawler:
build: .
command: node scheduler.js # Use scheduler instead of single run
env_file: .env
restart: unless-stopped # Auto-restart on failureOr use npm scripts:
npm run docker:compose:up
npm run docker:compose:downRun the complete test suite:
npm testRun tests in watch mode during development:
npm run test:watchRun tests with coverage report:
npm test -- --coverageThe test suite covers utility functions across four test modules:
-
normalization.test.js- Tests for catalog and collection normalization- Tests
deriveCategories(),normalizeCatalog(),normalizeCollection(),processCatalogs()
- Tests
-
parallel.test.js- Tests for parallel execution utilities- Tests
getDomain(),groupByDomain(),createDomainBatches(),aggregateStats(),executeWithConcurrency(),calculateRateLimits(),logDomainStats()
- Tests
-
api.test.js- Tests for API crawling utilities- Tests batch management, URL validation, STAC API response structures
- Uses real STAC API endpoints (Microsoft Planetary Computer, Element 84, USGS, NASA CMR)
-
is_api.test.js- Tests for is_api field functionality- Verifies collections are correctly marked as API or static catalog collections
- Tests
handleCatalog()andhandleCollections()from handlers.js
All tests use real STAC domain names and collection IDs from production STAC APIs for realistic testing.
The crawler uses carefully selected libraries for specific functionality:
- Purpose: Advanced web crawling framework with built-in request management
- Why chosen:
- Automatic retry logic with exponential backoff
- Built-in rate limiting per domain
- Concurrent request handling with configurable concurrency
- Request queue management for large-scale crawling
- Automatic handling of timeouts and errors
- Key features used:
HttpCrawler- For HTTP requests with JSON parsing- Request/response handlers for custom processing
- Domain-based crawling strategies
- Alternative considered: Axios alone - rejected because it lacks built-in queue management and retry logic
- Purpose: HTTP client for direct API calls (non-crawling requests)
- Why chosen:
- Simple interface for one-off requests (e.g., fetching catalog list)
- Wide adoption and reliability
- Promise-based async/await support
- Used for: Initial STAC Index API calls before crawling starts
- Purpose: STAC object manipulation and metadata extraction
- Why chosen:
- Official STAC library with spec-compliant parsers
- Type detection (Collection, Catalog, Item)
- Built-in methods for extent extraction (
getBoundingBox(),getTemporalExtent()) - Link resolution (relative to absolute URLs)
- Key features used:
create()- Parse JSON into STAC objectsisCollection(),isCatalog()- Type checking- Extent extraction methods
- Purpose: Validate STAC JSON against official schemas
- Why chosen:
- Uses official STAC JSON schemas
- Validates core spec + extensions (EO, SAT, Projection, etc.)
- Detailed error reporting with field-level messages
- Async validation suitable for high-volume crawling
- Key features used:
- Full STAC spec validation (v1.0.0, v1.1.0 support)
- Extension schema validation
- Error message extraction for debugging
- Critical for: Data quality - filters out malformed STAC metadata before database insertion
- Purpose: PostgreSQL database client with modern async/await support
- Why chosen:
- Type-safe SQL queries with tagged template literals
- Connection pooling built-in
- Better TypeScript support than
pgalone - Cleaner API than raw
pg
- Key features used:
- Connection pool management
- Parameterized queries (SQL injection prevention)
- Transaction support
- Purpose: Environment variable management from
.envfiles - Why chosen:
- Standard solution for 12-factor app configuration
- Keeps sensitive credentials out of source code
- Development/production environment separation
- Used for: Database credentials, crawler configuration, scheduler settings
- Purpose: Testing framework
- Why chosen:
- Industry standard for Node.js testing
- Built-in assertion library
- Parallel test execution
- Coverage reporting
- Module mocking support
- Test coverage: 110 tests across normalization, parallel execution, and API utilities
- Configuration: Uses ES modules (
--experimental-vm-modules) for modern JavaScript support
Node.js built-ins:
pg(Pool) - Part of@databases/pg, PostgreSQL connection poolingprocess.env- Environment variable accessconsole- Logging (no external logger to keep dependencies minimal)
Decision: Use PostgreSQL as the primary database
Reason:
- PostGIS extension: Native geospatial support for bounding box queries
- JSONB type: Efficient storage of STAC summaries and nested metadata
- Robust transactions: ACID compliance prevents data corruption during concurrent crawls
- Indexing: B-tree, GiST, and GIN indexes for fast spatial and text searches
- Scalability: Handles millions of collections without performance degradation
Decision: Group catalogs/APIs by domain and process domains in parallel
Reason:
- Rate limiting: Each domain has independent rate limits - prevents throttling
- Politeness: Distributes load across servers, avoiding overwhelming single hosts
- Efficiency: Processes multiple domains simultaneously while respecting per-domain limits
- Fairness: Prevents slow domains from blocking fast domains
Decision: Keep single-run crawler (index.js) separate from scheduler (scheduler.js)
Reason:
- Flexibility: Users can run one-off crawls or automated schedules
- Testing: Easier to test crawler logic without scheduler complexity
- Resource efficiency: Single runs exit immediately, don't hold resources
- Debugging: Simpler to debug individual components
- Docker compatibility: Can run different commands in containers
Decision: Collect 25 collections in memory, then flush to database
Reason:
- Performance: Reduces database connection overhead (25x fewer transactions)
- Memory efficiency: Prevents unbounded memory growth on large crawls
- Error recovery: Smaller batches = less data lost on errors
- Deadlock mitigation: Fewer concurrent transactions reduce deadlock risk
Batch size selection:
- Tested on 2GB RAM servers → 25 collections = ~10MB memory footprint
- Larger batches (100+) caused OOM on constrained servers
- Smaller batches (5-10) increased database load significantly
Decision: Retry database deadlocks up to 3 times with exponential backoff
Reason:
- PostgreSQL behavior: Concurrent inserts on related tables (keywords, extensions) can deadlock
- Automatic recovery: Transient deadlocks resolve after retry
- Exponential backoff: Reduces contention by spreading out retry attempts
- Max retries: Prevents infinite loops on persistent deadlocks
index.js- Main crawler entry point for single runsscheduler.js- Scheduler for periodic automated crawlingutils/db.js- Database helper with PostgreSQL connection poolutils/normalization.js- Data normalization and processingutils/parallel.js- Parallel execution utilities with domain-based batchingutils/config.js- Configuration management (env vars + CLI)utils/time.js- Time formatting utilitiesutils/handlers.js- Request handlers for catalogs and collections with STAC validationutils/endpoints.js- STAC API endpoint discovery utilitiescatalogs/catalog.js- Static catalog crawling logicapis/api.js- STAC API crawling logic
The crawler operates in two modes: static catalog crawling and STAC API crawling. Both modes follow a similar workflow but use different strategies to discover and process STAC collections.
- Initialization: Fetch the list of static catalogs from STAC Index API (
https://www.stacindex.org/api/catalogs) - Domain Grouping: Group catalogs by domain to enable parallel processing while respecting rate limits
- Parallel Execution: Process multiple domains simultaneously with configurable concurrency
- Recursive Traversal: For each catalog:
- Fetch the catalog JSON from its URL
- Validate STAC structure using
stac-node-validator - Migrate to normalized format using
stac-js - Extract child links (catalogs and collections)
- Recursively follow catalog links up to
MAX_DEPTH(default: 3) - Process collection links to extract metadata
- Link Following: The crawler follows STAC link relations:
rel=child- Navigate to child catalogs/collectionsrel=item- Skip (items are not processed, only collections)rel=self- Used to determine the source URL
- Initialization: Fetch the list of STAC APIs from STAC Index API
- Domain Grouping: Same as static catalog crawling
- API Discovery: For each API:
- Fetch the API root endpoint
- Validate STAC API compliance
- Discover
/collectionsendpoint from API conformance or links - Try multiple endpoint variations if needed (
/collections,/search, etc.)
- Collection Enumeration:
- Fetch all collections from
/collectionsendpoint - Handle pagination if the API returns paged results
- Process each collection individually
- Fetch all collections from
- Nested Catalog Support: If a collection contains child catalog links, recursively crawl them (up to
MAX_DEPTH)
The crawler stores the following data in PostgreSQL:
Collections (main data):
- Core metadata:
stac_id(generated from slug + collection ID),title,description,license - Spatial extent: Bounding box (
bbox) stored as PostGIS geometry - Temporal extent: Start and end dates
- STAC version: Version of STAC specification used
- Source tracking:
source_url(original collection URL),crawllog_catalog_id(reference to source catalog)
Related data (linked tables):
- Keywords: Extracted from collection metadata, stored in
collection_keywordswith many-to-many relation - STAC Extensions: List of STAC extensions used (e.g.,
eo,sat,proj), stored incollection_stac_extension - Providers: Data providers with name, description, roles, and URL
- Assets: Collection-level assets (thumbnails, documentation, etc.)
- Summaries: Statistical summaries of collection properties
Crawl tracking (for pause/resume):
crawllog_catalog: Stores the catalog/API URLs and slugs for future re-crawlingcrawllog_collection: Records which collection URLs have been processed and when
What is NOT stored:
- Individual items: The crawler only processes collections, not individual STAC items
- Catalog metadata: Static catalogs are only used for traversal, not saved to the database
- Full link arrays: Only essential links (self, root) are preserved
How Pausing Works:
- Graceful Shutdown: Press
Ctrl+Conce to trigger graceful shutdown - Batch Completion: The crawler finishes the current batch of requests before stopping
- Progress Saved: All processed collections are saved to
crawllog_collectionwith their source URLs - Safe Exit: Database connections are properly closed
How Resuming Works:
- URL Lookup: When restarting, the crawler queries
crawllog_collectionfor already-processed URLs - Skip Logic: URLs in the crawl log are skipped during traversal
- Continue from Interruption: Only new/unprocessed collections are fetched
- Idempotent: Running the crawler multiple times is safe - duplicates are handled via
ON CONFLICTclauses
Force Stop: Press Ctrl+C twice for immediate termination (may leave incomplete transactions)
The scheduler (scheduler.js) provides automated periodic crawling:
- Interval-based: Runs every
CRAWL_DAYS_INTERVALdays (default: 7) - Time Window Enforcement: Optional restriction to specific hours (e.g., night-time only)
- Startup Behavior: Configurable immediate run on startup (
CRAWL_RUN_ON_STARTUP) - Error Recovery: Automatic retry on crawl errors with configurable delay
- Recrawl Strategy: Full re-crawl of all catalogs/APIs -
ON CONFLICTensures updates rather than duplicates
Scheduling Logic:
Startup → DB Check → Time Window Check → Run Crawler → Success?
↓ Yes ↓ No (crawl error)
Schedule Next Wait RETRY_DELAY → Retry
↓
Wait Until Next → Run Crawler
The crawler implements multi-layer validation to ensure data quality:
Library: stac-node-validator (v2.0.0-rc.1)
What it validates:
- STAC JSON structure compliance with official STAC schemas
- Required fields presence (id, type, stac_version, etc.)
- Field types and formats
- STAC extension schemas (e.g., EO, SAT, Projection)
- Link relation requirements
When it runs: Before processing any catalog or collection
Error handling:
- Non-compliant structures are logged with detailed error messages
- Collections with validation errors are skipped
- Statistics track compliant vs. non-compliant items
Library: stac-js (v0.1.9)
What it validates:
- Converts raw JSON to typed STAC objects
- Validates object type (Collection, Catalog, Item)
- Validates link structure and relationships
- Extracts spatial/temporal extents using STAC-aware parsers
- Resolves relative URLs to absolute URLs
When it runs: After STAC spec validation passes
Error handling:
- Migration failures indicate malformed STAC structures
- Failed migrations are logged and skipped
stac-jsmethods return null for invalid data (e.g.,getBoundingBox())
Module: utils/normalization.js
What it normalizes:
- Categories/Keywords: Derives from multiple possible fields (categories, keywords, tags)
- Temporal extents: Handles null values, open-ended intervals
- Bounding boxes: Validates array structure, handles missing coordinates
- URLs: Extracts self links, resolves relative paths
- Provider roles: Normalizes role names (producer, processor, host, licensor)
- Fallback strategy: Uses multiple fallback levels to extract data
Validation checks:
- URL format: Ensures valid HTTP/HTTPS URLs before making requests
- Response status: Checks for 200 OK status codes
- Content-Type: Accepts JSON, GeoJSON, and some binary/text types
- Timeout enforcement: Requests timeout after configured duration
- Retry logic: Automatic retry with exponential backoff for failed requests
Rate limiting:
- Per-domain rate limits prevent overwhelming servers
- Configurable requests per minute per domain
- Crawler respects HTTP 429 (Too Many Requests) responses
The crawler tracks validation results:
stacCompliant- Collections passing STAC validationnonCompliant- Collections failing STAC validationcollectionsSaved- Successfully saved to databasecollectionsFailed- Failed database insertion
Example output:
Validation Results:
STAC Compliant: 450
Non-compliant: 12
Saved to DB: 448
Failed to save: 2
Check that:
- Database connection is configured correctly in
.env - Database is accessible and running
- Time window settings allow execution (if
CRAWL_ENFORCE_TIME_WINDOW=true)
View scheduler status:
node scheduler.js
# Output shows current configuration and time window statusIncrease CRAWL_DAYS_INTERVAL:
CRAWL_DAYS_INTERVAL=7 # Run every weekThis is controlled by time window enforcement. To allow crawling anytime:
CRAWL_ENFORCE_TIME_WINDOW=falseVerify database configuration:
# Test connection manually
psql -h $PGHOST -p $PGPORT -U $PGUSER -d $PGDATABASECheck environment variables are loaded:
node -e "require('dotenv').config(); console.log(process.env.PGHOST)"The crawler has automatic deadlock retry logic with exponential backoff. If deadlocks persist:
- Reduce parallel execution settings
- Increase database connection pool size
- Check database load and indexing
The defaults are optimized for 2GB RAM servers. Control parallel processing via environment variables or CLI:
| Setting | Default | Description |
|---|---|---|
PARALLEL_DOMAINS |
2 |
Number of domains to process simultaneously |
MAX_REQUESTS_PER_MINUTE_PER_DOMAIN |
60 |
Rate limit per domain |
MAX_CONCURRENCY_PER_DOMAIN |
5 |
Max concurrent requests per domain |
Theoretical max throughput = PARALLEL_DOMAINS x MAX_REQUESTS_PER_MINUTE_PER_DOMAIN requests/min
Example for higher-resource servers:
# High-performance settings (4+ GB RAM)
PARALLEL_DOMAINS=5
MAX_REQUESTS_PER_MINUTE_PER_DOMAIN=120
MAX_CONCURRENCY_PER_DOMAIN=10
# Theoretical throughput: 600 req/minAdjust pool size in utils/db.js:
const pool = new Pool({
// ... other settings
max: 10, // Increase for higher parallelism
});Increase timeouts for slow endpoints:
TIMEOUT_MS=120000 # 2 minutesnpm start # Run crawler once
npm test # Run all tests
npm run test:watch # Run tests in watch mode
npm run docker:build # Build Docker image
npm run docker:run # Run Docker container
npm run docker:compose:up # Start with docker-compose
npm run docker:compose:down # Stop docker-composeSee LICENSE file in the project root.
Crawl only the first 3 APIs with a short timeout:
node index.js -m apis -a 3 -t 15000Crawl 100 catalogs with maximum depth and extended timeout:
node index.js -m catalogs -c 100 -d 10 -t 120000Crawl both catalogs and APIs with moderate settings:
node index.js -m both -c 25 -a 15 -t 45000 -d 4Run complete crawl every 7 days, anytime:
CRAWL_DAYS_INTERVAL=7
CRAWL_RUN_ON_STARTUP=true
CRAWL_ENFORCE_TIME_WINDOW=falseRun every 7 days, only between 22:00 and 07:00:
CRAWL_DAYS_INTERVAL=7
CRAWL_ENFORCE_TIME_WINDOW=true
CRAWL_ALLOWED_START_HOUR=22
CRAWL_ALLOWED_END_HOUR=7
CRAWL_GRACE_PERIOD_MINUTES=30Run every day with retry on errors:
CRAWL_DAYS_INTERVAL=1
CRAWL_RUN_ON_STARTUP=true
CRAWL_RETRY_ON_ERROR=true
CRAWL_RETRY_DELAY_HOURS=2Full production configuration in .env:
# Database
PGHOST=db.production.com
PGPORT=5432
PGUSER=crawler_user
PGPASSWORD=secure_password
PGDATABASE=stac_production
# Crawler - Full scan
CRAWL_MODE=both
MAX_CATALOGS=0 # Unlimited
MAX_APIS=0 # Unlimited
TIMEOUT_MS=60000
MAX_DEPTH=5
# Scheduler - Weekly night crawls
CRAWL_DAYS_INTERVAL=7
CRAWL_RUN_ON_STARTUP=false # Wait for scheduled time
CRAWL_RETRY_ON_ERROR=true
CRAWL_RETRY_DELAY_HOURS=2
# Time Window - Night time only
CRAWL_ENFORCE_TIME_WINDOW=true
CRAWL_ALLOWED_START_HOUR=22
CRAWL_ALLOWED_END_HOUR=7
CRAWL_GRACE_PERIOD_MINUTES=30Then run the scheduler:
node scheduler.js