A unified, modular Python pipeline for discovering and scraping job listings from Avature-powered career sites. This project automates the entire process from initial site verification to deep discovery of new domains and final job extraction.
The pipeline operates in four distinct stages, all managed by a central orchestrator.
graph TD
A[Urls.txt] --> B(main.py)
subgraph "The Pipeline (src/)"
B --> C["1. Verification (Seeds)"]
B --> D["2. Discovery (Domains)"]
B --> E["3. Enrichment (Endpoints)"]
B --> F["4. Scraping (Jobs)"]
end
subgraph "Results (output/)"
C --> G[verified_seed_urls.json]
D --> H[discovered_domains.json]
E --> I[enriched_domains.json]
G & I --> J[all_sites_to_scrape.json]
F --> K[jobs.json]
end
main.py: The central entry point for running the pipeline.src/: Modular logic of the application.verification/: Logic for validating if a domain is an active Avature site (legacy Phase 1).discovery/: Advanced strategies for finding new domains (CT Logs, GitHub, DNS, etc., legacy Phase 2).enrichment/: Deep discovery techniques to find specificSearchJobsendpoints.scraping/: The scraper that extracts job details and descriptions.utils/: Shared helpers for permissions (robots.txt) and deduplication.
output/: Directory containing all intermediate and final result files.Urls.txt: The initial seed list of candidate domains.
The pipeline requires Python 3.x and the dependencies listed in the project.
To execute the entire process from start to finish:
python main.py --allYou can run individual stages if you only need certain results:
- Seed Verification:
python main.py --verify-seeds - Domain Discovery:
python main.py --discovery - Endpoint Enrichment:
python main.py --enrich - Combine Lists:
python main.py --combine(Generates the final submission list) - Scrape Jobs:
python main.py --scrape
To run a quick test on a limited number of sites:
python main.py --all --limit 5All results are saved to the output/ directory:
verified_seed_urls.json: Sites confirmed from your initial list.all_sites_to_scrape.json: The final definitive list of all sites (Seeds + Discovered) used for scraping. This is the file for submission.jobs.json: The final product containing all unique job listings with full descriptions.