Avature Job Discovery & Scraper Pipeline

A unified, modular Python pipeline for discovering and scraping job listings from Avature-powered career sites. This project automates the entire process from initial site verification to deep discovery of new domains and final job extraction.

Workflow Overview

The pipeline operates in four distinct stages, all managed by a central orchestrator.

graph TD
    A[Urls.txt] --> B(main.py)
    
    subgraph "The Pipeline (src/)"
        B --> C["1. Verification (Seeds)"]
        B --> D["2. Discovery (Domains)"]
        B --> E["3. Enrichment (Endpoints)"]
        B --> F["4. Scraping (Jobs)"]
    end

    subgraph "Results (output/)"
        C --> G[verified_seed_urls.json]
        D --> H[discovered_domains.json]
        E --> I[enriched_domains.json]
        G & I --> J[all_sites_to_scrape.json]
        F --> K[jobs.json]
    end

Project Structure

main.py: The central entry point for running the pipeline.
src/: Modular logic of the application.
- verification/: Logic for validating if a domain is an active Avature site (legacy Phase 1).
- discovery/: Advanced strategies for finding new domains (CT Logs, GitHub, DNS, etc., legacy Phase 2).
- enrichment/: Deep discovery techniques to find specific SearchJobs endpoints.
- scraping/: The scraper that extracts job details and descriptions.
- utils/: Shared helpers for permissions (robots.txt) and deduplication.
output/: Directory containing all intermediate and final result files.
Urls.txt: The initial seed list of candidate domains.

Execution Instructions

The pipeline requires Python 3.x and the dependencies listed in the project.

Run Full Pipeline

To execute the entire process from start to finish:

python main.py --all

Run Specific Stages

You can run individual stages if you only need certain results:

Seed Verification: python main.py --verify-seeds
Domain Discovery: python main.py --discovery
Endpoint Enrichment: python main.py --enrich
Combine Lists: python main.py --combine (Generates the final submission list)
Scrape Jobs: python main.py --scrape

Development/Testing

To run a quick test on a limited number of sites:

python main.py --all --limit 5

Key Output Files

All results are saved to the output/ directory:

verified_seed_urls.json: Sites confirmed from your initial list.
all_sites_to_scrape.json: The final definitive list of all sites (Seeds + Discovered) used for scraping. This is the file for submission.
jobs.json: The final product containing all unique job listings with full descriptions.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
output		output
src		src
.gitignore		.gitignore
README.md		README.md
[HiringCafe] Take-home Project.pdf		[HiringCafe] Take-home Project.pdf
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Avature Job Discovery & Scraper Pipeline

Workflow Overview

Project Structure

Execution Instructions

Run Full Pipeline

Run Specific Stages

Development/Testing

Key Output Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Avature Job Discovery & Scraper Pipeline

Workflow Overview

Project Structure

Execution Instructions

Run Full Pipeline

Run Specific Stages

Development/Testing

Key Output Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages