Skip to content

DhanushAkula/Avature-Scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Avature Job Discovery & Scraper Pipeline

A unified, modular Python pipeline for discovering and scraping job listings from Avature-powered career sites. This project automates the entire process from initial site verification to deep discovery of new domains and final job extraction.

Workflow Overview

The pipeline operates in four distinct stages, all managed by a central orchestrator.

graph TD
    A[Urls.txt] --> B(main.py)
    
    subgraph "The Pipeline (src/)"
        B --> C["1. Verification (Seeds)"]
        B --> D["2. Discovery (Domains)"]
        B --> E["3. Enrichment (Endpoints)"]
        B --> F["4. Scraping (Jobs)"]
    end

    subgraph "Results (output/)"
        C --> G[verified_seed_urls.json]
        D --> H[discovered_domains.json]
        E --> I[enriched_domains.json]
        G & I --> J[all_sites_to_scrape.json]
        F --> K[jobs.json]
    end
Loading

Project Structure

  • main.py: The central entry point for running the pipeline.
  • src/: Modular logic of the application.
    • verification/: Logic for validating if a domain is an active Avature site (legacy Phase 1).
    • discovery/: Advanced strategies for finding new domains (CT Logs, GitHub, DNS, etc., legacy Phase 2).
    • enrichment/: Deep discovery techniques to find specific SearchJobs endpoints.
    • scraping/: The scraper that extracts job details and descriptions.
    • utils/: Shared helpers for permissions (robots.txt) and deduplication.
  • output/: Directory containing all intermediate and final result files.
  • Urls.txt: The initial seed list of candidate domains.

Execution Instructions

The pipeline requires Python 3.x and the dependencies listed in the project.

Run Full Pipeline

To execute the entire process from start to finish:

python main.py --all

Run Specific Stages

You can run individual stages if you only need certain results:

  • Seed Verification: python main.py --verify-seeds
  • Domain Discovery: python main.py --discovery
  • Endpoint Enrichment: python main.py --enrich
  • Combine Lists: python main.py --combine (Generates the final submission list)
  • Scrape Jobs: python main.py --scrape

Development/Testing

To run a quick test on a limited number of sites:

python main.py --all --limit 5

Key Output Files

All results are saved to the output/ directory:

  • verified_seed_urls.json: Sites confirmed from your initial list.
  • all_sites_to_scrape.json: The final definitive list of all sites (Seeds + Discovered) used for scraping. This is the file for submission.
  • jobs.json: The final product containing all unique job listings with full descriptions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages