Health Resource Crawler

Project Layout - Complete Project Structure

health-crawler/
├── README.md
├── requirements.txt
├── data/
│   ├── README.md
│   └── websites/
│       ├── us-ca.csv
│       ├── us-or.csv
│       ├── us-tx.csv
│       └── ...
├── examples/
│   ├── simple_example.py
│   ├── categorized_example.py
│   ├── batch_crawler_example.py
│   ├── clean_and_save.py
│   ├── cleaned_output/
|       ├──batch_crawl_results_20251129_145353.cleaned
|       └── ... 
│   ├── output/
│       ├── batch_crawl_results_20251129_143325.json
│       ├── batch_crawl_results_20251129_144556.json
│       ├── batch_crawl_results_20251129_145053.json
│       ├── batch_crawl_results_20251129_145353.json
|       └── ...
|   └── summary_reports/
│       ├── summary_report_20251129_143325.txt
│       ├── summary_report_20251129_1445565.json
│       ├── summary_report_20251129_145053.json
│       ├── summary_report_20251129_145353.json
|       └── ...
├── docs/
│   ├── DATA_DICTIONARY.md
│   └── SOURCE_CATALOG.md

Health Resource Crawler

An intermediate friendly web crawler for extracting public health resources from county websites.

What This Does

This crawler finds:

📞 Phone numbers for health services (toll_free number, crisis lines, emergency contact, etc)
📍 Addresses of clinics and health facilities
🏥 Names of healthcare facilities
👨‍🔧 Services offered by the respective county department
🏷️ Automatically categorizes and tags each resource

Installation

# Install dependencies
pip install -r requirements.txt

Running the Batch Crawler

cd examples

python .\batch_crawler_example.py

Running the Cleaning Script

After running batch-crawler, run clean_and_save.py file to generate a cleaned JSON (it automatically picks the latest JSON from /output)

cd examples

python .\clean_and_save.py

Once completed running, it automatically cleans and moves low-confidence items to unverified_resources and writes cleaned JSON to examples/cleaned_output.

Important Considerations

Always be respectful when crawling websites and respect robots.txt
Add delays between requests (currently the scraper uses time.sleep(2))
Some websites may block automated access
This is for educational purposes only

State Data

We were given CSV files with health department websites for each 50 states:

California: 58 counties in data/websites/us-ca.csv
Oregon: 36 counties in data/websites/us-or.csv
Texas: 254 counties in data/websites/us-tx.csv

Categories and Tags

Resources are automatically categorized:

CONTACT_INFO: Phone numbers, emails
LOCATION: Addresses, geographic areas
FACILITY: Clinic and hospital names
SERVICE: Health services offered

And tagged by common health topics:
Vaccination, flu, COVID-19, pediatric, dental, mental_health, vision, measles, etc.

Outputs:

JSON (raw data) saved to examples/output/
Human-readable counties summary saved to examples/summary_reports/
Cleaned JSON saved to examples/cleaned_output/.

Cleaning & Quality Assurance:

The repo includes clean_and_save.py(examples/). This script moves low-confidence items to unverified_resources and normalizes values and then writes cleaned JSON to examples/cleaned_output folder

Future Work

Service coverage: Analyze which health topics are well-covered and least-covered
Website quality scoring: Rate sites by completeness of information
Data visualization: Create charts of your findings to discover hidden patterns
Database integration: Store results in a proper database
State-wise report generation: Create an overall state summary for state resources comparison

Privacy / Ethics:

Before publishing, manually review unverified_resources as it may contain False Positives (FP).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Health Resource Crawler

Project Layout - Complete Project Structure