Skip to content

sualehalam/Wehealth-webscraping-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Health Resource Crawler

Project Layout - Complete Project Structure

health-crawler/
├── README.md
├── requirements.txt
├── data/
│   ├── README.md
│   └── websites/
│       ├── us-ca.csv
│       ├── us-or.csv
│       ├── us-tx.csv
│       └── ...
├── examples/
│   ├── simple_example.py
│   ├── categorized_example.py
│   ├── batch_crawler_example.py
│   ├── clean_and_save.py
│   ├── cleaned_output/
|       ├──batch_crawl_results_20251129_145353.cleaned
|       └── ... 
│   ├── output/
│       ├── batch_crawl_results_20251129_143325.json
│       ├── batch_crawl_results_20251129_144556.json
│       ├── batch_crawl_results_20251129_145053.json
│       ├── batch_crawl_results_20251129_145353.json
|       └── ...
|   └── summary_reports/
│       ├── summary_report_20251129_143325.txt
│       ├── summary_report_20251129_1445565.json
│       ├── summary_report_20251129_145053.json
│       ├── summary_report_20251129_145353.json
|       └── ...
├── docs/
│   ├── DATA_DICTIONARY.md
│   └── SOURCE_CATALOG.md

Health Resource Crawler

An intermediate friendly web crawler for extracting public health resources from county websites.

What This Does

This crawler finds:

  • 📞 Phone numbers for health services (toll_free number, crisis lines, emergency contact, etc)
  • 📍 Addresses of clinics and health facilities
  • 🏥 Names of healthcare facilities
  • 👨‍🔧 Services offered by the respective county department
  • 🏷️ Automatically categorizes and tags each resource

Installation

# Install dependencies
pip install -r requirements.txt

Running the Batch Crawler

cd examples

python .\batch_crawler_example.py

Running the Cleaning Script

After running batch-crawler, run clean_and_save.py file to generate a cleaned JSON (it automatically picks the latest JSON from /output)

cd examples

python .\clean_and_save.py

Once completed running, it automatically cleans and moves low-confidence items to unverified_resources and writes cleaned JSON to examples/cleaned_output.

Important Considerations

  • Always be respectful when crawling websites and respect robots.txt
  • Add delays between requests (currently the scraper uses time.sleep(2))
  • Some websites may block automated access
  • This is for educational purposes only

State Data

We were given CSV files with health department websites for each 50 states:

  • California: 58 counties in data/websites/us-ca.csv
  • Oregon: 36 counties in data/websites/us-or.csv
  • Texas: 254 counties in data/websites/us-tx.csv

Categories and Tags

Resources are automatically categorized:

  • CONTACT_INFO: Phone numbers, emails

  • LOCATION: Addresses, geographic areas

  • FACILITY: Clinic and hospital names

  • SERVICE: Health services offered

    And tagged by common health topics:

  • Vaccination, flu, COVID-19, pediatric, dental, mental_health, vision, measles, etc.

Outputs:

JSON (raw data) saved to examples/output/
Human-readable counties summary saved to examples/summary_reports/
Cleaned JSON saved to examples/cleaned_output/.

Cleaning & Quality Assurance:

The repo includes clean_and_save.py(examples/). This script moves low-confidence items to unverified_resources and normalizes values and then writes cleaned JSON to examples/cleaned_output folder

Future Work

  1. Service coverage: Analyze which health topics are well-covered and least-covered
  2. Website quality scoring: Rate sites by completeness of information
  3. Data visualization: Create charts of your findings to discover hidden patterns
  4. Database integration: Store results in a proper database
  5. State-wise report generation: Create an overall state summary for state resources comparison

Privacy / Ethics:

Before publishing, manually review unverified_resources as it may contain False Positives (FP).

Authors

About

An intermediate web crawler for extracting public health resources from websites.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors