health-crawler/
├── README.md
├── requirements.txt
├── data/
│ ├── README.md
│ └── websites/
│ ├── us-ca.csv
│ ├── us-or.csv
│ ├── us-tx.csv
│ └── ...
├── examples/
│ ├── simple_example.py
│ ├── categorized_example.py
│ ├── batch_crawler_example.py
│ ├── clean_and_save.py
│ ├── cleaned_output/
| ├──batch_crawl_results_20251129_145353.cleaned
| └── ...
│ ├── output/
│ ├── batch_crawl_results_20251129_143325.json
│ ├── batch_crawl_results_20251129_144556.json
│ ├── batch_crawl_results_20251129_145053.json
│ ├── batch_crawl_results_20251129_145353.json
| └── ...
| └── summary_reports/
│ ├── summary_report_20251129_143325.txt
│ ├── summary_report_20251129_1445565.json
│ ├── summary_report_20251129_145053.json
│ ├── summary_report_20251129_145353.json
| └── ...
├── docs/
│ ├── DATA_DICTIONARY.md
│ └── SOURCE_CATALOG.md
An intermediate friendly web crawler for extracting public health resources from county websites.
This crawler finds:
- 📞 Phone numbers for health services (toll_free number, crisis lines, emergency contact, etc)
- 📍 Addresses of clinics and health facilities
- 🏥 Names of healthcare facilities
- 👨🔧 Services offered by the respective county department
- 🏷️ Automatically categorizes and tags each resource
# Install dependencies
pip install -r requirements.txtcd examples
python .\batch_crawler_example.pyAfter running batch-crawler, run clean_and_save.py file to generate a cleaned JSON (it automatically picks the latest JSON from /output)
cd examples
python .\clean_and_save.pyOnce completed running, it automatically cleans and moves low-confidence items to unverified_resources and writes cleaned JSON to examples/cleaned_output.
- Always be respectful when crawling websites and respect robots.txt
- Add delays between requests (currently the scraper uses
time.sleep(2)) - Some websites may block automated access
- This is for educational purposes only
We were given CSV files with health department websites for each 50 states:
- California: 58 counties in
data/websites/us-ca.csv - Oregon: 36 counties in
data/websites/us-or.csv - Texas: 254 counties in
data/websites/us-tx.csv
Resources are automatically categorized:
-
CONTACT_INFO: Phone numbers, emails
-
LOCATION: Addresses, geographic areas
-
FACILITY: Clinic and hospital names
-
SERVICE: Health services offered
And tagged by common health topics:
-
Vaccination,flu,COVID-19,pediatric,dental,mental_health,vision,measles, etc.
JSON (raw data) saved to examples/output/
Human-readable counties summary saved to examples/summary_reports/
Cleaned JSON saved to examples/cleaned_output/.
The repo includes clean_and_save.py(examples/). This script moves low-confidence items to unverified_resources and normalizes values and then writes cleaned JSON to examples/cleaned_output folder
- Service coverage: Analyze which health topics are well-covered and least-covered
- Website quality scoring: Rate sites by completeness of information
- Data visualization: Create charts of your findings to discover hidden patterns
- Database integration: Store results in a proper database
- State-wise report generation: Create an overall state summary for state resources comparison
Before publishing, manually review unverified_resources as it may contain False Positives (FP).