This project analyzes county-level public health data to understand what drives premature mortality and how those signals can inform policy decisions.
Across models, five factors consistently showed the strongest association with higher mortality rates:
- Limited access to primary care and higher preventable hospital stays
- Socioeconomic disadvantage (income, employment, education proxies)
- Population-level chronic health risks
- Behavioral risk factors tied to long-term disease burden
- Structural and environmental conditions affecting health outcomes
The goal is not prediction for its own sake, but identifying which levers matter most.
Public health policy decisions are often made without clear visibility into which county-level factors most strongly influence mortality outcomes.
This project asks a focused question: given observable public health indicators, which factors are most predictive of premature death, and how stable are those relationships across models?
The intent is to support prioritization and policy planning, not to replace expert judgment.
Feature importance analysis showed that premature mortality is driven primarily by structural and access-related conditions, rather than any single behavioral or clinical metric.
Healthcare access and socioeconomic factors consistently dominated model influence, while purely clinical indicators explained only part of the variation.
This reinforces that effective mortality reduction requires upstream interventions, not only medical treatment.
This analysis uses a County Health Rankings (CHR)–style analytic dataset provided as a CSV file.
In the notebook, the dataset is referenced as:
analytic_data2025_v2.csv
The CHR analytic codebook explains variable naming conventions (e.g., v###_rawvalue), along with numerator, denominator, and confidence interval fields. This documentation is essential for interpreting feature meaning and avoiding misuse of derived metrics.
The notebook follows a standard analytics workflow:
- Load the analytic dataset and select the target variable
- Target:
Premature Death (raw value)
- Clean and preprocess features
- remove non-informative identifiers
- handle missing values and normalize inputs as needed
- Split data into training and test sets
- Benchmark multiple models
- Linear Regression (interpretable baseline)
- KNN Regression (local similarity baseline)
- Random Forest Regression (captures non-linear interactions)
- Evaluate using variance explained (R²) and error magnitude (MSE)
This structure allows results to be compared across modeling assumptions while keeping the analysis grounded in policy interpretation.
Several regression models were benchmarked to test robustness of findings across modeling assumptions.
While ensemble methods captured non-linear relationships more effectively than linear baselines, the relative importance of the top contributing factors remained consistent.
Model accuracy metrics are discussed in the notebook, but the primary value of this analysis is explanatory rather than predictive.
This repository is designed to be exploratory and reproducible.
- Review the notebook to understand feature selection, modeling choices, and interpretation logic
- Run the notebook end-to-end to reproduce preprocessing, modeling, and evaluation
- Use feature importance outputs to reason about policy-relevant drivers of mortality
- Extend the analysis with interpretability tools or geographic stratification as needed