Heart-Failure-Prediction

Team 4 Project

Data: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download

Team Members	GitHub Account	Responsibilities	Reflection Video
Anis Bouwazra	anisess10	Decision Tree	https://drive.google.com/file/d/1ZlYW3CtXNp1zWHGlquYjldxXuCnAnwxh/view?usp=drive_link
Gina Zhang	Sgjzhang	KNN	https://drive.google.com/file/d/13wLcS-P8cenohMlMKm0v68lfUZjD14x-/view?usp=drive_link (video also here: Heart-Failure-Prediction/models at main · Sgjzhang/Heart-Failure-Prediction
Khatereh Maleki	Malekikh	Logistic Regression	https://drive.google.com/uc?id=1uXBsH8PtTSw6HSp2YnjzDcq-xzw966-a&export=download
Jason Pereira	Jasonpereira0	Random Forest	https://youtu.be/3yV3YC2iSsA
Shaumia Ranganathan	shaumia22	XGBoost	Project Overview video, Reflection video

Project Overview

This project aims to predict heart disease using a Data Science approach. It utilizes the “Heart Failure Prediction” dataset to analyze 11 key demographic, clinical, and exercise-related features that are significant predictors of heart disease. By combining visualization techniques and employing multiple classifier models after thorough preprocessing and hyperparameter tuning, to support early detection and intervention for individuals with cardiovascular disease or identifying those at high risk due to factors such as hypertension, diabetes, and hyperlipidemia, the best performing predictive model is selected post evaluation to ensure timely medical interventions, ultimately improving patient outcomes and reducing the global burden of CVDs.

The attributes in the dataset:

Age: age of the patient [years]
Sex: sex of the patient [M: Male, F: Female]
ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
RestingBP: resting blood pressure [mm Hg]
Cholesterol: serum cholesterol [mm/dl]
FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
Oldpeak: oldpeak = ST [Numeric value measured in depression]
ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
HeartDisease: output class [1: heart disease, 0: Normal]

Business Value & Impact

Cardiovascular diseases (CVDs) are the leading cause of death globally, claiming approximately 17.9 million lives annually and accounting for 31% of global mortality. A significant proportion of these deaths, four out of five, are attributed to heart attacks and strokes, with a substantial number occurring prematurely in individuals under the age of 70. Heart failure is a common consequence of CVDs, underscoring the need for early detection and management which results in:

Improved patient outcomes by assisting in early detection and timely medical intervention resulting in personalized treatment and improved patient quality of life.
Cost optimization but reducing unnecessary tests and hospitalization and promoting preventive care potential lowering costs with late-stage treatments.
Enhanced data driven decision-making for healthcare providers and/or policy makers.
Deployment at scale where the model can be used across various cardiac care environments.

Using the Heart Failure Prediction dataset from Kaggle, our goal is to deploy classifier models, evaluate them, and choose the best performing model to identify predictors of heart disease.

Risks & Limitations - Mitigation Strategies

It is important to acknowledge potential risks, limitations and ways to mitigate them to ensure transparency and reliability of the model and data visualizations. The following summarizes the risks & limitations and mitigation plans considered for them:

Data-Related Risks

Risk of Bias in Dataset Sources: The dataset combines data from multiple locations, each with different disease prevalence rates. This may affect model predictions. Mitigation: We validated prevalence (55%) to ensure a balanced mix and used class weighting to reduce bias.
Risk of Missing Demographic Information: No details on ethnicity, diet, smoking, or family history. Mitigation: Acknowledge these missing factors and suggest future work incorporating additional clinical data.
Risk of Potential Label Noise: Some diagnosis labels may be inconsistent due to multiple sources. Mitigation: Applied data cleaning techniques and validated predictions across models.
Risk of Missing or Incorrect Cholesterol Values: Some cholesterol values are recorded as zero, which is medically unlikely and suggests missing data. Impact on Model:
Zero values could affect feature distributions and impact model predictions.
Imputing arbitrary values (e.g., mean or median) could introduce bias without medical justification.
Mitigation: We retain these zero values as-is rather than replacing them with estimated numbers.

Model-Related Risks:

Risk of Model Selection Impact on Results: The type of model used affects not only accuracy but also interpretability and fairness.
- Mitigation: We tested multiple models (Random Forest, KNN, Linear Regression, Decision Trees, XGBoost) to compare performance, interpretability, and bias. Feature importance analysis (SHAP values) and fairness evaluation across subgroups (e.g., age, sex) to ensure model reliability.
Risk of Overfitting and underfitting: The model may learn dataset-specific patterns instead of generalizable trends.
- Mitigation: Used cross-validation and hyperparameter tuning to reduce overfitting and underfitting.

Deployment Risks:

Risk of False Diagnoses (False Positives & False Negatives): Incorrect predictions could lead to unnecessary anxiety or missed treatment.
- Mitigation: Implemented multiple models to ensure the best parameters and results will be selected from the advanced data analysis perspective. However, it is recommended to perform human oversight and clinical verification before making decisions based on model outputs.

Ethical Risks:

Risk of Data Privacy: In real-world applications, compliance with regulations and patient consent is essential.
- Mitigation: Stated that this dataset is anonymized but advised caution for real-world deployments.

Methods and Technologies

We used the following libraries for data preprocessing, model training, and evaluation: ###Core Libraries

Pandas & Numpy: For data manipulation and analysis (e.g., loading the dataset, handling missing values).
scikit-learn: For machine learning tasks:
- DecisionTreeClassifier: To build the Decision Tree model.
- train_test_split: To split the dataset into training and testing sets.
- GridSearchCV: For hyperparameter tuning.
- classification_report, confusion_matrix, roc_auc_score, mean_squared_error: For evaluating model performance.

Stratification and Cross-Validation

StratifiedKFold: For stratified cross-validation.
cross_val_score: To evaluate the model using cross-validation.

Visualization Libraries:

Matplotlib for creating static, animated, and interactive visualizations (e.g., bar charts, ROC curves).
Seaborn: For creating attractive statistical graphics (e.g., heatmaps, pair plots).
RocCurveDisplay ###Data Quality Assessment:
Evaluated prevalence rates to understand the distribution of the dataset.
Identified missing values using the isnull().sum() function.
Computed summary statistics (mean, min, and max values) for each feature.

Data Visualization:

Created boxplots for key features such as Cholesterol and Resting BP to detect zero values and potential outliers.
Developed bar plots to better capture the overall distribution of features in the heart failure dataset.

Feature Engineering

The following categorical features were converted into numerical values using OneHotEncoder:
- Sex
- ChestPainType
- RestingECG
- ExerciseAngina
- ST_Slope
After encoding, the data was split into training and testing sets using stratified sampling to preserve the target class distribution.

Model Training and Evaluation

Five models were trained and evaluated using cross-validation:
- Decision Tree
- k-Nearest Neighbors
- Random Forest
- Logistic Regression
- XGBoost
Each model’s performance was compared using metrics such as:
- Accuracy
- Precision
- Recall
- F1 Score
- ROC-AUC
- Confusion Matrix
- Log Loss
- RMSE
The best-performing model was finally deployed on the test dataset to establish reliable predictors of cardiovascular disease.

Project Plan

Load the Data
Explore the data a. Analyze and identify bias
Handle Missing Values (if needed) 4.Feature Engineering a. Encoding Categorical Values b. Creating new features - Define Age Groups, Categorize Cholesterol Levels c. Feature scaling/standardizing d. Visualize the numerical feature distribution e. Split the data - test and train datasets 5.Feature Selection a. Choose and fit a model
Hyperparameter Tuning a. GridSearchCV (Cross Validation) b. Identify the optimal hyperparameter combination
Evaluate the model
Model Selection - Compare and select the best performing model a. Visualize and analyze evaluation scores to support model choice. b. Explain the model performance
Save the model - pkl files
Load, deploy and test the model a. Preset data or User entered data

Category	Deliverable	ETA	Comments
Documentation	Draft Project repo & readme	Mar 12	Done
Documentation	Define Preliminary Project Plan	Mar 13	Done
Data Clean-up	Load Data	Mar 13	Done
Data Clean-up	Explore the Data	Mar 13	Done
Data Clean-up	Handle Missing Values (if needed)	Mar 14	Done
Feature Engineering	Encoding Categorical Values	Mar 17	Done
Feature Engineering	Creating new features (Define Age Groups, Categorize Cholesterol Levels)	Mar 17	Done
Feature Engineering	Feature scaling/standardizing	Mar 17	Done
Feature Engineering	Visualize the numerical feature distribution	Mar 17	Done
Feature Engineering	Split the data - test and train datasets	Mar 17	Done
Feature Selection	Choose and fit a model	Mar 17	Done
Hyperparameter Tuning	GridSearchCV (Cross Validation)	Mar 17	Done
Hyperparameter Tuning	Identify the optimal hyperparameter combination	Mar 17	Done
Model Analysis	Evaluate the Model	Mar 17	Done
Model Selection	Visualize and analyze evaluation score to support model choice	Mar 18	Done
Model Selection	Explain the model performance	Mar 19	Done
Model Selection	Save the model - pkl files	Mar 20	Done
Load, deploy and test the model	Preset data or User entered data	Mar 20	Done
Update Repo	Create Pull requests and push changes to main repo	Mar 21	Done
Finalize Documentation	Upload recordings	Mar 21	Done
Finalize Documentation	Upload final ReadME	Mar 21	Done

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an essential step in any data science project. It helps us understand the dataset, detect patterns, and identify potential issues before building a machine learning model. In this case, we are working with a dataset related to heart disease prediction.

Dataset Overview:
- The dataset typically includes 303 patients with 14 features, though some studies use a subset of these features.
- Features include age, sex, chest pain type, blood pressure, cholesterol levels, fasting blood sugar, resting ECG, maximum heart rate, exercise-induced angina, ST depression, slope of the peak exercise ST segment, number of major vessels colored by fluoroscopy, thalassemia, and the target variable indicating heart disease presence.
Distribution of Features:
- Age, resting blood pressure, and cholesterol levels tend to follow a normal distribution.
- ST depression (oldpeak) is left-skewed, indicating that most patients have low ST depression values.
- Maximum heart rate (thalach) is right-skewed, suggesting that many patients achieve high heart rates during exercise.
Correlation Analysis:
Features like chest pain type, maximum heart rate, and ST segment slope show positive correlation with heart disease.
Age, ST depression, exercise-induced angina, number of major vessels, and thalassemia show negative correlation with heart disease.
Risk Factors:
- Age is a significant risk factor, with higher incidence of heart disease in older individuals.
- Males are more likely to have heart disease than females.
- High blood pressure and diabetes are also associated with increased risk.
Visualization Insights:
- Pair plots reveal linear separation between disease and non-disease groups for features like ST depression and maximum heart rate.
- Resting ECG values of 0 and 1 are associated with higher heart disease risk compared to value 2.
Data Quality Issues:
- Missing values are common and often imputed using mean values.
- Duplicate rows and data entry errors need to be checked and corrected.

The dataset was reviewed for any missing values and none were found. Visualized box plot to identify outliers, which were found in the RestingBP and Cholesterol data. Each of the features were analyzed to determine if there was anything unexpected and nothing odd was found. The numerical features were plotted by age and sex to evaluate any correlations.

The patterns emphasize there is an age related increase in resting BP and Cholesterol. It is also clear that there is a decline in the max HR as age increases. Furthermore, males seem to exhibit more variation in the values such as Cholesterol and resting BP. Females show more variation across the exercise induced angina.

Further EDA we conducted:

Understanding the dataset: each row represents a patient and each column provides information about their health.
Checking for Missing values: we find that there are no missing values, which is great because we don’t have to worry about filling in any gaps in the data.
Statistical Summary of the Data: to have quick overview of numerical features, including the mean, standard deviation, min/max values, and quartiles.
Encoding Categorical Variables: KNN require numerical data, we need to convert categorical variables into numbers.
Model Performance: We built a K-Nearest Neighbors (KNN) classifier and tuned its hyperparameters.

Implications for Modeling

Feature Selection:
- Selecting features with strong correlations to heart disease (e.g., chest pain type, maximum heart rate) can improve model performance.
Data Preprocessing:
- Handling missing values and outliers is crucial for robust model training.
Model Interpretability:
- Understanding the impact of each feature on predictions can aid in clinical decision-making.

Modelling

Random Forest

Accuracy: 0.79 Precision: 0.80 Recall: 0.83 F1-Score: 0.82 AUC-ROC: 0.88 RMS Error: 0.45 Classification Report: precision recall f1-score support

       0       0.78      0.74      0.76        82
       1       0.80      0.83      0.82       102

accuracy                           0.79       184

macro avg 0.79 0.79 0.79 184 weighted avg 0.79 0.79 0.79 184

Confusion Matrix: [[61 21] [17 85]]

Features related to stress tests (ST_Slope_Up, ST_Slope_Flat, and Oldpeak) and exercise-induced angina (ExerciseAngina_Y) dominate the model's predictions.
This suggests that clinical indicators observed during stress tests are more predictive than demographic factors like age or general measures like maximum heart rate.

XGBoost Classification Report: precision recall f1-score support

       0       0.87      0.83      0.85       103
       1       0.86      0.90      0.88       127

accuracy                           0.87       230

macro avg 0.87 0.86 0.86 230 weighted avg 0.87 0.87 0.86 230

Logistic Regression

precision recall f1-score support 0 0.83 0.87 0.85 82 1 0.89 0.85 0.87 102 accuracy 0.86 184 macro avg 0.86 0.86 0.86 184 weighted avg 0.86 0.86 0.86 184

Decision Tree Classification Report: precision recall f1-score support 0 0.80 0.87 0.83 77 1 0.90 0.84 0.87 107 accuracy 0.85 184 macro avg 0.85 0.86 0.85 184 weighted avg 0.86 0.85 0.85 184 AUC-ROC Score: 0.8556256827284863 RMSE: 0.3830654388414369

KNN Classification Report: precision recall f1-score support

       0       0.90      0.91      0.91        82
       1       0.93      0.92      0.93       102

accuracy                           0.92       184

macro avg 0.92 0.92 0.92 184 weighted avg 0.92 0.92 0.92 184

Data Analysis

Results

Model Testing Results:

Model	Accuracy	Precision	Recall	F1-Score	AUC-ROC	RMS Error
Random Forest	0.79	0.80	0.83	0.82	0.88	0.45
Logistic Regression	0.86	0.89	0.85	0.87	0.88	-
XG Boost	0.87	0.86	0.92	0.89	0.87	0.13
KNN	0.92	0.93	0.91	0.93	0.94	0.29
Decision Tree	0.85	0.90	0.84	0.87	0.86	0.38

Key Observations

Best Performing Model:

KNN achieves the highest accuracy (92%), precision (93%), recall (91%), and AUC-ROC (94%), making it the best-performing model in this comparison.

XGBoost Performance:

XGBoost shows strong performance with high recall (92%) and a good balance of precision (86%) and F1-score (89%).

Logistic Regression:

Logistic Regression performs well with high precision (89%) and accuracy (86%), but its recall (85%) is slightly lower compared to KNN and XGBoost.

Random Forest and Decision Tree:

Both models have lower performance metrics compared to KNN and XGBoost, with Random Forest having a slightly lower recall (83%) and Decision Tree achieving a balanced performance after optimization.

RMS Error Comparison

KNN has the lowest RMS Error (0.33), indicating fewer prediction errors.
XGBoost has a relatively low RMS Error (0.13), reflecting its strong predictive accuracy.
Decision Tree and Random Forest have higher RMS Errors (0.38 and 0.45, respectively), suggesting more prediction errors.

Based on the performance metrics provided, KNN is the best model for heart disease prediction in this scenario.

Example data：

dummy_data = pd.DataFrame({ 'Age': [54], 'Sex': label_encoders['Sex'].transform(['M']), # Encode categorical variables 'ChestPainType': label_encoders['ChestPainType'].transform(['ATA']), 'RestingBP': [130], 'Cholesterol': [223], 'FastingBS': [0], 'RestingECG': label_encoders['RestingECG'].transform(['Normal']), 'MaxHR': [138], 'ExerciseAngina': label_encoders['ExerciseAngina'].transform(['N']), 'Oldpeak': [0.6], 'ST_Slope': label_encoders['ST_Slope'].transform(['Flat']) })

Results： Probability of heart disease: 0.40 Explanation： The individual is a 54-year-old male with certain risk factors such as chest pain type (ATA), resting blood pressure (130), cholesterol level (223), and exercise-induced angina (no). The features provided may not point to an elevated risk of heart disease, which could be due to factors like normal resting ECG, low FastingBS, and normal MaxHR.

Key Reasons for Choosing KNN

Highest Accuracy (89%):

KNN achieved the highest accuracy among all models, indicating that it correctly classified the most samples.

Strong Recall (92%):

KNN had the highest recall for detecting heart disease cases, which is critical in healthcare applications where missing true positives can have severe consequences.

Best AUC-ROC (0.94):

KNN demonstrated the highest AUC-ROC score, reflecting its strong ability to distinguish between heart disease and non-heart disease cases.

Lowest RMS Error (0.29):

KNN had the lowest RMS Error, indicating fewer prediction errors and better overall predictive accuracy.

Balanced Precision and Recall:

KNN maintained a high precision (89%) and recall (91%), showing a good balance between avoiding false positives and identifying true positives. Why KNN is Suitable for Heart Disease Prediction

Interpretability:

While KNN is not as interpretable as linear models like Logistic Regression, its simplicity and ease of implementation make it a practical choice for many applications.

Robustness to Noise:

KNN can handle noisy data to some extent, which is beneficial in real-world datasets where data quality may vary.

Flexibility:

KNN can be easily tuned by adjusting the number of neighbors (n_neighbors) to improve performance on specific datasets. Considerations for Other Models XGBoost: Offers high recall but slightly lower accuracy and AUC-ROC compared to KNN. Logistic Regression: Provides balanced performance but is less accurate than KNN. Decision Tree: provides reliable performance but is outperformed by KNN and XGBoost.

Conclusion

KNN is the best choice for heart disease prediction due to its high accuracy, strong recall, and excellent AUC-ROC score. Its ability to minimize prediction errors while maintaining a good balance between precision and recall makes it particularly suitable for healthcare applications where accurate detection of heart disease is critical.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data/raw		data/raw
models		models
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Heart-Failure-Prediction

Project Overview

Business Value & Impact

Risks & Limitations - Mitigation Strategies

Methods and Technologies

Stratification and Cross-Validation

Visualization Libraries:

Data Visualization:

Feature Engineering

Model Training and Evaluation

Project Plan

Exploratory Data Analysis

Implications for Modeling

Modelling

Data Analysis

Results

Model Testing Results:

Key Observations

RMS Error Comparison

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Heart-Failure-Prediction

Project Overview

Business Value & Impact

Risks & Limitations - Mitigation Strategies

Methods and Technologies

Stratification and Cross-Validation

Visualization Libraries:

Data Visualization:

Feature Engineering

Model Training and Evaluation

Project Plan

Exploratory Data Analysis

Implications for Modeling

Modelling

Data Analysis

Results

Model Testing Results:

Key Observations

RMS Error Comparison

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages