Skip to content

Jasonpereira0/Heart-Failure-Prediction

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 

Repository files navigation

Heart-Failure-Prediction

Team 4 Project

Data: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download

Team Members GitHub Account Responsibilities Reflection Video
Anis Bouwazra anisess10 Decision Tree https://drive.google.com/file/d/1ZlYW3CtXNp1zWHGlquYjldxXuCnAnwxh/view?usp=drive_link
Gina Zhang Sgjzhang KNN https://drive.google.com/file/d/13wLcS-P8cenohMlMKm0v68lfUZjD14x-/view?usp=drive_link (video also here: Heart-Failure-Prediction/models at main · Sgjzhang/Heart-Failure-Prediction
Khatereh Maleki Malekikh Logistic Regression https://drive.google.com/uc?id=1uXBsH8PtTSw6HSp2YnjzDcq-xzw966-a&export=download
Jason Pereira Jasonpereira0 Random Forest https://youtu.be/3yV3YC2iSsA
Shaumia Ranganathan shaumia22 XGBoost Project Overview video, Reflection video

Project Overview

This project aims to predict heart disease using a Data Science approach. It utilizes the “Heart Failure Prediction” dataset to analyze 11 key demographic, clinical, and exercise-related features that are significant predictors of heart disease. By combining visualization techniques and employing multiple classifier models after thorough preprocessing and hyperparameter tuning, to support early detection and intervention for individuals with cardiovascular disease or identifying those at high risk due to factors such as hypertension, diabetes, and hyperlipidemia, the best performing predictive model is selected post evaluation to ensure timely medical interventions, ultimately improving patient outcomes and reducing the global burden of CVDs.

The attributes in the dataset:

  1. Age: age of the patient [years]
  2. Sex: sex of the patient [M: Male, F: Female]
  3. ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
  4. RestingBP: resting blood pressure [mm Hg]
  5. Cholesterol: serum cholesterol [mm/dl]
  6. FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
  7. RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
  8. MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
  9. ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
  10. Oldpeak: oldpeak = ST [Numeric value measured in depression]
  11. ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
  12. HeartDisease: output class [1: heart disease, 0: Normal]

Business Value & Impact

Cardiovascular diseases (CVDs) are the leading cause of death globally, claiming approximately 17.9 million lives annually and accounting for 31% of global mortality. A significant proportion of these deaths, four out of five, are attributed to heart attacks and strokes, with a substantial number occurring prematurely in individuals under the age of 70. Heart failure is a common consequence of CVDs, underscoring the need for early detection and management which results in:

  • Improved patient outcomes by assisting in early detection and timely medical intervention resulting in personalized treatment and improved patient quality of life.
  • Cost optimization but reducing unnecessary tests and hospitalization and promoting preventive care potential lowering costs with late-stage treatments.
  • Enhanced data driven decision-making for healthcare providers and/or policy makers.
  • Deployment at scale where the model can be used across various cardiac care environments.

Using the Heart Failure Prediction dataset from Kaggle, our goal is to deploy classifier models, evaluate them, and choose the best performing model to identify predictors of heart disease.

Risks & Limitations - Mitigation Strategies

It is important to acknowledge potential risks, limitations and ways to mitigate them to ensure transparency and reliability of the model and data visualizations. The following summarizes the risks & limitations and mitigation plans considered for them:

  1. Data-Related Risks
  • Risk of Bias in Dataset Sources: The dataset combines data from multiple locations, each with different disease prevalence rates. This may affect model predictions. Mitigation: We validated prevalence (55%) to ensure a balanced mix and used class weighting to reduce bias.

  • Risk of Missing Demographic Information: No details on ethnicity, diet, smoking, or family history. Mitigation: Acknowledge these missing factors and suggest future work incorporating additional clinical data.

  • Risk of Potential Label Noise: Some diagnosis labels may be inconsistent due to multiple sources. Mitigation: Applied data cleaning techniques and validated predictions across models.

  • Risk of Missing or Incorrect Cholesterol Values: Some cholesterol values are recorded as zero, which is medically unlikely and suggests missing data. Impact on Model:

  • Zero values could affect feature distributions and impact model predictions.

  • Imputing arbitrary values (e.g., mean or median) could introduce bias without medical justification.

  • Mitigation: We retain these zero values as-is rather than replacing them with estimated numbers.

  1. Model-Related Risks:
  • Risk of Model Selection Impact on Results: The type of model used affects not only accuracy but also interpretability and fairness.
    • Mitigation: We tested multiple models (Random Forest, KNN, Linear Regression, Decision Trees, XGBoost) to compare performance, interpretability, and bias. Feature importance analysis (SHAP values) and fairness evaluation across subgroups (e.g., age, sex) to ensure model reliability.
  • Risk of Overfitting and underfitting: The model may learn dataset-specific patterns instead of generalizable trends.
    • Mitigation: Used cross-validation and hyperparameter tuning to reduce overfitting and underfitting.
  1. Deployment Risks:
  • Risk of False Diagnoses (False Positives & False Negatives): Incorrect predictions could lead to unnecessary anxiety or missed treatment.
    • Mitigation: Implemented multiple models to ensure the best parameters and results will be selected from the advanced data analysis perspective. However, it is recommended to perform human oversight and clinical verification before making decisions based on model outputs.
  1. Ethical Risks:
  • Risk of Data Privacy: In real-world applications, compliance with regulations and patient consent is essential.
    • Mitigation: Stated that this dataset is anonymized but advised caution for real-world deployments.

Methods and Technologies

We used the following libraries for data preprocessing, model training, and evaluation: ###Core Libraries

  • Pandas & Numpy: For data manipulation and analysis (e.g., loading the dataset, handling missing values).
  • scikit-learn: For machine learning tasks:
    • DecisionTreeClassifier: To build the Decision Tree model.
    • train_test_split: To split the dataset into training and testing sets.
    • GridSearchCV: For hyperparameter tuning.
    • classification_report, confusion_matrix, roc_auc_score, mean_squared_error: For evaluating model performance.

Stratification and Cross-Validation

  • StratifiedKFold: For stratified cross-validation.
  • cross_val_score: To evaluate the model using cross-validation.

Visualization Libraries:

  • Matplotlib for creating static, animated, and interactive visualizations (e.g., bar charts, ROC curves).
  • Seaborn: For creating attractive statistical graphics (e.g., heatmaps, pair plots).
  • RocCurveDisplay ###Data Quality Assessment:
  • Evaluated prevalence rates to understand the distribution of the dataset.
  • Identified missing values using the isnull().sum() function.
  • Computed summary statistics (mean, min, and max values) for each feature.

Data Visualization:

  • Created boxplots for key features such as Cholesterol and Resting BP to detect zero values and potential outliers.
  • Developed bar plots to better capture the overall distribution of features in the heart failure dataset.

Feature Engineering

  • The following categorical features were converted into numerical values using OneHotEncoder:
    • Sex
    • ChestPainType
    • RestingECG
    • ExerciseAngina
    • ST_Slope
  • After encoding, the data was split into training and testing sets using stratified sampling to preserve the target class distribution.

Model Training and Evaluation

  • Five models were trained and evaluated using cross-validation:
    • Decision Tree
    • k-Nearest Neighbors
    • Random Forest
    • Logistic Regression
    • XGBoost
  • Each model’s performance was compared using metrics such as:
    • Accuracy
    • Precision
    • Recall
    • F1 Score
    • ROC-AUC
    • Confusion Matrix
    • Log Loss
    • RMSE
  • The best-performing model was finally deployed on the test dataset to establish reliable predictors of cardiovascular disease.

Project Plan

  1. Load the Data
  2. Explore the data a. Analyze and identify bias
  3. Handle Missing Values (if needed) 4.Feature Engineering a. Encoding Categorical Values b. Creating new features - Define Age Groups, Categorize Cholesterol Levels c. Feature scaling/standardizing d. Visualize the numerical feature distribution e. Split the data - test and train datasets 5.Feature Selection a. Choose and fit a model
  4. Hyperparameter Tuning a. GridSearchCV (Cross Validation) b. Identify the optimal hyperparameter combination
  5. Evaluate the model
  6. Model Selection - Compare and select the best performing model a. Visualize and analyze evaluation scores to support model choice. b. Explain the model performance
  7. Save the model - pkl files
  8. Load, deploy and test the model a. Preset data or User entered data
Category Deliverable ETA Comments
Documentation Draft Project repo & readme Mar 12 Done
Documentation Define Preliminary Project Plan Mar 13 Done
Data Clean-up Load Data Mar 13 Done
Data Clean-up Explore the Data Mar 13 Done
Data Clean-up Handle Missing Values (if needed) Mar 14 Done
Feature Engineering Encoding Categorical Values Mar 17 Done
Feature Engineering Creating new features (Define Age Groups, Categorize Cholesterol Levels) Mar 17 Done
Feature Engineering Feature scaling/standardizing Mar 17 Done
Feature Engineering Visualize the numerical feature distribution Mar 17 Done
Feature Engineering Split the data - test and train datasets Mar 17 Done
Feature Selection Choose and fit a model Mar 17 Done
Hyperparameter Tuning GridSearchCV (Cross Validation) Mar 17 Done
Hyperparameter Tuning Identify the optimal hyperparameter combination Mar 17 Done
Model Analysis Evaluate the Model Mar 17 Done
Model Selection Visualize and analyze evaluation score to support model choice Mar 18 Done
Model Selection Explain the model performance Mar 19 Done
Model Selection Save the model - pkl files Mar 20 Done
Load, deploy and test the model Preset data or User entered data Mar 20 Done
Update Repo Create Pull requests and push changes to main repo Mar 21 Done
Finalize Documentation Upload recordings Mar 21 Done
Finalize Documentation Upload final ReadME Mar 21 Done

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an essential step in any data science project. It helps us understand the dataset, detect patterns, and identify potential issues before building a machine learning model. In this case, we are working with a dataset related to heart disease prediction.

  • Dataset Overview:

    • The dataset typically includes 303 patients with 14 features, though some studies use a subset of these features.
    • Features include age, sex, chest pain type, blood pressure, cholesterol levels, fasting blood sugar, resting ECG, maximum heart rate, exercise-induced angina, ST depression, slope of the peak exercise ST segment, number of major vessels colored by fluoroscopy, thalassemia, and the target variable indicating heart disease presence.
  • Distribution of Features:

    • Age, resting blood pressure, and cholesterol levels tend to follow a normal distribution.
    • ST depression (oldpeak) is left-skewed, indicating that most patients have low ST depression values.
    • Maximum heart rate (thalach) is right-skewed, suggesting that many patients achieve high heart rates during exercise.
  • Correlation Analysis: image

  • Features like chest pain type, maximum heart rate, and ST segment slope show positive correlation with heart disease.

  • Age, ST depression, exercise-induced angina, number of major vessels, and thalassemia show negative correlation with heart disease.

  • Risk Factors:

    • Age is a significant risk factor, with higher incidence of heart disease in older individuals.
    • Males are more likely to have heart disease than females.
    • High blood pressure and diabetes are also associated with increased risk.
  • Visualization Insights:

    • Pair plots reveal linear separation between disease and non-disease groups for features like ST depression and maximum heart rate.
    • Resting ECG values of 0 and 1 are associated with higher heart disease risk compared to value 2.
  • Data Quality Issues:

    • Missing values are common and often imputed using mean values.
    • Duplicate rows and data entry errors need to be checked and corrected.

The dataset was reviewed for any missing values and none were found. image Visualized box plot to identify outliers, which were found in the RestingBP and Cholesterol data. Each of the features were analyzed to determine if there was anything unexpected and nothing odd was found. The numerical features were plotted by age and sex to evaluate any correlations. image

The patterns emphasize there is an age related increase in resting BP and Cholesterol. It is also clear that there is a decline in the max HR as age increases. Furthermore, males seem to exhibit more variation in the values such as Cholesterol and resting BP. Females show more variation across the exercise induced angina.

Further EDA we conducted:

  • Understanding the dataset: each row represents a patient and each column provides information about their health.
  • Checking for Missing values: we find that there are no missing values, which is great because we don’t have to worry about filling in any gaps in the data.
  • Statistical Summary of the Data: to have quick overview of numerical features, including the mean, standard deviation, min/max values, and quartiles.
  • Encoding Categorical Variables: KNN require numerical data, we need to convert categorical variables into numbers.
  • Model Performance: We built a K-Nearest Neighbors (KNN) classifier and tuned its hyperparameters.

Implications for Modeling

  • Feature Selection:
    • Selecting features with strong correlations to heart disease (e.g., chest pain type, maximum heart rate) can improve model performance.
  • Data Preprocessing:
    • Handling missing values and outliers is crucial for robust model training.
  • Model Interpretability:
    • Understanding the impact of each feature on predictions can aid in clinical decision-making.

Modelling

Random Forest

Accuracy: 0.79 Precision: 0.80 Recall: 0.83 F1-Score: 0.82 AUC-ROC: 0.88 RMS Error: 0.45 Classification Report: precision recall f1-score support

       0       0.78      0.74      0.76        82
       1       0.80      0.83      0.82       102

accuracy                           0.79       184

macro avg 0.79 0.79 0.79 184 weighted avg 0.79 0.79 0.79 184

Confusion Matrix: [[61 21] [17 85]]

image

image

image

  • Features related to stress tests (ST_Slope_Up, ST_Slope_Flat, and Oldpeak) and exercise-induced angina (ExerciseAngina_Y) dominate the model's predictions.
  • This suggests that clinical indicators observed during stress tests are more predictive than demographic factors like age or general measures like maximum heart rate.

XGBoost Classification Report: precision recall f1-score support

       0       0.87      0.83      0.85       103
       1       0.86      0.90      0.88       127

accuracy                           0.87       230

macro avg 0.87 0.86 0.86 230 weighted avg 0.87 0.87 0.86 230

image

image

image

Logistic Regression

precision recall f1-score support 0 0.83 0.87 0.85 82 1 0.89 0.85 0.87 102 accuracy 0.86 184 macro avg 0.86 0.86 0.86 184 weighted avg 0.86 0.86 0.86 184

image

image

image

Decision Tree Classification Report: precision recall f1-score support 0 0.80 0.87 0.83 77 1 0.90 0.84 0.87 107 accuracy 0.85 184 macro avg 0.85 0.86 0.85 184 weighted avg 0.86 0.85 0.85 184 AUC-ROC Score: 0.8556256827284863 RMSE: 0.3830654388414369

image

image

image

KNN Classification Report: precision recall f1-score support

       0       0.90      0.91      0.91        82
       1       0.93      0.92      0.93       102

accuracy                           0.92       184

macro avg 0.92 0.92 0.92 184 weighted avg 0.92 0.92 0.92 184

image

image

image

image

Data Analysis

Results

Model Testing Results:

Model Accuracy Precision Recall F1-Score AUC-ROC RMS Error
Random Forest 0.79 0.80 0.83 0.82 0.88 0.45
Logistic Regression 0.86 0.89 0.85 0.87 0.88 -
XG Boost 0.87 0.86 0.92 0.89 0.87 0.13
KNN 0.92 0.93 0.91 0.93 0.94 0.29
Decision Tree 0.85 0.90 0.84 0.87 0.86 0.38

Key Observations

  1. Best Performing Model:
  • KNN achieves the highest accuracy (92%), precision (93%), recall (91%), and AUC-ROC (94%), making it the best-performing model in this comparison.
  1. XGBoost Performance:
  • XGBoost shows strong performance with high recall (92%) and a good balance of precision (86%) and F1-score (89%).
  1. Logistic Regression:
  • Logistic Regression performs well with high precision (89%) and accuracy (86%), but its recall (85%) is slightly lower compared to KNN and XGBoost.
  1. Random Forest and Decision Tree:
  • Both models have lower performance metrics compared to KNN and XGBoost, with Random Forest having a slightly lower recall (83%) and Decision Tree achieving a balanced performance after optimization.

RMS Error Comparison

  • KNN has the lowest RMS Error (0.33), indicating fewer prediction errors.
  • XGBoost has a relatively low RMS Error (0.13), reflecting its strong predictive accuracy.
  • Decision Tree and Random Forest have higher RMS Errors (0.38 and 0.45, respectively), suggesting more prediction errors.

Based on the performance metrics provided, KNN is the best model for heart disease prediction in this scenario.

Example data:

dummy_data = pd.DataFrame({ 'Age': [54], 'Sex': label_encoders['Sex'].transform(['M']), # Encode categorical variables 'ChestPainType': label_encoders['ChestPainType'].transform(['ATA']), 'RestingBP': [130], 'Cholesterol': [223], 'FastingBS': [0], 'RestingECG': label_encoders['RestingECG'].transform(['Normal']), 'MaxHR': [138], 'ExerciseAngina': label_encoders['ExerciseAngina'].transform(['N']), 'Oldpeak': [0.6], 'ST_Slope': label_encoders['ST_Slope'].transform(['Flat']) })

Results: Probability of heart disease: 0.40 Explanation: The individual is a 54-year-old male with certain risk factors such as chest pain type (ATA), resting blood pressure (130), cholesterol level (223), and exercise-induced angina (no). The features provided may not point to an elevated risk of heart disease, which could be due to factors like normal resting ECG, low FastingBS, and normal MaxHR.

Key Reasons for Choosing KNN

  1. Highest Accuracy (89%):
  • KNN achieved the highest accuracy among all models, indicating that it correctly classified the most samples.
  1. Strong Recall (92%):
  • KNN had the highest recall for detecting heart disease cases, which is critical in healthcare applications where missing true positives can have severe consequences.
  1. Best AUC-ROC (0.94):
  • KNN demonstrated the highest AUC-ROC score, reflecting its strong ability to distinguish between heart disease and non-heart disease cases.
  1. Lowest RMS Error (0.29):
  • KNN had the lowest RMS Error, indicating fewer prediction errors and better overall predictive accuracy.
  1. Balanced Precision and Recall:
  • KNN maintained a high precision (89%) and recall (91%), showing a good balance between avoiding false positives and identifying true positives. Why KNN is Suitable for Heart Disease Prediction
  1. Interpretability:
  • While KNN is not as interpretable as linear models like Logistic Regression, its simplicity and ease of implementation make it a practical choice for many applications.
  1. Robustness to Noise:
  • KNN can handle noisy data to some extent, which is beneficial in real-world datasets where data quality may vary.
  1. Flexibility:
  • KNN can be easily tuned by adjusting the number of neighbors (n_neighbors) to improve performance on specific datasets. Considerations for Other Models XGBoost: Offers high recall but slightly lower accuracy and AUC-ROC compared to KNN. Logistic Regression: Provides balanced performance but is less accurate than KNN. Decision Tree: provides reliable performance but is outperformed by KNN and XGBoost.

Conclusion

KNN is the best choice for heart disease prediction due to its high accuracy, strong recall, and excellent AUC-ROC score. Its ability to minimize prediction errors while maintaining a good balance between precision and recall makes it particularly suitable for healthcare applications where accurate detection of heart disease is critical.

About

Team 4 Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 100.0%