Team 4 Project
Data: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download
| Team Members | GitHub Account | Responsibilities | Reflection Video |
|---|---|---|---|
| Anis Bouwazra | anisess10 | Decision Tree | https://drive.google.com/file/d/1ZlYW3CtXNp1zWHGlquYjldxXuCnAnwxh/view?usp=drive_link |
| Gina Zhang | Sgjzhang | KNN | https://drive.google.com/file/d/13wLcS-P8cenohMlMKm0v68lfUZjD14x-/view?usp=drive_link (video also here: Heart-Failure-Prediction/models at main · Sgjzhang/Heart-Failure-Prediction |
| Khatereh Maleki | Malekikh | Logistic Regression | https://drive.google.com/uc?id=1uXBsH8PtTSw6HSp2YnjzDcq-xzw966-a&export=download |
| Jason Pereira | Jasonpereira0 | Random Forest | https://youtu.be/3yV3YC2iSsA |
| Shaumia Ranganathan | shaumia22 | XGBoost | Project Overview video, Reflection video |
This project aims to predict heart disease using a Data Science approach. It utilizes the “Heart Failure Prediction” dataset to analyze 11 key demographic, clinical, and exercise-related features that are significant predictors of heart disease. By combining visualization techniques and employing multiple classifier models after thorough preprocessing and hyperparameter tuning, to support early detection and intervention for individuals with cardiovascular disease or identifying those at high risk due to factors such as hypertension, diabetes, and hyperlipidemia, the best performing predictive model is selected post evaluation to ensure timely medical interventions, ultimately improving patient outcomes and reducing the global burden of CVDs.
The attributes in the dataset:
- Age: age of the patient [years]
- Sex: sex of the patient [M: Male, F: Female]
- ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
- RestingBP: resting blood pressure [mm Hg]
- Cholesterol: serum cholesterol [mm/dl]
- FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
- RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
- MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
- ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
- Oldpeak: oldpeak = ST [Numeric value measured in depression]
- ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
- HeartDisease: output class [1: heart disease, 0: Normal]
Cardiovascular diseases (CVDs) are the leading cause of death globally, claiming approximately 17.9 million lives annually and accounting for 31% of global mortality. A significant proportion of these deaths, four out of five, are attributed to heart attacks and strokes, with a substantial number occurring prematurely in individuals under the age of 70. Heart failure is a common consequence of CVDs, underscoring the need for early detection and management which results in:
- Improved patient outcomes by assisting in early detection and timely medical intervention resulting in personalized treatment and improved patient quality of life.
- Cost optimization but reducing unnecessary tests and hospitalization and promoting preventive care potential lowering costs with late-stage treatments.
- Enhanced data driven decision-making for healthcare providers and/or policy makers.
- Deployment at scale where the model can be used across various cardiac care environments.
Using the Heart Failure Prediction dataset from Kaggle, our goal is to deploy classifier models, evaluate them, and choose the best performing model to identify predictors of heart disease.
It is important to acknowledge potential risks, limitations and ways to mitigate them to ensure transparency and reliability of the model and data visualizations. The following summarizes the risks & limitations and mitigation plans considered for them:
- Data-Related Risks
-
Risk of Bias in Dataset Sources: The dataset combines data from multiple locations, each with different disease prevalence rates. This may affect model predictions. Mitigation: We validated prevalence (55%) to ensure a balanced mix and used class weighting to reduce bias.
-
Risk of Missing Demographic Information: No details on ethnicity, diet, smoking, or family history. Mitigation: Acknowledge these missing factors and suggest future work incorporating additional clinical data.
-
Risk of Potential Label Noise: Some diagnosis labels may be inconsistent due to multiple sources. Mitigation: Applied data cleaning techniques and validated predictions across models.
-
Risk of Missing or Incorrect Cholesterol Values: Some cholesterol values are recorded as zero, which is medically unlikely and suggests missing data. Impact on Model:
-
Zero values could affect feature distributions and impact model predictions.
-
Imputing arbitrary values (e.g., mean or median) could introduce bias without medical justification.
-
Mitigation: We retain these zero values as-is rather than replacing them with estimated numbers.
- Model-Related Risks:
- Risk of Model Selection Impact on Results: The type of model used affects not only accuracy but also interpretability and fairness.
- Mitigation: We tested multiple models (Random Forest, KNN, Linear Regression, Decision Trees, XGBoost) to compare performance, interpretability, and bias. Feature importance analysis (SHAP values) and fairness evaluation across subgroups (e.g., age, sex) to ensure model reliability.
- Risk of Overfitting and underfitting: The model may learn dataset-specific patterns instead of generalizable trends.
- Mitigation: Used cross-validation and hyperparameter tuning to reduce overfitting and underfitting.
- Deployment Risks:
- Risk of False Diagnoses (False Positives & False Negatives): Incorrect predictions could lead to unnecessary anxiety or missed treatment.
- Mitigation: Implemented multiple models to ensure the best parameters and results will be selected from the advanced data analysis perspective. However, it is recommended to perform human oversight and clinical verification before making decisions based on model outputs.
- Ethical Risks:
- Risk of Data Privacy: In real-world applications, compliance with regulations and patient consent is essential.
- Mitigation: Stated that this dataset is anonymized but advised caution for real-world deployments.
We used the following libraries for data preprocessing, model training, and evaluation: ###Core Libraries
- Pandas & Numpy: For data manipulation and analysis (e.g., loading the dataset, handling missing values).
- scikit-learn: For machine learning tasks:
- DecisionTreeClassifier: To build the Decision Tree model.
- train_test_split: To split the dataset into training and testing sets.
- GridSearchCV: For hyperparameter tuning.
- classification_report, confusion_matrix, roc_auc_score, mean_squared_error: For evaluating model performance.
- StratifiedKFold: For stratified cross-validation.
- cross_val_score: To evaluate the model using cross-validation.
- Matplotlib for creating static, animated, and interactive visualizations (e.g., bar charts, ROC curves).
- Seaborn: For creating attractive statistical graphics (e.g., heatmaps, pair plots).
- RocCurveDisplay ###Data Quality Assessment:
- Evaluated prevalence rates to understand the distribution of the dataset.
- Identified missing values using the isnull().sum() function.
- Computed summary statistics (mean, min, and max values) for each feature.
- Created boxplots for key features such as Cholesterol and Resting BP to detect zero values and potential outliers.
- Developed bar plots to better capture the overall distribution of features in the heart failure dataset.
- The following categorical features were converted into numerical values using OneHotEncoder:
- Sex
- ChestPainType
- RestingECG
- ExerciseAngina
- ST_Slope
- After encoding, the data was split into training and testing sets using stratified sampling to preserve the target class distribution.
- Five models were trained and evaluated using cross-validation:
- Decision Tree
- k-Nearest Neighbors
- Random Forest
- Logistic Regression
- XGBoost
- Each model’s performance was compared using metrics such as:
- Accuracy
- Precision
- Recall
- F1 Score
- ROC-AUC
- Confusion Matrix
- Log Loss
- RMSE
- The best-performing model was finally deployed on the test dataset to establish reliable predictors of cardiovascular disease.
- Load the Data
- Explore the data a. Analyze and identify bias
- Handle Missing Values (if needed) 4.Feature Engineering a. Encoding Categorical Values b. Creating new features - Define Age Groups, Categorize Cholesterol Levels c. Feature scaling/standardizing d. Visualize the numerical feature distribution e. Split the data - test and train datasets 5.Feature Selection a. Choose and fit a model
- Hyperparameter Tuning a. GridSearchCV (Cross Validation) b. Identify the optimal hyperparameter combination
- Evaluate the model
- Model Selection - Compare and select the best performing model a. Visualize and analyze evaluation scores to support model choice. b. Explain the model performance
- Save the model - pkl files
- Load, deploy and test the model a. Preset data or User entered data
| Category | Deliverable | ETA | Comments |
|---|---|---|---|
| Documentation | Draft Project repo & readme | Mar 12 | Done |
| Documentation | Define Preliminary Project Plan | Mar 13 | Done |
| Data Clean-up | Load Data | Mar 13 | Done |
| Data Clean-up | Explore the Data | Mar 13 | Done |
| Data Clean-up | Handle Missing Values (if needed) | Mar 14 | Done |
| Feature Engineering | Encoding Categorical Values | Mar 17 | Done |
| Feature Engineering | Creating new features (Define Age Groups, Categorize Cholesterol Levels) | Mar 17 | Done |
| Feature Engineering | Feature scaling/standardizing | Mar 17 | Done |
| Feature Engineering | Visualize the numerical feature distribution | Mar 17 | Done |
| Feature Engineering | Split the data - test and train datasets | Mar 17 | Done |
| Feature Selection | Choose and fit a model | Mar 17 | Done |
| Hyperparameter Tuning | GridSearchCV (Cross Validation) | Mar 17 | Done |
| Hyperparameter Tuning | Identify the optimal hyperparameter combination | Mar 17 | Done |
| Model Analysis | Evaluate the Model | Mar 17 | Done |
| Model Selection | Visualize and analyze evaluation score to support model choice | Mar 18 | Done |
| Model Selection | Explain the model performance | Mar 19 | Done |
| Model Selection | Save the model - pkl files | Mar 20 | Done |
| Load, deploy and test the model | Preset data or User entered data | Mar 20 | Done |
| Update Repo | Create Pull requests and push changes to main repo | Mar 21 | Done |
| Finalize Documentation | Upload recordings | Mar 21 | Done |
| Finalize Documentation | Upload final ReadME | Mar 21 | Done |
Exploratory Data Analysis (EDA) is an essential step in any data science project. It helps us understand the dataset, detect patterns, and identify potential issues before building a machine learning model. In this case, we are working with a dataset related to heart disease prediction.
-
Dataset Overview:
- The dataset typically includes 303 patients with 14 features, though some studies use a subset of these features.
- Features include age, sex, chest pain type, blood pressure, cholesterol levels, fasting blood sugar, resting ECG, maximum heart rate, exercise-induced angina, ST depression, slope of the peak exercise ST segment, number of major vessels colored by fluoroscopy, thalassemia, and the target variable indicating heart disease presence.
-
Distribution of Features:
- Age, resting blood pressure, and cholesterol levels tend to follow a normal distribution.
- ST depression (oldpeak) is left-skewed, indicating that most patients have low ST depression values.
- Maximum heart rate (thalach) is right-skewed, suggesting that many patients achieve high heart rates during exercise.
-
Features like chest pain type, maximum heart rate, and ST segment slope show positive correlation with heart disease.
-
Age, ST depression, exercise-induced angina, number of major vessels, and thalassemia show negative correlation with heart disease.
-
Risk Factors:
- Age is a significant risk factor, with higher incidence of heart disease in older individuals.
- Males are more likely to have heart disease than females.
- High blood pressure and diabetes are also associated with increased risk.
-
Visualization Insights:
- Pair plots reveal linear separation between disease and non-disease groups for features like ST depression and maximum heart rate.
- Resting ECG values of 0 and 1 are associated with higher heart disease risk compared to value 2.
-
Data Quality Issues:
- Missing values are common and often imputed using mean values.
- Duplicate rows and data entry errors need to be checked and corrected.
The dataset was reviewed for any missing values and none were found.
Visualized box plot to identify outliers, which were found in the RestingBP and Cholesterol data. Each of the features were analyzed to determine if there was anything unexpected and nothing odd was found.
The numerical features were plotted by age and sex to evaluate any correlations.

The patterns emphasize there is an age related increase in resting BP and Cholesterol. It is also clear that there is a decline in the max HR as age increases. Furthermore, males seem to exhibit more variation in the values such as Cholesterol and resting BP. Females show more variation across the exercise induced angina.
Further EDA we conducted:
- Understanding the dataset: each row represents a patient and each column provides information about their health.
- Checking for Missing values: we find that there are no missing values, which is great because we don’t have to worry about filling in any gaps in the data.
- Statistical Summary of the Data: to have quick overview of numerical features, including the mean, standard deviation, min/max values, and quartiles.
- Encoding Categorical Variables: KNN require numerical data, we need to convert categorical variables into numbers.
- Model Performance: We built a K-Nearest Neighbors (KNN) classifier and tuned its hyperparameters.
- Feature Selection:
- Selecting features with strong correlations to heart disease (e.g., chest pain type, maximum heart rate) can improve model performance.
- Data Preprocessing:
- Handling missing values and outliers is crucial for robust model training.
- Model Interpretability:
- Understanding the impact of each feature on predictions can aid in clinical decision-making.
Random Forest
Accuracy: 0.79 Precision: 0.80 Recall: 0.83 F1-Score: 0.82 AUC-ROC: 0.88 RMS Error: 0.45 Classification Report: precision recall f1-score support
0 0.78 0.74 0.76 82
1 0.80 0.83 0.82 102
accuracy 0.79 184
macro avg 0.79 0.79 0.79 184 weighted avg 0.79 0.79 0.79 184
Confusion Matrix: [[61 21] [17 85]]
- Features related to stress tests (ST_Slope_Up, ST_Slope_Flat, and Oldpeak) and exercise-induced angina (ExerciseAngina_Y) dominate the model's predictions.
- This suggests that clinical indicators observed during stress tests are more predictive than demographic factors like age or general measures like maximum heart rate.
XGBoost Classification Report: precision recall f1-score support
0 0.87 0.83 0.85 103
1 0.86 0.90 0.88 127
accuracy 0.87 230
macro avg 0.87 0.86 0.86 230 weighted avg 0.87 0.87 0.86 230
Logistic Regression
precision recall f1-score support 0 0.83 0.87 0.85 82 1 0.89 0.85 0.87 102 accuracy 0.86 184 macro avg 0.86 0.86 0.86 184 weighted avg 0.86 0.86 0.86 184
Decision Tree Classification Report: precision recall f1-score support 0 0.80 0.87 0.83 77 1 0.90 0.84 0.87 107 accuracy 0.85 184 macro avg 0.85 0.86 0.85 184 weighted avg 0.86 0.85 0.85 184 AUC-ROC Score: 0.8556256827284863 RMSE: 0.3830654388414369
KNN Classification Report: precision recall f1-score support
0 0.90 0.91 0.91 82
1 0.93 0.92 0.93 102
accuracy 0.92 184
macro avg 0.92 0.92 0.92 184 weighted avg 0.92 0.92 0.92 184
| Model | Accuracy | Precision | Recall | F1-Score | AUC-ROC | RMS Error |
|---|---|---|---|---|---|---|
| Random Forest | 0.79 | 0.80 | 0.83 | 0.82 | 0.88 | 0.45 |
| Logistic Regression | 0.86 | 0.89 | 0.85 | 0.87 | 0.88 | - |
| XG Boost | 0.87 | 0.86 | 0.92 | 0.89 | 0.87 | 0.13 |
| KNN | 0.92 | 0.93 | 0.91 | 0.93 | 0.94 | 0.29 |
| Decision Tree | 0.85 | 0.90 | 0.84 | 0.87 | 0.86 | 0.38 |
- Best Performing Model:
- KNN achieves the highest accuracy (92%), precision (93%), recall (91%), and AUC-ROC (94%), making it the best-performing model in this comparison.
- XGBoost Performance:
- XGBoost shows strong performance with high recall (92%) and a good balance of precision (86%) and F1-score (89%).
- Logistic Regression:
- Logistic Regression performs well with high precision (89%) and accuracy (86%), but its recall (85%) is slightly lower compared to KNN and XGBoost.
- Random Forest and Decision Tree:
- Both models have lower performance metrics compared to KNN and XGBoost, with Random Forest having a slightly lower recall (83%) and Decision Tree achieving a balanced performance after optimization.
- KNN has the lowest RMS Error (0.33), indicating fewer prediction errors.
- XGBoost has a relatively low RMS Error (0.13), reflecting its strong predictive accuracy.
- Decision Tree and Random Forest have higher RMS Errors (0.38 and 0.45, respectively), suggesting more prediction errors.
Based on the performance metrics provided, KNN is the best model for heart disease prediction in this scenario.
Example data:
dummy_data = pd.DataFrame({ 'Age': [54], 'Sex': label_encoders['Sex'].transform(['M']), # Encode categorical variables 'ChestPainType': label_encoders['ChestPainType'].transform(['ATA']), 'RestingBP': [130], 'Cholesterol': [223], 'FastingBS': [0], 'RestingECG': label_encoders['RestingECG'].transform(['Normal']), 'MaxHR': [138], 'ExerciseAngina': label_encoders['ExerciseAngina'].transform(['N']), 'Oldpeak': [0.6], 'ST_Slope': label_encoders['ST_Slope'].transform(['Flat']) })
Results: Probability of heart disease: 0.40 Explanation: The individual is a 54-year-old male with certain risk factors such as chest pain type (ATA), resting blood pressure (130), cholesterol level (223), and exercise-induced angina (no). The features provided may not point to an elevated risk of heart disease, which could be due to factors like normal resting ECG, low FastingBS, and normal MaxHR.
Key Reasons for Choosing KNN
- Highest Accuracy (89%):
- KNN achieved the highest accuracy among all models, indicating that it correctly classified the most samples.
- Strong Recall (92%):
- KNN had the highest recall for detecting heart disease cases, which is critical in healthcare applications where missing true positives can have severe consequences.
- Best AUC-ROC (0.94):
- KNN demonstrated the highest AUC-ROC score, reflecting its strong ability to distinguish between heart disease and non-heart disease cases.
- Lowest RMS Error (0.29):
- KNN had the lowest RMS Error, indicating fewer prediction errors and better overall predictive accuracy.
- Balanced Precision and Recall:
- KNN maintained a high precision (89%) and recall (91%), showing a good balance between avoiding false positives and identifying true positives. Why KNN is Suitable for Heart Disease Prediction
- Interpretability:
- While KNN is not as interpretable as linear models like Logistic Regression, its simplicity and ease of implementation make it a practical choice for many applications.
- Robustness to Noise:
- KNN can handle noisy data to some extent, which is beneficial in real-world datasets where data quality may vary.
- Flexibility:
- KNN can be easily tuned by adjusting the number of neighbors (n_neighbors) to improve performance on specific datasets. Considerations for Other Models XGBoost: Offers high recall but slightly lower accuracy and AUC-ROC compared to KNN. Logistic Regression: Provides balanced performance but is less accurate than KNN. Decision Tree: provides reliable performance but is outperformed by KNN and XGBoost.
KNN is the best choice for heart disease prediction due to its high accuracy, strong recall, and excellent AUC-ROC score. Its ability to minimize prediction errors while maintaining a good balance between precision and recall makes it particularly suitable for healthcare applications where accurate detection of heart disease is critical.
















