A comprehensive machine learning pipeline for predicting heart disease risk using the UCI Heart Disease Dataset. This project implements advanced data preprocessing, feature selection, dimensionality reduction, supervised and unsupervised learning, hyperparameter tuning, and deployment with a modern Streamlit web interface.
This project demonstrates a complete end-to-end machine learning workflow for medical diagnosis prediction, featuring:
- Advanced Data Preprocessing: Missing value imputation, categorical encoding, feature scaling
- Dimensionality Reduction: Principal Component Analysis (PCA) for feature optimization
- Intelligent Feature Selection: Multiple algorithms for optimal feature subset selection
- Supervised Learning: Multiple classification algorithms with performance comparison
- Unsupervised Learning: Clustering analysis for pattern discovery
- Hyperparameter Optimization: Automated model tuning for best performance
- Interactive Web Interface: Modern Streamlit UI for real-time predictions
- Production Deployment: Ready for cloud deployment with Ngrok
- Data Preprocessing: Handles missing values, categorical encoding, and feature scaling
- PCA Analysis: Dimensionality reduction while retaining 90% variance
- Feature Selection: Random Forest importance, RFE, and statistical tests
- Model Training: Logistic Regression, Decision Trees, Random Forest, SVM
- Clustering: K-Means and Hierarchical clustering for pattern discovery
- Hyperparameter Tuning: GridSearchCV and RandomizedSearchCV optimization
- Best Model: Tuned Random Forest Classifier
- Accuracy: 83.2%
- F1-Score: 85.4%
- Precision: 84.1%
- Recall: 86.7%
- AUC: 90.1%
- Interactive Interface: User-friendly form for patient data input
- Real-time Predictions: Instant heart disease risk assessment
- Visual Feedback: Clear risk indicators and recommendations
- Responsive Design: Modern, professional UI
Heart_Disease_Project/
โโโ ๐ data/ # Dataset and processed data
โ โโโ heart_disease_uci.csv # Original UCI dataset
โ โโโ X_scaled.csv # Scaled features
โ โโโ X_selected.csv # Selected features
โ โโโ X_pca.csv # PCA-transformed data
โ โโโ y_target.csv # Target variable
โโโ ๐ notebooks/ # Jupyter notebooks for analysis
โ โโโ 01_data_preprocessing.ipynb # Data cleaning and preprocessing
โ โโโ 02_pca_analysis.ipynb # Dimensionality reduction
โ โโโ 03_feature_selection.ipynb # Feature selection algorithms
โ โโโ 04_supervised_learning.ipynb # Classification models
โ โโโ 05_unsupervised_learning.ipynb # Clustering analysis
โ โโโ 06_hyperparameter_tuning.ipynb # Model optimization
โโโ ๐ models/ # Trained models and artifacts
โ โโโ best_model.pkl # Best performing model
โ โโโ scaler.pkl # Feature scaler
โ โโโ selected_features.pkl # Selected feature names
โโโ ๐ ui/ # Web application
โ โโโ app.py # Streamlit application
โโโ ๐ deployment/ # Deployment configurations
โ โโโ ngrok_setup.txt # Ngrok deployment guide
โโโ ๐ results/ # Analysis results
โ โโโ evaluation_metrics.txt # Model performance metrics
โ โโโ model_performance.csv # Detailed performance comparison
โโโ run_complete_pipeline.py # End-to-end pipeline execution
โโโ requirements.txt # Python dependencies
โโโ .gitignore # Git ignore rules
โโโ README.md # Project documentation
- Python 3.8+ (tested with Python 3.13)
- pip package manager
-
Clone the repository
git clone https://github.com/mido-io/Heart_Disease_Project.git cd Heart_Disease_Project -
Install dependencies
pip install -r requirements.txt
-
Run the complete pipeline
python run_complete_pipeline.py
-
Launch the web application
streamlit run ui/app.py
-
Access the application
- Local: http://localhost:8501
- Network: http://192.168.0.102:8501
- Total Samples: 920 patients
- Features: 13 clinical attributes
- Target: Binary classification (No Heart Disease / Heart Disease)
- Missing Values: Handled with intelligent imputation
- Data Quality: High-quality medical dataset with comprehensive preprocessing
| Feature | Description | Type |
|---|---|---|
| age | Age in years | Numerical |
| sex | Gender (Male/Female) | Categorical |
| cp | Chest pain type | Categorical |
| trestbps | Resting blood pressure | Numerical |
| chol | Serum cholesterol | Numerical |
| fbs | Fasting blood sugar > 120 mg/dl | Binary |
| restecg | Resting ECG results | Categorical |
| thalach | Maximum heart rate achieved | Numerical |
| exang | Exercise induced angina | Binary |
| oldpeak | ST depression induced by exercise | Numerical |
| slope | ST slope | Categorical |
| ca | Number of major vessels | Numerical |
| thal | Thalassemia | Categorical |
- Missing Value Handling: Intelligent imputation using median and mode
- Categorical Encoding: Label encoding for ordinal variables
- Feature Scaling: StandardScaler for normalization
- Data Validation: Comprehensive quality checks
- Random Forest Importance: Tree-based feature ranking
- Recursive Feature Elimination: Backward selection with cross-validation
- Statistical Tests: F-test for feature significance
- Final Selection: Top 10 most important features
- Logistic Regression: Linear baseline model
- Decision Tree: Interpretable tree-based model
- Random Forest: Ensemble method with 200 trees
- Support Vector Machine: Kernel-based classification
- Best Model: Tuned Random Forest (GridSearchCV optimized)
- Accuracy: Overall prediction correctness
- Precision: True positive rate
- Recall: Sensitivity to positive cases
- F1-Score: Harmonic mean of precision and recall
- AUC-ROC: Area under the receiver operating characteristic curve
- Modern Design: Clean, professional medical interface
- Responsive Layout: Works on desktop and mobile devices
- Interactive Forms: User-friendly input validation
- Real-time Feedback: Instant prediction results
- Risk Assessment: Binary heart disease prediction
- Probability Scores: Confidence levels for predictions
- Medical Recommendations: Professional health advice
- Visual Indicators: Color-coded risk levels
- Performance Metrics: Model accuracy and performance charts
- Feature Importance: Visual representation of key factors
- Distribution Analysis: Data exploration and insights
streamlit run ui/app.py- Install Ngrok: https://ngrok.com/download
- Run the application:
streamlit run ui/app.py - In another terminal:
ngrok http 8501 - Share the public URL for remote access
- Heroku: Container-based deployment
- AWS: EC2 instance with load balancing
- Google Cloud: App Engine or Compute Engine
- Azure: App Service or Container Instances
| Model | Accuracy | Precision | Recall | F1-Score | AUC |
|---|---|---|---|---|---|
| Logistic Regression | 81.5% | 82.1% | 83.2% | 82.6% | 88.3% |
| Decision Tree | 78.3% | 79.1% | 80.5% | 79.8% | 85.7% |
| Random Forest | 82.1% | 83.4% | 84.2% | 83.8% | 89.2% |
| SVM | 80.7% | 81.8% | 82.9% | 82.3% | 87.6% |
| Random Forest (Tuned) | 83.2% | 84.1% | 86.7% | 85.4% | 90.1% |
- Chest Pain Type (cp): 18.5% importance
- Serum Cholesterol (chol): 15.2% importance
- Maximum Heart Rate (thalch): 14.8% importance
- Age: 12.3% importance
- ST Depression (oldpeak): 11.7% importance
export STREAMLIT_SERVER_PORT=8501
export STREAMLIT_SERVER_ADDRESS=0.0.0.0python run_complete_pipeline.py --retrainModify run_complete_pipeline.py to adjust feature selection parameters:
# Select top N features
selected_features_names = feature_importance_df.head(N)['feature'].tolist()- pandas: Data manipulation and analysis
- numpy: Numerical computing
- scikit-learn: Machine learning algorithms
- matplotlib: Data visualization
- seaborn: Statistical data visualization
- streamlit: Web application framework
- joblib: Model serialization
- Python 3.8+ (tested with 3.13)
- scikit-learn 1.3+
- pandas 2.0+
- streamlit 1.28+
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Install development dependencies:
pip install -r requirements.txt - Make your changes and test thoroughly
- Submit a pull request
- Follow PEP 8 style guidelines
- Add comprehensive docstrings
- Include unit tests for new features
- Update documentation for API changes
This project is licensed under the MIT License - see the LICENSE file for details.
- UCI Machine Learning Repository for the Heart Disease dataset
- Scikit-learn team for the excellent ML library
- Streamlit team for the intuitive web framework
- Medical professionals who provided domain expertise
For questions, issues, or contributions:
- Issues: Create a GitHub issue
- Discussions: Use GitHub Discussions
- Email: [[email protected]]
- Deep Learning Models: Neural network implementation
- Real-time Data: Live patient monitoring integration
- Mobile App: React Native or Flutter application
- API Development: RESTful API for third-party integration
- Advanced Analytics: Time-series analysis and trend prediction
- Multi-language Support: Internationalization features
๐ Privacy Notice: All patient data is processed locally. No personal information is transmitted to external servers.
Built with โค๏ธ for advancing healthcare through machine learning