This project presents an end-to-end Machine Learning pipeline to predict whether a flight will be delayed before departure using historical flight and operational data.
The system leverages exploratory data analysis (EDA), feature engineering, and multiple ML models to identify delay patterns and generate predictive insights that can assist airlines and passengers in decision-making.
Flight delays cause significant operational inefficiencies and customer dissatisfaction. The objective of this project is to:
Predict the probability of a flight being delayed using only pre-flight available features
The dataset contains historical flight records with features such as:
- Airline carrier information
- Origin and destination airports
- Scheduled departure/arrival times
- Flight distance
- Temporal attributes (month, day, etc.)
Key analyses performed:
- 📈 Delay trends across months and time of day
✈️ Airline-wise delay distribution- 🌍 Route-based delay patterns
- 📊 Distance vs delay relationship
- ⏱️ Identification of rush-hour effects
These insights guided feature engineering and model design.
The project incorporates domain-driven features to improve predictive performance:
IsRushHourIsHolidaySeason
IsLongFlightDistance
Carrier_DelayRateOrigin_DelayRateDest_DelayRate
Route_FrequencyIsPopularRoute
These features capture historical and behavioral patterns influencing delays.
Strict precautions were taken to ensure model integrity:
ArrDelayCarrierDelayWeatherDelayNASDelaySecurityDelayLateAircraftDelayActualElapsedTime,TaxiIn,TaxiOut
- Only pre-departure features used
- Time-based sorting applied before splitting data
The following models were trained and evaluated:
- Logistic Regression
- Random Forest
- XGBoost
To address imbalance in delayed vs non-delayed flights:
class_weight='balanced'(Logistic Regression, Random Forest)scale_pos_weight(XGBoost)
Models were evaluated using:
- Accuracy
- Precision
- Recall
- F1-Score
- ROC-AUC Score
- Confusion Matrix
| Model | Accuracy | F1 Score | ROC-AUC |
|---|---|---|---|
| Logistic Regression | ~0.41 | ~0.57 | ~0.62 |
| Random Forest | ~0.52 | ~0.67 | ~0.60 |
| XGBoost | ~0.53 | ~0.68 | ~0.60 |
- Tree-based models outperform linear models
- High precision indicates strong confidence in delay predictions
- Moderate ROC-AUC suggests scope for further improvement
- Certain routes and airports consistently exhibit higher delay rates
- Long-distance flights have increased delay probability
- Peak hours significantly impact delay likelihood
- Historical airline performance is a strong predictor
Target encoding (e.g., Carrier_DelayRate) was initially computed on the full dataset, which can introduce data leakage.
- Compute encoding only on training data
- Apply mapping to test data
- Implement cross-validation (StratifiedKFold)
- Perform hyperparameter tuning (GridSearch / RandomSearch)
- Add SHAP-based model explainability
- Integrate real-time data (weather, airport congestion)
- Deploy as a web application (Flask/Streamlit)
├── final.ipynb # Main notebook (EDA + modeling)
├── Flight_delay.csv # Dataset
├── model_comparison.csv # Model performance results
├── feature_importance.csv # Feature importance analysis
├── test_predictions.csv # Predictions on test data
└── README.md # Project documentation
This system can help:
✈️ Airlines optimize scheduling and resource allocation- 🧳 Passengers anticipate delays and plan better
- 🏢 Airports improve operational efficiency
This project demonstrates a real-world ML workflow, emphasizing:
- Proper feature engineering
- Leakage-aware modeling
- Business-oriented insights
It serves as a strong foundation for deploying predictive systems in aviation analytics.
Anshu Mishra