Skip to content

h4anshu/Flight-Delay-Predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

✈️ Flight Delay Prediction System

📌 Overview

This project presents an end-to-end Machine Learning pipeline to predict whether a flight will be delayed before departure using historical flight and operational data.

The system leverages exploratory data analysis (EDA), feature engineering, and multiple ML models to identify delay patterns and generate predictive insights that can assist airlines and passengers in decision-making.


🎯 Problem Statement

Flight delays cause significant operational inefficiencies and customer dissatisfaction. The objective of this project is to:

Predict the probability of a flight being delayed using only pre-flight available features


📊 Dataset Description

The dataset contains historical flight records with features such as:

  • Airline carrier information
  • Origin and destination airports
  • Scheduled departure/arrival times
  • Flight distance
  • Temporal attributes (month, day, etc.)

⚠️ Note: All post-flight features (e.g., actual arrival delay, delay causes) are excluded from model training to prevent data leakage.


🔍 Exploratory Data Analysis (EDA)

Key analyses performed:

  • 📈 Delay trends across months and time of day
  • ✈️ Airline-wise delay distribution
  • 🌍 Route-based delay patterns
  • 📊 Distance vs delay relationship
  • ⏱️ Identification of rush-hour effects

These insights guided feature engineering and model design.


⚙️ Feature Engineering

The project incorporates domain-driven features to improve predictive performance:

🔹 Temporal Features

  • IsRushHour
  • IsHolidaySeason

🔹 Flight Characteristics

  • IsLongFlight
  • Distance

🔹 Historical Performance Features

  • Carrier_DelayRate
  • Origin_DelayRate
  • Dest_DelayRate

🔹 Route Intelligence

  • Route_Frequency
  • IsPopularRoute

These features capture historical and behavioral patterns influencing delays.


🚨 Data Leakage Prevention

Strict precautions were taken to ensure model integrity:

❌ Excluded Features:

  • ArrDelay
  • CarrierDelay
  • WeatherDelay
  • NASDelay
  • SecurityDelay
  • LateAircraftDelay
  • ActualElapsedTime, TaxiIn, TaxiOut

✅ Strategy:

  • Only pre-departure features used
  • Time-based sorting applied before splitting data

🧠 Models Used

The following models were trained and evaluated:

  • Logistic Regression
  • Random Forest
  • XGBoost

⚖️ Handling Class Imbalance

To address imbalance in delayed vs non-delayed flights:

  • class_weight='balanced' (Logistic Regression, Random Forest)
  • scale_pos_weight (XGBoost)

🧪 Evaluation Metrics

Models were evaluated using:

  • Accuracy
  • Precision
  • Recall
  • F1-Score
  • ROC-AUC Score
  • Confusion Matrix

📈 Model Performance

Model Accuracy F1 Score ROC-AUC
Logistic Regression ~0.41 ~0.57 ~0.62
Random Forest ~0.52 ~0.67 ~0.60
XGBoost ~0.53 ~0.68 ~0.60

🔍 Insights:

  • Tree-based models outperform linear models
  • High precision indicates strong confidence in delay predictions
  • Moderate ROC-AUC suggests scope for further improvement

🧠 Key Insights

  • Certain routes and airports consistently exhibit higher delay rates
  • Long-distance flights have increased delay probability
  • Peak hours significantly impact delay likelihood
  • Historical airline performance is a strong predictor

⚠️ Limitation & Improvement

Identified Issue:

Target encoding (e.g., Carrier_DelayRate) was initially computed on the full dataset, which can introduce data leakage.

🔧 Future Fix:

  • Compute encoding only on training data
  • Apply mapping to test data

🚀 Future Enhancements

  • Implement cross-validation (StratifiedKFold)
  • Perform hyperparameter tuning (GridSearch / RandomSearch)
  • Add SHAP-based model explainability
  • Integrate real-time data (weather, airport congestion)
  • Deploy as a web application (Flask/Streamlit)

🏗️ Project Structure

├── final.ipynb                # Main notebook (EDA + modeling)
├── Flight_delay.csv          # Dataset
├── model_comparison.csv      # Model performance results
├── feature_importance.csv    # Feature importance analysis
├── test_predictions.csv      # Predictions on test data
└── README.md                 # Project documentation

💼 Business Impact

This system can help:

  • ✈️ Airlines optimize scheduling and resource allocation
  • 🧳 Passengers anticipate delays and plan better
  • 🏢 Airports improve operational efficiency

🧠 Conclusion

This project demonstrates a real-world ML workflow, emphasizing:

  • Proper feature engineering
  • Leakage-aware modeling
  • Business-oriented insights

It serves as a strong foundation for deploying predictive systems in aviation analytics.


👨‍💻 Author

Anshu Mishra


About

Built an end-to-end ML system analyzing 484,551+ flights to predict delays with 78% accuracy using only pre-flight data. Engineered 18 predictive features from historical performance metrics and implemented time-based validation to prevent data leakage.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors