This project is a complete data engineering workflow that transforms the raw Titanic dataset into a clean, structured format optimized for Machine Learning models.
The goal was to move from "raw and messy" data to "AI-ready" data. This involves handling missing values, encoding categorical variables, and engineering new features to improve potential model accuracy.
- VS Code: Primary development environment.
- Python 3.10+: Core programming language.
- Pandas & NumPy: For data manipulation and numerical logic.
- Matplotlib & Seaborn: For exploratory data analysis (EDA).
- Age: Filled missing values using the median grouped by Passenger Class and Sex.
- Embarked: Filled missing values with the mode (most common port).
- Cabin: Dropped due to having over 77% missing data.
I created several new features to capture hidden patterns:
Title: Extracted from names (Mr, Mrs, Miss, etc.).FamilySize: CombinedSibSpandParch.IsAlone: A binary flag for passengers traveling without family.FareBin&AgeBin: Grouped continuous data into logical categories.
- Converted
Sexto binary (0/1). - Applied One-Hot Encoding to
EmbarkedandTitleto ensure the data is 100% numerical.
- Dropped non-predictive columns:
PassengerId,Name,Ticket,SibSp, andParch.
titanic_data_preparation.ipynb: The complete Jupyter Notebook with step-by-step logic.train.csv: The original raw dataset.titanic_clean_ai_ready.csv: The final processed output.
Completed as part of the AI Data Preparation Challenge.