Machine learning project predicting Titanic passenger survival using Python, pandas, and scikit-learn. Includes full data preprocessing, model training, and evaluation pipeline.
The goal is to build a model that predicts the Survived outcome (1 = survived, 0 = did not survive) based on passenger characteristics such as age, class, and sex.
- Data Loading — Imported
train.csv,test.csv, andgender_submission.csv - Data Cleaning — Filled missing values with median and mode
- Feature Encoding — Converted categorical features like
SexandEmbarkedinto numerical values - Model Training — Used Logistic Regression to train and validate survival predictions
- Evaluation — Measured accuracy and classification metrics on a validation set
- Prediction — Generated survival predictions for the test set and saved them as
submission.csv
Algorithm: Logistic Regression
Libraries: pandas, numpy, scikit-learn, seaborn, matplotlib
Typical accuracy achieved: around 80% on validation data.
- train.csv — Training data with labels (
Survived) - test.csv — Test data without labels
- gender_submission.csv — Sample submission file for Kaggle format
Key features include:
| Feature | Description |
|---|---|
Pclass |
Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd) |
Sex |
Passenger gender |
Age |
Passenger age in years |
SibSp |
Number of siblings/spouses aboard |
Parch |
Number of parents/children aboard |
Fare |
Ticket fare |
Embarked |
Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) |
- NumPy / Pandas — Data wrangling
- Matplotlib / Seaborn — Visualization
- Scikit-Learn — Machine learning modeling