Skip to content

cbenge509/DataScienceCapstone_Oct2017

Repository files navigation

Predicting College Graduate Earnings

1st Place Python License: MIT

1st Place Solution — DAT102x Data Science Capstone (October 2017)

A stacking ensemble approach to predicting median earnings of college graduates based on institutional characteristics, student demographics, and financial aid data. This solution won the Microsoft Professional Data Science curriculum capstone competition hosted by DrivenData.

Metric Score
Public Leaderboard RMSE 2.8992
Private Leaderboard RMSE 2.9796

Why This Solution Won

The winning approach combined four key elements:

  1. Aggressive feature engineering — Synthetic features derived from domain knowledge and mathematical transformations
  2. ML-based imputation — Using ExtraTrees to predict missing values rather than simple mean/median fills
  3. Diverse ensemble architecture — 7 base models spanning different algorithm families, combined via stacking
  4. Iterative hyperparameter refinement — 20+ submissions systematically tuning each component

This README documents the complete technical approach.

Pipeline Architecture

flowchart LR
    subgraph Input["Raw CSV Data"]
        A1[train_values]
        A2[train_labels]
        A3[test_values]
    end

    subgraph Preprocess["Preprocessing"]
        B1[Encoding]
        B2[Imputation]
        B3[Synthesis]
        B4[Outlier Removal]
    end

    subgraph Train["Training"]
        C1[7 Base Models]
        C2[Meta-learner]
        C3[15-fold CV]
    end

    Output[("submission.csv")]

    Input --> Preprocess --> Train --> Output
Loading

The solution is implemented across three modules:

Module Purpose
contestdata_preprocess.py Data cleaning, encoding, imputation, feature synthesis
capstone_model.py Ensemble definition and hyperparameters
submission.py Orchestration and prediction export

Data Preprocessing

Categorical Encoding

Categorical variables were encoded with domain-aware ordering rather than arbitrary label encoding. For example, states were mapped based on their correlation with income outcomes:

# States ordered by income correlation (highest to lowest earning states first)
stateHash = {'NJ':0, 'VA':1, 'MD':2, 'CT':3, 'MA':4, ...}

Degree types were encoded to preserve ordinal relationships:

degreeHash = {
    "Graduate Degree": 0,
    "Bachelor's Degree": 1,
    "Associate's Degree": 2,
    "Certificate": 3
}

One-hot encoding was applied selectively to nominal categories without inherent ordering (school ownership, region).

Missing Value Imputation

Rather than simple mean/median imputation, missing values in school__faculty_salary and student__demographics_age_entry were predicted using ExtraTreesRegressor trained on rows with complete data:

def interpolateFeatureValues(df, col):
    train = df[df[col].notnull()]
    test = df[df[col].isnull()]

    X_train = train.drop(columns=[col, 'id'])
    y_train = train[col]

    model = ExtraTreesRegressor(n_estimators=100, random_state=736283)
    model.fit(X_train, y_train)

    df.loc[df[col].isnull(), col] = model.predict(test.drop(columns=[col, 'id']))
    return df

This preserved relationships between features rather than injecting noise.

Feature Engineering

Synthetic Features

New features were synthesized using mathematical transformations based on domain intuition. The synthesize() function creates derived features using quadratic-style combinations:

def synthesize(df, a, b, c):
    """Create synthetic feature: (-b + sqrt(b² - 4ac)) / 2a"""
    name = f'synth_{a}_{b}_{c}'
    df[name] = (-df[b] + np.sqrt(df[b]**2 - 4*df[a]*df[c])) / (2*df[a])
    return df

Key synthetic features included combinations of:

  • Admission rates and completion rates
  • Faculty salary and student demographics
  • Financial aid metrics and institutional characteristics

Income Binning for Classification Features

Continuous income was discretized into bins, then used to train a classifier. The predicted bin probabilities became additional features, giving the model a different "view" of the target:

def createIncomeBins(income, num_bins=3):
    bins = pd.cut(income, bins=num_bins, labels=False)
    return bins

This technique adds ensemble diversity by mixing regression and classification signals.

Outlier Removal

Aggressive outlier removal improved generalization:

# Remove rows where high faculty salary correlates with very low income (data errors)
df = df[~((df['school__faculty_salary'] > 15000) & (df['income'] < 30))]

# Remove extreme faculty salary values (noise)
df = df[df['school__faculty_salary'] <= 20000]

# Cap unrealistic predictions
predictions[predictions > 120] = 120

Model Architecture

Stacking Ensemble

The core architecture is a stacking ensemble using mlxtend.StackingCVRegressor. Base model predictions become features for a meta-learner, with k-fold cross-validation preventing leakage.

flowchart BT
    subgraph Base["Base Models"]
        M1[AdaBoost + DT]
        M2[ExtraTrees]
        M3[XGBoost]
        M4[RandomForest]
        M5[GradientBoosting]
        M6[AdaBoost + ET]
        M7[LightGBM]
    end

    Predictions["Base Model Predictions (CV)"]
    Meta["Meta-Learner<br/>LinearRegression"]

    Base --> Predictions --> Meta
Loading

Base Models

Seven diverse regressors spanning different algorithm families:

Model Algorithm Family Key Hyperparameters
AdaBoost + DecisionTree Boosted trees n_estimators=90, max_depth=12
ExtraTreesRegressor Bagged trees n_estimators=100, max_features=0.6
XGBRegressor Gradient boosting n_estimators=175, learning_rate=0.08
RandomForestRegressor Bagged trees n_estimators=50, max_depth=18
GradientBoostingRegressor Gradient boosting n_estimators=90, max_depth=5
AdaBoost + ExtraTreeRegressor Boosted trees n_estimators=70, max_depth=15
LGBMRegressor Gradient boosting n_estimators=90, num_leaves=50

Diversity across algorithm families ensures base models make uncorrelated errors—the key to effective ensembling.

Meta-Learner

A simple LinearRegression combines base model predictions. Simplicity prevents overfitting at the meta-level:

meta_regressor = LinearRegression()
# Alternative tested: ElasticNet(alpha=0.5, l1_ratio=0.5, positive=True)

Training Configuration

Cross-Validation Strategy

The stacking ensemble uses 15-fold cross-validation, higher than the typical 5-fold:

model = build_StackingModelCV(
    cv=15,
    use_complex=True,
    use_positive_meta_bias=False
)

Higher fold count provides:

  • More training data per fold (each base model sees 93% of data)
  • More stable meta-learner features (15 out-of-fold predictions averaged)
  • Better generalization at the cost of training time

Reproducibility

All random states are fixed to 736283 across every model and operation:

RANDOM_STATE = 736283

# Applied consistently:
ExtraTreesRegressor(random_state=RANDOM_STATE)
XGBRegressor(seed=RANDOM_STATE)
LGBMRegressor(random_state=RANDOM_STATE)
train_test_split(random_state=RANDOM_STATE)

This ensures identical results across runs and enables systematic hyperparameter comparison.

Hyperparameter Evolution

Parameters were tuned iteratively across 20+ submissions. Code comments reference the submission number where each configuration was validated:

# Params derived from submission #8
ada_params = {'n_estimators': 90, 'learning_rate': 0.6}

# Params derived from submission #13
et_params = {'n_estimators': 100, 'max_features': 0.6}

# Params derived from submission #22
xgb_params = {'n_estimators': 175, 'learning_rate': 0.08}

Usage

Dependencies

pip install pandas numpy scikit-learn xgboost lightgbm mlxtend

Data Setup

The code expects three CSV files. Update the paths in submission.py to match your data location:

# Original Windows paths (modify as needed):
train_values = pd.read_csv('D:/DataScience/train_values.csv')
train_labels = pd.read_csv('D:/DataScience/train_labels.csv')
test_values = pd.read_csv('D:/DataScience/test_values.csv')

Expected data format:

  • train_values.csv — Features for training samples
  • train_labels.csv — Target income values (joins on id)
  • test_values.csv — Features for prediction

Running

python submission.py

This executes the full pipeline:

  1. Preprocesses training and test data
  2. Trains the stacking ensemble (15-fold CV)
  3. Generates predictions
  4. Exports submission.csv

Platform Note

XGBoost is configured for MinGW64 on Windows. On Linux/Mac, remove or modify:

# In capstone_model.py — remove these lines on non-Windows:
mingw_path = 'C:\\Program Files\\mingw-w64\\...'
os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']

Resources

License

MIT License — see LICENSE-MIT.txt


Author: Cris Benge

About

This is the winning submission for the Oct. 2017 Data Science Capstone challenge hosted by DrivenData for the Microsoft Professional Data Science curriculum on EdX

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages