Predicting College Graduate Earnings

1st Place Solution — DAT102x Data Science Capstone (October 2017)

A stacking ensemble approach to predicting median earnings of college graduates based on institutional characteristics, student demographics, and financial aid data. This solution won the Microsoft Professional Data Science curriculum capstone competition hosted by DrivenData.

Metric	Score
Public Leaderboard RMSE	2.8992
Private Leaderboard RMSE	2.9796

Why This Solution Won

The winning approach combined four key elements:

Aggressive feature engineering — Synthetic features derived from domain knowledge and mathematical transformations
ML-based imputation — Using ExtraTrees to predict missing values rather than simple mean/median fills
Diverse ensemble architecture — 7 base models spanning different algorithm families, combined via stacking
Iterative hyperparameter refinement — 20+ submissions systematically tuning each component

This README documents the complete technical approach.

Pipeline Architecture

flowchart LR
    subgraph Input["Raw CSV Data"]
        A1[train_values]
        A2[train_labels]
        A3[test_values]
    end

    subgraph Preprocess["Preprocessing"]
        B1[Encoding]
        B2[Imputation]
        B3[Synthesis]
        B4[Outlier Removal]
    end

    subgraph Train["Training"]
        C1[7 Base Models]
        C2[Meta-learner]
        C3[15-fold CV]
    end

    Output[("submission.csv")]

    Input --> Preprocess --> Train --> Output

The solution is implemented across three modules:

Module	Purpose
`contestdata_preprocess.py`	Data cleaning, encoding, imputation, feature synthesis
`capstone_model.py`	Ensemble definition and hyperparameters
`submission.py`	Orchestration and prediction export

Data Preprocessing

Categorical Encoding

Categorical variables were encoded with domain-aware ordering rather than arbitrary label encoding. For example, states were mapped based on their correlation with income outcomes:

# States ordered by income correlation (highest to lowest earning states first)
stateHash = {'NJ':0, 'VA':1, 'MD':2, 'CT':3, 'MA':4, ...}

Degree types were encoded to preserve ordinal relationships:

degreeHash = {
    "Graduate Degree": 0,
    "Bachelor's Degree": 1,
    "Associate's Degree": 2,
    "Certificate": 3
}

One-hot encoding was applied selectively to nominal categories without inherent ordering (school ownership, region).

Missing Value Imputation

Rather than simple mean/median imputation, missing values in school__faculty_salary and student__demographics_age_entry were predicted using ExtraTreesRegressor trained on rows with complete data:

def interpolateFeatureValues(df, col):
    train = df[df[col].notnull()]
    test = df[df[col].isnull()]

    X_train = train.drop(columns=[col, 'id'])
    y_train = train[col]

    model = ExtraTreesRegressor(n_estimators=100, random_state=736283)
    model.fit(X_train, y_train)

    df.loc[df[col].isnull(), col] = model.predict(test.drop(columns=[col, 'id']))
    return df

This preserved relationships between features rather than injecting noise.

Feature Engineering

Synthetic Features

New features were synthesized using mathematical transformations based on domain intuition. The synthesize() function creates derived features using quadratic-style combinations:

def synthesize(df, a, b, c):
    """Create synthetic feature: (-b + sqrt(b² - 4ac)) / 2a"""
    name = f'synth_{a}_{b}_{c}'
    df[name] = (-df[b] + np.sqrt(df[b]**2 - 4*df[a]*df[c])) / (2*df[a])
    return df

Key synthetic features included combinations of:

Admission rates and completion rates
Faculty salary and student demographics
Financial aid metrics and institutional characteristics

Income Binning for Classification Features

Continuous income was discretized into bins, then used to train a classifier. The predicted bin probabilities became additional features, giving the model a different "view" of the target:

def createIncomeBins(income, num_bins=3):
    bins = pd.cut(income, bins=num_bins, labels=False)
    return bins

This technique adds ensemble diversity by mixing regression and classification signals.

Outlier Removal

Aggressive outlier removal improved generalization:

# Remove rows where high faculty salary correlates with very low income (data errors)
df = df[~((df['school__faculty_salary'] > 15000) & (df['income'] < 30))]

# Remove extreme faculty salary values (noise)
df = df[df['school__faculty_salary'] <= 20000]

# Cap unrealistic predictions
predictions[predictions > 120] = 120

Model Architecture

Stacking Ensemble

The core architecture is a stacking ensemble using mlxtend.StackingCVRegressor. Base model predictions become features for a meta-learner, with k-fold cross-validation preventing leakage.

flowchart BT
    subgraph Base["Base Models"]
        M1[AdaBoost + DT]
        M2[ExtraTrees]
        M3[XGBoost]
        M4[RandomForest]
        M5[GradientBoosting]
        M6[AdaBoost + ET]
        M7[LightGBM]
    end

    Predictions["Base Model Predictions (CV)"]
    Meta["Meta-Learner<br/>LinearRegression"]

    Base --> Predictions --> Meta

Base Models

Seven diverse regressors spanning different algorithm families:

Model	Algorithm Family	Key Hyperparameters
AdaBoost + DecisionTree	Boosted trees	`n_estimators=90`, `max_depth=12`
ExtraTreesRegressor	Bagged trees	`n_estimators=100`, `max_features=0.6`
XGBRegressor	Gradient boosting	`n_estimators=175`, `learning_rate=0.08`
RandomForestRegressor	Bagged trees	`n_estimators=50`, `max_depth=18`
GradientBoostingRegressor	Gradient boosting	`n_estimators=90`, `max_depth=5`
AdaBoost + ExtraTreeRegressor	Boosted trees	`n_estimators=70`, `max_depth=15`
LGBMRegressor	Gradient boosting	`n_estimators=90`, `num_leaves=50`

Diversity across algorithm families ensures base models make uncorrelated errors—the key to effective ensembling.

Meta-Learner

A simple LinearRegression combines base model predictions. Simplicity prevents overfitting at the meta-level:

meta_regressor = LinearRegression()
# Alternative tested: ElasticNet(alpha=0.5, l1_ratio=0.5, positive=True)

Training Configuration

Cross-Validation Strategy

The stacking ensemble uses 15-fold cross-validation, higher than the typical 5-fold:

model = build_StackingModelCV(
    cv=15,
    use_complex=True,
    use_positive_meta_bias=False
)

Higher fold count provides:

More training data per fold (each base model sees 93% of data)
More stable meta-learner features (15 out-of-fold predictions averaged)
Better generalization at the cost of training time

Reproducibility

All random states are fixed to 736283 across every model and operation:

RANDOM_STATE = 736283

# Applied consistently:
ExtraTreesRegressor(random_state=RANDOM_STATE)
XGBRegressor(seed=RANDOM_STATE)
LGBMRegressor(random_state=RANDOM_STATE)
train_test_split(random_state=RANDOM_STATE)

This ensures identical results across runs and enables systematic hyperparameter comparison.

Hyperparameter Evolution

Parameters were tuned iteratively across 20+ submissions. Code comments reference the submission number where each configuration was validated:

# Params derived from submission #8
ada_params = {'n_estimators': 90, 'learning_rate': 0.6}

# Params derived from submission #13
et_params = {'n_estimators': 100, 'max_features': 0.6}

# Params derived from submission #22
xgb_params = {'n_estimators': 175, 'learning_rate': 0.08}

Usage

Dependencies

pip install pandas numpy scikit-learn xgboost lightgbm mlxtend

Data Setup

The code expects three CSV files. Update the paths in submission.py to match your data location:

# Original Windows paths (modify as needed):
train_values = pd.read_csv('D:/DataScience/train_values.csv')
train_labels = pd.read_csv('D:/DataScience/train_labels.csv')
test_values = pd.read_csv('D:/DataScience/test_values.csv')

Expected data format:

train_values.csv — Features for training samples
train_labels.csv — Target income values (joins on id)
test_values.csv — Features for prediction

Running

python submission.py

This executes the full pipeline:

Preprocesses training and test data
Trains the stacking ensemble (15-fold CV)
Generates predictions
Exports submission.csv

Platform Note

XGBoost is configured for MinGW64 on Windows. On Linux/Mac, remove or modify:

# In capstone_model.py — remove these lines on non-Windows:
mingw_path = 'C:\\Program Files\\mingw-w64\\...'
os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']

Resources

Competition: DAT102x: Predict Student Earnings
Final Leaderboard: DAT102x: Leaderboard
Detailed Analysis: Analysis of College Graduate Earnings.pdf — Full methodology, visualizations, and findings

License

MIT License — see LICENSE-MIT.txt

Author: Cris Benge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting College Graduate Earnings

Why This Solution Won

Pipeline Architecture

Data Preprocessing

Categorical Encoding

Missing Value Imputation

Feature Engineering

Synthetic Features

Income Binning for Classification Features

Outlier Removal

Model Architecture

Stacking Ensemble

Base Models

Meta-Learner

Training Configuration

Cross-Validation Strategy

Reproducibility

Hyperparameter Evolution

Usage

Dependencies

Data Setup

Running

Platform Note

Resources

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Analysis of College Graduate Earnings.pdf		Analysis of College Graduate Earnings.pdf
LICENSE-MIT.txt		LICENSE-MIT.txt
README.md		README.md
capstone_model.py		capstone_model.py
contestdata_preprocess.py		contestdata_preprocess.py
submission.py		submission.py

Folders and files

Latest commit

History

Repository files navigation

Predicting College Graduate Earnings

Why This Solution Won

Pipeline Architecture

Data Preprocessing

Categorical Encoding

Missing Value Imputation

Feature Engineering

Synthetic Features

Income Binning for Classification Features

Outlier Removal

Model Architecture

Stacking Ensemble

Base Models

Meta-Learner

Training Configuration

Cross-Validation Strategy

Reproducibility

Hyperparameter Evolution

Usage

Dependencies

Data Setup

Running

Platform Note

Resources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages