1st Place Solution — DAT102x Data Science Capstone (October 2017)
A stacking ensemble approach to predicting median earnings of college graduates based on institutional characteristics, student demographics, and financial aid data. This solution won the Microsoft Professional Data Science curriculum capstone competition hosted by DrivenData.
| Metric | Score |
|---|---|
| Public Leaderboard RMSE | 2.8992 |
| Private Leaderboard RMSE | 2.9796 |
The winning approach combined four key elements:
- Aggressive feature engineering — Synthetic features derived from domain knowledge and mathematical transformations
- ML-based imputation — Using ExtraTrees to predict missing values rather than simple mean/median fills
- Diverse ensemble architecture — 7 base models spanning different algorithm families, combined via stacking
- Iterative hyperparameter refinement — 20+ submissions systematically tuning each component
This README documents the complete technical approach.
flowchart LR
subgraph Input["Raw CSV Data"]
A1[train_values]
A2[train_labels]
A3[test_values]
end
subgraph Preprocess["Preprocessing"]
B1[Encoding]
B2[Imputation]
B3[Synthesis]
B4[Outlier Removal]
end
subgraph Train["Training"]
C1[7 Base Models]
C2[Meta-learner]
C3[15-fold CV]
end
Output[("submission.csv")]
Input --> Preprocess --> Train --> Output
The solution is implemented across three modules:
| Module | Purpose |
|---|---|
contestdata_preprocess.py |
Data cleaning, encoding, imputation, feature synthesis |
capstone_model.py |
Ensemble definition and hyperparameters |
submission.py |
Orchestration and prediction export |
Categorical variables were encoded with domain-aware ordering rather than arbitrary label encoding. For example, states were mapped based on their correlation with income outcomes:
# States ordered by income correlation (highest to lowest earning states first)
stateHash = {'NJ':0, 'VA':1, 'MD':2, 'CT':3, 'MA':4, ...}Degree types were encoded to preserve ordinal relationships:
degreeHash = {
"Graduate Degree": 0,
"Bachelor's Degree": 1,
"Associate's Degree": 2,
"Certificate": 3
}One-hot encoding was applied selectively to nominal categories without inherent ordering (school ownership, region).
Rather than simple mean/median imputation, missing values in school__faculty_salary and student__demographics_age_entry were predicted using ExtraTreesRegressor trained on rows with complete data:
def interpolateFeatureValues(df, col):
train = df[df[col].notnull()]
test = df[df[col].isnull()]
X_train = train.drop(columns=[col, 'id'])
y_train = train[col]
model = ExtraTreesRegressor(n_estimators=100, random_state=736283)
model.fit(X_train, y_train)
df.loc[df[col].isnull(), col] = model.predict(test.drop(columns=[col, 'id']))
return dfThis preserved relationships between features rather than injecting noise.
New features were synthesized using mathematical transformations based on domain intuition. The synthesize() function creates derived features using quadratic-style combinations:
def synthesize(df, a, b, c):
"""Create synthetic feature: (-b + sqrt(b² - 4ac)) / 2a"""
name = f'synth_{a}_{b}_{c}'
df[name] = (-df[b] + np.sqrt(df[b]**2 - 4*df[a]*df[c])) / (2*df[a])
return dfKey synthetic features included combinations of:
- Admission rates and completion rates
- Faculty salary and student demographics
- Financial aid metrics and institutional characteristics
Continuous income was discretized into bins, then used to train a classifier. The predicted bin probabilities became additional features, giving the model a different "view" of the target:
def createIncomeBins(income, num_bins=3):
bins = pd.cut(income, bins=num_bins, labels=False)
return binsThis technique adds ensemble diversity by mixing regression and classification signals.
Aggressive outlier removal improved generalization:
# Remove rows where high faculty salary correlates with very low income (data errors)
df = df[~((df['school__faculty_salary'] > 15000) & (df['income'] < 30))]
# Remove extreme faculty salary values (noise)
df = df[df['school__faculty_salary'] <= 20000]
# Cap unrealistic predictions
predictions[predictions > 120] = 120The core architecture is a stacking ensemble using mlxtend.StackingCVRegressor. Base model predictions become features for a meta-learner, with k-fold cross-validation preventing leakage.
flowchart BT
subgraph Base["Base Models"]
M1[AdaBoost + DT]
M2[ExtraTrees]
M3[XGBoost]
M4[RandomForest]
M5[GradientBoosting]
M6[AdaBoost + ET]
M7[LightGBM]
end
Predictions["Base Model Predictions (CV)"]
Meta["Meta-Learner<br/>LinearRegression"]
Base --> Predictions --> Meta
Seven diverse regressors spanning different algorithm families:
| Model | Algorithm Family | Key Hyperparameters |
|---|---|---|
| AdaBoost + DecisionTree | Boosted trees | n_estimators=90, max_depth=12 |
| ExtraTreesRegressor | Bagged trees | n_estimators=100, max_features=0.6 |
| XGBRegressor | Gradient boosting | n_estimators=175, learning_rate=0.08 |
| RandomForestRegressor | Bagged trees | n_estimators=50, max_depth=18 |
| GradientBoostingRegressor | Gradient boosting | n_estimators=90, max_depth=5 |
| AdaBoost + ExtraTreeRegressor | Boosted trees | n_estimators=70, max_depth=15 |
| LGBMRegressor | Gradient boosting | n_estimators=90, num_leaves=50 |
Diversity across algorithm families ensures base models make uncorrelated errors—the key to effective ensembling.
A simple LinearRegression combines base model predictions. Simplicity prevents overfitting at the meta-level:
meta_regressor = LinearRegression()
# Alternative tested: ElasticNet(alpha=0.5, l1_ratio=0.5, positive=True)The stacking ensemble uses 15-fold cross-validation, higher than the typical 5-fold:
model = build_StackingModelCV(
cv=15,
use_complex=True,
use_positive_meta_bias=False
)Higher fold count provides:
- More training data per fold (each base model sees 93% of data)
- More stable meta-learner features (15 out-of-fold predictions averaged)
- Better generalization at the cost of training time
All random states are fixed to 736283 across every model and operation:
RANDOM_STATE = 736283
# Applied consistently:
ExtraTreesRegressor(random_state=RANDOM_STATE)
XGBRegressor(seed=RANDOM_STATE)
LGBMRegressor(random_state=RANDOM_STATE)
train_test_split(random_state=RANDOM_STATE)This ensures identical results across runs and enables systematic hyperparameter comparison.
Parameters were tuned iteratively across 20+ submissions. Code comments reference the submission number where each configuration was validated:
# Params derived from submission #8
ada_params = {'n_estimators': 90, 'learning_rate': 0.6}
# Params derived from submission #13
et_params = {'n_estimators': 100, 'max_features': 0.6}
# Params derived from submission #22
xgb_params = {'n_estimators': 175, 'learning_rate': 0.08}pip install pandas numpy scikit-learn xgboost lightgbm mlxtendThe code expects three CSV files. Update the paths in submission.py to match your data location:
# Original Windows paths (modify as needed):
train_values = pd.read_csv('D:/DataScience/train_values.csv')
train_labels = pd.read_csv('D:/DataScience/train_labels.csv')
test_values = pd.read_csv('D:/DataScience/test_values.csv')Expected data format:
train_values.csv— Features for training samplestrain_labels.csv— Target income values (joins onid)test_values.csv— Features for prediction
python submission.pyThis executes the full pipeline:
- Preprocesses training and test data
- Trains the stacking ensemble (15-fold CV)
- Generates predictions
- Exports
submission.csv
XGBoost is configured for MinGW64 on Windows. On Linux/Mac, remove or modify:
# In capstone_model.py — remove these lines on non-Windows:
mingw_path = 'C:\\Program Files\\mingw-w64\\...'
os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']- Competition: DAT102x: Predict Student Earnings
- Final Leaderboard: DAT102x: Leaderboard
- Detailed Analysis: Analysis of College Graduate Earnings.pdf — Full methodology, visualizations, and findings
MIT License — see LICENSE-MIT.txt
Author: Cris Benge