Binary Classification with Apache Spark / HDFS ↖data source The goal of the competition is to predict which parts will fail quality control My goal is to utilize the hadoop ecosystem to handle a large dataset and establish a pipeline for machine learning munge : Aggregate columns using RDD transformations Create a column that indicates which of those column aggregations are outliers. fit_predict : Model data with Spark Machine Learning package Predict on test data munge_fit_predict : Run this as is to use the toy data set example