This prototype is trying 2 different approaches available on Microsoft Azure ML platform to optimize a ML pipeline:
- HyperDrive: Build an Azure ML pipeline using the Python SDK and Scikit-Learn model, then use HypterDrive to optimize for best hyperparameters
- Automated Machine Learning (AutoML): On the same dataset, leverage Azure AutoML to select the best ML algorithm and its hyperparameters
Then the results of the two methods are compared to see which approach is better.
The dataset was originated from UCI Data Repository but also hosted on Microsoft Azure Open Dataset.
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
Dataset details:
Input variables:
Bank client data:
- age (numeric)
- job: type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
- marital: marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
- education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
- default: has credit in default? (categorical: 'no','yes','unknown')
- housing: has housing loan? (categorical: 'no','yes','unknown')
- loan: has personal loan? (categorical: 'no','yes','unknown')
Related with the last contact of the current campaign:
- contact: contact communication type (categorical: 'cellular','telephone')
- month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
- day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
- duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
Other attributes:
- campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
- previous: number of contacts performed before this campaign and for this client (numeric)
- poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
Social and economic context attributes:
- emp.var.rate: employment variation rate - quarterly indicator (numeric)
- cons.price.idx: consumer price index - monthly indicator (numeric)
- cons.conf.idx: consumer confidence index - monthly indicator (numeric)
- euribor3m: euribor 3 month rate - daily indicator (numeric)
- nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target):
- y - has the client subscribed a term deposit? (binary: 'yes', 'no')
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y). AutoML approach gave a little better performance (accuracy = 91.66%) than HyperDrive approach (accuracy = 91.29%)
The pipeline consists of custom training module in Python (./script/train.py). This train module uses Scikit-Learn to establish a LogistricRegression model, train with 2 hyperparameters:
- Regularization Strength (C)
- Max iterations (max_iter)
then score model performance by accuracy metric:
A HyperDrive run will be configured with 4 main settings:
-
Parameter Sampling: Define the method used to navigate the hyperparamter space. In this case, the RandomParameterSampling class is selected since its random sweep implementation not only performs essentially as well as grid search but also take much less time. The 2 hyperparamters will be picked as followed:
- C: random float between 0.01 and 100 of an uniform distribution
- max_iter: random integer between 10 and 500
-
Early Termination Policy: Define rules to terminate poorly performing runs in order to improves running time and computational efficiency. The Bandit policy is used since it helps eleminate poor runs quicker. With slack_factor = 0.1 and delay_evaluation = 5, the policy guarantees any run after the 5th interval whose metric is less than (1 / (1 + 0.1) or 91% of the current best performing run will be terminated.
-
Scikit-Learn Estimator: Declare where to find the custom training module and what is the target compute used to execute a run
-
Metrics and Optimizing Goal: Since the training module uses accuracy as the primary metrics to score model performance, the HyperDrive needs to maximize the same metric
An experiment takes in a HyperDrive configuration and starts a new HyperDrive main run on the desire compute train cluster, then the main run triggers multiple child runs, each of them is assigned different hyperparamters picked by the provided parameter sampler:
The best hyperparameters that yields the highest accuracy are:
- Regularization Strength: 65.45
- Max iterations: 217
- Accuracy: 0.9129
AutoML runs tried multiple ML algorithms and different hyperparameters for each algorithm:
The best model is XGBoostClassifier with the following hyperparameters:
- colsample_bytree = 1
- eta = 0.2
- gamma = 0.1
- learning_rate = 0.1
- max_delta_step = 0
- max_depth = 6
- max_leaves = 3
- min_child_weight = 1
- missing = None
- n_estimators = 50
- n_jobs = 1
- nthread = None
- objective = 'reg:logistic'
- random_state = 0
- reg_alpha = 0
- reg_lambda = 1.7708333333333335
- scale_pos_weight = 1
- seed = None
- silent = None
- subsample = 0.9
- tree_method = 'auto'
- verbose = -10
- verbosity = 0
Top 5 global feature importances are by the following order:
- duration
- nr.employed
- emp.var.rate
- euribor3m
- month
AutoML pipeline was configured to run for 30 minutes. It actually took 41m 23s to complete on a training cluster of 4 nodes while HyperDrive pipeline configured with above early termination policy took 23m 39s.
AutoML offered better accuracy 91.66% vs. 91.29% from HyperDrive because StandardScalerWrapper/XGBoostClassifier is more advanced algorithm compared to Scikit-Learn LogistricRegression. The accuracy might be higher if letting AutoML run longer and enabling Deep Learning option.
-
Try running AutoML on a GPU based cluster with Deep Learning enabled: Deep Learning offers huge advantages over the traditional ML algorithms. GPU cluster can reduce training time significantly.
-
Configure the run environment via pip/conda environment.yml: When trying to run the experiments from local computer, there are several Python package conflicts happened. It gonna a big issue in production. Controlling running environment with requirements.txt and/or environment.yml offers peace of mind in environment transition.
-
Find a way to debug train.py locally, it might require how to decouple getting AzureML Run object out of train.py file.



