Anilkumar Panda

Error Analysis for Tabular Data - Part-2 (Residual Analysis)

2022-04-15T00:00:00+00:00

Residuals refer to the difference between the recorded value of a dependent variable and the predicted value of a dependent variable for every row in a data set. Plotting the residual values against the predicted values is a time-honoured model assessment technique and a great way to see all your modelling results in two dimensions.

Analysing residuals can provide us with great insights about the model behaviour both on a global level and also on a feature level.It can reveal insights such about the data points for which our model is making mistakes.

I will explore two methods for residual analysis:

Residual analysis on global & feature level
Modelling of residuals

For demonstration purposes, I will explore the model created on Lending Club data.The data is available on Kaggle.The task is to create a binary classification model, to predict loan defaults at the time of loan origination. Such models are also known as Application models/score models.

A lightgbm model was created. The details of the model can be found below. Now that we have a model, lets start with the residual analysis.

Residual Analysis on Global Level

Here I will explore the model behaviour on the global level. The plot shows the model residuals against the actual values(classes) of the dependent variable.

We see that indeed the model is making quite a few mistakes(large residuals). Especially for loan_status==1. This raises a few interesting questions :

Who are these customers?
Why is the model so wrong about them?
Are they somehow exerting undue influence on other predictions?

To answer these questions we can also conduct residual analysis on the feature level.

Residual Analysis on Feature Level

In this method, we will plot the residuals against the predicted values of the dependent variable for each feature(or at-least a few important features). In our case, let us take the example of the grade feature.

Intuitively ,higher the grade, less chances of default. However on closer analysis we see that the model is making mistakes for the customers who have a grade of A or B i.e providing high residuals.

This is an interesting observation. That means that the model is making more errors if an application has as higher assigned grade. A similar analysis can also be done for numerical features, where we bin the variable and look at the residuals for each bin.

Residuals analysis on feature level can be used to understand the model behaviour in an univariate way. Can we identity more such failure modes for the model? A definite pattern in which the model is making mistakes?

Modelling of Residuals

Modelling residuals with interpretable models is another great way to learn more about the mistakes your AI system could make. e.g A decision tree model of the example ML model’s residuals for loan_default = 1, or customer’s who default. While it reflects what was discovered before, it does so in a very direct way that exposes the logic of the failures.In fact, it’s even possible to build programmatic rules about when the model is likely to fail the worst.

model shows high residuals when the loan Grade = A and DTI <= 20 essentially when a good looking application fails.
this indicates that only the above features are not good enough. We need to create features that deal with deeper customer behaviours.

To summarise, the residual analysis is a great way to understand the model behaviour. These simple plots can help you uncover pockets of data that are likely to be the cause of the model’s failure. Once you identify these you can either

Analyse the default rates in grade A, B, C where the DTI is less than 20. It could also be that, most of the defaults are of applications of lower grade. Thus the ML models provides more weightage to those defaults. Providing higher weights to defaults with grade A,B,C could help. Which indeed is the case, when we provide more weightage to applications from grade A,B,C compared to others. The performance of the model increases slightly from 0.68 to 0.69.
Engineer new features
Collect more data wherever possible
Create business rules at inference time.

Further Analysis

The code to produce the plots can be found in the repo

A big shout out to Patrick Hall for the great work on model debugging.

Monotonic Constraints for Boosting Models.

2022-03-29T00:00:00+00:00

Boosting models like XGBoost, LightBoost have been some of the go to model for tabular datasets.This is because these models are highly performant and are able to learn various non-linear relationships within the datasets.However sometimes in the real world,creating an high performant model can result in a non-intuitive model. There are also requirements to include domain knowledge into the ML models.

Monotonicity constraints (MC), help us to achive the above objetives. Adding MC constraints helps with :

Creating interpretable boosting models.
Incorporating domain knowledge into the ML models.

This generally leads to models with less overfitting. In this blog post, we will compare two models created on the same dataset without and with MC. We will compare the model performances and further improve on them if required.

To be continued...

ML Model Metric Credibility

2022-03-09T00:00:00+00:00

"How confident are you that the model performance (on test data) is not below 0.75 AUC ?".I came across this question in my work.This inspired me to write this blog.

Generally we calculate the model performance on an holdout test dataset. For example if you are working in the risk domain, you would train your model on historical data and the latest data for which performance can be measured will be used as test set.

We will get a certain model performance on the test data. How confident are we, that the model performance is above a certain threshold on a similar dataset? Assuming the data characteristics have not changed. One way to do that is to calculate the model performance on another test dataset. Why not! The challenge is labelled data is quite hard to obtain in almost all scenarios. If data is oil, good labelled data is gold. So we cannot get another test dataset.

Well then , let's sample the test data and calculate the model performance ? In that case, I assume the model performance will be similar and based on the sampling parameters the results can vary. The assumptions of independence are not satisfied.

Is there a python package for it ?

Generally model KPI's have a lower limit. Businesses would gladly accept a model with higher performance in the real world than reported at the time of development. The inverse would not be true. That is to say model performance below a lower limit will be unacceptable and the model would not be used.

For example, in classification setting roc-auc is a standard KPI. Lets say the minimum KPI for ROC-AUC is 0.75 for your project.If we have roc-auc on the test data of 0.8. What are the chances that given more data, the model performance on the test data would drop below 0.75 ?

Model Validation Toolkit's can help us to calculate that. A detailed explanation can be found in the credibility section. TL;DR they use beta-distribution to do that.

Let's take the example for the model I created for the FICO challenge. The model test performance is 0.79. After plugging in the required variables,

from mvtk import credibility
# Number of positive samples in the test set
positives = 1539
# Number of negative samples in the test set
negatives = 1420
# ROC-AUC of the model
roc_auc = 0.79
# ROC-AUC cutoff
cutoff = 75/100
auc_positives, auc_negatives = credibility.roc_auc_preprocess(positives, negatives, roc_auc)
prob_below = credibility.prob_below(auc_positives, auc_negatives, cutoff) * 100
print(f'Probability of ROC-AUC being below {cutoff} is : {prob_below:.2f}%')

We see that the probability of ROC-AUC being below 0.75 is : 0.00% Sweet !! So we can be confident that given more data(without drift), our model AUC will most likely not drop below the cut-off.

Similarly, we can also calculate the metric credibility for other performance measures like precision,recall etc.

Error Analysis for Tabular Data - Part 1

2021-09-29T00:00:00+00:00

Error analysis simply means looking the misclassified data points and trying to infer why is the model making error on this subpopulation. There are some great videos/blogs about how to do error analysis for text/image data. e.g video by Andrew Ng.

However I could not find anything related to error analysis for tablular data. One of the reason, I can think of is that it is intuitively diffcult to look at tabluar data and see if the model is misclassifying the data. Sure there many python packages that help you to do that. e.g erroranalysis , mealy etc .However they are limited to showing you how to use the library and not how to use the error analysis methods to improve model performance. We will be using some of these in this blog post.

A model misclassifaction can be due to many reasons :

The data point has data quality issue. E.g loan given to a 300 year old person.
The data point is an outlier. E.g a very wealthy person defaulting on a small loan.
There are very few examples for the model to learn. E.g There are only 5 customers with an car loan, thus the model has not learnt much from this dataset.
Model bias : Due to the inherit limitations of the model,the classification hasn't been done correctly. E.g Linear model assumptions.

With error analysis we can identify which subset of the dataset the model is making the most mistakes and then try to improve upon it. Thus we can look inot are the problematic features, and what are the typical values of these features for the mispredicted samples. This information can later be exploited to support the improvement strategy. Which can one of the following as mentioned on the Dataiku website

Improve model design: removing a problematic feature, removing samples likely to be mislabeled, ensemble with a model trained on a problematic subpopulation;
Enhance data collection: gather more data regarding the most erroneous or under-represented populations;
Select critical samples for manual inspection thanks to the Error Tree, and avoid primary predictions on them using model assertions.

All of these are interesting options in this blog post I will focus on the 1 option. We will explore this via a ML model created on the HELOC dataset provied in the FICO Explainability Challenge. The data and the metadata can be obtained by filling this form

Problem Statement

In this challenge we are tasked to predict the variable RiskPerformance. I will not go into detail about the problem and the predictor variables. You can read the details in the FICO Explainability Challenge website.

The dataset is balanced with ~52% postive class and ~48% negative class.

ML Model

Before error analysis,we need a model,so lets go ahead build one. We will make use of many packages to make our life easy. Main among them are Probatus for feature selection and analysis,Yellowbrick for decidicng threshold,XGBoost for classification and erroranalysis for erroranalysis and offcourse scikit-learn for all other things.

Train Test Selection

The dataset is divided in training and testing set with %30 of the samples in the test set.

Feature Selection & HP Tuning

The ShapRFECV - Recursive Feature Elimination using SHAP importance from the Probatus is used for feature selection. As the name suggests it uses Shapley Values to eliminate features in a recurvise manner.

As the plot suggests, we can use only 5 features and achieve the maximum performance(roc-auc) in this case.

NetFractionRevolvingBurden
NumSatisfactoryTrades
AverageMInFile
MSinceMostRecentInqexcl7days
ExternalRiskEstimate

Once the features are selected lets tune the parameters for the XGBClassifier. The initial parameter grid is used as suggested in the book Approaching (Almost) Any Machine Learning Problem

The model has the following performance.

There is a slight overfit,however this is decent model for the purposes of this blog.

Feature Importance

If we look at the SHAP feature importances we notice that ExternalRiskEstimate is the most predictive feature.

I have observed that in most cases, the feature impartances follow the Pareto 80:20 rule. That is 20% of the features provide 80% of the performance. Hence if we can spot and rectify issues in the most important features, that chances of improvement is quite much.

Error Analysis

So far so good. Now lets begin with error analysis. For this we will use ErrorAnalysisDashboard provided Microsoft. The ouput of the ErrorAnalysisDashboard is a interactive wiget containing the error tree. There are many more functionalaties like the grid view,ability to exclude features from the error tree etc. However I will skip that and focus on the use case which is improving the model via erroranalyis. You can see other functionalaties provided by the package in the examples notebooks

To identifty important failure patterns, look for nodes with a stronger dark red (i.e., high error rate) and a higher fill line (i.e., high error coverage)

Error coverage shows the percentage of errors that are concentrated in the selection out of all errors present in the dataset.
Error rate represents the percentage of instances in the node for which the system has failed.

Ideally it makes sense to start the error analysis process by looking into these leaf nodes first.

For our usecase, we can see that for data points where ExternalRiskEstimate <= 83.50 and NetFractionRevolvingBurden <= 59.50 and NumSatisfactoryTrades <= 1.50 the error rate is 48.11 % and the error coverage is 9.75%.

Next, lets look into these features and see if we can spot something, which look like either data quality or anamolies.

If we look the ExternalRiskEstimate feature, we notice that there are few data points with values < 0. A ExternalRiskEstimate of less that 0 looks like a data quality issue and can be eliminated.

Similarly if we look at the NumSatisfactoryTrades feature, there are few datapoints with values < 0 . Number of trades with values < 0 again seems like a data quality issue. We will eliminate these as well.

Once we remove the datapoints and rerun the model, we get the following model results.

The model performance has improved!

Similarly one can look at other variable/s and identify datapoints leading to error and treat them. In this example, we eliminated the data points, however in your own use case these points colud be treated differenlty based on business knowledge.

Apart from the library used above,mealy is another library that can be used for error analysis.

The notebook and code for reproducing the above results can be found in my github account.

Active Learning

2021-09-25T00:00:00+00:00

While browsing Twitter , I came across this tweet by Abhishek.

Though it is meant as a joke, there is quite some truth to it. Everybody wants to build a ML model, but many dont want to put effort into obtaining/creating datasets. Many a times in a company, there exists vast unlablled datasets. Value in these datasets can be unlocked, if they can be labelled efficiently.

The branch of machine learning that deals with labelling data is called Active Learning. Though it is quite a vast field, I will try to provide a summary of the most frequent(read easy) methods used,some learning and ofcourse links to different resources.

Active learning(AL) aka Query Learning

Active learning (sometimes called “query learning” or “optimal experimental design” in the statistics literature) is a subfield of machine learning and, more generally, artificial intelligence. The key hypothesis is that if the learning algorithm is allowed to choose the data from which it learns—to be “curious,” if you will—it will perform better with less training.

An AL process typically consists of the following steps:

Having access to an pool of unlabeled data.
Labelling either by randomly sampling the data or filtering based on some rules, then labelling the data.
Creating a ML model based on the labelled data.
Scoring the remaining the dataset with the model trained in step 3.
Choosing the best datapoints to label next and continuing stpes 3-5.
Stopping , when a stopping criteria is achieved.

From my experience the crux of AL lies in steps 2. i.e how to bootstrap the process and step 4. how to choose the best samples for labelling.

Begin Labelling

Once you have the unlabelled dataset and a defintion of what a class looks like (e.g dog vs cat), the easiest way to start is to randomly select a sample and label them. However, if the dataset has a high class imbalance, then randomly selections will not yield many samples of the minority class. In such cases, you should start with simple rules, rules that have a high chances yielding the class of interest.

Unsupervised clustering can also be used. Then you can assign a label to an cluster and move ahead.

Once a decent number of samples have been labelled, you can train any classifier on this labelled set and score the unlabelled set.

Chosing points for labelling:

Once you have scored the remaining unlabelled datapoints, next step is to label points that :

either move the decision boundary of the classifier the most or
improve the loss metric the most.

To achive this you can many methods, however the most simple ones being :

Uncertainity Sampling:

Consider a binary classification problem. You have created an intial labelling set and trained a ML model as mentioned in steps 1-3 above. After scoring the unlabeled set, you choose the data points to label, for which the model is the least confident. i.e the where the predicted probability is close to 0.5 (or median). Those the data points for which the model is least confident and will gain the most with human labelling/knowledge. Different variations of this technique are entropy sampling and margin sampling.

The main disadvantages of such methods are that they are prone to get stuck in a localised region of problem space and are also prone to outliers. The next methods tries to resolve this disadvantages.

Query based sampling :

In the query based sampling method, you train multiple classifiers from diffrent family of classifiers. They choose the points for human labelling where most of the classifiers disagree. This is slighly complex methods, but has some advantages over uncertainity sampling.

Combination of sampling methods:

It is advised to start with random / query-based sampling can usedfor cold start when no classifier is available. Then one can move to more sophisticated approaches. You can also multiple methods.

Stopping criteria:

The AL process can be stopped when :

The allocated budget is over, i.e either the time/ resources assigned to for AL are exhausted.
The model KPI is reached.
Despite labelling more data points the final model preformance does not improve.

Though there are many libraries that help you to implement AL e.g modal, ALipy I found it easiser to create an custom interface in streamlit and implement various sampling methods.

I have used AL in a few some projects with good success, main among them are :

AL was used to solve a cold start problem in NLP for ING.
Dataset used to train the ESR Bot was created using AL. This twitter bot tweets about the various environment related news to spread awareness.

Another package that uses AL is Deduplipy created by Frits Hermans. It implements deduplication using active learning.

If you interested to use active learning in your project or have any questions/feedback about this blog do reach out

References

Thoughts on Feature Selection

2020-04-14T00:00:00+00:00

All practioners of ML know that feature selection is important and is helpful to reduce the unwanted signal in your dataset.

Feature selection methods can be categorised as follows:

Filter methods e.g selection based on co-relation, mutual information.
Wrapper methods e.g recursive feature elimination.
Embedded methods e.g Tree based feature importances.

All methods suffer from some advantages and disadvantages.The primary distinguishing factors being speed vs chances of overfitting. In terms of speed Filter methods are faster than embedded methods which are faster than wrapper methods. Wrapper methods have a higher chance of overfitting compared to embedded and filter methods.

To balance out the pros and cons a combined approach can be followed as shown below:

The main points to take into consideration are :

Feature selection should be done in consultation with the business experts.They have insights into the business problem and can guide in eliminating some features right away e.g some feature which are very unstable or in the case of co-related feature which features to eliminate etc. They can also validate the results of feature selection.
Bootstrapped sampling strategy can be used for feature selection, as features which are important in the entire dataset will be important in the sampled data as well.

MultiObjective Optimisation with Genetic Algorithm

2019-11-16T00:00:00+00:00

For most of the problems in real world, we have to optimise on two or more objective functions . Eg in the case of machine learning problems we may have to optimise on two or more objective function like maximise accuracy vs increase interpretability . Almost always these objective functions will be orthogonal to each other.

There are nulptiple ways to solve Mutliobjective problems, however in this post we will focus a specific algorithm called NGSA II ( Non Dominatated sorting using Genetic Algorithm) . There is a excellent lecture series you can follow here for the theory and examples. I will focus here on the pratical application and esp in the field of machine learning .

Before we go ahead, let us learn the basic building blocks of Genetic algorithms and Pareto Optimisation .

Genetic algorithm : It is an optimisation technique, that relies on two major principles. a.Diversity Preservation b.Selection pressure Follow the lectures here for an amazing introduction .
Pareto Solutions : A set of solutions to an multi objective problem, is said to be a Pareto optimal solution if it is better in atleast one objective without it being worse for the other .
Pareto Front : A set of all Pareto optimal solutions are called as Pareto Front .

NSGA-II approach combines both the above principles to provide solutions to an Multiobjective problem .

Some of the scenarios where MOGA can used inlcude:

Reduce complexity of our ML solution but increase/decrease the selected metric (accuracy,roc-auc, precision, recall) etc.
Optimise knockout rules in the lending process.

Up and running!

2019-11-15T00:00:00+00:00

Hello there, this is my personal blog to note down my thoughts both personal and professional. I plan to write about technology, the books I have read, the side project I am working on etc. I also plan to share my ideas freely here, so that others can discuss it with me and contribute as well .

You too can create your own blog website, following instructions from Jekyll Now repository on GitHub.