Machine Learning | HumblePoster

Definition

Sun, 05 May 2019 00:00:00 +0100

Just as we judge tomorrow’s weather based on past experience, eaters want to pick a good melon from their past experience, so can a computer help humans to make that happen? If there is such a discipline, where human “experience” corresponds to the “data” in the computer, and the computer learns this empirical data to generate an algorithmic model that allows the computer to make valid judgments in the face of new situations, and that is Machine Learning.

Mitchell, author of another classic textbook, gives a formal definition that assumes:

P: The performance of a computer program on a task class $T$.
T: The type of task the computer program wants to achieve.
E: Denotes experience, i.e., a historical data set.

If the computer program obtained an improvement in performance $P$ on task $T$ by using experience $E$, the program is said to have learned from $E$.

Terms

Sun, 05 May 2019 00:00:00 +0100

Data

Suppose we collect data on a batch of watermelons, e.g.:

(color=green; rootstock=crumpled; knock=muddy)
(color=black; rootstock=slightly crumpled; knock=dull)
(color=light from; rootstock=hard; knock=crisp)…

Each pair of brackets is a record of a watermelon. Definitions:

Data Set: The collection of all records is data set.
Sample: Each record is an instance or sample.
Feature or Attribute: A single feature is: a feature or attribute. e.g. color or percussion.
Vector: For each record represented on the axis can be represented by a vector. e.g. (green, huddled, turbid), i.e. each watermelon is: a feature vector.
Dimensionality: The number of characteristics of a sample is dimensionality.
Dimensional Disaster: The watermelon’s example dimension is 3, when the dimensionality is very large and it called dimensional disaster.

Data Set

When a computer program learns empirical data to generate an algorithm model, each record is called a training sample, and when the model is trained and we want to test the model’s performance with new samples, each new sample is called a test sample. Definitions:

Training Set: The set of all training samples is training set, [special].
Test Set: The set of all test samples is test set , [general].
Generalization: The ability of the machine-learned model to apply to the new sample is generalization. i.e. from special to general.

Classfication

In the case of the watermelon, we want the computer to train a decision model to determine whether a new watermelon is a good watermelon or not by learning data about its characteristics. What we can tell is: whether watermelon is good or bad, which is a discrete value. Likewise, there are projections of future population numbers by using population data from previous years, which are continuous values. Definitions:

Classfication: The problem where the predicted values are discrete is classification.
Regression: The problem where the predicted values are continuous is regression.

Method of learning

In our process of predicting, it is clear that we already know in advance whether the melon in the training set is a good or bad, the learner learns the characteristics of these melons and thus concludes the law. The watermelon in the training set have been marked, called marking information.

But there are also cases where the information is not marked. For example, we want to divide a pile of watermelons into two small piles according to their characteristics, so that the watermelons in a pile are as similar as possible. For this problem, we do not know beforehand how good or bad the watermelons are, the samples are not marked with information. Definitions:

Supervised Learning: The learning task for which the training data has tagged information is supervised learning, and it is easy to know that both the classification and regression described above are supervised learning categories.
Unsupervised Learing: The learning tasks for which the training data is not labeled with information are unsupervised learning, commonly known as clustering and association rules.

Error and overfitting

Sun, 05 May 2019 00:00:00 +0100

Error

The difference between the learner’s actual prediction of the sample and the true value of the sample is called error. Definitions:

Training Error or Empirical Error: Error in the training set
Test Error: Error in the test set
`Generalization Error: The learner’s error in all new samples

Overfitting

Apparently, we want learners perform well on the new sample which with small generalization errors. Therefore, the learners should be able to learn as many universal general characterisitics from the training set as possible, so as to make the correct discrimination when encountering new samples.

However, when learners learn the traing set too wel that take some of the training sample’s own characteristics as a general feature; there are also cases where the learning capacity is insufficient to learn the basic characteristics of the training set. Definitions:

Overfitting: Over-learning to the point of learning the not-so-generic characteristics included in the training sample
Underfitting: The learning ability is so poor that the general properties of the training sample have not been learned well

It is known that in the overfitting problem, the training error is very small, but the test error is large; in the underfitting problem, both the training error and the test error are large. Currently, the underfitting problem is relatively easy to overcome, such as increasing the number of iterations, but there is still no very good solution to the overfitting problem, and overfitting is a key obstacle to machine learning.

Method of Evaluation

Sun, 05 May 2019 00:00:00 +0100

In realistic tasks, we often have multiple algorithms to choose from, so how do chonse the best one for us? As mentioned in last chapter, we want the learner with the ‘smallest generalization error’, and the ideal solution is to evaluate the generalization error of the model and select the smallest one. However, the generalization error refers to the ability of the model to be applied to all new samples that we do not have direct access to the it.

Thus, we usually use a ‘test set’ to test the learner’s ability to discriminate on new samples, and then use the test error on the test set as an approximation of the generalization error. Obviously the test set which we select should be as mutually exclusive as possible with the training set, and here’s a little story to explain why.

Suppose the teacher has 10 questions for the students to practice, and uses the same 10 questions for the test, however some children may could only do these 10 questions and get a high score. It is clear that the score does not reflect the real level effectively.

In our task, we would like to have a well generalized models, as the teacher would like the students not only learned the course well but also gained the ability to think about what they have learned.

Training samples are equivalent to the exercises for students to practice, and the testing process is equivalent to an exam. If the test sample had been used for training, it would have been an over-optimistic estimate.

Split Traing set and Test set

Sun, 05 May 2019 00:00:00 +0100

In order to use the test error of a test set as an approximation of the generalization error, we need to effectively split the initial data set into mutually exclusive training sets and test sets. The following are some common methods.

Hold-out

Divide the data set $D$ into two mutually exclusive sets, one as the training set $S$ and one as the test set $T$, satisfying $D=S{\cup}T$ and $S{\cap}T=\phi$. The common division is about 2/3-4/5 samples are used for training and the rest for testing.

It is notable that the division of the training/test sets should be as consistent as possible in the distribution of the data to avoid additional bias, the stratification is commonly used to sovle this problem.

At the same time, the results of the single hold-out are often not stable enough due to the random nature of the division, and generally we take the average of a number of random division repeated experiments.

Cross Validation

Divide the data set $D$ into $k$ mutually exclusive subsets of equal size, satisfying $D=D_1{\cup}D_2{\cup}… {\cup}D_k$, $D_i{\cap}D_j=\phi (i{\neq}j)$, similarly using stratification to obtain these subsets that keeping the data distribution as consistent as possible.

The idea of the cross-validation method is that each time a sum of $k-1$ subsets is used as the training set and the remaining is used as the test set, so as to obtain $k$ cases of training/test set division to do $k$ training &testing, and return the mean of the $k$ test results eventually.

K-fold Cross Validation

Cross-validation is also called K-fold Cross Validation, the most common value of $k$ is 10. The following gives a diagram of 10-fold cross-validation.

Leave-One-Out

Similar to the hold-out, the data set $D$ is divided into $k$ subsets at random. Therefore K-fold Cross Validation is usually repeated $p$ times as p-times k-fold Cross Validation, which is commonly 10-times 10-fold Cross Validation that performe 100 training/testing sessions.

In particular, when there is only one sample in each subsets of divied $k$ subsets, it is known as the Leave-One-Out. The results of the Leave-One-Out are more accurate, but with significant computer consumption.

Bootstrapping

What we want to evaluate is the model that was trained with the whole $D$. However, in the Hold-out and Cross Validation, the actual evaluated model uses a smaller training set than $D$ because a portion of the sample is retained for testing, which inevitably introduces some estimation biases due to differences in training sample size. The Leave-One-Out is less affected by changes in training sample size, but the computational complexity is too high. The Bootstrapping solves precisely that problem.

The basic idea of the Bootstrapping is given a dataset $D$ containing $m$ samples, randomly selected from $D$ one sample at a time copied into $D'$, and then put it back into the initial dataset $D$ to be picked up at the next sampling. Repeating $m$ times to obtain a dataset $D'$ containing $m$ samples.

It can be known that the limit of the probability that the sample remain uncollected in $m$ times of sampling is:

${\lim\limits_{m\to\infty}}{(1-\frac{1}{m})^m\to\frac{1}{e}\approx0.368}$

Thus, approximately 36.8% of the initial sample set $D$ did not appear in $D'$ through bootstrapping sampling, so $D'$ could be used as the training set and $D-D'$ as the test set. The Bootstrapping is useful when the data set is small which is difficult to spilt the training/test set effectively, however it introduces estimation bias because the data set generated by the bootstraping (random sampling) alters the distribution of the initial data set. When the initial data set is sufficient, Hold-out and Cross Validation are more commonly used.

Parameter Tuning

Sun, 05 May 2019 00:00:00 +0100

Most learning algorithms have some parameters that need to be set, which is commonly referred to as parameter tuning, and the performance of the learned model often varies significantly depending on the parameter configuration.

Many parameters of the learning algorithm are taken in the real range, so it is not feasible to train a model for each parameter. It is common to select a range and step $\lambda$ for each parameter, which makes the learning process feasible.

For example, assuming that the algorithm has 3 parameters, each considering only 5 candidate values, there are $5^3$ = $125$ models to examine for each training/test set.

It is notable that once the model and paramters have been set, we need to retrain the model using the initial dataset $D$. This means that the test set initially divided for evaluation is also learned by the model to enhance the learning effct.

Performance Measure

Sun, 05 May 2019 00:00:00 +0100

Performance measures are evaluation criteria that measure the ability of models to generalize, and when comparing the ability of different models, different performance measures often results in different judgments.

The Most Common Performance Measures

Mean Squared Erro

In the regression task which predict continuous values, the most commonly used is the Mean Squared Error, many classical algorithms are using MSE as an evaluation function.

$E(f;D)=\frac{1}{m}\sum\limits_{i=1}^{m}(f(x_i)-y_i)^2$

More generally, for data distribution $\mathcal{D}$ and probability density functions $p(\cdot)$, the MSE can be described as

$E(f;\mathcal{D})=\int_{x\sim\mathcal{D}}(f(x)-y)^2p(x)dx$

Error rate & Precision

In the classification task which predict discrete values, the most commonly used are error rate and precision, where error rate is the number of samples classified incorrectly as a proportion of the total number of samples, and precision is the number of correctly classified samples as a proportion of the total number of samples, easily known that error rate + precision = 1.

Error rate is defined as:

$E(f;D)=\frac{1}{m}\sum\limits_{i=1}^{m}\mathbb I(f(x_i)\neq{y_i})$

Precision is defined as:

$\begin{align*}acc(f;D)&=\frac{1}{m}\sum\limits_{i=1}^{m}\mathbb I(f(x_i)=y_i)=1-E(f;D)\end{align*}$

More generally, for data distribution $\mathcal{D}$ and probability density functions $p(\cdot)$, the Error Rate and Precision can be described as:

$E(f;\mathcal{D})=\int_{x\sim\mathcal{D}}\mathbb I(f(x)\neq{y})p(x)dx$

$acc(f;\mathcal{D})=\int_{x\sim\mathcal{D}}\mathbb I(f(x)=y)p(x)dx=1-E(f;\mathcal{D})$

Accuracy/Recall/F1

For example, in a recommendation system, we only care about whether the content pushed to the user is of interest to the user (i.e. accuracy), or how much of all the content of interest to the user we pushed (i.e. recall). Therefore, the search accuracy/recall is more appropriate to describe such issues. For the binary classification, the classification result confusion matrix and the accuracy/racall are defined as follows:

Fact	Prediction
Positive	Negative
Positive	TP(Ture Positive)	FN(False Negative)
Negative	FP(False Positive)	TN(Ture Negative)

Accuracy $P$ and Recall $R$ are defined as:

$P=\frac{TP}{TP+FP}$

$R=\frac{TP}{TP+FN}$

Accuracy and Recall are a pair of contradictory measures. For example, if we want content pushed to be as interesting as possible to all users, we can only push the content which is certainly so that some content users are interestd will be missed, leads to low racall; if we want all the content which users are interested pushed, we only push all the content so that accuracy is very low.

The P-R curve is precisely the curve describing the change of the accuracy/racall, the P-R curve is defined as follows: according to the prediction result of the learner (generally a real value or probability), the test samples are ranked, the samples most likely to be the positive example in the front, the least likely to be the positive example in the back, and the samples are predicted as the positive example one by one in this order, and the current $P$ and $R$ values are calculated each time, as shown in the figure below:

How is the P-R curve evaluated? If the P-R curve of one learner $A$ is completely covered by the P-R curve of another learner $B$, then the performance of $B$ is better than that of $A$. If the curves of $A$ and $B$ intersect, then who has a larger area under the curve and whose performance is better. But in general, the area under the curve is difficult to estimate, so the “Break-Event Point” (BEP) is derived, i.e. when P=R, the higher the value of the Break-Event Point, the better the performance.

The P and R indicators are sometimes contradictory, so they need to be considered together, and the most common method is F-Measure, also known as F-Score, which is a weighted reconciliation average of $P$ and $R$. i.e.

$\frac{1}{F_\beta}=\frac{1}{1+\beta^2}\cdot(\frac{1}{P}+\frac{\beta^2}{R})$

$F_\beta=\frac{(1+\beta^2)\times{P}\times{R}}{(\beta^2\times{P})+R}$

n particular, when $\beta=1$ it becomes the common $F1$ measure, is a reconciled average of $P$ and $R$, the better the model performs when $F1$ is higher.

$\frac{1}{F1}=\frac{1}{2}\cdot(\frac{1}{P}+\frac{1}{R})$

$F1=\frac{2\times{P}\times{R}}{P+R}=\frac{2\times{TP}}{ALL+TP-TN}$

Sometimes we have multiple bicategorical confusion matrices, e.g., multiple trainings or training on multiple datasets, then there are two ways to estimate global performance, which are macro and micro. Macro is to calculate the $P$ and $R$ values of each confusion matrix first, then obtain the average P value $macroP$ and the average R value $macroR$ to calculating $F\beta$ or $F1$, while micro is to calculate the average TP, FP, TN, FN of the confusion matrix, then calculate $P$, $R$, and thus $F\beta$ or $F1$.

$macroP=\frac{1}{n}\sum\limits_{i=1}^{n}P_i$

$macroR=\frac{1}{n}\sum\limits_{i=1}{n}R_i$

$macro-F1=\frac{2\times{macroP}\times{macroR}}{macroP+macroR}$

$microP=\frac{\overline{TP}}{\overline{TP}+\overline{FN}}$

$microF1=\frac{2\times{microP}\times{microR}}{microP+microR}$

ROC & AUC

The ROC curve is very similar to the P-R curve, both are predicted according to the positive cases one by one according to the order of the order, the difference is that the ROC curve takes the “True Positive Rate” (TPR) as the horizontal axis and the vertical axis as the “False Positive Rate” (FPR), the ROC focuses on studying the order of the test sample based on the evaluation value.

$TPR=\frac{TP}{TP+FN}$

$FPR=\frac{FP}{TN+FP}$

A simple analysis of the image shows that when FN=0, TN must also be 0, and vice versa. We can draw a queue and try to split the queue using different truncation points (i.e. thresholds) to analyze the shape of the curve, (0,0) means that all samples are predicted as negative cases, (1,1) means that all samples are predicted as positive cases, (0,1) means the ideal case that all positive cases appear before negative cases, (1,0) means the worst case that all negative cases appear before positive cases.

Similarly, if the ROC curve of one learner $A$ is completely covered by the other learner $B$, the performance of $B$ is said to superior to that of $A$. If the curves of A and B intersect, then whose curve has more area under it and whose performance is better. The Area under the ROC Curve is defined as AUC (Area Uder ROC Curve). Different from P-R, the AUC here is estimable,i.e. the sum of the Area of each small rectangle under the AOC Curve. It’s easy to see that the larger the AUC is, the better the quality of the sort. When the AUC is 1, it mean that all the positive examples are in front of the negative ones, and when the AUC is 0 means all the negative examples are in front of the positive ones.

$AUC=\frac{1}{2}\sum\limits_{i=1}^{m-1}(x_{i+1}-x_i)\cdot(y_i+y_{i+1})$

Cost-sensitive Error Rate & Cost Curve

In the above approach, the mistakes of the learner are treated equally, but in reality the cost of predicting positive samples into negative is often not the same as predicting negative samples into positive, e.g., predicting no disease –> having disease just increases the number of checks, but having disease –> no disease increases the risk to life. Take the binary classification for example, which thus introduces a cost matrix.

Fact	Prediction
class 0	class 1
class 0	0	$cost_{10}$
class 1	$cost_{01}$	0

Under non-equal error costs, we want to minimize the overall cost so that the cost-sensitive error rate is:

$E(f;D;cost)=\frac{1}{m}(\sum\limits_{x_i\in{D}^+}\mathbb I(f(x_i)\neq{y_i})\times{cost_{01}}+\sum\limits_{x_i\in{D^-}}{\mathbb I}(f(x_i)\neq{y_i})\times{cost_{10}})$

Similarly, for the ROC curve, it evolves into a cost curve at non-equal error costs, where the horizontal axis of the cost curve is the probability cost of taking the positive case between [0,1], where $p$ is the probability of taking the positive case, and the vertical axis is the normalized cost of taking the value [0,1].

$P(+)cost=\frac{p\times{cost_{01}}}{p\times{cost_{01}}+(1-p)\times{cost_{10}}}$

$cost_{norm}=\frac{FNR\times{p}\times{cost_{01}}+FPR\times(1-p)\times{cost_{10}}}{p\times{cost_{01}}+(1-p)\times{cost_{10}}}$

Plot the cost curve is simple: set the coordinates of a point on the ROC curve as (TPR, FPR) to calculated FNR, then plot a line segment from (0, FPR) to (1, FNR) in the cost plane, the area under the line segment represents the desired overall cost under that condition, so transform each point of the ROC curve earth into a line segment in the cost plane, and take the lower boundary of all line segments, the enclosed area is the desired overall cost of the learner under all conditions, as shown in the figure:

Linear regression

Sun, 05 May 2019 00:00:00 +0100

lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.

Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.

Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.

Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.

Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.

Linear probability regression

Sun, 05 May 2019 00:00:00 +0100

Basic concepts of Decision tree

Sun, 05 May 2019 00:00:00 +0100

Mon, 01 Jan 0001 00:00:00 +0000

In the last two articles, we introduced a variety of common evaluation methods and performance measures so that we can select the most appropriate one to calculate the learner’s test error based on the characteristics of the data set and model task.

Howerver, the test error is affected by many factors such as the randomness of the algorithm (e.g. K-Means) and the diffirence between test sets, which makes the same model get different results each time. Also the test error is an approximation of the generalization error instead of the true generalization performance of the learners.

So how to compare the performance measures of single or multiple learners on different/same test set? And that is the comparative test. Final bias and variance is an important tool to explain learner generalization performance. This post continues from the previous post and focuses on comparison tests, variance and bias.

Machine Learning | HumblePoster

Definition

Terms

Data

Data Set

Classfication

Method of learning

Error and overfitting

Error

Overfitting

Method of Evaluation

Split Traing set and Test set

Hold-out

Cross Validation

K-fold Cross Validation

Leave-One-Out

Bootstrapping

Parameter Tuning

Performance Measure

The Most Common Performance Measures

Mean Squared Erro

Error rate & Precision

Accuracy/Recall/F1

ROC & AUC

Cost-sensitive Error Rate & Cost Curve

Linear regression

Linear probability regression

Basic concepts of Decision tree

Comparison test