Yet Another ML Blog

Neural Network Basics

2019-10-26T00:00:00+00:00

These days there are too many deep learning resources and tutorials out there to count! Regardless, it would be remiss to gloss over the basics in a blog such as this. Let us quickly run through the fundamental ideas behind artificial neural networks.

1. Setup

Recall that linear regression and classification models can be expressed as:

\[y(x, w) = f(\sum_{j=1}^M w_j\phi_j(x))\]

In the linear regression case, $f(.)$ is merely the identity. In classification, it is a nonlinear activation function, such as the sigmoid $\sigma(.)$ in logistic regression (for instance, in binary classification, we set $p(C_1|x) = y(\phi) = \sigma(w^T\phi(x)))$). What if we extend this model such that the basis functions $\phi_j(x)$ themselves are parameterized, alongside the existing coefficients $w_j$? In neural networks, we will do just that, using a series of functional transformations that we organize into “layers.”

Suppose we have $D$-dimensional data $X$. We start with the first layer by defining $M^{(1)}$ linear combinations of the components of $X$ as follows:

\[a_j = \sum_{i=0}^D w_{ji}^{(1)}x_i\]

where $j = 1, …, M^{(1)}$. These $a_j$ are the activations. The $w_{ji}^{(1)}$ are the weights, with the $w_{j0}^{(1)}$ being the biases (we roll it in by adding a corresponding $x_0 = 1$). The layer number is denoted by superscript. Take note that $M^{(1)}$ is a hyperparameter we have chosen for the number of activations - linear combinations of the inputs $X$ - in the first layer. Now, we apply a differentiable and nonlinear activation function $h(.)$ to the activations:

\[z_j = h(a_j)\]

Think of the $z_j$ as the outputs of the nonlinear basis functions $\phi_j(x)$ in normal linear models. They are known as the hidden units, because we do not typically have visibility into their values - they are directly fed into the next layer.

Let’s take stock again. We fed the original data $x_0=1, x_1, …, x_D$ into the first layer. They were linearly combined into $M^{(1)}$ activations $a_j$ according to the weights $w_{ji}^{(1)}$, then passed through a nonlinear activation function $h(.)$. So the first layer spat out $z_j$, $M^{(1)}$ values in total.

On to the next layer. We simply repeat the process with our hidden units. Now the inputs to the layer are our $z_j$:

\[a_k = \sum_{j=0}^{M^{(1)}} w_{kj}^{(2)}z_j\]

where $k = 1, …, M^{(2)}$. $M^{(2)}$ is a hyperparameter we have chosen for the number of activations in the second layer.

As you may have guessed, we can continue in this way ad infinitum. But let us stick to two layers for now. The output layer, layer 2 in this case, needs to be treated differently compared to the inner, hidden layers. Whereas there were no real restrictions on choosing $M^{(1)}$ other than practical considerations, the choice of $M^{(2)}$ matters depending on the output we are expecting. For instance, in a regression task with a single target, we want a single value from the neural network. So $M^{(2)} = 1$. And we would simply use the identity for the activation function such that $y_k = a_k$. In multiclass classification, we would want to output a $p$-dimensional vector where $p$ is the number of classes. Then we can use a softmax for the activation function $\sigma_i(z) = \frac{exp(z_i)}{\sum_j^p exp(z_j)}$ to squash it into class probabilities.

So overall, this two-layer network looks like:

\[y(x, w) = \sigma(\sum_{j=0}^{M^{(1)}}w_{kj}^{(2)}h(\sum_{i=0}^D w_{ji}^{(1)}x_i))\]

Notice that if all the activation functions of all the hidden units are linear, we simply have a composition of linear transformations. This defeats the purpose; in this case we can always simply represent the network as a linear transformation without the intermediate transformations. So the nonlinearity of the activation functions is crucial.

We call these feed-forward networks because evaluation and computation of the network propagates strictly forward. There must be no closed directed cycles, ensuring that the outputs are deterministically determined vis-à-vis the inputs. This latter point also marks the distinction between neural networks and probabilistic graphical models, as the nodes, or neurons, are deterministic rather than probabilistic.

Finally - it has been shown that neural networks with at least one hidden layer are universal approximators. That is, they can approximate any continuous function. Of course, the difficulty is in finding the appropriate set of parameters; in practice deeper networks have a much easier time representing complex nonlinearities.

2. Training

Now, how can we find the weight vector $w$ that minimizes $E(w)$? If we take a step from $w$ to $w+\delta w$, the change in $E(w)$ is $\delta E \approx \delta w^T\nabla E(w)$. Clearly at the minimum of the error function, $\nabla E(w) = 0$. Of course in general we cannot hope to find an analytical solution to $\nabla E(w) = 0$, so we will use numeric methods. These generally take the form of:

\[w^{(\tau + 1)} = w^{(\tau)} - \eta\Delta E(w^{(\tau)})\]

$\tau$ is the iteration number. Optimization methods differ in specification of $\Delta w^{(\tau)}$, but many use the gradient $\nabla E(w)$.

The simplest method is gradient descent, where at every iteration we take a small step in the negative gradient direction - the direction of greatest rate of decrease of the error function. In batch gradient descent, we compute $\nabla E$ over the entire training set in every iteration. There are actually more efficient methods than gradient descent, such as conjugate gradient descent, in which the gradient vectors are orthogonalized against each other (using Gram-Schmidt). That is, instead of stepping in the direction of the gradient, we take a step in a direction such that we avoid having to step in the same direction in future iterations. Quasi-Newton methods are also more robust than simple batch gradient descent. They make use of Newton’s method, but come with the drawback of having to compute the inverse Hessian.

Many error functions can be represented as a sum over individual observations, such as in situations minimizing the negative log likelihood over i.i.d. data. Stochastic gradient descent, otherwise known as online gradient descent, exploits this to update the weights one observation at a time:

\[E(w) = \sum_n E_n(w)\] \[w^{(\tau + 1)} = w^{(\tau)} - \eta\Delta E_n(w^{(\tau)})\]

This has the benefit of being able to be used in online situations, as well as being less prone to being trapped in local minima. A minima for the entire dataset will not be a minima for each observation.

3. Evaluating $E(w)$ via Backpropagation

How can we efficiently compute $E(w)$? This is done via the process of error backpropagation.

Consider a simple linear model, where the outputs $y_k$ are linear combinations of the inputs $x_i$:

\[y_k = \sum_i w_{ki}x_i\]

Let the error function be:

\[E_n = \frac{1}{2}\sum_k (y_{nk} - t_{nk})^2\]

where $y_{nk} = y_k(x_n, w)$. The partial with respect to weight $w_{ji}$:

\[\frac{\partial E_n}{\partial w_{ji}} = \frac{\partial E_n}{\partial y_{nj}}\frac{\partial y_{nj}}{\partial w_{ji}}\] \[\frac{\partial E_n}{\partial w_{ji}} = (y_{nj} - t_{nj})x_{ni}\]

In a neural network, each neuron computes:

\[a_j = \sum_i w_{ji}z_i\] \[z_j = h(a_j)\]

where $h(.)$ is some nonlinearity. Successive application of these equations is forward propagation of information through the network. Now consider evaluating $\frac{\partial E_n}{\partial w_{ji}}$. $E_n$ depends on $w_{ji}$ through the activation $a_j$:

\[\frac{\partial E_n}{\partial w_{ji}} = \frac{\partial E_n}{\partial a_{nj}}\frac{\partial a_{nj}}{\partial w_{ji}}\]

Going forward we will omit the $n$-subscripts.

Let the errors $\delta$ be defined as:

\[\delta_j = \frac{\partial E_n}{\partial a_{j}}\]

We also know that:

\[\frac{\partial a_{j}}{\partial w_{ji}} = z_i\]

since $a_j = \sum_i w_{ji}z_i$. So the partial of the untransformed activation w.r.t the weights is the input from the previous layer.

Thus:

\[\frac{\partial E_n}{\partial w_{ji}} = \delta_jz_i\]

Expanding to multiple layers:

For the output layer’s neurons, we know that:

\[\delta_k = \frac{\partial E_n}{\partial w_{ki}} = y_k - t_k\]

For the hidden layers, we simply need to apply the chain rule successively:

\[\delta_j = \frac{\partial E_n}{\partial w_{ji}} = \sum_k \frac{\partial E_n}{\partial a_{k}}\frac{\partial a_{k}}{\partial a_{j}}\] \[\delta_j = \sum_k \delta_k\frac{\partial a_{k}}{\partial a_{j}}\]

It is then easy to show:

\[\delta_j = h'(a_j)\sum_k w_{kj}\delta_k\]

That is, $\delta_j$ for a hidden unit can be obtained by using the $\delta$’s of the neurons in the following layer (figure from Bishop).

From here, with all the $\delta$s computed, we can use whichever update rule we wish in order to update the corresponding weights $w$.

Classification

2019-10-22T00:00:00+00:00

I realize that the order of posts here seems without rhyme or reason. I have no justification to offer. But these posts are better late than never! Here we proceed to lay another block of the foundation by discussing classification, logistic regression, and finally, generalized linear models.

1. The Classification Problem

In classification, we wish to assign one of several classes $C_k$ to input data $x$.

The classification problem can be broken down into two stages, inference and decision.

Inference stage: use the training data to learn the conditional posterior probabilities $p(C_k|x)$. Here we are considering the priors as $p(C_k)$. Thus, by Bayes’ theorem:
\[p(C_k|x) = \frac{p(x|C_k)p(C_k)}{p(x)}\]
Decision stage: use these posterior probabilities to assign classes.

There are three general approaches to classification, based on how they approach these stages.

Generative

Determine the class-conditional probability densities $p(x|C_k)$ for each class, as well as the priors $p(C_k)$. From there, use Bayes’ theorem to compute $p(C_k|x)$:
\[p(C_k|x) = \frac{p(x|C_k)p(C_k)}{p(x)} = \frac{p(x|C_k)p(C_k)}{\sum_k p(x|C_k)p(C_k)}\]
It is generative in the sense that we are also modeling the distribution of the inputs $x$, which allows us to generate synthetic $x$. This is demanding since we need to find the joint distribution over $x, C_k$ (the class priors are easy since we can estimate simply from the class proportions in the training set). When $x$ is of high dimensionality, we may need a large dataset to do this. Having the marginal density $p(x)$ can also be useful for outlier detection, where we identify points that are low probability under the model and thus have poor predictions from the model.
Discriminative

Determine the posteriors $p(C_k|x)$ directly, then use decision theory to assign classes. Discriminative approaches require less computational power, less parameters, and less data, and we still get to have the posteriors $p(C_k|x)$.
Discriminant

Determine a discriminat function $f(x)$ which directly maps $x$ to a class label. Discriminant approaches lose the probabilistic outputs entirely.

We will be skipping over discriminants in the following discussion.

2. The Generative Approach

Let us start with the simplest case, binary classification. Recall we want to model the class-conditional densities $p(x|C_k)$ as well as the priors $p(C_k)$ in order to compute the posteriors $p(C_k|x)$ via Bayes’ theorem.

\[p(C_1|x) = \frac{p(x|C_1)p(C_1)}{p(x|C_1)p(C_1) + p(x|C_2)p(C_2)}\]

Now generalizing to multiclass:

\[p(C_k|x) = \frac{p(x|C_k)p(C_k)}{\sum_j p(x|C_j)p(C_j)}\] \[p(C_k|x) = \frac{\exp(a_k)}{\sum_j\exp(a_j)}\]

where $a_k = \ln{p(x|C_k)p(C_k)}$. The form of the posterior is the softmax function applied to $a_k$.

We can now throw in assumptions about the class-conditional densities $p(x|C_k)$ and solve for the posteriors.

For example, suppose all of our features are binary and discrete. If there are $m$ features, and $K$ classes, we have $K \times 2^m$ possible values of $(x, y)$ to enumerate in order to describe the discrete distribution. This is clearly infeasible at high dimensionality. This only gets worse if we relax the binary condition.

A simplified representation can be attained by assuming conditional independence of the features given the class $C_k$:

\[p(x|C_k) = \prod_j P(x_j|C_k)\]

Then we can compute the posterior class probabilities like so:

\[\text{arg max}_c\text{ }p(C_k=c|x) = P(C_k=c)\prod_j P(x_j|C_k=c)\]

This is called the naive Bayes assumption.

3. Discriminative Models and Logistic Regression

In the generative approach, we can use maximum likelihood to estimate the parameters of the class-conditional densities as well as the class priors, under specific assumptions over the class-conditional densities. From there we can apply Bayes’ theorem to find the posterior probabilities.

We can also exploit the functional form of our model for the posterior probabilities, and use maximum likelihood to determine its parameters directly, without needing the class-conditionals or priors. Logistic regression is one example.

3.1 Binary

Let’s start with the binary case once more.

Recall the general formulation of the posterior:

\[p(C_1|x) = \frac{p(x|C_1)p(C_1)}{p(x|C_1)p(C_1) + p(x|C_2)p(C_2)}\]

Notice that this can be expressed as:

\[p(C_1|x) = \sigma(a)\] \[a = \ln{\frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}}\]

$\sigma$ is the sigmoid function:

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

So instead of looking at the class-conditional density or the prior density, we roll that all up into the sigmoid. Let:

\[a = w^Tx\] \[p(C_1|x) = \sigma(a) = \sigma(w^Tx)\]

Then, the likelihood function can be expressed in terms of $p(C_1|x)$ (let $y \in [0, 1]$):

\[p(y|w) = \prod_n \sigma(w^Tx_n)^{y_n}(1-\sigma(w^Tx_n))^{1-y_n}\]

Why can we assume this linear form for $a$? This is in fact a key point - this assumption implies a deeper assumption, that the target $y$ has a distribution in the exponential family.

Taking the negative logarithm, we get the cross-entropy error function:

\[E(w) = -\log{(p(y|w))} = -\sum_n y_n\log{(\sigma(w^Tx_n))} + (1-y_n)\log{(1-\sigma(w^Tx_n))}\]

Taking the gradient w.r.t $w$:

\[\nabla E(w) = \sum_n (\sigma(w^Tx_n) - y_n)x_n\]

There is no closed-form solution here, but we can simply use gradient descent.

3.2 Multiclass

To generalize to multiclass, let:

\[p(C_k|x) = f_k(x) = \frac{\exp(w^T_kx_k)}{\sum_j \exp(w^T_jx_j)}\]

This is the softmax transformation.

We use a 1-of-k coding scheme where $y_n$ for a feature vector $x_n$ is a vector with all elements equal to 0, except for element $k$.

The likelihood takes the form:

\[p(Y|w_1, ..., w_K) = \prod_n\prod_kp(C_k|x_n)^{y_{n, k}} = \prod_n\prod_kf_k(x_n)^{y_{n, k}}\]

$Y$ is a $N\times K$ matrix of target variables.

The negative log gives us the cross-entropy error function:

\[E(w_1, ..., w_K) = -\log{p(Y|w_1, ..., w_K)} = \sum_n\sum_ky_{n, k}\log{f_k(x_n)}\]

Keep in mind the constraint $\sum_ky_{n, k} = 1$ and take the gradient with respect to each $w_k$ to obtain the cross-entropy error function:

\[\nabla_{w_k} E(w_1, ..., w_K) = \sum_{n=1}^N(f_{n,k} - y_{n, k})x_n\]

4. Generalized Linear Models

A generalized linear model is a model of the form:

\[y = f(w^T\phi)\]

$f(.)$ is the activation function, and $f^{-1}(.)$ is the link function. In the GLM, $y$ is a nonlinear function of a linear combination of the inputs. Notice that linear and logistic regression satisfy this definition.

The link function describes the association between the mean of the target, $y$, and the linear term $w^T\phi$. It links them in such a way that the range of the transformed mean, $f^{-1}(y)$, is $(-\infty, \infty)$, which allows us to form the linear equation $y = f(w^T\phi)$ and solve using MLE. Take classification as an example - the target is $\in [0, 1]$ and so the untransformed $y$ is in the range $[0, 1]$.

4.1 Linear Regression Recap

Now, let’s take a quick review of linear regression:

Recall that for a linear regression model with Gaussian noise, the likelihood function is:

\[p(y|x, w, \sigma^2) = \prod_{i=1}^{N} N(y_i| w^Tx_i, \sigma^2)\]

And the log likelihood is:

\[ln(p(y|x, w, \sigma^2)) = \sum_{i=1}^{N} ln(N(y_i| w^Tx_i, \sigma^2))\] \[ln(p(y|x, w, \sigma^2)) = \frac{N}{2}ln\frac{1}{\sigma^2} - \frac{N}{2}ln(2\pi) - \frac{1}{2\sigma^2}E(w)\]

where:

\[E(w) = \sum_{i=1}^{N} (y-w^Tx)^2\]

which is the sum-of-squares error function.

Take the partial derivative w.r.t $w$:

\[\nabla ln(p(y|x, w, \sigma^2)) = \frac{1}{\sigma^2}\sum_{i=1}^N (y_i - w^Tx_i)x_i^T\]

Have you noticed something? In linear regression, binary logistic regression, and multiclass logistic regression, the partial w.r.t $w$ of the error function takes the form $(y_i - \hat{y}_i)x_i$.

This is no coincidence - this is a consequence of 1. assuming the target has a distribution from the exponential family, and 2. our choice of activation function.

4.2 The Exponential Family

The exponential family of distributions over $x$ given parameters $\eta$ is defined as:

\[p(x|\eta) = h(x)g(\eta)exp(\eta^Tu(x))\]

Where $\eta$ are the natural parameters and $u(x)$ is some function of $x$. $g(\eta)$ is a normalizing coefficient that ensures:

\[g(\eta)\int h(x)exp(\eta^Tu(x))dx = 1\]

For simplicity, consider a restricted subclass of exponential family distributions, where $u(x)$ is the identity, and where we have introduced a scale parameter $s$ such that:

\[p(x|\eta, s) = \frac{1}{s}h(\frac{x}{s})g(\eta)exp(\frac{\eta x}{s})\]

4.3 Canonical Link Functions

Now, let us assume that our target variable $t$ is in the exponential family. Then,

\[p(t|\eta, s) = \frac{1}{s}h(\frac{t}{s})g(\eta)exp(\frac{\eta t}{s})\]

Consider taking the gradient of both sides with respect to $\eta$ and simplifying. We will arrive at:

\[E(t|\eta) = y = -s\frac{d}{d\eta}\ln g(\eta)\]

Which shows us that $y$ and $\eta$ are related. Let this relationship be $\eta = \psi(y)$.

Now go back and consider the log-likelihood for the model on $t$, assuming all observations have the same scale parameter.

\[\ln p(t|\eta, s) = \sum_{n=1}^N \ln p(t_n|\eta, s) = \sum_{n=1}^N (\ln g(\eta_n) _ \frac{\eta_nt_n}{s}) + \text{const}\]

Take the derivative with respect to $w$:

\[\nabla_w \ln p(t|\eta, s) = \sum_{n=1}{N}(\frac{d}{d\eta_n}\ln g(\eta_n) + \frac{t_n}{s})\frac{d\eta_n}{dy_n}\frac{dy_n}{da_n}\nabla a_n\]

where $a_n = w^T\phi_n$. Now using $y = E(t

\eta)$, $\eta = \psi(y)$, and our earlier GLM definition $y_n = f(a_n)$:

\[\nabla_w \ln p(t|\eta, s) = \sum_{n=1}{N}\frac{1}{s}(t_n-y_n)\psi'(y_n)f'(a_n)\phi_n)\]

If we choose the link function $f^{-1}(y) = \psi(y)$, then $f(\psi(y)) = y$. Consequently:

\[f'(\psi)\psi'(y) = \frac{df}{d\psi}\frac{d\psi}{dy} = \frac{dy}{d\psi}\frac{d\psi}{dy} = 1\]

Notice $a = f^{-1}(y)$, hence $a = \psi$, and therefore from the above:

\[\psi'(y_n)f'(a_n) = 1\]

And the error function has reduced to a familiar form:

\[\nabla E(w) = \frac{1}{s}\sum_{n=1}^N(y_n-t_n)\phi_n\]

The proper choice of link function - the canonical link function - takes our error function to this simple form, and also enforces $\eta = E(t)$. There are a number of other desirable statistical properties associated with the canonical link function.

Regression

2019-10-19T00:00:00+00:00

Let’s do a quick review of basic regression, to lay the framework for future posts. The goal of regression, one of the principal problems in supervised learning, is to predict the value(s) of one or more continuous target variables $y$, given a corresponding vector of input variables $x$.

1. Setup

Suppose we have a training dataset consisting of $n$ observations. The $i$-th observation is a ${x_i, y_i}$ pair, where $x_i$ is a $D$-dimensional vector of input variables and $y_i$ the target variable (we shall assume one target, going forward). We would like to be able to predict $y$ when given some new value of $x$.

The simple linear model involves a linear combination of the input variables $x$. Consider a single observation $x$; its vector components denoted $x_1, x_2, …, x_D$. Then we can imagine a function:

\[y(x, w) = w_0 + w_1x_1 + ... + w_Dx_D\]

It takes in our inputs $x$ and a set of parameters $w$, and spits out a predicted target value. This is the simple linear regression model. Linear, in the sense that it is a linear combination of the parameters $w$! While it also happens to be linear in the data $x$, do not be deceived. We can also consider linear combinations of nonlinear functions of the data like so:

\[y(x, w) = w_0 + \sum_{j=1}^{M-1} w_j\phi_j(x)\]

where the $\phi_j$ basis functions. Notice the indices - there are $M$ parameters in total, one for each $w$ (including $w_0$), and most importantly $M$ does not necessarily equal $D$. In practical terms, we can think of these basis functions as feature transformations on our original data $x$. The key is that by using nonlinear basis functions, we can turn $y(x, w)$ into a nonlinear function w.r.t the data $x$. A simple example is polynomial regression - consider an input $x$ and the basis functions as successive powers, s.t. $\phi_j(x) = x^j$. For our discussion, however, the choice of basis function has no effect on our analysis. We will choose the trivial identity function $\phi(x) = x$ and omit $\phi$ to simplify the notation.

2. Least Squares

Let us assume that our target data is of the form:

\[y_i = w^Tx_i + \epsilon\]

where $\epsilon \sim N(0, \sigma^2)$. That is, the target has mean given by $y(x,w) = w^Tx$ but is normally distributed around that value with a given variance $\sigma^2$:

\[p(y_i|x, w, \sigma^2) = N(y_i|w^Tx_i, \sigma^2)\]

Then, across all observations, the likelihood function $P(D|W)$ becomes:

\[p(y|x, w, \sigma^2) = \prod_{i=1}^{N} N(y_i| w^Tx_i, \sigma^2)\]

Now we can do maximum likelihood estimation to obtain the parameters $w$ and $\sigma^2$:

\[\ln(p(y|x, w, \sigma^2)) = \sum_{i=1}^{N} \ln(N(y_i| w^Tx_i, \sigma^2))\] \[\ln(p(y|x, w, \sigma^2)) = \frac{N}{2}\ln\frac{1}{\sigma^2} - \frac{N}{2}ln(2\pi) - \frac{1}{2\sigma^2}E(w)\]

where:

\[E(w) = \sum_{i=1}^{N} (y-w^Tx)^2\]

which is the sum-of-squares error function.

2.1 Solving for $w$

Consider the maximization w.r.t $w$. Notice the first two terms in the log-likelihood are constant. So the MLE solution for $w$ reduces to minimizing the residual squared error.

Taking the gradient of the log-likelihood w.r.t $w$ and setting it equal to 0, we can solve for $w_{MLE}$:

\[w_{MLE} = (x^Tx)^{-1}x^Ty\]

This should be a familiar result! Wrack your brain for a moment - we will return to this shortly.

Notice that we must compute $(x^Tx)^{-1}$. This can be very tricky in practice when the matrix is singular or close to singular.

Here we can gain a greater understanding of the bias parameter $w_0$. We can make the bias parameter $w_0$ explicit and do MLE to see that:

\[w_{0, MLE} = \bar{y} - \sum w_i\bar{x_i}\]

That is, it addresses the difference between the sample mean of the target, and the weighted sum of the sample means of the data. In other words - it is compensating for the bias, hence the name. More on bias later.

2.2 Solving for $\sigma^2$

Consider the maximization w.r.t $\sigma^2$. Taking the gradient of the log-likelihood w.r.t $\sigma^2$ and setting it equal to 0, we can solve for $\sigma^2_{MLE}$:

\[\sigma^2_{MLE} = \frac{1}{N}\sum_{i=1}^N (y_i-w_{MLE}^Tx_i)^2\]

This is actually the variance of the true target values around the regression values from the regression model.

2.3 Geometric interpretation

Have you seen it? $w_{MLE}$ is actually the projection of $y$ onto $x$!

Let us take a perspective informed by linear algebra. Consider the classic linear algebra problem - solving $Ax=b$. In this case, $Xw=y$.

Sometimes, this cannot be solved. When $X$ has more rows than columns ($m > n$ where $m$ is the number of rows, $n$ the number of columns) there are more equations than unknowns. Make the connection between $X$ and the real world meaning - our dataset. This translates to: more observations than variables. Then there are infinitely many solutions, as the $n$ columns span a small part of $m$-space, and hence, $y$ is probably outside of the column space of $X$.

Framed differently, $e = b - Ax = y - Xw$, the error, is not always 0. If it were, we can obtain an exact solution for $w$. But in the case of $m > n$, $e$ cannot be taken to 0. But we can still find an approximate solution for $w$. One way to determine the optimal $\hat{w}$ that minimizes $e$ is the least squares solution, which minimizes Euclidean distance. Multiply both sides by $X^T$:

\[X^TX\hat{w} = X^Ty\]

The closest point to $y$ in $Xw$ is the projection of $y$ onto $X$.

So we have shown that this is equivalent to the MLE result when assuming Gaussian noise!

3. Regularized least-squares

We can control overfitting by introducing a regularization term to the error function. Recall our data-dependent error for $w$:

\[E_D(w) = \sum_{i=1}^{N} (y-w^Tx)^2\]

We can add a regularization term and minimize this total error function instead:

\[E(w) = E_D(w) + \lambda E_W(w)\]

From a probabilistic Bayesian point of view, we can interpret this regularization as assuming a prior on the parameters $w$. For example:

A simple choice is a Gaussian. We don’t know much a priori - let’s say $w_j \sim N(0, \lambda^2)$.

\[\text{arg max}_w log P(w|D,\sigma,\lambda) = \text{arg max}_w (log P(D|w, \sigma) + log P(w|\lambda))\]

The regularization term will be $\frac{1}{2\lambda^2}w^Tw$ - the sum of squares of the weights. This regularizer is known as weight decay as it heavily penalizes high weights due to the squaring. It is also known as ridge regression.

The exact solution can be found in closed form:

\[w_{MAP} = (X^TX + \frac{\sigma^2}{\lambda^2}I)^{-1}X^Ty\]

As $\lambda \rightarrow \infty$, our prior becomes broader, and MAP reduces to MLE! As $n$ increases, the entries of $X^TX$ grow linearly but the prior’s effect is fixed, so the effect of the prior also vanishes. Also notice that this avoids the problem of $X^TX$ being noninvertible.

A general regularizer can be used, for which the error looks like:

\[\frac{1}{2}\sum (y-w^Tx)^2 + \frac{\lambda}{2}\sum |w|^q\]

When $q=1$ we have lasso regression, which corresponds to a Laplace prior. It has the property of feature selection, as it can drive the parameters to 0.

4. Bias-variance decomposition of loss

Let’s take a step back and engage in a frequentist thought experiment related to model complexity.

We have a model that takes a dataset $D$ and predicts $y$ given $x$. Call it $h(x, D)$. A linear regression model, for example, would be $h(x, D)=w^T_Dx$.

Using a typical squared error loss function, we can calculate the expected loss on a new observation $(x, y)$:

\[E(L) = E((h(x, D)-y)^2) = \int_x\int_y(h(x, D)-y)^2p(x, y)dxdy\]

Of course, we can’t compute this without knowing $p(x, y)$, and we indeed do not know it. However, suppose we had a large number of datasets drawn i.i.d from $p(x, y)$. For any given dataset we can run our algorithm to obtain a prediction function $h(x, D)$. Clearly, different $D$ produce different predictors. The performance of the model as a whole can be evaluated by averaging over this ensemble of all possible datasets.

\[\bar{h}(x) = E_D(h(x, D))\]

We want to know the expected loss (squared error) over all datasets, which will be the best way to evaluate the performance of the algorithm.

\[E(L) = E_{x, y, D}((h(x, D)-y)^2)\] \[E(L) = \int_D\int_x\int_y(h(x, D)-y)^2p(x, y)p(D)dxdydD\]

Add and subtract $\bar{h}(x)$ in the first term:

\[E_{x, y, D}((h(x, D)-y)^2) = E_{x, y, D}((h(x, D) - \bar{h}(x) + \bar{h}(x) - y)^2)\] \[E_{x, y, D}((h(x, D)-y)^2) = E_{x, y, D}((h(x, D) - \bar{h}(x))^2 + (\bar{h}(x) - y)^2 + 2(h(x, D) - \bar{h}(x))(\bar{h}(x) - y))\]

But the cross term vanishes since $E_D(h(x, D) - \bar{h}(x))=0$, and the other part doesn’t depend on $D$. So ultimately:

\[E_{x, y, D}((h(x, D)-y)^2) = E_{x, y, D}((h(x, D) - \bar{h}(x))^2) + E_{x, y}((\bar{h}(x) - y)^2)\]

The first term is the variance of the model. The second term can be further decomposed. Following the same trick and defining $\bar{y}(x) = E_{y|x}(y)$ - the average $y$ at every $x$:

\[E_{x, y}((\bar{h}(x) - y)^2) = E_{x, y}((\bar{h}(x) - \bar{y}(x) + \bar{y}(x) - y)^2\] \[E_{x, y}((\bar{h}(x) - y)^2) = E_{x, y}((\bar{h}(x) - \bar{y}(x))^2 + (\bar{y}(x) - y)^2 + 2(\bar{h}(x) - \bar{y}(x))(\bar{y}(x) - y))\]

Again the cross term vanishes as $E(\bar{y}(x) - y) = 0$.

\[E_{x, y}((\bar{h}(x) - y)^2) = E_{x}((\bar{h}(x) - \bar{y}(x))^2) + E_{x, y}((\bar{y}(x) - y)^2)\]

Which are bias squared and noise terms, respectively.

So the expected loss is variance + bias squared + noise.

\[E_{x, y, D}((h(x, D) - \bar{h}(x))^2) = E_{x, y, D}((h(x, D) - \bar{h}(x))^2) + E_{x}((\bar{h}(x) - \bar{y}(x))^2) + E_{x, y}((\bar{y}(x) - y)^2)\]

Probabilistic PCA

2019-07-17T00:00:00+00:00

Today let’s return to principal component analysis (PCA). Previously we had seen how PCA can be expressed as a variance-maximizing projection of the data onto a lower-dimension space, and how that was equivalent to a reconstruction-error-minimizing projection. Now we will show how PCA is also a maximum likelihood solution to a continuous latent variable model, which provides us several useful benefits, some of which include:

Solvable using expectation-maximization (EM) in an efficient manner, where we avoid having to explicitly construct the covariance matrix. Of course, we can also alleviate this issue in regular PCA by using singular value decomposition (SVD).
Handles missing values in the dataset
Permits a Bayesian treatment of PCA

1. Setup

Suppose we have $n$ observations of data $x$, with dimensionality $D$. Recall that our goal in PCA is to reduce this data by projecting $X$ onto a $M$-dimensional subspace where $M < D$.

First let’s introduce a latent variable $z$ for the lower-dimensional principal component subspace. We assume a Gaussian prior distribution for $z$, $p(z)$ with zero mean and unit covariance.

\[p(z) = N(z|0, I)\]

Let us also choose the conditional distribution $p(x|z)$ to be Gaussian:

\[p(x|z) = N(x|Wz + \mu, \sigma^2I)\]

Here the mean is described as a linear function of $z$ with a $D\times M$ matrix $W$ and $D$-dimensional $\mu$.

From a generative viewpoint, we can describe our $D$-dimensional data as a linear transformation of the $M$-dimensional latent $z$ with some Gaussian noise:

\[x = Wz + \mu + \epsilon\]

where $\epsilon$ is $D$-dimensional Gaussian noise with zero mean and covariance $\sigma^2I$. So generating $x$ can be done by fixing some $z$ and then sampling from the distribution conditioned on $z$.

2. Specifying the likelihood and prior

Now we want to determine the parameters $W, \mu, \sigma^2$ using maximum likelihood. We need to specify the likelihood function $p(x|W, \mu, \sigma^2)$ first.

\[p(x) = \int p(x|z)p(z)dz\]

By the properties of the Gaussian and of linear-Gaussian models, this marginal distribution $p(x)$ is also Gaussian:

\[p(x) = N(x|\mu, C)\] \[C = WW^T + \sigma^2I\]

where $C$ is the $D\times D$ covariance matrix.

This can be noticed from $x = Wz + \mu + \epsilon$:

\[E(x) = E(Wz+\mu+\epsilon) = \mu\] \[cov(x) = E((Wz+\epsilon)(Wz+\epsilon)^T)\] \[cov(x) = E(Wzz^TW^T) + E(\epsilon\epsilon^T) = WW^T + \sigma^2I\]

Remember $z$ was defined to have zero mean and unit variance, and that $\epsilon$ was defined to have covariance $\sigma^2I$. The cross terms vanish because $z, \epsilon$ are uncorrelated.

Next we compute the posterior $p(z|x)$. We will be handwavy about the mathematical details here, as we have not yet covered linear-Gaussian models, but nevertheless the final result is presented for completeness. (In any case the mathematics here is not the critical point.)

It can be shown that

\[C^{-1} = \sigma^{-2}I - \sigma^{-2}WM^{-1}W^T\]

where

\[M = W^TW + \sigma^2I\]

which is $M\times M$.

From here we can use linear-Gaussian results:

\[p(z|x) = N(z|M^{-1}W^T(x-\mu), \sigma^2M^{-1})\]

3. Maximum likelihood solution

Now we can use maximum likelihood to obtain the parameters. We will skip over the derivations, but the parameters all have closed form solutions:

\[\mu_{\text{MLE}} = \bar{x}\] \[W_{\text{MLE}} = U_M(L_M - \sigma^2I)^{1/2}R\] \[\sigma^2_{\text{MLE}} = \frac{1}{D-M}\sigma_{i=M+1}^D \lambda_i\]

3.1 $W_{\text{MLE}}$

Let $S$ be the covariance matrix of the data $x$:

\[S = \frac{1}{n}\sum_n(x_i-\bar{x}(x_i-\bar{x})^T)\]

Then $U_M$ is a $D\times M$ matrix whose columns are the eigenvectors of $S$.

$L_M$ is $M\times M$ and contains the corresponding eigenvalues.

$R$ is an arbitrary $M\times M$ orthogonal (orthonormal) matrix. This can be confusing - just substitute in $I$ if so, which is perfectly valid! $R$ can be thought of as a rotation matrix in latent $M$-space. If we substitute $W_{\text{MLE}}$ into $C$, it can be shown that $C$ is independent of $R$. The point is that the predictive density is unaffected by rotations in latent space.

3.2 $\sigma^2_{\text{MLE}}$

We have assumed that the eigenvectors were arranged in decreasing order of the eigenvalues. Then $W$ is the principal subspace, and $\sigma^2_{\text{MLE}}$ is just the average variance of the remaining components.

4. $M$-space and $D$-space

Probabilistic PCA can be thought of as mapping points in $D$-space to $M$-space, and vice-versa. We can show that:

\[E(z|x) = M^{-1}W^T_{\text{MLE}}(x-\bar{x})\]

Which is $x$’s posterior mean in latent space. Of course, in data space this is:

\[WE(z|x) + \mu\]

The posterior covariance in latent space is:

\[\sigma^2M^{-1}\]

which is actually independent of $x$.

Notice that as $\sigma^2 \rightarrow 0$, the posterior mean simplifies to a familiar result:

\[(W^T_{\text{MLE}}W_{\text{MLE}})^{-1}W^T_{\text{MLE}}(x-\bar x)\]

which is the orthogonal projection of $x$ onto $W$ - the standard PCA result!

5. EM for PCA

We have just shown that probabilistic PCA can be solved using maximum likelihood with closed-form solutions. Why, then, would we want to use EM?

Typical PCA requires us to evaluate the covariance matrix, which is $O(ND^2)$. In EM, we do not need the explicit covariance matrix. The costliest operations are sums over the data, which are $O(NDM)$. If $M \ll D$ then EM can be much more efficient despite having to go through multiple EM cycles.

Second, if memory is a concern, or if the data is being streamed, the iterative nature of EM is useful as it can be used in an online learning fashion.

Finally, thanks to our probabilistic model, we can deal with missing data in a principled way. We can marginalize over the distribution of the missing data in a generalized $E$-step (Chen et al., 2009).

6. Bayesian PCA

Probabilistic PCA also allows us to take a Bayesian approach for selecting $M$. One example is by choosing a prior over $W$ that performs selection on relevant dimensions. This is an example of automatic relevance determination (ARD). We define a Gaussian prior for each component, or column of $W$, with respective precision hyperparameters $\alpha_i$:

\[p(W|\alpha) = \prod_{i=1}^M (\frac{\alpha_i}{2\pi})^{D/2}\exp{(-\frac{1}{2}\alpha_iw_i^Tw_i)}\]

During the EM process, by maximizing the marginal likelihood after marginalizing out $W$, we can find the optimal $\alpha_i$. Some $\alpha_i$ will be driven to infinity, consequently pushing $w_i$ to 0. Then the finite $\alpha_i$ determine the relevant components $w_i$!

We can also do a fully Bayesian treatment with priors over all parameters $\mu, \sigma^2, W$.

Latent Dirichlet Allocation

2019-06-30T00:00:00+00:00

Latent Dirichlet allocation (LDA) is a generative probabilistic model for discrete data. The objective is to find a lower dimensionality representation of the data while preserving the salient statistical structure - a complex way to describe clustering. It is commonly used in NLP applications as a topic model, where we are interested in discovering common topics in a set of documents.

1. Probability Distributions in LDA

First let us review some relevant probability distributions.

1.1 Poisson

Recall the Poisson distribution. It is a discrete distribution that models the probability of an event occuring $k$ times in an interval, assuming that they occur at a constant rate and are independent (the occurrence of one event does not effect the probability of another occurring).

The PMF is given by:

\[p(k) = e^{-\lambda}\frac{\lambda^k}{k!}\]

where $\lambda$ is the average number of events per interval. Notice that the length of the interval is baked into $\lambda$. We can explicitly define the interval length $t$ if we know the rate of events per unit interval $r$, and thus $\lambda=rt$.$\lambda$ is the expected value and the variance of the Poisson distribution.

1.2 Multinomial

Next we have the multinomial distribution, a generalization of the binomial distribution. Whereas the binomial distribution can be viewed as a sequence of Bernoulli trials, i.e. coin flips, the multinomial can be viewed as a sequence of categorical trials, i.e. $k$-sided dice rolls.

Given $k$ categories and $n$ trials, with the probabilities $p_i$ of a trial resulting in category $i$ of $k$, and $X_i$ the number of trials resulting in category $i$, the PMF is given by:

\[p(X_1=x_1, ..., X_k=x_k) = \frac{n!}{\prod_i^k x_i!}\prod_i^k p_i^{x_i}\]

where $\sum_i^k x_i = n$. This can also be expressed with gamma functions:

\[p(X_1=x_1, ..., X_k=x_k) = \frac{\Gamma(n+1)}{\prod_i^k \Gamma(x_i+1)}\prod_i^k p_i^{x_i}\]

Recall the gamma function is defined for positive integers $n$ as:

\[\Gamma(n) = (n-1)!\]

and otherwise as:

\[\Gamma(z) = \int_0^{\infty} x^{z-1}e^{-x}dx\]

The expected number of trials with outcome $i$ and the variance are:

\[E(X_i) = np_i\] \[Var(X_i) = np_i(1-p_i)\] \[Cov(X_i, X_j) = -np_ip_j\]

1.3 Dirichlet

Finally, the Dirichlet distribution is a multivariate generalization of the beta distribution. A $k$-dimensional Dirichlet random variable $\theta$ takes values in the $(k-1)$-simplex ($\theta$ lies in the $(k-1)$-simplex if $\theta_i\geq 0, \sum_{i=1}^k\theta_i = 1$) with the following pdf:

\[p(\theta|\alpha) = \frac{1}{B(\alpha)}\prod_{i=1}^k\theta_i^{\alpha_i-1}\] \[p(\theta|\alpha) = \frac{\Gamma(\sum_{i=1}^k\alpha_i)}{\prod_{i=1}^k\Gamma(\alpha_i)}\prod_{i=1}^k\theta_i^{\alpha_i-1}\]

where $\alpha$ is a $k$-dimensional parameter with positive elements. Note this looks very similar to the form of the multinomial! Recall the beta is the conjugate prior to the binomial; similarly the Dirichlet is the conjugate prior to the multinomial. So if our likelihood is multinomial with a Dirichlet prior, the posterior is also Dirichlet! This gives us some intuition for $\alpha_i$ as the prior proportion of the $i$th class.

The expected value and variance:

\[E(X_j) = \frac{\alpha_j}{\sum_{i=1}^k \alpha_i} = \frac{\alpha_j}{\alpha_0}\] \[Var(X_j) = \frac{\alpha_j(\alpha_0 - \alpha_j)}{\alpha_0^2(\alpha_0+1)}\]

2. LDA

LDA is a three-level Bayesian model where every item is modeled as a finite mixture over a set of latent topics, and the topics are in turn modeled as infinite mixtures over underlying topic probabilities.

It is helpful to discuss LDA in the context of NLP, in terms of guiding intuition. Do note that LDA can absolutely be applied to other problem spaces. We define:

Word: basic unit of discrete data. Given a vocabulary of size $V$, the $v$th word can be represented by a one-hot vector with a 1 in the $v$th index and 0 in all others.
Document: sequence of $N$ words $w = (w_1, …, w_N)$
Corpus: collection of $M$ documents $D = {w_1, … w_M}$

2.1 Formulating LDA

We want to find a probabilistic model for our corpus that assigns high probability to the members $d_1, …, d_M$ but also similar documents outside of our training set.

We will represent a document as a random mixture over latent topic variables, and the topics will themselves be described using distributions over words. That is, every document can consist of many topics, and words within the document belong to topics according to the distribution of words conditional on topics.

LDA models the generative process for each document in the corpus as:

$N \sim \text{Poisson}(\xi)$
$\theta \sim \text{Dir}(\alpha)$
For each of the $N$ words $w_n$:
1. Sample a topic $z_n \sim \text{Multinomial}(\theta)$
2. Sample a word $w_n$ from a multinomial conditioned on $z_n$, $w_n \sim \text{Multinomial(\beta_{z_n})}$, where each topic corresponds to its own $\beta_{z_n}$. This can also be written as $p(w_n | z_n, \beta)$ where $\beta$ is a $k\times V$ matrix with $\beta_{ij} = p(w^j=1|z^i=1)$.

Note that we have predetermined and fixed $k$. Secondly, note that the Poisson assumption is not relevant to the rest of the process and can be discarded for more realistic document length assumptions.

Given the parameters $\alpha, \beta$, the joint distribution of the topic mixture $\theta$, the latent topic variables $z$, and the set of $N$ words $w$:

\[p(\theta, z, w|\alpha, \beta) = p(\theta|\alpha)\prod_{n=1}^Np(z_n|\theta)p(w_n|z_n,\beta)\]

We can marginalize out the latent variables $\theta, z$ to get the marginal distribution of a document:

\[p(w|\alpha, \beta) = \int p(\theta|\alpha)(\prod_{n=1}^N\sum_{z_n} p(z_n|\theta)p(w_n|z_n, \beta))d\theta\]

Then the probability of a corpus - $p(D|\alpha, \beta)$, the likelihood - is the product of these marginal probabilities over all documents.

2.2 Comparisons

Take careful note of the hierarchy. LDA is not simply Dirichlet-multinomial clustering! There, we would sample a Dirichlet once for the corpus to describe the topic mixture, then sample a multinomial once for each document to describe a topic, then select words conditional on the topic (clustering)variable. A document would be associated with a single topic. In LDA, topics are repeatedly sampled within each document.

Indeed, what we have just described is the mixture of unigrams model. Each document is generated by choosing a topic $z$, then generating $N$ words from the multinomial $p(w|z)$:

\[p(w) = \sum_z p(z)\prod_{n=1}^Np(w_n|z)\]

A unigram model would be even more simplistic, with the words of every document being drawn independently from one multinomial:

\[p(w) = \prod_{n=1}^Np(w_n)\]

2.3 Estimation, EM, and Variational Inference

We need to find the parameters $\alpha, \beta$ that maximize the likelihood of the data, marginalizing over latent $\theta, z$. Of course, whenever we have to find maximum-likelihood solutions for models with latent variables, we turn to EM. At a high level:

E-step: compute the posterior of the latent variables $p(\theta, z|w, \alpha, \beta)$ given the document $w$ and the parameters
\[p(\theta, z|w, \alpha, \beta) = \frac{p(\theta, z, w|\alpha, \beta)}{p(w|\alpha, \beta)}\]
M-step: estimate $\alpha, \beta$ given the revised latent variable estimates

In practice, the posterior of the latent variables is analytically intractable in general. In such situations we turn to approximations to the posterior distribution. The most common method, and the one discussed in the original LDA paper, is variational inference. We will leave the details of variational EM for a future time.

Minimum Description Length

2019-06-24T00:00:00+00:00

The minimum description length principle is an approach for the model selection problem. It is underpinned by the beautifully simple concept of learning as compression. Any pattern or regularity in the data can be exploited to compress the data. Hence we can equate the two concepts - the more we can compress, the more we know about the data!

Before we begin, I would like to give thanks to Peter D. Grünwald and his text, The Minimum Description Length Principle, whose first chapter serves as the basis for this post.

1. Learning as Compression

Consider the following sequences as datasets $D$:

The first sequence is just a repetition of $0001$ - clearly extremely regular. The second is randomly generated, with no regularity. The third contains 4 times more 0s than 1s, and is a middle ground - less regular than the first, but more regular than the second.

To formalize the concept of MDL, we will need a description method which maps descriptions to datasets and vice-versa. One example is a computer language, where a corresponding description would then be a program that outputs the dataset $D$.

The first sequence, for example, can be easily expressed as a loop printing $0001$, which can be shown is a compression on the order of $O(\log n)$ bits. On the other hand, with a high probability, the shortest program that can output sequence 2 will be a print statement. Due to its random nature, the sequence cannot be compressed at all. The third sequence can be shown to be compressible to $O(n)$.

If we continue with computer languages as our description method, the minimum description length will be given by the length of the shortest program that can output the sequence and halt. This is the Kolmogorov complexity of the sequence. One might think that this would vary based on the hcoice of language, but according to the invariance theorem, for long enough sequences $D$, the lengths of the shortest programs in different languages will differ by a constant factor independent of $D$.

However, this idealized MDL we have arrived at using computer languages is not computable. It can be shown that no program exists that can take $D$ and return the shortest program that outputs $D$.

We are thus forced to use a more practical version of MDL, where we rely on more restrictive description methods, denoted $C$. We need to be able to compute the minimum description length of data $D$ when using said description method $C$.

2. Model Selection

To further motivate our study of MDL, consider the following example of choosing a polynomial fit, again courtesy of Grünwald:

The first polynomial, which is linear, seems to suffer from high bias. It is overly simplistic. On the other hand, the second polynomial seems to suffer high variance - it is overly complex and has likely fit to the noise in the data. The third appears to generalize the best between bias and variance. MDL will provide a way to make this decision, without handwavy arguments.

Let a hypothesis refer to a single probability distribution or function, and a model refer to a family of probability distributions/functions with the same form. For example, in a polynomial fitting problem, hypothesis selection is the problem of selecting the degree of a polynomial and the corresponding parameters, whereas model selection is strictly choosing the degree.

2.1 Crude MDL

Crudely: Let $M^1, M^2, …$ be a set of candidate models. The best hypothesis $H \in M^1\cup M^2\cup …$ to compress data $D$ minimizes:

\[L(H) + L(D|H)\]

$L(H)$ is the length in bits of the description of $H$ $L(D|H)$ is the length in bits of the description of $D$ encoded using $H$

Then the best model is the smallest model containing $H$.

For probabilistic hypotheses, a straightforward choice for the description method is given by $L(D|H)=-\log{P(D|H)}$, the Shannon-Fano code. Finding a code for $L(H)$, however, is hard, because the description length of a hypothesis $H$ can be very large in one model but short in another.

2.2 Refined MDL

In what is known as refined MDL, developed by Rissanen et al. (1996), we will use a one-part code $\bar L(D|M)$ instead of the two-part $L(H) + L(D|H)$. This will allow us to avoid the problem of specifying $L(H)$.

The stochastic complexity of the data given the model is given by

\[\bar{L}(D|M)\]

Where by definition, this quantity will be small whenever there is some hypothesis $H \in M$ such that $L(D|H)$ is small.

$\bar{L}(D|M)$ - the stochastic complexity of the data given the model

This can itself be decomposed into:

\[\bar{L}(D|M) = L(D\|\hat{H}) + COMP(M)\]

where $\hat{H}$ is the best hypothesis in $M$ (the distribution that maximizes the probability of the given data $D$), and $COMP(M)$ the parametric complexity of the model $M$. The parametric complexity measures the model’s ability to fit random data - an idea of its richness, or complexity. Recall that we are minimizing $\bar{L}(D|M)$ - we avoid overfitting by penalizing overly complex models thru the parametric complexity term.

3. The MDL Philosophy

According to Grünwald, MDL informs a “radical” philosophy for statistical learning and inference, with the following main points:

Regularity as compression - the goal for statistical inference is compression. It is to distill regularity from noise. Refer to the example in section 1.
Models as languages - models are interpretable as languages for describing properties of the data. Individual hypotheses capture and summarize different patterns in the data. Notice that these patterns exist and are meaningful whether or not $\hat H \in M$ is the true distribution of $D$.
We have only the data - In fact, the concepts of “truth” and “noise” lose meaning in the MDL approach. Traditionally we assume some model generates the data and that noise is a random variable relative to this model. But in MDL noise is not a random variable. The “noise” is relative to the choice of model $M$ as the residual number of bits used to encode the data once $M$ is specified! Under the MDL principle, “noisy data” is just data that could theoretically be easily compressed using another model. This avoids what Rissanen describes as a fundamental flaw in other statistical methods, which assume the existence of a “ground truth” lying within our chosen model $M$. For instance, linear regression assumes Gaussian noise; Gaussian mixture models assume the data is generated from underlying Gaussian distributions. These are not true in practice! MDL, in contrast, has an interpretation that relies on no assumptions but the data.

4. MDL in Regression

Consider a regression setting. To describe the description length of our model $\hat y$ and the data, we code two things - the residual $r = y - \hat{y}$, and the model (the parameters). In regression, this will be encoding 1. the features that are present in the model, and 2. the associated weights.

The idea is that the residual term will be proportional to the test error, and the remaining term (the model parameters) will describe the model complexity, such that minimizing the sum (the minimum description length) minimizes the expected test error.

4.1 Residuals

First we code the residual. The optimal coding length is given by entropy - we can look at the entropy of $r$ in terms of the log-likelihood of $y$:

\[-\log{P(D|w,\sigma)} = -n\log{(\frac{1}{\sigma\sqrt{2\pi}})} + \frac{1}{2\sigma^2}\sum_i(y_i-w^Tx_i)^2\] \[-\log{P(D|w,\sigma)} = n\log{(\sigma\sqrt{2\pi})} + \frac{Err}{2\sigma^2}\]

The second term is the scaled sum of squares error. The problem is we don’t know $\sigma^2$.

If we knew the true $w$, then for the corresponding $\hat{y}$, $\sigma^2=E((y-\hat{y})^2)=\frac{Err}{n}$ by the definition of variance. We don’t know it, but we can use our estimate of it.

Using our estimate at the current iteration, we get $\frac{Err^t}{2Err^t/n}=\frac{n}{2}$, which is not helpful - it’s constant. But here we refer back to the likelihood’s constant term $n\log{(\sigma\sqrt{2\pi})}$, which remains.

\[n\log{(\sigma\sqrt{2\pi})} = n\log{(\sigma)} + ...\]

Where … is some term which we can safely ignore, since it is independent of $\sigma$. We know:

\[n\log{(\sigma)} \sim n\log{\sqrt{\frac{Err}{n}}} = \frac{n}{2}\log{\frac{Err}{n}}\]

So ultimately we have

\[-\log{P(D|w,\sigma)} \sim \frac{n}{2}\log{\frac{Err}{n}} \sim n\log{\frac{Err}{n}}\]

4.2 Weights

We can code the model by asking: for each feature, is it in the model? and if so, what is its coefficient?

Say we have $p$ features and we expect $q$ to be in the model. Then the probability is $\frac{q}{p}$. The entropy is:

\[-\sum_i \frac{q}{p}\log{(\frac{q}{p})} + \frac{p-q}{p}\log{(\frac{p-q}{p})}\]

Two special cases that will come in handy below:

If $q=\frac{p}{2}$, the cost of coding each feature presence or absence is 1 bit.
If $q=1$ then $\log{(\frac{p-q}{p})} \sim 0$ and the cost of coding each feature is $\sim -\log{(\frac{1}{p})} = \log{(p)}$ bits

To code the weight values, this is handwavy - but the variance of $\hat{w} \sim \frac{1}{n}$, hence we use $\log{\sqrt{n}} = \frac{\log{(n)}}{2}$ bits for each feature.

So ultimately the description length is:

\[n\log{\frac{Err}{n}} + \lambda|w|_0\]

Where

\[\lambda=\log{(\pi)} + \frac{\log{(n)}}{2}\] \[\pi=\frac{q}{p}\]

4.3 Limiting Cases

Recall our discussion of MDL as training error + parametric complexity. Clearly the first term is capturing training error (the residuals) and the second the parametric complexity. Consider some limiting cases for $\lambda$:

BIC - $\sqrt{n} >\hbox{}> p$ - the cost of coding feature presence is negligible; we essentially charge $\frac{\log{(n)}}{2}$ per feature.
RIC - $p >\hbox{}> \sqrt{n}$ - the cost of coding weights is negligible; we essentially charge $\log{(p)}$ bits per feature
AIC - $q \sim \frac{p}{2}$ and $n, p$ small - we charge 1 bit per feature.

Depending on $n, p$ and our expected $q$ a different penalty criterion $\lambda$ is more appropriate to describe the variance.

Expectation Maximization

2019-06-16T00:00:00+00:00

In the last two posts we have seen two examples of the expectation maximization (EM) algorithm at work, finding maximum-likelihood solutions for models with latent variables. Now we derive EM for general models, and demonstrate how it maximizes the log-likelihood.

1. Decomposing the log-likelihood

Suppose we have a model with variables $X$ and latent variables $Z$, with a joint distribution described by parameters $\theta$. The likelihood function is then:

\[p(X|\theta) = \sum_Z p(X, Z|\theta)\]

The key assumption we make is that optimizing $p(X|\theta)$ is complex relative to optimizing $p(X, Z|\theta)$.

To optimize over the latent variables we will need a distribution $q(Z)$ over the latent variables. Now we can represent the log likelihood $\ln p(X|\theta)$ as follows:

\[\ln p(X|\theta) = \sum_Z q(Z)\ln p(X|\theta)\]

By the rules of probability we know $p(X,Z|\theta) = p(Z|X, \theta)p(X|\theta)$, hence:

\[\ln p(X|\theta) = \sum_Z q(Z)\ln \frac{p(X,Z|\theta)}{p(Z|X, \theta)}\] \[\ln p(X|\theta) = \sum_Z q(Z)(\ln \frac{p(X,Z|\theta)}{q(Z)} - \ln\frac{p(Z|X, \theta)}{q(Z)})\] \[\ln p(X|\theta) = \sum_Z q(Z)\ln \frac{p(X,Z|\theta)}{q(Z)} - \sum_Z q(Z)\ln\frac{p(Z|X, \theta)}{q(Z)}\] \[\ln p(X|\theta) = \mathcal{L}(q,\theta) + KL(q\|p)\]

Notice the second term is the KL-divergence between the prior $q(Z)$ and the posterior $p(Z|X, \theta)$! Furthermore, as KL-divergence is $\geq 0$, it follows that $\mathcal{L}(q,\theta) \leq \ln p(X|\theta)$. It is a lower bound on the log-likelihood.

2. EM Algorithm

Now, let our current parameters be $\theta_0$.

2.1 E-step

Consider taking the partial of the log likelihood $\ln p(X|\theta_0)$ with respect to $q(Z)$, keeping $\theta_0$ fixed. Notice that $\ln p(X|\theta_0)$ does not depend on $q(Z)$ as it is marginalized out! However, when $q(Z) = p(Z|X,\theta_0)$ - setting the $q$ distribution to the posterior for the current $\theta_0$ - the KL-divergence term goes to 0. Then we have $\mathcal{L}(q,\theta_0) = \ln p(X|\theta_0)$.

Now fix $q(Z)$ and take the partial w.r.t $\theta$ to maximize $\mathcal{L}(q,\theta)$ and find $\theta_1$. $\mathcal{L}$, the lower bound on the log-likelihood, will necessarily increase.

2.2 M-step

Remember that we fixed $q(Z) = p(Z|X, \theta_0)$. Substituting in $\mathcal{L}$:

\[\mathcal{L}(q,\theta) = \sum_Z p(Z|X, \theta_0)\ln p(X, Z|\theta) - \sum_z p(Z|X, \theta_0)\ln p(Z|X, \theta_0)\] \[\mathcal{L}(q,\theta) = \sum_Z p(Z|X, \theta_0)\ln p(X, Z|\theta) + H(q(Z))\]

Where $H(q(Z))$ is the entropy of the $q$ distribution. The key is that this quantity is independent of $\theta$. So really what we are maximizing is $\sum_Z p(Z|X, \theta_0)\ln p(X, Z|\theta) = Q(\theta, \theta_0)$ - the expected value of the complete-data ($X$ and $Z$) log-likelihood. Another way to think about this is a weighted MLE with the weights given by $q(Z)$.

Also note that we fixed $q(Z) = p(Z|X, \theta_0)$ instead of $p(Z|x, \theta_1)$, so the KL-divergence term is positive (unless $\theta_0 = \theta_1$). Hence the log-likelihood actually increases by more than the increase in $\mathcal{L}$.

2.3 Summary

The algorithm can be expressed beautifully as two simple steps - iteratively computing the posterior over the latent variables $p(Z|X, \theta)$, and then using the posterior to update the parameters $\theta$ by maximizing the expected full-data log-likelihood.

Through the decomposition into $\ln p(X|\theta) = \mathcal{L}(q,\theta) + KL(q||p)$ we also have shown that the EM algorithm continually improves the lower-bound on the log-likelihood as well as the log-likelihood itself.

Gaussian Mixtures and EM

2019-06-11T00:00:00+00:00

Continuing from last time, I will discuss Gaussian mixtures as another clustering method. We assume that the data is generated from several Gaussian components with separate parameters, and we would like to assign each observation to its most likely Gaussian parent. It is a more flexible and probabilistic approach to clustering, and will provide another opportunity to discuss expectation-maximization (EM). Lastly, we will see how K-means is a special case of Gaussian mixtures!

1. Gaussian Mixtures

Here we will take a probabilistic approach to clustering, where we assume the data is generated from $K$ underlying Gaussian distributions:

\[p(x) = \sum_{k=1}^K \pi_kN(x|\mu_k,\Sigma_k)\]

where $\pi_k$, the mixing coefficient, is the probability that the sample was drawn from the $k$th Gaussian distribution. We can reformulate this model in terms of an explicit latent variable $z$ with a 1-of-K representation. That is, $z$ is $K$-dimensional and $z_k = 1$ when the sample belongs to the $k$th Gaussian. Let us proceed with this to arrive at the joint distribution $p(x, z) = p(z)p(x|z)$.

The marginal distribution over $z$ can be specified:

\[p(z) = \prod_k \pi_k^{z_k}\]

since

\[p(z_k=1) = \pi_k\]

and $0 \leq \pi_k \leq 1$ and $\sum_k \pi_k=1$.

The conditional for a particular $z$ is:

\[p(x|z_k=1) = N(x|\mu_k, \Sigma_k)\]

Hence using the 1-of-K representation,

\[p(x|z) = \prod_k N(x|\mu_k, \Sigma_k)^{z_k}\]

So the joint distribution is simply:

\[p(x, z) = \prod_k (\pi_kN(x|\mu_k, \Sigma_k))^{z_k}\]

Now the marginal distribution of $x$ can be obtained from the joint:

\[p(x) = \sum_z p(z)p(x|z)\] \[p(x) = \sum_z \prod_k^K (\pi_kN(x|\mu_k, \Sigma_k))^{z_k}\]

Consider what the possible values of $z$ are. They are $K$-dimensional one-hot vectors. So the summation over $z$ is just over $K$ possibilities. Furthermore, if $z_k=1$, notice that the product simplifies to $\pi_kN(x|\mu_k, \Sigma_k)$ as $z_{i\neq k} = 0$.

\[p(x) = \sum_j^K \pi_jN(x|\mu_j, \Sigma_j)\]

So we have recovered the initial Gaussian mixture formulation, but now with the latent $z$.

Lastly we can define the conditional $p(z

x)$:

\[p(z_k=1|x) = \frac{p(x|z_k=1)p(z_k=1)}{p(x)}\] \[p(z_k=1|x) = \frac{\pi_kN(x|\mu_k, \Sigma_k)}{\sum_j^K \pi_jN(x|\mu_j, \Sigma_j)} = \gamma(z_k)\]

So $p(z_k=1|x) = \gamma(z_k)$ is the posterior probability of $z_k=1$, and $\pi_k$ the prior.

2. Maximum likelihood

We have a dataset of observations $x^{(1)}, …, x^{(N)}$ which we assume are i.i.d. Their corresponding latent variables are $z^{(1)}, …, z^{(N)}$. We can stack them into matrices $X, Z$ which are $N\times D$ and $N\times K$, respectively.

We know:

\[p(x) = \sum_j^K \pi_jN(x|\mu_j, \Sigma_j)\]

So the likelihood:

\[p(X|\pi, \mu, \Sigma) = \prod_n\sum_k^K \pi_kN(x^{(n)}|\mu_k, \Sigma_k)\]

And the log-likelihood:

\[\ln p(X|\pi, \mu, \Sigma) = \sum_n \ln(\sum_k^K \pi_kN(x^{(n)}|\mu_k, \Sigma_k))\]

2.1 $\mu_k$

Taking the partial with respect to $\mu_k$ and carefully stepping through the math, we obtain:

\[\mu_k = \frac{1}{N_k}\sum_n \gamma(z_{nk})x^{(n)}\] \[N_k = \sum_n \gamma(z_{nk})\]

Notice that $\mu_k$ is the weighted mean of $x$ where the weights are the posterior probabilities that $x^{(n)}$ belongs to Gaussian component $k$. $N_k$ thus can be interpreted as the effective number of points assigned to $k$.

2.2 $\Sigma_k$

Taking the partial with respect to $\Sigma_k$ we will arrive at a similar weighted result:

\[\Sigma_k = \frac{1}{N_k}\sum_n \gamma(z_{nk})(x^{(n)} - \mu_k)(x^{(n)} - \mu_k)^T\]

2.3 $\pi_k$

Lastly, we take the partial with respect to $\pi_k$. Recall $\sum_k \pi_k = 1$, so we will need a Lagrange multiplier to maximize

\[\ln p(X|\pi, \mu, \Sigma) + \lambda(\sum_k \pi_k - 1)\]

We will arrive at

\[\pi_k = \frac{N_k}{N}\]

2.4 Expectation-maximization for Gaussian mixtures

This is not a closed-form solution thanks to the complicated relationship to $\gamma(z_{nk})$. However we can solve for our parameters using the iterative method of EM!

Initialize the parameters and evaluate the initial log-likelihood
E-step: Evaluate the posteriors under the current parameters:
\[\gamma(z_{nk}) = frac{\pi_kN(x^{(n)}|\mu_k, \Sigma_k)}{\sum_j^K \pi_jN(x^{(n)}|\mu_j, \Sigma_j)}\]
M-step: Re-evaluate the parameters using the new posteriors:
\[\mu_k = \frac{1}{N_k}\sum_n \gamma(z_{nk})x^{(n)}\] \[\Sigma_k = \frac{1}{N_k}\sum_n \gamma(z_{nk})(x^{(n)} - \mu_k)(x^{(n)} - \mu_k)^T\] \[\pi_k = \frac{N_k}{N}\] \[N_k = \sum_n \gamma(z_{nk})\]
Calculate the log-likelihood and check stopping criteria:
\[\ln p(X|\pi, \mu, \Sigma) = \sum_n \ln(\sum_k^K \pi_kN(x^{(n)}|\mu_k, \Sigma_k))\]

3. Expectation-maximization

In general, EM is used to find MLE solutions when we have latent variables. Consider the general log-likelihood:

\[\ln p(X|\theta) = \ln(\sum_Z p(X, Z|\theta))\]

where we have introduced the latent variable matrix $Z$.

We cannot compute this in practice, because we do not observe the latent variables $Z$. What we know about $Z$ is captured by our posterior $p(Z|X, \theta)$. We can use it to compute the expected value of the log-likelihood under this posterior.

Consider our current parameters $\theta_0$ and the new parameters $\theta_1$.

In the E-step, we use $\theta_0$ to find the posterior over the latent variables, $p(Z|X, \theta_0)$. We can use it to calculate:

\[Q(\theta, \theta_0) = \sum_Z p(Z|X, \theta_0)\ln p(X, Z|\theta)\]

which is the expectation of the log-likelihood for a general $\theta$.

In the M-step, we find $\theta_1$ by maximizing the expectation:

\[\theta_1 = \text{arg max}_{\theta} Q(\theta, \theta_0)\]

4. Comparison with K-means

Notice that K-means performs a hard, binary assignment of observations to clusters - a point is either in a cluster or it isn’t. In Gaussian mixtures, we make a soft assignment by way of the posterior probabilities. It turns out that K-means is a special case of Gaussian mixtures!

Let the covariance matrices $\Sigma_k$ all be equal to $\epsilon I$. Then:

\[p(x|\mu_k, \Sigma_k) = \frac{1}{(2\pi\epsilon)^{D/2}}\exp(-\frac{||x-\mu_k||^2}{2\epsilon})\]

by the definition of the multivariate Gaussian.

Treating the variances $\epsilon$ as fixed, we arrive at:

\[\gamma(z_{nk}) = \frac{\pi_k\exp(-\frac{||x^{(n)}-\mu_k||^2}{2\epsilon})}{\sum_j \pi_j\exp(-\frac{||x^{(n)}-\mu_j||^2}{2\epsilon})}\]

As $\epsilon \rightarrow 0$, the term with the smallest value of $\|x^{(n)}-\mu_l\|^2$ will approach $0$ slowest. So in the limit, $\gamma(z_{nl}) \rightarrow 1 \rightarrow r_{nl}$ while the other posteriors converge to 0. We have obtained the hard assignment of K-means. Recall:

\[r_{nk} = \begin{cases} 1, & \text{if }k=\text{arg min}_j ||x_n-\mu_j||^2 \\\ 0, & \text{otherwise} \end{cases}\]

The update equation for $\mu_k$ will simplify to the K-means result thanks to these weights. We can also show that as $\epsilon \rightarrow 0$, the log-likelihood converges to the distortion loss in K-means.

Notice the geometric interpretation here. Because of the way we have specified the $\Sigma_k$ in K-means (as diagonal matrices), the underlying assumption is spherical clusters. Whereas in Gaussian mixtures, we can have elliptical clusters depending on our $\Sigma_k$.

K-means and EM

2019-06-10T00:00:00+00:00

Clustering is an unsupervised learning problem in which we try to identify groupings of similar data points, i.e. learn the structure of our data. Today I will introduce K-means, a popular and simple clustering algorithm. Our true motivation will be to use this as a gentle introduction to clustering and the expectation maximization (EM) algorithm. In subsequent posts we will expand on this foundation towards Gaussian mixtures, and finally into latent Dirichlet allocation (LDA).

1. K-means Clustering

Suppose we have $n$ observations of data $x$, with dimensionality $D$. We want to partition them into $K$ clusters such that points in each cluster are close together, and are far from points outside.

To indicate cluster assignment, we will define indicator variables $r_{nk}$ such that $r_{nk}=1$ if $x_n$ belongs to cluster $k$, and $r_{nj}=0$ for $j\neq k$.

The simple scheme we employ will be based on distortion, or sum of squared distances, which is a very natural objective function to employ:

\[J = \sum_n\sum_k r_{nk}||x_n-\mu_k||^2\]

$mu_k$ is a $D$-dimensional vector representing the center of cluster $k$. We need to manipulate $r_{nk}$ and $\mu_k$ to minimize $J$.

A simple iterative algorithm to do this is:

Initialize $\mu_k$
Minimize $J$ with respect to $r_{nk}$
Minimize $J$ with respect to $\mu_k$
Repeat 1-2 until convergence

This is our first introduction to the expectation maximization (EM) algorithm!

1.1 E-step - $\frac{\partial J}{\partial r_{nk}}$

$\frac{\partial J}{\partial r_{nk}}$ is easy to calculate. We swiftly arrive at

\[r_{nk} = \begin{cases} 1, & \text{if }k=\text{arg min}_j ||x_n-\mu_j||^2 \\\ 0, & \text{otherwise} \end{cases}\]

Notice that having to recompute distances between $mu_k$ and $x_n$ in every E-step is very costly.

1.2 M-step - $\frac{\partial J}{\partial \mu_k}$

$\frac{\partial J}{\partial \mu_k}$ is not much more difficult:

\[\frac{\partial J}{\partial \mu_k} = 2\sum_n r_{nk}(x_n-\mu_k) = 0\] \[\mu_k = \frac{\sum_n r_{nk}x_n}{\sum_n r_{nk}}\]

Which is actually the mean of all points in cluster $k$.

2. Properties

Convergence can be decided in many ways - threshold on $J$, number of iterations, until assignments no longer change, etc. Notice that under EM, the objective function is bound to decrease after every iteration. Of course, we are not guaranteed to stumble into the global optimum. Also notice that the algorithm is stochastic in that it depends on the initialization of $\mu_k$.

An alternative to this batch formulation of k-means can be derived using stochastic approximation methods, yielding the sequential update equation:

\[\mu_k^{1} = \mu_k^{0} + \eta_n(x_n-\mu_k^{0})\]

$\eta_n$ is the learning rate parameter; its dependence on $n$ allows it to be annealed over time. This formulation lets us use k-means in online settings.

3. Kernelization

So far, our formulation and interpretation of k-means is wholly reliant on squared Euclidean distance. This is not always the best metric for evaluating similarity of points. We can generalize with kernels $k(x_i, x_j) = \phi(x_i)^T\phi(x_j)$:

E-step
\[r_{nk} = \begin{cases} 1, & \text{if }k=\text{arg min}_j ||\phi(x_n)-\mu_j||^2 \\\ 0, & \text{otherwise} \end{cases}\]
Notice that we can express $\|\phi(x)-\mu_k^2\|$ purely using the kernel function, in yet another example of the kernel trick:
\[||\phi(x)-\mu_k||^2 = \phi(x)^T\phi(x) - 2\mu_k^T\phi(x) + \mu_k^T\mu_k\]
Let $y_k = \sum_n r_{nk}$:
\[||\phi(x)-\mu_k||^2 = \frac{1}{y_k^2}\sum_{n,m}r_{nk}r_{mk}k(x_n, x_m) - \frac{2}{y_k}\sum_n r_{nk}k(x_n, x) + k(x, x)\]
M-step

\[\mu_k = \frac{\sum_n r_{nk}\phi(x_n)}{\sum_n r_{nk}}\]

Due-tos

2019-06-06T00:00:00+00:00

This post is a throwback to the methodology behind one of my first analytics projects at System1. The due-to is a simple name for a simple idea - isolating the effects of individual key performance indicators (KPIs) on a business metric, like gross profit. Sometimes - most times, even - data science doesn’t have to be that sophisticated.

1. Problem Setup

Let us consider a simple example. Imagine being the proud owner of Widgets Incorporated. Your new widget has just entered the market and you want to do some digital marketing.

You place an ad for your widget via Google Ads. After spending some amount at the auction, you end up acquiring some number of impressions, or some number of appearances on Google search engine result pages. A useful KPI here is the cost per impression, which we will denote hereafter as CPC. (Typically, in digital advertising, this is formulated as the cost per thousand impressions, or the cost per mille (CPM))
A few users who see your ad end up clicking through to your website, which extolls the virtues of the Widget 3.0. The number of clicks over the number of impressions will be the click-through rate, CTR.
From there, some number of visitors who click through also complete a purchase. Revenue is generated. We can easily define revenue per click, RPC.

Now, what is your gross profit $\pi$?

\[\pi = Revenue - Cost\]

Let us do some algebraic manipulations to define profit in terms of our KPIs:

\[\pi = Impressions(\frac{Revenue - Cost}{Impressions})\] \[\pi = Impressions(\frac{Revenue}{Clicks}\frac{Clicks}{Impressions} - \frac{Cost}{Impressions})\] \[\pi = Impressions(RPC\times CTR - CPC)\]

Let $V$ be the total volume of impressions.

\[\pi = V(RPC\times CTR - CPC)\]

You can monitor the performance of your ad looking at the gross profit as well as the daily values of your KPIs.

Imagine now that you notice a drastic decrease in profit week-over-week, with corresponding changes in the key metrics. How can you understand more quantitatively the effect of V vs. RPC or CTR on the change in profit? What is the effect on profit due to changes in RPC or CPC?

2. Decomposition w/ Differentiation

More formally, we observe, over two time periods $t_1, t_2$, a change in $\pi$, $\Delta\pi$.

\[\Delta\pi = \pi_{t_2} - \pi_{t_1}\]

We would like to identify how much of this $\Delta$ is due to the change in volume $\Delta V$, or change in RPC $\Delta RPC$, etc.

Differentiate!

\[\frac{\partial\pi}{\partial t} = \frac{\partial(V(RPC\times CTR - CPC))}{\partial t}\]

An elementary application of the chain rule will give us:

\[\frac{\partial\pi}{\partial t} = \frac{\partial V}{\partial t}(RPC\times CTR - CPC) + \frac{\partial RPC}{\partial t}(V\times CTR) + \\\frac{\partial CTR}{\partial t}(V \times RPC) - \frac{\partial CPC}{\partial t}(V)\]

We can approximate the partials with our observed deltas, using finite differences. Here we simply use the forward difference:

\[\frac{\Delta\pi}{\Delta t} = \frac{\Delta V}{\Delta t}(RPC\times CTR - CPC) + \frac{\Delta RPC}{\Delta t}(V\times CTR) + \\\frac{\Delta CTR}{\Delta t}(V \times RPC) - \frac{\Delta CPC}{\Delta t}(V)\]

Along the same vein, we can substitute in values for $V, RPC, CTR, CPC$ from our real observations. $V = V_{t_2}$, $V = V_{t_1}$, or $V = \frac{V_{t_2} + V_{t_1}}{2}$ are all valid choices.

So we have successfully decomposed $\Delta\pi$! Nota bene:

\[\Delta \pi = \Delta \pi_{V} + \Delta \pi_{RPC} + \Delta \pi_{CTR} + \Delta \pi_{CPC}\] \[\Delta \pi_V = \Delta V(RPC\times CTR - CPC)\] \[\Delta \pi_{RPC} = \Delta RPC(V\times CTR)\] \[\Delta \pi_{CTR} = \Delta CTR(V \times RPC)\] \[\Delta \pi_{CPC} = -\Delta CPC(V)\]

That is, the contribution to $\Delta \pi$ from each KPI is given directly from these terms. Dimensional analysis confirms that $\Delta \pi_{KPI}$ are all in units of $\Delta \pi$.

With this information we can build a slick waterfall chart that satisfies our equation, with the first four bars summing to the last (count the heights!).

In this case, reductions in cost per click since the last week drove the majority of the profit increase, but click-thru rate plummeted. This may indicate that the keywords we are bidding on are no longer very relevant. While competition in the auction is lower for these keywords, consumer interest has also waned to a degree that nearly offsets that change. We can extract clear practical insights immediately.

In general, this analysis can be useful in:

identifying which KPI has the most impact on our metric of interest
providing a bird’s-eye view of macro-level trends and performance over two time periods
disentangling the effects of individual KPIs for particularly convoluted metrics