Visualization and Projection

One of the common themes that I’ve emphasized so far on this blog is that we should try to analyze high dimensional data sets without being able to actually “see” them. However, it is often useful to visualize the data in some way, and having just introduced principle component analysis, this is probably a good time to start the discussion. There are a number of types of visualization that involve representing statistics about the data in different ways, but today we’ll look at ways of representing the actual data points in two or three dimensions.

In particular, what we want to do is to draw the individual data points of a high dimensional data set in a lower dimensional space so that the hard work can be done by the best pattern recognition machine there is: your brain. When you look at a two- or three-dimensional data set, you will naturally recognize patterns based on how close different points are to each other. Therefore, we want to represent the points so that the distances between them change as little as possible. In general, this is called projection, the term coming from the idea that we will do the same thing to the data as you do when you make shadow puppets: We project a high dimensional object (such as your three-dimensional hands) onto a lower dimensional object (such as the two-dimensional wall).

We’ve already used linear projection implicitly when thinking about higher dimensional spaces. For example, I suggested the you think about the three-dimensional space that we live in as being a piece of paper on the desk of someone living in a  higher dimensional space. In the post about the curse of dimensionality, we looked at the three-dimensional cube from the side and saw that it was two-dimensional, then noted that a four-dimensional cube would look like a three-dimensional cube if we could look at it “from the side”.

When we look at an object from the side like this, we are essentially ignoring one of the dimensions. This the simplest form of projection, and in general we can choose to ignore more than one dimension at a time. For example, if you have data points in five dimensions and you want to plot them in two dimension, you could just pick the two dimensions that you thought were most important and plot the points based on those. It’s hard to picture how that works because you still have to think about the original five dimensional data. But this is similar to the picture if we were to take a two-dimensional data set and throw away one of the dimensions as in the left and middle pictures in the Figure below. You can see the shadow puppet analogy too: In the figure on the left, the light is to the right of the data, while in the middle, it’s above.

projection

Notice that the two-dimensional data set appears to be made up of three groups of three data points but when we project onto the vertical or horizontal axis, we see one group of six and a group of three. This is because in each of these directions, one of the group of three is hiding behind one of the others. So both projections are deceptive and in both projections we lose information. This is not surprising since we’re essentially throwing away all the information in one of the dimensions.

In particular, when we look at the vertical projection on the left of the Figure, each point could actually be anywhere along the horizontal line that it sits in. This is similar to the fact that when you see a dark spot in a shadow puppet, you know it comes from some part of your hand, but you don’t know how far from the wall that part of your hand is – There is a whole line of possibilities from the light bulb to the wall. That’s precisely why we can make such a variety of apparent shapes with shadow puppets. The more dimensions we throw away, the worse it gets: If we project from five dimensions to two, then each point in the two-dimensional plot could be anywhere in a three-dimensional hyperplane in the original five-dimensional space.

We can sometimes improve things a little by projecting along an angle rather than directly along the features. This is called a linear projection (even though we’re projecting onto a plane or a hyperplane rather than a line.) For example, if we project onto the diagonal line shown in the picture on the right of the Figure, the three groups of three data points actually look like three separate groups in the projection. The groups don’t look as well separated, but at least they’re no longer hiding behind each other.

From any angle, the projection is the result of crushing/flattening the data, so we still lose information and the shape of the data will be distorted in some way. The way to minimize the distortion is to crush along the directions where the data is the least spread out and to preserve the dimensions along which it is the widest. In the previous post on Principal Component Analysis, I discussed how to find the directions in which a data set is the most “spread out”, and we called the vectors pointing in these directions the principal components. So this suggests that projecting onto the longest two or three principle components (depending on whether we want a two-dimensional or three-dimensional projection) should produce a relatively good picture of the data.

The picture that results from calculating the principal components, then using them to create a linear projection is often called a PCA projection. For example this is (pretty close to) the projection on the right in the Figure above. Like any form of projection, there will still be distortion. But using the principle components should (at least in theory) minimize it.

Note that because PCA assumes a relatively simple model for the data set (a Gaussian blob), it essentially ignores much of the structure of data sets in which the underlying distribution is curved in some way. So the PCA projection will not go out of its way to preserve this structure in the projection. I mentioned in the PCA post that there are advanced techniques such as something called a kernel that can be used to make PCA work better on data sets with a curved/more complicated structure. There are also projection algorithms unrelated to PCA that are specifically designed for curved data sets. However, those will have to wait for future posts.

Posted in Visualization | 3 Comments

Principal Component Analysis

Now that we’ve gotten a taste of the curse of dimensionality, lets look at another potential problem with the basic form of regression we discussed a few posts back. Notice that linear/least squares regression always gives you an answer, whether or not that answer is any good. If you want to find a line that best fits your data, the regression algorithm will give you a line, regardless of the actual structure of the data. So what we also need is a way to determine how good the model discovered by regression is at approximating the data. There are a few different ways to do this, and in this post, I’ll introduce one called Principal Component Analysis (PCA).

Continue reading

Posted in Modeling | 22 Comments

The curse of dimensionality

Now that we’ve had a glimpse of what it means to analyze data sets in different dimensions, we should take a little detour to consider really high dimensional data. In the discussion of regression, I suggested using your intuition about planes in three-dimensional  space to understand hyperplanes in higher dimensions. This is a great way to get around the fact that we can’t visualize dimensions higher than three, but we still have to be careful to understand how geometry in higher dimensions is different from geometry in two or three dimensions. As it turns out, a number of aspects of higher dimensional geometry are quite counter-intuitive and this is a major component of what’s called the curse of dimensionality.

Continue reading

Posted in Modeling | 10 Comments

General regression and over fitting

In the last post, I discussed the statistical tool called linear regression for different dimensions/numbers of variables and described how it boils down to looking for a distribution concentrated near a hyperplane of dimension one less than the total number of variables (co-dimension one). For two variables this hyperplane is just a line, which is what you may usually think of regression as. In this post, I’ll discuss a more flexible version of regression, in which we allow the line or hyperplane to be curved.

Continue reading

Posted in Modeling, Regression | 14 Comments

The geometry of linear regression

In this post, we’ll warm up our geometry muscles by looking at one of the most basic data analysis techniques: linear regression. You’ve probably encountered it elsewhere, but I want to think about it from the point of view of geometry and, particularly, the distributions that I introduced in my previous post. Recall that our goal is to infer a probability distribution from a set of data points. Linear regression follows a very common pattern among modeling algorithms: We choose a basic form that we want the distribution to have, then we choose the distribution that best fits the given data among all distributions of this form.

Continue reading

Posted in Modeling, Regression | 28 Comments

Filling in the gaps – Probability Distributions

In the last post, I discussed how one can analyze a data set from the point of view of geometry, by thinking of each data point as coordinates in a high dimensional space. The thing is, when we’re analyzing these data sets, we’re usually not interested in the points in the data set, so much as the points that aren’t in the data set. For example, if we’re using past customer data to predict whether a new potential customer will buy a product, the new customer is not in our initial data set. So the goal of analyzing the data points from past customers is to understand the points that may correspond to new customers. The same goes for determining if a tumor is benign, if a certain blood pressure drug will work on a given patient, or any other application of data analysis.

Continue reading

Posted in Introduction | 9 Comments

What is data?

In this post, we’ll start working on understanding how computers and their programmers actually go about analyzing large, high-dimensional data sets. I’ll start by describing three different ways that one can think about data, each of which suggests a different set of tools and possibilties.

Continue reading

Posted in Introduction | 10 Comments

Different names for data analysis

Welcome to the Shape of Data blog. Over the next few months, I plan to write a number of posts illustrating how understanding the geometry behind data analysis can lead to deeper insights and a more intuitive understanding of the data. But before we delve into geometry, it might be a good idea to discuss the many different names that are often used to describe the analysis of large data sets.

Continue reading

Posted in Introduction | 14 Comments