Duality and Coclustering

In mathematics, the term “duality” usually refers to a relationship between two object, or types of objects, that is somehow symmetric. (Note that this is different from some other uses of the term outside of mathematics.) For example, if we start with a cube and draw a dot in the center of each of its eight faces, these dots will form the corners of an octahedron (a three-dimensional shape with eight faces and six corners) inside the cube, as on the left in the Figure below. But if we do the same thing with an octahedron, drawing a dot in the center of each of its faces, then forming a shape from them, we will get back a cube, as on the right in the Figure. So, the relationship that gets us from the cube to the octahedron is the same as the relationship that gets us from the octahedron to the cube.

duality

Continue reading

Posted in Uncategorized | 4 Comments

Configuration Spaces and the Meaning of Probability

I recently finished reading Nate Silver’s book The Signal and the Noise, which has gotten me thinking about how exactly one should interpret models/probability distributions, and the predictions they make. (If you’ve read this book or plan to read it, I also recommend reading Cathy O’Neil’s review of it.) Ultimately what a model does is to make a claim about the probability that a certain statement is or is not true. I always found this idea slightly troubling, since any fact is either true or false; to say that a true statement has a 70% probability of being true seems kind of meaningless, even if you don’t know that it’s true. When you’re making predictions about future events, where the statements aren’t yet true or false at all, this seems even more problematic. But it turns out that one can make philosophical sense of these sorts of statements by making a slight adjustment, namely saying the something has a 70% probability of being true given what we know about it. To explain what I mean by this, I want to introduce the idea of a configuration space.

Continue reading

Posted in Modeling | 9 Comments

Case Study 6: Digital images

In the last two posts, I described how we could generalize the notion of “tokens” that we first saw when analyzing non-numeric census data, to time series. In this context, a token is a short snippet from a time series that is very similar to a lot of snippets of the same length from other time series. This week, we’ll generalize the idea further from one-dimensional time series to two-dimensional images/pictures.

Continue reading

Posted in Feature extraction | 3 Comments

Case Study 5: Wavelets

In the last post, I explained how to measure how similar two time series of the same length are using the idea of resonance from physics. The answer boiled down to calculating the dot product of the two time series, a measure that is commonly used in linear algebra for calculating the angle between two vectors. So this fits nicely into the general theme of this blog – thinking of data as geometric information (in this case vectors) and translating the analytical tools into geometric properties. In this post, I want to focus on the case when we have time series of possibly different lengths. In this case, we can’t take the dot product of two time series because there isn’t a one-to-one way to match up the times from the first series with the times from the second. Instead, we’ll use the idea of resonance to compare shorter snippets of the two time series using the idea of tokens from the text analysis case study.

Continue reading

Posted in Feature extraction | 6 Comments

Case study 4: Resonance and Robots

In the last few posts, I described how to turn data sets with varying amounts of structure into vector data that could be analyzed using standard data mining/machine learning algorithms. All the types of data that I looked at were static, i.e. each data point was a single instance of information, whether it was measurements from a single iris flower, census data from a single person or the contents of a single text message. In this post and the next, I will switch gears to data that varies with time, which is called time series data. In other words, we will want to take a sequence of readings and turn them into a single data point in some (possibly high dimensional) data space.

There are countless ways to analyze time series, and my goal here is just to describe a few initial ideas that one can try. We’ll take our cues from two sources: First, one of the most fundamental time series in the human experience is sound, which is transmitted into our ears as a (very fast) sequence of pressure changes on our ear drums. Second, each text message that we looked at in Case study 3 can be thought of as a time series in which each letter in the message is the next value in the time series.

The key to translating the text messages into data vectors was that we were willing to throw away some of the information in each text message, namely the order in which the words appeared. We were left with a handful of words/tokens for each text message, which we translated into a vector with each entry a 1 or 0, indicating whether or not each token is present in the given message. (This is called a bag of words representation.) Note that we didn’t completely discard the time information: the smallest unit of time in the series is the letters, and the words/tokens are are snippets of this series of letters. So the order of the letters within each token/snippet matters, but we discard the order in which the tokens/snippets occur relative to each other.

It is often useful to do something similar for more general types of time series: given a collection of tokens that can appear in some type of time series, we may want to ask which of these tokens appear in a particular time series. For example, if we have a collection of sound recordings (which as noted above, we can think of as time series) and a recording of a bird chirp (which will be our token), we can ask which recordings include that chirp (or a similar chirp). As with the text messages, we won’t necessarily care when in the recording the chirp occurs, just whether or not it’s there. If we had a collection of short token recordings, we could record which of them occur in a given longer recording as a binary vector analogous to the bag-of-words for a piece of text.

Note that there are essentially two parts to this process: First, we need to decide what tokens we’re looking for. Second, we need to decide whether or not they appear in a given time series. In this post, I’m going to focus on the second part. In my next post, I’ll discuss the first, which is much more technical and relies on understanding the second part.

So we would like a way to decide which tokens appear in a given series, so that we can translate each time series into a vector of 1s and 0s, like we did with the text messages. (Note that these vectors will all have the same dimension, namely the number of tokens that we’re looking for, independent of the length of the original time series.) In practice, we will often end up with feature values between 0 and 1, since the tokens may be very close to something in a given time series but not match exactly. For example, if our sound recording has a chirp from a different bird in the same species as the bird from the token, their chirps will probably be a little different. So we might want to record a 92% match as a .92, etc.

Most time series come in the form of a stream of numbers. In some series the numbers come at regular intervals. For example in sound recordings, each number is essentially the pressure on the membrane of a microphone caused by a sound wave at a given time. The amount of time between measurements is determined by the recording frequency, usually measured in kilahertz (kHz). A 44.1 kHz MP3 file has 1/44,100 seconds between each time step. When recording these types of time series, it is only necessary to include the reading at each time, as long as the frequency is recorded in the metadata.

In other time series, there are irregular intervals between readings. For example, the ticker for a stock symbol on the New York Stock Exchange records the price that was paid for the stock each time someone buys the stock from someone else. There may be large time gaps between sales, followed by bursts of rapid trading. A file that records this type of time series must include a time stamp for each reading, in addition to the value at that time.

There are also time series in which more than just one number is recorded at each time. For example, there are data sets (such as this one) that record position information from the gyroscope and accelerometer in a cell phone over a period of time. At each time, a six-dimensional vector with three readings from each sensor is recorded.

So lets return to the original question: How do we decide if a given token is represented in a given time series? Note that the token will be a short time series of the same form as the time series in question. This is where our ears come in. Even among those of us who are tone deaf, ears are exceptionally good at picking out frequencies in sounds. In fact, when your ears transmit sound data to your brain, they do so not as a sequence of pressures, but as a spectrum of frequencies. (This translation into frequencies is called a Fourier transform.) The translation is done not by neurons, but by a physical process that takes place in the part of the ear called the cochlea, using a physical phenomenon called resonance.

You should have an intuitive understanding of resonance if you’ve ever pushed someone on a swing (something I’ve done my fair share of in the last few years.) The first time you ever pushed someone on a swing, you probably figured out very quickly that you need to push them while they’re moving away from you, not towards you, as in the Figure below. This is simple physics: If you push them while they’re moving towards you, you will be acting against their momentum, and thus decrease their overall kinetic energy. If you push when they’re moving away from you, you will be adding to their kinetic energy. (In fact, the best time to push is when they’re in the middle of their swing and closest to the ground, but this is very hard to do.)

swingresonance

So if you want to increase the amount of kinetic and potential energy of the child in the swing (in response to their cries of “higher! higher!”) you have to push in the same rhythm/pattern as the swing. But the rate at which the swing goes back and forth is completely determined by the weight of the child and the length of the swing. No matter how hard you push or what rate you push at, the number of times they go back and forth per minute will be the same.

In other words, the pattern that the swing follows is fixed. You can think of this fixed pattern as a token. If two people of equal strength were pushing two children of equal weight, the person whose pushing pattern closest matched the pattern of the swing would end up pushing their child higher. These two pushing patterns play the role of the two time series that we want to check for the token. The higher the swing goes, the better the match for the token. A match between the two patterns is called resonance and the fixed frequency of the swing is called its resonance frequency. (Note that the musical term resonance as opposed to dissonance is more general, but closely related to the way we’re using the term here.)

The cochlea in your ear does this with thousands of little hairs called stereocilia. Each hair is a different length and thus naturally vibrates at a different frequency, the same way that the swing goes back and forth at a fixed rate. When a sound wave vibrates the fluid in the cochlea, it causes these little hairs to vibrate (like pushing the swing) and the amount that each stereocilum vibrates (how high the swing goes) depends on how closely the pattern of the sound matches the natural pattern of the stereocilium. So, each hair is in charge of a single frequency and the amount that that hair is vibrating at any given time tells the brain the extent to which its assigned frequency was present in the last few microseconds of sound. (The liquid in the cochlea makes sure that the stereoclia don’t keep vibrating for long after the sound ends.)

So, that’s all well and good, but now we have to figure out how to do the same thing with a computer. As it turns out, it’s not so hard. If we’re willing to skip over some interesting (but reasonably dense) physics and calculus, we can calculate the amount of energy that each push adds to the swing, or that each moment of the sound wave adds to the stereocilium by just multiplying the force of the push by the velocity of the swing/stereocilium. There are two possible directions for the push and for the velocity, so we have to choose one direction to be positive and the other negative. If the push and the velocity are both positive or both negative then the product is positive, so you’re adding energy. (Aren’t negative numbers great!) If one is positive and the other negative, i.e. the push is in the opposite direction from the velocity, then the product is negative, so you’re reducing the energy.

For simplicity, lets say our token is a time series of length five, with regular intervals. We can think of those five numbers as a vector (a,b,c,d,e). To decide if the token appears at a certain time in a longer time series, we will take the five readings starting at that given time and form a vector (v,w,x,y,z). To have this vector “push on” the token to see how much energy it creates, we just have to multiply the corresponding elements (which will give us the added energy at each individual time) then add them together (for the total energy). In other words, the value av + bw + cx + dy + ez tells us how closely they match.

If you remember your linear algebra, you may recognize this as the dot product of the two vectors. Recall that the dot product is used to calculate the angles between the vectors and has played a role in almost all of my earlier posts (though I didn’t always mention it by name.) That’s quite convenient because it means that we can just use the token (a,b,c,d,e) and the snippet (v,w,x,y,z) like any other pair of vectors. (Perhaps this was a waste of time to put all this energy into explaining resonance when we ultimately decided to just use the time series in the most direct and simple way, but at least we now know why this simple approach is reasonable.)

There is one more small detail: Remember that when we compared the swing pushers, I insisted that the pushers must have equal strength and the swing riders must have equal weight. It’s obvious that if one of the pushers is significantly stronger than the other, they will make the child go higher, no matter how bad their timing is (though it may not be a pleasant experience for the child.) We can overcome this by (drumroll please…) normalizing the vectors (a,b,c,d,e) and (v,w,x,y,z) to have length 1, i.e. so that the dot product of each vector with itself is 1. After we do this, their dot product will be exactly equal to the cosine of the angle between them. A larger cosine (more energy added to the swing) means a smaller angle, so having the patterns match closely will correspond to the two vectors being very close in the data space. This will always give us a value between 0 and 1 that tells us how closely the token matches the snippet from the original time series. If we want to decide whether or not the token appears anywhere in the time series, we’ll need to look at a number of overlapping snippets throughout the time series, compare each one to the token, and record the maximum dot product over all the snippets.

For series with irregular time intervals, things are a little complicated, but the same basic theory applies. Since I said my goal was just to give you a few ideas about how you might start analyzing time series, I’m going to leave that case alone for now.

Instead, lets look at a specific data set with time series with regular intervals. The Robot Execution Failures Data Set on the UCI repository has time series of force and torque measurements taken from a robot right after it detected a failure. Each time series is labeled with one of four types of failures, and the goal is to predict what type it was based on the time series.

One nice thing about this data set is that each time series is the same length, namely 15 pairs of force/torque readings, 21 milliseconds apart, starting from the time of failure detection. Since these are all the same length, we don’t need to cut snippets out. Instead, we will treat each time series as a single vector of dimension 15 * 6 = 90. We’ll look at methods involving snippets and tokens in the next post. For now, I just want to demonstrate that thinking of time series as vectors in this way is a reasonable procedure.

Each time series in the data set is arranged as a block of data such as the following:

normal
	-1	-1	63	-3	-1	0
	0	0	62	-3	-1	0
	-1	-1	61	-3	0	0
	-1	-1	63	-2	-1	0
	-1	-1	63	-3	-1	0
	-1	-1	63	-3	-1	0
	-1	-1	63	-3	0	0
	-1	-1	63	-3	-1	0
	-1	-1	63	-3	-1	0
	-1	-1	61	-3	0	0
	-1	-1	61	-3	0	0
	-1	-1	64	-3	-1	0
	-1	-1	64	-3	-1	0
	-1	-1	60	-3	0	0
	-1	0	64	-2	-1	0

The first line contains the label for the time series. In this example, ‘normal’ means there wasn’t a failure. In the file ‘lp1.data’, there are three other labels: ‘collision’, ‘obstruction’ and ‘fr_collision’. Each row contains six force and torque readings from a single point in time. As described above, we’re going to just concatenate these lines into one long vector. Here’s the python code to do this. (The file has been read in to a list of strings called ‘lines’ and the number of time series has already been calculated and stored in the variable ‘rows’.)

# Create an empty data matrix and labels list
data = scipy.zeros([rows, 90])
labels = []

# Cycle through each block of data
for i in range(rows):
    # Read the label from the first line of the block
    labels.append(lines[i*18])

    for j in range(15): # Cycle through the data lines
        d = lines[i*18 + j + 1].split('\t')
        # Read in the six numbers from each time step
        for k in range(6):
            data[i,j*6+k] = float(d[k+1])

There are only 88 lines of data in the file, but as noted above, we have 90 features. Whenever there are more features than data points, any classification algorithm is virtually guaranteed to suffer from overfitting. However, we can get a rough idea of how much structure this form of feature extraction produces by doing a PCA plot. Here it is:
RobotPCA1The ‘normal’ time series are shown in green and form a very tight grouping. The ‘obstruction’ and ‘fr_collision’ series are shown in red and yellow, and appear to be well separated from the greens, and to a lesser extent from each other. The blue points are in the ‘collision’ class and it is hard to tell whether or not they’re separated from the greens. So lets do another plot with just the greens and blues. Here it is:

RobotPCA2This is more or less a blow-up of the green/blue blob in the picture above. Again, the green ‘normal’ points are tightly packed. But this time the blue points are widely dispersed. They don’t appear to be linearly separated from the green, because they surround the green on all sides. But, of course, because of the way projection works, there’s no way to tell for sure unless we run a linear classifier like SVM or logistic regression on it. I don’t think it’s worth doing on this data set since, as I noted above, overfitting is going to be a huge problem. But if you’d like to, you can download all the code from the Shape-of-Data github archive and try it yourself. (Note that I changed the name of the robot data file from “lp1.data” to “robotlp1.data” so that it would be easier to remember what it is.)

So this suggests that forming longer vectors from time series (or snippets of time series) in this way does a reasonable job of creating the type of structure that we’re used to. As I noted above, the real trick is going to be to use this process with snippets/tokens in order to analyze time series of different lengths. But that will have to wait for my next post.

Posted in Feature extraction | 1 Comment

Case study 3: Free form text

In the past two posts, I’ve been looking at ways to turn “unstructured” data into vector data that can be analyzed with techniques like the ones that I’ve described elsewhere on this blog. One of the most common types of unstructured data that you’ll find on the internet is text, whether it’s news stories, tweets or spam e-mails. This can be very difficult to deal with, since the meaning behind the text generally depends heavily on things like the order that the words are in, their context and on subtle cultural subtexts. The field of natural language processing has come up with some very clever ways to deal with with a lot these issues, but that’s not what I’m going to write about today. There are many analysis problems that don’t require a complete understanding of the meaning of different pieces of text, so much as a general impression. So today, I want to look at a relatively simple way to deal with text data by generalizing the method from my last post for dealing with token-based data.

As with the last few posts, we’re going to be interested in classification problems. For example, lets say we have a collection of texts such as a number of different new stories or a number of tweets from twitter, and we want to decide which of them are about fishing. We should be able to get a fairly accurate prediction just by looking at what words appear in each text. For example if we see both the words “bait” and “fish”, there’s a good chance we’ve got a positive. There are some words where things could go either way. For example, is the word “tackle” referring to fishing equipment or to football? If we’re reading it on our own, we would be able to tell from context. But for a long enough text, even if we don’t know the context of each specific word, we should be able to make a prediction based on what other words appear. There will also be some words that make the text less likely to be about fishing. For example, the word “touchdown” would be a good negative word, to cancel out any appearance of “tackle” in the wrong context.

For a topic that you know well, you could probably pick out a number of words that will increase or decrease the chances that a text is about your topic. But our goal, of course, is to have the computer pick out these words, i.e. to build a model for deciding which texts are related to a given topic, or fit into some other type of classification.

A popular version of this problem is the detection of spam messages. So this week, we’ll look at the SMS Spam Collection data set from the UCI data archive. Each line in this data set consists of either the word spam or ham (the latter of which means a non-spam message), followed by the contents of an SMS message (i.e. a text message you would get on your phone.) The goal is to use the spam/ham label on a subset of the messages (the training set) to build a model that will correctly predict whether each of the remaining messages (the evaluation set) is spam or ham.

As I mentioned above, the most common way to do this is to use the token idea from my last post. In the census data set, there are a number of categories, each with a small number of possible values. Each data point consists of a list of tokens, one from each category. We turned this data into vectors by creating one column for each possible choice (rather than one for each category.)

When we deal with text, it initially seems like we’re in a very different situation: We don’t have any categories, and there are an essentially limitless number of possible words that could appear. But actually, things aren’t so different. For example, we could think of the text message as a single category, where instead of picking a single option/token for that category, we get to choose as many as we want. We still have a slight problem that in theory, there are an unlimited number of possible words that could appear in text messages. Luckily, there are a relatively small number of words that actually appear in the text messages in this particular data set, so we’ll just deal with those.

In other words, we can build our own list of possible tokens by recording all the different words that appear in all the text messages. Below is the python code to do this. Note that the line after the first comment strips all punctuation from the line and makes it lower case. This way any word that appears right before a comma or period, or at the beginning of a sentence doesn’t get counted as a separate. (I realize that proper punctuation is not very common in text messages, but you never know…) Here’s the code.

from collections import Counter

f = open("SMSSpamCollection.data", "r")
# Read each line of the data file, removing punctuation
lines = [l.translate(string.maketrans("",""),
            string.punctuation).lower() for l in f]
f.close()

wordcount = Counter() # Tracks how often each word appears

for l in lines:             # For each text message
    for w in l.split()[1:]: # For each word in the text
        wordcount[w] += 1   # Record it in wordcount

# Create a list of words that appear at least five times
words = [w for w in wordcount if wordcount[w] > 5]

Two things to note: First, I changed the data file name from “SMSSpamCollection” to “SMSSpamCollection.data” so that it would have the same file extension as the other data files I’ve looked at. Second, I split the third line of code into two to make it fit in the browser window. If you copy this into a python file, you’ll have to unsplit it to make it work. You can also download the entire script from the Shape of Data github page.

After running this, I found that among the 5,574 text messages (of which 747 are spam), there are 9,663 different words that appear. Among these words, there are only 1,600 that appear more than five times. I decided to leave out these less frequent words, since we want a model that will focus on features that are likely to come up often in future text messages. For example, many of the less frequent words are URLs, which seem to appear in a lot of the spam messages and probably won’t be re-occurring.

Now that we have a list of tokens, we can translate the text messages into vectors. As with the last post, we will create a set of vectors with one dimension/feature for each possible token/word. For each text message and each word, we will set the corresponding entry in the vector to 1.0 if the word appears in the text message and to 0.0 otherwise. Another option would be to set each entry to the number of times the word appears in the text message, but we won’t explore that today. Here’s the python code:

# Create empty data and label holders
data = scipy.zeros([len(lines),len(words)])
labels = scipy.zeros(len(lines))

for i in range(len(lines)):     # For each text message
    l = lines[i].split()        # Make a list of words
    for j in range(len(words)): # For each token
        if words[j] in l[1:]:   # Set the entry to 1.0 if
            data[i,j] = 1.0     # the word appears

    if l[0] == "spam":   # Set the labels
        labels[i] = 1.0

We now have all the text message data in the same type of format that we used for both the Iris data set and the census data set, so we can analyze it the same way.

Again, note that we have thrown away a lot of information in this process. In particular, we no longer know what order the words appeared in, so we can’t reconstruct the original meaning of each text. In fact, in many cases we won’t be able to tell what part of speech a word is, since there are words like “mug” that can be used both as a noun and as a verb, depending on context. One way to deal with this issue would be to use a natural language processing algorithm like part-of-speech tagging which predicts how each word is being used and tags it as noun/verb/adjective/etc. In python, the NLTK package can do all of this relatively easily, but we won’t go into that today.

Once we have the data in vector form, we can create a PCA plot using the same code from the last post, which is shown below. You can see that there appears to be some real structure, and that the blue dots (the spam) are reasonably separated from the green dots (the ham).

spamPCA1

To see how well separated they really are, we can run a classification algorithm. When I split the data into 80% training and 20% evaluation sets and ran logistic regression, I got an accuracy of 98.2%, which is pretty darn good.

spamPCA2We also have the option of scaling the the features, like we did with the census data. The resulting PCA plot is shown on the left. The combination of treating the words as tokens and scaling them based on how common they are is called tf-idf, which stands for token frequency – inverse document frequency. In other words, each feature represents how often each word appears in the given text, divided by (i.e. times the inverse of) how often the word appears in all the texts. You can do this automatically with the sci-kit learn tf-idf function, but like I’ve said before, I like to do it by hand to make sure I really know what’s going on.  With logistic regression, I get an accuracy of 92.7% on the scaled data, which is still pretty good, but lower than with the unscaled features.

One interesting thing about using a linear model like logistic regression for this type of data is that the hyperplane that (roughly) separates the blue and the green points is defined by a vector, i.e. a list of numbers with one number for each word in our list. When we evaluate a new message (by taking a dot product), the model goes through this list of numbers and adds up all the numbers that correspond to words appearing in the message. If the sum is positive, it predicts that the message is spam. If the sum is negative, it predicts not spam. So the vector discovered by the logistic regression model basically carries out the scheme that I described at the beginning of this post: It assigns a weight to each word recording to what extent that word indicates that a message will be spam. Then it combines all those weights (which allows some words to cancel out others) to decide if the overall message is spam. And moreover, as we saw from the training and evaluation sets, this scheme actually works pretty well.

Posted in Feature extraction | 5 Comments

Case study 2: Tokens in census data

In last week’s post, I presented a brief tutorial on loading and analyzing the classic IRIS data set, which consists of four length measurements from each of 150 iris flowers. That data set was relatively easy to deal with because each set of four measurements can be thought of as a vector in the four-dimensional data space. In this and the next few posts, I will look at data sets with less obvious structure, and consider different ways of translating the data into vectors that can be analyzed with geometric methods. This week I want to look at a collection of census data from the UCI Machine Learning Repository. This data set consists of (anonymized) demographic data reported by 48,842 Americans on the 1990 US census. Each line consists of a list of numbers indicating things like age and tokens representing things like employment type. The tokens in each column are taken from a relatively short list. For example, under employment type, the options include private, self-employed-inc, Federal-gov and a few others. So in this post, I want to consider the question: How can we translate the token data into vectors?

Continue reading

Posted in Feature extraction | 10 Comments

Case study 1: Iris

Since the start of this blog, we’ve covered a lot of different algorithms that attempt to discover and summarize the geometric structure in a given data set. But as it turns out, this part of the data analysis is the (relatively) easy part. In the real world, most data starts out as what’s often referred to as “unstructured data”: Rather than being nicely arranged as rows and columns of numbers that can be interpreted as vectors in a high dimensional space, it comes in some other form. There will often be patterns in these other forms of data, but not the types of patterns to which we can immediately apply the sort of analysis that I’ve been describing for the past few months. So the first job of the data analyst (or the data scientist if you prefer) is generally to extract the structure from a set of “unstructured” data, i.e. to transform the patterns into a type of structure that we’re used to. This process is often called feature extraction or feature engineering. In the next few posts, I’m going to look at some specific data sets, with different initial levels of structure, and consider different possible ways to extract the kind of structure that we want from them.

Continue reading

Posted in Feature extraction | 3 Comments

TILO/PRC Clustering

In the last few weeks, we’ve looked at a number of different algorithms for dividing a data set into clusters, and a few ways of measuring how good the clusters found by such an algorithm are. This week, I’m going to introduce a new approach inspired by closely related problems in abstract geometry and topology. (So, this fall under Topological Data Analysis.) As a disclaimer I should point out that while most of the material that I’ve covered so far on this blog is fairly mainstream and well established, today’s post will cover techniques that have come out of my own (recent) research, so they’re not mainstream (yet.) This week I’m going to introduce a measure of cluster separation that gets closer to the idea of measuring bottlenecks, and an algorithm (called TILO/PRC) for finding clusters that get a good score under this metric.

Continue reading

Posted in Uncategorized | Leave a comment

Modularity – Measuring cluster separation

We’ve now seen a number of different clustering algorithms, each of which will divide a data set into a number of subsets. This week, I want to ask the question: How do we know if answer that a clustering algorithm gives us is any good? As always, since most data sets will be in higher dimensional data spaces, we usually won’t be able to just look at the data set to see how separated the clusters are. For supervised learning, we can assess the quality of a model by splitting the data into a training set and an evaluation set. This will tell us, for example, if we are guilty of overfitting. For clustering, however, we often won’t even know if there should be observable clusters, let alone how many there should be. Recall that the goal of clustering is to find sets of data points that are much more densely packed than the data set as a whole. So we need a way to decide whether a given subset really is both internally dense and separated from the rest of the data set.

Continue reading

Posted in Clustering, Unsupervised learning | Leave a comment