Genetic algorithms and symbolic regression

A few months ago, I wrote a post about optimization using gradient descent, which involves searching for a model that best meets certain criteria by repeatedly making adjustments that improve things a little bit at a time. In many situations, this works quite well and will always or almost always finds the best solution. But in other cases, it’s possible for this approach to fall into a locally optimal solution that isn’t the overall best, but is better than any nearby solution. A common way to deal with this sort of situation is to add some randomness into the algorithm, making it possible to jump out of one of these locally optimal solutions into a slightly worse solution that is adjacent to a much better one. In this post, I want to explore one such approach, called a genetic algorithm (or an evolutionary algorithm), which I’ll illustrate with a specific type of genetic algorithm called symbolic regression. I first heard about this in an article about a small company in Somerville, MA called Nutonian that has built a whole data science platform around it.

Continue reading

Posted in Modeling, Regression | 1 Comment

P-values

I’m going to start this post with a confession: Up until a few days ago, the only thing I knew about p-values was that Randall Munroe didn’t seem to like them. My background is in geometry, not statistics, even though I occasionally try to fake it. But it turns out that a lot of other people don’t like p-values either, such as the journal Basic and Applied Social Psychology which recently banned them. So I decided to do some reading (primarily Wikipedia) and it turns out, like most things in the world of data, there’s some very interesting geometry involved, at least if you know where to look.

Continue reading

Posted in Fundamentals | 12 Comments

Convolutional neural networks

Neural networks have been around for a number of decades now and have seen their ups and downs. Recently they’ve proved to be extremely powerful for image recognition problems. Or, rather, a particular type of neural network called a convolutional neural network has proved very effective. In this post, I want to build off of the series of posts I wrote about neural networks a few months ago, plus some ideas from my post on digital images, to explain the difference between a convolutional neural network and a classical (is that the right term?) neural network.

Continue reading

Posted in Classification, Feature extraction | 3 Comments

Precision, Recall, AUCs and ROCs

I occasionally like to look at the ongoing Kaggle competitions to see what kind of data problems people are interested in (and the discussion boards are a good place to find out what techniques are popular.) Each competition includes a way of scoring the submissions, based on the type of problem. An interesting one that I’ve seen for a number of classification problems is the area under the Receiver Operating Characteristic (ROC) curve, sometimes shortened to the ROC score or AUC (Area Under the Curve). In this post, I want to discuss some interesting properties of this scoring system, and its relation to another similar measure – precision/recall.

Continue reading

Posted in Classification | 3 Comments

Optimization

Optimization is a topic that has come up in a number of posts on this blog, but that I’ve never really addressed directly. So, I though it was about time that I gave it its own post. The term “optimization” has a number of different, often very specific meanings, and I’m probably not going to do any of them justice. Instead, I’ll start with a fairly vague sense of the term: optimization meaning to pick one option out of some collection of possible options in a way that maximizes (or minimizes) some measure. Or in other words, finding the “best” option, with respect to some carefully defined meaning of “best”. But as you’ll see below, we’ll actually be interested in a much more specific definition of optimization.

Continue reading

Posted in Classification, Regression | 3 Comments

Statistics vs. Heuristics

Now that I’m finally getting the hang of my new job, I thought I’d try to get back to blogging regularly. It’s been very interesting seeing things from the perspective of software engineering, which is all about building systems around the algorithms that come out of computer science, statistics, mathematics, etc. In particular, the goal is usually to use the simplest suitable algorithm for a job, since you know it’s likely going to be just one piece of a large, complex system and simpler algorithms are easier to write, to debug and to maintain. In particular, it’s often tempting to use heuristics when a theoretically sound algorithm is unknown or would be overly complex. (Standard disclaimer: These comments represent my own opinions and have not been approved or endorsed by my employer.)

Continue reading

Posted in Software engineering | 6 Comments

Map/Reduce

In my last post, I described the PageRank algorithm that was the basis for the original Google search. This week, I want to use PageRank to motivate the MapReduce framework for distributed data analysis. These days, MapReduce is beginning to fall out of fashion and is being replaced by new technologies, many of which allow data stream analysis or are graph based. But most of these new distributed frameworks build, in one way or another, on the basic ideas of MapReduce. And as I’ll explain in future posts, you can see traces of the basic Map Reduce concepts in many of the new frameworks.

Continue reading

Posted in Distributed learning | 1 Comment

PageRank

At the end of June, I’m going to begin a new job as a software engineer in Google’s Cambridge office. For the past few months, preparing for this change has kept me too busy to write anything, but now that I have some time, I want to mark the occasion with a post describing the algorithm that got Google started in the first place: PageRank. This will also be a good way to lead into a subsequent post on the Map/Reduce framework for large-scale data analysis algorithms, continuing with the topic of distributed machine learning that I started right before my hiatus. (And, while I’m not a Google employee yet, it probably doesn’t hurt to include the standard disclaimer that the opinions expressed below are my own and have not been endorsed or approved by Google.)

Continue reading

Posted in Distributed learning | 13 Comments

Distributed Learning

So far on this blog, my focus has been on conventional algorithms for data mining and machine learning, i.e. algorithms that can run on a standard desktop or laptop computer with a single processor. However, one of the trends in modern data analysis is parallel computing – the use of networks of machines (often called clusters, but unrelated to the clusters that we’ve seen previously on this blog) that work together, dividing up the computation and allowing the operator to work with much larger data sets. This is one of the main ideas behind the buzzword “Big data.” As it turns out, many conventional machine learning/data mining algorithms can be efficiently translated into parallel algorithms, and in this post, I want to give an introduction to this process. There won’t be a lot of geometry directly involved in this, but the better we understand the conventional algorithms (say, via geometry), the easier it will be to effectively translate them into parallel/distributed algorithms.

Continue reading

Posted in Distributed learning | 2 Comments

K-modes

I recently read an interesting Wired story about Chris McKinlay (a fellow alum of Middlebury College), who used a clustering algorithm to understand the pool of users on the dating site OkCupid (and successfully used this information to improve his own profile.) The data involved was answers to multiple-choice questions, which is very similar to the categorical data that I discussed a few posts back. But, instead of translating the data into vectors, like I discussed in that post, McKinlay used an algorithm called K-modes that works directly on this type of data. So, I thought this would be a good excuse to write a post about K-modes and how it compares to translating the data into vectors and then running K-means.

Continue reading

Posted in Clustering, Feature extraction | 39 Comments