PostgreSQL, Excel, R, and a Really Big Data Set!

At work I’ve started to work with the biggest data set I’ve ever seen!  First, let me qualify my use of the term “Big Data”.  The number of rows in the resultant data set (after much transformation and manipulation in PostgreSQL and to a lesser extent in Excel) is only just 395,928.  This will certainly pale in comparison to truly BIG data sets, but for me it’s huge!  For now, the data I’m working with has 307 variables (but a big chunk of null values).  Just calculate: if there were no null values in this data set, the number of data points would be 395,928 * 307 = 121,549,896!!  Even if 60% of this table consisted of null values, that still leaves 48,619,958 data points.  Wow!

Working with so much data (it came from a database dump), I’ve noticed a few things:

1) PostgreSQL is awesome at handling it.  I really haven’t had to wait too long for each of my queries (create table queries, mainly) to execute.  The longest I had to wait was in fact the final query that I ran today that joined all of the transformed tables that I created to a biographical information table.  I don’t think it took much longer than a minute, at most.

In fact, what I found really helpful in PostgreSQL was recoding some of the qualitative variables using CASE … WHEN …. THEN clauses.  Those are super handy and are carried out with seemingly little difficulty.

2) Excel 2010 had LOTS of difficulty letting me manipulate my data set once I loaded it up. Granted, the file was a 176 megabyte csv file… there’s no shame at whining when you have the weight of the world on your shoulders!

One problem I had in excel was that, at one point, it stopped immediately redrawing the excel window every time I changed to a different ribbon.  For example, I would click on the “File” ribbon, after being on the “Home” ribbon, and I would still see my data set, and not all the “File” ribbon options.  Sizing the excel window down and dragging it out of view to the bottom of the screen and back seemed to help at times, and waving my mouse cursor where all the relevant buttons should be helped at other times.

Another problem was when I went to recode some variables in Excel into dummy variables, it took SO LONG to finish the job!  Granted, I used the standard “Find / Replace” window, which might not be the best option in Excel.  I’ve learned my lesson though.  I have to either recode in PostgreSQL, or do what I describe below, in R.

3) Although I haven’t done too much with the data set in R yet, it has handled it with pizzazz!  R loaded the csv file in just under 2 minutes (I was honestly expecting longer).  Once I had it in, I realized that for the variables in my data set where there were no zeros indicating the absence of something, I would need to include them somehow.  I didn’t actually want to modify the data set to include the zeros, as that would add to how much RAM R was using up (it was already using up 1.5 gigs of ram!).  So, I made a simple function that would put zeros into a vector just for the purpose of graphing or statistical testing.

In other words, if I’m doing a boxplot and I need my X variable to contain zeros where there are only NA values, I run the boxplot command, passing the X variable to the function so that the zeros get added temporarily and not permanently (e.g. boxplot(Y ~ zerofy(X)).  This should hopefully prevent me from using too much RAM.

I haven’t yet run any typical statistical tests on the data set, so I can’t attest to how quickly they work on it, but that will come very very soon.  So far, I’m quite impressed with the open source tools I use and their performance on this data!

Fun Editing R Graphs in Inkscape

Last week, I read a chapter out of Visualize This by Nathan Yau.  I was, of course, delighted to see that he was championing the use of R.  One really cool thing that I learned from his book, and was very surprised about, was that you can export an R graph in PDF form and then easily edit individual elements of the graph in Adobe Illustrator.

Ever since I started making graphs in R, I always assumed that they were essentially “flat” images, where if you wanted to edit them you’d have to use an eraser tool and then redraw components laboriously.  However, opening up a graph in PDF format (and I now also presume SVG format as well), you can select individual elements of the graph, and fine tune their look very nicely.

The disadvantage for me, when thinking about Adobe Illustrator, is the price!  Thankfully you can usually find an open source alternative.  He mentions this of course, by saying you can also use Inkscape.  Perhaps deferring to Adobe Illustrator enthusiasts, however, he doesn’t go farther than that, and all examples are in Illustrator.

So, I decided to do my own example, and post here!  I took some time series data that I wrote about in a previous post and made a very simple plot.  The only graphing code I used was effectively this: plot(timeseriesvariable, type=’b’, main=”Some Title”) .  Nice and simple, right?  I then exported to PDF and opened it in Inkscape.  Here’s what it looked like when I opened it up in Inkscape:

Basic_graph_from_r

Pretty basic looking at this point.  One oddity was that the q’s in the plot are actually supposed to be circles!  Anyway, next I selected the x axis label and changed the text, and the font to make it look a bit nicer:

Graph_with_xlabel_changed

For the next few steps, I increased the width of the line segments, changed the title size and font, and added brown circles where the q’s were.  I learned a neat trick for placing the circles: After you place the first one, right click on it and click copy, then move your mouse cursor to where you want another one to be, then type CTRL-V, and the circle will show up exactly in the spot of your mouse cursor point.

Graph_with_bars_points_and_title_changed

For the next steps, I selected the box that encompasses the chart area and deleted it.  Then I selected each numeric value along the Y axis and rotated them (using the little curved purple arrows near the top left of the screen).  This is similar to what Nathan shows you in his book.  Now it’s looking pretty nice!

Graph_with_box_gone_and_y_axis_titles_rotated

Finally, I inserted a new layer, and made a light gray box for the layer, and pushed that layer to the back of the original layer so that it forms a nice newspaper-y background (again, a-la Nathan).  I also changed the text and font of the Y axis.  Here’s a “finished” product:

Graph_with_y_label_change_and_a_light_gray_background_added

And there you have it!  You can save the new image into whatever format you want, and distribute!  That was a lot of fun.  I didn’t have to use a heck of a lot of R code and my finished product looks pretty fancy.  Most importantly, it was all done with free software that can be installed on Linux, Windows or Mac.

Sampling and the Analysis of Big Data

After my last post, I came across a few articles supporting the opinion that if you have a good reason to take random samples from a “big” dataset, you’re not committing some kind of sin:

Big Data Blasphemy: Why Sample?

To Sample or Not to Sample… Does it Even Matter?

The moral of the story is that you can sample from “big data” so long as the analysis you’re doing doesn’t require some part of the data that will be excluded as part of the sampling process (an exampl being the top or bottom so many records based on some criterion).

“Big Data Analytics”? Hrm….

I do data mining and modeling really often these days.  However, the datasets that I work with really wouldn’t be considered “Big Data” (around 25,000 to about 200,000 rows, with quite a lot of variables).  I don’t know if I’ll ever be in a position to work with “Big Data”, but all the hype around it gets me thinking from time to time.  

Question: If I’ve got millions upon millions of records to work with, do I really need to submit all of them to my data analysis software (R) for data mining and modeling?

Answer: Not in the least bit.  If all I’m doing is looking for trends and building models that predict some desired behaviour, all I would have to do is get a handful of random samples that are small enough to fit into my data analysis software.  Then I could do my data mining on any one of the samples, build a model or models, and then test it/them on the other samples.  Apparently, random sampling is possible in Hadoop.  What this means is that if I get these random samples from the DBMS, I can just use the same kinds of techniques I’ve been using all along.

Am I missing something, or is “Big Data Analytics” more of a marketing term than an actual reality?

Evaluating an Excel Fuzzy Matching Macro Using Logistic Regression in R

For work today I needed to compare a list of names (and spousal names, when available) against an in-house database dump of names.  In an ideal world, the first and last names of the two lists would all match and that would be the end of the story.  Obviously it’s not an ideal world, and names don’t always match.

That being said, the task was this: after matching up last names exactly between both lists, try to find matches on first name that may or may not be exact.  The first name could be the primary first name, or it could be the first name of the spouse.  

The first thing I looked for was a fuzzy matching macro that I could easily use in excel that would give me a few scores: (1) First name match score, (2) Spouse’s First name match score, (3) Spouse’s last name match score; to be an exact match rather than a fuzzy one.

I found a fuzzy matching script that allowed me to make these comparisons across columns in the same row here, inserted it as a VBA module function, and compared the names across lists (within the same row, remember) for all records.

Once I started sorting through the potential matches based on how close to 1 the scores were, I started noticing some extra variables that seemed to predict a decent match.  For one, when there was a parenthetical insertion in the first name in the in-house dump of names, that row seemed to be more likely to be a match (e.g. first name in the in-house list was “William (Bill)”).  Don’t ask me why that happened!  Also, another thing I noticed was that if there was an ampersand in the first name in the in-house list, that row seemed to be more likely to be a match (e.g. first name in the in-house list was “Richard & Donna”).  Again, don’t ask me why that happened.  I haven’t a clue!

I coded the file for what seemed to be a match and what did not, and then was curious to enter in the Scores and matching column into R to do a Logistic Regression and find out how well I could model my own choices of matches between lists based on the fuzzy matching scores.  

 

Again, the First.Name.Match variable was a continuous fuzzy matching score, Parenthetical and Ampersand were simple binary variables describing whether or not a parenthetical insertion or ampersand were within the first name of the in-house list, Spouse.First.Name.Match:Spouse.Last.Name.Match is an interaction term stating that so long as the Spouse’s Last Name is an exact match, run the Spouse.First.Name.Match scores against event probability.

As you can see, all variables significantly predicted the probability of me judging that row to be a probable match.  In fact, when I selected those rows with a predicted event probability of .5 or higher, about 78% of my actual matches showed up.

In conclusion, if strapped for time, this fuzzy matching script is actually quite nice.

Sample Rows from a Data Frame that Exclude the ID Values from Another Sample in R

In order to do some modeling, I needed to make a training sample and a test sample from a larger data frame.  Making the training sample was easy enough (see my earlier post), but I was going crazy trying to figure out how to make a second sample that excluded the rows I had already sampled in the first sample.

After trying out some options myself, looking extensively on the net, and asking for help on the r-help forum, I came up with the following function that finally does what I need it to do:

 

To summarize the function, you enter in the big data frame first (here termed “main.df”), then your first sample data frame that has the ID values that you want to exclude (here termed “sample1.df”, then your sample size, then the ID variable names in both data frames enclosed in quotes.  

Functions like this certainly make my working life with R easier in preventing me from having to type in syntax like that every time I want that kind of a task done.

Sampling Rows from a Data Frame in R

It’s time, yet again, for a simple and useful function I created that helped me at work.  I was looking for a way to sample whole rows of a very large data frame with many columns so that I could build a regression model on a subset of my data.  

While I was acquainted with the sample() function, I realized today that it couldn’t be used direclty to sample whole rows of a data frame.  It’s really just meant for vectors.  So, I looked up how to get a sample of rows from a data frame and found an answer on the R help forums.  The meat of the function below comes from that answer:

So, just put the data frame as the first argument of the function, and the number representing the size of your sample, and there you have it!  You just cut a random sample out of your data frame.

Length by a Grouping Variable with NA Values Omitted in R

Today, after my supervisor pointed out to me the discrepancy between some graphs of percentages from a data frame that I was working with, and the raw numbers, in a table, from which those percentages were taken, I realized that I was including some NAs in my length calculations.

The data were simple binary columns, with 0 being the absence of an attribute, 1 being the presence, and NA being an incalculable value.  The dependent variable here was whether or not people donated at a certain level, and the independent variable was a simple binary grouping variable.  First I got the sum of the dependent variable by the independent variable (i.e. how many people had donated at that level depending on the independent variable):

tapply(y, x, sum, na.rm=TRUE)

That worked simply enough.  Then I wanted to extract the total number of people who donated, regardless of whether they had reached the specified level:

tapply(y, x, length)

That gave me numbers, but they included the NAs.  I know that to simply get the number of non-NA values from a vector in R, all you have to type is sum(!is.na(x)) and there you go, but I needed this by a grouping vector.  So I realized what I needed to do this evening and made a laughably small function:

Even though the meat of this function is very small, it’s still nice to simplify 🙂  Live and learn I guess!

Logistic Regression in R: Concordance and Discordance Testing

I run a lot of logistic regression models at work.  At my previous job, where I used Minitab, I always saw stats on Concordant Pairs, Discordant Pairs, and Ties in the model summary output for Logistic Regression modelling.  I tried to look for a function that gives you the same stats for a logistic regression model in R, but wasn’t successful.  So, I decided to write my own:


# Assuming the input is a stored binomial GLM object
Concordance = function(GLM.binomial) {
outcome_and_fitted_col = cbind(GLM.binomial$y, GLM.binomial$fitted.values)
# get a subset of outcomes where the event actually happened
ones = outcome_and_fitted_col[outcome_and_fitted_col[,1] == 1,]
# get a subset of outcomes where the event didn't actually happen
zeros = outcome_and_fitted_col[outcome_and_fitted_col[,1] == 0,]
# Equate the length of the event and non-event tables
if (length(ones[,1])>length(zeros[,1])) {ones = ones[1:length(zeros[,1]),]}
else {zeros = zeros[1:length(ones[,1]),]}
# Following will be c(ones_outcome, ones_fitted, zeros_outcome, zeros_fitted)
ones_and_zeros = data.frame(ones, zeros)
# initiate columns to store concordant, discordant, and tie pair evaluations
conc = rep(NA, length(ones_and_zeros[,1]))
disc = rep(NA, length(ones_and_zeros[,1]))
ties = rep(NA, length(ones_and_zeros[,1]))
for (i in 1:length(ones_and_zeros[,1])) {
# This tests for concordance
if (ones_and_zeros[i,2] > ones_and_zeros[i,4])
{conc[i] = 1
disc[i] = 0
ties[i] = 0}
# This tests for a tie
else if (ones_and_zeros[i,2] == ones_and_zeros[i,4])
{
conc[i] = 0
disc[i] = 0
ties[i] = 1
}
# This should catch discordant pairs.
else if (ones_and_zeros[i,2] < ones_and_zeros[i,4])
{
conc[i] = 0
disc[i] = 1
ties[i] = 0
}
}
# Here we save the various rates
conc_rate = mean(conc, na.rm=TRUE)
disc_rate = mean(disc, na.rm=TRUE)
tie_rate = mean(ties, na.rm=TRUE)
return(list(concordance=conc_rate, num_concordant=sum(conc), discordance=disc_rate, num_discordant=sum(disc), tie_rate=tie_rate,num_tied=sum(ties)))
}

view raw

gistfile1.r

hosted with ❤ by GitHub

An example of the output from this function follows:

> Concordance(GLM.3)

$concordance

[1] 0.6798001

$num_concordant

[1] 2312

$discordance

[1] 0.3201999

$num_discordant

[1] 1089

$tie_rate

[1] 0

$num_tied

[1] 0

You can see from the above stats that the model I evaluated with the concordance function had a concordance rate of 68%, and a discordance of 32%. So, seemingly evidence of a good model!

Useful averages!

As I work in the fundraising world, the averages I tend to come across aren’t very useful for describing the data due to the natural skewness involved in raw money given through charitable donations.  I’d like to talk about one particular kind of average that is quite useful: that of the logged times when my daughter has woken up, gone for a nap, or gone to sleep.

 

Sleep_sched

As you can see, the wonderful thing about the averages depicted in the above R graph is that the lines showing her sleep schedule each day go equally as much above as below the line, and they don’t deviate too far.  So, when describing what time she wakes up, goes for her second nap, and goes to sleep, the average is a fairly accurate description.

Of course we can go a step further and calculate the standard deviation of the morning wakeups and second nap times to say that she wakes up at 6:56am plus or minus 53 minutes, and goes to her second nap at 2:15pm plus or minus 1 hour 27 minutes, and then we have an even better idea what her behaviour is.

You can in fact see a baseline shift in her bed times after the 10 mark on the X axis.  This was when we started trying to get her to bed a bit early.  So from day 1 to 10, she was going to sleep at about 9:38pm, whereas from day 11 to 20, she was going to sleep at about 8:39pm.  Overall of course, this balances out to an average bedtime of 9:09pm.  Assuming of course we stick to her early sleep schedule, her sleep times will continue to vary near that 8:30pm mark.