An embarrassing admission; Copy pasting tables with text containing spaces from Excel to R

I can’t believe I didn’t learn how to do it earlier, but I never knew how to accurately copy tables from excel that had text with spaces in them, and paste into a data frame in R without generating confusion around spaces representing different variables.

Say you have a column title in a table in excel like “Group Size”.  You then copy the table and try to load it into R using: my.table = read.table(“clipboard”, header=TRUE).  R then gives you an error like this:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 2 elements

That’s where I stopped most times and made sure that my column titles had no spaces in them (replacing spaces with dots), or just did something else entirely.  Finally, I learned that if you add one teeny tiny argument to the read.table command… read.table(“clipboard”, header=TRUE, sep=”\t”) … then all text values with spaces in them actually get imported into R without a fuss!  Spaces in column names get converted into dots automatically, but actual text values in the columns with spaces are untouched and accessible as they should be in R.  Yay!

PostgreSQL, Excel, R, and a Really Big Data Set!

At work I’ve started to work with the biggest data set I’ve ever seen!  First, let me qualify my use of the term “Big Data”.  The number of rows in the resultant data set (after much transformation and manipulation in PostgreSQL and to a lesser extent in Excel) is only just 395,928.  This will certainly pale in comparison to truly BIG data sets, but for me it’s huge!  For now, the data I’m working with has 307 variables (but a big chunk of null values).  Just calculate: if there were no null values in this data set, the number of data points would be 395,928 * 307 = 121,549,896!!  Even if 60% of this table consisted of null values, that still leaves 48,619,958 data points.  Wow!

Working with so much data (it came from a database dump), I’ve noticed a few things:

1) PostgreSQL is awesome at handling it.  I really haven’t had to wait too long for each of my queries (create table queries, mainly) to execute.  The longest I had to wait was in fact the final query that I ran today that joined all of the transformed tables that I created to a biographical information table.  I don’t think it took much longer than a minute, at most.

In fact, what I found really helpful in PostgreSQL was recoding some of the qualitative variables using CASE … WHEN …. THEN clauses.  Those are super handy and are carried out with seemingly little difficulty.

2) Excel 2010 had LOTS of difficulty letting me manipulate my data set once I loaded it up. Granted, the file was a 176 megabyte csv file… there’s no shame at whining when you have the weight of the world on your shoulders!

One problem I had in excel was that, at one point, it stopped immediately redrawing the excel window every time I changed to a different ribbon.  For example, I would click on the “File” ribbon, after being on the “Home” ribbon, and I would still see my data set, and not all the “File” ribbon options.  Sizing the excel window down and dragging it out of view to the bottom of the screen and back seemed to help at times, and waving my mouse cursor where all the relevant buttons should be helped at other times.

Another problem was when I went to recode some variables in Excel into dummy variables, it took SO LONG to finish the job!  Granted, I used the standard “Find / Replace” window, which might not be the best option in Excel.  I’ve learned my lesson though.  I have to either recode in PostgreSQL, or do what I describe below, in R.

3) Although I haven’t done too much with the data set in R yet, it has handled it with pizzazz!  R loaded the csv file in just under 2 minutes (I was honestly expecting longer).  Once I had it in, I realized that for the variables in my data set where there were no zeros indicating the absence of something, I would need to include them somehow.  I didn’t actually want to modify the data set to include the zeros, as that would add to how much RAM R was using up (it was already using up 1.5 gigs of ram!).  So, I made a simple function that would put zeros into a vector just for the purpose of graphing or statistical testing.

In other words, if I’m doing a boxplot and I need my X variable to contain zeros where there are only NA values, I run the boxplot command, passing the X variable to the function so that the zeros get added temporarily and not permanently (e.g. boxplot(Y ~ zerofy(X)).  This should hopefully prevent me from using too much RAM.

I haven’t yet run any typical statistical tests on the data set, so I can’t attest to how quickly they work on it, but that will come very very soon.  So far, I’m quite impressed with the open source tools I use and their performance on this data!

Evaluating an Excel Fuzzy Matching Macro Using Logistic Regression in R

For work today I needed to compare a list of names (and spousal names, when available) against an in-house database dump of names.  In an ideal world, the first and last names of the two lists would all match and that would be the end of the story.  Obviously it’s not an ideal world, and names don’t always match.

That being said, the task was this: after matching up last names exactly between both lists, try to find matches on first name that may or may not be exact.  The first name could be the primary first name, or it could be the first name of the spouse.  

The first thing I looked for was a fuzzy matching macro that I could easily use in excel that would give me a few scores: (1) First name match score, (2) Spouse’s First name match score, (3) Spouse’s last name match score; to be an exact match rather than a fuzzy one.

I found a fuzzy matching script that allowed me to make these comparisons across columns in the same row here, inserted it as a VBA module function, and compared the names across lists (within the same row, remember) for all records.

Once I started sorting through the potential matches based on how close to 1 the scores were, I started noticing some extra variables that seemed to predict a decent match.  For one, when there was a parenthetical insertion in the first name in the in-house dump of names, that row seemed to be more likely to be a match (e.g. first name in the in-house list was “William (Bill)”).  Don’t ask me why that happened!  Also, another thing I noticed was that if there was an ampersand in the first name in the in-house list, that row seemed to be more likely to be a match (e.g. first name in the in-house list was “Richard & Donna”).  Again, don’t ask me why that happened.  I haven’t a clue!

I coded the file for what seemed to be a match and what did not, and then was curious to enter in the Scores and matching column into R to do a Logistic Regression and find out how well I could model my own choices of matches between lists based on the fuzzy matching scores.  

 

Again, the First.Name.Match variable was a continuous fuzzy matching score, Parenthetical and Ampersand were simple binary variables describing whether or not a parenthetical insertion or ampersand were within the first name of the in-house list, Spouse.First.Name.Match:Spouse.Last.Name.Match is an interaction term stating that so long as the Spouse’s Last Name is an exact match, run the Spouse.First.Name.Match scores against event probability.

As you can see, all variables significantly predicted the probability of me judging that row to be a probable match.  In fact, when I selected those rows with a predicted event probability of .5 or higher, about 78% of my actual matches showed up.

In conclusion, if strapped for time, this fuzzy matching script is actually quite nice.

Excel vs. R for Plotting Data

Today I made a couple of bar graphs, using the ggplot2 package in R, just to depict the relative importance of different terms in my predictive models.  After seeing the result, I asked myself:

Is it really worth the annoyance of setting up the data frame, typing in the code (with the inherent reordering of the factor levels based on the stats I was displaying), then exporting it when I can get a pretty similar looking graph in Excel for less time and effort?

Really… why do I need to spend more time when the main differences between graphs I can make in either program amount to the background colour (gray vs. white), axis text saliency (slighly grayed out vs. more visible), numbering of the y axis, bar colour, and gridlines?  When I compare simple plots, that’s all it amounts to for me.

I think if simple plots are all that I want, I’m going to stick with Excel.  For anything fancier that I probably wouldn’t do in Excel, I’ll use R and ggplot2, or just the plain vanilla R plotting options.  I need to save my time!