save.ffdf and load.ffdf: Save and load your big data – quickly and neatly!

I’m very indebted to the ff and ffbase packages in R.  Without them, I probably would have to use some less savoury stats program for my bigger data analysis projects that I do at work.

Since I started using ff and ffbase, I have resorted to saving and loading my ff dataframes using ffsave and ffload.  The syntax isn’t so bad, but the resulting process it puts your computer through to save and load your ff dataframe is a bit cumbersome.  It takes a while to save and load, and ffsave creates (by default) a bunch of randomly named ff files in a temporary directory.

For that reason, I was happy to come across a link to a pdf presentation (sorry, I’ve lost it now) summarizing some cool features of ffbase.  I learned that instead of using ffsave and ffload, you can use save.ffdf and load.ffdf, which have very simple syntax:

save.ffdf(ffdfname, dir=”/PATH/TO/STORE/FF/FILES”)

Use that, and it creates a directory wherein it stores ff files that bear the same names as your column names from your ff dataframe!  It also stores an .RData and .Rprofile file as well.  Then there is:

load.ffdf(dir=”/PATH/TO/STORE/FF/FILES”)

As simple as that, you load your files, and you’re done!  I think what I like about these functions is that they allow you to easily choose where the ff files are stored, removing the worry about important files being in your temporary directory.

Store your big data!!

My Goodness. What a Fat Dataset!

Recently at work we got sent a data file containing information on donations to a specific charitable organization, ranging all the way back to the 80’s.  Usually, when we receive a dataset with a donation history in it, each row represents a specific gift from a specific person at a specific time.  Also, each column represents some kind of information about that gift.  The result is usually a dataset which is fairly long (thousands or hundreds of thousands, in my recent experience) with maybe about 15 columns or more.

In this case, each row represented one person, but there were 1,551 columns!!  As it turned out, after the first column, which was the ID of the person donating the money, there were supposed to be just 31 extra columns to describe the gift in each row.  However, the person who put the data together decided that we should get 31*50 columns so that each row represented a person, and not a gift, and every subsequent gift from that person was represented by an extra 31 columns to the right of the previous 31.  Ridiculous!!

Anyway, I knew that I could reshape this using R, by stacking all 50 copies of the same variable together, and making sure that each new resultant 31 vectors should just take the names of the first 31 vectors.  Following is a gist that shows what eventually worked for me:


# Here's where I extract the database IDs and repeat them 50 times to make the column long enough for
# my new long-form dataset (596,100 rows)
client.data.new = rep(client.data[,1],50)
for (i in 2:32){
# for each column in the first 31 after the ID column, find the 49 matching columns
# to the right and stack them using melt
stacked.data = melt(client.data, id.vars="CnBio_ID", measure.vars=seq(i,(i+(31*49)),31), value.name=names(client.data)[i])
# Once we have one new stacked column, we add it on to the right of whatever has already been built in the
# new long-form dataset.
client.data.new = data.frame(client.data.new, stacked.data[,3])
}
# I don't know why the "value.name" argument didn't give me the right variable names for my
# new long-form dataset, but it was easy enough to use this last line of code to fix that!
names(client.data.new) = names(client.data)[1:32]

In conclusion, if you need your dataset to get in shape, you need only remember one letter: R!

A Return to Reliable R

The saga with Statistica continues:

Statistica kept crashing on me while doing my data processing.  One of the big problems was a wonderful bug that occurred when some of my text data variables were coded (unsurprisingly) as text!  Under this condition, I would only be able to add a certain small number of extra variables when I needed to make them, and then after that, any extra variable that I tried to add would crash the program!

I was told that this is a known bug in Statistica and they’re hoping to fix it with an update coming around by the end of the year.  In the meanwhile, a workaround is to go into the “Variable Specs” for any variable coded as Text and recode it as “Double”, save the worksheet, then try again.  That seemed to get rid of the crashing, but then my biographical ID column that held all the original database IDs for the individuals in my dataset got messed up.  Numerous IDs, which were previously unique, became spontaneously reassigned to more than one person.  I can’t have that because once I’m done with the dataset, I have to return important parts of it back to the clients I work with so they can put certain new columns into their database.  So it was a bit of a catch 22.

My supervisor advised me to make a new, strictly numeric, ID column outside of Statistica, and import only the new ID column, and not the old one, back into the program.  I did that, and all seemed well until finally it crashed, yet again!  This time, I had no clue whatsoever why the crash happened.

That’s when I told myself “screw it, I’m wasting time in Statistica and am going to do the rest of this analysis in R”.  Man, is it ever nice to be back in R.  Ironically, things are much more simple and flow a lot faster for me.  The only problem is that I have a few projects coming up soon that really need a data analysis program that can handle humongous data sets.  For that reason, I’m probably going to have to see if reinstalling Statistica makes it more reliable to work with.  If not, I suppose I’ll have to move on to other options!

Processing Data from a Statistica Worksheet Using R

Context: I work with data from non-profit organizations, and so a big concern in many of my analyses is if and how much people are donating from one year to the next.  One of the  things I normally like to do in my analyses is get a value for each person that represents how much their yearly donations are increasing or decreasing on average for 5 years (a simple slope from the regression of their giving values on the years that they gave).  It was pretty simple and quick to do this in R for previous projects, so there was no hassle there.  Now that we have Statistica in the office, my supervisor wants me to use it for our current project.

Problem: I was looking for a way in Statistica of doing the above slope calculation for each row in a dataset of roughly 82,000 rows, and could not find it.

Solution:  As I mentioned in my last post, it’s possible to feed your Statistica dataset into R using the Statconn Dcom server, so that you can process it/analyze it in R and then output your results back into Statistica.  So, I fed my dataset of 82,000 rows and 264 columns into R, and used some code I had used previously to calculate 5 year giving slopes for each row in the data set, and to output a new worksheet with the newly calculated slopes column.  Although the code is pretty simple, the entire process seemed to take about 5 minutes, which was unbearably slow!!  It’s a pretty important part of my analysis, so going without it isn’t an option.

I sent an email to one of the Statistica support guys, so hopefully they have a way of doing this kind of data processing natively, instead of having to wait all that time for the data to be processed through R.

Sampling and the Analysis of Big Data

After my last post, I came across a few articles supporting the opinion that if you have a good reason to take random samples from a “big” dataset, you’re not committing some kind of sin:

Big Data Blasphemy: Why Sample?

To Sample or Not to Sample… Does it Even Matter?

The moral of the story is that you can sample from “big data” so long as the analysis you’re doing doesn’t require some part of the data that will be excluded as part of the sampling process (an exampl being the top or bottom so many records based on some criterion).

“Big Data Analytics”? Hrm….

I do data mining and modeling really often these days.  However, the datasets that I work with really wouldn’t be considered “Big Data” (around 25,000 to about 200,000 rows, with quite a lot of variables).  I don’t know if I’ll ever be in a position to work with “Big Data”, but all the hype around it gets me thinking from time to time.  

Question: If I’ve got millions upon millions of records to work with, do I really need to submit all of them to my data analysis software (R) for data mining and modeling?

Answer: Not in the least bit.  If all I’m doing is looking for trends and building models that predict some desired behaviour, all I would have to do is get a handful of random samples that are small enough to fit into my data analysis software.  Then I could do my data mining on any one of the samples, build a model or models, and then test it/them on the other samples.  Apparently, random sampling is possible in Hadoop.  What this means is that if I get these random samples from the DBMS, I can just use the same kinds of techniques I’ve been using all along.

Am I missing something, or is “Big Data Analytics” more of a marketing term than an actual reality?

Sample Rows from a Data Frame that Exclude the ID Values from Another Sample in R

In order to do some modeling, I needed to make a training sample and a test sample from a larger data frame.  Making the training sample was easy enough (see my earlier post), but I was going crazy trying to figure out how to make a second sample that excluded the rows I had already sampled in the first sample.

After trying out some options myself, looking extensively on the net, and asking for help on the r-help forum, I came up with the following function that finally does what I need it to do:

 

To summarize the function, you enter in the big data frame first (here termed “main.df”), then your first sample data frame that has the ID values that you want to exclude (here termed “sample1.df”, then your sample size, then the ID variable names in both data frames enclosed in quotes.  

Functions like this certainly make my working life with R easier in preventing me from having to type in syntax like that every time I want that kind of a task done.

Sampling Rows from a Data Frame in R

It’s time, yet again, for a simple and useful function I created that helped me at work.  I was looking for a way to sample whole rows of a very large data frame with many columns so that I could build a regression model on a subset of my data.  

While I was acquainted with the sample() function, I realized today that it couldn’t be used direclty to sample whole rows of a data frame.  It’s really just meant for vectors.  So, I looked up how to get a sample of rows from a data frame and found an answer on the R help forums.  The meat of the function below comes from that answer:

So, just put the data frame as the first argument of the function, and the number representing the size of your sample, and there you have it!  You just cut a random sample out of your data frame.

Custom Making a Data Frame in R for Graphing Purposes

I have no problems importing all sorts of data sets into R for statistical analysis.  However, there is always the step where I’ve extracted the relevant stats that I want to put into multiple graphs.  When I get the particular stats I need, I usually just stick them in Excel, in tabular form, and then just make my graphs from there.  

Excel is great because it then becomes ridiculously easy to customize my graphs from within.  Today however I kept trying to copy and paste the tables I made in excel back into R to try building some fancy graphs with ggplot2.  I kept getting very frustrated though, because R is very picky about copying and pasting text from Excel.  

I realized that if I want to remove the frustration of copying tables from excel, I should just be constructing these “pre-graph” tables in R instead.

The above represents a very simple example, but essentially I haven’t even been doing anything like the above at work to try to use R graphics instead of Excel graphs.

In sum:

  1. Get the needed stats
  2. Type up a column or more with text items that categorize the stats from step 1
  3. Type up a column with the stats from text 1
  4. Put those columns together in a data frame and chart.

The sqldf package in R. Awesome!

I only recently learned how to use the sqldf package to perform SELECT queries on the data frames I have loaded into the R workspace.  There’s a lot I already know how to do when selecting, or subsetting data in R, but I don’t know how to do the equivalent of joining two datasets by some ID variable.  

Luckily, that kind of operation is easy when you put it into SQL syntax.  It basically looks something like this:

newdata = sqldf(‘SELECT dataframe1.*, dataframe2.somevariable FROM dataframe1 LEFT JOIN dataframe2 on dataframe1.ID = dataframe2.ID’)

Things get a bit more complicated when you have dots in your dataframe and variable names.  Below is what the call to sqldf has to look like if you are in a situation like that:

newdata = sqldf(‘SELECT “data.frame1”.*, “data.frame2”.some_variable FROM “data.frame1” LEFT JOIN “data.frame2” on “data.frame1”.ID = “data.frame2”.ID’)

You’ll notice that I had to put double quotes around the dataframe references, so that the SQL code didn’t get confused by the presence of dots.  The underscore in the variable reference “some_variable” actually translates into “some.variable” when it looks for the referent in the R data frame.  It’s a little messy, but totally worth it when I consider I don’t even know how to do this kind of operation in R otherwise!