Know Your Dataset: Specifying colClasses to load up an ffdf

When I finally figured out how to successfully use the ff package to load data into R, I was apparently working with relatively pain free data to load up through read.csv.ffdf (see my previous post).  Just this past Sunday, I naively followed my own post to load a completely new dataset (over 400,000 rows and about 180 columns) for analysis.  Unfortunately for me, the data file was a bit messier, and so read.csv.ffdf wasn’t able to finalize the column classes by itself.  It would chug along until certain columns in my dataset, which it at first took to be one data type, proved to be a different data type, and then it would give me an error message basically telling me it didn’t want to adapt to the changing assumptions of which data type each column represented.

So, I set out to learn how I could use the colClasses argument in the read.csv.ffdf command to manually set the data types for each column.  I adapted the following solution from a stackoverflow thread about specifying colClasses in the regular read.csv function.

First, load up a sample of the big dataset using the read.csv command (The following is obviously non-random. If you can figure out how to read the sample in randomly, I think it would work much better):

headset = read.csv(fname, header = TRUE, nrows = 5000)

The next command generates a list of all the variable names in your dataset, and the classes R was able to derive based on the number of rows you imported:

headclasses = sapply(headset, class)

Now comes the fairly manual part. Look at the list of variables and classes (data types) that you generated, and look for obvious mismatches. Examples could be a numeric variable that got coded as a factor or logical, or a factor that got coded as a numeric. When you find such a mismatch, the following syntax suffices for changing a class one at a time:

headclasses["variable.name"] = "numeric"

Obviously, the “variable.name” should be replaced by the actual variable name you’re reclassifying, and the “numeric” string can also be “factor”, “ordered”, “Date”, “POSIXct” (the last two being date/time data types). Finally, let’s say you want to change every variable that got coded as “logical” into “numeric”. Here’s some syntax you can use:

headclasses[grep("logical", headclasses)] = "numeric"

Once you are certain that all the classes represented in the list you just generated and modified are accurate to the dataset, you can load up the data with confidence, using the headclasses list:

bigdataset = read.csv.ffdf(file="C:/big/data/loc.csv", first.rows=5000, colClasses=headclasses)

This was certainly not easy, but I must say that I seem to be willing to jump through many hoops for R!!

Big data analysis, for free, in R (or “How I learned to load, manipulate, and save data using the ff package”)

Before choosing to support the purchase of Statistica at my workplace, I came across the ff package as an option for working with really big datasets (with special attention paid to ff dataframes, or ffdf). It looked like a good option to use, allowing dataframes with multiple data types and way more rows than if I were loading such a dataset into RAM as is normal with R. The one big problem I had is that every time I tried to use the ffsave function to save my work from one R session to the next, it told me that it could not find an external zip utility on my Windows machine. I guess because I just had so much else going on, I didn’t have the patience to do the research to find a solution to this problem.

This weekend I finally found some time to revisit this problem, and managed to find a solution! From what I can tell, R appears to expect, in cases like the ffsave function, that you have command-line utilities like a zip utility at the ready and recognizable by R. Although I haven’t tested the ff package on either of my linux laptops at home, I suspect that R recognizes the utilities that come pre-installed on them. However, in the windows case, the solution seems to be to install a supplementary group of command-line programs called Rtools.  When you visit the page, be sure to download the version of Rtools that corresponds with your R version.

When you go through the installation process, you will see a screen like below. Be sure that you check the same boxes as in the screenshot below so that R knows where the zip utility lives.

Once you have it installed, that’s when the fun finally begins. Like in the smaller data case, I like reading in CSV files. So, ff provides read.csv.ffdf for importing external data into R. Let’s say that you have a data file named bigdata.csv, here would be a command for loading it up:

bigdata = read.csv.ffdf(file=”c:/fileloc/bigdata.csv”, first.rows=5000, colClasses=NA)

The first part of the command, directing R to your file, should look straightforward. The first.rows argument tells it how big you want the first chunk of data it reads in should be (ff reads parts of your data at a time to save RAM.  Correct me if I’m wrong).  Finally, and importantly, the colClasses=NA argument tells R not to assume the data types of each of your columns from the first chunk alone.

Now that you’ve loaded your big dataset, you can manipulate it at will.  If you look at the ff and ffbase documentation, a lot of the standard R functions for working with and summarizing data have been optimized for use with ff dataframes and vectors.  The upshot of this is that working with data stored in ffdf format seems to be a pretty similar experience compared to working with normal data frames.  Importantly, when you want to subset your data frame to create a test sample, the ffbase package replaces the subset command so that the resultant subset is also an ffdf, and doesn’t take up more of your RAM.

I noticed that you can use the glm() and lm() functions on an ffdf, but I think you have to be careful because they are not optimized for use with ffdfs and therefore will take up the usual amount of memory if you save them to your workspace.  So if you build models using these functions, be sure to select a sample from your ffdf that isn’t overly big!

Next, comes the step of saving your work.  The syntax is simple enough:

ffsave(bigdata, file=”C:/fileloc/Rwork/bigdata”)

This saves a .ffData file and a .RData file to the directory of your choice with “bigdata” as the filenames.

Then, when you want to load up your data in a new R session during some later time, you use the simple ffload command:

ffload(file=”C:/fileloc/Rwork/bigdata”)

It gives you some warning messages, but as far as I can tell they do not get in the way of accessing your data.  That covers the basics of working with big data using the ff package.    Have fun analyzing your data using less RAM! 🙂

A Return to Reliable R

The saga with Statistica continues:

Statistica kept crashing on me while doing my data processing.  One of the big problems was a wonderful bug that occurred when some of my text data variables were coded (unsurprisingly) as text!  Under this condition, I would only be able to add a certain small number of extra variables when I needed to make them, and then after that, any extra variable that I tried to add would crash the program!

I was told that this is a known bug in Statistica and they’re hoping to fix it with an update coming around by the end of the year.  In the meanwhile, a workaround is to go into the “Variable Specs” for any variable coded as Text and recode it as “Double”, save the worksheet, then try again.  That seemed to get rid of the crashing, but then my biographical ID column that held all the original database IDs for the individuals in my dataset got messed up.  Numerous IDs, which were previously unique, became spontaneously reassigned to more than one person.  I can’t have that because once I’m done with the dataset, I have to return important parts of it back to the clients I work with so they can put certain new columns into their database.  So it was a bit of a catch 22.

My supervisor advised me to make a new, strictly numeric, ID column outside of Statistica, and import only the new ID column, and not the old one, back into the program.  I did that, and all seemed well until finally it crashed, yet again!  This time, I had no clue whatsoever why the crash happened.

That’s when I told myself “screw it, I’m wasting time in Statistica and am going to do the rest of this analysis in R”.  Man, is it ever nice to be back in R.  Ironically, things are much more simple and flow a lot faster for me.  The only problem is that I have a few projects coming up soon that really need a data analysis program that can handle humongous data sets.  For that reason, I’m probably going to have to see if reinstalling Statistica makes it more reliable to work with.  If not, I suppose I’ll have to move on to other options!

Memory Management in R, and SOAR

The more I’ve worked with my really large data set, the more cumbersome the work has become to my work computer.  Keep in mind I’ve got a quad core with 8 gigs of RAM.  With growing irritation at how slow my work computer becomes at times while working with these data, I took to finding better ways of managing my memory in R.

The best/easiest solution I’ve found so far is in a package called SOAR.  To put it simply, it allows you to store specific objects in R (data frames being the most important, for me) as RData files on your hard drive, and gives you the ability to analyze them in R without having them loaded into your RAM.  I emphasized the term analyze because every time I try to add variables to the data frames that I store, the data frame comes back into RAM and once again slows me down.

An example might suffice:

> r = data.frame(a=rnorm(10,2,.5),b=rnorm(10,3,.5))
> r

  a       b

1 1.914092 3.074571
2 2.694049 3.479486
3 1.684653 3.491395
4 1.318480 3.816738
5 2.025016 3.107468
6 1.851811 3.708318
7 2.767788 2.636712
8 1.952930 3.164896
9 2.658366 3.973425
10 1.809752 2.599830
> library(SOAR)
> Sys.setenv(R_LOCAL_CACHE=”testsession”)
> ls()
[1] “r”
> Store(r)
> ls()
character(0)
> mean(r[,1])
[1] 2.067694
> r$c = rnorm(10,4,.5)
> ls()
[1] “r”

So, the first thing I did was to make a data frame with some columns, which got stored in my workspace, and thus loaded into RAM.  Then, I initialized the SOAR library, and set my local cache to “testsession”.  The practical implication of that is that a directory gets created within the current directory that R is working out of (in my case, “/home/inkhorn/testsession”), and that any objects passed to the Store command get saved as RData files in that directory.

Sure enough, you see my workspace before and after I store the r object.  Now you see the object, now you don’t!  But then, as I show, even though the object is not in the workspace, you can still analyze it (in my case, calculate a mean from one of the columns).  However, as soon as I try to make a new column in the data frame… voila … it’s back in my workspace, and thus RAM!

So, unless I’m missing something about how the package is used, it doesn’t function exactly as I would like, but it’s still an improvement.  Every time I’m done making new columns in the data frame, I just have to pass the object to the Store command, and away to the hard disk it goes, and out of my RAM.  It’s quite liberating not having a stupendously heavy workspace, as when I’m trying to leave or enter R, it takes forever to save/load the workspace.  With the heavy stuff sitting on the hard disk, leaving and entering R go by a lot faster.

Another thing I noticed is that if I keep the GLMs that I’ve generated in my workspace, that seems to take up a lot of RAM as well and slow things down.  So, with writing the main dataframe to disk, and keeping GLMs out of memory, R is flying again!