save.ffdf and load.ffdf: Save and load your big data – quickly and neatly!

I’m very indebted to the ff and ffbase packages in R.  Without them, I probably would have to use some less savoury stats program for my bigger data analysis projects that I do at work.

Since I started using ff and ffbase, I have resorted to saving and loading my ff dataframes using ffsave and ffload.  The syntax isn’t so bad, but the resulting process it puts your computer through to save and load your ff dataframe is a bit cumbersome.  It takes a while to save and load, and ffsave creates (by default) a bunch of randomly named ff files in a temporary directory.

For that reason, I was happy to come across a link to a pdf presentation (sorry, I’ve lost it now) summarizing some cool features of ffbase.  I learned that instead of using ffsave and ffload, you can use save.ffdf and load.ffdf, which have very simple syntax:

save.ffdf(ffdfname, dir=”/PATH/TO/STORE/FF/FILES”)

Use that, and it creates a directory wherein it stores ff files that bear the same names as your column names from your ff dataframe!  It also stores an .RData and .Rprofile file as well.  Then there is:

load.ffdf(dir=”/PATH/TO/STORE/FF/FILES”)

As simple as that, you load your files, and you’re done!  I think what I like about these functions is that they allow you to easily choose where the ff files are stored, removing the worry about important files being in your temporary directory.

Store your big data!!

Which Torontonians Want a Casino? Survey Analysis Part 2

In my last post I said that I would try to investigate the question of who actually does want a casino, and whether place of residence is a factor in where they want the casino to be built.  So, here goes something:

The first line of attack in this blog post is to distinguish between people based on their responses to the third question on the survey, the one asking people to rate the importance of a long list of issues.  When I looked at this list originally, I knew that I would want to reduce the dimensionality using PCA.

library(psych)
issues.pca = principal(casino[,8:23], 3, rotate="varimax",scores=TRUE)

The PCA resulted in the 3 components listed in the table below.  The first component had variables loading on to it that seemed to relate to the casino being a big attraction with lots of features, so I named it “Go big or Go Home”.  On the second component there seemed to be variables loading on to it that related to technical details, while the third component seemed to have variables loading on to it that dealt with social or environmental issues.

Go Big or Go Home Concerned with Technical Details Concerned with Social/Environmental Issues or not Issue/Concern
Q3_A 0.181 0.751 Design of the facility
Q3_B 0.366 0.738 Employment Opportunities
Q3_C 0.44 0.659 Entertainment and cultural activities
Q3_D 0.695 0.361 Expanded convention facilities
Q3_E 0.701 0.346 Integration with surrounding areas
Q3_F 0.808 0.266 New hotel accommodations
Q3_G -0.117 0.885 Problem gambling & health concerns
Q3_H 0.904 Public safety and social concerns
Q3_I 0.254 0.716 Public space
Q3_J 0.864 0.218 Restaurants
Q3_K 0.877 0.157 Retail
Q3_L 0.423 0.676 -0.1 Revenue for the city
Q3_M 0.218 0.703 0.227 Support for local businesses
Q3_N 0.647 0.487 -0.221 Tourist attraction
Q3_O 0.118 0.731 Traffic concerns
Q3_P 0.497 0.536 0.124 Training and career development

Once I was satisfied that I had a decent understanding of what the PCA was telling me, I loaded the component scores into the original dataframe.

casino[,110:112] = issues.pca$scores
names(casino)[110:112] = c("GoBigorGoHome","TechnicalDetails","Soc.Env.Issues")

In order to investigate the question of who wants a casino and where, I decided to use question 6 as a dependent variable (the one asking where they would want it built, if one were to be built) and the PCA components as independent variables.  This is a good question to use, because the answer options, if you remember, are “Toronto”, “Adjacent Municipality” and “Neither”.  My approach was to model each response individually using logistic regression.

casino$Q6[casino$Q6 == ""] = NA
casino$Q6 = factor(casino$Q6, levels=c("Adjacent Municipality","City of Toronto","Neither"))

adj.mun = glm(casino$Q6 == "Adjacent Municipality" ~ GoBigorGoHome + TechnicalDetails + Soc.Env.Issues, data=casino, family=binomial(logit))
toronto = glm(casino$Q6 == "City of Toronto" ~ GoBigorGoHome + TechnicalDetails + Soc.Env.Issues, data=casino, family=binomial(logit))
neither = glm(casino$Q6 == "Neither" ~ GoBigorGoHome + TechnicalDetails + Soc.Env.Issues, data=casino, family=binomial(logit))

Following are the summaries of each GLM:
Toronto:


Call:
glm(formula = casino$Q6 == "City of Toronto" ~ GoBigorGoHome +
TechnicalDetails + Soc.Env.Issues, family = binomial(logit),
data = casino)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6426 -0.4745 -0.1156 0.4236 3.4835
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.58707 0.04234 -37.48 <2e-16 ***
GoBigorGoHome 1.76021 0.03765 46.75 <2e-16 ***
TechnicalDetails 1.77155 0.05173 34.24 <2e-16 ***
Soc.Env.Issues -1.63057 0.04262 -38.26 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13537 on 10365 degrees of freedom
Residual deviance: 6818 on 10362 degrees of freedom
(7400 observations deleted due to missingness)
AIC: 6826
Number of Fisher Scoring iterations: 6

Adjacent municipality:


Call:
glm(formula = casino$Q6 == "Adjacent Municipality" ~ GoBigorGoHome +
TechnicalDetails + Soc.Env.Issues, family = binomial(logit),
data = casino)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0633 -0.7248 -0.5722 -0.3264 2.7136
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.45398 0.02673 -54.394 < 2e-16 ***
GoBigorGoHome -0.41989 0.02586 -16.239 < 2e-16 ***
TechnicalDetails 0.18764 0.02612 7.183 6.82e-13 ***
Soc.Env.Issues 0.52325 0.03221 16.243 < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10431.8 on 10365 degrees of freedom
Residual deviance: 9756.4 on 10362 degrees of freedom
(7400 observations deleted due to missingness)
AIC: 9764.4
Number of Fisher Scoring iterations: 5

Neither location:


Call:
glm(formula = casino$Q6 == "Neither" ~ GoBigorGoHome + TechnicalDetails +
Soc.Env.Issues, family = binomial(logit), data = casino)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4090 -0.7344 -0.3934 0.8966 2.7194
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.22987 0.02415 -9.517 <2e-16 ***
GoBigorGoHome -0.85050 0.02462 -34.549 <2e-16 ***
TechnicalDetails -1.00182 0.02737 -36.597 <2e-16 ***
Soc.Env.Issues 0.69707 0.02584 26.972 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 14215 on 10365 degrees of freedom
Residual deviance: 10557 on 10362 degrees of freedom
(7400 observations deleted due to missingness)
AIC: 10565
Number of Fisher Scoring iterations: 4

And here is a quick summary of the above GLM information:
Summary of Casino GLMs

Judging from these results, it looks like those who want a casino in Toronto don’t focus on the big social/environmental issues surrounding the casino, but do focus on the flashy and non-flashy details and benefits alike.  Those who want a casino outside of Toronto do care about the social/environmental issues, don’t care as much about the flashy details, but do have a focus on some of the non-flashy details.  Finally, those not wanting a casino in either location care about the social/environmental issues, but don’t care about any of the details.

Here’s where the issue of location comes into play.  When I look at the summary for the GLM that predicts who wants a casino in an adjacent municipality, I get the feeling that it’s picking up people living in the down-town core who just don’t think the area can handle a casino.  In other words, I think there might be a “not in my backyard!” effect.

The first inkling that this might be the case comes from an article from the Martin Prosperity Institute (MPI), who analyzed the same data set, and managed to get a very nice looking heat map-map of the responses to the first question on the survey, asking people how they feel about having a new casino in Toronto.  From this map, it does look like people in Downtown Toronto are feeling pretty negative about a new casino, whereas those in the far east and west of Toronto are feeling better about it.

My next evidence comes from the cities uncovered by geocoding the responses in the data set.  I decided to create a very simple indicator variable, distinguishing those for whom the “City” is Toronto, and those for whom the city is anything else.  I like this better than the MPI analysis, because it looks at peoples’ attitudes towards a casino both inside and outside of Toronto (rather than towards the concept of a new Casino in Toronto).  If there really is a “not in my backyard!” effect, I would expect to see evidence that those in Toronto are more disposed towards a casino in an adjacent municipality, and that those from outside of Toronto are more disposed towards a casino inside Toronto!  Here we go:


library(ff)
library(ggthemes)
ffload(file="casino", overwrite=TRUE)
casino.orig$Outside.of.Toronto = as.ff(ifelse(casino.orig[,"City"] == "Toronto",0,1))
casino.in.toronto = glm(casino.orig[,"Q6"] == "City of Toronto" ~ Outside.of.Toronto, data=casino.orig, family=binomial(logit))
casino.outside.toronto = glm(casino.orig[,"Q6"] == "Adjacent Municipality" ~ Outside.of.Toronto, data=casino.orig, family=binomial(logit))
summary(casino.in.toronto)
Call:
glm(formula = casino.orig[, "Q6"] == "City of Toronto" ~ Outside.of.Toronto,
family = binomial(logit), data = casino.orig)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9132 -0.9132 -0.7205 1.4669 1.7179
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.21605 0.02600 -46.77 <2e-16 ***
Outside.of.Toronto 0.55712 0.03855 14.45 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16278 on 13881 degrees of freedom
Residual deviance: 16070 on 13880 degrees of freedom
(3884 observations deleted due to missingness)
AIC: 16074
Number of Fisher Scoring iterations: 4
————————————————–
summary(casino.outside.toronto)
Call:
glm(formula = casino.orig[, "Q6"] == "Adjacent Municipality" ~
Outside.of.Toronto, family = binomial(logit), data = casino.orig)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7280 -0.7280 -0.5554 -0.5554 1.9726
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.19254 0.02583 -46.16 <2e-16 ***
Outside.of.Toronto -0.59879 0.04641 -12.90 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13786 on 13881 degrees of freedom
Residual deviance: 13611 on 13880 degrees of freedom
(3884 observations deleted due to missingness)
AIC: 13615
Number of Fisher Scoring iterations: 4
——————————————-
casino.loc.by.city.res = as.matrix(table(casino.orig[,"Outside.of.Toronto"], casino.orig[,"Q6"])[,2:4], dimnames=list(c("Those Living Inside the City of Toronto", "Those Living Outside the City of Toronto"),colnames(casino.loc.by.city.res)))
rownames(casino.loc.by.city.res) = c("Those Living Inside the City of Toronto", "Those Living Outside the City of Toronto")
casino.loc.by.city.res = melt(prop.table(casino.loc.by.city.res,1))
names(casino.loc.by.city.res) = c("City of Residence","Where to Build the Casino","value")
ggplot(casino.loc.by.city.res, aes(x=casino.loc.by.city.res$"Where to Build the Casino", y=value, fill=casino.loc.by.city.res$"City of Residence")) + geom_bar(position="dodge") + scale_y_continuous(labels=percent) + theme_wsj() + scale_fill_discrete(name="City of Residence") + ggtitle("If a casino is built, where would you prefer to see it located?")

view raw

casino.geo.r

hosted with ❤ by GitHub

Where located by city of residenceAs you can see here, those from the outside of Toronto are more likely to suggest building a casino in Toronto compared with those from the inside, and less likely to suggest building a casino in an adjacent municipality (with the reverse being true about those from the inside of Toronto).

That being said, when you do the comparison within city of residence (instead of across it like I just did), those from the inside of Toronto seem equally likely to suggest that the casino be built in our outside of the city, whereas those outside are much more likely to suggest building the casino inside Toronto than outside.  So, depending on how you view this graph, you might only say there’s evidence for a “not in my backyard!” effect for those living outside of Toronto.

As a final note, I’ll remind you that although these analyses point to which Torontonians do want a new casino, the fact from this survey remains that about 71% of respondents are unsupportive of a casino in Toronto, and 53% don’t want a casino built in either Toronto or and adjacent municipality.  I really have to wonder if they’re still going to go ahead with it!

Know Your Dataset: Specifying colClasses to load up an ffdf

When I finally figured out how to successfully use the ff package to load data into R, I was apparently working with relatively pain free data to load up through read.csv.ffdf (see my previous post).  Just this past Sunday, I naively followed my own post to load a completely new dataset (over 400,000 rows and about 180 columns) for analysis.  Unfortunately for me, the data file was a bit messier, and so read.csv.ffdf wasn’t able to finalize the column classes by itself.  It would chug along until certain columns in my dataset, which it at first took to be one data type, proved to be a different data type, and then it would give me an error message basically telling me it didn’t want to adapt to the changing assumptions of which data type each column represented.

So, I set out to learn how I could use the colClasses argument in the read.csv.ffdf command to manually set the data types for each column.  I adapted the following solution from a stackoverflow thread about specifying colClasses in the regular read.csv function.

First, load up a sample of the big dataset using the read.csv command (The following is obviously non-random. If you can figure out how to read the sample in randomly, I think it would work much better):

headset = read.csv(fname, header = TRUE, nrows = 5000)

The next command generates a list of all the variable names in your dataset, and the classes R was able to derive based on the number of rows you imported:

headclasses = sapply(headset, class)

Now comes the fairly manual part. Look at the list of variables and classes (data types) that you generated, and look for obvious mismatches. Examples could be a numeric variable that got coded as a factor or logical, or a factor that got coded as a numeric. When you find such a mismatch, the following syntax suffices for changing a class one at a time:

headclasses["variable.name"] = "numeric"

Obviously, the “variable.name” should be replaced by the actual variable name you’re reclassifying, and the “numeric” string can also be “factor”, “ordered”, “Date”, “POSIXct” (the last two being date/time data types). Finally, let’s say you want to change every variable that got coded as “logical” into “numeric”. Here’s some syntax you can use:

headclasses[grep("logical", headclasses)] = "numeric"

Once you are certain that all the classes represented in the list you just generated and modified are accurate to the dataset, you can load up the data with confidence, using the headclasses list:

bigdataset = read.csv.ffdf(file="C:/big/data/loc.csv", first.rows=5000, colClasses=headclasses)

This was certainly not easy, but I must say that I seem to be willing to jump through many hoops for R!!

Big data analysis, for free, in R (or “How I learned to load, manipulate, and save data using the ff package”)

Before choosing to support the purchase of Statistica at my workplace, I came across the ff package as an option for working with really big datasets (with special attention paid to ff dataframes, or ffdf). It looked like a good option to use, allowing dataframes with multiple data types and way more rows than if I were loading such a dataset into RAM as is normal with R. The one big problem I had is that every time I tried to use the ffsave function to save my work from one R session to the next, it told me that it could not find an external zip utility on my Windows machine. I guess because I just had so much else going on, I didn’t have the patience to do the research to find a solution to this problem.

This weekend I finally found some time to revisit this problem, and managed to find a solution! From what I can tell, R appears to expect, in cases like the ffsave function, that you have command-line utilities like a zip utility at the ready and recognizable by R. Although I haven’t tested the ff package on either of my linux laptops at home, I suspect that R recognizes the utilities that come pre-installed on them. However, in the windows case, the solution seems to be to install a supplementary group of command-line programs called Rtools.  When you visit the page, be sure to download the version of Rtools that corresponds with your R version.

When you go through the installation process, you will see a screen like below. Be sure that you check the same boxes as in the screenshot below so that R knows where the zip utility lives.

Once you have it installed, that’s when the fun finally begins. Like in the smaller data case, I like reading in CSV files. So, ff provides read.csv.ffdf for importing external data into R. Let’s say that you have a data file named bigdata.csv, here would be a command for loading it up:

bigdata = read.csv.ffdf(file=”c:/fileloc/bigdata.csv”, first.rows=5000, colClasses=NA)

The first part of the command, directing R to your file, should look straightforward. The first.rows argument tells it how big you want the first chunk of data it reads in should be (ff reads parts of your data at a time to save RAM.  Correct me if I’m wrong).  Finally, and importantly, the colClasses=NA argument tells R not to assume the data types of each of your columns from the first chunk alone.

Now that you’ve loaded your big dataset, you can manipulate it at will.  If you look at the ff and ffbase documentation, a lot of the standard R functions for working with and summarizing data have been optimized for use with ff dataframes and vectors.  The upshot of this is that working with data stored in ffdf format seems to be a pretty similar experience compared to working with normal data frames.  Importantly, when you want to subset your data frame to create a test sample, the ffbase package replaces the subset command so that the resultant subset is also an ffdf, and doesn’t take up more of your RAM.

I noticed that you can use the glm() and lm() functions on an ffdf, but I think you have to be careful because they are not optimized for use with ffdfs and therefore will take up the usual amount of memory if you save them to your workspace.  So if you build models using these functions, be sure to select a sample from your ffdf that isn’t overly big!

Next, comes the step of saving your work.  The syntax is simple enough:

ffsave(bigdata, file=”C:/fileloc/Rwork/bigdata”)

This saves a .ffData file and a .RData file to the directory of your choice with “bigdata” as the filenames.

Then, when you want to load up your data in a new R session during some later time, you use the simple ffload command:

ffload(file=”C:/fileloc/Rwork/bigdata”)

It gives you some warning messages, but as far as I can tell they do not get in the way of accessing your data.  That covers the basics of working with big data using the ff package.    Have fun analyzing your data using less RAM! 🙂