Split, Apply, and Combine for ffdf

Call me incompetent, but I just can’t get ffdfdply to work with my ffdf dataframes.  I’ve tried repeatedly and it just doesn’t seem to work!  I’ve seen numerous examples on stackoverflow, but maybe I’m applying them incorrectly.  Wanting to do some split-apply-combine on an ffdf, yet again, I finally broke down and made my own function that seems to do the job! It’s still crude, I think, and it will probably break down when there are NA values in the vector that you want to split, but here it is:

mtapply = function (dvar, ivar, funlist) {
  lenlist = length(funlist)
  outtable = matrix(NA, dim(table(ivar)), lenlist, dimnames=list(names(table(ivar)), funlist))
  c = 1
  for (f in funlist) {
    outtable[,c] = as.matrix(tapply(dvar, ivar, eval(parse(text=f))))
    c = c + 1
  }
  return (outtable)}

As you can see, I’ve made it so that the result is a bunch of tapply vectors inserted into a matrix.  “dvar”, unsurprisingly, is your dependent variable.  “ivar”, your independent variable.  “funlist” is a vector of function names typed in as strings (e.g. c(“median”,”mean”,”mode”).  I’ve wasted so much of my time trying to get ddply or ffdfdply to work on an ffdf, that I’m happy that I now have anything that does the job for me.

Now that I think about it, this will fall short if you ask it to output more than one quantile for each of your split levels.  If you can improve this function, please be my guest!

A function to find the “Penultimax”

Penulti-what?  Let me explain: Today I had to iteratively go through each row of a donor history dataset and compare a donor’s maximum yearly donation total to the second highest yearly donation total.  In even more concrete terms, for each row I had to compare the maximum value across 5 columns against the next highest number.  This seemed to be a rather unique task, and so I had to make an R function to help carry it out.

So, I named the function “penultimax”, to honour the idea that it’s finding the second highest, or second max.  It works pretty simply, really just by removing the maximum value from the input vector, and returning the maximum of the new vector, if it’s there at all.  Following is the code for it (notice that it draws on an earlier function that I made, called safe.max, that returns an NA when it can’t find a maximum, instead of an error):


penultimax = function(invector) {
# If the vector starts off as only having 1 or 0 numbers, return NA
if (length(invector) <= 1) {
return(NA)
}
first.max = safe.max(invector)
#Once we get the max, take it out of the vector and make newvector
newvector = invector[!invector == first.max]
#If newvector now has nothing in it, return NA
if (length(newvector) == 0) {
return(NA)
}
#Now we get the second highest number in the vector.
#So long as it's there, we return that second highest number (the penultimax)
#or else we just return NA
second.max = safe.max(newvector)
if (is.na(first.max) & is.na(second.max)) {
return (NA) }
else if (!is.na(first.max) & is.na(second.max)) {
return (NA) }
else if (!is.na(first.max) & !is.na(second.max)) {
return (second.max)}
}
safe.max = function(invector) {
na.pct = sum(is.na(invector))/length(invector)
if (na.pct == 1) {
return(NA) }
else {
return(max(invector,na.rm=TRUE))
}
}

view raw

penultimax.r

hosted with ❤ by GitHub

Did I miss something out there that’s simpler than what I wrote?

Sample Rows from a Data Frame that Exclude the ID Values from Another Sample in R

In order to do some modeling, I needed to make a training sample and a test sample from a larger data frame.  Making the training sample was easy enough (see my earlier post), but I was going crazy trying to figure out how to make a second sample that excluded the rows I had already sampled in the first sample.

After trying out some options myself, looking extensively on the net, and asking for help on the r-help forum, I came up with the following function that finally does what I need it to do:

 

To summarize the function, you enter in the big data frame first (here termed “main.df”), then your first sample data frame that has the ID values that you want to exclude (here termed “sample1.df”, then your sample size, then the ID variable names in both data frames enclosed in quotes.  

Functions like this certainly make my working life with R easier in preventing me from having to type in syntax like that every time I want that kind of a task done.

Sampling Rows from a Data Frame in R

It’s time, yet again, for a simple and useful function I created that helped me at work.  I was looking for a way to sample whole rows of a very large data frame with many columns so that I could build a regression model on a subset of my data.  

While I was acquainted with the sample() function, I realized today that it couldn’t be used direclty to sample whole rows of a data frame.  It’s really just meant for vectors.  So, I looked up how to get a sample of rows from a data frame and found an answer on the R help forums.  The meat of the function below comes from that answer:

So, just put the data frame as the first argument of the function, and the number representing the size of your sample, and there you have it!  You just cut a random sample out of your data frame.

Length by a Grouping Variable with NA Values Omitted in R

Today, after my supervisor pointed out to me the discrepancy between some graphs of percentages from a data frame that I was working with, and the raw numbers, in a table, from which those percentages were taken, I realized that I was including some NAs in my length calculations.

The data were simple binary columns, with 0 being the absence of an attribute, 1 being the presence, and NA being an incalculable value.  The dependent variable here was whether or not people donated at a certain level, and the independent variable was a simple binary grouping variable.  First I got the sum of the dependent variable by the independent variable (i.e. how many people had donated at that level depending on the independent variable):

tapply(y, x, sum, na.rm=TRUE)

That worked simply enough.  Then I wanted to extract the total number of people who donated, regardless of whether they had reached the specified level:

tapply(y, x, length)

That gave me numbers, but they included the NAs.  I know that to simply get the number of non-NA values from a vector in R, all you have to type is sum(!is.na(x)) and there you go, but I needed this by a grouping vector.  So I realized what I needed to do this evening and made a laughably small function:

Even though the meat of this function is very small, it’s still nice to simplify 🙂  Live and learn I guess!