sapply is my new friend!

I’ve written previously about how the apply function is a major workhorse in many of my work projects. What I didn’t know is how handy the sapply function can be!

There are a couple of cases so far where I’ve found that sapply really comes in handy for me:

1) If I want to quickly see some descriptive stats for multiple columns in my dataframe. For example,

sapply(mydf[,10:20], median, na.rm=true)

would show me the medians of columns 10 through 20, displaying the column names above each median value.

2) If I want to apply the same function to multiple vectors in my dataframe, modifying them in place. I oftentimes have count variables that have NA values in place of zeros. I made a “zerofy” function to add zeros into a vector that lacks them. So, if I want to use my function to modify these count columns, I can do the following:

mydf[,30:40] = sapply(mydf[,30:40], zerofy)

Which then replaces the original data in columns 30 through 40 with the modified data! Handy!

Access individual elements of a row while using the apply function on your dataframe (or “applying down while thinking across”)

The apply function in R is a huge work-horse for me across many projects.  My usage of it is pretty stereotypical.  Usually, I use it to make aggregations of a targeted group of columns for every row in a dataframe.  Those aggregations could be counts, sum totals, or possibly a binary column that flags some condition based on a count or sum total from the entire group of columns.

In my latest project, I found a different usage for the apply function.  I had a dataframe detailing the donations made each year, by individual people, for the last 7 years.  I also had a column specifying the year when they converted from being a ‘regular’ donor to donating through a specific program.  The idea was to create a column specifying the last yearly amount donated before the year they had converted.  In other words, for each row, information had to be extracted on the year they converted, when they had donated before that year, and how much was that amount.

Here’s the R code I used with some sample data to help recreate what I did:

test = matrix(c(31, NA, 33, 37, NA, 2011, 
                NA, 33, NA, 31, 37, 2012,
                37, 33, 31, NA, NA, 2009), nrow=3, ncol=6, byrow=TRUE, dimnames=list(c(1,2,3),c(seq(2008,2012),'year')))

value.b4.year = apply(test[,1:6], 1, function (x) 
  ifelse(any(!is.na(x[1:5])), 
  x[max(which(!is.na(x[1:which(seq(2008,2012) == x[6]) -1])))], 
         NA))

First, I set up the sample data.  It’s obviously pretty simple, and the dataframe I used in reality had some rows where there were all NA values throughout the first 5 columns.

It’s in the apply function where the real magic begins.  First, we load up all relevant columns into the apply functions for each row (test[,1:6]).  You can specify/insert whichever columns you need from your dataframe, so long as you use c() in the indexing brackets when you’re referencing your dataframe.  Once you load them up, no matter what the column indices/numbers, they become columns 1 through the total number of columns you loaded up as far as the rest of the apply function is concerned. (Say you load up 15 different columns indices, all spread out in terms of their relative position numbers.  Regardless of the position each index represents, those columns become 1:15 when referring to them in the subsequent function call).

After the function (x) part of the function, I put a condition stating the the following value assignment should only happen if there is any data in the first 5 elements of the row, otherwise an NA gets assigned.  Notice that you don’t have to operate on, or select, every single column that you’ve loaded into the apply statement; very important!

The next line is dizzifyingly complex, so let me break it down for you:

x[1:which(seq(2008,2012) == x[6]) -1]

In this part of the line, I’m calling up all elements of the row, from the first element, until the element before the one corresponding to the year listed in the sixth element (i.e. 1:end year).  We start with  seq(2008,2012) == x[6], which marks the specific year that corresponds with the value of the sixth element.   Feeding that statement into the which function gives you the position in the row represented by x of the year listed in the sixth element.  Subtracting by 1 sets the end index as that column just before the one corresponding with the year.

To show you what the above gets you on line 1 of the sample data:

test[1,1:which(seq(2008,2012) == test[1,6]) -1]

2008 2009 2010 
  31   NA   33

Next, by nesting the above statement in the which and !is.na function calls, we eliminate those elements in the row that have NA values in them and find the positions where there is some value.

which(!is.na(test[1,1:which(seq(2008,2012) == test[1,6]) -1]))

2008 2010 
   1    3

Now that we have the positions of all values in the row that occurred before the year listed, we just need to select the highest value, by nesting the above function call inside the a call to the max function:

max(which(!is.na(test[1,1:which(seq(2008,2012) == test[1,6]) -1])))

3

Right, so the the last value before the year in column 6 for row 1 is in position 3!  That finally leads us to the syntax I used in the apply statement (somewhat different here, because we’re looking at one row of the sample data):

test[1,max(which(!is.na(test[1,1:which(seq(2008,2012) == test[1,6]) -1])))]

33

So, for row 1, the last donation amount before conversion was 33. Let’s look at the results of the apply statement from above:

value.b4.year

 1  2  3 
33 31 37

Yup, for each of these rows, the values saved in value.b4.year are the last values donated before the year specified in the sixth column.

So, the lesson here is that when you are using apply to perform an operation on each row of a dataframe, the value of any one column from that row is easily accessible by referring to the position of that value relative the entire group of columns that you have passed to the apply function.  Hopefully this will help you in the future as I know it will help me 🙂

 

Processing Data from a Statistica Worksheet Using R

Context: I work with data from non-profit organizations, and so a big concern in many of my analyses is if and how much people are donating from one year to the next.  One of the  things I normally like to do in my analyses is get a value for each person that represents how much their yearly donations are increasing or decreasing on average for 5 years (a simple slope from the regression of their giving values on the years that they gave).  It was pretty simple and quick to do this in R for previous projects, so there was no hassle there.  Now that we have Statistica in the office, my supervisor wants me to use it for our current project.

Problem: I was looking for a way in Statistica of doing the above slope calculation for each row in a dataset of roughly 82,000 rows, and could not find it.

Solution:  As I mentioned in my last post, it’s possible to feed your Statistica dataset into R using the Statconn Dcom server, so that you can process it/analyze it in R and then output your results back into Statistica.  So, I fed my dataset of 82,000 rows and 264 columns into R, and used some code I had used previously to calculate 5 year giving slopes for each row in the data set, and to output a new worksheet with the newly calculated slopes column.  Although the code is pretty simple, the entire process seemed to take about 5 minutes, which was unbearably slow!!  It’s a pretty important part of my analysis, so going without it isn’t an option.

I sent an email to one of the Statistica support guys, so hopefully they have a way of doing this kind of data processing natively, instead of having to wait all that time for the data to be processed through R.

Guess who wins: apply() versus for loops in R

Yesterday I tried to do some data processing on my really big data set in MS Excel. Wow, did it not like handling all those data!! Every time I tried to click on a different ribbon, the screen didn’t even register that I had clicked on that ribbon. So, I took the hint, and decided to do my data processing in R.

One of the tasks that I needed to do was calculate a maximum value, in each row of the dataset, from multiple monetary values in 5 different fields. The first thing I noticed was that the regular max() function in R doesn’t quite like it when you try to calculate a maximum from a series of NA values (it returned an inf value for some reason…). So, I decided to create a “safe” max function:


safe.max = function(invector) {
na.pct = sum(is.na(invector))/length(invector)
if (na.pct == 1) {
return(NA) }
else {
return(max(invector,na.rm=TRUE))
}
}

view raw

safe.max.r

hosted with ❤ by GitHub

Finding that it was working, I then constructed a simple for loop to iterate through my ~395,000 rows. As you could imagine, this was taking forever! After much looking around, I realized that the best solution was actually a base function, apply()!!

I constructed my “max” variable with one simple line of code: big.dataset$max_money = apply(as.matrix(big.dataset[,214:218]), 1, function (x) safe.max(x))

Compared to the for loop, which was taking forever, this method was a breeze! It took less than a minute to get through the whole data set. Moral of the story? When you’re dealing with lots of data, code as efficiently as possible!