sapply is my new friend!

I’ve written previously about how the apply function is a major workhorse in many of my work projects. What I didn’t know is how handy the sapply function can be!

There are a couple of cases so far where I’ve found that sapply really comes in handy for me:

1) If I want to quickly see some descriptive stats for multiple columns in my dataframe. For example,

sapply(mydf[,10:20], median, na.rm=true)

would show me the medians of columns 10 through 20, displaying the column names above each median value.

2) If I want to apply the same function to multiple vectors in my dataframe, modifying them in place. I oftentimes have count variables that have NA values in place of zeros. I made a “zerofy” function to add zeros into a vector that lacks them. So, if I want to use my function to modify these count columns, I can do the following:

mydf[,30:40] = sapply(mydf[,30:40], zerofy)

Which then replaces the original data in columns 30 through 40 with the modified data! Handy!

Package sqldf eases the multivariable sorting pain

This will be a quick one.  I was trying to sort my dataframe so that it went in ascending order on one variable and descending order on another variable.  This was really REALLY bothersome to try to figure out with base R functions.  Then I remembered sqldf!

# Assuming dataframe named 'mydf' and 'V1' and 'V2' are your variables you want to sort on in opposite directions
library(sqldf)
mydf.sorted = sqldf("SELECT * FROM mydf ORDER BY V1 ASC, V2 DESC")

That’s all! No pain, and lots of gain. Show me how to do this so simply with base R and I’ll be impressed.

EDIT:

After many comments on this post, it’s become painfully obvious to me that I wasn’t using the right search terms (or looking at that negative sign on the second argument to the order function) when searching for a base R solution to this problem. Well, we all make mistakes, right? I hope this post has still been helpful to others, and it isn’t a completely redundant internet resource on this topic!

Estimating Ages from First Names Part 2 – Using Some Morbid Test Data

In my last post, I wrote about how I compiled a US Social Security Agency data set into something usable in R, and mentioned some issues scaling it up to be usable for bigger datasets.  I also mentioned the need for data to test out the accuracy of my estimates.  First, I’ll show you how I prepped the dataset that it became more scalable (for the code that got us here, see my last post):

name_data_wavgpop_unisex = ddply(name_data, .(Name), function (x) sum(x$Rel_Pop*as.numeric(x$Year))/sum(x$Rel_Pop))
name_data_wavgpop_unisex$V1 = round(name_data_wavgpop_unisex$V1,0)

Above I’ve taken a different tactic to predicting expected year of birth according to name than I started out with in my last post. Here I’m using the relative popularity of the names in each year as weights for each year value. Multiplying them by the years, I get a weighted average of Year that gives me predicted year of birth. Then I round off the predictions to the nearest integer and continue on my way. Also, because test data doesn’t seem to come packaged with gender info, I’ve constructed the weighted averages using all relative popularity values for each name, regardless of whether or not that name has been used for both sexes (a.k.a. “Jordan”).

Now enter the test data. I’ve discovered that the easiest way of getting real names and ages off the internet is by looking for lists of victims of some horrible tragedy. The biggest such list of victims I could find was a list of 9/11 victims.  It’s not exactly formatted for easy analysis, and I was too lazy to get the data programatically, so I just copy-pasted into LibreOffice Calc the names and ages from the first 4 lists on the page (all from either American Airlines, or United Airlines) for a total of 285 observations.  I then extracted the first names, and then imported the first names and ages into R.

worldtrade = read.csv("world trade.csv")
worldtrade.ages = sqldf("SELECT a.*, b.V1 as Year FROM [worldtrade] AS a LEFT JOIN name_data_wavgpop_unisex AS b on a.Name == b.Name")
worldtrade.ages$Pred.Age = 2001 - as.numeric(worldtrade.ages$Year)

As you can see, I opted to use sqldf to append the appropriate predicted birth years for each name on the list I imported. I then got the predicted ages by subtracting each predicted birth year from 2001. Finally, let’s have a look at the resulting model fit (showing how close each predicted age was to the real age of the victim at the time of death):

Predicted vs Real Ages for 911 Victims

 

 

 

As you can see, it’s not a tight prediction in the least bit.  According to model fit statistics, there’s an adjusted r-squared of 14.6% and a residual standard error of 15.58 years.  You can also see from the scatter plot that the prediction doesn’t become reasonably linear until about age 30 and onwards.  Overall, I’d say it’s not too impressive, and I’d imagine it’s even worse for predicting who’s under 10 years old!

Well, this was fun (if only a little disappointing).  That’s statistics for you – sometimes it confirms, and sometimes it humbles you.  If you think you can show me a better way of using this name trending data to better predict ages than what I’ve done here, feel free to show me!

Estimate Age from First Name

Today I read a cute post from Flowing Data on the most trendy names in US history. What caught my attention was a link posted in the article to the source data, which happens to be yearly lists of baby names registered with the US social security agency since 1880
(see here). I thought that it might be good to compile and use these lists at work for two reasons:

(1) I don’t have experience handling file input programmatically in R (ie working with a bunch of files in a directory instead of manually loading one or two) and
(2) It could be useful to have age estimates in the donor files that I work with (using the year when each first name was most popular).

I’ve included the R code in this post at the bottom, after the following explanatory text.

I managed to build a dataframe that contains in each row a name, how many people were registered as having been born with that name in a given year, the year, the total population for that year, and the relative proportion of people with that name in that year.

Once I got that dataframe, I built a function to query that dataframe for the year when a given name was most popular, an estimated age using that year, and the relative proportion of people born with that name from that year.

I don’t have any testing data to check the results against, but I did do an informal check around the office, and it seems okay!

However, I’d like to scale this upwards so that age estimates can be calculated for each row in a vector of first names. As the code stands below, the function I made takes too long to be scaled up effectively.

I’m wondering what’s the best approach to take? Some ideas I have so far follow:

  • Create a smaller dataframe where each row contains a unique name, the year when it was most popular, and the relative popularity in that year. Make a function to query this new dataframe.
  • Possibly convert the above dataframe into a data table and then building a function to query the data table instead.
  • Failing the efficacy of the above two ideas, load the popularity data into Python, and make a function to query it there instead.

    Does anyone have any better ideas for me ?


    library(stringr)
    library(plyr)
    # We're assuming you've downloaded the SSA files into your R project directory.
    file_listing = list.files()[3:135]
    for (f in file_listing) {
    year = str_extract(f, "[0-9]{4}")
    if (year == "1880") { # Initializing the very long dataframe
    name_data = read.csv(f, header=FALSE)
    names(name_data) = c("Name", "Sex", "Pop")
    name_data$Year = rep(year, dim(name_data)[1]) }
    else { # adding onto the very long dataframe
    name_data_new = read.csv(f, header=FALSE)
    names(name_data_new) = c("Name", "Sex", "Pop")
    name_data_new$Year = rep(year, dim(name_data_new)[1])
    name_data = rbind(name_data, name_data_new)
    }}
    year_pop_totals = ddply(name_data, .(Year), function (x) sum(x$Pop))
    name_data = merge(name_data, year_pop_totals, by.x="Year", by.y="Year", all.x=TRUE)
    name_data$Rel_Pop = name_data$Pop/name_data$V1
    estimate_age = function (input_name, sex = NA) {
    if (is.na(sex)) {
    name_subset = subset(name_data, Name == input_name & Year >= 1921)} #1921 is a year I chose arbitrarily. Change how you like.
    else {
    name_subset = subset(name_data, Name == input_name & Year >= 1921 & Sex == sex)
    }
    year_and_rel_pop = name_subset[which(name_subset$Rel_Pop == max(name_subset$Rel_Pop)),c(1,6)]
    current_year = as.numeric(substr(Sys.time(),1,4))
    estimated_age = current_year – as.numeric(year_and_rel_pop[1])
    return(list(year_of_birth=as.numeric(year_and_rel_pop[1]), age=estimated_age, relative_pop=sprintf("%1.2f%%",year_and_rel_pop[2]*100)))
    }

    view raw

    estimate_age.R

    hosted with ❤ by GitHub

    I’ll also accept any suggestions for cleaning up my code as is 🙂

  • Access individual elements of a row while using the apply function on your dataframe (or “applying down while thinking across”)

    The apply function in R is a huge work-horse for me across many projects.  My usage of it is pretty stereotypical.  Usually, I use it to make aggregations of a targeted group of columns for every row in a dataframe.  Those aggregations could be counts, sum totals, or possibly a binary column that flags some condition based on a count or sum total from the entire group of columns.

    In my latest project, I found a different usage for the apply function.  I had a dataframe detailing the donations made each year, by individual people, for the last 7 years.  I also had a column specifying the year when they converted from being a ‘regular’ donor to donating through a specific program.  The idea was to create a column specifying the last yearly amount donated before the year they had converted.  In other words, for each row, information had to be extracted on the year they converted, when they had donated before that year, and how much was that amount.

    Here’s the R code I used with some sample data to help recreate what I did:

    test = matrix(c(31, NA, 33, 37, NA, 2011, 
                    NA, 33, NA, 31, 37, 2012,
                    37, 33, 31, NA, NA, 2009), nrow=3, ncol=6, byrow=TRUE, dimnames=list(c(1,2,3),c(seq(2008,2012),'year')))
    
    value.b4.year = apply(test[,1:6], 1, function (x) 
      ifelse(any(!is.na(x[1:5])), 
      x[max(which(!is.na(x[1:which(seq(2008,2012) == x[6]) -1])))], 
             NA))

    First, I set up the sample data.  It’s obviously pretty simple, and the dataframe I used in reality had some rows where there were all NA values throughout the first 5 columns.

    It’s in the apply function where the real magic begins.  First, we load up all relevant columns into the apply functions for each row (test[,1:6]).  You can specify/insert whichever columns you need from your dataframe, so long as you use c() in the indexing brackets when you’re referencing your dataframe.  Once you load them up, no matter what the column indices/numbers, they become columns 1 through the total number of columns you loaded up as far as the rest of the apply function is concerned. (Say you load up 15 different columns indices, all spread out in terms of their relative position numbers.  Regardless of the position each index represents, those columns become 1:15 when referring to them in the subsequent function call).

    After the function (x) part of the function, I put a condition stating the the following value assignment should only happen if there is any data in the first 5 elements of the row, otherwise an NA gets assigned.  Notice that you don’t have to operate on, or select, every single column that you’ve loaded into the apply statement; very important!

    The next line is dizzifyingly complex, so let me break it down for you:

    x[1:which(seq(2008,2012) == x[6]) -1]

    In this part of the line, I’m calling up all elements of the row, from the first element, until the element before the one corresponding to the year listed in the sixth element (i.e. 1:end year).  We start with  seq(2008,2012) == x[6], which marks the specific year that corresponds with the value of the sixth element.   Feeding that statement into the which function gives you the position in the row represented by x of the year listed in the sixth element.  Subtracting by 1 sets the end index as that column just before the one corresponding with the year.

    To show you what the above gets you on line 1 of the sample data:

    test[1,1:which(seq(2008,2012) == test[1,6]) -1]
    
    2008 2009 2010 
      31   NA   33

    Next, by nesting the above statement in the which and !is.na function calls, we eliminate those elements in the row that have NA values in them and find the positions where there is some value.

    which(!is.na(test[1,1:which(seq(2008,2012) == test[1,6]) -1]))
    
    2008 2010 
       1    3

    Now that we have the positions of all values in the row that occurred before the year listed, we just need to select the highest value, by nesting the above function call inside the a call to the max function:

    max(which(!is.na(test[1,1:which(seq(2008,2012) == test[1,6]) -1])))
    
    3

    Right, so the the last value before the year in column 6 for row 1 is in position 3!  That finally leads us to the syntax I used in the apply statement (somewhat different here, because we’re looking at one row of the sample data):

    test[1,max(which(!is.na(test[1,1:which(seq(2008,2012) == test[1,6]) -1])))]
    
    33

    So, for row 1, the last donation amount before conversion was 33. Let’s look at the results of the apply statement from above:

    value.b4.year
    
     1  2  3 
    33 31 37

    Yup, for each of these rows, the values saved in value.b4.year are the last values donated before the year specified in the sixth column.

    So, the lesson here is that when you are using apply to perform an operation on each row of a dataframe, the value of any one column from that row is easily accessible by referring to the position of that value relative the entire group of columns that you have passed to the apply function.  Hopefully this will help you in the future as I know it will help me 🙂

     

    When the “reorder” function just isn’t good enough…

    The reorder function, in R 3.0.0, is behaving strangely (or I’m really not understanding something).  Take the following simple data frame:

    df = data.frame(a1 = c(4,1,1,3,2,4,2), a2 = c(“h”,”j”,”j”,”e”,”c”,”h”,”c”))

    I expect that if I call the reorder function on the a2 vector, using the a1 vector as the vector to order the second one by, then any summary stats that I run on the a2 vector will be ordered according to the numbers in a1.  However, look what happens:

    table(reorder(df$a2, df$a1))
    c e h j 
    2 1 2 2

    I found out that in order to get it in the order specified by the numbers in the first vector, the following code seems to work:

    df$a2 = factor(df$a2, levels=unique(df$a2)[order(unique(df$a1))], ordered=TRUE)

    Now look at the result:

    table(df$a2)
    
    j c e h 
    2 2 1 2

    One thing I notice here is that R seems to be keeping the factor levels alphabetically organized. When I specify the levels by using the “unique” function, it allows itself to break the alphabetic organization.

    Why won’t the reorder function work in this case?

    Using ddply to select the first record of every group

    I had a very long file of monetary transactions (about 207,000 rows) with about two handfuls of columns describing each transaction (including date).  The task I needed to perform on this file was to select the value from one of the categorical descriptor columns (called “appeal”) associated with the first transaction found for every ID in the file.  Luckily for me, the file was organized so that the older transactions came before the newer transactions.  The coding solution I used was pretty simple, but took maybe 5 – 10 minutes to complete.

    Assuming the data frame is called ‘trans.file’, the numerical ID is called, ‘Num.ID’, and the categorical descriptor column of interest is called ‘appeal’, here’s the simple code:

    first.appeal.by.id = ddply(trans.file, “Num.ID”, function (x) as.character(x$appeal[1]))

    I do like simple, but I also like fast.  Is there a less computationally expensive way of doing this?  Please tell me if so 🙂

    Split, Apply, and Combine for ffdf

    Call me incompetent, but I just can’t get ffdfdply to work with my ffdf dataframes.  I’ve tried repeatedly and it just doesn’t seem to work!  I’ve seen numerous examples on stackoverflow, but maybe I’m applying them incorrectly.  Wanting to do some split-apply-combine on an ffdf, yet again, I finally broke down and made my own function that seems to do the job! It’s still crude, I think, and it will probably break down when there are NA values in the vector that you want to split, but here it is:

    mtapply = function (dvar, ivar, funlist) {
      lenlist = length(funlist)
      outtable = matrix(NA, dim(table(ivar)), lenlist, dimnames=list(names(table(ivar)), funlist))
      c = 1
      for (f in funlist) {
        outtable[,c] = as.matrix(tapply(dvar, ivar, eval(parse(text=f))))
        c = c + 1
      }
      return (outtable)}

    As you can see, I’ve made it so that the result is a bunch of tapply vectors inserted into a matrix.  “dvar”, unsurprisingly, is your dependent variable.  “ivar”, your independent variable.  “funlist” is a vector of function names typed in as strings (e.g. c(“median”,”mean”,”mode”).  I’ve wasted so much of my time trying to get ddply or ffdfdply to work on an ffdf, that I’m happy that I now have anything that does the job for me.

    Now that I think about it, this will fall short if you ask it to output more than one quantile for each of your split levels.  If you can improve this function, please be my guest!

    My Goodness. What a Fat Dataset!

    Recently at work we got sent a data file containing information on donations to a specific charitable organization, ranging all the way back to the 80’s.  Usually, when we receive a dataset with a donation history in it, each row represents a specific gift from a specific person at a specific time.  Also, each column represents some kind of information about that gift.  The result is usually a dataset which is fairly long (thousands or hundreds of thousands, in my recent experience) with maybe about 15 columns or more.

    In this case, each row represented one person, but there were 1,551 columns!!  As it turned out, after the first column, which was the ID of the person donating the money, there were supposed to be just 31 extra columns to describe the gift in each row.  However, the person who put the data together decided that we should get 31*50 columns so that each row represented a person, and not a gift, and every subsequent gift from that person was represented by an extra 31 columns to the right of the previous 31.  Ridiculous!!

    Anyway, I knew that I could reshape this using R, by stacking all 50 copies of the same variable together, and making sure that each new resultant 31 vectors should just take the names of the first 31 vectors.  Following is a gist that shows what eventually worked for me:


    # Here's where I extract the database IDs and repeat them 50 times to make the column long enough for
    # my new long-form dataset (596,100 rows)
    client.data.new = rep(client.data[,1],50)
    for (i in 2:32){
    # for each column in the first 31 after the ID column, find the 49 matching columns
    # to the right and stack them using melt
    stacked.data = melt(client.data, id.vars="CnBio_ID", measure.vars=seq(i,(i+(31*49)),31), value.name=names(client.data)[i])
    # Once we have one new stacked column, we add it on to the right of whatever has already been built in the
    # new long-form dataset.
    client.data.new = data.frame(client.data.new, stacked.data[,3])
    }
    # I don't know why the "value.name" argument didn't give me the right variable names for my
    # new long-form dataset, but it was easy enough to use this last line of code to fix that!
    names(client.data.new) = names(client.data)[1:32]

    In conclusion, if you need your dataset to get in shape, you need only remember one letter: R!

    Know Your Dataset: Specifying colClasses to load up an ffdf

    When I finally figured out how to successfully use the ff package to load data into R, I was apparently working with relatively pain free data to load up through read.csv.ffdf (see my previous post).  Just this past Sunday, I naively followed my own post to load a completely new dataset (over 400,000 rows and about 180 columns) for analysis.  Unfortunately for me, the data file was a bit messier, and so read.csv.ffdf wasn’t able to finalize the column classes by itself.  It would chug along until certain columns in my dataset, which it at first took to be one data type, proved to be a different data type, and then it would give me an error message basically telling me it didn’t want to adapt to the changing assumptions of which data type each column represented.

    So, I set out to learn how I could use the colClasses argument in the read.csv.ffdf command to manually set the data types for each column.  I adapted the following solution from a stackoverflow thread about specifying colClasses in the regular read.csv function.

    First, load up a sample of the big dataset using the read.csv command (The following is obviously non-random. If you can figure out how to read the sample in randomly, I think it would work much better):

    headset = read.csv(fname, header = TRUE, nrows = 5000)

    The next command generates a list of all the variable names in your dataset, and the classes R was able to derive based on the number of rows you imported:

    headclasses = sapply(headset, class)

    Now comes the fairly manual part. Look at the list of variables and classes (data types) that you generated, and look for obvious mismatches. Examples could be a numeric variable that got coded as a factor or logical, or a factor that got coded as a numeric. When you find such a mismatch, the following syntax suffices for changing a class one at a time:

    headclasses["variable.name"] = "numeric"

    Obviously, the “variable.name” should be replaced by the actual variable name you’re reclassifying, and the “numeric” string can also be “factor”, “ordered”, “Date”, “POSIXct” (the last two being date/time data types). Finally, let’s say you want to change every variable that got coded as “logical” into “numeric”. Here’s some syntax you can use:

    headclasses[grep("logical", headclasses)] = "numeric"

    Once you are certain that all the classes represented in the list you just generated and modified are accurate to the dataset, you can load up the data with confidence, using the headclasses list:

    bigdataset = read.csv.ffdf(file="C:/big/data/loc.csv", first.rows=5000, colClasses=headclasses)

    This was certainly not easy, but I must say that I seem to be willing to jump through many hoops for R!!