Estimate Age from First Name

Today I read a cute post from Flowing Data on the most trendy names in US history. What caught my attention was a link posted in the article to the source data, which happens to be yearly lists of baby names registered with the US social security agency since 1880
(see here). I thought that it might be good to compile and use these lists at work for two reasons:

(1) I don’t have experience handling file input programmatically in R (ie working with a bunch of files in a directory instead of manually loading one or two) and
(2) It could be useful to have age estimates in the donor files that I work with (using the year when each first name was most popular).

I’ve included the R code in this post at the bottom, after the following explanatory text.

I managed to build a dataframe that contains in each row a name, how many people were registered as having been born with that name in a given year, the year, the total population for that year, and the relative proportion of people with that name in that year.

Once I got that dataframe, I built a function to query that dataframe for the year when a given name was most popular, an estimated age using that year, and the relative proportion of people born with that name from that year.

I don’t have any testing data to check the results against, but I did do an informal check around the office, and it seems okay!

However, I’d like to scale this upwards so that age estimates can be calculated for each row in a vector of first names. As the code stands below, the function I made takes too long to be scaled up effectively.

I’m wondering what’s the best approach to take? Some ideas I have so far follow:

  • Create a smaller dataframe where each row contains a unique name, the year when it was most popular, and the relative popularity in that year. Make a function to query this new dataframe.
  • Possibly convert the above dataframe into a data table and then building a function to query the data table instead.
  • Failing the efficacy of the above two ideas, load the popularity data into Python, and make a function to query it there instead.

    Does anyone have any better ideas for me ?


    library(stringr)
    library(plyr)
    # We're assuming you've downloaded the SSA files into your R project directory.
    file_listing = list.files()[3:135]
    for (f in file_listing) {
    year = str_extract(f, "[0-9]{4}")
    if (year == "1880") { # Initializing the very long dataframe
    name_data = read.csv(f, header=FALSE)
    names(name_data) = c("Name", "Sex", "Pop")
    name_data$Year = rep(year, dim(name_data)[1]) }
    else { # adding onto the very long dataframe
    name_data_new = read.csv(f, header=FALSE)
    names(name_data_new) = c("Name", "Sex", "Pop")
    name_data_new$Year = rep(year, dim(name_data_new)[1])
    name_data = rbind(name_data, name_data_new)
    }}
    year_pop_totals = ddply(name_data, .(Year), function (x) sum(x$Pop))
    name_data = merge(name_data, year_pop_totals, by.x="Year", by.y="Year", all.x=TRUE)
    name_data$Rel_Pop = name_data$Pop/name_data$V1
    estimate_age = function (input_name, sex = NA) {
    if (is.na(sex)) {
    name_subset = subset(name_data, Name == input_name & Year >= 1921)} #1921 is a year I chose arbitrarily. Change how you like.
    else {
    name_subset = subset(name_data, Name == input_name & Year >= 1921 & Sex == sex)
    }
    year_and_rel_pop = name_subset[which(name_subset$Rel_Pop == max(name_subset$Rel_Pop)),c(1,6)]
    current_year = as.numeric(substr(Sys.time(),1,4))
    estimated_age = current_year – as.numeric(year_and_rel_pop[1])
    return(list(year_of_birth=as.numeric(year_and_rel_pop[1]), age=estimated_age, relative_pop=sprintf("%1.2f%%",year_and_rel_pop[2]*100)))
    }

    view raw

    estimate_age.R

    hosted with ❤ by GitHub

    I’ll also accept any suggestions for cleaning up my code as is 🙂

  • save.ffdf and load.ffdf: Save and load your big data – quickly and neatly!

    I’m very indebted to the ff and ffbase packages in R.  Without them, I probably would have to use some less savoury stats program for my bigger data analysis projects that I do at work.

    Since I started using ff and ffbase, I have resorted to saving and loading my ff dataframes using ffsave and ffload.  The syntax isn’t so bad, but the resulting process it puts your computer through to save and load your ff dataframe is a bit cumbersome.  It takes a while to save and load, and ffsave creates (by default) a bunch of randomly named ff files in a temporary directory.

    For that reason, I was happy to come across a link to a pdf presentation (sorry, I’ve lost it now) summarizing some cool features of ffbase.  I learned that instead of using ffsave and ffload, you can use save.ffdf and load.ffdf, which have very simple syntax:

    save.ffdf(ffdfname, dir=”/PATH/TO/STORE/FF/FILES”)

    Use that, and it creates a directory wherein it stores ff files that bear the same names as your column names from your ff dataframe!  It also stores an .RData and .Rprofile file as well.  Then there is:

    load.ffdf(dir=”/PATH/TO/STORE/FF/FILES”)

    As simple as that, you load your files, and you’re done!  I think what I like about these functions is that they allow you to easily choose where the ff files are stored, removing the worry about important files being in your temporary directory.

    Store your big data!!

    Access individual elements of a row while using the apply function on your dataframe (or “applying down while thinking across”)

    The apply function in R is a huge work-horse for me across many projects.  My usage of it is pretty stereotypical.  Usually, I use it to make aggregations of a targeted group of columns for every row in a dataframe.  Those aggregations could be counts, sum totals, or possibly a binary column that flags some condition based on a count or sum total from the entire group of columns.

    In my latest project, I found a different usage for the apply function.  I had a dataframe detailing the donations made each year, by individual people, for the last 7 years.  I also had a column specifying the year when they converted from being a ‘regular’ donor to donating through a specific program.  The idea was to create a column specifying the last yearly amount donated before the year they had converted.  In other words, for each row, information had to be extracted on the year they converted, when they had donated before that year, and how much was that amount.

    Here’s the R code I used with some sample data to help recreate what I did:

    test = matrix(c(31, NA, 33, 37, NA, 2011, 
                    NA, 33, NA, 31, 37, 2012,
                    37, 33, 31, NA, NA, 2009), nrow=3, ncol=6, byrow=TRUE, dimnames=list(c(1,2,3),c(seq(2008,2012),'year')))
    
    value.b4.year = apply(test[,1:6], 1, function (x) 
      ifelse(any(!is.na(x[1:5])), 
      x[max(which(!is.na(x[1:which(seq(2008,2012) == x[6]) -1])))], 
             NA))

    First, I set up the sample data.  It’s obviously pretty simple, and the dataframe I used in reality had some rows where there were all NA values throughout the first 5 columns.

    It’s in the apply function where the real magic begins.  First, we load up all relevant columns into the apply functions for each row (test[,1:6]).  You can specify/insert whichever columns you need from your dataframe, so long as you use c() in the indexing brackets when you’re referencing your dataframe.  Once you load them up, no matter what the column indices/numbers, they become columns 1 through the total number of columns you loaded up as far as the rest of the apply function is concerned. (Say you load up 15 different columns indices, all spread out in terms of their relative position numbers.  Regardless of the position each index represents, those columns become 1:15 when referring to them in the subsequent function call).

    After the function (x) part of the function, I put a condition stating the the following value assignment should only happen if there is any data in the first 5 elements of the row, otherwise an NA gets assigned.  Notice that you don’t have to operate on, or select, every single column that you’ve loaded into the apply statement; very important!

    The next line is dizzifyingly complex, so let me break it down for you:

    x[1:which(seq(2008,2012) == x[6]) -1]

    In this part of the line, I’m calling up all elements of the row, from the first element, until the element before the one corresponding to the year listed in the sixth element (i.e. 1:end year).  We start with  seq(2008,2012) == x[6], which marks the specific year that corresponds with the value of the sixth element.   Feeding that statement into the which function gives you the position in the row represented by x of the year listed in the sixth element.  Subtracting by 1 sets the end index as that column just before the one corresponding with the year.

    To show you what the above gets you on line 1 of the sample data:

    test[1,1:which(seq(2008,2012) == test[1,6]) -1]
    
    2008 2009 2010 
      31   NA   33

    Next, by nesting the above statement in the which and !is.na function calls, we eliminate those elements in the row that have NA values in them and find the positions where there is some value.

    which(!is.na(test[1,1:which(seq(2008,2012) == test[1,6]) -1]))
    
    2008 2010 
       1    3

    Now that we have the positions of all values in the row that occurred before the year listed, we just need to select the highest value, by nesting the above function call inside the a call to the max function:

    max(which(!is.na(test[1,1:which(seq(2008,2012) == test[1,6]) -1])))
    
    3

    Right, so the the last value before the year in column 6 for row 1 is in position 3!  That finally leads us to the syntax I used in the apply statement (somewhat different here, because we’re looking at one row of the sample data):

    test[1,max(which(!is.na(test[1,1:which(seq(2008,2012) == test[1,6]) -1])))]
    
    33

    So, for row 1, the last donation amount before conversion was 33. Let’s look at the results of the apply statement from above:

    value.b4.year
    
     1  2  3 
    33 31 37

    Yup, for each of these rows, the values saved in value.b4.year are the last values donated before the year specified in the sixth column.

    So, the lesson here is that when you are using apply to perform an operation on each row of a dataframe, the value of any one column from that row is easily accessible by referring to the position of that value relative the entire group of columns that you have passed to the apply function.  Hopefully this will help you in the future as I know it will help me 🙂

     

    Which Torontonians Want a Casino? Survey Analysis Part 2

    In my last post I said that I would try to investigate the question of who actually does want a casino, and whether place of residence is a factor in where they want the casino to be built.  So, here goes something:

    The first line of attack in this blog post is to distinguish between people based on their responses to the third question on the survey, the one asking people to rate the importance of a long list of issues.  When I looked at this list originally, I knew that I would want to reduce the dimensionality using PCA.

    library(psych)
    issues.pca = principal(casino[,8:23], 3, rotate="varimax",scores=TRUE)

    The PCA resulted in the 3 components listed in the table below.  The first component had variables loading on to it that seemed to relate to the casino being a big attraction with lots of features, so I named it “Go big or Go Home”.  On the second component there seemed to be variables loading on to it that related to technical details, while the third component seemed to have variables loading on to it that dealt with social or environmental issues.

    Go Big or Go Home Concerned with Technical Details Concerned with Social/Environmental Issues or not Issue/Concern
    Q3_A 0.181 0.751 Design of the facility
    Q3_B 0.366 0.738 Employment Opportunities
    Q3_C 0.44 0.659 Entertainment and cultural activities
    Q3_D 0.695 0.361 Expanded convention facilities
    Q3_E 0.701 0.346 Integration with surrounding areas
    Q3_F 0.808 0.266 New hotel accommodations
    Q3_G -0.117 0.885 Problem gambling & health concerns
    Q3_H 0.904 Public safety and social concerns
    Q3_I 0.254 0.716 Public space
    Q3_J 0.864 0.218 Restaurants
    Q3_K 0.877 0.157 Retail
    Q3_L 0.423 0.676 -0.1 Revenue for the city
    Q3_M 0.218 0.703 0.227 Support for local businesses
    Q3_N 0.647 0.487 -0.221 Tourist attraction
    Q3_O 0.118 0.731 Traffic concerns
    Q3_P 0.497 0.536 0.124 Training and career development

    Once I was satisfied that I had a decent understanding of what the PCA was telling me, I loaded the component scores into the original dataframe.

    casino[,110:112] = issues.pca$scores
    names(casino)[110:112] = c("GoBigorGoHome","TechnicalDetails","Soc.Env.Issues")

    In order to investigate the question of who wants a casino and where, I decided to use question 6 as a dependent variable (the one asking where they would want it built, if one were to be built) and the PCA components as independent variables.  This is a good question to use, because the answer options, if you remember, are “Toronto”, “Adjacent Municipality” and “Neither”.  My approach was to model each response individually using logistic regression.

    casino$Q6[casino$Q6 == ""] = NA
    casino$Q6 = factor(casino$Q6, levels=c("Adjacent Municipality","City of Toronto","Neither"))
    
    adj.mun = glm(casino$Q6 == "Adjacent Municipality" ~ GoBigorGoHome + TechnicalDetails + Soc.Env.Issues, data=casino, family=binomial(logit))
    toronto = glm(casino$Q6 == "City of Toronto" ~ GoBigorGoHome + TechnicalDetails + Soc.Env.Issues, data=casino, family=binomial(logit))
    neither = glm(casino$Q6 == "Neither" ~ GoBigorGoHome + TechnicalDetails + Soc.Env.Issues, data=casino, family=binomial(logit))

    Following are the summaries of each GLM:
    Toronto:


    Call:
    glm(formula = casino$Q6 == "City of Toronto" ~ GoBigorGoHome +
    TechnicalDetails + Soc.Env.Issues, family = binomial(logit),
    data = casino)
    Deviance Residuals:
    Min 1Q Median 3Q Max
    -3.6426 -0.4745 -0.1156 0.4236 3.4835
    Coefficients:
    Estimate Std. Error z value Pr(>|z|)
    (Intercept) -1.58707 0.04234 -37.48 <2e-16 ***
    GoBigorGoHome 1.76021 0.03765 46.75 <2e-16 ***
    TechnicalDetails 1.77155 0.05173 34.24 <2e-16 ***
    Soc.Env.Issues -1.63057 0.04262 -38.26 <2e-16 ***
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    (Dispersion parameter for binomial family taken to be 1)
    Null deviance: 13537 on 10365 degrees of freedom
    Residual deviance: 6818 on 10362 degrees of freedom
    (7400 observations deleted due to missingness)
    AIC: 6826
    Number of Fisher Scoring iterations: 6

    Adjacent municipality:


    Call:
    glm(formula = casino$Q6 == "Adjacent Municipality" ~ GoBigorGoHome +
    TechnicalDetails + Soc.Env.Issues, family = binomial(logit),
    data = casino)
    Deviance Residuals:
    Min 1Q Median 3Q Max
    -1.0633 -0.7248 -0.5722 -0.3264 2.7136
    Coefficients:
    Estimate Std. Error z value Pr(>|z|)
    (Intercept) -1.45398 0.02673 -54.394 < 2e-16 ***
    GoBigorGoHome -0.41989 0.02586 -16.239 < 2e-16 ***
    TechnicalDetails 0.18764 0.02612 7.183 6.82e-13 ***
    Soc.Env.Issues 0.52325 0.03221 16.243 < 2e-16 ***
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    (Dispersion parameter for binomial family taken to be 1)
    Null deviance: 10431.8 on 10365 degrees of freedom
    Residual deviance: 9756.4 on 10362 degrees of freedom
    (7400 observations deleted due to missingness)
    AIC: 9764.4
    Number of Fisher Scoring iterations: 5

    Neither location:


    Call:
    glm(formula = casino$Q6 == "Neither" ~ GoBigorGoHome + TechnicalDetails +
    Soc.Env.Issues, family = binomial(logit), data = casino)
    Deviance Residuals:
    Min 1Q Median 3Q Max
    -2.4090 -0.7344 -0.3934 0.8966 2.7194
    Coefficients:
    Estimate Std. Error z value Pr(>|z|)
    (Intercept) -0.22987 0.02415 -9.517 <2e-16 ***
    GoBigorGoHome -0.85050 0.02462 -34.549 <2e-16 ***
    TechnicalDetails -1.00182 0.02737 -36.597 <2e-16 ***
    Soc.Env.Issues 0.69707 0.02584 26.972 <2e-16 ***
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    (Dispersion parameter for binomial family taken to be 1)
    Null deviance: 14215 on 10365 degrees of freedom
    Residual deviance: 10557 on 10362 degrees of freedom
    (7400 observations deleted due to missingness)
    AIC: 10565
    Number of Fisher Scoring iterations: 4

    And here is a quick summary of the above GLM information:
    Summary of Casino GLMs

    Judging from these results, it looks like those who want a casino in Toronto don’t focus on the big social/environmental issues surrounding the casino, but do focus on the flashy and non-flashy details and benefits alike.  Those who want a casino outside of Toronto do care about the social/environmental issues, don’t care as much about the flashy details, but do have a focus on some of the non-flashy details.  Finally, those not wanting a casino in either location care about the social/environmental issues, but don’t care about any of the details.

    Here’s where the issue of location comes into play.  When I look at the summary for the GLM that predicts who wants a casino in an adjacent municipality, I get the feeling that it’s picking up people living in the down-town core who just don’t think the area can handle a casino.  In other words, I think there might be a “not in my backyard!” effect.

    The first inkling that this might be the case comes from an article from the Martin Prosperity Institute (MPI), who analyzed the same data set, and managed to get a very nice looking heat map-map of the responses to the first question on the survey, asking people how they feel about having a new casino in Toronto.  From this map, it does look like people in Downtown Toronto are feeling pretty negative about a new casino, whereas those in the far east and west of Toronto are feeling better about it.

    My next evidence comes from the cities uncovered by geocoding the responses in the data set.  I decided to create a very simple indicator variable, distinguishing those for whom the “City” is Toronto, and those for whom the city is anything else.  I like this better than the MPI analysis, because it looks at peoples’ attitudes towards a casino both inside and outside of Toronto (rather than towards the concept of a new Casino in Toronto).  If there really is a “not in my backyard!” effect, I would expect to see evidence that those in Toronto are more disposed towards a casino in an adjacent municipality, and that those from outside of Toronto are more disposed towards a casino inside Toronto!  Here we go:


    library(ff)
    library(ggthemes)
    ffload(file="casino", overwrite=TRUE)
    casino.orig$Outside.of.Toronto = as.ff(ifelse(casino.orig[,"City"] == "Toronto",0,1))
    casino.in.toronto = glm(casino.orig[,"Q6"] == "City of Toronto" ~ Outside.of.Toronto, data=casino.orig, family=binomial(logit))
    casino.outside.toronto = glm(casino.orig[,"Q6"] == "Adjacent Municipality" ~ Outside.of.Toronto, data=casino.orig, family=binomial(logit))
    summary(casino.in.toronto)
    Call:
    glm(formula = casino.orig[, "Q6"] == "City of Toronto" ~ Outside.of.Toronto,
    family = binomial(logit), data = casino.orig)
    Deviance Residuals:
    Min 1Q Median 3Q Max
    -0.9132 -0.9132 -0.7205 1.4669 1.7179
    Coefficients:
    Estimate Std. Error z value Pr(>|z|)
    (Intercept) -1.21605 0.02600 -46.77 <2e-16 ***
    Outside.of.Toronto 0.55712 0.03855 14.45 <2e-16 ***
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    (Dispersion parameter for binomial family taken to be 1)
    Null deviance: 16278 on 13881 degrees of freedom
    Residual deviance: 16070 on 13880 degrees of freedom
    (3884 observations deleted due to missingness)
    AIC: 16074
    Number of Fisher Scoring iterations: 4
    ————————————————–
    summary(casino.outside.toronto)
    Call:
    glm(formula = casino.orig[, "Q6"] == "Adjacent Municipality" ~
    Outside.of.Toronto, family = binomial(logit), data = casino.orig)
    Deviance Residuals:
    Min 1Q Median 3Q Max
    -0.7280 -0.7280 -0.5554 -0.5554 1.9726
    Coefficients:
    Estimate Std. Error z value Pr(>|z|)
    (Intercept) -1.19254 0.02583 -46.16 <2e-16 ***
    Outside.of.Toronto -0.59879 0.04641 -12.90 <2e-16 ***
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    (Dispersion parameter for binomial family taken to be 1)
    Null deviance: 13786 on 13881 degrees of freedom
    Residual deviance: 13611 on 13880 degrees of freedom
    (3884 observations deleted due to missingness)
    AIC: 13615
    Number of Fisher Scoring iterations: 4
    ——————————————-
    casino.loc.by.city.res = as.matrix(table(casino.orig[,"Outside.of.Toronto"], casino.orig[,"Q6"])[,2:4], dimnames=list(c("Those Living Inside the City of Toronto", "Those Living Outside the City of Toronto"),colnames(casino.loc.by.city.res)))
    rownames(casino.loc.by.city.res) = c("Those Living Inside the City of Toronto", "Those Living Outside the City of Toronto")
    casino.loc.by.city.res = melt(prop.table(casino.loc.by.city.res,1))
    names(casino.loc.by.city.res) = c("City of Residence","Where to Build the Casino","value")
    ggplot(casino.loc.by.city.res, aes(x=casino.loc.by.city.res$"Where to Build the Casino", y=value, fill=casino.loc.by.city.res$"City of Residence")) + geom_bar(position="dodge") + scale_y_continuous(labels=percent) + theme_wsj() + scale_fill_discrete(name="City of Residence") + ggtitle("If a casino is built, where would you prefer to see it located?")

    view raw

    casino.geo.r

    hosted with ❤ by GitHub

    Where located by city of residenceAs you can see here, those from the outside of Toronto are more likely to suggest building a casino in Toronto compared with those from the inside, and less likely to suggest building a casino in an adjacent municipality (with the reverse being true about those from the inside of Toronto).

    That being said, when you do the comparison within city of residence (instead of across it like I just did), those from the inside of Toronto seem equally likely to suggest that the casino be built in our outside of the city, whereas those outside are much more likely to suggest building the casino inside Toronto than outside.  So, depending on how you view this graph, you might only say there’s evidence for a “not in my backyard!” effect for those living outside of Toronto.

    As a final note, I’ll remind you that although these analyses point to which Torontonians do want a new casino, the fact from this survey remains that about 71% of respondents are unsupportive of a casino in Toronto, and 53% don’t want a casino built in either Toronto or and adjacent municipality.  I really have to wonder if they’re still going to go ahead with it!

    When the “reorder” function just isn’t good enough…

    The reorder function, in R 3.0.0, is behaving strangely (or I’m really not understanding something).  Take the following simple data frame:

    df = data.frame(a1 = c(4,1,1,3,2,4,2), a2 = c(“h”,”j”,”j”,”e”,”c”,”h”,”c”))

    I expect that if I call the reorder function on the a2 vector, using the a1 vector as the vector to order the second one by, then any summary stats that I run on the a2 vector will be ordered according to the numbers in a1.  However, look what happens:

    table(reorder(df$a2, df$a1))
    c e h j 
    2 1 2 2

    I found out that in order to get it in the order specified by the numbers in the first vector, the following code seems to work:

    df$a2 = factor(df$a2, levels=unique(df$a2)[order(unique(df$a1))], ordered=TRUE)

    Now look at the result:

    table(df$a2)
    
    j c e h 
    2 2 1 2

    One thing I notice here is that R seems to be keeping the factor levels alphabetically organized. When I specify the levels by using the “unique” function, it allows itself to break the alphabetic organization.

    Why won’t the reorder function work in this case?

    Do Torontonians Want a New Casino? Survey Analysis Part 1

    Toronto City Council is in the midst of a very lengthy process of considering whether or not to allow the OLG to build of a new casino in Toronto, and where.  The process started in November of 2012, and set out to answer this question through many and varied consultations with the public, and key stakeholders in the city.

    One of the methods of public consultation that they used was a “Casino Feedback Form“, or survey that was distributed online and in person.  By the time the deadline had passed to collect responses on this survey (January 25, 11:59pm), they had collected a whopping 17,780 responses.  The agency seemingly responsible for the survey is called DPRA, and from what I can tell they seemed to do a pretty decent job of creating and distributing the survey.

    In a very surprisingly modern and democratic form, Toronto City Council made the response data for the survey available on the Toronto Open Data website, which I couldn’t help but download and analyze for myself (with R of course!).

    For a relatively small survey, it’s very rich in information.  I love having hobby data sets to work with from time to time, and so I’m going to dedicate a few posts to the analysis of this response data file.  This post will not show too much that’s different from the report that DPRA has already released, as it contains largely univariate analyses.  In later posts however, I will get around to asking and answering those questions that are of a more multivariate nature!  To preserve flow of the post, I will post the R code at the end, instead of interspersing it throughout like I normally do.  Unless otherwise specified, all numerical axes represent the % of people who selected a particular response on the survey.

    Without further ado, I will start with some key findings:

    Key Findings

    1. With 17,780 responses, Toronto City Council obtained for themselves a hefty data set with pretty decent geographical coverage of the core areas of Toronto (Toronto, North York, Scarborough, Etobicoke, East York, Mississauga).  This is much better than Ipsos Reid’s Casino Survey response data set of 906 respondents.
    2. Only 25.7% of respondents were somewhat or strongly in favour of having a new casino in Toronto.  I’d say that’s overwhelmingly negative!
    3. Ratings of the suitability of a casino in three different locations by type of casino indicate that people are more favourable towards an Integrated Entertainment Complex (basically a casino with extra amenities) vs. a standalone casino.
    4. Of the three different locations, people were most favourable towards an Integrated Entertainment Complex at the Exhibition Place.  However, bear in mind that only 27.4% of respondents thought it was suitable at all.  This is a ‘best of the worst’ result!
    5. When asked to rate the importance of a list of issues surrounding the building of a new casino in Toronto, respondents rated as most important the following issues: safety, health, addiction, public space, traffic, and integration with surrounding areas.

    Geographic Distribution of Responses

    In a relatively short time, City Council managed to collect many responses to their survey.  I wanted to look at the geographic distribution of all of these responses.  Luckily, the survey included a question that asked for the first 3 characters of the respondents’ postal code (or FSA).  If you have a file containing geocoded postal codes, you then can plot the respondents on a map.  I managed to find such a file on a website called geocoder.ca, with latitude and longitude coordinates for over 90,000 postal codes).  Once I got the file into R, I made sure that all FSA codes in the survey data were capitalized, created an FSA column in the geocoded file, and then merged the geocoded dataset into the survey dataset.  This isn’t a completely valid approach, but when looking at a broad area, I don’t think the errors in plotting points on a map aren’t going to look that serious.

    For a survey about Toronto, the geographic distribution was actually pretty wide.  Have a look at the complete distribution:

    Total Geo Distribution of Responses

    Obviously there seem to be a whole lot of responses in Southern Ontario, but we even see a smattering of responses in neighbouring provinces as well.  However, let’s look at a way of zooming in on the large cluster of Southern Ontario cities.  From the postal codes, I was able to get the city in which each response was made.  From that I pulled out what looked like a good cluster of top southern ontario cities:

              City	 # Responses
           Toronto	8389
        North York	1553
       Scarborough	1145
         Etobicoke	936
         East York	462
       Mississauga	201
           Markham	149
          Brampton	111
     Richmond Hill	79
         Thornhill	62
              York	59
             Maple	58
            Milton	30
          Oakville	30
        Woodbridge	30
        Burlington	28
            Oshawa	25
         Pickering	22
            Whitby	19
          Hamilton	17
            Bolton	14
            Guelph	13
          Nobleton	12
            Aurora	11
              Ajax	10
           Caledon	10
       Stouffville	10
            Barrie	9

    Lots of people in Toronto, obviously, a fair amount in North York, Scarborough, and Etobicoke, and then it leaps downwards in frequency from there. However, these city labels are from the geocoding, and who knows if some people it places in Toronto are actually from North York (the tree, then one of its apples). So, I filtered the latitude and longitude coordinates based on this top list to get the following zoomed-in map:

    Toronto Geo Distribution of ResponsesMuch better than a table!  I used transparency on the colours of the circles to help better distinguish dense clusters of responses from sparse ones.  Based on the map, I can see 3 patterns:

    1) It looks like there is a huge cluster of responses came from an area of Toronto approximately bounded by Dufferin on the West, highway 404 on the east, the Gardiner on the south, and the 401 on the north.

    2) There’s also an interesting vertical cluster that seems to go from well south of highway 400 and the 401, and travels north to the 407.

    3) I’m not sure I would call this a cluster per se, but there definitely seems to be a pattern where you find responses all the way along the Gardiner Expressway/Queen Elizabeth Way/Kingston Road/401 from Burlington to Oshawa.

    Now for the survey results!

    Demographics of Respondents

    This slideshow requires JavaScript.

    As you can see, about 80% of the respondents disclosed their gender, with a noticeable bias towards men.  Also, most of the respondents who disclosed their age were between 25 and 64 years of age.  This might be a disadvantage, according to a recent report by Statistics Canada on gambling.  If you look at page 6 on the report, you will see that of all age groups of female gamblers, those 65 and older are spending the most amount of money on Casinos, Slot Machines, and VLTs per 1 person spending household.  However, I guess it’s better having some information than no information.

    Feelings about the new casino

    Feelings about the new casino

    Well, isn’t that something?  Only about a quarter of all people surveyed actually have positive feelings about a new casino!  I have to say this is pretty telling.  You would think this would be damning information, but here’s where we fall into the trap of whether or not to trust a survey result.

    Here we have this telling response, but then again, Ipsos Reid conducted a poll that gathered 906 responses that concluded that 52% of Torontonians “either strongly or somewhat support a new gambling venue within its borders”.  People were asked about their support of a new casino at the beginning and ending.  At the end, after they supplied people with all the various arguments supplied by both sides of the debate, they asked the question again.  Apparently the proportion supporting the casino was 54% when analyzing the second instance of the question.  They don’t even link to the original question form, so I’m left to wonder exactly how it was phrased, and what preceded it.  The only hint is in this phrase: ” if a vote were held tomorrow on the idea of building a casino in the city of Toronto…”.  Does that seem comparable to you?

    A Casino for “Toronto The Good”?

    Casino fit image of toronto

    This question seems to be pretty similar to the first question.  If a new casino fits your image of Toronto perfectly, then you’re probably going to be strongly in favour of one!  Obviously, most people seem pretty sure that a new casino just isn’t the kind of thing that would fit in with their image of “Toronto the Good”.

    Where to build a new casino

    Where casino builtIn the response pattern here, we seem to see a kind of ‘not in/near my backyard’ mentality going on.  A slight majority of respondents seem to be saying that if a new casino is to be built, it should be somewhere decently far away from Toronto, perhaps so that they don’t have to deal with the consequences.  I’ll eat my sock if the “Neither” folks aren’t those who also were strongly opposed to the casino.

    Casino Suitability in Downtown Area

    Casino suitability at Exhibition Place

    Casino Suitability at Port Lands

    They also asked respondents to rate the suitability of a new casino in three different locations:

    1. A downtown area (bounded by Spadina Avenue, King Street, Jarvis Street and Queens Quay)
    2. Exhibition Place (bounded by Gardiner Expressway, Lake Shore Boulevard, Dufferin Street and Strachan Avenue)
    3. Port Lands (located south of the Don Valley and Gardiner/Lake Shore, east of the downtown core)

    Looking at the above 3 graphs, you see right away that a kind of casino called an Integrated Entertainment Complex (kind of a smorgasboard of casino, restaurant, theatre, hotel, etc.) is more favourable than a standalone casino at any location.  That being said, the responses are still largely negative!  Out of the 3 options for location of an Integrated Entertainment Complex (IEC), it was Exhibition Place that rated the most positive by a small margin (18.1% said highly suitable, vs. 16.2% for downtown Toronto).  There are definitely those at the Exhibition Place who want the Toronto landmark to be chosen!

    Desired Features of IEC by Location

    This slideshow requires JavaScript.

    These charts indicate that, for those who can imagine an Integrated Entertainment Complex in either of the 3 locations, they would like features at that locations that allow them to sit/stand and enjoy themselves.  Restaurants, Cultural and Arts Facilities, and Theatre are tops in all 3 locations (but still bear in mind that less than half opted for those choices).  A quick google search reveals that restaurants and theatres are mentioned in a large number of search results.  An article in the Toronto Sun boasts that an Integrated Entertainment Complex would catapult Toronto into the stars as even more of a tourist destination.  Interestingly, the article also mentions the high monetary value of convention visitors and how much that would add to the revenues generated for the city.  I find it funny that the popularity of having convention centre space in this survey is at its highest when people are asked about the Exhibition Place.  The Exhibition Place already has convention centre space!!  I don’t understand the logic, but maybe someone will explain it to me.

    Issues of Importance Surrounding a New Casino

    Issues of Importance re the New CasinoUnlike the previous graphs, this one charts the % who gave a particular response on each item.  In this case, the graph shows the % of respondents who gave gave the answer “Very Important” when asked to rate each issue surrounding the new casino.  Unlike some of the previous questions, this one did not include a “No Casino” option, so already more people can contribute positively to the response distribution.  You can already see that people are pretty riled up about some serious social and environmental issues.  They’re worried about safety, health, addiction, public space (sounds like a worry about clutter to me), traffic, and integration with surrounding areas.  I’ll bet that the people worried about these top 5 issues are the people most likely to say that they don’t want a casino anywhere.  It will be interesting to uncover some factor structure here and then find out what the pro and anti casino folks are concerned with.

    For my next post, I have in mind to investigate a few simple questions so far:

    1. Who exactly wants or doesn’t want a new casino, and where?  What are they most concerned with (those who do and don’t want a casino)
    2. Is there a “not in my backyard” effect going on, where those who are closest to the proposed casino spots are the least likely to want it there, but more likely to want a casino elsewhere?  I have latitude/longitude coordinates, and can convert them into distances from the proposed casino spots.  I think that will be interesting to look at!


    library(ff)
    library(ffbase)
    library(stringr)
    library(ggplot2)
    library(ggthemes)
    library(reshape2)
    library(RgoogleMaps)
    # Loading 2 copies of the same data set so that I can convert one and have the original for its text values
    casino = read.csv("/home/inkhorn/Downloads/casino_survey_results20130325.csv")
    casino.orig = read.csv("/home/inkhorn/Downloads/casino_survey_results20130325.csv")
    # Here's the dataset of canadian postal codes and latitude/longitude coordinates
    pcodes = read.csv.ffdf(file="/home/inkhorn/Downloads/zipcodeset.txt", first.rows=50000, next.rows=50000, colClasses=NA, header=FALSE)
    # I'm doing some numerical recoding here. If you can tell me a cleaner way of doing this
    # then by all means please do. I found this process really annoyingly tedious.
    casino$Q1_A = ifelse(casino.orig$Q1_A == "Neutral or Mixed Feelings", 3,
    ifelse(casino.orig$Q1_A == "Somewhat in Favour", 4,
    ifelse(casino.orig$Q1_A == "Somewhat Opposed", 2,
    ifelse(casino.orig$Q1_A == "Strongly in Favour", 5,
    ifelse(casino.orig$Q1_A == "Strongly Opposed", 1,NA)))))
    casino$Q2_A = ifelse(casino.orig$Q2_A == "Does Not Fit My Image At All", 1,
    ifelse(casino.orig$Q2_A == "Neutral / I am Not Sure",2,
    ifelse(casino.orig$Q2_A == "Fits Image Somewhat", 3,
    ifelse(casino.orig$Q2_A == "Fits Image Perfectly", 4, NA))))
    for (i in 8:24) {
    casino[,i] = ifelse(casino.orig[,i] == "Not Important At All", 1,
    ifelse(casino.orig[,i] == "Somewhat Important", 2,
    ifelse(casino.orig[,i] == "Very Important", 3,NA)))}
    for (i in c(31:32,47,48,63,64)) {
    casino[,i] = ifelse(casino.orig[,i] == "Highly Suitable",5,
    ifelse(casino.orig[,i] == "Neutral or Mixed Feelings",3,
    ifelse(casino.orig[,i] == "Somewhat Suitable",4,
    ifelse(casino.orig[,i] == "Somewhat Unsuitable",2,
    ifelse(casino.orig[,i] == "Strongly Unsuitable",1,NA)))))}
    # There tended to be blank responses in the original dataset. When seeking to
    # plot the responses in their original text option format, I got rid of them in some cases,
    # or coded them in "Did not disclose" in others.
    casino.orig$Q1_A[casino.orig$Q1_A == ""] = NA
    casino.orig$Q1_A = factor(casino.orig$Q1_A, levels=c("Strongly Opposed","Somewhat Opposed","Neutral or Mixed Feelings","Somewhat in Favour","Strongly in Favour"))
    # Here's the graph showing how people feel about a new casino
    ggplot(subset(casino.orig, !is.na(Q1_A)), aes(x=Q1_A,y=..count../sum(..count..))) + geom_bar(fill="forest green") + coord_flip() + ggtitle("How do you feel about having a new casino in Toronto?") + scale_x_discrete(name="") + theme_wsj() + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + stat_bin(aes(label = sprintf("%.02f %%", ..count../sum(..count..)*100)), geom="text") + scale_y_continuous(labels=percent)
    # How does the casino fit into your image of toronto…
    ggplot(subset(casino.orig, Q2_A!= ''), aes(x=Q2_A,y=..count../sum(..count..))) + geom_bar(fill="forest green") + coord_flip() + ggtitle("How does a new casino in Toronto fit your image of the City of Toronto?") + scale_x_discrete(name="") + theme_wsj() + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + stat_bin(aes(label = sprintf("%.02f %%", ..count../sum(..count..)*100)),geom="text") + scale_y_continuous(labels=percent)
    # Where you'd prefer to see it located
    ggplot(subset(casino.orig, Q6!= ''), aes(x=Q6,y=..count../sum(..count..))) + geom_bar(fill="forest green") + coord_flip() + ggtitle("If a casino is built, where would you prefer to see it located?") + scale_x_discrete(name="") + theme_wsj() + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + stat_bin(aes(label = sprintf("%.02f %%", ..count../sum(..count..)*100)), geom="text") + scale_y_continuous(labels=percent)
    # Here I reorder the text labels from the questions asking about suitability of the downtown location
    casino.orig$Q7_A_StandAlone = reorder(casino.orig$Q7_A_StandAlone, casino$Q7_A_StandAlone)
    casino.orig$Q7_A_Integrated = reorder(casino.orig$Q7_A_Integrated, casino$Q7_A_Integrated)
    # Reshaping the downtown ratings data for graphing..
    stand.and.integrated.ratings.downtown = cbind(prop.table(as.matrix(table(casino.orig$Q7_A_StandAlone)[1:5])),
    prop.table(as.matrix(table(casino.orig$Q7_A_Integrated)[1:5])))
    colnames(stand.and.integrated.ratings.downtown) = c("Standalone Casino","Integrated Entertainment Complex")
    stand.and.integrated.ratings.downtown.long = melt(stand.and.integrated.ratings.downtown, varnames=c("Rating","Casino Type"), value.name="Percentage")
    # Graphing ratings of casino suitability for the downtown location
    ggplot(stand.and.integrated.ratings.downtown.long, aes(x=stand.and.integrated.ratings.downtown.long$"Casino Type", fill=Rating, y=Percentage,label=sprintf("%.02f %%", Percentage*100))) + geom_bar(position="dodge") + coord_flip() + ggtitle("Ratings of Casino Suitability \nin Downtown Toronto by Casino Type") + scale_x_discrete(name="") + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + scale_y_continuous(labels=percent) + geom_text(aes(x=stand.and.integrated.ratings.downtown.long$"Casino Type", y=Percentage, ymax=Percentage, label=sprintf("%.01f%%",Percentage*100), hjust=.75),position = position_dodge(width=1),size=4) + scale_fill_few(palette="light") + theme_wsj()
    # Reshaping the exhibition place ratings for graphing
    stand.and.integrated.ratings.exhibition = cbind(prop.table(as.matrix(table(casino.orig$Q7_B_StandAlone)[2:6])),
    prop.table(as.matrix(table(casino.orig$Q7_B_Integrated)[2:6])))
    colnames(stand.and.integrated.ratings.exhibition) = c("Standalone Casino","Integrated Entertainment Complex")
    stand.and.integrated.ratings.exhibition.long = melt(stand.and.integrated.ratings.exhibition, varnames=c("Rating","Casino Type"), value.name="Percentage")
    # Reordering the rating text labels for the graphing.
    stand.and.integrated.ratings.exhibition.long$Rating = factor(stand.and.integrated.ratings.exhibition.long$Rating, levels=levels(casino.orig$Q7_A_StandAlone)[1:5])
    # Graphing ratings of casino suitability for the exhibition place location
    ggplot(stand.and.integrated.ratings.exhibition.long, aes(x=stand.and.integrated.ratings.exhibition.long$"Casino Type", fill=Rating, y=Percentage,label=sprintf("%.02f %%", Percentage*100))) + geom_bar(position="dodge") + coord_flip() + ggtitle("Ratings of Casino Suitability \nat Exhibition Place by Casino Type") + scale_x_discrete(name="") + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + scale_y_continuous(labels=percent) + geom_text(aes(x=stand.and.integrated.ratings.exhibition.long$"Casino Type", y=Percentage, ymax=Percentage, label=sprintf("%.01f%%",Percentage*100), hjust=.75), position = position_dodge(width=1),size=4) + scale_fill_few(palette="light") + theme_wsj()
    # Reshaping the Port Lands ratings for graphing
    stand.and.integrated.ratings.portlands = cbind(prop.table(as.matrix(table(casino.orig$Q7_C_StandAlone)[2:6])),
    prop.table(as.matrix(table(casino.orig$Q7_C_Integrated)[2:6])))
    colnames(stand.and.integrated.ratings.portlands) = c("Standalone Casino", "Integrated Entertainment Complex")
    stand.and.integrated.ratings.portlands.long = melt(stand.and.integrated.ratings.portlands, varnames=c("Rating","Casino Type"), value.name="Percentage")
    # Reording the rating text labels for the graping.
    stand.and.integrated.ratings.portlands.long$Rating = factor(stand.and.integrated.ratings.portlands.long$Rating, levels=levels(casino.orig$Q7_A_StandAlone)[1:5])
    # Graphing ratings of casino suitability for the port lands location
    ggplot(stand.and.integrated.ratings.portlands.long, aes(x=stand.and.integrated.ratings.portlands.long$"Casino Type", fill=Rating, y=Percentage,label=sprintf("%.02f %%", Percentage*100))) + geom_bar(position="dodge") + coord_flip() + ggtitle("Ratings of Casino Suitability \nat Port Lands by Casino Type") + scale_x_discrete(name="") + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + scale_y_continuous(labels=percent) + geom_text(aes(x=stand.and.integrated.ratings.portlands.long$"Casino Type", y=Percentage, ymax=Percentage, label=sprintf("%.01f%%",Percentage*100), hjust=.75), position = position_dodge(width=1),size=4) + scale_fill_few(palette="light") + theme_wsj()
    # This was the part in my analysis where I looked at postal codes (FSAs really) and their coordinates
    # Sorry I'm not more linear in how I do my analysis vs. write about it 🙂
    # You'll notice that I've imported the geocode file as ffdf. This led to faster merging with the
    # original casino data set. This meant that I had to coerce the casino.orig data frame into ffdf format
    # But I work with it every day at work, so I'm used to it by now, despite its idiosynchracies.
    casino.orig$PostalCode = toupper(casino.orig$PostalCode)
    pcodes = read.csv.ffdf(file="/home/inkhorn/Downloads/zipcodeset.txt", first.rows=50000, next.rows=50000, colClasses=NA, header=FALSE)
    names(pcodes) = c("Postal","Lat","Long","City","Prov")
    pcodes$FSA = as.ff(as.factor(toupper(substr(pcodes[,"Postal"], 1,3))))
    casino.orig = as.ffdf(casino.orig)
    casino.orig$PostalCode = as.ff(as.factor(toupper(casino.orig[,"PostalCode"])))
    casino.orig = merge(casino.orig, pcodes, by.x="PostalCode", by.y="FSA", all.x=TRUE)
    # This is the code for the full map I generated
    casino.gc = casino.orig[which(!is.na(casino.orig[,"Lat"])),] # making sure only records with coordinates are included…
    mymap = MapBackground(lat=casino.gc$Lat, lon=casino.gc$Long)
    PlotOnStaticMap(mymap, casino.gc$Lat, casino.gc$Long, cex=1.5, pch=21, bg="orange")
    # Here I'm getting a list of cities, winnowing it down, and using it to filter the
    # geocode coordinates to zoom in on the map I generated.
    cities = data.frame(table(casino.orig[,"City"]))
    cities = cities[cities$Freq > 0,]
    cities = cities[order(cities$Freq, decreasing=TRUE),]
    cities = cities[cities$Var1 != '',]
    cities.filter = cities[1:28,] # Here's my top cities variable (i set an arbitrary dividing line…)
    names(cities.filter) = c("City","# Responses")
    # Here's where I filtered the original casino ffdf so that it only contained the cities
    # that I wanted to see in Southern Ontario
    casino.top.so = casino.orig[which(casino.orig[,"City"] %in% cities.filter$Var1),]
    # here's a transparency function that I used for the southern ontario map
    addTrans <- function(color,trans)
    {
    # This function adds transparancy to a color.
    # Define transparancy with an integer between 0 and 255
    # 0 being fully transparant and 255 being fully visable
    # Works with either color and trans a vector of equal length,
    # or one of the two of length 1.
    if (length(color)!=length(trans)&!any(c(length(color),length(trans))==1)) stop("Vector lengths not correct")
    if (length(color)==1 & length(trans)>1) color <- rep(color,length(trans))
    if (length(trans)==1 & length(color)>1) trans <- rep(trans,length(color))
    num2hex <- function(x)
    {
    hex <- unlist(strsplit("0123456789ABCDEF",split=""))
    return(paste(hex[(x-x%%16)/16+1],hex[x%%16+1],sep=""))
    }
    rgb <- rbind(col2rgb(color),trans)
    res <- paste("#",apply(apply(rgb,2,num2hex),2,paste,collapse=""),sep="")
    return(res)
    }
    # Finally here's the southern ontario map code
    mymap = MapBackground(lat=casino.top.so$Lat, lon=casino.top.so$Long)
    PlotOnStaticMap(mymap, casino.top.so$Lat, casino.top.so$Long, cex=1.5, pch=21, bg=addTrans("orange",10))
    # Here's some code for summarizing and plotting the response data to the question
    # around issues of importance regarding the new casino (question 3)
    q3.summary = matrix(NA, 16,1,dimnames=list(c("Design of the facility",
    "Employment opportunities","Entertainment and cultural activities",
    "Expanded convention facilities", "Integration with surrounding areas",
    "New hotel accommodations","Problem gambling & health concerns",
    "Public safety and social concerns","Public space",
    "Restaurants","Retail","Revenue for the City","Support for local businesses",
    "Tourist attraction","Traffic concerns","Training and career development"),c("% Very Important")))
    for (i in 8:23) {
    q3.summary[i-7] = mean(casino[,i] == 3, na.rm=TRUE)}
    q3.summary = as.data.frame(q3.summary[order(q3.summary[,1], decreasing = FALSE),])
    names(q3.summary)[1] = "% Very Important"
    q3.summary$Concern = rownames(q3.summary)
    q3.summary = q3.summary[order(q3.summary$"% Very Important", decreasing=FALSE),]
    q3.summary$Concern = factor(q3.summary$Concern, levels=q3.summary$Concern)
    ggplot(q3.summary, aes(x=Concern, y=q3.summary$"% Very Important")) + geom_point(size=5, colour="forest green") + coord_flip() + ggtitle("Issues of Importance Surrounding\nthe New Casino") + scale_x_discrete(name="Issues of Importance") + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + scale_y_continuous(labels=percent) + theme_wsj()
    # This chunk of code deals with summarizing and plotting the questions surrounding
    # what features people might want if a new Integrated Entertainment Complex is built
    q7a.summary = matrix(NA, 9,1, dimnames=list(c("No Casino","Casino Only", "Convention Centre Space", "Cultural and Arts Facilities",
    "Hotel","Nightclubs","Restaurants","Retail","Theatre"),c("% Include")))
    for (i in 36:44) {
    q7a.summary[i-35] = mean(casino[,i], na.rm=TRUE)}
    q7a.summary = as.data.frame(q7a.summary[order(q7a.summary[,1], decreasing = FALSE),])
    names(q7a.summary)[1] = "% Include"
    q7a.summary$feature = rownames(q7a.summary)
    q7a.summary$feature = factor(q7a.summary$feature, levels=q7a.summary$feature)
    ggplot(q7a.summary, aes(x=feature, y=q7a.summary$"% Include")) + geom_point(size=5, colour="forest green") + coord_flip() + ggtitle("What People Would Want in an Integrated\nEntertainment Complex in Downtown Toronto") + scale_x_discrete(name="Features") + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + scale_y_continuous(labels=percent,name="% Wanting the Feature") + theme_wsj()
    q7b.summary = matrix(NA, 9,1, dimnames=list(c("No Casino","Casino Only", "Convention Centre Space", "Cultural and Arts Facilities",
    "Hotel","Nightclubs","Restaurants","Retail","Theatre"),c("% Include")))
    for (i in 52:60) {
    q7b.summary[i-51] = mean(casino[,i], na.rm=TRUE)}
    q7b.summary = as.data.frame(q7b.summary[order(q7b.summary[,1], decreasing = FALSE),])
    names(q7b.summary)[1] = "% Include"
    q7b.summary$feature = rownames(q7b.summary)
    q7b.summary$feature = factor(q7b.summary$feature, levels=q7b.summary$feature)
    ggplot(q7b.summary, aes(x=feature, y=q7b.summary$"% Include")) + geom_point(size=5, colour="forest green") + coord_flip() + ggtitle("What People Would Want in an Integrated\nEntertainment Complex at the Exhbition Place") + scale_x_discrete(name="Features") + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + scale_y_continuous(labels=percent,name="% Wanting the Feature") + theme_wsj()
    q7c.summary = matrix(NA, 9,1, dimnames=list(c("No Casino","Casino Only", "Convention Centre Space", "Cultural and Arts Facilities",
    "Hotel","Nightclubs","Restaurants","Retail","Theatre"),c("% Include")))
    for (i in 68:76) {
    q7c.summary[i-67] = mean(casino[,i], na.rm=TRUE)}
    q7c.summary = as.data.frame(q7c.summary[order(q7c.summary[,1], decreasing = FALSE),])
    names(q7c.summary)[1] = "% Include"
    q7c.summary$feature = rownames(q7c.summary)
    q7c.summary$feature = factor(q7c.summary$feature, levels=q7c.summary$feature)
    ggplot(q7c.summary, aes(x=feature, y=q7b.summary$"% Include")) + geom_point(size=5, colour="forest green") + coord_flip() + ggtitle("What People Would Want in an Integrated\nEntertainment Complex in Port Lands") + scale_x_discrete(name="Features") + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + scale_y_continuous(labels=percent,name="% Wanting the Feature") + theme_wsj()
    # It sucks, but I imported yet another version of the casino dataset so that I wouldn't have to use
    # the annoying ffdf indexing notation (e.g. df[,"variable1"])
    casino.orig2 = read.csv("/home/inkhorn/Downloads/casino_survey_results20130325.csv")
    # Finally, here's some code where I processed and plotted the Gender and Age demographic variables
    casino$Gender = casino.orig$Gender
    casino$Gender = ifelse(!(casino.orig2$Gender %in% c("Female","Male","Transgendered")), "Did not disclose",
    ifelse(casino.orig2$Gender == "Female","Female",
    ifelse(casino.orig2$Gender == "Male","Male","Transgendered")))
    casino$Gender = factor(casino$Gender, levels=c("Transgendered","Did not disclose","Female","Male"))
    ggplot(casino, aes(x=Gender,y=..count../sum(..count..))) + geom_bar(fill="forest green") + coord_flip() + ggtitle("Gender Distribution of Respondents") + scale_x_discrete(name="") + theme_wsj() + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + stat_bin(aes(label = sprintf("%.02f %%", ..count../sum(..count..)*100)),
    geom="text") + scale_y_continuous(labels=percent)
    casino$Age = ifelse(casino.orig2$Age == "", "Did not disclose",
    ifelse(casino.orig2$Age == "Under 15", "Under 15",
    ifelse(casino.orig2$Age == "15-24", "15-24",
    ifelse(casino.orig2$Age == "25-34", "25-34",
    ifelse(casino.orig2$Age == "35-44", "35-44",
    ifelse(casino.orig2$Age == "45-54","45-54",
    ifelse(casino.orig2$Age == "55-64","55-64",
    ifelse(casino.orig2$Age == "65 or older", "65 or older","Did not disclose"))))))))
    casino$Age = factor(casino$Age, levels=c("Did not disclose","Under 15","15-24","25-34","35-44","45-54","55-64","65 or older"))
    ggplot(casino, aes(x=Age,y=..count../sum(..count..))) + geom_bar(fill="forest green") + coord_flip() + ggtitle("Age Distribution of Respondents") + scale_x_discrete(name="") + theme_wsj() + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + stat_bin(aes(label = sprintf("%.02f %%", ..count../sum(..count..)*100)), geom="text") + scale_y_continuous(labels=percent)

    Using ddply to select the first record of every group

    I had a very long file of monetary transactions (about 207,000 rows) with about two handfuls of columns describing each transaction (including date).  The task I needed to perform on this file was to select the value from one of the categorical descriptor columns (called “appeal”) associated with the first transaction found for every ID in the file.  Luckily for me, the file was organized so that the older transactions came before the newer transactions.  The coding solution I used was pretty simple, but took maybe 5 – 10 minutes to complete.

    Assuming the data frame is called ‘trans.file’, the numerical ID is called, ‘Num.ID’, and the categorical descriptor column of interest is called ‘appeal’, here’s the simple code:

    first.appeal.by.id = ddply(trans.file, “Num.ID”, function (x) as.character(x$appeal[1]))

    I do like simple, but I also like fast.  Is there a less computationally expensive way of doing this?  Please tell me if so 🙂

    Split, Apply, and Combine for ffdf

    Call me incompetent, but I just can’t get ffdfdply to work with my ffdf dataframes.  I’ve tried repeatedly and it just doesn’t seem to work!  I’ve seen numerous examples on stackoverflow, but maybe I’m applying them incorrectly.  Wanting to do some split-apply-combine on an ffdf, yet again, I finally broke down and made my own function that seems to do the job! It’s still crude, I think, and it will probably break down when there are NA values in the vector that you want to split, but here it is:

    mtapply = function (dvar, ivar, funlist) {
      lenlist = length(funlist)
      outtable = matrix(NA, dim(table(ivar)), lenlist, dimnames=list(names(table(ivar)), funlist))
      c = 1
      for (f in funlist) {
        outtable[,c] = as.matrix(tapply(dvar, ivar, eval(parse(text=f))))
        c = c + 1
      }
      return (outtable)}

    As you can see, I’ve made it so that the result is a bunch of tapply vectors inserted into a matrix.  “dvar”, unsurprisingly, is your dependent variable.  “ivar”, your independent variable.  “funlist” is a vector of function names typed in as strings (e.g. c(“median”,”mean”,”mode”).  I’ve wasted so much of my time trying to get ddply or ffdfdply to work on an ffdf, that I’m happy that I now have anything that does the job for me.

    Now that I think about it, this will fall short if you ask it to output more than one quantile for each of your split levels.  If you can improve this function, please be my guest!

    Finding Patterns Amongst Binary Variables with the homals Package

    It’s survey analysis season for me at work!  When analyzing survey data, the one kind of analysis I have realized that I’m not used to doing is finding patterns in binary data.  In other words, if I have a question to which multiple, non-mutually exclusive (checkbox) answers apply, how do I find the patterns in peoples’ responses to this question?

    I tried apply PCA and Factor Analysis alternately, but they really don’t seem well suited to the analysis of data consisting of only binary columns (1s and 0s). In searching for something that works, I came across the homals package.  While the main function is described as a “homogeneity analysis”, its one ability that interests me is called “non-linear PCA”.  This is supposed to be able to reduce the dimensionality of your dataset even when the variables are all binary.

    Well, here’s an example using some real survey data (with masked variable names).  First we start off with the purpose of the data and some simple summary stats:

    It’s a group of 6 variables (answer choices) showing peoples check-box responses to a question asking them why they donated to a particular charity.  Following are the numbers of responses to each answer choice:

    mapply(whydonate, FUN=sum, 1)
     V1  V2  V3  V4  V5  V6 
    201  79 183 117 288 199

    With the possible exception of answer choice V2, there are some pretty healthy numbers in each answer choice.  Next, let’s load up the homals package and run our non-linear PCA on the data.

    library(homals)
    fit = homals(whydonate)
    
    fit
    Call: homals(data = whydonate)
    
    Loss: 0.0003248596 
    
    Eigenvalues:
        D1     D2 
    0.0267 0.0156 
    
    Variable Loadings:
               D1          D2
    V1 0.28440348 -0.10010355
    V2 0.07512143 -0.10188037
    V3 0.09897585  0.32713745
    V4 0.20464762  0.21866432
    V5 0.26782837 -0.09600215
    V6 0.33198532 -0.04843107

    As you can see, it extracts 2 dimensions by default (it can be changed using the “ndim” argument in the function), and it gives you what looks very much like a regular PCA loadings table.

    Reading it naively, the pattern I see in the first dimension goes something like this: People tended to answer affirmatively to answer choices 1,4,5, and 6 as a group (obviously not all the time and altogether though!), but those answers didn’t tend to be used alongside choices 2 and 3.

    In the second  dimension I see: People tended to answer affirmatively to answer choices 3 and 4 as a group.  Okay, now as a simple check, let’s look at the correlation matrix for these binary variables:

    cor(whydonate)
    
               V1            V2            V3         V4          V5         V6
    V1 1.00000000  0.0943477325  0.0205241732 0.16409945 0.254854574 0.45612458
    V2 0.09434773  1.0000000000 -0.0008474402 0.01941461 0.038161091 0.08661938
    V3 0.02052417 -0.0008474402  1.0000000000 0.21479291 0.007465142 0.11416164
    V4 0.16409945  0.0194146144  0.2147929137 1.00000000 0.158325383 0.22777471
    V5 0.25485457  0.0381610906  0.0074651417 0.15832538 1.000000000 0.41749064
    V6 0.45612458  0.0866193754  0.1141616374 0.22777471 0.417490642 1.00000000

    The first dimension is easy to spot in the “V1” column above. Also, we can see the second dimension in the “V3” column above – both check out! I find that neat and easy. Does anyone use anything else to find patterns in binary data like this? Feel free to tell me in the comments!

    Multiple Classification and Authorship of the Hebrew Bible

    Sitting in my synagogue this past Saturday, I started thinking about the authorship analysis that I did using function word counts from texts authored by Shakespeare, Austen, etc.  I started to wonder if I could do something similar with the component books of the Torah (Hebrew bible).

    A very cursory reading of the Documentary Hypothesis indicates that the only books of the Torah supposed to be authored by one person each were Vayikra (Leviticus) and Devarim (Deuteronomy).  The remaining three appear to be a confusing hodgepodge compilation from multiple authors.  I figured that if I submitted the books of the Torah to a similar analysis, and if the Documentary Hypothesis is spot-on, then the analysis should be able to accurately classify only Vayikra and Devarim.

    The theory with the following analysis (taken from the English textual world, of course) seems to be this: When someone writes a book, they write with a very particular style.  If you are going to be able to detect that style, statistically, it is convenient to detect it using function words.  Function words (“and”, “also”, “if”, “with”, etc) need to be used regardless of content, and therefore should show up throughout the text being analyzed.  Each author uses a distinct number/proportion of each of these function words, and therefore are distinguishable based on their profile of usage.

    With that in mind, I started my journey.  The first steps were to find an online source of Torah text that I could easily scrape for word counts, and then to figure out which hebrew function words to look for.  For the Torah text, I relied on the inimitable Chabad.org.  They hired good rational web developer(s) to make their website, and so looping through each perek (chapter) of the Torah was a matter of copying and pasting html page numbers from their source code.

    Several people told me that I’d be wanting for function words in Hebrew, as there are not as many as in English.  However, I found a good 32 of them, as listed below:

    Transliteration Hebrew Function Word Rough Translation Word Count
    al עַל Upon 1262
    el אֶל To 1380
    asher אֲשֶׁר That 1908
    ca_asher כַּאֲשֶׁר As 202
    et אֶת (Direct object marker) 3214
    ki כִּי For/Because 1030
    col וְכָל + כָּל + לְכָל + בְּכָל + כֹּל All 1344
    ken כֵּן Yes/So 124
    lachen לָכֵן Therefore 6
    hayah_and_variants תִּהְיֶינָה + תִּהְיֶה + וְהָיוּ + הָיוּ + יִהְיֶה + וַתְּהִי + יִּהְיוּ + וַיְהִי + הָיָה Be 819
    ach אַךְ But 64
    byad בְּיַד By 32
    gam גַם Also/Too 72
    mehmah מֶה + מָה What 458
    haloh הֲלֹא Was not? 17
    rak רַק Only 59
    b_ad בְּעַד For the sake of 5
    loh לֹא No/Not 1338
    im אִם If 332
    al2 אַל Do not 217
    ele אֵלֶּה These 264
    haheehoo הַהִוא + הַהוּא That 121
    ad עַד Until 324
    hazehzot הַזֶּה + הַזֹּאת + זֶה + זֹאת This 474
    min מִן From 274
    eem עִם With 80
    mi מִי Who 703
    oh אוֹ Or 231
    maduah מַדּוּעַ Why 10
    etzel אֵצֶל Beside 6
    heehoo הִוא + הוּא + הִיא Him/Her/It 653
    az אָז Thus 49

    This list is not exhaustive, but definitely not small!  My one hesitation when coming up with this list surrounds the Hebrew word for “and”.  “And” takes the form of a single letter that attaches to the beginning of a word (a “vav” marked with a different vowel sound depending on its context), which I was afraid to try to extract because I worried that if I tried to count it, I would mistakenly count other vav’s that were a valid part of a word with a different meaning.  It’s a very frequent word, as you can imagine, and its absence might very well affect my subsequent analyses.

    Anyhow, following is the structure of Torah:

    ‘Chumash’ / Book Number of Chapters
    ‘Bereishit’ / Genesis 50
    ‘Shemot’ / Exodus 40
    ‘Vayikra’ / Leviticus 27
    ‘Bamidbar’ / Numbers 36
    ‘Devarim’ / Deuteronomy 34

    Additionally, I’ve included a faceted histogram below showing the distribution of word-counts per chapter by chumash/book of the Torah:

    m = ggplot(torah, aes(x=wordcount))
    > m + geom_histogram() + facet_grid(chumash ~ .)

    Word Count Dist by Chumash

    You can see that the books are not entirely different in terms of word counts of the component chapters, except for the possibility of Vayikra, which seems to tend towards the shorter chapters.

    After making a Python script to count the above words within each chapter of each book, I loaded it up into R and split it into a training and testing sample:

    torah$randu = runif(187, 0,1)
    torah.train = torah[torah$randu <= .4,] torah.test = torah[torah$randu > .4,]

    For this analysis, it seemed that using Random Forests made the most sense.  However, I wasn’t quite sure if I should use the raw counts, or proportions, so I tried both. After whittling down the variables in both models, here are the final training model definitions:

    torah.rf = randomForest(chumash ~ al + el + asher + caasher + et + ki + hayah + gam + mah + loh + haheehoo + oh + heehoo, data=torah.train, ntree=5000, importance=TRUE, mtry=8)
    
    torah.rf.props = randomForest(chumash ~ al_1 + el_1 + asher_1 + caasher_1 + col_1 + hayah_1 + gam_1 + mah_1 + loh_1 + im_1 + ele_1 + mi_1 + oh_1 + heehoo_1, data=torah.train, ntree=5000, importance=TRUE, mtry=8)

    As you can see, the final models were mostly the same, but with a few differences. Following are the variable importances from each Random Forests model:

    > importance(torah.rf)

     Word MeanDecreaseAccuracy MeanDecreaseGini
    hayah 31.05139 5.979567
    heehoo 20.041149 4.805793
    loh 18.861843 6.244709
    mah 18.798385 4.316487
    al 16.85064 5.038302
    caasher 15.101464 3.256955
    et 14.708421 6.30228
    asher 14.554665 5.866929
    oh 13.585169 2.38928
    el 13.010169 5.605561
    gam 5.770484 1.652031
    ki 5.489 4.005724
    haheehoo 2.330776 1.375457

    > importance(torah.rf.props)

    Word MeanDecreaseAccuracy MeanDecreaseGini
    asher_1 37.074235 6.791851
    heehoo_1 29.87541 5.544782
    al_1 26.18609 5.365927
    el_1 17.498034 5.003144
    col_1 17.051121 4.530049
    hayah_1 16.512206 5.220164
    loh_1 15.761723 5.157562
    ele_1 14.795885 3.492814
    mi_1 12.391427 4.380047
    gam_1 12.209273 1.671199
    im_1 11.386682 2.651689
    oh_1 11.336546 1.370932
    mah_1 9.133418 3.58483
    caasher_1 5.135583 2.059358

    It’s funny that the results, from a raw numbers perspective, show that hayah, the hebrew verb for “to be”, shows at the top of the list.  That’s the same result as in the Shakespeare et al. analysis!  Having established that all variables in each model had some kind of an effect on the classification, the next task was to test each model on the testing sample, and see how well each chumash/book of the torah could be classified by that model:

    > torah.test$pred.chumash = predict(torah.rf, torah.test, type="response")
    > torah.test$pred.chumash.props = predict(torah.rf.props, torah.test, type="response")
    
    > xtabs(~torah.test$chumash + torah.test$pred.chumash)
                      torah.test$pred.chumash
    torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
           'Bamidbar'            4            5          2         8          7
           'Bereishit'           1           14          1        14          2
           'Devarim'             1            2         17         0          1
           'Shemot'              2            4          4         9          2
           'Vayikra'             5            0          4         0          5
    
    > prop.table(xtabs(~torah.test$chumash + torah.test$pred.chumash),1)
                      torah.test$pred.chumash
    torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'   'Shemot'  'Vayikra'
           'Bamidbar'   0.15384615   0.19230769 0.07692308 0.30769231 0.26923077
           'Bereishit'  0.03125000   0.43750000 0.03125000 0.43750000 0.06250000
           'Devarim'    0.04761905   0.09523810 0.80952381 0.00000000 0.04761905
           'Shemot'     0.09523810   0.19047619 0.19047619 0.42857143 0.09523810
           'Vayikra'    0.35714286   0.00000000 0.28571429 0.00000000 0.35714286
    
    > xtabs(~torah.test$chumash + torah.test$pred.chumash.props)
                      torah.test$pred.chumash.props
    torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
           'Bamidbar'            0            5          0        13          8
           'Bereishit'           1           16          0        13          2
           'Devarim'             0            2         11         4          4
           'Shemot'              1            4          2        13          1
           'Vayikra'             3            3          0         0          8
    
    > prop.table(xtabs(~torah.test$chumash + torah.test$pred.chumash.props),1)
                      torah.test$pred.chumash.props
    torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'   'Shemot'  'Vayikra'
           'Bamidbar'   0.00000000   0.19230769 0.00000000 0.50000000 0.30769231
           'Bereishit'  0.03125000   0.50000000 0.00000000 0.40625000 0.06250000
           'Devarim'    0.00000000   0.09523810 0.52380952 0.19047619 0.19047619
           'Shemot'     0.04761905   0.19047619 0.09523810 0.61904762 0.04761905
           'Vayikra'    0.21428571   0.21428571 0.00000000 0.00000000 0.57142857

    So, from the perspective of the raw number of times each function word was used, Devarim, or Deuteronomy, seemed to be the most internally consistent, with 81% of the chapters in the testing sample correctly classified. Interestingly, from the perspective of the proportion of times each function word was used, we see that Devarim, Shemot, and Vayikra (Deuteronomy, Exodus, and Leviticus) had over 50% of their chapters correctly classified in the training sample.

    I’m tempted to say here, at the least, that there is evidence that at least Devarim was written with one stylistic framework in mind, and potentially one distinct author. From a proportion point of view, it appears that Shemot and Vayikra also show an internal consistency suggestive of close to one stylistic framework, or possibly a distinct author for each book. I’m definitely skeptical of my own analysis, but what do you think?

    The last part of this analysis comes from a suggestion given to me by a friend, which was that once I modelled the unique profiles of function words within each book of the Torah, I should use that model on some post-Biblical texts.  Apparently one idea is that the “Deuteronomist Source” was also behind the writing of Joshua, Judges, and Kings.  If the same author was behind all four books, then when I train my model on these books, they should tend to be classified by my model as Devarim/Deuteronomy, moreso than other books.

    As above, below I show the distribution of word count by book, for comparison’s sake:

    > m = ggplot(neviim, aes(wordcount))
    > m + geom_histogram() + facet_grid(chumash ~ .)

    Word Count Dist by Prophets Book

    Interestingly, it seems as though chapters in these particular post-Biblical texts seem to be a bit longer, on average, than those in the Torah.

    Next, I gathered counts of the same function words in Joshua, Judges, and Kings as I had for the 5 books of the Torah, and tested my random forests Torah model on them.  As you can see below, the result was anything but clear on that matter:

    > xtabs(~neviim$chumash + neviim$pred.chumash)
                  neviim$pred.chumash
    neviim$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
          'Joshua'           3            7          7         6          1
          'Judges'           2           11          2         6          0
          'Kings'            0            8          3        10          1
    
    > xtabs(~neviim$chumash + neviim$pred.chumash.props)
                  neviim$pred.chumash.props
    neviim$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
          'Joshua'           2            8          6         7          1
          'Judges'           0            9          2         9          1
          'Kings'            0            7          6         7          2

    I didn’t even bother to re-express this table into fractions, because it’s quite clear that each  book of the prophets that I analyzed didn’t seem to be clearly classified in any one category.  Looking at these tables, there doesn’t yet seem to me to be any evidence, from this analysis, that whoever authored Devarim/Deuteronomy also authored these post-biblical texts.

    Conclusion

    I don’t think that this has been a full enough analysis.  There are a few things in it that bother me, or make me wonder, that I’d love input on.  Let me list those things:

    1. As mentioned above, I’m missing the inclusion of the Hebrew “and” in this analysis.  I’d like to know how to extract counts of the Hebrew “and” without extracting counts of the Hebrew letter “vav” where it doesn’t signify “and”.
    2. Similar to my exclusion of “and”, there are a few one letter prepositions that I have not included as individual predictor variables.  Namely ל, ב, כ, מ, signifying “to”, “in”/”with”, “like”, and “from”.  How do I count these without counting the same letters that begin a different word and don’t mean the same thing?
    3. Is it more valid to consider the results of my analyses that were done on the word frequencies as proportions (individual word count divided by total number of words in the chapter), or are both valid?
    4. Does a list exist somewhere that details, chapter-by-chapter, which source is believed to have written the Torah text, according to the Documentary Hypothesis, or some more modern incarnation of the Hypothesis?  I feel that if I were able to categorize the chapters specifically, rather than just attributing them to the whole book (as a proxy of authorship), then the analysis might be a lot more interesting.

    All that being said, I’m intrigued that when you look at the raw number of how often the function words were used, Devarim/Deuteronomy seems to be the most internally consistent.  If you’d like, you can look at the python code that I used to scrape the chabad.org website here: python code for scraping, although please forgive the rudimentary coding!  You can get the dataset that I collected for the Torah word counts here: Torah Data Set, and the data set that I collected for the Prophetic text word counts here: Neviim data set.  By all means, do the analysis yourself and tell me how to do it better 🙂