Load Packages Automatically in RStudio

I recently finished a long stretch of work on a particular project that required me to draw upon four R packages.  Each time I got back to my work on the project, I’d have to load the packages manually, as needed.  It got really annoying and constantly made me wonder whether there was some way that I could just get these packages loaded automatically every time I switched to that project in R Studio so that I didn’t have to waste my time.

I finally got around to looking for an answer, and found it on an R Studio suggestions forum.  You create a .Rprofile file in your project directory within R Studio (see example below) or from a text editor, and then just type the library commands that call the R packages that you want.  Save it, and now every time you switch to your project in R Studio, the packages most relevant to your project will be loaded automatically!  See an example below of making the file from within R Studio:

> file.edit("~/Desktop/foo/.Rprofile")
# This opens up a script window, within which you can enter in your library commands
library(ggplot2)
library(scales)
library(plyr)
library(reshape2)

Easy!

Are scatterplots too complex for lay folks?

Usually, I like to write about the solutions to problems I’ve had, but today I only have a problem to write about.

This is the second research job I’ve had outside of academia, and in both cases I’ve met with resistance when I’ve tried to display bivariate relations using scatterplot. For example, a colleague came past my work computer yesterday while I had on screen a scatterplot with a linear trend line showing. She looked at the plot and blurted out “Wow, that looks complicated!”.

One option that I’ve tried in the past to display bivariate relations where I’d normally use a scatterplot is to discretize, or bin the x variable, and show the average y value for each range level of the x variable. This can certainly can help to show the direction of the relation, but hides the number of data points that go into the averages. I like being able to see the diagonal stick like orientation of a dot cloud signalling a strong correlation, or seeing the loose, circular orientation that signals no apparent correlation.

Do I have to say goodbye to the scatterplot now that I’m outside of academia? Will I ever be able to fruitfully use it again to communicate results to lay folks? Are the bar graph and line graph my only viable tools for data visualization now? Is there something I’m missing that could help people see scatterplots as helpful representations of bivariate relations? I’d appreciate answers to any of my questions.

Bar Graph Colours That Work Well

Ever since I started using ggplot2 more often at work in order to do graphs, I’ve realized something about the use of colour in bar graphs vs. dot plots: When I’m looking at a graph displayed on the brilliant Viewsonic monitor I’m using at work, the same relatively intense colours that work well in a dot plot start to bother me in a bar graph.  Take the bar graph immediately below for example.  The colour choice is not a bad one, but there’s something about the intensity of the colours that makes me want to find a new set of colours somewhat more soothing to my eyes.

The first resource I found was a “Color Encyclopedia” website called Color Hex and started looking for colours that seemed more soothing and could be used to compare 3 bars against one another in a bar graph.  You can search for colour names, colours according to their hexadecimal values, or even browse their list of “web safe colors”.  I stumbled upon the particular purple displayed in the graph below, and it simply gave me the other colours in the triad as suggestions.

Looking at this triad of colours, I’m actually quite pleased, but I still didn’t really understand why these colours worked, and how to select a new triad that didn’t bother me.  I shuffled through many different colours on the Color Hex website, and nothing else seemed to work with me as I wasn’t selecting colours based on any theory.

Then I stumbled upon an article by the good people at Perceptual Edge.  They seemed to confirm my earlier statement about the same intense colours working well when used to colour dot plots not working so well in bar plots.   Their solution is a simple one: choose from a list of colours of medium intensity.  On page 6 of the document linked above, they offer 8 different hues that look nice in a bar plot.  All I had to do to use these colours in the below plots was take a screenshot of the document, bring it into Inkscape, and hover the eye-dropper tool over the colours to get the hexadecimal colour values.  If you’re interested in using the values, I typed them out at the bottom of my post.  Now take a look at the graphs below:

The two graphs above follow the same principle that I had unknowingly touched upon when I chose the colours from the Color Hex website: stick with medium intensity, and your eyes won’t be jarred by the colour contrast.  I like that!

Anyway, below I show you the code I used to manually input the hexadecimal colour values into my ggplot bar graphs, and the list of 8 hexadecimal colour values corresponding with the colour boxes on page 6 of the Perceptual Edge document.  The variables a, c, and b were just variable names from a mock data frame that I cooked up for the purpose of the plots.

> colours = c(“#599ad3”, “#f9a65a”, “#9e66ab”)
> ggplot(e, aes(x=a, y=c, fill=b, stat=”identity”)) + geom_bar(position=”dodge”) + coord_flip()+scale_fill_manual(values=colours)

#727272
#f1595f
#79c36a
#599ad3
#f9a65a
#9e66ab
#cd7058
#d77fb3

Functions ddply and melt make plotting summary stats in R more tolerable

The main reason why I have usually chosen to use excel to make my plots at work is because I had difficulty feeding the summary stats in R into a plotting function.  One thing I learned this week is how to make summary stats into a data frame suitable for plotting, making the whole process of plotting in R more tolerable for me.  Below I show the process using the ever-popular iris dataset.  I use the functions ddply and melt to both summarize and restructure the data into a form amenable to plotting.

 length.by.species = ddply(iris, "Species", function (x) quantile(x$Sepal.Length, c(.25,.5,.75)))
> length.by.species
     Species   25% 50% 75%
1     setosa 4.800 5.0 5.2
2 versicolor 5.600 5.9 6.3
3  virginica 6.225 6.5 6.9
length.by.species = melt(length.by.species, variable.name="Quantile",value.name="Sepal.Length")
length.by.species
     Species Quantile Sepal.Length
1     setosa      25%        4.800
2 versicolor      25%        5.600
3  virginica      25%        6.225
4     setosa      50%        5.000
5 versicolor      50%        5.900
6  virginica      50%        6.500
7     setosa      75%        5.200
8 versicolor      75%        6.300
9  virginica      75%        6.900

One thing you can see in my call to ddply is that the main qualitative variable, whose values are used to subset your data frame, is referred to using quotes.  Somehow I find that a bit weird (I’m used to referring to variables without quotes, I suppose!).  Other than that, the syntax for the ddply command is similar enough to the apply family of functions, so no more complaints here.  You can also see that once I call the function, it gives me a nice neat data frame where the quantiles I asked for are columns, and the values of the Species variable represent different rows (or subsets of the data frame).

The melt command is easy enough, simply wanting to know what to call the column that will represent the values in the column titles (Quantile!) and what to call the numeric measure that the values come from (Sepal.Length).

Now that the summary stats are in a “Long” form data frame, with one column representing the numbers, and two columns containing text, it’s just a simple one liner to create a graph (here done in ggplot).  Below I show one line to create a dodged bar graph, and another line to create a dot plot, both showing the 1st to 3rd quantiles of Sepal.Length by Species.

ggplot(length.by.species, aes(y=Sepal.Length, x=Species, fill=Quantile, stat="identity")) + geom_bar(position="dodge")
ggplot(length.by.species, aes(x=Sepal.Length, y=Species, colour=Quantile, stat="identity")) + geom_point(size=4)

Thank you ddply and melt!

An embarrassing admission; Copy pasting tables with text containing spaces from Excel to R

I can’t believe I didn’t learn how to do it earlier, but I never knew how to accurately copy tables from excel that had text with spaces in them, and paste into a data frame in R without generating confusion around spaces representing different variables.

Say you have a column title in a table in excel like “Group Size”.  You then copy the table and try to load it into R using: my.table = read.table(“clipboard”, header=TRUE).  R then gives you an error like this:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 2 elements

That’s where I stopped most times and made sure that my column titles had no spaces in them (replacing spaces with dots), or just did something else entirely.  Finally, I learned that if you add one teeny tiny argument to the read.table command… read.table(“clipboard”, header=TRUE, sep=”\t”) … then all text values with spaces in them actually get imported into R without a fuss!  Spaces in column names get converted into dots automatically, but actual text values in the columns with spaces are untouched and accessible as they should be in R.  Yay!

Memory Management in R, and SOAR

The more I’ve worked with my really large data set, the more cumbersome the work has become to my work computer.  Keep in mind I’ve got a quad core with 8 gigs of RAM.  With growing irritation at how slow my work computer becomes at times while working with these data, I took to finding better ways of managing my memory in R.

The best/easiest solution I’ve found so far is in a package called SOAR.  To put it simply, it allows you to store specific objects in R (data frames being the most important, for me) as RData files on your hard drive, and gives you the ability to analyze them in R without having them loaded into your RAM.  I emphasized the term analyze because every time I try to add variables to the data frames that I store, the data frame comes back into RAM and once again slows me down.

An example might suffice:

> r = data.frame(a=rnorm(10,2,.5),b=rnorm(10,3,.5))
> r

  a       b

1 1.914092 3.074571
2 2.694049 3.479486
3 1.684653 3.491395
4 1.318480 3.816738
5 2.025016 3.107468
6 1.851811 3.708318
7 2.767788 2.636712
8 1.952930 3.164896
9 2.658366 3.973425
10 1.809752 2.599830
> library(SOAR)
> Sys.setenv(R_LOCAL_CACHE=”testsession”)
> ls()
[1] “r”
> Store(r)
> ls()
character(0)
> mean(r[,1])
[1] 2.067694
> r$c = rnorm(10,4,.5)
> ls()
[1] “r”

So, the first thing I did was to make a data frame with some columns, which got stored in my workspace, and thus loaded into RAM.  Then, I initialized the SOAR library, and set my local cache to “testsession”.  The practical implication of that is that a directory gets created within the current directory that R is working out of (in my case, “/home/inkhorn/testsession”), and that any objects passed to the Store command get saved as RData files in that directory.

Sure enough, you see my workspace before and after I store the r object.  Now you see the object, now you don’t!  But then, as I show, even though the object is not in the workspace, you can still analyze it (in my case, calculate a mean from one of the columns).  However, as soon as I try to make a new column in the data frame… voila … it’s back in my workspace, and thus RAM!

So, unless I’m missing something about how the package is used, it doesn’t function exactly as I would like, but it’s still an improvement.  Every time I’m done making new columns in the data frame, I just have to pass the object to the Store command, and away to the hard disk it goes, and out of my RAM.  It’s quite liberating not having a stupendously heavy workspace, as when I’m trying to leave or enter R, it takes forever to save/load the workspace.  With the heavy stuff sitting on the hard disk, leaving and entering R go by a lot faster.

Another thing I noticed is that if I keep the GLMs that I’ve generated in my workspace, that seems to take up a lot of RAM as well and slow things down.  So, with writing the main dataframe to disk, and keeping GLMs out of memory, R is flying again!

Ack! Duplicates in the Data!

As I mentioned in a previous post, I compiled the data set that I’m currently working on in PostgreSQL.  To get this massive data set, I had to write a query that was massive by dint of the number of LEFT JOINs that I had to write.  Today I caught myself wondering If I had remembered to add in DISTINCT to the SELECT clause in my query, as that would prevent any obvious duplicates.  I looked and, woops, I didn’t specify it in the SELECT clause.  It turns out that I didn’t have too many duplicates (only 135 out of about 395000 rows), but I was still bothered.

I took a look at the duplicates and found out that there were a small bunch of duplicates in the main name and contact table where the ID number of the records was the same, but one record had actual address and contact info, while another one would be pretty bare (in other words, missing information in various contact fields).  Although there was only one duplicate for each affected record, the many LEFT JOINs that I did multiplied the number of duplicates to get to about 135.

I really didn’t want to export the data again from the database and redo all the calculations I had made, so I had to find some way of cleaning up all these duplicates from within R.  Thankfully, I had turned the many contact fields in the data into dummy variables in view of the data mining that I had to do later on (I love dummy variables!).  This then made it possible to distinguish duplicates based on whether they had more or less fields filled out, enabling me to remove them, as shown in the script below:


# These column numbers represent fields with name/contact info that I've
# marked with 1s and 0s depending on whether or not there's anything in
# the field.
bio_cols = c(5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,23,24,25,26)
# Now we get the row numbers of all the records with duplicate IDs
dupe_id_rows = which(duplicated(big.dataset$ID) == TRUE)
# Now we get the actual ID values that were duplicated
dupe_ids = big.dataset$ID[which(duplicated(big.dataset$ID) == TRUE)]
# The following line gives us a sum of the number of fields, for each
# record, that have information in them
info_sums = apply(big.dataset[dupe_conids_rows,bio_cols],1,sum)
# Next we build a data frame that glues this information together. The
# goal is to isolate those rows representing records with missing info,
# and then delete them from the larger data set
a = data.frame(sums=info_sums,ids=dupe_ids,rows=dupe_id_rows)
# Now we order the data frame by the ID values
a = a[order(a[,2]),]
# Here we get the mean number of fields, for each ID value, filled out
# between duplicates that have more information, and duplicates that have
# missing information
a$avg = rep(tapply(a$sums, a$ids, mean), tapply(a$sums, a$ids, length))
# Here we make a column that marks a row for deletion (with a 1 instead
# of a 0) if the sum of fields that have information in them is less
# than the mean number for that ID value. Those rows that are marked
# are the duplicates that are missing info.
a$del = ifelse(a$sums < a$avg,1,0)
# The following is a full list of unique row numbers from the bigger
# dataset that constitute duplicates that are missing information.
rows_to_del = a$rows[a$del == 1]
# Here we delete those rows from the bigger dataset
big.dataset = big.dataset[-rows_to_del,]
# Since all remaining dupes are exact duplicates, this line removes them
big.dataset = big.dataset[which(duplicated(big.dataset$ID) == FALSE),]

Well, it worked!  Now I have no more problem with duplication.  Now, hopefully I’ll remember to use DISTINCT in my SQL queries in the future 🙂

Mining for relations between nominal variables

The task today was to find what variables had significant relations with an important grouping variable in the big dataset I’ve been working with lately.  The grouping variable has 3 levels, and represents different behaviours of interest.  At first I tried putting the grouping variable as a dependent variable in a multinomial logistic regression, but I didn’t really trust the output, and the goal was really just to construct a bunch of graphs showing significant bivariate nominal relations in the data..

That’s when I turned to my good old friend, the chi squared test.  All I had to do was select all the variables that I wanted to test against the grouping variable, and construct a list of the chi squared statistic from each test, the variable being tested, and the crosstab of the two variables for later graphing.  So that’s exactly what I did:


testvars = c(6,7,9,10,11,12,13,14,16, 17,18,19,20,21,23,24,25,26,384,375,376,386,385,387,388)
resultlist = c()
for (i in testvars) {
xsq = chisq.test(big.dataset[,i], big.dataset$DV_3lvls)$statistic
varname = names(big.dataset)[i]
tab = xtabs(~DV_3lvls + big.dataset[,i], data=big.dataset)
resultlist = rbind(resultlist, list(chisq=xsq, testvar=varname, xtab=tab))
}

view raw

chisq_mining.r

hosted with ❤ by GitHub

One really sweet thing about matrices in R is that you can mix them up with some parts having just numbers, some parts having text, and sub-matrices in other parts!  A typical row of the “resultlist” would look something like this:

xsq    testvar            xtab
[1,]     200.7 “variable1” numeric,6

Then all I needed to do to see the variable name and crosstab for that variable was to call “resultlist[1,2:3]”, and that gave me the numbers to graph.

Guess who wins: apply() versus for loops in R

Yesterday I tried to do some data processing on my really big data set in MS Excel. Wow, did it not like handling all those data!! Every time I tried to click on a different ribbon, the screen didn’t even register that I had clicked on that ribbon. So, I took the hint, and decided to do my data processing in R.

One of the tasks that I needed to do was calculate a maximum value, in each row of the dataset, from multiple monetary values in 5 different fields. The first thing I noticed was that the regular max() function in R doesn’t quite like it when you try to calculate a maximum from a series of NA values (it returned an inf value for some reason…). So, I decided to create a “safe” max function:


safe.max = function(invector) {
na.pct = sum(is.na(invector))/length(invector)
if (na.pct == 1) {
return(NA) }
else {
return(max(invector,na.rm=TRUE))
}
}

view raw

safe.max.r

hosted with ❤ by GitHub

Finding that it was working, I then constructed a simple for loop to iterate through my ~395,000 rows. As you could imagine, this was taking forever! After much looking around, I realized that the best solution was actually a base function, apply()!!

I constructed my “max” variable with one simple line of code: big.dataset$max_money = apply(as.matrix(big.dataset[,214:218]), 1, function (x) safe.max(x))

Compared to the for loop, which was taking forever, this method was a breeze! It took less than a minute to get through the whole data set. Moral of the story? When you’re dealing with lots of data, code as efficiently as possible!

Projects in RStudio

Now that I have one enormous project on the go and one smaller one, I find it’s helping me considerably to have each project stored in separate RStudio projects.  So, each project has its own scripting that I’ve been working on, its own extra variables or data frames that I’ve had to construct for them, and doesn’t take up more of my RAM than need be.  I like the way that RStudio helps to keep me organized!