Easy and Beautiful Plotting

I just read an article from an R blog called Revolutions, in which the author reviews a book called “R in Action’.  What I like best in this blog post is the example he gives of a really nice looking graph that doesn’t take very much coding to create:

 

6a010534b1db25970b0162fe0e3f7f970d-pi1

Elegant, informative, and here is the code he tells us is necessary to create it (bearing in mind the data is readily available from the “car” library in R):

 

 

Scatterplot_with_grouped_regression_lines

For this kind of graphing, and this little code, I would definitely want to graph in R.

 

 

Excel vs. R for Plotting Data

Today I made a couple of bar graphs, using the ggplot2 package in R, just to depict the relative importance of different terms in my predictive models.  After seeing the result, I asked myself:

Is it really worth the annoyance of setting up the data frame, typing in the code (with the inherent reordering of the factor levels based on the stats I was displaying), then exporting it when I can get a pretty similar looking graph in Excel for less time and effort?

Really… why do I need to spend more time when the main differences between graphs I can make in either program amount to the background colour (gray vs. white), axis text saliency (slighly grayed out vs. more visible), numbering of the y axis, bar colour, and gridlines?  When I compare simple plots, that’s all it amounts to for me.

I think if simple plots are all that I want, I’m going to stick with Excel.  For anything fancier that I probably wouldn’t do in Excel, I’ll use R and ggplot2, or just the plain vanilla R plotting options.  I need to save my time!

Custom Making a Data Frame in R for Graphing Purposes

I have no problems importing all sorts of data sets into R for statistical analysis.  However, there is always the step where I’ve extracted the relevant stats that I want to put into multiple graphs.  When I get the particular stats I need, I usually just stick them in Excel, in tabular form, and then just make my graphs from there.  

Excel is great because it then becomes ridiculously easy to customize my graphs from within.  Today however I kept trying to copy and paste the tables I made in excel back into R to try building some fancy graphs with ggplot2.  I kept getting very frustrated though, because R is very picky about copying and pasting text from Excel.  

I realized that if I want to remove the frustration of copying tables from excel, I should just be constructing these “pre-graph” tables in R instead.

The above represents a very simple example, but essentially I haven’t even been doing anything like the above at work to try to use R graphics instead of Excel graphs.

In sum:

  1. Get the needed stats
  2. Type up a column or more with text items that categorize the stats from step 1
  3. Type up a column with the stats from text 1
  4. Put those columns together in a data frame and chart.

The sqldf package in R. Awesome!

I only recently learned how to use the sqldf package to perform SELECT queries on the data frames I have loaded into the R workspace.  There’s a lot I already know how to do when selecting, or subsetting data in R, but I don’t know how to do the equivalent of joining two datasets by some ID variable.  

Luckily, that kind of operation is easy when you put it into SQL syntax.  It basically looks something like this:

newdata = sqldf(‘SELECT dataframe1.*, dataframe2.somevariable FROM dataframe1 LEFT JOIN dataframe2 on dataframe1.ID = dataframe2.ID’)

Things get a bit more complicated when you have dots in your dataframe and variable names.  Below is what the call to sqldf has to look like if you are in a situation like that:

newdata = sqldf(‘SELECT “data.frame1”.*, “data.frame2”.some_variable FROM “data.frame1” LEFT JOIN “data.frame2” on “data.frame1”.ID = “data.frame2”.ID’)

You’ll notice that I had to put double quotes around the dataframe references, so that the SQL code didn’t get confused by the presence of dots.  The underscore in the variable reference “some_variable” actually translates into “some.variable” when it looks for the referent in the R data frame.  It’s a little messy, but totally worth it when I consider I don’t even know how to do this kind of operation in R otherwise!