Jekyll2025-11-21T15:04:45+00:00https://jessecambon.github.io/feed.xmlJesse CambonMusings on data and open source softwareJesse Cambon[email protected]Tidygeocoder 1.0.42021-10-17T00:00:00+00:002021-10-17T00:00:00+00:00https://jessecambon.github.io/2021/10/17/tidygeocoder-1-0-4Tidygeocoder v1.0.4 is released! đŸŸ This release adds support for the Geoapify geocoding service (thanks Daniel Possenriede!), a progress bar, more helpful console output, and new functions for combining the results of multiple geocoding queries. A more detailed overview of the changes in this release is available in the changelog.

Progress Bars and Console Output

Progress bars are now displayed for single input geocoding queries (ie. not batch queries). Additionally, console messages now by default show which geocoding service was used, how many addresses or coordinates were given to it, and how long the query took to execute.

The progress_bar parameter can be used to toggle the use of the progress bar while the quiet parameter can be used to silence console messages that are displayed by default. See the documentation for geo() or reverse_geo() for details.

Additionally, the quiet, progress_bar, and verbose parameters can now be set permanently via options(). For example, options(tidygeocoder.progress_bar = FALSE) will disable progress bars for all queries.

Combining Multiple Queries

In past releases of the package, method = "cascade" could be used in the geo() and geocode() functions to combine the results of geocoding queries from two different services. The “cascade” method is now deprecated in favor of two new and more flexible functions: geocode_combine() and geo_combine(). These functions allow for executing and combining the results of more than two queries and they allow the queries to be fully customized.

To demonstrate the utility of these new functions, below I’ve assembled a dataset of addresses to be geocoded. The first 5 are street level addresses in the United States that can be geocoded with the US Census geocoding service. However, three of these addresses will not return results with the US Census batch service (see issue #87 for more information) and must instead be geocoded with the US Census single address geocoder. Also, the last three addresses are cities outside the United States and require a different geocoding service entirely (the US Census service is limited to the United States).

library(tidyverse)
library(tidygeocoder)

mixed_addresses <- tribble(
  ~street_address, ~city, ~state_cd, ~zip_cd,
  "624 W DAVIS ST #1D",   "BURLINGTON", "NC",  27215,
  "201 E CENTER ST #268", "MEBANE",     "NC",  27302,
  "7833  WOLFE LN",       "SNOW CAMP",  "NC",  27349,
  "202 C St",             "San Diego",  "CA",  92101,
  "121 N Rouse Ave",      "Bozeman",    "MT",  59715
) %>%
  bind_rows(
    tibble(city = c('Taipei', 'Moscow', 'Buenos Aires'))
  )

If we wanted to geocode a large dataset with addresses such as these, we might first try to geocode as many as possible via the US Census batch service, then attempt the remaining addresses with the US Census single address geocoder, and then finally send any remaining unfound addresses to another service. We’ll accomplish this workflow in the code below.

The geocode_combine() function accepts a dataframe input and a list of queries provided as lists (ie. a list of lists). Each list in the queries argument contains parameters that are passed to the geocode() function. Optionally, the query_names argument can be used to specify a label to be used for each query’s results.

Below, the street, city, state, and postalcode arguments are specified for the first two queries while the address argument (ie. single line address) is pointed at the city column for the third query (the ArcGIS service only accepts a single line address argument and doesn’t use address component arguments like city and state).

results <- mixed_addresses %>%
  geocode_combine(
    queries = list(
      list(method = 'census', mode = 'batch', 
        street = 'street_address', city = 'city', state = 'state_cd', postalcode = 'zip_cd'),
      list(method = 'census', mode = 'single',
           street = 'street_address', city = 'city', state = 'state_cd', postalcode = 'zip_cd'),
      list(method = 'arcgis', address = 'city')
    ),
    query_names = c('census - batch', 'census - single', 'arcgis')
  )
## 

## Passing 8 addresses to the US Census batch geocoder

## Query completed in: 1.9 seconds

## Passing 6 addresses to the US Census single address geocoder

## Query completed in: 3.6 seconds

## Passing 3 addresses to the ArcGIS single address geocoder

## Query completed in: 1.4 seconds
street_address city state_cd zip_cd lat long query
624 W DAVIS ST #1D BURLINGTON NC 27215 36.09598 -79.44453 census - single
201 E CENTER ST #268 MEBANE NC 27302 36.09683 -79.26977 census - single
7833 WOLFE LN SNOW CAMP NC 27349 35.89866 -79.43713 census - single
202 C St San Diego CA 92101 32.71676 -117.16283 census - batch
121 N Rouse Ave Bozeman MT 59715 45.68066 -111.03203 census - batch
NA Taipei NA NA 25.03737 121.56355 arcgis
NA Moscow NA NA 55.75696 37.61502 arcgis
NA Buenos Aires NA NA -34.60849 -58.37344 arcgis

By default the results of the queries are combined into a single dataframe as shown above and the query column shows which query produced each result. Alternatively, the results of each query can be returned as separate dataframes in a list by using return_list = TRUE.

By default, only addresses that are not found in a query are passed to the subsequent query. However, setting cascade = FALSE will pass all addresses to all queries. See the documentation for the geocode_combine() function for more usage details.

Package Housekeeping

The return_type, geocodio_v, mapbox_permanent, mapquest_open, iq_region, and here_request_id parameters are now deprecated in favor of the new api_options parameter. For instance, instead of using return_type = "geographies" you should now instead use api_options = list(census_return_type = "geographies").

Also, the cascade_order, param_error, and batch_limit_error parameters in geo() are now deprecated as they were only required because of the deprecated “cascade” method. Refer to the documentation for geo() or reverse_geo() for details.

]]>
Jesse Cambon
Tidygeocoder 1.0.32021-04-19T00:00:00+00:002021-04-19T00:00:00+00:00https://jessecambon.github.io/2021/04/19/tidygeocoder-1-0-3Tidygeocoder v1.0.3 is released on CRAN! This release adds support for reverse geocoding (geocoding geographic coordinates) and 7 new geocoder services: OpenCage, HERE, Mapbox, MapQuest, TomTom, Bing, and ArcGIS. Refer to the geocoder services page for information on all the supported geocoder services.

Big thanks go to Diego HernangĂłmez and Daniel Possenriede for their work on this release. You can refer to the changelog for the details on the changes in the release.

Reverse Geocoding

In this example we’ll randomly sample coordinates in Madrid and label them on a map. The coordinates are placed in a dataframe and reverse geocoded with the reverse_geocode() function. The Nominatim (“osm”) geocoder service is used and several API parameters are passed via the custom_query argument to request additional columns of data from Nominatim. Refer to Nominatim’s API documentation for more information on these parameters.

library(tidyverse, warn.conflicts = FALSE)
library(tidygeocoder)
library(knitr)
library(leaflet)
library(glue)
library(htmltools)

num_coords <- 25 # number of coordinates
set.seed(103) # for reproducibility

# latitude and longitude bounds
lat_limits <- c(40.40857, 40.42585)
long_limits <- c(-3.72472, -3.66983)

# randomly sample latitudes and longitude values
random_lats <- runif(
  num_coords, 
  min = lat_limits[1], 
  max = lat_limits[2]
  )

random_longs <- runif(
  num_coords, 
  min = long_limits[1], 
  max = long_limits[2]
  )

# Reverse geocode the coordinates
# the speed of the query is limited to 1 coordinate per second to comply
# with Nominatim's usage policies
madrid <- reverse_geo(
              lat = random_lats, random_longs, 
              method = 'osm', full_results = TRUE,
              custom_query = list(extratags = 1, addressdetails = 1, namedetails = 1)
          )

After geocoding our coordinates, we can construct HTML labels with the data returned from Nominatim and display these locations on a leaflet map.

# Create html labels
# https://rstudio.github.io/leaflet/popups.html
madrid_labelled <- madrid %>%
  transmute(
    lat, 
    long, 
    label = str_c(
        ifelse(is.na(name), "", glue("<b>Name</b>: {name}</br>")),
        ifelse(is.na(suburb), "", glue("<b>Suburb</b>: {suburb}</br>")),
        ifelse(is.na(quarter), "", glue("<b>Quarter</b>: {quarter}")),
        sep = ''
    ) %>% lapply(htmltools::HTML)
  )

# Make the leaflet map
madrid_labelled %>% 
  leaflet(width = "100%", options = leafletOptions(attributionControl = FALSE)) %>%
  setView(lng = mean(madrid$long), lat = mean(madrid$lat), zoom = 14) %>%
  # Map Backgrounds
  # https://leaflet-extras.github.io/leaflet-providers/preview/
  addProviderTiles(providers$Stamen.Terrain, group = "Terrain") %>%
  addProviderTiles(providers$OpenRailwayMap, group = "Rail") %>%
  addProviderTiles(providers$Esri.WorldImagery, group = "Satellite") %>%
  addTiles(group = "OSM") %>%
  # Add Markers
  addMarkers(
    labelOptions = labelOptions(noHide = F), lng = ~long, lat = ~lat,
    label = ~label,
    group = "Random Locations"
  ) %>%
  # Map Control Options
  addLayersControl(
    baseGroups = c("OSM", "Terrain", "Satellite", "Rail"),
    overlayGroups = c("Random Locations"),
    options = layersControlOptions(collapsed = TRUE)
  )

Limits

This release also improves support for returning multiple results per input with the limit argument. Consider this batch query with the US Census geocoder:

tie_addresses <- tibble::tribble(
  ~res_street_address, ~res_city_desc, ~state_cd, ~zip_code,
  "624 W DAVIS ST   #1D",   "BURLINGTON", "NC",  27215,
  "201 E CENTER ST   #268", "MEBANE",     "NC",  27302,
  "7833  WOLFE LN",         "SNOW CAMP",  "NC",  27349,
)

tg_batch <- tie_addresses %>%
  geocode(
    street = res_street_address,
    city = res_city_desc,
    state = state_cd,
    postalcode = zip_code,
    method = 'census', 
    full_results = TRUE
  )
res_street_address res_city_desc state_cd zip_code lat long id input_address match_indicator match_type matched_address tiger_line_id tiger_side
624 W DAVIS ST #1D BURLINGTON NC 27215 NA NA 1 624 W DAVIS ST #1D, BURLINGTON, NC, 27215 Tie NA NA NA NA
201 E CENTER ST #268 MEBANE NC 27302 NA NA 2 201 E CENTER ST #268, MEBANE, NC, 27302 Tie NA NA NA NA
7833 WOLFE LN SNOW CAMP NC 27349 NA NA 3 7833 WOLFE LN, SNOW CAMP, NC, 27349 Tie NA NA NA NA

You can see NA results are returned and the match_indicator column indicates a “Tie”. This is what the US Census batch geocoder returns when multiple results are available for each input address (see issue #87 for more details).

To see all available results for these addresses, you will need to use mode to force single address (not batch) geocoding and limit > 1. The return_input argument (new in this release) has to be set to FALSE to allow limit to be set to a value other than 1. See the geocode() function documentation for details.

tg_single <- tie_addresses %>%
  geocode(
    street = res_street_address,
    city = res_city_desc,
    state = state_cd,
    postalcode = zip_code,
    limit = 100,
    return_input = FALSE,
    method = 'census', 
    mode = 'single',
    full_results = TRUE
  )
street city state postalcode lat long matchedAddress tigerLine.tigerLineId tigerLine.side addressComponents.fromAddress addressComponents.toAddress addressComponents.preQualifier addressComponents.preDirection addressComponents.preType addressComponents.streetName addressComponents.suffixType addressComponents.suffixDirection addressComponents.suffixQualifier addressComponents.city addressComponents.state addressComponents.zip
624 W DAVIS ST #1D BURLINGTON NC 27215 36.09598 -79.44453 624 W DAVIS ST, BURLINGTON, NC, 27215 71662708 L 618 628   W   DAVIS ST     BURLINGTON NC 27215
624 W DAVIS ST #1D BURLINGTON NC 27215 36.08821 -79.43201 624 E DAVIS ST, BURLINGTON, NC, 27215 71664000 L 600 698   E   DAVIS ST     BURLINGTON NC 27215
201 E CENTER ST #268 MEBANE NC 27302 36.09683 -79.26977 201 W CENTER ST, MEBANE, NC, 27302 71655977 R 201 299   W   CENTER ST     MEBANE NC 27302
201 E CENTER ST #268 MEBANE NC 27302 36.09582 -79.26624 201 E CENTER ST, MEBANE, NC, 27302 71656021 R 299 201   E   CENTER ST     MEBANE NC 27302
7833 WOLFE LN SNOW CAMP NC 27349 35.89866 -79.43713 7833 WOLFE LN, SNOW CAMP, NC, 27349 71682243 L 7999 7801       WOLFE LN     SNOW CAMP NC 27349
7833 WOLFE LN SNOW CAMP NC 27349 35.89693 -79.43707 7833 WOLF LN, SNOW CAMP, NC, 27349 71685327 L 7801 7911       WOLF LN     SNOW CAMP NC 27349

We can now see there are two available results for each address. Note that this particular issue with “Tie” batch results is specific to the US Census geocoder service. Refer to the api_parameter_reference documentation for more details on the limit parameter.

The limit parameter can also be used to return all matches for a more general query:

paris <- geo('Paris', method = 'opencage', full_results = TRUE, limit = 10)
address lat long formatted annotations.currency.name
Paris 48.85670 2.351462 Paris, France Euro
Paris 33.66180 -95.555513 Paris, TX 75460, United States of America United States Dollar
Paris 38.20980 -84.252987 Paris, Kentucky, United States of America United States Dollar
Paris 36.30195 -88.325858 Paris, TN 38242, United States of America United States Dollar
Paris 39.61115 -87.696137 Paris, IL 61944, United States of America United States Dollar
Paris 44.25995 -70.500641 Paris, Maine, United States of America United States Dollar
Paris 35.29203 -93.729917 Paris, AR 72855, United States of America United States Dollar
Paris 39.48087 -92.001281 Paris, MO 65275, United States of America United States Dollar

The R Markdown file that generated this post is available here.

]]>
Jesse Cambon
Tidygeocoder 1.0.22021-01-18T00:00:00+00:002021-01-18T00:00:00+00:00https://jessecambon.github.io/2021/01/18/tidygeocoder-1-0-2Tidygeocoder v1.0.2 “Yodeling Yak” is now on CRAN. This release adds support for the popular Google geocoder service (thanks @chris31415926535) and also includes several bugfixes and enhancements. Refer to the changelog for details on the release and to the tidygeocoder homepage for a comparison of all supported geocoder services.

You can make use of the Google geocoder service by passing method = 'google' to the geo() and geocode() functions. Note that the Google geocoder service requires registering for an API key and charges per query. The Google API key needs to be stored in the GOOGLEGEOCODE_API_KEY environmental variable for use with tidygeocoder.

Also new in this release, US Census batch geocoding results will now return geographic FIPs codes in character format (instead of numeric) to preserve leading zeros (#47). The return_type = 'geographies' query in the usage example shows the new data format.

Additionally, arguments passed to the geo() and geocode() functions that aren’t valid for the selected geocoder service (ie. the Census geocoder doesn’t have a country argument) will now throw errors:

library(tidygeocoder)

geo(city = 'Auckland', country = 'New Zealand', method = 'census')
## Error in geo(city = "Auckland", country = "New Zealand", method = "census"): The following parameter(s) are not supported for the "census" method:
## 
## country
## 
## See ?api_parameter_reference for more details.

If you’re interested in contributing to the package and would like to add support for other geocoder services, updated instructions on how to go about this are located here.

]]>
Jesse Cambon
Introducing Tidygeocoder 1.0.02020-07-15T00:00:00+00:002020-07-15T00:00:00+00:00https://jessecambon.github.io/2020/07/15/tidygeocoder-1-0-0Tidygeocoder v1.0.0 is now live on CRAN. There are numerous new features and improvements such as batch geocoding (submitting multiple addresses per query), returning full results from geocoder services (not just latitude and longitude), address component arguments (city, country, etc.), query customization, and reduced package dependencies.

For a full list of new features and improvements refer to the release page on Github. For usage examples you can reference the Getting Started vignette.

To demonstrate a few of the new capabilities of this package, I decided to make a map of the stadiums for the UEFA Champions League Round of 16 clubs. To start, I looked up the addresses for the stadiums and put them in a dataframe.

library(dplyr)
library(tidygeocoder)
library(ggplot2)
require(maps)
library(ggrepel)

# https://www.uefa.com/uefachampionsleague/clubs/
stadiums <- tibble::tribble(
~Club,                ~Street,   ~City,   ~Country,
"Barcelona",          "Camp Nou", "Barcelona", "Spain",
"Bayern Munich",      "Allianz Arena", "Munich", "Germany",
"Chelsea",            "Stamford Bridge", "London", "UK",
"Borussia Dortmund",  "Signal Iduna Park", "Dortmund", "Germany",
"Juventus",           "Allianz Stadium", "Turin", "Italy",
"Liverpool",          "Anfield", "Liverpool", "UK",
"Olympique Lyonnais", "Groupama Stadium", "Lyon", "France",
"Man. City",          "Etihad Stadium", "Manchester", "UK",
"Napoli",             "San Paolo Stadium", "Naples", "Italy",
"Real Madrid",        "Santiago Bernabéu Stadium", "Madrid", "Spain",
"Tottenham",          "Tottenham Hotspur Stadium", "London", "UK",
"Valencia",           "Av. de SuĂšcia, s/n, 46010", "Valencia", "Spain",
"Atalanta",           "Gewiss Stadium", "Bergamo", "Italy",
"Atlético Madrid",    "Estadio Metropolitano", "Madrid", "Spain",
"RB Leipzig",         "Red Bull Arena", "Leipzig", "Germany",
"PSG",                "Le Parc des Princes", "Paris", "France"
  )

To geocode these addresses, you can use the geocode function as shown below. New in v1.0.0, the street, city, and country arguments specify the address. The Nominatim (OSM) geocoder is selected with the method argument. Additionally, the full_results and custom_query arguments (also new in v1.0.0) are used to return the full geocoder results and set Nominatim’s “extratags” parameter which returns extra columns.

stadium_locations <- stadiums %>%
  geocode(street = Street, city = City, country = Country, method = 'osm', 
          full_results = TRUE, custom_query= list(extratags = 1))

This returns 40 columns including the longitude and latitude. A few of the columns returned due to the extratags argument are shown below.

stadium_locations %>%
  select(Club, City, Country, extratags.sport, extratags.capacity, extratags.operator, extratags.wikipedia) %>%
  rename_with(~gsub('extratags.', '', .)) %>%
  knitr::kable()
Club City Country sport capacity operator wikipedia
Barcelona Barcelona Spain soccer NA NA en:Camp Nou
Bayern Munich Munich Germany soccer 75021 NA de:Allianz Arena
Chelsea London UK soccer 41837 Chelsea Football Club en:Stamford Bridge (stadium)
Borussia Dortmund Dortmund Germany soccer NA NA de:Signal Iduna Park
Juventus Turin Italy soccer NA NA it:Allianz Stadium (Torino)
Liverpool Liverpool UK soccer 54074 Liverpool Football Club en:Anfield
Olympique Lyonnais Lyon France soccer 58000 Olympique Lyonnais fr:Parc Olympique lyonnais
Man. City Manchester UK soccer NA Manchester City Football Club en:City of Manchester Stadium
Napoli Naples Italy soccer NA NA en:Stadio San Paolo
Real Madrid Madrid Spain soccer 85454 NA es:Estadio Santiago Bernabéu
Tottenham London UK soccer;american_football 62062 Tottenham Hotspur en:Tottenham Hotspur Stadium
Valencia Valencia Spain NA NA NA NA
Atalanta Bergamo Italy soccer NA NA NA
Atlético Madrid Madrid Spain soccer NA NA es:Estadio Metropolitano (Madrid)
RB Leipzig Leipzig Germany NA NA NA de:Red Bull Arena (Leipzig)
PSG Paris France soccer 48527 Paris Saint-Germain fr:Parc des Princes

Below, the stadium locations are plotted on a map of Europe using the longitude and latitude coordinates and ggplot.

ggplot(stadium_locations, aes(x = long, y = lat)) +
  borders("world", xlim = c(-10, 10), ylim = c(40, 55)) +
  geom_label_repel(aes(label = Club), force = 2, segment.alpha = 0) +
  geom_point() +
  theme_void()

Alternatively, an interactive map can be created with the leaflet library:

library(leaflet)

stadium_locations %>% # Our dataset
  leaflet(width = "100%", options = leafletOptions(attributionControl = FALSE)) %>%
  setView(lng = mean(stadium_locations$long), lat = mean(stadium_locations$lat), zoom = 5) %>%
  # Map Backgrounds
  addProviderTiles(providers$Stamen.Terrain, group = "Terrain") %>%
  addProviderTiles(providers$NASAGIBS.ViirsEarthAtNight2012, group = "Night") %>%
  addProviderTiles(providers$Stamen.Toner, group = "Stamen") %>%
  addTiles(group = "OSM") %>%
  # Add Markers
  addMarkers(
    labelOptions = labelOptions(noHide = F), lng = ~long, lat = ~lat,
    clusterOptions = markerClusterOptions(maxClusterRadius = 10), label = ~Club,
    group = "Stadiums"
  ) %>%
  # Map Control Options
  addLayersControl(
    baseGroups = c("OSM", "Stamen", "Terrain", "Night"),
    overlayGroups = c("Stadiums"),
    options = layersControlOptions(collapsed = TRUE)
  )


If you find any issues with the package or have ideas on how to improve it, feel free to file an issue on Github. For reference, the R Markdown file that generated this blog post can be found here.

]]>
Jesse Cambon
Deploying R Markdown Online2020-03-22T00:00:00+00:002020-03-22T00:00:00+00:00https://jessecambon.github.io/2020/03/22/deploying-rmarkdown-onlineR Markdown is a great tool for creating a variety of documents with R code and it’s a natural choice for producing blog posts such as this one. However, depending on which blog software you use, you may run into some problems related to the file paths for figure images (such as ggplot charts) which will require tweaks in your R Markdown workflow.

This blog post demonstrates a simple solution to this problem that will also give you central control over R Markdown knit settings across your site. I use this solution for this blog and a GitHub repository of data science resources. You can also find the R Markdown file that generated this blog post here.

Note that in this post I will be talking about implementing a solution for a Jekyll blog that is hosted via GitHub pages. Some modifications may be required if you are using another blog or website platform. However, this solution should be adaptable to all blogs or websites that use Markdown.

For Jekyll there are two steps to building web content (HTML) from an R Markdown file. The first is to knit the R Markdown (.Rmd) file which creates the Markdown (.md) file. The second step is to use the jekyll build command to create HTML content which is what will be displayed online.

1. Knit:          R Markdown (.Rmd) ---->  Markdown (.md) 
2. Jekyll Build:  Markdown (.md)   ---->  HTML (.html)

The Problem

When I first used R Markdown to create a post for this blog, none of my figures showed up in the post. The issue was that Jekyll creates the HTML file for a blog post in a different location than the R Markdown (.Rmd) and Markdown (.md) files and this breaks figure file paths. This blog post describes the problem in more detail.

Also, by default R Markdown stores files for figures two folder levels deep using the R Markdown file location as its root (ie. <rmarkdown-filename>_files/figure-gfm/image.png). I find it more convenient to organize figure files in a separate root directory from my R Markdown files and store the images only one folder level deep (ie. /rmd_images/<rmarkdown-filename>/image.png). You can see this folder structure in action here (posts are in the _posts folder and figures are in the rmd_images folder).

The Solution

This solution uses a single R script file (.R) which contains knit settings adjustments and is referenced by all R Markdown (.Rmd) files. This allows you to edit knit settings in one central location and use these settings whenever you knit an R Markdown file. Modifications are made to the knit process so that figure image files are saved in a well organized folder structure and the HTML files display figures properly.

The contents of this central R script which I have named rmd_config.R is below. It lives in the root directory of my Github repository and the contents of this file will be run (via source) when each R Markdown file is knit.

# get name of file during knitting and strip file extension
rmd_filename <- stringr::str_remove(knitr::current_input(), "\\.Rmd")

# Figure path on disk = base.dir + fig.path
# Figure URL online = base.url + fig.path
knitr::opts_knit$set(base.dir = stringr::str_c(here::here(), "/"), base.url = "/") # project root folder
knitr::opts_chunk$set(fig.path = stringr::str_c(file.path("rmd_images", rmd_filename), "/"))

Here is what is going on in the above script:

  • The filename of our R Markdown file is extracted using knitr::current_input() and stored in the variable rmd_filename (str_remove is used to remove the .Rmd file extension).
  • The here package establishes our ‘base’ directory (the root folder of our GitHub repository). The base directory path could change based on which computer we use and where we put our GitHub repository files so the here package allows us to automatically find this path.
  • The fig.path, which is where our figures will be stored, is set to a folder named after the R Markdown file being run that resides in the ‘/rmd_images’ root directory.

To utilize the above script in an R Markdown file, we simply insert the code below as a chunk into the R Markdown file. This will source the script to apply all the necessary knit settings when an R Markdown file is knit.

source(here::here("rmd_config.R"))

For a Jekyll blog, you’ll also want to include certain YAML header content such as tags or the title of the post. To do this we can use the preserve_yaml output setting in generating our .md file and then insert whatever YAML content we need into the header. Below is an example YAML header (the first part of your R Markdown document) which will generate a github-style (“gfm”) .md document. In this example I’ve added the fields “layout”, “title”, “date”, “author”, and “tags” which are all used by Jekyll in creating the blog post.

---
layout: post
title:  "Deploying R Markdown Online"
date:   2020-03-22
author: Jesse Cambon
tags: [ r ]
output: 
  md_document:
    pandoc_args: ["--wrap=none"]
    variant: gfm
    preserve_yaml: TRUE
---

Note that the pandoc_args setting is to prevent the knit process from inserting extra line breaks into the Markdown file that don’t exist in our R Markdown file. The extra line breaks normally are invisible, but I found they showed up when I pushed content to R-Bloggers which caused paragraphs to be broken up.

One other note on Jekyll is that it uses the liquid template language. If you want to display characters on your blog that are used by liquid such as {{}} (R’s “curly-curly” operator) then you will need to wrap these problematic characters with the raw and endraw liquid tags as described here. This prevents Jekyll from attempting to parse these characters as liquid syntax and passes them on in raw form to the HTML file for display.

Conclusion

To see this solution in action, you can look at the GitHub repository that produces this blog here and the R Markdown file for this specific blog post here. To provide a self-contained example of a figure displaying, I’ve created a simple histogram plot below and you’ll find the image file neatly filed away in the rmd_images directory underneath a subfolder named after this blog post.

One caveat is that this approach does assume that each R Markdown filename is unique. If this is not the case then you’ll need to make some modifications to the central rmd_config.R file above; otherwise figure images from different R Markdown files may save in the same directory (and possibly overwrite each other). However, the solution described here is quite flexible and could be adapted to a variety of use cases with tweaks to the rmd_config.R file.

hist(mtcars$disp)

March 26 2021: this post has been updated with simplified R Markdown code.

]]>
Jesse Cambon
Data Science Essentials2020-01-12T00:00:00+00:002020-01-12T00:00:00+00:00https://jessecambon.github.io/2020/01/12/data-science-essentialsOne the greatest strengths of R for data science work is the vast number and variety of packages and capabilities that are available. However, it can be intimidating to navigate this large and dynamic open source ecosystem, especially for a newcomer. All the information you need is out there, but it is often fragmented across numerous stack overflow threads and websites.

In an attempt to consolidate some of this information, this blog post demonstrates fundamental methods that I have used repeatedly as a data scientist. This code should get you started in performing some essential and broadly useful data science tasks with R - data manipulation, summarization, and visualization.

I will mainly rely on the dplyr, tidyr, and ggplot2 packages which all have excellent documentation that you can refer to for further details. Datasets that are built into these packages will be used so that there is no need to download external data. Also note that the input and output datasets will be displayed for each example, but at times only the first several rows will be shown for display purposes.

If you’d like to follow along while running the code, you can find the RMarkdown file that generated this blog post here. Also, if you haven’t installed the tidyverse packages already, you’ll need to do that first with this command: install.packages('tidyverse').

Basic Data Manipulation

To begin, we need to load the tidyverse packages:

library(tidyverse)

Now, let’s take a look at the mpg dataset from the ggplot2 package:

manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact

We’ll perform a few of the most commonly used data manipulation operations on this dataset using the dplyr package. If you’re new to R, the <- operator that you see below is used to assign the value of what follows it to the dataset that precedes it. In this example, we are manipulating the ‘mpg’ dataset and saving it as the ‘mpg_subset’ dataset.

If you’re not familiar with dplyr, note that the “pipe” operator %>% is used to pass the output of a function to the following function. This allows us to perform data manipulations in sequence in a clear and readable way.

In this example, we select the two rows that contain Nissan vehicles years 2005 and later with 4 cylinders, create two new columns, order the columns, remove four columns, and rename a column. In the order that they are used below, here are the main functions used to accomplish this:

  • filter controls which rows we want to keep from the input dataset. In this example, three conditions are applied using the “&” (AND) operator.
  • mutate is used to create new columns.
  • str_c combines multiple strings. In this case we are combining the manufacturer and model string fields into a single field with a single space in between.
  • select is used to pick which columns from the input dataset we want to keep. This select statement sets the order of the first four columns, includes the remaining columns, but then removes four columns.
    • A ‘-’ before a column name indicates that we want to remove that column.
    • The everything() function is shorthand for selecting all remaining columns and is an example of a select helper.
  • rename is used to rename the ‘fl’ column to ‘fuel_type’
mpg_subset <- mpg %>%
  filter(cyl == 4 & year >= 2005 & manufacturer == "nissan") %>%
  mutate(
    mpg_ratio = hwy / cty,
    make_model = str_c(manufacturer, " ", model)
  ) %>%
  select(
    make_model, year, hwy, cty, everything(),
    -manufacturer, -model, -drv, -trans
  ) %>%
  rename(fuel_type = fl)
make_model year hwy cty displ cyl fuel_type class mpg_ratio
nissan altima 2008 31 23 2.5 4 r midsize 1.347826
nissan altima 2008 32 23 2.5 4 r midsize 1.391304

Summary Statistics

Calculating summary statistics like counts, means, and medians is a good initial step to understand a dataset. To count observations (rows) by a categorical variable, we can use the count function. Here we find the number of rows for each value of the ‘cyl’ (cylinders) column:

count_cyl <- mpg %>%
  count(cyl)
cyl n
4 81
5 4
6 79
8 70

A broader variety of statistics can be calculated using the group_by and summarize functions. In this example, we create a new categorical column ‘class_c’ (which combines 2 seaters and subcompact vehicles into a single category) using the case_when function and then calculate a variety of basic summary statistics by this column.

The arrange function is used to order the rows in the dataset in descending order of the created ‘count’ variable. Note that the ‘ungroup’ function is not strictly necessary in this case, but it is a good practice if we plan to manipulate our dataset in the future without using groups.

mpg_stats <- mpg %>%
  select(class, hwy) %>%
  mutate(class_c = case_when(
    class %in% c("2seater", "subcompact") ~ "subcompact",
    TRUE ~ class
  )) %>%
  group_by(class_c) %>%
  summarize(
    count = n(),
    min_hwy = min(hwy),
    max_hwy = max(hwy),
    median_hwy = median(hwy),
    mean_hwy = mean(hwy)
  ) %>%
  ungroup() %>%
  arrange(desc(count)) # sort dataset
class_c count min_hwy max_hwy median_hwy mean_hwy
suv 62 12 27 17.5 18.12903
compact 47 23 44 27.0 28.29787
midsize 41 23 32 27.0 27.29268
subcompact 40 20 44 26.0 27.72500
pickup 33 12 22 17.0 16.87879
minivan 11 17 24 23.0 22.36364

Stacking Data

If you have datasets whose columns or rows align, you can combine them by stacking the datasets vertically or horizontally. To demonstrate this, we will first use the slice function to subset the ‘mpg’ dataset by row numbers to create the ‘mpg1’ and ‘mpg2’ datasets.

mpg1 <- mpg %>%
  slice(1) %>%
  select(manufacturer, model, hwy, cty) %>%
  mutate(dataset = 1)
manufacturer model hwy cty dataset
audi a4 29 18 1
mpg2 <- mpg %>%
  slice(44:45) %>%
  select(manufacturer, model, hwy, cty) %>%
  mutate(dataset = 2)
manufacturer model hwy cty dataset
dodge caravan 2wd 17 11 2
dodge caravan 2wd 22 15 2

Since these two datasets we just created have the same columns we can stack them vertically using bind_rows:

mpg_stack_vert <- mpg1 %>%
  bind_rows(mpg2)
manufacturer model hwy cty dataset
audi a4 29 18 1
dodge caravan 2wd 17 11 2
dodge caravan 2wd 22 15 2

Now let’s create a third subsection of the ‘mpg’ dataset using the same rows that generated ‘mpg1’ and ‘mpg2’ above, but with different columns. We’ll call it ‘mpg3’:

mpg3 <- mpg %>%
  slice(1, 44:45) %>%
  select(displ, year)
displ year
1.8 1999
3.3 2008
3.8 1999

We can stack the ‘mpg_stack_vert’ and ‘mpg3’ datasets horizontally since their rows align (we used the ‘slice’ function to subset the ‘mpg’ dataset on the same row numbers). We use the bind_cols function to do this.

mpg_stack_horz <- mpg_stack_vert %>%
  bind_cols(mpg3)
manufacturer model hwy cty dataset displ year
audi a4 29 18 1 1.8 1999
dodge caravan 2wd 17 11 2 3.3 2008
dodge caravan 2wd 22 15 2 3.8 1999

Joining Data

If you have datasets that contain a common “key” column (or a set of key columns) then you can use one of the join functions from dplyr to combine these datasets. First let’s create a dataset named ‘car_type’ which contains the unique combinations of the ‘manufacturer’, ‘model’, and ‘class’ column values using the distinct function.

car_type <- mpg %>%
  select(manufacturer, model, class) %>%
  distinct()
manufacturer model class
audi a4 compact
audi a4 quattro compact
audi a6 quattro midsize
chevrolet c1500 suburban 2wd suv

Now we will join this newly created ‘car_type’ dataset to the ‘mpg_stack_horz’ dataset (created above) using the ‘left_join’ function. The ‘manufacturer’ and ‘model’ columns are used as joining keys. The resulting dataset, ‘joined’, now contains all the columns from ‘mpg_stack_horz’ as well as the ‘class’ column from the ‘car_type’ dataset.

joined <- mpg_stack_horz %>%
  left_join(car_type, by = c("manufacturer", "model")) %>%
  select(-dataset, everything()) # make the 'dataset' column last
manufacturer model hwy cty displ year class dataset
audi a4 29 18 1.8 1999 compact 1
dodge caravan 2wd 17 11 3.3 2008 minivan 2
dodge caravan 2wd 22 15 3.8 1999 minivan 2

Converting Long to Wide Format

Let’s take a look at the us_rent_income dataset from the tidyr package:

GEOID NAME variable estimate moe
01 Alabama income 24476 136
01 Alabama rent 747 3
02 Alaska income 32940 508
02 Alaska rent 1200 13

Each row of this dataset pertains to either income or rent as we can see by looking at the value of the ‘variable’ column. This is an example of a “long” data format. The long format is versatile and desirable for data manipulation, but we may want to convert to the “wide” data format in some cases, particularly for presenting data.

To perform this conversion, we can use the pivot_wider function from tidyr. The end result is that the rent and income variables are put into separate columns. This function has two arguments you will need to set:

  • names_from: name of the column which contains values that will become our new column names.
  • values_from: name of the column which contains the values that will populate our new columns.

Additionally we use the select function to drop two columns, drop_na to remove rows with missing values, and mutate to create an income to rent ratio.

col_ratio <- us_rent_income %>%
  select(-GEOID, -moe) %>%
  pivot_wider(names_from = variable, values_from = estimate) %>%
  drop_na() %>%
  mutate(income_rent_ratio = income / (12 * rent))
NAME income rent income_rent_ratio
Alabama 24476 747 2.730478
Alaska 32940 1200 2.287500
Arizona 27517 972 2.359139

Converting Wide to Long Format

Now let’s look at the world_bank_pop dataset from tidyr (only the first 6 columns are shown for display purposes):

country indicator 2000 2001 2002 2003
ABW SP.URB.TOTL 42444.000000 43048.000000 43670.000000 44246.00000
ABW SP.URB.GROW 1.182632 1.413021 1.434559 1.31036
ABW SP.POP.TOTL 90853.000000 92898.000000 94992.000000 97017.00000

This dataset is in “wide” format since a categorical variable, in this case the year, is stored in the column names. To convert this dataset to the “long” format”, which can be more convenient for data manipulation, we use the pivot_longer function from tidyr. This function takes three inputs:

  • cols (1st argument): a list of the columns we want to pivot. In this example we create this list by subtracting the columns we don’t want to pivot.
  • names_to : the name of new column which will have the current column names as values.
  • values_to : name of the new column which will contain values.

We also use the ‘mutate’ and ‘as.numeric’ functions to convert our new ‘year’ variable to numeric and then filter so that our output dataset only includes certain years using the ‘seq’ function. The across function is used to apply the ‘as.numeric’ function to the ‘year’ column. The format for the ‘seq’ function is seq(start, stop, increment).

wb_pop <- world_bank_pop %>%
  pivot_longer(c(-country, -indicator), names_to = "year", values_to = "value") %>%
  mutate(across(year, as.numeric)) %>% # convert to numeric
  filter(year %in% seq(2000, 2016, 2))
country indicator year value
ABW SP.URB.TOTL 2000 42444
ABW SP.URB.TOTL 2002 43670
ABW SP.URB.TOTL 2004 44669

Visualizations

Now that we have manipulated and summarized some datasets, we’ll make a few visualizations with ggplot2. Ggplot graphs are constructed by adding together a series of ggplot functions with the “+” operator. This gives us a large amount of customization options since ggplot functions can be combined in many different ways.

Below you will find code for several commonly used charts, but you can refer to ggplot’s documentation for more information. Here is a brief overview of the package:

  • The ggplot function initializes a graph and typically specifies the dataset that is being used.
  • Atleast one geom (geometric object) function such as geom_histogram, geom_point, or geom_line is included which controls how data will be displayed.
  • The aes (aesthetic mappings) function controls which variables are used in the plot. This function can be included as part of the ggplot function or in a geom function depending on whether you want the effect to be global or specific to a geom function.
  • The formatting of the chart (such as margins, legend position, and grid lines) can be modified using preset themes such as theme_bw and theme_classic or the theme function which gives more manual control.
  • The ‘color’ parameter is used for setting the color of plot lines and points while the ‘fill’ parameter controls the color of areas (such as bars on bar charts). These parameters can be set to a value such as ‘navy’ or to a categorical variable. You can read more about this on ggplot’s site here.
  • To save a plot to a file use the ggsave function.

Scatter Plots

Scatter plots are used to visually examine the relationship between two continuous variables and can be created using geom_point. In this example, we plot engine displacement against highway MPG for the ‘mpg’ dataset. A ‘Transmission’ column is created to combine the various transmission types in the ‘trans’ variable into the ‘auto’ (automatic) and ‘manual’ categories using the str_detect function.

The ‘color’ argument in the ‘aes’ function is used to color our points according to the newly created ‘Transmission’ variable. A legend is automatically created and we’ve positioned it at the top of our graph.

ggplot(
  data = mpg %>%
    mutate(Transmission = case_when(str_detect(trans, "auto") ~ "auto", TRUE ~ "manual")),
  aes(x = displ, y = hwy, color = Transmission)
) +
  geom_point() +
  theme_light() +
  theme(
    legend.position = "top",
    legend.text = element_text(size = 11)
  ) +
  xlab("Displacement (L)") +
  ylab("Highway MPG")

Line Charts

Here we create a line graph with the SP.POP.GROW indicator from the ‘wb_pop’ dataset we created earlier based on world bank data. SP.POP.GROW is the percent population growth of a country and we divide its value (which is in the ‘value’ column) by 100 to convert it to a decimal percentage value.

In this example, both lines and points are displayed for our data because we have used both the geom_point and geom_line functions. The expansion function is used to control the margins in the x axis. We’ve also formatted the y axis as a percentage using the ‘labels’ argument in scale_y_continuous.

ggplot(
  wb_pop %>% filter(country %in% c("USA", "CAN", "MEX") & indicator == "SP.POP.GROW"),
  aes(x = year, y = value / 100, color = country)
) +
  theme_minimal() +
  geom_line() +
  geom_point() + # lines and points
  scale_x_continuous(expand = expansion(mult = c(.05, .05))) +
  scale_y_continuous(labels = scales::percent) +
  theme(
    legend.title = element_blank(), # suppress legend title
    panel.grid.minor.x = element_blank(),
    legend.text = element_text(size = 11),
    legend.position = "right"
  ) +
  xlab("Year") +
  ylab("Population Growth")

Histograms

Histograms display distributions of variables. We use a histogram to look at the distribution of highway MPG below. You may want to experiment with the ‘binwidth’ argument in the geom_histogram function to see how it effects what your histogram looks like. The expansion function is used to control the margins in the y axis.

ggplot(mpg, aes(hwy)) +
  geom_histogram(binwidth = 1) +
  theme_bw() +
  scale_y_continuous(expand = expansion(mult = c(0, .05))) +
  xlab("Highway MPG") +
  ylab("Vehicles")

Bar Charts

Bar charts are commonly used to show relative size and can be created with geom_bar. I find it helpful to order the bars by their size which I’ve done with the reorder function below. The geom_text function is used to add the labels to the top of the bars.

ggplot(
  data = mpg_stats,
  aes(x = reorder(class_c, -mean_hwy), y = mean_hwy)
) +
  geom_bar(stat = "identity", color = "black") +
  scale_y_continuous(expand = expansion(mult = c(0, .1))) + # expand top margin
  geom_text(aes(label = round(mean_hwy)), vjust = -0.5) + # label bars
  theme_bw() +
  xlab("Vehicle Class") +
  ylab("Mean Highway MPG") + # no axis labels
  theme(panel.grid = element_blank()) # turn off grid

Lollipop Charts

Lollipop charts can be an attractive alternative to bar charts. We construct one here using geom_segment and geom_point. The coord_flip function is used to orient the chart horizontally instead of vertically. We use the theme function to hide all grid lines except for the major vertical lines. The reorder function is again used to order the axis (in this case by rent in descending order).

ggplot(data = col_ratio %>% arrange(desc(rent)) %>% head(15), aes(x = NAME, y = rent)) +
  geom_segment(aes(x = reorder(NAME, rent), xend = NAME, y = 0, yend = rent), color = "grey") +
  geom_point(size = 3) +
  theme_minimal() +
  theme(
    plot.subtitle = element_text(face = "bold", hjust = 0.5),
    plot.title = element_text(lineheight = 1, face = "bold", hjust = 0.5),
    panel.grid.minor.y = element_blank(),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.x = element_blank()
  ) +
  coord_flip() +
  scale_y_continuous(labels = scales::dollar, expand = expansion(mult = c(0, .1))) +
  labs(
    title = "US States with the Highest Rent",
    caption = "Source: 2017 American Community Survey (Census)"
  ) +
  xlab("") +
  ylab("Median Monthly Rent")

Additional References

Here are some additional resources that you may find useful:

  • For importing data from files, refer to the readr (for CSV and text files) or readxl (for excel spreadsheets) packages.
  • To coerce a column in a dataset into a different format, you can use the as.numeric, as.character, as.Date, and as.factor functions (from base R). For more functions to work with date and datetime data see the lubridate package, for strings reference the stringr package, and for manipulating factors you can use the forcats package.
  • For quickly summarizing datasets with basic summary statistics, you can use the summary function (base R) or the skimr package.
  • The purrr package allows you apply functions across the values of a list using the map function. One example of where this is useful is in reading and combining data from multiple sheets in an excel spreadsheet by applying a function that reads a single sheet to a list of sheets.
  • I keep reference data science code (both R and Python) in a GitHub repository. You’ll find some more advanced techniques like modeling demonstrated there.

March 26 2021: this post has been updated to reflect changes in ggplot and dplyr syntax.

]]>
Jesse Cambon
Practical Tidy Evaluation2019-12-08T00:00:00+00:002019-12-08T00:00:00+00:00https://jessecambon.github.io/2019/12/08/practical-tidy-evaluationTidy evaluation is a framework for controlling how expressions and variables in your code are evaluated by tidyverse functions. This framework, housed in the rlang package, is a powerful tool for writing more efficient and elegant code. In particular, you’ll find it useful for passing variable names as inputs to functions that use tidyverse packages like dplyr and ggplot2.

The goal of this post is to offer accessible examples and intuition for putting tidy evaluation to work in your own code. Because of this I will keep conceptual explanations brief, but for more comprehensive documentation you can refer to dplyr’s website, rlang’s website, the ‘Tidy Evaluation’ book by Lionel Henry and Hadley Wickham, and the Metaprogramming Section of the ‘Advanced R’ book by Hadley Wickham.

Motivating Example

To begin, let’s consider a simple example of calculating summary statistics with the mtcars dataset. Below we calculate maximum and minimum horsepower (hp) by the number of cylinders (cyl) using the group_by and summarize functions from dplyr.

library(dplyr)
hp_by_cyl <- mtcars %>%
  group_by(cyl) %>%
  summarize(
    min_hp = min(hp),
    max_hp = max(hp)
  )
cyl min_hp max_hp
4 52 113
6 105 175
8 150 335

Now let’s say we wanted to repeat this calculation multiple times while changing which variable we group by. A brute force method to accomplish this would be to copy and paste our code as many times as necessary and modify the group by variable in each iteration. However, this is inefficient especially if our code gets more complicated, requires many iterations, or requires further development.

To avoid this inelegant solution you might think to store the name of a variable inside of another variable like this groupby_var <- "vs". Then you could attempt to use your newly created “groupby_var” variable in your code: group_by(groupby_var). However, if you try this you will find it doesn’t work. The “group_by” function expects the name of the variable you want to group by as an input, not the name of a variable that contains the name of the variable you want to group by.

This is the kind of headache that tidy evaluation can help you solve. In the example below we use the quo function and the “bang-bang” !! operator to set “vs” (engine type, 0 = automatic, 1 = manual) as our group by variable. The “quo” function allows us to store the variable name in our “groupby_var” variable and “!!” extracts the stored variable name.

groupby_var <- quo(vs)

hp_by_vs <- mtcars %>%
  group_by(!!groupby_var) %>%
  summarize(
    min_hp = min(hp),
    max_hp = max(hp)
  )
vs min_hp max_hp
0 91 335
1 52 123

The code above provides a method for setting the group by variable by modifying the input to the “quo” function when we define “groupby_var”. This can be useful, particularly if we intend to reference the group by variable multiple times. However, if we want to use code like this repeatedly in a script then we should consider packaging it into a function. This is what we will do next.

Making Functions with Tidy Evaluation

To use tidy evaluation in a function, we will still use the “!!” operator as we did above, but instead of “quo” we will use the enquo function. Our new function below takes the group by variable and the measurement variable as inputs so that we can now calculate maximum and minimum values of any variable we want. Also note two new features I have introduced in this function:

  • The as_label function extracts the string value of the “measure_var” variable (“hp” in this case). We use this to set the value of the “measure_var” column.
  • The “walrus operator” := is used to create a column named after the variable name stored in the “measure_var” argument (“hp” in the example). The walrus operator allows you to use strings and evaluated variables (such as “measure_var” in our example) on the left hand side of an assignment operation (where there would normally be a “=” operator) in functions such as “mutate” and “summarize”.

Below we define our function and use it to group by “am” (transmission type, 0 = automatic, 1 = manual) and calculate summary statistics with the “hp” (horsepower) variable.

car_stats <- function(groupby_var, measure_var) {
  groupby_var <- enquo(groupby_var)
  measure_var <- enquo(measure_var)
  return(mtcars %>%
    group_by(!!groupby_var) %>%
    summarize(
      min = min(!!measure_var),
      max = max(!!measure_var)
    ) %>%
    mutate(
      measure_var = as_label(measure_var),
      !!measure_var := NA
    ))
}
hp_by_am <- car_stats(am, hp)
am min max measure_var hp
0 62 245 hp NA
1 52 335 hp NA

We now have a flexible function that contains a dplyr workflow. You can experiment with modifying this function for your own purposes. Additionally, as you might suspect, you could use the same tidy evaluation functions we just used with tidyverse packages other than dplyr.

As an example, below I’ve defined a function that builds a scatter plot with ggplot2. The function takes a dataset and two variable names as inputs. You will notice that the dataset argument “df” needs no tidy evaluation. The as_label function is used to extract our variable names as strings to create a plot title with the “ggtitle” function.

library(ggplot2)
scatter_plot <- function(df, x_var, y_var) {
  x_var <- enquo(x_var)
  y_var <- enquo(y_var)

  return(ggplot(data = df, aes(x = !!x_var, y = !!y_var)) +
    geom_point() +
    theme_bw() +
    theme(plot.title = element_text(lineheight = 1, face = "bold", hjust = 0.5)) +
    geom_smooth() +
    ggtitle(str_c(as_label(y_var), " vs. ", as_label(x_var))))
}
scatter_plot(mtcars, disp, hp)

As you can see, we’ve plotted the “hp” (horsepower) variable against “disp” (displacement) and added a regression line. Now, instead of copying and pasting ggplot code to create the same plot with different datasets and variables, we can just call our function.

The “Curly-Curly” Shortcut and Passing Multiple Variables

To wrap things up, I’ll cover a few additional tricks and shortcuts for your tidy evaluation toolbox.

  • The “curly-curly” {{ }} operator directly extracts a stored variable name from “measure_var” in the example below. In the prior example we needed both “enquo” and “!!” to evaluate a variable like this so the “curly-curly” operator is a convenient shortcut. However, note that if you want to extract the string variable name with the “as_label” function, you will still need to use “enquo” and “!!” as we have done below with “measure_name”.
  • The syms function and the “!!!” operator are used for passing a list of variables as a function argument. In prior examples “!!” was used to evaluate a single group by variable; we now use “!!!” to evaluate a list of group by variables. One quirk is that to use the “syms” function we will need to pass the variable names in quotes.
  • The walrus operator “:=” is again used to create new columns, but now the column names are defined with a combination of a variable name stored in a function argument and another string (“_min” and “_max” below). We use the “enquo” and “as_label” functions to extract the string variable name from “measure_var” and store it in “measure_name” and then use the “str_c” function from stringr to combine strings. You can use similar code to build your own column names from variable name inputs and strings.

Our new function is defined below and is first called to group by the “cyl” variable and then called to group by the “am” and “vs” variables. Note that the “!!!” operator and “syms” function can be used with either a list of strings or a single string.

get_stats <- function(data, groupby_vars, measure_var) {
  groupby_vars <- syms(groupby_vars)
  measure_name <- as_label(enquo(measure_var))
  return(
    data %>% group_by(!!!groupby_vars) %>%
      summarize(
        !!str_c(measure_name, "_min") := min({{ measure_var }}),
        !!str_c(measure_name, "_max") := max({{ measure_var }})
      )
  )
}
cyl_hp_stats <- mtcars %>% get_stats("cyl", mpg)
gear_stats <- mtcars %>% get_stats(c("am", "vs"), gear)
## `summarise()` has grouped output by 'am'. You can override using the `.groups` argument.
cyl mpg_min mpg_max
4 21.4 33.9
6 17.8 21.4
8 10.4 19.2
am vs gear_min gear_max
0 0 3 3
0 1 3 4
1 0 4 5
1 1 4 5

This concludes my introduction to tidy evaluation. Hopefully this serves as a useful starting point for using these concepts in your own code.

]]>
Jesse Cambon
Geocoding with Tidygeocoder2019-11-11T00:00:00+00:002019-11-11T00:00:00+00:00https://jessecambon.github.io/2019/11/11/tidygeocoder-demoTidygeocoder is a newly published R package which provides a tidyverse-style interface for geocoding. It returns latitude and longitude coordinates in tibble format from addresses using the US Census or Nominatim (OSM) geocoder services. In this post I will demonstrate how to use it for plotting a few Washington, DC landmarks on a map in honor of the recent Washington Nationals World Series win.

First we will construct a dataset of addresses (dc_addresses) and use the geocode function from tidygeocoder to find longitude and latitude coordinates.

library(dplyr)
library(tidygeocoder)

dc_addresses <- tribble(
  ~name, ~addr,
  "White House", "1600 Pennsylvania Ave Washington, DC",
  "National Academy of Sciences", "2101 Constitution Ave NW, Washington, DC 20418",
  "Department of Justice", "950 Pennsylvania Ave NW, Washington, DC 20530",
  "Supreme Court", "1 1st St NE, Washington, DC 20543",
  "Washington Monument", "2 15th St NW, Washington, DC 20024"
)

coordinates <- dc_addresses %>%
  geocode(addr)

The geocode function adds longitude and latitude coordinates as columns to our dataset of addresses. The default geocoder service used is the US Census, but Nominatim or a hybrid approach can be chosen with the method argument (see the documentation for details). Our newly created coordinates dataset looks like this:

name addr lat long
White House 1600 Pennsylvania Ave Washington, DC 38.89875 -77.03535
National Academy of Sciences 2101 Constitution Ave NW, Washington, DC 20418 38.89211 -77.04678
Department of Justice 950 Pennsylvania Ave NW, Washington, DC 20530 38.89416 -77.02501
Supreme Court 1 1st St NE, Washington, DC 20543 38.88990 -77.00591
Washington Monument 2 15th St NW, Washington, DC 20024 38.88979 -77.03291

Now that we have the coordinates we want to plot, we will use the OpenStreetMap package to make a map of DC.

library(OpenStreetMap)
dc_map <- openmap(c(38.905, -77.05), c(38.885, -77.00))
dc_map.latlng <- openproj(dc_map)

Note that the coordinates supplied to the openmap function above were obtained using openstreetmap.org (use the export button to extract coordinates). The openmap function downloads a street map and the openproj function projects it onto a latitude and longitude coordinate system so that we can overlay our coordinates, which is what we do next.

library(ggplot2)
library(ggrepel)
autoplot(dc_map.latlng) +
  theme_minimal() +
  theme(
    axis.text.y = element_blank(),
    axis.title = element_blank(),
    axis.text.x = element_blank(),
    plot.margin = unit(c(0, 0, 0, 0), "cm")
  ) +
  geom_point(data = coordinates, aes(x = long, y = lat), color = "navy", size = 4, alpha = 1) +
  geom_label_repel(
    data = coordinates,
    aes(label = name, x = long, y = lat), show.legend = F, box.padding = .5, size = 5
  )

dc-map

And that’s our map. The geom_label_repel function from ggrepel provides the text labels and geom_point from ggplot2 supplies the points. Alternatively, the leaflet package provides an excellent interface to plot coordinates on an interactive map. For more information on tidygeocoder, visit its home on GitHub or CRAN.

]]>
Jesse Cambon