Jekyll2024-12-11T03:53:12-06:00https://tomhoper.github.io//feed.xmlTom HopeTom HopeRegular expressions for replication2021-07-01T00:00:00-05:002021-07-01T00:00:00-05:00https://tomhoper.github.io//posts/2021/rstudio-regexAs part of the publication process for my recent article on how states preempt separatist conflict, I needed to submit replication materials to the journal. I took my graduate quantitative methods sequence with the late Tom Carsey, so I’ve long been a proponent replicability efforts in social science. I also had an hourly job in grad school replicating quantitative results for multiple political science journals, so I’m very familiar with best practices for replication. Unfortunately, in the four years since I wrote the first line of code for this project, somewhere in between defending my dissertation and starting a new job (ok, fine, almost immediately after writing that first line of code), I got a little lazy.

Sometimes it’s faster (easier) to just write code that works for you, on your system, without any consideration for some poor researcher who may try to replicate your results in the future.1 This tendency was especially bad for this project because at various points in time I was writing code to run on my personal laptop and two different high performance computing clusters. This is a recipe for code that doesn’t travel well and will almost certainly fail to replicate.

There were a lot of changes I made to my code to ensure my results replicate, but the most tedious (and time consuming, by far) was cleaning up my file paths. Due to the computationally intensive GIS work and Bayesian statistics involved in the project, I ran lots of code on a cluster, and then pulled the results back to my laptop to summarize and create figures. This unsurprisingly resulted in a huge mess when looking at the project as a whole, rather than any individual script. Luckily, R and Rstudio made things (relatively) painless to fix.

File paths

Anytime you load a dataset into R, you need to specify the path to that file. The same’s true when you save R output to a file. This article started as a chapter of my dissertation, so all of the code originally lived in the Dissertation folder on my laptop. However, as I started adapting it to an article length manuscript, I created a new Conflict Preemption folder in my Projects folder. By the time the article was accepted, I had two main folders I needed to combine:

  • /Users/Rob/Dropbox/UNC/Dissertation/Onset
  • /Users/Rob/Dropbox/WashU/Projects/Conflict Preemption

Both of these folders live in my Dropbox, but that’s about where the similarities end. I wrote most of the code for running models while still at UNC, so when I added new scripts to run models to respond to reviewer comments, I still stuck them in the UNC folder. That also means that all of the output of these models ended up in the UNC folder when it got transferred from the cluster. However, when I needed to do something simpler like create a time series plot of the number of separatist groups in existence, I wrote that code in the WashU folder. I also had a script in the WashU folder to load all of the results and generate plots from them. Because this script and the data it needed to load where in completely different directories, this is what I had to do to load the data to create one of the main figures:

load('/Users/Rob/Dropbox/UNC/Dissertation/Onset/Figure Data/marg_eff_pop_df_cy.RData')

Not particularly likely to work on anyone else’s computer. To fix this, I needed to move all of the data to the Conflict Preemption folder, which was easy, and then rewrite all of the code the referenced file locations, which was less easy.

Here

As a first step, I needed to chop off /Users/Rob/Dropbox/UNC/Dissertation/Onset/ from the start of every file path. All the files for the article, including both the R scripts and the various data files, now live in /Users/Rob/Dropbox/WashU/Projects/Conflict Preemption, but all of the file paths in the scripts still start with /Users/Rob/Dropbox/UNC/Dissertation/Onset, because that’s where all the files were before. You can do this just using the standard find and replace functionality built into RStudio. However, there’s no guarantee that someone in the future will correctly set R’s working directory before running the code. I used the here R package to ensure that R can always find everything it needs for my code. All you have to do is wrap file paths in the here() function in the package, and they’ll be automatically completed with the full file path, letting R find your files.2

You need to use the relative path to each file, so for a file with an absolute path of /Users/Rob/Dropbox/WashU/Projects/Conflict Preemption/Figure Data/marg_eff_pop_df_cy.RData, the relative path (relative to the project folder of /Users/Rob/Dropbox/WashU/Projects/Conflict Preemption) would be Figure Data/marg_eff_pop_df_cy.RData. The final bit of R code looks like this:

load(here('Figure Data/marg_eff_pop_df_cy.RData'))

The addition of that here() in between load() and the file path means that things are no longer as simple as finding and replacing the start of the file path.

Regular expressions

Luckily, I was able to take advantage of RStudio’s built in support for regular expressions to save myself from having to manually change each line of code that either loaded or saved a file. Regular expressions are a powerful way to search through and manipulate text. You can activate them in RStudio’s find and replace dialog by checking the Regex box:

Once you’ve done that, certain characters in your search will no longer be interpreted literally. The most important difference is probably ., which is a stand-in for any character.3 This is similar to how * is a wildcard in the Unix shell, e.g., you can use ls *.R to list all R script files in a folder. The main regular expression feature I used is the capturing group, which allows you to identify and extract a subset of a line of text. You designate a capturing group by surrounding the desired text with parentheses. To fix all of the code loading RData files from the Figure Data folder, my regular expression looked like this:

'/Users/Rob/Dropbox/UNC/Dissertation/Onset/(Figure Data/.*\.RData)'

It starts with /Users/Rob/Dropbox/UNC/Dissertation/Onset/, which is the part I want to get rid of. Next, (Figure Data/.*\.RData)' tells the regular expression to look for any character (.) repeated an unlimited number of times (*) followed by .RData. Because . is a special character in regular expressions, we have to escape it with a backslash (\). This will match any file name ending in .RData in the Figure Data folder. If we left out the leading /Users/Rob/Dropbox/UNC/Dissertation/Onset/, we end up with the capturing group we want, but since /Users/Rob/Dropbox/UNC/Dissertation/Onset/ wouldn’t be part of the search string, it wouldn’t end up getting replaced. This is the same reason we need to include the opening and closing quotation marks; if we didn’t, we’d end up with a here() command inside quotation marks, which R would just treat as a string and not a command.

At this point I had the core of the line that I wanted to keep, but now I needed to extract it and place it inside of a call to here(). You accomplish this goal using a backreference to the capturing group. To reference the first capturing group, you use either \1 or $1 depending on which version of regular expressions you are using. This is often very difficult to figure out, and is one of the most annoying things about regular expressions. You’ll often just have to experiment and find out which one to use through trial and error. Luckily RStudio accepts either version!

To replace the absolute path with a relative one wrapped in a here() call, this is what I typed into the Replace field in the find and replace dialog:

here('$1')

and it resulted in this:

here('Figure Data/marg_eff_pop_df_cy.RData')

Thanks to the power of capture groups, you can just hit the replace all button and instantly transform every file path into a much more portable and replication-friendly one.

A little bit faster now

If you’re feeling really confident that you moved every file correctly, you can replace all file paths with the following regular expression:

'/Users/Rob/Dropbox/UNC/Dissertation/Onset/(.*\..*)'

This will get any files with file extensions (the \. followed by .* to ensure there’s at least one character after a literal period), as well as any preceding subdirectories (the initial .*) and stick them all into the resulting here() call. As an example, this will successfully turn this: fileConn <- file(here::here(‘Tables/pd_pop_cy.tex’))

groups <- readRDS('/Users/Rob/Dropbox/Dissertation/Onset/Input Data/groups_nightlights.RDS')

into this:

groups <- readRDS(here::here('Input Data/groups_nightlights.RDS'))
  1. I’m using ‘replication’ here to mean that the code used to generate quantitative results from a dataset should produce those same results when run by another researcher, not in the sense that means that independent researchers following the published protocol can collect the data themselves and arrive at the same conclusion. I use the term ‘reproducible’ to describe this property. Annoyingly, different fields use opposing definitions of these two terms. 

  2. Specifically, here() will key into the .Rproj file included in my replication materials and use that to properly locate everything else. 

  3. Except for newlines, carriage returns, and other end of line special characters. 

]]>
Tom Hope
Faceted maps in R2021-05-19T00:00:00-05:002021-05-19T00:00:00-05:00https://tomhoper.github.io//posts/2021/05/geom-sf-facetI recently needed to create a choropleth of a few different countries for a project on targeting of UN peacekeepers by non-state armed actors I’m working on. A choropleth is a type of thematic map where data are aggregated up from smaller areas (or discrete points) to larger ones and then visualized using different colors to represent different numeric values.

See this simple example, which displays the area of each county in North Carolina, from the sf package documentation.1 First, we need to load sf and then get the built-in nc dataset:

library(sf)
nc <- st_read(system.file('shape/nc.shp', package = 'sf'))
plot(nc[1])

Since I needed to generate choropleths for multiple countries, I decided to use ggplot2’s powerful faceting functionality. Unfortunately, as I discuss below, ggplot2 and sf don’t work together perfectly in ways that become more apparent (and problematic) the more complex your plots get. I moved away from faceting, and just glued together a bunch of separate plots, but then I had to figure out how to end up with a shared legend for five separate plots. Read on to see how I solved both of these issues.

The data

I already loaded sf to make the plot of North Carolina above, so now let’s load the remaining packages we’ll use:

library(tidyverse) # data manipulation and plotting
library(tmap)      # spatial plots
library(cowplot)   # combine plots
library(RWmisc)    # clean plot theme

I’m working with cleaned and subsetted versions of ACLED and GADM, which I’ve uploaded to my website as PKO.Rdata if you want to download them and run this code yourself. The acled object contains a list of attacks on peacekeepers in active Chapter VII UN peacekeeping missions in Subsaharan Africa, while the adm object contains all of the second order administrative districts (ADM2) in the five countries with active missions.

## load data
load(url('https://jayrobwilliams.com/data/PKO.Rdata'))

## inspect
head(acled)
head(adm)
## Simple feature collection with 6 features and 30 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -3.6102 ymin: 0.4966 xmax: 29.4654 ymax: 19.4695
## Geodetic CRS:  WGS 84
## # A tibble: 6 x 31
##   data_id   iso event_id_cnty event_id_no_cnty event_date  year time_precision
##     <dbl> <dbl> <chr>                    <dbl> <date>     <dbl>          <dbl>
## 1 6713346   140 CEN47283                 47283 2019-12-27  2019              1
## 2 6689432   180 DRC16211                 16211 2019-12-08  2019              1
## 3 7578005   180 DRC16182                 16182 2019-12-04  2019              1
## 4 7191069   466 MLI3253                   3253 2019-10-21  2019              1
## 5 6759702   466 MLI3225                   3225 2019-10-06  2019              1
## 6 6023339   466 MLI3224                   3224 2019-10-06  2019              1
## # … with 24 more variables: event_type <chr>, sub_event_type <chr>,
## #   actor1 <chr>, assoc_actor_1 <chr>, inter1 <dbl>, actor2 <chr>,
## #   assoc_actor_2 <chr>, inter2 <dbl>, interaction <dbl>, region <chr>,
## #   country <chr>, admin1 <chr>, admin2 <chr>, admin3 <chr>, location <chr>,
## #   geo_precision <dbl>, source <chr>, source_scale <chr>, notes <chr>,
## #   fatalities <dbl>, timestamp <dbl>, iso3 <chr>, month <dbl>,
## #   geometry <POINT [°]>

## Simple feature collection with 6 features and 19 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 18.54607 ymin: 4.221635 xmax: 22.395 ymax: 9.774724
## Geodetic CRS:  WGS 84
## # A tibble: 6 x 20
##   GID_0 NAME_0   GID_1  NAME_1 NL_NAME_1 GID_2 NAME_2 VARNAME_2 NL_NAME_2 TYPE_2
##   <chr> <chr>    <chr>  <chr>  <chr>     <chr> <chr>  <chr>     <chr>     <chr> 
## 1 CAF   Central… CAF.1… Bamin… <NA>      CAF.… Bamin… <NA>      <NA>      Sous-…
## 2 CAF   Central… CAF.1… Bamin… <NA>      CAF.… Ndélé  <NA>      <NA>      Sous-…
## 3 CAF   Central… CAF.2… Bangui <NA>      CAF.… Bangui <NA>      <NA>      Sous-…
## 4 CAF   Central… CAF.3… Basse… <NA>      CAF.… Alind… <NA>      <NA>      Sous-…
## 5 CAF   Central… CAF.3… Basse… <NA>      CAF.… Kembé  <NA>      <NA>      Sous-…
## 6 CAF   Central… CAF.3… Basse… <NA>      CAF.… Minga… <NA>      <NA>      Sous-…
## # … with 10 more variables: ENGTYPE_2 <chr>, CC_2 <chr>, HASC_2 <chr>,
## #   ID_0 <dbl>, ISO <chr>, ID_1 <dbl>, ID_2 <dbl>, CCN_2 <dbl>, CCA_2 <chr>,
## #   geometry <MULTIPOLYGON [°]>

First attempt: ggplot2

The first step we need to do is associate each individual attack with the ADM2 it occurred in. We can do this with the st_join() function. This function executes a left join by default, so by using adm for the x argument and acled for the y argument, we end up with one row for every ADM2 with no attacks in it, and n rows for each ADM2 with attacks in it, where n equals the number of attacks in that ADM2. We can then use group_by() and summarize() to create a count of attacks for each ADM2 by summing the number of non-NA observations of event_id_cnty, the main ID field in ACLED. Finally, I log this count variable (using log1p() to account for the ADM2s without any attacks because ln(0) is undefined) to make the resulting plot more informative due to outliers in Northern Mali and the Eastern DRC. Putting it all together:

st_join(adm, acled) %>% 
  group_by(NAME_0, NAME_1, NAME_2) %>% 
  summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
  ggplot(aes(fill = attacks)) +
  geom_sf(lwd = NA) +                 # no borders
  scale_fill_continuous(name = 'PKO targeting\nevents (logged)') +
  theme_rw() +                        # clean plot
  theme(axis.text = element_blank(),  # no lat/long values
        axis.ticks = element_blank()) # no lat/long ticks

That’s a lot of wasted white space, and it can make certain countries harder to see. Let’s split it out using facet_wrap(). We simply add a facet_wrap() call to our ggplot2 code, and tell it to split by our country name variable, NAME_0:

adm %>% 
  st_join(acled) %>% 
  group_by(NAME_0, NAME_1, NAME_2) %>% 
  summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
  ggplot(aes(fill = attacks)) +
  geom_sf(lwd = NA) +
  scale_fill_continuous(name = 'PKO targeting\nevents (logged)') +
  facet_wrap(~ NAME_0) +
  theme_rw() +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank())

We’ve got facets, but everything is still clearly on the same scale. let’s set scales = 'free' in our call to facet_wrap() to try and fix that.

st_join(adm, acled) %>% 
  group_by(NAME_0, NAME_1, NAME_2) %>% 
  summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
  ggplot(aes(fill = attacks)) +
  geom_sf(lwd = NA) +
  scale_fill_continuous(name = 'PKO targeting\nevents (logged)') +
  facet_wrap(~ NAME_0, scales = 'free') +
  theme_rw() +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank())
## Error: coord_sf doesn't support free scales

And we get an error. It turns out the the ggplot2 codebase assumes that it can maniulate axes independently of one another. This is very much not the case with geographic data where a meter vertically needs to equal a meter horizontally, so coord_sf() locks the axes in much the same manner as coord_fixed().2 To try and get around the limitations from ggplot2’s non-spatial origins, I turned to a package written from the ground up for plotting spatial data.

Second attempt: tmap

My googling led me to this Stack Overflow answer extolling the virtue of the tmap package.3 tmap is a package for drawing thematic maps from sf objects using a syntax very similar to ggplot2. We can reuse the same data wrangling code and as before pipe it into our plotting function, which this time is tm_shape(). We then add a call to tm_polygons() to get our colored features and tm_facet() to split them apart. Note that unlike ggplot2, we need to quote the names of variables in tmap functions:

st_join(adm, acled) %>% 
  group_by(NAME_0, NAME_1, NAME_2) %>% 
  summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
  tm_shape() +
  tm_polygons('attacks', title = 'PKO targeting\nevents (logged)') +
  tm_facets('NAME_0')

Much better so far! However, notice that tmap defaults to assuming that our attacks variable is discrete. We’ll need to tell it that it’s continuous. And what if we moved that legend down to the bottom right to get rid of the wasted space currently there?

adm %>% 
  st_join(acled) %>% 
  group_by(NAME_0, NAME_1, NAME_2) %>% 
  summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
  tm_shape() +
  tm_polygons(col = 'attacks',
              title = 'PKO targeting\nevents (logged)',
              style = 'cont') +                  # continuous variable
  tm_facets('NAME_0') +
  tm_layout(legend.outside.position =  "bottom", # legend outside below
            legend.position = c(.8, 1.1))        # manually position legend

This is…fine. You’ll notice that there’s a lot of white space at the bottom of the plot, which I still haven’t figured out how to eliminate, and I personally prefer the color palette options available in ggplot2. Finally, there’s not much control over the legend compared to what you get with ggplot2, so let’s head back there and try to come at this problem from a different direction.

Third attempt: cowplot

While we’re still using ggplot2 to make individual plots, we need some way to combine them into a final plot. We can rely on the plot_grid() function in the cowplot library for that.4 We need to create five subplots, which we could do manually, but let’s do it programmatically because at some point you may need to do this for 27 different countries. The best way to store our five subplots is in a list, because lists in R can contain any type of R objects as their elements.5 I’m going to use the map() function from the purrr package to accomplish this, but you could also use lapply(). map() takes a list as its first argument, .x and a function as its second, .f. To see how map works, look at the following example:

map(1:3, sample)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1 2
## 
## [[3]]
## [1] 2 3 1

map() returns a list of length 3 because our input .x was a vector of length three, and it applies the function .f to each element of .x. I’m going to use an anonymous function to filter adm to only contain ADM2s from one country at a time, then create our subplots separately like we did together above:

pko_countries <- c('Central African Republic', 'Democratic Republic of the Congo',
                   'Mali', 'South Sudan', 'Sudan')

## create maps in separate plots, force common scale between them
maps <- map(.x = pko_countries, 
            .f = function(x) adm %>% 
              filter(NAME_0 == x) %>% 
              st_join(acled) %>% 
              group_by(NAME_0, NAME_1, NAME_2) %>% 
              summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
              ggplot(aes(fill = attacks)) +
              geom_sf(lwd = NA) +
              scale_fill_continuous(name = 'PKO targeting\nevents (logged)') +
              theme_rw() +
              theme(axis.text = element_blank(),
                    axis.ticks = element_blank()))

We can either supply each individual subplot to plot_grid() separately, or we can use the plotlist argument to pass a list of plots; good thing we saved them in a list:

## use COWplot to combine and add single legend
plot_grid(plotlist = maps, labels = LETTERS[1:5], label_size = 10, nrow = 2)

I tried using the name of each country as the subplot label, but because label positioning is relative to the width of labels it was impossible to get them all nicely left-aligned. As a result, I had to settle on using letters to label the subplots and then identifying them in the figure caption in text. As you’ll see later, there’s no perfect way of accomplishing this and you’ll have to make a trade-off somewhere.

Setting aside that compromise, there’s still one issue with this plot that we can fix. We’re measuring the same thing (attacks on UN peacekeeping personnel) in all five choropleths, so there’s no need for five separate scales.

Shared legend

The cowplot documentation demonstrates how to use the get_legend() function to extract the legend from one of the subplots and then add it as another element to plot_grid(), placing it in the bottom right like we sort of managed to do with tmap. However, we need to add theme(legend.position = 'none') to the ggplot call for each subplot, otherwise we’ll just end up with six legends. That means we need to apply to each element of our list of maps, which means it’s another job that map() is perfect for! We’ll use map() to take each subplot in maps and remove the legend from it, then use get_legend() to add a legend in the bottom right.

## use COWplot to combine and add single legend
plot_grid(plotlist = c(map(.x = maps,
                           .f = function(x) x + theme(legend.position = 'none'))),
          get_legend(maps[[1]]),
          labels = LETTERS[1:5], label_size = 10, nrow = 2)

This doesn’t look right! We told plot_grid() to start with our maps, so why is the legend the first thing in the plot? If you look closely at the documentation for plot_grid(), you’ll see that the ... argument comes before the plotlist argument in the function definition. Even when we specify plotlist first, the function will add plotlist after ....6 To fix this, all we need to do is concatenate the results of get_legend() with the results of our call to map(). Note that we need to first transform the former to a list with list(), otherwise each element of it will be concatenated separately rather than as a grob object:

## use COWplot to combine and add single legend
plot_grid(plotlist = c(map(.x = maps,
                           .f = function(x) x + theme(legend.position = 'none')),
                       list(get_legend(maps[[1]]))),
          labels = LETTERS[1:5],
          label_size = 10,
          nrow = 2)

So far so good. But if we try using a different map in our call to get_legend(), things get weird:

## use COWplot to combine and add single legend
plot_grid(plotlist = c(map(.x = maps,
                           .f = function(x) x + theme(legend.position = 'none')),
                       list(get_legend(maps[[4]]))),
          labels = LETTERS[1:5], label_size = 10, nrow = 2)

Each subplot has its own unique legend that’s automatically generated from the values of attacks it contains. This is even worse than it might seem at first glance, because it means that the various subplots are in no way comparable to one another!

Accurate shared legend

To avoid misrepresenting the data, we need to ensure that each subplot has the same legend. The easiest way to do this is to manually set the legend for each subplot in our call to scale_fill_continuous(). Even though we’re manually setting the bounds of the legend, that doesn’t mean we have to hard code them. We can use a simpler version of our code to join attacks to ADM2s and then calculate the highest number of attacks across all countries in the data. Then we take advantage of the fact that scale_fill_continuous() can pass additional parameters to continuous_scale() via the ... argument. The continuous_scale() function is a low-level function used throughout ggplot2 to construct continuous scales, and it has a limits argument that sets the bounds of the scale. All we have to do is pass the minimum and maximum (logged) numbers of attacks in the data and we’re in business:

st_join(adm, acled) %>% 
  st_drop_geometry() %>%   # we don't need a map at the end; drop geometry to speed up
  group_by(NAME_0, NAME_1, NAME_2) %>% 
  summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
  pull(attacks) %>%        # extract attacks variable
  range() -> attacks_range # get min and max

## create maps in separate plots, force common scale between them
maps_shared <- map(.x = pko_countries, 
                   .f = function(x) adm %>% 
                     filter(NAME_0 == x) %>% 
                     st_join(acled) %>% 
                     group_by(NAME_0, NAME_1, NAME_2) %>% 
                     summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
                     ggplot(aes(fill = attacks)) +
                     geom_sf(lwd = NA) +
                     scale_fill_continuous(limits = attacks_range,
                                           name = 'PKO targeting\nevents (logged)') +
                     theme_rw() +
                     theme(axis.text = element_blank(),
                           axis.ticks = element_blank()))

Now all that’s left is to use plot_grid() to put it all together:

## use COWplot to combine and add single legend
plot_grid(plotlist = c(map(.x = maps_shared,
                           .f = function(x) x + theme(legend.position = 'none')),
                       list(get_legend(maps_shared[[1]]))),
          labels = LETTERS[1:5], label_size = 10, nrow = 2)

And unlike before, the legend is identical regardless of which subplot we use with get_legend():

## use COWplot to combine and add single legend
plot_grid(plotlist = c(map(.x = maps_shared,
                           .f = function(x) x + theme(legend.position = 'none')),
                       list(get_legend(maps_shared[[4]]))),
          labels = LETTERS[1:5], label_size = 10, nrow = 2)

This approach is still useful even if you’re not working with spatial data. plot_grid() is powerful because it lets you make asymmetric arrangements like this example from the cowplot documentation:

p1 <- ggplot(mtcars, aes(disp, mpg)) + 
  geom_point()
p2 <- ggplot(mtcars, aes(qsec, mpg)) +
  geom_point()

plot_grid(p1, p2, labels = c('A', 'B'), rel_widths = c(1, 2))

If the units you’re faceting by contain substantially different observations, you might end up in a situation where the automatically generated legends are different from one another. Manually creating the scale of the legend and ensuring it’s the same for all plots would solve this problem here, too.

Bonus: still to solve

Don’t let anyone convince you they know everything. I still haven’t managed to get my ideal (conditional on regular faceting with facet_wrap() being out of the question) solution to this working. I tried to create five subplots and just add a facet label to each, with each one being a facet of one panel. Straightforward enough, right?

maps_facet <- map(.x = pko_countries, 
                  .f = function(x) adm %>% 
                    filter(NAME_0 == x) %>% 
                    st_join(acled) %>% 
                    group_by(NAME_0, NAME_1, NAME_2) %>% 
                    summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
                    ggplot(aes(fill = attacks)) +
                    geom_sf(lwd = NA) +
                    scale_fill_continuous(limits = attacks_range,
                                          name = 'PKO targeting\nevents (logged)') +
                    facet_wrap(~NAME_0) +
                    theme_rw() +
                    theme(axis.text = element_blank(),
                          axis.ticks = element_blank()))

plot_grid(plotlist = c(map(.x = maps_facet,
                           .f = function(x) x + theme(legend.position = 'none')),
                       list(get_legend(maps_facet[[1]]))),
          nrow = 2)

Not so much, and no amount of tinkering with the align and axis arguments to plot_grid() has yielded any improvement. The specific paper this plot is for doesn’t have any other plots with facets, so I’m content to go with my inelegant solution of lettered labels and a key to them in the figure caption. If that weren’t the case, I might still be fiddling with this and getting deeper and deeper into the source code for plot_grid().

  1. If you’re wondering why the largest county area is in the ballpark of 0.25, it’s because the data are in square degrees, an old non-SI unit of measurement that’s defined in terms of how much the field of view from a given point is obstructed by an object. GIS is so easy these days, folks. 

  2. The more I learn about how ggplot2 and sf work under the hood, the more amazed I am that geom_sf() Just Works in 80% of cases, let alone works at all. 

  3. The answer also listed the geom_spatial() function from the ggspatial package as an alternative option, but I couldn’t get it to work. The answer is three and a half years old, which means it’s very possible something changed in either sf or ggspatial that broke this solution. So it goes. 

  4. It’s much more powerful and easily customizable than gridExtra::grid.arrange()

  5. They can also contain heterogeneous elements which will come in handy later

  6. If you check out the actual source code of plot_grid(), line 9 shows you that the function is indeed putting ... ahead of plotlist: plots <- c(list(...), plotlist)

]]>
Tom Hope
Finding Backcountry Campsites with CalTopo, OpenStreetMap, and R2021-01-04T00:00:00-06:002021-01-04T00:00:00-06:00https://tomhoper.github.io//posts/2021/01/gps-gis-osmLike many people, I’ve been spending more time outdoors during this pandemic. While this means daily walks in my neighborhood, it also means getting out into the wilderness and sleeping in a tent when I can. Although outdoor recreation is one of the safer ways to entertain yourself these days, it’s not without its own concerns. The difficulty of safely getting to trailheads means that while I’m backpacking more than usual, it’s still not as often as I’d like.

That means I’m spending a decent chunk of time thinking about and planning future trips. At some point in the process of doing this, I realized that I could use the GIS skills from my day job to help make planning future trips more efficient. In this post I walk through how you can use GIS tools in R to help with some of the route planning for a multiday backpacking trip. Specifically, how you can use open source spatial data on geography and transportation infrastructure to identify potential campsites along a hiking trail.

This was largely an exercise in seeing how I could apply GIS skills I’ve learned in the study of political violence to small-scale GPS navigation. I haven’t had the opportunity to hit the trail and test out any of the assumptions I use in this process yet, so you should view this post as more of a (loose) method than concrete suggestions. For a short and simple point-to-point hike with only one route, there’s really no need to engage in this level of GIS analysis. I’ve kept things simple to make them easier to follow, but this approach could actually be useful and save some time when planning a longer trip with many potential routes.

Backcountry camping

At some point in the future, I want to hike the Uwharrie Trail in Uwharrie National Forest in central North Carolina, near where I went to grad school. As I think about this (probably far off) trip, I’ve been using CalTopo to plan my route.

If you spend any amount of time in the outdoors, you should know about CalTopo. CalTopo is a website that lets you plan routes (hiking, skiing, rafting, etc.) on top of super high resolution topographic maps. You can then turn your smartphone into a full-featured GPS and use it to follow those routes (CalTopo offes a mobile app, as does Gaia GPS, both for about $20 a year). While the Uwharrie Trail is a pretty straightforward hike, I’ve been using this as an excuse to try and apply my GIS skills in a new context.

CalTopo is great, but it’s very point and click. I like doing things programmatically when I can, so that means it’s time to grab some of the open source data that CalTopo uses so we can play around with it in R. The base map in CalTopo is called MapBuilder Topo, and uses OpenStreetMap data as its starting point, so let’s start there.

Disclaimer

This guide is intended to show how to identify potential backcountry campsites on public land where dispersed camping is permitted. If you are backpacking in an area with designated, maintained backcountry campsites, you should use them. Dispersed camping is typically permitted in less-traveled areas where the impact of campers is better minimized by diffusing it rather than concentrating it into a handful of designated sites.1

Always check regulations for any land you plan to camp on to see if there are specific requirements for site selection or areas where camping is prohibited. Picking an actual campsite requires identifying areas where your saftey will be maximized and the longterm impact of your stay will be minimized. See this guide for the basics and this series for a slightly more hardcore set of principles to follow. And remember, never go into the wilderness without telling someone where you’re going and when you should be back.

Getting the data

OpenStreetMap (OSM) is an open source map of the entire globe; think of it as a hybrid of Google Maps and Wikipedia. OSM is designed so that anyone can easily add to or edit it. Setting aside the normative value of this perspective, this is helpful for us because it means that OSM is transparent. We can use the excellent osmdata R package to query OSM via the [Overpass API], and we can use OSM itself via the OSM website to learn the various parameters we’ll use to query OSM.

Trails

The getting started vignette covers much of the basics of using osmdata. The key functions are osmdata::opq(), which builds a query to the Overpass API, and osmdata::add_osm_feature(), which requests specific features. OSM classifies features using key-value pairs, and we can use the OSM website to figure out just which pairs we need. Navigate to an area of interest, right-click on the feature of interest, and then select “query features.”

Next, select the desired feature in the dialog on the left of the screen. In this case, select the “Relation” rather than the “Path” because the path will only include one segment of the trail while the relation will include its entire length.

We can see here that the Uwharrie Trail relation has type=hiking, so that’s the key-value pair wew’ll have to specify in our query.

Make sure to use the bbox argument to osmdata::opq(), otherwise you’ll request every hiking trail in the world! You can manually specify the four edges of a bounding box to search in, or you can use the osmdata::getbb() function to get it automatically using the Nominatim geocoder.

library(tidyverse)
library(sf)
library(osmdata)

## get hiking routes in Uwharrie National Forest
unf_trails <- opq(bbox = getbb('uwharrie national forest usa')) %>% 
  add_osm_feature(key = 'route', value = 'hiking') %>% 
  osmdata_sf()

Notice that we use the osmdata::osmdata_sf() function to convert the resulting object for use with the sf R package. Let’s inspect the resulting object of class osmdata_sf.

## inspect
unf_trails
## Object of class 'osmdata' with:
##                  $bbox : 35.3951403,-80.0236608,35.4351403,-79.9836608
##         $overpass_call : The call submitted to the overpass API
##                  $meta : metadata including timestamp and version numbers
##            $osm_points : 'sf' Simple Features Collection with 3341 points
##             $osm_lines : 'sf' Simple Features Collection with 26 linestrings
##          $osm_polygons : 'sf' Simple Features Collection with 0 polygons
##        $osm_multilines : 'sf' Simple Features Collection with 1 multilinestrings
##     $osm_multipolygons : NULL

We can see that the unf_trails object includes points, lines, polygons, multilines, and multipolygons. We want to use the lines since that will include any short trail segments that aren’t part of a larger trail. We can easily plot the trails using this object.

## plot
plot(unf_trails$osm_lines$geometry, col = 'coral4')

Don’t get lost

Let’s do some quick sanity checks. First, Wikipedia tells us the trail should be about 20 miles. We can use the sf::st_length() function to measure the length of each trail segment, and the sf::st_union() function to combine all segments. We’ll get our answer in meters, which as a metric-deprived American, won’t be all that helpful to me. To get around this, we can use the `units::st_units() function to convert from meters to miles.

## measure total trail length
st_union(unf_trails$osm_lines$geometry) %>% # combine all segments.
  st_length() %>% # measure length
  units::set_units(mi) # convert to miles
## 28.26457 [mi]

While that’s initially concerning, a closer reading of the Wikipedia article for the trail reveals that it was originally 40 miles long, so OSM likely includes some of the Northern section of the trail beyond what’s officially recognized today.

We should also plot the bounding box that osmdata::getbb() ends up generating to ensure we’re not missing any part of the trail. We can do this with the OpenStreetMap [R package](https://cran.r-project.org/package=OpenStreetMap. Here we unfortunately need to manually specify the bounding box as a series of two vectors with the latitude and longitude coordinate of the upper-left and lower-right of the box. OpenStreetMap::openmap() uses (latitude, longitude) pairs, not (longitude, latitude) pairs as is more common in GIS, i.e., (y, x) not (x, y), so be sure to include them in that order.[^lat-long]{markdown} OpenStreetMap::openproj() also requires a projection argument, so I use sf::st_crs(4326)$proj4string to generate one automatically, ensuring I don’t introduce a type somewhere by accident.

[^lat-long]:{markdown} I spent 20 minutes not understanding why I couldn’t get this to work before I finally read the documenation. Don’t be like me, folks.

library(OpenStreetMap)

## get bounding box
unf_bb <- getbb('uwharrie national forest usa')

## get OSM tiles
unf_tile <- openmap(c(unf_bb[2,1],  # lat
                      unf_bb[1,1]), # long
                    c(unf_bb[2,2],  # lat
                      unf_bb[1,2]), # long
                    type = 'osm', mergeTiles = T)

## project map tiles and plot (OSM comes in Mercator...)
plot(openproj(unf_tile), projection = st_crs(4326)$proj4string)

## plot trails
plot(unf_trails$osm_lines$geometry, add = T, col = 'coral4')

Uh oh. We can see that we’re only getting a small portion of the total trail and that it trails (heh) off the map on three sides. That’s not great, so let’s fix it. We can start by looking up Uwharrie National Forest itself on the OSM website. This gives us the boundaries of the official forest land in orange.

We can see from the dialog on the left that the forest’s OSM ID is 2918413, so we can use the osmdata::opq_osm_id() function to get the polygons for the forest’s boundaries. Let’s grab the forest boundaries and plot them, along with the bounding box they imply and the bounding box that resulted from osmdata::getbb() (in red) for comparison.

## get Uwharrie National Forest Boundaries
unf <- opq_osm_id(type = 'relation', id = 2918413) %>% 
  osmdata_sf()

## plot Uwharrie National Forest polygons
plot(unf$osm_multipolygons$geometry, col = 'lightgreen', border = NA, bty = 'n')

## construct line for original bounding box
plot(st_multilinestring(list(matrix(c(unf_bb[1, 1], unf_bb[2, 1],
                                      unf_bb[1, 1], unf_bb[2, 2],
                                      unf_bb[1, 2], unf_bb[2, 2],
                                      unf_bb[1, 2], unf_bb[2, 1],
                                      unf_bb[1, 1], unf_bb[2, 1]),
                                      ncol = 2, byrow = T))),
     add = T, col = 'red')

## plot bounding box for Uwharrie National Forest polygons
plot(st_as_sfc(st_bbox(unf$osm_multipolygons)), add = T)

## plot trails
plot(unf_trails$osm_lines$geometry, add = T, col = 'coral4')

Wow, we were missing a lot before. Let’s use the bounding box for the entire forest as our new bounding box. First, we plot OSM using this new bounding box. st_bbox() yields a vector of four numbers, rather than the matrix that osmdata::getbb() produces, so we need to work around this and specify the top-left and bottom-right corners of our new, bigger bounding box.

## get OSM tile for Uwharrie National Forest polygons
unf_full_tile <- openmap(c(st_bbox(unf$osm_multipolygons)[4],  # lat
                           st_bbox(unf$osm_multipolygons)[1]), # long
                         c(st_bbox(unf$osm_multipolygons)[2],  # lat
                           st_bbox(unf$osm_multipolygons)[3]), # long
                         type = 'osm', mergeTiles = T)

## project and plot OSM tile
plot(openproj(unf_full_tile), projection = st_crs(4326)$proj4string)

## plot trails
plot(unf_trails$osm_lines$geometry, add = T, col = 'coral4')

That’s much better! We’re getting a lot of area beyond the trail, but it’s easy to filter that out later so it’s better to grab too much than too little.

The whole trail

Now we can go back and grab all hiking trails in Uwharrie National Forest using our new bounding box. osmdata::opq() expects a bounding box in a certain format, so let’s inspect it to see what we’re working with and what we need to reshape the output of sf::st_bbox(unf$osm_multipolygons) into:

## bbox format osmdata::opq() expects
unf_bb
##         min       max
## x -80.02366 -79.98366
## y  35.39514  35.43514
## rearrange sf::st_bbox() output
matrix(st_bbox(unf$osm_multipolygons), ncol = 2,
       dimnames = list(c('x', 'y'), c('min', 'max')))
##         min       max
## x -80.17085 -79.73170
## y  35.21987  35.63684

Note that I’m specifying row and column names when creating the new bounding box. Without them, osmdata::opq() will fail! We can now plug this new bounding box object into osmdata::opq() and get all hiking routes in the forest.

## get hiking trails in all of Uwharrie National Forest
unf_trails_full <- opq(bbox = matrix(st_bbox(unf$osm_multipolygons), ncol = 2,
                                     dimnames = list(c('x', 'y'), c('min', 'max')))) %>% 
  add_osm_feature(key = 'route', value = 'hiking') %>% 
  osmdata_sf()

## plot
plot(unf_trails_full$osm_lines$geometry, col = 'coral4')

Now we’re getting a bunch of trails across the Pee Dee River in Morrow Mountain State Park. Again it’s easy to drop these extra trails later, so for the moment, more complete is better than less complete. These data come from OpenStreetMap, so they also include lots of usuable data. Let’s take a look at the fields included in our lines:

## inspect
glimpse(unf_trails_full$osm_lines)
## Rows: 106
## Columns: 37
## $ osm_id            <chr> "32024414", "216945232", "216945234", "216945241", …
## $ name              <chr> "Uwharrie Trail", "Mountain Loop Trail", "Mountain …
## $ alt_name          <chr> "Uwharrie National Recreation Trail", NA, NA, NA, N…
## $ bicycle           <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
## $ bridge            <chr> NA, NA, "yes", "yes", NA, NA, "boardwalk", NA, NA, …
## $ construction      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ dog               <chr> NA, "leashed", "leashed", "leashed", "leashed", NA,…
## $ foot              <chr> "designated", "designated", "designated", "designat…
## $ footway           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ highway           <chr> "path", "path", "path", "path", "path", "path", "pa…
## $ horse             <chr> NA, "no", "no", "no", "no", "no", NA, NA, NA, NA, "…
## $ lanes             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ layer             <chr> NA, NA, "1", "1", NA, NA, "1", NA, NA, NA, NA, NA, …
## $ motor_vehicle     <chr> NA, "no", "no", "no", "no", "no", NA, NA, NA, NA, "…
## $ name_1            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ oneway            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ rcn_ref           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ sac_scale         <chr> NA, "mountain_hiking", "mountain_hiking", "mountain…
## $ service           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "parkin…
## $ smoothness        <chr> NA, "bad", "good", "good", "bad", NA, NA, NA, NA, N…
## $ source            <chr> NA, NA, NA, NA, NA, "GPS_2009", "GPS_2009", "GPS_20…
## $ surface           <chr> "dirt", "ground", "wood", "wood", "ground", "ground…
## $ symbol            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "wh…
## $ tiger.cfcc        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.county      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.name_base   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.name_base_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.name_type   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.reviewed    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.zip_left    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.zip_left_1  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.zip_right   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.zip_right_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tracktype         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ trail_visibility  <chr> NA, "excellent", "excellent", "excellent", "excelle…
## $ wheelchair        <chr> NA, "no", "no", "no", "no", NA, NA, NA, NA, NA, "no…
## $ geometry          <LINESTRING [°]> LINESTRING (-80.0435 35.310..., LINESTRI…

We can use the “name” field to subset the data. If you were considering some parallel or spur trails, you could use sf::st_filter() in combination with `sf::st_is_within_distance() to instead just grab trails near your primary trail.

## extract OSM lines and filter
ut <- unf_trails_full$osm_lines %>% filter(name == 'Uwharrie Trail')

Now we’ve gotten the Uwharrie Trail twice. Once using a smaller bounding box and once using a larger one. We can plot them both and see if there were any segments the intial query missed

## plot
plot(ut$geometry, col = 'red')
plot(unf_trails$osm_lines$geometry, add = T, col = 'coral4')

Luckily the initial query still picked up every segment, but that won’t always be the case if you start with an inaccurate initial bounding box. If the entire Uwharrie Trail wasn’t collected into a relation, we might have missed large chunks of it on either end. Now we can use the bounding box for the Uwharrie Trail to capture any other features we care about nearby.

Water

The first other feature we need is water. On any multi-day trip, being able to refill your water is essential. The OSM wiki page on waterways shows us that they values we need to grab relevant water sources are river and stream. Although not well-documented, you can supply multiple value arguments to osmdata::opq() using c(). This will let us quickly and easily grab both rivers and streams in the area.2

## create bbox for just the Uwharrie Trail; no need for all water in the whole National Forest
ut_bb <- matrix(st_bbox(ut), ncol = 2, dimnames = list(c('x', 'y'), c('min', 'max')))

## get rivers and streams and extract OSM lines
ut_water <- opq(bbox = ut_bb) %>% 
  add_osm_feature(key = 'waterway', value = c('river', 'stream')) %>% 
  osmdata_sf()

Our next step will be to drop any water sources more than a kilometer from the trail. This will simplify our analysis later and also minimizes our environmental impact. To conduct GIS operations in meters, we need to project our data from latitude and longitude-based WGS84 to a meter-based coordiante reference system (CRS). The CRS database epsg.io shows that NAD83/North Carolina(EPSG:32119) is the projection for data in North Carolina, so we use sf::st_transform() along with sf::st_crs() to project our trail and water source objects. This lets us calculate distances in feet/meters rather than decimal degrees. We’ll use this to limit the water features to those that fall within 1km of the trail. This way we’re not limiting ourselves to only water features that directly intersect the trail, but we’re also not retaining a bunch of features that are farther off-trail than I like to hike for water.

## project trail
ut <- st_transform(ut, st_crs(32119))

## project water sources
ut_water <- ut_water$osm_lines %>% 
  st_transform(st_crs(32119)) %>% 
  st_filter(ut, .predicate = st_is_within_distance, dist = 1000)

## plot
plot(ut_water$geometry, col = 'lightblue')
plot(ut$geometry, add = T, col = 'coral4')

Roads

If we want to be near water, we want to be far from roads. OpenStreetMap has lots of different categories of roads, so we’ll want to capture all the major ones, as well as service roads and “tracks”, which is how OpenStreetMap refers to forest roads.3 OSM identifies roads with the key “highway,” and inspecting the OSM wiki page on roads shows us the various values we’ll need to grab all relevant roads.

## get roads, project, and limit to w/in 1000 m of trail
ut_roads <- opq(bbox = ut_bb) %>% 
  add_osm_feature(key = 'highway',
                  value = c('primary', 'secondary', 'tertiary', 'residential',
                            'unclassified', 'track', 'service')) %>% 
  osmdata_sf() %>% 
  magrittr::extract2('osm_lines') %>% 
  st_transform(st_crs(32119)) %>% 
  st_filter(ut, .predicate = st_is_within_distance, dist = 1000)

## plot
plot(ut_roads$geometry, col = 'black')
plot(ut$geometry, add = T, col = 'coral4')

Note the use of magrittr::extract2() to extract the osm_lines object from the osmdata_sf object returned by osmdata::osmdata_sf(). This is how you can access a list element in a pipeline, and is equivalent to $osm_lines.

Campsites

To locate potential campsites we need to identify our priorities and use them to define a set of rules for selecting potential sites. For this exercise, I’m using the following:

  1. I’d like to be within 750 feet of a water source. Some (more hardcore) backpackers prefer to be farther away from water sources to minimize the chance of encountering animals. Since Uwharrie National Forest isn’t an area with heightened bear activity, I’m willing to trade the chance of a raccoon sniffing around my bear canister for a shorter walk to refill my water.

  2. The US Forest Service requires that you camp at least 200 feet away from any water source. This is good practice everywhere, but it’s required in National Forests, so we want to make sure any potential campsites are at least 200 feet from any water features.

  3. The Uwharrie Trail is a fairly heavily-trafficked trail, so I’d like to avoid going more than 1/4 mile off-trail to find a campsite. This will minimize the disturbance to the surrounding area.4 All of the semi-official campsites on the Uwharrie Trail are a good ways off the trail itself, so staying near the trail will contain my impact on a large scale, but minimize it locally.

  4. If you’re not in a designated campsite, you should be at least 200 feet away from any trail. Again, this seeks to minimize your impact on the area by spreading out campsites over time.

  5. If I’m making the effort to carry my shelter, sleep system, and food on my back, you better believe I don’t want to be hearing any cars at night. To try and minimize the chances of this happening, I want to be least 1,000 feet from any roads. The lower section of the trail skirts particularly close to a residential neighborhood, so this is an important consideration.

  6. I’m going to drop any potential campsites smaller than 0.1 km^2. Choosing where to actually pitch your tent within a potential site area requires many considerations like drainage, wind exposure, and avoiding dead trees overhead. This means that we want to have ample space in which to find the ideal tent spot, so dropping small potential sites reduces the possibility of arriving at a spot and finding that there’s no good place for your tent.

With all of those factors in mind, we can now define our potential campsites and then narrow them down. I start by buffering the rivers and streams by 1,000 feet with sf::st_buffer(), which gives us every area within 1,000 feet of a water source. Then I move down my list of conditions, buffering the relevant feature and using sf::st_intersect() when I want to ensure I stay within a given distance of that feature and sf::st_difference() when I want to stay a given distance away from that feature.

Since NAD83 uses meters as the unit of measurement, we need to convert these distances in feet into meters. Again, the units package makes this easy with the units::set_units() function.

## buffer water 750 ft
campsites <- st_union(ut_water) %>% 
  st_buffer(dist = units::set_units(750, ft))

## buffer water 200 ft and subtract
campsites <- st_union(ut_water) %>% 
  st_buffer(dist = units::set_units(200, ft)) %>% 
  st_difference(x = campsites)

## buffer trail 1/4 mile and intersect
campsites <- st_union(ut) %>% 
  st_buffer(dist = units::set_units(.25, mi)) %>% 
  st_intersection(x = campsites)

## buffer trail 200 ft and subtract
campsites <- st_union(ut) %>% 
  st_buffer(dist = units::set_units(200, ft)) %>% 
  st_difference(x = campsites)

## buffer roads to 1000 ft and subtract
campsites <- st_union(ut_roads) %>% 
  st_buffer(dist = units::set_units(1000, ft)) %>% 
  st_difference(x = campsites)

## cast multipolygon to polygons and convert to sf
campsites <- campsites %>% 
  st_cast('POLYGON') %>% 
  st_sf() %>% 
  mutate(id = 1:n()) %>% # create ID variable
  filter(st_area(.) > units::set_units(.1, km^2)) # filter to > .1 sq km

The animation below shows each step in the process in order:

Elevation

So far we haven’t really done anything that you couldn’t do on CalTopo, albeit in a less programmatic way. Let’s change that by bringing in some elevation data. Elevation is important when hiking because it determines how many climbs your lungs will have to endure and how many descents your knees will. CalTopo has great built-in tools for generating elevation profiles and more detailed terrain statistics that can tell you what to expect along a given route. However, you can only calculate them for lines or polygons you’ve manually drawn.

While we could import the potential campsite polygons we’ve just generated into CalTopo and then calculate the terrain statistics, this has two major drawbacks. First, you have to point and click through generating the report for each polygon because there’s no way to batch process. Second, and more importantly, this would use a lot of processing power and computing time on CalTopo’s servers. If, unlike me, you have a paid subscription, you might feel less bad about this, but I’m trying not to take advantage of such an awesome service that CalTopo currently provides for free.

We can use R’s capabilities to handle raster data to solve both of these problems! The elevatr package lets you easily download elevation data in the form of a digital elevation model. These models combine multiple measurements from satellites to produce a single image of the earth where the brightness of each pixel represents the elevation of a given area. elevatr allows you to easily access elevation data compiled from a number of different data sources. The main function is elevatr::get_elev_raster(), which takes an sf object as its first argument and z, z zoom level of 1:14. We can also specify the clip = 'bbox' argument to crop the resulting raster to just the bounding box of our potential campsites, and not the entire tile they fall in.

library(raster)
library(elevatr)

## get elevation raster and clip to bbox
elev <- get_elev_raster(campsites, z = 13, clip = 'bbox')

## plot to inspect
plot(elev, col = grey(1:100/100))
plot(ut$geometry, add = T, col = 'coral4')

Since we can see that the highest point in the area is only about 300 feet above sea level, we don’t need to worry about absolute elevation when picking potential sites. Instead, we want to know how level these areas are; no one wants to wake up smushed against the downhill wall of their tent. We can use the raster::terrain() function to calculate the slope in each pixel.

## calculate slope
camp_slope <- terrain(elev, opt = 'slope', unit = 'degrees')

## plot slope
plot(camp_slope)
plot(ut$geometry, add = T, col = 'coral4')

All that’s left to do is aggregate slope measures to each polygon, and then calculate some sort of summary statistic to tell us how steep each potential site is overall. I’m going to use the median of each area’s slope rather than its average to avoid giving undue influence to outliers (if a .5 km2 area is largely flat with a cliff at one edge, then it’s likely still a good candidate for a campsite). Let’s filter out all areas with a median slope of more than 10°.

## calculate median slope for each polygon and filter
campsites <- campsites %>% 
  mutate(med_slope = (raster::extract(camp_slope, ., fun = median, na.rm = T))) %>%
  filter(med_slope < 10) 

With that done, we can now plot our potential campsite locations and all the features used to define them:

## plot campsites and all features
plot(ut$geometry, type = 'n')
plot(campsites$geometry, add = T, col = 'lightgreen', border = NA)
plot(ut_water$geometry, add = T, col = 'lightblue')
plot(ut_roads$geometry, add = T, col = 'black')
plot(ut$geometry, add = T, col = 'coral4')

This is a pretty picture, but it’s not very useful. To make it so that we can actually navigate to any of these spots, we need to get them onto a topographic map.

Plan it

To make our map usable, all we have to do is export the potential campsite polygons from R so that we can import them into CalTopo. CalTopo supports a number of file formats for importing, but the one we want to use is GeoJSON. We can use the geojsonio package to easily convert our polygons from sf objects to GeoJSON format and then save them to disk to import into CalTopo.5

There are two (potentially) tricky things we need to do. First, make sure we reproject our NAD83 data back to decimal degree-based WGS84 so that CalTopo can properly reference them. Second, we want to take advantage of R’s capabilities to efficiently wrangle data and create a name field for our polygons so they’ll be easy to identify and reference once they’re in CalTopo. To do this, we need to create a “title” field in our sf object before we convert it to GeoJSON.6

library(geojsonio)

## create site number field; transmute b/c all fields other than label are lost on import
campsites %>% 
  transmute(title = str_c('Potential Site ', row_number())) %>% 
  st_transform(st_crs(4326)) %>% # project to WGS84
  geojson_json() %>% 
  geojson_write(file = 'campsites.json')

## export Uwharrie Trail to save the trouble of tracing it
ut %>% 
  st_transform(st_crs(4326)) %>% # project to WGS84
  geojson_json() %>% 
  geojson_write(file = 'trail.json')

At this point all that’s left to do is click the “Import” button in CalTopo and select your newly created .json file. You can check out the potential campsites live on CalTopo below:

Some closing thoughts viewing the potential sites in context on CalTopo:

  • Potential Site 2 looks promising. It’s slightly downhill from the trail, and has some relatively flat ground. However, the water source is an intermittent stream (denoted by the three dots in the blue line), so depending on time of year there may not actually be easy access to water here.
  • Potential Site 4 is located near both a perennial and an intermittent stream, so the odds of finding a usable water source are higher. Across the trail to the West you can see an area that meets all of our site selection criteria except gentle slopes due to the steep rise to the 795 foot peak nearby.
  • Potential Site 7 demonstrates the limitations of this approach because there are two forest roads near it on the Forest Service map that aren’t included in OpenStreetMap. Google Maps shows that there’s a private RV campground here, so best to avoid it. Doubly so because it’s largely outside of Uwharrie National Forest’s boundaries (the green lines). This is why it’s important to check more than just the terrain before you go!
  1. See here for a discussion of different types of campsites and contexts in which they are usually found. 

  2. If we didn’t do this, we’d have to use c() to combine multiple osmdata_sf objects and then extract the osm_lines object from the combined osmdata_sf object. 

  3. The US Forest Service maintains GIS data on forest roads on National Forest land, but the API to access them is…less than user friendly so I’m ignoring them for this illustration. 

  4. In very sparsely-traveled areas, it can be better to seek out campsites far from the trail to avoid camping in areas where others have recently stayed. This can help prevent the emergence of ‘social’ campsites that are not officially recognized or maintained but are frequently used. It will also reduce the chance that you’ll encounter any local wildlife that have learned that such spots can be a source of easy meals. 

  5. Want to find potential campsites for a trail that’s not in OpenStreetMap? CalTopo supports exports as well as imports, so you can trace the route in CalTopo, export it, then load it in R with sf::st_read() and then carry out the steps above! 

  6. CalTopo refers to an object’s name field as its “Label” in the interface, but this isn’t what it’s called under the hood. I had to export a line I create and inspect the resulting .json file to find out that it’s referred to as a “title” instead. 

]]>
Tom Hope
R Markdown, Jekyll, and Footnotes2020-10-26T00:00:00-05:002020-10-26T00:00:00-05:00https://tomhoper.github.io//posts/2020/10/jekyll-footnotes

Update: 05/19/2021 John MacFarlane helpfully pointed out that this is all incredibly unnecessary because pandoc makes it easy to add support for footnotes to GitHub-Flavored Markdown. The documentation notes that you can add extensions to output formats they don’t normally support. Since standard markdown natively supports footnotes when used as an output format, I didn’t even think to look into manually enabling them for GitHub-Flavored Markdown.

If you’re running pandoc from the command line all you need to do is add -t gfm+footnotes to your pandoc command. If you’re working with .Rmd files like me, all you need to do is add +footnotes to the end of of the variant: gfm line in your YAML header. As a side benefit, you can drop the --wrap=preserve flag and end up with .md files that aren’t hundreds of columns wide. I’m leaving the original post up below in case anyone who has an even weirder use case than me might find it helpful, or if any of my students ever stumble across this page and don’t believe that I’m still constantly learning, too.

I use jekyll to create my website. Jekyll converts Markdown files into the HTML that your browser renders into the pages you see. As others and I have written before, it’s pretty easy to use R Markdown to generate pages with R code and output all together. One thing has consistently eluded me, however: footnotes.

Every time I try to include footnotes in my .Rmd file, they end up mangled and not actually footnotes in the final HTML page. My solution thus far has been to just avoid footnotes and lean heavily on parenthetical asides when I’m using R Markdown to generate a page. My recent post on using SQL style filtering to preprocess large spatial datasets before loading them into memory needed a whopping six footnotes, so I finally had to sit down and figure it out.

What’s happening

The ‘standard’ method for adding footnotes in R Markdown is actually a bit of a cheat compared to the method in the official Markdown specification. R Markdown lets you use a LaTeX-esque syntax for defining footnotes:

Here is some body text.^[This footnote will appear at the bottom of the page.]

However, Jekyll uses the official Markdown specification for footnotes, so this won’t work. Instead, we need to define them with the official syntax:

Here is some body text.[^1]

[^1]: This footnote will appear at the bottom of the page.

However, when R Markdown converts your file from standard Markdown to GitHub-Flavored Markdown, something strange happens and the output in your .md file will look like this:

Here is some body text.\[1\]

1. This footnote will appear at the bottom of the page.

When Jekyll converts the Markdown file to HTML, you end up with a sad lonely unclickable [1] where your footnote should go. The content of the footnote does appear at the bottom of the page, but it lacks the footnote formatting so it just looks like regular text and there’s no link to click and return to the footnote’s place in the text.

Why it’s happening

Understanding what’s happening here (and thus how to fix it) requires a slightly detailed explanation of what exactly happens when you hit that Knit button in RStudio. First, the knitr package runs all of the code in your .Rmd file and creates a .md file. Next, pandoc takes the .md file and converts it to whatever output format you want.1

R Markdown flowchart
Image courtesy of RStudio

Pandoc is the source of our problems here. The square braces that set off a footnote are metacharacters in Markdown, since they’re used to construct links (among other things, like citations with pandoc-citeproc). When Pandoc sees them in the process of converting from standard Markdown to GitHub-Flavored Markdown, it (logically) decides that they’re important content and preserves them by escaping them with a backslash so they’re preserved in the GitHub-Flavored Markdown. Unfortunately for us, we want our square brackets to be treated as special characters and not turned into text. This is a known issue with Pandoc (see this issue on GitHub) so it will eventually get fixed, but in the meantime I’ve come up with a workaround.

How to fix it

Pandoc allows you to tag both code chunks and inline code with a special raw attribute which will ensure they’re passed on to the output format unmodified. To do this, just enclose any text with backticks (`) and then put {=markdown} immediately after the closing backtick. This will ensure that Pandoc doesn’t alter the ‘code’ in the backticks at all. It’s debatable whether the [^1] used to define a footnote is really code, but for our purposes treating it like code will ensure that our footnotes work in the final output:

Here is some body text.`[^1]`{=markdown}

`[^1]:`{=markdown} This footnote will appear at the bottom of the page.

There’s one more tweak we have to make to get this to work. If any of your footnotes are longer than 72 characters,2 then Pandoc will split them up and divide them into multiple lines in the output .md file. Since footnotes need to be all on the same line, this will break them and you’ll have a bunch of sentence fragments at the end of your page right above the equally fragmented footnotes. To fix this, we need to use the --wrap argument to Pandoc in our YAML header. Below is the YAML header for the .Rmd file that generates the .md file that Jekyll uses to generate the HTML your browser uses to render this page.

---
title: Footnotes in `.Rmd` files
output:
  md_document:
    variant: gfm
    preserve_yaml: TRUE
    pandoc_args: 
      - "--wrap=preserve"
knit: (function(inputFile, encoding) {
  rmarkdown::render(inputFile, encoding = encoding, output_dir = "../_posts") })
date: 2020-10-26
permalink: /posts/2020/10/jeykll-footnotes
excerpt_separator: <!--more-->
toc: true
tags:
  - jekyll
  - rmarkdown
---

By specifying --wrap=preserve, we tell Pandoc to respect the line breaks present in the .Rmd file when generating the .md file.3 Accordingly, our footnotes will be intact and functional in the final web page.

Proof

And now, to prove to you that this post really did start out as a .Rmd file, here’s some R code and a plot. Everyone’s seen mtcars a million times, and it turns out that iris was originally published in the Annals of Eugenics, so I went digging for a new built in dataset.4 I landed on the Loblolly pines dataset, which records the height of 14 different loblolly pine trees.5

library(ggplot2)
ggplot(Loblolly, aes(x = age, y = height, group = Seed)) +
  geom_line(alpha = .5) +
  labs(x = 'Age (years)', y = 'Height (feet)') +
  theme_bw()

It looks like all of the trees in the sample followed a pretty similar growth trajectory! Finally, to really really prove this page started out as a .Rmd file, here’s the sessionInfo():

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.2
## 
## loaded via a namespace (and not attached):
##  [1] rstudioapi_0.11      knitr_1.30           magrittr_1.5        
##  [4] tidyselect_1.1.0     munsell_0.5.0        colorspace_1.4-1    
##  [7] here_0.1             R6_2.4.1             rlang_0.4.8         
## [10] dplyr_1.0.2          stringr_1.4.0        tools_4.0.2         
## [13] grid_4.0.2           gtable_0.3.0         xfun_0.18           
## [16] withr_2.3.0          htmltools_0.5.0.9001 ellipsis_0.3.1      
## [19] yaml_2.2.1           rprojroot_1.3-2      digest_0.6.25       
## [22] tibble_3.0.4         lifecycle_0.2.0      crayon_1.3.4        
## [25] purrr_0.3.4          vctrs_0.3.4          glue_1.4.2          
## [28] evaluate_0.14        rmarkdown_2.3        stringi_1.5.3       
## [31] compiler_4.0.2       pillar_1.4.6         generics_0.0.2      
## [34] scales_1.1.1         backports_1.1.10     pkgconfig_2.0.3
  1. Pandoc is incredibly powerful, but it’s also incredibly opaque and difficult to learn. You can create incredibly fancy PDF and HTML documents in R Markdown without ever having to know anything about Pandoc. 

  2. The default output width defined by the --columns argument to Pandoc. 

  3. You can also use --wrap=none, which will put every paragraph in a single gigantic line of text. 

  4. If you’re willing to install additional packages, Allison Horst’s palmerpenguins package is fantastic and fills much the same educational niche as iris. See here for even more alternatives. 

  5. Fun fact, loblolly pine seeds were carried aboard Apollo 14 and subsequently planted throughout the US

]]>
Tom Hope
Working with Large Spatial Data in R2020-09-25T00:00:00-05:002020-09-25T00:00:00-05:00https://tomhoper.github.io//posts/2020/09/spatial-sqlIn my research I frequently work with large datasets. Sometimes that means datasets that cover the entire globe, and other times it means working with lots of micro-level event data. Usually, my computer is powerful enough to load and manipulate all of the data in R without issue. When my computer’s fallen short of the task at hand, my solution has often been to throw it at a high performance computing cluster. However, I finally ran into a situation where the data proved too large even for that approach.

As a result, I finally had to teach myself how to break large spatial datasets into more manageable chunks. In the process a learned a little SQL and a lot about the underlying software libraries that power the r-spatial ecosystem of R packages. In this post, I walk through the workflow I developed for this task and explain the logic behind each step.

On disk

The general idea is to work with data ‘on disk’ instead of ‘in memory’. Normally, when you load a dataset into R, your computer reads it from whatever storage media it uses (hard drive or solid state drive) into memory (RAM). Memory is considerably faster to read from and write to than storage, which is what lets you complete simple operations in R in the blink of an eye. Most consumer computers have much more storage than RAM (my 2015 MacBook Pro has 256 GB of storage and 8 GB of memory) so it’s very possible to end up with a dataset larger than your computer’s memory. In fact, it doesn’t have to be anywhere near the size of your computer’s memory to bump into this limit because every other application you have running uses up memory as well.

To deal with this issue, you can extract just the parts of a dataset you need to work with at any given time; this subset will be loaded into memory, and the rest remain on disk and invisible to R1. There are a couple of R packages that exist for dealing with this issue, such as bigmemory for basic R data types like numerics or disk.frame for dplyr-compatible operations, but neither supports spatial data.

I’m going to use the cshapes to illustrate and explain this workflow2. You can download and extract it from within R:

## download cshapes dataset
download.file('http://downloads.weidmann.ws/cshapes/Shapefiles/cshapes_0.6.zip',
              'cshapes.zip')

## extract cshapes dataset
unzip('cshapes.zip')

## check that dataset extracted correctly
list.files(path = '.', pattern = 'cshapes')
## [1] "cshapes_shapefile_documentation.txt" "cshapes.dbf"                        
## [3] "cshapes.prj"                         "cshapes.shp"                        
## [5] "cshapes.shx"                         "cshapes.zip"

Then use the sf package to load the data and check them out:

## load packages
library(tidyverse)
library(sf)

## read in cshapes
cshapes <- st_read('cshapes.shp')

The cshapes dataset is specifically designed to be easy to load and manipulate on a conventional laptop computer. To do this, it sacrifices a significant degree of detail in the polygons that represent each individual state. For many analyses, this is fine and won’t affect the results. However, sometimes you need to measure the length of borders between states, and the coastline paradox dictates that you use the most high resolution spatial data possible. In that case, the data might be too large for your computer to hold in memory. If that’s the case, then it’s time to start thinking about leaving the data on disk and only loading what you really need at any given point.

SQL

Luckily the sf package supports SQL queries to filter the data on disk and only read in a subset of the total data. SQL is a language for interacting with relational databases, and is incredibly fast compared to loading data into R and then filtering it. SQL has many variants, referred to as dialects, and the sf package uses one called OGR SQL dialect to interact with spatial datasets. The basic structure of a SQL call is SELECT col FROM "table" WHERE cond.

  • SELECT tells the database what columns (fields in SQL parlance) we want
  • FROM tells the database what table (databases can have many tables) to select those columns from
  • WHERE tells that database we only want rows where some condition is true

If you use the tidyverse a lot, this may seem familiar to you because it’s pretty similar to dplyr syntax, except dplyr already knows which data frame you want to work with. If we want to only load one polygon at a time into R, then we need to know the field (or combination of fields) that uniquely identifies a polygon. To demonstrate, let’s load just the polygon for Morocco that begins in 1976 when it annexed the Northern part of Western Sahara. Let’s cheat by looking at the data I’ve loaded into R:

## filter to Morocco beginning in 1976
cshapes %>% filter(CNTRY_NAME == 'Morocco', GWSYEAR == 1976)
## Simple feature collection with 1 feature and 24 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -15.22687 ymin: 23.11465 xmax: -1.011809 ymax: 35.91916
## geographic CRS: WGS 84
##   CNTRY_NAME     AREA CAPNAME CAPLONG CAPLAT FEATUREID COWCODE COWSYEAR COWSMONTH COWSDAY COWEYEAR
## 1    Morocco 576351.8   Rabat   -6.83  34.02       220     600     1976         4       1     1979
##   COWEMONTH COWEDAY GWCODE GWSYEAR GWSMONTH GWSDAY GWEYEAR GWEMONTH GWEDAY ISONAME ISO1NUM ISO1AL2
## 1         8       4    600    1976        4      1    1979        8      4 Morocco     504      MA
##   ISO1AL3                       geometry
## 1     MAR MULTIPOLYGON (((-4.420418 3...

The cshapes dataset records when states change territorial boundaries or capital locations, so the combination of a state name or identifier and a start or end date uniquely identifies all rows in the data. Since, this polygon begins on April 1, 1976 and the Gleditsch and Ward code for Morocco is 600, plugging it all into the query argument to st_read() gets us:

## read in morocco polygon
morocco <- st_read('cshapes.shp',
                   query = 'SELECT * FROM "cshapes" WHERE GWCODE = 600 AND GWSYEAR = 1976 AND GWSMONTH = 4 AND GWSDAY = 1')

## verify country name
morocco$CNTRY_NAME
## [1] "Morocco"

Awesome! We were able to read in just one polygon from the cshapes dataset. Note that * means all columns. As I mentioned above, this is cheating, since we had to read the whole dataset into R with a standard st_read() call to learn the names and values of the variables we then filtered on.

Sneaking a peek

When this isn’t an option, we can sneak a peak at the data by loading just the first observation into R. This requires significantly less memory than loading an entire dataset, and can give us the information we need to filter the full dataset and read in one observation at a time. Most SQL implementations don’t have row numbers, so it’s hard to just grab the first row of the data for this purpose. However, the OGR SQL dialect documentation notes that it implements a special field called FID that is a feature ID, i.e., a row number. We can take advantage of FID to select the first polygon from the data using the query argument to st_read() again:

## read in first row of the data
cshapes_row <- st_read('cshapes.shp', query = 'SELECT * FROM "cshapes" WHERE FID = 1')

## inspect
cshapes_row
## Simple feature collection with 1 feature and 24 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -58.0714 ymin: 1.836245 xmax: -53.98612 ymax: 6.001809
## geographic CRS: WGS 84
##   CNTRY_NAME     AREA    CAPNAME CAPLONG   CAPLAT FEATUREID COWCODE COWSYEAR COWSMONTH COWSDAY
## 1   Suriname 145952.3 Paramaribo   -55.2 5.833333         1     115     1975        11      25
##   COWEYEAR COWEMONTH COWEDAY GWCODE GWSYEAR GWSMONTH GWSDAY GWEYEAR GWEMONTH GWEDAY  ISONAME
## 1     2016         6      30    115    1975       11     25    2016        6     30 Suriname
##   ISO1NUM ISO1AL2 ISO1AL3                 _ogr_geometry_
## 1     740      SR     SUR POLYGON ((-55.12796 5.82217...

Even if we knew that the data had an ID column and start and end dates, we wouldn’t know the precise formatting (capitalization, underscores or dashes) of column names, or whether start and end dates are stored as one column or sets of three like they are here.

Making a list

We still need more information if we want to iterate through the polygons in the data and load them one at a time. We know what columns uniquely identify the rows, but what don’t know all the values they take on. Without that, we we’re stuck. What (usually) makes spatial data big is not the tabular data themselves, but the spatial features they’re attached to. This is particularly the case with polygons, which can be incredibly large in size for complex features. So, the goal here is to get the data we care about (ID column and start date) and ditch everything else, loading only the bare minimum into memory.

To do this, we’ll use the ogr2ogr() function in the gdalUtils package3. ogr2ogr() converts between different spatial data formats. It also offers two features that we’re going to use to cut down the data to the bare minimum. The select argument is a SQL selection, so we’re going to create a comma separated list of our key columns. The nlt argument specifies what type of geometry to create in the output. Conveniently it accepts NONE as a value, which will yield a plain table of data with none of the memory-hogging geometries:

## load package
library(gdalUtils)

## convert to nonspatial geometry
ogr2ogr(src_datasource_name = 'cshapes.shp', dst_datasource_name = 'cshapes_no_geom',
        select = 'GWCODE,GWSYEAR,GWSMONTH,GWSDAY', nlt = 'NONE')

This will create a shapefile in the new directory cshapes_no_geom called cshapes. The usual .shp and .shx components of a shapefile are missing, but the .dbf part is there, and that’s the one we care about. Load it up with st_read() and we’ll have what we need:

## load non-geometry table
cshapes_id <- st_read('cshapes_no_geom/cshapes.dbf')

## inspect
head(cshapes_id)
##   GWCODE GWSYEAR GWSMONTH GWSDAY
## 1    110    1966        5     26
## 2    115    1975       11     25
## 3     52    1962        8     31
## 4    101    1946        1      1
## 5    990    1962        1      1
## 6    972    1970        6      4

Now you can load polygons one at a time and perform whatever geometric operations you need to. To illustrate, I’ll load the first four polygons in the dataset, calculate their area, and then plot them.

## set up four panel plot
par(mfrow = c(1, 4), mar = c(6.1, 4.1, 4.1, 4.1))

## read in each polygon and plot 
for (i in 1:4) {
  
  ## build SQL query
  query_str <- str_c('SELECT * FROM "cshapes" WHERE GWCODE = ', cshapes_id$GWCODE[i],
                     ' AND GWSYEAR = ', cshapes_id$GWSYEAR[i],
                     ' AND GWSMONTH = ', cshapes_id$GWSMONTH[i],
                     ' AND GWSDAY = ', cshapes_id$GWSDAY[i])
  
  ## read in data
  pol <- st_read('cshapes.shp', query = query_str)
  
  ## plot data
  pol %>%
    st_geometry() %>% 
    plot(main = pol$CNTRY_NAME,
         sub = str_c(round(units::set_units(st_area(pol), 'km^2'), digits = 0),
                      ' km^2'))
  
}

Won’t you be my neighbor?

Sometimes (oftentimes in spatial analysis) we need not just a polygon, but also its neighbors. That means loading just one polygon is insufficient. If your data are already in R, this is easy with the st_filter() function, but it’s much trickier if you’re trying to filter data before loading them into R4. Luckily, st_read() as you covered! The wkt_filter accepts a well-known text string that can be used to filter the data before loading them into R5. Well-known text is a standard string representation of geometry, and is actually how the sf package prints geometry in R:

st_point(c(1, 2))

We want to use the wkt_filter argument to only load polygons that intersect with our Morocco polygon into R. To do that, we need to convert our polygon to a well-known text string with the st_as_text() function, then pass it to st_read(). However, st_as_text() only accepts sfc and sfg objects, not sf objects:

## create well known text object to filter cshapes on disk
morocco_wkt <- st_as_text(morocco)
## Error in UseMethod("st_as_text"): no applicable method for 'st_as_text' applied to an object of class "c('sf', 'data.frame')"

To get around this, we need to drop the data on morocco and extract just the geometry of the polygon with st_geometry():

## create well known text object to filter cshapes on disk
morocco_wkt <- morocco %>% 
  st_geometry() %>% # convert to sfc
  st_as_text() # convert to well known text

## plot morocco and neighbors
st_read('cshapes.shp', wkt_filter = morocco_wkt) %>%
  st_geometry() %>%
  plot(main = morocco$CNTRY_NAME)

## add morocco polygon on top
morocco %>% 
  st_geometry() %>%
  plot(add = T, col = rgb(0, 1, 0, .5))

Notice that there are multiple polygon boundaries within the green area of our green Morocco polygon. That’s because there are 4 Morocco polygons in the data starting in 1956, 1958, 1976, and 1979. Be sure to filter the dataset, either as part of the SQL query or in a dplyr::filter() so that you only get polygons that existed contemporaneously with your polygon of interest.

Wrapping up

So far, we’ve covered:

  • How to extract the first polygon for a spatial dataset and learn the names of identifier columns
  • How to strip the geometry from a spatial dataset and extract just a table of these columns
  • How to use these columns to iterate through the polygons in the dataset and import them one at a time, or along with their neighbors

You can technically skip the first two steps and just move the .shp and .shx files out of the directory before loading the .dbf file with st_read(), but that kind of feels like cheating to me6 and it only works with shapefiles. If you have another type of spatial dataset, read on.

This time for real

In my research, I often need to work with spatial data that’s measured at or aggregated up to different administrative divisions (ADMs). GADM helpfully provides a global dataset of ADMs. Although you can download ADMs for specific countries, I work with data in enough different countries that I finally decided to just download the entire dataset. While the cshapes example above just illustrated how to implement a pipeline for working with spatial data on disk, you may actually need to use one with these data depending on your machine’s hardware.

This master dataset comes as a GeoPackage. Most importantly for us, that means we can’t just delete a few component files to load the non-spatial table from the dataset; we have to convert it from a spatial dataset to a non-spatial one with ogr2ogr(). The GeoPackage contains ADMs from level 0 (countries) all the way down to level 5. Each level is stored as a separate layer in the .gpkg, and we can get a list of available layers with the st_layers() function:

## get layers
st_layers('~/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg')
## Driver: GPKG 
## Available layers:
##   layer_name geometry_type features fields
## 1     level0 Multi Polygon      256      2
## 2     level1 Multi Polygon     3610     10
## 3     level2 Multi Polygon    45962     13
## 4     level3 Multi Polygon   147427     16
## 5     level4 Multi Polygon   138053     14
## 6     level5 Multi Polygon    51427     15

We want to work with the third-order administrative divisions (cities, towns, and other municipalities in the US context), so we need the level3 layer. Where we just used the name of the dataset in our SQL call before, this time we’ll use level3. Now we just follow the same workflow as with the cshapes dataset above:

## get first observation
level3 <- st_read('~/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg',
                  query = 'SELECT * FROM "level3" WHERE FID = 1', layer = 'level3')

## inspect
level3
## Simple feature collection with 1 feature and 16 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 13.08792 ymin: -8.010127 xmax: 13.59943 ymax: -7.708598
## geographic CRS: WGS 84
##   GID_0 NAME_0   GID_1 NAME_1 NL_NAME_1     GID_2 NAME_2 NL_NAME_2       GID_3 NAME_3 VARNAME_3
## 1   AGO Angola AGO.1_1  Bengo      <NA> AGO.1.1_1 Ambriz      <NA> AGO.1.1.1_1 Ambriz      <NA>
##   NL_NAME_3  TYPE_3 ENGTYPE_3 CC_3 HASC_3                           geom
## 1      <NA> Commune   Commune <NA>   <NA> MULTIPOLYGON (((13.12764 -7...

This time we have a single column that uniquely identifies observations, GID_3, so we only have to extract one column from the dataset. We use the ogr2ogr() function as before, but we have to specify the layer = 'level3' argument since the GeoPackage has more than one layer and we want to work with a specific one. Since GID_3 is our identifier column, that’s what we select from the dataset:

## convert to nonspatial geometry
ogr2ogr(src_datasource_name = '/Users/Rob/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg',
        dst_datasource_name = 'gadm34_levels_no_geom',
        layer = 'level3',
        select = 'GID_3',
        nlt = 'NONE')

## load non-geometry table
gadm_ids <- st_read('gadm34_levels_no_geom/level3.dbf')

## inspect
head(gadm_ids)
##         GID_3
## 1 AGO.1.1.1_1
## 2 AGO.1.1.2_1
## 3 AGO.1.1.3_1
## 4 AGO.1.2.1_1
## 5 AGO.1.2.2_1
## 6 AGO.1.2.3_1

And we can again read the polygons into R one at a time and perform whatever spatial operations we need. Since our identifying column is a string this time, we need to enclose it quotes in our SQL call. SQL is very picky about quotation mark types, so while we needed to surround our layer name with double quotes, we need to surround our identifier variable with single quotes. I’m already using single quotes to define the character string for the SQL call, so I need to escape the single quotes around the identifier. You can do this with a single backslash (\). Thus, you can include single quotes in a single-quoted string like this: 'this is a string \'this is another part of a string\''. Other than that wrinkle, things are pretty much the same as with cshapes:

## for reproducibility
set.seed(27599)

## set up four panel plot
par(mfrow = c(1, 4), mar = c(2.1, 4.1, 4.1, 4.1))

## read in each polygon and plot
for (i in sample(1:nrow(gadm_ids), 4, replace = F)) { # mix it up
  
  ## build SQL query
  query_str <- str_c('SELECT * FROM "level3" WHERE GID_3 = \'',
                     gadm_ids$GID_3[i], '\'')
  
  ## read in polygon for ADM3 i
  adm3 <- st_read('~/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg',
                  query = query_str, layer = 'level3')
  
  ## plot polygon and label with full name
  print(plot(adm3$geom,
             main = adm3 %>%
               select(starts_with('NAME_')) %>% # get all name variables
               st_drop_geometry() %>% # drop geometry
               rev() %>% # reverse order of names to 3, 2, 1, 0
               str_c(collapse = ', '), # collapse w/ commas
             cex.main = .6))
  
}

Spatially filtering the GADM dataset is just as easy as with cshapes. To illustrate, I’m going to pull out a random polygon and use it to filter the data. However, these are third-order administrative divisions, and so it’s possible that even capturing all adjacent polygons won’t cover a very large area. To deal with this concern, we can buffer the polygon with the st_buffer() function before we convert it to well-known text:

## import single polygon
adm3 <- st_read('~/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg',
                  query = str_c('SELECT * FROM "level3" WHERE FID = 63130'))

## create well known text object to filter GADM on disk
adm3_wkt <- adm3 %>% 
  st_geometry() %>% # convert to sfc
  st_buffer(.025) %>% # buffer .05 decimal degrees
  st_as_text() # convert to well known text

## plot Dakkoun and neighbors w/in .05 decimal degrees
st_read('~/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg',
        layer = 'level3', wkt_filter = adm3_wkt) %>%
  st_geometry() %>%
  plot(main = adm3 %>%
               select(starts_with('NAME_')) %>%
               st_drop_geometry() %>%
               rev() %>%
               str_c(collapse = ', '))

## plot Dakkoun and highlight
adm3 %>%
  st_geometry() %>%
  plot(add = T, col = 'green')

## plot buffered polygon used to filter GADM on disk
adm3 %>% 
  st_geometry() %>% 
  st_buffer(.025) %>% 
  st_cast('LINESTRING') %>%
  plot(add = T, col = 'blue')

The green polygon above is Dakkoun, the 63,130th polygon in the the dataset. The blue line is the extent of the .025 decimal degree buffer applied to it to before filtering the dataset. This workflow can speed things up when working with these data, considering there are 884,562 third-order administrative division polygons in the dataset.

Making data manageable

The query and wkt_filter arguments to st_read() can help you work with large spatial datasets that are either too big to load into memory, or too slow to work with once loaded. While this is less of a concern with low resolution datasets created by social scientists, it can be incredibly useful if you ever have to work with super high resolution data created by remote sensing technologies or actual cartographers and geographers.

  1. This is the appraoch that the raster package uses. R only stores information on the extent and resolution of a raster in memory; the actual values in each cell of a raster are only loaded into memory when accessed by R using a function like extract()

  2. Although I’m using cshapes as an example throughout this post so you can easily follow along and run the code yourself, it’s a small enough dataset that no modern machine should have trouble loading it. I also use this approach for a much larger dataset where you’d actually benefit from this approach at the end of this post

  3. This function is just a wrapper around the GDAL utility ogr2ogr. You could also do this with ogr2ogr directly in the shell, but it’s much uglier: ogr2ogr -f "ESRI SHAPEFILE" cshapes_no_geom.shp cshapes.shp cshapes -nlt NONE -select GWCODE,GWSYEAR,GWSMONTH,GWSDAY

  4. st_filter() accepts various spatial predicates beyond the default of st_intersects(). This filtering on disk gives much less fine-grained control. If you need more precision, you can load more nearby polygons by buffering the polygon before filtering the input like here and then using st_filter() with your spatial predicate of choice. 

  5. I spent over an hour trying to figure out how to tell the query parameter to use PostGIS or SpatiaLite dialects instead of the OGR SQL dialect so I could execute a spatial filter before finding the wkt_filter argument to st_read(). Always read the documentation carefully. 

  6. Having to move or delete files also risks losing them; the ogr2ogr() approach is safer in this regard. 

]]>
Tom Hope
Jekyll and HTML Widgets2020-09-19T00:00:00-05:002020-09-19T00:00:00-05:00https://tomhoper.github.io//posts/2020/09/jekyll-htmlI’m currently compiling a list of university-affiliated programs designed to help prepare students for graduate study in political science and assist them in the process of applying to graduate school (a labyrinthine and opaque process in many regards). Since travel costs can be a deciding factor for some students when deciding whether to apply to these programs, I thought it would be nice to also put them on a map.

While just plotting them on a map is easy, since it will be on a web page, I figured why not also embed links to each program in the map as well. In theory this is easy thanks to R packages like leaflet, which leverages the (unsurprisingly named) leaflet JavaScript library for interactive webmaps. However, because I use Jekyll instead of Hugo for my site, I can’t just use the blogdown R package and have everything magically work.

Steven Miller’s tutorial on integrating R Markdown and Jekyll is the starting point my own use of R Markdown and Jekyll, so check that out first for a quick primer on how to use R Markdown to render .Rmd files into the .md files that Jekyll uses to render your website. This approach works fantastically well for static images, and requires just a little tweaking to make interactive widgets like leaflet maps work.

Leaflet

We’ll use three packages to create our map. The tidyverse is pretty well-documented at this point, but I use it to write efficient and readable code. tidygeocoder is a geocoder that can use a variety of geocoding services and works well with data frames and tibbles. Finally, leaflet is what we’ll use to create our actual map widget.

library(tidyverse)
library(tidygeocoder)
library(leaflet)

First, we need to load our data. This is a CSV file of program information that I’ve compiled myself.

## read in data
predoc <- read_csv('predoc.csv')

## inspect the data
predoc
## # A tibble: 9 x 4
##   Institution          Name                     Location      URL                                   
##   <chr>                <chr>                    <chr>         <chr>                                 
## 1 University of South… POIR Predoctoral Summer… Los Angeles,… https://dornsife.usc.edu/poir/predoct…
## 2 Duke University      Ralph Bunche Summer Ins… Durham, NC, … https://www.apsanet.org/rbsi          
## 3 UC San Diego         START                    La Jolla, CA… https://grad.ucsd.edu/diversity/progr…
## 4 MIT                  MSRP                     Cambridge, M… https://oge.mit.edu/graddiversity/msr…
## 5 UC Irvine            SURF                     Irvine, CA, … https://grad.uci.edu/about-us/diversi…
## 6 University of Washi… NSF REU: Spatial Models… Tacoma, WA, … https://www.tacoma.uw.edu/smed/nsf-re…
## 7 University of North… NSF REU: Civil Conflict… Denton, TX, … https://untconflictmgmtreu.wordpress.…
## 8 Princeton University Emerging Scholars in Po… Princeton, N… https://politics.princeton.edu/gradua…
## 9 Harvard University   PS-Prep                  Cambridge, M… https://projects.iq.harvard.edu/ps-pr…

First, we need to get latitude and longitude coordinates from our place names to plot them on a map. We’ll use the geocode() function, where the first argument is a data frame containing a column with the location information we want to use. The second argument is address, which tells the geocoder to use the information stored in the Address column of our data frame, and then method = 'osm' dispatches it to the Open Street Map geocoder, Nominatim.

Next, we’ll use mutate() to create a new variable to hold the popup text a user will see when they click on a point. I want to provide the university name, the program’s name, and then a link to the program’s information page. I use the str_c() function to combine the Institution and Name columns, and then I use another call to str_c() to format the URL. This second call looks like str_c('<a href="', URL, '" target="_PARENT">Program Info</a>'), where URL is the name of the URL field. It combines the standard start of an HTML anchor tag (<a href=") with the URL itself, adds the link text of “Program Info”, and then closes the tag. The one unusual element is target="_PARENT" in the anchor tag. This is necessary to make any links a user clicks open normally, instead of within the frame used to embed it into the page (more on that later).

Once we’ve prepped our popup text, we just pass the data frame to leaflet(), add a background map (I’ve used a styled map, but you can also get the default map with addTiles()), and then the markers themselves. The one tricky part of addMarkers() is that it expects its arguments as one-sided formulas, not just variable names like tidyverse functions. geocode() has created lat and long columns, so pass those through as well as our label column, and we’re good to go.

Map it

Putting all the above code together in a pipeline looks like this:

## prep and plot
predoc %>% 
  geocode(address = Location, method = 'osm') %>% ## gecode locations
  mutate(lab = str_c(Institution, Name,
                     str_c('<a href="proxy.php?url=', URL, '" target="_PARENT">Program Info</a>'),
                     sep = '<br>')) %>% # paste fields into popup text
  leaflet() %>% # create leaflet map widget
  addProviderTiles(providers$CartoDB.Positron) %>% # add muted palette basemap
  addMarkers(lng = ~ long, lat = ~ lat, popup = ~ lab) # add markers with popup text

Unfortunately this code produces an error that stops R Markdown dead in its tracks; like, the-error = T-knitr-chunk-option-won’t-even-save-you dead in its tracks. What gives? R Markdown is supposed to be able to render interactive widgets no problem. The issue is that R Markdown can render those widgets for HTML output, but since we’re creating a GitHub Flavored Markdown document that Jekyll then turns into HTML, R Markdown chokes. It can’t embed an HTML widget into a plain text markdown document. Luckily there is a way around this, but it involves an extra step and dealing with some file paths.

R Markdown, HTML widgets, and Jekyll

To make things work, we have to manually save the HTML from our widget, and then embed it into our resulting markdown document. Then, when Jekyll renders the markdown to HTML, it will be visible in the final HTML files that comprise your website. This involves telling R where to save the HTML, then referencing it using raw HTML code in our markdown document. We’re going to do this with the htmlwidgets R package.

## load htmlwidgets to save map widget
library(htmlwidgets)

## prep and plot
predoc %>% 
  geocode(address = Location, method = 'osm') %>% ## gecode locations
  mutate(lab = str_c(Institution, Name,
                     str_c('<a href="proxy.php?url=', URL, '" target="_PARENT">Program Info</a>'),
                     sep = '<br>')) %>% # paste fields into popup text
  leaflet() %>% # create leaflet map widget
  addProviderTiles(providers$CartoDB.Positron) %>% # add muted palette basemap
  addMarkers(lng = ~ long, lat = ~ lat, popup = ~ lab) %>% # add markers with popup text
  saveWidget(here::here('/files/html/posts', 'predoc_map.html')) # save map widget

The code is identical to that above, with the addition of the file line that saves the map widget as an HTML file called predoc_map.html in /files/html/posts using the saveWidget() function. You’ll notice I use the here() function from the here R package to supply the file argument to saveWidget(). here is great because it very intelligently finds the top level of whatever project you’re working on and then constructs file paths from there. It has a number of ways to determine where a project ‘starts’, but for us it works because our website is a git repo and contains a .git directory.

Frame it

All that’s left to do is embed the map widget in the page using an iframe. iframes allow you to embed an HTML page inside of another HTML page. Since saveWidget() saved our map widget as an HTML file that’s nothing but our map, we can then embed it into our page using an iframe. Jekyll allows raw HTML in markdown files which it ignores and passes through untouched into the final HTML files it produces. Here’s the code I used for the map in this post.

<iframe src="/files/html/posts/predoc_map.html" height="600px" width="100%" style="border:none;"></iframe>

The main argument is src="proxy.php?url=...", which tells the iframe what content it will contain. Notice that this is the same file path I just specified above in saveWidget(). As long as that directory exists in your website repo, everything will work smoothly. There are three important arguments in addition to the content of the iframe itself:

  • height is how tall you want the iframe to be; here I’ve specified it in pixels, but you can also use inches, centimeters, or percentages as you’ll see below
  • width is how wide you want the iframe to be; I’ve used a percentage here because the AcademicPages template is responsive and will resize itself on smaller screens
  • style is where I tell the iframe not to include a border so it blends seamlessly with the rest of the page

The finished product

Here’s what the final map looks like. If you didn’t know the extra effort it took, it would blend seamlessly into the page. Theoretically this should work for any HTML widget, like those produced by the plotly R package. If you haven’t checked plotly out, you really should. It can turn ggplot2 plots into interactive widgets with a single line of code!

]]>
Tom Hope
Extracting UN Peacekeeping Data from PDF Files2020-08-28T00:00:00-05:002020-08-28T00:00:00-05:00https://tomhoper.github.io//posts/2020/08/pdf-dataSome coauthors and I recently published a piece in the Monkey Cage on the recent military coup in Mali and the overthrow of president Ibrahim Boubacar Keïta. We examine what the ouster of Keïta means for the future of MINUSMA, the United Nations peacekeeping mission in Mali. One of my contributions that didn’t make the final cut was this plot of casualties to date among UN peacekeepers in the so-called big 5 peacekeeping missions .

These missions are distinguished from other current UN peacekeeping missions by high levels of violence (both overall and against UN personnel) and expansive mandates that go beyond ‘traditional’ goals of stabilizing post-conflict peace. The conflict management aims of these operations necessarily expose peacekeepers to high levels of risk. If we want to try understand what the future of MINUSMA might look like dealing with a new government in Mali, it’s important to place MINUSMA in context among the remainder of the big 5 missions. To help do so, I turned to the source for data on peacekeeping missions, the UN.

Nonstandard formats

When we wrote the piece, the Peacekeeping open data portal page on fatalities only had a link to this PDF report instead of the usual CSV file (the CSV file is back, so you don’t technically have to go through all of these steps to recreate this figure). Here’s what the first page of that PDF looks like:

Since we were working on a short deadline, I needed to get these data out of that PDF. The most direct option is to just copy and paste the data into an Excel sheet. However, these data run to 148 pages, so all that copying and pasting would be tiring and risks introducing errors when your attention eventually slips and you forget to include page 127.

Getting the data

Enter the tabulizer R package. This package is just a (much) friendlier wrapper to the Tabula Java library, which is designed to extract tables from PDF documents. To do so, just plug in the file name of the local PDF you want or URL for a remote one:

library(tabulizer)

## data PDF URL
dat <- 'https://peacekeeping.un.org/sites/default/files/fatalities_june_2020.pdf'

## get tables from PDF
pko_fatalities <- extract_tables(dat, method = 'stream')

The extract_tables() function has two different methods for extracting data: lattice for more structured, spreadsheet like PDFs and stream for messier files. While the PDF looks pretty structured to me, method = 'lattice' returned a series of one variable per line gibberish, so I specify method = 'stream' to speed up the process by not forcing tabulizer to determine which algorithm to use on each page.

Note that you may end up getting several warnings, such as the ones I received:

## WARNING: An illegal reflective access operation has occurred
## WARNING: Illegal reflective access by RJavaTools to method java.util.ArrayList$Itr.hasNext()
## WARNING: Please consider reporting this to the maintainers of RJavaTools
## WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
## WARNING: All illegal access operations will be denied in a future release

Everything still worked out fine for me, but you may run into problems in the future based on the warning about future releases.

Cleaning the data

We end up with a list that is 148 elements long, one per page. Each element is a matrix, reflecting the structured nature of the data. Normally, we could just combine this list of matrices into a single object with do.call(rbind, pko_fatalities):

do.call(rbind, pko_fatalities)
## Error in (function (..., deparse.level = 1) : number of columns of matrices must match (see arg 2)

But if we do this, we get an error! Let’s take a look and see what’s going wrong. We can use lapply() in combination with dim() to do so:

head(lapply(pko_fatalities, dim))
## [[1]]
## [1] 54  9
## 
## [[2]]
## [1] 54  7
## 
## [[3]]
## [1] 54  7
## 
## [[4]]
## [1] 54  7
## 
## [[5]]
## [1] 54  7
## 
## [[6]]
## [1] 54  7

The first matrix has an extra two columns, causing our attempt to rbind() them all together to fail.

head(pko_fatalities[[1]])
##      [,1]                   [,2]                            [,3] [,4]              
## [1,] "Casualty_ID"          "Incident_Date Mission_Acronym" ""   "Type_of_Casualty"
## [2,] "BINUH‐2019‐12‐00001"  "30/11/2019 BINUH"              ""   "Fatality"        
## [3,] "BONUCA‐2004‐06‐04251" "01/06/2004 BONUCA"             ""   "Fatality"        
## [4,] "IPTF‐1997‐01‐02515"   "31/01/1997 IPTF"               ""   "Fatality"        
## [5,] "IPTF‐1997‐09‐02720"   "17/09/1997 IPTF"               ""   "Fatality"        
## [6,] "IPTF‐1997‐09‐02721"   "17/09/1997 IPTF"               ""   "Fatality"        
##      [,5]                       [,6]                [,7] [,8]                     
## [1,] "Casualty_Nationality"     "M49_Code ISOCode3" ""   "Casualty_Personnel_Type"
## [2,] "Haiti"                    "332 HTI"           ""   "Other"                  
## [3,] "Benin"                    "204 BEN"           ""   "Military"               
## [4,] "Germany"                  "276 DEU"           ""   "Police"                 
## [5,] "United States of America" "840 USA"           ""   "Police"                 
## [6,] "United States of America" "840 USA"           ""   "Police"                 
##      [,9]              
## [1,] "Type_Of_Incident"
## [2,] "Malicious Act"   
## [3,] "Illness"         
## [4,] "Accident"        
## [5,] "Accident"        
## [6,] "Accident"
head(pko_fatalities[[2]])
##      [,1]                    [,2]                 [,3]       [,4]       [,5]     
## [1,] "MINUSCA‐2015‐10‐09459" "06/10/2015 MINUSCA" "Fatality" "Burundi"  "108 BDI"
## [2,] "MINUSCA‐2015‐10‐09468" "13/10/2015 MINUSCA" "Fatality" "Burundi"  "108 BDI"
## [3,] "MINUSCA‐2015‐11‐09509" "10/11/2015 MINUSCA" "Fatality" "Cameroon" "120 CMR"
## [4,] "MINUSCA‐2015‐11‐09510" "22/11/2015 MINUSCA" "Fatality" "Rwanda"   "646 RWA"
## [5,] "MINUSCA‐2015‐11‐09511" "30/11/2015 MINUSCA" "Fatality" "Cameroon" "120 CMR"
## [6,] "MINUSCA‐2015‐12‐09542" "06/12/2015 MINUSCA" "Fatality" "Congo"    "178 COG"
##      [,6]                     [,7]              
## [1,] "Military"               "Malicious Act"   
## [2,] "Military"               "Accident"        
## [3,] "Military"               "Malicious Act"   
## [4,] "Military"               "To Be Determined"
## [5,] "International Civilian" "Illness"         
## [6,] "Military"               "Illness"

We can see that the first page has two blank columns, accounting for the 9 columns compared to the 7 columns for all other pages. Closer inspection of the header on the first page and the columns on both the first and second pages reveals that there actually should be 9 columns in the data.

The Incident_Date and Mission_Acronym columns are combined into one, as are the M49_Code and ISOCode3 columns. We’ll fix the data in those two columns in a bit, but first we have to get rid of the empty columns in the first page before we can merge the data from all the pages. We could just tell R to drop those columns manually with pko_fatalities[[1]][, -c(3, 7)], but this isn’t a very scalable solution if we have lots of columns with this issue.

To do this programmatically, we need a way to identify empty columns. If this was a list of data frames, we could use colnames() to identify the empty columns. However, extract_tables() has given us a matrix with the column names in the first row. Instead, we’ll just get the first row of the matrix. Since we’re accessing a matrix that is the first element in a list, we want to use pko_fatalities[[1]][1,] to index pko_fatalities. Next, we’ll use the grepl() function to identify the empty columns. We want to search for the regular expression ^$, which means the start of a line immediately followed by the end of a line, i.e., an empty string. Finally, we negate it with a ! to return only non-empty column names:

## drop two false empty columns on first page
pko_fatalities[[1]] <- pko_fatalities[[1]][, !grepl('^$', pko_fatalities[[1]][1,])]

With that out of the way, we can now combine all the pages into one giant matrix. After that, I convert the matrix into a data frame, set the first row as the column names, and then drop the first row.

## rbind pages
pko_fatalities <- do.call(rbind, pko_fatalities)

## set first row as column names and drop
pko_fatalities <- data.frame(pko_fatalities)
colnames(pko_fatalities) <- (pko_fatalities[1, ])
pko_fatalities <- pko_fatalities[-1, ]

Now that we’re working with a data frame, we can finally tackle those two sets of mashed up columns. To do this, we’ll use the separate() function in the dplyr package, which I load via the tidyverse package. Separate is magically straightforward. It takes a column name (which I have to enclose in backticks thanks to the space), a character vector of names for the resulting columns, and a regular expression to split on. I use \\s, which matches any whitespace characters. I also filter out any duplicate header rows that may have crept in (there’s one on page 74, at the very least).

library(tidyverse)

## separate columns tabulizer incorrectly merged
pko_fatalities <- pko_fatalities %>% 
  filter(Casualty_ID != 'Casualty_ID') %>% # drop any repeated header(s)
  separate(`Incident_Date Mission_Acronym`, c('Incident_Date', 'Mission_Acronym'),
           sep = '\\s', convert = T, extra = 'merge')  %>% 
  separate(`M49_Code ISOCode3`, c('M49_Code', 'ISOCode3'),
           sep = '\\s', convert = T) %>% 
  mutate(Incident_Date = dmy(Incident_Date)) # convert date to date object

You’ll notice I also supply two other arguments here: convert and extra. The former will automatically convert the data type of resulting columns, which is useful because it converts Incident_Date into a Date object, and M49_Code into an int object. The latter tells separate() what to do if it detects more matches of the splitting expression than you’ve supplied column names. There are 18 observations where the mission acronym is list as “UN Secretariat”. That means that separate() will detect a second whitespace character in these 18 rows. If you don’t explicitly set extra, you’ll get a warning telling you what happened with those extra characters. By setting extra = 'merge', you’re telling separate() to effectively ignore any space after the first one and keep everything to the right of the first space as part of the output. Thus, our "UN Secretariat" observations are preserved instead of being chopped off to just "UN".

Creating the plot

Now that we’ve got the data imported and cleaned up, we can recreate the plot from the Monkey Cage piece. However, first we need to bring in some outside information and calculate some simple statistics.

Preparing the data

Before we can plot the data, we need to bring in some mission-level information, namely what country each mission operates in. We can get this easily from the Peacekeeping open data portal master dataset. Once I load the data into R I select just the mission acronym and country of operation. I then edit the strings for CAR and DRC to add newlines between words with \n to make them fit better into the plot.

## get active PKO data and clean up country names
read_csv('https://data.humdata.org/dataset/819dce10-ac8a-4960-8756-856a9f72d820/resource/7f738eb4-6f77-4b5c-905a-ed6d45cc5515/download/coredata_activepkomissions.csv') %>% 
  select(Mission_Acronym, Country = ACLED_Country) %>% 
  mutate(Country = case_when(Country == 'Central African Republic' ~
                               'Central\nAfrican\nRepublic',
                             Country == 'Democratic Republic of Congo' ~
                               'Democratic\nRepublic\nof the Congo',
                             TRUE ~ Country)) -> pko_data

We’re looking to see how dangerous peacekeeping missions are for peacekeepers, so we want to only look at fatalities that are the result of deliberate acts. The data contain 6 different types of incident, so let’s check them out:

table(pko_fatalities$Type_Of_Incident)
## 
##         Accident          Illness    Malicious Act   Self‐Inflicted To Be Determined 
##             2712             2582             2096              268              244 
##          Unknown 
##               50

Malicious acts are the third highest type of incident, so it’s important for us to subset the data to ensure we’re counting the types of attacks we’re interested in. Since we’re looking at fatalities in the big 5 missions, we also need to subset the data to just these missions. We’re going to use the summarize() function in conjunction with group_by() to calculate several summary statistics for each mission. We’ll also use the time_length() and interval() functions from the lubridate package, so load that as well.

library(lubridate)

## list of PKOs to include
pkos <- c('MINUSMA', 'UNAMID', 'MINUSCA', 'MONUSCO', 'UNMISS')

## aggregate mission level data
pko_fatalities %>% 
  filter(Type_Of_Incident == 'Malicious Act',
         Mission_Acronym %in% pkos) %>% 
  group_by(Mission_Acronym) %>% 
  summarize(casualties = n(),
            casualties_mil = sum(Casualty_Personnel_Type == 'Military'),
            casualties_pol = sum(Casualty_Personnel_Type == 'Police'),
            casualties_obs = sum(Casualty_Personnel_Type == 'Military Observer'),
            casualties_civ = sum(Casualty_Personnel_Type == 'International Civilian'),
            casualties_oth = sum(Casualty_Personnel_Type == 'Other'),
            casualties_loc = sum(Casualty_Personnel_Type == 'Local'),
            duration = time_length(interval(min(Incident_Date),
                                            max(Incident_Date)),
                                   unit = 'year')) %>% 
  mutate(MINUSMA = case_when(Mission_Acronym == 'MINUSMA' ~ 'MINUSMA',
                             TRUE                         ~ '')) %>% 
  left_join(pko_data, by = 'Mission_Acronym') %>% 
  mutate(Country = factor(Country,
                          levels = Country[order(casualties,
                                                 decreasing = T)])) -> data_agg
  • casualties = n() counts the total number of fatalities in each mission because each row is one fatality
  • casualties_mil = sum(Casualty_Personnel_Type == 'Military') counts how many of those casualties were UN troops
  • the other casualties_... lines do the same for different categories of UN personnel
  • the code to the right of duration calculates how long each mission has lasted by:
    • finding the first and last date of a fatality in each mission
    • creating an interval object from those dates
    • calculating the length of that period in years
  • create an indicator variable noting whether or not an observation belongs to MINUSMA

Finally, we merge on the country information contained in pko_data and convert Country to a factor with levels that are decreasing in fatalities. This last step is necessary to have a nice ordered plot.

Plot it

With that taken care of, we can create the plot using ggplot. I’m using the label argument to place mission acronyms inside the bars with geom_text(), and a second call to geom_text() with the casualties variable to place fatality numbers above the bars. The nudge_y argument in each call to geom_text() ensures that they’re vertically spaced out, making them readable instead of overlapping.

ggplot(data_agg, aes(x = Country, y = casualties, label = Mission_Acronym)) +
  geom_bar(stat = 'identity', fill = '#5b92e5') +
  geom_text(color = 'white', nudge_y = -10) +
  geom_text(aes(x = Country, y = casualties, label = casualties),
            data = data_agg, inherit.aes = F,
            nudge_y = 10) +
  labs(x = '', y = 'UN Fatalities',
       title = 'UN fatalities in big 5 peacekeeping operations') +
  theme_bw()

Plot it (again)

We can also create some other plots to visualize how dangerous each mission is to peacekeeping personnel. While total fatalities are an important piece of information, the rate of fatalities can tell use more about the intensity of the danger in a given conflict.

data_agg %>% 
  ggplot(aes(x = duration, y = casualties, label = MINUSMA)) +
  geom_point(size = 2.5, color = '#5b92e5') +
  geom_text(nudge_x = 1) +
  expand_limits(x = 0, y = 0) +
  labs(x = 'Mission duration (years)', y = 'Fatalities (total)',
       title = 'UN fatalities in big 5 peacekeeping operations') +
  theme_bw()

We can see from this plot that not only does MINUSMA have the most peacekeeper fatalities out of any mission, it reached that point in a comparatively short amount of time. To really drive this point home, we can draw on the fantastic gganimate package. We’re going to animate cumulative fatality totals over time, so we need a yearly version of our mission-level data frame from above. The code below is pretty similar except we’re grouping by both Mission_Acronym and a variable called Year what we’re generating with the year() function in lubridate (it extracts the year from a Date object).

pko_fatalities %>% 
  filter(Type_Of_Incident == 'Malicious Act',
         Mission_Acronym %in% pkos) %>% 
  group_by(Mission_Acronym, Year = year(Incident_Date)) %>% 
  summarize(casualties = n(),
            casualties_mil = sum(Casualty_Personnel_Type == 'Military'),
            casualties_pol = sum(Casualty_Personnel_Type == 'Police'),
            casualties_obs = sum(Casualty_Personnel_Type == 'Military Observer'),
            casualties_civ = sum(Casualty_Personnel_Type == 'International Civilian'),
            casualties_oth = sum(Casualty_Personnel_Type == 'Other'),
            casualties_loc = sum(Casualty_Personnel_Type == 'Local')) %>% 
  mutate(MINUSMA = case_when(Mission_Acronym == 'MINUSMA' ~ 'MINUSMA',
                             TRUE                         ~ ''),
         Mission_Year = Year - min(Year) + 1) %>% 
  left_join(pko_data, by = 'Mission_Acronym') %>% 
  mutate(Country = factor(Country, levels = levels(data_agg$Country))) -> data_yr

Once we’ve done that, we need to make a couple tweaks to our data to ensure that our plot animates correctly. I use the new across() function (which is likely going to eventually replace mutate_at, mutate_if, and similar functions) to select all columns that start with “casualties”. Then, I supply the cumsum() function to the .fns argument, and use the .names argument to append “_cml” to the end of each resulting variable’s name. This argument uses glue syntax, which allows you to embed R code in strings by enclosing it in curly braces. The complete() function uses the full_seq() function to fill in any missing years in each mission, i.e., a year in the middle of a mission without any fatalities due to malicious acts. Finally, the fill() function fills in any rows we just added that are missing fatality data due to an absence of fatalities that year.

Now we’re ready to animate our plot! We construct the ggplot object like before, but this time we add the transition_manual() function to the end of the plot specification. This function tells gganimate what the ‘steps’ in our animation are. Since we’ve got individual years, we’re using the manual version of transition_ instead of the many fancier versions included in the package.

If you check out the documentation for transition_manual(), you’ll notice that there are a handful of special label variables you can use when constructing your plot. These will update as the plot cycles through its frames, allowing you to convey information about the flow of time. I’ve used the current_frame variable, again with glue syntax, to make the title of the plot display the current mission year as the frames advance.

library(gganimate)

data_yr %>% 
  arrange(Mission_Year) %>% 
  mutate(across(starts_with('casualties'), .fns = cumsum, .names = '{col}_cml')) %>%
  complete(Mission_Year = full_seq(Mission_Year, 1)) %>%
  fill(Year:casualties_loc_cml, .direction = 'down') %>%
  filter(Mission_Year <= 6) %>% # youngest mission is UNMISS
  ggplot(aes(x = Country, y = casualties_cml, label = casualties_cml)) +
  geom_bar(stat = 'identity', fill = '#5b92e5') +
  geom_text(nudge_y = 10) +
  labs(x = '', y = 'UN Fatalities',
       title = 'UN fatalities in big 5 peacekeeping operations: mission year {current_frame}') +
  theme_bw() +
  transition_manual(Mission_Year)

While the scatter plot above illustrates that UN personnel working for MINUSMA have suffered the most violence in the shortest time out of any big 5 mission, this animation make it abundantly clear, especially since MONUSCO and UNMISS both experience years without a single UN fatality from a deliberate attack. Visualizations like these are a great way to showcase your work, especially if you’re dealing with dynamic data. While you still can’t easily include them in a journal article, they’re fantastic tools for conference presentations or

]]>
Tom Hope
Adding Content to an Academic Website2020-08-07T00:00:00-05:002020-08-07T00:00:00-05:00https://tomhoper.github.io//posts/2020/08/website-contentOne thing I haven’t covered in my previous posts on creating and customizing an academic website is how to actually add content to your site. You know, the stuff that’s the reason why people go to your website in the first place? If you’ve followed those guides, your website should be professional looking and already feeling a little bit different from the stock template. However, adding new pages or tweaking the existing pages can be a little intimidating, and I realized I should probably walk through how to do so. Luckily Jekyll’s use of Markdown makes it really easy to add new content!

#Content

Editing the welcome page for your site (_pages/about.md) is relatively straightforward. Things get a little trickier if you want to build an entirely new page to your website. You’ll notice that I have a software page on my site that isn’t part of the academicpages template. I’ll use that page as a running example to walk you through adding a new page to your site.

First steps

First things first, we need to create a file for the page itself. The main pages for your website are generated from Markdown files contained in the _pages directory. Create a new file called software.md in _pages. Now, open it up in RStudio or your text editor of choice. If you’ve looked at the .md files for other pages, you’ll notice that they all start with a similar block of text. This is a YAML header that tells Jekyll the basic information needed to build the page. There are lots of different options you can include, but the only two you really need are the permalink for the page and its title. Add the following to the top of software.md:

---
permalink: /software/
title: "Software"
---

Anything after that second line of dashes will be translated into actual content on the page.

Fill it out

Now we need to make our new page actually say something. My software page lists the R packages I’ve contributed to and includes links to miscellaneous other bits of code like functions for working with video data in Python or the LaTeX template I used for my dissertation. You can check out the .md source file for my software page on my GitHub.

A couple of things to notice:

  • You can create headings using pound signs
    • More pound signs produce smaller headings
  • You can create links using standard Markdown syntax, e.g., [link text](url)
    • If you’re linking to a page generated from a source .md file in _pages, just put a slash before the page name and don’t include any extension, e.g., [software](/software)
  • You can embed images by adding an exclamation point before the opening [ in Markdown link syntax, e.g., ![](/images/profile.png)
  • You can create code blocks like the ones on this page by enclosing text in triple backticks
    • Put the name of the programming language after the opening backticks to activate syntax highlighting
  • You can also embed raw HTML directly like I used to include three images next to one another

These tools should be sufficient to let you build an awesome new page for your website. However, letting visitors actually get to your new page requires a little more work.

You can’t get there from here

If you want to just add a link to your new page from an existing page, like the homepage, that’s easy and can be accomplished by adding a link to the Markdown source in _pages/about.md. That’s how I added my teaching materials page; it’s just a link on my teaching page. But what about if you want your new page to be easily accessed from the fancy navigation bar at the top of the site?

To do that, we’ll need to edit the files Jekyll uses to control navigation on the site. Open up _data/navigation.yml and get ready to add our new page to the to menu. This is what it looks like in the template:

# main links links
main:
  - title: "Publications"
    url: /publications/

  - title: "Talks"
    url: /talks/    

  - title: "Teaching"
    url: /teaching/    
    
  - title: "Portfolio"
    url: /portfolio/
        
  - title: "Blog Posts"
    url: /year-archive/
    
  - title: "CV"
    url: /cv/
    
  - title: "Guide"
    url: /markdown/

The order that items appear in top-to-bottom in this file is also the order they’ll appear in left-to-right in the navigation bar. So decide where you want your new page to go, and slot it in. This is what _data/navigation.yml looks like for my website:

# main links links
main:
  - title: "Publications"
    url: /publications/
    
  - title: "Research"
    url: /research/

  - title: "Teaching"
    url: /teaching/

  - title: "Software"
    url: /software/

  - title: "Posts"
    url: /posts/
    
  - title: "CV"
    url: /cv/

Again, you can check out the current version of this file for my site at my GitHub if you want. If I’ve changed anything in the navigation menu since I wrote this post, those changes will be reflected there.

You’ll also notice that the Guide link is no longer there in my _data/navigation.yml. Removing elements from this file drops them from the navigation menu, so if there are any other pages in the template you don’t plan to use, go ahead and remove them now.

Once you’re happy with your new page, it’s time to tell git about them, and then upload them to GitHub. You can do this with

git add _pages/software.md _data/navigation.yml
git commit -m "add software page"
git push

If you followed the guide on uploading changes to GitHub in my post on making an academic website, all of the above code should run smoothly and in a few minutes you’ll have a new page on your website.

Uploading files

One of the advantages of using GitHub pages to host your website is that you don’t have to use Dropbox to host PDFs of your working papers and published articles, not to mention your CV. If you use Wix or WordPress, you may have to upload your files to Dropbox, and then link to them on your site. This process has three major downsides:

  1. You have to update your website in two places to add or update a PDF
  2. Google Scholar will ignore Dropbox links, so you won’t get a record of your scholarship online
  3. If someone clicks a Dropbox while viewing your site on their phone or tablet, it may take them to the Dropbox app or pop up a message about the app not being installed

All of these are less than ideal. Luckily, GitHub Pages has the capability to address all three already built in. When you make an update to your website and git push it to GitHub, all tracked files get uploaded with it. This means it’s super easy to upload your PDFs to your site and link directly to them. I’ll walk through how to do this with an example PDF called working-paper.pdf.

First, copy the PDF into the files/pdf directory in your site’s directory. Next we need to tell git about this file, which we do with

git add files/pdf/working-paper.pdf
git commit -m "add working paper"
git push

Don’t forget to add a link to the paper somewhere on your research page so that visitors can access it. Here’s an example of what that link might look like: [Working Paper](/files/pdf/working-paper.pdf). And if you want to use the fancy button from my post on customizing your site, you would do this: [Working Paper](/files/pdf/working-paper.pdf){: .btn--research}.

Designing for mobile

One of the advantages of the academicpages template is that it is responsive, meaning that layouts change automatically with screen size to present content in the most efficient manner. Take a look at my website on your phone to see how a smaller device changes the site’s layout. When you’re editing your website, it’s a good idea to periodically check how it appears on a phone, as it’s likely that a number of visitors to your site will view it on their phones.1

To do so, you can use tools like Chrome’s device mode, but this can be annoying and doesn’t perfectly capture the experience of navigating your site on a small touchscreen. The best way to do that is, unsurprisingly, to use your actual phone. However, this requires a slight tweak to our usual bundle exec jekyll serve command. We need to add a --host argument to the command, where the value of the argument is our computer’s IP address. There are many ways to look this up, but here are two quick ones you can execute from the terminal:

  • On MacOS: ifconfig en0 | grep inet | grep -v inet6 | awk '{print $2}'
  • On Linux: hostname -I

What each of these will do is capture the local IP address of the computer. Often this will be something like 192.168.1.x or 10.0.0.x. This won’t let you access the site from outside your network over the internet, but it will let you access it locally on your own network. Once you’ve found your local IP address, you can serve your site on your local network, letting you view it on your phone or tablet. For example, my IP address is 192.168.1.6, so putting it all together I get:

bundle exec jekyll serve --host 192.168.1.6

This is quite a lot to type, and your computer’s local IP address can change occasionally, so you can’t just keep putting in the same IP address each time. To save yourself some time by creating an alias for the command. An alias is simply a way to refer to a longer command with a shorter label. To do this, you’ll need to edit your .bash_profile configuration file.2 The easiest way to do this is to run

nano ~/.bash_profile

This will open up your .bash_profile in nano, a simple text editor.3 I’ve decided to call my aliased command serve-site, but you could call it anything you want. Scroll down to the end of the file and add either

alias serve-site="bundle exec jekyll serve --host=$(ifconfig en0 | grep inet | grep -v inet6 | awk '{print $2}')"

for MacOS or

alias serve-site="bundle exec jekyll serve --host=$(hostname -I)"

for Linux. Once you’ve added this line, save the file by pressing ctrl+o and then enter to use the existing filename, overwriting the old version of .bash_profile. Then press ctrl+x to close nano. The last step is to tell your terminal about this new alias. You can accomplish this with

source ~/.bash_profile

regardless of whether you’re on Linux or MacOS. Now whenever you want to check out your website on a mobile device, you can just navigate to your website’s directory and use the new serve-site alias to launch it locally.

Windows

If you’re trying to figure out how to do this on Windows, I haven’t forgotten about you, I just have no idea how to do this on Windows ¯\_(ツ)_/¯. My recommendation would be to do a lot of googling, or to install the Windows Subsystem for Linux, which will allow you to use a bash shell to interact with your files.

  1. As a senior faculty member once pointed out to me, the search committee member who didn’t fully read your application is most likely to pull up your website on their phone during a committee meeting. 

  2. I’m assuming that you’re using bash as your shell. If you’re using a different shell, see this list for which configuration files you should be editing. Other shells may also define aliases in different ways. 

  3. Feel free to use a different editor or use the edit command if you’ve set the default editor to your preferred editor. 

]]>
Tom Hope
Customizing an Academic Website2020-07-06T00:00:00-05:002020-07-06T00:00:00-05:00https://tomhoper.github.io//posts/2020/07/customizing-websiteThis is a followup to my previous post on creating an academic website. If you’ve followed that guide, you should have a website that’s professional-looking and informative, but it’s probably lacking something to really make it feel like your own. There are an infinite number of ways you could customize the academicpages template (many of them far, far beyond my abilities) but I’m going to walk you through the process I used to start tweaking my website. The goal here isn’t to tell you how you should personalize your website, but to give you the tools to learn how to implement whatever changes you want to make.

You’ll notice that the differences between my website and the template aren’t limited to my mug in the sidebar on the left. These differences are especially apparent if you compare my publications page with the template’s. You’ll notice on my page that article titles no longer link to separate pages and I’ve got fancy icons like to link to my PDF copy and to link to the version of record.

I’ve made several minor changes from the template to make my website feel a bit more like my own. These tweaks are varyingly difficult to accomplish, but they all involve a bit of trial and error. While academicpages is a great template, the accompanying documentation isn’t particularly useful if you want to make any changes that go beyond content into formatting. Thus, each new tweak I implement begins with something of a scavenger hunt.

Essentially, you need to track down where in the source code of your website a variable is originally defined, and then edit it there. Luckily, RStudio makes this relative straightforward with its Find in Files function. You can access this special search from the Edit menu, or by pressing Cmd+Shift+c on MacOS or Ctrl+Shift+c otherwise. Once you’ve brought up the Find in Files dialog, enter the name of the variable you’re looking for in the ‘Find’ box and your website’s directory in the ‘Search in’ box (for me that’s ~/Dropbox/Website).

Easier on the eyes

One thing I don’t like about the template’s settings out of the box is the font size in highlighted code blocks. It’s way too small! While small the small font size makes it more likely that an entire line of code will fit on a screen without having to scroll, how useful is that if you have to squint to see what’s there?

My first order of business in customizing my site was thus to increase the font size used in code blocks. I know that code highlighting functions are often referred to as syntax highlighters. Using this knowledge, I searched for “syntax” using the Find in Files dialog as seen below:

The most relevant hit is a file called _sass/syntax.scss. The _sass directory is where you’ll find lots of options to change the appearance of your site since it contains the CSS code that determines much of how your site looks. If you scroll through _sass/syntax.scss, this chunk of code controls how text is rendered in code boxes:

.highlight {
  margin: 0;
  font-family: $monospace;
  font-size: $type-size-7;
  line-height: 1.8;
}

We’ve found our next clue! We need to figure out where $type-size-7 is defined. If you do a Find in Files search for that, you’ll learn it’s in _sass/variables.scss. Open that file and Cmd+f (Ctrl+f not on a Mac) for $type-size-7, and you’ll find this chunk of code:

/* type scale */
$type-size-1                : 2.441em;  // ~39.056px
$type-size-2                : 1.953em;  // ~31.248px
$type-size-3                : 1.563em;  // ~25.008px
$type-size-4                : 1.25em;   // ~20px
$type-size-5                : 1em;      // ~16px
$type-size-6                : 0.75em;   // ~12px
$type-size-7                : 0.6875em; // ~11px
$type-size-8                : 0.625em;  // ~10px

Here’s where font sizes are defined for the entire website! Luckily the code is well-commented, so we know that $type-size-7 used in code blocks $\approx$ 11 pixels high. I first tried setting it to $type-size-5, but that was too big for my tastes. Alas $type-size-6 was too small, and so I resolved to make my own.

The font sizes are defined in ems (a unit of typography you’re no doubt familiar with if you’ve also spent too much time mucking about with LaTeX) so it’s easy to create a new one. $type-size-6 is 0.75 ems and $type-size-5 is 1 em, so to find the exact middle of them we can do

(1 + .75) / 2
## [1] 0.875

Now all we have to do is define our new font size in _sass/variables.scss

$type-size-syntax           : 0.875em;  // ~14px

and tell the site to use it in _sass/syntax.scss

.highlight {
  margin: 0;
  font-family: $monospace;
  font-size: $type-size-syntax;
  line-height: 1.8;
}

Save all these changes, restart your local webserver with bundle exec jekyll serve (or wait for it to rebuild and reload if it was already running) and check out the changes. If everything went smoothly, you should have larger and easier to read text in any code blocks on your website.

Fixing fancy icons

While you’re messing around in _sass/syntax.scss, there’s another easy fix that I highly recommend doing. The version of the template at the time I’m writing this has a small error in this file that prevents code boxes from rendering properly. You may be wondering why code boxes have a small square in the top right corner of them; I know I was. Take a look below for an example of this from my software page:

It turns out there’s supposed to be a little ‘code’ icon there, like this . I did some digging on Stack Overflow, and found this answer, which told me that there’s an error with how _sass/syntax.scss references the Font Awesome library the icon comes from.

&:before {
  position: absolute;
  top: 0;
  right: 0;
  padding: 0.5em;
  background-color: $lighter-gray;
  content: "\f121";
  font-family: "fontawesome" !important;
  font-size: $type-size-6;
  line-height: 1;
  text-transform: none;
  speak: none;
}

The font-family line should be font-family: "Font Awesome 5 Free" !important;. Once you’ve made this change, the icon should show up.

I’ve submitted a pull request fixing the issue, so if you’re reading this in the future hopefully it’s been fixed in the template and you have no idea what I’m talking about.

Adding some color

While we’re in _sass/variables.scss, there are a couple of other easy changes we can make. I never particularly liked the light blue that the template uses for links. There’s a section in _sass/variables.scss that controls what color links appear as.

/* links */
$link-color                 : $info-color;
$link-color-hover           : mix(#000, $link-color, 25%);
$link-color-visited         : mix(#fff, $link-color, 25%);
$masthead-link-color        : $primary-color;
$masthead-link-color-hover  : mix(#000, $primary-color, 25%);

Looking at this code, we can see that $link-color is defined as the same as $info-color. Further up, this is defined as

$info-color                 : #52adc8;

You’ll notice that $info-color is defined with an alphanumeric string and not in a more familiar format like RGB. That’s because it’s a web color in hexadecimal format. If you Google search color picker you’ll find a handy little applet where you can preview different hex colors or even convert from RGB to hex representation. Since I am a postdoc, I decided to just use my institution’s colors. If you change $info-color to any other color it will change the color of links across your website. Remember to wait for your webserver to reload the changes, otherwise you won’t be able to see them

Pushing buttons

The pages where I present my research highlight one of the other big changes I’ve made to my site. The buttons I use to link to papers and replication archives (see here for an example) are not part of the template. I’d originally just used text links here, and thought using some buttons instead might liven up the page a bit. Unfortunately, the default buttons defined in the Academicpages template don’t fit super well given the other changes I’ve made.

The default button looks like this, which is fine, but doesn’t fit with a more colorful custom theme. My new themed button looks like this, which fits a little better and has less space around it.1

While both buttons have a nice hover effect where they change color to let you know you’re over them, the second one incorporates the site’s accent color and is a bit more dynamic since the text changes color in addition to the background. Like most things CSS, buttons are defined in the _sass directory in _sass/buttons.scss.

There are lots of existing button types defined here, but we want to create our own. You can edit the default button styling in _sass/buttons.scss. If we redefine the existing base button class, we can end up with all kinds of weird side effects for other elements of the site, like the social media share buttons at the bottom of this post. To do this, we’ll define a new button subclass, which inherits the aspects of a button on the site but only uses the special behavior when we explicitly tell a button to do so.

After some time poking around the W3Schools page on buttons, and a lot of trial and error, I came up with the following CSS code:

/* research page buttons */
&--research {
  display: inline-block;
  margin-bottom: 0.25em;
  padding: 0.125em 0.25em;
  color: $link-color;
  text-align: center;
  text-decoration: none !important;
  border: 1px solid;
  border-color: $link-color;
  border-radius: $border-radius;
  cursor: pointer;

  &:hover {
    color: #fff;
    background-color: $link-color !important;
  }

The key parts are

  • color: $link-color;: use the site accent color for the text
  • text-decoration: none !important;: don’t underline the button text
  • border: 1px solid;: draw a one pixel border around the button
  • border-color: $link-color;: use the site accent color for the border
  • border-radius: $border-radius;: use a four pixel radius for the border ($border-radius is defined in _sass/variables.scss)

To add a button to a page, you simply tack on {: .btn--research} after a link, like so

[this](#Buttons){: .btn--research}

Going forward

This is just a brief overview of the ways you can tweak your website from the base provided by the template. Let Google and Stack Overflow be your guides. There will be some trial and error, but the beauty of git is that even if you break something it’s easy to roll back to changes to when everything was working.

  1. You’ll notice that this button is very similar to the ones that the Hugo Academic theme uses. Just because I don’t like the theme as a whole doesn’t mean there aren’t really elegant parts of it. 

]]>
Tom Hope
Building an Academic Website2020-06-30T00:00:00-05:002020-06-30T00:00:00-05:00https://tomhoper.github.io//posts/2020/06/academic-websiteIf you’re an academic, you need a website. Obviously I agree with this since you’re reading this on my website, but if you don’t have one, you should get one. Most universities these days provide a free option, usually powered by WordPress (both WashU and UNC use WordPress for their respective offerings). While these sites are quick to set up and come with the prestige of a .edu URL, they have several drawbacks that have been extensively written on. If you’re a junior scholar, having your own personal webpage is even more important:

  • If (when) you move institutions, you’ll lose your website
  • Even if you can export the contents of a WordPress site, there’s no guarantee it will seamlessly integrate with another university’s implementation
  • Even worse, you’ll lose your search engine ranking since you’ll be starting over from square one with a new URL

Even if you stay at the same institution for the rest of your career, you’re at the mercy of IT and your site may be taken down by a change to the hosting platform at some point in the future.

My approach

There are plenty of guides out there on how to create a personal website using tools like WordPress, Wix, or Google Sites. The free versions of these tools often come with ads, or at the least a message telling you which tool was used to create the website. I take a different approach that requires some (minimal) coding experience, but produces a beautiful and professional website that is ad free. I use a static site generator that produces HTML from easy to edit Markdown files. Because the resulting site is static (it’s just a collection of files with no interactivity where users can .e.g, fill out and submit forms) it can be hosted for free with GitHub Pages. Steven Miller has a nice rundown on all of the advantages of this approach.

Who this guide is for

This guide is intended for someone with a basic level of coding experience and comfort with Markdown files. Anyone who’s received graduate training in quantitative social science in the past few years should have all the necessary skills. I’m assuming some familiarity with using the terminal, but no experience with Git or GitHub is needed.

There are other guides to using static site generators to make academic websites, but they all assume a very high level of experience with the required tools and the ability to conduct extensive troubleshooting on your own. The template I use contains a 6 point checklist with no illustrations or examples. I’ve written this guide for people who have some experience coding, but don’t want to spend an afternoon learning two new languages on their own. If you know how to use R Markdown you have all the skills you need to build a website.

Conventions

In this tutorial, I use a couple of conventions to describe computer code and the actions you’ll do with it. Anytime you see content between two angle brackets, you should replace the content with the appropriate version for yourself. For example, <yourusername> would become jayrobwilliams for me, because that is my GitHub username. You’ll also frequently see

highlighted code blocks like this

with this icon in the top right corner. These can denote either code you should enter and run, or the output of running a command. I will clearly state which of these options applies in any given case. Sometimes you will also see inline code like this, and again I’ll note whether these represent code you should run, or the output of code.

A brief aside on Git-speak: these periodic indented blocks will explain the terminology that Git uses to help you understand what each Git command actually does.

Getting started

Two of the most popular programs available for building static sites from Markdown files are Jekyll and Hugo. Relevant for us is that each one has a full-featured theme for academic websites that you’ve probably seen before. There are plenty of differences under the hood, but the most important one for building an academic website is that Hugo integrates nicely with the blogdown R package, letting you write your website entirely in R. I chose Jekyll over Hugo because I liked the Jekyll theme better than the Hugo one. My life might have been easier if I’d gone with Hugo and blogdown, but here we are. Consequently, this tutorial will show you how to set up an academic website using Jekyll.

GitHub

The first thing you need to do is create a GitHub account if you don’t already have one. If you do have one, great! If you don’t already have an account, think about your username carefully. Unless you set up a custom domain name (which we’ll cover in a separate post), your website’s URL will be <yourusername>.github.io. Setting up a custom domain name is a little trick and isn’t free, so pick a username that works as a URL as well.

Git-speak aside: the basic unit of GitHub is the repository. Repositories are just folders (directories, if you want to be pedantic), but Git keeps a record of the files in the folders. We’ll start by making a repository on GitHub and then later download that repository to our computer. In both cases, it’s just a folder. The magic of Git is that we can link the two so that changes you make in your local repository (the one on your computer) will sync with the remote repository (the one on GitHub). When people (myself included) get lazy, they’ll often shorten repository to ‘repo’.

Once you’ve logged into GitHub, the next step is to head over to the repository for the template we’ll be using. We need to copy the template so we can get our own version that we can mess around with. In GitHub parlance we need to fork the template repository to get our own copy. Find the Fork button at the top right of the template repository (highlighted in green below), and click it.1

After a brief wait (and an amusing GIF), you’ll land at your forked copy of the repo. If you look closely, you’ll notice that the name in the top left has changed. Where it originally said “academicpages/academicpages.github.io” it should now read “<yourusername>/academicpages.github.io”. As you can see below, for me this is now “jayrobwilliams/academicpages.github.io”. We need to change this! Not just so people know that this is your website, but to get it online. Click the Settings button with the icon (highlighted in green below).

This will, unsurprisingly, take you to the Settings page. The very first option (highlighted in green below) is the repository name, and that’s all we need to worry about now. It should look like this, with <yourusername> in place of jayrobwilliams before the .github.io. Now, we need to change the repository name to <yourusername>.github.io. It’s important that the first part of the repo name exactly match your GitHub username, or you’ll run into trouble later when putting your website online.

In my case, I renamed the repo to “jayrobwilliams.github.io” because my GitHub username is jayrobwilliams.

We’ve told GitHub that this repository is for our website, so now it’s time to start working on that website. To do that, we need to get the files for our website into our computer.

Git

Since we’re going to host our website on GitHub, you’ll need to make sure you have Git installed. Git is a version control system designed to let teams of programmers collaborate on projects seamlessly. For us, it’s just going to be the way that we upload files for our webpage to GitHub. You can download Git from the official downloads page, but there are a couple of easier ways. If you’re on MacOS, you can install Git via the Xcode Command Line Tools by running

xcode-select --install

This has the advantage of also installing a Ruby development environment, which we’ll need later. If you’re on Windows, you should download Git Bash. While the Git you’ll get from the official website will work just fine, Git Bash emulates a Unix terminal, meaning that you’ll use Unix commands instead of the Windows commands you’d use in cmd.exe. Why is this useful? I use MacOS and Linux, so this tutorial is written from that perspective. Beyond my laziness in not providing the equivalent Windows commands alongside the Unix ones, most Git related information you’ll find online at places like Stack Overflow will be written for a Unix audience. By using Git Bash, you’ll be able to follow any advice you find and not have to translate it into Windows commands. Most (all??) Linux distributions ship with Git, so you should be good to go if you’re using Linux.

Now that you’ve got Git, you can download the files that will make up your website to your compute. To do that, we’ll clone the repository from GitHub to our computer.

Git-speak aside: cloning a repository means creating a local repository (folder) on your computer that’s connected to the remote repository (on GitHub). Cloning differs from downloading in that you are setting up a connection between the two folders so you can keep changes you make locally synced up with the remote repository (which is where GitHub will build your website from).

Find the green Clone or Download button on the right side of your repository’s page (highlighted in green below). When you click it, a dialog will pop up that contains the URL you will use to clone your repository to your computer. Click the clipboard icon to the right, and the URL will be copied to your clipboard.

Once you’ve done that, open a terminal and navigate to where you want your website’s folder to live. I keep mine at the top level of my Dropbox, so I would type

cd ~/Dropbox

to navigate there. Use the pwd command to verify you’re in the right place; if you’re also putting your website at the top level of your dropbox, the output should be /Users/<yourusername>/Dropbox/. Now it’s time to clone our GitHub repo to our computer and create our local repo.

Do this by typing

git clone https://github.com/<yourusername>/<yourusername>.github.io.git

and running it. This will create a folder called <yourusername>.github.io, which is honestly kind of clunky and not instantly informative when you’re scanning through your files. To give it a more informative name e.g, Website, run

git clone https://github.com/<yourusername>/<yourusername>.github.io.git Website

Adding Website at the end will clone the GitHub repo into a folder called ‘Website’ instead of <yourusername>.github.io.

Making it yours

Now that you’ve got all the files needed to build your website on your computer, it’s time to start personalizing this generic template!

Editing pages

All of the actual content of your site is contained in .md files, which are Markdown files. Most of these files live in the _pages directory. The documentation that comes with the template is relatively straightforward, and does a pretty good job telling you how to customize your site. For example, the landing page can be changed by editing the _pages/about.md file and the sidebar information is controlled by the _config.yml file. Let’s start by editing _config.yml to rename put our name on the site and change the sidebar to reflect our information.

I’ve copied the first several lines of _config.yml below:

# Welcome to Jekyll!
#
# This config file is meant for settings that affect your entire site, values
# which you are expected to set up once and rarely need to edit after that.
# For technical reasons, this file is *NOT* reloaded automatically when you use
# `jekyll serve`. If you change this file, please restart the server process.

# Site Settings
locale                   : "en-US"
title                    : "Your Name / Site Title"
title_separator          : "-"
name                     : &name "Your Name"
description              : &description "personal description"
url                      : https://academicpages.github.io # the base hostname & protocol for your site e.g. "https://mmistakes.github.io"
baseurl                  : "" # the subpath of your site, e.g. "/blog"
repository               : "academicpages/academicpages.github.io"
teaser                   :  # filename of teaser fallback teaser image placed in /images/, .e.g. "500x300.png"
breadcrumbs              : false # true, false (default)
words_per_minute         : 160
future                   : true
read_more                : "disabled" # if enabled, adds "Read more" links to excerpts
talkmap_link             : false #change to true to add

You’ll want to change the following things:

  • "Your Name / Site Title" to your name
  • &name "Your Name" to your name
  • https://academicpages.github.io to the name of your repository (and also your github username) i.e., https://<yourusername>.github.io
  • "academicpages/academicpages.github.io" to your GitHub username and the repo name i.e., "<yourusername>/<yourusername>.github.io"

When you’re finished, it should look like this:

# Welcome to Jekyll!
#
# This config file is meant for settings that affect your entire site, values
# which you are expected to set up once and rarely need to edit after that.
# For technical reasons, this file is *NOT* reloaded automatically when you use
# `jekyll serve`. If you change this file, please restart the server process.

# Site Settings
locale                   : "en-US"
title                    : "<Your Name>"
title_separator          : "-"
name                     : &name "<Your Name>"
description              : &description "personal description"
url                      : https://<yourusername>.github.io # the base hostname & protocol for your site e.g. "https://mmistakes.github.io"
baseurl                  : "" # the subpath of your site, e.g. "/blog"
repository               : "<yourusername>/<yourusername>.github.io"
teaser                   :  # filename of teaser fallback teaser image placed in /images/, .e.g. "500x300.png"
breadcrumbs              : false # true, false (default)
words_per_minute         : 160
future                   : true
read_more                : "disabled" # if enabled, adds "Read more" links to excerpts
talkmap_link             : false #change to true to add

This will put your name on the website and tell GitHub where to look for the files that make up your site. Now let’s check out our changes and make sure everything’s working like it’s supposed to!

Previewing your website

Once we upload our modified files to GitHub (and tell GitHub to turn them into a website, which we’ll cover below), they’re out there on the internet for everyone to see. If you’re like me, you’ll make a lot of mistakes when working on your website. There’s no need to broadcast all of those mistakes to the world, and we can avoid this very easily by previewing our website locally. What this means is building the site from the various .md files, rendering it to HTML, and then viewing it. We can do all of that on our computer without ever having to put it online.

To preview your website locally, you’ll need to install Jekyll on your computer. The easiest way to do this is with Bundler. Bundler is a package manager for Ruby, which is the programming language that Jekyll is written in. This means that we need a full Ruby development environment to get Jekyll working to run our website locally. If you’re on Windows, RubyInstaller will give you everything you need. If you’re on MacOS, your computer comes with Ruby, but not the development headers required for Bundler to work. If you installed Git via the Xcode Command Line tools earlier, you’re good to go here. If not, you can install Ruby Ruby via Homebrew with

brew install ruby

If you’re on Linux, just install Ruby via your package manager. Once that’s taken care of, it’s time to install Bundler. Do so by running

gem install bundler

in a terminal. If you’ve navigated away from the directory where your website is located (which is ~/Dropbox/Website for me), head back there now. Do this with

cd ~/Dropbox/Website

but replace ~/Dropbox/Website with the path to your website’s repo. Next, we need to install any packages (called ‘gems’ in Ruby) that Jekyll depends on. This is where Bundler shines by taking care of this whole process for us; it reads the Gemfile included with the source code and installs all required gems

bundle install

If you want to see what’s been installed, run gem list before and after bundle install. If everything worked correctly, you can now launch your website! What we’re going to do is start a webserver on your computer, which will let you access your website locally without having to put it out on the internet. We do this with

bundle exec jekyll serve

The bundle exec command is just a prefix that lets ruby access all of the gems specified in the gemfile. The jekyll serve command builds your website and starts a webserver so that you can view it locally. To access your website, open a browser and go to 127.0.0.1:4000 or localhost:4000. It should look exactly like the example site.

This is a special version of your site that’s only accessible from your computer; no one else can see it! So this is the perfect place to play around, experiment, and see how to make your site do what you want it to. This process is surprisingly easy. Make a change to a file e.g., editing _pages/about.md to introduce yourself, and save the file. That’s all you have to do; Jekyll will notice the change to the file and automatically rebuild the site. All that’s left to do is refresh your browser so you can see the changes!

Once you’ve made a couple changes to see how it works, you might want to turn off the webserver and make lots of changes, then check out your handiwork. Or maybe you’re just done working on your website for now. Either way, it’s time to shut down the webserver. To do so, you can just close the terminal window, but you’ll get a warning like this.

To save yourself some time and do this faster, simply press Ctrl+c.2

Getting online

Alright, you’ve made some changes from the template, checked them out locally, and you’re ready to share your website with the world. This is a two step process. First we need to upload all of our modified files to the GitHub repo we forked from the template. Then we need to configure GitHub Pages to build and deploy our website. Finally, if you want a custom domain name, we need to do some configuration outside of GitHub Pages to connect your domain name with your website.

Uploading changes to GitHub

To upload your changes to GitHub, we first have to make Git locally aware of them. We do this by committing the changes, then pushing them to the repo on GitHub

Git-speak aside: Git stores file histories as a series of changes or differences. A batch of changes (which can include changes in one or more files) is called a commit. When you want to tell the remote repo (the one on GitHub) about changes you’ve made, you push a commit from the local repo to the remote one. Once you do this, GitHub looks at the differences and modifies the files in the remote repo.

Before we can commit the changes, we need to stage them. This just involves telling Git what changes we want to commit. To make our lives easier, let’s check in on what changes we’ve made

git status

You should get results that look something similar to this:

On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   _config.yml

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	Gemfile.lock

no changes added to commit (use "git add" and/or "git commit -a")

We can ignore the first part for now. The second part (Changes not staged for commit) will list any files that Git knows about that have changed. The third part (Untracked files) includes files we haven’t told Git about, so as far as it’s concerned they don’t exist. I’ve edited _config.yml to replace the template’s information with my own, so it shows up here. If you want to verify the changes you made, you’ll want to diff the file. Do this with

git diff _config.yml

This is what the output looked like for me. Note that your output may or may not be color-coded depending on what type of system you’re on and your Git settings.

diff --git a/_config.yml b/_config.yml
index 1dc605c..4affb4e 100644
--- a/_config.yml
+++ b/_config.yml
@@ -7,13 +7,13 @@
 
 # Site Settings
 locale                   : "en-US"
-title                    : "Your Name / Site Title"
+title                    : "Rob Williams"
 title_separator          : "-"
-name                     : &name "Your Name"
+name                     : &name "Rob Williams"
 description              : &description "personal description"
-url                      : https://academicpages.github.io # the base hostname & protocol for your site e.g. "https://mmistakes.github.io"
+url                      : https://jayrobwilliams.github.io # the base hostname & protocol for your site e.g. "https://mmistakes.github.io"
 baseurl                  : "" # the subpath of your site, e.g. "/blog"
-repository               : "academicpages/academicpages.github.io"
+repository               : "jayrobwilliams/jayrobwilliams.github.io"
 teaser                   :  # filename of teaser fallback teaser image placed in /images/, .e.g. "500x300.png"
 breadcrumbs              : false # true, false (default)
 words_per_minute         : 160

Each line that begins with a + indicates an insertion and each line that starts with a - is a deletion. You can see that I’ve just replaced the generic information with mine. Now that we’re confident in the changes we made to _config.yml, we need to tell Git to record them. We first stage the file with

git add _config.yml

which stages the file. If we re-run git status, we’ll see

On branch master
Your branch is up to date with 'origin/master'.

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	modified:   _config.yml

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	Gemfile.lock

You’ll notice that _config.yml has moved from Changes not staged for commit to Changes to be committed (Changes not staged for commit has disappeared since we haven’t made any changes to other files that Git is tracking). Now we’re finally ready to commit our changes!

git commit -m "add my info to config file"

This tells Git to record the changes we made to _config.yml. The -m "add my info to config file" is the commit message for this commit.3 You should see something like this, but with a different number after master on the first line.4

[master a8af7b1] add my info to config file
 1 file changed, 4 insertions(+), 4 deletions(-)

That means our changes have been recorded locally by Git. Once we’ve committed our changes to _config.yml, it’s time to upload them to GitHub! We do this by pushing the changes to GitHub, which will then modify its copies of the files to match ours. This is (relatively) straightforward

git push

Git will then prompt you for your username and password.5 This will be your GitHub username and password. Enter each of them and you should be good to go! If everything goes smoothly, you will see a message something like this

Enumerating objects: 9, done.
Counting objects: 100% (9/9), done.
Delta compression using up to 4 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 426 bytes | 426.00 KiB/s, done.
Total 5 (delta 4), reused 0 (delta 0)
remote: Resolving deltas: 100% (4/4), completed with 4 local objects.
To github.com:jayrobwilliams/jayrobwilliams.github.io.git
   83c3bc4..a8af7b1  master -> master

This is just Git telling you exactly how it recorded the changes and uploaded them to GitHub, the content of it doesn’t matter and you can ignore it. If things don’t go smoothly, you may get an error that looks something like this

fatal: No configured push destination.
Either specify the URL from the command-line or configure a remote repository using

    git remote add <name> <url>

and then push using the remote name

    git push <name>

Your repo on github is the “remote” that your local Git needs access to. To fix this error, we just need to tell Git where to find the remote

git remote add origin https://github.com/<yourusername>/<yourusername>.github.io.git
git push -u origin master

The next step is to go to GitHub and check our your changed file. Head back to your repository (https://github.com/<yourusername>/<yourusername>.github.io, as a reminder) and click the commits button at the bottom left (highlighted in green below).

On the commits page, you should see the commit you just made

If you do, then you’re all set.

Going live

All that’s left to do now is take our website live. If using git on the command line has been stressing you out, you’ll be glad to know that this part happens entirely in your browser.

You want to go back to the settings page for your repo (the button with the at the top right of the page) since that’s where we’ll turn this collection of files into a website. Scroll down until you get to the GitHub Pages section Click the dropdown currently labeled “None” under the “Source” subheading, and choose “master branch” from the popup (highlighted in green below).6

Once you do this, the page should reload. Scroll back down to the “GitHub Pages” heading, and there should be a message at the top that “Your site is ready to be published at http://<yourusername>.github.io/”. You can see what this looks like for me below

Now, here’s where things may get a little tricky. Go to http://<yourusername>.github.io/ and see if there’s a website there. If there is, skip down to the last section below. If not, go back to the settings page for your repo and see if there’s still a “ready to be published” message. If there is, you may have to push an additional change to the repo to trigger building your site. Editing _pages/about.md to personalize the welcome message and short biography is a good candidate here. Once you’ve personalized your site’s landing page, record the changes and push them to the repo

git add _pages/about.md
git commit -m "edit welcome page"
git push

Then head back to the repo’s settings page and see if the “ready to be published” message has changed to “ Your site is published at http://<yourusername>.github.io/”. You can see what this looks like for my website below

You’ll notice that the URL for my site is http:/jayrobwilliams.com, not http:/jayrobwilliams.github.io. That’s because I’ve bought a custom domain name for my site. While this does make your website seem slightly more profession, it’s not free (unlike .github.io) and it can be a little tricky to set up. I’ll cover how to get a custom domain name for your site in a future post.

Next steps

Congratulations! You now have a professional looking academic website that’s yours forever; you won’t lose access to it when you leave your current institution. At this point you should continue making it your own by adding information on your research projects (edit the .md files in _portfolio folder to have them populate the Portfolio page with information on each of your research projects), teaching experience (_pages/teaching.md), and accomplishments (_pages/cv.md). Remember that you can preview any changes locally with bundle exec jeykll serve. If you add any new files, like PDFs of your working papers or publications, be sure to git add them as well, so that they’ll be uploaded to your repo with the next git push. You may have noticed that there are lots of small differences between my website and the template, and not just in terms of content. In a future post, I’ll talk about how to further customize the look of your site to make it feel more unique.

  1. GitHub might look a little different for you since it recently went through a redesign. Everything you’re looking for is still there and should be recognizable from the images. 

  2. Yes, even if you’re on MacOS; Ctrl+c has been the way to cancel programs on the command line since the 1960s while the key did not exist on computers until the Macintosh in 1984, and the terminal is olddd. 

  3. Yes, you can omit the -m "..." with git commit, but it will open up your system’s default editor (likely vim, emacs, or nano), none of which are remotely beginner-friendly. Until you decide to learn one of them, git commit -m is perfectly fine. 

  4. This is the hash for the commit; it’s a 40 digit string that uniquely identifies the changes that you’re committing. This is useful if you want to retrace your steps or undo things in the 

  5. If you’ve set up SSH authentication for GitHub, then you won’t be prompted to enter your credentials. 

  6. Ignore the “Choose a theme” button; it’s for use with bare bones GitHub pages sites and the academicpages template supplies all the components for its theme. 

]]>
Tom Hope