Jekyll2026-03-10T15:43:34+00:00https://andybeet.com/feed.xmlAndy BeetResearch Scientist and AnalystAndy BeetAir: Formatting R code2025-07-15T00:00:00+00:002025-07-15T00:00:00+00:00https://andybeet.com/posts/2025/07/air_formattingHave you been thinking that you should probably start using a style guide? No, i don’t mean a fashion guru! I mean following a guide to help maintain code formatted similarly across all functions and scripts. Consistent white space, consistent indentations in conditional statements and for loops. Code formatted exactly the same way ALL the time?

In R, this is pretty easy to do and it involves the use of the tool Air created by the Posit folks. Described as “an extremely fast R formatter” it will format your code every time you hit save. Best used in conjunction with the tidyverse style guide

As of the time of writing: auto formatting is ONLY applied to .r, .R files. R chunks inside R-Markdown files (‘rmd`) are not formatted.

Note: Air is a layout formatter, “ensuring that whitespace, newlines, and other punctuation conform to a set of rules and standards”

Reformatting existing code

Using this tool you can reformat ALL existing code in a repo. From the terminal simply type the command air format .. Or if you’d prefer you can open every R file and hit save. I know which one i’ll choose!

Reformatting all future code

The easiest way to keep the repo formatted consistently is to include an air.toml file in the root directory of your repo. The file can remain empty, or you can add specific formatting you’d like in your repo. If left empty air will apply defaults settings. This is often a good option.

The benefit of using this air.toml file is that anyone who contributes to your repo will automatically have their contributed code formatted as dictated by the toml file, rather than any specific user level settings. A no worry solution for consistent formatting of code. The only caveat is that contributors will need to have air installed too, but having the air.toml in the repo and indicating that air formatting is required by using a pull request template, for example, this should be painless.

Reformatting code using GitHub Actions

If you want to take things to the next level, you can create a GitHub action to run on all pull requests. The action will check for formatting inconsistencies and fail if formatting is required. If this is of interest see the air documentation

Now all you need to worry about is coding styles, a subject we’ll talk about at a later date.

]]>
Andy Beet
Using GitHub templates2025-06-15T00:00:00+00:002025-06-15T00:00:00+00:00https://andybeet.com/posts/2025/06/github_templatesMaintaining multiple GitHub repos is rewarding until you’re stuck deciphering a flood of disorganized issues and PRs. Eventually, the lack of standard formatting becomes a major bottleneck for any productive workflow. You’d like to see a consistent format for all issues and pull requests! Templates are what you have been looking for!

Issue and Pull Request (PR) templates act as a structured “handshake” between a project maintainer and a contributor. Without them, communication often becomes a messy back-and-forth of “Which version are you using?” or “What does this code actually do?” etc.

Here are a few reasons why they are great for collaborative teams:

Thoughtful submission

Templates act as a checklist that forces contributors to think through their submission.

For Issues: Instead of a report saying “It’s broken,” a template requires a Minimal Reproducible Example (Reprex), environment details (like your R version or OS), and clear steps to recreate the bug.

For Pull Requests: It ensures the author explains the “Why” behind a change, not just the “What.”

Reduces “Review Fatigue”

Reviewing code is mentally taxing. Templates reduce this “cognitive load” by:

Standardizing Layout: When every PR looks the same, reviewers know exactly where to find the testing instructions, the link to the Jira/GitHub issue, and the “breaking changes” warning.

Informative for new contributors

For someone new to your project, the “New Issue” button can be intimidating.

Templates provide scaffolding. They tell the user exactly what information is valued, making them feel more confident that their contribution will be accepted.

It serves as a subtle way to enforce Contribution Guidelines without making someone read a 2,000-word CONTRIBUTING.md file first.

Better historical records

Six months from now, when you’re wondering why a specific line of code was changed, a well-filled PR template provides the context that a single-line commit message often misses. It captures the intent and the testing process used at the time.

How to implement templates

In GitHub, templates can be written in either markdown or yaml files and are saved in standard locations

  • .github/ISSUE_TEMPLATE - for issues templates
  • .github/PULL_REQUEST_TEMPLATE - for PR templates

You can list as many issue templates as you like, for example for issues you could include templates for

  • Bug Reporting
  • Feature Requests
  • Data Issues

Caveat: Pull request

There is a caveat worth mentioning with pull requests templates. If you’d like a default template to appear EVERY time you make a pull request then you must have a default template called pull_request_template.md residing in the .github folder.

If you want to include more than this default template you will need to include additional templates to the PULL_REQUEST_TEMPLATE folder. You will NOT be prompted by GitHub as to which one you want to use. If you have a default that will be used. And, as of date of publication, GitHub does not provide a drop down to select a pull request template. To access the templates in the PULL_REQUEST_TEMPLATE folder you need to add a snippet of additional text to the URL - &template=your_template_name.md.

Examples

To see some examples from across the web

And if you want to go one step further … you can create repository templates. These can be set up to contain all of the above templates, a CONTRIBUTING file, CODE_OF_CONDUCT, LICENSE, auto formatting etc. and whenever a repo is created all of these files are bundled with the creation! Pretty cool!

In summary, templates are easy to create, human readable, easy to implement, and have been proven to reduce hair loss!

Enjoy!

]]>
Andy Beet
Pulling data from ERDDAP™2025-05-15T00:00:00+00:002025-05-15T00:00:00+00:00https://andybeet.com/posts/2025/05/accessing_erddapMany publicly available scientific datasets from around the world can be found on servers supported by ERDDAP™ (Environmental Research Division’s Data Access Program). Many organizations, like government agencies and universities, run their own ERDDAP™ servers and host oceanographic and environmental datasets.

The data are served up in a consistent way, allowing users to visualize and download data in various standard data formats, like netCDF (nc), csv, txt, JSON.

While it is perfectly reasonable to interact with these ERDDAP™ servers through a web interface, and often preferable initially to identify data sources, read metadata, and visualize the data, pulling and working with the data is often best accomplished using a language like python or R. Since i mostly use R for my work, we’ll go that route.

For R users there is a wonderful package called rerddap designed to help search, connect, and download data from ERDDAP™ servers. Lets go through an example to demonstrate how to get started. The steps involved are

  • List all servers (rerddap::servers())
  • List all datasets on a specific server (rerddap::ed_datasets())
  • Select and explore dataset fields (rerddap::info())
  • Pull data (rerddap::tabledap() or rerddap::griddap())

List servers

First, let’s see the list of server names available using the servers() function

rerddap::servers()
#> # A tibble: 63 × 4
#>    name                                                  short_name url   public
#>    <chr>                                                 <chr>      <chr> <lgl> 
#>  1 Voice of the Ocean                                    VOTO       http… TRUE  
#>  2 St. Lawrence Global Observatory - CIOOS | Observatoi… SLGO-OGSL  http… TRUE  
#>  3 CoastWatch West Coast Node                            CSWC       http… TRUE  
#>  4 ERDDAP at the Asia-Pacific Data-Research Center       APDRC      http… TRUE  
#>  5 NOAA's National Centers for Environmental Informatio… NCEI       http… TRUE  
#>  6 Biological and Chemical Oceanography Data Management… BCODMO     http… TRUE  
#>  7 European Marine Observation and Data Network (EMODne… EMODnet    http… TRUE  
#>  8 European Marine Observation and Data Network (EMODne… EMODnet P… http… TRUE  
#>  9 Marine Institute - Ireland                            MII        http… TRUE  
#> 10 CoastWatch Caribbean/Gulf of Mexico Node              CSCGOM     http… TRUE  
#> # ℹ 53 more rows

Created on 2025-12-08 with reprex v2.1.1

List datasets

The “CoastWatch West Coast Node” server looks interesting, lets explore that. We’ll need to grab the url field to explore the datasets hosted on this server.

servers <- rerddap::servers()
servers |> 
  dplyr::filter(short_name == "CSWC") |> 
  dplyr::select(url)
#> # A tibble: 1 × 1
#>   url                                     
#>   <chr>                                   
#> 1 https://coastwatch.pfeg.noaa.gov/erddap/

Created on 2025-12-08 with reprex v2.1.1

Now, at this point you can either copy this url into your browser and explore the contents of the server from there, or you can use rerddap to list the data sets. At this point we should mention that there are often two types of data hosted on these servers

  • Gridded data, termed griddap, in NetCDF (nc) format
  • Tabular data, termed tabledap, often in csv format

Let’s search for all of the tabular data on this server.

servers <- rerddap::servers()
url <- servers |> 
  dplyr::filter(short_name == "CSWC") |> 
  dplyr::select(url)
  
rerddap::ed_datasets(which = "tabledap", url = url)
#> # A tibble: 293 × 17
#>    griddap Subset     tabledap Make.A.Graph wms   files Accessible Title Summary
#>    <chr>   <chr>      <chr>    <chr>        <chr> <chr> <chr>      <chr> <chr>  
#>  1 ""      https://c… https:/… https://coa… ""    ""    public     * Th… "This …
#>  2 ""      https://c… https:/… https://coa… ""    "htt… public     Audi… "Audio…
#>  3 ""      https://c… https:/… https://coa… ""    "htt… public     CalC… "Hydro…
#>  4 ""      https://c… https:/… https://coa… ""    "htt… public     CalC… "Sampl…
#>  5 ""      https://c… https:/… https://coa… ""    "htt… public     CalC… "Cruis…
#>  6 ""      https://c… https:/… https://coa… ""    "htt… public     CalC… "Fish …
#>  7 ""      https://c… https:/… https://coa… ""    "htt… public     CalC… "Egg m…
#>  8 ""      https://c… https:/… https://coa… ""    "htt… public     CalC… "Fish …
#>  9 ""      https://c… https:/… https://coa… ""    "htt… public     CalC… "Size …
#> 10 ""      https://c… https:/… https://coa… ""    "htt… public     CalC… "Devel…
#> # ℹ 283 more rows
#> # ℹ 8 more variables: FGDC <chr>, ISO.19115 <chr>, Info <chr>,
#> #   Background.Info <chr>, RSS <chr>, Email <chr>, Institution <chr>,
#> #   Dataset.ID <chr>

Created on 2025-12-08 with reprex v2.1.1

We can see that there are 293 datasets available, each with its own url field (tabldap) and ID (Dataset.ID) and other accompanying metadata. As mentioned earlier it may be easier to look at these from within a web browser. But for the sake of this example let’s just look at the Summary metadata

servers <- rerddap::servers()
url <- servers |> 
  dplyr::filter(short_name == "CSWC") |> 
  dplyr::select(url)
datasets <- rerddap::ed_datasets(which = "tabledap", url = url)
datasets |> 
  dplyr::select(Dataset.ID,Summary)
#> # A tibble: 293 × 2
#>    Dataset.ID           Summary                                                 
#>    <chr>                <chr>                                                   
#>  1 allDatasets          "This dataset is a table which has a row of information…
#>  2 testTableWav         "Audio data from a local source.\n\ncdm_data_type = Oth…
#>  3 erdCalCOFINOAAhydros "Hydrographic data collected by CTD as part of CalCOFI …
#>  4 erdCalCOFIcufes      "Samples collected using the Continuous Underway Fish-E…
#>  5 erdCalCOFIcruises    "Cruises using one or more ships conducted as part of t…
#>  6 erdCalCOFIeggcnt     "Fish egg counts and standardized counts for eggs captu…
#>  7 erdCalCOFIeggstg     "Egg morphological developmental stage for eggs of sele…
#>  8 erdCalCOFIlrvcnt     "Fish larvae counts and standardized counts for eggs ca…
#>  9 erdCalCOFIlrvsiz     "Size data for selected larval fish captured in CalCOFI…
#> 10 erdCalCOFIlrvstg     "Developmental stages (yolk sac, preflexion, flexion, p…
#> # ℹ 283 more rows

Created on 2025-12-08 with reprex v2.1.1

Explore dataset

Now, from experience i know there is a dataset hosted here containing data from the National Data Buoy Center (NDBC). The data has a Dataset.ID = “cwwcNDBCMet”.

datasets |> 
  dplyr::filter(grepl("NDBC",Summary)) |> 
  dplyr::pull(Dataset.ID)
  
[1] "cwwcNDBCMet"  

Once we have this id, we are ready to explore and pull the data. So lets grab all of the information relating to this dataset. We’ll use the function info()

data_info <- rerddap::info(datasetid = "cwwcNDBCMet", url = "https://coastwatch.pfeg.noaa.gov/erddap/")

data_info is a list object containing metadata, like the url of the server, variables describing the variable names, variable type, and variable data range. This list object is what you’ll pass to the function, tabledap() to pull the data.

Pull the dataset

Even though we have identified the data set we want, we don’t really know how much data there is! If you tried to pull all of the data at once, there is a good chance your computer will crash! So in prep, we can explore the set of available variables using a couple of methods that involve the data_info list object:

  • rerddap::browse(data_info) - will open a webpage on ERDDAP™ describing the data set

  • data_info$variables will list the variables available and data_info$alldata[[variableName]] will show more detailed information about each variable.

erddap_url <- 'https://coastwatch.pfeg.noaa.gov/erddap/'
datasetid <- "cwwcNDBCMet"
data_info <- suppressMessages(rerddap::info(datasetid, url = erddap_url))
data_info$variables
#>    variable_name data_type           actual_range
#> 1            apd     float              0.0, 95.0
#> 2           atmp     float           -153.4, 50.0
#> 3            bar     float          800.7, 1198.8
#> 4           dewp     float            -99.9, 48.7
#> 5            dpd     float              0.0, 64.0
#> 6            gst     float              0.0, 75.5
#> 7       latitude     float          -55.0, 71.758
#> 8      longitude     float       -177.75, 179.001
#> 9            mwd     short                 0, 359
#> 10          ptdy     float            -13.1, 14.9
#> 11       station    String                       
#> 12          tide     float            -9.37, 6.15
#> 13          time    double 4910400.0, 1.7697828E9
#> 14           vis     float              0.0, 66.7
#> 15            wd     short                 0, 359
#> 16          wspd     float              0.0, 96.0
#> 17          wspu     float            -98.7, 97.5
#> 18          wspv     float            -98.7, 97.5
#> 19          wtmp     float            -98.7, 50.0
#> 20          wvht     float             0.0, 92.39

Created on 2026-01-30 with reprex v2.1.1

What jumps out here is the station variable and the latitude and longitude variables. We’ll now use these to pull the list of stations available

erddap_url <- 'https://coastwatch.pfeg.noaa.gov/erddap/'
datasetid <- "cwwcNDBCMet"
data_info <- suppressMessages(rerddap::info(datasetid, url = erddap_url))

rerddap::tabledap(
  data_info,
  fields = c("station", "longitude", "latitude"),
  distinct = TRUE
)
#> info() output passed to x; setting base url to: https://coastwatch.pfeg.noaa.gov/erddap
#> <ERDDAP tabledap> cwwcNDBCMet
#> # A tibble: 1,329 × 3
#>    station longitude latitude
#>    <chr>       <dbl>    <dbl>
#>  1 0Y2W3       -87.3    44.8 
#>  2 18CI3       -86.9    41.7 
#>  3 20CM4       -86.5    42.1 
#>  4 23020        38.5    22.2 
#>  5 31201       -48.1   -27.7 
#>  6 32012       -85.4   -19.6 
#>  7 32301      -105.     -9.9 
#>  8 32302       -85.1   -18   
#>  9 32487       -77.7     3.52
#> 10 32488       -77.5     6.26
#> # ℹ 1,319 more rows

Created on 2026-01-30 with reprex v2.1.1

Now that you have identified the set of buoy stations available you can now either pull individual stations data or pull a collection of stations within a geographic region. Either of these tasks will require using the erddap::tabledap() function. For example:

Get data from a single station

Select the station(s) of interest, then get the data. In this example we’ll pull all of the data associated with buoy 32012

erddap_url <- 'https://coastwatch.pfeg.noaa.gov/erddap/'
datasetid <- "cwwcNDBCMet"
data_info <- suppressMessages(rerddap::info(datasetid, url = erddap_url))

variables <- data_info$variables$variable_name

rerddap::tabledap(
  datasetid,
  fields = variables,
  query = paste0('station="', 32012, '"')
)
#> <ERDDAP tabledap> cwwcNDBCMet
#>    File size:    [9.37 mb]
#> # A tibble: 84,918 × 20
#>      apd  atmp   bar  dewp   dpd   gst latitude longitude   mwd  ptdy station
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>     <dbl> <int> <dbl>   <int>
#>  1  6.88   NaN   NaN   NaN  13.8   NaN    -19.6     -85.4   215   NaN   32012
#>  2  7.01   NaN   NaN   NaN  13.8   NaN    -19.6     -85.4   253   NaN   32012
#>  3  6.8    NaN   NaN   NaN  11.4   NaN    -19.6     -85.4   202   NaN   32012
#>  4  7.31   NaN   NaN   NaN  11.4   NaN    -19.6     -85.4   200   NaN   32012
#>  5  7.32   NaN   NaN   NaN  10.8   NaN    -19.6     -85.4   190   NaN   32012
#>  6  7.09   NaN   NaN   NaN  11.4   NaN    -19.6     -85.4   204   NaN   32012
#>  7  7.68   NaN   NaN   NaN  10.8   NaN    -19.6     -85.4   207   NaN   32012
#>  8  7.07   NaN   NaN   NaN  12.9   NaN    -19.6     -85.4   235   NaN   32012
#>  9  7.09   NaN   NaN   NaN  11.4   NaN    -19.6     -85.4   219   NaN   32012
#> 10  6.94   NaN   NaN   NaN  10     NaN    -19.6     -85.4   201   NaN   32012
#> # ℹ 84,908 more rows
#> # ℹ 9 more variables: tide <dbl>, time <dttm>, vis <dbl>, wd <int>, wspd <dbl>,
#> #   wspu <dbl>, wspv <dbl>, wtmp <dbl>, wvht <dbl>

Created on 2026-01-30 with reprex v2.1.1

Now you can work with this data in R for whatever purpose you’d like.

Get data by geographic region

If you want to simply pull all of the buoys in a specific region you can do this too. ERDDAP™ has a few server side functions that let you narrow your search. For example, lets pull all stations within a region along the Northeast USA seaboard, (around Cape cod, MA) between latitudes [41.6,41.8] and longitudes [-70.5, -69.5]

erddap_url <- 'https://coastwatch.pfeg.noaa.gov/erddap/'
datasetid <- "cwwcNDBCMet"
data_info <- suppressMessages(rerddap::info(datasetid, url = erddap_url))

variables <- data_info$variables$variable_name

rerddap::tabledap(
  datasetid,
  fields = variables,
  'latitude>=41.6','latitude<=41.8','longitude<=-69.5','longitude>=-70.5'
)
#> <ERDDAP tabledap> cwwcNDBCMet
#>    Path: [C:\Users\ANDREW~1.BEE\AppData\Local\Temp\RtmpSQAjbB\R\rerddap\c9f6726fc0a4f89d39c0c92913fabd90.csv]
#>    Last updated: [2026-01-30 11:00:00.736641]
#>    File size:    [54.57 mb]
#> # A tibble: 501,476 × 20
#>      apd  atmp   bar  dewp   dpd   gst latitude longitude   mwd  ptdy station
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>     <dbl> <int> <dbl> <chr>  
#>  1   NaN   4.9  998.   NaN   NaN   9.8     41.7     -70.0    NA   NaN CHTM3  
#>  2   NaN   4.9  998.   NaN   NaN  11.9     41.7     -70.0    NA   NaN CHTM3  
#>  3   NaN   4.9  998    NaN   NaN  11.3     41.7     -70.0    NA   NaN CHTM3  
#>  4   NaN   4.9  998    NaN   NaN  10.8     41.7     -70.0    NA   NaN CHTM3  
#>  5   NaN   5    998.   NaN   NaN  10       41.7     -70.0    NA   NaN CHTM3  
#>  6   NaN   5    998.   NaN   NaN  11.4     41.7     -70.0    NA   NaN CHTM3  
#>  7   NaN   5.1  998.   NaN   NaN   9.9     41.7     -70.0    NA   NaN CHTM3  
#>  8   NaN   5    998.   NaN   NaN  11.6     41.7     -70.0    NA   NaN CHTM3  
#>  9   NaN   5    998.   NaN   NaN   8.9     41.7     -70.0    NA   NaN CHTM3  
#> 10   NaN   4.9  998.   NaN   NaN  10.6     41.7     -70.0    NA   NaN CHTM3  
#> # ℹ 501,466 more rows
#> # ℹ 9 more variables: tide <dbl>, time <dttm>, vis <dbl>, wd <int>, wspd <dbl>,
#> #   wspu <dbl>, wspv <dbl>, wtmp <dbl>, wvht <dbl>

Created on 2026-01-30 with reprex v2.1.1

From this narrow geographic region there is only one station CHMT3 which is a buoy off Chatham, MA

Summary

Getting the data from ERDDAP™ into R requires a little work. You can use a combination of the R package rerddap and the ERDDAP™ website to help identify the data you are interested in.

Fortunately for the buoy station data used in the example, there is an R package called buoydata available to help with the identification of buoy station availability with tools to pull station data hassle free.

Enjoy exploring the masses of data on ERDDAP™

]]>
Andy Beet
Go get an ORCID iD!2025-03-15T00:00:00+00:002025-03-15T00:00:00+00:00https://andybeet.com/posts/2025/03/orcid_idsWhat is an ORCID iD? Should you get one? What can you do with it? If you are a researcher or a scholar it might be worth you signing up for one. Here’s why …

What is an ORCID?

An Open Researcher and Contributor ID (ORCID) is a 16 digit, persistent, name independent unique identifier. From ORCID site “ founded specifically to help solve the problem of name ambiguity in research and to enable transparent and trustworthy connections between researchers, their contributions, and their affiliations.”

So even if you change names or affiliations your ORCID remains the same and “follows” you through your career. Read more about ORCID from the official site.

And … It is free to obtain

What’s it used for?

Many publishers and institutions are using ORCID to seamlessly share information between data systems, so the next article you submit for publication, expect to see a field asking for it! Outside of publishers, you can use it in your R package development and link your software development skills to your scientific publishing.

Simply, add to the DESCRIPTION file of you R package and, if you use pkgdown, which you should :index_pointing_at_the_viewer:, it will propagate to the online documentation. For example:

Simply by by adding this snippet of code, (of course, you’ll have your own 16 digit code when you sign up!)

Authors@R: 
    person(given = "Andy",family = "Beet", role = c("aut", "cre"),
           email = "[email protected]",
           comment = c(ORCID = "0000-0001-8270-7090"))

… results in pkgdown displaying hyperlinked icons, icon, in two locations, the home page and the citation page, displaying something like this:

citation page

For more info on the benefits of using pkgdown to showcase your R package, you might find the post on Enhancing R packages useful.

So go on, get yourself an ORCID iD!!

]]>
Andy Beet
Enhancing an R package2025-02-15T00:00:00+00:002025-02-15T00:00:00+00:00https://andybeet.com/posts/2025/02/enhancing_rpackagesSo you want to create an R package? Well there are a ton of resources available to help with this, the best being the official book from the Posit folks, R Packages (2e). However, after the initial set up, organization, and implementation of the package there are several other related components you should think about including to make it function, look, and feel polished!

You can see ALL of the components mentioned in this post implemented in the R package stocksmart

We’ll start with some cosmetic enhancemnents

Add a Hex

You’ve seen these. Many site have these cool looking hexes, the Posit team has one for every R package in the tidyverse! Good news is that they are easy to create and implement.

The package hexSticker is a good start to help create a hex! Simply save your design as logo.png, add it to your man/figures folder, and link to it from your README.md with an image tag defined after the name of your package. Something like this:

# package_name <img src="proxy.php?url=man/figures/logo.png" align="right" width="120" />

Create a pkgdown site

If you’d like others to use your package, then having a nice looking website with documentation is essential. A step made very easy using a combination of the pkgdown and the usethis packages.

  • pkgdown will create the website locally from a single function call, build_site(). The default layout is pretty good right out of the box, but you can customize a little if you’d like.
  • usethis will create a GitHub action (with the functions use_github_action() or use_pkgdown_github_pages()) to redeploy your website everytime you make changes to the code or documentation

Not only will a website make your package more user friendly, it will highlight your documentation and show you where you need to focus more attention. A previous post on GitHub actions for R packages explain this in more detail.

Add user guides

Adding a helpful user guide or several articles to introduce a user to your package can enhance the adoption of your package by others. They are also pretty easy to create! If you know Rmarkdown or quarto then it’s a breeze. Just save your rmd or qmd in the vignettes folder of your package and pkgdown will take care of integrating them to your site.

Custom issue templates

Now i’m sure your package is awesome!! But some users may want some additional features that you didn’t anticipate, or maybe they found a bug in your code! I know, very unlikely right? :rofl: The standard issue templates provided by GitHub work just fine as a reporting tool, but to avoid a lot of back and forth conversation with a user of you package you can customize templates to ask for exactly the information you need from a user to either troubleshoot a bug or add a feature.

These custom templates are written in YAML, formatted ASCII text. They are saved in the folder .github/ISSUE_TEMPLATE

Contibuting guidelines and code of coduct

Adding a guide with instructions on how users can contribute to your package is worthwhile if you want to avoid potential future headaches! Although this does depend on whether people actually read the guidelines, which often they don’t! This then begs the question, why bother? Well the answer is, you can point people to the document when needed without having to waste time explaining the process each time. The same applies to a code of conduct document, outlining what you expect from contributors regarding behaviour and the type of environment you want to foster.

Best practices often suggest to add these two documents as markdown, .md, files in the root of your repository with names CODE_OF_CONDUCT.md and CONTRIBUTING.md. A great template, to get you started, can be found at @jessesquires.

Oh, once again pkgdown will take care of integrating them into your site automatically!

Enhancements to make your package more robust

R-CMD checks: Continuous integration

One of the most painful debugging experiences you might encounter is when someone informs you that your package wont install on their OS! What do you do? You work on a Windows machine and a MacOS user just informed you they are having issues installing your package! Well the answer is to use GitHub actions. Like the pkgdown example above, you can create, very easily, a workflow that will, for every pushed commit or pull request, run a series of checks (analogous to those required when submitting to CRAN) to inform you, amongst other things, if your package can be installed without issues on a variety of operating systems!

Again, the usethis package has a function, use_github_action("check-standard"), that will create the appropriate workflow YAML for you. This workflow is analogous to the devtools::check() function you can run locally which checks your package using “all known best practices”

Unit tests

Probably one of the least utilized practices in package deployment! Not because unit tests are hard to implement but because it can be hard to think about what kinds of test you need/should implement.

Well, the Posit folks help with the implementation part of unit testing with the testthat package. This package has tools to aid in file structure set up, and includes many tools to aid in different types of tests. And if you’ve implemented the R-CMD continuous integration workflow (described above) then all of the tests you create will be run when you push a commit or make a pull request!

These tests are pretty important, if thought about carefully, since they should catch many bugs BEFORE you release your package. And if you add new features or change the some of the functionality, these tests should aid in determining if the expected outputs are reasonable and that you haven’t introduced unexpected bugs.

Releases

Versioning your package should be taken very seriously. Reproducibility is so important in todays age of rapidly changing software. If you don’t version your software, your future self and others will find it extremely difficult to reproduce old code. In addition it should be used to communicate changes, like bug fixes or new features to users and other developers.

For details regarding when and how to version your package, see versioning R packages.

Summary

You can see ALL of the components mentioned in this post implemented in the R package stocksmart

]]>
Andy Beet
Versioning R packages2025-01-15T00:00:00+00:002025-01-15T00:00:00+00:00https://andybeet.com/posts/2025/01/versioning_rpackagesYou’ve reached the point in which you want to officially start to version your R package. How do you go about doing this? When should you version? How often? What versioning scheme should you use? How do you document the changes?

All great questions! :rofl: Lets talk this through …

So in principle, every time content is pushed to the main branch of your R package repository should trigger a versioning/release event. This statement assumes that you are using best practices and have adopted a branching strategy. This earlier post on selecting a branching strategy is worth a read if you are unsure of what this means.

Under the branching strategy i like to employ, called feature branching, all new features and bug fixes reside on their own branch until completion. At that point they are pulled into the development branch, often named dev, via a pull request. At this point multiple workflows are run to check various aspects of the code, typically R-CMD checks. Check out the post on enhancing your R packages for more info on this.

If all checks pass, then the development branch is ready to be pulled in to the main branch. If they fail, you’ll need to address the issues and fix them. Better you do it now vs a user submitting an issue at a later date.

When all checks pass you need to update the package version in the DESCRIPTION file and update content in the NEWS.md file to summarize all relevant changes to the package, whether a feature enhancement, a bug fix, or something else.

After you then, merge the pull request from dev -> main you immediately release the package using the release feature on GitHub. The description of the release should use the information in the NEWS.md file and the release version should match the version number in the DESCRIPTION file. The target commit associated with this release should be the latest commit to the main branch.

So this covers the how, when, and what? But what about the questions relating to how often should a package be released or what versioning scheme should be used?

The versioning scheme that is considered best practice is semantic versioning. An earlier post, Versioning R packages, goes into more detail on how this relates to R package development.

And with regard to how often you should release, well that depends on a lot of things, how frequently bugs are found and addressed or how quickly you want to add new features. Of course these do not need to be versioned as independent events. You can bundle new features and bug fixes into the same release of your package. You just need to adjust the version number to reflect these changes and document the changes in the NEWS.md file, often termed the changelog.

Summary

Under this workflow, at any point in time the main branch of the repository should represent a working released version of your package. If someone came across your repository and installed your package, it should be expected to work just fine! All developmental work, whether new features or bug fixes, should reside on their own branches and only merged into the main branch when fully tested and ready for release! This way the main branch remains “clean” of issues.

Hope this helps!

]]>
Andy Beet
Semantic versioning2024-12-01T00:00:00+00:002024-12-01T00:00:00+00:00https://andybeet.com/posts/2024/12/semantic_vesioningThe official site for Semantic versioning is available in many languages indicating, in part, how important this subject is across the world for software development. It provides a clear, standardized way to communicate the nature of changes between releases, which helps manage dependencies and predict the impact of updates. By using a version format like MAJOR.MINOR.PATCH, developers indicate whether an update is a backwards-incompatible breaking change (MAJOR), a new backwards-compatible feature (MINOR), or a backwards-compatible bug fix (PATCH).

In the development of R packages it is no different. Let’s dive a little deeper with some examples. Following semantic versioning we adopt the format of MAJOR.MINOR.PATCH. For example v3.5.1. The interpretation of these three components can be a little confusing, especially in the context of R package development.

  • Changes in the PATCH component refer to small changes, like bug fixes. It is assumed that these changes do NOT break any existing code. They are “backward compatible”. Nothing changes in the package except an error has been corrected
  • Changes in the MINOR component refer to changes such as new features. These new features integrate with the package, they complement other features already present, and do NOT break existing code.
  • Changes in the MAJOR component refer to changes that would break existing code. For example, if a user was using a particular function in your package, and you changed the code in this function that resulted in a different output structure, the users code would break. This would be “backward incompatible”

Examples

Suppose a package is currently at version 3.5.1

  • If one or more bugs were fixed (PATCH), the next release would be version 3.5.2
  • If a new feature was added (MINOR), the next release would be version 3.6.0. (Note, the PATCH is reset to zero. It would not be 3.6.1)
  • If a new breaking change was added (MAJOR), the next release would be version 4.0.0. (Note, both MINOR and PATCH are reset to zero)
  • If several bugs were fixed (PATCH) and a couple of new features (MINOR), the next release would be version 3.6.0. (Note, default to the higher order change, MINOR)
  • If several bugs were fixed (PATCH), a couple of new features added (MINOR), and a restructure of the package (MAJOR), the next release would be version 4.0.0. (Note, default to the higher order change, MAJOR)

I think you get it! Enough said.

]]>
Andy Beet
Automate versioning for R data packages2024-11-01T00:00:00+00:002024-11-01T00:00:00+00:00https://andybeet.com/posts/2024/11/automate_versioningDo you find yourself downloading data from a server over and over again to make sure you have the latest data? Then struggle to work out what’s changed from the last time you downloaded it?

If so, why not why not create an R package to version and document the data changes. That way you’ll always be able to go back to an older version and you’ll be able to track what has changed through time.

You can do all of this using a GitHub action. Within the workflow, the steps you’ll need are (not limited to):

  • A function to connect to the server or API and download the data on a defined schedule
  • A function to compare the new data pull with the current data in the package
  • An optional step to email yourself a report explaining the differences, if any (using a parameterized Rmd)
  • An optional step to update the README.md file to inform the user of the date of the most recent data pull
  • A function to update the version number in the DESCRIPTION file
  • A function to document the data changes in the NEWS.md file
  • Commit all changes to the repo
  • Create a new GitHub release using the notes added to the NEWS.md

This seems like a lot, and it might take a while to get things set up and working correctly, but the benefits are huge!

Example: Stocksmart package

As a real example, consider the R data package stocksmart. This is an R data package that serves up data from all federally managed fish stocks in the USA. Our group relies on this data to update annual ecosystem reports. Having the data automatically pulled, versioned, and documented saves us a lot of time.

The specifics of this workflow are:

  • A cron job job to update this data set running on a schedule every Wednesday at noon EST.
  • Following semantic versioning, these data changes are considered patch fixes and hence the version number will increment by 0.0.1 each time the data is updated.
  • A report is emailed to me after the workflow has finished running outlining the changes seen in the data. This is achieved using the Send mail GitHub action
  • The readme.Rmd and readme.md are updated to add the date of the most recent pull
  • The NEWS.md and the DESCRIPTION file are updated to outline changes
  • All files are committed to the repo
  • If changes are found a new release is created using the usethis package. specifically the use_github_release function.

You will need to add secrets to your repo to allow for the email action and the GitHub release function to behave as expected. Once set up this workflow is a maintenance light way to always have up-to date data to work with in R, not only for you, but for the R community at large!

Happy coding!

]]>
Andy Beet
Error handling with `tryCatch()`2024-10-01T00:00:00+00:002024-10-01T00:00:00+00:00https://andybeet.com/posts/2024/10/Error_handlingError handling is not something scientists typically incorporate into their code. However, there a situations in which incorporating such a thing can save you hours of headache!

Some examples:

  • You want to run a lot of model simulations/fitting in sequence. If one model iteration causes an error then your code stops without finishing
  • You are pulling a large amount of data from a server to analyse. Connection to the server is interrupted and your workflow stops.
  • Generally, you have a series of steps in workflow. If one of the steps fail, the whole workflow fails.

Ideally, you want to be able to handle these situations within your code, adapt to the error, and continue.

In the examples above, solutions might be:

  • If a model causes an error, write the issue to a file, then simulate a replacement.
  • If the server connection is interrupted, and throws an error, try to reconnect, before continuing

Using the tryCatch() function, bundled in base R, is a great option.

Example 1: Database connection

This example attempts to accesses an oracle server to pull data. Custom warnings are returned based on the type of warning thrown

chan <- tryCatch(
  {
    driver <- ROracle::Oracle()
    chan <- ROracle::dbConnect(driver, dbname=server,username=uid,password=pwd)
  }, warning=function(w) {
    if (grepl("login denied",w)) {message("login to server failed - Check username and password")}
    if (grepl("locked",w)) {message("logon to server failed - Account may be locked")}
    message(paste0("Can not Connect to Database: ",server))
    return(NA)
  }, error=function(e) {
    message(paste0("Terminal error: ",e))
    return(NA)
  }
)

Example 2: Download a file from a url

Attempt to download a file from an online location. If it’s missing, or can’t be downloaded for some reason, the code skips the file

# get file, catch error for missing file
result <- tryCatch(
  {
    downloader::download(fpath,destfile=destfile,quiet=TRUE,mode="wb")
    res <- TRUE
  }, error = function(e){
    message(paste0("No data for ",afname))
    return(FALSE)
  },warning = function(w) 
    return(FALSE)
)

if (!result) {
  next
}

In summary, error handling can save you a lot of frustration from code breaking prematurely!

]]>
Andy Beet
Add release tags in git2024-08-01T00:00:00+00:002024-08-01T00:00:00+00:00https://andybeet.com/posts/2024/08/add_release_tagsScenario: You have a repository in which you’ve been using for a while. You use it for your scientific work and have published papers based on it. However, you never created releases of the repo corresponding to each publication and now you have no way of reproducing the work used in your publications. What can you do?

Well, you can add release tags to any “old” commit, providing you can identify the commits! Now this isn’t a recommended practice or a substitute for not following best practices, but in a pinch it can help out “the younger, inexperienced you of years gone by”.

Turns out it is pretty simple too!

First, identify the commit hash. Second, decide on the tagname you want to assign to this commit. Then, run the git tag command.

For example, if your hash is 1e4b567712d785bb972665a2edd9401a17d9875d and you want to tag this with the name, v1.3.1 then you’d run the following lines of code in the terminal

git tag v1.3.1 1e4b567712d785bb972665a2edd9401a17d9875d
git push origin v1.3.1

If you want to annotate the release, and store the taggers name, date, time, along with a message you can use the following:

git tag -a v1.3.1 1e4b567712d785bb972665a2edd9401a17d9875d -m "Version 1.3.1 Release"
git push origin v1.3.1

Repeat this as many times as you’d like.

In GitHub you should be able to now create a release using this tag!

Make sure you add good release notes in the description field so you know why you tagged this particular point in time as a release point!

]]>
Andy Beet