rrmetab: Reproducible Research in R

orientation internal support tutorial

Introduction to the rrmetab R package that helps setting up a fully reproducible research workflow in FL Metab.

Peter-Paul Pichler https://www.pik-potsdam.de/members/pichler (Social Metabolism & Impacts, Potsdam Institure for Climate Impact Research)https://www.pik-potsdam.de
04-19-2021

The rrmetab package

The ´rrmetab´ package supports a completely computationally reproducible workflow in R for quantitative research project development in our group, from data cleaning to paper publication. It builds on a number of existing packages that achieve different aspects of reproducibility and combines them into a standard workflow that is simple and works for most projects.

Computational reproducibility

Computational reproducibility means that we want to maximize the chances that someone (including future you) can re-run your analysis from raw data to published paper and obtain the same results. The following steps ensure computational reproducibility.

  1. Code and data: Exact documentation and publication of all data manipulation steps during analysis. We try to go one step further and have “one click”-reproducibility, which means, we want to be able to go from raw data to finished analysis with one command. For this, it is important to avoid any manual data manipulation (e.g. in EXCEL). In our group we use literate programming with R and RMarkdown for analysis and documentation and Gitlab for publication.

  2. R environment: For someone to be able to run your code, they typically need a lot more than just your code and data. They often need to know the R version and the packages you have used. Since R packages can change over time they need to know which exact version of a package you have used. There are various ways to document this, but the rrmetab package uses renv to record all R dependencies.

  3. Computational environment: Finally, some code behaviour might be specific to the operating system used or other aspects of the computational environment outside of R (which compiler was used, etc). Or, in the future the packages (or versions) you have used are no longer available, etc. So ideally, reproducible code would be published together with the complete computational environment in which it was run. The rrmetab package relies on docker and the Gitlab container registry to achieve this.

Quality assurance

The two most important aspects for scientific software development besides reproducibility arguably are correctness and understandability (unclear in what order). You obviously aim for your code to be correct and it should also be intelligible for others (and most especially future you). The rrmetab package builds on the rrtools package for project organization, which in turn derives its structure from R package development (e.g. devtools and usethis). This means that all tools that support testing and documentation are available and that the folder structure is familiar.

Testing

Unit tests are commonly used to help improve code correctness. Unit tests are a pretty basic testing concept but since most software we develop in everyday research is most commonly used as intended, unit tests can already catch a lot of unexpected behaviour and bugs. The rrmetab package recommends the testthat package for unit testing.

Documentation & Organization

Even correct and reproducible software code is nearly useless to anyone (very much to future you) if it is very hard to understand. There are a lot of moving parts which make a software project hard to keep track of if you do not follow some common practices. The first helpful practice is to organize data, code, paper, etc in a useful and familiar way and the second is to document your code. The research compendium structure (in rrtools) extends the R package folder structure to achieve this. This also helps with code documentation as you can now use the full code documentation infrastructure (e.g. roxygen2 and devtools) of R package development.

Preconditions

The FL Metab workflow reproducible research assumes you have the following tools installed:

  1. RStudio and R

  2. Gitlab

  3. Docker

How to use rrmetab?

Getting started

Install the rrmetab package.

remotes::install_gitlab(repo="pichler/rrmetab", 
                        host="https://gitlab.pik-potsdam.de", 
                        auth_token = "wF_HNvUpjpmb_3TyYGv9")

Next, create an empty repository on gitlab and then create a new RStudio project based on this repository by navigating to File -> New Project, select Version Control and Git and enter the SSH Url for your repository (this should look like this git@gitlab.pik-potsdam.de:yourname/yourrepository.git).

After this, simply call the new_project() function from the rrmetab package.

rrmetab::new_project()

Calling this function without any parameters runs the default setup which should work for most projects. It does a number of things for you automatically:

  1. It calls rrtools::use_compenium() which creates the R package structure using the active project as compendium name. It will only ask you to overwrite the existing R Project file which you should answer yes to.

  2. It will add an MIT license with PIK and you as copyright holder (Name and Email guessed from your git credentials). You can check if they are correct in the LICENSE.md file.

  3. It will create a skeleton README.Rmd file by calling rrtools::use_readme_rmd(). This file needs to be edited later to include all the relevant information for your publication.

  4. It will create the folder structure for your analysis (calling rrtools::use_analysis()) explained below and place a paper.rmd in the paper folder. This is a template that renders to a word file. If you want to use a different template, simply delete the paper folder and add any template, for example from the rticles package.

  5. It will create a custom Dockerfile based on your current configuration (R version and renv version) that pulls your git repository and creates a container image that includes your current R environment (all packages you are using in the project).

  6. It initializes renv by calling renv.init()

Compendium folder structure

Taken from rrtools.

    analysis/
    |
    ├── paper/
    │   ├── paper.Rmd       # this is the main document to edit
    │   └── references.bib  # this contains the reference list information
    
    ├── figures/            # location of the figures produced by the Rmd
    |
    ├── data/
    │   ├── raw_data/       # data obtained from elsewhere
    │   └── derived_data/   # data generated during the analysis
    |
    └── templates
        ├── journal-of-archaeological-science.csl
        |                   # this sets the style of citations & reference list
        ├── template.docx   # used to style the output of the paper.Rmd
        └── template.Rmd

Package installation

The only thing that slightly changes is the preferred way to install packages you use in your project. The basic idea of renv to create a reproducible environment is that it creates a separate library for each project. The advantage of isolating your project package library is that you can have different projects that rely on different versions of a single package. Project A might need dplyr 0.9.2 but project B requires dplyr 1.0.0. This is now no problem but it means that your project library is empty when you start your project so no packages that are installed on your computer are visible to the project. The best way to install a package is to call

renv::install("tidyverse")

This will first search your local library for an installed version of the tidyverse and simply copy this to your project in 0 seconds. If you use the standard install.packages() (or click on packages and install), the package will be downloaded and compiled from CRAN which will take ages to complete.

Renv writes a renv.lock file to your project root which contains all packages used in the project (no matter how you installed them). You can manually update this file by calling renv::snapshot(). If you want to save your environment you can simply store this lockfile somewhere and restore your library at any time by calling renv::restore(). For example, the project library will not be pushed to git, so when you pull someone’s project for the first time (or when they have installed new pacakges), renv will tell you that your local library is out of sync and you need to call renv::restore() to install all required packages.

Publishing a compendium

These steps are required before you publish your research compendium (typically upon acceptance of the paper):

  1. Fill out all the information in README.rmd

  2. Check your Dockerfile if your project depends on any system libraries and include them (this might be automated in the future). If you want to test your Dockerfile on a private repository you will need to deploy a gitlab access token that allows read access to your repository. If not, remove the token from the pull line (TODO: clean this up).

  3. Create a docker image and push it to the gitlab container registry

Docker image

This assumes you have docker installed and (in linux) have followed the steps to manage Docker as a non-root user. If you haven’t done the second part, in linux you always have to run docker commands with sudo.

To create a docker image from your repository and your Dockerfile, open a terminal, navigate to where your Dockerfile is and run:

docker build -t "myimagename:1.0.0" .

where you choose a name for your image and optionally enter a version (here 1.0.0). If everything goes right (which it probably wont on first try) you can see your image by typing:

docker images

There are various ways to run a docker container from an image. In our case, the image contains RStudio which we can run and access via a browser. So we start up a container by running:

docker run -e PASSWORD=greatpwd --rm -p 8787:8787 myimagename:1.0.0

RStudio now requires that you set a password (which you choose). The --rm tells docker to remove the container when it is stopped, and -p sets the port where you can reach RStudio from a browser. So after this command, open a browser and go to localhost:8787 which should get you to a login page where you use user rstudio and the password you have set. After that click on File -> Open Project, open your project and knit your paper as you normally would.

Git container registry

In principle, the Dockerfile is enough to make your project fully reproducible. However, people will still have to create the image which takes time and relies on all of the packages and sources still being available in the future. To make extra sure, it is even better to create the image and share the image online. To do this, your repository needs to be public.

We can now do this at PIK directly in the Git container registry. To push your image to your repository, first log in by typing:

docker login gitlab.pik-potsdam.de:5050

Then build the image using this command:

docker build -t gitlab.pik-potsdam.de:5050/pichler/myrepositoryname .

Once it has build successfully, push the image like this:

docker push gitlab.pik-potsdam.de:5050/pichler/myrepositoryname

Running a container from the Git container registry

Now everybody can run this container simply by calling:

docker run -e PASSWORD=whatever --rm -p 8787:8787 gitlab.pik-potsdam.de:5050/pichler/myrepositoryname

(TODO: this info should be added to README.rmd)

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Pichler (2021, April 19). FL Metab methods blog: rrmetab: Reproducible Research in R. Retrieved from http://www.pik-potsdam.de/~pichler/metab/blog/posts/2021-04-19-rrmetab-reproducible-research-in-r/

BibTeX citation

@misc{pichler2021rrmetab:,
  author = {Pichler, Peter-Paul},
  title = {FL Metab methods blog: rrmetab: Reproducible Research in R},
  url = {http://www.pik-potsdam.de/~pichler/metab/blog/posts/2021-04-19-rrmetab-reproducible-research-in-r/},
  year = {2021}
}