Workshop 5: Open Science and Reproducibility in R

This is an old revision of the document!

April 25, 2017
Developed by Monica Granados

Perhaps you are familiar with this workflow: 1. Collect data 2. Input data into a spreadsheet software 3. Export data to statistical analysis software 4. Plot results using a third piece of software 5. Import images into a word processor 6. Export manuscript pdf for submission

And if a single value changes in step 1 or 2 - you have to repeat the entire workflow from start to finish.

But what if I told you there was a better way? What if you built an automated workflow that would automatically integrate changes straight through to the manuscript. This same workflow would also allow you to collaborate across the globe, make your research transparent and available to everyone.

Open science is the movement to make research and data accessible to all. Open science be practiced in varying degrees. Practicing in the open can involve:

Publishing your raw data (GitHub)
Using open source tools (R)
Making your analyses reproducible (R Markdown)
Publishing open access or posting a pre-print (bioRxiv)

You could engage in all of above or only one and still contribute to a movement that is working to accelerate science and making research accessible to all. In this workshop we will build a reproducible workflow which facilitates practicing in the open. The workflow follows the outline of the workshop: 1. Learn how to add your data to the public repository - GitHub 2. Use R and R Markdown code a reproducible manuscript 3. Knit your manuscript into a pdf to post as a pre-print in bioRxiv

GitHub is a web-based Git or version control repository and Internet hosting service. In this workshop we will learn how to upload data into GitHub so we can have R retrieve it directly from the site, simultaneously rendering your data open. We will also briefly go over some of the collaborative power of GitHub.

Setting up an account

Go to https://github.com/ and set up an account. Pick a username. Mine is Monsauce.

Create a repository in GitHub

There are two ways to interact with GitHub: directly thought the web GUI interface and through the terminal. Let's first use the GUI option as it more accessible. We will proceed to the terminal when we do an overview of collaborations with GitHub.

We are going to use the Doubs river fish communities data set as the data for our manuscript. You can get it from this link and save it locally to your hard drive: DoubsSpe data. The Doubs data set is a data frame of fish community data where the first column contains site names from 1 to 30 and the remaining columns are fish taxa. The taxa columns are populated by fish abundance data (counts).

Once you have an account on GitHub the first thing we want to do before we can upload data is to create a repository in your profile for the data. Let's call our new repository - QCBS Open Science Workshop.

1. Click "New Repository" and in Repository name enter "QCBS Open Science Workshop" 
2. Make your repository open and **click initialize with a README**

3. Click "Create Repository" at the bottom of the page

Upload data to GitHub

Now that you have a repository we can upload our data.

4. Click "Upload files" 
5. Click "choose your files" 
6. Select the Doubs data file on your hard drive
7. Comment in the commit changes and click "Commit changes"

Your data is now publicly available and discoverable by R! In the next section we will look at writing an R Markdown document in R studio so that the data can be pulled into the analysis in R and weaved into a document in R Studio.

Collaborative tools in GitHub

GitHub first and foremost is a platform for sharing and collaborating. GitHub uses Git or version control to keep track of changes to your files - this is particularly useful when you have multiple people contributing data or code. To preview the capability of Git we are going to go through the Try Git tutorial where you'll be introduced to concepts like branching, committing and merging: Let's Try Git

Once you have mastered using Git, you can see how useful it would be to have collaborators using Git and GitHub. You could each be working on code simultaneously and version control would allow you to keep track of changes and differences between your versions and then merge the changed together. All the while making your data open through GitHub.

Committing your changes also means the newest file (with the most recent data) is available for R to read in without having to constantly change file names! If you are not comfortable with using the terminal you can always use the web GUI and upload the most updated data file by first deleting the previous version.

R Markdown

R Markdown is a feature of R Studio that allows you to combine text and code to produce a dynamic, reproducible manuscript.

Installing R Markdown in R Studio

First we need to install the R Markdown package from R Studio

| Install R Markdown

install.packages("rmarkdown")

Next we need to open a new R Markdown script. In the File menu in R Studio navigate to New>New File>R Markdown… You'll see that you have the opportunity to name you file. I have named mine “QCBS Open Science Workshop.” You'll also see you have the option to output the manuscript in various formats. For now choose pdf, but you have the option to change it later. R Markdown files have the .Rmd file extension.

New R Markdown file

Like R, R Markdown has its own syntax. You can get a cheat sheet for the syntax here but let's go over some of the basics we would need to create a manuscript.

You'll notice that when you created the new R Markdown file it includes some information on syntax to include R code, plots and text. You'll find this really helpful when you are starting but for now let's delete this so we can start from a blank slate.

At the top of the page you will see the name of the file, date and output format. When you make new versions of the manuscript you can change the date stamp here so that you know what date you created it.

Text in R Markdown

Text in R Markdown works very much like you general word processor, except instead of a toolbar to bold, underline, italicize and headings - you have to use syntax.

plain text
two asterisks on each side for bold
*italics*
# Header 1
## Header 2

Let start writing some text for our manuscript for the Doubs Fish data in our R Markdown file. We are going to write a title and the methods section for the paper. Borcard et al (2011) has the following information about the Doubs Fish data set:

In an important doctoral thesis, Verneaux proposed to use fish species to characterize ecological zones along European rivers and streams. He showed that fish communities were good biological indicators of these water bodies. The Doubs data set contains part of the data used by Verneaux for his studies. These data have been collected at 30 sites along the Doubs River, which runs near the France–Switzerland border in the Jura Mountains. The first matrix contains coded abundances of 27 fish species.

Now in your R Markdown document add a title to the manuscript (Header 1), a sub-heading for the Methods section (Header 2) and then some text on data (plain text).

Then click Knit to see how the document is rendered!

Code chunks in R Markdown

The best feature of using R Markdown is that you can integrate R code with text. When you want to include R code you have a couple of options: 1. Run and print the code 2. Run but hide the code 3. Print but not run the code.

| run and print

```{r eval=TRUE, echo=TRUE}
#run and print the code 
length(iris$Species)
```

| run but hide the code

```{r include=FALSE}
#run but hide the code 
length(iris$Species)
```

| print but not run the code

```{r eval=FALSE,echo=TRUE}
#print but not run the code
length(iris$Species)
```

Let's write some code to install the packages we need for subsequent analyses and bring the data in from GitHub. Because we only want to evaluate the code and not print it in the document we will choose eval=TRUE, echo=FALSE

Start by making a new Results heading. Then we need to get the https address of where your file is on GitHub. If you navigate to your repository, click the “doubsspe.csv” file name and then click “raw” on the right hand side. Then copy the https address and add it to your code as shown below.

| run but hide the code

```{r include=FALSE}
#install packages
require(RCurl)
require(tidyr)
require(ggplot2)
doubs.URL <- getURL("https://raw.githubusercontent.com/Monsauce/QCBS-Open-Science-Workshop/master/doubsspe.csv")
doubs<-read.csv(text=doubs.URL) 
 
#the data is in wide form so let's convert it to long
doubs.long<-gather(doubs, species, count, CHA:ANG)
```

In line code

Now that our data is loaded we can start our data analysis. Let's say for example that we wanted to report the total number of species in our data set. You can make this number dynamic by coding it rather than just inserting the value as text in case the data frame changes. You can add in text code as follows:

The data set had a total richness of `r length(unique(doubs.long$species))`

Plots in R markdown

The bestest part of R Markdown is not only can you run your analyses and combine them with text, but you can also make the plots for your manuscript so they automatically update when you change your data frame. There is some extra sytnax in the code chunks you need for plots, like plot dimensions and suppressing warnings and messages. Let's make a plot for the mean number

| display plot

```{r, fig.width=5, fig.height=6.5, echo=FALSE}
#note you can set the size parameters of the figure 
Figure.1<-ggplot(doubs.long, aes(count))+geom_histogram()+facet_wrap(~X)+theme_minimal()+
xlab("Number of individuals")+ylab("Number of species")
 
plot(Figure.1)  
```

Now add a figure caption at the bottom of your code chunk.

R Markdown code (.Rmd) from the exercises can be found here

Once your manuscript is ready. You can post it on a pre-print server like bioRxiv so that your research is accessible to anyone, anywhere even if you are submitting the paper to a journal.

There are many ways to engage in open science, even by simply making your data open, publishing open access or posting a pre-print you are affecting enormous positive consequences on the accessibility of science to other scientists and the general public. Imagine a world where the USSR and the USA collaborated to go to the moon, or labs working together, sharing data on the world's most pressing problems instead of competitng to get to the finish line first? Or a world where text books are free and available to anyone? Isn't that a world you want to create?

We've only scratched the surface on the open science practices. Mozilla (the people that make Firefox) is committed to making the web and science more open and they have compiled a number of resources to help you practice in the open.

Mozilla Open Advice

Mozilla GitHub intro

Meet other open scientists

Want to dive into open science!?

Science Open: A platform for open science research
Join the OOO Canada Network!
Reporducibility workshop from fellow Mozilla Open leader Hao Ye