# Differences

This shows you the differences between two versions of the page.

 r_workshop2 [2016/10/21 10:48]vincent_fugere [QCBS R Workshops] r_workshop2 [2018/10/06 21:29] (current)mariehbrice [Workshop 2: Loading and manipulating data] 2018/09/26 11:17 katherinehebert [Writing a script] 2018/09/26 11:15 katherinehebert [Preparing data for R] 2018/09/26 11:15 katherinehebert [Preparing data for R] 2018/09/26 11:14 katherinehebert 2018/09/26 10:28 katherinehebert [Keep your files organized] 2018/09/26 10:26 katherinehebert [Create a new project] 2018/09/26 10:21 katherinehebert [Learning Objectives] 2018/09/25 14:09 fgabriel1891 [Subsetting data] add link to additional logical operators 2018/09/25 14:09 fgabriel1891 [Subsetting data] 2018/09/25 14:04 fgabriel1891 added [Subsetting data ] slide2018/09/25 13:37 fgabriel1891 added [Creating new variables] section 2018/09/25 12:40 fgabriel1891 Replace all CO2 for co22018/09/25 12:31 fgabriel1891 [Reminder from workshop 1: Accessing Data] 2018/09/25 12:23 fgabriel1891 [Preparing data for R] added "One variable per column!" pointer2018/09/25 12:18 fgabriel1891 [Data exploration] Added a comment after the snippet to remind people the use of $2018/09/25 12:16 fgabriel1891 [Data exploration] add description of hist() function, added an example of the "breaks =" parameter. 2018/09/25 12:09 fgabriel1891 [Reminder from workshop 1: Accessing Data] add the snippet 2018/09/25 12:06 fgabriel1891 [Looking at Data] change CO2 to co22018/09/25 12:03 fgabriel1891 [Looking at Data] added tail() function2018/09/25 12:00 fgabriel1891 2018/09/25 11:59 fgabriel1891 [Importing data] changed CO2 for co2, therefore updating to the right file name2018/09/25 11:59 fgabriel1891 [Importing data] changed CO2 for co2, therefore updating to the right file name2018/09/25 11:55 fgabriel1891 2018/09/25 11:52 fgabriel1891 added [Create a new project] slide2018/09/25 11:46 fgabriel1891 [Learning Objectives] 2016/10/21 10:48 vincent_fugere [QCBS R Workshops] 2016/09/16 15:04 sebastienportalier [Reminder from workshop 1: Accessing Data] 2016/09/16 15:00 sebastienportalier [Reminder from workshop 1: Accessing Data] 2016/09/16 14:59 sebastienportalier [Reminder from workshop 1: Accessing Data] 2016/09/16 14:47 sebastienportalier [Reminder from workshop 1: Accessing Data] 2016/09/16 14:45 sebastienportalier [Reminder from workshop 1: Accessing Data] 2016/09/16 14:44 sebastienportalier [Reminder from workshop 1: Accessing Data] 2016/09/16 14:38 sebastienportalier [Reminder from workshop 1: Accessing Data] 2016/09/16 14:35 sebastienportalier [Looking at Data] 2016/09/08 12:10 sebastienportalier [Display The Content Of The Working Directory] 2016/09/08 11:59 sebastienportalier [Fixing a Broken Data Frame] 2016/09/08 11:57 sebastienportalier [Exporting data] 2016/09/08 11:57 sebastienportalier [Save your workspace] 2016/09/08 11:57 sebastienportalier [Data Exploration] 2016/09/08 11:55 sebastienportalier [Looking at Data] 2016/09/08 11:55 sebastienportalier [Importing data] 2016/09/08 11:54 sebastienportalier [Working Directory] 2016/09/08 11:46 sebastienportalier [Working Directory] 2016/09/08 11:44 sebastienportalier [Housekeeping] 2016/09/08 11:44 sebastienportalier [Housekeeping] 2015/09/08 13:10 zofia.taranu [QCBS R Workshops] 2015/02/05 11:26 vincent_fugere [Workshop 2: Loading and manipulating data] 2014/10/07 12:02 vincent_fugere [Workshop 2: Loading and manipulating data] 2014/10/06 10:03 cedric [Workshop 2: Loading and manipulating data] 2014/09/29 14:45 vincent_fugere [Workshop 2: Loading and manipulating data] 2014/09/18 14:24 cedric [Workshop 2: Loading and manipulating data] 2018/10/06 21:29 mariehbrice [Workshop 2: Loading and manipulating data] 2018/10/02 16:34 katherinehebert [More on data manipulation] 2018/10/02 16:21 katherinehebert [dplyr - Merging data frames] 2018/10/02 16:21 katherinehebert [dplyr & magrittr NINJA CHALLENGE] 2018/10/02 16:21 katherinehebert [Ninja hint] 2018/10/02 15:52 katherinehebert [Create and populate columns with ''mutate()''] 2018/10/02 15:23 katherinehebert [Sort columns with ''arrange()''] 2018/10/01 16:39 katherinehebert [Recap: tidyr] 2018/10/01 16:01 katherinehebert [separate(): Separate two (or more) variables in a single column] 2018/10/01 16:01 katherinehebert [separate(): Separate two (or more) variables in a single column] 2018/10/01 14:39 katherinehebert [Wide vs. long data] 2018/09/26 16:49 katherinehebert [Fixing a Broken Data Frame] 2018/09/26 15:53 katherinehebert [Reminder from workshop 1: Accessing Data] 2018/09/26 15:49 katherinehebert [Importing data] 2018/09/26 15:46 katherinehebert [Working directory] 2018/09/26 15:42 katherinehebert [Working Directory] 2018/09/26 15:40 katherinehebert [Section Heading] 2018/09/26 15:30 katherinehebert [Choose variable names wisely] 2018/09/26 15:21 katherinehebert [Keep your files organized] 2018/09/26 15:21 katherinehebert [Create a new project] 2018/09/26 14:47 katherinehebert 2018/09/26 14:42 katherinehebert [dplyr NINJA CHALLENGE] 2018/09/26 14:40 katherinehebert 2018/09/26 14:33 katherinehebert [More on data manipulation] 2018/09/26 14:32 katherinehebert [More on data manipulation] 2018/09/26 14:32 katherinehebert [dplyr NINJA CHALLENGE] 2018/09/26 14:28 katherinehebert [dplyr NINJA CHALLENGE] 2018/09/26 14:27 katherinehebert [dplyr challenge] 2018/09/26 14:26 katherinehebert [dplyr - grouped operations and summaries] 2018/09/26 14:23 katherinehebert [dplyr and magrittr, a match made in heaven] 2018/09/26 14:19 katherinehebert [dplyr and magrittr, a match made in heaven] 2018/09/26 14:18 katherinehebert [dplyr and magrittr, a match made in heaven] 2018/09/26 14:16 katherinehebert [Create and populate columns with ''mutate()''] 2018/09/26 14:14 katherinehebert [Sort columns with ''arrange()''] 2018/09/26 14:02 katherinehebert [Select a subset of rows with ''filter()''] 2018/09/26 13:56 katherinehebert [Select a subset of columns with ''select()''] 2018/09/26 13:48 katherinehebert [Select a subset of columns with ''select()''] 2018/09/26 13:48 katherinehebert [Select a subset of columns with ''select()''] 2018/09/26 13:45 katherinehebert [Intro to dplyr] 2018/09/26 13:34 katherinehebert [Intro to dplyr] 2018/09/26 13:31 katherinehebert [tidyr Challenge] 2018/09/26 13:23 katherinehebert [tidyr Challenge] 2018/09/26 13:22 katherinehebert [Recap: tidyr] 2018/09/26 13:18 katherinehebert [separate(): Separate two (or more) variables in a single column] 2018/09/26 13:14 katherinehebert [spread(): Making your data wide] 2018/09/26 13:07 katherinehebert [spread(): Making your data wide] 2018/09/26 13:07 katherinehebert [gather(): Making your data long] 2018/09/26 13:06 katherinehebert [gather(): Making your data long] 2018/09/26 13:04 katherinehebert [spread(): Making your data wide] 2018/09/26 13:02 katherinehebert [Gather: Making your data long] 2018/09/26 13:02 katherinehebert [Spread: Making your data wide] Line 11: Line 11: Developed by: Johanna Bradie, Vincent Fugère, Thomas Lamy Developed by: Johanna Bradie, Vincent Fugère, Thomas Lamy - **Summary:​** In this workshop, you will learn how to load and view your data in R. You will learn basic commands to inspect and visualize your data, and learn how to fix errors that may have occurred while loading your data into R. In addition, you will learn how to write an R script, which is a text file that contains your R commands and allows you to rerun your analyses in one simple touch of a key ! (or maybe two, or three…) + **Summary:​** In this workshop, you will learn how to load, view, and manipulate ​your data in R. You will learn basic commands to inspect and visualize your data, and learn how to fix errors that may have occurred while loading your data into R. In addition, you will learn how to write an R script, which is a text file that contains your R commands and allows you to rerun your analyses in one simple touch of a key (or maybe two, or three…)! We will then introduce tidyr and dplyr, two powerful tools to manage and re-format your dataset, as well as apply simple or complex functions on subsets of your data. This workshop will be useful for those progressing through the entire workshop series, but also for those who already have some experience in R and would like to become proficient with new tools and packages. - Link to associated Prezi: [[http://​prezi.com/​wg4rggjfqucv/​qcbs-r-workshop-2/​|Prezi]] + **Link to new [[https://​qcbsrworkshops.github.io/​Workshops/​workshop02/​workshop02-en/​workshop02-en.html|Rmarkdown presentation]]** + + Link to old [[http://​prezi.com/​wg4rggjfqucv/​qcbs-r-workshop-2/​|Prezi ​presentation]] Download the R script and data for this lesson: ​ Download the R script and data for this lesson: ​ Line 19: Line 21: - [[http://​qcbs.ca/​wiki/​_media/​co2_good.csv|Dataset 1]] - [[http://​qcbs.ca/​wiki/​_media/​co2_good.csv|Dataset 1]] - [[http://​qcbs.ca/​wiki/​_media/​co2_broken.csv|Dataset 2]] - [[http://​qcbs.ca/​wiki/​_media/​co2_broken.csv|Dataset 2]] + ===== Learning Objectives ===== ===== Learning Objectives ===== + - Creating an R project - Writing a script - Writing a script - Loading, exploring and saving data - Loading, exploring and saving data - - Fixing ​a broken ​data frame + - Learn to manipulate data frames with tidyr, dplyr, maggritr + + ===== RStudio Projects ===== + + What is this? + - Within RStudio, Projects make it easy to separate and keep your work organized. ​ + - All files, scripts, documentation related to a specific project are bound together + + Encourages reproducibility and easy sharing. + + ===== Create a new project ===== + + Use the **Create project** command (available in the Projects menu and the global toolbar) + + {{:​0_create_a_new_project.png?​400|}} + + ===== Keep your files organized ===== + + One project = one folder + + {{:​0_folderdata1.png?​400|}} + + =====Preparing data for R===== + + * Datasets should be stored as **comma separated files (.csv)** in Data folder. + * comma separated files (.csv) can be created from almost all applications (Excel, LibreOffice,​ GoogleDocs) + * file -> save as .csv + + ====Choose file names wisely==== + * Good: + * rawDatasetAgo2017.csv + * co2_concentrations_QB.csv + * 01_figIntro.R + * Bad: + * final.csv //​(Uninformative!)//​ + * safnnejs.csv //​(Random!)//​ + * 1-4.csv //(Avoid using numbers!)//​ + * Dont.separate.names.with.dots.csv //(Can lead to reading file errors!)//​ + + ====Choose variable names wisely==== + * Use short informative titles (i.e. "​Time_1"​ not "First time measurement"​) + * Good: "​Measurements",​ "​SpeciesNames",​ "​Site"​ + * Bad: "​a",​ "​3",​ "​supercomplicatedverylongname"​ + * Column values must match their intended use + + + + ====Things to consider with your data==== + * No text in numeric columns + * Do not include spaces! + * NA (not available) can be used for missing values, and blank entries will automatically be replaced with NA + * Name your variables informatively + * **Look for typos!** + * Avoid numeric values for data that do not have a numeric meaning (i.e. subject, replicate, treatment) + * For example, if subjects are "​1,​2,​3"​ change to "​A,​B,​C"​ or "​S1,​S2,​S3"​ + * Use CONSISTENT formats for dates, numbers, metrics, etc. + * Do not include notes, additional headings, or merged cells! + * One variable per column! + + ====Bad data examples==== + + {{:​excel_notes.png|}} + {{:​horribledata.png|}} + + It is possible to do all your data preparation work within R. This has several benefits: + * Saves time for large datasets + * Keeps original data intact + * Keeps track of the manipulation and transformation you did + * Can switch between long and wide format data very easily (more on this later and in workshop 4) + * For a useful resource, see [[https://​www.zoology.ubc.ca/​~schluter/​R/​data/]] ===== Writing a script ===== ===== Writing a script ===== Line 31: Line 104: To use a script, just highlight commands and press "​Run"​ or press command-enter (Mac) or ctrl-enter (PC). To use a script, just highlight commands and press "​Run"​ or press command-enter (Mac) or ctrl-enter (PC). + ==== Creating an R script ==== + {{:​1_create_an_r_script.png?​500|}} + {{:​2_create_an_r_script2.mod_arrow.png?​600|}} ==== Commands & Comments ==== ==== Commands & Comments ==== Use the '# symbol'​ to denote comments in scripts. ​ The '# symbol'​ tells R to ignore anything remaining on a given line of the script when running commands. Use the '# symbol'​ to denote comments in scripts. ​ The '# symbol'​ tells R to ignore anything remaining on a given line of the script when running commands. - Since comments are ignored when running script, they allow you to leave yourself notes in your code or tell collaborators what you did. A script with comments is a good step towards reproducible science and annotating someone'​s script is a good way to learn. + Since comments are ignored when running script, they allow you to leave yourself notes in your code or tell collaborators what you did. A script with comments is a good step towards reproducible science, and annotating someone'​s script is a good way to learn. ​Try to be as detailed as possible! # This is a comment, not a command # This is a comment, not a command ​ + ==== Header ==== ==== Header ==== Line 54: Line 131: ====Section Heading==== ====Section Heading==== - You can use four # signs in a row to create section headings to help organize your script. + You can use four # signs in a row to create section headings to help organize your script. This allows you to move quickly between sections and hide sections. For example: For example: + #### Housekeeping #### #### Housekeeping #### ​ - RStudio displays a small arrow next to the line number where the section heading was created. ​ If you click on the arrow, you will hide this section of the script. + RStudio displays a small arrow next to the line number where the section heading was created. If you click on the arrow, you will hide this section of the script. + + You can also move quickly between sections using the drop-down menu at the bottom of the script window. + + {{:​4_section_headings_mod_arrow_2.png?​600|}} ==== Housekeeping ==== ==== Housekeeping ==== - It is good practice to have a command at the top of your script to clear the R memory. This will help prevent errors such as using old data that has been left in your workspace. ​The command ''​rm(list=ls())''​ will clear memory. + The first command at the top of all scripts should be ''​rm(list=ls())''​. This will clear R'​s ​memory, and will help prevent errors such as using old data that has been left in your workspace. ​ rm(list=ls()) ​ # Clears R workspace rm(list=ls()) ​ # Clears R workspace Line 78: Line 160: A = "​Test" ​ # <- or = can be used equally A = "​Test" ​ # <- or = can be used equally - #Note that it is best practice to use "<​-"​ for assignment instead of "​="​ + # Note that it is best practice to use "<​-"​ for assignment instead of "​="​ A A Line 84: Line 166: A A ​ + ====Important Reminders==== ====Important Reminders==== Line 97: Line 180: ​ + ---- =====Loading,​ exploring and saving data===== =====Loading,​ exploring and saving data===== - ====Working ​Directory==== + ====Working ​directory==== R needs to know the directory where your data and files are stored in order to load them. You can see which directory you are currently working in by using the ''​getwd()''​ command. R needs to know the directory where your data and files are stored in order to load them. You can see which directory you are currently working in by using the ''​getwd()''​ command. Line 110: Line 194: When you load a script, R automatically sets the working directory to the folder containing the script. When you load a script, R automatically sets the working directory to the folder containing the script. - To specify ​a path, use a "/"​ to separate folders, subfolders and file names. + To change working directories using the ''​setwd()''​ function, ​specify ​the working directory'​s ​path using a "/"​ to separate folders, subfolders and file names. You can also click Session > Set working directory > Choose directory... - There are several ways you can set the working directory: + ====Display ​the content ​of the working directory==== - * You can simply type the full path of the directory in the parentheses of the command ''​setwd()''​. ​ For example: + - + - setwd('/​Users/​vincentfugere/​Desktop/​QCBS_R_Workshop2'​) ​ # Mac Example + - setwd('​C:/​Users/​Johanna/​Documents/​PhD/​R_Workshop2'​) ​ # Windows Example + - # **Note that this path will NOT work on your computer! + - ​ + - * You can use ''​choose.dir()''​ to get a pop up to navigate to the appropriate directory. + - + - setwd(choose.dir()) ​ # Note that this may not work on a Mac. + - ​ + - * You can click on session / set working directory ​/ choose directory + - ====Display The Content Of The Working Directory==== The command ''​dir()''​ displays the content of the working directory. ​ The command ''​dir()''​ displays the content of the working directory. ​ Line 138: Line 210: Use the ''​read.csv()''​ command to import data in R. ​ Use the ''​read.csv()''​ command to import data in R. ​ - CO2<​-read.csv("​CO2_good.csv") # Creates an object called CO2 by loading data from a file called "CO2_good.csv" ​ + CO2 <- read.csv("​co2_good.csv") # Creates an object called CO2 by loading data from a file called "co2_good.csv" ​ ​ - This command specifies that you will be creating an R object named "​CO2"​ by reading a csv file called "CO2_good.csv"​. ​ This file must be located in your current working directory. + This command specifies that you will be creating an R object named "​CO2"​ by reading a csv file called "co2_good.csv"​. ​ This file must be located in your current working directory. ​ - + - Alternatively,​ you can choose the file to load interactively using the ''​file.choose()''​ command. ​ + - + Recall that the question mark can be used to find out what arguments ​the function requires. - CO2<​-read.csv(file.choose()) + - ​ + - + - Recall, that the question mark can be used to pull up the help page for a command. + ?read.csv # Use the question mark to pull up the help page for a command ​ ?read.csv # Use the question mark to pull up the help page for a command ​ Line 157: Line 223: - CO2<​-read.csv("​CO2_good.csv", header = TRUE) + CO2 <- read.csv("​co2_good.csv", header = TRUE) ​ - NOTE: If you have a French ​operating system or CSV editor, you may need to use ''​read.csv2()''​ instead of ''​read.csv()''​ + NOTE: If your operating system or CSV editor ​is in French, you may need to use ''​read.csv2()''​ instead of ''​read.csv()''​ - ==== Looking at Data ==== + + {{:​5_importing_data_mod_arrow.png?​900|}} + + Notice that RStudio now provides information on the CO2 data in your workspace. The workspace refers to all the objects that you create during an R session. + + ==== Looking at data ==== The CO2 dataset consists of repeated measurements of CO2 uptake from six plants from Quebec and six plants from Mississippi at several levels of ambient CO2 concentration. Half of the plants of each type were chilled overnight before the experiment began. The CO2 dataset consists of repeated measurements of CO2 uptake from six plants from Quebec and six plants from Mississippi at several levels of ambient CO2 concentration. Half of the plants of each type were chilled overnight before the experiment began. Line 168: Line 239: |CO2| Look at the whole data frame| |CO2| Look at the whole data frame| |head(CO2)| Look at the first few rows| |head(CO2)| Look at the first few rows| + |tail(CO2)| Look at the last few rows | |names(CO2)| Names of the columns in the data frame| |names(CO2)| Names of the columns in the data frame| |attributes(CO2) | Attributes of the data frame| |attributes(CO2) | Attributes of the data frame| + |dim(CO2) ​ | Dimensions of the data frame| |ncol(CO2) | Number of columns| |ncol(CO2) | Number of columns| |nrow(CO2) | Number of rows| |nrow(CO2) | Number of rows| Line 185: Line 258: - CO2<​-read.csv("​CO2_good.csv",​header=FALSE) + CO2<​-read.csv("​co2_good.csv",​header=FALSE) ​ Check the ''​str()''​ of CO2. What is wrong here? Reload the data with header=TRUE before continuing. Check the ''​str()''​ of CO2. What is wrong here? Reload the data with header=TRUE before continuing. - ==== Reminder from workshop 1: Accessing ​Data ==== + ==== Reminder from workshop 1: Accessing ​data ==== Data within a data frame can be extracted by several means. Let's consider a data frame called //mydata//. Use square brackets to extract the content of a cell. Data within a data frame can be extracted by several means. Let's consider a data frame called //mydata//. Use square brackets to extract the content of a cell. + + {{:​table_reminder_from_workshop_1_accessing_data.png?​500|}} + mydata[2,3] # extracts the content of row 2 / column 3 mydata[2,3] # extracts the content of row 2 / column 3 Line 201: Line 277: mydata[1,] # extracts the content of the first row mydata[1,] # extracts the content of the first row ​ + The squared brackets can also be used recursively + + mydata[,​1][2] # this extracts the second content of the first column + ​ If row number is omitted, the whole column is extracted. Similarly, the ''​$''​ sign followed by the corresponding header can be used. If row number is omitted, the whole column is extracted. Similarly, the ''​$''​ sign followed by the corresponding header can be used. - mydata[,1] # extracts the content of the first column + mydata$Variable1 ​# extracts ​a specific ​column ​by its name ("​Variable1"​) - mydata$header ​# extracts ​the content of the column ​which has the corresponding header + ​ - ====Data Exploration==== + ====Renaming variables==== - It can be very useful to plot all variable combinations when you are examining your data. + Variable names (i.e. column names) ​can be changed within R. - plot(CO2) # Plot of all variable combinations + # First let's make a copy of the dataset to play with! + CO2copy <- CO2 + # names() gives you the names of the variables present in the data frame + names(CO2copy) + + # Changing from English to French names (make sure you have the same levels!) + names(CO2copy) <- c("​Plante","​Categorie",​ "​Traitement",​ "​conc","​absortion"​) ​ - Do you want to see if one of your variables ​is normally distributed? ​ Use the ''​hist()'' ​command. + ====Creating new variables==== + New variables can be easily created and populated. For example, variables and strings can be concatenated together using the function ​''​paste()''​. - hist(CO2$uptake) # The $is used to extract a specific column from a data frame by name. + # Let's create an unique id for our samples using the function paste() + # see ?paste and ?paste0 + # Don't forget to use ""​ for strings + CO2copy$uniqueID <- paste0(CO2copy$Plante,"​_",​ CO2copy$Categorie,​ "​_",​ CO2copy$Traitement) + + # Observe the results + head(CO2copy$uniqueID) ​ - There are many built in functions in R that can be used to obtain information about your data.  Two commonly used functions are ''​mean()''​ and ''​sd()''​. + Creating new variables works for numbers and mathematical operations as well! + + # Let's standardize our variable "​absortion"​ to relative values + CO2copy$absortionRel = CO2copy$absortion/​max(CO2copy$absortion) # Changing to relative values + + # Observe the results + head(CO2copy$absortionRel) + + ​ + + ====Subsetting data==== + + There are many ways to subset a data frame. - conc_mean<​-mean(CO2$conc) # Calculate mean of the "conc" ​column of the "CO2" ​object. Save as "conc_mean" + # Let's keep working with our CO2copy data frame - conc_mean ​# Display object ​"conc_mean" + + ## Subsetting by variable name + CO2copy[,c("​Plante",​ "​absortionRel"​)] # Selects only "Plante" ​and "absortionRel" ​columns. (Don't forget the ","!) + + ## Subsetting by row + CO2copy[1:​50,​] # Subset data frame from rows from 1 to 50 + + ### Subsetting by matching with a factor level + CO2copy[CO2copy$Traitement == "nonchilled",] # Select observations matching only the nonchilled Traitement. + + ### Subsetting according to a numeric condition + CO2copy[CO2copy$absortion >= 20, ] # Select observations with absortion higher or equal to 20 + + ### Conditions can be complimentary -The & (and) argument- + CO2copy[CO2copy$Traitement ​ == "​nonchilled"​ & CO2copy$absortion >= 20, ] - conc_sd<​-sd(CO2$conc) ​# Calculate ​sd of "conc" ​column and save as "conc_sd" + # We are done playing with the dataset copy. Let's erase it. - conc_sd + CO2copy ​<- NULL + ​ + + Go [[https://​stat.ethz.ch/​R-manual/​R-devel/​library/​base/​html/​Logic.html| here]] to check all the logical operators you can use to subset a data frame in R + + ====Data exploration==== + + A good way to start your data exploration is to look at some basic statistics of your dataset using the ''​summary()''​ function. + + + summary(CO2) # Get summary statistics of your dataset + ​ + + You can also use some other functions to calculate basic statistics about specific parts of your data frame, using ''​mean()'',​ ''​sd()'',​ ''​hist()'',​ and ''​print()''​. + + + # Calculate mean and standard deviation of the concentration,​ and assign them to new variables + meanConc <- mean(CO2$conc) + sdConc <- sd(CO2$conc) + + # print() prints any given value to the R console + print(paste("​the mean of concentration is:", meanConc)) + print(paste("the standard deviation of concentration is:", sdConc)) + + # Let's plot a histogram to explore the distribution of "​uptake" + hist(CO2$uptake) + + # Increasing the number of bins to observe better the pattern + hist(CO2$uptake,​ breaks = 40) ​ Line 237: Line 383: ​ - To use apply, you have to specify three arguments. ​ The first argument is the data you would like to apply the function to; the second argument is whether you would like to calculate based on columns ​(2) or rows(1) of data; the third argument is the function you would like to apply. ​ For example: + To use apply, you have to specify three arguments. The first argument is the data you would like to apply the function to; the second argument is whether you would like to calculate based on rows (1) or columns ​(2) of your dataset; the third argument is the function you would like to apply. ​ For example: apply(CO2[,​4:​5],​ MARGIN = 2, FUN = mean) # Calculate mean of the two columns in the data frame that contain continuous data apply(CO2[,​4:​5],​ MARGIN = 2, FUN = mean) # Calculate mean of the two columns in the data frame that contain continuous data ​ + ====Save your workspace==== ====Save your workspace==== Line 248: Line 395: - save.image(file="​CO2_project_Data.RData"​) # Save workspace + save.image(file="​co2_project_Data.RData"​) # Save workspace rm(list=ls()) ​ # Clears R workspace rm(list=ls()) ​ # Clears R workspace - load("CO2_project_Data.RData"​) #Reload everything that was in your workspace + load("co2_project_Data.RData"​) #Reload everything that was in your workspace - head(CO2) # Looking good :) + head(CO2) # Looking good! :) ​ + ====Exporting data==== ====Exporting data==== If you want to save a data file that you have created or edited in R, you can do so using the ''​write.csv()''​ command. ​ Note that the file will be written into the current working directory. If you want to save a data file that you have created or edited in R, you can do so using the ''​write.csv()''​ command. ​ Note that the file will be written into the current working directory. - write.csv(CO2,​file="​CO2_new.csv") # Save object CO2 to a file named CO2_new.csv + write.csv(CO2,​file="​co2_new.csv") # Save object CO2 to a file named co2_new.csv ​ - ====Preparing data for R==== - * When preparing files for R, you should save them as .csv files. - * Almost all applications (Excel, GoogleDocs, LibreOffice,​ etc) can save a file as a csv (comma separated values) - * Use short informative titles (i.e. "​Time_1"​ not "First time measurement"​) ​ - * Column values must match their intended use. - * No text in numeric columns, including spaces - * NA can be used for missing values - * Avoid numeric values for data that does not have a numeric meaning (i.e. subject, replicate, treatment) - * For example, if subjects are "​1,​2,​3"​ change to "​A,​B,​C"​ or "​S1,​S2,​S3"​ - * Do not include notes, additional headings, or merged cells! - It is possible to do all data preparation in R.  This has several benefits: + ====Use your data CHALLENGE==== - * Saves time for large datasets + - * Preserves original data + - * Can switch between long and wide format data very easily (more on this in workshop 4) + - * For a useful resource, see [[https://​www.zoology.ubc.ca/​~schluter/​R/​data/​]] + - ====Use ​your data==== + Try to load, explore, plot and save your own data in R. + Does it load properly? If not, try fixing it in R. Save your fixed data and then try opening it in Excel. - **Challenge** + =====Fixing a Broken Data Frame===== - Try to load, explore, plot and save your own data in R. + Data can be messy, there are compatibility issues. For example, sharing ​data from a Mac to Windows or between computers set up in different continents can lead to weird datasets. - Does it load properly? ​ If not, try fixing it in R.  Save your fixed data and then try opening it in Excel. + - =====Fixing a Broken Data Frame===== + Let's practice how to solve some common errors. ​ - **Harder Challenge** + ==== Fix a broken dataframe CHALLENGE ==== - # Read a broken CO2 csv file into R and find the problems + # Read co2_broken.csv file into R and find the problems - CO2<​-read.csv("​CO2_broken.csv") # Overwrite CO2 object with broken CO2 data + CO2<​-read.csv("​co2_broken.csv") # Overwrite CO2 object with broken CO2 data head(CO2) # Looks messy head(CO2) # Looks messy CO2 # Indeed! CO2 # Indeed! Line 300: Line 434: * This is probably what your data or downloaded data looks like. * This is probably what your data or downloaded data looks like. - * Fix it in R (or not) + * You can fix the data frame in R (or not...) * Give it a try before looking at the solution! * Give it a try before looking at the solution! * Work with your neighbours and have fun :) * Work with your neighbours and have fun :) Line 317: Line 451: Also remember that you can use "?"​ to look up help for a function (i.e. ''?​str''​). Also remember that you can use "?"​ to look up help for a function (i.e. ''?​str''​). - HINT: There are 4 problems! + **HINT: There are 4 problems!** Answers: Answers: Line 331: Line 465: Here, "​TAB"​ was used instead of ","​. Here, "​TAB"​ was used instead of ","​. - CO2 <- read.csv("​CO2_broken.csv",​sep = ""​) + CO2 <- read.csv("​co2_broken.csv",​sep = ""​) ?read.csv ?read.csv ​ Line 348: Line 482: To fix this problem, you can tell R to skip the first two rows when reading in this file. To fix this problem, you can tell R to skip the first two rows when reading in this file. - CO2<​-read.csv("​CO2_broken.csv",​sep = "",​skip=2) ​ # By adding the skip argument into the read.csv function, R knows to skip the first two rows + CO2<​-read.csv("​co2_broken.csv",​sep = "",​skip=2) ​ # By adding the skip argument into the read.csv function, R knows to skip the first two rows head(CO2) # You can now see that the CO2 object has the appropriate headings head(CO2) # You can now see that the CO2 object has the appropriate headings ​ Line 368: Line 502: <​hidden>​ <​hidden>​ - CO2 <- read.csv("​CO2_broken.csv",​sep = "",​skip = 2,​na.strings = c("​NA","​na","​cannot_read_notes"​)) ​ + ?read.csv + ​ + + {{:​read_table_help1.png|}} + {{:​read_table_help2.png|}} + + + CO2 <- read.csv("​co2_broken.csv",​sep = "",​skip = 2,​na.strings = c("​NA","​na","​cannot_read_notes"​)) ​ ​ By identifying "​cannot_read_notes"​ as NA data, R reads these columns properly. By identifying "​cannot_read_notes"​ as NA data, R reads these columns properly. - Remember that NA stands for not available. + Remember that NA (capital!) ​stands for not available. Line 413: Line 554: ​ + + ---- + + =====Learn to manipulate data with tidyr, dyplr, maggritr===== + + ==== Using tidyr to reshape data frames ==== + + {{:​tidyrsticker.png?​200|}} + + ==== Why "​tidy"​ your data? ==== + + Tidying allows you to manipulate the structure of your data while preserving all original information. Many functions in R require (or work better) with a data structure that isn't always easily readable by people. ​ + + In contrast to aggregation,​ which reduces many cells in the original data set to one cell in the new dataset, tidying preserves a one-to-one connection. Although aggregation can be done with many functions in R, the ''​tidyr''​ package allows you to both reshape and aggregate within a single syntax. + + Install / Load the ''​tidyr()''​ package: + + if(!require(tidyr)){install.packages("​tidyr"​)} + library(tidyr) + ​ + + ==== Wide vs. long data ==== + + **Wide** format data has a separate column for each variable or each factor in your study. One row therefore can therefore include several different observations. + + **Long** format data has a column stating the measured variable types and a column containing the values associated to those variables (each column is a variable, each row is one observation). This is considered "​tidy"​ data because it is easily interpreted by most packages for visualization and analysis in ''​R''​. + + The format of your data depends on your specific needs, but some functions and packages such as ''​dplyr'',​ ''​lm()'',​ ''​glm()'',​ ''​gam()''​ require long format data. The ''​ggplot2''​ package can use wide data format for some basic plotting, but more complex plots require the long format (example to come). + + Additionally,​ long form data can more easily be aggregated and converted back into wide form data to provide summaries, or to check the balance of sampling designs. + + We can use the ''​tidyr''​ package to to manipulate the structure of your data while preserving all original information,​ using the following functions: + + * 1. ''​gather()''​ our data (wide --> long) + * 2. ''​spread()''​ our data (long --> wide) + + {{:​gather-spread.png?​600|}} + + Let's pretend you send out your field assistant to measure the diameter at breast height (DBH) and height of three tree species for you. The result is this "​wide"​ data set. + + + > wide <- data.frame(Species = c("​Oak",​ "​Elm",​ "​Ash"​),​ + DBH = c(12, 20, 13), + ​Height = c(56, 85, 55)) + > wide + Species DBH Height + 1     ​Oak ​ 12     56 + 2     ​Elm ​ 20     85 + 3     ​Ash ​ 13     55 + ​ + + ==== gather(): Making your data long ==== + + ?gather + ​ + + Most of the packages in the Hadleyverse will require long format data where each row is an entry and each column is a variable. Let's try to "​gather"​ the this wide data using the ''​gather()''​ function in tidyr. ''​gather()''​ takes multiple columns, and gathers them into key-value pairs. ​ + + The function requires at least 3 arguments: + * **data**: a data frame (e.g. "​wide"​) + * **key**: name of the new column containing variable names (e.g. "​Measurement"​) + * **value**: name of the new column containing variable values (e.g. "​Value"​) + * **...**: name or numeric index of the columns we wish to gather (e.g. "​DBH"​ or "​Height"​) + + + # Gathering columns into rows + + > long <- gather(wide,​ Measurement,​ Value, DBH, Height) + > long + Species Measurement Value + 1     ​Oak ​        DBH 12 + 2     ​Elm ​        DBH 20 + 3     ​Ash ​        DBH 13 + 4     ​Oak ​     Height 56 + 5     ​Elm ​     Height 85 + 6     ​Ash ​     Height 55 + ​ + + Let's try this with the C02 dataset. Here we might want to collapse the last two quantitative variables: + + + CO2.long <- gather(CO2, response, value, conc, uptake) + head(CO2) + head(CO2.long) + tail(CO2.long) + ​ + + ==== spread(): Making your data wide ==== + + ''​spread()''​ uses the same syntax as ''​gather()''​. The function requires 3 arguments: + * **data**: A data frame (e.g. "​long"​) + * **key**: Name of the column containing variable names (e.g. "​Measurement"​) ​ + * **value**: Name of the column containing variable values (e.g. "​Value"​) + + + # Spreading rows into columns + > wide2 <- spread(long,​ Measurement,​ Value) + > wide2 + Species DBH Height + 1     ​Ash ​ 13     55 + 2     ​Elm ​ 20     85 + 3     ​Oak ​ 12     56 + ​ + + ==== separate(): Separate two (or more) variables in a single column ==== + + + Some times you might have really messy data that has two variables in one column. Thankfully the ''​separate()''​ function can (wait for it) separate the two variables into two columns. + + The ''​separate()''​ function splits a columns by a character string separator. It requires 4 arguments: + * **data**: A data frame (e.g. "​long"​) + * **col**: Name of the column you wish to separate + * **into**: Names of new variables to create + * **sep**: Character which indicates where to separate + + Let's create a really messy data set: + + + set.seed(8) + messy <- data.frame(id = 1:4, + trt = sample(rep(c('​control',​ '​farm'​),​ each = 2)), + ​zooplankton.T1 = runif(4), + fish.T1 = runif(4), + ​zooplankton.T2 = runif(4), + fish.T2 = runif(4)) + messy + ​ + + First, we want to convert this wide dataset to long format. + + + messy.long <- gather(messy,​ taxa, count, -id, -trt) + head(messy.long) + ​ + + Then, we want to split those two sampling times (T1 & T2). The syntax we use here is to tell R ''​separate(data,​ what column, into what, by what)''​. The tricky part here is telling R where to separate the character string in your column entry using a regular expression to describe the character that separates them. Here, the string should be separated by the period ''"​."''​. + + + messy.long.sep <- separate(messy.long,​ taxa, into = c("​species",​ "​time"​),​ sep = "​\\."​) + head(messy.long.sep) ​ + ​ + + The argument ''​sep = "​\\."''​ tells R to splits the character string around the period (.). We cannot type directly ''"​."''​ because it is a regular expression that matches any single character. + + ====Recap: tidyr==== + + ''​tidyr''​ is a package that reshapes the layout of data sets. + * Convert from **wide format to long format** using ''​gather()''​ + * Convert from **long format to wide format** using ''​spread()''​ + * Split and merge columns with ''​unite()''​ and ''​separate()''​ + + Here's cheat sheet to help you use ''​tidyr''​ and ''​dplyr''​ for more data wrangling: [[https://​www.rstudio.com/​wp-content/​uploads/​2015/​02/​data-wrangling-cheatsheet.pdf]] + + ==== tidyr CHALLENGE ==== + + //Using the ''​airquality''​ dataset, ''​gather()''​ all the columns (except Month and Day) into rows. Then ''​spread()''​ the resulting dataset to return the same data format as the original data.// + + + ?​air.quality + data(airquality) + ​ + + ++++Solution| ​ + + # Use gather() to convert the dataset to long format + air.long <- gather(airquality,​ variable, value, -Month, -Day) + head(air.long) + # Note that the syntax used here indicates we wish to gather ALL the columns except "​Month"​ and "​Day"​ + + # Then, use spread() to convert the dataset back to wide format + air.wide <- spread(air.long , variable, value) ​ + head(air.wide) + ​ + ++++ + + ---- + + ===== Data manipulation with dplyr ===== + + {{:​dplyrsticker.png?​200|}} + + ==== Intro to dplyr ==== + + The vision of the ''​dplyr''​ package is to simplify data manipulation by distilling all the common data manipulation tasks to a set of intuitive functions (or "​verbs"​). The result is a comprehensive set of tools that facilitates data manipulation,​ such as filtering rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data. + + In addition to ease of use, it is also an amazing package because: + * it can crunch huge datasets wicked fast (written in ''​Cpp''​) + * it plays nice with the RStudio IDE and other packages in the Hadleyverse ​ + * it can interface with external databases and translate your R code into SQL queries + * if Batman was an R package, he would be ''​dplyr''​ (mastering fear of data, adopting cool technologies) + + Certain R base functions work similarly to dplyr functions, including: ''​split()'',​ ''​subset()'',​ ''​apply()'',​ ''​sapply()'',​ ''​lapply()'',​ ''​tapply()''​ and ''​aggregate()''​ + + Let's install and load the ''​dplyr''​ package: + + + if(!require(dplyr)){install.packages("​dplyr"​)} + library(dplyr) + ​ + + The ''​dplyr''​ package is built around a core set of "​verbs"​ (or functions). We will start with the following 4 verbs because these operations are ubiquitous in data manipulation:​ + + * ''​select()'':​ select columns from a data frame + * ''​filter()'':​ filter rows according to defined criteria + * ''​arrange()'':​ re-order data based on criteria (e.g. ascending, descending) + * ''​mutate()'':​ create or transform values in a column ​ + + ====Select a subset of columns with ''​select()''​==== + + {{:​select.png?​600|}} + + The general syntax for this function is ''​select(dataframe,​ column1, column2, ...)''​. Most ''​dplyr''​ functions will follow a similarly simple syntax. ''​select()''​ requires at least 2 arguments: + * **data**: the dataset to manipulate + * **...**: column names, positions, or complex expressions (separated by commas) ​ + + For example: + + select(data,​ column1, column2) # select columns 1 and 2 + select(data,​ c(2:4,6) # select columns 2 to 4 and 6 + select(data,​ -column1) # select all columns except column 1 + select(data,​ start_with(x.)) # select all columns that start with "​x."​ + ​ + + Here are more examples of how to use ''​select()'':​ + + {{:​select.helper.png?​400|}} + + The ''​airquality''​ dataset contains several columns: + + + > head(airquality) + Ozone Solar.R Wind Temp Month Day + 1    41     ​190 ​ 7.4   ​67 ​    ​5 ​  1 + 2    36     ​118 ​ 8.0   ​72 ​    ​5 ​  2 + 3    12     149 12.6   ​74 ​    ​5 ​  3 + 4    18     313 11.5   ​62 ​    ​5 ​  4 + 5    NA      NA 14.3   ​56 ​    ​5 ​  5 + 6    28      NA 14.9   ​66 ​    ​5 ​  6 + ​ + + For example, suppose we are only interested in the variation of "​Ozone"​ over time within the ''​airquality''​ dataset, then we can select the subset of required columns for further analysis: + + + > ozone <- select(airquality,​ Ozone, Month, Day) + > head(ozone) + Ozone Month Day + 1    41     ​5 ​  1 + 2    36     ​5 ​  2 + 3    12     ​5 ​  3 + 4    18     ​5 ​  4 + 5    NA     ​5 ​  5 + 6    28     ​5 ​  6 + ​ + + ==== Select a subset of rows with ''​filter()''​ ==== + + A common operation in data manipulation is the extraction of a subset based on specific conditions. The general syntax for this function is ''​filter(dataframe,​ logical statement 1, logical statement 2, ...)''​. + + {{:​filter.png?​600|}} + + Remember that logical statements provide a TRUE or FALSE answer. The ''​filter()''​ function retains all the data for which the statement is TRUE. This can also be applied on characters and factors. Here is a useful reminder of how logic works in R. + {{:​logic.helper.png?​500|}} + + For example, in the ''​airquality''​ dataset, suppose we are interested in analyses that focus on the month of August during high temperature events: + + + > august <- filter(airquality,​ Month == 8, Temp >= 90) + > head(august) + Ozone Solar.R Wind Temp Month Day + 1    89     229 10.3   ​90 ​    ​8 ​  8 + 2   ​110 ​    ​207 ​ 8.0   ​90 ​    ​8 ​  9 + 3    NA     ​222 ​ 8.6   ​92 ​    ​8 ​ 10 + 4    76     ​203 ​ 9.7   ​97 ​    ​8 ​ 28 + 5   ​118 ​    ​225 ​ 2.3   ​94 ​    ​8 ​ 29 + 6    84     ​237 ​ 6.3   ​96 ​    ​8 ​ 30 + ​ + + ==== Sorting rows with ''​arrange()''​ ==== + + In data manipulation,​ we sometimes need to sort our data (e.g. numerically or alphabetically) for subsequent operations. A common example of this is a time series. ​ + + The ''​arrange()''​ function re-orders rows by one or multiple columns, using the following syntax: ''​arrange(data,​ variable1, variable2, ...)''​. ​ + + By default, rows are sorted in ascending order. Note that we can also sort in descending order by placing the target column in ''​desc()''​ inside the ''​arrange()''​ function as follows: ''​arrange(data,​ variable1, desc(variable2),​ ...)''​. + + Example: Let's use the following code to create a scrambled version of the airquality dataset + + > air_mess <- sample_frac(airquality,​ 1) + > head(air_mess) + Ozone Solar.R Wind Temp Month Day + 21      1       ​8 ​ 9.7   ​59 ​    ​5 ​ 21 + 42     ​NA ​    259 10.9   ​93 ​    ​6 ​ 11 + 151    14     191 14.3   ​75 ​    ​9 ​ 28 + 108    22      71 10.3   ​77 ​    ​8 ​ 16 + 8      19      99 13.8   ​59 ​    ​5 ​  8 + 104    44     192 11.5   ​86 ​    ​8 ​ 12 + ​ + + Now, let's arrange the data frame back into chronological order, sorting by ''​Month'',​ and then by ''​Day'':​ + + + > air_chron <- arrange(air_mess,​ Month, Day) + > head(air_chron) + Ozone Solar.R Wind Temp Month Day + 1    41     ​190 ​ 7.4   ​67 ​    ​5 ​  1 + 2    36     ​118 ​ 8.0   ​72 ​    ​5 ​  2 + 3    12     149 12.6   ​74 ​    ​5 ​  3 + 4    18     313 11.5   ​62 ​    ​5 ​  4 + 5    NA      NA 14.3   ​56 ​    ​5 ​  5 + 6    28      NA 14.9   ​66 ​    ​5 ​  6 + ​ + + Try to see the difference when we change the order of the target columns: + + arrange(air_mess,​ Day, Month) + ​ + + ==== Create and populate columns with ''​mutate()''​ ==== + + Besides subsetting or sorting your data frame, you will often require tools to transform your existing data or generate some additional data based on existing variables. We can use the function ''​mutate()''​ to compute and add new columns in your dataset. + + The ''​mutate()''​ function follows this syntax: ''​mutate(data,​ newVar1 = expression1,​ newVar2 = expression2,​ ...)''​. + + {{:​mutate.png?​600|}} + + Let's create a new column using ''​mutate()''​. For example, suppose we would like to convert the temperature variable from degrees Fahrenheit to degrees Celsius: + + > airquality_C <- mutate(airquality,​ Temp_C = (Temp-32)*(5/​9)) + > head(airquality_C) + Ozone Solar.R Wind Temp Month Day   ​Temp_C + 1    41     ​190 ​ 7.4   ​67 ​    ​5 ​  1 19.44444 + 2    36     ​118 ​ 8.0   ​72 ​    ​5 ​  2 22.22222 + 3    12     149 12.6   ​74 ​    ​5 ​  3 23.33333 + 4    18     313 11.5   ​62 ​    ​5 ​  4 16.66667 + 5    NA      NA 14.3   ​56 ​    ​5 ​  5 13.33333 + 6    28      NA 14.9   ​66 ​    ​5 ​  6 18.88889 + ​ + + Note that the syntax here is quite simple, but within a single call of the ''​mutate()''​ function, we can replace existing columns, we can create multiple new columns, and each new column can be created using newly created columns within the same function call. + + ---- + + ===== dplyr and magrittr, a match made in heaven ===== + + {{:​magrittrsticker.png?​200|}} + + The ''​magrittr''​ package brings a new and exciting tool to the table: a pipe operator. Pipe operators provide ways of linking functions together so that the output of a function flows into the input of next function in the chain. The syntax for the ''​magrittr''​ pipe operator is ''​%>​%''​. The ''​magrittr''​ pipe operator truly unleashes the full power and potential of ''​dplyr'',​ and we will be using it for the remainder of the workshop. First, let's install and load it: + + + if(!require(magrittr)){install.packages("​magrittr"​)} + require(magrittr) + ​ + + Using it is quite simple, and we will demonstrate that by combining some of the examples used above. Suppose we wanted to ''​filter()''​ rows to limit our analysis to the month of June, then convert the temperature variable to degrees Celsius. We can tackle this problem step by step, as before: + + + june_C <- mutate(filter(airquality,​ Month == 6), Temp_C = (Temp-32)*(5/​9)) +   ​ + + This code can be difficult to decipher because we start on the inside and work our way out. As we add more operations, the resulting code becomes increasingly illegible. Instead of wrapping each function one inside the other, we can accomplish these 2 operations by linking both functions together: + + + june_C <- airquality %>​% ​ + filter(Month == 6) %>% + mutate(Temp_C = (Temp-32)*(5/​9)) +  ​ + + Notice that within each function, we have removed the first argument which specifies the dataset. Instead, we specify our dataset first, then "​pipe"​ into the next function in the chain. ​ + + The advantages of this approach are that our code is less redundant and functions are executed in the same order we read and write them, which makes its easier and quicker to both translate our thoughts into code and read someone else's code and grasp what is being accomplished. As the complexity of your data manipulations increases, it becomes quickly apparent why this is a powerful and elegant approach to writing your ''​dplyr''​ code. + + **Quick tip:** In RStudio we can insert this pipe quickly using the following hotkey: ''​Ctrl''​ (or ''​Cmd''​ for Mac) +''​Shift''​+''​M''​. + + ===== dplyr - grouped operations and summaries ===== + + The ''​dplyr''​ verbs we have explored so far can be useful on their own, but they become especially powerful when we link them with each other using the pipe operator (''​%>​%''​) and by applying them to groups of observations. The following functions allow us to split our data frame into distinct groups on which we can then perform operations individually,​ such as aggregating/​summarising:​ + + * ''​group_by()'':​ group data frame by a factor for downstream commands (usually summarise) + * ''​summarise()'':​ summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. ''​min()'',​ ''​max()'',​ ''​mean()'',​ etc...) + + These verbs provide the needed backbone for the Split-Apply-Combine strategy that was initially implemented in the ''​plyr''​ package on which ''​dplyr''​ is built. ​ + + {{:​split-apply-combine.png?​600|}} + + Let's demonstrate the use of these with an example using the ''​airquality''​ dataset. Suppose we are interested in the mean temperature and standard deviation within each month: + + + > month_sum <- airquality %>​% ​ + group_by(Month) %>​% ​ + summarise(mean_temp = mean(Temp), + sd_temp = sd(Temp)) + > month_sum + Source: local data frame [5 x 3] + + Month mean_temp ​ sd_temp + (int)     ​(dbl) ​   (dbl) + 1     ​5 ​ 65.54839 6.854870 + 2     ​6 ​ 79.10000 6.598589 + 3     ​7 ​ 83.90323 4.315513 + 4     ​8 ​ 83.96774 6.585256 + 5     ​9 ​ 76.90000 8.355671 + ​ + + ===== dplyr & magrittr CHALLENGE ===== + + //Using the ''​ChickWeight''​ dataset, create a summary table which displays the difference in weight between the maximum and minimum weight of each chick in the study. Employ ''​dplyr''​ verbs and the ''​%>​%''​ operator.// + + + ?​ChickWeight + data(ChickWeight) + ​ + + ++++Solution| ​ + + # Use group_by() to divide the dataset by "​Chick"​ + # Use summarise() to calculate the weight gain within each group + > weight_diff <- ChickWeight %>​% ​ + group_by(Chick) %>​% ​ + summarise(weight_diff = max(weight) - min(weight)) + > weight_diff + Source: local data frame [50 x 2] + + Chick weight_diff + ​(fctr) ​      (dbl) + 1      18           4 + 2      16          16 + 3      15          27 + 4      13          55 + 5       ​9 ​         58 + 6      20          76 + 7      10          83 + 8       ​8 ​         92 + 9      17         100 + 10     ​19 ​        114 + ..    ...         ... + ​ + + Note that we are only calculating the difference between max and min weight. This doesn'​t necessarily correspond to the difference in mass between the beginning and the end of the trials. Closely inspect the data for chick # 18 to understand why this is the case: + + + > chick_18 <- ChickWeight %>% filter(Chick == 18) + > chick_18 + weight Time Chick Diet + 1     ​39 ​   0    18    1 + 2     ​35 ​   2    18    1 + ​ + + Here we notice that chick 18 has in fact lost weight (and probably died during the trial). From a scientific perspective,​ perhaps a more interesting question is which of the 4 diets results in the greatest weight gain in chicks. We could calculate this using 2 more useful ''​dplyr''​ functions: ''​first()''​ and ''​last()''​ allow us to access the (need I say respectively) first and last observation within a group. ​ + ++++ + ---- + + ===== Ninja hint ===== + + Note that we can group the data frame using more than one factor, using the general syntax as follows: ''​group_by(group1,​ group2, ...)''​ + + Within ''​group_by()'',​ the multiple groups create a layered onion, and each subsequent single use of the ''​summarise()''​ function peels off the outer layer of the onion. In the above example, after we carried out a summary operation on ''​group2'',​ the resulting data set would remain grouped by ''​group1''​ for downstream operations. + + ===== dplyr & magrittr NINJA CHALLENGE ===== + + //Using the ''​ChickWeight''​ dataset, create a summary table which displays, for each diet, the average individual difference in weight between the end and the beginning of the study. Employ ''​dplyr''​ verbs and the ''​%>​%''​ operator. (Hint: ''​first()''​ and ''​last()''​ may be useful here.)// + + ++++Solution| ​ + + > diet_summ <- ChickWeight %>​% ​ + group_by(Diet,​ Chick) %>​% ​ + summarise(weight_gain = last(weight) - first(weight)) %>​% ​ + group_by(Diet) %>​% ​ + summarise(mean_gain = mean(weight_gain)) + > diet_summ + # A tibble: 4 × 2 + Diet mean_gain + <​fctr> ​    <​dbl>​ + 1      1     114.9 + 2      2     174.0 + 3      3     229.5 + 4      4     188.3 + ​ + + + Given that the solution to the last challenge requires that we compute several operations in sequence, it provides a nice example to demonstrate why the syntax implemented by ''​dplyr''​ and ''​magrittr''​. An additional challenge (if you are well versed in base ''​R''​ functions) would to reproduce the same operations using fewer key strokes. We tried, and failed... Perhaps we are too accustomed to ''​dplyr''​ now. + ++++ + ---- + + ===== dplyr - Merging data frames ===== + + In addition to all the operations we have explored, ''​dplyr''​ also provides some functions that allow you to join two data frames together. The syntax in these functions is simple relative to alternatives in other ''​R''​ packages: + + * ''​left_join()''​ + * ''​right_join()''​ + * ''​inner_join()''​ + * ''​anti_join()''​ + + These are beyond the scope of the current introductory workshop, but they provide extremely useful functionality you may eventually require for some more advanced data manipulation needs. + + =====More resources on data manipulation===== + * [[https://​www.rstudio.com/​wp-content/​uploads/​2015/​02/​data-wrangling-cheatsheet.pdf|The RStudio Data Wrangling Cheat Sheet]] + * [[http://​r4ds.had.co.nz/​transform.html|Learn more about ''​dplyr''​]] + * [[http://​seananderson.ca/​2014/​09/​13/​dplyr-intro.html|Sean Anderson'​s Intro to dplyr and pipes]] + * [[https://​rpubs.com/​bradleyboehmke/​data_wrangling|Bradley Boehmke'​s Intro to data wrangling]] + + + +