__MAJOR UPDATE__ As of Fall 2021, this wiki has been discontinued and is no longer being actively developed. All updated materials and announcements for the QCBS R Workshop Series are now housed on the [[https://r.qcbs.ca/workshops/r-workshop-02/|QCBS R Workshop website]]. Please update your bookmarks accordingly to avoid outdated material and/or broken links. Thank you for your understanding, Your QCBS R Workshop Coordinators. ======= QCBS R Workshops ======= [[http://qcbs.ca/|{{:logo_text.png?nolink&500|}}]] This series of [[r|10 workshops]] walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology. These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community. //The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications, please contact the current series coordinators, listed on the main wiki page// ====== Workshop 2: Loading and manipulating data ====== Developed by: Johanna Bradie, Vincent Fugère, Thomas Lamy **Summary:** In this workshop, you will learn how to load, view, and manipulate your data in R. You will learn basic commands to inspect and visualize your data, and learn how to fix errors that may have occurred while loading your data into R. In addition, you will learn how to write an R script, which is a text file that contains your R commands and allows you to rerun your analyses in one simple touch of a key (or maybe two, or three…)! We have included an advance users section where we will introduce tidyr and dplyr, two powerful tools to manage and re-format your dataset, as well as apply simple or complex functions on subsets of your data. This workshop will be useful for those progressing through the entire workshop series, but also for those who already have some experience in R and would like to become proficient with new tools and packages. **Link to new [[https://qcbsrworkshops.github.io/workshop02/workshop02-en/workshop02-en.html|Rmarkdown presentation]]** Download the R script and data for this lesson: - [[http://qcbs.ca/wiki/_media/script_workshop02-en.r|Script]] - [[http://qcbs.ca/wiki/_media/co2_good.csv|Dataset 1]] - [[https://raw.githubusercontent.com/QCBSRworkshops/workshop02/dev/workshop02-en/data/co2_broken.csv|Dataset 2]] (//After following this link, right-click on the page to save the file as .csv//). ===== Learning Objectives ===== 1. Creating an R project 2. Writing a script 3. Loading, exploring and saving data (For advanced users) 4. Learn to manipulate data frames with tidyr, dplyr, maggritr ===== RStudio Projects ===== What is this? - Projects make it easy to keep your work organized. - All files, scripts, documentation related to a specific project are bound together with a .Rproj file Encourages reproducibility and easy sharing ===== Create a new project ===== Use the **Create project** command (available in the Projects menu and the global toolbar) {{:0_create_a_new_project.png?400|}} ===== Keep your files organized ===== One project = one folder Place similar files inside of their own folders Keep track of versions {{:0_folderdata1.png?400|}} =====Preparing data for R===== * Datasets should be stored as **comma separated files (.csv)** in Data folder. * comma separated files (.csv) can be created from almost all applications (Excel, LibreOffice, GoogleDocs) * file -> save as .csv ====Naming files==== * Good: * rawDatasetAgo2017.csv * co2_concentrations_QB.csv * 01_figIntro.R * Bad: * final.csv //(Uninformative!)// * safnnejs.csv //(Random!)// * 1-4.csv //(Avoid using numbers!)// * Dont.separate.names.with.dots.csv //(Can lead to reading file errors!)// ====Naming variables==== * Use short informative titles (i.e. "Time_1" not "First time measurement") * Good: "Measurements", "SpeciesNames", "Site" * Bad: "a", "3", "supercomplicatedverylongname" * Column values must match their intended use ====Common data preparation mistakes==== * No text in numeric columns * Do not include spaces! * NA (not available) can be used for missing values, and blank entries will automatically be replaced with NA * Name your variables informatively * **Look for typos!** * Avoid numeric values for data that do not have a numeric meaning (i.e. subject, replicate, treatment) * For example, if subjects are "1,2,3" change to "A,B,C" or "S1,S2,S3" * Use CONSISTENT formats for dates, numbers, metrics, etc. * Do not include notes, additional headings, or merged cells! * One variable per column! ====Bad data examples==== {{:excel_notes.png|}} It is possible to do all your data preparation work within R. This has several benefits: * Saves time for large datasets * Keeps original data intact * Keeps track of the manipulation and transformation you did * Can switch between long and wide format data very easily (more on this later and in workshop 4) * For a useful resource, see [[https://www.zoology.ubc.ca/~schluter/R/data/]] ===== Writing a script ===== An R script is a text file that contains all of the commands you will use. Once written and saved, your R script will allow you to make changes and re-run analyses with little effort. To use a script, just highlight commands and press "Run" or press command-enter (Mac) or ctrl-enter (PC). ==== Creating an R script ==== {{:1_create_an_r_script.png?500|}} {{:2_create_an_r_script2.mod_arrow.png?600|}} ==== Commands & Comments ==== Use the '# symbol' to denote comments in scripts. The '# symbol' tells R to ignore anything remaining on a given line of the script when running commands. Since comments are ignored when running script, they allow you to leave yourself notes in your code or tell collaborators what you did. A script with comments is a good step towards reproducible science, and annotating someone's script is a good way to learn. Try to be as detailed as possible! # This is a comment, not a command ==== Header ==== It is recommended that you use comments to put a header at the beginning of your script with essential information: project name, author, date, version of R ## QCBS R Workshop ## ## Workshop 2 - Loading and manipulating data ## Author: Quebec Center for Biodiversity Science ## Date: Fall 2014 ## R version 2.15.0 ====Section Heading==== You can use four # signs in a row to create section headings to help organize your script. This allows you to move quickly between sections and hide sections. For example: #### Housekeeping #### RStudio displays a small arrow next to the line number where the section heading was created. If you click on the arrow, you will hide this section of the script. You can also move quickly between sections using the drop-down menu at the bottom of the script window. {{:4_section_headings_mod_arrow_2.png?600|}} ==== Housekeeping ==== The first command at the top of all scripts should be ''rm(list=ls())''. This will clear R's memory, and will help prevent errors such as using old data that has been left in your workspace. rm(list=ls()) # Clears R workspace ?rm ?ls We can test this command by adding data to the workspace and seeing how ''rm(list=ls())'' will remove it. A<-"Test" # Put some data into workspace, to see how rm(list=ls()) removes it A <- "Test" # Note that you can use a space before or after <- A = "Test" # <- or = can be used equally # Note that it is best practice to use "<-" for assignment instead of "=" A rm(list=ls()) A ====Important Reminders==== - R is ready for commands when you see the chevron '>' displayed in the terminal. If the chevron isn't displayed, it means you typed an incomplete command and R is waiting for more input. Press ESC to exit and get R ready for a new command. - R is case sensitive. i.e. "A" is a different object than "a" a<-10 A<-5 a A rm(list=ls()) # Clears R workspace again ---- =====Loading, exploring and saving data===== ====Working directory==== R needs to know the directory where your data and files are stored in order to load them. You can see which directory you are currently working in by using the ''getwd()'' command. getwd() # This commands shows the directory you are currently working in When you load a script, R automatically sets the working directory to the folder containing the script. To change working directories using the ''setwd()'' function, specify the working directory's path using a "/" to separate folders, subfolders and file names. You can also click Session > Set working directory > Choose directory... ====Display the content of the working directory==== The command ''dir()'' displays the content of the working directory. dir() # This command shows the content of the directory you are currently working in You can check: * Whether or not the file you plan to open is present in the current directory * The correct spelling of the file name (//e.g.// 'myfile.csv' instead of 'MyFile.csv') ====Importing data==== Use the ''read.csv()'' command to import data in R. CO2 <- read.csv("co2_good.csv") # Creates an object called CO2 by loading data from a file called "co2_good.csv" This command specifies that you will be creating an R object named "CO2" by reading a csv file called "co2_good.csv". This file must be located in your current working directory. Recall that the question mark can be used to find out what arguments the function requires. ?read.csv # Use the question mark to pull up the help page for a command In the help file you will note that adding the argument header=TRUE tells R that the first line of the spreadsheet contains column names and not data. CO2 <- read.csv("co2_good.csv", header = TRUE) NOTE: If your operating system or CSV editor is in French, you may need to use ''read.csv2()'' instead of ''read.csv()'' {{:5_importing_data_mod_arrow.png?900|}} Notice that RStudio now provides information on the CO2 data in your workspace. The workspace refers to all the objects that you create during an R session. ==== Looking at data ==== The CO2 dataset consists of repeated measurements of CO2 uptake from six plants from Quebec and six plants from Mississippi at several levels of ambient CO2 concentration. Half of the plants of each type were chilled overnight before the experiment began. There are some common commands that are useful to look at imported data: |CO2| Look at the whole data frame| |head(CO2)| Look at the first few rows| |tail(CO2)| Look at the last few rows | |names(CO2)| Names of the columns in the data frame| |attributes(CO2) | Attributes of the data frame| |dim(CO2) | Dimensions of the data frame| |ncol(CO2) | Number of columns| |nrow(CO2) | Number of rows| |summary(CO2) | Summary statistics| |str(CO2) | Structure of the data frame | The ''str()'' command is very useful to check the data type/mode for each column (i.e. to check that all factors are factors, and numeric data is stored as an integer or numeric. There are many common problems: * Factors loaded as text (character) and vice versa * Factors including too many levels because of a typo * Numeric or integer data being loaded as a character due to a typo (including space or using a comma instead of a "." for a decimal) **Exercise** Try to reload the data using: CO2<-read.csv("co2_good.csv",header=FALSE) Check the ''str()'' of CO2. What is wrong here? Reload the data with header=TRUE before continuing. ==== Reminder from workshop 1: Accessing data ==== Data within a data frame can be extracted by several means. Let's consider a data frame called //mydata//. Use square brackets to extract the content of a cell. {{:table_reminder_from_workshop_1_accessing_data.png?500|}} mydata[2,3] # extracts the content of row 2 / column 3 If column number is omitted, the whole row is extracted. mydata[1,] # extracts the content of the first row The squared brackets can also be used recursively mydata[,1][2] # this extracts the second content of the first column If row number is omitted, the whole column is extracted. Similarly, the ''$'' sign followed by the corresponding header can be used. mydata$Variable1 # extracts a specific column by its name ("Variable1") ====Renaming variables==== Variable names (i.e. column names) can be changed within R. # First let's make a copy of the dataset to play with! CO2copy <- CO2 # names() gives you the names of the variables present in the data frame names(CO2copy) # Changing from English to French names (make sure you have the same levels!) names(CO2copy) <- c("Plante","Categorie", "Traitement", "conc","absortion") ====Creating new variables==== New variables can be easily created and populated. For example, variables and strings can be concatenated together using the function ''paste()''. # Let's create an unique id for our samples using the function paste() # see ?paste and ?paste0 # Don't forget to use "" for strings CO2copy$uniqueID <- paste0(CO2copy$Plante,"_", CO2copy$Categorie, "_", CO2copy$Traitement) # Observe the results head(CO2copy$uniqueID) Creating new variables works for numbers and mathematical operations as well! # Let's standardize our variable "absortion" to relative values CO2copy$absortionRel = CO2copy$absortion/max(CO2copy$absortion) # Changing to relative values # Observe the results head(CO2copy$absortionRel) ====Subsetting data==== There are many ways to subset a data frame. # Let's keep working with our CO2copy data frame ## Subsetting by variable name CO2copy[,c("Plante", "absortionRel")] # Selects only "Plante" and "absortionRel" columns. (Don't forget the ","!) ## Subsetting by row CO2copy[1:50,] # Subset data frame from rows from 1 to 50 ### Subsetting by matching with a factor level CO2copy[CO2copy$Traitement == "nonchilled",] # Select observations matching only the nonchilled Traitement. ### Subsetting according to a numeric condition CO2copy[CO2copy$absortion >= 20, ] # Select observations with absortion higher or equal to 20 ### Conditions can be complimentary -The & (and) argument- CO2copy[CO2copy$Traitement == "nonchilled" & CO2copy$absortion >= 20, ] # We are done playing with the dataset copy. Let's erase it. CO2copy <- NULL Go [[https://stat.ethz.ch/R-manual/R-devel/library/base/html/Logic.html| here]] to check all the logical operators you can use to subset a data frame in R ====Data exploration==== A good way to start your data exploration is to look at some basic statistics of your dataset using the ''summary()'' function. summary(CO2) # Get summary statistics of your dataset You can also use some other functions to calculate basic statistics about specific parts of your data frame, using ''mean()'', ''sd()'', ''hist()'', and ''print()''. # Calculate mean and standard deviation of the concentration, and assign them to new variables meanConc <- mean(CO2$conc) sdConc <- sd(CO2$conc) # print() prints any given value to the R console print(paste("the mean of concentration is:", meanConc)) print(paste("the standard deviation of concentration is:", sdConc)) # Let's plot a histogram to explore the distribution of "uptake" hist(CO2$uptake) # Increasing the number of bins to observe better the pattern hist(CO2$uptake, breaks = 40) The function ''apply()'' can be used to apply a function to multiple columns of your data simultaneously. Use the ''?apply'' command to get more information about ''apply()''. ?apply To use apply, you have to specify three arguments. The first argument is the data you would like to apply the function to; the second argument is whether you would like to calculate based on rows (1) or columns (2) of your dataset; the third argument is the function you would like to apply. For example: apply(CO2[,4:5], MARGIN = 2, FUN = mean) # Calculate mean of the two columns in the data frame that contain continuous data ====Save your workspace==== By saving your workspace, you can save the script and the objects currently loaded into R. If you save your workspace, you can reload all of the objects even after you use the ''rm(list=ls())'' command to delete everything in the workspace. Use ''save.image()'' to save the workplace: save.image(file="co2_project_Data.RData") # Save workspace rm(list=ls()) # Clears R workspace load("co2_project_Data.RData") #Reload everything that was in your workspace head(CO2) # Looking good! :) ====Exporting data==== If you want to save a data file that you have created or edited in R, you can do so using the ''write.csv()'' command. Note that the file will be written into the current working directory. write.csv(CO2,file="co2_new.csv") # Save object CO2 to a file named co2_new.csv ====Use your data CHALLENGE==== Try to load, explore, plot and save your own data in R. Does it load properly? If not, try fixing it in R. Save your fixed data and then try opening it in Excel. =====Fixing a Broken Data Frame===== Data can be messy, there are compatibility issues. For example, sharing data from a Mac to Windows or between computers set up in different continents can lead to weird datasets. Let's practice how to solve some common errors. ==== Fix a broken dataframe ==== # Read co2_broken.csv file into R and find the problems CO2<-read.csv("co2_broken.csv") # Overwrite CO2 object with broken CO2 data head(CO2) # Looks messy CO2 # Indeed! * This is probably what your data or downloaded data looks like. * You can fix the data frame in R (or not...) * Give it a try before looking at the solution! * Work with your neighbours and have fun :) Some useful functions: * ''?read.csv'' * ''head()'' * ''str()'' * ''class()'' * ''unique()'' * ''levels()'' * ''which()'' * ''droplevels()'' Note: For these functions you have to put the name of the data object in the parentheses (i.e. ''head(CO2)''). Also remember that you can use "?" to look up help for a function (i.e. ''?str''). **HINT: There are 4 problems!** Problem #1: The data appears to be lumped into one column Solution: Re-import the data, but specify the separation among entries. The sep argument tells R what character separates the values on each line of the file. Here, "TAB" was used instead of ",". CO2 <- read.csv("co2_broken.csv",sep = "") ?read.csv Problem #2: The data does not start until the third line of the txt file, so you end up with notes on the file as the headings. head(CO2) # The head() command allows you to see that the data has not been read in with the proper headings Solution: To fix this problem, you can tell R to skip the first two rows when reading in this file. CO2<-read.csv("co2_broken.csv",sep = "",skip=2) # By adding the skip argument into the read.csv function, R knows to skip the first two rows head(CO2) # You can now see that the CO2 object has the appropriate headings Answer #3 Problem #3: "conc" and "uptake" variables are considered factors instead of numbers, because there are comments/text in the numeric columns. str(CO2) # The str() command shows you that both 'conc' and 'uptake' are labelled as factors class(CO2$conc) unique(CO2$conc) # By looking at the unique values in this column, you see that both columns contain "cannot_read_notes" unique(CO2$uptake) ?unique Solution: ?read.csv {{:read_table_help1.png|}} {{:read_table_help2.png|}} CO2 <- read.csv("co2_broken.csv",sep = "",skip = 2,na.strings = c("NA","na","cannot_read_notes")) By identifying "cannot_read_notes" as NA data, R reads these columns properly. Remember that NA (capital!) stands for not available. head(CO2) str(CO2) # You can see that conc variable is now an integer and the uptake variable is now treated as numeric Problem #4: There are only two treatments (chilled and nonchilled) but there are spelling errors causing it to look like 4 different treatments. str(CO2) # You can see that 4 levels are listed for Treatment levels(CO2$Treatment) unique(CO2$Treatment) # The 4 different treatments are "nonchilled", "nnchilled", "chilled", and "chiled" Solution: # You can use which() to find rows with the typo "nnchilled" which(CO2$Treatment=="nnchilled") # Row number ten # You can then correct the error using indexing: CO2$Treatment[10] <- "nonchilled" # Alternatively, doing it with a single command: CO2$Treatment[which(CO2$Treatment=="nnchilled")] <- "nonchilled" # Now doing the same for "chiled": CO2$Treatment[which(CO2$Treatment=="chiled")] <- "chilled" Have we fixed the problem? str(CO2) # Structure still identifies 4 levels of the factor unique(CO2$Treatment) # But, unique says that only two are used CO2<-droplevels(CO2) # This command drops the unused levels from all factors in the data frame str(CO2) # Fixed! --- =====Advanced users section===== =====Learn to manipulate data with tidyr, dyplr, maggritr===== ==== Using tidyr to reshape data frames ==== {{:tidyrsticker.png?200|}} ==== Why "tidy" your data? ==== Tidying allows you to manipulate the structure of your data while preserving all original information. Many functions in R require (or work better) with a data structure that isn't always easily readable by people. In contrast to aggregation, which reduces many cells in the original data set to one cell in the new dataset, tidying preserves a one-to-one connection. Although aggregation can be done with many functions in R, the ''tidyr'' package allows you to both reshape and aggregate within a single syntax. Install / Load the ''tidyr()'' package: if(!require(tidyr)){install.packages("tidyr")} library(tidyr) ==== Wide vs. long data ==== **Wide** format data has a separate column for each variable or each factor in your study. One row therefore can therefore include several different observations. **Long** format data has a column stating the measured variable types and a column containing the values associated to those variables (each column is a variable, each row is one observation). This is considered "tidy" data because it is easily interpreted by most packages for visualization and analysis in ''R''. The format of your data depends on your specific needs, but some functions and packages such as ''dplyr'', ''lm()'', ''glm()'', ''gam()'' require long format data. The ''ggplot2'' package can use wide data format for some basic plotting, but more complex plots require the long format (example to come). Additionally, long form data can more easily be aggregated and converted back into wide form data to provide summaries, or to check the balance of sampling designs. We can use the ''tidyr'' package to to manipulate the structure of your data while preserving all original information, using the following functions: * 1. ''gather()'' our data (wide --> long) * 2. ''spread()'' our data (long --> wide) {{:gather-spread.png?600|}} Let's pretend you send out your field assistant to measure the diameter at breast height (DBH) and height of three tree species for you. The result is this "wide" data set. > wide <- data.frame(Species = c("Oak", "Elm", "Ash"), DBH = c(12, 20, 13), Height = c(56, 85, 55)) > wide Species DBH Height 1 Oak 12 56 2 Elm 20 85 3 Ash 13 55 ==== gather(): Making your data long ==== ?gather Most of the packages in the Hadleyverse will require long format data where each row is an entry and each column is a variable. Let's try to "gather" the this wide data using the ''gather()'' function in tidyr. ''gather()'' takes multiple columns, and gathers them into key-value pairs. The function requires at least 3 arguments: * **data**: a data frame (e.g. "wide") * **key**: name of the new column containing variable names (e.g. "Measurement") * **value**: name of the new column containing variable values (e.g. "Value") * **...**: name or numeric index of the columns we wish to gather (e.g. "DBH" or "Height") # Gathering columns into rows > long <- gather(wide, Measurement, Value, DBH, Height) > long Species Measurement Value 1 Oak DBH 12 2 Elm DBH 20 3 Ash DBH 13 4 Oak Height 56 5 Elm Height 85 6 Ash Height 55 Let's try this with the C02 dataset. Here we might want to collapse the last two quantitative variables: CO2.long <- gather(CO2, response, value, conc, uptake) head(CO2) head(CO2.long) tail(CO2.long) ==== spread(): Making your data wide ==== ''spread()'' uses the same syntax as ''gather()''. The function requires 3 arguments: * **data**: A data frame (e.g. "long") * **key**: Name of the column containing variable names (e.g. "Measurement") * **value**: Name of the column containing variable values (e.g. "Value") # Spreading rows into columns > wide2 <- spread(long, Measurement, Value) > wide2 Species DBH Height 1 Ash 13 55 2 Elm 20 85 3 Oak 12 56 ==== separate(): Separate two (or more) variables in a single column ==== Some times you might have really messy data that has two variables in one column. Thankfully the ''separate()'' function can (wait for it) separate the two variables into two columns. The ''separate()'' function splits a columns by a character string separator. It requires 4 arguments: * **data**: A data frame (e.g. "long") * **col**: Name of the column you wish to separate * **into**: Names of new variables to create * **sep**: Character which indicates where to separate Let's create a really messy data set: set.seed(8) messy <- data.frame(id = 1:4, trt = sample(rep(c('control', 'farm'), each = 2)), zooplankton.T1 = runif(4), fish.T1 = runif(4), zooplankton.T2 = runif(4), fish.T2 = runif(4)) messy First, we want to convert this wide dataset to long format. messy.long <- gather(messy, taxa, count, -id, -trt) head(messy.long) Then, we want to split those two sampling times (T1 & T2). The syntax we use here is to tell R ''separate(data, what column, into what, by what)''. The tricky part here is telling R where to separate the character string in your column entry using a regular expression to describe the character that separates them. Here, the string should be separated by the period ''"."''. messy.long.sep <- separate(messy.long, taxa, into = c("species", "time"), sep = "\\.") head(messy.long.sep) The argument ''sep = "\\."'' tells R to splits the character string around the period (.). We cannot type directly ''"."'' because it is a regular expression that matches any single character. ====Recap: tidyr==== ''tidyr'' is a package that reshapes the layout of data sets. * Convert from **wide format to long format** using ''gather()'' * Convert from **long format to wide format** using ''spread()'' * Split and merge columns with ''unite()'' and ''separate()'' Here's cheat sheet to help you use ''tidyr'' and ''dplyr'' for more data wrangling: [[https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf]] ==== tidyr CHALLENGE ==== //Using the ''airquality'' dataset, ''gather()'' all the columns (except Month and Day) into rows. Then ''spread()'' the resulting dataset to return the same data format as the original data.// ?air.quality data(airquality) ++++Solution| # Use gather() to convert the dataset to long format air.long <- gather(airquality, variable, value, -Month, -Day) head(air.long) # Note that the syntax used here indicates we wish to gather ALL the columns except "Month" and "Day" # Then, use spread() to convert the dataset back to wide format air.wide <- spread(air.long , variable, value) head(air.wide) ++++ ---- ===== Data manipulation with dplyr ===== {{:dplyrsticker.png?200|}} ==== Intro to dplyr ==== The vision of the ''dplyr'' package is to simplify data manipulation by distilling all the common data manipulation tasks to a set of intuitive functions (or "verbs"). The result is a comprehensive set of tools that facilitates data manipulation, such as filtering rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data. In addition to ease of use, it is also an amazing package because: * it can crunch huge datasets wicked fast (written in ''Cpp'') * it plays nice with the RStudio IDE and other packages in the Hadleyverse * it can interface with external databases and translate your R code into SQL queries * if Batman was an R package, he would be ''dplyr'' (mastering fear of data, adopting cool technologies) Certain R base functions work similarly to dplyr functions, including: ''split()'', ''subset()'', ''apply()'', ''sapply()'', ''lapply()'', ''tapply()'' and ''aggregate()'' Let's install and load the ''dplyr'' package: if(!require(dplyr)){install.packages("dplyr")} library(dplyr) The ''dplyr'' package is built around a core set of "verbs" (or functions). We will start with the following 4 verbs because these operations are ubiquitous in data manipulation: * ''select()'': select columns from a data frame * ''filter()'': filter rows according to defined criteria * ''arrange()'': re-order data based on criteria (e.g. ascending, descending) * ''mutate()'': create or transform values in a column ====Select a subset of columns with ''select()''==== {{:select.png?600|}} The general syntax for this function is ''select(dataframe, column1, column2, ...)''. Most ''dplyr'' functions will follow a similarly simple syntax. ''select()'' requires at least 2 arguments: * **data**: the dataset to manipulate * **...**: column names, positions, or complex expressions (separated by commas) For example: select(data, column1, column2) # select columns 1 and 2 select(data, c(2:4,6) # select columns 2 to 4 and 6 select(data, -column1) # select all columns except column 1 select(data, start_with(x.)) # select all columns that start with "x." Here are more examples of how to use ''select()'': {{:select.helper.png?400|}} The ''airquality'' dataset contains several columns: > head(airquality) Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 For example, suppose we are only interested in the variation of "Ozone" over time within the ''airquality'' dataset, then we can select the subset of required columns for further analysis: > ozone <- select(airquality, Ozone, Month, Day) > head(ozone) Ozone Month Day 1 41 5 1 2 36 5 2 3 12 5 3 4 18 5 4 5 NA 5 5 6 28 5 6 ==== Select a subset of rows with ''filter()'' ==== A common operation in data manipulation is the extraction of a subset based on specific conditions. The general syntax for this function is ''filter(dataframe, logical statement 1, logical statement 2, ...)''. {{:filter.png?600|}} Remember that logical statements provide a TRUE or FALSE answer. The ''filter()'' function retains all the data for which the statement is TRUE. This can also be applied on characters and factors. Here is a useful reminder of how logic works in R. {{:logic.helper.png?500|}} For example, in the ''airquality'' dataset, suppose we are interested in analyses that focus on the month of August during high temperature events: > august <- filter(airquality, Month == 8, Temp >= 90) > head(august) Ozone Solar.R Wind Temp Month Day 1 89 229 10.3 90 8 8 2 110 207 8.0 90 8 9 3 NA 222 8.6 92 8 10 4 76 203 9.7 97 8 28 5 118 225 2.3 94 8 29 6 84 237 6.3 96 8 30 ==== Sorting rows with ''arrange()'' ==== In data manipulation, we sometimes need to sort our data (e.g. numerically or alphabetically) for subsequent operations. A common example of this is a time series. The ''arrange()'' function re-orders rows by one or multiple columns, using the following syntax: ''arrange(data, variable1, variable2, ...)''. By default, rows are sorted in ascending order. Note that we can also sort in descending order by placing the target column in ''desc()'' inside the ''arrange()'' function as follows: ''arrange(data, variable1, desc(variable2), ...)''. Example: Let's use the following code to create a scrambled version of the airquality dataset > air_mess <- sample_frac(airquality, 1) > head(air_mess) Ozone Solar.R Wind Temp Month Day 21 1 8 9.7 59 5 21 42 NA 259 10.9 93 6 11 151 14 191 14.3 75 9 28 108 22 71 10.3 77 8 16 8 19 99 13.8 59 5 8 104 44 192 11.5 86 8 12 Now, let's arrange the data frame back into chronological order, sorting by ''Month'', and then by ''Day'': > air_chron <- arrange(air_mess, Month, Day) > head(air_chron) Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 Try to see the difference when we change the order of the target columns: arrange(air_mess, Day, Month) ==== Create and populate columns with ''mutate()'' ==== Besides subsetting or sorting your data frame, you will often require tools to transform your existing data or generate some additional data based on existing variables. We can use the function ''mutate()'' to compute and add new columns in your dataset. The ''mutate()'' function follows this syntax: ''mutate(data, newVar1 = expression1, newVar2 = expression2, ...)''. {{:mutate.png?600|}} Let's create a new column using ''mutate()''. For example, suppose we would like to convert the temperature variable from degrees Fahrenheit to degrees Celsius: > airquality_C <- mutate(airquality, Temp_C = (Temp-32)*(5/9)) > head(airquality_C) Ozone Solar.R Wind Temp Month Day Temp_C 1 41 190 7.4 67 5 1 19.44444 2 36 118 8.0 72 5 2 22.22222 3 12 149 12.6 74 5 3 23.33333 4 18 313 11.5 62 5 4 16.66667 5 NA NA 14.3 56 5 5 13.33333 6 28 NA 14.9 66 5 6 18.88889 Note that the syntax here is quite simple, but within a single call of the ''mutate()'' function, we can replace existing columns, we can create multiple new columns, and each new column can be created using newly created columns within the same function call. ---- ===== dplyr and magrittr, a match made in heaven ===== {{:magrittrsticker.png?200|}} The ''magrittr'' package brings a new and exciting tool to the table: a pipe operator. Pipe operators provide ways of linking functions together so that the output of a function flows into the input of next function in the chain. The syntax for the ''magrittr'' pipe operator is ''%>%''. The ''magrittr'' pipe operator truly unleashes the full power and potential of ''dplyr'', and we will be using it for the remainder of the workshop. First, let's install and load it: if(!require(magrittr)){install.packages("magrittr")} require(magrittr) Using it is quite simple, and we will demonstrate that by combining some of the examples used above. Suppose we wanted to ''filter()'' rows to limit our analysis to the month of June, then convert the temperature variable to degrees Celsius. We can tackle this problem step by step, as before: june_C <- mutate(filter(airquality, Month == 6), Temp_C = (Temp-32)*(5/9)) This code can be difficult to decipher because we start on the inside and work our way out. As we add more operations, the resulting code becomes increasingly illegible. Instead of wrapping each function one inside the other, we can accomplish these 2 operations by linking both functions together: june_C <- airquality %>% filter(Month == 6) %>% mutate(Temp_C = (Temp-32)*(5/9)) Notice that within each function, we have removed the first argument which specifies the dataset. Instead, we specify our dataset first, then "pipe" into the next function in the chain. The advantages of this approach are that our code is less redundant and functions are executed in the same order we read and write them, which makes its easier and quicker to both translate our thoughts into code and read someone else's code and grasp what is being accomplished. As the complexity of your data manipulations increases, it becomes quickly apparent why this is a powerful and elegant approach to writing your ''dplyr'' code. **Quick tip:** In RStudio we can insert this pipe quickly using the following hotkey: ''Ctrl'' (or ''Cmd'' for Mac) +''Shift''+''M''. ===== dplyr - grouped operations and summaries ===== The ''dplyr'' verbs we have explored so far can be useful on their own, but they become especially powerful when we link them with each other using the pipe operator (''%>%'') and by applying them to groups of observations. The following functions allow us to split our data frame into distinct groups on which we can then perform operations individually, such as aggregating/summarising: * ''group_by()'': group data frame by a factor for downstream commands (usually summarise) * ''summarise()'': summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. ''min()'', ''max()'', ''mean()'', etc...) These verbs provide the needed backbone for the Split-Apply-Combine strategy that was initially implemented in the ''plyr'' package on which ''dplyr'' is built. {{:split-apply-combine.png?600|}} Let's demonstrate the use of these with an example using the ''airquality'' dataset. Suppose we are interested in the mean temperature and standard deviation within each month: > month_sum <- airquality %>% group_by(Month) %>% summarise(mean_temp = mean(Temp), sd_temp = sd(Temp)) > month_sum Source: local data frame [5 x 3] Month mean_temp sd_temp (int) (dbl) (dbl) 1 5 65.54839 6.854870 2 6 79.10000 6.598589 3 7 83.90323 4.315513 4 8 83.96774 6.585256 5 9 76.90000 8.355671 ===== dplyr & magrittr CHALLENGE ===== //Using the ''ChickWeight'' dataset, create a summary table which displays the difference in weight between the maximum and minimum weight of each chick in the study. Employ ''dplyr'' verbs and the ''%>%'' operator.// ?ChickWeight data(ChickWeight) ++++Solution| # Use group_by() to divide the dataset by "Chick" # Use summarise() to calculate the weight gain within each group > weight_diff <- ChickWeight %>% group_by(Chick) %>% summarise(weight_diff = max(weight) - min(weight)) > weight_diff Source: local data frame [50 x 2] Chick weight_diff (fctr) (dbl) 1 18 4 2 16 16 3 15 27 4 13 55 5 9 58 6 20 76 7 10 83 8 8 92 9 17 100 10 19 114 .. ... ... Note that we are only calculating the difference between max and min weight. This doesn't necessarily correspond to the difference in mass between the beginning and the end of the trials. Closely inspect the data for chick # 18 to understand why this is the case: > chick_18 <- ChickWeight %>% filter(Chick == 18) > chick_18 weight Time Chick Diet 1 39 0 18 1 2 35 2 18 1 Here we notice that chick 18 has in fact lost weight (and probably died during the trial). From a scientific perspective, perhaps a more interesting question is which of the 4 diets results in the greatest weight gain in chicks. We could calculate this using 2 more useful ''dplyr'' functions: ''first()'' and ''last()'' allow us to access the (need I say respectively) first and last observation within a group. ++++ ---- ===== Ninja hint ===== Note that we can group the data frame using more than one factor, using the general syntax as follows: ''group_by(group1, group2, ...)'' Within ''group_by()'', the multiple groups create a layered onion, and each subsequent single use of the ''summarise()'' function peels off the outer layer of the onion. In the above example, after we carried out a summary operation on ''group2'', the resulting data set would remain grouped by ''group1'' for downstream operations. ===== dplyr & magrittr NINJA CHALLENGE ===== //Using the ''ChickWeight'' dataset, create a summary table which displays, for each diet, the average individual difference in weight between the end and the beginning of the study. Employ ''dplyr'' verbs and the ''%>%'' operator. (Hint: ''first()'' and ''last()'' may be useful here.)// ++++Solution| > diet_summ <- ChickWeight %>% group_by(Diet, Chick) %>% summarise(weight_gain = last(weight) - first(weight)) %>% group_by(Diet) %>% summarise(mean_gain = mean(weight_gain)) > diet_summ # A tibble: 4 × 2 Diet mean_gain 1 1 114.9 2 2 174.0 3 3 229.5 4 4 188.3 Given that the solution to the last challenge requires that we compute several operations in sequence, it provides a nice example to demonstrate why the syntax implemented by ''dplyr'' and ''magrittr''. An additional challenge (if you are well versed in base ''R'' functions) would to reproduce the same operations using fewer key strokes. We tried, and failed... Perhaps we are too accustomed to ''dplyr'' now. ++++ ---- ===== dplyr - Merging data frames ===== In addition to all the operations we have explored, ''dplyr'' also provides some functions that allow you to join two data frames together. The syntax in these functions is simple relative to alternatives in other ''R'' packages: * ''left_join()'' * ''right_join()'' * ''inner_join()'' * ''anti_join()'' These are beyond the scope of the current introductory workshop, but they provide extremely useful functionality you may eventually require for some more advanced data manipulation needs. =====More resources on data manipulation===== * [[https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf|The RStudio Data Wrangling Cheat Sheet]] * [[http://r4ds.had.co.nz/transform.html|Learn more about ''dplyr'']] * [[http://seananderson.ca/2014/09/13/dplyr-intro.html|Sean Anderson's Intro to dplyr and pipes]] * [[https://rpubs.com/bradleyboehmke/data_wrangling|Bradley Boehmke's Intro to data wrangling]]