Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
r_workshop2 [2018/09/26 15:49]
katherinehebert [Importing data]
r_workshop2 [2021/10/13 16:03] (current)
lsherin
Line 1: Line 1:
 +<WRAP group>
 +<WRAP centeralign>​
 +<WRAP important>​
 +<wrap em> __MAJOR UPDATE__ </​wrap> ​
 +
 +<wrap em> As of Fall 2021, this wiki has been discontinued and is no longer being actively developed. </​wrap> ​
 +
 +<wrap em> All updated materials and announcements for the QCBS R Workshop Series are now housed on the [[https://​r.qcbs.ca/​workshops/​r-workshop-02/​|QCBS R Workshop website]]. Please update your bookmarks accordingly to avoid outdated material and/or broken links. </​wrap>​
 +
 +<wrap em> Thank you for your understanding,​ </​wrap>​
 +
 +<wrap em> Your QCBS R Workshop Coordinators. </​wrap>​
 +
 +</​WRAP>​
 +</​WRAP>​
 +<WRAP clear></​WRAP>​
 +
 ======= QCBS R Workshops ======= ======= QCBS R Workshops =======
  
Line 11: Line 28:
 Developed by: Johanna Bradie, Vincent Fugère, Thomas Lamy Developed by: Johanna Bradie, Vincent Fugère, Thomas Lamy
  
-**Summary:​** In this workshop, you will learn how to load, view, and manipulate your data in R. You will learn basic commands to inspect and visualize your data, and learn how to fix errors that may have occurred while loading your data into R. In addition, you will learn how to write an R script, which is a text file that contains your R commands and allows you to rerun your analyses in one simple touch of a key (or maybe two, or three…)! We will then introduce tidyr and dplyr, two powerful tools to manage and re-format your dataset, as well as apply simple or complex functions on subsets of your data. This workshop will be useful for those progressing through the entire workshop series, but also for those who already have some experience in R and would like to become proficient with new tools and packages.+**Summary:​** In this workshop, you will learn how to load, view, and manipulate your data in R. You will learn basic commands to inspect and visualize your data, and learn how to fix errors that may have occurred while loading your data into R. In addition, you will learn how to write an R script, which is a text file that contains your R commands and allows you to rerun your analyses in one simple touch of a key (or maybe two, or three…)! We have included an advance users section where we will introduce tidyr and dplyr, two powerful tools to manage and re-format your dataset, as well as apply simple or complex functions on subsets of your data. This workshop will be useful for those progressing through the entire workshop series, but also for those who already have some experience in R and would like to become proficient with new tools and packages. 
 + 
 +**Link to new [[https://​qcbsrworkshops.github.io/​workshop02/​workshop02-en/​workshop02-en.html|Rmarkdown presentation]]**
  
-Link to associated Prezi: [[http://​prezi.com/​wg4rggjfqucv/​qcbs-r-workshop-2/​|Prezi]] 
  
 Download the R script and data for this lesson: ​ Download the R script and data for this lesson: ​
-  - [[http://​qcbs.ca/​wiki/​_media/​script_workshop2.R|Script]]+  - [[http://​qcbs.ca/​wiki/​_media/​script_workshop02-en.r|Script]]
   - [[http://​qcbs.ca/​wiki/​_media/​co2_good.csv|Dataset 1]]   - [[http://​qcbs.ca/​wiki/​_media/​co2_good.csv|Dataset 1]]
-  - [[http://qcbs.ca/wiki/_media/​co2_broken.csv|Dataset 2]]+  - [[https://raw.githubusercontent.com/QCBSRworkshops/workshop02/​dev/​workshop02-en/​data/​co2_broken.csv|Dataset 2]] (//After following this link, right-click on the page to save the file as .csv//).
  
 ===== Learning Objectives ===== ===== Learning Objectives =====
  
-  - Creating an R project +1. Creating an R project 
-  ​- ​Writing a script + 
-  ​- ​Loading, exploring and saving data +2. Writing a script 
-  ​- ​Learn to manipulate data frames with tidyr, dplyr, maggritr+ 
 +3. Loading, exploring and saving data 
 + 
 +(For advanced users) 
 + 
 +4. Learn to manipulate data frames with tidyr, dplyr, maggritr
  
 ===== RStudio Projects ===== ===== RStudio Projects =====
  
 What is this?  What is this? 
-  - Within RStudio, ​Projects make it easy to separate and keep your work organized. ​  +  - Projects make it easy to keep your work organized. ​  
-  - All files, scripts, documentation related to a specific project are bound together+  - All files, scripts, documentation related to a specific project are bound together ​with a .Rproj file
  
-Encourages reproducibility and easy sharing+Encourages reproducibility and easy sharing
  
 ===== Create a new project ===== ===== Create a new project =====
Line 44: Line 67:
  
 One project = one folder One project = one folder
 +
 +Place similar files inside of their own folders ​
 +
 +Keep track of versions
  
 {{:​0_folderdata1.png?​400|}} {{:​0_folderdata1.png?​400|}}
Line 53: Line 80:
   * file -> save as .csv   * file -> save as .csv
  
-====Choose file names wisely====+====Naming files====
   * Good:    * Good: 
     * rawDatasetAgo2017.csv     * rawDatasetAgo2017.csv
Line 64: Line 91:
     * Dont.separate.names.with.dots.csv //(Can lead to reading file errors!)//     * Dont.separate.names.with.dots.csv //(Can lead to reading file errors!)//
  
-====Choose variable names wisely====+====Naming variables====
   * Use short informative titles (i.e. "​Time_1"​ not "First time measurement"​)   * Use short informative titles (i.e. "​Time_1"​ not "First time measurement"​)
     * Good: "​Measurements",​ "​SpeciesNames",​ "​Site"​     * Good: "​Measurements",​ "​SpeciesNames",​ "​Site"​
Line 72: Line 99:
  
  
-====Things to consider with your data====+====Common ​data preparation mistakes==== 
   * No text in numeric columns   * No text in numeric columns
   * Do not include spaces!   * Do not include spaces!
Line 87: Line 115:
  
 {{:​excel_notes.png|}} {{:​excel_notes.png|}}
-{{:​horribledata.png|}}+
  
 It is possible to do all your data preparation work within R. This has several benefits: ​ It is possible to do all your data preparation work within R. This has several benefits: ​
Line 249: Line 277:
   * Factors loaded as text (character) and vice versa   * Factors loaded as text (character) and vice versa
   * Factors including too many levels because of a typo   * Factors including too many levels because of a typo
-  * Numeric or integer data being loaded as character due to a typo (including ​space or using a comma instead of a "​."​ for a decimal)+  * Numeric or integer data being loaded as character due to a typo (including space or using a comma instead of a "​."​ for a decimal)
  
 **Exercise** ​ **Exercise** ​
Line 261: Line 289:
 Check the ''​str()''​ of CO2.  What is wrong here?  Reload the data with header=TRUE before continuing. Check the ''​str()''​ of CO2.  What is wrong here?  Reload the data with header=TRUE before continuing.
  
-==== Reminder from workshop 1: Accessing ​Data ====+==== Reminder from workshop 1: Accessing ​data ====
  
 Data within a data frame can be extracted by several means. Let's consider a data frame called //mydata//. Use square brackets to extract the content of a cell.  Data within a data frame can be extracted by several means. Let's consider a data frame called //mydata//. Use square brackets to extract the content of a cell. 
Line 421: Line 449:
 Let's practice how to solve some common errors. ​ Let's practice how to solve some common errors. ​
  
----- Fix a broken dataframe ​CHALLENGE ----+==== Fix a broken dataframe ​====
  
 # Read co2_broken.csv file into R and find the problems # Read co2_broken.csv file into R and find the problems
Line 451: Line 479:
 **HINT: There are 4 problems!** **HINT: There are 4 problems!**
  
-Answers: 
- 
-Answer #1 
-<​hidden>​ 
 Problem #1: The data appears to be lumped into one column Problem #1: The data appears to be lumped into one column
  
 Solution: Solution:
-<​hidden>​+
 Re-import the data, but specify the separation among entries. Re-import the data, but specify the separation among entries.
 The sep argument tells R what character separates the values on each line of the file. The sep argument tells R what character separates the values on each line of the file.
Line 466: Line 490:
 ?read.csv ?read.csv
 </​code>​ </​code>​
-</​hidden> ​ 
-</​hidden>​ 
  
-Answer #2 +
-<​hidden>​+
 Problem #2: The data does not start until the third line of the txt file, so you end up with notes on the file as the headings. Problem #2: The data does not start until the third line of the txt file, so you end up with notes on the file as the headings.
 <code rsplus | > <code rsplus | >
Line 477: Line 498:
  
 Solution: Solution:
-<​hidden>​+
 To fix this problem, you can tell R to skip the first two rows when reading in this file. To fix this problem, you can tell R to skip the first two rows when reading in this file.
 <code rsplus | > <code rsplus | >
Line 483: Line 504:
 head(CO2) # You can now see that the CO2 object has the appropriate headings head(CO2) # You can now see that the CO2 object has the appropriate headings
 </​code>​ </​code>​
-</​hidden>​ 
-</​hidden>​ 
  
 Answer #3 Answer #3
-<​hidden>​+
 Problem #3: "​conc"​ and "​uptake"​ variables are considered factors instead of numbers, because there are comments/​text in the numeric columns. Problem #3: "​conc"​ and "​uptake"​ variables are considered factors instead of numbers, because there are comments/​text in the numeric columns.
 <code rsplus | > <code rsplus | >
Line 498: Line 517:
  
 Solution: Solution:
-<​hidden>​+
 <code rsplus | > <code rsplus | >
 ?read.csv ?read.csv
Line 517: Line 536:
 str(CO2) # You can see that conc variable is now an integer and the uptake variable is now treated as numeric str(CO2) # You can see that conc variable is now an integer and the uptake variable is now treated as numeric
 </​code>​ </​code>​
-</​hidden>​ 
-</​hidden>​ 
  
-Answer #4 + 
-<​hidden>​+
 Problem #4: There are only two treatments (chilled and nonchilled) but there are spelling errors causing it to look like 4 different treatments. Problem #4: There are only two treatments (chilled and nonchilled) but there are spelling errors causing it to look like 4 different treatments.
 <code rsplus | > <code rsplus | >
Line 530: Line 547:
  
 Solution: Solution:
-<​hidden>​ 
 <code rsplus | > <code rsplus | >
 # You can use which() to find rows with the typo "​nnchilled"​ # You can use which() to find rows with the typo "​nnchilled"​
Line 550: Line 566:
 str(CO2) # Fixed! str(CO2) # Fixed!
 </​code>​ </​code>​
-</​hidden>​ +--- 
-</​hidden>​+ 
 + 
 +=====Advanced users section===== 
 + 
  
----- 
  
 =====Learn to manipulate data with tidyr, dyplr, maggritr===== =====Learn to manipulate data with tidyr, dyplr, maggritr=====
Line 575: Line 594:
 ==== Wide vs. long data ==== ==== Wide vs. long data ====
  
-**Wide** format data has a separate column for each variable or each factor in your study.+**Wide** format data has a separate column for each variable or each factor in your study. One row therefore can therefore include several different observations.
  
-**Long** format data has a column stating the measured variable types and a column containing the values associated to those variables (each column is a variable, each row is an observation). This is considered "​tidy"​ data because it is easily interpreted by most packages for visualization and analysis in ''​R''​.+**Long** format data has a column stating the measured variable types and a column containing the values associated to those variables (each column is a variable, each row is one observation). This is considered "​tidy"​ data because it is easily interpreted by most packages for visualization and analysis in ''​R''​.
  
 The format of your data depends on your specific needs, but some functions and packages such as ''​dplyr'',​ ''​lm()'',​ ''​glm()'',​ ''​gam()''​ require long format data. The ''​ggplot2''​ package can use wide data format for some basic plotting, but more complex plots require the long format (example to come). The format of your data depends on your specific needs, but some functions and packages such as ''​dplyr'',​ ''​lm()'',​ ''​glm()'',​ ''​gam()''​ require long format data. The ''​ggplot2''​ package can use wide data format for some basic plotting, but more complex plots require the long format (example to come).
Line 687: Line 706:
 </​code>​ </​code>​
  
-Then, we want to split those two sampling times (T1 & T2). The syntax we use here is to tell R separate(data,​ what column, into what, by what). The tricky part here is telling R where to separate the character string in your column entry using a regular expression to describe the character that separates them.Here the string should be separated by the period ​(.)+Then, we want to split those two sampling times (T1 & T2). The syntax we use here is to tell R ''​separate(data,​ what column, into what, by what)''​. The tricky part here is telling R where to separate the character string in your column entry using a regular expression to describe the character that separates them. Herethe string should be separated by the period ​''"​."''​.
  
 <code rsplus | > <code rsplus | >
Line 703: Line 722:
   * Split and merge columns with ''​unite()''​ and ''​separate()''​   * Split and merge columns with ''​unite()''​ and ''​separate()''​
  
-Here's cheat sheet to help you use tidyr and dplyr for more data wrangling: [[https://​www.rstudio.com/​wp-content/​uploads/​2015/​02/​data-wrangling-cheatsheet.pdf]]+Here's cheat sheet to help you use ''​tidyr'' ​and ''​dplyr'' ​for more data wrangling: [[https://​www.rstudio.com/​wp-content/​uploads/​2015/​02/​data-wrangling-cheatsheet.pdf]]
  
 ==== tidyr CHALLENGE ==== ==== tidyr CHALLENGE ====
Line 829: Line 848:
 </​code>​ </​code>​
  
-==== Sort columns ​with ''​arrange()''​ ==== +==== Sorting rows with ''​arrange()''​ ==== 
  
 In data manipulation,​ we sometimes need to sort our data (e.g. numerically or alphabetically) for subsequent operations. A common example of this is a time series. ​ In data manipulation,​ we sometimes need to sort our data (e.g. numerically or alphabetically) for subsequent operations. A common example of this is a time series. ​
Line 877: Line 896:
 {{:​mutate.png?​600|}} {{:​mutate.png?​600|}}
    
-Let's create a new column using ''​mutate()''​. For example, suppose we would like to convert the temperature variable ​form degrees Fahrenheit to degrees Celsius:+Let's create a new column using ''​mutate()''​. For example, suppose we would like to convert the temperature variable ​from degrees Fahrenheit to degrees Celsius:
 <code rsplus | > <code rsplus | >
 > airquality_C <- mutate(airquality,​ Temp_C = (Temp-32)*(5/​9)) > airquality_C <- mutate(airquality,​ Temp_C = (Temp-32)*(5/​9))
Line 1003: Line 1022:
 ---- ----
  
-==== Ninja hint ====+===== Ninja hint =====
  
 Note that we can group the data frame using more than one factor, using the general syntax as follows: ''​group_by(group1,​ group2, ...)''​ Note that we can group the data frame using more than one factor, using the general syntax as follows: ''​group_by(group1,​ group2, ...)''​
Line 1009: Line 1028:
 Within ''​group_by()'',​ the multiple groups create a layered onion, and each subsequent single use of the ''​summarise()''​ function peels off the outer layer of the onion. In the above example, after we carried out a summary operation on ''​group2'',​ the resulting data set would remain grouped by ''​group1''​ for downstream operations. Within ''​group_by()'',​ the multiple groups create a layered onion, and each subsequent single use of the ''​summarise()''​ function peels off the outer layer of the onion. In the above example, after we carried out a summary operation on ''​group2'',​ the resulting data set would remain grouped by ''​group1''​ for downstream operations.
  
-==== dplyr & magrittr NINJA CHALLENGE ====+===== dplyr & magrittr NINJA CHALLENGE ​=====
  
 //Using the ''​ChickWeight''​ dataset, create a summary table which displays, for each diet, the average individual difference in weight between the end and the beginning of the study. Employ ''​dplyr''​ verbs and the ''​%>​%''​ operator. (Hint: ''​first()''​ and ''​last()''​ may be useful here.)// //Using the ''​ChickWeight''​ dataset, create a summary table which displays, for each diet, the average individual difference in weight between the end and the beginning of the study. Employ ''​dplyr''​ verbs and the ''​%>​%''​ operator. (Hint: ''​first()''​ and ''​last()''​ may be useful here.)//
Line 1035: Line 1054:
 ---- ----
  
-==== dplyr - Merging data frames ====+===== dplyr - Merging data frames ​=====
  
 In addition to all the operations we have explored, ''​dplyr''​ also provides some functions that allow you to join two data frames together. The syntax in these functions is simple relative to alternatives in other ''​R''​ packages: In addition to all the operations we have explored, ''​dplyr''​ also provides some functions that allow you to join two data frames together. The syntax in these functions is simple relative to alternatives in other ''​R''​ packages:
Line 1046: Line 1065:
 These are beyond the scope of the current introductory workshop, but they provide extremely useful functionality you may eventually require for some more advanced data manipulation needs. These are beyond the scope of the current introductory workshop, but they provide extremely useful functionality you may eventually require for some more advanced data manipulation needs.
  
-====More on data manipulation====+=====More ​resources ​on data manipulation=====
   * [[https://​www.rstudio.com/​wp-content/​uploads/​2015/​02/​data-wrangling-cheatsheet.pdf|The RStudio Data Wrangling Cheat Sheet]]   * [[https://​www.rstudio.com/​wp-content/​uploads/​2015/​02/​data-wrangling-cheatsheet.pdf|The RStudio Data Wrangling Cheat Sheet]]
   * [[http://​r4ds.had.co.nz/​transform.html|Learn more about ''​dplyr''​]]   * [[http://​r4ds.had.co.nz/​transform.html|Learn more about ''​dplyr''​]]