Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
r_workshop4 [2018/09/26 15:39]
shaun.turney [3.2 T-test]
r_workshop4 [2019/08/08 17:52]
mariehbrice [Workshop 4: Linear models]
Line 9: Line 9:
 ====== Workshop 4: Linear models ====== ====== Workshop 4: Linear models ======
  
-Developed by: Catherine Baltazar, Bérenger Bourgeois, Zofia Taranu, Shaun Turney, ​William ​Vieira+Developed by: Catherine Baltazar, Bérenger Bourgeois, Zofia Taranu, Shaun Turney, ​Willian ​Vieira
  
 **Summary:​** In this workshop, you will learn how to implement basic linear models commonly used in ecology in R such as simple regression, analysis of variance (ANOVA), analysis of covariance (ANCOVA), and multiple regression. After verifying visually and statistically the assumptions of these models and transforming your data when necessary, the interpretation of model outputs and the plotting of your final model will no longer keep secrets from you! **Summary:​** In this workshop, you will learn how to implement basic linear models commonly used in ecology in R such as simple regression, analysis of variance (ANOVA), analysis of covariance (ANCOVA), and multiple regression. After verifying visually and statistically the assumptions of these models and transforming your data when necessary, the interpretation of model outputs and the plotting of your final model will no longer keep secrets from you!
  
-Link to associated Prezi: [[https://​prezi.com/​qk2xegtlj44b/​|Prezi]]+**Link to new [[https://​qcbsrworkshops.github.io/​workshop04/​workshop04-en/​workshop04-en.html|Rmarkdown presentation]]** 
 + 
 +Link to old [[https://​prezi.com/​qk2xegtlj44b/​|Prezi ​presentation]]
  
 Download the R script and data for this lesson: Download the R script and data for this lesson:
Line 94: Line 96:
 In the following sections, we do not always explicitly restate the above assumptions for every model. Be aware, however, that these assumption are implicit in all linear models, including all models presented below. ​ In the following sections, we do not always explicitly restate the above assumptions for every model. Be aware, however, that these assumption are implicit in all linear models, including all models presented below. ​
  
 +====1.4 Test statistics and p-values====
 +
 +Once you've run your model in R, you will receive a model output that includes many numbers. It takes practice to understand what each of these numbers means and which to pay the most attention to. The model output includes the estimation of the parameters (the β variables). The output also includes test statistics. The particular test statistic depends on the linear model you are using (t is the test statistic for the linear regression and the t test, and F is the test statistic for ANOVA). ​
  
-====1.Work flow====+In linear models, the null hypothesis is typically that there is no relationship between two continuous variables, or that there is no difference in the levels of a categorical variable. The larger the absolute value of the test statistic, the more improbable that the null hypothesis is true. The exact probability is given in the model output and is called the p-value. You could think of the p-value as the probability that the null hypothesis is true, although that's a bit of a simplification. (Technically,​ the p-value is the probability that, given the assumption that the null hypothesis is true, the test statistic would be the same as or of greater magnitude than the actual observed test statistic.) By convention, we consider that if the p value is less than 0.05 (5%), then we reject the null hypothesis. This cut-off value is called α (alpha). If we reject the null hypothesis then we say that the alternative hypothesis is supported: there is a significant relationship or a significant difference. Note that we do not "​prove"​ hypotheses, only support or reject them. 
 +====1.Work flow====
  
 Below we will explore several kinds of linear models. The way you create and interpret each model will differ in the specifics, but the principles behind them and the general work flow will remain the same. For each model we will work through the following steps: Below we will explore several kinds of linear models. The way you create and interpret each model will differ in the specifics, but the principles behind them and the general work flow will remain the same. For each model we will work through the following steps:
  
-  - Plot the data+  - Visualize ​the data (data visualization could also come later in your work flow)
   - Create a model   - Create a model
   - Test the model assumptions   - Test the model assumptions
Line 156: Line 162:
 ^ AvgAbund | The average abundance across all sites\\ where found in NA|Continuous/​ numeric| ​ ^ AvgAbund | The average abundance across all sites\\ where found in NA|Continuous/​ numeric| ​
 ^ Mass     | The body size in grams| Continuous/ numeric| ^ Mass     | The body size in grams| Continuous/ numeric|
-^ Diet     | Type of food consumed| Discrete – 5 levels (Plant; PlantInsect;​\\ Insect; ​InserctVert; Vertebrate)|+^ Diet     | Type of food consumed| Discrete – 5 levels (Plant; PlantInsect;​\\ Insect; ​InsectVert; Vertebrate)|
 ^ Passerine| Is it a songbird/ perching bird| Boolean (0/1)| ^ Passerine| Is it a songbird/ perching bird| Boolean (0/1)|
 ^ Aquatic ​ | Is it a bird that primarily lives in/ on/ next to the water| Boolean (0/1)| ^ Aquatic ​ | Is it a bird that primarily lives in/ on/ next to the water| Boolean (0/1)|
Line 211: Line 217:
  
 <code rsplus | Testing Normality: hist() function>​ <code rsplus | Testing Normality: hist() function>​
-# Plot Y ~ X and the regression line 
 # Plot Y ~ X and the regression line # Plot Y ~ X and the regression line
 plot(bird$MaxAbund ~ bird$Mass, pch=19, col="​coral",​ ylab="​Maximum Abundance", ​ plot(bird$MaxAbund ~ bird$Mass, pch=19, col="​coral",​ ylab="​Maximum Abundance", ​
Line 573: Line 578:
 === Running a t-test with lm() === === Running a t-test with lm() ===
  
-A t-test is a linear model and a specific case of ANOVA (see below) ​with one factor with 2 levels. As such, we can also run the t-test with the ''​lm()''​ function in R:+A t-test is a linear model and a specific case of ANOVA with one factor with 2 levels. As such, we can also run the t-test with the ''​lm()''​ function in R:
  
 <code rsplus | T-test as a linear model > <code rsplus | T-test as a linear model >
Line 600: Line 605:
 ==== 3.3 Running an ANOVA ==== ==== 3.3 Running an ANOVA ====
  
-The t-test is only for a single categorical explanatory variable with 2 levels. For all other variables ​with categorical explanatory variables we use ANOVA. First, let's visualize the data using ''​boxplot()''​. Recall that by default, R will order you groups in alphabetical order. We can reorder the groups according to the median of each Diet level. \\ Another way to graphically view the effect sizes is to use ''​plot.design()''​. This function will illustrate the levels of a particular factor along a vertical line, and the overall value of the response is drawn as a horizontal line.+The t-test is only for a single categorical explanatory variable with 2 levels. For all other linear models ​with categorical explanatory variables we use ANOVA. First, let's visualize the data using ''​boxplot()''​. Recall that by default, R will order you groups in alphabetical order. We can reorder the groups according to the median of each Diet level. \\ Another way to graphically view the effect sizes is to use ''​plot.design()''​. This function will illustrate the levels of a particular factor along a vertical line, and the overall value of the response is drawn as a horizontal line.
  
 <code rsplus | ANOVA> <code rsplus | ANOVA>
Line 685: Line 690:
 ==== 3.6 Complementary test ==== ==== 3.6 Complementary test ====
  
-Importantly,​ ANOVA cannot identify which treatment is different from the others in terms of response variable. To determine ​this, post-hoc tests that compare the levels of the explanatory variables (i.e. the treatments) two by two, must be performed. While several post-hoc tests exist (e.g. Fischer’s least significant difference, Duncan’s new multiple range test, Newman-Keuls method, Dunnett’s test, etc.), the Tukey’s range test is used in this example using the function ''​TukeyHSD''​ as follows:+Importantly,​ ANOVA cannot identify which treatment is different from the others in terms of response variable. It can only identify that a difference is present. To determine ​the location of the difference(s), post-hoc tests that compare the levels of the explanatory variables (i.e. the treatments) two by two, must be performed. While several post-hoc tests exist (e.g. Fischer’s least significant difference, Duncan’s new multiple range test, Newman-Keuls method, Dunnett’s test, etc.), the Tukey’s range test is used in this example using the function ''​TukeyHSD''​ as follows:
  
 <code rsplus| Post-hoc Tukey Test> <code rsplus| Post-hoc Tukey Test>
Line 1018: Line 1023:
 ==== 6.1 Assumptions ==== ==== 6.1 Assumptions ====
  
-As with models seen above, to be valid ANCOVA models must meet the statistical assumptions of linear models that can be verified using diagnostic plots, i.e.+As with models seen above, to be valid ANCOVA models must meet the statistical assumptions of linear models that can be verified using diagnostic plots. In addition, ​ANCOVA ​models must have:
-  - Normal distribution of the model residuals +
-  - Homoscedasticty of the residual variance  +
-  - Independence of the residuals +
-  - Equal variance between different levels of a given factor +
-In addition, ​ANOVA models must have:+
   - The same value range for all covariates   - The same value range for all covariates
   - Variables that are //fixed//   - Variables that are //fixed//
Line 1280: Line 1280:
 ---- ----
  
-CHALLENGE 7+**CHALLENGE 7** 
 + 
 +Compare the different polynomial models in the previous example, and determine which model is the most appropriate. Extract the adjusted R squared, the regression coefficients,​ and the p-values of this chosen model.
  
 ++++ Challenge 7: Solution| ++++ Challenge 7: Solution|