Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
r_workshop7 [2016/09/06 13:56]
zofia.taranu [Poisson-lognormal GLMM]
r_workshop7 [2019/08/08 17:57] (current)
mariehbrice [Workshop 7: Generalized linear mixed models]
Line 5: Line 5:
 This series of [[r|10 workshops]] walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology. This series of [[r|10 workshops]] walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology.
 These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community. These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community.
 +
 +//The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications,​ please contact the current series coordinators,​ listed on the main wiki page//
  
 ====== Workshop 7: Generalized linear mixed models ====== ====== Workshop 7: Generalized linear mixed models ======
Line 12: Line 14:
 **Summary:​** A significant limitation of linear mixed models is that they cannot accommodate response variables that do not have a normal error distribution. Most biological data do not follow the assumption of normality. In this workshop, you will learn how to use **generalized** linear models, which are important tools to overcome the distributional assumptions of linear models. You will learn the major distributions used depending on the nature of the response variables, the concept of the link function, and how to verify assumptions of such models. We will also build on the previous workshop to combine knowledge on linear mixed models and extend it to generalized linear mixed effect models. **Summary:​** A significant limitation of linear mixed models is that they cannot accommodate response variables that do not have a normal error distribution. Most biological data do not follow the assumption of normality. In this workshop, you will learn how to use **generalized** linear models, which are important tools to overcome the distributional assumptions of linear models. You will learn the major distributions used depending on the nature of the response variables, the concept of the link function, and how to verify assumptions of such models. We will also build on the previous workshop to combine knowledge on linear mixed models and extend it to generalized linear mixed effect models.
  
-Link to associated Prezi: [[http://​prezi.com/​suxdvftadzl4/​|Prezi]]+**Link to new [[https://​qcbsrworkshops.github.io/​workshop07/​workshop07-en/​workshop07-en.html|Rmarkdown presentation]]** 
 + 
 +Link to old [[http://​prezi.com/​suxdvftadzl4/​|Prezi ​presentation]]
  
 Download the R script and data for this lesson: Download the R script and data for this lesson:
Line 50: Line 54:
 </​file>​ </​file>​
  
-=== Scatterplots === 
  
 Can we see any relationship between Galumna and environmental variables? Can we see any relationship between Galumna and environmental variables?
Line 75: Line 78:
  
 Indeed, Galumna seems to vary negatively as a function of WatrCont, i.e. Galumna seems to prefer dryer sites. ​ Indeed, Galumna seems to vary negatively as a function of WatrCont, i.e. Galumna seems to prefer dryer sites. ​
- 
-=== Testing linearity ===  
  
 Fit linear models (function '​lm'​) to test whether these relationships are statistically significant. Fit linear models (function '​lm'​) to test whether these relationships are statistically significant.
Line 140: Line 141:
 which literally means that any given observation (y<​sub>​i</​sub>​) is drawn from a normal distribution with parameters μ (which depends on the value of x<​sub>​i</​sub>​) and σ<​sup>​2</​sup>​. ​ which literally means that any given observation (y<​sub>​i</​sub>​) is drawn from a normal distribution with parameters μ (which depends on the value of x<​sub>​i</​sub>​) and σ<​sup>​2</​sup>​. ​
  
-=== Problems with lm === 
 Predict Galumna abundance at a water content = 300 using this equation and the linear model that we fitted earlier. You will need values for β<​sub>​0</​sub>​ and β<​sub>​1</​sub>​ (regression coefficients) and ε<​sub>​i</​sub>​ (the deviation of observed values from the regression line) Predict Galumna abundance at a water content = 300 using this equation and the linear model that we fitted earlier. You will need values for β<​sub>​0</​sub>​ and β<​sub>​1</​sub>​ (regression coefficients) and ε<​sub>​i</​sub>​ (the deviation of observed values from the regression line)
  
Line 685: Line 685:
 All the GLMs introduced (Poisson, quasi-Poisson and NB) to model count data use the same log-linear mean function (log(//​µ//​) = **X**.//​β//​),​ but make different assumptions about the remaining likelihood. Quasi-Poisson and NB are favored to deal with overdispersion. However, in some cases the data may contains too many zeros and zero-augmented models can be useful as they extend the mean function by modifying (typically, increasing) the likelihood of zero counts (e.g. zero-inflated Poisson [ZIP]). All the GLMs introduced (Poisson, quasi-Poisson and NB) to model count data use the same log-linear mean function (log(//​µ//​) = **X**.//​β//​),​ but make different assumptions about the remaining likelihood. Quasi-Poisson and NB are favored to deal with overdispersion. However, in some cases the data may contains too many zeros and zero-augmented models can be useful as they extend the mean function by modifying (typically, increasing) the likelihood of zero counts (e.g. zero-inflated Poisson [ZIP]).
  
 +===== Other distributions ===== 
 +  * When the response variable consists of percentages or proportions that do not arise from successes and failures from //n// yes/no experiments (Bernoulli experiment),​ it is not possible to use the binomial distribution. In this case, it is often advised to perform a **logit transformation** of the data and use a lm(m). See this [[http://​onlinelibrary.wiley.com/​doi/​10.1890/​10-0340.1/​abstract|interesting article]].  
 +  * For data that can be appear normally distributed after a log-transformation,​ it can be advisable to use a **log-normal distribution** in a glm instead of log-transforming the data.  
 +  * A **Gamma distribution** can also be used. It is similar to a log-normal distribution,​ but is more versatile.  
 +  * The **Tweedie distribution** is a versatile family of distributions that is useful for data with a mix of zeros and positive values (not necessarily counts). See the [[https://​cran.r-project.org/​web/​packages/​tweedie/​index.html|R Tweedie package]]. 
 +  * When the data comprise an excess number of zeros, that arise from a different process than the process that generates the counts, the **zero-inflated** Poisson or zero-inflated negative binomial distributions should be used. These methods are available, in the [[http://​glmmadmb.r-forge.r-project.org/​|glmmADMB package]], among others.
  
 ===== Generalized linear mixed models (GLMMs) ===== ===== Generalized linear mixed models (GLMMs) =====
Line 1078: Line 1083:
  
  
-** CHALLENGE ​**+** CHALLENGE ​**
  
 Using the //inverts// dataset (larval development times (PLD) of 74 marine invertebrate and vertebrate species reared at different temperatures and time), answer the following questions: Using the //inverts// dataset (larval development times (PLD) of 74 marine invertebrate and vertebrate species reared at different temperatures and time), answer the following questions:
Line 1087: Line 1092:
   - Finally, once you determined the best distribution family, re-evaluate your random and fixed effects.   - Finally, once you determined the best distribution family, re-evaluate your random and fixed effects.
  
-++++ Challenge: Solution ​4|+++++ Challenge: Solution ​3|
 <file rsplus | Effect of feeding type and temperature on PLD> <file rsplus | Effect of feeding type and temperature on PLD>
 # Load data # Load data
Line 1242: Line 1247:
 B. Bolker (2009) [[http://​ms.mcmaster.ca/​~bolker/​emdbook/​|Ecological Models and Data in R]]. Princeton University Press.\\ B. Bolker (2009) [[http://​ms.mcmaster.ca/​~bolker/​emdbook/​|Ecological Models and Data in R]]. Princeton University Press.\\
 A. Zuur et al. (2009) [[http://​link.springer.com/​book/​10.1007/​978-0-387-87458-6|Mixed Effects Models and Extensions in Ecology with R]]. Springer.\\ A. Zuur et al. (2009) [[http://​link.springer.com/​book/​10.1007/​978-0-387-87458-6|Mixed Effects Models and Extensions in Ecology with R]]. Springer.\\
 +
 +**Articles **
 +
 +Harrison, X. A., L. Donaldson, M. E. Correa-Cano,​ J. Evans, D. N. Fisher, C. E. D. Goodwin, B. S. Robinson, D. J. Hodgson, and R. Inger. 2018. [[https://​peerj.com/​preprints/​3113/​|A brief introduction to mixed effects modelling and multi-model inference in ecology]]. ​ PeerJ 6:​e4794–32.
  
 **Websites** **Websites**