# Differences

This shows you the differences between two versions of the page.

 r_workshop6 [2015/02/25 21:52]zofia.taranu [Poisson-lognormal GLMM] r_workshop6 [2021/03/23 12:35]lsherin 2021/03/23 12:35 lsherin 2021/03/22 22:21 lsherin 2021/03/22 18:39 lsherin 2021/03/22 18:38 lsherin 2021/03/22 18:38 lsherin 2021/03/03 22:56 lsherin 2021/03/01 22:36 lsherin 2021/03/01 22:36 lsherin 2021/03/01 22:30 lsherin 2021/02/08 15:15 egongora Changed the link for the RMarkdown presentation2021/02/08 15:13 egongora This workshop will now be Workshop 7 and will cover LMMs and GLMMs.2019/08/08 17:56 mariehbrice [Workshop 6: Linear mixed effects models] 2019/02/08 09:17 mariehbrice [Workshop 6: Linear mixed effects models] 2019/01/22 10:32 beausoleilmo [Workshop 6: Linear mixed effects models] 2016/10/21 10:48 vincent_fugere [QCBS R Workshops] 2016/09/05 16:44 vincent_fugere 2016/09/05 11:14 vincent_fugere [Workshop 5: Linear mixed effects models] 2016/09/05 09:50 vincent_fugere 2015/09/08 12:15 zofia.taranu [Workshop 6: Generalized linear (mixed) models] 2015/09/08 12:14 zofia.taranu [QCBS R Workshops] 2015/02/26 16:38 zofia.taranu [Poisson-lognormal GLMM] 2015/02/26 15:19 zofia.taranu [Poisson-lognormal GLMM] 2015/02/26 15:18 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 21:52 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 21:30 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 21:08 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:53 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:52 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:51 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:50 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:49 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:46 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:45 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:45 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:35 zofia.taranu [Choosing an error distribution] 2015/02/24 17:28 zofia.taranu [Predictive power and goodness-of-fit assessment] 2015/02/24 16:19 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:16 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:14 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:08 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:07 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:06 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:05 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:04 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:03 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 15:56 zofia.taranu [The concept of the link function] 2015/02/22 16:35 zofia.taranu [Generalized linear mixed models (GLMMs)] 2015/02/07 09:40 zofia.taranu [Poisson GLMM] 2015/02/07 09:27 zofia.taranu [Generalized linear mixed models (GLMMs)] 2021/03/23 12:35 lsherin 2021/03/22 22:21 lsherin 2021/03/22 18:39 lsherin 2021/03/22 18:38 lsherin 2021/03/22 18:38 lsherin 2021/03/03 22:56 lsherin 2021/03/01 22:36 lsherin 2021/03/01 22:36 lsherin 2021/03/01 22:30 lsherin 2021/02/08 15:15 egongora Changed the link for the RMarkdown presentation2021/02/08 15:13 egongora This workshop will now be Workshop 7 and will cover LMMs and GLMMs.2019/08/08 17:56 mariehbrice [Workshop 6: Linear mixed effects models] 2019/02/08 09:17 mariehbrice [Workshop 6: Linear mixed effects models] 2019/01/22 10:32 beausoleilmo [Workshop 6: Linear mixed effects models] 2016/10/21 10:48 vincent_fugere [QCBS R Workshops] 2016/09/05 16:44 vincent_fugere 2016/09/05 11:14 vincent_fugere [Workshop 5: Linear mixed effects models] 2016/09/05 09:50 vincent_fugere 2015/09/08 12:15 zofia.taranu [Workshop 6: Generalized linear (mixed) models] 2015/09/08 12:14 zofia.taranu [QCBS R Workshops] 2015/02/26 16:38 zofia.taranu [Poisson-lognormal GLMM] 2015/02/26 15:19 zofia.taranu [Poisson-lognormal GLMM] 2015/02/26 15:18 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 21:52 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 21:30 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 21:08 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:53 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:52 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:51 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:50 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:49 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:46 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:45 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:45 zofia.taranu [Poisson-lognormal GLMM] 2015/02/25 17:35 zofia.taranu [Choosing an error distribution] 2015/02/24 17:28 zofia.taranu [Predictive power and goodness-of-fit assessment] 2015/02/24 16:19 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:16 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:14 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:08 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:07 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:06 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:05 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:04 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 16:03 zofia.taranu [Limitations of linear and linear mixed models] 2015/02/24 15:56 zofia.taranu [The concept of the link function] 2015/02/22 16:35 zofia.taranu [Generalized linear mixed models (GLMMs)] 2015/02/07 09:40 zofia.taranu [Poisson GLMM] 2015/02/07 09:27 zofia.taranu [Generalized linear mixed models (GLMMs)] 2015/02/06 16:53 zofia.taranu [Poisson-lognormal GLMM] 2015/02/06 16:52 zofia.taranu [Poisson-lognormal GLMM] Line 3: Line 3: [[http://​qcbs.ca/​|{{:​logo_text.png?​nolink&​500|}}]] [[http://​qcbs.ca/​|{{:​logo_text.png?​nolink&​500|}}]] - This series of [[r|8 workshops]] walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology. - These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community. - ====== Workshop 6: Generalized linear (mixed) models ====== + This series of [[r|10 workshops]] walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology. These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community. - Developed ​by: Cédric Frenette Dussault, Vincent Fugère, Thomas Lamy, Zofia Taranu  ​ + //The content of this workshop has been peer-reviewed ​by several QCBS members. If you would like to suggest modifications, please contact the current series coordinators, listed on the main wiki page// - **Summary:** A significant limitation of linear (mixed) models is that they cannot accommodate response variables that do not have a normal error distribution. Unfortunately,​ most variables in biological datasets are not normally distributed! In this workshop, you will learn how to use **generalized** linear models, which are important tools to overcome the distributional assumptions of linear models. You will learn the major distributions used depending on the nature of the response variables, the concept of the link function, and how to verify assumptions of such models. We will also build on the previous workshop to combine knowledge on linear mixed models and extend it to generalized linear mixed effect models. + ​IMPORTANT NOTICE: MAJOR UPDATES​ - Link to associated Prezi: [[http://​prezi.com/​suxdvftadzl4/​qcbs-r-workshop-6/​|Prezi]] - Download the R script and data for this lesson: + **March 2021 update:** This wiki has been discontinued and is no longer being actively developed or updated. Updated materials for the QCBS R Workshop Series are now housed on the QCBS R Workshop ​[[https://github.com/QCBSRworkshops/workshop07|GitHub page]]. - - [[http://​qcbs.ca/​wiki/​_media/​workshop6_script.r| Script]] + - - [[http://qcbs.ca/wiki/_media/​mites.csv| Mites data]] + - - [[http://​qcbs.ca/​wiki/​_media/​faramea.csv| Faramea data]] + - - [[http://​qcbs.ca/​wiki/​_media/​banta_totalfruits.csv| Arabidopsis data]] + - - [[http://​qcbs.ca/​wiki/​_media/​inverts.csv| Inverts data]] + - - [[http://​qcbs.ca/​wiki/​_media/​glmm_funs.r| glmm_funs]] //(code to download for use in the GLMM section)// + - + - ===== Limitations of linear and linear mixed models ===== + - To illustrate why generalized linear models are incredibly useful, it is best to first try to understand the limitations of linear models (workshop 3), which also include ​linear mixed effect models (workshop ​5). Let's load a dataset and try to fit some linear models. + Available materials ​include; + - The [[https://​qcbsrworkshops.github.io/​workshop07/​pres-en/​workshop07-pres-en.html|Rmarkdown presentation]] for this workshop; + - An [[https://​qcbsrworkshops.github.io/​workshop07/​book-en/​workshop07-script-en.R|R script]] that follows the presentation;​ + - [[https://​qcbsrworkshops.github.io/​workshop07/​book-en/​index.html|Written materials]] that accompany the presentation in bookdown format. - + **January 2021 update:** The structure of Workshops 6 + 7 have been changed. - setwd("​~/​Desktop"​) + - mites <- read.csv('​mites.csv'​) + - ​ + - The dataset that you just loaded is a subset of the classic '​Oribatid mite dataset',​ which has been used in numerous texts (e.g. Borcard, Gillet & Legendre, //Numerical Ecology with R//), and which is available in the "​vegan"​ library. The dataset consists of 70 moss and mite samples collected at University of Montreal'​s Station de Biologie des Laurentides,​ Saint-Hippolyte,​ QC. Each sample includes measurements ​for 5 environmental variables and abundance data for 35 mite taxa. In the reduced dataset that we will be using throughout this workshop, we only included the 5 environmental measurements and the abundance of a single mite taxon, "//​Galumna sp.//" Our goal will be to model the abundance, occurrence ​(presence/​absence), and proportion of Galumna as a function of the 5 environmental variables: therefore, we also created a presence/​absence variable and a proportion variable for Galumna. + If you are looking ​for the workshop ​covering generalized linear models ​(GLM), please refer to [[r_workshop7|Workshop 6]]. - ​ + If you are looking for the workshop covering linear and generalized linear mixed models ​(LMM and GLMM), please refer to [[r_workshop6|Workshop 7]]. - head(mites) + - str(mites) + - # 70 mite communities sampled from moss cores collected at the Station de Biologie des Laurentides,​ QC. + ====== Workshop 7: Linear and generalized linear mixed models ====== - # For each core/​sample,​ the following data is provided: + - # $Galumna: abundance of mite genus '​Galumna'​ + - #$pa: presence (1) or absence (0) of Galumna, irrespective of abundance + - # $totalabund:​ total abundance of mites of all species + - #$prop: proportion of Galumna in the mite community. i.e. Galumna/​totalabund + - # $SubsDens: substrate density + - #$WatrCont: water content + - # $Substrate: type of substrate. + - #$Shrub: abundance of shrubs. Coded as a factor. + - # $Topo: microtopography. blanket or hummock. + - ​ + - ** CHALLENGE 1 ** + Developed by: Catherine Baltazar, Dalal Hanna, Jacob Ziegler - Can we see any relationships between Galumna and environmental variables? + Section on GLMMs developed by: Cédric Frenette Dussault, Vincent Fugère, Thomas Lamy, Zofia Taranu - ++++ Challenge 1: Solution | + **Summary:** Mixed effects models allow ecologists to overcome a number of limitations associated with traditional linear models. In this workshop, you will learn when it is important to use a mixed effects model to analyze your data. We will walk you through the steps to conduct a linear mixed model analysis, check its assumptions,​ report results, and visually represent your model in R. We will also build on the previous workshop to combine knowledge on linear mixed models and extend it to generalized linear mixed effect models. - + - plot(mites) + - ​ + - {{:6_spm.jpg?800|}} + **Link to new [[https://​qcbsrworkshops.github.io/​workshop07/​pres-en/​workshop07-pres-en.html|Rmarkdown presentation]]** - There seems to be a negative relationship between Galumna abundance and water content (WatrCont). Occurence (presence/absence; pa) and proportion (prop) also seem to be negatively correlated with WatrCont. + Link to old [[http://prezi.com/​5syafj73u0qe/​|Prezi presentation]] - ++++ + - We can take a closer look by specifically plotting those 3 response variables as a function of WatrCont: + Link to old [[http://​prezi.com/​suxdvftadzl4/​|Prezi presentation covering GLMMs]] - ​ + Download the R script ​and data for this lesson: ​ - par(mfrow=c(1,​3)) #divide plot area in 1 row and 3 columns to have 3 plots in same figure + ​-[[http://​qcbs.ca/​wiki/​_media/​qcbs_w6_data.csv| Fish data]] - plot(Galumna ~ WatrCont, ​data = mites, xlab = 'Water content',​ ylab='​Abundance'​) + ​-[[http://​qcbs.ca/​wiki/​_media/​lmm_e.r| R Code]] - boxplot(WatrCont ~ pa, data = mites, xlab='​Presence/Absence',​ ylab = 'Water content'​) + -[[http://​qcbs.ca/​wiki/​_media/​banta_totalfruits.csv| Arabidopsis ​data]] - plot(prop ~ WatrCont, data = mites, xlab = 'Water content',​ ylab='​Proportion'​) + -[[http://​qcbs.ca/​wiki/​_media/​inverts.csv| Inverts data]] - par(mfrow=c(1,​1)) #​resets ​to default setting + ​-[[http://​qcbs.ca/​wiki/​_media/​glmm_e.r| GLMM Script]] //(this is the script for [[r_workshop7|Workshop 6]] but lines 434-784 cover the GLMM section of this workshop)// - + ​-[[http://​qcbs.ca/​wiki/​_media/​glmm_funs.r| glmm_funs]] //(code to download for use in the GLMM section)// - {{:6_3plot.jpeg?600|}} + In addition, you can look at [[lmm.workshop ​| a more up-to-date and detailed presentation mixed models]]. - Indeed, Galumna seems to vary negatively as a function ​of WatrCont, i.e. Galumna seems to prefer dryer sites. + ===== Learning Objectives ===== + -What is a linear mixed effects model (LMM) and why should I care? + -How do I implement LMM's in R? + -A priori model building and data exploration + -Coding potential models and model selection + -Model validation + -Interpreting results and visualizing the model + -Using LMMs when the assumptions ​of normality and error distribution are not met (GLMMs). - ** CHALLENGE 2 ** + ===== Overview ===== - Fit linear ​models (function '​lm'​) to test whether ​these relationships are statistically significant. + Biological and ecological data are often messy. Usually, there is an inherent structure to data (i.e. single observations are not always independent),​ relationships between variables of interest might differ depending on grouping factors like species, and more often than not sample sizes are low making it difficult to fit models that require many parameters to be estimated. Linear mixed effects ​models (LMM) were developed ​to deal with these issues. They can be applied to a great number of ecological questions and take many different forms. In this workshop we will use a simple question-based approach to learn the basics of how LMM operate and how to fit them. We will then conclude with a thought experiment of applying LMM to a more diverse set of questions. - ++++ Challenge 2: Solution | + ===== 1.What is a LMM and why should I care? ===== - ​ + - lm.abund <- lm(Galumna ~ WatrCont, data = mites) + - summary(lm.abund) + - lm.pa <- lm(pa ~ WatrCont, data = mites) + - summary(lm.pa) + - lm.prop <- lm(prop ~ WatrCont, data = mites) + - summary(lm.prop) + - ​ + - Yes, there is a strong and significant relationship ​for all 3 response variables! + LMM's allow you to use all the data you have instead of using means of non-independent samples, they account ​for structure in your data (for example, quadrates nested in sites nested in forests), they allow relationships to vary by different grouping factors (sometimes referred to as random effects), and they require less parameter estimates than classical regression which saves you degrees of freedom. But how do they do all of this? These questions will hopefully become clear during this section. First, Let's start by getting to know the dataset. - ++++ + - Wait a minute... Lets validate these models ​to make sure that we respect assumptions of linear models, starting with the abundance model. + ====1.1 Introduction ​to the dataset==== - + - plot(Galumna ~ WatrCont, data = mites) + - abline(lm.abund) + - ​ + - {{:​6_lm1.jpeg|}} + + # Remove prior commands in R + rm(list=ls()) ​ - The model does not fit well. It predicts negative abundance values when WatrCont exceeds 600, which does not make any sense. Also, the model does poorly at predicting high abundance values at low values ​of WatrCont. + # Place all workshop material in one folder on your computer + # Run the following line of code and use the browsing window to choose the QCBS_W6_Data.csv + # file in the folder that contains the workshop material + file.choose() - + # Set the working directory to the folder which contains the lab material by copy and pasting - plot(lm.abund) + # all but the R file name from the output of the file.choose() command into the set working ​ - ​ + # directory command. ​ - {{:​6_diagplots.jpeg|}} + # For example paste "/​Users/​ziegljac/​Documents/​QCBS_R/"​ -> include the quotations + # NOT "/​Users/​ziegljac/​Documents/​QCBS_R/​Get_Data_Func.R" -> include the quotations + setwd() - Diagnostic plots show that the data/model violate assumptions of homogeneity of variance ​(the graph on the left shows that residuals are larger at higher fitted values) and normality ​(the graph on the right indicates that residuals are not distributed as expected from a normal distribution,​ i.e. many points are far from predicted values given by the dotted line). Therefore, we need to reject this model, and can't use it to conclude that Galumna abundance varies as a function of water content. Looking at diagnostic plots for the presence-absence model and the proportion model also indicate that these models are inappropriate:​ + # Load useful libraries and data + # If you have never loaded these libraries before you will have to use + # The install.packages() function before ​the library() function + library(ggplot2) + library(lme4) + library(arm) + library(AICcmodavg) - + data <- read.csv('​QCBS_W6_Data.csv') - #​Proportion + - plot(prop ~ WatrCont, data = mites) + - abline(lm.prop) + - plot(lm.prop) + - #​Presence/​Absence + - plot(pa ~ WatrCont, data = mites) + - abline(lm.pa) + - plot(lm.pa) + - + - It is quite common ​with biological datasets that assumptions of homogeneity of variance (homoscedasticity) and normality are not met. These two assumptions are the main problem with linear models, and the main reason why we need generalized linear models. Lets revisit the basic equation for a linear model to better understand where these assumptions ​come from. The equation for a linear model is: + The dataset we will be using deals with fish trophic positions. Individuals were sampled over a range of body lengths that come from three different fish species caught in six different lakes. Here is a visual representation to help wrap your head around all of this! Note: only three sizes of fish are shown within each species but in reality there are 10 individuals per species. + + {{:​fig_1_qcbs_wiki.png?​800|}} ​ - y<​sub>​i​ = β<​sub>​0​ + β<​sub>​1​x<​sub>​i​ + ε<​sub>​i,​ where: + A simple question you might answer with this dataset is **does fish trophic position increase with fish size for all three fish species**? This will be our motivating question for this workshop. - * y<​sub>​i​ is the predicted value for the response variable, + ---- - * β<​sub>​0​ is the intercept of the regression line between y and x, + **CHALLENGE ​1** - ​* β<​sub>​1​ is the slope of the regression line between y and x, + - ​* x<​sub>​i​ is the value for the explanatory variable, + - ​* ε<​sub>​i​ are the residuals of the model, which are drawn from a normal distribution with a varying mean but a constant variance. + - This last point about ε<​sub>​i<​/sub> is important. This is where assumptions of normality and homoscedasticity originate. It means that the residuals (the distance between each observation ​and the regression line) can be predicted by drawing random values from a normal distribution. Recall that all normal distributions have two parameters, μ (the mean of the distribution) and σ<​sup>​2​ (the variance of the distribution):​ + For your fist challenge, reproduce the plots 1-3 from the //QCBS_W5_LMM.R// code. Observe ​the plots and try to get a sense of what is occurring. Two key questions to ask yourself are: + -Do I expect ​all species to increase in trophic position with length? In the exact same way? + -Do I expect these relationships to be the same across lakes? What might differ? - {{:​6_normal_params.jpeg|}} + ++++ Challenge 1 Answer ​| + + # Used to strip down figures to make them simpler + fig <- theme_bw() + theme(panel.grid.minor=element_blank(),​ panel.grid.major=element_blank(),​ panel.background=element_blank()) + + theme(strip.background=element_blank(),​ strip.text.y = element_text()) + theme(legend.background=element_blank()) + + theme(legend.key=element_blank()) + theme(panel.border = element_rect(colour="​black",​ fill=NA)) - In a linear model, μ changes based on values of x (the predictor variable), but σ<​sup>​2​ has the same value for all values of Y. Indeed, another equation to represent linear models is: + # Make the followoing three plots to explore ​the data + plot <- ggplot(aes(Fish_Length,Trophic_Pos),​data=data) - y<​sub>​i​ ~ //N//(μ = β<​sub>​0 ​+ β<​sub>​1​x<​sub>​i,​ σ<​sup>​2​), + # Plot 1 - All Data + plot + geom_point() + xlab("​Length (mm)") + ylab("​Trophic Position"​) + labs(title="​All Data") + fig - which literally means that any given observation ​(y<​sub>​i​) is drawn from a normal distribution with parameters μ (which depends on the value of x<​sub>​i​) and σ<​sup>​2​. ​ + # Plot 2 - By Speceis + plot + geom_point() + facet_wrap(~ Fish_Species) + xlab("​Length (mm)") + ylab("​Trophic Position"​) + labs(title="​By Species"​) + fig - ** CHALLENGE 3 ** + # Plot 3 - By Lake + plot + geom_point() + facet_wrap(~ Lake) + xlab("​Length (mm)") + ylab("​Trophic Position"​) + labs(title="​By Lake") + fig + ​ + **OUTPUT**\\ + **Plot 1** + {{:​fig_2_w5.png?​300|}} + **Plot 2** + {{:​fig_3_w5.png?​300|}} - Predict Galumna abundance at a water content = 300 using this equation and the linear model that we fitted earlier. You will need values for β<​sub>​0​ and β<​sub>​1​ (regression coefficients) and ε<​sub>​i​ (the deviation of observed values from the regression line) + **Plot 3** + {{:plot3.png?300|}} - ++++ Challenge 3: Solution | + Do I expect all species to increase in trophic position with length? In the exact same way? - + All species seem to increase in trophic position with length, but the slope might be different across species - coef(lm.abund) + - #​(Intercept) ​ ​WatrCont + - #​3.439348672 -0.006044788 + - ​ + - This model predicts that for a water content value of, say, 300, we should obtain a Galumna abundance of 1.63: + Do I expect these relationships to be the same across lakes? What might differ? - + Some parameters specific ​to one particular lake might change ​the relationship, such as the primary productivity of the system - 3.44 + (-0.006 x 300) = 1.63. + - + - That is the expected abundance if there was no residual. To get our predicted values, we need to add ε<​sub>​i​. This is where we use the normal distribution. For x = 300, our model predicts that ε<​sub>​i​ should follow a normal distribution with mean = 1.63. We can find the σ<​sup>​2​ value for our abundance model by extracting it from the model summary: + - + - + - summary(lm.abund)$sigma + - ​ + - We find that sigma is roughly 1.51. We now have all the coefficients that we need to model Galumna abundance. At a water content of 300, residuals should follow a normal distribution with parameters μ = 1.63 and σ<​sup>​2​ = 1.51. ++++ ++++ - At a water content of 400, residuals ε<​sub>​i​ should follow a normal distribution with parameters μ =  3.44 + (-0.006 x 400) = 1.02 and σ<​sup>​2 ​= 1.51, etc. Each y<​sub>​i​ value is modeled using a different normal distribution,​ with a mean that depends on x<​sub>​i​. The variance of all of those normal distributions (σ<​sup>​2​),​ however, is the same. Function lm() finds the optimal σ<​sup>​2​ value that minimizes the total residual sum of square and uses it for all normal distributions used to model y. Graphically, ​this looks like this: + ====1.2 How do I analyze ​this dataset?​==== - {{:​6_lm_assump.jpeg|}} + ===Run many separate analyses=== + One way to analyze this data is to fit linear regressions for each species in each lake. Here is a plot of species 1 in lake 1: - The four normal distributions on this graph represent the probability of observing a given Galumna abundance for 4 different water content values. The mean of the normal distribution varies as a function of water content (hence μ decreases with water content), but σ<​sup>​2​ is always 1.51 (i.e. the variance is homogeneous across all values of x). This model is inappropriate for at least two reasons: + {{:fig_5_w5.png?400|}} - 1. Values are on average farther from the regression ​line at low water content values. That is, there is more residual variance around ​the predicted values ​for low values of x, such that ε<​sub>​i​ varies as a function of x, thus violating the assumption of homoscedasticity. It makes no sense to use a constant value of σ<​sup>​2:​ the normal distributions used to predict y at low values of x should ideally be wider (have a larger σ<​sup>​2​) than normal distributions used to predict y at large x values, but linear models do not permit this. + Notice you would have to estimate a slope and intercept parameter for each regression ​(2 parameters x 3 species X 6 lakes = 36 parameter estimates) and the sample size for each analysis would be 10. There is a decreased chance ​of finding an effect due to low sample size and increased familywise error rate due to multiple comparisons. - 2. It makes no sense to use a normal distribution to predict y based on x. Our response variable ​is abundance, which can only take integer values. Yet, at water content = 300, the abundance value that our model predicts ​to be the most probable is 1.63! We know that the probability of observing 1.63 individuals at water content = 300 is actually zero, as is the probability ​of observing any fraction (non-integers). Our predicted values should be modeled using a distribution that only predicts integers, rather than a continuous distribution like the normal distribution. This is a very common problem, as biological ​data often follows one of the myriad other statistical distributions besides the normal distribution. + ===Run one lumped analysis=== + Another way to analyze this data is to fit a single linear regression ignoring species and lake. Again here is the plot of all the data: - Generalized linear models can solve these two problems. Keep reading! + {{:fig_6_w5.png?400|}} - ===== Biological data and distributions ===== + - Statisticians ​have described ​a multitude of distributions that correspond ​to different types of data. A distribution provides ​the probability of observing each possible outcome of an experiment or survey (for example, abundance = 8 Galumna is one such "​outcome"​ of a survey). "​Discrete"​ distributions have a range that only includes integers, while "​continuous"​ distributions have a range that also includes fractions (the normal distribution ​is an example of a continuous distribution). All distributions have parameters that dictate the shape of the distribution (for example μ and σ<​sup>​2​ for the normal distribution). For a good overview of statistical distributions,​ we recommend that you refer to chapter 4 in Ben Bolker'​s //​Ecological Models and Data in R//. Here we just discuss briefly a few distributions that are useful for ecologists ​and generalized linear modeling. + Notice you now have a huge sample size and far fewer parameters ​to estimate. But what about pseudoreplication?​ Things within ​the same lake and species will likely be correlated. Also, look at all that noise in the data, surely some of it is due to differences ​in lakes and species. - We have already seen that our response variable "​Galumna abundance"​ can only take integer values. Abundance, therefore, follows a discrete distribution,​ with no fractions in its range. A useful distribution ​to model abundance data is the "​Poisson"​ distribution,​ named after Siméon Denis Poisson. The Poisson distribution ​is a discrete distribution with a single parameter, λ (lambda), which defines both the mean and the variance of the distribution (i.e. the mean and the variance of a Poisson distribution are equal). Here are 3 examples of Poisson distributions with different values of λ, corresponding in this case to the mean number of Galumna observed in a fictive set of samples: + In this case we simply want to know if trophic position increases with length and we don't really care that this relationship might differ slightly among species due to biological processes that we didn't measure or among lakes due to unmeasured environmental variables. This is variation we simply want to control for (sometimes referred to as random factors). - {{:​6_poisson.jpeg|}} + ===Fit a LMM=== + LMM's are a balance between separating and lumping. They: - Note that at low λ values ​(when the mean is close to zero), the distribution is skewed to the right, while at large λ values ​(large mean) the distribution is symmetrical. The variance increases with the mean, predicted values are always integers, ​and the range of a Poisson distribution is always strictly positive; all of these properties are useful to model count data, for example abundance of a given taxon, number of seeds in a plot, etc. Our variable mites$Galumna seems to follow a Poisson distribution with a low value of λ (indeed, if we calculate the mean abundance of Galumna across all samples using the function mean(), we find that it is close to zero): + -Provide slope and intercept parameters for each species and lake (separating) but estimate fewer parameters than classical regression. + -Use all the data available ​(lumping) while accounting for pseudoreplication ​and controlling ​for differences among lakes and species. - ​ + ====1.3 Fixed and Random Effects ==== - hist(mites$Galumna) + There is a debate in the litterature about the definition of fixed and random effects. There are many possible definitions,​ and we chose to present those we think are easier to apply when doing your analyses. ​ - mean(mites$Galumna) + - ​ + - {{:​6_galumnahist.jpeg|}} + ===Fixed effect=== + When a variable has a fixed effect, data is usually gathered from all it's possible levels. The person doing the analyses is also interested in making conclusions about the levels for which the data was gathered. - Our variable mites$pa (presence-absence) takes yet another form. It consists ​of only zeros and ones, such that a Poisson distribution would not be appropriate to model this variable. + Example ​of a fixed effect: comparing mercury concentration in fish from three different habitats. Habitat has a fixed effect as we sampled in each 3 habitats and we are interested in making conclusions about these specific habitats. - ​ + ===Random effect=== - hist(mites$pa) + Variables with a random effect are also called random factors, as they are only categorical variables ​(not continuous variables). A random effect is observed when the data only includes a random sample of the factor'​s many possible levels, which are all of interest. They usually are grouping factors for which you want to control the effect in your model, but are not interested in their specific effect on the response variable. ​ - ​ + - {{:6_pa.jpeg|}} + Example of a random effect: studying mercury contamination in fish in Ugandan crater lakes. For logistical reasons, you can't sample all the crater lakes, so you sample only 8 of them. However, fish from a given lake might have some sort of correlation between themselves (auto-correlation) since they experience the same environmental conditions. Even though you're not interested in the effect of each lake specifically,​ you should account for this potential correlation with a random factor (crater lake) in order to make conclusions about crater lakes in general. - We need a distribution with a range that only includes two possible outcomes: zero or one. The "​Bernoulli"​ distribution is such a distribution. It is often the first distribution that students of statistics are introduced to, for example when discussing the probability of obtaining the outcome "​heads"​ when flipping a coin. The Bernoulli distribution has only one parameter, //p//, the probability of success (i.e. the probability of obtaining heads on a coin flip). If we consider that each of our samples is equivalent to a coin toss, then we can use the Bernouilli distribution to calculate the probability of obtaining the outcome "​Galumna present"​ (1) vs. "​Galumna absent"​ (0). Here are some examples of Bernoulli distributions with various probabilities of presence (//p//): - {{:​6_bernouill.jpeg|}} + ====1.4 How do LMM's work?==== + ===A. Intercepts and/or slopes are allowed to vary by lake and species=== + In LMM's allowing intercepts and/or slopes to vary by certain factors (random effects) simply means you assume they come from a normal distribution. A mean and standard deviation of that distribution are estimated based on your data. The most likely intercepts and slopes from that distribution are then fit by optimization (ex. maximum likelihood or restricted maximum likelihood). - We can calculate the number of sites where Galumna is present out of the total number of sites to get an idea of what //p// might be in our case: + **Intercepts** - ​ + In the case of species only a mean and standard deviation are estimated ​for the distribution of species intercepts instead of three separate species intercepts. The mean of this distribution is the '​species level model'​. In this example, we only have three species. In general, the more levels you have for a given factor, the more accurately the parameters of the distribution can be estimated ​(three may be a little low for estimating a mean and standard deviation but it makes simpler graphs!). When you will implement LMM's in R, note that the intercept in the summary is the species level intercept ​(i.e. the mean of all random intercepts). - sum(mites$pa) / nrow(mites) + - ​ + - //p// for the variable mites$pa is more or less 0.36, such that roughly twice as many sites have the outcome "​Galumna absent"​ (0) than the outcome "​Galumna present"​ (1). + {{:fig_7_w5.png?600|}} - When there are multiple trials/coin tosses, ​the Bernoulli distribution expands into the binomial distribution,​ which has the additional parameter //n//, corresponding ​to the number of trials. The binomial distribution predicts the probability ​of observing a given proportion of successes, //p//, out of a known total number of trials, //n//. "​Successes"​ can be anything from taxon occurrence, number of surviving individuals out of a sample, etc. Imagine that instead of only working in the Laurentians,​ we took 50 mite samples at each of 30 regions across Canada. In each sample from each region, we determine if Galumna is present or absent. We could model this data using a binomial distribution with //n// = 50 samples (i.e. "​trials"​ or coin flips where Galumna can be either present or absent) and //p// = the average proportion of samples in which Galumna is present. We would have 30 data points, corresponding to the 30 regions. Here are some examples of binomial distributions with //n// = 50 and 3 different values of //p//: + Likewise for lake only mean and standard deviation of lake intercepts ​are estimated saving you the need to estimate 6 different lake intercept parameters. This saves degrees ​of freedom as less parameter estimates are needed ​given the amount ​of data. - {{:6_binomial.jpeg|}} + {{:fig_8_w5.png?600|}} - Notice that the binomial distribution is right-skewed at low //p// values but left-skewed at high //p// values. This is the main difference with the Poisson distribution:​ the binomial distribution has an upper limit to its range, corresponding to the number of trials, //n//. Consequently,​ the binomial distribution is often used to model data where the number of successes are integers and where the number of trials is known. For example, we could use the binomial distribution to model our proportion data, where each individual mite in a sample could be considered a trial, and if the mite is a Galumna individual then the trial is a success. In this case, the number of trials //n// varies among our 70 samples based on the total number of individuals in the sample, while //p//, the probability of success, is given by the proportion of Galumna in each sample. + **Slopes** - Why are we discussing all of these distributions?​ Because all of them can be used to replace ​the normal distribution when calculating predicted values in a linear model. For example, we could use the Poisson distribution and model our abundance data with the following equation: + The same concept is used for allowing slopes ​to vary by a given factor (random effect). This is a little harder to visualize than the intercepts. In the case of species only a mean and standard deviation of slope parameters are estimated instead of three separate slopes. Again, when you will implement LMM's in R, the slope in the summary is the species level slope. ​ - y<​sub>​i​ ~ Poisson(λ = β<​sub>​0​ + β<​sub>​1​x<​sub>​i​) + {{:​fig_9_w5.png?​600|}} - Notice that λ varies as a function of x (water content), meaning that the residual variance will also vary with x. This means that we just relaxed ​the homogeneity ​of variance assumption! Also, predicted values will now be integers instead of fractions because they will all be drawn from Poisson distributions with different λ values. The model will never predict negative values because Poisson distributions have strictly positive ranges. By simply switching ​the distribution of error terms (ε<​sub>​i​) from normal to Poisson, ​we solved most of the problems of our abundance linear ​model. ​This new model is __almost__ a Poisson generalized linear model, which basically looks like this: + ===B. Intercepts, slopes, and associated confidence intervals are adjusted to account for the structure ​of data=== + If a certain species or lake is data poor like in the example below (we have a balanced design so not the case in our scenario) it will rely more heavily on the group level model for the intercept and slope in that species or lake. - {{:6_poissonglm.jpeg|}} + {{:fig_12_w5.png?600|}} - Notice that probabilities ​of observations/​predicted values (in orange, as for the lm model above) are now integers, and that both the variance and the mean of the distribution decline as λ decreases with increasing water content. Why is the fitted line of predicted values curved? Why is this called a "​generalized"​ linear model? Keep reading and you'll find out! + Confidence intervals ​of intercepts and slopes are adjusted to account ​for pseudoreplication based on the **intraclass correlation coefficient (ICC)** - How much variation is there among groups versus within groups? This determines your **effective sample size** - An adjusted sample sized based on how correlated the data within groups ​are. + **High ICC** - ===== A GLM with binary variables ===== + {{:​fig_10_w5.png?​400|}} - A common response variable in ecological datasets is the binary variable: we observe a phenomenon X or its "​absence"​. For example, species presence/​absence is frequently recorded in ecological monitoring studies. We usually wish to determine whether a species’ presence is affected by some environmental variables. Other examples include ​the presence/​absence of a disease within a wild population, the success/​failure ​to record a specific behaviour, and the survival/​death of organisms. A regression that has a binary response variable is one of many generalized linear models ​and is called a logistic regression or a logit model. + In this scenario ​the LMM will treat points within lake more like an overall mean because they are highly correlated. Therefore, the effective sample size will be smaller leading ​to larger confidence intervals around your slope and intercept parameters. - In R, presence (or success, survival…) is usually coded as 1 and absence (or failure, death…) as 0. A logistic regression (or any other generalized linear model) is performed with the ''​glm()''​ function. This function is different from the basic ''​lm()''​ as it allows one to specify a statistical distribution other than the normal distribution. We've already seen that binary variables are not normally distributed (//i.e.// we see a peak at 0 and a peak at 1 and nothing in between). Like we have seen in the previous section, the Bernoulli distribution is well suited for binary response variables. The mean of this distribution is the probability //p// of observing an outcome and the variance is //p//*(1 - //p//). The (1 - //p//) term represents the probability of **not** observing an outcome. In R, we specify the distribution with the ''​family''​ argument. For the logistic regression, we code it as: ''​family='​binomial'''​. Remember that the Bernoulli distribution is a special case of the binomial distribution when the number of repetitions is 1: R will "​understand"​ that it is a Bernoulli distribution. + **Low ICC** - When predicting the probability of observing some phenomenon Y, which is a binary variable, the expected values should be bound between 0 and 1: that's the range of a probability value! If we use a basic linear model to relate a binary response variable to various explanatory variables, we might obtain fitted values outside of the [0,1] range, which is nonsensical. The following example will help you understand why a basic linear model is inappropriate here. The next subsection will show you how to avoid this problem with a link function. Briefly, a link function is used to linearize the relationship between predicted values of the response variable and the linear predictor (see next subsection). + {{:fig_11_w5.png?400|}} - + In this scenario ​the LMM will treat points within lake more independently because things ​are less correlated within groups compared to among groups. Therefore ​the effective sample size will be larger leading ​to smaller confidence intervals around your slope and intercept parameters. - model.lm <- lm(pa ~ WatrCont + Topo, data = mites) + - fitted(model.lm) + - # The "​fitted()"​ function gives us expected values for the response variable. + - # Some values ​are lower than 0, which does not make sense for a logistic regression. + - # Let’s try the same model with a binomial distribution instead. + - # Notice the "​family"​ argument ​to specify the distribution. + - model.glm <- glm(pa ~ WatrCont + Topo, data = mites, family = binomial) + - fitted(model.glm) + - # All values are bound between 0 and 1. + - ​ + - ==== The concept of the link function ==== + ---- + **CHALLENGE 2** - To move away from the traditional linear model and to avoid its biases, we need to specify ​two things when using a logistic regression: a distribution for the residuals of the model and a link function for the expected values. We already presented the Bernoulli distribution in the previous section so let’s have a look at what the link function is. + For your second challenge answer these two questions with your neighbours. + How will the ICC and confidence intervals be affected in these two scenarios: + -Fish trophic positions are not variable among lakes? + -Fish trophic positions are similar within lakes but variable among lakes? - In the case of a simple linear model of a normally distributed continuous response variable, the following equation gives the expected values: + ++++ Challenge 2 Answer | + -Low ICC smaller confidence intervals + -High ICC larger confidence intervals + ++++ + ---- + =====2. How do I implement LMM's in R?===== - //​μ// ​= //Xβ// + The mixed model protocol in R: + 2.1: A priori model building and data exploration + 2.2: Coding potential models and model selection + 2.3: Model validation + 2.4: Interpreting results and visualizing the model + + ====2.1 A priori model building and data exploration==== - where //μ// is the expected value of the response variable, //X// is the model matrix (//i.e.// representing your data) and //β// corresponds to the parameters ​we estimate from the data (//i.e.// the intercept and the slope). The right-hand side of this equation is called the linear predictor. In mathematical terms, it is the matrix product of the model matrix //X// of a statistical ​model and the vector of estimated parameters //β//. Let’s have a look at this in R: + We are interested in finding out if trophic position can be predicted by length, while accounting for variation between species and lakes. So we know we want a model that looks something like this: - + {TP_ijk} ∼ {Length_i} + {Lake_j} + {Species_k} + {ε} - # Load the CO2 dataset. We used it during workshop 3! + - data(CO2) + - head(CO2) + - # Build a linear model of plant CO2 uptake as a function of CO2 ambient concentration + - model.CO2 <- lm(uptake ~ conc, data = CO2) + - # Extract the design matrix of the model with the model.matrix() function. + - X <- model.matrix(model.CO2) + - # And the estimated coefficients. + - B <- model.CO2$coefficients + - # Let’s multiply both X and B matrices to obtain the linear predictor. + - # The "​%*%"​ symbol indicates that it is a matrix product. + - XB <- X %*% B + - # Compare the values of XB to the values obtained with the predict() function. + - # All statements should be TRUE. + - # We use the round() function so that all elements have 5 digits. + - round(fitted(model.CO2),​ digits = 5) == round(XB, digits = 5) + - + - When using a simple linear model with a normally distributed response variable, the linear predictor is directly equal to the expected values of the model. But, what if our response variable is not normally distributed?​ If that is the case, we have to use a transformation on the expected values, //i.e.// a link function. A link function can be understood as a transformation of the expected values so that it can be **linearly related** to the linear predictor: + where,\\ - //g//(//μ//) = //Xβ// + <​m>​{TP_ijk}<​/m> is the trophic position of individual ​(i) from lake (j) of species (k)\\ + and\\ + ε are the residuals of the model (i.e. the unexplained variation).\\ - where //​g//​(//​μ//​) is the link function for the expected values. This allows us to relax the normality assumption. In the case of a binary response variable, the link function is called the logit function and is given by: + **Data Exploration** - logit(//​μ//​) = log (//μ// / 1-//μ//) = //Xβ// + Always make sure you've done "​Housekeeping"​ on your data before you start building models! Is the data in the right structure? Use the following code to view the structure and variable types within your data frame. - where //μ// represents expected values (//i.e.// the probability that Y = 1 because we observed the presence of a species, disease, success, or some other event). The ratio //μ// / 1-//μ// represents the odds that some outcome occured and it transforms the expected values into continuous values from 0 to infinity. If we have a 0.8 probability of observing species X, then our odds are 4 times more likely to observe the species than to not observe it: 0.8/(1-0.8) = 4. The log transformation,​ called the log odds, allows values to be spread across -infinity to infinity. Hence, the logit function took the expected values of a model and transformed them into continuous values without boundaries. The expected values can now be directly related ​to a linear predictor. This is why we still call this model a generalized **linear** model even though the plot of our response variable as a function of some explanatory variable doesn'​t look like a "​straight line"! + ​ + ################​Section 2#########################​ + # Running ​a mixed model in R + # Four Step Process ​to build a mixed model in R - ​ + # 1) A priori model bulding and data exploration####​ - # Let’s build a regression ​model of the presence/​absence of a mite species (Galumna sp.) + # i)  Map out the model based on a priori knowledge - # as a function of water content and topography. + # We know that we want to build a model that evaluates ​the relationship ​ - # To do this, we need to use the glm() function and specify the family argument. + # bewteen trophic position and length while accounting ​for lake and species variation - logit.reg <- glm(pa ~ WatrCont + Topo, data = mites, family = binomial(link = "​logit"​)) + # Trophic Position ​~ Length ​+ Species + Lake - # The logit function is the default ​for the binomial distribution, ​ + - # so it is not necessary to include it in the "​family"​ argument: + - logit.reg <- glm(pa ​~ WatrCont ​+ Topo, data = mites, family = binomial) + - summary(logit.reg) + - ​ + - ** CHALLENGE 4 ** + # ii)Housekeeping and data exploration + # Ensure that the structure of your data is correct + str(data) + ​ - Using the ''​bacteria''​ dataset (from the ''​MASS''​ package), model the presence of //H. influenzae//​ as a function of treatment and week of test. Start with a full model and reduce it to the most parsimonious model. + ++++ Output ​| - + {{:fig_13_w5.png?600|}} - ++++ Challenge 4: Solution ​| + - + - model.bact1 <- glm(y ~ trt * week, family = binomial('​logit'​),​ data = bacteria) + - model.bact2 <- glm(y ~ trt + week, family = binomial('​logit'​),​ data = bacteria) + - model.bact3 <- glm(y ~ week, family = binomial('​logit'​),​ data = bacteria) + - anova(model.bact1,​ model.bact2,​ model.bact3,​ test = '​LRT'​) + - # Analysis of Deviance Table + - # Model 1: y ~ trt * week + - # Model 2: y ~ trt + week + - # Model 3: y ~ week + - #   Resid. Df Resid. Dev Df Deviance Pr(>​Chi) + - # 1       ​214 ​    ​203.12 ​ + - # 2       ​216 ​    ​203.81 -2  -0.6854 ​ 0.70984 ​ + - # 3       ​218 ​    ​210.91 -2  -7.1026 ​ 0.02869 * + - #Based on these results, we select model #2 as the best candidate to model these data. + - ​ + ++++ ++++ - ==== Interpreting ​the output ​of a logistic regression ==== + Look at the distribution ​of samples across factors: + ​ + # Look at sample distribution across factors to check + # if there are any major unequal distributions + table(data$Lake) + table(data$Fish_Species) + ​ - The output of the previous logistic regression indicates that both water content and topography are significant,​ but how do we interpret the slope coefficients? Remember that we applied a transformation on our expected values (//i.e.// the probability that Y = 1) so we have to use a reverse function to properly interpret the results. We can use the natural exponential function ''​e<​sup>​x''​ to obtain the odds of probability of success for each explanatory variable. + ++++ Output | + **Lake**\\ + ---- + {{:​fig_15_w5.png?200|}}\\ + ---- + **Species**\\ + ---- + {{:​fig_14_w5.png?​100|}}\\ + ---- + ++++ - ​ + This data set is perfectly balanced, but mixed models can deal with unbalanced designs ​(like we so often have to in Ecology!). - # Obtaining the odds of the slope. + - # Use the "​exp()"​ function to put the coefficients back on the odds scale. + - # Mathematically, this line of code corresponds to: + - # exp(model coefficients) = exp(log(μ / (1 - μ)) = u / (1 - μ) + - # This corresponds ​to an odds ratio! + - exp(logit.reg$coefficients[2:​3]) + - # WatrCont ​ TopoHummock + - # 0.9843118 ​ ​8.0910340 + - # To obtain confidence intervals on the odds scale: + - exp(confint(logit.reg)[2:​3,​]) + - # 2.5 % 97.5 % + - # WatrCont ​ ​0.9741887 ​ 0.9919435 + - # TopoHummock ​ 2.0460547 ​ 38.6419693 + - ​ + - Note that the odds values here are considered when all other parameters are kept constant. The topography parameter value is 8.09. It means that the probability ​of observing ​//Galumna// sp. is 8.09 times more likely when the topography is hummock compared to blanket. + Look at the distribution ​of continuous variables:​ + ​ + # Look at distribution of continuous variables + # Transform if necessary (will avoid future problems with homogeneity of model residuals) + hist(data$Fish_Length) + hist(data$Trophic_Pos) + - When the odds value is smaller than 1, interpretation is a little bit more complicated. When this is the case, we have to take the inverse value (//i.e.// 1 divided by the odds) to facilitate interpretation. The interpretation is then how **LESS** likely it is to observe the event of interest. For water content, the odds is 0.984. The inverse is 1 / 0.984 = 1.0159. This means that a one-unit increase in water content decreases the likelihood of observing //Galumna// sp. by 1.0159. We can also substract 1 from the odds value to obtain a percentage: (1.0159 - 1) * 100 = 1.59% decrease in probability of observing //Galumna// sp. with a one-unit increase in water content. To convince ourselves that it is an appropriate interpretation,​ we can plot the presence of //Galumna// sp. as a function of water content. We see that, on average, //Galumna// sp. presence is higher at lower water content than its "​absence"​. + ++++ Output | + **Length**\\ + ---- + {{:fig_16_w5.png?​400|}}\\ + ---- + **Trophic Position**\\ + ---- + {{:​fig_17_w5.png?​400|}}\\ + ---- + ++++ + Major skews can lead to problems with model homogeneity down the road. So if necessary, make transformations. In this case, the data seems OK. - {{:galumna_pa.png?400|}} + Check for collinearity between variables: + It is important not to include 2 collinear variables in the same model, otherwise their effects on the response variable will be confounded, i.e. the model can't tell which of the collinear variable is responsible for the variation in the response variable. By default, the model would give a lot of explanatory power to the first variable in the model, and little power to the following variables. - When the parameter estimate is between 0 and 1 on the odds scales, it indicates a negative relationship between the response variable and the explanatory variable. If the parameter is greater than 1, it indicates a positive relationship between the response variable and the explanatory variable. If the confidence interval includes 1, it indicates that the variable is not significant. Remember that a value of 1 on the odds scale means that the probability of Y = 1 is the same as the probability of Y = 0 (//i.e.// when p = 0.5, 0.5/(1-0.5) = 1). + In this example, we have only one explanatory variable, so there is no risk of collinearity. If you did have another ​explanatory variable (Var2) and wanted to check for collinearity, you could use: + ​ + # Check for collinearity between variables + plot(data) + cor(data$Fish_Length,​ data$Var2) + ​ - If you want to obtain ​the probabilities instead of the odds for each explanatory variable, the inverse logit function is what you need: + ---- + **CHALLENGE 3**\\ + What are some additional measures we could have gotten in the wild that might have been + collinear with length? - logit<​sup>​-1​ = 1/(1+1/exp(x)) + ++++ Challenge 3 Answer | + One example is fish mass - this is highly correlated with fish length. Therefore, we would not want to include both in the same model as predictors. + ++++ + ---- - where x is the parameter to transform from log odds to the probability ​scale. The parameter estimate ​of topography ​in our ''​logit.reg''​ model is 2.091, which is on the log odds scale. So, the probability value is given by: + Consider ​the scale of your data:\\ + If two variables ​in the same model have different ranges of scale, the criteria the mixed models use to come up with parameter estimates are likely to return ​'convergence errors'. + Z-correction standardizes your variables and adjusts for this scaling problem by placing all your variables ​on the same scale even if they were originally in different units:\\ + {z} = ({x} - {mean(x)}) / {sd(x)} ​\\ - 1/​(1+1/​exp(2.091)) = 0.89 which is the same as 1/(1+1/8.09). Remember that the value 8.09 is on the odds scale. We have a 0.89 probability of observing //Galumna// sp. when the topography is Hummock. + Because ​the data we are dealing with have such variable scales ​(length ​is a longer ​scale than trophic position), we z-correct it. - + - # Let's start with our odds ratio for topography from the logit.reg model: + # Consider ​the scales of your variables - µ/ (1 - µ) = 8.09 + # Note: when 2 variables have very different ranges of scale the criteria mixed models use to come up - # Let's rearrange ​this to isolate µ + # with parameter estimates are likely to return '​convergance errors'​ - µ = 8.09(1 - µ) = 8.09 - 8.09µ + # Z correcting adjusts for this scaling problem: - 8.09µ + µ = 8.09 + # What is a z correction?:​ (z = (x - mean(x))/sd(x)) - µ(8.09 + 1) = 8.09 + # Z-correct Length - µ = 8.09 / (8.09 + 1) + data$Z_Length<​-(data$Fish_Length-mean(data$Fish_Length))/sd(data$Fish_Length) - µ = 1 / (1 + (1 / 8.09)) = 0.89 + # Z-correct Trophic Position - # We obtained the same result without using the exp() function! + data$Z_TP<​-(data$Trophic_Pos-mean(data$Trophic_Pos))/sd(data$Trophic_Pos) - + - ==== Predictive power and goodness-of-fit assessment ==== + - An easy and intuitive way to evaluate the predictive power of your model is to compare its deviance to the deviance of a null model. Deviance can be understood as a generalisation of the residual sum of squares when models are estimated by maximum likelihood (i.e. it is how parameters are estimated ​in GLM). This allows us to compute ​a pseudo-R<​sup>​2​ statistic, which is analogous to the coefficient ​of determination R<​sup>​2​ in ordinary least square regression (i.e. the basic method for linear ​models). The null model is a model without any explanatory variable. Its notation in R is: ''​null.model %%<-%% glm(Response.variable ~ 1, family = binomial)''​. The generic formula to compute a pseudo-R<​sup>​2​ is given by: + To know if a mixed model is necessary for your data set you must establish if it is actually important ​to account for variation in the factors that might be affecting ​the relationship that you're interested ​in, + Lake and Species in this case.\\ + We can do this by: + -Creating ​a linear model without factors + ​-Calculating ​the residuals of this linear model + -Ploting the residuals ​of the linear model against the factors of interest - Pseudo-R2 ​= (null deviance – residual deviance) / null deviance + + # Find out if it is important to account for variation in "​random effects"​ + # by comparing the residuals of a linear model without the random effects with + # the potential random effects + lm.test<-lm(Z_TP~Z_Length,​ data=data) + lm.test.resid<​-rstandard(lm.test) + # Species Effect + plot(lm.test.resid~ data$Fish_Species,​ xlab = "​Species",​ ylab="​Standardized residuals"​) + abline(0,0, lty=2) + # Lake Effect + plot(lm.test.resid~ data$Lake, xlab = "​Lake",​ ylab="​Standardized residuals"​) + abline(0,0, lty=2) + - where "null deviance"​ is the deviance of the null model and "​residual deviance"​ is the deviance of the model of interest. The difference is divided by the null deviance so that the result is bound between 0 and 1. + ++++ Output| + **Species**\\ + ---- + {{:​fig_18_w5.png?​400|}}\\ + ---- + **Lake**\\ + ---- + {{:​fig_19_w5.png?​400|}}\\ + ---- + ++++ - ​ + For this model, we should keep the random effects because the standardized residuals show variation across lake and species. - # Residual ​and null deviances are already stored in the glm object. + ====2.2 Coding potential models and model selection==== - objects(logit.reg) + Our a priori model\\ - pseudoR2 <- (logit.reg$null.deviance – logit.reg$deviance) / logit.reg$null.deviance + - pseudoR2 + - # [1]  0.4655937 + - ​ + - Hence, the model explains 46.6% of the variability in the data. + {TP_ijk} ∼ {Length_i} + {Lake_j} + {Species_k} + {ε} ​\\ - Recently, [[http://www.tandfonline.com/​doi/​abs/​10.1198/​tast.2009.08210#​.VFpKZYcc4ow|Tjur (2009)]] proposed a new statistic, the coefficient of discrimination (//D//), to evaluate the predictive power of logistic regression models. Intuitively,​ //D// is a measure of how well a logistic regression can classify an outcome as a success or a failure. In mathematical terms, it is the difference between the means of expected probability values for successes (//i.e.// Y =1) and failures (//i.e.// Y = 0): + coded in R looks like this:\\ + {{:​fig_20_w5.png?500|}} - D = overline{π}_1 - overline{π}_0​ + The lmer structure is not intuitive. The basic parts to the function are:\\ + {{:​fig_21_w5.png?​500|}}\\ - where <​m>​overline{π}_1 ​is the mean of expected probability values when the outcome is observed and <​m>​overline{π}_0​ is the mean of expected probability values ​when the outcome is not observed. A //D// value close to 1 indicates that the model gives a high probability of observing an outcome to cases where the outcome was actually observed and a low probability of observing an outcome ​to cases where the outcome was not observed. A //D// value close to 0 indicates that the model is not efficient at discriminating between ​the occurrences and "non occurrences" ​of an outcome. The following code shows how to obtain //D// and how to plot the histograms of <​m>​overline{π}_1​ and <​m>​overline{π}_0​. + REML (Restricted Maximum Likelihood) ​is the default estimation method in the "​lmer"​ function. REML estimates can be used when comparing models with the same fixed effects (i.e. nested models). However, if you are comparing models ​where the fixed effects differ among models then maximum likelihood should be used to fit parameters as this method ​is not dependent on the coefficients ​of the fixed effects. Fitting using maximum likelihood is done by setting REML=FALSE in the lmer command. - ​ + What about coding varying slopes as well as intercepts?​\\ - install.packages("​binomTools"​) + {{:fig_22_w5.png?​500|}}\\ - library("​binomTools"​) + - # The Rsq function computes several fit indices, + - # including the coefficient of discrimination. + - # For information on the other fit indices, see Tjur (2009). + - # The plot shows the distribution of expected values when the outcome is observed + - # and not observed. + - # Ideally, the overlap between the two histograms should be small. + - fit <- Rsq(object = logit.reg) + - fit + - # R-square measures and the coefficient of discrimination,​ '​D'​: + - # + - #    R2mod     ​R2res ​    ​R2cor ​    D + - #    0.5205221 0.5024101 0.5025676 0.5114661 + - # + - # Number of binomial observations: ​ 70 + - # Number of binary observation: ​ 70 + - # Average group size:  1 + - plot(fit, which = "​hist"​) + - ​ + - To assess the goodness-of-fit of a logistic regression, the diagnostic plots (see workshop 3) are not useful. Instead, you can do a [[http://​en.wikipedia.org/​wiki/​Hosmer-Lemeshow_test|Hosmer-Lemeshow test]] to evaluate whether your model is appropriate. This test separates ​the expected values (ordered from smallest to largest) in groups ​of approximately equal size. Ten is usually ​the recommended group number. In each group, we compare the observed ​and expected number of outcomes. It is similar to a chi-square test with G - 2 degrees of freedom (G is the number of groups). In R, this test is available in the ''​binomTools''​ package. + ---- + **CHALLENGE 4**\\ + Re-write ​the following code so that the slopes ​of the trophic position/​length relationship also vary by lake and species. - + - fit <- Rsq(object = logit.reg) + lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE) - HLtest(object = fit) + - # The p value is 0.9051814. Hence, we do not reject the model. + ++++ Challenge ​4 Answer ​| - # We can consider it as appropriate for the data. + - + lmer(Z_TP~Z_Length + (1+Z_Length|Lake) + (1+Z_Length|Species), data = data, REML=TRUE) - ** CHALLENGE 5 ** + - + - Assess goodness-of-fit and predictive power of the ''​model.bact2''​ model. + - How can you improve the predictive power of this model? + - + - ++++ Challenge ​5: Solution ​| + - + - null.d <- model.bact2$null.deviance + - resid.d <- model.bact2$deviance + - bact.pseudoR2 <- (null.d - resid.d) / null.d + - bact.pseudoR2 + - # 0.0624257 + - # This is very low! + - library(binomTools) + - HLtest(Rsq(model.bact2)) + - # Chi-square statistic: ​ 7.812347 ​ with  8  df + - # P-value: ​ 0.4520122 + - # Fit is adequate. + - + - Predictive power could be increased by including more relevant explanatory varibles. + ++++ ++++ + ---- - ==== Plotting results ==== + To determine if you've built the best linear mixed model based on your a priori knowledge you must compare this "​a-priori"​ model to possible alternative models. With the dataset we are working with there are many potential models that might best fit the data. - Once we’ve created ​a model and verified its validity, ​we can plot the results ​to show how the response variable is related ​to some explanatory variables. One way to graphically summarise the data is by plotting both the response variable and the expected values as a function of some predictor. Here is an example with the ''​ggplot2''​ package. See [[r_workshop4|workshop 4]] for more information on this package. + ---- + **CHALLENGE 5**\\ + Make a list of  7 alternative models to the following ​model that can be built and compared: + Note - If we had different fixed effects among models ​we would have to indicate "​REML=FALSE" ​to compare with likelihood methods like AIC (see explanation above). However, you should report parameter estimates from the 'best' ​model using "​REML=TRUE"​. - + - library(ggplot2) + lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE) - ggplot(mites, aes(x = WatrCont, y = pa)) + geom_point() + + ​ - stat_smooth(method ​= "​glm"​, family= "​binomial"​, se = FALSE) + xlab("Water content"​) + + ++++ Challenge 5 Answer | - ylab("​Probability of presence"​) + + ​ - ggtitle("​Probability of presence of Galumna sp. as a function of water content"​) + m1 <- lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE) - + m2 <- lmer(Z_TP~Z_Length ​+ (1+Z_Length|Lake) + (1+Z_Length|Species),​ data = data, REML=TRUE) + m3 <- lmer(Z_TP~Z_Length + (1|Species),​ data = data, REML=TRUE) + m4 <- lmer(Z_TP~Z_Length + (1|Lake), data = data, REML=TRUE) + m5 <- lmer(Z_TP~Z_Length ​+ (1+Z_Length|Species), data = data, REML=TRUE) + m6 <- lmer(Z_TP~Z_Length ​+ (1+Z_Length|Lake),​ data = data, REML=TRUE) + m7 <- lmer(Z_TP~Z_Length + (1+Z_Length|Lake) + (1|Species), data = data, REML=TRUE) + m8 <- lmer(Z_TP~Z_Length + (1|Lake) + (1+Z_Length|Species),​ data = data, REML=TRUE) - {{::​logistic_regression_plot.png?​500|}} + #Bonus model! - ==== An example with proportion ​data ==== + M0<​-lm(Z_TP~Z_Length,​data=data) + # It is always useful to build the simple linear model without any varying intercept and slope factors to see the difference in AICc values. Although "​lm"​ does not use the same estimation method + ​ + ++++ + ---- - Proportion data are more similar ​to binary data than you might think! Suppose you’re in the field to estimate the prevalence of a disease within wild populations ​of white-tailed deer along an environmental gradient of resource availability. You sample ten individuals ​in ten different populations ​to have an estimate of disease prevalence. Your data sheet has ten lines with the following information:​ population ID and number of individuals ​with the disease. At first sight, it seems you have count data, but this is inaccurate because you know how many individuals were sampled in each population. In fact, you have proportion data: it is the number of individuals with the disease out of ten sampled individuals. If you remember what you've read in the section on distributions,​ you should choose ​a binomial distribution ​to model these data. Let’s illustrate this with an example: + Now that we have a list of potential models, we can compare them to get a better idea of what is going on in our dataset and to select ​the one(s) ​with the highest predictive power given the data. Models can be compared using the "​AICc"​ function from the "​AICcmodavg"​ package. The Akaike Information Criterion (AIC) is a measure of model quality that can be used to compare models. + AICc corrects for bias created by small sample size when estimating AIC. - + - # Let’s generate some data based on the deer example: + # 2) Coding potential models and model selection####​ - # We randomly choose a number between 1 and 10 for the number of infected deer. + # i) Coding all potential models - # Ten deers were sampled in ten populations. + # List of all Potential models-->​ - # Resource availability is an index to characterise the habitat. + # Note: you can chose to not code ones that do not make biological sense. - set.seed(123) + # Linear model with no random effects - n.infected ​<- sample(x = 1:10, size = 10, replace ​= TRUE) + M0<-lm(Z_TP~Z_Length,​data=data) - n.total ​<- rep(x = 10, times = 10) + # Full model with varying intercepts - res.avail ​<- rnorm(n = 10, mean = 10, sd = 1) + M1<-lmer(Z_TP~Z_Length + (1|Fish_Species) + (1|Lake), data=data, REML=FALSE) - # Next, let’s build the model. Notice how the proportion ​data is specified. + # Full model with varying intercepts and slopes - # We have to specify the number of cases where disease was detected + M2<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1+Z_Length|Lake),​ data=data, REML=FALSE) - # and the number of cases where the disease was not detected. + # No Lake, varying intercepts only - prop.reg ​<- glm(cbind(n.infected, n.total - n.infected) ~ res.avail, family ​= binomial) + M3<-lmer(Z_TP~Z_Length + (1|Fish_Species), data=data, REML=FALSE) - summary(prop.reg) + # No Species, varying intercepts only - # If your data is directly transformed into proportions, here is the way to do it in R: + M4<​-lmer(Z_TP~Z_Length + (1|Lake), data=data, REML=FALSE) - # Let's first create a vector of proportions + # No Lake, varying intercepts ​and slopes - prop.infected ​<- n.infected / n.total + M5<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species), data=data, REML=FALSE) - # We have to specify the "​weights"​ argument in the glm function to indicate the number of trials per site + # No Species, varying intercepts and slopes - prop.reg2 ​<- glm(prop.infected ​~ res.avail, family ​= binomial, weights ​= n.total) + M6<-lmer(Z_TP~Z_Length + (1+Z_Length|Lake),​ data=data, REML=FALSE) - summary(prop.reg2) + # Full model with varying intercepts and slopes only varying by lake - # The summaries of both prop.reg ​and prop.reg2 are identical! + M7<-lmer(Z_TP~Z_Length + (1|Fish_Species) + (1+Z_Length|Lake), data=data, REML=FALSE) - + # Full model with varying intercepts ​and slopes only varying by species + M8<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1|Lake), data=data, REML=FALSE) - ===== GLMs with count data ===== - To illustrate count data we will use a new dataset called faramea. + # ii) Compare models using AICc values + # Compute AICc values for each model + AICc<​-c(AICc(M0),​ AICc(M1), AICc(M2), AICc(M3), AICc(M4), AICc(M5), AICc(M6), AICc(M7), AICc(M8)) + # Put values into one table for easy comparision + Model<​-c("​M0",​ "​M1",​ "​M2",​ "​M3",​ "​M4",​ "​M5",​ "​M6",​ "​M7",​ "​M8"​) + AICtable<​-data.frame(Model=Model,​ AICc=AICc) + AICtable + # M8 has the lowest AICc value so it is the best fit model + # M2 is also a good fit, but all other models are not nearly as good. + # Note when you compare models they must be fit by + # Maximum Likelihood (ML) and not by Restricted Maximum Likelihood (REML) + ​ - + ++++ Output| - faramea <- read.csv(‘faramea.csv’,​ header = TRUE) + {{:​fig_23_w5.png?200|}} - ​ + ++++ - In this dataset, ​the number of trees of the species //Faramea occidentalis//​ was measured in 43 quadrats in Barro Colorada Island in Panama. For each quadrat, environmental characteristics were also recorded such as elevation or precipitation. Let’s ​take a look at the number of //Faramea occidentalis//​ found at each quadrat. + The model with the lowest AIC value has the most predictive power given the data. Some suggest that if models are within 2 AICc units of each other then they are equally plausible. In this case, we can take a closer ​look at M8 and M2, but all other models have such high AICc values that they are not explaining as much variation in the data as M8 and should not be considered as plausible options to explain the patterns in the data - + ---- - hist(faramea$Faramea.occidentalis, breaks=seq(0,​45,​1),​ xlab=expression(paste("​Number of ", + **CHALLENGE 6**\\ - italic(Faramea~occidentalis))),​ ylab="​Frequency",​ main="",​ col="​grey"​) + Take 2 minutes with your neighbour to draw out the model structure of M2. Biologically, how does it differ from M8? Why is it not surprising that it's AICc value was 2nd best? - ​ + - {{ :​plot_faramea_abd.png?​500 ​|}} + ​ + M2<​-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1+Z_Length|Lake),​ data=data, REML=FALSE) + ​ - We can see that there are only positive ​and integer values. Given these specificities,​ the Poisson distribution, ​ + ++++ Challenge 6 Answer | - described above, seems to be the perfect choice to model this data. + M2: The trophic position is a function of length and both the intercept and the effect of length on trophic position ​can vary by fish species ​and lake. + + M8: The trophic position is a function of length and both the intercept and the effect of length on trophic position can vary by fish species but only the intercept can vary by lake (not the slope of trophic position on length). - ==== Poisson GLMs ==== + Biologically, speaking M2 says that intrinsic factors of species (e.g. growth rates) and lakes (e.g. productivity,​ community composition,​ etc.) cause the relationship between trophic position and length ​to differ ​(i.e. both slopes and intercepts), whereas, M8 says that intrinsic factors ​of species alone cause this relationship ​to differ ​(i.e. slopes) and that on average trophic positions might be higher or lower in one lake versus another (i.e. intercepts). - As we saw above, the Poisson distribution is particularly relevant ​to model count data because it: + - * specifies the probably only for integer values + - * P(y<0) = 0, hence the probability ​of any negative value is null + - * the mean-variance ​relationship ​allows for heterogeneity ​(e.g. when variance generally increases with the mean) + - In this example, we want to test whether elevation (a continuous explanatory variable) influences //Faramea occidentalis//​ abundance. Hence, a Poisson GLM (i.e. a simple Poisson regression) seems to be a good choice to model the number of //Faramea occidentalis//​ as a function of elevation. Poisson GLMs are usually a good way to start modeling count data. + These models are very similar in structure and the AIC units suggest ​this. The added complexity of allowing varying slopes by lake in M2 does not improve the model's predictive power when compared ​to M8. + ++++ - {{ ::​plot_faramea_vs_elevation.png?​400 |}} + ---- + ====2.3 Model validation==== + To look at independence plot model fitted values vs residuals values: + ​ + # 3) Checking model assumptions####​ + # Checking for M8 + # A. Look at independence:​ plot fitted values vs residuals + E1 <- resid(M8) + F1<​-fitted(M8) + plot(x = F1, + y = E1, + xlab = "​Fitted Values",​ + ylab = "​Normalized residuals"​) + abline(h = 0, lty = 2) + ​ + ++++ Output| + {{:​fig_24_w5.png?​400|}} + ++++ - **But what does a Poisson GLM do?**\\ + The even spread of the residuals suggest ​that the model is a good fit for the data. - It assumes ​that the response variables y<​sub>​i​ have been generated by a Poisson distribution with mean and variance //​µ//<​sub>​i​. + - Y<​sub>​i​ ∼ Poisson(//​µ//<​sub>​i​) + To check the assumption of homogeneity plot residuals vs each covariate in the model: - with E(Yi​) = Var(Y<​sub>​i​) = //​µ//<​sub>​i + + # B. Look at independence:​ + # i. plot residuals vs each covariate in the model + # Fish_Length + plot(x = data$Z_Length,​ + y = E1, + xlab = "Z Length",​ + ylab = "​Normalized residuals"​) + abline(h ​= 0, lty = 2) + # Note: observed groupings are created by the nature of the data because in the data set + # we only measured individuals from 5 categories of lengths ​(big, small, and three groups in between) + # Species + boxplot(E1 ~ Fish_Species, ​ + ylab = "​Normalized residuals",​ + data = data, xlab = "​Species"​) + abline(h = 0, lty = 2) + #Lake + boxplot(E1 ~ Lake, + ylab = "​Normalized residuals",​ + data = data, xlab = "​Lake"​) + abline(h = 0, lty = 2) + + ++++ Output| + ---- + {{:​fig_25_w5.png?​400|}} + ---- + {{:​fig_26_w5.png?​400|}} + ---- + {{:​fig_27_w5.png?​400|}} + ---- + ++++ + The equal spread above and below zero indicate that there are no homogeneity problems with these variables. - Recall from above that //​µ//<​sub>​i​ can be replaced ​the systematic ​model\\ + Ideally you would also do the above analysis with every covariate not in your model as well. If you observe patterns in these plots you will know that there is variation in your dataset that could be accounted for by these covariates that were not included in the model, and so you should re-consider the inclusion of this variable in your model. However, because there are no covariates we measured that we decided not to include in the model this point is not applicable to this analysis. - Y<​sub>​i​ ~ Poisson(β<​sub>​0​ + β<​sub>​1​X<​sub>​i​) + It is also important to check the distribution of residual error. Normally distributed residuals indicate that your model predictions are not biased high or low: - A systematic part is used as a linear combination of unknown parameters //β// representing the effects of different explanatory variables. **X** is the covariate matrix ​(which does not include an intercept term in this example). This systemtatic part is define as: + ​ + # D. Look at normality: histogram + hist(E1) + ​ + ++++ Output| + {{:​fig_28_w5.png?​400|}} + ++++ - //​β//<​sub>​0​ + **X**<​sub>​i​.//β// + ====2.4 Interpreting results and visualizing the model==== - The mean of the distribution //µ//i​ is related ​to the systematic part using a logarithm link function. As a result, ​the relationship between ​the mean and the linear predictor is log-linear. + You can view the model summary using: - * log(//​µ//<​sub>​i​) = //​β//<​sub>​0 + **X**<​sub>​i​.//β// + - or + # Look at model summary - * //​µ//<​sub>​i​ = exp(//​β//<​sub>​0​ + **X**<​sub>​i​.//​β//​) + # This allows you to get an idea of the variance explained by the different components + # of the model and the "​significance"​ of fixed effects + summary(M8) + + ++++ Output| + {{:​fig_29_w5.png?500|}} + ++++ + The output is broken up into descriptions of the Random effects ​(things we allow to vary based on the normal distribution) and Fixed effects (things we estimate in the same way as classical regression): - The Poisson distribution gives you the probability that a particular Y<​sub>​i​ value is observed for a given mean //​µ//<​sub>​i​ = exp(//​β//<​sub>​0​ + **X**<​sub>​i​.//​β//​). In this model, the unknown parameters are included in the vector of regression coefficients //β// (plus the intercept //​β//<​sub>​0​) and can be estimated using maximum-likelihood (ML) estimation. + {{:​fig_30_w5.png?500|}} - + - Fitting a Poisson GLM in R requires only setting //family = poisson// in the function glm(). By default the link function is log. + - + {{:​fig_31_w5.png?​500|}} - glm.poisson = glm(Faramea.occidentalis~Elevation,​ data=faramea,​ family=poisson) + - summary(glm.poisson) + - ​ + - The output ​is similar ​to a '​**lm**'​ output ​(see workshop 3) and gives you the parameter ​estimates which can also be retrieved using other functions:​ + To determine if the slope and, therefore, the effect of length on trophic position ​is significantly different from zero you first have to calculate the confidence interval ​(CI) for the slope parameter (estimate for Z_Length in the fixed effects section = 0.4223). The CI = Standard Error of the estimate x 1.96 plus or minus the parameter estimate. If the CI overlaps with zero, then the slope is not significantly different from zero at the 0.05 level. - ​ + - # intercept + - summary(glm.poisson)$coefficients[1,1] + - # slope of elevation + - summary(glm.poisson)$coefficients[2,​1] + - ​ + + {{:​fig_32_w5.png?​500|}} ​ + ---- + **CHALLENGE 7**\\ + a) What is the slope and confidence interval of the Z_Length variable in the M8 model? - ==== Model validation and the problem of overdispersion ==== + b) Is the Z_Length slope significantly different from 0? - An important aspect ​of the summary can be found in the last lines. + ++++ Challenge 7 Answer | - <​file ​rsplus | Poisson GLM summary> + a) What is the slope and confidence interval ​of the Z_Length variable ​in the M8 model? - summary(glm.poisson) + -Slope = 0.4223 - # Null deviance: 414.81  on 42  degrees of freedom + - # Residual deviance: 388.12  on 41  degrees of freedom + <​file>​ + upper CI = 0.4223 + 0.09*1.96 = 0.5987 + lower CI = 0.4223 - 0.09*1.96 = 0.2459 ​ - Remember that ML estimation is used to estimate ​the parameters. In the goodness-of-fit section we mentioned that the deviance was a ML equivalent of the sum of squares in linear models. Here, the null deviance and the residual deviance are equivalent to the total sum of squares and the residual sum of squares respectively. The residual deviance is defined as twice the difference between the log likelihood of a model that provides a perfect fit to the data (a saturated model) and the log likelihood of the model. If our model is correct the asymptotic distribution of the residual deviance is approximated using χ² distribution ​with //n//-//p//-1 degrees of freedom (computed as //n//-//p//-1, where //n// is the number of observations and //p// the number of covariates). This implies that residual deviance should be equal to the residual degrees of freedom. In our example, the residual deviance equals 388.12 while we have 41 (43-1-1) degrees of freedom. This former is greater to the former by 9.5 times, the model is then **overdispersed**. + b) Is the Z_Length slope significantly different from 0? + ​-Yes, because ​the CI does not overlap ​with 0 + ++++ + ---- - **Overdispersion** + To visualize this model you must obtain the coefficients (intercept and slope) of each component of the model. ​ - As a consequence overdispersion ​can be computed ​for any model using the parameter φ: + Overall our group level model coefficients (aka: "​coefs",​ in this case is just one intercept and slope) ​can be found in the summary of the model in the fixed effects section. Coefs for each of the model levels (in this case: Lake and Species) that were fit from a normal distribution can be obtained ​using the "​coef()"​ function. - φ = residual deviance / residual degrees of freedom + Two ways to visualize this data is: - ​* φ < 1 indicates underdispersion + ​-To show the group level model - ​* φ = 1 indicates no overdispersion + ​-To show species and lake level models - * φ > 1 indicates overdispersion + - Why does a Poisson GLM exhibit overdispersion?​ This arises when the variance of the data is higher than expected from the Poisson distribution. This frequently occurs when data includes many zeros and/or many very high values. Looking back at the distribution of our data (above) suggests that our data contains many zero preventing us to use a Poisson GLM. Overdispersion may also result from missing covariates, missing interaction terms or presence of strong outliers, preventing us from using a Poisson GLM. + 1. To show the group level model:\\ - The Poisson distribution can account only partially for heterogeneity in the data due to the mean variance relationship,​ but in some cases variance increases even higher than the mean. Computing the mean and the variance ​of our dataset suggests this is occurring: + Obtain ​the parameters ​of interest\\ - + {{:​fig_33_w5.png?​600|}} - mean(faramea$Faramea.occidentalis) + - var(faramea$Faramea.occidentalis) + - ​ + - In practice, Poissons GLM are useful for describing ​the mean //µ//i​ but underestimates the variance in the data, making all model-based tests too liberal! ​There are two ways of dealing with overdispersion and will be developed below: + and plot the data with the model overlayed - * correct for it by doing a **quasi-Poisson GLM** + - * choose another distribution such as the **negative binomial** + # Visualizing ​model results####​ - ==== quasi-Poisson GLMs ==== + # There are several ​ways of visualizing the results of a mixed model, all of which - The principle behind a quasi-Poisson GLM is very simple; the overdispersion parameter ​(φ) is added to the expected variance equation: + # involve using the coefficients generated ​by the model. ​ + # So the first step is to obtain the model coefficients to be able to add them to the figures + coef(M8) + # Now put the coefs into dataframes to make them more easy to manipulate + Lake.coef <- as.data.frame(coef(M8)$Lake) + colnames(Lake.coef) <- c("​Intercept","​Slope"​) + Species.coef <- as.data.frame(coef(M8)$Fish_Species) + colnames(Species.coef) <- c("​Intercept","​Slope"​) - E(Y​i​) = //µ//i + # Plot 1 - All Data + # Make a plot that includes all the data + plot <- ggplot(aes(Z_Length,​Z_TP),data=data) + Plot_AllData ​<- plot + geom_point() + xlab("​Length (mm)") + ylab("​Trophic Position"​) + + labs(title="​All Data") + fig + # Add a layer that has an abline with the intercept and slope of the relationship between length and + # trophic position. + # Note that you can obtain the intercept and slope of the fixed factor directly from the model summary + summary(M8) + Plot_AllData + geom_abline(intercept = -0.0009059, slope =0.4222697) + + ++++ Output| + {{:​fig_34_w5.png?​500|}} + ++++ - Var(Y<​sub>​i​) = φ.//​µ//<​sub>​i​ + 2. To show species and lake level models:\\ - The systematic part and the link function remains the same. The difference is that φ will first be estimated to correct the model. Parameter Estimates will be the same but the standard errors ​of the parameters are multiplied by √φ, in other terms, some marginally significant p-values may no longer hold. + Obtain ​the parameters ​of interest\\ - In R, The '​**quasipoisson**'​ family object can be used to deal with count data exhibiting overdispersion (the '​**quasibinomial**'​ family object can do the same for binomial data). The fitted φ value will be returned in the summary of the GLM. There are two ways to perform this GLM: + {{:​fig_35_w5.png?​200|}} ​ - + and plot the data with the model overlay: - # Option 1, fit a new quasi-Poisson GLM + - glm.quasipoisson ​= glm(Faramea.occidentalis~Elevation, data=faramea, family=quasipoisson) + # Plot 2 - By Species ​ - # Option 2, build from the previous model and update it: + # Plot the data color coded by Species - glm.quasipoisson ​= update(glm.poisson,family=quasipoisson) + Plot_BySpecies<​-plot + geom_point(aes(colour ​= factor(Fish_Species)), size = 4) + xlab("​Length (mm)") + - # output + ylab("​Trophic Position"​) + labs(title="By Species"​) + fig - summary(glm.quasipoisson) + # Add the regression line with the intercepts ​and slopes specific to each species - ​ + Plot_BySpecies + geom_abline(intercept = Species.coef[1,1], slope =Species.coef[1,2], colour="​coral2"​) + + geom_abline(intercept = Species.coef[2,1], slope =Species.coef[2,​2],​ colour = "​green4"​) + + geom_abline(intercept = Species.coef[3,​1],​ slope =Species.coef[3,​2],​ colour="​blue1"​) - If you look at the summary of the model you will see that φ is estimated as 15.97. We then made a good choice ​by updating ​the model to account for overdispersion. However if we look at P-values we can note that elevation is no longer significant. Yet, 15.97 is quite a lot of overdispersion, and in general quasi-Poisson GLMs will be favoured when φ is included between ​1 and 15 though these limits are arbitrary. When overdispersion is higher than 15-20 we recommend moving to the negative binomial. For the sake of pedagogy, we will consider that we are not happy with this model to fit a final negative-binomial GLM to our data. + # Plot 3 - By Lake + # Plot the data color coded by lake + Plot_ByLake<​-plot + geom_point(aes(colour = factor(Lake)),​ size = 4) + xlab("​Length (mm)") + + ylab("​Trophic Position"​) + labs(title="​By Lake") + fig + # Add in regression lines with the intercepts specific ​to each lake + Plot_ByLake + geom_abline(intercept = Lake.coef[1,1], slope =Lake.coef[1,2], colour="​coral2"​) + + geom_abline(intercept = Lake.coef[2,1], slope =Lake.coef[2,2], colour="​khaki4"​) + + geom_abline(intercept = Lake.coef[3,1], slope =Lake.coef[3,​2],​ colour="​green4"​) + + geom_abline(intercept = Lake.coef[4,​1],​ slope =Lake.coef[4,​2],​ colour="​darkgoldenrod"​) + + geom_abline(intercept = Lake.coef[5,​1],​ slope =Lake.coef[5,​2],​ colour="​royalblue1"​) + + geom_abline(intercept = Lake.coef[6,​1],​ slope =Lake.coef[6,​2],​ colour="​magenta3"​) + ​ + ++++ Output| + {{:​fig_36_w5.png?​500|}} + {{:​fig_37_w5.png?500|}} + ++++ - Two other points ​are important to keep in mind when using quasi-Poisson GLMs and dealing with overdispersion:​ + =====Thought Experiment===== - * **quasi-Poisson GLMs do not have AIC scores.** An important aspect is that quasi-Poisson GLMs do not correspond to models with fully specified likelihoods and rely on quasi-ML estimation (i.e. pseudolikelihood). One consequence is that quasi-Poisson GLMs do not have AIC scores for model comparisons. However, variants ​of AIC have been developed to deal with this situation (e.g. quasi-AIC). + Mixed models ​are really good at accounting for variation ​in ecological data while not loosing too many degrees ​of freedom. + We covered only a small amount of what LMM's can do. Provided below are a couple other thought experiments with data structure similar to the workshop data and two textbooks that fully detail the utility of LMM's. - * **Overdispersion affects model comparison.** Indeed overdispersion also influences the comparison of two nested models and has to be taken into account when φ is known. For instance, let’s assume that we want to compare GLM1, with //​p//<​sub>​1​ parameters to GLM2, with //​p//<​sub>​2​ parameters, such that GLM1 is nested within GLM2 and //​p//<​sub>​2​ > //​p//<​sub>​1​. Model comparison is achieved based on a generalized likelihood ratio test, which can be written as a function of the difference of deviances between the two GLMs, D<​sub>​1​ and D<​sub>​2​ respectively. If Overdispersion is known, deviances have to be scaled (i.e. corrected) accordingly as D* = D/φ. The final test will be based on the criterion D<​sub>​1​* ​- D*<​sub>​2​ which is assumed to follow a χ² distribution with //​p//<​sub>​2​-//​p//<​sub>​1​ degrees of freedom when GLM1 is correct. + ---- - * In some cases φ is not known. For instance, this occurs when you run a GLM with a normal error distribution. In that case, φ can be estimated //a posteriori//​ using the residual deviance of the larger model so the criterion becomes [(D<​sub>​1​-D<​sub>​2​)/​(//​p//<​sub>​2​-//​p//<​sub>​1​)]/​[D<​sub>​2/​(//​n//​-//​p//<​sub>​2​)] and is assumed to follow a F distribution with //​p//<​sub>​2​-//​p//<​sub>​1​ and n-//​p//<​sub>​2​ degrees of freedom. + **CHALLENGE 8**\\ - + Situation: - ==== Negative binomial GLMs ==== + You collected biodiversity estimates from 1000 quadrats ​that are within 10 different sites which are within 10 different forests. You measured ​the productivity ​in each quadrat ​as well. You are mainly interested ​in knowing if productivity is a good predictor ​of biodiversity. - + - GLM with a negative binomial (NB) distribution are favored when overdispersion is extreme. The NB distribution contains an additional parameter //k//, particularly handy for count data containing a preponderance of zeros. Before we go into R stuff, we should see what lies behind a negative binomial GLM. A NB distribution is actually a combination of two distributions:​ a Poisson distribution and a gamma distribution. The NB distribution first assumes that a discrete random variable is Poisson distributed but its mean, //µ// is assumed to follow a gamma distribution. The mixture between the Poisson and gamma distributions can be simplified into a density function specific to the NB which has two parameters //µ// and //k//. + - + - Y ~ NB(//µ//, //k//) + - + - E(Y) = //µ// and Var(Y) = //µ// + //​µ//​²///​k//​ + - + - Here, we can see how overdispersion will be accounted for by NB distribution in GLMs. The second term of the variance determines the amount of overdispersion. In fact, it is indirectly determined by //k//, where //k// is also called the dispersion parameter. If //k// is large (relative to //μ//²), the second term, //​µ//​²///​k//​ approximates 0, and the variance of Y is //μ//. In such cases the NB converges to the Poisson distribution and you might as well use a Poisson distribution. The smaller //k//, the larger the overdispersion. + - Just like with others GLMs, a NB GLM is specified following the fundamental three steps. It first assumes that Y<​sub>​i​ is negative binomial distributed with mean //​μ//<​sub>​i​ and parameter //k//. + - + - Y<​sub>​i​ ∼ NB(//​µ//<​sub>​i,​ //k//) + - + - E(Y<​sub>​i​) = //​µ//<​sub>​i​ and Var(Y<​sub>​i​) = //​µ//<​sub>​i​ + //​µ//<​sub>​i​²///​k//​ + - + - The two last steps define the systematic part and the link function between the mean of Yi and the predictor function. In NB GLMs the link function is logarithmic ensuring that fitted values are always positive. + - + - ​* log(//​µ//<​sub>​i​) = //​β//<​sub>​0​ + **X**<​sub>​i​.//​β//​ + - or + - * //​µ//<​sub>​i​ = exp(//​β//<​sub>​0​ + **X**<​sub>​i​.//​β//​) + - + - NB is not in the glm() function so you need to install and load the MASS library. ​You do not remember if you already installed this package? No problem, you can use the following function: + - + - ​ + - ifelse(length(which(installed.packages() == "​MASS"​)) == 0, + - {print("​MASS not installed. Installing... "); install.packages("​MASS"​)},​ + - print("​MASS already installed"​)) + - ​ + - + - Alternatively,​ if you know that this package is not installed you can directly use the command + - ​ + - install.packages("​MASS"​) + - ​ + - + - and remember to load the package + - ​ + - library("​MASS"​) + - ​ + - + - The negative binomial GLM can be build using the glm.nb() function: + - ​ + - glm.negbin = glm.nb(Faramea.occidentalis~Elevation,​ data=faramea) + - summary(glm.negbin) + - ​ + - + - The summary is similar to other GLMs summaries (e.g. Poisson GLMs), though we now have a parameter theta, ​which stands for parameter //k// in the variance of the NB distribution. Its standard error is also provided, but care is needed with its use as the interval is not symmetric and we are testing on the boundary. + - + - + - + - + - ==== Plotting ​the final GLM to the data ==== + - The NB GLMs appear to be the best fit to our data. We can plot the relationship between the abundance of Faramea occidentalis and elevation. + - + - ​ + - # plot the observed data + - plot(faramea$Elevation,​ faramea$Faramea.occidentalis,​ xlab="​Elevation (m)", ylab=expression(paste("​Number of", " ​ ", italic(Faramea~occidentalis))),​ pch=16, col=rgb(4,​139,​154,​150,​maxColorValue=255)) + - + - # pull values for intercept and beta from the summary and put them in the exponential equation + - curve(exp(summary(glm.negbin)$coefficients[1,​1]+summary(glm.negbin)$coefficients[2,​1]*x),​from=range(faramea$Elevation)[1],​to=range(faramea$Elevation)[2],​add=T,​ lwd=2, col="​orangered"​) + - + - # pull the standard error as well to plot the equations for confidence envelope + - curve(exp(summary(glm.negbin)$coefficients[1,​1]+1.96*summary(glm.negbin)$coefficients[1,​2]+summary(glm.negbin)$coefficients[2,​1]*x+1.96*summary(glm.negbin)$coefficients[2,​2]),​from=range(faramea$Elevation)[1],​to=range(faramea$Elevation)[2],​add=T,​lty=2,​ col="​orangered"​) + - curve(exp(summary(glm.negbin)$coefficients[1,​1]-1.96*summary(glm.negbin)$coefficients[1,​2]+summary(glm.negbin)$coefficients[2,​1]*x-1.96*summary(glm.negbin)$coefficients[2,​2]),​from=range(faramea$Elevation)[1],​to=range(faramea$Elevation)[2],​add=T,​lty=2,​ col="​orangered"​) + - ​ + - + - {{ :​nb_glm_fit_to_faramea.png?​450 |}} + - + - We can see that the number of //Faramea occidentalis//​ significantly decreases with elevation. However, the confidence envelope of the NB model is large at low elevation. ​ + - + - + - ==== Conclusion on GLMs with count data ==== + - + - All the GLMs introduced (Poisson, quasi-Poisson and NB) to model count data use the same log-linear mean function (log(//​µ//​) = **X**.//​β//​),​ but make different assumptions about the remaining likelihood. Quasi-Poisson and NB are favored to deal with overdispersion. However, ​in some cases the data may contains too many zeros and zero-augmented models can be useful as they extend the mean function by modifying (typically, increasing) the likelihood ​of zero counts (e.g. zero-inflated Poisson [ZIP]). + + What mixed model could you use for this dataset? ​ + ++++ Challenge 8 Answer | + + >​lmer(Bio_Div ~ Productivity + (1|Forest/​Site)) + >#Here the random effects are nested (i.e. Sites within forests), and not crossed. + ​ + ++++ + ---- + ---- + **CHALLENGE 9**\\ + Situation: + You collected 200 fish from 12 different sites evenly distributed across 4 habitat types found within 1 lake. You measured the length of each fish and the quantity of mercury in its tissue. ​ You are mainly interested in knowing if habitat is a good predictor of mercury concentration. ​ + What mixed model could you use for this data set? + ++++ Challenge 9 Answer | + + > lmer(Mercury~Length*Habitat_Type+ (1|Site)...) + ​ + ++++ + ---- ===== Generalized linear mixed models (GLMMs) ===== ===== Generalized linear mixed models (GLMMs) ===== - As seen above, heterogeneous response distributions can be accurately modelled by relaxing the assumptions of normally distributed errors and letting the errors take on different distribution families (e.g., Poisson or negative binomial). What happens if there is a structure in the dataset such that observations sampled in certain sites are more similar (correlated) than observations sampled in different sites? As we saw in [[r_workshop5|Workshop ​5]], we can account for this lack of independence by adding random effects in the model. Recall that intercepts and/or slopes are allowed to vary among grouping factors (random effects) and the variability in slopes/ intercepts among groups is assumed to follow a normal distribution. Random effects are also referred to as shrinkage estimates because they represent a weighted average of the group-level and the overall fit (fixed effect). The amount of shrinking towards the overall fit is more severe if the within-group variability is large compared to the among-group variability (i.e., when we have less confidence in the random estimate due to low sample size or high within-group variability,​ the estimate is '​pulled'​ towards the fixed effect). ​ + As seen above, heterogeneous response distributions can be accurately modelled by relaxing the assumptions of normally distributed errors and letting the errors take on different distribution families (e.g., Poisson or negative binomial). What happens if there is a structure in the dataset such that observations sampled in certain sites are more similar (correlated) than observations sampled in different sites? As we saw in [[r_workshop6|Workshop ​6]], we can account for this lack of independence by adding random effects in the model. Recall that intercepts and/or slopes are allowed to vary among grouping factors (random effects) and the variability in slopes/ intercepts among groups is assumed to follow a normal distribution. Random effects are also referred to as shrinkage estimates because they represent a weighted average of the group-level and the overall fit (fixed effect). The amount of shrinking towards the overall fit is more severe if the within-group variability is large compared to the among-group variability (i.e., when we have less confidence in the random estimate due to low sample size or high within-group variability,​ the estimate is '​pulled'​ towards the fixed effect). ​ The properties of Linear Mixed Models (Workshop 5) which incorporate random effects and those of Generalized Linear Models (seen above) which handle non-normal data can be combined to effectively model such instances in a Generalized Linear Mixed Model framework. The extension of GLMs to account for this additional structure follows similar steps introduced in the mixed modelling workshop.\\ ​ The properties of Linear Mixed Models (Workshop 5) which incorporate random effects and those of Generalized Linear Models (seen above) which handle non-normal data can be combined to effectively model such instances in a Generalized Linear Mixed Model framework. The extension of GLMs to account for this additional structure follows similar steps introduced in the mixed modelling workshop.\\ ​ Line 702: Line 739: To provide a brief overview of GLMMs, we will visit a case study presented by [[http://​www.facecouncil.org/​puf/​wp-content/​uploads/​2009/​10/​Generalized-linear-mixed-models.pdf|Bolker et al. (2009)]] and later by [[http://​www.cell.com/​cms/​attachment/​601623/​4742452/​mmc1.pdf|Bolker et al. (2011)]] where the structure (or lack of independence) in the dataset was due to observations being measured across different populations. The authors evaluated the extent to which the response of //​Arabidopsis thaliana// (mouse-ear cress) to its environmental drivers (nutrient availability and herbivory) varied as a result of the population and/or genotype they belonged to.\\ To provide a brief overview of GLMMs, we will visit a case study presented by [[http://​www.facecouncil.org/​puf/​wp-content/​uploads/​2009/​10/​Generalized-linear-mixed-models.pdf|Bolker et al. (2009)]] and later by [[http://​www.cell.com/​cms/​attachment/​601623/​4742452/​mmc1.pdf|Bolker et al. (2011)]] where the structure (or lack of independence) in the dataset was due to observations being measured across different populations. The authors evaluated the extent to which the response of //​Arabidopsis thaliana// (mouse-ear cress) to its environmental drivers (nutrient availability and herbivory) varied as a result of the population and/or genotype they belonged to.\\ - Using the //​Arabidopsis//​ dataset and detailed analysis presented by Bolker and colleagues, we will explore the eff + Using the //​Arabidopsis//​ dataset and detailed analysis presented by Bolker and colleagues, we will explore the eff ects of nutrient levels, herbivory and their interaction ( fixed effects) on the production //​Arabidopsis thaliana// and the across-population and/or across-genotype variability in these relationships (random eff ects). + ects of nutrient levels, herbivory and their interaction ( + fixed effects) on the production //​Arabidopsis thaliana// and the across-population and/or across-genotype variability in these relationships (random eff + ects). Line 751: Line 791: === Structure in dataset === === Structure in dataset === - When we look at the interaction between nutrient and clipping across the 9 different populations,​ we note that there is a consistent nutrient e + When we look at the interaction between nutrient and clipping across the 9 different populations,​ we note that there is a consistent nutrient e ffect (higher baseline under the High nutrient treatment). The effect of clipping is weaker, but in some populations we note a negative slope. ​ + ffect (higher baseline under the High nutrient treatment). The effect of clipping is weaker, but in some populations we note a negative slope. ​ Line 805: Line 846: - If we observe a linear relationship between the variance and the mean with a slope = 1, then the **Poisson** family is appropriate, ​ - If we observe a linear relationship between the variance and the mean with a slope = 1, then the **Poisson** family is appropriate, ​ - - If we observe a linear mean-variance relationship with a slope > 1 (i.e. Var = φµ  + - If we observe a linear mean-variance relationship with a slope > 1 (i.e. Var = φµ  where φ  > 1), then the **quasi-Poisson** family (as introduced above) should be applied, where φ  > 1), then the **quasi-Poisson** family (as introduced above) should be applied, - - Finally, a quadratic relationship between the variance and the mean (i.e. Var = µ(1 + α + - Finally, a quadratic relationship between the variance and the mean (i.e. Var = µ(1 + α ) or µ(1 + µ/k)), is characteristic of overdispersed data that is driven by an underlying heterogeneity among samples. In this case, the **negative binomial** (Poisson-gamma) would be more appropriate. + ) or µ(1 + µ/k)), is characteristic of overdispersed data that is driven by an underlying heterogeneity among samples. In this case, the **negative binomial** (Poisson-gamma) would be more appropriate. {{ :​worksp_6_error_dist.png?​450 |}} {{ :​worksp_6_error_dist.png?​450 |}} Line 812: Line 854: ==== Poisson GLMM ==== ==== Poisson GLMM ==== - Let's build a GLMM using the ''​glmer''​ function of the //lme4// package. This model has a random intercept for all genotype and population ​treatments. We include the nuisance variables (rack and germination method) as fixed effects. Given the mean-variance relationship from above, we most likely will need a model with overdispersion,​ but let's start with a Poisson model. + Let's build a GLMM using the ''​glmer''​ function of the //lme4// package. This model has a random intercept for all genotype and population treatments. We include the nuisance variables (rack and germination method) as fixed effects. Given the mean-variance relationship from above, we most likely will need a model with overdispersion,​ but let's start with a Poisson model. Line 857: Line 899: ==== Poisson-lognormal GLMM ==== ==== Poisson-lognormal GLMM ==== - Another option that we have not yet seen is to use a Poisson-lognormal distribution. A Poisson-lognormal model can be constructed by placing ​a lognormal prior on ε<​sub>​i​. A Poisson-lognormal distribution with mean µ and lognormal prior variance σ<​sup>​2​ has variance: + Another option that we have not yet seen is to use a Poisson-lognormal distribution. ​This approach deals with overdispersion by implementing an observation-level random effects (OLRE; see [[https://​peerj.com/​articles/​616.pdf|Harrison (2014)]]), which model the extra-Poisson variation in the response variable using a random effect with a unique level for every data point. + + A Poisson-lognormal model effectively places ​a lognormal prior on ε<​sub>​i​. A Poisson-lognormal distribution with mean µ and lognormal prior variance σ<​sup>​2​ has variance: var(y) = µ + µ<​sup>​2​ [exp(σ<​sup>​2​) - 1] var(y) = µ + µ<​sup>​2​ [exp(σ<​sup>​2​) - 1] Line 865: Line 909: var(y) = µ + µ<​sup>​2/​k var(y) = µ + µ<​sup>​2/​k - The variance term σ<​sup>​2​ in the Poisson-lognormal distribution will depend on the grouping level we select (e.g., at the individual, genotype or population level). ​Here we will evaluate it at the individual level. In sum, the Poisson-lognormal model can allow for a more flexible approach to assigning the observed aggregation to different sources of heterogeneity. + More generally, the variance term σ<​sup>​2​ in the Poisson-lognormal distribution will depend on the grouping level we select (e.g., at the individual, genotype or population level). ​That is, the Poisson-lognormal model can allow for a more flexible approach to assigning the observed aggregation to different sources of heterogeneity. Here, to implement the observation-level random effect, we will evaluate it at the individual level. Line 882: Line 926: Now that we have the appropriate error distribution,​ we can test the importance of the random intercepts (for //​population//​ and //​genotype//​) by comparing nested models with and without the random effects of interest using either: ​ Now that we have the appropriate error distribution,​ we can test the importance of the random intercepts (for //​population//​ and //​genotype//​) by comparing nested models with and without the random effects of interest using either: ​ - - an __information theoretic approach__ (such as, Akaike Information Criterion; AIC), which as we saw in [[r_workshop5#​coding_potential_models_and_model_selection|Workshop ​5]] examines several competing hypotheses (models) simultaneously to identify the model with the highest predictive power given the data. As before, we will use the AICc to correct for small sample sizes. Or, + - an __information theoretic approach__ (such as, Akaike Information Criterion; AIC), which as we saw in [[r_workshop6#​coding_potential_models_and_model_selection|Workshop ​6]] examines several competing hypotheses (models) simultaneously to identify the model with the highest predictive power given the data. As before, we will use the AICc to correct for small sample sizes. Or, - a __frequentist approach__ (traditional null hypothesis testing or drop1 approach), where the significance of each term is evaluated in turn (by comparing nested models together using the ''​anova()''​ function and the significance of the likelihood ratio test; LRT). It's important to keep in mind that with this approach we are testing a null hypothesis of zero variance for the random effects, but given that we cannot have negative variance, we are testing the parameter on the boundary of its feasible region. Therefore, the reported p value is approximately twice what it should be (i.e., we've truncated half of the possible values that fall below 0). - a __frequentist approach__ (traditional null hypothesis testing or drop1 approach), where the significance of each term is evaluated in turn (by comparing nested models together using the ''​anova()''​ function and the significance of the likelihood ratio test; LRT). It's important to keep in mind that with this approach we are testing a null hypothesis of zero variance for the random effects, but given that we cannot have negative variance, we are testing the parameter on the boundary of its feasible region. Therefore, the reported p value is approximately twice what it should be (i.e., we've truncated half of the possible values that fall below 0). Line 964: Line 1008: ++++ Bonus Section: Random slopes| ++++ Bonus Section: Random slopes| - **NB**: We cannot test for random slopes in this dataset because whenever random slopes for the fixed effects are included in the model, we obtain a strong correlation between the random intercepts and slopes. These perfect correlations may result from our model being over-parametrized. Thus, while there could very well be some variation in the nutrient or clipping e + **NB**: We cannot test for random slopes in this dataset because whenever random slopes for the fixed effects are included in the model, we obtain a strong correlation between the random intercepts and slopes. These perfect correlations may result from our model being over-parametrized. Thus, while there could very well be some variation in the nutrient or clipping e ffect at the genotype or population levels, and while this variation may be what we are most interested in, there is simply not enough data to test these e ffects with confidence. + ffect at the genotype or population levels, and while this variation may be what we are most interested in, there is simply not enough data to test these e + ffects with confidence. Let's have a look at an example of this by testing the random slope terms for clipping: Let's have a look at an example of this by testing the random slope terms for clipping: Line 977: Line 1023: ​ - We can examine the variance components using the model ''​summary''​. Alternatively,​ we could examine the correlation matrix among the random clipping intercepts and slopes. A third option is to use the ''​printvc''​ function that prints the variance-covariance matrices along with an equals sign ( = ) next to any covariance component with a nearly perfect ​correlation. + We can examine the variance components using the model ''​summary''​. Alternatively,​ we could examine the correlation matrix among the random clipping intercepts and slopes. A third option is to use the ''​printvc''​ function that prints the variance-covariance matrices along with an equals sign ( = ) next to any covariance component with a nearly perfect correlation. Line 1026: Line 1072: ​ - There is a strong effect of rack (a change in AIC of 55 if we remove this variable), whereas the effects of status and of the interaction term are weak (change in AIC when either is dropped is less than 2). We can thus start by removing the non-signifi + There is a strong effect of rack (a change in AIC of 55 if we remove this variable), whereas the effects of status and of the interaction term are weak (change in AIC when either is dropped is less than 2). We can thus start by removing the non-signifi ​ cant interaction term so that we can test the main e ffects of nutrient and clipping. + cant interaction term so that we can test the main e + ffects of nutrient and clipping. Line 1068: Line 1116: ​ - Both the main effects of nutrient and clipping are strong (large change in AIC of 135.6 (mpl5) and 10.2 (mpl6) if either nutrient or clipping are dropped, respectively). Our fi + Both the main effects of nutrient and clipping are strong (large change in AIC of 135.6 (mpl5) and 10.2 (mpl6) if either nutrient or clipping are dropped, respectively). Our fi nal model includes the fixed nutrient (strong positive) and clipping (negative) effects, the nuisance variables rack, the observation-level random e ffects (1 | X) to account for over-dispersion and the variation in overall fruit set at the //​population//​ and //​genotype//​ levels. + nal model includes the + fixed nutrient (strong positive) and clipping (negative) effects, the nuisance variables rack, the observation-level random e + ffects (1 | X) to account for over-dispersion and the variation in overall fruit set at the //​population//​ and //​genotype//​ levels. - ** CHALLENGE ​7 ** + ** CHALLENGE ​10 ** Using the //inverts// dataset (larval development times (PLD) of 74 marine invertebrate and vertebrate species reared at different temperatures and time), answer the following questions: Using the //inverts// dataset (larval development times (PLD) of 74 marine invertebrate and vertebrate species reared at different temperatures and time), answer the following questions: Line 1080: Line 1131: - Finally, once you determined the best distribution family, re-evaluate your random and fixed effects. - Finally, once you determined the best distribution family, re-evaluate your random and fixed effects. - ++++ Challenge: Solution ​1| + ++++ Challenge: Solution ​3| # Load data # Load data Line 1229: Line 1280: ​ ++++ ++++ - ==== Additional GLMM resources ==== + =====Additional resources===== + + ===LMM Textbooks=== + Gelman, A., and Hill, J. (2006). Data analysis using regression and multilevel/​hierarchical models (Cambridge University Press). + + Zuur, A., Ieno, E.N., Walker, N., Saveliev, A.A., and Smith, G.M. (2009). Mixed effects models and extensions in ecology with R (Springer). + + ===GLMM resources=== **Books** **Books** B. Bolker (2009) [[http://​ms.mcmaster.ca/​~bolker/​emdbook/​|Ecological Models and Data in R]]. Princeton University Press.\\ B. Bolker (2009) [[http://​ms.mcmaster.ca/​~bolker/​emdbook/​|Ecological Models and Data in R]]. Princeton University Press.\\ A. Zuur et al. (2009) [[http://​link.springer.com/​book/​10.1007/​978-0-387-87458-6|Mixed Effects Models and Extensions in Ecology with R]]. Springer.\\ A. Zuur et al. (2009) [[http://​link.springer.com/​book/​10.1007/​978-0-387-87458-6|Mixed Effects Models and Extensions in Ecology with R]]. Springer.\\ + + **Articles ** + + Harrison, X. A., L. Donaldson, M. E. Correa-Cano,​ J. Evans, D. N. Fisher, C. E. D. Goodwin, B. S. Robinson, D. J. Hodgson, and R. Inger. 2018. [[https://​peerj.com/​preprints/​3113/​|A brief introduction to mixed effects modelling and multi-model inference in ecology]]. ​ PeerJ 6:​e4794–32. **Websites** **Websites** [[http://​glmm.wikidot.com/​|GLMM for ecologists]]:​ A great website on GLMM with a Q&A section! [[http://​glmm.wikidot.com/​|GLMM for ecologists]]:​ A great website on GLMM with a Q&A section! + +