Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
r_workshop4 [2016/10/21 10:48]
vincent_fugere [QCBS R Workshops]
r_workshop4 [2018/10/10 19:35]
katherinehebert [7.3 Polynomial regression (advanced section/ optional)]
Line 9: Line 9:
 ====== Workshop 4: Linear models ====== ====== Workshop 4: Linear models ======
  
-Developed by: Catherine Baltazar, Bérenger Bourgeois, Zofia Taranu+Developed by: Catherine Baltazar, Bérenger Bourgeois, Zofia Taranu, Shaun Turney, Willian Vieira
  
 **Summary:​** In this workshop, you will learn how to implement basic linear models commonly used in ecology in R such as simple regression, analysis of variance (ANOVA), analysis of covariance (ANCOVA), and multiple regression. After verifying visually and statistically the assumptions of these models and transforming your data when necessary, the interpretation of model outputs and the plotting of your final model will no longer keep secrets from you! **Summary:​** In this workshop, you will learn how to implement basic linear models commonly used in ecology in R such as simple regression, analysis of variance (ANOVA), analysis of covariance (ANCOVA), and multiple regression. After verifying visually and statistically the assumptions of these models and transforming your data when necessary, the interpretation of model outputs and the plotting of your final model will no longer keep secrets from you!
Line 31: Line 31:
   - ANCOVA   - ANCOVA
   - Multiple linear regression ​   - Multiple linear regression ​
-===== Overview =====+===== 1. Overview =====
  
-Scientists have always been interested in determining relationships between variables. Depending on the kind of variables considered and their number, different statistical tools can be used to assess these relationships. The table below lists the five types of statistical analysis that will be covered in this workshop: ​+====1.1 Defining mean and variation====  
 + 
 +Scientists have always been interested in determining relationships between variables. ​In this workshop we will learn how to use linear models, a set of models that quantify relationships between variables.  
 + 
 +To begin, we will define some important concepts that are central to linear models: mean and variation. Mean is a measure of the average value of a population. Suppose we have a random variable x, for example the height of the people in this room, and we would like to see some patterns of this variable. The first way we will present it will be by using the mean (be aware that there are many ways of measuring it): 
 + 
 +But the mean alone will not represent it a population fully. We can also describe the population using measures of variation. Variation is the spread around the mean. For example, whether all people in the room are approximately the same height (low variation) or if there are many tall and short people (high variation). Mean deviation, variance, standard deviation and the coefficient of deviation are all measures of variation which we will define below. We can measure deviation of each element from the mean: 
 + 
 +With the deviation for each value, we can calculate the mean deviation:​ 
 + 
 +To convert all variables to positive values without the need of absolute values, we can also square the value. And that is where the variance comes from: 
 + 
 +However, by squaring each value, our variables are no longer in meaningful units. Back to our example with the height of people in this room, the variance will be $ mˆ2 $ which is not what we’re measuring. To convert the value to meaningful units, we can calculate the standard deviation:​ 
 + 
 +Finally, to express the coefficient of variation, known as the relative standard deviation, expressed in percentage, we have: 
 + 
 +====1.2 Linear models==== 
 + 
 +In linear models, we use the concepts of mean and variation to describe the relationship between two variables. Linear models are so named because they describe the relationship between variables as lines: 
 + 
 +<m> {y_i} = {β_0} + {β_1}{x_{i1}} + ... + {β_p}{x_{ip}} + {ε_i} </​m>​ 
 + 
 +where\\  
 + 
 +<​m>​{y_i}</​m>​ is the response variable,​\\ 
 +<​m>​{β_0}</​m>​ is the intercept of the regression line,\\ 
 +<​m>​{β_1}</​m>​ is the coefficient of variation for the first explanatory variable,​\\ 
 +<​m>​{β_p}</​m>​ is the coefficient of variation for the pth explanatory variable,​\\ 
 +<​m>​{x_i1}</​m>​ is the first explanatory variable,​\\ 
 +<​m>​{x_ip}</​m>​ is the pth explanatory variable,​\\ 
 +<​m>​{ε_i}</​m>​ are the residuals of the model\\ 
 + 
 +The response variable is the variable you want to explain. It's also known as the dependent variable. There is only one response variable. The explanatory variables are the variables you think may explain your response variable. They'​re also known as independent variables. There can be one or many explanatory variables. For example, suppose we want to explain variation in the height of people in a room. Height is the response variable. Some possible explanatory variables could be gender or age.   
 + 
 +In linear models the response variable must be continuous, while the explanatory variables can be continuous or categorical. A continuous variable has an infinite number of possible values. A categorical variable has a limited number of possible values. Age, temperature,​ and latitude are all continuous variables. Sex, developmental stage, and country are all categorical variables. For continuous explanatory variables, the linear model tests whether there is a significant correlation between the explanatory and response variable. For categorical explanatory variables, the linear model tests whether there is a significant difference between the different levels (groups) in their mean value of the response variable. This should become clearer as we learn about specific types of linear models in the sections below. 
 + 
 +In almost all cases, the explanatory variables will not explain all of the variation in the response variable. Gender and age, for example, will not be enough to predict everyone'​s height perfectly. The remaining, unexplained variation is called error or residuals. 
 + 
 +The goal of the linear model is to find the best estimation of the parameters (the β variables) and then assess the goodness of fit of the model. Several methods have been developed to calculate the intercept and coefficients of linear models, and the appropriate choice depends on the model. The general concept behind these methods is that the residuals are minimized. ​  
 + 
 +Depending on the kind of explanatory ​variables considered and their number, different statistical tools can be used to assess these relationships. The table below lists the five types of statistical analysis that will be covered in this workshop: ​
  
 ^ Statistical analysis^ Type of response\\ variable Y^ Type of explanatory\\ variable X^ Number of\\ explanatory variables ^ Number of\\ levels k^ ^ Statistical analysis^ Type of response\\ variable Y^ Type of explanatory\\ variable X^ Number of\\ explanatory variables ^ Number of\\ levels k^
Line 42: Line 82:
 ^ Multiple regression | :::        | Continuous| 2 or more| | ^ Multiple regression | :::        | Continuous| 2 or more| |
  
 +====1.3 Linear model assumptions====
  
-===== 1. Simple ​linear ​regression ===== +To be valid, a linear ​models must meet 4 assumptions,​ otherwise the model results cannot be safely interpreted. 
-The purpose of this analysis is to relate ​response variable, y, and an explanatory variable, x, through a straight line. The mathematical model corresponding to linear regression is given by:+  ​- ​The residuals are independent ​  
 +  - The residuals are normally distributed 
 +  - The residuals have mean of 0  
 +  - The residuals are homoskedastic (they have constant variance)
  
-<m> {y_i} = {β_0} + {β_1}{x_i} + {ε_i} </m>+Note that all of these assumptions concern the residuals, not the response or explanatory variables. The residuals must be independent,​ meaning that there isn't an underlying structure missing from the model (usually spatial or temporal autocorrelation). The residuals must be normally distributed with a mean of 0, meaning that the largest proportion of residuals have a value close to 0 (ie, the error is small) and the distribution is symmetrical (ie, the response variable is overestimated and underestimated equally often). The residuals must be homoskedastic,​ meaning that error doesn'​t change much as the value of the predictor variables change.  ​
  
-where\\ +In the following sections, we do not always explicitly restate the above assumptions for every model. Be aware, however, that these assumption are implicit in all linear models, including all models presented below. ​
  
-<​m>​{β_0}</​m>​ is the intercept of the regression line,\\ +====1.4 Test statistics and p-values====
-<​m>​{β_1}</​m>​ is the slope of the regression line,\\ +
-<​m>​{x_i}</​m>​ is the continuous explanatory variable,​\\ +
-<​m>​{ε_i}</​m>​ are the residuals of the model (i.e. the unexplained variation).\\+
  
-The goal here is to find correct estimation ​of these two regression ​parameters and then assess ​the goodness ​of fit of the regression ​model. ​While several ​methods have been developed to calculate ​the intercept ​and slope coefficients of regression ​modelsordinary ​least squares is the most widely used estimation method, and also corresponds to the default method of the ''​lm''​ function in R. Ordinary least squares fits a line such that the sum of the squared vertical distances between the observed data and the linear regression model (i.e. the residuals) is minimized. ​From this method, the intercept β<​sub>​0</​sub>​ and the slope β<​sub>​1</​sub>​ of the linear regression can be calculated as:+Once you've run your model in R, you will receive ​model output that includes many numbers. It takes practice to understand what each of these numbers means and which to pay the most attention to. The model output includes the estimation of the parameters ​(the β variables). The output also includes test statistics. The particular test statistic depends on the linear model you are using (t is the test statistic for the linear regression ​and the t test, and F is the test statistic for ANOVA).  
 + 
 +In linear models, the null hypothesis is typically that there is no relationship between two continuous variables, or that there is no difference in the levels ​of a categorical variable. The larger the absolute value of the test statistic, the more improbable that the null hypothesis is true. The exact probability is given in the model output and is called the p-value. You could think of the p-value as the probability that the null hypothesis is true, although that's a bit of a simplification. (Technically,​ the p-value is the probability that, given the assumption that the null hypothesis is true, the test statistic would be the same as or of greater magnitude than the actual observed test statistic.) By convention, we consider that if the p value is less than 0.05 (5%), then we reject the null hypothesis. This cut-off value is called α (alpha). If we reject the null hypothesis then we say that the alternative hypothesis is supported: there is a significant relationship or a significant difference. Note that we do not "​prove"​ hypotheses, only support or reject them. 
 +====1.5 Work flow==== 
 + 
 +Below we will explore ​several ​kinds of linear models. The way you create and interpret each model will differ in the specifics, but the principles behind them and the general work flow will remain the same. For each model we will work through the following steps: 
 + 
 +  - Visualize the data (data visualization could also come later in your work flow) 
 +  - Create a model 
 +  - Test the model assumptions 
 +  - Adjust the model if assumptions are violated 
 +  - Interpret the model results 
 + 
 +     
 +  
 + 
 + 
 +===== 2. Simple linear ​regression ​===== 
 + 
 +Simple linear regression is a type of linear model which contains a singlecontinuous explanatory variable. The regression tests whether there is a significant correlation between the two variables.  
 + 
 +Simple linear regression involves two parameters which must be estimated: an intercept (<​m>​{β_0}</​m>​) and a coefficient of correlation (<​m>​{β_1}</​m>​). Ordinary ​least squares is the most widely used estimation method, and also corresponds to the default method of the ''​lm''​ function in R. Ordinary least squares fits a line such that the sum of the squared vertical distances between the observed data and the linear regression model (i.e. the residuals) is minimized. ​ 
 + 
 +Click below to see the math in more detail. 
 +<​hidden>​ 
 +Using ordinary least squares, the intercept β<​sub>​0</​sub>​ and the slope β<​sub>​1</​sub>​ of the linear regression can be calculated as:
  
 <​m>​β_{1}={sum{i}{}{(x_{i}y_{i})}-overline{x}overline{y}}/​sum{i}{}{(x_{i}-overline{x})}^2 = {Cov(x,​y)}/​{Var(x)}</​m>​ <​m>​β_{1}={sum{i}{}{(x_{i}y_{i})}-overline{x}overline{y}}/​sum{i}{}{(x_{i}-overline{x})}^2 = {Cov(x,​y)}/​{Var(x)}</​m>​
  
 <​m>​β_{0}=overline{y}-β_{1}overline{x}</​m>​ <​m>​β_{0}=overline{y}-β_{1}overline{x}</​m>​
 +</​hidden>​
  
-To be valid, a linear regression must meet 4 assumptions,​ otherwise the model results cannot be safely interpreted. +====2.1 Running a linear model ====
-==== 1.1 Assumptions ==== +
-  - **Homoscedasticity**\\ To be valid, explanatory variables must have a homogenous variance (also called homoscedasticity),​ i.e. the spread of the data must be uniform for each x<​sub>​i</​sub>​ value. This assumption can be verified by indirectly comparing the spread of the model residuals along the different x<​sub>​i</​sub>​ values using a plot of the residuals against the fitted values.\\ As we will see below, in case of heteroscedasticty,​ data transformations can be applied to improve the spread of the residuals, or use a generalized linear model with a distribution (Poisson, negative binomial, etc.) that better suits the relationship. +
-  - **Independence**\\ ​ Linear regression can only be applied on independent data. This means that the y<​sub>​i</​sub>​ at a given x<​sub>​i</​sub>​ value must not be influenced by other x<​sub>​i</​sub>​ values. Violation of independence can happen if your data represent some form of dependence structure, such as spatial or temporal correlation.  +
-  - **High leverage**\\ If some of the observations in a dataset possess strongly different values from others, a model fitting problem can arise such that these high leverage data influence the model calculation.  +
-  - **Normal distribution**\\ Linear regression should only be applied to data that follow a normal distribution (response and explanatory variables).  +
-==== 1.2 Running a linear model ====+
  
 Using the bird dataset, we will first examine the linear regression of maximum abundance as a function of mass. Using the bird dataset, we will first examine the linear regression of maximum abundance as a function of mass.
Line 99: Line 160:
 ^ AvgAbund | The average abundance across all sites\\ where found in NA|Continuous/​ numeric| ​ ^ AvgAbund | The average abundance across all sites\\ where found in NA|Continuous/​ numeric| ​
 ^ Mass     | The body size in grams| Continuous/ numeric| ^ Mass     | The body size in grams| Continuous/ numeric|
-^ Diet     | Type of food consumed| Discrete – 5 levels (Plant; PlantInsect;​\\ Insect; ​InserctVert; Vertebrate)|+^ Diet     | Type of food consumed| Discrete – 5 levels (Plant; PlantInsect;​\\ Insect; ​InsectVert; Vertebrate)|
 ^ Passerine| Is it a songbird/ perching bird| Boolean (0/1)| ^ Passerine| Is it a songbird/ perching bird| Boolean (0/1)|
 ^ Aquatic ​ | Is it a bird that primarily lives in/ on/ next to the water| Boolean (0/1)| ^ Aquatic ​ | Is it a bird that primarily lives in/ on/ next to the water| Boolean (0/1)|
  
 +Note that Family, Diet, Passerine, and Aquatic are all categorical variables although they are encoded in different ways (string, discrete, boolean).  ​
 + 
 We are now ready to run our linear model: We are now ready to run our linear model:
  
Line 108: Line 171:
 lm1 <- lm(bird$MaxAbund ~ bird$Mass) # where Y ~ X means Y "as a function of" X> lm1 <- lm(bird$MaxAbund ~ bird$Mass) # where Y ~ X means Y "as a function of" X>
 </​code>​ </​code>​
-==== 1.Verifying assumptions ====+==== 2.Verifying assumptions ====
  
 <code rsplus | Diagnostic plots > <code rsplus | Diagnostic plots >
Line 116: Line 179:
 </​code>​ </​code>​
  
-=== Homoscedasticity ====+=== Verifying independence ​===
  
-//Residual vs Fitted plot// - The first graph of the diagnostic plots called by ''​plot(lm1)''​ illustrates the spread of the residuals between each fitted values, if homogenous then the homoscedasticty is respectedThe Residuals vs Fitted values should show a similar scatter across all Fitted values (x-axis). +Linear models can only be applied to independent data. This means that the yi at a given xi value must not be influenced by other xi values. Violation of independence can happen if your data represent some form of dependence structure, such as spatial or temporal correlation. 
-If the relationship between ​the response variable ​and the explanatory ​variable is not linear, ​then it will show up here.+ 
 +There is no simple diagnostic plot for independence,​ unfortunately. Instead, you must consider your data carefully. Is there some underlying structure in your data that makes your data points dependent on each other? If you collect data from the same sites over time (ie, a time series) or if you collect multiple data points from the same organism, your data violates the assumption of independence. You will need to use a different type of model instead.  
 + 
 +=== Verifying residual variance is constant and residual mean is 0 ==== 
 + 
 +//Residual vs Fitted plot// - The first graph of the diagnostic plots is called by ''​plot(lm1)''​. This plot illustrates the spread of the residuals between each fitted values. ​Each point represents ​the distance of the response variable ​from the model prediction of the response ​variable. If the residuals spread randomly around the 0 line, this indicates that the relationship ​is linear ​and that the mean of the residuals is 0. If the residuals form an approximate horizontal band around the 0 line, this indicates homogeneity of error variance (ie, it is homoskedastic). If the residuals form a funnel shape, this indicates the residuals are not homoskedastic.
  
 {{:​workshop_3_lm1_residuals_vs_fitted.png?​300|}} ​ {{:​workshop_3_lm1_residuals_vs_fitted.png?​300|}} ​
  
-//​Scale-location plot// - The third graph of the diagnostic plots enables one to verify whether the residuals ​spread increases with a given fitted values (i.e. identifies whether the spread in the residuals is due to the selected explanatory variable). If the spread increases, the homoscedasticity assumption is not respected.+//​Scale-location plot// - The third graph of the diagnostic plots enables one to verify whether the residual ​spread increases with a given fitted values (i.e. identifies whether the spread in the residuals is due to the selected explanatory variable). If the spread increases, the homoscedasticity assumption is not respected.
  
 {{:​workshop_3_lm1_scale-location.png?​300|}} {{:​workshop_3_lm1_scale-location.png?​300|}}
  
-=== Independence and normal distribution ​===+=== Verifying that residuals are normally distributed ​===
  
-//QQ plot// -  ​Independence ​can be assessed from the QQplot of the diagnostic plots. It allows you to check the distribution of the model residuals and verify the normality of the response and explanatory variables. This graph compares the probability distribution of the model residuals to the probability distribution of normal data series.+//QQ plot// -  ​Normality ​can be assessed from the QQplot of the diagnostic plots. This graph compares the probability distribution of the model residuals to the probability distribution of normal data series.
 If the standardized residuals lie linearly on the 1:1 line of the QQplot, the residuals can be considered normally distributed. If the standardized residuals lie linearly on the 1:1 line of the QQplot, the residuals can be considered normally distributed.
  
Line 136: Line 204:
 The points of the QQplot are nonlinear, which suggests that the residuals are not normally distributed. The points of the QQplot are nonlinear, which suggests that the residuals are not normally distributed.
  
-=== High leverage ===+=== Checking for high leverage ===
  
-//Residuals vs Leverage plot// - High leverage data can be visualised on the fourth diagnostic plots (i.e. residuals vs leverage), which identifies the observation numbers of the high leverage data point(s). If (and only if!) these observations correspond to mismeasurements ​or represent exceptions, they can be removed from the original dataset.\\ ​+In addition to the assumption testing above, we are also interested in whether any of our data points have high leverage. This is not assumption testing //per se//, but it will affect our interpretation of the data. If some of the observations in a dataset possess strongly different values from others, a model fitting problem can arise such that these high leverage data influence the model calculation. 
 + 
 +//Residuals vs Leverage plot// - High leverage data can be visualised on the fourth diagnostic plots (i.e. residuals vs leverage), which identifies the observation numbers of the high leverage data point(s). If (and only if!) these observations correspond to mis-measurements ​or represent exceptions, they can be removed from the original dataset.\\ ​
 {{:​workshop_3_lm1_leverage.png?​300|}} {{:​workshop_3_lm1_leverage.png?​300|}}
-==== 1.Normalizing data ====+==== 2.Normalizing data ====
  
-In the example provided above, the //​MaxAbund//​ (response variable) and //Mass// (explanatory variable) ​were not normally distributed and the linear regression ​assumptions were violated. The next step is to try to normalize the variables using transformations. ​To assess the normality of a variable, draw a histogram using the function ''​hist'',​ and check visually whether the data series appears to follow a normal distribution. For example:+In the example provided above, the model residuals ​were not normally distributed and therefore ​the assumption of residual normality is violated. We may still be able to use a linear regression ​model if we can address this violation. The next step is to try to normalize the variables using transformations. ​Often if we can make the explanatory and/or response variables normally distributed then the model residuals will become normally distributed. In addition to QQ-plots we can assess the normality of a variable ​by drawing ​a histogram using the function ''​hist'',​ and check visually whether the data series appears to follow a normal distribution. For example:
  
 <code rsplus | Testing Normality: hist() function>​ <code rsplus | Testing Normality: hist() function>​
-# Plot Y ~ X and the regression line 
 # Plot Y ~ X and the regression line # Plot Y ~ X and the regression line
 plot(bird$MaxAbund ~ bird$Mass, pch=19, col="​coral",​ ylab="​Maximum Abundance", ​ plot(bird$MaxAbund ~ bird$Mass, pch=19, col="​coral",​ ylab="​Maximum Abundance", ​
Line 161: Line 230:
 {{:​lm1_yvsx.png?​300|}} {{:​maxabund_hist.png?​300|}} {{:​mass_hist.png?​300|}} {{:​lm1_yvsx.png?​300|}} {{:​maxabund_hist.png?​300|}} {{:​mass_hist.png?​300|}}
  
-second ​way to assess normality is to use the Shapiro-Wilk normality test that compares the distribution of the observed data series to a normal distribution using the function ''​shapiro.test''​.+third way to assess normality is to use the Shapiro-Wilk normality test that compares the distribution of the observed data series to a normal distribution using the function ''​shapiro.test''​.
  
 The null and alternate hypotheses of this test are: The null and alternate hypotheses of this test are:
Line 187: Line 256:
  
 The histograms, Shapiro tests and Skewness all indicate that the variables need to be transformed to normalize (e.g. a log<​sub>​10</​sub>​ transformation). The histograms, Shapiro tests and Skewness all indicate that the variables need to be transformed to normalize (e.g. a log<​sub>​10</​sub>​ transformation).
-==== 1.Data transformation ====+==== 2.Data transformation ====
 In case of non-normality,​ response and explanatory variables can be transformed to enhance their normality following these rules: In case of non-normality,​ response and explanatory variables can be transformed to enhance their normality following these rules:
  
Line 197: Line 266:
 ^ Substantially negative skewness | <​m>​log_10{(K-x)}</​m>​| log10(K - x)| ^ Substantially negative skewness | <​m>​log_10{(K-x)}</​m>​| log10(K - x)|
    
-Thus, log<​sub>​10</​sub>​ transformations should be applied and saved in the bird data frame. The model can then be re-runned, verified and interpreted.+Thus, log<​sub>​10</​sub>​ transformations should be applied and saved in the bird data frame. The model can then be re-run, verified and interpreted.
  
 <code rsplus | Data Transformation > <code rsplus | Data Transformation >
Line 223: Line 292:
 {{:​hist_logmaxabund.png?​300|}} {{:​hist_logmass.png?​300|}} {{:​hist_logmaxabund.png?​300|}} {{:​hist_logmass.png?​300|}}
 {{:​plot_lm2_.png?​550|}} {{:​plot_lm2_.png?​550|}}
-==== 1.Model output ====+==== 2.Model output ====
  
 Once all these assumptions have been verified, the model results can be interpreted. These results are called in R using the function ''​summary''​. ​ Once all these assumptions have been verified, the model results can be interpreted. These results are called in R using the function ''​summary''​. ​
Line 274: Line 343:
 Users must, however, be aware that significant relationship between two variables does not always imply causality. Conversely, the absence of significant linear regression between y and x does not always imply an absence of relationship between these two variables; this is for example the case when a relationship is not linear. ​ Users must, however, be aware that significant relationship between two variables does not always imply causality. Conversely, the absence of significant linear regression between y and x does not always imply an absence of relationship between these two variables; this is for example the case when a relationship is not linear. ​
  
-The goodness of fit of the linear regression model is assessed from the **adjusted-R<​sup>​2</​sup>​** (here, 0.05484), given by:+The goodness of fit of the linear regression model is assessed from the **adjusted-R<​sup>​2</​sup>​** (here, 0.05484). This value is a measure of the proportion of variation explained ​by the model. ​
  
 +Click below to see the math in more detail.
 +<​hidden>​
 <m> overline{R}^2=1-(1-R^2){n-1}/​{n-p-1} </m> <m> overline{R}^2=1-(1-R^2){n-1}/​{n-p-1} </m>
    
Line 284: Line 355:
 <m> {SS_tot}=sum{i}{}{({y_i}-overline{y})}^2 </m> is the total sums of squares,\\ <m> {SS_tot}=sum{i}{}{({y_i}-overline{y})}^2 </m> is the total sums of squares,\\
 <m> {SS_reg}=sum{i}{}{(hat{y_i}-overline{y})}^2 </m> is the regression sums of squares - also called the explained sums of squares. <m> {SS_reg}=sum{i}{}{(hat{y_i}-overline{y})}^2 </m> is the regression sums of squares - also called the explained sums of squares.
 +
 +</​hidden>​
  
 The higher the adjusted-R<​sup>​2</​sup>​ is, the better the data fit the statistical model, knowing that this coefficient varies between 0 and 1. In this case, the relationship between //​logMaxAbund//​ and //logMass// is quite weak.\\ The higher the adjusted-R<​sup>​2</​sup>​ is, the better the data fit the statistical model, knowing that this coefficient varies between 0 and 1. In this case, the relationship between //​logMaxAbund//​ and //logMass// is quite weak.\\
 The last line of the R output represents the F-statistic of the model and its associated p-value. If this p-value is inferior to 0.05, the model explains the data relationship better than a null model.\\  ​ The last line of the R output represents the F-statistic of the model and its associated p-value. If this p-value is inferior to 0.05, the model explains the data relationship better than a null model.\\  ​
-==== 1.Plotting ====+==== 2.Plotting ====
  
 Linear regression results are generally represented by a plot of the response variable as a function of the explanatory variable on which the regression line is added (and if needed the confidence intervals), using the R code: Linear regression results are generally represented by a plot of the response variable as a function of the explanatory variable on which the regression line is added (and if needed the confidence intervals), using the R code:
Line 308: Line 381:
  
 {{:​yvxx_lm2.png?​400|}} {{:​yvxx_lm2.png?​400|}}
-==== 1.Subsetting ====+==== 2.Subsetting ====
  
 We may also run the analysis on a subset of observations,​ for example, on terrestrial birds only. We may also run the analysis on a subset of observations,​ for example, on terrestrial birds only.
Line 386: Line 459:
  
 ---- ----
-===== 2. T-tests ===== 
-The Student’s t-test, or simply t-test enables the comparison of a continuous variable according to a  categorical variable divided in two groups (or treatments),​ and only two groups. Based on the following mathematical model, t-test compares the response variables, y, between the two groups A<​sub>​1</​sub>​ and A<​sub>​2</​sub>​ of the explanatory variable: 
  
-<m> {y_{ij}} = µ + {A_i} + {ε_{ij}} </m> 
  
-where\\ 
  
-µ is the mean of the response variable,\\ 
-<​m>​{A_i}</​m>​ corresponds to the effect of group i,\\ 
-i varies from 1 to 2,\\ 
-<​m>​{ε_{ij}}</​m>​ are the residuals of the model (i.e. the unexplained variation). 
  
-The statistical hypotheses of the t-test evaluates the difference between the two explanatory groups, i.e.: 
  
-H<​sub>​0</​sub>:​ µ<​sub>​1</​sub> ​µ<​sub>​2</​sub>​\\ +===== 3. ANOVA =====
-H<​sub>​1</​sub>:​ µ<​sub>​1</​sub>​ ≠ µ<​sub>​2</​sub>​+
  
-where\\+Analysis of Variance (ANOVA) is a type of linear model for a continuous response variable and one or more categorical explanatory variables. The categorical explanatory variables can have any number of levels (groups). For example, the variable "​colour"​ might have three levels: green, blue, and yellow. ANOVA tests whether the means of the response variable differ between the levels. ​ For example, if blueberries differ in their mass depending on their colour. ​
  
-µ<​sub>​1</​sub>​ is the mean of the response variable for group 1,\\ +ANOVA calculations are based on the sum of squares partitioning and compares ​the within-treatment variance to the between-treatment variance. If the between-treatment variance ​is greater than the within-treatment variance, this means that the treatments affect the explanatory ​variable ​more than the random error (corresponding to the within-treatment variance), and that the explanatory variable is likely to be significantly influenced by the treatments.
-µ<​sub>​2</​sub> ​is the mean of the response ​variable ​for group 2.+
  
-The above bilateral alternative hypothesis H<​sub>​1</​sub>​ searches for a difference in the response variables between the two groups.  +In the ANOVA, the comparison ​of the between-treatment variance to the within-treatment variance ​is made through ​the calculation of the F-statistic that correspond ​to the ratio of the mean sum of squares of the treatment (MS<sub>Trt</​sub>​) on the mean sum of squares of the error (MS<sub>E</​sub>​). These two last terms are obtained by dividing their two respective sums of squares by their corresponding degrees of freedomas is typically presented in a ANOVA table (click ​to see below). Finallythe p-value of the ANOVA is calculated from the F-statistic that follows a Chi-square (χ<sup>2</sup>) distribution.
-Howeverif the sense of the expected difference ​is supported by a biological hypothesis, unilateral alternative hypotheses can also be used:\\ +
-  * if the response variable is supposed ​to be higher for group 1, then: H<sub>1</​sub>​: µ<sub>1</​sub>​ > µ<​sub>​2</​sub>,​\\ +
-  * if the response variable ​is supposed ​to be smaller for group 1then: H<sub>​1</​sub>:​ µ<​sub>​1</​sub>​ < µ<sub>2</sub>.+
  
 +Click to see the math in more detail below. ​
 +<​hidden>​
 +^ Source of\\ variation^ Degrees of\\ freedom (df)^ Sums of squares^ Mean squares ^ F-statistic^
 +^ Total   ​|<​m>​ra-1</​m>​| <​m>​{SS_Tot}=sum{i,​j}{}({y_ij}-overline{y})^2</​m> ​             |     | | 
 +^ Factor A|<​m>​a-1</​m>​| <​m>​{SS_FactorA}= r sum{i}{}({overline{y}_i}-overline{y})^2</​m>​|<​m>​{MS_FactorA}={SS_FactorA}/​{a-1}</​m>​|<​m>​F={MS_FactorA}/​{MS_E}</​m>​|
 +^ Error   ​|<​m>​a(r-1)</​m>​| <​m>​{SS_E}= sum{i,​j}{}({y_ij}-{overline{y}_i})^2</​m> ​            ​|<​m>​{MS_E}={SS_E}/​{a(r-1)}</​m>​| |
 +
 +a: number of levels of the explanatory variable A; r: number of replicates per treatment; <​m>​overline{y}</​m>:​ general mean of the explanatory variable; <​m>​overline{y}</​m><​sub>​i</​sub>​ : mean of the explanatory variable for all the replicates of the treatment i.
 +</​hidden>​
 +
 +==== 3.1 Types of ANOVA ====
 +  - **One-way ANOVA**\\ One categorical explanatory variable with 2 or more levels. If there are 2 levels a t-test can be used alternatively.
 +  - **Two-way ANOVA** //​[[http://​qcbs.ca/​wiki/​r_workshop4?&#​two-way_anova|(see section below)]]//​\\ - 2 categorical explanatory variables or more,\\ - Each categorical explanatory variable can have multiple levels,\\ - The interactions between each categorical explanatory variable must be tested.
 +  - **Repeated measures**\\ ANOVA can be used for repeated measures, but we won't cover this today. **Linear Mixed-effect Models** can also be used for this kind of data //​[[http://​qcbs.ca/​wiki/​r_workshop6|(see Workshop 6)]]//​. ​
 +==== 3.2 T-test ====
 +
 +When you have a single explanatory variable with only two levels, you can run a student'​s T-test to test for a difference in the mean of the two levels. If appropriate for your data, you can choose to test a unilateral hypothesis. This means that you can test the more specific assumption that one level has a higher mean than the other, rather than that they simply have different means. ​
 +
 +Click to see the math in more detail below. ​
 +<​hidden>​
 For the t-test, the t statistic used to find the p-value calculation is calculated as:\\ For the t-test, the t statistic used to find the p-value calculation is calculated as:\\
 <m> t= (overline{y}_1-overline{y}_2)/​sqrt{{s_1}^2/​n_1 + {s_2}^2/​n_2} </m> <m> t= (overline{y}_1-overline{y}_2)/​sqrt{{s_1}^2/​n_1 + {s_2}^2/​n_2} </m>
Line 421: Line 500:
 s<​sub>​1</​sub><​sup>​2</​sup>​ and s<​sub>​2</​sub><​sup>​2</​sup>​ are the variances of the response variable y for group 1 and 2, respectively,​\\ s<​sub>​1</​sub><​sup>​2</​sup>​ and s<​sub>​2</​sub><​sup>​2</​sup>​ are the variances of the response variable y for group 1 and 2, respectively,​\\
 n<​sub>​1</​sub>​ and n<​sub>​2</​sub>​ are the sample sizes of groups 1 and 2, respectively. n<​sub>​1</​sub>​ and n<​sub>​2</​sub>​ are the sample sizes of groups 1 and 2, respectively.
 +</​hidden>​
  
-==== 2.1 Assumptions ==== +Note that the t-test ​is mathematically equivalent to a one-way ANOVA with 2 levels.
-If the assumptions of the t-test ​are not met, the test can give misleading resultsThese assumptions concern the shape of the distribution of the data:+
  
-  ​- **Normality of data**\\ As with simple linear regression, ​data series ​need to be normally distributed. If the data are not normally distributed,​ but have reasonably symmetrical distributions,​ a mean which is close to the centre of the distribution,​ and only one mode (highest point in the frequency histogram) then a t-test will still work as long as the sample is sufficiently large (rule of thumb ~30 observations). If the data is heavily skewed, then we may need a very large sample before a t-test works. In such cases, an alternate non-parametric test should be used.+=== Assumptions === 
 +If the assumptions of the t-test are not met, the test can give misleading results. Here are some important things to note when testing the assumptions of a t-test. 
 + 
 +  ​- **Normality of data**\\ As with simple linear regression, ​the residuals ​need to be normally distributed. If the data are not normally distributed,​ but have reasonably symmetrical distributions,​ a mean which is close to the centre of the distribution,​ and only one mode (highest point in the frequency histogram) then a t-test will still work as long as the sample is sufficiently large (rule of thumb ~30 observations). If the data is heavily skewed, then we may need a very large sample before a t-test works. In such cases, an alternate non-parametric test should be used.
   - **Homoscedasticity**\\ Another important assumption of the two-sample t-test is that the variance of your two samples are equal. This allows you to calculate a pooled variance, which in turn is used to calculate the standard error. If population variances are unequal, then the probability of a Type I error is greater than α.\\ The robustness of the t-test increases with sample size and is higher when groups have equal sizes.\\ We can test for difference in variances among two populations and ask what is the probability of taking two samples from two populations having identical variances and have the two sample variances be as different as are s<​sub>​1</​sub><​sup>​2</​sup>​ and s<​sub>​2</​sub><​sup>​2</​sup>​. \\ To do so, we must do the variance ratio test (i.e. an F-test).   - **Homoscedasticity**\\ Another important assumption of the two-sample t-test is that the variance of your two samples are equal. This allows you to calculate a pooled variance, which in turn is used to calculate the standard error. If population variances are unequal, then the probability of a Type I error is greater than α.\\ The robustness of the t-test increases with sample size and is higher when groups have equal sizes.\\ We can test for difference in variances among two populations and ask what is the probability of taking two samples from two populations having identical variances and have the two sample variances be as different as are s<​sub>​1</​sub><​sup>​2</​sup>​ and s<​sub>​2</​sub><​sup>​2</​sup>​. \\ To do so, we must do the variance ratio test (i.e. an F-test).
- 
-{{:​t-test_var.png?​450|}} 
- 
-For the example above, with the pooled variance the [[http://​en.wikipedia.org/​wiki/​Type_I_and_type_II_errors|type I error]] is actually larger than the α set from the group 1 sample. Thus, we would conclude //do not reject//, when actually the α cut-off was wrong and we should have rejected! 
  
 === Violation of assumptions === === Violation of assumptions ===
  
 If variances between groups are not equal, it is possible to use corrections,​ like the [[http://​en.wikipedia.org/​wiki/​Welch'​s_t_test|Welch correction]]. If assumptions cannot be respected, the non-parametric equivalent of t-test is the [[http://​en.wikipedia.org/​wiki/​Mann–Whitney_U_test|Mann-Whitney test]]. Finally, if the two groups are not independent (e.g. measurements on the same individual at 2 different years), you should use a [[http://​en.wikipedia.org/​wiki/​Student'​s_t-test#​Paired_samples|Paired t-test]]. If variances between groups are not equal, it is possible to use corrections,​ like the [[http://​en.wikipedia.org/​wiki/​Welch'​s_t_test|Welch correction]]. If assumptions cannot be respected, the non-parametric equivalent of t-test is the [[http://​en.wikipedia.org/​wiki/​Mann–Whitney_U_test|Mann-Whitney test]]. Finally, if the two groups are not independent (e.g. measurements on the same individual at 2 different years), you should use a [[http://​en.wikipedia.org/​wiki/​Student'​s_t-test#​Paired_samples|Paired t-test]].
-==== 2.2 Running a t-test ​====+ 
 +=== Running a t-test ===
  
 In R, t-tests are implemented using the function ''​t.test''​. For example, to test for a mass difference between aquatic and non-aquatic birds, you should write:\\ In R, t-tests are implemented using the function ''​t.test''​. For example, to test for a mass difference between aquatic and non-aquatic birds, you should write:\\
Line 471: Line 550:
  
 Here, we show that the ratio of variances is not statistically different from 1, therefore variances are equal, and we proceeded with our t-test. Since p < 0.05, the hypothesis of no difference between the two bird types (Aquatic vs. terrestrial) was rejected. ​ Here, we show that the ratio of variances is not statistically different from 1, therefore variances are equal, and we proceeded with our t-test. Since p < 0.05, the hypothesis of no difference between the two bird types (Aquatic vs. terrestrial) was rejected. ​
-==== 2.3 Running a t-test with lm() ==== 
  
-Don't forget that the t-test is still a linear model and a specific case of ANOVA (see below) with one factor with 2 levels. As such, we can also run the t-test with the ''​lm()''​ function in R: +=== Unilateral t-test ===
- +
-<code rsplus | T-test as a linear model > +
-ttest.lm1 <- lm(logMass ~ Aquatic, data=bird) +
-anova(ttest.lm1) +
-</​code>​ +
- +
-    Analysis of Variance Table +
-    Response: logMass +
-              Df  Sum Sq  Mean Sq   F value    Pr(>​F) ​    +
-    Aquatic ​   1  19.015 ​ 19.0150 ​  ​60.385 ​    ​2.936e-10 *** +
-    Residuals 52  16.375 ​ 0.3149 ​                      +
-    --- +
-    Signif. codes: ​ 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 +
-     +
-When variances are equal (i.e., two-sample t-test), we can show that t<​sup>​2</​sup> ​F: +
- +
-<code rsplus | > +
-ttest1$statistic^2 +
-anova(ttest.lm1)$F +
-</​code>​ +
-==== 2.4 Unilateral t-test ​====+
 The //​alternative//​ option of the ''​t.test''​ function allows for the use of unilateral t-test. For example, if users want to test if non-aquatic birds are less heavy than aquatic birds, the function can be written: The //​alternative//​ option of the ''​t.test''​ function allows for the use of unilateral t-test. For example, if users want to test if non-aquatic birds are less heavy than aquatic birds, the function can be written:
  
Line 517: Line 574:
 In this case, the calculated t-statistic is t = -7.7707 with df = 52 degrees of freedom that gives a p-value of p-value = 1.468e-10. As the calculated p-value is inferior to 0.05, the null hypothesis is rejected. Thus, aquatic birds are significantly heavier than non-aquatic birds. In this case, the calculated t-statistic is t = -7.7707 with df = 52 degrees of freedom that gives a p-value of p-value = 1.468e-10. As the calculated p-value is inferior to 0.05, the null hypothesis is rejected. Thus, aquatic birds are significantly heavier than non-aquatic birds.
  
-===== 3. ANOVA ===== +=== Running ​a t-test ​with lm() ===
-Analysis of Variance (ANOVA) corresponds to generalization of the Student’s ​t-test. ANOVA enables the comparison of a continuous variable according to a categorical variable divided in more than two groups ​(or treatments), whereas the number of groups in the Student’s t-test was restricted to two. Based on the following mathematical model, ANOVA thus compares the response variables y between the j groups of the explanatory variable:+
  
-<m> {y_{ij}} = µ + {A_i} + {ε_{ij}} </m>+A t-test is a linear model and a specific case of ANOVA with one factor with 2 levels. As such, we can also run the t-test with the ''​lm()''​ function in R:
  
-where+<code rsplus | T-test as a linear model > 
 +ttest.lm1 <- lm(logMass ~ Aquatic, data=bird) 
 +anova(ttest.lm1) 
 +</​code>​
  
-µ is the general mean of the response variable,\\ +    Analysis ​of Variance Table 
-A<sub>i</​sub>​ is the effect of the level i for the factor A,\\ +    ​Response:​ logMass 
-i varies from to n (n ≥ 2),\\ +              Df  Sum Sq  Mean Sq   F value    Pr(>F)    ​ 
-ε<​sub>​ij</​sub> ​are the residuals of the model (i.e. the unexplained variation).+    ​Aquatic ​   ​ ​19.015 ​ 19.0150 ​  ​60.385 ​    2.936e-10 *** 
 +    ​Residuals 52  16.375 ​ 0.3149 ​                      
 +    --- 
 +    Signif. codes: ​ 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
 +     
 +When variances ​are equal (i.e., two-sample t-test), we can show that t<​sup>​2</​sup>​ = F:
  
-The statistical hypotheses of the ANOVA aim to identify if one of the j groups differ from the others in terms of the response variable y:+<code rsplus | > 
 +ttest1$statistic^2 
 +anova(ttest.lm1)$F 
 +</​code>​
  
-H<​sub>​0</​sub>:​ µ<​sub>​1</​sub>​ = µ<​sub>​2</​sub>​ =... = µ<​sub>​j</​sub>​ =... = µ<​sub>​n</​sub>​\\ 
-H<​sub>​1</​sub>:​ there is at least one µ<​sub>​j</​sub>​ that differs from the others ​ 
  
-To determine whether the null hypothesis must be accepted or rejected, ANOVA calculations are based on the sum of squares partitioning and compares the within-treatment variance to the between-treatment variance. If the between-treatment variance is greater than the within-treatment variance, this means that the treatments affect the explanatory variable more than the random error (corresponding to the within-treatment variance), and that the explanatory variable is likely to be significantly influenced by the treatments. 
  
-In the ANOVA, the comparison of the between-treatment variance to the within-treatment variance is made through the calculation of the F-statistic that correspond to the ratio of the mean sum of squares of the treatment (MS<​sub>​Trt</​sub>​) on the mean sum of squares of the error (MS<​sub>​E</​sub>​). These two last terms are obtained by dividing their two respective sums of squares by their corresponding degrees of freedom, as presented in the following ANOVA table. Then, the p-value of the ANOVA is calculated from the F-statistic that follows a Chi-square (χ<​sup>​2</​sup>​) distribution. 
- 
-^ Source of\\ variation^ Degrees of\\ freedom (df)^ Sums of squares^ Mean squares ^ F-statistic^ 
-^ Total   ​|<​m>​ra-1</​m>​| <​m>​{SS_Tot}=sum{i,​j}{}({y_ij}-overline{y})^2</​m> ​             |     | |  
-^ Factor A|<​m>​a-1</​m>​| <​m>​{SS_FactorA}= r sum{i}{}({overline{y}_i}-overline{y})^2</​m>​|<​m>​{MS_FactorA}={SS_FactorA}/​{a-1}</​m>​|<​m>​F={MS_FactorA}/​{MS_E}</​m>​| 
-^ Error   ​|<​m>​a(r-1)</​m>​| <​m>​{SS_E}= sum{i,​j}{}({y_ij}-{overline{y}_i})^2</​m> ​            ​|<​m>​{MS_E}={SS_E}/​{a(r-1)}</​m>​| | 
- 
-a: number of levels of the explanatory variable A; r: number of replicates per treatment; <​m>​overline{y}</​m>:​ general mean of the explanatory variable; <​m>​overline{y}</​m><​sub>​i</​sub>​ : mean of the explanatory variable for all the replicates of the treatment i. 
- 
-==== 3.1 Types of ANOVA ==== 
-  - **One-way ANOVA**\\ One factor with more than 2 levels 
-  - **Two-way ANOVA** //​[[http://​qcbs.ca/​wiki/​r_workshop4?&#​two-way_anova|(see section below)]]//​\\ - 2 factors or more,\\ - Each factor can have multiple levels,\\ - The interactions between each factor must be tested. 
-  - **Repeated measures**\\ ANOVA can be used for repeated measures, but we won't cover this today. **Linear Mixed-effect Models** can also be used for this kind of data //​[[http://​qcbs.ca/​wiki/​r_workshop6|(see Workshop 6)]]//​. ​ 
-==== 3.2 Assumptions ==== 
- 
-As with the simple linear regression and t-test, ANOVA must meet different statistical assumptions to be valid, among which two are particularly important. These assumptions can be verified using the diagnostic plots or with parametric tests. 
    
-  - **Normal distribution**\\ The residuals of ANOVA model can once again be visualised in the normal QQplotIf the residuals lie linearly on the 1:1 line of the QQplot, they can be considered as normally distributed. If not, the ANOVA results cannot be interpreted. +==== 3.3 Running an ANOVA ====
-  - **Homoscedasticity**\\ To be valid, ANOVA must be performed on models with homogeneous variance of the residuals. This homoscedasticity can be verified using either the residuals vs fitted plot or the scale-location plot of the diagnostic plots. If these plots present equivalent spread of the residuals for each of the fitted values, then the residuals variance can be considered homogeneous.\\ A second way to assess the homogeneity of residuals variance is to perform a Bartlett test on the anova model using the function ''​bartlett.test''​. If the p-value of this test is superior to 0.05, the null hypothesis H<​sub>​0</​sub>:​ s<​sub>​1</​sub><​sup>​2</​sup> ​s<​sub>​2</​sub><​sup>​2</​sup> ​=... s<​sub>​j</​sub><​sup>​2</​sup> ​=... = s<​sub>​n</​sub><​sup>​2</​sup>​ is accepted and the homoscedasticity assumption is respected.\\ Usual transformations of explanatory variables can be used if the homogeneity of residuals variance is not met. +
-  - **Additivity**\\ The effects of two factors are additive if the effect of one factor remains constant over all levels of the other factor, and that each factor influences the response variable independently of the other factor(s).+
  
-=== Violation of assumptions === +The t-test is only for single categorical explanatory variable with 2 levels. ​For all other linear models with categorical explanatory variables we use ANOVA. First, let's visualize the data using ''​boxplot()''​. Recall that by default, R will order you groups in alphabetical order. We can reorder the groups according to the median of each Diet level. \\ Another way to graphically view the effect sizes is to use ''​plot.design()''​. This function will illustrate the levels of a particular factor along a vertical line, and the overall value of the response is drawn as a horizontal line.
- +
-If assumptions are violated your can try to transform your data, which could potentially equalize variances and normalize residuals, and can convert a multiplicative effect into an additive effect. Or, if you can'(or don't want to) transform your data, the non-parametric equivalent of ANOVA is [[http://​en.wikipedia.org/​wiki/​Kruskal–Wallis_one-way_analysis_of_variance|Kruskal-Wallis test]]. +
-==== 3.3 Contrasts ==== +
- +
-  * Contrasts are group mean comparisons based on an priori hypothesis,​ +
-  * These groups can be compounded of one or many levels ​of a factor , +
-  * We can test basic hypothesis (ex: μ<​sub>​1</​sub>​ = μ<​sub>​2</​sub>​) or more complex hypothesis (ex: (μ<​sub>​1</​sub>​ + μ<​sub>​2</​sub>​)/​3 == μ<​sub>​3</​sub>​). +
- +
-The number of comparisons has to be lower or equal to the number of degrees of freedom of the ANOVA. ​Comparisons have to be independent from one another. For more details, see the [[http://​qcbs.ca/​wiki/​r_workshop4?&#​contrasts_advanced_section_optional| advanced section on contrasts]] below.  +
-==== 3.4 Running an ANOVA ==== +
- +
-First, let's visualize the data using ''​boxplot()''​. Recall that by default, R will order you groups in alphabetical order. We can reorder the groups according to the median of each Diet level. \\ Another way to graphically view the effect sizes is to use ''​plot.design()''​. This function will illustrate the levels of a particular factor along a vertical line, and the overall value of the response is drawn as a horizontal line.+
  
 <code rsplus | ANOVA> <code rsplus | ANOVA>
Line 596: Line 630:
 anova(anov1) ​ anova(anov1) ​
 </​code>​ </​code>​
-==== 3.Verifying assumptions ====+==== 3.Verifying assumptions ==== 
 + 
 +As with the simple linear regression and t-test, ANOVA must meet the four assumptions of linear models. Below are some tips in how to test these assumptions for an ANOVA. 
 +  
 +  - **Normal distribution**\\ The residuals of ANOVA model can once again be visualised in the normal QQplot. If the residuals lie linearly on the 1:1 line of the QQplot, they can be considered as normally distributed. If not, the ANOVA results cannot be interpreted. 
 +  - **Homoscedasticity**\\ To be valid, ANOVA must be performed on models with homogeneous variance of the residuals. This homoscedasticity can be verified using either the residuals vs fitted plot or the scale-location plot of the diagnostic plots. If these plots present equivalent spread of the residuals for each of the fitted values, then the residuals variance can be considered homogeneous.\\ A second way to assess the homogeneity of residuals variance is to perform a Bartlett test on the anova model using the function ''​bartlett.test''​. If the p-value of this test is superior to 0.05, the null hypothesis H<​sub>​0</​sub>:​ s<​sub>​1</​sub><​sup>​2</​sup>​ = s<​sub>​2</​sub><​sup>​2</​sup>​ =... = s<​sub>​j</​sub><​sup>​2</​sup>​ =... = s<​sub>​n</​sub><​sup>​2</​sup>​ is accepted and the homoscedasticity assumption is respected.\\ Usual transformations of explanatory variables can be used if the homogeneity of residuals variance is not met. 
 +  - **Additivity**\\ In addition to the assumption testing, it is important to consider whether the effects of two factors are additive. The effects are additive if the effect of one factor remains constant over all levels of the other factor, and that each factor influences the response variable independently of the other factor(s). 
 + 
 +If assumptions are violated your can try to transform your data, which could potentially equalize variances and normalize residuals, and can convert a multiplicative effect into an additive effect. Or, if you can't (or don't want to) transform your data, the non-parametric equivalent of ANOVA is [[http://​en.wikipedia.org/​wiki/​Kruskal–Wallis_one-way_analysis_of_variance|Kruskal-Wallis test]].
  
 <code rsplus| Model diagnostics > <code rsplus| Model diagnostics >
Line 612: Line 654:
  
 Ideally the first diagnostic plot should show similar scatter for each Diet level. The Shapiro and Bartlett tests are both non-significant,​ therefore residuals are assumed to be normally distributed and variances are assumed to be equal. Ideally the first diagnostic plot should show similar scatter for each Diet level. The Shapiro and Bartlett tests are both non-significant,​ therefore residuals are assumed to be normally distributed and variances are assumed to be equal.
-==== 3.Model output ====+==== 3.Model output ====
 Once your ANOVA model has been validated, its results can be interpreted. The R output of ANOVA model depends of the function that has been used to implement the ANOVA. If the ''​aov''​ function is used to implement the ANOVA model Once your ANOVA model has been validated, its results can be interpreted. The R output of ANOVA model depends of the function that has been used to implement the ANOVA. If the ''​aov''​ function is used to implement the ANOVA model
  
Line 644: Line 686:
 This R output corresponds exactly to the ANOVA table of your model. This output also present the degrees of freedom, the sum of squares, the mean sum of squares and the F-value previously explained. For this example, the diet significantly influences the abundance of birds as the p-value is inferior to 0.05. The null hypothesis can then be rejected meaning that at least one of the diet treatments influenced the abundance differently than the other treatments. ​ This R output corresponds exactly to the ANOVA table of your model. This output also present the degrees of freedom, the sum of squares, the mean sum of squares and the F-value previously explained. For this example, the diet significantly influences the abundance of birds as the p-value is inferior to 0.05. The null hypothesis can then be rejected meaning that at least one of the diet treatments influenced the abundance differently than the other treatments. ​
  
-==== 3.Complementary test ====+==== 3.Complementary test ====
  
-Importantly,​ ANOVA cannot identify which treatment is different from the others in terms of response variable. To determine ​this, post-hoc tests that compare the levels of the explanatory variables (i.e. the treatments) two by two, must be performed. While several post-hoc tests exist (e.g. Fischer’s least significant difference, Duncan’s new multiple range test, Newman-Keuls method, Dunnett’s test, etc.), the Tukey’s range test is used in this example using the function ''​TukeyHSD''​ as follows:+Importantly,​ ANOVA cannot identify which treatment is different from the others in terms of response variable. It can only identify that a difference is present. To determine ​the location of the difference(s), post-hoc tests that compare the levels of the explanatory variables (i.e. the treatments) two by two, must be performed. While several post-hoc tests exist (e.g. Fischer’s least significant difference, Duncan’s new multiple range test, Newman-Keuls method, Dunnett’s test, etc.), the Tukey’s range test is used in this example using the function ''​TukeyHSD''​ as follows:
  
 <code rsplus| Post-hoc Tukey Test> <code rsplus| Post-hoc Tukey Test>
Line 676: Line 718:
 In this case, the only significant difference in abundance occurs between the PlantInsect diet and the Vertebrate diet. In this case, the only significant difference in abundance occurs between the PlantInsect diet and the Vertebrate diet.
  
-==== 3.Plotting ====+==== 3.Plotting ====
  
 After having verified the assumptions of your ANOVA model, interpreted the ANOVA table and differentiated the effect of the treatments using post-hoc tests or contrasts, the ANOVA results can be graphically illustrated using a ''​barplot''​. This shows the response variable as a function of the explanatory variable levels, where standard errors can be superimposed on each bar as well as the different letters representing the treatment group (according to the post-hoc test). After having verified the assumptions of your ANOVA model, interpreted the ANOVA table and differentiated the effect of the treatments using post-hoc tests or contrasts, the ANOVA results can be graphically illustrated using a ''​barplot''​. This shows the response variable as a function of the explanatory variable levels, where standard errors can be superimposed on each bar as well as the different letters representing the treatment group (according to the post-hoc test).
Line 700: Line 742:
 {{:​lm_fig_12.png?​650|}} {{:​lm_fig_12.png?​650|}}
  
-==== 3.Contrasts (advanced section/ optional) ====+==== 3.Contrasts (advanced section/ optional) ====
  
 <​hidden>​ <​hidden>​
-A second R output ​of the ANOVA results, called contrasts, ​can be called to visualize the parameter estimates for each level of the explanatory variable in comparison to a baseline level. This output is called using the function ''​summary.lm(aov1)''​ when the ANOVA model was implemented with the ''​aov''​ function, and using the ''​summary(anov1)''​ when the ANOVA was implemented with the ''​lm''​ function. This output performs a linear regression for each level of the explanatory variable and calculates their associated parameters:+ 
 +  * Contrasts are group mean comparisons based on an a priori hypothesis,​ 
 +  * These groups can be compounded of one or many levels of a factor , 
 +  * We can test basic hypothesis (ex: μ<​sub>​1</​sub>​ = μ<​sub>​2</​sub>​) or more complex hypothesis (ex: (μ<​sub>​1</​sub>​ + μ<​sub>​2</​sub>​)/​3 == μ<​sub>​3</​sub>​). 
 + 
 +The number of comparisons has to be lower or equal to the number of degrees of freedom ​of the ANOVA. Comparisons have to be independent from one another.  
 + 
 +Contrasts are given as an R output for ANOVA can be called to visualize the parameter estimates for each level of the explanatory variable in comparison to a baseline level. This output is called using the function ''​summary.lm(aov1)''​ when the ANOVA model was implemented with the ''​aov''​ function, and using the ''​summary(anov1)''​ when the ANOVA was implemented with the ''​lm''​ function. This output performs a linear regression for each level of the explanatory variable and calculates their associated parameters:
  
   Call: lm(formula = logMaxAbund ~ Diet, data = bird)   Call: lm(formula = logMaxAbund ~ Diet, data = bird)
Line 795: Line 844:
 ===== 4. Two-way ANOVA ===== ===== 4. Two-way ANOVA =====
  
-To better explain ​the variability of response ​variable, two categorical variables ​can also be used as explanatory variables ​in an ANOVA model instead of just one (cf. part 3). To mathematically describe ​a two-way ANOVA, the one-way ANOVA model must simply be rewritten to take into account the interaction ​term between ​the two explanatory ​variables:+In the above section, the ANOVA models had single categorical ​variable. We can create ANOVA models with multiple ​categorical ​explanatory ​variables. When there are two categorical ​explanatory variables, we refer to the model as a two-way ANOVA. A two-way ANOVA tests several hypotheses: that there is no difference in mean among levels of variable A; that there is no difference in mean among levels of variable B; and that there is no interaction between variables ​A and B. A significant interaction means the mean value of the response variable for each level of variable A changes depending on the level of B. For example, perhaps relationship between the colour of a fruit and its mass will depend on the plant speciesif so, we say there is an interaction between colour and species. ​
  
-<m> {y_{ijk}} = µ + {A_i} + {B_j} + {A_i}{B_j} + {ε_{ijk}} </​m>​ +Click to see the math in more detail below.  
- +<hidden
-where  +The one-way ANOVA table has to be rewritten to add the second explanatory term as well as the interaction term. Thus, a two-way ANOVA table corresponds to: 
- +
-µ is the general mean of the response variable,​\\ +
-A<​sub>​i</​sub> ​ is the effect of the level i for the factor A,\\ +
-B<​sub>​j</​sub> ​ is the effect of the level j for the factor B,\\ +
-A<​sub>​i</​sub>​B<​sub>​j</​sub> ​ is the interaction term between the two explanatory variables,​\\ +
-i and j vary from 1 to n (n ≥ 2),\\ +
-ε<​sub>​ijk</​sub> ​ are the residuals of the model.\\ +
- +
-However, the statistical hypotheses for the two-way ANOVA are: +
- +
-H<​sub>​01</​sub>:​ No difference ​in mean among levels of A; µ<​sub>​a1</​sub>​ = µ<​sub>​a2</​sub>​ = ... = µ<​sub>​ai</​sub>​ =... = µ<​sub>​an</​sub>​\\ +
-H<sub>02</​sub>:​ No difference in mean among levels of B; µ<​sub>​b1</​sub>​ = µ<​sub>​b2</​sub>​ = ... = µ<​sub>​bi</​sub>​ =... = µ<​sub>​bm</​sub>​\\ +
-H<​sub>​03</​sub>:​ No interaction between factors A and B. +
-  +
-The one-way ANOVA table also has to be rewritten to add the second explanatory term as well as the interaction term. Thus, a two-way ANOVA table corresponds to: +
  
 ^ Source of\\ variation^ Degrees of\\ freedom (df)^ Sums of squares^ Mean squares ^ F-statistic^ ^ Source of\\ variation^ Degrees of\\ freedom (df)^ Sums of squares^ Mean squares ^ F-statistic^
Line 825: Line 859:
  
 a: number of levels of the explanatory variable A; b: number of levels of the explanatory variable B;  r: number of replicates per treatment a: number of levels of the explanatory variable A; b: number of levels of the explanatory variable B;  r: number of replicates per treatment
 +</​hidden>​
 +
 ==== 4.1 Running a two-way ANOVA ==== ==== 4.1 Running a two-way ANOVA ====
  
Line 981: Line 1017:
 ===== 6. ANCOVA ===== ===== 6. ANCOVA =====
  
-Analysis of covariance (ANCOVA) ​combines ​linear ​regression and ANOVA to test the influence of one categorical explanatory variable (or more) and one continuous explanatory variable (or more) on a continuous response variable. ​The underlying mathematical model of ANCOVA can be written as: +Analysis of covariance (ANCOVA) ​is a linear ​model that tests the influence of one categorical explanatory variable (or more) and one continuous explanatory variable (or more) on a continuous response variable. ​Each level of the categorical ​variable is described by its own slope and intercept. In addition to testing if the response variable differs for at least one level of the categorical variable, ANCOVA also tests whether the response variable might be influenced by its relationship with the continuous variable (called the covariate in ANCOVA), and by any differences between group levels in the way that the continuous variable influences the response (i.e. the interaction). The ANCOVA hypotheses are thus: that there is no difference in the mean among levels ​of the categorical ​variable; there is no correlation between the response variable and the continuous ​explanatory variable; there is no interaction between ​the categorical and continuous ​explanatory variables. ​ 
- + 
-<m> {y_ij} = {µ} + {A_i} + {Β_i}({x_{ij}}-{overline{x}_i}) + {ε_ij} </​m>​ +
- +
-where +
- +
-µ is the general mean of the response ​variable,\\ +
-A<​sub>​i</​sub>​ is the treatment effect,\\ +
-B<​sub>​i</​sub>​ is the effect of the continuous variable,​\\ +
-x<​sub>​ij</​sub> ​ is the covariate measured for observation y<​sub>​ij</​sub>,​\\ +
-<​m>​overline{x}</​m><​sub>​i</​sub>​ is the average value of the covariate for treatment group i,\\ +
-i varies from 1 to n (n > 2) treatments,​\\ +
-ε<​sub>​ij</​sub>​ are the residuals of the model.\\ +
- +
-Notice that this model contains a term A<​sub>​i</​sub>​ for the treatment effect (as in ANOVA) and a slope term Β<​sub>​i</​sub>​ for the covariate effect (as in regression). Thus, each treatment group is described by its own slope and intercept. In addition to testing if the response variable differs for at least one level of the categorical variable, ANCOVA also tests whether the response variable might be influenced by its relationship with the continuous variable (called the covariate in ANCOVA), and by any differences between group levels in the way that the continuous variable influences the response (i.e. the interaction). The ANCOVA hypotheses are thus: +
- +
-H<​sub>​01</​sub>:​ There is no effect ​of the categorical ​factor (i.e. µ<​sub>​1</​sub>​ = µ<​sub>​2</​sub>​ =... = µ<​sub>​i</​sub>​ =... = µ<​sub>​n</​sub>​)\\ +
-H<​sub>​02</​sub>:​ There is no effect of the continuous ​factor (i.e. β = 0)\\ +
-H<​sub>​03</​sub>:​ There is no interacting effect of the categorical and continuous ​factors\\ +
 ==== 6.1 Assumptions ==== ==== 6.1 Assumptions ====
  
-As with models seen above, to be validANCOVA models must also meet two statistical assumptions that can be verified using diagnostic plots, ​i.e.+As with models seen above, to be valid ANCOVA models must meet the statistical assumptions ​of linear models ​that can be verified using diagnostic plots. In additionANCOVA models must have
-  - Normal distribution of the model residuals +  - The same value range for all covariates 
-  - Homoscedasticty of the residual variance  +  - Variables ​that are //fixed// 
-    - Independence between residuals and fitted values,  +  - No interaction between ​categorical ​and continuous variables
-    - Independence between between variance of residuals and fitted values +
-    - Equal variance between different levels of a given factor +
-  - Same value range for all covariates +
-  - Variables are //fixed// +
-  - No interaction between ​factors ​and covariate(s)+
  
 **__Note:​__** A //fixed// variable is one that you are specifically interested in (i.e. bird mass). In contrast, a random variable is noise that you want to control for (i.e. site a bird was sampled in). **__Note:​__** A //fixed// variable is one that you are specifically interested in (i.e. bird mass). In contrast, a random variable is noise that you want to control for (i.e. site a bird was sampled in).
Line 1027: Line 1040:
 The different possible goals of the ANCOVA are to determine the effects of: The different possible goals of the ANCOVA are to determine the effects of:
  
-  - the factor(s) ​and covariate(s) ​on the response variable +  - the categorical ​and continuous variables ​on the response variable 
-  - the factor(s) on the response variable after removing the effect of the covariate(s) +  - the categorical variable(s) on the response variable(s) after removing the effect of the continuous variable 
-  - the factor(s) on the relationship between the covariate(s) and the response variable+  - the categorical variable(s) on the relationship between the continuous variables(s) and the response variable
  
-Importantly,​ these goals are only met if there is no significant interaction between the factor(s) ​and the covariate(s)! Examples of significant interactions between the factor ​and the covariate ​(for an ANCOVA with one factor and one covariate) are illustrated by the second and third panels below:+Importantly,​ these goals are only met if there is no significant interaction between the categorical ​and continuous variables! Examples of significant interactions between the categorical ​and continuous variables ​(for an ANCOVA with one factor and one covariate) are illustrated by the second and third panels below:
  
 {{:​ancova_schematic.png?​550|}} {{:​ancova_schematic.png?​550|}}
  
-The same logic follows for ANCOVAs with multiple ​factors ​and/​or ​covariates.+The same logic follows for ANCOVAs with multiple ​categorical ​and/​or ​continuous variables.
  
 ==== 6.3 Running an ANCOVA ==== ==== 6.3 Running an ANCOVA ====
Line 1048: Line 1061:
 </​code>​ </​code>​
  
-If only your factor ​is significant,​ drop your covariate ​from the model: you will then have an **ANOVA**.\\ +If only your categorical variable ​is significant,​ drop your continuous variable ​from the model: you will then have an **ANOVA**.\\ 
-If only your covariate ​is significant,​ drop your factor ​from the model, you will then have a **simple linear regression**.\\ +If only your continuous variable ​is significant,​ drop your categorical variable ​from the model, you will then have a **simple linear regression**.\\ 
-If your interaction ​covariate*factor ​is significant,​ you might want to test which level(s) ​of your factor ​ha(s)ve different slopes.\\+If your interaction is significant,​ you might want to test which levels ​of your categorical variables ​ha(s)ve different slopes ​and to question whether ANCOVA is the most appropriate model.\\
  
-In the CO2 example above, both the covariate ​and factor ​are significant,​ but the interaction is non-significant. If your replace //​Treatment//​ with //Type//, however, you will see an example of a significant interaction.+In the CO2 example above, both the continuous ​and categorical variable ​are significant,​ but the interaction is non-significant. If your replace //​Treatment//​ with //Type//, however, you will see an example of a significant interaction.
  
-If you want to compare means across factors, you can use adjusted means, which uses the equations given by the ANCOVA to estimate the means of each level of the factor, corrected for the effect of the covariate:+If you want to compare means across factors, you can use adjusted means, which uses the equations given by the ANCOVA to estimate the means of each level of the categorical variable, corrected for the effect of the categorical variable:
  
 <code rsplus| Adjusted means> <code rsplus| Adjusted means>
Line 1133: Line 1146:
 ===== 7. Multiple regression ===== ===== 7. Multiple regression =====
  
-Multiple regression tests the effects of several continuous explanatory variables on a response variable ​based on the following mathematical model: +Multiple regression tests the effects of several continuous explanatory variables on a response variable. ​
- +
-<m> {y_i} = {β_0} + {β_1}{x_{1i}} + {β_2}{x_{2i}} + {β_3}{x_{3i}} +... + {β_{n-1}}{x_{n-1}} + {β_n}{x_n} + {ε_i} </​m>​ +
- +
-where  +
- +
-β<​sub>​0</​sub>​ is the intercept of the regression line,\\ +
-β<​sub>​1</​sub>​ is effect of the variable x<​sub>​1</​sub>​ (i.e. the slope of the regression line of variable x<​sub>​1</​sub>​),​\\ +
-β<​sub>​2</​sub>​ is effect of the variable x<​sub>​2</​sub>​ (i.e. the slope of the regression line of variable x<​sub>​2</​sub>​),​\\ +
-ε<​sub>​i</​sub>​ are the residuals of the model (i.e. the unexplained variation).+
  
 ==== 7.1 Assumptions ==== ==== 7.1 Assumptions ====
  
-To be valideach of the explanatory variables used in the multiple regression ​model must be:  +In addition to the usual assumptions of linear modelsit is important to test for orthogonality because it will affect ​model interpretationVariables are not orthogonal when explanatory ​variables ​are collinear. If one explanatory variable is correlated to another, they are likely to explain the same variability of the response variable, and the effect of one variable will be masked by the other.
-  - **Normally-distributed**\\ This can be verified using a Shapiro-Wilk test (function ''​shapiro.test''​). Usual transformations must be performed in the case of non-normality.\\ +
-  - **Orthogonal**\\ Explanatory ​variables ​must not be collinear: if one explanatory variable is correlated to another, they are likely to explain the same variability of the response variable, and the effect of one variable will be masked by the other.\\ +
-  - **Linear**\\ i.e. Linear relationships between each explanatory variable. +
- +
-The model residuals assumptions are the same as those of the simple linear regression, that is: +
-  - Normality of residuals +
-  - Independence of residuals versus x<​sub>​i</​sub>​ (or predicted values) +
-  - Independence of the variance of residuals versus x<​sub>​i</​sub>​ (or predicted values) +
-  - No outliers +
- +
-=== Violation of assumptions ===+
  
 If you see any pattern between two explanatory variables, they are collinear. Collinearity must be avoided as the effect of each explanatory variable will be confounded! Possible solutions are: If you see any pattern between two explanatory variables, they are collinear. Collinearity must be avoided as the effect of each explanatory variable will be confounded! Possible solutions are:
Line 1163: Line 1156:
   - Try multidimensional analysis //​[[http://​qcbs.ca/​wiki/​r_workshop9|(see workshop 9)]]//,   - Try multidimensional analysis //​[[http://​qcbs.ca/​wiki/​r_workshop9|(see workshop 9)]]//,
   - Try a pseudo-orthogonal analysis.   - Try a pseudo-orthogonal analysis.
 +
 +Collinearity between explanatory variables can be assessed based on the variance inflation factor using the ''​vif''​ function of package ‘HH’:
 +
 +<code rsplus| Variance Inflation Factor>
 +vif(clDD ~ clFD + clTmi + clTma + clP + grass, data=Dickcissel)
 +</​code>​
 +
 +which gives the following output:
 +
 +    clFD     ​ clTmi ​    ​ clTma ​      ​ clP ​    ​ grass ​
 +   ​13.605855 ​ 9.566169 ​ 4.811837 ​ 3.196599 ​ 1.165775
 +
 +As variance inflation factor higher than 5 represents collinear variables, the R output shows that //clDD//, //clFD// and  //clTmi// are highly collinear. Only one of these explanatory variables can thus be retained in the final regression model.
 ==== 7.2 Dickcissel dataset ==== ==== 7.2 Dickcissel dataset ====
  
Line 1272: Line 1278:
 ---- ----
  
-CHALLENGE 7+**CHALLENGE 7** 
 + 
 +Compare the different polynomial models in the previous example, and determine which model is the most appropriate. Extract the adjusted R squared, the regression coefficients,​ and the p-values of this chosen model.
  
 ++++ Challenge 7: Solution| ++++ Challenge 7: Solution|
Line 1323: Line 1331:
 </​hidden>​ </​hidden>​
 ==== 7.4 Stepwise regression ==== ==== 7.4 Stepwise regression ====
 +<​hidden>​
 To obtain a final multiple regression model, users can first implement a full regression model containing all the explanatory variables and then drop the non-significant variable using a stepwise selection procedure. In this method, non-significant variables are successively dropped one by one from the model and the goodness-of-fit of each successive model are compared based on AIC [[http://​en.wikipedia.org/​wiki/​Akaike_information_criterion|(Akaike’s Information Criterion)]],​ until all the explanatory variables of the model are significant. Note that a lower AIC value indicates a better goodness of fit (i.e., the best model is the one with the lowest AIC). In R, stepwise selection is implemented using the function ''​step'':​ To obtain a final multiple regression model, users can first implement a full regression model containing all the explanatory variables and then drop the non-significant variable using a stepwise selection procedure. In this method, non-significant variables are successively dropped one by one from the model and the goodness-of-fit of each successive model are compared based on AIC [[http://​en.wikipedia.org/​wiki/​Akaike_information_criterion|(Akaike’s Information Criterion)]],​ until all the explanatory variables of the model are significant. Note that a lower AIC value indicates a better goodness of fit (i.e., the best model is the one with the lowest AIC). In R, stepwise selection is implemented using the function ''​step'':​
  
Line 1354: Line 1363:
 The model now accounts for 31.44% of the Dickcissel abundance variability,​ the clTmi being the most significant explanatory variable.\\ The model now accounts for 31.44% of the Dickcissel abundance variability,​ the clTmi being the most significant explanatory variable.\\
 Nevertheless,​ some of the selected explanatory variables are highly correlated and should be dropped from the final model in order to remove uninformative variables. ​ Nevertheless,​ some of the selected explanatory variables are highly correlated and should be dropped from the final model in order to remove uninformative variables. ​
 +</​hidden>​
  
-==== 7.5 Variance inflation ==== 
- 
-Collinearity between explanatory variables can be assessed based on the variance inflation factor using the ''​vif''​ function of package ‘HH’: 
- 
-<code rsplus| Variance Inflation Factor> 
-vif(clDD ~ clFD + clTmi + clTma + clP + grass, data=Dickcissel) 
-</​code>​ 
- 
-which gives the following output: 
- 
-    clFD     ​ clTmi ​    ​ clTma ​      ​ clP ​    ​ grass ​ 
-   ​13.605855 ​ 9.566169 ​ 4.811837 ​ 3.196599 ​ 1.165775 
- 
-As variance inflation factor higher than 5 represents collinear variables, the R output shows that //clDD//, //clFD// and  //clTmi// are highly collinear. Only one of these explanatory variables can thus be retained in the final regression model. 
 ===== 8. Variance partitioning (advanced section/ optional) ===== ===== 8. Variance partitioning (advanced section/ optional) =====
 <​hidden>​ <​hidden>​