Differences

This shows you the differences between two versions of the page.

--- r_workshop8 [2016/02/06 10:24]
berenger_bourgeois
+++ r_workshop8 [2021/10/13 16:05] (current)
lsherin
@@ Line 1: / Line 1: @@
-======= QCBS R Workshops =======
+<WRAP group>
+<WRAP centeralign>
+<WRAP important>
+<wrap em> __MAJOR UPDATE__ </wrap>
-[[http://qcbs.ca/|{{:logo_text.png?nolink&500|}}]]
+<wrap em> As of Fall 2021, this wiki has been discontinued and is no longer being actively developed. </wrap>
-This series of [[r|10 workshops]] walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology.
+<wrap em> All updated materials and announcements for the QCBS R Workshop Series are now housed on the [[https://r.qcbs.ca/workshops/r-workshop-08/|QCBS R Workshop website]]. Please update your bookmarks accordingly to avoid outdated material and/or broken links. </wrap>
-These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community.
+<wrap em> Thank you for your understanding, </wrap>
+<wrap em> Your QCBS R Workshop Coordinators. </wrap>
+</WRAP>
+</WRAP>
+<WRAP clear></WRAP>
 ======= QCBS R Workshops =======
@@ Line 13: / Line 23: @@
 These open-access workshops were created by members of the Quebec Centre for Biodiversity Science both for members of the QCBS and the larger community.
-====== Workshop 8: Ordination ======
+//The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications, please contact the current series coordinators, listed on the main wiki page//
-Developed by: Bérenger Bourgeois, Xavier Giroux-Bougard, Amanda Winegardner, Emmanuelle Chrétien and Monica Granados.
+<wrap em>IMPORTANT NOTICE: MAJOR UPDATES</wrap>
-(Material in R script obtained from: Borcard, Gillet & Legendre (2011). //Numerical Ecology with R//. Springer New York.)
-**Summary:** In this workshop, you will learn the basics of multivariate ordination analyses for your own data. You will learn to choose appropriate distance metrics and transformations for unconstrained ordinations, and to perform Principal Component Analysis, Correspondence Analysis, Principal Coordinate Analysis and Nonmetric MultiDimensional Scaling, to find patterns in your community composition data!
+**March 2021 update:** This wiki has been discontinued and is no longer being actively developed or updated. Updated materials for the QCBS R Workshop Series are now housed on the QCBS R Workshop [[https://github.com/QCBSRworkshops/workshop08|GitHub page]].
-Link to associated Prezi: [[https://prezi.com/isb-5h_tanxo/qcbs-r-workshop-8/|Prezi]]
+Available materials include;
+  - The [[https://qcbsrworkshops.github.io/workshop08/pres-en/workshop08-pres-en.html|Rmarkdown presentation]] for this workshop;
+  - An [[https://qcbsrworkshops.github.io/workshop08/book-en/workshop08-script-en.R|R script]] that follows the presentation - //*under construction*//;
+  - [[https://qcbsrworkshops.github.io/workshop08/book-en/index.html|Written materials]] that accompany the presentation in bookdown format - //*under construction*//.
-Download the R script and data required for this workshop:
+====== Workshop 8: Generalized additive models ======
-  *  [[http://qcbs.ca/wiki/_media/workshop8_ordination.R | R Script]]
-  *  [[http://qcbs.ca/wiki/_media/DoubsEnv.csv|DoubsEnv data]]
-  *  [[http://qcbs.ca/wiki/_media/DoubsSpe.csv|DoubsSpe data]]
-  *  [[http://qcbs.ca/wiki/_media/coldiss.R|Coldiss R function]]
-Make sure to load the following packages (see how in the R script):
+Developed by: Eric Pedersen and Zofia Taranu
-  *  [[http://cran.r-project.org/web/packages/vegan/index.html|vegan (for multivariate analyses)]]
-  *  [[http://cran.r-project.org/web/packages/gclus/index.html|gclus (for clustering graphics)]]
-  *  [[http://cran.r-project.org/web/packages/ape/index.html|ape (for phylogenetics)]]
-<code rsplus | Load the required packages and functions>
+**Summary:** The goal of today's workshop will be to first examine what we mean by a non-linear model, and how GAMs (Generalized Additive Models) allow us to handle non-linear relationships. We will also go over how to plot and interpret these non-linear relationships, how to include interaction terms, autocorrelated errors and expand on previous workshop by briefly examining a mixed modelling framework. Lastly, we will briefly touch upon what GAMs are doing behind the scenes.
-install.packages("vegan")
-install.packages("gclus")
-install.packages("ape")
-library(vegan)
-library(gclus)
-library(ape)
-source(file.choose()) # use coldiss.R which you have downloaded to your own directory
-</code>
-======What is ordination?======
+**Link to new [[https://qcbsrworkshops.github.io/workshop08/workshop08-en/workshop08-en.html|Rmarkdown presentation]]**
-Ordination is a set of methods for depicting samples in multiple dimensions (Clarke and Warwick 2011) and often feels like a catch-all term in ecological statistics. Ecologists are often told to “run a PCA” in the face of complex and messy multivariate data. R code for ordination techniques is readily available and relatively straight-forward to implement for most data. Interpretation of ordination analyses can be more difficult, especially if you are unsure of the specific questions you wish to explore with a particular ordination method. As such, while ordination methods are very useful for helping simplify and make sense of multivariate data, careful consideration of why the methods are being used and which ones are most appropriate is necessary for strong ecological analyses!
+Link to old [[https://prezi.com/ktlh8t8-03yt/|Prezi presentation]]
-When you use an ordination method, you are taking a set of variables and creating new principal axes along which samples (sites etc.) are scored or ordered (Gotelli and Ellison 2004), in order to reduce or simplify the data, i.e. to create new axes that represent most of the variation in the data. As an example, a dataset with 24 variables may be reduced to five principal components that represent the main patterns of variation amongst samples. Unconstrained ordination methods are generally not frameworks for hypotheses testing, rather they are best suited for exploratory data analysis. The different types of ordination can be useful for many different questions; see ([[http://ordination.okstate.edu/|the Ordination Website]]) for an overview of different types of ordination).
+Download the R script and data for this lesson:
+  - [[http://qcbs.ca/wiki/_media/gam_e.r|Script]]
+  - [[http://qcbs.ca/wiki/_media/other_dist.csv|Data]]
+===== Learning objectives =====
+  - Use the //mgcv// package to fit non-linear relationships,
+  - Understand the output of a GAM to help you understand your data,
+  - Use tests to determine if a non-linear model fits better than a linear one,
+  - Include smooth interactions between variables,
+  - Understand the idea of a basis function, and why it makes GAMs so powerful,
+  - Account for dependence in data (autocorrelation, hierarchical structure) using GAMMs.
+**Prerequisites:** Some experience in R (enough to be able to run a script and examine data and R objects) and a basic knowledge of regression (you should know what we mean by linear regression and ANOVA).
-======Getting started with data======
+===== 0. The linear model... and where if fails =====
-We will use two main data sets in the first part of this workshop. “DoubsSpe.csv” is a data frame of fish community data where the first column contains site names from 1 to 30 and the remaining columns are fish taxa. The taxa columns are populated by fish abundance data (counts). “DoubsEnv.csv” is a data frame of environmental data for the same sites contained in the fish community data frame. Again, the first column contains site names from 1 to 30. The remaining columns contain measurements for 11 abiotic variables. Note that data used in ordination analyses is generally in [[http://en.wikipedia.org/wiki/Wide_and_narrow_data|wide-format]].
+What do we mean by the linear model? Regression is the workhorse of statistics. It allows us to model a response variable as a function of predictors plus error. Linear regression is what most people first encounter in statistics. As we saw in the [[http://qcbs.ca/wiki/r_workshop4|linear models]] workshop, regression makes four major assumptions:
+  - normally distributed error
+  - the variance of the errors doesn't change (homoscedastic)
+  - i.i.d. (each error is independent of the others)
+  - a linear response: <m> {y} = {β_0} + {β_1}{x} </m>
-<code rsplus | Load the Doubs Species and Environmental Data>
+There's only one way for the linear model to be right:
-#Species community data frame (fish abundance): “DoubsSpe.csv”
-spe<- read.csv(file.choose(), row.names=1)
-spe<- spe[-8,] #Site number 8 contains no species and so row 8 (site 8) is removed. Be careful to
-#only run this command line once as you are overwriting "spe" each time.
-#Environmental data frame: “DoubsEnv.csv”
+{{::graphic0.1.png?350|}}
-env<- read.csv(file.choose(), row.names=1)
-env<- env[-8,] #Remove corresponding abiotic data for site 8 (since removed from fish data).
-#Again, be careful to only run the last line once.
-</code>
-======1. Explore the data======
+And yet so many ways for it to fail:
-=====1.1 Species data =====
+{{::graphic0.2.png?525|}}
-We can use summary functions to explore the “spe” data (fish community data) and discover things like the dimensions of the matrix, column headings and summary statistics for the columns. This is a review from =====
+So how can we fix it? We must first know what the regression model is trying to do:
-Workshop 2.
+   * fit a line that goes through the middle of the data,
+   * do this without overfitting the data, such as simply using a line between each point and its neighbours.
+Linear models do this by finding the best fit straight line that passes through the data. In contrast, additive models do this by fitting a curve through the data, but controlling how wiggly the line can get //(more on this later)//.
+===== 1. Introduction to GAMs =====
-<code rsplus | Explore DoubsSpe>
+Let's look at an example. First, we'll generate some data, and plot it.
-names(spe) #see names of columns in spe
-dim(spe) #dimensions of spe; number of columns and rows
-str(spe) #displays internal structure of objects
-head(spe) #first few rows of the data frame
-summary(spe) #summary statistics for each column; min value, median value, max value, mean value etc.
-</code>
-Look at the species’ distribution frequencies.
+<file rsplus | Generate and plot data>
+library(ggplot2)
+set.seed(10)
+n = 250
+x = runif(n,0,5)
+y_model = 3*x/(1+2*x)
+y_obs = rnorm(n,y_model,0.1)
+data_plot = qplot(x, y_obs) +
+            geom_line(aes(y=y_model)) +
+            theme_bw()
+print(data_plot)
+</file>
-<code rsplus | Species distribution of DoubsSpe data>
+{{::graphic1.1.jpg?350|}}
-#Species distribution
-(ab <- table(unlist(spe))) #note that when you put an entire line of code in brackets like this, the output for that operation is displayed right away in the R console
-barplot(ab, las=1, xlab="Abundance class", ylab="Frequency", col=grey(5:0/5))
+We can appreciate that if we were to fit this as a linear regression model, we would violate the assumptions listed above. Let's have a look, using the ''gam()'' command from the //mgcv// package here to model an ordinary least squares regression //(we will see below how to use ''gam()'' to specify a smoothed, non-linear term)//.
-</code>
-{{:spe_barplot.png?300|}}
+<file rsplus | Linear model>
-Can see that there is a high frequency of zeros in the abundance data.
+library(mgcv)
+linear_model = gam(y_obs~x)
+model_summary=summary(linear_model)
+print(model_summary)
+data_plot = data_plot+
+             geom_line(colour="red",
+             aes(y=fitted(linear_model)))
+print(data_plot)
+</file>
-See how many absences there are in the fish community data.
+We can see in the model summary that our linear model explains quite a bit of variance (R<sup>2</sup><sub>(adj)</sub> =  0.639), however, diagnostic plots of the model residuals would quickly show that the error variance is not normally distributed nor homoscedastic, and that an important non-linear pattern remains. Let's now try to fix this by fitting the data using a smoothed (non-linear) term.
-<code rsplus | Absences in fish data>
-sum(spe==0)
-</code>
-Look at the proportion of zeros in the fish community data.
+We will revisit this later, but briefly, GAMs are effectively a nonparametric form of regression where the <m> beta </m>*x<sub>i</sub> of a linear regression is replaced by an arbitrary smooth functions of the explanatory variables, f(x<sub>i</sub>), and the model becomes:
-<code rsplus | Proportion of zeros>
-sum(spe==0)/(nrow(spe)*ncol(spe))
-</code>
-The proportion of zeros in the dataset is ~0.5.
-See the frequency at which different numbers of species occur together.
+<m>{y_i} = f(x_{i}) + {ε_i}</m>
-<code rsplus | Species occurrences>
-spe.pres <- colSums(spe>0) #compute the number of sites where each species is present.
-hist(spe.pres, main="Species occurrence", las=1, xlab="Frequency of occurrences", breaks=seq(0,30, by=5), col="grey")
-</code>
-Can see that an intermediate number of sites contain the highest number of species.
-Calculate the number of species that occur at each site. This is a simplistic way of calculating species richness for each site. In some cases, the number of individuals found in a site may differ between sites. Because there may be a relationship between the number of individuals counted in a site or sample and the taxa or species richness, [[http://cc.oulu.fi/~jarioksa/softhelp/vegan/html/diversity.html|rarefied species richness]] may be a more appropriate metric than total richness. The rarefy() function in vegan can be used to generate rarefied species richness.
+where y<sub>i</sub> is the response variable, x<sub>i</sub> covariate, and //f// is the smooth function.
-<code rsplus | Species richness>
+Importantly, given that the smooth function f(x<sub>i</sub>) is non-linear and local, the magnitude of the effect of the explanatory variable can vary over its range, depending on the relationship between the variable and the response. That is, as opposed to one fixed coefficient <m> beta </m>, the function //f// can continually change over the range of x<sub>i</sub>. The degree of smoothness (or wiggliness) of //f// is controlled using penalized regression determined automatically in //mgcv// using a generalized cross-validation (GCV) routine (Wood 2006).
-site.pres <- rowSums(spe>0) #number of species with a value greater than 0 in that site row
-hist(site.pres, main="Species richness", las=1, xlab="Frequency of sites", ylab="Number of species", breaks=seq(0,30, by=5), col="grey")
-</code>
-=====1.2 Environmental data=====
+In ''gam()'' smooth terms are specified by expressions of the form: ''s(x)'' where x is the covariate which the smooth is a function of.
-Explore the environmental data and create a panel of plots to compare collinearity:
-<code rsplus | Explore Doubs Env data>
+<file rsplus | GAM>
-names(env)
+gam_model = gam(y_obs~s(x))
-dim(env)
+summary(gam_model)
-str(env)
+data_plot = data_plot +
-head(env)
+     geom_line(colour="blue",aes(y=fitted(gam_model)))
-summary(env)
+print(data_plot)
-pairs(env, main="Bivariate Plots of the Environmental Data" )
+</file>
-</code>
-In this case, the environmental data (explanatory variables) are all in different units and need to be standardized prior to computing distance measures to perform most ordination analyses. Standardize the environmental data (11 variables) using the function decostand() in vegan.
+The variance explained by our model has increased by 20% (R<sup>2</sup><sub>(adj)</sub> = 0.859) and when we compare the fit of the linear (red) and non-linear (blue) models, it is clear that the latter is the winner.
-<code rsplus | Decostand>
+{{::graphic1.3.jpg?375|}}
-env.z <- decostand(env, method="standardize")
-apply(env.z, 2, mean) # the data are now centered (means~0)
-apply(env.z, 2, sd)   # the data are now scaled (standard deviations=1)
-</code>
-======2. Association measures======
-Matrix algebra ultimately lies below all ordinations. A matrix consists of data (e.g. measured values) in rows and columns. Prior to starting multivariate analyses, you likely have matrices with ecological data (such as the DoubsEnv or DoubsSpe that we have used before), but in ordination methods, your original data matrices are used to create association matrices. Before launching into ordination methods, it is important to spend some time with your data matrices. Creating an association matrix between objects or among descriptors allows for calculations of similarity or distance between objects or descriptors (Legendre and Legendre 2012). Exploring the possible association measures that can be generated from your data prior to running ordinations can help you to better understand what distance measure to use within ordination methods. At this point in time, it may be difficult to see the purpose behind different dissimilarity indices, or describing what they do to your multivariate data, but we will need this knowledge when we consider canonical ordination methods later on.
+The //mgcv// package also includes a default plot to look at the smooths:
-IN SUMMARY: To ordinate objects you need to compute distances among them. Distances can be computed in several ways, taking into account abundance or presence/absence data. More importantly, they are several properties that are important for distance metrics that are explored with the data examples below. For more information about the properties of distance metrics and some key terms, click on the hidden section below.
+<file rsplus | smooth plot>
+plot(gam_model)
+</file>
-<hidden>
+How do we test whether the non-linear model offers a significant improvement over the linear model? We can use the ''gam()'' and ''anova()'' commands to formally test whether an assumption of linearity is justified. To do, we must simply set our smoothed model so that it is nested in our linear model; that is, we create model object that includes both ''x'' (linear) and ''s(x)'' (non-linear) and we ask whether adding ''s(x)'' to the model with only ''x'' as a covariate is supported by the data.
-**Some key terms:**
-**Association -** “general term to describe any measure or coefficient to quantify the resemblance or difference between objects or descriptors. In an analysis between descriptors, zero means no association.” (Legendre and Legendre 2012).
-**Similarity -** a measure that is “maximum (S=1) when two objects are identical and minimum when two objects are completely different.” (Legendre and Legendre 2012).
+<file rsplus | Test for linearity>
+linear_model = gam(y_obs~x)
+nested_gam_model = gam(y_obs~s(x)+x)
+print(anova(linear_model, nested_gam_model, test="Chisq"))
+</file>
-**Distance (also called dissimilarity) -** a measure that is “maximum (D=1) when two objects are completely different”. (Legendre and Legendre 2012). Distance or dissimilarity (D) = 1-S
+The non-linear term is significant:
-Choosing an association measure depends on your data, but also on what you know, ecologically about your data. For example, Euclidean distance is a very common and easy to use distance measure and is useful for understanding how different two samples are based on co-occurrence of species. The way Euclidean distance is calculated though relies on zeros in the dataset, meaning that two samples or sites without any species in common may appear more similar than two samples that share a few species. This could be misleading and it would be best to choose a different distance measure if you have a lot of zeros in your species matrix. This is commonly referred to the “double zero” problem in ordination analyses.
+   Analysis of Deviance Table
+   Model 1: y_obs ~ x
+   Model 2: y_obs ~ s(x) + x
+         Resid. Df    Resid. Dev     Df        Deviance    Pr(>Chi)
+    248.00        6.5846
+    240.68        2.4988         7.3168    4.0858      < 2.2e-16 ***
+   ---
+   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-Here are some commonly used dissimilarity (distance) measures (recreated from Gotelli and Ellison 2004):
+Note that the model ''y_obs~s(x)'' gives exactly the same results as ''y_obs~s(x)+x'', as shown above for the ''nested_gam_model''. We used the ''s(x)+x'' to illustrate the nestedness of the model, however, we will omit the ''+x'' in the nested model comparisons that follow.
-^Measure name^Property^Description^
+----
-^Euclidean^Metric^Distance between two points in 2D space.^
-^Manhattan^Metric^Distance between two points, where the distance is the sum of differences of their Cartesian coordinates, i.e. if you were to make a right able between the points.^
-^Chord^Metric^This distance is generally used to assess differences due to genetic drift.^
-^Mahalanobis^Metric^Distance between a point and a set distribution, where the distance is the number of standard deviations of the point from the mean of the distribution.^
-^Chi-square^Metric^Similar to Euclidean.^
-^Bray-Curtis^Semi-metric^Dissimilarity between two samples (or sites) where the sum of lower values for species present in both samples are divided by the sum of the species counted in each sample.^
-^Jaccard^Metric^[[http://en.wikipedia.org/wiki/Jaccard_index|Description]] ^
-^Sorensen's^Semi-metric^Bray-Curtis is 1 - Sorensen^
-</hidden>
-=====2.1 Distance measures=====
+**CHALLENGE 1**
-**Quantitative species data**
+We will now try this test with some new simulated data, just to get a handle on it. First generate the data using the code below, then fit a linear and smoothed GAM model to the relation between ''x_test'' and ''y_test_obs''.
-We can use the vegdist() function to compute dissimilarity indices for community composition data. These can then be visualized as a matrix if desired.
+What is the effective degrees of freedom of the smoothed term? Determine if linearity is justified for this data.
-<code rsplus | vegdist>
+<file rsplus| Generate data>
-spe.db<-vegdist(spe, method="bray") # "bray" with presence-absence data is Sorensen dissimilarity
+n <- 250
-spe.dj<-vegdist(spe, method="jac") # Jaccard dissimilarity
+x_test <- runif(n,-5,5)
-spe.dg<-vegdist(spe, method="gower") # Gower dissimilarity
+y_test_fit <- 4*dnorm(x_test)
-spe.db<-as.matrix(spe.db) #Put in matrix form (can visualize, write to .csv etc)
+y_test_obs <- rnorm(n,y_test_fit, 0.2)
-</code>
+</file>
-A condensed version of the spe.db matrix where the numbers represent the distance(dissimilarity) between the first 3 species in DoubsSpe would look like this:
+++++ Challenge 1 Solution |
-^ ^Species 1^Species 2^Species 3^
-^Species 1^0.0^0.6^0.68^
-^Species 2^0.6^0.0^0.14^
-^Species 3^0.68^0.14^0.0^
-You can see that when comparing a species to itself (e.g. Species 1 to Species 1), the distance = 0, because species 1 is like itself and distance is maximum when the species compared are 100% different.
-These same distance measures can be calculated from presence-absence data by setting binary=TRUE in the vegdist() function. This will give slightly different distance measures.
+<file rsplus| Challenge 1 solution>
+data_plot <- qplot(x_test, y_test_obs) +
+  geom_line(aes(y=y_test_fit))+
+  theme_bw()
+print(data_plot)
-You can also create graphical depictions of these association matrices using a function called coldiss.
+linear_model_test <- gam(y_test_obs~x_test)
-<hidden>
+nested_gam_model_test <- gam(y_test_obs~s(x_test)+x_test)
-This function can be sourced using the script coldiss.R:
+print(anova(linear_model_test, nested_gam_model_test, test="Chisq"))
-<code rsplus | coldiss>
-windows()
-coldiss(spe.db, byrank=FALSE, diag=TRUE) # Heat map of Bray-Curtis dissimilarity
-windows()
-coldiss(spe.dj, byrank=FALSE, diag=TRUE) # Heat map of Jaccard dissimilarity
-windows()
-coldiss(spe.dg, byrank=FALSE, diag=TRUE) # Heat map of Gower dissimilarity
-</code>
-{{:coldiss_Bray.png?800|}}
-The figure shows a dissimilarity matrix which mirrors what you would see if you exported the matrix or viewed it within the R console. The ‘hot’ colour (purple) shows areas of high dissimilarity.
+summary(nested_gam_model_test)$s.table
-</hidden>
+</file>
-**Quantitative environmental data**
+  Analysis of Deviance Table
-Let’s look at //associations// between environmental variables (also known as Q mode):
-<code rsplus | dist, environmental variables>
-?dist # this function also compute dissimilarity matrix
+  Model 1: y_test_obs ~ x_test
-env.de<-dist(env.z, method = "euclidean") # euclidean distance matrix of the standardized environmental variables
+  Model 2: y_test_obs ~ s(x_test) + x_test
-windows() #Creates a separate graphical window
+       Resid. Df   Resid. Dev    Df      Deviance   Pr(>Chi)
-coldiss(env.de, diag=TRUE)
+     248.0      81.09
-</code>
+     240.5      7.46          7.5012   73.629    < 2.2e-16 ***
+  ---
+  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-We can then look at the //dependence// between environmental variables (also known as R mode):
+                 edf   Ref.df        F p-value
-<code rsplus | correlations, environmental variables>
+  s(x_test) 7.602145 8.029057 294.0944       0
-(env.pearson<-cor(env)) # Pearson r linear correlation
-round(env.pearson, 2) #Rounds the coefficients to 2 decimal points
-(env.ken<-cor(env, method="kendall")) # Kendall tau rank correlation
-round(env.ken, 2)
-</code>
-The Pearson correlation measures the linear correlation between two variables. The Kendall tau is a rank correlation which means that it quantifies the relationship between two descriptors or variables when the data are ordered within each variable.
+Answer: Yes non-linearity is justified. The effective degrees of freedom (edf) are >> 1.
-In some cases, there may be mixed types of environmental variables. Q mode can still be used to find associations between these environmental variables. We’ll do this by first creating an example dataframe:
+{{::challenge_1_soln.jpg?350|}}
-<code rsplus | Example dataframe>
-var.g1<-rnorm(30, 0, 1)
-var.g2<-runif(30, 0, 5)
-var.g3<-gl(3, 10)
-var.g4<-gl(2, 5, 30)
-(dat2<-data.frame(var.g1, var.g2, var.g3, var.g4))
-str(dat2)
-summary(dat2)
-</code>
-A dissimilarity matrix can be generated for these mixed variables using the Gower dissimilarity matrix:
+++++
-<code rsplus | daisy>
+===== 2. Multiple smooth terms =====
-?daisy #This function can handle NAs in the data
-(dat2.dg<-daisy(dat2, metric="gower"))
-coldiss(dat2.dg)
-</code>
-**Challenge 1 - Introductory**
+GAMs make it easy to include both smooth and linear terms, multiple smoothed terms, and smoothed interactions. For this section, we'll be using a dataset automatically generated by the ''gamSim()'' command of the //mgcv// package. For further details on what type of data you can simulate with this function, see ''?gamSim''. For this section, we will use the example 5 (eg=5) of gamSim, which simulates an additive example plus a factor variable.
-Discuss with your neighbour: How can we tell how similar objects are when we have multivariate data? Make a list of all your suggestions.
-**Challenge 1 - Introductory Solution**
+<file rsplus | Generate non-linear data>
-<hidden>
+gam_data = gamSim(eg=5)
-Discuss as a group.
+head(gam_data)
-</hidden>
+</file>
-**Challenge 1 - Advanced**
+We will now look at how we can predict ''y'' based on the other variables. Start with a basic model, with one smoothed term (x1) and one categorical predictor (x0, which has 4 levels).
-Calculate the Bray-Curtis and the Gower dissimilarity of species abundance CHA, TRU and VAI for sites 1, 2 and 3 (using the “spe” and “env” dataframes) //without using the decostand() function.//
-**Challenge 1 - Advanced Solution**
+<file rsplus | Generate non-linear data>
-<hidden>
+basic_model = gam(y~x0+s(x1), data= gam_data)
+basic_summary = summary(basic_model)
+print(basic_summary$p.table)
+print(basic_summary$s.table)
+plot(basic_model)
+</file>
-First, it is helpful to know the formula for Bray-Curtis dissimilarity
+Here, the ''p.table'' provides the significance table for each parametric term and the ''s.table'' is the significance table for smooths. Note that for the latter, the wiggleness of the smoothed term ''s(x1)'' is indicated by the edf parameter (effective degrees of freedom); the higher the edf, the more non-linear the smoothing spline. A high value (8–10 or higher) means that the curve is highly non-linear, whereas a smoother with 1 effective degree of freedom is a straight line. In contrast, in linear regression the //model// degrees of freedom is equivalent to the number of non-redundant free parameters, p, in the model (and the //residual// degrees of freedom are given by n-p). We will revisit the edf later in this workshop.
-Bray-Curtis dissimilarity : d[jk] = (sum abs(x[ij]-x[ik]))/(sum (x[ij]+x[ik]))
-Next, subset the species data so that only sites 1, 2 are included and only the species CHA, TRU and VAI
+   print(basic_summary$p.table)
-<code rsplus | Subset>
+               Estimate Std. Error   t value     Pr(>|t|)
-spe.challenge<-spe[1:3,1:3] #”[1:3,” refers to rows 1 to 3 while “,1:3]” refers to the first 3 species columns (in #this case the three variables of interest)
+   (Intercept) 8.550030  0.3655849 23.387258 1.717989e-76
-</code>
+   x02         2.418682  0.5165515  4.682364 3.908046e-06
+   x03         4.486193  0.5156501  8.700072 9.124666e-17
+   x04         6.528518  0.5204234 12.544629 1.322632e-30
+   > print(basic_summary$s.table)
+              edf   Ref.df        F      p-value
+   s(x1) 1.923913 2.406719 42.84268 1.076396e-19
-Determine total species abundance for each site of interest (sum of the 3 rows). This will be for the denominator in the above equation.
-<code rsplus | Total sp abundance by site>
-(Abund.s1<-sum(spe.challenge[1,]))
-(Abund.s2<-sum(spe.challenge[2,]))
-(Abund.s3<-sum(spe.challenge[3,]))
-#() around code will cause output to print right away in console
-</code>
-Now calculate the difference in species abundances for each pair of sites. For example, what is the difference between the abundance of CHA and TRU in site 1? You need to calculate the following differences:
+In our basic model the edf of smooth function s(x1) is ~2, which suggests a non-linear curve. The plot of the model nicely illustrates the shape of this non-linear smoother:
-CHA and TRU site 1
-CHA and VAI site 1
-TRU and VAI site 1
-CHA and TRU site 2
-CHA and VAI site 2
-TRU and VAI site 2
-CHA and TRU site 3
-CHA and VAI site 3
-TRU and VAI site 3
-<code rsplus | Pairs of sites>
+{{::graphic2.2.jpg?425|}}
-Spec.s1s2<-0
-Spec.s1s3<-0
-Spec.s2s3<-0
-for (i in 1:3) {
-  Spec.s1s2<-Spec.s1s2+abs(sum(spe.challenge[1,i]-spe.challenge[2,i]))
-  Spec.s1s3<-Spec.s1s3+abs(sum(spe.challenge[1,i]-spe.challenge[3,i]))
-  Spec.s2s3<-Spec.s2s3+abs(sum(spe.challenge[2,i]-spe.challenge[3,i])) }
-</code>
-Now take the differences you have calculated as the numerator in the equation for Bray-Curtis dissimilarity and the total species abundance that you already calculated as the denominator.
+We can add a second term, x2, but specify a linear relationship with y (that is, both linear and non-linear terms can be included in GAM). The new linear term, x2, will appear in the ''p.table'', for which a regression coefficient estimate is indicated. In the ''s.table'', we will once again find the non-linear smoother, s(x1), and its wiggleness parameter.
-<code rsplus | Bray-Curtis dissimilarity>
-(db.s1s2<-Spec.s1s2/(Abund.s1+Abund.s2)) #Site 1 compared to site 2
-(db.s1s3<-Spec.s1s3/(Abund.s1+Abund.s3)) #Site 1 compared to site 3
-(db.s2s3<-Spec.s2s3/(Abund.s2+Abund.s3)) #Site 2 compared to site 3
-</code>
-You should find values of 0.5 for site 1 to site 2, 0.538 for site 1 to site 3 and 0.053 for site 2 to 3.
-Check your manual results with what you would find using the function vegdist() with the Bray-Curtis method:
+<file rsplus | two term>
-<code rsplus | vegist, Bray-Curtis dissimilarity>
+two_term_model <- gam(y~x0+s(x1)+x2, data=gam_data)
-(spe.db.challenge<-vegdist(spe.challenge, method="bray"))
+two_term_summary <- summary(two_term_model)
-</code>
+print(two_term_summary$p.table)
+print(two_term_summary$s.table)
+</file>
-A matrix looking like this is produced, which should be the same as your manual calculations:
+If we wish to explore whether the relationship between y and x2 is non-linear, we can model x2 as non-linear smooth term instead. As before, we can use an ANOVA to test if the smoothed term is necessary.
-^ ^Site 1^Site 2^
-^Site 2^0.5^--^
-^Site 3^0.538^0.0526^
-For the Gower dissimilarity, proceed in the same way but use the appropriate equation:
+<file rsplus | two smooth>
-Gower dissimilarity :  	d[jk] = (1/M) sum(abs(x[ij]-x[ik])/(max(x[i])-min(x[i])))
+two_smooth_model <- gam(y~x0+s(x1)+s(x2), data=gam_data)
-<code rsplus | Challenge solution, Gower>
+two_smooth_summary <- summary(two_smooth_model)
-# Calculate the number of columns in your dataset
+print(two_smooth_summary$p.table)
-M<-ncol(spe.challenge)
+print(two_smooth_summary$s.table)
+plot(two_smooth_model,page=1)
+</file>
-# Calculate the species abundance differences between pairs of sites for each species
+{{::graphic2.4.jpg?600|}}
-Spe1.s1s2<-abs(spe.challenge[1,1]-spe.challenge[2,1])
-Spe2.s1s2<-abs(spe.challenge[1,2]-spe.challenge[2,2])
-Spe3.s1s2<-abs(spe.challenge[1,3]-spe.challenge[2,3])
-Spe1.s1s3<-abs(spe.challenge[1,1]-spe.challenge[3,1])
-Spe2.s1s3<-abs(spe.challenge[1,2]-spe.challenge[3,2])
-Spe3.s1s3<-abs(spe.challenge[1,3]-spe.challenge[3,3])
-Spe1.s2s3<-abs(spe.challenge[2,1]-spe.challenge[3,1])
-Spe2.s2s3<-abs(spe.challenge[2,2]-spe.challenge[3,2])
-Spe3.s2s3<-abs(spe.challenge[2,3]-spe.challenge[3,3])
-# Calculate the range of each species abundance between sites
+When more than one covariable is included in the model, as above, the fitted response can be partitioned into the contributions of each variable as shown. Here we can appreciate the varying magnitude of the effect of each variable; where the y-axis represents the contribution (effect) of each covariate to the fitted response, centered on 0.  If the confidence intervals had overlapped with zero for certain values of x (or throughout the entire range), this would imply a non-significant effect at those x values (or of x in entirety). When the contribution for an individual covariate changes along the range of x-axis, the change in that covariate is associated with a change in the response.
-Range.spe1<-max(spe.challenge[,1]) - min (spe.challenge[,1])
-Range.spe2<-max(spe.challenge[,2]) - min (spe.challenge[,2])
-Range.spe3<-max(spe.challenge[,3]) - min (spe.challenge[,3])
-# Calculate the Gower dissimilarity
+<file rsplus | ANOVA>
-(dg.s1s2<-(1/M)*((Spe2.s1s2/Range.spe2)+(Spe3.s1s2/Range.spe3)))
+anova(basic_model,two_term_model,two_smooth_model, test="Chisq")
-(dg.s1s3<-(1/M)*((Spe2.s1s3/Range.spe2)+(Spe3.s1s3/Range.spe3)))
+</file>
-(dg.s2s3<-(1/M)*((Spe2.s2s3/Range.spe2)+(Spe3.s2s3/Range.spe3)))
-# Compare your results
+    Analysis of Deviance Table
-(spe.db.challenge<-vegdist(spe.challenge, method="gower"))
+    Model 1: y ~ x0 + s(x1)
-</code>
+    Model 2: y ~ x0 + s(x1) + x2
-</hidden>
+    Model 3: y ~ x0 + s(x1) + s(x2)
+      Resid. Df Resid. Dev      Df Deviance  Pr(>Chi)
+    394.08     5231.6
+    393.10     4051.3 0.97695   1180.2 < 2.2e-16 ***
+    385.73     1839.5 7.37288   2211.8 < 2.2e-16 ***
+    ---
+    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+The best fit model is the model with both smooth terms for x1 and x2.
+----
-=====2.2. Transformations of community composition data=====
-Sometimes species/community data may also need to be standardized or transformed. The decostand() function in vegan provides standardization and transformation options for community composition data.
+**CHALLENGE 2**
-Transforming abundance or count data to presence-absence data:
+Create two new models, with x3 as a linear and smoothed term.
-<code rsplus | decostand, presence-absence>
+Use plots, coefficient tables and the anova function to determine if x3 is an important term to include.
-spe.pa<-decostand(spe, method="pa")
-</code>
-Other transformations can be used to correct for the influence of rare species, e.g. the Hellinger transformation:
+++++ Challenge 2 solution |
-<code rsplus | Hellinger and Chi-square>
-#Hellinger transformation
-spe.hel<-decostand(spe, method="hellinger") #can also use method=”hell”
-#Chi-square transformation
+<file rsplus | Challenge 2 solution>
-spe.chi<-decostand(spe, method="chi.square")
+three_term_model <- gam(y~x0+s(x1)+s(x2)+x3, data=gam_data)
-</code>
+three_smooth_model <- gam(y~x0+s(x1)+s(x2)+s(x3), data=gam_data)
+three_smooth_summary <- summary(three_smooth_model)
-**Optional Challenge 2 - Advanced**
+print(three_smooth_summary$p.table)
-Produce Hellinger and Chi-square transformed abundance data for the “spe” data without using decostand().
+print(three_smooth_summary$s.table)
+plot(three_smooth_model,page=1)
+# edf = 1 therefore term is linear.
-**Optional Challenge 2 - Advanced Solution**
+anova(two_smooth_model,three_term_model,test="Chisq")
-<hidden>
+# term x3 is not significant
-The Hellinger transformation is a transformation that down-weights the importance given to rare species in a sample.
+</file>
-<code rsplus | Solution>
+++++
-# Hellinger transformation
+===== 3. Interactions =====
-# First calculate the total species abundances by site
-(site.totals=apply(spe, 1, sum))
-# Scale species abundances by dividing them by site totals
+There are two ways to include interactions between variables:
-(scale.spe<-spe/site.totals)
+  * if one variable is smoothed, and the other isn't: using the 'by' argument,
+  * fully smoothing both variables.
+Where the 'by' argument lets you have a smooth term vary between different levels of a factor. We will examine this using our categorical variable x0 and ask whether the non-linear smoother s(x2) varies across different levels of x0. To determine whether smooth terms differ significantly among factor levels, we will use an ANOVA test on the interaction term.
-# Calculate the square root of scaled species abundances
+<file rsplus | categorical interaction>
-(sqrt.scale.spe<-sqrt(scale.spe))
+categorical_interact <- gam(y~x0+s(x1)+s(x2,by=x0),data=gam_data)
+categorical_interact_summary <- summary(categorical_interact)
+print(categorical_interact_summary$s.table)
+plot(categorical_interact,page=1)
+# or alternatively: plot using vis.gam function, where theta is the degree rotation on the x-y plane
+vis.gam(categorical_interact,view=c("x2","x0"),theta=40,n.grid=500,border=NA)
+anova(two_smooth_model, categorical_interact,test="Chisq")
+</file>
+{{::graphic3.1b.png?350|}}
-# Compare the results
+Visually, we see that the shape of smooth terms are comparable among the four levels of x0. The anova test confirms this as well (Deviance = 98.6, p = 0.2347).
-sqrt.scale.spe
-spe.hel
-sqrt.scale.spe-spe.hel # or: sqrt.scale.spe/spe.hel
-# Chi-square transformation
+Finally we'll look at the interactions two smooth terms, x1 and x2. Here the 'by' argument is dropped (both quantitative variables).
-# First calculate the total species abundances by site
-(site.totals<-apply(spe, 1, sum))
-# Then calculate the square root of total species abundances
+<file rsplus | smooth interact>
-(sqrt.spe.totals<-sqrt(apply(spe, 2, sum)))
+smooth_interact <- gam(y~x0+s(x1,x2),data=gam_data)
+smooth_interact_summary <- summary(smooth_interact)
+print(smooth_interact_summary$s.table)
+plot(smooth_interact,page=1,scheme=3)
+# plot(smooth_interact,page=1,scheme=1) will give a similar plot to the vis.gam()
+vis.gam(smooth_interact,view=c("x1","x2"),theta=40,n.grid=500,border=NA)
+anova(two_smooth_model,smooth_interact,test="Chisq")
+</file>
-# Scale species abundances by dividing them by the site totals and the species totals
+{{::graphic3.2b.png?350|}}
-scale.spe2<-spe
-for (i in 1:nrow(spe)) {
-  for (j in 1:ncol(spe)) {
-   (scale.spe2[i,j]=scale.spe2[i,j]/(site.totals[i]*sqrt.spe.totals[j]))   }}
-#Adjust the scale abundance species by multiplying by the square root of the species matrix total
+The interaction between s(x1) and s(x2) is significant and the 2D plot nicely illustrates this non-linear interactions, where y is greatest at high values of x1 but low to mid-values of x2 (try playing around with ''theta'' to rotate the figure around. If you plan to run a lot of these, delete the argument ''n.grid=500'' to make the plots run faster).
-(adjust.scale.spe2<-scale.spe2*sqrt(sum(rowSums(spe))))
-# Compare the results
+===== 4. Changing basis =====
-adjust.scale.spe2
-spe.chi
-adjust.scale.spe2-spe.chi # or: adjust.scale.spe2/spe.chi
-</code>
-</hidden>
-======3. Unconstrained Ordination======
+We won't go too much further into this in this workshop, but you should know you can expand on the basic model we've looked at today with:
+  * more complicated smooths, by changing the basis,
+  * other distributions: anything you can do with a GLM using the family argument,
+  * mixed effect models, using gamm or the gamm4 package.
+We will first have a look at changing the basis, followed by a quick intro to using other distribution and GAMMs (Generalized Additive Mixed Models). Let's look at one place where knowing how to change the basis can help: cyclical data. That is, data where you want the predictor to match at the ends. Imagine you have a time series of climate data, broken down by monthly measurement, and you want to see if there's a time-trend in yearly temperature.  We'll use the Nottingham temperature time series for this:
-Unconstrained ordination allows us to organize samples, sites or species along continuous gradients (e.g. ecological or environmental). The key difference between unconstrained and constrained ordination (discussed later in this workshop- see ‘Canonical Ordination’) is that in the unconstrained techniques we are not attempting to define a relationship between independent and dependent sets of variables.
+<file rsplus | cyclic basis>
+data(nottem)
+n_years <- length(nottem)/12
+nottem_month <- rep(1:12, times=n_years)
+nottem_year <- rep(1920:(1920+n_years-1),each=12)
+nottem_plot <- qplot(nottem_month,nottem,
+                    colour=factor(nottem_year),
+                    geom="line") + theme_bw()
+print(nottem_plot)
+</file>
-Unconstrained ordination can be used to:
+Using the nottem data, we have created three new vectors; ''n_years'' corresponds to the number of years of data (20 years), ''nottem_month'' is a categorical variable coding for the 12 months of the year, for every year sampled (sequence 1 to 12 repeated 20 times), and ''nottem_year'' is a variable where the year corresponding to each month is provided.
-- Assess relationships //within// a set of variables (not between sets).
-- Find key components of variation between samples, sites, species etc.
-- Reduce the number dimensions in multivariate data without substantial loss of information.
-- Create new variables for use in subsequent analyses (such as regression). These principal components are  weighted, linear combinations of the original variables in the ordination.
-[[http://www.umass.edu/landeco/teaching/multivariate/schedule/ordination1.pdf|Source]]
-=====3.1 Principal Component Analysis:=====
+Monthly data from the years 1920 through to 1940:
-Principal component analysis (PCA) was originally described by Pearson (1901) although it is more often attributed to Hotelling (1933) who proposed it independently. The method and several of its implications for data analysis are presented in the seminal paper of Rao (1964). PCA is used to generate a few key variables from a larger set of variables that still represent as much of the variation in the dataset as possible. PCA is a powerful technique to analyze quantitative descriptors (such as species abundances), but cannot be applied to binary data (such as species absence/presence).
+{{::graphic4.1.jpg?500|}}
-Based on a dataset containing normally distributed variables, the first axis (or principal-component axis) of a PCA is the line that goes through the greatest dimension of the concentration ellipsoid describing the multinormal distribution. Similarly, the following axes go through the dimensions of the concentration ellipsoid by decreasing length. One can thus derive a maximum of p principal axes form a data set containing p variables.
+To model this, we need to use what's called a cyclical cubic spline, or "cc", to model month effects as well as a term for year.
-For this, PCA rotates the original system of axes defined by the variables in a way that the successive new axes are orthogonal to one another and correspond to the successive dimensions of maximum variance of the scatter of points. The new variables produced in a PCA are uncorrelated with each other (axes are orthogonal to each other) and therefore may be used in other analyses such as multiple regression (Gotelli and Ellison 2004). The principal components give the positions of the objects in the new system of coordinates. PCA works on an association matrix among variables containing the variances and covariances of the variables. PCA preserves Euclidean distances and detects linear relationships. As a consequence, raw species abundance data are subjected to a pre-transformation (i.e. a Hellinger transformation) before computing a PCA.
+<file rsplus | cyclic gam >
+year_gam <- gam(nottem~s(nottem_year)+s(nottem_month, bs="cc"))
+summary(year_gam)$s.table
+plot(year_gam,page=1, scale=0)
+</file>
-//To do a PCA you need://
+{{::graphic4.2.jpg?700|}}
-- A set of variables (with no distinction between independent or dependent variables, i.e. a set of species OR a set of environmental variables).
-- Samples that are measured for the same set of variables.
-- Generally a dataset that is longer than it is wide is preferred.
-PCA is most useful with data with more than two variables, but is easier to describe using a two dimensional example. In this example (derived from Clarke and Warwick 2001) , we have nine sites with abundance data for two species:
+There is about 1-1.5 degree rise in temperature over the period, but within a given year there is about 20 degrees variation in temperature, on average.
+The actual data vary around these values and that is the unexplained variance. Here we can see one of the very interesting bonuses of using GAMs. We can either plot the response surface (fitted values) or the terms (contribution of each covariate) as shown here. You can imagine these as plots of the changing regression coefficients, and how their contribution (or effect size) varies over time. In the first plot, we see that positive contributions of temperature occurred post-1930.
-^Site^Species 1^Species 2^
+Over longer time scales, for example using paleolimnological data, others ([[http://www.aslo.info/lo/toc/vol_54/issue_6_part_2/2529.pdf|Simpson & Anderson 2009; Fig. 3c]]) have used GAMs to plot the contribution (effect) of temperature on algal assemblages in lakes, to illustrate how significant contributions only occurred during two extreme cold events (that is, the contribution is significant when the confidence intervals do not overlap zero, at around 300 and 100 B.C.). This allowed the authors to not only determine how much variance was explained by temperature over the last few centuries, but also pinpoint when in time this effect was significant. If of interest to you, the code to plot either the response surface (''type = "response"'') or the terms (''type = "terms"'') is given below. When terms is selected, you obtain the same figure as above.
-^A^6^2^
-^B^0^0^
-^C^5^8^
-^D^7^6^
-^E^11^6^
-^F^10^10^
-^G^15^8^
-^H^18^14^
-^I^14^14^
-If you plotted these samples in the two dimensions, the plot would look like this:
-{{:pcaex_1.png?500|}}
-This is a straight forward scatter plot and is an ordination plot, but you can imagine that it is more difficult to produce and visualize such a plot if there were more than two species. In that case, you would want to reduce the number of variables to principal components. If the 2D plot above was too complex and we wanted to reduce the data into a single dimension, we would be doing so with a PCA:
+++++ Contribution plot|
+<file rsplus | contribution plot>
+pred<-predict(year_gam, type = "terms", se = TRUE)
+I<-order(nottem_year)
+plusCI<-I(pred$fit[,1] + 1.96*pred$se[,1])
+minusCI<-I(pred$fit[,1] - 1.96*pred$se[,1])
+xx <- c(nottem_year[I],rev(nottem_year[I]))
+yy <- c(plusCI[I],rev(minusCI[I]))
+plot(xx,yy,type="n",cex.axis=1.2)
+polygon(xx,yy,col="light grey",border="light grey")
+lines(nottem_year[I], pred$fit[,1][I],lty=1)
+abline(h=0)
+</file>
+++++
-{{:pcaex_2.png?500|}}
+===== 5. Other distributions =====
-In this case, the first principal component is the line of best fit through the points (sites) with the sites perpendicular to the line.
+To provide a brief overview on how to use GAMs when the response variable does not follow a normal distributions and is either count or proportion data (e.g., Gamma, binomial, Poisson, negative binomial), the example that follows uses a dataset where a binomial family distribution is needed and a non-linear relationship with the explanatory variable is evident. Here, the response variable represents the number of successes (event happened) versus failures over the course of an experiment.
-A second principal component is then added perpendicular to the first axis:
+<file rsplus| Loading the data>
+gam_data3 <- read.csv("other_dist.csv")
+summary(gam_data3)
+str(gam_data3)
+</file>
-{{:pcaex_3.png?500|}}
+  'data.frame':	514 obs. of  4 variables:
+   $ prop : num  1 1 1 1 0 1 1 1 1 1 ...
+   $ total: int  4 20 20 18 18 18 20 20 20 20 ...
+   $ x1   : int  550 650 750 850 950 650 750 850 950 550 ...
+   $ fac  : Factor w/ 4 levels "f1","f2","f3",..: 1 1 1 1 1 1 1 1 1 1 ...
-The final plot then is the two PC axes rotated where the axes are now principal components as opposed to species:
+''prop'' is the response variable, and is equal to the proportion of //successes/ (successes + failures)//. Note that there are numerous cases where the proportion equals 1 or 0 which indicates that the outcomes were always successes or failures, respectively, at a given time point in the experiment.\\
+''x1'' is the time since the start of experiment (our explanatory variable).\\
+''total'' represents the number of //successes + failures// observed at any time point of the experiment.\\
+''fac'' is a factor coding for trials 1 through 4 of the experiment (we will not use this variable in this section).
-{{:pcaex_4.png?500|}}
+Let's start by visualizing the data. We are interested in the number of successes in comparison to failures as x1 increases.
+We will calculate the grand averages of the proportions of successes per time bin (x1) given that there are repeated measures per x1 value (numerous trials and observations per trial).
-For PCAs with more than two variables, principal components are added like this (Clarke and Warwick 2001):
+<file rsplus| Data visualization>
+emptyPlot(range(gam_data3$x1), c(0,1), h=.5,
+          main="Probability of successes", ylab="Probability",xlab="x1")
-PC1 = axis that maximises the variance of the points that are projected perpendicularly onto the axis.
+avg <- aggregate(prop ~ x1, data=gam_data3, mean, na.rm=TRUE)
-PC2 = must be perpendicular to PC1, but the direction is again the one in which variance is maximised when points are perpendicularly projected.
+lines(avg$x1, avg$prop, col="orange",lwd=2)
-PC3 and so on: perpendicular to the first two axes.
+</file>
-Essentially, when there are more than two dimensions, PCA produces a new space in which all PCA axes are orthogonal (e.g. where the correlation among any two axes is null) and where the PCA axes are ordered according to the percent of the variance of the original data they explain.
-The “spe” data includes 27 fish taxa. To simplify the 27 fish taxa into a smaller number of fish-related variables or to identify where different sites or samples are associated with particular fish taxa we can run a PCA.
+{{::graphic1_otherdist.jpg?450|}}
+As x1 increases, so does the probability of successes. Would you say that this trend is linear or non-linear? We will test this using a logistic GAM (we use a ''binomial'' family distribution given that our response is proportion data).
-Run a PCA on Hellinger-transformed species data:
+<file rsplus| Logistic GAM>
-<code rsplus | PCA using rda() in vegan>
+prop_model <- gam(prop~ s(x1), data=gam_data3, weights=total, family="binomial")
-#Run the PCA using the rda() function (NB: rda() is used for both PCA and RDA)
+prop_summary <- summary(prop_model)
-spe.h.pca <- rda(spe.hel)
+print(prop_summary$p.table)
+print(prop_summary$s.table)
-#Extract the results
+plot(prop_model)
-summary(spe.h.pca) #overall results
+</file>
-</code>
-The summary looks like this:
+               Estimate  Std. Error  z value   Pr(>|z|)
+  (Intercept)  1.173978  0.02709613  43.32641  0
-{{:pcaoutput_1v2.png?800|}}
+          edf       Ref.df    Chi.sq     p-value
-{{:pcaoutput_2.png?800|}}
+  s(x1)   4.591542  5.615235  798.9407   2.027751e-164
-{{:pcaoutput_3.png?800|}}
-**Interpretation of ordination results:**
+What does the intercept represent in this model?
+  * Recall that the model uses the count outcomes to calculate the //logit//, which is the log odds ratio between successes and failures:
+<m> logit = log({N_{success}/N_{failures}}) </m>
-A new word from this output is "eigenvalue". An eigenvalue is the value of the change in the length of a vector, and for our purposes is the amount of variation explained by each axis in a PCA. As you can see, the summary function produces a fair amount of information. From the summary we can see how much of the variance in the data is explained by the unconstrained variables (i.e. variables where we have made no effort to define relationships between the variables). In this case, the total variance of the sites explained by the species is 0.5. The summary also tells you what proportion of the total explained variance is explained by each principal component in the PCA: the first axis of the PCA thus explains 51.33% of the variation, and the second axis 12.78%. You can also extract certain parts of the output, try this:
+- If successes = failures, the ratio is 1 and the logit is 0 (log(1) = 0).\\
-<code rsplus | Part of the PCA summary>
+- If successes have a larger count than failures, the ratio is greater than 1 and the logit has a positive value (e.g., log(2) = 0.69).\\
-summary(spe.h.pca, display=NULL) # only eigenvalues and their contribution to the variance
+- If successes have a smaller count than failures, the ratio is lower than 1 and the logit has a negative value (e.g., log(0.5) = -0.69).\\
-eigen(cov(spe.hel)) # also compute the eigenvalues
-</code>
-Sometimes you may want to extract the scores (i.e. the coordinates within a PCA biplot) for either the “sites”  (the rows in your dataset, whether they be actual sites or not) or the “species” (the variables in your data, whether they be actual species or some other variables). This is useful if you want to then use a principal component as a variable in another analysis, or to make additional graphics. For example, with the “spe” dataset, you might want to obtain a single variable that is a composite of all the fish abundance data and then use that use that variable in a regression with another variable, or plot across a spatial gradient. To extract scores from a PCA, use the scores() function:
+Thus, the intercept is the log odds ratio of successes to failures (logit), and indicates whether on average there are more successes than failures. Here, the estimated intercept coefficient is positive, which means that there are more successes than failures overall.
-<code rsplus | scores()>
-spe.scores <- scores(spe.h.pca, display="species", choices=c(1,2)) # species scores on the first two PCA axes
-site.scores <- scores(spe.h.pca, display="sites", choices=c(1,2)) # sites scores on the first two PCA axes
-#Note: if you don’t specify the number of principal components to extract (e.g. choices=c(1,2) or choices=c(1:2) then all of the scores will be extracted for all of the principal components.
-</code>
-The PCA on the “spe” fish data produces as many principal components as there are fish taxon (columns), which in this case means that 27 principal components are produced. In many cases though, you may have done a PCA to reduce the number of variables to deal with and produce composite variables for the fish. In this case, you are likely interested in knowing how many of these principal components are actually significant or adding new information to the PCA (i.e. how many principal components do you need to retain before you aren’t really explaining any more variance with the additional principal components). To determine this, you can use the Kaiser-Guttman criterion and produce a barplot showing at what point the principal components are no longer explaining significant amount of variance. The code for the barplot below shows the point at which the variance explained by a new principal component explains less than the average amount explained by all of the eigenvalues:
+What does the smooth term indicate?
-<code rsplus | Significant axes>
+  * This represents how the log odds of successes vs failures changes over x1 (time in this example). So, we see that since the edf>1, the proportion of successes increases faster over time (if for example the response represents the count of species A versus species B, and nutrient concentrations are increased over time, these results would indicate that species A is increasingly observed as nutrient concentrations approach its niche optima over the course of the experiment).
-# Identify the significant axis using the Kaiser-Guttman criterion
-ev <- spe.h.pca$CA$eig
-ev[ev>mean(ev)]
-n <- length(ev)
-barplot(ev, main="Eigenvalues", col="grey", las=2)
-abline(h=mean(ev), col="red")
-legend("topright", "Average eigenvalue", lwd=1, col=2, bty="n")
-</code>
-{{:pca_sigaxes_sp.png?500|}}
+=== Visualizing the trend over time ===
-From this barplot, you can see that once you reach PC6, the proportion of variance explained falls below the average proportion explained by the other components. If you take another look at the PCA summary, you will notice that by the time you reach PC5, the cumulative proportion of variance explained by the principal components is 85%.
+Lastly, we will see the different ways this relationship could be represented graphically.
-A PCA is not just for species data. It can also be run and interpreted in the same way using standardized environmental variables:
+<file rsplus| Plotting the trend>
-<code rsplus | PCA on environmental variables>
+par(mfrow=c(1,2))
-#Run the PCA
+plot(prop_model, select=1, scale=0, shade=TRUE)
-env.pca <- rda(env.z) # or rda(env, scale=TRUE)
+abline(h=0)
-#Extract the results
+out <- plot_smooth(prop_model, view="x1",main="")
-summary(env.pca)
+(diff <- find_difference(out$fv$fit, out$fv$CI, xVals=out$fv$x1))
-summary(env.pca, scaling=2)
+addInterval(0, lowVals=diff$start, highVals = diff$end, col='red', lwd=2)
-</code>
+abline(v=c(diff$start, diff$end), lty=3, col='red')
+text(mean(c(diff$start, diff$end)), 2.1, "sign. more \n success", col='red', font=3)
+</file>
-Scaling refers to what portion of the PCA is scaled to the eigenvalues. Scaling = 2 means that the species scores are scaled by eigenvalues, whereas scaling = 1 means that site scores are scaled by eigenvalues. Scaling = 3 means that both species and site scores are scaled symmetrically by square root of eigenvalues. Using scaling = 1 means that the Euclidean distances among objects (e.g. the rows in your data) are preserved, whereas scaling = 2 means that the correlations amongst descriptors (e.g. the columns in this data) are preserved. This means that when you look at a biplot of a PCA that has been run with scaling=2, the angle between descriptors represents correlation.
+{{::graphic2_otherdist.jpg?800|}}
-<code rsplus | Significant axes>
+What do these plots tell us about successes versus failures?
-# Identify the significant axis using the Kaiser-Guttman criterion
+  * Left plot: contribution/ partial effect (if we had more than one explanatory variable). Over time the log odds increases, so over time successes increase and failures decrease.
-ev <- env.pca$CA$eig
+  * Right plot: fitted values, intercept included (summed effect if we had more than one explanatory variable included in the model). Here we see that the log odds ratio is estimated around zero at the start of the experiment; this means there are equal amounts of successes and failures. Gradually successes increase, and at around x1=400 there are significantly more successes than failures (the smooth is significantly different from zero). We also illustrated how we could use the plot to determine at which x1 value this occurs.
-ev[ev>mean(ev)]
-n <- length(ev)
-barplot(ev, main="Eigenvalues", col="grey", las=2)
-abline(h=mean(ev), col="red")
-legend("topright", "Average eigenvalue", lwd=1, col=2, bty="n")
-</code>
-Compare this eigenvalue barplot with the one you created for the species PCA.
-As you saw in the explanation of the summary output, a lot of information can be extracted from a PCA before even plotting it. However, many of us interpret data best visually and a figure of a PCA  is often the best way to convey major patterns. PCA biplots can be plotted in base R. A PCA biplot includes the x-axis as the first Principal Component and the y-axis as the second Principal Component.
+Lastly, to help interpret the results, we could transform the summed effects back to proportions with the function ''plot_smooth'' from the //itsadug// package:
-For some plotting exercises, let's start with the //species PCA//. A basic biplot without any customization could be plotted like this, where the site positions are shown by black numbers and species’ positions are shown by red species codes. Remember, species positions come from plotting species along PCs and the site positions are derived from the weighted sums of species scores.
+<file rsplus| Transformed plot>
-<code rsplus | Plot species PCA>
+par(mfrow=c(1,1))
-plot(spe.h.pca)
+plot_smooth(prop_model, view="x1", main="",
-</code>
+            transform=plogis, ylim=c(0,1))
+abline(h=.5, v=diff$start, col='red', lty=2)
+</file>
-{{:spe_PCA1.png?800|}}
+{{::graphic3_otherdist.jpg?450|}}
-Basically, the plot above plots the PCA like this:
+As we already derived from the logit plot, we see that at around x1=400 the proportion of successes increases significantly above 0.5.
-<code rsplus | Plot species PCA line by line>
+===== 6. Quick intro to GAMMs =====
-plot(spe.h.pca, type="n") #produces a blank biplot with nothing displayed but the axes
-points(spe.h.pca, dis="sp", col="blue") #points are added for the species (columns) (dis=)
-#use text() instead of points() if you want the labels
-points(spe.h.pca, dis="sites", col="red") #points are added for the sites (rows)
-</code>
-{{:spe_PCA2.png?800|}}
-To create more detailed plots and to play with aesthetics, try this code:
+==== Dealing with non-independence ====
-<code rsplus | More aesthetically pleasing PCA plot>
-#Biplot of the PCA on transformed species data (scaling 1)
-windows()
-plot(spe.h.pca)
-windows()
-biplot(spe.h.pca)
-windows()
-plot(spe.h.pca, scaling=1, type="none", # scaling 1 = distance biplot :
-                                        # distances among objects in the biplot approximate their Euclidean distances
-                                        # but angles among descriptor vectors DO NOT reflect their correlation
-     xlab = c("PC1 (%)", round((spe.h.pca$CA$eig[1]/sum(spe.h.pca$CA$eig))*100,2)), #this comes from the summary
-     ylab = c("PC2 (%)", round((spe.h.pca$CA$eig[2]/sum(spe.h.pca$CA$eig))*100,2)))
-points(scores(spe.h.pca, display="sites", choices=c(1,2), scaling=1),
-       pch=21, col="black", bg="steelblue", cex=1.2)
-text(scores(spe.h.pca, display="species", choices=c(1), scaling=1),
-     scores(spe.h.pca, display="species", choices=c(2), scaling=1),
-     labels=rownames(scores(spe.h.pca, display="species", scaling=1)),
-     col="red", cex=0.8)
-</code>
-This code produces 3 plots, the final plot is the most visually pleasing:
+When observations are not independent, GAMs can be used to either incorporate:
-{{:spe_PCA3.png?800|}}
+  * a serial correlation structure to model residual autocorrelation (autoregressive AR, moving average MA, or a combination of the two ARMA),
+  * random effects that model independence among observations from the same site.
+That is, in addition to changing the basis as with the nottem example above, we can also add complexity to the model by incorporating an autocorrelation structure or mixed effects using the //gamm// or the //gamm4// package.
+To start, let's have a look at the first case; a model with temporal autocorrelation in the residuals. We will revisit the Nottingham temperature model and test for correlated errors using the (partial) autocorrelation function.
-What conclusions can you draw from this plot? From the plot, you can see the site scores shown by the blue dots. If you superimposed site labels on top, you could see what sites are closer to each other in terms of the species found at those sites, but even without the specific labels, you can see that there are only a few sites that are farther away from the majority. The species names are shown by their names in red and from the plot, you can see for example that the species "ABL" is not found or not found in the same prevalence in the majority of sites as other species closer to the centre of the ordination.
+<file rsplus| Correlated errors>
+par(mfrow=c(1,2))
+acf(resid(year_gam), lag.max = 36, main = "ACF")
+pacf(resid(year_gam), lag.max = 36, main = "pACF")
+</file>
-Biplots need to be interpreted in terms of angles from center of plot, not just in terms of proximity. Two variables at 90 degree angle are uncorrelated. Two variables in opposing directions are negatively correlated. Two variables very close together are strongly correlated, at least in the space described by these axes.
+{{::graphic5.1.jpg?700|}}
-Now let's look at a plot of the //environmental PCA//:
+The autocorrelation function plot (ACF; first panel above) provides the cross-correlation of a time series with itself at different points in time (i.e. similarity between observations at increasingly large time lags). In contrast, the partial autocorrelation function (PACF) gives the partial correlation of a time series with its own lagged values after controlling for the values of the time series at all shorter lags. The ACF and pACF plots are thus used to identify after how many time steps do observations start to be independent from one another. The ACF plot of our model residuals suggests a significant lag of 1, and perhaps a lag of 2. Therefore, a low-order AR model is likely needed. We can test this by adding AR structures to the Nottingham temperature model, one with an AR(1) (correlation at 1 time step) and one with an AR(2) (correlation at two times steps), and test for the best-fit model using ANOVA.
-<code rsplus | Biplot of environment PCA>
-#Biplot of the PCA on the environmental variables (scaling 2)
-windows()
-plot(env.pca)
-windows()
-plot(env.pca, scaling=2, type="none", # scaling 2 = correlation biplot :
-                                      # distances among abjects in the biplot DO NOT approximate their Euclidean distances
-                                      # but angles among descriptor vectors reflect their correlation
-     xlab = c("PC1 (%)", round((env.pca$CA$eig[1]/sum(env.pca$CA$eig))*100,2)),
-     ylab = c("PC2 (%)", round((env.pca$CA$eig[2]/sum(env.pca$CA$eig))*100,2)),
-     xlim = c(-1,1), ylim=c(-1,1))
-points(scores(env.pca, display="sites", choices=c(1,2), scaling=2),
-       pch=21, col="black", bg="darkgreen", cex=1.2)
-text(scores(env.pca, display="species", choices=c(1), scaling=2),
-     scores(env.pca, display="species", choices=c(2), scaling=2),
-     labels=rownames(scores(env.pca, display="species", scaling=2)),
-     col="red", cex=0.8)
-</code>
-{{:env_PCA1.png?800|}}
+<file rsplus| AR models>
+year_gam <- gamm(nottem~s(nottem_year)+s(nottem_month, bs="cc"))
+year_gam_AR1 <- gamm(nottem~s(nottem_year)+s(nottem_month, bs="cc"),
+                     correlation = corARMA(form = ~ 1|nottem_year, p = 1))
+year_gam_AR2 <- gamm(nottem~s(nottem_year)+s(nottem_month, bs="cc"),
+                     correlation = corARMA(form = ~ 1|nottem_year, p = 2))
+anova(year_gam$lme,year_gam_AR1$lme,year_gam_AR2$lme)
+</file>
-Remember that a PCA biplot is really just a scatter plot, where the axes are scores extracted from composite variables. This means that there are many different ways you can plot a PCA. For example, try using your ggplot() skills from Workshop 4 to extract PCA scores and plot an ordination in ggplot.
+                   Model df      AIC      BIC    logLik   Test   L.Ratio p-value
+  year_gam$lme         1  5 1109.908 1127.311 -549.9538
+  year_gam_AR1$lme     2  6 1101.218 1122.102 -544.6092 1 vs 2 10.689206  0.0011
+  year_gam_AR2$lme     3  7 1101.598 1125.962 -543.7988 2 vs 3  1.620821  0.2030
+The AR(1) provides a significant increase in fit over the naive model (LRT = 10.69, p = 0.0011), but there is very little improvement in moving to the AR(2) (LRT = 1.62, p = 0.203). So it is best to include only the AR(1) structure in our model.
+==== Mixed modelling ====
+As we saw in the previous section, ''bs'' specifies the type of underlying base function. For random intercepts and linear random slopes we use bs="re", but for random smooths we use bs="fs".
-**Use of PCA axis as composite explanatory variables:**
+Three different types of random effects are distinguished when using GAMMs (where //fac// is a categorial variable coding for the random effect and //x0// is a continuous fixed effect):
+  * //random intercepts// adjust the height of other model terms with a constant value: s(fac, bs="re")
+  * //random slopes// adjust the slope of the trend of a numeric predictor: s(fac, x0, bs="re")
+  * //random smooths// adjust the trend of a numeric predictor in a nonlinear way: s(x0, factor, bs="fs", m=1), where the argument m=1 sets a heavier penalty for the smooth moving away from 0, causing shrinkage to the mean.
-In some cases, users want to reduce several environmental variables in a few numbers of composite variables. When PCA axes represent ecological gradients (i.e. when environmental variables are correlated with PCA axis in a meaningful way), users can use site scores along PCA Axis in further analyses instead of the raw environmental data. In other words, sites scores along PCA Axis represent linear combinations of descriptors and can consequently be used as proxy of ecological conditions in “post-ordination” analyses.
+We will first examine a GAMM with only a random intercept. As before, we will use the ''gamSim()'' function to automatically generate a dataset, this time with a random effect component generate, then run a model with a random intercept using //fac// as the random factor.
-In the example above, the first PCA axis can be related to a gradient from oligotrophic, oxygen-rich to eutrophic, oxygen-deprived water: from left to right, a first group display the highest values of altitude (alt) and slope (pen), and the lowest values in river discharge (deb) and distance from the source (das). The second group of sites has the highest values in oxygen content (oxy) and the lowest in nitrate concentration (nit). A third group shows intermediate values in almost all the measured variables.
-If users are then interested in identifying if a specific species is associated with ologotrophic or eutrophic water, one can correlate the species abundances to sites scores along PCA Axis 1. For example, if one want to assess if the species TRU prefers oligotrophic or eutrophic water, the following linear model can be used :
-<code rsplus | Response of TRU species to the first ordination gradient>
+<file rsplus| random intercept>
-Sites_scores_Env_Axis1<- scores(env.pca, display="sites", choices=c(1), scaling=2)
+# generate and view data
-spe$ANG
+gam_data2 <- gamSim(eg=6)
-plot( Sites_scores_Env_Axis1, spe$TRU)
+head(gam_data2)
-summary(lm(spe$TRU~Sites_scores_Env_Axis1))
-abline(lm(spe$TRU~Sites_scores_Env_Axis1))
-</code>
-This simple model shows that the abundance of species TRU is significantly dependent of site scores along PCA axis 1 (t = -5.30, p = 1.35e-05, adj-R2 = 49.22%), i.e. depends of an oligotrophic-eutrophic gradient. The following plot identifies a preference of this species for oligotrophic water.
+# run random intercept model
+gamm_intercept <- gam(y ~ s(x0) + s(fac, bs="re"), data=gam_data2)
+# examine model output
+summary(gamm_intercept)$s.table
+</file>
-**Challenge 3**
+Note that there is now a smoother term for the random intercept in the summary table. You can plot and view the random intercepts for each level of //fac// as follows:
-Run a PCA of the “mite” species abundance data. What are the significant axes of variation? Which groups of sites can you identify? Which species are related to each group of sites? Use:
-<code rsplus | Mite species data>
-mite.spe <- mite #mite data is from the vegan package
-</code>
-**Challenge 3 - Solution**
+<file rsplus| plot of random intercepts>
-<hidden>
+plot(gamm_intercept, select=2)
-Your code likely looks something like the following.
+# select=2 because the random effect appears as the second entry in the summary table.
-<code rsplus | Mite species PCA>
+</file>
-#Hellinger transformation of mite data and PCA
-mite.spe.hel <- decostand(mite.spe, method="hellinger")
-mite.spe.h.pca <- rda(mite.spe.hel)
-#What are the significant axes?
-ev <- mite.spe.h.pca$CA$eig
-ev[ev>mean(ev)]
-n <- length(ev)
-barplot(ev, main="Eigenvalues", col="grey", las=2)
-abline(h=mean(ev), col="red")
-legend("topright", "Average eigenvalue", lwd=1, col=2, bty="n")
-#Output summary/results
+We can also use the ''plot_smooth'' function to visualize the model, which in contrast to the default ''plot.gam'', allows us to plot a smooth of the summed effects of a GAM (based on predictions) as we saw earlier, but in addition, optionally removes the random effects. Here, we will plot the summed effects for the x0 without random effects, and then plot the predictions of all four levels of the random //fac// effect:
-summary(mite.spe.h.pca, display=NULL)
-windows()
-#Plot the biplot
+<file rsplus | plot factor levels>
-plot(mite.spe.h.pca, scaling=1, type="none",
+par(mfrow=c(1,2), cex=1.1)
-     xlab=c("PC1 (%)", round((mite.spe.h.pca$CA$eig[1]/sum(mite.spe.h.pca$CA$eig))*100,2)),
+plot_smooth(gamm_intercept, view="x0", rm.ranef=TRUE, main="intercept + s(x1)", rug=FALSE)
-     ylab=c("PC2 (%)", round((mite.spe.h.pca$CA$eig[2]/sum(mite.spe.h.pca$CA$eig))*100,2)))
+plot_smooth(gamm_intercept, view="x0", cond=list(fac="1"),
-points(scores(mite.spe.h.pca, display="sites", choices=c(1,2), scaling=1),
+            main="... + s(fac)", col='orange', ylim=c(8,21), rug=FALSE)
-       pch=21, col="black", bg="steelblue", cex=1.2)
+plot_smooth(gamm_intercept, view="x0", cond=list(fac="2"), add=TRUE, col='red')
-text(scores(mite.spe.h.pca, display="species", choices=c(1), scaling=1),
+plot_smooth(gamm_intercept, view="x0", cond=list(fac="3"), add=TRUE, col='purple')
-     scores(mite.spe.h.pca, display="species", choices=c(2), scaling=1),
+plot_smooth(gamm_intercept, view="x0", cond=list(fac="4"), add=TRUE, col='turquoise')
-     labels=rownames(scores(mite.spe.h.pca, display="species", scaling=1)),
+</file>
-     col="red", cex=0.8)
-</code>
-And your resulting biplot likely looks something like this:
+{{::graphic5.3.jpg?650|}}
-{{:mite_pca.png?800|}}
+Next we will run and plot a model with random slopes:
-A dense cluster of related species and sites can be seen in the biplot.
+<file rsplus | random slope>
-</hidden>
+gamm_slope <- gam(y ~ s(x0) + s(x0, fac, bs="re"), data=gam_data2)
+summary(gamm_slope)$s.table
+plot_smooth(gamm_slope, view="x0", rm.ranef=TRUE, main="intercept + s(x0)", rug=FALSE)
+plot_smooth(gamm_slope, view="x0", cond=list(fac="1"),
+            main="... + s(fac)", col='orange',ylim=c(7,22), rug=FALSE)
+plot_smooth(gamm_slope, view="x0", cond=list(fac="2"), add=TRUE, col='red')
+plot_smooth(gamm_slope, view="x0", cond=list(fac="3"), add=TRUE, col='purple')
+plot_smooth(gamm_slope, view="x0", cond=list(fac="4"), add=TRUE, col='turquoise')
+</file>
-=====3.2. Correspondence Analysis:=====
+{{::graphic5.4.jpg?650|}}
-One of the key assumptions made in PCA is that species are related to each other linearly and that they respond linearly to ecological gradients. This is not necessarily the case with a lot of ecological data (e.g. many species have unimodal species distributions). Using PCA on data with unimodal species distributions or a lot of zero values may lead to a phenomenon called the “horseshoe effect”, and can occur with long ecological gradients. As such, a CA or Correspondence Analysis may be a better option for this type of data, see Legendre and Legendre (2012) for further information. As CA preserves Chi2 distances (while PCA preserves Euclidean distances), this technique is indeed better suited to ordinate datasets containing unimodal species distributions, and has, for a long time, been one of the favourite tools for the analysis of species presence–absence or abundance data. In CA, the raw data are first transformed into a matrix Q of cell-by-cell contributions to the Pearson Chi2 statistic, and the resulting table is submitted to a singular value decomposition to compute its eigenvalues and eigenvectors. The result is an ordination, where it is the Chi2 distance that is preserved among sites instead of the Euclidean distance in PCA. The Chi2 distance is not influenced by the double zeros. Therefore, CA is a method adapted to the analysis of species abundance data without pre-transformation. Contrary to PCA, CA can also be applied to analyze both quantitative and binary data (such as species abundances or absence/presence).
+We will now include both a random intercept and slope term.
-As in PCA, the Kaiser-Guttman criterion can be applied to determine the significant axes of a CA, and ordination axes can be extracted to be used in multiples regressions.
-Run a CA on species data:
+<file rsplus| random intercept and slope>
-<code rsplus | A using cca() in vegan>
+gamm_int_slope <- gam(y ~ s(x0) + s(fac, bs="re")
-#Run the CA using the cca() function (NB: cca() is used for both CA and CCA)
+                      + s(fac, x0, bs="re"), data=gam_data2)
-spe.ca <- cca(spe)
+summary(gamm_int_slope)$s.table
-# Identify the significant axes
-ev<-spe.ca$CA$eig
-ev[ev>mean(ev)]
-n=length(ev)
-barplot(ev, main="Eigenvalues", col="grey", las=2)
-abline(h=mean(ev), col="red")
-legend("topright", "Average eigenvalue", lwd=1, col=2, bty="n")
-</code>
-{{ :ca_guttmankaiser.png |}}
+plot_smooth(gamm_int_slope, view="x0", rm.ranef=TRUE, main="intercept + s(x0)", rug=FALSE)
+plot_smooth(gamm_int_slope, view="x0", cond=list(fac="1"),
+            main="... + s(fac) + s(fac, x0)", col='orange', ylim=c(7,22), rug=FALSE)
+plot_smooth(gamm_int_slope, view="x0", cond=list(fac="2"), add=TRUE, col='red', xpd=TRUE)
+plot_smooth(gamm_int_slope, view="x0", cond=list(fac="3"), add=TRUE, col='purple', xpd=TRUE)
+plot_smooth(gamm_int_slope, view="x0", cond=list(fac="4"), add=TRUE, col='turquoise', xpd=TRUE)
+</file>
+{{::graphic5.5.jpg?650|}}
-From this barplot, you can see that once you reach C6, the proportion of variance explained falls below the average proportion explained by the other components. If you take another look at the CA summary, you will notice that by the time you reach CA5, the cumulative proportion of variance explained by the principal components is 84.63%.
+Note that the random slope is static in this case:
-<code rsplus | Extract the CA results>
+<file rsplus| plot of random slopes>
-summary(spe.h.pca) #overall results
+plot(gamm_int_slope, select=3)
-summary(spe.h.pca, diplay=NULL)# only axis eigenvalues and contribution
+# select=3 because the random slope appears as the third entry in your summary table.
-</code>
+</file>
-{{ :ca_summary.png |}}
+Lastly, we will examine a model with a random smooth.
-CA results are presented in a similar manner as PCA results. You can see here that CA1 explains 51.50% of the variation in species abundances, while CA2 explain 12.37% of the variation.
+<file rsplus| random smooth>
+gamm_smooth <- gam(y ~ s(x0, fac, bs="fs", m=1), data=gam_data2)
+summary(gamm_smooth)$s.table
+</file>
-<code rsplus | Construct the CA biplots>
+Here, if the random slope varied along x0, we would see different curves for each level:
-par(mfrow=c(1,2))
-#### scaling 1
-plot(spe.ca, scaling=1, type="none", main='CA - biplot scaling 1', xlab=c("CA1 (%)", round((spe.ca$CA$eig[1]/sum(spe.ca$CA$eig))*100,2)),
-ylab=c("CA2 (%)", round((spe.ca$CA$eig[2]/sum(spe.ca$CA$eig))*100,2)))
-points(scores(spe.ca, display="sites", choices=c(1,2), scaling=1), pch=21, col="black", bg="steelblue", cex=1.2)
+<file rsplus| plot of random slopes>
+plot(gamm_smooth, select=1)
+# select=1 because the smooth slope appears as the first entry in your summary table.
+</file>
-text(scores(spe.ca, display="species", choices=c(1), scaling=1),
+All of the above models can in turn be compared using ''anova()'' as in the previous sections to determine the best fit model.
-     scores(spe.ca, display="species", choices=c(2), scaling=1),
+===== 7. GAMs behind the scenes ======
-     labels=rownames(scores(spe.ca, display="species", scaling=1)),col="red", cex=0.8)
-#### scaling 2
+We will now take a few minutes to look at what GAMs are doing behind the scenes. Lets first consider a model containing one smooth function of one covariate, x<sub>i</sub>:
-plot(spe.ca, scaling=1, type="none", main='CA - biplot scaling 2', xlab=c("CA1 (%)", round((spe.ca$CA$eig[1]/sum(spe.ca$CA$eig))*100,2)),
-     ylab=c("CA2 (%)", round((spe.ca$CA$eig[2]/sum(spe.ca$CA$eig))*100,2)), ylim=c(-2,3))
-points(scores(spe.ca, display="sites", choices=c(1,2), scaling=2), pch=21, col="black", bg="steelblue", cex=1.2)
+<m>{y_i} = f(x_{i}) + {ε_i}</m>
-text(scores(spe.ca, display="species", choices=c(1), scaling=2),
-     scores(spe.ca, display="species", choices=c(2), scaling=2),
-     labels=rownames(scores(spe.ca, display="species", scaling=2)),col="red", cex=0.8)
-</code>
-{{ :ca_biplot.png |}}
+To estimate the smooth function //f//, we need to represented the above equation in such a way that it becomes a linear model. This can be done by choosing a basis, b<sub>i</sub>(x), defining the space of functions of which //f// is an element:
-These biplots show that a group of sites located in the left part with similar fish community characterized by numerous species such as GAR, TAN, PER, ROT, PSO and CAR; in the upper right corner, an other site cluster characterized by the species LOC, VAI and TRU is identified; the last site cluster in the lower right corner of the biplot is characterized by the species BLA, CHA and OMB.
+<m>f(x) = sum{i=1}{q}{b_{i}(x)β_{i}}</m>
-**Challenge 4**
+**A simple example: a polynomial basis**
-Run a CA of the “mite” species abundance data. What are the significant axes of variation? Which groups of sites can you identify? Which species are related to each group of sites?
-**Challenge 4 - Solution**
+Suppose that //f// is believed to be a 4<sup>th</sup> order polynomial, so that the space of polynomials of order 4 and below contains //f//. A basis for this space would then be:
-<hidden>
-Your code likely looks something like the following:
-<code rsplus | CA with mite species>
-# CA on mite species
-mite.spe.ca<-cca(mite.spe)
-#What are the significant axes ?
+<m>b_{1}(x)=1,   b_{2}(x)=x,   b_{3}(x)=x^{2},   b_{4}(x)=x^{3},   b_{5}(x)=x^{4}</m>
-ev<-mite.spe.ca$CA$eig
-ev[ev>mean(ev)]
-n=length(ev)
-barplot(ev, main="Eigenvalues", col="grey", las=2)
-abline(h=mean(ev), col="red")
-legend("topright", "Average eigenvalue", lwd=1, col=2, bty="n")
-#Output summary/results
+so that f(x) becomes:
-summary(mite.spe.ca, display=NULL)
-#Plot the biplot
+<m>f(x) = β_{1} + x_{i}β_{2} +  x^{2}_{i}β_{3} + x^{3}_{i}β_{4}(x) + x^{4}_{i}β_{5}</m>
-windows()
-plot(mite.spe.ca, scaling=1, type="none",
-     xlab=c("PC1 (%)", round((mite.spe.ca$CA$eig[1]/sum(mite.spe.ca$CA$eig))*100,2)),
-     ylab=c("PC2 (%)", round((mite.spe.ca$CA$eig[2]/sum(mite.spe.ca$CA$eig))*100,2)))
-points(scores(mite.spe.ca, display="sites", choices=c(1,2), scaling=1),
-       pch=21, col="black", bg="steelblue", cex=1.2)
-text(scores(mite.spe.ca, display="species", choices=c(1), scaling=1),
-     scores(mite.spe.ca, display="species", choices=c(2), scaling=1),
-     labels=rownames(scores(mite.spe.ca, display="species", scaling=1)),
-     col="red", cex=0.8)
-</code>
-And your resulting biplot likely looks something like this:
+and the full model now becomes:
-{{ :ca_mite_biplot.png |}}
+<m>y_{i} = β_{1} + x_{i}β_{2} +  x^{2}_{i}β_{3} + x^{3}_{i}β_{4}(x) + x^{4}_{i}β_{5} + ε_{i}</m>
-</hidden>
+The basis functions are each multiplied by a real valued parameter, β<sub>i</sub>, and are then summed to give the **final curve f(x)**.
-=====3.3. Principal Coordinates Analysis:=====
+{{::polynomial_basis_example.png?600|}}
-PCA as well as CA impose the distance preserved among objects: the Euclidean distance (and several others with pre-transformations) for PCA and the Chi2 distance for CA. If one wishes to ordinate objects on the basis of another distance measure, more appropriate to the problem at hand, then PCoA is the method of choice. In PCA we rotated and plotted our data so that as much variation was explained by a first Principal Component and we can see how much "species", either actual species or environmental variables, contribute to each component by looking at the scores (also called loadings). Another type of unconstrained ordination is called Principal Coordinate Analysis (PCoA). In PCoA, points are added to plane space one at a time using Euclidean distance (//or whatever distance (dissimilarity) metric you choose//). Basically, one point is added, then a second so that it's distance is correct from the first point and then the third point and so on adding as many axes (dimensions) as necessary along the way. Choosing between PCA and PCoA can be tricky, but generally PCA is used to summarize multivariate data into as few dimensions as possible, whereas PCoA can be used to visualize distances between points. PCoA can be particularly suited for datasets that have more columns than rows. For example, if hundreds of species have been observed over a set of quadrats, then a approach based on a PCoA using Bray-Curtis similarity (see below) may be best suited.
+By varying the β<sub>i</sub> we can vary the form of f(x) to produce any polynomial function of order 4 or lower.
-Run a PCoA on the Hellinger transformed species abundances (back to DoubsSpe):
+**Another example: a cubic spline basis**
-<code rsplus | cmdscale for PCoA>
-#Using cmdscale()
-?cmdscale
-cmdscale(dist(spe.hel), k=(nrow(spe)-1), eig=TRUE)
-#Using pcoa()
+A cubic spline is a curve constructed from sections of a cubic polynomial joined together so that they are continuous in value. Each section of cubic has different coefficients.
-?pcoa
-spe.h.pcoa <- pcoa(dist(spe.hel))
-# Extract the results
+{{::cubic_spline.png?450|}}
-spe.h.pcoa
-#Construct the biplot
+Here's a representation of a smooth function using a rank 5 cubic spline basis with knot locations at increments of 0.2:
-biplot.pcoa(spe .h.pcoa, spe.hel, dir.axis2=-1)
-</code>
-The output looks like this (and here [[https://www.youtube.com/watch?v=lRdX1qhI7Dw|here]] is a video that might help with the explanation of eigenvalues in terms of ordination):
+{{::graphic6.1.jpg?300|}}
-{{:pcoaoutput_1.png?800|}}
+In this example, the knots are evenly spaced through the range of observed x values.
+However, the choice of the degree of model smoothness is controlled by the the number of knots, which was arbitrary. Is there a better way to select the knot locations?
-{{:pcoaoutput_2.png?800|}}
+**Controlling the degree of smoothing with penalized regression splines**
-And the PCoA biplot, like this:
+Instead of controlling smoothness by altering the number of knots, we keep that fixed to size a little larger than reasonably necessary, and control the model’s smoothness by adding a “wiggleness” penalty.
-{{:pcoa_spe.png?500|}}
+So, rather than fitting the model by minimizing (as with least squares regression):
-You can also run a PCoA using a different distance measure (e.g. Bray-Curtis). Here is a PCoA run on a Bray-Curtis dissimilarity matrix:
+<m> ||y - XB||^{2} </m>
-<code rsplus | PCoA with Bray-Curtis>
+it can be fit by minimizing:
-spe.bray.pcoa <- pcoa(spe.db) #where spe.db is the species dissimilarity matrix using Bray-Curtis.
-spe.bray.pcoa
-biplot.pcoa(spe.bray.pcoa, spe.hel, dir.axis2=-1)
-#Note that the distance measure chosen strongly influences the results.
-</code>
-**Challenge 5**
+<m> ||y - XB||^{2} + {lambda}int{0}{1}{[f^{''}(x)]^{2}dx} </m>
-Run a PCoA on the Hellinger-transformed //mite species// abundance data. What are the significant axes? Which groups of sites can you identify? Which species are related to each group of sites? How do the PCoA results compare with the PCA results?
-**Challenge 5 - Solution**
-<hidden>
-Your code likely looks something like this:
-<code rsplus | PCoA with mite species>
-mite.spe.h.pcoa <- pcoa(dist(mite.spe.hel))
-mite.spe.h.pcoa
-windows()
-biplot.pcoa(mite.spe.h.pcoa, mite.spe.hel, dir.axis2=-1)
-</code>
-And your biplot like this:
+As <m> lambda </m> goes to <m> infty </m>, the model becomes linear. The selection of the best fit smoothing parameter, <m> lambda </m>, makes use of a cross-validation approach. If <m> lambda </m> is too high then the data will be over smoothed, and if it is too low then the data will be under smoothed. Ideally, it would be good to choose <m> lambda </m> so that the //predicted f// is as close as possible to //f//. A suitable criterion might be to choose <m> lambda </m> to minimize:
-{{:pcoa_mite.png?500|}}
+<m> M = 1/n sum{i=1}{n}{(hat{f_{i}} - f_{i})^{2}} </m>
-We see that species 16 and 31 are farther away from other species in terms of distance and therefore their distribution across the sites is highly dissimilar from the other species of mites (and each other). Site labels that practically overlap each other are good examples of sites with low dissimilarity (i.e. high similarity) to each other in terms of the species that are found at those sites.
+Since //f// is unknown, M is estimated using a generalized cross validation technique that leaves out each datum from the data in turn and considers the average ability of models fitted to the remaining data to predict the left out datum (for further details, see Wood 2006).
-</hidden>
+**Illustration of the principle behind cross validation**
-=====3.4 Nonmetric MultiDimensional Scaling=====
+{{::illustration_of_smooth_sel.png?600|}}
-The unconstrained ordination methods presented above allow to organize objects (e.g. sites) characterized by descriptors (e.g. species) in full-dimensional space. In other words, PCA, CA and PCoA computes a large number of ordination axes (proportional to the number of descriptors) representing the variation of descriptors among sites and preserve distance among objects (the Euclidean distances in PCA, the Chi2 distances in CA and the type of distances defined by the user in PCoA). Users can then select the axis of interest (generally the first two ones as the explained the larger part of the variation) to represent objects in an ordination plot. The produced biplot thus represents well the distance among objects (e.g. the between-sites similarity), but fails to represent the whole variation dimensions of the ordination space (as Axis3, Axis4,…, Axisn are not represented on the biplot, but still contribute to explain the variation among objects).
+In the first panel, the curve fits many of the data poorly and does no better with the missing point. In the third panel, the curve fits the noise as well as the signal and the extra variability induced causes it to predict the missing datum rather poorly. In the second panel, however, we see that the curve fits the underlying signal quite well, smoothing through the noise and the missing datum is reasonably well predicted.
-In some case, the priority is not to preserve the exact distances among sites, but rather to represent as accurately as possible the relationships among objects in a small and number of axes (generally two or three) specified by the user. In such cases, nonmetric multidimensional scaling (NMDS) is the solution. If two axes are selected, the biplot produced from NMDS is the better 2D graphical representation of between-objects similarity: dissimilar objects are far apart in the ordination space and similar objects close to one another. Moreover, NMDS allows users to choose the distance measure applied to calculate the ordination.
+{{::gcv.png?600|}}
-To find the best representation of objects, NMDS applies an iterative procedure that tries to position the objects in the requested number of dimensions in such a way as to minimize a stress function (scaled from 0 to 1) which measure the goodness-of-fit of the distance adjustment in the reduced-space configuration. Consequently, the lower the stress value, the better the representation of objects in the ordination-space is. An additional way to assess the appropriateness of an NDMS is to construct a Shepard diagram which plot distances among objects in the ordination plot against the original distances. The R2 obtained from the regression between these two distances measure the goodness-of-fit of the NMDS ordination.
+**Brief note on effective degrees of freedom (edf)**
-<code rsplus | NMDS on spe dataset using Bray-Curtis distance and k=2 ordination axes>
+How many degrees of freedom does a fitted GAM have?
-# Run the NMDS
-spe.nmds<-metaMDS(spe, distance='bray', k=2)
-### Extract the results
-spe.nmds
-### Assess the goodness of fit and draw a Shepard plot
+Instead of providing the output of the cross-validation in terms of <m> lambda </m> (some tuning parameter modulating the fit’s complexity), the GAM function in the //mgcv// package uses a term called the effective degrees of freedom (edf) as a consistent way to quantify model complexity (that is, to justify our intuition that the degrees of freedom tracks model complexity).
-spe.nmds$stress
+Because the number of free parameters in GAMs is difficult to define, the edf are instead related to <m> lambda </m>, where the effect of the penalty is to reduce the effective degrees of freedom. So let's say the arbitrarily large number of knots is set to k=10, then k-1 sets the upper limit on the degrees of freedom associated with a smooth term. This number then decreases as the penalty <m> lambda </m> increases until the best fit penalty is found by cross-validation.
-stressplot(spe.nmds, main='Shepard plot')
-# Construct the biplot
-windows()
-plot(spe.nmds, type="none", main=paste('NMDS/Bray - Stress=', round(spe.nmds$stress, 3)),
-     xlab=c("NMDS1"),
-     ylab=c("NMDS2"))
-points(scores(spe.nmds, display="sites", choices=c(1,2)),
-       pch=21, col="black", bg="steelblue", cex=1.2)
-text(scores(spe.nmds, display="species", choices=c(1)),
-     scores(spe.nmds, display="species", choices=c(2)),
-     labels=rownames(scores(spe.nmds, display="species")),
-     col="red", cex=0.8)
-</code>
-{{ :shepard_plot.png |}}
+===== References =====
-The Shepard plot identifies a strong correlation between observed dissimilarity and ordination distance (R2 > 0.95), highlighting a high goodness-of-fit of the NMDS.
+There's a great deal more out there on GAMs... this was just the very surface.
-{{ :nmds_biplot.png |}}
+Simon Wood, the author of the mgcv package has a very useful website with introductory talks and notes on how to use GAMs:
+  * http://people.bath.ac.uk/sw283/mgcv/
-The biplot of the NMDS shows a group of closed sites characterized by the species BLA, TRU, VAI, LOC, CHA and OMB, while the other species form a cluster of sites in the upper right part of the graph. Four sites in the lower part of the graph are strongly different from the others.
+He's also written a book, Generalized Additive Models: An Introduction with R, which we used as reference for this workshop.
-**Challenge 6**
-Run the NMDS of the mite species abundances in 2 dimensions based on a Bray-Curtis distance. Assess the goodness-of-fit of the ordination and interpret the biplot.
-**Challenge 6 - Solution**
-<hidden>
-Your code likely looks something like this:
-<code rsplus | NMDS with mite species>
-mite.spe.nmds<-metaMDS(mite.spe, distance='bray', k=2)
-### Extract the results
-mite.spe.nmds
-### Assess the goodness of fit
-mite.spe.nmds$stress
-stressplot(mite.spe.nmds, main='Shepard plot')
-### Construct the biplot
-windows()
-plot(mite.spe.nmds, type="none", main=paste('NMDS/Bray - Stress=', round(mite.spe.nmds$stress, 3)),
-     xlab=c("NMDS1"),
-     ylab=c("NMDS2"))
-points(scores(mite.spe.nmds, display="sites", choices=c(1,2)),
-       pch=21, col="black", bg="steelblue", cex=1.2)
-text(scores(mite.spe.nmds, display="species", choices=c(1)),
-     scores(mite.spe.nmds, display="species", choices=c(2)),
-     labels=rownames(scores(mite.spe.nmds, display="species")),
-     col="red", cex=0.8)
-</code>
-And your plots like this:
-{{ :nmds_mite_shepard.png |}}
-The correlation between observed dissimilarity and ordination distance (R2 > 0.91) and the stress value relatively low, showing together a good accuracy of the NMDS ordination.
-{{ :nmds_mite_biplot.png |}}
-No cluster of sites can be precisely defined from the NMDS biplot showing that most of the species occurred in most of the sites, i.e. a few sites shelter specific communities.
-</hidden>
-=====3.5. Summary=====
-Ordination is a powerful statistical tecnhique for studying the relationships of objects characterized by several descriptors (e.g. sites described by biotic communities, or environmental variables), but serveral ordination methods exist. These  methods mainly differ in the type of distance preserved, the type of variables allowed and the dimensions of the ordination space. To better guide your choice of the ordination method to use, the following table identify the main characteristics of the four ordination methods presented :
-{{ :summary_ordination.jpg |}}
+Material from this workshop were also obtained from the following fantastic blogs and tutorials:
+  * http://www.fromthebottomoftheheap.net/blog/
-In the next workshop, you will see how to identify the relationships between biotic communities and sets of environmental variables describing the same sites, using canonical analyses.
+  * http://www.sfs.uni-tuebingen.de/~jvanrij/Tutorial/GAMM.html
+  * http://www.sfs.uni-tuebingen.de/~jvanrij/LSA2015/AnswersLab2.html
+Finally, the help pages, available through ?gam in R, are an excellent resource.