# Differences

This shows you the differences between two versions of the page.

 r_workshop9 [2016/04/05 16:03]emmanuelle.chretien [2.2 Partial RDA] r_workshop9 [2021/04/23 13:57] (current)lsherin 2021/04/23 13:57 lsherin 2021/03/23 12:35 lsherin 2021/03/22 18:47 lsherin 2019/08/08 17:59 mariehbrice 2019/06/25 16:44 mariehbrice 2019/03/18 04:12 mariehbrice 2016/10/21 10:49 vincent_fugere [QCBS R Workshops] 2016/09/05 16:04 vincent_fugere [Workshop 9: Multivariate analyses] 2016/09/05 11:15 vincent_fugere [Workshop 8: Multivariate analyses] 2016/09/05 09:54 vincent_fugere 2016/08/31 10:40 zofia.taranu [3. Multivariate regression tree] 2016/08/31 10:38 zofia.taranu [3. Multivariate regression tree] 2016/04/08 10:52 emmanuelle.chretien 2016/04/05 16:03 emmanuelle.chretien [2.2 Partial RDA] 2016/04/05 14:52 emmanuelle.chretien [4. Linear discriminant analysis] 2016/04/05 14:48 emmanuelle.chretien [4. Linear discriminant analysis] 2016/04/05 14:25 emmanuelle.chretien [2.3 Variation partitioning by partial RDA] 2016/04/05 14:21 emmanuelle.chretien [2.2 Partial RDA] 2016/04/05 14:13 emmanuelle.chretien [2.1 Redundancy analysis (RDA)] 2016/04/05 13:25 emmanuelle.chretien 2016/04/05 13:22 zofia.taranu [1.1 Species data] 2016/04/05 13:18 zofia.taranu [3. Multivariate regression tree] 2016/04/05 13:00 zofia.taranu [3. Multivariate regression tree] 2016/04/05 13:00 zofia.taranu [3. Multivariate regression tree] 2016/04/05 12:50 zofia.taranu [3. Multivariate regression tree] 2016/04/05 11:55 zofia.taranu [2.1 Redundancy analysis (RDA)] 2016/04/04 15:34 emmanuelle.chretien 2016/04/02 18:02 emmanuelle.chretien [3. Multivariate regression tree] 2016/04/02 18:00 emmanuelle.chretien [3. Multivariate regression tree] 2016/04/02 17:57 emmanuelle.chretien [3. Multivariate regression tree] 2016/04/02 17:28 emmanuelle.chretien [2.3 Variation partitioning by partial RDA] 2016/04/02 17:23 emmanuelle.chretien [2.3 Variation partitioning by partial RDA] 2016/04/02 16:36 emmanuelle.chretien [2.2 Partial RDA] 2016/04/02 16:36 emmanuelle.chretien [2.2 Partial RDA] 2016/03/31 11:08 emmanuelle.chretien [Workshop 9: Advanced multivariate analyses] 2016/03/31 10:01 emmanuelle.chretien [References] 2016/03/31 10:00 emmanuelle.chretien [3. Multivariate regression tree] 2016/03/31 09:59 emmanuelle.chretien 2016/03/31 09:58 emmanuelle.chretien 2021/04/23 13:57 lsherin 2021/03/23 12:35 lsherin 2021/03/22 18:47 lsherin 2019/08/08 17:59 mariehbrice 2019/06/25 16:44 mariehbrice 2019/03/18 04:12 mariehbrice 2016/10/21 10:49 vincent_fugere [QCBS R Workshops] 2016/09/05 16:04 vincent_fugere [Workshop 9: Multivariate analyses] 2016/09/05 11:15 vincent_fugere [Workshop 8: Multivariate analyses] 2016/09/05 09:54 vincent_fugere 2016/08/31 10:40 zofia.taranu [3. Multivariate regression tree] 2016/08/31 10:38 zofia.taranu [3. Multivariate regression tree] 2016/04/08 10:52 emmanuelle.chretien 2016/04/05 16:03 emmanuelle.chretien [2.2 Partial RDA] 2016/04/05 14:52 emmanuelle.chretien [4. Linear discriminant analysis] 2016/04/05 14:48 emmanuelle.chretien [4. Linear discriminant analysis] 2016/04/05 14:25 emmanuelle.chretien [2.3 Variation partitioning by partial RDA] 2016/04/05 14:21 emmanuelle.chretien [2.2 Partial RDA] 2016/04/05 14:13 emmanuelle.chretien [2.1 Redundancy analysis (RDA)] 2016/04/05 13:25 emmanuelle.chretien 2016/04/05 13:22 zofia.taranu [1.1 Species data] 2016/04/05 13:18 zofia.taranu [3. Multivariate regression tree] 2016/04/05 13:00 zofia.taranu [3. Multivariate regression tree] 2016/04/05 13:00 zofia.taranu [3. Multivariate regression tree] 2016/04/05 12:50 zofia.taranu [3. Multivariate regression tree] 2016/04/05 11:55 zofia.taranu [2.1 Redundancy analysis (RDA)] 2016/04/04 15:34 emmanuelle.chretien 2016/04/02 18:02 emmanuelle.chretien [3. Multivariate regression tree] 2016/04/02 18:00 emmanuelle.chretien [3. Multivariate regression tree] 2016/04/02 17:57 emmanuelle.chretien [3. Multivariate regression tree] 2016/04/02 17:28 emmanuelle.chretien [2.3 Variation partitioning by partial RDA] 2016/04/02 17:23 emmanuelle.chretien [2.3 Variation partitioning by partial RDA] 2016/04/02 16:36 emmanuelle.chretien [2.2 Partial RDA] 2016/04/02 16:36 emmanuelle.chretien [2.2 Partial RDA] 2016/03/31 11:08 emmanuelle.chretien [Workshop 9: Advanced multivariate analyses] 2016/03/31 10:01 emmanuelle.chretien [References] 2016/03/31 10:00 emmanuelle.chretien [3. Multivariate regression tree] 2016/03/31 09:59 emmanuelle.chretien 2016/03/31 09:58 emmanuelle.chretien 2016/03/31 09:56 emmanuelle.chretien [4. Linear discriminant analysis] 2016/03/31 09:55 emmanuelle.chretien 2016/03/31 09:36 emmanuelle.chretien [3. Multivariate regression tree] 2016/03/31 09:34 emmanuelle.chretien [1.1 Species data] 2016/03/31 09:33 emmanuelle.chretien [Advanced multivariate analyses] 2016/03/31 09:31 emmanuelle.chretien 2016/03/30 19:50 emmanuelle.chretien [3. Multivariate regression tree] 2016/03/30 19:49 emmanuelle.chretien 2016/03/30 19:36 emmanuelle.chretien 2016/03/30 19:25 emmanuelle.chretien 2016/03/30 19:14 emmanuelle.chretien 2016/02/06 13:52 berenger_bourgeois Line 6: Line 6: These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community. These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community. - ====== Workshop 9: Advanced multivariate analyses ====== + //The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications,​ please contact the current series coordinators,​ listed on the main wiki page// - Developed by: Monica Granados, Emmanuelle Chrétien, Bérenger Bourgeois, Amanda Winegardner and Xavier Giroux-Bougard. (Material in R script obtained from: Borcard, Gillet & Legendre (2011). //Numerical Ecology with R//. Springer New York.) + ​IMPORTANT NOTICE: MAJOR UPDATES + **March 2021 update:** This wiki has been discontinued and is no longer being actively developed or updated. Updated materials for the QCBS R Workshop Series are now housed on the QCBS R Workshop [[https://​github.com/​QCBSRworkshops/​workshop09|GitHub page]]. ​ - **Summary:** In this workshop, you will learn about advanced multivariate analyses for your own data. This workshop concentrates on constrained methods such as redundancy analysis, multivariate regression tree and linear discriminant analysis to explore how environmental variables may be driving patterns ​in species assemblage across sites. + Available materials include; + - The [[https://​qcbsrworkshops.github.io/​workshop09/​pres-en/​workshop09-pres-en.html|Rmarkdown presentation]] for this workshop; + - An [[https://​qcbsrworkshops.github.io/​workshop09/​book-en/​workshop09-script-en.R|R script]] that follows the presentation;​ + - [[https://​qcbsrworkshops.github.io/​workshop09/​book-en/​index.html|Written materials]] that accompany the presentation ​in bookdown format. - Link to associated Prezi: [[https://​prezi.com/​zzqqe4gcq80g/​csbq-workshop-9/​|Prezi]] + ====== Workshop 9: Multivariate analyses ====== - Download the R script, packages ​and data required for this workshop: + Developed by: Bérenger Bourgeois, Xavier Giroux-Bougard,​ Amanda Winegardner,​ Emmanuelle Chrétien and Monica Granados. - *  [[http://​qcbs.ca/​wiki/​_media/​script_workshop9.r| R Script]] + (Material in R script obtained from: Borcard, Gillet & Legendre (2011). //Numerical Ecology with R//. Springer New York.) + + **Summary:​** In this workshop, you will learn the basics of multivariate analyses that will allow you to reveal patterns in your community composition data. You will first learn to choose appropriate distance metrics and transformations to then perform various multivariate analyses: clustering analysis, Principal Component Analysis (PCA), Correspondence Analysis (CA), Principal Coordinate Analysis (PCoA) and Non-Metric MultiDimensional Scaling (NMDS). + + **Link to new [[https://​qcbsrworkshops.github.io/​workshop09/​workshop09-en/​workshop09-en.html|Rmarkdown presentation]]** + + Link to old [[https://​prezi.com/​isb-5h_tanxo/​|Prezi presentation]] + + Download the R script and data required for this workshop: + *  [[http://​qcbs.ca/​wiki/​_media/​multivar1_e.r| R Script]] *  [[http://​qcbs.ca/​wiki/​_media/​DoubsEnv.csv|DoubsEnv data]] *  [[http://​qcbs.ca/​wiki/​_media/​DoubsEnv.csv|DoubsEnv data]] *  [[http://​qcbs.ca/​wiki/​_media/​DoubsSpe.csv|DoubsSpe data]] *  [[http://​qcbs.ca/​wiki/​_media/​DoubsSpe.csv|DoubsSpe data]] - *  [[http://​qcbs.ca/​wiki/​_media/​classifyme.csv|Test data for linear discriminant analysis]] + *  [[http://​qcbs.ca/​wiki/​_media/​coldiss.R|Coldiss R function]] - *  [[http://​qcbs.ca/​wiki/​_media/​mvpart_1.6-2.tar.gz|mvpart package]] + - *  [[http://​qcbs.ca/​wiki/​_media/​MVPARTwrap_0.1-9.tar.gz|MVPARTwrap package]] + - *  [[http://​qcbs.ca/​wiki/​_media/​rdaTest_1.10.tar.gz|rdaTest package]] ​ + Make sure to load the following packages (see how in the R script): Make sure to load the following packages (see how in the R script): *  [[http://​cran.r-project.org/​web/​packages/​vegan/​index.html|vegan (for multivariate analyses)]] *  [[http://​cran.r-project.org/​web/​packages/​vegan/​index.html|vegan (for multivariate analyses)]] - *  [[http://​cran.r-project.org/​web/​packages/​vegan/​index.html|labdsv ​(for identification of significant indicator species in the multivariate regression tree analysis)]] + *  [[http://​cran.r-project.org/​web/​packages/​gclus/​index.html|gclus (for clustering graphics)]] - *  [[http://​cran.r-project.org/​web/​packages/​vegan/​index.html|plyr (classification ​for linear discriminant analysis)]] + *  [[http://​cran.r-project.org/​web/​packages/​ape/​index.html|ape (for phylogenetics)]] - *  [[http://​cran.r-project.org/​web/​packages/​vegan/​index.html|MASS (for linear discriminant analysis)]] + - *  mvpart + - *  MVPARTwrap + - *  rdatest + - + ​ install.packages("​vegan"​) install.packages("​vegan"​) - install.packages("​mvpart") + install.packages("​gclus") - install.packages("​labdsv") + install.packages("​ape") - install.packages("​plyr"​) + - install.packages("​MASS"​) + - + - # For the two following packages, upload the file provided on the wiki page. + - # To do so, go to Packages tab on the bottom right panel of R Studio + - # Click on Install Packages + - # Choose to install from Package Archive file and upload these two files + - install.packages("​MVPARTwrap"​) + - install.packages("​rdaTest"​) + library(vegan) library(vegan) - library(mvpart) + library(gclus) - library(MVPARTwrap) + library(ape) - library(rdaTest) + source(file.choose()) # use coldiss.R which you have downloaded to your own directory - library(labdsv) + - library(plyr) + - library(MASS) + ​ + ======What is ordination?​====== + Ordination is a set of methods for depicting samples in multiple dimensions (Clarke and Warwick 2011) and often feels like a catch-all term in ecological statistics. Ecologists are often told to “run a PCA” in the face of complex and messy multivariate data. R code for ordination techniques is readily available and relatively straight-forward to implement for most data. Interpretation of ordination analyses can be more difficult, especially if you are unsure of the specific questions you wish to explore with a particular ordination method. As such, while ordination methods are very useful for helping simplify and make sense of multivariate data, careful consideration of why the methods are being used and which ones are most appropriate is necessary for strong ecological analyses! ​ - + When you use an ordination method, you are taking a set of variables ​and creating new principal axes along which samples (sites etc.) are scored or ordered (Gotelli and Ellison 2004), ​in order to reduce or simplify ​the data, i.e. to create new axes that represent most of the variation ​in the data. As an example, a dataset with 24 variables ​may be reduced to five principal components that represent the main patterns ​of variation amongst samples. Unconstrained ordination methods are generally not frameworks for hypotheses testing, rather they are best suited for exploratory data analysis. The different types of ordination can be useful for many different questions; see ([[http://​ordination.okstate.edu/​|the Ordination Website]]) for an overview of different types of ordination). - + - ======Introduction to advanced multivariate analyses====== + - + - The previous workshop presented the basics ​of multivariate analyses: + - * how to choose appropriate distance metrics and transformations + - * hierarchical clustering + - * unconstrained ordinations + - * Principal component analysis + - * Principal coordinate Analysis + - * Correspondance analysis + - * Nonmetric multidimensional scaling + - + - The present workshop builds on this knowledge, ​and will focus on constrained analyses.  All the methods overviewed during the introductory workshop allowed to find patterns ​in the community composition ​data or in the descriptors, but not to explore how environmental ​variables ​could be driving these patterns. ​With constrained analyses, such as redundancy ​analysis (RDA), linear discriminant analysis (LDA) and multivariate regression tree (MRT), one can describe and predict relationships between community composition data and environmental variables. + - + ======Getting started with data====== ​ ======Getting started with data====== ​ - We will continue to use the Doubs river datasets for this workshop. “DoubsSpe.csv” is a data frame of fish community data where the first column contains site names from 1 to 30 and the remaining columns are fish taxa. The taxa columns are populated by fish abundance data (counts). “DoubsEnv.csv” is a data frame of environmental data for the same sites contained in the fish community data frame. Again, the first column contains site names from 1 to 30. The remaining columns contain measurements for 11 abiotic variables. Note that data used in ordination analyses is generally in [[http://​en.wikipedia.org/​wiki/​Wide_and_narrow_data|wide-format]]. ​ + We will use two main data sets in the first part of this workshop. “DoubsSpe.csv” is a data frame of fish community data where the first column contains site names from 1 to 30 and the remaining columns are fish taxa. The taxa columns are populated by fish abundance data (counts). “DoubsEnv.csv” is a data frame of environmental data for the same sites contained in the fish community data frame. Again, the first column contains site names from 1 to 30. The remaining columns contain measurements for 11 abiotic variables. Note that data used in ordination analyses is generally in [[http://​en.wikipedia.org/​wiki/​Wide_and_narrow_data|wide-format]]. ​ Line 94: Line 71: ​ - + ======1. Explore the data====== ​ - + - ======1. Explore ​and prepare ​the data====== ​ + =====1.1 Species data ===== =====1.1 Species data ===== - We can use summary functions to explore the “spe” data (fish community data) and discover things like the dimensions of the matrix, column headings and summary statistics for the columns. This is a review from Workshop 2. + We can use summary functions to explore the “spe” data (fish community data) and discover things like the dimensions of the matrix, column headings and summary statistics for the columns. This is a review from ===== + Workshop 2. ​ Line 133: Line 109: The proportion of zeros in the dataset is ~0.5. The proportion of zeros in the dataset is ~0.5. - This is high but not uncommon for species abundance data. However, in order to avoid the use of double-zeros as indications of resemblance among sites, we will apply a transformation to the species ​data. Legendre and Gallagher ​(2001) proposed five pre-transformations of the species ​data, four of them being available in vegan in the function decostand(). + See the frequency at which different numbers ​of species ​occur together. + ​ + spe.pres <- colSums(spe>0) #​compute ​the number of sites where each species ​is present. + hist(spe.pres,​ main="​Species occurrence",​ las=1, xlab="​Frequency ​of occurrences",​ breaks=seq(0,30, by=5), col="​grey"​) + ​ + Can see that an intermediate number of sites contain the highest number of species. - The Hellinger transformation will be applied to the fish data. It expresses abundances as the square-root ​of their relative abundance ​at each site (Borcard et al. 2011). + Calculate ​the number ​of species that occur at each site. This is a simplistic way of calculating species richness for each site. In some cases, the number of individuals found in a site may differ between sites. Because there may be a relationship between the number of individuals counted in a site or sample and the taxa or species richness, [[http://​cc.oulu.fi/​~jarioksa/​softhelp/​vegan/​html/​diversity.html|rarefied species richness]] may be a more appropriate metric than total richness. The rarefy() function in vegan can be used to generate rarefied species richness. - + - + - spe.hel <- decostand(spe, method="hellinger"​) ​# you can also use method="hell" ​ + site.pres <- rowSums(spe>0) #number of species with a value greater than 0 in that site row + hist(site.pres, main="Species richness", las=1, xlab="​Frequency of sites",​ ylab="​Number of species",​ breaks=seq(0,​30,​ by=5), col="grey") ​ - =====1.2 Environmental data===== =====1.2 Environmental data===== Line 162: Line 142: apply(env.z,​ 2, sd)   # the data are now scaled (standard deviations=1) apply(env.z,​ 2, sd)   # the data are now scaled (standard deviations=1) ​ + ======2. Association measures====== ​ + Matrix algebra ultimately lies below all ordinations. A matrix consists of data (e.g. measured values) in rows and columns. Prior to starting multivariate analyses, you likely have matrices with ecological data (such as the DoubsEnv or DoubsSpe that we have used before), but in ordination methods, your original data matrices are used to create association matrices. Before launching into ordination methods, it is important to spend some time with your data matrices. Creating an association matrix between objects or among descriptors allows for calculations of similarity or distance between objects or descriptors (Legendre and Legendre 2012). Exploring the possible association measures that can be generated from your data prior to running ordinations can help you to better understand what distance measure to use within ordination methods. At this point in time, it may be difficult to see the purpose behind different dissimilarity indices, or describing what they do to your multivariate data, but we will need this knowledge when we consider canonical ordination methods later on. + IN SUMMARY: To ordinate objects you need to compute distances among them. Distances can be computed in several ways, taking into account abundance or presence/​absence data. More importantly,​ they are several properties that are important for distance metrics that are explored with the data examples below. For more information about the properties of distance metrics and some key terms, click on the hidden section below.  ​ - ======2. Canonical analyses====== ​ + <​hidden>​ + **Some key terms:​** ​ - Described first by Rao (1964), canonical analysis is a generic ​term that for several types of statistical analyses sharing a common goal; to identify ​the relationship ​between ​a multivariate response table (matrix Y, generally describing the species composition of communities) and a multivariate explanatory table (matrix X, generally containing environmental ​descriptors) by combining ordination and regression concepts. Canonical analyses allow users to test ecological hypothesis concerning the environmental drivers of species composition. Among the diversity of canonical ​analysis, ​we will mainly focus  here on Redundancy Analysis ​(RDA). + **Association -** “general ​term to describe any measure or coefficient to quantify ​the resemblance or difference ​between ​objects or descriptors. ​In an analysis ​between descriptors, zero means no association.” ​(Legendre and Legendre 2012). + **Similarity -** a measure that is “maximum (S=1) when two objects are identical and minimum when two objects are completely different.” (Legendre and Legendre 2012). ​ - =====2.1 Redundancy analysis ​(RDA)===== + **Distance (also called dissimilarity) -** a measure that is “maximum (D=1) when two objects are completely different”. ​(Legendre and Legendre 2012). Distance or dissimilarity (D) = 1-S - Redundancy Analysis is a direct extension of multiple regression, as it models the effect of an explanatory matrix X (n x p) on a response matrix Y (n x m). This is done by preforming an ordination of Y to obtain ordination axes that are linear combinations ​of the variables ​in X. In RDA, ordination axes are calculating from a PCA of a matrix Yfit, computed by fitting the Y variables to X by multivariate linear regression. Note that the explanatory variables in X can be quantitative,​ qualitative or binary variables. Prior to RDA, explanatory variables ​in Y must be centered, standardized (if explanatory variables are not dimensionally homogeneous,​ i.e. in different units), transformed (to limit the skew of explanatory variables) or normalized (to linearize relationships) following the same principles as in PCA. Collinearity between the X variables should also be reduced before RDA. + Choosing ​an association measure depends ​on your data, but also on what you know, ecologically about your data. For example, Euclidean distance ​is a very common and easy to use distance measure and is useful for understanding how different two samples ​are based on co-occurrence ​of species. The way Euclidean distance is calculated though relies on zeros in the dataset, meaning that two samples or sites without any species ​in common may appear more similar than two samples that share a few species. This could be misleading and it would be best to choose a different distance measure if you have a lot of zeros in your species matrix. This is commonly referred ​to the “double zero” problem ​in ordination analyses. - In order to obtain the best model of RDA, explanatory variables can be selected by forward, backward or stepwise selection that remove non-significant explanatory variables. ​ + Here are some commonly used dissimilarity (distance) measures (recreated from Gotelli and Ellison 2004): ​ - RDA involves ​two computational steps. In the first step, a matrix ​of fitted ​values ​Yfit is calculated trough ​the linear equation: ​ + ^Measure name^Property^Description^ - Yfit = X[X'X]-1 [X'Y] + ^Euclidean^Metric^Distance between ​two points in 2D space.^ + ^Manhattan^Metric^Distance between two points, where the distance is the sum of differences of their Cartesian coordinates, i.e. if you were to make a right able between the points.^ + ^Chord^Metric^This distance is generally used to assess differences due to genetic drift.^ + ^Mahalanobis^Metric^Distance between a point and a set distribution,​ where the distance is the number ​of standard deviations of the point from the mean of the distribution.^ + ^Chi-square^Metric^Similar to Euclidean.^ + ^Bray-Curtis^Semi-metric^Dissimilarity between two samples (or sites) where the sum of lower values ​for species present in both samples are divided by the sum of the species counted in each sample.^ + ^Jaccard^Metric^[[http://​en.wikipedia.org/​wiki/​Jaccard_index|Description]] ^ + ^Sorensen'​s^Semi-metric^Bray-Curtis is 1 - Sorensen^ + ​ + =====2.1 Distance measures===== - In the second step, a PCA of the fitted matrix Yfit is calculated ​(see equations in the figure below) and produce the canonical eigenvalues and eigenvectors together with the matrix Z containing the canonical axes. These canonical axes correspond ​to linear combinations of the explanatory variables in X. This linearity of the combinations of the X variables is a fundamental property of RDA. In the analysis of community composition, these canonical axes are interpreted ​as complex environmental gradients. + **Quantitative species data** + We can use the vegdist() function ​to compute dissimilarity indices for community composition ​data. These can then be visualized ​as a matrix if desired. - {{ :​constrained_ord_diagram.png ​|}} + ​ + spe.db<​-vegdist(spe,​ method="​bray"​) # "​bray"​ with presence-absence data is Sorensen dissimilarity + spe.dj<​-vegdist(spe,​ method="​jac"​) # Jaccard dissimilarity + spe.dg<​-vegdist(spe,​ method="​gower"​) # Gower dissimilarity + spe.db<​-as.matrix(spe.db) #Put in matrix form (can visualize, write to .csv etc) + ​ + A condensed version of the spe.db matrix where the numbers represent the distance(dissimilarity) between the first 3 species in DoubsSpe would look like this: + ^ ^Species 1^Species 2^Species 3^ + ^Species 1^0.0^0.6^0.68^ + ^Species 2^0.6^0.0^0.14^ + ^Species 3^0.68^0.14^0.0^ + You can see that when comparing a species to itself (e.g. Species 1 to Species 1), the distance = 0, because species 1 is like itself and distance is maximum when the species compared are 100% different. ​ - Redundancy analysis as a two-step process ​(from Legendre and Legendre 2012) + These same distance measures can be calculated from presence-absence data by setting binary=TRUE in the vegdist() function. This will give slightly different distance measures. ​ - Several statistics can be computed from RDA: + - - The R² measures the strength of the canonical relationship between Y and X by calculating the proportion of the variation of Y explained by the variables in X, + You can also create graphical depictions ​of these association matrices using a function called coldiss. - - The adjusted R² also measures the strength ​of the relationship between Y and X, but applies ​a correction of the R² to take into account the number of explanatory variables, ​ + <​hidden>​ - - The F-statistic corresponds to an overall test of significance of an RDA by comparing the computed model to a null model. ​This test is based on the null hypothesis that the strength of the linear relationship calculated by the R² is not larger than the value that would be obtained for unrelated Y and X matrices of the same size. Note that F-statistics can also be used to sequentially test the significance ​of each canonical axis. + This function can be sourced using the script coldiss.R: + ​ + windows() + coldiss(spe.db,​ byrank=FALSE,​ diag=TRUE) # Heat map of Bray-Curtis dissimilarity + windows() + coldiss(spe.dj,​ byrank=FALSE,​ diag=TRUE) # Heat map of Jaccard dissimilarity + windows() + coldiss(spe.dg,​ byrank=FALSE,​ diag=TRUE) # Heat map of Gower dissimilarity + ​ + {{:​coldiss_Bray.png?800|}} - In R, RDA can be computed using the function rda from package vegan, as follows: ​ + The figure shows a dissimilarity matrix which mirrors what you would see if you exported ​the matrix or viewed it within the R console. The ‘hot’ colour (purple) shows areas of high dissimilarity. ​ + ​ + **Quantitative environmental data** + Let’s look at //​associations//​ between environmental variables (also known as Q mode): + ​ + ?dist # this function also compute dissimilarity matrix + env.de<​-dist(env.z,​ method = "​euclidean"​) # euclidean distance matrix of the standardized environmental variables ​ + windows() #Creates a separate graphical window + coldiss(env.de,​ diag=TRUE) + ​ - + We can then look at the //​dependence//​ between environmental variables (also known as R mode): - #Preparing the data prior to RDA + - env.z <- subset(env.z, select = -das) # remove ​the "​distance from the source"​ variable + (env.pearson<-cor(env)) # Pearson r linear correlation - + round(env.pearson, 2) #Rounds ​the coefficients to 2 decimal points ​ - #Running the RDA + (env.ken<-cor(env, method="​kendall"​)) # Kendall tau rank correlation - ?rda + round(env.ken, 2) - spe.rda <- rda(spe.hel~., data=env.z) + ​ - ### Extract ​the results + The Pearson correlation measures ​the linear correlation between two variables. The Kendall tau is a rank correlation which means that it quantifies the relationship between two descriptors or variables when the data are ordered within each variable. - summary(spe.rda, display=NULL) + - + In some cases, there may be mixed types of environmental variables. Q mode can still be used to find associations between these environmental variables. We’ll do this by first creating an example dataframe: - #The results are called using summary: + ​ - summary(spe.rda, display=NULL) #display = NULL optional  ​ + var.g1<​-rnorm(30, 0, 1) + var.g2<​-runif(30,​ 0, 5) + var.g3<​-gl(3,​ 10) + var.g4<​-gl(2,​ 5, 30) + (dat2<​-data.frame(var.g1, var.g2, var.g3, var.g4)) + str(dat2) + summary(dat2) ​ + A dissimilarity matrix can be generated for these mixed variables using the Gower dissimilarity matrix: ​ + + ?daisy #This function can handle NAs in the data + (dat2.dg<​-daisy(dat2,​ metric="​gower"​)) + coldiss(dat2.dg) + ​ - The summary of the output looks like this: + **Challenge 1 - Introductory** + Discuss with your neighbour: How can we tell how similar objects are when we have multivariate data? Make a list of all your suggestions. ​ + **Challenge 1 - Introductory Solution** + <​hidden>​ + Discuss as a group. ​ + ​ - {{ :​rda_output_1.png |}} + **Challenge 1 - Advanced** + Calculate the Bray-Curtis and the Gower dissimilarity of species abundance CHA, TRU and VAI for sites 1, 2 and 3 (using the “spe” and “env” dataframes) //without using the decostand() function.// - These results contain the proportion of variance of Y explained by the X variables (constrained proportion, 73.41% here), the unexplained variance of Y (unconstrained proportion, 26.59% here) and then summarize the eigenvalues,​ the proportions explained and the cumulative proportion of each canonical axis (each canonical axis = each constraining variable, in this case, the environmental variables from env). + **Challenge 1 - Advanced Solution** + <​hidden>​ - To select the significant explanatory variables, you can then perform a forward selection (or backwards or stepwise), using the ordiR2step() function ​(or using the forward.sel function of package packfor): + First, it is helpful to know the formula for Bray-Curtis dissimilarity + Bray-Curtis dissimilarity : d[jk] = (sum abs(x[ij]-x[ik]))/(sum (x[ij]+x[ik])) + Next, subset the species data so that only sites 1, 2 are included and only the species CHA, TRU and VAI + + spe.challenge<​-spe[1:​3,​1:​3] #​”[1:​3,​” refers to rows 1 to 3 while “,1:3]” refers to the first 3 species columns (in #this case the three variables of interest) + ​ - + Determine total species abundance for each site of interest (sum of the 3 rows). This will be for the denominator in the above equation. - ### Select the significant explanatory variables by forward selection + - ?​ordiR2step + (Abund.s1<​-sum(spe.challenge[1,])) - ordiR2step(rda(spe.hel~1, data=env.z), scope= formula(spe.rda), direction= "​forward",​ R2scope=TRUE,​ pstep=1000) + (Abund.s2<​-sum(spe.challenge[2,])) - env.signif ​<- subset(env.z, select = c("​alt",​ "​oxy",​ "​dbo"​)) + (Abund.s3<-sum(spe.challenge[3,])) + #() around code will cause output to print right away in console ​ + Now calculate the difference in species abundances for each pair of sites. For example, what is the difference between the abundance of CHA and TRU in site 1? You need to calculate the following differences: ​ + CHA and TRU site 1 + CHA and VAI site 1 + TRU and VAI site 1 + CHA and TRU site 2 + CHA and VAI site 2 + TRU and VAI site 2 + CHA and TRU site 3 + CHA and VAI site 3 + TRU and VAI site 3 - The resulting output reads: rda(formula = spe.hel ~ alt + oxy + dbo, data = env.z) with the proportion of variation explained by the three constraining variables being 0.59. So in this case, only three variables are retained by the forward selection, i.e. alt, oxy and dbo. These three variables can be placed in a new data frame env.signif to perform the new RDA including only significant X variables: ​ + + Spec.s1s2<​-0 + Spec.s1s3<​-0 + Spec.s2s3<​-0 + for (i in 1:3) { + Spec.s1s2<​-Spec.s1s2+abs(sum(spe.challenge[1,i]-spe.challenge[2,​i])) + Spec.s1s3<​-Spec.s1s3+abs(sum(spe.challenge[1,i]-spe.challenge[3,i])) + Spec.s2s3<​-Spec.s2s3+abs(sum(spe.challenge[2,i]-spe.challenge[3,​i])) } + ​ - + Now take the differences you have calculated as the numerator in the equation for Bray-Curtis dissimilarity and the total species abundance that you already calculated as the denominator. - spe.rda.signif ​<- rda(spe.hel~., data=env.signif) + - summary(spe.rda.signif, display=NULL) + (db.s1s2<-Spec.s1s2/(Abund.s1+Abund.s2)) #Site 1 compared to site 2 + (db.s1s3<​-Spec.s1s3/​(Abund.s1+Abund.s3)) #Site 1 compared to site 3 + (db.s2s3<​-Spec.s2s3/​(Abund.s2+Abund.s3)) #Site 2 compared to site 3 ​ + You should find values of 0.5 for site 1 to site 2, 0.538 for site 1 to site 3 and 0.053 for site 2 to 3. - The explanatory variables ​(altitude, oxygen and biological oxygen demand) now explain 59% of the variance in Y (species). + Check your manual results with what you would find using the function vegdist() with the Bray-Curtis method: - The adjusted R² of this RDA is calculated using the function RsquareAdj: ​ + ​ + (spe.db.challenge<​-vegdist(spe.challenge,​ method="​bray"​)) + ​ + A matrix looking like this is produced, which should be the same as your manual calculations: ​ + ^ ^Site 1^Site 2^ + ^Site 2^0.5^--^ + ^Site 3^0.538^0.0526^ - + For the Gower dissimilarity,​ proceed in the same way but use the appropriate equation: ​ - (R2adj <- RsquareAdj(spe.rda.signif)$adj.r.squared) + Gower dissimilarity : d[jk] = (1/M) sum(abs(x[ij]-x[ik])/(max(x[i])-min(x[i]))) - #Here the strength of the relationship between X and Y corrected for the number of X variables is 0.54. + - + # Calculate ​the number of columns in your dataset + M<-ncol(spe.challenge) + # Calculate the species abundance differences between pairs of sites for each species + Spe1.s1s2<​-abs(spe.challenge[1,​1]-spe.challenge[2,​1]) + Spe2.s1s2<​-abs(spe.challenge[1,​2]-spe.challenge[2,​2]) + Spe3.s1s2<​-abs(spe.challenge[1,​3]-spe.challenge[2,​3]) + Spe1.s1s3<​-abs(spe.challenge[1,​1]-spe.challenge[3,​1]) + Spe2.s1s3<​-abs(spe.challenge[1,​2]-spe.challenge[3,​2]) + Spe3.s1s3<​-abs(spe.challenge[1,​3]-spe.challenge[3,​3]) + Spe1.s2s3<​-abs(spe.challenge[2,​1]-spe.challenge[3,​1]) + Spe2.s2s3<​-abs(spe.challenge[2,​2]-spe.challenge[3,​2]) + Spe3.s2s3<​-abs(spe.challenge[2,​3]-spe.challenge[3,​3]) - The significance of the model and of each canonical axis can be tested using the function anova (note this is different from retaining significant variables as was done with forward selection, now we're testing the significant of the RDA axes): + # Calculate ​the range of each species abundance between sites + Range.spe1<​-max(spe.challenge[,1]) - min (spe.challenge[,​1]) + Range.spe2<​-max(spe.challenge[,​2]) - min (spe.challenge[,​2]) + Range.spe3<​-max(spe.challenge[,​3]) - min (spe.challenge[,​3]) + # Calculate the Gower dissimilarity ​ + (dg.s1s2<​-(1/​M)*((Spe2.s1s2/​Range.spe2)+(Spe3.s1s2/​Range.spe3))) + (dg.s1s3<​-(1/​M)*((Spe2.s1s3/​Range.spe2)+(Spe3.s1s3/​Range.spe3))) + (dg.s2s3<​-(1/​M)*((Spe2.s2s3/​Range.spe2)+(Spe3.s2s3/​Range.spe3))) - + # Compare your results - ?​anova.cca + (spe.db.challenge<​-vegdist(spe.challenge, method="gower")) - anova.cca(spe.rda.signif, step=1000) + - anova.cca(spe.rda.signif, step=1000, by="axis") + - #In this case, the RDA model is highly significant (p=0.001) as well as all three canonical axes. + ​ + ​ - To visualize the results of the RDA, triplots can be drawn using the plot(). Note that as in PCA, users can create scaling 1 and scaling 2 triplots. In scaling 1, distance among objects approximate their Euclidean distances while in scaling 2, angles between variables X and Y reflect their correlation. Thus, scaling 1 triplots can be used to interpret distances among objects and scaling 2 triplots to interpret the relationships between X and Y. To plot the scaling 1 triplots of the RDA, the following code can be used: + =====2.2. Transformations of community composition data===== - + Sometimes species/​community data may also need to be standardized or transformed. The decostand() function in vegan provides standardization and transformation options for community composition data. - #Quick plots scaling 1 and 2 + - windows() + Transforming abundance or count data to presence-absence data: - plot(spe.rda.signif, scaling=1, main="​Triplot RDA (scaling 1)") + - windows() + spe.pa<-decostand(spe, method="pa"​) ​ - plot(spe.rda.signif, scaling=2, main="​Triplot RDA (scaling 2)") + - + - #Advanced plots scaling 1 + - windows() + - plot(spe.rda.signif, scaling=1, main="​Triplot RDA - scaling 1", type="​none",​ xlab=c("​RDA1"​),​ ylab=c("​RDA2"​),​ xlim=c(-1,​1),​ ylim=c(-1,​1)) + - points(scores(spe.rda.signif, display="​sites",​ choices=c(1,​2),​ scaling=1),​ + - ​pch=21,​ col="​black",​ bg="​steelblue",​ cex=1.2) + - arrows(0,​0,​ + - ​scores(spe.rda.signif,​ display="​species",​ choices=c(1),​ scaling=1),​ + - ​scores(spe.rda.signif,​ display="​species",​ choices=c(2),​ scaling=1),​ + - ​col="​black",​length=0) + - text(scores(spe.rda.signif,​ display="​species",​ choices=c(1),​ scaling=1),​ + - ​scores(spe.rda.signif,​ display="​species",​ choices=c(2),​ scaling=1),​ + - ​labels=rownames(scores(spe.rda.signif,​ display="​species",​ scaling=1)),​ + - ​col="​black",​ cex=0.8) ​ + - arrows(0,​0,​ + - ​scores(spe.rda.signif,​ display="​bp",​ choices=c(1),​ scaling=1),​ + - ​scores(spe.rda.signif,​ display="​bp",​ choices=c(2),​ scaling=1),​ + - ​col="​red"​) + - text(scores(spe.rda.signif,​ display="​bp",​ choices=c(1),​ scaling=1)+0.05,​ + - ​scores(spe.rda.signif,​ display="​bp",​ choices=c(2),​ scaling=1)+0.05,​ + - ​labels=rownames(scores(spe.rda.signif,​ display="​bp",​ choices=c(2),​ scaling=1)),​ + - ​col="​red",​ cex=1) + - + - #Advanced plots scaling 2 + - windows() + - plot(spe.rda.signif,​ scaling=2, main="​Triplot RDA - scaling 2", type="​none",​ xlab=c("​RDA1"​),​ ylab=c("​RDA2"​),​ xlim=c(-1,​1),​ ylim=c(-1,​1)) + - points(scores(spe.rda.signif, display="sites", choices=c(1,​2),​ scaling=2),​ + - ​pch=21,​ col="​black",​ bg="​steelblue",​ cex=1.2) + - arrows(0,​0,​ + - ​scores(spe.rda.signif,​ display="​species",​ choices=c(1),​ scaling=2)*2,​ + - ​scores(spe.rda.signif,​ display="​species",​ choices=c(2),​ scaling=2)*2,​ + - ​col="​black",​length=0) + - text(scores(spe.rda.signif,​ display="​species",​ choices=c(1),​ scaling=2)*2.1,​ + - ​scores(spe.rda.signif,​ display="​species",​ choices=c(2),​ scaling=2)*2.1,​ + - ​labels=rownames(scores(spe.rda.signif,​ display="​species",​ scaling=2)),​ + - ​col="​black",​ cex=0.8) ​ + - arrows(0,​0,​ + - ​scores(spe.rda.signif,​ display="​bp",​ choices=c(1),​ scaling=2),​ + - ​scores(spe.rda.signif,​ display="​bp",​ choices=c(2),​ scaling=2),​ + - ​col="​red"​) + - text(scores(spe.rda.signif,​ display="​bp",​ choices=c(1),​ scaling=2)+0.05,​ + - ​scores(spe.rda.signif,​ display="​bp",​ choices=c(2),​ scaling=2)+0.05,​ + - ​labels=rownames(scores(spe.rda.signif,​ display="​bp",​ choices=c(2),​ scaling=2)),​ + - ​col="​red",​ cex=1) + ​ + Other transformations can be used to correct for the influence of rare species, e.g. the Hellinger transformation: ​ + ​ + #Hellinger transformation + spe.hel<​-decostand(spe,​ method="​hellinger"​) #can also use method=”hell” - And the final triplots would look like this: + #Chi-square transformation + spe.chi<​-decostand(spe,​ method="​chi.square"​) + ​ - {{ :​doubs_rda1.png |}} + **Optional Challenge 2 - Advanced** - {{ :doubs_rda2.png |}} + Produce Hellinger and Chi-square transformed abundance data for the “spe” data without using decostand(). - **Challenge ​1**: Run an RDA of the mite environmental variables constraining the mite species ​abundances. Use: + **Optional ​Challenge ​2 - Advanced Solution** + <​hidden>​ + The Hellinger transformation is a transformation that down-weights ​the importance given to rare species ​in a sample. + ​ + # Hellinger transformation + # First calculate the total species abundances by site + (site.totals=apply(spe,​ 1, sum)) - + # Scale species ​abundances by dividing them by site totals - #Load the mite species ​and environmental data from vegan package + (scale.spe<​-spe/site.totals) - data(mite) + - mite.spe<-mite + - mite.spe.hel <- decostand(mite.spe,​ method="​hellinger"​) + - data(mite.env) + # Calculate the square root of scaled species abundances - + (sqrt.scale.spe<-sqrt(scale.spe)) + # Compare the results + sqrt.scale.spe + spe.hel + sqrt.scale.spe-spe.hel # or: sqrt.scale.spe/​spe.hel - What are the significant explanatory variables? How much of the variation in species ​data is explain ​by significant explanatory variables ​(i.e. those that were selected)? What are the significant axes? What group(s) of sites can you identify? What species are related to each group(s) of sites? + # Chi-square transformation + # First calculate ​the total species ​abundances ​by site + (site.totals<​-apply(spe, 1, sum)) + # Then calculate the square root of total species abundances + (sqrt.spe.totals<​-sqrt(apply(spe,​ 2, sum))) - **Challenge ​1**: Solution ​ + # Scale species abundances by dividing them by the site totals and the species totals + scale.spe2<​-spe + for (i in 1:nrow(spe)) { + for (j in 1:​ncol(spe)) { + ​(scale.spe2[i,​j]=scale.spe2[i,​j]/​(site.totals[i]*sqrt.spe.totals[j])) ​ }} - <​hidden>​ + #Adjust the scale abundance species by multiplying by the square root of the species matrix total - Your code probably looks something like this: + (adjust.scale.spe2<​-scale.spe2*sqrt(sum(rowSums(spe)))) - + # Compare ​the results - #Initial RDA with ALL of the environmental data + adjust.scale.spe2 - mite.spe.rda<​-rda(mite.spe.hel~.,​ data=mite.env) + spe.chi - + adjust.scale.spe2-spe.chi # or: adjust.scale.spe2/spe.chi - #Select significant environmental variables + - ordiR2step(rda(mite.spe.hel~1, data=mite.env), ​ + - scope= formula(mite.spe.rda), direction= "​forward",​ R2scope=TRUE,​ pstep=1000) + - + - #Create a new dataframe with only the significant variables that you identified above + - mite.env.signif <- subset(mite.env,​ + - select = c("​WatrCont",​ "​Shrub",​ "​Substrate",​ "​Topo",​ "​SubsDens"​)) + - + - #Re-run the RDA with the significant variables and look at the summary + - mite.spe.rda.signif=rda(mite.spe~.,​ data=mite.env.signif) + - summary(mite.spe.rda.signif,​ display=NULL) + - + - #Find the R2 adjusted of the model with the retained environmental variables + - (R2adj <- RsquareAdj(mite.spe.rda.signif)$adj.r.squared) + - + - #Determine the significant canonical (constrained) axes) + - anova.cca(mite.spe.rda.signif, step=1000) + - anova.cca(mite.spe.rda.signif,​ step=1000, by="​axis"​) ​ + - + - #Plot the RDA + - windows() + - plot(mite.spe.rda.signif,​ scaling=1, main="​Triplot RDA - scaling 1", type="​none",​ xlab=c("​RDA1"​),​ ylab=c("​RDA2"​),​ xlim=c(-1,​1),​ ylim=c(-1,​1)) + - points(scores(mite.spe.rda.signif,​ display="​sites",​ choices=c(1,​2),​ scaling=1),​ + - ​pch=21,​ col="​black",​ bg="​steelblue",​ cex=1.2) + - text(scores(mite.spe.rda.signif,​ display="​species",​ choices=c(1),​ scaling=1),​ + - ​scores(mite.spe.rda.signif,​ display="​species",​ choices=c(2),​ scaling=1),​ + - ​labels=rownames(scores(mite.spe.rda.signif,​ display="​species",​ scaling=1)),​ + - ​col="​grey",​ cex=0.8) ​ + - arrows(0,​0,​ + - scores(mite.spe.rda.signif,​ display="​bp",​ choices=c(1),​ scaling=1),​ + - scores(mite.spe.rda.signif,​ display="​bp",​ choices=c(2),​ scaling=1),​ + - col="​red"​) + - text(scores(mite.spe.rda.signif,​ display="​bp",​ choices=c(1),​ scaling=1)+0.05,​ + - ​scores(mite.spe.rda.signif,​ display="​bp",​ choices=c(2),​ scaling=1)+0.05,​ + - ​labels=rownames(scores(mite.spe.rda.signif,​ display="​bp",​ choices=c(2),​ scaling=1)),​ + - ​col="​red",​ cex=1) ​ + ​ + ​ - Five explanatory variables are significant:​ WatrCont, Shrub, Substrate, Topo, and SubsDens. The proportion of the variance explained by these variables is 32.08% and the adjusted R² of this model is 19.20%. While the first canonical is significant (p<​0.001),​ the overall model is however not significant (p=0.107). Three groups of sites appear on the triplot. One group of sites with high water content (top right) have high abundance of species 9 and 25. Species 6, 12, 19 and 26 are related to another group of sites with hummock topography (bottom left). The last group of sites (top left) have not characteristic species and heterogeneous environmental conditions. See the triplot below: ​ - {{ :​mite_rda1.png |}} - ​ + =====2.3 Clustering===== + One application of association matrices is clustering. It is not a statistical method per se, because it does not test a hypothesis, but it highlights structures in the data by partitioning either the objects or the descriptors. As a result, similar objects are combined into groups, allowing distinctions – or contrasts – between groups to be identified. One goal of ecologists could be to divide a set of sites into groups with respect to their environmental conditions or their community composition. + + Clustering results are often represented as dendrograms (trees), where objects agglomerate into groups. There are several families of clustering methods, but for the purpose of this workshop, we will present an overview of three hierarchical agglomerative clustering methods: single linkage, complete linkage, and Ward’s minimum variance clustering. See chapter 8 of Legendre and Legendre 2012 for more details on the different families of clustering methods. + In hierarchical methods, elements of lower clusters (or groups) become members of larger, higher ranking clusters, e.g. species, genus, family, order. Prior to clustering, one needs to create an association matrix among the objects. Distance matrix is the default choice of clustering functions in R. The association matrix is first sorted in increasing distance order (or decreasing similarities). Then, groups are formed hierarchically following rules specific to each method. ​ - =====2.2 Partial RDA===== + Let's take a simple example of a euclidean distance matrix between 5 objets which were sorted in ascending order. - Partial RDA is a special case of RDA in which the response variables Y are related to explanatory variables X in the presence of additional explanatory variables, W, called covariables. As in partial linear regression, the linear effect of X variables on the Y variables are adjusted for the effects of the covariables W. For this, a RDA of the covariables W on the response variables Y is first performed. The residuals of this RDA are then extracted, i.e. a matrix Yres|W containing the Y response variables in which the effect of W were removed. The partial RDA finally correspond to the RDA of X on Yres|W. All statistics previously presented for RDA also apply for partial RDA. + {{ :cluster.example.png |}} - Partial RDA is thus a powerful tool when users what to assess ​the effect ​of environmental variables on species composition while taking into account ​the species variation due to other environmental variables with no interest. It can also be used to control for well-known linear effects, isolate ​the effect ​of a single explanatory variable or analyse related samples. In the example below, we will assess ​the effect of water chemistry on fish species abundances partialling out the effect of physiography. + In single linkage agglomerative clustering (also called nearest neighbour sorting), the objects at the closest distances agglomerate. The two closest objects agglomerate first, the next two closest objects/​clusters are then linked, and so on, which often generates long thin clusters or chains of objects (see how objets 1 to 5 cluster successively). Conversely, in complete linkage agglomerative clustering, an object agglomerates to a group only when linked ​to the furthest element ​of the group, which in turn links it to all members of that group (in the above example, the group formed ​of objets 3 and 4 only clusters with group 1-2 at 0.4, a distance at which objets 1, 2, 3 and 4 are all linked). As a result, complete linkage clustering ​will form many small separate groups, and is more appropriate to look for contrasts, discontinuities in the data. - In R, partial RDA is performed in the same way as RDA using rda() with the addition of a condition term: + Let’s compare ​the single and complete linkage clustering methods ​using the Doubs fish species data. + Species data were already Hellinger-transformed. The cluster analysis requiring similarity or dissimilarity indices, the first step will be to generate the Hellinger distance indices. - + - #Divide the env2 dataframe into two dataframes: + spe.dhel<-vegdist(spe.hel,method="​euclidean"​) #generates the distance matrix from Hellinger transformed ​data - envtopo ​<- env[, c(1:3)] # Physiography : explanatory dataset 1 + - names(envtopo) + #See difference between ​the two matrices - envchem <- env[, c(4:10)] # Water quality : explanatory dataset 2 + head(spe.hel)# Hellinger-transformed ​species ​data - names(envchem) + head(spe.dhel)# Hellinger distances among sites - + - #Run the partial RDA + - spechem.physio=rda(spe.hel, ​envchem, envtopo) + - summary(spechem.physio,​ display=NULL) + - #or + - spechem.physio2=rda(spe.hel ~ pH + dur + pho + nit + amm + oxy + dbo + - + Condition(alt + pen + deb), data=env) + - + - #Extract ​the results + - summary(spechem.physio, display=NULL) + - + - #Calculate the adjusted R2 of the partial RDA + - (R2adj <- RsquareAdj(spechem.physio)$adj.r.squared) + - + - #Test the significance of the axes in partial RDA + - anova.cca(spechem.physio,​ step=1000) + - anova.cca(spechem.physio2,​ step=1000, by="​axis"​) + - + - #Construct the triplots + - #Scaling 1 + - windows(title="​Partial RDA scaling 1") + - plot(spechem.physio,​ scaling=1, main="​Triplot partial RDA - scaling 1", type="​none",​ xlab=c("​RDA1"​),​ ylab=c("​RDA2"​),​ xlim=c(-1,​1),​ ylim=c(-1,​1)) + - points(scores(spechem.physio,​ display="​sites",​ choices=c(1,​2),​ scaling=1),​ + - ​pch=21,​ col="​black",​ bg="​steelblue",​ cex=1.2) + - arrows(0,​0,​ + - ​scores(spechem.physio,​ display="​species", choices=c(1),​ scaling=1), + - scores(spechem.physio, display="​species",​ choices=c(2), scaling=1),​ + - ​col="​black",​length=0) + - text(scores(spechem.physio,​ display="​species",​ choices=c(1),​ scaling=1),​ + - ​scores(spechem.physio,​ display="​species",​ choices=c(2),​ scaling=1),​ + - ​labels=rownames(scores(spechem.physio,​ display="​species",​ scaling=1)),​ + - ​col="​black",​ cex=0.8) ​ + - arrows(0,​0,​ + - scores(spechem.physio,​ display="​bp",​ choices=c(1),​ scaling=1),​ + - scores(spechem.physio,​ display="​bp",​ choices=c(2),​ scaling=1),​ + - col="​red"​) + - text(scores(spechem.physio,​ display="​bp",​ choices=c(1),​ scaling=1)+0.05,​ + - ​scores(spechem.physio,​ display="​bp",​ choices=c(2),​ scaling=1)+0.05,​ + - ​labels=rownames(scores(spechem.physio,​ display="​bp",​ choices=c(2),​ scaling=1)),​ + - ​col="​red",​ cex=1) + - + - #Scaling 2 + - windows(title="​Partial RDA scaling 2") + - plot(spechem.physio,​ scaling=2, main="​Triplot partial RDA - scaling 2", type="​none",​ xlab=c("​RDA1"​),​ ylab=c("​RDA2"​),​ xlim=c(-1,​1),​ ylim=c(-1,​1)) + - points(scores(spechem.physio,​ display="​sites", choices=c(1,​2),​ scaling=2),​ + - ​pch=21,​ col="​black",​ bg="​steelblue",​ cex=1.2) + - arrows(0,​0,​ + - ​scores(spechem.physio,​ display="​species",​ choices=c(1),​ scaling=2),​ + - ​scores(spechem.physio,​ display="​species",​ choices=c(2),​ scaling=2),​ + - ​col="​black",​length=0) + - text(scores(spechem.physio,​ display="​species",​ choices=c(1),​ scaling=2),​ + - ​scores(spechem.physio,​ display="​species",​ choices=c(2),​ scaling=2),​ + - ​labels=rownames(scores(spechem.physio,​ display="​species",​ scaling=2)),​ + - ​col="​black",​ cex=0.8) ​ + - arrows(0,​0,​ + - ​scores(spechem.physio,​ display="​bp",​ choices=c(1),​ scaling=2),​ + - ​scores(spechem.physio,​ display="​bp",​ choices=c(2),​ scaling=2),​ + - ​col="​red"​) + - text(scores(spechem.physio,​ display="​bp",​ choices=c(1),​ scaling=2)+0.05,​ + - ​scores(spechem.physio,​ display="​bp",​ choices=c(2),​ scaling=2)+0.05,​ + - ​labels=rownames(scores(spechem.physio,​ display="​bp",​ choices=c(2),​ scaling=2)),​ + - ​col="​red",​ cex=1) ​ + ​ + Most clustering methods can be computed with the function hclust() of the stats package. ​ - This RDA is significant (p<0.001) as well as the two first canonical axis. Water chemistry explained 31.89% of the variance of fish species composition, physiographic covariables explained 41.53% ​of this variation ​and the unexplained variation is 26.59%. The adjusted R² of this RDA is 24.13%. The triplot looks like: + ​ + #Faire le groupement à liens simples + #Perform single linkage clustering + spe.dhel.single<​-hclust(spe.dhel, method="​single"​) + plot(spe.dhel.single) + #Perform complete linkage clustering + spe.dhel.complete<​-hclust(spe.dhel,​ method="​complete"​) + plot(spe.dhel.complete) + ​ - {{ :doubs_rdapart2.png |}} + {{ :clust_single.png |}}{{ :​clust_complete.png |}} + Are there big differences between the two dendrograms? ​ - **Challenge 2** + In single linkage clustering, chains of objets occur (e.g. 19, 29, 30, 20, 26, etc.), whereas more contrasted groups are formed in the complete linkage clustering. ​ - Run the partial RDA of the mite environmental variables of the mite species abundances partialling out for the substrate variables ​(SubsDens, WaterCont and Substrate). Is the model significant?​ Which are the significant axes? Interpret ​the obtained triplot. + Ward’s minimum variance clustering differ from these two methods in that it clusters objects into groups using the criterion ​of least squares ​(similar to linear models). At the beginning, each object is considered being a cluster of its own. At each step, the pair of clusters merging is the one leading to minimum increase in total within-group sum of squares. - **Challenge 2**: Solution ​ + Again, it is possible to generate a Ward’s minimum variance clustering with hclust(). However, the dendogram shows squared distances by default. In order to compare this dendrogram to the single and complete linkage clustering results, one must calculate the square root of the distances. - <​hidden>​ - Here is some potential ​code: + ​ + #Perform Ward minimum variance clustering + spe.dhel.ward<​-hclust(spe.dhel,​ method="​ward.D2"​) + plot(spe.dhel.ward) - + #Re-plot the dendrogram by using the square roots of the fusion levels - #Partial RDA + spe.dhel.ward$height<-sqrt(spe.dhel.wardheight) - mite.spe.subs=rda(mite.spe.hel ~ Shrub + Topo + plot(spe.dhel.ward) - + Condition(SubsDens + WatrCont + Substrate), data=mite.env) + plot(spe.dhel.ward, hang=-1) # hang=-1 aligns all objets on the same line - + - #Summary + - summary(mite.spe.subs,​ display=NULL) + - (R2adj ​<- RsquareAdj(mite.spe.subs)adj.r.squared) + - + - #​Significant axes + - anova.cca(mite.spe.subs, step=1000) + - anova.cca(mite.spe.subs,​ step=1000, by="​axis"​) + - + - #Triplot scaling 1 + - windows(title="​Partial RDA scaling 1") + - plot(mite.spe.subs, scaling=1, main="​Triplot partial RDA - scaling ​1", type="​none",​ xlab=c("​RDA1"​), ylab=c("​RDA2"​),​ xlim=c(-1,​1),​ ylim=c(-1,​1)) + - points(scores(mite.spe.subs,​ display="​sites",​ choices=c(1,​2),​ scaling=1),​ + - ​pch=21,​ col="​black",​ bg="​steelblue",​ cex=1.2) + - arrows(0,​0,​ + - ​scores(mite.spe.subs,​ display="​species",​ choices=c(1),​ scaling=1),​ + - ​scores(mite.spe.subs,​ display="​species",​ choices=c(2),​ scaling=1),​ + - ​col="​black",​length=0) + - text(scores(mite.spe.subs,​ display="​species",​ choices=c(1),​ scaling=1),​ + - ​scores(mite.spe.subs,​ display="​species",​ choices=c(2),​ scaling=1),​ + - ​labels=rownames(scores(mite.spe.subs,​ display="​species",​ scaling=1)),​ + - ​col="​black",​ cex=0.8) ​ + - arrows(0,​0,​ + - ​scores(mite.spe.subs,​ display="​bp",​ choices=c(1),​ scaling=1),​ + - ​scores(mite.spe.subs,​ display="​bp",​ choices=c(2),​ scaling=1),​ + - ​col="​red"​) + - text(scores(mite.spe.subs,​ display="​bp",​ choices=c(1),​ scaling=1)+0.05,​ + - ​scores(mite.spe.subs,​ display="​bp",​ choices=c(2),​ scaling=1)+0.05,​ + - ​labels=rownames(scores(mite.spe.subs,​ display="​bp",​ choices=c(2),​ scaling=1)),​ + - ​col="​red",​ cex=1) + - + - #Triplot scaling 2 + - windows(title="​Partial RDA scaling 2") + - plot(mite.spe.subs,​ scaling=2, main="​Triplot partial RDA - scaling 2", type="​none",​ xlab=c("​RDA1"​),​ ylab=c("​RDA2"​),​ xlim=c(-1,1), ylim=c(-1,​1)) + - points(scores(mite.spe.subs,​ display="​sites",​ choices=c(1,​2),​ scaling=2),​ + - ​pch=21,​ col="​black",​ bg="​steelblue",​ cex=1.2) + - arrows(0,​0,​ + - ​scores(mite.spe.subs,​ display="​species",​ choices=c(1),​ scaling=2),​ + - ​scores(mite.spe.subs,​ display="​species",​ choices=c(2),​ scaling=2),​ + - ​col="​black",​length=0) + - text(scores(mite.spe.subs,​ display="​species",​ choices=c(1),​ scaling=2),​ + - ​scores(mite.spe.subs,​ display="​species",​ choices=c(2),​ scaling=2),​ + - ​labels=rownames(scores(mite.spe.subs,​ display="​species",​ scaling=2)),​ + - ​col="​black",​ cex=0.8) ​ + - arrows(0,​0,​ + - ​scores(mite.spe.subs,​ display="​bp",​ choices=c(1),​ scaling=2),​ + - ​scores(mite.spe.subs,​ display="​bp",​ choices=c(2),​ scaling=2),​ + - ​col="​red"​) + - text(scores(mite.spe.subs,​ display="​bp",​ choices=c(1),​ scaling=2)+0.05,​ + - ​scores(mite.spe.subs,​ display="​bp",​ choices=c(2),​ scaling=2)+0.05,​ + - ​labels=rownames(scores(mite.spe.subs,​ display="​bp",​ choices=c(2),​ scaling=2)),​ + - ​col="​red",​ cex=1)  ​ + ​ - {{ :partial_rda_ch6.png |}} + {{ :clust_ward.png |}} + {{ :​clust_wardfinal.png |}} - This RDA is significant (p<​0.001) as well as the first canonical axis. Environmental variables explained 9.81% of the variance of mite species composition,​ substrate covariables explained 42.84% of this variation ​and the unexplained variation is 47.35%. The adjusted R² of this RDA is 8.33%. + The clusters generated using Ward’s method tend to be more spherical ​and to contain similar numbers ​of objects. - ​ + - =====2.3 Variation partitioning by partial RDA===== + - Variation partitioning is a type of analysis that combines RDA and partial RDA to divide the variation of a response variable among two, three or four explanatory data sets. Variation partitioning ​are generally represented by Venn diagram ​in which the percentage ​of explained variance by each explanatory data set (or combination ​of data stets) is reported. + One must be careful in the choice ​of an association measure ​and clustering method in order to correctly address ​a problem. What are you most interested ​in: gradients? Contrasts? In addition, the results should be interpreted with respect to the properties ​of the method used. If more than one method seems suitable to an ecological question, computing them all and compare the results would be the way to go. As a reminder, clustering is not a statistical method, but further steps can be taken to identify interpretable clusters ​(e.g. where to cut the tree), ​or to compute clustering statistics. Clustering can also be combined to ordination in order to distinguish groups ​of sites. These go beyond the scope of this workshop, but see Borcard et al. 2011 for more details. - In the case of two datasets (below): ​ - - Fraction a + b +c is the explained variance by the two datasets calculated using a RDA of y by X + W. - Fraction d is the unexplained variance by the two datasets calculated using the same RDA as above. - Fraction a is the explained variance by the X data set only calculated using a partial of y by X with W as covariables. - Fraction c is the explained variance by the W data set only calculated using a partial of y by W with X as covariables. - Fraction b is calculated by subtraction,​ i.e. b = [a + b] + [b + c] - [a + b + c]. - {{ :​vennd_varpart.png |}} - Venn diagram of partition of the variation of a response variable y among two sets of explanatory variables X and W (from Legendre and Legendre 2012). + ======3. Unconstrained Ordination====== ​ - Variation partitioning is thus an indicated analysis when user what to relate the abundance of species in a community to various type of environmental variables, for example abiotic vs biotic variables, large-scale versus small-scale variables, etc. In the next example, ​we will partition the variation of fish species composition ​between ​chemical ​and physiographic ​variables. ​ + Unconstrained ordination allows us to organize samples, sites or species ​along continuous gradients (e.g. ecological or environmental). The key difference between unconstrained and constrained ordination (discussed later in this workshop- see ‘Canonical Ordination’) is that in the unconstrained techniques ​we are not attempting to define a relationship ​between ​independent ​and dependent sets of variables. ​ - In R, variation partitioning is performed using the function varpart(). Venn diagrams can also be drawn using the function plot(). + Unconstrained ordination can be used to: + - Assess relationships //within// a set of variables ​(not between sets). + - Find key components of variation between samples, sites, species etc. + - Reduce ​the number dimensions in multivariate data without substantial loss of information. + - Create new variables for use in subsequent analyses ​(such as regression). These principal components are  weighted, linear combinations of the original variables in the ordination. + [[http://​www.umass.edu/​landeco/​teaching/​multivariate/​schedule/​ordination1.pdf|Source]] + =====3.1 Principal Component Analysis===== - + Principal component analysis ​(PCA) was originally described by Pearson (1901) although it is more often attributed to Hotelling (1933) who proposed it independently. The method and several of its implications for data analysis are presented in the seminal paper of Rao (1964). PCA is used to generate a few key variables from a larger set of variables that still represent as much of the variation in the dataset as possible. PCA is a powerful technique to analyze quantitative descriptors (such as species abundances),​ but cannot be applied to binary data (such as species absence/​presence). ​ - ?varpart + - vegandocs("​partitioning.pdf") + Based on a dataset containing normally distributed variables, the first axis (or principal-component axis) of a PCA is the line that goes through the greatest dimension of the concentration ellipsoid describing the multinormal distribution. Similarly, the following axes go through the dimensions of the concentration ellipsoid by decreasing length. One can thus derive a maximum of p principal axes form a data set containing p variables. + + For this, PCA rotates the original system of axes defined by the variables in a way that the successive new axes are orthogonal to one another and correspond to the successive dimensions of maximum variance of the scatter of points. The new variables produced in a PCA are uncorrelated with each other (axes are orthogonal to each other) and therefore may be used in other analyses such as multiple regression (Gotelli and Ellison 2004). The principal components give the positions of the objects in the new system of coordinates. PCA works on an association matrix among variables containing the variances and covariances of the variables. PCA preserves Euclidean distances and detects linear relationships. As a consequence,​ raw species abundance data are subjected to a pre-transformation (i.e. a Hellinger transformation) before computing a PCA. - #Variation partitioning with all explanatory variables - spe.part.all <- varpart(spe.hel,​ envchem, envtopo) - spe.part.all - windows(title="​Variation partitioning - all variables"​) - plot(spe.part.all,​ digits=2)  ​ - ​ + //To do a PCA you need:// ​ + - A set of variables (with no distinction between independent or dependent variables, i.e. a set of species OR a set of environmental variables). ​ + - Samples that are measured for the same set of variables. ​ + - Generally a dataset that is longer than it is wide is preferred. ​ - The output looks like: + PCA is most useful with data with more than two variables, but is easier to describe using a two dimensional example. In this example (derived from Clarke and Warwick 2001) , we have nine sites with abundance data for two species: - {{ :​varpart_output.png |}} + - {{ :varpart_output_venn.png |}} + - In this case, the chemical variables explain 24.10% of the fish species composition,​ the physiographic variables explain ​11.20% of the fish species composition and the interaction of these two types of variables explained 23.30% of the fish species composition. Note that the varpart() function also identify the fractions that can be tested for significance using the function anova.cca(). ​ + ^Site^Species 1^Species 2^ + ^A^6^2^ + ^B^0^0^ + ^C^5^8^ + ^D^7^6^ + ^E^11^6^ + ^F^10^10^ + ^G^15^8^ + ^H^18^14^ + ^I^14^14^ - Users can also perform variation partitioning between data sets that only contain significant environmental variables: + If you plotted these samples in the two dimensions, the plot would look like this: + {{:​pcaex_1.png?​500|}} + This is a straight forward scatter plot and is an ordination plot, but you can imagine that it is more difficult to produce and visualize such a plot if there were more than two species. In that case, you would want to reduce the number of variables to principal components. If the 2D plot above was too complex and we wanted to reduce the data into a single dimension, we would be doing so with a PCA: - ​ + {{:​pcaex_2.png?​500|}} - #RDA of chemistry variables + - spe.chem <- rda(spe.hel~.,​ data=envchem) + - + - #Select significant chemistry variables + - R2a.all.chem <- RsquareAdj(spe.chem)$adj.r.squared + - ordiR2step(rda(spe.hel~1,​ data=envchem),​ + - ​scope= formula(spe.chem),​ direction= "​forward",​ R2scope=TRUE,​ pstep=1000) + - names(envchem) + - (envchem.pars <- envchem[, c( 4, 6, 7 )]) + - + - #RDA with other environmental variables + - spe.topo <- rda(spe.hel~.,​ data=envtopo) + - R2a.all.topo <- RsquareAdj(spe.topo)$adj.r.squared + - ordiR2step(rda(spe.hel~1,​ data=envtopo),​ + - ​scope= formula(spe.topo),​ direction= "​forward",​ R2scope=TRUE,​ pstep=1000) + - names(envtopo) + - envtopo.pars <- envtopo[, c(1,2)] + - + - #Varpart + - spe.part <- varpart(spe.hel,​ envchem.pars,​ envtopo.pars) + - windows(title="​Variation partitioning - parsimonious subsets"​) + - plot(spe.part,​ digits=2) + - + - #Tests of significance + - anova.cca(rda(spe.hel,​ envchem.pars),​ step=1000) # Test of fractions [a+b] + - anova.cca(rda(spe.hel,​ envtopo.pars),​ step=1000) # Test of fractions [b+c] + - env.pars <- cbind(envchem.pars,​ envtopo.pars) + - anova.cca(rda(spe.hel,​ env.pars), step=1000) # Test of fractions [a+b+c] + - anova.cca(rda(spe.hel,​ envchem.pars,​ envtopo.pars),​ step=1000) # Test of fraction [a] + - anova.cca(rda(spe.hel,​ envtopo.pars,​ envchem.pars),​ step=1000) # Test of fraction [c] + - ​ + + In this case, the first principal component is the line of best fit through the points (sites) with the sites perpendicular to the line. - Now, the chemical variables explain 25.30% of the fish species composition,​ the physiographic variables explain 14.20% of the fish species composition and the interaction of these two types of variables explained 19.60% of the fish species composition. All these fractions are significant (p<​0.001). + A second principal component is then added perpendicular to the first axis: + {{:​pcaex_3.png?​500|}} - **Challenge 3**: Perform variation partitioning of the mite species abundances with a first dataset for the significant substrate variables (SubsDens, WaterCont and Substrate) and a second dataset for the significant other variables (Shrud and Topo). What proportion of the variation ​are explained by each dataset? What are the significant fractions ? + The final plot then is the two PC axes rotated where the axes are now principal components as opposed to species: - + - **Challenge 3**: Solution ​ + - <​hidden>​ + {{:pcaex_4.png?​500|}} - This is what your code may look like: + - + For PCAs with more than two variables, ​principal components are added like this (Clarke ​and Warwick 2001): - str(mite.env) + - (mite.subs=mite.env[,c(1,2,3)]) #First set of variables outlined in challenge + - (mite.other=mite.env[,​c(4,​5)]) #Second set of variables outlined in challenge + - + - #RDA for mite.subs + - rda.mite.subs <- rda(mite.spe.hel~.,​ data=mite.subs) + - R2a.all.subs <- RsquareAdj(rda.mite.subs)$adj.r.squared + - + - #Forward selection for mite.subs + - ordiR2step(rda(mite.spe.hel~1,​ data=mite.subs),​ + - ​scope= formula(rda.mite.subs),​ direction= "​forward",​ R2scope=TRUE,​ pstep=1000) + - names(mite.subs) + - (mite.subs.pars <- mite.subs[, c(2, 3)]) + - + - #RDA for mite.other + - rda.mite.other <- rda(mite.spe.hel~.,​ data=mite.other) + - R2a.all.other <- RsquareAdj(rda.mite.other)$adj.r.squared + - + - #Forward selection for mite.other + - ordiR2step(rda(mite.spe.hel~1,​ data=mite.other),​ + - ​scope= formula(rda.mite.other),​ direction= "​forward",​ R2scope=TRUE,​ pstep=1000) + - names(mite.other) + - (mite.other.pars <- mite.other[,​ c(1,2)]) + - + - #Variation partitioning + - (mite.spe.part <- varpart(mite.spe.hel,​ ~WatrCont+Substrate,​ ~Shrub+Topo,​ + - data=mite.env)) + - windows(title="​Variation partitioning - parsimonious subsets"​) + - plot(mite.spe.part,​ digits=2) + - + - # Tests of all testable fractions + - anova.cca(rda(mite.spe.hel~ WatrCont+Substrate,​ data=mite.env),​ step=1000) # Test of fractions [a+b] + - anova.cca(rda(mite.spe.hel~Shrub+Topo,​ data=mite.env),​ step=1000) # Test of fractions [b+c] + - (env.pars <- cbind(mite.env[,​c(2,​3,​4,​5)])) + - anova.cca(rda(mite.spe.hel~ WatrCont+Substrate+Shrub+Topo,​ data=env.pars),​ step=1000) # Test of fractions [a+b+c] + - anova.cca(rda(mite.spe.hel~WatrCont+Substrate + Condition(Shrub+Topo),​ data=env.pars),​ step=1000) # Test of fraction [a] + - anova.cca(rda(mite.spe.hel~Shrub+Topo+ Condition(WatrCont+Substrate ), data=env.pars),​ step=1000) # Test of fraction [c] + - ​ + - In this case, substrate variables explain 14.00% of the mite species composition,​ the other environmental variables explain 9.1% of the mite species composition ​and the interaction of these two types of vriables explained 16.90% of the mite species composition. All these fractions are significant (p<0.001). + - ​ + + PC1 = axis that maximises the variance of the points that are projected perpendicularly onto the axis. + PC2 = must be perpendicular to PC1, but the direction is again the one in which variance is maximised when points are perpendicularly projected. ​ + PC3 and so on: perpendicular to the first two axes. + Essentially,​ when there are more than two dimensions, PCA produces a new space in which all PCA axes are orthogonal (e.g. where the correlation among any two axes is null) and where the PCA axes are ordered according to the percent of the variance of the original data they explain. ​ + The “spe” data includes 27 fish taxa. To simplify the 27 fish taxa into a smaller number of fish-related variables or to identify where different sites or samples are associated with particular fish taxa we can run a PCA. - ======3. Multivariate regression tree====== ​ - Multivariate regression tree (MRT) is a constrained clustering technique. Introduced by De’ath ​(2002), MRTs allow the partitioning of a quantitative response matrix by a matrix of explanatory variables constraining ​(guiding) on where to divide the data of the response matrix. RDA and MRT are both regression techniques, the former explaining the global structure of relationships through a linear model, the latter better highlighting local structures ​and interactions among variables by producing a tree model. + Run a PCA on Hellinger-transformed species data: + + #Run the PCA using the rda() function (NB: rda() is used for both PCA and RDA) + spe.h.pca <- rda(spe.hel) - Advantages of the MRT compared to the RDA: + #​Extract ​the results - * does not make assumptions about the shape of the relationships between species and environmental variables ​(quantitative or categorical),​ + summary(spe.h.pca) #overall results - * is robust in dealing with missing values + ​ - * is robust in dealing with collinearity among the explanatory variables + - * is insensitive to transformations of the explanatory variables, which allows the use of raw values + - * the outcome, the tree, is easy to interpret, especially to a non-scientist audience. + - + - The MRT technique splits the data into clusters of samples similar in their species composition based on environmental value thresholds. It involves two procedures running at the same time: 1) the computation of the constrained partitioning of the data, and 2) the calculation of the relative error of the successive partitioning levels by multiple cross-validations. The function mvpart() from the package mvpart computes both the partition and the cross-validation. + - A quick note on MRT terminology: + The summary looks like this: - Leaf: Terminal group of sites + {{:pcaoutput_1v2.png?​800|}} + {{:​pcaoutput_2.png?​800|}} + {{:​pcaoutput_3.png?​800|}} - Node: Point where the data splits into two groups. It is characterized by a threshold value of an explanatory variable. ​ - Branch: Each group formed by a split + **Interpretation of ordination results:** + A new word from this output is "​eigenvalue"​. An eigenvalue is the value of the change in the length of a vector, and for our purposes is the amount of variation explained by each axis in a PCA. As you can see, the summary function produces a fair amount of information. From the summary we can see how much of the variance in the data is explained by the unconstrained variables (i.e. variables where we have made no effort to define relationships between the variables). In this case, the total variance of the sites explained by the species is 0.5. The summary also tells you what proportion of the total explained variance is explained by each principal component in the PCA: the first axis of the PCA thus explains 51.33% of the variation, and the second axis 12.78%. You can also extract certain parts of the output, try this: + + summary(spe.h.pca,​ display=NULL) # only eigenvalues and their contribution to the variance + eigen(cov(spe.hel)) # also compute the eigenvalues + ​ - **1- Constrained partitioning ​of the data** + Sometimes you may want to extract the scores (i.e. the coordinates within a PCA biplot) for either the “sites” ​ (the rows in your dataset, whether they be actual sites or not) or the “species” (the variables in your data, whether they be actual species or some other variables). This is useful if you want to then use a principal component as a variable in another analysis, or to make additional graphics. For example, with the “spe” dataset, you might want to obtain a single variable that is a composite of all the fish abundance data and then use that use that variable in a regression with another variable, or plot across a spatial gradient. To extract scores from a PCA, use the scores() function: + ​ + spe.scores <- scores(spe.h.pca,​ display="​species",​ choices=c(1,2)) # species scores on the first two PCA axes + site.scores <- scores(spe.h.pca,​ display="​sites",​ choices=c(1,​2)) # sites scores on the first two PCA axes + #Note: if you don’t specify the number of principal components to extract (e.g. choices=c(1,​2) or choices=c(1:​2) then all of the scores will be extracted for all of the principal components. + ​ - First, ​the method computes all possible partitions of the sites into two groups. For each quantitative explanatory variable, the sites will be sorted in the ascending values ​of the variables; for categorical variables, ​the sites will be aggregated by levels to test all combinations of levels. The method will split the data after the first object, the second object and so on, and compute the sum of within-group sum of squared distances ​to the group mean (within-group SS) for the response data. The method will retain the partition into two groups minimizing ​the within-group SS and the threshold value/​level ​of the explanatory variable. These steps will be repeated within ​the two subgroups formed previously, until all objects form their own group. In other words, when each leaf of the tree contains one object. + The PCA on the “spe” fish data produces as many principal components as there are fish taxon (columns), which in this case means that 27 principal components are produced. In many cases though, you may have done a PCA to reduce ​the number ​of variables ​to deal with and produce composite variables ​for the fish. In this case, you are likely interested in knowing how many of these principal components are actually significant or adding new information ​to the PCA (i.e. how many principal components do you need to retain ​before you aren’t really explaining any more variance with the additional principal components). To determine this, you can use the Kaiser-Guttman criterion ​and produce a barplot showing at what point the principal components are no longer explaining significant amount ​of variance. The code for the barplot below shows the point at which the variance explained by a new principal component explains less than the average amount explained by all of the eigenvalues:​ + + # Identify the significant axis using the Kaiser-Guttman criterion + ev <- spe.h.pca$CA$eig + ev[ev>​mean(ev)] + n <- length(ev) + barplot(ev, main="​Eigenvalues",​ col="​grey",​ las=2) + abline(h=mean(ev),​ col="​red"​) + legend("​topright",​ "​Average eigenvalue",​ lwd=1, col=2, bty="​n"​) + ​ - **2- Cross-validation and pruning the tree** + {{:​pca_sigaxes_sp.png?​500|}} - The mvpart function also performs a cross-validation and identifies ​the best predictive tree. The cross-validation procedure consists in using a subset ​of the objects to construct ​the tree, and to allocate ​the remaining objects to the groups. In a good predictive tree, objects are assigned to the appropriate groups. The cross-validated relative error (CVRE) is the measure of the predictive error. Without cross-validation, one would retain ​the number ​of partitions minimizing the variance ​not explained by the tree (i.e. the relative error: the sum of the within-group SS over all leaves divided by the overall SS of the data). This is the solution maximizing the R2 so to speak. This approach ​is explanatory rather than predictive. + From this barplot, you can see that once you reach PC6, the proportion ​of variance explained falls below the average proportion explained by the other components. If you take another look at the PCA summary, you will notice that by the time you reach PC5, the cumulative proportion ​of variance explained by the principal components ​is 85%. - Let’s create a multivariate regression tree on the Doubs data. + A PCA is not just for species data. It can also be run and interpreted in the same way using standardized environmental variables:​ + ​ + #Run the PCA + env.pca <- rda(env.z) # or rda(env, scale=TRUE) + #Extract the results + summary(env.pca) + summary(env.pca,​ scaling=2) ​ + ​ - + Scaling refers to what portion of the PCA is scaled to the eigenvalues. Scaling = 2 means that the species scores are scaled by eigenvalues,​ whereas scaling = 1 means that site scores are scaled by eigenvalues. Scaling = 3 means that both species and site scores are scaled symmetrically by square root of eigenvalues. Using scaling = 1 means that the Euclidean distances among objects ​(e.g. the rows in your data) are preserved, whereas scaling ​= 2 means that the correlations amongst descriptors (e.g. the columns in this data) are preserved. This means that when you look at a biplot of a PCA that has been run with scaling=2, the angle between descriptors represents correlation. ​ - ?mvpart + - + - #​Prepare ​the data: remove “distance from source” + - env <- subset(env, select ​= -das) + - # Create ​the regression tree + - doubs.mrt<- mvpart(data.matrix(spe.hel)~.,env, + # Identify ​the significant axis using the Kaiser-Guttman criterion - ​legend=FALSE, margin=0.01, cp=0, xv="pick", + ev <- env.pca$CA$eig - xval=nrow(spe.hel), xvmult=100, which=4) + ev[ev>​mean(ev)] + n <- length(ev) + barplot(ev, main="​Eigenvalues"​, col="grey", ​las=2) + abline(h=mean(ev), col="​red"​) + legend("​topright"​, "​Average eigenvalue",​ lwd=1, col=2, bty="​n"​) ​ + Compare this eigenvalue barplot with the one you created for the species PCA. + As you saw in the explanation of the summary output, a lot of information can be extracted from a PCA before even plotting it. However, many of us interpret data best visually and a figure of a PCA  is often the best way to convey major patterns. PCA biplots can be plotted in base R. A PCA biplot includes the x-axis as the first Principal Component and the y-axis as the second Principal Component. ​ - At this point, you will need to select the tree who'​s ​size (number of groups) is appropriate to the aim of your study from the following graph. This step requires the argument xv="​pick"​. In other words, you must prune the tree by picking the best-fit tree. Indeed, a fully resolved tree is not the desirable outcome. Instead, one is usually interested in a tree including only informative partitions/​groups. It is possible to have an a-priori idea of the number ​of potential groups to be retained as well. + For some plotting exercises, let'​s ​start with the //species PCA//. A basic biplot without any customization could be plotted like this, where the site positions are shown by black numbers and species’ positions are shown by red species codes. Remember, species positions come from plotting species along PCs and the site positions are derived from the weighted sums of species scores. + + plot(spe.h.pca) +  ​ + {{:​spe_PCA1.png?​800|}} - {{ :cross_validation.png |}} + Basically, the plot above plots the PCA like this: + + plot(spe.h.pca,​ type="​n"​) #produces a blank biplot with nothing displayed but the axes + points(spe.h.pca,​ dis="​sp",​ col="​blue"​) #points are added for the species (columns) (dis=) + #use text() instead of points() if you want the labels + points(spe.h.pca,​ dis="​sites",​ col="​red"​) #points are added for the sites (rows) + ​ + + {{:spe_PCA2.png?800|}} - The graph shows the relative error RE (in green) and the cross-validated relative error CVRE (in blue) of trees of increasing size. The red dot indicates the solution with the smallest CVRE, and the orange dot shows the smallest tree within one standard error of CVRE. It has been suggested that instead of choosing the solution minimizing CVRE, it would be more parsimonious to opt for the smallest tree for which the CVRE is within one standard error of the tree with the lowest CVRE (Breiman et al. 1984). The green bars at the top indicate the number of times each size was chosen during the cross-validation process. + To create more detailed plots and to play with aesthetics, try this code: + + #Biplot of the PCA on transformed species data (scaling 1) + windows() + plot(spe.h.pca) + windows() + biplot(spe.h.pca) + windows() + plot(spe.h.pca, scaling=1, type="​none",​ # scaling 1 = distance biplot : + # distances among objects in the biplot approximate their Euclidean distances + # but angles among descriptor vectors DO NOT reflect their correlation + xlab = c("PC1 (%)", round((spe.h.pca$CA$eig[1]/​sum(spe.h.pca$CA$eig))*100,2)), #this comes from the summary + ylab = c("PC2 (%)", round((spe.h.pca$CA$eig[2]/​sum(spe.h.pca$CA$eig))*100,​2))) + points(scores(spe.h.pca, display="​sites",​ choices=c(1,​2),​ scaling=1),​ + ​pch=21,​ col="​black",​ bg="​steelblue",​ cex=1.2) + text(scores(spe.h.pca,​ display="​species",​ choices=c(1),​ scaling=1),​ + ​scores(spe.h.pca,​ display="​species",​ choices=c(2),​ scaling=1),​ + ​labels=rownames(scores(spe.h.pca,​ display="​species",​ scaling=1)),​ + ​col="​red",​ cex=0.8) ​ + ​ - This graph is interactive, which means you will have to click on the blue point corresponding your choice of tree size. Once you do so, the corresponding multivariate regression tree will appear. If you click on the orange dot, the following tree appears. + This code produces 3 plots, the final plot is the most visually pleasing: + {{:spe_PCA3.png?800|}} + What conclusions can you draw from this plot? From the plot, you can see the site scores shown by the blue dots. If you superimposed site labels on top, you could see what sites are closer to each other in terms of the species found at those sites, but even without the specific labels, you can see that there are only a few sites that are farther away from the majority. The species names are shown by their names in red and from the plot, you can see for example that the species "​ABL"​ is not found or not found in the same prevalence in the majority of sites as other species closer to the centre of the ordination. ​ - {{ :mrt_1se.png |}} + Biplots need to be interpreted in terms of angles from center of plot, not just in terms of proximity. Two variables at 90 degree angle are uncorrelated. Two variables in opposing directions are negatively correlated. Two variables very close together are strongly correlated, at least in the space described by these axes. - The statistics ​at the bottom ​of the figure are: the residual error (the reciprocal ​of the R2 of the model, in this case 43.7%), the cross-validated error, and the standard error. This tree has only two leaves separated by one node. This node splits the data into two groups at the threshold altitude value of 361.5m. + Now let's look at a plot of the //​environmental PCA//: + + #​Biplot ​of the PCA on the environmental variables (scaling 2) + windows() + plot(env.pca) + windows() + plot(env.pca, scaling=2, type="​none",​ # scaling 2 = correlation biplot : + # distances among abjects ​in the biplot DO NOT approximate their Euclidean distances + # but angles among descriptor vectors reflect their correlation + xlab = c("PC1 (%)", round((env.pca$CA$eig[1]/​sum(env.pca$CA$eig))*100,​2)),​ + ylab = c("PC2 (%)", round((env.pca$CA$eig[2]/​sum(env.pca$CA$eig))*100,​2)),​ + xlim = c(-1,1), ylim=c(-1,​1)) + points(scores(env.pca, display="​sites",​ choices=c(1,​2),​ scaling=2), + ​pch=21,​ col="​black",​ bg="​darkgreen",​ cex=1.2) + text(scores(env.pca, display="​species",​ choices=c(1),​ scaling=2),​ + ​scores(env.pca, display="​species",​ choices=c(2),​ scaling=2),​ + ​labels=rownames(scores(env.pca,​ display="​species",​ scaling=2)),​ + ​col="​red",​ cex=0.8) +  ​ - Each leaf is characterized by a small barplot showing the abundances of the species, its number of sites and its relative error. + {{:env_PCA1.png?800|}} - We can compare this tree with the 10-group solution, as suggested by the CVRE criterion, or choose ​a solution in between, e.g. with 4 leaves ​to compare. + Remember that a PCA biplot is really just a scatter plot, where the axes are scores extracted from composite variables. This means that there are many different ways you can plot a PCA. For example, try using your ggplot() skills from Workshop ​4 to extract PCA scores and plot an ordination in ggplot. - + **Use of PCA axis as composite explanatory variables:​** - # Using the CVRE criterion + - doubs.mrt.cvre<​- mvpart(data.matrix(spe.hel)~.,​ env, + - ​legend=FALSE,​ margin=0.01,​ cp=0,​xv="​pick",​ + - ​xval=nrow(spe.hel),​ xvmult=100,​which=4) + - # Choosing ourself the best number ​of partitions + In some cases, users want to reduce several environmental variables in a few numbers ​of composite variables. When PCA axes represent ecological gradients ​(i.e. when environmental variables are correlated with PCA axis in a meaningful way), users can use site scores along PCA Axis in further analyses instead of the raw environmental ​data. In other words, sites scores along PCA Axis represent linear combinations of descriptors and can consequently be used as proxy of ecological conditions in “post-ordination” analyses. - doubs.mrt.4<- mvpart(data.matrix(spe.hel)~., env, + In the example above, the first PCA axis can be related to a gradient from oligotrophic,​ oxygen-rich to eutrophic, oxygen-deprived water: from left to right, a first group display the highest values of altitude ​(alt) and slope (pen), and the lowest values in river discharge (deb) and distance from the source (das). The second group of sites has the highest values in oxygen content (oxy) and the lowest in nitrate concentration (nit). A third group shows intermediate values in almost all the measured variables. - legend=FALSE, margin=0.01, cp=0, xv="pick", ​ + If users are then interested in identifying if a specific species is associated with ologotrophic or eutrophic water, one can correlate the species abundances to sites scores along PCA Axis 1. For example, if one want to assess if the species TRU prefers oligotrophic or eutrophic water, the following linear model can be used : - xval=nrow(spe.hel), xvmult=100,which=4) + + ​ + Sites_scores_Env_Axis1<​- scores(env.pca, display="sites", ​choices=c(1), scaling=2) + spe$ANG + plot( Sites_scores_Env_Axis1, spe$TRU) + summary(lm(spe$TRU~Sites_scores_Env_Axis1)) + abline(lm(spe$TRU~Sites_scores_Env_Axis1)) ​ - {{ :mrt_cvre.png |}} + This simple model shows that the abundance of species TRU is significantly dependent of site scores along PCA axis 1 (t = -5.30, p = 1.35e-05, adj-R2 = 49.22%), i.e. depends of an oligotrophic-eutrophic gradient. The following plot identifies a preference of this species for oligotrophic water. - {{ :mrt_4.png |}} + - The 10-group solution has a high EXPLANATORY power but its predictive power (indicated by the cross-validated error) is just slightly better than that of the 2-group solution. The 4-group solution seems to be a good compromise. + **Challenge 3** + Run a PCA of the “mite” species abundance data. What are the significant axes of variation? Which groups of sites can you identify? Which species are related to each group of sites? Use: + + mite.spe <- mite #mite data is from the vegan package +  ​ - More information can be obtained by looking at the summary output. + **Challenge 3 - Solution** + <​hidden>​ + Your code likely looks something like the following. + + #Hellinger transformation of mite data and PCA + mite.spe.hel <- decostand(mite.spe,​ method="​hellinger"​) + mite.spe.h.pca <- rda(mite.spe.hel) + + #What are the significant axes? + ev <- mite.spe.h.pca$CA$eig + ev[ev>​mean(ev)] + n <- length(ev) + barplot(ev, main="​Eigenvalues",​ col="​grey",​ las=2) + abline(h=mean(ev),​ col="​red"​) + legend("​topright",​ "​Average eigenvalue",​ lwd=1, col=2, bty="​n"​) ​ + #Output summary/​results ​   ​ + summary(mite.spe.h.pca,​ display=NULL) ​ + windows() - + #Plot the biplot - summary(doubs.mrt) + plot(mite.spe.h.pca, scaling=1, type="​none",​ - ​ + ​xlab=c("​PC1 (%)", round((mite.spe.h.pca$CA$eig[1]/​sum(mite.spe.h.pca$CA$eig))*100,​2)),​ + ​ylab=c("​PC2 (%)", round((mite.spe.h.pca$CA$eig[2]/​sum(mite.spe.h.pca$CA$eig))*100,​2))) + points(scores(mite.spe.h.pca,​ display="​sites",​ choices=c(1,​2),​ scaling=1),​ + ​pch=21,​ col="​black",​ bg="​steelblue",​ cex=1.2) + text(scores(mite.spe.h.pca,​ display="​species",​ choices=c(1),​ scaling=1),​ + ​scores(mite.spe.h.pca,​ display="​species",​ choices=c(2),​ scaling=1),​ + ​labels=rownames(scores(mite.spe.h.pca,​ display="​species",​ scaling=1)),​ + ​col="​red",​ cex=0.8) +  ​ - {{ :doubs_mrt_summary.png |}} + And your resulting biplot likely looks something like this: - CP stands for “complexity parameter”,​ which is the equivalent of the variance explained by each node. The CP at nsplit 0 is the R2 of the whole tree. The summary then outlines, for each node, the best threshold values to split the data. While informative,​ this output is very dense. A more detailed and yet more manageable output can be generated by using the wrapper from the function MRT() of the MVPARTwrap package. + {{:mite_pca.png?800|}} - Plus, this other function allows identification of discriminant species. + + A dense cluster of related species and sites can be seen in the biplot. ​ + ​ - - # Find discriminant species with MRT results - doubs.mrt.wrap<​-MRT(doubs.mrt,​percent=10,​species=colnames(spe.hel)) - summary(doubs.mrt.wrap) - # Extract indval p-values + =====3.2. Correspondence Analysis===== - doubs.mrt.indval<​-indval(spe.hel,​doubs.mrt$where) + - doubs.mrt.indval$pval + - # Extract indicator species ​of each node, with its indval + One of the key assumptions made in PCA is that species are related to each other linearly and that they respond linearly to ecological gradients. This is not necessarily the case with a lot of ecological data (e.g. many species have unimodal species distributions). Using PCA on data with unimodal species distributions or a lot of zero values may lead to a phenomenon called the “horseshoe effect”, and can occur with long ecological gradients. As such, a CA or Correspondence Analysis may be a better option for this type of data, see Legendre and Legendre (2012) for further information. As CA preserves Chi2 distances (while PCA preserves Euclidean distances), this technique is indeed better suited to ordinate datasets containing unimodal species distributions,​ and has, for a long time, been one of the favourite tools for the analysis of species presence–absence or abundance data. In CA, the raw data are first transformed into a matrix Q of cell-by-cell contributions to the Pearson Chi2 statistic, and the resulting table is submitted to a singular value decomposition to compute its eigenvalues and eigenvectors. The result is an ordination, where it is the Chi2 distance that is preserved among sites instead of the Euclidean distance in PCA. The Chi2 distance is not influenced by the double zeros. Therefore, CA is a method adapted to the analysis of species abundance data without pre-transformation. Contrary to PCA, CA can also be applied to analyze both quantitative and binary data (such as species abundances or absence/​presence). - doubs.mrt.indval$maxcls[which(doubs.mrt.indval$pval<​=0.05)] + As in PCA, the Kaiser-Guttman criterion can be applied to determine the significant axes of a CA, and ordination axes can be extracted to be used in multiples regressions. - doubs.mrt.indval$indcls[which(doubs.mrt.indval$pval<​=0.05)] + - ​ + - {{ ::​doubs_mrt_discriminant.png ​|}} + Run a CA on species data: - {{ ::​doubs_mrt_finalpart.png |}} + + #Run the CA using the cca() function (NB: cca() is used for both CA and CCA) + spe.ca <- cca(spe) + + # Identify the significant axes + ev<​-spe.ca$CA$eig + ev[ev>​mean(ev)] + n=length(ev) + barplot(ev, main="​Eigenvalues",​ col="​grey",​ las=2) + abline(h=mean(ev),​ col="​red"​) + legend("​topright",​ "​Average eigenvalue",​ lwd=1, col=2, bty="​n"​) +  ​ - The main discriminant species of the first split are TRU, VAI and ABL. TRU and VAI contribute highly to the left leaf, and ABL is the most indicative species of the sites at lower altitude (<​361.5m). This output also indicates which sites are included in each leaf. + {{ :​ca_guttmankaiser.png |}} - {{ ::​doubs_mrt_indval.png |}} + From this barplot, you can see that once you reach C6, the proportion of variance explained falls below the average proportion explained by the other components. If you take another look at the CA summary, you will notice that by the time you reach CA5, the cumulative proportion of variance explained by the principal components is 84.63%. - The second part of the code allows us to test the significance of the indicator value of each species through a permutation test. For each significant indicator species, we extracted the leaf number and the indicator value. In this particular case, TRU, VAI and LOC are all significant species of the left leaf, TRU having the highest indicator value (0.867). + ​ + summary(spe.h.pca) #overall results + summary(spe.h.pca, diplay=NULL)# only axis eigenvalues and contribution +  ​ - **Challenge 4**: Run the multivariate regression tree for the mite data. Select the minimum size of tree within one SE of the CVRE. What is the proportion of variance explained by this tree? How many leaves contain this tree? What are the discriminant species? + {{ :ca_summary.png |}} - **Challenge 4** - Solution ​ + CA results are presented in a similar manner as PCA results. You can see here that CA1 explains 51.50% of the variation in species abundances, while CA2 explain 12.37% of the variation. - + + par(mfrow=c(1,​2)) + #### scaling 1 + plot(spe.ca,​ scaling=1, type="​none",​ main='​CA - biplot scaling 1', xlab=c("​CA1 (%)", round((spe.ca$CA$eig[1]/​sum(spe.ca$CA$eig))*100,​2)),​ + ylab=c("​CA2 (%)", round((spe.ca$CA$eig[2]/​sum(spe.ca$CA$eig))*100,​2))) - + points(scores(spe.ca, display="​sites"​, choices=c(1,2), scaling=1), pch=21, col="black", ​bg="​steelblue"​, cex=1.2) - mite.mrt<​-mvpart(data.matrix(mite.spe.hel)~.,mite.env, + - legend=FALSE,margin=0.01,cp=0,xv="pick", + - xval=nrow(mite.spe.hel),xvmult=100,​which=4) + - summary(mite.mrt) + - mite.mrt.wrap<​-MRT(mite.mrt,percent=10,species=colnames(mite.spe.hel)) + text(scores(spe.ca, display="species", choices=c(1), scaling=1),​ - summary(mite.mrt.wrap) + ​scores(spe.ca, display="​species",​ choices=c(2), scaling=1), + labels=rownames(scores(spe.ca, display="​species",​ scaling=1)),​col="​red",​ cex=0.8) - mite.mrt.indval<-indval(mite.spe.hel,​mite.mrt$where) + #### scaling 2 - mite.mrt.indval$pval + plot(spe.ca, scaling=1, type="​none",​ main='​CA ​- biplot scaling 2', xlab=c("CA1 (%)", round((spe.ca$CA$eig[1]/​sum(spe.ca$CA$eig))*100,2)), + ylab=c("​CA2 (%)", round((spe.ca$CA$eig[2]/​sum(spe.ca$CA$eig))*100,​2)),​ ylim=c(-2,​3)) - mite.mrt.indval$maxcls[which(mite.mrt.indval$pval<​=0.05)] + points(scores(spe.ca, display="​sites",​ choices=c(1,​2),​ scaling=2), pch=21, col="​black",​ bg="​steelblue",​ cex=1.2) - mite.mrt.indval$indcls[which(mite.mrt.indval$pval<​=0.05)] + text(scores(spe.ca, display="​species",​ choices=c(1),​ scaling=2), + scores(spe.ca, display="​species",​ choices=c(2),​ scaling=2),​ + ​labels=rownames(scores(spe.ca, display="​species",​ scaling=2)),​col="​red",​ cex=0.8) ​ - 25.6% of the variation in the mite species assemblage across sites is explained by the partition of the sites based on water content of the substrate (at 385.1 mg/l). LCIL is a discriminant species of sites with higher water content, and has an indicator value of 0.715. + {{ :ca_biplot.png |}} - ​ + These biplots show that a group of sites located in the left part with similar fish community characterized by numerous species such as GAR, TAN, PER, ROT, PSO and CAR; in the upper right corner, an other site cluster characterized by the species LOC, VAI and TRU is identified; the last site cluster in the lower right corner of the biplot is characterized by the species BLA, CHA and OMB. + **Challenge 4** + Run a CA of the “mite” species abundance data. What are the significant axes of variation? Which groups of sites can you identify? Which species are related to each group of sites? ​ - ======4. Linear discriminant analysis====== ​ + **Challenge ​4 - Solution** + <​hidden>​ + Your code likely looks something like the following: + ​ + # CA on mite species + mite.spe.ca<​-cca(mite.spe) - Linear discriminant analysis ​(LDA) is a constrained ​(canonical) technique that allows you to determine how well your independent set of variables explains an a priori grouping. This grouping may have been obtained from a previous clustering analysis ​(see Workshop 8) or from a hypothesis ​(e.g. grouping is based on sites at different latitudes or different treatments). An LDA can also be used to classify new data into these pre-determined groups. You can imagine some useful applications of this technique including assessing which population a fish should be classified in based on morphology or classifying whether a new paper is a freshwater, marine or terrestrial study based on the abstract of papers in those pre-determined biomes. ​ + #What are the significant axes ? + ev<​-mite.spe.ca$CA$eig + ev[ev>​mean(ev)] + n=length(ev) + barplot(ev, main="​Eigenvalues",​ col="​grey",​ las=2) + abline(h=mean(ev), col="​red"​) + legend("​topright",​ "​Average eigenvalue",​ lwd=1, col=2, bty="​n"​) - LDA computes discriminant functions from standardized descriptors. These coefficients quantify the relative contributions of the (standardized) explanatory variables to the discrimination of objects. Identification functions can be computed from the original (not standardized) descriptors to classify new data into pre-determined groups. + #Output summary/​results ​ - Let’s continue to work with the Doubs fish data. First we must ensure that the within-group covariance matrices of the explanatory variables are homogeneous – a condition necessary for the application of LDA. + summary(mite.spe.ca, display=NULL) - First we want to make an a priori classification that is independent from the environmental data set. We know that there is a general relationship that indicates environmental variables change with latitude ​(Budyko 1969). Here we will classify our Doubs fish sites based on latitude to determine how well the environmental factors explain our latitude grouping. Our groups are determined by simply dividing the range of latitudes equally into three groups and then assigning each site to a group depending on where they fall along the divided range.   ​ + #Plot the biplot + windows() + plot(mite.spe.ca, scaling=1, type="​none",​ + ​xlab=c("PC1 (%)", round((mite.spe.ca$CA$eig[1]/​sum(mite.spe.ca$CA$eig))*100,​2)),​ + ​ylab=c("​PC2 (%)", round((mite.spe.ca$CA$eig[2]/​sum(mite.spe.ca$CA$eig))*100,​2))) + points(scores(mite.spe.ca,​ display="​sites", choices=c(1,​2),​ scaling=1),​ + ​pch=21,​ col="​black",​ bg="​steelblue",​ cex=1.2) + text(scores(mite.spe.ca, display="​species",​ choices=c(1),​ scaling=1),​ + ​scores(mite.spe.ca,​ display="​species",​ choices=c(2),​ scaling=1),​ + ​labels=rownames(scores(mite.spe.ca,​ display="​species",​ scaling=1)),​ + ​col="​red",​ cex=0.8) + ​ - ​ + And your resulting biplot likely looks something like this: - #load spatial data to determine groups + - spa <- read.csv ('http://​www.davidzeleny.net/​anadat-r/​data-download/​DoubsSpa.csv',​ row.names = 1) + - spa <- spa[,-8] + - #View spatial data + {{ :​ca_mite_biplot.png |}} - View (spa) + - #add site numbers + - numbers<-(1:30) + - numbers<​-numbers[!numbers%in%8] + - spa$site<​-numbers + - #make groups based on lattitude y<82=group1, 82<​y<​156=group2, y>156=group3 + =====3.3. Principal Coordinates Analysis===== - spa.group<​-ddply(.data=spa, .variables=.(x, y, site), ​.fun= summarise, group = if(y <= 82) 1 else if (y <= 156) 2 else 3) + - #order by site + PCA as well as CA impose the distance preserved among objects: the Euclidean distance (and several others with pre-transformations) for PCA and the Chi2 distance for CA. If one wishes to ordinate objects on the basis of another distance measure, more appropriate to the problem at hand, then PCoA is the method of choice. In PCA we rotated and plotted our data so that as much variation was explained by a first Principal Component and we can see how much "​species"​, either actual species or environmental variables, contribute to each component by looking at the scores ​(also called loadings). Another type of unconstrained ordination is called Principal Coordinate Analysis (PCoA). In PCoA, points are added to plane space one at a time using Euclidean distance (//or whatever distance (dissimilarity) metric you choose//). Basically, one point is added, then a second so that it's distance is correct from the first point and then the third point and so on adding as many axes (dimensions) as necessary along the way. Choosing between PCA and PCoA can be tricky, but generally PCA is used to summarize multivariate data into as few dimensions as possible, whereas PCoA can be used to visualize distances between points. PCoA can be particularly suited for datasets that have more columns than rows. For example, if hundreds of species have been observed over a set of quadrats, then a approach based on a PCoA using Bray-Curtis similarity (see below) may be best suited. ​ - spa.group<​-spa.group[with(spa.group, order(site)), ] + - + + Run a PCoA on the Hellinger transformed species abundances (back to DoubsSpe): + + #Using cmdscale() + ?cmdscale + cmdscale(dist(spe.hel),​ k=(nrow(spe)-1),​ eig=TRUE) - Generally, we would first want to check that the within-group covariance matrices of the explanatory variables are homogeneous by verifying multivariate homogeneity of within-group covariance ​(MHV). For the purposes of this workshop we will by pass it but more information can be found in Borcard et al. (2011). + #Using pcoa() + ?pcoa + spe.h.pcoa <- pcoa(dist(spe.hel)) - Once we run the LDA we can use the result object to determine 1. What groups the sites are classified in based on the environmental data. 2. What are the posterior probabilities of that the sites to belong to the groups. 3. The percentage of correct classification based on our latitudinal grouping. + # Extract ​the results + spe.h.pcoa + #Construct the biplot + biplot.pcoa(spe .h.pcoa, spe.hel, dir.axis2=-1) + ​ - + The output looks like this (and here [[https://​www.youtube.com/​watch?​v=lRdX1qhI7Dw|here]] is a video that might help with the explanation of eigenvalues in terms of ordination): - #run LDA + - LDA<​-lda(env,​spa.group[,4]) + - #​classification of the objects based on LDA + {{:​pcoaoutput_1.png?800|}} - spe.class <- predict(LDA)$class + - #posterior probabilities of the objects to belong to the groups + {{:​pcoaoutput_2.png?800|}} - spe.post <- predict(LDA)$posterior + - #table of prior versus predicted classifications + And the PCoA biplot, like this: - spe.table <- table(spa.group[,4], spe.class) + - #proportion of correct classification + {{:pcoa_spe.png?500|}} - diag(prop.table(spe.table,​ 1)) + - ​ + - {{ :​lda_spetable.png?300 |}} + You can also run a PCoA using a different distance measure (e.g. Bray-Curtis). Here is a PCoA run on a Bray-Curtis dissimilarity matrix: ​ - The results suggest that the environmental factors explain ​the first, lower latitude, group and the group 3 perfectly but only 83% of the group 2 sites were predicted correctly. What does that tell us about our classification?​ Perhaps there are stronger delineations in the lower and higher latitude and the group 2 is a mix of both? + ​ + spe.bray.pcoa <- pcoa(spe.db) #where spe.db is the species dissimilarity matrix using Bray-Curtis. + spe.bray.pcoa + biplot.pcoa(spe.bray.pcoa, spe.hel, dir.axis2=-1) + #Note that the distance measure chosen strongly influences ​the results. + ​ - Now what we have some new sites and we want to classify them based on the relationship we have established between our latitudinal grouping and environmental factors using the LDA. Using the predict() function we can load in a new matrix with sites and classify them using the LDA object. ​ + **Challenge 5** + Run a PCoA on the Hellinger-transformed //mite species// abundance data. What are the significant axes? Which groups of sites can you identify? Which species are related to each group of sites? How do the PCoA results compare with the PCA results? ​ - Load in the classifyme.csv file, which contains dummy data from 5 new sites. + **Challenge 5 - Solution** + <​hidden>​ + Your code likely looks something like this: + ​ + mite.spe.h.pcoa <- pcoa(dist(mite.spe.hel)) + mite.spe.h.pcoa + windows() + biplot.pcoa(mite.spe.h.pcoa, mite.spe.hel,​ dir.axis2=-1) + ​ + And your biplot like this: - + {{:​pcoa_mite.png?​500|}} - #predicting classification of new data + - #read in new sites + - classify.me<​-read.csv("​classifyme.csv",​ header = T) + - #predict grouping ​of new data + We see that species 16 and 31 are farther away from other species in terms of distance and therefore their distribution across the sites is highly dissimilar from the other species of mites (and each other). Site labels that practically overlap each other are good examples of sites with low dissimilarity ​(i.e. high similarity) to each other in terms of the species that are found at those sites. - predict.group<​-predict(LDA, newdata=classify.me) + ​ - #give classification for each new site - group.new<​-predict.group$class - ​ - {{ :​lda_newgroups.png?200 |}} + =====3.4 Nonmetric MultiDimensional Scaling===== - Our new sites, in order, have been classified ​in groups 1,1, 1, 3 and 3 respectively. + The unconstrained ordination methods presented above allow to organize objects (e.g. sites) characterized by descriptors (e.g. species) in full-dimensional space. In other words, PCA, CA and PCoA computes a large number of ordination axes (proportional to the number of descriptors) representing the variation of descriptors among sites and preserve distance among objects (the Euclidean distances ​in PCA, the Chi2 distances ​in CA and the type of distances defined by the user in PCoA). Users can then select the axis of interest (generally the first two ones as the explained the larger part of the variation) to represent objects in an ordination plot. The produced biplot thus represents well the distance among objects (e.g. the between-sites similarity), but fails to represent the whole variation dimensions of the ordination space (as Axis3, Axis4,…, Axisn are not represented on the biplot, but still contribute to explain the variation among objects). - **Challenge 5**: Run an LDA for the mite env data (only first two vars) based on four latitudinal groups you create from the mite.xy data set. What group was group 2 most incorrectly grouped into? What proportion ​of sites was correctly classified ​in group 1? group 2? + In some case, the priority is not to preserve the exact distances among sites, but rather to represent as accurately as possible the relationships among objects in a small and number of axes (generally ​two or three) specified by the user. In such cases, nonmetric multidimensional scaling (NMDS) is the solution. If two axes are selected, the biplot produced from NMDS is the better 2D graphical representation ​of between-objects similarity: dissimilar objects are far apart in the ordination space and similar objects close to one another. Moreover, NMDS allows users to choose the distance measure applied to calculate the ordination. - **Challenge 5**: Solution ​ + To find the best representation of objects, NMDS applies an iterative procedure that tries to position the objects in the requested number of dimensions in such a way as to minimize a stress function (scaled from 0 to 1) which measure the goodness-of-fit of the distance adjustment in the reduced-space configuration. Consequently,​ the lower the stress value, the better the representation of objects in the ordination-space is. An additional way to assess the appropriateness of an NDMS is to construct a Shepard diagram which plot distances among objects in the ordination plot against the original distances. The R2 obtained from the regression between these two distances measure the goodness-of-fit of the NMDS ordination. - <​hidden>​ + - + # Run the NMDS - mite.xy$site<-seq(1:70) + spe.nmds<-metaMDS(spe, distance='​bray',​ k=2) - (max(mite.xy[,2])-min(mite.xy[,2]))/4 + + ### Extract the results + spe.nmds - mite.xy.group<​-ddply(.data=mite.xy, .variables=.(x, y, site), .fun= summarise, group = if(y <= 2.5) 1 else if (y <= 4.9) 2 else if (y <= 7.3) 3 else 4) + ### Assess the goodness of fit and draw a Shepard plot - mite.xy.group<​-mite.xy.group[with(mite.xy.group,​ order(site)),​ ] + spe.nmds$stress + stressplot(spe.nmds, main='​Shepard plot') - LDA.mite<-lda(mite.env[,1:2],mite.xy.group[,4]) + # Construct the biplot - mite.class <- predict(LDA.mite)$class + windows() - mite.post <- predict(LDA.mite)$posterior + plot(spe.nmds, type="​none",​ main=paste('​NMDS/​Bray ​- Stress=',​ round(spe.nmds$stress, 3)), - mite.table <- table(mite.xy.group[,4], mite.class) + ​xlab=c("​NMDS1"​), - diag(prop.table(mite.table, 1)) + ​ylab=c("​NMDS2"​)) + points(scores(spe.nmds, display="​sites",​ choices=c(1,​2)), + pch=21, col="​black",​ bg="​steelblue",​ cex=1.2) + text(scores(spe.nmds, display="​species",​ choices=c(1)), + scores(spe.nmds, display="​species"​, choices=c(2)), + labels=rownames(scores(spe.nmds, display="​species"​)), + ​col="​red",​ cex=0.8) ​ - ​ + {{ :​shepard_plot.png |}} + The Shepard plot identifies a strong correlation between observed dissimilarity and ordination distance (R2 > 0.95), highlighting a high goodness-of-fit of the NMDS. - ======5. Some other useful ordination methods====== + {{ :​nmds_biplot.png |}} - ​ + The biplot ​of the NMDS shows a group of closed sites characterized ​by the species BLA, TRU, VAI, LOC, CHA and OMB, while the other species form a cluster ​of sites in the upper right part of the graph. Four sites in the lower part of the graph are strongly different ​from the others. - ?cca #​(constrained correspondence analysis) + - # Constrained Correspondence Analysis (CCA) is a canonical ordination method similar to RDA that preserve + - # Chi-square distances among object (instead of Euclidean distances in RDA). This method is well suited for the + - # analysis of large ecological gradients. + - + - + - + - ?CCorA # Canonical Correlation Analysis + - + - # Canonical Correlation Analysis (CCorA) differs from RDA given that the two matrices are considered symmetric + - # while in RDA the Y matrix is dependent on the X matrix. ​The main use of this technique is to test the + - # significance ​of the correlation between two multidimensional data sets, then explore the structure of the data by + - # computing ​the correlations (which are the square roots of the CCorA eigenvalues) that can be found between + - # linear functions of two groups of descriptors. + - + - + - help(coinertia, package=ade4) # Coinertia Analysis + - + - #Coinertia Analysis (CoIA) is a symmetric canonical ordination method that is appropriate to compare pairs + - # of data sets that play equivalent roles in the analysis. The method finds a common space onto which the objects + - # and variables of these data sets can be projected and compared. Compared to CCorA, co-inertia analysis + - # imposes no constraint regarding ​the number ​of variables ​in the two sets, so that it can be used to compare + - # ecological communities even when they are species-rich. Co-inertia analysis is not well-suited,​ however, to + - # analyse pairs of data sets that contain ​the same variables, because the analysis does not establish one-to-one + - # correspondences between variables ​in the two data sets; the method does not ‘know’ that the first variable is the + - # same in the first and the second data sets, and likewise for the other variables. + - + - + - help(mfa, package=ade4) # Multiple Factorial Analysis + - + - # Multiple factor analysis (MFA) can be used to compare several data sets describing the same objects. MFA + - # consists in projecting objects and variables of two or more data sets on a global PCA, computed ​from all data + - # sets, in which the sets receive equal weights. + - + - + - # Spatial analysis can be performed using packages AEM and PCNM : http://​r-forge.r-project.org/​R/?​group_id=195 + - ​ + + **Challenge 6** + Run the NMDS of the mite species abundances in 2 dimensions based on a Bray-Curtis distance. Assess the goodness-of-fit of the ordination and interpret the biplot. - ======References====== + **Challenge 6 - Solution** + <​hidden>​ + Your code likely looks something like this: + ​ + mite.spe.nmds<​-metaMDS(mite.spe,​ distance='​bray',​ k=2) + ### Extract the results + mite.spe.nmds - Alday & Marrs (2014). A simple test for alternative states in ecological restoration: ​the use of principal response curves. Journal of Vegetation Science, 17, 302-311. ​ + ### Assess ​the goodness ​of fit + mite.spe.nmds$stress + stressplot(mite.spe.nmds, main='​Shepard plot') - Borcard, Gillet & Legendre ​(2011). Numerical Ecology with R. Springer New York. + ### Construct the biplot + windows() + plot(mite.spe.nmds, type="​none",​ main=paste('​NMDS/​Bray - Stress=',​ round(mite.spe.nmds\$stress,​ 3)), + ​xlab=c("​NMDS1"​),​ + ​ylab=c("​NMDS2"​)) + points(scores(mite.spe.nmds, display="​sites",​ choices=c(1,​2)),​ + ​pch=21,​ col="​black",​ bg="​steelblue",​ cex=1.2) + text(scores(mite.spe.nmds,​ display="​species",​ choices=c(1)),​ + ​scores(mite.spe.nmds,​ display="​species",​ choices=c(2)),​ + ​labels=rownames(scores(mite.spe.nmds,​ display="​species"​)),​ + ​col="​red",​ cex=0.8) + ​ - Breiman, L., J. H. Friedman, et al. (1984). Classification and Regression Trees. Belmont, California, USA, Wadsworth International Group. + And your plots like this: - Budyko, M.I. (1969) The effect of solar radiation variations on the climate of the Earth. Tellus, 21(5), 611-619. + {{ :​nmds_mite_shepard.png |}} - Clarke & Warwick ​(2001). Change in Marine Communities:​ An Approach to Statistical Analysis ​and Interpretation 2nd edition. Primer-E Ltd. + The correlation between observed dissimilarity and ordination distance ​(R2 > 0.91) and the stress value relatively low, showing together a good accuracy of the NMDS ordination. - De'​ath,​ G. (2002). Multivariate regression trees : a new technique for modeling species-environment relationships. Ecology, 83(4), 1105–1117. + {{ :nmds_mite_biplot.png |}} - Gotelli & Ellison (2004). A Primer ​of Ecological Statistics. Sinaeuer Associates Inc., Sunderland MA. + No cluster ​of sites can be precisely defined from the NMDS biplot showing that most of the species occurred in most of the sites, i.e. a few sites shelter specific communities. + ​ - Legendre & Legendre ​(2012). Numerical Ecology 3rd edition. Elsevier Science BV, Amsterdam. + =====3.5. Summary===== + Ordination is a powerful statistical tecnhique for studying the relationships of objects characterized by several descriptors ​(e.g. sites described by biotic communities,​ or environmental variables), but serveral ordination methods exist. These  methods mainly differ in the type of distance preserved, the type of variables allowed and the dimensions of the ordination space. To better guide your choice of the ordination method to use, the following table identify the main characteristics of the four ordination methods presented : + + {{ :​summary_ordination.jpg |}} - Poulin, Andersen & Rochefort (2013) A new approach for tracking vegetation change after restoration:​ a case study with peatlands. Restoration Ecology, 21, 363-371. + In the next workshop, you will see how to identify the relationships between biotic communities and sets of environmental variables describing the same sites, using canonical analyses.