- Home
- Documents
*Application of multivariate statistical techniques in ... oleg.paliy/Papers/Paliy_ REVIEWS AND...*

If you can't read please download the document

View

215Download

0

Embed Size (px)

INVITED REVIEWS AND SYNTHESES

Application of multivariate statistical techniques inmicrobial ecology

O. PALIY and V. SHANKAR

Department of Biochemistry and Molecular Biology, Boonshoft School of Medicine, Wright State University, 260 Diggs

Laboratory, 3640 Col. Glenn Hwy, Dayton, OH 45435, USA

Abstract

Recent advances in high-throughput methods of molecular analyses have led to an

explosion of studies generating large-scale ecological data sets. In particular, noticeable

effect has been attained in the field of microbial ecology, where new experimental

approaches provided in-depth assessments of the composition, functions and dynamic

changes of complex microbial communities. Because even a single high-throughput

experiment produces large amount of data, powerful statistical techniques of multivari-

ate analysis are well suited to analyse and interpret these data sets. Many different

multivariate techniques are available, and often it is not clear which method should be

applied to a particular data set. In this review, we describe and compare the most

widely used multivariate statistical techniques including exploratory, interpretive and

discriminatory procedures. We consider several important limitations and assumptions

of these methods, and we present examples of how these approaches have been uti-

lized in recent studies to provide insight into the ecology of the microbial world.

Finally, we offer suggestions for the selection of appropriate methods based on the

research question and data set structure.

Keywords: microbial communities, microbial ecology, microbiota, multivariate, ordination,

statistics

Received 1 May 2015; revision received 15 December 2015; accepted 22 December 2015

Introduction

The past decade has seen significant progress in eco-

logical research due in part to the advent and

increased utilization of novel high-throughput experi-

mental technologies. With approaches such as high-

throughput next-generation sequencing, oligonu-

cleotide and DNA microarrays, high sensitivity mass

spectrometry, and nuclear magnetic resonance analysis,

researchers are able to generate massive amounts of

molecular data even in a single experiment. These

methods have created an especially powerful effect on

the field of microbial ecology, where we now can

apply DNA, RNA, protein, and metabolite identifica-

tion and measurement techniques to the whole micro-

bial community without a need to separate or isolate

individual community members. These technologies

provided foundation to many groundbreaking

advances in our understanding of microbial commu-

nity organization, function, and interactions within a

community and with other organisms. Examples

include the assessment of marine microbiota response

to the Deepwater Horizon oil spill (Mason et al. 2014),

functional analysis of microbiomes in different soils

(Fierer et al. 2012), identification of enterotypes in the

human intestinal microbiota (Arumugam et al. 2011),

detection of seasonal fluctuations in oceanic bacterio-

plankton (Gilbert et al. 2012) and discovery of the loss

of gut microbial interactions in human gastrointestinal

diseases (Shankar et al. 2013, 2015).

Because large amount of data is generated even in

a single high-throughput experiment, powerful statis-

tical tools are needed to examine and interpret the

results. Because many different variables such as

species, genes, proteins or metabolites are measured

in each sample or site, the analysis of these data sets

is generally performed using multivariate statisticsCorrespondence: Oleg Paliy, Fax: (1) 937 775 3730;

E-mail: oleg.paliy@wright.edu

2016 John Wiley & Sons Ltd

Molecular Ecology (2016) 25, 10321057 doi: 10.1111/mec.13536

(definitions of most commonly used terms are

provided in Box 1). Indeed, many types of standard

multivariate statistical analyses have been employed

for the assessment of such high-throughput data sets,

and novel approaches are also being developed. The

multitude of possible statistical choices makes it a

daunting task for an investigator not experienced

with these tools to pick a good technique to use. In

Box 1. Terminology used in multivariate statistical analyses

Biplota two-dimensional diagram of the ordination analysis output that simultaneously shows variable posi-tioning and object positioning in a reduced dimensionality space.

Canonical analysisa general term for statistical technique that aims to find relationship(s) between sets of vari-ables by searching for latent (hidden) gradients that associate these sets of variables.

Constrained and unconstrained ordinationthe constrained multivariate techniques attempt to explain the vari-ation in a set of response variables (e.g. species abundance) by the variation in a set of explanatory variables(e.g. environmental parameters) measured in the same set of objects (e.g. samples or sites). The matrix ofexplanatory variables is said to constrain the multivariate analysis of the data set of response variables, and theoutput of constrained analysis typically displays only the variation that can be explained by constraining vari-ables. In contrast, the unconstrained multivariate techniques only examine the data set of response variables, andthe output of unconstrained analysis reflects overall variance in the data.

Gradient analysisthis term describes the study of distribution of variable values in the data set along gradients.As the goal of ordination analysis is to order objects along the main gradients of dispersion in the data set, both ofthese terms can be used synonymously. Two different types of gradient analysis are usually recognized:

Indirect gradient analysis utilizes only one data set of measured variables. This term is synonymous withunconstrained ordination.

Direct gradient analysis in contrast uses additionally available data to guide (direct) the analysis of the dataset of measured variables. It produces axes that are constrained to be a function of explanatory variables. Thisterm is synonymous with constrained ordination.

Data transformationthis term describes the process of applying a mathematical function to the full set of mea-sured values in a systematic way. See Box 2 for a description of common data transformations.

Distanceit quantifies the dissimilarity between objects in a specific coordinate system. Objects that are similarhave small in-between distance; objects that are different have large distance between them. For normalizeddistances, Distance (X,Y) can be related to Similarity (X,Y) as either D = 1 S, D = (1 S) or D = (1 S2). Manydifferent mathematical functions can be used to calculate distances among objects or variables (Legendre &Legendre 2012). The choice of distance measure has a profound effect on the output of multivariate analysis andshould be chosen based on the characteristics of the studied ecological data set.

Eigenvector and eigenvaluein ordination methods such as PCA, eigenvectors represent the gradients of dataset dispersion in ordination space and are used as ordination axes, and eigenvalues designate the strength ofeach gradient.

Explanatory/predictor/independent variableall these terms describe the type of variables that explain (predict)other variables.

Ordinationa general term that can be described as arranging objects in order (Goodall 1954). The goal of ordi-nation analysis is to generate a reduced number of new synthetic axes that are used to display the distributionof objects along the main gradients in the data set.

Orthogonalmathematical term that means perpendicular or at right angle to. Due to this feature, the orthog-onal variables are linearly independent (Rodgers et al. 1984).

Randomization testsa group of related tests of statistical significance that are based on the randomization ofmeasured data values to assess whether the value of a calculated metric (such as species diversity) can beobtained by chance.

Response/dependent variableboth of these terms describe the main measured variables in a study. Many mul-tivariate analyses examine the relationship between these variables and other variables that are said to predictor explain these measured variables (such as environmental factors). Thus, the measured variables are depen-dent on or responding to the values of the explanatory/predictor variable(s).

Triplota two-dimensional diagram of the multivariate analysis output that in addition to response variablesand objects shown on biplot also displays explanatory variables.

Unimodal distributiona distribution with one peak on the variable density plot. Variance, variability and variationthe differences in the use of these three related terms can be described bythe following statement: Variance is a statistical measure of data variation and dispersion, which describe theamount of variability in the data set.

2016 John Wiley & Sons Ltd

MULTIVARIATE STATISTICS IN MICROBIAL ECOLOGY 1033

this review, we provide short descriptions of the

most frequently used multivariate statistical tech-

niques, we compare different methods, and we pre-

sent examples of how these approaches have been

utilized in recent studies. This text is not meant to

serve as an exhaustive overview of existing methods,

as the landscape of currently available multivariate

analyses is vast and growing. Rather, our goal is to

familiarize the reader with the most c