Utilizing novel diversity estimators to quantify multiple dimensions of microbial biodiversity across domains
© Doll et al.; licensee BioMed Central Ltd. 2013
Received: 19 July 2013
Accepted: 7 November 2013
Published: 15 November 2013
Microbial ecologists often employ methods from classical community ecology to analyze microbial community diversity. However, these methods have limitations because microbial communities differ from macro-organismal communities in key ways. This study sought to quantify microbial diversity using methods that are better suited for data spanning multiple domains of life and dimensions of diversity. Diversity profiles are one novel, promising way to analyze microbial datasets. Diversity profiles encompass many other indices, provide effective numbers of diversity (mathematical generalizations of previous indices that better convey the magnitude of differences in diversity), and can incorporate taxa similarity information. To explore whether these profiles change interpretations of microbial datasets, diversity profiles were calculated for four microbial datasets from different environments spanning all domains of life as well as viruses. Both similarity-based profiles that incorporated phylogenetic relatedness and naïve (not similarity-based) profiles were calculated. Simulated datasets were used to examine the robustness of diversity profiles to varying phylogenetic topology and community composition.
Diversity profiles provided insights into microbial datasets that were not detectable with classical univariate diversity metrics. For all datasets analyzed, there were key distinctions between calculations that incorporated phylogenetic diversity as a measure of taxa similarity and naïve calculations. The profiles also provided information about the effects of rare species on diversity calculations. Additionally, diversity profiles were used to examine thousands of simulated microbial communities, showing that similarity-based and naïve diversity profiles only agreed approximately 50% of the time in their classification of which sample was most diverse. This is a strong argument for incorporating similarity information and calculating diversity with a range of emphases on rare and abundant species when quantifying microbial community diversity.
For many datasets, diversity profiles provided a different view of microbial community diversity compared to analyses that did not take into account taxa similarity information, effective diversity, or multiple diversity metrics. These findings are a valuable contribution to data analysis methodology in microbial ecology.
KeywordsDiversity indices Diversity profiles Phylogenetic diversity Effective numbers Community similarity
With the widespread use of culture-independent, high-throughput sequencing technologies, ecologists have begun to describe the diversity of microbial communities that were previously difficult to detect e.g., [1–3]. Given the newness of these data types and the fact that the aims and goals of microbial studies are usually similar to those of macro-ecology, microbial ecologists often use methods from classical community ecology to analyze their data. These include Shannon’s H , Berger-Parker Evenness , rarefaction, and ordination .
While the use of established ecological metrics to analyze microbial diversity may sometimes be appropriate , the data produced by ecologists surveying macro-organismal communities differ from data obtained by high-throughput sequencing of microbial communities in three key ways. First, in contrast to plant and animal assemblages, microbial assemblages are typically made up of more than one domain of life, thus necessitating the ability to quantify diversity across very disparate organism types. Second, many classical indices assume ecological communities are composed of unique species. However, traditional biological species concepts do not fit the natural histories of many microbial taxa that routinely undergo non-homologous recombination [8–10] and sometimes lack sexual reproduction. (It is worth noting that the concept of species is widely questioned for macro-organisms as well .) Finally, unlike with macro-organisms, researchers are often unable to directly observe and characterize microbes and their traits in situ[12, 13]. The taxonomic/phylogenetic and functional genes of environmental microbes are now commonly sequenced, but it is still very difficult to link the taxonomy of an individual microbe to the environmental functions it carries out.
These differences create methodological issues when discrete, taxonomic-based metrics are used to analyze microbial community datasets. The culture-independent approaches employed by microbial ecologists usually survey a variety of genes, intergenic spacers, and transcripts, which are typically classified into discrete, taxonomic bins called Operational Taxonomic Units (OTUs). Homologous genetic fragments that share less than a certain percentage of nucleotide polymorphisms are classified as being in the same genus or species (e.g., 97% similarity of the 16S gene is widely uses for “species”) [14–16]. This cutoff fails to adequately include the homology (and thus shared ecological function) with which the species concept was originally conceived.
The limitations of applying traditional diversity indices to microbial datasets lacking clear species delineations leave a number of questions: How can we quantify diversity using methods that are better suited for microbial datasets which span multiple domains of life? Does including similarity in our analyses change our interpretation of patterns of microbial diversity? What is the utility of including multiple dimensions of microbial diversity (i.e., taxonomic and phylogenetic) in our analyses?
One promising new way to analyze microbial community diversity and address these questions is through the use of diversity profiles, which were recently developed by Leinster & Cobbold [17, 18]. These profiles are graphs that are used to display effective numbers of diversity (i.e., effective diversities). Effective diversities are mathematical generalizations of previous indices that behave much more intuitively, satisfying a number of desirable mathematical properties that provide meaningful percentage and ratio comparisons . This is useful because many indices that have been traditionally used to describe macro-organismal community diversity and evenness can be quantitatively unintuitive (Inverse Simpson’s Diversity Index, Shannon’s Entropy, Gini-Simpson Index, etc.). For example, a community comprised of 10 hawks and 10 hummingbirds might experience a 50% decrease of both species, resulting in five hawks and five hummingbirds, but this change would not manifest as a 50% decrease in either Simpson Diversity or Shannon Diversity. Due to this, Hill  and later Jost  formulated effective number diversity metrics, which are simple entropies weighted by an order parameter, q. As the q parameter increases, the relative weight given to rare taxa in diversity index calculations declines. The effective diversity of order zero (q = 0) is equivalent to species richness (the total number of entities), order 1 is proportional to the Shannon index, and q = ∞ is a measure of pure evenness .
Diversity profiles significantly improve these previous calculations of effective diversity by adding community similarity information into diversity calculations, using a similarity matrix, Z. The term “similarity” is used by Leinster & Cobbold to refer to the degree of distance or difference between organisms. The similarity matrix can accommodate genetic similarity, phenotypic similarity, or any other biologically meaningful source of similarity between two or more entities. Incorporating this information into similarity-sensitive calculations of community diversity can greatly alter conclusions regarding diversity levels . For example, when taking into account similarity between taxa, a bird community comprised of one hawk, one hummingbird, and one goose would be more diverse than a community of three distinct hummingbird species. However, if similarity between taxa were not taken into account, these communities would be classified as equally diverse.
For microbial communities, which are often characterized by phylogenetic molecular markers, the use of a metric based on the average evolutionary relatedness of a community conveys more information on the uniqueness and potential function of that community than does a discrete, OTU-based approach . Recent work by Chao and colleagues , which expands on research by Faith , develops a measure of effective phylogenetic diversity. Effective phylogenetic diversity scales traditional diversity metrics by the hypothesized shared evolutionary history between taxa. Calculating phylogenetic diversity requires scaling raw taxonomic diversity by the shared evolutionary branches in a phylogeny. These branches can be either time-calibrated (ultrametric) or non-ultrametric. Even if a phylogeny is unavailable, the inclusion of cladistic data can be meaningful, if they accurately model shared ancestry within the study community. If the relative abundances of taxa or sequences are known, branches can also be weighted by abundance to compare the phylogenetic evenness among samples .
Given the differences between microbial and macro-organismal community data, the primary objective of this study was to evaluate the use of diversity profiles when analyzing microbial assemblages to determine whether the inclusion of similarity data (in our case, phylogenetic data) changes our interpretation of experimental and observational data. First, to explore whether diversity profiles alter our interpretation of microbial diversity data, we calculated diversity profiles for four datasets from different environments containing all domains of life and viruses. For comparison purposes, four statistics of pairwise community dissimilarity were calculated for the microbial datasets and plotted as dendrograms. Because diversity profiles can take into account the similarity of taxa and the relative importance of rare versus abundant taxa, we sought to evaluate how incorporating the phylogenetic similarity of taxa provides a different view of microbial diversity compared to traditional taxonomy-based metrics.
Second, we looked for evidence of bias and robustness of phylogenetic diversity profiles using simulated communities. We created numerous communities that varied in their rank abundance distributions, tree topologies, and whether ultrametric or non-ultrametric trees were used. Tree topologies were also simulated to create communities that spanned a large range of tree balances. Tree balance is determined by evolutionary processes, in particular lineage divergence and extinction rates and patterns, which differ greatly among real microbial communities . We wanted to compare how “naïve” diversity profiles (what Leinster & Cobbold term calculations that do not take taxa similarity information into account ) and similarity-based diversity profiles are influenced by the topological characteristics (e.g., tree ultrametricity, tree balance) of the sampled communities. We tested the concordance between taxonomic and phylogenetic measures of diversity and composition. We predicted that since OTU-based metrics are discrete transformations of phylogenetic measures, they would generally agree. Simulations (and real data) were also used to test whether this concordance is correlated with aspects of the sampled community including aspects of its phylogenetic topology, richness, and abundance distribution. Our analyses indicate that phylogenetic diversity profiles provide insights into microbial community diversity that would not be discernible with the use of traditional univariate diversity metrics.
Diversity profiles were calculated for experimental, observational, and simulated microbial communities, as presented in detail by Leinster & Cobbold . Briefly, consider a fully sampled community that contains S unique species. The relative abundances of the species are calculated by p1, . . . , p s , such that p i ≥ 0 and . Because p i ≠ 0, diversity profiles consider only species that are actually present in a community.
Information regarding the similarities between species in the community is taken into account by a matrix Z = (Z ij ). The matrix has dimensions S X S, and Z ij measures the similarity between the ith and the jth species. Similarity is scored such that 0 ≤ Z ij ≤ 1, so that 0 represents complete dissimilarity between two species and 1 represents identical species. When similarity information is not available, or authors do not wish to include it, Z ij = 1 in all cases, and this results in a naïve calculation.
Diversity profiles were then calculated across the range of a sensitivity parameter, q, for the values of 0 ≤ q ≤ ∞. At low values of q, such as q = 0, calculations of diversity are sensitive to rare taxa, and as q moves toward ∞, diversity calculations become more and more insensitive to the contributions of rare taxa.
For q ≠ 1, ∞, the diversity profile calculation is thus where . The resulting q D Z (p) is an effective number, and for certain values of q and Z, q D Z (p) corresponds to a commonly used diversity index. For example, for naïve diversity profiles that do not take into account similarity between species, q = 0 is equivalent species richness, q = 1 is proportional to Shannon Diversity , q = 2 is proportional to 1/D (inverse Simpson Diversity) , and as q moves toward ∞, it is a measure of 1/Berger-Parker Evenness .
We calculated diversity profiles for 0 ≤ q ≤ 5. When plotting the profiles, we created larger insets for 1 ≤ q ≤ 2 . For a more detailed description of the formulae used to calculate diversity profiles (e.g., their relationship to well-known diversity metrics, their potential benefits in diversity studies, examples of diversity profiles applied to macro-organism community datasets), refer to Leinster & Cobbold’s work .
Environmental microbial datasets
Research questions and hypotheses that shaped the design of the four environmental microbial community datasets
Acid mine drainage bacteria and archaea
1) Are environmental (Env) samples more diverse than bioreactor (BR) biofilms?
H1: Bioreactor growth conditions usually have a higher pH than the environment, and the geochemistry of the drainage might differ from growth media. Thus, environmental biofilms are expected to be more diverse than bioreactor-grown biofilms.
2) Is biofilm diversity higher at higher stages of biofilm development?
H2: As biofilms begin to establish, early growth-stage biofilms are expected to be less diverse. As they mature, more organisms join the community, increasing diversity.
Hypersaline lake viruses
1) How do viral diversities change across spatiotemporal replicates?
H1: Viral diversity will be greatest in pools with larger volume (2010A and 2007A samples).
H2: Community dissimilarity will cluster by site, then by year.
1) Does acetate addition affect the diversity and composition of soil microbial communities?
H1: Acetate addition will stimulate growth of a subset of the microbial community capable of using it as an electron donor.
2) Does vanadium addition affect the diversity and composition of soil microbial communities?
H2: Vanadium addition will reduce the diversity and evenness of the communities and favor those who can both use acetate as an electron donor and vanadium as an electron receptor and/or tolerate vanadium at high concentrations.
Substrate-associated soil fungi
1) How do plant community type (forest vs. grassland), substrate type (wood vs. straw), and time (6 months vs. 18 months) affect saprotrophic fungal assemblages?
H1: Wood substrates will be more diverse than straw substrates, because the wood substrate is more complex and requires a larger group of fungi to decompose it compared with a simpler substrate, such as straw.
H2: Plant community type will have a greater effect on diversity than substrate type or time, because it will determine which fungi can colonize a substrate.
Results of the diversity profiles for the four environmental microbial community datasets
Naïve profiles results
Was this predicted?
Similarity profiles results
Was this predicted?
Acid mine drainage bacteria and archaea
BR less diverse than most Env. samples
BR less diverse than Env. samples
High GS only more diverse than early GS for Env-1
Highest GS (GS 2) is most diverse of all samples
BR more diverse than Env-2, but less than Env-4
Env. samples mostly more diverse than BR
Higher GS is less diverse than lower GS for BR
Highest GS is most diverse of all samples
Hypersaline lake viruses
Diversity greater in larger pools
Yes (2010A for 2/3 genes; not true for Cluster 667)
Diversity greater in combined 2007A samples and/or 2010A
Background > Acetate > Vanadium + acetate
Background ≈ Vanadium + acetate > Acetate
Substrate-associated soil fungi
At all q: Wood T2 > Wood T1 > Straw T1 > Straw T2; No crossing along q
Straw T2 least diverse at all q
At q = 0, Straw T1 has second lowest diversity, but by q = 3, has highest diversity
Wood T2 > Wood T1 at all q
At all q: Wood T1 > Straw T1 > Wood T2 > Straw T2; No crossing along q
At all q: Straw T1 > Wood T1 > Wood T2 > Straw T2; No crossing along q
Acid mine drainage bacteria and archaea
Total RNA was purified from eight environmental biofilm communities, collected from the Richmond Mine at Iron Mountain, Northern California in 2010 and 2011. In addition, total RNA was extracted from five biofilms grown in laboratory bioreactors using Richmond Mine inoculum in 2009 and 2010. Biofilms were collected or harvested at varying stages of development, ranging from early (GS0), mid (GS1), and late (GS2), as described previously .
RNA from all 13 samples was converted to cDNA and subject to Illumina library preparation and sequencing at the University of California Davis. Six environmental samples (from locations Env-1, Env-2, Env-3) and two bioreactor samples were sequenced using the HiSeq 2500 Illumina platform. Two environmental samples (from locations Env-2 and Env-4) and three bioreactor samples were sequenced using the GAIIx Illumina platform. A total of 256 million 75–100 bp long-reads were mapped to the small subunit (SSU) rRNA Silva database (including Archaea, Bacteria and Eukarya) with a similarity cutoff of 97% identity. SSU rRNA reads were then assembled using Cufflinks , and clustered at 97% identity using uclust . SSU gene sequences were aligned using the SINA aligner webserver, and a phylogenetic tree was constructed using FastTree with options -gtr -nt -gamma. Normalized counts values obtained from Cufflinks were used as a measure of abundance of SSU rRNA genes sequences, as described earlier .
Hypersaline lake viruses
As previously described in detail [30, 31], eight surface water samples were collected from two locations (A and B) within hypersaline Lake Tyrrell, Victoria, Australia (~330 g/L NaCl), with dates, locations, time scales, and sample IDs as follows: January 2007 (two samples, site A, two days apart, 2007At1, 2007At2), January 2009 (one sample, site B, 2009B), January 2010 (one sample, site A, 2010A; four samples, site B, each approximately one day apart, 2010Bt1, 2010Bt2, 2010Bt3, 2010Bt4). In the summer, when samples were collected, the lake dries and leaves residual briny “pools” in a few isolated sites. Sites A and B are different pools ~300 m apart.
Post-0.1 μm filtrates were concentrated via tangential flow filtration for the collection of viral particles, followed by DNA extraction and metagenomic sequencing. 454-Titanium technology (~400 bp reads) was used to sequence samples 2010Bt1 and 2010Bt3, and Illumina GAIIx paired-end technology (~100 bp reads) was used to sequence the remaining six samples, for a total of 6.4 billion bp. Previous analyses of these data show that there was no observable difference between the 454-Titanium data and the Illumina data [30–32]. Each sample was assembled separately via Newbler , ABySS , or Velvet . Genes from all contigs >500 bp were predicted with Prodigal , and predicted genes longer than 300 bp were retained and clustered at 95% nucleotide identity, using uclust . Corresponding predicted proteins were separately 1) annotated with InterProScan  and 2) clustered at 40% amino acid identity, using uclust . In the absence of a universal marker gene, six viral “OTU groups” were chosen . Three were used for this study: methyltransferases (the most abundant annotation), concanavalin A-like glucanases/lectins (the most abundant annotation likely to be exclusive to viruses), and Cluster 667 (one of the largest protein clusters of unknown function). Proteins for each OTU group were aligned with MUSCLE , and a phylogenetic tree was constructed from the alignments, using FastTree  with default parameters.
DNA was extracted from five sediment samples taken from in situ flow-through columns buried in sampling wells in a shallow, uranium and vanadium-contaminated aquifer in Rifle, Colorado as described previously . Samples were from background sediment (B), sediment stimulated with carbon and vanadium addition (V1, V2), and sediment stimulated with carbon addition alone (A1, A2). Universal primers and gradient PCR were used to amplify the 16S small subunit ribosomal RNA gene from the organisms sampled.
HiSeq Illumina paired-end technology was used to sequence 2.7 megabases of PCR product at the University of California, Davis. The sequencing consisted of 26,954,412 100-base pair reads. Reads were mapped to reference sequences from the Silva database with the EMIRGE iterative algorithm [41, 42]. The genes were aligned to each other, using the SSU-align software . The alignment was automatically masked with the ssu-mask program. Bacterial OTUs were then clustered at a 97% nucleotide identity cutoff, using usearch . A phylogenetic tree was constructed with the aligned sequences via the FastTree maximum likelihood method with options –gtr –nt and 1000 iterations of the FastTree bootstrap [40, 44].
Substrate-associated soil fungi
The goal of this study was to determine if substrate, space, time or plant community were the major determinants of fungal saprotrophic community composition. Sampling of buried substrates (straw and wood blocks) occurred on Bolinas Ridge on Mount Tamalpais in Marin County, California, USA along four 10 × 10 m blocks in 2007 and 2008, as previously described . Two blocks were in the coastal grassland and two blocks were in the adjacent forest dominated by Pseudotsuga menziesii. The region is characterized as having a Mediterranean climate with a seasonal summer drought. DNA was extracted from 32 bait bags filled with sterile wheat straw and 32 small conifer wood blocks that had been buried (<10 cm) in both the grassland and forest blocks (16 straw samples and 16 wood samples were buried in each plant community type). Half of the straw and wood substrates were buried for six months (time point 1), while the others were buried for 18 months (time point 2).
DNA was purified, and the LSU region (LROR_F /LR5-F ) was PCR amplified with 10 bp MID barcodes. 454 Pyrosequencing 1/8 of a plate resulted in a total of 123,117 LSU sequences. Reads were trimmed and filtered using the QIIME software . Non-fungal taxa, sequences that resulted in no BLAST matches, and singletons were removed from the analysis. OTUs were conservatively determined at 95% sequence similarity. FastTree  was used for phylogenetic tree building in QIIME. For community analyses, only samples with at least 600 LSU sequence reads were included.
Analysis of datasets
Yule normalized Colless’ I tree balance calculations for the four environmental microbial community datasets
Number of tips
Yule normalized colless’ I
Acid mine drainage bacteria and archaea
Hypersaline lake viruses: Cluster 667
Substrate-associated soil fungi
In order to compare the diversity calculations produced by diversity profiles to more traditional calculations of community composition for the same datasets, four different statistics of pairwise community dissimilarity were computed (abundance-weighted Jaccard, unweighted Jaccard, abundance-weighted UniFrac, and unweighted UniFrac). The Jaccard index, is the ratio of the number of taxa shared between two samples to the total number of taxa in each sample and then this ratio subtracted from one . Pairwise phylogenetic dissimilarity for each sample was calculated using the UniFrac method . This metric measures the proportion of unshared phylogenetic branch lengths between two samples. Ward’s minimum-variance method  was used to complete hierarchical clustering on the samples based on each dissimilarity metric and plot them as dendrograms. Please see Additional file 1 for these results.
We simulated hundreds of microbial communities in order to better measure the degree to which differences between naïve and similarity-based diversity profiles are influenced by the abundance and phylogenetic distributions of microbial communities. Each simulated community was distributed according to one of four possible commonly fitted rank abundance distributions (Log Normal, Geometric, Log Series, or Uniform) and had a random phylogenetic tree topology. Tree topologies were simulated so as to create communities that spanned a large range of tree imbalances. Tree imbalance was quantified using Yule normalized Colless’ I tree balance statistic . Lastly, all trees were simulated in both ultrametric and non-ultrametric versions to test the effects of branch lengths on the diversity profiles.
To look for systematic differences between naïve and phylogenetic diversity profiles, we repeatedly (100 times) took a random sample of OTUs from two simulated communities and calculated the proportion of times that the naïve and phylogenetic diversity profiles agreed on which random sample was more diverse. We analyzed whether agreement between naïve and similarity-based diversity profiles systematically differed based on numbers of OTUs sampled, whether trees were ultrametric or non-ultrametric, Fisher’s alpha diversity values, or tree imbalance values.
Results and discussion
Given the potential limitations of applying traditional diversity indices to microbial datasets produced by high-throughput sequencing, we sought to evaluate microbial diversity using methods that might be better suited for microbial taxa that span multiple domains of life and multiple dimensions of diversity (e.g., taxonomic, phylogenetic). The advantages of using diversity profiles are that they encompass a number of other common diversity indices and allow for the incorporation of species similarity information.
Summaries of the four environmental microbial community datasets
Acid mine drainage bacteria and archaea
Total RNA was collected from 8 environmental biofilms and 5 bioreactor biofilms at varying stages of development: early (GS0), mid (GS1), and late (GS2). RNA from all samples was converted to cDNA. 6 environmental and 2 bioreactor samples were sequenced using HiSeq 2500 Illumina. 2 environmental and 3 bioreactor samples were sequenced using GAIIx Illumina.
159 SSU-rRNA sequence fragments were identified in 13 biofilms. The number of reads and SSU-rRNA sequences assembled from the GAIIx and the HiSeq platforms differed greatly; thus the rarefied data from these sequencing methods were analyzed separately (HiSeq: Figure 2, GAIIx: Additional file 1: Figure S1).
Hypersaline lake viruses
8 surface water samples were collected within a hypersaline lake as follows: Jan. 2007 (2 samples, site A, 2 days apart, 2007At1, 2007At2), Jan. 2009 (1 sample, site B, 2009B), Jan. 2010 (1 sample, site A, 2010A; 4 samples, site B, each ~1 day apart, 2010Bt1, 2010Bt2, 2010Bt3, 2010Bt4). 454-Titanium was used to sequence samples 2010Bt1 and 2010Bt3. Illumina GAIIx was used to sequence the remaining 6 samples.
630 methyltransferase genes, 411 concanavalin A-like glucanases/lectins, and 71 putative genes falling under Cluster 667 were assembled from the viral metagenomic reads (Methyltransferase: Additional file 1: Figure S2, Concanavalin: Additional file 1: Figure S3, Cluster 667: Figure 1).
DNA was extracted from 5 sediment samples taken from in situ flow-through columns buried in sampling wells in a shallow, uranium and vanadium-contaminated aquifer: background sediment (B), sediment stimulated with carbon and vanadium addition (V1, V2), and sediment stimulated with carbon addition alone (A1, A2). HiSeq Illumina was used to sequence 16S SSU-rRNA PCR product.
25,966 OTUs were identified from 5 subsurface samples (Figure 3).
Substrate-associated soil fungi
DNA was extracted from 32 straw bait bags and 32 wood blocks that were buried in grassland and forest (16 straw and 16 wood in each). Half of the substrates were buried for six months (time point 1) and half for 18 months (time point 2). 454-Titanium was used to sequence the PCR amplified LSU region.
Naïve microbial diversity comparisons may vary with the sensitivity parameter, q
Diversity profiles calculated from the experimental and observational datasets provided insights into microbial community diversity that would not be perceivable through the use of a classical univariate diversity metric. The sensitivity of diversity profiles to rarity greatly affected diversity measurements. Richness calculations count all taxa equally, greatly overestimating the contribution of rare taxa to diversity, whereas diversity measurements at high values of q are insensitive to the contribution of rare OTUs. Diversity profiles illustrate this stark contrast and highlight the question of the importance of ultra-rare taxa, the “rare biosphere” of Sogin et al. . Previously, these ultra-rare taxa were not included in diversity calculations because they were not detected using older methods of measuring microbial taxa (clone libraries, low depth sequencing, DGGE, etc.). Newer techniques such as deep short-read sequencing have revealed the existence of these taxa, but introduced more bias into older diversity indices such as species richness calculations. The datasets analyzed here demonstrate the importance of rare taxa.
Similarity information may alter microbial diversity calculations
The analyses presented here demonstrate the value of using diversity profiles to incorporate phylogenetic diversity as a measure of taxa similarity into diversity calculations. For all four microbial datasets we analyzed, we saw key distinctions between naïve taxonomic diversity calculations and those that incorporated phylogenetic information. For example, in the subsurface bacterial dataset, naïve measurements of OTU richness for each treatment indicated that the background sample (no treatment) contained the highest diversity for all values of q (Table 2, Figure 3A). Additionally, naïve measurements of both acetate-only samples were more diverse than the samples amended with both acetate and vanadium. These were the expected results as the experiment involved a treatment that should have selected for taxa that could use acetate as a carbon source and vanadium as an energy source (Table 1).
Phylogenetic results, on the other hand, suggested that the vanadium-acetate samples were as diverse as background samples and more diverse than the acetate-only treatments (Table 2, Figure 3B), indicating that perhaps the ability to use vanadium for energy or to tolerate its presence was more phylogenetically widespread than expected. Previous analysis of these data using Faith’s phylogenetic diversity metric found the background sediment to be most phylogenetically diverse , which Figure 3B also shows at q = 0. However, the crossing of the background sample and the acetate and vanadium treated samples when 1 ≤ q ≤ 2 in Figure 3B indicates a greater diversity of common taxa in the treated sites. This indicates that adding abundance information to measures of phylogenetic diversity through the use of diversity profiles can add depth to the interpretation of diversity calculations.
In another example, in forest samples at T = 1 in the substrate-associated soil fungi dataset, wood substrates contained greater naïve taxonomic diversity. This higher diversity on wood substrates compared to straw substrates was hypothesized because the wood substrate is more complex and requires a larger group of fungi to decompose it compared with a simpler substrate, such as straw (Table 1). However, the wood substrates actually contained lower phylogenetic diversity than straw substrates (Additional file 1: Figure S4). These results indicate that the fungal communities growing on wood substrates contained more member taxa that were closely related to each other, because when phylogenetic similarity was included in diversity calculations, the diversity of wood substrate fungal communities decreased.
This demonstrated capacity of diversity profiles to incorporate effective phylogenetic diversity, as well as other measures of similarity between taxa, is particularly meaningful for analyzing microbial diversity data. Macro-organismal ecologists have long been concerned with the interactions between an organism’s traits and aspects of its ecology, such as its niche axes or its role in ecosystem processes [54–57]. Many macro-eukaryote traits, when mapped to phylogenies, show evidence for phylogenetic conservatism [58, 59]. That is, certain traits are shared more often by closely related taxa than would be expected by chance. Even bacteria and archaea show evidence for trait conservatism, despite the role of non-homologous recombination in their evolutionary history [60, 61]. This implies that the phylogenetic distribution of a microbial assemblage can, thus, influence ecosystem processes via differences in the suite of traits present. Phylogenetic trait conservatism in microbes also has practical implications, such as potentially guiding current research in drug discovery or biodegradation [62–64].
Diversity analyses of environmental microbial samples can span all domains of life. It is thus highly desirable to evaluate and critically assess a method that can address the diversity of a microbial assemblages effectively across domains, as well as across samples with substantial differences in rare membership, while using a full complement of the information contained in DNA and RNA sequence analysis. As there is no universal marker gene for viruses, there are no robust means of determining viral phylogeny from community sequencing data. Apart from a few groups of well-characterized viruses, it is difficult to characterize viral phylogenetic relationships at all. In our similarity-based profiles, we assume that sequence and, therefore, tree similarity are proxies for phylogenetic similarity. This is reasonable for phylogenetically informative genes, such as the SSU rRNA genes in cellular organisms. However, in the case of genes from the hypersaline virus dataset, and any other viral metagenomic data to which diversity profiles may be applied, this is almost certainly not true. In our application of sequence similarity-based diversity profiles to viruses, we essentially (incorrectly) inferred phylogeny from functional genes that are likely subject to extensive horizontal gene transfer. While these genes are still informative in that they might correspond to the host range and thus the viruses’ community function, we suggest that naïve diversity profiles will be more useful for analyses of viral assemblages than similarity-based profiles, unless a more robust means of determining viral phylogeny is discovered.
Diversity profile simulations
The four microbial datasets analyzed in this study were well-suited to test the application of diversity profiles to microbial data, particularly because they spanned multiple domains of life and dimensions of diversity. However, while treatment replicates were included in the diversity profiles for two of the datasets (hypersaline lake viruses, subsurface bacteria dataset), they were not included for the other two datasets. Therefore, statistical tests were not performed to determine whether the diversity of a group of samples was significantly higher or lower than other groups. Additionally, while it is noteworthy that we analyzed four unique microbial datasets within this study, our conclusions of how diversity profiles perform when analyzing microbial data were limited based on this relatively small number of datasets.
In order to address these shortcomings of the data, we simulated microbial communities. Simulations allowed us to utilize diversity profiles at the scale of hundreds of simulated microbial datasets with a range of abundance distributions and phylogenetic tree topologies, so that analyses were carried out with greatly increased replication. The major finding from this simulation study is that when we repeatedly took a random sample of OTUs from two simulated communities and compared their diversity, naïve and similarity-based diversity profiles agreed only approximately 50% of the time in their classification of which sample was most diverse (95% confidence interval was 29.8% to 74.6%, mean was 52.2% across all experiments). This finding is a strong argument for analyzing more than taxonomic diversity when quantifying the diversity of microbial communities. The evolutionary or phylogenetic distance among members of microbial consortia is arguably foundational in assessing diversity of these nodes of life that span the domains. It appears that microbial diversity analyses should include similarity information whenever it is available or its omission should be appropriately justified. Such similarity information need not include continuous evolutionary distances, but could be as simple as assigning similarity values based on general taxonomic group.
This study explored whether similarity-based diversity profiles can aid our interpretation of microbial diversity. The findings indicate that the use of phylogenetic metrics and effective numbers can provide additional insight into the diversity of microbial communities when combined with naïve analyses that do not take into account similarity information or multiple diversity metrics. The ongoing question of how to best analyze microbial community datasets is paramount to deducing the processes that affect the composition and function of microbial communities. The type of information and metric used to measure biological diversity in any study of microbial diversity is a decision that must be well-justified prior to hypothesis testing instead of being made arbitrarily based solely on which metrics are popularly used by plant and animal ecologists. This justification, in turn, should be based on evidence produced by work, such as this study, that has systematically tested the efficacy and utility of these diversity metrics under a range of situations.
Availability of supporting data
The R code adapted from Leinster & Cobbold  and used to calculated diversity profiles is available for download and use at https://gist.github.com/darmitage. The hypersaline lake viruses raw sequencing reads are available in the NCBI BioProject (accession number PRJNA81851, http://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA81851). The subsurface bacteria dataset is available at: http://banfieldlab.berkeley.edu/SOM/yelton2012/.
Operational taxonomic units
Funding for this project was provided by a National Science Foundation Grant (#1050680) to Sandy Andelman and Julia Parrish: The Dimensions of Biodiversity Distributed Graduate Seminar (DBDGS). HMD was funded by a National Science Foundation Graduate Research Fellowship. Funding for JBE and the hypersaline lake virus study was provided by National Science Foundation award 0626526 and Department of Energy award DE-FG02-07ER64505. JK was funded by a NASA – Harriett G. Jenkins Pre-Doctoral Fellowship and a Mycological Society of America – NAMA Memorial Fellowship. The authors would like to thank S. Andelman, J. Parrish, C. Maranto, R. Sewell Nesteruk, J. Prosser, T. Bruns, and all other DBDGS participants for their input throughout the project.
- Roesch LFW, Fulthorpe RR, Riva A, Casella G, Hadwin AKM, Kent AD, Daroub SH, Camargo FAO, Farmerie WG, Triplett EW: Pyrosequencing enumerates and contrasts soil microbial diversity. ISME J. 2007, 1: 283-290.PubMedPubMed Central
- Fulthorpe RR, Roesch LFW, Riva A, Triplett EW: Distantly sampled soils carry few species in common. ISME J. 2008, 2: 901-910. 10.1038/ismej.2008.55.PubMedView Article
- Fierer N, McCain CM, Meir P, Zimmermann M, Rapp JM, Silman MR, Knight R: Microbes do not follow the elevational diversity patterns of plants and animals. Ecology. 2011, 92: 797-804. 10.1890/10-1170.1.PubMedView Article
- Shannon CE: A mathematical theory of communication. Bell System Technical Journal. 1948, 27: 379-423. 10.1002/j.1538-7305.1948.tb01338.x.View Article
- Berger WH, Parker FL: Diversity of Planktonic Foraminifera in deep-sea sediments. Science. 1970, 168: 1345-1347. 10.1126/science.168.3937.1345.PubMedView Article
- Bent SJ, Forney LJ: The tragedy of the uncommon: understanding limitations in the analysis of microbial diversity. ISME J. 2008, 2: 689-695. 10.1038/ismej.2008.44.PubMedView Article
- Hill TCJ, Walsh KA, Harris JA, Moffett BF: Using ecological diversity measures with bacterial communities. FEMS Microbiol Ecol. 2003, 43: 1-11. 10.1111/j.1574-6941.2003.tb01040.x.PubMedView Article
- Taylor JW, Jacobson DJ, Kroken S, Kasuga T, Geiser DM, Hibbett DS, Fisher MC: Phylogenetic species recognition and species concepts in fungi. Fung Genet Biol. 2000, 31: 21-32. 10.1006/fgbi.2000.1228.View Article
- Rosselló-Mora R, Amann R: The species concept for prokaryotes. FEMS Microbiol Rev. 2001, 25: 39-67.PubMedView Article
- Staley JT: The bacterial species dilemma and the genomic-phylogenetic species concept. Philos Trans R Soc Lond B Biol Sci. 2006, 361: 1899-1909. 10.1098/rstb.2006.1914.PubMedPubMed CentralView Article
- Mishler BD: Species are not uniquely real biological entities. Contemporary Debates in Philosophy of Biology. Edited by: Ayala FJ, Arp R. 2010, Oxford: Wiley-Blackwell, 110-122.
- Tiedje JM, Asuming-Brempong S, Nüsslein K, Marsh TL, Flynn SJ: Opening the black box of soil microbial diversity. Appl Soil Ecol. 1999, 13: 109-122. 10.1016/S0929-1393(99)00026-8.View Article
- Luo F, Yang Y, Zhong J, Gao H, Khan L, Thompson DK, Zhou J: Constructing gene co-expression networks and predicting functions of unknown genes by random matrix theory. BMC Bioinf. 2007, 8: 299-10.1186/1471-2105-8-299.View Article
- Horner-Devine MC, Lage M, Hughes JB, Bohannan BJM: A taxa-area relationship for bacteria. Nature. 2004, 432: 750-753. 10.1038/nature03073.PubMedView Article
- O’Brien HE, Parrent JL, Jackson JA, Moncalvo J-M, Vilgalys R: Fungal community analysis by large-scale sequencing of environmental samples. Appl Environ Microbiol. 2005, 71: 5544-5550. 10.1128/AEM.71.9.5544-5550.2005.PubMedPubMed CentralView Article
- Buée M, Reich M, Murat C, Morin E, Nilsson RH, Uroz S, Martin F: 454 Pyrosequencing analyses of forest soils reveal an unexpectedly high fungal diversity. New Phytol. 2009, 184: 449-456. 10.1111/j.1469-8137.2009.03003.x.PubMedView Article
- Leinster T, Cobbold CA: Measuring diversity: the importance of species similarity. Ecology. 2012, 93: 477-489. 10.1890/10-2402.1.PubMedView Article
- Chao A, Chiu C-H, Jost L: Phylogenetic diversity measures based on Hill numbers. Philos Trans R Soc Lond B Biol Sci. 2010, 365: 3599-3609. 10.1098/rstb.2010.0272.PubMedPubMed CentralView Article
- Hill MO: Diversity and evenness: a unifying notation and its consequences. Ecology. 1973, 54: 427-432. 10.2307/1934352.View Article
- Jost L: Entropy and diversity. Oikos. 2006, 113: 363-375. 10.1111/j.2006.0030-1299.14714.x.View Article
- Martiny AC, Treseder K, Pusch G: Phylogenetic conservatism of functional traits in microorganisms. ISME J. 2013, 7: 830-838. 10.1038/ismej.2012.160.PubMedPubMed CentralView Article
- Faith DP: Conservation evaluation and phylogenetic diversity. Biol Conserv. 1992, 61: 1-10. 10.1016/0006-3207(92)91201-3.View Article
- Cadotte MW, Davies TJ, Regetz J, Kembel SW, Cleland E, Oakley TH: Phylogenetic diversity metrics for ecological communities: integrating species richness, abundance, and evolutionary history. Ecol Lett. 2010, 13: 96-105. 10.1111/j.1461-0248.2009.01405.x.PubMedView Article
- Mooers AØ, Heard SB: Inferring evolutionary process from phylogenetic tree shape. Q Rev Biol. 1997, 72: 31-54. 10.1086/419657.View Article
- Simpson EH: Meaasurement of diversity. Nature. 1949, 163: 688-10.1038/163688a0.View Article
- Haegeman B, Hamelin J, Moriaty J, Nael P, Dushoff J, Weitz JS: Robust estimation of microbial diversity in theory and in practice. ISME J. 2013, doi:10.1038/ismej.2013.10
- Goltsman D: PhD thesis. Community Genomic, Proteomic, and Transcriptomic Analyses of Acid Mine Drainage Biofilm Communities. 2013, Berkeley, California, USA: University of California Berkeley, Environmental Science, Policy and Management Department
- Roberts A, Pimentel H, Trapnell C, Pachter L: Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinf. 2011, 27: 2325-2329. 10.1093/bioinformatics/btr355.View Article
- Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinf. 2010, 26: 2460-2461. 10.1093/bioinformatics/btq461.View Article
- Emerson JB, Thomas BC, Andrade K, Allen EE, Heidelberg KB, Banfield JF: Dynamic viral populations in hypersaline systems as revealed by metagenomic assembly. Appl Environ Microbiol. 2012, 78: 6309-6320. 10.1128/AEM.01212-12.PubMedPubMed CentralView Article
- Emerson JB, Andrade K, Thomas BC, Norman A, Allen EE, Heidelberg KB, Banfield JF: Virus-host and CRISPR dynamics in archaea-dominated Hypersaline Lake Tyrrell, Victoria, Australia. Archaea. 2013, 2013: 370871-PubMedPubMed CentralView Article
- Emerson JB, Thomas BC, Andrade K, Heidelberg KB, Banfield JF: New approaches indicate constant viral diversity despite shifts in assemblage structure in an Australian hypersaline lake. Appl Environ Microbiol. In Press
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y-J, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer MLI, Jarvie TP, Jirage KB, Kim J-B, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, et al: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437: 376-380.PubMedPubMed Central
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I: ABySS: a parallel assembler for short read sequence data. Genome Res. 2009, 19: 1117-1123. 10.1101/gr.089532.108.PubMedPubMed CentralView Article
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18: 821-829. 10.1101/gr.074492.107.PubMedPubMed CentralView Article
- Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ: Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf. 2010, 11: 119-10.1186/1471-2105-11-119.View Article
- Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R: InterProScan: protein domains identifier. Nucleic Acids Res. 2005, 33: W116-W120. 10.1093/nar/gki442.PubMedPubMed CentralView Article
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32: 1792-1797. 10.1093/nar/gkh340.PubMedPubMed CentralView Article
- Price MN, Dehal PS, Arkin AP: FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol Biol Evol. 2009, 26: 1641-1650. 10.1093/molbev/msp077.PubMedPubMed CentralView Article
- Yelton AP, Williams KH, Fournelle J, Wrighton KC, Handley KM, Banfield JF: Vanadate and acetate biostimulation of contaminated sediments decreases diversity, selects for specific taxa, and decreases aqueous v(5+) concentration. Environ Sci Technol. 2013, 47: 6500-6509.PubMed
- Miller CS, Baker BJ, Thomas BC, Singer SW, Banfield JF: EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data. Genome Biol. 2011, 12: R44-10.1186/gb-2011-12-5-r44.PubMedPubMed CentralView Article
- Miller CS, Handley KM, Wrighton KC, Frischkorn KR, Thomas BC, Banfield JF: Short-Read assembly of full-length 16S Amplicons reveals bacterial diversity in subsurface sediments. PLoS ONE. 2013, 8 (2): e56018-10.1371/journal.pone.0056018. doi: 10.1371/journal.pone.0056018PubMedPubMed CentralView Article
- Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: inference of RNA alignments. Bioinf. 2009, 25: 1335-1337. 10.1093/bioinformatics/btp157.View Article
- Price MN, Dehal PS, Arkin AP: FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS ONE. 2010, 5: e9490-10.1371/journal.pone.0009490. Doi: 10.1371/journal.pone.0009490PubMedPubMed CentralView Article
- Kerekes J: PhD thesis. Species Diversity, Ecology and Laccase Gene Diversity of Saprotrophic Fungi across Different Plant Community Types. 2011, Berkeley, California, USA: University of California, Berkeley, Department of Plant and Microbial Biology
- Amend AS, Seifert K, Samson R, Bruns TD: Indoor fungal composition is geographically patterned and more diverse in temperate zones than in the tropics. Proc Natl Acad Sci USA. 2010, 107: 13748-13753. 10.1073/pnas.1000454107.PubMedPubMed CentralView Article
- Tedersoo L, Jairus T, Horton BM, Abarenkov K, Suvi T, Saar I, Kõljalg U: Strong host preference of ectomycorrhizal fungi in a Tasmanian wet sclerophyll forest as revealed by DNA barcoding and taxon-specific primers. New Phytol. 2008, 180: 479-490. 10.1111/j.1469-8137.2008.02561.x.PubMedView Article
- Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Peña AG, Goodrich JK, Godron JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R: QIIME allows analysis of high- throughput community sequencing data. Nat Methods. 2010, 7: 335-336. 10.1038/nmeth.f.303.PubMedPubMed CentralView Article
- Colless DH: Review of Phylogenetics: the theory and practice of Phylogenetic systematics. Syst Zool. 1982, 31: 100-104. 10.2307/2413420.View Article
- Jaccard P: Distribution de la flore alpine dans le basin de dranses et dans quelques regions voisines. Bull Société Vaudoise Sci Natur. 1901, 37: 241272-
- Lozupone C, Knight R: UniFrac: a new Phylogenetic method for comparing microbial communities. Micrbiol. 2005, 71: 8228-8235.
- Ward JH: Hierarchical Grouping to Optimize an Objective Function. J Am Stat Assoc. 1963, 58: 236-244. 10.1080/01621459.1963.10500845.View Article
- Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, Arrieta JM, Herndl GJ: Microbial diversity in the deep sea and the underexplored “rare biodsphere. Proc Natl Acad Sci USA. 2006, 103: 12115-12120. 10.1073/pnas.0605127103.PubMedPubMed CentralView Article
- Hooper DU, Vitousek PM: The effects of plant composition and diversity on ecosystem processes. Science. 1997, 277: 1302-1305. 10.1126/science.277.5330.1302.View Article
- Tilman D, Lehman CL, Thomson KT: Plant diversity and ecosystem productivity: theoretical considerations. Proc Natl Acad Sci USA. 1997, 94: 1857-1861. 10.1073/pnas.94.5.1857.PubMedPubMed CentralView Article
- Silvertown J: Plant coexistence and the niche. Trends Ecol Evol. 2004, 19: 605-611. 10.1016/j.tree.2004.09.003.View Article
- Ackerly DD, Cornwell WK: A trait-based approach to community assembly: partitioning of species trait values into within- and among-community components. Ecol Lett. 2007, 10: 135-145. 10.1111/j.1461-0248.2006.01006.x.PubMedView Article
- Chazdon RL, Careaga S, Webb C, Vargas O: Community and phylogenetic structure of reproductive traits of woody species in wet tropical forests. Ecol Monogr. 2003, 73: 331-348. 10.1890/02-4037.View Article
- Brumfield RT, Tello JG, Cheviron ZA, Carling MD, Crochet N, Rosenberg KV: Phylogenetic conservatism and antiquity of a tropical specialization: Army-ant-following in the typical antbirds (Thamnophilidae). Mol Phylogenet Evol. 2007, 45: 1-13. 10.1016/j.ympev.2007.07.019.PubMedView Article
- Placella SA, Brodie EL, Firestone MK: Rainfall-induced carbon dioxide pulses result from sequential resuscitation of phylogenetically clustered microbial groups. Proc Natl Acad Sci USA. 2012, 109: 10931-10936. 10.1073/pnas.1204306109.PubMedPubMed CentralView Article
- Langille MGI, Zaneveld J, Caporaso JG, McDonald D, Knights D, Reyes JA, Celemente JC, Burkepile DE, Vega Thurber RL, Knight R, Beiko RG, Huttenhower C: Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat Biotechnol. 2013, 31: 814-821. 10.1038/nbt.2676.PubMedView Article
- Galvão TC, Mohn WW, de Lorenzo V: Exploring the microbial biodegradation and biotransformation gene pool. Trends Biotechnol. 2005, 23: 497-506. 10.1016/j.tibtech.2005.08.002.PubMedView Article
- Ferrer M, Beloqui A, Timmis KM, Golyshin KN: Metagenomics for mining new genetic resources of microbial communities. J Mol Microbiol Biotechnol. 2009, 16: 109-123. 10.1159/000142898.PubMedView Article
- Singh BK, Macdonald CA: Drug discovery from uncultivable microorganisms. Drug Discov Today. 2010, 15: 792-799. 10.1016/j.drudis.2010.07.002.PubMedView Article
- Blum MG, François O: Which random processes describe the tree of life? A large-scale study of phylogenetic tree imbalance. Syst Biol. 2006, 55: 685-691. 10.1080/10635150600889625.PubMedView Article
- Fisher RA, Corbet AS, Williams CB: The relation between the number of species and the number of individuals in a random sample of an animal population. J Anim Ecol. 1943, 12: 42-58. 10.2307/1411.View Article
- Magurran AE, Henderson PA: Explaining the excess of rare species in natural species abundance distributions. Nature. 2003, 422: 714-716. 10.1038/nature01547.PubMedView Article
- Sunagawa S, Woodley CM, Medina M: Threatened corals provide underexplored microbial habitats. PLoS ONE. 2010, 5: e9554-10.1371/journal.pone.0009554. Doi: 10.1371/journal.pone.0009554PubMedPubMed CentralView Article
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.