Analysis and comparison of the pan-genomic properties of sixteen well-characterized bacterial genera

Trost, Brett; Haakensen, Monique; Pittet, Vanessa; Ziola, Barry; Kusalik, Anthony

doi:10.1186/1471-2180-10-258

Research article
Open access
Published: 13 October 2010

Analysis and comparison of the pan-genomic properties of sixteen well-characterized bacterial genera

Brett Trost¹,
Monique Haakensen^2,3,
Vanessa Pittet²,
Barry Ziola² &
…
Anthony Kusalik¹

BMC Microbiology volume 10, Article number: 258 (2010) Cite this article

4882 Accesses
8 Citations
Metrics details

Abstract

Background

The increasing availability of whole genome sequences allows the gene or protein content of different organisms to be compared, leading to burgeoning interest in the relatively new subfield of pan-genomics. However, while several studies have analyzed protein content relationships in specific groups of bacteria, there has yet to be a study that provides a general characterization of protein content relationships in a broad range of bacteria.

Results

A variation on reciprocal BLAST hits was used to infer relationships among proteins in several groups of bacteria, and data regarding protein conservation and uniqueness in different bacterial genera are reported in terms of "core proteomes", "unique proteomes", and "singlets". We also analyzed the relationship between protein content similarity and the percent identity of the 16S rRNA gene in pairs of bacterial isolates from the same genus, and found that the strength of this relationship varied substantially depending on the genus, perhaps reflecting different rates of genome evolution and/or horizontal gene transfer. Finally, core proteomes and unique proteomes were used to study the proteomic cohesiveness of several bacterial species, revealing that some bacterial species had little cohesiveness in their protein content, with some having fewer proteins unique to that species than randomly-chosen sets of isolates from the same genus.

Conclusions

The results described in this study aid our understanding of protein content relationships in different bacterial groups, allowing us to make further inferences regarding genome-environment relationships, genome evolution, and the soundness of existing taxonomic classifications.

Background

Historically, taxonomic analyses have been performed using a diverse and often arbitrary selection of morphological and phenotypic characteristics. Today, these characteristics are generally considered unsuitable for generating reliable and consistent taxonomies for prokaryotes, as there is no rational basis for choosing which morphological or phenotypic properties should be examined. Moreover, it is doubtful that individual phenotypes or small collections of phenotypes can consistently and correctly represent evolutionary relationships [1]. The unsuitability of phenotypic traits, along with the advent of DNA sequencing, has led to 16S rRNA gene sequence comparisons becoming the standard technique for taxonomic analyses [1], although it has been argued that the cpn60 gene allows for greater evolutionary discrimination [2]. Over time, the trend has moved toward using a greater number of genes to infer phylogenetic relationships--in part due to the increasing ease and reduced cost associated with DNA sequencing, but also due to doubts about the accuracy of evolutionary relationships inferred from a single gene. Phylogeny can be inferred from a number of universally conserved housekeeping genes using multi-locus sequence analysis (MLSA) [3, 4].

While 16S rRNA gene sequence analysis and MLSA have proven to be effective tools for phylogenetics, a major deficiency inherent in these techniques is that only a small amount of information is used to represent an entire organism. This practice has largely been accepted due to the time and cost of genome sequencing. However, recent improvements in sequencing technology have substantially reduced the resources necessary to sequence a genome, and there are now numerous genome sequences available in publicly accessible databases. The accelerating pace of genome sequencing provides the opportunity to explore the use of entire genomes in analyzing evolutionary relationships.

Numerous approaches to determining relatedness via whole genomes have been devised (reviewed in [5]), with examples being dinucleotide frequencies [6], G + C content [7], codon usage [8, 9], gene order [10], and oligopeptide composition [11, 12]. Yet another approach to whole-genome phylogenetics is the comparison of gene content. This technique works by predicting orthologues in pairs of organisms and then assigning a "distance" between each pair based on the putative number of shared genes. This technique was originally proposed by Snel et al. [13] and was subsequently revisited with larger groups of organisms [14, 15]. However, horizontal gene transfer is a major complicating factor in using these methods to infer evolutionary relationships in prokaryotes [16].

Recently, a new subfield called pan-genomics has become established as a framework for exploring the genomic relatedness of bacterial groups. Unlike the studies cited in the previous paragraph, pan-genomics does not involve inferring phylogeny from genome content; rather, it encompasses broad-based characterizations of gene- or protein-content relationships in a given group of organisms. Pan-genomics was introduced by Tettelin et al. [17], who sequenced several strains of the bacterium Streptococcus agalactiae and then analyzed the genomic diversity of those isolates in terms of a "core genome" (genes present in all isolates) and a "dispensable genome" (genes not present in all isolates). Two more examples of pan-genomic analyses are those done for Vibrio[18] and for Escherichia coli[19]. Review articles summarizing concepts and developments in microbial pan-genomics are also available [20, 21].

Despite the increasing interest in pan-genomics, we do not know of a study providing a general characterization and comparison of gene/protein content relationships in many different bacterial groups. To fill this gap, this study reports the results of several different analyses that compare the protein content of different bacteria. When beginning this study, we were faced with the choice of comparing either gene content or protein content. Both have been examined in previous work; for example, Tettelin et al. [17] studied both gene sets and predicted protein sets, whereas Rasko et al. [19] used predicted proteins exclusively. For two reasons, we chose to explore protein content rather than gene content. First, since protein content is more directly related to function and physiology than gene content, the use of protein content was more appropriate for relating pan-genomic properties to factors like habitats, environmental niches, and selective pressures. Second, since we perform comparisons across diverse genera, the lower level of variability in protein sequences compared to gene sequences (due to the degeneracy of the genetic code) may provide an advantage when using BLAST to compare the more divergent organisms. The popularity of tools such as tblastx [22, 23] also speaks to the desirability of comparing gene sequences via the corresponding proteins. While we expect the use of gene content versus protein content to yield largely similar results, the reader should be aware that there could be some differences.

This paper communicates the results of three major analyses, with the first two involving protein content comparisons at the genus level, and the third involving comparisons at the species level. In the first analysis, we quantify and analyze the number of proteins (i.e. orthologues) found in all members of a given bacterial genus (its "core proteome"), the number of proteins found in one genus, but in none of the other genera used in this study (its "unique proteome"), and the number of proteins found in only a single isolate of a genus ("singlets"). The second analysis examines the relationship between protein content similarity and 16S rRNA gene percent identity in pairs of bacterial isolates from the same genus. Finally, the third analysis examines several bacterial species to determine whether their proteomes are more cohesive than randomly-selected sets of isolates from the same genus. For the third analysis, we use an operational definition of "cohesion". Specifically, we say that a bacterial species is proteomically cohesive if it satisfies two criteria: first, that its core proteome is larger than those of randomly-selected groups of isolates from the same genus; and second, that it contains more proteins unique to all members of that species than there are proteins unique to randomly-selected groups of isolates from the same genus.

Results and Discussion

Proteomes used

Sixteen genera met the requirements outlined in the Methods section, comprising a total of 211 isolates from 106 species. Table 1 shows the number of isolates and species used for each genus, while additional file 1 provides more detailed information about each individual isolate (i.e. genus, species, strain/isolate identity, proteome size, and genome size).

Table 1 Bacteria used in this study

Full size table

Orthologue detection

To detect orthologues, we used a variation on the reciprocal BLAST hits (RBH) method. Specifically, for two proteins to be declared orthologues, they had to be each other's best BLAST hit, and both BLAST hits had to attain E-values less than a defined threshold. The Methods section describes an analytical method for choosing this E-value threshold, as well as an empirical technique for estimating the degree to which the chosen E-value threshold will affect our analyses. In this section, we apply those techniques to choose an appropriate E-value threshold for the comparisons done in this study.

Analytical method

In the Methods section, we show that an appropriate E-value threshold can be chosen using the equation $E = M / (n_{p}^{2} n_{o}^{2})$ , where E is the E-value threshold, M is the desired value for the expected number of spurious matches, n_pis the number of proteins in a given organism's proteome, and n_ois the number of organisms involved in a comparison. In choosing a threshold for the comparisons used in this study, we noted that the bacterial isolate examined in this paper with the largest genome, Burkholderia xenovorans strain LB400, encodes 8951 ≈ 10⁴ proteins. Thus, a conservative value for n_pwould be 10⁴. Furthermore, the greatest number of organisms used in a single comparison was n_o= 211 (when finding proteins unique to a given genus). Finally, we chose M = 1, since the results of a given comparison would be only negligibly affected by a single spurious match. Thus, the chosen E-value threshold was E = 1/((10⁴)² × 211²) ≈ 10^-13, meaning that two proteins were considered orthologues if the matches between the two proteins (in both directions) had E-values less than 10^-13, in addition to each being the other's best BLAST hit.

Empirical method

To estimate the potential impact of the choice of E-value threshold on our analyses, three pairs of proteomes were arbitrarily selected in each of three categories: isolates from the same species; isolates from different species but the same genus; and isolates from different genera. These three categories were selected as they span the range of relatedness encountered in our analysis. For each pair of proteomes, the orthologue detection procedure described in the Methods section was used to determine the number of proteins in the first proteome, but not in the second proteome, over the range of E-value thresholds 10⁰, 10^-1,...,10^-180. Figure 1 shows the number of unique proteins for each comparison for each E-value threshold used.

For all three comparisons in all three categories, the number of unique proteins differed substantially depending on the E-value threshold chosen. For example, the number of proteins found in the proteome of Pseudomonas putida strain GB-1 but not in that of P. putida strain KT2440 (see Figure 1A) ranged from 3882 when using an E-value threshold of 10^-180 to 1075 when using a threshold of 10⁰. The plot for P. putida can be divided into two distinct sections. The first section of the plot ranged from an E-value threshold of 10¹⁸⁰ to a threshold of approximately 10^-31, in which there was a nearly perfectly linear decrease in the number of unique proteins as the exponent in the E-value threshold was increased. The second section ranged from E-value thresholds between 10^-30 and 10⁰. Like the first section, the number of unique proteins decreased as the E-value threshold was increased, although the slope was much smaller. In other words, compared to the first section, increasing the E-value threshold in this region seemed to result in smaller decreases in the number of unique proteins. This same trend was observed in the other two intra-species comparisons. Owing to the more divergent sequences of their proteins, all three inter-genus comparisons (Figure 1C) showed a distinctly different pattern--a very gradual slope between thresholds of 10^-180 and 10^-51, and then a steeper slope between thresholds of 10^-50 and 10⁰. As expected, the trend seen in all three inter-species (but intra-genus) comparisons (Figure 1B) was intermediate between the intra-species and inter-genus comparisons.

Figure 1 shows that, while the number of unique proteins differed substantially over the full range of E-value thresholds tested, the values did not differ by much over the range of E-value thresholds that might reasonably be chosen (say, between 10^-30 and 10^-2). For example, Figure 1A shows that P. putida strain GB-1 had 1097 proteins not found in P. putida strain KT2440 at an E-value threshold of 10^-3, versus 1144 at a threshold of 10^-13. Similarly, Figure 1C shows that Yersinia enterocolitica had 3185 proteins not found in Clostridium tetani at a threshold of 10^-3, versus 3322 at a threshold of 10^-13. As the magnitudes of these differences are small, and because an E-value threshold of 10^-13 is justified by the above analytical method, we used this threshold for the rest of our analyses.

Comparing the protein content of selected genera

Identification of core proteomes, unique proteomes, and singlets

To provide a general characterization of pan-genomic relationships in different genera, the orthologue detection procedure described in the Methods section was used to find core proteomes, unique proteomes, and singlets for each of the 16 genera listed in Table 1. If a given orthologous group contained proteins from all isolates of a given genus, it was considered to be part of the core proteome for that genus. If a given orthologous group contained proteins from all isolates of a given genus and no proteins from any other isolate in any of the other genera given in Table 1, then it was considered to be part of the unique proteome for that genus. Finally, if a given group contained just a single protein from a single isolate of a given genus, then it was referred to as a singlet. Note that although a singlet protein for a given isolate could not have been found in any other isolates from the same genus (by definition), it may have been found in the proteomes of isolates from other genera. Figure 2 displays the relationship between a genus's median proteome size and its core proteome size (A), its unique proteome size (B), and the average number of singlets per isolate (C). We compared against the median proteome size rather than the mean to eliminate the effect of outliers, since some genera have one or more isolates with far larger or smaller proteomes than most other isolates from the same genus.

Figure 2A shows that the different genera varied significantly in the ratio of their median proteome size to their core proteome size. Genera appearing below the best-fit line had a larger ratio of median proteome size to core proteome size than those appearing above the line. This ratio could be interpreted as showing the relative proteomic similarity of the isolates of each genus. For example, if genus A has a very low ratio, then many proteins found in a given isolate of genus A are actually found in all genus A isolates, whereas if genus B has a very high ratio, then many proteins found in a given isolate of genus B are not found in all genus B isolates. To use the language of Tettelin et al. [17], genera with a high ratio contain isolates that generally have large dispensable genomes, and vice versa.

The fact that genera such as Lactobacillus and Clostridium had a large ratio is consistent with reports that characterize the taxonomic classifications of these genera as overly broad. For instance, Ljungh and Wadstrom [24] argued that Lactobacillus should be split up into a number of separate genera, and Collins et al. [25] made a similar argument for Clostridium. On the other side of the spectrum, Brucella and Xanthomonas, among others, had low median proteome size to core proteome size ratios. This is consistent with the fact that all pairs of isolates in each of these two genera had 16S rRNA genes that were more than 99.5% identical to each other (see also the next section, which provides a comparison of proteomic similarity with 16S rRNA gene similarity).

The best-fit line in Figure 2A had an R² value of 0.46, showing that the median proteome size of a given genus explained less than half of the variation in core proteome size. Another factor that could explain differences in core proteome sizes is simply the number of isolates used, since the core proteome size of a given genus can only decrease (or remain the same) as more isolates are added to the analysis. In their report on the pan-genomics of Streptococcus agalactiae[17], for example, Tettelin and co-authors showed that, as additional isolates were added, the core genome of this species decreased in a fashion consistent with a decaying exponential function, eventually approaching some asymptotic value. Other factors that could explain differences in core proteome sizes include the quality of a genus's taxonomic classification, the frequency of horizontal gene transfer, the number of mobile genetic elements (e.g. plasmids), and the nature and variety of environments that the isolates inhabit.

The proteins comprising the core proteome of a given genus could be considered the fundamental units of information required for the existence of isolates of that genus as they currently exist in their environments, and include both housekeeping proteins and proteins required for environment-specific functions. The latter category of proteins would be the most informative in terms of characterizing the commonalities of a given group of bacteria. For instance, the protein encoded by the acpM gene, which is involved in mycolic acid synthesis [26], comprises part of the core proteome of the Mycobacterium genus, and thus is part of the unique lipid metabolism that characterizes mycobacteria. As a greater number of core proteomes are revealed through additional genome sequencing, core proteomes may be capable of revealing the fundamental requirements for life in relation to basal function or to specific niches, habitats, and diseases.

Whereas the core proteome is the set of proteins that a particular group of bacteria have in common, the unique proteome is what makes a group different from other groups (i.e. would not include conserved housekeeping proteins). The relationship between median proteome size and unique proteome size for the genera used in this study is given in Figure 2B. The trend was somewhat similar to that shown in Figure 2A, with both Lactobacillus and Clostridium having very few unique proteins and Xanthomonas having many unique proteins. However, there were some interesting differences. For instance, Mycobacterium had a fairly small core proteome, but had a larger unique proteome than all genera except Xanthomonas and Rhizobium. We hypothesized that this may be a reflection of the diverse lipid metabolism of mycobacteria, which among other things provides these organisms with their unique cell wall structure [27]. Mycobacterium tuberculosis strain H37Rv, for instance, contains around 250 enzymes for fatty acid biosynthesis alone, compared to a fifth of that for E. coli[28]. To tentatively examine this hypothesis, we analyzed the annotations of the 332 proteins unique to the mycobacteria. We report data here for a representative isolate, Mycobacterium ulcerans strain Agy99. Many of the 332 proteins were associated, in this isolate, with the structure or synthesis of the cell membrane, with 83 membrane proteins, 12 transferases, and 17 lipoproteins. In addition, 65 of the proteins were uncharacterized, and it is plausible that many of these uncharacterized proteins may also be associated with the mycobacterial cell wall, since our knowledge of its biology is still far from complete [29, 30].

The R² value of 0.23 for the best-fit line indicates that median proteome size explains little of the variation in unique proteome size. It is likely that much of this variation could be explained by some of the same factors mentioned for core proteome size, in particular the environments inhabited by a particular genus and the amount of specialization required to adapt to those environments.

The unique proteome of a given group of bacteria (not necessarily a genus) can be regarded as the protein complement that makes it distinct from other taxonomic groups. The DNA sequences of the open reading frames corresponding to the unique proteome would therefore be good candidates for group-specific identification methods, such as group-specific PCR. Given that PCR-based identification methods require conserved regions in the DNA sequences, the unique proteome would provide a broad range of possible targets. Conserved regions of DNA have been used for group-specific identification before; for instance, three of us performed phylum-specific PCR using conserved regions in the 16S rRNA gene as targets [31, 32]. As another example, O'Sullivan et al. [33] determined orthologous relationships among the genes in several lactic acid bacteria in order to identify niche-specific (specifically, gut-specific and dairy-specific) genes.

Another interesting application of unique proteomes could be to strengthen the argument for the taxonomic reclassification of certain genera. For example, the Lactobacillus genus had a very small unique proteome compared to other genera. While this fact alone would not be enough to show that the taxonomy of Lactobacillus should be re-examined, it does help support this contention in combination with other data (e.g. [24]). If care is used in the selection of groups, unique proteomes could also provide insight on factors or evolutionary trends leading to virulence, adaptation to specific environmental niches, or currently-unknown metabolic functions.

In contrast to the core and unique proteomes, the average number of singlets per isolate in a given genus (Figure 2C) exhibited a fairly strong relationship with the median proteome size (R² = 0.74). This was not surprising, since one would expect the number of singlets to increase with proteome size. Nonetheless, it is still rather striking that most isolates have hundreds of proteins not found in any other isolate from the same genus, reflecting the sheer amount of diversity in the protein content of even very closely related organisms. This is consistent with previous observations that new genes continue to be added to a given bacterial species with each new genome sequenced, and thus that it may be impossible to ever fully describe a given species in terms of its collective genome content [21].

Whereas unique proteins may be useful for developing genus-specific (or, more generally, group-specific) identification techniques, singlets would be similarly useful for facilitating strain-specific identification. Additionally, whereas the core proteome represents the protein complement necessary for life among all the niches and habitats occupied by the different strains of a given group, singlets could be linked to more specific lifestyle requirements of a single strain.

Comparison of proteomic similarity with 16S rRNA gene similarity

Phylogenetic studies currently use 16S rRNA gene sequence comparisons as the standard method for the taxonomic classification of prokaryotes. Two isolates are typically described as being of the same species if their 16S rRNA genes are more than 97% identical, and of the same genus if their 16S rRNA genes are more than 95% identical [34], although our data (see Table 2) suggest that the lower limit for a genus is closer to 90% (and Clostridium and Lactobacillus represent exceptions even to this boundary, as some pairs of isolates in these genera have identities well below 90%). However, analogous thresholds for proteomic similarity--if they exist--are currently unknown. Additionally, while other studies have reported a relationship between genomic similarity and identity of the 16S rRNA gene, no statistical correlation has been reported (a substantial review of this topic is given by Rosello-Mora and Amann [35]). We therefore sought to investigate the relationship between protein content similarity and 16S rRNA gene similarity in pairs of isolates from the same genus. In doing so, we used two different measures of proteomic similarity: "shared proteins" (the number of proteins found in the proteomes of both isolates--in other words, the number of orthologues), and "average unique proteins" (the average of the number of proteins found in isolate A but not isolate B, and the number of proteins found in isolate B but not isolate A). For a given genus, both of these proteomic similarity measures were plotted against the 16S rRNA gene percent identity for all pairs of isolates, and linear regression was used to describe the nature of the relationship (slope and R² value) between these variables. As described in the Methods section, only pairs of isolates whose 16S rRNA genes were less than 99.5% identical were included in this analysis. As a result, no slope and R² values could be determined for Brucella and Xanthomonas, as no pairs of isolates within these genera had 16S rRNA gene percent identities less than this cutoff. Table 2 contains the results of these analyses.

Table 2 Results of comparison between protein content similarity and 16S rRNA gene percent identity

Full size table

In contrast to 16S rRNA gene percent identity, Table 2 shows that there is no specific range of proteomic diversity for a genus. In other words, although a reasonably consistent cutoff has traditionally been used for bounding the 16S rRNA gene identity of isolates from the same genus, there does not seem to be a corresponding lower limit for shared proteins or upper limit for average unique proteins.

Table 2 indicates that most genera exhibited a direct relationship between shared proteins and 16S rRNA gene percent identity, and an inverse relationship between average unique proteins and 16S rRNA gene percent identity. This was expected given that larger numbers for the shared proteins measure indicate greater similarity, whereas larger numbers for the average unique proteins measure indicate greater dissimilarity. Interestingly, however, Neisseria exhibited the opposite trend; also anomalous were Rickettsia and Rhizobium, which had positive slopes for both proteomic similarity metrics. Surprisingly, the relationship between 16S rRNA gene similarity and protein content similarity was fairly weak for most genera. Specifically, only four of the 14 genera exhibited a strong (R² > 0.5) relationship between 16S rRNA gene identity and either of the proteomic similarity measures. Two of these genera (Bacillus and Yersinia) showed a strong relationship between 16S rRNA gene identity and both proteomic similarity measures, whereas Vibrio exhibited a strong correlation only for the shared proteins measure and Burkholderia had a strong correlation only for the average unique proteins measure.

Perhaps most interestingly, the R² values for the shared proteins measure and the average unique proteins measure were sometimes quite different even for the same genus. This could be attributed to the fact that the number of shared proteins in two isolates is a measure of gene conservation, whereas the average number of unique proteins in two isolates is a measure of gene gain or loss. For example, the R² value for Vibrio when using the shared proteins measure was 0.81, compared to just 0.03 when using the average unique proteins measure. This could indicate that a subset of genes were highly conserved over time while a large amount of gene loss/acquisition occurred, which ultimately enabled Vibrio isolates to inhabit the various niches in which they are currently found.

As described in the Methods section, we also created three phylogenetic trees, with the first based on 16S rRNA gene similarity, the second based on the number of shared proteins between two isolates, and the third based on the average unique proteins between two isolates. Collapsed versions of these trees are given in Figures 3A, 3B, and 3C, respectively, while trees showing all individual isolates are available as additional files 2, 3 and 4.

There are several notable observations that can be made through comparisons of these three phylogenetic trees. For the most part, the trees were similar; for example, the intra-genus diversity was large for Lactobacillus and Clostridium in all three phylogenetic trees (demonstrated by the height of each triangle). However, the methods based on protein content did sometimes give results different from those given by the method based on 16S rRNA gene similarity, which is typically used for nomenclature. Notably, the Bacillus genus was divided in both protein content-based trees, but not in the tree based on the 16S rRNA gene. Additionally, there were marked differences between the shared protein method (proposed by Snel et al. [13]) and the average unique proteins method (introduced in this paper). The shared proteins method resulted in a taxonomy fairly similar to that found when using the 16S rRNA gene, suggesting that their respective rates of evolution are similar. Conversely, the average unique proteins method gave a somewhat different view of taxonomy. For example, the genus Clostridium has been described as extremely heterogeneous [25], and this is reflected in the divergence of some species of this genus from the rest of the clostridia in the average unique proteins tree. As another example, the species Lactobacillus casei and Lactobacillus plantarum both have much larger proteomes than other lactobacilli, which is likely the cause of their divergence from the rest of their genus.

It is a widely held assumption that the 16S rRNA gene is one of the few genes that can be regarded as an approximate molecular clock, and that other genes--and the genome as a whole--can have a very different rate of evolution compared to the 16S rRNA gene, due to various selective pressures and horizontal gene transfer [1]. Table 2 represents a quantitative approach to examining the relationship between the evolutionary relatedness of different organisms (as measured by the similarity of their 16S rRNA genes) and their degree of genomic similarity (as measured by shared proteins or average unique proteins). It seems reasonable to hypothesize that a stronger relationship between 16S rRNA gene similarity and proteomic similarity for a given genus would imply a lower selective pressure on the organisms' genomes, and vice versa. This difference in selective pressure may in turn reflect the fact that different genera live in different environments, or that the organisms belonging to a given genus may inhabit a greater variety of environments than the organisms belonging to a second genus. As evolutionary pressures experienced by organisms differ based on their environmental niche and life cycle, we expect to see different patterns of association between 16S rRNA gene identity and proteomic content emerge as a greater number of genome sequences become available.

Comparing the protein content of selected species

Evaluating taxonomic classifications by determining how well species are clustered based on protein content

In this section, we provide a novel perspective on the soundness of the taxonomic classifications of different species. Broadly speaking, the classification of a set of organisms into a single species could be described as "good" if two criteria are met: the organisms are very similar to each other, and they are distinct from other organisms of the same genus. This section reports the results of examining these two criteria from the perspective of protein content; specifically, the isolates of a given species are considered to be similar to each other if they have a larger core proteome than randomly-selected sets of isolates of the same genus, and are considered to be distinct from other organisms of the same genus if they have a larger unique proteome than randomly-selected sets of isolates of the same genus.

For each species from the genera listed in Table 1 that had two or more isolates sequenced, we compared the core proteome size and the unique proteome size of that species to those of randomly-generated sets of isolates from the same genus. The results of this analysis are given in Tables 3 and 4. Also, additional file 5 contains the organisms comprising each random group, as well as the core proteome size and unique proteome size of each.

Table 3 Results of protein content cohesiveness experiments

Full size table

Table 4 Results of protein content cohesiveness experiments (continued)

Full size table

The primary purpose of this section was to investigate the utility of this cohesiveness analysis for identifying bacterial species that might be misclassified. A cursory reading of Tables 3 and 4 revealed that, while most species satisfied both of the above criteria, some species either had core or unique proteomes that were not significantly larger than the average of the random groups, or had several corresponding random groups that had larger core or unique proteomes than the species itself. A lack of cohesiveness in the proteomes of a given species indicates that its taxonomic classification may need revisiting. However, these results must be interpreted with caution. A closer look at these species revealed that the classification of some really did appear to warrant re-examination, whereas the apparent lack of cohesiveness of others had alternative explanations. In the following paragraphs, we discuss several examples. First, we describe the cohesiveness results for Bacillus anthracis, which is indeed proteomically cohesive based on Tables 3 and 4. Next, we discuss Rhizobium leguminosarum and Yersinia pestis, both of which look uncohesive based on these tables but whose lack of cohesiveness can readily be explained. Finally, we look at two species that probably do warrant reclassification, Bacillus cereus and Bacillus thuringiensis.

As an example of reading Tables 3 and 4, consider the first row of Table 3, which contains B. anthracis. The core proteome of the three sequenced B. anthracis isolates contained 4941 proteins. When sets of three Bacillus isolates were randomly chosen as described in the Methods section, however, the average core proteome size was just 2123. According to a two-tailed t-test, the P-value for this comparison was less than 0.001, indicating that the difference in core proteome size between the three B. anthracis isolates, and randomly chosen sets of three Bacillus isolates, was statistically significant. In fact, none of the 25 randomly-generated sets contained a larger core proteome than the set of B. anthracis isolates. B. anthracis therefore satisfied our first criterion, since the three B. anthracis isolates had more similar protein content than randomly-chosen sets of three Bacillus isolates. B. anthracis also satisfied the second criterion, which stated that species should be distinct from other isolates of the same genus. Table 3 shows that the B. anthracis isolates contained 168 proteins not found in any other Bacillus isolate, compared to an average of just one unique protein for the 25 randomly-generated sets (P-value < 0.001). None of the 25 randomly-generated sets contained more unique proteins than the three B. anthracis isolates. Overall, the fact that B. anthracis satisfied both criteria supports its current taxonomic classification.

As another example, consider R. leguminosarum. There were 3678 proteins in its core proteome, compared to an average of 4063 for randomly selected sets of two Rhizobium isolates. This difference was not statistically significant due to the fact that only four corresponding random groups could be created. Two of the four random groups--the first containing Rhizobium etli strain ATCC 51251 and R. leguminosarum strain 3841, and the second containing R. etli strain CIAT 652 and R. leguminosarum strain 3841--had larger core proteome sizes than the two R. leguminosarum isolates. The results for unique proteomes were similar, with the same two random groups having a larger unique proteome size than the two R. leguminosarum isolates. However, this apparent lack of cohesiveness can be attributed to differences in the proteome sizes of the individual isolates: the proteome of R. leguminosarum strain WSM2304 contains just 4320 proteins, compared to 5921 for the next-smallest Rhizobium isolate. As such, it might be expected that two Rhizobium isolates having proteomes much larger than that of R. leguminosarum strain WSM2304 would also have a larger core and/or unique proteome.

The apparent lack of cohesiveness of Y. pestis can also be readily explained, although the reason is different than that for R. leguminosarum. There were four random groups of seven isolates each, all of which contained a mixture of Y. pestis and Yersinia pseudotuberculosis isolates, that had larger core proteomes than the seven Y. pestis isolates. All of the isolates of both Y. pestis and Y. pseudotuberculosis had proteome sizes that fall within a fairly narrow range (about 3900-4300 proteins), so the larger core proteomes of these random groups cannot be attributed to large differences in proteome sizes. Rather, these results make sense given that Y. pestis and Y. pseudotuberculosis are very closely related, with Y. pestis having recently diverged from Y. pseudotuberculosis. However, it is known that Y. pestis has acquired additional factors that enable it to cause a very different and severe disease than that caused by Y. pseudotuberculosis[36].

Finally, the lack of cohesiveness of some species' proteomes does indeed suggest the need for taxonomic reclassification. For example, B. cereus had a much larger core proteome than the randomly generated sets, but had just two unique proteins. While two unique proteins was more than the average for the randomly-generated sets (none of which had any unique proteins), it was much less than the number of unique proteins possessed by other species having four (or more) sequenced isolates. Similarly, B. thuringiensis had a larger core proteome than the corresponding random sets, but actually had a smaller unique proteome than the average of the random sets. In addition, the B. thuringiensis isolates had fewer unique proteins than seven of the 25 corresponding random sets. Unlike R. leguminosarum and Y. pestis, we could not identify any reason for the lack of cohesiveness of B. cereus and B. thuringiensis, other than a possible misclassification. Given that there are many different ways in which the taxonomic classification of a given species can be evaluated, the reclassification of these species could not be justified using only one kind of analysis. However, data like those given in this section could be combined with other kinds of data in order to make a stronger argument. For instance, some of the B. cereus and B. thuringiensis isolates used in this study in fact have 99-100% 16S rRNA identity with isolates of the opposite species, and a lower percent identity (less than 99%) with isolates of the species to which they are currently assigned. Combined with the very small unique proteomes of B. cereus and B. thuringiensis, this suggests that there may be isolates named as thuringiensis that should really be named as cereus, and vice versa. As it can be difficult or uncertain to resolve speciation using only the 16S rRNA gene, using the core/unique proteome analyses introduced here may well assist in the proper naming of isolates that are difficult to speciate.

Conclusions

In this paper, we examined pan-genomic relationships and their applications in several groups of bacteria. It was found that different bacterial genera vary widely in core proteome size, unique proteome size, and the number of singlets that their isolates contain, and that these variables are explained only partly by differences in proteome size. We also found that the relationship between protein content similarity and the percent identity of the 16S rRNA gene varied substantially in different genera, with a fairly strong association in a few genera and little or no association in most other genera. Finally, we found that most bacterial species were fairly cohesive in their protein content, but that the protein content of some species (such as B. thuringiensis) was no more cohesive than that of randomly selected sets of isolates from the same genus, indicating that the current taxonomy of those species may need to be revisited. The differing pan-genomic properties of the various genera reported in this paper reflect the fact that different groups of bacteria have diverse evolutionary pressures and unequal rates of genomic evolution, and provide a starting point for a general, genome-based understanding of such differences in a broad range of bacteria.

We also note that the analyses described in this paper could be applied to any groups of interest, whether or not the bacteria included in each group have a common taxonomic classification. The commonalities in each group could instead be related to phenotype; for example, ability to live in a particular environment, physiological properties, metabolic capabilities, or even disease pathogenesis. As such, the methods described in this paper have broad applicability and should be useful for further pan-genomic comparisons in the future.

There are a number of opportunities to build upon the work performed in this study. For instance, it would be interesting to further characterize proteins that are found in only a single isolate of a given genus (singlets). Our research revealed that the isolates of most genera contain, on average, hundreds of singlets. This phenomenon could be further described by answering questions like: how much variation is there in the number of singlets in isolates of the same genus? Do isolates inhabiting certain environments possess more singlets than other isolates? Do singlets tend to be biased toward any particular functional category of protein? Another avenue for future work would be to enhance our study of the relationship between protein content similarity and 16S rRNA gene similarity. Despite the existence of usually-consistent lower bounds for 16S rRNA gene similarity for isolates of the same genus, in this study we were unable to determine corresponding bounds for protein content similarity. However, we considered only absolute measures of protein content (i.e. absolute numbers of shared proteins or average unique proteins), and it would also be worthwhile to devise biologically meaningful bounds using a relative measure that could take into account factors like the proteome sizes of the individual isolates, the number of individual isolates, and so on. Finally, perhaps the most obvious opportunity for future work is simply to repeat the analyses described in this paper when more genome sequences become available. Given the increasing pace of genome sequencing, in the future it should be possible to do a study similar to this one with dozens or even hundreds of genera, rather than just 16, which will allow us to gain a far richer understanding of the pan-genomic relationships among bacteria.