Gene copy number variation and its significance in cyanobacterial phylogeny
© Schirrmeister et al.; licensee BioMed Central Ltd. 2012
Received: 18 January 2012
Accepted: 25 June 2012
Published: 15 August 2012
Skip to main content
© Schirrmeister et al.; licensee BioMed Central Ltd. 2012
Received: 18 January 2012
Accepted: 25 June 2012
Published: 15 August 2012
In eukaryotes, variation in gene copy numbers is often associated with deleterious effects, but may also have positive effects. For prokaryotes, studies on gene copy number variation are rare. Previous studies have suggested that high numbers of rRNA gene copies can be advantageous in environments with changing resource availability, but further association of gene copies and phenotypic traits are not documented. We used one of the morphologically most diverse prokaryotic phyla to test whether numbers of gene copies are associated with levels of cell differentiation.
We implemented a search algorithm that identified 44 genes with highly conserved copies across 22 fully sequenced cyanobacterial taxa. For two very basal cyanobacterial species, Gloeobacter violaceus and a thermophilic Synechococcus species, distinct phylogenetic positions previously found were supported by identical protein coding gene copy numbers. Furthermore, we found that increased ribosomal gene copy numbers showed a strong correlation to cyanobacteria capable of terminal cell differentiation. Additionally, we detected extremely low variation of 16S rRNA sequence copies within the cyanobacteria. We compared our results for 16S rRNA to three other eubacterial phyla (Chroroflexi, Spirochaetes and Bacteroidetes). Based on Bayesian phylogenetic inference and the comparisons of genetic distances, we could confirm that cyanobacterial 16S rRNA paralogs and orthologs show significantly stronger conservation than found in other eubacterial phyla.
A higher number of ribosomal operons could potentially provide an advantage to terminally differentiated cyanobacteria. Furthermore, we suggest that 16S rRNA gene copies in cyanobacteria are homogenized by both concerted evolution and purifying selection. In addition, the small ribosomal subunit in cyanobacteria appears to evolve at extraordinary slow evolutionary rates, an observation that has been made previously for morphological characteristics of cyanobacteria.
Many genes originated via gene duplication in both prokaryotes and eukaryotes. Evolution after gene duplication can follow several scenarios . Subfunctionalization leads to gene copies evolving specialized functions, all of which are necessary for performing the original gene function. In the neofunctionalization scenario, one gene copy is preserved by purifying selection, while the other copy may evolve a novel function through rapid adaptation. Finally, in a process known as pseudogenization, one gene copy will lose its function due to accumulation of mutations. Another possible evolutionary fate for gene duplicates is gene conservation. Conserved gene copies can be easily detected based on their high levels of sequence similarity, which typically occurs for genes whose products are needed in high concentrations. All gene copies are strongly expressed in such cases. Gene duplicates can maintain their identical function in two ways: by purifying selection which prevents the duplicates from diverging, or alternatively through concerted evolution where frequent gene conversion maintains sequence identity within the genome .
Gene copy number variants have been frequently found and studied in humans , but are also known to exist in other eukaryotic organisms, such as mouse , maize , and yeast . Studies on human copy number variants revealed that multiple gene copies are often associated with diseases [6, 7], but can also have positive effects as has been shown for salivary amylase genes . Less is known about consequences of protein coding gene copy number variations in prokaryotes. Though there have been studies on variation of ribosomal RNA gene copy numbers and possible consequences [9, 10]. Bacteria exhibiting multiple rRNA gene copies seem to respond faster to resource availability . Accelerated growth rate has been conjectured to be a result of high ribosomal copy numbers . In E. coli it is known that more than one rRNA operon has to be functional to express sufficient ribosomes and achieve maximum growth. Bacteria generally possess fewer than 10 rRNA gene copies , though some Proteobacteria and Firmicutes may have as many as 15 copies of rRNA operons . Furthermore, ribosomal RNA copy numbers have been suggested to be phylogentically informative . Phylogenetic positions of organisms and the amount of rRNA operon copy numbers they possess are generally associated.
Although potentially important effects of ribosomal copy numbers have been suggested in various studies, protein coding gene copies are less considered. This could be due to the assumption that selection for faster cell replication leads to genome reduction in prokaryotes , which would reduce the likelihood of survival of multiple gene copies. Indeed, a tendency towards genome reduction has been observed in endosymbiotic bacteria, and in free living prokaryotes including unicellular marine cyanobacteria . However, conclusions that contradict this have been made by Kou and colleagues  who suggest that a lack of large prokaryotic genomes could be the result of selection acting on an upper limit of genome size. Thus, if there is no selective genome reduction in prokaryotes, multiple gene copies might be more widely distributed and of greater importance for prokaryotes than is believed so far.
Among prokaryotes cyanobacteria depict one of the morphologically most diverse phyla. Several of their morphotypes seem to exist for over two billion years as indicated by a well preserved fossil record [18, 19]. Cyanobacteria inhabit diverse environments. They had (and still have) an exceptional influence on the planet due to their ability to conduct oxygenic photosynthesis and fix nitrogen. According to their morphology, cyanobacteria have been classified into five different sections , though molecular data indicate that probably none of the five groups is monophyletic [21–26]. Section I and II consist of unicellular cyanobacteria. Section II species can be distinguished from all other cyanobacteria based on their reproduction via multiple fission. Cyanobacteria belonging to section III to V exhibit filamentous growth. Across the five existing morphotype sections cyanobacteria exhibit several patterns of differentiation. The majority of extant cyanobacterial species control gene expression using a circadian clock. Additionally, several multicellular cyanobacteria developed mechanisms to differentiate not only temporarily, but also spatially. Trichodesmium is the only section III genus known, able to produce specialized cells (‘diazocytes’) in the middle of a filament [27–29]. The principal form of terminal cell differentiation is observed in section IV and V cyanobacteria. Given the morphological variety found in this phylum, we ask whether gene dosage (multiple gene copies per cell) is associated with adaptive morphological strategies such as cell differentiation in cyanobacteria. Variation in 16S rRNA gene copy sequences and numbers has been reported previously for cyanobacterial genera [30, 31], but no phenotypic correlations were found. Little is known about protein coding gene copy numbers in cyanobacteria.
In this study we searched for both ribosomal RNA and protein coding gene copy number variation in diverse species of cyanobacteria for which full genome sequences were available. Ribosomal RNA gene copies were examined since it is known that they might occur in multiple copies and exhibit gene dosage effects [11–13]. Segments of genes within the rRNA operon are strongly conserved because of their functional relevance . These unique features have made 16S rRNA gene sequences a favored taxonomic marker for prokaryotes . Although rRNA sequence variation within a genome is low for most species , considerable intragenomic differences have been reported in some non-cyanobacterial species [10, 34]. This has led to the questioning of the reliability of 16S rRNA genes as a taxonomic marker. We examined sequence identity of rRNA genes within species of cyanobacteria by conducting phylogenetic analyses and calculating phylogenetic distances. Results for cyanobacteria were compared to data from the prokaryotic phyla Chroroflexi, Spirochaetes, and Bacteroidetes. Paralogs of 16S rRNA genes are almost identical in cyanobacterial species and suggest a deviation from divergent evolution of gene copies. Investigating variation in copies of the internal transcribed spacer region (ITS), located between the 16S and 23S rRNA genes, suggests that both concerted evolution and purifying selection are viable hypotheses for the evolution of 16S rRNA in cyanobacteria. Furthermore, we observed an exceptionally strong sequence conservation in 16S rRNA orthologs within the cyanobacterial phylum. A level of conservation that could not be observed in any of the eubacterial phyla studied here.
Aside from ribosomal RNA genes, we identified 41 protein coding genes which possess multiple conserved gene copies in at least one cyanobacterial species (Additional file 1). From this total of 44 genes, only six showed significant correlations to morphological characteristics. Ribosomal RNA genes were the main class of genes exhibiting conserved gene copies that were significantly correlated to the cyanobacterial sections IV and V. Species capable of terminal cell differentiation exhibited four or five copies of ribosomal genes. Furthermore, Gloebacter violaceus and a thermophilic Synechococcus species share a distinct pattern of gene copy numbers which adds independent support to previous studies that have grouped these species separately from the rest of cyanobacteria, closer to an eubacterial outgroup [22, 35–39].
Synechococcus sp. JA-3-3Ab, a unicellular cyanobacterium isolated from a hot spring in Yellow Stone National Park [41, 42], exhibited a pattern of gene copy numbers that generally deviated from the pattern observed in other Synechococci. It shared identical copy numbers of protein coding genes with Gloeobacter violaceus. These included a series of not yet annotated genes missing in all other cyanobacteria. This pattern of almost identical conserved gene copy numbers supports other phylogenetic and phylogenomic studies that place these two species close to each other at the base of the cyanobacterial phylogenetic tree [36–38]. In a previous study using 16S rRNA sequences, Schirrmeister et al. observed a close phylogenetic relationship of Gloeobacter violaceus and another Synechococcus strain  isolated from the same source as Synechococcus sp. JA-3-3Ab. Similar results have been found elsewhere . The phylogenetic distance of Gloeobacter violaceus to other extant cyanobacteria has been pointed out before . Major differences involve the light harvesting machinery. Gloebacter violaceus lacks thylacoid membranes , and various genes from photosystems I and II.
Data of cyanobacterial 16S rRNA gene sequences
# of copies
Acharyochloris marina MBIC11017
Anabaena variabilis ATCC 29413
Arthrospira platensis NIES 39
Cyanothece sp. PCC 7424
Cyanothece sp. PCC 8801
Gloeobacter violaceus PCC 7421
Microcystis aeruginosa NIES-843
Nostoc azollae 0708
Nostoc punctiforme PCC 73102
Nostoc sp. PCC 7120
Prochlorococcus marinus MIT 9211
Prochlorococcus marinus MIT 9303
P. marinus subsp. pastoris str. CCMP1986 (MED)
Synechococcus elongatus PCC 6301
Synechococcus sp. JA-3-3Ab
Synechococcus sp. PCC 7002
Synechococcus sp. RCC307
Synechococcus sp. WH 7803
Synechocystis sp. PCC 6803
Thermosynechococcus elongatus BP-1
Trichodesmium erythraeum IMS101
Using Spearman’s rank correlation coefficient (ρ) and Pearson’s correlation coefficient (R), we estimated a potential correlation of copy numbers to the defined morphological groups. Both tests indicated significant correlations to morphological groups for all ribosomal genes and two transposase coding genes. Furthermore, Spearman’s ρattested a significant correlation to morphology for photosystem II reaction center D2 protein (ρ=0.62), and a weaker correlation to Gas vesicle protein GVPa (ρ=0.58) coding genes. A significant Pearson’s correlation was found for a gene coding for a hypothetical protein (R=0.58). In Figure 3 distributions of ribosomal RNA gene copy numbers across morphological groups are presented as boxplot graphics with correlation coefficients, and p-values shown. All taxa capable of terminal differentiation exhibited four copies of ribosomal RNA genes. Correlation coefficients for 16S and 23S rRNA genes were ρ=0.74/R=0.86, in both cases, and ρ=0.63/R=0.8 for the 5S rRNA genes. Including additional data from the rrn-database  (Additional file 2), resulted in an even stronger correlation of 16S rRNA gene copy numbers to cyanobacterial species capable of terminal differentiation (ρ=0.87/R=0.9; Additional file 3). Cyanobacteria belonging to section IV and V form terminally differentiated cells (called heterocysts) in the absence of fixed nitrogen. In these cells oxygen sensitive nitrogen fixation can take place while neighbouring cells conduct oxygenic photosynthesis. These heterocystous cells undergo various structural and physiological alterations to protect nitrogenase from oxygen in a ‘microanaerobic’ environment. As a result they lose their ability to conduct photosynthesis and to divide. Multiple rRNA gene copies could have positive effects during heterocyst formation, the same way as they help E.coli to achieve maximum growth , and increases responses to changing environmental conditions . An increased amount of functional ribosomal operons likely depicts an advantage in the process of cell differentiation, during which expression of various genes is upregulated .
Previous studies have sometimes questioned the potential of 16S rRNA gene sequences as a taxonomic marker due to variation that has been observed between gene paralogs in some non-cyanobacterial organism [10, 34]. We explored sequence variation of 16S rRNA genes in cyanobacteria by reconstructing phylogenetic trees with Bayesian inference. We evaluated the divergence of 16S rRNA gene copies within and between cyanobacterial taxa. The inferred Bayesian consensus tree is displayed in Figure 2. Investigated cyanobacteria, exhibit one to four 16S rRNA copies per genome. Unicellular species partition in two major groups: species belonging to the marine pico-phytoplankton genera Synechococcus and Prochlorococcus, and members of the genera Synechocystis, Cyanothece and Microcystis which show a closer relation to multicellular cyanobacteria. All multicellular species studied here are closely related, and species capable of terminal differentiation form a monophyletic group. Comparisons of our study to previous findings show high similarities. Our results agree with a comparative phylogenomics approach used by Swingley et al., a consensus tree of concatenated sequences presented by Blank and Sànchez-Baracaldo , and, are highly similar to 16S rRNA analyses conducted by Schirrmeister et al.. Using a larger taxon set , we previously inferred polyphyletic groupings of undifferentiated multicellular species belonging to section III. This however is not deducible from the taxonomically more limited full genome data set used in the present study.
Comparison of mean distances within cyanobacteria and to other eubacterial phyla
Within a genome
95% confidence intervals
95% confidence intervals
The strong conservation of 16S rRNA sequence copies in cyanobacteria and Eubacteria examined here suggests that 16S rRNA in these species is shaped by strong purifying selection and/or concerted evolution. Generally, it is assumed that ribosomal genes in Archaea and Eubacteria are shaped by concerted evolution . 16S rRNA genes can be subdivided in strongly conserved and more variable regions. One would expect that if purifying selection acts as the major force for conservation of gene copies within a genome, some neutral variation should be detected in these variable regions. The extraordinary conservation of 16S rRNA in cyanobacteria seems to indicate that concerted evolution is a more likely explanation. To verify this suggestion we examined variation in the internal transcribed spacer region, located between the 16S and 23S rRNA gene. Though previous studies have suggested conservation of some regions in the ITS sequence, several regions should not be affected by selection and evolve neutrally. If the entire ITS sequence showed the same degree of conservation as does the 16S gene sequence, then purifying selection —which would only act on the functional parts— could be rejected as a driving force. However, the strong conservation found in cyanobacterial 16S rRNA gene sequences could not be confirmed for the ITS-regions of four cyanobacterial taxa (Additional file 9). For cyanobacteria and the eubacterial phyla studied here, both concerted evolution and strong purifying selection, appear to be the main contributing factors.
Although, cyanobacteria are assumed to be an ancient phylum which presumably raised oxygen levels in the atmosphere more than 2.3 billion years ago , variation in 16S rRNA copies is extremely low. Indeed, phylogenetic tree reconstructions for 16S rRNA result in relatively short estimated branch lengths within this phylum, compared to other eubacterial phyla (Figure 2). Short evolutionary distances for 16S rRNA sequences are consistent with a pattern that has been found for morphological characters in cyanobacteria before. In 1994, J.W. Schopf compared the tempo and mode of evolution in cyanobacteria from the Precambrian, to evolutionary patterns observed in fossils during the Phanerozoic. The latter have been described by G.G. Simpson in his book “The tempo and mode of evolution” . Schopf found that evolutionary predictions which Simpson made for metazoan fossils from the Phanerozoic, can also be applied to cyanobacteria. Morphologically, cyanobacteria seem to evolve not only at a “bradytelic”, but “hypobradytelic” mode, meaning at exceedingly low evolutionary rates. Fossils from the Precambrian strongly resemble present morphotypes. The oldest undisputed cyanobacterial fossils date back circa 2.0 billion years [18, 19]. Morphological appearance of these microfossils already suggests the presence of at least four of the morphological sections described by Castenholz . It seems that cyanobacteria reached their maximum morphological complexity two billion years ago, and many of today’s species could be described as so-called ‘living fossils’. It remains to be seen whether the low evolutionary rates as seen in 16S rRNA sequences and morphological features, is also seen at the genomic and metabolic level. This question can be further resolved as further genomic sequences become available for the cyanobacteria.
Among 22 fully sequenced cyanobacterial taxa that were carefully chosen according to phylogenetic position and morphological characteristics, we identified 41 protein coding genes that occur as multiple highly conserved copies in at least one cyanobacterial species. Copy numbers of ribosomal genes show a significant correlation to cyanobacterial species that are capable of terminal differentiation. The formation of heterocysts, morphologically modified cells for nitrogen fixation, requires a strong increase in gene expression, for which an accumulation of ribosomes could be of potential advantage. Further testing would be required though, to make causal conclusions for increased rRNA operons in cyanobacteria belonging to section IV and V. Furthermore, phylogenetic analyses revealed a high conservation of 16S rRNA copies within eubacterial species. Though this is true for all phyla that have been analyzed, cyanobacteria exhibit an exceptionally strong conservation. Comparison to variation in ITS regions point to concerted evolution via homologous recombination and purifying selection as the forces behind 16S rRNA sequence evolution. Comparison of interspecific genetic distances within several prokaryotic phyla, showed significantly lower variation of cyanobacterial 16S rRNA gene sequences. This suggests that 16S rRNA gene sequences evolve by a ‘hypobradytelic’ mode of evolution, previously suggested for morphological characteristics in cyanobacteria .
For this study we only used cyanobacterial taxa with fully sequenced and annotated genomes publicly available on GenBank (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). Of those 42 genomes (as of August 2011), 36 belong to singlecelled strains, covering 10 different species in total. The remaining six genomes belong to multicellular strains, each representing another species. The taxon sampling was done to exclude a bias towards unicellular closely related cyanobacteria which are overrepresented in the genome-database . Therefore, to cover the widest possible range of morphotypes, we selected one or more, fully sequenced taxa of each species for a total dataset of 22 cyanobacterial strains. More precisely, we included multiple strains of species Cyanothece sp.(2), Synechococcus sp.(4), and Prochlorococcus marinus(3), which, following the examination of previous phylogenies [39, 47, 58, 59], are assumed to add phylogenetic diversity. No outgroup was included in the phylogenetic analyses. Gloeobacter violceus has been shown to be closest to eubacterial outgroups . Therefore, phylogenetic trees are represented accordingly.
In order to find genes with multiple copies, we applied the orthology prediction algorithm OMA  to the set of 22 complete cyanobacteria genomes. First we looked for clusters of highly conserved paralogous genes within each species. From the all-against-all pairwise sequence alignments computed by OMA, we selected pairwise hits within each species with an alignment score of at least 130 and minimum sequence identity of ≥98%, ≥95% and ≥90%. We then used these hits as edges in a homology graph, and identified clusters of highly conserved paralogs as connected components. Finally, we removed hits within a cluster if the pairwise distance differed significantly from the mean distance within the cluster. In the second step, we grouped detected homologous clusters across species using OMA alignments, but this time with a score cut-off of 180 and minimum sequence identity of ≥50%. We further required that ≥0.8·n i ·n j of hits between any pair of clusters i and j be present in order to be considered, where n i n j is the number of genes in clusters i and j, respectively. If a cluster in one genome grouped with several clusters in another genome, we chose the one with the lowest average pairwise distance. Again, homologous groups were extracted as connected components from the resulting graph. Finally, single orthologs from the OMA orthologous matrix (i.e, with no detected multiple copies within their originating genome) were matched and added to corresponding homologous groups.
We tested whether a correlation between cell differentiation and copy numbers could be observed for the identified genes. To do this, we devided cyanobacterial species into four different groups of cell differentiation (G0-G3; see results). Five strains belong to G0, 12 taxa belong to G1, Tricodesmium is the only genus in G2, and four species belong to G3. For 16S rRNA genes additional data could be obtained from rrndb-database  (Additional file 3). Adding these data resulted in a taxon set of 16S rRNA gene sequences as follows: five strains belonging to G0, 12 strains representing G1, Trichodesmium as the only species in G2 and 11 species in G3. Spearman’s rank and Pearson’s correlation coefficients were applied in order to estimate associations between conserved copy numbers and morphological groups (G0-G3), using R-software. Correlations with a p-value<0.01 were considered to be significant.
We conducted separate phylogenetic analyses of 16S rRNA gene sequences of cyanobacteria (Table 1) and four different eubacterial phyla (Additional file 10). For all taxa included in the phylogenetic trees, full genome sequences were available. All sequences were downloaded from GenBank . For cyanobacteria two phylogenetic trees were reconstructed. One including a single 16S rRNA sequence per taxon and another including all 16S rRNA copies per taxon. Final taxon sets included 22 sequences in the first case and 48 sequences in the latter. The datasets were aligned using Clustal-X software with default settings  (1,325nt incl. gaps). Gaps were excluded from the analysis. Phylogenetic reconstructions were done using Bayesian analysis as implemented in MrBayes software . Two Metropolis coupled Markov Chain Monte Carlo (M C 3) searches were run for 107 generations each using three heated and one cold chain. Figure 1 and Figure 2 show the consensus trees of 16,002 trees that were sampled every 1,000th generation from the M C 3 searches, excluding the first 2,000 trees of each run (burn-in). At that point the log probabilities reached stationarity and average standard deviation of split frequencies were below 0.02. Performance of the MCMC and stationarity of the parameters were checked using Tracer v1.5 . Effective Sample Sizes (ESS) were all above 200, supporting a well mixed MCMC run.
Phylogenetic analysis described for cyanobacteria was equally conducted for the phyla Auificae, Bacteroidetes, Chloroflexi and Spirochaetes. The non-cyanobacterial phylogenetic trees were reconstructed including all 16S rRNA gene copies of each taxon. M C 3analyses were run for 106 generations. The first 200,000 generations of each run were discarded as a burn-in. Parameters and trees were sampled every 1,000th generation resulting in a final set of 1,602 trees. The resulting Bayesian consensus trees for each phylum with posterior probabilities displayed at the nodes, have been visualized with FigTree v1.3.1 .
For each set of aligned 16S rRNA gene sequences, distance matrices were calculated applying a K80 substitution model as implemented in the program baseml of PAML v4.3 . The same was done for the internal transcribed spacer region (ITS) in cyanobacteria (Additional file 9). The resulting numeric matrices were imaged as color matrices using the R-package “plotrix” . The color gradient of each matrix was scaled by the matrix’s minimum and maximum values. Mean distances were calculated within strains (between paralogs; d W ) and between strains (between orthologs; d B ), for each phylum. Significant differences in mean distances were confirmed with bootstrap re-samplings of independent values from the original dataset. To estimate significant differences of mean distances within species (d W ), independent distance values were sampled 10,000 times for each species. Bootstrap re-sampling was done on each of these sample sets. Mean distances were hence calculated and their distribution plotted in a histogram (Additional file 4). The resulting overall mean, of the distributions, as well as 95% confidence intervals are presented in Table 2. To confirm potential differences of mean distances between species (d B ) compared to other phyla, independent values were sampled 10,000 times. These datasets were re-sampled and mean distances calculated. The distributions are displayed in Additional file 5. The resultant overall mean, of each distribution, as well as 95% confidence intervals are shown in Table 2. Independence of distance estimations was assumed if from the corresponding matrix each column and row was only chosen once.
For statistical advice and support we would like to thank Erik Postma. Furthermore, we would like to thank Dr. Manuela Filippini Cattani, Dr. Miroslav Svercel and Valentina Rossetti for helpful comments on various versions of the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.