Peptidoglycan: a post-genomic analysis

Background To derive post-genomic, neutral insight into the peptidoglycan (PG) distribution among organisms, we mined 1,644 genomes listed in the Carbohydrate-Active Enzymes database for the presence of a minimal 3-gene set that is necessary for PG metabolism. This gene set consists of one gene from the glycosyltransferase family GT28, one from family GT51 and at least one gene belonging to one of five glycoside hydrolase families (GH23, GH73, GH102, GH103 and GH104). Results None of the 103 Viruses or 101 Archaea examined possessed the minimal 3-gene set, but this set was detected in 1/42 of the Eukarya members (Micromonas sp., coding for GT28, GT51 and GH103) and in 1,260/1,398 (90.1%) of Bacteria, with a 100% positive predictive value for the presence of PG. Pearson correlation test showed that GT51 family genes were significantly associated with PG with a value of 0.963 and a p value less than 10-3. This result was confirmed by a phylogenetic comparative analysis showing that the GT51-encoding gene was significantly associated with PG with a Pagel’s score of 60 and 51 (percentage of error close to 0%). Phylogenetic analysis indicated that the GT51 gene history comprised eight loss and one gain events, and suggested a dynamic on-going process. Conclusions Genome analysis is a neutral approach to explore prospectively the presence of PG in uncultured, sequenced organisms with high predictive values.


Background
The macromolecule peptidoglycan (PG) is a component of the bacterial cell wall that participates in withstanding osmotic pressure, maintaining the cell shape and anchoring other cell envelope components [1] PG is composed of linear glycan strands cross-linked by short peptides, with glycan strands of alternating Nacetylglucosamine (GlcNAc) and N-acetylmuramic acid (MurNAc) residues linked by β-1→4 bonds [1]. PG is at the basis of the first classification of bacteria using the staining procedure developed by Hans Christian Joachim Gram in 1884 [2]. This method reveals the presence of PG, with blue-colored Gram-positive bacteria having a thick PG layer, red-colored Gram-negative bacteria having a thin PG layer and poorly stained bacteria lacking PG. However, Gram staining lacks sensitivity and specificity for the detection of PG: for example, Mycobacterium organisms show variable results with Gram staining, despite the fact that they do have PG [3]. In addition, PG-less Planctomycetes and Chlamydia bacteria stain red like Gram-negative bacteria [4,5]. Further exploration of PG using electron microscopy observation of the cell wall refined previous optic microscopy observations, and biochemical analyses further allowed analyzing the cell wall PG composition, contributing to the description of additional Gram-positive species [6].
PG biosynthesis is a dynamic complex process involving 20 enzymatic reactions, including the formation of GlcNAc-MurNAc dimers by a glycosyltransferase (GT) of family GT28 (in this report, we adopted the family classification described in the CAZy database [7,8]) and the polymerization of the dimers to form the linear glycan strands by family GT51 glycosyltransferase [9]. These two glycosyltransferase families were the only ones evolved in the PG synthesis. Furthermore, PG lysis involves enzymes that may belong to six different glycoside hydrolase (GH) families, GH23, GH25, GH73, GH102, GH103 and GH104. Indeed, GH23 and GH25 families include enzymes called lysozyme known to lyse the PG. GH73 family enzymes showed a similar folding as GH23 and GH102, 103 and 104 families showed similar catalytic activities. So, we supposed that the six GHs could be isofunctional. Therefore, to be able to synthesize and to degrade PG, an organism needs a minimal set of three genes, comprising one GT28 gene, one GT51 gene and at least one gene of the five GH families mentioned above.
To circumvent the limitations associated with the aforementioned morphological and biochemical approaches to assess the presence of PG in living organisms, we aimed to develop a post-genomic, neutral approach to depict its presence among sequenced representatives of the four domains of life [10] by screening the Carbohydrate-Active Enzymes database (CAZy) [8] for the presence of the minimal set of three genes.

Results
Whereas none of the 103 tested Viruses and none of the 101 tested Archaea genomes exhibited the 3-gene set ( Table 1, Additional file 1), some representatives encode Deinococcus-Thermus (n=13) 13 (100%) Fibrobacteres-Acidobacteria (n=7) 6 (86%) 6 (86%) 7 (100%) 0 2 (29%) 0 0 0 6 (86%)   Figure 1, Additional file 1). A total of 4 other photosynthetic eukaryotic genomes do not contain the complete 3-gene set but do encode a portion of these genes: the Ostreococcus lucimarinus CCE9901 and Oryza sativa japonica group nuclear genomes encode one and four GT28 genes, respectively; and the Arabidopsis thaliana nuclear and chloroplastic genomes encode a total of four GT28 genes. The Paulinella chromatophora chromatophore genome encodes one GT28 and one GT51 gene. Three non-photosynthetic Eukaryota genomes encode one GH23 gene, i.e. Cryptococcus bacillisporus WM276, Cryptococcus neoformans var. neoformans and Homo sapiens. By analyzing the presence of at least one gene of the 3-gene set in 42 Eukaryota genomes, we found that these genes were significantly more present in the photosynthetic Eukaryota genomes (5/7, 71.4%) than in the non-photosynthetic Eukaryota genomes (3/35, 8.5%) (P-value=0.0001). Comparing the presence of each gene family between Bacteria and the other domains of life yielded a significant association between Bacteria and the presence of GH23, GH73, GH102, GH103, GT28 (Pvalue <10 -7 ) and GH104 (P-value <2.10 -5 ). The 3-gene set was found in 1,260/1,398 (90.1%) bacteria, whereas 138 (9.9%) bacteria appeared to lack at least one of these three genes ( . These data yielded a 77.8% negative predictive value of the 3-gene set for the presence of PG ( Table 1). The Pearson correlation test indicated a significant correlation between the absence of any gene of the 3gene set and the absence of PG, with the highest correlation value (0.963) for GT51 (P<10 -3 ), as confirmed by the principal component analysis ( Figure 2).
The phylogenetic comparative analysis yielded 13 clusters ( Table 2, Additional file 4). Two of the clusters aggregated the loss of PG with some PG metabolism genes: one involved PG loss and GT51 loss, with a Pagel's score of 60, a percentage of error close to zero and five positive dates (cluster III) and another cluster involved PG loss, the loss of GT51 and GH23 genes, with a Pagel's score of 51, a percentage of error close to zero and four positive dates (cluster IV).
The multivariable analysis of life style, genome size, GC content and absence or presence of PG indicated that a GC content <50%, genome size <1.5 Mb and an obligate intracellular life style were significantly correlated with the absence of PG, with odds ratios of 7.7, 80 and 19.5 and confidence intervals of 3-15.5, 42.4-152.4 and 11.7-32.5, respectively (P<10 -3 ). Examples of such GT51-negative, PG-less obligate intracellular Bacteria include Chlamydia [16], Anaplasma, Ehrlichia, Neorickettsia and Orientia [17,18]. Figure 2 Multiple variable analysis of peptidoglycan metabolism genes. a) Pearson correlation test results. We compared the absence of each gene with the absence of PG. We excluded values obtained from genomes with no information for PG. b) Principal component analysis results. We compared the absence of each gene with the absence of PG. We excluded values obtained from genomes with no information for PG.

Discussion
In this study, mining the CAZy database allowed the detection of a minimal set of three genes involved in PG synthesis among the four different domains of life. The fact that this complete 3-gene set was not detected in Archaea and Viruses organisms is in agreement with the previously known absence of PG in these organisms and validated our method [19]. In Archae, family GT28 genes are only very distantly related to the bona fide bacterial GTs involved in PG synthesis, and it is possible that the archaeal GT28 enzymes have a function unrelated to PG. In viruses, detecting a few genes potentially involved in the synthesis and in the degradation of PG was not surprising: such viruses were indeed bacterial phages in which GH genes could have recombined with the bacterial host genome [20,21] and could be used to break through the peptidoglycan layer to penetrate their bacterial hosts.
More surprising was the observation that the Eukaryote Micromonas sp. encodes a complete 3-gene set. Micromonas sp. is a photosynthetic picoplanktonic green alga containing chloroplasts ( Figure 5) [22]. A significant association was observed between photosynthetic Eukaryotes and the presence of genes involved in PG metabolism. Chloroplasts are thought to descend from photosynthetic Cyanobacteria ancestors, and their presence in photosynthetic Eukaryotes is thought to result from Eukaryotes-Cyanobacteria symbiosis [23]. Moreover, PG has been detected in the cell wall of Glaucophytes chloroplasts [24,25]. We, therefore, interpreted the presence of a complete 3-gene set in Micromonas sp. as deriving from its chloroplast and the presence of some PG metabolism genes in other photosynthetic Eukaryotes as remnants of an ancient complete set. Additionally, the Eukaryote GT28 gene could be a remote homolog involved in plant-specific glycolipid biosynthesis and not PG metabolism. In this scenario, Eukaryotes ancestors did not encode genes for PG biosynthesis, some photosynthetic Eukaryotes further

Loss GH73
Pagel's score was based on a chi 2 test, with four freedom degrees and was applied to two events. Functional PG corresponds to the presence of PG in the cell wall. Date correspond to a node for which events were observed. *Detail of dates is given in the Additional file 4.
acquired such a capacity after Eukaryotes-Cyanobacteria symbiosis 1.5-1.2 billion years ago (Keeling 2004), and lateral genetic transfer occurred between Eukaryotes and chloroplasts [25][26][27]. GH23 is also encoded by free nonphotosynthetic Eukaryotes; in Eukaryotes, GH23 could act as antimicrobial molecule [28]. Accordingly, we found that the minimal 3-gene set was specific for Bacteria, with a 100% positive predictive value for the presence of PG. Its predictive negative value was low, but we further determined that a lack of GT51 in the genome had a predictive negative value of 100% for the lack of PG in an organism. Moreover, our phylogenetic comparative analysis correlated the GT51 gene history and the PG history. Indeed, we observed that among the clusters including PG losses, GT51 gene losses were involved with a good Pagel's score (cluster III and cluster IV) ( Table 2). These results show that PG function is strongly linked to the presence of the GT51 gene. Thus, the GT51 gene could be used to predict the capacity of an organism to produce PG in its cell wall. A lack of GT51 was found in <10% of bacterial organisms. Under a parsimony hypothesis, this observation suggests that Bacteria ancestral genomes encoded GT51 and that the lack of GT51 gene in some bacteria results from loss events. Surprisingly, such loss events are observed in almost 2/3 Bacteria phyla, indicating that several independent loss events occurred during the evolutionary history of these different Bacteria phyla. These scenarios were confirmed by the gain/loss analysis featuring a GT51-containing Bacteria ancestor and eight GT51 losses. Moreover, we noticed that GT51 loss occurred in only few strains of the same species, as observed for Prochlorococcus marinus. Our careful examination of genomes did not find GT51 gene fragment, validating GT51 loss events which are on-going. A loss event could be counterbalanced by GT51 acquisition, as observed in Akkermansia muciniphila of the Verrucomicrobia phylum. A. muciniphila is living within intestinal microbiome a large microbial community where several lateral gene transfers have been reported [29]. GT51 gain/loss is a dynamic process dependent on selection pressure due to a PG advantage/disadvantage balance.
PG supports some important functions of the bacterial cell, preserving cell integrity by withstanding turgor pressure and maintaining a defined yet flexible shape. PG also anchors other cell envelope components and intimately participates in cell growth and cell division processes [1]. Nevertheless, PG is also an Achilles' heel for Bacteria, as some environmental organisms produce molecules that inhibit PG synthesis. The mold Penicillium notatum was shown by Alexander Fleming to produce penicillin, a PG synthesis inhibitor and the first antibiotic used to treat bacterial infections in humans [30]. Vancomycin is another PG synthesis inhibitor produced by the soil bacterium Streptomyces orientalis [31].

Rickettsia bellii OSU 85 389
Rickettsia bellii RML369 C PG-less organisms PG-producing organisms PG-less organism Figure 3 A 16S rDNA sequence phylogenetic tree-like representation. This representation features Bacteria phyla comprising organisms with a GT51 gene (black), phyla including some close representatives without a GT51 gene (green), phyla including isolated representatives without a GT51 gene (blue) and phyla for which all representatives lack a GT51 gene (red).
However, PG is found in the vast majority of bacteria, including bacterial organisms living in the same niches as antibiotic-producing organisms. Accordingly, we observed that the absence of PG correlates with the intracellular life style and genome reduction [32]. In addition, free-living PG-less Bacteria and Archaea organisms use various osmoadapation strategies, such as the intracellular accumulation of inorganic ions, salt-tolerant enzymes or the accumulation of selected negative or neutral organic molecules [33,34] to maintain cell shape despite the absence of PG. Archaea cell walls could also contain other polymers, such as pseudomurein, methanochondroitin, heterosaccharide and glutaminylglycan, participating in the mechanical strength of the cell wall [19].

Conclusions
The exploration of PG in bacteria shows great heterogeneity in PG content. Genome analysis with ancestral reconstructions and phylogenetic comparative analyses offer a neutral tool to explore this heterogeneity and trace the evolutionary history of PG. These analyses also allowed the identification of genes that could be used to predict functional features.

Screening the CAZY database
We extracted the GH23, GH73, GH102, GH103, GH10, GT28 and GT51 gene content for each genome available in CAZy in April 2011 [7], i.e., 1 398 Bacteria genomes distributed among 21 phyla, 42 Eukaryota genomes, 101 Archae genomes and 103 Viruses genomes. This database is updating regularly GenBank finished genomes for their content in carbohydrate active enzymes, providing with their EC number, gene name and product description. We then searched for the simultaneous presence of one GT28, one GT51 and at least one GH as evidence for PG metabolism. To assess the predictive value of this minimal 3-gene set, we correlated its bioinformatic detection with biological evidence for the presence of PG. We searched biological evidence for the presence of PG by screening Pubmed [35] using 'peptidoglycan' , 'cell wall' , 'life style' and the name of the genus as keywords. We further explored the HAMAP website  [36], GenBank database [37] and Genome OnLine Database GOLD [38] for additional strain and genomic information. To confirm the absence of the GT51 gene in a strain, the GT51 gene nucleotide sequence of the closest strain was extracted and compared using National Center for Biotechnology Information (NCBI) BLAST to the complete genome of the strain.

Statistical analyses
We examined the significance of the association between each gene family and each domain of life using the chisquared test and STATCALC from EpiInfo version 6. The data were entered into an Excel spreadsheet and were analyzed using PASW statistics 17.0 (SPSS Inc., Chicago, Illinois, USA). To assess the independent factors associated with the absence of PG, binary logistic regression was performed. The dependent variable was the absence of PG, and the independent variables were life style, GC content and genome size. The goodness of fit of the results of the regression analysis was tested using the Hosmer-Lemeshow test. A correlation analysis was performed using the Pearson correlation test to assess the interaction between the absence of PG and the absence of each PG metabolism gene in the study. Principal component analysis (PCA) was used to identify colinearity between the absence of PG and the absence of each gene. The results of the PCA are shown on a factor loading plot.

Phylogenetic tree construction
Bacteria phylogenetic trees were constructed based on the 16S rRNA gene sequence. An initial phylogenetic tree containing 111 16S rRNA gene sequences representing each Bacteria phylum was constructed and rooted using the Archaea Methanobrevibacter smithii 16S rRNA gene sequence. Multiple sequence alignments were performed using MUSCLE [39]. Phylogeny reconstruction of aligned sequences was performed in MEGA 5 using the neighbor-joining method and the bootstrapping method [40] after 1,000 iterations. To highlight different PG evolution events further, a second 16S rRNA gene sequence-based phylogenetic tree was constructed incorporating 1,114 sequences analyzed using the Maximum Likelihood method.

Phylogenetic comparative analysis
The gain/loss event analysis was conducted using DAGOBAH multi-agents software system [41], integrating the PhyloPattern library [42] for Mirkin parsimony [43] ancestral node annotation and for the automatic   reading of trees. The parameters were arranged to minimize the detection of gain events. To explore the existing link between the selected genes and PG, two vertical clustering calculations were conducted by DAGOBAH, one focusing on dates (framing of two speciation events) and the other focusing on feature number (gene or PG). Clusters were verified using Pagel's method [44].