GASdb: a large-scale and comparative exploration database of glycosyl hydrolysis systems

Background The genomes of numerous cellulolytic organisms have been recently sequenced or in the pipeline of being sequenced. Analyses of these genomes as well as the recently sequenced metagenomes in a systematic manner could possibly lead to discoveries of novel biomass-degradation systems in nature. Description We have identified 4,679 and 49,099 free acting glycosyl hydrolases with or without carbohydrate binding domains, respectively, by scanning through all the proteins in the UniProt Knowledgebase and the JGI Metagenome database. Cellulosome components were observed only in bacterial genomes, and 166 cellulosome-dependent glycosyl hydrolases were identified. We observed, from our analysis data, unexpected wide distributions of two less well-studied bacterial glycosyl hydrolysis systems in which glycosyl hydrolases may bind to the cell surface directly rather than through linking to surface anchoring proteins, or cellulosome complexes may bind to the cell surface by novel mechanisms other than the other used SLH domains. In addition, we found that animal-gut metagenomes are substantially enriched with novel glycosyl hydrolases. Conclusions The identified biomass degradation systems through our large-scale search are organized into an easy-to-use database GASdb at http://csbl.bmb.uga.edu/~ffzhou/GASdb/, which should be useful to both experimental and computational biofuel researchers.


Background
As a promising alternative energy source to fossil fuels, biofuels can be produced through degradation and fermentation of lignocellulosic biomass of plant cell walls [1,2]. A key challenge in converting biomass to fuels lies in the special structures of cell walls that plants have formed during evolution to resist decomposition from microbes and enzymes. It is this defense system of plants that makes their conversion to fuel difficult, which is known as the biomass recalcitrance problem [3]. Considerable efforts have been invested into searches for microbes, specifically cellulolytic microbes, which can effectively break down this defense system in plants.
Cellulolytic microbes degrade biomass through secreting glycosyl hydrolases, binding to the biomass using their carbohydrate binding domains (CBMs), and then cutting various chemical bonds of the biomass using their catalytic domains [4]. It has been observed that the catalytic efficiency of a glycosyl hydrolase (WGH) decreases when it does not have a CBM domain [5,6], compared to the ones with such a domain. While some microbes use directly multiple glycosyl hydrolases, independent of each other, for biomass degradation, other microbes use them in an organized fashion, i.e., orchestrating them into large protein complexes, called cellulosomes, through scaffolding (Sca) proteins. The former are called free acting hydrolases (FAC), and the latter called cellulosome dependent hydrolases (CDC) [4,7]. Some anaerobic microbes use both systems for biomass degradation [7] while most of the other cellulolytic microbes use only one of them. When degrading biomasses, cellulosomes are generally attached to their host cell surfaces by binding to the cell surface anchoring (SLH) proteins [8]. The general observation has been that cellulosomes are more efficient in degradation of biomass into short-chain sugars than free acting cellulases [8]. Our goal in this computational study is to identify and characterize all the component proteins of the biomass degradation system in an organism, which is called the glydrome of the organism.
We have systematically re-annotated and analyzed the functional domains and signal peptides of all the proteins in the UniProt Knowledgebase and the JGI Metagenome database, aiming to identify novel glycosyl hydrolases or novel mechanisms for biomass degradation. Based on their domain compositions, we have classified all the identified glydrome components into five categories, namely FAC, WGH, CDC, SLH and Sca. To our surprise, two less well-studied glycosyl hydrolysis systems were found to be widely distributed in 63 bacterial genomes, in which (a) glycosyl hydrolases may bind directly to the cell surfaces by their own cell surface anchoring domains rather than through those in the cell surface anchoring proteins or (b) cellulosome complexes may bind to the cell surface through novel mechanisms other than the SLH domains, respectively, as previously observed. Our analyses also suggest that animal-gut metagenomes are significantly enriched with novel glycosyl hydrolases. All the identified glydrome elements are organized into an easy-to-use database, GASdb, at http://csbl.bmb.uga.edu/~ffzhou/GASdb/.

Data sources
We downloaded the UniProt Knowledgebase release 14.8 (Feb 10, 2009) [9] with 7,754,276 proteins, and all the 46 metagenomes from the JGI IMG/M database [10] with 1,504,133 proteins. The three simulated metagenomes in the database were excluded from our analysis.

Annotation and database construction
We have identified the signal peptides and analyzed the functional domains for all the proteins using SignalP version 3.0 [13,14] and Pfam version 23.0 [15]. A protein is defined as a cell surface anchoring protein, if it has one SLH domain and one Cohesin domain; a scaffolding has at least three Cohesin domains or one Cohesin domain and one carbohydrate binding domain; a cellulosome dependent catalytic protein has one catalytic domain and one dockerin domain; a free acting catalytic protein has one catalytic domain and one CBM domain; and all the other proteins with one catalytic domain are defined as weak catalytic proteins.
We calculated the percentages of glydrome components in genomes with at least 1,000 proteins only, since most of the others may not have completely sequenced. Three dimension protein structures were predicted using LOMETS [16]. The protein's Gene Ontology annotations were predicted using PFP [17].
To make the annotated glydromes easy to be accessed, a database GASdb was constructed using PHP scripting language.
Identified glydromes in bacteria 4,616 FACs are identified from the 7.75 million proteins in the UniProt Knowledgebase (release 14.8) [see Additional file 1]. The majority of them, 2,774 (61.71%), are from bacterial genomes. 1,019 FACs are found in the phylum Firmicutes, of which are a number of well-studied cellulolytic organisms such as Anaerocellum thermophilum [18], Caldicellulosiruptor saccharolyticus [19] and Clostridium thermocellum [20,21]. In addition, a large number of FACs are found in each of the two other phyla, namely Bacteroidetes (342 FACs) and Actinobacteria (425 FACs). Overall, these three phyla harbour 64.38% (~1,786/2,774) of our identified bacterial FACs, comparing to 25.12% of all the bacterial genomes covered by these phyla.
The previous observation has been that a functional cellulosome consists of at least one cell surface anchoring protein with SLH domains, at least one scaffolding protein and a number of cellulosome dependent glycosyl hydrolases [3,8,22,23]. Our search and analysis results indicate that novel biomass-degradation mechanisms may exist in the genomes or metagenomes that we analyzed, the details of which will need further studies. For example, Clostridium acetobutylicum was known to encode a scaffolding protein and a few cellulosome dependent enzymes, but it is not clear how the cellulosome is anchored to the cell surface [24,25] as no SLH domains were identified in the genome [see Additional file 1]. The similar question holds for the other four Firmicutes, i.e. Clostridium cellulolyticum, Clostridium cellulovorans, Clostridium josui and Ruminococcus flavefaciens. We did not expect that the scaffolding proteins in all these genomes except for Ruminococcus flavefaciens encode a domain of unknown function (PF03442: DUF291). Our data supports the previous observation that the four DUF291 domains in the C. cellulovorans scaffolding CbpA are possibly involved in anchoring the cellulosome on the cell surface [26].
A somewhat unusual glydrome was identified in Paenibacillus sp. JDR-2 of phylum Firmicutes. Paenibacillus sp. JDR-2 was known to encode modular xylanases [27,28] as shown in Figure 1. It is surprising to find 4 SLH proteins, i.e. B1D7Q9, B1D969, B1DGS5 and B1DIS9, but no other cellulosome components in Paenibacillus sp. JDR-2. Our search did not find any dockerin domains in the genome, suggesting the possibility that the organism uses an unknown biomass-degradation mechanism. In addition our search also identified SLH domains in 6 FACs and 5 WGHs of this organism, as shown in Figure 1. The superfamily of Ig-like fold domains are found in varieties of cell surface proteins [29], and the existence of them (Big_2, Big_4, and fn3, etc) in the aforementioned proteins further supports that they may anchor to the cell surface.

Identified glydromes in archaea
18 FACs are identified in six genera of Archaea, i.e. Thermococcus, Halobacterium, Pyrococcus, Thermofilum, Caldivirga and Haloferax [see Additional file 1], covering 11 genomes. Each of these 11 archaeal genomes encodes 1-3 FACs together with up to 28 WGHs. FACs were known to be encoded in four archaeal genomes, i. e. Halobacterium mediterranei [30], Pyrococcus furiosus [31,32], Pyrococcus kodakaraensis [33] and Ferroplasma acidiphilum strain Y [34]. Three of them are in our list. The glycosyl hydrolase in Ferroplasma acidiphilum strain Y was missed in our database since our annotation is based on the knowledge from the two databases, CAZy [35] and Pfam [15], neither of which includes this enzyme. 14 of the 18 identified FACs are homologous to each other with NCBI BLAST E-values < 1e-132 in different species of the same genus, suggesting that these enzymes have been in the 11 archaeal genomes at least before the divergence of these species. 385 proteins are annotated as WGHs in the 93 genomes from 30 archaeal genera. No cellulosome components were found in any of the archaeal genomes.  The top four eukaryotic genomes in the numbers of WGHs are from the phylum Streptophyta, and they are Oryza sativa sp japonica (Rice) (828 WGHs), Arabidopsis thaliana (Mouse-ear cress) (678 WGHs), Vitis vinifera (Grape) (602 WGHs) and Zea mays (Maize) (284 WGHs).
It is interesting to observe that there are 272 and 224 WGHs in the human and mouse genomes, respectively. Besides two other plant genomes, i.e. Oryza sativa subsp. indica (Rice) (258 WGHs) and Physcomitrella patens sp patens (Moss) (226 WGHs), all the other 6 eukaryotic genomes encoding more than 200 WGHs are from the fungal phylum Ascomycota. No cellulosome components were identified in the eukaryotic genomes. 200 (~73.53%) human WGHs are homologous to mouse WGHs with NCBI BLAST E-values < e-23. So the majority of these enzymes have been in the genomes of human and mouse at least before their divergence 75 million years ago [36].

Identified glydromes in metagenomes
Overall, 63 FACs and 6,072 WGHs are found in 42 metagenomes except for TM7b which was sampled from the human mouth. The top two metagenomes in the numbers of glycosyl hydrolases are from termite guts (12 FACs and 1,150 WGHs) and diversa silage soil (13 FACs and 820 WGHs). Since the number of proteins in metagenomes varies from 452 in termite gut fosmids to 185,274 in the diversa silage soil, we calculated the percentage of the glycosyl hydrolases in each metagenome. On average, 0.65% of a metagenome encode glycosyl hydrolases. We noted that all the metagenomes with more than 1% encoding glycosyl hydrolases are from the animal guts (including human, mouse and termite). This is confirmed by an independent study using BLAST mapping [37]. No cellulosome components were identified in any metagenome.

Utility
The query interface of GASdb All the annotated glydromes were organized into an easy-to-use database GASdb (Figure 2). A user can find the proteins of interest through browsing, and searching using keywords or BLAST. The overall organization of each glydrome can be displayed; and the high resolution images of each protein can be downloaded for the publication purpose, as shown in Figure 3. A user can also display the signal peptide and functional domains of a given protein and its homologs using BLAST with E-value cutoff 1e-20, as shown in Figure 3.

The comparative analysis interface of GASdb
The glydromes of multiple genomes can be illustrated in the Compare interface. First, the user needs to find the genome(s) of interest using keywords through the Compare interface. Then one or multiple genomes can be selected from the left panel in Figure 4, and added to the right panel for final display. The user can also remove some genomes from the right panel. The signal peptides and functional domains of proteins in the selected glydromes in the right panel will be displayed in the next page by clicking the Compare button, as shown in Figure 4.

Discussion
The majority (52.90%) of glycosyl hydrolases (including FACs, CDCs and WGHs) in our database are encoded by the 1,771 bacterial genomes. The 1,668 eukaryotic genomes contribute 34.98% of the total glycosyl hydrolases. So the glycosyl hydrolases are much more enriched in bacteria than in eukaryotes, considering the substantially larger sizes of eukaryotic genomes. Cellulosome components are observed only in Firmicutes, except for the CDC xynB (Q7UF11) from Rhodopirellula baltica. All the other glycosyl hydrolases do not have dockerin domains, and were annotated as FACs or WGHs. Although the catalytic domain and the CBM domain of a glycosyl hydrolase can function independently, the CBM domain is known to play an important role in the catalytic efficiency of glycosyl hydrolase [5,6].
So the annotated FACs may have higher catalytic efficiency.
A cell surface anchoring protein binds to the cell surface through its two or three SLH domains, and binds to the cellulosome scaffolding proteins together with the CDCs through the interacting pairs of cohesin domains and dockerin domains. It is unexpected to find SLH domains in additional 5 FACs and 5 WGHs of Paenibacillus sp. JDR-2, as the only previous observation related to this is Q53I45 (XynA) in Paenibacillus sp. JDR-2 genome [28]. We believe that these glycosyl hydrolases may bind to the cell surface through their own SLH domains, as Paenibacillus sp. JDR-2 encodes SLH proteins but no scaffoldings or CDCs. It would be interesting to study how Paenibacillus sp. JDR-2 acquired the SLH proteins or lost the other cellulosome components. We noticed that this is not a unique feature of Paenibacillus sp. JDR-2, as there are 26 FACs and 52 WGHs with SLH domains in the other organisms, all of which are bacteria, except for the moss Physcomitrella patens. Many of these enzymes have been experimentally confirmed to anchor on the cell surfaces through the SLH domains, e. g. the cell surface xylanase xyn5 (Q8GHJ4) from Paenibacillus sp. W-61 [38,39], the extra-cellular endoglucanase celA (Q9ZA17) from Thermoanaerobacterium polysaccharolyticum [40] and the endoxylanase (Q60043) from Thermoanaerobacterium sp. strain JW/ SL-YS 485 [41].
Cellulosomes could be linked to the cell surfaces using novel mechanisms other than through the typically used SLH domains as our data indicate. Five Firmicutes encode scaffolding proteins and CDCs but no recognizable SLH domains, a key feature for the cell surface anchoring proteins. The cellulosomes were observed to anchor on the cell surfaces in Clostridium cellulolyticum [22], Clostridium cellulovorans [42] and Ruminococcus flavefaciens [7]. But the detailed mechanisms remain to be known. The cellulosomes in Clostridium acetobutylicum and Clostridium josui may also be linked to the cell surfaces through some unknown mechanisms. Our analysis suggests that the domain of unknown function DUF291 (PF03442) might be involved in attaching these cellulosomes to the cell surfaces. We predicted the 3D structure of the first DUF291 domain in the scaffolding Q977Y4 of the Clostridium acetobutylicum glydrome, as shown in Figure 5. The first template (1EHX) does not show functional implication, while the second one (1CS6) is involved in cell adhesion [43,44]. The difference between the two predicted structures of the DUF291 domain is similar to each other with RMSD~2.7 A and TM score 0.6 using TM-align [45,46].
We collected 41 proteins encoded in the same operons with the components of Clostridium acetobutylicum glydrome but not in our GASdb. 16 of these proteins cover the following functional categories: binding (GO:0005488), catalytic activity (GO:0003824) and transporter activity (GO:0005215), and the remaining 25 are hypothetical or uncharacterized proteins. Only five proteins were annotated to be involved in the glycosyl hydrolysis, e.g. carbohydrate binding (GO:0030246) or hydrolase activity (GO:0016787). Three of the five proteins missed in our GASdb, i.e. Q97EZ1, Q97FI9 and Q97TI3, do not have recognizable Pfam domains related to the glycosyl hydrolysis. Q97TP4 is annotated to be an  In general, the glycosyl hydrolases and the cellulosome components attack the biomass after they are secreted outside the cells and properly assembled [23,47], and hence we would expect that they have certain signal peptides. However the majority of the annotated glycosyl hydrolases do not have any signal peptides, based on the predictions of SignalP 3.0 [13,14]. We found that over 65% of WGHs across all organisms except for Eukaryota do not have predicted signal peptides suggesting the possibility of these proteins using a novel secretion mechanism.
The ratio between the numbers of WGHs and FACs in a glydrome tends to be no more than 30. We calculated this ratio for each glydrome in a genome or metagenome with at least 1,000 proteins and at least one FAC and one WGH. We observed that the averaged ratios between the numbers of WGHs and FACs are 9.98, 12.55 and 14.40 for archaea, bacteria and eukaryota, with standard derivations 8.22, 16.65 and 12.25, respectively. Overall, over 90% of the glydromes in archaea, bacteria and eukaryota are lower than 30 in this ratio, respectively. It is surprising to find that the metagenomes encode 95.38 times more WGHs than FACs but no cellulosome components. We speculate that there may be some novel CBM domains being used by these WGHs in these metagenomes. An alternative hypothesis could be that microbes in a community generously secrete WGHs to degrade biomass and live on the hydrolysis products in the nearby regions only.

Conclusions
We conducted the first large-scale annotation of glydromes in all the sequenced genomes and metagenomes. We have made a number of interesting observations about glydromes of the sequences genomes and metagenomes. Among them, two less well-studied glydromes were observed in dozens of organisms, which are A) glycosyl hydrolases were found to have cell surface anchoring domains and can bind to the cell surfaces by themselves; and B) Clostridium acetobutylicum and four other bacteria from the phylum Firmicutes encode all cellulosome components except for the cell surface anchoring proteins SLHs, suggesting that the cellulosomes may have link to the cell surfaces through some novel mechanisms. Individual cases have been experimentally observed, but further studies are needed to uncover the underlining mechanisms and how they evolved into the current glydrome structures. Our data also suggested that the animal gut metagenomes are rich in novel glycosyl hydrolases, providing new targets for further experimental studies.