GASdb: a large-scale and comparative exploration database of glycosyl hydrolysis systems
© Zhou et al. 2010
Received: 4 August 2009
Accepted: 4 March 2010
Published: 4 March 2010
Skip to main content
© Zhou et al. 2010
Received: 4 August 2009
Accepted: 4 March 2010
Published: 4 March 2010
The genomes of numerous cellulolytic organisms have been recently sequenced or in the pipeline of being sequenced. Analyses of these genomes as well as the recently sequenced metagenomes in a systematic manner could possibly lead to discoveries of novel biomass-degradation systems in nature.
We have identified 4,679 and 49,099 free acting glycosyl hydrolases with or without carbohydrate binding domains, respectively, by scanning through all the proteins in the UniProt Knowledgebase and the JGI Metagenome database. Cellulosome components were observed only in bacterial genomes, and 166 cellulosome-dependent glycosyl hydrolases were identified. We observed, from our analysis data, unexpected wide distributions of two less well-studied bacterial glycosyl hydrolysis systems in which glycosyl hydrolases may bind to the cell surface directly rather than through linking to surface anchoring proteins, or cellulosome complexes may bind to the cell surface by novel mechanisms other than the other used SLH domains. In addition, we found that animal-gut metagenomes are substantially enriched with novel glycosyl hydrolases.
The identified biomass degradation systems through our large-scale search are organized into an easy-to-use database GASdb at http://csbl.bmb.uga.edu/~ffzhou/GASdb/, which should be useful to both experimental and computational biofuel researchers.
As a promising alternative energy source to fossil fuels, biofuels can be produced through degradation and fermentation of lignocellulosic biomass of plant cell walls [1, 2]. A key challenge in converting biomass to fuels lies in the special structures of cell walls that plants have formed during evolution to resist decomposition from microbes and enzymes. It is this defense system of plants that makes their conversion to fuel difficult, which is known as the biomass recalcitrance problem . Considerable efforts have been invested into searches for microbes, specifically cellulolytic microbes, which can effectively break down this defense system in plants.
Cellulolytic microbes degrade biomass through secreting glycosyl hydrolases, binding to the biomass using their carbohydrate binding domains (CBMs), and then cutting various chemical bonds of the biomass using their catalytic domains . It has been observed that the catalytic efficiency of a glycosyl hydrolase (WGH) decreases when it does not have a CBM domain [5, 6], compared to the ones with such a domain. While some microbes use directly multiple glycosyl hydrolases, independent of each other, for biomass degradation, other microbes use them in an organized fashion, i.e., orchestrating them into large protein complexes, called cellulosomes, through scaffolding (Sca) proteins. The former are called free acting hydrolases (FAC), and the latter called cellulosome dependent hydrolases (CDC) [4, 7]. Some anaerobic microbes use both systems for biomass degradation  while most of the other cellulolytic microbes use only one of them. When degrading biomasses, cellulosomes are generally attached to their host cell surfaces by binding to the cell surface anchoring (SLH) proteins . The general observation has been that cellulosomes are more efficient in degradation of biomass into short-chain sugars than free acting cellulases . Our goal in this computational study is to identify and characterize all the component proteins of the biomass degradation system in an organism, which is called the glydrome of the organism.
We have systematically re-annotated and analyzed the functional domains and signal peptides of all the proteins in the UniProt Knowledgebase and the JGI Metagenome database, aiming to identify novel glycosyl hydrolases or novel mechanisms for biomass degradation. Based on their domain compositions, we have classified all the identified glydrome components into five categories, namely FAC, WGH, CDC, SLH and Sca. To our surprise, two less well-studied glycosyl hydrolysis systems were found to be widely distributed in 63 bacterial genomes, in which (a) glycosyl hydrolases may bind directly to the cell surfaces by their own cell surface anchoring domains rather than through those in the cell surface anchoring proteins or (b) cellulosome complexes may bind to the cell surface through novel mechanisms other than the SLH domains, respectively, as previously observed. Our analyses also suggest that animal-gut metagenomes are significantly enriched with novel glycosyl hydrolases. All the identified glydrome elements are organized into an easy-to-use database, GASdb, at http://csbl.bmb.uga.edu/~ffzhou/GASdb/.
We downloaded the UniProt Knowledgebase release 14.8 (Feb 10, 2009)  with 7,754,276 proteins, and all the 46 metagenomes from the JGI IMG/M database  with 1,504,133 proteins. The three simulated metagenomes in the database were excluded from our analysis.
We have identified the signal peptides and analyzed the functional domains for all the proteins using SignalP version 3.0 [13, 14] and Pfam version 23.0 . A protein is defined as a cell surface anchoring protein, if it has one SLH domain and one Cohesin domain; a scaffolding has at least three Cohesin domains or one Cohesin domain and one carbohydrate binding domain; a cellulosome dependent catalytic protein has one catalytic domain and one dockerin domain; a free acting catalytic protein has one catalytic domain and one CBM domain; and all the other proteins with one catalytic domain are defined as weak catalytic proteins.
We calculated the percentages of glydrome components in genomes with at least 1,000 proteins only, since most of the others may not have completely sequenced. Three dimension protein structures were predicted using LOMETS . The protein's Gene Ontology annotations were predicted using PFP .
To make the annotated glydromes easy to be accessed, a database GASdb was constructed using PHP scripting language.
4,616 FACs are identified from the 7.75 million proteins in the UniProt Knowledgebase (release 14.8) [see Additional file 1]. The majority of them, 2,774 (61.71%), are from bacterial genomes. 1,019 FACs are found in the phylum Firmicutes, of which are a number of well-studied cellulolytic organisms such as Anaerocellum thermophilum , Caldicellulosiruptor saccharolyticus  and Clostridium thermocellum [20, 21]. In addition, a large number of FACs are found in each of the two other phyla, namely Bacteroidetes (342 FACs) and Actinobacteria (425 FACs). Overall, these three phyla harbour 64.38% (~1,786/2,774) of our identified bacterial FACs, comparing to 25.12% of all the bacterial genomes covered by these phyla.
The previous observation has been that a functional cellulosome consists of at least one cell surface anchoring protein with SLH domains, at least one scaffolding protein and a number of cellulosome dependent glycosyl hydrolases [3, 8, 22, 23]. Our search and analysis results indicate that novel biomass-degradation mechanisms may exist in the genomes or metagenomes that we analyzed, the details of which will need further studies. For example, Clostridium acetobutylicum was known to encode a scaffolding protein and a few cellulosome dependent enzymes, but it is not clear how the cellulosome is anchored to the cell surface [24, 25] as no SLH domains were identified in the genome [see Additional file 1]. The similar question holds for the other four Firmicutes, i.e. Clostridium cellulolyticum, Clostridium cellulovorans, Clostridium josui and Ruminococcus flavefaciens. We did not expect that the scaffolding proteins in all these genomes except for Ruminococcus flavefaciens encode a domain of unknown function (PF03442: DUF291). Our data supports the previous observation that the four DUF291 domains in the C. cellulovorans scaffolding CbpA are possibly involved in anchoring the cellulosome on the cell surface .
Overall a large number of glycosyl hydrolases without carbohydrate binding domains or dockerin domains were identified in the bacterial genomes. More than 2,000 WGHs are found in each of the following four phyla, Proteobacteria (10,442 WGHs), Firmicutes (6,084 WGHs), Bacteroidetes (2,885 WGHs) and Actinobacteria (2,371 WGHs). Top 3 bacterial genomes with the highest percentages of glycosyl hydrolases (FACs, WGHs and CDCs) are Bacteroides intestinalis DSM 17393 (5.11%), Bacteroides ovatus ATCC 8483 (4.49%) and Bacteroides thetaiotaomicron (4.40%).
18 FACs are identified in six genera of Archaea, i.e. Thermococcus, Halobacterium, Pyrococcus, Thermofilum, Caldivirga and Haloferax [see Additional file 1], covering 11 genomes. Each of these 11 archaeal genomes encodes 1-3 FACs together with up to 28 WGHs. FACs were known to be encoded in four archaeal genomes, i.e. Halobacterium mediterranei , Pyrococcus furiosus [31, 32], Pyrococcus kodakaraensis  and Ferroplasma acidiphilum strain Y . Three of them are in our list. The glycosyl hydrolase in Ferroplasma acidiphilum strain Y was missed in our database since our annotation is based on the knowledge from the two databases, CAZy  and Pfam , neither of which includes this enzyme. 14 of the 18 identified FACs are homologous to each other with NCBI BLAST E-values < 1e-132 in different species of the same genus, suggesting that these enzymes have been in the 11 archaeal genomes at least before the divergence of these species.
385 proteins are annotated as WGHs in the 93 genomes from 30 archaeal genera. No cellulosome components were found in any of the archaeal genomes.
1,824 FACs are found in the 1,668 eukaryotic genomes covering 23 phyla, 62.23% (1,135/1,824) of which were from fungal genomes. A green plant phylum Streptophyta (664 FACs) contributes to 36.40% of the FACs. All the other phyla encode less than 100 FACs. Four plant genomes encode more than 45 FACs, and they are Oryza sativa sp japonica (Rice) (99 FACs), Vitis vinifera (Grape) (71 FACs), Arabidopsis thaliana (Mouse-ear cress) (65 FACs) and Zea mays (Maize) (47 FACs). The other 25 non-fungi FACs are encoded in 5 unicellular algae and 6 animal genomes.
17,048 WGHs are found in the 1,668 eukaryotic genomes. The top three phyla in the numbers of FACs are also top three in the numbers of WGHs; and 2,328, 5,444 and 5,171 WGHs are encoded in three phyla Arthropoda, Ascomycota and Streptophyta, respectively. The top four eukaryotic genomes in the numbers of WGHs are from the phylum Streptophyta, and they are Oryza sativa sp japonica (Rice) (828 WGHs), Arabidopsis thaliana (Mouse-ear cress) (678 WGHs), Vitis vinifera (Grape) (602 WGHs) and Zea mays (Maize) (284 WGHs).
It is interesting to observe that there are 272 and 224 WGHs in the human and mouse genomes, respectively. Besides two other plant genomes, i.e. Oryza sativa subsp. indica (Rice) (258 WGHs) and Physcomitrella patens sp patens (Moss) (226 WGHs), all the other 6 eukaryotic genomes encoding more than 200 WGHs are from the fungal phylum Ascomycota. No cellulosome components were identified in the eukaryotic genomes. 200 (~73.53%) human WGHs are homologous to mouse WGHs with NCBI BLAST E-values < e-23. So the majority of these enzymes have been in the genomes of human and mouse at least before their divergence 75 million years ago .
Overall, 63 FACs and 6,072 WGHs are found in 42 metagenomes except for TM7b which was sampled from the human mouth. The top two metagenomes in the numbers of glycosyl hydrolases are from termite guts (12 FACs and 1,150 WGHs) and diversa silage soil (13 FACs and 820 WGHs). Since the number of proteins in metagenomes varies from 452 in termite gut fosmids to 185,274 in the diversa silage soil, we calculated the percentage of the glycosyl hydrolases in each metagenome. On average, 0.65% of a metagenome encode glycosyl hydrolases. We noted that all the metagenomes with more than 1% encoding glycosyl hydrolases are from the animal guts (including human, mouse and termite). This is confirmed by an independent study using BLAST mapping . No cellulosome components were identified in any metagenome.
The majority (52.90%) of glycosyl hydrolases (including FACs, CDCs and WGHs) in our database are encoded by the 1,771 bacterial genomes. The 1,668 eukaryotic genomes contribute 34.98% of the total glycosyl hydrolases. So the glycosyl hydrolases are much more enriched in bacteria than in eukaryotes, considering the substantially larger sizes of eukaryotic genomes. Cellulosome components are observed only in Firmicutes, except for the CDC xynB (Q7UF11) from Rhodopirellula baltica. All the other glycosyl hydrolases do not have dockerin domains, and were annotated as FACs or WGHs. Although the catalytic domain and the CBM domain of a glycosyl hydrolase can function independently, the CBM domain is known to play an important role in the catalytic efficiency of glycosyl hydrolase [5, 6]. So the annotated FACs may have higher catalytic efficiency.
A cell surface anchoring protein binds to the cell surface through its two or three SLH domains, and binds to the cellulosome scaffolding proteins together with the CDCs through the interacting pairs of cohesin domains and dockerin domains. It is unexpected to find SLH domains in additional 5 FACs and 5 WGHs of Paenibacillus sp. JDR-2, as the only previous observation related to this is Q53I45 (XynA) in Paenibacillus sp. JDR-2 genome . We believe that these glycosyl hydrolases may bind to the cell surface through their own SLH domains, as Paenibacillus sp. JDR-2 encodes SLH proteins but no scaffoldings or CDCs. It would be interesting to study how Paenibacillus sp. JDR-2 acquired the SLH proteins or lost the other cellulosome components. We noticed that this is not a unique feature of Paenibacillus sp. JDR-2, as there are 26 FACs and 52 WGHs with SLH domains in the other organisms, all of which are bacteria, except for the moss Physcomitrella patens. Many of these enzymes have been experimentally confirmed to anchor on the cell surfaces through the SLH domains, e.g. the cell surface xylanase xyn5 (Q8GHJ4) from Paenibacillus sp. W-61 [38, 39], the extra-cellular endoglucanase celA (Q9ZA17) from Thermoanaerobacterium polysaccharolyticum  and the endoxylanase (Q60043) from Thermoanaerobacterium sp. strain JW/SL-YS 485 .
We collected 41 proteins encoded in the same operons with the components of Clostridium acetobutylicum glydrome but not in our GASdb. 16 of these proteins cover the following functional categories: binding (GO:0005488), catalytic activity (GO:0003824) and transporter activity (GO:0005215), and the remaining 25 are hypothetical or uncharacterized proteins. Only five proteins were annotated to be involved in the glycosyl hydrolysis, e.g. carbohydrate binding (GO:0030246) or hydrolase activity (GO:0016787). Three of the five proteins missed in our GASdb, i.e. Q97EZ1, Q97FI9 and Q97TI3, do not have recognizable Pfam domains related to the glycosyl hydrolysis. Q97TP4 is annotated to be an esterase (family 4 CE). The cellulosome integrating protein Q97KK4 has only one Cohesin domain occupying ~77.35% (140/181) of its total length, and might have been inactivated by domain deletion.
In general, the glycosyl hydrolases and the cellulosome components attack the biomass after they are secreted outside the cells and properly assembled [23, 47], and hence we would expect that they have certain signal peptides. However the majority of the annotated glycosyl hydrolases do not have any signal peptides, based on the predictions of SignalP 3.0 [13, 14]. We found that over 65% of WGHs across all organisms except for Eukaryota do not have predicted signal peptides suggesting the possibility of these proteins using a novel secretion mechanism.
The ratio between the numbers of WGHs and FACs in a glydrome tends to be no more than 30. We calculated this ratio for each glydrome in a genome or metagenome with at least 1,000 proteins and at least one FAC and one WGH. We observed that the averaged ratios between the numbers of WGHs and FACs are 9.98, 12.55 and 14.40 for archaea, bacteria and eukaryota, with standard derivations 8.22, 16.65 and 12.25, respectively. Overall, over 90% of the glydromes in archaea, bacteria and eukaryota are lower than 30 in this ratio, respectively. It is surprising to find that the metagenomes encode 95.38 times more WGHs than FACs but no cellulosome components. We speculate that there may be some novel CBM domains being used by these WGHs in these metagenomes. An alternative hypothesis could be that microbes in a community generously secrete WGHs to degrade biomass and live on the hydrolysis products in the nearby regions only.
We conducted the first large-scale annotation of glydromes in all the sequenced genomes and metagenomes. We have made a number of interesting observations about glydromes of the sequences genomes and metagenomes. Among them, two less well-studied glydromes were observed in dozens of organisms, which are A) glycosyl hydrolases were found to have cell surface anchoring domains and can bind to the cell surfaces by themselves; and B) Clostridium acetobutylicum and four other bacteria from the phylum Firmicutes encode all cellulosome components except for the cell surface anchoring proteins SLHs, suggesting that the cellulosomes may have link to the cell surfaces through some novel mechanisms. Individual cases have been experimentally observed, but further studies are needed to uncover the underlining mechanisms and how they evolved into the current glydrome structures. Our data also suggested that the animal gut metagenomes are rich in novel glycosyl hydrolases, providing new targets for further experimental studies.
Project name: GASdb;
Project home page: http://csbl.bmb.uga.edu/~ffzhou/GASdb/;
Operating systems: Platform independent;
Programming language: Perl, PHP, Apache
Restrictions to use by non-academics: none.
This work is supported in part by the grant for the BioEnergy Science Center, which is a U.S. Department of Energy BioEnergy Research Center supported by the Office of Biological and Environmental Research in the DOE Office of Science, the National Science Foundation (DBI-0354771, ITR-IIS-0407204, DBI-0542119, CCF0621700), National Institutes of Health (1R01GM075331 and 1R01GM081682) and a Distinguished Scholar grant from the Georgia Cancer Coalition. We'd like to thank Dr Yanbin Yin for his helpful discussions.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.