Comprehensive molecular biological approaches, such as transcriptome or proteome analysis, are essential for understanding the phenomenon of infection caused by virulent organisms, including GAS. Most post-genomic analysis is undertaken based on annotations derived from genome research. However, as mentioned above, previous genome analysis identified a number of "hypothetical proteins" that possibly represent unrecognized CDSs. Typical genome analysis is performed using a search procedure based on similarities. A query sequence derived from a list of ORFs in a genome is searched against a database comprising known amino acid sequences. These databases, such as NCBInr, have increased in size exponentially. Several genomes were re-evaluated semi-automatically with developed programs for gene identification [3, 5–7]. In an intra-species genomic overview of S. pyogenes, gene prediction was largely divided into two groups depending on whether the gene predictor ERGO was used or not (Additional file 1) [32–35]. Genes were predicted by ERGO in seven out of 13 S. pyogenes genome analyses, with an average CDS coverage 89.05% in the genome and an average length of protein coding gene of 861 bp. On the other hand, other gene prediction programs were used in the other five analyses, generating an average CDS coverage of 86.61% in genome, and an average length of protein coding genes of 890 bp. This suggested that the ERGO system predicted shorter ORFs compared to other gene predictors. It could be that the ERGO system over-predicted genes, whereas these genes might have been dismissed by the other gene predictors. The issue of trade-off between unrecognized ORF and over-prediction of genes should be solved using experimental evidence.
In fact, methods for gene prediction have been developed, and novel CDSs have been found by experimentally supported approaches [2, 8, 13]. Dandekar et al. revised the Mycoplasma pneumoniae genome and increased the total number of ORFs from 677 to 688 by integration of a gene-identifying program and proteomic experiments . They found 10 new CDSs in intergenic regions, two were identified by 2-dimensional gel electrophoresis followed by mass spectrometry, and one ORF was dismissed. The public genome annotation (GenBank: U00089) was revised based on this study. In Pseudomonas fluorescens PF0-1, Kim et al. searched unrecognized genes with cell fractionation data (global, soluble, and insoluble) followed by off-line two dimensional liquid chromatography combined with tandem mass spectrometry analysis . They found 16 novel genes of which six were intergenic region, nine overlapped with antisense predicted genes, and one overlapped with a predicted gene in another reading flame in the same direction. Payne et al., evaluated the genomes of Yersinia pestis with proteomic analysis for complement genome annotation, and 21 other Yersinia genomes in public databases were improved, including four new CDSs . One of the excellent adaptations of proteomics to genome annotation was provided for the hyperthermophilic crenarchaeon, Aeropyrum pernix. The number of proteins encoded by A. pernix has been the matter of some debate because of its high GC content and codon usage . Proteomic analysis of this archaeon provided useful information, including 19 newly identified CDSs . The results of proteomic analysis were used as a reliable index for the development of further gene annotation methods. In S. pyogenes, a number of CDSs remain as "(conserved) hypothetical proteins", whereas 13 intra-species genomes were revealed. Despite the strain SF370 being widely used in many researchers, the annotation has remained almost the same as when it was published in the public database. We envisioned that the re-evaluation of the SF370 genome with proteomic experimental evidence would provide useful information.
We identified nine novel genes that were transcribed and translated in SF370, based on assignments from MS/MS spectra from a list of six-frame ORFs rather than a list of known CDSs. Two out of these nine genes were identified in our previously report , and the transcriptions of both of these genes were verified by RT-PCR (Figure 1). OppA is believed to be a lipoprotein associated with virulence in mice . The oligopeptide permease complex consists of a periplasmic binding protein (OppA), two transmembrane proteins (OppB and OppC), and two membrane-associated cytoplasmic ATPases (OppD and OppF) on a polycistronic operon . CsrR, also known as CovR, is a unit of a two component signaling system that is associated with stressors, such as temperature, salt concentration, pH, antibiotics, and iron starvation [38–40]. In addition, the CsrR/S system is known to regulate several virulence factors, such as the hyaluronic acid capsule, streptolysin S, streptokinase, and pyrogenic exotoxin B (SpeB) . The CDS in ORF6306 encodes a fibronectin binding protein with a molecular weight of 85.1 kDa, and is believed to be involved in adhesion to the host cell surfaces. Although two other fibronectin binding proteins, SPy0430 and SPy1013, were annotated in SF370, neither of them could be detected in our proteome analysis. ORF5890 contains a CDS that encodes a 96.7 kDa enzyme that is considered to be a bifunctional acetaldehyde-CoA/alcohol dehydrogenase (EC 220.127.116.11 and 18.104.22.168). Four genes encoded by novel ORFs are believed to possess relatively low molecular weights; ORF15403 (26.6 kDa), ORF5890 (22.6 kDa), ORF703 (20.7 kDa), and ORF106976 (11.5 kDa). The full length of ORF106976 is corresponds to 105 amino acid residues. Although the homologous ORF was previously determined in MGAS315, the annotation for ORF106976 in SF370 has been omitted, probably because of its short length.
Unexpectedly, relatively few (nine) genes/novel CDSs were discovered in the SF370 genome, which possesses approximetely100 fewer CDSs compared to other GAS genomes. The number of new CDSs was comparable with previous reports [2, 8, 13]. In this study, two or more MS/MS spectra matching a unique peptide sequence in an ORF were used as the criterion for protein identification. Although the main goal of this study was a precise re-evaluation of SF370 genomes, this criterion may be too strict for the short length ORFs. The criteria that the identification of a protein was judged by one MS/MS spectrum matching to a unique peptide sequence will be considerable for the screening of unidentified CDS using a six-frame database. Alternatively, we suggest that an analysis that integrates proteomics and tiling DNA arrays should identify more of the short-length unrecognized ORFs. Although it would be easy to find unrecognized genes in a genome by several in silico strategies, such as intra-species genome comparison or searching with GO annotation, further experimental verification by the presence of mRNA or proteins encoded the genes is important. Proteomics-driven re-annotation with a six-frame database allows the identification of unrecognized genes with verification of the gene products at the same time.
The other aim of this study was to experimentally characterize hypothetical genes in GAS and to re-annotate hypothetical proteins by comprehensive analysis. Transcriptomic and/or proteomic analysis to generate functional annotations for hypothetical genes has been widely applied to many living organisms [9–12]. This assignment generated functional annotations for 54 CDSs (9.71% of HyPs) in Desulfovibrio vulgaris, 538 CDSs (33.1% of HyPs) in Shewanella oneidensis, and 129 (10.6% of HyPs) in the Haemophilus influenza genome [9–11]. In the SF370 genome, approximately 40% of proteins had been annotated as "hypothetical" or "conserved hypothetical" proteins. We identified 126 hypothetical proteins in three cellular fractions under three different culture conditions. Proteomics-driven functional annotation can help to not only deduce the response of cells under stressful culture conditions, as in transcriptome analysis, but can also be used to deduce the cellular location of protein expression . The absolute quantification of proteins should establish the number of peptide sequences that are detected under each culture condition, and whether the cellular fractions reflect the abundance of a particular protein [42, 43]. Furthermore, the homology search-based annotation, including GO, SignalP, and SOSUI, were integrated into proteomic experimental evidence of the annotation for unrecognized proteins. This integrated functional annotation provided interesting information for unknown proteins. For example, SPy0843 was assigned to the "cell" GO term and had a SignalP score 0.898. This protein was only identified from the insoluble fraction, and was expressed at a relatively high abundance in the static and CO2 culture conditions rather than under shaking conditions, by the proteomic analysis. It is speculated that the product of SPy0843 may be located in the cell membrane or cell wall, may be associated with the Sec pathway, and be upregulated under non-shaking culture conditions. Another example is SPy0317, which was assigned the GO terms "cell envelope", "external encapsulating structure", "transport", and "transporter activity", was estimated to have one membrane spanning domain by SOSUI, and had a SignalP score 0.999. The product of SPy0317 was universally observed in all cellular fractions, and was relatively highly expressed under shaking culture conditions. It is speculated that SPy0317 is secreted via the Sec pathway and is involved in transport of substances, especially under shaking culture conditions, which mimics mechanical or oxygenic stress. Other interesting examples were SPy1260 and SPy1262, which were identified with relatively high numbers of MS/MS spectra, despite both of them being assigned no GO terms. They should merit further biochemical and biological investigation.
A high degree of protein variation was observed in the supernatant compared to the insoluble and soluble fractions of the cell (Figure 2). Our previous reports suggested that stressors, such as addition of antibiotics [39, 44], influenced the expressions of extracellular proteins. These results suggest that GAS cells change their expression patterns of extracellular proteins when adapting to environmental stresses. In contrast to extracellular proteins, core proteins were easily identified in cell-body fractions under the different culture conditions. It is hypothesized that the protein components that we observed were a consequence of growth during the stationary phase of the cultures. For example, a previous report indicated that the effect of different culture atmospheres modulated surface structures. Bisno et al. reported that the expression level of the M protein of the cell wall-associated fraction was greater in 5% CO2 culture conditions . Our results also confirmed this hypothesis (Additional file 4). Interestingly, the highest amounts of M protein in the supernatant were observed under shaking culture conditions. We speculate that the M protein is detached from the cell wall because of the mechanical effects of shaking, although this should be investigated further.