The gene expression data of Mycobacterium tuberculosis based on Affymetrix gene chips provide insight into regulatory and hypothetical genes

Background Tuberculosis remains a leading infectious disease with global public health threat. Its control and management have been complicated by multi-drug resistance and latent infection, which prompts scientists to find new and more effective drugs. With the completion of the genome sequence of the etiologic bacterium, Mycobacterium tuberculosis, it is now feasible to search for new drug targets by sieving through a large number of gene products and conduct genome-scale experiments based on microarray technology. However, the full potential of genome-wide microarray analysis in configuring interrelationships among all genes in M. tuberculosis has yet to be realized. To date, it is only possible to assign a function to 52% of proteins predicted in the genome. Results We conducted a functional-genomics study using the high-resolution Affymetrix oligonucleotide GeneChip. Approximately one-half of the genes were found to be always expressed, including more than 100 predicted conserved hypotheticals, in the genome of M. tuberculosis during the log phase of in vitro growth. The gene expression profiles were analyzed and visualized through cluster analysis to epitomize the full details of genomic behavior. Broad patterns derived from genome-wide expression experiments in this study have provided insight into the interrelationships among genes in the basic cellular processes of M. tuberculosis. Conclusion Our results have confirmed several known gene clusters in energy production, information pathways, and lipid metabolism, and also hinted at potential roles of hypothetical and regulatory proteins.


Background
Knowledge about the genome sequence of Mycobacterium tuberculosis [1] has contributed to recent advancement in understanding the biology of this organism and its clinical relevance. Concurrent with this development, a highthroughput genome-wide gene expression analysis device in the form of microarrays has rapidly emerged as a seemingly indispensable tool for studying genomics in the modern era. These developments have brought about the revolutionary conception of new prophylactic and therapeutic interventions in the genomic perspective. Its significance should be clear, as tuberculosis is still causing millions of deaths in the world.
DNA microarrays have been applied to analyze M. tuberculosis. The first type of application focuses on genotyping, for example, species identification [2,3] and detection of drug-resistant mutants [4,5]. The second type of application seeks to explore altered gene expression and understand biological pathways in terms of up-regulated and down-regulated genes in certain conditions of interest, such as drug challenge [6], hypoxia [7], starvation [8], high temperature [9], and in vivo [10]. However, existing applications do not exploit the full potential of genomescale microarray analysis in configuring interrelationships among all genes in M. tuberculosis. We pioneered the approach that applied the Affymetrix M. tuberculosis Gene-Chip to gene expression analysis. Previously, this Gene-Chip was used for applications related to genotyping.
Our study is aimed to explore the whole-genome behavior of M. tuberculosis during log-phase growth by conducting a bioinformatics analysis on genome-wide gene expression data generated from microarray hybridization. Our results enrich the current understanding of genome functions based on sequence analysis and functional studies of individual genes in such aspects as deduction of possible roles of conserved hypothetical and regulatory proteins.

Active genes involved in growth
Research on M. tuberculosis has yet to answer the questions of how many and what genes are active during normal growth in a standard in vitro environment and how they are related to each other in a global genome-wide sense.
To answer these questions, we adopted the Affymetrix GeneChip system, which, based on a specific oligonucleotide array format, could provide the absolute signal intensity in a single condition as well as the signal ratio between two conditions. Furthermore, its built-in statistical algorithm computes the so-called Detection p-value that determines the presence or absence of any given mRNA. It is this feature that we capitalize on to explore the genomic behavior of M. tuberculosis. A gene is active when it is expressed. Gene activity is measured by the expression level (i.e., abundance of corresponding mRNA) detected by microarray hybridization in this study.
We found that about one-half of the genes in M. tuberculosis genome participated in bacterial growth during the log (exponential) phase in standard broth culture. The average and standard deviation of expression signal intensity for each active gene would reflect its relative level of activity and the range of variation (Table 1). It was noted that the extent of gene expression did not necessarily correlated with essentiality; for example, among the ten most expressed genes, only two were essential genes as defined by high density mutagenesis [11].

Functional genomic analysis
The microarray data of those genes involved in the in vitro growth of M. tuberculosis during log phase were analyzed through the hierarchical clustering algorithm of Eisen's Cluster program [12]. This cluster-analysis program allowed us to explore the internal structure of the data and derive useful information concerning the coordination and collaboration among the genes. A measure fundamental to clustering is that of similarity. We define similarity between genes by their correlation in terms of gene expression patterns across multiple samples. The dendrogram generated by the algorithm (Figure 1) was organized according to this measure and displayed in a way to optimize the similarity of adjacent elements (genes). In the dendrogram, more related genes were joined earlier, and several highly dense clusters with peaks spreading along were visible.
Based upon the above analysis, the dendrogram was partitioned, using a correlation threshold of 0.9, into a number of disjoint clusters related to functional classes known previously. Each cluster contained functionally related genes that might share similar functions or roles in physiology or metabolism and were likely subject to the same regulatory mechanism. For instance, one cluster was found to be similar to the FAS-II operon, and another cluster comprised largely of genes encoding ribosomal proteins (Table 2). Notice, however, genes of different classes may co-exist in the same cluster if they show correlation in their activity. This situation is exemplified by energy-dependent transportation across cell membrane, which necessitates the coupling between the gene class of cell processes and that of energy production.
The clustering results were further visualized through Eisen's TreeView program to generate a heat map where the brightness of the red color represented the intensity of gene expression ( Figure 1). Based on the cluster analysis results and the observation of four conspicuous shining bands, the map was tentatively divided into four zones from the top to the bottom, called: zones 1, 2, 3, and 4, in the consecutive order. Each zone contained clusters of genes that were strongly correlated in their expression patterns and in that sense, functionally related. Each zone in the map was represented by genes that expressed most (i.e., the brightest in red), Thus, the map zones were labeled according to the functional class (genolist.pasteur.fr/TubercuList/) of their representative genes as "intermediate and lipid metabolism", "energy", "information", and "cell wall, cell processes, and metabolism", respectively.
The two most conspicuous clusters in the first zone of the gene expression map consisted of a cluster represented by genes in the functional category of intermediate metabo-lism, such as cysH, ggtB, and nirA, and the other cluster represented by genes involved in the FAS-II cycle [6], such as accD6, kasA, and KasB. The formation of the cluster with emphasis on the FAS-II cycle reflects the importance of this pathway in M. tuberculosis growth. As for interme-diate metabolism, M. tuberculosis can metabolize many kinds of carbohydrates, hydrocarbons, alcohols, ketones and carboxylic acids [1]. This zone was adjacent to the energy zone, suggesting the close relationship between intermediate metabolism and energy production. The most prominent genes in the second zone of the gene expression map were the ATP synthase gene complex, which produces ATP from ADP and is critical in energy metabolism. In addition, genes encoding enzymes involved in the respiratory chain, such as the nuo (NADHubiquinone oxidoreductase) gene complex, were clustered into this zone. In another study [13], ATP synthase and nuo gene complexes were found to be down-regulated together. Thus, they may be co-regulated.
The third zone of the gene expression map was represented by genes involved in the information pathways, e.g., dnaQ (DNA polymerase III), recA (recombinase), rpoA (DNA-directed RNA polymerase), and the 30S and 50S ribosome protein gene complexes (rpsB, rpsC, rplB, rplC, etc.). As these genes play a vital role in genetic information replication, transcription, and translation, their expression is essential for maintaining bacterial growth. We noticed that the gene Rv2258C (encoding a possible transcriptional regulatory protein) correlated well with a group of ribosomal protein genes. This result is related to a report that the production of ribosomes is increased through the transcriptional regulation of genes encoding ribosomal proteins during the growth phase of yeast [14].
The last zone of the gene expression map was represented by genes classified under the category of cell wall, cell processes, and metabolism. This category comprises membrane proteins and proteins involved in cell processes, including secreted and transmembrane proteins [15], as well as enzymes involved in intermediate metabolism. Genes and their protein derivatives located in this zone are related to cell wall synthesis, transportation of organic and inorganic substance across membrane, and immunological responses, such as narK2 (a nitrate/nitrite transporter), fbpC (mycolyl transferase), hsp (a protein induced by heat stress), ald (a secreted enzyme), ctpF (a metal cation transporter), sodA (superoxide dismutase, which destroys radicals), ompA (an outer membrane protein), amt (an ammonium-transport integral membrane The gene expression map of genes involved in the log-phase growth of Mycobacterium tuberculosis Rv3582c ASK
The most important feature of mycobacterial cell wall is the substantial amount (up to 60% of the total mass) of lipid components, particularly, the very long chain mycolic acids, which are combined with surface glycolipids to form a pseudolipid bilayer [16]. Since the cell wall synthesis involves lipid metabolism [6], the main cluster in zone 4 also contains some genes in this class, such as fadE26 (encoding acyl-CoA dehydrogenase) and fadD16 (encoding fatty-acid-CoA ligase).
Each zone was tentatively named according to the most expressed genes it contained. The semantics of each zone would further depend on whether each zone represents a concentration of genes in the same functional categories and that concentration is consistent with the function classification of the most expressed genes in that zone. So we further analyzed the percentage distribution of functional classes for the genes in the major clusters of each zone on the gene expression map ( Table 3). The functional classification of a gene is based on current knowledge in the field and can be accessed from the TubercuList Server (genolist.pasteur.fr/TubercuList/). Our analysis indicated that each zone contained clusters of genes of Cell wall, cell processes, and metabolism The whole-genome gene expression image (Figure 1 There were about a dozen of regulatory genes associated with the major functional clusters identified in this study (Table 4). Their significance is reflected by their possible roles reported in other studies. These genes likely regulate other genes in the same cluster at the transcriptional level and deserve to be examined in more details. Regulatory genes associated with small clusters are considered less important and hence not examined here.
The Rv2989 gene was associated with the cluster characterized by energy metabolism and respiration in our data.
The association relation appears to be consistent with reports that this gene is up-regulated at high temperatures [9] and down-regulated after starvation [8].
Rv3334 was up-regulated at high temperatures [9] as well as after starvation [8]. This gene is probably in the MerR (mercury resistance) family and its protein is similar to many regulatory proteins in sequence.
Rv0081 can be induced by hypoxia [7]. Its presence in growing bacterial cells is called into question. However, this gene is only weakly expressed in the present study, and likely to be up-regulated if oxygen is depleted.
The whiB4 gene encodes a protein homologous to a Streptomyces sporulation factor [17], and the gene is up-regulated after starvation [8], suggesting a possible link between starvation and sporulation. In addition, the association of whiB4 with the class of cell wall and cell processes makes sense from the point that sporulation could potentially involve cell membrane.
Rv0653c is probably in the TetR family. Its significance is reflected by the fact that proteins in this family are involved in the transcriptional control of multi-drug efflux pumps and pathogenicity [18]. Another regulatory gene, Rv0165c, is probably in the GntR family. In E. coli, GntR regulates gluconate uptake and catabolism as a repressor [19]. Both Rv0653c and Rv0165c are apparently involved in a membrane-associated cellular process. Another gene worth attention is Rv1931c, which regulates genes important for virulence of M. tuberculosis [20], pos- Probable transcriptional regulatory protein probably MerR-family Cell wall, cell processes, and metabolism Rv0465c Probable transcriptional regulatory protein Ditto Rv0081 Probable transcriptional regulatory protein Ditto Rv3681c Probable transcriptional regulatory protein Ditto Rv0653c Possible transcriptional regulatory protein probably TetR-family Ditto Rv0165c Possible transcriptional regulatory protein probably GntR-family Ditto Rv1931c Probable transcriptional regulatory protein Ditto Rv1151c Probable transcriptional regulatory protein Ditto Rv1956 Possible transcriptional regulatory protein Ditto  sibly via a cellular process that translates extracellular stimuli into a transcriptional signal.

Discussion
The availability of the complete genome sequence of Mycobacterium tuberculosis [1] combined with rapidly emerging microarray technology [21] has catalyzed the process of understanding the bacterial biology and pathogenicity and expedited the development of new diagnostics and therapeutics for tuberculosis. The microarray approach has enabled high-throughput gene expression analysis on a genomic scale in a field known as functional genomics [22]. In particular, elucidation of functional relationships among genes based on genome-wide gene expression data from DNA microarray hybridization has been successfully demonstrated for eukaryotes, notably yeast [12,23], but it has not been done for M. tuberculosis.
To date, the functional classification of genes in M. tuberculosis is mainly based on the biological study of individual genes as well as sequence analysis and comparison with homologous genes in other bacteria. In this study, we provide a comprehensive analysis that addresses this issue from the perspective of functional genomics.
In the application to M. tuberculosis, DNA microarrays have been used for comparing species, detecting drugresistant mutants, and studying biological behavior under various conditions. In general, there are two computational paradigms for microarray data analysis. The first paradigm is to identify genes differentially expressed across two conditions; the second paradigm is to identify genes expressed in a coordinated manner that share common roles in cellular physiology or metabolism. Most of the applications for M. tuberculosis to date are based on the first paradigm, whereas our study described here is based on the second paradigm.
In this study, active genes were identified by means of the Affymetrix GeneChip. As its unique feature, the Affymetrix system uses multiple oligonucleotide probes for implementing each gene sequence to be interrogated. Furthermore, the system is capable of analyzing the presence or absence of each mRNA. In contrast to the cDNA microarray system that is focused on differential gene expression across two conditions, the Affymetrix system can calculate gene expression in a single condition and compare gene expression across multiple conditions. In this way, the Affymetrix system is more flexible and informative. The flexibility can be attributed to the use of PM/MM probes, instead of two explicitly defined external conditions, for implementing the test/control mechanism in microarray hybridization.
Our method for analysis of in vitro genomic activity of M. tuberculosis can be extended to study functional genomics in vivo. Understanding what genes are switched on or off between in vitro and in vivo conditions would shed light on issues, such as how biological adaptation leads to bacterial latency, why there is discrepancy between laboratory sensitivity and clinical efficacy, and so on. Thus, the functional-genomics data obtained in this work can serve as a reference for interpreting data generated in other contexts. Our genomic analysis was based on multiple RNA samples extracted during log-phase growth in contrast to other coordinated gene expression analyses based on samples collected in a time course or under different conditions. Our experiment design is justified, given the fact that the expression level of any gene has considerable fluctuation from time to time (Table 1) in log-phase growth, as evidenced from the observation that the standard deviation of a gene expression was sometimes greater than 50% of its mean across samples. Variation in gene expression across different time settings enabled the correlation among different genes to be analyzed. The validity of our experiments is supported by reconfirmation of several known growth-related gene clusters. However, our approach is not applicable to samples collected during stationary phase, when little variation in gene expression is expected across samples.
The in vitro broth culture condition has often been used as the reference condition to study the gene expression of M. tuberculosis in other conditions, such as hypoxia and starvation. However, the present study is the first to explore the functional genomics of this organism grown in log-phase culture. Bacterial growth can be divided into four different phases: lag phase, exponential or log phase, stationary phase, and death phase. It is the log phase that we focused our study on. During this phase, high growth activity is evidenced by our data showing that about half of the genes in the genome were expressed. In contrast, many genes in M. tuberculosis are repressed during the stationary phase, a condition similar to but milder than the non-replicating state of tubercle bacilli in an anaerobic condition [24]. In particular, the dormancy regulatory gene, dosR is weakly induced during the stationary phase while strongly induced in an anaerobic non-replicating state. An interesting finding based on our work is that dosR is always moderately expressed even in the log phase, suggesting its possible housekeeping role. Our data further showed that an important gene, acr (hspX), which is induced under hypoxia [7] and starvation [8], was always expressed in the log phase. Global gene expression profiling analysis of M. tuberculosis in mouse [25] and human tissue [10] indicated that lipid metabolism was critical for the bacilli to survive in the host environment. In these conditions, isocitrate lyase (ICL), an enzyme of the glyoxylate shunt (a pathway alternative to the tricarboxylic acid cycle) and related to mycobacterial persistence in macrophages [26], is up-regulated. It is consistent with our find-ing that icl is weakly expressed or absent in the log-phase culture.
The extent to which gene expression profiles across a set of independently collected samples suffice to separate genes into functional clusters in consistency with prior knowledge is attributable to the rigorous statistical model built in the Affymetrix system. Several familiar gene groups with clear designated functions, such as electron transport, protein synthesis and type II fatty acid synthesis, were observed in the data, offering credence to our analysis. However, genes associated with different functional classes would be placed in the same cluster if they appear to co-express. This implies that genes can share commons roles while differing in their functions, as illustrated by the earlier example of energy-dependent transportation across the cell membrane. In fact, comparing gene clusters based on gene expression with gene classes based on sequence analysis would offer new opportunities for redefining interrelationships among genes in the genome.
Gene clusters built out of expression profiles can be configured as functional linkage networks among genes but these clusters do not correspond directly to protein networks [10] constructed using a combination of Rosetta stone, phylogenetic profile, conserved gene neighbor, and operon computational methods. Genes, which share similar biological functions, may operate at different stages of the cell cycle or become active under different conditions, and hence have different expression profiles [10]. However, some gene families, in particular those encoding ribosome proteins are closely linked in both gene-expression and protein networks, as seen in our data.
There remained 1051 so-called conserved hypothetical genes/proteins annotated in the genome sequence of M. tuberculosis [15], among which many were found to be active in our analysis, attesting to the predictive power of the genome annotation program [1]. More importantly, this finding suggests that many hypothetical proteins actively participate in the growth process. Unfortunately, the functions of these hypotheticals remain uncharacterized, reflecting the limitation of the sequence-based approach to genome annotation. Microarray-based functional-genomics offers a fundamentally different approach to gene annotation, in which the roles of uncharacterized genes may be hypothesized if they coexpress with known genes ( Table 5).
Organized according to correlations in gene expression across samples, the gene expression image created by Eisen's Cluster and TreeView programs ( Figure 1) enabled us to visualize four transcriptional profiles, which, named according to the functional classes of the dominant genes in that region and put in a linear order over the image, were "intermediate and lipid metabolism", "energy and respiration", "information pathways", and "cell wall, cell processes and metabolism". The dendrogram was constructed and displayed so that similar clusters were likely to be located in proximate nodes. In the present application, as the similarity measure is based on the correlation in gene expression, physical distance on the tree reflects the degree of correlation among gene clusters, even with no guarantee of their optimal linear ordering in the tree [12]. As the expression of the genetic code lies at the heart of all physiological processes and metabolisms, it is logical that the information gene cluster functionally correlates with other gene clusters, a view supported by the observation that it was situated around the center of the image.
As it is now, there are more than 100 transcriptional regulatory genes in M. tuberculosis genome [1,15]. However, only a fraction of them have been experimentally studied in detail for their functions. A recent survey shows that regulatory proteins account for 9 (20%) out of 45 virulence factors identified in M. tuberculosis [27]. Unraveling the gene regulatory network would allow us to understand both physiological and virulence mechanisms and to develop novel drugs that work at the level of gene regulation. Since elucidating the roles of these genes and their clinical relevance is always time-consuming; it is practically necessary to set up priority for them. A reasonable assumption is that a regulatory gene regulates some other genes in the same functional cluster with a high probability. Under this assumption, we have identified several potentially important transcriptional regulatory genes involved in major biological pathways (Table 4). Their significance has been indicated by analysis based on the literature. Further biological investigation on these genes is warranted in the future work.
All the microarray data and supplementary materials produced in this study are posted at our web site [see Additional file 1].

Conclusion
Genes involved in the in vitro log-phase growth of M. tuberculosis have been identified. The gene expression map ( Figure 1) represents broad patters of functional concordance of closely related genes, but more importantly, it summarizes the coordinated cellular activities associated with the growth process on the genomic level. As it is today, hundreds of genes in the genome are annotated as conserved hypotheticals without clearly specified functions. Our data have shown that more than 100 such hypotheticals were actually expressed in the cell medium, and their biological roles can be suggested by their correlation with other known genes. In addition, the roles of most transcriptional regulatory genes predicted in the genome remain to be elucidated. In this study, we have discovered several regulatory genes that may exert regulatory influence on the growth of M. tuberculosis, and their roles may be inferred by what functional clusters they join. The data and information generated here provide an integrated genomic view about gene functions and interrelationships in M. tuberculosis, and can be incorporated in new experiments for research in tuberculosis. This study has not only transcriptionally validated several known gene clusters but also provided insight into a host of unknown hypothetical and regulatory genes. Boulder, CO). The RNA sample was precipitated and washed in ethanol, and redissolved to make a final concentration of 1 mg/ml. The purity of RNA was estimated by the ratio of the readings at 260 nm and 280 nm (A260/ A280) in the UV. 20 ul RNA samples were sent to the UCI DNA core and further checked through a quality and quantity test based on electrophoresis before microarray hybridization.

Microarray hybridization and analysis
In this study, we used the anti-sense Affymetrix M. tuberculosis genome array (GeneChip). The probe selection was based on the genome sequence of M. tuberculosis H37Rv [1]. Each annotated ORF (Open Reading Frame) or IG (Intergenic Region) was interrogated with oligonucleotide probe pairs. The gene chip represented all 3924 ORFs and 738 intergenic regions of H37Rv. Twenty 25-mer probes were selected within each ORF or IG. These probes are called PM (Perfect-Match) probes. The sequence of each PM probe is perturbed with a single substitution at the middle base. They are called MM (Mismatch) probes. A PM probe and its respective MM probe constitute a probe pair. The MM probe serves as a negative control for the PM probe in hybridization.
Microarray hybridization followed the Affymetrix protocol. In brief, the assay utilized reverse transcriptase and random hexamer primers to produce DNA complementary to the RNA. The cDNA products were then fragmented by DNAase I and labeled with terminal transferase and biotinylated GeneChip DNA Labeling Reagent at the 3' terminal.
Each RNA sample was hybridized with one gene array to produce the expression data of all genes on the array. We performed eleven independent bacterial cultures and RNA extractions at different times, and collected eleven sets of microarray data for this study. A global normalization scheme is applied so that each array's median value is adjusted to a predefine value (500).

Bioinformatics analysis
The gene expression data were analyzed by the program GCOS (GeneChip Operating Software) version 1.4. In the program, the Detection algorithm determines whether a measured transcript is detected (P Call) or not detected (A Call) on a single array according to the Detection p-value that is computed by applying the one-sided Wilcoxon's signed rank test to test the Discrimination scores (R) against a predefined adjustable threshold τ. The Discrimination score calculated for each probe pair is a function of the PM intensity (PMI) and the MM intensity (MMI), as given by R = (PMI -MMI)/(PMI + MMI) The parameter τ controls the sensitivity and specificity of the analysis, and was set to a typical value of 0.015, and the Detection p-value cutoffs, α 1 and α 2 , set to their typical values, 0.04 and 0.06, respectively, according to the Affymetrix system.
In this study, a gene was determined to be always (usually) active if the derived mRNA was present (P-call) in more than 90% (50%) of the RNA samples with a Detection p-value < 0.001. The gene-expression data were further analyzed using Eisen's Cluster and TreeView programs [12]. The whole-genome gene expression map was produced by the hierarchical clustering algorithm based on the average-linkage method in the program with the similarity measure defined by Pearson's correlation coefficient.