Whole-genome tiling array expression profiling
To survey the transcriptome of G217B, we designed a set of 93 unique tiling microarrays (Figure 2). The G217B genome contains a large number of repeat regions, including the MAGGY retrotransposon[7], which were excluded from the tiling microarray probes. Both strands of the remaining sequence were tiled with 50 mer probes at an average frequency of one probe every 60 base pairs (Figure 2). These arrays were hybridized with a pool of fluorescently labeled cDNA generated from cells grown under a variety of conditions. Because technical limitations did not allow us to isolate sufficient poly-adenylated-RNA from filamentous cells (which represent the soil form of this organism and must be grown under biosafety level three conditions due to the production of aerosolizable infectious spores), we focused on the pathogenic yeast form. G217B yeast cells were subjected to numerous growth conditions (see Materials and Methods) which had previously been observed to elicit potent transcriptional responses[8, 9]. Tiles that passed an empirically determined detection threshold were merged into TARs, as described in the Materials and Methods.
Detection of predicted genes
The GSC predicted that the G217B genome contains 11,221 genes, but 1,611 of these gene predictions contain repeat sequence, including the MAGGY transposon, and were excluded from further analysis. Of the remaining 9,610 predictions, 6,008 were detected in our tiling microarrays (Figure 3a). 60% of the gene predictions have some correspondence to the detected TARs: 47% of the predictions were cleanly detected only on the predicted strand (represented in Figure 3b i), 7% were detected only on the antisense strand (Figure 3b ii), and 6% had tiling and/or prediction support for transcription on both strands (Figure 3b iii), leaving 26% of the predicted set unsupported by our tiling data (Figure 3a). Detection on both strands is consistent with the presence of sense and/or antisense transcripts in one or more of the growth conditions profiled by this experiment. It has been shown that the DNA-dependent DNA polymerase activity of reverse transcriptase can generate false positive opposite strand signal in tiling experiments; e.g., two thirds of putative antisense transcripts in a Saccharomyces cerevisiae tiling experiment were not detected in the presence of actinomycin D[10]. Therefore, the number of sense/antisense pairs observed in our experiment is likely to be an overestimate.
Detection on only the antisense strand may correspond to incorrect predictions coinciding with bona fide transcripts on the opposite strand (e.g., Figure 3b iii, in which there is a spurious prediction antisense to the known 5' UTR of FDH1[9]) or to true genes that are repressed by an antisense transcript in our pooled yeast sample. Due to this ambiguity, genes in this category were not considered "detected". An additional 264 novel transcripts, which were not present in the predicted set, were also detected (Figure 3b iv), as described below. As part of the web database associated with this study, the detected transcript set can be viewed in the context of the raw tiling signal and predicted gene set (as in Figure 3), allowing human estimation of transcript set accuracy on a case by case basis.
Features of transcribed regions in the H. capsulatum genome
As is common for tiling data, the boundaries of TARs did not correspond precisely with the boundaries of the predicted genes. There were two common instances of this pattern. First, in many cases, additional transcription was detected 5' and 3' of the predicted gene (Figure 3b). This was most likely due to untranslated (UTR) sequences which are missed by the gene model and resulted in a longer length distribution for the TARs compared to the predicted genes (Figure 4). Second, it was not uncommon for a single long transcript to span multiple predictions. In some cases, this was due to the sequence encoding a single TAR being incorrectly predicted to contain multiple genes. In others, this was due to multiple genes being incorrectly detected as a single transcript, either due to spurious or pathological background signal or due to intergenic regions too small to be distinguished from introns. In the case of the Saccharomyces cerevisiae genome, multi-gene detected transcripts could be segmented based on sharp transitions in the intensity of the tiling signal[11]. Such analysis would be difficult in the present study, primarily because the tiling sample is a pool of cDNAs corresponding to multiple transcriptional states of the H. capsulatum yeast phase, each of which may contain transcript isoforms that differ by splicing and transcriptional start site (we have documented such variability for several phase specific transcripts in H. capsulatum[9]). Ultimately, we attempted to minimize this limitation of the tiling array method by selecting transcript detection parameters that distinguish the mostly small introns from the mostly large intergenic regions.
The majority of TARs that did not overlap with gene predictions corresponded to unpredicted UTR sequences. For example, 29% of non-overlapping TAR sequence can be interpreted as 5'UTR (immediately upstream of and contiguous with a gene prediction), and 35% as 3'UTR (immediate downstream of and contiguous with a gene prediction). Additionally, 33% of non-overlapping TARs corresponded to the intervening sequence between two predictions (i.e., intergenic sequence incorrectly detected as transcribed due to the resolution limits of the tiling strategy, or long transcripts incorrectly predicted as multiple genes).
Tiling arrays revealed 264 novel genes
One advantage of a tiling strategy is that it can uncover novel TARs that do not correspond to the predicted genes. Our tiling analysis detected 264 such loci that were not represented in the GSC predicted gene set for G217B (e.g., Figure 3b iv). TARs were designated as "novel" if (1) they had no overlap with gene predictions on either strand, (2) they did not fall into the "5'UTR", "3'UTR", or "intervening" classifications described above (i.e., the flanking 5' and 3' base did not coincide with a gene prediction), and (3) they had no overlap with repeat regions.
94 TARs that did not coincide with the predicted gene set were chosen for experimental validation by RT-PCR. 79 of these TARs were detected in a first-pass analysis with a single primer pair, giving a validation rate of 84%. A representative sampling of RT-PCR results is shown in Figure 5.
To determine whether the novel loci correspond to conserved sequences in other genomes and, if so, if these homologous loci have been independently annotated as transcribed (i.e., if they are merely unannotated in G217B), we searched for conserved sequences in other dimorphic fungal pathogens within the order Onygenales (4 strains of H. capsulatum, 2 strains of Blastomyces dermatitidis, 3 strains of Paracoccidioides brasiliensis, and the reference strain of Coccidioides immitis).
A BLASTX search of the isolated novel sequences against the predicted protein sets yielded a number of hits in other genomes (173 of the isolated novel sequences had hits in at least one non-G217B H. capsulatum gene set, and 63 of these had hits in at least one non-H. capsulatum gene set). To increase the sensitivity of this search, we performed an INPARANOID-based[12] mapping of syntenic loci that flanked each novel locus (Figure 6). Genes in 20 kb windows on either side of the novel TAR could be independently mapped to orthologs on a common contig in at least 8 other genomes for 217 of the isolated novel sequences. Of the 173 isolated novel sequences with BLASTX hits, 156 could be mapped to syntenic loci, and, for 150 of these, the BLASTX hit coincided with the mapped locus. A TBLASTX (translated nucleotide vs. translated nucleotide) comparison of the isolated novel sequence to the mapped locus yielded a significant alignment (e ≤ 10-6) for at least 4 H. capsulatum strains in 210 cases, for both B. dermatitidis strains in 80 cases, for at least two P. brasiliensis strains in 31 cases, and for the reference C. immitis strain in 7 cases. In general, the TBLASTX results were consistent with evolutionary distance from G217B (e.g. sequences conserved between H. capsulatum and B. dermatitidis were also conserved among H. capsulatum strains).
Taken together, these results suggest that: 1) the isolated novel sequences are conserved at the sequence level, and, therefore, likely to be transcribed, relative to the other H. capsulatum strains in most cases, and relative to B. dermatitidis for about half of the cases; 2) transcripts with deeply conserved sequence across the Onygenales also tend to be predicted as genes in most of these fungi; and 3) for about half of the isolated novel sequences, a corresponding gene prediction exists in another genome, highlighting differences in the prediction pipelines, while the other half represent truly novel discoveries of this tiling experiment.
Using standard expression profiling and sequence homology to enrich gene validation
To complement our tiling arrays, we took advantage of our archive of expression data compiled across several distinct growth conditions, including iron limitation, and all three morphologies (yeast, mycelia, and conidia). We surveyed whether gene predictions were detected in these expression profiling experiments, which employed whole-genome oligonucleotide microarrays where each prediction was represented by one or two gene-optimized 70 mer probes. Additionally, we used INPARANOID[12] to determine if gene predictions had homologs in other fungi. This validation by inferred homology to genes in other fungi relied on sequence conservation independent of expression pattern. The validation criteria for each strategy are given in the methods section and the results are summarized in Figure 7 (detailed per-gene results are available as Additional file 1, Table S1 and may be browsed interactively at http://histo.ucsf.edu). By these criteria, 8,115 non-repeat predicted proteins were validated by gene expression and 7,129 were validated by sequence homology.
Genes that were validated by tiling, gene expression, and sequence homology represented the largest category of predictions (5,379 genes) and accounted for 56% of the non-repeat predicted gene set. The next largest category was 1,404 genes validated by gene-expression and sequence conservation but not by the tiling experiment (15% of the non-repeat predicted gene set), followed by 845 genes (9%) validated only by expression array, and 487 genes (5%) validated by expression and tiling but not sequence conservation. 1,099 gene predictions (11%) were unvalidated by any of the three methods.
In the following discussion, predicted genes are referred to by their common names. Additional file 2, Table S2 gives the corresponding systematic names.
Genes that were missed by tiling array showed enriched expression in the mycelial form
As expected, gene predictions that were not detected by tiling tended to show reduced expression in the yeast phase and enhanced expression in the mycelial form. Examples include TYR1 and ABC4, both previously identified as highly enriched in the mycelial phase [9]; VELC, a mycelial-enriched paralog of the morphological regulators RYP2 and RYP3 [13]; and the ortholog of BDBG_03463, which is paralogous to the B. dermatitidis gene BYS1 (BYS1 itself has no ortholog in H. capsulatum)[14, 15].
Other notable categories of genes not detected by tiling include genes in heavily repeat-masked regions of the genome (where the tiling density is, therefore, too low to analyze) and genes with weak expression that did not give significant signal over background on tiling arrays.
Genes that were not detected by homology represented short or interrupted predictions
For genes not detected by homology, there were two related trends: (1) the predicted lengths were short, on the order of those genes not detected by any method (Figure 4); and/or (2) a single TAR was inappropriately split into multiple predicted genes. For example, the copper-repressed gene ELI1, which is known to be expressed as a single mRNA[16], is split into two predictions. Both predictions are detected by expression and tiling, but only the 3' prediction, which contains the coding sequence, is detected by homology, whereas the 5' prediction, which likely contains 5'UTR, is not. Short predictions are difficult to detect as homologs for two reasons: short runs of sequence similarity are likely to occur by chance, resulting in lower BLASTP p-values for hits to these predictions; and INPARANOID requires 50% reciprocal coverage between orthologs, resulting in rejection of genes where the predicted length is significantly smaller than that of the corresponding homologs. The same issues arise for split predictions, with the additional restriction that INPARANOID will make an ortholog assignment for only one member of a split pair, automatically resulting in rejection of the other member.
In all of these cases, the discrepancy between the experimental and sequence based results is a useful indication that the predicted gene model should be revised. In many cases, comparison of the transcript detected by tiling array to the results of less stringent sequence searches (e.g., BLASTX of the transcribed genomic sequence) is a useful starting point for such revision.
Genes not detected by homology also tend to show enriched expression in conidia, the vegetative spores generated by H. capsulatum filaments. H. capsulatum conidia, or their counterparts in any closely related fungi, have not been extensively studied; thus, the homologs of these genes may be unpredicted or entirely absent in the comparison genomes.
Genes that were validated only by homology have restricted expression profiles
The category of genes with orthologs in other fungi but no direct observation in our experimental data was relatively small (254 predictions representing 3% of the non-repeat gene set) and is predicted to contain genes that are expressed only under very restricted conditions that were not sampled in our expression data. Consistent with this hypothesis, we find STE3, the a-factor receptor whose expression has been observed only in mutants of G217B[17]; the ortholog of N. crassa RID, which is required for the RIP process and therefore expected to be expressed only during meiosis[18]; and the ortholog of T. reesei AXE2, a hemicellulolytic enzyme whose expression is dependent on carbon source[19].
Empirical redesign of microarray probes
Our tiling arrays and homology predictions can be used to inform future design of microarray probes. Because the expression experiments draw from a more diverse set of samples than the tiling experiments, detection of a predicted gene by homology and tiling but not by expression suggested a platform-specific defect in the 70 mer probe designed to detect that gene on our whole-genome oligonucleotide arrays (rather than a failure of the expression experiments to sample the appropriate condition). Our analyses support this hypothesis. In particular, the 70-mer probes for genes that failed to be detected by expression array tend to lie outside of the transcribed locus detected by tiling (e.g., the nitrositive-stress induced transcript COX12[8]), or span a predicted intron not supported by the tiling data (i.e., due to incorrect gene prediction, the 70 mer probe targets a discontiguous sequence in the true transcript). We are currently augmenting the expression array platform with new 70 mers for these genes, based on the coincidence of tiling transcripts with predicted exons.
Genes that failed to be validated by any method
We were unable to validate 1,099 predictions, or 11% of the non-redundant genes, by any method. This group primarily corresponds to wholly undetected predictions but may also include a small number of correct predictions for which the 5' end is undetected due to the 3' bias of the tiling experiment.
The unvalidated genes are significantly shorter than the detected genes (Figure 4). This observation could be due to false negatives in the tiling data (short transcripts are more difficult to detect because they are difficult to distinguish from background noise) or false gene predictions (there is an increased likelihood of short sequences fitting a gene model by chance). We note that genes validated only by expression (our only validation method that is independent of transcript length) are significantly shorter than genes validated by all methods but significantly longer than the unvalidated genes, lending weight to both explanations.