RNA Extraction and cDNA synthesis
To generate a diverse RNA sample for the tiling experiment, we prepared RNA from yeast-form Histoplasma capsulatum strain G217B (ATCC 26032; a kind gift of William Goldman, Washington University, St. Louis, MO) under a variety of conditions (including early, middle, and late logarithmic growth, stationary phase, heat shock (42°C for 30 min), oxidative stress (1 mM menadione for 80 min), sulfhydryl reducing stress (10 mM DTT for 2 hours), and a range of media (HMM, 3M, YPD, and SD complete). Total RNA and polyA RNA were prepared as previously described[8, 9]. Cy5-labeled cDNA was prepared from individual RNA samples as previously described, and an equal mass of cDNA was pooled from each sample and hybridized to individual tiling arrays as described below.
Whole Genome Tiling Array Design
The whole genome tiling arrays were designed based on the GSC Histoplasma capsulatum strain G217B genome assembly as of 11/30/2004. Degenerate sequence and transposable elements were removed from the assembly using RepeatMasker with default parameters and the repeat families determined by the GSC. The remaining sequence was tiled with 50 mer probes at an average frequency of one probe every 60 base pairs. Probe spacing was adjusted to minimize variation in melting temperature, and a subset of probes were truncated to optimize synthesis, in collaboration with CombiMatrix. The number of arrays used to tile a given contig was minimized, and the location of tiling probes was randomized within a given array.
In addition, each array contained a common set of control probes, viz.: quality control (QC) and negative control (NC) probes designed by CombiMatrix (Mukilteo, WA); positive control probes tiling the genomic loci and non-genic flanking sequence of TEF1(P40911), TYR1, and CBP1(AF006209); and probes specific to a spike-in control sequence. The QC, NC, and spike-in probes were not considered in the analysis.
Hybridization of tiling arrays
Fluorescently labeled cDNA was hybridized to CombiMatrix arrays as previously described. In addition to the Cy5-labeled sample described above, a common Cy3-labeled sample was used as a counterpoint reference on each array.
Images of the hybridized arrays were acquired with a GenePix 4000B scanner (Axon Instruments) controlled by the GenePix 4.0 program (Molecular Devices). Each array was scanned three times using the following PMT settings for the 635 nm laser: 400, 450, 540. Images were gridded with GenePix 4.0 and the median foreground intensity for each feature was used as the input for subsequent analysis. Based on the negative control probes, signal/noise was constant for the three scans, so all subsequent analysis was carried out using the lowest PMT scan.
Probe detection on tiling arrays
Background intensity was estimated based on the median intensities of a control set of known antisense and intergenic regions, a method similar to the use of median intensities of known introns in the analysis of rice tiling data. Specifically, the background intensity was estimated as the median intensity of the positive control probes corresponding to the intergenic (untranscribed) regions flanking CBP1 and TYR1 and the antisense (untranscribed) probes for CBP1, TYR1, and TEF1. A tiling probe was considered detected if it had intensity greater than the background intensity estimated for the corresponding array. 58% of the tiling probes were considered detected by this method.
Transcript detection on tiling arrays
In H. capsulatum, introns are small enough to make detection of complete transcripts feasible (in contrast to, e.g., Homo sapiens) but are large and irregular enough to make such detection non-trivial (in contrast to, e.g., Escherichia coli or Saccharomyces cerevisiae). For this study, we traded resolution for improved signal to noise and defined transcripts as genomic loci ≥ 200 bp for which the normalized density of detected probes was greater than 65% of the normalized density of all probes. Smoothed densities were calculated with the density function in R using a bandwidth of 500 bp, and transcripts were truncated such that transcript ends coincided with detected tiles.
In order to avoid regions of the tiling path that were rendered sparse due to repeat masking, transcript detection was restricted to regions spanning at least 10 kb of genome sequence with a minimum tiling density of 1 probe per 250 bp (1/5
of the target tiling density).
6,172 transcripts were detected. The length distribution (in terms of genomic locus) for detected and predicted transcripts is shown in Figure 4. Known transcripts showed a mild 3' bias, meaning that signal intensity was enriched at the 3' end of the gene, as expected given the method of sample preparation.
The genomic coordinates of the detected transcripts are given in Additional file 3, Table S3, and the probe intensities are given in Additional files 4 and 5, Data S4 and Data S5.
94 novel TARs were examined by RT-PCR. Primers were designed using the Primer3 program (with the Primer3plus default parameters) to design up to 5 primer pairs (giving 400-500 bp products) for each transcript. The designed primer pairs were then screened for redundant products using the re-PCR program with the first non-redundant pair being chosen for each target (targets with 5 redundant pairs were rejected).
PolyA RNA corresponding to the cDNA used for tiling arrays was subjected to RT-PCR analysis, with the exception that RNA from early log-phase cells was not included due to limited material. The pooled RNA was DNAse treated and reverse transcribed with AffinityScript (Stratagene). PCR reactions were carried out using AmpliTaq polymerase (Applied Biosystems) for 35 cycles of [94°C 15" → 56°C 15" → 72°C 4'].
Reaction products were visualized on a 1% agarose gel and were considered detected if they occurred at the length predicted by the re-PCR program with no corresponding band in the "no RT" control.
The sequences of the full set of novel TARs are given in Additional file 6, Data S6.
For the purpose of validation, the length of a predicted gene was taken as its full genomic locus (including introns and exons).
RECON-identified repeat-families from the GSC (including the MAGGY transposon) were mapped to the genome with REPEATMASKER using default settings and excluding simple sequence repeats. Predicted genes with greater than 20% of their length covered by REPEATMASKER-annotated repeat sequence were classified as repeats and removed from further analysis.
Non-repeat genes with greater than 50% of their length covered by detected TARs were classified as validated by tiling.
The following two-channel G217B whole-genome oligonucleotide microarray data sets were used for validation by expression profiling: wild type and ryp1 mutant 37°C and RT samples hybridized against a pooled reference (9 arrays), direct hybridizations of yeast, mycelial, and conidial samples (6 arrays, Inglis et al, unpublished), iron depletion time courses hybridized against a pooled reference (8 arrays plus 10 arrays, Hwang et al, unpublished). In keeping with our standard analysis pipeline for this platform, probes were considered detected if they were not manually flagged as bad and the sum of background-subtracted median intensities for the two channels was greater than 500. Non-repeat predicted genes were classified as validated by expression array if they mapped to at least one detected probe in at least 3 of the 33 arrays.
Annotated gene sets from the following genomes were used for validation by homology to other fungi: Blastomyces dermatitidis er-3 and slh14081; Paracoccidioides brasiliensis pb01, pb03, and pb18; Coccidioides immitis rs; Aspergillus nidulans; Aspergillus fumigatus (TIGR); Aspergillus oryzae (DOGAN); Neurospora crassa; Magnaporthe oryzae (formerly Magnaporthe grisea); Fusarium graminearum; Candida albicans (CGD, orfs19 gene set); Saccharomyces cerevisiae (SGD); Cryptococcus neoformans H99; and Ustilago maydis. Except where noted, all gene sets were obtained from the BROAD Institute. Pairwise ortholog/in-paralog mapping to G217B was performed by running INPARANOID with default parameters and no outgroup for each genome. Predicted genes were classified as validated by homology if they were a member of an orthogroup (direct ortholog to a gene in the target genome or in-paralog of a G217B gene with a direct ortholog in the target genome) for at least 3 of the 16 target genomes.
Microarray data have been submitted to the NCBI Gene Expression Omnibus (GEO) under accession number [GEO:GSE31155].
Nucleotide sequence data for the reported novel TARs are available in the Third Party Annotation Section of the DDBJ/EMBL/GenBank databases under the accession numbers TPA: BK008128-BK008391.