Genomic homogeneity between Mycobacterium avium subsp. avium and Mycobacterium avium subsp. paratuberculosis belies their divergent growth rates

Background Mycobacterium avium subspecies avium (M. avium) is frequently encountered in the environment, but also causes infections in animals and immunocompromised patients. In contrast, Mycobacterium avium subspecies paratuberculosis (M. paratuberculosis) is a slow-growing organism that is the causative agent of Johne's disease in cattle and chronic granulomatous infections in a variety of other ruminant hosts. Yet we show that despite their divergent phenotypes and the diseases they present, the genomes of M. avium and M. paratuberculosis share greater than 97% nucleotide identity over large (25 kb) genomic regions analyzed in this study. Results To characterize genome similarity between these two subspecies as well as attempt to understand their different growth rates, we designed oligonucleotide primers from M. avium sequence to amplify 15 minimally overlapping fragments of M. paratuberculosis genomic DNA encompassing the chromosomal origin of replication. These strategies resulted in the successful amplification and sequencing of a contiguous 11-kb fragment containing the putative Mycobacterium paratuberculosis origin of replication (oriC). This fragment contained 11 predicted open reading frames that showed a conserved gene order in the oriC locus when compared with several other Gram-positive bacteria. In addition, a GC skew analysis identified the origin of chromosomal replication which lies between the genes dnaA and dnaN. The presence of multiple DnaA boxes and the ATP-binding site in dnaA were also found in M. paratuberculosis. The strong nucleotide identity of M. avium and M. paratuberculosis in the region surrounding the origin of chromosomal replication led us to compare other areas of these genomes. A DNA homology matrix of 2 million nucleotides from each genome revealed strong synteny with only a few sequences present in one genome but absent in the other. Finally, the 16s rRNA gene from these two subspecies is 100% identical. Conclusions We present for the first time, a description of the oriC region in M. paratuberculosis. In addition, genomic comparisons between these two mycobacterial subspecies suggest that differences in the oriC region may not be significant enough to account for the diverse bacterial replication rates. Finally, the few genetic differences present outside the origin of chromosomal replication in each genome may be responsible for the diverse growth rates or phenotypes observed between the avium and paratuberculosis subspecies.


Background
Mycobacteria are Gram-positive, acid-fast, pleomorphic, non-motile rods belonging to the order Actinomycetales. Mycobacterium avium complex organisms consist of the human and animal pathogens M. avium subsp. avium, M. avium subsp. paratuberculosis, and M. avium subsp. silvaticum [1]. DNA-DNA hybridization studies have long ago established a genetic similarity between M. avium subspecies avium (M. avium) and M. avium subspecies paratuberculosis (M. paratuberculosis) [2][3][4]. Now that whole genome sequencing technologies are available, investigators can begin to examine genetic relatedness in greater detail through direct nucleotide-nucleotide comparisons. These comparisons are particularly important in instances where two genetically similar bacteria have little or no specific diagnostic tests to distinguish each.
The literature reports genetic similarity between M. paratuberculosis and M. avium at between 72% and 95% [2,4] depending on the region analyzed. However, despite the reported similarities, these mycobacteria are quite different phenotypically. M. paratuberculosis is an intracellular pathogen that infects ruminant animals, most notably cattle and sheep. The site of infection is the gastrointestinal tract, where it causes a chronic inflammatory ailment termed Johne's disease [5]. In contrast, M. avium is common in the environment, causes tuberculosis in birds, and disseminated infections in HIV patients [6]. Growth of M. paratuberculosis is characterized by its slow rate (doubling time of 22-26 hours, compared to 10-12 hours for M. avium) and requirement of mycobactin in culture media [5]. With the absence of a well-defined genetic system for M. paratuberculosis, a comparative genomic approach holds great potential in addressing the genetic basis for many of these phenotypic differences.
The genus Mycobacterium contains species that range from fast-growingsaprophytes such as M. smegmatis and M. fortuitum to slow-growing pathogens such asM. leprae, M. tuberculosis and M. paratuberculosis. Although the chromosomal origin of replication has been studied in some mycobacteria [7,8], the genetic organization of the origin of replication in M. paratuberculosis has been previously unknown. Knowledge of the gene organization and sequence of this region is particularly important because chromosomal replication may be regulated by a common mechanism that could directly affect rate of growth.
Several features of the oriC region are highly conserved among bacteria. The sequence immediately flanking the dnaA gene is considered the origin of chromosomal replication, or oriC region [9,10]. This region contains several genes that encode proteins required for basic cellular functions, including the protein subunit of RNase P (RnpA), ribosomal protein L34 (RpmH), the replication initiator protein (DnaA), the beta subunit of DNA polymerase III (DnaN), the recombination repair protein RecF, and the DNA gyrase proteins GyrA and GyrB. The relative gene order in this region is also highly conserved in many bacteria, especially the Gram-positives [11]. Although intergenic sequences in this region are conserved only among closely related organisms, the DnaA box is found in the non-coding regions flanking dnaA in most bacteria studied [12]. DnaA boxes are conserved nucleotide sequences (TTGTCCACA) where the DnaA protein binds to DNA, triggering events that ultimately lead to replication initiation and DNA synthesis [9].
In an effort to understand the genetic basis for growth rate and other phenotypic differences between M. paratuberculosis and M. avium, we have analyzed the genetic similarity of these genomes using two strategies. First, the putative oriC region of M. paratuberculosis was amplified, sequenced and compared with M. avium and other bacteria. Second, we examined nucleotide identity outside the oriC region using DNA homology matrix analysis as well as using several hundred M. paratuberculosis sequences from a random shotgun library compared with M. avium sequences present in the unfinished microbial genomes database. Our results show that these subspecies not only have a conserved gene order surrounding the origin of chromosomal replication, but also have a high synteny and nucleotide identity throughout both genomes. In addition, this preliminary comparative survey of the genomes of M. avium and M. paratuberculosis show even greater similarity (97%) than the literature suggests (72% to 95%) [2].

Identification of predicted ORFs encoding replication-related proteins
An ~11-kb contiguous genomic fragment from M. paratuberculosis was amplified and sequenced using 15 primer pairs designed from M. avium genomic sequence in the putative oriC region (Fig. 1). This strategy enabled the successful amplification of all 15 minimally overlapping fragments of ~800 bp in length for this region of the M. paratuberculosis chromosome. A putative replication origin was identified by GC skew analysis [14]. A strong inflection point in the GC plot marks this origin (Fig. 1). Eleven ORFs were identified using the gene prediction software Artemis [15] (release 3; The Sanger Centre http:/ /www.sanger.ac.uk/Software/Artemis/). Similarity searches were conducted locally using the BLASTP algorithm through the Artemis interface. Seven of these ORFs have high identity to proteins essential for basic cellular processes, including replication, in other mycobacterial species ( Table 1). The function of GidB is unknown, but it may have a role in cell division [11]. RNase P, which consists of the protein subunit RnpA and a catalytic RNA  subunit, is essential for generating mature tRNAs by cleaving the 5'-terminal leader sequences of precursor tRNAs [16]. rpmH encodes ribosomal protein L34, and DnaA is the initiator protein for chromosome replication. The Bsubunit of DNA polymerase is encoded by dnaN. The recF gene product is involved in recombination, DNA repair, and induction of the SOS response, and may also have a role in replication [17]. Bacterial DNA gyrase, a tetramer consisting of A and B subunits, catalyzes the ATP-dependent unwinding of covalently closed circular DNA [18]. The remaining predicted ORFs in this region have high similarity to hypothetical proteins from M. tuberculosis (Table  1).

Sequence homology and conserved gene order in the oriC region of mycobacteria and other gram-positive bacteria
Alignment of the region surrounding oriC for several mycobacteria and other gram-positive bacteria provides some interesting comparisons (Fig. 2). The M. paratuberculosis oriC region conforms to the conserved gene order that is present in other mycobacteria as well as the closely related Streptomyces coelicolor. Even the more distantly related Bacillus subtilis shows some degree of synteny in this region. The fast growing M. smegmatis species contains a gnd sequence between dnaN and recF, which is absent in the slow-growing mycobacteria (Fig. 2). However, there appear to be no notable differences between M. avium and M. paratuberculosis at this level. The M. smegma-

Figure 2
Comparative gene order in the oriC region of mycobacteria and other Gram-positive bacteria. The relative gene order in this region of M. paratuberculosis conforms to the highly conserved order found in other gram-positive bacteria. Numbers indicate the length of the ORF or intergenic region. Arrows show the direction of transcription.
tis coding sequence, gnd, has similarity to the 6phophogluconate dehydrogenase genes in E. coli, but the mycobacterial protein is predicted to be about 200 amino acids shorter than the E. coli homolog. The length of noncoding intergenic regions between rpmH -dnaA and dnaA -dnaN is well conserved among the bacteria shown in figure 2. In many bacteria where a functional oriC has been identified, this gene order is conserved and oriC is adjacent to the dnaA gene [9,10,19].
The amino acid sequence of each gene product was compared with the corresponding sequence in M. paratuberculosis for all species in this study ( Table 2). The data show that while gene order is conserved, the percent identity declines in comparisons with mycobacteria other than M. avium. This percent identity declines even further in comparisons with non-mycobacterial sequences such as S. coelicolor and Corynebacteria glutamicum ( Table 2).  (Fig. 3). In addition, a hexameric sequence thought to be recognized by ATP-DnaA (AGATCT) was found in the 3' non-coding sequence adjacent to dnaA (Fig. 3b). The significance of additional dnaA boxes in M. paratuberculosis is likely necessary to open the DNA helix of this GC rich organism (69% GC content).

Conserved functional motifs in the
The dnaA gene is divided into four functional domains based on analysis of several dnaA mutants [24]. These domains consist of (1) an area near the N-terminus thought be involved in ability of the DnaA protein to aggregate, (2) ATP binding, (3) a domain that maps to a region near the C-terminus and is involved in DNA binding, (4) and a final domain of unknown function, but may bind DnaB. The conserved ATP-binding site that is found in domain III in other bacteria was also located in M. paratuberculosis (Fig. 3b). An AT-rich stretch of 19 nucleotides (74% A+T), which in other bacteria serves as the site of local unwinding of DNA after DnaA-DNA interaction, was located in non-coding sequence adjacent to dnaA (Fig. 3b). The noncoding sequences flanking dnaA are slightly AT-rich in general, relative to the rest of the genome sequence, consistent with findings in other gram-positive bacteria (38% -40% A/T, vs. ~33% in the entire sequence).

A vast majority of all M. paratuberculosis K-10 genomic sequence have considerable nucleotide similarity to sequences from the human pathogenic isolate M. avium 104
As a basis for all nucleotide comparisons between M. avium and M. paratuberculosis in this study, an alignment of the 16s rRNA gene was performed. That analysis revealed a 100% nucleotide identity over the entire 1,472-bp gene (data not shown). Likewise, the oriC region in M. paratuberculosis was found to share a high level of nucleotide identity (~98%) with M. avium. Calculation of the rates of total nucleotide diversity (3) and synonymous substitution per synonymous site (ds) and non-synonymous substitution per non-synonymous site (dn) revealed patterns of variation within the range observed from sequence data outside the oriC region. These calculations showed a high degree of similarity between the two sequences and a predominance of synonymous over non-synonymous substitutions (Fig. 1). The patterns of nucleotide substitution  [25], indicates that genomic similarity continues outside the surrounding oriC region (Fig. 4). When evaluating similarities between two sequences of this size, a matrix comparison is the method of first choice. In addition, the matrix method displays matching regions in the context of the sequence as a whole, making it easy to determine if the regions are repeated or inverted. For example, figure 4 shows a large 56.6 kb genomic inversion of the region surrounding nucleotide 350,000. The DNA identity matrix also identified sequences that were present in one genome, but absent in the other as shown by the broken diagonal lines (Fig. 4). These data show remarkable similarity over large regions in both mycobacterial genomes.   Finally, we analyzed 548 recombinant clones from a randomly sheared M. paratuberculosis small insert library in order to obtain specific rates of nucleotide substitutions. Sequences from these clones represented over 350,000 bp of unique (non-overlapping) M. paratuberculosis genomic DNA and comprised 7% of the estimated 5 Mb genome sequence. From this analysis, we estimated the rates of total synonymous and non-synonymous substitutions for 200 fragments that were aligned in-frame and then analyzed with the program NAGV2 [26] using the methods of Nei and Gojobori [27]. The results of these analyses show that the average nucleotide diversity between the two species is 2.59% ± 0.06% (range 0% to 18.8%; median, 1.85% ± 0.05%). The results also show that the average rates of synonymous substitution per synonymous site are 3.38% ± 1.32% (range, 0% to 19.5%; median, 3.5% ± 1.5%). In contrast, the rates of non-synonymous substitution per non-synonymous site were 1.89% ± 0.05% (range, 0% to 12.9%; median 1.3% ± 0.05%). These results not only indicate that the two subspecies have a high degree of nucleotide identity (>97%), but also suggest that the patterns of substitution have favored synonymous substitutions as can be expected from positive selection.

Discussion
With the genome sequencing projects of M. paratuberculosis and M. avium nearing completion, we have been able to compare large amounts of sequence data for the first time. Our results show substantial nucleotide identity above even that reported previously in the literature [2][3][4]. Paradoxically, the overall nucleotide identity between these phenotypically distinct mycobacteria appears similar to that observed with two phenotypically identical Helicobacter pylori isolates at ≥98% nucleotide identity [28].
The high nucleotide identity shared between M. paratuberculosis and M. avium directly conflicts with their divergent phenotypic characteristics. Because of strong similarity in the oriC region, alternative hypotheses should be tested to explain the growth rate differences between M. avium and M. paratuberculosis. Genomic rearrangements and the presence of unique genes identified by matrix analysis in this study are two such possibilities that could account for some of the phenotypic differences. We have recently reported on M. paratuberculosis coding sequences that are absent in M. avium [29]. From an analysis of 48% of the M. paratuberculosis genome, only 27 predicted coding sequences were found to be absent in M. avium. Therefore, an estimated total of 50-60 M. paratuberculosis coding sequences might be absent in M. avium following a whole genome analysis. This extremely low number of unique M. paratuberculosis genes is in stark contrast to E. coli where the MG1655 isolate contains 528 genes not found in the EDL933 isolate [30]. Further analysis of this limited number of unique coding sequences will be critical in developing specific diagnostic reagents. Finally, a detailed analysis of coding sequences unique to each respective mycobacterial genome and their genetic regulatory networks will be necessary to understand the molecular basis for growth rate and other phenotypic differences.
Other potential explanations include the presence of global regulators, insertion sequences, transcription-translation rates, genomic rearrangements and ribosomal RNA operons. Each respective genome possesses insertion elements (IS900, IS1311) at unique loci that could distinctly affect growth difference or other phenotype by insertional mutation. Foley-Thomas et al. [31] compared the expression of the luciferase gene in M. paratuberculosis with the fast-growing M. smegmatis and concluded that the rates of transcription and translation may not account for the slow growth of M. paratuberculosis.
We present evidence for at least one large-scale genomic rearrangement between these two subspecies. This rearrangement consists of a 56.6 kb inversion that contains approximately 61 predicted coding sequences (Bannantine and Kapur, unpublished). Genomic rearrangements such as that described could have a profound effect on phenotype. The presence of multiple copies of ribosomal RNA operons within a genome can be directly attributed to faster growth rate. The increased gene dosage results in more ribosomes and therefore increased protein translational capacity. However, only one rRNA operon is present in each subspecies and this is also true for the fast growers Mycobacterium abscessus and Mycobacterium chelonae [32]. These fast growing mycobacteria have multiple promoters that increase the transcriptional rate of the rRNA operon to overcome gene dosage limitations [32]. The rRNA operon promoter structures have not been mapped by primer extension for either M. paratuberculosis or M. avium, but if M. avium had multiple functional rRNA operon promoters, that may account for the growth rate differences.
The genetic organization of the origin of replication has been characterized in several Gram-positive pathogens including B. subtilis, S. coelicolor, M. tuberculosis, M. avium, M. leprae, and M. smegmatis [8]. The results of our investigation on the oriC region of M. paratuberculosis show that each of the 15 primer pairs, designed from M. avium sequence data, resulted in the successful amplification and subsequent sequencing of an ~11 kb region of the M. paratuberculosis genome. The sequenced region encodes 11 putative proteins, several of which show a high level of identity to proteins that are known or predicted to be involved in DNA replication. However, we found a cluster of substitutions in a region of rnpA (data not shown). It is noteworthy that in this region of the gene, each of the nucleotide substitutions results in an amino acid replacement. While mutations in this region of the gene are known to result in dramatic differences in ability of bacteria to respond to environmental stresses [33], the functional significance of these differences between M. avium and M. paratuberculosis are at present unknown. While these sequencing efforts have revealed a conserved gene order in the oriC of Gram-positive bacteria [11], the nucleotide and amino acid identity between M. paratuberculosis and M. avium in this region is much stronger when compared to other mycobacteria and other Gram-positive bacteria (see Table 2). It is well recognized that the characterization of gene organization in the oriC region as well as the complete genome sequence will provide a springboard for addressing questions such as the nature of the slow growth rate of M. paratuberculosis as compared to the genetically related rapidly-growing mycobacteria. Progress on these research fronts will improve our chances of understanding and controlling infections caused by M. paratuberculosis and related pathogens.
The conservation of functional sequence motifs in the oriC of other Gram-positive organisms has provided clues to the mechanism of bacterial replication. For instance, DnaA monomers bind to specific, non-palindromic 9-nucleotide sequences called DnaA boxes, and this interaction is thought to initiate replication. The oriC of Gram-positive bacteria typically contains 10 -30 of these DnaA boxes, often found in non-coding regions flanking the dnaA gene. The interaction of DnaA with DnaA boxes promotes the local unwinding of a nearby AT-rich region, providing an entry site for the DnaB/DnaC helicase complex. The dnaA gene itself is divided into four domains that differ in the extent of sequence homology [34]. Domain IV is responsible for DnaA box recognition and domain III is a highly conserved region containing the ATPbinding site [13,35]. Domain I participates in cooperative DnaA protein-DNA interactions [36].
The genetic relatedness of M. paratuberculosis with other mycobacterial subspecies has been the root cause of the lack of development of M. paratuberculosis-specific diagnostic tests. By comparing the genome sequences of both M. paratuberculosis and M. avium, specific diagnostic tests may be developed and a better understanding of the molecular differences that contribute to unique phenotypes will be obtained. Finally, knowledge of the complete genome sequence of M. paratuberculosis is expected to facilitate the identification of diagnostic sequences in this economically significant veterinary pathogen.

Conclusion
With the genomes of M. paratuberculosis and M. avium nearly completed, investigators will be able to analyze the similarities and differences between these genomes with amazing detail. Through a comparative genomic analysis of over 2 million nucleotides, we have shown that the two subspecies, avium and paratuberculosis, are highly similar at the gene and nucleotide level. This is in stark contrast to the phenotypic differences that each displays.

Strains and growth media
A cattle isolate (K-10) of M. paratuberculosis [31] has been chosen for genome sequencing studies. The organism was grown in Middlebrook 7H9 broth supplemented with OADC (Difco Laboratories, Detroit, MI), Tween 80, and mycobactin J (Allied Monitor, Fayette, MO) as described by Bannantine et al. [37]. M. avium strain 104 was grown in Middlebrook 7H9 broth. DNA was extracted using the Qiagen QIAamp Tissue Kit (Chatsworth, CA).