Diversity in coding tandem repeats in related Neisseria spp.

Background Tandem repeats contained within coding regions can mediate phase variation when the repeated units change the reading frame of the coding sequence in a copy number dependent manner. Coding tandem repeats are those which do not alter the reading frame with copy number, and the changes in copy number of these repeats may then potentially alter the function or antigenicity of the protein encoded. Three complete neisserial genomes were analyzed and compared to identify coding tandem repeats where the number of copies of the repeat will have some structural consequence for the protein. This is the first study to address coding tandem repeats that may affect protein structures using comparative genomics, combined with a population survey to investigate which show interstrain variability. Results A total of 28 genes were identified. Of these, 22 contain coding tandem repeats that vary in copy number between the three sequenced strains, three strain specific genes were included for investigation on the basis of having >90% identity between repeated units, and three genes with repeated elements of >250 bp were included although no length variations were seen in the genomes. Amplification, and sequencing of repeats showing altered copy number, of these 28 coding tandem repeat containing regions, from a set of largely unrelated strains, revealed further repeat length variation in several cases. Conclusion Eighteen genes were identified which have variation in repeat copy number between strains of the same species, twelve of which show greater diversity in repeat copy number than is present in the sequenced genomes. In some cases, this may reflect a mechanism for the generation of antigenic variation, as previously described in other species. However, some of the genes identified encode proteins with cytoplasmic functions, including sugar metabolism, DNA repair, and protein production, in which repeat length variation may have other functions. Coding tandem repeats appear to represent a largely unexplored mechanism of generating diversity in the Neisseria spp.


Background
Variable copy number tandem repeats have been observed in a number of prokaryotic genomes [1,2]. These are adjacent sequences that are directly repeated, the repeated units of which may be identical or partially degenerate. Coding tandem repeats are those tandem repeats that are completely contained within a coding sequence and are composed of repeated units in which copy number will not disrupt the reading frame. Therefore, all coding tandem repeats have repeated units composed of 3 bp or multiples of 3 bp. These are distinct from intergenic repeats and from repeats such as those that mediate phase variation. There are many examples in which variation in copy number within coding tandem repeats has been shown to affect virulence and alter the ability of antibodies to bind to bacterial antigens. In Streptococcus agalactiae, there is a reduction in copies of a coding tandem repeat within the α C-protein from the same strain isolated from mother and neonate [3]. The proteins with deleted repeat units are no longer recognised by anti-α C-protein antibodies, and repeat deletion escape mutants can be generated with enhanced pathogenicity in immune mice [4]. These repeats share similarity with other streptococcal sequences in the Rib and Esp proteins, which also vary in the length of coding tandem repeats between strains [5][6][7]. Tandem repeated structures in the group A streptococcal M proteins, which are extensively studied virulence determinants, vary in length due to intragenic homologous recombination events [8,9]. Size variation in surface proteins Lmp1 and Lmp3 of Mycoplasma hominis has been correlated to tandem repeats at the C-terminal end of the proteins and contributes to immune evasion through antigenic variation [10]. In Mycoplasma hyorhinis, immune escape variants of the Vlp proteins are generated through intragenic recombination between the C-terminal coding tandem repeat region in homologues vlpA, vlpB, and vlpC [11,12]. Also, there is evidence that repeat epitopes can influence the overall antigenicity of proteins, as well as the availability of epitopes. For example, addition of tandem repeats in the PAc protein of Streptococcus mutans, which normally contains three long repeated regions, induces higher antibody production than the native peptide [13].
In the Neisseria spp., variable copy number coding tandem repeats have been observed previously only in PilQ [14], and DcaC [15], while different copy numbers of a coding tandem repeat have been reported separately for Lip / H.8 [16,17]. Although the functional consequences of these variations have yet to be determined, this is a potentially important mechanism of adaptation available to these species. A comprehensive analysis to identify genes in which potentially functional variation of this type occurs has not previously been performed in the Neisseria or any other bacterial species. In this study, comparisons of the complete genomes of N. gonorrhoeae strain FA1090, and N. meningitidis strains MC58 and Z2491 were conducted to identify all coding tandem repeats, and to identify which of these varied in copy number between the sequenced strains. Upon its availability, the N. meningitidis strain FAM18 genome sequence was added to this analysis. The coding tandem repeats were further investigated in a small diverse collection of strains, to extend the genome-based observations, and to determine which genes are likely to be undergoing functional variation of this type. A range of genes with potentially functionally important diversity in repeat encoded structures was identified.

Coding regions identified as containing coding tandem repeats
The three available complete neisserial genome sequences [18][19][20] were compared to identify genes containing coding tandem repeats associated with variation in the copy number of the repeated units. Each tandem repeat was evaluated to determine whether the entirety of the repeat is located within the predicted coding sequence and that it does not alter the reading frame. Tandem repeats that did not meet these criteria are not coding tandem repeats and as such were not investigated. Twenty-two genes were identified (Table 3), including: pilQ [14], and dcaC [15], in which diversity in the coding tandem repeats were reported previously, and Lip / H.8 antigen [16,17], in which these two publications report different copy numbers of the coding tandem repeat in the single gene addressed. In addition, 2 genes only present in N. gonorrhoeae strain FA1090 (TR23, XNG0938 & TR25, XNG0481) and 1 gene only present in N. meningitidis strain MC58 (TR24, NMB1848) were included for further investigation, each having >90% identity between the repeated units (Table 3). Although these could not be assessed for differences in copy number of the coding tandem repeats between the genome sequenced strains, it was felt that due to the high degree of identity between the repeated units they should be further investigated to determine if diversity exists. A further 3 genes (TR26-TR28) were included on the basis of tandem repeats composed of repeated units of greater than 250 bp, although the copy numbers for these did not differ between the sequenced strains (Table 3). Although outside the primary criteria of this study, the unusually long nature of the coding tandem repeated units lead to the inclusion of these three genes for investigation here, to assess if diversity in copy number in such repeats exists. The repeated elements within the coding tandem repeats in the selected candidate genes ranged in size from 6 bp to 273 bp (Table 4).
These 28 genes were assessed using PCR in 11 neisserial strains to identify additional diversity in coding tandem repeat copy numbers. These 11 strains were chosen on the basis of previously observed diversity in repeat copy numbers of dcaC [15], and included 6 N. meningitidis, 3 N. lactamica, and 2 N. gonorrhoeae strains (Table 1). N. meningitidis strain MC58 was used as a positive and size control in the PCR. The previous dcaC study revealed no variability in tandem copy number between the N. gonorrhoeae strains studied. For the 2 gonococcus specific genes, 11 N. gonorrhoeae strains were analyzed, using strain FA1090 as a positive and size control.
Primers were designed flanking the tandem repeats such that PCR product size could be used to determine the number of copies of the coding tandem repeated unit. In the case of TR19 (tonB), the gene contains 2 tandem repeats, which were addressed separately (TR19a and TR19b). In the case of TR5 (pilQ) a compound tandem repeat is present, such that the 5' 24 bp of the 66 bp tandem repeat is then repeated itself as a 24 bp tandem repeat immediately following the 66 bp repeat (Figure 1). Therefore, TR5 was evaluated by sequencing in all strains. Additional sequencing was done for all of the products where the size of the PCR product suggested that the length of the tandemly repeated region might differ from the sequenced strains. In all, over 200 sequencing reactions were conducted to ascertain the sequence of the coding tandem repeat containing region(s) of the 28 coding sequences.

Observed differences between coding repeat lengths
Of the 28 genes containing coding tandem repeats, 6 were found to have differences in the number of coding tandem repeats that appear to divide along species lines in the limited strain collection used (Table 4; TR9, TR14,  TR17, TR18, TR19, TR21). There was no length variation in one of the two N. gonorrhoeae specific genes (Table 4; TR25), nor in the three additional genes included in the study based on the length of the repeat (>250 bp) ( Table  4; TR26, TR27, TR28), suggesting that these long repeats are comparatively stable. Six of the genes displayed no additional length differences to that seen in the sequenced strains (4; TR1, TR3, TR6, TR7, TR12, TR16). Each of these had relatively few copies of the repeat (1 or 2, 2 or 3, or 1 or 3), whereas those which show additional variation to that seen in the genome sequence comparisons tended to have more copies of the repeated unit.

Predicted and known surface proteins with coding repeat copy number variation suggesting antigenic variation
The presence of coding tandem repeats within the genes encoding surface proteins has been recognized in other species as a mechanism of antigenic variation mediated by changes in the number of repeats [3][4][5][6][7][8][9][10][11][12]. In these cases, changes in the number of tandem repeat copies alters the protein epitopes and presumably offers some benefit to the organism through immune evasion. This process has not been directly demonstrated in the pathogenic Neisseria spp., nor has a detailed study of any bacterial genome been conducted in an attempt to identify the repertoire of coding tandem repeats within a strain. This is, therefore, the first report of its kind, and additionally includes data related to genomic comparisons, diverse strain analysis, and sequencing of new copy number difference in the identified coding tandem repeats.
Several of the genes identified in this study are either known to be surface proteins, or are predicted to be surface exposed. Of the 28 genes investigated here, twelve of these are outer membrane proteins, or are predicted to be surface associated (TR1, TR4, TR5, TR7, TR11, TR13,   TR14, TR15, TR16, TR20, TR21, TR26). For comparison, analysis of the complete genome of N. meningitidis strain MC58 [20] predicted 570 putative surface-exposed proteins out of 2158 annotated features [21]. Six of the 12 genes identified here contained tandem repeat copy numbers that differed from those of the sequence strains (TR4,  TR5, TR11, TR13, TR15, TR20). Additionally, two genes are predicted to be cytoplasmic proteins, which are antigenic in other species (TR2 & TR8). This does not necessarily mean these two CDSs encode surface proteins, which is why they are included in a separate section (Cytoplasmic proteins with variable numbers of coding tandem repeats, which may also be antigenic surface proteins), but likewise it is possible that these proteins are surface exposed. Overall, half of the genes identified that contain Two consecutive tandem repeat elements exist in pilQ (TR5) Figure 1 Two consecutive tandem repeat elements exist in pilQ (TR5). The first repeated unit is 66 bp. coding tandem repeats (14 of 28) may be surface exposed proteins.
A potential vaccine candidate NMB2001 (TR4), a protein with some homology with the p60 invasin from Listeria monocytogenes [22], has been identified as a potential vaccine candidate from the study based upon the genome sequencing project [21]. It has been determined to be surface exposed and available for antibody binding. The presence of a tandem repeat was referred to in this paper, but length variation was not described. The 29 amino acid repeat encoded in this protein, which constitutes the majority of the N-terminal portion of the protein, is variable among both the N. meningitidis and N. gonorrhoeae strains tested (Table 4). It is possible that changes in the coding tandem repeat copy number may alter the antigenicity of the protein, which could complicate its use in any new vaccine.
The compound tandem repeat of pilQ A compound tandem repeat was identified in pilQ, composed of a repeat of 66 bp followed by one of 24 bp, the latter being similar to the 5' 24 bp of the 66 bp repeat (Figure 1). These repeats have been described and studied previously, with a slightly different description of the repeat structure [14]. PilQ forms a dodecameric pore in the neisserial outer membrane, through which the pilus extends from the periplasm to the extracellular space [23]. It is not known how the changes in repeat numbers might affect the protein::protein interactions in the dodecamer, or where these repeats are located in the pore structure. Strain variability in this study is similar to that described previously, with one or two copies of the 66 bp repeat, and one to five copies of the 24 bp repeat, with a notable exception. In N. meningitidis strain FAM18, there are no complete copies of either the 66 bp or 24 bp repeats (Table 4). This is due to a deletion in the gene from 50 bp into what would be the first copy of the 66 bp repeat to 361 bp after the end of the tandem repeat containing region. This large deletion in pilQ generates a frame-shift mutation in this strain and deletes the site of annealing of the PCR primer TR5R, therefore TR5Rv2 was also designed. A large deletion (303 bp) which comprises a portion of a tandem repeat as well as non-repeated genic sequence is also seen in NMB2050 (TR26), although in this case the deletion does not generate a frame-shift.
Deletions associated with the tandem repeats in pilQ have not previously been reported. If the compound tandem repeat containing portion of the PilQ protein represents exposed epitopes, then the variation in the tandem repeat structure may be involved in antigenic variation. It has also been suggested that changes in the repeat alter the assembly of pilin in the context of variations in PilE and/ or PilC expression [14].

Lipoproteins and putative lipoproteins
Annotated as a hypothetical protein in N. meningitidis strain MC58, TR11 (NMB1333), is predicted by PSORT to be an outer membrane or periplasmic protein. The NCBI Conserved Domain Search reveals that this CDS contains a sequence with homology to the Peptidase family M23/ M37, which in addition to the eukaryotic proteins of the family, includes bacterial lipoproteins that have no peptidase activity. The most 5' of the two tandem repeat sequences present in the gene, a repeat composed of 9 copies of a 21 bp element, does not display inter-strain differences in length, although in N. gonorrhoeae strains FA1090 and FA19 copy 7 has 9 bp deleted. The second repeat within this gene is a 15 bp (5 amino acid) tandem repeat present in two, three, and four copies, differences in lengths being present within both the meningococcal and gonococcal strains. The C-terminus of this protein, 3' of these tandem repeats, contains the region with homology to bacterial lipoproteins.
TR13 (NMB1468), is also predicted to be a lipoprotein. This CDS contains a 21 bp coding tandem repeat that is present in two, three, four, or seven copies in the strains studied (Table 4). This sequence has no significant homology to other sequences in the public databases. It is noteworthy that a number of the proteins encoded by CDSs containing coding tandem repeats are, or are predicted to be, lipoproteins (TR4, TR11, TR13, TR14, TR15,  TR21). In addition to the potential to antigenically vary the protein sequence, and therefore the structure of these surface exposed molecules, the change in number of repeated units may also influence the lipid component of the protein, as has been suggested for Lip [24].
The Lip repeat Lip (TR15, also known as the H.8 antigen) is largely composed of a repeated 5 amino acid motif, and has been sequenced previously by two groups. The two reported sequences differ in the number of tandemly repeated sequences [16,17]. Although variation of the number of these repeats between strains has not previously been addressed at the DNA-level in the literature, Lip is known to vary in gel mobility suggesting significant inter-strain differences in size [25], and in the form of its lipid component [24]. A virulence-associated lipoprotein, this protein was investigated in the 1980's as a vaccine target due to its antigenicity and capacity to generate an antibody response during disseminated gonococcal infections [26]. Changes in the M r of the protein correlate with serumresistance and neutrophil enzyme-resistance [25], although these changes were also demonstrated to effect the immunogenicity and/or antigenicity of gonococcal P.1 [27]. Lip can be present as a multimer, but this too is dependent on the M r of the monomer [25]. Here we demonstrate 7 different length variations in the tandem repeat that comprises most of the gene, in which only 69 bp (23 amino acids) coding for the gene are outside the tandem repeat. This is the first report in which the DNA repeat from the Lip encoding gene has been sequenced from different strains, demonstrating a high degree of diversity with copy numbers ranging from 10 to 18 copies. No PCR products were generated from the commensal N. lactamica strains, which is consistent with restriction of this gene to the pathogenic species [28]. This protein has not been pursued recenty as a vaccine candidate, probably because antibodies directed against it were poorly bactericidal [29,30]. A second gene within the genomes (NMB1533/ NMA1733) contains repeated copies of the AAEAP Lip consensus sequence [31]. The seven repeat copies in this 'azurin-like protein' do not vary between the sequenced strains and therefore it is not included under the criteria of this study. It should be noted that this second CDS has been mis-annotated in both published genomes as H.8, while the real Lip/H.8 antigen CDS (NMB1523/ NMA1723) is annotated as a hypothetical protein and putative proline-rich repeat protein, respectively [19,20].
One of the two genes identified as present only in N. gonorrhoeae strain FA1090 (TR25, XNG0481) also has a tandem repeat which is similar to the AAEAP repeat of Lip, the 'azurin-like protein', and AniA. In this case the repeat was identified by ETANDEM as being 30 bp. It is present in 3 copies, or 6 copies of the 15 bp repeat, in all of the gonococcal strains evaluated. This CDS is present in a gonococcal specific island composed of 58 genes. At one end are genes whose homology indicates a prophage, including a putative phage integrase, transcriptional regulator, phage repressor, and DNA helicase. At the other end of this region are pemK and pemI, which were identified on plasmid R100 and are involved in its maintainance [36]. Therefore this region has features of both an integrated bacteriophage and an integrated plasmid, with a CDS containing a tandem repeat similar to that seen in other neisserial genes in the middle.
Of the genes evaluated, TR24 (NMB1848), was included in the study due to the high identity (97%) between repeat copies, although the gene itself was only found in N. meningitidis strain MC58. TR24 has more repeat copies than the two gonococcal genes also added to the study for this reason, the meningococcal gene having 15 copies of an 18 bp tandem repeat, rather than 3 copies in the two gonococcal genes (an 18 bp element in TR23, and a 30 bp element in TR25). Four length variants of this gene were identified, including differences between the closely related N. meningitidis strains MC58 and 44/76 (Table 4). This tandem repeated unit makes up most of the coding sequence of the gene, there being only 72 bp (24 amino acids) of coding sequence that is not within the tandem repeat. The composition of the majority of this gene by a varying number of coding tandem repeats is reminiscent of Lip, but the location and function of this gene product is not known.
The conserved dcaC repeat DcaC (TR20) is predicted to be an outer membrane protein of unknown function containing a 36 amino acid variable copy number tandem repeat, and has been described previously [15]. Although the gene as a whole has no homology to others in the public databases, homologues of the dcaC repeat are present in several hypothetical proteins. In Magnetococcus sp. MC-1 CDS Mmc10969 there are 14 copies of the repeat (NZ AAAN01000134); nine and ten in E. coli strain CFT073 CDSs c1269 and c5321, respectively [37]; four in Chlorobium tepidum strain TLS CDS CT0958 [38]; six in Vogesella indigofera strain ATCC19706 ORF1 (AF088857.1); five and three in Hae-mophilus somnus strain 129PT CDSs Hsom0164 (NZ AABG01000001) and Hsom1526 (NZ AABG01000013), respectively; and three in Pasteurella multocida strain PM70 CDS PM1611 [39]. Such conservation of repeat homology without overall homology of the proteins has not previously been reported. It appears, therefore, that the presence of a protein containing this repeat, and the variability in the copy number of this repeat, is conserved. In the Neisseria spp., the number of tandem repeats within the gene clearly increases the number of distinct hydrophobic regions within the protein (Figure 2).

The species-specific rmpM repeat
The shortest tandem repeat included in this study is 2 amino acids (6 bp) and is contained within RmpM (NMB0382, TR21). In this case, the presence of 6 copies of the repeat or 2 copies of the repeat appears to be linked with species, meningococci having the former and gonococci the latter. The presence of no PCR products in the N. lactamica strains is consistent with other work on this gene in the commensal Neisseria spp. [40], which suggests that a homologue is only present in some strains.
A potential adhesin NMB0586 (TR7) is a putative adhesin that contains a 12 bp tandem repeat, which is actually a 2 amino acid repeat, the translation of the 12 bp repeat being HDHD. Although only 2 to 3 copies of this repeat were identified in this study, the product of this gene is predicted to be expressed on the outer membrane or periplasmic space. Most of the length of the predicted protein sequence shares homology to ABC transport periplasmic components/surface adhesins. The crystal structure for TroA, a periplasmic zinc-binding protein from Treponema pallidum, has been solved [41,42] and the placement of the TR7 tandem repeat in the structure suggests that it may alter any substrate binding capacity this neisserial protein may have.

Predicted and known cytoplasmic proteins in which altered coding tandem repeat copy number may alter function
The only one of the 28 genes found to contain no copies of the repeat in one of the strains was TR10, mfd  iation [43] and recombination [44,45] are two characteristic features of the pathogenic Neisseria species. Mfd in other species has been linked with both DNA repair and recombination [46]. A knock-out mutant has been investigated to determine whether it influences phase variation rates in Haemophilus influenzae, which found no difference between wild-type and mutant [47]. However, the presence of diversity between neisserial species and strains in the length of a relatively long (69 amino acid) repeat within this protein may significantly affect its activity or interactions.
The greatest variation in copy number is seen within the gene with one of the shortest tandem repeated units. TR22 (NMB0281) contains a 9 bp coding tandem repeat at the 5' end, present in 2, 5,6,7,9,11,16,19, and 26 copies ( Table 4). The C-terminus of the protein has homology to a rotamase domain. These enzymes increase the rate of protein folding by catalyzing the interconversion of cis-proline and trans-proline. It is possible that the copy number of the coding tandem repeat influences the rate or substrate preference of this enzymatic reaction. Tandem repeats in glucansucrases have been previously identified near the active site of these enzymes in Leuconostoc and Streptococcus species where they may contribute to their function through substrate binding [48].
Within a gonococcus-specific region is a CDS (XNG0938; TR23) that has variable numbers of a 18 bp (6 amino acid) coding tandem repeat. The region that contains TR23 also contains 18 other genes not present in the meningococcal genomes including a divergently transcribed CDS with homology to a phage repressor protein.
TR23 itself contains a region that is similar to the integrase core domain found in viral integrases and PSORT predicts this to be a cytoplasmic protein. The features of the genes in this region therefore suggest that this region is derived from a prophage.  [43]. § From [20] unless otherwise noted. || Gene annotation from [74] ¶ NMA1680 is annotated on the reverse complement strand compared to NMB1468 and XNG0955. ** Gene annotations from [75] and [30] † † OMP: outer membrane protein. Gene annotation from [33] ‡ ‡ Gene annotation from [76] § § Gene annotation from [15] || || Gene annotation from [77] ¶ ¶ Corresponding gene is not present in this strain. *** Gene annotation from [78] Cytoplasmic proteins with variable numbers of coding tandem repeats, which may also be antigenic surface proteins Two, three, and four copies of a 33 bp tandem repeat are found in pgk (TR2, NMB0010). This gene encodes phosphoglycerate kinase (EC 2.7.2.3), a cytoplasmic enzyme involved in the pathway converting glucose to pyruvate [49]. This protein is conserved between prokaryotes and eukaryotes, and the crystal structures of both pig muscle [50] and Thermotoga maritima phosphoglycerate kinase have been determined [51]. The repeated region in the Neisseria spp. phosphoglycerate kinase maps onto an exposed surface portion of the protein. It was recently reported that in group B streptococcus phosphoglycerate kinase is a surface protein and antibodies directed against it provide protection against infection [52]. It is unclear at this time whether the neisserial protein is cytoplasmic, surface associated, or both, although it should be noted that group B streptococci and serogroup B N. meningitidis share capsule characteristics, and that the sugars for these may be a substrate for surface exposed phosphoglycerate kinase, in addition to its cytoplasmic role. Strains which varied in the copy number of tandem repeats were serogroup B N. meningitidis strains NGE30 (2 copies), BZ133 (2 copies), MC58 (3 copies) and 44/76 (4 copies). In contrast, neither the other serogroups of N. meningitidis, nor the N. gonorrhoeae strains, displayed any variability in tandem repeat copy number (Table 4).
A second protein that functions in the cytoplasm has been identified in this study, SucB (TR8, NMB0956). This is the dihydrolipoamide succinyltransferase (E2o) of the 2oxogluterate dehydrogenase complex, a component of the TCA cycle. Although the sequence of this gene in E. coli contains no repeats, the corresponding acetyltransferase component of the pyruvate dehydrogenase complex (E2p) does [53]. In Brucella melitensis and Coxiella burnetii, the product of sucB is immunogenic; antibodies to SucB being present in the serum of infected sheep and Q fever patients, respectively [54,55]. While there is no evidence that SucB is a surface exposed protein in these species, is does raise the possibility that variation of the 30 bp repeat in the neisserial gene may alter the antigenicity of this protein. Alternatively, the changes in the protein due to the differing tandem repeat copy numbers may offer certain neisserial strains adaptive advantages through altered enzymic activity.

The range of mechanisms generating diversity in Neisseria
Neisserial species have a number of different mechanisms by which they generate diversity. At the level of genic composition they are naturally transformable using a speciesspecific uptake sequence [56], have the capacity to generate mosaic genes [57,58], have a relatively highly panmictic population structure [59,60], and have genetic loci preferentially associated with strain-divergent genes within Minimal Mobile Elements [61]. At the level of phase variation they have many known and candidate switching genes [43,62], and also have systems utilizing recombination to diversify specific genes such as pilE [63]. Each of these influences the dynamic way in which different strains interact with their hosts, and the flexibility with which a colonizing population can diversify and adapt to the differing and changing environments within a single host. Flexibility due to variation in the number of the coding tandem repeats in the genes highlighted in this paper, as reflected by differences in repeat copy number between strains and species, probably represent an additional mechanism by which these host-restricted pathogens can optimize their niche adaptation to their human hosts. Coding tandem repeats within the Neisseria spp. are likely to add an additional level of diversity generation within the already highly adaptable, dynamic, and variable neisserial population.

Conclusions
Through alteration of the copy number of coding tandem repeats, the Neisseria spp. may have an additional mechanism of generation of diversity that has not previously been explored in detail. While the alteration of the copy number of coding tandem repeats has been recognized previously in three genes (pilQ, dcaC, lip), the functional consequence of these changes has not been addressed. This is the first report to identify all the sequenced neisserial genes that have coding tandem repeats and determine if these are present in variable copy number. From this assessment, it becomes apparent that this is potentially a mechanism for antigenic variation of surface proteins and / or for functional variation of cytoplasmic proteins.

Whole genome analysis to identify coding tandem repeats
The previously described whole genome analysis methodology [62,64] was applied, using the ACEDB graphical interface [65]. The complete genome sequences of N. meningitidis serogroup B strain MC58 [20], N. meningitidis serogroup A strain Z2491 [19], and N. gonorrhoeae strain FA1090 [18](publicly available from 1997, downloaded November 2000 from ftp://ftp.genome.ou.edu/ pub/gono/gono-2k.fa), were assessed. Direct tandem repeats were identified using ETANDEM for repeat components of up to 100 bp, due to the fact that it consumes computational cycles in a logarithmically expanding fashion with sequence length. EQUICKTANDEM does not have such heavy computational demands, and is used only for the identification of repeats between 100 and 1000 bp. Both programs are from the EMBOSS package [66], and were used with standard parameter settings. Near the completion of this project the complete sequence of N. meningitidis serogroup C strain FAM18 became available from The Wellcome Trust Sanger Institute ftp://ftp.sanger.ac.uk/pub/pathogens/nm/, and the coding tandem repeat copy numbers of the 28 identified genes identified from the initial 3-way genome sequence comparison were similarly assessed.

Bacterial strains and growth conditions
The neisserial strains used are shown in Table 1. These strains were chosen based on the results obtained previously concerning copy number differences in the coding tandem repeat in dcaC [15]. In addition to the information presented in that publication, further information on most of these strains can be obtained from the Neisseria Multi Locus Sequence Typing website http://neisse ria.mlst.net developed by Dr Man-Suen Chan and sited at the University of Oxford. Strains were propagated on GC agar (Difco Laboratories) containing the Kellogg supplement and ferric nitrate [67] at 37°C under 5% (v/v) CO 2 .

PCR amplification and sequencing
Chromosomal DNA extractions were performed using the method of McAllister and Stephens [68] or Ausubel et al. [69]. PCR from chromosomal DNA was performed using Invitrogen Taq DNA Polymerase or Bioline Bio-X-Act polymerase according to the manufacturers' instructions using the primer pairs shown in Table 2. When necessary, secondary primers were designed to obtain PCR products and sequences, denoted v2 on Table 2. PCR products were resolved on the appropriate concentration of either SeaKem ® LE agarose gels (Flowgen) or MetaPhor ® agarose gels (Flowgen) containing 0.5 µg/ml Ethidium Bromide (Sigma). PCR product size was determined using Quantity One ® Quantitation Software (BIORAD). Automated sequencing used ABI Prism ® BigDye™ Terminator Cycle Sequencing version 2.0 or version 3.0 (Applied Biosystems), and was resolved on an ABI Prism ® 3100 DNA Sequencer (Applied Biosystems).

Nucleotide sequence analysis
The Basic Local Alignment Search Tool (BLAST) [70] was used to search publicly available microbial genome sequences, GenBank, or EMBL. The complete genome sequence of N. gonorrhoeae strain FA1090 was obtained from the N. gonorrhoeae Genome Sequencing Project at the University of Oklahoma http://www.genome.ou.edu/ gono.html, which was independently annotated as described previously [43]. XNG numbers refer to this annotation, and where no N. meningitidis homologue is present to identify these CDSs ( v4.0 (NCBI). The Wisconsin Package from GCG (Accelrys) was used for nucleotide and amino acid sequence analysis and alignments. Staden was used for ABI sequence trace assembly and analysis. Predictions of signal sequences and protein localization were generated using PSORT, which currently claims 83% prediction accuracy [71]. Transmembrane domains and hydrophobicity profiles were predicted using TopPredII [72,73]