Typing Clostridium difficile strains based on tandem repeat sequences
© Zaiß et al. 2009
Received: 02 October 2008
Accepted: 08 January 2009
Published: 08 January 2009
Skip to main content
© Zaiß et al. 2009
Received: 02 October 2008
Accepted: 08 January 2009
Published: 08 January 2009
Genotyping of epidemic Clostridium difficile strains is necessary to track their emergence and spread. Portability of genotyping data is desirable to facilitate inter-laboratory comparisons and epidemiological studies.
This report presents results from a systematic screen for variation in repetitive DNA in the genome of C. difficile. We describe two tandem repeat loci, designated 'TR6' and 'TR10', which display extensive sequence variation that may be useful for sequence-based strain typing. Based on an investigation of 154 C. difficile isolates comprising 75 ribotypes, tandem repeat sequencing demonstrated excellent concordance with widely used PCR ribotyping and equal discriminatory power. Moreover, tandem repeat sequences enabled the reconstruction of the isolates' largely clonal population structure and evolutionary history.
We conclude that sequence analysis of the two repetitive loci introduced here may be highly useful for routine typing of C. difficile. Tandem repeat sequence typing resolves phylogenetic diversity to a level equivalent to PCR ribotypes. DNA sequences may be stored in databases accessible over the internet, obviating the need for the exchange of reference strains.
Clostridium difficile is a Gram-positive, spore-forming, obligately anaerobic bacterium. It is the leading cause of nosocomial diarrhoea among patients undergoing antibiotic treatment [1, 2]. The severity of C. difficile-associated disease (CDAD) ranges from mild diarrhoea to pseudomembranous colitis, toxic megacolon, and intestinal perforation [3–6]. Mortality rates of CDAD reportedly range from 6 to 30% [5, 7, 8]. During the last decade, the incidence of CDAD has increased significantly in North America [9–12] and Europe [4, 8, 13, 14]. In the USA and Canada, this increase has been associated with the emergence of a novel, hypervirulent strain designated NAP1/027 [11, 15]. Strains with the same genotype and associated outbreaks have also been reported from several European countries [14, 16–18].
For infection control investigations and epidemiological studies, it is mandatory to track the emergence and spread of epidemic strains. For this purpose, appropriate genotyping methods are needed. The utility of a typing method will depend on its inter-laboratory reproducibility and data portability, its discriminatory power and concordance of identified groupings with epidemiology, the temporal stability of the genetic markers investigated, and the universal typeability of isolates . Multilocus variable number of tandem repeats analysis (MLVA) is the most discriminatory method presently available for typing C. difficile [20, 21]. Recently reported results suggested that the level of resolution achieved through MLVA may be highly useful for detecting epidemiological clusters of CDAD within and between hospitals [21, 22]. The genetic loci currently exploited for MLVA-typing of C. difficile accumulate variation so rapidly, however, that longer-term relationships between isolates get obscured . It is therefore advisable - and has been a common practice - to combine MLVA with the analysis of more conserved genetic markers [20–23]. Most commonly applied approaches to genotyping C. difficile at present are DNA macrorestriction analysis (based on pulsed-field gel electrophoresis, mostly used in Canada and the USA [12, 15, 24]) and PCR ribotyping (in Europe [25–27]). These two methods yield largely concordant results [23, 27]. While DNA macrorestriction has slightly higher discriminatory power than PCR ribotyping, it is also more labour-intensive and time consuming [23, 27–29].
A major disadvantage of PCR ribotyping, DNA macrorestriction, and other band-based typing techniques (including restriction endonuclease analysis (REA) ) is the poor portability and interlaboratory comparability of the generated data. Bacterial strains to be compared usually need to be run on the same electrophoresis gels, which requires the exchange of reference strains between institutions. This requirement seriously hampers epidemiological investigations, particularly at international scales [21, 23].
Typing procedures based on DNA sequences overcome these limitations, since sequence data may easily be exchanged and stored in databases that are accessible via the internet. Accordingly, a scheme for multilocus sequence typing (MLST) of C. difficile was developed recently that is based on sequences from seven housekeeping gene fragments . While MLST to date has been applied to a limited number of isolates, available data allowed a first glimpse at the largely clonal genetic population structure of C. difficile [23, 31, 32]. In clonal bacteria, novel genotypes in the course of evolution are generated primarily through mutations, which in slowly evolving housekeeping genes are rare. Hence, it is this very clonality of C. difficile and the associated linkage disequilibrium that causes MLST to provide poor discriminatory power, which is exemplified by the fact that relevant epidemic strains are not resolved . In addition, MLST remains too expensive to be applied for routine typing aside from dedicated research projects.
More variable genomic regions may provide improved discrimination ability. In contrast to MLST, it may even suffice to sequence a single locus or very few genetic loci that are sufficiently variable, since - analysing a clonal population - phylogenetic inferences will rarely be confounded through homologous genetic recombination. Sequence-based typing schemes relying on one or several highly discriminatory markers have previously been established for a number of pathogens, including Staphylococcus aureus (spa gene) , Campylobacter jejuni (flaA) [34, 35], Streptococcus pyogenes (emm)  and Neisseria meningitidis (porA, fetA) [37–39].
The surface layer protein gene slpA has recently been proposed as a promising target for sequence-based typing of C. difficile . The limited data available suggests extremely high sequence variation among isolates and, correspondingly, excellent discriminatory power [23, 40]. To date, however, slpA sequencing reportedly has been applied to a total of only 11 different ribotypes, and it is not clear if the method is universally applicable [23, 40]. It is anticipated that the requirement for degenerate oligonucleotide primers may restrict the general utility of the current protocol . The method has as yet not been successfully transferred to any other laboratory [23, 40].
This present report describes the development and application of a new assay for genotyping C. difficile that is based on sequence analysis of two stretches of repetitive DNA. Investigating a panel of 154 diverse C. difficile isolates, we demonstrate extensive sequence variation in these genomic regions, resulting in high discriminatory power, and excellent concordance with PCR ribotyping.
Characteristics of tandem repeat loci TR6 and TR10.
tandem repeat locus
Copy no. Rangeb
No. of different repeatsb
725321 : 725600
3753166 : 3753574
Genomic regions with short tandem repeat regions may evolve fast due to intra-molecular recombination and frequent polymerase slippage during DNA replication [43–45]. Accordingly, loci TR6 and TR10 displayed both, sequence polymorphisms, generated through exchange of individual nucleobases (Additional files 3, 4), and length polymorphisms, as a consequence of repeat copy number variation (Additional file 2). Sequences of individual repeats were highly variable, with a nucleotide diversity π of 0.28 ± 0.01 for TR6 and 0.23 ± 0.01 for TR10. The majority of nucleotide substitutions at locus TR6 were synonymous, i. e., they left the encoded amino acid sequence unaffected, and hence may be considered selectively neutral. This was reflected by a Ka/Ks value of 0.39, suggesting TR6 sequences evolve under purifying selection. Locus TR10 does not encode any protein and, hence, sequence variation likely is neutral, too.
While TR6 and TR10 displayed remarkable sequence variation, both loci seemed sufficiently stable to identify genetically related isolates collected over time. For one, the stability of TR6 and TR10 was demonstrated by two VPI 10463 and three 630 strains (including the published genome sequence), that prior to our analysis each had been handled in different laboratories (Additional file 1) and, hence, had independently been subcultured multiple times, but yet shared the same respective TRST sequence types (Additional file 1). Furthermore, stability of both tandem repeat regions was circumstantially suggested through identical sequences found in multiple isolates sharing the same ribotype but originating from different geographical regions (Additional file 1).
Results were compared to PCR ribotyping on the basis of 154 isolates including international reference strains and clinical isolates collected at various German laboratories (Additional file 1). These isolates had been preselected from the material available to represent maximal diversity as judged on the basis of PCR ribotyping and geographic origin. They represented 75 different ribotypes (Additional file 1). Figure 2 shows a neighbor joining dendrogram based on the repeat successions in concatenated TR6 and TR10 sequences.
Discriminatory power and concordance of tandem repeat sequence typing and PCR ribotyping.
No. of strains included
No. of different types
Concordance with ribotypinga (%)
0.953 - 0.982
0.954 - 0.981
0.911 - 0.951
0.934 - 0.964
TRST demonstrated high overall concordance with PCR ribotyping for the set of strains typed in this study, resulting in a calculated Adjusted Rand's index of 89.8% (Table 2). The probability that a pair of isolates with the same ribotype also shared identical TRST sequence types was 89.6% (Wallace index 0.896). Accordingly, ribotypes usually corresponded to specific TRST sequence types (Figure 2). For example, 18 isolates with ribotype 027, originating from six different European countries, displayed identical sequences at TR6 and TR10 that discriminated them from all other isolates, and jointly were assigned TRST sequence type tr-027 (Additional file 1, Figure 2). Similarly, four isolates with ribotype 017 from three different countries, including the reference strain for toxinotype VIII, were assigned sequence type tr-017 (Additional file 1, Figure 2). Future work on larger numbers of isolates may reveal that sequencing a single locus (TR6 or TR10) will suffice to identify epidemiologically relevant strains. For the sake of concordance with PCR ribotyping, however, we presently suggest to sequence both loci. As outlined above, this strategy will also detect the impact of recombination.
Discrepancies between TRST and ribotyping were apparent where either method split a particular group of isolates into two or three classes, whereas the other lumped them into one (Figure 2). In virtually all of these cases, however, the respective isolates were affiliated to identical MLST sequence types or to single locus variants with respect to MLST (i. e., identical sequences at six out of seven MLST loci), indicating their close phylogenetic relatedness. Phylogenetic coherence of these additional (sub-)classes will remain unclear as long as there are no phylogenetic markers available to investigate the detailed evolutionary history of C. difficile within MLST sequence types.
Evolutionary relationships between isolates may be revealed through tandem repeat sequence alignment and phylogenetic analysis. This is also feasible for those isolates that were assigned different TRST types. For example, ribotypes 027, 156, and 019 by MLST are indicated to be closely related, since corresponding isolates are assigned two MLST sequence types that differ at one locus only (Figure 3). Close relationship of ribotypes 027 and 019 previously has also been found on the basis of DNA macrorestriction analysis, when isolates with both ribotypes were assigned to the 'North American Pulsotype NAP1' . Concordantly with MLST and macrorestriction, TRST also indicated the relatedness of these types through similar tandem repeat sequences that clustered tightly in the phylogenetic tree (Figure 2), yet it maintained the discriminatory power of PCR ribotyping by assigning three different sequence types (tr-034, tr-027, tr-019) (Figure 2). Similarly, ribotypes 078 and RKI35 were indicated to be closely related to ribotype 066 by both, MLST and TRST (Figures 2 and 3). In contrast, these relationships were not at all apparent on the basis of ribotyping band patterns (Figure 4).
Phylogenetic relatedness was also indicated in cases where TRST was more discriminatory than PCR ribotyping. For example, ribotypes 001, 163, 087, 014, and 117 each were subdivided into several TRST types (Figure 2). Clusters of related tandem repeat sequences in the phylogenetic tree still corresponded to PCR ribotypes (Figure 2), which warrants the comparability of results from both methods. This feature may be highly desirable, since it will facilitate, for example, cross-referencing to ribotyping-based examinations and maintaining the continuity of ongoing surveillance programs.
Ribotyping does not enable phylogenetic analyses based on dissimilar banding patterns, and the relatedness of different ribotypes has not commonly been assessed. In the long run, large-scale mutation discovery and genomic (re-)sequencing will reveal the phylogenetic validity of typing procedures .
We anticipate that PCR ribotyping will eventually be replaced by typing procedure(s) based on DNA sequences. The inherent portability of sequence data will obviate the need for the exchange of reference strains and enable decentralised genotyping efforts, which may boost large scale investigations on the molecular diversity of C. difficile. At present, however, our knowledge about the diversity and population biology of this important pathogen is very limited [23, 31, 32]. As a consequence, it is generally not clear if isolate groupings provided by various typing methods, including PCR ribotyping, are concordant with the epidemiology of associated disease [21, 23]. Related to these considerations, one limitation of this present study is the lack of epidemiologically linked isolates in our data set. Investigations in the near future should evaluate the utility of tandem repeat sequencing for infection chain tracking and short-term epidemiological investigations.
Sequence analysis of tandem repeats TR6 and TR10 provided full typeability across a wide range of C. difficile isolate diversity, excellent concordance with PCR ribotyping, and equal discriminatory ability. Sequence clades corresponded to phylogenetically coherent groupings. This sequencing-based typing approach may prove particularly useful because DNA sequences can easily be exchanged via the internet.
A total of 154 C. difficile isolates comprising 75 different ribotypes were used in this study. The strain collection included both, international reference strains and selected clinical isolates from various German hospitals, collected in 2007 and 2008. More detailed information about individual isolates is given in Additional file 1.
Genomic DNA was isolated from cultures grown for 48 h on cycloserine-cefoxitin fructose agar (OXOID, Basingstoke, UK), by using the DNeasy Blood & Tissue Kit (QIAGEN, Hilden, Germany) according to the manufacturer's recommendations.
PCR ribotyping initially was performed at the Reference Laboratory for Clostridium difficile at the Leiden University Medical Center in the Netherlands and later was transferred to the Robert Koch Institute. We followed the protocol of Bidet et al. , except that PCR Products were run on 1.5% agarose gels in 1× TBE at 85 volts for 4 hours. Isolates were assigned novel PCR ribotypes if their patterns differed from previously named patterns by at least one band.
To facilitate the application of tandem repeat sequence typing, a duplex PCR was designed using the following primers: TR6-F (5'-TTTCAACTTGTCCAGTTTTTAAGTC-3') and TR6-R (5'-ATGACATAGCGTTTGTGGAAT-3'); TR10-F (5'-TGCATCAAATTGGTCAAGACTC-3') and TR10-R (5'-TGAAATCATTGACTATAAAGCAAAA-3'). DNA amplification was performed on 1 μl of purified genomic DNA in a final volume of 50 μl containing 0.1 μM of TR6 and 1 μM of TR10 primers, 200 μM of each deoxynucleoside triphosphate, 1× PeqLab PCR buffer Y (20 mM Tris-HCL, 16 mM (NH4)2SO4, 0.01% Tween 20, 2 mM MgCl2) and 1.25 units Hot Taq-DNA-Polymerase (PeqLab, Erlangen, Germany). After an initial denaturation of 96°C for 3 min, the protocol consisted of 35 cycles at 96°C for 45 s, 52°C for 45 s, and 72°C for 45 s following a final extension at 72°C for 7 min. PCR products were prepared for sequencing using the QIAquick® PCR Purification Kit (QIAGEN, Hilden, Germany) and 0.35 μl of the purified products were applied for sequencing using the BigDye Terminator v3.1 Cycle Sequencing Kit (Applied Biosystems, Foster City, USA) with identical primers employed in the PCR. Automated sequence detection was performed on an ABI capillary sequencing system and sequences were analysed using the BioNumerics 5.10 software (Applied Maths, Belgium).
Data processing was performed with BioNumerics 5.10 by using a novel, dedicated "Repeat Typing" plugin that allowed automated batch assembly of trace files. The assignment of TRST sequence types was based on the successive occurrence of user-defined repeats in concatenated sequences from both tandem repeat loci. A repeat distance matrix for matching and clustering were calculated based on the DSI model , a mutation model comprising substitutions, indels (insertions or deletions), and duplications. Subsequent cluster analysis was performed based on the neighbor joining algorithm.
Clostridium difficile isolates were typed by MLST as described previously . Sequence data were submitted to the C. difficile MLST database http://www.pasteur.fr/recherche/genopole/PF8/mlst/Cdifficile.html to assign allele profiles and the resulting sequence types. Sequence types were analysed by constructing a dendrogram based on the UPGMA (Unweighted Pair Group Method with Arithmetic mean) clustering algorithm using the multistate categorical similarity coefficient (tolerance 0%) available in the BioNumerics software.
Seven-locus MLVA was conducted as described previously [20, 22], except that the different loci were PCR-amplified individually and PCR products were sequenced for repeat copy number determination. To facilitate sequence analysis of MLVA locus C6 , two novel oligonucleotide primers were used: C6-F 5'-CCAAGTCCCAGGATTATTGC-3' and C6-R 5'-AACATGGGGATTGGAATTGA-3'. Repeat copy numbers were determined manually using BioNumerics 5.10 software. The summed tandem-repeat difference was calculated where appropriate; it is the sum of repeat differences between two isolates at all seven MLVA loci .
An index of discrimination was calculated to compare the discriminating capaCity of ribotyping, and TRST. The discriminatory index was defined as the average probability of two consecutively sampled strains being characterized as the same type. This probability depends on the number of strain types and their frequency distribution in the population. Discriminatory indices were calculated based on Simpson's index of diversity . Confidence intervals for discriminatory indices were determined as described previously . The Concordance of two typing schemes was calculated based on the adjusted Rand's and Wallace's coefficients . While the Rand's coefficient allows a quantitative evaluation of the global congruence between two typing systems, the Wallace's coefficient compares the congruence of schemes depending on the directionality of typing by estimating the probability that a pair of isolates sharing the same type in system 1 also share the same type in system 2, and vice versa. Calculation of all parameters was performed with EpiCompare software, version 1.0 (Ridom GmbH, Würzburg, Germany).
The nucleotide diversity (π) and the ratio (Ka/Ks) of the average number of non-synonymous substitutions per non-synonymous site (Ka) to the number to synonymous substitutions per synonymous site (Ks) was calculated by using DnaSP, version 4.5 .
We are grateful to all people that have contributed bacterial isolates to this study, particularly to M. Kist, T. Åkerlund, H. Rüssmann, and B. Bornhofen. We thank Wolfgang Witte for inspiring discussions and generous support. For excellent technical assistance we thank Heike Illiger, Annette Weller, and the staff at the sequencing unit of the Robert Koch Institute. This work was partially supported by a grant from the German Federal Ministry of Health.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.