Application of a target enrichment-based next-generation sequencing protocol for identification and sequence-based prediction of pneumococcal serotypes

Background The use of whole-genome sequencing in microbiology at a diagnostic level, although feasible, is still limited by the expenses associated and by the complex bioinformatics pipelines in data analyses. We describe the use of target enrichment-based next-generation sequencing for pneumococcal identification and serotyping as applied to the polysaccharide 23 valent vaccine serotypes as an affordable alternative to whole genome sequencing. Results Correct identification of Streptococcus pneumoniae and prediction of common vaccine serotypes: 12 to serotype level and the rest to serogroup levels were achieved for all serotypes with >500 reads mapped against serotypes sequences. A proportion-based criterion also enabled the identification of two serotypes present in the same sample, thus indicating the possibility of using this method in detecting co-colonizing serotypes. The results obtained were comparable to or an improvement on the currently existing molecular serotyping methods for S. pneumoniae in relation to the polysaccharide vaccine serotypes. Conclusion We propose that this method has the potential to become an affordable and adaptable alternative to whole-genome sequencing for pneumococcal identification and serotyping.


Background
Identification and capsular serotyping of S. pneumoniae (pneumococcus) remains the cornerstone both for the diagnosis and surveillance of pneumococcal disease and carriage in the presence of extended spectrum pneumococcal conjugate vaccination. Currently available biochemical, serological and molecular methods are restricted by their throughput, turnaround time and limitations in multiplexing for multiple serotype detection [1,2].
Next Generation Sequencing (NGS) is a technique with the potential to fill this gap. Whole Genome Sequencing (WGS) on NGS platforms has been used for serotype prediction and for studying serotype changes in response to clinical interventions at research level [3,4]. In the diagnostic laboratory, WGS has been used to investigate potential outbreaks of drug-resistant pathogens [5]. However, the routine use of WGS with bench-top NGS remains limited by complex bioinformatics pipelines, computational capacities, relatively lower throughput per run without sample pooling, longer preparation times, and substantial costs.
Use of target enrichment followed by NGS has gained a wider use in cancer studies and the mutation analyses associated with congenital diseases [6,7]. The use of indexed sample preparation methods and pooling multiple samples lower the costs further. The combined use of target enrichment and sample pooling is gaining popularity as an option to sequence only the targets of interest at a higher depth of coverage enabling a more cost-effective use of sequencing reads; however, this strategy has hardly been explored in diagnostic microbiology. While the relatively smaller genome sizes in microbiology make the application of WGS attractive, adequate depth of sequences for interpretation may not be readily achievable. Target enrichment will enable a more cost-effective use of NGS in microbiology and overcome the issue of low copy numbers in the regions of interest. In this study, we explored a methodology based on target enrichment coupled with tagged sample pooling for the identification and serotyping of S. pneumoniae by using NGS. Criteria for the correct assignment of 23-valent polysaccharide vaccine (PPSV23) serotypes in relation to the total number of mapped reads per sample and the proportion of reads aligned to reference serotype sequences were also evaluated.

Bacterial isolates and DNA isolation
Twenty-four representative strains of S. pneumoniae of the PPSV 23 serotypes and serotype 6A, along with NCTC11189 Streptococcus sp. viridans type, obtained from the culture collection at the Department of Microbiology, The Chinese University of Hong Kong were used for the study. Isolates were grown overnight on blood agar plates. DNA was extracted by boil lysis, and the supernatant after centrifugation at 6,010 g for five minutes was used as the template. DNA quality and quantity of the neat and diluted samples were measured by the nanodrop method. In addition, multiple serotypes in varying proportions were also examined. Serotypes 19F and 3 were mixed in the proportion of 1:1, 1:10, 1:50, and 1:100 (1 = 75 ng/μl) to assess the feasibility of multiple serotype detection. DNA from NCTC11189 Streptococcus sp. viridans type was combined with that of serotype 19F at the proportions of 1:1, 1:10, 1:50, and 1:100 (1 = 75 ng/μl) to assess the feasibility of differentiating S. pneumoniae from members of the viridans group mimicking the nasopharyngeal niche.
Targeted enrichment of serotype specific regions and sample pooling A multiplex PCR using 23 previously described primer pairs [8], capable of amplifying the capsular genes of the polysaccharide vaccine serotypes (PPSV 23) and 17 closely related types (PCR 1), was used for target enrichment. An 18-bp nucleotide (nt) adaptor (TCTATTGGGCTAT GTCAC) was incorporated to the 5' end of each of the primers. The PCR-amplified fragments ranged from 259 to 413 bp in length.
A second multiplex PCR in addition to the abovementioned 23 primer pairs incorporated a pair of consensus primers (For-ATCGAACTCTTRCGYAATCTA and Rev-TCAAACTTRTCTTTTGGATAAGARC) targeting a sequence signature in the lytA gene capable of differentiating pneumococcus from the viridans group of streptococci (PCR 2).
Thermodynamic properties of the resultant primer adaptor combinations were evaluated by using the Oligoanalyzer 3.1 software (http://sg.idtdna.com/analyzer/ applications/oligoanalyzer/). Eight 5-nt unique indexes were selected from the ready-made index sequences available at http://cloud.github.com/downloads/fairclothlab/edittag/edit_metric_tags.txt [9]. One of these 5-nt indexes was attached to amplicons from each sample by way of a modified step-out PCR (MSO-PCR), so that the samples could be pooled prior to the library preparation [10]. The primer used for MSO-PCR had the same sequence as the 18-nt adaptor with unique 5-nt indexes at the 5' end. PCR 1 and 2 were performed by using the Platinum ® multiplex PCR kit (ABI by Life Technologies) with 1 μl of DNA extract as a template in a total reaction volume of 10 μl. The MSO-PCR had a total volume of 50 μl with 4 μl of PCR1 or PCR2 amplicons as templates and used Acti Taq DNA polymerase (Roche Diagnostics) for amplification. Thermocycling parameters were in accordance with the respective manufacturers' instructions with annealing temperatures of 60°C and 53°C for 30-and 15-cycles, respectively.
Sequencing and sequence analysis PCR amplicons were purified by using a QIAquick PCR product purification kit (Qiagen) and quantified by using Qubit (Life Technologies). Eight samples tagged with unique indexes were pooled together at equal quantities to create a single sample for library preparation. Library preparation was performed by using TruSeq DNA library preparation kit V2 (Illumina) without fragmentation. Sequencing was performed on a Miseq bench-top sequencer (Illumina) with 2 × 150 bp methodology. Paired-end sequencing data from the Miseq reporter software was further analyzed off instrument. Quality filtering of the paired end data, de-multiplexing and trimming was performed by using a FASTX toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) prior to mapping against reference sequences using Bowtie 2 [11]. The reference amplicon sequences for data alignment were generated from accession numbers cited by Kong et al. [8] for serotype-specific capsular gene sequences and [Genebank: AJ419979 and Genebank: AJ244307] for the atypical and typical pneumococcal lytA gene sequences, respectively.

Data analysis
The experiments with multiple serotypes were designed to establish cut-off limits that would enable the identification of one or more serotypes from a sweep of colonies while excluding the possibility of false positive serotype allocation due to sample pooling. Thus, the percentage of reads assigned to each serotype out of the total number of mapped reads per given sample was calculated. The aim of PCR 2 that incorporated the lytA primers was to establish cut-off limits to identify pneumococci from previously unidentified α-haemolytic streptococci and to assign predicted serotypes. The percentage of reads assigned to the typical or atypical lytA genes out of the total mapped reads per sample and the percentage of the reads assigned to each serotype out of the total reads mapped against serotype sequences for the given sample were calculated. The percentages of correct serotype identification by using various criteria, including the proportion of mapped reads and a defined minimum number of total mapped reads, were examined.

Results and discussion
The total numbers of mapped reads in the samples with a single serotype in the two reactions (PCR 1 and PCR 2) ranged from 169 to 4,987, with a median of 2,162 (IQR -1,059, 3,691) reads, and 128 to 5,238, with a median of 1,658 (IQR -807, 2,550) reads, respectively. Considering all samples, the mean percentage reads assigned to the correct serotype, per sample were 80.5% (95% CI 72.0-89.0%) and 80.7% (95% CI 72.9-88.5%) in the two PCR reactions, respectively. All serotypes but one was correctly assigned when the percentage reads matched to a single serotype exceeded 70% of all the reads in the sample.
The percentages of correctly identified pneumococcal serotypes based on various cut-off limits are listed in Table 1. The first criterion for serotype allocation was defined by the presence of >15% reads out of total reads mapped against serotype sequences per sample. The percentages of correct serotype identification based on this were 87.5% and 91.7% for PCR 1 and PCR 2, respectively. The cut-off of >15% reads was calculated based on the experimental evaluation of multiple serotypes (please refer to paragraph 3 of Results and Discussion). The percentage of correct identification increased when more stringent criteria were added in defining the cut-off. The percentage of correctly serotyped isolates reached 100% when only the samples with >500 reads mapped were considered for serotype allocation. The results interpretation for individual samples based on each of the three criteria; >15% mapped reads (criterion 1), addition of >200 (criterion 2), or >500 (criterion 3) reads mapped against the serotype sequences, are listed in Tables 2 and 3. Considering only criterion 1, in all five instances where serotypes were incorrectly assigned, the total number of reads mapped to the serotype sequences was <500. In PCR 1, both samples with aberrant results had a total number of mapped reads <200. For PCR 2, the correct identification of a single serotype was possible in 21/24 (87.5%) of all samples, 21/23 (91.3%) of samples with >200 reads, and 19/19 (100%) of samples >500 reads (Table 1). Thus, a minimum number of reads mapped against serotype sequences of >500 was considered as the criterion to filter samples suitable for serotype allocation. Table 4 shows the results of individual samples containing two bacterial isolates tested for pneumococcal identifications and/or serotype assignment. These results of mixed serotypes were used to establish a minimum percentage of reads as a cut-off for allocating serotypes, and it enabled the identification of a second serotype. The numbers of reads mapped against a given serotype was not proportional to the original ratio of input DNA. The total number of reads mapped against serotype 3 was in general lower than that of serotype 19F. The mean proportion of reads mapped against serotype 3 was 18.4% (95% CI 16.0-20.8%) for PCR 1 (Table 4a) and 18.2% (95% CI 15.6-20.8%) for PCR2 (Table 4b). A cut-off of 15% based on the lower limit of total reads mapped to particular serotype sequence, was used for the assignment of a serotype. Accordingly, the result would not be analyzed where the cut-off or coverage threshold is not achieved [12]. Thus, we propose that samples with >500 reads assigned to serotypes should be considered for further serotype allocation and any serotype with >15% of the reads assigned should be considered as present within the sample. Applying both criteria, 22 samples tested in PCR 1 and 19 samples in PCR 2 were eligible for further analysis, and all these samples were allocated serotypes correctly. PCR 2 enabled the confirmation of pneumococcal identification as well as the detection of the presence of a viridans group of streptococci. Although the proportions of the viridans group of streptococci and S. pneumoniae were not quantifiable, it is possible to apply this method to identify and serotype pneumococci from non-purified primary cultures of α-haemolytic colonies.
The inability of the method to quantify the proportion of serotypes or bacterial species within a sample could be due to amplification bias in the PCR efficiency of the primers and the differences in the target sequences. However, a second serotype could still be qualitatively identified.
The method used for quantification of DNA prior to pooling the samples could be improved in terms of specificity by using a fluorescence based quantification method and this may help to improve the results pertaining to quantified identification. However, as the measurement of DNA quality excluded possible protein contamination and as the outcome was measured in a qualitative manner instead of a quantitative manner, the impact on the final results was negligible.
The pooling of samples with the addition of unique nucleotide tags helped us to reduce the associated cost.
However, this also introduces the possibility of false positive identification of a second serotype due to chimeric amplicon generation. Thus, to achieve a balance between detecting co-colonization and sample pooling the cut-off for defining serotypes needs to be fine-tuned for the given number of samples pooled and it could therefore be considered as a limitation of the method. Where the presence of a second serotype is a possibility, the cut-off determination needs to consider the possibility of including a false positive second serotype with a lower threshold and the false negative exclusion of a minor serotype with a higher cut-off limit.
If the method is to be applied to identified pneumococcal isolates where a single serotype is expected, the serotype with the highest proportion of reads mapped would be the only one found in the given sample. The use of PCR 1 for target enrichment is sufficient when the objective would be to identify the serotype of an identified pneumococcal isolate. The use of PCR 2 for target enrichment is recommended when the template is DNA from primary cultures or patient samples; as it identifies the presence of pneumococci by the characteristic lytA sequences. This would enable the identification of samples containing serotypes not included in the current PCR and samples with abnormal non-pneumococcal isolates harboring capsular sequences.
The protocol could potentially be extended to include more serotypes and direct detection from clinical samples. The validation process could be improved with the evaluation of the performance of different indexes and the establishment of the maximum plexity achievable for sample pooling. The whole process currently takes about 50 hours to complete; however, the method is amenable to automation and the recently released Truseq Nano (Illumina/USA) commercial test has potential to reduce the time required. Even with the current turn-around time, this protocol is appropriate where a larger number of samples are to be tested or if applied to direct sample testing. The analysis process could be further simplified by using the Custom Amplicon Workflow of Miseq reporter with alternative amplicon manifest created where the in-house index-adaptor-primer sequences could substitute the upstream and downstream probe sequences and a pseudo-genome created by all possible ampliconadaptor-index combinations could be used instead of the reference genome.
As proof of concept, we demonstrated that target enrichment-based NGS could be applied with similar results to the conventional molecular serotyping methods. The serotype resolution for the PPSV 23 capsular types from this method is similar to that of the recently recommended triplex RT-PCR method by CDC, Atlanta, USA (http://www.cdc.gov/ncidod/biotech/strep/pcr.htm, accessed on 01/06/2013). However the latter method does not include serotypes 10A/F and 15B/C, and it requires seven sequential PCRs. A sequetyping method described by Leung et al. using a pair of consensus primers, although is simpler in terms of the number of primers used, does not achieve the same serotype grouping pattern as the one in the current protocol; furthermore, the success rate for samples of PPSV 23 serotypes was lower at 86% [13]. All molecular methods, including the current method, have the drawback of not being able to differentiate some very closely related serotypes.
A number of comparable molecular methodologies with the capability of detecting colonization of multiple serotypes have recently been described [14][15][16][17]. Direct clinical samples and the sweep of colonies from primary plates have been employed as the templates for these methods. The sequential application of multiplex PCR has been shown to detect multiple serotypes from nasopharyngeal secretions [14]. The currently widely used CDC-recommended multiplex PCR is capable of identifying 40 different sero-identities by eight multiplexes while in the current method, a single PCR is capable of enriching 23 polysaccharide capsular vaccine serotypes, at least to related serogroups in a single reaction. There is potential to expand the number of sero-identities in the current method because the resolution is not limited by the need for bands with detectable differences in lengths. Microarray-based serotyping methods have been used successfully to identify and type pneumococcal serotypes [15][16][17]. However, microarray technology remains expensive and it needs complex interpretations whereas the bench-top NGS technology is becoming more userfriendly [1]. PCR followed by capillary electrophoresis and PCR followed by ionization mass spectrometry with the capability of detecting multiple colonization, have also been described [18,19]. However, NGS with its competitive market is becoming more affordable and this is the first instance to the best of our knowledge that target enrichment coupled to NGS has been used to develop an assay to identify and serotype S. pneumoniae. The routine use of WGS in a microbiology laboratory using the same platform has been evaluated recently [20]. However, the number of samples to be pooled together is limited. Our results demonstrate that target enrichment-based NGS could be used for the identification and sequence-based serotype prediction of pneumococcus. Development of proprietary chemistry-based  Denominator -total number of reads mapped against serotype/group specific sequences. *Criterion 1: Any serotype with >15% sequence reads mapped assigned to the sample. **Criterion 2: Samples with >200 reads considered for further typing AND fulfilled criterion 1. ***Criterion 3: Samples with >500 reads considered for further typing AND fulfilled criterion 1. methods for target enrichment is not widely available for microbiological applications and the kits remain expensive. However, the availability of Taq polymerases with increased multiplexing capabilities and the elimination of the necessity for the gel-based differentiation of bands increase the possibility of using multiplex PCR for target enrichment. In addition to using this method for typing organisms, syndrome-based identification panels could also be developed by using a similar methodology.

Conclusions
We demonstrated the feasibility of using a custom enrichment-based sequencing methodology for S. pneumoniae identification and serotyping. A multiplex PCR containing primers for serotype/group specific regions of the 23 valent polysaccharide vaccine serotypes was used for target enrichment followed by NGS on a Miseq platform to serotype pneumococci successfully at cut-off read levels defined during the study. A second multiplex PCR containing an additional pair of primers that help to identify pneumococci was also successfully applied for target enrichment. The first enrichment PCR could be used to serotype identified pneumococcal isolates while the latter one could be used on primary cultures and direct samples. The principle of using simple multiplex PCR for target enrichment followed by NGS could be adapted for syndrome-based diagnosis and typing using either bacterial isolates or patient samples.

Availability of supporting data
Not available.