Whole genome single nucleotide polymorphism based phylogeny of Francisella tularensis and its application to the development of a strain typing assay

Background A low genetic diversity in Francisella tularensis has been documented. Current DNA based genotyping methods for typing F. tularensis offer a limited and varying degree of subspecies, clade and strain level discrimination power. Whole genome sequencing is the most accurate and reliable method to identify, type and determine phylogenetic relationships among strains of a species. However, lower cost typing schemes are necessary in order to enable typing of hundreds or even thousands of isolates. Results We have generated a high-resolution phylogenetic tree from 40 Francisella isolates, including 13 F. tularensis subspecies holarctica (type B) strains, 26 F. tularensis subsp. tularensis (type A) strains and a single F. novicida strain. The tree was generated from global multi-strain single nucleotide polymorphism (SNP) data collected using a set of six Affymetrix GeneChip® resequencing arrays with the non-repetitive portion of LVS (type B) as the reference sequence complemented with unique sequences of SCHU S4 (type A). Global SNP based phylogenetic clustering was able to resolve all non-related strains. The phylogenetic tree was used to guide the selection of informative SNPs specific to major nodes in the tree for development of a genotyping assay for identification of F. tularensis subspecies and clades. We designed and validated an assay that uses these SNPs to accurately genotype 39 additional F. tularensis strains as type A (A1, A2, A1a or A1b) or type B (B1 or B2). Conclusion Whole-genome SNP based clustering was shown to accurately identify SNPs for differentiation of F. tularensis subspecies and clades, emphasizing the potential power and utility of this methodology for selecting SNPs for typing of F. tularensis to the strain level. Additionally, whole genome sequence based SNP information gained from a representative population of strains may be used to perform evolutionary or phylogenetic comparisons of strains, or selection of unique strains for whole-genome sequencing projects.


Background
Francisella tularensis, a Gram-negative bacterium, is the causative agent of tularemia and a Category A select agent. F. tularensis is divided into three subspecies (subsp.): tularensis (type A); holarctica (type B); and mediasiatica [1,2]. Tularemia caused by type A strains occurs only in North America, whereas tularemia caused by type B strains occurs throughout the northern hemisphere. Together these two species account for the majority of cases of tularemia worldwide. F. tularensis subsp. mediasiatica includes strains predominant in central Asia [3]. F. novicida has been suggested to be a subspecies of F. tularensis based on genetic similarity [4,5], but is still formally recognized as a distinct species. F. novicida has been isolated from North America and Australia, and rarely causes human disease even though it can cause a lethal infection in the murine model of disease [3,6].
Current DNA based genotyping methods for typing F. tularensis offer a varying degree of power to discriminate subspecies, clades and strains [2,7,8]. Two clades, A1 and A2, within F. tularensis subsp. tularensis have been reported based on multiple subtyping methods including multi-locus variable number tandem repeat analysis (MLVA), pulsed-field gel electrophoresis (PFGE) and multi-locus sequence typing [2]. Clades within A1, A1a and A1b, have been identified by PFGE [9]. A limited degree of variation has been observed within type B strains by all methods. MLVA currently provides the highest degree of strain discrimination for F. tularensis, however it is limited in its ability to perform evolutionary analyses and to estimate relationships among very closely related strains [10].
Development of high-resolution genotyping methods for F. tularensis can ideally be met by whole genome sequencing of multiple strains. Whole genome sequencing is the most accurate and reliable method to identify and discriminate strains of a species, especially those species with a high degree of genome homogeneity. Genomic sequence information of several type A and B strains is now available http://www.ncbi.nlm.nih.gov/sites/ent rez?db=genomeprj&orig_db=&term=Fran cis-ella%20tularensis&cmd=Search. F. tularensis has a single circular chromosome with genome size of ~1.89 Mb. Naturally occurring plasmids have not been reported for F. tularensis strains so far. A low genetic diversity in F. tularensis has been documented. Based on whole genome sequencing, the genetic variation between the type B live vaccine strain (LVS) and two other type B strains, FSC200 and OSU18, is only 0.08% and 0.11% respectively. F. tularensis subsp. holarctica strain FSC200 is a virulent strain of European origin whereas F. tularensis subsp. holarctica strain OSU18 is a virulent strain isolated in the United States. A higher genetic variation of 0.7% has been reported between a type B (LVS) and type A (SCHU S4) strain [11]. Global single nucleotide polymorphism (SNP) information, based on whole genome sequencing, offers several advantages over existing typing methods because each individual nucleotide may be a useful genetic character. The cumulative differences in two or more sequences provide a larger number of discriminators that can be used to genotype and distinguish bacterial strains. Strain genotypes that are built upon SNP variation are highly amenable to evolutionary reconstruction and can be readily analyzed in a phylogenetic and population genetic context to: i) assign unknown strains into wellcharacterized clusters; ii) reveal closely related siblings of a particular strain; and iii) examine the prevalence of a specific allele in a population of closely related strains that may in turn correlate with phenotypic features of the infectious agent [12]. SNPs also provide potential markers for the purpose of strain identification important for forensic and epidemiological investigations.
Previously, we reported an Affymetrix GeneChip ® based approach for whole genome F. tularensis resequencing and global SNP determination [13]. We now report the whole genome sequence and global SNP data from 40 Francisella strains using this approach. Twenty six F. tularensis type A (20 A1 and 6 A2), thirteen F. tularensis type B and one F. novicida strain were used for phylogenetic SNP analysis and identification of high-quality SNPs for use as typing markers. Based on our global analysis of 40 genomes, we were able to identify a series of SNPs at various levels of hierarchy. We used these SNPs to develop and validate a low-cost PCR-based assay for typing and discriminating F. tularensis isolates.

F. tularensis custom resequencing array set
The basis of the Affymetrix GeneChip ® resequencing by hybridization and the details of the design of F. tularensis GeneChip ® set has been described earlier [13]. Briefly, the design is primarily on the basis of the DNA sequence of strain LVS (GenBank Accession: AM 233362) serving as a reference and complemented with unique sequences of SCHU S4 (GenBank Accession: AJ 749949). A total of 1,764,558 queryable bases were identified for resequencing by hybridization after exclusion of ~9.22% of repetitive sequence from the design. This sequence was tiled onto a set of six CustomSeq 300 K GeneChips ® by Affymetrix, Inc., (Santa Clara, CA). This design provides approximately 91% of the F. tularensis double stranded genome sequence information from holarctica (type B) and tularensis (type A) subspecies. The whole genome resequencing was performed in duplicate for all query strains.

Whole genome amplification, resequencing assay and raw data acquisition
Francisella genomic DNA amplification, DNA fragmentation, labeling, hybridization and acquisition of raw data was carried out exactly as described earlier [13].

Processing of raw data with bioinformatic filters
Hybridization of a whole-genome sample on an Affymetrix ® resequencing array platform can lead to incorrect basecalls due to a number of systematic effects that are less prevalent when the sample consists of a purified PCR product. We have developed bioinformatic filters to account for most of these predictable adverse effects. Our bioinformatic filters consist of a set of Perl scripts that operate on the CHP files generated by GSEQ software and produce a list of high-confidence SNP calls from the larger raw set of SNPs calls present in those files. The scripts are available for download from our website http:// pfgrc.jcvi.org/index.php/compare_genomics/ snp_scripts.html. Each filter serves to reduce the number of candidate SNPs. The output of one filtering step becomes the input for the next. The detailed descriptions of these filters have been reported [13].
Briefly, the quality filter implemented in GSEQ software initially eliminates SNP calls that have been assigned low quality scores based on the difference in signal intensity between the highest intensity probe pair and the next highest intensity pair at a particular locus. The first filter applied is the "low homology filter" which identified regions that performed poorly as a result of deletions in the sample relative to the reference sequence.   no-calls and SNP calls. SNP calls that occur within the defined low homology region are removed from the list of high-confidence SNP calls. The next script is referred to as the alternate homology filter. The alternate homology effect is caused by the sequences in the query DNA sample capable of hybridizing with high efficiency to more than one probe pair at a locus on the array. When a locus contains two strongly hybridizing probe pairs, the GSEQ software may make a SNP call, a reference base call or a nocall ("N"), depending on the relative signal strengths of the probe pairs. The alternate homology filter identifies SNP calls that may have arisen as a result of this effect based on the difference in binding energy between the alternate (SNP) sequence and the reference sequence. If the difference between these two binding energies is = 11.5 kcal/mol, the SNP call is assumed to be an artifact of the alternate sequence homology, and it is removed from the list of high confidence SNP calls. The remaining SNP calls are then put through the footprint effect filter. The artifact called the footprint effect is caused by the occurrence of a real SNP in a query sample that results in a destabilizing effect on 25-mers in the immediate vicinity of the SNP. The footprint effect filter algorithm assumes that a genuine SNP is most likely to cause spurious SNP calls at locations within 10 bases on either side of the genuine SNP. Any SNP call that occurs more than 10 base positions from the nearest neighboring SNP call is assumed to be valid, and any SNP call that has one or more neighbors within 10 base positions is subjected to the filter. Since any number of consecutive SNP calls within 10 base positions of each other may occur in the data, this filter is implemented as a recursive algorithm.
For each list of consecutive SNP calls that each lies within 10 bases of its neighbors, the algorithm identifies the SNP call having the highest quality score. That SNP call is accepted as valid, and its immediate neighbors are removed from the list of high confidence SNP calls. This action may break the original list of neighboring SNP calls into two separate lists. All resulting lists are processed recursively in the same way, until all of the SNP calls have been accepted or rejected. This algorithm is implemented in the RemoveFootprintEffect.pl Perl program. All the above filters are applied to individual data sets generated for any sample, following which a final filter referred to as the replicate combination filter is applied. The replicate combination filter generates the list of common SNPs present in both the experiments.

Phylogenetic clustering, selection of SNP markers and PCR primer design from multistrain global Francisella SNP collection
We generated a phylogenetic tree from the resequencing data by considering only those locations at which a SNP occurred in one or more of the forty strains. For each strain, we constructed a sequence containing the base calls at each of the locations at which a SNP was found in some strain(s). This resulted in forty sequences, each containing 19,897 base calls (including no-calls) which were used for the phylogenetic analysis. The phylogenetic tree was generated using the MrBayes program, version 3.1.2 [15][16][17].
The program was run for 200,000 generations, using a haploid model. The root of the resulting tree was inferred by midpoint rooting. The resulting tree is reported as a cladogram and as a phylogram. A phylogenetic tree (Addi-  tional File 1) was also generated from the same data using the dnaml (maximum likelihood) program of the PHYLIP package version 3.6 [18].
Node pairings which discriminated between subspecies or clades were selected for the development of diagnostic typing assays. Criteria used to select SNP locations for the assay were: 1. The SNP location must cleanly differentiate the two nodes of interest. Within each of the nodes, all of the member strains must share the same base call at the location, and the two nodes must differ at the location.
2. The sequences downstream of the SNP location must be in sufficient agreement among all strains from both nodes so that an appropriate primer can be chosen from the consensus sequence (the consensus at the primer location may not contain "N" calls or any conflicting base calls).
3. The primer sequences must have melting temperatures within a specific limited range (60°C to 70°C).
4. The predicted PCR product size must be within the range 150 to 500 bp.
We developed a set of programs to identify candidate SNP locations for the real-time PCR (RT-PCR) assay. SNPTree uses the phylogenetic tree and the multi-FASTA files from the resequencing experiments as input, assigns arbitrary node numbers to all nodes in the tree, and produces a set of multi-FASTA files, one for each node in the tree, of the consensus base calls for each node. The consensus call is "N" unless all members of a particular node share the same base call at that location. The program also produces a set of files, one for each node, listing the base calls that occur at every SNP location, for all SNP positions detected within the entire set of 40 samples (19,897 locations). The program CompareNodes uses the SNP list files for any two nodes and produces a list of SNP locations that cleanly differentiate the two nodes (described above). The program CreatePrimer3 uses a list of discriminating SNP locations and the multi-FASTA files for two nodes, and creates an input file for the Primer3 program [19]. CreatePrimer3 also chooses the 5'-forward primers, which are constrained by the locations of the SNPs. The Primer3 software [19] is then used to identify appropriate 3'reverse primers. The Primer3 program enforces the last three criteria listed above. This process resulted in the design of a large number of primers for candidate SNP locations for most node pairs that may be used as diagnostic markers. The final set of SNP markers/locations we used was selected manually by identifying primers distrib-uted over the entire genome. The programs SNPTree, CompareNodes and CreatePrimer3 were developed at the J. Craig Venter Institute specifically for this study and are freely available for download ftp://ftp.jcvi.org/pub/soft ware/pfgrc/SNPTree/SNPTreePackage.tar.gz. These programs along with our bioinformatic filter pipeline can easily be adapted for other bacterial model systems for whole genome resequencing and SNP phylogeny using the Affymetrix resequencing array platform.
Primer3 software was used to design discriminating PCR primers based on the set of discriminating locations identified. Three primers were designed at each discriminating location: a 5'-forward primer with the node X call in the 3' position; a 5'-forward primer with the node Y call in the 3' position; and a single 3'-reverse primer. A base call at the discriminating location is determined by two PCR reactions where one of the two yields a lower cycle threshold (Ct) value. The RT-PCR primers used are shown in Additional File 2.  [13]. We used this approach to collect whole-genome sequence and global SNP information from 40 Francisella strains. Table 1 shows the list of strains analyzed in this study. Twenty six type A (20 A1 and 6 A2), thirteen type B and one F. novicida strain were resequenced.

Real-time PCR assays for F. tularensis typing
The base call rate and number of SNPs for F. tularensis A1, A2 and type B strains are shown in Figure 1 [20,21]. A1 strains showed the highest number of SNPs when compared to the reference sequence with a range of 5929 to 6543 whereas A2 strains displayed a range of 4732 to 5469 SNPs. The average number of SNPs for A1 strains was 6362 ± 161 and 5096 ± 281 for A2 strains.

Whole genome phylogenetic clustering of strains and SNP analysis
The cladogram and phylogram generated from the wholegenome resequence SNP data of all 40 Francisella strains is shown in Figure 2. Phylogenetic analysis revealed distinct clustering of the strains into the two subspecies, type A and type B, with further separation of strains within clusters. F. novicida (FRAN003) was distinct from type A and type B and formed its own phylogenetic group. Nodes (including internal nodes and leaf nodes) of the phylogenetic tree were assigned numbers by the SNPTree program. All type A strains emerged from node 4, whereas all type B strains emerged from node 50. The type A strains were divided into two primary sub-nodes, node 39 and node 5, corresponding to clades A2 and A1 respectively. A1 strains further subdivided into node 8, node 23, and node 5, corresponding to clades A1a and A1b and the MA00-2987 strain, respectively (Table 1). SCHU S4, the laboratory type A strain, fell within the A1a clade (node 8). Type B strains also divided into two clades based on nodes 52 and 64; these clades are referred to here as B1 and B2, respectively. The Japanese holarctica isolate FRAN024 formed its own phylogenetic group. Subsections of the phylogenetic tree at higher resolution, representing the type A1 (excluding MA00-2987), A2 and B strains (excluding FRAN024) are shown in Figure 3.
Whole genome resequencing and SNP profiles of F. tularensis strains Figure 1 Whole genome resequencing and SNP profiles of F. tularensis strains. (A) Whole genome resequencing call rates and (B) single nucleotide polymorphic profiles of 39 F. tularensis type A and B strains. The data is an average of sample analysis performed in duplicate. The filtered base call rate and the filtered SNP values were obtained by processing the raw data from Affymetrix software through our bioinformatic filters [13]. Strains are displayed as either A1, A2 or type B for comparative analysis. F. tularensis subsp. novicida (FRAN003) displayed an average filtered base call rate of 83.041% and 12407 filtered SNPs (data not shown).

A B
The differentiating SNPs and maximum SNP separation numbers are indicators of the diversity within each node, as these represent SNP differences between members of the node (rather than SNP differences relative to the reference genome). The differentiating SNPs are the number of locations at which two or more member strains have differing base calls. Maximum SNP separation is the maximum number of SNP differences that are found between any two members of the node. As expected, the SNP diversity is greatest within subspecies (type A and type B) and decreases within clades; B1, A1a and A1b strains showed the least diversity (maximum SNP separation of 76, 75 and 38, respectively). Typing methods have previously revealed less diversity within type B than type A strains [2,[21][22][23]. Similarly, our data show less diversity among type B isolates, with a maximum SNP separation of 602 when the Japanese holarctica strain FRAN024 is excluded from this analysis (B*). However, when all type B isolates, including the Japanese holarctica strain FRAN024, are included in the analysis, our data indicates a similar level Whole genome SNP based phylogenetic analysis of Francisella strains Figure 2 Whole genome SNP based phylogenetic analysis of Francisella strains. Phylogenetic analysis of resequenced Francisella strains. The whole-genome resequencing data was pared down to those base positions at which a SNP call occurred in one or more of the forty strains. These sequences were used to generate a phylogenetic tree using the MrBayes program as described in methods. This tree was then displayed as a cladogram (A) and as a phylogram (B) using the TreeView program http://taxonomy.zoology.gla.ac.uk/rod/treeview.html. Distinct clustering of type A and type B strains was observed. Both type A and B strains were further discriminated within the clusters. In the cladogram, the percentage values on the branches are the probabilities of the partitions indicated by each branch. The numbers shown in red are node numbers of significant nodes that are referenced in the manuscript. In the phylogram, the branch lengths are proportional to the mean of the posterior probability density, and a scale bar is given to relate the branch lengths to their numeric values.
of diversity for types A and B (maximum SNP separation of 2779 and 2833, respectively).
The presence of a large number of differentiating SNPs within each phylogenetic node suggests that a deeper level of discrimination can be achieved by identifying SNPs unique to individual strains. The smallest number of differentiating SNPs within a phylogenetic node was 71 (A1b strains). The phylogram ( Figure 2B) indicates that the closest clade pairings are between A1a/A1b and B1/B2 which is quantitatively in agreement with the SNP differences as shown in Additional File 4. Phylogenetic analyses performed by two independent approaches (Bayesian in Figure 2 and maximum likelihood in Additional File 1) showed some differences only at the level of minor clades in the trees. These did not affect the subsequent analyses.

Typing assays based on high quality global SNP markers
Node pairings that discriminated between F. tularensis subspecies or within subspecies were selected for the development of SNP diagnostic typing assays (Figure 2). The four node pairings were node 4 and node 50, node 52 and node 64, node 39 and node 5, and node 8 and node 23 for discrimination of type A vs. type B, B1 vs. B2, A2 vs. A1 and A1a vs. A1b, respectively. A SNP location was selected to differentiate between two nodes in the tree when all strains belonging to one node contain the SNP call and all strains belonging to the other node contain the reference call at that location. The location of the 32 in silico identified diagnostic SNP markers in the F. tularensis LVS genome are shown in Figure 4. Fourteen SNP loci were in the forward strand, sixteen in the reverse and two loci were in non-coding intergenic regions. The discriminating nodes, SNP location, locus name, gene symbol with product and the role category is described in the Additional File 5.
To show that SNPs can be used as diagnostic markers for typing of F. tularensis subspecies and clades, RT-PCR assays were designed. Initially, seven F. tularensis strains were used to screenthe 32 RT-PCR discriminatory SNP positions for the ability to distinguish type A vs. type B, A1 vs. A2, A1a vs. A1b, and B1 vs. B2. Preliminary results indicated 5 out of 9 primer sets (684048, 917759, 1014623, 1136971, 1581977) distinguished type A and type B, 3 out of 9 primer sets distinguished A1 and A2 (521982, 1025460, 1507435), 2 out of 5 primers sets distinguished A1a and A1b (518892, 1574929) and 3 out of Expanded phylogram for F. tularensis A1, A2 and type B strains  Table 2) were examined.
The data for 4 primer sets (1014623, 521982, 299153 and 1574929) is shown in Figure 5. These assays are hierarchical in nature. The first primer set determines whether a strain is type A or type B based on SNP 1014623. In type A and type B strains, this nucleotide position is T and C, respectively. A strain identified as type B can be further typed as B1 or B2 based on SNP 299153 (G in B1 strains and T in B2 strains). Similarly, strains identified as type A can be classified as A1 or A2 based on SNP 521982 (T in A1 strains and C in A2 strains) and A1 strains further characterized as A1a or A1b by SNP 1574929 (G in A1a strains and C in A1b strains).
As shown in Figure 5, the type A and type B SNP assay clearly distinguished between the 23 type A and 16 type B strains. The 23 type A strains were then subdivided into 15 A1 and 8 A2 strains and the 15 A1 strains were subsequently further sub-divided into 8 A1a and 7 A1b strains. For all 23 type A strains, the classification of strains as A1, A2, A1a or A1b by diagnostic SNP typing corresponds with PmeI PFGE typing results (Table 2) [14], emphasizing the power and the utility of this simpler methodology for classification of type A clades.
Type B strains were also resolved into B1 and B2 clades based on a single SNP. As these clades were newly identified by our SNP based phylogenetic clustering, resequenced B1 (KY00 1708 and MO01-1673) and B2 (LVS, OR96 0246) strains were included as positive controls. Of the 16 type B strains tested, nine isolates were classified as B2 and 7 isolates were classified as B1. Isolates from Russia (RC 503), Spain (SP03 1782 and SP98 2108) Finland (SP03 1783) and the US were identified as B2 by this assay, whereas isolates from Canada and the US were identified as B1, providing evidence for geographic clustering of type B isolates based on this SNP marker. In summary, this work shows the potential for development of SNP typing markers based on a relatively small number of "complete" genome sequences. For future work, it will be important to define a set of SNPs that could be used for high-resolution discrimination to the strain level. * contains all the type B strains with the exception of FRAN024, Japanese holarctica strain. Total SNPs are locations at which a SNP occurs in one or more strains in the node (if the same SNP occurs in more than one strain, that location is counted only once). Common SNPs are locations where all strains in the node share the same base call, which is different from the reference call on the resequencing platform. Unique SNPs are locations where just a single strain in the node has a base call that differs from the reference sequence. Differentiating SNPs are locations at which at least two strains in the node have different base calls. Maximum SNP separation is the number of base calls separating the two most distant members of the node. Differentiating SNPs and maximum SNP separation are both indicators of the degree of diversity within the node. The detection of diversity is limited by the extent to which our sample set is representative of the variability within each clade in nature. Refer to Figure 2 for the details of strain clustering.

Discussion
Whole genome comparative analysis and collection of high-confidence global SNPs from multiple strains of a given bacterial species has a number of applications in both basic and translational research. Our study was undertaken with an objective of providing the scientific community with whole-genome sequence and SNP information from multiple strains of F. tularensis, enabling rapid advancements in our understanding of basic and applied biology of this organism. F. tularensis has been recognized as a causative agent of tularemia for almost a century [24] and is classified as a category A biodefense agent. We have collected nearly complete (~91%) genome sequence and global SNP information from forty Francisella strains using our whole genome high-density resequencing array platform [13]. All the sequence and SNP information is publicly available to the scientific community from Biodefense and Public Health Database (Bio-HealthBase) at http://www.biohealthbase.org/GSearch/ home.do?decorator=Francisella. BioHealthBase is a Bio-informatics Resource Center (BRC) for biodefense and emerging/re-emerging infectious diseases that is supported by the National Institute of Allergy and Infectious Diseases (NIAID). The data can also be obtained from our web site at http://pfgrc.jcvi.org/index.php/ compare_genomics/francisella_genotyping.html or through the JCVI ftp server at ftp://ftp.jcvi.org/pub/data/ PFGRC/Ft_DataRelease/. This multi-strain high-quality nearly complete genome sequence and global SNP information provides a unique opportunity to perform comparative genome analysis between F. tularensis strains, thus contributing towards a better understanding of pathogenicity and evolutionary relationships of this species. We have used this information to build a robust whole genome based phylogeny that enabled the identification of SNP discriminatory markers. We further validated high quality global SNP markers for typing of F. tularensis subspecies and clades as a proof of concept that these markers may be used for future development of high-resolution typing methods.
Location of in silico identified diagnostic SNP markers in the F. tularensis LVS genome Figure 4 Location of in silico identified diagnostic SNP markers in the F. tularensis LVS genome. Representation of in silico discriminating SNP markers on the F. tularensis LVS genome. The vertical colored bar represents the position of the SNP marker on the LVS with the relevant node pair indicated by color. Loci containing the discriminatory SNP markers in the forward and reverse strands are shown in red and blue respectively. Two markers in the non-coding sequences of the genome are also shown.
Previous reports have suggested that greater genetic diversity exists among type A as compared to type B strains [2]. Our whole genome SNP based analysis of 12 type B isolates from North America and Russia appears to confirm this observation. However, SNP data obtained after inclusion of a Japanese type B strain (FRAN024) indicated a similar level of SNP diversity in type A and type B strains (Table 3). Sufficient SNP diversity was observed among type B strains to generate an internal structure in the phylogenetic tree (Figure 2) as well as to resolve all unique strains. The single F. novicida isolate in our study, FRAN003 (U112), had the lowest base call rate (83.041%) and the highest number of SNPs (12,407) among our samples. The low base call rate is a likely reflection of the sequence divergence between the F. novicida strain (U112) and the reference sequence on our resequencing chips. Rohmer et. al [11]. have reported a nucleotide sequence identity of 97.8% between the LVS and F. novicida U112 genomes. The differences in these two approaches may be due to the fact that array-based resequencing is sensitive to sequence divergence, and performs best with samples that are homologous with the ref-  A number of molecular approaches have been used to better understand the diversity of Francisella [2,21,[25][26][27]. New subdivisions within F. tularensis subspecies have been revealed by these approaches. Differing methods provide differing resolution as most of the methods sample only a subset of the whole genome in order to assess relationships among different isolates [2]. MLVA is considered to provide the highest discriminatory power (i.e. strain level) [2,21,28]. PFGE typing has been used to identify four distinct type A genotypes, A1a, A1b, A2a and A2b [9], not previously observed by MLVA typing. PFGE typing combined with epidemiologic data revealed that the observed genetic diversity among type A strains correlated with differences in clinical outcome and geographic distribution. A1b strains were associated with significantly higher mortality in humans as compared to A1a, A2 or type B strains. Type B strains display little or no genetic diversity by PFGE [14] and a number of other molecular methods [2,10,[21][22][23].

Real-time PCR evaluation of SNP diagnostic markers
Comparative whole-genome sequence analysis provides the highest level of discrimination among different strains, but has not been widely used due to the high cost of this method. Keim et al [2] have shown a wholegenome SNP phylogeny of Francisella using ~8000 syntenic SNPs from the published whole genome sequences of seven strains. Use of only two type A and two type B genomes was sufficient to reveal that type A strains differ greatly from each other unlike type B strains. More recently, the phylogenetic structure of F. tularensis has been reported based on whole genome SNP analysis of thirteen publicly available genome sequences; 29,774 SNPs were used in this analysis [4]. In this study, we have constructed a phylogenetic profile of forty Francisella strains based on whole genome sequences. This to our knowledge is the first report of a phylogenetic model based on nearly complete genomes of multiple strains of F. tularensis using Affymetrix resequencing arrays.
We have demonstrated that resequencing data may be used to generate high-resolution phylogenetic trees based on global SNPs. The advantage of this sequence-based approach is that SNP based phylogenetic trees can be used for evolutionary analyses. The comparative analysis based on the phylogenetic relatedness of strains can provide significant insights into the varying degree of phenotypes and ecotypes of an organism. The total number of complete genomes required to achieve an optimum phylogenetic profile from the multiple strains of an organism will be determined by the degree of plasticity of the genome. Adequate phylogenetic relationship can be determined with a sufficient number of genomes from diverse isolates of an organism and the whole genome comparative analysis of such related strains can provide real biological insights into the adaptation and evolution of a species. Such phylogenetic-based comparative analysis can capture genomic differences of very closely related strains and provide valuable information for the development of rapid molecular sequence based assays, capable of discrimination to the strain level.

Conclusion
The whole genome resequencing array platform provides sequence and SNP information from multiple strains for any infectious agent with an available whole genome sequence. Multi-strain whole genome sequence data allows one to build robust phylogenetic models for an organism based on global SNPs. Whole genome SNP based phylogenetic trees can guide meaningful comparative analysis of strains to better understand the biology of an organism as well as in translational research such as in developing high resolution economical SNP based typing assays. We have collected whole genome sequence and SNP information from forty strains of Francisella to construct a global phylogeny. Our data shows a good correlation with the previously published reports using limited genomic sequence information and also provides higher strain resolution. We used the whole genome SNP phylogeny to identify informative SNP markers specific to major nodes in the tree and to develop a genotyping assay for subspecies and clades of F. tularensis strains. Less diverse type B strains could even be discriminated into two clades, B1 and B2, based on a single SNP. Our whole genome SNP based phylogenetic clustering shows high potential for identifying SNP markers within F. tularensis capable of discriminating to the strain level. This finding should greatly facilitate the rapid and low-cost typing of F. tularensis strains in the future.

Authors' contributions
GAP-planned, developed and co-coordinated the project, analyzed the data, wrote the manuscript; MHH -bioinformatic tool development and data analysis, contributed to the progress of the project and manuscript writing; JMPcontributed to the data analysis and manuscript preparation; SP-wet lab analysis, performed resequencing assays of the samples; SAK-bioinformatic data analysis, preparation of tables and figures; MJW-contributed to the data analysis and manuscript preparation; CM-data collection for the SNP typing assay of samples; MJ-contribution towards development and optimization of the SNP typing assay; MES-participated in data analysis and manuscript preparation; RDF-project oversight; SNP-project design, manuscript contribution and project oversight. All authors read and approved the final manuscript.