Genome sequencing and comparative genomics provides insights on the evolutionary dynamics and pathogenic potential of different H-serotypes of Shiga toxin-producing Escherichia coli O104

Background Various H-serotypes of the Shiga toxin-producing Escherichia coli (STEC) O104, including H4, H7, H21, and H¯, have been associated with sporadic cases of illness and have caused food-borne outbreaks globally. In the U.S., STEC O104:H21 caused an outbreak associated with milk in 1994. However, there is little known on the evolutionary origins of STEC O104 strains, and how genotypic diversity contributes to pathogenic potential of various O104 H-antigen serotypes isolated from different ecological niches and/or geographical regions. Results Two STEC O104:H21 (milk outbreak strain) and O104:H7 (cattle isolate) strains were shot-gun sequenced, and the genomes were closed. The intimin (eae) gene, involved in the attaching-effacing phenotype of diarrheagenic E. coli, was not found in either strain. Examining various O104 genome sequences, we found that two “complete” left and right end portions of the locus of enterocyte effacement (LEE) pathogenicity island were present in 13 O104 strains; however, the central portion of LEE was missing, where the eae gene is located. In O104:H4 strains, the missing central portion of the LEE locus was replaced by a pathogenicity island carrying the aidA (adhesin involved in diffuse adherence) gene and antibiotic resistance genes commonly carried on plasmids. Enteroaggregative E. coli-specific virulence genes and European outbreak O104:H4-specific stx2-encoding Escherichia P13374 or Escherichia TL-2011c bacteriophages were missing in some of the O104:H4 genome sequences available from public databases. Most of the genomic variations in the strains examined were due to the presence of different mobile genetic elements, including prophages and genomic island regions. The presence of plasmids carrying virulence-associated genes may play a role in the pathogenic potential of O104 strains. Conclusions The two strains sequenced in this study (O104:H21 and O104:H7) are genetically more similar to each other than to the O104:H4 strains that caused an outbreak in Germany in 2011 and strains found in Central Africa. A hypothesis on strain evolution and pathogenic potential of various H-serotypes of E. coli O104 strains is proposed. Electronic supplementary material The online version of this article (doi:10.1186/s12866-015-0413-9) contains supplementary material, which is available to authorized users.


Background
The occurrence of a large outbreak involving enteroaggregative hemorrhagic E. coli (EAHEC) O104:H4 infections in Europe and North America in 2011 with prior reporting of similar strains in patients with severe infections and hemolytic uremic syndrome (HUS) in Asia, Norway, and the Republic of Georgia raises questions about the origin, evolution, genetic diversity, and virulence of these pathogens [1][2][3][4][5][6][7]. The German outbreak O104:H4 strain 2011-C-3493 had the characteristics of an enteroaggregative E. coli (EAEC); however, it carried the gene that encodes Shiga toxin 2a (stx 2a ), and thus can also be classified as a Shiga toxin-producing E. coli (STEC). Bielaszewska et al. [8] analyzed 80 isolates from this outbreak and found that they belonged to ST (sequence type) 678, similar to the HUSEC041 strain isolated from a patient with HUS in Germany in 2001. Moreover, the outbreak strains possessed a plasmid carrying CTX-M-15, encoding an extended-spectrum-β-lactamase, which is absent in HUSEC041. E. coli O104:H4 strains and O104 strains with other H-types showed genetic and phenotypic differences, including the presence of specific virulence and antibiotic resistance genes, the types of Shiga toxin genes, and their pulsed-field gel electrophoresis profiles [7,9,10].
Differences in virulence and antibiotic resistance genes linked to disease severity and HUS among various O104 serotype strains, including O104:H11, O104:H21, O104:H7, O104:H2, O104:H12, and O104:H16 have been observed [10,23]. The differential phenotypic properties of O104 strains with H2, H4, H11, H12, and H21 antigens on selective media, such as Brilliance ESBL (extendedspectrum beta-lactamases), CHROMagar STEC, and CHROMagar O104 were also reported [24,25]. However, there is little information regarding the genetic basis of virulence of non-O104:H4 pathogens and how they might have emerged or were transmitted. Therefore, these serotypes should be fully investigated at the genomic level to identify genes important for virulence and for growth and survival in food. Moreover, besides EAEC-STEC O104:H4, STEC O104:H7, and STEC O104: H21, several stx-negative O104:H4 strains have been isolated from human patients in Central Africa [4,26,27] at different locations and at different times. Thus far, humans are considered as the only natural reservoir of EAEC strains [28], and the search for the source of the O104:H4 outbreak strain revealed that there was no animal reservoir for EAEC/EHEC/STEC O104:H4 isolates [29][30][31][32]. In contrast, STEC O104:H7 and STEC O104:H21 are reported to be associated with contaminated milk [22], feces of cattle, and/or from food of bovine origin [33]. Facilitated by next generation sequencing (NGS) technologies, the identification of mobile genetic elements, including plasmids, transposons, prophages, genomic islands, and chromosomal pathogenicity islands (PAIs) encoding virulence-and antibiotic resistance-associated genes, has provided insight into the genomic differences among environmental, animal, and clinical strains isolated from geographically dispersed areas. Unlike other STEC that cause serious human illness, STEC O104 strains lack the eae gene; however they can attach to intestinal cells. The EAHEC O104:H4 outbreak strain 2011C-3493 also lacked eae; however, this strain possessed a plasmid (pAA) that encodes for adherence fimbriae [9]. Studies found that not all O104 serotypes possess aggR, which encodes a regulator of virulence plasmid and chromosomal genes including aggregative adherence fimbriae and aaiA-P (aggR-activated island encoding a type VI secretion system), a characteristic of EAEC. EAHEC O104:H4 colonizes the human bowel through aggregative adherence fimbriae encoded by the EAEC plasmid (pAA). In the European O104:H4 outbreak strains, the intimin (eae) function was substituted by the plasmid-encoded aggregative adherence fimbrial colonization mechanism, and once attached, the cells were able to produce and deliver Shiga toxin 2a, resulting in severe illness and HUS. Recently, Beutin et al. [1] studied the possible uptake of the chromosome-encoded stx2phage P13374 (from the German EAHEC O104:H4 strain) by EAEC strains of different serotypes. In this study, the authors concluded that the stx2-phage P13374 had a restricted host range, and its spread was dose-sensitive. One of the bovine-specific phages (P13803) was found to be capable of lysogenizing a stx-negative EAEC O104:H4 strain and converted it into an EAEC-STEC producing Stx 2a [1,34].
Previously published sequence data on O104:H4 isolates revealed variations in gene content among strains from different hosts and geographical sources [2,4,6,9,35]. Differences were observed in virulence-and resistance-associated genes among various H-serotypes of O104 strains. A better understanding of the genomic differences among strains belonging to different O104:H serotypes will help to reveal the basis for their pathogenicity and will provide information to understand how they evolved. The aim of this study was to sequence the genomes of the O104:H21 milk outbreak strain and a STEC O104:H7 cattle isolate, which had not been associated with human illness, and perform comparative analyses with the genomes of several E. coli O104:H4 strains isolated from different geographical locations. The resultant sequence analysis revealed notable differences among O104 strains belonging to H-serotypes H21, H7, and H4, including the presence of different plasmids and other mobile genetic elements.

Comparison at the genomic level
In order to gain insight on the genomic differences among environmental, animal, and clinical strains isolated from geographically dispersed areas, the whole genome sequence of various EAEC/STEC O104 strains were analyzed. Whole-genome comparison showed that these 13 O104 strains shared a conserved core chromosomal backbone but contained various mobile genomic elements reflected in the differences in their genome sizes. Details will be further discussed in the next three sections.
A comparison of seven O104 genomes, as visualized by the program MAUVE [36], including three O104:H21 strains isolated during an outbreak of hemorrhagic colitis in Montana in 1994 [37], is shown in Figure 1. Homologous regions that are conserved are shown in the same colors in Figure 1; gaps within a unique genomic region are in white. In this Figure, considerable conservation in the 7 genomes is revealed, although some serotype-specific regions were observed. For example, all three O104:H4 isolates possesed 3 prominent regions either completely missing (region #1, PAI carrying aidA and various type VI secretion-associated proteins; region #2, hypothetical proteins), or partially missing (region #7, tellurite resistance proteins and transporters) in O104:H7 and O104:H21 strains. Region #3 (phage_Yersin_413C; Additional file 2: Table S2) was present in all analyzed O104:H4 and O104:H21 strains, but missing in the sequenced O104:H7 strain. Other regions of significant genomic dissimilarity in O104:H7 include region #4 (various transporters and ATPase), #5 (transporters, etc.), and #6 (the adhesin gene, iha, Table 1).
Each O104 strain carried 1-4 plasmids of different sizes, ranging from 3,173 to 169,634 bp (Table 2). Interestingly, there is no whole genome sequence but partial fragment available.     Figure 2A and B).

Whole genome phylogenetic comparisons among various O104 H type strains and other LEE-negative STEC
In this study, we first inferred a whole-genome phylogeny of strains based on an alignment (Additional file 1: Table  S1) based on an alignment of whole-genome contigs or complete full genome sequence mapped onto the complete closed genome sequence of the O104: strains belonging to the H4 type, and that H7 and H21 had a closer relationship to each other than to O104:H4 ( Figure 3A). The similarity of the profiles of the mobile genetic elements and the phylogenetic tree analysis indicated that O104:H7 and O104:H21 might share a similar evolutionary path since they are more closely related to each other than to O104:H4 strains. Phylogenetic comparisons among various LEE-negative STEC serotypes and different O104 strains showed that LEE-negative STEC serotypes O91:H21 and O113:H21 were genetically closer to O104:H21 and O104:H7 than to O104:H4 ( Figure 3B).

Our phylogenetic analyses (Figures 3) demonstrated that
O104 strains with the same H-type clustered together (i.e. monophyletic) but their corresponding serogroups were cladded into multiple independent lineages (i.e. polyphyletic). Scattered distribution patterns of virulence factors and resistance genes, as well as phylogenetic analysis of these different O104 strains may suggest that strains from individual lineages might have acquired virulence genes and antibiotic resistance genes independently in parallel evolutionary processes.
stx and eae genotypes Shiga toxins Stx 1 and Stx 2 are recognized as the most important phage-encoded virulence factors in all diseaseassociated STEC strains, and the genes encoding for these toxins are located in the genome of mobile bacteriophages [38,39]. The stx phages are genetically distinct groups of temperate phages that can be identified in their prophage states inserted in the STEC chromosome, but also can be detected as phages released from the bacteria into the environment and/or animal hosts. stx phages can exist in polluted waters, feces, food, in humans, and the environment [40,41]. Generally speaking, phages could be used to detect and to derive lineage/evolution of newly emerged STEC strains from different hosts and/or environments [34,[42][43][44]. The diagram in Figure 4A Table S2). The details of stx phage modular genetic structure in various O104 strains are illustrated on the diagram of Figure 4A. Genomic sequence analysis of the European outbreak O104: H4 strains showed that all isolates possessed stx 2a and some also carried stx 2c , stx 2d , or the combination of stx 2a , stx 2c , and stx 2d genes (data not shown). STEC strains that carry specific stx 2 subtypes, especially stx 2a , stx 2c , and stx 2d tend to be more virulent [45]. In this study, activatable stx 2d was found to be the sole stx gene in our sequenced O104:H21 and O104:H7 strains. Sequence analysis also revealed that none of the O104 strains harbored the stx 2e , stx 2f , or stx 2g alleles.
eae is another important virulence gene of EPEC, EHEC, and STEC strains. The three prominent LEE integration sites selenocystyl-tRNA (selC), pheU, and pheV were previously described [46]. The different insertion sites of the LEE in various STEC suggest that the acquisition of LEE may depend on the strain genetic backgrounds, the environmental conditions to which the strain is exposed, and the mechanisms of genetic recombination. Our analysis showed that the O104:H7, O104: H21, and O104:H4 do not possess eae genes. It has been reported that serotype O104:H − carried an eae-TAU (GenBank Accession: FM872416.1) variant. Strains expressing eae-TAU were generally restricted to and efficiently colonized Peyer's patches in human intestinal mucosa [47]; so far, serotype O104:H − has been found only in humans [21]. Schmidt and colleagues [48] reported a PAI in E. coli O91:H − that occurs exclusively in a subgroup of STEC strains that are eae negative and contain the variant stx 2d gene, similar to O104:H7 and O104:H21 in this study.
Also interestingly, a DNA fragment containing the LEE-encoded translocated-intimin receptor (tir) effector gene from serotype O104:H12 strain 4051-6 was sequenced (GenBank Accession# AB288103.1); however tir was missing in all O104:H4, O104H7, and O104:H21 strains examined in the current study ( Table 1). The tir gene is typically encoded by the LEE, flanked by ORF19 (upstream) and eae (downstream) although the whole genome sequence of O104:H12 strain 4051-6 is not available currently. The functional Tir protein is translocated into the epithelial cell and is integrated into the cell membrane where it functions as a receptor for eae. Oswald and coworkers [49] showed that allelic differences in eae were associated with differences in tissue and host specificity. Thus, it is possible that O104 strains belonging to H7 or H21 serotype possess the EAEC or EHEC genetic background to acquire functional genes of various PAIs or eae variants ( Figure 5, Additional file 3: Figure S1). A detailed description of Figure 5 is presented in next section.
Identification and analysis of prophages, genomic islands, and/or pathogenicity islands The number of predicted prophages varied greatly among various O104 H-type strains (Additional file 2: Table S2). Our analyses showed the distribution of different types of bacteriophages, particularly the stx 2 -carrying prophages, in the genomes of O104:H7, O104:H21, and O104:H4 supporting the hypothesis that the different H-serotypes of O104 may acquire diverse types of bacteriophages due to a possible genome adaptation to niches [50,51]. For instance, the similarity between bacteriophage types in the genomes of O104:H7 and O104:H21 shown in (Additional file 2: Table S2) suggested that they may come from the same evolutionary path or evolved in a similar environment.
PAIs belong to a specialized class of GIs. They carry multiple virulence loci, may take up a large chromosomal region (>10 kb) of pathogens, are absent from non-pathogens, and usually have a different GC content from that of the core genome. A PAI from an O103:H − strain which is inserted into the selC site between yicI and nlpA (shown in Figure 5) contains 40 ORFs encoding proteins with ≥ 50 deduced amino acids and one selC site (numbered sequentially from 1 to 41). The proteins encoded in this island include a novel serine protease EspI (ORF#15); an adherence-associated locus, similar to iha of E. coli O157:H7 (ORF#18); an E. coli vitamin B12 receptor (ORF #20); an araC-type regulatory module (ORF #22-26); and several other important and/or unknown function proteins. The remaining sequence consists largely of complete and incomplete insertion sequences, prophage sequences, and an intact phage integrase gene that is located directly downstream of the chromosomal selC [48]. A pairwise alignment of the selC-tRNA loci between yicI and nlpA to this typical O103 PAI shows some interesting genomic structural differences among O104 H-serotypes, O103:H − , O157:H7, non-O157 STEC, and other serotypes including E. coli K-12 ( Figure 6). All of these strains shown in Figures 5 and 6 share the same gene structure at the N-terminal (yicI, yicJ, and selC) and C-terminal (yicL and nlpA), but none of these serotypes are genetically similar in the central portion of this typical PAI structure.
The genomic region of the selC site in O104:H7 and O104:H21 is 7775 bp and it encodes no known functional proteins. Here we propose the concept of "pseudo PAI", a genomic region less than 10 kb in size that lacks key functional gene products typically found in O157:H7 and O103:H − ( Figure 5). Interestingly, strains carrying "pseudo PAI" in this selC site normally have smaller genome sizes and they could have lower pathogenicity if they also lack plasmids carrying virulence genes. It is important to note that the eae gene has at least 27 different variants and one of the subtypes, eae-TAU, has been isolated from patients with hemolytic-uremic syndrome in O104:H − (Additional file 3: Figure S1) [52]. As shown in Figure 6, the presence and size of the PAI in the selC tRNA locus also shows no correlation with serotype. A "pseudo PAI" (7775 bp between genes yicI to nlpA) was found in strains of serotypes O104:H7, O104:H21, O111: H − , O113:H21, O121:H19, O91:H21, O103:H25, O103:H2, O26:H11, and even non-pathogenic E. coli K12. These differences in the size of the functional PAIs (serotypes O26, O145, O157, O104:H4, and strain 4797/97 O103:H − ) or "pseudo PAI" among the same serogroup was also observed for O104 and O103:H − . These observations are suggestive of the acquisition of the different PAIs by the same serotype, including O104:H4 strains due to their different ecological niches and/or the geographical regions where they were found. The selC site can be targeted by bacteriophage carrying LEE or LEE-like PAIs (e.g., O157, O26, etc.), bacteriophage harboring resistance genes (e.g. O104:H4) (shown in Figures 5 and 6) through horizontal genetic transfer (HGT) and adaptation to new ecological niches, or it can remain empty (e.g., O104:H7 and O104:H21).
As stated above, within different types of PAI inserted into the selC site, genes that may encode antibiotic resistance, are associated with altered metabolism, or that express virulence factors are located in the central portion of the island and are surrounded by mobilityassociated genes (ORF with red color, Figure 5). Mobile genetic elements (ORF 28-30, 33, 34-36, and 38) are extensively interspersed throughout the genome of serotype O104:H4, but less so in O104:H21 (ORF 28, 29, 38, 19, 30, 33) and O104:H7 (no ORFs, except the pseudo PAI portion), particularly O104:H7 ( Figure 5). However, Schmidt and coworkers [46] indicated that LEE in O91: H − was flanked by prophage sequences and was devoid of IS-related sequences, which were only shown in O157:H7 in Figure 5. Our data may indicate that O104 H-serotypes could acquire PAI with or without an intact or a portion of LEE by phage transduction or HGT, which potentially could make O104 H-serotypes more pathogenic.
At present, it is not known if the PAIs in different O104 H-serotypes were in the process of stepwise insertion of genes into the region, or stepwise deletion of genes from the "original" PAI, similar to those in O103 and O91:Hserotypes via complex mechanisms of horizontal transfer and/or recombination. A novel PAI, originally identified in O103:H − and O91:H − , was also partially found in one O104 H − serotype but showed some rearrangement (Additional file 3: Figure S1).
Based on our analyses, we propose an evolutionary model (Additional file 3: Figure S1) in the selC-tRNA site of various H-serotypes of E. coli O104 strains in a stepwise insertion fashion. This model suggests that O104:H − and O104:H4 may derive from the ancestral O104:H7 or O104:H21 by the acquisition of antibiotic resistance genes (O104:H4) or eae-TAU carried by a prophage/PAI (O104: H − ) and other genetic components, such as the plasmid carrying aggR gene. Various O104 strains were also shown to carry stx 2 [53] (Table 2), implicating that within the O104 serogroup there are multiple acquisition events of prophage, GI, and/or PAI. PAIs can excise at different frequencies depending on growth conditions [54][55][56]. This suggests that genetic versatility is needed for the survival of E. coli in diverse environments, which may be relevant regarding its host specificity, as well.
The stx 1and stx 2cprophages are normally flanked by integrative element IS629 at both ends in almost all O157 and non-O157 STEC, including O26, O45, O111, O121, and O145. Our analysis of the stx prophages indicated that the transposable insertion sequence element IS629 may be a driver of stx 1 and stx 2c prophage evolution among O104:H7, O104:H21, and other LEE-negative STEC strains. We have hypothesized that stx genes may excise from African strains after infection, owing to a functional change of a putative integrase near the yqgA (inner membrane protein) gene due to frameshift-and nonsensemutations. After transposition, there is only one IS629 left in O104:H4. IS629 insertion is highly biased toward prophages and prophage-like integrative elements. Both IS629 and IS602 are not present in O104:H7 and O104:H21 but exist in all O104:H4 strains (Table 1). IS602 also appears to be part of kanamycin (and related aminoglycosides) resistance transposons [57]. In addition, O104: H4-specific CRISPR2 was not found in any of the other STEC strains examined in this study (Table 1), including O104:H7 and O104:H21. The presence and the functions of CRISPR2 [58] may play a significant role in O104:H4 for maintaining a high degree of genomic plasticity and flexibility for adapting to host and environmental changes. Analysis of IS629 and CRISPR distribution among O104 strains might be useful for identification and detection of specific O104 strains and population genetic analysis, as well as molecular epidemiology studies. The presence of IS629, CRISPR, and prophage regions (data not shown) may potentially contribute to the larger genome size of O104:H4 (Additional file 1: Table S1 and Figure 1).
There was no apparent correlation between serotypes and plasmid profiles among O104:H4, O104:H7, and O104:H21 strains, as well as other non-O104 STEC strains (Table 2, Figure 2). There was marked variability in the numbers and types of genes carried by the plasmids in the O104 strains (Tables 2 and 3); none of the 13 E. coli O104 isolates carried the same set of plasmids. STEC O104:H7 RM9387 had four plasmids, while most other O104 strains carried only 1 or 2 plasmids. The following genes used to define EAEC, STEC, and EHEC strains were located on some plasmids: aggR, aggBCD cluster (encoding major fimbrial subunits), saa, ecf cluster, ehx cluster, etp cluster (type II secretion system apparatus), toxB (putative adhesion), katP (catalase-peroxidase), espP, espA, sepA (serine protease sepA precursor), subAB, typeIV pilus gene cluster, mabB, stcE (zinc metalloprotease), and excA (exclusion-determining protein), in agreement with other previous reports [10,[59][60][61][62]. These genes are useful  for the molecular detection and characterization of STEC, EAEC, and EHEC strains. A comparison of the various O104 strains revealed that subAB (a potent toxin involved in inducing cell death), the ehx cluster, espP, and excA were found in O104:H7, O104:H21, and LEE-negative O113:H21 strains, but not in O104:H4 strains. Therefore, it is suggested that virulence of the O104:H21 outbreak strain may be due in part to plasmid-associated virulence genes, including subAB, the ehx cluster, and espP ( Table 2). In Table 2, only O104:H4 strains and an O111:H21 strain carried the transcriptional activator aggR and/or the aggregate B-C-D gene cluster ( Table 2). subAB was more prevalent in LEE-negative than in LEE-positive STEC strains and likely contributes to the progression to severe disease [63], since the O113:H21 strain was involved in an outbreak with cases of hemolytic uremic syndrome (Table 2).

Antibiotic resistance gene profiling
There were a number of genes related to antibiotic resistance in O104:H4. The O104:H4 outbreak strains were resistant to at least 14 different antibiotics including ampicillin, sulfonamide, cefotaxime, ceftazidime, streptomycin, sulfamethoxazole, trimethoprim, cotrimoxazole, tetracycline, and nalidixic acid, but were susceptible to imipenem, kanamycin, gentamicin, chloramphenicol, and ciprofloxacin [62]. The O104:H4 outbreak strain carried a plasmid, pESBL, that encodes for extended spectrum β-lactamase CTX-M-15 [64,65]. Although the actual plasmid sequences were not available, genomic analysis showed that the four O104:H4 strains isolated from Africa carried some resistance genes (both chromosomal and plasmid-borne), but not as many as the O104:H4 strains of European origin (Table 3). On the other hand, O104:H7 and O104:H21 strains did not carry the plasmidborne antibiotic resistance genes found in the German outbreak O104:H4 strains (Table 3). These antibiotic resistance genes listed in Table 3 may have been acquired through horizontal gene transfer by transposon-based plasmid integration (Table 3, Additional file 3: Figure S1). The analysis of the plasmid profiles of various O104 Htypes and other STEC serotypes (Table 2) is consistent with the observation that EHEC plasmid-encoded genes ( Table 2, GenBank accession CP009105 and CP009107), including hly cluster, excA, subA/B, espP, were only found in STEC O104:H7 and O104:H21 strains, but not in O104:H4 strains [10].

Conclusions
The milk-associated outbreak STEC O104:H21 strain 94-3024 and the O104:H7 cattle isolate RM9387 were shotgun sequenced, analyzed, and compared to elucidate the potential relationship, origin, and evolution of bacterial pathogenesis in various O104 H-serotypes. The observed variation between different O104 H-serotype genomes demonstrated that genome-wide divergence likely occurred via acquisition and loss of genomic islands, prophages, and plasmids. The genetic diversity in O104:H7, O104:H21, and O104:H4 serotypes was reflected in differences in virulence and resistance genes carried on the chromosome and on plasmids, suggesting their independent evolution demonstrated by different distribution and types of genetic mobile elements. Genetic diversity of stx 2 bacteriophages among O104: H4, O104:H7, and O104:H21 is illustrated in Figure 4. Further studies are needed to define the sources (cattle and other animals, fresh produce, the environment, humans) of these stx phages and to determine the frequency of lysogenization of E. coli O104:H7 and O104: H21 and phage origin by comparing to stx 2 -carrying phages from E. coli O104:H4 and other non-O104 EAEC/ STEC strains. Their roles in dynamic bacterial genome evolution have been increasingly highlighted by the fact that many sequenced bacterial genomes contain multiple prophages carrying a wide range of genes, such as stx and antimicrobial resistance genes.
Bacteriophages are major genetic factors promoting horizontal gene transfer (HGT) between bacteria. In this study, a "pseudo selC-tRNA site PAI" was defined containing perfect direct repeats and transposons, but was less than 10-kb in length ( Figure 5, Additional file 3: Figure S1). The results in Figure 5 indicated that O104: H4 and O104:H21 may represent various insertion intermediates of the O104:H7 strain generated in the course of O104 evolution. Our data provide new insights into the potential activities of the functional prophages embedded in bacterial genomes and may lead to the formulation of a novel concept of inter-prophage interactions in prophage communities. That is to say, strains containing prophage without genetic defects may be potentially more capable of spreading (gaining or losing) important virulence determinants such as eae and other genetic traits to other bacterial strains ( Figure 6). Our findings suggest that more research is needed to understand the potential roles of prophage in HGT between bacteria and in the evolution of bacterial pathogens.
A key virulence factor, stx 2 , was found in all European O104:H4 outbreak strains, but only in one of the strains of African origin (C777-09) ( Table 1). This finding supports the contention that similarly virulent O104:H4 isolates could be widespread; however, they could be genetically different due to their adaption to specific niches. It is also suggestive that both O104:H7 and O104:H21 may have evolved into pathogenic STEC strains due to the acquisition of stx, other virulence genes, and antibiotic resistance genes such as mdtL, emrE, ksgA, ydeB, dacC, folA-O157, and others from other STEC. It is reasonable to suggest that genetic variation of STEC O104 may partly be due to adaptation to local environments and interactions with other bacteria and hosts. Serotypes O104:H − and O104: H12 have been isolated from humans and associated with HUS, thus it is also important that future studies examine the pathogenic potential of different O104 H-types, other than H4, H7, and H21.

Methods
Strains, genome sequencing, assembly, and annotation STEC O104:H21 strain 94-3024 and O104:H7 strain RM9387 were obtained from the Centers for Disease Control and Prevention (Atlanta, GA) and from Dr. Robert Mandrell at the USDA Agricultural Research Service, Western Regional Research Center (Albany, CA), respectively. Ion Torrent libraries were prepared following the manufacturer's recommended library construction procedures. Ion Torrent 316-chips with the 200-bp One-Touch kit was used for the generation of sequencing data on the Ion Torrent Personal Genome Machine (PGM). High molecular weight DNA for PacBio sequencing was extracted using Qiagen Genomic-tip 100/G columns and a modified manufacturer's protocol as previously described [66]. Ten micrograms of DNA was sheared to a targeted size of 20 kb using a g-TUBE (Corvaris, Woburn, MA.) and concentrated using 0.45X volume of AMPure PB magnetic beads (Pacific Biosciences, Menlo Park, CA) following the manufacturer's protocol. Sequencing libraries were created using 5 micrograms of sheared, concentrated DNA and the PacBio DNA Template Prep Kit 2.0 (3Kb -10Kb) according to the manufacturer's protocol. The library was bound with polymerase P5 followed by sequencing on a Pacific BioSciences (PacBio) RS II sequencing platform (with chemistry C3 and the 120-min data collection protocol). A fastq file was generated from the PacBio reads using SMRTanalysis Version 2.1 and errorcorrected reads were created using pacBioToCA with self-correction [67]. The longest 20X of the corrected reads were assembled with Celera Assembler 7.0 [65]. The resulting contigs were polished using Quiver [68] and annotated using a local instance of Do-It-Yourself Annotator (DIYA) [69] for initial gene prediction and frame shift verification. The annotated genome sequence was imported into Geneious (Biomatters LTD., Auckland, New Zealand) and duplicated sequence removed from the 5′ and 3′ ends to generate the circularized chromosome. The chromosome was reoriented with the dnaA gene at the 5′ end. Reoriented circular chromosomes were reanalyzed using Quiver and a final annotated chromosome was generated with DIYA. Hybrid error correction and de novo assembly of single-molecule sequencing reads were evaluated and confirmed by Ion Torrent sequencing. The complete genomes of O104:H21 and O104:H7 used in this study were submitted to Prokka [70] to predict genes and annotate all assemblies, followed by manual checking for the final genome submission to NCBI. The identification of potential frame shifts and pseudo-genes of these two query finished genome sequences were also identified using the NCBI online service Microbial Genome Submission Check [71].

Genome comparisons
The Artemis comparison tool, ACT version 10 [72] was used to plot nucleotide similarities (blastn) between O104: H serotypes and other non-O104 STEC. The comparison of particular pathogenicity/genomic islands, plasmids, and prophages was performed using the ACT alignment program at the default settings. These predicted genomic islands and prophages were identified from sequence alignments and breakpoint sites and were further manually curated. The gene name and locus ID were directly assigned based on the NCBI Reference Sequence files.
Mauve was used for comparing and visualizing a number of genomes. By applying O104:H4 strain 2009EL-2050 (NC_018650.1) genome as the "reference" strain, all FDA draft raw sequences for O104:H21 (Table 2, strains BAA-178 and BAA-182) downloaded from SRA (http://trace. ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=search_obj) and our two newly sequenced genomes were re-assembled and ordered based on reference genome. In general, progressive Mauve was applied for whole genome comparison. Homologs and/or orthologous relationship of the nucleotide sequences for profiling of virulence-and resistanceassociated genes between genomes were determined using NCBI Basic Local Alignment Search Tool (BLAST) with the following criteria: identity >80%, e-value <1e − 10, and coverage >90%. The identified homologs and/or orthologous genes were further manually curated and confirmed.

Phage identification and analysis
Prophages and putative phage-like elements in the O104: H4 reference strain 2009EL-2050 (NC_018650.1) and the newly sequenced O104:H7 and O104:H21 genomes were analyzed using prophage-predicting PHAST Web server [73]. Regions identified algorithmically as "intact", "questionable", and "incomplete" by PHAST, and regions sharing a high degree of sequence similarity and conserved synteny with predicted "intact" prophages, were marked as prophages. The extent of sequence similarity and synteny among these predicted prophage sequences were then aligned and presented using software ACT.

Identification and comparison of pathogenicity islands and other genomic islands
IslandViewer [74] was used to predict genomic islands and/or pathogenicity islands (PAIs) [75] in the O104:H4 reference strain 2009EL-2050. The predicted genomic islands and/or pathogenicity islands were then used as the "reference genomic islands template" for the identification of corresponding genomic islands from various H serotypes of O104 isolates and other non-O104 STEC genomes based on 90% identity and 90% coverage (90%, 50 nts). The LEE protein sequences from the top 7 STEC serogroups (O26, O45, O103, O111, O121, O145, and O157) were used as query sequences for large-scale BLAST score ration analysis and BLASTP to detect the presence of LEE islands and its associated genes (homologs). The LEE pathogenicity island insertion site was determined using BLASTN to identify the contig and alignment coordinates of the intimin gene, eae, among all O104:H serotype strains listed in this study. The presence of the LEE carrying eae and its flanking genes was confirmed by manual inspection of the annotation for this region, and/or by detection of the coordinates of genes encoded within the LEE of O157:H7 strain EDL933.

Identification and comparison of virulence factors and antibiotic resistance genes
Only bacterial genes experimentally confirmed to be involved in bacterial pathogenesis and antibiotic resistance were collected through an extensive literature mining in PubMed. This was followed by reference mapping to confirm their existence with corresponding raw sequencing reads. SNP discovery for RpoB (mutations in rpoB, β-subunit of RNA polymerase, can result in antibiotic resistance) was also performed (data not shown).

Phylogeny tree construction
The phylogenetic relationship of the two sequenced O104: H7 and O104:H21 strains and other O104 strains was visualized and constructed using MAUVE whole-genome alignment and presented with the program Mega6 based on features in the genome or specific regions such as PAIs ( Figure 6).