Analysis of the lambdoid prophage element e14 in the E. coli K-12 genome

Background Many sequenced bacterial genomes harbor phage-like elements or cryptic prophages. These elements have been implicated in pathogenesis, serotype conversion and phage immunity. The e14 element is a defective lambdoid prophage element present at 25 min in the E. coli K-12 genome. This prophage encodes important functional genes such as lit (T4 exclusion), mcrA (modified cytosine restriction activity) and pin (recombinase). Results Bioinformatic analysis of the e14 prophage sequence shows the modular nature of the e14 element which shares a large part of its sequence with the Shigella flexneri phage SfV. Based on this similarity, the regulatory region including the repressor and Cro proteins and their binding sites were identified. The protein product of b1149 was found to be a fusion of a replication protein and a terminase. The genes b1143, b1151 and b1152 were identified as putative pseudogenes. A number of duplications of the stfE tail fibre gene of the e14 are seen in plasmid p15B. A protein based comparative approach using the COG database as a starting point helped detect lambdoid prophage like elements in a representative set of completely sequenced genomes. Conclusions The e14 element was characterized for the function of its encoded genes, the regulatory regions, replication origin and homology with other phage and bacterial sequences. Comparative analysis at nucleotide and protein levels suggest that a number of important phage related functions are missing in the e14 genome including parts of the early left operon, early right operon and late operon. The loss of these genes is the result of at least three major deletions that have occurred on e14 since its integration. A comparative protein level approach using the COG database can be effectively used to detect defective lambdoid prophage like elements in bacterial genomes.


Background
Bacterial genomes harbor several types of mobile elements including transposons, insertion elements and temperate bacteriophages, both functional and defective. These elements can encode various important functions, including toxins, virulence factors, bacteriophage resistance, restriction modification systems and antibiotic resistance [1]. Prophages, both intact and defective, have a special role in this context as they are resident elements and play a special role in the physiology of the host bacteria. They have been implicated in serotype conversion, pathogenesis and phage immunity [reviewed by [2,3]].
The temperate lambda-like (lambdoid) phages have highly mosaic genomes with respect to each other. This forms the basis of the "modular genome hypothesis" proposed by Botstein in 1980 [4]. According to this hypothesis phages evolve by interchanging genetic elements (modules), each of which can be considered as a functional unit [5,6]. In spite of this diversity, E. coli and other enterobacterial genomes are recognized to contain a number of lambda-like cryptic prophages [reviewed by [7,8]]. For example the very well characterized E. coli K-12 genome carries eight convincingly identified prophages (λ itself and seven others; all of the latter are defective and six, DLP-12, e14, Rac, QIN, CPS-53, and Eut, are thought to be lambdoid in nature [reviewed by [7,9,10]]). The high rate of recombination, deletions and insertions present in such cryptic phage elements makes their unambiguous detection and determination of evolutionary linkages difficult (see below).
The e14 element, the subject of this report, is one such defective prophage element that is integrated into the E. coli K-12 genome at 25 min on the chromosome within the isocitrate dehydrogenase (icd) gene [11,12]. The sequence of the e14 element is available with the sequencing of the E. coli K-12 genome; it is 15.4 kbp long and lies between 1195432 bp and 1210646 bp on the K-12 chromosome [13]. The element has at one end 216 bp of homology with the C-terminal end of the host icd gene, and the actual crossover for integration (the attachment site) occurs between the first 11 bp at one end of the homology in e14 and an 11 bp sequence inside the host icd gene [12]. The integration event fused the e14 "icd replacement region" to the N-terminal portion of the host icd gene, causing only two amino acid changes in the isocitrate dehydrogenase protein [14]. The element is capable of excision if the host SOS response is triggered. Both excision and re-integration occur in a site-specific manner [11,15]. e14 shares its integration site with phage 21 and has a similar integration machinery to that of phage 21; both have slightly overlapping int and xis genes. These two genes are transcribed leftward and lie about 3 kb from the e14 att site [12,14]. However, e14 and phage 21 must have different specificities of site recognition since phage 21 Int and Xis cannot cure cells of the e14 element as demonstrated by Wang et al. [14].
Experimental data on e14 is scattered in the scientific literature. The e14 element was originally identified by Greener and Hill [16], and mapped on the E. coli K-12 chromosome and cloned by Plasterk et al. [17,18] and Maguin et al. [19]. A restriction map of the element was made which largely corresponds with the now available sequence [18]. Current E. coli genome databases attribute 20/21 ORFs to the e14 element [20][21][22]. Most of these are annotated as putative or hypothetical proteins and very few have a functional annotation. The element is known to encode several important functions including the lit gene involved in T4 exclusion [23,24], the rglA (mcrA) gene involved in restriction of hydroxymethylated nonglucosylated T4 phages [25,26], the pin gene involved in inversion of an adjacent 1794 bp segment within e14 [17,27]. In addition to these, it is also attributed to encode a kil function and a concomitant repressor protein [18], and an SOS induced cell division inhibition function attributed to the sfiC gene [19,28]. Defined regions of e14 encoding these latter functions have been implicated by mapping the kil, repressor and sfiC functions. However the actual genes corresponding to these functions have not been previously identified. Recent sequencing of numerous bacteriophage genomes now allows a much more sophisticated bioinformatic analysis of its genetic content and prediction of the function of many of the e14 genes.
E. coli is perhaps the best-understood cellular organism, and K-12 is the most highly studied E. coli strain. If this model genome is to be completely understood, and this goal now seems achievable, it is essential that we understand its prophage elements. Here, we use a sequence analysis approach to further understand the evolution, and phage-and host-related functions of the e14 element.

Results and Discussion
Overall genetic structure of e14 The e14 element is a relic lambdoid prophage element integrated into the icd gene of the E. coli K-12 genome. Little is known about where the e14 element came from or why it is maintained in the E. coli genome. We have attempted to characterize this prophage element keeping in mind that it is a lambdoid prophage that almost certainly was once functional. The element is 15204 bp in length and has an average G+C content of 0.45, while the E. coli chromosome has G+C content of 0.5. BLAST searches performed against the non-redundant database identified several particularly closely related sequences between e14. These regions of homology along with adjacent regions were then used in pair wise Blast and MegaBlast searches [29] in order to extend the similar regions. The important hits found after general and sequence specific BLAST are shown ( Figure 1, Table 1). Important homologues of e14 include bacteriophages SfV, 21, ST64B, HK97, ΦP27 and prophage portions of several enterobacterial genomes and plasmid p15B ( Figure 1 and Table 4 below). Figure 1 shows graphically that there are 24 open reading frames present in the e14 element and their relationships to a generic lambdoid phage genome. All but four e14 genes have convincing homology to genes present in the genomes of other lambdoid phages, and all the phage-encoded homologs of these twenty genes are present in similar locations on those phages. Seven of the e14 genes can be deduced to be functional and four appear to have obviously debilitating truncations. With its divergent early operons and late operon downstream of the early right operon, all of which contain genes that are homologous to other lambdoid phages, e14 is clearly derived from a lambdoid phage ancestor. It also seems clear that e14 no longer encodes a whole phage genome and that its ancestral prophage has suffered three major deletions, one in the early left operon, one that fused the early right and late operons and one within the late operon. In addition, there is a possible insertion near its left end since its original integration into the K-12 chro-Overview of the e14 genome Figure 1 Overview of the e14 genome. The genetic functions of a generic lambdoid bacteriophage genome (brown rectangle) are shown above displayed with a transcriptional map (black arrows). In the middle, the section of the E. coli K-12 genome that contains e14 (gray rectangle) is shown with ORFs denoted by rectangular arrows oriented in the direction of transcription (green -host genes; red -e14 genes that are likely nonfunctional; black -e14 genes that are known to be functional; blue -e14 genes whose functionality cannot be assessed at present; parentheses indicate the boundaries of the P-invertable element). Small black arrows above the e14 map denote putative promoters, vertical lines denote putative terminators and small black squares putative operators. The yellow regions between the lambdoid and e14 maps indicate regions where e14 has homology to at least one known member of the lambdoid phage family (see text for details). Below, colored rectangles mark regions of highest homology between e14 and various known phages and prophages with regions of greater similarity closer to the e14 map (these are not meant to show all known homologies, only the closest ones); CPS-53 is a defective prophage in E. coli K-12, CP-933H is prophage in E. coli EDL933 and CP073-5, Sti4b, and Sti8 are provisional names for prophages in E. coli CFT073 and S. typhi CT18 (Supplementary Material of Ref. [39]). mosome. We discuss these relationships in more detail below.

The regulatory switch in the e14 genome
The regulatory switch that determines whether a lambdoid phage will follow a lytic or lysogenic life cycle includes an operator/promoter sequence and two major regulatory proteins similar to the Cro and CI repressor proteins of phage lambda [30]. Analysis of the e14 sequence suggests that its regulatory mechanism is similar to other functional lambdoid phages, especially Shigella flexneri phage SfV. This similarity has been reported previously by Allison et al. [31]. The e14 regulatory switch region has 96% identity to that of SfV. This region encodes the b1145 and b1146 proteins of e14 and the homologous P34 and P35 of SfV. (We will use mostly the "b" gene nomenclature system of Blattner et al. [13] for E. coli K-12 genes, because the genes are numerically named in their order on the chromosome, which makes their relative locations obvious; Table 1 gives both names for each open reading frame). The b1145 (e14) and P34 (SfV) proteins are identical except for a single amino acid difference -V4 of P34 is I4 of b1145 protein. A domain search with b1145 protein shows that it belongs to the LexA group of SOS-response transcriptional repressors similar to the lambda repressor (Interpro id: IPR006198, PFAM id: PF00717). In addition, b1145 protein is similar in sequence to the "CI repressors" of several other bacteri- ophages including Bordetella phage BPP-1 (accession no. AAK40284), and Salmonella phages P22 [32] and ST64B [33]. The experimental results of Plasterk and van de Putte [18] and Maguin et al. [19] suggest that disruption of the single e14 EcoRI site ( Figure 1) is sufficient to negate repression of the kil function (see below). Since this site lies within the b1145 gene, one can conclude with reasonable certainty that b1145 encodes a functional prophage repressor that is responsive to an SOS signal. Its location between the early left and early right operons and leftward orientation is identical to prophage repressor genes in all known lambdoid phages. The second protein player in this regulation is the Cro protein, which is structurally similar to CI repressor and binds to the same operators, but performs the opposite function of facilitating lytic rather than a lysogenic life cycle. The putative b1146 protein matches the SfV P35 protein with 99% identity over 66 amino acid residues with a single residue change: C49 of P35 is Y48 of the b1146 protein. We note that the original annotation of b1146 is about 100 residues longer than the SfV P35 protein and other known Cro protein homologs. It is very likely that the first 101 residue section was wrongly predicted as part of this ORF. The smaller 66 codon ORF has a plausible RBS ( Figure 2) and is identified by the gene prediction program GeneMark [34] (using E. coli as a typical model) as a separate ORF.
The b1145 protein is deduced to be functional, since the early left operon functions IntE and Vxis and early right operon kil function are normally off in K-12 (see below), and b1146 also seems likely to be functional by virtue of its near identity to its SfV homologue. In the well-characterized lambdoid phages the CI and Cro repressors bind to the same operators which overlap the promoters for the two divergent early operons. SfV and e14 are 95.2% identical in nucleotide sequence over a 1643 bp region that includes b1145, b1146 and the two potential operator regions. Allison et al. [31] predicted three inverted repeats (we note that they are all closely related to the consensus palindrome TTGTACCTNNNAGGTACAA) in SfV in the cIcro intergenic region that might act as O R of lambda phage. These sites are maintained in the e14 sequence The regulatory region of the e14 element aagcgtattgggactctttaaccaaagaacagcagggcgagttggccggaaaagttggct caacacctggctacttacggctggttttcaatggctataaaaaagccagttttgtgctgg with two differences in the first repeat, one in the second repeat and two in the third repeat ( Figure 2). Since it has been experimentally shown that LexA repression controls expression of functions in both the e14 early left and early right operons [18,19], and transcription of the majority of genes tested in these operons have been found to increase following UV irradiation [35], the promoters and operators on both sides of b1145 appear to have remained intact. We note that there are also two putative operator sequences (above) between b1144 and b1145 centered on e14 bp 5976 and 5996 (and similar sequences in SfV), and that plausible, correctly oriented promoters (see below) overlap both the left and right putative early operator regions.

DNA replication functions
Where it has been studied, canonical lambdoid phage DNA replication starts from a single origin (ori) in a bidi-rectional (θ) mode and later by a rolling circle (σ) mode. The origin of replication typically lies within the early right operon and is characterized by the presence of several repeats and palindromic sequences. Cumulative GC skew plots can provide a way of identifying origins of replication. The lowest point in such a plot usually corresponds to the origin of replication and the highest point to the replication termination point [36,37]. A cumulative GC skew plot was made for the e14 element. As is evident from this plot (Figure 3), two significant minima are present. The first of these minima occurs about 2000 bp from the left end of the element and the second is about 6000 residues from the left end. Neither of these regions show associated repeat regions. By analogy with the various canonical lambdoid phages, the replication origin should be in or near the replication genes that are located in the middle of the early right operon. The N-terminal portion of b1149 appears to be a replication-related pro-Cumulative GC-plot of e14 tein, suggesting the original replication origin was near b1149. This region is close to the second minima seen in Figure 3. e14 is a decaying prophage element, and it is likely that the origin of replication has been deleted in the course of evolution; Figure 1 shows that it was likely removed by the middle proposed major deletion that has affected e14 (see below).

Modular genome organization
The presently annotated e14 genome contains 20/21 predictable ORFs, which are available from the public databases including the Ecogene database [20], Genobase [21] and the Swissprot database [22]. Most of these encode putative proteins with no current functional annotation. Based on available data, sequence similarity, domain and motif searches an attempt was made to provide functional annotation for all the ORFs (Table 1). In the following paragraphs we will comment on the e14 genes from left to right across the element.
Most lambdoid bacteriophages do not have any complete genes between the att site and the int gene. However, e14 genes b1137, b1138 and lit lie between the att site and integrase gene. This has resulted in speculations regarding the origin of these genes. The three genes also show a significantly lower G+C content (Table 1) than the remainder of e14. All the three genes show LexA-dependent transcriptional induction on UV irradiation [35], but this could be an indirect result of e14 induction. Interestingly, the intergenic region between b1138 and lit harbors a region with eight bp multiple exact repeats which are highly AT rich. b1137 was previously annotated as a putative methyltransferase and involved in tellurite resistance, but these matches are very weak; it also shows four possible transmembrane segments and low similarity to certain eukaryotic proteins. The next ORF encodes the lit function. Expression of this protein inhibits protein expression late in phage T4 development. The protein interacts with a short sequence, the gol region within gene 23 that is the major head protein gene of phage T4 [38]. Lit is a protease known to cleave EF-Tu resulting in global inhibition of translation and death of E. coli cells infected with T4 phage [39]. These three e14 genes are unique in that none have convincing homologs in the current database that are phage bacteria encoded. Therefore the origins of this region are difficult to establish. It could have been picked up "recently" by the original functional phage ancestor of e14 through a specialized transduction (imprecise excision) mechanism before its integration here, or it could have been inserted here by some other process after e14's integration; its location next to att makes the former path more attractive.
By sequence homology with phage 21 and other phages, the integrase and excisionase function are encoded by intE and vxis, respectively, which form overlapping ORFs that are almost certainly functional as e14 is capable of SOS induced excision from the chromosome. Both IntE (b1140) and Vxis (b1141) show LexA-dependent transcriptional induction on UV-irradiation [35].
The small hypothetical b1142 protein is about 11 kDa in size and is similar to the N-terminal region of gene c3200 of E. coli O6:H1 CFT073 (87 % identity over 54 residues) [40]. The latter protein is much larger than its e14 homolog, and is encoded in a similar position in a lambdoid prophage in that genome. The C-terminal region of this CFT073 protein shows close sequence similarity to hypothetical proteins in SfV, ST64B, CPS-53, and Xylella fastidiosa prophage XfP4 [41], and each of these homologs lies in the early left operon in these lambdoid elements. It is possible that b1142 is a remnant of a larger gene and the deletion event that truncated it could be the left major e14 deletion in Figure 1. Gene b1143 encodes a protein with weak similarity to the putative protein encoded by gene STY2069 of Salmonella enterica CT18 [42] which lies in the early left operon of a prophage there. b1144 encodes a 94 amino acid protein which matches prophage-encoded hypothetical proteins early left operon from S. flexneri and S. enterica. b1144 also shows high transcriptional induction upon UV-irradiation [35].
The next two ORFs, b1145 and b1146, correspond to the CI repressor and Cro proteins as discussed in the previous section.
The b1147 and b1148 genes have no known function, but both show convincing similarity to hypothetical proteins of lambdoid phage origin. For example, phages SfV and ST64B carry homologs of b1147 and b1148 in similar locations as in e14. The roles of these homologs have not been studied, however a lethal (kil) function that kills the host bacterium was mapped by Plasterk and van de Putte [18] to what we now can deduce is the b1146-b1149 interval. Since b1146 and b1149 are homologs of nonlethal genes, it seems most likely that b1147 and/or b1148 encode this lethal function. We also note that a lethal sfiC function was mapped to the e14 element by Maguin et al. [19]. Their data are consistent with sfiC being a CI repressor-controlled gene, but its location was not accurately mapped. It is not known whether kil and sfiC are the same or different functions. Experimental evidence suggests that the sfiC gene product interacts with the FtsZ cell division protein and is responsible for an irreversible blockage of cell division [19], unlike the reversible inhibition brought about by SulA [28]. The protein product is highly stable even in lon + strains and does not show significant similarity to any non-phage protein. It is interesting that other lambdoid phages are known to encode FtsZ inhibitors in their early left operons [43][44][45][46][47].
The b1149 protein appears to be a unique fusion between a replication protein and phage terminase. While the first 78 residues are quite similar to the N-termini of putative replication proteins from E. coli O157:H7 prophage CP-933P [10] (sprot id: Q8XAD8) and phages ΦP27 [48], ST64T [49] and SfV. The rest of the b1149 protein is extremely similar to the C-termini of terminase proteins of ST64B (98% identical) and SfV (96%) and other phages. The deletion that caused this gene fusion is the middle major e14 deletion in Figure 1, and it seems unlikely that the b1149 protein product is now functional. b1150 is a very small protein that is highly similar to proteins encoded by genes in the same location by phages ST64B, SfV and ΦP27. b1151 closely resembles portal proteins involved in head assembly from phages ST64B and ΦP27 over the N-terminal 135 amino acid residues. In bacteriophage ST64B the portal protein is 414 residues and ΦP27 protein is 413 residues. b1151 is almost certainly a C-terminally truncated pseudogene derived from a homolog of these larger proteins. This truncation and the N-terminal truncation relative to its homologs of the next gene, b1152, represent the boundaries of the right major e14 deletion in Figure 1. b1152 and b1153 are tail protein homologs of gene 47 and 48 proteins of phage Mu, which has a contractile tail. SfV phage tail proteins are their closest homologs and occur in similar relative positions. The N-terminal 106 residues of b1154 are similar to a 22 kDa protein from SfV (85% identity over 100 residues) and show similarity to side tail fiber proteins in other phages. The remaining C-terminal 103 residues are weakly related to the predicted protein of gene plu2959 of Photorhabdus luminescens TT01 [50], which lies in the tail region of a prophage in that genome. The left boundary of the Pin-invertible element which starts 11582 bp from the left attachment site of e14, lies within b1154, 96 codons from the 5'-terminus. b1155 shows close resemblance in its C-terminal region to genes in prophages CPS-53 of E. coli K-12, CP933H of E. coli EDL933, and Sti8 of S. enterica CT18. TfaE (b1156) shows 90% identity to the tail fibre assembly protein of bacteriophage HK97. b1154 and b1155 proteins are members of the large tail fibre assembly (Tfa) protein family that includes phage T4 gene 38 and Mu gene 50 proteins.
The pin (b1158) protein is a site-specific DNA invertase like the Min invertase of p15B, Gin of phage Mu, Hin of S. enterica, and Cin of phages P1 and P7, as well as putative invertases on a number of prophages in the sequenced bacterial genomes such as Sp1 of E. coli Sakai [51], Sti3 and Sti7 of S. enterica CT18, and Fels-2 of S. enterica LT2 [52]. These invertases in turn belong to a larger family of site-specific resolvase and recombinase proteins. The Pin protein catalyses the inversion of a 1794 bp long fragment referred to as the P-element [18]. This invertible element lies between 11582-13405 bases from the left att site and encompasses the four ORFs b1154, b1155, b1156 and stfE (b1157). When the early right/late operon fusion, in which these genes lie, is transcribed, genes b1155 and b1156 are not expected to be expressed in the shown (Figure 1) orientation of the P-element (and b1157 is not expected to have any bone fide translation start), but after inversion, the b1157 open reading frame would be fused to the N-terminal 96 codons of b1154 and b1156 would be placed in the correct orientation for expression. StfE (b1157) and b1154 appear to encode the C-termini of alternate side tail fibre proteins, and b1155 and b1156 appear to encode alternate as tail fibre assembly proteins.
The last gene of the e14 element, mcrA, encodes a methylation-dependent restriction endonuclease belonging to the HNH family of proteins found in several bacterial and bacteriophage systems [25,53]. In vivo studies on McrA suggest that it restricts T-even phage DNA that is hydroxymethylated and non-glucosylated (RglA activity) and also cleaves HpaII and SssI methylated DNA [25]. No close homologs of mcrA are known on other phage or prophage genomes, but many temperate phages carry genes that protect the host bacterium from attack by other phages.
Operons in the e14 element were predicted based on the co-occurrence of genes in the same order in different genomes [54] (Figure 1), but putative promoters and terminators that seem reasonable for expression of the various genes in light of their analogy to lambdoid phage genes could be identified (Tables 2 &3) based on the known coding regions and operons. These need to be experimentally verified.
The mosaic nature of the e14 genome is evident from the various similarity searches conducted using this sequence as query (Figure 1). A large section of the e14 genome is very similar to the SfV phage genome. However, unlike the SfV phage, which is a functional temperate bacteriophage and encodes 53 proteins, the e14 element encodes only about 23 proteins. As has been discussed above, this suggests that large deletions must have occurred during the course of evolution of the e14 element. These deletions have removed important genes like the major head coat protein and major tail shaft protein genes, lysis genes and replication genes, and we have made suggestions as to where each of the three major deletions might have occurred. This is not the first case of such deletions in defective prophage elements, for example the K-12 Rac prophage also appears to have suffered at least one large deletion [7]. An interesting observation is the presence of a number of paralogs of StfE protein in different orientations in p15B and Salmonella enterica. p15B is a plasmid in E. coli 15Twhich shows 81% homology to bacteriophage      [55]. The site-specific DNA inversion system of the plasmid however is very similar to the pin recombinase gene and to part of the invertible DNA of the Pin system as shown in Figure 1. The p15B recombinase system is known to be more complex compared to other recombinase systems including the Cin recombinase system in P1 and is capable of alternately assembling one out of six different ORFs [55,56]. This suggests that such recombinase systems which interchange alternate virion host specificity proteins (tail fibres) could potentially also be of greater complexity, however, there is no experimental evidence for this idea.

Search for prophage elements other genomes
With the recent surge in the availability of genomic DNA sequences it has been found that many microbial genomes harbor prophage elements. These elements can encode key functions including virulence factors, toxins and phage immunity proteins. Thus, detection of such ele-ments in bacterial genome sequences becomes very important. Given the low sequence similarity between parts of the phage elements and their modular nature (nonhomologous alternatives are known for many of the modules [5]), the search for such elements is no simple task. Even if the search is restricted to the tailed temperate phages (there are other kinds of temperate DNA phages [57,58]) or even to the lambdoid phages, none of the phage genes are sufficiently conserved to serve as a single marker for all prophages, and to make it more difficult, in any given case any particular gene could have been deleted from a defective prophage [7,8]. We attempted to search for e14-like modules in the sequenced prokaryotic genomes keeping the mosaic nature of these elements in mind. This approach is different from other approaches in that it does not rely on a single gene like integrase or terminase for phage detection but has the potential to use the entire known pool of temperate tailed phage-encoded genes for detection. The initial search involved looking at   the COG [59] database in the eight bacterial genomes listed in Table 5 (these eight include representatives from the rather distantly related Proteobacteria and Firmicutes). COG entries on these genomes for each of the e14 proteins were analyzed. In cases where the COG entries were not available, the COGNITOR program was used to obtain hits. The COG hits were sorted by organism and on the locus of occurrence in the organism. Genes encoding the COG hits for the different e14 proteins, which were within 30,000 base pairs of each other, were then grouped together. Any region with greater than two genes in this cluster was considered to be a putative prophage element and further analyzed. Twenty-six phage-related regions were identified by this analysis of which 23 are already known phage-like areas in the bacterial genomes. Two (labeled P2 and P3 in Table 5) of the remaining three regions are probably non-phage areas. The region identified in Bacillus Subtilis and marked as P1 could be a decaying prophage region as it has at least 4 genes in this area which perform phage related functions apart from the two used in the detection. This includes yneA (which contains a lysin motif), ynzC (a site specific recombinase which shows similarity to integrase of phage phi-FC1, Lactococcus lactis phage TP901-1 and Listeria innocua phage A118), yndG (virulence factor and is a KicB toxin homolog), yndB (related to a prophageXfP2 protein in Xylella fastidiosa XF2524).
Further, this region is also identified as a possible phage element "5" and shows compositional variation compared to the rest of the Bacillus subtilis genome [60]. We could thus identify several lambdoid prophage elements in a representative set of bacterial genomes using such a protein level approach. This approach takes into consideration the modular nature of phage genomes and looks for orthologs of the genes of the defective prophage e14 that exist in proximity of each other. We hasten to mention that a number of putative prophages were not found by this analysis. But this exercise was knowingly severely limited by only taking orthologs of e14 genes into consideration, and a similar approach using the entire pool of known lambdoid phage (or even all temperate phage) genes should make a much more sensitive and robust technique for detecting phage elements, and, importantly, it can be automated.

Methods
Sequence manipulation and analysis was done using the EMBOSS [61] and GCG [62] suite of sequence analysis tools. Perl scripts were used for calculating cumulative GC plots and facilitating other searches. The cumulative GC was calculated as ∑(G-C)/(G+C) using a window size of 500. Domain searches were done using the NCBI CDD server [63], Interpro [64], PFAM [65] and SMART [66] databases were used where ever necessary. Promoter sequences in the e14 genome were detected using the BPROM [67] utility. The predicted promoters were then analyzed for the presence of ORFs in close vicinity. The promoters for which an ORF could not be assigned are not listed in this work. Rho-independent terminators were detected using the terminator program available with the GCG package. The program is an adaptation of the terminator program by Brendel and Trifonov 1984 [68]. Promoters and terminators that could not be explained functionally were ignored though the prediction servers identified several with high scores. Information on operons within the e14 genome was obtained with TIGROperons [54]. The COG database [59] was used to find orthologs of proteins encoded by the e14 element. For the proteins, which are not known to belong to any of the COGs listed, the COGNITOR application was used to identify orthologs.

Authors contribution
PM was responsible for data collection and analysis. SK conceived of the study, and participated in its design and analysis. SC helped to bring this information into the biological context of past and current prophage research.