Non-canonical transcriptional start sites in E. coli O157:H7 EDL933 are regulated and appear in surprisingly high numbers

Analysis of genome wide transcription start sites (TSSs) revealed an unexpected complexity since not only canonical TSS of annotated genes are recognized by RNA polymerase. Non-canonical TSS were detected antisense to, or within, annotated genes as well new intergenic (orphan) TSS, not associated with known genes. Previously, it was hypothesized that many such signals represent noise or pervasive transcription, not associated with a biological function. Here, a modified Cappable-seq protocol allows determining the primary transcriptome of the enterohemorrhagic E. coli O157:H7 EDL933 (EHEC). We used four different growth media, both in exponential and stationary growth phase, replicated each thrice. This yielded 19,975 EHEC canonical and non-canonical TSS, which reproducibly occurring in three biological replicates. This questions the hypothesis of experimental noise or pervasive transcription. Accordingly, conserved promoter motifs were found upstream indicating proper TSSs. More than 50% of 5,567 canonical and between 32% and 47% of 10,355 non-canonical TSS were differentially expressed in different media and growth phases, providing evidence for a potential biological function also of non-canonical TSS. Thus, reproducible and environmentally regulated expression suggests that a substantial number of the non-canonical TSSs may be of unknown function rather than being the result of noise or pervasive transcription. Supplementary Information The online version contains supplementary material available at 10.1186/s12866-023-02988-6.

Additionally, the TSS comparison between EHEC and E. coli MG1655 is included (sheet 3) with the name of the gene (gene number for EHEC genes) and the distance between start codon and TSS for MG1655 (A, B) and EHEC (C, D) and the deviation between MG1655 and EHEC distances (E).

Supplementary Table S6
List of annotated genes with antisense TSS. Annotated gene number, strand, start and stop coordinates (A-D), asTSS position and strand (E, F), asTSS presence in analyzed culture conditions (G-N), categorization of asTSS in u-asTSS, d-asTSS and asTSS (O).

Supplementary Table S9
Promoter prediction for canonical TSS and non-canonical TSS. The output of the bTSSfinder promoter prediction is given for promoters of canonical TSS (first sheet), promoters of non-canonical TSS (second sheet), and promoters of random genome positions (third sheet).

Supplementary Table S10
Differentially expressed TSS sorted into lists of differentially expressed TSS between LB and minimal medium, LB + acid, or LB + NaCl in exponential growth phase (sheet 1), or in stationary growth phase (sheet 2) or between growth phases (sheet 3).

Supplementary File S1:
Tool version numbers, settings and input file instructions for programs used in Cappable-seq sequencing evaluation and data evaluation. 1, Primary transcripts with 5' triphosphates (yellow) are labeled with a desthiobition cap (DTG-TEG-GTP) at the 5' end using vaccinia capping enzyme, whereas fragments with 5' monophosphates (blue) remain unchanged. 2, After fragmentation, 5' monophosphorylated fragmentation products are left unmodified, while biotinylated 5' fragments are captured with streptavidin beads. 3, Contaminating monophosphorylated fragments are ligated to a 5' Illumina TruSeq sequencing adapter with unique sequence tag 1. 4, The desthiobiotin cap is removed using a Cap-Clip Acid Pyrophosphatase and 5, fragments are ligated to 5' Illumina TruSeq sequencing adapter with unique sequence tag 2. 6, Fragments are then used to create a Next Generation Sequencing library. Libraries were sequenced single end (75 bp) with an Illumina NextSeq 500.

Supplementary Figure S2
Distribution of internal TSSs over the length of the annotated genes. Number of iTSSs are represented as density distribution depending on their localization within the lengthnormalized annotated gene. (A) iTSSs associated with annotated genes (i.e., 1,233 TSS which may belong to the next gene downstream or belong to annotated genes that have a wrongly annotated start codon and (B) genuine iTSSs (i.e., 3,404 TSS with an increased S/N ratio but not classified to belong to the AG).

A B
Supplementary Figure S3 In-vitro activity assay of different promoters (positions of the TSS is indicated below the graph). Promoters for five asTSSs and one oTSS (position 4867698) were identified. Promoter sequences (i.e., fragments upstream of each TSS) were introduced in the promoterless GFP reporter vector pProbe-NT and the construct was transformed in E. coli Top10. Cells were grown in LB medium supplemented with 450 mM NaCl (for promoter constructs at TSS 2285574 and 2285499) or plain LB medium (all other constructs) until OD 600 = 0.6 was reached and fluorescence was measured. Promoter activity is shown as mean fluorescence of three biological replicates for cells grown with the vector-promoter construct (blue bars) or with the empty vector (orange bars). The latter represents the background signal of the vector in the respective experiment. Significant enhanced activity for analyzed promoters was identified in all instances (p < 0.05, Welch two-sample t-test, significance level ⍺ = 5 %). Figure S4 Promoter sequence logos for sequences upstream of canonical TSS (A), non-canonical TSS (B), and random genome positions (C). Sequences are sorted into groups with (left panels) and without (right panels) a promoter predicted with the program bTSSfinder. The number of sequences used to create the sequence logo is indicated in each case.   Figure S6 Volcano plots for transcription start sites of different categories, i.e., gTSS of fAG and hAG, iTSS, asTSS, and oTSS. Differential expression is calculated for a given stress/growth condition compared to non-stress, i.e. altered growth medium in exponential ('exp') or stationary ('stat') growth phase or altered growth phase ('phase'). The -log 10 FDR (false discovery rate) is plotted against the log 2 FC (fold change) for the indicated stress situation. Differentially upregulated, downregulated and unchanged (not significantly different) expressed TSS are marked in green, yellow and grey, respectively. Dashed lines and dotted lines indicate limits, i.e., log 2 FC > |2| and FDR cutoff > -log 10 (0.05), respectively.