Statistical characterization of the GxxxG glycine repeats in the flagellar biosynthesis protein FliH and its Type III secretion homologue YscL
© Trost and Moore; licensee BioMed Central Ltd. 2009
Received: 06 October 2008
Accepted: 16 April 2009
Published: 16 April 2009
FliH is a protein involved in the export of components of the bacterial flagellum and we herein describe the presence of glycine-rich repeats in FliH of the form AxxxG(xxxG) m xxxA, where the value of m varies considerably in FliH proteins from different bacteria. While GxxxG and AxxxA patterns have previously been described, the long glycine repeat segments in FliH proteins have yet to be characterized. The Type III secretion system homologue to FliH (YscL, AscL, PscL, etc.) also contains a similar GxxxG repeat, and hence the presence of the repeat is evolutionarily conserved in these proteins, suggesting an important structural role or biological function.
A set of FliH and YscL protein sequences was downloaded from GenBank, and then filtered to reduce redundancy, to ensure the soundness of the sequences, and to eliminate, as much as possible, confounding phylogenetic signal between individual sequences by implementing a pairwise 25% sequence identity cut-off. The general features of the glycine-rich repeats in these proteins were examined, and it was found that the length of these repeat segments varied substantially among FliH proteins but was fairly consistent for the Type III (YscL) homologue sequences, with values of m ranging from 0 to 12 for FliH and 0 to 2 for YscL. The amino acid sequence distribution of each of the three positions in the GxxxG repeats was found to differ significantly from the overall amino acid composition of the FliH/YscL proteins. The high frequency of Glu, Gln, Lys and Ala residues in the repeat positions, which is not likely indicative of any contaminating phylogenetic signal, suggests an α-helical structure for this motif. In addition, we sought to determine whether certain pairs of amino acids, in certain pairs of positions, were found together significantly more often than would be predicted by chance. Several statistically significant correlations were uncovered, which may be important for maintaining helical stability or for forming helix-helix interactions. These correlations are likely not of a phylogenetic origin as the originating sequences for the pair correlations are derived from a low similarity set and the individual incidences of the pair correlations do not cluster in any obvious phylogenetic sense, nor is there much evidence of strict sequence conservation outside the positions of the glycine residues. Finally, the α-helices from a non-redundant set of proteins from the Protein Data Bank were searched for GxxxG repeats similar in length to those found in FliH, however there were no helices containing more than three contiguous glycine repeat segments; thus, long glycine repeats similar to those found in FliH are presumably quite rare in nature.
The glycine repeats in YscL and particularly FliH represent an intriguing amino acid sequence motif that is very rare in nature. Although we do not attempt to offer a mechanism whereby these repeats may have evolved, we do place the existence of the motif and some residue pairings within a rational structural context. While crystal structures of these proteins are necessary to fully elucidate the structural and functional significance of these repeats, the characterization reported here represents a first step in understanding this unique sequence feature.
The bacterial flagellum is an apparatus that projects outward from the cell membrane, and employs rotation of a flexible filament attached to a universal joint (the hook) for propulsion. The flagellum is made up of four components: the basal body, which houses the flagellar rotary motor and export apparatus; the rod, which spans the periplasm, peptidoglycan, and outer membrane; the hook, which acts as a universal joint; and the filament, which acts as the propulsion device (reviewed in [1, 2]). In order to construct a functional flagellum, the constituent proteins must first be synthesized in the cytoplasm and then be transported to their site of incorporation in a temporally and spatially regulated manner. A specialized Type III secretion system called the flagellar export apparatus is used to transport the individual components of the flagellum across the two cell membranes of gram-negative bacteria . The bacterial flagellar export apparatus (reviewed in [1, 2]) is composed of a number of proteins, including two integral membrane proteins FlhA and FlhB, that also contain globular cytoplasmic domains, four additional integral membrane proteins FliO, FliP, FliQ, and FliR, and two membrane-associated cytoplasmic proteins, FliH and FliI. Other structural components of the flagellar basal body (FliF), and C-ring (FliG, FliM, FliN) are also required for flagellum assembly. In addition, enteric gram-negative bacteria have a number of substrate-specific chaperones associated with the flagellar export apparatus (e.g. FlgN, FliT, FliS, FliJ). These proteins act in concert with the flagellar export ATPase FliI in translocating partially unfolded substrates, such as the filament component flagellin, in an export-competent state through the basal body pore. Ultrastructural and biochemical investigations of the flagellar basal body and the Type III secretion system indicate that these systems have evolved from a common ancestor [3, 4]. In support of these observations, most of the flagellar export components have conserved orthologues (ranging from 20–40% pairwise identity) in the Type III secretion system of gram-negative pathogenic bacteria [5, 6], including FliI (InvC, HrcN etc.), FliH (YscL), FliN (HrcQB), and FlhA (SctV) [7–11].
The present study investigates a conserved GxxxG (where "x" represents any amino acid) sequence motif unique to the flagellar FliH/YscL family of proteins. Naming conventions for YscL-like proteins are rather inconsistent, as this protein often has different names in different organisms; for ease of reference, all YscL-like proteins will be referred to in this paper simply as "YscL". An alignment of the complete sequences of a representative group of FliH and YscL sequences along with a schematic domain organization is provided in Figures 1, 2 and 3. The extreme N-terminal region of FliH is very poorly conserved, but some sequence conservation is evident in the various bacterial groups (e.g. enterobacteria, epsilon proteobacteria), but not the YscL protein family. A GxxxG segment of variable length follows, then a poorly conserved segment likely to be helical in structure, followed by a well-conserved C-terminal domain known to be responsible for the interaction with the N-terminus of the flagellar/Type III ATPase (Figures 1, 2 and 3).
When we noticed the presence of conserved consecutive GxxxG repeats in FliH/YscL, we asked if this motif had been previously observed in other types of proteins. Lemmon et al.  first discovered that specific interactions are required for the transmembrane helix-helix dimerization of glycophorin A. It was later shown that dimerization was mediated by a GxxxG-containing motif . The GxxxG motif has been identified as the dominant motif in the transmembrane regions of hundreds of proteins [24, 25], and appears to play a critical role in the stabilization of helix-helix interactions. Such motifs were subsequently observed in many soluble proteins . The amino acid composition of the variable positions in the glycine repeats of soluble proteins is certain to be very different from that of transmembrane proteins; transmembrane proteins would contain mostly hydrophobic residues in the variable positions of the repeats, while the variable positions in soluble proteins would contain mostly hydrophilic residues. As such, the only commonality between glycine repeats in transmembrane proteins and glycine repeats in soluble proteins is likely to be the glycines found at every fourth residue. As glycine lacks a side chain, it is suitable for allowing the close packing of helices, and could hence facilitate helix-helix dimerization.
Most annotated FliH sequences contain a segment of repeats of the form AxxxG(xxxG) m xxxA, where m can vary on average between 2 and 10 depending on the bacterial species. While there is some variation to this pattern, not all sequences contain the N-terminal-side Axxx or the C-terminal-side xxxA, and FliH proteins from some species have no GxxxG repeats at all. Nevertheless, a significant proportion (44% in our set of sequences) of FliH proteins extracted from the non-redundant sequence database (see Methods) do exhibit the AxxxG(xxxG)mxxxA pattern. In addition to this long AxxxG(xxxG) m xxxA repeat segment, most FliH proteins also contain one or more shorter repeat segments elsewhere in the primary sequence (Figures 1, 2 and 3), which usually contain just a single AxxxG, GxxxG, or GxxxA. These shorter repeat segments are very poorly conserved, do not contain an obvious preference for particular amino acids at any of the three middle non-glycine positions, and often contain proline. Hence, these non-conserved GxxxG segments are unlikely to be either helical or biologically significant. To differentiate the two patterns, we will refer to the longest repeat segment in a particular FliH protein as its "primary repeat segment". YscL proteins exhibit similar patterns, except that they generally have shorter primary repeat segments.
We report here a statistical characterization of the amino acids composing the variable positions in the primary repeat segments of a varied collection of FliH and YscL sequences from different bacterial species. As they are analyzed separately, the specific portion of the repeat segments being discussed – AxxxG, GxxxG, or GxxxA – will be referred to as the "repeat type". Additionally, we make the distinction between the first, second, and third variable residue in a given repeat, which will be denoted as positions x1, x2, and x3, respectively. Below, we describe the analysis performed on FliH, which is of primary interest due to its uniquely long primary repeat segments. Some of the analysis described below was also performed for YscL; full details are provided in the Results and Methods sections.
Finally, we sought to determine how prevalent long glycine repeats are in other types of proteins not related to FliH, and to identify a protein of known three-dimensional structure that contains a FliH-like repeat segment that is involved in helix-helix dimerization. To address both goals, a large number of protein structures were downloaded from the Protein Data Bank (PDB; http://www.rcsb.org/pdb). These structures were searched for the presence of helices with glycine repeats, and one protein with a FliH-like glycine repeat segment was chosen as a molecular model for the types of interactions that might occur in FliH proteins.
The work presented here represents a comprehensive characterization of a relatively unusual primary sequence pattern. While this study focuses mainly on FliH/YscL and their glycine repeat segments, the results should also add to our understanding of the general characteristics of glycine repeat-containing α-helices in water-soluble proteins.
Sets of proteins acquired
FliH proteins and YscL proteins were downloaded and filtered as described in the Methods section to obtain a set of FliH sequences and a set of YscL sequences where no sequence was more than 25% identical to any other sequence. After filtering, 50 FliH sequences and 16 YscL sequences remained.
Initial characterization of glycine repeat segments
Initially, some general data regarding the composition of the 50 chosen FliH sequences were gathered. The average number of GxxxGs found in a primary repeat segment was 2.84, with a standard deviation of 2.53; the fewest number found in this set was 0, while the greatest number was 10. (In describing the length of a sequence's primary repeat segment, we include only GxxxGs; AxxxGs and GxxxAs are not included in the total). Although the longest repeat found in this dataset was 10, there exist FliH sequences with even longer repeats. For instance, the FliH from E. coli strain 53638 (GenBank accession number EDU66533) contains a repeat of length 12; however, this sequence was excluded when imposing the 25% identity sequence cut-off. A histogram showing the number of FliH sequences having primary repeat segments of different lengths is given in Figure 4. The majority of sequences have repeats with a length of 3 or less, while a few sequences have much longer repeats. Interestingly, the distribution of the lengths of the primary repeat segments in a set of 167 FliH sequences for which no sequence is more than 90% identical to any other sequence is very similar to that shown in Figure 4, indicating that bias arising from high sequence similarity in the available FliH sequences used has little effect on the results. This histogram is available as Additional file 3. In contrast to FliH, the primary repeat segments of YscL were much more uniform in length. Five sequences had no repeat segment at all, while 7 sequences had a repeat of length 1 and 4 sequences had a repeat of length 2. This stark difference in the distribution of the repeat lengths between FliH and YscL invites speculation concerning the importance of the repeat in these two proteins. As FliH apparently experiences selection pressure for longer repeats, but YscL does not, it suggests that longer repeats are advantageous to the function of FliH, but not to YscL; however, the nature of this difference is unclear.
Of the FliH sequences that had at least one GxxxG (a total of 44 sequences), the repeat segments of 22 sequences were flanked by both an Axxx on the N-terminal side and an xxxA on the C-terminal side. A lower number (13 sequences) contained only an initial Axxx, while few sequences had only an xxxA at the end (4 sequences) or neither an N-terminal-side Axxx nor a C-terminal-side xxxA (5 sequences). It thus appears that the initial Axxx is more strongly conserved than the terminating xxxA. Just two of the YscL sequences contained repeats with both the initial AxxxG and the terminal GxxxA, and an equal number (4 each) contained only the initial AxxxG or only the terminal GxxxA.
Secondary structure prediction
Several secondary structure prediction programs were used to predict the secondary structure of the primary repeat segments of selected FliH and YscL proteins, and the prediction programs consistently and convincingly classified these regions as α-helical for all of the proteins tested. The tools used are given in [27–31]. Thus, there is a strong basis for interpreting the sequence characteristics of the glycine repeat segments as being important either for helical stability, or for making helix-helix interactions.
Multiple alignment of the glycine repeats
Calculating the amino acid distribution in the primary repeat segments
Although the amino acid compositions in each position-repeat-type combination show distinct biases, there are also overriding similarities. The analysis below is specific to FliH, but similar biases are seen with YscL. For instance, in the x1 position of AxxxG repeats, Arg is found at a much higher frequency (20%) than it is in x1 of GxxxG (10%) (Figures 5, 7 and 8). Tyr or Phe account for more than 30% of the residues found in position x1 of AxxxG but are never found in positions x2 or x3 of AxxxG or very rarely for x2 or x3 of GxxxG. More apparent still is the bias in position x3 toward Glu, which accounts for more than a third of the residues found in that position.
In GxxxG repeats, Tyr and Phe account for over 45% of the x1 positions, Leu with 15% compared to zero in AxxxG, and then Arg and Lys together making up approximately 15%. Glu, Gln, and Ala together account for about 2/3 of the residues in position x3. Of note is that Gln makes up over 15% of the residues in the x3 position of GxxxGs, while the similar amino acid Asn, differing from Gln only by virtue of having one fewer methylene group in its side chain, is rarely found in that position.
It is also interesting to examine how the amino acid distribution differs in each of the three repeat types. In general, the amino acid distribution in each repeat position is fairly similar, with a general preference for Ala, Glu, Gln, Arg, Lys, and Tyr. However, there are some obvious differences: AxxxGs and GxxxGs have a very high frequency of Tyr or Phe in position x1, whereas these are comparatively rare in GxxxAs. Ala is quite common in position x3 of GxxxGs, but is less common in GxxxAs and rare in AxxxGs. Arg is quite common in positions x1 and x2 in AxxxGs and GxxxAs, but is less common in GxxxGs.
More generally, Figures 7 and 8 suggest that, particularly for GxxxGs, positions x2 and x3 are basically equivalent in their amino acid preferences, while the amino acid frequencies in position x1 are significantly different than that of x2 and x3. This observation suggests that position x1 has a fundamentally different structural role than either positions x2 or x3; one possibility is that the amino acid in position x1 facilitates helix-helix interactions, while the amino acids in x2 and x3 are involved in maintaining helical stability.
In addition, the frequencies obtained using these FliH and YscL datasets are very similar to those obtained when using sets of sequences where the maximum pairwise identity is 90%, rather than 25%. The frequency distribution for the 25% identity sets depicted in Figures 7 and 8 is also provided for the 90% identity sequence sets in Additional file 4. This observation is consistent with the hypothesis that positions x1-x3 in the GxxxG repeats have undergone extensive mutation during the course of evolution, but have reached an equilibrium amino acid composition that is consistent with the structural and functional constraints placed on these motifs. That multiple combinations of a few amino acid types are observed, and not a distinct conserved sequence pattern at x1-x3, suggests that there are multiple permutations of amino acid residues that equally fulfil the structural/functional requirements of these repeats in FliH protein and its role in the flagellar export apparatus.
Finding correlations between pairs of amino acids in specific positions in the primary repeat segments
We sought to find pairs of amino acids in specific positions that occur together significantly more often than would be predicted by chance. This analysis was performed only for FliH; due to their short primary repeat segments, the same analysis would not be meaningful for YscL proteins. The pair correlation, a value that is greater than one if a particular pair of amino acids in a given pair of positions occurs more often than would be expected by chance, was calculated for each possible pair of amino acids, and in each possible pair of positions, within the primary repeat segments. The statistical significance for each correlation was computed using a χ2 test.
Significant pair correlations in the FliH glycine repeats
8.0 × 10-4
4.0 × 10-3
4.7 × 10-3
5.5 × 10-3
9.1 × 10-3
1.2 × 10-2
1.7 × 10-2
2.8 × 10-2
4.4 × 10-2
4.4 × 10-2
4.9 × 10-2
As expected, most of the significant patterns found in Table 1 involve residues that are nearby in the primary sequence, although there is an important exception. The most significant correlation is GxAxGxxxGxAxG, which is surprising given that it is a longer-range pattern. It is possible that the Ala residues in the x2 positions contribute to helical stability via hydrophobic interactions or by some other mechanism. Some correlations are readily explicable; for instance, the pattern GQxxGYxxG seems plausible, as the NE2 amide hydrogen of the Gln residue at x1 should be able to either donate a hydrogen bond to the Tyr residue OH or provide its N-H group to make an amino-aromatic interaction. Furthermore, the NE2 amide hydrogen of a Gln residue in position x1 can also donate a hydrogen bond to the backbone carbonyl oxygen of the first Gly residue in the neighbouring twofold related GxxxG helix segment presuming standard GxxxG helix dimerization . However, other patterns are more difficult to explain. For instance, the pattern GYxxGFxxG is found twice as often as would be expected by chance, but the Phe and Tyr side chains are unlikely to interact directly with each other, as both side chains would presumably be in a χ1 = 180° conformation favoured by aromatic residues in helices, preventing van der Waals stacking of the aromatic rings. The strong positive correlation may indicate that the combination of these two residues in these positions is conducive to forming helix-helix interactions through close contacts of the aromatic side chain on one helix with the glycine backbone atoms on the adjacent helix, again assuming standard GxxxG helix dimerization.
Identifying glycine repeats in the helices of other proteins
Glycine repeat frequencies in PDB helices
% of all helices
Longer GxxxG repeats
Proteins in the PDB containing the GxxxGxxxGxxxG motif
The structure of glycine repeat-containing helices in other proteins as a model for FliH
Although no crystal structure has been solved for any FliH protein, one can still obtain insight into the structure of the FliH glycine repeats by examining the crystal structures of other proteins that also have glycine repeats. Unfortunately, there are no solved structures of proteins having long glycine repeats. The best alternative would be to use one of the proteins given in Table 3, but unfortunately the amino acid composition of the glycine repeats in these helices is so unlike that of the FliH proteins that none would make a good model for the type of interaction that might be formed between helices in FliH.
Thus, the remaining approach is to find a protein that contains a single GxxxG repeat having FliH-like amino acids in the variable positions. In their analysis of helical interaction motifs in proteins, Kleiger et al.  provide a table of proteins that contain GxxxG repeats that mediate helix-helix interactions. The glycine repeat in each PDB file given by Kleiger and co-authors was identified, and it was found that some of these contained amino acids in the variable positions that were similar to the amino acids that are commonly found in the glycine repeats in FliH.
Parts (C) and (D) of Figure 9 suggest that interactions between adjacent glycine residues may have an important role in the dimerization process, as the lack of a bulky side chain in this residue allows a C-H... O hydrogen bond to form between the two Gly residues. In addition, the closest contacts between residues with side chains appear to be between the x1 position in the first helix and the x2 position of the second twofold symmetry-related helix. In the case of 1HJR, the NE of the Arg residue in position x1 donates a hydrogen bond to the OE1 oxygen atom of the Gln residue in x2 on the opposite helix. Although residues in positions x2 and x3 can also make interactions with the adjacent twofold symmetry-related helix, they do not appear to be as close together in space.
Functional significance of the variability in length of glycine repeats in different FliH proteins
Given the large amount of variability in the lengths of the glycine repeat segments in different FliH proteins, it begs the question as to whether helix-helix dimerization or some other property inherent to the GxxxG sequences is functionally important in FliH. If so, it would imply that one of two things is true: either the FliH proteins with few or no glycine repeats are able to form helix-helix dimers anyway, perhaps due to the presence of some other motif, or that these FliH proteins assume some other structure that happens to be functionally equivalent to the helix-helix dimers that are presumably found in the GxxxG repeat-rich FliH proteins. It seems possible that this distinction could be the result of FliH genes ancestrally acquiring a GxxxG segment that has over time undergone convergent evolution, with two or more ancestral proteins evolving semi-independently into a functionally similar end product – some evolving into the glycine repeat-rich FliH proteins, and others evolving into FliH proteins lacking these repeats. The extremely low sequence identity between many FliH proteins would also support this hypothesis. This also raises the question of how such repeats might evolve. Comparison of closely related FliH GxxxG sequence repeats from BLAST searches (results not shown) suggests that additional repeats are likely added one at a time in four residue steps. How this might occur during DNA replication or recombination is not known. The evolution of multiple short sequence motifs, although a challenging problem, is outside the scope of this analysis, but is certain to attract the attention of other researchers in the future.
Comparison of glycine repeat frequencies with quantitative α-helix propensities
It is interesting to compare the amino acid frequencies given in Figures 7 and 8 with the experimentally-derived propensity of each amino acid to be in an α-helix. The scale derived by Pace and Scholtz  assigns a number between 0 and 1 kcal/mol to each amino acid, with higher energies reflecting decreased helix propensity. According to their scale, Ala has the highest helix propensity, while Pro has the lowest. Consistent with this scale, Figures 7 and 8 show that four of the nine position – repeat-type combinations contain Ala at a relatively high frequency (over 10%). In contrast, Leu, the second-most favourable helix-forming residue, is present at high frequencies (~14%) only in position x1 of GxxxG repeats. Glu and Gln, which are found at high frequency in the glycine repeats, have only moderate helix propensity according to Pace and Scholtz's scale (lower than Leu, Met, and Lys, all of which are found at much lower frequencies in the primary repeat segments than either Glu or Gln).
It is possible that the amino acid composition required for helix-helix dimerization is distinctly different than that found in a typical α-helix. For instance, we have argued above that the hydrogen bonding capability of side chains (e.g. Glu, Gln, Arg) in positions x1 and x2 may be very important in side chain-side chain or side chain-backbone interactions in dimeric GxxxG helix-helix interactions. Further work would involve careful structural and biochemical characterization of various idealized GxxxG motifs in peptides and proteins.
It is important to acknowledge that many different scales have been developed for measuring the α-helix propensity of the amino acids, and although they are mostly consistent with one another, each scale is derived from a unique set of experimental parameters. In this case, we have chosen to compare our results with Pace and Scholtz's scale, but other scales are qualitatively very similar, with Ala, Glu, Met, Leu, Phe, Lys and Gln generally acknowledged as being helix forming residues. For instance, one secondary structure propensity scale that is commonly found in biochemistry textbooks lists Glu as the most favorable helix residue, which is more consistent with the composition of the glycine repeats in FliH. However, this same scale also lists Tyr as being somewhat unfavourable in helices, whereas in FliH Tyr is strongly favoured in position x1 of AxxxG and GxxxG motifs. This underscores the often stated caveat that context is everything in protein structure. The presence of glycine in such helical segments reinforces this point, as glycine residues are not normally acknowledged as being helix formers except within certain local sequence contexts.
Looking beyond the PDB to find proteins with glycine repeats
We report that there are no sequences found in the PDB set that we downloaded containing helices with glycine repeats anywhere near the length of those found in some FliH proteins. As a relatively small fraction of all known protein sequences have had their structures solved, one would have a better chance of finding long glycine repeats by searching a larger database of protein sequences (not structures), such as the Swiss-Prot database. Some preliminary analysis was performed as a starting point for addressing this problem. The entire Swiss-Prot database, which consisted of 261,515 sequences at the time that it was downloaded, was searched for FliH-like glycine repeat segments. Of course, since these sequences do not contain secondary structure information, there was no way to limit the search to α-helices. Eighteen sequences were found that contained repeat segments of length 11 or longer; however, all of these segments consisted of low-complexity repeats (for instance, the protein with Swiss-Prot accession number P19260 contains the repeat GSAGGSAGGSAGGSAGGSAGGSAGGSAGGSAGGSAGGSAGGSAGGSAGG), and thus were in no way analogous to repeats in FliH. The longest glycine repeat segment that was not a low-complexity repeat was of length 10, which was found in a presumably uncharacterized protein from Rickettsia japonica simply called "17 kDa surface antigen" (Swiss-Prot accession number Q52764). Further analysis would have to be done with this Swiss-Prot-derived sequence information in order to identify repeat segments that are similar to those found in FliH.
While many different short protein sequence motifs have been characterized, the glycine repeats in FliH and YscL are an unusual example. Firstly there is an obvious structural hypothesis to put the general features of the sequence motif in context and amino acid secondary structure preferences for residues found in the repeats strongly suggest an α-helical structure. However, not all observed pairwise residue correlations in adjacent repeats are entirely well-explained within the context of the presented structural model. In addition we have no plausible explanation for why only FliH proteins, and no other sequences, contain these unique GxxxG repeats. There is also no obvious reason or explanation for the highly variable number of repeats in different FliH sequences. However, sequence deletions in Salmonella FliH that affect in vitro ATPase hydrolysis assays for a FliI:FliH complex (either by enhancing or reducing FliI's ATPase activity) overlap with one or more of the Salmonella FliH GxxxG repeats (see introduction) . This suggests that secondary interactions between FliI and FliH, in addition to the well-known interaction between the C-domain of FliH and the N-terminal 15 residues of FliI, may depend critically on the presence of the GxxxG motif [15, 18]. Studies on the ATPase activities and/or export capability of FliI:FliH pairs from other motile bacteria with engineered deletions in the FliH GxxxG repeats would likely shed light on the importance of the GxxxG repeats in flagellar protein export. While the extremely long length of the repeats in some FliH proteins implies that the repeats may cooperate to perform an important functional or structural role, the fact that other FliH sequences have short repeats segments, or even no repeat segment at all, would suggest otherwise. Alternately, another unidentified protein involved in the flagellum export pathway may be able to compensate for deletion of the GxxxG motifs in FliH. Given the likely structural constraints on FliH participating in the flagellar export pathway via interactions with FliI, FliN and other proteins at the base of the flagellar export pore, it will be interesting to see if more than one protein participates in interactions with the FliH GxxxG motifs. It is also interesting that extremely long glycine repeats evolved in FliH, but not in its Type III secretion homologue YscL, and this may actually tell us something, albeit cryptically, about differences in the two export systems. The extremely biased amino acid composition of the glycine repeats suggests that these regions may adopt nonstandard helix-helix tertiary or quaternary interactions that will be of interest for structural biologists to elucidate. Lastly, and perhaps most interestingly, the extreme rarity of this motif in other proteins is very surprising given that nature tends to find similar structural solutions to a biological problem multiple times. Crystal structures and careful biochemical/biological analysis of these proteins should ultimately be able to address these fascinating issues.
Acquiring the set of FliH proteins
We endeavored to acquire FliH proteins from as many different bacterial species as possible. To accomplish this, GenBank was searched for protein sequences whose annotation contained the word "FliH", and these protein sequences were downloaded. In addition, the FliH sequence from Salmonella and the FliH sequence was H. pylori were used as input to PSI-BLAST, and the sequences attaining e-values of less than 10-3 after two iterations were downloaded. All of these sequences were aggregated into a single set that will be denoted "set A".
Filtering of FliH sequences
Redundancy in set A was reduced by using the EMBOSS  program needle to perform pairwise global alignments  between all possible pairs of sequences. That is, each sequence in set A was globally aligned with every other sequence, and the % identity between each pair of sequences was recorded. The gap opening penalty used in needle was 8, while the gap extension penalty was set to 0.5; all other settings were left at their default values. Using the % identity data for each pair in set A, a new set of proteins ("set B") was derived such that no protein in the latter set was more than 25% identical to any other protein in that same set. The purpose of this was to eliminate as much as possible the phylogenetic signal, which could potentially confound the statistical results. This set was used to derive the data shown in Figures 4, 5, 7 and 8. For comparison purposes, a larger set of proteins was created; in this set, no protein was more than 90% identical to any other protein. Analysis of this set is shown in Additional files 3 and 4.
Note that the obvious method for deriving set B is simply to randomly delete one of the proteins whenever two proteins in set A are found to be more than 25% identical. However, this method may result in more proteins being deleted than necessary; consider three proteins X, Y, and Z, and that proteins X and Y are both more than 25% identical to protein Z, but are not more than 25% identical to each other (casual testing suggested that this does happen occasionally). Suppose that X is first compared to Z and found to be more than 25% identical, and X is arbitrarily chosen for deletion. Then Y is compared to Z, and one of these proteins is deleted. Now only one protein is left, despite the fact that only Z needed to be deleted in order to satisfy the requirements of set B. To solve this problem and maximize the number of sequences left after filtering, the following algorithm was used: for each protein p in set A, a set ψ p is maintained that contains all the other proteins that are more than 25% identical to p. The sequence M with the highest value of |ψ M | is found, and M is then removed from set A; in addition, M is also deleted from every other protein's ψ p . This process is repeated until ψ p = ∅ for all p.
To remove proteins that were unlikely to actually be FliH, the mean length μ of the sequences in set B was computed, as well as the standard deviation σ of these lengths. Protein sequences having a length outside the range μ ± 1.5σ were deleted. Finally, a multiple alignment of the sequences was created using T-coffee , and sequences were deleted that, based on the alignment, looked as if they were unlikely to actually be FliH.
Acquiring and filtering the YscL sequences
The procedure used to acquire YscL sequences was similar to that used to acquire the FliH sequences. The only difference was that, due to their inconsistent naming conventions, a GenBank search was not performed; rather, the set consisted only of significant matches from a PSIBLAST search using the YscL sequence from Yersinia enterocolitica. The sequences were then filtered in the same manner as the FliH sequences.
Characterization of amino acid frequencies in the primary repeat segments
is distributed as χ2 with 19 degrees of freedom. The P-value corresponding to each χkR2 was determined using the Statistics::Distributions Perl module.
Determining correlations between pairs of amino acids in the primary repeat segments
To determine whether certain pairs of amino acids occur together in certain positions at frequencies significantly greater than would be expected by chance, correlations for all possible pairs of amino acids were calculated for each possible pair of positions within a given primary repeat segment. Correlations were determined only in GxxxG repeats (AxxxGs and GxxxAs were ignored). Statistical analysis was performed as described previously [31, 32]. Consider a typical segment in a FliH protein with m GxxxG repeats. Define n ijkld to be the number of times that amino acid i is found at position x k in some arbitrary repeat r (1 ≤ r ≤ m), and amino acid j is found at position x l in the (r + d)th repeat (1 ≤ r + d ≤ m). Thus, the possible values for i and j are the 20 amino acids, and k and l can each be either 1, 2, or 3. Values for d range from 0 to 9; the upper value was chosen because the longest repeat found in any FliH protein in set B was of length 10. If d = 0, then this means that the two amino acids in the pair are in the same repeat; if d = 1, it means that they are in adjacent repeats, and so on. When d = 0, k <l. To compute n ijkld , the following procedure was used:
For each FliH sequence p
For each GxxxG repeat r in p with r + d ≤ m
If position x k in repeat r contains residue i and
position x l in repeat (r + d) contains residue j
Add 1 to n ijkld
where is the number of times amino acid i is found at position x k (with any amino acid at position x l ), is the analogous value for the other amino acid, and is the total number of pairs. Note that superfluous subscripts are dropped in the preceding notation.
denote the pair correlation, which will be greater than one if the amino acids at the indicated positions are found at a greater frequency than would be expected given their individual frequencies in those positions, and vice versa.
If the null hypothesis is true (n ijkld = E ijkld ), then χ2 ijkld will have a χ2 distribution with one degree of freedom.
The following is an example to illustrate the above procedure. Assume that we want to find the pair correlation between Asp in position x3 and Glu in position x1 in pairs of repeats that have one repeat between them. This corresponds to the pattern GxxDGxxxGExxG, and therefore i = D, j = E, k = 3, l = 1, and d = 2. Also assume that the number of possible instances in which these amino acids could occur together in the stated pattern, in all the FliH proteins, is 263 (n d = 263). Of these instances, Asp is found in position x3 of the left-hand repeat 22 times, while a Glu occurs in position x1 of the right-hand repeat 9 times (n ikd = 22 and n jld = 9). Thus, the number of times you would expect Asp and Glu to appear together in these positions, assuming no correlation, is E ijkld = (22 × 9)/263 = 0.753. The actual number of times that they occur together is n ijkld = 5; the pair correlation is thus g ijkld = 5/0.753 = 6.64, meaning that this pairing of amino acids in the stated positions is found 6.64 times as often as would be expected at random. The χ2 value is (5 - 0.753)2/0.753 = 23.95, which corresponds to a P-value of 9.8 × 10-7, meaning that this correlation is certainly statistically significant.
Identifying glycine repeats in proteins in the Protein Data Bank
7,963 proteins were downloaded from the PDB by first searching for molecules that contain protein, then removing structures solved by a method other than X-ray crystallography, and finally using the "remove similar sequences at 40% identity" option.
Each PDB file was searched using a Perl script for helices that contain glycine repeats. If multiple helices had the exact same sequence, then all but one of these were discarded. This occurred both in the same protein (when there are multiple identical subunits), and between proteins (despite the sequences being less than 40% identical according to the PDB's criteria, some PDB files still contained helices with sequences that were the same as helices found in another PDB file).
Secondary structure prediction
We thank Paul O'Toole (UCC Cork) for many helpful discussions. Work in SM's lab is funded in part by a Discovery Grant from the Natural Sciences and Engineering Research of Canada (NSERC).
- Macnab RM: How bacteria assemble flagella. Annu Rev Microbiol. 2003, 57: 77-100. 10.1146/annurev.micro.57.030502.090832.PubMedView ArticleGoogle Scholar
- Macnab RM: Flagella and motility. Escherichia coli and Salmonella: Cellular and Molecular Biology. Edited by: Neidhardt FC, Curtiss R, Ingraham JL, Lin ECC, Low KB, Magasanik B, Reznikoff WS, Riley M, Schaechter M, Umbargered HE. 1996, ASM Press, Washington DC, 123-145.Google Scholar
- Blocker A, Komoriya K, Aizawa SI: Type III secretion systems and bacterial flagella: insights into their function from structural similarities. Proc Natl Acad Sci USA. 2003, 100: 3027-3030. 10.1073/pnas.0535335100.PubMed CentralPubMedView ArticleGoogle Scholar
- Kubori T, Matsushima Y, Nakamura D, Uralil J, Lara-Tejero M, Sukhan A, Galan JE, Aizawa SI: Supramolecular structure of the Salmonella typhimurium type III protein secretion system. Science. 1998, 280: 602-605. 10.1126/science.280.5363.602.PubMedView ArticleGoogle Scholar
- Van Gijsegem F, Gough C, Zischek C, Niqueux E, Arlat M, Genin S, Barberis P, German S, Castello P, Boucher C: The hrp gene locus of Pseudomonas solanacearum, which controls the production of a type III secretion system, encodes eight proteins related to components of the bacterial flagellar biogenesis complex. Mol Microbiol. 1995, 15: 1095-1114. 10.1111/j.1365-2958.1995.tb02284.x.PubMedView ArticleGoogle Scholar
- Hueck CJ: Type III protein secretion systems in bacterial pathogens of animals and plants. Microbiol Mol Biol Rev. 1998, 62: 379-433.PubMed CentralPubMedGoogle Scholar
- Jackson MW, Plano GV: Interactions between type III secretion apparatus components from Yersinia pestis detected using the yeast two-hybrid system. FEMS Microbiol Lett. 2000, 186: 85-90. 10.1111/j.1574-6968.2000.tb09086.x.PubMedView ArticleGoogle Scholar
- Jouihri N, Sory MP, Page AL, Gounon P, Parsot C, Allaoui : MxiK and MxiN interact with the Spa47 ATPase and are required for transit of the needle components MxiH and MxiI, but not of Ipa proteins, through the type III secretion apparatus of Shigella flexneri. Mol Microbiol. 2003, 49: 755-767. 10.1046/j.1365-2958.2003.03590.x.PubMedView ArticleGoogle Scholar
- González-Pedrajo B, Minamino T, Kihara M, Namba K: Interactions between C ring proteins and export apparatus components: a possible mechanism for facilitating type III protein export. Mol Microbiol. 2006, 60: 984-998. 10.1111/j.1365-2958.2006.05149.x.PubMedView ArticleGoogle Scholar
- Minamino T, Macnab RM: Interactions among components of the Salmonella flagellar export apparatus and its substrates. Mol Microbiol. 2000, 35: 1052-1064. 10.1046/j.1365-2958.2000.01771.x.PubMedView ArticleGoogle Scholar
- Rain JC, Selig L, De Reuse H, Battaglia V, Reverdy C, Simon S, Lenzen G, Petel F, Wojcik J, Schachter V, Chemama Y, Labigne A, Legrain P: The protein-protein interaction map of Helicobacter pylori . Nature. 2001, 409: 211-215. 10.1038/35051615.PubMedView ArticleGoogle Scholar
- Fadouloglou VE, Tampakaki AP, Glykos NM, Bastaki MN, Hadden JM, Phillips SE, Panopoulos NJ, Kokkinidis M: Structure of HrcQB-C, a conserved component of the bacterial type III secretion systems. Proc Natl Acad Sci USA. 2004, 101: 70-75. 10.1073/pnas.0304579101.PubMed CentralPubMedView ArticleGoogle Scholar
- Brown PN, Mathews MA, Joss LA, Hill CP, Blair DF: Crystal structure of the flagellar rotor protein FliN from Thermotoga maritima . J Bacteriol. 2005, 187: 2890-2902. 10.1128/JB.187.8.2890-2902.2005.PubMed CentralPubMedView ArticleGoogle Scholar
- O'Toole PW, Lane MC, Porwollik S: Helicobacter pylori motility. Microbes Infect. 2000, 2: 1207-1214. 10.1016/S1286-4579(00)01274-0.PubMedView ArticleGoogle Scholar
- Minamino T, Macnab RM: FliH, a soluble component of the type III flagellar export apparatus of Salmonella, forms a complex with FliI and inhibits its ATPase activity. Mol Microbiol. 2000, 37: 1494-1503. 10.1046/j.1365-2958.2000.02106.x.PubMedView ArticleGoogle Scholar
- Minamino T, González-Pedrajo B, Oosawa K, Namba K, Macnab RM: Structural properties of FliH, an ATPase regulatory component of the Salmonella type III flagellar export apparatus. J Mol Biol. 2002, 322: 281-290. 10.1016/S0022-2836(02)00754-4.PubMedView ArticleGoogle Scholar
- González-Pedrajo B, Fraser GM, Minamino T, Macnab RM: Molecular dissection of Salmonella FliH, a regulator of the ATPase FliI and the type III flagellar protein export pathway. Mol Microbiol. 2002, 45: 967-982. 10.1046/j.1365-2958.2002.03047.x.PubMedView ArticleGoogle Scholar
- Lane MC, O'Toole PW, Moore SA: Molecular basis of the interaction between the flagellar export proteins FliI and FliH from Helicobacter pylori. J Biol Chem. 2006, 281: 508-517. 10.1074/jbc.M507238200.PubMedView ArticleGoogle Scholar
- Blaylock B, Riordan KE, Missiakas DM, Schneewind O: Characterization of the Yersinia enterocolitica type III secretion ATPase YscN and its regulator, YscL. J Bacteriol. 2006, 188: 3525-3534. 10.1128/JB.188.10.3525-3534.2006.PubMed CentralPubMedView ArticleGoogle Scholar
- Minamino T, Namba K: Distinct roles of the FliI ATPase and proton motive force in bacterial flagellar protein export. Nature. 2008, 451: 485-488. 10.1038/nature06449.PubMedView ArticleGoogle Scholar
- Pallen MJ, Bailey CM, Beatson SA: Evolutionary links between FliH/YscL-like proteins from bacterial type III secretion systems and second-stalk components of the FoF1 and vacuolar ATPases. Protein Sci. 2006, 15: 935-941. 10.1110/ps.051958806.PubMed CentralPubMedView ArticleGoogle Scholar
- Lemmon MA, Flanagan JM, Treutlein HR, Zhang J, Engelman DM: Sequence specificity in the dimerization of transmembrane α-helices. Biochemistry. 1992, 31: 12719-12725. 10.1021/bi00166a002.PubMedView ArticleGoogle Scholar
- Langosch D, Brosig B, Kolmar H, Fritz HJ: Dimerisation of the glycophorin A transmembrane segment in membranes probed with the ToxR transcription activator. J Mol Biol. 1996, 263: 525-530. 10.1006/jmbi.1996.0595.PubMedView ArticleGoogle Scholar
- Senes A, Gerstein M, Engelman DM: Statistical analysis of amino acid patterns in transmembrane helices: the GxxxG motif occurs frequently and in association with β-branched residues at neighboring positions. J Mol Biol. 2000, 296: 921-936. 10.1006/jmbi.1999.3488.PubMedView ArticleGoogle Scholar
- Russ WP, Engelman DM: The GxxxG motif: a framework for transmembrane helix-helix association. J Mol Biol. 2000, 296: 911-919. 10.1006/jmbi.1999.3489.PubMedView ArticleGoogle Scholar
- Kleiger G, Grothe R, Mallick P, Eisenberg D: GXXXG and AXXXA: common α-helical interaction motifs in proteins, particularly in extremophiles. Biochemistry. 2002, 41: 5990-5997. 10.1021/bi0200763.PubMedView ArticleGoogle Scholar
- Pace CN, Scholtz JM: A helix propensity scale based on experimental studies of peptides and proteins. Biophys J. 1998, 75: 422-427. 10.1016/S0006-3495(98)77529-0.PubMed CentralPubMedView ArticleGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000, 16: 276-277. 10.1016/S0168-9525(00)02024-2.PubMedView ArticleGoogle Scholar
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48: 443-453. 10.1016/0022-2836(70)90057-4.PubMedView ArticleGoogle Scholar
- Notredame C, Higgins DG, Heringa J: T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000, 302: 205-217. 10.1006/jmbi.2000.4042.PubMedView ArticleGoogle Scholar
- Lifson S, Sander C: Specific recognition in the tertiary structure of β-sheets of proteins. J Mol Biol. 1980, 139: 627-639. 10.1016/0022-2836(80)90052-2.PubMedView ArticleGoogle Scholar
- Wouters MA, Curmi PM: An analysis of side chain interactions and pair correlations within antiparallel β-sheets: the differences between backbone hydrogen-bonded and non-hydrogen-bonded residue pairs. Proteins. 1995, 22: 119-131. 10.1002/prot.340220205.PubMedView ArticleGoogle Scholar
- Roussel A, Cambillau C: TURBO-FRODO. 1991, Silicon Graphics, Mountain View, CAGoogle Scholar
- DeLano WL: The PyMol molecular graphics system. 2002, DeLano Scientific, Palo Alto, CAGoogle Scholar
- Cuff JA, Clamp ME, Siddiqui AS, Finlay M, Barton GJ: JPred: a consensus secondary structure prediction server. Bioinformatics. 1998, 14: 892-893. 10.1093/bioinformatics/14.10.892.PubMedView ArticleGoogle Scholar
- McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics. 2000, 16: 404-405. 10.1093/bioinformatics/16.4.404.PubMedView ArticleGoogle Scholar
- Kneller DG, Cohen FE, Langridge R: Improvements in protein secondary structure prediction by an enhanced neural network. J Mol Biol. 1990, 214: 171-182. 10.1016/0022-2836(90)90154-E.PubMedView ArticleGoogle Scholar
- PROF – secondary structure prediction system. [http://www.aber.ac.uk/~phiwww/prof/]
- Pollastri G, Przybylski D, Rost B, Baldi P: Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins. 2002, 47: 228-235. 10.1002/prot.10082.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.