Conserved amino acid markers from past influenza pandemic strains
© Allen et al. 2009
Received: 09 January 2009
Accepted: 22 April 2009
Published: 22 April 2009
Skip to main content
© Allen et al. 2009
Received: 09 January 2009
Accepted: 22 April 2009
Published: 22 April 2009
Finding the amino acid mutations that affect the severity of influenza infections remains an open and challenging problem. Of special interest is better understanding how current circulating influenza strains could evolve into a new pandemic strain. Influenza proteomes from distinct viral phenotype classes were searched for class specific amino acid mutations conserved in past pandemics, using reverse engineered linear classifiers.
Thirty-four amino acid markers associated with host specificity and high mortality rate were found. Some markers had little impact on distinguishing the functional classes by themselves, however in combination with other mutations they improved class prediction. Pairwise combinations of influenza genomes were checked for reassortment and mutation events needed to acquire the pandemic conserved markers. Evolutionary pathways involving H1N1 human and swine strains mixed with avian strains show the potential to acquire the pandemic markers with a double reassortment and one or two amino acid mutations.
The small mutation combinations found at multiple protein positions associated with viral phenotype indicate that surveillance tools could monitor genetic variation beyond single point mutations to track influenza strains. Finding that certain strain combinations have the potential to acquire pandemic conserved markers through a limited number of reassortment and mutation events illustrates the potential for reassortment and mutation events to lead to new circulating influenza strains.
Influenza A has evolved toward host specific mechanisms of infection leading to genetic divergence between human and avian strains. Sequence divergence is so striking that single nucleotide counts are sufficient for classifying the host type for most influenza strains when analyzing whole segment or whole genome data . A notable exception is the H5N1 avian strain that crosses the species barrier and can lead to deadly human infection. The H5 surface protein, hemagglutinin (HA), in some cases is able to recognize human cell receptors [2,3] along with mutations that allow the virus to better survive in the upper respiratory tract . To date, however, there are relatively low numbers of human H5N1 infections compared to the more human persistent subtypes, which may be in part due to inefficient human to human transmission [5,6]. In this study the influenza viruses from the pandemics of 1918, 1957 and 1968 with elements of avian (or avian-like) strains mixed with genetic elements persistent in humans [7–9] are used to provide a historic map of enduring genetic features from past pandemics and their circulation in current human, avian and swine strains .
Whole influenza genomes were searched for genetic markers conserved in pandemic strains that are associated with two features of infection: host specificity and high mortality rate. For host specificity a search was designed to find amino acid mutations in human influenza strains that were not observed in avian strains. The approach for defining host specificity markers closely followed the work of  which predicted positions in the genome associated with human host specificity. Other recent work  looked more broadly for human markers beyond the pandemic conserved regions. Both of these studies examined amino acid point mutations using differing measures for functional significance. In this study a new approach was developed to look for combinations of mutations in the genome that identify host specific evolutionary pressures beyond single point mutations. New mutations were identified that exhibit a co-variation mutation pattern. Evaluating mutation combinations allowed for the analysis of genetic markers where single point mutations failed to distinguish high and low mortality rate strains. In total 34 host specific and high mortality rate pandemic conserved markers were found. The ultimate goal of our study was to examine how the 34 pandemic conserved markers might re-emerge in a future single strain. While marker re-emergence in a single strain does not predict pandemic potential, their presence could highlight unexpected evolutionary events in circulating strains that warrant closer scrutiny.
Influenza genomes not used in the marker estimation process were searched for the presence or absence of the markers. The human host specific markers were sought in the recent avian strains infecting human (H5N1, H9N2, H7N3 and H7N7), the high mortality rate associated markers were sought in the avian strains and both marker sets were sought in non-avian non-human strains (e.g. swine, cat and others). The high mortality rate markers appeared in a wide variety of avian strains but the recent avian to human strain crossovers lacked most of the human strain specific markers. Human persistent strains retained human specific markers (by definition) but lacked most of the high mortality rate markers. Swine strains fell in the middle, carrying both high mortality rate and host specificity markers but with no single strain containing all 34 markers. Using a maximum parsimony principle, likely evolutionary pathways for the re-emergence of the 34 markers in a single strain were considered with a computational experiment. The fewest evolutionary events through reassortment and mutation needed for a single influenza strain to acquire all 34 markers in the presence of a second strain were counted. Starting with a small number of sequenced H1N1 human and swine strains, a mix with avian strains were found to acquire the 34 pandemic markers through a combination of 4 or fewer segment reassortment and amino acid mutation events.
The genetic marker identification procedure uses a discriminative classifier (a linear support vector machine ) with cross validation to build two models, one for host specificity and one for high mortality rate strains. The discriminative classifier is a computational tool that is designed to classify an unknown sample as belonging to one of two classes. Here one classifier model is designed to classify the influenza host type, the second model is designed to classify the influenza mortality rate type. Each model takes as input the 11 influenza proteins aligned and concatenated and classifies the strain in the case of host specificity as being human or avian. For mortality rate, input strains are divided into high and low mortality rate strain classes. The purpose for building the classifier is to find the positions in the influenza genome that are important in the model for accurately classifying input strains, a problem commonly referred to as the feature selection problem . Candidate markers are found by building new classifiers that take as input a small subset of the influenza proteome. The input sets that lead to classifiers that match the accuracy of the original classifier (which uses the entire proteome as input) highlight the amino acid markers that are important for class discrimination. An iterative procedure is used. For the initial step all single amino acid positions are found that separate the two classes (human/avian or high/low mortality rate). The iterative step n identifies the n sized (potentially non-contiguous sequence) combinations that separate the data such that each combination does not contain a smaller sized combination that separates the two classes equally well. This procedure yields a set of non-redundant mutation patterns that separate the two classes. The iterative procedure is important so that a candidate marker is only included as part of a distinguishing pattern when it adds to the classification accuracy. So for example if position 21 in the PB2 protein distinguishes avian and human strains, then position 21 would not be included as part of another set of features (say position 22 in the PB2 protein). Only markers that contribute significantly to classification accuracy are included in the final result. Details on selecting candidate functional markers are given in the Methods section.
Sixteen positions in the influenza genome were found to be associated with human host specificity. The markers were found on the non-structural protein 1 (NS1), non-structural protein 2 (NS2), matrix protein 1 (MP1), nucleoprotein (NP), acidic protein (PA) and the basic polymerase 2 (PB2) protein. Each strain was assigned a genotype, which showed whether the human consensus amino acid variant was present at each of the 16 positions. Strains excluded from the marker estimation process, human infections of avian origin  and non-human non-avian strains, were checked for evidence of an enrichment of human specificity markers relative to the remaining avian strains. With one exception the human infections of avian origin showed a genotype that was distinct from the most common avian genotype background but the number of accumulated human markers was small.
The columns are grouped so that avian to human crossover genotypes are clustered into three groups labeled at the top of Figure1as: H7 (avian frequency rank 0 and 14), H5N1 beginning in 2003 (rank 2, 8, 3, 16 and 9) [7,16–19] and the H5N1/H9N2 Hong Kong outbreaks from 1997–1999 (rank 13, 15, 6, and 17) [20,21]. Additional similar genotype patterns are placed in adjacent columns. A pattern emerges from the two most common avian genotypes ranked 0 and 1 in Figure1. These two genotypes cover 60% of the sequenced strains and span nearly all of the subtypes including H5N1, H9N2, H7N7 and H7N3. Among the lethal avian to human crossovers there are two genotypes that arise in humans that are not found in sequenced avian strains (rank 16 and 17). These cases could be examples of post infection mutations, or alternatively show the limits in the coverage of sequenced avian strains.
In a second experiment human influenza strains were separated into two groups: a high mortality rate group containing pandemic genomes selected from the 1918, 1957 and 1968 outbreaks, human H5N1 and the H1N1 1976 deadly New Jersey infection and a low mortality rate group containing all other whole genome human infection samples. As with the pandemic conserved host type markers, the high mortality rate markers were required to be positively identified in each of the sequenced strains associated with the three pandemic outbreaks (e.g. perfect conservation and no ambiguous sequence codes). Eighteen of 2,112 sequenced human influenza genomes (9 of 286 when samples were grouped by year, subtype and location) not in the high mortality rate class contained all 18 of the identified high mortality rate markers. These cases occurred in H2N2 and H3N2 strains from the 1960s and 1970s in years following their respective pandemics.
The most common non-human non-avian genotype (rank 43 in Figure2) is a swine H1N1, which shares many of the high mortality rate variants but misses the mutations found on the NS and PB1 segments. The second most common subtype shares all but one of the high mortality rate variants and is circulating in horse (rank 15) but Figure1shows that H3N8 lacks most of the human host type markers (rank 19 and 21 in Figure1). The complete high mortality rate variant (rank 0) are H5N1 cases that infect a broad host range including swine, tiger, domestic cat, civet, and stone marten. Figure1shows that these strains (most with the genotype with rank 9 in Figure1) contain only a small number of human specific markers similar to the H5N1 human infections. The differences in genotypes show that swine host both strains found with human transmission markers or strains enriched with the high mortality rate markers. This could present an opportunity for two strains to mix and evolve into a swine strain with all 34 of the predicted pandemic conserved markers.
Recent work mixing avian H5N1 with human H3N2 in ferret models has shown that combining the H5N1 cell surface proteins with the internal human proteins need not lead directly to efficient ferret to ferret transmission, which serves as a model for human to human transmission . In this approach only reassortment events were considered, highlighting the complexity that may be involved in acquiring the precise mix of genetic elements required for an H5N1 virus to acquire pandemic potential.
Minimal evolutionary steps to acquire all 34 pandemic markers.
(one or both)
Strains sequenced since 2006 with 4 events or less needed to acquire the 34 markers.
A distinct genotype subset emerges from the avian background from which the human crossovers are derived with some strains adopting a limited number of human persistent markers. Overall, the human infections of avian origin have acquired no more than a few human specific markers, which suggests that avian strains are not rapidly acquiring human persistent markers through genetic drift. The high mortality rate markers are ubiquitous in the avian background and are distinct from the vast majority of human infections. While the host type markers clearly separate avian and human strains, there are a number of cases where descendants of the 1957 and 1968 pandemics continued to retain all of the predicted high mortality rate markers. Finding that classification accuracy for high mortality rate strains is lower than the host type classification weakens support for the notion of a single essential common set of high mortality rate markers. The reduced classification accuracy comes primarily from the fact that the H2N2 sequences continue to maintain the 18 markers into the 1960s, well past the associated pandemic. Thus, these 18 markers do not clearly distinguish between pandemic and non-pandemic associated H2N2 strains. Instead the results support the hypothesis that additional factors play an important role in determining the mortality rates of a specific strain. This highlights the potential importance to pandemic potential of host immunity and antigenic novelty. Even in the case of host type markers where classification accuracy is very high, markers could be missed. For example, the HA and NA genes play a critical role in host specific infection, but this study focused specifically on the persistent markers, and host specificity markers were found only on the more heavily conserved internal proteins. Additional potentially important host type markers that are not persistent should still exist.
It is worth noting that 5 of the 18 high mortality rate markers lie on the NA or PB1 segments implying that they were independently introduced into the three respective pandemic outbreaks . Aside from the 18 high mortality rate markers persisting in H2N2 strains past the 1957 pandemic time frame, the markers give an overall high degree of classification accuracy and, therefore, a potentially useful common, although not sufficient, set of associated genetic factors. Among the high mortality rate strains not associated with a pandemic, only the 1976 H1N1 isolate lacks all 18 markers (4 are not present). Because the 1976 sample is a small contributor to the total number of high mortality rate features, it does not significantly contribute to the classification model. Substituting a single alternate 1976 swine strain for example, would have limited impact on the markers chosen unless more strains were added or a single strain was given the same weight as the pandemic strains in which perfect conservation is required. In this case mixing low mortality rate strains into the high mortality rate class would substantially alter the reported set of persistent markers. Requiring perfect conservation with the 1976 H1N1 strain would reduce the number of candidate markers to 14 (or less if an alternate 1976 swine strain were used). Similarly, swapping in nearly any other H3N2 sequence from the low mortality rate class, including those from the 1970s would alter the candidate marker set due to a lack of conservation.
Evolutionary pathways through reassortment and mutation show that strain combinations starting with H1N1 human and swine need the fewest events to acquire the pandemic conserved markers. Several of these pathways would lead to novel strains with H5N1 subtypes that could challenge human immunity. The potential need for an extended time or number of exposures for strains to acquire the human persistent mutations combined with the high mortality rate markers associated with avian strains suggests how swine could act as a mixing vessel where both human specific and high mortality rate markers are found to persist. Additional work may reveal restrictions that limit the strain combinations that lead to viable new strains. Measuring the rate of co-infection in swine and human, particularly in cases where an avian like strain is suspected to be present, could provide additional data for more precisely modeling the likelihood of the reassortment events that combine with mutations to facilitate mutation combinations important to infection.
A pattern classification approach  is used with heuristic feature selection [14,24] to predict the candidate markers. Taken as input is a multiple sequence alignment (using MUSCLE ) for a collection of influenza genomes, where the 11 proteins are concatenated together. Each position in the alignment is converted to a bit vector of length 21, where an entry of 1 in the vector indicates the presence of one of the 20 amino acids or an insertion symbol. For an input alignment of lengthx(and 21 ×xlength bit vector), to find allnsized mutation subsets,xchoosencombinations are checked, which is time prohibitive even for smallnwhenxis large. A heuristic is used to exploit the information obtained from the linear support vector machine (LSVM) to reduce the size ofxto 60 and limitnto 10. Note that even this size (~7 × 1010) in theory could be too large to efficiently process. Since smaller combination sizes were found, the search space size was sufficiently reduced to compute a solution. The LSVM computes weights for each position in the alignment reflecting the relative influence on the classifier. These weights are used to select thexmost heavily weighted mutations from which to consider combinations. A similar approach was used in document classification  and a related approach was taken to classify 70 antibody light chain proteins . LSVM code was developed by modifying the software package LIBSVM .
The expected classification accuracy is defined by the accuracy of the LSVM using the aligned proteome as input and 5-fold cross validation. Similar to the approach taken by  for human specific markers, sequences in the multiple sequence alignment used for training the classifier were labeled either human or avian depending on the host, excluding the avian to human crossover samples (H5N1, H9N2, H7N7 and H7N3) from training and testing. The 2,026 human persistent strains and 1,018 avian strains were grouped by time, location and subtype, with representative samples chosen at random to yield 281 distinct human strains and 560 distinct avian strains. Classifier accuracy was estimated by randomly dividing the data set into 5 non-overlapping partitions. The classifier was trained on 4 of the partitions and accuracy was measured by the percentage of correct classifications on the fifth partition, with the percentage of correct classifications calculated separately for each class to account for the difference in class size. The average of all 5 tested non-overlapping partitions was calculated giving two accuracy values (one for each class) and the final accuracy measure was the average of these two values. The 34 pandemic conserved markers given in this report were required to be positively identified in every sequenced strain in each of the three pandemic outbreaks without deviation from the majority consensus. This led to three markers reported in  that were excluded from this report for lack of conservation or positive identification (when an ambiguous sequence code was present) in one of the sequenced strains associated with the pandemic outbreaks.
The host specificity classifier misclassified 2 human and 2 avian strains for a classification accuracy of 99.5%. The classification errors appeared to be due to recent reassortment events that suggest the presence of influenza genomes that are a mix of both human and avian strains .
The high mortality rate data set was constructed using the same procedure as the host type dataset and the same 5-fold cross validation procedure was used to estimate accuracy. A total of 111 influenza genomes were classified as high-mortality rate strains and 2,001 were classified as low-mortality rate strains, with a non-redundant subset taken for training (35 high mortality rate, and 255 low mortality rate). The percentage of high and low mortality rate strains that were correctly classified was 96.2% and 96.9% respectively (an average of 96.6%). The lower accuracy for the high mortality rate classifier compared to the host type classifier likely highlights the genetic complexity associated with high mortality rate and the influence of other important factors such as host interaction.
Newly generated classifiers using only a small subset of the aligned proteomes as input were required to match the original classifier accuracy (99.5% for host type and 96.6% for high mortality rate type) within a margin of error defined by a confidence threshold. The confidence thresholds were defined by confidence intervals assuming 1 sided t-test comparisons using the standard deviation in the cross validation tests. Lowering the classification accuracy threshold allowed for the possibility of undetected reassortment events and other potential strain labeling errors (such as host interaction factors) that preclude perfect separation of class types.
Ten of the 13 pandemic conserved host specificity positions reported in  were found. The 3 remaining markers (702 PB2, 28 PA and 552 PA) were not predicted due to lack of conservation among the pandemic strains. The host specific mutations reported here but not in  are attributed to the use of mutation combinations to guide the search for new genetic markers. Two mutations of note not reported by  that gave at least a 5% increase in accuracy at the highest classification accuracy threshold (99.5%) were 400 PA and 70 NS1. The 400 PA human consensus amino acid was Leucine and 3% of the avian strains had Leucine, with the remainder split between Serine and Proline. In the case of 70 NS1, 99.6% of human samples had Lysine along with 23% of the avian strains. (The avian consensus amino acid was Glutamic acid.)
JEA was supported in part by an IC Postdoctoral fellowship. We thank Stephen P. Velsko for valuable discussions. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.