In Silico identification of pathogenic strains of Cronobacter from Biochemical data reveals association of inositol fermentation with pathogenicity

Background Cronobacter, formerly known as Enterobacter sakazakii, is a food-borne pathogen known to cause neonatal meningitis, septicaemia and death. Current diagnostic tests for identification of Cronobacter do not differentiate between species, necessitating time consuming 16S rDNA gene sequencing or multilocus sequence typing (MLST). The organism is ubiquitous, being found in the environment and in a wide range of foods, although there is variation in pathogenicity between Cronobacter isolates and between species. Therefore to be able to differentiate between the pathogenic and non-pathogenic strains is of interest to the food industry and regulators. Results Here we report the use of Expectation Maximization clustering to categorise 98 strains of Cronobacter as pathogenic or non-pathogenic based on biochemical test results from standard diagnostic test kits. Pathogenicity of a strain was postulated on the basis of either pathogenic symptoms associated with strain source or corresponding MLST sequence types, allowing the clusters to be labelled as containing either pathogenic or non-pathogenic strains. The resulting clusters gave good differentiation of strains into pathogenic and non-pathogenic groups, corresponding well to isolate source and MLST sequence type. The results also revealed a potential association between pathogenicity and inositol fermentation. An investigation of the genomes of Cronobacter sakazakii and C. turicensis revealed the gene for inositol monophosphatase is associated with putative virulence factors in pathogenic strains of Cronobacter. Conclusions We demonstrated a computational approach allowing existing diagnostic kits to be used to identify pathogenic strains of Cronobacter. The resulting clusters correlated well with MLST sequence types and revealed new information about the pathogenicity of Cronobacter species.


Background
Cronobacter, formerly known as Enterobacter sakazakii [1], is a bacterial genus containing seven species [2,3] in the family Enterobacteriacae; C. sakazakii, C. malonaticus, C. muytjensii, C. turicensis, C. dublinensis, C. universalis, and C. condimenti. The organism has received a lot of attention recently due to its association with neonatal infections, especially meningitis, necrotizing enterocolitis, septicaemia and subsequent death [4,5]. These bacteria have been isolated from a wide range of food stuffs [6][7][8], therefore it is important to be able to detect Cronobacter species in food. For this purpose several diagnostic tests exist. However, most of these tests make no distinction as to the species of the bacteria. Not all Cronobacter species are known to be pathogenic to infants and can cause asymptomatic colonisation. The strict microbiological criteria for the presence of Cronobacter in powdered infant formula (< 1 Cronobacter cell/10 g) for intended age < 6 months [9] means it is of great interest to differentiate between pathogenic and non-pathogenic strains. Although a range of possible virulence features (i.e. ompA, adhesins, iron-uptake mechanisms) have been identified in Cronobacter and reviewed elsewhere [10], their presence does not correspond to clinical symptoms. Therefore, the identification of further discriminating factors would be useful. Currently, to differentiate between species, it is necessary to sequence either the 16S RNA subunit [11] or the MLST genes [12]; the latter is required for searching the Cronobacter MLST database [12,13]. There are 178 isolates of Cronobacter recorded in the MLST database [13] at the time of analysis (March 2011). Although it is known that type 4 strains (ST 4) are associated with meningitis [14], neither of the above methods is able to differentiate between pathogenic and non-pathogenic strains, they only identify individual species. Moreover, both methods are time consuming compared with the use of biochemical diagnostic test kits which take 4-18 hours to produce results that can easily be interpreted.
For this reason we aimed to develop methods for identifying which of the strains in the Cronobacter genus are pathogenic based on data obtained from standard biochemical diagnostic tests. These tests were those commonly used to identify Cronobacter isolates and are used in their taxonomic description [2,3,11]. Here we used Expectation Maximization (EM) clustering algorithm to divide the data on the basis of the biochemical test results. Since the precise pathogenic status of most Cronobacter strains is unknown, we considered the resulting clusters as being pathogenic or not on the basis of (a) the source from which the strains were isolated and/or (b) MLST types previously associated with pathogenic or non-pathogenic strains (see Materials and Methods) and reference [14]. The clustering of the biochemical test results was also examined for traits associated with pathogenicity.

Results and Discussion
Clustering the dataset for Test 1 with the number of clusters being 2, resulted in clusters 1 (p 1 = 0.26) and 2 (p 2 = 0.74) containing 25 and 65 strains respectively (L = -3.119; Table 1) where p i (i = 1, 2) is the probability of cluster membership for a randomly chosen strain and L is the maximum log likelihood (see Materials and Methods). According to our hypothesis cluster 2 was most likely to contain pathogenic strains since all ST 4 strains were assigned to this cluster. It is known that ST 4 strains are associated with the most serious pathogenic states such as meningitis in infants [14]. Of the other MLST types, ST 1 and 3 were placed exclusively with the potentially non-pathogenic strains in cluster 1. ST 7 was split between two clusters with 7 of 11 strains in the non-pathogenic grouping. All except one ST 8 strain were predicted to be in the pathogenic cluster, as were all of the ST 12 strains (Table 1). The group with unspecified clinical source (22 strains) was divided between the two clusters, indicating that not all clinical isolates are likely to be pathogenic and this feature (isolation of a strain from a clinical sample) alone by no means allows us to infer pathogenicity of a strain. For example, one clinical case, classified as non-pathogenic, was obtained from a breast abscess and it is plausible that this was a secondary infection although it is not known if another infectious agent was isolated. Thus this may indeed be a non-pathogenic strain. Two asymptomatic strains appeared in the pathogenic cluster; one of these strains is ST 12 and the other ST 13. Several ST 12 strains are from clinical sources and it is likely that all ST 12 strains will have similar pathogenic characteristics. Therefore, we can speculate that these strains could have caused an infection following a higher ingested dose or a lower immune status.
Clustering for the Test 2 dataset gave two clusters in which 84 strains (91% of the data) were in cluster 2 (p 2 = 0.9) and eight strains (9% of the data) were in cluster 1 (p 1 = 0.1, L = -6.44; Table 2). One strain of those in cluster 1 was associated with a clinical diagnosis ( ST 31) and was likely to be pathogenic, as well as one ST 4 strain, with the remainder placed in cluster 2. The heterogeneity of MLST types in both clusters, as well as the small number of strains in cluster 1, suggests that the biochemical data in Test 2 is not sufficient to differentiate between pathogenic and non-pathogenic strains. To prove this, the EM algorithm was allowed to automatically determine the number of clusters to assign the data to (data not shown). As a result, only a single cluster was produced indicating that the Test 2 data is not sufficient to differentiate between Cronobacter strains.
Clustering of the Test 3 dataset (Table 3) resulted in cluster 1 containing 40 instances (p 1 = 0.61) and cluster 2 containing 25 instances (p 2 = 0.39, L = -16.726). The majority of the ST 4 strains were grouped in the second cluster, indicating that this cluster contains the potentially pathogenic strains. However, all other MLST types (with multiple strains available) were split between the two clusters. ST 1 was mostly placed in the non-pathogenic cluster, with one strain in cluster 2. ST 3 was split evenly (three in each) between the two clusters. Most of the ST 7 strains were found to be non-pathogenic with just one strain being pathogenic. However, many strains indicated as pathogenic in the Test 1 results (and also Test 2) were placed in the larger potentially non-pathogenic grouping. Based on the division of strains of the same MLST type between clusters, it is likely that the results of Test 3 are less accurate than Test 1 and Test 4 (see below), although many ST 1 and ST 4 strains appeared to be correctly assigned. Note that this test has the fewest number of strains available; it is expected that the availability of more data will greatly improve the results of clustering using this diagnostic test data.
For the fourth test, cluster 1 contained 33 strains (p 1 = 0.44) and cluster 2 contained 43 strains (p 2 = 0.56). The clusters are shown in Table 4 (L = -2.598). This clustering assignment was successful at differentiating between MLST types. ST 1 and 3 were placed entirely in the nonpathogenic grouping (cluster 1) and with two exceptions (strains 552, 553), the ST 4 strains were placed in cluster 2, allowing us to label the latter as the potentially pathogenic cluster. All except two ST 7 strains (strains 515, 535) were placed in the non-pathogenic cluster. ST 8 strains were placed in the pathogenic cluster as were all except one strain of ST 12 (strain 520). A more detailed examination of the strains allocated to each cluster showed that all strains labelled as pathogenic were positive for the inositol fermentation (Ino) test, whilst the prospective non-pathogenic strains were negative for this test. Although this is not conclusively shown by the result of the Inositol test in Test 1 and Test 2, the Test 1 data does indicate a bias towards strains with inositol fermentation in the pathogenic cluster. This suggested that either inositol fermentation was a requirement for pathogenicity, or that the genetic locus conferring inositol fermentation was linked to genes conferring pathogenic traits. This latter conclusion was supported by the two apparently pathogenic ST 4 strains which were negative for inositol fermentation (strains 552 and 553): strain 552 was isolated from infant formula, but strain 553 was associated with neonatal meningitis indicating pathogenesis. It is probable that the inositol fermentation gene was lost from these strains, but the pathogenic traits acquired alongside it remained. It should be noted that this test is different from the INO test in the Test 2 dataset, which we removed from the analysis as it produces the same result for all Cronobacter strains.

Consensus Clustering
Aggregating the clustering assignments based on the majority rule (two out of four) for the 48 strains which have data available from all four tests resulted in the clusters shown in Table 5. The results showed the majority of ST 4 strains were placed in cluster 2. However, there was still splitting of ST 1, 3 and 7 strains between the two clusters. There were also only 10 of the 48 strains placed in the non-pathogenic category. It was hypothesised that the results from Test 2 could be skewing the results, as this test did not differentiate between strains of different MLST sequence types. Therefore we excluded this test from the consensus clustering allowing 50 strains for which data was available from Tests 1, 3 and 4 to be analysed. Test 3 was retained since many ST 1 and ST 4 strains appeared to be correctly assigned. The results ( Table 6) were similar to those for clustering with Test 4 alone. All strains of ST 1, 3 and 7 appeared in cluster 1 (the potential nonpathogenic grouping). With two exceptions (strains 552, 553), the ST 4 strains were grouped in cluster 2 (potentially pathogenic strains) along with the remainder of MLST types. The consensus clustering of Tests 1, 3 and 4 datasets also showed the same correlation with inositol fermentation as the results for Test 4 alone.
The results of all four clustering analyses gave plausible assignments of the data into two clusters, one of which has the propensity of being pathogenic and the other one of being non-pathogenic. The various MLST types were not divided equally between the clusters as     one would expect by chance alone. Our hypothesis that strains with similar biochemical characteristics will have similar pathogenicity seems to hold since strains with pathogenic source isolates were grouped together throughout, although there were a small number of strains that were placed in the non-pathogenic cluster whilst having putative pathogenic status. Strains with the same MLST type were generally grouped together indicating, as might be expected, that strains with the same MLST type have similar biochemical characteristics.
To further investigate the association of inositol fermentation with pathogenicity, we examined the annotated genome of C. sakazakii BAA-894 [Genbank: CP000783] (strain 658) [15] for genes associated with inositol fermentation. Whilst BAA-894 is ST 1 and negative for inositol fermentation, this strain was isolated from powdered formula associated with a clinical outbreak [15] and therefore is likely to be a pathogenic strain. The gene coding for inositol monophosphatase [Genbank: ESA_00718, EC:3.1.3.25], which is annotated in the KEGG database [16] as part of the inositol phosphate metabolism pathway [KEGG: esa00562], was found in close proximity (approx 41 kb upstream) to a predicted protein [Genbank: ESA_00756] which has been identified in the BAA-894 genome and found in Table 5 Consensus clustering generated from Tests 1-4 data   two other meningitic strains of C. sakazakii (strains 701, 767) by hybridization with the BAA-894 genome [15]. Strains 701 and 767 are ST 4 and were associated with fatal outbreaks, indicating this as a putative virulence factor. This was also found to be in close proximity to the zinc-containing metalloprotease locus characterized by Kothary et al [17]. Also at a distance of approximately 82 kb upstream, was a prophage fragment, GR3 [Genbank:ESA_00604-ESA_00630], which contains genes homologous to the Yersinia pseudotuberculosis adhesion pathogenicity island, as well as genes identified in strains 701 and 767 and the reference genome [Genbank: BAA-894]. Despite BAA-894 being deficient for inositol fermentation, the proximity of these genes to inositol monophosphatase and their implication as putative virulence factors suggests that the inositol monophosphate gene is associated with pathogenesis and supports our hypothesis that inositol fermentation is linked to the pathogenicity of Cronobacter species. The lack of inositol fermentation in BAA-894 may be explained by the loss of another gene, as yet unknown, which also plays a crucial role in the inositol phosphate metabolism pathway. The genome of a C. turicensis strain [Genbank: FN543093-FN543096, ST 19, strain 1211] has also been sequenced [18]. No biotyping data exists for C. turicensis strains. However, the original characterisation of the C. turicensis species [2] showed that C. turicensis is positive for inositol fermentation and the C. turicensis strain sequenced contains the inositol monophosphatase gene associated with pathogenesis. The majority of C. turicensis strains were placed in the pathogenic cluster in Tests 1 and 2, but not in Test 3 (no data on C. turicensis is available for Test 4). The sequenced strain 1211 was pathogenic in Tests 1 and 2 (Tables 1 and 2).
Our clustering method has demonstrated that it is possible to quickly differentiate between pathogenic and non-pathogenic strains, and may lead to a quick and easy diagnostic test that can reliably identify pathogenic strains.

Conclusions
Here we have used Expectation Maximization clustering to divide strains of Cronobacter into groups of pathogenic and non-pathogenic strains based on the results of diagnostic biochemical tests. The clustering assignments showed promise, clearly dividing the data into two clusters containing obviously pathogenic and non-pathogenic strains, based on the source of isolate and the MLST type of the strain. However, further experiments characterising the pathogenicity of Cronobacter strains are required to confirm the accuracy of the classification. Nevertheless, our results demonstrated a clear association between pathogenic strains and inositol fermentation, supported by genomic proximity of putative virulence factors to the gene coding for inositol monophosphatase.

Sources of bacterial strains
A total of 98 Cronobacter strains were analyzed in this study. Strains were from diverse food, clinical and environmental sources worldwide. The following species of Cronobacter were included: C. sakazakii NCTC 11467 T , C. malonaticus LMG 23826 T , C. turicensis LMG 23827 T , C. muytjensii ATCC 51329 T , C. dublinensis LMG 23823 T , C. universalis NCTC 9529 T . Strains were kindly donated by the following organizations: Health Products and Food Branch (Health Canada); CDC (Atlanta, USA); Children's Hospital (Los Angeles CA, USA); Northern Foods (UK); Oxoid ThermoFisher Ltd. (Basingstoke, UK); Hospital Cèské Budéjovice (Czech Republic); Institut fûr Tierärztliche Nahrungsmittelkunde Milchwissenschaften (Justus-Liebig-Universität Gießen, Germany); Nottingham City Hospital Trust (Nottingham, UK) and the Department of Medical Microbiology, Radboud (Nijmegen, Netherlands). All other strains were food and environmental isolates from the culture collection at Nottingham Trent University (Nottingham, UK) [19].

Dataset
We examined results from four sets of diagnostic tests carried out on a total of 98 strains encompassing six species of Cronobacter. For a complete list of strains used in this work and their details see Additional File 1 and references [ [1][2][3]15,18] and [20][21][22][23][24][25][26][27][28]]. Each test comprises a series of enzyme assays which produce a colour change recorded by the user. Bacterial species can then be identified by a characteristic series of changes in colour. All tests were carried out in accordance with the manufacturers' instructions and replicated three times; biotyping was performed as in [1]. The tests were those commonly used in the identification of Cronobacter species, and in taxonomic descriptions of the genus [2,3,12,19].
The four tests were:

Test 1
API 20 E (bioMérieux; SA, Marcy-l'Etoile, France) [29] consists of 20 enzyme assays scored as positive or negative. The assays are in the form of a strip of 20 cupules each containing a dehydrated substrate to which the reagents are added, for details of the specific tests see [29] and the manufacturers' instructions. Gram negative bacterial species are identified by comparison to an online database.

Test 2
ID 32E (bioMérieux SA; Marcy-l'Etoile, France) [30] consists of 32 miniaturised enzyme assays with positive or negative scores these assays can be measured either manually or automatically and Gram negative bacterial species are identified by comparison to an online database.

Test 3
API Zym (bioMérieux SA; Marcy-l'Etoile, France) [31] consists of 20 cupules with 19 enzyme assays and one control. The assays produce a coloured response which is scored in intensity between 0 and 5.  [13] as well as source, geographical location and date of isolation. These attributes were removed for the purpose of clustering but were used to label the data afterwards. The result of each enzyme assay was represented categorically. In the case of Tests 1, 2 and 4 this was 0 or 1 for a negative or positive result respectively. A positive result being one which shows activity for the enzyme in the sample. Test 3 had categories ranging from 0 to 5. 0 is indicative of no reaction, and categories 1-5 indicate a range of positive responses, with 5 being the strongest response. Thus, each strain from each dataset was represented by a vector of attributes with each attribute containing the result of one of the enzyme assays in the corresponding test.

Features used
The enzyme assays used in this study were not designed to discriminate between species or genotypes of Cronobacter. In all four tests there were assays where all (or almost all) strains were reported as producing the same result, either positive or negative. Attributes where all strains produce the same result, either positive or negative, for Tests 1, 2 and 4 or where all strains occupy one category in the case of Test 3 were removed from the list of features used for clustering. The features from each test used to perform clustering are listed in Table 7.

EM Clustering
Clustering was carried out using the Expectation Maximization (EM) algorithm [32] implemented in the Weka machine learning package [33]. Whilst we give a brief explanation of the algorithm here, the reader is advised to consult the reference for full details. The algorithm operates by using Gaussian mixture models to estimating the maximum likelihood of membership in a cluster. In Gaussian mixture models the data is drawn from a mixture of k Gaussian distributions with mean u i and standard deviation s i (1≤i≤k). The algorithm begins by randomly selecting parameters u i and s i and computing the probability of cluster membership for each data point based on the probability density function defined by parameters u i and s i . The distribution parameters are then re-estimated, the cluster membership is recomputed and these steps are repeated until a termination threshold is reached and/ or the procedure converges to a local maximum of the likelihood function. For a two-component mixture model used in this study the resulting probabilities of a random strain being in class 1 and 2 were denoted by p 1 and p 2 (p 1 +p 2 = 1) respectively, the maximum of log-likelihood estimate was denoted by L.
The following initial parameters for the EM algorithm were used: the maximum number of iterations was set to 100, the minimum standard deviation was set to 1.0E-06, and the number of clusters was set to 2. The number of clusters was pre-specified for all experiments in this work; we set the number of clusters to two as we were seeking to split the data into pathogenic and non-pathogenic groups. Evaluation of the pathogenicity of the resulting clusters was somewhat subjective since the pathogenic status of the majority of Cronobacter strains was not known. However, some samples were clearly the source of pathogenic effects such as meningitis or septicaemia. There was evidence that strains with MLST sequence type (ST) 4 cause the most severe infections [14]. This was supported by the fact that all except two of ten strains that demonstrated clinical diagnosis were ST 4. Since strains with similar biochemical properties are likely to have similar pathogenic status, it was hypothesised that if the majority of ST 4 strains are placed into one cluster then this cluster is likely to be pathogenic whereas the remaining cluster is likely to be non-pathogenic. Therefore we designated the cluster with the largest number of ST 4 strains as pathogenic.
Since it is reasonable to assume that similar MLST types will have similar levels of pathogenicity, the spectrum of MLST types in each cluster is a good indicator of the accuracy of the assignment, and takes into account factors such as differences between species of Cronobacter. To date only a few plausible virulence features have been identified, such as ompA, adhesins, and iron-uptake mechanisms, many of which are distributed across the seven Cronobacter species [10].

Consensus clustering
Consensus clustering was carried out to combine the results generated by the four tests. It was hypothesised that the consensus clustering will result in a more accurate classification of strains in the appropriate cluster. The four clustering assignments were combined by way of each assignment having one vote with the majority determining the cluster assignment of each strain. Any tie (i.e. two of four votes for each cluster) in the voting resulted in the strain being placed in the pathogenic cluster; this decreased the probability of missing a pathogenic strain while increasing the risk of finding a false positive. However, this was accepted as a good compromise, since missing a pathogenic strain has more serious consequences than misidentifying a negative strain. The consensus clustering was carried out on the 48 strains for which data for all four diagnostic tests is available.

Additional material
Additional File 1: Cronobacter strains. Strains used in this study including source of isolation, MLST Type, references and which experiments they were used in.