- Research article
- Open Access
Genotype-phenotype matching analysis of 38 Lactococcus lactisstrains using random forest methods
https://doi.org/10.1186/1471-2180-13-68
© Bayjanov et al.; licensee BioMed Central Ltd. 2013
- Received: 23 November 2012
- Accepted: 20 March 2013
- Published: 26 March 2013
Abstract
Background
Lactococcus lactis is used in dairy food fermentation and for the efficient production of industrially relevant enzymes. The genome content and different phenotypes have been determined for multiple L. lactis strains in order to understand intra-species genotype and phenotype diversity and annotate gene functions. In this study, we identified relations between gene presence and a collection of 207 phenotypes across 38 L. lactis strains of dairy and plant origin. Gene occurrence and phenotype data were used in an iterative gene selection procedure, based on the Random Forest algorithm, to identify genotype-phenotype relations.
Results
A total of 1388 gene-phenotype relations were found, of which some confirmed known gene-phenotype relations, such as the importance of arabinose utilization genes only for strains of plant origin. We also identified a gene cluster related to growth on melibiose, a plant disaccharide; this cluster is present only in melibiose-positive strains and can be used as a genetic marker in trait improvement. Additionally, several novel gene-phenotype relations were uncovered, for instance, genes related to arsenite resistance or arginine metabolism.
Conclusions
Our results indicate that genotype-phenotype matching by integrating large data sets provides the possibility to identify gene-phenotype relations, possibly improve gene function annotation and identified relations can be used for screening bacterial culture collections for desired phenotypes. In addition to all gene-phenotype relations, we also provide coherent phenotype data for 38 Lactococcus strains assessed in 207 different phenotyping experiments, which to our knowledge is the largest to date for the Lactococcus lactis species.
Keywords
- Lactis
- Metal Resistance
- Plasmid Gene
- Random Forest Algorithm
- Nisin Production
Background
Lactococcus lactis – a low-GC Gram-positive model organism, found frequently in both dairy and non-dairy [1] environments, has been extensively studied due to its industrial importance. Major focus of these studies has been on dairy isolates, of which the genomes of three isolates have been sequenced [2–4]. Plant isolates compared to dairy isolates show higher stress-tolerance and have more extensive fermentative abilities [5]. Due to their larger genetic and metabolic repertoire non-dairy isolates of L. lactis are therefore of interest in dairy food fermentation [6]. Strains used in dairy starter cultures have presumably evolved from plant strains, where some metabolic capabilities were lost in order to adapt to dairy environments [7]. Recently, the genome of ssp. lactis strain KF147 was fully sequenced [8] and that of strain KF282 was partially sequenced [9]. These two plant L. lactis isolates were reported to possess many genes related to uptake of plant cell-wall degradation products such as arabinose and xylose [9]. Many genes present in these two isolates are new and do not have homologs in the three L. lactis strains IL1403, MG1363 and SK11 of dairy origin [9]. Recently, the genomes of several other L. lactis strains have also been fully sequenced [10–13]. Furthermore, many L. lactis strains were reported to have plasmids, enriching the genotypic and phenotypic repertoire of this species [3, 14]. L. lactis strains isolated from different niches have been reported to have high genomic sequence divergence [15–17], also at the subspecies level [18]. Their gene content partly reflects their phenotypic properties such as niche adaptation [9, 16, 18].
Genotype and phenotype information for 38 L.lactis strains that was used in genotype-phenotype matching
Strain name | Subspecies | Isolation origin | # present genes (out of 4026) | # phenotyping experiments (out of 130a) |
---|---|---|---|---|
AM2 | cremoris | dairy | 2563 | 119 |
ATCC19435T | lactis | dairy | 2047 | 121 |
DRA4 | lactis | dairy | 2182 | 123 |
E34 | lactis | plant | 2022 | 123 |
FG2 | cremoris | dairy | 2301 | 117 |
HP | cremoris | dairy | 2307 | 122 |
IL1403 | lactis | dairy | 2289 | 127 |
K231 | lactis | plant | 2067 | 124 |
K337 | lactis | plant | 2002 | 126 |
KF134 | lactis | plant | 2039 | 128 |
KF146 | lactis | plant | 2087 | 130 |
KF147 | lactis | plant | 2472 | 126 |
KF196 | lactis | plant | 1978 | 126 |
KF201 | lactis | plant | 2020 | 125 |
KF24 | lactis | plant | 2119 | 128 |
KF282 | lactis | plant | 1937 | 127 |
KF67 | lactis | plant | 2096 | 128 |
KF7 | lactis | plant | 2109 | 125 |
KW10 | cremoris | plant | 2039 | 126 |
LMG14418 | lactis | dairy | 2259 | 113 |
LMG6897T | cremoris | dairy | 2308 | 113 |
LMG8520 | hordniae | insect | 1903 | 113 |
LMG8526 | lactis | plant | 1985 | 123 |
LMG9446 | lactis | plant | 1983 | 125 |
LMG9449 | lactis | plant | 2221 | 125 |
Li-1 | lactis | plant | 2198 | 126 |
M20 | lactis | plant | 2090 | 121 |
MG1363 | cremoris | dairy | 2397 | 125 |
ML8 | lactis | dairy | 2339 | 123 |
N41 | cremoris | plant | 2405 | 121 |
N42 | lactis | plant | 2361 | 125 |
NCDO763 | cremoris | dairy | 2414 | 126 |
NCDO895 | lactis | dairy | 2285 | 124 |
P7266 | lactis | plant | 1917 | 126 |
P7304 | lactis | plant | 2223 | 127 |
SK11 | cremoris | dairy | 2551 | 119 |
UC317 | lactis | dairy | 2280 | 125 |
V4 | cremoris | dairy | 2313 | 113 |
Results
Strain similarity based on phenotypes
Experiments grouped based on experimental conditions
Group name | Number of experiments | Description |
---|---|---|
Growth on sugar | 16 | Contains phenotypes based on 50CH API experiments |
Antibiotic resistance | 18 | Contains phenotypes based on antibiotic resistance experiments |
Metal resistance | 17 | Contains phenotypes based on metal resistance experiments |
Growth on milk or polysaccharides | 11 | Contains phenotypes based on growth on milk or polysaccharides |
Other experiments | 10 | Contains phenotypes based on all remaining experiments, which include growth test on medium with nisin, arginine hydrolase, salt or different enzymes. |
Genotype-phenotype matching
Integrated analysis using an iterative gene selection allowed identification of gene-phenotype relations that could not be found by studying genotype and phenotype data separately. In genotype-phenotype matching, we used the presence/absence of 4026 ortholog groups (OGs; see Methods) in 38 L. lactis strains (Table 1) determined by comparative genome hybridization (CGH) as genotype data. These 38 strains are a subset of a large representative collection of L. lactis trains that covers genotype, niche and phenotype diversity of L. lactis species [15]. For phenotype data, we used phenotypic measurements of these strains in 207 experiments that were previously assessed in separate studies (see Methods and Additional file 1). After pre-processing, phenotype data from 130 experiments was usable for genotype-phenotype matching (see Methods). Only associations of genes to accurately classified phenotypes (see Methods) were considered in further analysis, which resulted in 140 phenotypes, assessed in 74 different experiments.
Many gene-phenotype relations were identified: a total of 1388 OGs or on average 565 genes per reference strain were identified to be related to at least one of these 140 phenotypes. In the present study, we focussed on gene clusters consisting of at least two phenotype-related genes that are in close genomic proximity (e.g., in operons; see Methods). Transposases, integrases and phage proteins were also removed, because relations between these proteins and phenotypes are likely to be spurious. Discarding above-mentioned genes decreased the percentage of phenotype-related genes by about 50% on average. In analyzing gene clusters, we first considered gene clusters of which their presence relates to a positive trait (e.g., growth) and absence relates to a negative trait (e.g., no growth). There were also many gene clusters with inverse patterns, where an absence of a gene cluster leads to a positive trait. An inverse relationship between genes and phenotypes might indicate that in the absence of a regulator, genes previously inhibited by this particular regulator can become active, which in turn might lead to a positive trait (e.g., survival of a strain). In the supplementary data we provide all identified relations including inverse relations (see genotype-phenotype relations in an Additional file 2 that contains a mini-website).
Genes related to carbohydrate utilization
Integration of gene significance with its presence/absence. A gene that is present in at least 75% of strains of a phenotype is assumed to be predominantly present and a gene that is absent in at least 75% of strains of a phenotype is assumed to be predominantly absent; otherwise a gene is assumed to be present in a subset of strains. Gene-phenotype relations were visualized by integrating each gene’s phenotype importance with its predominant presence/absence in strains of this particular phenotype, whereas in visualizing gene-strain relations gene’s contribution score and presence/absence in a corresponding strain were used.
L. lactis KF147 gene clusters correlated to growth on the sugars arabinose, melibiose and sucrose. Colours represent strength of relationship between a gene and a phenotype (Figure 1). Phenotypes are either shown as last digits in column names or with suffixes “high” or “low”, where 0 indicates there is no growth and other numbers indicate different growth levels in different experiments as described in the Additional file 1. Here “high” and “low” phenotypes indicate high and low growth levels, respectively. For gene annotations see Additional file 3.
A large cluster of 11 genes (Figure 2) was found to be related to growth on melibiose, a plant disaccharide, but not to any of the other carbohydrates tested. This confirms an earlier observation that strain KF147 can utilize this disaccharide while 3 other strains IL1403 (dairy), SK11 (dairy) and KF282 (plant) strains cannot grow on melibiose [9, 26]. We also investigated whether a genomic region that encompasses these genes was deleted in melibiose-negative strains, because chromosomal deletion of a 12 kb region in Streptococcus mutans strains leads to melibiose-negative phenotype [27, 28]; this 12 kb region contains orthologs of LLKF_2260-2262 of strain KF147. Because tiling pan-genome CGH arrays were used to identify gene occurrence in these strains, deletion of a genomic region in query strains can be determined (see Methods). Therefore, we visualized a small genomic region of approximately 20 Kb (see Additional file 2) that covers the starting position of LLKF_2250 and the end position of LLKF_2270 on the KF147 genome. This region encompasses all these 11 genes and several more genes. Indeed, we also observed that this large 20 Kb region was deleted or absent in all melibiose-negative strains from both plant and dairy origin (see Additional file 2). Probably, only 10 genes consecutively located in a 15 Kb region (corresponding to genes LLKF_2259-LLKF_2269 in strain KF147) are necessary for growth on melibiose.
Genes related to metal resistance
Genes related to metal resistance. A) Genes correlated to copper resistance were found on plasmids C and D of L. lactis SK11. B) L. lactis MG1363 genes that were found to be correlated to arsenite resistance. C) Gene-to-strain relations for L. lactis MG1363 genes shown in B. Colours represent strength of relationship (Figure 1) between a gene and a phenotype for A and B, but between a gene and a strain for C. Phenotypes are shown as the final digits in column names, where 0 indicates there is no resistance and other numbers indicate different resistance levels in different experiments as described in the Additional file 1. For gene annotations see Additional file 3.
Genes related to arginine metabolism
Genes related to arginine metabolism. A) Two clusters of L. lactis IL1403 genes related to arginine metabolism. B) A L. lactis MG1363 gene cluster correlated to arginine metabolism. Colours represent strength of relationship between a gene and a phenotype (Figure 1). Phenotypes are either shown as last digits in column names or with suffixes “high” or “low”, where 0 indicates no growth and other numbers indicate different growth levels as described in the Additional file 1. Here “high” and “low” phenotypes indicate high and low enzyme activity levels, respectively. For gene annotations see Additional file 3.
Plasmid genes related to phenotypes
Genes correlated to growth on lactose were found on plasmid D of L. lactis SK11. Colours represent strength of relationship between a gene and a phenotype (Figure 1). Phenotypes are either shown as last digits in column names or with suffixes “high” or “low”, where 0 indicates there is no growth and other numbers indicate different growth levels in different experiments as described in the Additional file 1. Here “high” and “low” phenotypes indicate high and low growth levels, respectively. For gene annotations see Additional file 3.
Partial gene-phenotype relations
For each experiment category several (on average 9) partial relations between gene clusters and phenotypes, where a gene is present in only a subset of strains with a particular phenotype (Figure 1), were identified. Most of these gene clusters contain only two genes and were often found to be relevant to a negative trait (e.g.: no-growth). As example we present partial relations between a cluster of four genes of strain MG1363 (and their orthologs in query strains) and arsenite resistance (Figure 3B). These genes were found to be relevant for strains growing at 0.9625 mM of arsenite and are present in most of the highly resistant strains. However, some of these genes are only present in a subset of strains with no or mild resistance (Figure 3B). Visualizing occurrence of these genes in strains revealed that they are mostly absent in strains with no arsenite resistance phenotype and mostly present in strains with mild or high arsenite resistance phenotypes (Figure 3C).
Discussion
Genotype-phenotype association analysis of 38 L. lactis strains by integrating large genotype and phenotype data sets allowed screening of gene to phenotype relations. Only the top 50 genes per phenotype were selected as important (see Methods), because probably most relevant genes related to a phenotype should be among these 50 genes and their correlated genes. Indeed, only less than 1% of phenotypes had 50 or more related genes in the top list. Furthermore, identified relations were visualized by integrating each gene’s occurrence with its phenotype importance, which allows a quick screening of many relations. However, some relations could be due to an indirect effect of other factors that were not taken into account. For example, the anti-correlation between sucrose and lactose metabolism could be a bias resulting from starter-culture selection programmes, where often bacteriocin-negative strains were selected that could have led to selection of strains that can use lactose instead of sucrose. Additionally, for some phenotypes we could not find many related genes, for example, well-known arginine-metabolism related genes were not found as relevant to metabolism of arginine. Therefore, we analyzed all OGs with gene members containing a word ‘arginine’ in their annotation and genes of the arginine deiminase pathway (arcABCD). However, all these genes were either present in all or in at least 36 out of 38 strains, and such genes are removed in the pre-processing step of PhenoLink, because they are not capable to separate strains with different phenotypes (see Methods). We described a few examples where the annotation of genes could be refined and a few cases where new functions are suggested for genes with unknown functions. We were able to pinpoint only a few novel relations, but analyzing all identified gene-phenotype relations in detail should allow finding even more novel relations and refining annotations of more genes.
Genotype-phenotype matching allows comprehensive screening for possible relations between genes and phenotypes. We had data for 38 strains and, thus, there were relatively few strains with a given phenotype and in some experiments many strains manifested the same phenotype. Therefore, few partial gene-phenotype relations were identified in this study. More partial relations could be identified using data sets based on more L. lactis strains, which would allow finding analogous genes that have similar function but different sequences.
Even with DNA sequencing prices dropping, determining the gene content of dozens of strains by genome sequencing could still be costly. Pan-genome arrays allow querying occurrence of genes in multiple strains more cost-effectively, but genes absent in reference sequences and strongly divergent genes would be missed. Though the presence/absence data can be linked to phenotypes, it cannot account for effects of regulatory control or post-translational modifications. Thus putative gene-phenotype relations should be experimentally tested by high-throughput techniques such as gene expression analysis.
Annotating genes of a genome is essential in understanding the genomic properties of any strain. Gene annotation is often based on sequence similarity, so mistakes in annotating a single gene could propagate to genes of different organisms through annotation by sequence similarity. Therefore identified gene-phenotype relations should be experimentally validated and linked to other information sources such as pathway information. This would allow decreasing error propagation introduced by sequence similarity based gene function prediction approaches. Genotype-phenotype matching results show that the largest group of proteins related to phenotypes was hypothetical proteins indicating that gene annotations could still be improved for all 4 reference strains. Genomes of more bacterial strains are sequenced on a daily basis, which shows the critical importance of accurate gene function prediction. Identified gene-phenotype relations would allow more accurately determining functions of many genes, and hence better understanding of genotype- and phenotype-level differences among 38 L. lactis strains. We provide all identified relations as well as complete genotype and phenotype data set (see Additional files). This data set not only serves as a collection of leads to phenotypes, but due to large data size could also be used to test different association methods.
Conclusions
Lactococcus lactis has been extensively studied due to its industrial importance. Here we provide a coherent genotype and phenotype dataset and its interpretation for the Lactococcus species. We integrated for 38 L. lactis strains their genotypic measurements as well as phenotypes derived from 207 different experiments (see Methods) to identify gene-phenotype relations. Our results are publicly available (see also Additional files) and contains many leads into Lactococcus species-wide genotype-phenotype relations that can further be analysed and experimentally validated. These relations could be used to refine functions of genes. As new genome sequences emerge frequently, this would allow annotating gene functions for these new genomes more accurately and predicting phenotypes of new strains based on their DNA sequence.
Methods
Strains
For genotyping, a total of 39 L. lactis strains were selected from 91 L. lactis strains of which several phenotype and genotype properties were previously assessed [15]. These strains were isolated from plant and dairy niches and belong to 3 different subspecies: lactis (28 strains), cremoris (10 strains) and hordniae (one strain). These strains represent the genotype, niche and phenotype diversity of the L. lactis species [15]. Phenotypic properties of the strain NIZOB2244B were not assessed; therefore, 38 strains were used in genotype-phenotype matching (see Table 1).
Phenotypic diversity tests
Strains were incubated in 96-well micro-plates in quadruplicate in 250 μl M17 broth (Oxoid Ltd., Basingstoke, Hampshire, England) supplemented with 1% glucose (wt/vol) (GM17). Medium was supplemented either with different concentrations of NaCl; nisin (Sigma Chemical, St Louis, USA); metals; antibiotics; or polysaccharides (see Additional file 1). The plates were incubated overnight at 30°C [31].
For incubation of strains in GM17 medium different temperatures (4, 17, 30, 37 or 45°C) were used. Strains were also incubated in several other media: skimmed milk, skimmed milk supplemented with 0.5% yeast extract (Difco, Becton, Dickinson and company, Sparks, USA) and MRS-broth (Merck KGaA, Germany). Fermentation tests of arginine hydrolase activity, 50 different sugars and citrate were performed as reported previously [15]. Activity of several enzymes, i.e. branched chain aminotransferase, alpha-hydroxyisocaproic acid dehydrogenase, aminopeptidase N, cystathionine β lyase, X-prolyl dipeptidyl aminopeptidase and esterase in strains growing on GM17-broth or CDM-media, were previously assessed [32, 33]. More information about phenotyping experiments and results of these experiments are available in an Additional file 1.
Genotype data
The gene content of L. lactis strains was previously determined by pan-genome CGH arrays, where tiling array probes were based on chromosomal, plasmid and single gene or operon DNA sequences of this species as described in [34]. Next to probes targeting all known genes within Lactococcus sp. [35] we additionally targeted intergenic regions. However, in this study, we did not use the probes targeting intergenic regions. We grouped orthologous genes into ortholog groups (OGs); bidirectional orthologous relations among genes of four fully sequenced strains were identified by pair-wise comparisons using InParanoid [36] with default parameters [34]. The genomes used were from L. lactis strains ssp. lactis IL1403, ssp. lactis KF147, ssp. cremoris SK11 and ssp. cremoris MG1363. MG1363 replaces the incomplete chromosomal sequence of KF282 strain that was used in the array design [34]. Genes with inconsistent bidirectional orthologous relations and plasmid genes of plasmid-containing strains (SK11 and KF147) were each treated as a separate OG containing a single gene. In total, 4026 OGs were created of which 149 specified single plasmid genes. For a gene member of an OG scoring the signal intensities of aligned probes determines its presence/absence in a query strain [34]. We used the PanCGHweb web-tool to find presence/absence of OGs in these strains [37].
Visualizing and identifying presence or absence of a genomic segment
Presence or absence of contiguously located genes (i.e. a gene cluster) in a query strain indicates that the whole genomic region encompassing these genes is present or absent in this particular strain. Therefore presence or absence of a genomic segment in a query strain compared to a reference strain was identified. To this end, probes aligning to a genomic region of interest in a reference strain were identified. The log ratio of probe signals in a query strain to the reference strain was visualized to identify presence or absence of a genomic region in a query strain.
Data pre-processing
In PhenoLink, genotype and phenotype data are pre-processed before using them in genotype-phenotype matching analysis. PhenoLink is based on the Random Forest algorithm [38]. In random forest classification, trees are trained based on random selections of genes and strains, genes with the same occurrence pattern could get different contribution scores [39]. This score is an estimate of how important a gene is to correctly classify a certain strain. Additionally, genes that are either present or absent in (almost) all queried strains have negligible impacts to separate strains of differing phenotypes [40]. Thus we did not use genes with homogeneous occurrence patterns and used only one of the highly correlated genes in further analysis. Prior to classification, phenotypes with continuous measurements were grouped into 3 bins, where each bin represents a different category. Strains that belong to the middle category were not used in genotype-phenotype matching to improve the classification accuracy. Additionally, in some experiments most of the strains exhibited a single phenotype such as the capability to grow on a certain sugar. Such an imbalance often leads to biased classification. Therefore imbalance in the number of strains per phenotype was decreased by creating 100 bags [22].
Genotype-phenotype matching
Genes related to phenotypes were identified using PhenoLink mostly with default parameter settings. To decrease effects of random selection, the same genotype and phenotype data were classified 3 times and only genes consistently relating to phenotypes were selected. Additionally, only genes with a positive contribution score for at least a few (in this study 3) strains of a phenotype were used for further classification, which decreases spurious relations between genes and phenotypes. This iterative removal of genes continued until no more than a few (in this study 5) genes were removed [22]. Only relations to phenotypes that were classified with at least 60% accuracy were used in further visualization and analysis, which was empirically defined to allow visualizing even weaker relations. The accuracy was estimated by the Random Forest algorithm and is the percentage of strains that were correctly classified. For each phenotype, genes were sorted based on their phenotype importance, which is the sum of gene’s contribution score for each strain of this particular phenotype, and genes with the highest phenotype importance (in this study the top 50 genes) were selected. Genes that had homogenous occurrence patterns (variance < 0.05) were not used in genotype-phenotype matching. Highly correlated genes (e.g. members of the same operon) were added to the selected top genes provided that they were correlated to any gene in the top genes. The added gene was assigned the same phenotype importance as the gene to which it is correlated.
Visualization of gene-phenotype relations
Visualization of the identified gene-phenotype relations facilitates quick screening and simplifies the analysis of these relations. Visualizing relations between accurately classified phenotypes (in this study a total of 140) and genes (here a total of 1388 OGs or on average 565 genes for each of the 4 reference strains) creates a large figure, which is difficult to analyze. To simplify visualization and analysis of gene-phenotype relations, phenotyping experiments were categorized into 5 groups based on experiment type: (i) growth on sugar, (ii) antibiotic resistance, (iii) metal resistance, (iv) growth on milk or polysaccharides and (v) remaining experiments (see also Table 2 and Additional file 1). Genes related to these phenotypes were visualized by merging the presence/absence of a gene with its phenotype importance. Since a gene’s presence/absence is strain-specific, its occurrence in strains of a phenotype was quantified to determine if a gene is predominantly present or absent. Merging predominant presence/absence of a gene with its phenotype importance creates 6 possible combinations each represented with a different colour as shown in Figure 1. A gene that is present in at least 75% of strains of a phenotype is assumed to be predominantly present and a gene that is absent in at least 75% of strains of a phenotype is assumed to be predominantly absent; otherwise a gene is assumed to be present in a subset of strains.
Visualization of gene-phenotype relations in reference strains allows identification of genes that are localized in close genomic proximity (e.g., members of the same operon). Therefore, gene-phenotype relations for corresponding genes of the reference strains were included in the visualization (see also Additional file 2). Two reference strains (SK11 and KF147) have plasmids; therefore, in the visualization a total of 149 plasmid genes were also used. In visualizing gene-phenotype relations, the phenotype importance of an OG was used for all its members. For each reference strain on average 565 gene-phenotype relations were found, but we focussed our analysis on phenotype-related gene clusters, which are genes in close genomic proximity. Two genes were considered in close proximity if a distance between their genomic starting positions did not exceed 2500 nucleotides, which was empirically determined. Using distances larger than 2500 nucleotides results in visualizing more non-neighbouring genes (false-positives), but using smaller distance would discard some neighbouring genes (false-negatives). Discarding true neighbours from visualization has more impact than including non-neighbours, because non-neighbouring genes can be easily recognized in visualization. Remaining gene-phenotype relations were visualized based on genomic order of genes.
Partial relations between genes and phenotypes, where a gene is present in only a subset of strains with a particular phenotype, were visualized with black colour (Figure 1). Gene’s occurrence in a strain was merged with its contribution score as shown in Figure 1. Gene-strain relations were visualized to show in which strains a gene is present and to which strains of a phenotype a gene was found to be relevant.
Clustering of strains based on phenotypes
Hierarchical clustering of strains based on their phenotypes could reveal the phenotypic similarity of strains, which might be linked to their genotype. Thus, strains were hierarchically clustered based on the phenotypes using the euclidean distance metric and the average linkage agglomerative clustering method [39]. Experiments that only contained phenotype information for all 38 strains were used in clustering and strains were clustered for each of the 5 experiment categories separately (see Table 2 and Additional file 1). Clustering was not performed for fifth experiment category, because there were only 5 experiments where all 38 strains had phenotype information.
Availability of supporting data
The data sets supporting the results of this article are included within the article and its additional files.
Declarations
Acknowledgements
We thank Douwe Molenaar for useful discussions.
Funding
JB was funded by Besluit Subsidies Investeringen Kennisinfrastructuur (BSIK) grant [through the Netherlands Genomics Initiative (NGI)]; BioRange programme [as part of, the Netherlands Bioinformatics Centre (NBIC)]; and the NGI (as part of the Kluyver Centre for Genomics of Industrial Fermentation).
Authors’ Affiliations
References
- Sandine WE, Radich PC, Elliker PR: Ecology of lactic streptococci. A review. J Milk Food Technol. 1972, 35: 179-206.Google Scholar
- Bolotin A, Wincker P, Mauger S, Jaillon O, Malarme K, Weissenbach J, Ehrlich SD, Sorokin A: The complete genome sequence of the lactic acid bacterium Lactococcus lactis ssp. lactis IL1403. Genome Res. 2001, 11 (5): 731-753. 10.1101/gr.GR-1697R.PubMedPubMed CentralView ArticleGoogle Scholar
- Makarova K, Slesarev A, Wolf Y, Sorokin A, Mirkin B, Koonin E, Pavlov A, Pavlova N, Karamychev V, Polouchine N: Comparative genomics of the lactic acid bacteria. Proc Natl Acad Sci USA. 2006, 103 (42): 15611-15616. 10.1073/pnas.0607117103.PubMedPubMed CentralView ArticleGoogle Scholar
- Wegmann U, O'Connell-Motherway M, Zomer A, Buist G, Shearman C, Canchaya C, Ventura M, Goesmann A, Gasson MJ, Kuipers OP: Complete genome sequence of the prototype lactic acid bacterium Lactococcus lactis subsp. cremoris MG1363. J Bacteriol. 2007, 189 (8): 3256-3270. 10.1128/JB.01768-06.PubMedPubMed CentralView ArticleGoogle Scholar
- Nomura M, Kobayashi M, Narita T, Kimoto-Nira H, Okamoto T: Phenotypic and molecular characterization of Lactococcus lactis from milk and plants. J Appl Microbiol. 2006, 101 (2): 396-405. 10.1111/j.1365-2672.2006.02949.x.PubMedView ArticleGoogle Scholar
- van Hylckama Vlieg JE, Rademaker JL, Bachmann H, Molenaar D, Kelly WJ, Siezen RJ: Natural diversity and adaptive responses of Lactococcus lactis. Curr Opin Biotechnol. 2006, 17 (2): 183-190. 10.1016/j.copbio.2006.02.007.PubMedView ArticleGoogle Scholar
- Kelly WJ, Ward LJ, Leahy SC: Chromosomal diversity in Lactococcus lactis and the origin of dairy starter cultures. Genome Biol Evol. 2010, 2: 729-744.PubMedPubMed CentralGoogle Scholar
- Siezen RJ, Bayjanov J, Renckens B, Wels M, Van Hijum SA, Molenaar D, Van Hylckama Vlieg JE: Complete genome sequence of Lactococcus lactis subsp. lactis KF147, a plant-associated lactic acid bacterium. J Bacterio. 2010, 192 (10): 2649-2650. 10.1128/JB.00276-10.View ArticleGoogle Scholar
- Siezen RJ, Starrenburg MJ, Boekhorst J, Renckens B, Molenaar D, van Hylckama Vlieg JE: Genome-scale genotype-phenotype matching of two Lactococcus lactis isolates from plants identifies mechanisms of adaptation to the plant niche. Appl Environ Microbiol. 2008, 74 (2): 424-436. 10.1128/AEM.01850-07.PubMedPubMed CentralView ArticleGoogle Scholar
- Gao Y, Lu Y, Teng KL, Chen ML, Zheng HJ, Zhu YQ, Zhong J: Complete genome sequence of Lactococcus lactis subsp. lactis CV56, a probiotic strain isolated from the vaginas of healthy women. J Bacteriol. 2011, 193 (11): 2886-2887. 10.1128/JB.00358-11.PubMedPubMed CentralView ArticleGoogle Scholar
- Bolotin A, Quinquis B, Ehrlich SD, Sorokin A: Complete genome sequence of Lactococcus lactis subsp. cremoris A76. J Bacteriol. 2012, 194 (5): 1241-1242. 10.1128/JB.06629-11.PubMedPubMed CentralView ArticleGoogle Scholar
- Kato H, Shiwa Y, Oshima K, Machii M, Araya-Kojima T, Zendo T, Shimizu-Kadota M, Hattori M, Sonomoto K, Yoshikawa H: Complete genome sequence of Lactococcus lactis IO-1, a lactic acid bacterium that utilizes xylose and produces high levels of L-lactic acid. J Bacteriol. 2012, 194 (8): 2102-2103. 10.1128/JB.00074-12.PubMedPubMed CentralView ArticleGoogle Scholar
- Ainsworth S, Zomer A, De Jager V, Bottacini F, Van Hijum SA, Mahony J, Van Sinderen D: Complete Genome of Lactococcus lactis subsp. cremoris UC509.9, Host for a Model Lactococcal P335 Bacteriophage. Genome Announc. 2013, 1 (1): pii: e00119-12. doi: 10.1128/genomeA.00119-12. Epub 2013 Jan 31Google Scholar
- Siezen RJ, Renckens B, VanSwam I, Peters S, Van Kranenburg R, Kleerebezem M, DeVos WM: Complete sequences of four plasmids of Lactococcus lactis subsp. cremoris SK11 reveal extensive adaptation to the dairy environment. Appl Environ Microbiol. 2005, 71 (12): 8371-8382. 10.1128/AEM.71.12.8371-8382.2005.PubMedPubMed CentralView ArticleGoogle Scholar
- Rademaker JL, Herbet H, Starrenburg MJ, Naser SM, Gevers D, Kelly WJ, Hugenholtz J, Swings J, van Hylckama Vlieg JE: Diversity analysis of dairy and nondairy Lactococcus lactis isolates, using a novel multilocus sequence analysis scheme and (GTG)5-PCR fingerprinting. Appl Environ Microbiol. 2007, 73 (22): 7128-7137. 10.1128/AEM.01017-07.PubMedPubMed CentralView ArticleGoogle Scholar
- Siezen RJ, Bayjanov JR, Felis GE, van der Sijde MR, Starrenburg M, Molenaar D, Wels M, van Hijum SA, van Hylckama Vlieg JE: Genome-scale diversity and niche adaptation analysis of Lactococcus lactis by comparative genome hybridization using multi-strain arrays. Microb Biotechnol. 2011, 4 (3): 383-402. 10.1111/j.1751-7915.2011.00247.x.PubMedPubMed CentralView ArticleGoogle Scholar
- Taibi A, Dabour N, Lamoureux M, Roy D, LaPointe G: Evaluation of the genetic polymorphism among Lactococcus lactis subsp. cremoris strains using comparative genomic hybridization and multilocus sequence analysis. Int J Food Microbiol. 2010, 144 (1): 20-28. 10.1016/j.ijfoodmicro.2010.02.020.PubMedView ArticleGoogle Scholar
- Passerini D, Beltramo C, Coddeville M, Quentin Y, Ritzenthaler P, Daveran-Mingot ML, Le Bourgeois P: Genes but not genomes reveal bacterial domestication of Lactococcus lactis. PLoS One. 2010, 5 (12): e15306-10.1371/journal.pone.0015306.PubMedPubMed CentralView ArticleGoogle Scholar
- Nieto-Arribas P, Sesena S, Poveda JM, Palop L, Cabezas L: Genotypic and technological characterization of Lactococcus lactis isolates involved in processing of artisanal Manchego cheese. J Appl Microbiol. 2009, 107 (5): 1505-1517. 10.1111/j.1365-2672.2009.04334.x.PubMedView ArticleGoogle Scholar
- Psoni L, Kotzamanidis C, Yiangou M, Tzanetakis N, Litopoulou-Tzanetaki E: Genotypic and phenotypic diversity of Lactococcus lactis isolates from Batzos, a Greek PDO raw goat milk cheese. Int J Food Microbiol. 2007, 114 (2): 211-220. 10.1016/j.ijfoodmicro.2006.09.020.PubMedView ArticleGoogle Scholar
- Tan-a-ram P, Cardoso T, Daveran-Mingot ML, Kanchanatawee S, Loubiere P, Girbal L, Cocaign-Bousquet M: Assessment of the diversity of dairy Lactococcus lactis subsp. lactis isolates by an integrated approach combining phenotypic, genomic, and transcriptomic analyses. Appl Environ Microbiol. 2011, 77 (3): 739-748. 10.1128/AEM.01657-10.PubMedPubMed CentralView ArticleGoogle Scholar
- Bayjanov JR, Molenaar D, Tzeneva V, Siezen RJ, van Hijum SA: PhenoLink - a web-tool for linking phenotype to omics data for bacteria: application to gene-trait matching for Lactobacillus plantarum strains. BMC Genomics. 2012, 13: 170-10.1186/1471-2164-13-170.PubMedPubMed CentralView ArticleGoogle Scholar
- Rauch PJ, De Vos WM: Characterization of the novel nisin-sucrose conjugative transposon Tn5276 and its insertion in Lactococcus lactis. J Bacteriol. 1992, 174 (4): 1280-1287.PubMedPubMed CentralGoogle Scholar
- Rauch PJ, Beerthuyzen MM, de Vos WM: Distribution and evolution of nisin-sucrose elements in Lactococcus lactis. Appl Environ Microbiol. 1994, 60 (6): 1798-1804.PubMedPubMed CentralGoogle Scholar
- Kelly WJ, Davey GP, Ward LJ: Characterization of lactococci isolated from minimally processed fresh fruit and vegetables. Int J Food Microbiol. 1998, 45 (2): 85-92. 10.1016/S0168-1605(98)00135-4.PubMedView ArticleGoogle Scholar
- Bachmann H, Starrenburg MJ, Molenaar D, Kleerebezem M, van Hylckama Vlieg JE: Microbial domestication signatures of Lactococcus lactis can be reproduced by experimental evolution. Genome Res. 2012, 22 (1): 115-124. 10.1101/gr.121285.111.PubMedPubMed CentralView ArticleGoogle Scholar
- Russell RR, Aduse-Opoku J, Sutcliffe IC, Tao L, Ferretti JJ: A binding protein-dependent transport system in Streptococcus mutans responsible for multiple sugar metabolism. J Biol Chem. 1992, 267 (7): 4631-4637.PubMedGoogle Scholar
- Ushiro I, Lumb SM, Aduse-Opoku J, Ferretti JJ, Russell RR: Chromosomal deletions in melibiose-negative isolates of Streptococcus mutans. J Dent Res. 1991, 70 (11): 1422-1426. 10.1177/00220345910700110501.PubMedView ArticleGoogle Scholar
- Efstathiou JD, McKay LL: Inorganic salts resistance associated with a lactose-fermenting plasmid in Streptococcus lactis. J Bacteriol. 1977, 130 (1): 257-265.PubMedPubMed CentralGoogle Scholar
- Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000, 28 (1): 33-36. 10.1093/nar/28.1.33.PubMedPubMed CentralView ArticleGoogle Scholar
- Kutahya OE, Starrenburg MJ, Rademaker JL, Klaassen CH, Van Hylckama Vlieg JE, Smid EJ, Kleerebezem M: High-resolution AFLP Typing of Lactococcus lactis Strains Enables Identification of Genetic Markers for Subspecies Related Phenotypes. Appl Environ Microbiol. 2011, 77 (15): 5192-5198. 10.1128/AEM.00518-11.PubMedPubMed CentralView ArticleGoogle Scholar
- Bachmann H, Starrenburg MJ, Dijkstra A, Molenaar D, Kleerebezem M, Rademaker JL, van Hylckama Vlieg JE: Regulatory phenotyping reveals important diversity within the species Lactococcus lactis. Appl Environ Microbiol. 2009, 75 (17): 5687-5694. 10.1128/AEM.00919-09.PubMedPubMed CentralView ArticleGoogle Scholar
- Bachmann H, Kruijswijk Z, Molenaar D, Kleerebezem M, van Hylckama Vlieg JE: A high-throughput cheese manufacturing model for effective cheese starter culture screening. J Dairy Sci. 2009, 92 (12): 5868-5882. 10.3168/jds.2009-2372.PubMedView ArticleGoogle Scholar
- Bayjanov JR, Wels M, Starrenburg M, van Hylckama Vlieg JE, Siezen RJ, Molenaar D: PanCGH: a genotype-calling algorithm for pangenome CGH data. Bioinformatics. 2009, 25 (3): 309-314. 10.1093/bioinformatics/btn632.PubMedPubMed CentralView ArticleGoogle Scholar
- Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS: Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc Natl Acad Sci USA. 2005, 102 (39): 13950-13955. 10.1073/pnas.0506758102.PubMedPubMed CentralView ArticleGoogle Scholar
- Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol. 2001, 314 (5): 1041-1052. 10.1006/jmbi.2000.5197.PubMedView ArticleGoogle Scholar
- Bayjanov JR, Siezen RJ, van Hijum SA: PanCGHweb: a web tool for genotype calling in pangenome CGH data. Bioinformatics. 2010, 26 (9): 1256-1257. 10.1093/bioinformatics/btq103.PubMedPubMed CentralView ArticleGoogle Scholar
- Breiman L: Random forests. Machine Learning. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Hastie T, Tibshirani R, Friedman J: The elements of statistical learning. 2009, New York: SpringerView ArticleGoogle Scholar
- Dudoit S, Fridlyand J, Speed TP: Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. J Am Stat Assoc. 2002, 97 (457): 77-87. 10.1198/016214502753479248.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.