A Multilocus Sequence Typing System (MLST) reveals a high level of diversity and a genetic component to Entamoeba histolytica virulence
© Gilchrist et al.; licensee BioMed Central Ltd. 2012
Received: 21 December 2011
Accepted: 6 July 2012
Published: 27 July 2012
The outcome of an Entamoeba histolytica infection is variable and can result in either asymptomatic carriage, immediate or latent disease (diarrhea/dysentery/amebic liver abscess). An E. histolytica multilocus genotyping system based on tRNA gene-linked arrays has shown that genetic differences exist among parasites isolated from patients with different symptoms however, the tRNA gene-linked arrays cannot be located in the current assembly of the E. histolytica Reference genome (strain HM-1:IMSS) and are highly variable.
To probe the population structure of E. histolytica and identify genetic markers associated with clinical outcome we identified in E. histolytica positive samples selected single nucleotide polymorphisms (SNPs) by multiplexed massive parallel sequencing. Profile SNPs were selected which, compared to the reference strain HM-1:IMSS sequence, changed an encoded amino acid at the SNP position, and were present in independent E. histolytica isolates from different geographical origins. The samples used in this study contained DNA isolated from either xenic strains of E. histolytica trophozoites established in culture or E. histolytica positive clinical specimens (stool and amebic liver abscess aspirates). A record of the SNPs present at 16 loci out of the original 21 candidate targets was obtained for 63 of the initial 84 samples (63% of asymptomatically colonized stool samples, 80% of diarrheal stool, 73% of xenic cultures and 84% of amebic liver aspirates). The sequences in all the 63 samples both passed sequence quality control metrics and also had the required greater than 8X sequence coverage for all 16 SNPs in order to confidently identify variants.
Our work is in agreement with previous findings of extensive diversity among E. histolytica isolates from the same geographic origin. In phylogenetic trees, only four of the 63 samples were able to group in two sets of two with greater than 50% confidence. Two SNPs in the cylicin-2 gene (EHI_080100/XM_001914351) were associated with disease (asymptomatic/diarrhea p = 0.0162 or dysentery/amebic liver abscess p = 0.0003). This study demonstrated that there are genetic differences between virulent and avirulent E. histolytica strains and that this approach has the potential to define genetic changes that influence infection outcomes.
The eukaryotic parasite Entamoeba histolytica, the causative agent of amebiasis, is a major cause of morbidity and mortality worldwide, as well as a category B priority biodefense pathogen . In Dhaka, Bangladesh, surveys done in a cohort of children living in an urban slum showed evidence of E. histolytica infection (determined by detection of parasite antigen in either diarrhea or monthly surveillance stool) in 80% of the children tested . Host genetics can influence susceptibility to infectious disease and a single amino acid substitution in the host cytokine receptor homology domain 1 of LEPR and a difference in the leukocyte antigen class II allele expressed are associated with increased susceptibility to intestinal infection by the E. histolytica [3, 4]. Symptomatic disease occurs in only a minority of E. histolytica infections (20%) in an unpredictable manner and an initially asymptomatic infection can over time convert to invasive disease (~12.5%), amebic liver abscess can occur years after travel to an endemic area [5, 6]. It is hypothesized that both host and parasite factors contribute to the outcome of an E. histolytica. However, although progress has been made in both the identification and characterization of parasite virulence factors and in understanding the regulation of their gene expression, direct manipulation of the E. histolytica genome remains elusive, and the traits affecting parasite virulence have not been genetically mapped [8–17].
Despite this variations that occur within repeat-containing genes in the amoeba genome chitinase and serine-rich E. histolytica protein SREHP have been used to examine the link between E. histolytica genetics and disease [18–22]. The high rates of polymorphism however at the loci make it difficult to use them for this purpose and an association between some of these markers and virulence has not been proven in large scale studies [18, 21]. However, based on the composition of highly repetitive tRNA arrays, E. histolytica has been shown to have distinct genotypes with different potentials to cause disease [23–27].
E. histolytica tRNA genes are unusually organized in 25 arrays containing up to 5 tRNA genes in each array, with intergenic regions between tRNA genes containing short tandem repeats (STRs) . A 6-locus (D-A, S-Q, R-R, A-L, STGA-D, and N-K) tRNA gene-linked genotyping system has shown that the number of STRs at these loci differ in parasite populations isolated from three clinical groups (asymptomatic, diarrhea/dysentery and liver abscess) [24, 26]. The variations occurring in tRNA genotypes, even between the ameba strains isolated from the intestine and in the liver abscess of the same patient, suggest that not all strains of E. histolytica have the same capacity to reach the liver of the infected host . However, the diversity of tRNA linked STR genotypes occurring even in a restricted geographic region, and the frequent occurrence of novel genotypes, limit their usefulness to predict infection outcome or to probe the population structure of E. histolytica [25, 29, 30]. The extensive genetic polymorphism in the repeat sequences of SREHP, chitinase and tRNA arrays for instance could reflect slippage occurring during E. histolytica DNA replication as Tibayrenc et al. hypothesize that the parasites exist as clonal populations that are stable over large geographical areas and long periods of time [31, 32].
Compared with other DNA markers, single nucleotide polymorphisms (SNPs) are genetically stable, amenable to future automated methods of detection, and in contrast to the highly repetitive tRNA arrays, their location can be mapped in the E. histolytica genome [33–35]. After the first sequencing and assembly of Entamoeba histolytica HM-1:IMSS genome was published by Loftus et al. Bhattacharya et al. amplified and sequenced 9 kb of coding and non-coding DNA to evaluate the variability of E. histolytica SNPs in 14 strains and identified a link between some genotypes and clinical outcome . The advent of the next generation of high throughput genomic sequencing (NGS) technologies has provided more comprehensive opportunities to investigate variation in the genome of E. histolytica and clinical outcome by allowing the fast and efficient way to sequence laboratory-cultured ameba of clinical relevance [35, 37]. These cultured strains were isolated from different geographical areas endemic for amebiasis and contained large numbers of “strain-specific” SNPs in addition to SNPs present in more than one strain . The sequence variations associated with virulence strains previously identified in the sequenced 9 kb DNA (a synonomous SNP in XM_001913658.1the heavy subunit of the Gal/GalNAc lectin gene (894A/G), and SNPs in the non-coding DNA either between XM_652295.1 and XM_652296.2 sequences (236T/G, 240A/G and 561T/G) or 5’ of the Amoebapore C transcript XM_650937.2 (407A/C and 422A) seemed to be present only the two to four Bangladesh isolates sequenced by Bhattacharya et al. and were not present in the available international sequenced whole genomes .
The goal of this work was to develop a set of less variable markers to profile a large number of strains from different regions of the globe, therefore we selected additional non-synonomous SNPs which Bhattacharya et al. had shown to be less variable, to probe the population structure of E. histolytica in depth . The new SNPs were present with a frequency of 0.3-0.6 in the pool of geographically disparate E. histolytica parasites whose genomes had been sequenced. We restricted our SNP candidates for initial analysis to genes with the potential to be involved in the virulence of this parasite [8–17]. As our current hypothesis is that the development of disease is multifactorial, or polygenic, and involves a combination of parasite factors in the current work we selected several loci to test for their association with disease outcome in E. histolytica. These loci contained SNPs that resulted in non-synonomous changes to the encoded amino acids, were present in more than three of the sequenced E. histolytica genomes, and enriched either in strains originating from symptomatic or asymptomatic infections. We have shown that two of these SNPs were significantly associated with disease severity in Bangladesh isolates.
Initial identification and validation of single nucleotide polymorphisms identified using Next Generation Sequencing
Genomes sequenced by the Genomic Sequencing Center for Infectious Diseases (GSCID) and the Institute of Integrative Biology, E. histolytica Genome sequencing projects
Genbank identifier if available
GSCID E. histolytica Genome Sequencing Project
R. Haque, unpublished data ICDDR,B
Ali et al. 2007 
Escueta-de Cadiz et al. 2010 
Escueta-de Cadiz et al. 2010 
Escueta-de Cadiz et al. 2010 
University of Liverpool E. histolytica Genome Sequencing Project
Ungar et al., 1985 
University of Liverpool genome resequencing project 
University of Liverpool genome resequencing project 
R. Haque, unpublished data ICDDR,B
Diamond, and Clark. 1993 
R. Haque, unpublished data ICDDR,B 
R. Haque, unpublished data ICDDR,B 
Verification, by Sanger sequencing, of 12 polymorphic loci identified by Next Generation Sequencing (NGS) of E. histolytica genomes
Genbank accession number
Candidate single nucleotide polymorphisms
The resampling results described above indicated that SNPs were maintained within an E. histolytica strain under continuous culture. However, this does not exclude the possibility that a particular genotype may change in frequency within an endemic population. To test for association between SNPs and disease outcome, E. histolytica samples were collected from an area endemic for amebiasis (ICDDR and Rajshahi Medical College, Rajshahi, Bangladesh- Additional file 1: Table S4). Both field samples and xenic cultures established from asymptomatic and symptomatic infections were used as a source of DNA (19 amebic liver aspirates; 26 xenic cultures (14 established from asymptomatic infections and 12 from diarrheal); 20 E. histolytica positive samples from diarrheal stool; and 19 E. histolytica positive samples collected during monthly stool sample surveillance). We anticipated that the virulence of this parasite in humans may not be the direct target of selection, because invasive disease does not seem to confer an advantage to pathogen dissemination . To focus on potentially genetically stable SNPs, which were nevertheless variably present in the different stains, we selected non-synonomous SNPs in the available data that were present in at least four, but not more than nine genomes. This allowed us to select for polymorphic SNPs that frequently occur in ameba and may represent genetically stable or ancestral variants that remain at a frequency of 0.3 to 0.6 a frequency that gave us sufficient statistical power to detect 2x differences within the amebic population surveyed in this study. For a SNP to be considered a candidate for association with symptomatic disease it had to occur at a greater frequency in the isolates from symptomatic amebic infections. Twenty-one potentially informative loci were chosen for further analysis in a larger number of E. histolytica isolates as described in the methods section of this paper (Additional file 1: Table S5 and S6).
SNP genotyping of E. histolytica clinical isolates
The 21 marker loci selected from whole genome sequencing data were used to genotype clinical isolates of E. histolytica. DNA isolated from three sources, stool samples, short term xenic cultures of parasites from stool and amebic liver abscess aspirates, was used as a template to amplify the 21 loci. PCR products were sequenced using Illumina sequencing technology and the resulting demuliplexed sequence reads aligned to reference sequences representing the genes to which each amplicon corresponds in order to determine the nucleotide(s) present in the sampled genomes (see Additional file 1: Table S7).
Five of the 21 targets were not consistently co-amplified in our PCR reactions. This could have been due to differences in primer efficiency or off-target amplification in the xenic culture and stool specimens that contain an undefined mixture of intestinal microflora or it may also be because the gene is missing from some isolates or highly divergent. These five loci were not included in later analyses that only used the 16 remaining loci. We selected only those samples where all target sequences were efficiently amplified and we observed good coverage of all the expected amplicon sequences. If one or more of the targets was missing, then the sample was eliminated (Additional file 1: Table S7). The final data set consisted of 63 or the original 84 samples (63% of asymptomatically colonized stool samples, 80% of diarrheal stool, 73% of xenic cultures and 84% of amebic liver aspirates) which passed quality control and had the greater than 8 fold sequence coverage needed to confidently call SNPs.
The effect of adaptation of to in vitro culture on SNP allele frequencies
To examine the potential effect of adaption to in vitro culture on the frequency of SNP alleles, and therefore how well transiently or long established cultured trophozoites represent the parasite population, SNP allele frequencies were compared between parasites genotyped directly from stool samples and those from cultured trophozoites (Additional file 1: Table S10).
To confirm that this change in the distribution of the Non-Reference SNPs was due to culture conditions, and not due to the Bangladesh origin of these samples, the comparison was repeated using only the strains isolated and transiently established in culture at ICDDR, B and genotyped using MLST during this study. A similar trend was observed.
EHI_065250 (LCAT) belongs to a gene family that consists of ten genes; they range in identity from 82% to 51% (Additional file 2: Figure 1A and B); two are highly similar to EHI_065250 (82 and 81% identity). The primers used in SNP amplification were specific for EHI_065250 and did not amplify the other members of this gene family. The other LCAT gene sequences are sufficiently different that off-target amplification would be detected in the sequence alignments of the Illumina reads. Such off-target amplification was never observed, confirming that amplification was specific for the target EHI_065250 locus.
Association of SNPs with disease phenotype
Significance of SNP distribution in Invasive amebic liver abscess, dysentery and Asymptomatic disease
Location in reference contig
Genetic differences between virulent and avirulent E. histolytica strains
Amebic liver abscess is a complication only found in adults whereas dysentery is more frequent in children. The liver aspirate samples in this study were collected from adults, at Rajshahi Medical College Hospital, Bangladesh. This is a geographically distinct location from the dysenteric and asymptomatic samples that were collected from children in Dhaka, Bangladesh. One goal of this study was to identify SNPs to type the virulence potential of the parasite in amebic liver aspirates; if SNPs occur at different frequencies in Dhaka and Rajshahi isolates they will appear as potential biomarkers of parasites with the potential to initiate amebic liver abscesses. The difference in SNP 1&2 frequency in both asymptomatic and diarrheal samples was replicated however in the sequenced genomes from diverse populations in Asia and South America (described in Table 1 and Additional file 1: Table S6 and included in Data set 2 Additional file 1: Table S11) [24, 29, 35, 39].
The previously discussed locus, LCAT EHI_065250, which contained five different SNPs (3–7), was also associated with symptomatic disease however possible selection in culture rendered the distribution less significant within the larger data set (Table 3).
E. histolytica SNPs were identified in amebic DNA isolated from a Bangladesh population by amplicon sequencing. Non-Reference SNPs in the EHI_080100 cylicin-2 gene were significantly associated with the virulence phenotype (amebic liver abscess > asymptomatic > diarrhea or dysentery).
We initially analyzed the genetic diversity among 12 sequenced E. histolytica genomes that represented different geographical origins and disease manifestations, and selected a set of 21 polymorphic sites in coding regions where SNPs change the encoded amino acid. The distribution of these 21 non-synonymous SNPs in field isolates and cultured strains of E. histolytica were examined in samples collected from an endemic area in Bangladesh by multilocus sequence typing (MLST). Of 16 loci that passed quality control five were invariantor very infrequent in Bangladesh.
Our results are inconsistent with a model of clonality in E. histolytica populations. In a clonal population we would expect to see strong linkage disequilibrium between markers, since linkage would not be eroded by recombination and sexual reassortment. In fact, we saw only two identical genotypes in our sample, suggesting a considerable amount of recombination and/or reassortment. Our results support previous observations, based on short tandem repeat DNA sequences, of high diversity among genotypes even within limited geographical areas [18, 21, 24]. Due to this complexity, the number of whole genomes sequenced in the pilot studies, were not sufficient to predict accurately the SNPs associated with disease. However, 2 out of the 16 loci examined,(EHI_065250 and EHI_080100), were significantly associated with disease in isolates collected in Rajshahi and Dhaka, Bangladesh.
One caveat to this study was that the amebic liver abscess samples were collected in Rajshahi but the stools samples were collected at a different location (Mirpur, Dhaka); the differences in the grouping of liver abscess and stool E. histolytica could reflect geographical differentiation . Ali et al. have however previously described different genotypes in liver abscess and enteric samples from the same patients . This suggests a possible genetic selection for parasites with invasive capabilities. Based on our data we suggest a divergent rather than sequential model of the potential to cause severe disease . Non-Reference EHI_080100 SNPs were more frequent in asymptomatic than diarrhea/dysentery samples. However the Non-Reference SNP potentially predisposed the asymptomatic infection to initiate an amebic liver abscess rather than amebic colitis (p = 0.0182) as the Non-Reference EHI_080100 SNPs, were present with even higher prevalence, in samples from amebic liver abscess (p = 0.0003, q = 0.0144). Additional studies are needed to identify additional amebic biomarkers associated with invasive disease.
In both EHI_065250 and EHI_080100 the consequence of the Non-Reference polymorphisms was to change two amino acids within the C-terminal domains. The reason behind the association of these SNPs with invasive disease is not yet clear. The polymorphic genes have not previously been associated with a virulent phenotype, and other than the previously discussed change in at a potential phosphorylation site, there were no other predicted changes in protein function using the currently available bioinformatics tools (PolyPhen http://genetics.bwh.harvard.edu/pph2/http://sift.jcvi.org/www/SIFT_seq_submit2.html) [47, 48].
Locations of informative SNPs
ContiguousE. histolytica DNA region ID
Location of SNP(s) (bp)
The currently identified SNPs could act as genetic “markers” in incomplete linkage disequilibrium with neighboring DNA that contains causative or regulatory SNP (r-SNP) mutations that result in a modulation of gene expression. It is interesting to note that contiguous with the EHI_065250 gene is one of the genes encoding the intermediate subunit of the Galactose- and N-acetyl-D-galactosamine (Gal/GalNAc) inhibitable lectin (igl2) . The Gal/GalNAc inhibitable lectinis a well-characterised virulence factor in E. histolytica.
It is also possible that amino acids changes resulting from the SNPs directly influence the biological activity of the encoded protein and that these changes affect the ability of the trophozoite to invade its host. What has never been clear is the advantage to the E. histolytica parasite to the causation of invasive disease . It is conceivable that these SNPs result in a maladaptation of the coevolving genomes of parasite and host and are the reason for the expression of the virulent E. histolytica phenotype [3, 4]. The effect of the changes described in this study on the stability and function of the encoded protein is currently under investigation.
E. histolytica does not follow the model of T. gondii that exists predominantly in a few main lineages . Rather, even in population from a single geographic location, majority of the individual parasites show unique genotypes. The number of tRNA-linked genotypes discovered, are likely to continue increasing in number and will enable the measurement of strain diversity. However, the results presented in this work support the hypothesis that a relationship exists between the genotype of an E. histolytica strain and parasite virulence. Unlike the tRNA-linked sequence types (Ali et al, 2012) which are merely surrogate markers for the prediction of infection outcomes, non-synonymous SNPs detected in the present study shows promise to identify parasite factors directly linked to infection outcomes . Although preliminary, our findings identified two candidate genes that may contribute to the pathogenesis of these parasites. The level of genetic variation we observed increases the importance of the SNPs we have linked to disease. We are currently investigating the impact of the non-synonomous changes on the function of these proteins.
To fully understand the genetics of this parasite, additional biomarkers will be needed to understand virulence and different outcomes of the disease at the genome level. In the absence of stable clonal populations deeper characterization of the variation in the E. histolytica genome requires sequences from additional ameba strains. Using the protocol described in this paper usable sequence data was gathered from approximately half of the field samples. This allowed the testing of the association of selected candidate SNPs within an endemic population. Given the large amount of variation that occurs, SNPs need to be carefully chosen to type the virulence potential in an E. histolytica MLST schema rather than to reflect parasite phylogeny. Future studies are needed which focus on the genome of the infecting parasite in conjunction with the genome of the infected host.
The Ethical Review Committee at ICCDR,B approved this study. Written informed consent was provided by all study participants and/or their legal guardians.
Cultured E. histolytica strains used for genotyping
E. histolytica trophozoites isolated from patients of all age groups seen at the hospital for diarrheal diseases, or from children living in an urban slum area in Dhaka were established in culture at the International Centre for Diarrhoeal Diseases Research, Bangladesh (ICDDR,B). Polyxenic cultures were maintained in biphasic Robinson's medium at 37°C (listed in Additional file 1: Table S4) .
E. histolytica DNA derived from stool samples
Recently collected stools from patients of all age groups seen at the hospital for diarrheal diseases, or from children living in an urban slum area in Dhaka were examined macroscopically for the presence of blood and mucus; a smear of feces in 0.9% saline was examined microscopically for the presence of erythrocytes, leukocytes, and E. histolytica trophozoites. The DNA was extracted using a slightly modified QIAamp DNA Stool Mini Kit protocol (Qiagen Inc., Valencia, CA) as described previously for specimens from ICDDR,B . Stool samples are also listed in Additional file 1: Table S4.
E. histolytica DNA derived from Amebic Liver Abscess (ALA) aspirates
Aspirates from patients with amebic liver abscesses were obtained only from adults because ALA is an extremely rare complication in children . A presumptive diagnosis of ALA was based on clinical picture, ultrasound examination and positive serology using an E. histolytica antigen based ELISA (TechLab E. histolytica II) . Abscess fluid was obtained under ultrasound guidance from patients with ALA and was purified using the modified QIAamp DNA Stool Mini Kit protocol described above (samples are listed in Additional file 1: Table S4) .
Primers for these experiments were designed using the publically available Primer3 program and checked for specificity using the NCBI Primer-BLAST tool  (http://www.ncbi.nlm.nih.gov/tools/primer-blast/). All primers used in this study are listed in either Additional file 1: Table S2 or Table S4.
Whole genome sequencing of axenic cultured E. histolytica strains
Whole genome sequencing of five of the E. histolytica strains used in this study was carried out at the J. Craig Venter Institute. These sequence traces are deposited athttp:// http://www.ncbi.nlm.nih.gov/bioproject/9532 dbSNPs Genbank(http://www.ncbi.nlm.nih.gov/projects/SNP/) and AmoebaDB (http://amoebadb.org/amoeba/)[57, 58]. This project is also fully described at the NCBI Bio Project page (Accession: PRJNA9532). Whole genome re-sequencing was performed at the Institute of Integrative Biology, (Centre for Genomic Research) University of Liverpool and results deposited at AmoebaDB [35, 57]. For a complete list of E. histolytica genomes, sequencing technology and Sequencing Center see Table 1 and Additional file 1: Table S1.
SNP detection and selection of candidate informative SNPs
For genome-wide SNP detection at JCVI the sequenced strains were analyzed using the CLC Genomics Workbench 4.0.2 SNP detection component as described below (see SNP detection and validation of amplicon sequences). In genomes sequenced at the Centre for Genomic Research, SNPs were identified according to the methods described Weedall et al. . For a list of the SNP detection method used in each genome see Additional file 1: Table S1. SNPs are listed in Additional file 1: Table S5. The default in the table was Reference and only high confidence SNPs (sequence coverage >8) were identified as Non-Reference. At selected locations a visual inspection of available sequence traces was performed to identify lower confidence SNPs (Additional file 1: Table S6). To identify “ancestral” or genetically stable SNPs we selected SNPs that were present in more than three strains. To pick out SNPs linked to disease the SNPs were grouped according whether the sequenced genome was first isolated from patients with asymptomatic or symptomatic disease. The list of weighted selection criteria included whether the SNPs enriched asymptomatic or symptomatic isolates, if the SNP was present in repeat regions or large E. histolytica protein families, whether it was contained in genes with any potential role in virulence, or if orthologous sequences were present in the non-pathogenic but closely related species E. dispar . The selected SNPs are shown in Additional file 1: Table S6.
Preliminary amplicon sequencing and validation
PCR amplifications were performed on a C1000 Thermal Cycler (Bio-Rad) using the High Fidelity Phusion DNA polymerase Master Mix (Finnzymes). Sample DNA (0.5 μl) was added to a 25 μl reaction mix containing 125 pm of the designated primers (5 nM). After an initial denaturation step of 98°C, denaturation at 98°C for 10 sec, annealing of primers at 50°C for 30 sec and elongation at 72°C for 30 sec was performed for 34 cycles. This was followed by a final extension at 72°C for 10 min. The amplified products were separated on a 2% agarose gel and the DNA fragments of the correct size were gel purified and sequenced by Sanger sequencing (GENEWIZ, Inc).
PCR amplification of SNP markers and preparation ofmuliplexed sequencing libraries
For clinical samples and low copy number culture material, amplicons were generated by nested PCR (see Additional file 1: Table S2 and S3). PCR amplifications were carried out using Phusion High Fidelity DNA polymerase Master Mix (Finnzymes). 1 μl of first round amplified DNA was used as template for the second round of amplification, using the same conditions as for the first round PCR with the exception that the annealing temperature was increased to 60°C and the nested PCR primers were used with tails that contained the unique “barcode” sequences and adaptors necessary for Illumina paired-end sequencing, as described by Meyer and Kircher (Additional file 1: Table S4) . DNA from cultured parasites was used directly as template for the second round PCR amplification only, as its more abundant template made nested PCR unnecessary.
After this step, the different PCR products amplified from original samples were pooled in groups of 5 or 6 and one μl was amplified using 200 nM of the IS4 primer and an indexing primer (Additional file 1: Tables S2 and S4) for an initial denaturation step of 98°C, denaturation at 98°C for 10 sec, annealing of primers at 60°C for 20 sec and elongation at 72°C for 20 sec was performed for 34 cycles. This was followed by a final extension at 72°C for 10 min.
The final amplification product was pooled and purified using the Qiagen MinElute 96 UF PCR purification kit according to manufacturer’s directions, and an aliquot of the amplified library and the accompanying negative controls were run on a 2% agarose gel for quality control. The final library was pooled and DNA concentration determined using a Quant-iT Kit (Invitrogen). Prior to submission for sequencing the size distribution of the DNA in the pooled library sample was examined for insert sizes and confirmed to be of the expected range (200–300 bp) using an Agilent 2100 bioanalyzer.
Illumina paired-end sequencing of amplicons containing SNP markers
An aliquot of the multiplexed libraries (5 pmol) was denatured and then processed with the Illumina Cluster Generation Station at the J. Craig Venter Institute, Rockville, MD (JCVI, MD, USA), following the manufacturers protocol. Libraries were sequenced on an Illumina GAII,run for 100 cycles to produce reads of 100 bp. Images were collected over 120 tiles (one lane) which contained 715,000 ±60 clusters per tile.
Data filtering and analysis pipeline
After the run image analysis, base calling and error estimation were performed using Illumina/Solexa Pipeline (version 0.2.2.6). Perl scripts were used to sort and bin all sequences according to indexes CASAVA 1.6 (Illumina).
Alignment of sequence reads and SNP typing
Amplicon sequence analysis was performed using the high-throughput sequencing module of CLC Genomics Workbench 4.0.2. Raw read output for each indexed amplicon set (derived from samples as indicated in Additional file 1: Table S4) was cleaned by trimming of adaptor sequences, ambiguous nucleotides and low quality sequences with average quality scores less than 20. The remaining reads were used for reference assembly. To assess the level of redundancy and non-specific alignment in each individual dataset, an initial reference-based assembly was executed using the whole E. histolytica HM-1:IMSS reference genome (Genbank accession AAFB00000000). As some level of non-specific alignment occurred, the alignment conditions utilized for the final mapping of Illumina reads to the reference assembly were adjusted to require a global alignment of 80% identity over at least 80% of the specific concatenated reference assembly of the target sequences (see Additional file 1: Table S3). Default local alignment settings with mismatch cost of 2, deletion cost of 3 and insertion cost of 3 were used. Reads that were not assembled into contigs in the reference assembly were not analyzed. Consensus sequences derived from the reference assemblies for each amplicon set were utilized for SNP scoring and further phylogenetic analysis.
SNP detection in the amplified DNA was performed using CLC Genomics Workbench 4.0.2 SNP detection component, which is based on the Neighborhood Quality Standard (NQS) algorithm . To identify quality SNPs, putative SNPs were screened following specific criteria based on the read depth, minor allele frequency (10%), the quality of flanking regions and absence of other SNPs within 15-bp flanking regions. SNPs located in repetitive regions were also not considered. The central base quality score of ≥30 and average surrounding base quality score of ≥20 were set to assess the quality of reads at positions for SNP detection. A minimum coverage of 10 and a minimum variant frequency of two was required, and the variations compared against the reference sequence were counted as SNPs. The NQS algorithm looked at each position in the genome alignment to determine if there was a SNP at that position.
The sequences spanning the SNPs were extracted and the IUB base code guide used to describe heterologous bases (see Additional file 1: Table S8). At each locus the sum of the squared allele frequencies was subtracted from 1 to gauge the diversity (heterozygosity) in both the original sequenced genomes and the new MLST data (Figure 2). The E. dispar Mercator whole genome alignment deposited in AmoebaDB was used to obtain the equivalent sequences where they existed in this related species (Additional file 1: Table S8) [57, 61]. The statistical significance of SNP distribution or genotype group versus the phenotypic manifestation of disease (asymptomatic/diarrhea or dysentery/amebic liver abscess) was determined by use of a Chi-squared contingency test or Fisher’s Exact test using the Prism 5 program (GraphPad Software) and the resulting p values were corrected for multiple comparisons by use of the false discovery rate formula of Benjamini and Hochberg in the R program FDR online calculator made freely available by the SDM project [62, 63]. To obtain the correction for multiple comparisons in the pairwise comparison the p-values of all possible combinations (i.e. asymptomatic vrs dysentery; asymptomatic vrs amebic liver abscess; dysentery vrs amebic liver abscess) for a given data set were combined prior to correction. A FDR of 10% was considered significant (http://sdmproject.com/utilities/?show=FDR_).
This investigation was supported by grant 5R01AI043596 from NIAID to WAP. This project has also been funded in part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services under contract numbers N01-AI30071 and/or HHSN272200900007C.We wish to thank Dr Karen Beeson for her expert advice regarding next-generation sequencing technology, Drs. Cynthia Snider and Poonum Korpe for transportation of Bangladesh DNA samples and Dr. A. Mackey, Dr. B. Mann and Dr. M. Taniuchi for informative discussions. We also wish to thank Dr. B. Mann and C. B. Bousquet for careful reading of this manuscript.
- NIAID: Biodefense Research Agenda for Category B and C Priority Pathogens. NIH Publication. 2003, 03-5315: 1-50.
- Haque R, Mondal D, Duggal P, Kabir M, Roy S, Farr BM, Sack RB, Petri WA: Entamoeba histolytica infection in children and protection from subsequent amebiasis. Infect Immun. 2006, 74: 904-909. 10.1128/IAI.74.2.904-909.2006.PubMedView Article
- Duggal P, Haque R, Roy S, Mondal D, Sack RB, Farr BM, Beaty TH, Petri WA: Influence of human leukocyte antigen class II alleles on susceptibility to Entamoeba histolytica. J Infect Dis. 2004, 189: 520-526. 10.1086/381272.PubMedView Article
- Duggal P, Guo X, Haque R, Peterson KM, Ricklefs S, Mondal D, Alam F, Noor Z, Verkerke HP, Marie C, Leduc CA, Chua SC, Myers MG, Leibel RL, Houpt E, Gilchrist CA, Sher A, Porcella SF, Petri WA: A mutation in the leptin receptor is associated with Entamoeba histolytica infection in children. J Clin Invest. 2011, 121: 1191-1198. 10.1172/JCI45294.PubMedView Article
- Haque R, Mondal D, Karim A, Molla IH, Rahim A, Faruque ASG, Ahmad N, Kirkpatrick BD, Houpt E, Snider C, Petri WA: Prospective case–control study of the association between common enteric protozoal parasites and diarrhea in Bangladesh. Clin Infect Dis. 2009, 48: 1191-1197. 10.1086/597580.PubMedView Article
- Haque R, Kabir M, Noor Z, Rahman SMM, Mondal D, Alam F, Rahman I, Al Mahmood A, Ahmed N, Petri WA: Diagnosis of amebic liver abscess and amebic colitis by detection of Entamoeba histolytica DNA in blood, urine, and saliva by a real-time PCR assay. J Clin Microbiol. 2010, 48: 2798-2801. 10.1128/JCM.00152-10.PubMedView Article
- Guo X, Houpt E, Petri WA: Crosstalk at the initial encounter: interplay between host defense and ameba survival strategies. Curr Opin Immunol. 2007, 19: 376-384. 10.1016/j.coi.2007.07.005.PubMedView Article
- Gilchrist CA HE, Trapaidze N, Fei Z, Crasta O, Asgharpour A, Evans C, Martino-Catt S, Baba DJ, Stroup S, Hamano S, Ehrenkaufer G, Okada M, Singh U, Nozaki T, Mann BJ, Petri WA: Impact of intestinal colonization and invasion on the Entamoeba histolytica transcriptome. Mol Biochem Parasitol. 2006, 147: 163-76. 10.1016/j.molbiopara.2006.02.007.PubMedView Article
- Gilchrist CA, Moore ES, Zhang Y, Bousquet CB, Lannigan JA, Mann BJ, Petri WA: Regulation of Virulence of Entamoeba histolytica by the URE3-BP Transcription Factor. mBio. 2010, 1: e00057-10PubMedView Article
- Gilchrist CA, Petri WA: Using differential gene expression to study Entamoeba histolytica pathogenesis. Trends Parasitol. 2009, 25: 124-131. 10.1016/j.pt.2008.12.007.PubMedView Article
- Clark CG, Alsmark UCM, Tazreiter M, Saito-Nakano Y, Ali V, Marion S, Weber C, Mukherjee C, Bruchhaus I, Tannich E, Leippe M, Sicheritz-Ponten T, Foster PG, Samuelson J, Noël CJ, Hirt RP, Embley TM, Gilchrist CA, Mann BJ, Singh U, Ackers JP, Bhattacharya S, Bhattacharya A, Lohia A, Guillén N, Duchêne M, Nozaki T, Hall N: Structure and content of the Entamoeba histolytica genome. Adv Parasitol. 2007, 65: 51-190.PubMedView Article
- Singh U, Ehrenkaufer GM: Recent insights into Entamoeba development: identification of transcriptional networks associated with stage conversion. Int J Parasitol. 2009, 39: 41-47. 10.1016/j.ijpara.2008.09.004.PubMedView Article
- MacFarlane R, Bhattacharya D, Singh U: Genomic DNA microarrays for Entamoeba histolytica: applications for use in expression profiling and strain genotyping. Exp Parasitol. 2005, 110: 196-202. 10.1016/j.exppara.2005.03.006.PubMedView Article
- Linford AS, Moreno H, Good KR, Zhang H, Singh U, Petri WA: Short hairpin RNA-mediated knockdown of protein expression in Entamoeba histolytica. BMC Microbiol. 2009, 9: 38-10.1186/1471-2180-9-38.PubMedView Article
- Bracha R, Nuchamowitz Y, Anbar M, Mirelman D: Transcriptional silencing of multiple genes in trophozoites of Entamoeba histolytica. PLoS Pathogens. 2006, 2: e48-10.1371/journal.ppat.0020048.PubMedView Article
- Zhang H, Alramini H, Tran V, Singh U: Nuclear localized antisense small RNAs with 5’-polyphosphate termini regulate long-term transcriptional gene silencing in Entamoeba histolytica G3 strain. J Biol Chem. 2011, 286: 44467-44479. 10.1074/jbc.M111.278184.PubMedView Article
- Abhyankar MM, Haviland SM, Gilchrist CA, Petri WA: Development of a negative selectable marker for Entamoeba histolytica. J Visualized Exp. 2010, 46: e2410-10.3791/2410.
- Haghighi A, Kobayashi S, Takeuchi T, Thammapalerd N, Nozaki T: Geographic diversity among genotypes of Entamoeba histolytica field isolates. J Clin Microbiol. 2003, 41: 3748-3756. 10.1128/JCM.41.8.3748-3756.2003.PubMedView Article
- Samie A, Obi CL, Bessong PO, Houpt E, Stroup S, Njayou M, Sabeta C, Mduluza T, Guerrant RL: Entamoeba histolytica : genetic diversity of African strains based on the polymorphism of the serine-rich protein gene. Exp Parasitol. 2008, 118: 354-361. 10.1016/j.exppara.2007.09.008.PubMedView Article
- Simonishvili S, Tsanava S, Sanadze K, Chlikadze R, Miskalishvili A, Lomkatsi N, Imnadze P, Petri WA, Trapaidze N: Entamoeba histolytica: the serine-rich gene polymorphism-based genetic variability of clinical isolates from Georgia. Exp Parasitol. 2005, 110: 313-317. 10.1016/j.exppara.2005.02.015.PubMedView Article
- Haghighi A, Kobayashi S, Takeuchi T, Masuda G, Nozaki T: Remarkable genetic polymorphism among Entamoeba histolytica isolates from a limited geographic area. J Clin Microbiol. 2002, 40: 4081-4090. 10.1128/JCM.40.11.4081-4090.2002.PubMedView Article
- Ghosh S, Frisardi M, Ramirez-Avila L, Descoteaux S, Sturm-Ramirez K, Newton-Sanchez OA, Santos-Preciado JI, Ganguly C, Lohia A, Reed S, Samuelson J: Molecular epidemiology of Entamoeba spp.: evidence of a bottleneck (Demographic sweep) and transcontinental spread of diploid parasites. J Clin Microbiol. 2000, 38: 3815-3821.PubMed
- Ali IKM, Zaki M, Clark CG: Use of PCR amplification of tRNA gene-linked short tandem repeats for genotyping Entamoeba histolytica. J Clin Microbiol. 2005, 43: 5842-5847. 10.1128/JCM.43.12.5842-5847.2005.PubMedView Article
- Ali IKM, Mondal U, Roy S, Haque R, Petri WA, Clark CG: Evidence for a link between parasite genotype and outcome of infection with Entamoeba histolytica. J Clin Microbiol. 2007, 45: 285-289. 10.1128/JCM.01335-06.PubMedView Article
- Blessmann J, Ali IKM, Nu PAT, Dinh BT, Viet TQN, Van AL, Clark CG, Tannich E: Longitudinal study of intestinal Entamoeba histolytica infections in asymptomatic adult carriers. J Clin Microbiol. 2003, 41: 4745-4750. 10.1128/JCM.41.10.4745-4750.2003.PubMedView Article
- Ali IKM, Haque R, Alam F, Kabir M, Siddique A, Petri WA: Evidence for a link between locus R-R sequence type and outcome of infection with Entamoeba histolytica. Clin Microbiol Infect. 2012, 18: E235-7. 10.1111/j.1469-0691.2012.03826.x.PubMedView Article
- Clark CG, Ali IKM, Zaki M, Loftus BJ, Hall N: Unique organisation of tRNA genes in Entamoeba histolytica. Mol Biochem Parasitol. 2006, 146: 24-29. 10.1016/j.molbiopara.2005.10.013.PubMedView Article
- Ali IKM, Solaymani-Mohammadi S, Akhter J, Roy S, Gorrini C, Calderaro A, Parker SK, Haque R, Petri WA, Clark CG: Tissue invasion by Entamoeba histolytica: evidence of genetic selection and/or DNA reorganization events in organ tropism. PLoS Negl Trop Dis. 2008, 2: e219-10.1371/journal.pntd.0000219.PubMedView Article
- Escueta-de Cadiz A, Kobayashi S, Takeuchi T, Tachibana H, Nozaki T: Identification of an avirulent Entamoeba histolytica strain with unique tRNA-linked short tandem repeat markers. Parasitol Int. 2010, 59: 75-81. 10.1016/j.parint.2009.10.010.PubMedView Article
- Watanabe K, Gatanaga H, Escueta-de Cadiz A, Tanuma J, Nozaki T, Oka S: Amebiasis in HIV-1-infected Japanese men: clinical features and response to therapy. PLoS Negl Trop Dis. 2011, 5: e1318-10.1371/journal.pntd.0001318.PubMedView Article
- Tibayrenc M, Kjellberg F, Ayala FJ: A clonal theory of parasitic protozoa: the population structures of Entamoeba, Giardia, Leishmania, Naegleria, Plasmodium, Trichomonas, and Trypanosoma and their medical and taxonomical consequences. Proc Natl Acad Sci U S A. 1990, 87: 2414-2418. 10.1073/pnas.87.7.2414.PubMedView Article
- Wells RD, Dere R, Hebert ML, Napierala M, Son LS: Advances in mechanisms of genetic instability related to hereditary neurological diseases. Nucleic Acids Res. 2005, 33: 3785-3798. 10.1093/nar/gki697.PubMedView Article
- Lorenzi HA, Puiu D, Miller JR, Brinkac LM, Amedeo P, Hall N, Caler EV: New assembly, reannotation and analysis of the Entamoeba histolytica genome reveal new genomic features and protein content information. PLoS Negl Trop Dis. 2010, 4: e716-10.1371/journal.pntd.0000716.PubMedView Article
- Loftus B, Anderson I, Davies R, Alsmark UCM, Samuelson J, Amedeo P, Roncaglia P, Berriman M, Hirt RP, Mann BJ, Nozaki T, Suh B, Pop M, Duchene M, Ackers J, Tannich E, Leippe M, Hofer M, Bruchhaus I, Willhoeft U, Bhattacharya A, Chillingworth T, Churcher C, Hance Z, Harris B, Harris D, Jagels K, Moule S, Mungall K, Ormond D, Squares R, Whitehead S, Quail MA, Rabbinowitsch E, Norbertczak H, Price C, Wang Z, Guillén N, Gilchrist C, Stroup SE, Bhattacharya S, Lohia A, Foster PG, Sicheritz-Ponten T, Weber C, Singh U, Mukherjee C, El-Sayed NM, Petri WA, Clark CG, Embley TM, Barrell B, Fraser CM, Hall N: The genome of the protist parasite Entamoeba histolytica. Nature. 2005, 433: 865-868. 10.1038/nature03291.PubMedView Article
- Weedall GD, Clark CG, Koldkjær P, Kay S, Bruchhaus I, Paterson S, Hall N: Genomic diversity of the human intestinal parasite Entamoeba histolytica. Genome Biol. 13 (5): R38-[Epub ahead of print]
- Bhattacharya D, Haque R, Singh U: Coding and noncoding genomic regions of Entamoeba histolytica have significantly different rates of sequence polymorphisms: implications for epidemiological studies. J Clin Microbiol. 2005, 43: 4815-9. 10.1128/JCM.43.9.4815-4819.2005.PubMedView Article
- Wilson IW, Weedall GD, Hall N: Host-Parasite interactions in Entamoeba histolytica and Entamoeba dispar: what have we learned from their genomes?. Parasite Immunology. 34: 90-99.
- Mukherjee C, Clark CG, Lohia A: Entamoeba shows reversible variation in ploidy under different growth conditions and between life cycle phases. PLoS Negl Trop Dis. 2008, 2: e281-10.1371/journal.pntd.0000281.PubMedView Article
- Ungar BL, Yolken RH, Quinn TC: Use of a monoclonal antibody in an enzyme immunoassay for the detection of Entamoeba histolytica in fecal specimens. AmJTrop Med Hyg. 1985, 34: 465-472.
- Diamond LS, Clark CG: A redescription of Entamoeba histolytica Schaudinn, 1903 (Emended Walker, 1911) separating it from Entamoeba dispar Brumpt, 1925. J Eukaryot Microbiol. , 40: 340-344.
- Ghosh SK, Samuelson J: Involvement of p21racA, phosphoinositide 3-kinase, and vacuolar ATPase in phagocytosis of bacteria and erythrocytes by Entamoeba histolytica: suggestive evidence for coincidental evolution of amebic invasiveness. Infect Immun. 1997, 65: 4243-4249.PubMed
- Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods. Mol Biol Evol. 2011, 28: 2731-2739. 10.1093/molbev/msr121.PubMedView Article
- Blom N, Gammeltoft S, Brunak S: Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol. 1999, 294: 1351-1362. 10.1006/jmbi.1999.3310.PubMedView Article
- Gilchrist CA, Baba DJ, Zhang Y, Crasta O, Evans C, Caler E, Sobral BWS, Bousquet CB, Leo M, Hochreiter A, Connell SK, Mann BJ, Petri WA: Targets of the Entamoeba histolytica transcription factor URE3-BP. PLoS Negl Trop Dis. 2008, 2: e282-10.1371/journal.pntd.0000282.PubMedView Article
- Li L, Stoeckert CJ, Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003, 13: 2178-2189. 10.1101/gr.1224503.PubMedView Article
- Hughes MA, Petri WA: Amebic liver abscess. Infect Dis Clin N Am. 2000, 14: 565-582. 10.1016/S0891-5520(05)70121-5. viiiView Article
- Ramensky V, Bork P, Sunyaev S: Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002, 30: 3894-3900. 10.1093/nar/gkf493.PubMedView Article
- Ng PC, Henikoff S: SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003, 31: 3812-3814. 10.1093/nar/gkg509.PubMedView Article
- Lorenzi H, Thiagarajan M, Haas B, Wortman J, Hall N, Caler E: Genome wide survey, discovery and evolution of repetitive elements in three Entamoeba species. BMC Genomics. 2008, 9: 595-10.1186/1471-2164-9-595.PubMedView Article
- Cheng XJ, Hughes MA, Huston CD, Loftus B, Gilchrist CA, Lockhart LA, Ghosh S, Miller-Sims V, Mann BJ, Petri WA, Tachibana H: Intermediate subunit of the Gal/GalNAc lectin of Entamoeba histolytica is a member of a gene family containing multiple CXXC sequence motifs. Infect Immun. 2001, 69: 5892-5898. 10.1128/IAI.69.9.5892-5898.2001.PubMedView Article
- Petri WA, Haque R, Mann BJ: The bittersweet interface of parasite and host: lectin-carbohydrate interactions during human invasion by the parasite Entamoeba histolytica. Annu Rev Microbiol. 2002, 56: 39-64. 10.1146/annurev.micro.56.012302.160959.PubMedView Article
- Sibley LD, Khan A, Ajioka JW, Rosenthal BM: Genetic diversity of Toxoplasma gondii in animals and humans. Philos Trans R Soc Lond. 2009, 364: 2749-2761. 10.1098/rstb.2009.0087.View Article
- Robinson GL: Laboratory cultivation of some human parasitic amoebae. J Gen Microbiol. 1968, 53: 69-79. 10.1099/00221287-53-1-69.PubMedView Article
- Taniuchi M, Verweij JJ, Noor Z, Sobuz SU, van Lieshout L, Petri WA, Haque R, Houpt ER: High throughput multiplex PCR and probe-based detection with Luminex beads for seven intestinal parasites. AmJTrop Med Hyg. 2011, 84: 332-337.View Article
- Haque R, Huston CD, Hughes M, Houpt E, Petri WA: Amebiasis. N Engl J Med. 2003, 348: 1565-1573. 10.1056/NEJMra022710.PubMedView Article
- Rozen S, Skaletsky H: Primer3 on the WWW for general users and for biologist programmers. Methods in Molecular Biology. 2000, 132: 365-386.PubMed
- Aurrecoechea C, Barreto A, Brestelli J, Brunk BP, Caler EV, Fischer S, Gajria B, Gao X, Gingle A, Grant G, Harb OS, Heiges M, Iodice J, Kissinger JC, Kraemer ET, Li W, Nayak V, Pennington C, Pinney DF, Pitts B, Roos DS, Srinivasamoorthy G, Stoeckert CJ, Treatman C, Wang H: AmoebaDB and MicrosporidiaDB: functional genomic resources for Amoebozoa and Microsporidia species. Nucleic Acids Res. 2011, 39: D612-619. 10.1093/nar/gkq1006.PubMedView Article
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29: 308-311. 10.1093/nar/29.1.308.PubMedView Article
- Meyer M, Kircher M: Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harbor Protocols. 2010, 2010: pdb.prot5448-10.1101/pdb.prot5448.PubMedView Article
- Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, Linton L, Lander ES: An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature. 2000, 407: 513-516. 10.1038/35035083.PubMedView Article
- Dewey CN: Aligning multiple whole genomes with Mercator and MAVID. Meth Mol Biol. 2007, 395: 221-236. 10.1007/978-1-59745-514-5_14.View Article
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J Royal Stat Soc. Series B (Methodological). 2010, 57: 289-300.
- Ihaka R, Gentleman R: R: A Language for Data Analysis and Graphics. J Comput Graph Stat. 1996, 5: 299-314.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.