The eukaryotic parasite Entamoeba histolytica, the causative agent of amebiasis, is a major cause of morbidity and mortality worldwide, as well as a category B priority biodefense pathogen . In Dhaka, Bangladesh, surveys done in a cohort of children living in an urban slum showed evidence of E. histolytica infection (determined by detection of parasite antigen in either diarrhea or monthly surveillance stool) in 80% of the children tested . Host genetics can influence susceptibility to infectious disease and a single amino acid substitution in the host cytokine receptor homology domain 1 of LEPR and a difference in the leukocyte antigen class II allele expressed are associated with increased susceptibility to intestinal infection by the E. histolytica [3, 4]. Symptomatic disease occurs in only a minority of E. histolytica infections (20%) in an unpredictable manner and an initially asymptomatic infection can over time convert to invasive disease (~12.5%), amebic liver abscess can occur years after travel to an endemic area [5, 6]. It is hypothesized that both host and parasite factors contribute to the outcome of an E. histolytica. However, although progress has been made in both the identification and characterization of parasite virulence factors and in understanding the regulation of their gene expression, direct manipulation of the E. histolytica genome remains elusive, and the traits affecting parasite virulence have not been genetically mapped [8–17].
Despite this variations that occur within repeat-containing genes in the amoeba genome chitinase and serine-rich E. histolytica protein SREHP have been used to examine the link between E. histolytica genetics and disease [18–22]. The high rates of polymorphism however at the loci make it difficult to use them for this purpose and an association between some of these markers and virulence has not been proven in large scale studies [18, 21]. However, based on the composition of highly repetitive tRNA arrays, E. histolytica has been shown to have distinct genotypes with different potentials to cause disease [23–27].
E. histolytica tRNA genes are unusually organized in 25 arrays containing up to 5 tRNA genes in each array, with intergenic regions between tRNA genes containing short tandem repeats (STRs) . A 6-locus (D-A, S-Q, R-R, A-L, STGA-D, and N-K) tRNA gene-linked genotyping system has shown that the number of STRs at these loci differ in parasite populations isolated from three clinical groups (asymptomatic, diarrhea/dysentery and liver abscess) [24, 26]. The variations occurring in tRNA genotypes, even between the ameba strains isolated from the intestine and in the liver abscess of the same patient, suggest that not all strains of E. histolytica have the same capacity to reach the liver of the infected host . However, the diversity of tRNA linked STR genotypes occurring even in a restricted geographic region, and the frequent occurrence of novel genotypes, limit their usefulness to predict infection outcome or to probe the population structure of E. histolytica [25, 29, 30]. The extensive genetic polymorphism in the repeat sequences of SREHP, chitinase and tRNA arrays for instance could reflect slippage occurring during E. histolytica DNA replication as Tibayrenc et al. hypothesize that the parasites exist as clonal populations that are stable over large geographical areas and long periods of time [31, 32].
Compared with other DNA markers, single nucleotide polymorphisms (SNPs) are genetically stable, amenable to future automated methods of detection, and in contrast to the highly repetitive tRNA arrays, their location can be mapped in the E. histolytica genome [33–35]. After the first sequencing and assembly of Entamoeba histolytica HM-1:IMSS genome was published by Loftus et al. Bhattacharya et al. amplified and sequenced 9 kb of coding and non-coding DNA to evaluate the variability of E. histolytica SNPs in 14 strains and identified a link between some genotypes and clinical outcome . The advent of the next generation of high throughput genomic sequencing (NGS) technologies has provided more comprehensive opportunities to investigate variation in the genome of E. histolytica and clinical outcome by allowing the fast and efficient way to sequence laboratory-cultured ameba of clinical relevance [35, 37]. These cultured strains were isolated from different geographical areas endemic for amebiasis and contained large numbers of “strain-specific” SNPs in addition to SNPs present in more than one strain . The sequence variations associated with virulence strains previously identified in the sequenced 9 kb DNA (a synonomous SNP in XM_001913658.1the heavy subunit of the Gal/GalNAc lectin gene (894A/G), and SNPs in the non-coding DNA either between XM_652295.1 and XM_652296.2 sequences (236T/G, 240A/G and 561T/G) or 5’ of the Amoebapore C transcript XM_650937.2 (407A/C and 422A) seemed to be present only the two to four Bangladesh isolates sequenced by Bhattacharya et al. and were not present in the available international sequenced whole genomes .
The goal of this work was to develop a set of less variable markers to profile a large number of strains from different regions of the globe, therefore we selected additional non-synonomous SNPs which Bhattacharya et al. had shown to be less variable, to probe the population structure of E. histolytica in depth . The new SNPs were present with a frequency of 0.3-0.6 in the pool of geographically disparate E. histolytica parasites whose genomes had been sequenced. We restricted our SNP candidates for initial analysis to genes with the potential to be involved in the virulence of this parasite [8–17]. As our current hypothesis is that the development of disease is multifactorial, or polygenic, and involves a combination of parasite factors in the current work we selected several loci to test for their association with disease outcome in E. histolytica. These loci contained SNPs that resulted in non-synonomous changes to the encoded amino acids, were present in more than three of the sequenced E. histolytica genomes, and enriched either in strains originating from symptomatic or asymptomatic infections. We have shown that two of these SNPs were significantly associated with disease severity in Bangladesh isolates.