Intrastrain and interstrain genetic variation within a paralogous gene family in Chlamydia pneumoniae

Background Chlamydia pneumoniae causes human respiratory diseases and has recently been associated with atherosclerosis. Analysis of the three recently published C. pneumoniae genomes has led to the identification of a new gene family (the Cpn 1054 family) that consists of 11 predicted genes and gene fragments. Each member encodes a polypeptide with a hydrophobic domain characteristic of proteins localized to the inclusion membrane. Results Comparative analysis of this gene family within the published genome sequences provided evidence that multiple levels of genetic variation are evident within this single collection of paralogous genes. Frameshift mutations are found that result in both truncated gene products and pseudogenes that vary among isolates. Several genes in this family contain polycytosine (polyC) tracts either upstream or within the terminal 5' end of the predicted coding sequence. The length of the polyC stretch varies between paralogous genes and within single genes in the three genomes. Sequence analysis of genomic DNA from a collection of 12 C. pneumoniae clinical isolates was used to determine the extent of the variation in the Cpn 1054 gene family. Conclusions These studies demonstrate that sequence variability is present both among strains and within strains at several of the loci. In particular, changes in the length of the polyC tract associated with the different Cpn 1054 gene family members are common within each tested C. pneumoniae isolate. The variability identified within this newly described gene family may modulate either phase or antigenic variation and subsequent physiologic diversity within a C. pneumoniae population.


Background
Chlamydia pneumoniae is an obligate intracellular bacterium that infects and causes disease in the respiratory tract [1,2] and has recently been associated with heart disease [3]. Approximately 10% of pneumoniae cases and 5% of bronchitis and sinusitis cases in the U.S. are attributed to C. pneumoniae infection. Pathogenic mechanisms utilized by C. pneumoniae to replicate and disseminate within hosts remain unclear.
Little is known about strain-specific determinants of C. pneumoniae. Isolates of C. pneumoniae are virtually indis-tinguishable using 16s rRNA [4], restriction fragment length polymorphism [5], and amplification fragment length polymorphism analysis [6]. Unlike C. trachomatis, only a single serotype or genotype of C. pneumoniae has been identified by any of the above methods.
Genomic analyses have recently revealed a large gene family of 21 polymorphic outer membrane proteins (Pmps) with predicted outer membrane localization in C. pneumoniae [7,10,11]. The function of this gene family in chlamydial growth and development remains unknown. Several studies have examined genetic variation and strain differentiation of Pmp proteins, which may be important for genetic flexibility and adaptive response. Recently, it has been reported that interstrain and intrastrain variation of gene expression and protein productions of pmpG6 and pmpG10 are modulated by deletion of tandem repeats in pmpG6 [9,11] and variation in the length of polyguanosine tract in pmpG10 [12,13]. This evidence suggests that variation may be an important requisite for the function of this gene family in the biology of Chlamydiae.
Examination of the C. pneumoniae genome sequences by Daugaard et al. [14] demonstrated that a unique and related family of genes is found within the C. pneumoniae genome, and that variation among strains leads to differences within several members of the gene family. Gene products of these paralogous genes contained a unique bi-lobed hydrophobic domain, which is a predictive marker for localization to the inclusion membrane [15]. In this study, we further characterize this family by examining variation in sequence of the family members both within and among different C. pneumoniae isolates.

The conserved hydrophobic domain of the Cpn 1054 gene family
In a previous study, we demonstrated that a bi-lobed (50-60 amino acid) hydrophobic domain is a predictive marker for protein localization to the inclusion membrane (Inc proteins; [15]). Consequently, 68 C. pneumoniae proteins, including several members of the Cpn 1054 family (Figure 3), were identified as putative inclusion membraneassociated proteins. The nucleotide sequence found at the 5' end of each paralogous repeat unit represents the region of highest identity among most family members ( Figure  2). This includes the sequences that encode the major hydrophobic domain of each repeat unit ( Figure 4A). In some cases (Cpn 007, 124, 126) the hydrophobic domain represents the primary reason the gene is included in the Cpn1054 gene family (Figure 4A,4B; Figure 5). Three nearly identical copies of this domain are found within Cpn 007 (Figure 3), a protein that is otherwise not similar to other family members.

Homopolymeric cytosine (poly C) tract and the variations of the length of poly C tracts in the Cpn 1054 gene family
Genomic analysis of the Cpn 1054 gene family revealed a repeat sequence of 6 to 15 homopolymeric cytosine residues, positioned either upstream or immediately within the predicted 5' end of Cpn 008, 010, 041, 043, 045, 1054, and 1055 ( Figure 2). Analysis of each sequenced genomes indicated that three Cpn 1054 gene family members had identical polyC tracts in each of the respective genomes: Cpn 041 (CCCCCCTCCCC), Cpn 043 (12C), and Cpn 045 (CCCCCC). In each of the other family members however, the length of the polyC tract varies in the published genomes. These analyses were expanded to examine the polyC tracts in several clinical isolates. Sequence data collected directly from amplification products of several Cpn 1054 family members showed that polymorphisms within the polyC tract were present in all tested family members except Cpn 041 and Cpn 045 (not shown). Therefore, the polyC tract in Cpn 041 and Cpn 045 appear to be conserved among isolates, a result consistent with the data in the genome sequences. In order to examine the clonal variation within other Cpn 1054 family members, sequences surrounding the polyC tracts were amplified from selected genes and the products cloned into plasmids. The polyC tract was then examined through nucleotide sequence analysis of a selection of independent plasmid constructs. Sequence variation within the polyC tract was first examined for Cpn 043, Cpn 1054, and Cpn 1055 of C. pneumoniae AR39. Sequence analysis of independent recombinant plasmids showed variability in the length of the polyC tract of each tested gene ( Figure  6A). We next examined variation within a single gene, Cpn 1055, using genomic DNAs from strains AR39, AR 458 and PS 32 as template. Variation in the length of polyC was observed in this gene from each tested isolate ( Figure 6B).
Two approaches were used to demonstrate that the observed variation in length of the polyC tract was not a function of PCR errors during the analysis. First, two different thermostable polymerases (Taq and Pwo polymerase) were used to generate the primary amplification products for cloning and subsequent sequence analysis. Amplifications with each enzyme resulted in clones with variation in the length of the polyC tract ( Figure 6C). A second approach for examination of the possibility of PCR errors was to reamplify the polyC tract from a single plasmid template, and examine the sequence of the polyC tract directly in these amplification products. No variation is similar to Cpn 008, 010, 011, 043,045, and 1055 while the 3' end was similar to the Cpn 009, 010.1, 012, 042, 044, 046, and 1056. The range of predicted amino acid similarity is 20-99%, as indicated by shading. Vertical bars indicate termination codons that interrupt the reading frame. The position of the beginning of each family member in the CWL029 genome is indicated above the sequence. All numbering is based on the CWL029 genome sequence [7].

Figure 2
Multiple sequence alignments of the 5' end and upstream regions of selected Cpn 1054 family members. Three genes have been left out of this figure (Cpn 007, 124, 126) because they lack significant sequence identity at the 5' end. All data were obtained from the CWL029 genome sequence [7]. The designation for each sequence includes the gene number of the 1054 family member and the gene directly upstream. The position in the genomic contig is also indicated for each sequence. Where appropriate, the predicted stop codon of upstream genes (Cpn 009, 042, 044, 1054; red boxes) is indicated. Two regions encompassing candidate start codons for each gene are indicated with green boxes. In some cases this includes the codon GTG as a candidate translational start site, as predicted through the Chlamydia Genome Website. PolyC tracts, boxed in blue, are upstream of the predicted start site or are within the gene near the 5' end of coding region. of the length of the polyC tract of Cpn 043, 1054, and 1055 was identified in these PCR products (not shown). These results support the conclusion that the variability in the length of the polyC tracts is not an artifact of the amplification process, and thus the observed variability reflects differences at these loci within individual isolates.

Allelic differences within Cpn 010-010.1
Analysis of published genomic sequence of several 1054 family members identified polymorphisms within several genes. Two examples of these polymorphisms can be found in Cpn 010/10.1 and Cpn 1054, family members that are over 98% identical at the nucleotide level [14]. Analysis of the published genome sequences indicates that these two genes likely duplicated through nonreciprocal exchanges or by gene conversion [16]. Comparative analysis of the three genomes showed that the short sequence polymorphisms are present in Cpn 010/10.1, Cpn 043, and Cpn 1054 [14]. Cpn 10/10.1 and Cpn 1054 are especially interesting, as the two genes differ by only a few nucleotides over a 2,400 base pair coding sequence. The primary differences between these genes are 1) a single nucleotide polymorphism (SNP) that alters the reading frame in 10/10.1 [14], and, 2) a region of diversity at the 3' end of each gene (Figure 7). Daugaard et al. [14] identified an RFLP marker that exploits the SNP, and other candidate RFLPs can be identified in the 3' region of diversity ( Figure 7). Short sequence polymorphisms were ex-

Figure 3
Hydrophilicity profiles of selected members of the Cpn 1054 gene family using the Kyte-Doolittle method with a sliding window of 7. The scale of hydrophilicity is indicated for Cpn 007 and the scale is identical in all panels. The bi-lobed hydrophobic domain is boxed in each profile. Note that the scale on the horizontal axis varies for each protein.
amined in Cpn 010/10.1 by PCR amplification and nucleotide sequencing in 12 C. pneumoniae isolates. Variability at both polymorphisms was observed between strains. The frameshift mutation leading to the truncated sequence of Cpn 010 was identified in CWL029 and 4 other isolates: AR388, AR231, KA5C and KA66 (not shown). The 3' polymorphism that distinguishes Cpn 10/ 10.1 from Cpn 1054 was found within Cpn 10.1 in two isolates: AR39 and AC43 (Figure 7). There was no apparent linkage between the central frameshift mutation and the 3' polymorphism, as the two regions varied independent of one another. There was also no evidence that these sequences vary within individual C. pneumoniae isolates.

Discussion
Relatively little is known about molecular pathogenesis, genetic diversity and adaptive strategy of C. pneumoniae.
Although the genomic organization of these independent strains is very similar (over 99.9% identical), there are regions of variation within each isolate [8]. In the present study, we have identified a paralogous gene family within C. pneumoniae, designated as the Cpn 1054 gene family. This family consists of eleven paralogous loci, with single repeat elements consisting of single ORFs or ORF pairs. The identity of the predicted polypeptide sequences shared among family members ranges from 20-99%. It is likely that the diversity of these genes arose through gene

P G L S S V I S S P A G M G A C A L G C V M L A L G I D V L P G L S S V I S S P A G M G A C A L G C V M L A L G I D V L P G L S S I I S S P A E M G A C A L G C V M L A L G I D V L
Cp007 (61-1 10) Cp007 (171-220) Cp007 (281-324) 70 80 90

A D S T I R S L P T Y L L D E G H P Q S A D S T I R S L P T Y P L D E G H P Q S Q Q E A E A A L A R L P E E
duplication and subsequent diversification. It appears that certain duplications were relatively recent, as at least two of the repeated loci-Cpn 010/10.1 and Cpn 1054-are nearly identical. Analysis of the three genomes also demonstrates that apparent gene conversion has occurred between 10/10.1 and 1054 in strain AR39 [16], and that an intact 1054 ORF is found within each sequenced genome [14]. However, its location varies between the two loci. The redundant nature of the Cpn 1054 family members is somewhat unusual against the generally reductive evolutionary strategy of the chlamydiae [17]. There is no evidence that the Cpn 1054 gene family is found outside of C. pneumoniae, and thus the members of the family may be important in the unique biological traits of this species.
Gene duplication and subsequent genetic drift are the likely means by which variation is manifested between members of the Cpn 1054 gene family. Variation is also observed within individual family members, both between strains [14] and, as shown in this report, within individual isolates. Several gene family members, including Cpn 008, 010, 043, 1054, and 1055, contain homopolymeric cytosine repeats either upstream or at the predicted 5' end of the coding region. In C. pneumoniae, variation of the short repeat of homopolymeric nucleotides was first identified in the pmp family. Comparative genomic analysis and cloning expression showed that the length of the polyG tract of pmpG 10 varies between strains and within an isolate [12,13]. Furthermore, variation of the length of polyG has been demonstrated that it plays a role in the differential expression of PmpG 10 [12]. Variability in short nucleotide repeats generated via slipped strand mispairing are key elements in the generation of phenotypic diversity within many pathogenic microorganisms [18]. Further investigation will be required to determine if the expression of members of the Cpn 1054 gene family is affected by the observed variability in length of the polyC tract.
Although the proteins in the Cpn 1054 gene family are classified as candidate inclusion membrane proteins [15], their subcellular locations and role in infection and disease remain to be identified. It is also not yet known whether the Cpn 1054 gene family is expressed individually or coordinately, or to what extent each gene is expressed during the course of an infection. However, the variation, both within and between strains, is a potential requisite for this gene family that may contribute to the unique biology of C. pneumoniae.

Conclusions
The C. pneumoniae genome contains a gene family (the Cpn 1054 gene family) consisting of 18 different genes in 11 paralogous loci. Variation is observed both within and among isolates. This variation may be useful for the biotyping of C. pneumoniae clinical isolates, and may be important in phenotypic diversity within the species.

Bacterial strains, plasmids and C. pneumoniae genomes
All experiments were conducted using a collection of independent clinical isolates from a strain library (Table 1). Genomic DNA of C. pneumoniae was isolated from purified EBs using the methods of Campbell et al. [19]. Extracted DNA was stored at -20C. Genomics analyses were conducted using sequences from the genome websites listed in the introduction. Open reading frames, nucleotide positions and contig numbering were annotated based upon the C. pneumoniae CWL029 genome [7].

Bioinformatics analysis of the Cpn 1054 gene family of C. pneumoniae genomes
DNA and polypeptide sequences were aligned using CLUSTALW analysis (Mac Vector™ 6.0; Oxford Molecular, Genetics Computer Group, Inc. Madison, WI). Each gene  and each predicted gene product was also subjected to gap BLASTX and BLASTN respectively http://www.ncbi.nlm.nih.gov/BLAST/ [20]. The similarity between two different DNA sequences was determined using the BLAST 2 sequence program from http://www.ncbi.nlm.nih.gov/ blast/bl2seq/bl2.html [21]. Hydrophilicity profiles of the gene product of each Cpn 1054 family member was determined using hydropathy plot analysis ( [22] MacVector™).
Phylogenic analyses of both DNA and amino acid sequences were performed using PAUP* [23]. In this study Cpn 0186 (IncA), and four additional candidate inclusion membrane proteins, Cpn 0284, Cpn 0285, Cpn 0829 and Cpn 0830, were selected as members of an outgroup for analysis. Phylogenic trees were inferred by neighbor-joining to estimate evolutionary distances. Bootstrap values were obtained from a consensus of 100 neighbor-joining trees.

DNA amplification, and sequence analysis of the CP1054 gene family
Purified genomic DNA of selected isolates (Table 1) were amplified using specific oligonucleotide primers flanking the DNA region of interest ( Table 2). The amplification products were purified using a Qiaquick PCR purification spin column kit (Qiagen, Valencia CA). Purified PCR products were sequenced using an ABI PRISM 377 (Perkin-Elmer, Norwalk, CT) through the Oregon State University Center for Gene Research and Biotechnology.

Examination of variation within isolates through cloning of PCR products
The variation of the length of the polyC tract within Cpn 043, 1054 and 1055 was determined through sequence analysis of purified amplification products and through sequencing of amplification products following cloning into plasmids. Both Taq [7]. Candidate restriction enzyme sites that may be used for RFLP analysis are indicated.

Nucleotide sequence accession numbers
The nucleotide sequences of variants within Cpn 010 identified from independent clinical isolates were deposited in GenBank under following accession numbers : AF474017 through 474026, and AF 461543 through 461552.