Open Access

Hepatitis C virus whole genome position weight matrix and robust primer design

  • Ping Qiu1Email author,
  • Xiao-Yan Cai2,
  • Luquan Wang1,
  • Jonathan R Greene1 and
  • Bruce Malcolm3
BMC Microbiology20022:29

DOI: 10.1186/1471-2180-2-29

Received: 11 July 2002

Accepted: 25 September 2002

Published: 25 September 2002

Abstract

Background

The high degree of sequence heterogeneity found in Hepatitis C virus (HCV) isolates, makes robust nucleic acid-based assays difficult to generate. Polymerase chain reaction based techniques, require efficient and specific sequence recognition. Generation of robust primers capable of recognizing a wide range of isolates is a difficult task.

Results

A position weight matrix (PWM) and a consensus sequence were built for each region of HCV and subsequently assembled into a whole genome consensus sequence and PWM. For each of the 10 regions, the number of occurrences of each base at a given position was compiled. These counts were converted to frequencies that were used to calculate log odds scores. Using over 100 complete and 14,000 partial HCV genomes from GenBank, a consensus HCV genome sequence was generated along with a PWM reflecting heterogeneity at each position. The PWM was used to identify the most conserved regions for primer design.

Conclusions

This approach allows rapid identification of conserved regions for robust primer design and is broadly applicable to sets of genomes with all levels of genetic heterogeneity.

Background

Genetic heterogeneity is a hallmark of RNA viruses in general, and the hepatitis C virus (HCV) in particular, due to the lack of fidelity of viral RNA-dependent RNA polymerases [1, 2]. In HCV, this genetic diversity has been organized into six major genotypes and numerous subtypes (over 80). Isolates of the same genotype have an average DNA sequence identity of 95%, but different genotypes have DNA sequence identity close to 65% on average [25].

Nucleic acid-based assays, such as the polymerase chain reaction (PCR), the ligase chain reaction (LCR), nucleic acid sequence-based amplification (NASBA), branched chain DNA (bDNA) and sequence analysis itself, rely on the efficient hybridization of oligonucleotides to the targeted sequence. Mismatches between the oligonucleotides and the targeted nucleic acid can affect duplex stability and may compromise the ability of a system to amplify and detect the targeted sequences. Numerous factors determine the effect of mismatches, including: the length of the oligonucleotide, the nature and position of the mismatches, the temperature of hybridization, the presence of co-solvents and the concentrations of oligonucleotides, as well as monovalent and divalent cations [6].

The sequence heterogeneity of HCV challenges efficient detection with nucleic acid-based assays. PCR is widely used for the detection of HCV specific nucleic acids due to its sensitivity. Generally speaking however, effective primers require the genotype of the sample to be known in advance and even then will often be less than 100% effective due to minor variations in the isolates.

Design of robust primers to maximize success with unsequenced isolates (i.e. clinical samples), is a common challenge facing the molecular virologists. A number of software products exist to facilitate primer selection with defined genomes. Many factors are considered in these programs, for example, melting temperature of primers, avoiding primer dimers, avoiding self-complementary primers etc (e.g., Primer Premier [7], Primer3 [8], PRIDE [9]). These algorithms deal mostly with a single template or a small number of sequences. Little effort has been made to handle large number of heterogeneous variants of a given genome.

A large number of HCV related sequences have been deposited in GenBank, making genome wide comparison of all different HCV genotypes and subtypes possible. In this report, more than 100 complete and 14,000 partial sequences deposited in GenBank (Release 129, April 15, 2002) have been used to generate a genome wide consensus sequence and Position Weight Matrix (PWM) for the HCV genome. A PWM based approach for identifying highly conserved regions is proposed which should aid in robust primer design for nucleic acid-based assays. This approach is general enough to be used to optimize any set of genomes with a high degree of heterogeneity.

Results and Discussion

Aligning genomes and generating a position weight matrix(PWM)

One HCV genome (D90208) was used as a template and was separated into pieces based on known gene boundaries. All complete and partial sequences that contained a given region were collected by TBLASTN [10] against HCV sequences from the GenBank non-redundant nucleotide sequence database (nt). An alignment was then made for each part of the genome using ClustalW [11]. A weight score for each position in each fragment was calculated and a PWM was created for that fragment. A whole genome PWM was created by joining the individual PWMs. Finally a 25-bp window, (representative of a typical primer), was walked through the genome/PWM to identify the most conserved locations for primer design.

Due to the extreme genetic heterogeneity of the HCV genome and the nature and large number of complete and partial sequences in the public database, a direct genome wide sequence alignment was not feasible. The approach taken, to break the HCV genome into 10 pieces according to the gene boundaries, proved to be successful. HCV sequence D90208 was chosen arbitrarily as the template sequence and the number of sequences included for alignment of each region is indicated in Table 1.
Table 1

Sequence numbers for each region that was used to construct the whole genome PWM

Region

Start

Stop

# of Sequences

5' UTR

1

329

1333

CORE

330

899

1818

E1

900

1475

3830

E2(p7)

1476

2564

3792

NS2

2565

3407

1496

NS3

3408

5300

520

NS4A

5301

5462

277

NS4B

5463

6245

345

NS5A

6246

7586

1571

NS5B

7587

9413

1914

Some regions of the HCV genome share only 50 percent identity across strains. Figure 1 shows a plot of conservation score using a 10-bp window for the whole HCV genome. Region 1–350, which corresponds to the 5' UTR is very conserved across all strains while region 1860–2230 corresponding to the E2 protein, is very heterogeneous. In addition, third position wobble causes mismatching at virtually every third base (in the coding region), leading as expected, to less identity at the DNA level [12]. In the process of collecting sequences for each HCV region, using a nucleotide level comparison algorithm like BLASTN, a lot of sequence entries will be missed. To solve this problem, a protein level comparison algorithm TBLASTN was used via a six-frame translation. Different stringency scores were used to ensure that as many sequences as possible were retrieved. A sequence was chosen for alignment (for a given region) if it shared at least 50% identity over a 30 amino acid stretch or 65 % identity over a 20 amino acid stretch, or over 90% identity over a 10 amino acid stretch with the template sequence. These cutoffs were chosen following inspection of the blast hits for the different regions. Only 4.9% of the available sequences were discarded due to failure to meet the aforementioned criteria.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2180-2-29/MediaObjects/12866_2002_Article_63_Fig1_HTML.jpg
Figure 1

HCV genome conservation score distribution.

For each regional alignment, flanking sequences were trimmed prior to generating the PWM. The genome wide PWM was created by combining all individual PWMs (see additional file 1). Insertions (represented by '-' in additional file 1) were added to the template sequence only if greater than 1% of the sequences contained this insertion. This was done to reduce the inclusion of spurious insertions that are caused by sequencing errors or that exist in only a single isolate. A consensus sequence was derived by picking the most frequently occurring base at each position.

Choosing a conserved region for optimized primer design

Using the PWM, the most conserved stretches were rapidly identified making possible the design of robust primers based on the criteria described in Methods. The 25-bp segments in Table 2 and Table 3 are listed by positions in the genome. The higher the odds score, the more conserved the region. Figure 2 shows samples from the final PWM; one 10-bp region in NS5B has a very low conservation score (A), a second 10-bp region shows a very high conservation score (B). This approach allows rapid identification of the most conserved regions of the genome with no regard for self-complementarity of primers, optimizing melting temperature, avoiding primer dimers, etc. Once potential regions of interest are identified, other primer design algorithms can then be used to ensure that self-complementarity etc. will not be a problem. This two step strategy for designing robust primers can be applied to any set of genomes with a high degree of heterogeneity such as viruses, bacterial genes etc. Once a specific sequence has been identified, partially degenerate or multiple oligonucleotides can easily be generated as deemed appropriate for the particular application. Empirical validation of all primers is still prudent.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2180-2-29/MediaObjects/12866_2002_Article_63_Fig2_HTML.jpg
Figure 2

Comparison of two 10-bp regions in NS5B: the first with a very low conservation score (A), the second with a very high conservation score (B). Conservation scores were calculated by taking the average of the highest log odds score for each position (see Methods). The sequence shown on top of the matrices is the consensus sequence.

Table 2

Suggested forward primer regions based on HCV whole genome PWM

Primer Start

Primer End

Conservation Score

Primer Start

Primer End

Conservation Score

Primer Start

Primer End

Conservation Score

7

31

1.804

465

489

1.936

5672

5696

1.821

8

32

1.805

470

494

1.938

6263

6287

1.74

32

56

1.94

471

495

1.939

6302

6326

1.775

33

57

1.94

559

583

1.716

6317

6341

1.732

40

64

1.954

560

584

1.736

6318

6342

1.734

54

78

1.957

580

604

1.873

6323

6347

1.78

93

117

1.911

581

605

1.873

6356

6380

1.786

94

118

1.912

582

606

1.873

6376

6400

1.728

106

130

1.962

600

624

1.89

6424

6448

1.744

107

131

1.962

610

634

1.757

6431

6455

1.77

141

165

1.956

611

635

1.779

6542

6566

1.736

142

166

1.956

612

636

1.779

6556

6580

1.707

143

167

1.957

619

643

1.798

6976

7000

1.708

165

189

1.865

707

731

1.614

7018

7042

1.748

168

192

1.866

1297

1321

1.782

7081

7105

1.862

178

202

1.897

1298

1322

1.782

7097

7121

1.922

204

228

1.841

1302

1326

1.766

7098

7122

1.925

212

236

1.871

1303

1327

1.804

7249

7273

1.66

213

237

1.881

1402

1426

1.719

7250

7274

1.686

214

238

1.881

1403

1427

1.733

7251

7275

1.686

215

239

1.882

1404

1428

1.733

7288

7312

1.68

218

242

1.893

1447

1471

1.765

7568

7592

1.782

219

243

1.893

1448

1472

1.795

7569

7593

1.782

220

244

1.899

1449

1473

1.794

7581

7605

1.829

221

245

1.9

1754

1778

1.587

7587

7611

1.902

222

246

1.901

1755

1779

1.626

7689

7713

1.734

225

249

1.922

2511

2535

1.676

7879

7903

1.792

237

261

1.922

2572

2596

1.796

7895

7919

1.812

238

262

1.922

2692

2716

1.856

7985

8009

1.849

263

287

1.952

2715

2739

1.928

8015

8039

1.804

264

288

1.954

2746

2770

1.789

8016

8040

1.803

337

361

1.833

2747

2771

1.791

8034

8058

1.835

345

369

1.849

2748

2772

1.792

8035

8059

1.852

351

375

1.883

2820

2844

1.928

8040

8064

1.878

364

388

1.934

3247

3271

1.728

8089

8113

1.88

381

405

1.918

3308

3332

1.806

8090

8114

1.88

387

411

1.899

3328

3352

1.851

8159

8183

1.872

395

419

1.929

3329

3353

1.871

8192

8216

1.794

405

429

1.914

3330

3354

1.87

8233

8257

1.791

406

430

1.914

3439

3463

1.591

8234

8258

1.811

407

431

1.937

3607

3631

1.729

8295

8319

1.807

408

432

1.936

3703

3727

1.852

8312

8336

1.728

422

446

1.923

3886

3910

1.779

8726

8750

1.739

443

467

1.908

3892

3916

1.762

8893

8917

1.686

444

468

1.908

4955

4979

1.788

8926

8950

1.816

447

471

1.907

5049

5073

1.898

8962

8986

1.74

448

472

1.907

5186

5210

1.751

8963

8987

1.749

451

475

1.909

5187

5211

1.774

9048

9072

1.642

452

476

1.922

5198

5222

1.801

9114

9138

1.717

453

477

1.922

5354

5378

1.84

9180

9204

1.794

454

478

1.922

5355

5379

1.843

9202

9226

1.745

455

479

1.922

5499

5523

1.811

9315

9339

1.826

456

480

1.921

5636

5660

1.724

9475

9499

1.704

464

488

1.929

5657

5681

1.678

   

To ensure optimal polymerization, the 3' end and the penultimate position were required to be G or C with frequencies of ≥0.98 and the upstream position, (3' -2), a G or C with a frequency of ≥0.90 or alternatively an A or T with a frequency of ≥0.95.

Table 3

Suggested reverse primer regions based on HCV whole genome PWM.

Primer Start

Primer End

Conservation Score

Primer Start

Primer End

Conservation Score

Primer Start

Primer End

Conservation Score

30

54

1.94

467

491

1.938

3352

3376

1.822

31

55

1.94

470

494

1.938

3630

3654

1.808

55

79

1.956

471

495

1.939

3726

3750

1.684

63

87

1.942

474

498

1.938

5043

5067

1.876

77

101

1.929

475

499

1.931

5072

5096

1.835

116

140

1.972

476

500

1.926

5209

5233

1.818

117

141

1.972

477

501

1.923

5210

5234

1.792

129

153

1.954

478

502

1.891

5221

5245

1.763

164

188

1.865

487

511

1.918

5377

5401

1.754

165

189

1.865

488

512

1.918

5378

5402

1.732

166

190

1.865

493

517

1.89

5522

5546

1.745

187

211

1.9

494

518

1.89

5659

5683

1.695

188

212

1.891

572

596

1.805

5680

5704

1.882

191

215

1.89

582

606

1.873

5695

5719

1.866

197

221

1.878

603

627

1.811

6286

6310

1.782

201

225

1.846

604

628

1.77

6325

6349

1.815

209

233

1.868

605

629

1.77

6340

6364

1.79

235

259

1.923

633

657

1.897

6341

6365

1.746

236

260

1.922

634

658

1.864

6379

6403

1.648

237

261

1.922

642

666

1.743

6454

6478

1.783

238

262

1.922

691

715

1.542

6579

6603

1.838

241

265

1.915

730

754

1.484

7120

7144

1.868

242

266

1.914

856

880

1.58

7121

7145

1.868

243

267

1.913

1297

1321

1.782

7272

7296

1.794

244

268

1.912

1320

1344

1.893

7273

7297

1.754

245

269

1.912

1325

1349

1.862

7311

7335

1.674

260

284

1.953

1326

1350

1.834

7591

7615

1.856

261

285

1.953

1425

1449

1.762

7712

7736

1.821

286

310

1.924

1426

1450

1.729

7902

7926

1.813

287

311

1.922

1439

1463

1.716

8038

8062

1.878

360

384

1.936

1470

1494

1.67

8057

8081

1.912

387

411

1.899

1471

1495

1.62

8058

8082

1.912

403

427

1.914

1777

1801

1.532

8092

8116

1.864

407

431

1.937

2534

2558

1.737

8112

8136

1.833

418

442

1.894

2715

2739

1.928

8256

8280

1.833

428

452

1.914

2769

2793

1.918

8293

8317

1.806

429

453

1.914

2770

2794

1.907

8916

8940

1.839

430

454

1.914

2771

2795

1.894

8985

9009

1.684

445

469

1.908

3270

3294

1.759

9203

9227

1.742

452

476

1.922

3331

3355

1.854

9498

9522

1.574

466

490

1.936

3351

3375

1.84

   

To ensure optimal polymerization, the 3' end and the penultimate position were required to be G or C with frequencies of ≥0.98 and the upstream position, (3' -2), a G or C with a frequency of ≥ 0.90 or alternatively an A or T with a frequency of ≥ 0.95.

Methods

Databases and Resources

Genbank Release 129 was downloaded from ftp://ncbi.nlm.nih.gov. Pairwise alignment TBLASTN [13] was used to determine whether or not two sequences share similarity. ClustalW [11] was used for multiple sequence alignment. All non-commercial softwares used in this study were written in PERL 5.0.

Construction of alignment

All HCV related sequences were extracted from GenBank (Release 129) by using keyword HCV or Hepatitis C. D90208 was chosen as the organizing template for its fully annotated genome in the GenBank. (Other organizing HCV genomes yielded virtually identical consensus sequences and PWM profiles.) The genomes were separated into 10 regions according to D90208's annotation: 5' UTR, core, E1, E2(P7), NS2, NS3, NS4A, NS4B, NS5A, NS5B. The DNA sequences for each of these regions were retrieved and used for TBLASTN analysis against all HCV sequences. If a sequence shared 50% identity over 90-bp (30 amino acids), 65% identity over 60-bp (20 amino acids) or 90% over 30-bp (10 amino acids) with the query template region, it was considered to contain part of the corresponding gene from a HCV genome in that region, and therefore was used for multiple sequences alignments of this region. For each region, a multiple alignment was done using ClustalW. Alignment was manually curated to eliminate obvious false alignments due to bad sequence quality or inappropriate BLAST hits.

Construction of PWM

A PWM and a consensus sequence were built for each region of HCV and subsequently assembled into a whole genome consensus sequence and PWM. For each of the 10 regions, the number of occurrences of each base at a given position was compiled. These counts were converted to frequencies that were used to calculate log odds scores. The odds score is the frequency observed divided by the theoretical frequency expected (i.e., the background frequency of the base, usually averaged over the genome ~0.25/base). For example, if the base frequency is 0.79 and the estimated background frequency is 0.25, the odds score would be 0.79/0.25 = 3.16. Finally, odds scores were converted to log odds scores by taking the logarithm base 2.

Wi,j = log2(Fi,j/Pi,)

Where

Wi,j is the scoring matrix value of base i in position j

Fi,j is the frequency of base i in position j, Pi is the background frequency of base i

As the logarithm of zero is infinity, a zero occurrence of a particular base in the matrix creates a problem. In this case, a large negative log odds score may be used at such a position in a scoring matrix. A formula proposed by Hertz and Stormo [14] was used instead in our calculations.

Wi,j = log2 [(Ci,j + Pi)/{(N + 1)Pi}] ≈ log2(Fi,j/Pi,)

Where Ci,j is the count of base i in position j, N is the total number of sequences.

Choosing a conserved region for primer design

By sliding 25 bp window (representing average primer length) incrementally along the genome in 1-bp intervals, an average of the highest log odds scores for each position (either A, C, G or T) was calculated to generate the overall degree of conservation (conservation score) for this 25-bp region.

https://static-content.springer.com/image/art%3A10.1186%2F1471-2180-2-29/MediaObjects/12866_2002_Article_63_Equa_HTML.gif

where L is the length of the region (25-bp in this case).

For PCR applications (or those involving polymerization, where homology at the 3' end of the primer is most critical), it is recommended that the 3' end and the penultimate position be G or C with frequencies of ≥0.98. It is also beneficial if the previous position (3' -2) is a G or C with a frequency of ≥0.90 or alternatively, an A or T with a frequency of ≥0.95. Regions that contain insertions, should in general, be avoided.

Declarations

Authors’ Affiliations

(1)
Bioinformatics Group and Discovery Technology Department, Schering-Plough Research Institute
(2)
Bioanalytical Department, Schering-Plough Research Institute
(3)
Antiviral Therapy Department, Schering-Plough Research Institute

References

  1. Grakoui A, McCourt DW, Wychowski C, Feinstone SM, Rice CM: Characterization of the hepatitis C virus-encoded serine proteinase: determination of proteinase-dependent polyprotein cleavage sites. J Virol. 1993, 67: 2832-2843.PubMed CentralPubMedGoogle Scholar
  2. Davis GL: Hepatitis C virus genotypes and quasispecies. Am J Med. 1999, 107: 21S-26S. 10.1016/S0002-9343(99)00376-9.View ArticlePubMedGoogle Scholar
  3. Kato N: Genome of human hepatitis C virus (HCV): gene organization, sequence diversity, and variation. Microb Comp Genomics. 2000, 5: 129-151.View ArticlePubMedGoogle Scholar
  4. Kato N: Molecular virology of hepatitis C virus. Acta Med Okayama. 2001, 55: 133-159.PubMedGoogle Scholar
  5. Forns X, Bukh J: The molecular biology of hepatitis C virus. Genotypes and quasispecies. Clin Liver Dis. 1999, 3: 693-716.View ArticlePubMedGoogle Scholar
  6. Wetmur JG, Sninsky JJ, In Innis MA, Gelfand DH, Sninsky JJ: PCR strategies. 1995, New York: Academic PressGoogle Scholar
  7. Singh VK, Mangalam AK, Dwivedi S, Naik S: Primer premier: program for design of degenerate primers from a protein sequence. Biotechniques. 1998, 24: 318-319.PubMedGoogle Scholar
  8. Rozen S, Skaletsky H: Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol. 2000, 132: 365-386.PubMedGoogle Scholar
  9. Haas S, Vingron M, Poustka A, Wiemann S: Primer design for large scale sequencing. Nucleic Acids Res. 1998, 26: 3006-3012. 10.1093/nar/26.12.3006.PubMed CentralView ArticlePubMedGoogle Scholar
  10. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1006/jmbi.1990.9999.View ArticlePubMedGoogle Scholar
  11. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680.PubMed CentralView ArticlePubMedGoogle Scholar
  12. Lin HJ, Lau JY, Lauder IJ, Shi N, Lai CL, Hollinger FB: The hepatitis C virus genome: a guide to its conserved sequences and candidate epitopes. Virus Res. 1993, 30: 27-41. 10.1016/0168-1702(93)90013-D.View ArticlePubMedGoogle Scholar
  13. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1006/jmbi.1990.9999.View ArticlePubMedGoogle Scholar
  14. Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999, 15: 563-577. 10.1093/bioinformatics/15.7.563.View ArticlePubMedGoogle Scholar

Copyright

© Qiu et al; licensee BioMed Central Ltd. 2002

This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.