Virus variation resources at the National Center for Biotechnology Information: dengue virus
© Resch et al. 2009
Received: 15 January 2009
Accepted: 02 April 2009
Published: 02 April 2009
Skip to main content
© Resch et al. 2009
Received: 15 January 2009
Accepted: 02 April 2009
Published: 02 April 2009
There is an increasing number of complete and incomplete virus genome sequences available in public databases. This large body of sequence data harbors information about epidemiology, phylogeny, and virulence. Several specialized databases, such as the NCBI Influenza Virus Resource or the Los Alamos HIV database, offer sophisticated query interfaces along with integrated exploratory data analysis tools for individual virus species to facilitate extracting this information. Thus far, there has not been a comprehensive database for dengue virus, a significant public health threat.
We have created an integrated web resource for dengue virus. The technology developed for the NCBI Influenza Virus Resource has been extended to process non-segmented dengue virus genomes. In order to allow efficient processing of the dengue genome, which is large in comparison with individual influenza segments, we developed an offline pre-alignment procedure which generates a multiple sequence alignment of all dengue sequences. The pre-calculated alignment is then used to rapidly create alignments of sequence subsets in response to user queries. This improvement in technology will also facilitate the incorporation of additional virus species in the future. The set of virus-specific databases at NCBI, which will be referred to as Virus Variation Resources (VVR), allow users to build complex queries against virus-specific databases and then apply exploratory data analysis tools to the results. The metadata is automatically collected where possible, and extended with data extracted from the literature.
The NCBI Dengue Virus Resource integrates dengue sequence information with relevant metadata (sample collection time and location, disease severity, serotype, sequenced genome region) and facilitates retrieval and preliminary analysis of dengue sequences using integrated web analysis and visualization tools.
The National Center for Biotechnology Information (NCBI) Virus Variation Resources (VVR) provide web retrieval interfaces, analysis and visualization tools for virus sequence datasets. In this paper we describe the recent extension of the collection of resources to include the Dengue Virus Resource in addition to the existing Influenza Virus Resource [1, 2]. The NCBI Dengue Virus Resource was created to support a collaborative effort by the National Institute of Allergy and Infectious Diseases (NIAID), the Broad Institute, and the Novartis Institute for Tropical Diseases (NITD) to create a large collection of complete dengue genome sequences and provide access to the sequences and linked geographic and clinical information. This effort includes the NIAID-funded sequencing of dengue genomes from a wide geographic range by the Broad Institute and its collaborators.
The World Health Organization (WHO) estimates that up to 50 million individuals in more than 100 tropical and sub-tropical countries are infected with the mosquito-borne dengue virus (DENV) each year resulting in 500,000 hospitalizations [3, 4]. With improvements in disease identification, reporting and surveillance, the number of reported dengue cases has been increasing in recent decades (Figure 1), as has the geographic range of the virus and its main vector Aedes aegypti, making dengue a growing public health concern, especially in developing nations. Dengue infections can result in a wide spectrum of disease severity ranging from sub-clinical to dengue fever (DF), an influenza-like illness that is commonly self-limiting, to the life-threatening dengue haemorrhagic fever (DHF)/dengue shock syndrome (DSS).
The number of DENV sequences available in the public sequence repositories has been growing steadily and the value of these sequences would be enhanced if exploratory analysis tools for performing preliminary phylogenetic analysis and search for epidemiological, geographic, and medical information were integrated with the database and convenient interactive visualization was provided. DengueInfo  was developed by NITD as a resource for retrieving whole genomes and associated metadata. Similarly, whole genome sequences generated at the Broad Institute can be accessed and queried directly from the institute's online database . However, neither of these resources provide an integrated interface to analysis and visualization tools nor do they provide access to all dengue sequences irrespective of origin or length. To meet these needs, we extended the functionality developed by the authors of the NCBI Influenza Virus Resource to the non-segmented dengue virus. Since the DENV genome is more than 4 times larger than the largest individual influenza virus segment, multiple sequence alignments could not be calculated on request as is done for influenza virus and are instead pre-calculated offline. The alignment calculation is a three step procedure that first generates multiple protein alignments for the polyproteins derived from complete genome records of each DENV serotype, merges the serotype-specific protein alignments, and then iteratively adds shorter protein sequences. Coding sequence alignments are calculated on demand from the protein alignments. The new NCBI Virus Variation Resource is a flexible tool that can be extended to other viruses, for example West Nile virus.
The current Virus Variation Resource includes dengue and influenza virus sequences. The NCBI Influenza Virus Resource was described elsewhere [1, 2]. Here we describe the extension of this resource to include dengue virus sequence data. Since the dengue genome is not segmented but more than 4 times longer than the longest influenza segment, a different approach to calculating multiple alignments is used for dengue sequences. While alignments in the Influenza Resource are calculated on demand, dengue alignments are pre-calculated to increase responsiveness and reduce server loads. Details of this approach are described in a later section.
All DENV nucleotide and protein sequences available in the public DDBJ/EMBL/GenBank repositories are evaluated for inclusion in the database. Patent sequences and sequences that contain obvious errors or vector sequences are excluded and the serotype classification is verified by comparison with a reference sequence set. Metadata (disease severity, collection date, collection location, serotype, genome region) are taken from the records, if available, or obtained from the literature. The region of the DENV genome covered by the sequence is determined by alignment and made available for queries. Newly public sequences are detected in the NCBI data stream daily and are usually added to the database within a week of becoming available.
Total dengue records
known collection Country
known collection year
known disease severity
Virus Variation Resource data are stored in the relational database system MSSQL Server 2005 using a simple schema that stores nucleic acid sequences and their metadata in one table and protein sequences in a second table linked to their encoding sequences through an id field.
Multiple alignments of the available DENV protein sequences in VVR are pre-calculated offline using the following three step procedure. First, all complete protein sequences of each serotype are aligned separately in a multiple alignment step. Then, the individual intra-type alignments are merged to create a seed alignment covering the complete dengue polyprotein. Finally, incomplete sequences are aligned one by one against the sequences of the same type from the seed alignment using sequence to profile alignments. If a gap column is inserted into the profile during one of the iterative alignment steps, it is introduced into the complete seed alignment of all types to preserve consistency. When new sequences are added to the VVR database, they are added to the existing alignment through the last step of the alignment procedure. Periodically, the alignment is completely recalculated to take advantage of the increases in the number of complete sequences. Alignments are calculated with MUSCLE  driven by a set of custom Perl programs which rely on the BioPerl toolkit . Nucleotide alignments of the coding regions are generated dynamically as codon alignments based on the protein alignments.
The multiple alignment viewer is accessible from the results view. It assembles the requested pre-aligned sequences and displays them with a measure of sequence variability and a consensus anchor sequence at the top (Figure 3C). Any of the sequences can be chosen to replace the consensus as the anchor. Sequences can be selected for pairwise Blast-2-sequences alignment and aligned sequences can be downloaded in FASTA or a print-friendly format. CDS alignment are calculated dynamically based on the pre-calculated protein alignment by mapping codons to their corresponding amino acids, with coding changes highlighted in a different color. Note that only the regions selected in the query are displayed in the alignment and that the number of displayed residues in the alignment is limited to avoid delivering excessive amounts of data to client browsers. Currently the limit is 100,000 residues (for example 200 sequences of length 500), but planned improvements to the alignment viewer will likely raise this limit.
Phylogenetic or clustering trees can be calculated and displayed for protein sequences or their corresponding CDS sequences. The tree builder is accessible from the results and the alignment views with the "Build a tree" button and allows sequences to be selected for inclusion based on a trade off between total length of the alignment and the exclusion of short sequences. Various measures of distance for protein and nucleotide sequences are available and are identical to those described for the NCBI Influenza Virus Resource . Trees can be constructed from the distance matrices using the neighbor-joining, average linkage, complete linkage, or single linkage algorithms. To facilitate the display of trees with many leaf nodes an adaptive resolution technique in which some branches are displayed in a sub-scale representation is employed  (Figure 3D). Users can interactively manipulate the aggregation or refinement of any branch in the tree. In addition, certain metadata, such as year or Country of isolation, can be displayed on the tree and are shown as aggregate measures for aggregated branches.
It was reported that strains of DENV-3 circulating in Thailand prior to 1992 are distinct from those circulating after 1992, and this finding has been interpreted as an extinction of existing DENV-3 strains and the emergence of new, locally evolved strains. This event reportedly happened coincidentally with the replacement of DENV-2 with DENV-3 as the majority serotype in Thailand . We demonstrate a preliminary analysis of dengue sequences using the tools of the Virus Variation Resource that supports this observation.
The Virus Variation Resource currently covers dengue and influenza viruses. However, the framework of this resource may be applied to other viruses. The Influenza Virus Resource has been very successful since its inception and we hope that additional resources in a similar mold will prove useful for other communities.
Virus Variation Resources constitute a tool that allows the included virus sequences to be queried by available metadata which include geographic and medical information. Sequences resulting from these searches can then be downloaded in aligned or unaligned forms and optionally subjected to exploratory data analysis using the built-in tools. The technology for pre-calculating multiple sequence alignments can be applied to other collections, including the existing Influenza Virus Resource and a resource for the West Nile Virus that we plan to develop in the future.
VVR databases and tools are provided as a free service by the National Center for Biotechnology Information and can be accessed at http://www.ncbi.nlm.nih.gov/genomes/VirusVariation/.
National Center for Biotechnology Information
National Institute of Allergy and Infectious Disease
Novartis Institute for Tropical Diseases
Virus Variation Resource
dengue haemorrhagic fever
dengue shock syndrome
West Nile virus
World Health Organization
International Nucleotide Sequence Database.
This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. We thank Dr. D. Lipman (NCBI), Dr. J. Ostell (NCBI), Dr. J. Rodney Brister (NCBI), Dr. S. Ciufo (NCBI), Dr. S. Watowich (UTMB), Dr. M Schreiber (NITD), Dr. E. Holmes (Pennsylvania State University), Dr. M. Miller (NIH Fogarty International Center), and the participants of the "Discovery and Evaluation of Therapeutics against Dengue" workshop for helpful discussions. P. Bolotov (NCBI), M. Kimelman (NCBI), and S. Zhdanov (NCBI) contributed to the setup of the database backend and daily scan of new sequence records.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.