HCV Database
HCV sequence database
 



To our users Please note that the HCV database site is no longer funded. We try to keep the database updated and the tools running, but unfortunately, we cannot guarantee we can provide help for using this site. Data won't be manually curated either.


Things you can do: Find sequences in the database
Download sequences from the database
Retrieve data about the sequences
Analyze sequences
Work with the sequences using our tools
Download ready-made alignments

Finding sequences in the database

The HCV search interface allows you to find and download sequences on the basis of a number of criteria. The search interface is quite flexible and has many more features than are immediately visible; they are explained in the search interface help page. Aside from the regular fields, a number of fields can be searched and listed using the 'other fields' pulldown menu. The output of the search interface automatically includes all the fields that were used for the search, and a number of other fields are displayed by default. Other fields can be listed by marking them with a '*' in the search page. The search interface also shows the location of the resulting sequences graphically, relative to the complete HCV genome (regular sequences are shown in red, reverse-complement sequences in blue).

Downloading sequences

The search interface allows you to search on a variety of fields. You can either download all sequences (as nucleotides or amino acid sequences) that meet your criteria, or you can limit your set to a specific gene or region by selecting that genomic region on the search interface. The genomic regions can be specified as whole genes or as coordinates; you can find the coordinates by checking the HCV genomic map, or if you want to download sequences that match a region you are looking at, you can use the HCV numbering engine to find the beginning and end coordinates you should use. Sequence names can be output in three formats, chosen to provide concise labels (genosubtype and name, the default), optimal information (genosubtype, sampling country, sampling year, name), and uniquely identifying labels (accession, short description; this naming system can be used for PHYLIP trees and other programs that restrict the names to 10 characters and/or require unique names).

You can retrieve sequences as an alignment, or unaligned. If you choose aligned sequences, two things can happen. If you used a genomic region or sequence coordinates to retrieve your alignment, it will be limited to this region. Otherwise, you will end up with an alignment that covers the entire genome, i.e. is around 11,000 characters long. This can be convenient if you want to align your sequences to a set of complete genomes, or to other sequences retrieved using the same method (these alignments may differ by a few positions). Please note that the alignments are not necessarily optimal and may require manual adjustment; but they form a very good starting point. WARNING: If you download an alignment, sequences that do not have valid coordinates relative to H77 will NOT be included in the alignment! This can happen if the sequences are very short, if they contain non-HCV inserts, or if they are reverse complements. More...

Retrieving background data about the sequences

It is possible to download the output shown in the search interface as tab-delimited files, which allows you to tabulate background data for the retrieved set that will not show up in the sequence names. Examples of background information: patient information (code, health status, age, gender, risk factor, infection date, infection country and city), comments from the authors or HCV database staff, tissue type, strain - isolate - and clone name. More...

Analyzing sequences

On the website, you can find a number of tools that were developed in Los Alamos, many originally created for the HIV databases. All these tools take different sequence input formats, and all have documentation and examples to help you use them. Some of the analysis programs we provide are:
HCV-BLAST runs a BLAST search on the HCV database.
TreeMaker generates simple neighbor-joining trees from an alignment.
Syn-Nonsyn analyzes synonymous and non-synonymous mutations in the data, and allows you to build trees based on either.
PCOORD provides principal component analysis, a method to study hard-to-see patterns in your sequence data.
Geography can be used to map or tabulate the geographical distribution of genotypes and subtypes.
N-Glycosite lets you analyze the pattern of N-linked glycosylation sites in your proteins.

Using the toolset

Presently the following tools are available:

Consensus

Consensus creates one or multiple consensus sequences that can be modified by a large number of parameters.

Gene Cutter

Gene Cutter is a tool that clips pre-defined coding regions from a nucleotide alignment, then codon aligns and provides translations of the cut regions.

SeqConvert

With SeqConvert you can convert between eight common formats. Formats supported are fasta, msf, gcg, gde, clustalw, ig, slx and table.

Gapstrip/squeeze

Strip out the gaps from your sequences, in preparation for making a tree or other analysis.

PeptGen

This tool lets you map overlapping peptides, with options to adjust length, overlap, and to exclude user selected amino acids from C- and N-terminal positions. Subtype consensus sequences are available.

QuickAlign

Automatically align your primer or sequence fragment to the complete genome alignment. The interface returns the coordinates (H77 numbering) and an alignment of the fragment to all sequences in the whole genome alignment.

Epilign

Automatically align your amino acid epitope against the alignments we have up on the web.

SeqPublish

Paste your alignment into the window and have it formatted for publication: identical columns are replaced by dashes, and the sequences are printed in blocks of user-determined length.

Sequence Locator Tool

A quick way to find the position of your nucleotide or protein sequence in HCV relative to H77, or to do the revserse: type in H77 coordinates for an amino acid sequence, and the program retrieves the corresponding sequence (to perform this trick using nucleotides, use the search interface).

Downloading ready-made alignments

We provide a number of ready-made nucleotide and amino acid alignments that can be used as reference sets for new alignments, analyses, trees, subtyping, etc. These alignments are trimmed to contain only one sequence per patient or epidemiologically related set, and to have all sequences genotyped and annotated. There are two types of alignments, the complete gene alignments that contain all sequences that meet these criteria, and the genotype reference set, which contains a small number of sequences for each genotype and subtype.




Questions or comments? Contact us at hcv-info@lanl.gov