HCV Database
HCV sequence database
 



To our users Please note that the HCV database site is no longer funded. We try to keep the database updated and the tools running, but unfortunately, we cannot guarantee we can provide help for using this site. Data won't be manually curated either.


More on the data in this database Some of the information you can find in the database was provided by the authors or the people who generated the sequence, but a lot of it was manually annotated by the database staff. We use some general rules in annotating. We try not to enter information unless we feel quite certain that it is correct, but we try to be as complete as possible. We check some of the information the authors provide, but not all of it; thus, the genotypes and subtypes you can find in the database are a mixture of annotations by the database and the authors. Some of it is based on information we request from the authors, so not all of it is necessarily traceable to publications. If we change an author's designation, this is usually mentioned in the comments. When sequences appear very suspect (example: in a study with two patients, one sequence attributed to patient clusters very closely with all the sequences from the other patient) we put the label 'contaminant' in the name. This does NOT mean the sequence is necessarily a contaminant; it can also be a sample mix-up or a sequence that behaves oddly. This label is used only to warn users that something is strange about this sequence; usually the comment explains the reason for the label.

We store some patient data (though never data that could lead to identification) that can be important for epidemiological studies, such as risk factor, country and date of infection (if available), and we use patient records without further information to link all data that is known to be from one patient together. Please note that if there is no patient record, there could still be multiple sequences from a patient, because the data are not completely annotated! The annotation of the year and place of infection is a difficult issue, because the data are often suggestive but not certain. If a sequence is genotype 4a and the patient is an Egyptian who has lived in France for 5 years, can we assume the patient was infected in Egypt? If a patient can only recall receiving a blood transfusion in 1988, do we accept that as the year of infection? We try to make these judgments carefully, and err on the side of caution. Finally, to identify patients we use a *patient ID*, this is a number that is automatically generated by the database. There is also the *patient code*, which is a name we try to make descriptive. Unfortunately the patient code is often "patient 1" if that is how the patient is indicated in the publication. There are many "patient 1"s! To distinguish them, they all have a patient ID, which is shown after the patient code in ().

When annotating the data, we try to add as much information as possible. This means for example that if a sampling city isn't available, we may enter the sampling region, such as 'Scotland' or 'Sechuan'. Some fields are therefore descriptive rather than selection fields. This also goes for a few other fields like patient health, and sometimes for genotype; for example, chimeras between genotypes 1a and 2b are shown as '1a/2b'. The same annotation style is used for recombinants. Dates are never less accurate than a year, so if someone was infected "in 1988 or 1989" this will only be entered in the comments field, not in the date field.

The genotype/subtype classification of HCV is unfortunately quite fragmented and unsystematic. We provide a table that shows the genotypes and subtypes we have in the database. In the future this table will be automatically updated. Because of the annotation lag, for the time being we cannot guarantee that genotypes/subtypes that are not in this table are not defined. We have decided to follow the classification of Simmonds et al (1996) as this seems to be the least controversial; we realize that it is not uncontroversial. Following the lead of the HIV database, when a sequence is known to belong to different genotype in two regions, both regions will be designated as recombinant, even though they may each belong to only one genotype.

The database staff attempts to assign genotypes to as many sequences as possible. If genotypes are missing, this can be because the sequence hasn't been annotated yet; because it is a very short sequence (e.g. HVR-1; we try to prioritize longer sequences over shorter), or because its classification is not clear; for example, genotypes 1 and 6 can be indistinguishable in the UTRs, so many UTR sequences will not be genotyped. There are cases where authors confidently assign a genotype, while we may have felt less certain; in most cases, we will follow the authors' classification, unless we are quite sure it is wrong.

Please note that 'drug naive' means drug naive at the time of sampling. The patient may subsequently have received therapy.

If patients have received a liver transplant during or close to the sampling, this is noted in the 'Patient health' field.

More on retrieving sequences When the sequences are uploaded into the database, they are internally aligned against a 'model sequence' that represents all sequences that are already present in the database. For this alignment we use the HMMER program, written by Sean Eddy. The start and end coordinates of each sequence relative to the model sequence, as well as the location of all the gaps, are stored in the database. When you request all sequences encompassing the core gene, for example, the coordinates for the core gene in the model sequence are retrieved, and all sequences with a lower (or equal) start point and a higher (or equal) stop point are retrieved. When the sequences are downloaded, the gaps relative to the model sequences are inserted. For the little image that shows the location of the sequence relative to the genome, a slightly different set of coordinates is used, relative to H77 instead of the model sequence. These coordinates are produced by an algorithm, and are identical to the coordinates that the Sequence Locator tool produces. Please note that the location of some sequences cannot be accurately determined, often because they are too short or because they are located in a region where H77 is undefined (such as D85026). The exact method used for creating the internal alignments and retrieving the regions has been described here.


Questions or comments? Contact us at seq-info@lanl.gov