HCV Database
HCV sequence database

To our users Please note that the HCV database site is no longer funded. We try to keep the database updated and the tools running, but unfortunately, we cannot guarantee we can provide help for using this site. Data won't be manually curated either.

On the definition of "clusters" and "related sequences" in the HCV database

The "exclude related" function selects only the first sequence (usually the one with the lowest accession number) from each cluster of epidemiologically related sequences from one genomic region, and discards the rest. This allows you to only include sequences in an alignment that are not (too) closely related, which can save a lot of work if the idea is to get a general overview of the variability or phylogeny. "Epidemiologically related" means multiple sequences from one patient, or sequences from known or very likely transmission clusters. The clusters have been defined based on what is described in the literature. To get an overview of which clusters have been defined in the database, select "Clusters" in the "Other fields" pulldown menu and type an underscore character (_) in the search field. This will list all defined clusters in the database, and the patients and sequences they are defined on. To select a single cluster (for example the cluster associated with patient "Recip 1", a recipient of an infected blood transfusion), type "Recip 1" in the Patient Code field, select "Cluster name" in the Other Fields pulldown and put a star (*) in the search field; this will list all clusters this patient is a part of. Then you can copy the cluster name and search on that, and find all other members of the cluster.

The "Exclude related" function only works when the search has been on a certain genomic region. This is because it makes little sense to discard (for example) the E1 sequences from all cluster members when there is a related sequence of another region. Even if all sequences in a retrieval span only one region, you have to use the genomic region field in the search, otherwise the search interface won't "know" all sequences are from that region.

The composition of the clusters is usually straightforward, but not in all cases. Sometimes there are "subclusters", where patients a, b, and c were infected by donor X, and patient a then infected patient d. In this case, we have defined two clusters, one with (see cluster "Anti-D Ireland plus secondary recipients") and one without ("Anti-D Ireland") the secondary recipients. (One could argue that patients a and d also form a subcluster, but we felt that few people would be interested in this subcluster.) Another complication is dual or superinfection, especially with two different genoptypes (see for example clusters associated with Pubmed ID 11170062). In these cases we have split each dually infected patient up into two 'quasi-patients', usually with a suffix indicating the difference between the two (for example, "Donor B-2a" and "Donor B-1b"). Donor B-2a is part of the cluster "Blood donors North China, genotype 2a", while Donor B-1b is part of "Blood donors North China, genotype 1b-1".

FINAL NOTE: this function depends on manual annotation. If the clusters are not (yet) annotated, the related sequences in it will not be automatically excluded! Please make sure to check your data after download and discard any remaining related sequences. If you find a big unannotated cluster and the information on which sequences are related is available, please let us know and we will add the annotation.

Questions or comments? Contact us at hcv-info@lanl.gov