HCV Database
HCV sequence database
 



To our users Please note that the HCV database site is no longer funded. We try to keep the database updated and the tools running, but unfortunately, we cannot guarantee we can provide help for using this site. Data won't be manually curated either.


ElimDupes Explanation

Introduction

There are various ways of defining "duplicateness" in two sequences.

1. The strongest definition would be the case in which two sequences match exactly as in:

     ACCCTGATTAGC   seq1
     ACCCTGATTAGC   seq2

2. Slightly less strong than perfect match is the situation in which the sequences match in all respects except the case of the letters:

     ACCCTGATTAGC   seq1
     aCCCtGATTaGC   seq2

3. A third consideration is the case of gaps and other non-letter or "extraneous" characters. With gaps removed, the two sequences below are duplicates.

     ACCCTGATTAGC       seq1
     ACCCT----GATTAGC   seq2

4. Fourth, there is the case of one sequence that matches part of another:

     ACCCTGATTAGC   seq1
         TGAT       seq2

5. Final consideration is the similarity of sequences. In the example below, 8 of 10 bases of seq2 are duplicated in seq1. Thus, the two sequences are said to be 80% similar.

     ACCCTGATTA   seq1
     ACCCGTATTA   seq2

Input

The tool accepts a single input file of sequences which can be in any of the Common Sequence Formats. It is best to submit aligned sequences but the tool can align the sequences if needed. To have the tool do the alignment, uncheck the checkbox below the input form. The tool will run much slower if alignment needs to be done.

Option Summary
Option Details

Remove extraneous characters from sequences

'No' (default) means that gaps and other non-letter characters will not be removed and thus will be included in the comparisons. In this case, the following two sequences will not be considered duplicates:

     ACCCTGATTAGC       seq1
     ACCCT----GATTAGC   seq2

If this option is changed to 'Yes', the gaps will be removed from seq2 and the two sequences will be treated as duplicates.


Make all letters uppercase

'Yes' (default) converts all characters to upper case. With this setting the following two sequences will be treated as duplicates:

     ACCCTGATTAGC   seq1
     aCCCtGATTaGC   seq2

If this option is changed to 'No' the above two sequences will not be considered duplicates.


Consider subsequences as duplicates

'Yes' (default) means that a shorter sequence that is contained within a larger sequence will be considered duplicate. For example, consider the two sequences:

     ACCCTGATTAGC       seq1
     ACCCT-------       seq2

If gaps are removed (Remove extraneous characters set to 'Yes') then the sequences become:

     ACCCTGATTAGC       seq1
     ACCCT              seq2

If Consider subsequences as duplicates = 'Yes', then seq2 will be considered a duplicate of seq1, otherwise not.


Restore original sequences in output

'Yes' (default) means the resulting downloadable file will be the original sequences in their unchanged form instead of the form as may be altered by the tool options such as changing case or stripping gaps.


Eliminate sequences more similar than...

In the example below, 8 of the 10 bases of seq2 are duplicated in seq1. Thus, the two sequences are said to be 80% similar. If this option is set to 79% or less, these two sequences will be treated at duplicates. If the option is set to 80% or higher, then these sequences will not be considered duplicates.

     ACCCTGATTA   seq1
     ACCCGTATTA   seq2

Analyze input by groups

This option performs analysis and produces files of unique sequence by group. A "group" is defined by N number of leading characters in the sequence name. For example, if your sequence set of based on samples taken a specific points in time for a given patient, then your labels might be something like:

>Week04.seq1
>Week04.seq2
>Week16.seq1
>Week16.seq2

If you enter 6 in the analyze input by groups box then Elimdupes will group the sequences by the first 6 characters and treat them as distinct groups.

Note that if you choose to create a file of uniques sequences with _count added... the resulting file will contain the unique sequences for all groups, with a blank line between groups. This allows you to easily cut paste the entire results, or just the results for a given group.


Create File of unique sequence with _count added...

This option, "Create a file of unique sequences with _count added to (or updated in) the sequence name," will create an additional file of unique sequences where the number of occurrences (count) of a given sequence is appended to the sequence name. The rank of the count (the sequence with the highest count has a rank of 1) is optionally added in ".rank_count" format.

Note that if "Analyze input by groups" is selected the counts (and rank if chosen) will be reset at the beginning of each group. The ouput for all groups will be combined in a single file with a blank line between groups.

This option is helpful for handling deep sequences, reducing them to unique forms with their counts and ranking. Sometimes these files need to be trimmed after alignment, and by trimming the ends, more repetition can occur and the file can be reduced further. For example:

>seq1
GTGGATCCGTAAAGA
>seq2
GTGGATCCGTAAAAA
>seq3
GTGGATCCGTAAAAA
>seq4
TTGGATCCGTAAAAA

Gives:

>seq2.1_2
GTGGATCCGTAAAAA
>seq1.2_1
GTGGATCCGTAAAGA
>seq4.2_1
TTGGATCCGTAAAAA

If the user trimmed the last 2 bases, and re-entered the alignment with .rank_count (as above), it would give:

>seq1.1_3
GTGGATCCGTAAA
>seq4.2_1
TTGGATCCGTAAA

Output

1----- first, Elimdupes displays the option settings for this run:

Options used:
Remove extraneous characters from sequences: true
Make all letters uppercase: true
Consider subsequences as duplicates: true
Use original sequences in output: true
Create a file of unique sequences with _count: true
Add rank to unique sequences with count (.rank_count format): true

2----- next, Elimdupes displays links to View and Download the file with _counts (and optional rank), if selected:

Unique sequences with rank and count appended (.rank_count):      View    Download

3----- next, Elimdupes displays the analysis. Note that if analyze by groups is selected, this section will repeat for each group.

Unique sequences file:                     View    Download

Duplicate (eliminated) sequences file:     View    Download

Tab-delimited summary table below:                 Download

---------------------------------------------------------------------------------
Unique             Number of   Duplicate
sequences         duplicates   sequences
---------------------------------------------------------------------------------
A3_seq1                    2   A3_seq2, A3_seq4
A3_seq3
A1_seq1                    3   A1_seq2, A1_seq3, A1_seq4
---------------------------------------------------------------------------------
Total unique seqs = 3
Total duplicate seqs = 5





Questions or comments? Contact us at hcv-info@lanl.gov