Ambiguity Consensus Explanation
Introduction
Ambiguity Consensus Maker takes an input file of
aligned nucleotide sequences and
calculates a consensus sequence made up of the IUPAC ambiguity codes
for each column in the alignment. The program provides the consensus alone, or the consensus appended to the original
alignment.
A good way to understand the options available is
to click the Sample Input button at the top of the submission
page. This loads a simple, hypothetical alignment in Table format. You can use this file to test the results of various options. Each column of the Sample Input has been chosen to
illustrate one of the IUPAC ambiguity codes. "Sequence11" of the sample input was chosen to illustrate that the tool ignores non-ACGTU characters.
Input options
- Input requirements:
- Consensus Maker
recognizes most standard alignment formats. If the
program fails to decipher your format, try resubmitting
the alignment in Fasta or Table format.
Please see our Format Converter tool.
- Ambiguity Consensus Maker accepts only nucleotide sequences.
- Do not mix "T" and "U" in the same alignment.
- If your alignment contains sequences of varying
length, Consensus Maker will equalize the lengths of
sequences by adding spaces to the ends of short
sequences.
- If the input alignment contains blocks of sequences
(e.g., HIV sequences grouped by subtype) then the program can
calculate a consensus for each sequence block; it does not create a
consensus of the consensuses. The program recognizes sequence blocks
by how the component sequences are named (see details below).
- If your input contains tilde (~) characters, they will be converted to dashes (-).
- All non-ACGTU characters will be ignored in calculating the consensus. This includes the gap character. For example, if one column in an alignment of 100 sequences contains 75 gaps, 20 G, and 5 A, this column would be considered to contain 20% A (not 5%). Any IUPAC codes in the input sequences will be similarly ignored.
- Squeeze gaps. If your
alignment contains columns that are entirely gaps,
these columns will be removed before a consensus is calculated.
Default = no.
Consensus-by-Block Options
- Make consensus for
each block. If the input contains blocks of sequences,
this option allows you to calculate a consensus for each block, not just a
single consensus for the alignment as a whole. Default =
false. If false, only a single consensus is computed for
the entire alignment. If true, then you must ensure that
the names of the sequences in your alignment follow a
conventional format that can be read by the program.
Sequences must have names like "A.US.57866" . The program
reads the letter(s) before the first dot ("A") and uses
it to define an "A" group of sequences. Another group of
sequences will be defined in the alignment if that first
character changes, e.g., B.FR.98332. See example below.
This is the naming convention is that is followed by the HIV
database; thus alignments downloaded from the database will already be in this format. The output will have a CONSENSUS_A and a
CONSENSUS_B. If more than one character is present before
the first dot, those characters will become the block
name; e.g., CRF01AE.X34577 will define a CRF01AE
consensus block. If you don't want to calculate multiple
consensuses, then your sequences can be named in any manner.
- Show number of sequences in consensus.
If consensuses are computed for
each block, this option will show how many
sequences comprised each block. The number will be shown
following each consensus name, e.g., CON_A(23). Default = false.
- Minimum number of sequences per block for
consensus. If a block contains fewer than "n"
sequences, then don't calculate a consensus for that
block. Default = 3. This number only applies if you are making consensuses for blocks
within the alignment.
Consensus Calculation option
- Characters to count when making consensus. The
program considers "ACGTU" when making a consensus.
- Character presence
percentage.
If a column of an alignment contains 99
"A" and 1 "G", would you want to give this a consensus of
"A" or "R" (where R is the IUPAC code for
purines A or G)? In other words, if a
character is present below a certain "presence
percentage" threshold, should it be ignored when making
the consensus? You can set this presence percentage
threshold in the box provided. The default is "0", which
means every occurrence of an A,C,G,T, or U count. For example, if you
set the value to 2%, then the G in the above
example would be ignored and the consensus would be
"A".
The character presence percentage has an upper limit. For example, there is no case where a value over 50% would make any logical sense. The program will decide the largest logical value for your alignment, and this value is always somewhere between 25% and 50%. If you set the percentage higher than the logical upper limit, you will receive an error message. For a typical nucleotide alignment, it rarely makes sense to set the value higher than 25%.
Output Options
- Consensus + alignment.
Results will show the consensus appended to the top of the
user's alignment. Default = true. When false, the output
consists of the consensus alone.
- Show number of sequences.
If consensuses are to be computed for
each block in the alignment, this option will show how many
sequences occurred in each block. The number will be shown
following each consensus name, e.g., CON_A(23). The default
is to not show numbers.
Examples
Example of using names to identify
alignment blocks:
In the table-formatted file below there are two blocks, an "A1"
block and a "B" block recognizable by the "A1." and "B." (note the
dot) with which the names begin. Two consensuses will be calculated
for this alignment if "Do consensus for each block" is true and "Min.
no. seqs. for consensus" is 3.
A1.FR.83.IIIB_A04321 aaactatcgtagctagctagctgatcgatgctagctgatcg.... etc
A1.FR.83.IIIC_A04322 aaactatcgtagctagctag------gatgctagctgatcg.... etc
A1.DE.96.POIURR_A04322 aaactatcgtagctagctag------gatgctagctgatcg.... etc
B.FR.82.LAI_K03455 aaactatcgtagctagctttctgatcgatgctagctgatcg.... etc
B._._.N833_AF76511 acactatcgtagctagctagctgatcgatgctagctgatcg.... etc
B.US.99.JK77_AF76511 acactatcgtagctagctagctgatcgatgctagctgatcg.... etc
Example of "pretty print" output:
CON gccagccccc tgaTGGGGGC GACaCTCCAC CATGAATCAC tCCCCTGTGA
1a.-.COLONEL_AF290978 ---------- --TTGGGGGC GACACTCCAC CATGAATCAC CCCCCTGTGA
1a.-.H77_AF009606 GCCAGCCCCC TGATGGGGGC GACACTCCAC CATGAATCAC TCCCCTGTGA
1a.-.HEC278830_AJ278830 GCCAGCCCCC TGATGGGGGC GACGCTCCAC CATGAATCAC TCCCCTGTGA
CON GGAACTACTG TCTTCACGCA GAAAGCGTCT AGCCaTGGCG TTAGTATGAG
1a.-.COLONEL_AF290978 GGAACTACTG TCTTCACGCA GAAAGCGTCT AGCCATGGCG TTAGTATGAG
1a.-.H77_AF009606 GGAACTACTG TCTTCACGCA GAAAGCGTCT AGCCATGGCG TTAGTATGAG
1a.-.HEC278830_AJ278830 GGAACTACTG TCTTCACGCA GAAAGCGTCT AGCCGTGGCG TTAGTATGAG
CON TGTCGTGCAG CCTcCAGGAC CCCCCCTCCC GGGAGAGCCA TAGTGGTCTG
1a.-.COLONEL_AF290978 TGTCGTGCAG CCTCCAGGAC CCCCCCTCCC GGGAGAGCCA TAGTGGTCTG
1a.-.H77_AF009606 TGTCGTGCAG CCTTCAGGAC CCCCCCTCCC GGGAGAGCCA TAGTGGTCTG
1a.-.HEC278830_AJ278830 TGTCGTGCAG CCTCCAGGAC CCCCCCTCCC GGGAGAGCCA TAGTGGTCTG
Example of "output aligned" output:
CON gccagccccc tgaTGGGGGC GACaCTCCAC CATGAATCAC tCCCCTGTGA
1a.-.COLONEL_AF290978 .......... ..T------- ---------- ---------- C---------
1a.-.H77_AF009606 ---------- ---------- ---------- ---------- ----------
1a.-.HEC278830_AJ278830 ---------- ---------- ---G------ ---------- ----------
CON GGAACTACTG TCTTCACGCA GAAAGCGTCT AGCCaTGGCG TTAGTATGAG
1a.-.COLONEL_AF290978 ---------- ---------- ---------- ---------- ----------
1a.-.H77_AF009606 ---------- ---------- ---------- ---------- ----------
1a.-.HEC278830_AJ278830 ---------- ---------- ---------- ----G----- ----------
CON TGTCGTGCAG CCTcCAGGAC CCCCCCTCCC GGGAGAGCCA TAGTGGTCTG
1a.-.COLONEL_AF290978 ---------- ---------- ---------- ---------- ----------
1a.-.H77_AF009606 ---------- ---T------ ---------- ---------- ----------
1a.-.HEC278830_AJ278830 ---------- ---------- ---------- ---------- ----------
Example of formatted output (nexus):
#NEXUS
begin taxa;
dimensions ntax=4;
taxlabels
CON
1a._.COLONEL_AF290978
1a._.H77_AF009606
1a._.HEC278830_AJ278830
;
end;
begin characters;
dimensions nchar=150;
format interleave datatype=dna;
matrix
CON gccagccccctgaTGGGGGCGACaCTCCACCATGAATCACtCCCCTGTGA
1a._.COLONEL_AF290978 ------------TTGGGGGCGACACTCCACCATGAATCACCCCCCTGTGA
1a._.H77_AF009606 GCCAGCCCCCTGATGGGGGCGACACTCCACCATGAATCACTCCCCTGTGA
1a._.HEC278830_AJ278830 GCCAGCCCCCTGATGGGGGCGACGCTCCACCATGAATCACTCCCCTGTGA
CON GGAACTACTGTCTTCACGCAGAAAGCGTCTAGCCaTGGCGTTAGTATGAG
1a._.COLONEL_AF290978 GGAACTACTGTCTTCACGCAGAAAGCGTCTAGCCATGGCGTTAGTATGAG
1a._.H77_AF009606 GGAACTACTGTCTTCACGCAGAAAGCGTCTAGCCATGGCGTTAGTATGAG
1a._.HEC278830_AJ278830 GGAACTACTGTCTTCACGCAGAAAGCGTCTAGCCGTGGCGTTAGTATGAG
CON TGTCGTGCAGCCTcCAGGACCCCCCCTCCCGGGAGAGCCATAGTGGTCTG
1a._.COLONEL_AF290978 TGTCGTGCAGCCTCCAGGACCCCCCCTCCCGGGAGAGCCATAGTGGTCTG
1a._.H77_AF009606 TGTCGTGCAGCCTTCAGGACCCCCCCTCCCGGGAGAGCCATAGTGGTCTG
1a._.HEC278830_AJ278830 TGTCGTGCAGCCTCCAGGACCCCCCCTCCCGGGAGAGCCATAGTGGTCTG
;
end;
Questions or comments? Contact us at
seq-info@lanl.gov