HCV Database
HCV sequence database

To our users Please note that the HCV database site is no longer funded. We try to keep the database updated and the tools running, but unfortunately, we cannot guarantee we can provide help for using this site. Data won't be manually curated either.

How to make a phylogenetic tree



Methods to build a tree

First of all, it is important to realize that all tree building methods assume that the alignment is correct; errors in the alignment can lead to a very misleading tree. Once the alignment is optimal, there are many different methods to create phylogenies. Roughly, the methods can be divided into distance-based and character-based methods. Character-based methods use the individual substitutions among the sequences to determine the most likely ancestral relationships, while distance-based methods first calculate the overall distance between the all pairs of sequences, and then calculate a tree based on those distances. Maximum parsimony (MP) and maximum likelihood (ML) are the most important character-based methods. In most comparative studies, ML seems to be the method that yields the best trees (Leitner, 1996).

The most important drawback of ML is that it is very computationally intensive; it is almost unusable with more than a few dozen sequences. MP is much faster than ML, but still slow compared to most distance-based methods. Both these programs calculate large numbers of trees and compare them by either the likelihood or the parsimony score. There are programs available which speed up the process considerably, such as FastDNAml (Olsen, 1994) but for large numbers of sequences the computational burden is still a major hurdle.

The distance-based methods are generally much faster than the character-based ones. In this group, Neighbor Joining (NJ) is by far the most popular method. It is very fast and generally quite good, although there are conditions under which it systematically produces a wrong tree (bias). NJ is not the only biased method: in fact, under the right conditions (mostly when evolutionary rates are different for different sites, which is frequently the case for HIV) any method is biased, i.e. systematically produces the wrong tree (Kuhner & Felsenstein, 1994). The importance of the bias for the everyday tree is a matter of fierce debate.

Weighbor ("weighted Neighbor-Joining") is a method by Bruno, Socci, and Halpern which gives less weight to the longer distances in the distance matrix. The resulting trees are less sensitive to specific biases than NJ and MP, and negative branch lengths are avoided. The method is much faster than ML and usually faster than MP, but much slower than NJ.

To some extent the choice between methods can be based on the purpose of the tree. For subtyping sequences, an NJ tree based on a matrix of genetic distances is generally good enough. It is not vital that the tree is correct in every branch, only that the sequence of interest is clustered with the right subtype. Almost any method will solve this problem correctly. However, when more detailed information on the evolutionary relationships is important, for instance in forensic analyses, or when studying rates of evolution or trying to resolve key relationships (for example, short domains and potential recombination), more realistic models of evolution and unbiased tree reconstruction methods should be used.

Genetic distances

The simplest way to calculate a distance between two sequences is to count the number of differences. This measure is often called the Hamming distance. It reflects the difference between sequences, but ignores their evolutionary relationship. The most intuitive way to show how this can be misleading is to think of superimposed or reversed mutations: a nucleotide in a sequence, say an A, gets mutated and becomes a G. This results in a difference of 1. Many replication rounds later the same site gets mutated again. If it becomes an A, the difference now goes down, even though the genetic distance (the number of mutational events) has gone up. If it then becomes a T or a C, the number of differences doesn't change, but the evolutionary distance should increase, because there have now been two evolutionary events.

The oldest attempt to adjust the number of differences between two sequences for the chances of a parallel or back mutation was designed by Jukes and Cantor (1969). They proposed the corrective formula D=-3/4ln(1-(4p/3)). The relation between the percent difference and the genetic distance according to the Jukes/Cantor formula is shown in Figure 1. The figure illustrates two things. First, the effect of the correction increases with the difference between the sequences, and is negligible with very homologous sequences. Second, the effect of saturation can be seen: beyond a certain number of differences (~75%), it becomes impossible to tell what the true genetic distance is.

Jukes-Cantor image
Figure 1. The percent difference vs. genetic distance according to the Jukes/Cantor formula.

Since this formula was first proposed in 1969, many different models have been proposed to accurately estimate the underlying evolutionary distance from the observed number of differences between the sequences, taking into account new knowledge about the behavior of DNA, such as different transition/transversion rates, different base frequencies, and non-uniform substitution rates between sites. Not all of these sophisticated methods are easily available (Dnadist, for example, only offers Jukes-Cantor, Kimura 2-parameter, Jin-Nei, and the F84 model Phylip uses for its maximum likelihood trees; Mega also offers Tamura and Tamura-Nei and a set of Gamma distribution-based distances).

A relatively new development is the use of models that incorporate variable evolutionary rates across sites. This is an important extension of the existing models, especially for HIV, which is well-known to show dramatically different evolutionary rates (e.g. under the influence of immune escape) over the length of the genome. Gary Olsen has a program called DNArates which can estimate the rates of evolution in different sites, given an initial tree. The number of categories for the rates can be specified by the user. Using the DNArates program can quite dramatically increase the quality of the resulting trees. The program is unpublished, but can be obtained from Gary Olsen's website and from the UIBIO archive.

Many simulations show that the importance of using a realistic substitution model for estimating the genetic distance depends very much on the divergence of the sequences. If the expected number of substitutions per site is small (below 0.2), the resulting tree will not change much when different substitution models are used. If the sequences are highly diverged, however, and especially when the number of substitutions per site rises above 1, the differences become marked, the the choice of the substitution model is very important.

Tree-building programs available

Probably the most widely used program suite is Felsenstein's PHYLIP. It offers a large array of methods, including ML, MP and NJ. The choice of genetic distance is fairly limited, and learning to use the programs requires some time investment, but their versatility makes them very popular. Most allow input of user trees, jumbling (randomizing input orders), and outgroup designation.

MEGA is a nifty little program for DOS/Windows. It offers NJ and MP (although in a quirky implementation) and the sophisticated Tamura-Nei genetic distance estimation. MEGA is able to build trees on the basis of silent or non-silent mutations only, and has a fast bootstrapping option built in. It is available for minimal cost from the authors. Downsides are that the program is fairly unstable and has memory problems with large sets of sequences, and the printing of the trees often gives problems.

PHYLOWIN is a UNIX/Linux based program with a very nice graphical user interface. It does ML, MP and NJ, allows selection of sequences (rows) and position (columns) and bootstrapping. It allows the user to define subsets of sequences and positions and save them. It is available free of charge for academic users.

FastDNAml computes fast(er) ML trees. It is not very easy to use, but is often the only option if one wants to calculate ML trees for more than 20 sequences. It uses input file or command line parameters to specify the trees. The C source code for the program and an executable for Powermac are available by FastDNAml FTP site.

PAUP* is a very versatile program that does maximum likelihood treebuilding in addition to parsimony, and allows incorporation of variable rates per site. It is presently distributed as a beta version, and costs around $100. The Mac version is entirely menu-driven, but the user interface is very clumsy. There are no user interfaces for the Windows and Unix versions.

Links to these and dozens of other phylogenetic programs can be found at Joe Felsenstein's Phylogeny Programs site.

Our Treemaker interface

Our Treemaker interface is just that: it interfaces between the user and the phylogenetic programs Dnadist, Neighbor, and Drawtree/Drawgram from Joe Felsenstein's PHYLIP suite. The interface can only be used to make Neighbor-Joining trees, which may not be optimal for all circumstances, but usually form a good starting point for more sophisticated analysis; and as mentioned above, if the inferences made from the tree need not be very exact, for example for subtyping a small region or for a quick contamination check, this tree can suffice. Please note that the interface does not do bootstrapping.

The interface takes a sequence alignment in several formats, allows the adjustment of a few parameters (transition/transversion ratio, outgroup, and the shape of the tree). It uses the F84 genetic distance estimate (i.e. the option called 'ML' in Dnadist).

Common mistakes in making a tree

  1. Presenting the output from PHYLIP's Consense program as the final tree.

    This tree gives only a branching order. The 'branchlengths' are not true branch lengths, but rather reflect the % bootstrap values. For this reason, the Treefile that Consense produces does not contain branch lengths, and when printed, all branches in the tree have the same length, like this:

    tree without branch lengths

    REMEDY: If you want to include bootstrap values in a tree created with PHYLIP, the simplest method is to simply paste the values into the non-bootstrapped tree with valid branchlengths. The better way is to use the tree you get from the Consense program as input for another run of the tree-building program to have the branch lengths estimated for that particular tree. This also solves the (infrequent) problem where the topology of the consensus tree doesn't exactly match the one of the original tree.

  2. Visible alignment errors and/or unrecognized hypermutation

    When one sequence protrudes far out beyond all the others and there is no inherent reason for it to be so different, further inspection is needed. Frequently the cause is an alignment error or a hypermutated sequence. The tree looks like this:

    tree with hypermutant

    The misaligned/hypermutated sequence is shown in red.

    a. Visually check the alignment of your sequences. There are many alignment editors available that make it very easy to do this. We gives links to sites where you can download sequence editing software on our External Tools page.
    b. Check your sequence for hypermutation. Hypermutation, a relatively common phenomenon in HIV, means a very high incidence of G -> A mutations, usually resulting in a non-viable sequence. We have created an interface on our site that was designed to detect hypermutation: HYPERMUT.

  3. Unrecognized recombination

    When an isolate branches off close to the root between two subtypes, especially if it has a long branch or if it is not from a particularly old isolate, there is a chance that it is a recombinant. See the tree below for an example, in this case the MAL isolate, an A/D/I recombinant (shown in red). Similarly, when a sequence branches off between two clusters from one patient isolate, it can be a within-patient recombinant. In this case it can be the result of a real recombination event, or a PCR artifact.

    tree with recombinant

    REMEDY: If you suspect your sequence may be a recombinant, there are a multitude of ways to look at this more closely. On this website we maintain RIP, which produces an alignment and an easy-to-read plot that shows the similarity of your sequence to a set of reference sequences over the entire length of the sequence. If there are major changes in what the most similar sequence is over the length of your sequence, this suggests recombination.

  4. Assuming a molecular clock

    The graphical representation of this assumption is that all branches end on one vertical line, representing the present day, like this:

    clock tree

    This assumption is not realistic for HIV, and it is very uncommon for other organisms. The most commonly used method that produces these trees is UPGMA; the Kitsch (which explicitly assumes a molecular clock) program from the PHYLIP suite also results in this type of tree.

    REMEDY: Use a different tree reconstruction method. Neighbor-joining, Maximum Parsimony and Maximum Likelihood all produce trees that do not assume a strict molecular clock.
    NOTE: When using PAUP*, the default method for printing a tree is a 'rectangular cladogram' which will make the tree look like this. Setting the drawing style to 'phylogram' remedies this.

What to do when your tree looks funny.

If none of these errors describe your situation, but you still think your tree is off, there are a few things you can do.

  1. Try a different method, or a different distance estimate. In rare cases (when there are many equivalent trees) even the input order of the sequence can make a difference; use the Jumble option provided in DNAML, or rearrange your input file.
  2. Try using a different program that uses the same tree reconstruction method and compare the results.
  3. Use a different outgroup: although the outgroup does not affect the structure of the tree, it can sometimes make it easier to interpret.
  4. Split your sequences in half and see if the resulting trees are different. This suggests either recombination or dramatic evolutionary rate differences. In some cases (an example is the Rev-responsive element or RRE in HIV) two adjacent regions can be under such different constraint that they evolve very differently; building a tree from a sequence that spans both regions can give confusing results.

Tree 'reliability'

By far the most popular test for trees is the bootstrap. Contrary to what many people think, this is not a test of how accurate your tree is; it only gives information about the stability of the tree topology (the branching order), and it helps assess whether the sequence data is adequate to validate the topology. The bootstrap randomly resamples columns from your alignment, so that some positions will not be used and others will be used more than once, and builds a new tree from this dataset. This is done as many times as you specify. The bootstrap value is a count or percentage of how often each branch was present in exactly the same topology in all the resampled trees, so it gives an impression of how much the tree topology could change if, for example, you'd reconstruct it using a different gene. There are many rules of thumb about how to interpret the bootstrap. It is known to be a conservative measure, so a bootstrap of 95% gives more than 95% confidence in that branch. The number of 70% is often cited as a cut-off for a 'reliable' branch (see Hillis & Bull, 1993).

Further reading

Internet resources

Basic books and texts

Molecular evolution and HIV


Questions or comments? Contact us at hcv-info@lanl.gov