The distance matrix option in raxml

2,104 views
Skip to first unread message

jiajie

unread,
Jul 26, 2011, 6:23:44 AM7/26/11
to ra...@googlegroups.com
Dear Alexis,

 I need to calculate a n by n distance matrix of n sequences. I see raxml can generate this matrix by optimizing through an input tree. I also use the programs in Phylip where they estimate the distances directly using n*n sequences pair.

To use the distance based tree methods, I think the Phylip way of generating the distance matrix is the standard way. I can not see the point of using a tree to get the distance matrix, unless there is proof that this is indeed a better estimation.

But of course the raxml is much faster than Phylip (phylip need 20+ hours for a sequence data set around 4000 taxa). One reason might be Raxml only calculated around 2n distances along the tree(if I am right), whereas Phylip actually calculated n^2/2 distances.

    

Best

Jiajie 

Alexis

unread,
Jul 26, 2011, 10:03:12 AM7/26/11
to raxml
Dear Jiajie,

Okay here is how I implemented pair-wise distance computations in
RAxML:

RAxML initially either reads in some reference tree via -t or will
compute a parsimony tree.
This tree is then used to infer parameters of the model such as GAMMA
and the GTR rate matrix.
Once those parameters have been estimated for the entire tree and over
all sequences, I then
use the estimated GTR and alpha paremeters to compute pairwise
distances between sequences.

So basically what happens is that I keep the GTR and alpha paremeters
fixed, I connect each pair of taxa
by a branch and do an ML estimate of the branch length connecting
those two taxa using the standard Newton-Raphson numerical
optimization procedure. Does that make sense?

I just use the tree initially because it was not clear to me if it
makes sense to optimize alpha and the GTR rates for each pair of taxa
individually.

Also RAxML computes all pair-wise distances between all pairs of taxa.
How did you get the impression that it's only about 2n
distances that it computes? I just tested with the latest GIT version
and it returns all pairs.
Hope this helps,

Alexis

BUI Quang Minh

unread,
Jul 26, 2011, 11:02:00 AM7/26/11
to ra...@googlegroups.com
Hi Alexis,

Computing the ML distance matrix this way completely makes sense to me.
I think most other programs do this as well though may not be as
efficient as RAxML. The reasons are:

- the estimation of model parameters and Gamma shape (alpha) is not
sensitive to the tree topology as long as you have a reasonable tree. So
using the initial parsimony tree is fine.
- From the theory, one cannot estimate the alpha parameter using just
two sequences (see also Ziheng Yang's paper 1994). Moreover, the GTR
parameters can only be estimated from two sequences if they are long
enough or have enough information. However, in practice we rarely have
such long sequences. You may have long sequences but if they are from
different genes and you use partition model, then the per-partition
sequence lengths are short again.

Cheers,
Minh

Zhang Jiajie

unread,
Jul 29, 2011, 1:43:01 PM7/29/11
to ra...@googlegroups.com
Hi Alexis and Minh,

That explains a lot things, thanks a lot!

I thought in the beginning that RAxML simply extract pair-wise distances from the tree, once the edge lengths are optimized. Would this make any differences from the way it is implemented in RAxML now?


Best

Jiajie

************************************************************

Jiajie Zhang
PhD Candidate

Institute of Biochemistry
University of Luebeck

Institute for Neuro- and Bioinformatics
University of Luebeck

Graduate School for Computing in Medicine and Life Sciences
University of Luebeck

Ratzeburger Allee 160
23538 Luebeck
Germany

Phone: +49-451-500-4065
Phone: +49-451-317-93111
Fax: +49-451-500-4068

E-mail: zhang...@biochem.uni-luebeck.de
E-mail: bestzha...@gmail.com
Web: www.biochem.uni-luebeck.de
Web: www.gradschool.uni-luebeck.de

*************************************************************

Alexis

unread,
Jul 30, 2011, 12:05:30 PM7/30/11
to raxml
Dear Jiajie,

> That explains a lot things, thanks a lot!

Thanks for the additional info Minh :-)

> I thought in the beginning that RAxML simply extract pair-wise distances from the tree, once the edge lengths are optimized. Would this make any differences from the way it is implemented in RAxML now?

Yes, it does, I am just computing plain pairwise distances without
caring about the other taxa. If you have two distant taxa t1 and t2
the direct ML distance estimate is expected to be less accurate
than if you had some additional taxa x1,x2,...,xn that would sit on
the path between t1 and t2 (see the whole discussion about taxon
sampling etc., e.g., http://sysbio.oxfordjournals.org/content/51/4/588.short).

So for getting the patristic distances you just need an ML tree with
branch lengths and need to write a little script, whereas the RAxML
pair-wise distance mode gives you plain distances between pairs of
taxa.
It would actually be interesting to test what the variance/differences
between direct distances and tree-based patristic distances will be.

Cheers,

Alexis
> E-mail: zhangjia...@biochem.uni-luebeck.de
> E-mail: bestzhangjia...@gmail.com

jonas ghyselinck

unread,
Sep 4, 2012, 8:59:34 AM9/4/12
to ra...@googlegroups.com

Dear Alexis,

 

I would like to calculate patristic distance matrices for large ML trees (i.e. 1400 taxa). In order to do so, we constructed trees with RAxML and exported branch lengths (trees were exported in newick format). We tried several software programs to calculate patristic distances from the newick trees (PHYLOCOM, PATRISTIC, RAMI), but all failed due to the large amount of data. 

 

I noticed that RaxMLGui has a tool available: Pairwise distances” – Compute distances for all taxa pairs in the data set (“-f x”).

However, I'm not sure whether this generates the sum of branch lengths between all pairs of taxa in the tree? Could you maybe clarify this?
 

Since I'm not an informatician and have no experience with writing scripts whatsoever, could you tell me if you know scripts that allow to generate such matrices?

I would like to thank you in advance for your time!

Looking forward to hearing from you,

Best,

Jonas
Ghent University

Department of Biochemistry and Microbiology (WE10)

Laboratory of Microbiology

K.L. Ledeganckstraat 35

9000 Ghent, Belgium

Phone: +32-9-2645101

 

 

Alexandros Stamatakis

unread,
Sep 4, 2012, 11:17:08 AM9/4/12
to ra...@googlegroups.com
Dear Jonas,

> I would like to calculate patristic distance matrices for large ML trees
> (i.e. 1400 taxa). In order to do so, we constructed trees with RAxML and
> exported branch lengths (trees were exported in newick format). We tried
> several software programs to calculate patristic distances from the newick
> trees (PHYLOCOM, PATRISTIC, RAMI), but all failed due to the large amount
> of data.
>
> I noticed that RaxMLGui has a tool available: Pairwise distances” – Compute
> distances for all taxa pairs in the data set (“-f x”).
> However, I'm not sure whether this generates the sum of branch lengths
> between all pairs of taxa in the tree? Could you maybe clarify this?

from the RAxML help:

"-f x": compute pair-wise ML distances, ML model parameters will be
estimated on an MP starting tree or a user-defined tree passed
via "-t", only allowed for GAMMA-based models of rate
heterogeneity

This computes the pair-wise distances between two sequences, not the
patristic ones.

> Since I'm not an informatician and have no experience with writing scripts
> whatsoever, could you tell me if you know scripts that allow to generate
> such matrices?

try dist.pl from my colleague Markus:

http://www.goeker.org/mg/distance/


Cheers,

Alexis

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

jonas ghyselinck

unread,
Sep 6, 2012, 3:28:59 AM9/6/12
to ra...@googlegroups.com
Dear Alexis,
 
Thank you for your reply!
According to you, does is make sense to calculate the correlation between branch lengths in two trees between corresponding pairs of taxa if the trees are ML trees that were constructed with the GTRCAT approximation? Or would you recommend using GTRGAMMA for this purpose?
 
Thanks!
Best,
Jonas.

Alexandros Stamatakis

unread,
Sep 6, 2012, 8:31:43 AM9/6/12
to ra...@googlegroups.com
Dear Jonas,

> Thank you for your reply!

:-)

> According to you, does is make sense to calculate the correlation between
> branch lengths in two trees between corresponding pairs of taxa if the
> trees are ML trees that were constructed with the GTRCAT approximation? Or
> would you recommend using GTRGAMMA for this purpose?

GTRCAT should be fine as long as you use the latest RAxML vesrion from
github where we fixed the branch length issue.

More info and data regarding CAT and branch lengths and how they related
to GAMMA can be found in the following two papers:

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0009490
http://www.biomedcentral.com/1471-2105/12/470/

jonas ghyselinck

unread,
Sep 20, 2012, 5:33:36 AM9/20/12
to ra...@googlegroups.com
Dear Alexis,
 
I calculated pairwise distances between sequences in RAxML v7.3.2 for a dataset of 1381 sequences. However, I found that a number of sequences had pairwise distances larger than one. Since this is not possible, I was wondering whether you have any idea of what would be the cause of this?
 
Thank you in advance!
 
Best,
Jonas.

Luciano Brocchieri

unread,
Nov 28, 2012, 8:58:51 AM11/28/12
to ra...@googlegroups.com
Alexis,
I am very interested in exactly understanding how RAxML computes pairwise distances. Let me see if I understand.
Given an input tree (and, I suppose, an input alignment) RAxML uses the tree to optimize parameters.
I understand this includes the parameter alpha of the gamma distribution. Is there an option to choose a transition rate matrix (e.g., LG) or to optimize the coefficients of a GTR and estimates of equilibrium frequencies?
Is there an option to estimate the fraction of invariant sites and to choose number of rate categories in the discretized gamma?
Does the program output the values of the estimated rates (median or average) for each rate category?
I suppose once all parameters are estimated, they are used to obtain ML estimates of evolutionary distance of each pair (n choose 2 pairs) independently. Am I understanding correctly?
Thank you very much!
Luciano

jonas ghyselinck

unread,
Nov 28, 2012, 9:05:20 AM11/28/12
to ra...@googlegroups.com
Dear Alexis,
 
How does RAxML treat gaps when calculating pairwise distances? Does it ignore gaps, does it penalize? Does it count a string of gaps as a single gap?
 
Thank you!
Jonas.

Alexandros Stamatakis

unread,
Nov 28, 2012, 1:25:44 PM11/28/12
to ra...@googlegroups.com
Hi Luciano,

> I am very interested in exactly understanding how RAxML computes pairwise
> distances. Let me see if I understand.
> Given an input tree (and, I suppose, an input alignment) RAxML uses the
> tree to optimize parameters.
> I understand this includes the parameter alpha of the gamma distribution.

yes.

> Is there an option to choose a transition rate matrix (e.g., LG) or to
> optimize the coefficients of a GTR and estimates of equilibrium frequencies?

Yes, it's documented in the manual and the on-line help you can get by
typing ./raxmlHPC -h

> Is there an option to estimate the fraction of invariant sites

yes, please see the v704 manual.

> and to
> choose number of rate categories in the discretized gamma?

no, it's hard coded to 4.

> Does the program output the values of the estimated rates (median or
> average) for each rate category?

no, but you can chose between the median and average in the latest RAxML
version that is on github (I don't recall the command line parameter
right now).

> I suppose once all parameters are estimated, they are used to obtain ML
> estimates of evolutionary distance of each pair (n choose 2 pairs)
> independently. Am I understanding correctly?

Yes, that's correct.

Cheers,

Alexis

Alexandros Stamatakis

unread,
Nov 28, 2012, 1:27:07 PM11/28/12
to ra...@googlegroups.com
Dear Jonas,

it treats them in the same way as for tree searches.
Please search the forum for this topic which has already been discussed
a couple of times.

Cheers,

Alexis
Reply all
Reply to author
Forward
0 new messages