Effect of parsimony random seed on semantics of trees

1,148 views
Skip to first unread message

darthvader

unread,
Apr 23, 2012, 7:59:51 AM4/23/12
to ra...@googlegroups.com
I have been running RAxML for some datasets. I noticed that the tree structure varies according to the random seed (-p parameter).
What is the need for the random seed? How is it used internally?  
Is it guaranteed that different runs with different p values would result in similar trees (despite the ordering difference) ?

darthvader

unread,
Apr 23, 2012, 8:03:51 AM4/23/12
to ra...@googlegroups.com
Also, could someone explain what is bootstraping or direct me to a resource where I can read about it in the context of phylogeny?

Fernando Izquierdo

unread,
Apr 23, 2012, 8:10:14 AM4/23/12
to ra...@googlegroups.com
Hi Darth,

The random seed is set to guarantee that you will generate a
deterministic parsimony starting tree. Thus we have reproducible runs,
since different seeds will generate different starting trees which in
turn will lead to different ML trees (but given a seed, you always get
the same final ML tree).

In general you should use a random number as a starting seed.

May the force be with you,
Fernando

Fernando Izquierdo

unread,
Apr 23, 2012, 8:22:01 AM4/23/12
to ra...@googlegroups.com
Dear raxml users,

Please note that the scope of this group should cover only
raxml-related questions.

For background questions on computational phylogenetics such as
bootstraping you can have a look at Ziheng Yang's textbook
"computational molecular evolution" or Felsenstein's "Inferring
Phylogenies".

Best,
Fernando

Alexandros Stamatakis

unread,
Apr 23, 2012, 8:54:48 AM4/23/12
to ra...@googlegroups.com
different runs with different seeds may or may not result in different trees, this depends on the dataset,
what is guaranteed is that when you specify the same parsimony seed you'll obtain the same starting trees

the procedure/algorithm that is used is called randomized stepwise addition order and described in the RAxML-III paper
from 2005

Alexis

--
Dr. Alexandros Stamatakis
www.exelixis-lab.org

darthvader

unread,
Apr 23, 2012, 9:34:43 AM4/23/12
to ra...@googlegroups.com, Alexandros...@gmail.com

I would definitely read that to understand more.
By different trees, is trees with possibly different semantics like (A (B,C) D) and ( (A, B) C, D) or just ordering difference?

sergios-orestis kolokotronis

unread,
Apr 23, 2012, 12:07:50 PM4/23/12
to ra...@googlegroups.com
How difficult would it be to jumble the seed number in RAxML in order to explore better starting trees? I know it can be scripted at coupled with ML searches (-N), but it should be useful to people building large trees. Just saying...

Trees explored better will be...

Fernando Izquierdo

unread,
Apr 23, 2012, 6:28:13 PM4/23/12
to ra...@googlegroups.com
Hi Sergio,

This is an interesting question. However, it is difficult to know what
is a good starting tree. According to our experience, trees that start
not being promising in terms of LH may take search paths that end up
leading to the better ML trees. In other words, we are not aware of
any straight forward way to identify "good starting trees" early in
the search phase.

As far as I know, the closest we have done to that is limiting tree
search to a inner part of the tree, which is faster than the search on
the full trees. We saw there that the good starting trees for the
constraint search (SPR moves only in the inner part of the tree) where
also good trees when the full search (SPRs on the full tree) was done.

The later has been discussed here:
F. Izquierdo-Carrasco, S.A. Smith, A. Stamatakis: "Algorithms, Data
Structures, and Numerics for Likelihood-based Phylogenetic Inference
of Huge Trees". BMC Bioinformatics 12:470 2011

Cheers,
Fernando

Alexandros Stamatakis

unread,
Apr 24, 2012, 1:57:22 PM4/24/12
to ra...@googlegroups.com
I agree with Fernando, I don't see much potential there.

If you want to experiment with this idea, get the latest Parsimonator version from github that now reports parsimony
scores and generate say 1000 starting trees and then extract the 10 best ones or so.

But as Fernando says, search space is weird and a good starting tree in terms of parsimony score will not necessarily
generate a good ML tree.

Alexis

--
Dr. Alexandros Stamatakis
www.exelixis-lab.org

jonas ghyselinck

unread,
Nov 19, 2012, 9:28:27 AM11/19/12
to ra...@googlegroups.com
Dear,
 
Do I understand it correctly that by specifying the 'Number of runs' (-#) option, you actually specify the number of best scoring (with respect to the highest likelihood value) parsimony trees on which an ML search will be performed.
If this is the case, I would understand that - since an ML search is being performed on a higher number of starting trees - chances of obtaining an ML tree with high likelihood value increase...
Or do I have to interpret this that from the specified number of best parsimony starting trees a consensus tree is being constructed on which an ML search is performed.
 
I read your publications from 2004 and 2005 on this matter, but I cannot seem to figure out the answer.
 
I'm asking this because I performed two ML tree searches from a short read 16S sequence library (280bp) that was generated from a full length 16S library. I calculated the pearson correlation between patristic distances between two ML trees generated from the same library (once for the short reads and once for the full length) and I saw that the correlation between the trees from the short read library was higher than the correlation between trees on the full length library; which was surprising to me. Since there is more phylogenetic information contained within full length sequences I would expect a higher correlation between two full length trees... Could this be due to the starting trees on which an ML search was performed?
 
Thank you in advance!
 
Cheers,
Jonas.

Alexandros Stamatakis

unread,
Nov 19, 2012, 10:54:44 AM11/19/12
to ra...@googlegroups.com
Hi Jonas,

> Do I understand it correctly that by specifying the 'Number of runs' (-#)
> option, you actually specify the number of best scoring (with respect
> to the highest likelihood value) parsimony trees on which an ML search will
> be performed.

By -# you specify the number of randomized stepwise addition order
parsimony trees that will be generated. So if you specify -# 10 RAxML
will generate 10 (most probably distinct) parsimony starting trees that
will subsequently be optimized under ML.

> If this is the case, I would understand that - since an ML search is being
> performed on a higher number of starting trees - chances of obtaining an ML
> tree with high likelihood value increase...

Yes, that's our hope, because the searches for the best-known ML trees
will start at different points in the vast search space and will take
different trajectories.

> Or do I have to interpret this that from the specified number of best
> parsimony starting trees a consensus tree is being constructed on which an
> ML search is performed.

Nope.

> I read your publications from 2004 and 2005 on this matter, but I cannot
> seem to figure out the answer.
>
> I'm asking this because I performed two ML tree searches from a short read
> 16S sequence library (280bp) that was generated from a full length 16S
> library. I calculated the pearson correlation between patristic distances
> between two ML trees generated from the same library (once for the short
> reads and once for the full length) and I saw that the correlation between
> the trees from the short read library was higher than the correlation
> between trees on the full length library; which was surprising to me.

That is indeed a bit surprising, but maybe your sample (two trees on the
full alignment and two trees on the short alignment) is just too small
and this happend by chance.

> Since
> there is more phylogenetic information contained within full length
> sequences I would expect a higher correlation between two full length
> trees...

That would be my expectation as well, assuming of course that there is
no bug in your analysis pipeline and scripts for computing patristic
disnatnces and correlations.

On the other hand it may be that you just happened to subsample a
phylogenetically stable region of the 16S such as for instance the v6
region.

> Could this be due to the starting trees on which an ML search was
> performed?

Hard to tell, you should definitely use a larger sample, i.e., 10 Ml
trees on the long alignment and 10 Ml trees on the short alignment.

Cheers,

Alexis

>
> Thank you in advance!
>
> Cheers,
> Jonas.
>

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Reply all
Reply to author
Forward
0 new messages