RAxML-EPA and bootstrapping

114 views
Skip to first unread message

Josh Singer

unread,
Nov 25, 2016, 1:28:17 PM11/25/16
to raxml
Hi, 

Could you help to clarify what happens with bootstrapping during the EPA? This is mentioned in the paper (Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood).

The EPA algorithm can optionally use the nonparametric bootstrap (Felsenstein 1985) to account for uncertainty in the placement of the QS. An example for this is shown in Fig. 3. Thus, a QS might be placed repeatedly onto different edges of the RT with various levels of support. For the bootstrap procedure, we introduce additional heuristics to accelerate the insertion process. During the insertions onto the RT using the original alignment, we keep track of the insertion scores for all QS into all edges of the RT. For every QS, we can then sort the insertion edges by their scores and for each bootstrap replicate only conduct insertions for a specific QS into 10% of the best-scoring insertion edges on the RT. This reduces the number of insertion scores to be computed per QS on each bootstrap replicate by 90% and therefore approximately yields a 10-fold speedup for the bootstrapping procedure. 

How do we specify on the command line to use bootstrapping with the EPA?

How do we specify the proportion of insertion branches where bootstrapping is applied (the 10% mentioned above)?

What effect does bootstrapping have on the EPA results? Are the bootstrapping results output somewhere? Do they act as a filter on the returned placements, or guide them somehow?

I also noticed this option in the manual:
-G enable the ML-­based evolutionary placement algorithm heuristics by specifying a threshold value (fraction of insertion branches to be evaluated using thorough insertions under ML).

Just to clarify the meaning of this:
-- It doesn't have anything to do with the bootstrapping mentioned above, correct?
-- If I understand correctly, all insertion branches will be evaluated using the "fast" method. Then the top 10% under this method (-G 0.1) will be evaluated using the "slow" / "thorough" method. Is that correct? 

Thanks in advance!

Josh S.

 

Alexandros Stamatakis

unread,
Nov 25, 2016, 1:59:30 PM11/25/16
to ra...@googlegroups.com
Dear Josh,

Bootstrapping in the EPA has been deprecated a long time ago, the
likelihood weights work much better for obtaining support for
placements, in particular because the query sequence tend to be short.


> I also noticed this option in the manual:
>
> -G enable the ML-­based evolutionary placement algorithm heuristics
> by
> specifying a threshold value (fraction of insertion branches to be evaluated
> using thorough insertions under ML).
>
>
> Just to clarify the meaning of this:
> -- It doesn't have anything to do with the bootstrapping mentioned
> above, correct?

Correct.

> -- If I understand correctly, all insertion branches will be evaluated
> using the "fast" method. Then the top 10% under this method (-G 0.1)
> will be evaluated using the "slow" / "thorough" method. Is that correct?

Correct again :-)

Alexis

>
> Thanks in advance!
>
> Josh S.
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Josh Singer

unread,
Nov 28, 2016, 6:01:24 AM11/28/16
to raxml
Thanks for the helpful response.
 
Bootstrapping in the EPA has been deprecated a long time ago, the
likelihood weights work much better for obtaining support for
placements, in particular because the query sequence tend to be short.

We're actually using EPA to get fast placement results for sequences which may be (up to) equal length with the reference sequence set, about 10kb. 

Would this lead to problems in terms of the placement support?

Josh

Alexandros Stamatakis

unread,
Nov 28, 2016, 6:05:36 AM11/28/16
to ra...@googlegroups.com
Hi Josh,

> Thanks for the helpful response.

:-)

> Bootstrapping in the EPA has been deprecated a long time ago, the
> likelihood weights work much better for obtaining support for
> placements, in particular because the query sequence tend to be short.
>
>
> We're actually using EPA to get fast placement results for sequences
> which may be (up to) equal length with the reference sequence set, about
> 10kb.

Interesting :-)

> Would this lead to problems in terms of the placement support?

I guess no, the longer the query sequences, the better the placement
signal should be, except if you have chimeric sequences in there or some
sort of lateral gene transfer, or species tree gene tree incongruence.
Nontheless, with such long sequences it could be possible to even detect
these effects using placement weights.

Alexis

>
> Josh

Josh Singer

unread,
Nov 28, 2016, 6:15:11 AM11/28/16
to ra...@googlegroups.com
Thanks again Alexis, I was indeed thinking of using the placement results to detect chimeric sequences.

Josh

On Mon, Nov 28, 2016 at 11:05 AM, Alexandros Stamatakis <alexandros...@gmail.com> wrote:
Hi Josh,

Thanks for the helpful response.

:-)

    Bootstrapping in the EPA has been deprecated a long time ago, the
    likelihood weights work much better for obtaining support for
    placements, in particular because the query sequence tend to be short.


We're actually using EPA to get fast placement results for sequences
which may be (up to) equal length with the reference sequence set, about
10kb.

Interesting :-)

Would this lead to problems in terms of the placement support?

I guess no, the longer the query sequences, the better the placement signal should be, except if you have chimeric sequences in there or some sort of lateral gene transfer, or species tree gene tree incongruence. Nontheless, with such long sequences it could be possible to even detect these effects using placement weights.

Alexis


Josh

--
You received this message because you are subscribed to the Google
Groups "raxml" group.

To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

--
You received this message because you are subscribed to a topic in the Google Groups "raxml" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raxml/SlcBE73gZY0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raxml+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages