Hello,
The group I work with is attempting to put together a quantitative quality check based on the output of epa-ng and had a few questions that we hoped could help us ensure that we are accomplishing this effectively. Our goal is to assess how well a set of reads places on a group on a phylogenetic tree. We created a script that outputs the average pendant length and likelihood weight ratio of a chosen group on a phylogenetic tree.
Our questions are:
If we are assessing the likelihood weight ratio of the best placement of a read, would the likelihood weight ratio be lowered based on how closely the nearest reference sequence is related to the reference sequence in question?
Is pendant length the best indicator of the confidence of a placement or would that be the likelihood weight ratio?
Would having a large group of closely related reference sequences (on a tree) reduce the overall confidence of each read placement?
Very much appreciate everything your group does and the benefit all of it has for the scientific community.
Best,
Matt
Hi Matt,
If we are assessing the likelihood weight ratio of the best placement of a read, would the likelihood weight ratio be lowered based on how closely the nearest reference sequence is related to the reference sequence in question?
Is pendant length the best indicator of the confidence of a placement or would that be the likelihood weight ratio?
I would say it's both, as you could theoretically have placements
that have both a high LWR but also a high pendant length. While
that is an edge case, I have seen it happen in real data in a sort
of placement-long-branch-attraction, where a reference tree had
some branches that were much longer than the average of the other
branches, in effect sweeping up reads that did not have any good
fit / close reference sequences in the tree. (this effect is more
pronounced when using the placement heuristic that is enabled by
default)
So a low best LWR, or rather a flat LWR distribution, is the
first indicator that something doesn't fit, but to be sure you
should also check the pendant length on "high confidence"
placements and probably have some criterion above which they
should be designated as suspect.
Would having a large group of closely related reference sequences (on a tree) reduce the overall confidence of each read placement?
Yes if the read is close to them all, either because of biology or because the read is too short / lacks distinct signal.
Very much appreciate everything your group does and the benefit all of it has for the scientific community.
:)
Thanks for reaching out! Don't hesitate to ask more questions. And apologies for not having this information out somewhere yet, or making placement more robust in these situations, or not having a better way of evaluating placement quality out yet. These are issues I am aware of, but haven't had the time to properly address yet. I think both assessing the quality of a reference tree (specifically with regards to suitability for placement) and how to make a good one, and also how to better evaluate the raw results are understudied and could use some serious effort.
Oh, and by the way we do also have a separate google group for
placement: phylogeneti...@googlegroups.com
to which I'm also posting this answer!
Happy Placement,
Pierre
--
Best,
Matt
You received this message because you are subscribed to the Google Groups "raxml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raxml/b45ea01f-37cf-4646-8bf2-291627c6ad25o%40googlegroups.com.
-- MSc Pierre Barbera Phone: +49 6221 533 258 Fax: +49 6221 533 298 E-Mail: pierre....@h-its.org HITS gGmbH Schloss-Wolfsbrunnenweg 35 D-69118 Heidelberg Amtsgericht Mannheim / HRB 337446 Managing Director: Dr. Gesa Schönberger Scientific Director: PD Dr. Wolfgang Müller
Would having a large group of closely related reference sequences (on a tree) reduce the overall confidence of each read placement?
Hi Matt,
to chime in on Guido's suggestion of "if the different placement positions of a sequence are close to each other on the tree, then that is usually a good thing": There is a measure to quantify that, called Expected Distance between Placement Locations (EDPL), which we for example implemented here: https://github.com/lczech/gappa/wiki/Subcommand:-edpl
Depending on your code base, you might also want to have a look
at my library for working with placement data: https://github.com/lczech/genesis
Spoiler alert: this is shameless advertising. But it might be
helpful, as several such measures are already implemented there,
and most new ideas can probably be added easily. If you have any
suggestions on missing features, please let me know ;-)
All the best
Lucas
--
You received this message because you are subscribed to the Google Groups "raxml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raxml/cfd7beca-addb-4f66-abe5-0dc86e350268o%40googlegroups.com.