Epa-ng placement quality assessment

Matthew Hays

unread,

Jun 9, 2020, 4:49:49 PM6/9/20

to raxml

Hello,

The group I work with is attempting to put together a quantitative quality check based on the output of epa-ng and had a few questions that we hoped could help us ensure that we are accomplishing this effectively. Our goal is to assess how well a set of reads places on a group on a phylogenetic tree. We created a script that outputs the average pendant length and likelihood weight ratio of a chosen group on a phylogenetic tree.

Our questions are:

If we are assessing the likelihood weight ratio of the best placement of a read, would the likelihood weight ratio be lowered based on how closely the nearest reference sequence is related to the reference sequence in question?

Is pendant length the best indicator of the confidence of a placement or would that be the likelihood weight ratio?

Would having a large group of closely related reference sequences (on a tree) reduce the overall confidence of each read placement?

Very much appreciate everything your group does and the benefit all of it has for the scientific community.

Best,

Matt

Pierre Barbera

unread,

Jun 10, 2020, 8:29:40 AM6/10/20

to ra...@googlegroups.com, Phylogenetic Placement

Hi Matt,

If we are assessing the likelihood weight ratio of the best placement of a read, would the likelihood weight ratio be lowered based on how closely the nearest reference sequence is related to the reference sequence in question?

Not sure I get the question, do you mean if the LWR is lower towards the best-hit reference if that reference also has a close neighbor reference in the tree? If so the answer "probably yes, depending on the reference data". If reference sequences are too similar to each other from the perspective of the query reads, the placement likelihoods will be very close to each other too, effectively "smearing" the LWR distribution over the similar regions of the tree.

Is pendant length the best indicator of the confidence of a placement or would that be the likelihood weight ratio?

I would say it's both, as you could theoretically have placements that have both a high LWR but also a high pendant length. While that is an edge case, I have seen it happen in real data in a sort of placement-long-branch-attraction, where a reference tree had some branches that were much longer than the average of the other branches, in effect sweeping up reads that did not have any good fit / close reference sequences in the tree. (this effect is more pronounced when using the placement heuristic that is enabled by default)

So a low best LWR, or rather a flat LWR distribution, is the first indicator that something doesn't fit, but to be sure you should also check the pendant length on "high confidence" placements and probably have some criterion above which they should be designated as suspect.

Would having a large group of closely related reference sequences (on a tree) reduce the overall confidence of each read placement?

Yes if the read is close to them all, either because of biology or because the read is too short / lacks distinct signal.

Very much appreciate everything your group does and the benefit all of it has for the scientific community.

:)

Thanks for reaching out! Don't hesitate to ask more questions. And apologies for not having this information out somewhere yet, or making placement more robust in these situations, or not having a better way of evaluating placement quality out yet. These are issues I am aware of, but haven't had the time to properly address yet. I think both assessing the quality of a reference tree (specifically with regards to suitability for placement) and how to make a good one, and also how to better evaluate the raw results are understudied and could use some serious effort.

Oh, and by the way we do also have a separate google group for placement: phylogeneti...@googlegroups.com
to which I'm also posting this answer!

Happy Placement,
Pierre

Best,

Matt

--
You received this message because you are subscribed to the Google Groups "raxml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raxml/b45ea01f-37cf-4646-8bf2-291627c6ad25o%40googlegroups.com.

-- 
MSc Pierre Barbera

Phone: +49 6221 533 258
Fax: +49 6221 533 298
E-Mail: pierre....@h-its.org

HITS gGmbH
Schloss-Wolfsbrunnenweg 35
D-69118 Heidelberg
Amtsgericht Mannheim / HRB 337446
Managing Director: Dr. Gesa Schönberger
Scientific Director: PD Dr. Wolfgang Müller

Grimm

unread,

Jun 10, 2020, 11:51:50 AM6/10/20

to raxml

High Matt,

Would having a large group of closely related reference sequences (on a tree) reduce the overall confidence of each read placement?

This depends mainly on how well-sorted the signal in such a subtree is, i.e. the proportion of phylogenetically sorted vs. stochastically distributed patterns. When the matrix is perfectly compatible with the reference tree, even a single mutation can be enough. But usually the signal in the terminal brushes is not perfectly sorted.

For such a complex real-world scenario, see our recent pre-print comparing auto-identification capacity of EPA and simple BLAST for HTS data in oaks, targeting a low to high variable nuclear intergenic spacer that usually has thousands of non-identical but similar copies per locus (and HTS catches most of them)

Piredda et al. (2020), Authorea. DOI: 10.22541/au.158696014.43811940

High-throughput sequencing of 5S-IGS in oaks - exploring intragenomic variation and algorithms to recognize target species in pure and mixed samples

The critical question is: How exact do you want to have your queries placed, what is the purpose of the placement?

I would argue, if your reference includes a large group of closely related sequences, why bother about confidence at all as long as a query is always placed in that subtree. A split confidence supporting several alternative placements is only worrisome, when those alternatives imply substantially different positions in the tree; e.g. EPA will produce ambivalent split support for

queries of character suites ancestral to more than one otherwise coherent clade; one can observe this e.g. when placing fossils using EPA via a morphomatrix in a molecular-defined tree)
recombinant sequences, where bits of the sequences have different evolutionary histories (and sources)

Happy placing, Guido

Lucas Czech

unread,

Jun 11, 2020, 12:40:38 AM6/11/20

to ra...@googlegroups.com

Hi Matt,

to chime in on Guido's suggestion of "if the different placement positions of a sequence are close to each other on the tree, then that is usually a good thing": There is a measure to quantify that, called Expected Distance between Placement Locations (EDPL), which we for example implemented here: https://github.com/lczech/gappa/wiki/Subcommand:-edpl

Depending on your code base, you might also want to have a look at my library for working with placement data: https://github.com/lczech/genesis
Spoiler alert: this is shameless advertising. But it might be helpful, as several such measures are already implemented there, and most new ideas can probably be added easily. If you have any suggestions on missing features, please let me know ;-)

All the best
Lucas

--
You received this message because you are subscribed to the Google Groups "raxml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/raxml/cfd7beca-addb-4f66-abe5-0dc86e350268o%40googlegroups.com.

Reply all

Reply to author

Forward