unintuitive clade assignment in gappa extract

Grace Pold

unread,

Sep 9, 2021, 8:58:45 AM9/9/21

to Phylogenetic Placement

Hello,

I was looking at my placement outputs and realized that if I simply provide an in/outgroup clade list, gappa extract places the "basal" outgroup reads in the ingroup.

For instance, in the tree (link here), all blue tips correspond to the ingroup, and all red tips to the outgroup. However, during clade extraction, all the placements falling on the orange/brown edge are assigned to the ingroup (ie are counted as blue), rather than the outgroup or uncertain. Is there a way to change this so the clade assignment only goes as far back as the root? I know I can add in a second outgroup coming off the ingroup in my reference tree (as depicted by the dashed line) and this will correct it. But I'm trying to avoid re-running all the placements if possible.

Thanks so much!

Lucas Czech

unread,

Sep 10, 2021, 5:11:01 PM9/10/21

to Phylogenetic Placement

Hey Grace,

ah yes, I see the issue there. The reason gappa behaves that way is that phylogenetically, the orange/brown edge is (in a sense) the same edge as the gray one on the other side of the root. That is, the tree is "rooted" only in the sense that there exist a bifurcating top node in the tree - but the two branches away from the root are in fact one branch. The likelihood model that is used in ML tree inference and in phylogenetic placement does not consider the separation that is induced by the root node. Hence, I'm hesitant to just add an option as you suggest.

However, as an idea to solve this use case: I could add an option to select whether the branch connecting a clade to the rest of the tree is considered part of that clade or not. Then, the orange/brown branch could be excluded - I think. I'd have to play around with that for a bit to see what a straightforward. Could you maybe send me one of the jplace files where this case occurs?

Thanks and so long
Lucas

PS: Kudos! Your question is really well phrased and clear, and you outlined potential solutions already! There should be more people that thorough when asking for user support ;-)

--
You received this message because you are subscribed to the Google Groups "Phylogenetic Placement" group.
To unsubscribe from this group and stop receiving emails from it, send an email to phylogenetic-plac...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/phylogenetic-placement/e14bedf4-2b7a-4c7f-9a94-d629b126d38dn%40googlegroups.com.

Pierre Barbera

unread,

Sep 13, 2021, 6:06:47 AM9/13/21

to phylogeneti...@googlegroups.com

Hi Grace,

as Lucas points out, this is a bit of an edge case from the point of the methods. The way EPA-ng handles this, is that it unroots the tree, performs the placement, then maps the placements back onto the rooted tree. For the used-to-be-split-by-root-node branch it decides which side of the root to map to based on the distal length (the attachment point). Especially with queries that don't have a strong "pull" toward the tree and don't place so well, it could be the case that they attach high up to the root like that. In these cases the signal is also often low, which makes it more likely that they get "misassigned" to the outgroup.

So here is what I would do:

1) for those queries, check how well they actually placed. One strong indication is how high the highest LWR (third number in the output) is; if its low then probably theres not enough signal to place it confidently. Another indication is the pendant length, which is the distance of the placed query to the branch where it was placed. If its comparatively high (as in as large as the diameter of the tree or larger) then its similarly bad.

2) your tree looks very small; it could be that this issue improves/goes away with a more comprehensive reference tree. The idea there is that more references -> clearer signal -> stronger placement, and stronger toward the leaves rather than the basal branches.

Thank you for the great question, as Lucas said! Please keep asking as this is helping to improve the tools and the overall workflow of doing placement.

Happy Placement,
Pierre

To view this discussion on the web, visit https://groups.google.com/d/msgid/phylogenetic-placement/a0ee9447-0e16-481e-89b5-6b8641236993%40gmail.com.

Grace Pold

unread,

Sep 13, 2021, 12:21:43 PM9/13/21

to Pierre Barbera, Phylogenetic Placement

Hi Pierre,

Thank you! I realize I probably should have initially replied to Lucas on-list rather than directly.

Sorry for the confusion. The tree I am actually using is much bigger (~1700 tips)...I just drew a smaller tree for the post so it would be easier to see the problem.

You are correct that many of the placements are very uncertain in the tree, which is a whole another very interesting question once you remove the technical reasons for it. However, I think in this case the issue is how the tree is split and that branch going back to the root is interpreted, rather than with the placements themselves.

What I am trying to do is split the tree into the ingroup and the outgroup so that the reads which are placed confidently in the ingroup are kept and counted and those on the outgroup are not. I am happy to discard anything which isn't placed decisively inside the ingroup (whether by center of mass across all placements for a read or single best placement).

So, are you suggesting just to discard all reads where the pendant branch length for the placements is long and/or do a filtering by edge number in the .jplace file (since I know which edge number the one leading to the outgroup is)? Because I think that could work for the first analysis I need to do with just ingroup counts rather than ordinations...

Thanks!

Grace

You received this message because you are subscribed to a topic in the Google Groups "Phylogenetic Placement" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/phylogenetic-placement/Dy9EpxbNUdc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to phylogenetic-plac...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/phylogenetic-placement/539b6f87-f498-9a3f-8e29-0ee8a6d49543%40h-its.org.

Pierre Barbera

unread,

Sep 15, 2021, 3:02:08 AM9/15/21

to phylogeneti...@googlegroups.com

Hi Grace,

as for the pendant length, it's a bit tricky. I can't give definitive answers there since I'm not aware of any previous attempts of classifying them into "close enough" and "too long". I do think it would be reasonable to exclude queries that have very long pendant lengths though, but I would be conservative there.

As for filtering based on LWR, center of mass sounds like a good approach, however that can be deceiving 1) if the LWR distribution is perfectly flat, uniform across the tree 2) ... in which case you might not even see that in the output, since by default (for now) epa-ng only outputs a maximum of 7 placements per query. The latter can be fixed by running with "--filter-acc-lwr 0.95 --filter-max 10000" or similar settings. One of the next releases will overhaul that behaviour to give more reasonable feedback/output in these situations.

Filtering out the specific branches could also be a temporary fix to your issue as you mentioned. First I would try a general filtering, and then apply gappa extract again.

Let us know how it goes!

Pierre

To view this discussion on the web, visit https://groups.google.com/d/msgid/phylogenetic-placement/CAAJD43AM3%3DCYfM2MhXFaYvTFwO%3DBpOkiYxmgujd%2B74%2BEGuJYug%40mail.gmail.com.

-- 
MSc Pierre Barbera

Phone: +49 6221 533 258
Fax: +49 6221 533 298
E-Mail: pierre....@h-its.org

HITS gGmbH
Schloss-Wolfsbrunnenweg 35
D-69118 Heidelberg
Amtsgericht Mannheim / HRB 337446
Managing Director: Dr. Gesa Schönberger
Scientific Director: PD Dr. Wolfgang Müller

Grace Pold

unread,

Sep 15, 2021, 9:20:03 AM9/15/21

to Pierre Barbera, Phylogenetic Placement

Hi Pierre,

Thank you!

I decided that queries with 80% cumulative mass in the outgroup for the top 7 placements should not be considered part of the ingroup. So this is what I did for a temporary work-around:

#1. Determine where cumulative LWR is 0.8

gappa edit accumulate --jplace-path placements.jplace --threshold 0.8 --out-dir ./ --file-prefix acc0.8_

#2: Convert this file to a csv
guppy to_csv -o acc0.8_accumulated.csv --out-dir ./ --point-mass acc0.8_accumulated.jplace

#3: Prepare a file for splitting the jplace files by extracting a list of reads belonging in the ingroup (in R). This generates a list of reads where the 0.8 cumulative mass is NOT in the outgroup.

allhits<-read.delim("acc0.8_accumulated.csv", sep=",")

outgroup_edgelist<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)

goodhits<-subset(allhits, !edge_num%in%outgroup_edgelist)$name
towrite<-data.frame(goodhits, rep("ingroup", length(goodhits)))
write.table(towrite, "acc_0.8Ingroup.csv", sep=",",col.names=FALSE, row.names=FALSE)

#4. Split the jplace file to exclude the reads hitting the outgroup:
gappa edit split --jplace-path ./acc0.8_accumulated.jplace --split-file acc_0.8Ingroup.csv --out-dir . --file-prefix acc0.8_NoOutgroup_

#(or for the original jplace file with all placements)

gappa edit split --jplace-path placements.jplace --split-file acc_0.8Ingroup.csv --out-dir . --file-prefix NoOutgroup_

This successfully gives all reads "definitively" within the ingroup based on the criteria outlined at the top. And also rejects any reads with placements either side of the root in the accumulate step. I think this is OK for the moment.

Best,

Grace

To view this discussion on the web, visit https://groups.google.com/d/msgid/phylogenetic-placement/f0c09080-200a-c44a-e219-f442c17d4f05%40h-its.org.

Reply all

Reply to author

Forward