# Constraint tree construction strategy

88 views

### rutge...@naturalis.nl

Dec 26, 2023, 5:16:03 AM12/26/23
to raxml
Hi all,

I'm wondering what the right strategy is for constraint tree construction in the following case:
1. I have a set of taxa A, B, C, D, E, F, O1, O2, O3.
2. I know that A..F are in the ingroup, and O1..O3 are not
3. O1..O3 are not necessarily monophyletic themselves with respect to the ingroup
4. I have a topology for A, C, E and F, for example (((A,C),E),F);
I want to accomplish the following:
• B and D should be free to graft anywhere in the ingroup from 4.
• The outgroups should be free to form a ladder leading to the MRCA of the ingroup
For example, the following should be legal:

((((((((A,B),C),D),E),F)n1,O1),O2),O3);

How do I accomplish this? If I specify the known topology from 4. and I stay agnostic about B and D then the ingroup topology should work out. If I set O1..O3 as outgroups, I expect the tree to be rooted on n1. (Right?) I could live with that because I only care about the ingroup anyway, but I guess it would output the outgroups as monophyletic, at least in terms of how it's drawn. Or is there another way to specify the monophyly of the ingroup? E.g. by adding B and D as a polytomous from the root of the ingroup from 4.?

Thanks and happy holidays!

Rutger

### Grimm

Jan 5, 2024, 3:45:25 AMJan 5
to raxml
Hi Rutger,

while RAxML allows for multifurcating constraints I don't think it can do nested, gold tree-like constraints.

Defining an outgroup only roots the tree for post-inference interpretation, it has no effect on the topology optimisation. To be able to root a tree with a multi-tip outgroup, we assume that in- and outgroup are reciprocally monophyletic, i.e. the tree has a branch (internode)/ supports a taxon bipartition ingroup | outgroup; to enforce this we would in your example use the following constraint
(A,B,C,D,E) [or (O1,O2,O3)
If the outgroup taxa don't have that quality, and, in an unconstrained analysis some of them mingle with part of the ingroup, it cannot be used to root the ingroup tree. Similarily, any outgroup sample that changes ingroup relationships may be problematic.

But the question I have is why you want to enforce a gold tree [ (((A,C),E),F) ] during analysis at all?

If your primary interest is to test where B and D connect to the already known backbone topology, you can just use the "evolutionary placement algorithm" (EPA) reading in the known topology as guide tree and testing any to place tip as query.
For the example, we read in (((A,C),E),F)Ox as reference tree, an alignment including all tips (i.e. incl. B and D as well) and test B and D as queries. EPA will then give you a ML score for where B and D best-place (and 2nd-, 3rd-best...) within this reference tree.
Since you want to have your tree rooted, including the outgroup closest to the ingroup will be enough to see if B or D would be early branching sisters or nested within the ingroup. Note that ML has a 50:50 chance to escape long-branch attraction, i.e. a long branching ingroup query may be attracted by a too distant outgroup, or outgroups.

EPA is implemented in classic RAxML but also available as stand-alone tool for very large query numbers: EPA-ng

If your data doesn't support well enough the gold tree (we used EPA for instance to place fossils within a molecular-based tree; morphological data usually reflect phylogenetic relationships imperfectly), but you want to keep it for e.g. trait mapping purposes but with B and D, EPA would also be the tool to place B and D within your gold tree, and then you just take this topology, feed it into RAxML and optimise only the branch-length and model(s).

Cheers, Guido.

### Rutger Vos

Jan 7, 2024, 1:46:31 AMJan 7
Hi Guido,

thanks very much for your detailed reply, and best wishes for 2024.

But the question I have is why you want to enforce a gold tree [ (((A,C),E),F) ] during analysis at all?

Basically, I want the branch lengths for the tree for A..F.

I have partial topologies from OpenTOL and aligned COI barcodes from BOLD, hence my question. (As an aside: I want to do this for all families in the BOLD taxonomy, so I'm sure I'll run into many edge cases...)

If your primary interest is to test where B and D connect to the already known backbone topology, you can just use the "evolutionary placement algorithm" (EPA) reading in the known topology as guide tree and testing any to place tip as query.
For the example, we read in (((A,C),E),F)Ox as reference tree, an alignment including all tips (i.e. incl. B and D as well) and test B and D as queries. EPA will then give you a ML score for where B and D best-place (and 2nd-, 3rd-best...) within this reference tree.
Since you want to have your tree rooted, including the outgroup closest to the ingroup will be enough to see if B or D would be early branching sisters or nested within the ingroup. Note that ML has a 50:50 chance to escape long-branch attraction, i.e. a long branching ingroup query may be attracted by a too distant outgroup, or outgroups.

EPA is implemented in classic RAxML but also available as stand-alone tool for very large query numbers: EPA-ng

If your data doesn't support well enough the gold tree (we used EPA for instance to place fossils within a molecular-based tree; morphological data usually reflect phylogenetic relationships imperfectly), but you want to keep it for e.g. trait mapping purposes but with B and D, EPA would also be the tool to place B and D within your gold tree, and then you just take this topology, feed it into RAxML and optimise only the branch-length and model(s).

Mmm... interesting! I guess I would have to do two things separately? I.e. first place all the unplaced tips to their ML locations, and then estimate the branch lengths on the resulting tree. Your point about LBA is well taken: I'm sure there are going to be many weird situations, including constraint trees that are obviously not good for the underlying alignment. I'll burn that bridge when I get to it.

Cheers,

Rutger

### Grimm

Jan 8, 2024, 6:42:04 AMJan 8
to raxml
Hi Rutger,
there may be an intermediate filtering step needed, particularly since your are interested primarily in branch lengths. The protocol for step 1 could look like this
1. Extract the guide tree topologies for all families. I would take a well-hung published genus tree for that, the more data used for that tree the better. Alternative may be a large tree base on a oligo-gene, multi-gene matrix that includes a coxI partition. Root the family reference trees based on the status quo when exporting them to NEWICK.
2. Align the coxI data per family (no need for outgroups at this step). If you have the coxI partition from a (or several) superfamily or order-level tree(s), I suggest using mafft to just align all other data into the reference partition.
3. Use EPA-ng to place all tips not in the guide-tree/ref matrix.
Generating the emended topology for the second step may have some hickups to keep in mind.
The jPlace file will typically include a lot of queries that have very high likelihood scores (LWR) with tips or internal branches in the references: this is because they
• belong to the same tip taxon (species, genus, pending how broadly sampled the reference tree was and how differentiated the taxa are in this barcode or
• are unambiguous sibling(s) of this tip/subtree
E.g. if B is a sister to A and coxI has the capacity to resolve this, then all B accessions would have a high (near-)unambiguous LWR for A, while that for another A tip would always be ~1. If B is a sister of A + C, then the B queries would have high LWR for the A+C root branch, while the A would still have LWR ~1 for A naturally.

In addition, you'll probably have queries with split LWRs ("wildcards" or "rogues") these can be
• resolution issues: e.g. if A and C are too close sequentially or imbalanced (short A tip, long C tip), B sisters, as well as A and C tips may have split LWR support for the other branches of the A-B-C subtree. E.g. a less drifted A would could split it's support as best-hit A tip, second-best hit C tip, while a much drifted A may be more attracted by the A+C root due to sort of local LBA. The Bs may just split their support more or less equally between the A tip, the C tip or the A+C root.
• what I like to call "cousin lineages" – members of the same clade but are too unique (genetically) to be attracted by a definite branch within a subtree. A classic are lineages diverging from fast ancient radiations that annoyingly survived but did not partake in the later radiations within the lineage. They often turn out to be phylogenetic rogues.
• genetic fossils – taxa that have not evolved their coxI as much as their siblings. I have never worked with animals except for foraminifers but in plants, it's quite common that members of some species/genera carry a barcode that is genetically-evolutionary speaking ancestral to those of its next relatives (sister species/genera), because they underwent substantially more genetic drift. Our standard phylogenetic trees cannot place ancestors, as they would need to be optimised at the nodes, so the LWR of an ancestral barcode is typcially split between different branches, in a the case of a perfect ancestor and data situation, the split LWR profile would include all the branches (internodes) joining in that node. E.g. if B would have the ancestral barcode of A and C, the LWR could split for the A tip, the C tip or the A+C root by 0.33 (we actually had and have cases like that in our EPA-ed samples)
That is, you may have difficulties taking the EPA result directly but would need an intermediate tip filtering step, maybe a second level check-up tree inference to make a decision on the final topology to optimise for the branch lengths.

One example: if e.g. B and D are the sisters and next-sisters of A+C and coxI supports an A-B-C-D clade, both B and D (as queries only) will get high LWRs with the A+C root (in the reference tree) during EPA. If B is genetically less drifted and D is much drifted and any other reference tip is substantially distinct from A and C, then B may have somewhat split LWR with A+C root, and either of or both the A and C tip, being a sister and genetically closest to the A-B-C-D ancestor, but D queries may split their LWR between the A+C root and the sisterclade to A+C because of possible inclade-outclade LBA, irrespective of whether the dichotomous evolution sequence is D + (B + (A+C) ) or B + (D (A+C) ). In that case, you would have to run a tree with a or two B and D representative(s) and a large as possible outgroup sample to make a call for either one (I would just go networks to make that decision, however, it's much harder to publish). Or just optimise the branch lengths on both alternatives.

Good placing, Guido.