initial partitions influence on topology and branchlengths

99 views
Skip to first unread message

Jacopo Martelossi

unread,
Jun 14, 2019, 6:17:49 AM6/14/19
to IQ-TREE
Dear IQ-tree community,

I am a MSc student working on an insect phylogeny of 200 spp circa using 8 genes, including mitochondrial and nuclear PGS and rRNA. The concatenation has been gblocked and thus reduced to 4kbp circa. I am observing some degree of saturation using DAMBE and a lot of heterogenity in base composition using Aligroove.

While testing different intial partitioning schemes I observed that they were influencing topology and branchlength of the tree in a quite strong way. 

The initial partitioning scheme i have been using were the ones I have seen beeing commonly used: 

(a) all genes separately, resulting in 8 initial partitions.

(b) all codon positionin separately + rRNA, resulting in 4 initial partitions.

(c) each codon position of ecah PCG + each rRNA separately, resulting in 28 partitions.  

(d) Moreover,  I also tested the GHOST model with 4 classes.

My question is: how can I decide which is the "correct" - if such a thing exist! - initial partitioning scheme? Until now I observed the average nodal support of each phylogeny (ufboot) and I did likelihood mapping, to asses the performance of each analysis to resolve any quartet in a clear way. 

Does it make sense? Can i compare Likelihood values of the trees and can I use any different test to focus on a single topology? 

Thanks you all in advance for the support, 

Jacopo


PS: does it make sense to use concordance factors for 8 gene only?

Minh Bui

unread,
Jun 25, 2019, 11:03:58 AM6/25/19
to IQ-TREE, Jacopo Martelossi
Hi Jacopo,

Sorry for the delay, answers below:

On 14 Jun 2019, at 6:17 am, Jacopo Martelossi <jacopo.m...@gmail.com> wrote:

Dear IQ-tree community,

I am a MSc student working on an insect phylogeny of 200 spp circa using 8 genes, including mitochondrial and nuclear PGS and rRNA. The concatenation has been gblocked and thus reduced to 4kbp circa. I am observing some degree of saturation using DAMBE and a lot of heterogenity in base composition using Aligroove.

While testing different intial partitioning schemes I observed that they were influencing topology and branchlength of the tree in a quite strong way. 

My first suggestion is to look at the alignment carefully. Make sure to minise alignment errors. Are sequences orthologous? (e.g. paralogs are quite problematic). This is perhaps the most important step. Wrong alignments may heavily influence the results.

Let’s assume for now that the alignments are OK.

Given this observation and the saturation issue, I’m wondering what happen if you don’t use GBlock? How many sites did GBlock remove? You can check whether the problem may be reduced if you have more sites in the alignment. 

You can also try to reduce the number of sequences. For example, you can look at the composition test performed at the beginning of the IQ-TREE run, and perhaps remove those sequences with very low p-values, which do not fit the alignment in terms of sequence composition.

You can also analysis the translated amino acid sequences. This may also help to reduce saturation.


The initial partitioning scheme i have been using were the ones I have seen beeing commonly used: 

(a) all genes separately, resulting in 8 initial partitions.

(b) all codon positionin separately + rRNA, resulting in 4 initial partitions.

(c) each codon position of ecah PCG + each rRNA separately, resulting in 28 partitions.  

(d) Moreover,  I also tested the GHOST model with 4 classes.

Yes all these partitioning makes sense. I would also recommend to use the MERGE option, which tries to reduce the number of initial partitions to minimize the BIC score. 

My question is: how can I decide which is the "correct" - if such a thing exist! - initial partitioning scheme?

“All models are wrong, but some are useful” ;-) To be precise, what you can do is to find the “best” model among the currently available ones. And as hinted above, you can use the BIC. So you can perform different runs, look at the .iqtree file, and pick up the one with the smallest BIC score.

Until now I observed the average nodal support of each phylogeny (ufboot)

what do you mean by “average”?

and I did likelihood mapping, to asses the performance of each analysis to resolve any quartet in a clear way. 

Does it make sense? Can i compare Likelihood values of the trees and can I use any different test to focus on a single topology? 

As said, you should compare the BIC, not likelihood. BIC will balance the trade-off between having more parameter-rich models and the likelihoods. 

Does that answer your questions?

Minh


Thanks you all in advance for the support, 

Jacopo


PS: does it make sense to use concordance factors for 8 gene only?

--
You received this message because you are subscribed to the Google Groups "IQ-TREE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+un...@googlegroups.com.
To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.
To view this discussion on the web visit https://groups.google.com/d/msgid/iqtree/d7cb91b1-8b5d-4113-88f7-2617399beb23%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jacopo Martelossi

unread,
Jun 25, 2019, 4:03:29 PM6/25/19
to IQ-TREE
Hi Minh,
first of all thank's for your answer. I had already put some of your sugegstions in practice by scanning the AICc, the BIC and making some LRT between the  likelihood of trees obtained with te various partition schemes. Thank's for the suggestion to use the translated sequencies, this could do for mine situation!

Best regards,

Jacopo.
Reply all
Reply to author
Forward
0 new messages