Pairwise distance in .mldist

169 views
Skip to first unread message

Sishuo Wang

unread,
Jun 4, 2019, 2:23:16 AM6/4/19
to IQ-TREE
Hi all,

I wish to calculate the genetic distance for my DNA alignment using the best-fit substitution model estimated by IQ-Tree.

The minimum sequence identity of my alignment was ~80% and the length of the alignment was ~1000 bp. It seemed that when +G was in the rate model, the pairwise distance appeared to be very high with many over 1.0 in the file .mldist. However, if +G was not specified, e.g., GTR or GTR+F was used as the model, the distances in the .mldist file seemed to me to make much more sense (maximum ~32%).

So, my questions are: i) was the sequence identity of the alignment too low to use +G, and ii) would only -m GTR (or whatever the best substitution model) work for my purpose? Thanks.

Best regards,
Sishuo

Minh Bui

unread,
Jun 6, 2019, 5:36:36 AM6/6/19
to IQ-TREE, Sishuo Wang
Hi Sishuo,

Without rate heteogeneity the genetic distances might be underestimated. However, Richard Neher recently observed that GTR+G4 might lead to overestimation due to just 4 rate categories. I suggest that you try GTR+G16, which should better approximate the Gamma distribution in theory. 

Moreover, did you test which model fits best to your data? I suggest that you also try the free-rate model, e.g. GTR+R, but let ModelFinder find the best number of categories. And then use the distances computed from the best-fit model.

Cheers,
Minh

--
You received this message because you are subscribed to the Google Groups "IQ-TREE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+un...@googlegroups.com.
To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.
To view this discussion on the web visit https://groups.google.com/d/msgid/iqtree/2f83feeb-e19b-405a-869f-bfb957cf2c1c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Minh Bui

unread,
Jun 6, 2019, 5:38:36 AM6/6/19
to iqt...@googlegroups.com, Sishuo Wang

Sishuo Wang

unread,
Jun 7, 2019, 4:48:33 AM6/7/19
to IQ-TREE
Hi Minh,

Thanks so much for your help! I double-checked my alignment, and it seemed that a few of them were likely not orthologous to the rest, which might lead to unexpectedly large genetic distance found in .mldist. After the removal of them, I followed your suggestions and it worked perfectly. I have one more question. It seemed that when +I was included in the best-model, the pairwise distance shown in .mldist could be even lower than the distance calculated as the "raw" distance (i.e., no. of different AAs between two sequences). So, should I use +I to calculate the pairwise distance of my alignment (it was often included in best-hit model, e.g., GTR+G+I) or would it be fine to ignore +I? Thanks.

Best,
Sishuo


On Thursday, June 6, 2019 at 5:38:36 PM UTC+8, Minh Bui wrote:
FYI Here is the twitter thread on this issue:

On 6 Jun 2019, at 7:36 pm, Minh Bui <min...@univie.ac.at> wrote:

Hi Sishuo,

Without rate heteogeneity the genetic distances might be underestimated. However, Richard Neher recently observed that GTR+G4 might lead to overestimation due to just 4 rate categories. I suggest that you try GTR+G16, which should better approximate the Gamma distribution in theory. 

Moreover, did you test which model fits best to your data? I suggest that you also try the free-rate model, e.g. GTR+R, but let ModelFinder find the best number of categories. And then use the distances computed from the best-fit model.

Cheers,
Minh

On 4 Jun 2019, at 4:23 pm, Sishuo Wang <sishuow...@gmail.com> wrote:

Hi all,

I wish to calculate the genetic distance for my DNA alignment using the best-fit substitution model estimated by IQ-Tree.

The minimum sequence identity of my alignment was ~80% and the length of the alignment was ~1000 bp. It seemed that when +G was in the rate model, the pairwise distance appeared to be very high with many over 1.0 in the file .mldist. However, if +G was not specified, e.g., GTR or GTR+F was used as the model, the distances in the .mldist file seemed to me to make much more sense (maximum ~32%).

So, my questions are: i) was the sequence identity of the alignment too low to use +G, and ii) would only -m GTR (or whatever the best substitution model) work for my purpose? Thanks.

Best regards,
Sishuo

--
You received this message because you are subscribed to the Google Groups "IQ-TREE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iqt...@googlegroups.com.

To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.
To view this discussion on the web visit https://groups.google.com/d/msgid/iqtree/2f83feeb-e19b-405a-869f-bfb957cf2c1c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "IQ-TREE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iqt...@googlegroups.com.

Minh Bui

unread,
Jun 11, 2019, 11:20:04 PM6/11/19
to IQ-TREE, Sishuo Wang
Hi Sishuo,

On 7 Jun 2019, at 6:48 pm, Sishuo Wang <sishuow...@gmail.com> wrote:

Hi Minh,

Thanks so much for your help! I double-checked my alignment, and it seemed that a few of them were likely not orthologous to the rest, which might lead to unexpectedly large genetic distance found in .mldist.

I see, that’s the reason. I think we have an upper bound of 9.0 or so. If the distance gets almost there, then you might likely have non-homologous sequences. IQ-TREE also prints some WARNING about saturated distances.

After the removal of them, I followed your suggestions and it worked perfectly. I have one more question. It seemed that when +I was included in the best-model, the pairwise distance shown in .mldist could be even lower than the distance calculated as the "raw" distance (i.e., no. of different AAs between two sequences). So, should I use +I to calculate the pairwise distance of my alignment (it was often included in best-hit model, e.g., GTR+G+I) or would it be fine to ignore +I? Thanks.

I actually do not expect that GTR+I+G distances be lower than the “raw” (Hamming) distances. How much lower are they?

It’s true that generally GTR+I+G distances will lower than GTR+G distances.

Minh

To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+un...@googlegroups.com.

To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.

Sishuo Wang

unread,
Jun 11, 2019, 11:36:07 PM6/11/19
to IQ-TREE
Hi Minh,

Could be 50% lower than the Hamming distance. I am providing the results and the alignment as attachment. The raw distance was calculated using dna.dist (pairwise deletion = false). I guess there were problems in the alignment? Thank you so much.

Best,
Sishuo
Hi Sishuo,
G.mldist
G+I.mldist
RAW.dist
5.aln

Minh Bui

unread,
Jul 3, 2019, 9:18:13 PM7/3/19
to IQ-TREE, Sishuo Wang
Hi Sishuo,

I had a thorough look and it looks like there is a bug in distance computation (.mldist file) in case of +I or +I+G model. The distances are not scaled properly, leading to more underestimation, when the proportion of invariable sites gets higher. We will fix the bug, but for now please don’t use GTR+I+G distances, but GTR+G distances.

So thanks for the discussion, that led me to find this issue!

Minh

To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+un...@googlegroups.com.

To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.

For more options, visit https://groups.google.com/d/optout.
<G.mldist><G+I.mldist><RAW.dist><5.aln>

Reply all
Reply to author
Forward
0 new messages