Number of informative and uninformative parsimony sites

693 views
Skip to first unread message

Taylor Paisie

unread,
Nov 10, 2016, 1:36:21 PM11/10/16
to IQ-TREE
After using tree-puzzle in for likelihood mapping analysis, I am able to see the number of informative and uninformative parsimony sites in the .puzzle file.  Since i'm running whole genomes, TREE-PUZZLE is very slow and not very efficient.  Is there a command i'm not finding in the documentation that would give me information on the informative and uninformative parsimony sites?

Thanks!

Bui Quang Minh

unread,
Nov 11, 2016, 5:38:42 AM11/11/16
to iqt...@googlegroups.com, Taylor Paisie
Hi there,

Right now there is no option to print these, so some coding is necessary. Moreover, do you have partitioned data and want to see the number of informative sites across partitions?

Minh

On Nov 10, 2016, at 7:36 PM, Taylor Paisie <tpai...@gmail.com> wrote:

After using tree-puzzle in for likelihood mapping analysis, I am able to see the number of informative and uninformative parsimony sites in the .puzzle file.  Since i'm running whole genomes, TREE-PUZZLE is very slow and not very efficient.  Is there a command i'm not finding in the documentation that would give me information on the informative and uninformative parsimony sites?

Thanks!

--
You received this message because you are subscribed to the Google Groups "IQ-TREE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+un...@googlegroups.com.
To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.
For more options, visit https://groups.google.com/d/optout.

--
Bui Quang Minh
Center for Integrative Bioinformatics Vienna (CIBIV)
Campus Vienna Biocenter 5, VBC5, Ebene 1
A-1030 Vienna, Austria
Phone: ++43 1 4277 74326
Email: minh.bui (AT) univie.ac.at







Heiko Schmidt

unread,
Nov 11, 2016, 5:42:19 AM11/11/16
to iqt...@googlegroups.com
Dear Taylor,

there is the plan to integrate that feature into IQ-Tree as well.

For the time being I would suggest, to use TREE-PUZZLE to determine the informative sites statistics
while setting the ‘Tree search procedure?’ with the k-button to 'Pairwise distances only (no tree)’
to stop without tree search.

Best wishes,
Heiko


> On 10 Nov 2016, at 19:36, Taylor Paisie <tpai...@gmail.com> wrote:
>
> After using tree-puzzle in for likelihood mapping analysis, I am able to see the number of informative and uninformative parsimony sites in the .puzzle file. Since i'm running whole genomes, TREE-PUZZLE is very slow and not very efficient. Is there a command i'm not finding in the documentation that would give me information on the informative and uninformative parsimony sites?
>
> Thanks!
>

Taylor Paisie

unread,
Nov 11, 2016, 11:38:36 AM11/11/16
to IQ-TREE, tpai...@gmail.com, minh...@univie.ac.at
Thank you for answering! No i do not have partitioned data, it's some whole virus genomes and they take so long to run on TREE-PUZZLE it's unbearable.  

Taylor Paisie

unread,
Nov 11, 2016, 11:39:30 AM11/11/16
to IQ-TREE
Thank you! This is great to hear, I think my lab will be very happy to hear this!!!

Bui Quang Minh

unread,
Nov 16, 2016, 6:59:46 AM11/16/16
to IQ-TREE, Taylor Paisie
Dear Taylor,

I just realized that this information is already printed in the log file like this:

….
Alignment most likely contains DNA/RNA sequences
Alignment has 17 sequences with 1998 columns and 1152 patterns (1009 informative sites)

meaning that the alignment has 1009 parsimony informative sites

Minh

Federico Gaiti

unread,
Feb 17, 2017, 1:09:34 PM2/17/17
to IQ-TREE, tpai...@gmail.com, minh...@univie.ac.at
Hi,

following up on this. I am now using IQ-Tree and, while I see that the information about informative sites is printed in the log file, I would like to ask if there is a way to exactly know which these informative sites are in the alignment. 

Thank you
Fede

Bui Quang Minh

unread,
Feb 18, 2017, 9:16:22 AM2/18/17
to iqt...@googlegroups.com

Bui Quang Minh

unread,
Feb 18, 2017, 9:16:23 AM2/18/17
to iqt...@googlegroups.com, minh...@univie.ac.at, tpai...@gmail.com
dear fede,
unfortunately there is no such option, though it should be easy to implement. however tree-puzzle may have it. you can also check out the R package ape.

Minh

On Sat, 18 Feb 2017 at 07:09, Federico Gaiti <federic...@gmail.com> wrote:

Federico Gaiti

unread,
Feb 21, 2017, 10:43:01 AM2/21/17
to iqt...@googlegroups.com, minh...@univie.ac.at, tpai...@gmail.com
Hi Minh,

Thanks for your reply. I was also wondering, what is the logic behind determining these “informative” sites? How are these determined? And what does actually “informative” means? 

Thank you
Fede


You received this message because you are subscribed to a topic in the Google Groups "IQ-TREE" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/iqtree/ostFKdN0shQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to iqtree+un...@googlegroups.com.

Heiko Schmidt

unread,
Feb 21, 2017, 11:38:01 AM2/21/17
to iqt...@googlegroups.com, Minh, Bui Quang, tpai...@gmail.com
Dear Fede,

“informative” here actually means parsimony-informative, that means this sites which would be informative for parsimony analysis.

A site is (parsimony) informative if (what I call) the 2,2,2-rule applies:

An informative site:
- contains at least 2 different (informative) characters
(wildcards like ‘X’ in proteins, ’N’ in DNA, ‘-' etc. are typically regarded as not informative)
- and of these (informative) characters at least 2
- occur at least twice (2x)

That means a columns like:
- A,A,C,C,C,C is informative (2xA + 4xC)
- A,A,C,T,C,C is informative (2xA + 3xC)
- A,A,G,T,C,G is informative (2xA + 2xG)
- A,A,C,T,C,C is informative (2xA + 3xC)
- A,A,T,T,C,C is informative (2xA + 2xC + 2XT)
- A,A,-,N,C,C is informative (2xA + 2xC)
- A,G,C,T,C,C is not informative
- A,-,C,T,-,C is not informative
- C,C,C,C,C,C is not informative

This is often used as an approximate measure whether you have enough potentially informative sites to resolve a tree.
One wants to have far more informative sites than the n-3 internal branches to resolve in a tree (where n is the number of sequences/leafs/taxa). If you have few informative sites, it is clear why a tree remains unresolved or why one cannot gain support for branches.

However, this does not mean that the sites are informative for a certain tree/branch or that the informative sites are congruent among each other.

Does that help?

Best wishes,
Heiko

Federico Gaiti

unread,
Feb 21, 2017, 11:53:36 AM2/21/17
to iqt...@googlegroups.com, Minh, Bui Quang, tpai...@gmail.com
Hi Heiko,

Thanks for your explanation. Really helpful. Would the same rule apply to binary data with three types of characters (0, 1 and ‘?' for missing)?

Best
Fede

Heiko Schmidt

unread,
Feb 21, 2017, 12:34:36 PM2/21/17
to iqt...@googlegroups.com, Federico Gaiti, Minh, Bui Quang, tpai...@gmail.com
Yes, this applies to all kinds of data character data.
In the binary case each of the characters (0,1) have to occur at least twice for the site to be informative (and ‘?’ is an uninformative wildcard).

Best wishes,
Heiko

Federico Gaiti

unread,
Feb 23, 2017, 6:35:55 PM2/23/17
to Heiko Schmidt, iqt...@googlegroups.com, Minh, Bui Quang, tpai...@gmail.com
Hi Heiko,

Thank you. Really helpful. Are you aware of a way to know exactly which these informative sites are in the alignment? With tree-puzzle I could get the unresolved quartets, but I think this is a different question than what I’m looking for.

Thanks
Fede

Federico Gaiti

unread,
Feb 24, 2017, 3:56:06 PM2/24/17
to Heiko Schmidt, iqt...@googlegroups.com, Minh, Bui Quang, tpai...@gmail.com
For example, using the option -wsr I can write per-site rates to .rate file. I got an output that looks like this:

Site Rate Category Categorized_rate
1 1.00000 1 0.31635
2 1.00000 1 0.31635
3 1.00000 1 0.31635
4 0.87711 1 0.31635
5 0.80212 1 0.31635
6 0.80212 1 0.31635
7 0.80212 1 0.31635
8 0.80212 1 0.31635
9 0.80212 1 0.31635
...

What do the category correspond to? Is the number of the site (i.e., Site 1) corresponding to the column number in the phylip input file?

Thank you
Fede

Bui Quang Minh

unread,
Feb 26, 2017, 12:38:04 AM2/26/17
to Federico Gaiti, IQ-TREE
Dear Fede,

Assuming that you used a Gamma rate heterogeneity model with 4 categories (the same holds true if you used FreeRate model). The columns of .rate file have the following meaning:

Site:                   Column number of alignment file, starting from 1.
Rate:                  Evolutionary rate of the site inferred by empirical Bayesian method (or posterior mean).
Category:           The index of the Gamma category that this site likely falls into (with highest posterior probability). Category 1 means that the site is slowly evolving whereas category 4 means fast evolving. Thus, you can use this column to quickly classify the sites in the alignment.
Categorized_rate: The rate of the corresponding category.

Note that if you also combine Gamma rate with invariant site model (I+G4), then there is another Category 0, meaning that the corresponding site is likely invariable.

Further reading: a comparison paper by Mayrose et al (2004) (https://doi.org/10.1093/molbev/msh194) showed a good performance of empirical Bayesian method. That’s why it was implemented in IQ-TREE.

Cheers, Minh

Federico Gaiti

unread,
Feb 26, 2017, 9:24:36 PM2/26/17
to Bui Quang Minh, IQ-TREE
Great — thank you Minh. 

Other question, how are the likelihood distance (.mldist) estimated? 

Thanks
Fede

Bui Quang Minh

unread,
Feb 27, 2017, 4:20:05 AM2/27/17
to Federico Gaiti, IQ-TREE
Hi Fede again,

A maximum parsimony tree is first constructed from the alignment, then parameters of the given model are estimated based on the parsimony tree. The estimated model parameters are now used to compute the pairwise distance between sequences. Thus, model parameters are not re-estimated for each pairwise comparison. This is because some parameters (like the Gamma shape) cannot be estimated from just 2 sequences.

Hope that helps,
Minh

Federico Gaiti

unread,
Feb 27, 2017, 1:31:15 PM2/27/17
to Bui Quang Minh, IQ-TREE
Great — thank you. Really helpful.
Fede

Federico Gaiti

unread,
Jul 24, 2017, 3:34:40 PM7/24/17
to Heiko Schmidt, IQ-TREE, Minh, Bui Quang, tpai...@gmail.com
Hi Heiko,

Thank you for your reply.

Are you aware of a way to know exactly which these informative sites are in the alignment? Basically, a way to know which sites are being used by the algorithm to build the tree?

Thank you

Best,
Fede


> On Feb 21, 2017, at 12:34 PM, Heiko Schmidt <heiko....@univie.ac.at> wrote:
>

Heiko Schmidt

unread,
Jul 24, 2017, 4:06:45 PM7/24/17
to Federico Gaiti, IQ-TREE, Minh, Bui Quang, tpai...@gmail.com
Dear Fede,

I am thinking about implementing into the code such that one can get e.g. a letter for each site whether that on is informative or not. So far I did not have the time to do so.

One thing to keep in mind - Parsimony informative and uninformative is an approximation to get an idea about the information in the data (besides likelihood mapping). The definition of this applies to (unweighted) parsimony, where a site is informative if it (a) contains at least 2 different characters [not including gaps or wildcards like N or X] and (b) from these characters at least 2 occur (c) at least twice. That means one can form at least two clusters of at least 2 sequences of equal characters. I call this the 2-2-2 rule.

However, in ML the likelihood computation the evolutionary model can also exploit similarity of residues (like Purines or Pyrimidines). Depending on the parameterisation of the model (usually estimated from the data), typically transitions are more likely than transversions, thus two purines (or two pyrimidines) are more closely related than a purine and a pyrimidine.
Since unweighted parsimony does not distinguish between mutation types, such informations is not reflected by parsimony (un)informative sites, so there is more information in the alignment.

Best wishes,
Heiko

Federico Gaiti

unread,
Jul 24, 2017, 10:50:10 PM7/24/17
to Heiko Schmidt, IQ-TREE, Minh, Bui Quang, tpai...@gmail.com
Hi Heiko

Thank you for your reply and detailed explanation. Really useful as usual.  

Implementing the info on whether a site is informative or not in the code would be amazing, if you have a chance to do it. 

Thanks again 
Fede

Bui Quang Minh

unread,
Aug 1, 2017, 6:00:35 AM8/1/17
to IQ-TREE, tpai...@gmail.com
Dear Fede and Heiko, 

I just created an enhancement issue in GitHub: https://github.com/Cibiv/IQ-TREE/issues/34

It should be easily implemented. But what we need, is the output format. How do you think the output file best looks like?

Thus, can you please add comments to this GitHub issue?  (you will need a GitHub account for this).

Thanks, Minh
To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+unsubscribe@googlegroups.com.

To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "IQ-TREE" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/iqtree/ostFKdN0shQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to iqtree+unsubscribe@googlegroups.com.

To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "IQ-TREE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+unsubscribe@googlegroups.com.

To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "IQ-TREE" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/iqtree/ostFKdN0shQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to iqtree+unsubscribe@googlegroups.com.

To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "IQ-TREE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+unsubscribe@googlegroups.com.

Federico Gaiti

unread,
Aug 2, 2017, 1:36:33 PM8/2/17
to iqt...@googlegroups.com, tpai...@gmail.com
Thanks Minh,

I will comment on GitHub.

Best,
Fede

To unsubscribe from this group and all its topics, send an email to iqtree+un...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages