Using dendropy to find quartet frequencies in list of trees?

94 views
Skip to first unread message

Ruth Davidson

unread,
Oct 16, 2014, 3:18:42 PM10/16/14
to dendrop...@googlegroups.com
Hi Jeet

I need to write a function that finds quartet frequencies in a list of trees and I hope I can do it with DendroPy's frequency of splits function.  From looking at this post on the users group 


I inferred that I pass the labels for one half of the split that I am looking for.  To demonstrate, here is output where I am trying to find the relative frequencies of the three binary unrooted quartet topologies, represented as the one non-trivial split corresponding to teh internal edge: (1) 'MON', 'MAC' | 'TUR', 'BOS,  (2) 'MON', 'TUR', | 'MAC', 'BOS, and (3) 'MON', 'BOS', | 'MAC', 'TUR'.  

>>> import dendropy

>>> treelist = dendropy.TreeList.get_from_path("true-trees-1X.tre", "newick")

>>> l = treelist.taxon_set.labels()

>>> l

['MON', 'MAC', 'TUR', 'BOS', 'SUS', 'VIC', 'FEL', 'CAN', 'EQU', 'MYO', 'PTE', 'SOR', 'ERI', 'CAV', 'RAT', 'MUS', 'CAL', 'TAR', 'NEW', 'PAN', 'HOM', 'GOR', 'PON', 'MIC', 'OTO', 'TUP', 'SPE', 'DIP', 'ORY', 'OCH', 'CHO', 'DAS', 'PRO', 'LOX', 'ECH', 'ORN', 'GAL']

>>> splita = 'MON', 'MAC'

>>> splitb = 'MON', 'TUR'

>>> splitc = 'MON', 'BOS'

>>> treelist.frequency_of_split(labels=splita)

0.94825

>>> treelist.frequency_of_split(labels=splitb)

0.0

>>> treelist.frequency_of_split(labels=splitc)

0.0

>>> 

I was expecting the three numbers to total to one, because all the trees in the set treelist are binary and are on the full taxa set:  Here is "proof" though I am sure there is a better way to check for polytomies, and for the full taxon set, sorry, here it is:


>>> polytomycheck = [[e for e in treelist[i].postorder_edge_iter()] for i in range(len(treelist))]

>>> len(polytomycheck[0])

72

>>> len(l)

37

>>> check2 = [len(polytomycheck[j]) == 72 for j in range(4000)]

>>> False in check2

False

Also, I expect the same frequency for using "the other side of split a:  split a is for the quartet (1) 'MON', 'MAC' | 'TUR', 'BOS, and so using the other side I expect to see 0.94825 again, but I don't:

>>> othersidesplita = 'TUR','BOS'

>>> treelist.frequency_of_split(labels=othersidesplita)

0.73825

Any suggestions are much appreciated!  I am not attached to doing this a certain way; I just need to write a function that estimates the relative frequency in a list of trees of the three possible unrooted quartet topologies on four taxa appear as induced subtrees. 


Thanks so much!







Derrick Zwickl

unread,
Oct 20, 2014, 12:38:32 PM10/20/14
to dendrop...@googlegroups.com

Hi Ruth,

In my package pygot (https://github.com/zwickl/pygot) I do exactly that, finding quartet frequencies within larger trees using Dendropy.  You can grab the code and take a look at the dendropyScoreTriples.py script.  You need to provide taxon bitmasks to count the frequency of embedded quartets, unless you want to prune down to the taxon set of interest, which is very inefficient.

Best,
Derrick
--
You received this message because you are subscribed to the Google Groups "DendroPy Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dendropy-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ruth Davidson

unread,
Oct 22, 2014, 11:55:53 AM10/22/14
to dendrop...@googlegroups.com
Hi Derrick,

Thanks for sharing your link and files.  I also read the accompanying sysbio paper and thought it was fantastic. 

In dendropyScoreTriples.py, do you mean the function 

def dendropy_score_triples(quartets, treefiles, oryza_names=False, write_trees=False, output_dir='./', filter_file=None, gene_order_file=None):

I am puzzled as to what inputs format the quartets are in, I looked in 


https://github.com/zwickl/pygot/blob/4ce01a3e81f132d4f7ab0de78ec4fa947c9cca12/examples/quartetExampleMasterScript.sh

and I see a reference to a "quartet file"-what does the file look like?

 I am also confused by the bitmasks:  from my messing around shown below it seems that different subsets of taxa are identified by different bitmasks.  Also, I still don't understand how to give Dendropy the right bitmask for a given split in the frequency of splits dendropy function....

>>> L = list(treelist.taxon_set.labels())

>>> L

['MON', 'MAC', 'TUR', 'BOS', 'SUS', 'VIC', 'FEL', 'CAN', 'EQU', 'MYO', 'PTE', 'SOR', 'ERI', 'CAV', 'RAT', 'MUS', 'CAL', 'TAR', 'NEW', 'PAN', 'HOM', 'GOR', 'PON', 'MIC', 'OTO', 'TUP', 'SPE', 'DIP', 'ORY', 'OCH', 'CHO', 'DAS', 'PRO', 'LOX', 'ECH', 'ORN', 'GAL']

>>> treelist.taxon_set.get_taxa_bitmask(labels = 'MON')

0

>>> treelist.taxon_set.get_taxa_bitmask(labels = 'PAN')

0

>>> treelist.taxon_set.get_taxa_bitmask(labels = [L[1], L[3]])

10

>>> treelist.taxon_set.get_taxa_bitmask(labels = [L[1], L[3], L[5]])

42

Derrick Zwickl

unread,
Oct 23, 2014, 3:38:02 PM10/23/14
to dendrop...@googlegroups.com

Hi Ruth,

This is the Dendropy discussion list, so we should probably discuss some of those details separately...

Regardless, some basic background (that you may not need): in Dendropy (and other programs), a bitmask is a series of zero's and ones, one per taxon, that indicate a particular split.  i.e., zero bits are on one side of the split, and ones are on the other.  The bit strings can be stored/represented as integers, which I think is why you are getting decimal numbers below.  You should be able to do bin(bitmask) on them to see the bitstring representation.  The reason that the splits are generally represented as bit strings is that bit-wise ANDs, ORs and complements can be used to verify split compatibility (with Bunemann's three point condition), split identity, etc.  A "mask" is another bit string that can be used with a bit-wise AND operation to mask out bits (taxa) that you aren't interested in (by setting them to zero) when checking split compatibility, which is how you can test for splits embedded within larger trees. 

See the is_compatible() function in the Dendropy.treesplit module, which takes the bitmasks for two splits to be tested for compatibility, and a taxon mask to allow some taxa to be disregarded when testing that compatibility.  If you mask out all but four taxa, then split compatibility is the same as split identity.  I use the is_compatible() function with a mask in the masked_frequency_of_split() function in the pygot.dendroutils module. 

Best,
Derrick

Jeet Sukumaran

unread,
Oct 23, 2014, 3:53:01 PM10/23/14
to dendrop...@googlegroups.com

The (in-progress) tutorial on bipartitions for DendroPy 4 provides an
overview of how bipartitions/splits are modeled using the bitmasks are used:

https://github.com/jeetsukumaran/DendroPy/blob/DendroPy4/doc/source/tutorial/bipartitions.rst#modeling-bipartitions

While written for DendroPy 4, the principles apply to DendroPy3 in so
far as the bitmasks used to model splits/bipartitions. However, the
various code and usage examples are DendroPy 4 only. The modeling of
bipartitions is a quite a bit more rudimentary in DendroPy 3 as opposed
to 4. In 3, there distinction between a "split bitmask" and a "leafset
bitmask" is not clear and can be confusing, as we use the same term for
either. In DendroPy 4, these are distinct concepts, with the values
being equal for rooted trees and the former being the normalized value
of the latter for unrooted trees.

The modeling of bipartitions/splits in DendroPy 4 is much more thorough,
with lots of built-in support for, e.g. rendering the splits etc., yet
at the same time is just as performant as DendroPy 3. I would use
DendroPy 4 for bipartitions/splits modeling, though some aspects of the
API are unstable and subject to change over the next few weeks[*]

--jeet

[*] In particular, I have not yet decided whether or not the Bipartition
object should be mutable and therefore not hashable, or immutable and
therefore hashable, or facultatively mutable/immutable and therefore
facultatively unhashable/hashable. Currently, it is the last of these,
with the default being immutable/hashable (i.e., you have to say
``bipartition.is_mutable=True`` to modify an existing bipartition). I
think this might be confusing for users, and I am thinking of just
making Bipartitions mutable, period. This will mean they cannot be used
in sets or dictionary keys, which are desirable to enable fast look-ups,
but I am thinking that we might just use the split bitmasks or leafset
bitmasks directly for this.
>> <https://github.com/zwickl/pygot>) I do exactly that, finding
>> quartet frequencies within larger trees using Dendropy. You can
>> grab the code and take a look at the dendropyScoreTriples.py
>> script. You need to provide taxon bitmasks to count the frequency
>> of embedded quartets, unless you want to prune down to the taxon
>> set of interest, which is very inefficient.
>>
>> Best,
>> Derrick
>>
>> On 10/16/14 12:18 PM, Ruth Davidson wrote:
>>> Hi Jeet
>>>
>>> I need to write a function that finds quartet frequencies in a
>>> list of trees and I hope I can do it with DendroPy's frequency of
>>> splits function. From looking at this post on the users group
>>>
>>> https://groups.google.com/forum/#!searchin/dendropy-users/april$20wright/dendropy-users/qs48o-J8d1Q/IvV7oOO14UgJ
>>> <https://groups.google.com/forum/#%21searchin/dendropy-users/april$20wright/dendropy-users/qs48o-J8d1Q/IvV7oOO14UgJ>
>>> send an email to dendropy-user...@googlegroups.com <javascript:>.
>>> For more options, visit https://groups.google.com/d/optout
>>> <https://groups.google.com/d/optout>.
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "DendroPy Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to dendropy-user...@googlegroups.com
>> <mailto:dendropy-user...@googlegroups.com>.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "DendroPy Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dendropy-user...@googlegroups.com
> <mailto:dendropy-user...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--



--------------------------------------
Jeet Sukumaran
--------------------------------------
jeetsu...@gmail.com
--------------------------------------
Blog/Personal Pages:
http://jeetworks.org/
GitHub Repositories:
http://github.com/jeetsukumaran
Photographs (as stream):
http://www.flickr.com/photos/jeetsukumaran/
Photographs (by galleries):
http://www.flickr.com/photos/jeetsukumaran/sets/
--------------------------------------

Jeet Sukumaran

unread,
Oct 23, 2014, 3:56:49 PM10/23/14
to dendrop...@googlegroups.com
Oh yes, citation information is also available at the bottom the page ;)
>> <https://github.com/zwickl/pygot>) I do exactly that, finding
>> quartet frequencies within larger trees using Dendropy. You can
>> grab the code and take a look at the dendropyScoreTriples.py
>> script. You need to provide taxon bitmasks to count the frequency
>> of embedded quartets, unless you want to prune down to the taxon
>> set of interest, which is very inefficient.
>>
>> Best,
>> Derrick
>>
>> On 10/16/14 12:18 PM, Ruth Davidson wrote:
>>> Hi Jeet
>>>
>>> I need to write a function that finds quartet frequencies in a
>>> list of trees and I hope I can do it with DendroPy's frequency of
>>> splits function. From looking at this post on the users group
>>>
>>> https://groups.google.com/forum/#!searchin/dendropy-users/april$20wright/dendropy-users/qs48o-J8d1Q/IvV7oOO14UgJ
>>> <https://groups.google.com/forum/#%21searchin/dendropy-users/april$20wright/dendropy-users/qs48o-J8d1Q/IvV7oOO14UgJ>
>>> send an email to dendropy-user...@googlegroups.com <javascript:>.
>>> For more options, visit https://groups.google.com/d/optout
>>> <https://groups.google.com/d/optout>.
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "DendroPy Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to dendropy-user...@googlegroups.com
>> <mailto:dendropy-user...@googlegroups.com>.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "DendroPy Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dendropy-user...@googlegroups.com
> <mailto:dendropy-user...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--



Jeet Sukumaran

unread,
Oct 23, 2014, 4:14:05 PM10/23/14
to dendrop...@googlegroups.com
Also, have a look at:


https://pythonhosted.org/DendroPy/tutorial/treestats.html#frequency-of-a-split-in-a-collection-of-trees



https://pythonhosted.org/DendroPy/tutorial/treestats.html#frequency-of-a-split-in-a-collection-of-trees
> <https://github.com/zwickl/pygot>) I do exactly that, finding
>> send an email to dendropy-user...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout
>> <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "DendroPy Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dendropy-user...@googlegroups.com
> <mailto:dendropy-user...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Jeet Sukumaran

unread,
Oct 23, 2014, 4:37:25 PM10/23/14
to dendrop...@googlegroups.com
Disregard this: Derrick pointed out that you are interested in a
backbone constraint, rather than the functionality offered by the
approaches below. The tutorial I referenced above will help in
understanding how to go about this, while Derrick's code will actually
do it for you. I suggest reading the former and using the latter.

Ruth Davidson

unread,
Oct 30, 2014, 10:57:16 AM10/30/14
to dendrop...@googlegroups.com
I appreciate very much both of you pointing me to the background on the bitmasks. I see now why they are used and how they could be used to avoid pruning which is the way I have implemented some other things that are indeed slow.    I want to get on board with the bitmasks but I'm still having a hard time figuring out how to actually use them- I read the source for is_compatible


and masked_frequency_of_split


But I'm still at a loss as how to pass masks as inputs to them-in particular there seems to be a bound method for getting a split bitmask from an edge, like I see in the source code for find_edge_from_split:  
e = root.edge
cm = e.split_bitmask


 but I can't get that to work on a small example...

>>> trees = dendropy.TreeList.get_from_path("true-trees-1X.tre", "newick")

>>> edges1 = [e for e in trees[1].postorder_edge_iter()]

>>> edges1[1].split_bitmask

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

AttributeError: 'Edge' object has no attribute 'split_bitmask'


There are a few things I still don't understand:  1.  How to get a list of bitmasks for a tree representing all the splits, 2.  how to mask the taxa I don't care about when I input the information to masked_frequency_of_splits 3. what an example would look like where you actually give a mask as an input to either is_compatible or masked_frequency_of_splits. I can't seem to actually get dendropy to tell me what the split_bitmask is for any edge or split....  For example if I want to feed a split_bitmask to the function split_bitmask_string()

https://pythonhosted.org/DendroPy/library/taxon.html#dendropy.dataobject.taxon.TaxonSet.split_bitmask_string

I don't know where to find my split_bitmask info in the first place...

Thanks as always!

Ruth


Jeet Sukumaran

unread,
Oct 30, 2014, 11:37:45 AM10/30/14
to dendrop...@googlegroups.com
Hi Ruth,

The good news is that all of these things can easily be done, and, in
fact, are done regularly within the library itself.

The bad news is that these are not documented, as the whole splits
architecture was a "under-the-hood" system in DendroPy 3.

The good news is that in DendroPy 4, bipartitions/split bitmasks etc.
are going to be public exposed parts of the API, and thus will
(eventually) get good documentation.

The bad news is that you have to wait till this documentation is written
and I cannot give you an estimate of when then might be. Before the end
of the year, certainly. Within the next few weeks, maybe.

The good news, though, is that, as I noted before, all of the things you
want to do are already being done by the DendroPy code internally. So if
you want to use this system sooner, you will need to to do the due
diligence and actually go through the DendroPy code.

Some hints to get you started:

(1) you will need to call ``tree.update_splits()`` to actually get
split bitmasks added to the tree.
(2) the above will add the ``split_bitmask`` attribute to each edge
(3) the above will also add a ``split_edges`` dictionary to each
tree, where the keys are the split bitmasks and the values are the edges
on which they are found
(4) for *unrooted* trees, very confusingly, the splits used as keys
in the "split_edges" dictionary are *normalized*, while the
``split_bitmask`` attribute of the edge are *unnormalized* (i.e., what I
am calling ``leafset_bitmask`` in DendroPy 4). Apologies for this
confusion. It will go away in DendroPy 4. The ``split_edges`` dictionary
of unrooted trees is a special dictionary that automatically normalizes
any key passed to it, so both the normalized and unnormalized versions
of keys reference the same value.
(5) The complexity described above does not hold for rooted trees

-- jeet
> >> an email to dendropy-user...@googlegroups.com <javascript:>
> >> <mailto:dendropy-user...@googlegroups.com <javascript:>>.
> >> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
> >
>
> --
>
>
>
> --------------------------------------
> Jeet Sukumaran
> --------------------------------------
> jeetsu...@gmail.com <javascript:>
> --------------------------------------
> Blog/Personal Pages:
> http://jeetworks.org/
> GitHub Repositories:
> http://github.com/jeetsukumaran <http://github.com/jeetsukumaran>
> <http://www.flickr.com/photos/jeetsukumaran/sets/>
> --------------------------------------
>
> --
> You received this message because you are subscribed to the Google
> Groups "DendroPy Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dendropy-user...@googlegroups.com
> <mailto:dendropy-user...@googlegroups.com>.
Reply all
Reply to author
Forward
0 new messages