polymorphisms

50 views
Skip to first unread message

Genevieve

unread,
May 7, 2015, 10:44:24 PM5/7/15
to bucky...@googlegroups.com
I ran MrBayes on about 1000 genes for input for BUCKy. However, most of these genes have thousands trees sampled so BUCKy is getting stuck. I'm wondering if this has to do with the presence of ambiguous nucleotides in my alignments? I hadn't thought of masking these prior to running MrBayes. How are polymorphisms handled by BUCKy? Could this be why I am getting so many distinct splits?


Thanks!


Geneviève

Weisrock, David

unread,
May 8, 2015, 12:24:56 PM5/8/15
to bucky...@googlegroups.com
Hi Genevieve,

BUCKy only sees the trees you input, so ambiguity character codings in the sequence data never come into play.

As far as I know (and I may not be updated on this), MrBayes does not use ambiguity codings (e.g., R, Y, etc.) in an analysis. Even if you have them in your data set, they are treated as unknown characters (i.e., N). So, if you have a lot of polymorphic character positions, these are uninformative for your gene tree reconstructions.

It sounds like you are saying that your individual gene tree posterior distributions have many (thousands) of distinct topologies, which would be indicative of very little phylogenetic information in the data sets. This may mean that you should not expect BUCKy to be very informative about levels of concordance across loci.

I have not run BUCKy on that many loci before, but my recommendation would be to randomly subsample your loci and perform analyses on smaller sets of loci. Start with something small, like 50 loci, and work your way up to determine how many you can run in a single analysis. Perhaps you could repeat this a bunch of times to gauge average estimates of concordance. Just a thought.

Dave


--
You received this message because you are subscribed to the Google Groups "BUCKy users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bucky-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.




Department of Biology
University of Kentucky
101 Thomas Hunt Morgan Building
Lexington, KY 40506



Cécile Ané

unread,
May 8, 2015, 1:02:16 PM5/8/15
to bucky...@googlegroups.com
Hi Geneviève,

You might be getting many distinct trees from each gene if these genes
have few informative sites, or if there are many taxa. Polymorphisms
contribute less information that non-ambiguous nucleotides, but they
should not be the main reason why each gene has thousands of distinct
trees. BUCKy definitely works better with fewer taxa. If your scientific
hypothesis is focused on a particular difficult edge in your species
tree, you could sample taxa around that edge to reduce the number of
taxa and still address your main question.

Cecile.
> --
> You received this message because you are subscribed to the Google
> Groups "BUCKy users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to bucky-users...@googlegroups.com
> <mailto:bucky-users...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Cecile Ane
Departments of Statistics and of Botany
University of Wisconsin - Madison
www.stat.wisc.edu/~ane/

CALS statistical consulting lab:
www.cals.wisc.edu/calslab/stat_consulting.php

Cécile Ané

unread,
May 8, 2015, 1:24:21 PM5/8/15
to bucky...@googlegroups.com
One approach we've taken to subsample loci was to rank them by
'informativeness', based on the number of distinct trees in their
MrBayes MCMC sample. We then took as many loci as computationally
possible, starting from those with fewest distinct trees.

On 05/08/2015 11:24 AM, Weisrock, David wrote:
> [...] I have not run BUCKy on that many loci before, but my recommendation
> would be to randomly subsample your loci and perform analyses on smaller
> sets of loci. Start with something small, like 50 loci, and work your
> way up to determine how many you can run in a single analysis. Perhaps
> you could repeat this a bunch of times to gauge average estimates of
> concordance. Just a thought. [...]

Bret Larget

unread,
May 8, 2015, 1:25:09 PM5/8/15
to bucky...@googlegroups.com
In addition, MrBayes does handle ambiguous nucleotides in an appropriate manner. If there is an R, for example, the likelihood is computed using the sum of the probabilities of that character being an A or a G, and each such ambiguous character is handled similarly.

BUCKy assumes that the samples from each locus are accurate estimates of the posterior distribution over all trees given the data at that locus. If the data from a locus is not informative enough to measure the probabilities of individual most probable trees accurately, then BUCKy may not fare so well. Having lots of genes need not be a problem. Having lots of taxa means that tree space is huge and often means that some (most/all?) loci are not informative enough for tree-level probability measures to be accurate. I agree with Cecile that if you can address your scientific question with a subset of taxa, this may help BUCKy be useful to you.

-Bret Larget
--
You received this message because you are subscribed to the Google Groups "BUCKy users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bucky-users...@googlegroups.com.

Weisrock, David

unread,
May 8, 2015, 2:08:39 PM5/8/15
to bucky...@googlegroups.com
Thanks for the clarification on MrBayes Bret. I seem to remember that earlier versions didn’t use ambiguities as information. Maybe I’m thinking of BEAST. Anyway, thanks!

Dave



Reply all
Reply to author
Forward
0 new messages