CSV format for diploid data in BEASTvntr

363 views
Skip to first unread message

Santiago Sánchez

unread,
Mar 7, 2017, 12:26:52 PM3/7/17
to beast...@googlegroups.com
Dear Users,

I've been trying out the BEASTvntr (microsatellite inference in BEAST2) package, and up to know I've successfully managed to run the example file based on the instruction is in the GitHub page (e.g. single partition and constant coalescent). However my goal is to use the EBSP model. So I have a couple of questions (probably for the developers):

1) If I understand correctly, for a multi-locus analysis you need to select the "multiple partitions" option. Now with multiple partitions, would it make sense to infer gamma for between locus variation. In the example it is suggested to  estimate gamma with 6 categories. I was assuming this was done because basically all loci are being concatenated. But is this needed for a multi-locus analysis?

2) The example input file is in straight forward in csv format. But how to specify loci for diploid organisms? The example file is based on data from bacteria.

3) I recall that in previous BEAST 1.x versions one could fix an average rate among loci. Ideally I would like to do this for my analysis (e.g. fix a mean rate, allowing rates to be different between loci). How to do this in BEAST2?

Thanks in advance,
Santiago
--
==========================
Santiago Sanchez-Ramirez, PhD
Postdoctoral Associate
Ecology and Evolutionary Biology
University of Toronto
==========================

s.g.

unread,
Mar 8, 2017, 10:44:19 AM3/8/17
to beast-users

Hi Santiago!
I also tried yesterday the example from BEASTvntr, but ran into the error ' Could not find a proper state to initialize'. I used BEAST2.43-2.45. Did you changed anything that was not mentioned in the readme? Which beast version did u use?


3) For multiple partitions sharing the same tree, one partition can be chosen as a reference and the remainder can have rates estimated relative to the reference

Best,
sven

Santiago Sánchez

unread,
Mar 8, 2017, 11:31:22 AM3/8/17
to beast-users
Hi Sven,

For the example file, I followed exactly as instructed in the README file (or the text in the GitHub page, for that matter). Can you show me what error you get? Specifically, which part of the model is failing in the likelihood computation.

Just to give further info, I actually managed to run the EBSP model. Here are some points for advice/discussion:

1) To format my diploid data, I spliced alleles into "a" and "b", using the same intuition as when using sequence data. Given that the assumption is that there is free recombination between loci, I thought the order of the alleles into "a" and "b" wouldn't matter much. So for instance, if the data looked like this, with two columns per locus, and the numbers being the number of repeats:

ind01,2,3,5,6,2,1,3,2
ind02,2,2,5,6,2,4,3,2
...

My spliced file looked like:

ind01a,2,5,2,3
ind01b,3,6,1,2
ind02a,2,5,2,3
ind02b,2,6,4,2
...

Any input this part would be greatly appreciated.

2) For the substitution model, I linked a single model for all loci (partitions), and used gamma for rate heterogeneity between loci (6 categories, as suggested in the tutorial). Everything else in the site model tab I left as is was.

3) For the clock model, I kept a strict clock and fixed the rate of the first locus to a known prior rate. All the others were estimated. This was the only workaround I could think about, given that I don't know how to specify an average rate for all loci.

4) In the prior tab, I selected the tree prior for each locus (EBSP) and specified gamma distributions for all clock.rate priors. I also used a lognormal distribution with a sensible mean for the population.mean prior.

5) Finally, I specified the number of generations, and adjusted sampling rate for everything to a ratio of 1:10.

Any comment/suggestion/advice is welcome!

Cheers,
Santiago


--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.
Visit this group at https://groups.google.com/group/beast-users.
For more options, visit https://groups.google.com/d/optout.

carlo pacioni

unread,
Mar 8, 2017, 6:46:15 PM3/8/17
to beast-users, arjun.sin...@gmail.com
Hi Santiago,
thanks for sharing this.

I have only very quickly had a look at this a while back and I agree with you that when multi-partitions are used, each column is a locus and diploid data need to be re-organised with two rows per individual.
As for your initial email, I'd have thought that there is no need to estimate a gamma for rate heterogeneity (your point #2). Is there a reason why you used a rate heterogeneity?

I also have a few other questions re this package, I hope you don't mind if attach them to your post:
I was curious to know whether anyone knows how/whether missing data can be included in msat analysis in BEAST2. The standard  '?' it is not recognised.
Regarding the substitution model parameterization is there any recommendation on which one to use (i.e. when to use which?).

Lastly, I am having some problem to link substitution models in BEAUti, did you have this problem too?

Cheers,
carlo

Arjun Dhawan

unread,
Mar 9, 2017, 8:33:59 AM3/9/17
to beast-users, arjun.sin...@gmail.com, Don Klinkenberg
Hi there,

I am the maintainer of beastvntr. Recently, a new version of beastvntr (0.1.1) came out, which properly recognizes missing data (such as '?'). The multiple partitions option was originally designed to be able to estimate model parameters for each locus separately. In that case, it would make no sense to have multiple categories, since each partition only holds 1 site. The idea of having multiple categories is to allow for rate variation among sites.

Considering the spliced alleles, I am not sure if I completely understand what you want to achieve, but I believe what you want is this:

fileA.csv:
ind01,2,5,2,3
ind02,2,5,2,3

fileB.csv
ind01,3,6,1,2
ind02,2,6,4,2

You should import both files (importing each as a single partition), and then link the Trees of both partitions. This causes the MCMC to run with separate model parameters for each allele (if you want, you can still link some of the model parameters), but having a shared phylogeny.
In this particular case, it can make sense to have rate categories, to account for the different rates that might be present on each of the 4 sites.

Hope this helps.

Cheers,

Arjun

Op donderdag 9 maart 2017 00:46:15 UTC+1 schreef carlo pacioni:

s.g.

unread,
Mar 9, 2017, 9:10:52 AM3/9/17
to beast-users
Hi Santiago,
also thanks for sharing, I get the error right at the initialisation. So the likelihood computation does not start.
Cheers,
sven
----------------------------------------------------------------------------------------
Start likelihood: -Infinity after 10 initialisation attempts
P(posterior) = -Infinity (was -Infinity)
    P(prior) = -166.74417605958266 (was -166.74417605958266)
        P(CoalescentConstant.t:comas2009_VNTR) = -166.94814886390859 (was -166.94814886390859)
        P(GammaShapePrior.s:comas2009_VNTR) = -1.0 (was -1.0)
        P(PopSizePrior.t:comas2009_VNTR) = 1.2039728043259361 (was 1.2039728043259361)
    P(likelihood) = -Infinity (was -Infinity)
        P(treeLikelihood.comas2009_VNTR) = -Infinity (was -Infinity)
Fatal exception: Could not find a proper state to initialise. Perhaps try another seed.
java.lang.RuntimeException: Could not find a proper state to initialise. Perhaps try another seed.
    at beast.core.MCMC.run(Unknown Source)
    at beast.app.BeastMCMC.run(Unknown Source)
    at beast.app.beastapp.BeastMain.<init>(Unknown Source)
    at beast.app.beastapp.BeastMain.main(Unknown Source)
    at beast.app.beastapp.BeastLauncher.main(Unknown Source)
Fatal exception: Could not find a proper state to initialise. Perhaps try another seed.
java.lang.RuntimeException: An error was encounted. Terminating BEAST
    at beast.app.util.ErrorLogHandler.publish(Unknown Source)
    at java.util.logging.Logger.log(Logger.java:738)
    at java.util.logging.Logger.doLog(Logger.java:765)
    at java.util.logging.Logger.log(Logger.java:788)
    at java.util.logging.Logger.severe(Logger.java:1464)
    at beast.app.beastapp.BeastMain.<init>(Unknown Source)
    at beast.app.beastapp.BeastMain.main(Unknown Source)
    at beast.app.beastapp.BeastLauncher.main(Unknown Source)
----------------------------------------------------------------------------------------

Arjun

unread,
Mar 9, 2017, 9:32:28 AM3/9/17
to beast...@googlegroups.com
Hi Sven,

Can you send the XML file that you are trying to run?

Cheers,

Arjun

--
You received this message because you are subscribed to a topic in the Google Groups "beast-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beast-users/8qwtobmIWqo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to beast-users+unsubscribe@googlegroups.com.

To post to this group, send email to beast...@googlegroups.com.

s.g.

unread,
Mar 9, 2017, 10:23:59 AM3/9/17
to beast-users
Hi Arjun,
sure, thanks for having a look at it!

Cheers,
sven
To unsubscribe from this group and all its topics, send an email to beast-users...@googlegroups.com.

To post to this group, send email to beast...@googlegroups.com.
comas2009_VNTR.xml

Santiago Sánchez

unread,
Mar 9, 2017, 11:24:39 AM3/9/17
to beast-users
Hello everyone,

Thanks for joining the discussion.

I'm still not sure whether to use gamma or not (I'm still trying to think in terms sequence data), given that I'm linking the substitution model across partitions, and thus trying to account for rate differences among loci (?). But, yes it probably makes more sense to not use gamma.

Carlo: great questions! I simply excluded all individuals with missing data. But now that Arjun clarified that new version allows for "?" this is a much better option. I also would like to know more details about the specification of substitution models.

Arjun: I'm not sure if what you are suggesting (two files as single partitions) is the right way to go. If both single partition files are linked, this would result in a single concatenated tree, wouldn't it? In a multi-locus analysis all loci must have separate trees. And in diploid data all individuals are either homozygous (two copies of the same allele) or heterozygous (two different alleles), the two copies being independent. Splicing the two copies into "a" and "b" would be enough to capture the variation within a locus, regardless of whether two alleles belong to the same individual or not.

Sven: Probably Arjun is better equipped to help you with this ;-)

Cheers,
Santiago

Arjun

unread,
Mar 9, 2017, 12:44:54 PM3/9/17
to beast...@googlegroups.com
Santiago: I now understand what you are trying to do. Linking the trees would indeed be inappropriate in your case.

Sven: the XML file you created doesn't hold any information about the substitution model. I am not sure why this happens in your case, usually Beauti should select the Sainudiin model by default. Perhaps clicking on the Sainudiin model from the dropdown menu in the Site model tab will resolve it. I have included an XML that does work (you can check out the difference between this file, and yours, and see what I mean).

Cheers,
Arjun

To unsubscribe from this group and stop receiving emails from it, send an email to beast-users+unsubscribe@googlegroups.com.

To post to this group, send email to beast...@googlegroups.com.
Visit this group at https://groups.google.com/group/beast-users.
For more options, visit https://groups.google.com/d/optout.
--
==========================
Santiago Sanchez-Ramirez, PhD
Postdoctoral Associate
Ecology and Evolutionary Biology
University of Toronto
==========================

--
You received this message because you are subscribed to a topic in the Google Groups "beast-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beast-users/8qwtobmIWqo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to beast-users+unsubscribe@googlegroups.com.
comas2009_VNTR.xml
Reply all
Reply to author
Forward
0 new messages