theta from simulated data

Michael Harvey

unread,

Mar 26, 2013, 4:22:44 PM3/26/13

to dadi...@googlegroups.com

Ryan,

I have more of a popgen question, but it hinges on how dadi estimates theta. I ran some simulations, and I am trying to recover the parameter values used in the simulations using dadi, but I am having trouble equating the theta estimates from dadi to the theta value I used in ms. For example, I simulated 50,000 loci of 100bp each (5,000,000 total sites). These contained 106,348 SNPs, of which 105,641 were biallelic. I randomly selected one biallelic SNP per locus for analysis, resulting in a set of 43,654 SNPs. My theta estimates from dadi are all approximately 1.0. I simulated data, however, using per-locus theta values of 0.2 (0.002 per site). I am trying to convert the parameter estimate to see if I can recover the parameter value I used in the simulation.

I think I need to correct for the fact that I only used 1 of every ~2.4 SNPs in the analysis. I was thinking I could do this as follows:

(dadi theta estimate)/(number of sites in dadi analysis)*(total number of SNPs)/(number of SNPs in analysis) = corrected per-site theta

This, however, results in the following:

(1.0/43654)*(106348/43654) = 5.5E-5/site

Which is quite different from the expected 0.002/site. Is this due to estimation error in dadi? Or (more likely) am I doing something wrong?

Many thanks for any assistance,

Mike Harvey

Gutenkunst, Ryan N - (rgutenk)

unread,

Mar 26, 2013, 6:35:56 PM3/26/13

to dadi...@googlegroups.com

Hello Mike,

First, I'm surprised you're getting theta estimates from dadi of approximately 1. They should be much larger with that many SNPs. How are you getting theta from dadi? Typically you would fit the model, which by default takes theta=1, then use optimal_sfs_scaling to get the corresponding theta. (Since the frequency spectrum is simply linear in theta.)

The estimate of theta dadi provides can be equated with 4 Ne mu L. Here L is the length of sequence from which the SNPs were derived, not the number of SNPs themselves. The estimate from dadi should align well with the input to ms, if you were using all the SNPs ms gave you. In this case, since you only kept ~1/2.4 SNPs in your data set, the "effective length of sequence" used for the dadi is smaller by that factor, so you would need to correct for that.

So in short the dadi result should match up with the ms result of total effective theta = 0.2 per locus * 50,000 loci * 1/2.4 ~ 4,000.

Does this help?

Best,

Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/dadi-user?hl=en-US.
For more options, visit https://groups.google.com/groups/opt_out.

--

Ryan Gutenkunst

Assistant Professor

Molecular and Cellular Biology

University of Arizona

phone: (520) 626-0569

http://gutengroup.mcb.arizona.edu

Michael Harvey

unread,

Mar 26, 2013, 8:24:27 PM3/26/13

to dadi...@googlegroups.com

Ryan,

Yes, that is very helpful. I was confused about how to use optimal_sfs_scaling to obtain theta. But when I print the optimal scaling it outputs something on the order of 4230, which (as you note) is very similar to what I would expect. I assume I simply multiply this by the optimized relative theta values (which are close to 1) to obtain the theta for each population?

Thanks again,

Mike

Gutenkunst, Ryan N - (rgutenk)

unread,

Mar 27, 2013, 10:06:08 AM3/27/13

to dadi...@googlegroups.com

Hi Mike,

Yes. By default dadi defines the ancestral population to have size "1" and measures all other sizes relative to it. Theta is also defined in terms of that population size. So you would just multiply the relative sizes dadi infers by the optimal_sfs_scaling to get the corresponding population-specific thetas.

Best,

Ryan

dan...@berkeley.edu

unread,

Jun 16, 2017, 1:37:06 PM6/16/17

to dadi-user

Hi Ryan,

Tagging along Mike's question. I am not exactly sure what is my L value in correcting theta.

Below is how I filtered down to the SNP dataset I used for dadi:

I used ddRADseq to acquire my sequences. Instead of doing de novo assembly to generate RAD tags, we mapped our sequences to a draft genome and that gave us an overall of ~ 3 million sites. From ~3 million sites, we called SNPs, and that gave us 123,797 SNPs, we further filtered out possible paralogs and that leaves us 56,343 SNPs to be the input file for dadi.

The theta value that dadi reports was 753. The substitution rate we used is 1.1 x10^-8. If I used 3 million sites as L, then the Na is smaller than one should expected.

So I am not sure what is a reasonable way for deciding the number of sites (L) in my case.

Thank you,

Dana

Gutenkunst, Ryan N - (rgutenk)

unread,

Jun 20, 2017, 1:49:55 PM6/20/17

to dadi...@googlegroups.com

Hello Dana,

Your estimate for L should account for all filtering your did. In your case, I would adjust by L = 3e6 * 56,343/123,797.

Best,

Ryan

Visit this group at https://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

--

Ryan Gutenkunst
Associate Professor of Molecular and Cellular Biology, University of Arizona
phone: (520) 626-0569, office: LSS 325, web: http://gutengroup.mcb.arizona.edu

Latest papers:
“Selection on network dynamics drives differential rates of protein domain evolution”
PLoS Genetics; http://dx.doi.org/10.1371/journal.pgen.1006132
"Inferring demographic history using two-locus statistics"
Genetics; http://doi.org/10.1534/genetics.117.201251

Reply all

Reply to author

Forward