discordance between dadi simulated sfs and ms simulation sis

489 views
Skip to first unread message

afreed...@gmail.com

unread,
Jun 11, 2018, 2:34:49 PM6/11/18
to dadi-user
Hi Ryan,

I am using dadi for two purposes: 1) to test alternative hypotheses by fitting different models, and 2) to compare the genome-wide Fst frequency distribution between the inferred model and the observed data. Re: objective 2, I am trying to assess the extent to which the inferred model captures an excess of high fst sites between the two populations. Thus, my plan was to simulate the model inferred by dadi with ms and obtain the Fst distribution for those outputs.

THE PROBLEM:
The SFS projection using dadi.Spectrum.from_ms_file looks nothing like that produced by dadi using the inferred parameter set. The ms-based model has null values for a large portion of the ms space. 

dadi:


ms:


I read through much of the discussion on the google group to minimize the probability of making a rookie mistake. Here are the details:

RAD-SEQ:
I generated 134900 94bp long RAD-seq loci containing 205935 SNPs. Downsampling one SNP per locus left me with 56055 SNPs for input to dadi. Thus, L, following your previous recommendations in this forum = 94*134900*56055/205935 = 3451628. There's no ancestral genome available, so inference uses a folded sfs.

MODEL:
The best fitting 2-population model is one with divergence accompanied by gene flow that stops relatively recently. The point at which gene flow stops is also the one at which population size occurs on both branches. Setting the time of gene flow cessation and pop size change to be simultaneous, while not entirely realistic, represents a tradeoff to avoid over-parameterizing. The residuals are centered on zero, which suggests the model does a decent job. It also has unequivocal support (based upon AIC) relative to 13 other 2-population models, as well as 17 other 3-population models that accounts for substructure within one of the populations. I attribute the difference between the dadi-simulated model and the observed data to unaccounted for population structure within the two defined populations (that may be incipient species). These are lizards with low dispersal, and populations sampled over several hundred kilometers so genetic turnover within populations is somewhat expected. 

Forward in time dadi parameters:
theta = 5911.67
pop 1 relative pop size post divergence = .1009
pop 2 relative pop size post divergence = .5933
T1, time from divergence until gene flow ceases = 5.3655
T2, time from cessation of gene flow until present = 0.0269
m12 = migration from pop 2 into 1 = 1.1616
m21 = migration from pop 1 into 2 = 0.4132
pop 1 relative pop size after size change = 29.9198
pop 2 relative pop size after size change = 8.7145

To simulate unlinked RAD loci, I generate a single-locus ms command by adjusting theta to be locus specific: 5911.67 * 94/3451628, i.e. multiply theta x individual rad locus length/total estimated length (L).


MS SIM SETUP
Basically, I simulate the set of unlinked rad loci and sum their respective sfs:

ms_fs = dadi.Spectrum.from_ms_file(popen('ms 306 56055 -t 0.160995617141 -I 2 91 215 -n 1 29.9198 -n 2 8.7145 -en 0.01345 1 0.1009 -en 0.01345 2 0.5933 -em 0.01345 1 2 0.8264 -em 0.01345 2 1 2.3232 -ej 2.6962 1 2),average=False)
folded_sfs=dadi.Spectrum.fold(ms_fs)
fig = pylab.figure(1)
fig.clear()
dadi.Plotting.plot_single_2d_sfs(folded_sfs,vmin=0.005)

note that the vmin setting was the same used for the plotting of the data and dadi inferred sfs.

MS ARGS IN DETAIL
-n 1 29.9198 : relative pop size of pop 1 (post expansion) at present
-n 2 8.7145 : relative pop size of pop 2 (post expansion) at present
-en 0.01345 1 0.1009: going backward in time, pop 1 size changes, in ms units, at 0.0269/2 = 0.01345 to .1009
-en 0.01345 2 0.5933 : similarly, pop 2 size changes at same time as pop 1 to .5933 

### time parameter conversions ###
dadi parameters are forward in time, while ms is backwards. m12-dadi, the fraction of pop 1 comprised of migrants from pop 2, is equivalent in coalescent terms to m21 for ms. thus, given that dadi rates are multiplied by 2 to get equivalent ms rates:

 -em 0.1345 1 2 0.8264  : M12, the # of individuals from pop 1 comprised of backward in time migrants from pop 2
 -em 0.1345 2 1 2.3232 : M21 the # of individuals from pop 2 comprised of backward in time migrants from pop 1
 -ej 2.6962 1 2 : the full length of the genealogy in ms units is (T1+T2)/2 = (5.3655 + 0.0269)/2 = 2.6962, at which time all lineages from population 1 are moved to population 2
 
I'd be happy to have you point out a rookie mistake someplace, and welcome any other suggestions if I haven't!

Thanks,
Adam
 

Ryan Gutenkunst

unread,
Jun 11, 2018, 3:49:59 PM6/11/18
to dadi...@googlegroups.com
Hello Adam,

I don’t see any obvious errors, but I haven’t run any code.

Try a few edge cases to narrow down where the problem is.
1) If you set migration to zero, do you get agreement?
2) If you keep migration throughout the time, do you get agreement?
3) What if you remove population growth, and just leave them at fixed size?

If that debugging doesn’t help you isolate the problem, then send me the dadi model code in addition the ms code, and I can poke at it some.

Best,
Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

Adam Freedman

unread,
Jun 12, 2018, 12:50:41 PM6/12/18
to dadi...@googlegroups.com
Hi Ryan,

In the middle of generating the edge cases. One clear thing, is that removing gene flow severely exacerbates the wonkiness. ms_fs = dadi.Spectrum.from_ms_file(popen('ms 306 56055 -t 0.160995617141 -I 2 91 215 -n 1 29.9198 -n 2 8.7145 -en 0.01345 1 0.1009 -en 0.01345 2 0.5933 -ej 2.6962 1 2'),average=False) produces the following:



so clearly, the way gene flow is parameterized is not causing the issue. more to follow later in between putting out other fires ....

but thanks for getting back to me so quickly yesterday!

-Adam

To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+unsubscribe@googlegroups.com.

To post to this group, send email to dadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "dadi-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dadi-user/N623Qe75iDs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dadi-user+unsubscribe@googlegroups.com.

Ryan Gutenkunst

unread,
Jun 12, 2018, 1:03:09 PM6/12/18
to dadi-user
Hi Adam,

But does that spectrum agree with your dadi model with gene flow=0? You need to be making the comparisons here to track down where the error is in translating between ms and dadi.

Best,
Ryan

On Jun 12, 2018, at 9:50 AM, Adam Freedman <afreed...@gmail.com> wrote:

Hi Ryan,

In the middle of generating the edge cases. One clear thing, is that removing gene flow severely exacerbates the wonkiness. ms_fs = dadi.Spectrum.from_ms_file(popen('ms 306 56055 -t 0.160995617141 -I 2 91 215 -n 1 29.9198 -n 2 8.7145 -en 0.01345 1 0.1009 -en 0.01345 2 0.5933 -ej 2.6962 1 2'),average=False) produces the following:

<nogeneflow.png>

To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.

Adam Freedman

unread,
Jun 12, 2018, 1:16:54 PM6/12/18
to dadi...@googlegroups.com
I see what you're saying. stay tuned ...

Sarah Flanagan

unread,
Jul 31, 2019, 12:27:01 AM7/31/19
to dadi-user
Were you able to resolve this issue? I am running into a similar issue. In my case, I have ddRADseq data and I generated 251339 RAD loci (each 95bp long) with a total of 1370051 SNPs. I filtered and down-sampled to a dataset with 12103 unpolarized SNPs, each from a different RAD locus. I compared a number of models using Daniel Portik's dadi-pipeline and a model with ongoing asymmetric migration since a population split had the best fit.

The following forward-in-time dadi parameters were estimated from the best-fitting model:

Size of population 1 after split (nu1) = 0.1317
Size of population 2 after split (nu2) = 8.4225
Time in the past of split in units of 2*Na generations (T) = 0.1441
Migration from pop 2 to pop 1 (m12) = 0.7777
Migration from pop 1 to pop 2 (m21) = 0.0561
theta = 227.83

Based on what I've seen on this forum, I converted these parameters to ms arguments as follows:
-m 1 2 1.5554 [m12*2]
-m 2 1 0.1122 [m21*2]
-n 1 0.1317
-n 2 0.84225
-ej 0.07205 2 1 [T/2]

For theta, I converted it using the previously-recommended formula of (dadiTheta * locusLength) / ( (numSNPused * numLoci * locusLength) / totalNumSNPs ), which was:
-t = (227.83  * 95) / ( (12103 * 251339 * 95) / 1370051 )  = 0.1026112

like this:
core="-m 1 2 1.5554 -m 2 1 0.1122 -n 1 0.1317 -n 2 8.4225 -ej 0.07205 2 1"
command
=dadi.Misc.ms_command(theta=0.1026112, ns=(60,70), core=core, iter=12103)
ms_fs
=dadi.Spectrum.from_ms_file(os.popen(command))
folded_sfs
=dadi.Spectrum.fold(ms_fs)
dadi
.Plotting.plot_single_2d_sfs(folded_sfs,vmin=0.000001)


The sfs of asym_mig with these parameters looks like this:

FL_asym_mig_folded.png



But the output of the ms run above looks like this:

FL_ms_asym_mig_folded.png



Following the suggestions in this post, I tested it with equal migration between the two populations, no migration, and without the -ej term (the attached script has all of the code for these exploratory analysis). I found that removing the -ej term improved the ms-simulated sfs:

FL_ms_asym_mig_folded_noEj.png

However I'm not convinced that this ms model is the appropriate one for my dadi model.

Is anyone able to spot an issue in my implementation of the ms code? Has anyone successfully solved this type of issue?

Thank you in advance for your help!

Sarah
To unsubscribe from this group and stop receiving emails from it, send an email to dadi...@googlegroups.com.

To post to this group, send email to dadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "dadi-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dadi-user/N623Qe75iDs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dadi...@googlegroups.com.

To post to this group, send email to dadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi...@googlegroups.com.

To post to this group, send email to dadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "dadi-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dadi-user/N623Qe75iDs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dadi...@googlegroups.com.
257_debug_ms.py

Ryan Gutenkunst

unread,
Jul 31, 2019, 1:51:43 PM7/31/19
to dadi-user
Hi Sarah,

Your demographic parameter conversion looks fine to me. But I don’t follow the theta conversion formula you’re using. You want your bootstrap data sets to have the same total theta as your original data data set. Are you downsampling the ms simulations to get to one SNP per locus?

It’s hard to tell whether your ms and dadi simulations agree, because the ms simulation is so sparse. Run many ms simulations then average the results to compare with the dadi model (and look at the residuals to see if there is systematic difference).

Best,
Ryan

> On Jul 30, 2019, at 9:27 PM, Sarah Flanagan <spflana...@gmail.com> wrote:
>
> Were you able to resolve this issue? I am running into a similar issue. In my case, I have ddRADseq data and I generated 251339 RAD loci (each 95bp long) with a total of 1370051 SNPs. I filtered and down-sampled to a dataset with 12103 unpolarized SNPs, each from a different RAD locus. I compared a number of models using Daniel Portik's dadi-pipeline and a model with ongoing asymmetric migration since a population split had the best fit.
>
> The following forward-in-time dadi parameters were estimated from the best-fitting model:
>
> Size of population 1 after split (nu1) = 0.1317
> Size of population 2 after split (nu2) = 8.4225
> Time in the past of split in units of 2*Na generations (T) = 0.1441
> Migration from pop 2 to pop 1 (m12) = 0.7777
> Migration from pop 1 to pop 2 (m21) = 0.0561
> theta = 227.83
>
> Based on what I've seen on this forum, I converted these parameters to ms arguments as follows:
> -m 1 2 1.5554 [m12*2]
> -m 2 1 0.1122 [m21*2]
> -n 1 0.1317
> -n 2 0.84225
> -ej 0.07205 2 1 [T/2]
>
> For theta, I converted it using the previously-recommended formula of (dadiTheta * locusLength) / ( (numSNPused * numLoci * locusLength) / totalNumSNPs ), which was:
> -t = (227.83 * 95) / ( (12103 * 251339 * 95) / 1370051 ) = 0.1026112
>
> like this:
> core="-m 1 2 1.5554 -m 2 1 0.1122 -n 1 0.1317 -n 2 8.4225 -ej 0.07205 2 1"
> command=dadi.Misc.ms_command(theta=0.1026112, ns=(60,70), core=core, iter=12103)
> ms_fs=dadi.Spectrum.from_ms_file(os.popen(command))
> folded_sfs=dadi.Spectrum.fold(ms_fs)
> dadi.Plotting.plot_single_2d_sfs(folded_sfs,vmin=0.000001)
>
>
> The sfs of asym_mig with these parameters looks like this:
> <FL_asym_mig_folded.png>
>
>
>
> But the output of the ms run above looks like this:
> <FL_ms_asym_mig_folded.png>
>
>
>
> Following the suggestions in this post, I tested it with equal migration between the two populations, no migration, and without the -ej term (the attached script has all of the code for these exploratory analysis). I found that removing the -ej term improved the ms-simulated sfs:
>>> ms:
> To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/d41cabfd-4eb6-4903-8d4c-f4e79005bb52%40googlegroups.com.
> <257_debug_ms.py><FL_asym_mig_folded.png><FL_ms_asym_mig_folded.png><FL_ms_asym_mig_folded_noEj.png>

Sarah Flanagan

unread,
Aug 7, 2019, 5:28:00 PM8/7/19
to dadi-user
Hi Ryan,

Thank you for your response! I have a few follow-up questions:
  1. To calculate theta I was following what the original post had done, which was: (dadi's theta estimate)*(individual rad locus length)/(total estimated length (L)), where L = (length of RAD loci)*(total number of RAD loci)*(number of SNPs in the analysis)/(total number of SNPs). Your response prompted me to scour the dadi-user group more thoroughly and I see that this estimate differs slightly from others that have been suggested. For instance, in this post, the conversion to ms theta is (daid-theta)*(length of sequence)/(total length of sequence used), and in this post, the conversion appears to be (dadi-theta)/L, where L = (total number of rad loci)*(number of SNPs in the analysis)*(total number of SNPs). What would you say is the best way to convert dadi's estimated theta to the input theta for ms for RAD-seq data?
  2. I think I am a bit confused about specifying the number of loci in ms. I thought the number of iterations is equivalent to the number of SNPs simulated by ms (and I had used the number of SNPs in my dataset) -- is there a different way to specify the number of SNPs, and the number of SNPs per locus, in ms?
  3. How do you recommend that I average the spectra produced by multiple ms runs to compare the models?
Thank you so much for your help and the time you put into answering all of the questions here on the forum!

All the best,
Sarah





On Thursday, August 1, 2019 at 5:51:43 AM UTC+12, Ryan Gutenkunst wrote:
Hi Sarah,

Your demographic parameter conversion looks fine to me. But I don’t follow the theta conversion formula you’re using. You want your bootstrap data sets to have the same total theta as your original data data set. Are you downsampling the ms simulations to get to one SNP per locus?

It’s hard to tell whether your ms and dadi simulations agree, because the ms simulation is so sparse. Run many ms simulations then average the results to compare with the dadi model (and look at the residuals to see if there is systematic difference).

Best,
Ryan

> To unsubscribe from this group and stop receiving emails from it, send an email to dadi...@googlegroups.com.

Ryan Gutenkunst

unread,
Aug 7, 2019, 7:50:18 PM8/7/19
to dadi-user
Hi Sarah,

Maybe it’s time to back up here. Why are you running ms again? Typically we would only do coalescent simulations if we wanted to test how linkage affects our data. But if you’re simulating RAD seq loci independently and only taking one SNP per locus, then you’re generating data with no linkage anyways. In this case, the only difference between dadi and ms would be if you made an error in converting model specifications.

If you’re assuming you’re loci are unlinked, then the likelihood dadi calculates is correct, and you can do model selection with likelihood ratio tests, etc.

Best,
Ryan
> To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/3b03f540-eb61-4f3d-a2d8-1d6cf69e6c58%40googlegroups.com.

Sarah Flanagan

unread,
Aug 7, 2019, 9:45:42 PM8/7/19
to dadi-user
Hi Ryan,

I would like to compare the distribution of Fsts in the observed data to a null distribution, so I'm trying to simulate data using the best-inferred model for my dataset. I was under the impression that dadi does not simulate data that can be used for this purpose, so I'm trying to use ms to generate simulated data. Do you have a suggestion for a better way to do this?

Thank you again for your rapid response!

All the best,
Sarah

On Thursday, August 8, 2019 at 11:50:18 AM UTC+12, Ryan Gutenkunst wrote:
Hi Sarah,

Maybe it’s time to back up here. Why are you running ms again? Typically we would only do coalescent simulations if we wanted to test how linkage affects our data. But if you’re simulating RAD seq loci independently and only taking one SNP per locus, then you’re generating data with no linkage anyways. In this case, the only difference between dadi and ms would be if you made an error in converting model specifications.

If you’re assuming you’re loci are unlinked, then the likelihood dadi calculates is correct, and you can do model selection with likelihood ratio tests, etc.

Best,
Ryan

Ryan Gutenkunst

unread,
Aug 8, 2019, 11:54:47 AM8/8/19
to dadi...@googlegroups.com
Hi Sarah,

The suitability of dadi for this depends on what assumptions you’re willing to make. Are you assuming your RADseq loci are independent? If not, you’ll need a much more complex simulation scheme to capture that dependence anyways. If yes, then you don’t need to use coalescent simulations.

Best,
Ryan
> To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/6202cbda-7583-4778-b08f-5ec6dca0c91d%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages