I am using dadi for two purposes: 1) to test alternative hypotheses by fitting different models, and 2) to compare the genome-wide Fst frequency distribution between the inferred model and the observed data. Re: objective 2, I am trying to assess the extent to which the inferred model captures an excess of high fst sites between the two populations. Thus, my plan was to simulate the model inferred by dadi with ms and obtain the Fst distribution for those outputs.
The SFS projection using dadi.Spectrum.from_ms_file looks nothing like that produced by dadi using the inferred parameter set. The ms-based model has null values for a large portion of the ms space.
I read through much of the discussion on the google group to minimize the probability of making a rookie mistake. Here are the details:
RAD-SEQ:
I generated 134900 94bp long RAD-seq loci containing 205935 SNPs. Downsampling one SNP per locus left me with 56055 SNPs for input to dadi. Thus, L, following your previous recommendations in this forum = 94*134900*56055/205935 = 3451628. There's no ancestral genome available, so inference uses a folded sfs.
MODEL:
The best fitting 2-population model is one with divergence accompanied by gene flow that stops relatively recently. The point at which gene flow stops is also the one at which population size occurs on both branches. Setting the time of gene flow cessation and pop size change to be simultaneous, while not entirely realistic, represents a tradeoff to avoid over-parameterizing. The residuals are centered on zero, which suggests the model does a decent job. It also has unequivocal support (based upon AIC) relative to 13 other 2-population models, as well as 17 other 3-population models that accounts for substructure within one of the populations. I attribute the difference between the dadi-simulated model and the observed data to unaccounted for population structure within the two defined populations (that may be incipient species). These are lizards with low dispersal, and populations sampled over several hundred kilometers so genetic turnover within populations is somewhat expected.
Forward in time dadi parameters:
theta = 5911.67
pop 1 relative pop size post divergence = .1009
pop 2 relative pop size post divergence = .5933
T1, time from divergence until gene flow ceases = 5.3655
T2, time from cessation of gene flow until present = 0.0269
m12 = migration from pop 2 into 1 = 1.1616
m21 = migration from pop 1 into 2 = 0.4132
pop 1 relative pop size after size change = 29.9198
pop 2 relative pop size after size change = 8.7145
To simulate unlinked RAD loci, I generate a single-locus ms command by adjusting theta to be locus specific: 5911.67 * 94/3451628, i.e. multiply theta x individual rad locus length/total estimated length (L).
MS SIM SETUP
Basically, I simulate the set of unlinked rad loci and sum their respective sfs:
ms_fs = dadi.Spectrum.from_ms_file(popen('ms 306 56055 -t 0.160995617141 -I 2 91 215 -n 1 29.9198 -n 2 8.7145 -en 0.01345 1 0.1009 -en 0.01345 2 0.5933 -em 0.01345 1 2 0.8264 -em 0.01345 2 1 2.3232 -ej 2.6962 1 2),average=False)
folded_sfs=dadi.Spectrum.fold(ms_fs)
fig = pylab.figure(1)
fig.clear()
dadi.Plotting.plot_single_2d_sfs(folded_sfs,vmin=0.005)
note that the vmin setting was the same used for the plotting of the data and dadi inferred sfs.
MS ARGS IN DETAIL
-n 1 29.9198 : relative pop size of pop 1 (post expansion) at present
-n 2 8.7145 : relative pop size of pop 2 (post expansion) at present
-en 0.01345 1 0.1009: going backward in time, pop 1 size changes, in ms units, at 0.0269/2 = 0.01345 to .1009
-en 0.01345 2 0.5933 : similarly, pop 2 size changes at same time as pop 1 to .5933
### time parameter conversions ###
dadi parameters are forward in time, while ms is backwards. m12-dadi, the fraction of pop 1 comprised of migrants from pop 2, is equivalent in coalescent terms to m21 for ms. thus, given that dadi rates are multiplied by 2 to get equivalent ms rates:
-em 0.1345 1 2 0.8264 : M12, the # of individuals from pop 1 comprised of backward in time migrants from pop 2
-em 0.1345 2 1 2.3232 : M21 the # of individuals from pop 2 comprised of backward in time migrants from pop 1
-ej 2.6962 1 2 : the full length of the genealogy in ms units is (T1+T2)/2 = (5.3655 + 0.0269)/2 = 2.6962, at which time all lineages from population 1 are moved to population 2
I'd be happy to have you point out a rookie mistake someplace, and welcome any other suggestions if I haven't!
Thanks,
Adam