Comparing models of migration with no migration

545 views
Skip to first unread message

Pedro

unread,
Feb 26, 2015, 7:17:24 AM2/26/15
to dadi...@googlegroups.com
Hi Ryan,

I'm working on some demographic analyses where I want to compare simple IM models with and without migration. Eventually I would also like to had an unsampled population to the inference, but that is other case (but please see my previous post related to some problems I'm having related to this; sorry for posting so much lately...).
For the IM models with asymmetric migration and no migration I have good parameter convergence (I guess, because estimates based on T that are in some way in agreement to expectations...). As expected, the model with migration gave higher T than the one without migration.

I now want to test whether the model with migration is significantly better than the model without migration. Reading the FAQs in dadi manual, I understand that I will need to simulate data sets under the best- fit model without migration, and then fit each data set using the no-migration model, and the migration model. I know the core parameters I will have to give to ms to generate such datasets, however, there are a few things that I do not fully understand, maybe because of my inexperience with ms. The dadi module dadi.Misc.ms_command() needs theta, recomb and rsites. What is the meaning of these paramters?
Is theta the one inferred with dadi.Inference.optimal_sfs_scaling()?
recomb is the assumed recombination rate. Is it the same as r2 (rsquare)? or is it other measure?
rsites is sites for recombination. Can this be the same value as the effective sequence length (L) used to calculate Nref?

Many thanks for your help.
Pedro

Gutenkunst, Ryan N - (rgutenk)

unread,
Feb 27, 2015, 7:55:50 AM2/27/15
to dadi...@googlegroups.com
Hi Pedro,

On Feb 26, 2015, at 5:17 AM, Pedro <p.alme...@gmail.com> wrote:
I now want to test whether the model with migration is significantly better than the model without migration. Reading the FAQs in dadi manual, I understand that I will need to simulate data sets under the best- fit model without migration, and then fit each data set using the no-migration model, and the migration model. I know the core parameters I will have to give to ms to generate such datasets, however, there are a few things that I do not fully understand, maybe because of my inexperience with ms. The dadi module dadi.Misc.ms_command() needs theta, recomb and rsites. What is the meaning of these paramters?
Is theta the one inferred with dadi.Inference.optimal_sfs_scaling()?
recomb is the assumed recombination rate. Is it the same as r2 (rsquare)? or is it other measure?
rsites is sites for recombination. Can this be the same value as the effective sequence length (L) used to calculate Nref?

Yes, theta is the one inferred from dadi.optimal_sfs_scaling().
recomb is 4*Ne*rho*L, where rho is the per-site per-generation rate of recombination and L is the sequence length. For humans, for example, rho=1e-8 is a reasonable value. (Although that ignores a lot of heterogeneity.)
rsites is mostly an internal ms parameter, controlling how many places in the sequence recombination is allowed to happen. Typically it doesn't affect the results much as long as it is large. You might try starting with L, but if it's too slow you can scale back.

Note that doing these simulations correctly can be tricky. Because you need to account for things like how much missing data you have. (See the original dadi paper for the hoops we had to jump through.) We're currently writing up an approach that avoids these many simulations and refitting, based on the Godambe Information Matrix. If you have lots of trouble with the current approach, let us know and we may be able to help.

Best,
Ryan


Many thanks for your help.
Pedro

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

Pedro

unread,
Mar 3, 2015, 9:24:20 AM3/3/15
to dadi...@googlegroups.com
Hi Ryan,

the data I have is from whole-genome sequencing. The genome is about 12.2Mb (16 chromosomes) and we were able to cover in the two populations under study ~11.6 Mb high quality sites with ~114 kb segregant sites. This considering all individuals [18,25]. To run dadi, and to account for missing data, I projected the samples down to [14,22]. This projection was chosen by testing all possible projections and choosing the one with the highest number of segregating sites (~9.3 kb). I then ran two IM models, without migration and with asymmetrical migration between pops. Looking at the residual plots it seems to me that the asymmetrical migration model fits better to the data (and has a higher ll and lower AIC), but the estimated parameters for migration are relatively low I think:
m12 m21
0.03491 0.00295

To simulate datasets I was thinking that because my projection reduced the number of segregating sites by a factor of ~0.8, I need to simulate ~11.6 Mb x 0.8 of sequence (saw this in a previous post) for the entire dataset and then project down to [14,22].
In your PLoS Gen paper you group genes in linked blocks, but considering I have sequence for entire chromosomes I could break chromosomes into some specified length (eg linkage decay to half) and simulate datasets for each one of these regions with size reduced to 80%.
In the end each of these datasets would be imported to dadi with dadi.Spectrum.from_ms_file to generate new estimates.
Does this sounds a good approach? Maybe that is something more that I'm missing and should account for...

Many thanks,
Pedro
asymetricMig_2d_fit2D_folded_run02.pdf
noMig_2d_fit2D_folded_run02.pdf

Gutenkunst, Ryan N - (rgutenk)

unread,
Mar 9, 2015, 2:18:48 PM3/9/15
to dadi...@googlegroups.com
Hello Pedro,

That should be a good approach overall. You'll want to simulate with [18,25] individuals, then project your simulations down. You should find that you don't need to reduce your segregating sites by 0.8, because that's accounted for when you estimate theta. (The mu you estimate becomes the probability of mutation + being successfully called in at least [14,22] individuals.)

Breaking the chromosomes into pieces is reasonable, if you need to do that to get reasonably efficient simulations.

From your plots, It does seem that the migration model better fits the shared polymorphisms. Although it is concerning that the polymorphisms private to each population are poorly modeled.

Best,
Ryan

<asymetricMig_2d_fit2D_folded_run02.pdf><noMig_2d_fit2D_folded_run02.pdf>

Pedro

unread,
Mar 27, 2015, 10:20:02 AM3/27/15
to dadi...@googlegroups.com
Hello Ryan,

many thanks for all your help throughout.

I have now started to simulate data with dadi using the IM demographic model which fitted much better the private polymorphisms.
The function to simulate data for each chromosome is:

mscore = Demographics2D.IM_mscore(params)
mscommand = dadi.Misc.ms_command(scaledtheta, datasize, mscore, int(args.rep), recomb=rec, rsites=chrL)
msfs=dadi.Spectrum.from_ms_file(os.popen(mscommand))
## fold simulated data because original data is also folded
msfold = msfs.fold()

In the end I would like to have about 100 replicates for each chromosome and then add them up to make the final dataset similar to the one used to estimate ML parameters. My problem now is whether I could do this by save the frequency spectrum msfold and then add all spectrums together, for example:

finalfs = msfold1 + msfold2 + ... + msfold16 (16 chromosomes)

or is it better to save the raw output from ms and then in some way concatenate all simulated chromosomes?
After making some tests, I noticed that the summed frequency spectrum of some data matches very closely to the freq spectrum of the concatenated data, although there were some differences but only in the far 8th decimal number, so I guess this approach should be ok.

Many thanks,
Pedro

Gutenkunst, Ryan N - (rgutenk)

unread,
Apr 1, 2015, 2:16:27 PM4/1/15
to dadi...@googlegroups.com
Hello Pedro,

As far as dadi is concerned, the summed frequency spectrum is enough, as you do below.

Best,
Ryan
--
Ryan Gutenkunst
Assistant Professor
Molecular and Cellular Biology
University of Arizona
phone: (520) 626-0569
http://gutengroup.mcb.arizona.edu

Reply all
Reply to author
Forward
0 new messages