I hope some of you can help me. I just started working with fastsimcoal together with some other people in our lab. We would like to test some different IM models using (two population diverging with gene flow) and have tried to make some models to test. We have s site-frequency-spectrum so we want to run several models in order to test which scenario fits our data the best.
So far we came of with a model looking like the one below. I just present a simple case with isolation with migration and constant population sizes and a possible change in the ancestral population. Our problem is that this model looks somewhat different from the one presented in the fastsimcoal manual (page 64) and although we think it is correct we started to have a doubt. The main differences are the way we describe the splitting event but also the way we define the migration priors.
I present both our model and the one from the manual. I hope some of you can comment on whether or not these two models actually simulate the same scenario or if they are different. In the latter case we hope you can explain the difference in the assumed models.
Our IM model. Two populations diverging with gene flow:
//Number of population samples (demes)
2
//Population effective sizes (number of genes)
N1
N2
//Sample sizes
56
56
//Growth rates : negative growth implies population expansion
0
0
//Number of migration matrices : 0 implies no migration between demes
1
//Migration matrix 0
0 M01
M10 0
//historical event: time, source, sink, migrants, new size, new growth rate, migr. matrix
2 historical event
T1 0 0 1 R1 0 0
T1 1 0 1 1 0 0
//Number of independent loci [chromosome]
65000 0
//Per chromosome: Number of linkage blocks
1
//per Block:data type, number of loci, per generation recombination and mutation rates and optional parameters
FREQ 100 0 1e-8
Our prios
// Priors and rules file
// *********************
[PARAMETERS]
//#isInt? #name #dist.#min #max
//all Ns are in number of haploid individuals
1 N1 unif 100 5000000 output
1 N2 unif 100 5000000 output
1 NANC unif 100 5000000 output
1 T1 unif 100 5000000 output
0 M01 logunif 1e-20 1e-1 output
0 M10 logunif 1e-20 1e-1 output
[RULES]
[COMPLEX PARAMETERS]
0 R1 = NANC/N1 hide
The IM model from the manual (page 64):
//Parameters for the coalescence simulation program : simcoal.exe
2 samples to simulate :
//Population effective sizes (number of genes)
NPOP1
NPOP2
//Samples sizes and samples age
20
30
//Growth rates: negative growth implies population expansion
0
0
//Number of migration matrices : 0 implies no migration between demes
2
//Migration matrix 0
0 MIG21
MIG12 0
//Migration matrix 1
0 0
0 0
//historical event: time, source, sink, migrants, new deme size, growth rate, migr mat index
1 historical event
TDIV 0 1 1 RESIZE 0 1
//Number of independent loci [chromosome]
1 0
//Per chromosome: Number of contiguous linkage Block: a block is a set of contiguous loci
1
//per Block:data type, number of loci, per gen recomb and mut rates
FREQ 1 0 2.5e-8e-8
// Priors and rules file
// *********************
[PARAMETERS]
//#isInt? #name #dist.#min #max
//all N are in number of haploid individuals
1 ANCSIZE unif 100 100000 output
1 NPOP1 unif 100 100000 output
1 NPOP2 unif 100 100000 output
0 N1M21 logunif 1e-2 20 hide
0 N2M12 logunif 1e-2 20 hide
1 TDIV unif 100 20000 output
[RULES]
[COMPLEX PARAMETERS]
0 RESIZE = ANCSIZE/NPOP2 hide
0 MIG21 = N1M21/NPOP1 output
0 MIG12 = N2M12/NPOP2 output
Thank you so much for your reply. I have started some analyses now using the model you suggested and it runs smoothly. It will be interesting how much this will differ from my old runs. THANK YOU.
I have another question that we have discussed in my group regarding estimating confidence intervals on the various parameter estimates. It seems to us that the most accurate thing to do is to bootstrap over your own data: meaning that if you use a site-frequency-spectrum then you would bootstrap over your individual SNP data (as an example) in order to generate new SFS data sets for fsc analysis. However, it seems that some people estimate the confidence by generating e.g. 100 data sets corresponding to the estimated parameter values (I think that such can be found in the .par files) and then run fsc on these datasets. Which of these two options would you find most appropriate and why? We talked a lot about this and to us it seems most accurate to bootstrap over your own data instead of simulating new data but we may be mistaken?
A last thing I thought a bit about is if anyone tried to incorporate possible selection into your models. Although most e.g. SNPs would be assumed to be under neutral evolution some may be under positive selection (e.g. between two species) and depending on the scenario these loci may constitute a good part of your data (especially the highly differentiated ones). Did anyone try to play around with selection and how it may bias parameter estimates in fsc??? And is it possible to incorporate selection at all?
Thank you so much for your help
Sincerely Vivi and Magnus
Chris