Thank you for building the great soft software.
Fastsimcoal2 is user friendly, robust and fast, especially with multithreading.
However, I'm not sure about how to infer confidence intervals for estimated parameters.
This is my understanding of the process:
1) Estimate the parameter with FSC.
2) Replicate step 1 for 50~100 times.
3) Choose the parameter with the highest MaxEstLhood as the "point estimation".
4) Simulating SFS 100 times with the parameter (point estimation) from step 3.
5) Re-estimate parameter for the 100 SFS data from step 4 (30 replicates for every SFS).
6) the 100 parameters from step 5 can be used to infer confidence interval and the 100 CLR (log10(CLO=CLE)) could be used to evaluate the model.
Please check whether my understanding is right.
Another question is:
When I used the command "fsc25 -i 1PopBot20Mb_maxL.par -n 100" to simulate the SFS, the program told me "Do not output expected MAF spectrum Do not output expected DAF spectrum" and the SFS could not be output.
Thanks for any help!
Kun
1) Estimate the parameter with FSC.
2) Use the maximum likelihood parameters (found in file *_maxL.par) to generate say 100 SFS (preferably from DNA data, and you need to modify the _maxL.par file to do this)
3) Re-estimate parameter for the 100 SFS data from step 2 (30-50 or more runs for every SFS).
4) the 100 parameters from step 3 can be used to infer confidence interval.
Concerning the evaluation of the fit of the data to the model based on the difference between the "observed" and "maximum" likelihood, I have noticed that this procedure is too stringent, and most of the time you will reject the hypothesis that the data have been generated under the model you are simulating (and this will be most of the time right as we do not know usually what is the true evolutionary scenario). I'd rather advise to compare models with AIC rather than to hope find the absolute true model.
See my comment in point 2 above to answer your final question.
Hope it helps
laurent
I'm sorry but I'm still a litter confused.
In the step 3, we need 30-50 or more runs for every SFS in order to get a precise parameter. But why don't we need it in step 1?
If I use DNA data as the output of simulation, the output file could be very large. It would be very difficult if I want to simulate a whole genome.
I've tested "-D", "-m" or other parameters of FSC to seek a SFS output, but they just doesn't work.
Kun