Proper way to bootstrap

1,592 views
Skip to first unread message

jason....@gmail.com

unread,
Mar 27, 2015, 5:38:08 PM3/27/15
to fasts...@googlegroups.com
The following post suggests a way to bootstrap that simply involves simulating data under the best fit model, and for each simulation, then re-estimating the parameter values (https://groups.google.com/forum/#!topic/fastsimcoal/N956Af31iA4).That approach does not sound like a proper bootstrap to me (Can someone please correct me if I am wrong). Rather, I think the following steps should be followed where X is the desired number of bootstraps:

1) generate X number of observed SFS by sampling SNPs from the SNP pool with replacement.
2) For each SFS estimate the parameters using FSC.

The resulting distribution of parameter values can then be used to generate a 95% CI etc.

Laurent Excoffier

unread,
Mar 30, 2015, 3:40:23 AM3/30/15
to fasts...@googlegroups.com
What you describe is non-parametric bootstrap, which is also fine.
The assumptions are different from those of the parametric bootstrap, but both are possible.

jason....@gmail.com

unread,
Mar 30, 2015, 9:17:42 AM3/30/15
to fasts...@googlegroups.com
Thanks for the clarification of the two types of bootstrap. My understanding is that the parametric bootstrap assumes that the model being used is the true model. None of the candidate models tested by programs like fastsimco are likely to be true, so does it make sense to use the parametric bootstrap?

Also, bootstraps are generally used to determine confidence intervals etc. Could we instead use a non-parametric bootstrap to determine model support for each of a series of candidate models as an alternative to AIC or LRT (which I understand are problematic for some of the likelihood calculations in fastsimco)? For example, perform 1000 non-parametric bootstraps and determine for best fit model from your candidate models. Support for a model will simply be the frequency at which that model is chosen as best fit in the 1000 bootstraps.

Laurent Excoffier

unread,
Mar 31, 2015, 3:15:06 AM3/31/15
to fasts...@googlegroups.com
With the parametric bootstrap, you would indeed get the CI assuming the models are true.
With the non-parametric bootstrap, you woudl still get CI assuming your tested model is true as well, but in a less obvious way.

Your idea for model choice is interesting, and I think worth doing, but I'd rather report the distribution of the relative likelihoods of the different models over the 1000 bootstraps.
best

L

jason....@gmail.com

unread,
Apr 1, 2015, 7:07:42 AM4/1/15
to fasts...@googlegroups.com
Hi Laurent,

I also would favor the likelihood approach for model choice. The reason I am exploring alternative options like the bootstrap method is because it was not clear to me from your 2013 PLOS GENETICS paper whether using likelihood ratio tests or AIC was valid given you are dealing with composite likelihoods. To be honest, I did not fully understand those sections of the paper and under which models you could and could not use AIC or LRT. Also, it would seem that with very large datasets, even a trivial increase in the complexity of the model is likely to result in a delta AIC > 2 for example, meaning that this method of model choice may be less than ideal with big data. Would you mind clarifying for me when you would find it appropriate to use AIC or similar methods for model choice or what approach you would suggest?

Thanks

Laurent Excoffier

unread,
Apr 15, 2015, 5:03:18 AM4/15/15
to fasts...@googlegroups.com
I used the LRT test on bootstrap data to see if the data could be considered as being generated under the best model identified by the AIC.
This LRT approach within model will tell you if the tested model can be considered as the true one. In general it is not, and you will conclude most of the time that you did not get the right model. This is to be expected, I guess...

The comparison of AIC between models should allow you to choose which of your tested models best fit the data. It is indeed expected that by increasing the amount of data you should have increasing power to detect the best model.
However, I am not sure that by simply increasing the complexity (number of parameters) of the model, you will necessarily improve the likelihood that much if the parameters you add are irrelevant.

For all this to work in theory, you need that your likelihood is a real likelihood (SNPs have to be independent, and you need to use the multidimensional SFS, not the multiple pairwise SFS), and your lhood needs to be computed accurately.

Sorry if you knew all that

L

jason....@gmail.com

unread,
May 14, 2015, 11:15:30 AM5/14/15
to fasts...@googlegroups.com
thanks for those details. The multidimensional SFS approach only works well for npops up to 4, correct? So if I wish to build a model with npop > 4 I will need the multiple pairwise SFS for which the likelihoods are not proper and thus AIC cannot be calculated. For npop > 4 when using multiple pairwise SFS, do you have a suggested means of comparing fit across a set of models?

Laurent Excoffier

unread,
May 14, 2015, 3:55:39 PM5/14/15
to fasts...@googlegroups.com
No, the multiSFS approach works well for larger number of populations, but the multiSFS matrix becomes rapidly very large, making it difficult to handle. We are using it routinely for 6-8 populations but with very small samples sizes.
If you use multiple joint 2D SFS, then the model choice based on AIC won't work. I have no replacement for this at the time.

jason....@gmail.com

unread,
May 14, 2015, 5:36:24 PM5/14/15
to fasts...@googlegroups.com
OK, thanks for those details.

roberto....@gmail.com

unread,
Sep 17, 2018, 3:42:31 PM9/17/18
to fastsimcoal
A quick followup question on this: What do you consider very small sample sizes? I'm modelling 8 populations with sample sizes 2N=4-20. Would this multiSFS be too big? Would you suggest using several pairwise sfs or downscaling sample sizes to get a manageable 8D-SFS?

Thanks in advance!

Laurent Excoffier

unread,
Oct 30, 2018, 3:02:49 PM10/30/18
to fastsimcoal
Hi,
It depends on the number of snps you have. 
if you have millions of snps, then I guess such a multi SFS is fine, but if you have only a few thousands it is hopeless
L
Reply all
Reply to author
Forward
0 new messages