3 populations, multiSFS or 3 jointSFS

brianjo...@gmail.com

unread,

Jul 23, 2014, 11:17:03 AM7/23/14

to fasts...@googlegroups.com

First of all, I love this program.
If you have more than 2 populations, there's an option to either use a multidimensional SFS or a series of joint SFS for each pairwise population comparison. Maybe I'm not understanding the likelihood calculations correctly, but what's the difference between using one or the other input format? In theory, there should be more information in the full multiSFS than in a set of pairwise SFS, and when I use the exact same populations and .tpl and .est files, but vary the input format (one 3D SFS or 3 pairwise SFS) I get different likelihood values for models I construct... and the likelihoods between the different models (different branching structures, same 3 populations) change such that the least likely model for the 3DSFS become the most likely one for the data if I use a set of 3 two-dimensional SFSs...
Thanks for any help!

Brian

brianjo...@gmail.com

unread,

Aug 4, 2014, 11:36:36 AM8/4/14

to fasts...@googlegroups.com

I figured out my problem.
I think It's misleading how the format of the 3DSFS is specified in the manual. It seems like you first cycle through variants in deme 0 (holding other demes constant, i.e. zero at first), then deme 1, then deme 2. I gathered this based on the different sample sizes in the example in the manual (page 39) and how you cycle through them. However, it's the exact opposite (first going through deme2, then deme 1, then deme 0), which I proved to myself by going into the example files that come with the program and marginalizing the 3DSFS's into 3 2DSFS's. Changing this also made all of my analyses go from making no sense to making complete sense.

Laurent Excoffier

unread,

Aug 18, 2014, 9:58:02 AM8/18/14

to fasts...@googlegroups.com

The difference in using one or the other format is in the computation of the likelihood, as explained in the PLOS Genetics paper and on p. 47 of the manual, where the likelihood for the multidimensinal SFS is computed according to equation (5) and that based on the pairwise comparisons is computed according to equation (7).
This explains why you obtain different likelihoods when using one model or the other.
I would use a multiSFS when I have up to 4 populations and use the composite pairwise likelihood when I have more populations. If you try both models, the multiSFS is generally more trustable. Note also that the pairwise likelihood cannot be used for comparing models with AIC, as the underlying theory does not apply to composite likelihoods.

Laurent Excoffier

unread,

Aug 18, 2014, 10:04:04 AM8/18/14

to fasts...@googlegroups.com

I do not clearly see how your second post relates with the previous one, but the format of the multiSFS follows that of dadi (somehow for compatibility purposes), and I agree that the reverse ordering of the poopulation labels can be misleading.

I'm glad that it now makes sense for you now :-)

Matt

unread,

Aug 5, 2016, 12:25:01 PM8/5/16

to fastsimcoal

Dear Dr. Excoffier and other users,

Given that we should not use the multiSFS for more than 4 populations, and since the pairwise SFS cannot be compared via AIC, how can we compare among models when more than 4 populations are involved?

Thanks so much!
Matt

Laurent Excoffier

unread,

Sep 22, 2016, 11:05:23 AM9/22/16

to fastsimcoal

Hi,

you could actually use multiSFS for more than 4 pops. 5 or 6 may be okay if you do not have too many individuals and enough SNPs... but if you have many pops, then the number of entries in your SFS will explode, and most of the SFS will be empty, causing problems...

With more populations, the other problem is that the number of parameters to estimate quickly grows, and becomes very large.