I am seeking advice on a persistent issue regarding the expected (simulated) 2D-SFS in my demographic modeling.
I have built a 5-population model (Pop 0 to Pop 4, representing different subspecies along the ring) that includes historical isolation and a recent secondary contact event between the terminal populations (Pop 0 and Pop 4).
Because I don't have a reliable outgroup, I am using the folded SFS (Minor Allele Frequency). My observed 2D-SFS heatmaps are complete and continuous, showing valid SNPs in the top-left and bottom-right corners (representing high-frequency minor alleles / private high-frequency alleles).
The Problem: While the overall likelihood scores seem reasonable, the expected (simulated) 2D-SFS heatmaps show completely empty/blank corners for specific population pairs, failing to capture the high-frequency alleles present in my observed data.
Interestingly, this missingness follows a very strict and consistent pattern:
Missing corners (top-left / bottom-right blank): Pop 0-1, 0-2, 1-2, 0-3, 1-3, and 2-3.
Complete/Normal heatmaps: Pop 0-4, 1-4, 2-4, and 3-4.
In other words, any pair involving Pop 4 is simulated completely fine. But any pair among the other four populations completely fails to simulate high-frequency alleles. I also tested a 4-population model and found the exact same structural missingness for pairs not involving the terminal western population.
Data Context: To achieve comprehensive geographic sampling, my dataset is a merged VCF of WGS (Whole Genome Sequencing) and GBS data. Specifically, Pop 1 and Pop 2 contain 10 WGS individuals each, while the rest of the individuals and populations are sequenced via GBS.
What I have checked:
I am aware that folded SFS naturally has a blank half (since MAF > 0.5 is impossible). The missingness I am referring to is in the valid half of the heatmap where the observed data clearly has SNPs.
I have rigorously constrained the .est and .tpl files to ensure that size changes (T_CHG) strictly occur during the historical isolation period and do not overlap with divergence times (TDIV) or the secondary contact period (T_SC_END).
My Questions:
Why would the optimization algorithm completely fail to generate high-frequency alleles for specific population pairs, while perfectly simulating them for pairs involving Pop 4?
Could the massive number of low-frequency alleles (singletons) from the WGS samples in Pop 1 and Pop 2 be "hijacking" the likelihood optimization, forcing the model to overestimate recent migration rates among Pop 0, 1, 2, and 3, thereby wiping out the corners?
Is there a recommended way in fastsimcoal2 to handle this?
I have attached my .tpl and .est files, as well as the PDF of the 2D-SFS heatmaps for your reference.
Any insights into the underlying mechanics of this issue or suggestions for parameterization would be greatly appreciated!
Best regards,
Lei