Hi Ryan,
Thank you for your response. That aligns with my expectations, but it seems like there's something wrong in my pipeline. I know that the models I'm using aren't completely matching the data, but I also know that the parameters estimated are orders of magnitude off.
I'm using variant data from 59 individuals (30 and 29) stored in a .vcf, which I project to a SFS of 2n-2, which could be smaller. My base genome (autosomal) has 1.98 Gb mappable sites, and I called variants from ~9x coverage data. Filtering reduced 130,436,888 variants to 66,067,163 sites. I do not have an invariant VCF as it's whole-genome and that file size is unwieldy, but as a base approximation I could assume that 51% of the initial sequence length is my final sequence length (1Gb).
When I look at my model outputs, the population sizes and time intervals look way low. I assume that this is related to how I'm scaling it, but it could be that the model is a poor fit, or that I'm making an error in the process somewhere. But, for my best fit replicates, I'm getting nref=10000, npops=(500,2500), split_T~2700-5600. Some replicates have further split times, but the population sizes are all pretty small across all replicates and models. Nucleotide diversity for the populations is around (0.23%,0.21%) which is a couple of orders of magnitude off what I'd estimate using dadi outputs (.0007 - .004 %).
I'm wondering what could be messing with this inference. Obviously, a good estimate for L is important, but it could be other factors; for instance, there is some population structure. I'm considering downprojecting a bit further and increasing the maxiter, and possibly adding a bottleneck before split time, but I don't want an overly complex model. I know that the model itself isn't matching the data, but I haven't figured out exactly what needs to be done. I've also attached the model sfs and residuals for some of the best-fit data.
Additional information:
I'm estimating a split-with-migration, so I have models with and without migration, with a linear and exponential size change, and two phases to approximate secondary contact. I don't think the details of the models are atypical, but could provide them.
Population sizes: start at 2, from 1e-3 to 100
migration rates: stat at 0.2, from 1e-3 to 30
Time intervals: start at 1, from 1e-2 to 15
and I perturb_params with fold=1 before inference with maxiter= 200, and pts [40,50,60]
Example calculation with model:
def nu1_func(t):
return nu1A * (nu1B / nu1A) ** (t / TA)
def nu2_func(t):
return nu2A * (nu2B / nu2A) ** (t / TA)
phi = Integration.two_pops(phi, xx, TA,nu1_func, nu2_func,m12=m12, m21=m21)
Outputs:
loglik -13531.50 theta 169586.1
nu1A 1.9405859 nu2A 0.04918462
nu1B 0.05179008 nu2B 0.2568233
TA 0.1520766
m12 0.2254417 m21 0.7371130
Calculation:
sumL= 1985216215 * (66067163 )/130436888 = 1005525395
mu=4*(10^-9)
nref=theta / (4* sumL * mu) = 10539
n1B=nu1B*nref =457
n2B=nu2B*nref =2568
splittime=TA*2*$nref =3210