Hi Ryan,
I have two main questions related with model fit/comparison, and I’d appreciate your help in clarifying them. But before that, I’ll provide some background on the analyses I’ve conducted so far so that the questions make more sense:
I’m currently running 2D models where I want to estimate the divergence time between two fish populations that I suspect diverged very recently (within the last century) due to a barrier. I started with 1D models for each population as you have previously suggested, and models suggested expansions in both. I proceeded to 2D models testing my hypothesis according to what I know about the system and some expectations: there was an ancestral population that went through an expansion (due to colonization by fish) that diverged in two (due to the barrier), then each population decreased in size (due to smaller habitat caused by the complete fragmentation of the barrier) and no gene flow was allowed (it was a complete barrier for a while). More recently, the barrier has been modified to allow passage and some management has been conducted, so Ne could have increased again and gene flow is allowed, but I didn’t include this third epoch in my model to avoid creating one very complex for now (and since this has happened very recently, maybe the growth suggested by my initial models is related with a larger ancestral population – but I might be wrong). I know that the optimization will find the most likely values for Ne’s, but I wanted to mention the null hypothesis I’m testing. This is the function I defined to represent this 2-epoch scenario:
def twoEp_growth_decline(params, ns, pts):
nu0, nu1, T1, nud1, nuu1, nud2, nuu2, T2, m2_12 , m2_21 = params
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx) # T0: ancestral pop
# ancestral exponential change
nu_func1 = lambda t : nu0 * (nu1/nu0) ** (t/T1) # ancient exponential change (e.g. growth)
phi = Integration.one_pop(phi, xx, T1, nu_func1)
# split
phi = PhiManip.phi_1D_to_2D(xx, phi)
# linear changes in each subpop
nu1_func2 = lambda t: nud1 + (nud2 - nud1)*t/T2
nu2_func2 = lambda t: nuu1 + (nuu2 - nuu1)*t/T2
phi = Integration.two_pops(phi, xx, T2, nu1 = nu1_func2, nu2 = nu2_func2, m12 = m2_12, m21 = m2_21) # T1: different pop changes
fs = Spectrum.from_phi(phi, ns, (xx,xx))
return fs
I’m using Portik’s pipeline for the optimization. After
this, the optimized parameters were [2.2006, 0.8126, 2.8392, 6.0689, 0.0924,
3.679, 5.739, 0.1671, 1.9129, 0.6256] which suggest opposite things to what I
initially predicted (e.g. ancestral decline instead of a growth). However, the
fit of the model was really bad, I have residuals on the order of 10^13 (which I
don’t understand why they are not shown in the plot, but I’m looking at the
range of the scale). I played around with vmin= and resid_range= to find the regions
of my spectrum that are both under and overestimated. Most of my spectrum is
underestimated, especially at the corners (unique, intermediate/high-frequency variation
in each population), but I’m attaching a photo where most of my data is located
(vmin = 1; attached figures 1 and 2), and it shows that residuals are not random at all. I modified the
bounds and initial parameters according to my hypothesis to look for a more
likely parameter space, but the resulting models didn’t improve and still
had huge residuals (I initially ran simpler models that didn't involve popualtion size changes, but they failed https://groups.google.com/g/dadi-user/c/4pbwzbMp5p8 and 1d models suggest strong Ne changes, see below).

Because of that, I decided to take a step back and return to my 1D models in order to assess the change (i.e. instantaneous, linear, exponential) of each population, and to see if 1, 2 or 3-epoch models fit them better. After running several replicates of Portik’s pipeline with 1 or 2-epoch models, I found that an instantaneous or exponential increase have happened in both populations. However, I find it hard to choose what change was better as both models have very similar residual plots (attached figure 3). Thus, I decided to proceed with 2 or 3-epoch models with some combinations of instantaneous and exponential changes (to simulate a bottleneck). I plotted the replicates that had the highest log-likelihoods and that “converged” on similar values across different iterations, but I still can’t decide on which model is qualitatively better as residuals look very similar. I was interested in knowing if I could change the range of residuals in 1d plots and couldn’t find an argument, so I calculated the values with resid = dadi.Inference.Anscombe_Poisson_residual(model, fs) to inspect them better. However, I also found huge residuals for these 1d models – the values are those printed for each model in the figure 4 I’m attaching). So here are my first round of questions: Is this function giving the same residual values than those in the plot? If so, it would mean that I still have bad 1d models! but if not, is there a way a to change the residual range in the 1d plots? I'm really confused about residual values here.
Because my 1, 2 and 3-epoch models are not nested, I cannot use the LRT, so I decided to try to calculate the clAIC (following what was initially discussed here https://groups.google.com/g/dadi-user/c/n1-ovndv8FI?pli=1), and since I find it hard to select the best fit based on the residual plots alone, I want to also incorporate a more quantitative model selection approach. For this second part, I basically would like to know if I'm using the appropriate objects generated by dadi’s functions, specifically for the H, J matrices and if I should be using log during the calculation – I’m showing comments related to this in red. I’m following the source code from this page https://gitlab.mbb.cnrs.fr/khalid/dadi/-/blob/master/dadi/Godambe.py?ref_type=heads. This is the code I wrote (I’m using an example function with the best optimized parameters from a Portik’s pipeline run):
def ep3_ins_exp_ins(params, ns, pts):
nu1, nu2, nu3, T1, T2, T3 = params
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx)
# T1: ancestral time
phi = Integration.one_pop(phi, xx, T1, nu1)
# T2: exp change
nu_func2 = lambda t : nu1 * (nu2/nu1) ** (t/T2)
phi = Integration.one_pop(phi, xx, T2, nu_func2)
# T3: inst change
phi = Integration.one_pop(phi, xx, T3, nu3)
fs = Spectrum.from_phi(phi, ns, (xx,))
return fs
func = ep3_ins_exp_ins
func_ex = dadi.Numerics.make_extrap_log_func(func)
grid_pts = [450, 460, 470]
p0 = [0.5889,0.2077,21.7962,0.221,0.0468,0.0629]
data = fsd
# Getting the matrices
GIM, H, J, cU = dadi.Godambe.get_godambe(func_ex, grid_pts, boots, p0, data, eps=0.01, log=True) #I do log because that’s what I’m using for the func_ex, but I don’t know enough math to be sure if this makes sense
# Getting the inverse Hessian
H_inv = numpy.linalg.inv(H)
# Getting the trace of the first term
first_term = numpy.trace(J @ H_inv)
# Getting the log.likelihood of the model
ns = fsd.sample_sizes
modeld = func_ex(p0, ns, grid_pts)
ll = dadi.Inference.ll_multinom(modeld, fsd)
# Calculating the clAIC!
clAIC = 2*first_term - 2*ll
This is a really long post, apologies for that. I'd really appreciate your help with my two concerns and any additional comment/advice that can help me to move forward to good 2D models for my system.
Thank you very much,
Maria
On Apr 3, 2026, at 2:17 PM, Maria Paula Rodriguez <mrodrig...@gmail.com> wrote:
Hi Ryan,
I have two main questions related with model fit/comparison, and I’d appreciate your help in clarifying them. But before that, I’ll provide some background on the analyses I’ve conducted so far so that the questions make more sense:
I’m currently running 2D models where I want to estimate the divergence time between two fish populations that I suspect diverged very recently (within the last century) due to a barrier. I started with 1D models for each population as you have previously suggested, and models suggested expansions in both. I proceeded to 2D models testing my hypothesis according to what I know about the system and some expectations: there was an ancestral population that went through an expansion (due to colonization by fish) that diverged in two (due to the barrier), then each population decreased in size (due to smaller habitat caused by the complete fragmentation of the barrier) and no gene flow was allowed (it was a complete barrier for a while). More recently, the barrier has been modified to allow passage and some management has been conducted, so Ne could have increased again and gene flow is allowed, but I didn’t include this third epoch in my model to avoid creating one very complex for now (and since this has happened very recently, maybe the growth suggested by my initial models is related with a larger ancestral population – but I might be wrong). I know that the optimization will find the most likely values for Ne’s, but I wanted to mention the null hypothesis I’m testing. This is the function I defined to represent this 2-epoch scenario:
def twoEp_growth_decline(params, ns, pts):
nu0, nu1, T1, nud1, nuu1, nud2, nuu2, T2, m2_12 , m2_21 = params
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx) # T0: ancestral pop
# ancestral exponential change
nu_func1 = lambda t : nu0 * (nu1/nu0) ** (t/T1) # ancient exponential change (e.g. growth)
phi = Integration.one_pop(phi, xx, T1, nu_func1)
# split
phi = PhiManip.phi_1D_to_2D(xx, phi)
# linear changes in each subpop
nu1_func2 = lambda t: nud1 + (nud2 - nud1)*t/T2
nu2_func2 = lambda t: nuu1 + (nuu2 - nuu1)*t/T2
phi = Integration.two_pops(phi, xx, T2, nu1 = nu1_func2, nu2 = nu2_func2, m12 = m2_12, m21 = m2_21) # T1: different pop changes
fs = Spectrum.from_phi(phi, ns, (xx,xx))
return fs
I’m using Portik’s pipeline for the optimization. After this, the optimized parameters were [2.2006, 0.8126, 2.8392, 6.0689, 0.0924, 3.679, 5.739, 0.1671, 1.9129, 0.6256] which suggest opposite things to what I initially predicted (e.g. ancestral decline instead of a growth). However, the fit of the model was really bad, I have residuals on the order of 10^13 (which I don’t understand why they are not shown in the plot, but I’m looking at the range of the scale). I played around with vmin= and resid_range= to find the regions of my spectrum that are both under and overestimated. Most of my spectrum is underestimated, especially at the corners (unique, intermediate/high-frequency variation in each population), but I’m attaching a photo where most of my data is located (vmin = 1; attached figures 1 and 2), and it shows that residuals are not random at all. I modified the bounds and initial parameters according to my hypothesis to look for a more likely parameter space, but the resulting models didn’t improve and still had huge residuals (I initially ran simpler models that didn't involve popualtion size changes, but they failed https://groups.google.com/g/dadi-user/c/4pbwzbMp5p8 and 1d models suggest strong Ne changes, see below).<2. 2D-twoepoch.Down224.Ups239_vmin1_res3.png><1. 2D-twoepoch.Down224.Ups239_noVmin.png>
--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dadi-user/947ba512-4517-4226-bbd8-e4e3bffadea2n%40googlegroups.com.
<2. 2D-twoepoch.Down224.Ups239_vmin1_res3.png><3. 1D-Upstream-1_2epoch.PNG><1. 2D-twoepoch.Down224.Ups239_noVmin.png><4. 1D-Downstream-2_3epoch.PNG>
On Apr 9, 2026, at 12:38 PM, Maria Paula Rodriguez <mrodrig...@gmail.com> wrote:
Hi Ryan,
1/6. Oh ok, so just to double-check: the instantaneous change is because the Nref (which is "internally" set to 1) is not nu0 in my model (and not in any model unless the most ancient nu is set to 1), so my nu0 could take a different value (e.g. 10), right?
I ran the model again adding nu0 = 1, as well as removing the more recent linear changes and any migration (to keep it simple). I'm attaching the result with and without vmin = 1 (attached figures 1 and 2) for the replicate with the highest log likelihood, which is again suggesting a growth for the two daughter populations (nu1 = 1.6 and nu2 = 2.7) compared to nu0 (0.12). As you can see, I still get huge residuals; this model is mostly overestimating the corners (high and intermediate-frequency private variants) and some unique SNPs, but when I look at a smaller residual range it is overestimating the shared variation.
2/3. My original sample sizes were 184 and 175 individuals in my two populations, which would be 368 and 350 chromosomes, and I projected down in easySFS to 222 and 243, respectively, maximizing the number of SNPs. I followed your suggestion of subsampling the data to ease visualization (e.g. I cropped the model and data sfs to 20 chromosomes - attached figure 3, I'm not sure if there's a function to properly subsample) and it shows that the model is mostly underestimating the shared and unique rare variation and overestimating the unique more common variation - I find challenging finding biological explanations for the [1,1], but maybe recent gene flow can help with that(?).
The range of residuals is much smaller in the cropped region of the SFS, which might be related to what you're mentioning of losing more variants in the corners with the downprojection. You said that subsampling can help plotting, but would it be a good idea to subsample and conduct the demographic analysis with fewer samples? I haven't seen many other posts with a similar sample size to mine, and I don't know if this might be complicating the optimization too (I kept all samples from my two genetic groups separated by the barrier, but some of them are farther to it, and I don't know how IBD can complicate the demographic analysis). However, I decided to keep all my samples as the divergence is suspected to be recent (within the last century) in order to increase the detection of recent events (https://link.springer.com/article/10.1186/s12862-014-0254-4). Do you have any recommendations for what might be a better subsampling/downprojection strategy? I also want to mention that I allowed a maximum site missingness of 30% when filtering my vcf as I wanted to keep sites with a low proportion of missing data, and I thought it could benefit the downprojection according to what I understand from it (i.e. "more confident" resampling if most variants have calls).
Sorry, I didn't understand what "outliers SNPs" you were referring to in #3. Do you mean, the SNPs in the corners (variants that are private/close to be private to each population) that are smeared out during the downprojection? To test for neutrality, I could conduct a HWE test within each population as a way to test for neutrality maybe(?) (I think doing this with both populations combined is not good as I can lose sites that are associated with structure), and I assume that when you say "real" you refer to remaining sequencing/filtering artefacts in the data - besides the site missingness I did standard site/genotype quality filters and excluded sites with a high heterozygosity (probable paralogs) but not much more than that.
To view this discussion visit https://groups.google.com/d/msgid/dadi-user/57c3261b-b8bc-4ae3-a591-d773c2d5b08cn%40googlegroups.com.
<3. 2D-CROPPED-ancExpChange_split_noMig_vmin1.png><2. 2D-ancExpChange_split_noMig_vmin1_resRange3.png><1. 2D-ancExpChange_split_noMig_Novmin1.png>