model residuals/fit and clAIC implementation

33 views
Skip to first unread message

Maria Paula Rodriguez

unread,
Apr 3, 2026, 5:17:09 PMApr 3
to dadi-user

Hi Ryan,

 I have two main questions related with model fit/comparison, and I’d appreciate your help in clarifying them. But before that, I’ll provide some background on the analyses I’ve conducted so far so that the questions make more sense:

I’m currently running 2D models where I want to estimate the divergence time between two fish populations that I suspect diverged very recently (within the last century) due to a barrier. I started with 1D models for each population as you have previously suggested, and models suggested expansions in both. I proceeded to 2D models testing my hypothesis according to what I know about the system and some expectations: there was an ancestral population that went through an expansion (due to colonization by fish) that diverged in two (due to the barrier), then each population decreased in size (due to smaller habitat caused by the complete fragmentation of the barrier) and no gene flow was allowed (it was a complete barrier for a while). More recently, the barrier has been modified to allow passage and some management has been conducted, so Ne could have increased again and gene flow is allowed, but I didn’t include this third epoch in my model to avoid creating one very complex for now (and since this has happened very recently, maybe the growth suggested by my initial models is related with a larger ancestral population – but I might be wrong). I know that the optimization will find the most likely values for Ne’s, but I wanted to mention the null hypothesis I’m testing. This is the function I defined to represent this 2-epoch scenario:

def twoEp_growth_decline(params, ns, pts):

    nu0, nu1, T1, nud1, nuu1, nud2, nuu2, T2, m2_12 , m2_21 = params

     xx =  Numerics.default_grid(pts)

    phi = PhiManip.phi_1D(xx) # T0: ancestral pop

       # ancestral exponential change

    nu_func1 = lambda t : nu0 * (nu1/nu0) ** (t/T1) # ancient exponential change (e.g. growth)

    phi = Integration.one_pop(phi, xx, T1, nu_func1) 

   # split

    phi = PhiManip.phi_1D_to_2D(xx, phi)

    # linear changes in each subpop

    nu1_func2 = lambda t: nud1 + (nud2 - nud1)*t/T2

    nu2_func2 = lambda t: nuu1 + (nuu2 - nuu1)*t/T2

    phi = Integration.two_pops(phi, xx, T2, nu1 = nu1_func2, nu2 = nu2_func2, m12 = m2_12, m21 = m2_21) # T1: different pop changes

    fs = Spectrum.from_phi(phi, ns, (xx,xx))

    return fs

 

I’m using Portik’s pipeline for the optimization. After this, the optimized parameters were [2.2006, 0.8126, 2.8392, 6.0689, 0.0924, 3.679, 5.739, 0.1671, 1.9129, 0.6256] which suggest opposite things to what I initially predicted (e.g. ancestral decline instead of a growth). However, the fit of the model was really bad, I have residuals on the order of 10^13 (which I don’t understand why they are not shown in the plot, but I’m looking at the range of the scale). I played around with vmin= and resid_range= to find the regions of my spectrum that are both under and overestimated. Most of my spectrum is underestimated, especially at the corners (unique, intermediate/high-frequency variation in each population), but I’m attaching a photo where most of my data is located (vmin = 1; attached figures 1 and 2), and it shows that residuals are not random at all. I modified the bounds and initial parameters according to my hypothesis to look for a more likely parameter space, but the resulting models didn’t improve and still had huge residuals (I initially ran simpler models that didn't involve popualtion size changes, but they failed https://groups.google.com/g/dadi-user/c/4pbwzbMp5p8 and 1d models suggest strong Ne changes, see below).2. 2D-twoepoch.Down224.Ups239_vmin1_res3.png1. 2D-twoepoch.Down224.Ups239_noVmin.png


Because of that, I decided to take a step back and return to my 1D models in order to assess the change (i.e. instantaneous, linear, exponential) of each population, and to see if 1, 2 or 3-epoch models fit them better. After running several replicates of Portik’s pipeline with 1 or 2-epoch models, I found that an instantaneous or exponential increase have happened in both populations. However, I find it hard to choose what change was better as both models have very similar residual plots (attached figure 3). Thus, I decided to proceed with 2 or 3-epoch models with some combinations of instantaneous and exponential changes (to simulate a bottleneck). I plotted the replicates that had the highest log-likelihoods and that “converged” on similar values across different iterations, but I still can’t decide on which model is qualitatively better as residuals look very similar. I was interested in knowing if I could change the range of residuals in 1d plots and couldn’t find an argument, so I calculated the values with resid = dadi.Inference.Anscombe_Poisson_residual(model, fs) to inspect them better. However, I also found huge residuals for these 1d models – the values are those printed for each model in the figure 4 I’m attaching). So here are my first round of questions: Is this function giving the same residual values than those in the plot? If so, it would mean that I still have bad 1d models! but if not, is there a way a to change the residual range in the 1d plots? I'm really confused about residual values here.

Because my 1, 2 and 3-epoch models are not nested, I cannot use the LRT, so I decided to try to calculate the clAIC (following what was initially discussed here https://groups.google.com/g/dadi-user/c/n1-ovndv8FI?pli=1), and since I find it hard to select the best fit based on the residual plots alone, I want to also incorporate a more quantitative model selection approach. For this second part, I basically would like to know if I'm using the appropriate objects generated by dadi’s functions, specifically for the H, J matrices and if I should be using log during the calculation – I’m showing comments related to this in red. I’m following the source code from this page https://gitlab.mbb.cnrs.fr/khalid/dadi/-/blob/master/dadi/Godambe.py?ref_type=heads. This is the code I wrote (I’m using an example function with the best optimized parameters from a Portik’s pipeline run):

def ep3_ins_exp_ins(params, ns, pts):

    nu1, nu2, nu3, T1, T2, T3 = params

    xx = Numerics.default_grid(pts)

    phi = PhiManip.phi_1D(xx)

    # T1: ancestral time

    phi = Integration.one_pop(phi, xx, T1, nu1)

    # T2: exp change

    nu_func2 = lambda t : nu1 * (nu2/nu1) ** (t/T2)   

    phi = Integration.one_pop(phi, xx, T2, nu_func2)

    # T3: inst change

    phi = Integration.one_pop(phi, xx, T3, nu3)

    fs = Spectrum.from_phi(phi, ns, (xx,))

    return fs

   

func = ep3_ins_exp_ins

func_ex = dadi.Numerics.make_extrap_log_func(func)

 

grid_pts = [450, 460, 470]

p0 = [0.5889,0.2077,21.7962,0.221,0.0468,0.0629]

data = fsd

# Getting the matrices

GIM, H, J, cU = dadi.Godambe.get_godambe(func_ex, grid_pts, boots, p0, data, eps=0.01, log=True) #I do log because that’s what I’m using for the func_ex, but I don’t know enough math to be sure if this makes sense

# Getting the inverse Hessian

H_inv = numpy.linalg.inv(H)

# Getting the trace of the first term

first_term = numpy.trace(J @ H_inv)

 

# Getting the log.likelihood of the model

ns = fsd.sample_sizes

modeld = func_ex(p0, ns, grid_pts)

ll = dadi.Inference.ll_multinom(modeld, fsd)

# Calculating the clAIC!

clAIC = 2*first_term - 2*ll

 

This is a really long post, apologies for that. I'd really appreciate your help with my two concerns and any additional comment/advice that can help me to move forward to good 2D models for my system.


Thank you very much,

Maria

3. 1D-Upstream-1_2epoch.PNG
4. 1D-Downstream-2_3epoch.PNG

Ryan Gutenkunst

unread,
Apr 8, 2026, 1:23:00 PMApr 8
to dadi...@googlegroups.com
Hello Maria,

Lots of progress from our last exchange!

A few thoughts:
1. In your two-population model, I would set nu0 = 1 if you want to model smoothly starting exponential growth. Right now you have an instantaneous size change, followed by exponential change thereafter.
2. I think projecting your data is making the “corner” residuals harder to interpret. What were your original sample sizes? The projection “smears out” data points, so rare SNPs in those corners get smeared out to have low occupation in many SFS bins, which isn’t seen when vmin=1. You might consider subsampling the data for easier visualization.
3. More generally, it would be worth looking at those outlier SNPs carefully in the data to see whether you believe they’re real and neutral. They could be distorting the fit quite a bit.
4. In general, the 1D models are almost mathematically unidentifiable from each other once they have more than two parameters in them. So I wouldn’t focus on trying to distinguish between them or do statistical tests.
5. resid = dadi.Inference.Anscombe_Poisson_residual(model, fs) is how that calculation is done, but the model needs to be normalized first using theta, which is what you might be seeing.
6. How does the two-population model fit if you remove the linear change afterwards? I suspect that will be really hard to infer well, and it the complexity may be interfering with optimization.

Best,
Ryan

On Apr 3, 2026, at 2:17 PM, Maria Paula Rodriguez <mrodrig...@gmail.com> wrote:

Hi Ryan,

 I have two main questions related with model fit/comparison, and I’d appreciate your help in clarifying them. But before that, I’ll provide some background on the analyses I’ve conducted so far so that the questions make more sense:

I’m currently running 2D models where I want to estimate the divergence time between two fish populations that I suspect diverged very recently (within the last century) due to a barrier. I started with 1D models for each population as you have previously suggested, and models suggested expansions in both. I proceeded to 2D models testing my hypothesis according to what I know about the system and some expectations: there was an ancestral population that went through an expansion (due to colonization by fish) that diverged in two (due to the barrier), then each population decreased in size (due to smaller habitat caused by the complete fragmentation of the barrier) and no gene flow was allowed (it was a complete barrier for a while). More recently, the barrier has been modified to allow passage and some management has been conducted, so Ne could have increased again and gene flow is allowed, but I didn’t include this third epoch in my model to avoid creating one very complex for now (and since this has happened very recently, maybe the growth suggested by my initial models is related with a larger ancestral population – but I might be wrong). I know that the optimization will find the most likely values for Ne’s, but I wanted to mention the null hypothesis I’m testing. This is the function I defined to represent this 2-epoch scenario:

def twoEp_growth_decline(params, ns, pts):

    nu0, nu1, T1, nud1, nuu1, nud2, nuu2, T2, m2_12 , m2_21 = params

     xx =  Numerics.default_grid(pts)

    phi = PhiManip.phi_1D(xx) # T0: ancestral pop

       # ancestral exponential change

    nu_func1 = lambda t : nu0 * (nu1/nu0) ** (t/T1) # ancient exponential change (e.g. growth)

    phi = Integration.one_pop(phi, xx, T1, nu_func1) 

   # split

    phi = PhiManip.phi_1D_to_2D(xx, phi)

    # linear changes in each subpop

    nu1_func2 = lambda t: nud1 + (nud2 - nud1)*t/T2

    nu2_func2 = lambda t: nuu1 + (nuu2 - nuu1)*t/T2

    phi = Integration.two_pops(phi, xx, T2, nu1 = nu1_func2, nu2 = nu2_func2, m12 = m2_12, m21 = m2_21) # T1: different pop changes

    fs = Spectrum.from_phi(phi, ns, (xx,xx))

    return fs

 

I’m using Portik’s pipeline for the optimization. After this, the optimized parameters were [2.2006, 0.8126, 2.8392, 6.0689, 0.0924, 3.679, 5.739, 0.1671, 1.9129, 0.6256] which suggest opposite things to what I initially predicted (e.g. ancestral decline instead of a growth). However, the fit of the model was really bad, I have residuals on the order of 10^13 (which I don’t understand why they are not shown in the plot, but I’m looking at the range of the scale). I played around with vmin= and resid_range= to find the regions of my spectrum that are both under and overestimated. Most of my spectrum is underestimated, especially at the corners (unique, intermediate/high-frequency variation in each population), but I’m attaching a photo where most of my data is located (vmin = 1; attached figures 1 and 2), and it shows that residuals are not random at all. I modified the bounds and initial parameters according to my hypothesis to look for a more likely parameter space, but the resulting models didn’t improve and still had huge residuals (I initially ran simpler models that didn't involve popualtion size changes, but they failed https://groups.google.com/g/dadi-user/c/4pbwzbMp5p8 and 1d models suggest strong Ne changes, see below).<2. 2D-twoepoch.Down224.Ups239_vmin1_res3.png><1. 2D-twoepoch.Down224.Ups239_noVmin.png>

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dadi-user/947ba512-4517-4226-bbd8-e4e3bffadea2n%40googlegroups.com.
<2. 2D-twoepoch.Down224.Ups239_vmin1_res3.png><3. 1D-Upstream-1_2epoch.PNG><1. 2D-twoepoch.Down224.Ups239_noVmin.png><4. 1D-Downstream-2_3epoch.PNG>

Maria Paula Rodriguez

unread,
Apr 9, 2026, 3:38:46 PMApr 9
to dadi-user
Hi Ryan,

Thanks for your prompt reply and sharing your ideas. I will follow-up combining similar points (specific questions are in bold):

4/5. Good to know these things, thanks for clarifying them!

1/6. Oh ok, so just to double-check: the instantaneous change is because the Nref (which is "internally" set to 1) is not nu0 in my model (and not in any model unless the most ancient nu is set to 1), so my nu0 could take a different value (e.g. 10), right?
I ran the model again adding nu0 = 1, as well as removing the more recent linear changes and any migration (to keep it simple). I'm attaching the result with and without vmin = 1 (attached figures 1 and 2) for the replicate with the highest log likelihood, which is again suggesting a growth for the two daughter populations (nu1 =  1.6 and nu2 = 2.7) compared to nu0 (0.12). As you can see, I still get huge residuals; this model is mostly overestimating the corners (high and intermediate-frequency private variants) and some unique SNPs, but when I look at a smaller residual range it is overestimating the shared variation.

2/3. My original sample sizes were 184 and 175 individuals in my two populations, which would be 368 and 350 chromosomes, and I projected down in easySFS to 222 and 243, respectively, maximizing the number of SNPs. I followed your suggestion of subsampling the data to ease visualization (e.g. I cropped the model and data sfs to 20 chromosomes - attached figure 3, I'm not sure if there's a function to properly subsample) and it shows that the model is mostly underestimating the shared and unique rare variation and overestimating the unique more common variation - I find challenging finding biological explanations for the [1,1], but maybe recent gene flow can help with that(?).

The range of residuals is much smaller in the cropped region of the SFS, which might be related to what you're mentioning of losing more variants in the corners with the downprojection. You said that subsampling can help plotting, but would it be a good idea to subsample and conduct the demographic analysis with fewer samples? I haven't seen many other posts with a similar sample size to mine, and I don't know if this might be complicating the optimization too (I kept all samples from my two genetic groups separated by the barrier, but some of them are farther to it, and I don't know how IBD can complicate the demographic analysis). However, I decided to keep all my samples as the divergence is suspected to be recent (within the last century) in order to increase the detection of recent events (https://link.springer.com/article/10.1186/s12862-014-0254-4). Do you have any recommendations for what might be a better subsampling/downprojection strategy? I also want to mention that I allowed a maximum site missingness of 30% when filtering my vcf as I wanted to keep sites with a low proportion of missing data, and I thought it could benefit the downprojection according to what I understand from it (i.e. "more confident" resampling if most variants have calls).

Sorry, I didn't understand what "outliers SNPs" you were referring to in #3. Do you mean, the SNPs in the corners (variants that are private/close to be private to each population) that are smeared out during the downprojection? To test for neutrality, I could conduct a HWE test within each population as a way to test for neutrality maybe(?) (I think doing this with both populations combined is not good as I can lose sites that are associated with structure), and I assume that when you say "real" you refer to remaining sequencing/filtering artefacts in the data - besides the site missingness I did standard site/genotype quality filters and excluded sites with a high heterozygosity (probable paralogs) but not much more than that.

To better address the recent demography of my study system, I conducted temporal Ne analysis in GONE2 (https://www.nature.com/articles/s41467-025-61378-w) and it estimated a larger ancestral metapopulation size that has persistently declined to the present. Most of the replicates I've run in dadi lead to optimized parameters where the sizes of the daughter populations are larger than the ancestral one (like the one I'm showing here) - the fit is bad though - but it is hard to reconcile all these results. I'm currently running Portik's pipeline with models constraining the population sizes (nu0 >= nu1 + nu2) to see how they look like, but these take longer. Maybe the inferred recent growth is related with the loss of the high-frequency variants you suspect is happening (which could inflate the bins of rare SNP's, giving the signal of a growth), so I can share more information about my filtering pipeline/downprojection if needed.

I really appreciate the time you're taking to help me think this through!

Best,
Maria
3. 2D-CROPPED-ancExpChange_split_noMig_vmin1.png
2. 2D-ancExpChange_split_noMig_vmin1_resRange3.png
1. 2D-ancExpChange_split_noMig_Novmin1.png

Ryan Gutenkunst

unread,
Apr 14, 2026, 4:51:21 PMApr 14
to dadi...@googlegroups.com
Hello Maria,

On Apr 9, 2026, at 12:38 PM, Maria Paula Rodriguez <mrodrig...@gmail.com> wrote:

Hi Ryan,

1/6. Oh ok, so just to double-check: the instantaneous change is because the Nref (which is "internally" set to 1) is not nu0 in my model (and not in any model unless the most ancient nu is set to 1), so my nu0 could take a different value (e.g. 10), right?
I ran the model again adding nu0 = 1, as well as removing the more recent linear changes and any migration (to keep it simple). I'm attaching the result with and without vmin = 1 (attached figures 1 and 2) for the replicate with the highest log likelihood, which is again suggesting a growth for the two daughter populations (nu1 =  1.6 and nu2 = 2.7) compared to nu0 (0.12). As you can see, I still get huge residuals; this model is mostly overestimating the corners (high and intermediate-frequency private variants) and some unique SNPs, but when I look at a smaller residual range it is overestimating the shared variation.

Yes.

2/3. My original sample sizes were 184 and 175 individuals in my two populations, which would be 368 and 350 chromosomes, and I projected down in easySFS to 222 and 243, respectively, maximizing the number of SNPs. I followed your suggestion of subsampling the data to ease visualization (e.g. I cropped the model and data sfs to 20 chromosomes - attached figure 3, I'm not sure if there's a function to properly subsample) and it shows that the model is mostly underestimating the shared and unique rare variation and overestimating the unique more common variation - I find challenging finding biological explanations for the [1,1], but maybe recent gene flow can help with that(?).

By subsampling I meant getting down to the lower sample size by randomly choosing individuals for each SNP, rather than projecting, which averages over all choices. Maybe this isn’t supported in easySFS.

The range of residuals is much smaller in the cropped region of the SFS, which might be related to what you're mentioning of losing more variants in the corners with the downprojection. You said that subsampling can help plotting, but would it be a good idea to subsample and conduct the demographic analysis with fewer samples? I haven't seen many other posts with a similar sample size to mine, and I don't know if this might be complicating the optimization too (I kept all samples from my two genetic groups separated by the barrier, but some of them are farther to it, and I don't know how IBD can complicate the demographic analysis). However, I decided to keep all my samples as the divergence is suspected to be recent (within the last century) in order to increase the detection of recent events (https://link.springer.com/article/10.1186/s12862-014-0254-4). Do you have any recommendations for what might be a better subsampling/downprojection strategy? I also want to mention that I allowed a maximum site missingness of 30% when filtering my vcf as I wanted to keep sites with a low proportion of missing data, and I thought it could benefit the downprojection according to what I understand from it (i.e. "more confident" resampling if most variants have calls).

Sorry, I didn't understand what "outliers SNPs" you were referring to in #3. Do you mean, the SNPs in the corners (variants that are private/close to be private to each population) that are smeared out during the downprojection? To test for neutrality, I could conduct a HWE test within each population as a way to test for neutrality maybe(?) (I think doing this with both populations combined is not good as I can lose sites that are associated with structure), and I assume that when you say "real" you refer to remaining sequencing/filtering artefacts in the data - besides the site missingness I did standard site/genotype quality filters and excluded sites with a high heterozygosity (probable paralogs) but not much more than that.

Yes, I mean SNPs that are extremely differentiated, almost at fixation in one population and absent from the other. Those extreme outliers might be issues in the data or some very strong (and interesting) selection. There’s few enough you can just filter them directly based on allele frequencies.

Best,
Ryan

To view this discussion visit https://groups.google.com/d/msgid/dadi-user/57c3261b-b8bc-4ae3-a591-d773c2d5b08cn%40googlegroups.com.
<3. 2D-CROPPED-ancExpChange_split_noMig_vmin1.png><2. 2D-ancExpChange_split_noMig_vmin1_resRange3.png><1. 2D-ancExpChange_split_noMig_Novmin1.png>

Reply all
Reply to author
Forward
0 new messages