model residuals/fit and clAIC implementation

2 views
Skip to first unread message

Maria Paula Rodriguez

unread,
Apr 3, 2026, 5:17:09 PM (2 days ago) Apr 3
to dadi-user

Hi Ryan,

 I have two main questions related with model fit/comparison, and I’d appreciate your help in clarifying them. But before that, I’ll provide some background on the analyses I’ve conducted so far so that the questions make more sense:

I’m currently running 2D models where I want to estimate the divergence time between two fish populations that I suspect diverged very recently (within the last century) due to a barrier. I started with 1D models for each population as you have previously suggested, and models suggested expansions in both. I proceeded to 2D models testing my hypothesis according to what I know about the system and some expectations: there was an ancestral population that went through an expansion (due to colonization by fish) that diverged in two (due to the barrier), then each population decreased in size (due to smaller habitat caused by the complete fragmentation of the barrier) and no gene flow was allowed (it was a complete barrier for a while). More recently, the barrier has been modified to allow passage and some management has been conducted, so Ne could have increased again and gene flow is allowed, but I didn’t include this third epoch in my model to avoid creating one very complex for now (and since this has happened very recently, maybe the growth suggested by my initial models is related with a larger ancestral population – but I might be wrong). I know that the optimization will find the most likely values for Ne’s, but I wanted to mention the null hypothesis I’m testing. This is the function I defined to represent this 2-epoch scenario:

def twoEp_growth_decline(params, ns, pts):

    nu0, nu1, T1, nud1, nuu1, nud2, nuu2, T2, m2_12 , m2_21 = params

     xx =  Numerics.default_grid(pts)

    phi = PhiManip.phi_1D(xx) # T0: ancestral pop

       # ancestral exponential change

    nu_func1 = lambda t : nu0 * (nu1/nu0) ** (t/T1) # ancient exponential change (e.g. growth)

    phi = Integration.one_pop(phi, xx, T1, nu_func1) 

   # split

    phi = PhiManip.phi_1D_to_2D(xx, phi)

    # linear changes in each subpop

    nu1_func2 = lambda t: nud1 + (nud2 - nud1)*t/T2

    nu2_func2 = lambda t: nuu1 + (nuu2 - nuu1)*t/T2

    phi = Integration.two_pops(phi, xx, T2, nu1 = nu1_func2, nu2 = nu2_func2, m12 = m2_12, m21 = m2_21) # T1: different pop changes

    fs = Spectrum.from_phi(phi, ns, (xx,xx))

    return fs

 

I’m using Portik’s pipeline for the optimization. After this, the optimized parameters were [2.2006, 0.8126, 2.8392, 6.0689, 0.0924, 3.679, 5.739, 0.1671, 1.9129, 0.6256] which suggest opposite things to what I initially predicted (e.g. ancestral decline instead of a growth). However, the fit of the model was really bad, I have residuals on the order of 10^13 (which I don’t understand why they are not shown in the plot, but I’m looking at the range of the scale). I played around with vmin= and resid_range= to find the regions of my spectrum that are both under and overestimated. Most of my spectrum is underestimated, especially at the corners (unique, intermediate/high-frequency variation in each population), but I’m attaching a photo where most of my data is located (vmin = 1; attached figures 1 and 2), and it shows that residuals are not random at all. I modified the bounds and initial parameters according to my hypothesis to look for a more likely parameter space, but the resulting models didn’t improve and still had huge residuals (I initially ran simpler models that didn't involve popualtion size changes, but they failed https://groups.google.com/g/dadi-user/c/4pbwzbMp5p8 and 1d models suggest strong Ne changes, see below).2. 2D-twoepoch.Down224.Ups239_vmin1_res3.png1. 2D-twoepoch.Down224.Ups239_noVmin.png


Because of that, I decided to take a step back and return to my 1D models in order to assess the change (i.e. instantaneous, linear, exponential) of each population, and to see if 1, 2 or 3-epoch models fit them better. After running several replicates of Portik’s pipeline with 1 or 2-epoch models, I found that an instantaneous or exponential increase have happened in both populations. However, I find it hard to choose what change was better as both models have very similar residual plots (attached figure 3). Thus, I decided to proceed with 2 or 3-epoch models with some combinations of instantaneous and exponential changes (to simulate a bottleneck). I plotted the replicates that had the highest log-likelihoods and that “converged” on similar values across different iterations, but I still can’t decide on which model is qualitatively better as residuals look very similar. I was interested in knowing if I could change the range of residuals in 1d plots and couldn’t find an argument, so I calculated the values with resid = dadi.Inference.Anscombe_Poisson_residual(model, fs) to inspect them better. However, I also found huge residuals for these 1d models – the values are those printed for each model in the figure 4 I’m attaching). So here are my first round of questions: Is this function giving the same residual values than those in the plot? If so, it would mean that I still have bad 1d models! but if not, is there a way a to change the residual range in the 1d plots? I'm really confused about residual values here.

Because my 1, 2 and 3-epoch models are not nested, I cannot use the LRT, so I decided to try to calculate the clAIC (following what was initially discussed here https://groups.google.com/g/dadi-user/c/n1-ovndv8FI?pli=1), and since I find it hard to select the best fit based on the residual plots alone, I want to also incorporate a more quantitative model selection approach. For this second part, I basically would like to know if I'm using the appropriate objects generated by dadi’s functions, specifically for the H, J matrices and if I should be using log during the calculation – I’m showing comments related to this in red. I’m following the source code from this page https://gitlab.mbb.cnrs.fr/khalid/dadi/-/blob/master/dadi/Godambe.py?ref_type=heads. This is the code I wrote (I’m using an example function with the best optimized parameters from a Portik’s pipeline run):

def ep3_ins_exp_ins(params, ns, pts):

    nu1, nu2, nu3, T1, T2, T3 = params

    xx = Numerics.default_grid(pts)

    phi = PhiManip.phi_1D(xx)

    # T1: ancestral time

    phi = Integration.one_pop(phi, xx, T1, nu1)

    # T2: exp change

    nu_func2 = lambda t : nu1 * (nu2/nu1) ** (t/T2)   

    phi = Integration.one_pop(phi, xx, T2, nu_func2)

    # T3: inst change

    phi = Integration.one_pop(phi, xx, T3, nu3)

    fs = Spectrum.from_phi(phi, ns, (xx,))

    return fs

   

func = ep3_ins_exp_ins

func_ex = dadi.Numerics.make_extrap_log_func(func)

 

grid_pts = [450, 460, 470]

p0 = [0.5889,0.2077,21.7962,0.221,0.0468,0.0629]

data = fsd

# Getting the matrices

GIM, H, J, cU = dadi.Godambe.get_godambe(func_ex, grid_pts, boots, p0, data, eps=0.01, log=True) #I do log because that’s what I’m using for the func_ex, but I don’t know enough math to be sure if this makes sense

# Getting the inverse Hessian

H_inv = numpy.linalg.inv(H)

# Getting the trace of the first term

first_term = numpy.trace(J @ H_inv)

 

# Getting the log.likelihood of the model

ns = fsd.sample_sizes

modeld = func_ex(p0, ns, grid_pts)

ll = dadi.Inference.ll_multinom(modeld, fsd)

# Calculating the clAIC!

clAIC = 2*first_term - 2*ll

 

This is a really long post, apologies for that. I'd really appreciate your help with my two concerns and any additional comment/advice that can help me to move forward to good 2D models for my system.


Thank you very much,

Maria

3. 1D-Upstream-1_2epoch.PNG
4. 1D-Downstream-2_3epoch.PNG
Reply all
Reply to author
Forward
0 new messages