Hi Ryan,
I have two main questions related with model fit/comparison, and I’d appreciate your help in clarifying them. But before that, I’ll provide some background on the analyses I’ve conducted so far so that the questions make more sense:
I’m currently running 2D models where I want to estimate the divergence time between two fish populations that I suspect diverged very recently (within the last century) due to a barrier. I started with 1D models for each population as you have previously suggested, and models suggested expansions in both. I proceeded to 2D models testing my hypothesis according to what I know about the system and some expectations: there was an ancestral population that went through an expansion (due to colonization by fish) that diverged in two (due to the barrier), then each population decreased in size (due to smaller habitat caused by the complete fragmentation of the barrier) and no gene flow was allowed (it was a complete barrier for a while). More recently, the barrier has been modified to allow passage and some management has been conducted, so Ne could have increased again and gene flow is allowed, but I didn’t include this third epoch in my model to avoid creating one very complex for now (and since this has happened very recently, maybe the growth suggested by my initial models is related with a larger ancestral population – but I might be wrong). I know that the optimization will find the most likely values for Ne’s, but I wanted to mention the null hypothesis I’m testing. This is the function I defined to represent this 2-epoch scenario:
def twoEp_growth_decline(params, ns, pts):
nu0, nu1, T1, nud1, nuu1, nud2, nuu2, T2, m2_12 , m2_21 = params
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx) # T0: ancestral pop
# ancestral exponential change
nu_func1 = lambda t : nu0 * (nu1/nu0) ** (t/T1) # ancient exponential change (e.g. growth)
phi = Integration.one_pop(phi, xx, T1, nu_func1)
# split
phi = PhiManip.phi_1D_to_2D(xx, phi)
# linear changes in each subpop
nu1_func2 = lambda t: nud1 + (nud2 - nud1)*t/T2
nu2_func2 = lambda t: nuu1 + (nuu2 - nuu1)*t/T2
phi = Integration.two_pops(phi, xx, T2, nu1 = nu1_func2, nu2 = nu2_func2, m12 = m2_12, m21 = m2_21) # T1: different pop changes
fs = Spectrum.from_phi(phi, ns, (xx,xx))
return fs
I’m using Portik’s pipeline for the optimization. After
this, the optimized parameters were [2.2006, 0.8126, 2.8392, 6.0689, 0.0924,
3.679, 5.739, 0.1671, 1.9129, 0.6256] which suggest opposite things to what I
initially predicted (e.g. ancestral decline instead of a growth). However, the
fit of the model was really bad, I have residuals on the order of 10^13 (which I
don’t understand why they are not shown in the plot, but I’m looking at the
range of the scale). I played around with vmin= and resid_range= to find the regions
of my spectrum that are both under and overestimated. Most of my spectrum is
underestimated, especially at the corners (unique, intermediate/high-frequency variation
in each population), but I’m attaching a photo where most of my data is located
(vmin = 1; attached figures 1 and 2), and it shows that residuals are not random at all. I modified the
bounds and initial parameters according to my hypothesis to look for a more
likely parameter space, but the resulting models didn’t improve and still
had huge residuals (I initially ran simpler models that didn't involve popualtion size changes, but they failed https://groups.google.com/g/dadi-user/c/4pbwzbMp5p8 and 1d models suggest strong Ne changes, see below).

Because of that, I decided to take a step back and return to my 1D models in order to assess the change (i.e. instantaneous, linear, exponential) of each population, and to see if 1, 2 or 3-epoch models fit them better. After running several replicates of Portik’s pipeline with 1 or 2-epoch models, I found that an instantaneous or exponential increase have happened in both populations. However, I find it hard to choose what change was better as both models have very similar residual plots (attached figure 3). Thus, I decided to proceed with 2 or 3-epoch models with some combinations of instantaneous and exponential changes (to simulate a bottleneck). I plotted the replicates that had the highest log-likelihoods and that “converged” on similar values across different iterations, but I still can’t decide on which model is qualitatively better as residuals look very similar. I was interested in knowing if I could change the range of residuals in 1d plots and couldn’t find an argument, so I calculated the values with resid = dadi.Inference.Anscombe_Poisson_residual(model, fs) to inspect them better. However, I also found huge residuals for these 1d models – the values are those printed for each model in the figure 4 I’m attaching). So here are my first round of questions: Is this function giving the same residual values than those in the plot? If so, it would mean that I still have bad 1d models! but if not, is there a way a to change the residual range in the 1d plots? I'm really confused about residual values here.
Because my 1, 2 and 3-epoch models are not nested, I cannot use the LRT, so I decided to try to calculate the clAIC (following what was initially discussed here https://groups.google.com/g/dadi-user/c/n1-ovndv8FI?pli=1), and since I find it hard to select the best fit based on the residual plots alone, I want to also incorporate a more quantitative model selection approach. For this second part, I basically would like to know if I'm using the appropriate objects generated by dadi’s functions, specifically for the H, J matrices and if I should be using log during the calculation – I’m showing comments related to this in red. I’m following the source code from this page https://gitlab.mbb.cnrs.fr/khalid/dadi/-/blob/master/dadi/Godambe.py?ref_type=heads. This is the code I wrote (I’m using an example function with the best optimized parameters from a Portik’s pipeline run):
def ep3_ins_exp_ins(params, ns, pts):
nu1, nu2, nu3, T1, T2, T3 = params
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx)
# T1: ancestral time
phi = Integration.one_pop(phi, xx, T1, nu1)
# T2: exp change
nu_func2 = lambda t : nu1 * (nu2/nu1) ** (t/T2)
phi = Integration.one_pop(phi, xx, T2, nu_func2)
# T3: inst change
phi = Integration.one_pop(phi, xx, T3, nu3)
fs = Spectrum.from_phi(phi, ns, (xx,))
return fs
func = ep3_ins_exp_ins
func_ex = dadi.Numerics.make_extrap_log_func(func)
grid_pts = [450, 460, 470]
p0 = [0.5889,0.2077,21.7962,0.221,0.0468,0.0629]
data = fsd
# Getting the matrices
GIM, H, J, cU = dadi.Godambe.get_godambe(func_ex, grid_pts, boots, p0, data, eps=0.01, log=True) #I do log because that’s what I’m using for the func_ex, but I don’t know enough math to be sure if this makes sense
# Getting the inverse Hessian
H_inv = numpy.linalg.inv(H)
# Getting the trace of the first term
first_term = numpy.trace(J @ H_inv)
# Getting the log.likelihood of the model
ns = fsd.sample_sizes
modeld = func_ex(p0, ns, grid_pts)
ll = dadi.Inference.ll_multinom(modeld, fsd)
# Calculating the clAIC!
clAIC = 2*first_term - 2*ll
This is a really long post, apologies for that. I'd really appreciate your help with my two concerns and any additional comment/advice that can help me to move forward to good 2D models for my system.
Thank you very much,
Maria