trouble running neutral model

250 views
Skip to first unread message

Alex Hart

unread,
Feb 15, 2022, 9:10:05 AM2/15/22
to dadi-user
Hi,

New dadi user here. I have two species for which I am trying to find the most likely demographic history by comparing the AIC of various models(?). The underlying genetic data is RADSeq (SNP) and it has been processed into vcf. 

I'll use species no. 1 as an example since they are fundamentally similar. Species 1 has 97 samples, and after using an external program to look for pop structure, was found to no significant structure. I have no ancestral data so I generated the frequency spectrum and folded it. 

When using 'dadi.Plotting.plot_1d_fs(fs_folded)', it looks like this:
example_FS.png

If I set up the model as follows:
data = dadi.Spectrum.from_file('species1.fs')
func = dadi.Demographics1D.snm
pts_l = [60, 70, 80]
expected_sfs = func(data, ns, pts_l)

then I get an error:
TypeError: 'list' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred: [...]
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
I was under the impression that the snm model didn't need any parameters since the 'parameters' were the number of samples and the grid. I am aware the grid is not optimum right now - I am just trying to get the model to work!

If I specify an upper bound, a lower bound, and p0, with two or three elements for each, then the snm model runs with the following code:
upper_bound = [1,1,1]
lower_bound = [0,0,0]
p0 = [1,1,1]

func_ex = dadi.Numerics.make_extrap_log_func(func)
p0 = dadi.Misc.perturb_params(p0, fold=1, upper_bound=upper_bound,
                              lower_bound=lower_bound)
print('Beginning optimization ************************************************')
popt = dadi.Inference.optimize_log(p0, data, func_ex, pts_l,
                                   lower_bound=lower_bound,
                                   upper_bound=upper_bound,
                                   verbose=len(p0), maxiter=100)
# The verbose argument controls how often progress of the optimizer should be
# printed. It's useful to keep track of optimization process.
print('Finshed optimization **************************************************')


However, I'm not sure if I should accept the output that it produces, given that the model is not supposed to have input params? I used this code to run the growth, bottlegrowth and two_epoch models, which all indicated the best fit for nu and T (and variants) were very low (usually below 0.1), which I have interpreted as either I have run the model incorrectly for both species/all models, or both species show similar demographic histories where nothing has changed much for a long time.

Thanks in advance,

Alex



Ryan Gutenkunst

unread,
Feb 15, 2022, 12:24:23 PM2/15/22
to dadi-user
Hello Alex,

1) First, your data look really weird. It is very unusual to see a 1D spectrum that is not monotonically decreasing at low frequencies. It appears that your calling pipeline is missing a large majority of low frequency variants. Before trusting any dadi model, it would be imperative to understand potential biases in your sequencing/calling pipeline.

2) The SNM model takes no parameters, so the call would be ‘expected_sfs = func_ex([], ns, pts_l)’. You also need to create an extrapolating version of the function ‘func_ex = dadi.Numerics.make_extrap_func(dadi.Demographics1D.snm)’.

3) For the SNM model, there are no parameters to optimize. I’m not sure what you intend by setting upper and lower bounds of three parameters each.

4) You should not rely on any of your modeling outputs, since your input data are qualitatively weird.

Best,
Ryan

> On Feb 15, 2022, at 7:10 AM, 'Alex Hart' via dadi-user <dadi...@googlegroups.com> wrote:
>
> Hi,
>
> New dadi user here. I have two species for which I am trying to find the most likely demographic history by comparing the AIC of various models(?). The underlying genetic data is RADSeq (SNP) and it has been processed into vcf.
>
> I'll use species no. 1 as an example since they are fundamentally similar. Species 1 has 97 samples, and after using an external program to look for pop structure, was found to no significant structure. I have no ancestral data so I generated the frequency spectrum and folded it.
>
> When using 'dadi.Plotting.plot_1d_fs(fs_folded)', it looks like this:
> --
> You received this message because you are subscribed to the Google Groups "dadi-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/e65230a2-451f-4cba-8d85-1f64d2a80e3fn%40googlegroups.com.
> <example_FS.png>

Alex Hart

unread,
Feb 16, 2022, 4:35:52 AM2/16/22
to dadi-user
Hi Ryan,

Thanks for your swift response. Your comments about my data being weird make a lot of sense, since I have extremely heavily filtered and processed them to remove low-frequency variants (as is necessary for other parts of my analysis). I will re-run the analysis with a less-filtered dataset.

Cheers,

Alex

Alex Hart

unread,
Mar 15, 2022, 11:45:48 AM3/15/22
to dadi-user
Hi Ryan,

I was wondering if I could have some further help with regards to the neutral model. I reproduced my vcf files as you suggested with less filtering and was able to run and optimise models for "growth", "bottlegrowth", "two_epoch", and "snm". Here's what they look like, using "growth" as an example:
pasc_growth_model.png

I know that I can use the AIC of each model to compare how well each model fits the dataset. However, I would like to confirm that these models fit better (or worse) than the neutral model. I don't think you can calculate the AIC for the neutral model since there are no parameters, which as I understand, you need to calculate AIC. However, I saw a paper (https://doi.org/10.1098/rspb.2017.2806 if you're interested) which used the LRT to compare the neutral model to other models. I understand that LRTs are supposed to be used to compare cases where one model is simple and the other is a complex version of that model with similar parameters, i.e., nested. I have been trying to write a code section allowing me to perform LRTs using the neutral model as my simple model and the model of interest as the complex model within which the neutral is nested. However, I cannot get a coherent answer from it, presumably because of the mandatory parameter nesting. A summary of the code:

# defining our complex model
complex_func = dadi.Demographics1D.two_epoch
complex_func_ex = dadi.Numerics.make_extrap_log_func(complex_func)

# best fit params optimised elsewhere
p_lrt = [12.4432, 0.286792]

# running the complex model
model = complex_func_ex(p_lrt, ns, pts_l)
# Likelihood of the data given the model AFS.
ll_complex_model = dadi.Inference.ll_multinom(model, data)

# bootstrapping
boots = dadi.Misc.bootstraps_from_dd_chunks(chunks, Nboot, pop_ids, ns)

# LRT
adj = dadi.Godambe.LRT_adjust(complex_func_ex, pts_l,
                              boots, p_lrt, data,
                              nested_indices=[0], multinom=True)
D_adj = adj*2*(ll_complex_model - neutral_ll)

The value of D_adj changes depending on where I specify the nested indices, but will not run if I don't specify one. I'm afraid I don't understand the stats underpinning these tests well enough to find a solution. I am very out of my depth with this level of statistics and would appreciate any advice you can give me, since I can't help but feel like I might be barking up the wrong tree!

Kind regards, and thanks in advance,

Alex

Mario Ernst

unread,
Sep 28, 2023, 10:20:34 AM9/28/23
to dadi-user
Hi Alex, 

I am having the same issue. I am running different 1D models and would like to compare these to the neutral model to see if they are a better fit. Did you figure out a way to test this?

All the best, Mario

Ryan Gutenkunst

unread,
Oct 3, 2023, 1:26:11 PM10/3/23
to dadi-user
Hello Mario,

Not sure why I didn’t answer this last year… In any case, the issue is that both model parameters are nested here, because the SNM can be arrived at from the two_epoch model by either setting T = 0 or nu = 1. Not also that in this case, because one of the parameters is on the boundary (T = 0) then the null model is a more complex weighting of chi-squared distributions. (See the supplemental material of https://doi.org/10.1093/molbev/msv255 .

All that said, I actually think visual inspections of the residuals are more informative than formal tests for genomic data in any case.

Best,
Ryan

> On Sep 28, 2023, at 7:20 AM, Mario Ernst <film...@gmail.com> wrote:
>
> Hi Alex,
>
> I am having the same issue. I am running different 1D models and would like to compare these to the neutral model to see if they are a better fit. Did you figure out a way to test this?
>
> All the best, Mario
>
> Alex Hart schrieb am Dienstag, 15. März 2022 um 16:45:48 UTC+1:
> Hi Ryan,
>
> I was wondering if I could have some further help with regards to the neutral model. I reproduced my vcf files as you suggested with less filtering and was able to run and optimise models for "growth", "bottlegrowth", "two_epoch", and "snm". Here's what they look like, using "growth" as an example:
>
>
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/c9623f5b-fd18-40f9-9fa4-6ba03626aba8n%40googlegroups.com.

Message has been deleted

Ryan Gutenkunst

unread,
Apr 8, 2024, 7:03:00 PMApr 8
to dadi-user
Hello Mario,

See replies inline below.

> On Mar 27, 2024, at 6:02 AM, Mario Ernst <film...@gmail.com> wrote:
> When you say that it is an issue that the snm_1d can be arrived at from the two_epoch model by either setting T = 0 or nu = 1, do you mean to say that the results don't provide any meaningful information in terms of model selection? If so, is there a way around it?

In general, we want to use the Godambe approach to calculate the adjusted likelihood ratio accounting for linkage between the SNPs. That’s a little bit subtle in this case, because you can nest the snm model within the two_epoch model two ways.

> Alex mentioned that it is not possible to do AIC because the standard neutral model has no parameters but it would be possible to calculate the AIC as -2*log-likelihood of the model, right?

Correct.

> I am asking because I am running 1D models (snm_1d and two_epoch) for multiple populations and almost always the AIC supports the more complex models. However, the LRT never finds significant support for the complex model (the difference in the likelihoods is extremely low) and I am wondering why that is.

I’m a bit confused since your example showed a difference in log likelihood ~10^-10, which no method should ever say lends support to the more complex model.

Best,
Ryan

>
> Find below the code for one example population (and the respective fitted spectra attached):
>
> ######define the models
> def snm_1d(notused, ns, pts):
> """
> Standard neutral model.
> ns = (n1,)
> n1: Number of samples in resulting Spectrum
> pts: Number of grid points to use in integration.
> """
> xx = Numerics.default_grid(pts)
> phi = PhiManip.phi_1D(xx)
> fs = Spectrum.from_phi(phi, ns, (xx,))
> return fs
>
> snm_1d.__param_names__ = []
> snm = snm_1d
>
> def two_epoch(params, ns, pts):
> """
> Instantaneous size change some time ago.
> params = (nu,T)
> ns = (n1,)
> nu: Ratio of contemporary to ancient population size
> T: Time in the past at which size change happened (in units of 2*Na
> generations)
> n1: Number of samples in resulting Spectrum
> pts: Number of grid points to use in integration.
> """
> nu,T = params
> xx = Numerics.default_grid(pts)
> phi = PhiManip.phi_1D(xx)
> phi = Integration.one_pop(phi, xx, T, nu)
> fs = Spectrum.from_phi(phi, ns, (xx,))
> return fs
>
> two_epoch.__param_names__ = ['nu', 'T']
>
> def two_epoch_LRT(params, ns, pts):
> """
> Instantaneous size change some time ago.
> params = (nu,T)
> ns = (n1,)
> nu: Ratio of contemporary to ancient population size
> T: Time in the past at which size change happened (in units of 2*Na
> generations)
> n1: Number of samples in resulting Spectrum
> pts: Number of grid points to use in integration.
> """
> nu,T,theta = params
> xx = Numerics.default_grid(pts)
> phi = PhiManip.phi_1D(xx,theta0=theta)
> phi = Integration.one_pop(phi, xx, T, nu, theta0=theta)
> fs = Spectrum.from_phi(phi, ns, (xx,))
> return fs
>
> #calculate theta and likelihood for neutral model
> proj=[34]
> fs = dadi.Spectrum.from_data_dict(dd, [pop], polarized = False,projections = proj)
> ns = fs.sample_sizes
> pts = [max(ns)+300, max(ns)+350, max(ns)+400]
> func_ex = dadi.Numerics.make_extrap_log_func(snm_1d)
> model_fit = func_ex([], ns, pts)
> ll_neutral = dadi.Inference.ll_multinom(model_fit, data) #ll_neutral=-74.47790204264794
> theta = dadi.Inference.optimal_sfs_scaling(model_fit, data) #theta=820.9806169050971
>
> #run the models as if the scenario was neutral
> func_ex_two_epoch = dadi.Numerics.make_extrap_log_func(two_epoch_LRT)
> popt_two_epoch=[ 1, 0.8396 , 820.98061691] #Ne=1, T=0.8396 is optimised param for two_epoch model elsewhere, theta=820.98061691 calculated above for snm_1d
> model_two_epoch = func_ex_two_epoch(popt_two_epoch, ns, pts)
> ll_two_epoch = dadi.Inference.ll(model_two_epoch, fs) #ll_two_epoch=-74.47790204244299
>
> #degrees of freedom: 2 (as Ne and T are added to the snm_1d model)
> #LRT statistic: 2*(ll_two_epoch-ll_neutral)=4.098978934052866e-10
> #critical value (p<0.05)=5.991, hence not significant
>
> All the best, Mario
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/6c43cec0-289d-4c21-9d57-dcda1dc96a49n%40googlegroups.com.
> <fitted_alleni_r70_pop2_reps_50_50_100_maxiters_100_100_100_folds_3_2_1_two_epoch.pdf><fitted_alleni_r70_pop2_reps_50_50_100_maxiters_100_100_100_folds_3_2_1_snm_1d.pdf>

Mario Ernst

unread,
Apr 11, 2024, 12:30:51 PMApr 11
to dadi-user
Dear Ryan, 

thank you so much for your reply!

Since I am working with RADseq data with one SNP per locus I am assuming that they are unlinked. Thus I thought there is no need for calculating composite likelihoods and to use the Godambe approach.

The reason for the LR statistic being so low while the AIC clearly supports the more complex model is because I am doing the LRT wrong I think. Probably I didn't understand correctly but I am trying to run the complex model with parameter values that make it the same model as the best-fit simple model (following your explanation here https://groups.google.com/g/dadi-user/c/wXneKCnXa6A/m/xOx6l0eYBQAJ). 

So to give you a different example, lets say I am working with the following 2D models:

def no_mig(params, ns, pts):
    """
    Split into two populations, no migration.
    nu1: Size of population 1 after split.
    nu2: Size of population 2 after split.
    T: Time in the past of split (in units of 2*Na generations)
    """
    nu1, nu2, T = params
    xx = dadi.Numerics.default_grid(pts)
    phi = dadi.PhiManip.phi_1D(xx)
    phi = dadi.PhiManip.phi_1D_to_2D(xx, phi)
    phi = dadi.Integration.two_pops(phi, xx, T, nu1, nu2, m12=0, m21=0)
    fs = dadi.Spectrum.from_phi(phi, ns, (xx,xx))
    return fs

def sym_mig(params, ns, pts):
    """
    Split into two populations, with symmetric migration.
    nu1: Size of population 1 after split.
    nu2: Size of population 2 after split.
    T: Time in the past of split (in units of 2*Na generations)
    m: Migration rate between populations (2*Na*m)
    """
    nu1, nu2, m, T = params
    xx = Numerics.default_grid(pts)
    phi = dadi.PhiManip.phi_1D(xx)
    phi = dadi.PhiManip.phi_1D_to_2D(xx, phi)
    phi = dadi.Integration.two_pops(phi, xx, T, nu1, nu2, m12=m, m21=m)
    fs = dadi.Spectrum.from_phi(phi, ns, (xx,xx))
    return fs

Following parameter optimization for the no_mig model, I obtain [0.3969, 0.8135, 0.4706] as parameter values and a likelihood of -1160.6053407377483.

Then, I incorporate these values into the sym_mig model with m=0 to make it the same model as the best fit no_mig model :
func_ex_sym_mig = dadi.Numerics.make_extrap_log_func(sym_mig)
model_sym_mig = func_ex_sym_mig([0.3969,0.8135,0,0.4706], ns, pts) # nu1, nu2, m, T 
ll_sym_mig = dadi.Inference.ll_multinom(model_sym_mig, fs)

But ll_sym_mig ends up being -1160.6053407377483. This is the same likelihood as I obtain initially.

If you can pinpoint where I went wrong in my approach, I would greatly appreciate your insights on the correct method for executing the LRT. Thank you so much for your support!

Best regards, Mario

Ryan Gutenkunst

unread,
Apr 11, 2024, 6:17:46 PMApr 11
to dadi-user
Hello Mario,

Correct, if you can assume unlinked loci, you don’t need to worry about composite likelihoods and the Godambe approach.

For the LRT then, you just need to compare likelihoods of the complex model and the simple model. (If you turn the complex model into the simple model, you should get the same likelihood, up to numerical noise. We need to do something more complex for the Godambe, hence the previous discussion about turning one model into the other.)

The one challenge remaining with the LRT is that the null distribution isn’t a simple chi^2 where you’re parameters lie on the boundary of the parameter space. We discuss that a bit in the Godambe paper (Coffman et al.)

Best,
Ryan
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/6254a1bc-0626-4a3d-a820-89a493e8003dn%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages