global optimum, parameter estimation variations, and nested models comparison

90 views

Skip to first unread message

Yuanshu Pu

unread,

Aug 22, 2024, 1:37:15 PM8/22/24

to dadi-user

Dear Ryan and dadi-users,

First of all, many thanks for maintaining this group! I’ve been learning a lot from the previous discussions.

I’m working with ddRad seq data from 15 populations of a single species, each with 7-10 individuals. I don’t have an outgroup, so I’m working with folded fs and have used only a single SNP per locus, assuming they are unlinked. Using Admixture, I identified 4 clusters among these populations. I pooled all individuals from each cluster into one dadi population and started by testing some 1D models for them.

I followed the optimization routine from Portik et al. (2017), running at least 5 rounds of optimization with 100 replicate optimizations per round. The starting parameters were initially random, but after each round, the best scoring replicate's parameters were used to generate perturbed starting parameters for the next round. Convergence was considered achieved when at least two replicates in each of the last two rounds reached the same optimum (with the best log-likelihood values within a 0.005 difference). If not, I increased the number of rounds until convergence was reached. I repeated this routine multiple times for each model, starting with random parameters each time, to produce independent versions of the optimum. Comparing these versions should help determine if a global optimum was found.

Now I’m a bit stuck with properly understanding the results, especially the estimated parameters. In specific:

In the attached table, I summarized 10 optimization versions of a 2-epoch 1D model (population increase followed by a decrease). Two versions (v1, v8) reached the optimum with the lowest AIC (142.68), while five versions (v4,v5,v7,v9,v10) reached an optimum with AIC=142.70, with slightly different parameters (see the attached plots for estimation of parameter T1 and nu1 from different versions). I’m wondering, which optimum should be considered global: the one with the lowest AIC or the one supported by more independent versions? Alternatively, should they be considered the same since the difference is minor? If so, is it valid to report a range (like a 95% confidence interval) of the converged replicates of parameter estimates rather than just the single best log-likelihood one? Or instead, are more rounds of optimization needed for more precise convergence (perhaps aiming for a log-likelihood difference of 0.0001 instead of 0.005)?
I’m uncertain about the concept of “nested models” with which the AIC are comparable. With one same fs, I tested a 1epoch model with linear population decline (optimized AIC = 203.16, with two parameters; see attached plot of comparison between model and data); and a 2epoch model with population increase in T1 followed by a decline in T2 (optimized AIC = 142.7, with four parameters; I have trouble plotting the result because the current population size at convergence is 0). Can I directly compare these results to interpret that “a historical increase before population decline is better supported”, or should I run the 1epoch model in a 2epoch framework, with the same four parameters but with restrictions between them?
The parameter estimations of the 1epoch and 2epoch models from question2 are drastically different: the 1epoch model suggested a population decline from Ne=58,6000 to Ne=290 over just 10k years; while the 2epoch model estimated an initial population increase from Ne=50,000 to Ne=1,000,000 over 70[10-90] k years, followed by a decrease to Ne=0 over ~170[150-200] k years. I’m wondering why this occur (i.e. why is the decrease time in the 2epoch model much longer than in the 1epoch model), and how to deal with the large variation in some parameters’ estimations.
Lastly, is it possible to find different patterns for each population within the same ADMIXTURE cluster, or is pooling all populations of each cluster into one “dadi population” sufficient?

I appreciate any insights or advice you can provide!

Best regards,

Yuanshu

REST_mod2_incdec_optimum_1_actual.nu1_violin_plot.png

Ryan Gutenkunst

unread,

Aug 29, 2024, 7:23:38 PM8/29/24

to dadi-user

Hello Yuanshu,

See responses below.

On Aug 22, 2024, at 9:18 AM, Yuanshu Pu <yuans...@evobio.eu> wrote:

Dear Ryan and dadi-users,
First of all, many thanks for maintaining this group! I’ve been learning a lot from the previous discussions.
I’m working with ddRad seq data from 15 populations of a single species, each with 7-10 individuals. I don’t have an outgroup, so I’m working with folded fs and have used only a single SNP per locus, assuming they are unlinked. Using Admixture, I identified 4 clusters among these populations. I pooled all individuals from each cluster into one dadi population and started by testing some 1D models for them.
I followed the optimization routine from Portik et al. (2017), running at least 5 rounds of optimization with 100 replicate optimizations per round. The starting parameters were initially random, but after each round, the best scoring replicate's parameters were used to generate perturbed starting parameters for the next round. Convergence was considered achieved when at least two replicates in each of the last two rounds reached the same optimum (with the best log-likelihood values within a 0.005 difference). If not, I increased the number of rounds until convergence was reached. I repeated this routine multiple times for each model, starting with random parameters each time, to produce independent versions of the optimum. Comparing these versions should help determine if a global optimum was found.

Now I’m a bit stuck with properly understanding the results, especially the estimated parameters. In specific:

In the attached table, I summarized 10 optimization versions of a 2-epoch 1D model (population increase followed by a decrease). Two versions (v1, v8) reached the optimum with the lowest AIC (142.68), while five versions (v4,v5,v7,v9,v10) reached an optimum with AIC=142.70, with slightly different parameters (see the attached plots for estimation of parameter T1 and nu1 from different versions). I’m wondering, which optimum should be considered global: the one with the lowest AIC or the one supported by more independent versions? Alternatively, should they be considered the same since the difference is minor? If so, is it valid to report a range (like a 95% confidence interval) of the converged replicates of parameter estimates rather than just the single best log-likelihood one? Or instead, are more rounds of optimization needed for more precise convergence (perhaps aiming for a log-likelihood difference of 0.0001 instead of 0.005)?

The global optimum is the one with the lowest likelihood (and thus lowest AIC). I’m not sure what is being reported here in terms of median versus p95. For a single data set, all you want to report is the best maximum likelihood, which is a single parameter set.

I’m uncertain about the concept of “nested models” with which the AIC are comparable. With one same fs, I tested a 1epoch model with linear population decline (optimized AIC = 203.16, with two parameters; see attached plot of comparison between model and data); and a 2epoch model with population increase in T1 followed by a decline in T2 (optimized AIC = 142.7, with four parameters; I have trouble plotting the result because the current population size at convergence is 0). Can I directly compare these results to interpret that “a historical increase before population decline is better supported”, or should I run the 1epoch model in a 2epoch framework, with the same four parameters but with restrictions between them?

Nesting is more relevant for likelihood ratio tests than AIC. (The whole point of AIC is that it is supposed to work with non-nested models.) Using AIC, you would interpret this result to favor the four-parameter model.

The parameter estimations of the 1epoch and 2epoch models from question2 are drastically different: the 1epoch model suggested a population decline from Ne=58,6000 to Ne=290 over just 10k years; while the 2epoch model estimated an initial population increase from Ne=50,000 to Ne=1,000,000 over 70[10-90] k years, followed by a decrease to Ne=0 over ~170[150-200] k years. I’m wondering why this occur (i.e. why is the decrease time in the 2epoch model much longer than in the 1epoch model), and how to deal with the large variation in some parameters’ estimations.

One possibility is that there may be issues with your data / sfs. In the plot, the difference between singletons and doubletons is surprisingly large, and you can see that the model does not fit it well. This could be caused by low sequence depth leading you to miss rare alleles. The two epoch model could be fitting this data bias to arrive at biologically unrealistic results.

We’ve recently implemented some functionality for low-coverage data analysis into dadi and dadi-cli: https://doi.org/10.1101/2024.07.19.604366 .

Lastly, is it possible to find different patterns for each population within the same ADMIXTURE cluster, or is pooling all populations of each cluster into one “dadi population” sufficient?

Admixture clusters can contain substructure within them, and treating a substructured group of populations as a single modeled population can introduce biases. So yes, you could infer different models for your different populations within a cluster.

Best,

Ryan

I appreciate any insights or advice you can provide!

Best regards,
Yuanshu

<REST_mod2_incdec_optimum_1_summary_table.jpg>
<REST_mod2_incdec_optimum_1_actual.nu1_violin_plot.png><REST_mod2_incdec_optimum_1_actual.T1_violin_plot.png><REST_mod1_fs.jpg>

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/ac752ad0-4541-4a7f-b0eb-dc2a712ff74fn%40googlegroups.com.
<REST_mod1_fs.jpg><REST_mod2_incdec_optimum_1_actual.T1_violin_plot.png><REST_mod2_incdec_optimum_1_actual.nu1_violin_plot.png><REST_mod2_incdec_optimum_1_summary_table.jpg>

Reply all

Reply to author

Forward

0 new messages