2 populations model split migration bottlegrowth

459 views
Skip to first unread message

Mathilde SALAMON

unread,
Jan 14, 2022, 3:40:45 PM1/14/22
to dadi-user
Hello Dr. Gutenkunst,

I am new to dadi and I am trying to specify a model for two populations (HA and PG) with an initial split, constant migration, later a bottleneck event followed by exponential recovery in population 2. I used the models for YRI/CEU and the split_mig in Demographics2D to built this custom model. In this model, Ts should be much larger than Tb.

I am not sure if I specified the order of events and parameters correctly:

def split_mig_bottlegrowth(params, ns, pts):
"""
Model of split with migration then instantaneous size change followed by exponential growth in one population
params list is
nu1: Size of population 1 after split.
nu2: Size of population 2 after split.
nu2B: The bottleneck size for pop2
nu2F: The final size for pop2
m: The scaled migration rate
Ts: The time between the split and bottleneck
Tb: The scaled time between the bottleneck in pop 2 and present.

ns = (n1,n2): Size of fs to generate.
pts: Number of points to use in grid for evaluation.
"""
nu1,nu2, nu2B, nu2F, m, Tb, Ts = params
n1,n2 = ns
# Define the grid we'll use
xx = yy = dadi.Numerics.default_grid(pts)

# phi for the equilibrium ancestral population
phi = PhiManip.phi_1D(xx)
# phi for the ancestral population split
phi = PhiManip.phi_1D_to_2D(xx, phi)

# phi for the migration between pop 1 and pop 2 post split
phi = Integration.two_pops(phi, xx, Ts, nu1, nu2, m12=m, m21=m)

# Define a function to describe the bottleneck then exponential growth in population 2
# Use lambda for this function
nu2_func = lambda t: nu2B*(nu2F/nu2B)**(t/Tb)
phi = dadi.Integration.two_pops(phi, xx, Tb, nu1=nu1, nu2=nu2_func,
m12=m, m21=m)

# Finally, calculate the spectrum.
sfs = dadi.Spectrum.from_phi(phi, (n1,n2), (xx,yy))
return sfs

I am planning to compare this model with the nested split_mig model (no bottlegrowth) and a simpler model with split but no migration.

Thank you for your consideration,
Mathilde Salamon
PhD student in Biology
Lab Derry - Ecology and Evolution of aquatic Ecosystems
http://aquaticecoevo.uqam.ca
Université du Québec à Montréal (Canada)
Model_2Dsplit_mig_bottlegrowth.png

Ryan Gutenkunst

unread,
Jan 25, 2022, 4:20:26 PM1/25/22
to dadi-user
Hello Mathilde,

My apologies for the slow reply. Some dadi emails got incorrectly sent to my spam folder.

This looks like a correct implementation of your verbal model.

Best,
Ryan
> <Model_2Dsplit_mig_bottlegrowth.png>
>
> --
> You received this message because you are subscribed to the Google Groups "dadi-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/933849fe-dec9-4786-a12b-c22674ecc8a3n%40googlegroups.com.
> <Model_2Dsplit_mig_bottlegrowth.png>

Mathilde SALAMON

unread,
Apr 7, 2022, 11:44:37 AM4/7/22
to dadi-user
Hello Ryan,

thank you for the confirmation of my verbal model, it was greatly appreciated.

I was able to run optimizations on a slightly modified model (inequal migration) with dadi.Misc.perturb_params on the log of the parameters, and plot the results. The optimization kept pushing the upper bound for the migration rates, and the best fit has very large migration rates (upper bounds at 90 for mri and 140 for mir). I am also getting some weird results with the plotting: the folded SFS from data looks weird already, and is very different from the model SFS (which I guess looks that way because of the large migration rates).

Do you have an idea what might cause this ? I am working with pooled samples (80 individuals pooled per populations) and I wonder if that could be the problem.
 
Thank you,
Mathilde
Model.png
HA_PG.png
optimization.png

Ryan Gutenkunst

unread,
Apr 7, 2022, 11:46:07 AM4/7/22
to dadi-user
Hello Mathilde,

Those are very high migrations rates, I guess tempered by the extremely small population sizes nu2 you’re inferring.

The data spectrum does look extremely weird. There are very distinct signs that your procedure of pooling and calling SNP frequencies is strongly biased. For example, the lack of low-frequency shared polymorphisms and the strips at particularly frequencies. I wouldn’t trust any model inference until you understand the consequences of your SNP calling strategy well.

Best,
Ryan
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/afb5e453-431f-48e1-a021-c738effa62e9n%40googlegroups.com.
> <Model.png><HA_PG.png><optimization.png>

Mathilde SALAMON

unread,
Apr 12, 2022, 12:12:43 PM4/12/22
to dadi-user
Hello Ryan,

thank you for your answer. I tried to fix the issue with the SNP frequencies but I am still getting the same issue with the sfs. I generated it with the SNP data method using the functions:
dd=dadi.Misc.make_data_dict()
pop_ids, ns = ['HA', 'PG'], [80, 80]
fs = dadi.Spectrum.from_data_dict(dd, pop_ids, ns, polarized = False)

Each population has 40 diploid individuals. I looked at the output fs file and it also looks weird. I was wondering if I made a mistake in the SNP data input: since we don't know the ancestral state of the alleles (we don't have a good outgroup), it means that the spectrum should be folded. Does that also mean that the Allele 1 in the SNP data format should be the minor allele ?

Here is an extract of my SNP data input:
REF ALT Allele1 HA PG Allele2 HA PG LOCUS
-A- -G- A 40 35 G 40 45 1.128
-T- -C- T 75 66 C 5 14 8.12
-G- -T- G 54 70 T 26 10 18.1655
-C- -T- C 56 65 T 24 15 18.2766
-C- -A- C 80 72 A 0 8 18.65
-C- -T- C 40 80 T 40 0 20.1389
-C- -T- C 80 80 T 0 0 20.51
-T- -C- T 27 80 C 53 0 20.782
-C- -T- C 1 11 T 79 69 23.63

Thank you for your help,
Best,
Mathilde
PG_HA.fs

Ryan Gutenkunst

unread,
Apr 12, 2022, 6:50:50 PM4/12/22
to dadi-user
Hello Mathilde,

Dadi does not care which allele is specified as allele 1 or allele 2. I suspect the issue with SNP frequencies arises earlier in your calling pipeline.

Best,
Ryan
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/5fd0bced-7263-4188-9477-ae6b5f897b8dn%40googlegroups.com.
> <PG_HA.fs>

Mathilde SALAMON

unread,
Apr 15, 2022, 11:51:21 AM4/15/22
to dadi-user
Hello Ryan,

thank you for your answer, that's what I thought. I am rerunning my calling pipeline and hopefully will get better results.

Best,
Mathilde

Mathilde SALAMON

unread,
May 3, 2022, 10:32:32 AM5/3/22
to dadi-user
Hello Ryan,

I rerun my SNP calling pipeline using a different method, and it seems that the data fs look a little bit better (presence of low-frequency shared polymorphisms), but the strips at particular frequencies are still there. I am really not sure what could cause this. I am also getting some errors at the beginning of the optimization (using the dadi.Inference.optimize_log function):

Beginning optimization ************************************************
8       , -496810     , array([ 0.0901689  ,  32.2392    ,  0.00901311 ,  4.16508    ,  9.65673    ,  4.00496    ,  1.79129    ,  0.0286078  ])
16      , -423761     , array([ 0.190932   ,  32.5291    ,  0.0107814  ,  4.28896    ,  16.5152    ,  4.28416    ,  1.80309    ,  0.0198988  ])
WARNING:Numerics:Extrapolation may have failed. Check resulting frequency spectrum for unexpected results.
WARNING:Inference:Model is masked in some entries where data is not.
WARNING:Inference:Number of affected entries is 77. Sum of data in those entries is 381:
WARNING:Numerics:Extrapolation may have failed. Check resulting frequency spectrum for unexpected results.
WARNING:Inference:Model is masked in some entries where data is not.
WARNING:Inference:Number of affected entries is 77. Sum of data in those entries is 381:
WARNING:Numerics:Extrapolation may have failed. Check resulting frequency spectrum for unexpected results.
WARNING:Inference:Model is masked in some entries where data is not.
WARNING:Inference:Number of affected entries is 77. Sum of data in those entries is 381:
WARNING:Numerics:Extrapolation may have failed. Check resulting frequency spectrum for unexpected results.
WARNING:Inference:Model is masked in some entries where data is not.
WARNING:Inference:Number of affected entries is 77. Sum of data in those entries is 381:
WARNING:Numerics:Extrapolation may have failed. Check resulting frequency spectrum for unexpected results.
WARNING:Inference:Model is masked in some entries where data is not.
WARNING:Inference:Number of affected entries is 77. Sum of data in those entries is 381:
WARNING:Numerics:Extrapolation may have failed. Check resulting frequency spectrum for unexpected results.
WARNING:Inference:Model is masked in some entries where data is not.
WARNING:Inference:Number of affected entries is 77. Sum of data in those entries is 381:
381.0 372593.0
381.0 372593.0
381.0 372593.0
381.0 372593.0
381.0 372593.0
381.0 372593.0
24      , -796464     , array([ 0.204067   ,  37.4219    ,  0.0550363  ,  7.51519    ,  6.7188     ,  10.6923    ,  2.27205    ,  0.00154578 ])
WARNING:Numerics:Extrapolation may have failed. Check resulting frequency spectrum for unexpected results.
WARNING:Inference:Model is masked in some entries where data is not.
WARNING:Inference:Number of affected entries is 77. Sum of data in those entries is 381:
WARNING:Numerics:Extrapolation may have failed. Check resulting frequency spectrum for unexpected results.
WARNING:Inference:Model is masked in some entries where data is not.
WARNING:Inference:Number of affected entries is 77. Sum of data in those entries is 381:
WARNING:Numerics:Extrapolation may have failed. Check resulting frequency spectrum for unexpected results.
WARNING:Inference:Model is masked in some entries where data is not.
WARNING:Inference:Number of affected entries is 77. Sum of data in those entries is 381:
381.0 372593.0
381.0 372593.0
381.0 372593.0
32      , -420012     , array([ 0.191864   ,  32.8642    ,  0.0121467  ,  4.47304    ,  15.4627    ,  4.57631    ,  1.83383    ,  0.0165068  ])
40      , -420654     , array([ 0.240726   ,  34.8134    ,  0.00845121 ,  4.82029    ,  11.7067    ,  7.17143    ,  2.38726    ,  0.0142131  ])
48      , -419284     , array([ 0.206221   ,  33.5056    ,  0.0108196  ,  4.57757    ,  14.1528    ,  5.27923    ,  1.99431    ,  0.0157397  ])
56      , -420551     , array([ 0.219414   ,  34.7422    ,  0.00917865 ,  4.84808    ,  12.9347    ,  5.08161    ,  2.02714    ,  0.014058   ])
64      , -419344     , array([ 0.208277   ,  33.6754    ,  0.0105341  ,  4.6205     ,  13.9473    ,  5.24659    ,  1.99961    ,  0.0154532  ])
72      , -419323     , array([ 0.208277   ,  33.6754    ,  0.0105341  ,  4.6205     ,  13.9473    ,  5.24659    ,  1.99961    ,  0.0154686  ])
80      , -419305     , array([ 0.207062   ,  33.5553    ,  0.0107016  ,  4.59512    ,  14.0682    ,  5.26582    ,  1.99848    ,  0.0156216  ])
88      , -419302     , array([ 0.20638    ,  33.4878    ,  0.0107972  ,  4.58089    ,  14.1368    ,  5.28197    ,  1.99472    ,  0.0157173  ])
96      , -419304     , array([ 0.206238   ,  33.4737    ,  0.0108172  ,  4.57792    ,  14.1653    ,  5.27897    ,  1.99436    ,  0.0157373  ])
104     , -419303     , array([ 0.206332   ,  33.483     ,  0.010804   ,  4.58446    ,  14.1417    ,  5.27747    ,  1.9946     ,  0.0157241  ])
112     , -419287     , array([ 0.206248   ,  33.4748    ,  0.0108266  ,  4.57813    ,  14.1501    ,  5.2788     ,  1.99438    ,  0.0157359  ])
120     , -419306     , array([ 0.206303   ,  33.5137    ,  0.010808   ,  4.57928    ,  14.1445    ,  5.27792    ,  1.99452    ,  0.0157281  ])
128     , -419303     , array([ 0.206457   ,  33.475     ,  0.0108154  ,  4.57819    ,  14.1498    ,  5.27876    ,  1.99439    ,  0.0157355  ])
136     , -419289     , array([ 0.206286   ,  33.4785    ,  0.0108105  ,  4.57891    ,  14.1463    ,  5.2782     ,  1.99448    ,  0.0157306  ])
144     , -419279     , array([ 0.206286   ,  33.4785    ,  0.0108105  ,  4.57891    ,  14.1463    ,  5.2782     ,  1.99448    ,  0.0157463  ])
152     , -419282     , array([ 0.206254   ,  33.4753    ,  0.010815   ,  4.57825    ,  14.1496    ,  5.27871    ,  1.99639    ,  0.0157351  ])
160     , -419284     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.28366    ,  1.99445    ,  0.0157321  ])
168     , -419298     , array([ 0.206268   ,  33.48      ,  0.0108095  ,  4.57946    ,  14.1619    ,  5.27842    ,  1.99451    ,  0.0157296  ])
176     , -419282     , array([ 0.206274   ,  33.4776    ,  0.0108118  ,  4.58332    ,  14.1475    ,  5.27838    ,  1.99446    ,  0.0157319  ])
184     , -419291     , array([ 0.206275   ,  33.4774    ,  0.0108228  ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
192     , -419284     , array([ 0.206275   ,  33.5109    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
200     , -419305     , array([ 0.206481   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
208     , -419285     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
216     , -419280     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157479  ])
224     , -419284     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99645    ,  0.0157321  ])
232     , -419284     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.28366    ,  1.99445    ,  0.0157321  ])
240     , -419300     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1616    ,  5.27838    ,  1.99445    ,  0.0157321  ])
248     , -419282     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.58327    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
256     , -419291     , array([ 0.206275   ,  33.4774    ,  0.0108228  ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
264     , -419284     , array([ 0.206275   ,  33.5109    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
272     , -419305     , array([ 0.206481   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
280     , -419285     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
288     , -419280     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157479  ])
296     , -419284     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99645    ,  0.0157321  ])
304     , -419284     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.28366    ,  1.99445    ,  0.0157321  ])
312     , -419300     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1616    ,  5.27838    ,  1.99445    ,  0.0157321  ])
320     , -419282     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.58327    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
328     , -419291     , array([ 0.206275   ,  33.4774    ,  0.0108228  ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
336     , -419284     , array([ 0.206275   ,  33.5109    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
344     , -419305     , array([ 0.206481   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
352     , -419285     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
360     , -419280     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157479  ])
368     , -419284     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99645    ,  0.0157321  ])
376     , -419284     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.28366    ,  1.99445    ,  0.0157321  ])
384     , -419300     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1616    ,  5.27838    ,  1.99445    ,  0.0157321  ])
392     , -419282     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.58327    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
400     , -419291     , array([ 0.206275   ,  33.4774    ,  0.0108228  ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
408     , -419284     , array([ 0.206275   ,  33.5109    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
416     , -419305     , array([ 0.206481   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
424     , -419285     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
432     , -419280     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157479  ])
440     , -419285     , array([ 0.206275   ,  33.4774    ,  0.010812   ,  4.57869    ,  14.1474    ,  5.27838    ,  1.99445    ,  0.0157321  ])
Finished optimization **************************************************

At least this time the optimization does not seem to be pushing the bounds of parameters, and the migration rates make more sense.

Thank you,
Best,
Mathilde

Mathilde SALAMON

unread,
May 3, 2022, 1:28:50 PM5/3/22
to dadi-user
I forgot to attach the fs
HA_PG.png

Mathilde SALAMON

unread,
May 6, 2022, 10:56:09 AM5/6/22
to dadi-user
Hello Ryan,

I tried to change the way I prepared the dataset for dadi. Initially I was thinning my dataset (1 every 50 SNPs, ~1 SNP per kb) and I also had an issue with duplicate names for SNPs when I imported the file, which ended up deleted a bunch of SNPs. So I retried the optimization without thinning (I saw in one of the post of the dadi-user group that it was not necessary for neutral models), fixing the duplicate issue and removing the uninformative snps. It seems that this change improved the SFS but there are still some strips at given frequencies. I see two additional options that could bias the SFS in this way: 1) the filtering procedure for the pool-seq data (minimum read count, minimum coverage per pool, max coverage per pool), 2) the procedure to transform the allele frequency data into the input format for dadi with the dadi_input_pool function from the genomalicious package, which convert allele frequency data into read counts based on the number of individuals per pool, rounder up or down based on the pool size, starting with the reference allele: https://rdrr.io/github/j-a-thia/genomalicious/man/dadi_inputs_pools.html

I also tried to remove or not SNPs putatively under selection based on genome scans, but this didn't really change the pattern in the SFS.

This is the optimization output I got with the attached SFS (first round of optimization):

## First optimization ##
# Parameters are: (nu1, nu2, nu2B, nu2F, mri, mir, Ts, Tb)
# upper_bound = [200, 200, 200, 200, 20, 20, 10, 10]
# lower_bound = [1e-2, 1e-2, 1e-3, 1e-2, 0, 0, 0, 0]
# p0 = [0.20,33,0.01, 5,14,5,2,0.02]

Beginning optimization ************************************************
8       , -8.98758e+06, array([ 0.14979    ,  34.6859    ,  0.0132009  ,  8.23115    ,  19.8       ,  5.89777    ,  1.70246    ,  0.014394   ])
24      , -8.97079e+06, array([ 0.149597   ,  34.6765    ,  0.0131538  ,  8.22403    ,  19.8423    ,  5.89736    ,  1.70308    ,  0.0144553  ])
32      , -8.93616e+06, array([ 0.149023   ,  34.6485    ,  0.0130127  ,  8.2107     ,  19.7281    ,  5.89562    ,  1.71012    ,  0.0145197  ])
40      , -8.806e+06  , array([ 0.146751   ,  34.5368    ,  0.0124759  ,  8.11689    ,  19.3549    ,  5.88867    ,  1.73858    ,  0.0147805  ])
48      , -8.4989e+06 , array([ 0.153787   ,  34.2516    ,  0.0105902  ,  7.75837    ,  19.6094    ,  5.51377    ,  1.86106    ,  0.0145775  ])
64      , -8.49805e+06, array([ 0.153794   ,  34.2139    ,  0.0105881  ,  7.75744    ,  19.6122    ,  5.51455    ,  1.86234    ,  0.0145754  ])
72      , -8.49629e+06, array([ 0.153794   ,  34.2139    ,  0.0105881  ,  7.75744    ,  19.6122    ,  5.51455    ,  1.86234    ,  0.01459    ])
80      , -8.49656e+06, array([ 0.153802   ,  34.2104    ,  0.010586   ,  7.75651    ,  19.6149    ,  5.51532    ,  1.86548    ,  0.0145734  ])
96      , -8.49689e+06, array([ 0.153804   ,  34.2096    ,  0.0105855  ,  7.75628    ,  19.6353    ,  5.51552    ,  1.86393    ,  0.0145729  ])
104     , -8.4974e+06 , array([ 0.153806   ,  34.2087    ,  0.0105849  ,  7.7638     ,  19.6163    ,  5.51571    ,  1.86425    ,  0.0145723  ])
120     , -8.4971e+06 , array([ 0.153806   ,  34.2427    ,  0.0105848  ,  7.75599    ,  19.6165    ,  5.51576    ,  1.86433    ,  0.0145722  ])
128     , -8.4976e+06 , array([ 0.15396    ,  34.2083    ,  0.0105847  ,  7.75593    ,  19.6167    ,  5.51581    ,  1.86441    ,  0.0145721  ])
136     , -8.41882e+06, array([ 0.154641   ,  33.8239    ,  0.0103515  ,  7.65321    ,  19.9255    ,  5.60313    ,  2.01224    ,  0.0143441  ])
144     , -8.41698e+06, array([ 0.154641   ,  33.8239    ,  0.0103515  ,  7.65321    ,  19.9255    ,  5.60313    ,  2.01224    ,  0.0143585  ])
152     , -8.418e+06  , array([ 0.154641   ,  33.8239    ,  0.0103515  ,  7.6532     ,  19.9256    ,  5.60314    ,  2.01428    ,  0.0143441  ])
160     , -8.41874e+06, array([ 0.154642   ,  33.8238    ,  0.0103514  ,  7.65318    ,  19.9256    ,  5.60876    ,  2.01228    ,  0.0143441  ])
176     , -8.41919e+06, array([ 0.154642   ,  33.8238    ,  0.0103514  ,  7.66084    ,  19.9256    ,  5.60315    ,  2.01229    ,  0.014344   ])
184     , -8.42035e+06, array([ 0.154642   ,  33.8238    ,  0.0103618  ,  7.65317    ,  19.9256    ,  5.60316    ,  2.01231    ,  0.014344   ])
200     , -8.41948e+06, array([ 0.154796   ,  33.8238    ,  0.0103514  ,  7.65317    ,  19.9256    ,  5.60317    ,  2.01231    ,  0.014344   ])
208     , -8.41878e+06, array([ 0.154642   ,  33.8238    ,  0.0103514  ,  7.65317    ,  19.9257    ,  5.60317    ,  2.01231    ,  0.014344   ])
216     , -8.41694e+06, array([ 0.154642   ,  33.8238    ,  0.0103514  ,  7.65317    ,  19.9257    ,  5.60317    ,  2.01231    ,  0.0143584  ])
224     , -8.40822e+06, array([ 0.154746   ,  33.7761    ,  0.0103226  ,  7.64043    ,  19.9646    ,  5.61417    ,  2.03362    ,  0.0143158  ])
232     , -8.40898e+06, array([ 0.154746   ,  33.7761    ,  0.0103226  ,  7.64043    ,  19.9646    ,  5.61979    ,  2.03158    ,  0.0143158  ])
240     , -8.40862e+06, array([ 0.154746   ,  33.7761    ,  0.0103226  ,  7.64043    ,  19.9845    ,  5.61417    ,  2.03158    ,  0.0143158  ])
248     , -8.40453e+06, array([ 0.154799   ,  33.7522    ,  0.0103083  ,  7.64171    ,  19.9841    ,  5.61968    ,  2.04129    ,  0.0143017  ])
256     , -8.41059e+06, array([ 0.154746   ,  33.7761    ,  0.010333   ,  7.64043    ,  19.9646    ,  5.61417    ,  2.03159    ,  0.0143158  ])
264     , -8.40591e+06, array([ 0.154781   ,  33.7939    ,  0.0103131  ,  7.63619    ,  19.9776    ,  5.61784    ,  2.03805    ,  0.0143064  ])
272     , -8.40538e+06, array([ 0.154948   ,  33.7549    ,  0.0103099  ,  7.63478    ,  19.9819    ,  5.61907    ,  2.04021    ,  0.0143033  ])
280     , -8.40425e+06, array([ 0.154797   ,  33.7528    ,  0.0103086  ,  7.63422    ,  19.9836    ,  5.61955    ,  2.04106    ,  0.014302   ])
288     , -8.40241e+06, array([ 0.154797   ,  33.7528    ,  0.0103086  ,  7.63422    ,  19.9836    ,  5.61955    ,  2.04106    ,  0.0143163  ])
296     , -8.40336e+06, array([ 0.154798   ,  33.7523    ,  0.0103083  ,  7.6341     ,  19.984     ,  5.61965    ,  2.04328    ,  0.0143018  ])
304     , -8.40411e+06, array([ 0.154799   ,  33.7523    ,  0.0103083  ,  7.63408    ,  19.984     ,  5.6253     ,  2.04128    ,  0.0143017  ])
320     , -8.40453e+06, array([ 0.154799   ,  33.7522    ,  0.0103083  ,  7.64171    ,  19.9841    ,  5.61968    ,  2.04129    ,  0.0143017  ])
328     , -8.40571e+06, array([ 0.154799   ,  33.7522    ,  0.0103186  ,  7.63407    ,  19.9841    ,  5.61968    ,  2.04129    ,  0.0143017  ])
336     , -8.40429e+06, array([ 0.154799   ,  33.786     ,  0.0103083  ,  7.63407    ,  19.9841    ,  5.61968    ,  2.04129    ,  0.0143017  ])
344     , -8.40484e+06, array([ 0.154954   ,  33.7522    ,  0.0103083  ,  7.63407    ,  19.9841    ,  5.61968    ,  2.04129    ,  0.0143017  ])
352     , -8.40414e+06, array([ 0.154799   ,  33.7522    ,  0.0103083  ,  7.63407    ,  19.9841    ,  5.61968    ,  2.04129    ,  0.0143017  ])
360     , -8.40229e+06, array([ 0.154799   ,  33.7522    ,  0.0103083  ,  7.63407    ,  19.9841    ,  5.61968    ,  2.04129    ,  0.014316   ])
368     , -8.40334e+06, array([ 0.154799   ,  33.7522    ,  0.0103083  ,  7.63407    ,  19.9841    ,  5.61968    ,  2.04333    ,  0.0143017  ])
376     , -8.4041e+06 , array([ 0.154799   ,  33.7522    ,  0.0103083  ,  7.63407    ,  19.9841    ,  5.6253     ,  2.04129    ,  0.0143017  ])
392     , -8.40453e+06, array([ 0.154799   ,  33.7522    ,  0.0103083  ,  7.64171    ,  19.9841    ,  5.61968    ,  2.04129    ,  0.0143017  ])
400     , -8.40571e+06, array([ 0.154799   ,  33.7522    ,  0.0103186  ,  7.63407    ,  19.9841    ,  5.61968    ,  2.04129    ,  0.0143017  ])
408     , -8.40429e+06, array([ 0.154799   ,  33.786     ,  0.0103083  ,  7.63407    ,  19.9841    ,  5.61968    ,  2.04129    ,  0.0143017  ])
416     , -8.40484e+06, array([ 0.154954   ,  33.7522    ,  0.0103083  ,  7.63407    ,  19.9841    ,  5.61968    ,  2.04129    ,  0.0143017  ])
424     , -8.40414e+06, array([ 0.154799   ,  33.7522    ,  0.0103083  ,  7.63407    ,  19.9841    ,  5.61968    ,  2.04129    ,  0.0143017  ])
432     , -8.40229e+06, array([ 0.154799   ,  33.7522    ,  0.0103083  ,  7.63407    ,  19.9841    ,  5.61968    ,  2.04129    ,  0.014316   ])
440     , -8.40334e+06, array([ 0.154799   ,  33.7522    ,  0.0103083  ,  7.63407    ,  19.9841    ,  5.61968    ,  2.04333    ,  0.0143017  ])
448     , -8.4041e+06 , array([ 0.154799   ,  33.7522    ,  0.0103083  ,  7.63407    ,  19.9841    ,  5.6253     ,  2.04129    ,  0.0143017  ])
464     , -8.42105e+06, array([ 0.15478    ,  33.8314    ,  0.010356   ,  7.65522    ,  19.9194    ,  5.6014     ,  2.00923    ,  0.0143486  ])
480     , -8.39994e+06, array([ 0.154835   ,  33.7356    ,  0.0102983  ,  7.62965    ,  19.9976    ,  5.62352    ,  2.05013    ,  0.0142919  ])
Finished optimization **************************************************
Best-fit parameters: [1.53786835e-01 3.42173650e+01 1.05902177e-02 7.75836584e+00
 1.96094402e+01 5.51376621e+00 1.86106012e+00 1.45775061e-02]
Maximum log composite likelihood: -8498746.500881964
Optimal value of theta: 897470.0361729742

Do you think one or both of the options I mentioned could be causing the bias in the SFS ?

Thank you,
Mathilde
HA_PG.png

Mathilde SALAMON

unread,
May 11, 2022, 10:03:15 AM5/11/22
to dadi-user
Hello Ryan,

I was finally able to solve the issue that was causing the strips at particular frequencies in the SFS, which came from the file transformation from frequency data to dadi input format with the genomalicious package.

There are two methods proposed to infer the SFS based on allele frequencies: 1) rounding the allele counts to the nearest integer based on the number of chromosomes methodSFS=='counts' which is the one I initially used, 2) getting the read counts for the ref allele based on a binomial draw (n = 1,size = number of chromosomes in pool, prob = frequency of ref allele), then substracting the allele counts for the ref allele from the number of chromosomes in pool to obtain the allele count of the alternative allele.

The first method was creating a bias in the SFS because of the rounding.

Thank you for all your help,
Best,
Mathilde
HA_PG.png

Ryan Gutenkunst

unread,
May 11, 2022, 4:03:47 PM5/11/22
to dadi-user
Hello Mathilde,

Thanks of the update. That’s a subtle issue and I’ve glad you found it. We may be analyzing some pool-seq data ourselves in the near future, so we’ll have to the on the lookout for that.

Best,
Ryan
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/d96e0b03-7993-412b-845b-2de69500ec32n%40googlegroups.com.
> <HA_PG.png>

Mathilde SALAMON

unread,
Jul 4, 2022, 5:07:07 PM7/4/22
to dadi-user
Hello Ryan,

despite being able to fix the previous error with the generation of the sfs, I am still having issues with the parameter inference. The plots of the SFS look weird, with strips of positive residuals at low frequencies. I have tried several approach to filter my dataset, but without much success so far.

For the coverage filter, I used it because the estimator of allele frequencies from pool seq data can be biased, with the variance in allele frequencies depending on the coverage and size of the pool (eq. 4 in Gautier et al., Mol Ecol, 2013 https://pubmed.ncbi.nlm.nih.gov/23730833/). I tried to fix this issue by following the example in https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.13765 (only study using dadi with pool-seq data that I found) and filtered the SNPs within the 1st and 3rd quartiles of the coverage distribution within populations (between 9X and 18X), with the goal of obtaining relatively homogeneous coverage between the two pools (the size of the pool is equal between the 2 samples).

Here are my results:
Test 5:

Dataset filtering: removed outliers from genome scans + fixed/lost alleles in PG or HA (removed all SNPs that were fixed/lost in one pop instead of uninformative SNPs ==> error)

Nb SNPs: 6,657,545

Log likelihood: -2,117,809

Inferred theta: 189,091

Test 6:

Dataset filtering: removed outliers from genome scans + fixed/lost alleles in PG or HA (error) + filter for homogenous coverage SNPs within the 1st and 3rd quartile of coverage distribution

Nb SNPs: 1,588,634

Log likelihood: -554,805

Inferred theta: 41,882


Test 7: 

Dataset filtering: removed outliers from genome scans + uninformative SNPs (fixed/lost alleles in PG and HA) + filter for homogenous coverage SNPs within the 1st and 3rd quartile of coverage distribution

Nb SNPs: 4,768,169

Log likelihood: NA

Inferred theta: NA

Test 8:

Dataset filtering: Changed minimum minor allele frequency threshold from 0.0125 to 0.08 to allow for singletons + removed outliers from genomic scans + uninformative SNPs (fixed/lost alleles in PG and HA) + filter for homogenous coverage SNPs within the 1st and 3rd quartile of coverage distributio

Nb SNPs: 868,873 (thinned)

Log likelihood: -250,113 (before optimization)

Inferred theta: 12,469 (before optimization)

For the test #8 I converted the parameters (optimization on the log scale) and the results are really off: we don't really know the size of population PG after split but it is way too large, same for the time of the population split, and the time of the potential bottleneck is know (should be ~15 years/generations).

Based on these results, my intuition would be to try to remove the coverage filter but after so many lengthy trials and errors (each optimization run takes 3-5 days), I would be very grateful to know your opinion on this.

Thank you and best wishes, Mathilde


parameter_conversion_test8_thinned.png
HA_PG_test7_opt1.png
HA_PG_test6_opt1.png
HA_PG_test5_opt3.png
test8__opt1_full_HA_PG.png

Ryan Gutenkunst

unread,
Jul 8, 2022, 6:52:35 PM7/8/22
to dadi...@googlegroups.com
Hello Mathilde,

Hmm… The models are definitely still struggling to deal with the low frequency alleles. When you say you remove alleles, are you also masking those entries in the site frequency spectrum you feed to dadi? If not, then the model will distort itself trying to adjust for that removal. It’s a hard case… If you don’t have confidence in the calling of rare alleles, then you might just need to mask those entries through the entire spectrum.

When you optimize on the log scale, the reported parameter values will still be on the linear scale.

You can reduce your computation time by projecting your spectra downward. That will at least help with exploring potential scenarios. It will also reduce the effect of low-frequency variants being poorly called.

Best,
Ryan
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/b0a1b306-3c8e-4fd7-9fdf-1e4f57bdebd0n%40googlegroups.com.
> <parameter_conversion_test8_thinned.png><HA_PG_test7_opt1.png><HA_PG_test6_opt1.png><HA_PG_test5_opt3.png><test8__opt1_full_HA_PG.png>

Mathilde SALAMON

unread,
Jul 11, 2022, 5:02:01 PM7/11/22
to dadi-user
Hello Ryan,

thank you for your answer.

I did not mask entries in the site frequency spectrum, maybe that would be a solution. I removed SNPs that are 1) outliers in genome scans, 2) uninformative (fixed or lost in both populations) and 3) that are not between the 1st and 3rd quartiles of coverage distribution. I believe the last filter might be the issue so I am going to do another trial run removing this filter and checking the correspondance of coverage per SNPs between the two populations to see if there are large differences. I will do this by projecting the spectra downward as you suggested.

For the parameters conversion, I had not understood that the output values were on the linear scale, I thought a first exponential conversion was necessary. Without exponentiating the parameters, the values are starting to make much more sense, except maybe for the time of the split and bottleneck.

I will keep you updated on the results.

Best,
Mathilde
Reply all
Reply to author
Forward
0 new messages