Running dadi using variant-only vcf as input?

Isaac Linn

unread,

Feb 19, 2026, 5:32:15 PMFeb 19

to dadi-user

Sorry if this is a simple question, but I haven't found a clear answer so far.

I'm going through the process of running dadi on a population dataset for which I have a SNP-only VCF. I can go through the process of variant calling again, but at present, I don't have access to a VCF with variant and invariant sites. I'm wondering whether I can use dadi with this input only, whether I need to pad the dataset with monomorphic sites, or whether it's not an issue. I've tried running dadi so far and population size estimates look way low when I use L=length of callable input sequence, but when I use L= number of sites in VCF (again, mostly polymorphic) the population size estimates are a reasonable order of magnitude (compared with nucleotide diversity).

I think I understand that the monomorphic sites would not make it into the allele frequency spectrum, but I'm concerned that it would impact the underlying model or estimation of theta.

Thanks,

Isaac

Ryan Gutenkunst

unread,

Feb 20, 2026, 12:32:45 PMFeb 20

to dadi...@googlegroups.com

Hello Isaac,

Yes, dadi will work fine with a variant-only vcf; it will not impact the model or estimation of theta compared to an all-sites VCF.

Estimating L is sometimes tricky, depending on how the sequencing was done. It may be that certain regions were uncallable or masked (for example, repetitive regions), so L is rarely the full genome size. Setting L equal to the number of variant sites is a mistake.

Best,

Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dadi-user/1fa08eda-b25d-4c9d-a319-e50cbfe1b6dan%40googlegroups.com.

Isaac Linn

unread,

Mar 9, 2026, 5:16:06 PMMar 9

to dadi-user

Hi Ryan,

Thank you for your response. That aligns with my expectations, but it seems like there's something wrong in my pipeline. I know that the models I'm using aren't completely matching the data, but I also know that the parameters estimated are orders of magnitude off.

I'm using variant data from 59 individuals (30 and 29) stored in a .vcf, which I project to a SFS of 2n-2, which could be smaller. My base genome (autosomal) has 1.98 Gb mappable sites, and I called variants from ~9x coverage data. Filtering reduced 130,436,888 variants to 66,067,163 sites. I do not have an invariant VCF as it's whole-genome and that file size is unwieldy, but as a base approximation I could assume that 51% of the initial sequence length is my final sequence length (1Gb).

When I look at my model outputs, the population sizes and time intervals look way low. I assume that this is related to how I'm scaling it, but it could be that the model is a poor fit, or that I'm making an error in the process somewhere. But, for my best fit replicates, I'm getting nref=10000, npops=(500,2500), split_T~2700-5600. Some replicates have further split times, but the population sizes are all pretty small across all replicates and models. Nucleotide diversity for the populations is around (0.23%,0.21%) which is a couple of orders of magnitude off what I'd estimate using dadi outputs (.0007 - .004 %).

I'm wondering what could be messing with this inference. Obviously, a good estimate for L is important, but it could be other factors; for instance, there is some population structure. I'm considering downprojecting a bit further and increasing the maxiter, and possibly adding a bottleneck before split time, but I don't want an overly complex model. I know that the model itself isn't matching the data, but I haven't figured out exactly what needs to be done. I've also attached the model sfs and residuals for some of the best-fit data.

Additional information:

I'm estimating a split-with-migration, so I have models with and without migration, with a linear and exponential size change, and two phases to approximate secondary contact. I don't think the details of the models are atypical, but could provide them.

Population sizes: start at 2, from 1e-3 to 100

migration rates: stat at 0.2, from 1e-3 to 30

Time intervals: start at 1, from 1e-2 to 15

and I perturb_params with fold=1 before inference with maxiter= 200, and pts [40,50,60]

Example calculation with model:

def nu1_func(t):

return nu1A * (nu1B / nu1A) ** (t / TA)
def nu2_func(t):
return nu2A * (nu2B / nu2A) ** (t / TA)
phi = Integration.two_pops(phi, xx, TA,nu1_func, nu2_func,m12=m12, m21=m21)

Outputs:

loglik -13531.50 theta 169586.1

nu1A 1.9405859 nu2A 0.04918462

nu1B 0.05179008 nu2B 0.2568233

TA 0.1520766

m12 0.2254417 m21 0.7371130

Calculation:

sumL= 1985216215 * (66067163 )/130436888 = 1005525395

mu=4*(10^-9)

nref=theta / (4* sumL * mu) = 10539

n1B=nu1B*nref =457

n2B=nu2B*nref =2568

splittime=TA*2*$nref =3210

m_200_r_50_Mountain_South_split_exponential_mig.png

m_200_r_50_Coast_South_split_linear_mig.png

m_200_r_50_Mountain_South_split_linear_mig.png

m_200_r_50_Coast_South_split_exponential_mig.png

Ryan Gutenkunst

unread,

Mar 23, 2026, 5:08:47 PMMar 23

to dadi-user

Hello Isaac,

My apologies for the slow reply. I don’t see any obvious red flags with the approach or calculations you’ve laid out below. 9x is on the verge were the low-coverage correction could be useful, but it shouldn’t be causing distortions of the magnitude you’re reporting. What are you expectations for split times?

The residuals plots don’t look amazing, but they aren’t horrible. It’s useful to see them with the full range plotted in addition to restricted -3 to 3 range as well.

If you’re still thinking about this, I’m happy to iterate.

Best,

Ryan

To view this discussion visit https://groups.google.com/d/msgid/dadi-user/8cc4bd2a-0a15-482f-853f-d57a44db3d2dn%40googlegroups.com.
<m_200_r_50_Mountain_South_split_exponential_mig.png><m_200_r_50_Coast_South_split_linear_mig.png><m_200_r_50_Mountain_South_split_linear_mig.png><m_200_r_50_Coast_South_split_exponential_mig.png>

Isaac Linn

unread,

Mar 31, 2026, 7:03:26 PMMar 31

to dadi-user

Hi Ryan,

Thanks for your help.

I'm expecting split times on the order of magnitude of 50kya if not longer. Populations are about 0.6% diverged, and heterozygosity is about 0.2% within populations; factoring in mammal mutation rate, a split time of roughly 400kya is conceivable. But, with those numbers, theta should be higher.

Dadi has been run on a dataset from this species before : https://doi.org/10.1002/ece3.4129 , which estimated a split time between clades at 51kya. In my dataset, I have sampling across a broader range, and whole-genome resequencing instead of ddRAD data, so it's a little different, but I anticipate 51k years will be accurate or low.

One thing I noticed for these models is that the best-fit models have growth for one population (south) but shrinkage for the other population (coast or mountain). This violates my understanding of the biology, so I also intend to initialize them with population growth explicit (instead of initializing n1A and n1B at the same values). I haven't set those up to run yet, though, in case my overall method is flawed.

I've attached residual plots with no restrictions, and with a limit at 20.

Any help is appreciated; it's helpful to know that the assumptions about L are not the cause of this.

Cheers,

Isaac

Coast_South_split_exponential_mig_rep11.png

Mountain_South_split_exponential_mig_rep26.png

Coast_South_split_linear_mig_rep12_resid20.png

Coast_South_split_exponential_mig_rep11_resid20.png

Mountain_South_split_linear_mig_rep26_resid20.png

Mountain_South_split_exponential_mig_rep26_resid20.png

Coast_South_split_linear_mig_rep12.png

Mountain_South_split_linear_mig_rep26.png

Ryan Gutenkunst

unread,

Apr 2, 2026, 7:38:27 PMApr 2

to dadi...@googlegroups.com

Hello Isaac,

If I’m reading the old paper correctly (and quickly), your ancestral population size is very similar to theirs, so that suggests it’s not a scaling issue. What do the single population models look like, in terms of times and directions of size changes?

Best,

Ryan

To view this discussion visit https://groups.google.com/d/msgid/dadi-user/41fc92da-277e-4853-ba00-60d6ae6cb958n%40googlegroups.com.
<Coast_South_split_exponential_mig_rep11.png><Mountain_South_split_exponential_mig_rep26.png><Coast_South_split_linear_mig_rep12_resid20.png><Coast_South_split_exponential_mig_rep11_resid20.png><Mountain_South_split_linear_mig_rep26_resid20.png><Mountain_South_split_exponential_mig_rep26_resid20.png><Coast_South_split_linear_mig_rep12.png><Mountain_South_split_linear_mig_rep26.png>

Isaac Linn

unread,

Apr 8, 2026, 5:17:31 PMApr 8

to dadi-user

Hi Ryan,

Thanks for your help and time, I greatly appreciate it.

I hadn't actually ran single population models, which was an oversight. Also, I hadn't noticed the similar ancestral pop sizes! Yeah, nref is on a similar order of magnitude.

After running single-population models and a coalescent model (smc++) it looks like the "mountain" population has more likely had a decrease in population size over time, and a bottleneck and recovery is less appropriate. The coast and south populations (which better match the old paper) both have best models with an ancestral shrink in population size ~30kga, followed by a population size increase over time. I may edit the models for that one in place, but to focus on the simpler case, in case I am making errors in the usage of dadi in general

In general, it looks like numerically, the two-pop model has similar parameters, but lower theta by an order of magnitude, and larger pop sizes at bottleneck.

The present-day effective population sizes look more realistic for single-population models, and better match previous dadi results. For instance, current south population size is estimated at 23.8k in the 2018 paper, 26.3k with the single population model, as opposed to 2.2k in the two-population model.

I'm not sure if there's something in the method that's way off, or if I just need to have better parameterization of migration rate? perhaps I should code in the population size changes explicitly (with independent bottlenecks and parameters set), and just define free parameters for the timeline and rate of migration?

Any advice is appreciated, thank you again,

Isaac

Details for "Coast" and "Mountain" populations:

I ran single-population models with a bottleneck or period of bottleneck followed by exponential growth, linear growth, or instantaneous growth. Both had a best model of instantaneous bottleneck followed by linear population growth :

def instant_linear_change(params, ns, pts):

#instant change from nref into nuA, which changes to nuB across TB generations until present
nuA, nuB, TB = params
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx)
def nu_func(t):
return nuA + (nuB - nuA) * (t / TB)
phi = Integration.one_pop(phi, xx, TB, nu_func)
fs = Spectrum.from_phi(phi, ns, (xx,))
return fs

I also ran two-population models with a split followed by migration ( or not) with constant population size, two epochs, exponential size change and linear size change. For the Coast-Mountain comparison, the best model was a split followed by a linear change in population size over time, with continuous migration:

def split_linear_mig(params, ns, pts):

#split from nref into nu1A and nuu2A, which change to nu1B and nu2B across TB generations until present
nu1A, nu1B, nu2A, nu2B, TB, m12, m21= params
xx = Numerics.default_grid(pts)
phi = PhiManip.phi_1D(xx)
phi = PhiManip.phi_1D_to_2D(xx, phi)
def nu1_func(t):
return nu1A + (nu1B - nu1A) * (t / TB)
def nu2_func(t):
return nu2A + (nu2B - nu2A) * (t / TB)
phi = Integration.two_pops(phi, xx, tB,nu1_func, nu2_func,m12=m12, m21=m21)
fs = Spectrum.from_phi(phi, ns, (xx, xx))
return fs

And here are those model outputs:

Single population:

pop loglik theta nuB
Coast -222.7535 1798824 0.56951649

South -3503.3312 1461253 0.28979192

TB nuA nref
0.13469882 0.005873020 111808.69
0.13221309 0.01044610 90826.44

nB nA Tgen
63676.894 6.566547e+02 30120.997
26320.768 9.487821e+02 24016.887

Two population (Coast=1, South=2)

loglik theta nu1A nu1B
-16220.55 136357.2 0.04929965 0.5186923
nu2A nu2B TB m12 m21
0.020855868 0.2542678 0.13902554 0.28489234 0.76129817
nref n1B n2B n1A n2A
8475.4964 4396.175 2155.046 417.839 176.76383
Tgen
2356.6209

Looking at these, the models actually have some similar parameters before scaling.

nuB-Coast: 0.570

nu1B: 0.519

nuB-South: 0.290

nu2B: 0.254

TB-Coast: 0.135

TB-South: 0.132

TB-Two-pop: 0.139

but the bottleneck sizes and theta are different:

nuA-Coast: 0.0059

nu1A: 0.0492

nuA-South: 0.010

nu2A: 0.021

Coast Theta & nref: 1798824 111808.69

South Theta & nref: 1461253 90826.44

Two pop theta & nref: 136357.2 8475.4964

Screenshot 2026-04-06 203643.png

m_200_r_50_Coast_exploration_instant_linear_change.png

Coast_South_split_linear_mig_rep12_resid20.png

m_200_r_50_South_exploration_instant_linear_change.png

Ryan Gutenkunst

unread,

Apr 14, 2026, 4:45:41 PMApr 14

to dadi...@googlegroups.com

Hello Isaac,

I wonder if something is going on with the data between the 1D and 2D spectra, to explain the very different theta estimates. Were they generated in different data steps? As a check, if you marginalize your 2D spectrum (https://dadi.readthedocs.io/en/latest/user-guide/manipulating-spectra/#marginalizing) you should get back your 1D spectra.

You do have a very strong pattern in the residuals of poor fits to doubletons. I’m not sure what could be causing that. Is inbreeding potential in your system?

Best,

Ryan

To view this discussion visit https://groups.google.com/d/msgid/dadi-user/0ed94b79-cb78-4d72-9eaf-c8259687d398n%40googlegroups.com.
<Screenshot 2026-04-06 203643.png><m_200_r_50_Coast_exploration_instant_linear_change.png><Coast_South_split_linear_mig_rep12_resid20.png><m_200_r_50_South_exploration_instant_linear_change.png>

Isaac Linn

unread,

Apr 24, 2026, 6:02:19 PMApr 24

to dadi-user

Hi Ryan, thank you for your help. The 1D and 2D spectra were generated in different data steps, and I think the 1D dataset ends up with a higher OOM segregating sites -- I think this is due to individual missingness.

Taking a step back, it might make sense to think about the low-coverage correction, which I haven't implemented. Here's a few more details:

1. Inbreeding is likely in the system, as this is a population with frequent bottlenecks in population size. I had suspected that this might be an issue, but I think I understand that I can't use the inbreeding models with down-projection. This is an issue because

2. Individuals have a high fraction of missing genotypes frequently. I filter by individual GQ > 20, which leads to many individuals having more than 10% missingness... I hadn't actually looked at that, and I imagine it's not helping. But, I would either need to down project or use a more coarse quality filter -- in which case, I may need to implement the low-coverage correction.

I'm going back to think about the process a bit more, but I was wondering if based on that, you had any advice.

Thanks,

Isaac

Ryan Gutenkunst

unread,

May 3, 2026, 4:05:40 PMMay 3

to dadi-user

Hello Isaac,

When modeling inbreeding, instead of using down-projection, there is now an option for down-*sampling*. The difference is that down-projection averages (for each site) over all possible samplings of called haplotypes at each site, but down-sampling just randomly chooses a set of called individuals for each site. The result is less smooth, but preserves the inbreeding signal.

Best,

Ryan

To view this discussion visit https://groups.google.com/d/msgid/dadi-user/8e503e50-4169-49d1-baee-ccbb323bbe54n%40googlegroups.com.

Reply all

Reply to author

Forward