Running dadi using variant-only vcf as input?

37 views
Skip to first unread message

Isaac Linn

unread,
Feb 19, 2026, 5:32:15 PMFeb 19
to dadi-user
Sorry if this is a simple question, but I haven't found a clear answer so far.

I'm going through the process of running dadi on a population dataset for which I have a SNP-only VCF. I can go through the process of variant calling again, but at present, I don't have access to a VCF with variant and invariant sites. I'm wondering whether I can use dadi with this input only, whether I need to pad the dataset with monomorphic sites, or whether it's not an issue. I've tried running dadi so far and population size estimates look way low when I use L=length of callable input sequence, but when I use L= number of sites in VCF (again, mostly polymorphic) the population size estimates are a reasonable order of magnitude (compared with nucleotide diversity). 

I think I understand that the monomorphic sites would not make it into the allele frequency spectrum, but I'm concerned that it would impact the underlying model or estimation of theta. 
Thanks,
Isaac

Ryan Gutenkunst

unread,
Feb 20, 2026, 12:32:45 PMFeb 20
to dadi...@googlegroups.com
Hello Isaac,

Yes, dadi will work fine with a variant-only vcf; it will not impact the model or estimation of theta compared to an all-sites VCF.

Estimating L is sometimes tricky, depending on how the sequencing was done. It may be that certain regions were uncallable or masked (for example, repetitive regions), so L is rarely the full genome size. Setting L equal to the number of variant sites is a mistake.

Best,
Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dadi-user/1fa08eda-b25d-4c9d-a319-e50cbfe1b6dan%40googlegroups.com.

Isaac Linn

unread,
Mar 9, 2026, 5:16:06 PMMar 9
to dadi-user
Hi Ryan,
Thank you for your response. That aligns with my expectations, but it seems like there's something wrong in my pipeline. I know that the models I'm using aren't completely matching the data, but I also know that the parameters estimated are orders of magnitude off.
I'm using variant data from 59  individuals (30 and 29) stored in a .vcf, which I project to a SFS of 2n-2, which could be smaller. My base genome (autosomal) has 1.98 Gb mappable sites, and I called variants from ~9x coverage data. Filtering reduced 130,436,888 variants to 66,067,163 sites. I do not have an invariant VCF as it's whole-genome and that file size is unwieldy, but as a base approximation I could assume that 51% of the initial sequence length is my final sequence length (1Gb).

When I look at my model outputs, the population sizes and time intervals look way low. I assume that this is related to how I'm scaling it, but it could be that the model is a poor fit, or that I'm making an error in the process somewhere. But, for my best fit replicates, I'm getting nref=10000, npops=(500,2500), split_T~2700-5600. Some replicates have further split times, but the population sizes are all pretty small across all replicates and models. Nucleotide diversity for the populations is around (0.23%,0.21%) which is a couple of orders of magnitude off what I'd estimate using dadi outputs (.0007 - .004 %). 

I'm wondering what could be messing with this inference. Obviously, a good estimate for L is important, but it could be other factors; for instance, there is some population structure. I'm considering downprojecting a bit further and increasing the maxiter, and possibly adding a bottleneck before split time, but I don't want an overly complex model. I know that the model itself isn't matching the data, but I haven't figured out exactly what needs to be done. I've also attached the model sfs and residuals for some of the best-fit data.

Additional information:
I'm estimating a split-with-migration, so I have models with and without migration, with a linear and exponential size change, and two phases to approximate secondary contact. I don't think the details of the models are atypical, but could provide them.
Population sizes: start at 2, from 1e-3 to 100
migration rates: stat at 0.2, from 1e-3 to 30
Time intervals: start at 1, from 1e-2 to 15
and I perturb_params with fold=1 before inference with maxiter= 200, and pts [40,50,60]

Example calculation with model:
    def nu1_func(t):
            return nu1A * (nu1B / nu1A) ** (t / TA)
        def nu2_func(t):
            return nu2A * (nu2B / nu2A) ** (t / TA)
        phi = Integration.two_pops(phi, xx, TA,nu1_func, nu2_func,m12=m12, m21=m21)

Outputs:  
    loglik    -13531.50         theta   169586.1   
    nu1A    1.9405859      nu2A  0.04918462    
    nu1B    0.05179008      nu2B   0.2568233   
    TA         0.1520766
    m12     0.2254417        m21    0.7371130  

Calculation: 
    sumL= 1985216215 * (66067163 )/130436888  = 1005525395
    mu=4*(10^-9)
    nref=theta / (4* sumL * mu)                                   = 10539
    n1B=nu1B*nref                                                         =457
    n2B=nu2B*nref                                                          =2568
    splittime=TA*2*$nref                                             =3210
m_200_r_50_Mountain_South_split_exponential_mig.png
m_200_r_50_Coast_South_split_linear_mig.png
m_200_r_50_Mountain_South_split_linear_mig.png
m_200_r_50_Coast_South_split_exponential_mig.png

Ryan Gutenkunst

unread,
Mar 23, 2026, 5:08:47 PM (13 days ago) Mar 23
to dadi-user
Hello Isaac,

My apologies for the slow reply. I don’t see any obvious red flags with the approach or calculations you’ve laid out below. 9x is on the verge were the low-coverage correction could be useful, but it shouldn’t be causing distortions of the magnitude you’re reporting. What are you expectations for split times?

The residuals plots don’t look amazing, but they aren’t horrible. It’s useful to see them with the full range plotted in addition to restricted -3 to 3 range as well.

If you’re still thinking about this, I’m happy to iterate.

Best,
Ryan

To view this discussion visit https://groups.google.com/d/msgid/dadi-user/8cc4bd2a-0a15-482f-853f-d57a44db3d2dn%40googlegroups.com.
<m_200_r_50_Mountain_South_split_exponential_mig.png><m_200_r_50_Coast_South_split_linear_mig.png><m_200_r_50_Mountain_South_split_linear_mig.png><m_200_r_50_Coast_South_split_exponential_mig.png>

Isaac Linn

unread,
Mar 31, 2026, 7:03:26 PM (5 days ago) Mar 31
to dadi-user
Hi Ryan,
Thanks for your help.
I'm expecting split times on the order of magnitude of 50kya if not longer. Populations are about 0.6% diverged, and heterozygosity is  about 0.2% within populations; factoring in mammal mutation rate, a split time of roughly 400kya is conceivable. But, with those numbers, theta should be higher.
Dadi has been run on a dataset from this species before : https://doi.org/10.1002/ece3.4129 , which estimated a split time between clades at 51kya. In my dataset, I have sampling across a broader range, and whole-genome resequencing instead of ddRAD data, so it's a little different, but I anticipate 51k years will be accurate or low.

One thing I noticed for these models is that the best-fit models have growth for one population (south) but shrinkage for the other population (coast or mountain). This violates my understanding of the biology, so I also intend to initialize them with population growth explicit (instead of initializing n1A and n1B at the same values). I haven't set those up to run yet, though, in case my overall method is flawed.

I've attached residual plots with no restrictions, and with a limit at 20. 

Any help is appreciated; it's helpful to know that the assumptions about L are not the cause of this.
Cheers,
Isaac
Coast_South_split_exponential_mig_rep11.png
Mountain_South_split_exponential_mig_rep26.png
Coast_South_split_linear_mig_rep12_resid20.png
Coast_South_split_exponential_mig_rep11_resid20.png
Mountain_South_split_linear_mig_rep26_resid20.png
Mountain_South_split_exponential_mig_rep26_resid20.png
Coast_South_split_linear_mig_rep12.png
Mountain_South_split_linear_mig_rep26.png

Ryan Gutenkunst

unread,
Apr 2, 2026, 7:38:27 PM (3 days ago) Apr 2
to dadi...@googlegroups.com
Hello Isaac,

If I’m reading the old paper correctly (and quickly), your ancestral population size is very similar to theirs, so that suggests it’s not a scaling issue. What do the single population models look like, in terms of times and directions of size changes?

Best,
Ryan

To view this discussion visit https://groups.google.com/d/msgid/dadi-user/41fc92da-277e-4853-ba00-60d6ae6cb958n%40googlegroups.com.
<Coast_South_split_exponential_mig_rep11.png><Mountain_South_split_exponential_mig_rep26.png><Coast_South_split_linear_mig_rep12_resid20.png><Coast_South_split_exponential_mig_rep11_resid20.png><Mountain_South_split_linear_mig_rep26_resid20.png><Mountain_South_split_exponential_mig_rep26_resid20.png><Coast_South_split_linear_mig_rep12.png><Mountain_South_split_linear_mig_rep26.png>

Reply all
Reply to author
Forward
0 new messages