Running dadi using variant-only vcf as input?

23 views
Skip to first unread message

Isaac Linn

unread,
Feb 19, 2026, 5:32:15 PMFeb 19
to dadi-user
Sorry if this is a simple question, but I haven't found a clear answer so far.

I'm going through the process of running dadi on a population dataset for which I have a SNP-only VCF. I can go through the process of variant calling again, but at present, I don't have access to a VCF with variant and invariant sites. I'm wondering whether I can use dadi with this input only, whether I need to pad the dataset with monomorphic sites, or whether it's not an issue. I've tried running dadi so far and population size estimates look way low when I use L=length of callable input sequence, but when I use L= number of sites in VCF (again, mostly polymorphic) the population size estimates are a reasonable order of magnitude (compared with nucleotide diversity). 

I think I understand that the monomorphic sites would not make it into the allele frequency spectrum, but I'm concerned that it would impact the underlying model or estimation of theta. 
Thanks,
Isaac

Ryan Gutenkunst

unread,
Feb 20, 2026, 12:32:45 PMFeb 20
to dadi...@googlegroups.com
Hello Isaac,

Yes, dadi will work fine with a variant-only vcf; it will not impact the model or estimation of theta compared to an all-sites VCF.

Estimating L is sometimes tricky, depending on how the sequencing was done. It may be that certain regions were uncallable or masked (for example, repetitive regions), so L is rarely the full genome size. Setting L equal to the number of variant sites is a mistake.

Best,
Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dadi-user/1fa08eda-b25d-4c9d-a319-e50cbfe1b6dan%40googlegroups.com.

Isaac Linn

unread,
Mar 9, 2026, 5:16:06 PM (6 days ago) Mar 9
to dadi-user
Hi Ryan,
Thank you for your response. That aligns with my expectations, but it seems like there's something wrong in my pipeline. I know that the models I'm using aren't completely matching the data, but I also know that the parameters estimated are orders of magnitude off.
I'm using variant data from 59  individuals (30 and 29) stored in a .vcf, which I project to a SFS of 2n-2, which could be smaller. My base genome (autosomal) has 1.98 Gb mappable sites, and I called variants from ~9x coverage data. Filtering reduced 130,436,888 variants to 66,067,163 sites. I do not have an invariant VCF as it's whole-genome and that file size is unwieldy, but as a base approximation I could assume that 51% of the initial sequence length is my final sequence length (1Gb).

When I look at my model outputs, the population sizes and time intervals look way low. I assume that this is related to how I'm scaling it, but it could be that the model is a poor fit, or that I'm making an error in the process somewhere. But, for my best fit replicates, I'm getting nref=10000, npops=(500,2500), split_T~2700-5600. Some replicates have further split times, but the population sizes are all pretty small across all replicates and models. Nucleotide diversity for the populations is around (0.23%,0.21%) which is a couple of orders of magnitude off what I'd estimate using dadi outputs (.0007 - .004 %). 

I'm wondering what could be messing with this inference. Obviously, a good estimate for L is important, but it could be other factors; for instance, there is some population structure. I'm considering downprojecting a bit further and increasing the maxiter, and possibly adding a bottleneck before split time, but I don't want an overly complex model. I know that the model itself isn't matching the data, but I haven't figured out exactly what needs to be done. I've also attached the model sfs and residuals for some of the best-fit data.

Additional information:
I'm estimating a split-with-migration, so I have models with and without migration, with a linear and exponential size change, and two phases to approximate secondary contact. I don't think the details of the models are atypical, but could provide them.
Population sizes: start at 2, from 1e-3 to 100
migration rates: stat at 0.2, from 1e-3 to 30
Time intervals: start at 1, from 1e-2 to 15
and I perturb_params with fold=1 before inference with maxiter= 200, and pts [40,50,60]

Example calculation with model:
    def nu1_func(t):
            return nu1A * (nu1B / nu1A) ** (t / TA)
        def nu2_func(t):
            return nu2A * (nu2B / nu2A) ** (t / TA)
        phi = Integration.two_pops(phi, xx, TA,nu1_func, nu2_func,m12=m12, m21=m21)

Outputs:  
    loglik    -13531.50         theta   169586.1   
    nu1A    1.9405859      nu2A  0.04918462    
    nu1B    0.05179008      nu2B   0.2568233   
    TA         0.1520766
    m12     0.2254417        m21    0.7371130  

Calculation: 
    sumL= 1985216215 * (66067163 )/130436888  = 1005525395
    mu=4*(10^-9)
    nref=theta / (4* sumL * mu)                                   = 10539
    n1B=nu1B*nref                                                         =457
    n2B=nu2B*nref                                                          =2568
    splittime=TA*2*$nref                                             =3210
m_200_r_50_Mountain_South_split_exponential_mig.png
m_200_r_50_Coast_South_split_linear_mig.png
m_200_r_50_Mountain_South_split_linear_mig.png
m_200_r_50_Coast_South_split_exponential_mig.png
Reply all
Reply to author
Forward
0 new messages