Calculating L with a subset of original VCF

5 views

Skip to first unread message

Ellie Faber

unread,

May 19, 2026, 4:16:48 PM (3 days ago) May 19

to dadi-user

Hello,

I know it has been answered multiple times on this forum, but I am still worried I am not calculating effective sequence length, L, correctly for my dataset. I have been using the formula L ~ (size of region SNPs were successfully called from) * (number of SNPs used in dadi / number of SNPs in original dataset).

I have WGS data where pretty much all of the genome (1.6GB) was called. From there, I have a VCF with ~ 47 M SNPs. After LD pruning my VCF and restricting to biallelic SNPs only, I am left with ~ 3.1 M SNPs. So, L = 1.6GB * (3.1 / 47).

I am confused because I am working on a dataset where the genomes are highly repetitive, so I am not sure what part of the calculation I would consider sites that have been masked on account of being repeats.

I am also using a subset of samples from the original VCF, so I am not sure if I have to subset the original dataset down to the samples I am interested in to get the number of SNPs in this original dataset.

Any insight is appreciated! Thank you.

Ryan Gutenkunst

unread,

May 20, 2026, 10:10:18 PM (2 days ago) May 20

to dadi...@googlegroups.com

Hello Elle,

L should only count the unmasked regions from which SNPs could enter your analysis.

You can either compute the ratio of SNP numbers at the full sample size or at what you use in the dadi analysis. They should be similar.

Best,

Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dadi-user/9461b194-7ba9-4207-b85c-a52621346dbfn%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages