Hello,
I know it has been answered multiple times on this forum, but I am still worried I am not calculating effective sequence length, L, correctly for my dataset. I have been using the formula L ~ (size of region SNPs were successfully called from) * (number of SNPs used in dadi / number of SNPs in original dataset).
I have WGS data where pretty much all of the genome (1.6GB) was called. From there, I have a VCF with ~ 47 M SNPs. After LD pruning my VCF and restricting to biallelic SNPs only, I am left with ~ 3.1 M SNPs. So, L = 1.6GB * (3.1 / 47).
I am confused because I am working on a dataset where the genomes are highly repetitive, so I am not sure what part of the calculation I would consider sites that have been masked on account of being repeats.
I am also using a subset of samples from the original VCF, so I am not sure if I have to subset the original dataset down to the samples I am interested in to get the number of SNPs in this original dataset.
Any insight is appreciated! Thank you.