Hi Ryan,
I face with similar issue about how to calculate L. I used GATK software for downstream SNP calling for resequencing data containing two populations. And let GATK output all sites in VCF1, which include SNPs and invariant sites. Suppose VCF1 has 10,000,000 sites, which is the total length of genome. Then, I kept only sites that sequenced in at least 80% of all samples in VCF2, which contain 800,000 sites (200,000 SNPs and 600,000 invariant sites). Next, I used snpEff software annotate these variants. Variants were classified as exonic, intronic, 5’ UTR , 3’UTR, splicing, intergenic and so on. SNPs that were located in exonic regions were classified as synonymous or nonsynonymous SNPs. Only synonymous SNPs were used to perform DADI analysis. Supposing the total length of exonic regions is 400,000 in VCF2 and the total number of synonymous SNPs is 100,000 in VCF2 . Finally, in order to account for missing calls for some individuals in populations, I performed projection down to maximize the number of segregating sites in final dadi analysis, which include 20,000 sites.
Based on your reply to Aude, my understanding is L = 400,000 * (20,000/800,000) = 10,000
I’m not sure whether I have a correct understanding about L. Could you help me make it clear?
Many thanks for your help!
Best,
Qi
Thanks again for your reply. I’m sorry I
muddled up the location of synonymous sites. Now, I understand the L is the total
length of coding regions if we use the rate of mutations that create synonymous
variants.
However, for some organisms, we may know the mutation rate to create any variant, but not know the rate of mutations that create synonymous variants. As you mentioned in another post (https://groups.google.com/g/dadi-user/c/ptGoDYG_iYY/m/L8Bb7HcAAgAJ), “if you’re using synonymous sites, you also need to account for the fact that not all mutations in coding regions will be synonymous.” If we still use synonymous sites in dadi analysis and use the mutation rate to create any variant in formula theta = 4*Ne*mu*L, L is no longer the total length of coding regions. We need to scale L. How to scale? In this case, I think L ~ (the total length of coding regions) * (total number of synonymous sites in final dadi anlysis / total number of SNPs in coding regions). Is it an appropriate way to calculate L or not?
Best
Qi