RAD-seq data and L

Aude Caizergues

unread,

Apr 20, 2023, 6:26:32 PM4/20/23

to dadi-user

Hello !

I am currently in the process of running dadi on a pair of population from which 10 individuals were RAD-seq sequenced. I performed genotype calling using stacks that allows me to print both a VCF with only the SNPS and a VCF with all the positions that passed my filter i.e. for a position to be sequenced in at least 9 of the 10 individuals of a population (the only filter I put, since I don't want to biaise the dataset). After filtration of the sites that are not sequenced in at least 90% of the individuals I obtain 9,021,224 total sites among which 213,628 are SNPs.

After that, I want to keep only the positions that are in intergenic regions or 4-fold degenerate site to make sure they are neutral. So I plan on filtering my vcf with bedtools. For the sake of the example, let's say I will have 1,000,000 of positions (which are the number of lines of my VCF since I have variant + non-variant positions in it) and from which 80,000 are variants (SNPs).

The thing is I am not sure of which number to include in the calculation of L

L~Total_seq_length*(nSNPs_final/nSNPs_total)

I saw in a previous post that you recommend using the numbers after filtration, but I'm not sur if I should use the number after round 1 of filtration (90% of indiv) or round 2 (intergenic regions).

Let's say I'm considering ONLY neutral regions (after round two) so the estimation becomes L= 1,000,000,000*(80,000/nSNPs_total) but I'm not sure what should I put as nSNPs_total...

If i use the numbers, before filtration for only neutral positions, I obtain : L= 9,021,224*(80,000/213,628) but I see a problem here, the total length to find the 80,000 final SNPs was not really 9,021,224 since other positions than intergenic and 4-fold were present in this count.

Could you please help me determine which calculation is the good one, I'm afraid to mess up everything...

Thank you in advance,

Aude

Ryan Gutenkunst

unread,

Apr 23, 2023, 3:13:04 PM4/23/23

to dadi-user

Hello Aude,

In this case, I would first apply intergenic filter to the total sites VCF. Say that gives you a L of 5e6. Then I would reduce that by fraction of 4-fold SNPs divided by total SNPs.

Best,
Ryan

> --
> You received this message because you are subscribed to the Google Groups "dadi-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/414bd90e-b5df-4401-abe3-5a6dd444eee9n%40googlegroups.com.

Qi Fu

unread,

May 5, 2023, 11:56:34 PM5/5/23

to dadi-user

Hi Ryan,

I face with similar issue about how to calculate L. I used GATK software for downstream SNP calling for resequencing data containing two populations. And let GATK output all sites in VCF1, which include SNPs and invariant sites. Suppose VCF1 has 10,000,000 sites, which is the total length of genome. Then, I kept only sites that sequenced in at least 80% of all samples in VCF2, which contain 800,000 sites (200,000 SNPs and 600,000 invariant sites). Next, I used snpEff software annotate these variants. Variants were classified as exonic, intronic, 5’ UTR , 3’UTR, splicing, intergenic and so on. SNPs that were located in exonic regions were classified as synonymous or nonsynonymous SNPs. Only synonymous SNPs were used to perform DADI analysis. Supposing the total length of exonic regions is 400,000 in VCF2 and the total number of synonymous SNPs is 100,000 in VCF2 . Finally, in order to account for missing calls for some individuals in populations, I performed projection down to maximize the number of segregating sites in final dadi analysis, which include 20,000 sites.

Based on your reply to Aude, my understanding is L = 400,000 * (20,000/800,000) = 10,000

I’m not sure whether I have a correct understanding about L. Could you help me make it clear?

Many thanks for your help!

Best,

Qi

Ryan Gutenkunst

unread,

May 7, 2023, 11:40:08 PM5/7/23

to dadi...@googlegroups.com

Hello Qi,

What matters if the number of sites that could enter your analysis, adjusted for missing data. Your L is the total number of sites (not only variant sites) for which SNPs could enter your analysis. So if you projected down to a sample size of 18, for example. L would be the number of sites in VCF2 for which you had at least 9 individuals (18 chromosomes) called successfully.

Best,
Ryan

> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/3df9b959-6ab1-4d87-a496-7fb0c4c55198n%40googlegroups.com.

Qi Fu

unread,

May 8, 2023, 11:41:06 AM5/8/23

to dadi-user

Hi Ryan,

Thank you very much for your answer. If I use synonymous sites from intronic regions to do dadi analysis, the L ~ total length of intronic regions * (total number of synonymous sites in final dadi anlysis / total number of SNPs in intronic regions) ?

Thanks a lot.

Best,

Qi

Ryan Gutenkunst

unread,

May 8, 2023, 3:49:01 PM5/8/23

to dadi...@googlegroups.com

Hello Qi,

I don’t understand your question, since by usual definition synonym sites are in coding sequences, not introns.

If you mean coding sequences instead of intronic sequences, there’s a better way to think of it. You need L for converting between theta and the effective population size, using theta = 4*Ne*mu*L. Since you’re considering only synonymous variants, the mu you want to use is not the mutation rate to create any variant, but rather the rate of mutations that create synonymous variants. (In humans, that’s roughly 1/3.3 of the total mutation rate.) Then L is the total length of coding regions.

Best,
Ryan

> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/d25ddcd0-6585-4896-b307-bbffb7f6d4d5n%40googlegroups.com.

Qi Fu

unread,

May 8, 2023, 10:57:38 PM5/8/23

to dadi-user

Hi Ryan,

Thanks again for your reply. I’m sorry I muddled up the location of synonymous sites. Now, I understand the L is the total length of coding regions if we use the rate of mutations that create synonymous variants.

However, for some organisms, we may know the mutation rate to create any variant, but not know the rate of mutations that create synonymous variants. As you mentioned in another post (https://groups.google.com/g/dadi-user/c/ptGoDYG_iYY/m/L8Bb7HcAAgAJ), “if you’re using synonymous sites, you also need to account for the fact that not all mutations in coding regions will be synonymous.” If we still use synonymous sites in dadi analysis and use the mutation rate to create any variant in formula theta = 4*Ne*mu*L, L is no longer the total length of coding regions. We need to scale L. How to scale? In this case, I think L ~ (the total length of coding regions) * (total number of synonymous sites in final dadi anlysis / total number of SNPs in coding regions). Is it an appropriate way to calculate L or not?

Best

Qi

Ryan Gutenkunst

unread,

May 9, 2023, 1:19:38 PM5/9/23

to dadi-user

Hello Qi,

You could scale that way. It might not be quite as good as estimating the synonymous mu (given the genetic code and anything you know about relative mutation rates), because you’re implicitly assuming that purifying selection against the non-synonymous variants is not strong.

Best,
Ryan

> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/8b4463a8-9cec-43a2-8b31-5a53096c34d1n%40googlegroups.com.

Qi Fu

unread,

May 9, 2023, 9:01:52 PM5/9/23

to dadi-user

Hi Ryan,

Thank you for your comments. I greatly appreciate your help!

Best,

Qi

Reply all

Reply to author

Forward