Sequence length L in LD-pruned dataset

243 views
Skip to first unread message

arghya dey

unread,
Jan 11, 2023, 1:57:13 PM1/11/23
to dadi-user
Dear Ryan,

As per your previous discussions with other users, I noticed that theta=4*Nref*mu*L .
I am considering a genotype data in VCF format for humans. I have 60 million SNPs in my dataset, which becomes 1 million in number after LD pruning. I am using these 1 million SNPs to create folded frequency spectrum in dadi. In this scenario, what should be my L? 60 million or 1 million?

My next question is, can I consider mutation rate mu as 10^(-8) for humans in equation for theta (theta=4*Nref*mu*L) ?

My third question: which approach is better for overall process and uncertainty analysis? 
a) frequency spectrum for full data, Godambe Information matrix
or b) frequency spectrum for LD-pruned data, Fisher information matrix? 
(Since Godambe information matrix can deal with linked data but Fisher doesn't)

Thanks in advance,
Arghya.
Message has been deleted

arghya dey

unread,
Jan 12, 2023, 9:35:43 AM1/12/23
to dadi-user
Hello Ryan,
Sorry to bother again. I would like to clarify more regarding the first question. I considered 1 million SNPs in LD-pruned dataset spanning over 22 autosomes. Positionally, the first SNP is  1:115746 and the last one is 22:51228910. 

I also have plink files for my dataset. I am thinking of looking at the .bim file to find the difference between maximum and minimum base pair coordinates corresponding to each chromosome. Will I get L if I sum all these differences for all chromosomes?

The next two questions remain the same. 

I will be grateful for your kind help.

Thanks and Regards,
Arghya.

Ryan Gutenkunst

unread,
Jan 16, 2023, 5:40:09 PM1/16/23
to dadi-user
Hello Arghya,

1) L is neither 60 million or 1 million. It is the length of sequence from which data could enter your analysis. If you have enough coverage across the full chromosomes that you could have called SNPs successfully, then you can use the full chromosomes lengths. And you will need to account for the fact that you’ve thinned your data by a factor of 60.

2) mu ~ 1.2 x 10^{-8} is a common mutation rate estimate for humans

3) In general, (a) with the full data will give more precise estimates of parameters. But the Godambe approach can be more finicky.

Best,
Ryan
> --
> You received this message because you are subscribed to the Google Groups "dadi-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/56ec9e3d-1e94-4c31-9ad9-5702ad334f5cn%40googlegroups.com.

Message has been deleted

arghya dey

unread,
Jan 17, 2023, 2:29:53 AM1/17/23
to dadi-user
Hello Ryan,
Thank you so much for your help!  

Best,
Arghya.

Ryan Gutenkunst

unread,
Jan 17, 2023, 10:58:15 AM1/17/23
to dadi...@googlegroups.com
Yes, if your calling is good enough to get SNPs from all 2.8 billion base pairs. (That’s very unlikely to be true…)

> On Jan 17, 2023, at 12:10 AM, arghya dey <arghya...@gmail.com> wrote:
>
> Hello Ryan,
> Thank you so much for your help!
> The total length of 22 chromosomes is nearly 2.8 billion. So, for LD pruning, my L would be around 2.8 billion * (1 million / 60 million). Am I right?
>
> Best,
> Arghya.
>
> On Tuesday, 17 January 2023 at 04:10:09 UTC+5:30 Ryan Gutenkunst wrote:
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/dd25e885-7f10-4e60-adb0-c1fa350dac27n%40googlegroups.com.

arghya dey

unread,
Jan 18, 2023, 7:00:44 AM1/18/23
to dadi-user
Hello Ryan,

I forgot to mention some intermediate steps before LD pruning. I had around 60 million SNPs in the whole dataset and around 2000 individuals. I did MAF filering (selected those SNPs where MAF > 0.01) and missingness filtering (selected those SNPs for which missing-ness is < 2% ). After these MAF and missing genotype filtering, 60 million SNPs reduced to nearly 8 million SNPs (most SNPs were filtered out for low MAF. The missingness criteria removed only 0.1 million SNPs). The LD pruning was done on these 8 million SNPs which gave final 1 million count. Then I selected individuals from my populations of interest. Should these intermediate steps somehow affect the value of L? 

I saw in some other posts where you suggested that MAF filtering might bias the frequency spectrum and for the neutral case, linkage doesn’t bias demographic estimates. This made me believe that I can find better esmimates if I re-run the analysis on whole 60 million SNPs. I plan to do this eventually. But the MAF filtering, although biasing the spectrum against rare alleles, takes care of the genotype calling errors. So, I was wondering if I should consider whole 60 million SNPs or 8 million for re-running the analysis.

I am extremely grateful for your help to a new user like me.

Thanks and Regards,
Arghya.

Ryan Gutenkunst

unread,
Jan 19, 2023, 11:25:51 AM1/19/23
to dadi-user
Hello Arghya,

Filtering on MAF will dramatically bias your results if you don’t model it. If you really think your low frequency genotypes are poor, you can mask those low-frequency entries in the spectrum, although that will cost you substantial statistical power.

Masking in this way will not affect your L. The other steps do.

Best,
Ryan
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/e035cf4e-d97a-459b-8a10-137c8bff0d02n%40googlegroups.com.

arghya dey

unread,
Jan 21, 2023, 1:22:46 AM1/21/23
to dadi-user
Hello Ryan,

Thank you so much for your valuable inputs. I will read regarding masking in dadi and hopefully, understand it. I have a very trivial question. As I am dealing with huge number of variants, it is possible that many of those will have only one allele (MAF=0) in the combined set of populations of interest. In joint frequency spectrum, I think they occupy the [0,0] cell. Should I leave them as it is? Or remove them as they don't have polymorphism for my set of populations?

Best,
Arghya

Ryan Gutenkunst

unread,
Jan 21, 2023, 11:00:35 AM1/21/23
to dadi-user
Hello Arghya,

Those invariant shouldn’t appear in your final vcf final anyways. But if they do, dadi will properly ignore them. So no real need to filter them.

Best,
Ryan
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/a2f06879-2aa9-4aad-8515-71a44857ed1cn%40googlegroups.com.

arghya dey

unread,
Jan 23, 2023, 2:29:11 AM1/23/23
to dadi-user
Thank you so much Ryan.
Best,
Arghya.

Reply all
Reply to author
Forward
0 new messages