Confused with effective sequenced length (L)

钱晓波

unread,

Mar 29, 2024, 8:05:37 AM3/29/24

to dadi-user

Hi Ryan & developers,

I am new for dadi and used it for my project.

I just test a very simple model (split_mig) for each pairwise of Chinese populations and got the output as following:

pop1,pop2,Log(likelihood),nu1,nu2,T,m,misid,theta

CN_A,CN_B,-27245.56121,12.81360934,18.73955114,0.015744191,0.000407754,0.053141094,154465.5855

I know theta is equal to 4*Na*u*L. And I searched how to get the effective seuqenced length (L) at dadi-user google groups. I used about 1.82M synonymous SNPs from ~18.15M for model test, so that L = 1.82 / 18.15 * 2.8e10? BTW, my SNPs calling from WGS data with depth of 30x. Sometime we approxiate the total length of whole genome to 3e10 bp, so that L will be more large and have a effect on Na.

My question: if L is so ambiguous, how do we make a decision for L and then calculate Na, effective pop size of pop1 and pop2.

Bests,

Xiaobo QIAN

Ryan Gutenkunst

unread,

Apr 8, 2024, 6:44:37 PM4/8/24

to dadi-user

Hello Xiaobo,

If you successfully sequenced the total genome length of 2.8e10 bp and you’re using 1.82/18.15 of detected SNPs in your analysis, then your calculation is correct. The L must be what you successfully sequenced, not the total genome length.

Honestly, the difference of 3 vs 2.8 is likely much smaller than other systematic biases in which model you fit, etc.

Best,
Ryan

> --
> You received this message because you are subscribed to the Google Groups "dadi-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/49b98fdf-e019-4b1b-b081-e584a11c187fn%40googlegroups.com.

Ryan Gutenkunst

unread,

Apr 22, 2024, 11:01:07 AM4/22/24

to Xiaobo Qian (Xb), dadi-user

Hello Xiaobo,

I would go with option 3.

Another (perhaps better) way to think of this is that what matters if L * mu, where mu is the mutation rate to sites that can enter your analysis. So you can think of L as being fixed at the length of the target region, but the effective mutation rate is lower because you only count the rate for sites that also pass the other criteria.

Best,
Ryan

> On Apr 22, 2024, at 7:12 AM, Xiaobo Qian (Xb) <qianxia...@gmail.com> wrote:
>
> Hi Ryan,
>
> Thank you for your reply!
>
> One more question for me again when read other's paper. Due to no reply from the author, I come to bother you again :(
>
> The paper decribed as following:
> "To avoid the bias caused by the coding sequence regions, we selected intergenic, synonymous, and intronic sites from the target region as the neutral sites for analysis." (Unluckily, author did not tell us what is the targe region)
>
> If it makes sense, and we assumed 10M variants called from the whole genome (~28000M), then 1M variants satified with criteria were selected from target region or whole genome (I don't know), would L be which one below?
> 1. (The total length of whole genome) * 1M / 10M
> 2. (The total length of target region) * 1M / 10M
> 3. (The total length of target region) * 1M / (just number of variants called from target region, may be selected from 10M variants)
>
> Bests,
> Xiaobo

Xiaobo Qian (Xb)

unread,

Apr 24, 2024, 12:15:01 PM4/24/24

to dadi-user

Hi Ryan,

Thank you for your help!

I do think a lot about what you said, effective length or effective mutation rate.

Actually, I only have a joint calling vcf file, including hundreds individuals from several populations. Given the dataset was called from WGS with 30x, I'd like to fix the sequence length as 2.8e10 bp. If I want to construct the neutral model for each of these populations, should I limit the genome regions to non-CDS? Ryan N. Gutenkunst, PloS Genetics, 2009 just used noncoding regions of 219 autosomal genes. Of course, If I limit to non-CDS, the fixed length should be changed.

Best wishes,

Xiaobo

Ryan Gutenkunst

unread,

Apr 25, 2024, 1:49:23 PM4/25/24

to dadi-user

Hello Xiaobo,

For demographic history, you want to minimize effects of selection on your SNPs (direct or indirect). So definitely go non-CDS. I might suggest being more stringent and going, for example, at least 100 kb away from genes if your genome size permits that.

Best,
Ryan

> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/b9c213ef-3cfe-46bb-8bd9-416474cc9ce0n%40googlegroups.com.

Xiaobo Qian (Xb)

unread,

Oct 14, 2024, 12:42:49 PM10/14/24

to dadi-user

Hi Ryan,

Recently I tried dadi to infer demographic model again. I used SNPs from non-CDS regions to entry dadi, and there are ~17 million SNPs. The first step is make fs file. However, ~17M SNPs take a lot of RAM, should I thin the data, such as non-CDS regions of human genes rather than whole genomes?

Thank you agian.

Best,

Xiaobo

Ryan Gutenkunst

unread,

Oct 14, 2024, 12:45:27 PM10/14/24

to dadi-user

Hello Xiaobo,

Once the data are parsed, it doesn’t matter how many SNPs you have. So I would not thin.

As I said before, going non-CDS is definitely better to minimize effects of selection.

Best,

Ryan

To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/1e61a89d-c0cc-437f-886f-047397024f3en%40googlegroups.com.

Reply all

Reply to author

Forward