low theta, strange parameter values

81 views
Skip to first unread message

Linda H

unread,
Mar 19, 2025, 11:49:52 AMMar 19
to dadi-user
Dear Ryan,

I'm using dadi to estimate the demographic history of three sloth populations (20,8,24 alleles per pop). One of them should be quite distant (35% difference in PC1, old split based on phylogenetics etc). I have unlinked 3RAD data (20000 snps) and have projected down to (12,6,16 alleles). I established that a model without migration is the best model, with likelihoods converging to just above -50 and similar AIC values as well. The optimised parameters are 0.126,0.2572,0.1532,0.5201,0.0432,0.1213, with a theta of 4.7.
Here is are the plotted SFS and residuals: 
res1.jpg
I have tried to convert these to real values using: 
popt = [0.126,0.2572,0.1532,0.5201,0.0432,0.1213,4.7]
Nref = popt[6]/(4*mu*L)
N1 = Nref * popt[0]
NA = Nref * popt[1]
N2 = Nref * popt[2]
N3 = Nref * popt[3]
T1 = (popt[4]*popt[6])/(2*mu*L)
T2 = (popt[5]*popt[6])/(2*mu*L)
Here is the issue:
Our mutation rate is unknown but probably between 1.25e-8 and 7.97e-9.
Effective sequencing length L should be ~20000 snps after filtering /~200000 before filtering *2mio bp total sequencing length. This gives me some pretty outrageous values, for example 150 generations (600 years) since the first split. Playing around with different values, I've come to the conclusion that L couldnt be more than 50 for any numbers to make sense, but that seems pretty unlikely.
My thought right now is that theta is just too low, but I can't think of anything that would bias it like that.    
Do you have thoughts or advice on this issue?
Is there maybe anything I have overlooked?

Thanks in advance for your help!
Cheers,
Linda    

Ryan Gutenkunst

unread,
Mar 19, 2025, 6:09:55 PMMar 19
to dadi...@googlegroups.com
Hello Linda,

A few thoughts come to mind looking at your SFS plots. That theta is indeed very small. What is fs.S(), the total segregating sites in your SFS? Based on the scales in your plot, my guess is that it’s much less than the 20k you expect. It’s possible large amounts of SNPs are being missed due to called sample size limitations. So my first check would be that your data processing is actually happening as you expect.

The shape of your frequency spectra are also surprising, with “disjoint” entries. That might simply be caused by the small numbers in your actual SFS. It would be easier to evaluate with subsampling rather than projection.

Best,
Ryan

On Mar 19, 2025, at 8:41 AM, Linda H <linda.h...@gmail.com> wrote:

Dear Ryan,

I'm using dadi to estimate the demographic history of three sloth populations (20,8,24 alleles per pop). One of them should be quite distant (35% difference in PC1, old split based on phylogenetics etc). I have unlinked 3RAD data (20000 snps) and have projected down to (12,6,16 alleles). I established that a model without migration is the best model, with likelihoods converging to just above -50 and similar AIC values as well. The optimised parameters are 0.126,0.2572,0.1532,0.5201,0.0432,0.1213, with a theta of 4.7.
Here is are the plotted SFS and residuals: 
<res1.jpg>
I have tried to convert these to real values using: 
popt = [0.126,0.2572,0.1532,0.5201,0.0432,0.1213,4.7]
Nref = popt[6]/(4*mu*L)
N1 = Nref * popt[0]
NA = Nref * popt[1]
N2 = Nref * popt[2]
N3 = Nref * popt[3]
T1 = (popt[4]*popt[6])/(2*mu*L)
T2 = (popt[5]*popt[6])/(2*mu*L)
Here is the issue:
Our mutation rate is unknown but probably between 1.25e-8 and 7.97e-9.
Effective sequencing length L should be ~20000 snps after filtering /~200000 before filtering *2mio bp total sequencing length. This gives me some pretty outrageous values, for example 150 generations (600 years) since the first split. Playing around with different values, I've come to the conclusion that L couldnt be more than 50 for any numbers to make sense, but that seems pretty unlikely.
My thought right now is that theta is just too low, but I can't think of anything that would bias it like that.    
Do you have thoughts or advice on this issue?
Is there maybe anything I have overlooked?

Thanks in advance for your help!
Cheers,
Linda    

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dadi-user/4c06a12d-7bab-4dec-855e-8c10412e4cf0n%40googlegroups.com.
<res1.jpg>

Linda H

unread,
Mar 28, 2025, 10:33:01 AMMar 28
to dadi-user
Hi Ryan, 

thanks for pointing me to the data processing! I found some individuals with disproportionate amounts of missing data and a HWE filter that may have messed with the SNPs, so i adjusted my filters and moved forward without projecting. That really changed our estimates, including theta, and the number of segregating sites from 32 to now 151414 and reduced how choppy the SFS was: 
image_720.png
I have two more questions: 
1- Out of curiosity I tried projecting the new dataset again, but it still gave me a very low number of segregating sites. Maybe I'm misunderstanding how it works but i thought the point was to maximize segregating sites by not allowing dadi to throw out snps with missing data. Do you have an idea what could be the reason for that behaviour?
2-  I'm having trouble interpreting these residuals. I see an overestimation of private snps at intermediate frequencies in every population, which I interpret as an effect of drift on the observed data. I guess the broader question is whether this is an okay fit for our data, but is this specifically something I need to adress using aditional simulations?  

Cheers and thanks for all the work you put into answering questions!
Linda

Ryan Gutenkunst

unread,
Apr 1, 2025, 1:16:04 PMApr 1
to dadi-user
Hello Linda,

See responses below.

On Mar 28, 2025, at 7:33 AM, Linda H <linda.h...@gmail.com> wrote:

Hi Ryan, 

thanks for pointing me to the data processing! I found some individuals with disproportionate amounts of missing data and a HWE filter that may have messed with the SNPs, so i adjusted my filters and moved forward without projecting. That really changed our estimates, including theta, and the number of segregating sites from 32 to now 151414 and reduced how choppy the SFS was: 
<image_720.png>
I have two more questions: 
1- Out of curiosity I tried projecting the new dataset again, but it still gave me a very low number of segregating sites. Maybe I'm misunderstanding how it works but i thought the point was to maximize segregating sites by not allowing dadi to throw out snps with missing data. Do you have an idea what could be the reason for that behaviour?

There are two countervailing factors:
1) If you have missing data in some individuals, projecting to a smaller sample size will allow more sites to enter the analysis, increasing the sites in the SFS.
2) But projecting downward reduces your effective sample size, and some alleles that were rare in your original sample will be absent in the projected sample, reducing the sites in the SFS.
The "maximizing segregating sites” rule of thumb is a rough guide. If projecting doesn’t increase segregating sites much or even decreases them, I would avoid projecting.

2-  I'm having trouble interpreting these residuals. I see an overestimation of private snps at intermediate frequencies in every population, which I interpret as an effect of drift on the observed data. I guess the broader question is whether this is an okay fit for our data, but is this specifically something I need to adress using aditional simulations?  

It would be useful to see the residual plot without restricting the range to +/- 3, since that cuts off some of the useful information. The residuals are (model - data), so these plots suggest that your model is underestimating the private low-frequency alleles and over-estimating the intermediate-frequency alleles. A common cause for this would be growth in the populations that your model isn’t capturing.

Best,
Ryan

Linda Hagberg

unread,
Apr 8, 2025, 4:46:14 AMApr 8
to dadi...@googlegroups.com
Hi Ryan, 

Thanks for clarifying down-projection, it really doesn't seem to be useful in this case. 

Population growth makes sense biologically, as we have indications about this from other analyses. 
Here is the plot without the residual cut off:
image.png
I guess the lack of private singletons in "ron" is driving the residual scale.

Thanks a lot for your insights!
Cheers, 
Linda


You received this message because you are subscribed to a topic in the Google Groups "dadi-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dadi-user/HcSycjhW1dI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dadi-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dadi-user/9604FE36-6DBF-44E9-A553-0DB4964BD088%40gmail.com.
Reply all
Reply to author
Forward
0 new messages