making three-population models work: the simplest 3-pop model

741 views
Skip to first unread message

c.monster

unread,
Apr 21, 2015, 6:58:36 PM4/21/15
to dadi...@googlegroups.com
Dear Ryan and fellow dadi-ists - 

I seem to have trouble making my three-population models to work. They seem to run indefinitely with nothing ever printed out, save for occasional warning of  the kind "Model is masked in some entries where data is not". I wonder, is it my data (specifically, very low divergence between pops, Fst ~0.04) or the way I specify my model?

Here is what I thought would be the simplest ever 3-pop model, analogous to split_migr. Am I missing something? Maybe I need to integrate between splits? Or could my problem be because I have too many individuals (down-projected to 24 per pop)?


def s3m(params, ns, pts):

  nu1,nu2,nu3,T,m121,m131,m232= params

  xx = Numerics.default_grid(pts)

  phi = PhiManip.phi_1D(xx)

  phi = PhiManip.phi_1D_to_2D(xx,phi)

  phi = PhiManip.phi_2D_to_3D_split_1(xx,phi)

  phi = Integration.three_pops(phi, xx, T, nu1,nu2, nu3, m12=m121, m21=m121, m13=131, m31=m131, m23=m232, m32=m232)

  fs = Spectrum.from_phi(phi, ns, (xx,xx,xx))

  return fs


# optimizer:

S3MM=Numerics.make_extrap_log_func(s3m)

pts=[40,50,60]

ns=[24,24,24]  # =data.sample_sizes

params=array([ 1,1,1,1,20,20,20])

upper_bound = [5, 5, 5,5,100,100,100]

lower_bound = [0.1,0.1, 0.1, 0.1,1,1,1]

poptg = dadi.Inference.optimize_log(params, data, S3MM, pts, 

    lower_bound=lower_bound,

    upper_bound=upper_bound,

    verbose=len(params), 

    maxiter=3)




Jason

unread,
Oct 8, 2015, 8:47:20 AM10/8/15
to dadi-user
Hi Mikhail,

Did you get your 3D model to work. If so, can you post what you did to get it to work.

Thanks,
Jason

Mikhail Matz

unread,
Oct 8, 2015, 12:31:01 PM10/8/15
to dadi-user
Hi Jason - 

I made another one work, although exceedingly slowly  - takes literally a week (code attached). What made it work without much warnings (although must have contributed to slowing it down) was expansion of the extrapolation grid, instead of pts=[30,40,50]  to pts=[40,60,80]. It also uses a strange trick of modeling near-simultaneous split between three pops that seems to help with fitting but I am actually not sure whether it makes sense.

As a result I gave up on more than two-population modeling; I switched to doing pairwise analysis. As we discussed in another thread ("does pairwise dadi make sense"), it seems adequate for a case of multiple arbitrarily defined populations that all draw migrants from more or less common pool, as is typical in a marine system, as long as you are aware that inferred migration between a pair of populations reflects genetic exchange along all paths (i.e., including direct migration between them as well as exchange via other pops as stepping stones).

cheers

Mikhail
s3m_2step_auto.py

Jason

unread,
Oct 15, 2015, 8:43:15 AM10/15/15
to dadi-user


Hi Mikhail,

Ok, thanks for your response. I appreciate knowing what I am getting into if I decide to pursue 3 population modelling in dadi. From your experience it sounds like it may be very challenging and resource intensive. I would also be interested in hearing from other users who have worked with 3 population models.

Thanks,
Jason

Mitra Menon

unread,
Mar 29, 2017, 11:28:23 AM3/29/17
to dadi-user
Hi Mikhail,

I am using a 3 population model and running into this warning while using the optimizer. I am wondering if you had similar issues:
WARNING:Numerics:Extrapolation may have failed. Check resulting frequency spectrum for unexpected results.

My ns = [52,98,30]
pts_1 = [200,220,240]

Mitra

Gutenkunst, Ryan N - (rgutenk)

unread,
Mar 29, 2017, 4:22:30 PM3/29/17
to dadi...@googlegroups.com
Hi Mitra,

This may also be an issue with optimization bounds, as some parameter settings (very high migration, for example) can make integration very challenging. If you’re only seeing this message a few times, it may mean that there are only problems at the edges of parameter space, which may be far from the important region for your problem. So it may not be so big a deal. Turn on verbose logging of the optimizer to see what parameter sets are causing the issue.

Best,
Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

--
Ryan Gutenkunst
Assistant Professor of Molecular and Cellular Biology, University of Arizona
phone: (520) 626-0569, office: LSS 325, web: http://gutengroup.mcb.arizona.edu

Latest papers: 
“Selection on network dynamics drives differential rates of protein domain evolution”
PLoS Genetics; http://dx.doi.org/10.1371/journal.pgen.1006132
"Triallelic population genomics for inferring correlated fitness effects of same site nonsynonymous mutations"
Genetics; http://dx.doi.org/10.1534/genetics.115.184812

Mitra Menon

unread,
Mar 30, 2017, 9:38:46 PM3/30/17
to dadi-user, rgu...@email.arizona.edu
Hi Ryan,

Thanks for the reply. So I tried a couple of things to get rid of this error based on your comment here and previous posts by others. My model is that of pure divergence so does not include a migration parameter. My sample size is [40, 78 ,14] and my grid size is [200,220,240]. With these setting I have tried using verbose=1 to print out which parameters might be out of bounds. 

My parameter bounds are:
upper_bound = [5, 5, 5, 5, 5, 1]
lower_bound = [1e-2,1e-2,1e-2,1e-2,1e-4,1e-7]
p0 = array([1,1,1,1,0.001,0.0001])

Using verbose=1 it seems like the warning is always generated. But from what I see the values don't seem to be touching any of the bounds. For instance this is the value for one instance of optimization:array([ 0.672875 , 0.670105 , 1.82947 , 1.33827 , 0.00201976 , 0.000104001]). It seems well within the limits so I am not sure what the warning msg is.

I anyways finished the run and looked at the estimate of theta using ::: dadi.Inference.optimal_sfs_scaling(model1, data). theta is 1923. This value is really unrealistic. I have attached the spectrum file at the end too. Core and periphery belong to the same species and LP should be more divergent from both of them, but probably slightly closer to periphery because they hybridize. 

Here are my questions:
Can you help me understand why i am still getting the warning msg and how I could fix it?

I am still not sure my parameter bounds and the initial guess is correct. Since we are exploring the parameter space through the optimization does input bounds really matter much. If it is really off then my estimate based on the model should be near the bounds. Could you give me some detail about how to decide on the parameter space to explore (how set population size and time)?

Finally is the value of theta is so off because my model is bad or the parameters were not good to begin with?

thanks for your time
Mitra
split_only.png

Gutenkunst, Ryan N - (rgutenk)

unread,
Apr 2, 2017, 2:45:55 PM4/2/17
to Mitra Menon, dadi-user
Hello Mitra,

On Mar 30, 2017, at 6:38 PM, Mitra Menon <men...@mymail.vcu.edu> wrote:

Hi Ryan,

Thanks for the reply. So I tried a couple of things to get rid of this error based on your comment here and previous posts by others. My model is that of pure divergence so does not include a migration parameter. My sample size is [40, 78 ,14] and my grid size is [200,220,240]. With these setting I have tried using verbose=1 to print out which parameters might be out of bounds. 

My parameter bounds are:
upper_bound = [5, 5, 5, 5, 5, 1]
lower_bound = [1e-2,1e-2,1e-2,1e-2,1e-4,1e-7]
p0 = array([1,1,1,1,0.001,0.0001])

Using verbose=1 it seems like the warning is always generated. But from what I see the values don't seem to be touching any of the bounds. For instance this is the value for one instance of optimization:array([ 0.672875 , 0.670105 , 1.82947 , 1.33827 , 0.00201976 , 0.000104001]). It seems well within the limits so I am not sure what the warning msg is.

I anyways finished the run and looked at the estimate of theta using ::: dadi.Inference.optimal_sfs_scaling(model1, data). theta is 1923. This value is really unrealistic. I have attached the spectrum file at the end too. Core and periphery belong to the same species and LP should be more divergent from both of them, but probably slightly closer to periphery because they hybridize. 

Here are my questions:
Can you help me understand why i am still getting the warning msg and how I could fix it?

I’m not sure what your parameters mean in your model, so it’s hard to diagnose. If you send a small example, I can get dig into the code itself.

I am still not sure my parameter bounds and the initial guess is correct. Since we are exploring the parameter space through the optimization does input bounds really matter much. If it is really off then my estimate based on the model should be near the bounds. Could you give me some detail about how to decide on the parameter space to explore (how set population size and time)?

The parameter bounds are primarily about avoiding really slow fits. Calculation with high migration rates and times or small population sizes is much slower, and the optimizer can explore extreme values before settling down. So typically we set migrations rates to be bounded [0, 20ish], times to be [0, 5ish], population sizes to be [1e-3, 1e6ish].

Finally is the value of theta is so off because my model is bad or the parameters were not good to begin with?

I’m not sure why you think this theta is so bad. Remember that it is theta for the entire data set, not a per-site theta. Also note that Watterson’s estimate of theta is only valid for equilibrium demography.

Best,
Ryan

<split_only.png>

Mitra Menon

unread,
Apr 2, 2017, 3:09:02 PM4/2/17
to dadi-user
Thanks Ryan,

You were right I had misunderstood theta. But I have one question about it. Theta in dadi is 4*Na*u*L , so is L the total number of SNPs inputted or is it the number we obtain through data.S() after downscaling?

As for the optimization warning, I have tried modifying the bounds and the grid size but the warning seems to be generated always. Also I ran the optimizer couple of times (mostly with maxiter=5)  to converge on a parameter estimate as suggested in the manual but they seem to differ almost always. I have pasted the code below. 

##simple model of divergence only##
##############################

dd = dadi.Misc.make_data_dict('dadiIn')
data = dadi.Spectrum.from_data_dict(dd,pop_ids=['core','periphery','LP'],projections=[40,78,14],polarized=False)
ns=data.sample_sizes

pts_1 = [200,220,240]

def split_only((nuS,nuL,nuC,nuP,T1,T2),ns,pts):

#nuS: ancestral size for SWWP after split (T1) 
#nuL : ancestral size of LP and the present size too
#nuC: size of core pop after spliting from SWWP (T2)
#nuP: size of periphery pop after spliting from SWWP (T2)
#T1: first divergence event/speciation leading to SWWP and LP
#T2: second divergence event, SWWP splits to 2 groups core and periphery

        xx = dadi.Numerics.default_grid(pts)
        phi = dadi.PhiManip.phi_1D(xx)
        phi = dadi.PhiManip.phi_1D_to_2D(xx,phi)
        phi = dadi.Integration.two_pops(phi,xx,T1,nu1=nuL,nu2=nuS)
        phi = dadi.PhiManip.phi_2D_to_3D_split_2(xx,phi)
        phi = dadi.Integration.three_pops(phi,xx,T2,nu1=nuL,nu2=nuC,nu3=nuP) ##
        fs = dadi.Spectrum.from_phi(phi,ns,(xx,xx,xx))
        return fs

func = split_only

upper_bound = [5, 5, 10, 5, 1, 1e-2]
lower_bound = [1e-3,1e-3,1e-1,1e-4,1e-4,1e-7]
p0 = array([0.5,0.5,1.5,0.5,0.08,0.008])

p0 = dadi.Misc.perturb_params(p0, fold=1, upper_bound=upper_bound,lower_bound=lower_bound)
func_ex = dadi.Numerics.make_extrap_log_func(func)
popt = dadi.Inference.optimize_log(p0, data, func_ex, pts_1,lower_bound=lower_bound,upper_bound=upper_bound,verbose=len(p0), maxiter=5)


Mitra

Gutenkunst, Ryan N - (rgutenk)

unread,
Apr 3, 2017, 11:31:22 AM4/3/17
to dadi...@googlegroups.com
Hi Mitra,

Theta is 4*Na*u*L, where L is the number of bases that entered your analysis. See the archives of the mailing list for many discussions of this issue.

If you’d like more help with your optimization, you’ll need to send a full working example that I can run. (Off-list if you prefer.)

Best,
Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

Gutenkunst, Ryan N - (rgutenk)

unread,
Apr 3, 2017, 11:34:07 AM4/3/17
to dadi-user
Hi Mitra,

Theta is 4*Na*u*L, where L is the number of bases that entered your analysis. See the archives of the mailing list for many discussions of this issue.

If you’d like more help with your optimization, you’ll need to send a full working example that I can run. (Off-list if you prefer.)

Best,
Ryan

On Apr 2, 2017, at 12:09 PM, Mitra Menon <men...@mymail.vcu.edu> wrote:

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

Gutenkunst, Ryan N - (rgutenk)

unread,
Apr 6, 2017, 7:11:49 PM4/6/17
to dadi...@googlegroups.com
Hello Mitra,

Thank you for sending your example offline. I’m responding on-list, so that other folks can learn from the response. (I’ve sanitized your code, so it should be anonymous.)

The error you were seeing arises when extrapolation leads to values in the extrapolated spectrum that are quite different than in the input spectra. Typically this arises when using logarithmic extrapolation with models where many entries are very small (near machine precision). That’s exactly what’s happening here. You model has a second divergence that is very, very short. Consequently, most of the density remains along the diagonal, and so many entries are very small.

So the warnings are likely not impacting your results very much at all, because the issues are concentrated in cells of the fs that contribute very little to the likelihood anyways (because there’s likely no data there). If you want to get rid of them, the easiest fix is to use linear extrapolation rather than logarithmic. That is more stable, with the downside that one can get small negative entries in the model frequency spectrum because of precision issues. But usually those aren’t a problem when fitting data, at least if your  initial guess is close to reasonable.

The attached code demonstrates the issue and shows that linear and logarithmic extrapolation yield very similar results with the pts you’ve set. Note that you can also go to a coarse (faster) grid of points, and still get good accuracy.

Best,
Ryan
test.py
ATT00001.htm

Mitra Menon

unread,
Apr 7, 2017, 12:00:31 PM4/7/17
to dadi-user
Thanks a lot for you help Ryan. 

I ran my model 10 different times to try and get parameters to converge . But still could not, and the likelihoods also vary quite a bit. I came across this resource to improve global optimization :

I think I am going to try using their optimization routine on my data.

Mitra

On Tuesday, April 21, 2015 at 6:58:36 PM UTC-4, Mikhail Matz wrote:

Mitra Menon

unread,
Apr 9, 2017, 8:58:35 AM4/9/17
to dadi-user
Hi Ryan,

I tried to use the 3 step optimization, mentioned in the previous post, but had trouble installing it. I am not sure why my parameters or my likelihoods for any given model varies a lot from run to run. Do you have any suggestions or input on it? I wonder if my data is really uniformative and hence I am having trouble getting convergence. Is there another optimization method I could try within dadi, which might be better suited in issues such as convergence problems?

Thanks
Mitra

On Tuesday, April 21, 2015 at 6:58:36 PM UTC-4, Mikhail Matz wrote:

Gutenkunst, Ryan N - (rgutenk)

unread,
Apr 10, 2017, 1:40:47 PM4/10/17
to dadi...@googlegroups.com
Hello Mitra,

You have some difficulty inferring parameters for your data with such a short divergence time. If the divergence time is so short, there’s essentially no drift happening, so Ne and m will be difficult to infer. I suggest working with some pairwise 2-population models to make sure you’re in the right ballpark with your initial parameter guesses.

Best,
Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

Mitra Menon

unread,
Apr 12, 2017, 2:10:08 PM4/12/17
to dadi-user, rgu...@email.arizona.edu
Hello Ryan,

Thanks for the suggestion. I fit pairwise population models under the simplistic (unreal) no migration only divergence model. The parameter estimates from these are in line with those from the 3 parameter models. Would it make sense if out of the 10 reps that I have of the 3 parameter model, I pick the one with the highest likelihood. Do the same for all my models and then perform further tests?

Mitra Menon

unread,
Apr 12, 2017, 2:12:38 PM4/12/17
to dadi-user, rgu...@email.arizona.edu
Sorry I have another quick question. If I am to fix certain parameters in my model, is this the right way to do it:

def split_only((nuL,nuC,T),ns,pts):
        xx = dadi.Numerics.default_grid(pts)
        phi = dadi.PhiManip.phi_1D(xx)
        phi = dadi.PhiManip.phi_1D_to_2D(xx,phi)
        phi = dadi.Integration.two_pops(phi,xx,T,nu1=nuL,nu2=nuC) ##
        fs = dadi.Spectrum.from_phi(phi,ns,(xx,xx))
        return fs

upper_bound = [1e2, 1e2,1]
lower_bound = [1e-3,1e-3,1e-6]
p0 = array([0.5,1.04658373e-03,0.087])

popt = dadi.Inference.optimize_log(p0, data, func_ex, pts_1,lower_bound=lower_bound,upper_bound=upper_bound,fixed_params=[None,1.04658373e-03,None],verbose=len(p0), maxiter=25)

Thanks

Gutenkunst, Ryan N - (rgutenk)

unread,
Apr 12, 2017, 3:59:50 PM4/12/17
to Mitra Menon, dadi-user
Hello Mitra,

Yes, that is the correct way to fix parameters.

I would not publish results for which I wasn’t confident I had converged parameter estimates. In other words, I want to get the same maximum likelihood several times starting from different initial points in the optimization. Sometimes that is very challenging, in which case I would simplify the model (or fix parameters).

Do check your convergence issues. Is it that many sets of parameters give very similar results, implying a ride in the likelihood surface? Or is it that you just aren’t getting good fits, implying issues with optimization?

Best,
Ryan
Reply all
Reply to author
Forward
0 new messages