AttributeError: 'MaskedArray' object has no attribute 'folded' in a simple IM model with migration

930 views
Skip to first unread message

Pedro

unread,
Jan 6, 2015, 10:47:26 PM1/6/15
to dadi...@googlegroups.com
Hello Ryan,

I'm really new to dadi and to python so it may be that this issue has in fact a simple answer.
I'm trying to get the likelihood for a simple IM model using a sample data of ca. 700 SNPs. My sample size is of 10 and 9 (2 pops), without outgroup information and both are projected down to 7 (data is from haploids).
When I try to fit the data to an IM model without migration, everything seems to work just fine and I get parameters optimized and both model and optimized likelihoods.
The model without migration is defined as:

def IM_no_mig(params, ns, pts):
        s, nu1, nu2, T = params
        xx = Numerics.default_grid(pts)

        phi = PhiManip.phi_1D(xx)
        phi = PhiManip.phi_1D_to_2D(xx , phi)

        phi = Integration.two_pops(phi , xx , T, nu1 , nu2)

        model_sfs = Spectrum.from_phi(phi , ns , (xx ,xx))
        return model_sfs


However when I try to put migration in this model (I guess this would correspond to continuous migration):

def IM(params, ns, pts):

        s,nu1,nu2,T,m12,m21 = params 

        xx = Numerics.default_grid(pts) 

        phi = PhiManip.phi_1D(xx)
        phi = PhiManip.phi_1D_to_2D(xx, phi) 

        phi = Integration.two_pops(phi, xx, T, nu1, nu2, m12=m12, m21=m21)

        fs = Spectrum.from_phi(phi, ns, (xx,xx)) 
        return fs

it happens a strange error that I can not trace back...

Traceback (most recent call last):
  File "./test-dadi.py", line 105, in <module>
    ll_model = dadi.Inference.ll_multinom(model, fs)
  File "/usr/lib64/python2.7/site-packages/dadi-1.6.3-py2.7-linux-x86_64.egg/dadi/Inference.py", line 388, in ll_multinom
    ll_arr = ll_multinom_per_bin(model, data)
  File "/usr/lib64/python2.7/site-packages/dadi-1.6.3-py2.7-linux-x86_64.egg/dadi/Inference.py", line 372, in ll_multinom_per_bin
    return ll_per_bin(theta_opt*model, data)
  File "/usr/lib64/python2.7/site-packages/dadi-1.6.3-py2.7-linux-x86_64.egg/dadi/Inference.py", line 326, in ll_per_bin
    if data.folded and not model.folded:
AttributeError: 'MaskedArray' object has no attribute 'folded'

The code I'm using is the following and I've also attached the data:

import numpy
from numpy import array
import scipy
import dadi
import models

dd = dadi.Misc.make_data_dict("data.dadi")
nproj = [7, 7]
fs = dadi.Spectrum.from_data_dict(dd, ['Pop1','Pop2'], nproj, polarized=False)
ns = fs.sample_sizes
 
pts_l = [20,30,40]
func = models.IM
params = array([0.5, 5, 5, 5, 10, 10])
upper_bound = [1, 10, 10, 10, 20, 20]
lower_bound = [1e-3, 1e-3, 1e-3, 0, 0, 0]

func_ex = dadi.Numerics.make_extrap_log_func(func)
model = func_ex(params, ns, pts_l)
ll_model = dadi.Inference.ll_multinom(model, fs)

Using IPython, if I see the spectrum for the data it seems to be fine and folded

Spectrum([[-- 163.2079089506176 64.08923611111122 33.95335648148154
  28.451504629629685 14.389930555555557 14.689699074074081
  2.9186921296296306]
 [163.31878858024706 26.8812114197531 11.697569444444447 5.653356481481482
  4.36400462962963 4.591319444444446 4.432118055555557 --]
 [58.634143518518535 17.535300925925952 9.499652777777786 7.398495370370377
  6.398032407407413 2.772395833333336 -- --]
 [28.726273148148138 9.94039351851853 11.789930555555568 7.4461805555555625
  3.1258680555555585 -- -- --]
 [13.264467592592599 9.308449074074087 8.897569444444455 3.1258680555555585
  -- -- -- --]
 [7.5723379629629655 8.475578703703716 2.772395833333336 -- -- -- -- --]
 [4.098958333333332 4.432118055555557 -- -- -- -- -- --]
 [2.9186921296296306 -- -- -- -- -- -- --]], folded=True, pop_ids=['Pop1', 'Pop2'])

but the spectrum for the model is unfolded?! and of course empty.

Spectrum([[-- -- -- -- -- -- -- --]
 [-- -- -- -- -- -- -- --]
 [-- -- -- -- -- -- -- --]
 [-- -- -- -- -- -- -- --]
 [-- -- -- -- -- -- -- --]
 [-- -- -- -- -- -- -- --]
 [-- -- -- -- -- -- -- --]
 [-- -- -- -- -- -- -- --]], folded=False, pop_ids=None)

Do you know why is this happening and what do I have to do to fix this issue?
Many thanks,
Pedro
data.dadi

Pedro

unread,
Jan 7, 2015, 2:11:47 PM1/7/15
to dadi...@googlegroups.com
Hi again,

I have been trying to understand what was going on and came up with a few improvements to the model. These time there are no errors being thrown, but there are some things still puzzling me and I don't know if I'm understanding everything correctly.
The code I'm using now is below. The function is a derivation of dadi.Demographics2D.split_mig but allowing asymmetrical migration between pops  (I still guess this would correspond to continuous migration):

import numpy
from numpy import array
import scipy
import dadi
from dadi import Numerics, PhiManip, Integration
from dadi.Spectrum_mod import Spectrum

def split_asymmigration(params, ns, pts):
        nu1, nu2, T, m12, m21 = params
        xx = Numerics.default_grid(pts)
        phi = PhiManip.phi_1D(xx)
        phi = PhiManip.phi_1D_to_2D(xx, phi)
        phi = Integration.two_pops(phi, xx, T, nu1, nu2, m12=m12, m21=m21)
        model_sfs = Spectrum.from_phi(phi, ns, (xx,xx))
        return model_sfs

 
nproj = [7, 7]   
dd = dadi.Misc.make_data_dict("data.dadi")
fs = dadi.Spectrum.from_data_dict(dd, ['Pop1','Pop2'], nproj, polarized=False)
ns = fs.sample_sizes
 
pts_l = [20,30,40]
func = split_asymmigration
params = array([5, 5, 5, 0.5, 0.5])

func_ex = dadi.Numerics.make_extrap_log_func(func)
model = func_ex(params, ns, pts_l)
ll_model = dadi.Inference.ll_multinom(model, fs)

Regarding the function from my previous post, I remove the parameter "s" since, for now, I don't want to model population size changes, and assume that populations diverged from an ancestral population without differences in sizes.
Other thing I notice was that both s, m12 and m21 (and m) are fractions so the upper bounds should at 1 maximum, so this is now reflected in params (for m12 and m21).
Are these assumptions correct?
Running this code seems to work fine without errors and I can get a likelihood for the model.

However, even when I try to see the fs from data and model, the data is folded but the model is unfolded (although no errors were produced)
In [141]: fs
Out[141]: 
Spectrum([[-- 163.2079089506176 64.08923611111122 33.95335648148154
  28.451504629629685 14.389930555555557 14.689699074074081
  2.9186921296296306]
 [163.31878858024706 26.8812114197531 11.697569444444447 5.653356481481482
  4.36400462962963 4.591319444444446 4.432118055555557 --]
 [58.634143518518535 17.535300925925952 9.499652777777786 7.398495370370377
  6.398032407407413 2.772395833333336 -- --]
 [28.726273148148138 9.94039351851853 11.789930555555568 7.4461805555555625
  3.1258680555555585 -- -- --]
 [13.264467592592599 9.308449074074087 8.897569444444455 3.1258680555555585
  -- -- -- --]
 [7.5723379629629655 8.475578703703716 2.772395833333336 -- -- -- -- --]
 [4.098958333333332 4.432118055555557 -- -- -- -- -- --]
 [2.9186921296296306 -- -- -- -- -- -- --]], folded=True, pop_ids=['Pop1', 'Pop2'])

In [142]: model
Out[142]: 
Spectrum([[-- 4.2323162466287885 1.238520455004072 0.4367702510677906
  0.1561644050258667 0.051980644984948796 0.01460974171510141
  0.0027672245208492647]
 [4.23276627435965 1.0657773166443008 0.53915621287854 0.2787035790552109
  0.13644782560461052 0.06030354930983497 0.022362785800930558
  0.005730192991285237]
 [1.2382858352428943 0.5392535294852395 0.3306957670246527
  0.2037891280497767 0.11897835845924991 0.06322387780776599
  0.028752950971828544 0.009475518062233016]
 [0.43656218616342635 0.2787297399375768 0.20379311363119537
  0.14636031908481067 0.09958537450919944 0.062331004533783096
  0.03418692965111405 0.014408507254029054]
 [0.15604649417424643 0.13643488432235482 0.11897078185079053
  0.09958004211331033 0.0789031919526288 0.05812777081372741
  0.038465536755872776 0.02100403598957421]
 [0.05192773977708721 0.06028383172937256 0.0632104537493751
  0.0623203694305088 0.058121291554794466 0.05092057861050144
  0.041100197637313776 0.029919489828415705]
 [0.014591595614763924 0.02235023662212713 0.028741666306823423
  0.034175473788942375 0.0384544803322037 0.04109206680590185
  0.04151758273090359 0.042444072169578975]
 [0.0027634403459188468 0.005726092651528733 0.00947064100855815
  0.014401742136163739 0.020993747971162095 0.02990331594085543
  0.04241829928227183 --]], folded=False, pop_ids=None)

Is this the expected behaviour, or am I still doing something wrong?
Thank you so much.
Pedro

Gutenkunst, Ryan N - (rgutenk)

unread,
Jan 8, 2015, 12:48:18 PM1/8/15
to dadi...@googlegroups.com
Hello Pedro,

You code looks good. Only thing is that m12 and m21 aren't bound to be below 1. The values dadi takes in are the migration rates scaled by the population size, so 2*Nref*m, which can be greater than 1.

Best,
Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

--
Ryan Gutenkunst
Assistant Professor
Molecular and Cellular Biology
University of Arizona
phone: (520) 626-0569
http://gutengroup.mcb.arizona.edu

Pedro

unread,
Jan 9, 2015, 5:29:04 AM1/9/15
to dadi...@googlegroups.com
Hello Ryan,

thank you so much for your clarifications. I have now been testing this model with my full dataset and came up with a few more questions regarding optimization.
Running this model with the optimize_log function for a particular set of initial parameters seems to converge to the same solution. For three independent runs, all finished in approx. the same solution (that is actually close to the initial parameters) with a good likelihood always higher than the initial likelihood.

params = array([0.5, 0.5, 0.5, 0.5, 0.5])
upper_bound = [100, 100, 20, 20, 20]
lower_bound = [1e-3, 1e-3, 0, 0, 0]
p0 = dadi.Misc.perturb_params(params, fold=2, lower_bound=lower_bound, upper_bound=upper_bound)
popt = dadi.Inference.optimize_log(p0, fs, func_ex, pts_l, lower_bound=lower_bound, upper_bound=upper_bound, verbose=1, maxiter=100)

However, when trying the optimize_log_fmin function for three independent runs with the same parameters, only one of them is converging to the same solution as with optimize_log (and actually gets approx the same likelihood). The other two runs converge for different solutions and also different between them, but these have lower likelihood values ( comparing to the ones that converged to the same solution, either using log or log_fmin). Differences are mainly related with the parameter T, although in none of the runs the bounds were even scratched.

Q1) should I trust in the results of optimize_log (that converged to the same values)? But why then that the optimize_log_fmin didn't converge also to the same solution? Should I increase maxiter in order to ensure convergence of all runs?

Another test I've done was to push the initial parameters further apart by a factor of 10. So now the initial parameters are params = array([5, 5, 5, 5, 5]). The likelihood for these parameters is very low indicating that they might be further away from the optimum. Again in only one of three runs did the optimized parameters converge to the same solution with the highest likelihood (using optimize_log function).

Q2) should I be worried with this lack of convergence for these up biased initial parameters? Or because the initial likelihood is so low that it is expected that several runs do not converge to the same optimized parameters, because it may be difficult to search a space so far away from the optimum?

Thank you once again.
Best,
Pedro

Gutenkunst, Ryan N - (rgutenk)

unread,
Jan 12, 2015, 3:27:23 PM1/12/15
to dadi...@googlegroups.com
Hello Pedro,

The convergence issues you're seeing are not unexpected. Our typical rule of thumb is to start optimizations from multiple initial points until we have have 3 optimization runs that all converge to roughly the same likelihood and parameters, and that likelihood is the best among the runs. In general, it's hard to prove that you've found the global optimum, but this seems to be a good heuristic.

The _fmin version is a less sophisticated optimizer, although it seems to be more robust for really tough problems. So I'm not surprised it is failing to always find the optimum optimize_log finds.

Best,
Ryan

Pedro

unread,
Jan 14, 2015, 5:12:35 AM1/14/15
to dadi...@googlegroups.com
Hi again,

many thanks for all the help. I will continue these tests as you have suggested.

All the best,
Pedro
Reply all
Reply to author
Forward
0 new messages