PyMC 2 Impute performance

42 views
Skip to first unread message

Vincent Dubourg

unread,
Nov 12, 2014, 11:36:44 AM11/12/14
to py...@googlegroups.com
Hello,

PyMC Impute class is great for readability, but is a pain to sample from (no offence, it's great it's here!)...

dataset = np.ma.masked_array(dataset, fill_value=np.nan)
data = pm.Impute('data', pm.Gamma, alpha=3., beta=1.)

It seems it creates as many stochastic nodes as observations (ie `dataset.size`)... Resulting in the creation of more than 400 nodes in my case. My current model even results in more than 800 nodes since I have two quantities I'd like to impute... With a missing rate that is around 50% each.

I tried to group the observed and missing values in two stochastics with size=n_observed and n_missing, respectively; so that I get 2 nodes instead of `dataset.size`. This looks like this:

missing_data = pm.Gamma('missing_data', size=np.isnan(dataset).sum(), alpha=3., beta=1., trace=False)
@pm.deterministic(name='data', trace=True)
def data(obs=dataset, miss=missing_data):
    obs[np.isnan(obs)] = miss
    return obs

But in the end, looking at `data.trace()`, the imputation is not random at all: it keeps using the seed. My guess is that it comes from the fact that PyMC rejects all the proposed missing_data at once (as a whole, in a single block).

I think I would need a custom StepMethod that would do some sort of elementwise Metropolis, there... using numpy broadcasting operations instead of looping over each node.

But reading the doc section entitled "Granularity of step methods: One-at-a-time vs Block-updating" [1]... It seems this idea is going against the way PyMC has been designed as I would need an array-valued log-likelihood.

Any idea? Other than IPython.parallel... ;)

[1] http://pymc-devs.github.io/pymc/modelfitting.html#granularity-of-step-methods-one-at-a-time-vs-block-updating

Thanks,
Vincent
Reply all
Reply to author
Forward
0 new messages