Surprising behavior when modeling many objects

39 views
Skip to first unread message

Todd

unread,
Aug 12, 2009, 12:07:40 AM8/12/09
to PyMC
Hi,

I've come across some surprising (at least to me) behavior when using
PyMC to model source and background counts from astronomical objects
(quasars, in case you're curious). The goal is, not surprisingly, to
estimate the posterior probability distributions for the flux of each
object. (The fluxes of the objects are independent, as are the
backgrounds at the locations of the sources.) For one object, the
model is simple:

r_obj = pymc.Uniform('r_obj', 0., tot_data*5., value=r_obj0)
r_sky = pymc.Uniform('r_sky', 0., tot_data*5., value=r_sky0)

@pymc.deterministic
def r_tot(r_obj=r_obj, r_sky=r_sky):
return r_obj + r_sky

d_tot = pymc.Poisson('d_tot', r_tot, value=tot_data, observed=True)

M = pymc.MCMC([r_obj, r_sky, r_tot, d_sky, d_tot])
M.sample(100000, 10000, 10)

where tot_data is the observed source plus sky counts at the source
position. This model works perfectly.

The mystery arises when I try to model a bunch (several hundred)
sources simultaneously. As I add to the list of sources that I'm
modeling, the 95% HPD interval for the first source in the list
shrinks, even though, as mentioned above, the sources fluxes and
background fluxes at each source position are independent and modeled
as such. A figure illustrating the shrinkage of the HPD interval is
here:

http://dl.getdropbox.com/u/647515/PyMC/FUVFlux.png

The solid red line in the figure is the flux of the first source, and
the dashed red lines are the +/- 2 sigma range computed from the
square root of the counts (Poisson statistics, the error from the
background counts can be neglected for this object). When there are
~40 or fewer objects in the sample, the 95% HPD interval, shown by the
blue error bars, is correct, but the interval shrinks when there are
more than 40 objects in the sample.

A script to reproduce my results is here:

http://dl.getdropbox.com/u/647515/PyMC/FUVFlux.py

I'm using Python 2.5 on a Mac with Numpy 1.3 and an SVN version of
PyMC from about a month ago.

Does anyone have any thoughts about what might be going wrong?

Thanks,

Todd

Todd

unread,
Aug 18, 2009, 11:41:12 AM8/18/09
to PyMC
Hi,

I just wanted to bump this. Please let me know if my first post
wasn't clear.

I've run my model in JAGS (model description appended below) with the
exact same data and do not see the same shrinkage of the 95% HPD
interval as the number of objects is increased.

Thanks,

Todd

JAGS model:

model {
for (i in 1:n.data) {
FUV.obj.rate[i] ~ dunif(0., FUV.obj.rate0[i]*5.)
FUV.bkg.rate[i] ~ dunif(0., FUV.bkg.rate0[i]*5.)

FUV.tot.rate[i] <- FUV.obj.rate[i]*10^(-0.4*(FUV.ext.corr[i]
+ FUV.aper.corr)) + FUV.bkg.rate[i]

FUV.bkg.counts[i] ~ dpois(FUV.bkg.rate[i]*FUV.weight[i])
FUV.tot.counts[i] ~ dpois(FUV.tot.rate[i]*FUV.weight[i])
}
}

David Huard

unread,
Aug 18, 2009, 1:08:22 PM8/18/09
to py...@googlegroups.com
Hi Todd,

Sorry it took so long...

You are creating variables that all have the same name:

for i1 in np.arange(1, 135, 5):

    FUV_obj_rate = pymc.Uniform('FUV_obj_rate', 0., FUV_obj_rate0[:i1]*5., value=FUV_obj_rate0[:i1])
    FUV_bkg_rate = pymc.Uniform('FUV_bkg_rate', 0., FUV_bkg_rate0[:i1]*5., value=FUV_bkg_rate0[:i1])


Although the sampler can handle it since it doesn't care about the name, only the object's reference, the backends storing the traces index the data by name. My guess is that this is the problem. Try it again by giving a unique name to each Node, eg

    FUV_obj_rate = pymc.Uniform('FUV_obj_rate_%d'%i, 0., FUV_obj_rate0[:i1]*5., value=FUV_obj_rate0[:i1])

HTH,

David

David Huard

unread,
Aug 18, 2009, 3:39:51 PM8/18/09
to py...@googlegroups.com
Todd

I spoke to fast. My answer was off-topic and not-relevant. I'll look at it again.

David

Todd Small

unread,
Aug 18, 2009, 4:14:08 PM8/18/09
to py...@googlegroups.com
Hi David,

I appreciate you taking the time to have a look at the problem I've come up against.

Todd
 
On Tuesday, August 18, 2009, at 12:39PM, "David Huard" <david...@gmail.com> wrote:
>

David Huard

unread,
Aug 18, 2009, 4:27:20 PM8/18/09
to py...@googlegroups.com
Todd,

As you add parameters in your sampling space, it takes longer to explore the multidimensional space. In some of the problems I looked at a couple of years ago with around 200 hundred variables, I took 2e6 samples to be on the safe side. I ran your script with 200k iterations and 50k burn-in and the results show some improvements, in the sense that the declining slope starts at a higher number of objects. You'll probably need more to get good estimates for your 135 objects.

You might want to provide the adaptive metropolis sampler a good initial guess for the covariance matrix. The default values might be far from ideal and you'd spend a while tuning the covariance, wasting iterations in a transient.

You might also want to try to use the Metropolis sampler instead of the AdaptiveMetropolis sampler, or maybe split your objects in blocks of 40 and sample those using one AdaptiveMetropolis sampler for each block. If you do some experiments along these lines, I think a lot of people would be interested to see what solution gives the best performance, so don't be shy to add an entry on the wiki about this.

HTH,

David

Todd Small

unread,
Aug 18, 2009, 4:36:53 PM8/18/09
to py...@googlegroups.com
Hi David,

Thanks for your comments!  I'll try running longer chains and your other tips and report back.

It's curious that I didn't see any reduction in the 95% HPD interval using JAGS, even though my JAGS chain was only 100,000 iterations.  I'll have to check what samplers JAGS chose to use...

Todd
 
On Tuesday, August 18, 2009, at 01:27PM, "David Huard" <david...@gmail.com> wrote:
>

Todd Small

unread,
Aug 19, 2009, 10:57:24 AM8/19/09
to py...@googlegroups.com
Hi David,

I ran my script with 1,000,000 iterations and 250,000 burn-in, and my problems go away.

I hadn't realized that the default sampler was AdaptiveMetropolis rather than plain ol' Metropolis.  I'll test out the Metropolis sampler next.

Thanks again for your help!

Todd

David Huard

unread,
Aug 19, 2009, 11:03:45 AM8/19/09
to py...@googlegroups.com
On Wed, Aug 19, 2009 at 10:57 AM, Todd Small <todd_...@mac.com> wrote:
Hi David,

I ran my script with 1,000,000 iterations and 250,000 burn-in, and my problems go away.

Good ! Each time a problem springs up on the list, I always fear that it's a bug in the code, and it really feels good when it's not the case !
 

I hadn't realized that the default sampler was AdaptiveMetropolis rather than plain ol' Metropolis.  I'll test out the Metropolis sampler next.

Thanks again for your help!

You're welcome,

David
 

Todd Small

unread,
Aug 21, 2009, 11:11:00 AM8/21/09
to py...@googlegroups.com
Hi David,

I wanted to let you that I tested out your suggestion of using the Metropolis sampler instead of the AdaptiveMetropolis sampler.  The Metropolis sampler does *not* have the problem with shrinkage of the 95% HPD interval, even with a mere 100,000 iterations.

Thanks,

Todd

Anand Patil

unread,
Aug 21, 2009, 11:27:04 AM8/21/09
to py...@googlegroups.com
Sounds like this might be another instance of issue 269
http://code.google.com/p/pymc/issues/detail?id=269 ?

Anand
Reply all
Reply to author
Forward
0 new messages