deviance function returned a nan

Angus McMorland

unread,

Jul 8, 2010, 1:45:47 PM7/8/10

to pystat...@googlegroups.com

Hi all,

I have a GLM problem involving binned counts of a rarely occurring
event, in small windows: the dependent variable ((Y) is therefore
predominantly 0.

Using statsmodels.GLM for the log-linked Poisson family model, I get the error:

"The first guess on the deviance function returned a nan. This could
be a boundary problem and should be reported."

So I'm reporting it. The problem seems to arise in the Poisson
family's deviance function, which takes the log of Y, generating nans.
As an additional data point, I'm trying to port some similar MATLAB
code that used their glmfit routine, and I am told that it handles
this sort of data appropriately.

What's my next step?

Many thanks,

Angus.
--
AJC McMorland
Post-doctoral research fellow
Neurobiology, University of Pittsburgh

Skipper Seabold

unread,

Jul 8, 2010, 1:53:33 PM7/8/10

to pystat...@googlegroups.com

On Thu, Jul 8, 2010 at 1:45 PM, Angus McMorland <amc...@gmail.com> wrote:
> Hi all,
>
> I have a GLM problem involving binned counts of a rarely occurring
> event, in small windows: the dependent variable ((Y) is therefore
> predominantly 0.
>
> Using statsmodels.GLM for the log-linked Poisson family model, I get the error:
>
> "The first guess on the deviance function returned a nan. This could
> be a boundary problem and should be reported."
>
> So I'm reporting it. The problem seems to arise in the Poisson
> family's deviance function, which takes the log of Y, generating nans.
> As an additional data point, I'm trying to port some similar MATLAB
> code that used their glmfit routine, and I am told that it handles
> this sort of data appropriately.
>
> What's my next step?
>

Thanks for reporting. I think I know what the issue is. Can you file
a bug ticket and/or post here some example data and code, so I can
replicate the problem.

https://bugs.launchpad.net/statsmodels

Skipper

josef...@gmail.com

unread,

Jul 9, 2010, 8:53:22 AM7/9/10

to pystat...@googlegroups.com

Until the fix is in committed, the work around is to keep the values
slightly larger than zero.

MLE in discretemod gives exactly the same parameter estimate, but I
don't understand the difference between beta in the simulation and the
estimated parameters. Aren't they supposed to be the same (estimate
and true)? I haven't checked the source.

Also, I'm surprised GLM and discretemod agree to such a high degree,
at least 7 decimals.

Josef

-------------------
import numpy as np
import scikits.statsmodels as sm

size = 1e5
nbeta = 3

np.random.seed(987654)
beta = np.random.rand(nbeta)
print beta
x = np.random.rand(size, nbeta)
y = np.random.poisson(np.dot(x, beta))
yc = np.clip(y, 1e-14,np.inf)

fam = sm.families.Poisson()
glm = sm.GLM(yc, x, family=fam)
res = glm.fit()
print res.params

# compare with MLE
resdm = sm.discretemod.Poisson(y, x).fit()
print resdm.params

''' output:

[ 0.18442322 0.43138522 0.69510932]
[-0.52190887 -0.22194198 0.1307167 ]
Optimization terminated successfully.
Current function value: 110271.945166
Iterations 5
[-0.52190887 -0.22194198 0.1307167 ]

'''

Josef

>
> Skipper
>

Skipper Seabold

unread,

Jul 9, 2010, 11:38:50 AM7/9/10

to pystat...@googlegroups.com

On Fri, Jul 9, 2010 at 8:53 AM, <josef...@gmail.com> wrote:
> On Thu, Jul 8, 2010 at 1:53 PM, Skipper Seabold <jsse...@gmail.com> wrote:
>> On Thu, Jul 8, 2010 at 1:45 PM, Angus McMorland <amc...@gmail.com> wrote:
>>> Hi all,
>>>
>>> I have a GLM problem involving binned counts of a rarely occurring
>>> event, in small windows: the dependent variable ((Y) is therefore
>>> predominantly 0.
>>>
>>> Using statsmodels.GLM for the log-linked Poisson family model, I get the error:
>>>
>>> "The first guess on the deviance function returned a nan. This could
>>> be a boundary problem and should be reported."
>>>
>>> So I'm reporting it. The problem seems to arise in the Poisson
>>> family's deviance function, which takes the log of Y, generating nans.
>>> As an additional data point, I'm trying to port some similar MATLAB
>>> code that used their glmfit routine, and I am told that it handles
>>> this sort of data appropriately.
>>>
>>> What's my next step?
>>>
>>
>> Thanks for reporting. I think I know what the issue is. Can you file
>> a bug ticket and/or post here some example data and code, so I can
>> replicate the problem.
>>
>> https://bugs.launchpad.net/statsmodels
>
> Until the fix is in committed, the work around is to keep the values
> slightly larger than zero.
>

Poisson should operate on integer count data though. I committed a
fix, if you want to check it and see if it looks okay. It agrees with
MLE Poisson.

> MLE in discretemod gives exactly the same parameter estimate, but I
> don't understand the difference between beta in the simulation and the
> estimated parameters. Aren't they supposed to be the same (estimate
> and true)? I haven't checked the source.
>

They are the same if you change

fam = sm.families.Poisson()

to

fam = sm.families.Poisson(link=sm.families.links.identity)

because Y was created with the identity link, ie.,

y = np.random.poisson(np.dot(x, beta))

> Also, I'm surprised GLM and discretemod agree to such a high degree,
> at least 7 decimals.
>

As far as GLM vs MLE, my understanding is that IRLS is a poor man's
MLE, used before solvers were everywhere and computing power was
better. I think it's one of the reasons GLM isn't in the econometrics
textbooks, though that's just a guess. All of the references that I
used for GLM said that if a MLE is available then it should be
preferred.

Skipper

josef...@gmail.com

unread,

Jul 9, 2010, 12:37:12 PM7/9/10

to pystat...@googlegroups.com

looks good
if the masked arrays only show up once, they could be put inline, but
it doesn't really make much of a difference

376 YmuMasked = Ymu[mask]
377 Ymasked = Y[mask]
378 np.putmask(retarr, mask, Ymasked*np.log(YmuMasked)/scale)

np.putmask(retarr, mask, Y[mask]*np.log(Ymu[mask])/scale)

>
>> MLE in discretemod gives exactly the same parameter estimate, but I
>> don't understand the difference between beta in the simulation and the
>> estimated parameters. Aren't they supposed to be the same (estimate
>> and true)? I haven't checked the source.
>>
>
> They are the same if you change
>
> fam = sm.families.Poisson()
>
> to
>
> fam = sm.families.Poisson(link=sm.families.links.identity)
>
> because Y was created with the identity link, ie.,
>
> y = np.random.poisson(np.dot(x, beta))

thanks, I still don't know or remember these details.

>
>> Also, I'm surprised GLM and discretemod agree to such a high degree,
>> at least 7 decimals.
>>
>
> As far as GLM vs MLE, my understanding is that IRLS is a poor man's
> MLE, used before solvers were everywhere and computing power was
> better. I think it's one of the reasons GLM isn't in the econometrics
> textbooks, though that's just a guess. All of the references that I
> used for GLM said that if a MLE is available then it should be
> preferred.

I thought they might be only (aymptotically) equivalent,
If they are exactly the same, besides using a different optimization
algorithm, then for the basic results there shouldn't be a reason to
choose one or the other. My vague impression is that GLM is faster
than MLE.

Josef

>
> Skipper
>

josef...@gmail.com

unread,

Jul 9, 2010, 12:46:58 PM7/9/10

to pystat...@googlegroups.com

I don't think integers are enforced, for the poisson case both GLM and
MLE work.
I checked the implementation in stats.poisson
cdf uses floor, pmf is defined for real values

>>> stats.poisson._pmf(np.array([0,1,2]), 1.)
array([ 0.36787944, 0.36787944, 0.18393972])
>>> stats.poisson._pmf(np.array([0,1,2])+1e-5, 1.)
array([ 0.36788156, 0.36787789, 0.18393802])
>>> stats.poisson._pmf(np.array([0,1,2])+1e-15, 1.)
array([ 0.36787944, 0.36787944, 0.18393972])