I have a GLM problem involving binned counts of a rarely occurring
event, in small windows: the dependent variable ((Y) is therefore
predominantly 0.
Using statsmodels.GLM for the log-linked Poisson family model, I get the error:
"The first guess on the deviance function returned a nan. This could
be a boundary problem and should be reported."
So I'm reporting it. The problem seems to arise in the Poisson
family's deviance function, which takes the log of Y, generating nans.
As an additional data point, I'm trying to port some similar MATLAB
code that used their glmfit routine, and I am told that it handles
this sort of data appropriately.
What's my next step?
Many thanks,
Angus.
--
AJC McMorland
Post-doctoral research fellow
Neurobiology, University of Pittsburgh
Thanks for reporting. I think I know what the issue is. Can you file
a bug ticket and/or post here some example data and code, so I can
replicate the problem.
https://bugs.launchpad.net/statsmodels
Skipper
Until the fix is in committed, the work around is to keep the values
slightly larger than zero.
MLE in discretemod gives exactly the same parameter estimate, but I
don't understand the difference between beta in the simulation and the
estimated parameters. Aren't they supposed to be the same (estimate
and true)? I haven't checked the source.
Also, I'm surprised GLM and discretemod agree to such a high degree,
at least 7 decimals.
Josef
-------------------
import numpy as np
import scikits.statsmodels as sm
size = 1e5
nbeta = 3
np.random.seed(987654)
beta = np.random.rand(nbeta)
print beta
x = np.random.rand(size, nbeta)
y = np.random.poisson(np.dot(x, beta))
yc = np.clip(y, 1e-14,np.inf)
fam = sm.families.Poisson()
glm = sm.GLM(yc, x, family=fam)
res = glm.fit()
print res.params
# compare with MLE
resdm = sm.discretemod.Poisson(y, x).fit()
print resdm.params
''' output:
[ 0.18442322 0.43138522 0.69510932]
[-0.52190887 -0.22194198 0.1307167 ]
Optimization terminated successfully.
Current function value: 110271.945166
Iterations 5
[-0.52190887 -0.22194198 0.1307167 ]
'''
Josef
>
> Skipper
>
Poisson should operate on integer count data though. I committed a
fix, if you want to check it and see if it looks okay. It agrees with
MLE Poisson.
> MLE in discretemod gives exactly the same parameter estimate, but I
> don't understand the difference between beta in the simulation and the
> estimated parameters. Aren't they supposed to be the same (estimate
> and true)? I haven't checked the source.
>
They are the same if you change
fam = sm.families.Poisson()
to
fam = sm.families.Poisson(link=sm.families.links.identity)
because Y was created with the identity link, ie.,
y = np.random.poisson(np.dot(x, beta))
> Also, I'm surprised GLM and discretemod agree to such a high degree,
> at least 7 decimals.
>
As far as GLM vs MLE, my understanding is that IRLS is a poor man's
MLE, used before solvers were everywhere and computing power was
better. I think it's one of the reasons GLM isn't in the econometrics
textbooks, though that's just a guess. All of the references that I
used for GLM said that if a MLE is available then it should be
preferred.
Skipper
looks good
if the masked arrays only show up once, they could be put inline, but
it doesn't really make much of a difference
376 YmuMasked = Ymu[mask]
377 Ymasked = Y[mask]
378 np.putmask(retarr, mask, Ymasked*np.log(YmuMasked)/scale)
np.putmask(retarr, mask, Y[mask]*np.log(Ymu[mask])/scale)
>
>> MLE in discretemod gives exactly the same parameter estimate, but I
>> don't understand the difference between beta in the simulation and the
>> estimated parameters. Aren't they supposed to be the same (estimate
>> and true)? I haven't checked the source.
>>
>
> They are the same if you change
>
> fam = sm.families.Poisson()
>
> to
>
> fam = sm.families.Poisson(link=sm.families.links.identity)
>
> because Y was created with the identity link, ie.,
>
> y = np.random.poisson(np.dot(x, beta))
thanks, I still don't know or remember these details.
>
>> Also, I'm surprised GLM and discretemod agree to such a high degree,
>> at least 7 decimals.
>>
>
> As far as GLM vs MLE, my understanding is that IRLS is a poor man's
> MLE, used before solvers were everywhere and computing power was
> better. I think it's one of the reasons GLM isn't in the econometrics
> textbooks, though that's just a guess. All of the references that I
> used for GLM said that if a MLE is available then it should be
> preferred.
I thought they might be only (aymptotically) equivalent,
If they are exactly the same, besides using a different optimization
algorithm, then for the basic results there shouldn't be a reason to
choose one or the other. My vague impression is that GLM is faster
than MLE.
Josef
>
> Skipper
>
I don't think integers are enforced, for the poisson case both GLM and
MLE work.
I checked the implementation in stats.poisson
cdf uses floor, pmf is defined for real values
>>> stats.poisson._pmf(np.array([0,1,2]), 1.)
array([ 0.36787944, 0.36787944, 0.18393972])
>>> stats.poisson._pmf(np.array([0,1,2])+1e-5, 1.)
array([ 0.36788156, 0.36787789, 0.18393802])
>>> stats.poisson._pmf(np.array([0,1,2])+1e-15, 1.)
array([ 0.36787944, 0.36787944, 0.18393972])
So they look robust to adding a small positive noise.
Josef