Standard errors for Poisson regression

112 views
Skip to first unread message

Thomas Haslwanter

unread,
May 15, 2013, 9:47:59 AM5/15/13
to pystat...@googlegroups.com
By now I have a guilty feeling, costing you guys so much time in going through my comments. Please tell me when to stop!

Problem
This time it is the standard errors for Poisson regression that don't match up. The fitted parameters, as well as the "fittedvalues" are pretty close (I guess that the mis-matches are just differences in when to stop the iteration). But the standard errors are WAY out.

Parameters:
Expected:
2.376, -0.198, 1.441, -0.308

from Python/statsmodels (code see below):
Intercept   -10.727205
agecat        2.353706
agesq        -0.196405
smoke         1.397584
smkage       -0.295187


Standard Errors:
Expected:
0.208, 0.027, 0.372, 0.097

from Python/statsmodels:
Intercept    54.855423
agecat       27.273075
agesq         3.464816
smoke        34.776545
smkage        7.983272


Code
R-code:
res.britdoc <- glm(deaths~agecat + agesq + smoke + smkage + offset(log(personyears)), family=poisson, data=britdoc)

Stata code:
.poisson deaths agecat agesq smoke smkage, exposure(personyears)
.glm deaths agecat agesq smoke smkage, family(poisson) lin(log) lnoffset(personyears)

Python/statsmodels code: (since I did not know how to include an offset, I moved it over to the endog variable)

import numpy as np
import pandas as pd
import statsmodels.api as sm
import urllib2
import zipfile
from StringIO import StringIO

def poisson_regression_tbd():
    '''Poisson Regression
    chapter 9.2, p.170 & 171 '''
   
    inFile = r"GLM_data/Table 9.1 British doctors' smoking and coronary death.xls"
    df = get_data(inFile)   
    print df

    # Generate the required variables
    df['smoke'] = np.zeros(len(df))
    df['smoke'][df['smoking']=='smoker']=1

    df['agecat'] = np.array([1,2,3,4,5,1,2,3,4,5])
    df['agesq'] = df['agecat']**2

    df['smkage'] = df['agecat']
    df['smkage'][df['smoking']=='non-smoker']=0

    df['death_py'] = df['deaths']/df['person-years']

    model = sm.GLM.from_formula('death_py~agecat+agesq+smoke+smkage', family=sm.families.Poisson(), data=df).fit()
    print model.summary()

As the last time, you can either take the data from

def get_data(inFile):
    '''Get data from original Excel-file'''
   
    # get the zip-archive
    url = 'http://cdn.crcpress.com/downloads/C9500/GLM_data.zip'
    GLM_archive = urllib2.urlopen(url).read()
   
    # extract the requested file from the archive, as a pandas XLS-file
   
    zipdata = StringIO()
    zipdata.write(GLM_archive)
    myzipfile = zipfile.ZipFile(zipdata)
    xlsfile = myzipfile.open(inFile)
    xls = pd.ExcelFile(xlsfile)
    df = xls.parse('Sheet1', skiprows=2)   
   
    return df

or you can enter them directly (somehow) from

        age     smoking  deaths  person-years
0  35 to 44      smoker      32         52407
1  45 to 54      smoker     104         43248
2  55 to 64      smoker     206         28612
3  65 to 74      smoker     186         12663
4  75 to 84      smoker     102          5317
5  35 to 44  non-smoker       2         18790
6  45 to 54  non-smoker      12         10673
7  55 to 64  non-smoker      28          5710
8  65 to 74  non-smoker      28          2585
9  75 to 84  non-smoker      31          1462

josef...@gmail.com

unread,
May 15, 2013, 10:08:26 AM5/15/13
to pystat...@googlegroups.com
On Wed, May 15, 2013 at 9:47 AM, Thomas Haslwanter
<thomas.h...@gmail.com> wrote:
> By now I have a guilty feeling, costing you guys so much time in going
> through my comments. Please tell me when to stop!

Please don't stop

This is a good usage code review and we don't get enough of those.
Whenever I did this on some parts of statsmodels, I always found
details that I would like to change to make it more "obvious" or easy.
I don't know what impact this has on the model.

adding offset as a keyword should work

offset=df['person-years']

we also have ``exposure`` which is offset with taking the log internally
Can you try also discrete Poisson to see whether you get the same numbers?

I don't have a clue yet, candidates: treatment of offset, near singular exog

Josef

Skipper Seabold

unread,
May 15, 2013, 10:12:41 AM5/15/13
to pystat...@googlegroups.com
On Wed, May 15, 2013 at 9:47 AM, Thomas Haslwanter
<thomas.h...@gmail.com> wrote:
It would help if you could also post the data prep steps for R/Stata,
so I don't have to do them.

Skipper

Skipper Seabold

unread,
May 15, 2013, 10:22:05 AM5/15/13
to pystat...@googlegroups.com
On Wed, May 15, 2013 at 9:47 AM, Thomas Haslwanter
<thomas.h...@gmail.com> wrote:
> By now I have a guilty feeling, costing you guys so much time in going
> through my comments. Please tell me when to stop!
>
> Problem
> This time it is the standard errors for Poisson regression that don't match
> up. The fitted parameters, as well as the "fittedvalues" are pretty close (I
> guess that the mis-matches are just differences in when to stop the
> iteration). But the standard errors are WAY out.

I get the same standard errors as Stata. Stata code

insheet using /home/skipper/scratch/try_poisson.csv
encode age, gen(agecat)
gen agesq = agecat^2
gen smkage=0
gen smoke = 0
replace smkage = agecat if smoking=="smoker"
replace smoke = 1 if smoking=="smoker"
glm deaths agecat agesq smoke smkage, family(poisson) lin(log)
lnoffset(personyears)

Statsmodels Code

df['smoke'] = np.zeros(len(df))
df['smoke'][df['smoking']=='smoker']=1

df['agecat'] = np.array([1,2,3,4,5,1,2,3,4,5])
df['agesq'] = df['agecat']**2

df['smkage'] = df['agecat']
df['smkage'][df['smoking']=='non-smoker']=0

from statsmodels.formula.api import glm
model = glm('deaths~agecat+agesq+smoke+smkage',
family=sm.families.Poisson(), data=df,
exposure=df["person-years"]).fit()
print model.summary()

[28]: model.bse
[28]:
Intercept 0.450077
agecat 0.207949
agesq 0.027367
smoke 0.372199
smkage 0.097041
dtype: float64


hth,

Skipper

Skipper Seabold

unread,
May 15, 2013, 10:30:30 AM5/15/13
to pystat...@googlegroups.com
On Wed, May 15, 2013 at 9:47 AM, Thomas Haslwanter
<thomas.h...@gmail.com> wrote:
> or you can enter them directly (somehow) from
>
> age smoking deaths person-years
> 0 35 to 44 smoker 32 52407
> 1 45 to 54 smoker 104 43248
> 2 55 to 64 smoker 206 28612
> 3 65 to 74 smoker 186 12663
> 4 75 to 84 smoker 102 5317
> 5 35 to 44 non-smoker 2 18790
> 6 45 to 54 non-smoker 12 10673
> 7 55 to 64 non-smoker 28 5710
> 8 65 to 74 non-smoker 28 2585
> 9 75 to 84 non-smoker 31 1462

copy-paste

df = pandas.read_clipboard(index_col=0, sep=" +")

Skipper

josef...@gmail.com

unread,
May 15, 2013, 10:48:00 AM5/15/13
to pystat...@googlegroups.com
cool, that's very useful

5 minutes saved for each SAS example

Josef

>
> Skipper

josef...@gmail.com

unread,
May 15, 2013, 11:51:26 AM5/15/13
to pystat...@googlegroups.com
we cannot rescale a poisson estimation like this with offset

as an estimation model it's not scale invariant, for example bse
change a lot with rescaling
df['death_py'] = df['deaths'] * 1./df['person-years'] *
df['person-years'].mean()

The Poisson process itself can be scaled with the exposure (the
expected value, parameter),
but we cannot rescale the observations (as in a linear model)

Josef

>
> Josef

Thomas Haslwanter

unread,
May 15, 2013, 1:16:01 PM5/15/13
to pystat...@googlegroups.com
I don't have Stata available. As for R, I have just done enough of it to decide that I would rather do my stats in Python ;)
Anyway, I'll see if I get it going in R.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages