Offset for Negative Binomial GLM

1,185 views
Skip to first unread message

burakb

unread,
Mar 9, 2015, 9:59:20 AM3/9/15
to pystat...@googlegroups.com
Hi everyone,

Is it possible to use negative binomial GLM through formula api and still use offsets?

I am looking at the equivalent of

library(MASS)
titanicgrp <- read.csv("titanicgrp.csv")
attach (titanicgrp)
lncases <-log(cases)
reslm <- glm.nb(survive ~ age + sex + factor(class) + offset(lncases), data = titanicgrp)

this code. I started with

import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
model=smf.glm("...", data=df, family=sm.families.NegativeBinomial()).fit()

and tried to add offset a parameter to glm(..) which did not work.

Any ideas?

Thanks,

josef...@gmail.com

unread,
Mar 9, 2015, 10:07:54 AM3/9/15
to pystatsmodels
What's the error message?
It should work, otherwise there's a bug in from_formula (which is unlikely).

(I got a "hard drive is about to fail" warning, and won't be able to
check for some time.)

Josef


>
> Any ideas?
>
> Thanks,

burakb

unread,
Mar 9, 2015, 10:57:10 AM3/9/15
to pystat...@googlegroups.com
Hi again,

import pandas as pd, numpy as np

import statsmodels.formula.api as smf
import statsmodels.api as sm

df = pd.read_csv("titanicgrp.csv",sep=',',index_col=0)
df['lncases'] = df['cases'].map(lambda x:np.log(x))
print df

model=smf.glm("survive ~ age + sex", data=df, offset='lncases',
             family=sm.families.NegativeBinomial()).fit()
print(model.summary())

this works fine when I take  out offset parameter. The data is

"","survive","cases","age","sex","class"
"1",1,1,0,0,1
"2",13,13,0,0,2
"3",14,31,0,0,3
"4",5,5,0,1,1
"5",11,11,0,1,2
"6",13,48,0,1,3
"7",140,144,1,0,1
"8",80,93,1,0,2
"9",76,165,1,0,3
"10",57,175,1,1,1
"11",14,168,1,1,2
"12",75,462,1,1,3


Thanks,

burakb

unread,
Mar 9, 2015, 10:57:44 AM3/9/15
to pystat...@googlegroups.com
Oh and the error message is

Traceback (most recent call last):
  File "t.py", line 10, in <module>
    family=sm.families.NegativeBinomial()).fit()
  File "/usr/local/lib/python2.7/dist-packages/statsmodels-0.7.0-py2.7-linux-x86_64.egg/statsmodels/genmod/generalized_linear_model.py", line 731, in fit
    cov_kwds=cov_kwds, use_t=use_t, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/statsmodels-0.7.0-py2.7-linux-x86_64.egg/statsmodels/genmod/generalized_linear_model.py", line 821, in _fit_irls
    - self._offset_exposure)
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'

Compilation exited abnormally with code 1 at Mon Mar  9 15:57:22



On Monday, March 9, 2015 at 3:07:54 PM UTC+1, josefpktd wrote:

josef...@gmail.com

unread,
Mar 9, 2015, 11:13:07 AM3/9/15
to pystatsmodels
On Mon, Mar 9, 2015 at 10:57 AM, burakb <burakb...@gmail.com> wrote:
> Hi again,
>
> import pandas as pd, numpy as np
> import statsmodels.formula.api as smf
> import statsmodels.api as sm
>
> df = pd.read_csv("titanicgrp.csv",sep=',',index_col=0)
> df['lncases'] = df['cases'].map(lambda x:np.log(x))
> print df
>
> model=smf.glm("survive ~ age + sex", data=df, offset='lncases',
> family=sm.families.NegativeBinomial()).fit()
> print(model.summary())

which version of statsmodels are you using?

try to give a series or ndarray for offset, I don't remember since
when names for offset and similar are allowed, and whether it was
already added to all models.
The offset and exposure handling has changed recently.

offset=df['lncases']

Josef

burakb

unread,
Mar 9, 2015, 11:21:30 AM3/9/15
to pystat...@googlegroups.com
Worked! Final code and data


import pandas as pd, numpy as np
import statsmodels.formula.api as smf
import statsmodels.api as sm

df = pd.read_csv("titanicgrp.csv",sep=',',index_col=0)
df['lncases'] = df['cases'].map(lambda x:np.log(x))
print df

model=smf.glm("survive ~ age + sex + C(whichclass)", data=df, offset=df['lncases'],
             family=sm.families.NegativeBinomial()).fit()
print(model.summary())


"","survive","cases","age","sex","whichclass"

"1",1,1,0,0,1
"2",13,13,0,0,2
"3",14,31,0,0,3
"4",5,5,0,1,1
"5",11,11,0,1,2
"6",13,48,0,1,3
"7",140,144,1,0,1
"8",80,93,1,0,2
"9",76,165,1,0,3
"10",57,175,1,1,1
"11",14,168,1,1,2
"12",75,462,1,1,3

The result is

                 Generalized Linear Model Regression Results                 
==============================================================================
Dep. Variable:                survive   No. Observations:                   12
Model:                            GLM   Df Residuals:                        7
Model Family:        NegativeBinomial   Df Model:                            4
Link Function:                    log   Scale:                  0.222676200695
Method:                          IRLS   Log-Likelihood:                -50.130
Date:                Mon, 09 Mar 2015   Deviance:                       1.9976
Time:                        16:18:17   Pearson chi2:                     1.56
No. Iterations:                    13                                        
======================================================================================
                         coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Intercept              0.5197      0.340      1.527      0.127        -0.147     1.187
C(whichclass)[T.2]    -0.2573      0.354     -0.728      0.467        -0.950     0.436
C(whichclass)[T.3]    -0.9164      0.352     -2.605      0.009        -1.606    -0.227
age                   -0.6795      0.286     -2.380      0.017        -1.239    -0.120
sex                   -0.8033      0.284     -2.825      0.005        -1.361    -0.246
======================================================================================

which is pretty similar to R output

Deviance Residuals:
     Min        1Q    Median        3Q       Max 
-2.43070  -0.57657   0.00483   0.58594   1.67408 

Coefficients:
               Estimate Std. Error z value Pr(>|z|)   
(Intercept)      0.6134     0.3141   1.953  0.05085 . 
age             -0.6700     0.2472  -2.711  0.00671 **
sex             -0.9801     0.2301  -4.259 2.06e-05 ***
factor(class)2  -0.3746     0.2967  -1.262  0.20680   
factor(class)3  -0.9071     0.2906  -3.122  0.00180 **
---

(Dispersion parameter for Negative Binomial(9.6122) family taken to be 1)

    Null deviance: 41.065  on 11  degrees of freedom
Residual deviance: 12.479  on  7  degrees of freedom
AIC: 99.434

Number of Fisher Scoring iterations: 1


              Theta:  9.61
          Std. Err.:  5.92

 2 x log-likelihood:  -87.434

burakb

unread,
Mar 9, 2015, 11:22:11 AM3/9/15
to pystat...@googlegroups.com
I am on statsmodels 0.5.0

josef...@gmail.com

unread,
Mar 9, 2015, 11:42:46 AM3/9/15
to pystatsmodels
This is not close enough to the R output. 
Check that you have the same fixed parameter for the NegativeBinomial.  GLM doesn't estimate it.

You can also compare with discrete model NegativeBinomial which estimates the  dispersion parameter of the negative binomial.


statsmodels 0.5 is already a bit old.
 
Josef

burakb

unread,
Mar 9, 2015, 11:47:08 AM3/9/15
to pystat...@googlegroups.com
That was the wrong machine, my bad. I am on statsmodels 0.7.0.dev0+5dd2502

I will check the points you mentioned.

josef...@gmail.com

unread,
Mar 9, 2015, 3:44:48 PM3/9/15
to pystatsmodels
On Mon, Mar 9, 2015 at 11:47 AM, burakb <burakb...@gmail.com> wrote:
That was the wrong machine, my bad. I am on statsmodels 0.7.0.dev0+5dd2502


string/name for offset and exposure is not yet implemented in master for GLM.from_formula.

only a few models like GEE currently have it 

Josef

burakb

unread,
Mar 9, 2015, 4:05:30 PM3/9/15
to pystat...@googlegroups.com
Does it belong in the formula string tho? I think it is better this way, both data argument and offset are outside.

With R they are not in the string, but R does not use a string even for the formulas itself.

josef...@gmail.com

unread,
Mar 9, 2015, 10:34:51 PM3/9/15
to pystatsmodels
On Mon, Mar 9, 2015 at 4:05 PM, burakb <burakb...@gmail.com> wrote:
Does it belong in the formula string tho? I think it is better this way, both data argument and offset are outside.

So far we decided to only have the main formula in the formula, "endog" and "exog".
No pipes or multi-equations or generalized formulas yet.

explicit is better than ...  hiding everything in a cryptic R-like formula. " 
What does that second part of the formula mean again?


besides strings for extra arrays/data, we also start to have models with two formulas, but not one string with some kind of separator like `|`.
As far as I have seen R packages are pretty inconsistent with extra arrays/formula components.

 

With R they are not in the string, but R does not use a string even for the formulas itself.

We need strings to have valid python syntax. (Different historical language origins IIUC.)

Josef

burakb

unread,
Mar 10, 2015, 5:36:37 AM3/10/15
to pystat...@googlegroups.com

I checked fixed param is the same, same output. Tried on 0.6.1 and 0.7-dev.

Can I use discrete model from formula api? This code is for a tutorial, it must be "R-like" :)

Thanks for the help so far; at least I know how to use offsets now!

Data


"","survive","cases","age","sex","whichclass"
"1",1,1,0,0,1
"2",13,13,0,0,2
"3",14,31,0,0,3
"4",5,5,0,1,1
"5",11,11,0,1,2
"6",13,48,0,1,3
"7",140,144,1,0,1
"8",80,93,1,0,2
"9",76,165,1,0,3
"10",57,175,1,1,1
"11",14,168,1,1,2
"12",75,462,1,1,3


import statsmodels.formula.api as smf
import statsmodels.api as sm
import pandas as pd, numpy as np
df = pd.read_csv("titanicgrp.csv",sep=',',index_col=0)
df['lncases'] = df['cases'].map(lambda x:np.log(x))
model=smf.glm("survive ~ age + sex + C(whichclass)", data=df, offset=df['lncases'],
             family=sm.families.NegativeBinomial()).fit()
print(model.summary())

For R


library(MASS)
titanicgrp <- read.csv("titanicgrp.csv")
attach (titanicgrp)
lncases <-log(cases)
reslm <- glm.nb(survive ~ age + sex + factor(whichclass) + offset(lncases), data = titanicgrp)

Python output


                 Generalized Linear Model Regression Results                 
==============================================================================
Dep. Variable:                survive   No. Observations:                   12
Model:                            GLM   Df Residuals:                        7
Model Family:        NegativeBinomial   Df Model:                            4
Link Function:                    log   Scale:                  0.222676200695
Method:                          IRLS   Log-Likelihood:                -50.130
Date:                Tue, 10 Mar 2015   Deviance:                       1.9976
Time:                        10:32:22   Pearson chi2:                     1.56

No. Iterations:                    13                                        
======================================================================================
                         coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Intercept              0.5197      0.340      1.527      0.127        -0.147     1.187
C(whichclass)[T.2]    -0.2573      0.354     -0.728      0.467        -0.950     0.436
C(whichclass)[T.3]    -0.9164      0.352     -2.605      0.009        -1.606    -0.227
age                   -0.6795      0.286     -2.380      0.017        -1.239    -0.120
sex                   -0.8033      0.284     -2.825      0.005        -1.361    -0.246
======================================================================================

R output

glm.nb(formula = as.formula(p1), data = titanicgrp, init.theta = 9.612198415,
    link = log)


Deviance Residuals:
     Min        1Q    Median        3Q       Max 
-2.43070  -0.57657   0.00483   0.58594   1.67408 

Coefficients:
               Estimate Std. Error z value Pr(>|z|)   
(Intercept)      0.6134     0.3141   1.953  0.05085 . 
age             -0.6700     0.2472  -2.711  0.00671 **
sex             -0.9801     0.2301  -4.259 2.06e-05 ***
factor(class)2  -0.3746     0.2967  -1.262  0.20680   
factor(class)3  -0.9071     0.2906  -3.122  0.00180 **
---

(Dispersion parameter for Negative Binomial(9.6122) family taken to be 1)

    Null deviance: 41.065  on 11  degrees of freedom
Residual deviance: 12.479  on  7  degrees of freedom
AIC: 99.434

Number of Fisher Scoring iterations: 1


              Theta:  9.61
          Std. Err.:  5.92




burakb

unread,
Mar 12, 2015, 4:45:12 AM3/12/15
to pystat...@googlegroups.com
I meant to say the output is still different from R.

If any blaring mistakes are in my SM code, please do let know. Otherwise, I need to go to rp2 land (just for NB).

Cheers,

Kerby Shedden

unread,
Mar 12, 2015, 7:27:51 AM3/12/15
to pystat...@googlegroups.com
Our alpha parameter is apparently the reciprocal of R's theta parameter.

If I run it like this, I get perfect agreement with R in terms of the coefficients.

model=smf.glm("survive ~ age + sex + C(whichclass)", data=df, offset=df['lncases'],
                           family=sm.families.NegativeBinomial(alpha=1/9.6122)).fit()

Apparently R is doing joint ML estimation of the theta parameter and the coefficients, so the standard errors don't line up if you do it this way.  But if I run the R function like the following, I get perfect agreement for both estimates and SE's:

reslm <- glm(survive ~ age + sex + factor(whichclass) + offset(lncases),
             family=negative.binomial(theta=9.61), data = titanicgrp)

Kerby Shedden

unread,
Mar 12, 2015, 8:16:43 AM3/12/15
to pystat...@googlegroups.com
Also note about halfway down this page

http://www.ats.ucla.edu/stat/r/dae/nbreg.htm

where they point out that R's parameterization for negative binomial is different from SAS, Stata, and SPSS (and us).

josef...@gmail.com

unread,
Mar 12, 2015, 8:44:24 AM3/12/15
to pystatsmodels
On Thu, Mar 12, 2015 at 8:16 AM, Kerby Shedden <kshe...@umich.edu> wrote:
Also note about halfway down this page

http://www.ats.ucla.edu/stat/r/dae/nbreg.htm

where they point out that R's parameterization for negative binomial is different from SAS, Stata, and SPSS (and us).



On Thursday, March 12, 2015 at 7:27:51 AM UTC-4, Kerby Shedden wrote:
Our alpha parameter is apparently the reciprocal of R's theta parameter.

If I run it like this, I get perfect agreement with R in terms of the coefficients.

model=smf.glm("survive ~ age + sex + C(whichclass)", data=df, offset=df['lncases'],
                           family=sm.families.NegativeBinomial(alpha=1/9.6122)).fit()

Apparently R is doing joint ML estimation of the theta parameter and the coefficients, so the standard errors don't line up if you do it this way.  But if I run the R function like the following, I get perfect agreement for both estimates and SE's:

reslm <- glm(survive ~ age + sex + factor(whichclass) + offset(lncases),
             family=negative.binomial(theta=9.61), data = titanicgrp)


my guess is that R's glm uses the expected information matrix, which differs in this case from the observed information matrix
Using the Hessian with discrete.NegativeBinomial, I get the same (inverse) alpha estimate as R's glm, but a bit different standard errors, see below.
(and I found a bug in how start_params are calculated)

Josef


mod_nb = smf.negativebinomial("survive ~ age + sex + C(whichclass)",

data=df, offset=df['lncases'])

# There is a bug in start_params with offset exposure,

#res_nb = mod_nb.fit(method='lbfgs', start_params=np.concatenate((model.params, [0.5])))

res_nb = mod_nb.fit(start_params=np.concatenate((model.params, [0.5])))

print(res_nb.summary())


Optimization terminated successfully.

         Current function value: 3.643069

         Iterations: 12

         Function evaluations: 14

         Gradient evaluations: 14

                     NegativeBinomial Regression Results                      

==============================================================================

Dep. Variable:                survive   No. Observations:                   12

Model:               NegativeBinomial   Df Residuals:                        7

Method:                           MLE   Df Model:                            4

Date:                Thu, 12 Mar 2015   Pseudo R-squ.:                  0.1306

Time:                        08:39:08   Log-Likelihood:                -43.717

converged:                       True   LL-Null:                       -50.283

                                        LLR p-value:                   0.01065

======================================================================================

                         coef    std err          z      P>|z|      [95.0% Conf. Int.]

--------------------------------------------------------------------------------------

Intercept              0.6134      0.328      1.868      0.062        -0.030     1.257

C(whichclass)[T.2]    -0.3746      0.307     -1.220      0.223        -0.977     0.227

C(whichclass)[T.3]    -0.9071      0.288     -3.155      0.002        -1.471    -0.343

age                   -0.6700      0.254     -2.643      0.008        -1.167    -0.173

sex                   -0.9802      0.246     -3.985      0.000        -1.462    -0.498

alpha                  0.1040      0.068      1.521      0.128        -0.030     0.238

======================================================================================



burakb

unread,
Mar 13, 2015, 5:28:08 AM3/13/15
to pystat...@googlegroups.com
I guess then it is okay to use statsmodels version? Differences seem to be the result of different implementation choices for NB.
Reply all
Reply to author
Forward
0 new messages