Logistic regression models with patsy, with 2 endogenous variables

177 views
Skip to first unread message

Thomas Haslwanter

unread,
May 9, 2013, 5:33:33 AM5/9/13
to pystat...@googlegroups.com
In GLM with binomial response data, such as in the statsmodels example
http://statsmodels.sourceforge.net/devel/examples/generated/example_glm.html
there are often two columns of endogenous variables.

In R, if those two variables are called "number" and "observation", and the exogenous variable "x", we can handle this with with a formula by using

my.mat = cbind(number, observation)
res.glm = glm(my.mat~x, family=binomial(link="logit"))

My question: how can we generate formulas in patsy with two endogenous variables?

Nathaniel Smith

unread,
May 9, 2013, 7:49:55 AM5/9/13
to pystatsmodels

Are these variables numeric or categorical? If numeric just say

number + observation ~ x

(In R the lhs is treated as a simple expression; in patsy it gets expanded just like the rhs.)

If they're categorical.... Hmm. Then we'll need to think.

-n

josef...@gmail.com

unread,
May 9, 2013, 8:03:02 AM5/9/13
to pystat...@googlegroups.com
for binomial is count1 + count2, the counts for the two possible
cases, so I guess this will work with treating the lhs like the rhs.

MNLogit takes the class labels and creates the dummies internally, AFAICS

Josef

>
> -n
Message has been deleted

Thomas Haslwanter

unread,
May 9, 2013, 9:30:56 AM5/9/13
to pystat...@googlegroups.com, n...@pobox.com
"number + observation" does not make sense here.
I probably should have been more explicit about the description of my variables. The example is taken from example 7.3.1 in Dobson & Barnett "An Introduction to GLMs":

x ... dose of a poison
number ... number of beetles tested
obeservation ... number of beetles killed.

I actually get the correct result for the parameters when I use the formula approach in statsmodels, with "y" the observations, and "n" the numbers" in a pandas dataframe "df":

df['p'] = df['y']/df['n']
model = sm.GLM.from_formula('p~x', data = df, family=sm.families.Binomial())

However, the standard errors, confidence intervals and deviance are wrong (at least compared to the results in Dobson&Barnett).
And using the statsmodels approach with endog/exog gives me even the wrong parameters (see my other post from today).

Skipper Seabold

unread,
May 9, 2013, 10:01:22 AM5/9/13
to pystat...@googlegroups.com
On Thu, May 9, 2013 at 9:28 AM, Thomas Haslwanter
<thomas.h...@gmail.com> wrote:
> df['p'] = df['y']/df['n']
> model = sm.GLM.from_formula('p~x', data = df, family=sm.families.Binomial())
>
> However, the standard errors, confidence intervals and deviance are wrong
> (at least compared to the results in Dobson&Barnett).
> And using the statsmodels approach with endog/exog gives me even the wrong
> parameters (see my other post from today).
>

If this is still true, could you file an issue that replicates this
with the expected correct answers from the text.

https://github.com/statsmodels/statsmodels/issues

Skipper

josef...@gmail.com

unread,
May 9, 2013, 10:13:49 AM5/9/13
to pystat...@googlegroups.com
I think the problem is the missing case weights.

p = np.repeat(p, df['n'])
x = np.repeat(x, df['n'])

see issues on case weights (MIA)

Josef

>
> Skipper

Skipper Seabold

unread,
May 9, 2013, 10:20:19 AM5/9/13
to pystat...@googlegroups.com
Would be helpful then to have a potential test case on github.

This one?

https://github.com/statsmodels/statsmodels/issues/203

Skipper

josef...@gmail.com

unread,
May 9, 2013, 10:28:54 AM5/9/13
to pystat...@googlegroups.com
I was thinking about a general keyword for all models

https://github.com/statsmodels/statsmodels/issues/481
https://github.com/statsmodels/statsmodels/issues/743

I thought there are more issues, but I don't like the new github search yet.

somewhere there is a discussion pweights (Stata) versus case weights.
SAS allows both under different names, and I think Stata too

Josef

>
> Skipper

josef...@gmail.com

unread,
May 9, 2013, 10:39:22 AM5/9/13
to pystat...@googlegroups.com
On Thu, May 9, 2013 at 10:28 AM, <josef...@gmail.com> wrote:
> On Thu, May 9, 2013 at 10:20 AM, Skipper Seabold <jsse...@gmail.com> wrote:
>> On Thu, May 9, 2013 at 10:13 AM, <josef...@gmail.com> wrote:
>>> On Thu, May 9, 2013 at 10:01 AM, Skipper Seabold <jsse...@gmail.com> wrote:
>>>> On Thu, May 9, 2013 at 9:28 AM, Thomas Haslwanter
>>>> <thomas.h...@gmail.com> wrote:
>>>>> df['p'] = df['y']/df['n']
>>>>> model = sm.GLM.from_formula('p~x', data = df, family=sm.families.Binomial())
>>>>>
>>>>> However, the standard errors, confidence intervals and deviance are wrong
>>>>> (at least compared to the results in Dobson&Barnett).
>>>>> And using the statsmodels approach with endog/exog gives me even the wrong
>>>>> parameters (see my other post from today).
>>>>>
>>>>
>>>> If this is still true, could you file an issue that replicates this
>>>> with the expected correct answers from the text.
>>>>
>>>> https://github.com/statsmodels/statsmodels/issues
>>>
>>> I think the problem is the missing case weights.
>>>
>>> p = np.repeat(p, df['n'])
>>> x = np.repeat(x, df['n'])
>>>
>>> see issues on case weights (MIA)
>>>
>>
>> Would be helpful then to have a potential test case on github.
>>
>> This one?
>>
>> https://github.com/statsmodels/statsmodels/issues/203

I didn't look at this issue before.

It looks like:
The example above with p as proportion is currently not supported.
Either we have endog as
- binary with repeated observations or the
- (count_success, count_failure)

If endog is a proportion, then we don't know how many observations
this is based on, and there is no way to get the degrees of freedom
and cov_params correct.

Josef

Skipper Seabold

unread,
May 9, 2013, 10:42:36 AM5/9/13
to pystat...@googlegroups.com
IIRC, it's not. That's why the docs say it expects a 2d array now
after we got rid of the weights argument.

Skipper

Thomas Haslwanter

unread,
May 9, 2013, 11:32:07 AM5/9/13
to pystat...@googlegroups.com
I don't completely understand the discussion. But for comparison, I wanted to post a correct (non-formula) code, and a code using patsy, which produces the correct parameters, but the wrong deviances and standard errors:

Correct results from:
------------------------------
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

exog = np.array([[ 1.,      1.6907],
    [ 1.,      1.7242],
    [ 1.,      1.7552],
    [ 1.,      1.7842],
    [ 1.,      1.8113],
    [ 1.,      1.8369],
    [ 1.,      1.861 ],
    [ 1.,      1.8839]])

# number of beetles tested, and numbers killed
data = np.array([[ 59.,   6.],
 [ 60.,  13.],
 [ 62.,  18.],
 [ 56.,  28.],
 [ 63.,  52.],
 [ 59.,  53.],
 [ 62.,  61.],
 [ 60.,  60.]])

endog = data.copy()
endog[:,0] = data[:,0]-data[:,1]

# fit the model
glm_binom = sm.GLM(endog, exog, family=sm.families.Binomial())
res = glm_binom.fit()
print res.summary()

to_be_predicted = endog[:,1]/endog[:,0]
predicted = 1-res.predict(exog)

print 'Data: ' + str(to_be_predicted)
print 'Predictions: ' + str(predicted)


Correct parameters, but wrong standard errors and deviances from:
-------------------------------------------------------------------------------------------------
import numpy as np
import pandas as pd
import statsmodels.api as sm

data = np.array([[1.6907, 59, 6],
    [1.7242, 60, 13],
    [1.7552, 62, 18],
    [1.7842, 56, 28],
    [1.8113, 63, 52],
    [1.8369, 59, 53],
    [1.861, 62, 61 ],
    [1.8839, 60, 60]])

df = pd.DataFrame(data)
df.columns = ['x', 'n', 'y']

# fit the model

df['p'] = df['y']/df['n']
model = sm.GLM.from_formula('p~x', data = df, family=sm.families.Binomial())
print model.fit().summary()

Skipper Seabold

unread,
May 9, 2013, 11:33:49 AM5/9/13
to pystat...@googlegroups.com
On Thu, May 9, 2013 at 11:32 AM, Thomas Haslwanter
<thomas.h...@gmail.com> wrote:
> I don't completely understand the discussion. But for comparison, I wanted
> to post a correct (non-formula) code, and a code using patsy, which produces
> the correct parameters, but the wrong deviances and standard errors:
>

Can you also post the correct deviances and standard errors? I don't
have the textbook.

Thomas Haslwanter

unread,
May 9, 2013, 12:19:52 PM5/9/13
to pystat...@googlegroups.com
The code without patsy produces the correct results:
b1 = -60.72, se = 5.18
b2 = 34.27, se = 2.91
Deviance = 11.23

For completeness, I also give the Stata and R-code provided by Dobson:

Stata code:

.glm y x, family(binomial n) link(logit)
.predict fit, mu

R code:
>y=c(6,13,18,28,52,53,61,60)
>n=c(59,60,62,56,63,59,62,60)
>x=c(1.6907,1.7242,1.7552,1.7842,1.8113,1.8369,1.8610,1.8839)
>n_y=n-y
>beetle.mat=cbind(y,n_y)
>res.glm1=glm(beetle.mat!x, family=binomial(link="logit"))

Since I am just getting started with statistical modeling, I may not be able to contribute here - especially since this involves patsy and formulas, too. On the other hand, I would be keen to try to contribute to statsmodels. What do you think: may this be a good point to start contributing? Or is this too involved?

Nathaniel Smith

unread,
May 9, 2013, 12:22:54 PM5/9/13
to pystatsmodels
Try:

df['tested'] = df['n']
df['killed'] = df['y']
df['survived'] = df['tested'] - df['killed']
sm.GLM.from_formula('survived + killed ~ x', data=df,
family=sm.families.Binomial())

?

(Or equivalently, just
sm.GLM.from_formula('I(n - y) + y ~ x', data=df, family=...)
)

-n

On Thu, May 9, 2013 at 11:32 AM, Thomas Haslwanter
<thomas.h...@gmail.com> wrote:

josef...@gmail.com

unread,
May 9, 2013, 12:50:07 PM5/9/13
to pystat...@googlegroups.com
As Skipper and I discussed this cannot work, because GLM doesn't know
what the number of observations is.

Below is a version that expands to the full number of observations,
which is the only way that Logit knows how to handle (currently).

Josef

-------------------------
# -*- coding: utf-8 -*-
"""

Created on Thu May 09 12:16:40 2013

Author: Josef Perktold
"""

import numpy as np
import statsmodels.api as sm

data = np.array([[1.6907, 59, 6],
[1.7242, 60, 13],
[1.7552, 62, 18],
[1.7842, 56, 28],
[1.8113, 63, 52],
[1.8369, 59, 53],
[1.861, 62, 61 ],
[1.8839, 60, 60]])

count1 = data[:,2]
count2 = data[:,1] - data[:,2]

x1 = np.repeat(data[:,0], count1.astype(int))
y1 = np.ones(len(x1))
x2 = np.repeat(data[:,0], count2.astype(int))
y2 = np.zeros(len(x2))
endog = y = np.concatenate((y1, y2))
x = np.concatenate((x1, x2))
exog = sm.add_constant(x)

print endog.shape, exog.shape

model = sm.GLM.from_formula('y~x', data = {'y':y, 'x':x},
family=sm.families.Binomial())

print sm.GLM(endog, exog, family=sm.families.Binomial()).fit().bse
print model.fit().bse

print sm.Logit(endog, exog).fit().bse

-----------------------

Thomas Haslwanter

unread,
May 9, 2013, 4:46:30 PM5/9/13
to pystat...@googlegroups.com, n...@pobox.com
Nathanial,

both your suggestions worked, and produced the correct results.
Thanks for the quick reply!

thomas

-----------------------------

import numpy as np
import pandas as pd
import statsmodels.api as sm

data = np.array([[1.6907, 59, 6],
    [1.7242, 60, 13],
    [1.7552, 62, 18],
    [1.7842, 56, 28],
    [1.8113, 63, 52],
    [1.8369, 59, 53],
    [1.861, 62, 61 ],
    [1.8839, 60, 60]])

df = pd.DataFrame(data)
df.columns = ['x', 'n', 'y']

# fit the model
df['tested'] = df['n']
df['killed'] = df['y']
df['survived'] = df['tested'] - df['killed']
model = sm.GLM.from_formula('survived + killed ~ x', data=df, family=sm.families.Binomial())

# or equivalently:
#model = sm.GLM.from_formula('I(n - y) + y ~ x', data=df, family=sm.families.Binomial())

print model.fit().summary()

Thomas Haslwanter

unread,
May 9, 2013, 4:52:07 PM5/9/13
to pystat...@googlegroups.com
I might miss something: but thanks to your help, and the comments from Nathaniel, there are now versions of statsmodels, with and without formulas, that are working (see my other posts).
So in my opinion, the conversion to 0/1 is not really necessary.
Thanks again for your help!

josef...@gmail.com

unread,
May 9, 2013, 5:03:24 PM5/9/13
to pystat...@googlegroups.com
On Thu, May 9, 2013 at 4:52 PM, Thomas Haslwanter
<thomas.h...@gmail.com> wrote:
> I might miss something: but thanks to your help, and the comments from
> Nathaniel, there are now versions of statsmodels, with and without formulas,
> that are working (see my other posts).
> So in my opinion, the conversion to 0/1 is not really necessary.

You don't need it for binomial, but none of the other models or
families can handle anything except individual observations.

If you work with experimental or categorical exog, then sooner or
later something like this will be necessary.
In our economics real life data, it's unusual to have two observations
with identical exog.

Josef
Reply all
Reply to author
Forward
0 new messages