errors in for nominal logistic regression?

381 views
Skip to first unread message

Thomas Haslwanter

unread,
May 14, 2013, 2:07:48 PM5/14/13
to pystat...@googlegroups.com
In comparing the output of the statsmodel MNLogit function with results from R and STATA, I ran into two questions/problems:

1) While the absolute values of the fitted parameters and the standard errors match, the sign differs from that in R:
    the R-command "multinom" from the library "nnet"
 res.cars=muninom(c_resp~factor(c_age)+factor(c_sex), weights-freq, data=car)
produces the parameter values
[-0.388, 1.128, 1.588],

the Python/statsmodel code produces the parameters
[-0.388, -1.128, -1.588],

The intercept values also don't match, but that may be a reference problem.

2) Not too surprising, I also could not get the desired/reasonable output from model.fittedvalues.

The Python progam (with all the other functions I have done so far) can be found at
https://github.com/thomas-haslwanter/dobson

or below.

#---------------------------------

import numpy as np
import pandas as pd
import statsmodels.api as sm
import patsy

from statsmodels.stats.api import anova_lm

# for data import
import urllib2
import zipfile
from StringIO import StringIO

def get_data(inFile):
    '''Get data from original Excel-file'''
   
    # get the zip-archive
    url = 'http://cdn.crcpress.com/downloads/C9500/GLM_data.zip'
    GLM_archive = urllib2.urlopen(url).read()
   
    # extract the requested file from the archive, as a pandas XLS-file
   
    zipdata = StringIO()
    zipdata.write(GLM_archive)
    myzipfile = zipfile.ZipFile(zipdata)
    xlsfile = myzipfile.open(inFile)
    xls = pd.ExcelFile(xlsfile)
    df = xls.parse('Sheet1', skiprows=2)   
   
    return df
def nominal_logistic_regression():
    '''Nominal Logistic Regression
    chapter 8.3,  p. 155
   
    At this point, I nominal logistic regression can be done with the formula approach
    '''
   
    # Get the data
    inFile = r'GLM_data/Table 8.1 Car preferences.xls'
    df = get_data(inFile)   
   
    # Generate the design matrices using patsy
    pm = patsy.dmatrices('response~age+sex', data=df)
   
    # Change the first output, representing the endogenous data, into a vector
    # e.g. [0,1,0] -> 1
    # e.g. [0,0,1] -> 2
    endog_ind = np.zeros(len(pm[0]))
    for ii in range(len(pm[0])):
        endog_ind[ii] = np.where(pm[0][ii])[0]

    # Since frequencies cannot be represented explicitly, multiply the entries
    # to correspond to the correct number of occurences
    endog = np.repeat(endog_ind, df['frequency'].values.astype(int), axis=0)
    exog = np.array(np.repeat(pm[1], df['frequency'].values.astype(int), axis=0))
   
    # Fit the model, and print the summary
    model = sm.MNLogit(endog, exog, method='nm').fit()
    print  model.summary()

if __name__ == '__main__':
    nominal_logistic_regression()

produces

Optimization terminated successfully.
         Current function value: 0.967837
         Iterations 5
                          MNLogit Regression Results                         
==============================================================================
Dep. Variable:                      y   No. Observations:                  300
Model:                        MNLogit   Df Residuals:                      292
Method:                           MLE   Df Model:                            6
Date:                Tue, 14 May 2013   Pseudo R-squ.:                  0.1182
Time:                        19:39:52   Log-Likelihood:                -290.35
converged:                       True   LL-Null:                       -329.27
                                        LLR p-value:                 9.966e-15
==============================================================================
       y=1       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.9789      0.256      3.819      0.000         0.477     1.481
x1            -1.1283      0.342     -3.302      0.001        -1.798    -0.459
x2            -1.5877      0.403     -3.941      0.000        -2.377    -0.798
x3            -0.3881      0.301     -1.292      0.197        -0.977     0.201
------------------------------------------------------------------------------
       y=2       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -0.8732      0.355     -2.462      0.014        -1.568    -0.178
x1             0.3498      0.409      0.856      0.392        -0.452     1.151
x2             1.3290      0.393      3.386      0.001         0.560     2.098
x3             0.4249      0.301      1.414      0.157        -0.164     1.014
==============================================================================



  

Skipper Seabold

unread,
May 14, 2013, 2:18:07 PM5/14/13
to pystat...@googlegroups.com
On Tue, May 14, 2013 at 2:07 PM, Thomas Haslwanter
<thomas.h...@gmail.com> wrote:
> In comparing the output of the statsmodel MNLogit function with results from
> R and STATA, I ran into two questions/problems:
>
> 1) While the absolute values of the fitted parameters and the standard
> errors match, the sign differs from that in R:
> the R-command "multinom" from the library "nnet"
> res.cars=muninom(c_resp~factor(c_age)+factor(c_sex), weights-freq,
> data=car)
> produces the parameter values
> [-0.388, 1.128, 1.588],
>
> the Python/statsmodel code produces the parameters
> [-0.388, -1.128, -1.588],
>
> The intercept values also don't match, but that may be a reference problem.
>

What version of Python are you on? I can't run your data fetching code.

If the intercept values don't match, then the dropped case for the
categorical factors is likely different. You need to explicitly set
the reference case in one or the other. Since the base case is
different, then the parameters will also be different since they're
only identified vs. the base case.

josef...@gmail.com

unread,
May 14, 2013, 2:19:30 PM5/14/13
to pystat...@googlegroups.com
On Tue, May 14, 2013 at 2:07 PM, Thomas Haslwanter
<thomas.h...@gmail.com> wrote:
try to reverse the labeling, Multinomial Logit use one of the levels
as normalization, we use the last, multinom uses the first, from the
description.

Josef

josef...@gmail.com

unread,
May 14, 2013, 2:50:48 PM5/14/13
to pystat...@googlegroups.com
On Tue, May 14, 2013 at 2:19 PM, <josef...@gmail.com> wrote:
> On Tue, May 14, 2013 at 2:07 PM, Thomas Haslwanter
> <thomas.h...@gmail.com> wrote:
>> In comparing the output of the statsmodel MNLogit function with results from
>> R and STATA, I ran into two questions/problems:

Is the R and Stata code available somewhere?

Josef

josef...@gmail.com

unread,
May 14, 2013, 8:04:49 PM5/14/13
to pystat...@googlegroups.com
From reading the code, we also use the first level in alphabetical
order as base level

you can also use the string response df['response'] directly in the
call to MNLogit
endog = np.repeat(df['response'], df['frequency'].values.astype(int), axis=0)

it produces internally the same endog as the patsy call
both MNLogit and patsy, create the dummy array according to
alphabetical levels (np.unique)

I don't see a way to choose this, what would be possible is to rename
>>> np.unique(df['response'])
10 important
3 no/little
8 very important
Name: response

so no/little is first in alphabetical order. I haven't figured out
where to get the data for R, but SAS (UCLA examples) prints
"Logits modeled use response='no/little' as the reference category."

I guess there should be some more control over converting string
levels to numeric levels, or to dummy coding, or to choose the
reference category.

Josef

Skipper Seabold

unread,
May 14, 2013, 8:14:53 PM5/14/13
to pystat...@googlegroups.com
You can choose the reference category using patsy. I think it would
just be a few changes to from_formula and __init__ to make this work
correctly for C(..., reference=...) and/or using an additional
keyword. If you file a ticket, I'll look at it at some point. (There
may already be a ticket for this.)

https://patsy.readthedocs.org/en/latest/API-reference.html#patsy.Treatment

Skipper

Nathaniel Smith

unread,
May 14, 2013, 8:19:45 PM5/14/13
to pystatsmodels

The third example at that link demonstrates setting the reference from a formula. Not sure what's missing?

-n

Skipper Seabold

unread,
May 14, 2013, 8:21:53 PM5/14/13
to pystat...@googlegroups.com
MNLogit doesn't expect the 1-of-N / dummy encoding in endog. Would
just need to make it so it doesn't do anything if it already gets
preprocessed data.

Skipper

josef...@gmail.com

unread,
May 14, 2013, 8:47:10 PM5/14/13
to pystat...@googlegroups.com
I think "sex" also has the reversed encoding compared to SAS.

Forcing the encoding

endog_ind = np.zeros(len(pm[0]))
for ii, level in enumerate(['no/little', 'important', 'very important']):
endog_ind[df['response']==level] = ii

I get the same slope parameters (SAS has negative coefficients for sex)

I haven't figured out yet why the intercept differs compared to SAS

Josef

>
> Skipper

josef...@gmail.com

unread,
May 14, 2013, 9:08:03 PM5/14/13
to pystat...@googlegroups.com
Treatment coding doesn't help, AFAICS
we need the full dummy matrix, without a constant, is that possible
with rolling one column to the front

pm = patsy.dmatrices('response~sex+age', data=df)

>>> pm[0]
DesignMatrix with shape (18, 3)
response[important] response[no/little] response[very important]
0 1 0
1 0 0
0 0 1
0 1 0
1 0 0
0 0 1
0 1 0
1 0 0
0 0 1
0 1 0
1 0 0
0 0 1
0 1 0
1 0 0
0 0 1
0 1 0
1 0 0
0 0 1
Terms:
'response' (columns 0:3)

The implementation of MNLogit, requires that the reference category
``response[no/little]`` is the first column in the dummy
representation.

any suggestions?

Josef

Skipper Seabold

unread,
May 14, 2013, 9:29:00 PM5/14/13
to pystat...@googlegroups.com
Would this work?

Y = patsy.dmatrix("C(response, Treatment('no/little'))", data=df)[:,1:]

Skipper

josef...@gmail.com

unread,
May 14, 2013, 9:37:24 PM5/14/13
to pystat...@googlegroups.com
no, not from what I have seen

the first column needs to be the dummy for the reference category, for example

Y = patsy.dmatrix("C(response, Treatment('no/little'))", data=df)
Y[:, 0] -= Y[:,1:].any(1) # I assume this works

Josef

>
> Skipper
>

josef...@gmail.com

unread,
May 14, 2013, 9:58:04 PM5/14/13
to pystat...@googlegroups.com
a cheap way to force the reference category to be first in alphabetical order,
prefix with underline, the reference category doesn't show up in the summary()

df['response'][df['response'] == 'no/little'] = '_no/little'

>>> print mod.summary()

                          MNLogit Regression Results                          
==============================================================================
Dep. Variable:                      y   No. Observations:                  300
Model:                        MNLogit   Df Residuals:                      292
Method:                           MLE   Df Model:                            6
Date:                Tue, 14 May 2013   Pseudo R-squ.:                  0.1182
Time:                        21:56:28   Log-Likelihood:                -290.35

converged:                       True   LL-Null:                       -329.27
                                        LLR p-value:                 9.966e-15
====================================================================================
     y=important       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept           -0.9789      0.256     -3.819      0.000        -1.481    -0.477
sex[T.women]         0.3881      0.301      1.292      0.197        -0.201     0.977
age[T.24-40]         1.1283      0.342      3.302      0.001         0.459     1.798
age[T.> 40]          1.5877      0.403      3.941      0.000         0.798     2.377
------------------------------------------------------------------------------------
y=very important       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept           -1.8521      0.331     -5.600      0.000        -2.500    -1.204
sex[T.women]         0.8130      0.321      2.532      0.011         0.184     1.442
age[T.24-40]         1.4781      0.401      3.687      0.000         0.692     2.264
age[T.> 40]          2.9168      0.423      6.897      0.000         2.088     3.746
====================================================================================

Josef

Skipper Seabold

unread,
May 14, 2013, 10:04:54 PM5/14/13
to pystat...@googlegroups.com
Oh, right.

This is kind of a multiple equations model and could probably use a
refactor since I was recently looking at it. I think it's possible to
make it general so it wouldn't matter where it is. But to do it in a
general way would be more work than just fixing MNLogit, which I think
might not be too hard. I don't have any ideas how to do it now though.

Skipper

Skipper Seabold

unread,
May 14, 2013, 10:05:29 PM5/14/13
to pystat...@googlegroups.com
On Tue, May 14, 2013 at 9:58 PM, <josef...@gmail.com> wrote:
> a cheap way to force the reference category to be first in alphabetical
> order,
> prefix with underline, the reference category doesn't show up in the
> summary()
>
> df['response'][df['response'] == 'no/little'] = '_no/little'
>

Good enough hack as any I guess.

Skipper

josef...@gmail.com

unread,
May 14, 2013, 10:19:03 PM5/14/13
to pystat...@googlegroups.com

I was thinking about whether we should change the math to an arbitrary index as reference case, but I think we then need to rearrange the numpy arrays each time inside the optimization loop, which might be a lot more expensive than shuffling the columns in the __init__

I would rather keep the math simple in loglike, score and hessian. (You might have some fun times again coding Hessians :)

maybe it's not that bad for MNLogit

H = np.transpose(H.reshape(J,J,K,K), (0,2,1,3)).reshape(J*K,J*K)
I'm not trying to figure this out

Josef
 

Skipper

Thomas Haslwanter

unread,
May 15, 2013, 8:02:03 AM5/15/13
to pystat...@googlegroups.com
I am running Python 2.7.3, on a Win64 system with Windows 7.
If you have trouble getting the data, you can copy them from here:

      sex    age        response  frequency
0   women  18-23       no/little         26
1   women  18-23       important         12
2   women  18-23  very important          7
3   women  24-40       no/little          9
4   women  24-40       important         21
5   women  24-40  very important         15
6   women   > 40       no/little          5
7   women   > 40       important         14
8   women   > 40  very important         41
9     men  18-23       no/little         40
10    men  18-23       important         17
11    men  18-23  very important          8
12    men  24-40       no/little         17
13    men  24-40       important         15
14    men  24-40  very important         12
15    men   > 40       no/little          8
16    men   > 40       important         15
17    men   > 40  very important         18

Thomas Haslwanter

unread,
May 15, 2013, 9:05:01 AM5/15/13
to pystat...@googlegroups.com
The R-code is given as (library "nnet")
 res.cars=multinom(c_resp~factor(c_age)+factor(c_sex), weights=freq,data=car)

and the Stata code as
  .mlogit c_resp c_sex _Ic_age_1 _Ic_age_2 [fweight=freq]

I don't have the bit of the code that reads in the data.
Message has been deleted

Thomas Haslwanter

unread,
May 15, 2013, 9:16:45 AM5/15/13
to pystat...@googlegroups.com
Changing from
'response~sex+age'
to
'response~age+sex'

changes the sequence of the output, but does not affect the problem that all fitted parameters have the same sign in statsmodels.
I don't completely understand the discussion between you and Skipper. But does it address the sign-problem?

josef...@gmail.com

unread,
May 15, 2013, 9:51:15 AM5/15/13
to pystat...@googlegroups.com
On Wed, May 15, 2013 at 9:16 AM, Thomas Haslwanter
<thomas.h...@gmail.com> wrote:
>
> Changing from
> 'response~sex+age'
> to
> 'response~age+sex'
>
> changes the sequence of the output, but does not affect the problem that all fitted parameters have the same sign in statsmodels.
> I don't completely understand the discussion between you and Skipper. But does it address the sign-problem?


Yes, the simplest way, if "no/little" is first in alphabetical order
then it is used as reference category, which means parameters are
relative to the "no/little" case.
Currently, "important" is first alphabetically, so that one is used as
the reference category.

There is currently no direct way to override that the first category
by alphabetical order is used as reference.

my current version with renaming "no/little":
(I had an error with the zipfile and didn't want to hit the website
each time I run the code, so I'm using the xls files from an unzipped
directory.)
commented out code are different version, and trying to get the
variable names back into the model

-----------------
# -*- coding: utf-8 -*-
"""

Created on Tue May 14 15:28:23 2013

Author: Thomas Haslwanter
mailing list
changes Josef Perktold

"""

import numpy as np
import pandas as pd
import statsmodels.api as sm
import patsy

from statsmodels.stats.api import anova_lm

# for data import
import urllib2
import zipfile
from StringIO import StringIO

def get_data(inFile):
'''Get data from original Excel-file'''

# # get the zip-archive
# url = 'http://cdn.crcpress.com/downloads/C9500/GLM_data.zip'
# GLM_archive = urllib2.urlopen(url).read()
#
# # extract the requested file from the archive, as a pandas XLS-file
#
# zipdata = StringIO()
# zipdata.write(GLM_archive)
# myzipfile = zipfile.ZipFile(zipdata)
# xlsfile = myzipfile.open(inFile)
xlsfile = inFile
xls = pd.ExcelFile(xlsfile)
df = xls.parse('Sheet1', skiprows=2)

return df
def nominal_logistic_regression():
'''Nominal Logistic Regression
chapter 8.3, p. 155

At this point, I nominal logistic regression can be done with the
formula approach
'''

# Get the data
inFile = r'GLM_data/Table 8.1 Car preferences.xls'
df = get_data(inFile)
df['response'][df['response'] == 'no/little'] = '_no/little'

# Generate the design matrices using patsy
pm = patsy.dmatrices('response~sex+age', data=df)

# # Change the first output, representing the endogenous data, into a vector
# # e.g. [0,1,0] -> 1
# # e.g. [0,0,1] -> 2
# endog_ind = np.zeros(len(pm[0]))
# for ii in range(len(pm[0])):
# endog_ind[ii] = np.where(pm[0][ii])[0]
#
# endog_ind[endog_ind==0] = 3
# endog_ind[endog_ind==1] = 0
# endog_ind[endog_ind==3] = 1
#
# endog_ind = np.zeros(len(pm[0]))
# for ii, level in enumerate(['no/little', 'important', 'very important']):
# endog_ind[df['response']==level] = ii


# Since frequencies cannot be represented explicitly, multiply the entries
# to correspond to the correct number of occurences
#endog = np.repeat(endog_ind, df['frequency'].values.astype(int), axis=0)
endog = np.repeat(np.array(df['response']),
df['frequency'].values.astype(int), axis=0)
exog = np.array(np.repeat(pm[1],
df['frequency'].values.astype(int), axis=0))
#endog = pd.DataFrame(endog)
exog = pd.DataFrame(exog, columns=pm[1].design_info.column_names)
# Fit the model, and print the summary
model = sm.MNLogit(endog, exog, method='nm').fit()
#print model.summary()
return model

if __name__ == '__main__':
mod = nominal_logistic_regression()
print mod.summary()


------------

Skipper Seabold

unread,
May 15, 2013, 10:01:31 AM5/15/13
to pystat...@googlegroups.com
On Wed, May 15, 2013 at 9:16 AM, Thomas Haslwanter
<thomas.h...@gmail.com> wrote:
> Changing from
> 'response~sex+age'
> to
> 'response~age+sex'
>
> changes the sequence of the output, but does not affect the problem that all
> fitted parameters have the same sign in statsmodels.
> I don't completely understand the discussion between you and Skipper. But
> does it address the sign-problem?

I'm almost certain this is just an issue with your choosing of the
base case in R and Stata. I get the same output from Stata when I
explicitly force the same base outcome as us.

Stata code:

insheet using /home/skipper/scratch/try_mnlogit.csv
encode response, gen(resp)
xi: mlogit resp i.sex i.age [fweight=freq], baseoutcome(1)

Our code: (MNLogit has problems with formulas AFAICT, but it's still
helpful to have the endogenous variable names to be sure you're
comparing correctly vs. Stata. There's also a problem holding on to
these exog_names in PatsyData)

endog = pd.Series(np.repeat(df['response'].values,
df['frequency'].values.astype(int), axis=0),
name='response')
exog = pd.DataFrame(np.repeat(df[['sex','age']].values,
df['frequency'].values.astype(int), axis=0),
columns=['sex', 'age'])

# Fit the model, and print the summary
X = patsy.dmatrix("age + sex", exog)
model = sm.MNLogit(endog, X, method='nm').fit()
print model.summary()

Skipper

Skipper Seabold

unread,
May 15, 2013, 10:03:13 AM5/15/13
to pystat...@googlegroups.com
FYI, method is an argument to fit not the model instantiation.

Skipper
Message has been deleted

josef...@gmail.com

unread,
May 15, 2013, 10:44:01 PM5/15/13
to pystat...@googlegroups.com
On Wed, May 15, 2013 at 10:03 AM, Skipper Seabold <jsse...@gmail.com> wrote:
> On Wed, May 15, 2013 at 10:01 AM, Skipper Seabold <jsse...@gmail.com> wrote:
>> On Wed, May 15, 2013 at 9:16 AM, Thomas Haslwanter
>> <thomas.h...@gmail.com> wrote:
>>> Changing from
>>> 'response~sex+age'
>>> to
>>> 'response~age+sex'
>>>
>>> changes the sequence of the output, but does not affect the problem that all
>>> fitted parameters have the same sign in statsmodels.
>>> I don't completely understand the discussion between you and Skipper. But
>>> does it address the sign-problem?
>>
>> I'm almost certain this is just an issue with your choosing of the
>> base case in R and Stata. I get the same output from Stata when I
>> explicitly force the same base outcome as us.

I opened
https://github.com/statsmodels/statsmodels/issues/837
Reply all
Reply to author
Forward
0 new messages