On Wed, May 15, 2013 at 9:16 AM, Thomas Haslwanter
<
thomas.h...@gmail.com> wrote:
>
> Changing from
> 'response~sex+age'
> to
> 'response~age+sex'
>
> changes the sequence of the output, but does not affect the problem that all fitted parameters have the same sign in statsmodels.
> I don't completely understand the discussion between you and Skipper. But does it address the sign-problem?
Yes, the simplest way, if "no/little" is first in alphabetical order
then it is used as reference category, which means parameters are
relative to the "no/little" case.
Currently, "important" is first alphabetically, so that one is used as
the reference category.
There is currently no direct way to override that the first category
by alphabetical order is used as reference.
my current version with renaming "no/little":
(I had an error with the zipfile and didn't want to hit the website
each time I run the code, so I'm using the xls files from an unzipped
directory.)
commented out code are different version, and trying to get the
variable names back into the model
-----------------
# -*- coding: utf-8 -*-
"""
Created on Tue May 14 15:28:23 2013
Author: Thomas Haslwanter
mailing list
changes Josef Perktold
"""
import numpy as np
import pandas as pd
import statsmodels.api as sm
import patsy
from statsmodels.stats.api import anova_lm
# for data import
import urllib2
import zipfile
from StringIO import StringIO
def get_data(inFile):
'''Get data from original Excel-file'''
# # get the zip-archive
# url = '
http://cdn.crcpress.com/downloads/C9500/GLM_data.zip'
# GLM_archive = urllib2.urlopen(url).read()
#
# # extract the requested file from the archive, as a pandas XLS-file
#
# zipdata = StringIO()
# zipdata.write(GLM_archive)
# myzipfile = zipfile.ZipFile(zipdata)
# xlsfile = myzipfile.open(inFile)
xlsfile = inFile
xls = pd.ExcelFile(xlsfile)
df = xls.parse('Sheet1', skiprows=2)
return df
def nominal_logistic_regression():
'''Nominal Logistic Regression
chapter 8.3, p. 155
At this point, I nominal logistic regression can be done with the
formula approach
'''
# Get the data
inFile = r'GLM_data/Table 8.1 Car preferences.xls'
df = get_data(inFile)
df['response'][df['response'] == 'no/little'] = '_no/little'
# Generate the design matrices using patsy
pm = patsy.dmatrices('response~sex+age', data=df)
# # Change the first output, representing the endogenous data, into a vector
# # e.g. [0,1,0] -> 1
# # e.g. [0,0,1] -> 2
# endog_ind = np.zeros(len(pm[0]))
# for ii in range(len(pm[0])):
# endog_ind[ii] = np.where(pm[0][ii])[0]
#
# endog_ind[endog_ind==0] = 3
# endog_ind[endog_ind==1] = 0
# endog_ind[endog_ind==3] = 1
#
# endog_ind = np.zeros(len(pm[0]))
# for ii, level in enumerate(['no/little', 'important', 'very important']):
# endog_ind[df['response']==level] = ii
# Since frequencies cannot be represented explicitly, multiply the entries
# to correspond to the correct number of occurences
#endog = np.repeat(endog_ind, df['frequency'].values.astype(int), axis=0)
endog = np.repeat(np.array(df['response']),
df['frequency'].values.astype(int), axis=0)
exog = np.array(np.repeat(pm[1],
df['frequency'].values.astype(int), axis=0))
#endog = pd.DataFrame(endog)
exog = pd.DataFrame(exog, columns=pm[1].design_info.column_names)
# Fit the model, and print the summary
model = sm.MNLogit(endog, exog, method='nm').fit()
#print model.summary()
return model
if __name__ == '__main__':
mod = nominal_logistic_regression()
print mod.summary()
------------