Re: [pystatsmodels] regression logistique statsmodels

Message has been deleted

josef...@gmail.com

unread,

Aug 19, 2013, 6:45:02 AM8/19/13

to pystatsmodels

On Mon, Aug 19, 2013 at 6:06 AM, <hanae...@gmail.com> wrote:

Bonjour;
Je suis débutante sur Python et j'essaye de faire la régression logistique avec Statsmodels; j'ai essayé de faire ce code mais j'ai eu une erreur que j'arrive pas à le résoudre voila mon code:

import numpy as np

from scipy import stats

import matplotlib.pyplot as plt

import statsmodels.api as sm

from statsmodels.formula.api import logit, probit, poisson, ols

from numpy import (genfromtxt)

fname="C:/Users/lenovo/Desktop/table.csv"

my_data = genfromtxt(fname,delimiter=',')

my_data_dict = dict(
x=my_data[:,1],
y= my_data[:,6]

)

form='x ~ y'

affair_mod = logit(form, my_data_dict).fit()

Voila l'erreur que j'ai eu:

Thanx

Hi Hanae,

It looks like that you don't have enough variation in the data.

Can you show what x and y are (or send the csv file to me offlist, if it's too large or you don't want to have the data public)?

It can be because your generated design matrix is really perfectly collinear (singular), or
because the optimization tries out a parameter combination at which the derivatives (Hessian) are singular.

In the first case, you would need to change the design matrix to make it non-singular.

In the second case, you can try different optimization methods that are less sensitive than the default `newton` method, for example
affair_model = logit(form, my_data_dict)
affair_result = affair_model.fit(method='bfgs')

or
affair_result = affair_model.fit(method='nm')

'nm' (which is Nelder-Mead, scipy.optimize.fmin) is the most robust, but usually the slowest and least precise.

to check singular design matrix, you can check the `exog` attribute of the model:

print np.linalg.matrix_rank(affair_model.exog), affair_model.exog.shape[1]

another possibility to see what's going on is to estimate a linear model
res_ols = ols(form, my_data_dict).fit()

print res_ols.summary()

OLS can also handle singular design matrices, and the summary() would show some useful results, even if some parameters are not identified.
(not identified : there are many different parameters that all solve the least squares problem.)

Josef

Message has been deleted

josef...@gmail.com

unread,

Aug 19, 2013, 12:15:51 PM8/19/13

to pystatsmodels

This looks like a difficult case.

It looks like partial separation to me. That's a case when we can perfectly predict many cases and some parameters are not identified and would like to go to infinite.

I'm not sure yet what we can do about this case. It's the first time that we have an actual example.

For (full) perfect separation, we just raise an error (that can be turned off).

The estimated parameters can be used for prediction, but we don't get standard errors on the parameters.

Josef

Josef

josef...@gmail.com

unread,

Aug 19, 2013, 12:35:33 PM8/19/13

to pystatsmodels

False alarm

your endog y variable has values (1, 2) instead of (0, 1)

I guess Logit only saw values >0 and thought all y values are 1.

When you use the formula, then you still need to convert the x variable that has categories (1 to 6) into a set of dummy variables, or better, tell patsy to do it for you.

(Once I find what the command is.)

Josef

Josef

Josef

josef...@gmail.com

unread,

Aug 19, 2013, 1:16:16 PM8/19/13

to pystatsmodels

Here are two versions that work for me:

---------
import numpy as np

#from scipy import stats

#import matplotlib.pyplot as plt

import statsmodels.api as sm

from statsmodels.formula.api import logit, probit, poisson, ols

fname=r"E:\Josef\work-oth2\tableFusion.csv"

my_data = np.genfromtxt(fname,delimiter=',')

y = my_data[:,1]

x = my_data[:,6]

mask = ~(np.isnan(y) | np.isnan(x))

y = my_data[mask,1] - 1

x = my_data[mask,6]

my_data_dict = dict(

y=y,

x=x

)

form='y ~ C(x)'

affair_model = logit(form, my_data_dict, missing='drop')

affair_result = affair_model.fit()

print affair_result.summary()

# and now with pandas

import pandas as pd

names=['var0', 'y', 'var2', 'var3', 'var4', 'var5', 'x']

pd_data = pd.read_csv(fname, delimiter=',', names=names)

# drop observations with a missing value in any of the variables

# could drop instead only in the ones used

pd_data_clean = pd_data.dropna()

pd_data_clean['y'] -= 1

form='y ~ C(x)'

affair_model = logit(form, pd_data_clean)

affair_result = affair_model.fit()

print affair_result.summary()

-------------------------------

I hope this helps

Josef

Josef

Josef

Josef

Message has been deleted

josef...@gmail.com

unread,

Aug 19, 2013, 5:03:21 PM8/19/13

to pystatsmodels

2013/8/19 <hanae...@gmail.com>:
> Merci Josef mais SVP j'ai pas compris ces lignes:
>
> Masque = ~ (np.isnan (y) | np.isnan (x))

This creates a mask for values that have nan's (missing values) either
in x or in y

>
> y = my_data [masque, 1] - 1
>
> x = my_data [masque, 6]
>
>
> Pour le deuxième code j'ai pas compris ça:
>
>
> pd_data_clean = pd_data.dropna ()

This drops all rows (observations) that contain at least one nan (missing value)

>
> pd_data_clean ['y'] - = 1

this converts your binary data series from (1, 2) to (0,1).
statsmodels' Logit model requires that the dependent/response variable
is encoded with 0, 1.

When I loaded your tableFusion.csv, then there were several missing values.
If you have patsy 0.2, dropping the missing values in this way is not
necessary anymore in this case, but I was working with patsy 0.1.

Josef

>
>
> Hanae