Doing Logistic Regression with categorical variables

4,182 views
Skip to first unread message

Toni Cebrián

unread,
Feb 11, 2013, 9:41:13 AM2/11/13
to pystat...@googlegroups.com
Hi,

   At work I'm trying to predict the probability of click in a banner. The exogenous variables are all categorical (sex, language, country, OS, etc...) and the response is binary (click Yes or Not). In the first step of the exploratory phase I planned to use statsmodel.api.Logit, but I'm getting a "LinAlgError: Singular matrix". Of course, that is because some rows in the exogenous are duplicated and I assume that internally the solver is doing some matrix inversion.
    So the question is, how should I reformulate the problem in order to use logistic regression with categorical variables in the statsmodel package?

Regards.
Toni.

Skipper Seabold

unread,
Feb 11, 2013, 9:52:11 AM2/11/13
to pystat...@googlegroups.com
I suspect that you are encountering perfect separation in your data. Can you try fitting the Logit model with method = "powell" or method = "bfgs" to confirm.

Aside: why do some of these methods raise a PerferctSeparationError and some do not? Are we not being careful enough?

Skipper

josef...@gmail.com

unread,
Feb 11, 2013, 9:54:23 AM2/11/13
to pystat...@googlegroups.com
On Mon, Feb 11, 2013 at 9:41 AM, Toni Cebrián <anc...@gmail.com> wrote:
> Hi,
>
> At work I'm trying to predict the probability of click in a banner. The
> exogenous variables are all categorical (sex, language, country, OS, etc...)
> and the response is binary (click Yes or Not). In the first step of the
> exploratory phase I planned to use statsmodel.api.Logit, but I'm getting a
> "LinAlgError: Singular matrix". Of course, that is because some rows in the
> exogenous are duplicated and I assume that internally the solver is doing
> some matrix inversion.

Since Logit uses nonlinear optimization, it doesn't handle singular
matrices, in contrast to the linear models.
GLM might be able to handle singular exogenous variables, but it would
still leave you with non unique parameter representation.

> So the question is, how should I reformulate the problem in order to use
> logistic regression with categorical variables in the statsmodel package?

I think the easiest might be to build the designmatrix, exog, with
patsy for example dmatrix or use the formula interface to the model
explicitly (requires 0.5dev)

The alternative would be to drop the columns that are linear
combinations of others "by hand" or use a transform to non-singular
design matrix.

Josef

>
> Regards.
> Toni.
>
> --
> You received this message because you are subscribed to the Google Groups
> "pystatsmodels" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pystatsmodel...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

josef...@gmail.com

unread,
Feb 11, 2013, 9:58:51 AM2/11/13
to pystat...@googlegroups.com
IIRC, perfect separation doesn't result in a linalg error, "just"
non-convergence.
Also IIRC, only Logit and Probit have the perfect separation check so far.

Josef


>
> Skipper

Skipper Seabold

unread,
Feb 11, 2013, 10:00:50 AM2/11/13
to pystat...@googlegroups.com
On Mon, Feb 11, 2013 at 9:58 AM, <josef...@gmail.com> wrote:
On Mon, Feb 11, 2013 at 9:52 AM, Skipper Seabold <jsse...@gmail.com> wrote:
> On Mon, Feb 11, 2013 at 9:41 AM, Toni Cebrián <anc...@gmail.com> wrote:
>>
>> Hi,
>>
>>    At work I'm trying to predict the probability of click in a banner. The
>> exogenous variables are all categorical (sex, language, country, OS, etc...)
>> and the response is binary (click Yes or Not). In the first step of the
>> exploratory phase I planned to use statsmodel.api.Logit, but I'm getting a
>> "LinAlgError: Singular matrix". Of course, that is because some rows in the
>> exogenous are duplicated and I assume that internally the solver is doing
>> some matrix inversion.
>>     So the question is, how should I reformulate the problem in order to
>> use logistic regression with categorical variables in the statsmodel
>> package?
>>
>
> I suspect that you are encountering perfect separation in your data. Can you
> try fitting the Logit model with method = "powell" or method = "bfgs" to
> confirm.
>
> Aside: why do some of these methods raise a PerferctSeparationError and some
> do not? Are we not being careful enough?

IIRC, perfect separation doesn't result in a linalg error, "just"
non-convergence.
Also IIRC, only Logit and Probit have the perfect separation check so far.


I just tried a toy example is why I asked. I got a failure to invert the Hessian with Newton and BFGS and a PerfectSeparationError using Powell, so either way there's a problem somewhere.

Skipper

josef...@gmail.com

unread,
Feb 11, 2013, 10:24:01 AM2/11/13
to pystat...@googlegroups.com
open an issue and attach the toy example, so we can look into this,
and see where we can make this more robust.

Newton can get a Hessian that is not invertible independent of perfect
separation, if it evaluates at points that are not nice.
(we might not have a collection of "not nice")
I don't know when we can get a singular jacobian with BFGS.

Toni Cebrián

unread,
Feb 11, 2013, 10:56:19 AM2/11/13
to pystat...@googlegroups.com
Thanks for your quick reply.

For the record, with method:

* newton, raised an exception "LinAlgError: Singular matrix"
* powell, Finished OK, but sometimes a warning about inverting a Hessian is seen. When I do the res.summary() I get the following exception "ValueError: need covariance of parameters for computing (unnormalized) covariances"
* bfgs, Finished OK. When I do the res.summary() I get the exception "ValueError: need covariance of parameters for computing (unnormalized) covariances"

Just for you to have an idea of the variables, I'm using country that has about 140 categorical levels, and expands into 140 new binary features (most of them in US and ES columns). Language is the same with about 20 different categorical levels.

So from your mails, I assume that the best way to proceed in the meantime would be to:

* add some noise to the features so they aren't linearly dependent
* or, use the GLM module

am I right?

Toni




2013/2/11 <josef...@gmail.com>

josef...@gmail.com

unread,
Feb 11, 2013, 11:16:14 AM2/11/13
to pystat...@googlegroups.com
On Mon, Feb 11, 2013 at 10:56 AM, Toni Cebrián <anc...@gmail.com> wrote:
> Thanks for your quick reply.
>
> For the record, with method:
>
> * newton, raised an exception "LinAlgError: Singular matrix"
> * powell, Finished OK, but sometimes a warning about inverting a Hessian is
> seen. When I do the res.summary() I get the following exception "ValueError:
> need covariance of parameters for computing (unnormalized) covariances"
> * bfgs, Finished OK. When I do the res.summary() I get the exception
> "ValueError: need covariance of parameters for computing (unnormalized)
> covariances"

I get the same with Skipper's toy example, bfgs finishes the parameter
estimate, but I think the Hessian is not invertible so we cannot get
the cov_params.

You could also check whether you have perfect separation also

res.predict()
gives you the predicted probabilities. If there are values very close
to zero or one, then you have perfect prediction.

>
> Just for you to have an idea of the variables, I'm using country that has
> about 140 categorical levels, and expands into 140 new binary features (most
> of them in US and ES columns). Language is the same with about 20 different
> categorical levels.

What's your sample size, do you still have more observations than variables?

>
> So from your mails, I assume that the best way to proceed in the meantime
> would be to:
>
> * add some noise to the features so they aren't linearly dependent
> * or, use the GLM module

my first recommendation is to drop linearly dependent variables, so
you get interpretable identified parameters.

Josef

Toni Cebrián

unread,
Feb 11, 2013, 12:16:38 PM2/11/13
to pystat...@googlegroups.com



2013/2/11 <josef...@gmail.com>
On Mon, Feb 11, 2013 at 10:56 AM, Toni Cebrián <anc...@gmail.com> wrote:

> Thanks for your quick reply.
>
> For the record, with method:
>
> * newton, raised an exception "LinAlgError: Singular matrix"
> * powell, Finished OK, but sometimes a warning about inverting a Hessian is
> seen. When I do the res.summary() I get the following exception "ValueError:
> need covariance of parameters for computing (unnormalized) covariances"
> * bfgs, Finished OK. When I do the res.summary() I get the exception
> "ValueError: need covariance of parameters for computing (unnormalized)
> covariances"

I get the same with Skipper's toy example, bfgs finishes the parameter
estimate, but I think the Hessian is not invertible so we cannot get
the cov_params.

You could also check whether you have perfect separation also

res.predict()
gives you the predicted probabilities. If there are values very close
to zero or one, then you have perfect prediction.

Yes, I'm getting perfect 0s or 1s in at lest 1% of the data.

>
> Just for you to have an idea of the variables, I'm using country that has
> about 140 categorical levels, and expands into 140 new binary features (most
> of them in US and ES columns). Language is the same with about 20 different
> categorical levels.

What's your sample size, do you still have more observations than variables?

I have around 200 variables and I have as many instances as I'd like, but I'm testing with 4000.
 

>
> So from your mails, I assume that the best way to proceed in the meantime
> would be to:
>
> * add some noise to the features so they aren't linearly dependent
> * or, use the GLM module

my first recommendation is to drop linearly dependent variables, so
you get interpretable identified parameters.

How could I do this?

Toni

josef...@gmail.com

unread,
Feb 11, 2013, 12:31:35 PM2/11/13
to pystat...@googlegroups.com
That's good, no need to worry about degrees of freedom nobs - k_vars
to small or negative.

>
>>
>>
>> >
>> > So from your mails, I assume that the best way to proceed in the
>> > meantime
>> > would be to:
>> >
>> > * add some noise to the features so they aren't linearly dependent
>> > * or, use the GLM module
>>
>> my first recommendation is to drop linearly dependent variables, so
>> you get interpretable identified parameters.
>
>
> How could I do this?

If you are building the 0-1 dummy variables yourself, then you should
drop one of the levels.
With 140 levels, you should only include 139 dummy variables.
If you also include the constant, then you should drop one dummy of
each categorical variable.

np.linalg.matrix_rank can be used to check the rank of the exog.
If exog is still singular after dropping the extra dummies (with
interaction effects for example), then you need to find the extra
linear combinations.

using patsy and formulas, or sm.tools.categorical would drop the extra
dummies for you

Josef
Reply all
Reply to author
Forward
0 new messages