conditional logit

Andrew Marder

unread,

Aug 11, 2013, 3:57:41 PM8/11/13

to pystat...@googlegroups.com

Dear Py Stats Modelers,

I tried my hand at implementing a conditional logit model by extending statsmodels.base.model.GenericLikelihoodModel. If you are interested the code is here [1]. Using a simple example I show this implementation yields parameter estimates reasonably close to those found by Stata. Unfortunately, my implementation is more than 400 times slower than Stata. For my needs, I will probably stick with using Stata. But, if anyone's interested in writing a conditional logit model for StatsModels this might be a useful first pass, and I'll be happy to hear how to speed up my code ;)

Thanks for sharing such a wonderful library,

Andrew

[1]: https://github.com/amarder/clogit.py

josef...@gmail.com

unread,

Aug 11, 2013, 4:45:08 PM8/11/13

to pystat...@googlegroups.com

On Sun, Aug 11, 2013 at 3:57 PM, Andrew Marder
<andrew....@gmail.com> wrote:
> Dear Py Stats Modelers,
>
> I tried my hand at implementing a conditional logit model by extending
> statsmodels.base.model.GenericLikelihoodModel. If you are interested the
> code is here [1]. Using a simple example I show this implementation yields
> parameter estimates reasonably close to those found by Stata. Unfortunately,
> my implementation is more than 400 times slower than Stata. For my needs, I
> will probably stick with using Stata. But, if anyone's interested in writing
> a conditional logit model for StatsModels this might be a useful first pass,
> and I'll be happy to hear how to speed up my code ;)

We also have currently a Google Summer of Code project that has as
objective to implement several discrete choice models by Ana Martinez
as student.

The current clogit version is in this branch
https://github.com/AnaMP/statsmodels/compare/clogit

If you are interested, then you could review the clogit branch, and
give us some feedback and to see whether you would make you use of it
instead of Stata. And to provide feedback in what necessary features
we need to add.

A quick look at your code:

what is "import stata"? I didn't know we can import stata into python.

pandas might be slow in the log-likelihood because it is called very often.
Just a guess: can you define `self.data.groupby('group')` outside of
the loglikelihood, in __init__ for example and reuse it.

I don't remember whether fit(method='nm') is still the default in
GenericLikelihoodModel, method='bfgs' should be faster and more
precise at the optimum but might not converge in some cases.

Another issue that Skipper reported for NegativeBinomial is that the
numerical derivatives are much slower than analytical derivatives.
When those are available, then it improves the performance
significantly.

Another possibility: If we can have a fast approximate initial
estimate, then numerical optimization can start from a better position
and will converge faster.

I guess we will have a reasonably fast version by the end of summer or
early fall.

>
> Thanks for sharing such a wonderful library,

Thank you,

Josef

>
> Andrew
>
> [1]: https://github.com/amarder/clogit.py

Ana Martínez Pardo

unread,

Aug 12, 2013, 11:32:52 AM8/12/13

to pystat...@googlegroups.com

On 11/08/13 22:45, josef...@gmail.com wrote:
> On Sun, Aug 11, 2013 at 3:57 PM, Andrew Marder
> <andrew....@gmail.com> wrote:
>> Dear Py Stats Modelers,
>>
>> I tried my hand at implementing a conditional logit model by extending
>> statsmodels.base.model.GenericLikelihoodModel. If you are interested the
>> code is here [1]. Using a simple example I show this implementation yields
>> parameter estimates reasonably close to those found by Stata. Unfortunately,
>> my implementation is more than 400 times slower than Stata. For my needs, I
>> will probably stick with using Stata. But, if anyone's interested in writing
>> a conditional logit model for StatsModels this might be a useful first pass,
>> and I'll be happy to hear how to speed up my code ;)
>
> We also have currently a Google Summer of Code project that has as
> objective to implement several discrete choice models by Ana Martinez
> as student.
>
> The current clogit version is in this branch
> https://github.com/AnaMP/statsmodels/compare/clogit
>
> If you are interested, then you could review the clogit branch, and
> give us some feedback and to see whether you would make you use of it
> instead of Stata. And to provide feedback in what necessary features
> we need to add.

Please, let me know if you encounter any problems or need any
clarification. The code still is crude and has some mess.

>
> A quick look at your code:
>
> what is "import stata"? I didn't know we can import stata into python.
>
> pandas might be slow in the log-likelihood because it is called very often.
> Just a guess: can you define `self.data.groupby('group')` outside of
> the loglikelihood, in __init__ for example and reuse it.
>
> I don't remember whether fit(method='nm') is still the default in
> GenericLikelihoodModel, method='bfgs' should be faster and more
> precise at the optimum but might not converge in some cases.

Nelder–Mead is the default method in GenericLikelihoodModel.
I fixed Method = Newton, because it should find the maximum in a few
iterations since, in a model linear, the log-likelihood function of the
sample is global concave for the parameters.

>
> Another issue that Skipper reported for NegativeBinomial is that the
> numerical derivatives are much slower than analytical derivatives.
> When those are available, then it improves the performance
> significantly.
>
> Another possibility: If we can have a fast approximate initial
> estimate, then numerical optimization can start from a better position
> and will converge faster.
>
> I guess we will have a reasonably fast version by the end of summer or
> early fall.
>
>>
>> Thanks for sharing such a wonderful library,
>
> Thank you,
>
> Josef
>
>>
>> Andrew
>>
>> [1]: https://github.com/amarder/clogit.py

Ana

Andrew Marder

unread,

Aug 13, 2013, 10:40:40 AM8/13/13

to pystat...@googlegroups.com

Hi Josef and Ana,

Thanks for writing back so quick! I've put a few comments below the fold...

On Sun, Aug 11, 2013 at 4:45 PM, <josef...@gmail.com> wrote:

On Sun, Aug 11, 2013 at 3:57 PM, Andrew Marder
<andrew....@gmail.com> wrote:
> Dear Py Stats Modelers,
>
> I tried my hand at implementing a conditional logit model by extending
> statsmodels.base.model.GenericLikelihoodModel. If you are interested the
> code is here [1]. Using a simple example I show this implementation yields
> parameter estimates reasonably close to those found by Stata. Unfortunately,
> my implementation is more than 400 times slower than Stata. For my needs, I
> will probably stick with using Stata. But, if anyone's interested in writing
> a conditional logit model for StatsModels this might be a useful first pass,
> and I'll be happy to hear how to speed up my code ;)

We also have currently a Google Summer of Code project that has as
objective to implement several discrete choice models by Ana Martinez
as student.

The current clogit version is in this branch
https://github.com/AnaMP/statsmodels/compare/clogit

If you are interested, then you could review the clogit branch, and
give us some feedback and to see whether you would make you use of it
instead of Stata. And to provide feedback in what necessary features
we need to add.

I am having a hard time understanding / using the code in dcm_clogit.py. As an outsider, it looks like it is implementing a multinomial logit model like R's mlogit package. Is that accurate? Here's a quote from Wooldridge relating conditional and multinomial logit models:

"The conditional logit model is intended specifically for problems where consumer or firm choices are at least partly made based on observable attributes of each alternative. The utility level of each choice is assumed to be a linear function in choice attributes, x_ij, with common parameter vector beta. This turns out to actually contain the multinomial logit model as a special case by appropriately choosing x_ij."

A quick look at your code:

what is "import stata"? I didn't know we can import stata into python.

I wrote a very bare bones wrapper to call Stata, estimate a model, and bring the results into Python. Feel free to tinker if this looks useful:

https://github.com/amarder/StataPy

pandas might be slow in the log-likelihood because it is called very often.
Just a guess: can you define `self.data.groupby('group')` outside of
the loglikelihood, in __init__ for example and reuse it.

This change sped up the code by about 2.5 seconds (2%), a nice quick win.

I don't remember whether fit(method='nm') is still the default in
GenericLikelihoodModel, method='bfgs' should be faster and more
precise at the optimum but might not converge in some cases.

Here are the runtimes in seconds for various maximization methods:

nm (default): 173

bfgs: manually stopped after 240

newton: 114

Ana, great tip on newton, this is a 34% speed up!

Another issue that Skipper reported for NegativeBinomial is that the
numerical derivatives are much slower than analytical derivatives.
When those are available, then it improves the performance
significantly.

I think this is where we should get a big speed up. I took a quick stab at taking the derivative, but it looks pretty tough, and I gave up.

Another possibility: If we can have a fast approximate initial
estimate, then numerical optimization can start from a better position
and will converge faster.

Another good idea, I think sometimes people use parameter estimates from the standard logit (without fixed effects) as a good initial position.

I guess we will have a reasonably fast version by the end of summer or
early fall.

Sounds good to me. Happy to beta test,

Andrew

Ana Martinez Pardo

unread,

Aug 13, 2013, 12:29:36 PM8/13/13

to pystat...@googlegroups.com

Sorry, it's still very crude. We are only testing against to mlogit
package. On DCM_clogit.py, first is the class CLogit and after, three
examples: the first, replicate Greene example (two alternative specific
variables with generic coefficients and one alternative specific
variable with an alternative specific coefficient) and, the next two,
are variants of it. Each example, is followed by R results to test.The
R's mlogit pachage look at conditional logit in multinomial logit model,
not as a separate model. I think there are two lines of thought on this:
those which separate on two and those wich taken together. If all the
independent variables are case specific, then, the end is the same, the
two models are identical. For now our models deal with only specific
variables but it should be able to work with both types of variable
(alternative specific and/or individual/case specific). We only start to
talk a bit in issue #941 about how to handle data entry.

>
> A quick look at your code:
>
> what is "import stata"? I didn't know we can import stata into python.
>
>
> I wrote a very bare bones wrapper to call Stata, estimate a model, and
> bring the results into Python. Feel free to tinker if this looks useful:
>
> https://github.com/amarder/StataPy
>
>
> pandas might be slow in the log-likelihood because it is called very
> often.
> Just a guess: can you define `self.data.groupby('group')` outside of
> the loglikelihood, in __init__ for example and reuse it.
>
>
> This change sped up the code by about 2.5 seconds (2%), a nice quick win.
>
>
> I don't remember whether fit(method='nm') is still the default in
> GenericLikelihoodModel, method='bfgs' should be faster and more
> precise at the optimum but might not converge in some cases.
>
>
> Here are the runtimes in seconds for various maximization methods:
> nm (default): 173
> bfgs: manually stopped after 240
> newton: 114
>
> Ana, great tip on newton, this is a 34% speed up!
>
>
> Another issue that Skipper reported for NegativeBinomial is that the
> numerical derivatives are much slower than analytical derivatives.
> When those are available, then it improves the performance
> significantly.
>
>
> I think this is where we should get a big speed up. I took a quick stab
> at taking the derivative, but it looks pretty tough, and I gave up.

I know the derivative (I'll put it on github). I'll try to see how to
include it.

>
>
> Another possibility: If we can have a fast approximate initial
> estimate, then numerical optimization can start from a better position
> and will converge faster.
>
>
> Another good idea, I think sometimes people use parameter estimates from
> the standard logit (without fixed effects) as a good initial position.

Good! I'll try on it!

>
>
> I guess we will have a reasonably fast version by the end of summer or
> early fall.
>
>
> Sounds good to me. Happy to beta test,
>
> Andrew
>

Thank you! That would be great!
Ana

Ed Rahn

unread,

Oct 27, 2014, 7:54:18 PM10/27/14

to pystat...@googlegroups.com

Hi Andrew,

Do you have any updated code? Can you recommend any other changes that the ones listed here.