What is the status of mixed logit?

491 views
Skip to first unread message

Jeff

unread,
Mar 31, 2015, 11:26:10 AM3/31/15
to pystat...@googlegroups.com
The statsmodels website says:

Google Summer of Code 2013: We have had two students accepted to work on statsmodels as part of the Google Summer of Code 2013. The first project will focus on improving the discrete choice models, adding, for example, Conditional Logit, Nested Logit, and Mixed Logit models. 

What is the status of that project? A year ago here on the Google Group Josef said "mixed logit works but is unfinished and doesn't have unit tests." Any developments?

Thanks!

josef...@gmail.com

unread,
Mar 31, 2015, 12:49:55 PM3/31/15
to pystatsmodels
It's the same as a year ago, it's still stalled.

https://github.com/statsmodels/statsmodels/pull/1605
is the branch that was rebased last year and updated to work on python 3

Josef



>
> Thanks!

Jeff

unread,
May 2, 2015, 12:57:00 AM5/2/15
to pystat...@googlegroups.com
What would be the steps for using that code? Assume I'm not that savvy with git, which is true.

Do I understand correctly that the code appears to be working, though there are no unit tests so there are no guarantees? Is there any other way in which it is "unfinished"?

josef...@gmail.com

unread,
May 2, 2015, 10:49:04 AM5/2/15
to pystatsmodels
On Sat, May 2, 2015 at 12:57 AM, Jeff <jeffa...@gmail.com> wrote:
What would be the steps for using that code? Assume I'm not that savvy with git, which is true.

Theoretically it's just checking out or downloading the branch.

However, it's a year behind master and won't have the newer parts of statsmodels. (And we cannot import two different versions of statsmodels at the same time).
It looks like I converted the branch to common python 2 - python 3 codebase last year. 
 

Do I understand correctly that the code appears to be working, though there are no unit tests so there are no guarantees? Is there any other way in which it is "unfinished"?

About terminology since we just had a question about Mixed Logit, Poisson, GLM and a new GSOC project:
The project in 2013 was all about variations on Multinomial Logit or discrete choice models. I'm not sure how well it would work for binary Logit, which would be just a special case.

The emphasis of the GSOC project was to get the core estimation parts. I guess there is also work left to get formula integration and more post estimation results.

Based on running the mxlogit example (on python 3.4), it looks like mxlogit still has bugs (e.g. hessian has one row and column of zeros *), and the return class is not properly connected. 

(* After dropping the zero column in hessian, the parameter standard errors, bse, look like they agree at around 2 decimals/digits with R except for the standard deviation of the random component. It looks like a duplicated intercept in the parameter list, specifically I think the random effects distributions should have mean zero.)

clogit looks good, but Ana ran out of time for mxlogit, it was added at the very end of GSOC, AFAIR.

another point: 
Ana worked with 1-d parameter arrays which should make most of the generic postestimation results like t_test and similar work without changes. (discrete.MNLogit has 2-d parameters which are still a headache.)

and one more:
Numerical integration uses Halton sequences which could be generalized to other distribution than Gaussian. There is no version using Gaussian quadrature for integration, but it looks like the design is modular enough that only one method needs to be enhanced to support any finite point approximation to a distribution of random effects. 

Based on this, I don't think it would take long to get mxlogit into a usable, tested state. Some improvements (like analytical score and hessian and more features)  would be more open ended.


PS: 
After being puzzled by the random effects integration, I don't think mxlogit has been extended to panel data.
AFAICS by partially skimming the vignette for R's mnlogit http://cran.r-project.org/web/packages/mlogit/vignettes/mlogit.pdf ,
this is a Logit random utility model with unobserved cross-sectional heterogeneity, not for repeated observations of the choice for each individual.

Josef

josef...@gmail.com

unread,
May 2, 2015, 11:58:32 AM5/2/15
to pystatsmodels
On Sat, May 2, 2015 at 10:49 AM, <josef...@gmail.com> wrote:


On Sat, May 2, 2015 at 12:57 AM, Jeff <jeffa...@gmail.com> wrote:
What would be the steps for using that code? Assume I'm not that savvy with git, which is true.

Theoretically it's just checking out or downloading the branch.

However, it's a year behind master and won't have the newer parts of statsmodels. (And we cannot import two different versions of statsmodels at the same time).
It looks like I converted the branch to common python 2 - python 3 codebase last year. 
 

Do I understand correctly that the code appears to be working, though there are no unit tests so there are no guarantees? Is there any other way in which it is "unfinished"?

About terminology since we just had a question about Mixed Logit, Poisson, GLM and a new GSOC project:
The project in 2013 was all about variations on Multinomial Logit or discrete choice models. I'm not sure how well it would work for binary Logit, which would be just a special case.

The emphasis of the GSOC project was to get the core estimation parts. I guess there is also work left to get formula integration and more post estimation results.

Based on running the mxlogit example (on python 3.4), it looks like mxlogit still has bugs (e.g. hessian has one row and column of zeros *), and the return class is not properly connected. 

(* After dropping the zero column in hessian, the parameter standard errors, bse, look like they agree at around 2 decimals/digits with R except for the standard deviation of the random component. It looks like a duplicated intercept in the parameter list, specifically I think the random effects distributions should have mean zero.)

clogit looks good, but Ana ran out of time for mxlogit, it was added at the very end of GSOC, AFAIR.

another point: 
Ana worked with 1-d parameter arrays which should make most of the generic postestimation results like t_test and similar work without changes. (discrete.MNLogit has 2-d parameters which are still a headache.)

and one more:
Numerical integration uses Halton sequences which could be generalized to other distribution than Gaussian. There is no version using Gaussian quadrature for integration, but it looks like the design is modular enough that only one method needs to be enhanced to support any finite point approximation to a distribution of random effects. 

Based on this, I don't think it would take long to get mxlogit into a usable, tested state. Some improvements (like analytical score and hessian and more features)  would be more open ended.


PS: 
After being puzzled by the random effects integration, I don't think mxlogit has been extended to panel data.
AFAICS by partially skimming the vignette for R's mnlogit http://cran.r-project.org/web/packages/mlogit/vignettes/mlogit.pdf ,
this is a Logit random utility model with unobserved cross-sectional heterogeneity, not for repeated observations of the choice for each individual.


partial follow up:
Aside from data handling, adding panel data might be only a change in one line, or a few with proper support.
np.multiply.reduceat(cdf, group_idx) in cdf_average
where group_idx = np.concatenate(([0], np.where(groups[1:] != groups[:-1])[0] + 1))
assuming data is sorted by group indicator.
(and ignoring possible numerical problems with repeatedly multiplying tiny numbers.)


Looks like this pattern would work also for MixedGLM, AFAICS.

Josef

josef...@gmail.com

unread,
May 2, 2015, 10:53:39 PM5/2/15
to pystatsmodels
On Sat, May 2, 2015 at 11:58 AM, <josef...@gmail.com> wrote:


On Sat, May 2, 2015 at 10:49 AM, <josef...@gmail.com> wrote:


On Sat, May 2, 2015 at 12:57 AM, Jeff <jeffa...@gmail.com> wrote:
What would be the steps for using that code? Assume I'm not that savvy with git, which is true.

Theoretically it's just checking out or downloading the branch.

However, it's a year behind master and won't have the newer parts of statsmodels. (And we cannot import two different versions of statsmodels at the same time).
It looks like I converted the branch to common python 2 - python 3 codebase last year. 
 

Do I understand correctly that the code appears to be working, though there are no unit tests so there are no guarantees? Is there any other way in which it is "unfinished"?

About terminology since we just had a question about Mixed Logit, Poisson, GLM and a new GSOC project:
The project in 2013 was all about variations on Multinomial Logit or discrete choice models. I'm not sure how well it would work for binary Logit, which would be just a special case.

The emphasis of the GSOC project was to get the core estimation parts. I guess there is also work left to get formula integration and more post estimation results.

Based on running the mxlogit example (on python 3.4), it looks like mxlogit still has bugs (e.g. hessian has one row and column of zeros *), and the return class is not properly connected. 

(* After dropping the zero column in hessian, the parameter standard errors, bse, look like they agree at around 2 decimals/digits with R except for the standard deviation of the random component. It looks like a duplicated intercept in the parameter list, specifically I think the random effects distributions should have mean zero.)

Besides redundant random effects mean, there is also a problem with the lower bound for variances.

After changing these two things, I'm getting reasonably close to R mnlogit.
2 to 3 digits on mean parameters and statistically significant variances.
bse is close at two digits, but maybe a bit larger than R but one bse differs by 14 %
maximized loglikelihood llf ours: -197.687  R: -197.26

I guess we are down to numerical approximation errors in numerical integration of random effects, and numerical derivatives in score and hessian. I don't have R available right now to check how much the results in R change with changing one of the optimization or numerical integration options.
I only checked two examples for which the results are in the PR. It needs more examples to see how robust this is.

MixedGLM will have similar issues, and we will have to look at numerical issues more this summer during GSOC.

Josef

Jeff

unread,
May 4, 2015, 5:40:01 AM5/4/15
to pystat...@googlegroups.com
Thanks, Josef.

The thing that I'm now after is actually the clogit. I poked around in its code, and I think I determined that the code is set up handling a set of observations that all have the same choices. So, for example:

Observation 1
Choice 1: bus
Choice 2: car
Choice 3: bike (choice selected!)

Observation 2
Choice 1: bus
Choice 2: car (choice selected!)
Choice 3: bike

However, my data looks like this:
Observation 1
Choice 1: bus
Choice 2: car
Choice 3: bike (choice selected!)

Observation 2
Choice 1: bus (choice selected!)
Choice 2: bike 

R's survival package performs clogit and handles this case ok, as described here:

Am I correct in that the current statsmodels code needs every observation to have the same number of alternatives?

Jeff

unread,
May 4, 2015, 5:47:27 AM5/4/15
to pystat...@googlegroups.com
Worth noting: surival's clogit is implemented as a coxph of a specific form, with strata. It looks like statsmodels might be able to pull this off?


Can it handle strata? 

Jeff

unread,
May 4, 2015, 8:20:30 AM5/4/15
to pystat...@googlegroups.com
It can! It totally can! That's great!

Unfortunately, it appears to take at 4x as long to compute. Also, statsmodels keeps telling me test datasets are singular, while survival tells me they're not. The singular matrix error prevents me from scaling up the dataset to see if that 4x longer runtime remains with a large dataset.

josef...@gmail.com

unread,
May 4, 2015, 8:33:13 AM5/4/15
to pystatsmodels
On Mon, May 4, 2015 at 5:47 AM, Jeff <jeffa...@gmail.com> wrote:
Worth noting: surival's clogit is implemented as a coxph of a specific form, with strata. It looks like statsmodels might be able to pull this off?


Can it handle strata? 

Yes, it has a strata option.

I have no idea how this would work, I've never heard of this.

If you get something that works for this, then I would be interested in including it in statsmodels, but it's too much of a detour for me to dig in myself.
 


On Monday, May 4, 2015 at 5:40:01 PM UTC+8, Jeff wrote:
Thanks, Josef.

The thing that I'm now after is actually the clogit. I poked around in its code, and I think I determined that the code is set up handling a set of observations that all have the same choices. So, for example:

Observation 1
Choice 1: bus
Choice 2: car
Choice 3: bike (choice selected!)

Observation 2
Choice 1: bus
Choice 2: car (choice selected!)
Choice 3: bike

However, my data looks like this:
Observation 1
Choice 1: bus
Choice 2: car
Choice 3: bike (choice selected!)

Observation 2
Choice 1: bus (choice selected!)
Choice 2: bike 

I never looked into this. The question came up very briefly during the discrete choice GSOC, without trying to find a solution, AFAIR.

Is it possible to add a "fake" option 'car' with for example a very high cost component or dummy? 
Maybe not, because logit function cannot have exactly zero probability.

Stata's acslog mentions that alternatives can differ by case, but doesn't say how that's handled.

Josef

Kerby Shedden

unread,
May 4, 2015, 8:33:18 AM5/4/15
to pystat...@googlegroups.com
Do you have an intercept in your Cox model?  There should be no intercept.  If there is one, it will give you the problems you are seeing.

If this solves the singularity problem I would be interested to try to help you with  the performance issues.

Kerby

josef...@gmail.com

unread,
May 4, 2015, 8:41:47 AM5/4/15
to pystatsmodels
On Mon, May 4, 2015 at 8:20 AM, Jeff <jeffa...@gmail.com> wrote:
It can! It totally can! That's great!

great, example, code ?
 

Unfortunately, it appears to take at 4x as long to compute. Also, statsmodels keeps telling me test datasets are singular, while survival tells me they're not. The singular matrix error prevents me from scaling up the dataset to see if that 4x longer runtime remains with a large dataset.

Check if you have a constant/intercept in the data matrix. According to Kerby that's not allowed.

We haven't done a performance review of Cox PH, AFAIR. So there might be some slack left.

General recommendation: try to find good starting values, if the defaults are not very good yet.

Josef

Jeff

unread,
May 4, 2015, 9:39:45 AM5/4/15
to pystat...@googlegroups.com
No intercept in the data. It looks like the cries of it being singular were due to an interaction of two floats. Removing that interaction removed the singular error. I'll have to put it in later, but that's a problem for later.

Detailed code example forthcoming, but for now consider this: It looks like the computation time issue may also be gone, though only for smaller datasets. With larger datasets I can't actually get statsmodels to finish fitting, and I think this is due to the 'efron' setting for the ties.

In R's survival package, the relevant code is:

analysis <- clogit(myformula, df_sample)

clogit has an option that is relevant for us right now:
"method: use the correct (exact) calculation in the conditional likelihood or one of the approximations"
default = 'exact'. Other options are "approximate", "efron", "breslow"

clogit, in turn, just calls coxph, though with particular settings. Here is coxph's documentation on 'exact' vs 'efron' and 'breslow'

"ties: a character string specifying the method for tie handling. If there are no tied death times all the methods are equivalent. Nearly all Cox regression programs use the Breslow method by default, but not this one. The Efron approximation is used as the default here, it is more accurate when dealing with tied death times, and is as efficient computationally. The “exact partial likelihood” is equivalent to a conditional logistic model, and is appropriate when the times are a small set of discrete values. If there are a large number of ties and (start, stop) style survival data the computational time will be excessive." Note: So clogit switches the default from 'efron' to 'exact'

If I run my dataset with method='exact' (the default), it takes 30 seconds. That's about 2.5 million data points, ~20,000 strata, and 6 variable columns.
If I run the analysis with method='efron' or method='breslow', it's still running 30 minutes later.

With statsmodels, the only options are 'efron' and 'breslow', and indeed with either the model doesn't finish fitting.

Are the 'efron and 'breslow' fitting methods necessary? Is it possible to use whatever 'exact' methodology the survival package is using? It looks like they use an entirely different fitter for 'exact'.

Here is the R code, and here is the C code that the R code calls.

For further reference, this is what the coxph call looks like:

coxph(formula = Surv(rep(1,n_data_points), classification_youre_trying_to_model) ~ variable1 + variable2 + variable3 + strata(your_strata), data, method='exact')

The call to PHReg is:
PHReg.from_formula('1 + variable1 + variable2 + variable3', data, status= classification_youre_trying_to_model, strata=your_strata, ties='efron'))

Jeff

unread,
May 4, 2015, 10:00:12 AM5/4/15
to pystat...@googlegroups.com
To be clear:

When using 'efron' or 'breslow', statsmodels is faster than survival. When using 'exact', survival is far faster than statsmodels using 'efron' or 'breslow'.

josef...@gmail.com

unread,
May 4, 2015, 10:41:24 AM5/4/15
to pystatsmodels
On Mon, May 4, 2015 at 10:00 AM, Jeff <jeffa...@gmail.com> wrote:
To be clear:

When using 'efron' or 'breslow', statsmodels is faster than survival. When using 'exact', survival is far faster than statsmodels using 'efron' or 'breslow'.

That statsmodels is faster than survival for the first two is a surprise.
I just wanted to say that it will be difficult to competer with `survival` for quite some time yet, if they do all the heavy lifting in C.

Note: R survival is GPL, so it's license incompatible and we cannot copy or translate code.

 


On Monday, May 4, 2015 at 9:39:45 PM UTC+8, Jeff wrote:
No intercept in the data. It looks like the cries of it being singular were due to an interaction of two floats. Removing that interaction removed the singular error. I'll have to put it in later, but that's a problem for later.

Detailed code example forthcoming, but for now consider this: It looks like the computation time issue may also be gone, though only for smaller datasets. With larger datasets I can't actually get statsmodels to finish fitting, and I think this is due to the 'efron' setting for the ties.

In R's survival package, the relevant code is:

analysis <- clogit(myformula, df_sample)

clogit has an option that is relevant for us right now:
"method: use the correct (exact) calculation in the conditional likelihood or one of the approximations"
default = 'exact'. Other options are "approximate", "efron", "breslow"


As far as I understand from the reference to matched case control studies is that this is the same as `clogit` in Stata (or a special case of it). 

Our clogit in the PR corrsponds to asclogit in Stata and is a discrete choice model.

My plan is to add the model like Stata's and R's clogit as ConditionalLogit (or LogitConditional, and similar PoissonConditional). That's how I know them from the econometrics side.

There are no specific plans yet, but it has been on my wishlist for a while, and has an open issue. Skipper has a draft version for ConditionalPoisson, IIUC. (A long time ago I was more familiar with Generalized Method of Moments versions or interpretations of it.)

I don't know how an explicitly coded version for ConditionalLogit would perform relative to the Cox PH exact implementation.

about ties `exact` in Stata stcox  http://www.stata.com/manuals13/ststcox.pdf

"
When we view time as discrete, the exact partial method (option exactp) is the final method available. This approach is equivalent to computing conditional logistic regression where the groups are defined by the risk sets and the outcome is given by the death variable. This is the slowest method to use and can take a significant amount of time if the number of tied failures and the risk sets are large. 
"

so this sounds like we are back to implementing ConditionalLogit or at least some parts (likelihood function) of it.

Josef

Jeff Alstott

unread,
May 4, 2015, 11:00:39 AM5/4/15
to pystat...@googlegroups.com
"That statsmodels is faster than survival for the first two is a surprise. I just wanted to say that it will be difficult to compete with `survival` for quite some time yet, if they do all the heavy lifting in C."
Maybe your code is just good! It does look like they also do 'efron' and 'breslow' in C.

"so this sounds like we are back to implementing ConditionalLogit or at least some parts (likelihood function) of it."
It does. But now there's a bonus of using the same code as an expansion for PHReg, too!

I, personally, am not interested in the asclogit (McFadden's choice model), so count that as one vote for implementing what you're calling ConditionalLogit ahead of the multinomial logit. I look forward to the day that that fast implementation is created for Python, so that I don't need to use R and RPy. Doing this analysis in R with 100GB of data is not my cup of tea.


josef...@gmail.com

unread,
May 4, 2015, 12:41:44 PM5/4/15
to pystatsmodels
An explicit (possibly made up and smaller) example would still be helpful.

I was looking at how difficult the likelihood for the conditional logit is. Both survival-clogit and Stata's clogit seem to mention the same recursion.
The runtime is proportional to number of periods per individual times number of ones per individual (summed over individuals).

How many positive events do you have for individuals (strata ?)? Is there a large difference in the number of observations between individuals, and a large difference in the number of ones per individuals?

My guess is that the recursion can be largely vectorized if the dataset is not too "unbalanced", at the cost of some increase in memory. If the data is very unbalanced or memory becomes an issue, then the recursion might have to be cythonized, or maybe vectorized in batches.

I don't see the connection to multiple categories of the dependent variable (like transport modes) yet.

I'm still working on the MixedLogit detour, but getting the working core of ConditionalLogit looks like an interesting "exercise", and will have many uses.

Josef

Jeff Alstott

unread,
May 4, 2015, 2:09:28 PM5/4/15
to pystat...@googlegroups.com
Just quickly to answer the central questions:
For my specific dataset, there are approximately 120 options (That's the max and most frequent number. The exact amount varies but is generally over 100). Each strata has exactly one positive event (so 1 one and ~120 zeros). 

I will post an example tomorrow. 

josef...@gmail.com

unread,
May 4, 2015, 2:48:42 PM5/4/15
to pystatsmodels
On Mon, May 4, 2015 at 2:09 PM, Jeff Alstott <jeffa...@gmail.com> wrote:
Just quickly to answer the central questions:
For my specific dataset, there are approximately 120 options (That's the max and most frequent number. The exact amount varies but is generally over 100). Each strata has exactly one positive event (so 1 one and ~120 zeros). 

Good, only one choice is computationally the nicest case, and doesn't even need the recursion, as also the R clogit help page mentions. It's very close in this case to a multinomial logit version with a flexible number of choice (similar to the PR clogit not to our discrete.MNLogit since you have choice specific explanatory variables, IIUC).

I looked into the recursion a bit more. Vectorizing over strata is only easy if it is batched for individuals with the same number of ones and the same number of observations/choices. Whether it's worth it depends on how often loglike is called and how difficult the data handling is.

To continue with a panel data analogy: 
Our MNLogit and discrete choice models in the PR assume a balanced structure and can efficiently use reshaped arrays in wide format. Unbalanced panel or cluster data like in GEE and MixedLM use mostly loops over all individuals or groups or strata.


Josef

josef...@gmail.com

unread,
May 4, 2015, 7:43:49 PM5/4/15
to pystatsmodels
On Mon, May 4, 2015 at 2:48 PM, <josef...@gmail.com> wrote:


On Mon, May 4, 2015 at 2:09 PM, Jeff Alstott <jeffa...@gmail.com> wrote:
Just quickly to answer the central questions:
For my specific dataset, there are approximately 120 options (That's the max and most frequent number. The exact amount varies but is generally over 100). Each strata has exactly one positive event (so 1 one and ~120 zeros). 

Good, only one choice is computationally the nicest case, and doesn't even need the recursion, as also the R clogit help page mentions. It's very close in this case to a multinomial logit version with a flexible number of choice (similar to the PR clogit not to our discrete.MNLogit since you have choice specific explanatory variables, IIUC).

I looked into the recursion a bit more. Vectorizing over strata is only easy if it is batched for individuals with the same number of ones and the same number of observations/choices. Whether it's worth it depends on how often loglike is called and how difficult the data handling is.

To continue with a panel data analogy: 
Our MNLogit and discrete choice models in the PR assume a balanced structure and can efficiently use reshaped arrays in wide format. Unbalanced panel or cluster data like in GEE and MixedLM use mostly loops over all individuals or groups or strata.


One more issue related to the singular matrix and differences across versions:

Chamberlain's conditional Logit was used to remove a strata/individual specific intercept. I assume, but I'm not sure about the details, that we cannot have any individual specific but constant explanatory variables in the model, it looks like those need to be interacted with the choice (similar to what's done automatically in asclogit and the PR CLogit)

Also, conditional logit doesn't seem to have a reference category directly specified, in contrast to the Multinomial logit variants.
According to the Stata manual, asclogit can be reparameterized as a special case of clogit, but I don't see how the identification conditions work.

Stata's asclogit allows for different choice set per individual, but I guess that there needs to be a common reference category, but I didn't see any explanation.

From what I can see right now, it would be relatively straightforward to implement the clogit loglikelihood with exactly 1 one without any (long) python loop delegating loops to numpy, but there might be more work to do to get an identifiable parameterization, for the different logit variants.

Josef

Jeff Alstott

unread,
May 5, 2015, 7:46:01 AM5/5/15
to pystat...@googlegroups.com, Giorgio Triulzi
My economectrics-savvy collaborator says:

"As far as I can tell...


>One more issue related to the singular matrix and differences across versions:
>Chamberlain's conditional Logit was used to remove a strata/individual specific intercept. I assume, but I'm not sure about the details, that we cannot have any individual >specific but constant explanatory variables in the model, it looks like those need to be interacted with the choice (similar to what's done automatically in asclogit and the PR CLogit)

...the assumption is correct


>Also, conditional logit doesn't seem to have a reference category directly specified, in contrast to the Multinomial logit variants.
>According to the Stata manual, asclogit can be reparameterized as a special case of clogit, but I don't see how the identification conditions work.

...indeed clogit does not have any reference category specified in Stata"

Jeff Alstott

unread,
May 5, 2015, 8:02:54 AM5/5/15
to pystat...@googlegroups.com, Giorgio Triulzi
Here's a minimal example of using PHReg to do the same fitting as survival's clogit in R.

Python
from pylab import *
import pandas as pd
df = pd.DataFrame(columns=['a','b','c'], data=rand(100, 3))
df['my_strata'] = repeat([1,2,3,4,5], 20)
df['outcome_to_classify'] = tile([1,0,0,0,0], 20)

#Make one of the observations have a different number of choices. 
#We make the last observation have only 4 choices instead of 5 by dropping the last row
df.drop(df.index[-1], inplace=True) 

from statsmodels.duration.hazard_regression import PHReg
mod = PHReg.from_formula('1 ~ a + b + c', 
                         df, 
                         status='outcome_to_classify', 
                         strata='my_strata',
                         ties="efron")
rslt = mod.fit()
print(rslt.summary())

                        Results: PHReg
===============================================================
Model:                    PH Reg          Sample size:       99
Dependent variable:       Intercept       Num. events:       20
Ties:                     Efron                                
---------------------------------------------------------------
           log HR log HR SE   HR      t    P>|t|  [0.025 0.975]
---------------------------------------------------------------
Intercept  0.0000       nan 1.0000     nan    nan    nan    nan
a          0.4416    0.8453 1.5552  0.5224 0.6014 0.2967 8.1520
b         -1.5333    0.9185 0.2158 -1.6694 0.0950 0.0357 1.3060
c         -0.5142    0.8046 0.5980 -0.6391 0.5228 0.1236 2.8942
===============================================================
Confidence intervals are for the hazard ratios


R (called from Python with rmagic)
%load_ext rmagic
%R -i df
%R library(survival)
%R analysis <- clogit(outcome_to_classify ~ a + b + c + strata(my_strata), df, method='efron')
%R print(summary(analysis))

Loading required package: splines
Call:
coxph(formula = Surv(rep(1, 99L), outcome_to_classify) ~ a + 
    b + c + strata(my_strata), data = df, method = "efron")

  n= 99, number of events= 20 

     coef exp(coef) se(coef)      z Pr(>|z|)  
a  0.4416    1.5552   0.8453  0.522    0.601  
b -1.5333    0.2158   0.9185 -1.669    0.095 .
c -0.5142    0.5980   0.8046 -0.639    0.523  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  exp(coef) exp(-coef) lower .95 upper .95
a    1.5552      0.643   0.29668     8.152
b    0.2158      4.634   0.03566     1.306
c    0.5980      1.672   0.12355     2.894

Concordance= 0.668  (se = 0.181 )
Rsquare= 0.04   (max possible= 0.691 )
Likelihood ratio test= 4.02  on 3 df,   p=0.2595
Wald test            = 3.74  on 3 df,   p=0.2914
Score (logrank) test = 3.89  on 3 df,   p=0.2742



Or just in R:
library(survival)
analysis <- clogit(outcome_to_classify ~ a + b + c + strata(my_strata), df, method='efron')
print(summary(analysis))





josef...@gmail.com

unread,
May 5, 2015, 8:15:17 AM5/5/15
to pystatsmodels
On Tue, May 5, 2015 at 8:02 AM, Jeff Alstott <jeffa...@gmail.com> wrote:
Here's a minimal example of using PHReg to do the same fitting as survival's clogit in R.

Thanks,
Can you add a seed and show the results form R survival's exact clogit 

I don't have R/rpy2 currently installed, and I'm stuck at the moment on a one year old statsmodels branch that doesn't have PHReg (until I clean up some loose ends for dcm MixedLogit.

I have a GenericLikelihoodModel version of ConditionalLogit for the special case of one selected choice, but didn't run it yet on a verified example.

Josef

Jeff Alstott

unread,
May 5, 2015, 8:28:42 AM5/5/15
to pystat...@googlegroups.com
"I have a GenericLikelihoodModel version of ConditionalLogit for the special case of one selected choice, but didn't run it yet on a verified example."
Great!

Python, with seed.

from pylab import *
import pandas as pd
seed(400)
df = pd.DataFrame(columns=['a','b','c'], data=rand(100, 3))
df['my_strata'] = repeat([1,2,3,4,5], 20)
df['outcome_to_classify'] = tile([1,0,0,0,0], 20)

#Make one of the observations have a different number of choices. 
#We make the last observation have only 4 choices instead of 5
df.drop(df.index[-1], inplace=True) 

from statsmodels.duration.hazard_regression import PHReg
mod = PHReg.from_formula('1 ~ a + b + c', 
                         df, 
                         status='outcome_to_classify', 
                         strata='my_strata',
                         ties="efron")
rslt = mod.fit()
print(rslt.summary())


                           Results: PHReg
====================================================================
Model:                       PH Reg           Sample size:        99
Dependent variable:          Intercept        Num. events:        20
Ties:                        Efron                                  
--------------------------------------------------------------------
           log HR   log HR SE     HR      t    P>|t|  [0.025  0.975]
--------------------------------------------------------------------
Intercept -0.0000 33554432.0000 1.0000 -0.0000 1.0000 0.0000     inf
a          1.0154        0.8366 2.7606  1.2138 0.2248 0.5357 14.2265
b          0.2449        0.8817 1.2775  0.2777 0.7812 0.2269  7.1916
c          0.3601        0.8118 1.4334  0.4435 0.6574 0.2920  7.0364
====================================================================
Confidence intervals are for the hazard ratios



R, survival's clogit, 'efron'
%load_ext rmagic
%R -i df
%R library(survival)
%R analysis <- clogit(outcome_to_classify ~ a + b + c + strata(my_strata), df, method='efron')
%R print(summary(analysis))

Call:
coxph(formula = Surv(rep(1, 99L), outcome_to_classify) ~ a + 
    b + c + strata(my_strata), data = df, method = "efron")

  n= 99, number of events= 20 

    coef exp(coef) se(coef)     z Pr(>|z|)
a 1.0154    2.7606   0.8366 1.214    0.225
b 0.2449    1.2775   0.8817 0.278    0.781
c 0.3601    1.4334   0.8118 0.444    0.657

  exp(coef) exp(-coef) lower .95 upper .95
a     2.761     0.3622    0.5357    14.227
b     1.277     0.7828    0.2269     7.192
c     1.433     0.6976    0.2920     7.036

Concordance= 0.62  (se = 0.181 )
Rsquare= 0.019   (max possible= 0.691 )
Likelihood ratio test= 1.88  on 3 df,   p=0.5979
Wald test            = 1.84  on 3 df,   p=0.6055
Score (logrank) test = 1.86  on 3 df,   p=0.6012


R, survival's clogit, 'exact'. Note the different results!
%R analysis <- clogit(outcome_to_classify ~ a + b + c + strata(my_strata), df, method='exact') #'exact' is the default
%R print(summary(analysis))

Call:
coxph(formula = Surv(rep(1, 99L), outcome_to_classify) ~ a + 
    b + c + strata(my_strata), data = df, method = "exact")

  n= 99, number of events= 20 

    coef exp(coef) se(coef)     z Pr(>|z|)
a 1.1201    3.0651   0.9189 1.219    0.223
b 0.2371    1.2675   0.9563 0.248    0.804
c 0.4339    1.5432   0.8962 0.484    0.628

  exp(coef) exp(-coef) lower .95 upper .95
a     3.065     0.3263    0.5061    18.563
b     1.268     0.7889    0.1945     8.260
c     1.543     0.6480    0.2664     8.938

Rsquare= 0.019   (max possible= 0.574 )
Likelihood ratio test= 1.92  on 3 df,   p=0.5901
Wald test            = 1.85  on 3 df,   p=0.6044
Score (logrank) test = 1.89  on 3 df,   p=0.5956


josef...@gmail.com

unread,
May 5, 2015, 8:59:38 AM5/5/15
to pystatsmodels
On Tue, May 5, 2015 at 8:28 AM, Jeff Alstott <jeffa...@gmail.com> wrote:
"I have a GenericLikelihoodModel version of ConditionalLogit for the special case of one selected choice, but didn't run it yet on a verified example."
Great!

Python, with seed.

from pylab import *
import pandas as pd
seed(400)
df = pd.DataFrame(columns=['a','b','c'], data=rand(100, 3))
df['my_strata'] = repeat([1,2,3,4,5], 20)
df['outcome_to_classify'] = tile([1,0,0,0,0], 20)


I should have asked about the interpretation first, because my current version breaks on this.

Using terminology: individual (strata) have an individual specific choice set and make a selection from the choice set.

'my_strata' is the indicator/index variable for individuals, but then there would be 5 individuals with 20 choice options/events for the first four individuals and 19 for the last individual.

'outcome_to_classify' is the binary indicator for the actual selection. Given the tile, every individual would select 4 choices, not just one as you described for your data.

Is this interpretation correct?

If I change to  df['my_strata'] = np.repeat(np.arange(20), 5)   then the first 5 observations are individual 0, the next 5 individual 1, and so on.

Or is df['my_strata']  supposed to be the indicator for the choice/alternative, i.e. choice 1, choice 2, ...?

Josef

Jeff Alstott

unread,
May 5, 2015, 9:12:21 AM5/5/15
to pystat...@googlegroups.com
Ah, it is just me who has messed up. It should be some number of individuals/strata, and each one only makes one positive choice. This code should be more explicit:

Python
from pylab import *
import pandas as pd
seed(400)
n_strata = 20
n_choices = 5 #We'll reduce some strata's choices later
df = pd.DataFrame(columns=['a','b','c'], data=rand(n_strata*n_choices, 3))
df['my_strata'] = repeat(range(n_strata), n_choices)
df['outcome_to_classify'] = tile([1,0,0,0,0], n_strata)

#Make one of the observations have a different number of choices. 
#We make the last observation have only 4 choices instead of 5
df.drop(df.index[-1], inplace=True) 

from statsmodels.duration.hazard_regression import PHReg
mod = PHReg.from_formula('1 ~ a + b + c', 
                         df, 
                         status='outcome_to_classify', 
                         strata='my_strata',
                         ties="efron")
rslt = mod.fit()
print(rslt.summary())

This returns a LinAlgError: Singular matrix, even though survival's clogit/coxph have no trouble.

R, efron
%load_ext rmagic
%R -i df
%R library(survival)
%R analysis <- clogit(outcome_to_classify ~ a + b + c + strata(my_strata), df, method='efron')
%R print(summary(analysis))
Call:
coxph(formula = Surv(rep(1, 99L), outcome_to_classify) ~ a + 
    b + c + strata(my_strata), data = df, method = "efron")

  n= 99, number of events= 20 

      coef exp(coef) se(coef)     z Pr(>|z|)
a 1.472913  4.361925 0.977688 1.507    0.132
b 1.094598  2.987982 0.954946 1.146    0.252
c 0.003076  1.003081 1.016185 0.003    0.998

  exp(coef) exp(-coef) lower .95 upper .95
a     4.362     0.2293    0.6419     29.64
b     2.988     0.3347    0.4598     19.42
c     1.003     0.9969    0.1369      7.35

Concordance= 0.582  (se = 0.354 )
Rsquare= 0.036   (max possible= 0.476 )
Likelihood ratio test= 3.66  on 3 df,   p=0.3011
Wald test            = 3.35  on 3 df,   p=0.3413
Score (logrank) test = 3.53  on 3 df,   p=0.3174

R, exact
%R analysis <- clogit(outcome_to_classify ~ a + b + c + strata(my_strata), df, method='exact') #'exact' is the default
%R print(summary(analysis))

Call:
coxph(formula = Surv(rep(1, 99L), outcome_to_classify) ~ a + 
    b + c + strata(my_strata), data = df, method = "exact")

  n= 99, number of events= 20 

      coef exp(coef) se(coef)     z Pr(>|z|)
a 1.472913  4.361925 0.977688 1.507    0.132
b 1.094598  2.987982 0.954946 1.146    0.252
c 0.003076  1.003081 1.016185 0.003    0.998

  exp(coef) exp(-coef) lower .95 upper .95
a     4.362     0.2293    0.6419     29.64
b     2.988     0.3347    0.4598     19.42
c     1.003     0.9969    0.1369      7.35

Rsquare= 0.036   (max possible= 0.476 )
Likelihood ratio test= 3.66  on 3 df,   p=0.3011
Wald test            = 3.35  on 3 df,   p=0.3413
Score (logrank) test = 3.53  on 3 df,   p=0.3174

josef...@gmail.com

unread,
May 5, 2015, 10:35:51 AM5/5/15
to pystatsmodels
There is still something strange with the example, i don't get the same results. Either I'm still misinterpreting the model or it's because every individual chooses the same alternative and there is no variation in the selection.

I started to replicate a Stata example, example 2 from the documentation http://www.stata.com/manuals13/rclogit.pdf page 7

with data from the r12 version (older dat format that I can read with old statsmodels)


my results agree with Stata at about 4 decimals (i.e. within approximately numerical optimization precision)

(I had to rescale the first variable x1 because the bad scaling causes overflows in the optimization with bfgs.)

The example is balanced, 1-1 matched case control, so it still doesn't verify my code in the unbalanced case. (but there is nothing specific in the code for the balanced case).

>>> print(res3.summary())
                           ConditionalLogit Results                           
==============================================================================
Dep. Variable:                    low   Log-Likelihood:                -25.794
Model:               ConditionalLogit   AIC:                             65.59
Method:            Maximum Likelihood   BIC:                             84.62
Date:                Tue, 05 May 2015                                         
Time:                        10:07:26                                         
No. Observations:                 112                                         
Df Residuals:                     105                                         
Df Model:                           6                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            -0.5573      0.306     -1.823      0.068        -1.157     0.042
x2             1.4009      0.628      2.231      0.026         0.170     2.632
x3             1.8078      0.789      2.292      0.022         0.262     3.354
x4             2.3611      1.086      2.174      0.030         0.232     4.490
x5             1.4018      0.696      2.014      0.044         0.037     2.766
x6             0.5711      0.690      0.828      0.408        -0.781     1.923
x7            -0.0251      0.699     -0.036      0.971        -1.396     1.345
==============================================================================
>>> 

results for x1 in original Stata scale

>>> res3.params[0] / dat_lb['lwt'].values.std()
-0.018372710166329135
>>> res3.bse[0] / dat_lb['lwt'].values.std()
0.010080178253112035

Josef
fast prototyping with GenericLikelihoodModel

Jeff Alstott

unread,
May 5, 2015, 10:52:10 AM5/5/15
to pystat...@googlegroups.com

On Tue, May 5, 2015 at 10:35 PM, <josef...@gmail.com> wrote:
There is still something strange with the example, i don't get the same results. Either I'm still misinterpreting the model or it's because every individual chooses the same alternative and there is no variation in the selection.

The fact that both statsmodels' PHReg doesn't want to run it, and your code doesn't want to run it, but survival's clogit wants to run it, might be informative? Some common cause? Note that survival's clogit calls coxph, and coxph has an option 'singular.ok', which defaults to TRUE. 

"singular.ok logical value indicating how to handle collinearity in the model matrix. If TRUE, the program will automatically skip over columns of the X matrix that are linear combinations of earlier columns. In this case the coefficients for such columns will be NA, and the variance matrix will contain zeros. For ancillary calculations, such as the linear predictor, the missing coefficients are treated as zeros."


I played around with the number of strata and choices, and it appears that statsmodels agrees to fit the data with different numbers of strata and choices. Here's an example that works:

Python
from pylab import *
import pandas as pd
seed(400)
n_strata = 10
n_choices = 10 #We'll reduce some strata's choices later
df = pd.DataFrame(columns=['a','b','c'], data=rand(n_strata*n_choices, 3))
df['my_strata'] = repeat(range(n_strata), n_choices)

default_outcomes = zeros(n_choices)
default_outcomes[0] = 1
df['outcome_to_classify'] = tile(default_outcomes, n_strata)

#Make one of the observations have a different number of choices. 
#We make the last observation have only 4 choices instead of 5
df.drop(df.index[-1], inplace=True) 

from statsmodels.duration.hazard_regression import PHReg
mod = PHReg.from_formula('1 ~ a + b + c', 
                         df, 
                         status='outcome_to_classify', 
                         strata='my_strata',
                         ties="efron")
rslt = mod.fit()
print(rslt.summary())

                           Results: PHReg
====================================================================
Model:                       PH Reg           Sample size:        99
Dependent variable:          Intercept        Num. events:        10
Ties:                        Efron                                  
--------------------------------------------------------------------
           log HR   log HR SE     HR      t    P>|t|  [0.025  0.975]
--------------------------------------------------------------------
Intercept  0.0000 47453132.8121 1.0000  0.0000 1.0000 0.0000     inf
a          1.3262        1.2206 3.7666  1.0865 0.2773 0.3443 41.2036
b         -0.4479        1.3052 0.6389 -0.3432 0.7315 0.0495  8.2499
c          0.2376        1.1903 1.2682  0.1996 0.8418 0.1230 13.0726
====================================================================
Confidence intervals are for the hazard ratios


R
%R -i df
%R library(survival)
%R analysis <- clogit(outcome_to_classify ~ a + b + c + strata(my_strata), df, method='exact') #'exact' is the default
%R print(summary(analysis))


Call:
coxph(formula = Surv(rep(1, 99L), outcome_to_classify) ~ a + 
    b + c + strata(my_strata), data = df, method = "exact")

  n= 99, number of events= 10 

     coef exp(coef) se(coef)      z Pr(>|z|)
a  1.3262    3.7666   1.2206  1.086    0.277
b -0.4479    0.6389   1.3052 -0.343    0.731
c  0.2376    1.2682   1.1903  0.200    0.842

  exp(coef) exp(-coef) lower .95 upper .95
a    3.7666     0.2655   0.34433     41.20
b    0.6389     1.5651   0.04949      8.25
c    1.2682     0.7885   0.12302     13.07

Rsquare= 0.014   (max possible= 0.371 )
Likelihood ratio test= 1.44  on 3 df,   p=0.6953
Wald test            = 1.42  on 3 df,   p=0.7017
Score (logrank) test = 1.45  on 3 df,   p=0.694

josef...@gmail.com

unread,
May 5, 2015, 11:07:48 AM5/5/15
to pystatsmodels
Good, that looks much better

my result

>>> print(res2.summary())
                           ConditionalLogit Results                           
==============================================================================
Dep. Variable:                      y   Log-Likelihood:                -22.199
Model:               ConditionalLogit   AIC:                             50.40
Method:            Maximum Likelihood   BIC:                             58.18
Date:                Tue, 05 May 2015                                         
Time:                        11:00:34                                         
No. Observations:                  99                                         
Df Residuals:                      96                                         
Df Model:                           2                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             1.3262      1.221      1.086      0.277        -1.066     3.719
x2            -0.4479      1.305     -0.343      0.731        -3.006     2.110
x3             0.2376      1.190      0.200      0.842        -2.095     2.570
==============================================================================
>>> 

The function to get exp(params) with standard errors is somewhere.

Josef

Jeff Alstott

unread,
May 5, 2015, 11:11:53 AM5/5/15
to pystat...@googlegroups.com
Alright! So now it looks like the only remaining issue is how to handle the singular matrices? You're in the belly of the code, so you have a better shot of interpreting what coxph is doing with that singular.ok option.

josef...@gmail.com

unread,
May 5, 2015, 11:30:47 AM5/5/15
to pystatsmodels
On Tue, May 5, 2015 at 11:11 AM, Jeff Alstott <jeffa...@gmail.com> wrote:
Alright! So now it looks like the only remaining issue is how to handle the singular matrices? You're in the belly of the code, so you have a better shot of interpreting what coxph is doing with that singular.ok option.

I have no idea what R does, and I didn't go through the calculation details of PHReg.

I don't have any linear algebra in the estimation of the parameters, with bfgs. My loglikelihood function is essentially 3 lines https://github.com/statsmodels/statsmodels/issues/2386#issuecomment-98947396

My guess is that without enough variation, the parameters are not well identified and the loglikelihood function is flat or has local optima driven by numerical noise.

I wouldn't trust any results, or maybe R survival does some implicit regularization in this case. (We do it with pinv in OLS.) 

However, both R's and my standard errors are relatively large, but it looks like there is no problem with a singular hessian. (I don't have analytical hessian (yet).)

Josef

Jeff Alstott

unread,
May 5, 2015, 11:37:51 AM5/5/15
to pystat...@googlegroups.com
So what does your code (and Stata) do for the example below? This is exactly the same as before, except the number of choices is dropped to 7. PHReg cries that there's a singular error, while survival's clogit gives a response.

Python
from pylab import *
import pandas as pd
seed(400)
n_strata = 10
n_choices = 7 #We'll reduce some strata's choices later
df = pd.DataFrame(columns=['a','b','c'], data=rand(n_strata*n_choices, 3))
df['my_strata'] = repeat(range(n_strata), n_choices)

default_outcomes = zeros(n_choices)
default_outcomes[0] = 1
df['outcome_to_classify'] = tile(default_outcomes, n_strata)

#Make one of the observations have a different number of choices. 
#We make the last observation have only 4 choices instead of 5
df.drop(df.index[-1], inplace=True) 

from statsmodels.duration.hazard_regression import PHReg
mod = PHReg.from_formula('1 ~ a + b + c', 
                         df, 
                         status='outcome_to_classify', 
                         strata='my_strata',
                         ties="efron")
rslt = mod.fit()
print(rslt.summary())

---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-34-47272558d484> in <module>()
     21                          strata='my_strata',
     22                          ties="efron")
---> 23 rslt = mod.fit()
     24 print(rslt.summary())

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/duration/hazard_regression.py in fit(self, groups, **args)
    373         if 'disp' not in args:
    374             args['disp'] = False
--> 375         fit_rslts = super(PHReg, self).fit(**args)
    376 
    377         if self.groups is None:

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
    432             Hinv = cov_params_func(self, xopt, retvals)
    433         elif method == 'newton' and full_output:
--> 434             Hinv = np.linalg.inv(-retvals['Hessian']) / nobs
    435         elif not skip_hessian:
    436             try:

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/numpy/linalg/linalg.py in inv(a)
    518     signature = 'D->D' if isComplexType(t) else 'd->d'
    519     extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 520     ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    521     return wrap(ainv.astype(result_t))
    522 

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
     88 
     89 def _raise_linalgerror_singular(err, flag):
---> 90     raise LinAlgError("Singular matrix")
     91 
     92 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix


R
%R -i df
%R library(survival)
%R analysis <- clogit(outcome_to_classify ~ a + b + c + strata(my_strata), df, method='exact') #'exact' is the default
%R print(summary(analysis))

Call:
coxph(formula = Surv(rep(1, 69L), outcome_to_classify) ~ a + 
    b + c + strata(my_strata), data = df, method = "exact")

  n= 69, number of events= 10 

      coef exp(coef) se(coef)      z Pr(>|z|)  
a  0.35339   1.42389  1.23541  0.286   0.7748  
b  0.44016   1.55295  1.33109  0.331   0.7409  
c -2.94833   0.05243  1.46427 -2.014   0.0441 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  exp(coef) exp(-coef) lower .95 upper .95
a   1.42389     0.7023  0.126441   16.0348
b   1.55295     0.6439  0.114322   21.0954
c   0.05243    19.0741  0.002973    0.9246

Rsquare= 0.072   (max possible= 0.429 )
Likelihood ratio test= 5.13  on 3 df,   p=0.1624
Wald test            = 4.1  on 3 df,   p=0.2505
Score (logrank) test = 4.82  on 3 df,   p=0.1853

josef...@gmail.com

unread,
May 5, 2015, 11:52:14 AM5/5/15
to pystatsmodels
This is during fit and not necessarily an indication that the final result is bad. The optimizer might just try out bad values. But in many cases it's a signal that something "might" be wrong

try bfgs which doesn't use the hessian during optimization (assuming the option is available, i.e.

rslt = mod.fit(method='bfgs')

Also, the 'newton` method has now a ridge parameter option, to get around singular hessian during optimization. I don't remember what the keyword is called.

What I often do when I want more precise final estimates is to use 'bfgs' or similar first, and then use 'newton' with the estimated params as starting values. `newton` with a good hessian works harder than 'bfgs' to get to the local optimum, like 6 to 8 decimals versus 4 to 5 decimals in score.

 


R
%R -i df
%R library(survival)
%R analysis <- clogit(outcome_to_classify ~ a + b + c + strata(my_strata), df, method='exact') #'exact' is the default
%R print(summary(analysis))

Call:
coxph(formula = Surv(rep(1, 69L), outcome_to_classify) ~ a + 
    b + c + strata(my_strata), data = df, method = "exact")

  n= 69, number of events= 10 

      coef exp(coef) se(coef)      z Pr(>|z|)  
a  0.35339   1.42389  1.23541  0.286   0.7748  
b  0.44016   1.55295  1.33109  0.331   0.7409  
c -2.94833   0.05243  1.46427 -2.014   0.0441 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  exp(coef) exp(-coef) lower .95 upper .95
a   1.42389     0.7023  0.126441   16.0348
b   1.55295     0.6439  0.114322   21.0954
c   0.05243    19.0741  0.002973    0.9246

Rsquare= 0.072   (max possible= 0.429 )
Likelihood ratio test= 5.13  on 3 df,   p=0.1624
Wald test            = 4.1  on 3 df,   p=0.2505
Score (logrank) test = 4.82  on 3 df,   p=0.1853

I get same as R

>>> print(res2.summary())
                           ConditionalLogit Results                           
==============================================================================
Dep. Variable:                      y   Log-Likelihood:                -16.739
Model:               ConditionalLogit   AIC:                             39.48
Method:            Maximum Likelihood   BIC:                             46.18
Date:                Tue, 05 May 2015                                         
Time:                        11:41:04                                         
No. Observations:                  69                                         
Df Residuals:                      66                                         
Df Model:                           2                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.3534      1.235      0.286      0.775        -2.068     2.775
x2             0.4402      1.331      0.331      0.741        -2.169     3.049
x3            -2.9483      1.464     -2.014      0.044        -5.818    -0.078
==============================================================================
>>> print(res2.params)
[ 0.35339533  0.44016712 -2.948349  ]
>>> print(res2.bse)
[ 1.23541575  1.33109395  1.46427539]

Jeff Alstott

unread,
May 5, 2015, 11:58:31 AM5/5/15
to pystat...@googlegroups.com
'newton' also cries singular. Trying 'bfgs' gives a different error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-37-cfffea3f4f0b> in <module>()

     21                          strata='my_strata',
     22                          ties="efron")
---> 23 rslt = mod.fit(method='bfgs')
     24 print(rslt.summary())

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/duration/hazard_regression.py in fit(self, groups, **args)
    376 
    377         if self.groups is None:
--> 378             cov_params = fit_rslts.cov_params()
    379         else:
    380             cov_params = self.robust_covariance(fit_rslts.params)

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/model.py in cov_params(self, r_matrix, column, scale, cov_p, other)
   1044         if (cov_p is None and self.normalized_cov_params is None and
   1045             not hasattr(self, 'cov_params_default')):
-> 1046             raise ValueError('need covariance of parameters for computing '
   1047                              '(unnormalized) covariances')
   1048         if column is not None and (r_matrix is not None or other is not None):

ValueError: need covariance of parameters for computing (unnormalized) covariances

If you're getting the same results as R, then I'm happy! I understand it's not unit tested and such. But I'd like to try it out, if only to test its runtime vs. R. Is it in a shape that I could access it?

josef...@gmail.com

unread,
May 5, 2015, 12:17:27 PM5/5/15
to pystatsmodels
On Tue, May 5, 2015 at 11:58 AM, Jeff Alstott <jeffa...@gmail.com> wrote:
'newton' also cries singular. Trying 'bfgs' gives a different error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-37-cfffea3f4f0b> in <module>()
     21                          strata='my_strata',
     22                          ties="efron")
---> 23 rslt = mod.fit(method='bfgs')
     24 print(rslt.summary())

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/duration/hazard_regression.py in fit(self, groups, **args)
    376 
    377         if self.groups is None:
--> 378             cov_params = fit_rslts.cov_params()
    379         else:
    380             cov_params = self.robust_covariance(fit_rslts.params)

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/model.py in cov_params(self, r_matrix, column, scale, cov_p, other)
   1044         if (cov_p is None and self.normalized_cov_params is None and
   1045             not hasattr(self, 'cov_params_default')):
-> 1046             raise ValueError('need covariance of parameters for computing '
   1047                              '(unnormalized) covariances')
   1048         if column is not None and (r_matrix is not None or other is not None):

ValueError: need covariance of parameters for computing (unnormalized) covariances


That means now that the PHReg hessian is not invertible at the optimum either.

 
If you're getting the same results as R, then I'm happy! I understand it's not unit tested and such. But I'd like to try it out, if only to test its runtime vs. R. Is it in a shape that I could access it?

I need to clean it up a bit and can publish it later today. Right now, it has examples and library code all mixed together and I'm working on an old branch and haven't tried using pandas dataframes nor formulas yet.

Based on the code, I don't see a reason why it shouldn't be at least as fast as R. My expectation would be that it's a bit faster because it only treats the 1 versus k_i special case.

josef...@gmail.com

unread,
May 5, 2015, 6:11:57 PM5/5/15
to pystatsmodels
problem with from_formula with categoricals. Same problem in ConditionalLogit as in PHReg

patsy doesn't allow no constant dummies. We have an open issue in patsy for extending categorical dummies to over and underparameterized versions. 

>>> mod5 = ConditionalLogit.from_formula('low ~ smoke + ptd + C(race) - 1', data = dat_lb, groups=gr)
>>> mod5.endog = mod5.endog.ravel()
>>> res5 = mod5.fit()
Optimization terminated successfully.
         Current function value: 0.275586
         Iterations: 22
         Function evaluations: 24
         Gradient evaluations: 24
>>> print(res5.summary())
                          ConditionalLogitOne Results                          
===============================================================================
Dep. Variable:                     low   Log-Likelihood:                -30.866
Model:             ConditionalLogitOne   AIC:                             71.73
Method:             Maximum Likelihood   BIC:                             85.32
Date:                 Tue, 05 May 2015                                         
Time:                         18:00:17                                         
No. Observations:                  112                                         
Df Residuals:                      107                                         
Df Model:                            4                                         
================================================================================
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
C(race)[1.0]    -0.3539   3029.432     -0.000      1.000     -5937.932  5937.224
C(race)[2.0]     0.2875   3029.433   9.49e-05      1.000     -5937.291  5937.866
C(race)[3.0]     0.3664   3029.433      0.000      1.000     -5937.213  5937.946
smoke            1.4380      0.537      2.679      0.007         0.386     2.490
ptd              1.6156      0.678      2.384      0.017         0.288     2.944
================================================================================
>>> 

Josef

josef...@gmail.com

unread,
May 5, 2015, 6:34:23 PM5/5/15
to pystatsmodels
I put the module with the class into `miscmodels` which contains GenericLikelihoodModel subclasses (those are models that are missing specific bells and whistles)


I will open a PR when I have at least a minimal set of unit tests, so TravisCI will have something to check.
I only ran it on python 3.4 (my current default).

Josef

Jeff Alstott

unread,
May 5, 2015, 11:50:17 PM5/5/15
to pystat...@googlegroups.com
Thanks, Josef!

If I were wanting to run this in the next day or three, what would you recommend?
1. Waiting for this to be pulled into statsmodels, and then using the latest development branch of statsmodels? (seems easiest?)
2. Using your fork? I'm not savvy in simultaneously operating with parallel versions of packages.
3. Something else? E.g. Your code is sufficiently brief that I could just copy/paste it into the IPython Notebook where I'm running my analyses.

josef...@gmail.com

unread,
May 6, 2015, 6:57:21 AM5/6/15
to pystatsmodels
On Tue, May 5, 2015 at 11:49 PM, Jeff Alstott <jeffa...@gmail.com> wrote:
Thanks, Josef!

If I were wanting to run this in the next day or three, what would you recommend?
1. Waiting for this to be pulled into statsmodels, and then using the latest development branch of statsmodels? (seems easiest?)
2. Using your fork? I'm not savvy in simultaneously operating with parallel versions of packages.
3. Something else? E.g. Your code is sufficiently brief that I could just copy/paste it into the IPython Notebook where I'm running my analyses.

None of the above.

It's all in one module, so I would just copy it and put it somewhere where it's on the python path for what you are working with.  I wouldn't copy it into the notebook or any other script file because it mixes library code and specific use case code, and it will make it more difficult to just change the import when this is merged into statsmodels.

For example, I just put an extra module from a different branch in my local (development) copy, but added an extra part to the filename (`xxx_bak.py`) so I remember it's just a local copy and not proper part of statsmodels and not under version control.

(Based on some comments and examples on the internet, several users did this with another unmerged branch of mine for Johansen's cointegration tests, and years later new users are asking where that module is.)


Please report back how well it works or doesn't, It's still needs more work to turn it into a "proper" Model.

Josef

josef...@gmail.com

unread,
May 6, 2015, 7:35:08 AM5/6/15
to pystatsmodels
Jeff, I forgot to mention for the usage of the current version:

This assumes that the data is sorted by Strata/individuals, i,e. first rows are for the first strata and so on. `groups` (strata) index/indicator are assumed to be integer arrays with uniques in range(n_groups)

Both are what is internally used for efficient calculations
There are currently no checks or options for this.

Also endog is internally uses as a boolean array. If the provided endog is a 0, 1 integer array, then it will be converted to boolean, if it's already boolean, then we don't need to make a copy.

Josef

Jeff Alstott

unread,
May 6, 2015, 9:29:55 AM5/6/15
to pystat...@googlegroups.com
Thanks!

I may get to put this into use in the next day or two, but it's possible for the moment I'll try to save programmer time by using the existing survival/clogit/RPy setup. In the end, though, I will definitely use your Python solution for the production run, to save computer time. I look forward to seeing this Python ConditionalLogit being fully vetted. It'll be great to have it!

josef...@gmail.com

unread,
May 6, 2015, 9:42:58 AM5/6/15
to pystatsmodels
On Wed, May 6, 2015 at 9:29 AM, Jeff Alstott <jeffa...@gmail.com> wrote:
Thanks!

I may get to put this into use in the next day or two, but it's possible for the moment I'll try to save programmer time by using the existing survival/clogit/RPy setup. In the end, though, I will definitely use your Python solution for the production run, to save computer time. I look forward to seeing this Python ConditionalLogit being fully vetted. It'll be great to have it!

You could try it out on some larger examples of your data to see whether it is actually fast enough.
Don't trust unverified advertising claims by the author. (I didn't check any timing and only used the simple examples.)

Any help (with for example unit tests and extra features) would be useful. 
This was just a two day unscheduled detour for me, and in contrast to working on mixed logit it doesn't have the advantage to help prepare for this years GSOC projects.

Jeff Alstott

unread,
May 6, 2015, 2:12:21 PM5/6/15
to pystat...@googlegroups.com
On Wed, May 6, 2015 at 9:42 PM, <josef...@gmail.com> wrote:
You could try it out on some larger examples of your data to see whether it is actually fast enough.
Don't trust unverified advertising claims by the author. (I didn't check any timing and only used the simple examples.)
Absolutely! 

Any help (with for example unit tests and extra features) would be useful. 
This was just a two day unscheduled detour for me, and in contrast to working on mixed logit it doesn't have the advantage to help prepare for this years GSOC projects.
I'm glad you're doing another GSOC. It'll be great once these choice models are all built out. 


Jeff Alstott

unread,
Jul 6, 2015, 12:33:09 PM7/6/15
to pystat...@googlegroups.com
Hello all,

I'll be moving into large scale production with the ConditionalLogit model within the next month or two, and I wanted to know if I should still be using Josef's code. I heard that there was a GSoC project that would get at least close to this topic. Have there been any developments, or are there expected to be any, that would change which code I should be using?

Thanks!
Jeff

josef...@gmail.com

unread,
Jul 8, 2015, 8:47:44 AM7/8/15
to pystatsmodels
On Mon, Jul 6, 2015 at 12:32 PM, Jeff Alstott <jeffa...@gmail.com> wrote:
Hello all,

I'll be moving into large scale production with the ConditionalLogit model within the next month or two, and I wanted to know if I should still be using Josef's code. I heard that there was a GSoC project that would get at least close to this topic. Have there been any developments, or are there expected to be any, that would change which code I should be using?

Hi Jeff,

I think there was a misunderstanding. This year we have GSOC projects on mixed models but not on discrete choice models directly.

There won't be any further work on this until I can finish up ConditionalLogit or unless someone else contributes some of the necessary unit tests and missing features.

Because we have many projects running at the same time, I doubt I will find time until at least after GSOCs are over and the main backlog of PRs targeted for 0.8 has been merged.

Feedback would still be very welcome.

Josef

Jeff Alstott

unread,
Jul 8, 2015, 9:58:13 AM7/8/15
to pystat...@googlegroups.com
Thanks for the info!

"I think there was a misunderstanding. This year we have GSOC projects on mixed models but not on discrete choice models directly." In all of statistical models, those are fairly close. I thought I might have gotten lucky and someone would have wandered from one to the other :-)

Thanks for all your great work!

Jeff Alstott

unread,
Aug 19, 2015, 5:07:16 PM8/19/15
to pystat...@googlegroups.com
Resurrecting this thread to ask a follow-up question: How would I implement interaction terms with this ConditionalLogit code? I would think to use formulas, but using formulas with ConditionalLogitOne.from_formula() yields this error:

IndexError: too many indices for array

That's without even putting in an interaction term.

Thanks!

josef...@gmail.com

unread,
Aug 19, 2015, 7:47:34 PM8/19/15
to pystatsmodels
On Wed, Aug 19, 2015 at 5:06 PM, Jeff Alstott <jeffa...@gmail.com> wrote:
Resurrecting this thread to ask a follow-up question: How would I implement interaction terms with this ConditionalLogit code? I would think to use formulas, but using formulas with ConditionalLogitOne.from_formula() yields this error:

IndexError: too many indices for array

That's without even putting in an interaction term.


Please post the full traceback, this doesn't tell me anything.

I don't think I tried out from_formula. It's just the inherited method which might not be appropriate in subclasses.
You could try to use patsy.dmatrices directly to see whether it produces the correct design matrix and endog.

One possible bug is that patsy converts categorical `endog` which it shouldn't do in this case. We have some related open issues, but I don't think patsy has the option to turn off automatic `endog` categorical transformation yet. endog might have to be integer category levels.

(mostly guessing, I'm in different neighborhoods right now.)

Josef

 

Thanks!


Jeff Alstott

unread,
Aug 19, 2015, 10:58:00 PM8/19/15
to pystat...@googlegroups.com
On Wed, Aug 19, 2015 at 7:47 PM, <josef...@gmail.com> wrote:
Please post the full traceback, this doesn't tell me anything.
> model = ConditionalLogit.from_formula('Entered ~ Class_Patent_Count_Cumulative_Previous_Year_Percentile - 1', 
                                         data = df, groups=df['Agent_Entry'].values)
> model.endog = model.endog.ravel()  #I took this line from your gist. It doesn't seem to affect the error at all.
> fit = model.fit()

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-189-beaeee5f31cd> in <module>()
----> 1 fit = model.fit()

/home/jeffrey_alstott/technoinnovation/Inventor_Movement/conditional_logit.py in fit(self, **kwds)
    159         # only used to change default method to bfgs
    160         method = kwds.pop('method', 'bfgs')
--> 161         return super(ConditionalLogitOne, self).fit(method=method, **kwds)

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, retall, **kwargs)
    669                             method=method, maxiter=maxiter,
    670                             full_output=full_output,
--> 671                             disp=disp, callback=callback, **kwargs)
    672         genericmlefit = GenericLikelihoodModelResults(self, mlefit)
    673 

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
    423                                                        callback=callback,
    424                                                        retall=retall,
--> 425                                                        full_output=full_output)
    426 
    427         #NOTE: this is for fit_regularized and should be generalized

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/optimizer.py in _fit(self, objective, gradient, start_params, fargs, kwargs, hessian, method, maxiter, full_output, disp, callback, retall)
    182                             disp=disp, maxiter=maxiter, callback=callback,
    183                             retall=retall, full_output=full_output,
--> 184                             hess=hessian)
    185 
    186         # this is stupid TODO: just change this to something sane

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/optimizer.py in _fit_bfgs(f, score, start_params, fargs, kwargs, disp, maxiter, callback, retall, full_output, hess)
    290                                  gtol=gtol, norm=norm, epsilon=epsilon,
    291                                  maxiter=maxiter, full_output=full_output,
--> 292                                  disp=disp, retall=retall, callback=callback)
    293     if full_output:
    294         if not retall:

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/scipy/optimize/optimize.py in fmin_bfgs(f, x0, fprime, args, gtol, norm, epsilon, maxiter, full_output, disp, retall, callback)
    780             'return_all': retall}
    781 
--> 782     res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)
    783 
    784     if full_output:

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/scipy/optimize/optimize.py in _minimize_bfgs(fun, x0, args, jac, callback, gtol, norm, eps, maxiter, disp, return_all, **unknown_options)
    835     else:
    836         grad_calls, myfprime = wrap_function(fprime, args)
--> 837     gfk = myfprime(x0)
    838     k = 0
    839     N = len(x0)

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/scipy/optimize/optimize.py in function_wrapper(*wrapper_args)
    280     def function_wrapper(*wrapper_args):
    281         ncalls[0] += 1
--> 282         return function(*(wrapper_args + args))
    283 
    284     return ncalls, function_wrapper

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/model.py in <lambda>(params)
    402         nobs = self.endog.shape[0]
    403         f = lambda params, *args: -self.loglike(params, *args) / nobs
--> 404         score = lambda params: -self.score(params) / nobs
    405         try:
    406             hess = lambda params: -self.hessian(params) / nobs

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/model.py in score(self, params)
    629         kwds = {}
    630         kwds.setdefault('centered', True)
--> 631         return approx_fprime(params, self.loglike, **kwds).ravel()
    632 
    633     def score_obs(self, params, **kwds):

/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/tools/numdiff.py in approx_fprime(x, f, epsilon, args, kwargs, centered)
    128     n = len(x)
    129     #TODO:  add scaled stepsize
--> 130     f0 = f(*((x,)+args), **kwargs)
    131     dim = np.atleast_1d(f0).shape # it could be a scalar
    132     grad = np.zeros((n,) + dim, float)

/home/jeffrey_alstott/technoinnovation/Inventor_Movement/conditional_logit.py in loglike(self, params)
     86 
     87         xb = self.exog.dot(params)
---> 88         return loglike_one_groups(self.endog, xb, self.group_idx).sum()
     89 
     90     def loglike_obs(self, params):

/home/jeffrey_alstott/technoinnovation/Inventor_Movement/conditional_logit.py in loglike_one_groups(endog, xb, group_idx)
     41     endog = np.asarray(endog, np.bool_)
     42     expxb_groups = np.add.reduceat(np.exp(xb), group_idx)
---> 43     return xb[endog] - np.log(expxb_groups)
     44 
     45 def score_one_groups(params, endog, exog, group_idx):

IndexError: index 39892 is out of bounds for axis 1 with size 39891



I don't think I tried out from_formula. It's just the inherited method which might not be appropriate in subclasses.
The gist you posted previously has some testing of it in line 110:
 
You could try to use patsy.dmatrices directly to see whether it produces the correct design matrix and endog.
I will need to learn about that! 

One possible bug is that patsy converts categorical `endog` which it shouldn't do in this case. We have some related open issues, but I don't think patsy has the option to turn off automatic `endog` categorical transformation yet. endog might have to be integer category levels.
You had previously made mention in this thread of the categorical issue, which I didn't fully understand at the time. In my case I'm not using using categorical data, yet. I may in the future, though, so it's good to be aware of possible further issues down the road.

(mostly guessing, I'm in different neighborhoods right now.)
Fair enough! Thanks! 


josef...@gmail.com

unread,
Aug 19, 2015, 11:58:50 PM8/19/15
to pystatsmodels
On Wed, Aug 19, 2015 at 10:57 PM, Jeff Alstott <jeffa...@gmail.com> wrote:

On Wed, Aug 19, 2015 at 7:47 PM, <josef...@gmail.com> wrote:
Please post the full traceback, this doesn't tell me anything.
> model = ConditionalLogit.from_formula('Entered ~ Class_Patent_Count_Cumulative_Previous_Year_Percentile - 1', 
                                         data = df, groups=df['Agent_Entry'].values)
> model.endog = model.endog.ravel()  #I took this line from your gist. It doesn't seem to affect the error at all.

Instead of ravel this should have been squeeze and making sure that model.endog doesn't have several columns.
The error is a bit different from before.  It would be easier if you could make a minimum non-working dataset (or send me privately if necessary the real dataset).

All my guesses what could be wrong should raise already at an earlier time, at least I would expect it to.

To narrow down the problem

Which version of numpy are you using?

Does this raise an exception?
>>> np.arange(5)[np.arange(10) < 3]
array([0, 1, 2])
>>> np.__version__
'1.9.2rc1'

What's the dtype and what are the unique values of the original data?
df['Entered']

What kind of values are in model.endog, and what is it's shape (before calling ravel on it)?

Are there nans, missing values in the data?

Does the same dataset work when not using the from_formula?

My main guess is still that it's a problem with how patsy generates the endog, and that the current version doesn't have enough checking for appropriate or usable inputs.

Josef

Jeff Alstott

unread,
Aug 26, 2015, 6:10:10 PM8/26/15
to pystat...@googlegroups.com
Sorry for the delay in responding to this. The machine with the relevant data went down, and we are waiting for the sysadmin who normally works in the same country as the machine to return from vacation and restart it. Then we can try your suggestions!

Jeff Alstott

unread,
Sep 7, 2015, 8:41:31 AM9/7/15
to pystat...@googlegroups.com
And we're back! And I found the problem, and I have a workaround, but I don't understand why there's a problem in the first place. 

"and making sure that model.endog doesn't have several columns."
This is the issue. model.endog was two columns, even after .squeeze()ing. This happens only when the endogenous variable is given as a boolean.

df['Entered'] = df['Entered'].astype('bool')
model = ConditionalLogit.from_formula('Entered ~ Class_Patent_Count_Cumulative_Previous_Year_Percentile - 1', 
                                         data = df, groups=df['Agent_Entry'].values)
model.endog = model.endog.squeeze()
model.endog.shape
#(357, 2)
model.endog
#array([[ True, False],
#       [ True, False],
#       [ True, False],
#       [ True, False],
#       ....


But if it's an int then everything is fine:
df['Entered'] = df['Entered'].astype('int')
model = ConditionalLogit.from_formula('Entered ~ Class_Patent_Count_Cumulative_Previous_Year_Percentile - 1', 
                                         data = df, groups=df['Agent_Entry'].values)
model.endog = model.endog.squeeze()
model.endog.shape
#(357,)
model.endog
#array([False, False, False, False, True,False.....

And the model will indeed fit correctly.

So why this difference? ConditionalLogitOne casts the endogenous variable as a bool in line 80:
self.endog = np.asarray(endog, bool)

However, if I do this manually I can't replicate the problematic behavior if I do it manually.
df['Entered'] = df['Entered'].astype('int') 
np.asarray(df['Entered'], bool)
#array([False, False, False, False, True,False.....

So the problem is presumably in the .from_formula() function, which I know nothing about.

I was initially trying to be clever by giving the endogenous variable as a boolean directly, because I didn't want the model object to create a new copy of the endogenous array (at model.endog). This is because the data is quite large and I'm trying to keep as low of a memory profile as possible, lest I run out of RAM. However, I'm having trouble testing that the endogenous variable is actually duplicated and takes up more space in RAM. Should I be doing something more clever instead?

Thanks!
Jeff


josef...@gmail.com

unread,
Sep 7, 2015, 9:12:59 AM9/7/15
to pystatsmodels
It's a known issue with patsy in that it also transforms categorical endog, which we don't want in many cases.

Can you check if using `I(endog_name)` works to skip the conversion?
I never tried and don't have a example handy right now

ConditionalLogit.from_formula('I(Entered) ~ Class_Patent_Count_Cumulative ...

What could work in this case is to use `np.uint8` for the type of the endog. (if pandas supports it and there are no missing values which would make pandas use object arrays, I guess.)

Other workarounds for using formulas would require adjusting the internal code, AFAICS.

Josef

Jeff Alstott

unread,
Sep 7, 2015, 9:41:16 AM9/7/15
to pystat...@googlegroups.com
It's a known issue with patsy in that it also transforms categorical endog, which we don't want in many cases.
What do you mean by "transform"? 

Can you check if using `I(endog_name)` works to skip the conversion?
I never tried and don't have a example handy right now

ConditionalLogit.from_formula('I(Entered) ~ Class_Patent_Count_Cumulative ...
df['Entered'] = df['Entered'].astype('bool')
model = ConditionalLogit.from_formula('I(Entered) ~ Agent_Class_Number*Class_Patent_Count_Cumulative_Previous_Year_Percentile + '
                                      'Agent_Previous_Citations_to_Class_Percentile - 1', 
                                         data = df, groups=df['Agent_Entry'].values)
model.endog = model.endog.squeeze()
model.endog.shape 
#(357, 2)

So that appears to not do it.


What could work in this case is to use `np.uint8` for the type of the endog. (if pandas supports it and there are no missing values which would make pandas use object arrays, I guess.)
That works just fine, just like my previous casting it as to 'int'.  Does this prevent the duplication? Or is it just a smaller footprint than int32.

josef...@gmail.com

unread,
Sep 7, 2015, 10:33:07 AM9/7/15
to pystatsmodels
I had to try, and it looks like it still makes a copy even though bool and uint8 have the same binary representation (AFAICS). 
Using `view` would work to avoid the copy but that's too risky in a library function.

if endog.dtype == np.uint8:
    endog = endog.view(np.bool_)


In general, when I want to avoid overhead in loops or repetitive calculations, then I avoid formulas, so I never tried to figure out how to optimize these parts.

Josef

Jeff Alstott

unread,
Sep 7, 2015, 10:38:48 AM9/7/15
to pystat...@googlegroups.com
Fair enough. Thank you very much for the help!
Reply all
Reply to author
Forward
0 new messages