What would be the steps for using that code? Assume I'm not that savvy with git, which is true.
Do I understand correctly that the code appears to be working, though there are no unit tests so there are no guarantees? Is there any other way in which it is "unfinished"?
On Sat, May 2, 2015 at 12:57 AM, Jeff <jeffa...@gmail.com> wrote:What would be the steps for using that code? Assume I'm not that savvy with git, which is true.Theoretically it's just checking out or downloading the branch.However, it's a year behind master and won't have the newer parts of statsmodels. (And we cannot import two different versions of statsmodels at the same time).It looks like I converted the branch to common python 2 - python 3 codebase last year.Do I understand correctly that the code appears to be working, though there are no unit tests so there are no guarantees? Is there any other way in which it is "unfinished"?About terminology since we just had a question about Mixed Logit, Poisson, GLM and a new GSOC project:The project in 2013 was all about variations on Multinomial Logit or discrete choice models. I'm not sure how well it would work for binary Logit, which would be just a special case.The emphasis of the GSOC project was to get the core estimation parts. I guess there is also work left to get formula integration and more post estimation results.Based on running the mxlogit example (on python 3.4), it looks like mxlogit still has bugs (e.g. hessian has one row and column of zeros *), and the return class is not properly connected.(* After dropping the zero column in hessian, the parameter standard errors, bse, look like they agree at around 2 decimals/digits with R except for the standard deviation of the random component. It looks like a duplicated intercept in the parameter list, specifically I think the random effects distributions should have mean zero.)clogit looks good, but Ana ran out of time for mxlogit, it was added at the very end of GSOC, AFAIR.another point:Ana worked with 1-d parameter arrays which should make most of the generic postestimation results like t_test and similar work without changes. (discrete.MNLogit has 2-d parameters which are still a headache.)and one more:Numerical integration uses Halton sequences which could be generalized to other distribution than Gaussian. There is no version using Gaussian quadrature for integration, but it looks like the design is modular enough that only one method needs to be enhanced to support any finite point approximation to a distribution of random effects.Based on this, I don't think it would take long to get mxlogit into a usable, tested state. Some improvements (like analytical score and hessian and more features) would be more open ended.PS:After being puzzled by the random effects integration, I don't think mxlogit has been extended to panel data.AFAICS by partially skimming the vignette for R's mnlogit http://cran.r-project.org/web/packages/mlogit/vignettes/mlogit.pdf ,this is a Logit random utility model with unobserved cross-sectional heterogeneity, not for repeated observations of the choice for each individual.
On Sat, May 2, 2015 at 10:49 AM, <josef...@gmail.com> wrote:On Sat, May 2, 2015 at 12:57 AM, Jeff <jeffa...@gmail.com> wrote:What would be the steps for using that code? Assume I'm not that savvy with git, which is true.Theoretically it's just checking out or downloading the branch.However, it's a year behind master and won't have the newer parts of statsmodels. (And we cannot import two different versions of statsmodels at the same time).It looks like I converted the branch to common python 2 - python 3 codebase last year.Do I understand correctly that the code appears to be working, though there are no unit tests so there are no guarantees? Is there any other way in which it is "unfinished"?About terminology since we just had a question about Mixed Logit, Poisson, GLM and a new GSOC project:The project in 2013 was all about variations on Multinomial Logit or discrete choice models. I'm not sure how well it would work for binary Logit, which would be just a special case.The emphasis of the GSOC project was to get the core estimation parts. I guess there is also work left to get formula integration and more post estimation results.Based on running the mxlogit example (on python 3.4), it looks like mxlogit still has bugs (e.g. hessian has one row and column of zeros *), and the return class is not properly connected.(* After dropping the zero column in hessian, the parameter standard errors, bse, look like they agree at around 2 decimals/digits with R except for the standard deviation of the random component. It looks like a duplicated intercept in the parameter list, specifically I think the random effects distributions should have mean zero.)
Worth noting: surival's clogit is implemented as a coxph of a specific form, with strata. It looks like statsmodels might be able to pull this off?Can it handle strata?
On Monday, May 4, 2015 at 5:40:01 PM UTC+8, Jeff wrote:Thanks, Josef.The thing that I'm now after is actually the clogit. I poked around in its code, and I think I determined that the code is set up handling a set of observations that all have the same choices. So, for example:Observation 1Choice 1: busChoice 2: carChoice 3: bike (choice selected!)Observation 2Choice 1: busChoice 2: car (choice selected!)Choice 3: bikeHowever, my data looks like this:Observation 1Choice 1: busChoice 2: carChoice 3: bike (choice selected!)Observation 2Choice 1: bus (choice selected!)Choice 2: bike
It can! It totally can! That's great!
Unfortunately, it appears to take at 4x as long to compute. Also, statsmodels keeps telling me test datasets are singular, while survival tells me they're not. The singular matrix error prevents me from scaling up the dataset to see if that 4x longer runtime remains with a large dataset.
To be clear:When using 'efron' or 'breslow', statsmodels is faster than survival. When using 'exact', survival is far faster than statsmodels using 'efron' or 'breslow'.
On Monday, May 4, 2015 at 9:39:45 PM UTC+8, Jeff wrote:No intercept in the data. It looks like the cries of it being singular were due to an interaction of two floats. Removing that interaction removed the singular error. I'll have to put it in later, but that's a problem for later.Detailed code example forthcoming, but for now consider this: It looks like the computation time issue may also be gone, though only for smaller datasets. With larger datasets I can't actually get statsmodels to finish fitting, and I think this is due to the 'efron' setting for the ties.In R's survival package, the relevant code is:analysis <- clogit(myformula, df_sample)clogit has an option that is relevant for us right now:"method: use the correct (exact) calculation in the conditional likelihood or one of the approximations"default = 'exact'. Other options are "approximate", "efron", "breslow"
Just quickly to answer the central questions:For my specific dataset, there are approximately 120 options (That's the max and most frequent number. The exact amount varies but is generally over 100). Each strata has exactly one positive event (so 1 one and ~120 zeros).
On Mon, May 4, 2015 at 2:09 PM, Jeff Alstott <jeffa...@gmail.com> wrote:Just quickly to answer the central questions:For my specific dataset, there are approximately 120 options (That's the max and most frequent number. The exact amount varies but is generally over 100). Each strata has exactly one positive event (so 1 one and ~120 zeros).Good, only one choice is computationally the nicest case, and doesn't even need the recursion, as also the R clogit help page mentions. It's very close in this case to a multinomial logit version with a flexible number of choice (similar to the PR clogit not to our discrete.MNLogit since you have choice specific explanatory variables, IIUC).I looked into the recursion a bit more. Vectorizing over strata is only easy if it is batched for individuals with the same number of ones and the same number of observations/choices. Whether it's worth it depends on how often loglike is called and how difficult the data handling is.To continue with a panel data analogy:Our MNLogit and discrete choice models in the PR assume a balanced structure and can efficiently use reshaped arrays in wide format. Unbalanced panel or cluster data like in GEE and MixedLM use mostly loops over all individuals or groups or strata.
"As far as I can tell...
Results: PHReg
===============================================================
Model: PH Reg Sample size: 99
Dependent variable: Intercept Num. events: 20
Ties: Efron
---------------------------------------------------------------
log HR log HR SE HR t P>|t| [0.025 0.975]
---------------------------------------------------------------
Intercept 0.0000 nan 1.0000 nan nan nan nan
a 0.4416 0.8453 1.5552 0.5224 0.6014 0.2967 8.1520
b -1.5333 0.9185 0.2158 -1.6694 0.0950 0.0357 1.3060
c -0.5142 0.8046 0.5980 -0.6391 0.5228 0.1236 2.8942
===============================================================
Confidence intervals are for the hazard ratios
Loading required package: splines
Call:
coxph(formula = Surv(rep(1, 99L), outcome_to_classify) ~ a +
b + c + strata(my_strata), data = df, method = "efron")
n= 99, number of events= 20
coef exp(coef) se(coef) z Pr(>|z|)
a 0.4416 1.5552 0.8453 0.522 0.601
b -1.5333 0.2158 0.9185 -1.669 0.095 .
c -0.5142 0.5980 0.8046 -0.639 0.523
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
a 1.5552 0.643 0.29668 8.152
b 0.2158 4.634 0.03566 1.306
c 0.5980 1.672 0.12355 2.894
Concordance= 0.668 (se = 0.181 )
Rsquare= 0.04 (max possible= 0.691 )
Likelihood ratio test= 4.02 on 3 df, p=0.2595
Wald test = 3.74 on 3 df, p=0.2914
Score (logrank) test = 3.89 on 3 df, p=0.2742
Here's a minimal example of using PHReg to do the same fitting as survival's clogit in R.
Results: PHReg
====================================================================
Model: PH Reg Sample size: 99
Dependent variable: Intercept Num. events: 20
Ties: Efron
--------------------------------------------------------------------
log HR log HR SE HR t P>|t| [0.025 0.975]
--------------------------------------------------------------------
Intercept -0.0000 33554432.0000 1.0000 -0.0000 1.0000 0.0000 inf
a 1.0154 0.8366 2.7606 1.2138 0.2248 0.5357 14.2265
b 0.2449 0.8817 1.2775 0.2777 0.7812 0.2269 7.1916
c 0.3601 0.8118 1.4334 0.4435 0.6574 0.2920 7.0364
====================================================================
Confidence intervals are for the hazard ratiosR, survival's clogit, 'efron'
%load_ext rmagic%R -i df%R library(survival) %R analysis <- clogit(outcome_to_classify ~ a + b + c + strata(my_strata), df, method='efron') %R print(summary(analysis))
Call:
coxph(formula = Surv(rep(1, 99L), outcome_to_classify) ~ a +
b + c + strata(my_strata), data = df, method = "efron")
n= 99, number of events= 20
coef exp(coef) se(coef) z Pr(>|z|)
a 1.0154 2.7606 0.8366 1.214 0.225
b 0.2449 1.2775 0.8817 0.278 0.781
c 0.3601 1.4334 0.8118 0.444 0.657
exp(coef) exp(-coef) lower .95 upper .95
a 2.761 0.3622 0.5357 14.227
b 1.277 0.7828 0.2269 7.192
c 1.433 0.6976 0.2920 7.036
Concordance= 0.62 (se = 0.181 )
Rsquare= 0.019 (max possible= 0.691 )
Likelihood ratio test= 1.88 on 3 df, p=0.5979
Wald test = 1.84 on 3 df, p=0.6055
Score (logrank) test = 1.86 on 3 df, p=0.6012R, survival's clogit, 'exact'. Note the different results!
Call:
coxph(formula = Surv(rep(1, 99L), outcome_to_classify) ~ a +
b + c + strata(my_strata), data = df, method = "exact")
n= 99, number of events= 20
coef exp(coef) se(coef) z Pr(>|z|)
a 1.1201 3.0651 0.9189 1.219 0.223
b 0.2371 1.2675 0.9563 0.248 0.804
c 0.4339 1.5432 0.8962 0.484 0.628
exp(coef) exp(-coef) lower .95 upper .95
a 3.065 0.3263 0.5061 18.563
b 1.268 0.7889 0.1945 8.260
c 1.543 0.6480 0.2664 8.938
Rsquare= 0.019 (max possible= 0.574 )
Likelihood ratio test= 1.92 on 3 df, p=0.5901
Wald test = 1.85 on 3 df, p=0.6044
Score (logrank) test = 1.89 on 3 df, p=0.5956"I have a GenericLikelihoodModel version of ConditionalLogit for the special case of one selected choice, but didn't run it yet on a verified example."Great!Python, with seed.from pylab import *import pandas as pdseed(400)df = pd.DataFrame(columns=['a','b','c'], data=rand(100, 3))df['my_strata'] = repeat([1,2,3,4,5], 20)df['outcome_to_classify'] = tile([1,0,0,0,0], 20)
Call:
coxph(formula = Surv(rep(1, 99L), outcome_to_classify) ~ a +
b + c + strata(my_strata), data = df, method = "efron")
n= 99, number of events= 20
coef exp(coef) se(coef) z Pr(>|z|)
a 1.472913 4.361925 0.977688 1.507 0.132
b 1.094598 2.987982 0.954946 1.146 0.252
c 0.003076 1.003081 1.016185 0.003 0.998
exp(coef) exp(-coef) lower .95 upper .95
a 4.362 0.2293 0.6419 29.64
b 2.988 0.3347 0.4598 19.42
c 1.003 0.9969 0.1369 7.35
Concordance= 0.582 (se = 0.354 )
Rsquare= 0.036 (max possible= 0.476 )
Likelihood ratio test= 3.66 on 3 df, p=0.3011
Wald test = 3.35 on 3 df, p=0.3413
Score (logrank) test = 3.53 on 3 df, p=0.3174R, exact
%R analysis <- clogit(outcome_to_classify ~ a + b + c + strata(my_strata), df, method='exact') #'exact' is the default %R print(summary(analysis))
Call:
coxph(formula = Surv(rep(1, 99L), outcome_to_classify) ~ a +
b + c + strata(my_strata), data = df, method = "exact")
n= 99, number of events= 20
coef exp(coef) se(coef) z Pr(>|z|)
a 1.472913 4.361925 0.977688 1.507 0.132
b 1.094598 2.987982 0.954946 1.146 0.252
c 0.003076 1.003081 1.016185 0.003 0.998
exp(coef) exp(-coef) lower .95 upper .95
a 4.362 0.2293 0.6419 29.64
b 2.988 0.3347 0.4598 19.42
c 1.003 0.9969 0.1369 7.35
Rsquare= 0.036 (max possible= 0.476 )
Likelihood ratio test= 3.66 on 3 df, p=0.3011
Wald test = 3.35 on 3 df, p=0.3413
Score (logrank) test = 3.53 on 3 df, p=0.3174There is still something strange with the example, i don't get the same results. Either I'm still misinterpreting the model or it's because every individual chooses the same alternative and there is no variation in the selection.
Results: PHReg
====================================================================
Model: PH Reg Sample size: 99
Dependent variable: Intercept Num. events: 10
Ties: Efron
--------------------------------------------------------------------
log HR log HR SE HR t P>|t| [0.025 0.975]
--------------------------------------------------------------------
Intercept 0.0000 47453132.8121 1.0000 0.0000 1.0000 0.0000 inf
a 1.3262 1.2206 3.7666 1.0865 0.2773 0.3443 41.2036
b -0.4479 1.3052 0.6389 -0.3432 0.7315 0.0495 8.2499
c 0.2376 1.1903 1.2682 0.1996 0.8418 0.1230 13.0726
====================================================================
Confidence intervals are for the hazard ratiosCall:
coxph(formula = Surv(rep(1, 99L), outcome_to_classify) ~ a +
b + c + strata(my_strata), data = df, method = "exact")
n= 99, number of events= 10
coef exp(coef) se(coef) z Pr(>|z|)
a 1.3262 3.7666 1.2206 1.086 0.277
b -0.4479 0.6389 1.3052 -0.343 0.731
c 0.2376 1.2682 1.1903 0.200 0.842
exp(coef) exp(-coef) lower .95 upper .95
a 3.7666 0.2655 0.34433 41.20
b 0.6389 1.5651 0.04949 8.25
c 1.2682 0.7885 0.12302 13.07
Rsquare= 0.014 (max possible= 0.371 )
Likelihood ratio test= 1.44 on 3 df, p=0.6953
Wald test = 1.42 on 3 df, p=0.7017
Score (logrank) test = 1.45 on 3 df, p=0.694Alright! So now it looks like the only remaining issue is how to handle the singular matrices? You're in the belly of the code, so you have a better shot of interpreting what coxph is doing with that singular.ok option.
--------------------------------------------------------------------------- LinAlgError Traceback (most recent call last) <ipython-input-34-47272558d484> in <module>() 21 strata='my_strata', 22 ties="efron") ---> 23 rslt = mod.fit() 24 print(rslt.summary()) /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/duration/hazard_regression.py in fit(self, groups, **args) 373 if 'disp' not in args: 374 args['disp'] = False --> 375 fit_rslts = super(PHReg, self).fit(**args) 376 377 if self.groups is None: /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs) 432 Hinv = cov_params_func(self, xopt, retvals) 433 elif method == 'newton' and full_output: --> 434 Hinv = np.linalg.inv(-retvals['Hessian']) / nobs 435 elif not skip_hessian: 436 try: /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/numpy/linalg/linalg.py in inv(a) 518 signature = 'D->D' if isComplexType(t) else 'd->d' 519 extobj = get_linalg_error_extobj(_raise_linalgerror_singular) --> 520 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj) 521 return wrap(ainv.astype(result_t)) 522 /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag) 88 89 def _raise_linalgerror_singular(err, flag): ---> 90 raise LinAlgError("Singular matrix") 91 92 def _raise_linalgerror_nonposdef(err, flag): LinAlgError: Singular matrix
Call:
coxph(formula = Surv(rep(1, 69L), outcome_to_classify) ~ a +
b + c + strata(my_strata), data = df, method = "exact")
n= 69, number of events= 10
coef exp(coef) se(coef) z Pr(>|z|)
a 0.35339 1.42389 1.23541 0.286 0.7748
b 0.44016 1.55295 1.33109 0.331 0.7409
c -2.94833 0.05243 1.46427 -2.014 0.0441 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
a 1.42389 0.7023 0.126441 16.0348
b 1.55295 0.6439 0.114322 21.0954
c 0.05243 19.0741 0.002973 0.9246
Rsquare= 0.072 (max possible= 0.429 )
Likelihood ratio test= 5.13 on 3 df, p=0.1624
Wald test = 4.1 on 3 df, p=0.2505
Score (logrank) test = 4.82 on 3 df, p=0.1853R%R -i df%R library(survival)%R analysis <- clogit(outcome_to_classify ~ a + b + c + strata(my_strata), df, method='exact') #'exact' is the default%R print(summary(analysis))Call: coxph(formula = Surv(rep(1, 69L), outcome_to_classify) ~ a + b + c + strata(my_strata), data = df, method = "exact") n= 69, number of events= 10 coef exp(coef) se(coef) z Pr(>|z|) a 0.35339 1.42389 1.23541 0.286 0.7748 b 0.44016 1.55295 1.33109 0.331 0.7409 c -2.94833 0.05243 1.46427 -2.014 0.0441 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 exp(coef) exp(-coef) lower .95 upper .95 a 1.42389 0.7023 0.126441 16.0348 b 1.55295 0.6439 0.114322 21.0954 c 0.05243 19.0741 0.002973 0.9246 Rsquare= 0.072 (max possible= 0.429 ) Likelihood ratio test= 5.13 on 3 df, p=0.1624 Wald test = 4.1 on 3 df, p=0.2505 Score (logrank) test = 4.82 on 3 df, p=0.1853
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-37-cfffea3f4f0b> in <module>()
21 strata='my_strata',
22 ties="efron")
---> 23 rslt = mod.fit(method='bfgs')
24 print(rslt.summary())
/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/duration/hazard_regression.py in fit(self, groups, **args)
376
377 if self.groups is None:
--> 378 cov_params = fit_rslts.cov_params()
379 else:
380 cov_params = self.robust_covariance(fit_rslts.params)
/home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/model.py in cov_params(self, r_matrix, column, scale, cov_p, other)
1044 if (cov_p is None and self.normalized_cov_params is None and
1045 not hasattr(self, 'cov_params_default')):
-> 1046 raise ValueError('need covariance of parameters for computing '
1047 '(unnormalized) covariances')
1048 if column is not None and (r_matrix is not None or other is not None):
ValueError: need covariance of parameters for computing (unnormalized) covariances
'newton' also cries singular. Trying 'bfgs' gives a different error:--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-37-cfffea3f4f0b> in <module>() 21 strata='my_strata', 22 ties="efron") ---> 23 rslt = mod.fit(method='bfgs') 24 print(rslt.summary()) /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/duration/hazard_regression.py in fit(self, groups, **args) 376 377 if self.groups is None: --> 378 cov_params = fit_rslts.cov_params() 379 else: 380 cov_params = self.robust_covariance(fit_rslts.params) /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/model.py in cov_params(self, r_matrix, column, scale, cov_p, other) 1044 if (cov_p is None and self.normalized_cov_params is None and 1045 not hasattr(self, 'cov_params_default')): -> 1046 raise ValueError('need covariance of parameters for computing ' 1047 '(unnormalized) covariances') 1048 if column is not None and (r_matrix is not None or other is not None): ValueError: need covariance of parameters for computing (unnormalized) covariances
If you're getting the same results as R, then I'm happy! I understand it's not unit tested and such. But I'd like to try it out, if only to test its runtime vs. R. Is it in a shape that I could access it?
Thanks, Josef!If I were wanting to run this in the next day or three, what would you recommend?1. Waiting for this to be pulled into statsmodels, and then using the latest development branch of statsmodels? (seems easiest?)2. Using your fork? I'm not savvy in simultaneously operating with parallel versions of packages.3. Something else? E.g. Your code is sufficiently brief that I could just copy/paste it into the IPython Notebook where I'm running my analyses.
Thanks!I may get to put this into use in the next day or two, but it's possible for the moment I'll try to save programmer time by using the existing survival/clogit/RPy setup. In the end, though, I will definitely use your Python solution for the production run, to save computer time. I look forward to seeing this Python ConditionalLogit being fully vetted. It'll be great to have it!
You could try it out on some larger examples of your data to see whether it is actually fast enough.Don't trust unverified advertising claims by the author. (I didn't check any timing and only used the simple examples.)
Any help (with for example unit tests and extra features) would be useful.This was just a two day unscheduled detour for me, and in contrast to working on mixed logit it doesn't have the advantage to help prepare for this years GSOC projects.
Hello all,I'll be moving into large scale production with the ConditionalLogit model within the next month or two, and I wanted to know if I should still be using Josef's code. I heard that there was a GSoC project that would get at least close to this topic. Have there been any developments, or are there expected to be any, that would change which code I should be using?
IndexError: too many indices for array
Resurrecting this thread to ask a follow-up question: How would I implement interaction terms with this ConditionalLogit code? I would think to use formulas, but using formulas with ConditionalLogitOne.from_formula() yields this error:IndexError: too many indices for arrayThat's without even putting in an interaction term.
Thanks!
Please post the full traceback, this doesn't tell me anything.
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-189-beaeee5f31cd> in <module>() ----> 1 fit = model.fit() /home/jeffrey_alstott/technoinnovation/Inventor_Movement/conditional_logit.py in fit(self, **kwds) 159 # only used to change default method to bfgs 160 method = kwds.pop('method', 'bfgs') --> 161 return super(ConditionalLogitOne, self).fit(method=method, **kwds) /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, retall, **kwargs) 669 method=method, maxiter=maxiter, 670 full_output=full_output, --> 671 disp=disp, callback=callback, **kwargs) 672 genericmlefit = GenericLikelihoodModelResults(self, mlefit) 673 /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs) 423 callback=callback, 424 retall=retall, --> 425 full_output=full_output) 426 427 #NOTE: this is for fit_regularized and should be generalized /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/optimizer.py in _fit(self, objective, gradient, start_params, fargs, kwargs, hessian, method, maxiter, full_output, disp, callback, retall) 182 disp=disp, maxiter=maxiter, callback=callback, 183 retall=retall, full_output=full_output, --> 184 hess=hessian) 185 186 # this is stupid TODO: just change this to something sane /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/optimizer.py in _fit_bfgs(f, score, start_params, fargs, kwargs, disp, maxiter, callback, retall, full_output, hess) 290 gtol=gtol, norm=norm, epsilon=epsilon, 291 maxiter=maxiter, full_output=full_output, --> 292 disp=disp, retall=retall, callback=callback) 293 if full_output: 294 if not retall: /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/scipy/optimize/optimize.py in fmin_bfgs(f, x0, fprime, args, gtol, norm, epsilon, maxiter, full_output, disp, retall, callback) 780 'return_all': retall} 781 --> 782 res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts) 783 784 if full_output: /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/scipy/optimize/optimize.py in _minimize_bfgs(fun, x0, args, jac, callback, gtol, norm, eps, maxiter, disp, return_all, **unknown_options) 835 else: 836 grad_calls, myfprime = wrap_function(fprime, args) --> 837 gfk = myfprime(x0) 838 k = 0 839 N = len(x0) /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/scipy/optimize/optimize.py in function_wrapper(*wrapper_args) 280 def function_wrapper(*wrapper_args): 281 ncalls[0] += 1 --> 282 return function(*(wrapper_args + args)) 283 284 return ncalls, function_wrapper /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/model.py in <lambda>(params) 402 nobs = self.endog.shape[0] 403 f = lambda params, *args: -self.loglike(params, *args) / nobs --> 404 score = lambda params: -self.score(params) / nobs 405 try: 406 hess = lambda params: -self.hessian(params) / nobs /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/base/model.py in score(self, params) 629 kwds = {} 630 kwds.setdefault('centered', True) --> 631 return approx_fprime(params, self.loglike, **kwds).ravel() 632 633 def score_obs(self, params, **kwds): /home/jeffrey_alstott/anaconda3/lib/python3.4/site-packages/statsmodels/tools/numdiff.py in approx_fprime(x, f, epsilon, args, kwargs, centered) 128 n = len(x) 129 #TODO: add scaled stepsize --> 130 f0 = f(*((x,)+args), **kwargs) 131 dim = np.atleast_1d(f0).shape # it could be a scalar 132 grad = np.zeros((n,) + dim, float) /home/jeffrey_alstott/technoinnovation/Inventor_Movement/conditional_logit.py in loglike(self, params) 86 87 xb = self.exog.dot(params) ---> 88 return loglike_one_groups(self.endog, xb, self.group_idx).sum() 89 90 def loglike_obs(self, params): /home/jeffrey_alstott/technoinnovation/Inventor_Movement/conditional_logit.py in loglike_one_groups(endog, xb, group_idx) 41 endog = np.asarray(endog, np.bool_) 42 expxb_groups = np.add.reduceat(np.exp(xb), group_idx) ---> 43 return xb[endog] - np.log(expxb_groups) 44 45 def score_one_groups(params, endog, exog, group_idx): IndexError: index 39892 is out of bounds for axis 1 with size 39891
I don't think I tried out from_formula. It's just the inherited method which might not be appropriate in subclasses.
You could try to use patsy.dmatrices directly to see whether it produces the correct design matrix and endog.
One possible bug is that patsy converts categorical `endog` which it shouldn't do in this case. We have some related open issues, but I don't think patsy has the option to turn off automatic `endog` categorical transformation yet. endog might have to be integer category levels.
(mostly guessing, I'm in different neighborhoods right now.)
On Wed, Aug 19, 2015 at 7:47 PM, <josef...@gmail.com> wrote:Please post the full traceback, this doesn't tell me anything.> model = ConditionalLogit.from_formula('Entered ~ Class_Patent_Count_Cumulative_Previous_Year_Percentile - 1',data = df, groups=df['Agent_Entry'].values)> model.endog = model.endog.ravel() #I took this line from your gist. It doesn't seem to affect the error at all.
#(357, 2)
model.endog
#array([[ True, False],# [ True, False], # [ True, False], # [ True, False], # ....
But if it's an int then everything is fine:
df['Entered'] = df['Entered'].astype('int')
model = ConditionalLogit.from_formula('Entered ~ Class_Patent_Count_Cumulative_Previous_Year_Percentile - 1',
data = df, groups=df['Agent_Entry'].values)
model.endog = model.endog.squeeze()
model.endog.shape#(357,)model.endog#array([False, False, False, False, True,False.....And the model will indeed fit correctly.So why this difference? ConditionalLogitOne casts the endogenous variable as a bool in line 80:self.endog = np.asarray(endog, bool)However, if I do this manually I can't replicate the problematic behavior if I do it manually.df['Entered'] = df['Entered'].astype('int')np.asarray(df['Entered'], bool)#array([False, False, False, False, True,False.....So the problem is presumably in the .from_formula() function, which I know nothing about.I was initially trying to be clever by giving the endogenous variable as a boolean directly, because I didn't want the model object to create a new copy of the endogenous array (at model.endog). This is because the data is quite large and I'm trying to keep as low of a memory profile as possible, lest I run out of RAM. However, I'm having trouble testing that the endogenous variable is actually duplicated and takes up more space in RAM. Should I be doing something more clever instead?Thanks!Jeff
ConditionalLogit.from_formula('I(Entered) ~ Class_Patent_Count_Cumulative ...It's a known issue with patsy in that it also transforms categorical endog, which we don't want in many cases.
Can you check if using `I(endog_name)` works to skip the conversion?I never tried and don't have a example handy right nowConditionalLogit.from_formula('I(Entered) ~ Class_Patent_Count_Cumulative ...
What could work in this case is to use `np.uint8` for the type of the endog. (if pandas supports it and there are no missing values which would make pandas use object arrays, I guess.)