Regularized Likelihood models?

Ian Langmore

unread,

Jul 27, 2012, 2:37:38 PM7/27/12

to pystat...@googlegroups.com

I made a few small additions to Likelihoodmodel.fit() in order to enable L1 regularization. See this fork. Is there any interest in adding something like this to the main statsmodels? It seems pretty easy to do. The only decisions to make involve the fact that LikelihoodModelResults is geared toward maximum likelihood results.

josef...@gmail.com

unread,

Jul 27, 2012, 3:54:26 PM7/27/12

to pystat...@googlegroups.com

Yes, I think this would make a good addition.

I don't know the details how L1 penalized estimation is calculated,
but I was thinking about similar Maximum L2- (or smooth-) penalized
Likelihood estimation.

The way you define `func`, the penalized objective function outside of
the models makes it very separate (positive) from the model code. It
makes it generally available for all MLE models.

On the other hand, I don't see where you can adjust the result
instance itself, generically. Do we need to adjust all result classes
to take advantage of this?

From your current code the only change in the results seems to be the
df_model. Is this all that changes of the statistical results? How are
cov_params and similar defined?

naive question: does it make sense to return the results with the
exog/variables for non-zero parameters?

(quick check: looks like we can ignore the penalization for
cov_params, ... and we get asymptotic results where we can ignore the
penalization parts. needs to be verified.)

aside: I looked for some time at the MLE analog of Ridge regression,
but haven't figured out yet how to attach the penalized likelihood to
the models and don't remember or didn't look at the statistical
details for the result statistics.

Thanks,

Josef

Ian Langmore

unread,

Jul 28, 2012, 6:30:56 PM7/28/12

to pystat...@googlegroups.com

Hello Josef.

Glad to hear there is interest. To answer your questions:

I added a some explanation of how the L1 penalization is done. See the docstring for l1.py here.

I'm not sure how cov_params are used, so I don't know how they should be modified. Are they used for any LikelihoodModel?

Some Results attributes no longer make sense, such at the gradient of the objective, gopt, or the hessian hopt. This happens since the new objective is non-differentiable. AIC/BIC should be defined with the new df_model (that only includes the nonzero parameters). The t and f tests are no longer applicable. The confidence intervals could be replaced by credible intervals, but I don't think people really believe these (the R package "penalized" does not compute them for this reason). So...what makes sense to me is to create a new RegularizedResults that is similar to LikelihoodModelResults, but with less information. What do you think?

Ridge regression should be much easier. The penalty is smooth, so all we have to do is modify the gradient and hessian and then the standard optimization techniques should apply.

-Ian

josef...@gmail.com

unread,

Jul 28, 2012, 7:06:43 PM7/28/12

to pystat...@googlegroups.com

On Sat, Jul 28, 2012 at 6:30 PM, Ian Langmore <ianla...@gmail.com> wrote:
> Hello Josef.
>
> Glad to hear there is interest. To answer your questions:
>
> I added a some explanation of how the L1 penalization is done. See the
> docstring for l1.py here.
>
> I'm not sure how cov_params are used, so I don't know how they should be
> modified. Are they used for any LikelihoodModel?

cov_params is the variance of our parameter estimates in large sample
(asymptotical Normal distribution)
We have it until now for all models, since all of them are smooth.

>
> Some Results attributes no longer make sense, such at the gradient of the
> objective, gopt, or the hessian hopt. This happens since the new objective
> is non-differentiable. AIC/BIC should be defined with the new df_model
> (that only includes the nonzero parameters). The t and f tests are no
> longer applicable. The confidence intervals could be replaced by credible
> intervals, but I don't think people really believe these (the R package
> "penalized" does not compute them for this reason). So...what makes sense
> to me is to create a new RegularizedResults that is similar to
> LikelihoodModelResults, but with less information. What do you think?

I don't know the literature but didn't expect it to be so "bad" in
terms of statistical results.

Yes, this looks like we need a special Results class that doesn't have
some of our usual results.

Is L1 fast enough for some bootstrap results, if we want to get other
statistical results. (If that is even theoretically possible.)

>
> Ridge regression should be much easier. The penalty is smooth, so all we
> have to do is modify the gradient and hessian and then the standard
> optimization techniques should apply.

Conceptually it looks straightforward, how to do it in practice and
what results we get in this case, I also don't know or remember.

Thanks, I will look at your changes (when it's not late night in front
of a dark pool bar that has the only wireless)

Josef

VincentAB

unread,

Jul 28, 2012, 8:15:50 PM7/28/12

to pystat...@googlegroups.com

Thanks, I will look at your changes (when it's not late night in front
of a dark pool bar that has the only wireless)

Josef

So you weren't kidding about the beach-side vacation. Enjoy!

Ian Langmore

unread,

Jul 28, 2012, 11:25:12 PM7/28/12

to pystat...@googlegroups.com

I'm not sure how we could get credibility intervals without MCMC...but I wouldn't be surprised if there was a way. I'll retract my point about "people not believing them." In my opinion though the L1 prior is chosen mostly for mathematical convenience and not because we believe that it's the marginal distribution of the parameter. Others often have a different opinion (I'm a mathematician, not a statistician anyways...).

As for the cov_params...here's my reasoning (which I haven't seen in the literature, but should be out there somewhere): The posterior looks like \exp\{-\ln Likelihood(\beta)\} \exp\{-\alpha \|\beta\|_1\}. If all \beta\neq0, the derivative of the negative log likelihood cancels the derivative of \|\beta\|_1 away from zero (it's differentiable away from zero), and we are justified in approximating the posterior as \exp\{-\beta^T H \beta/2\}, where H is the Hessian of -\ln Likelihood. In that case, the (approximate) covariance of the parameters is H^-1, and we don't need to change anything. If some \beta_j = 0, then, even away from zero, the derivatives do not cancel, and we're unjustified in using the "Hessian approximation." In that case we need to consider the first and second derivatives of the "negative log posterior", separately for both positive and negative parameter values. We could then integrate this against \beta\beta^T to get the covariance. In any case, the approximation isn't Gaussian anymore so cov_params doesn't tell the whole story. Does this make sense? I could tex this up and post it.

Other important additions would be the ability to compute the "evidence" (MacKay's evidence framework), as this can be used to let the data choose the regularization parameter. To approximate the evidence we would have to take both the Hessian and the first gradient into consideration (in the case that some parameters went to zero).

To be fully Bayesian, we would want to pick a hyper-prior for \alpha and then choose \beta and \alpha together. This would require some work.

We could just start un-ambitious and treat the prior as a regularization trick with no Bayesian interpretation.

-Ian

josef...@gmail.com

unread,

Jul 29, 2012, 3:31:17 AM7/29/12

to pystat...@googlegroups.com

I just consulted my in-house theoretical econometrician:

roughly what I understood:
If the model is correct and has only a finite number of non-zero
parameters/variables, then in large samples (asymptotically) the
variable selection has an oracle property, and the estimates of
non-zero parameters have the standard asymptotic covariance
(cov_params) and all (or most) associated results based on asymptotic
normality.

In this case we could return the standard Result instance for the
reduced parameter space plus the extra information that we have from
the L1 penalization.

Josef

Ian Langmore

unread,

Jul 29, 2012, 12:40:53 PM7/29/12

to pystat...@googlegroups.com

I agree. So the results restricted to the nonzero params can be done as you suggest. Results for the zero params could be handled later.

One thing that will be special is as follows: Suppose there are two y values (a.k.a. targets), y1 and y2. Then in general the non-zero params associated with either one could be different. Not sure how this will affect things but I can take a crack at it and see.

-Ian

Ian Langmore

unread,

Jul 29, 2012, 4:04:08 PM7/29/12

to pystat...@googlegroups.com

I see two difficulties with returning results for only the nonzero params. First, the nonzero params differ between targets. Second, some results are naturally computed using the zero params as well (e.g. the log likelihood).

How about this: Some of our previous results are no longer valid, and we have some new results. Let's create l1ModelResults (extending LikelihoodModelResults). It will return float('nan') in place of results that don't make sense. So, e.g. if params[0] = 0, then results.bse[0] = float('nan'). Does this seem like a good first step?

I started (see this commit), but am having a difficulty using the ResultsWrapper and @cache_readonly. Inside l1ModelResults, I have

@cache_readonly

def bse(self):

bse = super(l1ModelResults, self).bse

# self.zero_param_idx is populated during init

bse[self.zero_param_idx] = float('nan')

return bse

The wrapper overrides this by calling MultinomialResults.bse. If I preemptively call l1ModelResults.bse (e.g. inside __init__), then my results are preserved, but they are not shaped right (normally, MultinomialResults creates and shapes bse).

-Ian

josef...@gmail.com

unread,

Jul 29, 2012, 5:05:15 PM7/29/12

to pystat...@googlegroups.com

On Sun, Jul 29, 2012 at 4:04 PM, Ian Langmore <ianla...@gmail.com> wrote:
> I see two difficulties with returning results for only the nonzero params.
> First, the nonzero params differ between targets. Second, some results are
> naturally computed using the zero params as well (e.g. the log likelihood).
>
> How about this: Some of our previous results are no longer valid, and we
> have some new results. Let's create l1ModelResults (extending
> LikelihoodModelResults). It will return float('nan') in place of results
> that don't make sense. So, e.g. if params[0] = 0, then results.bse[0] =
> float('nan'). Does this seem like a good first step?
>
> I started (see this commit), but am having a difficulty using the
> ResultsWrapper and @cache_readonly. Inside l1ModelResults, I have
> @cache_readonly
> def bse(self):
> bse = super(l1ModelResults, self).bse
> # self.zero_param_idx is populated during init
> bse[self.zero_param_idx] = float('nan')
> return bse
>
> The wrapper overrides this by calling MultinomialResults.bse. If I
> preemptively call l1ModelResults.bse (e.g. inside __init__), then my results
> are preserved, but they are not shaped right (normally, MultinomialResults
> creates and shapes bse).

Can you please bottom or inline post? We are used to that convention
in numpy/scipy/... related mailing lists?
Makes it easier to follow the thread.

I think the problem is not the result wrapper, but that discrete
models have their own subclasses for the results.

for example for multinomial in statsmodels/discrete/discrete_model.py
MultinomialResults(DiscreteResults)

We need this to add results that are specific to binary, count or
multinomial response models. That is related to my initial question
whether we have to add L1 specific code to all the results
(sub)classes.

The result sub-classes still inherit most of the statistical results,
however multinomial is a special case for bse because params is 2d
instead of the usual 1d.

I think the direct approach for multinomial would be to overwrite the
MultinomialModel.fit method and return a L1MultinomialResults.
I don't know if there is a more general solution (mixins?).
Code reuse and model specific results have the unfortunate side-effect
that everything got a bit spread out across classes.

Now, I think I understand why you mention target, before I didn't pay
enough attention to realize that your demo example uses multinomial
which is a 2d problem. I see the problem with returning a standard
result instance if the variable selection varies by target/multinomial
choice.
In analogy to system of equation models where we don't necessarily
have the same number of regressors in each equation, the way out would
be to work only with the flattened params and bse. (But it's more work
to change or add this.)

Is there a special reason you picked multinomial as example and not
Logit or Poisson for example?

aside: in l1ModelResults(LikelihoodModelResults) it looks like you
copied the entire super class. Most of it will be inherited and
doesn't need to be copied, isn't it?

Josef

Message has been deleted

Ian Langmore

unread,

Jul 29, 2012, 6:02:55 PM7/29/12

to pystat...@googlegroups.com

On Sun, Jul 29, 2012 at 5:57 PM, Ian Langmore <ianla...@gmail.com> wrote:

I chose multinomial because that's what I need for work. You're right: I did copy over the entire super class. I was going to go through and modify/delete methods as needed--I started with bse. Sorry for the confusion.

I'll try making an L1MultinomialResults class. This is the simplest thing to do. We can decide whether to generalize things later.

-Ian

I chose multinomial because that's what I need for work. You're right: I did copy over the entire super class. I was going to go through and modify/delete methods as needed--I started with bse. Sorry for the confusion.

I'll try making an L1MultinomialResults class. This is the simplest thing to do. We can decide whether to generalize things later.

-Ian

josef...@gmail.com

unread,

Jul 29, 2012, 6:33:10 PM7/29/12

to pystat...@googlegroups.com

very good reason

You're right: I
> did copy over the entire super class. I was going to go through and
> modify/delete methods as needed--I started with bse. Sorry for the
> confusion.
>
> I'll try making an L1MultinomialResults class. This is the simplest thing
> to do. We can decide whether to generalize things later.

I was running your demo a few times to see how well this works.

A few observations:

My impression is that the regularization parameter alpha is too high,
too many zeros. Did you think about how to choose the regularization
parameter?

The sample size needs to be at least a few thousand to get good
results (close to true parameters).

L1 looks much better in an example with 20 regressors, 10 of them with
true_param equal to zero.

question (I have no idea about the answer): Does the penalized
likelihood function have a nice shape (convex) or do we have to worry
about local minima and the choice of starting values?

suggestion: A possible nice feature for practical applications might
be a whitelist of variables, besides the constant, that are not
penalized. (or groups with different regularization parameter ?)
(example: In Ridge regression with polynomial terms, we can make the
penalization stronger for higher order terms.)

I appreciate that we get L1 penalized likelihood (and I don't have to
figure out the technical details :)

Josef

>
> -Ian

Ian Langmore

unread,

Jul 29, 2012, 8:42:41 PM7/29/12

to pystat...@googlegroups.com

For this demo, you're right that the regularization parameter is too high. In this demo, the "real" data is generated from the exact same process as our model, so as N\to\infty, the ML estimate converges to the true value. Therefore, one should use a regularization parameter that is independent of N (so that as you add more data, the MAP solution turns into the ML solution). Instead of following this guideline, for the demo I chose \alpha ( = 0.005 * N) in order to ensure that I get some zeros. You bring up the good point that a better demo would involve some of the "true" parameters actually equal to zero.

The choice of regularization parameter could be done with cross validation, or with BIC or AIC or "evidence." Much has been written about it (although I've only seen incorrect expositions of the evidence framework for this problem). I think if we return AIC and BIC then the user can make their own choice. In my practical problem, the "niceness" of the resultant nonzero parameter set (small but not too small) is an important consideration.

As for the whitelist: Maybe it would be just as easy to allow alpha to be a vector. Then the user can set some components of that vector = 0. I'll have to see if that causes problems with the optimizer (the gradient would have zeros in it).

Since the L1 penalty is strictly convex, and the original negative-log-likelihood is convex, adding them together we get a strictly convex objective function. Therefore there is one unique to the optimization problem. Nonetheless, in my experience with this particular package (I've been using it over the past week), the final solution can depend on the initial guess! We also often get convergence errors. This only happens on larger scale problems. I have an appointment tomorrow with a professor who studies optimization. We will try to find the cause of this issue.

-Ian

Ian Langmore

unread,

Jul 29, 2012, 8:59:40 PM7/29/12

to pystat...@googlegroups.com

Here's a question: What is df_model supposed to represent? The way it is defined it depends on whether or not we are fitting an intercept. This doesn't seem to be standard...or is it? Isn't it just the number of free parameters in the model?

-Ian

Ian Langmore

unread,

Jul 29, 2012, 9:52:37 PM7/29/12

to pystat...@googlegroups.com

I added the L1MultinomialResults class, extending MultinomialResults. I committed. I think things work. What do you think?

-Ian

Ian Langmore

unread,

Jul 30, 2012, 2:11:28 AM7/30/12

to pystat...@googlegroups.com

Correction: The L1 penalty is convex but not strictly convex. The negative log likelihood is convex. So the objective function is convex (but possibly not strict). This means that we could have multiple minima...each giving the same objective function value. It won't have any local minima that are not global minima...so the problem is fairly well posed.

-Ian

josef...@gmail.com

unread,

Jul 30, 2012, 5:04:22 AM7/30/12

to pystat...@googlegroups.com

(just to this one for now)

df_model is the number of parameters minus 1. -1 for the constant as
used in the R_squared or similar, when we calculate measures that
compare with the model that just contains a constant.

It's an early design decision, which was useful in the linear model,
but not so for the other models. I started to use in some cases
df_modelwc ("with constant") to avoid the minus one. Also, whether
models or results know whether there is a constant among the
regressors, if they need to know it, is still a bit inconsistent. And
I think for some calculations we assume that a constant is included.

(We have an open issue for hasconst or const_idx to replace the ad-hoc
detection of a constant.)

Josef

>
> -Ian

josef...@gmail.com

unread,

Jul 30, 2012, 5:13:56 AM7/30/12

to pystat...@googlegroups.com

or better: df_model without counting constants assuming there is a constant
in your multinomial example you have one constant per target.

Josef

Ian Langmore

unread,

Jul 30, 2012, 11:04:18 AM7/30/12

to pystat...@googlegroups.com

I implemented your suggestion to give more control over the regularization parameter. Now alpha can be a scalar or an array (with the same length as the total number of parameters). You can use the demo to pick different nonzero patterns in alpha and consequently zero out different members of results.params. The code is actually much simpler since now I don't deal with the constant/no-constant cases separately. Do you like this implementation of alpha?

Playing around with the demo...l1 regularization doesn't do any better (in a MSE sense) than ML at fitting the true parameters. This is not surprising since the covariates are uncorrelated, hence the problem is well-posed to begin with. Therefore, ML should do just fine (assuming N is large enough). The original motivation of regularization is to deal with ill-posed problems where e.g. the covariates are correlated and hence the fitted params take on "random" values (see e.g. http://ianlangmore.com/quick-introduction-to-bayesian-inference.html). Perhaps I should add some correlation to make ML do worse...

-Ian.

Ian Langmore

unread,

Aug 1, 2012, 11:14:37 AM8/1/12

to pystat...@googlegroups.com

I added a CVXOPT solver. I also cleaned up the code. The CVXOPT solver seems to give more predictable output. However, both l1 solvers often don't converge cleanly. Nonetheless, the output looks good. I'll keep playing around with them and try to understand the convergence issues.

-Ian

Ian Langmore

unread,

Aug 3, 2012, 6:12:40 PM8/3/12

to pystat...@googlegroups.com

The CVXOPT solver works nicely. It sometimes uses the maximum number of iterates, but if the tolerance is tuned down it doesn't. I updated the demo to make it nicer.

I think this package is ready to be integrated into statsmodels. I'm on vacation for the next week, but can answer email. Once I'm back I can do more work. Basically I'm letting you determine what needs to be done, and then I'll do most of the coding.

-Ian

josef...@gmail.com

unread,

Aug 3, 2012, 6:58:51 PM8/3/12

to pystat...@googlegroups.com

Unfortunately CVXOPT is GPL licensed which is not compatible with out BSD license, which limits how much we can use it. For example only as an optional dependency that users need to install.

Is CVXOPT now required or does the slsqp version still work?

I think this package is ready to be integrated into statsmodels. I'm on vacation for the next week, but can answer email. Once I'm back I can do more work. Basically I'm letting you determine what needs to be done, and then I'll do most of the coding.

I'm still for another week on family vacation and holiday schedule.

Josef

-Ian

Jonathan Taylor

unread,

Aug 13, 2012, 4:22:21 PM8/13/12

to pystat...@googlegroups.com

If anyone's interested, a student of mine and I have been playing around with some software for

pretty general regularized regression problems, i.e. LASSO, group LASSO, generalized LASSO, nuclear norm,

non-negative constraints, isotonic regression, etc.

The state of the code is pretty good, though the docs are lagging. The most current branch is "jonathan"

https://github.com/klingebj/regreg

The solver uses FISTA or ISTA and has some capacity for smoothing (a la NESTA: http://www-stat.stanford.edu/~candes/nesta/ TFOCS: http://tfocs.stanford.edu/)

though we can handle smoothing quite general penalties / constraints.

As it is a first order solver, all it needs are the objective function (i.e. negative log-likelihood) and the gradient. I forget if statsmodels have "score" (i.e. gradient) methods. If so, it is trivial to add a lot of different penalties to statsmodels.

Jonathan

--
Jonathan Taylor
Dept. of Statistics
Sequoia Hall, 137
390 Serra Mall
Stanford, CA 94305
Tel: 650.723.9230
Fax: 650.725.8977
Web: http://www-stat.stanford.edu/~jtaylo

Ian Langmore

unread,

Aug 13, 2012, 5:30:03 PM8/13/12

to pystat...@googlegroups.com

Thanks Jonathan.

Does your student want to help integrate some of these into statsmodels? The solver is 1/2 of the picture...the rest is how to integrate with the rest of statsmodels. In any case, the branch https://github.com/langmore/statsmodels works but needs unit tests and a few other things...this provides a start that your student could contribute to.

-Ian

Ian Langmore

unread,

Aug 14, 2012, 9:34:48 AM8/14/12

to pystat...@googlegroups.com

Sorry. I didn't see you had a question:

Both slsqp and CVXOPT work. Currently they are accessed by setting method='l1' or method='l1_cvxopt_cp'.

I'm back from my vacation and ready to do whatever needs to be done.

-Ian

Jonathan Taylor

unread,

Aug 14, 2012, 11:44:29 AM8/14/12

to pystat...@googlegroups.com

I'm not sure it fits in as a branch of statsmodels because it handle many more convex optimization problems other than

adding "l1" to likelihood model. But that's certainly a prominent use case.

It is more a package to enable high dimensional first order solvers in the spirit of ISTA, FISTA, NESTA, TFOCS and (eventually) some distributed solvers via ADMM.

josef...@gmail.com

unread,

Aug 16, 2012, 11:42:52 AM8/16/12

to pystat...@googlegroups.com

(mixing top, bottom and inline posting I have no idea where to reply)

Ian,

You could open a WIP (work in progress) pull request, so it will be
easier to add comments.

I'm still too busy for a detailed review,

https://github.com/langmore/statsmodels/compare/d64f74a911...master

Some of the basic things that I think need to be done:

make cvxopt optional, protect the import and usage with try ... except
(like matplotlib)

We can also keep in mind that we could use other optimizers like
Jonathan's, so we can extend the choice from slsqp and cvxopt.

one advantage of your setup is that the L1 optimizer is available for
every likelihood model. So, we need at least some examples of models
other than multinomial. (and maybe give a warning if L1 doesn't work
for a subclass model.)
Using Logit as a second case would have the advantage that we can
compare with sklearn.

missing tests:
I don't know how much we will get benchmark cases for verification.
The optimizer itself needs to be tested/verified for at least one
case. For other results we might be able to rely on regression tests,
or just test that the generic results work in this case.
I think it's worth a look to see if we get a comparison to sklearn for
an overlapping case, my guess is currently Logit would be the only
overlap.

I think the overall integration with the existing model looks good. I
don't see any obvious problems or better solutions.

One extra part will be to add the results wrapper, but it's just a few
lines, and Skipper did all of them so far.
Maybe some cosmetic changes to summary() will be necessary if we want
to avoid the printing of all the "dropped", zero parameters.

another open question: how do we get the penalization parameter?
my experience is only with Ridge regression, L2-penalized linear least
squares. Georg does a lot of cross-validation for bandwidth choice in
nonparametric.

Thank you,

Josef

Josef

Jonathan Taylor

unread,

Aug 16, 2012, 11:11:56 PM8/16/12

to pystat...@googlegroups.com

I added an example of using regreg to solve an L1 minimized problem: all that is needed is for the model to have an objective and a score.
Also added a group LASSO example.

https://github.com/jonathan-taylor/statsmodels/tree/regreg

Uses master of:

https://github.com/klingebj/regreg

Jonathan Taylor

unread,

Aug 16, 2012, 11:15:40 PM8/16/12

to pystat...@googlegroups.com

another open question: how do we get the penalization parameter?
my experience is only with Ridge regression, L2-penalized linear least
squares. Georg does a lot of cross-validation for bandwidth choice in
nonparametric.

Yes, typically this would be done with cross-validation. There are some theoretically guided choices if the noise level is known.

In regreg, I have a pathwise version for LASSO (complete) and group LASSO (almost complete) that yields a solution path \`a la glmnet...

josef...@gmail.com

unread,

Aug 16, 2012, 11:45:51 PM8/16/12

to pystat...@googlegroups.com

On Thu, Aug 16, 2012 at 11:11 PM, Jonathan Taylor <jonatha...@stanford.edu> wrote:

I added an example of using regreg to solve an L1 minimized problem: all that is needed is for the model to have an objective and a score.
Also added a group LASSO example.

https://github.com/jonathan-taylor/statsmodels/tree/regreg

Looks nice and simple.

This is starting to look great. With the integration that Ian already did or is doing, we, suddenly, can have Lasso on all our likelihood models (all have loglike, score and hessian). Just a bit more wrapper work and testing.

I think group lasso will be able to take care of models that have un-penalized parameters different from coefficients on explanatory variables.

Thanks

Josef

Ian Langmore

unread,

Aug 17, 2012, 10:01:52 AM8/17/12

to pystat...@googlegroups.com

I'll send an update later today. As for picking the regularization parameter I agree that cross validation is the way to go. I have seen some theory here, but unfortunately I'm quite sure they cut-and-paste an error from here...rendering the main point of the theory completely false! In any case, even with good asymptotic theory, I'd still use cross validation. Eventually we could build a cross validation module that automatically splits data and fits and computes error...

Ian Langmore

unread,

Aug 17, 2012, 5:24:06 PM8/17/12

to pystat...@googlegroups.com

I just put in the WIP Pull request "WIP: Added l1 regularization to LikelihoodModel"
I protected the import of cvxopt
The demo now has a CLI and can use MNLogit, Logit, or Probit.
We can't easily compare with sklearn since they are using "lasso" and we use "Bayesian regression with a Laplace prior." Lasso carries different meaning sometimes, but to me, lasso means unpenalized regression with an l1 constraint. Bayesian regression with a Laplace prior means the penalty is added explicitly to the standard objective function. They are related by a non-trivial constant.
Tests and results wrapper are still missing.

-Ian

josef...@gmail.com

unread,

Aug 18, 2012, 5:56:59 AM8/18/12

to pystat...@googlegroups.com

As far as I understand, as long as the objective function is the same (log-likelihood or least squares) the only difference is in the scaling of the penalization factor/Lagrange multiplier. So the trace of dropped variables as function of the penalization factor should be the same (after rescaling).

I'm starting two new threads on related questions.

Thanks,

Josef

Jonathan Taylor

unread,

Aug 29, 2012, 2:49:21 AM8/29/12

to pystat...@googlegroups.com

As long as you solve the problem as Ian had written in the docstring here https://github.com/langmore/statsmodels/blob/master/statsmodels/discrete/l1_slsqp.py, then +1 to Josef's comment.

As far as I understand, as long as the objective function is the same (log-likelihood or least squares) the only difference is in the scaling of the penalization factor/Lagrange multiplier. So the trace of dropped variables as function of the penalization factor should be the same (after rescaling).

I'm starting two new threads on related questions.

Thanks,

Josef

Tests and results wrapper are still missing.
-Ian

Ian Langmore

unread,

Sep 4, 2012, 12:07:17 PM9/4/12

to pystat...@googlegroups.com

On Wed, Aug 29, 2012 at 2:49 AM, Jonathan Taylor <jonatha...@stanford.edu> wrote:

As long as you solve the problem as Ian had written in the docstring here https://github.com/langmore/statsmodels/blob/master/statsmodels/discrete/l1_slsqp.py, then +1 to Josef's comment.

As far as I understand, as long as the objective function is the same (log-likelihood or least squares) the only difference is in the scaling of the penalization factor/Lagrange multiplier. So the trace of dropped variables as function of the penalization factor should be the same (after rescaling).

I'm starting two new threads on related questions.

Thanks,

Josef

Tests and results wrapper are still missing.
-Ian

Let's reply at the bottom of posts.

Jonathan. What new threads did you start? I can't find them on the pystatsmodels google group.

I added l1_demo/sklearn_compare.py, which uses either the spector or anes96 datasets to do a comparison of regularization paths. The results look good: The paths match up.

Josef: Are unit tests next? Let me know if I should give you write permissions to my fork or anything else that is standard practice.

-Ian

Ian Langmore

unread,

Sep 6, 2012, 11:28:52 AM9/6/12

to pystat...@googlegroups.com

I updated the code to make it PEP8 compliant.

josef...@gmail.com

unread,

Sep 6, 2012, 12:03:49 PM9/6/12

to pystat...@googlegroups.com

Sorry for the delay, I'm starting to get back to statsmodels "work"

Yes, I think unittests are the main part missing before a merge.

If you haven't looked at our unittests yet, we follow mostly the numpy
testing guidelines. The main structure we have settled on is to use
classes and subclasses to run the same tests for different versions of
the models (or different datasets).
A simple, quickly written set of tests is in
https://github.com/statsmodels/statsmodels/pull/453/files#diff-4 , but
the tests for any models follow a similar structure but more
elaborate.

I didn't run your recent examples yet, but my rough plan for unittests
would be to convert one or more of the examples to tests.

The important tests will be on the number of zeros in the estimate for
a few cases of the penalization parameter, and maybe test that slsqp
and cvxopt get the same result in simple cases.
The rest will mainly be regression tests, that the statistical results
in the results instance work (produce the same result as in the
reduced model?)

I can help with this next week, and look at your latest examples.

Josef

Ian Langmore

unread,

Sep 6, 2012, 2:03:35 PM9/6/12

to pystat...@googlegroups.com

I started writing tests. I think I understand the basic structure. See my latest commit. If you like the structure, I'll do the same with another value of the regularization parameter. Also, tell me what tests to add next.

-Ian

josef...@gmail.com

unread,

Sep 6, 2012, 2:24:00 PM9/6/12

to pystat...@googlegroups.com

I added a quick comment on PR, back in half an hour.

Do you have now Logit and Multinomial Logit as examples?

Josef

>
> -Ian

Ian Langmore

unread,

Sep 6, 2012, 2:59:02 PM9/6/12

to pystat...@googlegroups.com

I took care of the issues on the PR. See my commit.

demo.py has Logit and MNLogit and Probit, all in their "full glory." This demo is quite complicated and it's likely that I'll only use it for myself.

short_demo.py only has Logit. However, you can just change Logit (on line 34) to Probit or MNLogit and see the results. This isn't a good demo of MNLogit since there are only two targets. I would like to copy an existing MNLogit demo. I can't find one in the documentation. Did I miss it?

-Ian

josef...@gmail.com

unread,

Sep 6, 2012, 3:25:05 PM9/6/12

to pystat...@googlegroups.com

looks good

>
> demo.py has Logit and MNLogit and Probit, all in their "full glory." This
> demo is quite complicated and it's likely that I'll only use it for myself.
>
> short_demo.py only has Logit. However, you can just change Logit (on line
> 34) to Probit or MNLogit and see the results. This isn't a good demo of
> MNLogit since there are only two targets. I would like to copy an existing
> MNLogit demo. I can't find one in the documentation. Did I miss it?

example_discrete has MNLogit with anes data

I just checked: the main result specific to L1 to test will be
zero_param_idx and nnz_params

looking at Logit L1BinaryResults
bse is overwritten to include nans.
but I don't see what the inherited cov_params will be

open design question
does cov_params work with nans? used for example in t_test and
f_test, nans might be ok, we get nans back if any of the zeroed
parameters is used
or
should cov_params be defined on the reduced non-zero parameter space?
(outside of MNLogit we wouldn't have a problem creating a reduced
result instance)

a new method/option for L1 results: return_nonzero ?
add ``params`` with reduced set and ``params_full`` with the zeros included?

essentially, we need a design and tests that t_test, f_tests, and
similar give valid results for the nonzero parameters.

another possibility to create test cases: take an existing example and
add noise variables to exog and see whether L1 removes them.

Josef

>
> -Ian

josef...@gmail.com

unread,

Sep 6, 2012, 3:41:49 PM9/6/12

to pystat...@googlegroups.com

another possible variation for a test case:

add almost noise variables with decreasing amount of influence and see
that they are set to zero approximately in sequence.
But I'm not sure how well this will work if the sample size is not
large, noise would dominate sequence.
beta = (np.arange(20, 1, -1)/40.)**2
exog = np.random.randn(nobs, 20)
endog_true = np.dot(exog, beta)
...
or something like this

and test: zero_param_idx.mean() is decreasing function of alpha

Josef

>
> Josef
>
>>
>> -Ian

josef...@gmail.com

unread,

Sep 6, 2012, 4:10:38 PM9/6/12

to pystat...@googlegroups.com

trying to understand what is going on

running short_demo with alpha = 2 * N * np.ones(K)

>>> logit_l1_res.params
array([-0. , -0. , -0.00708798, 0. ])
>>> logit_l1_res.nnz_params
4
>>> logit_l1_res.zero_param_idx
(array([], dtype=int32),)
>>> logit_l1_res.bse
array([ 2.70804171, 0.83732284, 0.10067523, 0.71953571])

why are the parameters that are zero not counted in zero_params?
Is there a range where the L1 shrinkage to zero is not counted yet?

>>> logit_l1_res.params[0]
-4.0713192000732499e-15
>>> logit_l1_res.params[1]
-7.4420513321418466e-15
>>> logit_l1_res.params[-1]
6.1312610219790984e-16

Some of what I was saying or thinking might be incorrect, because I
forgot again that L1 penalization also does shrinking and not just
selection.

with alpha=0, I get small differences
>>> logit_l1_res.params
array([-13.01975671, 2.82582333, 0.09512088, 2.37848333])
>>> logit_res.params
array([-13.02134686, 2.82611259, 0.09515766, 2.37868766])
nothing to worry about, but it might be worth looking at convergence
criteria and whether they are consistent across estimators.

Josef

>
> Josef
>
>
>>
>> Josef
>>
>>>
>>> -Ian

josef...@gmail.com

unread,

Sep 6, 2012, 4:45:51 PM9/6/12

to pystat...@googlegroups.com

storing to think about, before running off

do we need nans in bse?
we are inverting the full hessian, so either we can use bse for all
parameters or they are all messed up?
what's the shortcut to invert only part of the hessian to get
normalized cov_params for the reduced set, treating the rest as
restricted nuisance parameter?
am I missing something again?

>>> np.sqrt(np.diag(np.linalg.inv(-logit_l1_res.model.hessian(logit_l1_res.params))))
array([ 3.37928767, 0.97253199, 0.11728033, 0.8450922 ])
>>> np.sqrt(np.diag(np.linalg.inv(-logit_res.model.hessian(logit_l1_res.params))))
array([ 3.37928767, 0.97253199, 0.11728033, 0.8450922 ])
>>> logit_l1_res.bse
array([ 3.37928767, 0.97253199, 0.11728033, 0.8450922 ])

just the inverse hessian evaluated at the penalized parameters.

Josef

>
> Josef
>
>
>
>>
>> Josef
>>
>>
>>>
>>> Josef
>>>
>>>>
>>>> -Ian

Ian Langmore

unread,

Sep 6, 2012, 6:12:35 PM9/6/12

to pystat...@googlegroups.com

bse gets overwritten since the "basic standard error" (??? is that what bse is?) is no longer valid for the zero'd out params. cov_params is exactly as before, but shouldn't be. I'll change this.

I'll add tests for bse and nnz_params.

My preference is to return all params, with interval estimates = nan for the zero'd params. For my particular use case it is interesting to see which params got zero'd out. I can see however some cases where you have MANY irrelevant params that get zero'd out, and you would want to ignore those. What if we think about adding that later?

The CLI demo (statsmodels/l1_demo/demo.py) allows you to use as many params as possible, and set as many as you want to zero. You can add noise and stuff too. The results are interesting and consistent with what you would want. It's never this beautiful "all the irrelevant params go to zero while the other ones are recovered perfectly" thing, but it seems to work.

Ian Langmore

unread,

Sep 6, 2012, 6:15:17 PM9/6/12

to pystat...@googlegroups.com

I have tested (with data for my job) that nnz_params decreases as alpha increases. The comparison to sklearn (in statsmodels/l1_demo/sklearn_compare.py) also demonstrates this.

Ian Langmore

unread,

Sep 6, 2012, 6:22:24 PM9/6/12

to pystat...@googlegroups.com

The params don't get counted as zero (for nnz_params) unless they are truly == 0. To do this, the fitting should be done with trim_params=True. trim_tol (default = 1e-4) determines the level under which params get trimmed (set to zero). In the old version of the demo, trim_params was set to False. The new demo (just committed) has trim_params=True. I'll change the default to trim_params=True.

The convergence criteria are different for the two methods. For cvxopt it's quite complicated--a combination of primal and dual residual. For slsqp I have no idea (unreadable Fortran combined with almost no documentation). I'll do a test where I decrease tolerance in both and make sure the answers (with alpha = 0) match up to more significant figures.

-Ian

josef...@gmail.com

unread,

Sep 6, 2012, 6:28:40 PM9/6/12

to pystat...@googlegroups.com

I was comparing with the unpenalized discrete_model version. I don't
have cvxopt installed in my development python (2.6).

Josef

>
> -Ian

josef...@gmail.com

unread,

Sep 6, 2012, 6:32:53 PM9/6/12

to pystat...@googlegroups.com

good, I would include an example like that as a test case.
Also, tests will make sure that we get the same (decimals?) results
across python, numpy and scipy versions and across
platforms/computers, besides making future refactorings easier.

(If it gets too slow, we can mark some of the tests as "slow")

Josef

Ian Langmore

unread,

Sep 6, 2012, 6:42:13 PM9/6/12

to pystat...@googlegroups.com

We need nans in bse since some the variables that get zero'd are not asymptotically Gaussian. I'll have to think more about the others. May get a chance tonight.

Ian Langmore

unread,

Sep 7, 2012, 10:51:15 AM9/7/12

to pystat...@googlegroups.com

I think I know what to do. I typed up some notes that can be found here. The upshot is that the non-zero params are approximately Gaussian with covariance given by the inverse restricted Hessian. By "restricted Hessian" I mean the hessian computed using only the non-zero params. The bse for non-zero params is the diagonal of this matrix. I'll think on this for a day and let you take a look. If all seems well then I think that's the direction to go (I'll have to make code changes Monday).

-Ian

josef...@gmail.com

unread,

Sep 7, 2012, 11:09:53 AM9/7/12

to pystat...@googlegroups.com

Sounds reasonable to me.

Essentially what this means is that we have to overwrite normalized
covparams or use the nonzero model for the statistical results.
For Logit we could just use a new instance with the reduced parameter
space (nonzero exog).
For MNLogit it's a bit more difficult: cov_params is defined for the
flattened params, and nonzero parameters will need to select different
exogs for each endog/class.

a vague question related to a point above since I didn't look at this part yet:

Is there a clear indication which parameters are forced to zero
because of the penalization?
Above you mentioned trim_params=True trim_params=False
What if the unpenalized parameter estimate would be 1e-6? Does it get
identified as zeroed parameter?

I'm mainly curious about the behavior at the edge cases. I doubt it is
relevant in practice.

Thanks,

Josef

>
> -Ian
>

Ian Langmore

unread,

Sep 7, 2012, 11:29:06 AM9/7/12

to pystat...@googlegroups.com

In all cases it seems we need a variable to keep track of the params that got zero'd out. This can be used to build the appropriate Hessian, cov_params, bse, and display results. I'll do this.

We also need to be able to correctly identify the zero'd params, since, as you mention, the params could just have a very small estimate (suppose someone is not using standardized covariates. One way to tell would be to look at the gradient. If a param truly has a small, non-zero estimate, then the derivative in that direction should have magnitude equal to alpha. I tested this once and found it to be the case (it is a theorem...). I'll try adding this in as an option: trim_params='auto'.

-Ian

Ian Langmore

unread,

Sep 7, 2012, 12:29:11 PM9/7/12

to pystat...@googlegroups.com

My last two commits added an option: If trim_tol == 'auto' (now the default), then parameters are trimmed if the gradient has magnitude less than alpha - 0.01. This introduces the magic number 0.01, which, I'm not able to fully justify. Nonetheless, if you run the optimization at a reasonable accuracy, then it would be quite pathological if the derivative ended up this close to alpha...do you agree?

I also return the list trimmed. trimmed[i] = True if the ith parameter was trimmed.

-Ian

josef...@gmail.com

unread,

Sep 7, 2012, 1:33:46 PM9/7/12

to pystat...@googlegroups.com

Sorry, I don't have amuch intuition for this without trying to
understand it better.

you are looking at the gradient of the penalized function (including
the penalization term)?
If the gradient of the loglikelihood is in the interval [-alpha,
alpha], then the reported parameter is zero.
If the penalized function gradient is alpha, then the gradient of the
loglike is zero.

In discrete_model we have analytical gradients, when we maximize with
a gradient method then the gradient should be smaller than 1e-4 or so
in absolute value.
So we might be able to reduce the threshold difference, to alpha - 0.001. ?

(I check the changes later today.)

Josef

Ian Langmore

unread,

Sep 7, 2012, 1:38:02 PM9/7/12

to pystat...@googlegroups.com

My latest commit now uses trimmed to compute the covariance of the nonzero params (cov_nz_params).

This method replaces Hinv, which in turn replaces cov_params, with a smaller matrix. Hope that's ok.

I still need to add this for cvxopt and for MNLogit.

-Ian

Ian Langmore

unread,

Sep 7, 2012, 1:50:00 PM9/7/12

to pystat...@googlegroups.com

I am looking at the gradient of the unpenalized term. If the unpenalized gradient is in (-alpha, alpha), then the parameter is zero. If the unpenalized gradient has magnitude alpha, then the parameter will be nonzero, and the penalized gradient will be zero.

In my write up, I used "L" for the full term to be optimized (reminds me of Lagrangian), and f(x) for the negative log likelihood...probably confusing.

The unpenalized gradient is indeed accurate (since you have it analytically), but there is no guarantee that the solution point is exactly optimal. For that reason, the unpenalized gradient could be close but not exactly at alpha. In fact, with magic_tol = 0.01, and acc = 1e-4 (solver accuracy in slsqp), you can end up with the unpenalized gradient larger than alpha by 0.15! For this reason, in the commit I just pushed, I increased the magic_tol to 0.03. Alternatively, we could increase the default solver accuracy.

Time for lunch!

josef...@gmail.com

unread,

Sep 8, 2012, 5:19:39 PM9/8/12

to pystat...@googlegroups.com

I will add some comments later to the PR directly.

Just asking for an opinion about a general idea.

In the mnlogit example in short_demo.py l1 with slsqp doesn't converge
to an optimum, the score at the converged point is still large even
though "Optimization terminated successfully".

cvxopt manages to find the optimum (many abs score parameters around 10)

if I use the unpenalized mnlogit results as starting parameters to
slsqp, then it also converges to the optimum. difference in parameters
to cvxopt in the range of 1e-5 or 1e-6

so, starting values look important to get l1 with slsqp to work.

Here's the question (well, in 3 parts):

If we replace L1 penalization by a smoothed L1 penalization, then we
are in the standard smooth unconstrained optimization case. Do we get
better convergence behavior in this case?
Is it cheap enough to do this before a real L1 optimization?
Are there other approximate models that we can estimate first, to get
good starting values for the L1 penalized optimization?

(aside: cvxopt doesn't have a binary for Windows for py 2.6 for the
latest release, and doesn't compile with easy_install.)

Josef

Ian Langmore

unread,

Sep 8, 2012, 6:51:03 PM9/8/12

to pystat...@googlegroups.com

Yes I agree that slsqp is causing problems. I've been using short_demo with various values of alpha, then printing the results of the slsqp or cvxopt simulation (with trim_tol = 1e-6) with:

In [57]: s = mlogit_mod.score(mlogit_l1_res.params.ravel('F')); p = mlogit_l1_res.params.ravel('F'); a = alpha.ravel('F')

In [58]: for i in xrange(len(a)):

print a[i], s[i], p[i]

You can see that the cvxopt simulation does what it should: Whenever abs(s[i]) is far from a[i], p[i] is zero'd (except when a[i] = 0, in which case there is no trimming for that coefficient). For slsqp we get values of score at e.g. 21 (which is not allowed by theory), or we get score at 19 without the parameter being zero'd (it is however getting close to zero...just never made it). I should note that both of them stop working when we use very high alpha (but not high enough to .

I can't find the documentation on slsqp. The scipy page cites an article, but I cannot find it anywhere. In any case, the slsqp wrapper only has "acc", which is some sort of accuracy adjustment...

To try and answer your questions: If we use some initial smooth approximation, we may speed convergence, but I don't think it will cause us to converge any better (since the last iterations presumably have to be done with the true l1 penalty). The true l1 penalty is needed to trim parameters. In theory, you could use \ell^p with 0<p\leq 1, but for p<1 the problem is not convex (and even less smooth). If p>1, then I'm pretty sure we won't get trimming.

Note that my implementation of l1 regularization converts the non-smooth unconstrained problem to a constrained smooth problem. There are other approaches. sklearn uses a coordinate descent method. This is pretty simple, and could be done efficiently in Cython. However...I'm not too confident on any solver that I make working without a hitch...

-Ian

Ian Langmore

unread,

Sep 8, 2012, 7:03:29 PM9/8/12

to pystat...@googlegroups.com

I almost forgot...we do have Johnathan's solver. The difficulty with cvxopt is that it isn't available for everyone...is Johnathan's solver any more portable?

-Ian

josef...@gmail.com

unread,

Sep 8, 2012, 7:47:48 PM9/8/12

to pystat...@googlegroups.com

In the example slsqp still converged to the correct solution (same as
cvxopt) if the starting values are good enough. So I was still
thinking in terms of letting l1 do the final convergence, and zero the
appropriate values.

If we would get a smoothly penalized optimization close enough, then
slsqp might be able to finish the convergence to trimmed l1. The
smooth penalization wouldn't give us any zeros, but it might get us
into the neighborhood of zero parameters.

My impression from the example was that l1 with slsqp might get stuck
when it has too many zeros at a given point, and doesn't get out of
the zeros situation anymore.

something similar:
In a smooth problem with not very good starting values, I was using
fmin (Nelder-Mead) to get close to the right neighborhood. But
Nelder-Mead stopped with some large values in the score, but it was
close enough so that fmin_ncg didn't break (with non-invertible
hessian or other numerical problems) and finished the convergence
until score was 1e-6, or something like that.

I saw in some paper where they defined a smooth "abs" function, with a
similar smooth optimization argument, but I have no idea which paper
this was and don't remember the topic.

Josef

Ian Langmore

unread,

Sep 8, 2012, 8:18:11 PM9/8/12

to pystat...@googlegroups.com

I found the "Huber function" that can act as a smooth approximation. The version on wikipedia isn't quite what we want. Instead we should use:

x^2 / (2 a), for |x| < a

|x| - a/2, for |x| > a

We could also try fmin, which, as you know, doesn't require any derivatives. This means we could directly solve the unconstrained non-smooth problem. The scipy.optimize writeup of this method isn't very encouraging however..

-Ian

josef...@gmail.com

unread,

Sep 8, 2012, 8:18:46 PM9/8/12

to pystat...@googlegroups.com

just a quick google search, haven't read it

http://pages.cs.wisc.edu/~gfung/GeneralL1/

I think I read something like the supplemental
http://pages.cs.wisc.edu/~gfung/GeneralL1/L1_approx_bounds.pdf before

Josef

josef...@gmail.com

unread,

Sep 8, 2012, 8:29:49 PM9/8/12

to pystat...@googlegroups.com

As far as I have seen it's all pure python.
So I wouldn't expect any portability issues, except some minor things
with version compatibilities across python and numpy versions.

I also think we could distribute it with statsmodels, if it isn't
easy_installable.

And it looks like it will increase our optimization possibilities a lot.

However, it's a large amount of code, where we don't have much idea
about how much support it requires. So it's more a long term
possibility and not for 0.5.

In the short run, I would prefer to get the scipy optimizers, esp.
slsqp, to work.

Josef

>
> -Ian

Ian Langmore

unread,

Sep 8, 2012, 8:48:11 PM9/8/12

to pystat...@googlegroups.com

Their SmoothL1 approximation gives another nice smooth approximation. We can try that. Their projection method is complicated, and we would have to import their code to do that. I agree that importing a large piece of code is trouble. The main hesitation I would have (other than wanting to avoid the work) is, "who maintains it"?

The method I used should, in theory, work. It was recommended by Boyd (optimization superstar), and actually implemented (for 1-D logit) in an "official" cvxopt example. I think there are some issues with slsqp that are preventing this from working.

So I think the following two strategies seem good: 1) Try scipy.optimize.fmin with the non-smooth unconstrained problem. 2) Try a smooth approximation with the other scipy optimizers.

-Ian

josef...@gmail.com

unread,

Sep 8, 2012, 9:17:00 PM9/8/12

to pystat...@googlegroups.com

good, and

try random restarts:

mlogit_l1_res4 = mlogit_mod.fit(start_params=0.01*np.ones(36), method='l1', alpha=alpha, trim_params=True, maxiter=5000)

for n_iter in range(5):

start_params = mlogit_l1_res4.params.copy()

start_params[start_params==0] = 0.1 * np.random.randn((start_params==0).sum())

mlogit_l1_res4 = mlogit_mod.fit(start_params=start_params, method='l1', alpha=alpha, trim_params=True, maxiter=5000)

score = mlogit_l1_res.model.score(mlogit_l1_res4.params)

if np.max(np.abs(score)) < 10.1:

break

print n_iter

print score

1

[ 10.0092709 9.45679474 -10.0146224 -4.61854889 0.86472194

0.00022045 -9.9997222 10.00052563 -9.99873256 10.00445343

10.00161333 0.00049568 -9.99719891 10.00137699 -9.99661082

-9.68574769 10.0018798 0.00003591 -10.00173877 9.9987974

-10.00413148 9.9994498 9.9987431 -0.00004489 -9.9993287

9.99975224 -9.98471059 10.00421636 10.00044245 0.00014181

-9.99884763 9.9995028 -9.99322419 9.99981952 10.00143474

0.00009332]

>>> set(np.nonzero(mlogit_l1_res4.trimmed)[0]) - set(np.nonzero(mlogit_l1_res_.trimmed)[0])
set([])
>>> set(np.nonzero(mlogit_l1_res4.params==0)[0]).symmetric_difference(set(np.nonzero(mlogit_l1_res_.params==0)[0]))
set([])

the default starting values for MultinomialModel is all zeros, which might not be good if the algorithm gets stuck in zeros.

I played mostly in the interpreter, but it looks like after a few random restarts we get to the correct score.

The non-zero values were already before pretty close to the cvxopt version but the score was still huge. So, it might also be that convergence for the score might not be so easy in the slsqp version, convergence of parameters might be closer, except it didn't give all the same zeros.

Is it possible to make changes in the algorithm to avoid getting stuck with extra zeros?

I never looked closer at fmin_slsqp to see if there are any optimization options that can be "tuned".

Josef

>

> -Ian

Ian Langmore

unread,

Sep 8, 2012, 9:36:16 PM9/8/12

to pystat...@googlegroups.com

The random restart test gives some clues. Might scale poorly with lots of params--so I'm not sure if it should be used in production (were you thinking that it should be?). We could perturb only the "troublesome_params" (the ones that have a huge score). Another issue comes up: A large score implies something is wrong...but what if some small scores are also bad? I agree however that zeros seem to be a problem with slsqp.

slsqp only has one tunable parameter, 'acc'. This is "requested accuracy", so who knows what that means.

-Ian

Ian Langmore

unread,

Sep 8, 2012, 10:23:07 PM9/8/12

to pystat...@googlegroups.com

I just pushed a version that uses scipy.optimize.fmin. It seems to have worse trouble than slsqp! It returns ridiculous results even when alpha is moderate.

josef...@gmail.com

unread,

Sep 8, 2012, 10:40:54 PM9/8/12

to pystat...@googlegroups.com

in the mnlogit example all scores look initially pretty bad

array([   73.88537466,    84.05209991, 1236.11367745,   104.79993905,
         362.78022192,    24.14683981, -306.23082178, -457.68987382,
       -6228.49780975, -542.80181628, -2008.69126068, -121.6443737 ,
         -34.26003952,   -30.65480007, -499.90617969,   -50.37285101,
        -141.63447896,    -9.46645929,     5.89881439,    45.60243579,
         534.60555803,    47.49220528,   143.66060989,     8.5651136 ,
         227.26643836,   538.25194267, 5027.62464338,   511.12903298,
        1964.92510846,   107.2506593 ,   -90.25778028, -224.21906149,
       -1965.03951547, -183.30451793, -737.53135338,   -40.73155129])

I'm perturbing the zeros, and then only in a small neighborhood, just to get out of the sink.

So it might not take long to converge again.

How large is your problem in production?

I was briefly going through the comparison of optimizers in the Mark Schmidt, Glenn Fung, Romer Rosales. Optimization Methods for L1-Regularization. UBC Technical Report TR-2009-19, 2009

One point they make is whether an optimizer works with the sparse solution or has to work with an interior solution, all parameters nonzero.

To me it looks like the slsqp version only converges to the correct parameters if we keep it in the interior long enough.

This might limit it's usefulness (in the current version) if it's a really large scale sparse problem.

(In general I'm starting to find random (re)starts more attractive for "messy" optimization problems, especially with lots of local optima.)

What they don't look at in their paper is switching algorithms. They mention that smooth version have slow convergence behavior, but it sounds like that is only in the neighborhood of the zeroed optimum. They might be nice in the medium large neighborhood, but shouldn't be used for the local convergence close to the optimum. (Just guessing.)

Josef

Ian Langmore

unread,

Sep 8, 2012, 11:45:44 PM9/8/12

to pystat...@googlegroups.com

My production problem is three targets and 20 variables. I'll take a look at the paper.

-Ian

josef...@gmail.com

unread,

Sep 9, 2012, 12:22:34 AM9/9/12

to pystat...@googlegroups.com

I looked at the fortran code of slsqp, there is no explanation and I didn't see what the two variables are that are checked with acc

however, setting acc=1e-10 and I get immediate convergence of score, this looks pretty good in the example.

having nonzero starting values leeds to a bit faster convergence

>>> mlogit_l1_res5 = mlogit_mod.fit(start_params=0.01*np.ones(36), method='l1', alpha=alpha, trim_params=True, maxiter=5000, acc=1e-10)
Callback will be ignored with l1
Hessian not used with l1
Positive directional derivative for linesearch    (Exit mode 8)
            Current function value: 1522.19535571
            Iterations: 86
            Function evaluations: 355
            Gradient evaluations: 82

>>> mlogit_l1_res5 = mlogit_mod.fit(start_params=0.0*np.ones(36), method='l1', alpha=alpha, trim_params=True, maxiter=5000, acc=1e-10)
Callback will be ignored with l1
Hessian not used with l1
Positive directional derivative for linesearch    (Exit mode 8)
            Current function value: 1522.1953557
            Iterations: 110
            Function evaluations: 434
            Gradient evaluations: 106
>>> mlogit_l1_res_cvx.llf - mlogit_l1_res5.llf
-1.6862638858583523e-06

I look at fmin tomorrow, I have a bit of a mess now because I was trying out different things.

In the paper they also look at bfgs.

fmin, and optimizers that don't check the gradient, might not be good enough in the neighborhood of the solution.

cvxopt took at some point a very long time for convergence. Did this ever happen to you?

it's possible I messed up something.

-Ian

My production problem is three targets and 20 variables. I'll take a look at the paper.

Doesn't sound so bad, 40 parameters with MNLogit (?)

Josef

-Ian

josef...@gmail.com

unread,

Sep 9, 2012, 12:33:40 AM9/9/12

to pystat...@googlegroups.com

or try to set xtol to a much smaller value (and maybe ftol) to force that fmin works harder.

Josef

Ian Langmore

unread,

Sep 9, 2012, 1:00:07 AM9/9/12

to pystat...@googlegroups.com

I had been using acc=1e-15, but I was using the default maxiter (100). When I switched to 5000 (as you were using), I saw that the results got nice. In fact, the results are still nice even when I use the default acc (keeping maxiter=5000). I think I should increase the default maxiter...what do you think?

In any case, this is good news! I was getting worried that we were going to have to learn a lot more about optimization than I have time for right now.

Yes, 40 parameters isn't too bad. We do want less though---it helps to keep the model easier to interpret. Regularization should also help with some ill-posedness (some variables could take on huge values without affecting the answer much).

-Ian

Ian Langmore

unread,

Sep 9, 2012, 1:11:40 AM9/9/12

to pystat...@googlegroups.com

CVXOPT is slower. It also often reaches the iteration limit (75 I think?). I'm not surprised that you found it slow.

-Ian

Ralf Gommers

unread,

Sep 9, 2012, 3:11:25 AM9/9/12

to pystat...@googlegroups.com

http://abs-5.me.washington.edu/dynOpt/p262-kraft.pdf

Ralf

josef...@gmail.com

unread,

Sep 9, 2012, 7:28:43 AM9/9/12

to pystat...@googlegroups.com

Yes, I think the default maxiter should be much large, I increased it all the time to avoid non-convergence

"Iteration limit exceeded (Exit mode 9)"

I think, the main purpose of it is to avoid endless loop, or to be used if you only want a few steps.

However, I got the non-converging gradient also at the large maxiter with the default acc.

If we don't increase default acc, then at least it should be added to the docstring as main option if the optimization didn't converge.

In any case, this is good news! I was getting worried that we were going to have to learn a lot more about optimization than I have time for right now.

In the paper, QP, their version of slsqp, was an algorithm that converged in a few iterations (although each iteration is more expensive).

So, I guess it should be possible to get a reliable solution with slsqp.

note, acc can also be negative, then

AN EXACT LINESEARCH IS PERFORMED, OTHERWISE AN ARMIJO-TYPE LINESEARCH IS USED.

but I haven't tried it yet.

Yes, 40 parameters isn't too bad. We do want less though---it helps to keep the model easier to interpret. Regularization should also help with some ill-posedness (some variables could take on huge values without affecting the answer much).

40 is not too much, in order to work with an interior path/solution until close to the end - if necessary.

Josef

-Ian

josef...@gmail.com

unread,

Sep 9, 2012, 8:32:32 AM9/9/12

to pystat...@googlegroups.com

Thank you for the link.

storing a few more for future reference (not that I want to get more
into optimization than I have to)

slides with several smoothing approaches
http://www.ee.ucla.edu/~vandenbe/236C/lectures/smoothing.pdf
(we have Huber already in robust,
we need a collection of function with derivatives and second
derivatives, currently genmod links and robust norms.)

background lecture notes (Optimization Methods for Large-Scale Systems):
http://www.ee.ucla.edu/~vandenbe/ee236c.html Vandenberghe
http://www.stanford.edu/class/ee364b/lectures.html Boyd

Josef

>
> Ralf
>
>

Ian Langmore

unread,

Sep 9, 2012, 8:50:36 AM9/9/12

to pystat...@googlegroups.com

Ralf: Thanks for that reference!

Josef: I'm flying today...I'll implement the changes we've been discussing and then push when I get home around 5:30. Then we can discuss next steps: Documentation (currently l1 is sort of hidden), more tests, etc...

-Ian

Ian Langmore

unread,

Sep 10, 2012, 12:36:24 AM9/10/12

to pystat...@googlegroups.com

I was not able to change the default maxiter for slsqp. The calling function, statsmodels/base/model.py:LikelihoodModel.fit() passes in maxiter. I need to pass this value down to _fit_slsqp_cp(). The only way I see to use a new default value would be to change the default in fit(). But that would effect many other models. Is there another way? Can we see if fit() is just passing in the default (in which case we can set a new default), or if fit() is passing in a user defined maxiter (in which case we use the user defined value)? Would we even want to do this? It seems confusing to have one standard maxiter for most fit methods, and then a new one for l1.

-Ian

josef...@gmail.com

unread,

Sep 10, 2012, 6:43:36 AM9/10/12

to pystat...@googlegroups.com

Ii think we *need* to change the default maxiter for l1 to protect the
user from getting non-convergence in many cases in the default
setting.

I don't have a problem with different maxiter for different
optimizers. I think the defaults work in many cases only if they are
used with the default optimizer or a similar one.
For example, when trying out method="nm", I always have to increase
maxfun and maxiter, but nm is not recommended for discrete models.

A thought that has been creeping up on me (but maybe not in proper english):

We might be better off creating a new ``fit_penalized`` method. If we
need more or different options in the signature and given that we
return a different results instance and it's a different estimation
problem, a separate method should be cleaner.
Even, if internally we may at some point just call a super(...).fit as
int the current implementation.

Just a thought for now, I haven't looked at the details yet. What do you think?

Josef
no caffeine yet

>
> -Ian

Ian Langmore

unread,

Sep 10, 2012, 9:59:06 AM9/10/12

to pystat...@googlegroups.com

I think it's a good idea. For one, I don't see any other way we can use different default values. Second, it would make the regularization methods easier for a user to recognize.

Are you thinking of making it a new LikelihoodModel method?

Do you want to take a first hack at it?

-Ian

josef...@gmail.com

unread,

Sep 10, 2012, 10:21:11 AM9/10/12

to pystat...@googlegroups.com

good, let's try it then.

>
> Are you thinking of making it a new LikelihoodModel method?

I would like to push it up as high as possible, but haven't looked yet.
The important part is to see how a new fit method will work for Logit
and MNLogit, which can still reuse the current super().fit
implementation.
Then, we can look at how much of the generic code we can push up.

One problem adding it to LikelihoodModel will be that it is not (yet)
implemented for other subclasses, linear_model, time series.
maybe DiscreteModel is the right location to get it to work for the
current cases.

>
> Do you want to take a first hack at it?

I started to work on bugfixes and test writing, and will not have much
time today. So, if you want, you can get started.

Josef

>
> -Ian

Ian Langmore

unread,

Sep 10, 2012, 10:40:59 AM9/10/12

to pystat...@googlegroups.com

Cool. I've started.

Ian Langmore

unread,

Sep 10, 2012, 11:16:31 AM9/10/12

to pystat...@googlegroups.com

When I'm in super(...).fit(), is there a way to tell who called

josef...@gmail.com

unread,

Sep 10, 2012, 12:10:29 PM9/10/12

to pystat...@googlegroups.com

we can check the class of the "self" model, but I don't think there is
an easy way to check which method called (I'm not a fan of inspect and
similar.).
I don't like to check identities in general, it might be easier to
just let the subclass or caller set a flag (private attribute). But
that might require some care with not getting a "state" spill-over for
repeated calls to fit.

What do you need it for?

Josef

Ian Langmore

unread,

Sep 10, 2012, 12:25:10 PM9/10/12

to pystat...@googlegroups.com

To call certain functions based on if the caller was an "L1 type" fit function. I found a better way though so I no longer need it.

Ian Langmore

unread,

Sep 10, 2012, 3:16:52 PM9/10/12

to pystat...@googlegroups.com

The first attempt is done. See the last two commits. cvxopt and documentation are not up to date, but everything else is (both demos and the tests).

There is one major question: I'm not so sure I handled the calling of a super classes fit function correctly in MultinomialModel.fit_regularized. What I did works, but now fit_regularized and fit call superclass methods at different levels (DiscreteModel and LikelihoodModel respectively).

-Ian

Ian Langmore

unread,

Sep 10, 2012, 3:21:04 PM9/10/12

to pystat...@googlegroups.com

The "theme" is that the instance class uses fit_regularized as a quick wrapper for DiscreteModel.fit_regularized. This in turn wraps up parameters and calls LikelihoodModel.fit. Some of the parameters passed to LikelihoodModel include "fit functions" and an "inverse Hessian function".

josef...@gmail.com

unread,

Sep 10, 2012, 4:05:15 PM9/10/12

to pystat...@googlegroups.com

Just a brief browsing of the change (and I will be offline again soon).

looks good overall
with the long list of new option arguments, I think having a separate
fit_regularized is much better.

The registering of a new optimizer is an interesting idea, I need to
look more closely what further we can make with it.

In terms of module locations, I would prefer if we can keep the l1_xxx
files under statsmodels/base, so they are together with the other
optimizers, and can later be reused outside of discrete.

It's a bit unfortunate in this version, that we lost the generic l1 in
LikelihoodModel, for immediate reuse in other MLE models. I didn't
look yet at how much of the DiscreteModel.fit_regularized is generic
and how much is "discrete".
We can also think about the proper location of this later, since it
works for discrete, and there are other design problems because
everything inherits from LikelihoodModel.
But then I would like to find a "cheap" way of adding fit_regularized
to GenericLikelihoodModel or similar pure MLE models. Mix-in?

Thanks, I think we are getting close.

Josef

Ian Langmore

unread,

Sep 10, 2012, 4:18:16 PM9/10/12

to pystat...@googlegroups.com

I moved the l1 files to base/

I'm not sure how the mix in's work. I do agree that this could be used elsewhere. I don't think there's anything specific to discrete in the current implementation of fit_regularized. What's the obvious continuous candidate to try and make this work with?

-Ian

Ian Langmore

unread,

Sep 10, 2012, 4:54:02 PM9/10/12

to pystat...@googlegroups.com

I took a look at the GLM class. The fit method doesn't seem to be connected to the base.LikelihoodModel.fit. If so it's probably best to leave it out for now. It's hard for me to justify getting paid to extend the GLM class when we don't need it at the moment for my job.

Nonetheless, I don't see any difficulty in putting fit_regularized inside base.LikelihoodModel. What models do use base.LikelhoodModel.fit?

-Ian

josef...@gmail.com

unread,

Sep 10, 2012, 7:54:37 PM9/10/12

to pystat...@googlegroups.com

glm and robust use IRLS and linear model uses LS

We don't have any production ready other MLE models outside discrete
and tsa AR, ARIMA.

So, we can postpone this move until we want to use it for other models.
(regularized Tobit ? never heard of that but I don't see why not)

>
> Nonetheless, I don't see any difficulty in putting fit_regularized inside
> base.LikelihoodModel. What models do use base.LikelhoodModel.fit?

currently besides discrete_model also ARIMA model has it as an option
to use it,
and subclasses of GenericLikelihoodModel (which are more fast
implementation models, than fully worked out models).

Tobit (pull request by Skipper) uses it. I don't know if any other
incoming models are using it.

There is also a github issue to make the optimizers more easily
available for other than pure MLE, for example empirical likelihood
pull requests use scipy.optimize directly.

base.LikelhoodModel.fit get's overwritten by subclasses in the other
models, linear, GLM and RLM, but they wouldn't overwrite
fit_regularized (without implementing that also).

Josef

>
> -Ian

Ian Langmore

unread,

Sep 11, 2012, 4:58:58 AM9/11/12

to pystat...@googlegroups.com

Yeah it's probably best to postpone the move. Potentially every model that has a score method should be able to use l1_slsqp, but I can't really justify spending the time getting things together. We should just get things working nicely here and then maybe it can be someone's project down the line.

What's next?

More demos: I can put together a demo that sweeps alpha and shows parameters going to zero one by one

Tests: A few more tests

Docs: Do last

-Ian

josef...@gmail.com

unread,

Sep 11, 2012, 7:33:47 AM9/11/12

to pystat...@googlegroups.com

tests for some of the options, and tests for the results, does t_test,
f_test, ... work. ??

I will go over it later this morning.

> Docs: Do last

Josef

>
> -Ian

josef...@gmail.com

unread,

Sep 11, 2012, 12:21:03 PM9/11/12

to pystat...@googlegroups.com

>
> tests for some of the options, and tests for the results, does t_test,
> f_test, ... work. ??

git service outage, so here instead

cov_params and params have no different shape

>>> logit_l1_res.params
array([ 0. , 0. , -0.03688929, 0.65497501])
>>> logit_l1_res.bse
array([ nan, nan, 0.02318671, 0.74755651])
>>> logit_l1_res.cov_params()
array([[ 0.00053762, -0.01204074],
[-0.01204074, 0.55884073]])

>>> logit_l1_res.t_test(np.eye(2))
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
logit_l1_res.t_test(np.eye(2))
File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new\statsmodels\statsmodels\base\model.py",
line 1078, in t_test
raise ValueError('r_matrix and params are not aligned')
ValueError: r_matrix and params are not aligned

Instead of f and t tests, we should actually use Wald test with
chisquare and normal distribution, because we only have asymptotic
results in discrete. (NotYetImplementedError)

Josef

Ian Langmore

unread,

Sep 11, 2012, 12:35:11 PM9/11/12

to pystat...@googlegroups.com

I added a regularization path plot to short_demo.py. I also added an option QC_verbose that makes the QC check print out a longer error report. short_demo.py gives warnings with the default options. We can tweak acc and maxiter and get the warnings to occur with only 3 out of 100 alpha values. To make the warnings disappear, I increased QC_tol.

-Ian

I added to short_demo.py some lines plotting a regularization path. For large alpha we do get warnings about the QC check. Increasing maxiter and acc didn't stop it. It's only for the last 3

josef...@gmail.com

unread,

Sep 11, 2012, 1:02:20 PM9/11/12

to pystat...@googlegroups.com

On Tue, Sep 11, 2012 at 12:35 PM, Ian Langmore <ianla...@gmail.com> wrote:
> I added a regularization path plot to short_demo.py. I also added an option
> QC_verbose that makes the QC check print out a longer error report.
> short_demo.py gives warnings with the default options. We can tweak acc and
> maxiter and get the warnings to occur with only 3 out of 100 alpha values.
> To make the warnings disappear, I increased QC_tol.

what does QC stand for? I didn't read it yet.

the short_demo looks good, also the sklearn_compare
I was surprised to see the warnings, but then I saw the comments. I
think it's fine if the options are tuned for some intermediate
difficulty, they don't have to cover all cases since the user can
always change them if the default doesn't work.

Josef

Ian Langmore

unread,

Sep 11, 2012, 1:28:52 PM9/11/12

to pystat...@googlegroups.com

QC stands for "Quality Control." It's typical to talk about QC'ing something. People use QA (Quality Assurance) to mean the same thing.

Ian Langmore

unread,

Sep 11, 2012, 1:36:01 PM9/11/12

to pystat...@googlegroups.com

I changed things around so that now we compute the full cov_params matrix, but with nan inserted into the places that are not asymptotically normal. bse and confidence interval tests pass. t_test() runs, but I'm not sure if the output is right. I'll check after lunch.