Message from discussion
Regularized Likelihood models?
Received: by 10.52.178.166 with SMTP id cz6mr10913698vdc.1.1343532313914;
Sat, 28 Jul 2012 20:25:13 -0700 (PDT)
X-BeenThere: pystatsmodels@googlegroups.com
Received: by 10.221.0.84 with SMTP id nl20ls1953192vcb.0.gmail; Sat, 28 Jul
2012 20:25:12 -0700 (PDT)
Received: by 10.52.23.65 with SMTP id k1mr725355vdf.7.1343532312945;
Sat, 28 Jul 2012 20:25:12 -0700 (PDT)
Date: Sat, 28 Jul 2012 20:25:12 -0700 (PDT)
From: Ian Langmore <ianlangm...@gmail.com>
To: pystatsmodels@googlegroups.com
Message-Id: <420115db-84d4-4d7c-9a46-b2a6ae7316b8@googlegroups.com>
In-Reply-To: <CAMMTP+ASoy=naR8XWrGq7Cy6j6auvqFxFtLgjemwxxJd-Q15Lw@mail.gmail.com>
References: <47ad1a80-8419-4e02-b569-636bcc9ba8ca@googlegroups.com>
<CAMMTP+CjJ_wai0YhhmJV3X4sD5FpkdtPkA+Z1TdNFzAJ2VsDzg@mail.gmail.com>
<d28e15e2-fe82-4297-bdf8-b384f8e6762b@googlegroups.com>
<CAMMTP+ASoy=naR8XWrGq7Cy6j6auvqFxFtLgjemwxxJd-Q15Lw@mail.gmail.com>
Subject: Re: [pystatsmodels] Regularized Likelihood models?
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_Part_605_16317779.1343532312470"
------=_Part_605_16317779.1343532312470
Content-Type: multipart/alternative;
boundary="----=_Part_606_12539044.1343532312470"
------=_Part_606_12539044.1343532312470
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
I'm not sure how we could get credibility intervals without MCMC...but I
wouldn't be surprised if there was a way. I'll retract my point about
"people not believing them." In my opinion though the L1 prior is chosen
mostly for mathematical convenience and not because we believe that it's
the marginal distribution of the parameter. Others often have a different
opinion (I'm a mathematician, not a statistician anyways...).
As for the cov_params...here's my reasoning (which I haven't seen in the
literature, but should be out there somewhere): The posterior looks like
\exp\{-\ln Likelihood(\beta)\} \exp\{-\alpha \|\beta\|_1\}. If all
\beta\neq0, the derivative of the negative log likelihood cancels the
derivative of \|\beta\|_1 away from zero (it's differentiable away from
zero), and we are justified in approximating the posterior as
\exp\{-\beta^T H \beta/2\}, where H is the Hessian of -\ln Likelihood. In
that case, the (approximate) covariance of the parameters is H^-1, and we
don't need to change anything. If some \beta_j = 0, then, even away from
zero, the derivatives do not cancel, and we're unjustified in using the
"Hessian approximation." In that case we need to consider the first and
second derivatives of the "negative log posterior", separately for both
positive and negative parameter values. We could then integrate this
against \beta\beta^T to get the covariance. In any case, the approximation
isn't Gaussian anymore so cov_params doesn't tell the whole story. Does
this make sense? I could tex this up and post it.
Other important additions would be the ability to compute the "evidence"
(MacKay's evidence framework), as this can be used to let the data choose
the regularization parameter. To approximate the evidence we would have to
take both the Hessian and the first gradient into consideration (in the
case that some parameters went to zero).
To be fully Bayesian, we would want to pick a hyper-prior for \alpha and
then choose \beta and \alpha together. This would require some work.
We could just start un-ambitious and treat the prior as a regularization
trick with no Bayesian interpretation.
-Ian
On Saturday, 28 July 2012 19:06:43 UTC-4, josefpktd wrote:
>
> On Sat, Jul 28, 2012 at 6:30 PM, Ian Langmore <ianlangm...@gmail.com>
> wrote:
> > Hello Josef.
> >
> > Glad to hear there is interest. To answer your questions:
> >
> > I added a some explanation of how the L1 penalization is done. See the
> > docstring for l1.py here.
> >
> > I'm not sure how cov_params are used, so I don't know how they should be
> > modified. Are they used for any LikelihoodModel?
>
> cov_params is the variance of our parameter estimates in large sample
> (asymptotical Normal distribution)
> We have it until now for all models, since all of them are smooth.
>
> >
> > Some Results attributes no longer make sense, such at the gradient of
> the
> > objective, gopt, or the hessian hopt. This happens since the new
> objective
> > is non-differentiable. AIC/BIC should be defined with the new df_model
> > (that only includes the nonzero parameters). The t and f tests are no
> > longer applicable. The confidence intervals could be replaced by
> credible
> > intervals, but I don't think people really believe these (the R package
> > "penalized" does not compute them for this reason). So...what makes
> sense
> > to me is to create a new RegularizedResults that is similar to
> > LikelihoodModelResults, but with less information. What do you think?
>
> I don't know the literature but didn't expect it to be so "bad" in
> terms of statistical results.
>
> Yes, this looks like we need a special Results class that doesn't have
> some of our usual results.
>
> Is L1 fast enough for some bootstrap results, if we want to get other
> statistical results. (If that is even theoretically possible.)
>
> >
> > Ridge regression should be much easier. The penalty is smooth, so all
> we
> > have to do is modify the gradient and hessian and then the standard
> > optimization techniques should apply.
>
> Conceptually it looks straightforward, how to do it in practice and
> what results we get in this case, I also don't know or remember.
>
> Thanks, I will look at your changes (when it's not late night in front
> of a dark pool bar that has the only wireless)
>
> Josef
>
> >
> > -Ian
> >
> >
> >
> > On Friday, 27 July 2012 15:54:26 UTC-4, josefpktd wrote:
> >>
> >> On Fri, Jul 27, 2012 at 2:37 PM, Ian Langmore <ianlangm...@gmail.com>
> >> wrote:
> >> > I made a few small additions to Likelihoodmodel.fit() in order to
> enable
> >> > L1
> >> > regularization. See this fork. Is there any interest in adding
> >> > something
> >> > like this to the main statsmodels? It seems pretty easy to do. The
> >> > only
> >> > decisions to make involve the fact that LikelihoodModelResults is
> geared
> >> > toward maximum likelihood results.
> >>
> >> Yes, I think this would make a good addition.
> >>
> >> I don't know the details how L1 penalized estimation is calculated,
> >> but I was thinking about similar Maximum L2- (or smooth-) penalized
> >> Likelihood estimation.
> >>
> >> The way you define `func`, the penalized objective function outside of
> >> the models makes it very separate (positive) from the model code. It
> >> makes it generally available for all MLE models.
> >>
> >> On the other hand, I don't see where you can adjust the result
> >> instance itself, generically. Do we need to adjust all result classes
> >> to take advantage of this?
> >>
> >> From your current code the only change in the results seems to be the
> >> df_model. Is this all that changes of the statistical results? How are
> >> cov_params and similar defined?
> >>
> >> naive question: does it make sense to return the results with the
> >> exog/variables for non-zero parameters?
> >>
> >> (quick check: looks like we can ignore the penalization for
> >> cov_params, ... and we get asymptotic results where we can ignore the
> >> penalization parts. needs to be verified.)
> >>
> >> aside: I looked for some time at the MLE analog of Ridge regression,
> >> but haven't figured out yet how to attach the penalized likelihood to
> >> the models and don't remember or didn't look at the statistical
> >> details for the result statistics.
> >>
> >> Thanks,
> >>
> >> Josef
>
------=_Part_606_12539044.1343532312470
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable
I'm not sure how we could get credibility intervals without MCMC...but I wo=
uldn't be surprised if there was a way. I'll retract my point about "=
people not believing them." In my opinion though the L1 prior is chos=
en mostly for mathematical convenience and not because we believe that it's=
the marginal distribution of the parameter. Others often have a diff=
erent opinion (I'm a mathematician, not a statistician anyways...).<div><br=
></div><div>As for the cov_params...here's my reasoning (which I haven't se=
en in the literature, but should be out there somewhere): The posteri=
or looks like \exp\{-\ln Likelihood(\beta)\} \exp\{-\alpha \|\beta\|_1\}. &=
nbsp;If all \beta\neq0, the derivative of the negative log likelihood cance=
ls the derivative of \|\beta\|_1 away from zero (it's differentiable away f=
rom zero), and we are justified in approximating the posterior as \exp\{-\b=
eta^T H \beta/2\}, where H is the Hessian of -\ln Likelihood. In that=
case, the (approximate) covariance of the parameters is H^-1, and we don't=
need to change anything. If some \beta_j =3D 0, then, even away from=
zero, the derivatives do not cancel, and we're unjustified in using the "H=
essian approximation." In that case we need to consider the first and=
second derivatives of the "negative log posterior", separately for both po=
sitive and negative parameter values. We could then integrate this ag=
ainst \beta\beta^T to get the covariance. In any case, the approximat=
ion isn't Gaussian anymore so cov_params doesn't tell the whole story. &nbs=
p;Does this make sense? I could tex this up and post it.</div><div><b=
r></div><div>Other important additions would be the ability to compute the =
"evidence" (MacKay's evidence framework), as this can be used to let the da=
ta choose the regularization parameter. To approximate the evidence w=
e would have to take both the Hessian and the first gradient into considera=
tion (in the case that some parameters went to zero).<br></div><div><br></d=
iv><div>To be fully Bayesian, we would want to pick a hyper-prior for \alph=
a and then choose \beta and \alpha together. This would require some =
work.</div><div><br></div><div>We could just start un-ambitious and treat t=
he prior as a regularization trick with no Bayesian interpretation.</div><d=
iv><br></div><div>-Ian<br><br>On Saturday, 28 July 2012 19:06:43 UTC-4, jos=
efpktd wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-l=
eft: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">On Sat, Jul 28, =
2012 at 6:30 PM, Ian Langmore <<a href=3D"mailto:ianlangm...@gmail.com" =
target=3D"_blank">ianlangm...@gmail.com</a>> wrote:
<br>> Hello Josef.
<br>>
<br>> Glad to hear there is interest. To answer your questions:
<br>>
<br>> I added a some explanation of how the L1 penalization is done. &nb=
sp;See the
<br>> docstring for l1.py here.
<br>>
<br>> I'm not sure how cov_params are used, so I don't know how they sho=
uld be
<br>> modified. Are they used for any LikelihoodModel?
<br>
<br>cov_params is the variance of our parameter estimates in large sample
<br>(asymptotical Normal distribution)
<br>We have it until now for all models, since all of them are smooth.
<br>
<br>>
<br>> Some Results attributes no longer make sense, such at the gradient=
of the
<br>> objective, gopt, or the hessian hopt. This happens since the=
new objective
<br>> is non-differentiable. AIC/BIC should be defined with the ne=
w df_model
<br>> (that only includes the nonzero parameters). The t and f tes=
ts are no
<br>> longer applicable. The confidence intervals could be replace=
d by credible
<br>> intervals, but I don't think people really believe these (the R pa=
ckage
<br>> "penalized" does not compute them for this reason). So...wha=
t makes sense
<br>> to me is to create a new RegularizedResults that is similar to
<br>> LikelihoodModelResults, but with less information. What do y=
ou think?
<br>
<br>I don't know the literature but didn't expect it to be so "bad" in
<br>terms of statistical results.
<br>
<br>Yes, this looks like we need a special Results class that doesn't have
<br>some of our usual results.
<br>
<br>Is L1 fast enough for some bootstrap results, if we want to get other
<br>statistical results. (If that is even theoretically possible.)
<br>
<br>>
<br>> Ridge regression should be much easier. The penalty is smoot=
h, so all we
<br>> have to do is modify the gradient and hessian and then the standar=
d
<br>> optimization techniques should apply.
<br>
<br>Conceptually it looks straightforward, how to do it in practice and
<br>what results we get in this case, I also don't know or remember.
<br>
<br>Thanks, I will look at your changes (when it's not late night in front
<br>of a dark pool bar that has the only wireless)
<br>
<br>Josef
<br>
<br>>
<br>> -Ian
<br>>
<br>>
<br>>
<br>> On Friday, 27 July 2012 15:54:26 UTC-4, josefpktd wrote:
<br>>>
<br>>> On Fri, Jul 27, 2012 at 2:37 PM, Ian Langmore <<a href=3D"m=
ailto:ianlangm...@gmail.com" target=3D"_blank">ianlangm...@gmail.com</a>>=
;
<br>>> wrote:
<br>>> > I made a few small additions to Likelihoodmodel.fit() in =
order to enable
<br>>> > L1
<br>>> > regularization. See this fork. Is there any i=
nterest in adding
<br>>> > something
<br>>> > like this to the main statsmodels? It seems pretty =
easy to do. The
<br>>> > only
<br>>> > decisions to make involve the fact that LikelihoodModelRe=
sults is geared
<br>>> > toward maximum likelihood results.
<br>>>
<br>>> Yes, I think this would make a good addition.
<br>>>
<br>>> I don't know the details how L1 penalized estimation is calcul=
ated,
<br>>> but I was thinking about similar Maximum L2- (or smooth-) pena=
lized
<br>>> Likelihood estimation.
<br>>>
<br>>> The way you define `func`, the penalized objective function ou=
tside of
<br>>> the models makes it very separate (positive) from the model co=
de. It
<br>>> makes it generally available for all MLE models.
<br>>>
<br>>> On the other hand, I don't see where you can adjust the result
<br>>> instance itself, generically. Do we need to adjust all result =
classes
<br>>> to take advantage of this?
<br>>>
<br>>> From your current code the only change in the results seems to=
be the
<br>>> df_model. Is this all that changes of the statistical results?=
How are
<br>>> cov_params and similar defined?
<br>>>
<br>>> naive question: does it make sense to return the results with =
the
<br>>> exog/variables for non-zero parameters?
<br>>>
<br>>> (quick check: looks like we can ignore the penalization for
<br>>> cov_params, ... and we get asymptotic results where we can ign=
ore the
<br>>> penalization parts. needs to be verified.)
<br>>>
<br>>> aside: I looked for some time at the MLE analog of Ridge regre=
ssion,
<br>>> but haven't figured out yet how to attach the penalized likeli=
hood to
<br>>> the models and don't remember or didn't look at the statistica=
l
<br>>> details for the result statistics.
<br>>>
<br>>> Thanks,
<br>>>
<br>>> Josef
<br></blockquote></div>
------=_Part_606_12539044.1343532312470--
------=_Part_605_16317779.1343532312470--