robust nonlinear regression with M-Estimators

371 views
Skip to first unread message

josef...@gmail.com

unread,
Mar 1, 2014, 11:02:15 PM3/1/14
to pystatsmodels
I'm still trying to figure out what all these robust estimators are about.

Another try, (so I have code that I don't forget what I read):
M-Estimation as subclass of GenericLikelihoodModel (*) with scipy's optimizers.
It's much easier to convert it to work with a non-linear regression.

a simple 2-parameter example with nice pictures

in the plots
`robust` is Huber-Dutter with single optimization for loc and scale
using HuberT norm
`robust_bs` is TukeyBiweight iterated with mad scale, started at `robust`
`ls` is scipy.optimize.leastsq

4 y-outliers with +5


---
the simple model: Michaelis-Menten
---

def func_m(params, x):
a, b = params
return a * x / (np.exp(b) + x)

class MentenNL(MEstimatorHD):

def predict(self, params, exog=None):
if exog is None:
exog = self.exog
return np.squeeze(func_m(params[:2], exog))

def predict_jac(self, params, exog=None):
if exog is None:
exog = self.exog
from statsmodels.tools.numdiff import approx_fprime
return approx_fprime(params[:2], self.predict)

--
(*) GenericLikelihoodModel is almost a generic M-Estimator class
except for some incorrect names.

Josef
oh no, not another `jac`, what about `score_obs`?
robust_non-linear_outliers.png
robust_non-linear_no-outliers.png

Sturla Molden

unread,
Mar 2, 2014, 11:21:37 AM3/2/14
to pystat...@googlegroups.com
The biweight function is the one most commonly used. Actually, today it
might be the only one in use for robust regression.

Typically the residuals are normalized by the factor MAD/0.6745:

s = np.median(np.abs(r - np.median(r))) / 0.6745
biweight = lambda x: ((np.abs(x)<1.0)*(1.0 - x**2)**2)
weight = biweight(r/s)

We can fit the M-estimator as a non-linear least-squares problem
(scipy.optimize.leastsq), but IRLS is more commonly used. For
Michaelis-Menten you would need non-linear least squares anyway (well, the
closed form solution has actually been derived), but personally I perfer to
combine scipy.optimize.leastsq with IRLS for M-estimaton of
Michaelis-Menten.

Three robustness steps is enough! You do not need the IRLS algorithm to
converge. The fewer robustness steps you take, the closer the solution will
be to the maximum likelihood fit for normally distributed residuals. You
just need so many that outliers are sufficiently penalized (in practice:
one LS fit + 3 IRLS steps). Taking too many can actually worsen the fit.

Which gives us this algorithm for fitting the Michaelis-Menten equation:

- initial estimate from Lineweaver-Burk plot
- scipy,optimize.leastsq, no weights
- compute weights
- scipy,optimize.leastsq, weighted residuals
- compute weights
- scipy,optimize.leastsq, weighted residuals
- compute weights
- scipy,optimize.leastsq, weighted residuals

Regards,
Sturla

josef...@gmail.com

unread,
Mar 4, 2014, 6:00:03 PM3/4/14
to pystatsmodels
I never tried to do it this way. I didn't know it can be made so easily.
We had an unfinished GSOC project 2 years ago that was intended to
implement this.

This is essentially what statsmodels RLM does for the linear model,
only spread out over 3 modules and 4 classes (with choice of loss
function, scale estimator and robust standard errors).


a bit of background
Following up at looking some corner case behavior of the current RLM,
I did another round of background readings on robust methods. I never
went through several details of what's currently implemented.

A few reasons why I wanted to try the version where we just optimize a
loss function

- Tukey Biweight has often local minima (besides those that might be
in the non-linear function), maybe scipy.optimize.basinhopping can
help with the global minimum.
The advantage of HuberT and similar is that they have a unique minimum
(in the linear case).

- we can estimate location and scale at the same time (which is not
always good).

- for me it's easier to think about general versions and extensions
when looking at it just as an optimization problem, instead of a
collection of conditions that have to be solved (estimating
equations), or in terms of the optimization algorithm (IRLS)
(for example Results for GenericLikelihoodModel has sandwich form for
standard errors already built in.)


hopefully to come eventually:
robust M-Estimators or similar with (auto- or spatially) correlated
errors, or heteroscedasticity (unequal variances)
robust estimators for Logit, Poisson, ... GLM?

variations on robust estimators
MM-estimators
tau-estimator
S-estimators


I was mainly trying to understand what the current statsmodels.robust
really does.

What do all the default settings in robust mean?
SAS explicitly states the asymptotic relative efficiency of the
M-estimator settings in the documentation.

Josef

Sturla Molden

unread,
Mar 4, 2014, 7:12:27 PM3/4/14
to pystat...@googlegroups.com
<josef...@gmail.com> wrote:

> - Tukey Biweight has often local minima (besides those that might be
> in the non-linear function), maybe scipy.optimize.basinhopping can
> help with the global minimum.
> The advantage of HuberT and similar is that they have a unique minimum
> (in the linear case).

But in robust regression we don't need to find the global minimum of the
loss function. We just want to penalize outliers, from a purely practical
point of view.

There is no guarantee that optimizing the loss function will give a better
model estimate, given that we start from the maximum likelihood solution.
Biweight is often used because it doesn't take us far away from the ML
estimator in absence of high-leverage data points. But we are not looking
to optimize the loss function, so it does not really matter if it has local
minima.

I don't know how statsmodels implements this though. I assume you know the
code much better than me.

Sturla

josef...@gmail.com

unread,
Mar 4, 2014, 7:43:15 PM3/4/14
to pystatsmodels
On Tue, Mar 4, 2014 at 7:12 PM, Sturla Molden <sturla...@gmail.com> wrote:
> <josef...@gmail.com> wrote:
>
>> - Tukey Biweight has often local minima (besides those that might be
>> in the non-linear function), maybe scipy.optimize.basinhopping can
>> help with the global minimum.
>> The advantage of HuberT and similar is that they have a unique minimum
>> (in the linear case).
>
> But in robust regression we don't need to find the global minimum of the
> loss function. We just want to penalize outliers, from a purely practical
> point of view.
>
> There is no guarantee that optimizing the loss function will give a better
> model estimate, given that we start from the maximum likelihood solution.
> Biweight is often used because it doesn't take us far away from the ML
> estimator in absence of high-leverage data points. But we are not looking
> to optimize the loss function, so it does not really matter if it has local
> minima.

That might be true if you just expect to have a few outliers, however
the masking effect of several outliers or high leverage points could
still lead to a "non-robust" estimate.

The main reference for these are high breakdown estimators, where we
could have up to almost 50% outliers. There it is important to start
with a high breakdown estimate, not with something close to the least
squares or MLE estimate.

In an old pull request I implemented LTS, least trimmed squares, as a
high breakdown estimator that can be used as a starting value for MM
estimation. LTS or similar methods start with a large number of random
subsets of minimal size (sample size same as number of parameter or 1
larger).

Optimizing the loss function itself doesn't help with local minima,
except if we can improve it by using a global optimizer, like
basinhoppin, and hopefully scipy will get more in future.

>
> I don't know how statsmodels implements this though. I assume you know the
> code much better than me.

Currently statsmodels.robust only has the linear M-estimator starting
at the least squares solution, implemented with IRLS.

Josef

>
> Sturla
>

Sturla Molden

unread,
Mar 4, 2014, 7:49:46 PM3/4/14
to pystat...@googlegroups.com
Sturla Molden <sturla...@gmail.com>
wrote:

> But in robust regression we don't need to find the global minimum of the
> loss function. We just want to penalize outliers, from a purely practical
> point of view.

Or put another way:

The purpose of robust regression is to improve the models ability to
forecast new data. It does this by reducing the variance of the estimators
(which we can find from bootstrapping the data).

In that respect it is similar to ridge-regression, except they deal with
two different sources of increased model variance (multicollinearity and
high-leverage points).

The global minimum of the loss function in robust regression has no
theoretical justification, so we don't need to find it. Instead we are
deviating from the maximum likelihood fit to find an estimator with smaller
variance. As long as the biweight function helps us to reduce the variance
of our estimators, it does not matter that it has local minima.

This is also why I suggested only taking three robustness steps: We don't
need the IRLS to converge, because we don't care about the exact optimum
(local or global). Even if we can find the global optimum with a Huber's
loss function, we still need to answer this question: What makes it
inherently better than something between this optimum and the LS optimum?
As it turns out: Nothing (that I know of). We are starting our gradient
descent from the LS optimum. But when the variance is sufficiently reduced,
we can just as well stop. Why go further away from the maximum likelihood
when we don't have to?

:-)

Sturla

Sturla Molden

unread,
Mar 4, 2014, 7:55:43 PM3/4/14
to pystat...@googlegroups.com
<josef...@gmail.com> wrote:

> That might be true if you just expect to have a few outliers,

Exactly. What I described is for the situation where a few outliers are
messing up the fit.

> The main reference for these are high breakdown estimators, where we
> could have up to almost 50% outliers. There it is important to start
> with a high breakdown estimate, not with something close to the least
> squares or MLE estimate.

Yes. That is a very different situation.

Sturla

josef...@gmail.com

unread,
Mar 4, 2014, 8:19:48 PM3/4/14
to pystatsmodels
On Tue, Mar 4, 2014 at 7:49 PM, Sturla Molden <sturla...@gmail.com> wrote:
> Sturla Molden <sturla...@gmail.com>
> wrote:
>
>> But in robust regression we don't need to find the global minimum of the
>> loss function. We just want to penalize outliers, from a purely practical
>> point of view.
>
> Or put another way:
>
> The purpose of robust regression is to improve the models ability to
> forecast new data. It does this by reducing the variance of the estimators
> (which we can find from bootstrapping the data).

Hypothesis test can also improve a lot if the variance is reduced by
downweighting outlying observations.

>
> In that respect it is similar to ridge-regression, except they deal with
> two different sources of increased model variance (multicollinearity and
> high-leverage points).
>
> The global minimum of the loss function in robust regression has no
> theoretical justification, so we don't need to find it. Instead we are
> deviating from the maximum likelihood fit to find an estimator with smaller
> variance. As long as the biweight function helps us to reduce the variance
> of our estimators, it does not matter that it has local minima.
>
> This is also why I suggested only taking three robustness steps: We don't
> need the IRLS to converge, because we don't care about the exact optimum
> (local or global). Even if we can find the global optimum with a Huber's
> loss function, we still need to answer this question: What makes it
> inherently better than something between this optimum and the LS optimum?
> As it turns out: Nothing (that I know of). We are starting our gradient
> descent from the LS optimum. But when the variance is sufficiently reduced,
> we can just as well stop. Why go further away from the maximum likelihood
> when we don't have to?

in my original example above I had iterated 5 steps

below is the change in estimated parameters with 10 iterations
updating biweight parameter estimates and MAD scale in each iteration

data were generated with params (10, 1), using 30 observations

>>> print np.array(store)
[[ 11.97982588 1.33221022]
[ 11.3324888 1.21928743]
[ 10.85400463 1.12890643]
[ 10.58645995 1.07561884]
[ 10.2743019 1.01122067]
[ 10.26487731 1.009242 ]
[ 10.26240113 1.00872177]
[ 10.26174976 1.0085849 ]
[ 10.26157822 1.00854884]
[ 10.26153332 1.00853941]]

comparison leastsq parameters are [ 18.01143455, 2.06214964]

Why stop if we can iterate to convergence?

Josef

>
> :-)
>
> Sturla
>

josef...@gmail.com

unread,
Mar 8, 2014, 10:24:25 AM3/8/14
to pystatsmodels



>
> a simple 2-parameter example with nice pictures
>
> in the plots
> `robust` is Huber-Dutter with single optimization for loc and scale
> using HuberT norm
> `robust_bs` is TukeyBiweight iterated with mad scale, started at `robust`
> `ls` is scipy.optimize.leastsq
>

Just another example to show "it works", and to remember how to do non-linear estimation

summary labels are incorrect method is M-estimation not MLE
standard errors are wrong because GenericLikelihoodModel.fit doesn't have cov_type yet.
>>> resm2.bsejhj
array([ 0.26043514,  0.63595686,  1.1465242 ,  0.1707081 ])
are theoretically correct for M-estimation

sigmoid

def sigmoid(params, exog=None):

    x0, y0, c, k = params

    y = c / (1 + np.exp(-k * (x - x0))) + y0

    return y


 5 y-outliers with +10

>>> print resm2.summary(xname='x0 y0 c k'.split())
                              SigmoidNL Results                              
==============================================================================
Dep. Variable:                      y   Log-Likelihood:                 187.51
Model:                      SigmoidNL   AIC:                            -371.0
Method:            Maximum Likelihood   BIC:                            -367.2
Date:                Sat, 08 Mar 2014                                        
Time:                        10:12:25                                        
No. Observations:                  50                                        
Df Residuals:                      48                                        
Df Model:                           1                                        
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x0             5.1332      0.283     18.120      0.000         4.578     5.688
y0             7.6483      0.625     12.229      0.000         6.422     8.874
c             10.6845      1.221      8.750      0.000         8.291    13.078
k              0.8196      0.198      4.143      0.000         0.432     1.207
==============================================================================

>>> beta_m  # data generating parameters
[5, 8, 10, 1]
robust_non-linear_outliers_sigmoid.png
Reply all
Reply to author
Forward
0 new messages