Pros and cons regarding the Least Absolute Deviation Regression

thebluecliffrecord

unread,

Nov 13, 2009, 2:51:33 PM11/13/09

to

Dear All,

I appreciate if anybody could provide information about the strength
and weakness of the least absolute deviation (LAD) regression compared
to the least square regression. I googled several websites and look
for valuable information additionally.

LAD is also called median regression, L1-norm regression, minimum
absolute deviations (MAD) etc (??)

Thanks.

Sincerely,

Sangdon Lee, Ph.D.,
GM Warren Tech Center

Gordon Sande

unread,

Nov 13, 2009, 3:49:05 PM11/13/09

to

Least squares matchs the functional form of Gaussians and has lots of simple
algebra. Least absolute deviations matchs the funcional form of Laplacians,
which are also often called double exponentials. The algebra is that of linear
progamming so is not quite as simple. Laplacians are longer tailed than
Gaussian so arise often when robustness is a concern.

Ask Google about "The Gaussian Hare and the Laplacian Torroise" for comments
on issues like computation and robustness.

Brian Borchers

unread,

Nov 13, 2009, 9:59:30 PM11/13/09

to

If your measurements include some outlier points that are actually
incorrect values (rather than points that really are correct and are
just revealing that your underlying model is wrong), then L1
regression is more robust than least squares. If your data really are
normally distributed, then least squares regression gives a maximum
likelihood estimate and is the way to go.

Ray Koopman

unread,

Nov 14, 2009, 3:01:56 AM11/14/09

to

LAD regression is a special case of quantile regression. See
http://www.econ.uiuc.edu/~roger/research/rq/QRJEP.pdf

Paul

unread,

Nov 14, 2009, 9:50:01 AM11/14/09

to

As others have pointed out, the primary advantage of LAD over least
squares is resistance to outliers. (There are variants of OLS that
also make it more resistant to outliers.) The primary disadvantages
are that it is computationally more difficult than OLS (but
contemporary LP solvers can handle pretty large data sets pretty
efficiently), it requires software not commonly included in a stats
package, and (as Ray points out) it lacks many of the theoretical
properties of OLS that statisticians love. I'll also add that finding
confidence intervals for coefficients is not all that easy, even if
you assume normality (and if you're assuming normality, why aren't you
using OLS?). The "nice" theoretical properties of OLS may be a trap,
tough; they lure you into the assumption that the noise component of
the response has a Gaussian, or at least near-Gaussian, distribution
(along with some independence assumptions), which is reasonable in
many situations but not ubiquitous.

/Paul

aruzinsky

unread,

Nov 14, 2009, 2:23:57 PM11/14/09

to

1. LS is the maximum likelihood estimator (MLE) for linear systems
with additive i.i.d. Gaussian noise of zero mean.

2. LAD is the MLE for linear systems with additive i.i.d Laplace noise
of zero median.

3. MLE are asymptotically efficient, i.e., minimum variance.

4. The sample mean is a special case of LS regression and the sample
median is a special case of LAD regression. As a rule of thumb, LS
regression works well when the sample mean works well and LAD
regression works well when the sample median works well.

5. With minor exceptions, the standard deviation of the sample mean
and median of N iid numbers with standard deviation sigma is

sample mean: sigma/sqrt(N)

sample median: 1/(2*f(median)*sqrt(N)) as N->oo

where f(.) = probability density function

See http://mathworld.wolfram.com/StatisticalMedian.html , Eq. 4

For the Gaussian distribution,

f(median) = 1/(sigma*sqrt(2*PI) ) = 0.3989/sigma

therefore

Standard Deviation of sample median ~= 1.25*sigma/sqrt(N)

For the Laplace Distribution,

f(median) = 1/(sigma*sqrt(2) ) = 0.707/sigma

therefore

Standard Deviation of sample median ~= 0.707*sigma/sqrt(N)

Thus, for Gaussian noise, LS is approximately (1.25)^2 = 1.56X as
efficient as LAD and for Laplace noise, LS is approximately (0.707)^2
= 0.5 X as efficient as LAD. These are small differences, so why
bother?

6. For some, if not all, cases with f(median) = oo, or sigma = oo, the
variance of LAD estimates can converge faster than O(1/N) whereas LS
is always O(sigma^2/N) (I imply there is no m.s. convergence for sigma
= oo).

7. For i.i.d. Cauchy noise of zero median (Cauchy distributions have
undefined mean and infinite variance), LS is useless because estimates
will have infinite variance. However, in my experience, LAD will be
almost as good as Cauchy MLE (I suspect faster than O(1/N) m.s.
convergence). So, maybe, you think that makes LAD robust?

8. For noise, d(x-a)/2 + d(x+a)/2, where d() is Dirac delta (note f(0)
= 0), LAD is not a consistent estimator (useless) whereas LS works
fine. Therefore general robustness allegations for LAD are invalid.

thebluecliffrecord

unread,

Nov 17, 2009, 10:52:07 AM11/17/09

to

Dear All,

I appreciate valuable responses from Mr. Sande, Mr. Borchers, Mr.
Koopman, Mr. Paul and Mr. Aruzinsky.

By the way, EPA is using the LAD to determine the fuel economy target
(miles/gallon) as a function of vehicle footprint (the area where the
four tires meet the ground). The rational is the robustness of LAD to
outliers (e.g., Porsche is the size of small cars but its FE is much
worse). However, I don't consider Porsche as outlier because Porsche
is desinged such a way. Porsche is not a sample and we have data for
entire vehicles currently being driven(ignoring small differences in
FE caused by various optional features of a vehicle(e.g. power seat,
etc)).

Is LAD more appropriate than LS regression when FE is regressed on
footprint? Or, Am I asking the wrong question because the differences
between LAD and LS regressions are ignorable or discussion of LAD vs
LS will be bogged down.

I appreciate any comments.

Sincerely,

Sangdon Lee,

aruzinsky

unread,

Nov 17, 2009, 12:37:32 PM11/17/09

to

Outliers, shmoutliers. For all you know, LAD should be used only
because f(median) is large and possibly oo.

Do I understand correctly that there is only one independent variable,
"vehicle footprint," modeled by

(1) Yi = A*Xi + Ei

or

(2) Yi = A*Xi + B + Ei

?

It may interest you to know that, in case (1), the LAD estimate of A
is a type of weighted median that can be found by sorting. However,
if mean(Ei) != 0, using LS on (1) would be inappropriate and, if median
(Ei) != 0, using LAD on (1) would be inappropriate.
If it is unknown whether mean(Ei) = 0, or median(Ei) = 0, (2) should
be used because B is such that it forces the mean or median of Ei to
zero.

Previously, I forgot to mention that when f(Ei) is asymmetric and its
mean does not coincide with its median, the LS versus LAD estimates
will typically be biased with respect to each other.

To determine whether LAD or LS is better, I suggest that you look at
the empirical distribution of residuals from each method on (2)and
don't forget the possibility that f(median) = oo.

Gordon Sande

unread,

Nov 17, 2009, 12:57:02 PM11/17/09

to

If the tire pressure is increased the footprint will go down. This illustrates
that the apparent assumption is for equal contact loading. It is well
known that
increasing the pressure lowers the fuel consumption and lowers the
traction. Any
car intended to have higher traction will have lower contact loading.
The extreme
case would be steel wheels on steel rails.

One might ask whether the underlying science is any good before worrying about
whether the statistics is good. For those in the trade it is easy to
find examples
of good statistics being applied to bad science. Robustness is a matter
of dealing
with poor measurement processes but not bad models.

The effects of chemical treatments where there are unspecied contaminents
offers many ready cases. Think medicine.

aruzinsky

unread,

Nov 17, 2009, 1:31:27 PM11/17/09

to

On Nov 17, 11:57 am, Gordon Sande <g.sa...@worldnet.att.net> wrote:
> ...
> It is well known that increasing the pressure ... and lowers the traction.
> ...
> - Show quoted text -

That's not true. In Physics 101, I learned that frictional force is
independent of area of contact except in cases when the heat from
friction changes the coefficient of friction, for example, when tires
begin to melt by spinning on the pavement. And, then depending on how
much, the coefficient of friction can increase.

thebluecliffrecord

unread,

Nov 17, 2009, 2:20:14 PM11/17/09

to

Thanks for your response (Mr. Sande and Mr. Aruzinsky)

Mr. Gordon Sande wrote:
> If the tire pressure is increased the footprint will go down. This illustrates
> that the apparent assumption is for equal contact loading. It is well
> known that
> increasing the pressure lowers the fuel consumption and lowers the
> traction. Any
> car intended to have higher traction will have lower contact loading.
> The extreme
> case would be steel wheels on steel rails.

The definition of footprint = track*wheelbase. Therefore no need to
worry about tire pressure as well as measurement errors.

> One might ask whether the underlying science is any good before worrying about
> whether the statistics is good. For those in the trade it is easy to
> find examples
> of good statistics being applied to bad science. Robustness is a matter
> of dealing
> with poor measurement processes but not bad models.

The point is well understood. However, I'm investgating
methodological issues (if any),

>To determine whether LAD or LS is better, I suggest that you look at
>the empirical distribution of residuals from each method on (2)and
>don't forget the possibility that f(median) = oo.

Is it correct to interpret f(median)=oo as the median value is very
large (i.e., infinity)? .

>8. For noise, d(x-a)/2 + d(x+a)/2, where d() is Dirac delta (note f(0) = 0), LAD is not a consistent estimator (useless) whereas LS works
fine. >Therefore general robustness allegations for LAD are invalid.

I would like to have a reference in regards to the (in)consistency of
LAD estimator. I remember reading that the LAD estimation is not
unique contrary to LS estimation.

Thanks in advance.

Sangdon Lee

Gordon Sande

unread,

Nov 17, 2009, 2:47:15 PM11/17/09

to

On 2009-11-17 15:20:14 -0400, thebluecliffrecord <sangd...@gmail.com> said:

> Thanks for your response (Mr. Sande and Mr. Aruzinsky)
>
> Mr. Gordon Sande wrote:
>> If the tire pressure is increased the footprint will go down. This illustrates
>> that the apparent assumption is for equal contact loading. It is well
>> known that
>> increasing the pressure lowers the fuel consumption and lowers the
>> traction. Any
>> car intended to have higher traction will have lower contact loading.
>> The extreme
>> case would be steel wheels on steel rails.
>
> The definition of footprint = track*wheelbase. Therefore no need to
> worry about tire pressure as well as measurement errors.

Sounds like a surrogate for weight.

Your earlier "vehicle footprint (the area where the four tires meet the
ground)"
sure sounded like tire contact area.

>> One might ask whether the underlying science is any good before worrying about
>> whether the statistics is good. For those in the trade it is easy to
>> find examples
>> of good statistics being applied to bad science. Robustness is a matter
>> of dealing
>> with poor measurement processes but not bad models.
>
> The point is well understood. However, I'm investgating
> methodological issues (if any),
>
>> To determine whether LAD or LS is better, I suggest that you look at
>> the empirical distribution of residuals from each method on (2)and
>> don't forget the possibility that f(median) = oo.
>
> Is it correct to interpret f(median)=oo as the median value is very
> large (i.e., infinity)? .
>
>> 8. For noise, d(x-a)/2 + d(x+a)/2, where d() is Dirac delta (note f(0)
>> = 0), LAD is not a consistent estimator (useless) whereas LS works
> fine. >Therefore general robustness allegations for LAD are invalid.
>
> I would like to have a reference in regards to the (in)consistency of
> LAD estimator. I remember reading that the LAD estimation is not
> unique contrary to LS estimation.

The example is just of an interval that never narrows because the original
distribution only has two points. It is more a counter example to the notion
of consistency than anything else. Sometimes apparently sensible definitions
have simple but pathological counter examples. One of my favorite math books
is called Counter Examples in Mathematics and is just a large catalogue of
simple counter examples of short inadequately qualified definitions. Nothing
very profound but nice examples of the need to be careful.

LAD is equivalent to linear programming which readily admits degeneracy. This
is just a fancy version of the fact that the median may be an interval when
the sample size is even. In the case of rank deficiency LS is also degenerate.
That is why there is considerable literature on pseudo inverses. Otherwise the
discussion of the minimal norm least squares solution would be vacuous.

Lack of uniqueness is a property of both methods.

Gordon Sande

unread,

Nov 17, 2009, 2:50:54 PM11/17/09

to

On 2009-11-17 14:31:27 -0400, aruzinsky <aruz...@general-cathexis.com> said:

> On Nov 17, 11:57�am, Gordon Sande <g.sa...@worldnet.att.net> wrote:
>> ...
>> It is well known that increasing the pressure ... and lowers the traction
> .
>> ...
>> - Show quoted text -
>
> That's not true. In Physics 101,

Under the sort of assumptions that are suitable for Physics 101.
Try the presence of gravel, ice for starters.

aruzinsky

unread,

Nov 17, 2009, 3:32:45 PM11/17/09

to

No, it means that there is probability greater than zero that Ei =
median(Ei). For a symmetric probabiliity densisity distribution, the
median is zero (if the median is not unique, a median is zero).
Assuming the median is zero. f(0) = oo, means that f(Ei) will include
a Dirac delta, multiplied by a coefficient, P, at zero. The rest of
the distribution can resemble anything, even a Gaussian
distriibution. This implies there is probability = P > 0 that Et = 0
exactly. A caveate, in any case, an LAD fit of N parameters will have
at least N residuals equal to zero.

"I would like to have a reference"

P. Bloomfield and W.L. Steiger, "Least Absolute Deviations (Book),"
Boston-Based-Stuttgart, Birkhauser, 1983

T.E. Dielman, "Least Absolute Value Estimation in Regression Models:
An Annotated Bibliography," Communications in Statistics Theor.
Meth., vol 13, No 4, pp. 513-541, 1984

aruzinsky

unread,

Nov 17, 2009, 4:03:28 PM11/17/09

to

On Nov 17, 1:50 pm, Gordon Sande <g.sa...@worldnet.att.net> wrote:

> > much, the coefficient of friction can increase.- Hide quoted text -

>
> - Show quoted text -

I understand that increased pressure from decreased contact area will
better form a slippery film of water between tires and ice, but I
don't see why friction from gravel would change with contact area.
Personally, I find dry salt on the streets very hazardous but I see no
reason why more contact area between the tires and salt would help.
Anyway, ice, gravel and salt are absent more often than not therefore
Physics 101 prevails more often than not.

aruzinsky

unread,

Nov 17, 2009, 4:40:48 PM11/17/09

to

On Nov 17, 1:47 pm, Gordon Sande <g.sa...@worldnet.att.net> wrote:
> ...

> One of my favorite math books
> is called Counter Examples in Mathematics and is just a large catalogue of
> simple counter examples of short inadequately qualified definitions.

> ...

I have

Jordan Stoyanov, Counterexamples in Probability, John Wiley & Sons,
1987

Incidentally, there are many papers, mostly in image processing, with
authors who falsely pretend that d/dx abs(x) = sign(x) (the derivative
of abs(x) is undefined at zero.) when deriving LAD algorithms that
don't work and this passes by plenty of referees.