Bug: lowess with a few discrete x values returns nan

49 views
Skip to first unread message

josef...@gmail.com

unread,
Jun 15, 2015, 9:44:40 PM6/15/15
to pystatsmodels
https://github.com/statsmodels/statsmodels/issues/2449


Does anyone know what lowess is supposed to be doing if there are only a few unique x values, like x values in set {0,1,2}, with enough variation in y values?


How does R or other packages handle this?


I don't have any clue, and no experience with this.

Josef



Sturla Molden

unread,
Jun 16, 2015, 6:48:37 PM6/16/15
to pystat...@googlegroups.com
<josef...@gmail.com> wrote:

> Does anyone know what lowess is supposed to be doing if there are only a
> few unique x values, like x values in set {0,1,2}, with enough variation in
> y values?

There must be something wrong with the implementation. LOWESS is just a
weighted regression. It should not break down due to ties.

For any vector x, y is predicted with a weighted multiple regression in
which the weights are given by a tricube window applied to the Euclidian
distance between x and each row in the data vector X. The most efficient
implementation would be to use a kd-tree (or a ball tree) to search X for
points in the neighborhood around x for which the tricube weight is nonzero
(e.g. call query_ball_point on an instance of scipy.spatial.cKDTree).

Sturla

josef...@gmail.com

unread,
Jun 16, 2015, 8:39:58 PM6/16/15
to pystatsmodels
On Tue, Jun 16, 2015 at 6:47 PM, Sturla Molden <sturla...@gmail.com> wrote:
<josef...@gmail.com> wrote:

> Does anyone know what lowess is supposed to be doing if there are only a
> few unique x values, like x values in set {0,1,2}, with enough variation in
> y values?

There must be something wrong with the implementation. LOWESS is just a
weighted regression. It should not break down due to ties.

It's a bit messier because it's a robust version, and there is also the problem of the robust variance being zero. But that has been fixed and I guess shouldn't affect this case.

I assume there is a problem about how the neighbors are calculated when there are many ties and no or not much variation in x. There could also be cases where all neighbors have the same x, so we only have one value in the regression (which might be singular regression on two constant columns).

I won't try to figure out the code right now, but a quick look shows that we divide by zero if there is no variation in x withing the neighbors. As part of the fix we might have to special case this.

 

For any vector x, y is predicted with a weighted multiple regression in
which the weights are given by a tricube window applied to the Euclidian
distance between x and each row in the data vector X. The most efficient
implementation would be to use a kd-tree (or a ball tree) to search X for
points in the neighborhood around x for which the tricube weight is nonzero
(e.g. call query_ball_point on an instance of scipy.spatial.cKDTree).


It's simpler in the case of lowess because it's just one dimensional and we sort the x values

Josef

 
Sturla


Sturla Molden

unread,
Jun 16, 2015, 9:18:18 PM6/16/15
to pystat...@googlegroups.com
<josef...@gmail.com> wrote:

> It's a bit messier because it's a robust version, and there is also the
> problem of the robust variance being zero. But that has been fixed and I
> guess shouldn't affect this case.

In the original formulation there was three optional robustness steps
(M-estimation), in which the tricube distance weights were multiplied by
bisquare weights on the residuals. The M-estimation was not run until
convergence, there were just three additional updates.

> It's simpler in the case of lowess because it's just one dimensional and we
> sort the x values

Maybe in statsmodels, but lowess was actually multipple regression.

There was also a formulation of lowess where it was used on timeseries data
by applying the tricube weight to the distance in time between samples.


Sturla

josef...@gmail.com

unread,
Jun 16, 2015, 9:36:22 PM6/16/15
to pystatsmodels
On Tue, Jun 16, 2015 at 9:18 PM, Sturla Molden <sturla...@gmail.com> wrote:
<josef...@gmail.com> wrote:

> It's a bit messier because it's a robust version, and there is also the
> problem of the robust variance being zero. But that has been fixed and I
> guess shouldn't affect this case.

In the original formulation there was three optional robustness steps
(M-estimation), in which the tricube distance weights were multiplied by
bisquare weights on the residuals. The M-estimation was not run until
convergence, there were just three additional updates.

We have a `it` keyword that defaults to three, I don't think there is a convergence check, so it always calculates a fixed number of iterations in our version.

 

> It's simpler in the case of lowess because it's just one dimensional and we
> sort the x values

Maybe in statsmodels, but lowess was actually multipple regression.


I heard of it initially as just for graphical analysis only for the univariate case (but I don't think I was ever patient or interested enough to read the articles).
We have multivariate kernel regression, but that is not optimized (no cython) and it's not robust.

The problem with the optimized lowess code is that it is a bit difficult to figure out all the moving parts.

Josef

Sturla Molden

unread,
Jun 17, 2015, 8:09:39 AM6/17/15
to pystat...@googlegroups.com
<josef...@gmail.com> wrote:

> I heard of it initially as just for graphical analysis only for the
> univariate case (but I don't think I was ever patient or interested enough
> to read the articles).

http://www.stat.washington.edu/courses/stat527/s13/readings/Cleveland_Delvin_JASA_1988.pdf

It is a multivariate smoother. It fits a very flexible surface to the data.
If you make stronger assumptions and restrict it to an additive form, you
get the generalized additive model.

http://www.math.wpi.edu/saspdf/stat/chap38.pdf
http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_gam_sect029.htm


> We have multivariate kernel regression, but that is not optimized (no
> cython) and it's not robust.

The main bottlencks in the lowess (loess?) are the kd-tree search and the
LAPACK calls. These are already optimized. There is not much to gain from
cythonization.


Sturla

josef...@gmail.com

unread,
Jun 18, 2015, 3:26:32 PM6/18/15
to pystatsmodels
Sturla,

general question to KDTrees: 

How do they handly ties?
If we have many identical values, and I request the k nearest neighbor, what do I get?

Josef

 

Sturla


Sturla Molden

unread,
Jun 19, 2015, 5:43:22 AM6/19/15
to pystat...@googlegroups.com
<josef...@gmail.com> wrote:

> How do they handly ties?
> If we have many identical values, and I request the k nearest neighbor,
> what do I get?

It is not a stable sort. You get the ones enountered first.

In the case of lowess probably will not do a k-NN search but rather a range
search (query_ball_point in scipy.spatial.cKDTree). Or one could do a k-NN
search first, find the largest distance, and then do a range search to grab
the ties.

Sturla

Sturla Molden

unread,
Jun 19, 2015, 6:02:40 AM6/19/15
to pystat...@googlegroups.com
Sturla Molden <sturla...@gmail.com>
wrote:

> It is not a stable sort. You get the ones enountered first.

That is, you get k nearest neighbors for each queried point. But if the
k-th nearest neighbor has ties, some of them might be omitted.

Sturla

Reply all
Reply to author
Forward
0 new messages