Translating R's loess/predict functions to pystatsmodels?

1,742 views
Skip to first unread message

yarden

unread,
Jun 28, 2013, 5:37:03 PM6/28/13
to pystat...@googlegroups.com, Yarden Katz
Hi all,

I have the Python code below, which calls R via Rpy2, to get loess-corrected values for a pair of vectors based on MA normalization (as in http://en.wikipedia.org/wiki/MA_plot). The code calls the R function "loess" and then the R function "predict" to get the correction factors. The problem is that it's very slow and I'd prefer to have a faster native Python solution. Is there an equivalent of this function in statsmodels? The code is below for your information (the action is in the 'run_loess' function).

Is this possible? Thanks. Best, --Yarden

===


def run_ma_loess(x, y):
    M = np.log2(x) - np.log2(y)
    # A = average intensity 1/2(XY)
    A = 0.5 * (np.log2(x) + np.log2(y))
    # Fit M ~ A
    corrected_m, correction_factor = run_loess(A, M)
    corrected_x = 2**((2*A + corrected_m)/2.)
    corrected_y = 2**((2*A - corrected_m)/2.)
    return corrected_x, corrected_y


def where_na_like(l):
    """
    Return indices where array is NA-like
    """
    bool_index = np.array(map(lambda x: np.isinf(x) or pandas.isnull(x), l))
    return np.where(bool_index)[0]


def run_loess(x, y, span=0.8):
    """
    Predict y as function of x. Takes two numpy vectors.
    """
    # Ensure that Inf/-Inf values are substituted
    x[where_na_like(x)] = robj.NA_Real
    y[where_na_like(x)] = robj.NA_Real
    data = robj.DataFrame({"x": x, "y": y})
    loess_fit = r.loess("y ~ x", data=data)
    correction_factor = np.array(list(r.predict(loess_fit, x)))
    corrected_y = \
        np.array(list(y)) - correction_factor
    return corrected_y, correction_factor

josef...@gmail.com

unread,
Jun 28, 2013, 5:48:54 PM6/28/13
to pystat...@googlegroups.com
On Fri, Jun 28, 2013 at 5:37 PM, yarden <yarde...@gmail.com> wrote:
> Hi all,
>
> I have the Python code below, which calls R via Rpy2, to get loess-corrected
> values for a pair of vectors based on MA normalization (as in
> http://en.wikipedia.org/wiki/MA_plot). The code calls the R function "loess"
> and then the R function "predict" to get the correction factors. The problem
> is that it's very slow and I'd prefer to have a faster native Python
> solution. Is there an equivalent of this function in statsmodels? The code
> is below for your information (the action is in the 'run_loess' function).

can you try our lowess?
http://statsmodels.sourceforge.net/devel/generated/statsmodels.nonparametric.smoothers_lowess.lowess.html

it should be relatively close, but I haven't looked at what the
difference is in R between lowess and loess

The documentation page is out of date. I think the documentation
hasn't updated in a week.
If you use statsmodels master, then you get the lowess that is
implemented in cython and is pretty fast, and can skip observations as
the R function if the dataset is very large.

Almost all R code is GPL which is license incompatible with our BSD,
so we cannot "translate" any R code, but we can implement our own
functions.

Please report back if statsmodels.nonparametric.lowess is enough or if
some features are different.

Josef

Nathaniel Smith

unread,
Jun 28, 2013, 5:54:36 PM6/28/13
to pystatsmodels
On Fri, Jun 28, 2013 at 10:37 PM, yarden <yarde...@gmail.com> wrote:
> Hi all,
>
> I have the Python code below, which calls R via Rpy2, to get loess-corrected
> values for a pair of vectors based on MA normalization (as in
> http://en.wikipedia.org/wiki/MA_plot). The code calls the R function "loess"
> and then the R function "predict" to get the correction factors. The problem
> is that it's very slow and I'd prefer to have a faster native Python
> solution.

A native Python solution will be better, but as a possible stopgap --
IIRC R's "lowess" function is *much* faster than its "loess" function
for large data sets.

> def where_na_like(l):
> """
> Return indices where array is NA-like
> """
> bool_index = np.array(map(lambda x: np.isinf(x) or pandas.isnull(x), l))

I think this can just be written
bool_index = np.isinf(x) | pandas.isnull(x)

-n

yarden

unread,
Jun 28, 2013, 6:21:19 PM6/28/13
to pystat...@googlegroups.com
Thanks very much Josef - but just to clarify, there's no need for "predict"-like function in the pystatsmodels version, since the loess version already returns the x and y associated values -- is that correct?  Thanks, --Yarden

josef...@gmail.com

unread,
Jun 28, 2013, 6:29:16 PM6/28/13
to pystat...@googlegroups.com
On Fri, Jun 28, 2013 at 6:21 PM, yarden <yarde...@gmail.com> wrote:
> Thanks very much Josef - but just to clarify, there's no need for
> "predict"-like function in the pystatsmodels version, since the loess
> version already returns the x and y associated values -- is that correct?
> Thanks, --Yarden

Yes by default the sorted x and the associated predicted yvalues are
returned. nan and infs are stripped.

I added some options to deviate from this usage.
return_sorted=False returns the fitted values in the original order of
observations.

However lowess is a pure smoother at the existing xvalues, it doesn't
allow you to predict for other xvalues.

Josef

William Lane

unread,
Jun 10, 2014, 12:41:08 PM6/10/14
to pystat...@googlegroups.com
I believe a key difference with the R package is that it allows multidimensional models, i.e., multiple predictors. As far as I can see, the Python implementation only accepts single predictors. Is this correct?

William


On Friday, June 28, 2013 5:48:54 PM UTC-4, josefpktd wrote:

josef...@gmail.com

unread,
Jun 10, 2014, 1:07:49 PM6/10/14
to pystatsmodels
On Tue, Jun 10, 2014 at 12:41 PM, William Lane <wilan...@gmail.com> wrote:
> I believe a key difference with the R package is that it allows
> multidimensional models, i.e., multiple predictors. As far as I can see, the
> Python implementation only accepts single predictors. Is this correct?

Yes, the main purpose was as a smoother for graphical analysis or visualization.
I don't know how difficult it would be to extend the cython code to include
several predictors.

We do have kernel regression in an almost-finished state, that handles
multivariate predictors and is designed for the case when these
predictors can be of different types, continuous, ordered or
categorical.
It also has the choice between local linear and local polynomial
regression. It also has options for bandwidth choices and automatic
bandwidth calculations.

However, it doesn't do robust estimation like lowess, and should be
much slower than lowess for large samples.
One alternative for robust estimation would be to use splines (from
patsy) and RLM.

What's your usecase?

Josef

William Lane

unread,
Jun 11, 2014, 4:30:47 AM6/11/14
to pystat...@googlegroups.com
I am trying to develop a functional surrogate model to reduce simulations. Essentially I simulated my system over a wide range of parameters (predictors), and now I want to use the results to derive a function based on the predictors. It is really just a case of non-linear multiple regression, where I need the final model, e.g., f = a*x^2 + b*(y - c)^2 + d*z + ...

I have yet to find something that will do this. I have tried using a MARS approach (py-earth and earth packages) but can't get them to provide anything by piecewise linear models (I'd like to get piecewise quadratic or spline at the least). To give you an idea of the trends, I've attached a figure (of 5 predictors: phi, Us, Vs, Dcyl, acyl).

If you have any recommendations, please share them, I am desperate!

Thanks,

William
gp1.pdf

josef...@gmail.com

unread,
Jun 11, 2014, 9:50:07 AM6/11/14
to pystatsmodels
On Wed, Jun 11, 2014 at 4:30 AM, William Lane <wilan...@gmail.com> wrote:
> I am trying to develop a functional surrogate model to reduce simulations.
> Essentially I simulated my system over a wide range of parameters
> (predictors), and now I want to use the results to derive a function based
> on the predictors. It is really just a case of non-linear multiple
> regression, where I need the final model, e.g., f = a*x^2 + b*(y - c)^2 +
> d*z + ...
>
> I have yet to find something that will do this. I have tried using a MARS
> approach (py-earth and earth packages) but can't get them to provide
> anything by piecewise linear models (I'd like to get piecewise quadratic or
> spline at the least). To give you an idea of the trends, I've attached a
> figure (of 5 predictors: phi, Us, Vs, Dcyl, acyl).
>
> If you have any recommendations, please share them, I am desperate!

I have no practical experience with response surface modeling.

Kernel regression or RBF in scipy would work but don't create
"parameters", we need the entire sample to predict.

(penalized) splines should also work, but you will also have to model
interaction effects.

However, I'm a fan of series approaches, and would just try out
different basis functions instead of using piecewise or local function
estimation. Your trends look pretty smooth and don't have clear
breaks, AFAICS.

Since I just saw this yesterday when I looked up what GEP Box was
doing besides time series analysis:
some classes of nonlinear models might provide a more parsimonious
parameterization
http://en.wikipedia.org/wiki/Polynomial_and_rational_function_modeling#Rational_function_models


--
Series expansion (including splines) or polynomials are nice because
we can use linear models and the extra features we have.

I don't know how far numpy got with multivariate polynomials, Charles
had code for multivariate Bernstein polynomials.
If multivariate polynomials are not available, I would just start with
univariate polynomials, for example Chebychev
(numpy.polynomial.chebvander), and add interaction terms (the design
won't be orthogonal, but that doesn't really matter).

Another type that I bump into more often recently are fractional
polynomials that don't just have positive integer powers, but also
negative, fractional powers and logs.
http://www.stata.com/stata13/fractional-polynomials/ and the link to
the manual entry at the bottom. Something I want to try out soon,
because Stata has it, and it looks very useful and easy to implement.

In terms of estimation, I would heavily overparameterize and penalize.
Maybe some of scikit-learn methods will help to regularize or select
variables.
But I'm not a big fan of default penalization priors when we have
specific information about the structure of the model.
I would choose stronger penalization for higher order terms and
interaction effects, and not or only weakly penalize low order effects
(constant, linear trend and maybe one more).
I only worked with L2 penalization so far (sparse is only really
interesting if we have a huge number of regressors, IMO)
https://picasaweb.google.com/106983885143680349926/Joepy?noredirect=1#5688102596988651330

Similar for splines, the common penalizations is by the difference in
consecutive parameters, or on total variation of derivatives (? I
never tried) not a Ridge or L1 uniform/standardized penalization to
zero.

aside 1: Since this is just "curve-fitting", it won't be relevant that
the final results statistics (standard errors and so on) can or will
be heavily biased because of the specification search.


It would be useful to have a response surface category in statsmodels,
but there is currently nothing out of the box available.

aside 2:
In case someone thinks I'm doing Bayesian analysis with informative prior:
No, I'm not. I'm doing classical analysis that "just happens" to use
"prior" information and that produces the same parameter estimates as
a normal-normal conjugate Bayesian analysis. And, I use robust
classical standard errors for the parameter estimate. :)

Josef

Sturla Molden

unread,
Jun 13, 2014, 12:49:20 PM6/13/14
to pystat...@googlegroups.com
<josef...@gmail.com> wrote:

> I have no practical experience with response surface modeling.

SVM regression can be used to model a surface. Basically any manifold will
be a hyperplane in an higher-dimensional space. With SVM you can go all the
way up to an infinite dimensional space, and then any manifold in our 3D
world is guaranteed to be a hyperplane.

Sturla

Owen Solberg

unread,
Jan 15, 2015, 5:43:10 PM1/15/15
to pystat...@googlegroups.com
William is correct.  For the sake of clarity:

LOWESS is not the same as LOESS.
LOWESS (locally weighted scatterplot smoothing) came first and as the name suggests, is univariate.
LOESS (local regression) is a later, more powerful generalization, supporting multivariate regression.  We use it a lot for modeling spatial effects.

pystatsmodels seems to have LOWESS, but not yet LOESS -- bummer.

Owen

josef...@gmail.com

unread,
Jan 15, 2015, 6:08:07 PM1/15/15
to pystatsmodels
On Thu, Jan 15, 2015 at 5:43 PM, Owen Solberg <owen.s...@gmail.com> wrote:
William is correct.  For the sake of clarity:

LOWESS is not the same as LOESS.
LOWESS (locally weighted scatterplot smoothing) came first and as the name suggests, is univariate.
LOESS (local regression) is a later, more powerful generalization, supporting multivariate regression.  We use it a lot for modeling spatial effects.

pystatsmodels seems to have LOWESS, but not yet LOESS -- bummer.


as the name indicates LOWESS (locally weighted scatterplot smoothing)  is intended (was sold to me) as a visual or graphics function for scatterplot smoothing.  It's not a full `model`.

We do have KernelRegression (local constant or local polynomial) in statsmodels.nonparametric. which supports explanatory variables that are continuous, binary or ordered, but it doesn't have the robustness iterations of lowess.

However, the code is currently in need of refactoring, part of it is in the sandbox. 
We have several PR's waiting, but I haven't been able to get into reviewing it in some time.


Josef
Reply all
Reply to author
Forward
0 new messages