Hi:
To expand on Hadley's comments, I would suggest that you consult
?loess and read the chapter in the White book that it cites to gain a
better understanding of how the loess procedure operates. It makes no
sense to me why one would want to use a span of 1 to fit a loess model
with 40K observations.
By its nature, loess is a "local regression" algorithm, with emphasis
on the "local". The span argument controls the proportion of the data
that should be used to produce each local fit - the wider the span,
the smoother the fitted function. If you look carefully at the
algorithm itself, you'll discover that it's VERY memory intensive, so
if you insist on a loess fit with a large sample size, at least reduce
the span. Here is an example to illustrate that you can indeed fit a
loess model in ggplot2 with 40K observations.
x <- seq(0, 100, length.out = 40000)
# A periodic function
DF <- data.frame(x = x, y = 1 + sin(x) + 0.5 * cos(2 * x) + rnorm(40000))
library(ggplot2)
# Uses the default "auto" method to which Hadley referred:
ggplot(DF, aes(x = x, y = y)) +
geom_point(alpha = 0.05, shape = 21) +
geom_smooth(size = 1)
The result of this [gam] fit, which finishes rather quickly, is more
or less equivalent to a loess model with a large span (such as 1), but
far more computationally efficient. The periodicity is almost
completely ignored in the fitted curve as it has been averaged away.
To capture the periodicity with a local regression algorithm, you need
to reduce the proportion of the data devoted to each local fit. The
following call takes about 1.5-2 minutes (guesstimated) to run, but it
does produce a loess fit in the end on my laptop (with 12Gb RAM +
R-3.2.0 64-bit + i7 chip):
ggplot(DF, aes(x = x, y = y)) +
geom_point(alpha = 0.05, shape = 21) +
geom_smooth(method = "loess", span = 0.1, size = 1)
When I ran this in the R GUI, I got the "Not responding" message while
R was cranking away mightily, but eventually a graph did appear.
You should be able to get a more accurate local fit by reducing the
span further, since span = 0.1 in this example means that it's using
approximately 4000 points per local fit, which is far more than it
needs for a curve this simple in form. The following call took about
8-10 seconds, with one difference in the specification:
ggplot(DF, aes(x = x, y = y)) +
geom_point(alpha = 0.05, shape = 21) +
geom_smooth(method = "loess", span = 0.005, size = 1)
In this call, span = 0.005 means that approximately 200 observations
are used in each local fit, which is still fairly large. I would
recommend experimenting with slightly smaller and larger spans to see
how it affects the fitted loess model. The choice of span should be
informed by the number of rows in the input data frame, the shape of
the noisy input function and the degree of smoothness desired.
The example was deliberately chosen to illustrate why the choice of
span matters in loess. On the other hand, the error message you
received indirectly signals that loess is a memory hog and you need to
know enough about how it works as an algorithm in order to use it
productively.
Dennis