Introduction
-------------
Recently, there were some questions on this news group about
smoothing with Matlab. In this message, I like to draw attention
to the oldest of smoothing techniques (Whittaker, 1923), and to
show how easily it can be implemented in Matlab. It can be used
for linear smoothing, but also for smoothing of histograms. In
that way, it gives us a fast and practical algorithm for non-
parametric density estimation.
Suppose that we want to smooth a discrete (time) series y, with m
observations. I use the TeX notation below. For each y_i we like
to find a smoothed value z_i, such that z is a smooth series, but
the z_i are not too much different from the y_i. The smoothness of
z can be measured as the sum of the squared differences, and the
faithfulness to the data by the sum of squares of differences
y_i - z_i. These two functions can be put together in a weighted
combination:
S = \sum (y_i - z_i)^2 + \beta \sum (z_i - z_{i-1})^2.
In the first sum i runs from 1 to m, in the second from 2 to m.
The parameter \beta influences the smoothness of z. The higher
\beta, the smoother z.
This approach to smoothing was proposed by Whittaker. However, he
did not use first differences, z_i - z_{i-1}, but third differences,
z_i - 3z_{i-1} + 3z_{i-2} - z_{i-3}. The principle is the same, and
we will see below that is very easy to work with any order of the
differences in Matlab.
I will not go into the derivation of the resulting equations. It
can be found in Whiitaker's paper, or in (Eilers, 1994), where I
have given implementations in C.
Linear smoothing of a series
----------------------------
In Matlab, all you have to do is the following:
m = length(y);
p = diff(eye(m), 1);
z = (eye(m) + beta * p' * p) \ y;
That's all! Experiment with beta and see the results.
The above allgorithm uses first differences. To work with third
differences, use p = diff(eye(m), 3). Thanks to the generality of
the diff() function in Matlab, it is trivial to work with any
order of the differences.
There is one drawback. The size of the system of equations is m by m.
For long series, this can become impractical. The system has a small
band width. Maybe the provisions for sparse matrix computations in
Matlab can help here. I have no experience with that yet (I'm working
with the Student Edition, version 3.5). In the C implementation I
mentioned earlier, the sparseness is exploited, so that large values
of m (thousands or more) can be handled. That algorithm could easily
be translated to Matlab, but the elegance is lost.
A simple extensions is useful. Say that we want to introduce a vector
of weights w. Then S becomes:
S = \sum w_i (y_i - z_i)^2 + \beta \sum (z_i - z_{i-1})^2.
The Matlab algorithm changes slightly:
m = length(y);
p = diff(eye(m), 1);
z = (diag(w) + beta * p' * p) \ (w .* y);
You can use weights for interpolation. Where no values of y_i are known,
put w_i = 0, and y_i an arbitrary value. Take 1 for all other weights.
Scatterplot smoothing
---------------------
With some extra work, we can use this algorithm for scatterplot
smoothing, for any number of observations. Let x and y be vectors
that give m observations. Let xmin < x < xmax for all i.
Now we "bin" the observations into two vectors c and s, each with n
bins, based on x. In c we count the number of observations in each
bin, in s we sum the values off all y_i's with x_i in that bin. A
Matlab script to do this is:
c = zeros(n,1);
s = c;
dx = (xmax - xmin) / n;
for i = 1:length(x)
j = floor((x(i) - xmin) / dx) + 1;
c(j) = c(j) + 1;
s(j) = s(j) + y(i);
end;
Then we smooth in a similar way as above:
n = length(c);
p = diff(eye(n), 1);
z = (diag(c) + beta * p' * p) \ s;
To see the results:
plot(x,y,'o');
hold;
u = xmin - dx/2 + (1:n) * dx;
plot(u, z);
Histogram smoothing
-------------------
Simonoff (1983) used first order differences for smoothing of
contingency tables with ordered categories. But higher order differences
are no problem. The idea is that in each cell of the histogram you
estimate the expected value mu_i of the counts, and assume a Poisson
distribution for ech cell. This leads to maximum likelihood estimation
with a smoothness penalty.
To guarantee that each mu_i will be non-negative, it is advantegeous to
put mu_i = exp(z_i) and use differences of z.
The likelihood equations are non-linear, so iterations are necessary.
Again, I will not go into the details, but give the Matlab script
directly. The data are a vector of counts y, the histogram.
m = length(y);
z = ones(m, 1) * log(sum(y) / m);
z1 = z + 1;
p = diff(eye(m), 3);
b = beta * p' * p;
while max(abs(z - z1)) > 0.001
z1 = z;
mu = exp(z);
z = (diag(mu) + b) \ (y - mu + mu .* z);
end;
My experience is that some 10 iterations are generally needed. In the
last iterations convergence is very fast.
This histogram smoother is attractive. One can prove that
sum(y) == sum(mu),
sum(u .* y) == sum( u .* mu),
sum(u .* u .* y) == sum(u .* u .* mu),
for any linear vector u. This means that mean and variance of the
smoothed histogram are equal to mean and variance of the original
histogram, whatever the amount of smoothing.
For high beta, the smoothed histogram approaches a (discretized) normal
distribution.
Another useful property is that the finite support of the histogram is
respected.
Conclusion
----------
I hope that the examples will convince you that Whittaker's smoother
is simple and useful. It is relatively unknown, because it gets much
less advertising than splines and kernel type smoothers (running
weighted means and lines). But it deserves a place every Matlab user's
toolbox.
Much more can be said about this smoother. It is easy to compute the
"hat" matrix and cross-validation measures. One can smooth binary or
gamma distributed data in a similar way as was shown for a histogram.
I will not describe these extensions here. This message is already long.
Contact me if you are interested in details and more references.
Paul Eilers (pa...@dcmr.nl)
DCMR
Schiedam
The Netherlands
References
----------
Eilers, P. H. C. (1994) Smoothing and interpolation with finite
differences. In: Graphic Gems IV, P. S. Heckbert (ed.). Academic Press.
Simonoff, J. S. (1983) A penalty function approach to smoothing large
sparse contingency tables. The Annals of Statistics 11, 208 - 218.
Whittaker, E. T. (1923) On a new method of graduation. Proceedings of the
Edinburgh Mathematical Society 41, 63 - 75.