Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Density Estimation

0 views
Skip to first unread message

cl...@tukey.stanford.edu

unread,
Dec 13, 1993, 9:04:39 PM12/13/93
to
Herman Rubin writes:
>In article <1993Dec8.0...@EE.Stanford.EDU> cl...@tukey.Stanford.EDU ( ) writes:
>>Herman Rubin writes:
>
>>And what do armed nuclear weapons have to do with density estimation?
>
>The intent was to indicate both the magnitude of the danger and the
>seriousness of the problem. Knowing the mean and variance of the
>distribution of X may or may not be of value, and the one who just
>plugs into a data analysis program is unlikely to know the difference.
>Density estimation is not of much value unless it is going to be used
>for something. And in this field, very little has been done by looking
>at the PROBLEM; finding a solution is much closer to vacuous theory
>than facing the real problem.

Really? I can't speak for those I don't know, but as far as my department
at Bell Labs is concerned, this statement is plainly wrong, and there are
several of us here who have worked in the field of local fitting. Our work
has been almost entirely motivated by real engineering and manufacturing
problems for which other existing methods have been found to be unsatisfactory.

>>Sorry, but these are just cheap cough outs. It is YOUR responsibility,
>>as the one promoting this parametric approach, to provide the tools and
>>methods to convert the vague ideas one has in practice into formal loss
>>functions, sensible models and prior distributions for the parameters.
>
>These are not copouts, any more than it is a copout if a drug company
>does not know how to come up with a quick cure for AIDS.

So, someone walks into your office with a dataset, and says `I want to
estimate the density'. You say `Go away, come up with a large parametric
model, an estimate (from no data) of the prior density on the parameter
space, and a global loss function, then come back'.

For the sake of the story, assume the client comes back with the requested
items a few days later. You say `The estimate is argmin E_{\pi} L(\hat f,f).
If you wish to compute that, consult a numerical analyst'.

Is this really the complete role you see for a statistician?

>>As I've said in previous posts, coming up with sensible choices is
>>unreasonable in many applications; no single loss function can adequately
>>capture the performance of a function estimate. Local smoothness of an
>>unknown density is often a reasonable assumption to make. Converting this
>>into a global parametric model and prior distribution is difficult. If you
>>wish to promote a large scale parametric approach, this is YOUR problem.
>
>What is local smoothness? There are hundreds of ways of expressing it.

A question, and a reasonable one at that! A welcome change from your usual
attempts to portray posters as stupid and describing others' work as `just
plain bad'.

By local smoothness, I mean existence of a good polynomial approximation
over the smoothing window, which can be measured in a number of ways; for
example, by maximum error of the approximation. I work with log(density)
rather than the density. Formal specification of bounds allows us to,
for example, study what type of structures will be missed through systematic
bias of the estimate, and can be used as a basis for deriving `optimal'
bandwidths. I would never expect a user to specify such a bound (any more
than I would expect them to specify priors or global loss functions).
I will use their questions of interest to determine reasonable bandwidths
for the problem, and at least initially try several fits to show the
difference. Density Estimation is not about plugging in global priors,
loss functions or bandwidths and taking the first answer that comes out.
It is about finding a reasonable solution to real problems.

There is of course no reason to restrict ourselves to polynomial approximations;
if there is prior information about singularities or heavy tails and estimating
these features is of interest, then other forms of local approximation may be
appropriate. In some applications I am fortunate enough to have such
information, and use it. In the vast majority of applications, I do not have
such information, and in such cases a local polynomial approximation will
be just as useful as any other.

Under the most widely studied asymptotics (bandwidth -> 0) the analagous
assumptions are existence of derivatives of appropriate order, with perhaps
bounded or continuity conditions depending on the precise results being
proved.

>In most "real world" density situations, the density is infinitely
>differentiable, usually even analytic.

Questionable, and I wouldn't place too much faith in estimators that rely
on this assumption. Probably not very relevant to our discussion.

> This by itself is useless for
>inference. If you would read the literature, you would know that I
>have suggested classes of infinte-parametric models--the prior distribution
>problem is another matter indeed, but there are some negative results.

I have read hundreds of papers in density estimation and related fields
over the past few years, and would be glad to add yours if only you would
provide the references. I have been asking for specific details of your
methods throughout this thread, and requested you to send (p)reprints over
two weeks ago.

Until you provide references, you can hardly direct others to read the
literature.

>These negative results are of some importance, in that they point out
>that certain types of simplifications do not work. I also have some
>preliminary results on robust procedures here; they should beat any
>of the kernel procedures.

`they should'. Good scientific argument. I also claim my procedures beat
kernel procedures, particularly with regard to moderate tail performance.

>>In the real world, we don't get to brush aside the difficult parts as
>>`someone else's problem'; `will be done in future work' and all the similar
>>excuses used all too frequently by some in academics.
>
>As I said before, the pharmaceutical company cannot solve the problem of
>treating the disease without input from the biologist. The term "panacea"
>is the alchemical term for the compound which could. The statistician
>is no more capable of finding this than the alchemist was, and for much
>the same reason.

Prior distributions, and (at least for some problems) models and loss
functions are mathematical abstractions of little direct interest to
anyone other than the statistician. It is the job of the statistician
to become sufficiently involved and knowledgeable about the problem, so
that informed decisions can be made about such items as needed, and if
informed decisions cannot be made, then to try other approaches. I have
no idea as to how to construct meaningful prior densities in possibly
hundreds or thousands of dimensions; until you can provide some guidance
as to how I should do so, your approach (or any other that requires me
to do so) can at best be considered of passing theoretical interest.

Leaving a biologist or engineer to make these decisions would in my view
make for a singularly bad statistician.

>>Using function+derivative at 3 or 4 points in each dimension gives as
>>much or more information than just the function on 5 or 6 points. Poor
>>specification? Maybe, but you will need at least as many parameters as
>>I need points to get a reasonable specification.
>
>You might be able to get one decimal place that way, but I doubt it.
>But even 4 point in each of 8 dimensions, function and derivative,
>will give more than 100,000 numbers, and I have seen nothing in the
>numerical analysis literature which indicates that even 3-dimensional
>interpolation is a reasonable problem. In one dimension, both high
>degree and many points are relatively cheap, but not in more.

In numerical analysis one is usually interested in far higher degrees
of accuracy than in statistics. Since 9*4^8=589824, the original 100000
data points can't tell us much more. And your trying to model the situation
with "a 1000-parameter model, or even 10000", will provide about as much
information as I get from a 2^8 grid of points. Less, if I use local
quadratic fitting.

I will also point out that I have never claimed my procedures (or anyone
else's) will do anything spectacular for the original problem; I never
even responded to that post. At best, one might find well separated
clusters or heavy skewness. I responded, Herman, to your ignorant
criticisms of the whole field of density estimation, which was not
restricted to the 8 dimensional case. As such, your trying to return to
this problem is just running for cover, of which you will find very little.

>>If your global fitting involves global optimization (as do all global
>>procedures I'm aware of; you have provided no evidence to suggest otherwise)
>>then a large number of parameters makes the computational burden absolutely
>>untenable.
>
>This is exactly what I dispute. Numerical methods are not that difficult
>if one understands the problem, but it is an art, often not easily done in
>a package.

If the problems are `not that difficult', why are you having so much
difficulty producing anything specific? What special features of your
procedure enable significant speedup over standard methods of global
optimization?

Local fitting is linear in the number of points fitted directly, and therefore
linear in the number of parameters. Do you have global optimization procedures
that are better than linear in the number of parameters?

>>Finally, if the user wishes to visualize the density estimate, one can
>>only look at lower dimensional cross sections and/or projections. With
>>a local approach, one just computes the views of most interest. With a
>>global approach, you get nothing until you've fitted everything.
>
>It depends on what one wants to visualize. Possibly you want those; I
>have encountered situations where what one wants is the behavior in
>the moderate tails, and this is certainly not possible by crude methods.

Oh. You're resorting to calling other people's work `crude methods' again.
Say something intelligent. Show me precisely in what ways your methods are
better than mine in the moderate tails. To obtain good estimates in the tails,
it is surely obvious one needs some form of localization to prevent bias
being induced by incorrect specification elsewhere?

>>These issues have been thrashed out in detail in the literature, as you
>>would quickly realize if you bothered to read any of the references I've
>>given.
>
>Engineering literature or numerical analysis literature?

The best place to start computing local fits is the paper by Cleveland
and Grosse. This was in the statistics literature. Do I sense a problem?

>>>I do not have enough information about the original problem to produce
>>>an algorithm. It is questionable whether there even IS enough information
>>>to produce a simple algorithm.
>
>>Then what is the basis for your claims of computational superiority?
>>Your claims for superiority of your approach and abuse of existing methods
>>for density estimation have stretched far beyond the original 8 dimensional
>>problem; I fail to see why you are now trying to run for cover behind this
>>problem.
>
>The knowledge of problems for which plugging into computer packages does
>not work, plus a modicum of knowledge of numerical methods, which is not
>covered by the standard texts.

So what are you saying, apart from spouting off as to how you think you are
so much smarter than everyone else? Anyone who wants to estimate a density
has to write their own code from scratch? Are we allowed to use high level
languages, or does this have to be done in machine code?

And how does this answer the question:
"Then what is the basis for your claims of computational superiority?"

You have still not provided a scrap of evidence that you have ever implemented
or tried out your method.

>When someone asks for an 8-dimensional object, the assumptions are already
>of great importance. Judging from the little stated in the problem, the
>low-dimensional sections are likely to miss the important points.

As opposed to failing to visualize the fit at all, which will miss just about
everything.

You included early in your first post in this topic:

>In my opinion, this is not even the case for density
>estimation in *1* dimension; there are ad hoc procedures, and for some of
>them their properties have been studied, but none from a good analysis of
>the problem.

Now, stop trying to run for cover behind the 8 dimensional problem, and
start trying to justify your claims. You certainly haven't shown your
approach offers anything special for the 8-d problem, or any other problem.

>>Statistics in fact has very little to do with choosing between models.
>>In industry, the problems are to improve quality and productivity.
>>Complicated problems, no simple model and considerable advances have
>>been brought about through data analysis; often with no more than some simple
>>data visualization. Saying this is more likely to do damage than to bring
>>understanding simply falls flat in the face of evidence.
>
>If one is doing quality control, or operating a factory, that is the case.

Hey! We agree on something.

>If one is carrying out a scientific investigation, this is not the case.
>In a moderately well-tuned production facility, one may be looking at
>small perturbations, and low degree polynomials can be good fits, or
>it may be the case that most variables do not matter, or one of a
>number of other low-dimensional alternatives.

Local fitting does not restrict one to using local polynomials. In some
of my applications, I don't use local polynomials. There are procedures
for both global and local variable selection; procedures such as MARS do
the latter in the regression problem. There are semiparametric approaches and
conditionally parametric approaches for use where some variables are more
important than others; one can also smooth with different bandwidths in
different directions. You are hardly raising anything original here.

> In higher dimensional
>problems, linear approximations are often used over ranges where linearity
>is overly restrictive, or other methods claimed to be robust, but which are
>only robust under massive assumptions, usually unstated, are produced.

And others propose using "a 1000-parameter model, or even 10000", which is
inadequate except under massive assumptions. Often in high dimensions
there's insufficient data for anything more than a coarse fit.

cl...@tukey.stanford.edu

unread,
Dec 13, 1993, 9:26:23 PM12/13/93
to
st...@csv.warwick.ac.uk (J E H Shaw)

>
>>>Using function+derivative at 3 or 4 points in each dimension gives as
>>>much or more information than just the function on 5 or 6 points. Poor
>>>specification? Maybe, but you will need at least as many parameters as
>>>I need points to get a reasonable specification.
>I don't understand what you (clive) are saying here. If you want
>"as much information" then surely you need all the mixed derivatives
>at each point (e.g. you can fit a bicubic surface to a 4x4 grid of points,
>or to f, df/dx, df/dy, *d2f/dxdy* at the corners of a rectangle).
>Or else you need to make some VERY heroic assumptions (particularly
>in, say, 8 dimensions).

The bicubic comparison would surely be right for comparing the function
on 7 points/side with interpolation on 3 points/side? I won't try to claim
that much. With 100000 points (as the original problem had), 9 parameters on
a grid of 3 or 4 points per side has fairly exhausted the data. I haven't
(and don't) claim to do anything great for an 8-d problem; rather, as
I emphasize in my accompanying post, Herman by trying to return to this
problem is just looking for non-existent cover.

It should also be noted the interpolation scheme does not need to provide
a great approximation to direct estimation, but rather the increased error
should not be large relative to the std devn and bias of the estimate.
In 8 dimensions these are inevitably large.

>Also, in higher dimensions, a Cartesian grid of points is a very
>inefficient set at which to obtain information about a function.
>Conway & Sloane have done some work on "quantisation" and similar
>problems

I agree 100%; Herman brought up the grid and there was already plenty
to argue...

In fact, the problem with a cartesian grid is even more serious than this;
with density estimation data is much more spread out in the tails, and
much larger bandwidths need to be used. Many more points need to be fitted
in regions of high density. The k-d tree structures used by Cleveland and
Grosse in loess accomplish this through adaptive choice of knots. This leads
to substantial savings in computation, although I doubt this is the most
efficient scheme possible. Some adaptive triangulation would have to do better.

I've seen some of Sloane's work (his office is about 20 feet away from mine..)
Interesting, although a number of additional issues need to be considered
for local fitting:
1) Growth of trees needs to be fast, as it's used purely as a computation
saving (I've heard stories of Sloane grinding Crays to a halt).
2) The growth should be recursive or sequential, rather than iterative.
The decision to terminate growth must be based or related to local
bandwidth specification, which at least for some schemes means fitting
at a point as soon as it is added to the tree.
3) Fast interpolation schemes: For a new point, one must be able to quickly
find the boundaries of the cell it is located in.

>Please could you repeat the references. Thank you.

The bible, covering everything worth knowing in density estimation up to
a year or so ago:
David W. Scott, Multivariate Density Estimation: Theory, Practice and
Visualization. Wiley, 1992.

The work on loess is a great source of ideas, although it isn't density
estimation. The first reference covers motivation and practice of local
fitting; the second computation:
Cleveland and Devlin: Locally Weighted Regression ... 1988 JASA 83, 596-610.
Cleveland and Gross: Computational Methods for Local Regression.
1991 Statistics and Computing 1, 47-62.

Some methodological developments in density estimation since Scott's book:

Stone and Kooperberg - log-spline methods: J. Comp. Graph, Stat, 1993
and Comp. Stat+Data Anal 1991; software from statlib.

Gu and Qiu on penalized likelihood methods; 1993 Ann.Stat. 217-234
and June 1993 JASA. The latter describes their computational procedure.

Local Likelihood Density Estimation; ftp'able from netlib.att.com, file
netlib/stat/doc/93-31.Z; software netlib/stat/prog/density.shar.Z.
Chris Jones (Open U.) and Nils Hjort (Univ. Oslo) are doing some related
work.

Herman Rubin

unread,
Dec 14, 1993, 3:04:46 PM12/14/93
to
In article <1993Dec14....@EE.Stanford.EDU> cl...@tukey.Stanford.EDU ( ) writes:
>Herman Rubin writes:
>>In article <1993Dec8.0...@EE.Stanford.EDU> cl...@tukey.Stanford.EDU ( ) writes:
>>>Herman Rubin writes:

....................

>Really? I can't speak for those I don't know, but as far as my department
>at Bell Labs is concerned, this statement is plainly wrong, and there are
>several of us here who have worked in the field of local fitting. Our work
>has been almost entirely motivated by real engineering and manufacturing
>problems for which other existing methods have been found to be unsatisfactory.

If you are concerned only with the behavior in a small region where the
density keeps away from 0, your methods may give good results, or they
may not.

>>>Sorry, but these are just cheap cough outs. It is YOUR responsibility,
>>>as the one promoting this parametric approach, to provide the tools and
>>>methods to convert the vague ideas one has in practice into formal loss
>>>functions, sensible models and prior distributions for the parameters.

>>These are not copouts, any more than it is a copout if a drug company
>>does not know how to come up with a quick cure for AIDS.

>So, someone walks into your office with a dataset, and says `I want to
>estimate the density'. You say `Go away, come up with a large parametric
>model, an estimate (from no data) of the prior density on the parameter
>space, and a global loss function, then come back'.

I would try to get the client to consider the assumptions, and even ask
presumably relevant quenstions to educe them. If the client knows something
about probability modeling, it may not be too difficult.

>For the sake of the story, assume the client comes back with the requested
>items a few days later. You say `The estimate is argmin E_{\pi} L(\hat f,f).
>If you wish to compute that, consult a numerical analyst'.

I can switch hats and act as a numerical analyst. But some of my
colleagues will send the client to me for these purposes.

Sometimes the appropriate methods are in the packages, but quite often
they are not. Here is an actual problem I solved; the client wanted
tail approximations for the probability of the sum of a fair number of
absolute normal random variables. These were in the extreme tails.
I came up with what turned out to be good approximations; I have not
written them up. But to check them, it was necessary to do a reasonable
job of computing the probabilities. Now where is there a statistics
package which will compute the moment generating function of the
absolute normal for complex arguments, and also which can do the
appropriate numerical integration of the inversion formula for the
tail cdf? How many consultants would even come up with this (to me)
quite straightforward procedure?

Now suppose I tell you that there is a method for spectral density
estimation which will do a fairly good job of matching the
performance of the current smoothing type estimates using global
loss even if the optimal smoothing rate is not known? And that
I do not know how to come up with procedures like this except by
using the infinite-parameter approach? (The infinite parameterization
which works here is to use the Fourier series.)

>> This by itself is useless for
>>inference. If you would read the literature, you would know that I
>>have suggested classes of infinte-parametric models--the prior distribution
>>problem is another matter indeed, but there are some negative results.
>
>I have read hundreds of papers in density estimation and related fields
>over the past few years, and would be glad to add yours if only you would
>provide the references. I have been asking for specific details of your
>methods throughout this thread, and requested you to send (p)reprints over
>two weeks ago.
>
>Until you provide references, you can hardly direct others to read the
>literature.
>
>>These negative results are of some importance, in that they point out
>>that certain types of simplifications do not work. I also have some
>>preliminary results on robust procedures here; they should beat any
>>of the kernel procedures.
>
>`they should'. Good scientific argument. I also claim my procedures beat
>kernel procedures, particularly with regard to moderate tail performance.
>
>>>In the real world, we don't get to brush aside the difficult parts as
>>>`someone else's problem'; `will be done in future work' and all the similar
>>>excuses used all too frequently by some in academics.

Nor do I. I do not expect that I will always be able to take a procedure
off the shelf for a real problem, and even if I can, I do not expect the
packages to know how to carry out the calculations.

>>As I said before, the pharmaceutical company cannot solve the problem of
>>treating the disease without input from the biologist. The term "panacea"
>>is the alchemical term for the compound which could. The statistician
>>is no more capable of finding this than the alchemist was, and for much
>>the same reason.

>Prior distributions, and (at least for some problems) models and loss
>functions are mathematical abstractions of little direct interest to
>anyone other than the statistician. It is the job of the statistician
>to become sufficiently involved and knowledgeable about the problem, so
>that informed decisions can be made about such items as needed, and if
>informed decisions cannot be made, then to try other approaches. I have
>no idea as to how to construct meaningful prior densities in possibly
>hundreds or thousands of dimensions; until you can provide some guidance
>as to how I should do so, your approach (or any other that requires me
>to do so) can at best be considered of passing theoretical interest.

>Leaving a biologist or engineer to make these decisions would in my view
>make for a singularly bad statistician.

This is only because we have taught methods instead of understanding.
Any look at what is involved in the problem of action under uncertainty
comes up with the loss-prior combination, although not necessarily
separated out. The problem of how to approximate this requires the
statistician to understand the problems, and the client to understand
the ideas. The client does not necessarily have to compute anything.
The high dimensional problems very definitely require approximation
and the use of robustness arguments, and as is the case elsewhere,
going to an infinite number of dimensions directly helps.

..........................

>Local fitting is linear in the number of points fitted directly, and therefore
>linear in the number of parameters. Do you have global optimization procedures
>that are better than linear in the number of parameters?

Why? Just because most papers treat it that way?

>>>Finally, if the user wishes to visualize the density estimate, one can
>>>only look at lower dimensional cross sections and/or projections. With
>>>a local approach, one just computes the views of most interest. With a
>>>global approach, you get nothing until you've fitted everything.

Having started out working with people who realized that their problems
required a global approach, you have not scared me. It is looking at
too many non-global approaches which produces artifacts.

--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hru...@snap.stat.purdue.edu (Internet, bitnet)
{purdue,pur-ee}!snap.stat!hrubin(UUCP)

0 new messages