Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Physics from Fisher Information

24 views
Skip to first unread message

tco...@maths.adelaide.edu.au

unread,
May 12, 1999, 3:00:00 AM5/12/99
to
In article <7h73pp$l77$2...@pravda.ucr.edu>,
ba...@galaxy.ucr.edu (john baez) wrote:

> I wish I could help you, but I can't. When people started talking
> about Friedan's work I went down to the library to look it up, but I
> couldn't make much sense out of his papers.

I've only looked at one of his earlier papers (which may not be typical;
it was in American Journal of Physics Vol57, No11, Nov89), but it
seemed reasonably straight forward to me. In the appendix, it also
gave a derivation of the Cramer-Rao lower bound (which is just the
reciprocal of the Fisher Information) for those who are not already
familiar with it, so this may be the best place to start.

> Does *anyone* understand this stuff? If so, could they explain it?
>
> I don't even understand what "Fisher information" really is or why
> people (not just Friedan) are interested in it.

The Cramer-Rao lower bound seems mostly to be used in signal and image
processing (which is how I am familiar with it). In general, suppose
there exists a random process R having some probability distribution
function with a real parameter A. You can only observe samples from R,
but from these samples you can estimate the parameter A (call this
estimated value A' which will also be a random variable, depending on
your samples). Now if A' is an unbiased estimate for A (i.e. E(A')=A),
then it can be shown that the variance of A' has to be a minimum of the
Cramer-Rao lower bound.

For instance, suppose that you have a Gaussian distributed set of
samples. A Gaussian has two parameters (the mean and the variance), so
suppose we wish to estimate the mean mu. One estimate would be to
average all of the samples to produce the sample mean mu'.

Now the variance of mu' is just variance/N (for N samples), and if you
were to calculate the Cramer-Rao lower bound, you would find this would
also come to variance/N. This would mean that the sample mean is the
optimal unbiased estimator for the actual mean of a Gaussian
distribution (if you were considering unbiased estimators, you would
look at the Bachattyra (sp?) lower bound instead, but I don't know very
much about this and it is probably very messy mathematically).

Getting back to the Frieden paper... what he does is to assume that our
measurement of the position of a particle is just an estimate based on
some random process which contains the actual position as a parameter.
The very best unbiased estimate that we can get will have a variance
corresponding to the Cramer-Rao lower bound. Now Murphy's law (that
nature wants us to be able to know as little as possible), corresponds
to maximising the Cramer-Rao lower bound, or minimising the Fisher
information. If this optimisation is done with respect to some
positivity constraint on the energy (and this is the dodgy part, because
there is no particular reason for his particular choice of constraint...
it just happens to give the correct answer), then Schroedinger's
equation falls out.

Because of the flexibility in the choice of constraint, I've decided
that this is not a particularly rigorous proof. This was an early paper
in the area however, so I would be interested to know if he has been
able to come up with a less hand-wavy explanation for his assumption.

I hope this explanation is of some use.


--== Sent via Deja.com http://www.deja.com/ ==--
---Share what you know. Learn what you don't.---


Kresimir Kumericki

unread,
May 12, 1999, 3:00:00 AM5/12/99
to physics-...@ncar.ucar.edu
In article <7h73pp$l77$2...@pravda.ucr.edu> John Baez wrote:
> In article <01be9aec$b623a020$2897cfa0@sj816bt720500>,
> Philip Wort <phil...@cdott.com> wrote:

>>I find this book is very difficult to follow. It is full of forward
>>references and has a confusing section structure ... anyway any help
>>would be appreciated, since it seems that there is something interesting
>>here beneath the confusion (perhaps just mine).

> I wish I could help you, but I can't. When people started talking about
> Friedan's work I went down to the library to look it up, but I couldn't
> make much sense out of his papers.

I also took a short look then, so I really cannot claim that I
understood much, but I got the impression that all he derives from
this information concepts are *kinetic* terms in Lagrangians of
different physical theories (Newton mechanics, QM etc.)
So, I'd also like to hear whether there is more to it than that.
(In particular, I remember that public announcements were mentioning
"Lagrangians" without any qualifications.)
This also brings us to the following question. What if you are
trying to construct the Lagrangian for some system (and you know
it's degrees of freedom; I suppose you should know them also if you
wanted to calculate this Fisher information thing) and some method
gives you the kinetic term? Would that be of some help to you or
not?
I guess that interesting things are usually in those other terms.


--
-------------------------------------------------------------
Kresimir Kumericki kku...@phy.hr http://www.phy.hr/~kkumer/
Theoretical Physics Department, University of Zagreb, Croatia
-------------------------------------------------------------


Chris Hillman

unread,
May 13, 1999, 3:00:00 AM5/13/99
to
On 11 May 1999, john baez wrote:

> In article <01be9aec$b623a020$2897cfa0@sj816bt720500>,
> Philip Wort <phil...@cdott.com> wrote:

> >I find this book is very difficult to follow.

> I wish I could help you, but I can't. When people started talking about
> Friedan's work I went down to the library to look it up, but I couldn't

> make much sense out of his papers. I'd hoped his book would be easier
> to read, but from what you say it sounds like maybe not....



> Does *anyone* understand this stuff? If so, could they explain it?

I had the same reaction to Frieden's papers: I couldn't figure out what he
was trying to say in the first few paragraphs of the papers I downloaded,
so I put them aside... (If anyone else wants to have a go, check the
PROLA archive http://prola.aps.org/search.html)

> I don't even understand what "Fisher information" really is or why
> people (not just Friedan) are interested in it.

As it happens, the same question came up in bionet.info-theory recently,
so I quote my reply below. My post was based on what I found in Thomas &
Cover, so if I've gotten anything wrong, someone should correct what I've
said!

As for why people are interested... well, statisticians find everything
Fisher did of enduring interest for one reason or another, it seems :-)

One additional thing, which I'm not sure I remember quite right, but which
physicists will probably find intriguing, is the notion of a "statistical
manifold", where one can actually make a manifold out of a parametrized
family of distributions in such a way that the Riemann curvature turns out
to be an "entropy" related to Shannon's entropy and the corresponding
connection is related to Fisher's information! Something like that,
anyway --- it's been a decade since I looked at this. Now you're probably
thinking what I thought ten years ago, but when I looked at the books on
this stuff, the expected connections were not apparent. No forms in
sight, it didn't even look like differential geometry. In fact, it looked
very ugly :-(

Chris Hillman

=========== BEGIN REPOST [WITH NEW EXAMPLE] =======================

Date: Fri, 7 May 1999 15:23:58 -0700
From: Chris Hillman <hil...@math.washington.edu>
Newsgroups: bionet.info-theory
Subject: Re: Definition of Fisher Information


On Mon, 26 Apr 1999, Stephen Paul King wrote:

> Could someone give a definiton of Fisher Information that a
> mindless philosopher would understand? :)

Let me start with a couple of intuitive ideas which should help to orient
you. Fisher information is related to the notion of information (a kind of
"entropy") developed by Shannon 1948, but not the same. Roughly speaking

1. Shannon entropy is the volume of a "typical set"; Fisher information
is the area of a "typical set",

2. Shannon entropy is allied to "nonparametric statistics"; Fisher
information is allied to "parametric statistics".

Now, for the definition.

Let f(x,t) be a family of probability densities parametrized by t.

[Example (I just made this up for this repost):

2 sin(pi t)
f(x,t) = ------------ x^(1-t) (1-x)^t
pi t(1-t)

where 0 < x < 1 and 0 < t < 1. (The numerical factor is chosen to ensure
that the integral over 0 < x < 1 is unity.)

If you take appropriate limits, this family can be extended to -1 < t < 2;
e.g. (its fun to plot these as functions of x)

f(x,-1/2) = 8/(3 pi) x^(-1/2) (1-x)^(3/2)

f(x, 0) = 2 (1-x)

f(x, 1/4) = 16 sqrt(2)/(3 pi) (1-x)^(3/4) x^(1/4)

f(x, 1/3) = 9 sqrt(2)/(2 pi) (1-x)^(2/3) x^(1/3)

f(x, 1/2) = 8 sqrt(x-x^2)/pi

f(x, 2/3) = 9 sqrt(2)/(2 pi) (1-x)^(1/3) x^(2/3)

f(x, 1) = 2 x

f(x, 3/2) = 8/(3 pi) x^(3/2) (1-x)^(-1/2)

End of example]

In parametric statistics, we want to estimate which t gives the best fit
to a finite data set, say of size n. An estimator is a function from
n-tuple data samples to the set of possible parameter values, e.g.
(0,infty) in the example above. Given an estimator, its bias, as a
function of t, is the difference between the expected value (as we range
over x) of the estimator, according to the density f(.,t), and the actual
value of t. The variance of the estimator, as a function of t, is the
expectation (as we range over x), according to f(.,t), of the squared
difference between t and the value of the estimator. If the bias vanishes
(in this case the estimator is called unbiased), the variance will usually
still be a positive function of t. It is natural to try to minimize the
variance over the set of unbiased estimators defined for a given family of
densities f(.,t).

Given a family of densities, the score is the logarithmic derivative

V(x,t) = d/dt log f(x,t) = d/dt f(x,t)/f(x,t)

[In the example (the one I just made up), if I didn't goof we have, if I
haven't goofed

V(x,t) = log(x/(1-x)) + pi cot(pi t) + (2t-1)/(t(1-t))

e.g. V(x,1/2) = log(x/(1-x)).]

(We are tacitly now assuming some differentiability properties of our
parameterized family of densities.)

The mean of the score (as we average over x) is always zero. The Fisher
information is the variance of the score:

J(t) = expected value of square of V(x) as we vary x

Notice this is a function of t defined in terms of specific parametrized
family of densities. (Of course, the definition is readily generalized to
more than one parameter).

[In the example:

J(-1/2) ~ 5.42516

J (0) = pi^2/3 - 1 ~ 2.28987

J (1/3) ~ 1.90947

J(1/2) ~ 1.8696

J(2/3) ~ 1.90947

J(1) = pi^2/3 - 1 ~ 2.28987

J(3/2) ~ 5.42516

if I didn't goof. Note the expected symmetry of these values.]

The fundamentally important Cramer-Rao inequality says that

variance of any estimator >= 1/J(t)

Thus, in parametric statistics one wants to find estimators which achieve
the optimal variance, the reciprocal of the Fisher information. From this
point of view, the larger the Fisher the information, the more precisely
one can (using a suitable estimator) fit a distribution from the given
parametrized family to the data.

(Incidentally: someone has mentioned the work of Roy Frieden, who has
attempted to relate the Cramer-Rao inequality to the Heisenberg
inequality. See the simple "folklore" theorem (with complete proof) I
posted on a generalized Heisenberg inequality in sci.physics.research a
few months ago--- you should be able to find it using Deja News.)

This setup is more flexible than might at first appear. For instance,
given a density f(x), where x is real, define the family of densities
f(x-t); then the Fisher information is

J(t) = expectation of [d/dt log f(x-t)]^2

= int f(x-t) [d/dt log f(x-t)]^2 dx

By a change of variables, we find that for a fixed density f, this is a
constant. In this way, we can change our point of view and define a
(nonlinear) functional on densities f:

J(f) = int f(x) [f'(x)/f(x)]^2 dx

[The idea now is something like this: J(f) is measuring the precision of
fitting f to numerical data, up to translation of the distribution. The
larger J(f) is, the more precisely you can identified a particular
translation which gives the best fit. I think this is the idea, anyway.]

On the other hand, Shannon's "continuous" entropy is the (nonlinear)
functional:

H(f) = -int f(x) log f(x) dx

Suppose that X is a random variable with finite variance and Z is an
independent normally distributed random variable with zero mean and unit
variance ("standard noise"), so that X + sqrt(t) Z is another random
variable associated with density f_t, represented X perturbed by noise.
Then de Bruijn's identity says that

J(f_t) = 2 d/dt h(f_t)

and if the limit t-> 0 exists, we have a formula for the Fisher
information of the density f_0 associated with X.

See Elements of Information Theory, by Cover & Thomas, Wiley, 1981, for
details on the above and for general orientation to the enormous body of
ideas which constitutes modern information theory, including typical sets
and comment (1) above. Then see some of the many other books which cover
Fisher information in more detail. In one of the books by J. N. Kapur on
maximal entropy you will find a particularly simple and nice connection
between the multivariable Fisher information and Shannon's discrete
"information" (arising from the discrete "entropy" -sum p_j log p_j).

(Come to think of it, if you search under my name using Deja News you
should find a previous posting of mine in which I gave considerable detail
on some inequalities which are closely related to the area-volume
interpretations of Fisher information and entropy. If you've ever heard
of Hadamard's inequality on matrices, you should definitely look at the
discussion in Cover & Thomas.)

Hope this helps!

Chris Hillman

John Forkosh

unread,
May 14, 1999, 3:00:00 AM5/14/99
to
Philip Wort (phil...@cdott.com) wrote:
: I am reading "Physics from Fisher information" by Frieden, and wondered if
: anyone else had read it, or was familiar enough with the ideas therein to
: help me follow it!
snip
There was an introductory article about Fisher information
in the American Journal of Physics several months back
which might be of some help. Sorry that I can't recall
the exact citation,
John (for...@panix.com)


Didier A. Depireux

unread,
May 14, 1999, 3:00:00 AM5/14/99
to
john baez (ba...@galaxy.ucr.edu) wrote:
: Philip Wort <phil...@cdott.com> wrote:
: >I find this book is very difficult to follow. It is full of forward

: >references and has a confusing section structure ... anyway any help

: Does *anyone* understand this stuff? If so, could they explain it?

: I don't even understand what "Fisher information" really is or why


: people (not just Friedan) are interested in it.

I started reading the book, being interested in applying information theory
to neural systems (I went from string theory to the auditory pathway!).
Anyway, when I saw the title of that book, I thought I _had_ to read it.
But I couldn't go very far in it, I found it really confusing. The main
points are not very clear.

So I started reading "Elements of Information Theor", a _really_ good book,
by Cover and Thomas (as recommended by someone on bionet.info-theory).

For a layman's introduction to some of the terms used in Fisher
information, look at http://www.newscientist.com/ns/19990130/iisthelaw.html

Didier

--
Didier A Depireux did...@isr.umd.edu
Neural Systems Lab http://www.isr.umd.edu/~didier
Institute for Systems Research Phone: 301-405-6557 (off)
University of Maryland -6596 (lab)
College Park MD 20742 USA Fax: 1-301-314-9920


Steven Hall

unread,
May 14, 1999, 3:00:00 AM5/14/99
to
In article <7h73pp$l77$2...@pravda.ucr.edu>, ba...@galaxy.ucr.edu (john baez) wrote:

> In article <01be9aec$b623a020$2897cfa0@sj816bt720500>,


> Philip Wort <phil...@cdott.com> wrote:
>
> >I find this book is very difficult to follow. It is full of forward
> >references and has a confusing section structure ... anyway any help

> >would be appreciated, since it seems that there is something interesting
> >here beneath the confusion (perhaps just mine).
>

> I wish I could help you, but I can't. When people started talking about
> Friedan's work I went down to the library to look it up, but I couldn't
> make much sense out of his papers. I'd hoped his book would be easier
> to read, but from what you say it sounds like maybe not....
>

> Does *anyone* understand this stuff? If so, could they explain it?
>
> I don't even understand what "Fisher information" really is or why
> people (not just Friedan) are interested in it.

I'll take a shot at this, having studied parameter estimation a little.

Consider the problem of estimating a parameter vector, A, from a vector of
random variables, R, whose probability density depends on the parameter
vector. (A is a parameter because it's not a random variable with a known
probabililty density.) That is, we know the conditional density, p(R|A),
and want to estimate A given R. One common way to do this is maximum
likelihood estimation (Choose the A that maximizes p(R|A)), but other
methods are possible.

The question arises, how good is my estimation scheme? If this were a
Bayesian problem, I could tell you, because I can get p(A|R) from p(R|A)
and p(A). However, in this kind of problem, there is no prior information
p(A). So that doesn't work.

Instead, there is a bound on how well we can do. Suppose that we come up
with an unbiased estimator, a(R). This just means that E(a(R)-A)=0, for
all A. Then the covariance of the estimation error satisfies

covariance(a(R)-A) >= J

where J is the Fisher information matrix. The bound above is known as the
Cramer-Rao inequality. When the equality holds, the estimate is called
"efficient". An efficient estimate does not always exist, but when it does,
the estimator that does this is the maximum likelihood estimator.

The Fisher information matrix is given by

J_{ij} = E[(d log[p(R|A)]/d A_{i})(d log[p(R|A)]/d A_{j})]
= -E[(d^{2} log[p(R|A)]/d A_{i} d A_{j})]

One of my pet peeves is that researchers are keep coming up with new, ad
hoc estimation schemes, but few bother to compare their results to the
Cramer-Rao inequality to see if the idea is a good one. Of course, any
unbiased estimator will be at best as good as the bound, but some are much
worse!

In any event, I have no idea how this relates to physics, so maybe I
haven't been much help.


0 new messages