Fitting normal distribution to a histogram

John Uebersax

unread,

Jun 19, 2008, 12:12:42 PM6/19/08

to

Okay, let's say I have a frequency distribution like this:

x f(x)
--------
1 10
2 20
3 50
4 40
5 15

And I want to fit a normal distribution to it.

x is a threshold-based discretization of a (latent) continuous
variable, such that, for example, x = 1 means

1 =< x < 2 .

Clearly one can fit a normal distribution to the data via, say, ML
estimation and an iterative algorithm.

Question: is there an easy way to get at least a good approximation?
For example, could one (1) replace data values with their
corresponding interval mid-points, and then (2) just calculate the
sample mean and SD?

I suppose a follow-up question is if anyone has or can recommend some
free code to perform the ML estimation.

Thanks in advance.

John Uebersax PhD

Ray Koopman

unread,

Jun 19, 2008, 12:57:38 PM6/19/08

to

What about endpoints? For instance, should we assume that there
could have been an x = 0, representing 0 <= X < 1, or an x = -1,
representing -1 <= X < 0, etc, but that there were no observations
with X < 1 ? Or is the x-scale bounded, so that although x = 2
represents 2 <= X < 3, x = 1 represents -oo < X < 2 ?

Scott Hemphill

unread,

Jun 19, 2008, 1:42:29 PM6/19/08

to

John Uebersax <jsueb...@gmail.com> writes:

> Okay, let's say I have a frequency distribution like this:
>
> x f(x)
> --------
> 1 10
> 2 20
> 3 50
> 4 40
> 5 15
>
> And I want to fit a normal distribution to it.
>
> x is a threshold-based discretization of a (latent) continuous
> variable, such that, for example, x = 1 means
>
> 1 =< x < 2 .
>
> Clearly one can fit a normal distribution to the data via, say, ML
> estimation and an iterative algorithm.
>
> Question: is there an easy way to get at least a good approximation?
> For example, could one (1) replace data values with their
> corresponding interval mid-points, and then (2) just calculate the
> sample mean and SD?

Sure, why not? I get mean 3.72 and SD 1.07.

> I suppose a follow-up question is if anyone has or can recommend some
> free code to perform the ML estimation.

Sorry, I can't. But I have used Mathematica's NMaximize function to
do the ML estimation and get mean 3.72 and SD 1.03. The "easy way"
wasn't so bad, was it?

Scott
--
Scott Hemphill hemp...@alumni.caltech.edu
"This isn't flying. This is falling, with style." -- Buzz Lightyear

Jack Tomsky

unread,

Jun 19, 2008, 1:46:03 PM6/19/08

to

What you have is interval data. You can use the approach you suggested, calculating the empirical mean and SD based on replacing each x by its corresponding mid-point between x and x+1. Then use this result as the starting values for the ML estimation algorithm.

The likelihood function is

L(mu, sigma) = Prod[Phi((xj + 1 - mu)/sigma) - Phi((xj - mu)/sigma)]

Here, Phi is the standard normal cumulative function. Then numerically use an algorithm, such as Excel's Solver, to maximize L or equivalently ln(L).

If you need help, I'll run it for you on Excel.

Jack

Jack Tomsky

unread,

Jun 19, 2008, 3:16:25 PM6/19/08

to

> What you have is interval data. You can use the
> approach you suggested, calculating the empirical
> mean and SD based on replacing each x by its
> corresponding mid-point between x and x+1. Then use
> this result as the starting values for the ML
> estimation algorithm.
>
> The likelihood function is
>
> L(mu, sigma) = Prod[Phi((xj + 1 - mu)/sigma) -
> Phi((xj - mu)/sigma)]

That should actually be

L(mu, sigma) = Prod[{Phi((xj + 1 - mu)/sigma) -

Phi((xj - mu)/sigma)}^f(xj)]

Jack

Jack Tomsky

unread,

Jun 19, 2008, 3:24:47 PM6/19/08

to

I get the same answer as Scott using Excel's Solver and treating the data as interval data.

muhat = 3.72189
sigmahat = 1.02546

Jack

John Uebersax

unread,

Jun 19, 2008, 3:49:40 PM6/19/08

to

On Jun 19, 6:57 pm, Ray Koopman <koop...@sfu.ca> wrote:

> What about endpoints? For instance, should we assume that there
> could have been an x = 0, representing 0 <= X < 1, or an x = -1,
> representing -1 <= X < 0, etc, but that there were no observations
> with X < 1 ? Or is the x-scale bounded, so that although x = 2
> represents 2 <= X < 3, x = 1 represents -oo < X < 2 ?

Ray, this is right on the mark.

To stick with the actual example, the lowest category, 1, would in
fact mean -inf < x < 2. So perhaps we would call this category "left
censored" while the others are merely "interval censored".

The complete set of meanings would be as follows:

x meaning
------------------
1 -inf < X < 2
2 2 <= X < 3
3 3 <= X < 4
4 4 <= X < 5
5 5 <= X < 6

Where X denotes the pre-discretized version of x.

We assume that no "6" values were actually observed.

John Uebersax

unread,

Jun 19, 2008, 4:10:28 PM6/19/08

to

Scott, Jack -- thanks for the replies.

Scott wrote:

> But I have used Mathematica's NMaximize function to
> do the ML estimation and get mean 3.72 and SD 1.03. The "easy way"
> wasn't so bad, was it?

But my guess is that these are the ML estimates treating the data at
face value, not treating them as interval-censored (as with Jack's
equation below).

Jack wrote:

> L(mu, sigma) = Prod[{Phi((xj + 1 - mu)/sigma) - Phi((xj - mu)/sigma)}^f(xj)]

Right. Based on my reply to Ray, I think we'd just modify this so
that for the first category it would be only Phi((x2 - mu)/sigma),
rather than a difference of two Phis.

> I get the same answer as Scott using Excel's Solver and treating the data as interval data.

Again, my guess is that "treating the data as interval data" means
taking them at face value, and not as interval-censored.

John Uebersax

Jack Tomsky

unread,

Jun 19, 2008, 4:21:41 PM6/19/08

to

The way I interpreted the data was that there are 10 measurements between 1 and 2, 20 measurements between 2 and 3, etc.

Jack

Ray Koopman

unread,

Jun 19, 2008, 5:58:45 PM6/19/08

to

For those meanings, Mathematica's FindMaximum gives
muhat = 3.71464, sigmahat = 1.04489

Scott Hemphill

unread,

Jun 19, 2008, 7:10:48 PM6/19/08

to

Jack Tomsky <jto...@ix.netcom.com> writes:

As did I. My code looks like:

In[1]:= wp[n_] := SetOptions[NMaximize, WorkingPrecision -> n];

In[2]:= pdf[m_, s_, x_] := Exp[-(x - m)^2/(2*s^2)] / (Sqrt[2*Pi]*s);

In[3]:= cdf[m_, s_, x_] = Integrate[pdf[m, s, t], {t, -Infinity, x},
Assumptions -> s > 0];

In[4]:= p[m_, s_, x_] := cdf[m, s, x + 1] - cdf[m, s, x];

In[5]:= f[m_, s_] := p[m, s, 1]^10*p[m, s, 2]^20*p[m, s, 3]^50*p[m, s, 4]^40*
p[m, s, 5]^15;

(* You need multiple precision arithmetic to deal with small likelihoods *)

In[6]:= wp[100];

In[7]:= N[NMaximize[f[m, s], {m, s}], 16]

-87
Out[7]= {1.205749284816294 10 ,

> {m -> 3.721891606793503, s -> 1.025458686006314}}

May Liszt

unread,

Jun 19, 2008, 7:12:10 PM6/19/08

to

Convert the histogram to CDF,
then find the mean and SD that gives you the smallest KS stat, or
alternatively any other stat for non-parameter test on goodness of
fit.

The idea behind is non-parameter test of goodness of fit.

On Jun 19, 9:12 am, John Uebersax <jsueber...@gmail.com> wrote:

Ray Koopman

unread,

Jun 19, 2008, 8:01:30 PM6/19/08

to

On Jun 19, 4:10 pm, Scott Hemphill <hemph...@hemphills.net> wrote:
> Jack Tomsky <jtom...@ix.netcom.com> writes:
>> [...]

>> The way I interpreted the data was that there are 10 measurements
>> between 1 and 2, 20 measurements between 2 and 3, etc.
>
> As did I. My code looks like:
>
> In[1]:= wp[n_] := SetOptions[NMaximize, WorkingPrecision -> n];
>
> In[2]:= pdf[m_, s_, x_] := Exp[-(x - m)^2/(2*s^2)] / (Sqrt[2*Pi]*s);
>
> In[3]:= cdf[m_, s_, x_] = Integrate[pdf[m, s, t], {t, -Infinity, x},
> Assumptions -> s > 0];
>
> In[4]:= p[m_, s_, x_] := cdf[m, s, x + 1] - cdf[m, s, x];
>
> In[5]:= f[m_, s_] := p[m, s, 1]^10*p[m, s, 2]^20*p[m, s, 3]^50*p[m, s, 4]^40*
> p[m, s, 5]^15;
>
> (* You need multiple precision arithmetic to deal with small likelihoods *)
>
> In[6]:= wp[100];
>
> In[7]:= N[NMaximize[f[m, s], {m, s}], 16]
>
> -87
> Out[7]= {1.205749284816294 10 ,
>
> > {m -> 3.721891606793503, s -> 1.025458686006314}}
>
> Scott
> --

> Scott Hemphill hemph...@alumni.caltech.edu

> "This isn't flying. This is falling, with style." -- Buzz Lightyear

I think you've made things more complicated than they need to be.
Using logs seems to obviate the need for multiple precision.
Here's the code I used, which also is not as clean as it could be.

First assuming x = Floor[X].

In[1]:= k = Length[freq = {10,20,50,40,15}]
Out[1]= 5

In[2]:= {n = Tr@freq, N[m = freq.Range@k/n],
N[s = Sqrt[freq.Range@k^2/n - m^2]]}
Out[2]= {135, 3.22222, 1.06574}

In[3]:= LL[mu_?NumericQ,sigma_?NumericQ] := freq.Log[Erf@@@Transpose[
({Range[1,k],Range[2,k+1]}-mu)*(Sqrt[.5]/sigma)]]

In[4]:= LL[m+.5,s]
Out[4]= -106.732

In[5]:= FindMaximum[LL[mu,sigma],{{mu,m+.49,m+.51},{sigma,.99s,N@s}}]
Out[5]= {-106.563,{mu->3.72189,sigma->1.02546}}

Now assuming x = Floor[Max[X,1]]

In[6]:= xx = Partition[{-Infinity,2,3,4,5,6},2,1]
Out[6]= {{-Infinity,2},{2,3},{3,4},{4,5},{5,6}}

In[7]:= KK[mu_?NumericQ,sigma_?NumericQ] := freq.Log[Erf@@@((xx-mu)*
(Sqrt[.5]/sigma))]

In[8]:= KK[m+.5,s]
Out[8]= -105.675

In[9]:= FindMaximum[KK[mu,sigma],{{mu,m+.49,m+.51},{sigma,.99s,N@s}}]
Out[9]= {-105.63,{mu->3.71464,sigma->1.04489}}

John Uebersax

unread,

Jun 20, 2008, 7:23:18 AM6/20/08

to

Thanks again Ray, Scott, Jack, and May.

With your help I was able to write my own program in fortran 90. It
gives exactly the same ML estimates as Ray reports for this model:

In[6]:= xx = Partition[{-Infinity,2,3,4,5,6},2,1]
Out[6]= {{-Infinity,2},{2,3},{3,4},{4,5},{5,6}}

Ray, I get a final LL of about -199.205 (using the natural log), vs.
your -105.63. Are you evaluating the cdf of the normal distribution,
or just something proportional to it?

John Uebersax

Ray Koopman

unread,

Jun 20, 2008, 12:54:44 PM6/20/08

to

I was using Erf[x/Sqrt@2] = 2*Phi[x] - 1,
so all the interval probabilities were doubled.
n = 135; n*Log[2] = 93.575, which is the difference
between your -199.205 and my -105.630.

Scott Hemphill

unread,

Jun 20, 2008, 1:00:58 PM6/20/08

to

John Uebersax <jsueb...@gmail.com> writes:

Changing my In[5] from:

In[5]:= f[m_, s_] := p[m, s, 1]^10*p[m, s, 2]^20*p[m, s, 3]^50*p[m, s, 4]^40*
p[m, s, 5]^15;

to:

In[5]:= f[m_, s_] := cdf[m, s, 2]^10*p[m, s, 2]^20*p[m, s, 3]^50*p[m, s, 4]^40*

p[m, s, 5]^15;

results in:

In[7]:= N[NMaximize[f[m, s], {m, s}], 16]

-87
Out[7]= {3.064427873001987 10 ,

> {m -> 3.714636504122876, s -> 1.044891842261940}}

In[8]:= Log[%[[1]]]

Out[8]= -199.205042203117482

Scott
--
Scott Hemphill hemp...@alumni.caltech.edu

S. F. Thomas

unread,

Jun 21, 2008, 10:44:01 AM6/21/08

to

Very interesting result. The estimate of the fitted sigma parameter
changed between taking the data to be points (SD=1.07 above), and taking
them to be intervals (SD=1.03).

This raises two questions you or someone else might find interesting to
answer:

1) I know the problem context gives a "natural" interval data model
where the data are taken as intervals [1,2), [2,3), etc. But consider
the general case where we start with point data x1, x2, etc. and
consider the corresponding interval model where the data are
[x1-d,x1+d), [x2-d,x2+d), etc. where d is the half-width of a more
general interval model. My first question: what is the analytical
expression for the way in which the estimate of the sigma parameter
varies with d, at least as d ranges from 0 to 1/2, and how is it derived?

2) What is the expression for the standard error in the estimate of the
mean parameter, as d varies from 0 to 1/2?

> Scott

Regards,
S. F. Thomas

Ray Koopman

unread,

Jun 21, 2008, 6:26:17 PM6/21/08

to

On Jun 21, 7:44 am, "S. F. Thomas" <thomas7...@bellsouth.net> wrote:
> Scott Hemphill wrote:

Here's a suggestion for question 1). First some generalities.
Suppose we have a quadratic function of a vector t.
Let ' and . denote matrix transposition and multiplication.

function: f[t] = a + (t - b)'.C.(t - b)/2, C is positive definite

= a + b'.C.b/2 - t'.C.b + t'.C.t/2

gradient: g[t] = -C.b + C.t

Hessian: H[t] = C

Note that we can always work backwards from f,g,H to get a,b,C.

Partition: t = {u,v}, b = {bu,bv}, C = {{Cuu,Cuv},
{Cvu,Cvv}}

gu[t] = -(Cuu.bu + Cuv.bv) + (Cuu.u + Cuv.v)

gu[t] = 0 <==> (Cuu.u + Cuv.v) = (Cuu.bu + Cuv.bv)

u = (Cuu^-1).(Cuu.bu + Cuv.bv - Cuv.v)

= p - Q.v, p = (Cuu^-1).(Cuu.bu + Cuv.bv)

Q = (Cuu^-1).Cuv

This expresses the conditionally minimizing u|v in terms of v.

Now particularize. Assume that the negative of the log of the
likelihood is sufficiently close to quadratic over the region
of interest for the above theory to apply.

f = -LogLikelihood, u = {mu,sigma}, v = d. Get p and Q.

(It may help to write (x-mu)/sigma as alpha+beta*x and work with
u = {alpha,beta}, converting back to {mu,sigma} only at the end.)

S. F. Thomas

unread,

Jun 21, 2008, 11:56:22 PM6/21/08

to

Thank you, Ray. Very elegant. But classically cautious I see.

I imagine a Bayesian solution would be simpler, certainly more direct,
since it provides a way in effect to integrate likelihood, therefore to
marginalize to obtain an expression for the marginal posterior
distribution of the variance parameter. This would I think end up being
a family of inverse-gammas, indexed by d. The expression for the mode
would involve d, also the sample size n and the sample variance s^2
calculated using the mid-points of the data. It would seem the effect of
increasing d is to increase the peakedness of the inverse gamma and
shift the mode, also the posterior mean (of the variance parameter), to
the left. That I can work out, I think, although I haven't actually done
it in full detail, yet. But I've sketched enough of it I think to
support my remarks above. (I'm getting ready to travel tomorrow, so I
won't be able to return to it for at least the next two days, depending.)

Of course the Bayesian price in that case must be paid, which is to
"assume facts not in evidence", as they might say in court, in terms of
the prior, for example the Jeffrey's prior which is the practical
equivalent of having two extra data points.

I raised the question not only because I was somewhat surprised to see
the interval data model yielding a smaller estimate of the SD than the
point model, and I wanted to know why, but also because I thought the
problem context was instructive, in terms of the classical vs Bayesian
divide, and in terms of the possibilistic approach which I am
advancing*. In this context, the possibilistic approach would arrive at
the same solution as a Bayesian approach were a flat (and improper)
prior used.** But in "getting there" it is not necessary to pay the
afore-mentioned Bayesian price. Clearly there are some new semantics
involved, but nothing at all that would do violence to the semantics of
the original model understood in a classical sense.

Again, thanks. If there is a classical approach that exploits more
closely the properties of the likelihood function a la Bayes, I'd still
be interested to see it, although the quadratic approximation at the
peak is certainly quite a viable and worthwhile approach.

I would also be interested in seeing a Bayesian answer, especially
perhaps from the esteemed Prof. Herman Rubin, with whom I have
philosophical disagreement, but whose technical mastery I readily
acknowledge is comprehensive.

Regards,
S. F. Thomas

* In which the absolute likelihood function is treated in effect as a
membership function characterizing what the data "say" about the unknown
parameters of interest, allowing, like Bayes, a direct characterization
of uncertainty -- in this case possibilistic -- in the unknown
parameters. (See S. F. Thomas, Fuzziness and Probability, ACG Press, 1995).

** It would also agree with the Bayesian approach in focusing not on the
posterior mode of the marginal, but the posterior mean. These two will
differ in this case because of the inherent skew in the inverse-gamma
form characterizing the marginal uncertainty on the variance parameter,
but come closer together the greater the sample size n. Hence the
estimate would be 1.08 where 1.07 was obtained using MLE, and 1.04 where
1.03 was obtained using MLE.

John Uebersax

unread,

Jun 23, 2008, 8:48:03 AM6/23/08

to

Again, thanks for all the suggestions.

If anyone is interested, the issue of interval-censoring comes up in a
common assay for detecting influenza virus, the Hemaglutinnin
Inhibition (HI) test. This is based on serial dilutions -- one keeps
diluting an original sample until a certain reaction no longer
occurs. The result is then scored as the last dilution (number of
titers) where the reaction occurred. However, actually, all one
actually knows is that the true value must be somewhere beween the kth
and k+1th titer of the series.

Here are some related articles.

Nauta JJP, De Bruijn IA. On the bias in HI titers and how to reduce
it. Vaccine 2006;24:6645-6646.

Nauta JJP. Eliminating bias in the estimation of the geometric mean
of HI titres. Biologicals 2006; 34(3):183-186.

Siev D. Reply to 'Nauta JJP, Eliminating bias in the estimation of
the geometric mean of HI titers'. Biologicals 2007;35(2):149-151.

(The abstracts are available on PubMed. If any of the people who've
already posted to this thread would like a eprint of any of these
articles, please let me know).

John Uebersax PhD