Definition of "Population" versus "Sample" Variance

Herman Rubin

unread,

Dec 2, 1995, 3:00:00 AM12/2/95

to

In article <49napk$n...@newsbf02.news.aol.com>,
AaCBrown <aacb...@aol.com> wrote:
>In a recent thread there was a discussion of the correct definitions of
>"Population" variance and "Sample" variance. One divides the sum of the
>squared deviations from the mean by N, the other by N-1.

>I have the following impressions and would be very interested in other
>opinions:

>(1) Neither term has a long history in statistics. Mathematical
>statisticians talked about the "maximum likelihood" or "unbiased" estimate
>of the variance. If the terms "population variance" or "sample variance"
>crept in, they were not assumed to be general terms the reader would
>understand. Which one meant N and which meant N-1 varied from author to
>author, many authors used different conventions entirely or one term but
>not the other.

"Population variance" is a strictly probability concept. It has nothing
to do with the data. The idea, if not the name, is well over two
centuries old.

Until about a century ago, most users of statistics blindly assumed that
they could treat the sample as the population. They ignored most of the
effects of finite sample size. That the t-statistic could not be considered
to be normally distributed was in contradiction to the usage during most
of the nineteenth century.

>(2) Computer programs and calculators universally use the term "population
>variance" to mean the N form and "sample variance" to mean the N-1 form.
>Thus in EXCEL the functions are VAR (N-1) and VARP (P for population, N
>form).

IF the N values are equally likely, dividing by N gives the variance.
Assuming independent observations from some distribution, dividing by
N-1 gives an unbiased estimate of the variance of that distribution.
This can be extended to regression models. In the normal case, the
distribution is a normalized chi-squared distribution with N-1
degrees of freedom.

>(3) Possibly as a result of (2) elementary and popular statistics texts
>have adopted the same convention.

Usually without giving enough of an explanation to show what these mean.
Probability must come first for such an understanding; not the computation
of probabilities, nor the list of distributions, but the ideas. The
variance of the number of successes in n independent trials with constant
probability p of success is np(1-p); no data is used. The major reason
for the great use of variance is that it has good properties.

>Therefore today, as a result of (2) and (3):

>(4) If someone uses the terms without explanation it is reasonably safe to
>assume that "population variance" means N and "sample variance" means N-1.

>Is this correct?

It is only correct for the "population variance" under the restrictive
conditions as stated above. When using residuals from a regression,
N-1 is not correct, either.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
hru...@stat.purdue.edu Phone: (317)494-6054 FAX: (317)494-0558

John Ongtooguk

unread,

Dec 7, 1995, 3:00:00 AM12/7/95

to

Herman Rubin (hru...@b.stat.purdue.edu) wrote:

: Until about a century ago, most users of statistics blindly assumed that

: they could treat the sample as the population. They ignored most of the
: effects of finite sample size. That the t-statistic could not be considered
: to be normally distributed was in contradiction to the usage during most

: of the nineteenth century....

Well, I guess a lot of us are way behind the times, at least in
the SPC/SQC/TQC manufacturing world. One may find t tests and
such but for estimating the limits of a distribution it is very,
very common to assume normal distributions and assume that the
sample is the same as the population. As an example having a Cpk
( (sample mean-tolerance limit)/(3 * sample stdev) ) greater
than say 1.0 is 'good' as one assumes that 99.73% of the
population is within the spec limit. I guess that part of the
reason for the continued practice is that doing it right, at
least as far as I've been able to figure out, is tedious and
hard to implement.

To do it right, and I'll appreciate any corrections if it isn't,
involves using a non-central t distribution, where a confidence
needs to be stated as well as a population percentage in order
for the correct number of sample standard deviations to be
determined. One also needs to know whether the percentage is
to be held at each tail or whether it applies overall. I've
been able to find some tables listing such 'k factors' but
I haven't been able to figure out how to calculate such values,
which would greatly help in implementing such a practice.

The 'tolerance interval' seems to be a good way to state what
is expected for a product performance parameter, among many
other things, as one can state an expectation for the product
population and assume that any number of sampling plans will
produce similar long term results. It seems that it should
also be more widely used as we're almost always using a
sample mean and standard deviation, as I think a lot of
people are. Comments, corrections, even flames appreciated.

John Ongtooguk (jo...@vcd.hp.com)

George ZELIGER

unread,

Dec 8, 1995, 3:00:00 AM12/8/95

to jo...@fc.hp.com

To be honest, I have not understood what you wrote about the non-central
t-distribution. However, I agree with you that in the technical
applications field ( which is the ASQC estate, or at least it looks like
that) there is enough muddle with the use of statistical concepts -- and
therefore, tools. This is because the manipulative skills ("crunching
with numbers") are always considered as much more important than the
understanding of fundamental ideas.
I don't accept the excuse that "doing it right is too tedious". The
least tedious way is not to do anything at all.
You are right when you say that tolerance intervals shoud play much more
important role. Actually, thouse should be not the tolerance intervals.
What should be used, is in the same relationship to a tolerance interval
as the "true" (population) mean is to the corresponding confidence
interval. In other words, in Statistics the tolerance interval is a kind
of an estimate (or an estimator) of an object which I never saw being
introduced in commonly used textbooks. In the 'Introduction fo
Statistics' course I teach for the Boston Chapter of the ASQC I call it
"Core Interval". It is not something new in Statistics, since everybody
use the idea at least subconsciously. However, explicit introduction of
the term makes explanation of many thing -- from what is the basic idea
of SPC, through what are that notorious Cpk, to what is the underpinning
idea of Statistical Inference, etc.
For example, a confidense interval is nothing but a core interval for the
distribution of an approriate statistic. There are four major types of
core intervals: left-sided, right-sided, symmetric in probability, and
the shortest. There are lots of practical examples of use of each type.
If your location is close to Boston, give me a call at (617) 782-2033 (H)
or (617) 890-001,x200 (W).

Bert Gunter

unread,

Dec 8, 1995, 3:00:00 AM12/8/95

to

jo...@fc.hp.com (John Ongtooguk) wrote:

... <A Lot of Stuff Omitted >

All of this is technically correct, but for reasonable size samples
(hundreds -- which you need anyway to estimate Cpk with any precision)
the difference between normal theory tolerance intervals and naive
approaches isn't worth a tinker's damn, as they say. The real problem
is that the assumption of normality is no good when working in the
tails -- in fact it's often really rotten (e.g., when working with
naturally skew distributions like runout, or flatness, or
%contamination, or ...). All those normal theory approximations that
you see in the quality control literature for ppm abd ppb are mere
flights of fancy from people who appear to know little of statistics
or the real world.

Actually, that's a lie, too. The real problem is that there is NO
distribution (Deming's analytic vs enumerative distinction) -- the
processes are not stable (in statistical control), so any attempt at
prediction from sample to a longer term population is ludicrous
anyway.

Cheers,
Bert

Bert Gunter
bgu...@pluto.njcc.com
(Statistical Consultant)

AaCBrown

unread,

Dec 11, 1995, 3:00:00 AM12/11/95

to

bgu...@pluto.njcc.com (Bert Gunter) in <4aa0fk$i...@earth.njcc.com>
writes:

> The real problem is that there is NO distribution (Deming's analytic
> vs enumerative distinction) -- the processes are not stable (in
> statistical control), so any attempt at prediction from sample to
> a longer term population is ludicrous anyway.

I think your point is valid but overstated. I would say that there are
lots of problems and pitfalls in predicting the future from the past or a
population from a sample. But statistical methods, for all their
shortcomings, often work surprisingly well. I would trust the opinion of a
good statistician who has looked at the data over the opinion of an expert
in the field of application.

Aaron C. Brown
New York, NY

John Ongtooguk

unread,

Dec 11, 1995, 3:00:00 AM12/11/95

to

Bert Gunter (bgu...@pluto.njcc.com) wrote:

: All of this is technically correct, but for reasonable size samples

: (hundreds -- which you need anyway to estimate Cpk with any precision)
: the difference between normal theory tolerance intervals and naive
: approaches isn't worth a tinker's damn, as they say.

Hundreds would be nice but it's rarely possible during product
development, hence the tolerance interval approach; one merely
needs to juggle defect rates, confidence, and sample sizes along
with accepting a number of assumptions when working up a test
plan. Assuming a normal distribution and sample = population
versus a tolerance interval based upon a two sided spec when
controlling the center of a distribution one ends up with the
following (as read from a chart):

% 'good' confidence +/- s sample size
sample = population 99.73 ? 3.0 hundreds ?
tolerance interval 99.9 95 4.5 24
tolerance interval 99.0 80 3.0 26
tolerance interval 95.0 80 2.5 14

If one buys into the assumptions it seems like a practical
approach.

: The real problem
: is that the assumption of normality is no good when working in the

: tails -- in fact it's often really rotten (e.g., when working with
: naturally skew distributions like runout, or flatness, or
: %contamination, or ...). All those normal theory approximations that
: you see in the quality control literature for ppm abd ppb are mere
: flights of fancy from people who appear to know little of statistics
: or the real world.

For a single parameter like flatness or others that have a limit
at zero it's true, but for product level parameters a normal
distribution seems to often be a decent assumption. But like
you say working at ppm and ppb levels is a big stretch without
proving out at least a few other assumptions.

: Actually, that's a lie, too. The real problem is that there is NO

: distribution (Deming's analytic vs enumerative distinction) -- the
: processes are not stable (in statistical control), so any attempt at
: prediction from sample to a longer term population is ludicrous
: anyway.

Yes, this is a big problem. We also see that processes can be
in control quite often, but the mean shifts for many reasons.
In any case if one can't work with sample sizes in the hundreds
what's wrong with the example test plans listed above ?

John Ongtooguk (jo...@vcd.hp.com)

George Zeliger

unread,

Dec 17, 1995, 3:00:00 AM12/17/95

to jo...@fc.hp.com

produc...@fc.hp.com (John Ongtooguk) wrote:
>Bert Gunter (bgu...@pluto.njcc.com) wrote:
>
>: All of this is technically correct, but for reasonable size samples
>: (hundreds -- which you need anyway to estimate Cpk with any precision)
>: the difference between normal theory tolerance intervals and naive
>: approaches isn't worth a tinker's damn, as they say.
>
> Hundreds would be nice but it's rarely possible during product
> development, hence the tolerance interval approach; one merely

I don't see what's the relationship between the necessary in many cases
large sample sizes and tolerance intervals approach. Tolerance intervals
are in the same relationship to what I call "core intervals" as
confidence intervals are to the corresponding distribution parameters.
Notorious 'mu'+/-3'sigma' is nothing but a core interval, and the
necessity to use it is caused by life rather then by the sample size
reason.
Imagine that you manufacture contraceptive pills and only guarantee that
the amount of the active ingredient is correct "in average" for your
production, rather then for, say, 99.9% of it. I guess some people would
like to say you a couple of not so polite words in this case.
Statistical tolerance intervals as they are calculated nowadys
"overestimate" their parent core intervals, which implies losses for the
manufacturer. If you want the tol. intervals to be close enough to the
core intervals, you will end up with the necessary number of observations
of the same order as is necessary for the samples discussed above.

> % 'good' confidence +/- s sample size
> sample = population 99.73 ? 3.0 hundreds ?
> tolerance interval 99.9 95 4.5 24
> tolerance interval 99.0 80 3.0 26
> tolerance interval 95.0 80 2.5 14

I missed something in your correspondence with Bert, so could you please
explain what does the statement "sample=population" mean?

>: The real problem
>: is that the assumption of normality is no good when working in the
>: tails -- in fact it's often really rotten (e.g., when working with
>: naturally skew distributions like runout, or flatness, or
>: %contamination, or ...). All those normal theory approximations that
>: you see in the quality control literature for ppm abd ppb are mere
>: flights of fancy from people who appear to know little of statistics

>: or the real world. (Bert Gunter)

> For a single parameter like flatness or others that have a limit
> at zero it's true

It all depends on the relationship between the mean of the corresponding
distribution and the characteristic of its dispersion, as well as on the
problem under consideration. Even if the distribution is limited by
zero, but its mean is equal to many times its st. deviation, the normal
approximation might work.
Sometimes this assumption works even better in the tails than within the
central core interval: imagine a distribution for which 99.7% of the
population lie within +/-3*'sigma' limits and both tails are mutually
symmetric, but which is multimodal, skewed, and everything else in
between. Everything depends on particular situation, and general
statements of the types "Normal distribution describes everything" and
"Normal distribution is never applicable" are equally dubious.

>: Actually, that's a lie, too. The real problem is that there is NO
>: distribution (Deming's analytic vs enumerative distinction)

Well, if you work with a finite population, and "finite so much" that you
can handle it as an entity, then there is no need in probability and all
the headache connected with its use -- you are in the framework of
Descriptive Statistics.
However, if your population is infinite actually (maybe, in some physical
experiments), or is final but so large that you cannot handle it (like
the population of molecules even in a small jar, or the population of
fish in the ocean), or doesn't exist at the moment although we know that
more and more of its elements will be generated as time passes -- doesn't
it remind you a technological process in a company that doesn't plan to
close on the second day after opening? -- you need special tools. So far
we don't know anything better than Probability Theory and, therefore,
Inferential (let's admit, Mathematical) Statistics. Their major concept
is that of distribution.
Of course, only the Lord knows whether there is or there is no
distribution. We use a MODEL, based on more (which is better) or less
(which is worse) plausible assumptions (hopefully) based on careful
analysis of the problem. Why, after all, we say that probability of
getting "tails" when flipping a coin is equal to 1/2? What is MIL STD
105?
If there is no distribution (which means that our analysis of the problem
convices us that we cannot develope a productive (i.e., a predictive)
model, then there is no place for SPC and statisticians, so let's go
home. I agree that such situations do exist, but I am not sure it's the
general case.

I basically agree with Bert that people who know little of statistics
considerably contribute to the existing mess. I would only change his
statement a little and rather say "who UNDERSTAND LITTLE of Statistics'
basic ideas and concepts", since sometimes they know a host of
statistically sounding words, names, titles, etc.; especially successful
they are in using and inventing abbreviations.
I don't agree with him that they don't know real world. They know very
well that in the real world juggling with scientifically sounding words
often leads to personal success. However, nothing else but successes in
applying Statistics form grounds for this, after all.

John Ongtooguk

unread,

Dec 19, 1995, 3:00:00 AM12/19/95

to

George Zeliger (ZELIGER....@asqcnet.org) wrote:
: produc...@fc.hp.com (John Ongtooguk) wrote:
: ....
: Imagine that you manufacture contraceptive pills and only guarantee that

: the amount of the active ingredient is correct "in average" for your
: production, rather then for, say, 99.9% of it. I guess some people would
: like to say you a couple of not so polite words in this case.
: Statistical tolerance intervals as they are calculated nowadys
: "overestimate" their parent core intervals, which implies losses for the
: manufacturer. If you want the tol. intervals to be close enough to the
: core intervals, you will end up with the necessary number of observations
: of the same order as is necessary for the samples discussed above.
:
: > % 'good' confidence +/- s sample size
: > sample = population 99.73 ? 3.0 hundreds ?
: > tolerance interval 99.9 95 4.5 24
: > tolerance interval 99.0 80 3.0 26
: > tolerance interval 95.0 80 2.5 14
:
: I missed something in your correspondence with Bert, so could you please
: explain what does the statement "sample=population" mean?

"sample = population" is the assumption that any sample of a
population is the same as the population, which per my perhaps
incorrect understanding is wrong. As I understand it the
statement that '+/- 3 std dev = 99.73%' refers to a known
population stand deviation and that if one wants to compare
tails with a spec limit using the same statement one needs
to know the population mean. Using your example above one
would need to collect all of the pills that are to be made
in order to determine the population mean and standard deviation,
which at the start of a five year production run might prove
to be impractical. A practical solution is to determine a
sample mean and standard deviation, which will require a
statement of confidence based upon sample size and the
portion of the distribution under consideration, process
control issues aside. The overestimation that you refer to
is something that some of us have to live with as we cannot
work with samples of hundreds or more.

Using your example above one would want to minimize the
consumer's risk, so a very large percentage of the population
of pills should have an adequate level of active ingredient
at a high confidence level as determined by a an adequate
sample size. The statement 'sample = population' refers
to the assumption that they are the same, and that the
mean and standard deviation of any sample greater than
10 (?) 20 (?) 30 (?), whatever your statistician says is
an acceptable approximation for a normal distribution,
will be adequate for estimating the percentage of the
population that 'meets spec', while ignoring any statement
of confidence.

John Ongtooguk (jo...@vcd.hp.com)