ANOVA and measurement repeatability study

voice_o...@australia.edu

unread,

Oct 25, 2009, 2:42:26 AM10/25/09

to

Greetings:

I am running an experiment to test the repeatability of a parts
measuring system.

The experiment involves taking a part from the assembly line, placing
it in its fixture, having two inspectors take measurements a pre-
selected points -- each point measured twice -- in random order. The
part is them removed fro the fixture, replaced back into it....and the
process repeated 4 more times.

Ho: There is no difference between measurements taken after repeated
mountings in the fixture

My assumption was that by comparing the p-value for the mountings, I
could determine if this null assumption should be rejected.

However, I just read a paper which stated that the repeatability
should be calculated as

REPEAT = 5.15 * SQRT(Mean Square of Error term)

Why is this? How do you use this resulting value to determine YES/NO
of system repeatability?

Is it incorrect to use the p-value?

Explanations appreciated!

Thanx

Ray Koopman

unread,

Oct 25, 2009, 3:25:43 AM10/25/09

to

On Oct 24, 11:42 pm, voice_of_rea...@australia.edu wrote:
> Greetings:
>
> I am running an experiment to test the repeatability of a parts
> measuring system.
>
> The experiment involves taking a part from the assembly line,
> placing it in its fixture, having two inspectors take measurements a

> pre-selected points -- each point measured twice -- in random order.
> The part is them removed from the fixture, replaced back into it....

> and the process repeated 4 more times.
>
> Ho: There is no difference between measurements taken after repeated
> mountings in the fixture
>
> My assumption was that by comparing the p-value for the mountings,
> I could determine if this null assumption should be rejected.
>
> However, I just read a paper which stated that the repeatability
> should be calculated as
>
> REPEAT = 5.15 * SQRT(Mean Square of Error term)
>
> Why is this? How do you use this resulting value to determine YES/NO
> of system repeatability?
>
> Is it incorrect to use the p-value?
>
> Explanations appreciated!
>
> Thanx

It is not clear exactly what your design is.

How many parts were tested? Were they nominally the same?

Did the same 2 inspectors do all 4 replications on all n parts?
That is. what were the nesting/crossing relations
among inspectors, parts, and replications?

How many preselecte points were there?
Do they represent different dependent variables,
or different levels of the same variable?

Rich Ulrich

unread,

Oct 25, 2009, 5:50:30 PM10/25/09

to

On Sat, 24 Oct 2009 23:42:26 -0700 (PDT),
voice_o...@australia.edu wrote:

>Greetings:
>
>I am running an experiment to test the repeatability of a parts
>measuring system.
>
>The experiment involves taking a part from the assembly line, placing
>it in its fixture, having two inspectors take measurements a pre-
>selected points -- each point measured twice -- in random order. The
>part is them removed fro the fixture, replaced back into it....and the
>process repeated 4 more times.
>
>Ho: There is no difference between measurements taken after repeated
>mountings in the fixture
>
>My assumption was that by comparing the p-value for the mountings, I
>could determine if this null assumption should be rejected.

Well, you have not specified what you are testing, or
what excess/ error might you cause you to decide that
there is *too* much irregularity/unreliability.

From what you describe, I think you could perform tests
to see if one rater is systematically different from the other;
or whether the selected points differ systematically; or
whether *order* systematically matters.

However --
a) all of those could differ "significantly" by the tests,
and your measurements *could* still be accurate enough
for your purposes; or,
b) it could happen that none of the tests "reject" while
the inherent accuracy of measurement is too poor to be
acceptable.

Your potential tests will tell you if the means are different.
You have to examine the means in order to decide, on
your own, what to do about it.

If the mean differences are all tiny enough to ignore,
despite being "significant" by a test, then you might be
best served to ignore them. - As a student, I participated
in an experiment that used experienced, paired nurses to
measure blood pressure. One nurse measured systematically
4 mm higher than the others, and that was later determined
to be blamed on her head-cold (muffling her hearing) on that
particular day. In any case, the 4 mm was a trivial difference.
- You can ignore them, or you can act on what you can learn
from them, in order to improve precision in the future.

If the mean differences are *large* enough to be a
serious concern, despite being "not significant", then
you have a problem -- since you therefore have no
confidence that any number given is a good estimate.

>
>However, I just read a paper which stated that the repeatability
>should be calculated as
>
>REPEAT = 5.15 * SQRT(Mean Square of Error term)

This is 5.15 times the "error of measurement" for your
whole system. It gives a a number to use as a range (or
maybe a half-range) for any given measurement. It does
not say anything about systematic differences that
might have been detected by the tests.

Using plus-or-minus (2* sqrt(MSE) ) would give approximately
a 95% confidence interval for a single measurement. Is this
small enough? The author of your paper was being more
strict than 95% CI, and was probably accounting for the
N (degrees of freedom) for his particular study... since
I don't recognize the relevance of 5.15. The usual CI's
are derived from tables of the t-test,

*You* have to make the decision of whether a measurement
is precise enough to be useful. What decision is being made?
Do you want a vague warning when a number seems off-target?
Do you want to be *sure* that something is wrong when
a number seems off? -- The smaller the range, the better
the decision is apt to be.

>
>Why is this? How do you use this resulting value to determine YES/NO
>of system repeatability?
>
>Is it incorrect to use the p-value?

What you learn from the p-values can tell you about
sources of error that do exist, and which *might* be
decreased; but it does not tell you whether those errors
have to matter to you.

--
Rich Ulrich

voice_o...@australia.edu

unread,

Oct 26, 2009, 3:58:52 AM10/26/09

to

Hi, thanks for your response.....

On Oct 25, 3:25 pm, Ray Koopman <koop...@sfu.ca> wrote:
>
> How many parts were tested?

>> On Oct 24, 11:42 pm, voice_of_rea...@australia.edu wrote:
> > The experiment involves taking a part from the assembly line....(i.e. one part)

>
>Did the same 2 inspectors do all 4 replications on all n parts?
>

>>.....having THE SAME two inspectors take measurements at pre-selected points -- each point measured twice....

> How many preselecte points were there?

How does this affect whether or not I can use the p-value in the
manner I originally assumed?

Again, my original approach was....

> > Ho: There is no difference between measurements taken after repeated
> > mountings in the fixture
>
> > My assumption was that by comparing the p-value for the mountings,
> > I could determine if this null assumption should be rejected.

To put it another way:

One part was measured repeatedly. If the mean difference in the
readings obtained is significant, then it seems to me the measurement
system is unstable (non-repeatable). Doesn't the p-value indicate the
probability that the results are explained by the null hypothesis? As
such, doesn't a p-value less than the selected alpha indicate that the
null hypothesis should be rejected? And as such doesn't that imply
that the measuring system is NOT producing repeatable results?

Ray Koopman

unread,

Oct 26, 2009, 2:39:10 PM10/26/09

to

On Oct 26, 12:58 am, voice_of_rea...@australia.edu wrote:
> Hi, thanks for your response.....
>
> On Oct 25, 3:25 pm, Ray Koopman <koop...@sfu.ca> wrote:
>
>> How many parts were tested?
>>> On Oct 24, 11:42 pm, voice_of_rea...@australia.edu wrote:
>>> The experiment involves taking a part from the assembly line....(i.e. one part)
>
>> Did the same 2 inspectors do all 4 replications on all n parts?
>
>>>.....having THE SAME two inspectors take measurements at pre-selected points -- each point measured twice....
>> How many preselecte points were there?
>
> How does this affect whether or not I can use the p-value in the
> manner I originally assumed?

I asked about the procedure because it was not clear what you had done
or what statistical analyses would be justified. "One should aim not
at being possible to understand, but at being impossible to
misunderstand." [Quintillian]

>
> Again, my original approach was....
>
>>> Ho: There is no difference between measurements taken after repeated
>>> mountings in the fixture

That's a trivial hypothesis. The existence of a single observed
difference, of any magnitude, would suffice to reject it.

>
>>> My assumption was that by comparing the p-value for the mountings,
>>> I could determine if this null assumption should be rejected.
>
> To put it another way:
>
> One part was measured repeatedly. If the mean difference in the
> readings obtained is significant, then it seems to me the measurement
> system is unstable (non-repeatable). Doesn't the p-value indicate the
> probability that the results are explained by the null hypothesis? As
> such, doesn't a p-value less than the selected alpha indicate that the
> null hypothesis should be rejected? And as such doesn't that imply
> that the measuring system is NOT producing repeatable results?

You don't care about the mean difference (which almost certainly is
not exactly zero) as much as you care about the tails of the
distribution of measurements. How much disagreement among measurements
can your process tolerate? What proportion of the measurments are
acceptably close to one another? Is the 99% confidence interval for a
single observation narrow enough?

Rich Ulrich

unread,

Oct 26, 2009, 4:27:46 PM10/26/09

to

On Mon, 26 Oct 2009 00:58:52 -0700 (PDT),
voice_o...@australia.edu wrote:

>Hi, thanks for your response.....
>
>On Oct 25, 3:25�pm, Ray Koopman <koop...@sfu.ca> wrote:
>>
>> How many parts were tested?
>
>>> On Oct 24, 11:42 pm, voice_of_rea...@australia.edu wrote:
>> > The experiment involves taking a part from the assembly line....(i.e. one part)
>>
>>Did the same 2 inspectors do all 4 replications on all n parts?
>>
>>>.....having THE SAME two inspectors take measurements at pre-selected points -- each point measured twice....
>
>
>> How many preselecte points were there?
>
>How does this affect whether or not I can use the p-value in the
>manner I originally assumed?

Did my post of Oct 25 fail to appear on your server?

I think I covered your questions fairly thoroughly.

>
>Again, my original approach was....
>> > Ho: There is no difference between measurements taken after repeated
>> > mountings in the fixture
>>
>> > My assumption was that by comparing the p-value for the mountings,
>> > I could determine if this null assumption should be rejected.
>
>To put it another way:
>
>One part was measured repeatedly. If the mean difference in the
>readings obtained is significant, then it seems to me the measurement
>system is unstable (non-repeatable). Doesn't the p-value indicate the
>probability that the results are explained by the null hypothesis? As
>such, doesn't a p-value less than the selected alpha indicate that the
>null hypothesis should be rejected? And as such doesn't that imply
>that the measuring system is NOT producing repeatable results?

--
Rich Ulrich

voice_o...@australia.edu

unread,

Oct 27, 2009, 1:05:49 AM10/27/09

to

Thank you again for your response....

On Oct 27, 2:39 am, Ray Koopman <koop...@sfu.ca> wrote:
> That's a trivial hypothesis. The existence of a single observed
> difference, of any magnitude, would suffice to reject it.

Ok fine....
Ho: There is no SIGNIFICANT difference between measurements taken

after repeated
mountings in the fixture

> You don't care about the mean difference (which almost certainly is

> not exactly zero) as much as you care about the tails of the
> distribution of measurements.

Ok...let me ask the question this way...and hopefully there is a
simple answer...

For the experiment as outlined above(previous posts), what does a p-
value < alpha in the "parts" row signify?

[Rem: ordinarily I would think this would signify variation BETWEEN
parts....but since I am only using ONE PART in this experiment...such
observed variation must be coming from the measurement system itself.]

voice_o...@australia.edu

unread,

Oct 27, 2009, 1:07:08 AM10/27/09

to

On Oct 27, 4:27 am, Rich Ulrich <rich.ulr...@comcast.net> wrote:

> Did my post of Oct 25 fail to appear on your server?
>
> I think I covered your questions fairly thoroughly.
>
>

It just showed up. Sorry.

Can you address the question I just asked (above post)?

Rich Ulrich

unread,

Oct 27, 2009, 4:53:32 PM10/27/09

to

Yes, I did address it. "Signficant" means systematic
difference; which may or may not be large enough
to matter to you.

"Non-significant" means that whatever differences
exist are not systematic. However, it is again true
that the apparent size of the differences may or may
not matter to you. "What matters" is how well you
can depend on a given measurement, and how much
(and what) that tells you.

--
Rich Ulrich

voice_o...@australia.edu

unread,

Oct 27, 2009, 11:24:13 PM10/27/09

to

On Oct 28, 4:53 am, Rich Ulrich <rich.ulr...@comcast.net> wrote:

> "Signficant" means systematic

Ok, so that seems to say that my methodology is correct.

A p-value < alpha means there is systematic difference in the
measurments being made. The "bewteen parts" differences are
significant....and since I am in fact only using ONE part, this means
the measuring system is producing significantly different results ->
not repeatable.

Thank you!

Rich Ulrich

unread,

Oct 28, 2009, 3:19:49 PM10/28/09

to

Uh-oh. It *seems* to me that you are missing the point, in
a couple of ways.

You have "one part". But - in the paradigm that I described -
you have two raters, and you have two locations. Or more.
THAT is what can be tested. "Different" does NOT imply
"not-repeatable"; in fact, one implication may be opposite
of that .

"Systematic" requires a good degree of being "repeatable",
compared to the other sources of error and variation.

That is why I have said, several times, that the p-value does
not answer question of whether the SIZE of the effect matters.

The SIZE of the variation, as described by the article that
you cited, is closer to the point -- although, you did not
mention what they may have said further about tested
differences. Re-read what I said.

If one rater is regularly 1 point smaller than the other,
- and thus, significant -
you can be in very good shape, if the relevant criterion
for actually *using* your measurements is something that
involves a difference of 10 points or 20 points or 50 points.

--
Rich Ulrich

voice_o...@australia.edu

unread,

Oct 29, 2009, 2:25:49 AM10/29/09

to

Thank you again

On Oct 29, 3:19 am, Rich Ulrich <rich.ulr...@comcast.net> wrote

> If one rater is regularly 1 point smaller than the other.....

....then that would show up in ANOVA table as a p-value < alpha in the
RATERS row. I am discussing the p-value in the PARTS row.

> "Different" does NOT imply "not-repeatable".....

If I in fact only have ONE part...yet my measurement system is showing
me significant differences with the measurements of that part...but
consistencies BEWTEEN raters, then it seems to me that the difference
is in fact pointing out a lack of repeatability.

Ray Koopman

unread,

Oct 29, 2009, 2:25:40 PM10/29/09

to

On Oct 26, 10:05 pm, voice_of_rea...@australia.edu wrote:
> [...]

>
> Ok...let me ask the question this way...and hopefully there is a
> simple answer...
>
> For the experiment as outlined above(previous posts), what does a

> p-value < alpha in the "parts" row signify?

>
> [Rem: ordinarily I would think this would signify variation BETWEEN
> parts....but since I am only using ONE PART in this experiment...such
> observed variation must be coming from the measurement system itself.]

If there is only one part, but the program nevertheless gave you a
p-value for the significance of the difference between parts, then
the program did not do the analysis that you think you told it to do.

Rich Ulrich

unread,

Oct 29, 2009, 2:52:24 PM10/29/09

to

Is Ray right, that you are totally misreading the output?

What I "guessed" about the design on Oct 25 was that you
might have 3 factors: rater, location of measurement, and
order of measuring. You never confirmed that, so I don't
know. What do you mean here by "differences with the
measurements of that part"?

Whatever it is, what I emphasized about small differenced
between raters - that they may be trivial (or can be adjusted
for) - is equally true about differences between location.

Differences in order would need further exploration,
perhaps, to figure out what is going on - but those, too,
must be "systematic" in order to have a significant p-level.
And the actual size of the differences, in the context of
whatever you are using the measures for, is what matters --
NOT the p-level alone.

--
Rich Ulrich

voice_o...@australia.edu

unread,

Oct 29, 2009, 10:11:05 PM10/29/09

to

On Oct 30, 2:25 am, Ray Koopman <koop...@sfu.ca> wrote:
> If there is only one part, but the program nevertheless gave you a
> p-value for the significance of the difference between parts, then
> the program did not do the analysis that you think you told it to do.

There is one part that is repeatedly mounted and unmounted from its
measuring fixture. Each mounting is entered into the program as a
"new" part.

Since it is in fact the SAME part, in theory the "between parts"
variation should be insignificant (I can verify the assumption that
mounting and unmounting does not effect the dimensions of the part).

If however there is significant difference, this implies that there is
something going on in the measuring system that is causing this same
part to appear to be different...to be producing significantly
different dimensions.

In other words, repeatedly using this fixture does NOT produce
repeatable results. The system is not repeatable.

voice_o...@australia.edu

unread,

Oct 29, 2009, 10:12:38 PM10/29/09

to

On Oct 30, 2:52 am, Rich Ulrich <rich.ulr...@comcast.net> wrote:

> What I "guessed" about the design on Oct 25 was that you
> might have 3 factors: rater, location of measurement, and
> order of measuring.

I'm not sure how you got this idea.

Please see my response to the previous post......

Ray Koopman

unread,

Oct 29, 2009, 11:46:59 PM10/29/09

to

All right, now it makes sense -- it was just a labelling problem.
Assuming that the proper error term was used (which depends on the
fixed/random status of the other factors in the design, which have
not been specified), a significant main effect for "Part" (which might
more appropriately be called something like "Trial"), means that the
differences between the means of the 5 trials are bigger than would
usually be expected if there were no differences among the trials.
So either an unusual but accidental event has occurred, or there truly
are trial-to-trial differences among the measurements. However, the
p-value itself establishes only the unusualness of the event and says
nothing about whether the differences are big enough to be concerned
about.

Rich Ulrich

unread,

Oct 30, 2009, 4:01:07 PM10/30/09

to

On Thu, 29 Oct 2009 19:12:38 -0700 (PDT),
voice_o...@australia.edu wrote:

>On Oct 30, 2:52�am, Rich Ulrich <rich.ulr...@comcast.net> wrote:
>
>> What I "guessed" about the design on Oct 25 was that you
>> might have 3 factors: rater, location of measurement, and
>> order of measuring.
>
>I'm not sure how you got this idea.
>

I got it directly from what you posted in the first post.

VOR>

>The experiment involves taking a part from the assembly line, placing
>it in its fixture, having two inspectors take measurements a pre-
>selected points -- each point measured twice -- in random order.

I point to "two inspectors", I point to "pre-selected points", and
I point to "in random order".

That makes up 3 factors. The conventional analysis would
be a Repeated Measures ANOVA, where "parts" is a term
that is not measured by the ANOVA ... if that denotes a
the identity of the different 'parts' that are each replaced
and tested several times.

- Ray seems to accept that you are taking the "order" effect
and mis-labeling it as Parts. I suspect that he is being too
unsuspicious. I think you are doing something else.

The Repeated Measures with 3 factors can also be performed
as a 4-way ANOVA, one that would obtain an addition Sum of
Squares for Parts, which would *properly* be labeled Parts.
The 4-way is usually avoided because the extra SS terms are
not so generally interesting.

However, having *this* definition of Parts as "significant" is a
good thing for reliability. It says that measures *do* discriminate
between parts. That is the F-test that could be translated or
transformed into a "significant" ICC (intraclass correlation) --
though you usually want to know more than whether a
correlation is significant.

Compare the circumstance to one where you give 6 IQ tests
to a bunch of people -- if there is variation among the people,
then you *should* see a significant difference among "persons"
as a test. If there is no difference, *that* is what says
that you have unreliable tests, or else, not much measureable
difference between persons. And that consequence
would be true whether or not one test is systematically
10 points higher or lower than the others. The variability
of the measures is what matters for "how good is the testing,"
so long as you eventually do pay attention to which mean
is being used.

--
Rich Ulrich

voice_o...@australia.edu

unread,

Nov 1, 2009, 2:55:24 AM11/1/09

to

On Oct 30, 11:46 am, Ray Koopman <koop...@sfu.ca> wrote:
>
> All right, now it makes sense --

GREAT!!

>...... a significant main effect for "Part" (which might

> more appropriately be called something like "Trial"), means that the

> differences between the means of the 5 trials are bigger than....

Yes!

> ...... the p-value itself establishes only the unusualness of the event and says

> nothing about whether the differences are big enough to be concerned
> about.

Agreed....but at this point "unusualness" is enough.....

voice_o...@australia.edu

unread,

Nov 1, 2009, 2:57:09 AM11/1/09

to

On Oct 31, 4:01 am, Rich Ulrich <rich.ulr...@comcast.net> wrote:
> On Thu, 29 Oct 2009 19:12:38 -0700 (PDT),
>

> voice_of_rea...@australia.edu wrote:
> >On Oct 30, 2:52 am, Rich Ulrich <rich.ulr...@comcast.net> wrote:
>
> >> What I "guessed" about the design on Oct 25 was that you
> >> might have 3 factors: rater, location of measurement, and
> >> order of measuring.
>
> >I'm not sure how you got this idea.
>
> I got it directly from what you posted in the first post.
>
> VOR>
>
> >The experiment involves taking a part from the assembly line, placing
> >it in its fixture, having two inspectors take measurements a pre-
> >selected points -- each point measured twice -- in random order.
>
> I point to "two inspectors", I point to "pre-selected points", and
> I point to "in random order".

I didn't say anything about "different locations"

>
> - Ray seems to accept that you are taking the "order" effect
> and mis-labeling it as Parts.

Ray's understanding -- per the above post -- seems to be correct. I
am more confident now that my initial approach is viable.

Ray Koopman

unread,

Nov 1, 2009, 2:04:32 PM11/1/09

to

On Oct 31, 11:57 pm, voice_of_rea...@australia.edu wrote:
> [...]

> Ray's understanding -- per the above post -- seems to be correct.
> I am more confident now that my initial approach is viable.

That confidence may be misguided. I do not understand what was done.
My interpretation of the p-value was conditional on the analysis
having been done correctly. It is still not clear what the
experimental design was or how the analysis was done.

After the clarification that there was only one part, the original
post can be interpreted as saying that the data were collected in a
2 (inspectors) x n (points) x 2 (replications) x 5 (trials) factorial
design. On each trial, each inspector measured each point twice.

Some unspecified order was random. Was it which inspector went first?
Was it the order in which the points were measured? Did all this
change from trial to trial, or were the orders the same on every
trial?

And, again, what was the fixed/random status of each factor in the
analysis?