Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

MA MCAS statistical fallacy

107 views
Skip to first unread message

Gene Gallagher

unread,
Jan 10, 2001, 4:32:43 PM1/10/01
to
The Massachusetts Dept. of Education committed what appears to be a
howling statistical blunder yesterday. It would be funny if not for the
millions of dollars, thousands of hours of work, and thousands of
students' lives that could be affected.

Massachusetts has implemented a state-wide mandatory student testing
program, called the MCAS. Students in the 4th, 8th and 10th grades are
being tested and next year 12th grade students must pass the MCAS to
graduate.

The effectiveness of school districts is being assessed using average
student MCAS scores. Based on the 1998 MCAS scores, districts were
placed in one of 6 categories: very high, high, moderate, low, very low,
or critically low. Schools were given improvement targets based on the
1998 scores, with schools in the highest two categories were expected to
increase their average MCAS scores by 1 to 2 points, while schools in
the lowest two categories were expected to improve their scores by 4-7
points (http://www.doe.mass.edu/ata/ratings00/rateguide00.pdf).

Based on the average of 1999 and 2000 scores, each district was
evaluated yesterday on whether they had met their goals. The report was
posted on the MA Dept. of education web site:
http://www.doe.mass.edu/news/news.asp?id=174

Those familiar with "regression to the mean" know what's coming next.
The poor schools, many in urban centers like Boston, met their
improvement "targets," while most of the state's top school districts
failed to meet their improvement targets.

The Boston Globe carried the report card and the response as a
front-page story today:
http://www.boston.com/dailyglobe2/010/metro/Some_top_scoring_schools_fau
lted+.shtml

The Globe article describes how superintendents of high performing
school districts were outraged with their failing grades, while the
superintendent of the Boston school district was all too pleased with
the evaluation that many of his low-performing schools had improved:

[Brookline High School, for example, with 18 National Merit Scholarship
finalists and the highest SAT scores in years, missed its test-score
target - a characterization blasted by Brookline Schools Superintendent
James F. Walsh, who dismissed the report.

"This is not only not helpful, it's bizarre," Walsh said. ''To call
Brookline, Newton, Medfield, Weston, Wayland, Wellesley as failing to
improve means so little, it's not helpful. It becomes absurd when you're
using this formula the way they're using it.''

Boston School Superintendent Thomas W. Payzant, whose district had 52 of
113 schools meet or exceed expectations, was more blunt: "For the
high-flying schools, I say they have a responsibility to not be smug
about the level they have reached and continue to aspire to do better."]

Freedman, Pisani & Purvis (1998, Statistics 3rd edition) describe the
fallacy involved:
"In virtually all test-retest situations, the bottom group on the first
test will on average show some improvement on the second test and the
top group will on average fall back. This is the regression effect.
Thinking that the regression effect must be due to something important,
..., is the regression fallacy."

I find this really disturbing. I am not a big fan of standardized
testing, but if the state is going to spend millions of dollars
implementing a state-wide testing program, then the evaluation process
must be statistically valid. This evaluation plan, falling prey to the
regression fallacy, could not have been reviewed by a competent
statistician.

I hate to be completely negative about this. I'm assuming that
psychologists and others involved in repeated testing must have
solutions to this test-retest problem.

If I'm missing the boat on the this test-retest error, I'd also
appreciate others pointing it out.

--
Dr. Eugene D. Gallagher
ECOS, UMASS/Boston


Sent via Deja.com
http://www.deja.com/

Bob Hayden

unread,
Jan 11, 2001, 12:34:17 AM1/11/01
to

A powerful case for competency testing of all public officials!-)

----- Forwarded message from Gene Gallagher -----


=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
http://jse.stat.ncsu.edu/
=================================================================

----- End of forwarded message from Gene Gallagher -----

--

_
| | Robert W. Hayden
| | Work: Department of Mathematics
/ | Plymouth State College MSC#29
| | Plymouth, New Hampshire 03264 USA
| * | fax (603) 535-2943
/ | Home: 82 River Street (use this in the summer)
| ) Ashland, NH 03217
L_____/ (603) 968-9914 (use this year-round)
Map of New hay...@oz.plymouth.edu (works year-round)
Hampshire http://mathpc04.plymouth.edu (works year-round)

The State of New Hampshire takes no responsibility for what this map
looks like if you are not using a fixed-width font such as Courier.

"Opportunity is missed by most people because it is dressed in
overalls and looks like work." --Thomas Edison

=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
http://jse.stat.ncsu.edu/
=================================================================

J. Williams

unread,
Jan 11, 2001, 9:46:14 AM1/11/01
to
Francis Galton explained it in 1885. Possibly, the Mass. Dept. of
Education missed it! Or, could it be that the same gang who brought
us the exit poll data during the November election were helping them
out? :-)

I am wondering why they did not have a set of objective standards for
ALL students to meet. Of course, it is nice to reward academically
weaker districts for "improving," but the real issue may not be
"improvement," rather it might be attainment at a specific level for
all schools as a minimum target. A sliding scale depicting
"improvement" means little if the schools in question are producting
students who fall behind in math, reading comprehension, etc.
Rewarding urban schools for improving probably is a good idea, but
that should not mean entering a zero sum game with the "good"
schools. When a given school is already "good" it naturally can't
"improve" more than schools on the bottom of the achievement ladder.
It seems they really should have prepared a better public announcement
of results. Rather than "knocking" the high achieving schools, they
should praise them justifiably. Then, noting the improvement in the
large urban schools would seem positive as well.

Robert J. MacG. Dawson

unread,
Jan 11, 2001, 8:47:08 AM1/11/01
to

Gene Gallagher wrote:
>
> Those familiar with "regression to the mean" know what's coming next.
> The poor schools, many in urban centers like Boston, met their

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


> improvement "targets," while most of the state's top school districts
> failed to meet their improvement targets.

Wait one... Regression to the mean occurs because of the _random_
component in the first measurement. Being in an urban center is not part
of the random component - those schools' grades didn't improve because
some of them woke up one day and found that their school had moved to a
wealthier district.
If the effect of nonrandom components such as this is large enough
(as I can well believe) to justify the generalization highlighted above,
and if there was a strong pattern of poor-performing schools meeting
their
targets and better-performing schools not doing so, we are looking at
something else - what, I'll suggest later


> The Globe article describes how superintendents of high performing
> school districts were outraged with their failing grades, while the
> superintendent of the Boston school district was all too pleased with
> the evaluation that many of his low-performing schools had improved:
>
> [Brookline High School, for example, with 18 National Merit Scholarship
> finalists and the highest SAT scores in years, missed its test-score
> target - a characterization blasted by Brookline Schools Superintendent
> James F. Walsh, who dismissed the report.

There *is* a problem here,but it's not (entirely) regression
to the mean. If I recall correctly, Brookline High School is
internationally
known as an excellent school, on the basis of decades of excellent
teaching.
If it couldn't meet its target, it's not because its presence among the
top
schools was a fluke in the first measurement - it's probably because the
targets for the top schools were unrealistic.

Was there any justification for the assumption voiced by the Boston
superintendant that the top-performing schools were in fact not
performing at
their capacity and would be "smug" if they assumed that their present
per-
formance was acceptable? The targets described seem to imply that no
school
in the entire state - not one - was performing satisfactorily, even the
top
ones. Perhaps this was felt to be true, or perhaps it was politically
more
acceptable to say "you all need to pull your socks up" than to say "the
following schools need to pull their socks up; the rest of you, steady
as she goes."

As a reductio ad absurdum, if this policy were followed repeatedly,
it would be mathematically impossible for any school to meet its target
every year. That - and not regression to the mean - is the problem
here, I
think.

-Robert Dawson

Robert J. MacG. Dawson

unread,
Jan 11, 2001, 10:31:42 AM1/11/01
to
A couple additional thoghts I didn't get around to before leaving for
my 8:30 lecture:

(1) The clearest way of looking at the stats side of things is probably
that one would expect a high enough r^2 between schools' performances in
one year and in the next that regression to the mean would be a rather
minor phenomenon.

(2) _Why_ were even the best schools expected to improve, with targets
that seem to have been overambitious?? I would hazard a guess that it
might be due to an inappropriate overgeneralization of the philosophy -
appropriate, in an eduational context, for individual students - that
progress should constantly be being made.

Getting further off-topic, we see the same thing in economics, where
the standard model for the Western economies is one of constant growth,
and our institutions seem unable to adapt to slight shrinkage - or even
slower-than-usual growth - without pain all round.

Our culture seems to concentrate on the idea that the moment an
institution stops growing it starts to die, and does not seem to put
much effort into maintaining "mature" institutions for which the
constant-growth paradigm is no longer appropriate. Maybe this cultural
neoteny is still appropriate and advantageous at this stage in history -
I don't know.


-Robert

Herman Rubin

unread,
Jan 11, 2001, 2:37:21 PM1/11/01
to
In article <3a5dc2d7...@news.earthlink.net>,

J. Williams <kak2...@excite.com> wrote:
>Francis Galton explained it in 1885. Possibly, the Mass. Dept. of
>Education missed it! Or, could it be that the same gang who brought
>us the exit poll data during the November election were helping them
>out? :-)

>I am wondering why they did not have a set of objective standards for
>ALL students to meet.

There are only two ways this can be done. One is by having
the standards so low as to be useless, and the other is by
not allowing the students who cannot do it to get to that
grade, regardless of age. The second is, at this time,
Politically Incorrect.

................

When a given school is already "good" it naturally can't
>"improve" more than schools on the bottom of the achievement ladder.

It can, by changing curriculum and speeding things up. This
is also not Politically Correct.

>It seems they really should have prepared a better public announcement
>of results. Rather than "knocking" the high achieving schools, they
>should praise them justifiably. Then, noting the improvement in the
>large urban schools would seem positive as well.

The biggest factor in the performance of schools is in the
native ability of students; but again it is Politically
Incorrect to even hint that this differs between schools.
--
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
hru...@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

dennis roberts

unread,
Jan 11, 2001, 1:46:06 PM1/11/01
to
At 11:31 PM 1/10/01 -0500, Bob Hayden wrote:

regression to the mean applies to relative position ... NOT raw scores

let's say we give a test called a final exam at the beginning of a course
... and assume for a moment that there is some spread ... though the mean
necessarily would be rather low ... then, we give an alternate form of this
final exam at the end of the course ... where again, there is reasonable
spread but, obviously, the mean has gone up alot ...

now, EVERYONE'S SCORES GO UP ... so everyone improves ... and it is not
that the low scores (BECAUSE of regression) will improve more and the
better scoring students on the pretest will improve less) that is NOT what
regression to the mean is all about ...

so, it depends on how these tests are scored and reported ... if the scores
are reported on something like a percentile score basis ... then there is
necessarily a problem ... but, if the scores are reported on some scale
that reflect that 10th grade scores are higher than 8th grade scores ...
and 8th grade scores are necessarily higher than 4th grade scores ... that
is, the scores reflect an ever increasing general level of knowledge ...
then regression to the mean is not the bugaboo that the "letter" makes it
out to be

now, the post said:

The effectiveness of school districts is being assessed using average
student MCAS scores. Based on the 1998 MCAS scores, districts were
placed in one of 6 categories: very high, high, moderate, low, very low,
or critically low. Schools were given improvement targets based on the
1998 scores, with schools in the highest two categories were expected to
increase their average MCAS scores by 1 to 2 points, while schools in
the lowest two categories were expected to improve their scores by 4-7
points

=========
there are a number of ? that this paragraph brings to mind:

1. how are categories of very high, etc. ... translated into 1 to 2 points
... or 4 to 7 points? i don't see any particular connection of one to the other

2. we have a problem here of course that the scores in a district are
averages ... not scores for individual kids in 4th, 8th, and 10th grades

3. what does passing mean in this context?

4. let's say there are 50 districts ... and, for last year ... using 4th
grade as an example ... we line up from highest mean for a district down to
lowest mean for a district .... then, in the adjacent column, we put what
those same districts got as means on the tests for the 4th grade this year
....

we would expect this correlation to be very high ... for two reasons ...
first, means are being used and second, from year to year ... school
district's population does not change much ... so if one district has on
average, a lower scoring group of 4th grade students .... that is what is
going to be the case next year

thus, given this ... we would NOT expect there to be much regression to the
mean ... since the r between these two variables i bet is very high

5. but, whatever the case is in #4 ... what does this have to do with
CHANGE IN MEAN SCORES? or changes in the top group of at least 1/2 points
and in the low groups changes of 4/7 points? the lack of r between the two
years of 4th grade means on these test just means that their relative
positions change with the higher ones not looking as relatively high ...
and the low ones not looking quite so relatively low BUT, your position
could change up or down relatively speaking regardless of whether your mean
test performance went up or down ... or stayed the same


bottom line: we need alot more information about exactly what was done ...
and how improvement goals were defined in the first place ... before we can
make any reasonable inference that regression to the mean would have
anything to do with better districts being bad mouthed and poorer
performing districts being praised

Rich Ulrich

unread,
Jan 11, 2001, 3:48:52 PM1/11/01
to
On Wed, 10 Jan 2001 21:32:43 GMT, Gene Gallagher
<eugen...@my-deja.com> wrote:

> The Massachusetts Dept. of Education committed what appears to be a
> howling statistical blunder yesterday. It would be funny if not for the
> millions of dollars, thousands of hours of work, and thousands of
> students' lives that could be affected.
>

< snip, much detail >

> I find this really disturbing. I am not a big fan of standardized
> testing, but if the state is going to spend millions of dollars
> implementing a state-wide testing program, then the evaluation process
> must be statistically valid. This evaluation plan, falling prey to the
> regression fallacy, could not have been reviewed by a competent
> statistician.
>
> I hate to be completely negative about this. I'm assuming that
> psychologists and others involved in repeated testing must have
> solutions to this test-retest problem.

The proper starting point for a comparison for a school
should be the estimate of the "true score" for the school:
the regressed-predicted value under circumstances of
no-change. "No-change" at the bottom would be satisfied
by becoming only a little better; no-change at the top
would be met by becoming only a little worse. If you are
dumping money into the whole system, then you might
hope to (expect to?) bias the changes into a positive direction.

I thought it was curious that the

"schools in the highest two categories were expected to increase
their average MCAS scores by 1 to 2 points, while schools in the
lowest two categories were expected to improve their scores by 4-7

points."

That sounds rational in form. It appears to me that their model
might have the correct form, but the numbers surprised them.
That is: It looks as if someone was taking into account regression of
a couple of points, then hoping for a gain of 4 or 5 points. That
(probably) under-estimated the regression-to-the-mean, and
over-estimated how much a school could achieve by freshly
swearing to good intentions.

What is needed -- in addition to self-serving excuses -- is an
external source of validation. And it should validate in cases that
are not predicted by regression to the mean.

--
Rich Ulrich, wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html

dennis roberts

unread,
Jan 11, 2001, 3:00:36 PM1/11/01
to
i went to some of the sites given in the urls ... and, quite frankly, it is
kind of difficult to really get a feel for what has transpired ... and how
targets were set ... and how goals were assessed

regardless of whether we like this kind of an approach for accountability
... or not ... we all have to admit that there are a host of problems with
it ... many of these are simply political and policy oriented (in fact,
these might be the largest of the problems ... when the legislature starts
enacting regulations without a real good understanding of the
methodological problems) ... some are measurement related ... and yes,
some are statistical in nature

we do have components of this total process

1. there are tests that are developed/used/administered/scored ... in 4th
and 8th and 10th grades ... these are NOT the same tests of course ... so,
one is never sure what it means to compare "results" from say the 8th grade
to the 4th grade ... etc.

2. then we have the problem that one year ... we have the data on the 4th
graders THAT year ... but, the next year we have data on the 4th graders
for THAT year ... these are not the same students ... so any direct
comparison of the scores ... 4th to 4th ... or 8th to 8th or 10th to 10th
... are not totally comparable ... so, ANY difference in the scores ... up
or down ... cannot be necessarily attributed to improvement or lack of
improvement ... the changes could be related and in fact, totally accounted
for because there are small changes in the abilities of the 4th graders one
year compared to another year ... (or many other reasons)

3. as i said before, we also have the problem of using aggregated
performance ... either averages of schools and/or averages for districts
... when we line them up and then assign these quality names of very high,
high, etc.
there is a necessary disconnect between the real goals of education ...
that is, helping individual kids learn .. and the way schools or districts
are being evaluated ... when averages are being used ...

4. i would like to know how on earth ... these 'standards' for
"dictated" improvement targets were derived ... did these have anything to
do with real data ... or an analysis of data ... or, were just roundtabled
and agreed to? we have to know this to be able to see if there is any
connection between policy targets and statistical problems

5. we have to try to separate relative standing data from actual test score
gain information ... and we don't know how or if the ones setting the
standards and making decisions ... know anything about this problem

so, to summarize ... there are many many issues and problems with
implementing any system whereby you are trying to evaluate the performance
of schools and districts ... and, perhaps the least important of these is a
statistical one ... set in the context of political policy matters ... that
a legislature works with ... and "legislates" targets and practices without
really understanding the full gamut of difficulties when doing so

unfortunately, in approaches like this, one makes an assumption that if a
school ... or district ... gets better (by whatever measure) ... that this
means that individual students get better too ... and we all know of course
that this is NOT NECESSARILY TRUE ... and in fact we know more than that
... we know that it is NOT true ... in many cases

sure, it is important to make some judgements about how schools and
districts are doing ... especially if each gets large sums of money from
taxpayers ... but, the real issue is how we work with students ... and how
each and every one of them do ... how each kid improves or not ... and, all
these approaches to evaluating schools and districts ... fail to keep that
in mind ... thus, in the final analysis, all of these systems are
fundamentally flawed ... (though they still may be useful)

Robert J. MacG. Dawson

unread,
Jan 11, 2001, 3:10:12 PM1/11/01
to

Paul R Swank wrote:
>
> Regression toward the mean occurs when the pretest is used to form the groups, which it appears is the case here.

Of course it "occurs": - but remember that the magnitude depends on
r^2. In the case where there is strong correlation between the pretest
and the posttest, we do not expect regression to the mean to be
particularly significant.

Now, it is generally acknowledged that there are some schools which
_consistently_ perform better than others. (If that were not the case,
nobody would be much surprised by any one school failing to meet its
goal!) Year-over-year variation for one school is presumably much less
than between-school variation.

Therefore, I would not expect regression to the mean to be sufficient
to explain the observed outcome (in which "practically no" top schools
met expectations); and I conclude that the goals may well have been
otherwise unreasonable. Indeed, requiring every school to improve its
average by at least two points every year is not maintainable in the
long run, and only justifiable in the short term if there is reason to
believe that *all* schools are underperforming.

-Robert Dawson

Ronald Bloom

unread,
Jan 11, 2001, 5:18:48 PM1/11/01
to
Robert J. MacG. Dawson <Robert...@stmarys.ca> wrote:
>
> (2) _Why_ were even the best schools expected to improve, with targets
> that seem to have been overambitious?? I would hazard a guess that it
> might be due to an inappropriate overgeneralization of the philosophy -
> appropriate, in an eduational context, for individual students - that
> progress should constantly be being made.

Continual Growth is the state religion. If you're not "growing"
you're falling behind. In business, there is no such thing anymore
as a "reasonable return" . Nowadays, there is in it's stead the
mantra of a reasonable rate of growth of the rate of return.

Ronald Bloom

unread,
Jan 11, 2001, 5:31:47 PM1/11/01
to
Herman Rubin <hru...@odds.stat.purdue.edu> wrote:
> In article <3a5dc2d7...@news.earthlink.net>,
> J. Williams <kak2...@excite.com> wrote:
>>Francis Galton explained it in 1885. Possibly, the Mass. Dept. of
>>Education missed it! Or, could it be that the same gang who brought
>>us the exit poll data during the November election were helping them
>>out? :-)

>>I am wondering why they did not have a set of objective standards for
>>ALL students to meet.

> There are only two ways this can be done. One is by having
> the standards so low as to be useless, and the other is by
> not allowing the students who cannot do it to get to that
> grade, regardless of age. The second is, at this time,
> Politically Incorrect.

And what alternative do you propose? Sending the underachievers
to work in the fields as soon as signs of promise fail
to manifest?

[...]

> The biggest factor in the performance of schools is in the
> native ability of students; but again it is Politically
> Incorrect to even hint that this differs between schools.

It may be "politically incorrect" to say so. But does that
support the proposition in any way shape or form? So go
on, "hint"; get up on a beer-barrel and "hint" that the
"fit" are languishing from the ignominious condition of
having to suffer the presence of the "unfit". You'll
have plenty of company: Pride is greedier even than mere
Avarice.


-- R. Bloom

Robert J. MacG. Dawson

unread,
Jan 11, 2001, 5:21:54 PM1/11/01
to

Paul R Swank wrote:
>
> Robert:
>
> Why would you expect a strong correlation here? You're talking about tests done a year apart with some new kids in each school and some kids who have moved on.

Simply because there seems to be general consensus that there are such
things as "good schools" and "poor schools". This may be partially
because some schools have catchment areas in which parents are better
educated/more suppportive/able to afford breakfast for their kids and
others aren't, partially because schools in some areas have better
funding, partially because some schools have better teachers, or for a
host of other reasons.


> Is regression toward the mean causing all of the noted results. Probably not. But it is quite conceivable that it could be partially responsible for the results.

It would certainly contribute - but I think it would be a minor
contribution, in the presence of goals as stated.

Richard A. Beldin

unread,
Jan 12, 2001, 5:46:09 AM1/12/01
to
In one way or another, we have to give the slower students more time. We
can do it by making courses which allow students to progress (or not) at
their own pace or by flunking them so they can do it all over. The
former is more efficient, but the latter works too. What doesn't work is
pretending that slower learners completed their course in the same time
as the quicker ones.
rabeldin.vcf

Gene Gallagher

unread,
Jan 12, 2001, 8:12:50 AM1/12/01
to
In article <3A5DA78A...@stmarys.ca>,
Robert...@STMARYS.CA (Robert J. MacG. Dawson) wrote:
<Snip>>

>
> Wait one... Regression to the mean occurs because of the
_random_
> component in the first measurement. Being in an urban center is not
part
> of the random component - those schools' grades didn't improve because
> some of them woke up one day and found that their school had moved to
a
> wealthier district.
> If the effect of nonrandom components such as this is large
enough
> (as I can well believe) to justify the generalization highlighted
above,
> and if there was a strong pattern of poor-performing schools meeting
> their targets and better-performing schools not doing so, we are
looking at
> something else - what, I'll suggest later.
<Snip>

I do believe that regression to the mean is involved here. Each one of
the 1539 in the state is being evaluated. The performance by the 1998
group of 4th graders is being compared to the mean of the 1999 and 2000
4th graders. Every school is expected to improve, with the top
performing schools being expected to improve by an average of 2 points
on the scaled score. The scaled score is an odd duck, ranging from 200
to 280. There are a core group of questions in the 1998 to 2000 exams,
so that different students scoring the same number of correct responses
on these core questions would get the same individual score from year to
year. Schools are being "Failed" on the basis of 1 and 2 point
differences between years on this 200 to 280 point scale. In some
cases, it appears that a 2-pt difference might be due to just a
difference of 1 or two correct answers on the exam.

The school results are presented in a very odd fashion, making it
difficult to assess the patterns.
http://www.doe.mass.edu/ata/ratings00/SPRPDistribTables.html

The results for the 140 high performing schools are excerpted below,
with the target differences between the 1998 and @avg(1999:2000) exam.

Failed to meet Approached Met Exceeded Total
Diff Less then 0 0 to 1 1 to 3 > 3.1
35 18 36 51 140

The key to this table is found in the DOE rating guide.
http://www.doe.mass.edu/ata/ratings00/rateguide00.pdf

I'm not at all certain that this table isn't what I'd expect to see by
chance alone if there were an improvement of exactly the sort that the
Dept of Education was hoping for (a mean 2 point improvement). Rather
than reporting that the mean had improved, this Dept. of Education
report emphasizes failure. Schools in this high category were expected
to improve by 2 points on the 200 to 280 point MCAS scale. A high
performing school might have an overall score of 250 in 1998 on this
scale. If the mean of the 1999 and 2000 exams was less than the 1998
score, the school failed. If the difference was between 0 and 0.9, then
the school "Approached" its goal. If the score was between 1 and 3,
then the school "Met" its goal. If the difference in scores was greater
than 3, then the school exceeded its MCAS goal. I think it is more than
coincidence that the "Approached" category is half the "Met" category
since the bin size is twice as large for "Met" as "Approached".

For the lowest performing schools, their expectation was that they would
improve by 6 points. Any difference from 1998 and the mean of 1999 and
2000 less than 4 points on the 200 to 280 point scale earned an "F."
Schools approached their targets by improving between 4 to 5 points.
Schools met their targets by improving between 5 to 7 points and schools
exceeded their targets by improving by more than 7 points.

There were 91 critically low schools based on the 1998 exam, and here is
how they met their 6-point increase goal:

Evaluation Failed Approached Met Exceeded Total
Difference < 4 4 to 4.9 5 to 7 > 7.1
83 4 3 1 91

I frankly don't know what has happened with these MCAS scores. The
Dept. of Education's documents are sorely lacking in anything that looks
like valid statistical analyses, showing changes in means and standard
deviations. Instead of presenting valid, standard statistical analyses
to show what has happened to the mean scores, the state is reporting to
the general public an ordinal evaluation that makes the populace feel
that the students and their educators are failing them at all levels.
The top schools are getting smeared with failing grades when the results
may be consistent with random variation around a general pattern of
improvement, "the regression to the mean fallacy." Real improvements in
poor-performing schools may be masked by this attempt to convert scale
data into a poorly implemented ordinal scale of improvement.

--

Eric Bohlman

unread,
Jan 12, 2001, 8:27:00 AM1/12/01
to
Robert J. MacG. Dawson <Robert...@stmarys.ca> wrote:
> Therefore, I would not expect regression to the mean to be sufficient
> to explain the observed outcome (in which "practically no" top schools
> met expectations); and I conclude that the goals may well have been
> otherwise unreasonable. Indeed, requiring every school to improve its
> average by at least two points every year is not maintainable in the
> long run, and only justifiable in the short term if there is reason to
> believe that *all* schools are underperforming.

This is the second time in the last two months that I've said to myself "I
wish W. Edwards Deming were still alive." He always railed against
arbitrary numerical performance goals that were set with no understanding
of the system that produced the results and no specific plan for changing
the system to produce better results. He'd probably be quoting Lloyd
Nelson's quip about such goals, to the effect that if you say that you
want to increase, say, revenues by 5% this year and you don't plan to
change the system, then you're admitting that you've been slacking because
if the current system allowed for it, you should already have done it.

(The other context in which I thought of Deming was the election
results. Deming insisted that it was meaningless to talk about the "true
value" of some quantity in the absence of an operational definition of how
to measure it. In this context, I'm sure he would have insisted that it
was nonsense to assert that this method or that method of counting
disputed votes gave a closer answer to the "true vote count" because the
latter doesn't even exist until you specify a particular method of
evaluating ambiguous ballots.)

Gene Gallagher

unread,
Jan 12, 2001, 8:30:54 AM1/12/01
to
In article <5.0.0.25.2.200101...@email.psu.edu>,

d...@PSU.EDU (dennis roberts) wrote:
>
> 1. how are categories of very high, etc. ... translated into 1 to 2
points
> ... or 4 to 7 points? i don't see any particular connection of one to
the other

See the pdf link below. A panel set out cut points based on % of
students failing and % students showing proficiency on the exam. These
cut points were set by a Dept. of Education panel.

>
> 2. we have a problem here of course that the scores in a district are
> averages ... not scores for individual kids in 4th, 8th, and 10th
grades

Actually, each school and in fact each class is evaluated. There isn't
much averaging involved. My daughter's 4th grade class is composed of 3
groups of about 25, is being compared with the 1998 crop of 4th graders.
Since the mean of the 1999 and 2000 groups didn't improve by 2 points,
the school "Failed." My local school district's schools failed in
every category, even though the mean MCAS scores are among the highest
in the state.


>
> 3. what does passing mean in this context?

I posted a longer response to Dr. Dawson on another thread, but passing
for a good school means a 2 point increase on the 200 to 280 point
scale. For a poor school not to fail, the score had to increase by more
than 4 points between 1998 and the mean of the 1999-2000 scores. Few of
the poor performing schools met this goal.


>
> 4. let's say there are 50 districts ... and, for last year ... using
4th
> grade as an example ... we line up from highest mean for a district
down to
> lowest mean for a district .... then, in the adjacent column, we put
what
> those same districts got as means on the tests for the 4th grade this
year
> ....
>
> we would expect this correlation to be very high ... for two reasons
...
> first, means are being used and second,

No, Each one of the 1539 schools in the state was evaluated. The
districts were simply sent the results on the percent of their schools
that had failed to meet their targets or that had met their target
increases. All schools had to increase their scores, with the poor
schools being expected to improve by 6 points and the best schools by 2
points.

from year to year ... school
> district's population does not change much ... so if one district has
on
> average, a lower scoring group of 4th grade students .... that is what
is
> going to be the case next year
>
> thus, given this ... we would NOT expect there to be much regression
to the
> mean ... since the r between these two variables i bet is very high

For the top performing schools, I think regression to the mean does
occur, but it is very difficult to assess with the DOE documents. I
read everything that I could get on their pages, but I couldn't find a
description of correlations or standard deviations anywhere.

>
> 5. but, whatever the case is in #4 ... what does this have to do with
> CHANGE IN MEAN SCORES? or changes in the top group of at least 1/2
points
> and in the low groups changes of 4/7 points? the lack of r between the
two
> years of 4th grade means on these test just means that their relative
> positions change with the higher ones not looking as relatively high
...
> and the low ones not looking quite so relatively low BUT, your
position
> could change up or down relatively speaking regardless of whether your
mean
> test performance went up or down ... or stayed the same
>
> bottom line: we need alot more information about exactly what was done
...
> and how improvement goals were defined in the first place ... before
we can
> make any reasonable inference that regression to the mean would have
> anything to do with better districts being bad mouthed and poorer
> performing districts being praised

There are a number of documents on the MA Dept of education web site
justifying this evaluation. None do a good job in my opinion, but here
is the one I found to be the most relevant:

http://www.doe.mass.edu/ata/ratings00/rateguide00.pdf

The summary table for the 1539 state schools is presented here:
http://www.doe.mass.edu/ata/ratings00/SPRPDistribTables.html


--

Gene Gallagher

unread,
Jan 12, 2001, 9:10:29 AM1/12/01
to
In article <5.0.0.25.2.200101...@email.psu.edu>,
d...@PSU.EDU (dennis roberts) wrote:

In the rating guide, they state that they had core questions on the
1998, 1999, and 2000 tests. The scaled score ranged from 200 to 280,
and different students getting the same number of correct responses on
these core questions would get the same scaled score on the 1998 to 2000
tests.


>
> 2. then we have the problem that one year ... we have the data on the
4th
> graders THAT year ... but, the next year we have data on the 4th
graders
> for THAT year ... these are not the same students ... so any direct
> comparison of the scores ... 4th to 4th ... or 8th to 8th or 10th to
10th
> ... are not totally comparable ... so, ANY difference in the scores
... up
> or down ... cannot be necessarily attributed to improvement or lack of
> improvement ... the changes could be related and in fact, totally
accounted
> for because there are small changes in the abilities of the 4th
graders one
> year compared to another year ... (or many other reasons)

Other reasons include changes in class size, or another factor. The
1998 score was being compared to the mean of the 1999 and 2000 classes.
So, if there is random variation around an unchanging mean or even a
moderate increase in the mean, a top performing 1998 school would likely
earn a failing grade in 2000 since they would be expected to show a 2
point increase over their 1998 score.

> 3. as i said before, we also have the problem of using aggregated
> performance ... either averages of schools and/or averages for
districts
> ... when we line them up and then assign these quality names of very
high,
> high, etc.
> there is a necessary disconnect between the real goals of education
...
> that is, helping individual kids learn .. and the way schools or
districts
> are being evaluated ... when averages are being used ...
>
> 4. i would like to know how on earth ... these 'standards' for
> "dictated" improvement targets were derived ... did these have
anything to
> do with real data ... or an analysis of data ... or, were just
roundtabled
> and agreed to? we have to know this to be able to see if there is any
> connection between policy targets and statistical problems

I don't know how these decisions are reached. On the local level, we
are seeing the consequences of this MCAS testing. Teachers are being
forced to adjust their curriculum to cover the types of questions that
MCAS asked. Despite that, good schools and teachers statewide are being
branded as failures because of this flawed evaluation process.

I agree with you that schools and school districts should be
accountable. I fear that this present evaluation system is not solving
the core problems. It is not doing a good job of identifying and
rewarding top schools. It may be unfairly flogging poor performing
schools. From top to bottom, the MCAS evaluation appears to be giving
the public the feeling that their teachers are failing them. I think a
valid statistical analysis must be done on these data.

There was an incredibly wrong-headed response by the Suffolk
University Beacon Hill Institute to this MCAS evaluation covered in both
the Boston Globe and Herald yesterday. The report can be read at:

http://www.beaconhill.org/BHIStudies/EdStudyexecsum.pdf

These authors argued that the MCAS evaluation didn't do a good job
identifying effective schools. They used a logit regression to account
for district-wide scores. One of their major conclusion is that in
top-performing school districts, large class size increases MCAS scores.

These are the 11 explanatory variables used to explain the 1998 MCAS
scores:

(A) Policy variables:
1) Increase in funding from 1994 to 1998
2) Change in student teacher ratio from 1994 to 1998
3) Number of students per computer
(B) Socioeconomic variables
4) Crime rate
5) % professionals in a district
6) % single parents
7) Dummy variable indicating urban vs. non-urban
(C) Choice variables
8) % students in Charter schools in district
9) % students bussed in by METCO
10) % students in public schools
(B) Previous tests
11) The district performance based on the 1994 statewide MEAPS test

My analysis:
This regression analysis is almost guaranteed to have severe
multicollinearity problems. As such, one can not trust the magnitude
nor even the sign of the coefficients for these explanatory variables.
The authors proceed to discuss how their study indicates how the state
should change funding for schools:

a) Increases in funding doesn't lead to higher performance! (The authors
are proud that they didn't use actual funding per student as a variable)

b) REDUCING THE STUDENT TO TEACHER RATIO ACTUALLY WORSENED TEST
PERFORMANCE IN THE 8TH AND 10TH GRADE FOR HIGH-PERFORMING SCHOOLS

There are likely strong correlations between many of these explanatory
variables. An analysis of the variance inflation factor for the
explantory variables would show this. Due to this problem, the authors
should not trust the magnitude nor even the sign of the coefficients of
these variables. I greatly doubt whether larger student-teacher ratios
leads to improved student performance in high performing schools as the
authors state

The authors base many of their policy recommendations on what are
probably invalid regression coefficients:
* no need to reduce class sizes in top schools
* don't increase funding for top-performing schools
* "we can't improve performance by spending more."
* the state should shift funding from high performing schools to low
performing schools, increasing the class size in high performing schools
and reducing class size in low performing schools will both increase
performance.

Is it really a surprise to anyone familiar with multicollinearity that
after you've included all 11 variables shown above that you don't see
much of an effect on "increase in funding" on MCAS scores. Is it any
surprise that student-teacher ratio is negatively correlated with
performance after you've included the other 10 variables in the
equation.

This Beacon Hill Institute study shows how statistics are being abused
in this MCAS evaluation process.


--

dennis roberts

unread,
Jan 12, 2001, 11:20:00 AM1/12/01
to
At 01:12 PM 1/12/01 +0000, Gene Gallagher wrote:

>I do believe that regression to the mean is involved here.


i just reiterate that regression in this case ... involves a correlation
between two columns of MEANS ... means for schools OR means for districts
... and means do NOT change that much .... from year to year ... and
certainly ... schools with low or high means one year just CANNOT change
their position much

(this is certainly not true of individual students but ... none of this
discussion has anything directly to do with individual students ... only
means of schools or districts)

does anyone have any information from mass. directly in their reports ...
as to what the correlation is/was between the (for example) 4th grade means
for the schools 1 year and 4th grade means for the same schools the next
year??? or the same r value based on district means?

unless we know either of these two correlations ... we cannot talk
meaningfully about whether regression is important in this discussion ...
or not

dennis roberts

unread,
Jan 12, 2001, 11:30:46 AM1/12/01
to
At 01:30 PM 1/12/01 +0000, Gene Gallagher wrote:

No, Each one of the 1539 schools in the state was evaluated. The
districts were simply sent the results on the percent of their schools
that had failed to meet their targets or that had met their target
increases. All schools had to increase their scores, with the poor
schools being expected to improve by 6 points and the best schools by 2
points.


=========
surely, some schools are larger than others .... so, some school means are
based on many more classes than others but, in any case ... the mean of a
school will be based on 50 kids ... 100 kids ... or more? and i bet that
these schools ... even when based on 50 kids ... or 100 kids ... their
MEANS will not change much (and even if their mean improves ... that does
not mean their relative postion in the overally 1539 will change ... any )
... unless of course, the next year's 4th graders somehow ... are radically
different .. that is, while the current 4th grade looked like X ... the
INcoming 4th grade (last years 3rd graders) is either much more able ... or
less able

in any case ... unless we have the correlation between the two columns of
1539 means ... for schools ... for the two years ... again, we cannot speak
about regression to the mean ...

and from what you have indicated ... we have scaled scores ... so is it the
target to change in scaled scores??? or actual raw scores on tests? if it
is based on scaled scores ... we have a big problem ...

Robert J. MacG. Dawson

unread,
Jan 12, 2001, 1:58:09 PM1/12/01
to

In my last posting I omitted the "very high" group on the grounds of
small size. In case anybody's curious, here's what the plot looks like
with those data included (coded as "*"; note that the vertical "error
bars" on these would be very wide!)

Proportion of schools improved by fewer than X pts

0 1 2 3 4 5 6 7 8

1 v c
c
.9 c l
* vm
.8
vl
.7
h
.6 lm

.5 *
m
.4 h

.3
h
.2

.1

0 *
0 1 2 3 4 5 6 7 8

Robert J. MacG. Dawson

unread,
Jan 12, 2001, 2:03:25 PM1/12/01
to
>The school results are presented in a very odd fashion, making it
>difficult to assess the patterns.
>http://www.doe.mass.edu/ata/ratings00/SPRPDistribTables.html

They are that. Let's try.

These data don't look at all like the newspaper story. Here they are,
with outcomes given as proportions of each group.


Failed App Met Exceeded N


(Very high 0% 50% 33% 17% 6)
High 25% 13% 26% 36% 140
Moderate 43% 17% 23% 17% 471
LOw 60% 14% 18% 8% 545
VeryLow 76% 8% 14% 4% 287
Critical 91% 4% 3% 1% 91

Overall - regression to the mean would cause a NW-SE ridge in this
table - and the newspaper story suggested this. What we see is a NE-SW
ridge. Whatever is causing that ridge is much stronger than regression
to the mean.

It might be just the demands for more improvement from the "low"
schools.

To check this, put all the groups onto one quantile plot (I have
omitted the tiny "very high" group; others are labelled as Critical,
Very low, Low, Medium, or High by initial):

Proportion of schools improved by fewer than X pts

0 1 2 3 4 5 6 7 8

1 v c
c
.9 c l

vm
.8
vl
.7
h
.6 lm

.5
m
.4 h

.3
h
.2

.1

0


0 1 2 3 4 5 6 7 8


Bingo - one common curve, as near as we can tell. We have different
groups in different parts of the plot because of the funny way the data
were presented, but one curve seems to fit nicely.

This suggests that - far from better schools being penalized for
regression to the mean, or poorer schools being rewarded for it - the
ability of schools to improve on a once-off basis was roughly constant
across the spectrum, and historically poorer-performing schools are
being penalized by unreasonably-high goals.

What would be reasonable goals? On a once-off basis, it suggests that
about 50% of schools can improve by 2 points, about 75% can at least
hold their own, and (assuming approximate symmetry in the ogive) very
few schools "in control" would drop by more than about 2 points. A
possible system, then, would be to give - across the board - a major pat
on the back for an improvement of more than 4 points, a minor pat on the
back for an improvement of more than 2 points, and an investigation for
a drop of more than 2 points. Also, a complementary system based on raw
performance.

Gene Gallagher

unread,
Jan 12, 2001, 3:40:22 PM1/12/01
to
In article <5.0.0.25.2.200101...@email.psu.edu>,
d...@PSU.EDU (dennis roberts) wrote:
> At 01:12 PM 1/12/01 +0000, Gene Gallagher wrote:
>
> >I do believe that regression to the mean is involved here.
>
> i just reiterate that regression in this case ... involves a
correlation
> between two columns of MEANS ... means for schools OR means for
districts
> ... and means do NOT change that much .... from year to year ... and
> certainly ... schools with low or high means one year just CANNOT
change
> their position much
>
The MCAS evaluation is based on means, but means based on a very small
sample size. Each one of the 1500+ 4th grades, 8th grades, and 10th
grades in each school is evaluated. In my daughter's 4th grade class,
there were 3 classrooms of about 25 students. The mean MCAS scores for
this group of 3 classes in one school in 1998 is compared to the mean of
the 1999 and 2000 classes in that school.

The school was expected to show a 2 point increase, and when it didn't
this 4th grade class in this school was graded a failure, as was the 8th
grade class in the junior high across town and the 10th grade class in
the high school.

This case is very much like Galton's regression to mediocrity. Even
with increasing mean test scores (not documented to date, by the way),
the tendency will be for the top performing schools on the 1st test to
fall back closer to the mean and the poorest performers on the 1st
test to increase.

I would like to get all of the 1998 1999 and 2000 MCAS scores so that
the correlation coefficients and sources of error can be calculated. If
the correlation is very high, then regression to mediocrity won't be
much of a factor.

Gene Gallagher

unread,
Jan 12, 2001, 3:48:19 PM1/12/01
to
In article <3A5F4435...@stmarys.ca>,

Robert...@STMARYS.CA (Robert J. MacG. Dawson) wrote:

What a wonderful analysis with poorly tabulated data. I'll have to
spend some time seeing how you could dig this point improvement patterns
out of the tables published by DOE. I agree with you that imposing a
6-point improvement scale on the poorest performing schools is an
unrelatistic goal.

Werner Wittmann

unread,
Jan 12, 2001, 7:12:20 PM1/12/01
to
Hi Dennis et al.

The best reference concerning regression to the mean(rtm) is Dave Kenny!
Bookmark his homepage:
http://nw3.nai.net/~dakenny/kenny.htm
(Its an exciting one for still others reason than rtm )
Dave had finalized a book about rtm which he had started to write which the
late Don Campbell,
rtm is Don's favorite brainchild, despite its origins with Sir Francis
Galton.
See the regression artifact primer at Dave's homepage:
http://nw3.nai.net/~dakenny/rrtm.htm

From the book's front page you'll see when rtm is relevant:
rtm= perfection-correlation.
So whenever the pre/post correlation is less than one rtm exists,
is what Don and Dave say.
But you have also to consider the reliability of the pre- and posttest.
Whenever the double correction for pre- and posttest reliability (correction
for
attenuation) leads to a correlation of one, rtm vanishes.
Dennis might be right that the pre/post means are relatively reliable.
We would need parallel test for the means to estimate how reliable actually
they are and need also the pre/post correlation for the means of classes or
schools, districts etc. (we have to be clear about the unit of analysis).
Guess that the attenuation corrected pre/post correlation with the *.means
will still be lower
than one, so rtm is seriously to consider.
Another point is fan-spread. This happens when the variance of the posttest
is greater than that of
the pretest, the correlation coefficient does not map that effect, it maps
only rank order changes.
(fan shrinkage is also possible..i.e. lower variance at post than at
pre-test).
Fan spread would map the Matthew effect.
Such an effect is only to be detected with a split-plot design.
And finally the true answer what happened there can only be given, when
you've assessed the causes for
the change(whatever is was,i.e. mean change, variance change,skew,and still
higher moment changes).
So these guys should map the causes, say as a variable z.
If x is pre and y post and you're able to demonstrate that
Ry(true).x(true),z(true)=1 ,
(this means that the multiple correlation of predicting the posttest true
score with a combination
of the pretest and the causes true scores equals one)
than rtm has completely disappeared.
A complicating factor to consider are ceiling and floor effects of pre- and
posttests, but not very
reasonable with most standardized tests used.(they would lead to nonlinear
change effects)
Nonlinear growth effects are also well-known from economics as the law of
diminishing returns.

Werner

Werner W. Wittmann;University of Mannheim; Germany;
e-mail: witt...@tnt.psychologie.uni-mannheim.de

-----Ursprungliche Nachricht-----
Von: owner-...@jse.stat.ncsu.edu
[mailto:owner-...@jse.stat.ncsu.edu]Im Auftrag von dennis roberts
Gesendet: Freitag, 12. Januar 2001 15:54
An: Gene Gallagher; edst...@jse.stat.ncsu.edu
Betreff: Re: MA MCAS statistical fallacy

Gene Gallagher

unread,
Jan 13, 2001, 10:59:43 AM1/13/01
to
In article
<GGEDJKBMGKBJJKNKBO...@tnt.psychologie.uni-mannheim.de>,

<witt...@tnt.psychologie.uni-mannheim.de> wrote:
> Hi Dennis et al.
>
> The best reference concerning regression to the mean(rtm) is Dave
Kenny!
> Bookmark his homepage:
> http://nw3.nai.net/~dakenny/kenny.htm
> (Its an exciting one for still others reason than rtm )
> Dave had finalized a book about rtm which he had started to write
which the
> late Don Campbell,
<Snip>

I've ordered Kenny's book on rtm from Amazon.com (another $32 to
them; if they fail it's not due to me!). Kenny provides a link to a
wonderful description of the regression to the mean phenomenon by Bill
Trochim:

http://trochim.human.cornell.edu/kb/regrmean.htm

I believe that rtm must be happening in the MA MCAS test-retest
evaluation program, but the pattern is difficult to discern because the
Mass Dept. of education doesn't provide the scores in a fashion that are
easy to analyze. I don't see how you can plot the 1998 vs 1999-2000
socres with the available data, and this plot is the key to recognizing
rtm. The MA DOE report only whether the good performing schools met
their 2 point increase and whether the bad schools met the 6 point goal
increase (on a 200 to 280 point scale). So, rtm would make it less
likely that the top schools would meet the 2-point increase, even if
there was an overall 2 pt increase in scores state-wide. Evidence for
rtm in the poor performing schools might be masked by the need to
demonstrate a 6-point increase in scores to meet the test-retest goal.
I've requested the raw data from the MA DOE to answer some of these
questions.

Herman Rubin

unread,
Jan 13, 2001, 3:46:46 PM1/13/01
to
In article <3a5dc2d7...@news.earthlink.net>,
J. Williams <kak2...@excite.com> wrote:
>Francis Galton explained it in 1885. Possibly, the Mass. Dept. of
>Education missed it! Or, could it be that the same gang who brought
>us the exit poll data during the November election were helping them
>out? :-)

>I am wondering why they did not have a set of objective standards for
>ALL students to meet.

Unless the standards are abysmally low, or it is expected that
many will not even come close, this is impossible. People are
GREATLY different, and this is a big part of the problem.

Of course, it is nice to reward academically
>weaker districts for "improving," but the real issue may not be
>"improvement," rather it might be attainment at a specific level for
>all schools as a minimum target.

Not only are individuals different, but the "law of large
numbers" does not make school populations even remotely close
to equal.

All of this testing of schools is based on flawed assumptions.
The null hypothesis is always false.

Herman Rubin

unread,
Jan 13, 2001, 4:05:04 PM1/13/01
to
In article <3A5EE071...@mail.caribe.net>,

The latter does not work well, and the idea that it is
reasonable slows down what the bright can learn.

One of the current goals is to get all reading by third
grade. Before socialization ahead of education, all were
reading by second grade; if they were not, they did not
get to second grade.

Herman Rubin

unread,
Jan 13, 2001, 4:01:39 PM1/13/01
to
In article <93lc8j$fil$1...@news.panix.com>,

Ronald Bloom <rbl...@panix.com> wrote:
>Herman Rubin <hru...@odds.stat.purdue.edu> wrote:
>> In article <3a5dc2d7...@news.earthlink.net>,
>> J. Williams <kak2...@excite.com> wrote:
>>>Francis Galton explained it in 1885. Possibly, the Mass. Dept. of
>>>Education missed it! Or, could it be that the same gang who brought
>>>us the exit poll data during the November election were helping them
>>>out? :-)

>>>I am wondering why they did not have a set of objective standards for
>>>ALL students to meet.

>> There are only two ways this can be done. One is by having
>> the standards so low as to be useless, and the other is by
>> not allowing the students who cannot do it to get to that
>> grade, regardless of age. The second is, at this time,
>> Politically Incorrect.

> And what alternative do you propose? Sending the underachievers
>to work in the fields as soon as signs of promise fail
>to manifest?

The obvious alternative is to adjust the education to
the individual, and completely abandon the idea of age
grouping. I believe that those who SHOULD go to college
should receive a far greater education by their early
teens than they are now allowed to get several years
later. Those who need to take longer should take longer,
and those who cannot manage to learn something should
not be cluttering up classes where others are trying to
do so.

The ones who cannot are generally not underachievers
but those who are not mentally capable. The bright,
including those who do well on the tests, are forced
to be underachievers, as the program is set up to
keep them from achieving what they can.

[...]

>> The biggest factor in the performance of schools is in the
>> native ability of students; but again it is Politically
>> Incorrect to even hint that this differs between schools.

> It may be "politically incorrect" to say so. But does that
>support the proposition in any way shape or form? So go
>on, "hint"; get up on a beer-barrel and "hint" that the
>"fit" are languishing from the ignominious condition of
>having to suffer the presence of the "unfit". You'll
>have plenty of company: Pride is greedier even than mere
>Avarice.

The hyperegalitarians cannot accept the truth; their
fanatic ideas are that all are essentially capable of
the same learning at a given age. As long as these
run the schools, not much learning will occur.

Rich Ulrich

unread,
Jan 15, 2001, 7:47:49 PM1/15/01
to
Concerning the MCAS. There was a discussion last month
in another Usenet group, alt.usage.english, concerning one of
its math questions which was written too loosely.

Here is the start of that thread. The thread has 130+ (not very
interesting) entries in Deja, which is where I recovered this from.

=================== start of Deja message.
Subject: Fix the wording in this test question?
Date: 12/10/2000
Author: Daniel P. B. Smith <dpbs...@bellatlantic.net>

Below is a verbatim question from a standardized math test. The
concepts and mathematics are clear enough, I think. I'm presenting
this is an English puzzle.

Construing the language as precisely as possible, but being careful to
take into account the full language of the question and the
multiple-choice answers, what do you think the correct answer is?

Do you think the question is actually OK? Is the wording good enough
as it stands? Or, as worded, could there be a legitimate uncertainty
about which answer is correct?

BEGIN QUESTION TEXT

37. When Matt's and Damien's broad jumps were measured accurately to
the nearest foot, each measurement was 21 feet. Which statement best
describes the greatest possible difference in the lengths of Matt零
jump and Damien's jump?

A. One jump could be up to 1/4 foot longer than the other.
B. One jump could be up to 1/2 foot longer than the other.
C. One jump could be up to 1 foot longer than the other.
D. One jump could be up to 2 feet longer than the other.

END QUESTION TEXT

ObPuzzle: Assume that the wording needs improvement. Assume that the
concept to be tested is that "the range of real numbers for which the
closest integer is 21 is the interval from 20.5 to 21.5 not including
either endpoint, sometimes notated (20.5, 21.5)." What is a simple,
natural wording in everyday language that would test someone's
understanding of this concept while providing a single, unambiguously
correct choice?

Note: This question was waken verbatim from the Massachusetts
Comprehensive Assessment System (MCAS) Mathematics Grade 8 test. The
test "common questions" are available as PDF files, linked from
http://www.doe.mass.edu/mcas/00release/. 8th grade students are about
13 years old.

The official correct answer to this question is C.

--
Daniel P. B. Smith
Current email address: dpbs...@bellatlantic.net
"Lifetime forwarding" address: dpbs...@alum.mit.edu
Visit alt.books.jack-london!
============= end of Deja message.


On Wed, 10 Jan 2001 21:32:43 GMT, Gene Gallagher
<eugen...@my-deja.com> wrote:

> The Massachusetts Dept. of Education committed what appears to be a
> howling statistical blunder yesterday. It would be funny if not for the
> millions of dollars, thousands of hours of work, and thousands of
> students' lives that could be affected.
>

> Massachusetts has implemented a state-wide mandatory student testing
> program, called the MCAS. Students in the 4th, 8th and 10th grades are
> being tested and next year 12th grade students must pass the MCAS to
> graduate.

J. Williams

unread,
Jan 16, 2001, 9:14:36 AM1/16/01
to
On Mon, 15 Jan 2001 19:47:49 -0500, Rich Ulrich <wpi...@pitt.edu>
wrote:

>Concerning the MCAS. There was a discussion last month
>in another Usenet group, alt.usage.english, concerning one of
>its math questions which was written too loosely.
>
>Here is the start of that thread. The thread has 130+ (not very
>interesting) entries in Deja, which is where I recovered this from.
>
>=================== start of Deja message.
>Subject: Fix the wording in this test question?
>Date: 12/10/2000
>Author: Daniel P. B. Smith <dpbs...@bellatlantic.net>
>
>
>Below is a verbatim question from a standardized math test. The
>concepts and mathematics are clear enough, I think. I'm presenting
>this is an English puzzle.

>37. When Matt's and Damien's broad jumps were measured accurately to


>the nearest foot, each measurement was 21 feet. Which statement best

>describes the greatest possible difference in the lengths of Mattąs


>jump and Damien's jump?
>
>A. One jump could be up to 1/4 foot longer than the other.
>B. One jump could be up to 1/2 foot longer than the other.
>C. One jump could be up to 1 foot longer than the other.
>D. One jump could be up to 2 feet longer than the other.
>
> END QUESTION TEXT
>
>ObPuzzle: Assume that the wording needs improvement. Assume that the
>concept to be tested is that "the range of real numbers for which the
>closest integer is 21 is the interval from 20.5 to 21.5 not including
>either endpoint, sometimes notated (20.5, 21.5)." What is a simple,
>natural wording in everyday language that would test someone's
>understanding of this concept while providing a single, unambiguously
>correct choice?

Maybe, I am missing something, but think the original question and
response items are quite clear and concise. I see nothing
particularly "loose" about it. The essentials of a class interval
used in frequency distributions seem apparent although it is subtle.
For me, this appears to be an excellent question. Of course, I was
not an English major either :-)

Robert J. MacG. Dawson

unread,
Jan 16, 2001, 12:01:18 PM1/16/01
to

Rich Ulrich wrote:

> Construing the language as precisely as possible, but being careful to
> take into account the full language of the question and the
> multiple-choice answers, what do you think the correct answer is?
>
> Do you think the question is actually OK? Is the wording good enough
> as it stands? Or, as worded, could there be a legitimate uncertainty
> about which answer is correct?
>
> BEGIN QUESTION TEXT
>
> 37. When Matt's and Damien's broad jumps were measured accurately to
> the nearest foot, each measurement was 21 feet. Which statement best

> describes the greatest possible difference in the lengths of Mattąs


> jump and Damien's jump?
>
> A. One jump could be up to 1/4 foot longer than the other.
> B. One jump could be up to 1/2 foot longer than the other.
> C. One jump could be up to 1 foot longer than the other.
> D. One jump could be up to 2 feet longer than the other.
>
> END QUESTION TEXT


I guess I don't see any problem with this. "Accurately to the nearest
foot" does not mean the same thing as "to within one foot" or "to an
accuracy/tolerance of plus or minus one foot" which I suppose yields the
canonical wrong answer. The use of the word "nearest" is crucial.

The choice of answers (and "best" in the rubric) rules out any quibbles
about banker's rounding, etc. Options such as

C'. One jump could be up to 1 foot longer than the other, but not 1 foot
longer.
C". One jump could be up to 1 foot longer than the other, or 1 foot
longer.

would be fiendish.

The question could be made clearer by putting a comma after
"accurately", and simpler by furthermore adding "and then rounded":

When Matt's and Damien's broad jumps were

measured accurately, and then rounded to


the nearest foot, each measurement was 21
feet.

However, while this takes less care to read, I do not think it says
anything that the other question did not - except perhaps that in the
original question "accurately to the nearest foot" *might* mean "with a
tolerance of six inches or less" without requiring it to actually yield
a round number of feet. But since we're given that the results WERE an
even number of feet this is academic.

The question may be criticised for being false to reality - at the
level of competition at which people jump 21 feet (the record is about
29 ft; 21 ft seems to be about male high school championship level, or
perhaps national level for the age group at which the problem was
aimed!) I cannot imagine anybody bothering with such an approximate
measurement. One could put in

"The reporter for the school newspaper, who was only interested in
football and cheerleading, rounded the lengths to the nearest foot and
published both as 21 feet."

-Robert Dawson

Jerry Dallal

unread,
Jan 16, 2001, 2:09:07 PM1/16/01
to
Werner Wittmann wrote:
>

> See the regression artifact primer at Dave's homepage:
> http://nw3.nai.net/~dakenny/rrtm.htm

I looked at
http://nw3.nai.net/~dakenny/primer.htm
and found myself puzzled by the Galton squeeze plot (or is it a pair
link diagram, or are they one and the same?).

It shows the initially extreme subjects converging toward the mean,
but if this were purely regression to the mean, the cross-sectional
post-test SD would be equal to the pretest SD. In the diagram, the
post-test SD is clearly smaller. However, I have not read the book
or the Galton article from which the diagram was extracted, both of
which may contain additional material to clear up my confusion.

dennis roberts

unread,
Jan 16, 2001, 8:34:20 PM1/16/01
to
At 11:33 AM 1/16/01 -0400, you wrote:

>> 37. When Matt's and Damien's broad jumps were measured accurately to
>> the nearest foot, each measurement was 21 feet. Which statement best

>> describes the greatest possible difference in the lengths of Matt零


>> jump and Damien's jump?
>>
>> A. One jump could be up to 1/4 foot longer than the other.
>> B. One jump could be up to 1/2 foot longer than the other.
>> C. One jump could be up to 1 foot longer than the other.
>> D. One jump could be up to 2 feet longer than the other.
>>
>> END QUESTION TEXT

"nearest" in this context implies ... that if it were closer to 20 than 21
... it would be called 20 ... or, closer to 22 than 21 ... it would be
called 22 ... SO, this has to mean (unless one is doing some weird
extrapolations) that ... there is some dividing point between 20 and 21 ...
and 21 and 22 ... and that dividing point is in the middle between the two ...

so, one person could have been 20.5001 ... rounded to 21 and the other
could have been 21.4999 ... rounded to 21 ... so, the biggest difference
would be 1 foot

of course, this only makes sense if we assume that rounding to the nearest
has meaning ... but if it does ... then i don't know any other choice than
C that could be construed as being correct


>
>
> I guess I don't see any problem with this. "Accurately to the nearest
>foot" does not mean the same thing as "to within one foot" or "to an
>accuracy/tolerance of plus or minus one foot" which I suppose yields the
>canonical wrong answer. The use of the word "nearest" is crucial.

==============================================================
dennis roberts, penn state university
educational psychology, 8148632401
http://roberts.ed.psu.edu/users/droberts/drober~1.htm

Rich Ulrich

unread,
Jan 17, 2001, 3:57:25 PM1/17/01
to
On Tue, 16 Jan 2001 14:14:36 GMT, kak2...@excite.com (J. Williams)
concluded:

> Maybe, I am missing something, but think the original question and
> response items are quite clear and concise. I see nothing
> particularly "loose" about it. The essentials of a class interval
> used in frequency distributions seem apparent although it is subtle.
> For me, this appears to be an excellent question. Of course, I was
> not an English major either :-)

Well, I think anyone with advanced math courses can answer it because
we jump to conclusions. We know what the only "real" question is,
that is likely to be asked in this format. We have seen approximately
the same question a number of times -- and, if it was not asked right,
there was no one complaining.

... back to the question ...
< snip, my intro, other intro ... >

> >37. When Matt's and Damien's broad jumps were measured accurately to
> >the nearest foot, each measurement was 21 feet. Which statement best

> >describes the greatest possible difference in the lengths of Matt零


> >jump and Damien's jump?

- Okay, here is my answer before I repeat the official ones.
The "greatest possible difference" is *at least* one foot.
If this is a dedicated math question, the aspect of roundoff should
give "one foot (minimum)"; and any slightest introduction of realism
implies some *error-in-measurement* to be added on.

Thus, the "greatest possible difference" is one-foot, plus the
amounts on BOTH SIDES of the error distribution -- whether that is
in quarter-inches or angstroms. => something more than 1 foot.
- Now, look at the answers.

> >
> >A. One jump could be up to 1/4 foot longer than the other.
> >B. One jump could be up to 1/2 foot longer than the other.
> >C. One jump could be up to 1 foot longer than the other.
> >D. One jump could be up to 2 feet longer than the other.
> >

- I don't care if you want to obsess about whether "up-to" contains
the exact margin [ even though: that is a separate fault that makes
this an unprofessional question ]. With ANY scope at all for realism
(i.e., measurement error), the exact margin has to be exceeded.
So. What "best describes" the "greatest"?

Choice (C) does a good job of describing "the difference" -- if that
had been asked. But that was not in the question. The focus has been
set on the "greatest." I think that the naive rater might rule out
(C) because it answers the wrong question, and it is logically
inconsistent with the correct answer.

Answer (D) is also pretty poor, but it is not impossible.
If you decide it is a math question, it is the only possible one.

As Robert Dawson pointed out, the whole pretext/description is not
realistic for the report of an actual track meet. Someone *could*
say that, reasonably, if the two jumps were *very* close to the
21-feet; otherwise, the whole comment is just totally stupid. So,
someone who was reading this as a social question might decide that
(A) is the only feasible answer. Realistically speaking.

Those of us who are quite bright don't have any doubts though.
We know, by the internal awkwardness, that this was not a subtle,
sneaky, trick-logical question. We are quite a bit smarter than the
idiot (relatively speaking) with 120 IQ who wrote the question, and we
know the question/answer, even if he failed to ask it.


> >
> >ObPuzzle: Assume that the wording needs improvement. Assume that the
> >concept to be tested is that "the range of real numbers for which the
> >closest integer is 21 is the interval from 20.5 to 21.5 not including
> >either endpoint, sometimes notated (20.5, 21.5)." What is a simple,
> >natural wording in everyday language that would test someone's
> >understanding of this concept while providing a single, unambiguously
> >correct choice?
>

What simple, natural wording... ? Well, it is *not* a wording that
brings in the distraction of "up to." It is not a wording that
implies perfect precision in measurement, or requires great knowledge
of track meets, or requires ignorance of track meets.

dennis roberts

unread,
Jan 17, 2001, 6:35:21 PM1/17/01
to
At 03:57 PM 1/17/01 -0500, Rich Ulrich wrote:

> - Okay, here is my answer before I repeat the official ones.
>The "greatest possible difference" is *at least* one foot.
>If this is a dedicated math question, the aspect of roundoff should
>give "one foot (minimum)"; and any slightest introduction of realism
>implies some *error-in-measurement* to be added on.

ok ... even if so ... what is a reasonable error that could be added on to
come to the conclusion of answer D? this would have to imply that two
errors in the opposite direction occur simultaneously ... the error
overshoots truth one way ... hence you round up ... whereas the errors
undershoots truth the other way ... hence you round down ... this is not "
... slightest introduction of realism ... "


>Thus, the "greatest possible difference" is one-foot, plus the
>amounts on BOTH SIDES of the error distribution -- whether that is
>in quarter-inches or angstroms. => something more than 1 foot.
> - Now, look at the answers.
>
> > >
> > >A. One jump could be up to 1/4 foot longer than the other.
> > >B. One jump could be up to 1/2 foot longer than the other.
> > >C. One jump could be up to 1 foot longer than the other.
> > >D. One jump could be up to 2 feet longer than the other.
> > >

rich tries to clarify this problem but ... i don't think is successful IN
RELATION TO THE GIVEN ANSWERS

now, we are assuming aren't we that someone did the measuring WITH a
"qualified measuring rule ... we know of course that (just like the "chain"
for measuring first downs in a football game) it could be stretched more or
less taut ... but, what would be some reasonable limit for tautness or
UNtautness without the competitor screaming bloody murder??? thus, we could
say ... if the distance was measured withOUT error .. the greatest distance
would be 1 foot ...

given that there is no indication in this item of anything to do with
MEASUREMENT ERROR ... WE CANNOT ASSUME ANYTHING ABOUT THAT CONCEPT

i would suggest that given the way this item is stated ... that is, what is
said and what is not said ... we have to assume that whatever the
measurement was ... it was correct and, all that would be done given that
it is said "to the nearest foot" ... that either rounding up or down
according to the standard system of rounding ... would have to apply

it you are really arguing that this item is trying to "test" for more ...
then, this item is not a simple math item ... and i think you would have a
hard time saying that a person who answered D really knows more than the
person who answers C ... given what is said and implied in this item ...
however, if you answered A or B ... you do have a problem!

some might call this a classic trick item ... where it hopes that some will
not even consider measurement error ... or, read the item in such a way
that the thought of measurement error never even enters their mind ...

and, should it? not unless this is a test not about math so much but, about
measurement ...

dennis roberts

unread,
Jan 18, 2001, 10:21:28 AM1/18/01
to

> > >37. When Matt's and Damien's broad jumps were measured accurately to
> > >the nearest foot, each measurement was 21 feet. Which statement best
> > >describes the greatest possible difference in the lengths of Matt零
> > >jump and Damien's jump?

> > >A. One jump could be up to 1/4 foot longer than the other.
> > >B. One jump could be up to 1/2 foot longer than the other.
> > >C. One jump could be up to 1 foot longer than the other.
> > >D. One jump could be up to 2 feet longer than the other.
> > >


my question is ... what is the real purpose of this item?

an examinee has to take the stem at face value and, it says, quite
clearly ... that these were "measured accurately" ... to the nearest foot
... so, within that scheme ... they have to assume that for it to be 21 ...
it would have had to have been measured "accurately" at 20.5 up to 21 ...
(rounded up if necessary) or ... from 21.5 down to 21 ... (rounded down if
necessary) ... and any interpretation other than that would therefore ...
have to disregard what is stated in the stem ... that being, the
measurement is done accurately ...

the only way that an answer like D could be acceptable ... is if the stem
had left open to question ... how accurately the measurements were taken
... such as:

on old tape measure was used to measure the jumps and, using that ... they
both had jumps of 21 feet ...

or, a measure made of rubber bands was used ... they both had jumps of 21
feet ....

or .. a tape measure was used but, sometimes it was pulled very taut ...
sometimes not .... they both had jumps of 21 feet

etc.

C is the only acceptable answer GIVEN the stem ... and what it says and implies

again ... i ask ... what is the purpose of this item IF it is other than
what is implied in the stem?

if one wants to argue for D ... then one has to also argue that the
examinee should have not believed what the stem said ... and therefore,
they were justified in answering a DIFFERENT question

if D is the keyed correct response ... then it is a trick ? ... pure and simple

Robert J. MacG. Dawson

unread,
Jan 18, 2001, 10:32:23 AM1/18/01
to

dennis roberts wrote:
>
> > > >37. When Matt's and Damien's broad jumps were measured accurately to
> > > >the nearest foot, each measurement was 21 feet. Which statement best
> > > >describes the greatest possible difference in the lengths of Matt零
> > > >jump and Damien's jump?
> > > >A. One jump could be up to 1/4 foot longer than the other.
> > > >B. One jump could be up to 1/2 foot longer than the other.
> > > >C. One jump could be up to 1 foot longer than the other.
> > > >D. One jump could be up to 2 feet longer than the other.

>

> if one wants to argue for D ... then one has to also argue that the
> examinee should have not believed what the stem said ... and therefore,
> they were justified in answering a DIFFERENT question

As Dennis said. Of course, they were _meant_ to realize that they
should answer the implicit question:

What is the proper Linnaean name for the bullfrog?

A. _Felis_concolor_
B. _Ursa_major_
C. _Nolo_contendere_
D. None of the above.

I dunno, kids these days, never reading the questions properly then
wanting part marks...

-Robert Dawson

Rich Ulrich

unread,
Jan 22, 2001, 3:28:25 PM1/22/01
to
On 16 Jan 2001 09:01:18 -0800, Robert...@STMARYS.CA (Robert J.
MacG. Dawson) wrote:

>
>
> Rich Ulrich wrote:
>
> > Construing the language as precisely as possible, but being careful to
> > take into account the full language of the question and the
> > multiple-choice answers, what do you think the correct answer is?
> >
> > Do you think the question is actually OK? Is the wording good enough
> > as it stands? Or, as worded, could there be a legitimate uncertainty
> > about which answer is correct?
> >
> > BEGIN QUESTION TEXT
> >
> > 37. When Matt's and Damien's broad jumps were measured accurately to
> > the nearest foot, each measurement was 21 feet. Which statement best

> > describes the greatest possible difference in the lengths of Matt零


> > jump and Damien's jump?
> >
> > A. One jump could be up to 1/4 foot longer than the other.
> > B. One jump could be up to 1/2 foot longer than the other.
> > C. One jump could be up to 1 foot longer than the other.
> > D. One jump could be up to 2 feet longer than the other.
> >
> > END QUESTION TEXT
>
>
> I guess I don't see any problem with this. "Accurately to the nearest
> foot" does not mean the same thing as "to within one foot" or "to an
> accuracy/tolerance of plus or minus one foot" which I suppose yields the
> canonical wrong answer. The use of the word "nearest" is crucial.
>

Maybe I won't convince anyone, but I do want to try once more....
One Theme (a little bit relevant) is the wording of questions.
Another Theme is a technical one about how sloppy we all tend to be,
when we make assumptions about accuracy of measurement.

Here are some strings of words... DO these suggest round-off errors?

"If Matt scored 121 on an IQ test and Damien scored 123 on the same
test, what is the maximum difference in their IQs?"
- This reads BAD to me, because I know that IQs are fuzzy numbers.
At the best, Matt-today is apt to be 4 points different from
Matt-tomorrow, so there's 5 or 6 points to expect from 2 people who
measure exactly the same. If you want to know about whether the
testee understands ROUNDOFF, you don't introduce ontological doubts,
i.e., What is reality?
- Folks *do* accept and pretend to understand some numbers like
this. If you are testing for math-content, it is not fair to require
that knowledgeable testees ignore or draw on what else they know.
Now, if you are testing for IQ in the disguise of testing for math, it
*might* be useful to see how well people figure out what the question
is supposed to be (but I do have my doubts about that sort of Q).
- Testing for math, it is fair for you to ask for conclusions about
numbers, NOT about "sizes." (Not every 14-year old will draw the
distinction, but quite a few of them ought to be capable of it.)


"Please measure the child's height accurately to the nearest
millimeter."
- Measuring with that precision can be tough. You might have to be
taught how to do it by protocol; or you might not achieve an accuracy
which is quite as good as the expert's. I think that is a reasonable
request -- or a reasonable description, even if the potential is not
quite met 100%. That is: I think someone could ask for it, and it
could be useful in a close, daily study of growth; replication (for
reliability and validity-testing) would show a VERY frequent error of
1 (or 2) millimeters.


"Please use the tape measure in the usual way and write down the size
of the waist, accurately to the nearest millimeter."
- I find a lot of resistance from people, in trying to get them to
*think* of measurements in units that are shaky. When they are
pushing the limits of what can be measured, experimental physicists
are thoroughly aware that you need to specify both an estimate and its
precision. But ordinary folk seem to (a) think that you can round-off
to a number with no discernable "error", and - consistent with that -
(b) refuse to contemplate measuring the soft-tissue of the waist in
millimeters.

Semantically, I tend to draw a distinction -- which seems
to be, a distinction between the overtones of the ADVERB
and the ADJECTIVE (unless I am being fooled by the -ly).
If you asked for my waist size, "accurate" to a millimeter,
I would have to demur, pleading the lack of definition or rules.
If I had been measuring, and now you asked, measure
"accurately" to a millimeter, I would have no difficulty in agreeing
to read a measurement that finely. But other people seem to choke on
the concept. I'm not claiming that I have found the most appropriate
words, but I don't think this gap is all my fault.

dennis roberts

unread,
Jan 22, 2001, 6:58:16 PM1/22/01
to
At 03:28 PM 1/22/01 -0500, Rich Ulrich wrote:


>"If Matt scored 121 on an IQ test and Damien scored 123 on the same
>test, what is the maximum difference in their IQs?"


the comparable item to the real one shown before would be: "if matt was
measured ACCURATELY (rounded to the nearest single IQ point) to have an IQ
of 121 and damien was also measured ACCURATELY (to the nearest single IQ
point) to have an IQ of 123 ... "

the operative terms here are "measured accurately" ...and the "units"

and of course, the UNITS that are being implied ... for the original data
... it was to the nearest pound ... and, the IQ item would also have to put
the "accuracy to the nearest ... " into some score value context

your rewording does change things and ... would imply perhaps not very
accurate measurement but, the original question DID have that word ...
"accurate" ...

this word cannot be discounted ...

as i said before ... given the stem and the choice C of 1 foot ... i think
any intelligent examinee could argue logically that this is the correct
answer ... or, if the test builder wanted to claim D or 2 feet is the
correct answer ... that C would have to be given equal correct weight ...

there just is no good way to argue against the original choice C ... IN THE
CONTEXT OF THE STEM OF THE QUESTION

P.G.Hamer

unread,
Jan 23, 2001, 7:37:37 AM1/23/01
to
dennis roberts wrote:

> there just is no good way to argue against the original choice C ... IN THE
> CONTEXT OF THE STEM OF THE QUESTION

I am reminded of the joke article that contains many `pollitically incorrect'
answers to the exam question "given a barometer how do you measure
the hight of a tower".

A point I only realised recently is that many of these spoof answers could
give a more accurate answer than the `textbook' method.

Peter

The first few that I remember.

1) Drop the barameter and time its fall.

2) Tie it to a long piece of string and use it as a lead-line, measuring the
length of the string.

3) Tie it to a long piece of string and use it as a lead-line, measuring the
period of the resultant pendulum.


Robert J. MacG. Dawson

unread,
Jan 23, 2001, 9:58:22 AM1/23/01
to

"P.G.Hamer" wrote:
>
> dennis roberts wrote:
>
> > there just is no good way to argue against the original choice C ... IN THE
> > CONTEXT OF THE STEM OF THE QUESTION
>

> I am reminded of the joke article that contains many `politically incorrect'


> answers to the exam question "given a barometer how do you measure

> the height of a tower".


>
> A point I only realised recently is that many of these spoof answers could
> give a more accurate answer than the `textbook' method.
>
> Peter
>
> The first few that I remember.
>
> 1) Drop the barameter and time its fall.
>
> 2) Tie it to a long piece of string and use it as a lead-line, measuring the
> length of the string.
>
> 3) Tie it to a long piece of string and use it as a lead-line, measuring the
> period of the resultant pendulum.

My own favorite: tell the janitor that you will give him a barometer if
he can tell you how tall the building is.

I *presume* that "politically incorrect" above means "nonstandard", not
"involving stereotypes of gender, religion, ethnicity, sexual
orientation, or hair color". The idea of a whole subgenre of "Scottish
Barometer Jokes" or "Blonde Barometer Jokes" is just too mind-boggling.
<grin>

-Robert Dawson

Tony T. Warnock

unread,
Jan 23, 2001, 11:42:16 AM1/23/01
to

"P.G.Hamer" wrote:

Trade the barometer to the super for a look at the building plans.

Rich Ulrich

unread,
Jan 30, 2001, 5:59:10 PM1/30/01
to
On 22 Jan 2001 15:58:16 -0800, d...@PSU.EDU (dennis roberts) wrote:

> At 03:28 PM 1/22/01 -0500, Rich Ulrich wrote:

< snip, details of my alternative examples of statements >

> as i said before ... given the stem and the choice C of 1 foot ... i think
> any intelligent examinee could argue logically that this is the correct
> answer ... or, if the test builder wanted to claim D or 2 feet is the
> correct answer ... that C would have to be given equal correct weight ...
>
> there just is no good way to argue against the original choice C ... IN THE
> CONTEXT OF THE STEM OF THE QUESTION
>

- I wonder if other people got lost in the discussion? So far as I
remember, no one suggested that the actual choice wasn't C.

But here are words from what was originally posted:
========== part of post, extracted


Do you think the question is actually OK? Is the wording good enough
as it stands? Or, as worded, could there be a legitimate uncertainty
about which answer is correct?

BEGIN QUESTION TEXT

37. When Matt's and Damien's broad jumps were measured accurately to
the nearest foot, each measurement was 21 feet. Which statement best

describes the greatest possible difference in the lengths of Mattąs


jump and Damien's jump?

A. One jump could be up to 1/4 foot longer than the other.
B. One jump could be up to 1/2 foot longer than the other.
C. One jump could be up to 1 foot longer than the other.
D. One jump could be up to 2 feet longer than the other.

END QUESTION TEXT

ObPuzzle: Assume that the wording needs improvement. Assume that the
concept to be tested is < rounding off ... >
======== end of extract from post.

There are at least 3 distractors in there, which I identified before.
Maybe they make this a better item for General Intelligence (or,
Intelligence and acculturation). (Stupid phraseology of rounding;
rounding of SIZE rather than numbers; "greatest possible difference"
in the Q invokes "at least 1 foot" as the minimum - before presenting
answers without the proper alternative.) (Is it possible to test
"rounding" without using the word "rounding"? - I suspect that the
simple attempt, like this, may be something that breeds scorn and
contempt into the hearts and minds of mathophobics, everywhere.)

They make this a sloppy test of Rounding, i.e,
Does the pupil understand the concepts of rounding off numbers?
- well, there are rational plus neurotic reasons to resist.
The most obvious reason to answer C, in my opinion, is that the item
is an obvious probe, "Do you understand Rounding"?

Further: I have had trouble making this point to people, but I am
pretty sure that "measured accurately to the nearest foot" is context
dependent, or an idiomatic expression. The 2nd reason to answer C is
that the test-item is a probe, "Do you understand *that* idiom?" -
and if you haven't paid attention in math-class, it's likely you
don't.

dennis roberts

unread,
Jan 30, 2001, 8:47:35 PM1/30/01
to
i think that one thing that math class teaches you about "measurement" is
that there is error ... well maybe they do, now that i think of it, i am
not so sure how clear this notion is taught ... but, let's assume that it
is ...

math would also reinforce meaning of the use of the english ( ____ input
other language if appropriate) language ... the term "accurate" does have
some meaning ...

in this context ... and after all, every item is in some context ... it
says accurate for BOTH ... to the nearest foot

so, i think a perfectly legitimate interpretation of that is ... could be
off 1/2 foot down ... or up ...

i don't see anything mathematically wrong with deducing the answer to be C
given the context of the item ...

so, if you are saying that C is the best in this context ... good ... if
you are arguing for D ...

i disagree

my math is not great ... but, it ain't that bad either

> BEGIN QUESTION TEXT
>
>37. When Matt's and Damien's broad jumps were measured accurately to
>the nearest foot, each measurement was 21 feet. Which statement best
>describes the greatest possible difference in the lengths of Mattąs
>jump and Damien's jump?
>
>A. One jump could be up to 1/4 foot longer than the other.
>B. One jump could be up to 1/2 foot longer than the other.
>C. One jump could be up to 1 foot longer than the other.
>D. One jump could be up to 2 feet longer than the other.
>
> END QUESTION TEXT
>

==============================================================
dennis roberts, penn state university
educational psychology, 8148632401
http://roberts.ed.psu.edu/users/droberts/drober~1.htm

dennis roberts

unread,
Jan 30, 2001, 11:04:52 PM1/30/01
to

> BEGIN QUESTION TEXT
>
>37. When Matt's and Damien's broad jumps were measured accurately to
>the nearest foot, each measurement was 21 feet. Which statement best
>describes the greatest possible difference in the lengths of Mattąs
>jump and Damien's jump?
>
>A. One jump could be up to 1/4 foot longer than the other.
>B. One jump could be up to 1/2 foot longer than the other.
>C. One jump could be up to 1 foot longer than the other.
>D. One jump could be up to 2 feet longer than the other.
>
> END QUESTION TEXT


having been in the measurement field for more than 1/2 my life ... i have
some feel for and appreciation of ... the notion of measurement error
(whether this is a principle of math ... or not) ...

reliability is all about that ... reliability of the measured value ... the
measured jump of 21 feet ... can we depend on this to be correct and if
not, how "off" could it be from a bad measurement standpoint ...

so, this question interests me not so much from the standpoint of what
specifically IT is measuring ... but, from the standpoint of what makes a
decent question ... or a poor one

this kind of item is one that gives "tests" some of the bad name they get

in a case like this, we have to look at the question being asked ... ie,
the stem ... and first list out what are "facts" of the stem and what are
"logical inferences" that an examinee could make (maybe should make)

let's assume up front that the objective of the item really is ... concept
of measurement error ... that is, when folks take measurements ... they can
be wrong ... and wrong either way but, consider the following in this case

FACTS

1. you are given that the jumps were measured to the nearest foot
2. you are given that the jumps were measured accurately

LOGICAL INFERENCES on the part of the examinee

1. this is a contest ... broad jump ... and a tape measure was used that
had at least inch subdivisions ... or even greater (ever see a tape measure
at a track meet only with FOOT tick marks? ever have in your hand, a tape
measure that is 25 feet or 50 feet ... that did NOT have at least inch or
probably FINER subdivisions?)
2. contests are important so ... the measurers are assumed to be doing
their best to read "jumps" accurately ... if they don't ... they get water
bottles tossed at them by the irate parents
3. typical tapes could be extended in a somewhat slack mode ... when
extended to make the measurement ... but there is a limit to how TAUT or
lengthened they can be made to go ... so, if an error is likely to occur
(forget the fact that "accurate" is given in the stem) ... then it would be
most likely and sensibly in the slackened condition ... which means the
measured jump would be recorded LONGER than it should be ...

given the FACTS and what i believe to be sensible inferences one can and
should make in a broad jump contest which this is assumed to be like ...
would lead me as a measurement person ... to say this:

if the tape were extended in the taut(est) condition ... AND, measurements
were done accurately ... then if the landing mark were really between 20.5
and 21 ... we assume it will be rounded up/reported as 21 ... max error 1/2
foot ... or ... if the landing mark fell between just less than 21.5 and 21
... it would be rounded down or reported to be 21 ... this max gap sensibly
would be 1 foot

however, if the tape were slackened ... either a little or a lot, whether
it be measurer's error or not ... for a particular measurement (which
means it is not an accurate measurement by definition but, lets let that
slide for a moment) ... then the gap (and hence max error) between the tape
mark and the landing mark becomes harder to discern ... perhaps impossible
to discern

because ... we don't know how much slack there might be in the tape

but, regardless, it will be seen by the measurer as being LONGER than it
really is ...

thus, under the slackened condition ... errors could make the measurment
longer than it should be ... but, in the taut condition ... the error is
not likely to make the measurement shorter than it should be ...

the liklihood of an LONGER error (if anything) is much greater than the
liklihood of a SHORTER error

if in the taut condition ... the MINIMUM "max" error could sensibly be
called (rounding of course considered) 1 foot ... between the two ACTUAL
jumps ... BUT WHAT COULD THE SENSIBLE MAXIMUM MAX ERROR BE BETWEEN THE 2
ACTUAL JUMPS?

i say that this canNOT be sensibly determined from the facts and logical
inferences made in the question ... and while i now say that choice C would
be a "possible choice" it should actually read (C: One jump could be up to
1 foot longer than the other <<<< could be 1 foot ... but not could be UP
TO 1 foot) for the MIN MAX ... choice D of 2 feet is NOT a good choice
either (nor can any be deduced) for the MAX MAX error in the actual jumps

thus, i now don't believe C is correctly stated ... and therefore is not
correct ... and D is not correct because we cannot determine what might be
the largest error that could be made ... it might be 1.3 feet or 1.7 ... or
2.1 ... but we do NOT know that the max error could or would be 2 feet

bottom line:

A and B are incorrect for sure ... C is not good ... and D can't be proved
to be correct
none of the choices is correct ... C is probably the BEST choice but still
not a good one


this might be a good question for assessing an inappropriate objective ...
or, an inappropriate question to test a legitimate objective

but as it stands ... it surely is a poor item that fails to keep straight
... appropriateness of the item GIVEN some objective

J. Williams

unread,
Feb 1, 2001, 9:22:21 AM2/1/01
to
On 30 Jan 2001 20:04:52 -0800, d...@PSU.EDU (dennis roberts) wrote:

>if the tape were extended in the taut(est) condition ... AND, measurements
>were done accurately ... t
>

>however, if the tape were slackened ... either a little or a lot, whether
>it be measurer's error or not ...
>

>because ... we don't know how much slack there might be in the tape
>

>thus, under the slackened condition ... errors could make the measurment
>longer than it should be ... but, in the taut condition ...

>if in the taut condition ... the MINIMUM "max" error could sensibly be


>called (rounding of course considered) 1 foot ... between the two ACTUAL
>jumps ...

What if the tape measuring device was metallic? The tape measure need
not be the old fashioned cloth type employed by seamstresses and
tailors. Right? Additionally, the word "accurately" was specified
in the question. The respondent in reading the query must assume the
person doing the measuring is indeed "accurate" and the tape measure
is too.


>>
>thus, i now don't believe C is correctly stated ... and therefore is not
>correct ...

I disagree --- C is correct


>
>A and B are incorrect for sure ... C is not good ... and D can't be proved
>to be correct
>none of the choices is correct ... C is probably the BEST choice but still
>not a good one

C is indeed the best choice. It is the ONLY correct answer. What is
so awful about the correct choice? I don't get it!

>but as it stands ... it surely is a poor item that fails to keep straight
>... appropriateness of the item GIVEN some objective

The question yields a subtle view of a theoretical confidence
interval. Maybe, I'm missing something salient here, but I think it
is a fair question. Of course, I was not an English major either :-)

Rich Ulrich

unread,
Feb 4, 2001, 8:03:00 PM2/4/01
to
I'm still trying to perfect my answer, so I will take another
shot here. I don't know whether J Williams saw what I posted before;
but I am happy that DMR is calling it a bad test item.

On Thu, 01 Feb 2001 14:22:21 GMT, kak2...@excite.com (J. Williams)
wrote:


> On 30 Jan 2001 20:04:52 -0800, d...@PSU.EDU (dennis roberts) wrote:

[ ... snip, much ]

JW >

> C is indeed the best choice. It is the ONLY correct answer. What is
> so awful about the correct choice? I don't get it!

DMR >

> >but as it stands ... it surely is a poor item that fails to keep straight
> >... appropriateness of the item GIVEN some objective

JW >

> The question yields a subtle view of a theoretical confidence
> interval. Maybe, I'm missing something salient here, but I think it
> is a fair question. Of course, I was not an English major either :-)

I think there are three different approaches that can be delineated to
saying what is "a good item."

(1) There is (something like) "Is the right answer given by someone
with a good IQ?" I think that we are all agreed that (C) should meet
that requirement. Further, I imagine that the item was validated
*statistically* by this standard -- marking "C" goes along with
higher scores on other test items.

(2) There is a narrower approach -- which, indeed, was the question
specified when this item was posted. "Does the item show whether the
student understands rounding?" Will it be answered correctly by
everyone who does, or could naive respondents be led astray?
Since a "broad jump measured accurately to the nearest foot"
is not something that anyone in the Western world has ever heard
of, is it really fair to ask an 8th grader to interpret what it might
mean? (I assume, the 8th grader is suppose to translate this,
immediately, into "This is a ROUNDING problem," and the rest
of us statisticians know what the item's answer is, because we
have overlearned exactly that same response.)


You demonstrate possible difficulties, perhaps, by debriefing
students who missed the item; or by comparing to other, related items;
or by noting that there are unexpected item-loadings in a large scale
factor analysis. But you usually will discover them by careful
face-inspection, which is what I provided (I hope) in earlier posts.

"If you can imagine a way that someone would misread the item,
then someone will." This is a mild version of Murphy's law. It is
practically a truism when you are designing items or forms -- the hard
part of your judgement is, figuring how much "problem" is too-much
problem. In the recent Florida election, we learned that "punched
cards" have an inherent error rate of over 1%. And a "butterfly
ballot" has a rate over 5%. How much does it matter that most of
these errors should befall that 15% of the voters in Florida who were
voting for the first time? - well, it means that our subjective
account should not assume that every voter is cool and experienced.

"Professionally speaking," the butterfly punch-ballot has to be
regarded as awful, no matter how much Jay Leno, etc., make fun
of the Florida voters instead.

Similarly for the test-item. If you are making assumptions about the
pupil's experience, vocabulary, acculturation, IQ, and attitude, then
you may forget to rate the item by how it measures "rounding."

Here is a minor question or observation. In the real world, does
anyone ever perform rounding, and blandly expect for it to be
recognized as such? Or don't we *explicitly* state that "this is
rounding and not truncation or estimation."

(3) The third approach is, "Is the answer technically correct?"
So far, it remains embarrassing and something-to-be-corrected,
when the keyed answer violates physics, or careful logic. Or if,
on close inspection, the question does not make good sense.
This is more important than slightly misleading some students.
Bad logic likely will be reflected in errors of the previous type,
but errors of (#3) need to be corrected, where (#2) do not.
It is harder to show test-makers that *they* are "wrong."

I have not had many people agree with me that, instead of being
purely logical, this item relies on well-understood jargon or idiom.
I 'm trying one more time.

It says, "measured accurately to the nearest foot." People keep
claiming that "accurate" must mean "it's 100% accurate" - so that
this conflation of accuracy (of measurement) and precision (of
reporting) is entirely expected and natural.

What if another item said that a blimp at 1000 feet saw the two jumps,
and "estimated each at 21 feet, accurate only to the nearest foot."
What does this imply about the maximum difference between the two
jumps?
What if it said, "estimated each at 21 feet and 6 inches, measuring
accurately only to the nearest foot"?
- actually, the occasional use of half-units (like 6 inches) is
probably a give-away that someone thinks that their *accuracy* is
about one-unit; they are promising not to err by more than 1/2, so
they refuse to round off, between .40 and .60, say.

- I think that I have just presented, in those last two things,
comments that are much more "real-life" than the statement in the
original test item. And "accurately" is ambiguous to the 14-year-old.

Finally, we round *numbers* if we don't want to fret about
measurement error. And we keep that language clear.

- this is still not perfect, but I hope I am improving it.

Robert J. MacG. Dawson

unread,
Feb 5, 2001, 11:52:28 AM2/5/01
to

Rich Ulrich wrote:

>
> (1) There is (something like) "Is the right answer given by someone
> with a good IQ?" I think that we are all agreed that (C) should meet
> that requirement. Further, I imagine that the item was validated
> *statistically* by this standard -- marking "C" goes along with
> higher scores on other test items.

Unless IQ is what you're trying to test, it's not the IQ, it's the
knowledge and understanding that's important.

>
> (2) There is a narrower approach -- which, indeed, was the question
> specified when this item was posted. "Does the item show whether the
> student understands rounding?" Will it be answered correctly by
> everyone who does, or could naive respondents be led astray?

Does the idea of "a naive respondent who nonetheless understands
rounding" really mean anything? Somebody who is naive *about rounding*
does not truly understand it. Whether somebody is naive about (say)
taking candy fron strangers is irrelevant here.



> "If you can imagine a way that someone would misread the item,
> then someone will." This is a mild version of Murphy's law. It is
> practically a truism when you are designing items or forms -- the hard
> part of your judgement is, figuring how much "problem" is too-much
> problem. In the recent Florida election, we learned that "punched
> cards" have an inherent error rate of over 1%. And a "butterfly
> ballot" has a rate over 5%. How much does it matter that most of
> these errors should befall that 15% of the voters in Florida who were
> voting for the first time? - well, it means that our subjective
> account should not assume that every voter is cool and experienced.
>
> "Professionally speaking," the butterfly punch-ballot has to be
> regarded as awful, no matter how much Jay Leno, etc., make fun
> of the Florida voters instead.

The purposes are very different. The purpose of the ballot is to
determine somebody's intention, not their understanding. If one *wanted*
a government chosen by the most intelligent, a ballot form that would
probably be spoiled by the uneducated voter would be the way to go. Of
course, you would need to use it in all districts and randomize which
candidate gets the easy-to-read spot at the top... rather than giving it
to the Governor's brother.

-Robert Dawson

Rich Ulrich

unread,
Feb 11, 2001, 5:23:21 PM2/11/01
to
I will just drop in a couple of additional comments to Robert's post -

On 5 Feb 2001 08:52:28 -0800, Robert...@STMARYS.CA (Robert J.
MacG. Dawson) wrote:

> Rich Ulrich wrote:

> >
> > (1) There is (something like) "Is the right answer given by someone
> > with a good IQ?" I think that we are all agreed that (C) should meet
> > that requirement. Further, I imagine that the item was validated
> > *statistically* by this standard -- marking "C" goes along with
> > higher scores on other test items.

RJD >

> Unless IQ is what you're trying to test, it's not the IQ, it's the
> knowledge and understanding that's important.

Is this an assent to my eventual point? - validating the precise
content is "important" but it is easy to overlook.
>
me > >

> > (2) There is a narrower approach -- which, indeed, was the question
> > specified when this item was posted. "Does the item show whether the
> > student understands rounding?" Will it be answered correctly by
> > everyone who does, or could naive respondents be led astray?

RJD>

> Does the idea of "a naive respondent who nonetheless understands
> rounding" really mean anything? Somebody who is naive *about rounding*
> does not truly understand it. Whether somebody is naive about (say)
> taking candy fron strangers is irrelevant here.

The "naive respondent" that I have in mind is one who understands the
lessons, but has not over-learned her "rounding" the way that all of
us have: we will trust, that a problem that CAN be a rounding
problem WILL BE a rounding problem.

Or, is that what we are supposed to teach? I do wonder whether
my complaint is fundamentally against bad teaching. I do imagine
that concept-insensitive teachers are using words and examples that
are just as sloppy as the Item. And then they wonder why some
students, who insist on their own poetic or neurotic interpretations,
Don't Get It.

[ snip, rest ]

0 new messages