Testing & Measurement

42 views
Skip to first unread message

Trendler, Guenter

unread,
Oct 27, 2010, 9:24:04 AM10/27/10
to talking-m...@googlegroups.com

D. So what do we do? Do we conclude that chess ability is a continuous quantity like length, measurable by tests and game performance? I would hesitate, given the force of the objections under (1). Do we then conclude that chessometrics is a pathological science, claiming quantitative structure where there is none and deceiving the chess-playing community, leading them by Stevens' primrose path in suggesting that Elo ratings are quantitative? Again, I would hesitate, given the force of the evidence in (2).

G. Why should anybody conclude that chessometrics is pathological science? If the purpose of a rating system is to "predict" results and if it accomplishes the task successfully nobody will object to it. Testing and measurement should not be confused, although, as you point out, they have some features in common. Testing as the examination or assessment of the test-taker's knowledge, skill, aptitude, physical fitness etc. has proven as an useful means of selection and classification and as such it is indispensable in the modern education system (1). Of course, as you point out, if someone would claim that, say, the BTL model or the Rasch model leads to interval-scale measurements some objections will arise. Furthermore, successful testing does not vindicate the use of psychometric methods relying on the calculation of arithmetical means (e.g. SEM). Finally, I can’t imagine that anybody claims that measurement is the only way to making successful predictions or in general the only way for psychology to become a successful science.

Regards
Guenter

(1) http://en.wikipedia.org/wiki/Test_%28assessment%29

Denny Borsboom

unread,
Oct 27, 2010, 10:54:57 AM10/27/10
to talking-m...@googlegroups.com
> G. Why should anybody conclude that chessometrics is pathological science?
> If the purpose of a rating system is to "predict" results and if it
> accomplishes the task successfully nobody will object to it. Testing and
> measurement should not be confused, although, as you point out, they have
> some features in common. Testing as the examination or assessment of the
> test-taker's knowledge, skill, aptitude, physical fitness etc. has proven as
> an useful means of selection and classification and as such it is
> indispensable in the modern education system (1). Of course, as you point
> out, if someone would claim that, say, the BTL model or the Rasch model
> leads to interval-scale measurements some objections will arise.
> Furthermore, successful testing does not vindicate the use of psychometric
> methods relying on the calculation of arithmetical means (e.g. SEM).
> Finally, I can’t imagine that anybody claims that measurement is the only
> way to making successful predictions or in general the only way for
> psychology to become a successful science.

I agree with everything here (except for the part about the means).

But my point is stronger: if someone came up to me and said they can
in fact measure - not just test for - chess ability, I think that
person could make a pretty reasonable case, just like Jack can make a
very reasonable case for the Lexile. It's clearly not just curve
fitting or stamp collecting, as there is a control element present
that has a quantitative side to it.

At my university they use an adaptive model based on Elo/BTL/Rasch to
create an adaptive learning environment
(http://www.mathsgarden.com/pages.eng/about/). They keep the kids'
probability of making an item correct at 70%. It is is quite
impressive how well this works. If you wanted the % correct to rise
from 70 to 80 for your child, they can just tune this easily, on the
spot, as you're standing there. Just turning an IRT-knob. This works.

I find it hard to make sense of how on earth this could be so accurate
if the differences between levels of ability were merely ordered.
There have to be distances, in some way, to give you this kind of
control and there also has to be some kind of correct assessment of
these distances. So there has to be measurement in some form. But how
can that be?

Best
Denny

> --
> You received this message because you are subscribed to the Google Groups
> "Talking Measurement" group.
> To post to this group, send email to talking-m...@googlegroups.com.
> To unsubscribe from this group, send email to
> talking-measure...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/talking-measurement?hl=en.
>
>


--
Denny Borsboom
Department of Psychology
University of Amsterdam
Roetersstraat 15
1018 WB Amsterdam
The Netherlands
phone: +31 20 525 6882
email: d.bor...@uva.nl
homepage: http://users.fmg.uva.nl/dborsboom

Trendler, Guenter

unread,
Oct 27, 2010, 1:04:49 PM10/27/10
to talking-m...@googlegroups.com
D. But my point is stronger: if someone came up to me and said they can

in fact measure - not just test for - chess ability, I think that
person could make a pretty reasonable case, just like Jack can make a
very reasonable case for the Lexile. It's clearly not just curve
fitting or stamp collecting, as there is a control element present
that has a quantitative side to it.

At my university they use an adaptive model based on Elo/BTL/Rasch to
create an adaptive learning environment
(http://www.mathsgarden.com/pages.eng/about/). They keep the kids'
probability of making an item correct at 70%. It is is quite
impressive how well this works. If you wanted the % correct to rise
from 70 to 80 for your child, they can just tune this easily, on the
spot, as you're standing there. Just turning an IRT-knob. This works.

I find it hard to make sense of how on earth this could be so accurate
if the differences between levels of ability were merely ordered.
There have to be distances, in some way, to give you this kind of
control and there also has to be some kind of correct assessment of
these distances. So there has to be measurement in some form. But how
can that be?

G. How about testing some axioms of measurement?

Guenter


winmail.dat

Denny Borsboom

unread,
Oct 27, 2010, 2:52:31 PM10/27/10
to talking-m...@googlegroups.com
> G. How about testing some axioms of measurement?

Assuming you'll grant the incorporation of probabilities into the
testing scheme, we can certainly fit an ACM model if that's what you
mean. As far as I can see, the ACM defined on a probability structure
is a Rasch model, which we could fit the frequentist way and look at
the chi-square or, if you like Bayesian statistics, in the Bayesian
way and look at posterior predictives. Alternatively we could check
model implications piecewise; for instance we could search the items x
persons matrix for 3 by 3 tables in which the antecedent inequalities
of the double cancellation axioms are satisfied. If we're lucky, we'll
find some and we can test the double cancellation condition [by
significance testing, Bayesian inference, etc.], assuming that we're
not confounding our tests with departures from other assumptions like
unidimensionality, local independence etc. Most likely outcome is that
there are some departures from the model, not easy to interpret, and
we'll be looking at the choice of either deleting items that don't fit
or finding a looser model that does fit.

How exactly does this depart from what my colleagues are doing using
mainstream psychometrics?

Best
Denny

--

Derek Briggs

unread,
Oct 27, 2010, 5:25:12 PM10/27/10
to talking-m...@googlegroups.com
Denny--

You're kidding right?

OK, I can't speak with any authority of mainstream practices in
Europe. But speaking with respect to my colleagues in the United
States, even among psychometricians that use the Rasch Model, only a
rather small number are aware of (and even smaller number understand)
the connection between the Rasch Model and ACM, even if is true that
this connection is overblown as Andrew has suggested. Simply put, no
effort is made to acknowledge or test the quantity hypothesis. Now
testing for model fit might be a start and perhaps that's what you
have in mind. But consider the way fit is established for the Rasch
Model in commercial practice. Here is an example pulled from the
technical report of the large-scale test for a midwestern state:

"Infit is a statistic that assesses the fit of the observed data to
the Rasch IRT model with respect to the parameters that were estimated
for that item. Essentially, it answers the question, 'How closely does
the observed data hold to the values that are predicted by the model?'
The infit statistic is sensitive to unexpected responses for examinees
with abilities near to the difficulty of the item. Its expected value
is 1.0; values greater than 1.5 indicate that the data contains
unexpected response patterns.

I would argue that almost everything in this passage represents a
fundamental misunderstanding about the nature of residual-based fit
statistics. 1) Infit cooks the books by downweighting "outliers"; 2)
It doesn't answer the question they think it answers--only whether it
is reasonable to specify a common slope for the ICCs; 3) There are no
fixed rules of thumb for an "acceptable" value of misfit. For large
samples (>500), a value of 1.1 would typically be strong indication of
misfit; and 4) it ignores negative misfit. Given this, we have many,
many tests being used as though they fit the Rasch Model when upon
closer inspection this is a charade.

I do think tests of fit based on the work of Glas and Verhelst are on
theoretically stronger footing and more common in European
applications, but I can't say for sure.

I do have a question that perhaps others could weigh in on. Putting
aside an important conceptual distinction between the RM and ACM
(namely that under ACM one imagines a scenario in which the two
variables M and N in an m by n data matrix could be manipulated, while
in the RM this is largely tautological) if a data matrix could be
shown to fit the Rasch Model in the sense that we have plotted
empirical ICCs and we see they are (a) monotonically increasing (b)
parallel and (c) have no lower asymptote. I think it would still be
possible to find that the data matrix violates double cancellation in
the probabilistic sense suggested by work of Karabatsos,
Schleiblechner, etc. Is this correct? On the other hand, is it also
the case that if one only uses the expected values assuming the RM to
be true, all cancellation axioms of ACM will hold?

Derek
---------------------------------------------------------------------------------------
Derek Briggs
Associate Professor & Program Chair
Research & Evaluation Methodology
School of Education
University of Colorado, Boulder
Boulder, CO 80309
http://www.colorado.edu/education/faculty/derekbriggs/index.html

Denny Borsboom

unread,
Oct 27, 2010, 6:09:47 PM10/27/10
to talking-m...@googlegroups.com
Derek,

There are significant differences in the quality of statistical
modeling undertaken in various areas of science (the so-called exact
sciences not always being ahead of things, in my view). This is also
true within disciplines. I am aware that there are weaker and stronger
versions of psychometric modeling, and I am not claiming insight in
the minds of people who use infit outfit statistics.

But I am not kidding that testing acm axioms implies testing a
psychometric model and hence falls squarely within the tasks of
psychometrics and will normally involve statistical modeling.

Your observations on the rasch-acm correspondence are, in my view, correct.

Best
Denny

Derek Briggs

unread,
Oct 27, 2010, 6:31:17 PM10/27/10
to talking-m...@googlegroups.com
Hi again Denny,

I think I agree with your second to last statement.

But if you agree that my observation on Rasch-ACM correspondence are
correct, then you must also agree that existing tests of fit applied
to the RM at best only satisfy a necessary but not sufficient
condition for ACM to hold in the observed data. If mainstream
psychometricians took this issue seriously (i.e. testing the quantity
hypothesis empirically), one would not stop there. I would not for a
second argue that psychometricians are not concerned about "model
fit"--but to what end?

Derek

---------------------------------------------------------------------------------------
Derek Briggs
Associate Professor & Program Chair
Research & Evaluation Methodology
School of Education
University of Colorado, Boulder
Boulder, CO 80309
http://www.colorado.edu/education/faculty/derekbriggs/index.html

Denny Borsboom

unread,
Oct 28, 2010, 7:40:39 AM10/28/10
to talking-m...@googlegroups.com
Hi Derek,

yes of course you want to have powerful tests of specific deviations
from the model. Many such tests exist and are being used at least
where I come from. The double cancellation issue is important, and new
tests for it may be a useful addition to the toolkit if they perform
well. But I doubt they are news to those familiar with the IRT
literature and they surely aren't qualitatively different from other
tests [e.g. for monotonicity, for non-intersecting IRFs, for DIF, for
global fit, etc].

In my personal view, there's usually a lot more problems with the
measurement model than failing to meet double cancellation. There are
often departures from unidimensionality and local independence; in the
case of multiple groups there are often minor, sometimes major,
violation of measurement invariance. In such cases it makes no sense
to start testing for double cancellation because the model fails at
earlier testing stages. So you get into the improve the items and/or
the model phase of test development. Many times people get tired of
tweaking the items and model, or they run out of money, and they will
just accept something that isn't entirely off the track. That's life.

I know there are traditions where very poor statistics is being done.
That's a pity but I can't help it. However it really makes no sense to
pretend that the axiomatic route is qualitatively different from a
modeling route. In the end, you're looking at sets of hypotheses and
sets of data and you have to figure out whether you accept your core
hypotheses or not.

In my view it is an enormous mistake to think that the axiomatic stuff
offers something qualitatively different from the statistical modeling
approach, or even that they are necessarily distinct. The axiomatic
route is just another way to impose externally justified conditions
[from statistics, from confimation theory, from philosophy of science,
from measurement theory in physics, etc etc] to psychological test
scores.

The problem with psychological measurement however is in the
substance, not in the techniques.

Best
Denny

Trendler, Guenter

unread,
Oct 28, 2010, 11:47:41 AM10/28/10
to talking-m...@googlegroups.com

D. However it really makes no sense to pretend that the axiomatic route is qualitatively different from a modeling route. In the end, you're looking at sets of hypotheses and sets of data and you have to figure out whether you accept your core hypotheses or not.

In my view it is an enormous mistake to think that the axiomatic stuff offers something qualitatively different from the statistical modelling approach, or even that they are necessarily distinct. The axiomatic route is just another way to impose externally justified conditions [from statistics, from confimation theory, from philosophy of science, from measurement theory in physics, etc etc] to psychological test scores.

G. Of course, if one does not accept the critique of psychometrics from a quantitative standpoint as valid one will never see any difference. To the critic confusing the two is like confusing statistics with the requirements of statistics. Just to make sure: do you believe that Rasch and in particular his followers (e.g. Wright, Fischer) are correct in their critique of classical test theory? Are they wrong in pointing out that classical test theory presumes measurement?

Guenter

-----Ursprüngliche Nachricht-----
Von: talking-m...@googlegroups.com im Auftrag von Denny Borsboom
Gesendet: Do 28.10.2010 13:40
An: talking-m...@googlegroups.com
Betreff: Re: [talking-measurement] Testing & Measurement

winmail.dat

Denny Borsboom

unread,
Oct 28, 2010, 1:14:23 PM10/28/10
to talking-m...@googlegroups.com
> G. Of course, if one does not accept the critique of psychometrics from a
> quantitative standpoint as valid one will never see any difference.

D: Please point to me exactly where the difference lies - it should be
no trouble to you, as one who does accept "the critique of
psychometrics from a quantitative standpoint" and sees these things
clearly. In an earlier post, some time ago, you criticized a table by
Andrich; that could serve as a fine example; alternatively, you could
take any actual example of testing double cancellation in a 3 x 3
table. Exactly where is the difference between testing the axioms of
quantity as applied to probabilities, and testing hypotheses
concerning the parameters of a probabilistic model?

To the
> critic confusing the two is like confusing statistics with the requirements
> of statistics. Just to make sure: do you believe that Rasch and in
> particular his followers (e.g. Wright, Fischer) are correct in their
> critique of classical test theory? Are they wrong in pointing out that
> classical test theory presumes measurement?

D: Should classical test theory be taken to "presume measurement"
before or after classical test theory enjoys breakfast? The
personification here precludes the statement from having a truth
value; it's just absurd. I cannot answer this. If you mean: "is it an
axiom or premise or definition of CTT that quantitative measurement
has been carried out?" the answer is clearly no. If you mean "could a
person use CTT profitably without presuming that quantitative
measurement has been carried out?" the answer is clearly yes. If you
mean "do the procedures in CTT make most sense to you if you imagine
them to be carried out on quantitative measures" my personal answer is
yes, but I could easily imagine others who would sensibly disagree.

Best
Denny


On Thu, Oct 28, 2010 at 5:47 PM, Trendler, Guenter

Andrew Kyngdon

unread,
Oct 28, 2010, 9:26:33 PM10/28/10
to talking-m...@googlegroups.com
Denny,

It is most certainly an enormous mistake to think that there are no major differences between what you refer to as "axiomatic stuff" and statistical modelling.

Do you seriously believe there are no formal or theoretical differences between the theory of conjoint measurement and structural equation modelling, for example?

I'll assume that what you refer to as "axiomatic stuff" is the field of abstract measurement theory. It is nonsense to suggest that there are no formal and theoretical differences between abstract measurement theory and the collection of statistical data analysis techniques known as "psychometrics". It is also nonsense to argue that there would be no essential empirical differences between an experimental study involving the use of statistical models and abstract measurement theory.

Let us take the theory of conjoint measurement. In no way is it a statistical data reduction device like an IRT model. IRT models are only concerned with persons, items and response probabilities. The theory of conjoint measurement is a theory of quantity, not a model of a dataset, and therefore has a generality which far exceeds that of any mere IRT model.

Scientific measurement is the estimation of ratios between magnitudes of continuous quantities and unit magnitudes of the same kind. As I have mathematically proven on this forum, one can deduce the single and double cancellation axioms from the scientific theory of measurement using some of Hoelder's (1901) axioms of quantity, namely, the solvability, associative and commutative laws.

I know of no way that one can formally, algebraically deduce an IRT model directly from the scientific theory of measurement. Why? Because IRT models are not theories of quantity. They were only created ever to reduce sets of test performance data, never to test the hypothesis of quantitative psychological attributes.

And this is where IRT fit statistics and other indicators fail badly.

Lord (1980) showed that monotone transformations of the parameters of any logistic IRT model creates other sets of response curves of equal validity (for any set of items) as those produced by the original model. Hence any dataset to which a logistic IRT model is fitted can be equally well fitted by another such model, provided the form of the item response curves of the second model is a monotonic increasing function of the form specified in the first one (Jones & Applebaum, 1989). As only order is preserved under monotone transformation, an IRT model fit may be indicative of order, but not quantity, in human cognitive abilities. This is borne out by the fact that if both the Rasch model and the three parameter logistic (3PL) (Birnbaum, 1968) model fit a data set, only the order upon the person ability estimates produced by these models remains invariant (Mislevy, 1987).

Global fit statistics associated with statistical models also have another well known problem - the masking of differences in the structure of systems underlying noisy numerical data (Nickerson & McClelland, 1984; Nygren, 1980). Because item discrimination multiplies the difference between person ability and item difficulty, the 2PL model advances a distributive composition rule (Krantz & Tversky, 1971) for individual differences in test performance. The Rasch model advances an additive composition rule as it concerns only the difference between person ability and item difficulty. These composition rules have different structural implications for cognitive abilities, yet the 2PL model will fit test data the Rasch model fits. Moreover, both models will fit synthetic data generated by IRT models in which the item response function is not logistic (García- Pérez, 1999). Which composition rule is the correct one for a real set of test score data, in which the "true" form of the item response function is unknown, cannot be inferred from existing tests of fit.

The problem with psychological measurement is that there is no strong evidence which supports the hypothesis that there exist such things as psychological quantities. It is not known if there exist any quantities outside of those of physics. Therefore, to test the hypothesis of psychological quantities, we need theoretical and experimental tools that are sensitive to the presence of continuous quantities. There are a few such methodologies. One is the method of concomitant variation in the investigation of trade off relationships. This has lead to the quantification of physical variables, such as in Ohm's (1826) investigations of resistance, voltage and electric current.

Indeed, much has been made of tradeoffs on this forum, as if only such things are needed, and that abstract measurement theory has nothing useful to say. However, if one looks at tradeoffs carefully in theoretical terms, all one is doing when investigating tradeoffs is a special form of conjoint measurement. Rather than testing the cancellation axioms upon all the cells of the conjoint array, one is empirically focussing only upon the diagonal cells and attempting to discover if they are equivalent. This can only be the case if the relevant attributes are quantities. So sorry to disappoint those who feel that investigating tradeoffs is avoiding using the theory of conjoint measurement!

For good reason I do not believe that the heterogeneous collection of statistical models known as "psychometrics" is capable of coming to terms with the hypothesis of psychological quantities. At least as psychometric models are currently employed now and the current state of fit statistics and other indicators. If psychometric models were capable of dealing with this hypothesis scientifically, they would have done so by now and I would not be writing this post. Indeed, this forum would not exist.

Where I do agree with you Denny is that methodology is only a part of the problem of psychological measurement. Our knowledge of continuous quantities is contingent upon descriptive, substantive theory and empirical evidence of quantity. A limitation of theories of conjoint measurement is that they are useful only in the latter respect. In physics and metrology, descriptive theory is integral to the quantity calculus (Emerson, 2008), the definition of units given by the International System of Units (Bureau International des Poids et Mesures, 2006) and all physical laws. It is therefore integral to all measurement in physics.

The profound lack of descriptive theories of human behaviour is arguably the single most important problem facing the scientific measurement of psychological systems.

Cheers,

Andrew


For example,

Trendler, Guenter

unread,
Oct 29, 2010, 3:31:19 AM10/29/10
to talking-m...@googlegroups.com

A. Where I do agree with you Denny is that methodology is only a part of the problem of psychological measurement. Our knowledge of continuous quantities is contingent upon descriptive, substantive theory and empirical evidence of quantity. A limitation of theories of conjoint measurement is that they are useful only in the latter respect. In physics and metrology, descriptive theory is integral to the quantity calculus (Emerson, 2008), the definition of units given by the International System of Units (Bureau International des Poids et Mesures, 2006) and all physical laws. It is therefore integral to all measurement in physics.

The profound lack of descriptive theories of human behaviour is arguably the single most important problem facing the scientific measurement of psychological systems.

G. Can you please explain this? What is the descriptive, substantive theory in the case of, say, Ohm's law?

Guenter

winmail.dat

Stephen Humphry

unread,
Oct 29, 2010, 5:02:10 AM10/29/10
to talking-m...@googlegroups.com

A: The profound lack of descriptive theories of human behaviour is arguably the single most important problem facing the scientific measurement of psychological systems.

G. Can you please explain this? What is the descriptive, substantive theory in the case of, say, Ohm's law?

S: Electomagnetism, including electric charge, voltage, resistors/conductors, the electromotive force. All of the electomagnetic quantities are interrelated by theory. We can describe/state what resistance is in terms of potential difference and electric current. It wasn't until Maxwell that most things were tied together but Ohm had put forward the theoretical and descriptive basis of the law in 1827 (and some a little earlier), drawing upon Fourier's work on heat conduction.

Andrew Kyngdon

unread,
Oct 29, 2010, 5:04:49 AM10/29/10
to talking-m...@googlegroups.com
Easy. How can you measure an attribute of a natural system when you have no theory at all of that system?

Ohm's law is I = V/R, where I is electric current, V is potential difference and R is electrical resistance. Given the object of his studies was electric current, Ohm must have had some idea of Coulomb's work on the electrostatic force of attraction and repulsion. He would have also known theories of electromagnetism of the day, such as Ampere's work done only a few years before him.

Boyle's Law was discovered in the 1660s. So before accurate instruments of temperature measurement came about in the 19th century, theories concerning the relationship between temperature and volume were already known.

The International System of Units (S.I.) (Bureau International des Poids et Mesures (BIPM, 2006, p.113) currently defines the second as:

"...the duration of 9 192 631 770 periods of the radiation corresponding to the transition between the two hyperfine levels of the ground state of the caesium 133 atom."

This definition of a unit of time would simply be impossible without strong, descriptive theories of atomic physics.

It's no accident that improvements in physical measurement went hand in glove with developments in physical theory. If I was a physicist I guess I could list more examples.

In psychology we claim to be able to measure various attributes via test scores, but we don't have much in the way of theory that relates test items or their features to the relevant hypothesised psychological quantity. The Lexile Framework is the exception that proves the rule. Almost all of psychometric test construction is atheoretical.

Cheers,

Andrew

Denny Borsboom

unread,
Oct 29, 2010, 6:58:15 AM10/29/10
to talking-m...@googlegroups.com
Hi Andrew,

I haven't espoused any of the nonsense you discuss, agreeing wholly
that the relevant theses are nonsensical. There are certainly
significant differences between the relevant systems, as you correctly
point out. I believe I have in fact written a book about them and I
wouldn't have done that if I thought they were the same.

What I said is that in *psychometric practice* checking double
cancellation *as implied in the ACM interpretation of the Rasch model*
will always involve testing hypotheses on *the parameters of a
probability model*. This says nothing about ACM broadly; all you say
about it is in my view entirely reasonable and I don't object. It
rather says something about how the relevant hypotheses would play out
in a typical psychometric testing situation.

I would agree that insofar as these hypotheses are in fact tested,
they are probably tested poorly. I am also not very enthusiastic about
the performance of fit tests in IRT. Finally, I would agree that the
maximum justification current practice gives is that scales are
(stochastic) orders (this is precisely why I am surprised that things
like adaptive testing, which do not work on orders but on distances,
can work so well even under quite shabby conditions). So, I don't
think that with respect to this particular issue we are in serious
disagreement.

By the way: if there are indeed people who use IRT models as
data-reduction devices, these people must be a little nuts.

Best
Denny

Andrew Kyngdon

unread,
Oct 30, 2010, 12:47:32 AM10/30/10
to talking-m...@googlegroups.com
Hi Denny,

I'm still not sure what it is you are arguing exactly, but if you think that we are not really disagreeing then I guess it doesn't matter.

The primary concern of your book was test validity, or so it seemed to me. You did have one chapter on ACM and representationalism, but it only seemed to argue something that had been argued elsewhere - that the "error problem" mitigated the application of anything from abstract measurement theory to psychology. Something tells me though that this wouldn't have been worrying Danny Kahneman when he was walking up to accept his Nobel Economics Prize for prospect theory.

Cheers,

Andrew

Reply all
Reply to author
Forward
0 new messages