Finally the breakthrough?

Trendler, Guenter

unread,

Jun 1, 2011, 4:36:23 PM6/1/11

to talking-m...@googlegroups.com

In a recently published paper Andrew reports that: On the basis of these results it can be concluded that the difficulty of reading items, as conceived of in the Lexile theory, and the reading ability of persons are quantitative. (Kyngdon, 2011, p. 11) This contradicts Michells (2011) recent assessment that: "There is no evidence that the attributes that psychometricians aspire to measure (such as abilities, attitudes and personality traits) are quantitative." (p. 245).

According to the concept of specific objectivity the measurement values of reading items should remain invariant (in the limits of errors of measurement) indifferent of the test group used for the estimation of item parameters. That is, if the data obtained from the 9,909 fourth-grade students from elementary schools in Duval County, North Carolina (p. 8) were randomly split into two (and more) groups the measurement values of the reading items should be invariant. If Andrew is right then graphically represented the values for each pair of items should lie on a line with the slope one going through the origin.

It would be great to see the results of such a test.

GT

Kyngdon, A. (2010) Plausible measurement analogies to some psychometric models of test performance. British Journal of Mathematical and Statistical Psychology.

Michell, J. (2011) Qualitative research meets the ghost of Pythagoras. Theory and Psychology, 21, 2, 241-259.

Andrew Kyngdon

unread,

Jun 1, 2011, 8:32:54 PM6/1/11

to talking-m...@googlegroups.com

Hello Geunter,

I’m not sure about a breakthrough. In the Lexile Framework for Reading, putative measures of text difficulty are calculated directly from continuous prose text, not from reading test scores. Hence it is similar to theories of decision making under risk and uncertainty (e.g., cumulative prospect theory, configural weight theory, rank-sign dependent theory), in which putative measures of utility are calculated directly from the features of risky choices (i.e., outcomes, outcome probabilities and number of outcomes). The idea of deriving measurements from the structural features of stimuli is not new in either utility or psychophysics (indeed, the configural weight utility theories are based on psychophysics models from the early ‘70s), but it is almost unheard of in psychometrics.

What can be done is to estimate so called “empirical Lexiles” from test score data, then calculate root mean square errors between the “empirical” and “theoretical” Lexiles. This could be done in a similar manner to what you propose below.

By the way, in my paper I did say the following:

A comprehensive test of the Lexile theory, however, was not the goal of the current example and therefore any judgment concerning the descriptive adequacy of this theory is premature.

The data used were not obtained from an empirical study specifically devised for the purpose of testing axioms. The example shows that while complex, the testing of an additive conjoint system with real data is not an insurmountably difficult task.

I wrote the first draft of this paper a few years ago now, and it was more of a theoretical note than anything. I submitted it to Psychometrika where it went out for review and told that applying conjoint measurement to real data posed “insurmountable” difficulties. I had to show in a resubmission that this was not the case, so used the example of Lexiles and Steve Humphry’s frame of reference Rasch model. I then resubmitted to Psychometrika, only to be then told by the editor that the material was not suitable for publication there. One of the reviewers of the first draft did try to tell me that because we are dealing with random Bernoulli variables, cognitive abilities cannot possibly be continuous quantities. Go figure.

You’ll also note that I had to permute the matrix in order for it to be consistent with the cancellation axioms. As I noted, if Lexiles are considered to be genuine measurements (and they have better claim to this than anything other psychometric system I know of), measurements of the difficulty of Lexile reading items have an error of 170L. At the item level I feel that this is a problem. Jack Stenner has the idea that the Lexile item measure is an “ensemble mean” of all the distribution of all possible difficulties for that item. So perhaps what needs to be done next is to figure out what causes an item’s difficulty to vary from the ensemble mean and perhaps use that research to develop adjustment factors. But I’m just thinking and speculating out loud here.

The size of the Lexile unit might also be a problem too. Works well for books (Stenner, et al, 2006) but is too fine for individual differences in reading test performance. Again, just speculating.

By the way, just to touch on Paul Barrett’s earlier post again, when I submitted the paper to the BJMSP, I had two very good reviewers. One of them, however, would not give up his or her demand that I remove Emerson’s (2008) version of the classical/standard definition of measurement from the paper. I ended up having to do this, but I didn’t give up entirely, citing Emerson in the last paragraph.

The paper can be downloaded from my website.

Cheers,

Andrew

Andrew Kyngdon, PhD

MetaMetrics, Inc.

www.lexile.com

My website: https://sites.google.com/site/drandrewkyngdon/home

Measurement Forum: http://groups.google.com/group/talking-measurement

--
You received this message because you are subscribed to the Google Groups "Talking Measurement" group.
To post to this group, send email to talking-m...@googlegroups.com.
To unsubscribe from this group, send email to talking-measure...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/talking-measurement?hl=en.

Trendler, Guenter

unread,

Jun 2, 2011, 12:02:36 PM6/2/11

to talking-m...@googlegroups.com

Hi Andrew,

Under which conditions can we safely conclude that an attribute is quantitative? This is what my inquire aims at. After all, as we know since Michell, if an attribute is quantitative then it is measurable. Hence, if someone claims that an attribute is quantitative I also expect that he can measure it.

A crucial criterion for the attainment of measurement is in my view the satisfaction of what Chang (2004) calls “the principle of single value (or, single-valuedness)”. This principle demands that “a real physical property can have no more than one definite value in a given situation.” (p. 90) There can be only one. Measurement values obviously have to satisfy this condition. Chang derives the principle from Brian Ellis's (1968) requirements for a scale of measurement. According to Chang, Ellis stipulates that “we have a scale of measurement only if we have a rule for making numerical assignments that is 'determinative in the sense that, provided sufficient care is exercised the same numerals (or range of numerals) would always be assigned to the same things under the same conditions'.” (p. 90)

However and from whichever sample measurement values are derived they must agree to the principle, in the limits of random errors of course. I think the graphical test suggested is the easiest way to test if it is violated or not. Hence unless it is not convincingly demonstrated that “measurement values” are invariant the claim that quantification has been attained is premature. Show me that item parameters are invariant and I will believe you.

Regards,
Guenter

Chang, H, Inventing Temperature, Oxford, 2004.

-----Ursprüngliche Nachricht-----
Von: talking-m...@googlegroups.com im Auftrag von Andrew Kyngdon
Gesendet: Do 02.06.2011 02:32
An: talking-m...@googlegroups.com
Betreff: [talking-measurement] RE: Finally the breakthrough?

winmail.dat

Andrew Kyngdon

unread,

Jun 2, 2011, 8:36:13 PM6/2/11

to talking-m...@googlegroups.com

Guenter,

G. "Under which conditions can we safely conclude that an attribute is quantitative?"

The adjective "safely" is the key word in your question. In order to "safely" conclude that an attribute is quantitative, we must be confident in our knowledge of the attribute. It can take decades or even centuries of thought and observation for an attribute to be "safely" considered measurable. The history of mass and weight comes to mind as does speed and velocity. It literally took centuries for the difference between mass and weight to be understood (i.e., that the former is a property of an object and the latter a force). For most of our history we humans thought we were measuring mass, when we were actually measuring the product of an object's mass and it's acceleration due to gravity.

The history of science suggests that any "deadline" for the "safe" measurement of an attribute just does not exist.

Andrew

Jack Stenner

unread,

Jun 3, 2011, 2:04:39 PM6/3/11

to talking-m...@googlegroups.com

Hello Guenter:

Yes, I believe that the Lexile Framework for Reading represents a direct and specific refutation of Michell’s (2011) assessment quoted below. But, it is far more damaging to his thesis than might appear at first blush. If text complexity and reading ability are both quantitative attributes (as measured) of text and readers respectively, then one of the most complex of human abilities (Haruby and Goswami, 2011) will have been found to be quantitative. What are the odds that English reading ability is somehow unique in the pantheon of human science constructs? I think we will find that other human science attributes can be measured on an interval scale, i.e. are quantitative. But, garden variety Rasch applications are not the way to discover these attributes nor to test the quantitivity hypothesis.

I find it useful to distinguish between descriptive and causal Rasch models. The former regresses a measurement outcome (count correct) on the exponentiated difference between a person ability and a measurement mechanism. A causal Rasch model takes the same statistical form as a descriptive model but requires knowledge about how to independently intervene on the person ability and measurement mechanism to produce predictable changes in the measurement outcome. The trade-off property is satisfied when an experimental manipulation on the person ability can be off-set (traded off) for an experimental manipulation on the measurement mechanism to hold the measurement outcome constant. (See JAM Stenner and Stone (2010) for an analogy to temperature measurement.) Concretely, a change of 100L in reader ability (caused by reading 500,000 words) is off-set by an increase of 100L in text complexity to hold the success rate (relative raw score) constant at, say, 75%. Such trade-offs are sustainable only when the attributes are measured on an interval scale, i.e. the measurement process produces quantitative data. Ordinal data will not, in general, support such trade-offs.

Note that there need be no reference to quantitivity, no additive conjoint measurement, no reference to right leaning diagonals, and no reference to Holder’s axioms. What we have done is “aped” eighteenth and nineteenth century physicists and their earnest attempts to demonstrate that differences are meaningful and useful for predictive inference and that trade-offs amongst attributes of unlike kind (text complexity and reader ability) are similarly useful. Every time an investigator checks to see if a difference of like amount results in a highly similar prediction on the measurement outcome, wherever along the scale that difference is taken, a test for “quantitivity” has been made. Every time an investigator checks for unit invariance a test for quantitivity has been made. Every time an individual centered growth trajectory is shown to be generally objective (independent of items and measurements on other persons) a test for quantitivity has been made.

I second Andrews call for more focus on substantive theory. Let’s get about the very difficult task of justifying causal interpretations of the regressions of measurement outcomes on person attributes and measurement mechanisms. And let’s stop pretending that descriptive Rasch models are a low cost substitute for the hard experimental work required to justify causal inference.

The attached plot is for 475 articles. The theoretical text complexity (coming from the Lexile Analyzer) is plotted against the empirical text complexity (based on student responses to at least 1000 unique items per passage). I call your attention to the stunning measurement precision of the empirical text complexity measures and the attendant small standard error of measurement (12L). The reported reliability is the correlation between empirical text complexity measures taken on two independent samples of readers responding to independent samples of machine generated four choice MC items.

Jack Stenner

Chairman & CEO

MetaMetrics, Inc.

Developer of the Lexile and Quantile Frameworks

1000 Park Forty Plaza Drive, Suite 120

Durham, NC 27713

Tel: 919-547-3402 Fax: 919-547-3401

jste...@Lexile.com

Web: www.MetaMetricsinc.com | www.Lexile.com | www.Quantiles.com

Lexile Search Now Available on Barnes & Noble.com!

Click here for details

From: talking-m...@googlegroups.com [mailto:talking-m...@googlegroups.com] On Behalf Of Trendler, Guenter
Sent: Wednesday, June 01, 2011 4:36 PM
To: talking-m...@googlegroups.com
Subject: [talking-measurement] Finally the breakthrough?

In a recently published paper Andrew reports that: “On the basis of these results it can be concluded that the difficulty of reading items, as conceived of in the Lexile theory, and the reading ability of persons are quantitative.” (Kyngdon, 2011, p. 11) This contradicts Michell’s (2011) recent assessment that: "There is no evidence that the attributes that psychometricians aspire to measure (such as abilities, attitudes and personality traits) are quantitative." (p. 245).

--

image001.emz

oledata.mso

Paul Barrett

unread,

Jun 3, 2011, 6:25:36 PM6/3/11

to talking-m...@googlegroups.com

Jack

“What are the odds that English reading ability is somehow unique in the pantheon of human science constructs? ”

It could be higher than you imagine.

Why? Because reading ability is a performance-based measure. It has a technical, very specific definition, and a clear measurable outcome.. That is what allowed you to manipulate both attribute and stimulus in order to experimentally demonstrate linearity.

You might be able to do this with something like “working memory”, or other very specific performance attributes with rather precise stimuli available for manipulation (words), and clear “countable” outcomes.

But, what about attributes like personality, temperament, motivation, attitudes, goals, values, emotions?

In my view, you never needed to invoke the Rasch or any kind of stochastic model. For me, these kinds of models impede a measurement hypothesis, not help test it. I could never understand why you ever bothered with IRT-type models, once Andrew had explained the kind of experimental data you were able to collect from the outset.

All you needed was a theory (which you had), some experiments (which you were able to perform), and the establishment of those observations you presented in that graph. Then you fit the theory-suggested deterministic mathematical function. You might even have hypothesised the function in advance, and tested it against the incoming data (which you probably did!). As a by-product of this exercise you would have specified your standard unit, and voila.

As you say, it’s a blueprint for how to go about observing relations between magnitudes experimentally, then fitting the deterministic theory-expected mathematical function.

Re: the lexile graph. The measurement resolution is poor for any individual, but “good enough” to show the underlying function is “nears as dammit” linear, and probably “good enough” for practical purposes e.g. does a potential error of + or -100L matter to a child who is say 700L?

The 95% prediction interval for the correlation reported within that data you present (assuming a 120L standard deviation) is about + or – 100L .., or about +or – 50L with a 75% interval ... via simulation (n=475) using generated bivariate data (which is not quite correct for these data I know as your tails are too heavy for normally distributed data) .

You have such simple data – showing a linear function overlaid by what looks to be random noise.

That “noise” is a very interesting issue in its own right (i.e. what’s causing it).

I’d now want to know for those individuals not “close” to the expected lexile value in a “one-shot” assessment session whether their observed lexile score randomly varies around the expected theoretical value, or is persistently wrong (over multiple test sessions using different exemplars of Lexile stimuli). That would be a further substantive test of a fixed “quantitative” relation holding for all.

Anyway, none of this detracts from your achievement here.

I presume you have exactly the same experimental data and graph you could show us for the quantile?

Regards .. Paul

Advanced Projects R&D Ltd.

__________________________________________________________________________________

W: www.pbarrett.net

E: pa...@pbarrett.net

M: +64-(0)21-415625

image005.png

Paul Barrett

unread,

Jun 5, 2011, 5:12:47 PM6/5/11

to talking-m...@googlegroups.com

OK – having read the excellent:

Stenner, A.J. (1996) Measuring reading comprehension with the lexile framework. (http://d44mopotwxgz0.cloudfront.net/m/resources/materials/Stenner_Measuring_Reading_Comprehension.pdf).

paper presented at the California Comparability Symposium, Burlingame, CA, October 31.

Things are a lot clearer.

Word frequency and sentence length are the two counts which are predicting word comprehension and text-passage difficulty.

Given: Carroll, J. B., Davies, P. and Richman, B. (1971). Word frequency book. Boston: Houghton Mifflin.

and other references regarding sentence length and memory load,

So, try this … (again, I hasten to add I’m not trying to downplay any of this really clever and insightful work, but just querying why so much effort was invested in probabilistic Rasch IRT other than just more straightforward deterministic measurement methods).

We have the capablity to begin experimentation, manipulating two stimuli independently from one another - word frequency (WF) and sentence length (SL), then determining the expected causal relationship and the likely functional relation between WF and SL.

None of this required the Rasch model. Just careful observation and deterministic equation fit (which would have yielded discrepancy errors from the expected vs observed outcomes).

This also gets around the issue of sample-bound calibrations … you just don’t need them at all because you are working with an absolute difficulty scale provided by WF and/or SL, or a joint-functional-relation of both.

The standard unit is whatever you choose it to be, given that it permits computation of ratios of magnitudes of comprehension difficulty and observed comprehension.

There is a theoretical true zero to this scale of comprehension … 0 word frequency – the word which is never used. The theoretical maximum is the word which shows 100% frequency in all length passages or texts within a language (again, this depends upon how WF and SL are combined, if at all, into a single “scale” with a unit.

You would need many hundreds of experiments to provide the necessary observations to confirm the quantitative relationship over the range of WF and SL.

Which brings me to a second very interesting but rhetorical point.

Who owns the cm, the inch, the kg, the volt?

Metametrics owns the lexile.

Is this a model of how measurement in psychology will proceed, via private corporation and then sold under license?

Maybe – because if university/government-funded researchers will not invest the time and effort into developing quantitative measures, privately funded scholars/scientists are all that’s left to do “the business”.

Jack Stenner

unread,

Jun 6, 2011, 3:25:42 PM6/6/11

to talking-m...@googlegroups.com

You raise some very insightful points regarding the sociology of science underneath what we are trying to do. Our approach to commercializing the Lexile Framework by opening it up to license by now hundreds of companies, states and countries has worked very well. The playing field is level and all those who desire can pay a modest royalty and use the technology as they see fit. Educators world-wide have a free license to use the Lexile Analyzer and report measures for all non-commercial purposes. We do not favor one licensee over any other. You correctly point out that there is no governmental agency that would have fronted 100 million U.S. dollars to establish the first standard open source human science metric because no governmental agency has the staying power or sustained vision to pull it off. Had we failed I would have paid a dear price. I am to my knowledge the only NIH funded investigator to be barred from publication in all International Reading Associations journals. The ban, now ending its first decade, was confirmed last week to be “still in place” by the head of IRA’s publication committee. The perverse consequence of this ban is that very few studies using Lexiles have been published in IRA journals. All because I have a “significant” financial interest in the Lexile Framework. When the committee was asked about statistical software developers (AMOS, SAS, HLM) they responded that such financial entanglements were not “as consequential” as Lexiles. My good friend William Fisher is trying to help me understand academia’s perspective. If there is a better route to widespread adoption of uniform metrics I am open to listening.

More and more I am of the opinion that causal Rasch models are the royal road to discovering psychological three variable (Ternary) laws. I don’t see how we escape the need for a link function that connects measurement outcomes to attribute measures and measurement mechanisms. Why not use the logit link function and get raw score sufficiency as a consequence of that choice. If you have puzzled about why one should care about raw score sufficiency consider that machine generated items sampled from an ensemble can yield measures only if it does not matter which sampled item calibration is attached to which right-wrong response. This reality is only possible when there is no information about the parameter of interest in the way we assign item calibrations to the pattern of right-wrong responses. Don Burdick gets credit for this insight. So, it is possible to convert counts correct into measures when there is no one to one correspondence between an item calibration and a right-wrong response. Only Rash models support this little bit of magic.

Note that in the NexTemp example (Stenner and Stone, JAM, 2010) the Guttman model provides the link function. When we have precise control over the measurement mechanism then we can engineer a Guttman pattern (Guttman pattern violations are infrequent enough – but always present – that we turn our heads and ignore the violations) but our understanding in the reading context is such that we need a stochastic model and Rasch is the best available stochastic model.

Paul, your comment about sample bound item calibrations is also very important. I believe that the solution to sample dependence of measurement mechanism calibrations does not depend on the fit of data to the Rasch model but rather in completely divorcing item calibrations from data. Physical science measurement solves this dilemma by not using data in any way to estimate item calibrations. Rather, let’s use substantive theory to calibrate instruments and, thus, completely separate instrument calibration from data. This to me is the solution that yields complete realization of sample free item/measurement calibration. Please see the attached graphic which describes a developing reader’s growth trajectory. Each of =4000 items the student took was generated on the fly by the computer. Text complexity theory is used to provide the instrument calibrations needed to convert counts correct into Lexile reader measures. Note that the measurement precision approaches what we can do with yardsticks in measuring student height.

You are precisely correct regarding the choice of a unit. It takes a careful reading of the philosophy and history of science to appreciate the complete capriciousness of how the units are taken for granted today were originally arrived at. For the human sciences , fractions of a logit with meaningful anchors will suffice for now.

This is a real student

He read a Harry Potter length novels worth of informational text - 140,000 words over 30 months.

This is a completely individual centered picture – no data on any other student is needed to make this graphic

Note the college and career context. This is a text based description of our K-12 objective in reading.

The growth trajectory shows good fit

The projected HS graduation Lexile measure is about 1400L.

The monthly fit statistics (expected – observed) are within acceptable bounds.

Jack Stenner

Chairman & CEO

MetaMetrics, Inc.

Developer of the Lexile and Quantile Frameworks

1000 Park Forty Plaza Drive, Suite 120

Durham, NC 27713

Tel: 919-547-3402 Fax: 919-547-3401

jste...@Lexile.com

Web: www.MetaMetricsinc.com | www.Lexile.com | www.Quantiles.com

Lexile Search Now Available on Barnes & Noble.com!

Click here for details

--

image001.emz

oledata.mso

Trendler, Guenter

unread,

Jun 6, 2011, 5:53:59 PM6/6/11

to talking-m...@googlegroups.com

Hi Jack,

I'm all with you about "aping" 19th century science. However, if an item is measurable its measurement value should vary over repeated measurements only in the limits of random error.

As Rasch (1980) states:

"Apparently the physical case distinguishes itself from the others in that the multiplicative law holds for directly observable quantities, the accelerations, in contrast to the more evasive "parameters in a model" on which we had to rely in the psychological examples. In the latter cases, except for reading time, such a law could not even easily be imagined to hold for the observations themselves, nor for any transformation of them, since both number of misreadings and number of words read are discontinuous variables, not to speak of the answers to items of the intelligence tests, being not at all quantitative. (/) In principle, however, there is hardly any difference. In fact, the acceleration of a body cannot be determined; the observation of it is admittedly liable to, at any rate, so called "errors of measurement", but in the last analysis this admittance is paramount to defining the acceleration per se as a parameter in a 'probability distribution - e.g. the mean value of a Gaussian distribution - and it is such parameters, not the observed estimates, which are assumed to follow the multiplicative law. (…) Thus, by way of an example, the reading accuracy of a child – (…) - can be measured with the same kind of objectivity as we tell its weight - though not with the same degree of precision, to be sure, but that is a different matter." (Probabilistic models for some intelligence and attainment tests, p. 115)

Therefore my question: how stable are your measurement values for item difficulty (e.g. text complexity)? Let's say you split the sample into 5 groups and determine item difficulty for item A, B etc. The results could look something like this:

Item A Item B
1,53 1,63
1,49 1,64
1,51 1,59
1,55 1,55
1,50 1,60 etc.

Do they? How stable are your measurement values for item difficulty (e.g. text complexity measures) over repeated measurements?

Regards
Guenter

-----Ursprüngliche Nachricht-----
Von: talking-m...@googlegroups.com im Auftrag von Jack Stenner
Gesendet: Fr 03.06.2011 20:04
An: talking-m...@googlegroups.com
Betreff: [talking-measurement] RE: Finally the breakthrough?

Hello Guenter:
Yes, I believe that the Lexile Framework for Reading represents a direct and specific refutation of Michell's (2011) assessment quoted below. But, it is far more damaging to his thesis than might appear at first blush. If text complexity and reading ability are both quantitative attributes (as measured) of text and readers respectively, then one of the most complex of human abilities (Haruby and Goswami, 2011) will have been found to be quantitative. What are the odds that English reading ability is somehow unique in the pantheon of human science constructs? I think we will find that other human science attributes can be measured on an interval scale, i.e. are quantitative. But, garden variety Rasch applications are not the way to discover these attributes nor to test the quantitivity hypothesis.

I find it useful to distinguish between descriptive and causal Rasch models. The former regresses a measurement outcome (count correct) on the exponentiated difference between a person ability and a measurement mechanism. A causal Rasch model takes the same statistical form as a descriptive model but requires knowledge about how to independently intervene on the person ability and measurement mechanism to produce predictable changes in the measurement outcome. The trade-off property is satisfied when an experimental manipulation on the person ability can be off-set (traded off) for an experimental manipulation on the measurement mechanism to hold the measurement outcome constant. (See JAM Stenner and Stone (2010) for an analogy to temperature measurement.) Concretely, a change of 100L in reader ability (caused by reading 500,000 words) is off-set by an increase of 100L in text complexity to hold the success rate (relative raw score) constant at, say, 75%. Such trade-offs are sustainable only when the attributes are measured on an interval scale, i.e. the measurement process produces quantitative data. Ordinal data will not, in general, support such trade-offs.

Note that there need be no reference to quantitivity, no additive conjoint measurement, no reference to right leaning diagonals, and no reference to Holder's axioms. What we have done is "aped" eighteenth and nineteenth century physicists and their earnest attempts to demonstrate that differences are meaningful and useful for predictive inference and that trade-offs amongst attributes of unlike kind (text complexity and reader ability) are similarly useful. Every time an investigator checks to see if a difference of like amount results in a highly similar prediction on the measurement outcome, wherever along the scale that difference is taken, a test for "quantitivity" has been made. Every time an investigator checks for unit invariance a test for quantitivity has been made. Every time an individual centered growth trajectory is shown to be generally objective (independent of items and measurements on other persons) a test for quantitivity has been made.

I second Andrews call for more focus on substantive theory. Let's get about the very difficult task of justifying causal interpretations of the regressions of measurement outcomes on person attributes and measurement mechanisms. And let's stop pretending that descriptive Rasch models are a low cost substitute for the hard experimental work required to justify causal inference.

The attached plot is for 475 articles. The theoretical text complexity (coming from the Lexile Analyzer) is plotted against the empirical text complexity (based on student responses to at least 1000 unique items per passage). I call your attention to the stunning measurement precision of the empirical text complexity measures and the attendant small standard error of measurement (12L). The reported reliability is the correlation between empirical text complexity measures taken on two independent samples of readers responding to independent samples of machine generated four choice MC items.

[cid:image0...@01CC21F4.2A94E680]

Jack Stenner
Chairman & CEO
MetaMetrics, Inc.
Developer of the Lexile and Quantile Frameworks
1000 Park Forty Plaza Drive, Suite 120
Durham, NC 27713
Tel: 919-547-3402 Fax: 919-547-3401

jste...@Lexile.com<mailto:jste...@Lexile.com>
Web: www.MetaMetricsinc.com<blocked::http://www.metametricsinc.com/> | www.Lexile.com<blocked::http://www.lexile.com/> | www.Quantiles.com<blocked::http://www.quantiles.com/>

Lexile Search Now Available on Barnes & Noble.com!

Click here<http://www.lexile.com/using-lexile/barnes-noble/> for details

image001.emz

image002.gif

oledata.mso

Paul Barrett

unread,

Jun 6, 2011, 6:06:18 PM6/6/11

to talking-m...@googlegroups.com

Hello Jack, thanks for the reply ...

Your first paragraph is a real eye-opener! And ...

“If there is a better route to widespread adoption of uniform metrics I am open to listening”

I don’t see any routes opening up until something rather fundamental changes in the way psychological associations world-wide approach the entire “measurement” issue. Something of the order of an international metrics laboratory would need to be set up and funded by the profession of psychologists worldwide, via levy from members worldwide. Right now I can’t see the APA or APS right now showing any enthusiasm to even discuss such a matter. One only has to be on the various “quantitative” listservs like SEMNET, Psych-Methods, and APA DIV5 to realize there is no interest at all in anything that deviates from conventional statistically-oriented analyses of data.

More and more I am of the opinion that causal Rasch models are the royal road to discovering psychological three variable (Ternary) laws. I don’t see how we escape the need for a link function that connects measurement outcomes to attribute measures and measurement mechanisms.

OK - say we propose word frequency as being the “measure” of semantic comprehension (with sentence length lurking as a “modifying” factor – I’m just not clear on this).

The unit is arbitrary. We call an integer frequency a “lexon”.

Now we put together “stimuli” of a fixed frequency value (say mid-frequency-range words) which test for the number of words “understood” correctly by several groups of individuals.

They will obtain “numbers of correct responses” from 0 to 100%. Those who score 100% possess comprehension at or above the fixed-frequency value.

We have assigned lexon scores to all those below 100%, using number of words identified correctly as their initial estimates.

Now we do the same again with a fixed-frequency of words say 100 lexons less than our initial study, and 100 lexons more, on the same individuals.

Because we assume word-frequency and human comprehension are linearly related, we can make predictions of expected correct word-identifications for all individuals who have been assigned their lexon values.

And we can do this over a range of lexon-rated stimuli, and many samples, so as to show that the lexon unit is linearly related (or linear after transformation) to comprehension regardless of what value lexon stimuli is presented to any individual.

Where all this breaks down is if word frequency alone is not causal for comprehension. But we discover that in our first series of experiments. Sentence length becomes a “prime suspect”. But now we have to consider how sentence length might be “conjoined” with word frequency in order to account for the errors we see in our predictions.

That needs a whole new line of experimentation – but the same approach, deterministic with the acquisition of measurement error as error, and not as part of a model-fit exercise.

One issue here is that word-frequency (WF)and sentence length (SL) can only ever be resolved to integer counts. If the relation between whatever derived measure of WF+SL and correct word recognition is linear, then ratios between magnitudes of WF+SL would predict empirical values of word-recognition. But, interestingly, the measurement of word-recognition can only ever be discrete (unless we assume recognition is itself continuous-valued, but a kind of “voting” process takes place in a network within the brain where a word ends up either being recognized or not).

Anyway, this approach, like the Rasch, is based entirely upon theory and empirical observations. But, whereas the Rasch constructs a link function via probabilities and patterns of error (i.e. it won’t fit unless there is error in the responses ... as per Michell, J. (2004) Item Response Models, pathological science, and the shape of error. Theory and Psychology, 14, 1, 121-129.), here we let theory and data provide evidence of that link function directly.

But, I am just armchair musing, and it could be simple-minded musing ... because, as you say .. “our understanding in the reading context is such that we need a stochastic model and Rasch is the best available stochastic model.”

However, the need for a stochastic model suggests reading comprehension may not be quantitative after all

or

that it is quantitative but the functional relation is being masked by other independent disturbances

or

that the functional relation is inherently “fuzzily-linear”, irresolvable beyond a certain degree of error, which is a fundamental property of the system under examination.

Struggling through my probably feeble reasoning, I think you can see why I feel stochastic models are not the way to proceed to establish “quantities”.

I think we need to a priori define what we think are the rules for the instantiation of measurement, then set out to confirm them, not attempt to “discover, then define” them statistically.

image005.png

Paul Barrett

unread,

Jun 6, 2011, 6:36:15 PM6/6/11

to talking-m...@googlegroups.com

Hello again Jack (or Andrew)

The relationship over time between comprehension errors and the lexile scores is interesting. ~30% performance over that expected from a lexile-rated text is reflected in a 50-lexile discrepancy. But then, so is a 5-9% under-performance on occasion.

Jack, do you have many such students’ records like this (hundreds/thousands)?

If so, has your R&D team implemented a feature analysis looking for the relationship-over-time patterns/noise features within data such as these?

Sorry to pester you so much, but this is right at the heart of what this list is all about (for me anyway!).

Regards .. Paul

image005.gif

image001.png

Andrew Kyngdon

unread,

Jun 6, 2011, 8:52:06 PM6/6/11

to talking-m...@googlegroups.com

Paul (and everyone),

I must say I was not aware of this:

I am to my knowledge the only NIH funded investigator to be barred from publication in all International Reading Associations journals. The ban, now ending its first decade, was confirmed last week to be “still in place” by the head of IRA’s publication committee. The perverse consequence of this ban is that very few studies using Lexiles have been published in IRA journals. All because I have a “significant” financial interest in the Lexile Framework. When the committee was asked about statistical software developers (AMOS, SAS, HLM) they responded that such financial entanglements were not “as consequential” as Lexiles.

The logic behind this just beggars belief. Completely and utterly non-scientific. Don’t they realise that publishing articles on the Lexile Framework for Reading will open it up to critical scrutiny? Or do they know this, but fear that Lexiles might survive such criticism?

Unbelievable.

From: talking-m...@googlegroups.com [mailto:talking-m...@googlegroups.com] On Behalf Of Paul Barrett

Sent: Tuesday, 7 June 2011 8:36 AM
To: talking-m...@googlegroups.com

--

Jack Stenner

unread,

Jun 7, 2011, 12:58:58 PM6/7/11

to talking-m...@googlegroups.com

Hello Guenter,

The slide I sent with the first post is a plot of the empirical complexity and the theoretical complexity. The reliability of the empirical complexities for 475 articles is .996 and the associated standard error is 12L. To give a frame of reference for twelve Lexiles consider that a typical fourth grader grows 8L in a month and that a 12L difference would move your comprehension rate for an article matched to your reading ability from 75% to 76.2%. Thanks to the trade off property this increase in comprehension could be realized by either increasing your reading ability by 12L or by lowering the article text complexity by 12L. From either perspective the precision is quite high. If we increased the minimum number of readers encountering an article from 50 to 500 the standard errors would of course be even smaller. Best Jack

Jack Stenner
Chairman & CEO
MetaMetrics, Inc.
Developer of the Lexile and Quantile Frameworks
1000 Park Forty Plaza Drive, Suite 120
Durham, NC 27713
Tel: 919-547-3402 Fax: 919-547-3401
jste...@Lexile.com

Web: www.MetaMetricsinc.com | www.Lexile.com | www.Quantiles.com

Lexile Search Now Available on Barnes & Noble.com!

Click here for details

-----Original Message-----
From: talking-m...@googlegroups.com [mailto:talking-m...@googlegroups.com] On Behalf Of Trendler, Guenter
Sent: Monday, June 06, 2011 5:54 PM
To: talking-m...@googlegroups.com
Subject: AW: [talking-measurement] RE: Finally the breakthrough? #1

Hi Jack,

I'm all with you about "aping" 19th century science. However, if an item is measurable its measurement value should vary over repeated measurements only in the limits of random error.

As Rasch (1980) states:

"Apparently the physical case distinguishes itself from the others in that the multiplicative law holds for directly observable quantities, the accelerations, in contrast to the more evasive "parameters in a model" on which we had to rely in the psychological examples. In the latter cases, except for reading time, such a law could not even easily be imagined to hold for the observations themselves, nor for any transformation of them, since both number of misreadings and number of words read are discontinuous variables, not to speak of the answers to items of the intelligence tests, being not at all quantitative. (/) In principle, however, there is hardly any difference. In fact, the acceleration of a body cannot be determined; the observation of it is admittedly liable to, at any rate, so called "errors of measurement", but in the last analysis this admittance is paramount to defining the acceleration per se as a parameter in a 'probability distribution - e.g. the mean value of a Gaussian distribution - and it is such parameters, not the observed estimates, which are assumed to follow the multiplicative law. (.) Thus, by way of an example, the reading accuracy of a child - (.) - can be measured with the same kind of objectivity as we tell its weight - though not with the same degree of precision, to be sure, but that is a different matter." (Probabilistic models for some intelligence and attainment tests, p. 115)

Jack Stenner

unread,

Jun 8, 2011, 11:29:00 AM6/8/11

to talking-m...@googlegroups.com

Thanks Paul,

I apologize for running out of gas on the last post and not supplying more than a cryptic description of the attached graphic. It is among my favorite pictures. First, yes we have the data to make this picture for 1800 students. Second, the spike you see for February 2010 is based on only 3 article encounters out of the 347 across the 29 months of use. No use in the summer months for this student. So this spike could be based on only a few responses on which he was not paying his usual attention. The left axis is a measure of theory fit: how well does the students response pattern of rights and wrongs coincide with what the theory says should happen given the Bayesian updated reader ability going into each article encounter and the theoretical text complexity measure coming out of the Lexile Analyses. The expected- observed percents in the upper left hand corner is a summary of the control chart month to month theory-observed fit.

Because the instrument calibrations all come from theory and no other students’ data informed this picture in any way the growth parameters are generally objective and the measurement framework is wholly individual centered. It is also clear that this picture describes an individual centered test of quantitivity because almost 400 predictions on the count correct (measurement outcome) are based on differencing student ability and text complexity. Could we on average obtain this correspondence if text complexity and reader ability were attributes (as measured) merely ordinal?

I would be happy to share data like these if you see some novel ways to look at the noise over time. Thanks again for your interest. Best Jack

Jack Stenner

Chairman & CEO

MetaMetrics, Inc.

Developer of the Lexile and Quantile Frameworks

1000 Park Forty Plaza Drive, Suite 120

Durham, NC 27713

Tel: 919-547-3402 Fax: 919-547-3401

jste...@Lexile.com

Web: www.MetaMetricsinc.com | www.Lexile.com | www.Quantiles.com

Lexile Search Now Available on Barnes & Noble.com!

Click here for details

From: talking-m...@googlegroups.com [mailto:talking-m...@googlegroups.com] On Behalf Of Paul Barrett

Sent: Monday, June 06, 2011 6:36 PM
To: talking-m...@googlegroups.com

Subject: [talking-measurement] RE: the single student graph

Hello again Jack (or Andrew)

--

Paul Barrett

unread,

Jun 8, 2011, 10:07:40 PM6/8/11

to talking-m...@googlegroups.com

Hello Jack

This is what it all comes down to:

Because the instrument calibrations all come from theory and no other students’ data informed this picture in any way the growth parameters are generally objective and the measurement framework is wholly individual centered. It is also clear that this picture describes an individual centered test of quantitivity because almost 400 predictions on the count correct (measurement outcome) are based on differencing student ability and text complexity. Could we on average obtain this correspondence if text complexity and reader ability were attributes (as measured) merely ordinal?

The key phrase here, for me, is “on average”.

I’ve thought long and hard about this issue – and I’m still really uneasy for reasons I’m not sure are valid. But, let me state them here –

This is how I see things from the graphs presented, as simply/briefly that I can make the points - and my “seeing” may need correction:

I’m happy for others to answer these points if they feel I am mistaken – because I feel bad pestering you like this but I need to figure out why I am so uneasy about stating that “reading ability is a quantitative variable”.

The entire measurement model is statistical, not deterministic. It cannot account for any individual’s performance beyond up to + or - 100 or less Lexile accuracy. But it can account for the average of such individuals performance to about 12 lexiles, which is pretty damn accurate given a 1500 possible lexile range.. This is the verbal analogue the little robot graphics: http://mindhacks.com/2010/06/23/the-scientific-method-lego-robots-edition/ Looking at any one robot’s trace, you wouldn’t necessarily call learning “an additive concatenation of learning units” – yet, when averaged over a group, the expected function is present. So, whatever is being measured can only be seen as a statistical aggregate, which does not apply to any single individual except “on average”. Does this constitute quantitative measurement? All I can say right now is that it does not, IF we expect such measurement to apply to all objects said to possess a quantitative attribute.

However, you (Jack) might reasonably claim that the “objects” are inherently noisy, hence the underling linear relation will always be masked by error. The problem with this is that unless we can show that error is uniform/normal random around an expected performance lexile for a single individual (the repeated estimation of an empirical lexile from multiple same fixed-lexile text stimuli) for all such individuals tested this way, those aggregate errors might be masking systematic individual errors which would cast doubt on the lexile (aka reading ability) as a quantitative variable.

If the lexile is simply unitized word frequency, then we are fitting a function between word frequency value and the capability of an individual to read words of a specific frequency. We explore the functional relation between the two, acquiring confirmatory data of expected vs actual performance for many instances of word frequency and performance of people who can read words above and below that frequency. i.e. word frequency is unitized as lexons (with a precision equal to the minimum divisible frequency spanned by the most frequent and least frequent words in any vocabulary). So, people who can read words with 100% accuracy of a particular lexon (frequency of occurrence) are presented with words say 20 units above, to establish whether their performance is 20 units less than expected. Likewise many such tests. Ratios can be sustained by this scale because it posseses a true zero, and a fixed maximum, and so it accords directly with how we understand continuous-quantity measurement. No Rasch needed.

But, I have a terrible feeling the lexile is not simple word frequency, but some hidden-from-view admixture of word frequency and sentence length. I just can’t fathom it out. Maybe I’m totally wrong here. If not, I can see why you (Jack) need Rasch IRT and a constructed link function, because there is no other way to develop a measure other than implementing huge systematic experimental “performance” evaluations which are attempts to discover the precise deterministic quantitative relations between performance, word frequency, and sentence length over representative conjoined values of word frequency and sentence length.

What constitutes a fixed stimulus? If we just accept that we ask individuals to read text passages which vary in length, average word frequency, and average constituent sentence length, with “target words” at fixed frequencies, then look at whether they choose the correct target word, we can construct a model to fit the data – as Jack did. It’s an aggregate-based model, where reading ability is based upon an expected 75% accuracy rate for a stimulus-item of X lexiles and individuals who possess X lexile ability.

The problem is, the stimuli are “we just don’t know” derived magnitudes of some quantity, because we have no idea what functional relation/s link all the constituent components of a stimulus into what we think is a magnitude of some quantity. What we have done is ‘Rasched’ the data to produce the linearity, in the same way that Woods “Rasched” random coin tosses to produce “con-tossing-ability”.

So, the lexile is a very practical scale. It accounts for real-world performance of typical text-passage complexity and reading performance. It’s fairly accurate for any individual, and very accurate when considered as an aggregate functional relationship. That average functional relationship is clearly linear in form.

But for me, it’s not a quantity for reasons given above.

However, this may be because I am not understanding certain issues about quantitative measurement,

or

that I simply haven’t see all the data Jack and others have seen,

or

that I am misinterpreting facts about text-stimuli (such as the constituent composing principles of a text-passage which is assigned a fixed lexile value)

or

I’m just being a “perfectionist”.

Underlying my unease is that if we take a group of items, such as “personality behaviors”, then a group of people who respond to whether they behave this way or not, and Rasch the data, calling the logit an extrav, this seems to be no different in principle to what we’ve done for reading ability, except that for our extravs we have self-rated performance, and for the lexile, actual performance. This is precisely how Klaus Sijtsma views measurement in psychology.

Anyway, the above is probably heavy-going so I’m not looking for any “instant responses” – unless my doubts are so dumb that they can be instantly refuted!

And, does everyone else on the list accept that reading ability is quantitative? I’m beginning to wonder whether it is only Guenter and myself who have any doubts ..!

Ah well .. a fascinating issue regardless ... and thanks for the offer to analyze data Jack; I just don’t have the spare time right now.

Regards .. Paul

W: www.pbarrett.net

E: pa...@pbarrett.net

M: +64-(0)21-415625

From: talking-m...@googlegroups.com [mailto:talking-m...@googlegroups.com] On Behalf Of Jack Stenner

Sent: Thursday, 9 June 2011 3:29 a.m.
To: talking-m...@googlegroups.com

Subject: RE: [talking-measurement] RE: the single student graph

image001.gif

image003.png

image004.png

image005.png

Andrew Kyngdon

unread,

Jun 9, 2011, 3:55:23 AM6/9/11

to talking-m...@googlegroups.com

Paul,

I’ll make some points which I may nor may not add to the discussion between you and Jack.

The original core of the Lexile Framework consisted of the imbedded sentence or “inter-sentential” cloze reading item type. As you can see from my BJMSP article, this item consists of professionally edit continuous prose text obtained from a novel, magazine article or textbook. An example is as follows:

Thus did he pray, and Apollo heard his prayer. He came down furious from the summits of Olympus, with his bow and his quiver upon his shoulder, and the arrows rattled on his back with the rage that trembled within him. He sat himself down away from the ships with a face as dark as night, and his silver bow rang death as he shot his arrow in the midst of them. First he smote their mules and their hounds, but presently he aimed his shafts at the people themselves. He was _______.

a) merciless b) qualified c) accommodating d) depressing

As you can see, the “stem” of the item is an excerpt from a continuous prose version of Homer’s Iliad. The test constructor “imbeds” the last sentence which requires the reader to “cloze” it (i.e., select the correct word from the four presented. There are two quantitative features of the prose text in the stem of such items. One is “log mean sentence length” and the other is “mean log word frequency”. Mean sentence length is simply a ratio of two counts – the number of words to the number of sentence endings (i.e., full stops, exclamation marks, question marks). The common logarithm of this ratio is then calculated. Word frequency is obtained from a corpus, such as the Carroll Corpus (Carroll, Davies & Richman, 1971). How often a word appears in the corpus is an estimate how frequently that word appears in discourse (MetaMetrics now has a 500 million word corpus thanks to all the text that has been scanned over the years). The common logarithm of the frequency for each word is calculated and an arithmetic mean is calculated from the estimated frequencies from all the words in the passage.

The Lexile Framework was originally developed by constructing reading tests consisting of this type of item. The Rasch model was used to analyse the item response data and from this item difficulties were obtained. Jack and Don Burdick calculated the log mean sentence length and the mean log word frequency for all items in the test, and then regressed these against the Rasch item difficulties. What they found was that sentence length predicted about 80% of the variance in Rasch item difficulties, with word frequency accounting for another 10% or so. This simple linear regression equation they called a “construct specification equation”.

This equation is as follows:

where δi is the difficulty of imbedded sentence cloze item i, Si is the mean sentence length of the stem of i, Wi is the word frequency and a, b and c are real valued constants. The current values of these constants are proprietary and cannot be disclosed but previous values have been published. Once the values of the constants in the construct specification equation were obtained, Jack and Don used them to predict the difficulty of such items without using any empirical data. Despite about 25 years of subsequent research, no “third variable” has been discovered which accounts for the remaining 10% of variation.

The δi values get transformed into Lexiles via the following equation:

L_i = [(δ_i+ 3.3) 180] + 200.

The unit of the Lexile scale (denoted “L”) is defined as 1/1000^th of the difference in difficulty between a sample of basal primer texts and Grolier’s Encyclopaedia (Grolier, 1986). For example, the Iliad item above has a difficulty of 1220L.

The problem with imbedded sentence cloze items is that the difficulty can be manipulated without altering the prose text in the item stem. Test constructors can use different “imbedded” sentences which affect how difficult the item is. Increasing the vocabulary demand by changing the distracters/foils used in the multiple choice can also affect the difficulty. Such changes have not been experimentally examined save for the study by Stenner, Burdick, Sanford & Burdick (2006). They argue that the Lexile measure of an imbedded sentence cloze item is the “ensemble mean” of all possible variants of that item (i.e., all possible imbedded sentences and foils).

How readers receive a putative Lexile measure is as follows. Readers complete a test consisting of imbedded sentence cloze items and their data is analysed using a modified Rasch model. Instead of empirically estimating item difficulties, these are calculated from applying the “construct specification equation” to the item stems. The ensuing values are “plugged in” to the Rasch model (much like how the estimates of precalibrated linking items are treated) and the data analysed. The thinking behind this is that the raw score sufficiency of the Rasch model enables persons to be measured in the same unit as the items. Readers can also get a Lexile measure via a linked test. In this respect, the Lexile Framework is much like conventional psychometrics.

How monographs get a Lexile is to treat monographs like giant reading tests and “slice” them up into 125 word long passages of text (or thereabouts so to avoid cutting any sentences). A monograph, M, has a difficulty which is calculated by iteratively solving a Rasch equation such that the sum of Rasch model probabilities of a correct answer to the n “slices” is equal to a relative raw score of 75% of slices correctly responded to (Stenner, et al, 2006). Let such a score be represented X_M. The Rasch model equation is thus:

where k is a constant of 1.1.

Given that the Rasch model was used to obtain item difficulties in the first instance, I would say the Lexile Framework was founded on the assumption that individual differences in reading test performance are quantitative. Hence I would argue that the Lexile framework does not putatively measure reading ability per se, but individual differences in performance upon reading tests. In my BJMSP paper, I applied the theory of conjoint measurement as a limited means of testing the hypothesis of quantitative individual differences.

In my opinion, it would not seem that the cognitive system responsible for reading is a quantitative system like a physical law. For example, consider the work of the famous Australian cognitive scientist Max Coltheart and his dual route cascaded model of “print to speech” reading behaviour (i.e. reading aloud) (Coltheart, Curtis, Atkins & Heller, 1993). Coltheart has presented evidence that his theory can even explain the differences which exist between the reading disabilities of phonological and surface dyslexia (e.g. Coltheart, Rastle, Perry, Langdon & Ziegler, 2001). Coltheart’s influential theory, however, is not one that is explicitly quantitative.

However, it is plausible that individual differences may have quantitative causes, and there is a lot of cognitive science research to back this up. Sentence length has long been hypothesised to be a “proxy” variable for the demands prose text places upon a reader’s Verbal Working Memory (VWM) capacity (e.g., Crain & Shankweiler, 1988; Klare, 1963; Liberman, Mann, Shankweiler & Werfelman, 1982; Shankweiler & Crain, 1986). People with greater VWM capacity have greater facility with syntactic complexity, as they can hold in VWM multiple interpretations of sentences containing syntactical and lexical ambiguities (McDonald, Just & Carpenter, 1992; Miyake, Just & Carpenter, 1994) and are better able to comprehend sentences with centre-embedded relative clauses (King & Just, 1991). Hence individual differences in reading test performance may be caused in part by individual differences in VWM capacity.

According to Rayner (1998), rarer words require greater lexical processing as new information from prose text is obtained only when the eyes fixate on text. So it could be hypothesised that individual differences in vocabulary contributes to differences in reading test performance. Word frequency is highly predictive of eye fixation times and total gaze duration in eye movement studies of reading (Rayner, 1998). Readers fixate and gaze at low frequency words for several hundred milliseconds longer than for high frequency words in passages of continuous prose text (Just & Carpenter, 1980). This phenomenon is not attenuated when either word length is controlled for (Inhoff & Rayner, 1986; Rayner, Ashby, Pollatsek & Reichle, 2004) or by the presence of other variables, such as number of letters, subjective word familiarity or age of word acquisition (Juhaz & Rayner, 2003). Low frequency words are also skipped less than high frequency words when words consist of six letters or less (O’Regan, 1979; Rayner, Sereno & Raney, 1996). Lexical decision task experiments have found significantly greater word recognition reaction times for low frequency words than for high frequency words (Balota & Chumbley, 1985; Hudson & Bergman, 1985; Jastrzembski, 1981).

So whilst the capacity to read continuous prose may not be a quantitative system like a physical law, it might be the case that individual differences in reading test performance are quantitative. Such differences, in the Lexile Framework, are attributed to individual differences in VWM capacity and vocabulary.

In response to your three points:

1) The Lexile Framework for Reading has a significant stochastic component (viz., the Rasch model). However, putative reading item difficulties are non-stochastically determined by the construct specification equation.

2) The Lexile is not unitised word frequency. Sentence length is also hypothesised to be a causal factor of individual differences in reading test performance.

3) I’m not sure what you mean by “fixed stimulus”.

The graphs Jack has put up concern a new item type with which it is possible to obtain repeated assessments of the one individual. This may enable tests of the Lexile Framework to progress from the “between – subjects” kind used so far.

I hope this helps.

Cheers,

Andrew

Andrew Kyngdon, PhD

MetaMetrics, Inc.

www.lexile.com

My website: https://sites.google.com/site/drandrewkyngdon/home

Measurement Forum: http://groups.google.com/group/talking-measurement

--

image001.wmz

oledata.mso

image010.wmz

Jack Stenner

unread,

Jun 9, 2011, 3:16:10 PM6/9/11

to talking-m...@googlegroups.com

Thanks Paul,

I think you have peeled back some crucial distinctions. The question of what is being averaged and when, in an analysis that purports to sustain the quantitivity hypothesis, is fundamental. Most IRT applications make only a nod at within person variation and attention quickly shifts to studying between person variation. Molenaar in Molenaar and Newell (2010) notes “… most analysis efforts in learning and development still focus on data averaged across subjects. There is growing realization of the limitations of this time honored approach” (pg 3). Elsewhere he proves that only under rare circumstances (where ergodicity conditions are satisfied) can we infer what is going on within persons from between person analyses. A causal Rasch model affords a test that the same causal model holds individually for every person in the sample. This is what we mean by an individually centered analysis. There is no averaging until data fit to the causal Rasch model has been confirmed for each case. Only then are we licensed to average data model fit over, say, gender, ethnicity, text genre, grade level, age etc. And only then can we have confidence that the attribute on which I differ from myself over time is the same attribute on which I differ from my brother at one point in time (thanks to Denny B. for this aphorism). As a generalization, I don’t think that between person analyses (factor analyses, SEM etc.) are informative regarding the quantitivity hypothesis unless more exotic FA models like P mode are employed.

If you’re particular brand of quantitivity demands a deterministic model then you will never be satisfied that any human science attributes are quantitative. I believe that “Quantitative stochastic models” are realizable. But in addition to the methods advocated by Michel I am willing to admit tests on the meaningfulness of differences. Both because such tests abound and because, for me, such tests are time proven and more easily explained. Physics came of age with just such tests.

With respect to your number 3 the LF is not an “aggregate-based model”. As described above it is an individually centered causal Rasch model. Data on a single reader over time is all the data we need to test the quantitivity hypothesis because experimental manipulation of text complexity and reader ability will either demonstrate that differences are meaningful or not-within the bounds of measurement error. All models are wrong but some are useful (George Box).

With respect to the coin-tossing-ability ditty offered up by Wood and Goldstein it is nothing more than a conjurer’s trick that shifts attention from what matters. But that’s another paper.

Jack Stenner

Chairman & CEO

MetaMetrics, Inc.

Developer of the Lexile and Quantile Frameworks

1000 Park Forty Plaza Drive, Suite 120

Durham, NC 27713

Tel: 919-547-3402 Fax: 919-547-3401

jste...@Lexile.com

Web: www.MetaMetricsinc.com | www.Lexile.com | www.Quantiles.com

Lexile Search Now Available on Barnes & Noble.com!

Click here for details

--

Paul Barrett

unread,

Jun 12, 2011, 11:19:36 PM6/12/11

to talking-m...@googlegroups.com

Hello Andrew ... Jack ..

Thanks for all your input here. Much has been clarified.

First, Andrew ...

“I’m not sure what you mean by “fixed stimulus”.

I stated as my #3 “issue”:

What constitutes a fixed stimulus? If we just accept that we ask individuals to read text passages which vary in length, average word frequency, and average constituent sentence length, with “target words” at fixed frequencies, then look at whether they choose the correct target word, we can construct a model to fit the data – as Jack did.

The problem is, the stimuli are “we just don’t know” derived magnitudes of some quantity, because we have no idea what functional relation/s link all the constituent components of a stimulus into what we think is a magnitude of some quantity.

What I meant was that I suspected that the initial stimuli used to construct the Rasch difficulties(which were then predicted by two other count-attributes) were not “fixed” in terms of their constituent properties. I saw these “probe items” as reflecting a mix of constituents, because we have no idea what functional relation/s link all the constituent components of a stimulus into what we propose is a magnitude of some quantity.

If we think of varying an SI unit physical attribute like length, only one property of a stimulus is varied ... that of length. Likewise temperature. For text-passage stimuli with cloze sentences, we have a stimulus which is complex ... as again as noted by Andrew:

“The problem with imbedded sentence cloze items is that the difficulty can be manipulated without altering the prose text in the item stem. Test constructors can use different “imbedded” sentences which affect how difficult the item is. Increasing the vocabulary demand by changing the distracters/foils used in the multiple choice can also affect the difficulty. Such changes have not been experimentally examined save for the study by Stenner, Burdick, Sanford & Burdick (2006). They argue that the Lexile measure of an imbedded sentence cloze item is the “ensemble mean” of all possible variants of that item (i.e., all possible imbedded sentences and foils).”

So, I think Jack’s statement is meaningful here:

“I believe that “Quantitative stochastic models” are realizable. But in addition to the methods advocated by Michel I am willing to admit tests on the meaningfulness of differences. Both because such tests abound and because, for me, such tests are time proven and more easily explained. Physics came of age with just such tests.”

For me, the fact that “Test constructors can use different “imbedded” sentences which affect how difficult the item is. Increasing the vocabulary demand by changing the distracters/foils used in the multiple choice can also affect the difficulty” presents a warning that we cannot use a collection of relatively unconstrained stimuli, and allow a model to create difficulties for them, which we then predict. Indeed, I can think of many examples with short but impossibly difficult sentences, or long but very easy to comprehend ones, along with “complex” cloze sentences, high-low frequency cloze items etc. Nevertheless, all would follow the rules and grammar for a language.

However, whether these “lab constructed” text stimuli are ‘plausible’ though is another matter altogether, and I think that “plausibility of occurrence in natural language” is also a factor which provides some degree of “constraint” on the text-passage items originally used to construct the item difficulties against which sentence length and word-frequency were regressed.

I am tempted to ask: why bother with the lexile at all?

If two simple cardinal counts of sentence length and word frequency are considered largely predictive of comprehension, why not just use these two as fundamental SI-type units, with comprehension considered a derived variable formed by some relational function between the two constituents?

If these two are considered causal for comprehension, then simple observations of ANY text passage would yield data from which a derived relation might be discovered. Indeed, one could build test stimuli from first principles:

1. All stem sentences of equal length.

2. All stem sentences contain words of exactly the same frequency (to within a small degree of variation)

3. All close sentences are of the same length.

4. All cloze sentences contains “choice” words of exactly the same frequency.

Such “fixed” test stimuli should yield exactly the same frequencies of comprehension among groups of readers who differ in “reading comprehension ability” if our theory is correct i.e. we equate those whose comprehension scores are the same on a variety of such “fixed-stimulus” items.

Then present them with a new set of similarly designed items, which are exemplars of what we construe as increasing (or decreasing) difficulty, and assess the same people.

If the attribute we have in mind is quantitative, relative orders among “comprehension score” groups would remain fixed.

We would also have the glimmering of how to conjoin sentence length and word frequency, forming a derived variable with an arbitrary, but integer-only unit, because the maximum possible resolution of the constituent attributes forming our derived variable are themselves integers; i.e. counts. While the derived variable may not be strictly quantitative, it may turn out to possess “good enough resolution” for practical purposes.

But the measurement construction process is not a stochastic process at all. It only becomes stochastic once we begin to talk of models which contain means, averages, difficulties incorporating a 75% accuracy thresholds etc.

So, we differ entirely in our view of the relevance of the Rasch or any IRT model. For me, Rasch modeling (or any probability-based modeling) is something which gets in the way of constructing quantitative measurement via observation within carefully structured manipulated experiment, in the manner that Guenter Trendler describes (if such measurement is possible at all in the social sciences). For you, it is an essential component in the measurement construction process.

However, I still take my hat off to you, Jack, for the entire Metametrics Inc and lexile enterprise. It’s easy to be a critic a posteriori! As you say Andrew, there has been a heavy cost for even proceeding down this route ..

“The Lexile Framework presents perhaps the first sustained attempt to break away from conventional psychometric thinking in regards to cognitive abilities, but what has been the consequence of that? Jack Stenner being banned from publishing in the International Reading Association journals for yet another decade”

One can only hope the sheer practicality and eventual international uptake of the lexile system wins through the wilful myopia of some.

By the way, does anyone know what Joel Michell thinks of the lexile .. I’m assuming given his 2011 statement that he also feels it is not strictly quantitative?

image005.gif

image008.png

image014.png

image015.png

image016.png

image017.png

image001.png

Trendler, Guenter

unread,

Jun 13, 2011, 6:13:45 AM6/13/11

to talking-m...@googlegroups.com

"The slide I sent with the first post is a plot of the empirical complexity and the theoretical complexity. The reliability of the empirical complexities for 475 articles is .996 and the associated standard error is 12L. To give a frame of reference for twelve Lexiles consider that a typical fourth grader grows 8L in a month and that a 12L difference would move your comprehension rate for an article matched to your reading ability from 75% to 76.2%. Thanks to the trade off property this increase in comprehension could be realized by either increasing your reading ability by 12L or by lowering the article text complexity by 12L. From either perspective the precision is quite high. If we increased the minimum number of readers encountering an article from 50 to 500 the standard errors would of course be even smaller. Best Jack"

Ok, but why not argue with real data? Why not construct a table analogous to, let's say, the melting temperature of metals? (1)

Indeed, if the model is correct then the ratio of the comprehension rate for any two items is constant independently of reading ability. Alternatively, the measurement values of any item must be invariant over replications independently of the samples (i.e. reading ability) used for repeated measurements (i.e. the estimation of item parameters). Are they?

Regards,
Guenter

(1) http://www.engineeringtoolbox.com/melting-temperature-metals-d_860.html

-----Ursprüngliche Nachricht-----
Von: talking-m...@googlegroups.com im Auftrag von Jack Stenner

Gesendet: Di 07.06.2011 18:58
An: talking-m...@googlegroups.com
Betreff: RE: [talking-measurement] RE: Finally the breakthrough? #1

winmail.dat

Andrew Kyngdon

unread,

Jun 16, 2011, 11:24:12 PM6/16/11

to talking-m...@googlegroups.com

Paul,

Sorry for not replying earlier, but I have had this week off.

P. What I meant was that I suspected that the initial stimuli used to construct the Rasch difficulties(which were then predicted by two other count-attributes) were not “fixed” in terms of their constituent properties. I saw these “probe items” as reflecting a mix of constituents, because we have no idea what functional relation/s link all the constituent components of a stimulus into what we propose is a magnitude of some quantity.

OK, I understand you now. Yes, I believe you are correct – I do not think that the imbedded sentence cloze items used in the first analysis were subject to the type of experimental control that you describe. Furthermore, I do not believe that the reading item stems were controlled for multicolinearity (i.e., correlations between word frequency and sentence length). This has always bothered me a bit. Perhaps Jack could elaborate on the initial development of the Lexile Framework?

P. However, whether these “lab constructed” text stimuli are ‘plausible’ though is another matter altogether, and I think that “plausibility of occurrence in natural language” is also a factor which provides some degree of “constraint” on the text-passage items originally used to construct the item difficulties against which sentence length and word-frequency were regressed.

Yes, it is important to note that the stems of all imbedded sentence close reading items are extracts from professionally edited continuous prose text. The stem of an imbedded sentence cloze reading item cannot be non-prose text such as poetry, nor can it be prose that is not professionally edited, such as student first drafts of essays. The stems themselves are not manipulated by the test constructor in anyway, apart from the initial extraction from the text.

You ask:

I am tempted to ask: why bother with the lexile at all?

If two simple cardinal counts of sentence length and word frequency are considered largely predictive of comprehension, why not just use these two as fundamental SI-type units, with comprehension considered a derived variable formed by some relational function between the two constituents?

If these two are considered causal for comprehension, then simple observations of ANY text passage would yield data from which a derived relation might be discovered. Indeed, one could build test stimuli from first principles:

1. All stem sentences of equal length.

2. All stem sentences contain words of exactly the same frequency (to within a small degree of variation)

3. All close sentences are of the same length.

4. All cloze sentences contains “choice” words of exactly the same frequency.

Such “fixed” test stimuli should yield exactly the same frequencies of comprehension among groups of readers who differ in “reading comprehension ability” if our theory is correct i.e. we equate those whose comprehension scores are the same on a variety of such “fixed-stimulus” items. Then present them with a new set of similarly designed items, which are exemplars of what we construe as increasing (or decreasing) difficulty, and assess the same people. If the attribute we have in mind is quantitative, relative orders among “comprehension score” groups would remain fixed.

I’ve had similar thoughts to yours Paul over time. I see no reason why such an experiment should not be conducted. Furthermore, I would add as Point 5 that word frequency and sentence length be orthogonal.

What I think would need to be done first is to experimentally ascertain what sentence length corresponds to one unit (or “chunk”) of verbal working memory (VWM) capacity. Is it three or four words? Say for argument’s sake that it is three. This is our “unit” of VWM capacity. We could then trade off mean word frequencies against unit increases in VWM capacity to keep comprehension (test performance) constant. Comprehension could then be expressed in a “unit” that is a product of other “units”, that is, as a mean word frequency per unit of VWM. As you say, this could not really be described as genuine derived measurement as you are dealing with two counts and counts are not continuous quantities. However, one wonders if you could treat this “derived discrete quantity” as a continuous one and not run into any major problems. For example, money is discrete (e.g., I’ve never held $pi in my hand, but I have held $3.14), yet economists treat it as continuous and it seems to work well enough.

A problem here is what datum corresponds to “comprehension”. Is it the number of correct responses to an item or the proportion of correct responses? If the latter, then I can see why it would be plausible to use a stochastic model. Incidentally, I prefer the terms “stochastic” and “non-stochastic” as opposed to the usual “probabilistic” and “deterministic” nomenclature as I believe the latter is misleading. One can have stochastic models in which something is being determined (such as in utility where choice probabilities are determined by the utilities of the simple gambles being evaluated, which are calculated using non-stochastic theories).

P. So, we differ entirely in our view of the relevance of the Rasch or any IRT model. For me, Rasch modeling (or any probability-based modeling) is something which gets in the way of constructing quantitative measurement via observation within carefully structured manipulated experiment, in the manner that Guenter Trendler describes (if such measurement is possible at all in the social sciences). For you, it is an essential component in the measurement construction process.

I don’t agree here. I do not believe that the Rasch model or an IRT model is an essential component in creating measurement. The study of decision making under risk and uncertainty has proceeded without much input at all from stochastic models, so I am not sure that stochastic models are necessarily needed in the measurement of cognitive abilities (although they may be).

P. One can only hope the sheer practicality and eventual international uptake of the lexile system wins through the wilful myopia of some.

The wilful myopia is very strong. I tried to submit a paper on the Lexile Framework and conjoint measurement to a major psychometric/educational assessment journal, only to have the editor knock it back immediately without explanation. As I never had this happen to me before (outright rejection of a paper by the editor without going to reviewers) I asked whether the rejection was due to the Lexile Framework or conjoint measurement. The editor apologised and stated that it was the Lexile Framework. So it’s not only the International Reading Association journals that have a problem with the Lexile Framework.

P. By the way, does anyone know what Joel Michell thinks of the lexile .. I’m assuming given his 2011 statement that he also feels it is not strictly quantitative?

I do not believe he has actually expressed any thoughts concerning Lexiles directly. However, in his rejoinder paper in the Michell/Borsboom/Markus double issue of “Measurement” in 2008 he did state that: “Wisely, however, Kyngdon does not

conclude from his results that reading comprehension ability is a quantitative attribute, for much more research would need to be done before that hypothesis is made even plausible.”(p.132). So I would imagine that he would not view the Lexile Framework as measuring reading ability.

Jack Stenner

unread,

Jun 17, 2011, 8:55:51 AM6/17/11

to talking-m...@googlegroups.com

I am on vacation and my Blackberry keys are too small. But I am intrigued by the next steps suggested in the exchange between Paul and Andrew. I have been rereading Jim Woodward's "Data and Phenomena" and wonder whether the utility of the Lexile Framework as a theory would be diminished if at a small scale quantitivity broke down. As the quantitivity of time and length does at Planck limits. Jack

Paul Barrett

unread,

Jun 21, 2011, 1:50:05 AM6/21/11

to talking-m...@googlegroups.com

Just a quick note to say I’ve posted that Rasch vs EFA of the EPQ questionnaire data, published back in 1981 with Paul Kline.

Barrett, P. T., Kline, P. (1981) A comparison between Rasch analysis and factor analysis of items in the EPQ. Personality Study and Group Behaviour, 1, 2, 11-28.

It’s only a simple thing ... and not exactly first-class .. but it was “good enough” as a demo of what Paul Kline and I had been suspecting about IRT.

http://www.pbarrett.net/publications.html ... #5

image001.gif

image002.png

image005.png

image006.png

image007.png

image008.png

image009.png

Paul Barrett

unread,

Jul 21, 2011, 8:04:11 PM7/21/11

to talking-m...@googlegroups.com

A new issue of New Ideas in Psychology will be out shortly, focusing on the use of constructs in Psychology. From the online articles available for download, one particularly stood out for me:

Click on the “Articles In Press” link at:

http://www.sciencedirect.com/science/journal/0732118X

Lamiell, J.T. (2011) Statisticism in personality psychologists' use of trait constructs: What is it? How was it contracted? Is there a cure?. New Ideas in Psychology (doi:10.1016/j.newideapsych.2011.02.009), In Press, , 1-7.

Abstract

‘Statisticism’ is meant to characterize a way of thinking in psychology that invests virtually boundless trust in the aptness of statistical concepts and methods to reveal the ‘lawfulness’ of human psychological functioning and behavior. In the article, I discuss how statisticism came to infect the thinking of mainstream 20th century personality investigators and how – if at all – the discipline might be cured. Unfortunately, mainstream thinking within the sub-discipline of personality psychology has long sanctioned an understanding of the statistical findings issuing from studies of individual differences in personality traits that is faithful to neither of the so called ‘frequentist’ or ‘subjectivist’ traditions. Instead, such findings are widely regarded as a scientifically acceptable warrant for claims to knowledge about objective states of affairs existing for individuals within the samples one has studied. I suggest that the prospects for eradicating dubious fruits of this form of statisticism will hinge importantly on (a) the ability of theoretically and philosophically-oriented psychologists to re-instill within the discipline a healthy respect for the power of conceptual analysis more generally, and, following this, (b) concern within the discipline for the fact that the deep and abiding conceptual problem described above in fact does exist.

This theme was also explored even more fully in the recent book:

Statistical Models and Causal Inference: A Dialogue with the Social Sciences (the late David Freedman et al, 2009)

http://www.amazon.com/Statistical-Models-Causal-Inference-Dialogue/dp/0521123909/ref=pd_rhf_shvl_3

The Editorial to the journal issue and two other absolutely crucial articles are:

Slaney, K.L., & Racine, T.P. (2011) Editorial: Constructing an understanding of constructs. New Ideas in Psychology (doi:10.1016/j.newideapsych.2011.02.010), In Press, , 1-3.

Michell, J. (2011) Constructs, inferences, and mental measurement. New Ideas in Psychology (doi:10.1016/j.newideapsych.2011.02.004), In Press, , 1-9.

Maraun, M.D., Gabriel, S.M. (2011) Illegitimate concept equating in the partial fusion of construct validation theory and latent variable modeling. New Ideas in Psychology (doi:10.1016/j.newideapsych.2011.02.006), In Press, , 1-11.

There is a piece by:

Markus, K., & Borsboom, D.A. (2011) Reflective measurement models, behavior domains, and common causes. New Ideas in Psychology (doi:10.1016/j.newideapsych.2011.02.008), In Press, , 1-11.

It’s like watching two people playing an incredibly sophisticated game with words. Mightily impressive but a ultimately a game nevertheless.

Regards ... Paul

Paul Barrett

unread,

Aug 23, 2011, 7:50:25 PM8/23/11

to talking-m...@googlegroups.com

Recently, on SEMNET, Stan Mulaik posted a reference to the article:

Pearl, J. (2011a). The causal foundations of structural equation modeling. In press, in R. H. Hoyle (Ed.), Handbook of Structural Equation Modeling. New York: Guilford Press.

http://ftp.cs.ucla.edu/pub/stat_ser/r370.pdf

To which I replied with somewhat less enthusiasm about the entire approach Judea Pearl has taken viz a viz his algebra of cause.

James Grice, Liz Schlimgen, and myself have now re-analyzed the example in Pearl's article, using Observational Oriented Modeling (OOM) and a very simple model-free approach to detecting cause from data like these.

The results from this example data set show that Pearl's formula may lead to ambiguous conclusions regarding causality. Furthermore, the simple but extraordinarily powerful OOM logic and underlying philosophy asks some very big questions about the capability of any 'SEM' approach as a method of ‘determining cause’.

The draft article is available from the OSU personality Research Laboratory web-page:

http://psychology.okstate.edu/faculty/jgrice/personalitylab/methods.htm

with a direct article link:

http://psychology.okstate.edu/faculty/jgrice/personalitylab/OOMMedForm_2011A.pdf

A more comprehensive evaluation of the entire "structural equations" approach (both Pearl's algebra of causation and the mechanics of statistical parametric SEM) for detecting cause is now the focus of our attention.

Regards .. Paul

Advanced Projects R&D Ltd.

__________________________________________________________________________________

W: www.pbarrett.net

E: pa...@pbarrett.net

M: +64-(0)21-415625

Andrew Kyngdon

unread,

Aug 23, 2011, 11:28:49 PM8/23/11

to talking-m...@googlegroups.com

A great post Paul, thankyou for sharing it!

As to Pearl (2011), my impression of the paper is one of “statisticism” bordering on the absolute barking mad.

Especially Pearl’s Equation (1):

y = Bx + u_Y, x = u_X,

where x is the “severity of a disease” and y is the “severity of a symptom” and u_X stands for all factors that could possibly affect Y when X is held constant.

Is this guy for real? Does he really believe that disease can be described by such a crude equation? Does he really think that disease is a “…a physical process whereby nature examines the values of all variables in the domain assigns a variable Y the value y = Bx + u_Y”(p.7). Does he really believe that we can explain the occurrence of disease X by simply writing x = u_X?

My brother Craig Kyngdon is an award winning parasite immunologist whom created vaccines against the “Tenia Solium” parasite. Humans infected by T. Solium can contract a devastating disease called “Neurocysticercosis”, which “…causes substantial morbidity and mortality in Asia, Africa, and central and south America” (Kyngdon, et al, 2006, p. 191).

I have attached Craig’s paper which was published in “Parasite Immunology”. In it he explains how his vaccines prevented pigs from being infected by T.Solium. Now, much of Craig’s paper is intelligible to me, but I can see no trace of SEM in it. In fact, I know Craig is a complete ignoramus about SEM or behavioural science statistics (which is no bad thing). I cannot for life of me see how Pearl’s paper could have helped Craig in anyway develop his vaccines, much less help him understand the highly complex immune systems of humans and pigs. I might be missing something, but I don’t think so.

In fact, if Craig wrote anything like Equation (1) is his paper, I would imagine the Editor of Parasite Immunology would have responded with a simple “WTF?”

Pearl (2011) seems to think that application of SEM to survey data will yield descriptive theories of psychological systems. Luce (1988) caricatured this thinking beautifully when he called it the “Tools to Theory Hypothesis”. Unfortunately, it would seem that Pearl’s claims are taken seriously in psychology and the behavioural sciences.

Andrew

Andrew Kyngdon, PhD

MetaMetrics, Inc.

www.lexile.com

My website: https://sites.google.com/site/drandrewkyngdon/home

Measurement Forum: http://groups.google.com/group/talking-measurement

From: talking-m...@googlegroups.com [mailto:talking-m...@googlegroups.com] On Behalf Of Paul Barrett

Sent: Wednesday, 24 August 2011 9:50 AM
To: talking-m...@googlegroups.com