Balinski & Laraki's "measurement theory" attack on average-based range voting: refuted.

16 views
Skip to first unread message

Warren D Smith

unread,
May 23, 2013, 10:48:22 PM5/23/13
to balinski, Michel Balinski, laraki, Steven J. Brams, electionscience

Abd ul-Rahman Lomax

unread,
May 24, 2013, 2:22:45 PM5/24/13
to balinski, Michel Balinski, laraki, Steven J. Brams, electionscience
I thank Warren for this page.

At 09:48 PM 5/23/2013, Warren D Smith wrote:
>See
>http://rangevoting.org/MeasTheory.html


>The British Association for Advancement of
>Science in 1932 tasked a committee to report on
>"quantitative measurement of sensory events." It
>produced its final report in 1940. This was
>stimulated by the "sone scale of loudness"
>purported to measure "objective scale of [subjective] auditory sensation."
>Encyclopedia Brittanica, "Sone": Loudness is a
>subjective characteristic of a sound (as opposed
>to the sound-pressure level in decibels, which
>is objective and directly measurable).
>Consequently, the sone scale of loudness is
>based on data obtained from subjects who were
>asked to judge the loudness of pure tones and
>noise. One sone is arbitrarily set equal to the
>loudness of a 1,000-hertz tone at a sound level
>of 40 decibels above the standard reference
>level (i.e., the minimum audible threshold). A
>sound with a loudness of four sones is one that
>listeners perceive to be four times as loud as the reference sound.

Eek! "Four times as loud"? What does that mean?

From Wikipedia, article on "Sone":

>The study of apparent loudness is included in
>the topic of
><http://en.wikipedia.org/wiki/Psychoacoustics>psychoacoustics
>and employs methods of
><http://en.wikipedia.org/wiki/Psychophysics>psychophysics.

Or, really psycho physics. Okay, cheap shot.

The Wikipedia article gives rough equivalence of
sones to sound pressure levels. It looks to me
like sones are an artifact of translation of
perceived sound levels into a numerical scale,
because a perception of sound sound as a multiple
of another is largely subjective. It may relate
to a measurable quantity, such as the numbers of
neurons firing, or rates of firing.

What can be studied with objectivity, even though
it's "subjective," is relative loudness, i.e., is
sound A louder than sound B? It is easy if the
sounds are otherwise identical, i.e., same
frequency distribution, timing, context, etc.

>Such scales are highly important, in fact
>crucial, for purposes such as telephony,
>computer speech synthesis, audio compression,
>etc. But one member of the BAAS committee
>claimed any such quantitative scale "is not
>merely false but in fact meaningless unless and
>until a meaning can be given to the concept of
>addition as applied to sensation." (Final report
>p.245.) But other members had extremely opposite views!

If it is useful, it is not meaningless, even if
no clear, objective meaning has been discovered.
The "addition" comment is intrinsic to the
concept of sones, i.e, a sound at four sones, is
it equivalent to four sounds of one sone,
simultaneously played, at the same time?

Obviously there would be phase relationships,
etc., to consider. (Because the sum of two tones,
as described, could be *silent*. So we must
assume no phase difference for this to be meaningful.)

Then, for tones with different characteristics,
I'd expect that the correpondence of sound
pressure to the perception of loudness would
vary. Hence any correspondence of the sone scale
to a sound pressure scale would be variable, and
it may, indeed, vary with the individual. Again,
this would be tested by playing the two differing
sounds and adjusting them in volume so they are perceived as equally loud.

I'm not at all sure that the sone concept is
needed for those applications. That it is used in
some way, however, does show possible utility.

(A well-tested set of correspondences between
sones and sound pressure could be used, but ...
it seems likely that the real application is
always comparative, i.e., is a particular sound
louder than another. Or enough louder to
distinguish signal from noise, another application?)

Warren commented:

>You can already tell that that quote was
>hogwash. A counterexample is "temperature," an
>apparently obscure and little known concept
>unfamiliar to eminent members of the British
>Association for Advancement of Science.
>Temperature is meaningful and measurable,
>despite the fact that the "sum of two
>temperatures" seems meaningless and you
><http://rangevoting.org/FeynTexts.html#temp>never
> do that ("what is the total temperature of
>this apple and this cup of coffee?")

If temperature is a proxy for thermal energy, as
it is, then the sum of two temperatures could be
meaningful, if they are temperatures of the same
body. "Same" might refer to thermal mass. I don't
recall the specific relationships and am far too
lazy to look them up or derive them. However, if
we "add" an apple to the cup of coffee, they will
reach thermal equilibrium, so the "sum" of A and
B is actually a kind of weighted average of A and
B. "Net combined" temperature, not "total" temperature.

Skipping over more details of the debate over
sones and the like, we come to the real question
here, the reason for interest in the topic:

>What does this have to do with Range voting?
>
>The present essay was stimulated by an insanely
>wrong-headed attack on range voting – actually,
>incredibly, an attack on every voting method
>that uses numbers!! – by M.Balinski & R.Laraki.
>They prefer an alternative and more complicated,
>but related voting system – which they invented
>and called majority judgment (MJ) – based not on
>"greatest average score wins" but rather on
>"greatest median score wins, with an additional
>tie-breaking scheme." MJ also uses, not a
>numerical score-set, but rather a set of 6 verbal scores
>Excellent / Tres bien / Bien / Assez Bien /
>Passable / Insuffisant / a Rejeter.
>
>We quote the attack from Balinski & Laraki's paper
>Election by Majority Judgment: Experimental
>Evidence, pages 13-54 in Bernard Dolez, Bernard
>Grofman, Annie Laurent: Studies In Public
>Choice: In Situ and Laboratory Experiments On
>Electoral Law Reform: French Presidential Elections, Springer 2011.
>
>We have numbered their paragraphs for later reference:
>
>1. Is it reasonable to use numerical scales in
>voting? The answer is a resounding no, for several reasons:

Great example of the abuse of the word "reason."
It certainly is possible to use numerical scales
in voting, it's been widely done, and it is
*clearly* reasonable by a relatively objective
standard of reason. That is "reasonable people"
-- not insane -- use such scales. Whether or not
it's *optimal* would require some standard of optimality. Do they define one?

>2. The numbers mean nothing unless they are
>defined: proposals to use weights give them no definition.

Suppose voters vote in an election by tossing
uniform weights into baskets. They are permitted
to toss up to N of these weights into each
labelled basket. What is the definition of a vote
of one weight? It's obvious: it is an *action*
that is defined by its *effect*. That is, tossing
N weights maximally acts to elect that candidate,
and tossing none maximally acts against the
election of that candidate, and tossing an
intermediate number has a proportional effect intermediate between these.

Generally, proposals to use weights have exactly
that "meaning." The error is in assuming that to
be meaningful, it must have meaning *outside of
the election process.* I.e., that there must be
some correspondence between the number of weights
chosen by the voter and some condition
independent of the election. In fact, Balinski
and Laraki have attempted to create that, by
using words that *may* correspond to an
independent judgment. And that leads to election pathologies with Range.

The same error can exist, naturally, if a voter
thinks that the voter should, say, vote for all
candidates according to an independent
assessment, and that "honesty" requires, then
equal rating all "excellent" candidates when, in
fact, the system is asking for a *choice* between
them. "Excellent" is a *category* as Arrow has
discussed in his Center for Election Science
interview. It does not negate the existence of
differences between candidates within the
category (as placed by the voter). It means,
simply, that the voter has made a choice to
suppress the difference as being insignficant *under the election conditions.*

Gad, you'd think that political scientists would
actually engage in discussion with those who have
studied these matters in detail, before
committing themselves to print. But, we have seen
over and over, they don't, and as a result,
"political science" is often decades behind what
is commonly known. It's hubris.

> Their only real "meaning" is found in their
> strategic use. This induces comparisons, which
> immediately leads to
> <http://rangevoting.org/ArrowThm.html>Arrow's
> paradox... E.g. with these actual ballot instructions
>Give a grade to each of the twelve candidates:
>either 0, or 1, or 2 (2 the best grade, 0 the
>worst). To do so, place a cross in the
>corresponding box etc. The candidate elected
>with [this] method is the one who receives the highest number of points.

This does not lead to Arrow's paradox, that's
preposterous and totally misreads Arrow's
Theorem. The voters have not necessarily ranked
all the candidates, they have ranked
*categories.* Analyses of range that propose
violations of Arrovian criteria generally assume
an underlying ranking, and then study the
election from the point of view of those
rankings, which totally neglects the concept of
an insignificant ranking, totally neglects the
action of the voter to *deliberately* rank two
candidates in the same category, as an *exercise of power.*

The analysis completely denies the concept of
"preference strength," when preference strength
is obviously active in real-world social choice.
It is thinking like this that delayed the
development of election science for decades. It was a denial of the *obvious*.

>3. nothing is said concerning the meaning of 0, 1, or 2.

That's directly in contradiction to the example
given. The meaning of the vote was precisely
defined by the result. The meaning *is* the
result, the effect of the voter's voting pattern
on the result. This is *so* embarrassing. With
vote-for-one, what is the "meaning" of the vote?

>The numbers induce relative, so strategic,
>behavior. Other numbers could have been given.
>For example, with {-1,0 +1} mathematically there
>is no difference, but were these numbers used
>the behavior of the voters would almost surely
>have been different. [In fact, this experiment
>later was
><http://rangevoting.org/France2012.html>tried
>and voter behavior was significantly different.]

The behavior of voters under differing conditions
will vary. So? The real issue would be which
system will generate maximized social utility,
(or, symmetrically, minimize social discontent
with the result) and that is a problem that
depends on the definition of social utility.
There are techniques for studying this, and the
denial of any absolute meaning to "social
utility" is useless. It's true, generally, i.e.,
we have no means of measuring "true absolute
social utility," but we can measure proxies for
it, in real elections, and we can measure the
effect of variations in system on results in simulations.

Simulations cannot address the difference between
the name-system and the numerical-system, but
that can be studied in properly designed
statistical trials. I.e, a population is divided
randomly into two groups, with each group being
given an "election," say as an exit poll. One
group has a name-system and the other group has a numerical-system.

However, such a trial could be biased by the
absence of strategic incentive. Balinksi and
Laraki seem to assume that "strategic incentives"
somehow distort results. That's true in a sense.
If voters voted absolute utilities *not
normalized*, then we could easily maximize
utility. But there is no way to define these
utilities that is practical. In reality, in
everyday choice, our application of words like
"Excellent" varies with *expectations.* I.e, it is a *strategic choice.*

Strategic choices test preference strength. What
has been called "strategic voting" with Range is
simply a manifestation of how people, real-world,
make choices. Contradictory states have been
asserted to assert this strategic voting as
somehow "dishonest." I.e., a voter supposedly has
a weak preference for A over B, but down-rates B
to min rating, say, because they want A to win.
Uh, that means they have a *strong preference.*
And if the only choice is between A and B, this
is totally rational and to be expected, and is,
in the ordinary meaning of the word, "honest."

Present these voters with a different election
scenario, they will choose differently, and this
is how multiple-round systems *powerfully* test
preference strength. By testing preference
strength, social utility maximization is possible.

>4. When numbers are used, they may well not be
>used in the same way at all: when a 0-100 scale
>is used, some voters may view 80 to be an
>excellent grade, others may see it as merely middling.

Balinski and Laraki completely miss the real
behavior of voters, and the "meaning" of ratings,
and this comment makes it obvious.

I have one full vote to cast for each candidate.
The meaning of the vote is the effect on the
outcome, nothing more, nothing less.

(If a voter thinks something else, they have been
misled, probably by people demanding "honest
votes," i.e., absolute approval, etc.) If I rate
a candidate at 80%, this is a *strong vote* for
the candidate, generally. By the way, I don't
like grading systems, like ABCDF. The purpose of
voting systems is not to "grade" the candidates,
it is to choose one or more. My favorite will get
an A, even if I think he's pretty bad (assuming I
choose to vote at all, I might not.) The worst
will get an F, even if I perceive the worst as
*almost* as good as the favorite. That's in a two
candidate election. What happens with more
candidates is more complex, for sure. But voting
under range is still a matter of deciding where
to place voting power, and that power is
expressed in the pairwise elections, it can vary
from zero (equal rating) to one full vote (max/min rating.)

Voting is an exercise of *power,* not a
sentiment. We *always* choose where to put our
power based on expectations, unless we are
asleep. Do we prefer voters who are asleep or those who are awake?

If I am presented with a ballot with candidates
on it, and I am familiar with the candidates, but
have no specific knowledge of how others are
likely to vote, I am quite likely to cast what is
called a "fully sincere" Range ballot. That is,
it will be unaffected by "strategic
considerations." But in a real-world election,
that circumstance is only present in minor
elections. Strategic considerations will shift
the vote, for most voters, because voters dislike
wasting their vote, sometimes. And the actual
behavior depends on preference strength, so it is
arguable that it *improves* results.
(Similations, so far, have not clearly addressed
this. *Especially to be studied* are two-round
systems, where voter turnout also tests preference strength.)

>5. Even if the numbers did provide a common
>language, they will almost certainly not be a
>proper interval measure [in the sense of Stevens
>– it is here that Balinski & Laraki invoke
>"measurement theory"] – that depends on who the
>candidates are and how the voters give their
>grades. For example, the 0-20 scale used in
>France is a common language, but an 18, 19, or
>20 is unheard of in philosophy or literature, so
>the scale is not an interval measure. Once the
>distribution of the grades is known – after many
>elections (or many examinations) – it is
>possible to determine whether the scale is an
>interval measure and, if not, to correct it (as
>did the Danes). But then it is too late, since
>the weights must be announced ahead of time.

If a name-scale is used, it should be defined in
terms of the fractional vote assigned (or the
numerator of the fractional vote, same thing). I
still don't like it, though I have proposed that
the ranks in Bucklin be named. (They were
numbered in the original Bucklin, 1, 2, 3).

I have also proposed using a Range ballot for
Bucklin, and, again, names could be attached. But
these are *comparative* names, not absolute
categories or names easily interpreted as such.
So, I'll give the rank, the equivalent rating,
and the name, for a Range-Bucklin implementation.
And I'll assume a two-round system, with some
extra explanation that might not be on the ballot.

INSTRUCTIONS

Categorize each candidate into one of the
following ranks. You may categorize more than one
candidate into the rank. The first rank will be
counted for all voters, and, if the majority of
voters have, with this counting, approved the
election of the candidate, the election will
complete. If more than one candidate has a
majority, then the candidate with the most votes
will be elected. [Ties not considered here].

If there is no majority approval, then the next
rank votes will be added to those already
counted, and, if necessary, this counting of
lower ranks will be repeated down to the Approved
rank, until a majority is found or all approved ranks have been counted.

If there is still no majority, a runoff election
will be held. On the runoff ballot will be the
two most-approved candidates from this election,
plus any candidate who would, by comparison of
all ranks voted, including the unapproved ranks,
defeat, pairwise, both of those candidates.

1, 4, Favorite
2, 3, Preferred
3, 2, Approved
4, 1, Disliked
5, 0, Rejected

If you do not mark a category for a candidate,
that candidate will be classified as Rejected.

(The range ratings I give above as equivalent are
not used in the method as described. They would
be used in a variant. The 3rd rank has been
defined as an approved rank, which was the case
with original Bucklin. A vote at that rank *can*
elect a candidate. The most that votes below that
rank can do is to select a Condorcet winner and
place that candidate into the runoff. It would
also be possible to define the runoff as being
top two, *or*, if there is a candidate who
defeats both of the two, between the top approved
candidate and the pairwise winner. There is
another contingency I have not addressed. It is
*highly* unlikely, but would need to be
considered in the formal method for logical completeness.)

The system above could be Range instead of
Bucklin; all that is necessary is to have a
formal approval cutoff. I.e., say, mid-range or
higher is "approved." The descending approval
cutoff counting could be skipped and the range
votes added to find highest range sum, or, if
that canddiate does not have majority approval,
then a runoff with top two (which is known to
improve range votes with real voters, i.e.,
"strategic voters,") and the pairwise test still considered as above.

>6. Even if it turned out that the scale did
>approximate an interval measure, the procedure
>depends on irrelevant alternatives, [hence] is
>subject to Arrow's paradox: for if one or
>several candidates drop out, the distribution of
>the remaining grades will almost certainly be
>different, so the scale is no longer an interval
>measure. [For example, in the French 2007
>presidential election, the counts of the number
>of times each of their 6 verbals scores was
>used, changed considerably when all scores for
>the 8 "unimportant" among the 12 candidates were removed.]

It's insane. An election is a *choice*, and
choice depends on context. The context is the
specific ballot, set of options, used by the
voters, not some other hypothetical or
previously-possible ballot. IIA as applied to the
expressed votes is not violated. Arrow knows
this. Of course, Balinski and Laraki did not have
the benefit of the CES interview, but we already
knew that Arrow's theorem did not apply to
systems that categorize candidates into ranked
categories, allowing equal ranking.

Arrow's theorem deliberately did not consider
cardinal voting systems, nor did it consider
"ballots." It assumed individual preference
profiles that ranked all candidates, strictly, no
equal ranking, and no skipped ranks. In the real
world we are often faced with choices where we
have difficulty deciding. That is an expression
of low or no preference strength. Voting systems
that allow this as an expression, therefore,
collect more accurate data from voters.

Bucklin is a Range system, not a ranked system.
The difference has often been overlooked, but
voters could skip ranks with Bucklin, and did,
thus espressing strong preference. In original
Bucklin, they could equal-rank in third rank, and
it's an obvious extenstion to allow equal ranking
in all ranks. Why not? Forcing the voter to rank
suppresses valuable information. Equal ranking is
*information,* just as ranking is information.

Warren went on to say much the same as I wrote
above, independently. (I hadn't read him through
before writing this commentary, I often do that.)

Warren D Smith

unread,
May 24, 2013, 1:30:08 PM5/24/13
to electio...@googlegroups.com
> If temperature is a proxy for thermal energy, as
> it is, then the sum of two temperatures could be
> meaningful, if they are temperatures of the same
> body. "Same" might refer to thermal mass. I don't
> recall the specific relationships and am far too
> lazy to look them up or derive them. However, if
> we "add" an apple to the cup of coffee, they will
> reach thermal equilibrium, so the "sum" of A and
> B is actually a kind of weighted average of A and
> B. "Net combined" temperature, not "total" temperature.

--the "total temperature" of an apple and cup of coffee would by this
notion be quite different from an apple and a barrel of coffee.
I have no issue with summing "energies," but for summing "temperatures,"
sorry, not accepted.

Clay Shentrup

unread,
May 24, 2013, 1:59:44 PM5/24/13
to electio...@googlegroups.com
Are Balinski and Laraki even aware of the extent to which you've refuted their baloney?

Andy Jennings

unread,
May 24, 2013, 3:00:40 PM5/24/13
to electionscience
A personal definition of "intrinsic meaning":

To me, a grading scale "has more meaning" the easier it is to convert my personal, internal evaluations of the candidates into grades.  When I examine a crowded field of candidates, my evaluations are verbal, e.g. "I agree with most but not all of his positions", "I disagree with her political philosophy completely", "I think his policies would be very bad for the country".  I find it much easier to convert these into the grading scale Excellent/Very Good/Good/Fair/Poor/Reject than into the numeric scale 0-100.

On the other hand, perhaps there are those who make their personal, internal evaluations differently and find it easier to convert into numbers than adjectives.  Perhaps in terms of monetary value or some other utility value that is easier to convert into numbers than words?

A survey of the public would be better than me just explaining my personal thought processes.



I will also admit that there are some natural ways to encourage people to convert their evaluations into numbers:

1. Analyze their positions on several issues, score them from 0-10, and take a weighted average.
2. For legislators, what percentage of their votes would match my personal votes.
3. Estimate how good they would be as a percentile in some sample group.  (All American citizens/All humans who've ever lived/Everyone who I remember that has ever run for President)
4. 100 minus the percentage of the population that would have to support that person before I would be willing to go along with it (Simmons).


~ Andy

Jameson Quinn

unread,
May 24, 2013, 4:07:45 PM5/24/13
to electio...@googlegroups.com
Second try at responding; gmail ate the first try.

Warren's original email with the link was cc:ed to Balinski, Laraki, and Brams.

I think that B+L were imprecise and handwavy (with a bit of namedropping; after all, they're French), and that Warren did a good job of pointing that out. But I think that Warren isn't good at being diplomatic about things, and that makes people defensive, which doesn't actually help resolve the issues.

Stripped of B+L's overstatements and Warren's logical dissections thereof, I think there are four valid empirical questions here. In order from easiest to hardest to resolve:
1. Do people like numeric or non-numeric scales better? Andy says he prefers the latter, and I agree; but we're a completely unrepresentative sample.
2. Do numeric or non-numeric scales give better honesty / IIA?
3. Are irrational anchoring effects a bigger deal with numeric or non-numeric scales? How much of an impact does that have on result quality with different systems?
4. Are people's self-reported numeric ratings really interval-like? That is, is there more of a utility difference between (eg) 70 and 100 than between 0 and 20?

I think we need more data to definitively answer any of these questions. My intuition says that non-numeric scales are better but without data to back me up, that's worth approximately nothing.

Jameson

2013/5/24 Clay Shentrup <cl...@electology.org>
Are Balinski and Laraki even aware of the extent to which you've refuted their baloney?

--
You received this message because you are subscribed to the Google Groups "The Center for Election Science" group.
To unsubscribe from this group and stop receiving emails from it, send an email to electionscien...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Andy Jennings

unread,
May 24, 2013, 4:11:54 PM5/24/13
to electionscience, Warren D Smith
I'm no expert on measurement theory in general, but according to Warren's write-up, the field certainly seems to have gone overboard.  Hopefully he's not exaggerating...

My thoughts:

1. I don't treat those four levels of measurement as exhaustive.  Surely I can come up with others.  What about measuring angles around the circle (modular arithmetic)?  It's very nearly an interval measure, but 0 minus 10 is 350, which doesn't fit.  That doesn't make the whole idea of measurement theory useless.  These four levels are still a good general framework for most measurement.

2. I don't think a measurement has to have meaning on the whole real number line to fit into one of these four categories.

3. According to other sources, "ratio measures" are a subset of "interval measures", but Warren's page just says they're a subset of "ordinal measures".  Is there a "ratio measure" that's not an "interval measure"?

4. I don't see how Warren's new category, "range", is that different from "interval".  What would be a range measure that's not an interval measure.

(Warren, I would fix these.  Otherwise it seems like you're fighting a straw man here.)

How this relates to voting:

Whether or not you agree with the whole field of measurement theory, it certainly makes sense, before you do sums and averages, to be sure you're using a grading language where sums and averages makes sense.

Just because you have a collection of numbers doesn't mean you can take sums and averages.  You should make sure those numbers come from a realm where it makes sense to take sums and averages.  If they are votes, you should make sure people are treating them as an interval measure, where intervals have meaning.

If you're collecting votes on a 0-100 scale, it certainly makes sense to ask if voters are treating that as an interval measure, not just assume it (even if you told them that you're going to sum or average their votes).

Do people use the 0-100 scale as an interval measure?  Well, Mike Ossipoff recently called for a 0-100 range vote on some issue and his votes were something like:

100, 99, 98, 97, 50, 49, 48, and 0.

Did he think about the options and consciously decide that the utility difference between the things he voted as 100 and 50 was fifty times as much as the utility difference between the things he voted as 100 and 99?  If the voting scale was 0-50, would his votes have been

50, 49.5, 49, 48.5, 25, 24.5, 24, and 0

or would they have been

50, 49, 48, 47, 25, 24, 23, and 0?

I would guess the latter.  It certainly seems to me that he (a very informed voting theorist) was NOT using 0-100 as a proper interval scale, but just using the unit as an infinitesimal to distinguish between things that were almost, but not quite, equivalent.

So I think there is at least some evidence that voting with 0-100 is NOT a proper interval scale.

On the other hand, BL would say that "Excellent/Very Good/Good/Fair/Poor/Reject" an ordinal scale and cannot in any way be considered an interval scale.  I disagree somewhat.  I think the evidence is mixed.  I don't think you can say that there is necessarily a midpoint between "Very Good" and "Reject", much less that it is exactly equal to "Fair".  But I think you CAN say that "Very Good" is much closer to "Excellent" than it is to "Reject".  So, especially once the finite grading language is fixed, I think there is some degree of "interval"-ness about it.  I don't think you can directly convert these grades to 5,4,3,2,1,0, but if you were careful, perhaps it would convert to 100, 90, 80, 60, 30, 0 or something.  The problem with converting these to an interval measure and summing (or averaging) is that the winner could change completely depending on your conversion.

An interesting quote from Wikipedia (http://en.wikipedia.org/wiki/Level_of_measurement):

The use of the mean as a measure of the central tendency for the ordinal type is still debatable among those who accept Stevens' typology. Many behavioural scientists use the mean for ordinal data, anyway. This is often justified on the basis that the ordinal type in behavioural science is in fact somewhere between the true ordinal and interval types; although the interval difference between two ordinal ranks is not constant, it is often of the same order of magnitude. For example, applications of measurement models in educational contexts often indicate that total scores have a fairly linear relationship with measurements across the range of an assessment. Thus, some argue that so long as the unknown interval difference between ordinal scale ranks is not too variable, interval scale statistics such as means can meaningfully be used on ordinal scale variables. Statistical analysis software such as PSPP requires the user to select the appropriate measurement class for each variable. This ensures that subsequent user errors cannot inadvertently perform meaningless analyses (for example correlation analysis with a variable on a nominal level).

~ Andy

Warren D Smith

unread,
May 24, 2013, 4:14:39 PM5/24/13
to electio...@googlegroups.com
> Stripped of B+L's overstatements and Warren's logical dissections thereof,
> I think there are four valid empirical questions here. In order from
> easiest to hardest to resolve:
> 1. Do people like numeric or non-numeric scales better? Andy says he
> prefers the latter, and I agree; but we're a completely unrepresentative
> sample.
> 2. Do numeric or non-numeric scales give better honesty / IIA?

--either is usable. It is an experimental question. There is a
tremendous amount
of psychology and market-research research on this stuff over 80
years, which is rather horrible to delve into, but I've recently been
trying. Unfortunately, it seems to depend on
the use to which the scale is put (e.g. rating food , which works
better... there is no one universally optimal design...

> 3. Are irrational anchoring effects a bigger deal with numeric or
> non-numeric scales? How much of an impact does that have on result quality
> with different systems?

--and not only that it may depend on instructions given. Example, on
0-100 scale,
if instructions explicitly say "50 is neutral, below 50 is dislike"
that might cause
different behavior than 0-100 scale with no explicit mention of "50"
in the instructions.

> 4. Are people's self-reported numeric ratings really interval-like? That
> is, is there more of a utility difference between (eg) 70 and 100 than
> between 0 and 20?


>
> I think we need more data to definitively answer any of these questions. My
> intuition says that non-numeric scales are better but without data to back
> me up, that's worth approximately nothing.
>
> Jameson
>
> 2013/5/24 Clay Shentrup <cl...@electology.org>
>
>> Are Balinski and Laraki even aware of the extent to which you've refuted
>> their baloney?
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "The Center for Election Science" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to electionscien...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "The Center for Election Science" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to electionscien...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>


--
Warren D. Smith
http://RangeVoting.org <-- add your endorsement (by clicking
"endorse" as 1st step)

Warren D Smith

unread,
May 24, 2013, 4:27:24 PM5/24/13
to electio...@googlegroups.com
> Stripped of B+L's overstatements and Warren's logical dissections thereof,
> I think there are four valid empirical questions here. In order from
> easiest to hardest to resolve:
> 1. Do people like numeric or non-numeric scales better? Andy says he
> prefers the latter, and I agree; but we're a completely unrepresentative
> sample.
> 2. Do numeric or non-numeric scales give better honesty / IIA?

--either is usable. It is an experimental question. There is a
tremendous amount
of psychology and market-research research on this stuff over 80
years, which is rather horrible to delve into, but I've recently been
trying. Unfortunately, it seems to depend on
the use to which the scale is put (e.g. rating food quality, or films,
or toilet paper), which works better... there is no one
universally-optimal design...

> 3. Are irrational anchoring effects a bigger deal with numeric or
> non-numeric scales? How much of an impact does that have on result quality
> with different systems?

--and not only that it may depend on instructions given. Example, on
0-100 scale,
if instructions explicitly say "50 is neutral, below 50 is dislike"
that might cause
different behavior than same 0-100 scale but with no explicit mention
of "50" in the instructions.

> 4. Are people's self-reported numeric ratings really interval-like? That
> is, is there more of a utility difference between (eg) 70 and 100 than
> between 0 and 20?

--on a standard 9-point food quality scale, old design was digits 1-9
associated with words, but more recently it has been claimed it is
better to use certain intentionally non-equispaced numbers associated
with words in which the more-extreme words are further-spaced apart.
"Better" based on you get a faster approach to a normal distribution
(which I doubt is the same as "better" for voting-system purposes, in
fact it may be opposite).

> I think we need more data to definitively answer any of these questions. My
> intuition says that non-numeric scales are better but without data to back
> me up, that's worth approximately nothing.

--it may well be verbal is better in some senses, BUT there are some
other issues, such as: if you want voting to work well for voters who
do not all speak English well,
and want conciseness (short ballots) and speed, then numbers have
advantages; also, since in average-based range voting the numbers
actually are the underlying truth, there is something to be said for
using them. In Balinski MJ method the numerical values do not
actually matter, only the ordering of the scale-points matter (as they
well know).

The vastness of the market-research & etc literature and the huge
amount of years+money spent by those guys researching this stuff means
that we will be unable to compete much with that stuff, hence are best
off just finding out about it. However there may be some questions
that are specifically voting oriented that are not well addressed by
research oriented toward other goals, that'd be the only way on which
we could compete.

There are journals with titles like "market research," "opinion research,"
and "psychometrics"...

Jameson Quinn

unread,
May 24, 2013, 4:39:27 PM5/24/13
to electio...@googlegroups.com
Minor point:

--it may well be verbal is better in some senses, BUT there are some
other issues, such as: if you want voting to work well for voters who
do not all speak English well,
and want conciseness (short ballots) and speed, then numbers have
advantages;

The alphabet is about as good as numbers for that. Even a small minority of people who are not literate in the Latin alphabet can probably figure out a series of bubbles with ABCDF just as easily as one with 01234. And for those who have gone to school with letter grades, there's probably less of a risk of getting things backwards (is 1 better or worse than 4?)

Warren D Smith

unread,
May 24, 2013, 4:53:13 PM5/24/13
to electio...@googlegroups.com
--I would recommend a graphic-aided scale something like
dislike----------------------------like
.........0 1 2 3 4 5 6 7 8 9......

and then it is pretty hard to be confused re direction.
And the "like" could be made a smiley face if do not want to use a word...

but the thing is, if you do want to use words, then the law must
specify the exact words.
Since allowing variation would allow biasing the vote by using word
choice X in region X of the country, etc.

So that's a can of worms.

Bruce Gilson

unread,
May 24, 2013, 5:23:55 PM5/24/13
to electionscience Foundation


In fact, if someone is familiar with preferential voting (as in Australia; Cambridge,  Mass.; etc.) they may assume 1 is best.

But there are all sorts of interferences possible.

Getting away from voting, but still a relevant issue, I point to grade-point averages. Most colleges convert letter grades to a 0-4 scale to compute GPAs. When I went to the City College of New York (a long time ago; they may well have changed it) the scale was, however, from -2 to 2, because C was the minimum average to get a BA/BS degree, and making that the zero point made some sense. As a result, psychologically to me, when I hear a 1.5 GPA, it sounds good to me, because I think of that as what most schools would call a 3.5. I have to work at it to rescale the numbers.

This different scale even affected the grades professors gave. It was very rare to have a professor give a D grade to a student, because it was a pass for the course but still a negative figure for the GPA. I imagine that where a 0-4 scale is used, D grades are considerably more common.

Somewhat later (still decades ago, so again it may have changed) I taught at Rutgers University. They used an inverted scale (rather than A, B, C, ..., they used 1, 2, 3, ..., though it did not quite correspond to the usual ABCDF because there were 7 grades: 1 to 5 were passing, 6 and 7 failing). Here again, one's mind has to work at evaluating such GPAs, if one's mental picture is accustomed to a more conventional scale.

I suppose the fact that confusion like this can arise is why I prefer words, like "strongly support/support/neutral/oppose/strongly oppose" -- though this will not work if you have more than a small number of options.

Andy Jennings

unread,
May 24, 2013, 5:34:12 PM5/24/13
to Warren D Smith, electionscience
> Do people use the 0-100 scale as an interval measure?  Well, Mike Ossipoff
> recently called for a 0-100 range vote on some issue and his votes were
> something like:
>
> 100, 99, 98, 97, 50, 49, 48, and 0.
>
> Did he think about the options and consciously decide that the utility
> difference between the things he voted as 100 and 50 was fifty times as
> much as the utility difference between the things he voted as 100 and 99?
> If the voting scale was 0-50, would his votes have been
>
> 50, 49.5, 49, 48.5, 25, 24.5, 24, and 0
>
> or would they have been
>
> 50, 49, 48, 47, 25, 24, 23, and 0?
>
> I would guess the latter.  It certainly seems to me that he (a very
> informed voting theorist) was NOT using 0-100 as a proper interval scale,
> but just using the unit as an infinitesimal to distinguish between things
> that were almost, but not quite, equivalent.
>
> So I think there is at least some evidence that voting with 0-100 is NOT a
> proper interval scale.

--it definitely is not an interval scale since is bounded.
Incidentally the name 'interval' selected by Stevens was a very bad
choice of word.
I re-use it, but hate it.

There's no need to get overly hung up on the terminology.  I'm saying that it doesn't look like Mike was even using 0-100 as a proper range measure, and if voters aren't using the grading scale as a range measure, then it doesn't make sense to average the votes.

That is, I think changing the scale from 0-100 to 0-50 should not change the outcome.  And it might change the outcome if voters are not using the grading scale as a proper range measure.

~ Andy

Abd ul-Rahman Lomax

unread,
May 24, 2013, 6:39:11 PM5/24/13
to electionscience
At 02:00 PM 5/24/2013, Andy Jennings wrote:
>A personal definition of "intrinsic meaning":
>
>To me, a grading scale "has more meaning" the easier it is to
>convert my personal, internal evaluations of the candidates into
>grades. When I examine a crowded field of candidates, my
>evaluations are verbal, e.g. "I agree with most but not all of his
>positions", "I disagree with her political philosophy completely",
>"I think his policies would be very bad for the country". I find it
>much easier to convert these into the grading scale Excellent/Very
>Good/Good/Fair/Poor/Reject than into the numeric scale 0-100.

Sure. People will vary. I will assert, however, that this is no
easier than using a numerical scale, with a decent algorithm. And you
will end up doing the same work. The way you decide is strategically
disempowering, and probably not efficient. That's your choice; just
realize that this is *not* standard Range voting strategy, and the
methods using grading scales are simply versions of Range voting,
specifically median range, which does not generally differ greatly
from average range.

>On the other hand, perhaps there are those who make their personal,
>internal evaluations differently and find it easier to convert into
>numbers than adjectives. Perhaps in terms of monetary value or some
>other utility value that is easier to convert into numbers than words?

No, that is *not* easy. What is easiest is pairwise comparison.
Pairwise comparison can use instinct and intuition, and almost
certainly does. I.e., a much larger set of neuronal patterns is used,
including many that are outside of consciousness and that will "know"
things we don't consciously perceive. Do we like A or B better? As we
get more sophisticated, we might notice *reasons* why we prefer one
to the other, that are fundamentally irrelevant, and distinguishing
these is useful. I.e., A has more hair than B. Which might not even
be real! But ... I don't define any of these semi-automatic reactions
as "wrong." After all, if B knows that people will prefer people with
more hair, could B wear a hair piece? I'm not saying he or she
should, but it is a marker of two things: character and social
integration, all of which might actually be important.

Here is the algorithm I propose as being simple to use, for use with
any Range system. There are really two systems, but the first one is
the start. It is a zero-knowledge strategy, it is purely sincere and
does not consider election probabilities. In some situations, it can
be skipped where there are two clear frontrunners.

1. Rank all the candidates into a list with as many positions (or
more) as there are candidates.
2. It may be easiest to start with favorites, but one can also list
"really bad" candidates at the bottom of the list, moving up.
3. If it is difficult to rank two canididates, your preference
strength is low, so consider, then, ranking them as equal. This would
be a sincere vote. You may later distinguish them if that matters to you.
4. Does the list contain more candidates than range ratings available?
5. If so, then start to collapse candidates to single ratings. Pick
the ones with the most difficulty in ranking them. Would you be about
equally pleased if either was elected? Or any one of the set you are
now bringing into the same rating?
6. When you have collapsed the list to the number of ratings, you
will have groups of candidates with equal preference strength between
them. If that seems to understate a preference strength between a
pair, then, spread the candidates apart, which will require more
collapse of others into single categories. Commonly, people may
collapse undesirable candidates to the bottom rating.
7. Where to rank unknown candidates is an issue for some. If it's
true that you would rather have "Anybody but Ralph," then you might
place them a bit higher than Ralph! In a Range system, you can
literally vote "anybody but Ralph" by top-rating all other
candidates. But most people won't do that... (many range systems will
bottom-rank unrated candidates.)

If the system uses names, the number of ratings is the number of
category names, and that list would translate directly. In some
systems, "approval" will be important. The list above may be adjusted
to place all unapproved candidates below the approval cutoff, and all
approved candidates at or above it. It's basically two lists. If this
is a range system where "approved" has meaning, it is probably a
runoff system. So "approved" would mean that the voter prefers
electing the candidate to a runoff being held, it's that simple.
That's a pairwise choice.

Then comes strategic voting. Start with the list generated as above.
Consider the frontrunners. If there are two or more, push the
preferred frontrunner to the top rating, if the frontrunner is not
already there. If your favorite is not a frontrunner, there is a
choice to be made. There will be a cost to maintaining strict
preference instead of equal-rating the preferred frontrunner. That
cost in a basic Range system is small, if there is adequate
resolution. In a Bucklin system using a Range ballot, there is very
little risk if one elevates the frontrunner to the second rank.
Similar concerns and choices can be made with the worst frontrunners.
Other candidates, as well as a *middle* frontrunner, can be "spread
out" or "compressed" across the range, based on the original list.
Essentially, to spread, separate out distinguishable preferences, to
compress, collapse more to equal rating.

It doesn't have to be perfect. This is only one vote. All these votes
are *sincere*. If the real choice is between A and B, and I prefer A
to B, and my preference for C is just something that I want to
express, perhaps to encourage C for the future, to influence major
parties, etc., it is not "insincere" to increase the weight tossed in
the A basket and decrease the weight tossed in the B basket. It's
making a realistic choice. Voting "sincerely" -- i.e., purely as in
the zero-knowledge rating system I described -- is simply tossing
away some of my voting power. If I really have low overall preference
strength, I might do that. I'd only vote in such an election, though,
if I happened to be there!

I've described this process to point out how to develop a preference
profile that would give "fully sincere" ratings, as one set, and
"strategic ratings" presumably based on knowledge of probabilities.
At only one point in this process was any absolute *meaning* to the
ratings considered, and that was if "approval" was an issue. It is
not an issue in pure range elections, and people who describe
Approval as being about approving candidates in an absolute sense --
which includes supporters of Approval -- have greatly confused the
issues. As a voting system, approval is not a sentiment, it's a vote,
an exercise of power. I might vote for a candidate whom I don't
approve of, as to my sentiment, and not vote for a candidate whom I
do approve of, in an absolute sense. *That depends on context*, on
the choices I have.

>A survey of the public would be better than me just explaining my
>personal thought processes.

I see the responsibiity of CES to both educate the public and to test
public response to the education. I would personally have difficulty
using the letter grade idea, or the French system, precisely because
my associations with those names would confuse the ranking. Ranking
the candidates first, before adjusting votes, requires no specific
category identifications, they are simply ranks. I don't ever have to
decide if a candidate is "Good" or not, but only comparative
rankings. In a single-round system, whether I think the candidate is
"Good" or not is only meaningful by comparison. In a two-round
system, there is a choice to be made, based on a comparison as I
described, and that will push all the ratings so that the approval
cutoff is the election cutoff.

>I will also admit that there are some natural ways to encourage
>people to convert their evaluations into numbers:
>
>1. Analyze their positions on several issues, score them from 0-10,
>and take a weighted average.

People can use systems like this, but most elections are really based
on a complex judgment of character. I've used multiple issue rating
systems in an attempt to make "rational personal decisions," but what
I really do is to use such a system, decide if I like the answer or
not, and then vote what I like, having been informed by the process.
I may vote differently.

>2. For legislators, what percentage of their votes would match my
>personal votes.

I'm not so convinced that I'm right; after all, for each one of their
legislative votes, they have staff to advise them and they have
themselves studied the issue -- if they are responsible. And
responsibility is a character issue, and also their ability to choose
staff wisely, etc. Actual voting can differ from position based on
many factors. In other words, it's entirely possible that were I in
their shoes, I'd vote as them. And it only seems different because I
am *not* in their shoes.

Hence, Andy, you can easily guess why I love Asset. There is *one
decision* to make, and it's actually one where I might have the most
reliable personal information. The vote will not be wasted, the
complexities of Range voting (and the Asset equivalent) are not
needed, my sense is that the system will work better without them.
(Warren's analysis of range-Asset essentially assumes the existing
system, and Warren has not considered *personal responsibility* and
*communication* issues.)

>3. Estimate how good they would be as a percentile in some sample
>group. (All American citizens/All humans who've ever lived/Everyone
>who I remember that has ever run for President)

Jeez, you like to make it complicated!

>4. 100 minus the percentage of the population that would have to
>support that person before I would be willing to go along with it (Simmons).

Interesting. I have not examined that. It is a strategic
consideration that could work with many candidates and zero
knowledge. Except that I would find it very difficult to judge this.
My sensor doesn't work with percentages. Too many things to think of at once.

The two-round strategy I mentioned pretty much approximates this.

Warren D Smith

unread,
May 24, 2013, 5:40:06 PM5/24/13
to Andy Jennings, electionscience
there was at least one study where it was shown that linear transform
worked well to
convert responses on one scale (I think 5 points) to another (I think 7)
thus proving pollees were reasonably sane.

The behavior AJ just mentioned re using +1 as an "infinitesimal" is I
would think a good
reason to prefer larger scales. With computer graphical gizmos it
becomes possible to use slider to get "infinite" scale... several
recent papers have investigated that...


--
Warren D. Smith

Clay Shentrup

unread,
May 24, 2013, 5:52:46 PM5/24/13
to electio...@googlegroups.com
On Friday, May 24, 2013 1:07:45 PM UTC-7, Jameson Quinn wrote:
I think that Warren isn't good at being diplomatic about things, and that makes people defensive, which doesn't actually help resolve the issues.

Normally I'd say you're right. But these people are researcher/mathematicians. They should be able to put their egos aside and look at objective facts.

Abd ul-Rahman Lomax

unread,
May 24, 2013, 7:00:39 PM5/24/13
to electio...@googlegroups.com
Warren, as a dualistic thinker, you don't recognize when I agree with
you. What does it *mean* to "sum temperatures"? It is generally undefined.

I did not propose "total temperature" as meaningful with the kind of
example you were using. So what did you "not accept."?

Your general position is correct, it seems to me, though I have not
personally verified it in detail.

The example I gave showed that "summing" the objects by adding one to
the other, literally, would produce a combined temperature that was
intermediate between the two separate temperatures. It's certainly
not numerical summation. So your example of temperature as showing
that there can be a meaningful quantity that is not summable was cogent.

Abd ul-Rahman Lomax

unread,
May 24, 2013, 7:06:06 PM5/24/13
to electio...@googlegroups.com
At 12:59 PM 5/24/2013, Clay Shentrup wrote:
>Are Balinski and Laraki even aware of the extent to which you've
>refuted their baloney?

If you had responded to Warren with "Reply All," you'd know the
answer to your question, or, at least, that both have the opportunity
of being informed, assuming they read their email. As to whether or
not they are "aware" of the "extent," that's quite a difficult
psychological question, eh? Might involve mind-reading.

Jameson Quinn

unread,
May 24, 2013, 6:08:40 PM5/24/13
to electio...@googlegroups.com


2013/5/24 Clay Shentrup <cl...@electology.org>

On Friday, May 24, 2013 1:07:45 PM UTC-7, Jameson Quinn wrote:
I think that Warren isn't good at being diplomatic about things, and that makes people defensive, which doesn't actually help resolve the issues.

Normally I'd say you're right. But these people are researcher/mathematicians. They should be able to put their egos aside and look at objective facts.

Sure. They should. But if you think academics, especially French academics, let their ego get in the way any less than the average person, I have an Arc de Triomphe to sell you. 

Jameson
Reply all
Reply to author
Forward
0 new messages