The statistics include the top and bottom scores and the standard
deviation of the scores that each game received, and the number of
votes cast for each game.
Hmm. Convince me of why.
I don't have any problem with a few people being very generous or very
grumpy. That's what averaging is for.
(I wanted to add that if one person gives every game a 1, it doesn't
affect the final scores at all. That's not quite true; it
disproportionately affects games that fewer people voted on. But the
skewing is still small.)
"And Aholibamah bare Jeush, and Jaalam, and Korah: these were the
Yes, I believe that's how it's done in sports, they drop the one highest and
the one lowest score and then average the remaining ones together. I'm not
sure if this is always statistically valid, but there's probably some
justification for it (e.g., one judge rating all the competitors from his/her
country higher than anyone else).
>Hmm. Convince me of why.
Well, if you don't mind some book quoting, I'll give an example. :) The
book is _An_Introduction_To_Error_Analysis_ by John R. Taylor, chapter 6.
Actually, this will be a paraphrasing because I just have my notes on the
book, not the actual book (I don't own it). He uses Chauvenet's criterion
for rejection of data points and demonstrates its use by example. I'm going
to assume you understand the basics talked about in here and not explain
everything; if you don't and want more, tell me and I should be able to help.
Say you have six measurements: 3.8, 3.5, 3.9, 3.9, 3.4, 1.8 and all are
legitimate. Then the average/mean is 3.4 and the standard deviation (sigma)
is 0.8. The 1.8 measurement differs from the mean (3.4) by 1.6 or two
standard deviations. Using Gaussians, the probability of a measurement being
outside 2*sigma is P(outside 2*sigma) = 1 - P(inside 2*sigma) or 1-0.95=0.05;
i.e., 5% or 1 in 20 measurements. With only six measurements we expect
0.05*6=0.3 or 1/3 of a measurement as bad as the 1.8 observed. If 1/3 of a
measurement is considered "ridiculously improbable" then we can reject the
Chauvenet's criterion, as normally given, states that if the expected number
of measurements at least as bad as the suspect measurement is less than 1/2,
then the suspect measurement should be rejected. Obviously the choice of
1/2 is arbitrary; but it is also reasonable and can be defended.
No, I don't have how the 1/2 can be defended. :)
>I don't have any problem with a few people being very generous or very
>grumpy. That's what averaging is for.
True, but good statistics does more than just a straight average. I'm not
claiming to be a statistics master or anything, but if you've got some
suspect deviant points and a valid reason to reject them (e.g., Chauvenet's
criterion), it's probably better to toss them out. But you do need to have
a valid reason for rejecting the data, be it purely mathematical/statistical
as you might do for voting like this or some physical reason in the case of
say some scientific research. You can't just toss data you don't like :)
but if you have a good reason for rejecting it, you probably should.
>(I wanted to add that if one person gives every game a 1, it doesn't
>affect the final scores at all. That's not quite true; it
>disproportionately affects games that fewer people voted on. But the
>skewing is still small.)
I think that would depend on the number of votes you're talking about.
I don't know what normal is for comp games, but the smaller the number of
votes, the greater the skewing will be due to an effect like you mention.
50 votes at 7 points and 1 vote at 1 point: average = 6.882
20 votes at 7 points and 1 vote at 1 point: average = 6.714
10 votes at 7 points and 1 vote at 1 point: average = 6.455
5 votes at 7 points and 1 vote at 1 point: average = 6.000
Ironically, the last choice has 6 data points and 1 suspect point 2*sigma
away from the mean (average) so it fits the example of Chauvenet's criterion
above and can be thrown out. Note that down at that small a number of
votes (6) the one deviant point has dropped the average by one full point
by being included; that could be highly significant in the voting. With high
enough numbers of votes like 51, that one 1 point vote is almost certainly
statistically acceptable and could be kept in that case. For the middle
ranges of votes, you'd probably have to check to see if the one 1 vote is
statistically acceptable. As you may have gathered, I don't know the
details of the competition voting, but hopefully this post is of some use
in the discussion about rejecting bad data. :)
The problem I have with this analysis is that it confuses two completely
In the domain of scientific measurement this sort of logic is perfectly
reasonable. There's some single, objective truth out there and you're
trying to find out what it is. Measurements that disagree with the
facts are just plain wrong, and should be discarded.
In this context though, we're talking about artistic judgement, and here
there is no single, objective truth. There is no objective basis for
saying "this opinion is wrong".
Instead, applying your analysis we would end up saying that the sixth
person's opinion is "bad" and not "acceptable" purely and simply because
he strongly disagrees with the other five people. This is a form of
argument that strikes me as being itself unacceptable.
"To summarize the summary of the summary: people are a problem."
James Marshall wrote:
Lot's of stuff on statistics.
> (I wanted to add that if one person gives every game a 1, it doesn't
> affect the final scores at all. That's not quite true; it
> disproportionately affects games that fewer people voted on. But the
> skewing is still small.)
Even if all games have the same number of votes, higher rated games will
be proportionately affected more than lower rated games. The order of
the results won't change but the relative scoring will. A game scoring
an 8 is rated (appreciated) twice as high as a game scoring a 4. After
adding the *ones* to the calculation the first game will be rated *less*
than twice as high as the other one. (Naturally, the skewing will again
>Instead, applying your analysis we would end up saying that the sixth
>person's opinion is "bad" and not "acceptable" purely and simply because
>he strongly disagrees with the other five people. This is a form of
>argument that strikes me as being itself unacceptable.
. . .we also have to consider that our goal is to rank the games based on
how much the judges, as a group, liked them. The issue is: how do we arrive at
the group decision, given only the individual decisions. The current answer
is: average them.
I'm still pondering over this myself.
Brendan B. B. (Bren...@aol.com)
(Name in header has spam-blocker, use the address above instead.)
"Do not follow where the path may lead;
go, instead, where there is no path, and leave a trail."
It seems to me that whatever method is used must give equal weight to
every judge's opinion; I can't think of any better way than averaging to
do that. (Doesn't necessarily mean there isn't one, of course.)
I don't think anything *matters* except the order of the results.
I haven't heard anyone say "Yay, I got a 6.43, that's nearly twice as high
as _Pass The Banana_!" Most years, in fact, the numbers haven't even been
released. Only the rankings.
Agree, it's more of a theoretical than a practical argument. On the
other hand, if such a 1 rater emerges during the 99 competition, the
games will rate too low relative to comp 98 games. It may reflect bad on
the quality of comp 99. Worse, it may also (partly) hide the progress in
the writing of a person who entered in both comps. Of course with about
100 voters a game it will be only a minor distortion so perhaps again a
mainly theoretical argument but it will still yield a not desirable
There's always the potential of a disgruntled author or someone who's just
pissed at some of the authors to sabotage their rankings with a 1 or the
opposite case handing out free 10's to friends. These charity scores have much
less effect on the top rated games than the first case, which is more typical
Not that the sky is falling and there's a bunch of sociopathic reviewers out
there... but it's common practice with any subjective sample of decent size to
account for human nature. The farther a vote is from the mean, the more weight
it holds so the top rated games are the most vulnerable to losing places. To
finish with a 7 average you need three 9's or six 8's to make up for a 1 vote.
Either way, that's a lot of weight for the top contenders to have to give up.
It would take spamming to get a bad game into the top echelon that should be
very easy to spot.
It would be very interesting to see graphs of the top ten and compare them with
bell curves or the graph of all_votes.
Just because everyone can vote does not mean that all votes are equal.
> Agree, it's more of a theoretical than a practical argument. On the
> other hand, if such a 1 rater emerges during the 99 competition, the
> games will rate too low relative to comp 98 games. It may reflect bad on
> the quality of comp 99.
I suspect most judges don't have an absolute judging scale that they use every
year exactly the same way. For example, I decided to boost several scores
after I'd finished playing this year because my highest score was low enough
that I could split up some of the games I'd played earlier in the judging
and had ranked as (say) 5's, since I didn't know I'd have room to give the
better ones 6. Nothing this year (for me) was competing against, say,
Photopia except that several made me think "Photopia did this sort of thing
better last year, and I wasn't that wild about it then. This isn't going to
score that well." However, comparisons to IF that I'd played in the past
occurred to me every year I've judged the comp, and not just with older comp
games. (I don't even recall what the exact score I gave Photopia was, which
would kill an absolute scale right there).
Kevin Lighton lig...@bestweb.net or shin...@operamail.com
"Townsfolk can get downright touchy over the occasional earth-elemental in
the scullery. Can't imagine why..." Quenten _Winds of Fate_
> least. It's probably because hardly anybody voted on it. The other
> entries tended to get around 100 votes apiece.
Except the MS-DOS ones (mine included) which received around 50 votes (less
than half of what looks like the "average" number of votes). This reinforces
something people have periodically stated: the PC-Only games reach a smaller
audiance, at least within the scope of R.*.I-F and the competition.
> Just because everyone can vote does not mean that all votes are equal.
At cgi-resources.com (where a couple of my scripts are high on their
respective lists) they disregard the top 10% and bottom 10% of the votes.
This probably arose at some point from looking at the stats and making some
comparisons between the votes from different people -- or, it might have
just been a safeguard. I don't know what the end result is.
For the IF competition, I don't think it's necessary. I'm willing to accept
extreme high and low votes as part of the overall score, because I have to
think that people casting those votes honestly had that high or low of an
opinion of my game. I could be naive.
If I had wanted to win the competition, it would have been easy enough to
do. My online Lunatix game has 300 players (at present). By posting this
"Hey guys! Get TWO FREE MONTHS of play time. All you have to do is to go
www.textfire.com and download the IF-Competition games, play at least 5 and
vote on them. Make sure you play "The Insanity Circle" and help support us!
Let us know after you've voted, and we'll add the free time to your
But that would have been asinine and would have defeated the reason I was
entering the contest - to get honest feedback and see how well my game
compares overall to the others. I suspect this is important to the other
authors as well. Many entered under an alias or anonymous for this reason.
The IF competition isn't a Mr./Mrs. Popularity contest, and if it were, it
would be pointless.
I have to think that most (if not all) votes were honest. With that in mind,
I wouldn't want to ignore those opinions, however bad (or good) they might
be. Now, if there actually *was* vote-fixing going on, then that's another
problem entirely and I'm not sure desregarding the bottom/top extremes would
be the right solution (not sure what it *would* be though).
> For the IF competition, I don't think it's necessary. I'm willing to accept
> extreme high and low votes as part of the overall score, because I have to
> think that people casting those votes honestly had that high or low of an
> opinion of my game. I could be naive.
I'd think, from an author's point of view, the most interesting statistics
would be the frequency distribution (i.e. how many people gave the game a
10, how many gave it a 9, etc.).
There's a big difference between "most people thought my game was a 7" and
"half the people thought it was a 10 and half thought it was a 4".
That's what the standard distribution represents, if you're willing to
accept it condensed down to a single number.
> That's what the standard distribution represents, if you're willing to
> accept it condensed down to a single number.
That's something I was trying to figure out. Math isn't my strong suit (at
all). If a game scored "6" and the Standard Deviation was "2" does that mean
the votes were typically 4 to 8 with "6" being the average (2 on either
side) or was it 5 to 7 with a range of 2?
I think it would be interesting (although, maybe not feasable or ethical) to
see detailed scores (30 1's, 10 2's, 15 3's, etc).
Throwing out outliers would be good practice if the distribution of
votes for a game were centred around an expectation value. But that is
an unwarranted assumption. In fact, some games are of the "either you
hate it or you love it" variety, and in that case you could expect a
distribution with two peaks. In the extreme case, a game would receive
only 1's and 10's.
The standard deviation is a bit more complicated than that. Roughly, it
means that 70% of the votes were within 2 of the average (the 4-8 range).
And 95% of the votes were within *4* of the average, and 99% were within 6
of the average. It recognizes that there are outliers.
Of course, this is a pretty rough summary. (For one thing, we know
perfectly well that every vote was in the 1-10 range.) The standard
deviation implicitly assumes that the votes lie on a smooth bell curve,
including fractional values and values outside 1-10.
When they don't, the standard deviation is really fitting the best bell
curve it can to the actual votes.
That's not true: the standard deviation doesn't assume anything about
the distribution, it's just a function of the votes. It's your
interpretation of the standard deviation which assumes a bell
curve. And since the votes aren't distributed on a bell curve, that
interpretation is misleading.
Let's just say that the standard deviation is a measure of how
spread-out the votes were; a standard deviation of zero means that all
the voters gave the game the same score, and a high standard deviation
means that they gave very different scores.
For those who don't mind *a little* maths, an explanation follows:
You know how to compute the mean score for a game, right. Now,
most of the votes probably differs from the mean (unless all
the voters agreed 100%). So let's consider how much the votes
disagree from the mean.
Let's call the mean score M, and a particular vote v. Then the
difference between the vote and the mean is
v - M
But one problem with this is that this is a negative number if the
vote is lower than the mean. So let's square the number, so we
always get a positive result:
(v - M)^2
Now for the trick: take the average of this for all the votes. This is
called the variance, and is a measure of how much the voters disagreed
with each other. If all of them agreed, the variance is 0. If they
disagreed very much, it will be a large number (since most of the
(v - M)^2 numbers will be large).
But, I hear you say, what about the standard deviation? Well, the
variance has one problem, and that's to do with the squares. If,
for example, we measure the variance of the length of IF authros
(rather than the quality of their works), we'll get a variance which
is measured in square meters, i.e. an area. It's more convenient
to work with a number that has the same unit as the original numbers,
so we take the square root of the variance.
And that, dear reader, is the standard deviation.
> And that, dear reader, is the standard deviation.
Thanks Magnus & Andrew for helping answer this. :)
Neither. Standard deviation is the square root of the average of the
squares of the differences of the scores from the mean. Say that
three times fast.
Matthew T. Russotto russ...@pond.com
"Extremism in defense of liberty is no vice, and moderation in pursuit
of justice is no virtue."
> But one problem with this is that this is a negative number if the
> vote is lower than the mean. So let's square the number, so we
> always get a positive result:
> (v - M)^2
I always wondered, why not just calculate some value X which is the
average of the *absolute* values of [Vi - M], (i = 1,2,3...n). Squaring
(and rooting) isn't necessary to avoid negative values. The resulting
average will be different from the variance but it will say something
about the distribution of the votes.
Also, when calculating variance, as a result of the squaring very high
and low votes have a higher weigh than votes close to the mean value.
The value X will avoid this problem. Isn't such a value more honest,
particularly when dealing with opinion values instead of scientific
observations? Is the variance figure more informative?
Mmf. I grant the point.
I was trying to get at the idea that for a bell curve, the average and
S.D. tell you *exactly* how spread-out the samples are -- in fact, they
tell you everything about the samples, because there's only one normal
curve with a given average and S.D.
For a distribution which isn't a normal curve, the average and S.D. tell
But if that makes no sense to you, never mind. :-)
> Let's just say that the standard deviation is a measure of how
> spread-out the votes were; a standard deviation of zero means that all
> the voters gave the game the same score, and a high standard deviation
> means that they gave very different scores.
(Or, for more detail, Magnus's following post.)
Gah. I'm having flashbacks to my Graph Theory class where the prof never
used the word "graph". It was "set of subsets of size 2 of a set".
What did he say instead of "real number"?
She. And it was a theory course - we never dealt with real numbers at all.
You can, of course, do this though I forget what it's called. But it's
an absolute pig to make use of, arithmetically speaking.
That's true. And the absolute value isn't differentiable at zero, so using
calculus is also a pig.
There are also more theoretical reasons. One is that the variance is just
one of an infinte set of "moments" where you consider the averages of
(vi - M), (vi - M)^2, (vi - M)^3, and so on.
Another is that the variance is additive: if you have two independent
stochastic vairables X and Y, then V(X + Y) = V(X) + V(Y).
Probably because the absolute value isn't differentiable and thus
gives mathemeticians the creeps :-)
Out of curiosity, I just did a quick sort and compiled a list of the games
with the highest standard deviations. The highest SD, strangely, went to
"Hunter, In Darkness" with 2.42. Next was (ahem!) my very own "Halothane"
with 2.35, and "Erehwon" with 2.26. For those who're interested, the first
ten games with high SDs (in descending order) were: Hunter, Halothane,
Erehwon, Beal Street, Lunatix, Exhibition and Pass the Banana (a bizarre and
fortuitous tie..), and finally A Moment Of Hope, The HeBGB Horror! and The
Water Bird (the last three were also tied.) The _lowest_ SDs went to the
following games: Skyranch (a striking 1.10), Guard Duty and Outsided,
strangely enough games that received almost uniform low ratings (though I do
note that Guard Duty has scored at least one 9, and Outsided a whopping 7).
Does anyone want to extrapolate from this?
Quentin.D.Thompson. [The 'D' is a variable.]
Lord High Executioner Of Bleagh
(Formerly A Cheap Coder)
Sent via Deja.com http://www.deja.com/
Before you buy.
Items with lower averages are forced to have lower SDs.
If you want to compare SDs across multiple items, it may
make more sense to express the SD as a fraction (percentage)
of the total. (But give me a second, and I'll expose that lie.)
For example, the maximum SD for an item receiving a 5.5 average
comes from votes like:
1,1,1,1,1,10,10,10,10,10 (SD=4.5) [*]
An item receiving a 2 average comes from votes like
And of course an item receiving a 1 average must have an SD of 0.
Then again, the maximal SD for an item with a 9 average comes from
So I guess the issue is really that they're near the edge of
the allowed range, not actually just relative to the average
(although that also seems relevent to me). If you think of SD
as the "width" of the bell curve, then scores near the edge of
the range of allowed votes have to have squished bell curves
since they can't trail off the edges of the range.
[*] I believe SD is square root of the variance and the variance
is technically defined not as the average of the squared errors,
but the sum of the squared errors divided by (number_of_samples - 1);
in the limit as the number of samples goes to infinity this is just
the average, but prior to that, it's just annoying.
However, I am not a statistician. If you feel you need
statistical advice, please get a consultation from a professional.
OK, I lied abotu this before :-). Or, rather, I didn't mention
it because it would only be confusing.
There are two definitions of variance. One is simply the average of the
squared deviations from the mean. The other definition introduces a
correction factor N/(N-1) where N is the number of samples. The latter
definition is usually used when you take a sampling out of a large
set and want to estimate the variance of the whole set from that of
the sampling. The reasons for this are rather theoretical and I'd
like to refer anybody interested to some textbook in mathematical
For the purproses of counting competition votes, I'd say it's
not a critical choice which definition to use.
They wave their hands about degrees of freedom and such. I think it's
just a fudge factor :-)
}For the purproses of counting competition votes, I'd say it's
}not a critical choice which definition to use.
If you're counting all the competition votes, the ordinary uncorrected
population variance is the one to use, I think.
>}For the purproses of counting competition votes, I'd say it's
>}not a critical choice which definition to use.
>If you're counting all the competition votes, the ordinary uncorrected
>population variance is the one to use, I think.
Well, the theoretical standpoint is this: if you're taking a sample out
of a large (possibly infinite) population, and want to estimate the
standard deviation of the whole population, then using the N/(N-1)
formula means that the standard deviation of the sample is an unbiased
estimate of the standard deviation of the whole population.
In the comp votes example you could argue that we aren't sampling the
whole population of votes; we have the whole population of votes at
hand. On the other hand, you could argue that the votes are a sample of
the opinions of *all* IF players (not just the ones who voted).
But this is unnecessary. We aren't using the standard deviation to
estimate anything. We're using it in a purely descriptive way, as a
measure of how much the votes for each game differed from each
other. It's really immaterial which formual we use as long as
everybody uses the same formula. We could just as well use the mean of
the absolute of the deviations (rather than that of the squares of the
deviations), as somebody suggested.
And we could just as well use the median score instead of the average
score to determine how well a game fared. That would also address
the outlier problem to some extent. (The median is the "middle" vote -
the one that had just as many votes above it as below it).
>However, I am not a statistician. If you feel you need
>statistical advice, please get a consultation from a professional.
Most likely a professional whose profession is psychology. :)
"And if you're the kind of person who parties with a bathtub full of
pasta, I suspect you don't care much about cholesterol anyway."
I've seen psychologists 'doing' statistics, and I've seen the Revd
Ian Paisley doing politics - they both scar(r)ed me!