I have looked a little at the formula for calulating ratings
at fibs. Ofcause its nearly impossible to construct a perfect formula.
What I think is the weakest part of it is the calulation of the
probability that someone wins. In the formula is the match length one
and it seems as when playing 1 or 2 pointers it's possible to get a much
higher rating than otherwise. I don't mind people who is getting high
rating because of playing short matches. However i would find it
interesting to know how good I am at different matchlengths. Beacause of
this i would find it
interesting with one ratingsystem for matchlengts 1-2 and maybe also
3-4, 5 to infinity or 3 to infinity. I realise a change needs many
shared opinions, so
anyone else share these suggestions ?
/fortuna on fibs
One thing I would observe is this:
In an issue of Inside Backgammon from several years ago, Kit Woolsey
made the statement that he felt he would be lucky to win 55% of his
games against an intermediate player. Now, I suspect that Kit may
have a fairly high standard for "intermediate" - I really don't know.
Does intermediate mean you get some of the quiz problems in IB right,
does it mean you even UNDERSTAND the problems, or does it mean you
know that it's better to be at the edge of a prime than a pip or two
away? (I'm not putting down Kit, understand, I'm saying that I have a
lot of respect for his game!) Anyway, under the FIBS rating formula,
a difference of 175 points translates to a 55% chance of winning a
one-point match. Now, even by his statement, I'm sure that Kit would
win more than 55% of his one-point matches, since his 55% PROBABLY
refers to games - and I suspect that Kit's average win will win more
points than his average loss will lose - if he's just playing to win
the game, he'll probably win more than 55%. I would venture a
wild-a**ed guess that I am probably what Kit would call an
intermediate player, by which I mean that I could play him an
intermediate-length match (7, 9, 11) and not embarass myself, even win
occassionally. My FIBS rating tends to be - about 175 points below
Kit's. So maybe, just maybe, by this theory, the FIBS formula is
I do have to say that the formula, though, doesn't quite seem right in
this sense. The "random walk" theory by which the square root is used
for match length assumes that each game is independent. But a 3-point
match is very very different than playing one-point games, best
3-out-of-5. I would suspect that the difference between players of
different skill levels is much greater at cube handling than at
checker play. Most intermediate players will not at least fail to
consider a checker play with an equity difference of .05 or so greater
than another they're considering, but it's very easy to believe that
an intermediate player would misevaluate the overall equity in a
position. The reason is simple - checker plays present a choice
between two positions; cube decisions require considering a whole
range of positions that could result one, two, three, five, ten rolls
later. So the FIBS formula just might favor the weaker player in a
one-point match versus a longer match. But then, I've been led to
understand that some players fatten their ratings by playing one-point
matches against weak players, so what do I know?
Yes, I think you've hit the nail right on the head. This effect seems to
be observed from time to time, and people make vague comments about what
is causing it without solidly advocating a single conclusion (very unusual
for r.g.b. posters :-) I agree with you that gammons and the cube destroy
independence between games in a match: there is a positive correlation
between each of the "points" won, and less than n degrees of freedom in an
n-point match. The theory assumes games are independent and therefore
underestimates the "skill" (ie. probability of the higher rated player
winning) in shorter matches and/or overestimates the skill in longer
ones (depending which length you assign as "correct").
It is very easy to derive examples or find empirical evidence supporting
this hypothesis. For instance:
- Consider 2 point matches. It should be fairly obvious that a 2-point
match is identical to a 1-point match, because the weaker player can
double immediately and be certain of having a greater probability of
winning the match than by potentially playing it out as 2 or more games.
So, it is clear that the "skill" in a 2-pointer is exactly the same
as that in a 1-pointer. But the Elo system will credit the underdog
with MORE (and the favourite with less) if he/she wins!
- Tom Keith shows that the above example generalises to longer matches
by comparing the probabilities the Elo system estimates against those
in a skill-adjusted match equity table in an article at:
- Looking at real life data, we see the effect occurring in practice.
Peter Fankhauser looks at the matches in the Big Brother database
and summarises the results for different length matches at:
(look at table 7).
- David Montgomery mentions the discrepancies in records of his own
matches of varying lengths at:
- There was a thread at the end of last year with the subject
"rankings and ratings" with some interesting discussion (look for
Don Banks' and Chuck Bower's articles).
I think it's safe to conclude that the Elo system correctly predicts
the probabilities of the favourite winning only when the games in the
match are independent. For backgammon (with cubes and gammons), it is
vaguely adequate but nowhere near perfect. An ideal system would give
any accurately rated player an expected gain of zero when playing any
length match against another accurately rated player; Peter's analysis
(see above) shows that FIBS typically gives expected gains of 0.3
points to the favourite in 1 point matches (this will obviously depend
on the rating difference of the players). Since a 1 point match is
worth about 2 ratings points, this is an error of about 15%.
Coming up with a better scheme is pretty tricky. Existing skill-adjusted
match equity tables address the problem of players having different
probabilities of winning each game, but assumes efficient cube handling
by both players and does not account for the weaker player making
more cube errors. A truly accurate match equity table would have to
account for the fact that different strength players make different
kinds of cube errors (eg. the match equity table for a 7 point match
between a 2000 ranked player and an 1800 ranked player will be
different to that between a 1200 and a 1000, because the 1000 player
might be expected to make silly errors like failing to double immediately
trailing -2, -1 post-Crawford. This will throw the match equities
out of whack and violate another assumption of the Elo system, that
the winning probability depends only on the relative ratings of the
players and not the absolute ratings). When it comes down to it,
even attempting to measure someone's skill with a (scalar) rating is
a little bit presumptuous anyway; it fails to reflect the distinction
between cube skill and chequer skill, for instace. Scalars also imply
transitivity which we don't really have (A is a favourite against B
and B is a favourite against C does not necessarily imply A is a
favourite against C). Those are more problems than I'd care to solve!
When it comes down to it, applying the Elo system to backgammon is
trying to do something that can't really be done, so we shouldn't get
too upset if it doesn't get things quite right :-)
Gary Wong, Department of Computer Science, University of Arizona
The key variable in the formula is the factor:
SQR(n) * (rating difference) / 2000
Now - is that 2000 factor empircally derived, or is it an assumption?
It would be relatively easy to test if you had enough data on resolved
Somehow I think that would have more impact than questions of whether
the sqr part of the formula is correct.
It's not 100% clear to me what the impact of the cube is in
multi-point matches. Think of it this way. Every decision in the
game gives the better player an opportunity to gain equity. We could
model a game by saying that on each turn, the favorite (F) gains some
random amount of equity by making better decisions. The amount will
vary - he will gain zero equity when his opponent rolls an opening
3-1, for example. He also usually gains zero equity on cube decisions
for at least the first few rolls.
It's clear that your average cube decision has more equity impact than
your average checker play. But the more games in a match, the more
checker plays, and the more opportunities for F to squeeze a little
extra equity out. Remember, F can find a correct double a weaker
player will never even think of, his opponent can make a foolish take,
and roll a joker to gammon F. Now, all F is left with is the
consolation that he played right, and wistful longing that he'd tried
to grind down his opponent.
But I diagress. I'd still like to know whether the 2000 is just an
assumption, or empirically tested. That would seem to be the first
thing to refine.
> What is the effect of the "2000" in "SQR(n) * (rating difference) / 2000"
> (from the FIBS rating formula)? Is it empirically derived?
2000 is actually a constant (call it c). If c is changed to 2000 * t (with t
> 0), the average rating will remain the same (approx. 1500). Ratings which
differ from the population average will be scaled away from the average by a
factor of t, i.e. new_rating = avg_rating + t * (old_rating - avg_rating).
e.g. if avg_rating = 1500, c= 400 (i.e. t = 0.2), then
new_rating = 1500 = 0.2 * (old_rating - 1500)
e.g. 2000 under the old ratings will correspond to 1600 under the new
ratings. 1000 under the old ratings will correspond to 1400 under the
new ratings. Old ratings between 1000 and 2000 have an equivalent new
rating between 1400 and 1600, which can be obtained by linear
Thus, the choice of c = 2000 only affects the spread of the ratings, while not
changing the ordering of "true" ratings. i.e. consider two players
(player1 and player2): if player1_true_rating > player2_true_rating for c =
2000 then player1_true_rating > player2_true_rating for any other c > 0
Note that the above discussion is referring to one's "true" rating. For small
values of c there is a large amount of noise in the ratings system (since the
adjustment factor , 4 * K * SQRT (n) * P, is independent of c), i.e. ratings
will move relatively more quickly with a small c than with a large c. E.g.
in an extreme case, if c = 1, then the "true" ratings would likely range from
1499.75 to 1500.25, yet a 9-pt. match between two players of equal rating
would boost the winner's rating by 6 pts. (assume K = 1) and reduce the
loser's rating by 6 pts. Obviously in this case, the rating system would be
too unreliable for use. Many players would be massively (in a relative
sense) over- or under-rated.
I assume that 2000 was chosen so that there would be a reasonable spread
between the high and low ratings.
[Note: below I will define "ELO-style rating formula" to be a rating formula
similar to the FIBS one in which P_upset = 1 / (10^ (D * sqrt(n)/c) + 1), for
c > 0.)
Assuming that an ELO-style rating formula is appropriate (although there is a
lot of evidence to the contrary), c should be chosen so that the
match-adjustment [4 * K * SQRT (n) * P] moves/changes ratings at a relatively
slow (but not too slow rate). If c is chosen too low, then ratings will move
too fast, i.e. they will be too volatile and thus unreliable. If c is chosen
too high then the rating system will take a very long time to correct the
ratings of those who are significantly under- or over-rated.
Personally, I think that the FIBS ratings system is a bit too volatile. I
would like to see c = 4000. Better yet, to avoid having to scale everyone's
mean-adjusted rating by 2 overnight (and thus alarming many new users), we
can equivalently just change the match-adjustment factor to [2 * K * SQRT (n)
As noted by some of the empirical evidence referenced in Gary Wong's recent
post, an ELO-style rating formula is not robust over the possible match
lengths. Perhaps a better solution (requiring more housekeeping) would be to
have separate rating formulas for different match lengths. Perhaps there
could be five different ratings: one for 1-pt. matches, one for 2-pt. matches,
one for 3-6 pt. matches, one for 7-16 pt. matches, and one for 17+ pt matches.
Having a separate category for 2-pt. matches may be a little controversial
since among expert players a 2-pt. match is virtually identical to a 1-pt.
match, however among novices, there is still room for cube strategy. :-)
Even better, the value of t (the scaling factor - see the first line of my
post) can be empirically set (different for each of the 5 rating groups) so
that the spread (high rating - low rating) in each of the 5 ratings is
approximately the same. Perhaps one could even be assigned an "overall
rating" which would be the average of each of the 5 ratings (or maybe with
only 50% weighting on the 1-pt. and 2-pt. ratings, i.e. overall_rating = .125
r(1) + .125 r(2) + .25 r(3-6) + .25 r(7-16) + .25 r(17+)). This would mean
that a player has to be good at both small-length matches as well as
long-length matches in order to have a good overall rating.
Under most backgammon rating systems that I've seen, a player who plays
perfectly in 1-pt. matches (and who plays only 1-pt. matches) can obtain an
extremely high rating, even if he is awful in cube strategy (i.e. since he
will never have to make a cube decision). My proposal would remedy this
Just my $0.02