From time to time there are questions pertaining to how to decide
who is better than whom. There was a rash of them recently relating
to the BOTS and their performances on FIBS. I'm writing this as an
overview/summary of the topic. Some ideas (including speculations) are
my own and others have already been expressed (in this newsgroup). I'll
try to differentiate, but pardon me if I plagiarize. And, especially I
ask forgiveness for NOT giving credit to the authors of any ideas that
One other disclaimer: I will be discussing some "methods" and
their "keepers". DO NOT FOR ONE SECOND conclude that these people
contend that their methods are anything more than just another piece of
data which can be input into the unanswerable question "who is better
than whom?". They provide a valuable service (for no remuneration) and
if someone were to attempt to lecture, scold, or otherwise chastise any
of them, then such person has completely misunderstood both their efforts
and, in addition, this article!
The BEST way (that I can think of) to determine which of TWO players
is better is to have them play a LONG session against each other. Up
until recently (last 10 years or so) this was about the ONLY way to
answer the question "who is better...". Normally this is done for money
(to "reward" the better player as well as to attempt to ensure that each
is playing at his/her best)! Unfortunately the number of games/matches
required to determine the answer with statistical confidence is so large
that it just takes too much time to reach a reliable answer. As the skill
difference between the players gets small, the number of trials required
becomes HUGE. As an example, after last summer's JF challenge, Fredrik
pointed out (with statistics) that a difference of 58 points in 300 games
isn't nearly enough to draw a conclusion because of the large fluctuations
(from the dice). (Try DejaNews for more specifics.)
It could be that people who play against each other A LOT (one or
more sessions per week over years, for example) could actually collect
enough data to reach a statistically significant answer. However, you
can always surmise: "Maybe one of the players improved more relative to the
other over the time the data were taken. Who is the better player NOW?"
Currently we have dedicated 24 hour players (like commercial Jellyfish),
and the question can be answered knowing that at least one of the players
isn't improving! I haven't seen much on the newsgroup (that is,
substantiated with numbers) comparing human vs. JF. I keep such tallies
for my own play (and have posted results in the past) but haven't been
getting in enough play-vs-JF time lately to collect sufficient statistics
on the current version. I'm SURE that JFv2.0 level-7 was a better player
than I. Anyone want to take the other side of that argument? Gee, thanks.
What about global measures? I know of three which are currently
available, but none of them is perfect, either. They are surveys,
performance points, and ratings based methods. All have their pluses
One example of a survey is Yamin Yamin's "Giant 32 of Backgammon".
This is a biannual survey of on the order of 100 persons (responses,
that is). The results are published in the Flint Area Backgammon News.
Actually, this survey was just completed within the last couple of weeks
and I expect to see the results in either the next Flint newsletter or
the issue after that.
The problem with surveys is that they are inherently subjective.
For example, some Europeans have complained that Yamin's survey is biased
toward North American players (and I agree with them). I, for
one, am NOT complaining about this survey. The nature of BG (as it is
played today) is regional in nature. IMHO that is an irrefutable fact.
Most of Yamin's survey respondants are North Americans, and even if they
aren't socioligically biased (let's hope that's the case!) their
experience is in this hemisphere and players from other parts of
the world play within their own travel zones. Not very many (in fact,
NO) events give a truly geographically unbiased sampling. Actually,
the online Internet tournaments are probably the least biased from
Performance rankings are another measure. Bill Davis has been
coordinating such a point system--The American Backgammon Tour. How
good is this at determining the best players in "America"? Well, it
is probably a decent (though still statistically insufficient) way of
deciding who is "best" among those who play in a LOT of ABT events!
Problem is, for whatever reason, a lot of strong Western Hemisphere
players don't participate frequently on this tour. It's a fun way of
recognizing players who are doing well, but it's just another piece
in the puzzle.
The third method is one which has really caught on recently
thanks to Internet backgammon. That is ratings systems. Copied from
chess ratings systems, this is an objective method of ranking players
who share a common playground. Kent Goulding (and colleagues) had
been keeping a ratings system for large tournament results over the
past several years. Unfortunately, instigated by lost-data problems,
I believe his effort has been inactive since the summer of 1996.
Still, to my mind, KG deserves much of the credit for the current
popularity of the online ratings systems.
One obvious weakness of any ratings system is that it really
only applies "locally". Only FIBS players get FIBS ratings/rankings.
Only GAMESGRID players get GAMESGRID ratings/rankings. Etc. At best
you only get a reliable ranking among the participants and conditions
of that rating system. Maybe the highest ranked player is just "a small
fish in a small pond", so to speak. And it's really worse than that,
because sometimes the players within a rating system don't intermingle
much. For example, some FIBS players only play within a small cluster of
"friends", so even though s/he has a FIBS ratings, it's not as universal
as it appears. The two ideas I've mentioned in this paragraph have
been discussed previously (multiple times) in this newsgroup. In
addition, they are covered in greater detail, with some nice examples,
(or counterexamples...) in the Jacobs-Trice book "Can a Fish Taste
Twice as Good?". This is recommended reading for anyone wanting to
delve more deeply into the subject.
Online ratings systems can be tricked as well. (This is no secret.)
By carrying the "don't intermingle" idea to an extreme, a person can
play against him/herself (using two or more different ID's) and
artificially inflate his/her rating. This is almost always easily
detected. A very high rating with very low experience is certainly
suspicous (though it's apparently theoretically possible to do this
honestly). There are other low-integrity tactics which have been pointed
out in this newsgroup as well, like preferential dropping, and "fishing"
(searching out weak players whose ratings are higher than deserved, for
one reason or another). I believe these problems are inherent. There
will always be "clever" cheaters who find a way to work around attempts
to prevent such tactics.
One other thing worth mentioning (and covered previously in the
newsgroup) is the observation that the common ratings systems may have
the weakness of overrating players who only compete in 1-point matches.
It seems like a difficult thing to prove, but there does appear to be
circumstantial evidence. Maybe these special players should have their
own (segregated) ratings system.
Now I am going to attempt to break new ground and start speculating.
(Oh, you thought that's what I'd already been doing!) In particular I'm
going to focus on the robots' ratings on FIBS--another hot topic recently.
Do the robots' high ratings really give them the title "best players on
FIBS". Maybe, but not necessarily. I am going to list (in no particular
order) some reasons why their high ratings could be brought under
suspicion. Note that in case you are new to the newsgroup, I have no
hidden malice towards them. I have a very high respect for these players.
1) Selection effects. (Basically I'm repeating the above problems with
intermingling.) Do the bots take on all comers? Should they? Do the
best humans take on all comers? Should they? Are weak humans more
likely to challenge a highly ranked bot than I highly ranked human? My
guess is "yes". Computers are incapable of sneering when they turn you
down. Even if you argue that human experts don't do this (and my
experience is that they are among the best mannered experts of any kind
in the world!), that doesn't keep the inexperienced player from suspecting
such a thing could happen.
2) Exhaustion. Computers don't get tired. Humans do.
3) Emotion. Computers don't feel emotion. They don't notice bad dice.
I suspect even the best human experts, as hard as they try, still feel
the pain of unlucky dice. I'm sure it doesn't have the same magnitude
of adverse effects as it does for the typical player, but it has to have
a small impact in any case. How about elation? Is that an advantage
for a human player to have? What about embarassment? Do strong human
players make errors based on ego? (I can't lose to THIS person!) Again,
they know it's a detriment to good play, and thus work on eliminating
such concentration killers, but it still must affect things sometimes.
4) Distractions. Does a computer's spouse interupt and call it to dinner?
Does it have to break it's concentration when the modem rings? Is it
watching a ballgame at the same time it's playing? Does it play better
or worse after having a couple alcoholic drinks?
5) "Giant killing". (I am of the belief that this is potentially a HUGE
advantage for the bots.) Let me start with a (true) story. I had heard
through r.g.bg and also from conversations with other players that there
was a "new kid on the block"--SnowWhite. (This was a while back.) I
was on FIBS and decided to watch this maiden take on one of the seven
dwarves. I watched for all of about two dice rolls. Why? I was
annoyed (make that disgusted). SnowWhite's opponent was in some kind
of SUPER BACKGAME. Three or four points in SwowWhite's board, only one
or two checker's in his/her home board--you get the picture. Gee. This
looked like the typical backgammon game that I play...
So, why do I believe that such tactics are a big advantage to the
bots? Simple. We're not (necessarily) talking about a highly rated
FIBS player trying to outsmart a bot by playing a backgame. We're talking
about Joe-typical-player. Even if it's true that an expert backgammon
player can "make money" using backgame tactics, it is usually done by
getting the bot to indiscriminately elevate the cube in a few games.
The bot wins most of the games (many of which are gammons) with the cube
at a low level. The human expert wins a few games WITH THE CUBE AT SOME
ASTRONOMICAL VALUES. This isn't likely to work at match play due to the
finite match length.
Secondly, backgames are quite tricky. Your Joe-typical-player
is going to be giving away equity by playing sub-optimally, so even if
some experts can outplay a bot by seeking backgames, my guess is that
most FIBS players are going to screw up bad enough to end up becoming
cannon fodder for the bots. Now. Suppose this same Joe-typical-player
is in a match with a human expert. Do you think he is going to steer
into a backgame? I can tell that there is at least one Chuck-typical-
player who won't!
I realize that there are likely to be some biases that work
against the bots. For example, I wouldn't be surprised if the bots
have a higher percentage of their matches dropped. "Hey, bots have
no feelings, so why should I feel guilty pulling the plug when I'm
losing a match to one of them?" And even if the biases favor the
bots, that certainly doesn't mean they aren't better anyway. My main
point is to read the ratings systems with a skeptical eye, whether
comparing bot vs. bot, bot vs. human, or human vs. human.
c_ray on FIBS
As for the question of variations - well - if a
bots weakness is analysis of some type of positions and a human's
weakness is frustration doubles after playing for a greedy gammon and
getting a very bad roll that takes him below the doubling window, why
aren't these both factors in evaluating "skill?"
Arguing who is the "best" is not really meaningful. In many
competitive endeavors there will be a group at the top who are
somewhat indistinguishable. I'm not that up on the players who are
the best at backgammon - but in bridge, just in the U.S. there are
four pairs who could all be argued at the best in the country (and
who, not conindicentally, were all in the semifinals of the last world
championship). What matters is not which is the best, but rather,
that these four pairs are all in that very top echelon.
My FIBS rating is 1800. I don't know that it means that I'm better
than a 1750 player or worse than an 1850. I know that I'm not nearly
as good as Kit or KG, and I know that I'm better than LindaRes (rated
about 1600) - whom I only mention because some years back Linda and I
used to play in the same Monday night tournament when we both lived in
the same area. I know that the 'bots as a group are better than any
of the humans.
And I know that we will probably never answer most of these questions,
because there's not enough backgammon played, and a lot of the time
when it is played, money is an issue. Some years ago, I played on the
ImagiNation Network. We had a league that consisted of 10 rounds of
7-point matches (with the truly bizarre scoring format that the winner
got 7 points and the loser got the number of points scored - you could
get into situations where the leader wasn't good enough to double and
the trailer wasn't good enough to take!), and we also had a "Master's
League." This was a full round-robin of about 20 players, playing
11-point matches, and we played it two or three times. The entry fee
was $40, and we were all paying at least $20 a month just to use the
network, so there was really no issue that a good player couldn't
afford to enter. After 40 or 60 matches or whatever, we could have a
reasonable idea who was the best player in the league. (Kent
Goulding. Hands-down.) If we could have something similar at high
levels, then we might know. I'm not saying we ever will, or that we
have to. But there just aren't enough backgammon tournaments that are
geographically convenient to all the top players. In bridge we have
three ten-day tournaments a year at which somewhere around 90+% of the
very top players, and at least half of the good players, will show up.
(Hey, they even let me come!) I just don't see any way we'll be able
to define who the "best" are until there is a way for the top players
to spend many days a year playing against each other.
I'm rambling. Sorry.
After writing 200 lines on this subject, how could I possibly have more
to say? You don't know me very well... After submitting the initiating
post, I remembered that I forgot to remember to mention the following
effect. (Note that this may well work against the bots, but I certainly
think it is a factor in ratings.)
6) "Choking". Allow me to again begin with a (true) story. In September
1992 I happened to be in Dallas on BG night (I think it was Thursday) and
went to the local weekly tournament. In case you are unaware, Dallas has
more than its share of world class players. But on this night, there was
a "stranger" in town. A month previously, the 3rd (biannual) World Cup
had been held in this very city. Among the attendees was Paul Magriel,
(arguably) the biggest name in the history of BG (at least, if you mean
people who made their names playing BG). He had stayed around after the
big tourney, presumably to play for $. This night he fairly routinely (so
it seemed) made it to the finals in a field of about 20. I don't remember
who his opponent was. She would probably have entered the Intermediate
category at a regional/national tournament. But she had worked her way
through her half of the ladder. Having been booted, myself, (by neither
of the two finalists, if that matters) I decided to kibitz. (Unlike FIBS,
that means to watch and keep your mouth SHUT!) Again, I was annoyed...
"Oh, Paul, it's such an honor! I just can't believe... I'll remember
this the rest of my life... blah, blah, blah." And no, she wasn't conning
him. She was awestruck. (Lest this post sound sexist, I don't think the
fact that Magriel's opponent was a woman had much if any bearing. Magriel
is no Valentino! Now, if on the other hand, her opp had been Omar Sharif...)
I don't remember details of the match. But if one of her opening rolls had
been 31, I would have taken some decent odds and bet she would have misplayed
it. That is how poorly she performed. Magriel won in a walk. And he was
ultra-gracious all the way! (In reality, on the few occasions I've talked
with him in the 90's he has seemed to me to be a genuine, nice person. So
I certainly don't accuse him of insincerity. Hell, he was playing this match
for peanuts!) The bottom line--his opponent choked. Plain and simple. And
I think it happens all the time.
Allow me to propose the following "thought experiment". An unnamed
Internet BG server has two "new" members: Kit Woolsey and Wool Kitsey. Now
even the casual player recognizes the former as a gamester extraordinaire. In
BG alone he has a list of accolades as long as a whale's tail. Hell, on the
subject of how the play tournament backgammon, he literally wrote the book!
But who is this Kitsey person? Kit's distant Aussie cousin, known only for
his prowess in shearing sheep? (But, in reality, both players are one and
the same, a fact completely unknown to the other players on the server. And
now you have probably surmised that said server is NOT FIBS, since you're
only allowed one ID per person there!)
Both players start out at the canonical 1500 rating. Both players receive
IDENTICAL dice and play the same opponents. (Hey, this is a thought-
experiment. Lighten up!) Which one climbs the ratings ladder more quickly?
There's little doubt in my (simple) mind that Woolsey outrates Kitsey, at
least initially. Average players are less likely to choke against an unknown.
Let me attempt to clear your minds on a couple suspicions. Do I choke?
Damn right I do! Do I begrudge the experts for this "advantage"? Not as
long as they've earned their reputations honestly (which is the case of
virtually all highly regarded players today). Heck, I still have dreams of
collecting on some of that choking behavior myself someday! How big of an
effect is it? Pretty significant, IMHO.
There's another way to look at it. Take another real life example. In
the 1996 US Open (played in conjunction with the World Cup) the finals pitted
David Montgomery and Jake Jacobs. Jake has a (well deserved) strong
reputation at the BG table, and a dossier to accompany it. (I'm talking BG
dossier!) David was a virtual unknown college student, having gained his
entry into the event by earning his way as a result of winning a BG quiz
contest put on by Inside BG magazine. (No small feat, considering I was
one of the contestants left floundering in his wake! IMVHO, of course!!)
Jake won a close match--the chalk prevailed. But I wonder who had the
tougher route to the finals. Given the variability of the dice, it's hard
to say. But I'm sure there was more boot quaking going on in Jake's matches
than in David's.
: Allow me to propose the following "thought experiment". An unnamed
:Internet BG server has two "new" members: Kit Woolsey and Wool Kitsey. Now
:even the casual player recognizes the former as a gamester extraordinaire. In
:BG alone he has a list of accolades as long as a whale's tail. Hell, on the
:subject of how the play tournament backgammon, he literally wrote the book!
:But who is this Kitsey person? Kit's distant Aussie cousin, known only for
:his prowess in shearing sheep? (But, in reality, both players are one and
:the same, a fact completely unknown to the other players on the server. And
:now you have probably surmised that said server is NOT FIBS, since you're
:only allowed one ID per person there!)
: Both players start out at the canonical 1500 rating. Both players receive
:IDENTICAL dice and play the same opponents. (Hey, this is a thought-
:experiment. Lighten up!) Which one climbs the ratings ladder more quickly?
:There's little doubt in my (simple) mind that Woolsey outrates Kitsey, at
:least initially. Average players are less likely to choke against an unknown.
Your remarks about choking really struck home with me. I can vividly
remember the first time Kit invited me to a match on FIBS. I can
remember him inviting me, and I can remember nothing else. The whole
time I'm thinking "This is Kit Woolsey. I'm playing Kit Woolsey." I
didn't stand a chance. And very quickly, of course, the watchers
start showing up and now my only thought is "Can I just get out of
this match without embarrassing myself too badly?" Well, I must say,
the dice gods were extremely kind to me that day and I managed to win
the match, but I was so aware of how psyched out of that match I had
been that I immediately went to JellyFish, where I confirmed that I
had given up a ton of equity on poor moves -- missing fundamental
things like making my 5 point. I knew what was happening as it
happened, but it still got the best of me.
Of course I've mastered that phenomenon now. Now when I play Kit I
can usually remember to breathe at some point during the match. :)
>If anything, I think the rating system favors players who play weaker
>opponents than themselves.
That could well be the case. Now, does such a flaw favor the
humans or the bots?
>As for the question of variations - well - if a
>bots weakness is analysis of some type of positions and a human's
>weakness is frustration doubles after playing for a greedy gammon and
>getting a very bad roll that takes him below the doubling window, why
>aren't these both factors in evaluating "skill?"
Good point (I think). Chuck says "...human emotion and the bots
lack of such characteristics make the bots APPEAR to be better players."
Hank counters: "...not only does it make the bots APPEAR better, but
it actually argues that they ARE better." (paraphrasing, not quoting)
>I'm rambling. Sorry.
And you expect us to forgive you that easily? ;)
Perhaps my biggest choke of all was, strangely, against a bot. I was
playing mloner a couple years ago, and it was horrendously late at night. I
was WAY past tired, in that zombie-like state that won't let you type
"quit" but won't let you effectively use more than 2% of your brain either.
It was late enough that most Americans, even in the western time zones,
were doing what I should have been doing: sleeping. So as was usually the
case, mloner had been kicking my butt, but I'd been coming back, though on
this final game things weren't going so well and my only hope was to go
into a back game. If memory serves correctly, I didn't do too well at that
either, and it deteriorated to the point where I was on the bar during
mloner's bearoff, hoping for a shot. With a hit, I do recall that I was a
healthy favorite to win the match. So I watched. And waited.
Why does one choke against a bot? Well, since mloner was the top player on
FIBS at the time, it attracted a lot of attention. At this hour, it seemed
that the Europeans were all awake and alert. One by one, my match was
drawing watchers. I'd sometimes check to see who they were, and they were
generally 1800+ German players. Now, at the time, I was around 1500 (not
the mighty 1600 player I am now ;-) and the mere presence of an 1800 player
made me VERY self conscious of my game. And I didn't have just ONE of them
watching, I was starting to collect them like grocery store coupons. A more
focused player would have ignored all that, but I was doing the "whois"
command far more than I was concentrating on the game.
So, back to the game. mloner's starting to bear off, the tension is
building, and mloner gets a roll that it uses to clear the 6 point, but it
gives me a blot to hit on the ace point! If I can only roll a 1...
All eyes are on me. I roll a 6 1!!! YES! The crowd no doubt was excited
that this lowly 1500 player was about to defeat the mighty mloner!!!
Hooray for the underdog!!!
So what do I do? I come off the bar with the 6 and run, not hitting the
blot I'd been so eagerly awaiting. My mind was somewhere else, but what
brought me back to FIBS reality was noticing a quick flurry of "so-and-so
stops watching you" messages. I wondered if I'd passed gas or something!
Why did everyone leave suddenly? Then I saw. All of a sudden that queasy
sick feeling came over me... I had finally been handed my roll, and I
didn't even use it. All I could think about on my way to bed was how much
all those German 1800 players must have been shaking their head at that
idiot KevinB. Why was such a stupid player occupying mloner's time...
Oh well, I probably will never make it to Germany anyway, and if I do, I'll
go under an assumed name...
> On Wed, 24 Dec 1997 04:59:53 GMT, d...@pacificnet.net (Dean Gay) wrote:
> >Your remarks about choking really struck home with me. I can vividly
> >remember the first time Kit invited me to a match on FIBS. I can
> >remember him inviting me, and I can remember nothing else. The whole
> >time I'm thinking "This is Kit Woolsey. I'm playing Kit Woolsey." I
> >didn't stand a chance. And very quickly, of course, the watchers
> >start showing up and now my only thought is "Can I just get out of
> >this match without embarrassing myself too badly?"
> Phew, I'm glad I'm not the only one then!
> _ Remove the "W" from my address to email me
> James Eibisch ('v')
> Reading, U.K. (,_,) N : E : T : A : D : E : L : I : C : A
> ======= -- http://www.revolver.demon.co.uk --
>1) Selection effects. (Basically I'm repeating the above problems with
>2) Exhaustion. Computers don't get tired. Humans do.
>3) Emotion. Computers don't feel emotion. They don't notice bad dice.
>4) Distractions. Does a computer's spouse interupt and call it to dinner?
>5) "Giant killing". (I am of the belief that this is potentially a HUGE
I appreciate what you're saying but I don't think these five events
happen often enough to be significant. How often do you play
Suppose you're about to beat jellyfish, and your spouse interrupts your
train of thought and you make a bad mis-play. BANG! Jellyfish wins a
7 point match and gains 5.35 ratings points, when it should have LOST
the match and, say, 7.65 ratings points.
Question. Do you think that Jellyfish's rating will forever be 12
points higher than it "should" be ? Momentarily, yes, it's rating
is higher than it deserves. But, 1000 games later (about 1 week for a bot)
this will no longer be true, in fact, it's rating will be almost exactly
the same regardless of what your spouse did. That's because FIBS
has a good ratings formula.
It would be interesting if two identifcal copies of jellyfish were
running on fibs, one that played players with less than 1700 rating, and one
that played only those greater than 1700. That would prove something,
one way or the other.
No, I don't think that swing will affect the ratings well down
the road. But is it an isolated incident? Whatever the ratings
system memory time scale is, I predict there will several
incidents where one or more of 1-5 occurs.
>It would be interesting if two identifcal copies of jellyfish were
>running on fibs, one that played players with less than 1700 rating, and one
>that played only those greater than 1700. That would prove something,
>one way or the other.
That would be a nice experiment, and could shed some light on
point 1 above. Whether it's worth the effort or not...