I have played thousands of games against JF on levels 5, 6, and 7. I keep
careful records of every game outcome. My statistics show that my results
do not depend on which level JF is playing at.
Brian, I for one would VERY MUCH like you to summarize the data in a
newsgroup post. Specically the various game outcomes as a function of
level you were playing against. You have made some rather strong
statements about your own ability (compared to JF) and the relative
abilities of its different levels with respect to your game. I really
think you owe us (and Fredrik) the raw data so that we can reach our
own conclusions. You have kept careful records (which is good).
Would you PLEASE post them. I would like to see a compilation with
headings such as (WARNING: not real data!):
vs. JF level-5:
Brian wins: 1 2 4 6 8 12 16 ...
total # of games: 131 45 13 ....
JF wins: 1 2 4 6 8 12 16 ...
total # of games: 105 56 9 ....
vs. JF level-6: (etc.)
Now I am going to produce some evidence, analysis, and statistics
which provides some (but probably not conclusive) results backing the
side (my side, BTW) which says that JF-7 plays stronger then JF-5 vs.
typical human players (here the FIBS community). First the raw data
(and thanks to Jason Lee and Matt R.--aka Hacksaw--for doing the FIBS
rating reports over the time period captured below):
date jellyfish JF_level_five rating
rating exper change rating exper change difference
30-Jan-98 29548 63516
15-Dec-97 2037.68 29548 0 2004.14 63516 0
02-Dec-97 2037.68 29548 0 2004.14 63516 0
19-Nov-97 2037.68 29548 0 2004.14 63516 0
06-Jun-97 2048.75 29485 63 2004.14 63516 0
26-May-97 2043.47 29270 215 2004.14 63516 0
09-May-97 2067.89 26725 2545 2004.14 63516 0
25-Apr-97 2052.86 26197 528 2004.14 63516 0 48.72
11-Apr-97 2033.78 26122 75 1959.12 58052 5464 74.66
21-Mar-97 1975.93 25094 1028 1923.87 52683 5369
08-Mar-97 1975.93 25094 0 1870.71 47345 5338
15-Feb-97 1975.93 25094 0 1893.67 41437 5908
01-Feb-97 1975.93 25094 0 1918.44 36870 4567
17-Jan-97 1975.93 25094 0 1887.97 34447 2423
31-Dec-96 1974.96 25079 15 1853.69 30422 4025
14-Dec-96 1972.03 25077 2 1830.79 24432 5990 141.24
18-Nov-96 2006.16 24927 150 1908.44 17173 7259
17-Sep-96 1903.77 19046 5881 1908.44 17173 0
26-Aug-96 1921.30 17759 1287 1908.44 17173 0 12.86
29-Jul-96 1933.81 17209 550 1894.14 17013 160
15-Jul-96 1933.81 17209 0 1825.27 15982 1031 108.54
03-Jul-96 1919.74 16789 420 1824.66 14093 1889
17-Jun-96 1971.99 15376 1413 1824.66 14093 0
02-Jun-96 1943.78 13372 2004 1824.66 14093 0 119.12
11-Mar-96 1946.17 7559 5813 1816.86 9117 4976
22-Feb-96 1946.17 7559 0 1828.68 5246 3871 117.49
21-Jan-96 1984.16 5661 1898 1901.31 3007 2239
totals 23887 60509
games missed 5661 3007
average rating difference 88.95
std dev of rating difference 45.56
Now the analysis (some of which is actually in the table):
In order to compare the two bot versions, I required that each
played a minimum of 100 games since the last time I did a comparison.
This cut into the raw data considerably, since often at least one of
them was idle. The last column shows seven rating periods when this
requirement was met. (BTW, I came up with this plan BEFORE looking
at the data, just in case someone was wondering...).
The average difference in FIBS ratings over these seven rating
periods was 88.95 rating points. The standard deviation was 45.56
rating points. The result is almost a two standard deviation result.
On the surface, the data shows that it is quite likely that JF level-7
(jellyfish on FIBS) plays better against the typical FIBS opponent
than does JF level-5.
There are some assumptions, and I will try and list all (but
almost certainly leave some out).
1) The quality of opponent was typically the same for each bot. (Probably
a decent assumption, but I could imagine that SOME strong players would
only go after level-7 while SOME weak players might steer towards level-5.
Of course if the ratings system is robust--meaning insensitive to level
of opponent--then even this bias, if present, wouldn't matter.)
2) Fredrik was not changing the neural net during this time period, or,
if he in fact was, that he was changing BOTH robots and inserting the
SAME neural net. (Fredrik, could you please comment.)
Note my method is set up so as to be insensitive to variations in
overall ability of the FIBS community as a function of time. (It has been
speculated that "ratings inflation" on FIBS could be due to an overall
change in player abilities.) I only compare the ratings over the
same time period.
I'm sure many have noticed that the FINAL ratings of the two JF versions
differed by a much smaller amount: just over 33 rating points instead of
the 89 of my study. I don't see this as anything more than a single data
point compared to the seven data points I used (which varied from 13 to 141
in difference). Of course if Fredrik was changing the NN's over the time
period above then this conclusion could be way off.
OK, now its your turn, advocates of the opposite opinion. Please
present some data.
Chuck
bo...@bigbang.astro.indiana.edu
c_ray on FIBS
(snip)
> Now I am going to produce some evidence, analysis, and statistics
>which provides some (but probably not conclusive) results backing the
>side (my side, BTW) which says that JF-7 plays stronger then JF-5 vs.
>typical human players (here the FIBS community). First the raw data
>(and thanks to Jason Lee and Matt R.--aka Hacksaw--for doing the FIBS
>rating reports over the time period captured below):
>
> date jellyfish JF_level_five rating
> rating exper change rating exper change difference
>
>30-Jan-98 29548 63516
>15-Dec-97 2037.68 29548 0 2004.14 63516 0
>02-Dec-97 2037.68 29548 0 2004.14 63516 0
>19-Nov-97 2037.68 29548 0 2004.14 63516 0
>06-Jun-97 2048.75 29485 63 2004.14 63516 0
>26-May-97 2043.47 29270 215 2004.14 63516 0
>09-May-97 2067.89 26725 2545 2004.14 63516 0
>25-Apr-97 2052.86 26197 528 2004.14 63516 0 (48.72) remove
(the above difference doesn't count since jellyfish only played
75 matches in the period)
>11-Apr-97 2033.78 26122 75 1959.12 58052 5464 74.66
>21-Mar-97 1975.93 25094 1028 1923.87 52683 5369
>08-Mar-97 1975.93 25094 0 1870.71 47345 5338
>15-Feb-97 1975.93 25094 0 1893.67 41437 5908
>01-Feb-97 1975.93 25094 0 1918.44 36870 4567
>17-Jan-97 1975.93 25094 0 1887.97 34447 2423
>31-Dec-96 1974.96 25079 15 1853.69 30422 4025
>14-Dec-96 1972.03 25077 2 1830.79 24432 5990 141.24
>18-Nov-96 2006.16 24927 150 1908.44 17173 7259
>17-Sep-96 1903.77 19046 5881 1908.44 17173 0
>26-Aug-96 1921.30 17759 1287 1908.44 17173 0 12.86
>29-Jul-96 1933.81 17209 550 1894.14 17013 160
>15-Jul-96 1933.81 17209 0 1825.27 15982 1031 108.54
>03-Jul-96 1919.74 16789 420 1824.66 14093 1889
>17-Jun-96 1971.99 15376 1413 1824.66 14093 0
>02-Jun-96 1943.78 13372 2004 1824.66 14093 0 119.12
>11-Mar-96 1946.17 7559 5813 1816.86 9117 4976
>22-Feb-96 1946.17 7559 0 1828.68 5246 3871 117.49
>21-Jan-96 1984.16 5661 1898 1901.31 3007 2239 [ 82.85] (include)
(this last--first chronologically--rating period SHOULD have
been included. jellyfish played 433 matches that period and
JF_level_five began during that time, thus playing 3007 matches)
>
> totals 23887 60509
> games missed 5661 3007
> average rating difference 88.95
> std dev of rating difference 45.56
LAST TWO LINES WITH CORRECTIONS DETAILED ABOVE:
average rating difference 93.82
std dev of rating difference 42.24
The changes aren't very large. Also, requiring a minimum of 100
matches was an arbitrary threshold, but it was decided upon before
looking at the data so I should stick with that (and thus I've thrown out
the 25-Apr-97 rating report where jellyfish had only played 75 matches
since the previous report).
BTW, the "games missed" line is just the total experience for
each player minus the sum of the third and sixth columns respectively
("change" columns) and can be seen to be the total number of matches
played by the bots through 21-Jan-96, the first rating period to have
jf_level_five. This is just a "check sum" to insure I didn't miss
something.
(snip)
> In order to compare the two bot versions, I required that each
>played a minimum of 100 games since the last time I did a comparison.
(snip)
This should have said "100 MATCHES", not "100 games".
The higher levels do play better, measured by JF rollouts.
You could verify this quite easily by rolling out positions
that the levels disagree on.
The answer to Brian's results are probably quite easy to explain.
First of all he has hardly played enough games to eliminate the
randomness of his average results (I don't know this for sure,
as I don't know just how many games he has played).
Secondly, the level 5 really does play a very solid game.
When I lost my university account, it had a rating>2000 on fibs.
That was lucky, I think, but still.
Thirdly, people tend to adjust their playing speed to their opponent.
(This is part of the reason why lvl5 did so good also.)
Human expert play deriorates more than the expert thinks, if
he plays even slightly faster than his usual speed.
(My oppinion based on analysis of my own game, not claiming to
be world class expert, tho.)
Fredrik Dahl.
I would be keen to see that too. In the meantime though, I believe the
total number of games played was around 3000 and that Brian was ahead by
an "insignificant amount" against L7 and behind by an "insignificant amount"
against L5. For lack of real data, I'll assume he played 1000 games against
each of levels 5, 6 and 7, and came out even against each one -- with a lot
of hand waving, you can see that this data set would lead to his conclusion.
> Now I am going to produce some evidence, analysis, and statistics
>which provides some (but probably not conclusive) results backing the
>side (my side, BTW) which says that JF-7 plays stronger then JF-5 vs.
>typical human players (here the FIBS community). First the raw data
>(and thanks to Jason Lee and Matt R.--aka Hacksaw--for doing the FIBS
>rating reports over the time period captured below):
>
> date jellyfish JF_level_five rating
> rating exper change rating exper change difference
>
>30-Jan-98 29548 63516
>15-Dec-97 2037.68 29548 0 2004.14 63516 0
>02-Dec-97 2037.68 29548 0 2004.14 63516 0
>19-Nov-97 2037.68 29548 0 2004.14 63516 0
>06-Jun-97 2048.75 29485 63 2004.14 63516 0
>26-May-97 2043.47 29270 215 2004.14 63516 0
>09-May-97 2067.89 26725 2545 2004.14 63516 0
>25-Apr-97 2052.86 26197 528 2004.14 63516 0
>11-Apr-97 2033.78 26122 75 1959.12 58052 5464 74.66
>21-Mar-97 1975.93 25094 1028 1923.87 52683 5369
>08-Mar-97 1975.93 25094 0 1870.71 47345 5338
>15-Feb-97 1975.93 25094 0 1893.67 41437 5908
>01-Feb-97 1975.93 25094 0 1918.44 36870 4567
>17-Jan-97 1975.93 25094 0 1887.97 34447 2423
>31-Dec-96 1974.96 25079 15 1853.69 30422 4025
>14-Dec-96 1972.03 25077 2 1830.79 24432 5990 141.24
>18-Nov-96 2006.16 24927 150 1908.44 17173 7259
>17-Sep-96 1903.77 19046 5881 1908.44 17173 0
>26-Aug-96 1921.30 17759 1287 1908.44 17173 0 12.86
>29-Jul-96 1933.81 17209 550 1894.14 17013 160
>15-Jul-96 1933.81 17209 0 1825.27 15982 1031 108.54
>03-Jul-96 1919.74 16789 420 1824.66 14093 1889
>17-Jun-96 1971.99 15376 1413 1824.66 14093 0
>02-Jun-96 1943.78 13372 2004 1824.66 14093 0 119.12
>11-Mar-96 1946.17 7559 5813 1816.86 9117 4976
>22-Feb-96 1946.17 7559 0 1828.68 5246 3871 117.49
>21-Jan-96 1984.16 5661 1898 1901.31 3007 2239 82.85
>
> totals 23887 60509
> games missed 5661 3007
> average rating difference 93.82
> std dev of rating difference 42.24
>
> The average difference in FIBS ratings over these seven rating
>periods was 93.82 rating points. The standard deviation was 42.24
>rating points. The result is almost a two standard deviation result.
>On the surface, the data shows that it is quite likely that JF level-7
>(jellyfish on FIBS) plays better against the typical FIBS opponent
>than does JF level-5.
I believe you can make a stronger conclusion than this. That standard
deviation is _your estimate of the population standard deviation_,
but we want to know _the error you expect in your measurement of
the population mean_. Since you made several "independent" (not quite
independent; see below) samples to arrive at your mean, you have been
able to use 6 degrees of freedom in your result and hence reduce the
variance by a factor of 6 (ie. the standard error by a factor of
sqrt(6)). Your data show JF 7 to be nearly 6 sds stronger than JF 5,
and are overwhelming support of your hypothesis. ("Quite" and "very"
are resonable adjectives for results significant to one and two
standard deviations... I run out of hyperboles before I get to six :-)
Personally, I would be reluctant to assume that samples that may be as
close as 100 experience points apart are independent. (Refer to an
earlier article of mine arguing that the "half life" of FIBS rating
points is of the order of 200 experience.) I made another analysis
measuring the two populations (L7 and L5) separately at intervals of
at least 400 experience and then compared the results (ie. I calculated
the difference of the means; Chuck computed the mean difference) but
the eventual conclusion was very similar to Chuck's so I won't bother
repeating it here. My computation shows the (individual) population
standard deviations to be considerably larger than Chuck's measurement,
but my results used greater degrees of freedom so overall our standard
errors were about the same.
Now comes the hard part: reconciling the fact that Chuck and Brian's
experiments yield (apparently) incompatible conclusions. Were either
or both experiments performed incorrectly? Are the data that have been
presented honest and accurate? Let me add that personally I have every
faith in Chuck and Brian's ability and honesty (I know you find judging
articles by the reputation of the author rather than the quality of the
reasoning to be distasteful, Chuck, but bear with me for a while :-) -- I
am willing to accept both sets of results and conclusions as they stand.
Allow me to reword Chuck and Brian's conclusions to see if they really
are incompatible. Chuck finds that JF7 is stronger than JF5 by 94 +/-
34 FIBS rating points (2 sd); Brian finds that JF7 is equal to JF5 (my
interpretation); assuming 1000 money games against each level, the 2 sd
confidence interval is 0.0 +/- 0.27 points per game. (The justification
for this result: the standard deviation of a single money game is
approximately 3 points; therefore the standard deviation after 1000
games is 95 points, or 0.095 points per game. The standard deviation
in the difference between two of these quantities is 0.134 points per
game; therefore the 2 sd confidence interval is +/- 0.27.)
How do we convert between FIBS ratings and expected points per game?
To the best of my knowledge this is an open question. However, here's
a simple model. Assume 1-point matches are being played on FIBS.
FIBS expects that between players ranked 94 +/- 34 points apart (as
Chuck found JF 7 to be above JF 5), the favourite will win 52.7% +/-
1.0 of the games. If we assume this constant factor is also correct
in money games, and assume a win in a money game is worth 2 points on
average (see my other article for justification), then this 2.7% +/-
1.0 CPW is worth 0.108 +/- 0.04 points per game. So, the results of
the two experiments are:
Chuck: 0.108 +/- 0.04
Brian: 0.000 +/- 0.27
Note that Brian's confidence interval INCLUDES Chuck's! The conclusions
do not disagree after all! My interpretation is that JF 7 is a little
stronger than JF 5 (by about 2.7% CPW, or 0.11 money ppg) -- this is
only a slight advantage, and NOT significant enough to be detected by
even 1,000 money games (as Brian found).
> OK, now its your turn, advocates of the opposite opinion. Please
>present some data.
Well, I'm not advocating any opinion, and I don't have any new data, but
I hope both kinds of advocates will accept this sort of statistical olive
branch, and agree that JF 7 appears slightly stronger than JF 5, by about
0.11ppg.
Cheers,
Gary (GaryW on FIBS).
--
Gary Wong, Department of Computer Science, University of Arizona
ga...@cs.arizona.edu http://www.cs.arizona.edu/~gary/
(snip)
Well, there Gary goes again. Instead of just snowing us with his
opinion, he's got to blow us away with statistics! I've read this twice,
and must admit I don't understand it completely, but what I do understand
I can't find fault with. Nice work, Gary. (But I reserve the right to
rescind this compliment if someone shows that it is all a bunch of smoke....
Now, if you had contradicted me, I'm SURE I could have found all kinds of
errors. ;)
There are a couple numbers which surprise me just a bit: 100 point
rating difference only gives a 53-47 edge in a one-point match. I looked
at Kevin Bastian's nice writeup on the FIBS rating formula at:
http://www.northcoast.com/~mccool/fibsrate.html
and, sure enough, that's what the formula says. Thought I had you there,
Gary....
The other surprise (to me) is related: 93 point ratings difference is
only worth about 0.1 ppg at money play. I remember talking about this kind
of thing with David Montgomery a while back. I seem to recall that he had
a different correlation between ratings difference and money play advantage,
so I'm cc'ing him in hopes that he will elaborate.
Also, I suspect that Fredrik has pitted JF-5 vs. JF-7 at one time or
another. I asked him in an e-mail if he would post some results on this
if he has it handy. I'm still hopeful he will comment. It would be
interesting to see how much Gary's prediction (~0.1 ppg at money play)
compares to real life numbers. It may say something about the ratings
formula (but then again, maybe not...).
Chuck wrote:
> The other surprise (to me) is related: 93 point ratings difference is
> only worth about 0.1 ppg at money play. I remember talking about this kind
> of thing with David Montgomery a while back. I seem to recall that he had
> a different correlation between ratings difference and money play advantage,
> so I'm cc'ing him in hopes that he will elaborate.
I looked at this a few months ago, but I used a much more complicated
model than Gary's. Instead of just going with the one point match
win percent, here is what I did:
- set a probability distribution for the points won for each player,
when they win. (for example: each player might win 1 point 38%,
2 points 38%, 4 points 20%, 6 points .5%, 8 points 2.5%,
and 16 points 1%)
- set a probability for how likely it is one player will beat the other.
- play a "long" match between these two players, assuming that the
results for each game will follow the money distribution until the
players get "close" to the end of the match.
- at this point, settle the match using a match equity table.
(An important refinement: use a skill-adjusted match equity table,
like those in _Can a Fish Taste Twice as Good_.)
- repeat this many times and determine overall match winning chances
for the two players
Based on the match winning chances, it is easy to get the rating
difference. Based on the probabilities you set, you have the money
points per game.
I did this for a lot of different probability distributions and
edges in probability of winning, along with a lot of different
definitions of "long" and "close" above.
The overall results are:
A rating difference of 40-50 points corresponds to about a .10ppg
money edge.
The key assumption is that play is like money until you get "close"
to the end of the match. This is pretty true most of the time.
When the score gets real lopsided, it's not. Also, there might be
some changes in the low frequency distributions (8-point and 16-point
wins) even fairly early in a match. I don't think this swings
much.
*Much* more important are many other factors. Some money players
don't play matches. And vice versa. Certain styles of play are
better suited to money or matches. And so forth. So this
result, even if valid, is *only an approximate rule of thumb*.
Data from my own real-life money play conforms to this rule. I think
I have about a 150 point rating edge over my average local opponent,
based on watch FIBS ratings go up and down, and my long-term money
result is about +.30ppg.
David Montgomery
mo...@cs.umd.edu
monty on FIBS
-----== Posted via Deja News, The Leader in Internet Discussion ==-----
http://www.dejanews.com/rg_mkgrp.xp Create Your Own Free Member Forum