Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

The SSDF FAQ

29 views

Skip to first unread message

Goran Grottling

unread,

Jul 23, 1995, 3:00:00 AM7/23/95

Frequently asked questions about SSDF:

Q: Who is responsible for "The Swedish Rating List"?
A: The Swedish Chess Computer Association (In Swedish "Svenska
Schackdatorforeningen", abbreviated SSDF). The rating list is the result of
its members' efforts.

Q: Does a game have to be played with a particular time limit, or is the list
based on games played with various time settings?
A: All games are played at tournament level (40 moves/2h). A game played with
anything other than 3 minutes per move is not counted. However, SSDF publishes
a separate rating list for blitz games (5 min/game or 60 moves/5 min) in PLY,
the journal of the Association, a couple of times per year.

Q: Does SSDF have a lab full of computers playing each other?
A: No. All testing is done in our members' homes and on their own computers.
Also, most vendors are willing to lend us one or two chess computers for
testing when a new model is released.

Q: Can anyone play a few test games and send his results to SSDF?
A: SSDF only accepts results from its members. Furthermore, we do not accept
tests from people having a commercial interest in computer chess. The person
responsible for managing the tests regularly calls all testers, usually once
every few weeks, and collects their latest results. On those occasions, he
also plans upcoming tests and suggests suitable computers and/or programs to
be pitted against each other.

Q: Who is managing the list?
A: Since 1990 Thoralf Karlsson, chairman of SSDF, has been handling the list.
He took over from Goran Grottling who had been in charge since the inception
of SSDF in 1984.

Q: How do you know that you can trust reported results?
A: It's mostly a question of confidence. We have known most of the testers
for many years and don't believe they would try to deceive us. Also, all
testers are required to keep a written record of their activity. In case
there are any doubts, those records will surely be of good help.

Q: But in theory, someone could have sent in false results?
A: Yes, but not on a large scale. Experience has taught us that a series of
20 games, which is our normal test match between two computers, can produce
some rather unexpected results. Still we'd be very suspicious if anyone
reported that, say, Super Constellation outplayed Genius on a Pentium-class
PC by 15 to 5. You must remember that normally a lot of people test the same
program or computer, and we are therefore able to compare results from
different sources. Likewise, a tester who consistently reported low scores
for, say, Richard Lang's programs would raise a few eyebrows.

Q: How many people are involved in the testing?
A: At the end of 1994, SSDF had played well over 40,000 games. This had taken
us eleven years, and all in all 132 testers have been involved, each
contributing with anything between 1 to 5,770 games. During 1994 alone, 40
people were doing tests. Our most industrious tester played over 700 games
that year; he usually plays three games in parallel. (And yes, that means he
has six computers!)

Q: How are the tests carried out?
A: Our goal is to always play matches of 20 games between two computers/
programs under test. We also consider it important to evaluate a computer
against various kinds of opponents. For instance, a new chess program has
often been tested by the programmer against contemporary products, but we
also try to find some old programs to run it against. The reason for this is
that the programmer may well have optimized its opening books so as to get
maximum performance against the major competitors in the market, but he has
most likely not had access to all the software that's hiding in the closets
of SSDF's members. Which games or computers actually get to play each other
is ultimately dependent on what kind of equipment our testers have or are
able to borrow. So in order to let, say, Mephisto Vancouver and MChess Pro
meet, we have to find someone with access to both the Mephisto machine and
an ordinary PC. Rule of thumb number three is that matches between opponents
whose ratings differ by more than 400 points are meaningless. The outcome of
such a match does not provide statistically meaningful information.

Q: Are all games played until mate?
A: No, but we don't accept early, grandmaster-style draws. Normally, a game
is allowed to go on a bit further than human players would have done. We
know that strange things can happen in computer games! Many testers also
follow their own rules. Gunnar Blomstrand, one of the more productive,
generally plays on until one of the computers evaluates its position as -10
or below. Super Expert, Hiarcs and some other computers are able to resign
on their own, and of course we accept that if it happens.

Q: What is SSDF's opinion on so called "killer libraries", opening libraries
that are specifically tuned to give good results when playing against certain
other computers?
A: We don't like them, but there is not much we can do. If we disqualified
results from games played with such a library, then surely someone would
protest against that. The best method is likely to be the one we described
above: make sure each program under test gets to play against as many
different opponents as possible, including older programs.

Q: When it happens that two computers repeat a game they have played before,
are both games included in the results in that case?
A: Yes, a game is allowed to continue even if the tester can see that it is
going to be a duplicate. Any program that's stupid enough to lose the same
game several times has but itself to blame. Furthermore, from a statistical
point of view this behaviour is not very important, since the program is just
as likely to repeat a win as a loss.

Q: How are the ratings calculated?
A: SSDF uses its own rating program, written by our member Lars Hjorth, but
the basic formulas are derived from Arpad Elo's ELO rating system. Our
program calculates, for each computer, the average rating for its opponents
and how many points it has scored. Given those two numbers, professor Elo's
formulas produces a rating.
However, if all computers are only tested against other computers, all we
get is a relative rating that is just valid among those computers. Therefore,
SSDF has played several hundred games between computers and human players in
serious tournaments and used these results to set a "correct" absolute level
for the rating list according to Swedish conditions. Different national
rating systems are not completely in accordance though, and that has to be
taken into account when reading our list. For instance, US ratings seems to
lie approximately 150 points above the corresponding Swedish ratings (maybe
more when below 2000 and less on the other side of the scale). For ourselves
we obviously use the Swedish scale.
We firmly believe that our ratings are correct in the sense that if a
computer were to play a sufficient number of games against Swedish humans, it
would end up with a rating close to what it has on our list. Unfortunately,
as programs get better it becomes increasingly difficult to arrange meaningful
games against human players. Reassuringly, we've noted that our ratings are
fairly consistent with the results from the yearly Aegon tournament in
Holland.

Q: SSDF often uses the term "margin of error". What factors influence the
size of this margin?
A: More than anything else, the number of games played decides the margin of
error (=confidence range, see below). Once upon a time, we thought that
40 games between two computers was a lot. Nowadays, we know more about
statistics. After so few games, you can almost never say for sure which
computer is better. Of course, in most situations your result after 40 games
looks similar to what you will see after 1,000 games. But it happens often
enough, that the picture is different. Even if the two are of equal strength,
you may well get a result of 28-12 in the first series of games and 12-28 in
the next. The margin of error also depends on the relative strength of the
two computers. A big difference in strength results in a larger margin of
error. From a statistical point of view, the optimal solution is thus to
play a large number of games against opponents of similar strength.

Q: A typical line in your list looks like this: "Genius 3.0/P90/rating 2440
/+54/-49". How should I interpret all those numbers?
A: They tell you that Genius 3.0, when played on a 90 MHz Pentium PC, with 95
percent probability has a rating between 2391 (2440-49) and 2494 (2440+54).
The fact that we are using a 95 percent confidence gap implies that, on the
average, 5 percent of our ratings will indeed be outside the specified range.
Therefore, in a list with 60 computers, three of them are probably
erroneously rated. But neither we nor anyone else knows which ones.

Q: Can you explain why new computers tend to get a high rating, which then
decreases as more games are played?
A: This claim is simply not true. In early 1994, we studied the change in
rating for the 28 programs that had entered the list since the fall of 1991.
Of those, exactly half increased their rating during the period, while the
other half lost points. Admittedly, most of the programs that lost points
were the ones with high ratings, but we regard that as pure chance. It is
true that CM The King has fallen dramatically (72 points), and so has
Mephisto Risc, Vancouver 68030 and MCPro. But it's equally true that Zarkov
2.5 has gained 65 points and Chessmaster 3000 51 points during the same
period. We have observed that as more games are played, the list seems to
be "squeezed" so that the difference between the top and bottom computers
decrease. We are not sure why this happens, but it is most likely a
deficiency in Prof. Elos rating system.

Q: But undoubtedly there are cases where SSDF has missed the mark with a
new program?
A: Yes, Mephisto Polgar, for instance. In 1989, after 94 games, Mephisto
Polgar was given the rating 2057 +/- 57. Now it has played 1693 games and
has a rating of 1973 +/- 17. Obviously, the first rating was too high and
Polgar was thus one of the computers that lay slightly outside its' 95%
interval.

Q: And how about Mephisto Gideon?
A: Gideon was first rated on list number 8/93, after 176 games, and its'
rating was given as 2319 (+59, -53). You'll recall that this is to be read
fully as "with 95% probability, the true rating lies between 2378 and 2266".
After 393 games, Gideon's rating is 2280 (+37, -35), and we cannot see that
these two results are not to be in accordance with each other.

Q: Do the testers use Windows multitasking when playing two PC programs
against each other?
A: No, definitely not! Even if such a solution could be made to work
technically, it would not produce correct results. Among other things, it
means that the programs would be unable to use the opponent's time, the so
called "permanent brain" function. To test two PC programs according to
SSDF's rules, you must have two computers.

Q: Are these two machines required to have identical memory configurations?
A: They usually have. But even if one of them had 8 Mbyte of RAM and the
other just 4, it would not mean a lot. Kathe Spracklen once estimated the net
effect of doubling the size of the hash tables to about 7 rating points, and
we have not found anything to contradict that. It's a fact of life that not
all PC's are the same, not even if they have identical processors, and we'll
have to live with that. With so many factors besides the processor, like the
speed of RAM, size of cache memory, type of expansion bus, and architecture
of the mother board that affect performance, there is no way that SSDF can
enforce a standard.

Q: Is automated testing being utilized?
A: Yes, thanks to Dr. Christian Donninger from Vienna, we have been able to
run automatic tests since November 1994. Hitherto, only a few people testing
have been able to do it. After all, it takes two computers. Nevertheless,
they have produced many results. However, autotesting has brought about a
dramatic increase in testing capacity for PC programs, which means that we
are getting more programs on to the list faster. The other testers will not
be superfluous, however. It still takes humans to maneuver ordinary chess
computers, and we will need them to test those.

Q: How do you set program preferences, and what opening library do you use?
A: We use instructions from the manual or, in some cases, straight from the
source - the programmer. Experimenting with various styles of play is out of
the question, since it would require hundreds of games with each setting to
differentiate between them. We do not have the time to do that. On those
occasions when a program has more than one opening library, we use the
"tournament library". Now, we do not believe that the choice of openings is
as important as the programmers tend to think, but we still have to use
optimal settings. If we didn't, someone would surely come along and blame an
unexpexted (bad) result on us for not doing so.

Q: In your rating list it often says that PC-programs are played at "50-66
MHz" and in some cases "25-33 MHz". What does that mean?
A: Some of our testers have PCs with the processor 486DX running at 50 MHz,
others have the processor 486/DX2 running at 66 MHz. Many tests have shown
that the difference in speed between these two is not more than 5-10%. The
66 MHz processor is somewhat faster for chess programs, but the difference
in playing strength is not more than about 7 rating points. Similarly,
results obtained with 486 PCs running at 25 and 33 MHz have been lumped
together. Actually, very few games have been played on 25 MHz computers, but
we still want to give as accurate information as possible to our readers.

Q: Why isn't TASC R30 on the list - it is definitely a strong chess computer.
A: SSDF has not had the opportunity to test TASC R30. It is an exclusive
computer that has not sold very well in Sweden, and no retailer has been
willing to lend us a machine. Neither has TASC, although they have lent us a
number of 30 MHz Chess Machine cards. A few members have bought R30s though,
and they have reported some results to SSDF. Those results have been counted
as if they had been played with a Chess Machine 30-32 MHz.

Q: How important is speed?
A: If you double the clock speed, you gain about 70 points. That was true ten
years ago, when we evaluated Constellation, Plymate and others at different
speeds, and it still seems to be true in 1995 when we run PC programs that
have a much higher playing strength. Some say that it varies between
different machines or programmers, that some programs gain more and
others less from an increase in speed, but that has never been proved. You
will certainly find differences if you compare all programs we have tested
at different clock frequencies, but such differences could well be attributed
to statistical inaccuracies.

Q: You were late in starting to test Fritz 3.0 last autumn. In your comments
you said that this was because Chessbase did not send you the diskettes. But
couldn't you have bought the program yourselves?
A: That we could have, but it is a question of financial resources. We would
have needed to buy programs sufficient for 15-20 testers, which would mean
about 10 original diskettes. That would have been a considerable cost for a
small, idealistic association like ours. Fortunately, the other programmers
have not been as slow as Chessbase was last summer (they eventually sent us
two diskettes). Most of them (Lang, Hirsch, Uniacke, Schroeder, de Koning
and Weststrate) are very interested in SSDF's test results and the position
their program will achieve on the list. Nowadays, they therefore send us a
number of diskettes as soon as possible. We are grateful for this !

Q: Why did you test some PC-programs at 66 MHz and others at 33 MHz? Surely,
this is not fair.
A: Our test work is governed by our resources. During 1993/94, about half of
the PC-testers had the "slower" 486 and half had the faster one. Had we
tested each program at both speeds, the number of games at a given speed
would have been halved and the statistical uncertainty therefore greater. For
example, instead of having played 300 games with Genius 2.0 at 66 MHz, we
would have had 150 games at 66 MHz and 150 at 33 MHz. The testers have of
course a maximum potential. By the way, the different testers play their
test games at very different paces.
You must remember that SSDF's rating list is not a commercially oriented
sales list. We assume that it is read by people who know to what degree
different processor speeds affect ratings. The difference between the faster
and slower 486 version is theoretically 35-40 points. We have two examples
of this, which confirm the theory: Genius 1.0 and MChess Pro 3.12 have been
tested at both speeds. Check this out by yourselves in the rating list.
Almost all our testers have now upgraded to faster 486's, and all PC-programs
can be tested at the higher speed. But of course the problem has started
again as a few testers have now acquired Pentium 90 MHz machines, and it will
take time before we are able to test all new programs on this processor.

Q: But wasn't it strange that you chose to test Hiarcs 2 at the lower speed,
as it sensationally won the world championships in Munich -93. And by the
way, why did you never test Hiarcs 2.1?
A: Towards the end of 1993, several new programs were released at about the
same time. These were Genius 2, MChess Pro 3,5, Chessmaster 4000, Hiarcs 2
and Socrates. As half of our testers at that time had 486/33 MHz computers,
we had to decide which programs to test on the faster 66 MHz computers and
which on the slower 33 MHz computers. Our guiding principle was that the
strongest programs should be tested at the higher speed.
Hiarcs 2 and Socrates were the two programs to be tested at 33 MHz, and the
results showed that we made the right choice. Neither of these programs
turned out to be better than the other three programs in question. In early
1995, Hiarcs 2.0 had 2208 after 229 games. At 50-66 MHz, the program would
have achieved 2250 at the most.
When we received diskettes with Hiarcs 2.1 from Mark Uniacke, we had already
completed 150 of the 229 games with the 2.0 version. We definitely had no
possibility to free resources to start all over again with the new version,
which according to Uniacke himself was probably only 10-15 points better.
Furthermore, only a few weeks separated the release of the two versions. For
commercial reasons, it was better to market the exact version which made such
a good result in Munich.

Q: In the list 1/95 it says that 41,088 games have been played by 136
computers, but I can only find 59 on the list. In the long list, which can
be downloaded for free from SSDF's BBS, there are only 127 computers. How
can this be?
A: In order to save space, we have over time taken 68 computers out of the
list. We also feel that it makes the list easier to grasp when old computers,
taken off the market many years ago, are removed. Nine computers don't even
show on the long version of the list, as they haven't played the minimum
number of 100 games required to attain a position. However, all games played
by all computers are included in the calculation of the ratings. The old
games contribute to the stability of the list, and sometimes it is also nice
to study the long list if only for nostalgic reasons.

Q: How is your BBS reached?
A: If you have a modem you can call +46 31 992301, which is the number for
our BBS - "Grottan BBS". The first time you will be asked to answer a few
questions, and you cannot do much more. When the SysOp, Goran Grottling, has
accepted you as a new user, it becomes possible to download files and read
mail. You will be able to freely get the latest rating list, including
results from all individual matches between computers, both in the short and
long version. Grottan BBS is managed by SSDF and contains a large amount of
chessrelated files. It is possible to chose English as the language in the
BBS. Grottan BBS is connected to FidoNET and has the address 2:203/245.

Q: How can I get the ordinary paper version of "The Swedish Ratinglist"?
A: By paying to:

Goran Grottling
Diabasvagen 3
S-437 32 Lindome
Sweden

The charge for 8 lists per year is 120 SEK, but to this must be added costs
for Goran to exchange cheques or currency into SEK. An additional 50 SEK
should cover these expenses. A Eurocheque is usually a good alternative.
European citizens can send the fee to the postal account of SSDF; 418772-0.

Q: A final question: Do you in SSDF believe that your ratinglist represents
"The Absolute Truth"?
A: No, we are quite humble about this. Above all, we want our readers to
realize that the ratings are not exact enough to attach any significance to
rating differences of 10-20 points. You also have to consider the confidence
range for each program. We are also aware that some have the opinion that
certain programs can perform better in play against humans, while others are
worse. However, we think that this remains to be confirmed. Of course, the
rating list of SSDF only accurately represents the outcome for machines
playing against each other. Anyway, it must be better to rely on thousands of
computer games than only a few!

0 new messages