Statistical validity of medium-length match results

37 views

brucemo

Feb 15, 1997, 3:00:00 AM2/15/97
to

Someone told me my program stunk. This is no big deal, except that this
person also told me that my program didn't used to stink. I decided I'd try
to think of a way to assess the strength of my program now versus its
strength a few months ago. I don't know of an easy way to do this well, but
I do know of an easy way to do this badly, and I have mechanisms built into
all versions of my program that allow me to use this easy bad way.

The easy bad way: an autoplay match. One reason it's bad is because there
isn't sufficient book variation to get accurate results (you get repeated
games), but my program's book is pretty broad so I ignored this.

I let various versions of my program play a hundred games or so each versus
an older version that I used as a control, then I tried to make sense out of
the results.

The sample size varied, and the results varied. My newest version did the
worst. It scored 34.7% in 98 games versus the old version. The best was
one from a few weeks before that, it score 48.9% in 88 games, but one from a
few days before the newest version scored 47.8% in 316 games (I let this one
go for a long time).

What does all this mean? Does my newest version really suck?

You can obviously use statistical methods to understand this, but I don't
know enough about statistics to apply the right terms to this.

In the mean time, I figured that I'd do an experiment, since I'm a
programmer and as a consequence experiments like this are easy for me to do.

I figure that the results of this experiment are interesting enough that
it's worth posting them here, even if by doing so I demonstrate my lack of
understanding of statistics. If someone would like to post some information
about statistical methods that can be used to do this work, please feel free
to do so, this is one reason I'm writing this post.

Here's my experiment:

I wrote a program that would simulate matches between programs of equal
strength. Based upon observation, two equal programs should draw about 40%
of the games, and should split the remaining 60% equally.

So I just wrote something that scores "games" based upon a random number,
collects results of "matches" of various lengths, and figures out the
biggest observed ELO point delta that occurred in any of the matches of a
given length.

Here is my data:

Length Score Percent ELO
------ ------ ------- ---
10 8.5 85.0% 301
20 16.5 82.5% 269
30 22.0 73.3% 176
40 29.0 72.5% 168
50 35.0 70.0% 147
75 50.0 66.7% 120
100 61.5 61.5% 81
150 94.5 63.0% 92
200 117.5 58.8% 61
250 148.5 59.4% 66
300 173.5 57.8% 55
400 225.0 56.3% 44
500 284.0 56.8% 48
1000 541.0 54.1% 29
5000 2592.5 51.8% 13
10000 5122.5 51.2% 9
100000 50440.0 50.4% 3

This is output from one run of my little test program. The first column is
match size. I ran 1000 matches of this length, and figured out which of
them had the biggest variation from a 50% result, which is the second and
third columns. So, in a ten-game match, at least one time one of my
supposedly equal players scored 8.5 out of 10. This works out to a 301 ELO
point observed delta between these two players, when in fact the players
have a true ELO point delta of zero points.

In a hundred-game match, the best anyone did in 1000 matches is 61.5%, or 81
ELO points, and in a 100,000-game match, the best was 50.4%, or 3 elo
points.

What this shows me is that short matches have a lot of possibility for
error. But what surprised me was that the definition of "short" was a
little broader than I'd thought. I figured that 100 games would be enough,
but it certainly does not appear to be enough to determine between a few ELO
points.

Perhaps I can be criticized for using doing 1000 matches, since it could be
argued that the odds of a truly strange result could be fairly low, but that
in 1000 matches I'm probably doing enough to catch at least one strange
result. But when I did fewer matches (like 10) I still saw a lot of
deviation from the expected zero ELO points. For instance I just did one
with ten matches and saw a deviation of 56 ELO points in a hundred-game
match (58%), and this is not atypical.

I think it's safe to conclude that it is not possible to determine if a
change you make to your program is worth 20 ELO points, based upon a
100-game match, or even a 300-game match.

bruce

PS. Here's the test program. It compiles under MSVC for Windows NT. If
you are using something else, you will have to take out all reference to
"Reseed" and "windows.h", unless you feel like using some more standard
mechanism for setting your random number seed, in which case you can rewrite

#include <windows.h>
#include <stdio.h>
#include <stdlib.h>

typedef struct tagSTAT {
int iSize;
int iBest;
} BKT;

BKT s_argbkt[] = {
10, 0,
20, 0,
30, 0,
40, 0,
50, 0,
75, 0,
100, 0,
150, 0,
200, 0,
250, 0,
300, 0,
400, 0,
500, 0,
1000, 0,
5000, 0,
10000, 0,
100000, 0,
-1,
};

#define iITER 1000

void Reseed(void)
{
SYSTEMTIME st;

GetSystemTime(&st);
srand((st.wHour ^ st.wMinute ^ st.wSecond ^ st.wMilliseconds) & 0x7FFF);
}

void main(int argc, char *argv[])
{
register i;
register j;
int iIter;

Reseed();
for (iIter = 0; iIter < iITER; iIter++) {
int iScore;

for (i = 0; s_argbkt[i].iSize >= 0; i++) {
iScore = 0;
for (j = 0; j < s_argbkt[i].iSize; j++) {
int iRand;

iRand = rand() % 10;
if (iRand <= 2)
iScore += 2;
else if (iRand <= 5)
iScore += 0;
else
iScore += 1;
}
iScore = abs(iScore - s_argbkt[i].iSize) + s_argbkt[i].iSize;
if (iScore > s_argbkt[i].iBest)
s_argbkt[i].iBest = iScore;
}
}
printf("Length Score Percent ELO\n");
printf("------ ------ ------- ---\n");
for (i = 0; s_argbkt[i].iSize >= 0; i++) {
double rResult;
double rElo;

rResult = (double)s_argbkt[i].iBest /
(double)(s_argbkt[i].iSize * 2);
rElo = 400.0 * log10(1.0 / (1.0 - rResult) - 1.0);
printf("%6d %7.1f %4.1f%% %3.0f\n", s_argbkt[i].iSize,
(double)s_argbkt[i].iBest / 2.0, rResult * 100.0, rElo);
}
}

Robert Hyatt

Feb 15, 1997, 3:00:00 AM2/15/97
to

I'll add notes below, but what you see is what caused me to stop auto-testing
15 years ago. I used to do it *all* the time because (a) that was the only
way to get lots of games before there was an ICS, and (b) it also located
lots of bugs. But trying to decide which version was "better" was a difficult
process, because "better" is vague. Better at beating a version of itself,
or better at playing chess. Games between crafty/ferret would be more
interesting to watch, because you pull for one side or the other, but if you
watch ferret/ferret, you like to see one side win, but you get pissed when the
other side loses, so you always lose... every game... :)

However, even crafty/ferret is flawed, because two programs can have minor
differences that produce major differences. One example was a comment from
Lonnie earlier this week about Mchess Pro 6.0 (I believe). He was playing
it against Crafty and crafty was winning more than it lost by a big percentage,
then he played it against Ferret, and the result was the opposite. However
when Crafty and Ferret plays, Ferret generally comes out well ahead. Now
try to draw conclusions from *that*... :)

: Someone told me my program stunk. This is no big deal, except that this

: person also told me that my program didn't used to stink. I decided I'd try
: to think of a way to assess the strength of my program now versus its
: strength a few months ago. I don't know of an easy way to do this well, but
: I do know of an easy way to do this badly, and I have mechanisms built into
: all versions of my program that allow me to use this easy bad way.

: The easy bad way: an autoplay match. One reason it's bad is because there
: isn't sufficient book variation to get accurate results (you get repeated
: games), but my program's book is pretty broad so I ignored this.

: I let various versions of my program play a hundred games or so each versus
: an older version that I used as a control, then I tried to make sense out of
: the results.

: The sample size varied, and the results varied. My newest version did the
: worst. It scored 34.7% in 98 games versus the old version. The best was
: one from a few weeks before that, it score 48.9% in 88 games, but one from a
: few days before the newest version scored 47.8% in 316 games (I let this one
: go for a long time).

: What does all this mean? Does my newest version really suck?

easy answer. maybe or maybe not. :)

: You can obviously use statistical methods to understand this, but I don't

: know enough about statistics to apply the right terms to this.

: In the mean time, I figured that I'd do an experiment, since I'm a
: programmer and as a consequence experiments like this are easy for me to do.

: I figure that the results of this experiment are interesting enough that
: it's worth posting them here, even if by doing so I demonstrate my lack of
: understanding of statistics. If someone would like to post some information
: about statistical methods that can be used to do this work, please feel free
: to do so, this is one reason I'm writing this post.

<SNIP>

: Perhaps I can be criticized for using doing 1000 matches, since it could be

: argued that the odds of a truly strange result could be fairly low, but that
: in 1000 matches I'm probably doing enough to catch at least one strange
: result. But when I did fewer matches (like 10) I still saw a lot of
: deviation from the expected zero ELO points. For instance I just did one
: with ten matches and saw a deviation of 56 ELO points in a hundred-game
: match (58%), and this is not atypical.

: I think it's safe to conclude that it is not possible to determine if a
: change you make to your program is worth 20 ELO points, based upon a
: 100-game match, or even a 300-game match.

: bruce

I think you are dead right here. lots of games. It's probably better to
watch it and assess its play yourself, rather than watching the rating, or
the win/loss record. I pretty much ignore all of these things. If it is
losing consistently to a person or program, I'll study the games to figure
out why. However, that's my only measure at present...

Amir Ban

Feb 16, 1997, 3:00:00 AM2/16/97
to

brucemo wrote:
>
> Someone told me my program stunk. This is no big deal, except that this
> person also told me that my program didn't used to stink. I decided I'd try
> to think of a way to assess the strength of my program now versus its
> strength a few months ago. I don't know of an easy way to do this well, but
> I do know of an easy way to do this badly, and I have mechanisms built into
> all versions of my program that allow me to use this easy bad way.
>
> The easy bad way: an autoplay match. One reason it's bad is because there
> isn't sufficient book variation to get accurate results (you get repeated
> games), but my program's book is pretty broad so I ignored this.
>

Send over the program. I'll tell you if it stinks :)

> I let various versions of my program play a hundred games or so each versus
> an older version that I used as a control, then I tried to make sense out of
> the results.
>
> The sample size varied, and the results varied. My newest version did the
> worst. It scored 34.7% in 98 games versus the old version. The best was
> one from a few weeks before that, it score 48.9% in 88 games, but one from a
> few days before the newest version scored 47.8% in 316 games (I let this one
> go for a long time).
>
> What does all this mean? Does my newest version really suck?
>
> You can obviously use statistical methods to understand this, but I don't
> know enough about statistics to apply the right terms to this.
>

I've written a program to calculate this a long time ago to do the statistical
calculation. It's attached below.

[snip]

> What this shows me is that short matches have a lot of possibility for
> error. But what surprised me was that the definition of "short" was a
> little broader than I'd thought. I figured that 100 games would be enough,
> but it certainly does not appear to be enough to determine between a few ELO
> points.
>

Welcome to the real world. Didn't you ever look at the SSDF lists ? Don't you see the
depressingly wide error margins with hundreds of games played ? Didn't you ever get an
excellent result from an autoplay match, only to discover at the end that by oversight
the two program versions are identical ?

Testing, to my mind, is the real question. Here's a brash statement:

Give me a dependable and time-efficient method to check which of two programs versions
is better, and in six months I will beat anybody.

If anyone wants to challenge this, please send me the method.

[snip]

> I think it's safe to conclude that it is not possible to determine if a
> change you make to your program is worth 20 ELO points, based upon a
> 100-game match, or even a 300-game match.
>

A 20 ELO points improvement is quite an improvement. Practically, you need methods to
detect smaller improvements.

Here's some free advice on how to do this. Take it or leave it:

- Try to do no harm. Use testing procedures which are more inclined to reject than to
accept changes.

- Accept the fact that some of your "tested improvements" are at best random noise.
Probability is on your side, and over time, the program will move in the right
direction.

- Use independent benchmarks to keep track of "objective" strength. I guess everyone
uses one or more test suites for that. DON'T use short suites, and don't use WAC. Any
suite on which your program scores above 90% is useless. Resist the temptation to
optimize your program for the suite. You must also realize that suite results may have
enough noise to swamp your improvement, and that many real improvements will score
worse not better, but then you must justify them with more testing.

- Beware of those "obvious-improvement-no-need-to-test" things. You will be surprised.

- Intellectual honesty is a great quality. It's easy to deceive yourself.

- At the final count, use common sense. I have done some fine things when I decided to
ignore every testing rule I swear by.

Amir

> bruce
>

-------------------------------------------------------------------------
/* This program calculates the significance of a match result. You
enter the result for both sides as parameters (e.g. SIG 5 3), and
it will print the significance of the result as a value between 1
and -1. If the scores are equal, the result is always 0. It's positive if the 1st
side is ahead, negative if behind.

The calculation is based on this: Suppose the sides are of equal strength.
What is the probability of getting this result or better ?
*/

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define DRAW_PROB 0.3

double prob(int nGames, int result, double pDraw)
{
int comp = 0;
if (result > nGames) {
result = nGames * 2 - result;
comp = 1;
}

double pWin = (1 - pDraw) /2;
double p = 0;

for (int nWin = 0; 2 * nWin <= result; nWin++)
for (int nDraw = 0; 2 * nWin + nDraw <= result; nDraw++) {
int nLoss = nGames - nWin - nDraw;
double q = pow(pWin,nWin) * pow(pDraw,nDraw) * pow(pWin,nLoss);
int i = nWin, j = nDraw, k = nGames;
while (i > 0) {
q *= k--;
q /= i--;
}
while (j > 0) {
q *= k--;
q /= j--;
}
if (2 * nWin + nDraw == nGames)
p += q / 2;
else
p += q;
}

return comp ? (1 - 2 * p) : (2 * p - 1);
}

void cdecl main(int argc, char *argv[])
{
if (argc != 3) {
printf("Usage SIG <score1> <score2>");
return;
}

int score1 = atof(argv[1]) * 2 + 0.5;
int score2 = atof(argv[2]) * 2 + 0.5;

printf("Significance is %8.1f\%\n",prob((score1 + score2)/2,score1,DRAW_PROB) * 100);
}

Jean Peter Fendrich

Feb 16, 1997, 3:00:00 AM2/16/97
to

As you told us this is a bad way but is *a* way. Therefor you
shouldn't use sofisticated statistical methods. However there is a
fairly easy way to get some meaning of your results.

1) Use number of wins, loss, and draws respectivelly instead of
percentage.
W = number of wins, L = number of lost, D = number of draws
n = number of games (that means W + L + D)
m = mean value of wins (that means W/n)

2) Apply the following formulas to compute s
( SQRT means square root of. )
( y**2 means y*y )

x = W*(1-m)*(1-m) + D*(0.5-m)*(0.5-m) + L*(0-m)*(0-m)
s = SQRT( x/(n-1) )

x is just for clearness a temporary value to compute s.
s is messuring how much your data is spread around the m value.

3) Compute error margin A (with 95% confidence)

A = 1.96 * s / SQRT(n)

4) State with 95% confidence:
The 'real' result should be somewhere in between m-A to m+A

5) If you want some other confidence, instead of 95% you have to
replace the 1.96 in 3) above.
2.81 will give you 99.5% confidence.
Just mail me for others...

6) You shouldn't use the formulas with a few games. At least
play 30-50 games to be sure that the formulas are valid.
You can't use them if you have results like 100 - 0.

Example
-------
Let's say you got 47.8% in 316 games with W=100, L=114 and D=102
With the above formulas you will get:
N=316, m=0.4778
x= 100*(0.5222)*(0.5222) + 102*(0.0222)*(0.222) + 114*(0.4778)*(0.4778)
x=26.0254
s = SQRT(26.0254 / 315) = 0.2874
A = 1.96 * 0.2874 / SQRT(316) = 0.032

Now we can say with 95% confidence:
The result should be somewhere between 0.4778-0.032 and 0.4778+0.032
(If I made my math's allright, it's late you know ...)

Because 0.4778+0.032 is 0.510 (51%) we are not really sure which one
of the programs is the better.
But of cource this statement is wrong 5 times out of 100.... :-)

It is possible to use m-A and m+A to look in the ELO tables if you
prefer
ratings over match results. But now this begins to be really weird
because
ELO computed with 300 games between the same two opponents will not be
*really ELO*. It will only reflect the difference between the two
opponents and nothing more. ELO is by defintion trying to predict your
results against all possible opponents.
This 'weird' ELO would predict the result between your two opponents.

Well, the number of games needed depends on how your results are
spread out as in 2) above.
and ...
20 ELO points is a very small number. It really means nothing!
We will never be able to meassure the ELO of humans with that precision,
even if some people think so.
Between two different versions of the same program however it might mean
something but it doesn't say a bit about the performance between other
programs or humans.

- snip -

--
J-P Fendrich

- You can't be allmost pregnant -

Rolf Tueschen

Feb 16, 1997, 3:00:00 AM2/16/97
to

>I let various versions of my program play a hundred games or so each versus
>an older version that I used as a control, then I tried to make sense out of
>the results.

...

>What does all this mean? Does my newest version really suck?

>You can obviously use statistical methods to understand this, but I don't
>know enough about statistics to apply the right terms to this.

>Here's my experiment:

>I wrote a program

-

>Here is my data:

>Length Score Percent ELO
>------ ------ ------- ---

**** 3 *************** ?? **** SSDF case

> 10 8.5 85.0% 301
> 20 16.5 82.5% 269
> 30 22.0 73.3% 176
> 40 29.0 72.5% 168

> 50 35.0 70.0% 147
> 75 50.0 66.7% 120
> 100 61.5 61.5% 81
> 150 94.5 63.0% 92
> 200 117.5 58.8% 61
> 250 148.5 59.4% 66
> 300 173.5 57.8% 55
> 400 225.0 56.3% 44
> 500 284.0 56.8% 48
> 1000 541.0 54.1% 29
> 5000 2592.5 51.8% 13
> 10000 5122.5 51.2% 9
>100000 50440.0 50.4% 3

-

>What this shows me is that short matches have a lot of possibility for
>error. But what surprised me was that the definition of "short" was a
>little broader than I'd thought. I figured that 100 games would be enough,
>but it certainly does not appear to be enough to determine between a few ELO
>points.

-

>bruce
---------------------------------------------------------------------------------------------------------------------

Last year there was rumour in rgcc. I heard of critics against SSDF that
pointed exactly at s.th. you found out with your experiment.

You remember the SSDF method?
They had lots of 5, 10 or 20, sorry also 40 games matches.
And it's true NOT between *as thought* equally strong prgs. Ok. The deviation
error would only be higher therefore
.
Everybody should look upon your table of results to get an impression of what
SSDF simply might have overlooked over the years.
Because they summed up all those small match results against always DIFFERENT
prgs. And each of these tiny matches had this huge *error*.
SSDF thought was that by adding all this *the* overall error would almost
vanish to a normal statistical deviation. But that was an illusion.

Picture would have been another if they had distributed only 4 or 6 actual prgs
under their testers. Then some adding could have equalized some errors of
testing. But they always told us that was not possible. And it really would
cost much too much money. OR the companies provided them with enough test prgs.
Only solution.

bruce, you didn't know all that before? Where were you last year? You could

Rolf Tueschen

----*Man is unable not to know what he knows.* Leibowitz-----

brucemo

Feb 17, 1997, 3:00:00 AM2/17/97
to

Rolf Tueschen wrote:

> Last year there was rumour in rgcc. I heard of critics against SSDF that
> pointed exactly at s.th. you found out with your experiment.

Let's not turn this into another political thread, please. What I would
like is help from people who understand statistics, and comments from
programmers who have tried this sort of thing, or might want to try this
sort of thing.

I would rather not have to wade though charges and countercharges to get
it.

bruce

Feb 17, 1997, 3:00:00 AM2/17/97
to

Bruce,

here are some of my own methods to test if version X of a program is
stronger than version Y. Maybe I'm wrong in one or more aspects but I
had some good results in the last years.

First of all, I think that statistical methods are not appropriate in
this case. Maybe version X is clearly better than version Y in 1000+
games, but this doesn't mean that version X does ANY better against
Rebel or Chess Genius! If you are unlucky your new version will become
stronger in aspects where it is already strong and weaker is aspects it
is weak.

So what can you do?

- A lot of test positions, approx. equal number of early midgame,
midgame and endgame positions, tactical and positional positions, and so
on and so on. The more positions the better. Now implement a routine
that tests this positions automatically. You can run the position test
over night or in one or two days. You get a first picture of the "new"
strength and even more important you see in which aspects the program
has become stronger or weaker. (You can make nice graphics with values
for positional play, tactical play, endgame, etc.)

- Of course test positions are not enough. If the first impression
concerning the test positions is good, you have to play games, but NOT
against earlier versions! Take three or four of the best programs (e.g.
Chess Genius, M-Chess, Rebel and my beloved Nimzo :-) and play games
against them! If you only take one program (or an earlier version of
your own program) there is a high possibility that you "optimize" your
program to get good results only against this one opponent. Most likely
you will get strange results against programs you didn't play against.
(Therefore it is important that SSDF prefers to play 20 game matches
against a couple of opponents and not big 100+ game matches against two
or three oponents.)

- I don't think that it is very important to play tournament games at
first, approx. 30 seconds per move will do it. If the new version still
gets better results than the old one there should be a smile on your
face :-). You can now start to play tournament games to be sure that
your program gets stronger with every ply....

- The opening book is very important. Sad but true. When e.g. Fritz
plays with the Genius book, it gets better results against other
computer programs. You can make your program stronger "simply" by
modifying the book. A good book for program A is not automatically a
good book for program B so you have to create a special book. (BTW: I do
not want to talk about killer openings...) So there should be a special
"opening book test" against a couple of oponents. What is the first
evaluation after the last book move? Does the version X "understand" the
position better than version Y?

I hope there are at least a few things you didn't know yet...

Best wishes

Rolf Tueschen

Feb 17, 1997, 3:00:00 AM2/17/97
to

Jean Peter Fendrich <j...@vd.volvo.se> wrote:

-------many snip

> But now this begins to be really weird
>because
>ELO computed with 300 games between the same two opponents will not be
>*really ELO*. It will only reflect the difference between the two
>opponents and nothing more. ELO is by defintion trying to predict your
>results against all possible opponents.
>This 'weird' ELO would predict the result between your two opponents.

-----------------------------------------

rating floors. Bloodgood a prisoner with >2700 ELO. *Real* or *weird*?
Predictions out of this?

IMHO there's no natural born ELO. It always depends.

Could you give me a short answer on my reply to bruce as well?

Thanks

Rolf Tueschen

Jean Peter Fendrich

Feb 17, 1997, 3:00:00 AM2/17/97
to

Rolf Tueschen wrote:

- snipped -

Rolf,
I will not go through the same boring discussion with you once again.
Try a little harder yourself. Go back to the old answers to your same
questions. It's all there. Easy to read, easy to understand.
Your stupid claims will not be more true by repeating them over and
over again.

--
J-P Fendrich

Tom C. Kerrigan

Feb 17, 1997, 3:00:00 AM2/17/97
to

I've been thinking of making a test suite of, say, 500 positions that I
rip randomly from, say, Kasparov games and using that as a sort of
positional barometer. I would go through these at maybe 1 second per move
and that might be useful in finding appropriate weights for evaluation
terms. If my score went up on this suite, I would be fairly certain that
my program didn't get a lot worse.

As for tactics, I'd like to see something like the BT2630, only where all
the problems have tactical solutions that are found by most programs in 5
minutes or so. Then you could score it a similar way and get an exact
number to see if the program got faster.

Unfortunately, I have no idea how to generate the latter...

Cheers,
Tom

brucemo

Feb 17, 1997, 3:00:00 AM2/17/97
to

Tom C. Kerrigan wrote:

> As for tactics, I'd like to see something like the BT2630, only where all
> the problems have tactical solutions that are found by most programs in 5
> minutes or so. Then you could score it a similar way and get an exact
> number to see if the program got faster.
>
> Unfortunately, I have no idea how to generate the latter...

Easy. Take all the tactical suites we have so far, run them through your
program, and save the ones that take a while to solve.

This would probably be specific to your program, which is not exactly what
you are asking for, but if you are willing to accept this problem, you're
done.

bruce

Rolf Tueschen

Feb 17, 1997, 3:00:00 AM2/17/97
to

>Rolf Tueschen wrote:

>bruce
----------------------------------------------------

Don't know what you are talking about.

I give 1000\$\$US (sniff sniff, bruce) that you mist my point.

It was a very serious question. Read my critical post on statistics in SSDF
last year.

Don't always try to be sort of president, bruce. Don't try to censor us.

I also tried to differentiate bruce ICCA from bruce serious on cc. Ok?
Try to be more relaxed regarding people who are forced to write more often on
*politics* because they don't have any program running. You are not at all more
valuable just because you wrote Ferret. And vice versa. :)

rolf

Moritz Berger

Feb 18, 1997, 3:00:00 AM2/18/97
to

On 17 Feb 1997 20:30:36 +0100, kerr...@merlin.pn.org (Tom C.
Kerrigan) wrote:

>I've been thinking of making a test suite of, say, 500 positions that I
>rip randomly from, say, Kasparov games and using that as a sort of
>positional barometer. I would go through these at maybe 1 second per move
>and that might be useful in finding appropriate weights for evaluation

>terms. If my score went up on this suite, I would be fairly certain that
>my program didn't get a lot worse.

>
>As for tactics, I'd like to see something like the BT2630, only where all
>the problems have tactical solutions that are found by most programs in 5
>minutes or so. Then you could score it a similar way and get an exact
>number to see if the program got faster.
>
>Unfortunately, I have no idea how to generate the latter...
>

>Cheers,
>Tom

I use the following approach: From computer vs. computer games, I
choose positions with eval-jumps, either immediately or fast
increasing/decreasing evaluations by both programs. Sometimes a
program sees a few moves earlier the importance of a particular move.
Among equally strong programs, there often are single key moves that
eventually decide the outcome of a game. It's quite easy to spot them
by looking at the analysis recorded during the game, a very important
feature in Rebel 8, Genius 5 and Hiarcs (of course Crafty with its
full log really shines in this respect, also W-Chess offers nice
insights with its similar capability).

After finding such positions, I have a perfect test suite for
different relevant positions (i.e. they are taken from real computer
vs. computer games, so both the problem and its solution are likely to
be identified by means of a chess program).

Here's one example (1st 2 lines are in .EPD format, you will have to
join them into one line to make it readable by some programs). The
theme is king safety at shorter time controls, after the name of the
program I give the solution time in seconds, in brackets the search
depth needed for the solution, in some cases alternative moves that
include the solution in the PV:

r1r1n1k1/5p2/p1b1p1p1/4P2Q/PpNq4/6R1/1PB3PP/R6K b
- - id=Human - REBEL 8.0 P5/133; bm=e8g7;

Hiarcs 4 (Fritz 4 engine): 1"(3): Ne8g7
Genius 3: 1"(4): Ne8g7
Hiarcs 5: 3"(3): Ne8g7
Genius 5: 4"(3): Ne8g7
Hiarcs 3: 6"(4): Ne8g7
Fritz 3.1 (Fritz 4 engine): 6"(8): Ne8g7
Fritz 4.01: 7"(8): Ne8g7
Mchess 6: 17"(5): Ne8g7
Mchess 5: 35"(5): Bc6xg2 53"(5): Ne8g7
Rebel 7: 57"(7): Ne8g7
Rebel 8: 89"(7): Bc6xg2 104"(7): Ne8g7
Fritz 1.2: 404"(10): Ne8g7

Please note that the programs without Null-move (Rebel 7,8 and Fritz
1.2) have more problems with this position ... M-Chess 6 is mucho
faster than M-Chess 5, this shows the huge progress that Marty Hirsch

Moritz

-------------
Moritz...@msn.com

mclane

Feb 18, 1997, 3:00:00 AM2/18/97
to

>Bruce,

>Best wishes
I hope I did not hurt you Andreas, but I can subscribe almost any
point of your WAY OF TESTING.

Despite the fact that I don't like to see fast games (because I am
more the slow guy, sitting with a cup of tea, and joining the game)
I do exactly the same way.

I have never found out much about a version by letting it play against
the predecessors.

This is obviously silly.

You have to let it play against ALL OPPONENT programs, different
timecontrols and also do test-positions you know much about, e.g.
mail-chess-games you studied for months.

For me the most important way of finding out WHICH VERSION IS BETTER
is the feeling. If I look on the main-lines and the evaluations of a
program, I can find out about which version is stronger in the best
case within 10 moves of a 60/60 game.
Normally I try to proof my feelings then with more games on 60/60 and
lately I proof it again with 40/120.

Opening books is important !!!! Therefore I do not TEST opening book
when I want to find out about engine. If you play WITH opening book
and you want to find out about total-playing strength you get to
different results, concerning which opening lines you have in your
book, and which ones the opponent has.
I kill the openings, or input symmetrical positions. Or switch sides
or whatever. But I do not test the books when I want to find out about
playing strength.

AFTER i have found out about playing strength I let the program play
WITH its book.

If I like a program I find out about it very easy. If I don't like the
program, work is much more difficult.
Also it is very easy to work with some programmers, and more difficult
to work with others. You can't give a guaranty in forward.

But I have no commercial interest. It is just fun. So I can test out
if the relationship works. If not - it is a pity - but I will survive
it.

With some programs it is like a drug.
You can't live without them.

Thanks Andreas that you are back.

Dennis Breuker

Feb 18, 1997, 3:00:00 AM2/18/97
to brucemo

Hi Bruce!

Have you read the article "Confidently selecting a search heuristic",
from Thomas Anantharaman in ICCA Journal, Vol 14., No. 1 pp. 3-16?
[This is an edited version of Chapter 3 of A's PhD thesis, May 1990,
Carnegie-Mellon University (Report CMU-CS-90-173).]
There he describes how the Deep-Thought team evaluates how well
a new search heuristic works in reasonable time. Here is the
abstract:

of proposed selective-search heuristics in computer chess. The erea is
shown to be beset by problems, which may be succinctly expressed by the
vast amount of conmputations required to get standard deviations down to
reasonable level. We consider this level to be achieved when, say, an
estimate of playing strength in USCF rating points with a confidence of
95% is within 20 rating points of computation. A total of 15 techniques
have been investigated, leading to a selection of a preferred technique
for its accessory advantages. The results suggest that what human
Masters and Grand Masters do is not, or no longer, a good yardstick for
the evaluation of a computer's move.

Hope this helps,
Dennis.

Robert Hyatt

Feb 18, 1997, 3:00:00 AM2/18/97
to

Tom C. Kerrigan (kerr...@merlin.pn.org) wrote:
: I've been thinking of making a test suite of, say, 500 positions that I
: rip randomly from, say, Kasparov games and using that as a sort of
: positional barometer. I would go through these at maybe 1 second per move
: and that might be useful in finding appropriate weights for evaluation
: terms. If my score went up on this suite, I would be fairly certain that
: my program didn't get a lot worse.

: As for tactics, I'd like to see something like the BT2630, only where all
: the problems have tactical solutions that are found by most programs in 5
: minutes or so. Then you could score it a similar way and get an exact
: number to see if the program got faster.

: Unfortunately, I have no idea how to generate the latter...

: Cheers,
: Tom

Also the "former" is difficult. You can tune your eval to make the
correlation between your eval and the moves actually played go up,
and it's not hard. However, you will likely play worse. Because
you will be making the right move for the wrong reasons. If a GM
plays a move because it fixes a weak pawn on a weak square for the
opponent, and you don't understand weak pawns and weak squares, you
can conclude that the move was good because it advanced a pawn. In
general *every* pawn move is bad, because it weakens one or more
squares, and they can never be "unweakened". You then start pushing
pawns, wreck your pawn structure, create a dozen weaknesses, and lose
every game, while still matching the moves the GM made in your test
suite...

Not a pretty thought...

The opposite can also happen. A GM doesn't move a pawn in any of the
500 positions you choose, so you decide that all pawn pushes are bad,
and never make one, even if the weakness it produces in your position
is nothing compared to the weakness it produces in the opponent's
position.

Logic says emulating a GM by computing a least-squares fit of your
your (or my) eval to what the GM played is going to produce pretty
much random numbers, unless your eval is sophisticated enough that
it somehow incorporates most of the GM's positional knowledge into
what it evaluates. This is a case where making the right move for
the wrong reason might be much worse than making the wrong move...

Robert Hyatt

Feb 18, 1997, 3:00:00 AM2/18/97
to

Moritz Berger (Moritz...@msn.com) wrote:
: On 17 Feb 1997 20:30:36 +0100, kerr...@merlin.pn.org (Tom C.
: Kerrigan) wrote:

: >I've been thinking of making a test suite of, say, 500 positions that I
: >rip randomly from, say, Kasparov games and using that as a sort of
: >positional barometer. I would go through these at maybe 1 second per move
: >and that might be useful in finding appropriate weights for evaluation
: >terms. If my score went up on this suite, I would be fairly certain that
: >my program didn't get a lot worse.
: >
: >As for tactics, I'd like to see something like the BT2630, only where all
: >the problems have tactical solutions that are found by most programs in 5
: >minutes or so. Then you could score it a similar way and get an exact
: >number to see if the program got faster.
: >
: >Unfortunately, I have no idea how to generate the latter...
: >
: >Cheers,
: >Tom

: I use the following approach: From computer vs. computer games, I

: faster than M-Chess 5, this shows the huge progress that Marty Hirsch

: Moritz

: -------------
: Moritz...@msn.com

Two points: First, Crafty relies heavily on null move, as we all
know. output:

9 13.03 0.230 Qxc4 Bxg6 fxg6 Qxg6+ Kf8 Rg4 Bxg2+
Rxg2 Qf4 Qg8+ Ke7 a5
9 27.83 0.816 Ng7 Qg4 Bxg2+ Rxg2 Rxc4 Qf3 Rac8 Bd3
Rc1+ Rxc1 Rxc1+ Bf1
9-> 1:00 0.816 Ng7 Qg4 Bxg2+ Rxg2 Rxc4 Qf3 Rac8 Bd3
Rc1+ Rxc1 Rxc1+ Bf1
10 1:09 1.093 Ng7 Qg4 Bxg2+ Rxg2 Rxc4 Qxd4 Rxd4
a5 Rad8 b3 Rd2 Rxd2 Rxd2

so it takes 27 secs with very small hash table to hit this.

However, I didn't think that genius nor Mchess used null-move, rather, that
they were simply more selective at shallow plies to get their speed

I had thought, that except for Morsch, most didn't use null move.. Dave is

Bob

Robert Hyatt

Feb 18, 1997, 3:00:00 AM2/18/97
to

Moritz Berger (Moritz...@msn.com) wrote:
: On 18 Feb 1997 14:52:10 GMT, hy...@crafty.cis.uab.edu (Robert Hyatt)
: wrote:
: >Moritz Berger (Moritz...@msn.com) wrote:
: <snip>
: >: Here's one example (1st 2 lines are in .EPD format, you will have to

: >: join them into one line to make it readable by some programs). The
: >: theme is king safety at shorter time controls, after the name of the
: >: program I give the solution time in seconds, in brackets the search
: >: depth needed for the solution, in some cases alternative moves that
: >: include the solution in the PV:

: >: r1r1n1k1/5p2/p1b1p1p1/4P2Q/PpNq4/6R1/1PB3PP/R6K b
: >: - - id=Human - REBEL 8.0 P5/133; bm=e8g7;

: >: Hiarcs 4 (Fritz 4 engine): 1"(3): Ne8g7
: >: Genius 3: 1"(4): Ne8g7
: >: Hiarcs 5: 3"(3): Ne8g7
: >: Genius 5: 4"(3): Ne8g7
: >: Hiarcs 3: 6"(4): Ne8g7
: >: Fritz 3.1 (Fritz 4 engine): 6"(8): Ne8g7
: >: Fritz 4.01: 7"(8): Ne8g7
: >: Mchess 6: 17"(5): Ne8g7
: >: Mchess 5: 35"(5): Bc6xg2 53"(5): Ne8g7
: >: Rebel 7: 57"(7): Ne8g7
: >: Rebel 8: 89"(7): Bc6xg2 104"(7): Ne8g7
: >: Fritz 1.2: 404"(10): Ne8g7
: >
: >: Please note that the programs without Null-move (Rebel 7,8 and Fritz
: >: 1.2) have more problems with this position ... M-Chess 6 is mucho

: >: faster than M-Chess 5, this shows the huge progress that Marty Hirsch

: <snip>
: >Two points: First, Crafty relies heavily on null move, as we all
: >know. output:

: > 9 13.03 0.230 Qxc4 Bxg6 fxg6 Qxg6+ Kf8 Rg4 Bxg2+
: > Rxg2 Qf4 Qg8+ Ke7 a5
: > 9 27.83 0.816 Ng7 Qg4 Bxg2+ Rxg2 Rxc4 Qf3 Rac8 Bd3
: > Rc1+ Rxc1 Rxc1+ Bf1
: > 9-> 1:00 0.816 Ng7 Qg4 Bxg2+ Rxg2 Rxc4 Qf3 Rac8 Bd3
: > Rc1+ Rxc1 Rxc1+ Bf1
: > 10 1:09 1.093 Ng7 Qg4 Bxg2+ Rxg2 Rxc4 Qxd4 Rxd4
: > a5 Rad8 b3 Rd2 Rxd2 Rxd2
:
: >so it takes 27 secs with very small hash table to hit this.

: your problem, Bob ;-) maybe you need a mainframe for this ;-))))) not
: enough nps on Intel hardware ...

: >However, I didn't think that genius nor Mchess used null-move, rather, that

: >they were simply more selective at shallow plies to get their speed

: Ok, I didn't want to become a second Diep or start a "Null move
: skandal" here ;-)

: My statement was mainly related to the vast improvement among the
: Fritz versions: Fritz 1 didn't have null-move, Fritz 2 and later used
: null-move and gained very much from it ... So Fritz 3.1 is 60-70 times
: faster in this position, this might also be due to increased
: selectivity, maybe I was simply wrong ... But I also noticed that
: Rebel didn't score well on this position, and Rebel doesn't use null
: move ... M-Chess is also among the slower programs (although it is
: heavily selective), so according to my logic it might not have been
: among the null movers ... Maybe I'm completely wrong, all recent
: programs maybe do well enough on this position, so I hope that no
: programmer hits me with a null move for this :-)))

I don't think Marty uses it either... been a long while since I've talked
with him, but just looking at how it searches, and the PV's it produces, leads
me to believe it doesn't...

: My example was just that: An example how I try to draw conclusions
: from real computer vs. computer positions ... If you're testing your
: own program, you can hopefully find out which factor influences your
: performance most in a particular position.

I really rely more on intuition as I can tell reasonably well whether the
quality of the moves Crafty produces is higher or lower over time... not
always of course, but in general...

brucemo

Feb 18, 1997, 3:00:00 AM2/18/97
to

I read it a few times this morning. It is a difficult article. There are
lots of ideas in the article but I have a hard time understanding them.

It seems that the DT guys settled upon a method that involves a large test
suite that does not appear to have been selected with much care (not a
criticism). They ran a version of their program over the test suite,
allowing quite a bit of time per move.

They then ran several other versions of their program over the test suite,
allowing much less time per move, and checked the agreement between these
scores and what they got using longer time. They tried several different
ways of comparing scores. The one they chose had to do with "scaled move
score". I am unable to figure out what this is, since I am being yelled at
from upstairs at the moment.

I think the main point of all this was an attempt to save time over the use
of a bunch of autoplay matches.

One possible criticism is that the manner in which they evaluated their
results is that they compared against ratings assigned to the programs via
a series of 220-game autoplay matches. 220 games seems like a lot, but you
don't get an incredibly high degree of precision in 220 games do you?

bruce

brucemo

Feb 18, 1997, 3:00:00 AM2/18/97
to

Jean Peter Fendrich wrote:

> As you told us this is a bad way but is *a* way. Therefor you
> shouldn't use sofisticated statistical methods. However there is a
> fairly easy way to get some meaning of your results.
>
> 1) Use number of wins, loss, and draws respectivelly instead of
> percentage.
> W = number of wins, L = number of lost, D = number of draws
> n = number of games (that means W + L + D)

> m = mean value of wins (that means W/n)[lots more snipped]

I am curious: What is this? Is this the SSDF method, or something
from a statistics book, or an independent invention, or what?

bruce

brucemo

Feb 18, 1997, 3:00:00 AM2/18/97
to

Moritz Berger wrote:

> r1r1n1k1/5p2/p1b1p1p1/4P2Q/PpNq4/6R1/1PB3PP/R6K b
> - - id=Human - REBEL 8.0 P5/133; bm=e8g7;
>
> Hiarcs 4 (Fritz 4 engine): 1"(3): Ne8g7
> Genius 3: 1"(4): Ne8g7
> Hiarcs 5: 3"(3): Ne8g7
> Genius 5: 4"(3): Ne8g7
> Hiarcs 3: 6"(4): Ne8g7
> Fritz 3.1 (Fritz 4 engine): 6"(8): Ne8g7
> Fritz 4.01: 7"(8): Ne8g7
> Mchess 6: 17"(5): Ne8g7
> Mchess 5: 35"(5): Bc6xg2 53"(5): Ne8g7
> Rebel 7: 57"(7): Ne8g7
> Rebel 8: 89"(7): Bc6xg2 104"(7): Ne8g7
> Fritz 1.2: 404"(10): Ne8g7

What hardware?

bruce

Moritz Berger

Feb 18, 1997, 3:00:00 AM2/18/97
to

On 18 Feb 1997 14:52:10 GMT, hy...@crafty.cis.uab.edu (Robert Hyatt)
wrote:
>Moritz Berger (Moritz...@msn.com) wrote:
<snip>
>: Here's one example (1st 2 lines are in .EPD format, you will have to
>: join them into one line to make it readable by some programs). The
>: theme is king safety at shorter time controls, after the name of the
>: program I give the solution time in seconds, in brackets the search
>: depth needed for the solution, in some cases alternative moves that
>: include the solution in the PV:

>: r1r1n1k1/5p2/p1b1p1p1/4P2Q/PpNq4/6R1/1PB3PP/R6K b

My example was just that: An example how I try to draw conclusions

from real computer vs. computer positions ... If you're testing your
own program, you can hopefully find out which factor influences your
performance most in a particular position.

>I had thought, that except for Morsch, most didn't use null move.. Dave is

>Bob

-------------
Moritz...@msn.com

brucemo

Feb 18, 1997, 3:00:00 AM2/18/97
to

>
> Bruce,
>
> here are some of my own methods to test if version X of a program is
> stronger than version Y. Maybe I'm wrong in one or more aspects but I
> had some good results in the last years.
>
> First of all, I think that statistical methods are not appropriate in
> this case. Maybe version X is clearly better than version Y in 1000+
> games, but this doesn't mean that version X does ANY better against
> Rebel or Chess Genius! If you are unlucky your new version will become
> stronger in aspects where it is already strong and weaker is aspects it
> is weak.

Ok! There has been enough reponse to my original article that perhaps I should
make it more clear what I'm getting at. There are a few ideas that are getting
muddled up:

1) The idea of testing different versions of the same program against each other.
There seems to be a lot of opposition to doing this, but I think that for what I
was trying to do, the results are valid. Also, some of the stuff that I've been
talking about here pertains to the case where you test ANY two programs against
each other, not just different versions of the same program.

2) How to determine strength of a program in general. People have proposed using
suites, have given me suggestions about how the validity of any sort of testing
using matches of any length, etc. There have been a few posts that include
formulas, and I'm grateful for this, although I haven't digested what I've read
yet (I still have it all marked unread).

I'd be delighted to discuss any of this stuff in this thread, they are both
related points, and are very interesting, but I would like to get back to what I'm
actually trying to do for a moment.

What I'm trying do do here is trace a big dropoff in strength that may have
occurred somewhere within a hundred or so versions of my program.

"Strength" means strength as measured by outside people. But I can't easily
quantify this. One guy says the program feels weaker, but I can't look at ICC
logs and pinpoint it, since his opinion is subjective.

I decided to do some experiments with autoplay matches, in an effort to find a
smoking gun. This may be bogus because of an "inbreeding" problem (although I
would like to see more evidence that this does cause problems). I decided to
ignore this problem and do the experiment anyway. The reason I decided to do the
experiment anyway is that the version I was testing against was a pretty old
version, and so there might be some significant variation between that version and
the versions I was comparing (perhaps this might render some of the inbreeding
objection moot), and because I figured that the strength decrease might be large
enough that it would be evident even if there is an inbreeding problem. I figure
that if an experiment shows that a new version is significantly weaker than the
version immediately before it, when compared with a very old version, that there
is probably at least some weakness when compared with other programs.

So there are some sources of error in my experiments. There are other possible
sources of problem, as well. For instance, the time control I used was
five-minute blitz with no pondering. This is kind of fast, but I was running on a
fast machine, and the games my program plays are often kind of fast anyway. But I
acknowledge that this is a possible source of problem, and choose to ignore it.

I make no particular claims of scientific rigor here. I'm trying fairly hard, but
this experiment was not conducted in a vacuum, it was conducted in my basement.

Here is what I did. I ran several versions of my program versus an old version.
Here are the results:

Ver 1 Ver 2 + - = Games Score Elo
97 39 39 47 48 134 47.0% 21
118 39 22 24 42 88 48.9% 8
130 39 22 25 31 78 48.1% 13
130 39 63 74 101 238 47.7% 16
132 39 46 48 82 176 49.4% 4
133 39 35 55 62 152 43.4% 46
134 39 28 42 52 122 44.3% 40
136 39 13 43 42 98 34.7% 110
136 39 30 52 72 154 42.9% 50

The scores are from the point of view of the first program listed. As you can
see, I ran a couple of these (130 v 39 and 136 v 39) more than once, but didn't
sum the values for the distinct experiments. It doesn't change much if you sum
them, you just get numbers that are a little better.

I think that without doing any kind of statistical stuff at all, you'd think that
version 133 has a problem. In fact, I suspect this is correct. The difference
between version 132 and version 133 is not some trivial positional evaluation
change, I changed a search extension so that it would avoid doing the extension in
cases where I didn't think the extension was necessary. I think it is likely that

I have become interested in statistics, so I figured that I'd try to determine how
bad the mistake was. Perhaps someone will find this discussion interesting.

Imagine that there are no draws in chess, that one side either wins or loses. If
you play two programs of identical strength when compared with each other (same
program, or two programs that are coincidentally as strong, it makes no difference
in my opinion), a match between these programs will have exactly the same
characteristics as a match where you flip a coin to see which program has won.

If you have a coin you are flipping, and the question is, does the coin has a
tendency to land on heads more than it lands on tails, you can flip as often as
you want, for as long as you want, and you can never "know", because any result is
possible.

But you can begin to develop a healthy suspicion, if you determine that the odds
of observing, by chance, what you in fact did observe, are low enough.

When you are flipping a coin it is easy to determine the probability of any given
result -- you can graph it onto a bell-curve. If you do a coin flipping match,
and you get a result that falls within the 95% (arbitrarily chosen) of the
bell-curve that's centered around 50%, your coin is probably OK. But if you get a
result that's outside of this region, you may consider that something might be
wrong with the way you are flipping the coin, or with the coin itself.

It seems to me that a chess match between equal programs is like a series of coin
flips, with one difference -- the coin can easily land on its edge, in fact, this
seems to be the most likely result. I have not done the math, but it seems to me
that this doesn't change the characteristics of the probability curve, it just
changes the "pointiness" of the curve. So I believe the coin-flipping example
pertains.

I made a program that would generate this bell curve experimentally, given the
draw percentage. You tell it what the rate of drawing is, which affects the
pointiness of the probability curve it generates, and you get back the likelihood
that two equal programs would generate an Elo delta that you observed (It'd
probably be easy to do the math to generate this curve, rather than using
experimentation, but I haven't tried to do it).

I figure that if I observe an Elo delta that I'm unlikely to see, that it is
likely that the best explanation is that one of the programs is stronger than the
other one, when played against each other at least.

For example, I performed 50,000 176-game matches with my chessified coin-flipping
program (assuming a draw percentage of 40%). 176 games is the length of the match
between version 132 and version 39. In that one match I observed a score of
49.4%, for an Elo delta of 4 points between these versions. In my 50,000-match
run, I observed that an Elo delta of MORE than 4 points was observed between two
EQUAL programs approximately 40% of the time. So if I were to conclude that
version 39 was better than version 132, I'd likely be wrong about 40% of the time.
It's probably does better than this new version does, but I can't know for
certain. And it probably doesn't do a lot better. And in any case, this is OK,
since a lot of what I did between 39 and 132 is directed at humans, so I'd expect
to lose a little computer vs computer strength.

I then performed 50,000 152-game matches. This is the same length as the match
between version 133 and version 39. In this case I observed an Elo difference of
46 points. In my 50,000-match run, I observed an Elo delta in excess of 46 points
something less than 1.5% of the time.

So I conclude that version 133 has a rock in it when compared to version 39. I
might be wrong, but apparently the odds are 98.5 to 1.5 against this :-)

I believe I can take this a little further. If I munge my test program so that it
is not simulating equal programs, it is simulating programs that are about ten Elo
points apart, I will still only get a reported Elo delta of in excess of 46 points

So I can conclude that version 133 is at least 10 Elo points worse than version
39, and I'd only be wrong perhaps 5% of the time. It's will probably continue to
perform more than 10 points worse than version 39, but I can't be as sure of this.

The whole point of this experiment was to find version 133, so I can check the
code to try to find a bug, and if I really do have a bug, to fix it. I think that
what I did was a fairly proper way of doing this, right?

bruce

brucemo

Feb 18, 1997, 3:00:00 AM2/18/97
to

mclane wrote:

> For me the most important way of finding out WHICH VERSION IS BETTER
> is the feeling. If I look on the main-lines and the evaluations of a
> program, I can find out about which version is stronger in the best
> case within 10 moves of a 60/60 game.

It might be interesting to blind test you. I wonder if you could
distinguish Fritz from Genius from Rebel from CSTal, or rank them in
the order you would rank them if you knew what you were testing?

Just a friendly idea.

bruce

brucemo

Feb 18, 1997, 3:00:00 AM2/18/97
to

> - A lot of test positions, approx. equal number of early midgame,
> midgame and endgame positions, tactical and positional positions, and so
> on and so on. The more positions the better. Now implement a routine
> that tests this positions automatically. You can run the position test
> over night or in one or two days. You get a first picture of the "new"
> strength and even more important you see in which aspects the program
> has become stronger or weaker. (You can make nice graphics with values
> for positional play, tactical play, endgame, etc.)

The first question in response to this is: Where in the heck is this suite?

I'd like to run against this suite, but I don't have it. Does anyone have
it?

A nicely tuned positional suite would seem like it must be "easy" to
generate, but I've never heard of anyone actually doing it.

> - Of course test positions are not enough. If the first impression
> concerning the test positions is good, you have to play games, but NOT
> against earlier versions! Take three or four of the best programs (e.g.
> Chess Genius, M-Chess, Rebel and my beloved Nimzo :-) and play games
> against them! If you only take one program (or an earlier version of
> your own program) there is a high possibility that you "optimize" your
> program to get good results only against this one opponent. Most likely
> you will get strange results against programs you didn't play against.
> (Therefore it is important that SSDF prefers to play 20 game matches
> against a couple of opponents and not big 100+ game matches against two
> or three oponents.)

This is correct and it is of course one reason my program plays on ICC,
where it gets to play against everything, and I don't have to do any work
to make this happen.

> - I don't think that it is very important to play tournament games at
> first, approx. 30 seconds per move will do it. If the new version still
> gets better results than the old one there should be a smile on your
> face :-). You can now start to play tournament games to be sure that
> your program gets stronger with every ply....

It is hard to determine if a program is getting better, based upon a
realistic number of games at this time control. You can make a change that
drops your program by 70 Elo points, and assuming a draw percentage of 40%,
you'll still win or draw a 20-game match against an equal program about 15%
of the time.

I think you'd have to play (and win or draw) a 45-game match in order to be
sure with 95% certainty that you hadn't decreased your program's strength by
more than 70 Elo points :-)

> - The opening book is very important. Sad but true. When e.g. Fritz
> plays with the Genius book, it gets better results against other
> computer programs. You can make your program stronger "simply" by
> modifying the book. A good book for program A is not automatically a
> good book for program B so you have to create a special book. (BTW: I do
> not want to talk about killer openings...) So there should be a special
> "opening book test" against a couple of oponents. What is the first
> evaluation after the last book move? Does the version X "understand" the
> position better than version Y?

This is a good idea.

bruce

mclane

Feb 19, 1997, 3:00:00 AM2/19/97
to

hy...@crafty.cis.uab.edu (Robert Hyatt) wrote:

>Tom C. Kerrigan (kerr...@merlin.pn.org) wrote:
>: I've been thinking of making a test suite of, say, 500 positions that I
>: rip randomly from, say, Kasparov games and using that as a sort of
>: positional barometer. I would go through these at maybe 1 second per move
>: and that might be useful in finding appropriate weights for evaluation
>: terms. If my score went up on this suite, I would be fairly certain that
>: my program didn't get a lot worse.

>: As for tactics, I'd like to see something like the BT2630, only where all
>: the problems have tactical solutions that are found by most programs in 5
>: minutes or so. Then you could score it a similar way and get an exact
>: number to see if the program got faster.

>: Unfortunately, I have no idea how to generate the latter...

>: Cheers,
>: Tom

>Also the "former" is difficult. You can tune your eval to make the

>correlation between your eval and the moves actually played go up,
>and it's not hard. However, you will likely play worse. Because
>you will be making the right move for the wrong reasons. If a GM
>plays a move because it fixes a weak pawn on a weak square for the
>opponent, and you don't understand weak pawns and weak squares, you
>can conclude that the move was good because it advanced a pawn. In
>general *every* pawn move is bad, because it weakens one or more
>squares, and they can never be "unweakened". You then start pushing
>pawns, wreck your pawn structure, create a dozen weaknesses, and lose
>every game, while still matching the moves the GM made in your test
>suite...

>Not a pretty thought...

>The opposite can also happen. A GM doesn't move a pawn in any of the
>500 positions you choose, so you decide that all pawn pushes are bad,
>and never make one, even if the weakness it produces in your position
>is nothing compared to the weakness it produces in the opponent's
>position.

>Logic says emulating a GM by computing a least-squares fit of your
>your (or my) eval to what the GM played is going to produce pretty
>much random numbers, unless your eval is sophisticated enough that
>it somehow incorporates most of the GM's positional knowledge into
>what it evaluates. This is a case where making the right move for
>the wrong reason might be much worse than making the wrong move...

Bob is right. From all my experiences with chess-programs I came to
the same conclusion. A good finder is mainly a bad chess program.

There are exceptions. The King is a big exception. It place really
nice chess and finds almost anything.

It sacs and plays attractive chess.

Feb 19, 1997, 3:00:00 AM2/19/97
to

brucemo wrote:
>
>
> > - A lot of test positions, approx. equal number of early midgame,
> > midgame and endgame positions, tactical and positional positions, and so
> > on and so on. The more positions the better. Now implement a routine
> > that tests this positions automatically. You can run the position test
> > over night or in one or two days. You get a first picture of the "new"
> > strength and even more important you see in which aspects the program
> > has become stronger or weaker. (You can make nice graphics with values
> > for positional play, tactical play, endgame, etc.)
>
> The first question in response to this is: Where in the heck is this suite?
>
> I'd like to run against this suite, but I don't have it. Does anyone have
> it?
>
> A nicely tuned positional suite would seem like it must be "easy" to
> generate, but I've never heard of anyone actually doing it.
>

This is a VERY good question!

I'm sorry, but you have to collect this suite on your own. Everyone has
different priorities. Some peolpe prefer positional suites, other people
prefer tactical ones. It's just like the "perfect" opening book. It
doesn't exist, because you have to adapt it in a way that it fits to

BTW: I don't think that the BT tests are a good thing for testing
programs. Often programs find the right moves for the wrong reasons, or
they find the first move but not the second and so on and so on. The
tests are VERY simple and so many people use them, but the truth isn't
that simple! You have to be carful when you chose the positions. Nobody
said it is easy! :-)

> > - I don't think that it is very important to play tournament games at
> > first, approx. 30 seconds per move will do it. If the new version still
> > gets better results than the old one there should be a smile on your
> > face :-). You can now start to play tournament games to be sure that
> > your program gets stronger with every ply....
>
> It is hard to determine if a program is getting better, based upon a
> realistic number of games at this time control. You can make a change that
> drops your program by 70 Elo points, and assuming a draw percentage of 40%,
> you'll still win or draw a 20-game match against an equal program about 15%
> of the time.
>
> I think you'd have to play (and win or draw) a 45-game match in order to be
> sure with 95% certainty that you hadn't decreased your program's strength by
> more than 70 Elo points :-)
>

I am a real fan of statistics (When I called my professor years after I
finished the university and I asked him about a statistical problem, he
remembered me well!) I always supported the Swedish ELO list and I will
do so in the future, because the Swedes do it in a (statistical) correct
way. BUT I think that statistics are not the appropriate way in this
particular case! You would have to play too many games to be (95
percent!!) sure that your new version is truly better than the old one.
And if you play 1000 or more games you can't be sure that the new
version scores better against Genius or M-Chess in 1000 games. I believe
that you can detect a difference of >=80 ELO without playing hundrets of
games, and I believe you cannot detect a difference of about 20 ELO at
all (just take a look at the margin of error in the SSDF list!).

Best wishes

Martin Borriss

Feb 19, 1997, 3:00:00 AM2/19/97
to

>
>> - A lot of test positions, approx. equal number of early midgame,
>> midgame and endgame positions, tactical and positional positions, and so
>> on and so on. The more positions the better. Now implement a routine
>> that tests this positions automatically. You can run the position test
>> over night or in one or two days. You get a first picture of the "new"
>> strength and even more important you see in which aspects the program
>> has become stronger or weaker. (You can make nice graphics with values
>> for positional play, tactical play, endgame, etc.)
>
>The first question in response to this is: Where in the heck is this suite?
>
>I'd like to run against this suite, but I don't have it. Does anyone have
>it?
>
>A nicely tuned positional suite would seem like it must be "easy" to
>generate, but I've never heard of anyone actually doing it.
>

I was considering making a suite but didn't get any response from other
people, so I figured except for me there is no need ;)

Martin

Moritz Berger

Feb 19, 1997, 3:00:00 AM2/19/97
to

On Tue, 18 Feb 1997 13:27:57 -0800, brucemo <bru...@nwlink.com>
wrote:

>Moritz Berger wrote:
>> r1r1n1k1/5p2/p1b1p1p1/4P2Q/PpNq4/6R1/1PB3PP/R6K b
>> - - id=Human - REBEL 8.0 P5/133; bm=e8g7;
<snip>
>What hardware?

>bruce
P5/133 Toshiba Tecra 720 CDT 48 MB RAM

-------------
Moritz...@msn.com

Vincent Diepeveen

Feb 19, 1997, 3:00:00 AM2/19/97
to

In <5ecfmq\$k...@juniper.cis.uab.edu> hy...@crafty.cis.uab.edu (Robert Hyatt) writes:

>Moritz Berger (Moritz...@msn.com) wrote:
>: On 17 Feb 1997 20:30:36 +0100, kerr...@merlin.pn.org (Tom C.

>: Kerrigan) wrote:
>
>: >I've been thinking of making a test suite of, say, 500 positions that I
>: >rip randomly from, say, Kasparov games and using that as a sort of
>: >positional barometer. I would go through these at maybe 1 second per move
>: >and that might be useful in finding appropriate weights for evaluation
>: >terms. If my score went up on this suite, I would be fairly certain that
>: >my program didn't get a lot worse.
>: >
>: >As for tactics, I'd like to see something like the BT2630, only where all
>: >the problems have tactical solutions that are found by most programs in 5
>: >minutes or so. Then you could score it a similar way and get an exact
>: >number to see if the program got faster.
>: >
>: >Unfortunately, I have no idea how to generate the latter...
>: >
>: >Cheers,
>: >Tom
>

>: I use the following approach: From computer vs. computer games, I
>: choose positions with eval-jumps, either immediately or fast
>: increasing/decreasing evaluations by both programs. Sometimes a
>: program sees a few moves earlier the importance of a particular move.
>: Among equally strong programs, there often are single key moves that
>: eventually decide the outcome of a game. It's quite easy to spot them
>: by looking at the analysis recorded during the game, a very important
>: feature in Rebel 8, Genius 5 and Hiarcs (of course Crafty with its
>: full log really shines in this respect, also W-Chess offers nice
>: insights with its similar capability).
>
>: After finding such positions, I have a perfect test suite for
>: different relevant positions (i.e. they are taken from real computer
>: vs. computer games, so both the problem and its solution are likely to
>: be identified by means of a chess program).
>

>: Here's one example (1st 2 lines are in .EPD format, you will have to
>: join them into one line to make it readable by some programs). The
>: theme is king safety at shorter time controls, after the name of the
>: program I give the solution time in seconds, in brackets the search
>: depth needed for the solution, in some cases alternative moves that
>: include the solution in the PV:
>

>: r1r1n1k1/5p2/p1b1p1p1/4P2Q/PpNq4/6R1/1PB3PP/R6K b

>: - - id=Human - REBEL 8.0 P5/133; bm=e8g7;
>

>: Hiarcs 4 (Fritz 4 engine): 1"(3): Ne8g7
>: Genius 3: 1"(4): Ne8g7
>: Hiarcs 5: 3"(3): Ne8g7
>: Genius 5: 4"(3): Ne8g7
>: Hiarcs 3: 6"(4): Ne8g7
>: Fritz 3.1 (Fritz 4 engine): 6"(8): Ne8g7
>: Fritz 4.01: 7"(8): Ne8g7
>: Mchess 6: 17"(5): Ne8g7
>: Mchess 5: 35"(5): Bc6xg2 53"(5): Ne8g7
>: Rebel 7: 57"(7): Ne8g7
>: Rebel 8: 89"(7): Bc6xg2 104"(7): Ne8g7
>: Fritz 1.2: 404"(10): Ne8g7
>
>: Please note that the programs without Null-move (Rebel 7,8 and Fritz
>: 1.2) have more problems with this position ... M-Chess 6 is mucho
>: faster than M-Chess 5, this shows the huge progress that Marty Hirsch
>

>: Moritz
>
>: -------------
>: Moritz...@msn.com

>
>Two points: First, Crafty relies heavily on null move, as we all
>know. output:
>
> 9 13.03 0.230 Qxc4 Bxg6 fxg6 Qxg6+ Kf8 Rg4 Bxg2+
> Rxg2 Qf4 Qg8+ Ke7 a5
> 9 27.83 0.816 Ng7 Qg4 Bxg2+ Rxg2 Rxc4 Qf3 Rac8 Bd3
> Rc1+ Rxc1 Rxc1+ Bf1
> 9-> 1:00 0.816 Ng7 Qg4 Bxg2+ Rxg2 Rxc4 Qf3 Rac8 Bd3
> Rc1+ Rxc1 Rxc1+ Bf1
> 10 1:09 1.093 Ng7 Qg4 Bxg2+ Rxg2 Rxc4 Qxd4 Rxd4
> a5 Rad8 b3 Rd2 Rxd2 Rxd2
>
>so it takes 27 secs with very small hash table to hit this.
>

>However, I didn't think that genius nor Mchess used null-move, rather, that
>they were simply more selective at shallow plies to get their speed

Last ply pruning they probably use, but as it seems all use nullmove
somehow.

You can use nullmove first x ply and not the last 4 plies (near the leaves)
for example and forward prune last 3/4 ply, which seems a commonly used
idea.

Most programs like Genius/Nimzo/Hiarcs use a different approach to plydepth,
they give a value to plydepth, for example 16 points = 1 ply, which
more easily prevents extensions far over the horizon. You can then use
reduction factors of 1/16 ply, for example 1.5 ply, or 2.5 ply instead of
1,2,3,4 etc.

Wouldn't know how to implement nullmove last n-ply using this type of search.

>I had thought, that except for Morsch, most didn't use null move.. Dave is

Ed Schroeder says he doesn't use it. Still there are some mates,
which typically are mate in n, where nullmove can only find mate in n+i, i>0,
where almost all commercial programs find a mate in n+i.

Nullmove seems the best invention after alfabeta to reduce branching factor.

It seems possible to forward prune last 5 plies, but suppose you
search 13 ply, no way one can forward prune without search replacement
last 12 ply, that would give too much tactical weaknesses for the programs.

Furthermore, nullmove is cheap.

>Bob

Vincent

--
+----------------------------------------------------+
| Vincent Diepeveen email: vdie...@cs.ruu.nl |
| http://www.students.cs.ruu.nl/~vdiepeve/ |
+----------------------------------------------------+

Vincent Diepeveen

Feb 19, 1997, 3:00:00 AM2/19/97
to

>On 18 Feb 1997 14:52:10 GMT, hy...@crafty.cis.uab.edu (Robert Hyatt)
>wrote:
>>Moritz Berger (Moritz...@msn.com) wrote:
><snip>

>>: Here's one example (1st 2 lines are in .EPD format, you will have to
>>: join them into one line to make it readable by some programs). The
>>: theme is king safety at shorter time controls, after the name of the
>>: program I give the solution time in seconds, in brackets the search
>>: depth needed for the solution, in some cases alternative moves that
>>: include the solution in the PV:
>
>>: r1r1n1k1/5p2/p1b1p1p1/4P2Q/PpNq4/6R1/1PB3PP/R6K b
>>: - - id=Human - REBEL 8.0 P5/133; bm=e8g7;
>
>>: Hiarcs 4 (Fritz 4 engine): 1"(3): Ne8g7
>>: Genius 3: 1"(4): Ne8g7
>>: Hiarcs 5: 3"(3): Ne8g7
>>: Genius 5: 4"(3): Ne8g7
>>: Hiarcs 3: 6"(4): Ne8g7
>>: Fritz 3.1 (Fritz 4 engine): 6"(8): Ne8g7
>>: Fritz 4.01: 7"(8): Ne8g7
>>: Mchess 6: 17"(5): Ne8g7
>>: Mchess 5: 35"(5): Bc6xg2 53"(5): Ne8g7
>>: Rebel 7: 57"(7): Ne8g7
>>: Rebel 8: 89"(7): Bc6xg2 104"(7): Ne8g7
>>: Fritz 1.2: 404"(10): Ne8g7

/users/vdiepeve/> dir
..
-rwxr-xr-x 1 vdiepeve student 167936 May 23 1996 diep*
..

Old version of Diep running under unix (hardware comparable to that of a 486
after hashtables have been swapped from disk to RAM):

black timeleft=277:46.39
r = r = n = k =
= - = - = p = -
p = b = p = p =
= - = - P - = Q
P p N q - = - =
= - = - = - R -
- P B = - = P P
R - = - = - = K
white timeleft=277:46.39
black to move

command : m
input sucks!
command : move
white : Human black : Computer

00:00:00 0 g6xh5
00:00:00 0 d4xc4
1:d4c4 n52 t1 3.345
00:00:00 0 d4xc4 c2xg6
2:d4c4 n189 t1 2.270
00:00:01 0 d4xc4 c2xg6 f7xg6 h5xg6 g8-f8 g6-h7
00:00:02 0 3 (10/48) Bc6xg2 research
00:00:02 0 c6xg2 h1xg2 c8xc4 c2xg6
00:00:03 0 c6xg2 g3xg2 e8-g7 a1-d1
00:00:04 0 3 (23/48) Ne8-g7 research
00:00:04 0 e8-g7 h5-e2 c6xg2 e2xg2
00:00:04 0 e8-g7 h5-e2 c6xg2 e2xg2
3:e8g7 n4301 t4 0.805
00:00:05 0 e8-g7 h5-e2 c6xg2 e2xg2 d4xc4 c2-e4
4:e8g7 n6606 t4 0.760
00:00:09 0 e8-g7 g3xg6 f7xg6 c4-d2
00:00:11 0 e8-g7 h5-g4 c6xg2 g3xg2 c8xc4 a1-e1
5:e8g7 n24917 t11 0.975
00:00:23 0 e8-g7 h5-g4 c6xg2 g3xg2 c8xc4 g4xd4 c4xd4 a1-d1
6:e8g7 n49287 t27 0.975
00:00:55 0 e8-g7 h5-g4 d4xc4 g4xc4 c6xg2 g3xg2 c8xc4 c2-d3 c4-d4
7:e8g7 n233179 t107 0.985
00:03:03 0 e8-g7 h5-g4 c6xg2 g3xg2 c8xc4 a1-d1 d4xg4 c2-b1

So 3 ply and 4 seconds. At PC this first seconds will be done within
a second (no swapping). So 1 second on a PC.

>>: Please note that the programs without Null-move (Rebel 7,8 and Fritz
>>: 1.2) have more problems with this position ... M-Chess 6 is mucho
>>: faster than M-Chess 5, this shows the huge progress that Marty Hirsch

><snip>

>>Two points: First, Crafty relies heavily on null move, as we all
>>know. output:
>
>> 9 13.03 0.230 Qxc4 Bxg6 fxg6 Qxg6+ Kf8 Rg4 Bxg2+
>> Rxg2 Qf4 Qg8+ Ke7 a5
>> 9 27.83 0.816 Ng7 Qg4 Bxg2+ Rxg2 Rxc4 Qf3 Rac8 Bd3
>> Rc1+ Rxc1 Rxc1+ Bf1
>> 9-> 1:00 0.816 Ng7 Qg4 Bxg2+ Rxg2 Rxc4 Qf3 Rac8 Bd3
>> Rc1+ Rxc1 Rxc1+ Bf1
>> 10 1:09 1.093 Ng7 Qg4 Bxg2+ Rxg2 Rxc4 Qxd4 Rxd4
>> a5 Rad8 b3 Rd2 Rxd2 Rxd2
>
>>so it takes 27 secs with very small hash table to hit this.

>your problem, Bob ;-) maybe you need a mainframe for this ;-))))) not

>enough nps on Intel hardware ...

>>However, I didn't think that genius nor Mchess used null-move, rather, that

>>they were simply more selective at shallow plies to get their speed
>

>Ok, I didn't want to become a second Diep or start a "Null move
>skandal" here ;-)
>
>My statement was mainly related to the vast improvement among the
>Fritz versions: Fritz 1 didn't have null-move, Fritz 2 and later used
>null-move and gained very much from it ... So Fritz 3.1 is 60-70 times
>faster in this position, this might also be due to increased
>selectivity, maybe I was simply wrong ... But I also noticed that
>Rebel didn't score well on this position, and Rebel doesn't use null
>move ... M-Chess is also among the slower programs (although it is
>heavily selective), so according to my logic it might not have been
>among the null movers ... Maybe I'm completely wrong, all recent
>programs maybe do well enough on this position, so I hope that no
>programmer hits me with a null move for this :-)))
>
>My example was just that: An example how I try to draw conclusions
>from real computer vs. computer positions ... If you're testing your
>own program, you can hopefully find out which factor influences your
>performance most in a particular position.
>

>>I had thought, that except for Morsch, most didn't use null move.. Dave is
>

>>Bob
>
>-------------
>Moritz...@msn.com

Robert Hyatt

Feb 19, 1997, 3:00:00 AM2/19/97
to

Vincent Diepeveen (vdie...@cs.ruu.nl) wrote:

: Last ply pruning they probably use, but as it seems all use nullmove
: somehow.

: You can use nullmove first x ply and not the last 4 plies (near the leaves)
: for example and forward prune last 3/4 ply, which seems a commonly used
: idea.

Hard to say. Kittinger says no null move in his search. I believe Ed
said the same, but he can respond for himself. Richard? who knows...

: Most programs like Genius/Nimzo/Hiarcs use a different approach to plydepth,

: they give a value to plydepth, for example 16 points = 1 ply, which
: more easily prevents extensions far over the horizon. You can then use
: reduction factors of 1/16 ply, for example 1.5 ply, or 2.5 ply instead of
: 1,2,3,4 etc.

I've done this forever, for different reasons. Crafty and Ferret (to name
two I know about) use fractional ply extensions. Ken Thompson used something
like this in 1981 belle, counting a capture as .5 ply for example. In Crafty
I use ply=60, so I can decrement by 1/60'th of a ply and a lot of multiples
in between. Bruce is using ply=120 which gives more even divisions for odd
fractional numbers like 1/8th, that I can't "quite" get.

: Wouldn't know how to implement nullmove last n-ply using this type of search.

I've done it in the past, but I'm a believer in either using null-move,
period, or throwing it out. Every attempt to restrict it seems to keep most
of the problems, but loses a big part of the benefit...

: >I had thought, that except for Morsch, most didn't use null move.. Dave is

: Ed Schroeder says he doesn't use it. Still there are some mates,
: which typically are mate in n, where nullmove can only find mate in n+i, i>0,
: where almost all commercial programs find a mate in n+i.

: Nullmove seems the best invention after alfabeta to reduce branching factor.

: It seems possible to forward prune last 5 plies, but suppose you
: search 13 ply, no way one can forward prune without search replacement
: last 12 ply, that would give too much tactical weaknesses for the programs.

: Furthermore, nullmove is cheap.

*AND* easy to implement... maybe even more important. :)

: >Bob

Robert Hyatt

Feb 19, 1997, 3:00:00 AM2/19/97
to

: Vincent Diepeveen (vdie...@cs.ruu.nl) wrote:
: : It seems possible to forward prune last 5 plies, but suppose you
: : search 13 ply, no way one can forward prune without search replacement
: : last 12 ply, that would give too much tactical weaknesses for the programs.

: I'm not sure. In a 13 ply search, Chess System Tal forward prunes the last
: _13_ plies, I believe. Is this correct, Chris?

: Tord

As he's explained it, yes. He makes tactical mistakes. He finds deep
tactical lines. I assume he believes that the deep lines offset the
occasional mistakes. Time will tell. It's certainly interesting. I
hope he's right... because *if* (make that *IFF* for theory-conscious
people) he's right, then we have hope to tackle DB. If the heavily
selective search doesn't pan out... we aren't going to get to
15 plies (average, max much higher.)

BTW, this 15 ply thing comes up from time to time, and I get asked about
it all the time. Here's where *I* come up with that number. The last
time I ran Cray Blitz on a T90 (very rare as machine is difficult to get
time on) it was capable of searching to depth 12 in most middlegame
positions. Yes, it would stop at 9 if a king gets exposed, because it
would extend like crazy. that was searching at 3M-5M nodes per second.
They are about 200-300 times faster than that, so I conservatively give
them 3 more plies, because CB didn't use singular extensions and they do...
and it has a high computational cost when done correctly. the 15 is
speculation... it could be more. I doubt it is less however...

Feb 19, 1997, 3:00:00 AM2/19/97
to

Vincent Diepeveen (vdie...@cs.ruu.nl) wrote:
: It seems possible to forward prune last 5 plies, but suppose you
: search 13 ply, no way one can forward prune without search replacement
: last 12 ply, that would give too much tactical weaknesses for the programs.

I'm not sure. In a 13 ply search, Chess System Tal forward prunes the last

Jean-Peter Fendrich

Feb 19, 1997, 3:00:00 AM2/19/97
to

brucemo wrote:
>
- very large snip -

I didn't have the energy to read it all in a very focused way, sorry...

But I think that you did forget one important thing and that's how much
it doesn't describe the spreading (I hope this is an adequate english
term). If you place one foot in very hot water and the other foot in
the freezer, you will find a very good mean value of the temperature
but you don't feel that good! In chess: If you play 100 games which all
ends up in a draw you will be pretty sure that theese programs are very
alike. If they win 50 games each with no draws then the result is much
more unstable. In another post in this thread you could find a formula
to compute s which is really trying to meassure how 'spread out' your
data is.

--
J-P Fendrich

mclane

Feb 20, 1997, 3:00:00 AM2/20/97
to

>mclane wrote:
>
>> For me the most important way of finding out WHICH VERSION IS BETTER
>> is the feeling. If I look on the main-lines and the evaluations of a
>> program, I can find out about which version is stronger in the best
>> case within 10 moves of a 60/60 game.

>It might be interesting to blind test you. I wonder if you could

>distinguish Fritz from Genius from Rebel from CSTal, or rank them in
>the order you would rank them if you knew what you were testing?

>Just a friendly idea.

>bruce

Chris has send me a version a day. And sometimes 2.
They all looked the same.

Believe me. After this torture I was able to know not only which
version is better, I knew it's move before it played it, I knew which
games it could win and which ones not, and more.

But you should not think that I want to impress anybody here.
I think this can be done by anybody.
Any worker is able to do this. E.g. if you work in a big store, after
some months you feel if something is not at the right place although
you have not count anything. Feelings are much better than
autoplayers. Of course feelings only work if you LIKE something.
I don't think I am good in testing programs I do not like ! I would
fail then. So a blind test would maybe only show which programs I like
and which not...

Ask Chris. Normally I am faster and more precise than his debugger.

Vincent Diepeveen

Feb 21, 1997, 3:00:00 AM2/21/97
to

In <5ef8rc\$t...@juniper.cis.uab.edu> hy...@crafty.cis.uab.edu (Robert Hyatt) writes:

>: Vincent Diepeveen (vdie...@cs.ruu.nl) wrote:

>: : It seems possible to forward prune last 5 plies, but suppose you

>: : search 13 ply, no way one can forward prune without search replacement
>: : last 12 ply, that would give too much tactical weaknesses for the programs.
>

>: I'm not sure. In a 13 ply search, Chess System Tal forward prunes the last

>: _13_ plies, I believe. Is this correct, Chris?
>
>: Tord
>

>As he's explained it, yes. He makes tactical mistakes. He finds deep
>tactical lines. I assume he believes that the deep lines offset the
>occasional mistakes. Time will tell. It's certainly interesting. I
>hope he's right... because *if* (make that *IFF* for theory-conscious
>people) he's right, then we have hope to tackle DB. If the heavily
>selective search doesn't pan out... we aren't going to get to
>15 plies (average, max much higher.)
>
>BTW, this 15 ply thing comes up from time to time, and I get asked about
>it all the time. Here's where *I* come up with that number. The last
>time I ran Cray Blitz on a T90 (very rare as machine is difficult to get
>time on) it was capable of searching to depth 12 in most middlegame

Just to compare depth of Cray programs to
Kallisto on MMX200 : Kallisto at a MMX200 searches 12/13 ply already,
with an 15 MB hashtable. That is not far from 15 ply. Kallisto does not
forward prune, but does use nullmove.

>positions. Yes, it would stop at 9 if a king gets exposed, because it
>would extend like crazy. that was searching at 3M-5M nodes per second.
>They are about 200-300 times faster than that, so I conservatively give
>them 3 more plies, because CB didn't use singular extensions and they do...
>and it has a high computational cost when done correctly. the 15 is
>speculation... it could be more. I doubt it is less however...
>
>

Robert Hyatt

Feb 21, 1997, 3:00:00 AM2/21/97
to

Vincent Diepeveen (vdie...@cs.ruu.nl) wrote:

: In <5ef8rc\$t...@juniper.cis.uab.edu> hy...@crafty.cis.uab.edu (Robert Hyatt) writes:

: >Tord Kallqvist Romstad (tor...@ifi.uio.no) wrote:
: >: Vincent Diepeveen (vdie...@cs.ruu.nl) wrote:

: >: : It seems possible to forward prune last 5 plies, but suppose you

: >: : search 13 ply, no way one can forward prune without search replacement
: >: : last 12 ply, that would give too much tactical weaknesses for the programs.

: >
: >: I'm not sure. In a 13 ply search, Chess System Tal forward prunes the last

: >: _13_ plies, I believe. Is this correct, Chris?
: >
: >: Tord
: >
: >As he's explained it, yes. He makes tactical mistakes. He finds deep
: >tactical lines. I assume he believes that the deep lines offset the
: >occasional mistakes. Time will tell. It's certainly interesting. I
: >hope he's right... because *if* (make that *IFF* for theory-conscious
: >people) he's right, then we have hope to tackle DB. If the heavily
: >selective search doesn't pan out... we aren't going to get to
: >15 plies (average, max much higher.)
: >
: >BTW, this 15 ply thing comes up from time to time, and I get asked about
: >it all the time. Here's where *I* come up with that number. The last
: >time I ran Cray Blitz on a T90 (very rare as machine is difficult to get
: >time on) it was capable of searching to depth 12 in most middlegame

: Just to compare depth of Cray programs to
: Kallisto on MMX200 : Kallisto at a MMX200 searches 12/13 ply already,
: with an 15 MB hashtable. That is not far from 15 ply. Kallisto does not
: forward prune, but does use nullmove.

Not a good comparison. In Jakarta, Crafty was doing 12-13 plies on every
search, occasionally dropping back a ply or two if things got wild and
extensions kicked in. But CB will still crush Crafty in a match, because
the speeds are too different.

: >positions. Yes, it would stop at 9 if a king gets exposed, because it

: >would extend like crazy. that was searching at 3M-5M nodes per second.
: >They are about 200-300 times faster than that, so I conservatively give
: >them 3 more plies, because CB didn't use singular extensions and they do...
: >and it has a high computational cost when done correctly. the 15 is
: >speculation... it could be more. I doubt it is less however...

: >
: >
: --

Roger Davis

Feb 21, 1997, 3:00:00 AM2/21/97
to

Maybe this has been brought up before, I don't know... But what about
using GM annotations of other GM games to construct a test suite? Every
time that the annotator awards a exclamation point to a move, then that's a
very good move that even a GM wouldn't ordinarily find, and that becomes a
potential test position. You would want to select good annotators that
actually seem to put some thought into their analysis, of course. And, you
could select annotators that seem to award exclamation points for different
reasons, so that it would be possible to systematically construct a Karpov
test suite, a Kasparov test suite, and so on. So...if you had a GM that
plays a positional game, then perhaps a knowledge-based program would score
higher on his suite, while if you had a speed-based program, it might score
higher on a tactical GM's suite...

Good idea or no?

Roger Davis

Robert Hyatt

Feb 21, 1997, 3:00:00 AM2/21/97
to

Roger Davis (raj...@ix.netcom.com) wrote:
: Maybe this has been brought up before, I don't know... But what about

: Good idea or no?

: Roger Davis

That sounds like a good idea and a worthy project. It might take some analysis
to be sure that all of the ! moves are really good, but it might be a good
way to get started... now for some volunteers. :)

Harald Faber

Feb 21, 1997, 3:00:00 AM2/21/97
to

quoting a mail from mclane # prima.ruhr.de

Hello mclane,

m> From: mcl...@prima.ruhr.de (mclane)
m> Subject: Re: Statistical validity of medium-length match results
m> Organization: Prima e.V. Dortmund

m> Bob is right. From all my experiences with chess-programs I came to
m> the same conclusion. A good finder is mainly a bad chess program.
m>
m> There are exceptions. The King is a big exception. It place really
m> nice chess and finds almost anything.
m> It sacs and plays attractive chess.

Right.

Is The King 2.54 in the testing-suite of the SSDF? I still wonder where he
will be placed. We have impressive results til now (4.5 pts out of 6
against Fritz4 [1 game, 1-0, the other one wil be won by the King too],
Rebel8 [2, 2 wins], MCP6 [2, 1.5pts for King] and Genius5 [1, only loss])
but again the restriction that we played offline (->no use of permanent
brain).

Harald
--

mclane

Feb 22, 1997, 3:00:00 AM2/22/97
to

>Vincent Diepeveen (vdie...@cs.ruu.nl) wrote:
>: It seems possible to forward prune last 5 plies, but suppose you

>: search 13 ply, no way one can forward prune without search replacement
>: last 12 ply, that would give too much tactical weaknesses for the programs.

>I'm not sure. In a 13 ply search, Chess System Tal forward prunes the last

>_13_ plies, I believe. Is this correct, Chris?

>Tord

I think chris is again in holiday !!

I have problems to get him too.

Hm - we will never see CSTal come on the market.

Have you noticed that the predecessor is evaluated with 1984 (George
Orwell ELO) in the swedish-rating-list on rank 48 ????!!!!!!

So how big is the programming progress we have made using intelligence
(the nodes per second are LESSSSSSSSSSS in the latest version than we

In the old versions we searched arround 2 times more nodes/second than
we have now. Chris is one very rare programmer who reduces NPS and
increases playing strength .

How much progress ???

100 ELO ?
200 ELO ?
300 ELO ?

Whatever.
IMO the latest version is not the best. I think I have better versions
in my big archive.

mclane

Feb 23, 1997, 3:00:00 AM2/23/97
to

Harald...@p21.f2.n1.z1001.fidonet.org (Harald Faber) wrote:

>Hello mclane,

>Right.

>Harald
>--
As The King 2.5 they may have it in the TASC R30.
I am not sure if they test 2.54 as a pc - version.
I am not sure which version CM5000 has. Maybe an own version.

Moritz Berger

Feb 23, 1997, 3:00:00 AM2/23/97
to

On Sun, 23 Feb 1997 15:54:51 GMT, mcl...@prima.ruhr.de (mclane) wrote:
<czub>

>I am not sure which version CM5000 has. Maybe an own version.
CM 5000 is The King 2.55

-------------
Moritz...@msn.com

Vincent Diepeveen

Feb 25, 1997, 3:00:00 AM2/25/97
to

In <01bc2043\$1caf9d20\$efc4...@Rajjar.ix.netcom.com> "Roger Davis" <raj...@ix.netcom.com> writes:

>Maybe this has been brought up before, I don't know... But what about
>using GM annotations of other GM games to construct a test suite? Every
>time that the annotator awards a exclamation point to a move, then that's a
>very good move that even a GM wouldn't ordinarily find, and that becomes a
>potential test position. You would want to select good annotators that
>actually seem to put some thought into their analysis, of course. And, you
>could select annotators that seem to award exclamation points for different
>reasons, so that it would be possible to systematically construct a Karpov
>test suite, a Kasparov test suite, and so on. So...if you had a GM that
>plays a positional game, then perhaps a knowledge-based program would score
>higher on his suite, while if you had a speed-based program, it might score
>higher on a tactical GM's suite...
>
>Good idea or no?

This is exactly what i do. If i see a position that my program plays wrong,
then i add it to the test sets. If i see a grandmaster make a good move,
i add it to the library.

Diep for this huge knowledge reason does very well on these positions.
Better than all other programs i have. Something they pointed out in the
newsgroup, which is a disadvantage of this testset is that it doesn't
say that you win games with it.

You only win games when preventing a program from doing a bad move.
In positions that usually are derived from auto232 play.

So that are 2 contradictions. Of course Diep plays better and better in
tournament levels, but still it is not very good comparable.

in our case we select:

analysis rating = max(good moves)

to measure tournament performance you need:

There is of course a relation, if a program does well in tournament
games, then it probably is not the worst analysis program and vice versa.

Last months Diep is playing better and better at tournament level, but it
hardly plays better at analysis level! I fixed a lot of evaluation faults/bugs,
which sometimes cause strange moves in games. Just 1 of such a move every
game is enough.

>Roger Davis

Robert Hyatt

Feb 25, 1997, 3:00:00 AM2/25/97
to

Vincent Diepeveen (vdie...@cs.ruu.nl) wrote:

: In <01bc2043\$1caf9d20\$efc4...@Rajjar.ix.netcom.com> "Roger Davis" <raj...@ix.netcom.com> writes:

: >Maybe this has been brought up before, I don't know... But what about
: >using GM annotations of other GM games to construct a test suite? Every
: >time that the annotator awards a exclamation point to a move, then that's a
: >very good move that even a GM wouldn't ordinarily find, and that becomes a
: >potential test position. You would want to select good annotators that
: >actually seem to put some thought into their analysis, of course. And, you
: >could select annotators that seem to award exclamation points for different
: >reasons, so that it would be possible to systematically construct a Karpov
: >test suite, a Kasparov test suite, and so on. So...if you had a GM that
: >plays a positional game, then perhaps a knowledge-based program would score
: >higher on his suite, while if you had a speed-based program, it might score
: >higher on a tactical GM's suite...
: >
: >Good idea or no?

: This is exactly what i do. If i see a position that my program plays wrong,
: then i add it to the test sets. If i see a grandmaster make a good move,
: i add it to the library.

: Diep for this huge knowledge reason does very well on these positions.

: Better than all other programs i have. Something they pointed out in the
: newsgroup, which is a disadvantage of this testset is that it doesn't
: say that you win games with it.

: You only win games when preventing a program from doing a bad move.
: In positions that usually are derived from auto232 play.

: So that are 2 contradictions. Of course Diep plays better and better in
: tournament levels, but still it is not very good comparable.

: in our case we select:

: analysis rating = max(good moves)

: to measure tournament performance you need:

: tournament rating = min(bad moves)

: There is of course a relation, if a program does well in tournament
: games, then it probably is not the worst analysis program and vice versa.

: Last months Diep is playing better and better at tournament level, but it
: hardly plays better at analysis level! I fixed a lot of evaluation faults/bugs,
: which sometimes cause strange moves in games. Just 1 of such a move every
: game is enough.

Don't forget, tuning a program to emulate a move a GM made can be very good,
the move, then tuning to make that move is going to produce horrendous bugs
in other games. If the GM was trying to advance a passed pawn, most programs
understand passed pawns pretty well and adjusting terms to advance the pawn,
or control squares in front of the pawn, or remove pieces that are hindering
its advance, and so forth would be good. On the other hand, if the GM made
a move to deny a key square to an opponent's bishop, and your eval doesn't
understand it, then it will make that same sort of move in positions where
denying the bishop that square is not important. And the move that denies
the bishop that square might be really awful.

That's the real problem I see in trying to tune against GM moves. Now if a
GM explains *why* he made the move, in terms I can follow on the board, I can
either tweak terms in crafty, or add new ones if needed. But to just take
64K blind positions, and do a least-squares fit of your eval to those positions
is likely going to produce some really amusing moves at time... amusing if you
are on the other side of course... :)

Tom C. Kerrigan

Feb 26, 1997, 3:00:00 AM2/26/97
to

mclane (mcl...@prima.ruhr.de) wrote:

> So how big is the programming progress we have made using intelligence
> (the nodes per second are LESSSSSSSSSSS in the latest version than we

It's blatantly obvious that you get off on programs searching slower. This
has been baffling me for a while now.

Maybe I could give you my first attempt at a chess program. It used
incrementally updated atkto/atkfr bitboards and was insanely slow even
when the evaluation function was disconnected. It also had a tendancy to
drop its queen. You would love it. :)

Cheers,
Tom

mclane

Feb 27, 1997, 3:00:00 AM2/27/97
to

Moritz...@msn.com (Moritz Berger) wrote:

>On Sun, 23 Feb 1997 15:54:51 GMT, mcl...@prima.ruhr.de (mclane) wrote:
><czub>
>>I am not sure which version CM5000 has. Maybe an own version.
>CM 5000 is The King 2.55

Really ? Is the King 2.55 Search limited to 6 plies selective ?
When I am considering about the differnece between Chessmaster 4000
and The King 2.0, it was mainly, that Chessmasters selective component
was forced to stay 4 or 6 (don't have it in mind accurate enough)
plies above the other component.
I think my chessmachine with The King 2.0 was different. There the
difference between the components was variable and switched by
The King automatically. In difficult positions it was low, in normal
positions it was high, I think.

>-------------
>Moritz...@msn.com