Chess Program Too Slow -- Part II

Joe Stella

unread,

Nov 29, 1994, 4:06:05 AM11/29/94

to

Many Thanks to all those who have responded to my previous post. All
replies are greatly appreciated.

I have gathered some of the responses here, with my replies. I hope this
will be of interest to other amateur chess programmers.

>>In article <joes.4....@ultranet.com>, I (Joe Stella) wrote:

>>I am writing a chess program in C++ on my 486/33 PC, using Microsoft Visual
>>C++. The program runs under Windows.
>>...
>>It takes an average of about 25 seconds to do a 5-ply search on my 486/33.
>>Moves that get out of check don't count as a ply. The node count is about 2000
>>nodes per second, with a "node" defined as any *legal* move made in the tree
>>search. Nodes visited more than once by PVS are counted as two nodes.

>>It seems to me that 2000 nodes per second is too slow for a 486/33. My Novag
>>Super Expert C (6 MHz 6502) is almost this fast, and it has a much more
>>complicated evaluator.

In article <3aqvd8$6...@rational.rational.com> Jon Dart responds:

>Actually, 2000 nodes per second is pretty good, about on the order of what
>I'd expect given your hardware and implementation language. You also have
>to rememeber that under Windows, you don't get all of the CPU. During a
>GetMessage or PeekMessage call, the system will yield to other tasks
>(e.g. the Clock, Program Manager, etc.) and they will get some CPU cycles.

I have commented out the GetMessage and PeekMessage calls in order to time this
effect. The search is only about 5% faster. (I always make sure the system is
"quiet" when I time a search.)

>Remember, the Novag Super Expert is coded in assembly language. You can
>get a several times speed improvement by coding yours in assember (not
>that I'd recommend it).

Yes, but the Novag Super Expert has a *much* more complicated evaluator, which
should more than make up for the assembly language advantage.

>Bob Hyatt and several others have recently discussed using bitboards for
>move generation, which appears to be more efficient than the Gnu Chess
>algorithm.

I have considered doing this also but I thought this would only be
advantageous in a 64-bit processor. I would be glad to learn otherwise.

>Move ordering is also extremely important. It sounds like you have a
>reasonably good move ordering mechanism, but more effort in this area
>might pay off. Also, I believe Visual C++ has a profiler, so you
>might try running that over your code to see where the hot spots are.

Yes, these are both very valid points. I have tried using the profiler
some time ago, and it is very cumbersome to use.

Thorsten Greiner responds:

>First of all, in my opinion C++ isn't the perfect language when it comes
>to performance. From my experience, programs in C++ that use classes etc.
>typically run a factor 2 to 10 slower than their C counterparts. I must say
>though, that I don't have any experience with Visual C or Visual C++.

Yes, I have noticed this too. This is why I no longer have any C++ classes
or objects in the tree search code. Classes are used only in the user interface
portion.

>Second you could try to improve your move ordering. My chess program Amy uses
>the following ordering:

> 1. Null move
> 2. Move from the transposition table
> 3. Gaining captures
> 4. Killer moves
> 5. Loosing captures
> 6. All other moves (with some static ordering)

>A gaining capture is defined as a capture to an undefended piece, or if the
>piece is defended, if it has higher value (e.g. pawn takes queen). Since I am
>using bitboards for move generation, the test for being defended can be made
>fairly quick. In your case this test probably is much harder to make (i.e.
>time consuming).

More bitboards! I guess I really should look into this...

Improving my move ordering will make my search faster for the same node
per second speed, because less nodes will be searched. I still think
that in addition to this, I need to improve my nodes per second speed
because my evaluator is so *simple*. If I try to add any real chess
knowledge to my program right now, I am afraid this will bring it
to its knees!

>I have divided (the evaluation function) into two parts, a slow and
>a fast evaluation function. The fast eval score is maintained in the
>DoMove/UndoMove routine. It covers things like material balance, piece
>placement, pawn advancement etc.

>The slow eval deals with things like mobility, pawn structure etc. This
>routine is called only if the fast eval score is in the range
>[alpha-MaxPos , beta+MaxPos], where MaxPos is something like 3/4 of a pawn
>(materialwise). This alone makes the program a factor 1.5 to 2 faster in terms
>of nodes/sec.

Every time I try to do something like this, moves get cut off in the tree
search that shouldn't, and the program makes very bad moves. It appears
the problem I have is that the value of beta gets set according to the
"fast" evaluation function; then when a position has a fast_score > beta it
gets cut off, but it also has (fast_score + slow_score) < beta meaning that
the position should be searched further. If I use a MaxPos "window" as
suggested above, then the number of beta cutoffs drops sharply and the program
is terribly slow. I must be missing something here.

Vincent Diepeveen writes

>Kallisto and Fritz3 are evaluating 8000-15000 nodes a second on the hardware
>they are using (486 dx 66). They are reaching 9-12 ply brute force in the
>middlegame.

>Suppose I have the same program they do, but now I'm evaluating
>1200 times a second. This means that I'm about 8 times slower.
>8 times is about 1.2 ply deeper brute force.

Yes, this is true *if* they are doing traditional brute force. If these
programs really search this deeply, I have to think that they do
some form of forward pruning. This means either a very smart "pruning"
algorithm in the main search, or a smart quiescence algorithm which
includes more than just captures and checks. Obviously, the distinction
between these two is fuzzy. I wonder if someone out there who has a good
algorithm will be willing to share it (I can *hope*, can't I?).

>To give you an example of another program with a slow evaluator:
>The King. The King 1 implemented in Chessmaster 4000
>evaluates about 2000 nodes a second on a 486 dx 66.
>It always achieves about 8 ply brute force in very complex middlegames...

I'm sure that The King has a *much* more complex evaluation function than
my program. In spite of this, it still calculates more than 1/2 the speed
of mine (taking the hardware into account). There is more to this
than just the "assembly language" speedup factor.

Joe Stella
<jo...@ultranet.com>

Robert Hyatt

unread,

Nov 29, 1994, 11:37:09 PM11/29/94

to

In article <joes.3....@ultranet.com>,

If they are searching 8-15K nodes per second, they have never, in their
wildest dreams, reached 9-12 plies in the middlegame. Absolute, pure
hogwash. To reach 9, they will need 10X the speed you quoted.

Don't be too concerned about speed. the easiest way to make it fast
is to make it dumb. poor payoff. ignore nodes per second and work on
driving search time down by shrinking the tree with good move ordering.

You are trying to make your car go faster by watching the tachometer.
That only works if we have everything equal. Can't you visualize blowing
by a car running 8000RPM thru a 6.10 final drive ratio, while you cruise
by at 3000RPM with a 2.22 final drive? forget the tach (nodes per second)
and watch the speedometer (clock).

>>Suppose I have the same program they do, but now I'm evaluating
>>1200 times a second. This means that I'm about 8 times slower.
>>8 times is about 1.2 ply deeper brute force.
>
>Yes, this is true *if* they are doing traditional brute force. If these
>programs really search this deeply, I have to think that they do
>some form of forward pruning. This means either a very smart "pruning"
>algorithm in the main search, or a smart quiescence algorithm which
>includes more than just captures and checks. Obviously, the distinction
>between these two is fuzzy. I wonder if someone out there who has a good
>algorithm will be willing to share it (I can *hope*, can't I?).
>
>
>>To give you an example of another program with a slow evaluator:
>>The King. The King 1 implemented in Chessmaster 4000
>>evaluates about 2000 nodes a second on a 486 dx 66.
>>It always achieves about 8 ply brute force in very complex middlegames...
>
>I'm sure that The King has a *much* more complex evaluation function than
>my program. In spite of this, it still calculates more than 1/2 the speed
>of mine (taking the hardware into account). There is more to this
>than just the "assembly language" speedup factor.
>
> Joe Stella
> <jo...@ultranet.com>
>

--
Robert Hyatt Computer and Information Sciences
hy...@cis.uab.edu University of Alabama at Birmingham
(205) 934-2213 115A Campbell Hall, UAB Station
(205) 934-5473 FAX Birmingham, AL 35294-1170

Urban Koistinen

unread,

Nov 30, 1994, 2:53:28 PM11/30/94

to

In <joes.3....@ultranet.com> jo...@ultranet.com (Joe Stella) writes:
[text deleted]
->I'm sure that The King has a *much* more complex evaluation function than
->my program. In spite of this, it still calculates more than 1/2 the speed
->of mine (taking the hardware into account). There is more to this
->than just the "assembly language" speedup factor.

I think you should tell your compiler to compile to assembly language.
That way you can see how stupid your compiler is.
Reading assembly language is not that hard, as you don't have to
understand it all.
--
Urban Koistinen - md85...@nada.kth.se
Stop software patents, interface copyrights: contact l...@uunet.uu.net

Gijsb. Wiesenekker

unread,

Dec 1, 1994, 2:49:42 AM12/1/94

to

Robert Hyatt (hy...@willis.cis.uab.edu) wrote:
: If they are searching 8-15K nodes per second, they have never, in their

: wildest dreams, reached 9-12 plies in the middlegame. Absolute, pure
: hogwash. To reach 9, they will need 10X the speed you quoted.

I have participated in both Paderborn (the fourth IPC) and
the Dutch Computer Chess Championships 1994, and I have noticed
that the 'better' programs (that is, the programs that end up in the
top half of final ranking) search about 3 ply deeper than last
year (that is, last year their displays showed things like
'time limit, depth = 8, time = 120 secs' and now their displays show
things lime 'time limit, depth = 11, time = 120 secs').
Schach 3.0 (who won the fourth IPC) searches about 100K nodes/sec on a P90,
and reaches 10-11 plies in the middlegame. As far as I know, Schach 3.0
uses alpha-beta with null moves.
The search depth seems to be 'real' in the sense that their principle
variations 'fit' the given search depth.
I do not know (yet) how they reach these depth's. The author
of Quest (that won the Dutch CCC, and outsearched ZZZZZZ in the
final round by 6-7 plies in the middle game, ZZZZZZ searching
about 6 ply. Needless to say I lost that game..) told me that
it has to do with 'using the material balance in the null-move search', but
he did not tell me any more than that. I have tried a few things
(reducing the null move search depth if there is a material advantage
for example) but have not found the trick until now.

Gijsbert Wiesenekker

Robert Hyatt

unread,

Dec 1, 1994, 9:55:35 AM12/1/94

to

In article <1994Dec1.0...@cca.vu.nl>,

I don't know what their "depth" means, but I can guarantee you that it doesn't
mean "full-width" to that depth. why? compute the square root of 38^10 (which
is approximately 38^5 which is approximately *BIG*. That't the absolute minimum
number of nodes you have to search to do a 10 ply search, not counting any
quiescence nor extensions. 38^2 is roughly 1000, 38^4 is roughly 1000000, so
38 million nodes is required to do a 10 ply search (exhaustive 10 plies) with
*no* capture search, etc added on. That is roughly 10 million nodes per
minute. 200000 per second. And what about the capture search which can
easily double, triple or quadruple this?

If they are playing games to reduce the search depth along certain branches,
which the null-move search does, they they *aren't* doing a full 10 plies,
and will overlook things that a real 10ply program will find.

Bob

Peter Gillgasch

unread,

Dec 3, 1994, 11:22:51 AM12/3/94

to

Note: This is clearly off topic for r.g.c. If you are just a chess
player, don't bother to continue reading...

In article <3bil7o$3...@news.kth.se>, md85...@hemul.nada.kth.se (Urban Koistinen) writes:
|> In <joes.3....@ultranet.com> jo...@ultranet.com (Joe Stella) writes:
|> [text deleted]
|> ->I'm sure that The King has a *much* more complex evaluation function than
|> ->my program. In spite of this, it still calculates more than 1/2 the speed
|> ->of mine (taking the hardware into account). There is more to this
|> ->than just the "assembly language" speedup factor.
|>
|> I think you should tell your compiler to compile to assembly language.
|> That way you can see how stupid your compiler is.
|> Reading assembly language is not that hard, as you don't have to
|> understand it all.
|> --

I don't think that this makes too much sense. I am amazed that this
old story of the superiority of assembly language over compiler
generated code is still believed.

IMHO it is essential to have a good understanding of the cpu that
executes the program. Then try to implement data structures that can be
easily mapped on the given hardware. Read your compiler manual, e.g. the
"Linking with libraries from other languages and assembler" part. Turn
all code optimizations on.

Use a language that maps well on the cpu. In other words use "C" (^8.

Then you can take a look on the assembler output. If the generated code
is *bad* then look for another compiler and kick the a** of the
manufacturer... If you can read this, you can email the developers. Send
them some code snippets and the generated assembly language output. If
you don't get an answer, then kick 'em and never buy any of their
products. And tell your friends about that...

Profile your code. If some routines take a lot of time do the following:

1. Think about better algorithms/data structures.
2. Hand-optimize the routines, but keep an unoptimized copy of the code
for documentation and maintainence.
3. If everything fails, code time consuming *small* functions that are
in the "top 10" in terms of runtime % *and* are clearly suboptimal
handled by your compiler in assembly, but ... (see 2.)

-- Peter

Peter Gillgasch

unread,

Dec 3, 1994, 12:01:40 PM12/3/94

to

In article <1994Dec1.0...@cca.vu.nl>, wie...@iodine.chem.vu.nl (Gijsb. Wiesenekker) writes:
|> Robert Hyatt (hy...@willis.cis.uab.edu) wrote:
|> : If they are searching 8-15K nodes per second, they have never, in their
|> : wildest dreams, reached 9-12 plies in the middlegame. Absolute, pure
|> : hogwash. To reach 9, they will need 10X the speed you quoted.

In 1989 DT was searching 700,000 - 1,000,000 nps and had a brute force
lookahead of 10 ply... I don't know where they are standing *now*...

|> I have participated in both Paderborn (the fourth IPC) and
|> the Dutch Computer Chess Championships 1994, and I have noticed
|> that the 'better' programs (that is, the programs that end up in the
|> top half of final ranking) search about 3 ply deeper than last
|> year (that is, last year their displays showed things like
|> 'time limit, depth = 8, time = 120 secs' and now their displays show
|> things lime 'time limit, depth = 11, time = 120 secs').

Probably the just changed the printf() (^8... Seriously, can you ellaborate on that
a bit further? Which programs are you talking about exactly? If I recall the Thompson
experiment correctly such a program should have an ELO rating of about 2750... I suppose
that you talk about some Dutch programs, which ran on Pentiums. The claim of >>2600 ELO
on such a cpu (or any other in this price segment) is clearly a joke.

|> Schach 3.0 (who won the fourth IPC) searches about 100K nodes/sec on a P90,
|> and reaches 10-11 plies in the middlegame. As far as I know, Schach 3.0
|> uses alpha-beta with null moves.

There you have it. "Null moves". Probably recursively applied. You are talking
about a *highly* selective (but quite efficient) search algorithm.

|> The search depth seems to be 'real' in the sense that their principle
|> variations 'fit' the given search depth.

You have been there, so probably you can answer the Q: Wasn't there pv output
*much* longer than the search depth? That should be the case in a modern *brute
force* program with selective deepening heuristics.

|> I do not know (yet) how they reach these depth's. The author
|> of Quest (that won the Dutch CCC, and outsearched ZZZZZZ in the
|> final round by 6-7 plies in the middle game, ZZZZZZ searching
|> about 6 ply. Needless to say I lost that game..) told me that
|> it has to do with 'using the material balance in the null-move search', but
|> he did not tell me any more than that. I have tried a few things
|> (reducing the null move search depth if there is a material advantage
|> for example) but have not found the trick until now.

Frans Moersch (I hope I spelled this correctly) used recursive null move pruning
in Fritz/Quest. There is a nice article by Chrilly Donninger in the ICCA Journal about
that (he did re-invent the algorithm or some variant of it). I will think about the
material balance thing and how it fits into that picture...

The important Q is: They *print* larger numbers than last year. But do they play
*much* better? Put it this way: If they are looking 3 plies deeper into the tree
(and I mean *everywhere* in the tree) they should play about 400 ELO points stronger
than last year... Ok, they crushed ZZZZZZ, which was running on a RS/6000 I suppose.
What is ZZZZZZ's rating? Do you think that their rating is at least 800 points higher?

It would be interesting to test one of those programs against an excellent brute force
opponent (e.g. CB or DT) to see what price they pay with their selective search schemes
in terms of accuracy... My personal opinion is that they will be crushed, but maybe
they crush humans better than the full width machines...