Speed of Move Generator

MANOHARARAJAH VALAVAN

unread,

Aug 15, 1995, 3:00:00 AM8/15/95

to

I want to compare the performance of my move generator to the other programs
out there. The program was compiled using visual c++ for win95, and running
on a Pentium 66 it takes around 0.05 milliseconds per move generation. Now
the moves come out un-ordered. So if we add move ordering, it would add
another 0.02 milliseconds to the tally. Making it around 0.07 to produce
a fully ordered move list.
--
-------------------------------------------------------------------------------
man...@ecf.utoronto.ca | 2nd year Comp Eng., University of Toronto
Valavan Manohararajah | OS/2 Warp "Operate at a higher level"
-----------------------------Team OS/2 Member----------------------------------

Robert Hyatt

unread,

Aug 15, 1995, 3:00:00 AM8/15/95

to

In article <DDBzC...@ecf.toronto.edu>,

A better test is the following:

1. pick a position.

2. generate all legal moves from this position N times, where N is
large enough to get an accurate time (10K for example). divide
10K*number of moves generated by the time used to get a number of
moves per second generated.

3. do the same thing, but toss in a make_move() (and an unmake()
if it's necessary in your implementation) to produce a rough
approximation of the number of positions per second you can
generate/make.

4. Set the same position up in Crafty, if you have it, using the
setboard or read or whatever command. type "perf" and it'll
produce its numbers for the above.

Caution: comparing moves generated per second is not a good test
since crafty is very fast here (it has to do no work to pull the
move generation off, all the work is done in make_move())

Hope this helps. However, different compilers, different optimization
levels, different processors, different L2 caches, different RAM
sizes, will all make a difference in performance.

--
Robert Hyatt Computer and Information Sciences
hy...@cis.uab.edu University of Alabama at Birmingham
(205) 934-2213 115A Campbell Hall, UAB Station
(205) 934-5473 FAX Birmingham, AL 35294-1170

MANOHARARAJAH VALAVAN

unread,

Aug 15, 1995, 3:00:00 AM8/15/95

to

In article <DDBzC...@ecf.toronto.edu>,
MANOHARARAJAH VALAVAN <man...@ecf.toronto.edu> wrote:
>I want to compare the performance of my move generator to the other programs
>out there. The program was compiled using visual c++ for win95, and running
>on a Pentium 66 it takes around 0.05 milliseconds per move generation. Now
>the moves come out un-ordered. So if we add move ordering, it would add
>another 0.02 milliseconds to the tally. Making it around 0.07 to produce
>a fully ordered move list.

Sorry.....the number was wrong. Move generation takes 0.2 milliseconds in my
program. With ordering, that number would come out to be around 0.22 millisecs.

MANOHARARAJAH VALAVAN

unread,

Aug 16, 1995, 3:00:00 AM8/16/95

to

In article <40qisq$e...@pelham.cis.uab.edu>,

Robert Hyatt <hy...@willis.cis.uab.edu> wrote:
>In article <DDBzC...@ecf.toronto.edu>,
>MANOHARARAJAH VALAVAN <man...@ecf.toronto.edu> wrote:
>>I want to compare the performance of my move generator to the other programs
>>out there. The program was compiled using visual c++ for win95, and running
>>on a Pentium 66 it takes around 0.05 milliseconds per move generation. Now
>>the moves come out un-ordered. So if we add move ordering, it would add
>>another 0.02 milliseconds to the tally. Making it around 0.07 to produce
>>a fully ordered move list.

>>--
>>-------------------------------------------------------------------------------
>> man...@ecf.utoronto.ca | 2nd year Comp Eng., University of Toronto
>> Valavan Manohararajah | OS/2 Warp "Operate at a higher level"
>>-----------------------------Team OS/2 Member----------------------------------
>
>

>A better test is the following:
>
>1. pick a position.
>
>2. generate all legal moves from this position N times, where N is
>large enough to get an accurate time (10K for example). divide
>10K*number of moves generated by the time used to get a number of
>moves per second generated.

Ok...I did this one..... Lets take the opening position! I know this isn't
a really good position for testing out the speed of the generator....but
anyway it takes 1.5 millisecs to run 10000 GenerateMoves().

>
>3. do the same thing, but toss in a make_move() (and an unmake()
>if it's necessary in your implementation) to produce a rough
>approximation of the number of positions per second you can
>generate/make.

This would be the better way....but alas, I just started writing the program
so I don't have the make/unmake yet. This Make() and Unmake() would be the
costly part for me as well, since I use a Bit Board Move Generation scheme
that updates data as it makes+unmakes moves.

>
>4. Set the same position up in Crafty, if you have it, using the
>setboard or read or whatever command. type "perf" and it'll
>produce its numbers for the above.
>
>Caution: comparing moves generated per second is not a good test
>since crafty is very fast here (it has to do no work to pull the
>move generation off, all the work is done in make_move())
>
>Hope this helps. However, different compilers, different optimization
>levels, different processors, different L2 caches, different RAM
>sizes, will all make a difference in performance.

B

>
>
>--
>Robert Hyatt Computer and Information Sciences
>hy...@cis.uab.edu University of Alabama at Birmingham
>(205) 934-2213 115A Campbell Hall, UAB Station
>(205) 934-5473 FAX Birmingham, AL 35294-1170

Steven J. Edwards

unread,

Aug 16, 1995, 3:00:00 AM8/16/95

to

I suggest using a simple enumeration benchmark for testing move
generation, execution/retraction, legality testing, and detection of
checkmates and stalemates. For some of my tests, I have Spector start
with the initial position and have it enumerate all distinct pathways
to successive ply depths starting at one. Note that for ply depths of
seven or more, 32 bit signed integers are insufficient to hold the
sums.

This also tests the basic move logic as the sums can be compared with
those generated by other programs.

Using a simple offset generator, Spector running on a 66 MHz 486DX2
can process move generation, execution, legality testing, and
retraction at a node frequency of just under 30 KHz. Using the
bitboard generation mechanism, which also does a great deal of
incremental updating, is about six times slower. Most of the slowdown
is due to the bitboard code and data being too large to fit into the
L1 processor cache.

Here are the numbers for the pathway count for the first four ply from
the starting position:

depth count
----- ------
1 20
2 400
3 8902
4 197281

-- Steven (s...@mv.mv.com)

Bruce Moreland

unread,

Aug 16, 1995, 3:00:00 AM8/16/95

to

It is hard to know how to evaluate the speed of a given chess engine. A
sexy number is nodes traversed per second (nps). On a P5/66 I would
expect to see a hobbyist program doing 10K nps as an approximate high
end, although it is certainly possible to do at least double this without
sacrificing strength for speed. The problem with nodes/second as a
measure of program strength is that it doesn't measure program strength,
it measures nodes/second. Fritz is the fastest PC program, but it is not
on top of the Swedish rating list, Genius is. Genius is a lot slower,
but is significantly stronger, according to the Swedish list.

It is possible to generate huge numbers of nodes/second, if you aren't
particularly interested in whether they are the right nodes to generate.
If you order your moves poorly you can increase nodes/second
tremendously, but you won't search as many plies.

Another measure of program strength is search depth reached in a given
time. This is more a measure of how well the parts of the program work
together. When you are writing a program, increased depth reached is
certainly a good indication that your program is improving. But once
again it is misleading to compare distinct programs on this basis,
because the stronger programs use forward pruning to increase depth, and
selective extensions that increase depth in localized areas of the tree,
at the cost of a little depth throughout the rest of the tree. Null-move
forward pruning is a well-known technique that can be used to increase
depth significantly, at cost of a little accuracy (sometimes a lot).
There are other forward pruning techniques that the commercial programs
use that we can only speculate about, since they haven't been published.

So you can't know how fast you go until you've finished your program, and
once you know how fast you go, you might not really know how fast you go,
and in any case what matters is not how fast you go, or how fast you
think you go, but whether you win or lose.

Now that I've said all of that, it sounds like you're going pretty slow.
You give a figure of 0.05 milliseconds per move generated, which is
20,000 moves generated per second. If this was a whole chess program,
20,000 nps would be great, but it's not a whole program, you still have
to write the rest of it, and in the case of my program Ferret, moves
generated per second is about 8x the value for nodes/second, so as a
wild guess you're getting like 3K nps. So my guess is you need to speed
up, but I'd urge you to just write the rest of the program, get it
playing on a regular basis (fics or ICC) then refine it as necessary.

Also, I don't understand the 0.02 millisecond number. What exactly is
this?

bruce

In article <DDBzC...@ecf.toronto.edu>, man...@ecf.toronto.edu says...

>
>I want to compare the performance of my move generator to the other
programs
>out there. The program was compiled using visual c++ for win95, and
running
>on a Pentium 66 it takes around 0.05 milliseconds per move generation.
Now
>the moves come out un-ordered. So if we add move ordering, it would add
>another 0.02 milliseconds to the tally. Making it around 0.07 to
produce
>a fully ordered move list.
>--
>------------------------------------------------------------------------
-------
> man...@ecf.utoronto.ca | 2nd year Comp Eng., University of Toronto
> Valavan Manohararajah | OS/2 Warp "Operate at a higher level"
>-----------------------------Team OS/2
Member----------------------------------

--
The opinions expressed in this message are my own personal views
and do not reflect the official views of Microsoft Corporation.

Peter Gillgasch

unread,

Aug 16, 1995, 3:00:00 AM8/16/95

to

In article <40qisq$e...@pelham.cis.uab.edu>, hy...@willis.cis.uab.edu (Robert Hyatt) writes:
|> In article <DDBzC...@ecf.toronto.edu>,
|> MANOHARARAJAH VALAVAN <man...@ecf.toronto.edu> wrote:

|> >I want to compare the performance of my move generator to the other programs
|> >out there. The program was compiled using visual c++ for win95, and running
|> >on a Pentium 66 it takes around 0.05 milliseconds per move generation. Now
|> >the moves come out un-ordered. So if we add move ordering, it would add
|> >another 0.02 milliseconds to the tally. Making it around 0.07 to produce
|> >a fully ordered move list.

< automatic os divided by two sig snip procedure applied >

For the record: This speed was later corrected to be around 0.2 millisec
unordered, 0.22 millisec ordered.

This test is quite artificial (I had a *hard* time to swallow it, really).

My old DarkThought code (table driven move generator) generated on the
PowerMacintosh @80 mhz around 5,000,000 moves per sec (unordered). Assuming
a position with 40 pseudo legal moves we get a speed of 125,000 move lists
per sec (unordered). That is 0.008 millisec, written in "C" only. That is a
factor of 25. Sorry, no cigar.

*But* this move generator was only useful for move generation, and it
was so incredible fast that it was really *sad* to see the performance
drop every additional step introduced... Hence I followed Bob and Steven
and ... and after a lot of thought I designed and implemented a bit board
engine (functional, but still under construction) without any "design for
change" ideals in mind (notable exception: evaluation)...

This code searches on the same hardware around 18,000 (+/- 4000) legal
non futile positions per second. That is 0.055 milliseconds per node, a
number which is much more interesting. Of course I will make it even
faster (well, I hope so).

Now, Bob suggests a new speed game...

|> A better test is the following:
|>
|> 1. pick a position.

Will post results on BK tomorrow...

|> 2. generate all legal moves from this position N times, where N is
|> large enough to get an accurate time (10K for example). divide
|> 10K*number of moves generated by the time used to get a number of
|> moves per second generated.

Ordered or unordered ? I don't have a routine that spits out unordered
moves, although I could write a capture generation routine in less then
10 lines... Hm, that would be nice to get an idea of the mvv/lva overhead.
This would be useful for our latest discussion and would help myself
to justify why I entered this race (^8

|> 3. do the same thing, but toss in a make_move() (and an unmake()
|> if it's necessary in your implementation) to produce a rough
|> approximation of the number of positions per second you can
|> generate/make.

With legality checking I assume.

|> 4. Set the same position up in Crafty, if you have it, using the
|> setboard or read or whatever command. type "perf" and it'll
|> produce its numbers for the above.

Question is, can we learn somthing from this or is it only
for the fun of it? Hm...

More later.

-- Peter

------------------------------------------------------------
Peter W. Gillgasch
Klosterweg 28/I-113 email gil...@ira.uka.de
76131 Karlsruhe/Germany Phone ++49/(0)721/6904255

Peter Gillgasch

unread,

Aug 16, 1995, 3:00:00 AM8/16/95

to

Here are the numbers I promised. After reading Bob's proposal how to benchmark
the move generation code of a chess program I added a command to my current
DarkThought code to test the speed of some basic functions of the new move
generator (which is much more than a move generator).

Here is what each test does:

Capture generation - computes captures (including en passant) and pawn promotions
one by one ordered by mvv/lva scheme. Note that this test
is the worst case scenario, all captures are generated, the
mvv/lva overhead is simply wasted here.

Move generation - Produces a list of all moves. Captures are generated first
using the capture generation routine. To compute the time for
the generation of non capturing moves only you have to subtract
the capture generation time.

Note that I didn't implement a routine that tries to generate
all pseudo legal moves with minimal effort - that would be
probably a lot of faster but has absolutely no impact on
search performance...

DoMove - Executes *all* moves maintaining castling and en passant rights,
some evaluation stuff and some hash keys. Checks the
legality of the move and returns a bitboard denoting all squares
that attack the opponent's king. Keeps the set of pseudo legal
moves for both players up to date. It does a *lot* of work.
No UndoMove operation is needed. To compare with other move
generator designs it would be fair to run a sequence of calls

Note that this test executes the moves only (they have been
generated beforehand). If you are interested in testing move
generation *and* move execution simply add the time of the
move generation test.

Note that those routines are very sophisticated: in a real tree search much less
than 1% of the produced moves are illegal, often less than 0.1 %.

For this test I actually called user interface glue/interactive debugging routines,
that test some conditions and call the "right" routines. During a tree search the
search code *is* the glue routine and automatically dispatches to the right routine,
if it is not already completely hand-inlined...

So it should be slightly faster "in reality".

The following lines (one for each Bratko-Kopec position) contain the data for
each test. The "total" columns state the total time in seconds for the test,
the "each" columns state the time in milliseconds for each call.

The column "Capture generation each" states the time for generating a complete capture
list, the column "Move generation each" states the time for generating a complete
move list and the column "DoMove call each" states the *average* time of a *single*
move execution.

The data was obtained running each test 10,000 times. Increasing the number of
iterations by a factor of 10 showed no effect on the "each" data, so it should
be pretty accurate.

Position Capture generation Move generation DoMove call

total (sec) each (ms) total (sec) each (ms) total (sec) each (ms)

BK #01 0.1333 0.013333 0.2333 0.023333 9.5500 0.024487
BK #02 0.1667 0.016667 0.2833 0.028333 7.1000 0.022187
BK #03 0.2667 0.026667 0.3667 0.036667 6.0833 0.023397
BK #04 0.2000 0.020000 0.3333 0.033333 8.7833 0.023739
BK #05 0.3500 0.035000 0.4500 0.045000 12.1833 0.024367
BK #06 0.1333 0.013333 0.2167 0.021667 4.7667 0.020725
BK #07 0.2833 0.028333 0.4000 0.040000 7.6167 0.023081
BK #08 0.1000 0.010000 0.1833 0.018333 2.5333 0.021111
BK #09 0.2833 0.028333 0.4000 0.040000 8.4167 0.022748
BK #10 0.2667 0.026667 0.3500 0.035000 8.2500 0.022917
BK #11 0.2333 0.023333 0.3500 0.035000 8.2333 0.024216
BK #12 0.2000 0.020000 0.3000 0.030000 8.7000 0.023514
BK #13 0.2333 0.023333 0.3500 0.035000 10.2500 0.023295
BK #14 0.2500 0.025000 0.3333 0.033333 7.2167 0.024056
BK #15 0.2833 0.028333 0.3833 0.038333 10.0333 0.023889
BK #16 0.2833 0.028333 0.4000 0.040000 8.1333 0.023922
BK #17 0.2667 0.026667 0.3667 0.036667 5.8167 0.023267
BK #18 0.2667 0.026667 0.3833 0.038333 9.5833 0.022817
BK #19 0.2167 0.021667 0.2833 0.028333 7.6500 0.023906
BK #20 0.2000 0.020000 0.3000 0.030000 9.4167 0.024145
BK #21 0.3000 0.030000 0.4167 0.041667 10.7167 0.023297
BK #22 0.2833 0.028333 0.4000 0.040000 7.5500 0.023594
BK #23 0.1833 0.018333 0.2833 0.028333 8.0667 0.023725
BK #24 0.2667 0.026667 0.3833 0.038333 9.1500 0.023462

Machine & development software spec:

Tests were run on my PowerMacintosh 6100 using a PowerPC 601 clocked at 80 mhz.
This is a 32 bit implementation of the PowerPC architecture. The system has 256
kb 2nd level cache and 16 mb of ram. 1st level cache bandwidth is 110 mb/sec.

The program is compiled with metrowerks ANSI-C (CodeWarrior 6). About 516 bytes
are handcoded in PowerPC assembly language (less than 1 % of the object code).

The execution time for DoMove() suggests that the maximum speed of the current
code is around 43.5 knps. During tree searches sustained performance is around
18 knps, peak performance of 25 knps was observed on this system.

I am interested in comparable data of other software only programs, especially
those that are using bitboards for everything like I do now or other programs
that run on a PowerPC based system.

Anybody faster? Bob is probably rlogin' to a T90 right now... (^8

Peter Gillgasch

unread,

Aug 17, 1995, 3:00:00 AM8/17/95

to

Hi Bruce, nice to see you here!

In article <40tmqk$f...@news2100.microsoft.com>, brucemo (Bruce Moreland) writes:
|> It is hard to know how to evaluate the speed of a given chess engine. A
|> sexy number is nodes traversed per second (nps). On a P5/66 I would
|> expect to see a hobbyist program doing 10K nps as an approximate high
|> end, although it is certainly possible to do at least double this without
|> sacrificing strength for speed. The problem with nodes/second as a
|> measure of program strength is that it doesn't measure program strength,
|> it measures nodes/second. Fritz is the fastest PC program, but it is not
|> on top of the Swedish rating list, Genius is. Genius is a lot slower,
|> but is significantly stronger, according to the Swedish list.

To put this into perspective: Morsch claimed to do 80 knps on a P90.
From somebody who is not Richard Lang but is closely associated with him
(hint: he wears a handy all the time (^8) I heard something about 20 knps
for Genius. Of course there is always the problem what they consider to
be a "node".

|> It is possible to generate huge numbers of nodes/second, if you aren't
|> particularly interested in whether they are the right nodes to generate.
|> If you order your moves poorly you can increase nodes/second
|> tremendously, but you won't search as many plies.

Correct, move ordering is the key problem for starters. I have now a move
generator that has the properties that are mentioned in the books as
needed for a really useful movegenerator:

(a) it generates moves one by one (well, most of the time)
(b) the moves are ordered by merit
(c) nearly no illegal moves are generated
(d) a move can be tested quickly wether it "fits" the current board position

And it is fast. Some of these properties are really hard to achive when
you don't have bitboards. For beginning chess programmers (who are probably
not using bitboards to start with) I'd suggest that they should ignore
(a) and (b) - but sort after capture generation! (c) and (d) are both
important and relatively trivial to do.

|> Another measure of program strength is search depth reached in a given
|> time. This is more a measure of how well the parts of the program work
|> together. When you are writing a program, increased depth reached is
|> certainly a good indication that your program is improving.

That is indeed the best measure of overall efficiency. But as you noted
"depth reached" is becoming more and more fuzzy.

|> But once
|> again it is misleading to compare distinct programs on this basis,
|> because the stronger programs use forward pruning to increase depth, and
|> selective extensions that increase depth in localized areas of the tree,
|> at the cost of a little depth throughout the rest of the tree. Null-move
|> forward pruning is a well-known technique that can be used to increase
|> depth significantly, at cost of a little accuracy (sometimes a lot).
|> There are other forward pruning techniques that the commercial programs
|> use that we can only speculate about, since they haven't been published.

At the WCCC you could clearly see this: some teams were *really* open minded
and talked about close to anything. The guys from *Socrates are probably
the best example. The commercial teams tell you close to nothing of course
and some "scientific" teams and hobbiests are even worse... I made the
interesting observation that the deeper the search, the more talkative
are the programmers. Well, maybe they search so deep because they are
talkative (^8

|> So you can't know how fast you go until you've finished your program, and
|> once you know how fast you go, you might not really know how fast you go,
|> and in any case what matters is not how fast you go, or how fast you
|> think you go, but whether you win or lose.

Welcome to the real world. And then there is the opening book - I really
hate it.

|> Now that I've said all of that, it sounds like you're going pretty slow.
|> You give a figure of 0.05 milliseconds per move generated, which is
|> 20,000 moves generated per second. If this was a whole chess program,
|> 20,000 nps would be great, but it's not a whole program, you still have
|> to write the rest of it, and in the case of my program Ferret, moves
|> generated per second is about 8x the value for nodes/second, so as a
|> wild guess you're getting like 3K nps.

It is really hard to estimate, but in the old DarkThought code the
nodes per sec rate was about 120 times slower than the peak of the
moves per sec rate. I was really looking __hard__ for something to eliminate
to cut this performance loss down by, say, a factor of 10. But I couldn't
find something. So, no 250,000 - 600,000 nps on a workstation class
machine...

The move generator design was simply great, but unfortunately
it had little to do with the needs of a real chess program - it did
recompute everything from scratch, so having a good mobility evaluation
needed a modified move generator loop (which was to slow for me to
be acceptable) or some hack that recomputed mobility as needed.

Now everything is done in the incremental style, at the cost of moves
per second, but everything fits together to meet the needs of a
chess engine. Sometime in the future I will think about a combination
of the two designs, but as long as I don't have an idea how to avoid major
kludges I won't start this.

|> So my guess is you need to speed
|> up, but I'd urge you to just write the rest of the program, get it
|> playing on a regular basis (fics or ICC) then refine it as necessary.
|>
|> Also, I don't understand the 0.02 millisecond number. What exactly is
|> this?

I think he generated captures en bloc (completely unordered) and needs
about 0.02 ms to order them.

|> The opinions expressed in this message are my own personal views
|> and do not reflect the official views of Microsoft Corporation.

Hm. I am glad that I don't need any disclaimers yet (^8

Robert Hyatt

unread,

Aug 17, 1995, 3:00:00 AM8/17/95

to

In article <40rk4c$3...@nz12.rz.uni-karlsruhe.de>,

Peter Gillgasch <gil...@ira.uka.de> wrote:
>
>Now, Bob suggests a new speed game...
>
>|> A better test is the following:
>|>
>|> 1. pick a position.
>
>Will post results on BK tomorrow...
>
>|> 2. generate all legal moves from this position N times, where N is
>|> large enough to get an accurate time (10K for example). divide
>|> 10K*number of moves generated by the time used to get a number of
>|> moves per second generated.
>
>Ordered or unordered ? I don't have a routine that spits out unordered
>moves, although I could write a capture generation routine in less then
>10 lines... Hm, that would be nice to get an idea of the mvv/lva overhead.
>This would be useful for our latest discussion and would help myself
>to justify why I entered this race (^8
>

>-- Peter
>

I think *all* these perf measures are "for the fun of it" as measuring
any particular component of a chess program is interesting, but useless.

For the bitboard discussion, make_move() is complex (I'm currently
working on eliminating the attack bitboards altogether and compute
them on the fly using the rotated stuff) but comparing it to another
approach doesn't work, since part of the overhead in make_move makes
evaluations faster, capture order better, etc.

I don't even really care about nodes per second, because that can be
affected by poor move ordering, the worse the order, the faster the
nps.

We really can't even compare search times for known problems very
easily, since different extensions produce different times and
numbers of nodes. In short, it's *really* hard to compare two
programs like this. Sort of like comparing two automobiles by
watching the tachometer. Interesting, but not useful. Different
transmissions, numbers of gears, final drive rations, wheel sizes,
....

Peter Gillgasch

unread,

Aug 17, 1995, 3:00:00 AM8/17/95

to

Even more data... This round of the move generation experiment deals with
the overhead introduced by generating capture moves "one by one" instead
of generating captures "en bloc" and sort 'em later to get them into the
correct mvv/lva sequence (Bob, I am putting the king first since it is the
least valuable piece. It cannot be captured - kind of using a static
exchange evaluator for king captures. Seems reasonable to me). The "sort"
algorithm is a simple maximum selection sort (scanning the list in a couple
of passes) since the input is so small.

Same test set as before (Bratko-Kopec), same machine (PPC 601-80), but this
time with 100,000 iterations since everything is so quick. Note that my
last numbers were not very exact to this (except of the DoMove() ms data)
- I apologize for that. I have added yet another column stating the kilo
captures per second rate.

Position One by one capture generation All and sort capture generation

total (sec) each (ms) kcps total (sec) each (ms) kcps

BK #01 1.2333 0.012333 0.000 0.7000 0.007000 0.000
BK #02 1.7167 0.017167 116.505 0.9333 0.009333 214.286
BK #03 2.6000 0.026000 76.923 1.2500 0.012500 160.000
BK #04 1.9333 0.019333 0.000 1.1500 0.011500 0.000
BK #05 3.3000 0.033000 151.515 1.3833 0.013833 361.446
BK #06 1.2667 0.012667 157.895 0.7833 0.007833 255.319
BK #07 2.7000 0.027000 111.111 1.2667 0.012667 236.842
BK #08 1.0333 0.010333 0.000 0.6833 0.006833 0.000
BK #09 2.7500 0.027500 109.091 1.3333 0.013333 225.000
BK #10 2.4833 0.024833 80.537 1.0000 0.010000 200.000
BK #11 2.2500 0.022500 44.444 1.2000 0.012000 83.333
BK #12 1.9833 0.019833 50.420 0.8667 0.008667 115.385
BK #13 2.3333 0.023333 85.714 1.0667 0.010667 187.500
BK #14 2.2833 0.022833 87.591 1.2167 0.012167 164.384
BK #15 2.6333 0.026333 113.924 1.1333 0.011333 264.706
BK #16 2.6167 0.026167 114.650 1.2667 0.012667 236.842
BK #17 2.6500 0.026500 150.943 1.3833 0.013833 289.157
BK #18 2.5333 0.025333 118.421 1.2500 0.012500 240.000
BK #19 1.9333 0.019333 103.448 0.9500 0.009500 210.526
BK #20 1.9167 0.019167 104.348 1.1333 0.011333 176.471
BK #21 2.9167 0.029167 171.429 1.3833 0.013833 361.446
BK #22 2.8167 0.028167 106.509 1.2500 0.012500 240.000
BK #23 1.8000 0.018000 0.000 0.9333 0.009333 0.000
BK #24 2.5833 0.025833 154.839 1.4833 0.014833 269.663

Some observations: In this synthetical test "all and sort" is about twice
as fast as "one by one". Both routines generate illegal moves very rarely,
so they can be both tested in real tree searches.

I made some preliminary tests with "all and sort" searches in tactical positions.
Much to my surprise the results are unclear. Sometimes the "all and sort"
code is slightly faster, sometimes the "one by one" code. The only reason
why "one by one" can hold itself against "all and sort" is that most of
the time not more than one capture is needed and the capture scan is
immediately stopped as soon as we reach a futile victim class.

This benchmark suggests that both algorithms have something to offer. If
I find the time I'll think about what to learn from this...

Peter Gillgasch

unread,

Aug 18, 1995, 3:00:00 AM8/18/95

to

In article <4111nf$i...@pelham.cis.uab.edu>, hy...@willis.cis.uab.edu (Robert Hyatt) writes:
|> In article <40rk4c$3...@nz12.rz.uni-karlsruhe.de>,
|> Peter Gillgasch <gil...@ira.uka.de> wrote:
|> >
|> >Now, Bob suggests a new speed game...
|> >
|> >|> A better test is the following:
|> >|>
|> >|> 1. pick a position.
|> >
|> >Will post results on BK tomorrow...

Done that.

|> I think *all* these perf measures are "for the fun of it" as measuring
|> any particular component of a chess program is interesting, but useless.

Hm. I wouldn't say "useless". At least it gives both the author of the
program as well as other programmers an idea what is achievable with
a certain method. Hence everybody can jump in and say "oops, look at these
numbers, you got a performance problem here". For example I was really
surprised to see that ordered capture generation *alone* accounts for
2/3 of the time that is needed to generate a complete move list - an
observation that triggered further investigations.

|> For the bitboard discussion, make_move() is complex (I'm currently
|> working on eliminating the attack bitboards altogether and compute
|> them on the fly using the rotated stuff) but comparing it to another
|> approach doesn't work, since part of the overhead in make_move makes
|> evaluations faster, capture order better, etc.

Correct, as I stated I am interested in comparing with other programs
that do either use the same approach or use similar hardware.

The idea to remove the attack bitboards is interesting but not new - I
know at least one programmer who has done that and has seen a speed
of >100,000 nps on a P90, but his program was not really useful at
that time (bad ordering for non captures, close to nil evaluation). It is
hard to say how much of that speed will be lost when doing things "right",
but an educated guess is that he will loose at least 2/3 of the speed. If
this guess is correct then removing the attack bitboards is probably not
very helpful.

After looking at the profiling results of my code on a variety of machines
I can say that it is not really DoMove() that is expensive, it is the copy
operation to generate a new state that can be modified to reflect the changes
of the board position.

Hence it is useful to reduce the size of the state that needs copying.
I have tried to reduce the size of the board database without touching
the attack/attacked bitboards, which produced a substantial speed increase
immediately. But the biggest chunk of the game state are the 128 attack
bitboards (1 kb). Since the cache bandwidth of the machine I am developing
on is about 110 mb/sec this amount of data copying introduces an absolute
speed limit of about 87 knps on that machine.

I am interested in your results if you manage to remove the attack bitboards.
As I said I am thinking about this as well but I am afraid that this will

(a) require major kludges (and hence will need a lot of work to do
by the programmer and probably a lot of debugging).

(b) will not be faster in "usual" over the board positions, although
it will be a lot of faster on tactical benchmarks and in endgames.

|> I don't even really care about nodes per second, because that can be
|> affected by poor move ordering, the worse the order, the faster the
|> nps.

Well, on "easy" tactical benchmarks I see very little difference in nps
when I remove *all* move ordering after killers.

|> We really can't even compare search times for known problems very
|> easily, since different extensions produce different times and
|> numbers of nodes.

I suggest to turn all extensions off (except check extensions) and to
turn off null moves. Then I think that we get a better idea of the
search speed. I even suggest to turn off the slow evaluation.

I admit that this "sounds" foolish but the name of the thread is
"Speed of move generator" and it would be pretty interesting to see
what speeds are achievable in software. Why? Most chess programs
written by programmers with little "commercial interests" use table
driven move generators similar to GNU chess. There is one commercial
program that is more or less "officialy" rumoured to deliver a speed
of 80,000 nps *and* is playing at least "well" - ask the DBT and the
*Socrates teams. (^8

Such a speed is not achievable with table driven move generators. So
how does it work? We could find out by disassembling, but finding it
out by "rediscovery" is much more interesting... I asked the programmer
of this program about that and I *think* that I have some ideas how
he does it - he looked at me and said "I really can't remember how
I did that." Now, if he had only said "I won't tell you"...

|> In short, it's *really* hard to compare two
|> programs like this. Sort of like comparing two automobiles by
|> watching the tachometer. Interesting, but not useful. Different
|> transmissions, numbers of gears, final drive rations, wheel sizes,
|> ....

Ok, nice car, so what does your tachometer say... (^8

MANOHARARAJAH VALAVAN

unread,

Aug 18, 1995, 3:00:00 AM8/18/95

to

In article <4111nf$i...@pelham.cis.uab.edu>,

Robert Hyatt <hy...@willis.cis.uab.edu> wrote:
>In article <40rk4c$3...@nz12.rz.uni-karlsruhe.de>,
>Peter Gillgasch <gil...@ira.uka.de> wrote:
>>
>>Now, Bob suggests a new speed game...
>>
>>|> A better test is the following:
>>|>
>>|> 1. pick a position.
>>

>>|> 2. generate all legal moves from this position N times, where N is
>>|> large enough to get an accurate time (10K for example). divide
>>|> 10K*number of moves generated by the time used to get a number of
>>|> moves per second generated.

Ok....now I figured out why my move generator was so slow. I was doing wrong
computation. I quoted a figure of about 1.5 seconds per 10000
MoveGeneration()s.

That is the whole routine that does the move generation. If I look
at the number of moves then, I get 1.5 seconds per 280000 move generations.

This gives 187000mvs/sec...... Ok...that sounds more reasonable on my
machine. Sheez....I was banging my head against the wall trying to find some
part of the move generator that was slow!

--
-------------------------------------------------------------------------------

MANOHARARAJAH VALAVAN

unread,

Aug 18, 1995, 3:00:00 AM8/18/95

to

I was examining the code for crafty the other day.....Can I suggest a small
speedup, Bob?

How about using a precomputed table for generating the list of squares....
rather than calling FirstOne() / LastOne() gazillions of times. The table
would be quite big though: square *squarelist[65536]

This table would give you square lists for the entire set of 16 bit numbers.
To get a square list for the 64 bit number, read the square list four times.

If we are talking an average of about 8 squares that need to be stored per
squarelist[] entry, that gives us about 65536*8=524288 bytes.

Also the pointers would take 4*65536=256k.

This gave me a speed up of about 40% in my move generator as opposed to calling
FirstOne() several times. The squarelist[] method cuts down on function
calling overhead.

Robert Hyatt

unread,

Aug 18, 1995, 3:00:00 AM8/18/95

to

In article <DDICt...@ecf.toronto.edu>,

Problem is, Generate_Moves() is an insignificant part of the total
time right now. eval is first, make_move is second, nothing else is
above 5%.

One *could* take the rotated bitboards, and simply "look up" the
set of legal moves for each diagonal. Think about this: for a
diagonal or rank or file (depending on the kind of piece) you can
take the appropriate rotated bitboard, to get the bits of that
rank/file/diag into adjacent bits, and then use this as an index
into a pre-computed set of moves, including the captures at the end
of the diagonals (you would have to fill in the captured piece,
of course).

BTW, I compile crafty as one large source file for ICC, so there
is no function calling overhead for first_one, etc, as they get
inlined. In any case, with movgen taking practically no time, it's
not going to win much at present, until I get eval under control.
It's got too much knowledge, with poorly scaled values, and with
some components that are in exact opposition to others. I'm fixing
to skinny it down and try some long ICC tests with different pieces
of knowledge, one at a time, to see what helps. It's currently
doing mobility, space, pawn structure, piece placement, king
safety, passed pawn analysis, weak square attacks, and who knows
what else. Some of these neutralize others. The search itself has
progressed to a point where it seems to be doing o.k. Now it's
a matter of getting the eval tuned. It shows a lot of promise at
times, and then plays a really horrible endgame to offset that.

I'm also going to try and construct a test mechanism to evaluate
the difference between doing all captures and non-losing captures
in quiescence. As usual, Hsu has me thinking. It still looke right
to me, but just because I've done it for 15 years doesn't make it
right. I need some concrete way to evaluate this, however.

On a different note, I've tried (and still am) playing around with
the idea of eliminating the attack bitboards completely, and then
computing them "on the fly" as needed. So far, the speed is about
the same as incrementally updating the bitboards, and it's not clear
yet that I can optimize much more. It would be nice to get rid of
them, as they cause a lot of memory traffic as they are copied from
ply to ply (of course, this traffic is *much* less than what would
be needed to "unmake" a move and maintain only one copy of the
attack maps, which we did in Cray Blitz). The rotated bitboards make
this easy and fast, but the penalty is that every time you need one
of the attack maps, it turns into a function call, rather than a 64bit
load from memory. The drawback then comes each time you add a *new*
reference to the attack maps, because the old program would have *no*
overhead to get it, but the new one has to compute them again. I
haven't yet decided whether the new approach is worthwhile or not,
but the code is *much* simpler (Make_Move()) now, well under 1,000
lines with lots of comments, as opposed to nearly 3,000 before.

If you see any optimization ideas, by all means let me know. I'm
now after speed.

Robert Hyatt

unread,

Aug 19, 1995, 3:00:00 AM8/19/95

to

In article <411kc6$7...@nz12.rz.uni-karlsruhe.de>,
Peter Gillgasch <gil...@ira.uka.de> wrote:

<snip>

>
>Hm. I wouldn't say "useless". At least it gives both the author of the
>program as well as other programmers an idea what is achievable with
>a certain method. Hence everybody can jump in and say "oops, look at these
>numbers, you got a performance problem here". For example I was really
>surprised to see that ordered capture generation *alone* accounts for
>2/3 of the time that is needed to generate a complete move list - an
>observation that triggered further investigations.

<snip>

>
>Correct, as I stated I am interested in comparing with other programs
>that do either use the same approach or use similar hardware.
>
>The idea to remove the attack bitboards is interesting but not new - I
>know at least one programmer who has done that and has seen a speed
>of >100,000 nps on a P90, but his program was not really useful at
>that time (bad ordering for non captures, close to nil evaluation). It is
>hard to say how much of that speed will be lost when doing things "right",
>but an educated guess is that he will loose at least 2/3 of the speed. If
>this guess is correct then removing the attack bitboards is probably not
>very helpful.

Note that I'm removing the attack bitmaps, but *not* the capabilities
of them, since I can calculate them (now) faster than the incremental
make_move() code was doing. I don't think we are talking about the same
thing with your above comment, as if he had the attack bitmaps, why in
the world wasn't he using them? They are *integeral* in Crafty, so I
have to have something similar. However, the rotated bitboards make
producing these bitmaps awfully fast. So fast, in fact, that I have
passed the break-even point and the new version is faster. It might
end up significantly faster.

>
>After looking at the profiling results of my code on a variety of machines
>I can say that it is not really DoMove() that is expensive, it is the copy
>operation to generate a new state that can be modified to reflect the changes
>of the board position.
>
>Hence it is useful to reduce the size of the state that needs copying.
>I have tried to reduce the size of the board database without touching
>the attack/attacked bitboards, which produced a substantial speed increase
>immediately. But the biggest chunk of the game state are the 128 attack
>bitboards (1 kb). Since the cache bandwidth of the machine I am developing
>on is about 110 mb/sec this amount of data copying introduces an absolute
>speed limit of about 87 knps on that machine.

The memory bandwidth on *my* machine is about 8 gigawords / second,
so I don't have *that* problem. Yes, that is 64 gb per *second*.
However, I too, was interested in eliminating this big copy operation
for the pc/sparc version of crafty. It was costing me about 10% of
total search time to copy this data.

>
>I am interested in your results if you manage to remove the attack bitboards.
>As I said I am thinking about this as well but I am afraid that this will
>
>(a) require major kludges (and hence will need a lot of work to do
> by the programmer and probably a lot of debugging).
>
>(b) will not be faster in "usual" over the board positions, although
> it will be a lot of faster on tactical benchmarks and in endgames.

Currently, it's now a little faster in *all* positions, and has one new
characteristic: Crafty's NPS was nearly constant for the entire game,
with the old version which (on ICC is still running) maintained the
full set of attack bitmaps. The new version actually speeds up as the
game material "thins out." However, in *no* position yet run is the
new code slower, although it has taken a lot of "tricks" to get it up
to where it is. Nice thing is, the new code is shorter and make_move
is easier to read, and is now ready for manual inlining again, since
it's so short.

Yes, *lots* of debugging. However, the code actually simplifies so
your second concern wasn't a huge problem.

<snip>

>
>I suggest to turn all extensions off (except check extensions) and to
>turn off null moves. Then I think that we get a better idea of the
>search speed. I even suggest to turn off the slow evaluation.

It's not quite so simple. Some are doing "futility" cutoffs, some
do recapture extensions differently than others, and so forth.
Another point is that turning off some of these things might penalize
a program that does some "advance work" to save time later, only to
find out that with the extension turned off, the advance work was
wasted. Result: my favorite expression -> "catch-22".

>
>I admit that this "sounds" foolish but the name of the thread is
>"Speed of move generator" and it would be pretty interesting to see
>what speeds are achievable in software. Why? Most chess programs
>written by programmers with little "commercial interests" use table
>driven move generators similar to GNU chess. There is one commercial
>program that is more or less "officialy" rumoured to deliver a speed
>of 80,000 nps *and* is playing at least "well" - ask the DBT and the
>*Socrates teams. (^8

By the way, I spit in your milk. *I* don't use a "gnu-like" anything
in Crafty. :^) (ICC players will know what I mean...)

>
>Such a speed is not achievable with table driven move generators. So
>how does it work? We could find out by disassembling, but finding it
>out by "rediscovery" is much more interesting... I asked the programmer
>of this program about that and I *think* that I have some ideas how
>he does it - he looked at me and said "I really can't remember how
>I did that." Now, if he had only said "I won't tell you"...

I don't agree with your speed assessment. Ferret (Bruce Moreland) is
searching 50-100K nodes per sec on a pentium-100, using a cute table
driven generator. I think my new "approach" can beat this, but the
move generator has never been a major bottleneck in my programs, all
the way back to early Cray Blitz versions. Worst case ever was maybe
25% after rewriting a lot of other code (SEE ripoff(), etc.) in
Cray Assembly to make things run faster.

After I get the current crafty functional, I have an idea of how to
write a "blazingly fast" move generator, assuming you are willing to
"bite off" and import the rotated bit boards from Crafty. It will
end up being nothing more than memory copies of pre-computed move
lists. look up the set of legal moves for a specific diagonal by
taking the adjacent bits for that diagonal as an index. No checking
for ends of ranks/files, etc... just copying moves to the search
move list. DON'T start a discussion on this just yet. I want to
get the current code tweaked a little first, before I get distracted
on yet another interesting issue. If anyone wants to think about
it, feel free, just postpone the discussion until I can work my way
to that point.

By the way, current crafty is evolving in interesting ways. Swap()
is now a pretty large fraction (beyond 10% now and maybe higher)
since the cost of computing the bitmaps is now assessed directly to
the "user" of those bitmaps. I'm probably going to try (again) to
play with mvv/lva ordering of captures to eliminate the calls to
swap, knowing that I'm going to lose the ability to forward prune
the losing captures. My "tool" for the next week or two is going
to be gprof, as I'm learning a lot.

>
>|> In short, it's *really* hard to compare two
>|> programs like this. Sort of like comparing two automobiles by
>|> watching the tachometer. Interesting, but not useful. Different
>|> transmissions, numbers of gears, final drive rations, wheel sizes,
>|> ....
>
>Ok, nice car, so what does your tachometer say... (^8

Don't know, but my boat is turning a 28" pitch chopper prop at a
powerhead RPM of 5900, for a top speed of 80+ MPH. Howzzat?

MANOHARARAJAH VALAVAN

unread,

Aug 19, 1995, 3:00:00 AM8/19/95

to

In article <412lcv$i...@pelham.cis.uab.edu>,

Robert Hyatt <hy...@willis.cis.uab.edu> wrote:
>
>Problem is, Generate_Moves() is an insignificant part of the total
>time right now. eval is first, make_move is second, nothing else is
>above 5%.

Oops... I didn't think about that one.... I hadn't finished most of the
code for my porgram, so I was going on just Move Generations. I guess the
movegen code is pretty insignificant compared to the other code.

>
>One *could* take the rotated bitboards, and simply "look up" the
>set of legal moves for each diagonal. Think about this: for a
>diagonal or rank or file (depending on the kind of piece) you can
>take the appropriate rotated bitboard, to get the bits of that
>rank/file/diag into adjacent bits, and then use this as an index
>into a pre-computed set of moves, including the captures at the end
>of the diagonals (you would have to fill in the captured piece,
>of course).

Could you elaborate a little bit more on the rotated bit board technique?
I don't see why you need to use the rotated bit boards if you are going to
probe into a pre-computed table anyways...

Robert Hyatt

unread,

Aug 19, 1995, 3:00:00 AM8/19/95

to

In article <DDJID...@ecf.toronto.edu>,

MANOHARARAJAH VALAVAN <man...@ecf.toronto.edu> wrote:
>
>Could you elaborate a little bit more on the rotated bit board technique?
>I don't see why you need to use the rotated bit boards if you are going to
>probe into a pre-computed table anyways...
>

Having them is the only way to probe into this table. What I have
is three bitboards, rotated left 45 degrees, right 45 degrees and
right 90 degrees (plus the normal non-rotated bitboard, which I
call rotated right zero degrees for consistency.)

These bitboards have a 1 for every occupied square and a 0 for
each empty square. Lets take just one for clarity: rotated left
45 degrees. This bitboard (rl45) has the diagonals a1, a2-b1, a3-b2-c1,
a4-b3-c2-d1, etc. in adjacent bits. This lets me get the "state"
of one of these diagonals, and then look up the resulting attacks
on that diagonal by probing into a pre-computed table. Without
these diagonals, kept in this fashion, I'll have to scan the
diagonal in the usual iterative methodology to figure out which
squares are empty and which are occupied by enemy pieces, so I can
enumerate the move list. It's actually pretty cute.

I've just barely broken past the "break even" point of computing the
attack bit vectors "on the fly" but am beginning to see some real
opportunity for speed improvements now. I'll make the source available
as soon as I get it thoroughly tested.

Robert Hyatt

unread,

Aug 20, 1995, 3:00:00 AM8/20/95

to

In article <418qhk$p...@hpscit.sc.hp.com>,
David Wong <da...@hpsgcsgb.sgp.hp.com> wrote:

<snip, thanks to a picky pnews>

>
>Could you elaborate a little further on how to avoid generating illegal
>moves in the move generator? I have just started in chess programming and
>so far have written a simple program which has a move generator, an
>alpha-beta search with windows, and an SEE evaluation function. In order to
>check whether a generated move is legal, my MoveGen() will have to call
>MakeMove(), follow by IsKingInCheck(), then UnMakeMove(). (Other illegal
>moves like moving pieces off board or onto own pieces are avoided.)
>
>My move generator currently generates move 'one by one' (generate one move,
>search, generate another, search, and so on) using move vectors and by
>mvv/lva ordering. The evaluation function calls Attacks() which generates
>a list of attackers/defenders (along with the mobility and x-ray attacks of
>each piece) for each squares and performs a static exchange at the terminal
>nodes. The search function has only the killer heuristics but does not have
>any extensions nor transposition tables.
>
>I do not have time control yet, but for a fix-depth of 4-ply, the program
>takes about 30-100 seconds (depending on stage) to complete the search. I've
>heard of many methods on speeding up the search like using trans/ref table,
>null moves, history ordering, etc. Could you make a recommendation in
>their order of importance/effectiveness so that I know which to try next?
>

First, generating "legal moves" is not particularly important. It
has a computational cost that is wasted most of the time, since most
moves are legal, with one important exception. If you are in check,
then most moves are illegal, and culling them is important, and is
also fairly easy. You can capture the checking piece if there is
only one, capture either one with the king (only) if there are
two (as a double-discovered check), move the king to a non-attacked
square, and interpose if the checking piece is a sliding piece.

In the world of bitboards, the above are easy, with the exception
of interposing. It takes a little more work to pick out legal
interposes, since some interposing candidates might be pinned
on the king (absolute pin) making their interposition illegal.

If you spend time checking for legality when not in check, the
work is wasted. It's faster to make illegal moves and reject
them at the next ply, either by the in-check test (again, in
bitboard programs this test costs nothing) or by generating
moves, discovering that you can capture the opponent's king, and
rejecting that way. I do the former now, but did the latter in
Cray Blitz.

The general rule of thumb in computer chess is to "procrastinate" --
never do *now* what you can put off until *later*. Because, with
alpha/beta, *later* might never come.

First, get your capture moves right. Then, any reasonable ordering
strategy like killers, history, trans/ref moves, etc. will reduce
the size of the tree. 4 plies in a minute indicates either your
alpha/beta is not working correctly, or else move ordering is way
out of line. Killers are a simple to add ordering mechanism that
gives good results.

David Wong

unread,

Aug 21, 1995, 3:00:00 AM8/21/95

to

In article <40ubev$7...@nz12.rz.uni-karlsruhe.de>, gil...@ira.uka.de says...
[snip]

>
>Correct, move ordering is the key problem for starters. I have now a move
>generator that has the properties that are mentioned in the books as
>needed for a really useful movegenerator:
>
>(a) it generates moves one by one (well, most of the time)
>(b) the moves are ordered by merit
>(c) nearly no illegal moves are generated
>(d) a move can be tested quickly wether it "fits" the current board position
>
>And it is fast. Some of these properties are really hard to achive when
>you don't have bitboards. For beginning chess programmers (who are probably
>not using bitboards to start with) I'd suggest that they should ignore
>(a) and (b) - but sort after capture generation! (c) and (d) are both
>important and relatively trivial to do.
>

[snip]

Could you elaborate a little further on how to avoid generating illegal
moves in the move generator? I have just started in chess programming and
so far have written a simple program which has a move generator, an
alpha-beta search with windows, and an SEE evaluation function. In order to
check whether a generated move is legal, my MoveGen() will have to call
MakeMove(), follow by IsKingInCheck(), then UnMakeMove(). (Other illegal
moves like moving pieces off board or onto own pieces are avoided.)

My move generator currently generates move 'one by one' (generate one move,
search, generate another, search, and so on) using move vectors and by
mvv/lva ordering. The evaluation function calls Attacks() which generates
a list of attackers/defenders (along with the mobility and x-ray attacks of
each piece) for each squares and performs a static exchange at the terminal
nodes. The search function has only the killer heuristics but does not have
any extensions nor transposition tables.

I do not have time control yet, but for a fix-depth of 4-ply, the program
takes about 30-100 seconds (depending on stage) to complete the search. I've
heard of many methods on speeding up the search like using trans/ref table,
null moves, history ordering, etc. Could you make a recommendation in
their order of importance/effectiveness so that I know which to try next?

Any help is appreciated.

David Wong
da...@hpsgcsgb.sgp.hp.com

Joe Stella

unread,

Aug 21, 1995, 3:00:00 AM8/21/95

to

In article <418qhk$p...@hpscit.sc.hp.com>
da...@hpsgcsgb.sgp.hp.com (David Wong) writes:

>[...]

>The evaluation function calls Attacks() which generates
>a list of attackers/defenders (along with the mobility and x-ray attacks of
>each piece) for each squares and performs a static exchange at the terminal
>nodes.

>[...]

This is going to cause lots of evaluation errors. I think you are better
off searching all captures in a quiescense search. It takes more time,
but the increase in accuracy is worth it.

>I do not have time control yet, but for a fix-depth of 4-ply, the program
>takes about 30-100 seconds (depending on stage) to complete the search.

On what system? 30 seconds on a 286 is not the same as 30 seconds on
a Pentium-133!

>I've
>heard of many methods on speeding up the search like using trans/ref table,
>null moves, history ordering, etc. Could you make a recommendation in
>their order of importance/effectiveness so that I know which to try next?

The history heuristic is fairly easy to program and results in a noticable
improvement in search speed.

After this, I would suggest adding the null move search. This can yield
an extra ply of search depth in the midgame. Be careful applying this
to the endgame, because it has a certain "blindness" with regard to
zugzwang.

I think you can add the transposition tables at any time; just be
sure to put in a switch so you can easily shut them off. You will
need to do this occasionally when debugging.

Joe S.

Bruce Moreland

unread,

Aug 26, 1995, 3:00:00 AM8/26/95

to

Nodes/second is widely variable according to position. I just did an auto-play game at 5sec/move and I got 25-40K nps in a
full-board middlegame, 45-55K when the queens came off (it was a RNN vs RBB endgame at that point), and as many as 70K nps
in the N vs B ending that ensued. In another case I saw 60K nps in a QBB vs QBN middlegame. It's usually somewhere in ply
7 at the end of 5 seconds (I use recursive null move with R=2).

I ran Bratko-Kopec #2 for a minute (RRN vs RRN) and got an average of 50.2K (finished 10 plies in 45 seconds), and
Bratko-Kopec #3 for a minute and got an average of 31.9K (finished 9 plies in 46 seconds).

The most I have ever seen is 99,998 nps, so I can't quite claim 100K :). 25K is probably a practical lower bound, but in
any given middlegame position I'm probably doing 30-45K.

I don't claim any of this is "fast". If you're going faster on equivalent hardware (P5/100 with Windows '95) I'd like to
hear how though. I'm sure it's possible to go faster, but I don't know of anyone who is going faster than I am who will
explain how they do it.

Ferret has a fairly minimal eval function (a little 2nd order stuff but not a lot), static exchange evaluation move ordering
stuff approximately like Hyatt uses (losers are pruned), recapture extension, a single-response-to-check extension, and
quite a bit of 2nd order king safety stuff. I try not to sacrifice depth for nodes/second, and I try not to sacrifice
strength for depth.

For the record, on the "Win at Chess" suite Ferret scores 255 in 5 seconds, 268 in 10 seconds, 282 in 30 seconds, 289 in 60
seconds, and 296 in five minutes. It misses #31, #100, #141, and #230. The single response to check heuristic makes a lot
of difference in WAC. Obviously WAC is not the world's greatest test, but it's a start.

A few months ago I did roughly the same SEE vs MVV/LVA test as Hyatt did, and got roughly the same result: reduced depth in
a given time. Can anyone present a practical case where MVV/LVA finds something that SEE overlooks? And if so, are these
cases common?

I wrote my move generator after reading the little "how the gnu move generator works" article that comes with gnu (to make
one thing clear, my program is not based on gnu, I built it from scratch, I used gnu as one of many references, it contains
no gnu source code). I have made some enhancements to this move generation technique, however. I use pointers rather than
indexes, and I store several precomputed fields in each move table element. Each move table element is 20 bytes, so my move
table is something like 100K bytes, which is big.

I have tweaked my iterative in-check and static exchange evaluator functions until I think they're fast enough. I use some
bitboards that I generate at compile time to help rule out some obvious non-attacks, for instance a bishop on a1 can NEVER
attack a king on e8, I have an easily accessed bitboard that records this relationship, so most of the time I can avoid
traversing the vector between a sliding attacker and the defender's square. Also, I don't use the move table when I have to
traverse a vector between an attacker and a defender, all I use the move table for is generating moves.

If I were to do it over again, I'd try to avoid the move table approach. I've milked it pretty well, but I think other
approaches may be faster. I'm speaking of an "offset" approach rather than a bitboard approach. I haven't seen any
proof that bitboards can go as fast as other approaches, although the extra information they generate as by-products may
make the time spent worth it.

bruce

In article <415e9p$k...@pelham.cis.uab.edu>, hy...@willis.cis.uab.edu says...
>[snip]

>
>I don't agree with your speed assessment. Ferret (Bruce Moreland) is
>searching 50-100K nodes per sec on a pentium-100, using a cute table
>driven generator. I think my new "approach" can beat this, but the
>move generator has never been a major bottleneck in my programs, all
>the way back to early Cray Blitz versions. Worst case ever was maybe
>25% after rewriting a lot of other code (SEE ripoff(), etc.) in
>Cray Assembly to make things run faster.
>

>[snip]

>--
>Robert Hyatt Computer and Information Sciences
>hy...@cis.uab.edu University of Alabama at Birmingham
>(205) 934-2213 115A Campbell Hall, UAB Station
>(205) 934-5473 FAX Birmingham, AL 35294-1170

--

John Stanback

unread,

Aug 28, 1995, 3:00:00 AM8/28/95

to

Here is some information on Zarkov running on a Pentium 100.
I may try Steven's enumeration test one of these days, but
I don't think that will change these numbers much.
Compared to other programs of about the same strength, Zarkov has
a slow search speed, but makes up for this with a pretty good
evaluator and search extensions. All statistics are taken from the
starting position.

Move generation only: 550,000 moves/second
(offset technique, includes generation and adding
scores based on mvv-lva, 3 killers, and simple piece-sq
"history" like table)

Move generate, make, unmake: 105,000 moves/second
(make-move tests for legality, updates hash indexes, but
does no incremental evaluation.)

Move generate, make, update attack table, unmake: 37,000 moves/second
(attack table update routine computes mobility for
sweep pieces and records all attacks from each
side to each square)

Actual search from starting position: 8,000 nodes/second
(position evaluation is done by a slow, but fairly
thourough evaluation of each piece. usually about 30%
of all nodes are evaluated. i'd like to get this down
to the 15-20% Hsu mentions is possible, but I haven't
found a way to do this and not miss some promotion and
mate threats.

For move ordering in quiescence I use mvv-lva.
Based on a few simple experiments I've run I don't
understand how SEE pruning could be good. But then again,
since Bob finds it very difficult to find positions where
SEE pruning fails and he gets a factor of 2 speedup....

By the way, Don Dailey told me that he's going all out for
speed with his new program -- it searches about 60,000 nps on a
60 Mhz Pentium! I bet it goes really fast on a Paragon.

John Stanback

Robert Hyatt

unread,

Aug 28, 1995, 3:00:00 AM8/28/95

to

In article <41sm07$2...@tadpole.fc.hp.com>, John Stanback <j...@fc.hp.com> wrote:
>Here is some information on Zarkov running on a Pentium 100.
>

>For move ordering in quiescence I use mvv-lva.
>Based on a few simple experiments I've run I don't
>understand how SEE pruning could be good. But then again,
>since Bob finds it very difficult to find positions where
>SEE pruning fails and he gets a factor of 2 speedup....

Remember, *all* the pruning is based on SEE to eliminate those
captures that look hopeless, like QxP with pawn defended, RxN with
N defended, etc. It's *very* accurate, if you only consider captures
on the target square (it understands bearing order, etc.) but it does
not consider pinned or overloaded pieces at all. I see savings for
two reasons: (1) more accurate move ordering reduced the size of the
tree by more than 10% in the kopec problem test, but over a larger
problem set, this appears to be closer to 20%; (2) reducing the
branching factor in the q-search due to not following lines with
losing (apparently) captures. To me, it's not "new" since we've
done it for many years in CB. However, others have apparently tried
it and had results similar to my own. A big part of the question is
obviously based on hardware, since SEE is improbable in a VLSI
machine (too slow) and MVV/LVA is natural there. However, it's
not the first time that hardware has to sacrifice *something*,
otherwise it would have an impossible advantage. Thank goodness
it only has an overwhelming advantage... :^)

>
>By the way, Don Dailey told me that he's going all out for
>speed with his new program -- it searches about 60,000 nps on a
>60 Mhz Pentium! I bet it goes really fast on a Paragon.
>
>

Sounds faster than the fastest, Fritz. Again, I know how to make 'em
fast, but the NPS figure isn't the measure that's important. your
overall search speed might be slower, but if your tree is smaller, or
your knowledge better, it's a more than even trade.
>John Stanback

Thomas Kerrigan

unread,

Aug 29, 1995, 3:00:00 AM8/29/95

to

Wow, all these big names. Can I post here so I can be cool too? :)

As you probably know, I'm the author of Stobor. I use the offset method
to generate moves. I save the offsets in my piece records, which improves
performance by 15% or so. I also use the 0x88 method for bounds checking.
This probably makes the approach just as fast as movetables, if not
faster, with the extra 64k RAM bonus. ;) Another "cool" thing I do is
inline takeback in my makemove. This significantly improves performance.
I use a check extension, recapture extension that's carefully confined,
and a one-response-to-check extension.

The result of this is that I'm almost exactly as fast as Ferret (the
speeds I get for BK 2 and 3 are *exactly* as fast), only my searches are
a few plies more shallow. I'm working on it. :)

Cheers,
Tom

------------------------------------------------------------------------------
"You can bring any calculator you like to the midterm, as long as it
doesn't dim the lights when you turn it on." -Hepler, Systems Design 182

Thomas C. Kerrigan
'kerr...@alpha.pr1.k12.co.us'

Robert Hyatt

unread,

Aug 29, 1995, 3:00:00 AM8/29/95

to

In article <41vifu$u...@alpha.pr1.k12.co.us>,

Thomas Kerrigan <kerr...@alpha.pr1.k12.co.us> wrote:
>Wow, all these big names. Can I post here so I can be cool too? :)
>

<snip>

>The result of this is that I'm almost exactly as fast as Ferret (the
>speeds I get for BK 2 and 3 are *exactly* as fast), only my searches are
>a few plies more shallow. I'm working on it. :)
>

I've said it before, but it bears saying one more time, you are looking
at the *wrong* numbers. If you are "a few plies more shallow" then you
are a *long* way from ferret's speed. Turn off alpha/beta and you can
"blow his nps doors off". Of course, 3-4 plies is all you are going to
search, but you *are* going faster. :^)