FPGA move generator

Ties Bos

unread,

Sep 6, 2000, 7:16:40 AM9/6/00

to

I've spent some time coding a move generator for chess on a bunch of FPGAs.
I hope some of the chess engines can benefit from its speed (couple of
million psuedo-legal moves per second).

I wonder if anyone can tell me about the performance of software move
generators. I've found it pretty hard to benchmark them (from the ones I
found source code for) separate from the rest of the engines.

Pointers to papers on the subject are welcome too!

Thanks,
Ties.

Dan Andersson

unread,

Sep 6, 2000, 12:10:37 PM9/6/00

to

Most programs use an incremental move generation approach and generate just
a one or a few moves per node. That makes it necessary to know how long it
takes for the FPGA to generate all moves for one position, to complicate
matters a program often need only an array lookup to find the correct move
as the position is already in the transposition table (thats really fast).
And the FPGA should order the moves in the order of decreasing material
gain. That could be easily accomplished by making two move generators, one
for caputues and one for non captures (and that should simplify the design).
The move generator could be useful when implementing some kinds of enhanced
a-b pruning techniques as ETC, where one looks in the transposition table
for the best child node and thus generate all moves in a position, but the
search would be memory bound and therefore it would be counterproductive to
generate pseudolegal moves as the non-legal lookups are wasted work. In
short, move generation is not the critical part of a modern chess program,
the interface between a move generator and the cpu could even make the
software slower. Evaluation of leaf positions on the other hand ... fast
pattern recognition. Keep up the good work and feel free to share your
designs with others.

Regards Dan Andersson

"Ties Bos" <tb...@huygens.org> skrev i meddelandet
news:8p5930$46s$1...@ares.cs.utwente.nl...

Robert Hyatt

unread,

Sep 6, 2000, 1:34:22 PM9/6/00

to

Alexandra Schubert <al...@redneck.gacracker.org> wrote:
> Hi Ties,

>> I've spent some time coding a move generator for chess on a bunch of
> FPGAs.
>> I hope some of the chess engines can benefit from its speed (couple
> of
>> million psuedo-legal moves per second).

To give you a quick comparison, from the opening position, crafty can
generate about 5M moves per second on a PII/400 notebook (66mhz FSB).
In a typical middlegame position, it can generate about 7M moves per
second on that same machine.

> What are FPGAs? CPUs? Did you publish the code? I would be very
> interested, if it is written in C. Could you email it or post it here?

Field Programmable Gate Arrays. They are a sort-of special-purpose piece
of hardware that can be configured to do lots of things... they aren't
based on the CPU concept, nor do they execute instructions as you would
think of them.

>> I wonder if anyone can tell me about the performance of software
> move
>> generators. I've found it pretty hard to benchmark them (from the
> ones I
>> found source code for) separate from the rest of the engines.

For crafty, run it on the platform of your choice and type "perf"
to have it tell you how fast it can generate, and then generate/make/unmake
moves.

> Even moves per second are difficult to compare, that's true.

>> Pointers to papers on the subject are welcome too!

> See the code for the bitmap move generator in the chess program
> CRAFTY.

> --
> Alexa

--
Robert Hyatt Computer and Information Sciences
hy...@cis.uab.edu University of Alabama at Birmingham
(205) 934-2213 115A Campbell Hall, UAB Station
(205) 934-5473 FAX Birmingham, AL 35294-1170

Rjpawlak

unread,

Sep 6, 2000, 7:08:46 PM9/6/00

to

>What are FPGAs? CPUs? Did you publish the code? I would be very
>interested, if it is written in C. Could you email it or post it here?

FPGA=Field Programmable Gate Array. It's a device usually programmed in VHDL,
not C. It's a collection of gates and other low level logic that is very
popular for many types of signal processing applications (amoung other things).

Bob Pawlak
Chess Program Reviews
http://members.aol.com/rjpawlak/ch_widow.html
http://www.chessopolis.com

Ties Bos

unread,

Sep 7, 2000, 1:24:24 AM9/7/00

to

Thank you all for your replies. I guess I should explain a little more....

Ok, I tried 'perf' with crafty on my PIII 500 MHz. It says it can
generate/make/unmake about 2 Million moves/second.
From my FPGA I can do the same (except for castling and enpassent captures,
although generated, these need a little more CPU to actually make/unmake
since more than 2 squares are involved) at a rate of about 5 Million
moves/second. Besides generating the moves, the FPGA also in parallel keeps
track of material counts and generates 64-bit hashes for the current
position of all pieces and also of just the pawns.
Does the 'perf' test in crafty include some of the material counts or hashes
too?

The FPGA doesn't really generate all moves at once of course, it generates
one at a time such that high-value captures come first and non-captures come
last as I hope is useful for most a-b based search algorithms. It is also
possible to check whether a particular move is valid from the current
position such that a move coming from the transposition table can be
validated.

Currently I am awaiting a new PCB that closely connects the FPGA and a CPU.
Indeed the PCI interface does not allow me to run the FPGA at full speed. It
is idling most of the time since the interface is too slow.

I am in the process of writing down what I built, and will put things
including the VHDL code onto the web once I am done.

Thanks for your input,
Ties.

"Dan Andersson" <rhq...@tninet.se> wrote in message
news:8p5qft$fji$1...@cubacola.tninet.se...

Ties Bos

unread,

Sep 7, 2000, 1:54:39 AM9/7/00

to

Dan Andersson

unread,

Sep 7, 2000, 5:37:27 AM9/7/00

to

Alteras Excaliber SDK would be my first choice if I were building a chess
machine. A FPGA w. an on die 200 MHz ARM processor or a NIOS soft core CPU.

Robert Hyatt

unread,

Sep 7, 2000, 9:18:48 AM9/7/00

to

Ties Bos <tb...@huygens.org> wrote:
> Thank you all for your replies. I guess I should explain a little more....

> Ok, I tried 'perf' with crafty on my PIII 500 MHz. It says it can
> generate/make/unmake about 2 Million moves/second.
> From my FPGA I can do the same (except for castling and enpassent captures,
> although generated, these need a little more CPU to actually make/unmake
> since more than 2 squares are involved) at a rate of about 5 Million
> moves/second. Besides generating the moves, the FPGA also in parallel keeps
> track of material counts and generates 64-bit hashes for the current
> position of all pieces and also of just the pawns.
> Does the 'perf' test in crafty include some of the material counts or hashes
> too?

The Make/UnMake part of perf does _all_ of that. Make/UnMake updates all
the material counts, the hash keys, chess board, 50 move rule counter,
etc.

Tony Werten

unread,

Sep 8, 2000, 8:48:44 AM9/8/00

to

Ties Bos <tb...@huygens.org> wrote in message
news:8p7aj6$t20$1...@ares.cs.utwente.nl...

> Thank you all for your replies. I guess I should explain a little more....
>
> Ok, I tried 'perf' with crafty on my PIII 500 MHz. It says it can
> generate/make/unmake about 2 Million moves/second.
> From my FPGA I can do the same (except for castling and enpassent
captures,
> although generated, these need a little more CPU to actually make/unmake
> since more than 2 squares are involved) at a rate of about 5 Million
> moves/second. Besides generating the moves, the FPGA also in parallel
keeps
> track of material counts and generates 64-bit hashes for the current
> position of all pieces and also of just the pawns.
> Does the 'perf' test in crafty include some of the material counts or
hashes
> too?
>
> The FPGA doesn't really generate all moves at once of course, it generates
> one at a time such that high-value captures come first and non-captures
come
> last as I hope is useful for most a-b based search algorithms. It is also
> possible to check whether a particular move is valid from the current
> position such that a move coming from the transposition table can be
> validated.

Actually you can make it generate all moves when you're in evaluation. These
moves can be put to good use. (specially if you get them for free )

Tony

Ties Bos

unread,

Sep 8, 2000, 9:36:14 AM9/8/00

to

> >
> > The FPGA doesn't really generate all moves at once of course, it
generates
> > one at a time such that high-value captures come first and non-captures
> come
> > last as I hope is useful for most a-b based search algorithms. It is
also
> > possible to check whether a particular move is valid from the current
> > position such that a move coming from the transposition table can be
> > validated.
>
> Actually you can make it generate all moves when you're in evaluation.
These
> moves can be put to good use. (specially if you get them for free )
>
> Tony
>

If I can write to memory independent of the main CPU on my next board, I
could actually do so. Right now I only have a PCI slave interface which
really is a bottleneck. With DMA, these moves could indeed be almost 'free'
as you would only need a memory-read per move to get them as the FPGA could
put them in memory while the main CPU is evaluating, as you suggested.

I'll definitely keep that in mind.

Ties.

Urban Koistinen

unread,

Sep 9, 2000, 5:09:19 AM9/9/00

to

tb...@huygens.org (Ties Bos) wrote in <8paq0v$bgv$1...@ares.cs.utwente.nl>:

What is the cost of your FPGA board?

Urban Koistinen - now into electronics

Anders Thulin

unread,

Sep 11, 2000, 6:21:28 AM9/11/00

to

Ties Bos wrote:

> From my FPGA I can do the same (except for castling and enpassent captures,
> although generated, these need a little more CPU to actually make/unmake
> since more than 2 squares are involved) at a rate of about 5 Million
> moves/second.

Rather disappointing -- it's only on the order of 2 times faster than
PIII 500 MHz, which I suspect is a standard PC today.

What is it that takes so much time?

> Does the 'perf' test in crafty include some of the material counts or hashes
> too?

Yes, everything that's done for an ordinary move.

--
Anders Thulin Anders....@telia.se 040-10 50 63
Telia Prosoft AB, Box 85, S-201 20 Malmö, Sweden

Ties Bos

unread,

Sep 11, 2000, 8:58:43 AM9/11/00

to

> > From my FPGA I can do the same (except for castling and enpassent
captures,
> > although generated, these need a little more CPU to actually make/unmake
> > since more than 2 squares are involved) at a rate of about 5 Million
> > moves/second.
>
> Rather disappointing -- it's only on the order of 2 times faster than
> PIII 500 MHz, which I suspect is a standard PC today.
>
> What is it that takes so much time?
>

The Move Generator is primarily based on Belle's hardware from 1981. About
equal time goes into finding a victim and an aggressor.

All victims are found at once and about 1/3 of the time goes into the
compare tree that selects the highest-value victim.

From timing analyses we found that the design is mostly slowed down by
high-fanout circuits and lots of routing delays.

I can note here that my 5M moves/second are full moves, including both a
move and an unmove. On current Xilinx Virtex chips pre-routing timing
analyses says we can expect the speed to go up to about 12M moves/second.

I expect the Pentium figures also include full moves and that moves are
prioritized such that ordering is failry good for Alpha-Beta searching.
Right? Maybe we did this the wrong way, but I can't really see how to speed
up the circuit another order of magnitude or so.

Please let me know what you think,

Ties.

Anders Thulin

unread,

Sep 11, 2000, 9:58:28 AM9/11/00

to

Ties Bos wrote:

> From timing analyses we found that the design is mostly slowed down by
> high-fanout circuits and lots of routing delays.

I don't recall the Belle algorithm, so I can't comment well on this.
But it suggests some kind of structure problem, I think.

> On current Xilinx Virtex chips pre-routing timing
> analyses says we can expect the speed to go up to about 12M moves/second.

A factor 6 ... that's more like it. Se below.

> I expect the Pentium figures also include full moves and that moves are
> prioritized such that ordering is failry good for Alpha-Beta searching.
> Right?

Full moves certainly -- but that's more a question of representation
than of any fundmental algorithm problem. Don't know about ordering -- I've
never studied Crafty that closely.

> Maybe we did this the wrong way, but I can't really see how to speed
> up the circuit another order of magnitude or so.

Wrong way ... don't know about that. Perhaps it's just that I expected
a hardware implementation to be faster on at least an order of magnitude,
preferrably two. It used to be, but perhaps it's not that simple anymore.
Perhaps speed is not an important factor to you.

I assume you know about the bitboard-based move generation that Crafty
uses (essentially a series of 64-bit binary operations, followed by
a conversion from bitboard to 'move' format), and I assume that you can
compare that approach with both the Belle algorithm and the limitations
of FPGA.

Robert Hyatt

unread,

Sep 11, 2000, 6:29:58 PM9/11/00

to

Anders Thulin <Anders....@telia.se> wrote:

> I don't recall the Belle algorithm, so I can't comment well on this.
> But it suggests some kind of structure problem, I think.

It uses a 64-square state... and the find victim / find attacker operations
have to look through squares for sliding pieces. That is one delay. The
other is that you have to (using combinatorics) select the most valuable
victim and least valuable attacker, which turns into a tree.

> Full moves certainly -- but that's more a question of representation
> than of any fundmental algorithm problem. Don't know about ordering -- I've
> never studied Crafty that closely.

One thing that can't be done in the "belle" approach is things like
"killer moves", "hash table move", "history moves" and so forth. That
hurts. For the same ply depth, (comparing Belle with Cray Blitz, where
Ken and I used very similar search extensions, etc) his tree was about 8-10x
bigger than mine. But he was 8-10x faster too, so it was a "wash" in that
regard.

> Wrong way ... don't know about that. Perhaps it's just that I expected
> a hardware implementation to be faster on at least an order of magnitude,
> preferrably two. It used to be, but perhaps it's not that simple anymore.
> Perhaps speed is not an important factor to you.

FPGAs aren't great devices for doing this. Belle was based on something
similar (PLAs). DT/Deep Blue used the same approach but used an ASIC
device which gives way more flexibility in hiding some of the delays.

> I assume you know about the bitboard-based move generation that Crafty
> uses (essentially a series of 64-bit binary operations, followed by
> a conversion from bitboard to 'move' format), and I assume that you can
> compare that approach with both the Belle algorithm and the limitations
> of FPGA.

Belle didn't use bitmaps. It had a 64-word board, where each square on
the board was one complete circuit that could enumerate the squares
attacked by the piece on that square, and it could find all the pieces
that were attacking that square.

Ties Bos

unread,

Sep 12, 2000, 2:18:01 AM9/12/00

to

> > I don't recall the Belle algorithm, so I can't comment well on this.
> > But it suggests some kind of structure problem, I think.
>

> One thing that can't be done in the "belle" approach is things like
> "killer moves", "hash table move", "history moves" and so forth. That
> hurts. For the same ply depth, (comparing Belle with Cray Blitz, where
> Ken and I used very similar search extensions, etc) his tree was about
8-10x
> bigger than mine. But he was 8-10x faster too, so it was a "wash" in that
> regard.
>
>

It is based on Belle, it isn't just Belle in an FPGA.
The circuit does allow trying any move you like, and you can also check
whether a particular move is valid from the current position. So when I find
a move in the hash table, I check whether it is valid from the current
position (the hash might have failed) and then try that move first. The same
works for killer moves and the like.
Unlike Belle there is no 64-bit deep move-disable stack, but moves already
tried are masked of by the move last made, as I think is done in Deep Blue.

Ties.