Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

A parallel processing chess program for the 'Wintel' platform

26 views
Skip to first unread message

Ian Kennedy

unread,
Mar 9, 1997, 3:00:00 AM3/9/97
to

I have just completed the first port of my program Psycho to parallel
processing, running on an SMP compliant dual Pentium PC running Windows
NT Workstation version 4.0. The program is a Win32 console application
(compiled with Visual C++ 4.0), using two threads and a processing model
loosely based on the descriptive paper of parallel searching on the Cray
Blitz by Bob Hyatt et al. A parent thread first generates an ordered list
of ply1 moves. A child thread is then created on the second processor and
the two threads pick off in turn all the moves from the list. All
cut-offs and scores are unique to the thread/processor except at ply1
where the code enters a critical section to access shared values for
cutoffs and backed up scores. When no more moves are left the parent
thread will go into a wait state, if the child thread is still searching
the last move.

Initial testing indicates a surprisingly high efficiency this early on,
with most test positions actually being searched in 50% (or less) of the
time compared with the same code running on a single thread. Quite why
this has been achieved so quickly is hard to explain, possibly the
original pruning was under-efficient.

The design of the single threaded model forced me to duplicate the search
code for the child thread rather than execute multiple instances of one
function. This was due to the need to avoid the costly use of Win32
thread local storage (tls) for certain key variables. This latter device
typically produces 3 times as many machine code instructions to access
each tls variable as for a normal variable in memory.

Apart from maintainability problems ie having to apply mods to both
copies of the code, the biggest disadvantage of this approach is that it
does not scale up beyond 2 processors without extra work. However NT
Workstation version only supports 2 processors. Further, most tests
indicate that Windows NT does not scale successfully beyond 4 processors.


Ian Kennedy

Peter W. Gillgasch

unread,
Mar 10, 1997, 3:00:00 AM3/10/97
to

"Ian Kennedy" <iken...@cix.compulink.co.uk> wrote:

> I have just completed the first port of my program Psycho to parallel
> processing, running on an SMP compliant dual Pentium PC running Windows
> NT Workstation version 4.0.

Quite remarkable. Congrats !

[ snip ]

> Initial testing indicates a surprisingly high efficiency this early on,
> with most test positions actually being searched in 50% (or less) of the
> time compared with the same code running on a single thread. Quite why
> this has been achieved so quickly is hard to explain, possibly the
> original pruning was under-efficient.

Actually I think that seeing a 2 times speedup suggests that the
implementation of the search code is somewhat inefficient... How did you
measure that ?

If I understand Bob's original algorithm designed for a little number of
cpus correctly and you use basically the same algorithm then the 1st
move is searched with an "open" window (with aspiration bounds of
course) and your 2nd cpu sits idle. Usually the 1st move takes about 50%
of the time of the complete search, so splitting the remaining zero
window searches should yield a total perfect speedup of 1.5 since you
see no gain for the 1st move (50% of the work) and a speedup of 2 for
the remaining moves (50% of the work).

So I cannot see how a speedup of 2 is possible at all with this
algorithm. Bob ?

Have you considered a shared trans/ref table ?

[ snip ]

> Apart from maintainability problems ie having to apply mods to both
> copies of the code, the biggest disadvantage of this approach is that it
> does not scale up beyond 2 processors without extra work. However NT
> Workstation version only supports 2 processors. Further, most tests
> indicate that Windows NT does not scale successfully beyond 4 processors.

A simple suggestion: Use cpp to create two modules out of one source
code. Will look somewhat ugly but it is better than maintaining two
source codes and keeping them in sync.

-- Peter

May God grant me the serenity to accept the things I cannot change,
courage to choke the living shit out of those who piss me off,
and wisdom to know where I should hide the bodies...

Robert Hyatt

unread,
Mar 10, 1997, 3:00:00 AM3/10/97
to

Peter W. Gillgasch (gil...@ilk.de) wrote:
: "Ian Kennedy" <iken...@cix.compulink.co.uk> wrote:

: > I have just completed the first port of my program Psycho to parallel
: > processing, running on an SMP compliant dual Pentium PC running Windows
: > NT Workstation version 4.0.

: Quite remarkable. Congrats !

: [ snip ]

Damn the snips. :) This probably has something important that you refer
to below... :)


: > Initial testing indicates a surprisingly high efficiency this early on,


: > with most test positions actually being searched in 50% (or less) of the
: > time compared with the same code running on a single thread. Quite why
: > this has been achieved so quickly is hard to explain, possibly the
: > original pruning was under-efficient.

: Actually I think that seeing a 2 times speedup suggests that the
: implementation of the search code is somewhat inefficient... How did you
: measure that ?

: If I understand Bob's original algorithm designed for a little number of
: cpus correctly and you use basically the same algorithm then the 1st
: move is searched with an "open" window (with aspiration bounds of
: course) and your 2nd cpu sits idle. Usually the 1st move takes about 50%
: of the time of the complete search, so splitting the remaining zero
: window searches should yield a total perfect speedup of 1.5 since you
: see no gain for the 1st move (50% of the work) and a speedup of 2 for
: the remaining moves (50% of the work).

Hmm... let me give you 4 algorithms that evolved thru Cray Blitz:

1. search all moves at ply=1 in parallel, period. Obviously inefficient,
occasionally finds solution 2x faster, typically is about 1.5 x faster, and
the number of processors doesn't matter... same results with 2, or 8.

2. the PVS algorithm. search down the left-hand side of the tree (8 nodes
in a depth=8 search) and then search all the branches at depth=8 in parallel.
This fixes alpha so now back up one ply and search the rest in parallel. This
fixes alpha one ply back, so you repeat again, and again, until you search
the moves at the root. typical speedup with 4 procesors = 3, with 8= 4-5

3. The EPVS algorithm. same as above, but when a processor runs out of work
at a node (notice how they all work together) then the others stop, and we
go down one of those branches two plies and parallel search there. A little
better than 2, but not a tremendous improvement, especially at larger numbers
of processors.

4. DTS similar (somewhat) to Young Brothers Wait. Here processors simply
search parts of the tree that are reasonably certain of being searched on a
serial search, in parallel. This is appearing (I hope) in the March JICCA
if the final editing changes were clean. This did pretty well on a C90 Cray,
producing 3.9X faster on 4 processors, 7.1x faster on 8 processors and 11.1
faster on 16 (don't hold me to those numbers... it's from memory).

So yes, 2x is possible, but it takes a good algorithm. If move ordering is bad,
then big speedups are possible. Jonathan, Feldman, Hsu, I, and probably others
that I've neglected discovered that the better the search, the worse the
parallel performance...


: So I cannot see how a speedup of 2 is possible at all with this
: algorithm. Bob ?

I agree... but I didn't see where that was what he did. Hold on to your
questions for a month, the JICCA should answer them all as I gave results
for all of the above, and a fairly detailed description of each of the 4
algorithms I used in CB over the years...

Crafty's next of course... Don't be shocked to see it hitting some big
numbers over the next year on the chess servers. I know how to do the search
with little difficulty. I just need to find a machine... I doubt the right
platform is a PC however, until the memory bandwidth goes up... unless I can
fit all of crafty into cache, so that the only memory traffic is the hashing.


: Have you considered a shared trans/ref table ?

That's all I've ever used. It's critical...

:
: [ snip ]

: > Apart from maintainability problems ie having to apply mods to both
: > copies of the code, the biggest disadvantage of this approach is that it
: > does not scale up beyond 2 processors without extra work. However NT
: > Workstation version only supports 2 processors. Further, most tests
: > indicate that Windows NT does not scale successfully beyond 4 processors.
:
: A simple suggestion: Use cpp to create two modules out of one source
: code. Will look somewhat ugly but it is better than maintaining two
: source codes and keeping them in sync.

Is this splitting the tree at the root (alg #1 above?) If so, then I don't
see how it can get 2x faster, except for favorable positions. The worst-case
test is a position where it *never* changes its mind. Those almost have to
be < 1.5x faster...

Robert Hyatt

unread,
Mar 10, 1997, 3:00:00 AM3/10/97
to

Ian Kennedy (iken...@cix.compulink.co.uk) wrote:
: I have just completed the first port of my program Psycho to parallel
: processing, running on an SMP compliant dual Pentium PC running Windows
: NT Workstation version 4.0. The program is a Win32 console application
: (compiled with Visual C++ 4.0), using two threads and a processing model
: loosely based on the descriptive paper of parallel searching on the Cray
: Blitz by Bob Hyatt et al. A parent thread first generates an ordered list
: of ply1 moves. A child thread is then created on the second processor and
: the two threads pick off in turn all the moves from the list. All
: cut-offs and scores are unique to the thread/processor except at ply1
: where the code enters a critical section to access shared values for
: cutoffs and backed up scores. When no more moves are left the parent
: thread will go into a wait state, if the child thread is still searching
: the last move.

: Initial testing indicates a surprisingly high efficiency this early on,

: with most test positions actually being searched in 50% (or less) of the
: time compared with the same code running on a single thread. Quite why
: this has been achieved so quickly is hard to explain, possibly the
: original pruning was under-efficient.

I responded to peter already, because his queries to you arrived at my
news server before this original did.

Splitting at the root really should give almost no speedup with a current
null-move search program. The first branch (in Crafty, for example)
takes way more than 50% of the time. Since each processor searches a move
at the root without knowing the value of alpha from the "best" root move,
they both will take about the same time to search the first move, and then
will fly through the rest. As a result, the speedup should be very minimal.

In the first parallel Cray Blitz, I did this as well, but back then was not
using the null-move search at all. As a result, the first move took maybe
50% of the time, and the rest took the remaining 50%. What this means is
that 2x is the max speedup you can ever get, with any number of processors,
because you can never drive the overall time below that 1/2 limit, even if
you search the rest of the moves instantly. We found the speedup to be
between 1.5 and 2.0, depending, with 1.5 the typical number. Whether we
used 2 or 4 processors in fact...


: The design of the single threaded model forced me to duplicate the search

: code for the child thread rather than execute multiple instances of one
: function. This was due to the need to avoid the costly use of Win32
: thread local storage (tls) for certain key variables. This latter device
: typically produces 3 times as many machine code instructions to access
: each tls variable as for a normal variable in memory.

: Apart from maintainability problems ie having to apply mods to both

: copies of the code, the biggest disadvantage of this approach is that it
: does not scale up beyond 2 processors without extra work. However NT
: Workstation version only supports 2 processors. Further, most tests
: indicate that Windows NT does not scale successfully beyond 4 processors.

I can't say more without knowing more about your program. In the past, we've
generally seen really good results for programs that had not yet been refined in
their move ordering algorithms so that the search is not so efficient. As search
efficiency goes up, parallel search difficulties go up faster...

In my Ph.D. thesis, I ran tests on "worst-first" ordering, which is pure
minimax, and there it is easy to produce good speedups with almost any type
of parallel algorithm. But as the ordering becomes better and better, the
efficiency of the parallel search goes in the other direction... <sigh>...


: Ian Kennedy

John Sargeant

unread,
Mar 18, 1997, 3:00:00 AM3/18/97
to

Suppose you could exploit very fine-grain parallelism efficiently (without
building secial-purpose hardware). Is there potential for a lot of that in
position evaluation?

John


Marcel van Kervinck

unread,
Mar 18, 1997, 3:00:00 AM3/18/97
to

Yes, pattern matching is inherently fine-grained parallellism.

Marcel
-- _ _
_| |_|_|
|_ |_ Marcel van Kervinck
|_| mar...@stack.nl

0 new messages