Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

parallel fraud - Robert Hyatt's creative bookkeeping

76 views
Skip to first unread message

Vincent Diepeveen

unread,
Sep 3, 2002, 12:12:11 PM9/3/02
to
Hello all,

We all know how many failures the past years parallel programs have
been
when developed by scientists. This years diep show at the teras (1024
processors) was no
exception to that. The 3 days preparation time i had to get
to the machine (and up to 5 days before tournament
i wasn't sure whether i would get system time *anyway*).

However sponsors want to hear how well your thing did. At a 1024
processor machine (maximum allocation 512 processors within 1
partition
of shared memory) from which you get 60 with bandwidth of the memory
2 times slower than local ram, and let's not even *start* to discuss
the latency otherwise you will never start to fear diep using that
machine, despite it being a great machine when compared to others.

I'm working hard now to get a DIEP DTS NUMA version ready.

DTS it is because it is dynamic splitting wherever it wants to.

Work for over a month fulltime has been done now. Tests at a dual K7
as well as dual supercomputer processors have been very positive up to
32 processors.

Nevertheless i worried about how to report about it. So i checked out
the
article from Robert Hyatt again. Already in 1999 when i had
implemented
a pc-DTS version i wondered why i never got near the speeds of bob
when i was not forward pruning other than nullmove. The 1999 world
champs
version i had great speedups, but i could all explain them by forward
pruning which i was using at the time.

Never i got close even dual xeon or quad xeon to speeds reported by
Bob
in his DTS version described 1997. I concluded that it had to do with
a number of things, encouraged by Bob's statements. In 99 bob
explained
that splitting was very cheap at the cray. He copied a block with all
data of 64KB from processor 0 to P1 within 1 clock at the cray.

I didn't know much of crays or supercomputers at the time, except that
they were out of my budget so i believed it. However i have a good
memory
for certain numbers, so i have remembered his statement very well.

In 2002 Bob explained the cray could copy 16 bytes each clock. A
BIG contradiction to his 1999 statement. No one here will wonder
about that, because regarding deep blue we have already seen hundreds
of contradicting statements from bob. Anyway, that makes
splitting at the cray of course very expensive, considering bob copied
64KB data for each split. Crafty is no exception here.

I never believed the 2.0 speedup in his tabel at page 16 for 2
processors,
because if i do a similar test i sometimes get also > 2.0, usually
less.

Singular extensiosn hurted diep's speedup incredible, but even today
i cannot get within a few minutes get to the speedup bob achieved in
his 1997 article.

In 1999 i wondered about why his speedup was so good.
So Bob concluded he splitted in a smarter way when i asked.
Then i asked obviously how he splitted in cray blitz, because
what bob is doing in crafty is too horrible for DIEP to get a speedup
much above 1.5 anyway. I asked obviously how he splitted in cray
blitz.

The answer was: "do some statistical analysis yourself on game trees
to find a way to split well it can't be hard, i could do it too in
cray blitz but my source code is gone. No one has it anymore".

So you can feel my surprise when he suddenly had data of crafty versus
cray blitz after 1999, which bob quotes till today into CCC to proof
how
well his thing was.

Anyway, i can analyze games as FM, so i already knew a bit about how
well
this cray blitz was. I never paid much attention to the lies of bob
here.

I thought he was doing this in order to save himself time digging up
old
source code.

Now after a month of fulltime work at DIEP at the supercomputer and
having
it working great at a dual (and very little overhead) but still a bad
speedup i started worrying about my speedup and future article to
write
about it.

So a possible explanation for the bad speedup of todays software when
compared
to bob's thing in 1993 and writing about it in 1997 is perhaps
explained
by nullmove. Bob still denies this despite a lot of statistical data
at loads of positions (150 positions in total tried) with CRAFTY even.

Bob doesn't find that significant results. Also he says that not a
single of MY tests is valid because i have a stupid PC with 2
processors
and bad RAM. a dual would hurt crafties performance too much.

This because i concluded also that the speedup crafty gets here
is between 1.01 and 1.6 and not 1.7.

Data suggests that crafties speedup at his own quad is about 2.8,
where he claims 3.1.

Then bob referred back to his 1997 thesis that the testmethod wasn't
good.
Because to get that 2.8 we used cleared hashtables and in his thesis
he
cheats a little by not clearing the tables at all. to simulate a game
playing environment that's ok of course.

However there is a small problem with his article. The search times
and
speedup numbers are complete fraud. If i divide the times of 1 cpu by
the speedup bob claims he has, i get perfect numbers nearly.

Here is the result for the first 10 positions based upon bob's article
march 1997 in icca issue #1 that year, the tables with the results
are on page 16:

When diep searches at a position it is always a weird number.
If i claim a speedup of 1.8 then it is usually 1.7653 or 1.7920 or
1.8402
and so on. Not with bob. Bob knows nothing from statistical analysis
of data (i must claim innocent here too but i am at least not STUPID
like bob here):

pos 2 4 8 16
1 2.0000 3.40 6.50 9.09
2 2.00 3.60 6.50 10.39
3 2.0000 3.70 7.01 13.69
4 2.0000 3.90 6.61 11.09
5 2.0000 3.6000 6.51 8.98876
6 2.0000 3.70 6.40 9.50000
7 1.90 3.60 6.91 10.096
8 2.000 3.700 7.00 10.6985
9 2.0000 3.60 6.20 9.8994975 = 9.90
10 2.000 3.80 7.300 13.000000000000000

This clearly PROOFS that he has cheated completely about all
search times from 1 processor to 8 processors. Of course
now that i am running myself at supercomputers i know what is
the problem. I only needed a 30 minute look a month ago
to see what is in crafty the problem and most likely that was
in cray blitz also the problem. The problem is that crafty
copies 44KB data or so (cray blitz 64KB) and while doing that
it is using smp_lock. That's too costly with more than 2 cpu's.

This shows he completely lied about his speedups. All times
from 1-8 cpu's are complete fraud.

There is however also evidence he didn't compare the same
versions. Cray Blitz node counts are also weird.

The more processors you use the more overhead you have obviously.
Please don't get mad at me for calculating it in the next simple
but very convincing way. I will do it only for his first node
counts at 1..16 cpu's, the formula is:
(nodes / speedup_i-cpu's ) * speedup_i+1_cpu's

1 to 2 cpu's we don't need the math.
If you need exactly 2 times shorter to get to it but
thereby you need more nodes at more cpu's (where you need
expensive splits) then that's already weird of course, though
not impossible.

2 to 4 cpu's:
3.4 * (89052012 / 2.0) = 151388420.4 nodes.
bob needed: 105.025.123 which in itself is possible.
Simply like 40% overhead extra for 4 processors which 2 do
not have. This is very well possible.

4 to 8 cpu's:
6.5 * 105025123 nodes / 3.4 = 200.783.323
bob needed: 109MLN nodes
That means at 8 cpu's the overhead is already approaching
100% rapidly. This is very well possible. The more cpu's
the bigger the overhead.

8 to 16 cpu's:
9.1 * (109467495 / 6.5) = 153254493
bob needed: 155.514.410

My dear fellow programmers. This is impossible.

Where is the overhead?

The factor 100% at least overhead?

More likely factor 3 overhead.

The only explanation i can come up with is that the node counts
from 2..8 processors are created by a different version from
Cray Blitz than the 16 processor version.

From the single cpu version we already know the number of nodes gotta
be weird because it is using a smaller hashtable (see page 4.1 in the
article second line there after 'testing methodology').

We talk about mass fraud here.

Of course it is 5 years ago this article and i do not know whether
he created the table in 1993.

How am i going to tell my sponsor that my speedup won't be the same
as that from the 1997 article? To whom do i compare, zugzwang?
'only' had on paper 50% speedup out of 512 processors. Of course also
something which is not realistic. However Feldmann documented most of
the things he did in order to cripple zugzwang to get a better
speedup.

A well known trick is to kick out nullmove and only use normal
alfabeta
instead of PVS or other forms of search. Even deep blue did that :)

But what do you guys think from this alternative book keeping from
Bob?

Best regards,
Vincent

Gian-Carlo Pascutto

unread,
Sep 3, 2002, 12:45:07 PM9/3/02
to
< @ . > wrote in message news:un9pg42...@corp.supernews.com...
> Vincent Diepeveen <di...@xs4all.nl> wrote:
>
> [snip]
>
> We all know Dr Hyatt's accomplishments and qualifications.

That has nothing to do with the claim at hand.

--
GCP


Vincent Diepeveen

unread,
Sep 3, 2002, 12:52:50 PM9/3/02
to
On Tue, 03 Sep 2002 09:40:31 -0700,  @ .  wrote:

>Vincent Diepeveen <di...@xs4all.nl> wrote:
>
>[snip]
>
>We all know Dr Hyatt's accomplishments and qualifications.
>

>How about posting yours?

You can do the calculations yourself too with the numbers in advances
in computerchess. That shows clearly next table:

pos 2 4 8 16
1 2.0000 3.40 6.50 9.09
2 2.00 3.60 6.50 10.39
3 2.0000 3.70 7.01 13.69
4 2.0000 3.90 6.61 11.09
5 2.0000 3.6000 6.51 8.98876
6 2.0000 3.70 6.40 9.50000
7 1.90 3.60 6.91 10.096
8 2.000 3.700 7.00 10.6985
9 2.0000 3.60 6.20 9.8994975 = 9.90
10 2.000 3.80 7.300 13.000000000000000

There is a chance smaller than 1/10^30 that 'by accident' such
numbers happen. that's 0.0000000000000000000000000000001
with about 30 zero's before the 1 happens.

In short statistical analysis very clearly shows his fraud. I hope you
realize in court statistical analysis is a legal method to proof you
are right. It proofs clearly here his numbers are a big fraud and
setup.

Simon Waters

unread,
Sep 3, 2002, 1:52:38 PM9/3/02
to
Vincent Diepeveen wrote:
>
> in his DTS version described 1997. I concluded that it had to do with
> a number of things, encouraged by Bob's statements. In 99 bob
> explained
> that splitting was very cheap at the cray. He copied a block with all
> data of 64KB from processor 0 to P1 within 1 clock at the cray.
>
> I didn't know much of crays or supercomputers at the time, except that
> they were out of my budget so i believed it. However i have a good
> memory
> for certain numbers, so i have remembered his statement very well.

I can't comment off hand on the 64KB copy between processors,
but I have first hand experience of programming, and Cray
training, on their YMP vector processor architecture.

In our case whilst running even modest loops over the vector
processor architecture, the Cray could dynamically reassign
spare vector capacity as processors become free from other tasks
in a dynamic fashion, fast enough to justify the set up cost.
That always amazed me. To do that Crays obviously could punt
large amounts of information between processors in short order!

Cray compilers had some fairly sophisticated techniques in this
area. I forget the details but memory had a job keeping up with
processor performance, so certain memory accessing tasks would
take more than one vector clock cycle, so for loop optimisation
it would effectively aim to optimise code so as to avoid hitting
the same memory location too frequently. Similarly loops were
farmed between processors in chunks so one might do 0-31,
another 32-63, of a loop to 64.

> In 2002 Bob explained the cray could copy 16 bytes each clock. A
> BIG contradiction to his 1999 statement.

What a Cray did each clock cycle is rather hard to describe
without a diagram of all the vector and scalar registers. The
C90 (maybe slightly later than Bob's work) had a number of
different vector registers that could be chained together.

This chaining allowed you to effectively compute simple
arithmetic results involving 4, or 5, 64 bit words every clock
cycle (once set up cost was paid, for most of our stuff the
setup cost justified vector procesing for loops of 3 iterations
or larger), which is substantially more than 16 bytes being
copied around every clock cycle, on every processor.

The Crays were a perculiar architecture, everything was very
elegantly arranged to ensure that at each step you had just
enough resource to avoid bottlenecks. This level of elegant
design was a pleasure to see when you were actually worrying
about optimisation. Never had the feeling since that my hardware
was so carefully engineered.

> How am i going to tell my sponsor that my speedup won't be the same
> as that from the 1997 article? To whom do i compare, zugzwang?

Surely Bob isn't the only person to publish parallel statistics
from a Chess program?

What really counts is results - Bob's program beat the world (no
doubt in large part down to the best hardware), but then why
does Schumaker get to drive the Ferrari's?

> But what do you guys think from this alternative book keeping from
> Bob?

The numbers you quote are not inconsistent with the speed up's
Crays gave on real world numerical processing tasks, although
they tended to be a little erratic depending on loop sizes as
number of processors varied. Chess presumably scales well on
powers of 2 in processors for other reasons.

I'd expect chess to scale substantially worse than the number
crunching I was involved with, but without a copy of the various
reports of Bob's you refer to, I and I suspect most people in
the group will think you are ranting again, and cetainly can't
use the numbers you quote without more context to say if your
analysis is right or wrong.

I would have thought a less confrontational approach might get
you more information out of Dr Hyatt.

Vincent Diepeveen

unread,
Sep 3, 2002, 2:41:48 PM9/3/02
to
On Tue, 03 Sep 2002 18:52:38 +0100, Simon Waters
<Si...@wretched.demon.co.uk> wrote:

Page 13 from icca from the DTS article from bob says:

4. DTS performance results

The DTS algorithm was tested using a Cray C916/1024. this machine has
16 processors with a cycle time of 4.166 nanoseconds and also has 1024
million words of memory (8 gigabytes).

4.1 testing methodology

In producing these results, all testing used the machine in a
dedicated mode so that all of the machines memory was used, regardless
of the number of processors utilized in each test (except the one
processor test, which is explained later).

[snip]

==> regarding the 1 processor result that he had to share the machine

As a result, memory conflicts were much higher (bank conflicts to
those that are Cray-savvy) as well as swapping overhead which gets
charged to the user. While this is well below the .1% level of noise,
it is noticeable, and should be remembered.

The whole math i did was based upon bob's search times table. table
3 in icca march 1997. I'll give a few numbers for you to verify it.

procs: 1 , 2 , 4 , 8 , 16
pos
1 2830 1415 832 435 311
2 2849 1424 791 438 274
3 3274 1637 884 467 239
4 2308 1154 591 349 208
5 1584 792 440 243 178
6 4294 2147 1160 670 452
7 1888 993 524 273 187
8 7275 3637 1966 1039 680
9 3940 1790 1094 635 398
10 2431 1215 639 333 187
11 3062 1531 827 425 247
and so on. see the article for all other numbers. you'll see
i didn't make mistakes in my table. the fact is that if you
get a speedup at a position of 3.7 then that means your
speedup in reality calculates when dividing search time
of 1 processor by 2,4,8,16 that it is somewhere in the
range of 3.65 - 3.74

However if it is 3.70 your speedup, then the chance
that happens next time again and again and again is
not so big. Because speedups are pretty relevant
and only to base upon search times, only this table
matters IMHO.

The other tables are interesting to see, but this is THE
important table.

If you want to let your 16 processor run (which is the tournament
run) look better, you obviously have to modify your results for 1..8
processors in order to let your 16 processor version look better.

That's exactly what Bob did. If you divide the above numbers in
next manner you'll get the table i posted which shows he is a big
fraud.

Simon Waters

unread,
Sep 3, 2002, 4:45:47 PM9/3/02
to
Vincent Diepeveen wrote:
>
> The whole math i did was based upon bob's search times table. table
> 3 in icca march 1997. I'll give a few numbers for you to verify it.
>
> procs: 1 , 2 , 4 , 8 , 16
> pos
> 1 2830 1415 832 435 311
> 2 2849 1424 791 438 274
> 3 3274 1637 884 467 239
> 4 2308 1154 591 349 208
> 5 1584 792 440 243 178
> 6 4294 2147 1160 670 452
> 7 1888 993 524 273 187
> 8 7275 3637 1966 1039 680
> 9 3940 1790 1094 635 398
> 10 2431 1215 639 333 187
> 11 3062 1531 827 425 247
> and so on. see the article for all other numbers. you'll see

> If you want to let your 16 processor run (which is the tournament


> run) look better, you obviously have to modify your results for 1..8
> processors in order to let your 16 processor version look better.

> That's exactly what Bob did. If you divide the above numbers in
> next manner you'll get the table i posted which shows he is a big
> fraud.

It seems consistent with the data for 2,4, and 8 CPUs being
transformed into a speedup rounded to 1 decimal place, and then
transformed back into seconds, where as the 16 CPU data would
appear to be raw data.

I'm not sure I'd accuse someone of fraud on this basis, although
it might be worth querying if it was more important data, like
mortality figures in a drug trial.

I'm not sure what you think Bob would hope to achieve if it was
fraud, I don't suppose Cray were really that interested in the
parallelism inherent in computer chess searches. I expect they
wanted to win an event like the WCCC and thus promote their
technology in a favourable light.

Robert Hyatt

unread,
Sep 3, 2002, 4:47:52 PM9/3/02
to
Vincent Diepeveen <di...@xs4all.nl> wrote:

I am not going to further respond to this. But a little background might
shed _some_ light.

Apparently vincent contacted some vendor that builds a NUMA architecture,
and I suppose they were aware of Cray Blitz somehow. He asked them to
sponsor his program by providing machine time. They apparently wanted him
to guarantee that he could get a better speedup than anyone had in the
past.

Vincent then started to work on his port to the NUMA machine and could not
get good results. He approached me and said:

"I can't match your Cray speedups." It _must_ be because of one of the
following reasons. Which would you prefer me to use:

(a) null-move search as used today makes it harder to get a good speedup
than back in the 1992 days when null-move was hardly used. I responded
that (1) I used null-move in 1993 when I produced the data for the DTS
article he is questioning, even though it was R=1 with no recursion,
rather than todays R=2 or 3 values. (2) I was convinced that null-move
was not a significant influence on parallel speedup. He ranted and raved
about how that was wrong and convinced me to run the 24 DTS positions thru
Crafty with normal null move using 4 cpus, and with non-null-move using
4 cpus. I did. Normal null-move produced a speedup of 3.0, and non-nullmove
produced a speedup of 3.1... I gave him the data. And told him claiming that
would look foolish as _anybody_ could run the same test and show his conclusion
was flawed.

(b) my DTS paper data was faked. My original PhD dissertation produced a
speedup of 9.0 (roughly) on a 16 cpu machine searching less than 100 nodes
per second per processor, going to depth=5 plies on the kopec positions. I
did a different sort of test using a Cray C90 with 16 processors, and rather
than using random positions, I replayed the same game several times, using
different numbers of processors. The speedup was 11.0 (roughly). He somehow
concluded that because of the way some of the numbers "look" that they must
have been faked, based on what I still don't understand.

The funny part is that I didn't consider 11.0 very good, because when I ran
tests later on a 32 cpu machine, I would hope to get 22, but did not. The
speedup was getting worse and worse, as it did for the 1,2,4,8 and 16 cpu
tests I ran in the DTS article.

So apparently, it is a case of "I must be able to show I am better, by
discrediting something I can't beat, so that I can get this sponsor to
give me machine time for the next WCCC event."

Completely overlooking the fact that anyone familiar with high-performance
computing would _never_ expect a NUMA machine (using 16 processors) to come
anywhere near the parallel performance of a Cray with 16 processors, simply
because the Cray memory system is so much better. Of course, you won't find
a Cray with 128 processors, so that is the advantage of the NUMA architecture,
that it scales better than a full crossbar.

But, you can believe what you want. I see no way to "win" this argument,
because I can (using Crafty) run the same position 5 times using 8 cpus and
get eight different times. The non-deterministic behavior of parallel search
chess programs is well-known and makes this kind of "discussion" difficult.

I'll leave you to believe what you like. Vincent says that Crafty can't possible
produce a speedup of > 1.6 on a dual. For those with duals, you might try a
few test positions and then decide who you want to believe. I know where _I_
stand. On raw data...

Bob

--
Robert Hyatt Computer and Information Sciences
hy...@cis.uab.edu University of Alabama at Birmingham
(205) 934-2213 115A Campbell Hall, UAB Station
(205) 934-5473 FAX Birmingham, AL 35294-1170

Robert Hyatt

unread,
Sep 3, 2002, 4:57:48 PM9/3/02
to


The main issue here is how the data was produced. I had a program I
wrote for my dissertation analysis so that it could "eat" log files and
extract the important stuff. However, I wrote it in 1986-1987 and I
_really_ don't remember what it did internally. I might have done something
ugly since it was a 32 bit program, and using floating point might have caused
some quantization errors as could using integer math as well.

But I really don't remember the "insides" of that and can't comment as to
how it might have possibly modified the node counts or times in converting
them from characters to internal to output...

The program (as well as all the old crafty versions) was lost somewhere in
the 1996-97-98 time frame due to a disk failure on my machine, only to discover
that all our backup tapes were being written just fine but were absolutely
unreadable... If I had access to the program, I might see if I had done
something ugly between reading the data and displaying it as a table.

however, the actual speedup results are quite consistent with lots of testing
Harry and I did on the C90... Speedups between 10 and 12 were the norm, with
some around 16 and some lower at around 8. 11.0 seems to be a good average
number. IE Crafty seems to hit around 3.1 on my quad (Vincent claims this
is impossible, but I have run tests for him that continue to show this).
Cray Blitz was significantly better than that with 4 processors, but not
enough better to make 11.0 out of touch for Crafty today on the right set
of test positions...

Note that the DTS paper was written well after the last appearance of
Cray Blitz. Our last ACM event was 1994, the last ACM tournament held.
However, I had already started to work on the paper using a 1993 game
as the "test positions" and wrote the paper long after Cray Blitz had
been officially "retired" from competition. I thought that I might one
day write DTS-2, but when I did a recursive search version of Crafty,
that pretty well eliminates the DTS approach from the get-go. I chose
to do something simpler that still works pretty well, IMHO. I'm not
sure I would want to publish the results of the speedups now, of
course, unless I get a third party to run the test results for me. :)

Simon Waters

unread,
Sep 3, 2002, 5:26:13 PM9/3/02
to
Robert Hyatt wrote:
>
> I'm not sure I would want to publish the results of the speedups now, of
> course, unless I get a third party to run the test results for me. :)

For as long as I get to keep the Quad Xeon running the tests is
no problem ;)

Robert Hyatt

unread,
Sep 3, 2002, 5:41:51 PM9/3/02
to

I have been known to give people access to my quad xeons in the cluster.

Once upon a time, I even let Vincent use the quad in my office.

Pascutto is using one of the quad 550's to develop "deep sjeng" at the
moment, so access is possible. Of course "some" probably won't be using
them again. :)

Kym

unread,
Sep 3, 2002, 8:05:08 PM9/3/02
to
I've come from a Digital/VMS VAX and later alpha background. //lel
computing and clustering has always been an interest. The architecture is
the issue from my perpective.

1. Intel 32 bit is not the best way to build //lel systems

2. You must have a very good backbone (cross bar switch) so all CPU's and
memory banks can talk simultaneously (also need coherency and cache
co-ordination - write a paper on that)

3. Program architecture and quality compilers are also required
Look at http://www.compaq.com/alphaserver/gs320/ for a good //lel
implementation, also
http://www.compaq.com/alphaserver/gs/gs_whitepapers.html .

If items 2 and 3 are implemented you should get near linear scaling of
memory/compute intensive applications. As I don't beleive that the Intel
32 bit architecture meets either 2 or 3 I am not supprised at the results.

Most of the work I have done on //lel environments was crunching very large
arrays and vector processing. Chess is a more difficult problem due to the
move tree, Alpha/Beta etc.

"Robert Hyatt" <hy...@crafty.cis.uab.edu> wrote in message
news:al375o$i7n$1...@juniper.cis.uab.edu...

Michael Byrne

unread,
Sep 3, 2002, 9:06:58 PM9/3/02
to
di...@xs4all.nl (Vincent Diepeveen) wrote in message news:<3d74dec6...@news.xs4all.nl>...
> Hello all,
>
> <big snip of trash talk>

You now posting this crap on newsnet? The truth is that can't
duplicate what Bob Hyatt has done and now your sponsors are getting
antsy and you have to try save face. The only way you save face is by
saying what Bob did was "creative bookkeeping".

Give me a break!

You are scum and don't bother opening the door when you leave. You
should be able to slime your way out underneath.

Michael Byrne

JamesGE

unread,
Sep 4, 2002, 1:28:41 AM9/4/02
to
Aside from a lot of what you said, that I didn't understand, have you
thought about assigning processors to compute wildly differing paramaters of
the position? That might be alot more effective than just trying to compute
as many variations as possible.

For example, you have a chess position. In the position, the computer can
find 2 advantageous lines almost immediatly. But say, assign one processor
to only look at gambit lines. That is, even if the computer sees lines that
give it, say, the equivelent of a pawn ahead, make sure the program still
looks at gambit lines, to see if down the road, a gambit might be
positionaly better, even if in the short run, its tactically inferior.

Just a thought.


"Vincent Diepeveen" <di...@xs4all.nl> wrote in message
news:3d74dec6...@news.xs4all.nl...

Simon Waters

unread,
Sep 4, 2002, 4:11:38 AM9/4/02
to
Robert Hyatt wrote:
>
> Pascutto is using one of the quad 550's to develop "deep sjeng"

"Deep GNU" is too good a 'pun' to pass Stuart or RMS by for
long.

Robert Hyatt

unread,
Sep 4, 2002, 10:49:00 AM9/4/02
to
"JamesGE" <jjamesge> wrote:
> Aside from a lot of what you said, that I didn't understand, have you
> thought about assigning processors to compute wildly differing paramaters of
> the position? That might be alot more effective than just trying to compute
> as many variations as possible.

That has been suggested, and tried by others. The problem is that a
chess engine searches a "serial search space" defined by the alpha/beta
algorithm. The goal is to traverse that search space as quickly as possible,
without doing extra (and unnecessary) work while doing so.

Hence the challenge of a parallel chess engine.


> For example, you have a chess position. In the position, the computer can
> find 2 advantageous lines almost immediatly. But say, assign one processor
> to only look at gambit lines. That is, even if the computer sees lines that
> give it, say, the equivelent of a pawn ahead, make sure the program still
> looks at gambit lines, to see if down the road, a gambit might be
> positionaly better, even if in the short run, its tactically inferior.


This has been done, but most chess positions are non-tactical, which means
the processors looking at wild stuff will be wasted.


> Just a thought.

Kenneth Sloan

unread,
Sep 4, 2002, 11:10:43 AM9/4/02
to
Robert Hyatt <hy...@crafty.cis.uab.edu> writes:

> "JamesGE" <jjamesge> wrote:
> > Aside from a lot of what you said, that I didn't understand, have you
> > thought about assigning processors to compute wildly differing paramaters of
> > the position? That might be alot more effective than just trying to compute
> > as many variations as possible.
>
> That has been suggested, and tried by others. The problem is that a
> chess engine searches a "serial search space" defined by the alpha/beta
> algorithm. The goal is to traverse that search space as quickly as possible,
> without doing extra (and unnecessary) work while doing so.
>
> Hence the challenge of a parallel chess engine.

Well...let's be more accurate. As Bob very well knows, there is a long
history of chess programs which tried to use methods *other than*
alpha-beta search. It's just that (so far) the serial alpha-beta
methods have proven to be better, in the arena.

Once you go down the serial path, parallizing it becomes very
difficult. This is not really a surprise. Almost by definition, any
attempt to parallelize requires a step BACK from the algorithms that are
optimal on single-processor machines.

The step back from serial algorithms is happening in some fields, with
some success. So far, it hasn't been terribly successful in chess. My
theory is that the computer chess world is very focussed on short term
performance, and the current model is very advanced, very sophisticated,
and very difficult to improve on.

Almost by definition, in order to make gains (by going parallel) over a
highly optimized serial algorithm, you must do more work, in absolute
terms. The hope is that you can throw enough processors at the problem
so that you can more (by adding resources) than you lose (by being
"inefficient").

Some progress has been made with small (1<n<32) processors, but this
looks to be reaching the point of diminishing returns. I expect the
next great leap forward to come from massively parallel architectures
(100000 < n). The current problem for chess folk is that these still
tend to be a bit expensive, and not available for experimentation.
Also, they are being targeted at "embarassingly parallel" problems
(e.g., image processing - which can easily suck up a processor per
pixel). Putting a chess program on such a machine will require a
complete re-thinking of what a chess program ought to look like. And
so, it will inevitably be a failure (in the arena), at first.

--
Kenneth Sloan sl...@uab.edu
Computer and Information Sciences (205) 934-2213
University of Alabama at Birmingham FAX (205) 934-5473
Birmingham, AL 35294-1170 http://www.cis.uab.edu/info/faculty/sloan/

Robert Hyatt

unread,
Sep 4, 2002, 1:23:22 PM9/4/02
to
Kenneth Sloan <sl...@uab.edu> wrote:
> Robert Hyatt <hy...@crafty.cis.uab.edu> writes:

>> "JamesGE" <jjamesge> wrote:
>> > Aside from a lot of what you said, that I didn't understand, have you
>> > thought about assigning processors to compute wildly differing paramaters of
>> > the position? That might be alot more effective than just trying to compute
>> > as many variations as possible.
>>
>> That has been suggested, and tried by others. The problem is that a
>> chess engine searches a "serial search space" defined by the alpha/beta
>> algorithm. The goal is to traverse that search space as quickly as possible,
>> without doing extra (and unnecessary) work while doing so.
>>
>> Hence the challenge of a parallel chess engine.

> Well...let's be more accurate. As Bob very well knows, there is a long
> history of chess programs which tried to use methods *other than*
> alpha-beta search. It's just that (so far) the serial alpha-beta
> methods have proven to be better, in the arena.

I agree.

Schaeffer decided to try to use some of the processors to do a tactical-
refutation search since tactics without positional evaluation can go very
quickly. He in essence said "These extra processors are not really helping
the normal search, so I am going to send them off to do tactical searches only
and hopefully refute a positional move that loses tactically." It worked ok,
but seemed like an admission of defeat in getting all the processors to help
on the normal search. This has been called "speculative computing" and is like
running two separate engines, one searching normally, one only looking for
tactical shots, and using a "oracle" to somehow combine the results into a
single move.

Other approaches include offbeat ideas like conspiracy number search, proof
number search, and other "almost alpha/beta" type searches. They seem to be
promising, but they aren't beating raw alpha/beta parallel searches yet.

mtd(f) is another alpha/beta variant that some have reported good parallel
results with. I could not reproduce them and didn't want to spend a year
working on something that was not appearing to be an improvement in my
initial testing, but who knows? Took me a good while to get into "bitmap
mode" to get good performance there. mtd(f) seems to be the same sort of
thing that takes significant time to get it right. However, it does have
some issues that crafty's evaluation seems to wreck, so after three months,
I put it away for future testing, if that ever happens.


> Once you go down the serial path, parallizing it becomes very
> difficult. This is not really a surprise. Almost by definition, any
> attempt to parallelize requires a step BACK from the algorithms that are
> optimal on single-processor machines.

And unfortunately, the very formulation of the alpha/beta algorithm is
inherently serial by definition. You first search one branch to establish
a bound for searching the others. That first search is serial at that point,
by definition, and that hurts.

> The step back from serial algorithms is happening in some fields, with
> some success. So far, it hasn't been terribly successful in chess. My
> theory is that the computer chess world is very focussed on short term
> performance, and the current model is very advanced, very sophisticated,
> and very difficult to improve on.

And it might be good enough to rise above all humans at some point in the
future. Whether there is a better algorithm or not is an open question. With
the current success of chess engines, it might actually be a moot question,
because when is "good enough" really "good enough"? :)


> Almost by definition, in order to make gains (by going parallel) over a
> highly optimized serial algorithm, you must do more work, in absolute
> terms. The hope is that you can throw enough processors at the problem
> so that you can more (by adding resources) than you lose (by being
> "inefficient").

> Some progress has been made with small (1<n<32) processors, but this
> looks to be reaching the point of diminishing returns. I expect the
> next great leap forward to come from massively parallel architectures
> (100000 < n). The current problem for chess folk is that these still
> tend to be a bit expensive, and not available for experimentation.
> Also, they are being targeted at "embarassingly parallel" problems
> (e.g., image processing - which can easily suck up a processor per
> pixel). Putting a chess program on such a machine will require a
> complete re-thinking of what a chess program ought to look like. And
> so, it will inevitably be a failure (in the arena), at first.

> --
> Kenneth Sloan sl...@uab.edu
> Computer and Information Sciences (205) 934-2213
> University of Alabama at Birmingham FAX (205) 934-5473
> Birmingham, AL 35294-1170 http://www.cis.uab.edu/info/faculty/sloan/

Kenneth Sloan

unread,
Sep 4, 2002, 1:41:44 PM9/4/02
to
Robert Hyatt <hy...@crafty.cis.uab.edu> writes:

> ...


> And unfortunately, the very formulation of the alpha/beta algorithm is
> inherently serial by definition. You first search one branch to establish
> a bound for searching the others. That first search is serial at that point,
> by definition, and that hurts.

yes, of course.

My point was that there's no reason to restrict your attention to
alpha-beta - other than the fact that it has always turned out to be
better...so far.

In the land of massively parallel architectures, perhaps chaos is better
than deductive reasoning.

Robert Hyatt

unread,
Sep 4, 2002, 2:20:30 PM9/4/02
to
Kenneth Sloan <sl...@uab.edu> wrote:
> Robert Hyatt <hy...@crafty.cis.uab.edu> writes:

>> ...
>> And unfortunately, the very formulation of the alpha/beta algorithm is
>> inherently serial by definition. You first search one branch to establish
>> a bound for searching the others. That first search is serial at that point,
>> by definition, and that hurts.

> yes, of course.

> My point was that there's no reason to restrict your attention to
> alpha-beta - other than the fact that it has always turned out to be
> better...so far.

> In the land of massively parallel architectures, perhaps chaos is better
> than deductive reasoning.


Of course nothing says that a classic depth-first (without alpha/beta)
or a classic breadth-first search won't be better on massively parallel
machines.

Just like lousy sorting algorithms do very well on large numbers of
processors while things like heap-sort/quick-sort don't do so well...


> --
> Kenneth Sloan sl...@uab.edu
> Computer and Information Sciences (205) 934-2213
> University of Alabama at Birmingham FAX (205) 934-5473
> Birmingham, AL 35294-1170 http://www.cis.uab.edu/info/faculty/sloan/

--

Russell Reagan

unread,
Sep 4, 2002, 2:53:55 PM9/4/02
to
"Robert Hyatt" <hy...@crafty.cis.uab.edu> wrote

> Of course nothing says that a classic depth-first (without alpha/beta)
> or a classic breadth-first search won't be better on massively parallel
> machines.

Do you know of any chess programs (or tic-tac-toe, checkers, or whatever)
that use a breadth first search? I understand how the BFS algorithm works,
but I can't figure out how you would get a best move or principle variation
from it. Could you explain?

Russell

Robert Hyatt

unread,
Sep 4, 2002, 3:35:31 PM9/4/02
to

> Russell

Sure. You put the root position in a list called "open".

Now you take any position in the open list, and generate all the possible
successor positions. The original position is now "closed" since it can
generate no further successors. You have a tree with one-move-depth of
search. You pick any one you choose and generate all the moves. Then you
do this for all the other moves. Now you have a tree two-levels deep. you
continue this until you run out of time. You simply follow the pathway
from the current position with the best score, back to the root, which gives
you your PV...

No alpha/beta of course, as that is depth-first only...

But the point is that this is a tree, with the top-most position as the
"root", and the deepest positions (all are at the same depth btw) as the
terminal positions. There is an obvious connectivity from the best
terminal position, to its parent, to its parent's parent, etc...

Kenneth Sloan

unread,
Sep 4, 2002, 3:38:15 PM9/4/02
to
Robert Hyatt <hy...@crafty.cis.uab.edu> writes:

> ...


> Just like lousy sorting algorithms do very well on large numbers of
> processors while things like heap-sort/quick-sort don't do so well...

> ...

They are not lousy algorithms...on that hardware.

Russell Reagan

unread,
Sep 4, 2002, 4:04:21 PM9/4/02
to
"Robert Hyatt" <hy...@crafty.cis.uab.edu> wrote

> Sure. You put the root position in a list called "open".
>
> Now you take any position in the open list, and generate all the possible
> successor positions. The original position is now "closed" since it can
> generate no further successors. You have a tree with one-move-depth of
> search. You pick any one you choose and generate all the moves. Then you
> do this for all the other moves. Now you have a tree two-levels deep.
you
> continue this until you run out of time. You simply follow the pathway
> from the current position with the best score, back to the root, which
gives
> you your PV...
>
> No alpha/beta of course, as that is depth-first only...
>
> But the point is that this is a tree, with the top-most position as the
> "root", and the deepest positions (all are at the same depth btw) as the
> terminal positions. There is an obvious connectivity from the best
> terminal position, to its parent, to its parent's parent, etc...

So you have to save the entire tree you are searching?

Russell


Kenneth Sloan

unread,
Sep 4, 2002, 5:12:56 PM9/4/02
to
"Russell Reagan" <rre...@attbi.com> writes:

Yes, well...that's one reason that depth-first search (and alpha-beta
search) have become so popular on processors with limited resources, eh?

But, consider that any program that does "iterative deepening" is doing
something which might be called "breadth-first search". Think about
where "the entire tree" is stored in these cases. Where else do current
chess programs keep knowledged accumulated by previous searches?

Pure depth-first search and pure breadth-first search are extreme
positions. There are an entire spectrum of choices in between.

Robert Hyatt

unread,
Sep 4, 2002, 8:53:17 PM9/4/02
to

Yes. That is the "issue" for breadth-first. You have to store the tree.
With depth-first, you don't...


> Russell

Robert Hyatt

unread,
Sep 4, 2002, 8:55:55 PM9/4/02
to
Guy Macon < Guy Macon > wrote:

> Robert Hyatt <hy...@crafty.cis.uab.edu> wrote:


>>
>>Russell Reagan <rre...@attbi.com> wrote:
>>
>>> Do you know of any chess programs (or tic-tac-toe, checkers, or whatever)
>>> that use a breadth first search? I understand how the BFS algorithm works,
>>> but I can't figure out how you would get a best move or principle variation
>>> from it. Could you explain?
>>
>>> Russell
>>
>>Sure. You put the root position in a list called "open".
>>
>>Now you take any position in the open list, and generate all the possible
>>successor positions. The original position is now "closed" since it can
>>generate no further successors. You have a tree with one-move-depth of
>>search. You pick any one you choose and generate all the moves. Then you
>>do this for all the other moves. Now you have a tree two-levels deep. you
>>continue this until you run out of time. You simply follow the pathway
>>from the current position with the best score, back to the root, which gives
>>you your PV...
>>
>>No alpha/beta of course, as that is depth-first only...
>>
>>But the point is that this is a tree, with the top-most position as the
>>"root", and the deepest positions (all are at the same depth btw) as the
>>terminal positions. There is an obvious connectivity from the best
>>terminal position, to its parent, to its parent's parent, etc...

> Please forgive me if this is a clueless newbie question, but would it
> be possible to take all of those possible successor positions and
> assign them to nodes in a Beowulf cluster, each of which is running
> a normal copy of Crafty? With lots of nodes or few legal moves, you
> could even assign all of the possible successor positions in the next
> ply to nodes. Clearly this would be quite inefficient, but it seems
> like it would also be easy to do, and those of us who run Beowulf
> clusters have a severe lack of fun programs that use all of the nodes.
> How hard would what I describe above be to implement?


The problem with that is that you are going to search a much larger
tree. Alpha/Beta demands that the first branch be searched before
other branches are searched at a node. Violating this means you search
without a good "bound" and the tree grows.

Of course, If the tree were to grow at a slower rate than the search
speed produced by a large group of machines, then it would "win".

You'll get a cluster-crafty one day...

> Guy Macon http://www.guymacon.com Guy Macon http://www.guymacon.com
> Guy Macon http://www.guymacon.com Guy Macon http://www.guymacon.com
> Guy Macon http://www.guymacon.com Guy Macon http://www.guymacon.com
> Guy Macon http://www.guymacon.com Guy Macon http://www.guymacon.com

0 new messages