Opteron optimizations

Fortknight

unread,

Nov 21, 2003, 1:19:28 PM11/21/03

to

Ok engine makers...

Am I off base here? With the advent of the Opteron with a 64 bit
word (Think bit boards here), lots of registers (very fast operation),
and 3 megs of L1 Cache (extremely fast memory loads for a pretty large
swath of the search space), that there should be room for at least a
magnitude improvement in move generation if not also evaluation at a
given clock speed compared to I86-32 hardware?

Would not that magnitude of improvement provide a significant increase
in elo strengthm beyond the 40 pts a year we are at now?

Granted this path would want a dedicated processor so as to not have
significant register and cache thrashing from the operating system,
but if you were looking to *win* the next years computer chamionship
it would seem to be a quick way to get an "unfair" advantage...

Cheers

Tommy

unread,

Nov 21, 2003, 4:48:18 PM11/21/03

to

"Fortknight" <X...@x.com> wrote in

> Am I off base here? With the advent of the Opteron with a 64 bit
> word (Think bit boards here), lots of registers (very fast operation),
> and 3 megs of L1 Cache (extremely fast memory loads for a pretty large
> swath of the search space), that there should be room for at least a
> magnitude improvement in move generation if not also evaluation at a
> given clock speed compared to I86-32 hardware?

I thought it might be about twice as fast, are you sure it would have an
improvement of an order of magnitude?

> Granted this path would want a dedicated processor so as to not have
> significant register and cache thrashing from the operating system,
> but if you were looking to *win* the next years computer chamionship
> it would seem to be a quick way to get an "unfair" advantage...

Are you talking about scheduling and context switching? Is this the reason
why you want a second processor, to have your engine run on a separate
processor? That is probably not a good idea, much better would be to have
your engine running in parallel on both....

Tommy

Fortknight

unread,

Nov 21, 2003, 5:19:22 PM11/21/03

to

On Fri, 21 Nov 2003 21:48:18 +0000 (UTC), "Tommy"
<ltk_RE...@libero.it> wrote:

>
>"Fortknight" <X...@x.com> wrote in
>
>> Am I off base here? With the advent of the Opteron with a 64 bit
>> word (Think bit boards here), lots of registers (very fast operation),
>> and 3 megs of L1 Cache (extremely fast memory loads for a pretty large
>> swath of the search space), that there should be room for at least a
>> magnitude improvement in move generation if not also evaluation at a
>> given clock speed compared to I86-32 hardware?
>
>I thought it might be about twice as fast, are you sure it would have an
>improvement of an order of magnitude?

I think that you could implement a routine to have an entire bit board
representation solely in registers, and do move generations solely in
cpu.... (This would be an ASM only kind of thing)... concievably you
could do move generation in a single clock... Which would be
3000000k nodes... But lets assume order of magnitude less for all
sorts of other things... so 300000k, but lets make it even one more
magnitude less... 30000k... I think if you could get 30 million
nodes a second, I think you are hitting that magnitude improvement....

>
>> Granted this path would want a dedicated processor so as to not have
>> significant register and cache thrashing from the operating system,
>> but if you were looking to *win* the next years computer chamionship
>> it would seem to be a quick way to get an "unfair" advantage...
>
>Are you talking about scheduling and context switching? Is this the reason
>why you want a second processor, to have your engine run on a separate
>processor? That is probably not a good idea, much better would be to have
>your engine running in parallel on both....

If you are searching for maximum performance and doing all sorts of
specialized register stuff and maximizing your use of cache, you don't
want the operating system coming in several times a second, saving off
your registers, polluting the cache etc... The idea would be to have
selfish control of the processor... (This would mean that the other
processor would be available for other tasks)

>Tommy
>
>

Tommy

unread,

Nov 21, 2003, 8:37:04 PM11/21/03

to

"Fortknight" <X...@x.com> wrote

> I think that you could implement a routine to have an entire bit board
> representation solely in registers,

Umh, this is not my strongest field, but I doubt it.... with rotated
bitboards you need at least 14 bitboards, the AMD64 architecture has 16
registers R0 - R15 (not considering special purposes registers). I do not
think you could do much with only 2 available registers :-(

Finally by using the keyword "register" you are only giving a hint to the
compiler, compiler does what he think is best anyway.

> (This would be an ASM only kind of thing)...

In the old days you could produce much faster code by writing your own ASM
than by using a compiler. Now this is still true, but it is not that easy
anymore.....
Compilers are much better than in the past. Also, architectures are much
more complicated, with some RISC processors it is *really hard* to write
efficient ASM! If you have lots of time to red documentation and perform
tests etc... you might acheive better performance, but when you do so a new
and improved version of that processor might come out and your assembly code
might not be the best anymore (maybe they made an instruction slower and
another one faster).
What is more code you produce is not portable.

Finally, sometimes it could more convenient to have other variables in
registers and *not* the bitboards... So, even if you are using assembly (and
not a compiler) you will probably drop the idea of having bitboards
permanently in registers.

> concievably you
> could do move generation in a single clock... Which would be
> 3000000k nodes...

With a 3GHz processor you have 3000000K cycles per second, in other words:
3.000.000.000 cycles/sec
How could you possibly do the whole move generation in a single clock cycle
?!?!?
You can do one OR operation in a clock cycle, but not the whole move
generation ;-)

So if you consider that for the move generation you need many ORs, ANDs, a
few loops, you also need reading from the pre-computed attack arrays (which
are *not* in registers by the way....) etc.... then you realise that you
need quite a few clock cycles... certainly not only 1.

> But lets assume order of magnitude less for all
> sorts of other things... so 300000k, but lets make it even one more
> magnitude less... 30000k...

What is this?You are making something 100 times slower for apparently no
reason....

> I think if you could get 30 million
> nodes a second, I think you are hitting that magnitude improvement....

Well, doing 30 millions move generations per second, does *not* mean doing
30 millions nodes per second.
It might be less it might also be more.... (my program can do up to 150K
generations per second, but it can do 1M nodes/sec).
In a node you need to do some other things apart from move-generation and
move-generation does not have be done all the times.

Anyway, I hope you are right..... but I doubt it.

Tommy

Fortknight

unread,

Nov 21, 2003, 10:20:04 PM11/21/03

to

You can get the bitboards down to 8, (6 for each of the piece types,
and a white piece mask, and a black piece mask), and 1 register for
state flags (castleing, enpassant, check) and for eval score (and
possibly other things that you need), giving you 7 registers for doing
various other things...

And this would have to be ASM, because you are in some sense not doing
something any compiler is going to understand, and you are going to be
manipulating register states in ways that a compiler is not going to
happy with. (This is not a speed thing, just a compiler thing). You
may likely be able to compile this with a lot of macro's vs just c.

So, on 3 GHz machine, I give you 100 clock cycles (and that should be
quite alot for both move generation and eval) to get to 30M nodes per
second.

This would seem to be a worthwhile "optimization" as you should be
able to get a magnitude increase in search depth against other
computer opponents. I will grant you that this is not portable, but
that is fine, the goal is to create a program to blow away the
computer opponents (And if you can do that, riches *will* follow as
the best program always attracts the most dollars, making it
worthwhile).

Robert Hyatt

unread,

Nov 22, 2003, 12:14:32 AM11/22/03

to

Fortknight <X...@x.com> wrote:
> You can get the bitboards down to 8, (6 for each of the piece types,
> and a white piece mask, and a black piece mask), and 1 register for
> state flags (castleing, enpassant, check) and for eval score (and
> possibly other things that you need), giving you 7 registers for doing
> various other things...

> And this would have to be ASM, because you are in some sense not doing
> something any compiler is going to understand, and you are going to be
> manipulating register states in ways that a compiler is not going to
> happy with. (This is not a speed thing, just a compiler thing). You
> may likely be able to compile this with a lot of macro's vs just c.

> So, on 3 GHz machine, I give you 100 clock cycles (and that should be
> quite alot for both move generation and eval) to get to 30M nodes per
> second.

> This would seem to be a worthwhile "optimization" as you should be
> able to get a magnitude increase in search depth against other
> computer opponents. I will grant you that this is not portable, but
> that is fine, the goal is to create a program to blow away the
> computer opponents (And if you can do that, riches *will* follow as
> the best program always attracts the most dollars, making it
> worthwhile).

A couple of nit-picks. "magnitude" as in "order of magnitude" is
generally 10x. Going 10x faster will _not_ give you 10x the depth.
You will be lucky to get 2 more plies. This is an exponential growth
issue which means logarithmic growth in depth. 100 cycles is _very_
fast for the search overhead, making/unmaking moves, hash probing,
move ordering, generating moves, and positional evaluation. Most programs
are way more than 20x slower than that. More like 2000 clocks _minimum_
per node... Crafty, for example, does 1M nps on a 3ghz processor, about
3000 cycles per node approximately.

--
Robert M. Hyatt, Ph.D. Computer and Information Sciences
hy...@uab.edu University of Alabama at Birmingham
(205) 934-2213 136A Campbell Hall
(205) 934-5473 FAX Birmingham, AL 35294-1170

Gian-Carlo Pascutto

unread,

Nov 22, 2003, 3:44:13 AM11/22/03

to

> Granted this path would want a dedicated processor so as to not have
> significant register and cache thrashing from the operating system,
> but if you were looking to *win* the next years computer chamionship
> it would seem to be a quick way to get an "unfair" advantage...

Opterons (Deep Sjeng and ParSOS), and dedicated processors are already
being used (Brutus).

--
GCP

Tommy

unread,

Nov 22, 2003, 6:25:54 AM11/22/03

to

"Fortknight" <X...@x.com> wrote

> You can get the bitboards down to 8, (6 for each of the piece types,
> and a white piece mask, and a black piece mask),

Yes I know that, but if you do so you then use bitboards and *not* rotated
bitboards. Traditional bitboards are not as fast, especially as far as move
generation is concerned.....

> So, on 3 GHz machine, I give you 100 clock cycles (and that should be
> quite alot for both move generation and eval)

100 clock cycles seems nothing to me. Obviously I am new to chess
programming so I might be wrong, but I think you are looking at a higher
figure here.

> to get to 30M nodes per second.

I have told you about this in my last post ....
being able to do evaluation and move generation in 100 clock cycles does
*not* mean being able to do 30M nodes per second!
I know the math you are using: you are thinking that in a second you have
3.000.000.000 clock cycles and that each node takes 100 cc,so you get you
get 30.000.000 nodes per second. This calculation is ok, *but* being able to
do evaluation and move generation in 100 clock cycles does *not* mean being
able to do each node in 100 cc. This is true because not all the node
require move generation, for example leaves do not require move generation
and, by the way, leaves are the majority of you nodes. In leaves you might
only want to see if you are in check and then, if you are, see if this a
check mate (and this might involve something similar to move generation),
but this is not done all the times.

So, you do not need to generate moves in all nodes... but in many nodes (the
ones in which you generate moves..) you need to do other things like placing
and unplacing the move... this requires some clock cycles as well by the
way.
Also, if you write a "state of the art" chess program (and you need to do so
to be fast and strong) then you have to do many other things like, hash
tables, move ordering etc....

My program for example can do 150K move generations per second but 1M nodes
per second, so almost 10 times more.
Actually I have just solved a problem..... I was by mistake generating moves
at all nodes and getting a speed of about 60K nodes per second !!!
There have been quite a few posts on this ;-)

> This would seem to be a worthwhile "optimization" as you should be
> able to get a magnitude increase in search depth against other
> computer opponents.

Going at 3Mnodes/sec and going at 30Mnodes/sec is clearly an order of
magnitude increase, but.... in speed *not* in depth!!!
To go 10 times deeper you need *much more* than an order of magnitude more
in speed!

> I will grant you that this is not portable, but
> that is fine, the goal is to create a program to blow away the
> computer opponents (And if you can do that, riches *will* follow as
> the best program always attracts the most dollars, making it
> worthwhile).

Yes, if you could go 10 times deeper that would be fantastic ;-)
I hope you do....

Tommy

unread,

Nov 22, 2003, 6:36:42 AM11/22/03

to

"Gian-Carlo Pascutto" <nat...@hotmail.com> wrote

> > Granted this path would want a dedicated processor so as to not have
> > significant register and cache thrashing from the operating system,
> > but if you were looking to *win* the next years computer chamionship
> > it would seem to be a quick way to get an "unfair" advantage...

> Opterons (Deep Sjeng and ParSOS),

Could you please tell the performance improvement you experience when
running Deep Sjeng on a 32bit with a certain clock and then on a 64 bit
machine with the same clock? (I know they are faster also in 32 application,
so it does not really help, but still gives me an idea...)
Also, does Deep Sjeng use rotated bitboards? If so, what have you done to
optimise it for opteron? I think you should get a "natural" improvement by
compiling for 64bit machine.....

> and dedicated processors are already
> being used (Brutus).

Yes, but it is worth saying that Brutus does not use a dedicated processor
in 2 way system, but a dedicated processor on a separate card.
With a 2-way opteron system I would rather use both processors in parallel
then try and get one dedicated for my application ;-)

Tommy

Gian-Carlo Pascutto

unread,

Nov 23, 2003, 3:12:14 AM11/23/03

to

> Could you please tell the performance improvement you experience when
> running Deep Sjeng on a 32bit with a certain clock and then on a 64 bit
> machine with the same clock?

The Opteron is 70% faster, clock for clock, than the Athlon XP, and 100%
faster than the Pentium 4.

--
GCP

David Richerby

unread,

Nov 23, 2003, 8:12:58 AM11/23/03

to

Nanga Parbat <m...@privacy.net> wrote:
> Have you any idea what type of benchmark comes closest to a chess
> program?

The SPEC 2000 CPU benchmarks include Crafty:

http://www.spec.org/cpu2000/CINT2000/186.crafty/docs/186.crafty.html

Dave.

--
David Richerby Permanent Unholy Monk (TM): it's like
www.chiark.greenend.org.uk/~davidr/ a man of God but it's also a crime
against nature and it'll be there
for ever!

David Richerby

unread,

Nov 23, 2003, 12:31:19 PM11/23/03

to

Nanga Parbat <m...@privacy.net> wrote:

> David Richerby wrote:
>> The SPEC 2000 CPU benchmarks include Crafty:
>

> Nice to know :-) Thanks.
>
> But this benchmark is not often used in reviews.

It's not often used in computer magazine reviews but it's the most widely-
respected CPU-power benchmark in the industry, as far as I'm aware. All
the major manufacturers put a lot of effort into scoring well in the SPEC
suite.

Dave.

--
David Richerby Psychotic Newspaper (TM): it's like
www.chiark.greenend.org.uk/~davidr/ a daily broadsheet but it wants to
kill you!

Russell Reagan

unread,

Nov 24, 2003, 12:58:00 AM11/24/03

to

"Fortknight" <X...@x.com> wrote

> lots of registers (very fast operation),
> and 3 megs of L1 Cache

The Opteron doesn't have a lot of registers compared to some other
processors. It has 16 64-bit general purpose registers, 16 128-bit xmm
registers, and 8 64-bit mmx registers. That is more than x86, but saying
that a processor has more registers than an x86 processors isn't saying much
;-)

Also, it does not have 3 MB of L1 cache. It has 1 MB of L2 cache, 128 KB of
L1 cache (64 KB data, 64Kb instructions), and no L3 cache. The Itanium has
L3 cache, and 3 MB of it sounds about right. Maybe that's what you were
thinking of.

> Would not that magnitude of improvement provide a significant increase
> in elo strengthm beyond the 40 pts a year we are at now?

All the tests I have seen of bitboard engines show that an Opteron is about
60-70% faster than an equally clocked 32-bit Athlon. Crafty was 60% faster,
and Sjeng was 70% faster, when both were compiled for the Opteron.

Using an Opteron doesn't provide more ELO improvements each year. Going from
32-bit to 64-bit is a one time boost. As the cpus get faster, that will give
more benefits, but cpus would get faster whether they were 32-bit or 64-bit.