Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Message from discussion Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Tom Rokicki  
View profile  
 More options May 25 1998, 3:00 am
Newsgroups: comp.arch, comp.arch.fpga, comp.arch.embedded
From: roki...@cello.hpl.hp.com (Tom Rokicki)
Date: 1998/05/25
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))

>     ones = up ^ upleft;
>     twos = up & upleft;
>     carry = ones & upright;
>   . . .

You can beat this (in terms of number of logical operations and shifts)
by quite a bit.  Here, `g' is the original, and only input.

sl3=(sl2=(a=left(g))^(b=right(g)))^g
sh3=(sh2=a&b)|(sl2&g)

sll=(a=up(sl3)^(b=down(sl3)))^sl2
slh=(a|(b^sl2))^sll
a=up(sh3)^(b=down(sh3))
g=(a^sh2^slh)&((a|(b^sh2))^slh)&(sll|g)

I believe that's 19 logical operations, one left, one right, two ups
and two downs (this assumes the ups and downs are cheap).

> Actually, it should be quite easy to get close to 2 IPC, because there's
> a lot of independent operations all the way to the end.

You bet!

> As you've just discovered, to get max speed you must maintain a back
> buffer in RAM, and then write updated blocks to the display.

Yep, and you need to block the algorithm appropriately so it fits in cache.
This is pretty easy to do; just do the above algorithm in appropriately
sized strips.  Further, it's pretty easy to block out (not process)
areas that are static or oscillating with period 2 (which are terribly
common in Life); I generally use two alternating buffers and keep a
`superbitmap' of those chunks that are changing with period >2.

> The 120 MB/sec required write speed (for 60 fps) will definitely
> overload a PCI bus, which has a (very) theoretical max speed of 133
> MB/sec on a long burst.

Which is why you do the delta.  Indeed, what I did is `stupider' than
that.  There's no sense updating the display at greater than the
frame rate but it's easy to calculate at greater than frame rate.
So I don't update on every generation, just on every frame.  And then
I only update the deltas, which are often quite small compared to the
real data.

> When you've optimized the code, then you'll discover that the problem
> really is memory bandwidth and nothing else.

I'm not so sure about this; it's pretty easy to make the loads/stores
overlap pretty well.  Of course, I did it on the 68000 where there are
enough registers; I'm not sure about the x86 world.

The above code was completely designed by me, although I'm sure others
have found a similar solution.

(I actually implemented the above on an HP calculator in user-RPL, and
on the Amiga, both with the blitter and in assembler.  I keep meaning
to get around to speeding up xlife but never can seem to find the time.)

Here's 48G code for anyone who cares; just put a GROB on the stack and
hit `GEN':

GEN << WHILE 1 REPEAT DUP ->LCD GEN1 END >>
GEN1 << {#0 #1} DUP2 SH OVER LX ROT REVLIST
  SWAP OVER SH 5 ROLLD 4 ROLLD SH 4 PICK LX
  ROT 3 PICK + NEG + 4 ROLLD NEG + LX + NEG >>
SH << DUP2 OVER DUP ROT {#FFFh #FFFh} SUB
  LX 3 DUPN 7 ROLLD GXOR 5 ROLLD GXOR + >>
LX << {#0 #0} SWAP GXOR >>

-tom


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google