Newsgroups: comp.arch, comp.arch.fpga, comp.arch.embedded
From: roki...@cello.hpl.hp.com (Tom Rokicki)
Date: 1998/05/25
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))
> ones = up ^ upleft; You can beat this (in terms of number of logical operations and shifts) > twos = up & upleft; > carry = ones & upright; > . . . by quite a bit. Here, `g' is the original, and only input. sl3=(sl2=(a=left(g))^(b=right(g)))^g sll=(a=up(sl3)^(b=down(sl3)))^sl2 I believe that's 19 logical operations, one left, one right, two ups > Actually, it should be quite easy to get close to 2 IPC, because there's You bet! > a lot of independent operations all the way to the end. > As you've just discovered, to get max speed you must maintain a back Yep, and you need to block the algorithm appropriately so it fits in cache. > buffer in RAM, and then write updated blocks to the display. This is pretty easy to do; just do the above algorithm in appropriately sized strips. Further, it's pretty easy to block out (not process) areas that are static or oscillating with period 2 (which are terribly common in Life); I generally use two alternating buffers and keep a `superbitmap' of those chunks that are changing with period >2. > The 120 MB/sec required write speed (for 60 fps) will definitely Which is why you do the delta. Indeed, what I did is `stupider' than > overload a PCI bus, which has a (very) theoretical max speed of 133 > MB/sec on a long burst. that. There's no sense updating the display at greater than the frame rate but it's easy to calculate at greater than frame rate. So I don't update on every generation, just on every frame. And then I only update the deltas, which are often quite small compared to the real data. > When you've optimized the code, then you'll discover that the problem I'm not so sure about this; it's pretty easy to make the loads/stores > really is memory bandwidth and nothing else. overlap pretty well. Of course, I did it on the 68000 where there are enough registers; I'm not sure about the x86 world. The above code was completely designed by me, although I'm sure others (I actually implemented the above on an HP calculator in user-RPL, and Here's 48G code for anyone who cares; just put a GROB on the stack and GEN << WHILE 1 REPEAT DUP ->LCD GEN1 END >> -tom You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||
