New Life implementation

Tony Finch

unread,

Nov 28, 1994, 12:42:22 PM11/28/94

to

I wrote this life program (as described below) and it went pleasingly
quickly. I wonder if anyone else has come across the same algorithm?

Tony.

[If anyone has an Acorn Archimedes and wants to try it out they should be
able to get the program from micros.hensa.ac.uk -- look at the file index
in /micros/arch/riscos to find out what number they've given it]

This file contains technical details of ListLife, a fast (1000
gens/sec!) implementation of Conway's Game of Life by Tony Finch.

It is called ListLife because it stores the Life arena as a list of the
positions of the live cells (rather than in a bitmap). This means the
arena can be large (2e9 by 2e9) whilst taking up little space and time.
Memory limits the number of live cells to about 2 million (if you can
get a 16MB wimpslot). Thanks to Olly Betts for this idea. I couldn't be
work out how his program worked, so I wrote this one which turned out
three times faster!

The list of live cells is stored in order by rows. Since each row has
the same Y coordinate throughout, that is stored once per row. To
distinguish Y coordinates from X coordinates, Ys are negative and Xs are
positive. This saves time as well as space since the program does not
have to deal with Y coordinates when looping along a line.

The inner loop scans along 3 adjacent lines together, maintaining a
small (3x3) bitmap in one of the registers. This bitmap is the
neighbourhood of the cell that is being examined. Because this bitmap
takes only 512 different values, the operations of counting up the live
neighbours and deciding whether the cell should live or die can be done
by a simple table lookup and compare with 0.

Blank areas are skipped by comparing the bitmap with 0, then seeing
which coordinate from the 3 lines is next and continuing from there.
This is when the program checks for the end of the line. Because the X
coordinates are sorted into decreasing order, the program just finds the
maximum of the next coordinates on the three lines (which values are
cached in the registers). If this value is less than 0, then the line is
complete.

The algorithm does special things to handle X loops where less than
three lines are occupied. These cases could be handled by the 3 line
loop by leaving the relevant next-coordinate-cache register negative,
but special case code is quicker.

The plot routine is worth a mention. The coordinates of each point
have to be checked against the screen limits; the inner loop is
carefully crafted so that this check is combined with calculating the
point's offset from the edge of the screen, so saving an instruction.
This loop could probably do with being unrolled. [Note: Direct screen
access is used so this loop is a subtract, a store byte and a branch.]

The larger structure is interesting too. The previous generation is
removed from the screen (clearing the whole screen is slower) and then
the new generation is plotted. If these are done sequentially, the
display becomes very flickery, so the display is unplotted and replotted
line by line. You can see this quite well on the Maximum Growth pattern.

My usual benchmark is the Rabbit, which starts with 9 cells and grows
for 17331 generations, growing to a population of about 2400 cells.
On my 25MHz A5000 [based on the ARM3 RISC processor similar to that
used in the Apple Newton] it averages 169 gens/sec. On a Risc PC 600
[ARM 610 33MHz I think] of unknown details it did 222 gens/sec. To
justify my claim of 1000 gens/sec, the program can do an R pentomino
(1103 generations, c. 200 cells) in 0.82 secs. These timings were made
in mode 0 [a low-bandwidth screen mode] without plotting (whoever said
benchmarks had to reflect reality?).

Tony.

Micah Beck

unread,

Nov 28, 1994, 6:55:03 PM11/28/94

to

In article <fanf2-28119...@selkis.trin.cam.ac.uk> fa...@cam.ac.uk (Tony Finch) writes:

My usual benchmark is the Rabbit, which starts with 9 cells and grows
for 17331 generations, growing to a population of about 2400 cells.
On my 25MHz A5000 [based on the ARM3 RISC processor similar to that
used in the Apple Newton] it averages 169 gens/sec. On a Risc PC 600
[ARM 610 33MHz I think] of unknown details it did 222 gens/sec. To
justify my claim of 1000 gens/sec, the program can do an R pentomino
(1103 generations, c. 200 cells) in 0.82 secs. These timings were made
in mode 0 [a low-bandwidth screen mode] without plotting (whoever said
benchmarks had to reflect reality?).

1000 gens/sec doesn't mean much for a sparse implementation, where a
generation is not a well-defined unit of computation.

The statistic I use for characterizing the speed of dense CA programs is
seconds per cell-generation, computed by dividing the time of a run by
the total number of cells in the grid and the number of generations computed.

Could you give a similar statistic for your sparse implementation (perhaps
using the size grid necessary to contain the entire run if it were implemented
in a dense manner)? Another useful statistic would be seconds per non-zero
cell-generation, which is obtained by dividing sec/cell-gens by the average
density of the grid.

Micah Beck
Assistant Professor
be...@cs.utk.edu Department of Computer Science
(615) 974 - 4398 University of Tennessee
- 4404 (FAX) Knoxville, TN 37996-1301

Tony Finch

unread,

Nov 29, 1994, 10:27:44 AM11/29/94

to

In article <BECK.94No...@beck.cs.utk.edu>, be...@beck.cs.utk.edu
(Micah Beck) wrote:

> The statistic I use for characterizing the speed of dense CA programs is
> seconds per cell-generation, computed by dividing the time of a run by
> the total number of cells in the grid and the number of generations computed.
>
> Could you give a similar statistic for your sparse implementation (perhaps
> using the size grid necessary to contain the entire run if it were implemented
> in a dense manner)? Another useful statistic would be seconds per non-zero
> cell-generation, which is obtained by dividing sec/cell-gens by the average
> density of the grid.

The space occupied by a pattern is a hard thing to guage sensibly (because
of things like gliders wandering off towards infinity etc.), but the
number of live cells is easy since it is close to the memory occupied by
the pattern data (1 word for each live cell + 1 word for each line + 2).
Totalling this over a run and timing it gives roughly live-cell-generations
per second which is the reciprocal of your last suggestion.

The values I measure on my Acorn A5000 (25MHz ARM3) are:

total cells live cells/sec live cells/sec generations
no plotting plotting

rabbit 104 M 1.01 M 0.705 M 17331
r pentomino 0.985 M 1.20 M 0.814 M 1103
max growth 50.5 M 1.58 M 0.967 M 512

The average density of the rabbit is close to the density of the scatterings
left when a random field settles i.e 4 percent or so. This means that there
are many small gaps to be skipped which slows down the throughput. The r
pentomino has a higher density since more of the cells are actively evolving
throughout the 1103 generations. Max has the highest density of all (nearly
50%) so has the best throughput.

For comparison, I have a 32 bit parallel adder implementation that does a
512*480 field 30 times a second which is 7.4 M cells/sec. At 4% density this
is 0.3 M live cells per second.

I didn't include detailed numbers in the original posting since I am mostly
interested in the existence or not of similar algorithms. Also few people
will be able to make direct comparisons with these numbers since Acorn
machines are not very common outside Britain.

Tony Finch

unread,

Nov 29, 1994, 10:29:14 AM11/29/94

to

In article <BECK.94No...@beck.cs.utk.edu>, be...@beck.cs.utk.edu
(Micah Beck) wrote:

> The statistic I use for characterizing the speed of dense CA programs is
> seconds per cell-generation, computed by dividing the time of a run by
> the total number of cells in the grid and the number of generations computed.
>
> Could you give a similar statistic for your sparse implementation (perhaps
> using the size grid necessary to contain the entire run if it were implemented
> in a dense manner)? Another useful statistic would be seconds per non-zero
> cell-generation, which is obtained by dividing sec/cell-gens by the average
> density of the grid.

The space occupied by a pattern is a hard thing to guage sensibly (because

Tony.

John Harper

unread,

Dec 1, 1994, 5:15:46 AM12/1/94

to

> I wrote this life program (as described below) and it went pleasingly
> quickly. I wonder if anyone else has come across the same algorithm?
>
> Tony.
>

My implementation works in a basically similar way, keeping
track of a state which represents one of the 512 possible
combinations of live/dead neighbours plus self, then using
a lookup table to decide what to do about it. The idea of
just keeping track of the X-coordinates was one that I used
in my PDP-8 implementation which I wrote in 1973. A slight
wrinkle is to terminate the row with an impossibly HIGH value,
which avoids some special casing.

Mine (the current one, not the PDP-8 one) is written in C++.
Using MS VC++ in "full optimization" mode, it takes about
4 seconds to grow the R-pentomino to 1105 generations on a
486/66. No doubt hand-coding would speed this up, but there
seems little point since in reality the performance is completely
dominated by updating the display. On Windows this is fairly
painful; at some point I guess I'll make it use WinG, but even
so there are a heck of a lot of bits to be moved. It is also
very sensitive to the quirks of the video card.

John

Paul Callahan

unread,

Dec 9, 1994, 9:58:11 PM12/9/94

to

fa...@cam.ac.uk (Tony Finch) writes:

>I wrote this life program (as described below) and it went pleasingly
>quickly. I wonder if anyone else has come across the same algorithm?

Certainly, much work has been done on trying to get Life to run as
fast as possible, and the idea of exploiting sparsity (i.e., using
a list rather than a matrix) has been rediscovered on many occasions.
I'm a little surprised that you managed to attain the speeds you
mention without exploiting bit parallelism as well (i.e., noting that
a 32-bit machine is actually a vector processor that can do 32
logical ANDs, ORs, XORs, etc. simultaneously, and can even add vectors of
eight numbers if each sum is guaranteed to be in the range [0-15]).
The fastest implementations I've seen (and even written a non-optimized
variation of) tend to exploit both sparsity and parallelism by storing
the pattern as a list of boxes, each of which is a bitmap representing
a non-empty patch of the pattern.

>The inner loop scans along 3 adjacent lines together, maintaining a
>small (3x3) bitmap in one of the registers. This bitmap is the
>neighbourhood of the cell that is being examined. Because this bitmap
>takes only 512 different values, the operations of counting up the live
>neighbours and deciding whether the cell should live or die can be done
>by a simple table lookup and compare with 0.

This might be giving you a speedup above other sparse-list implementations.
I suppose you can shift the neighborhood window in fewer operations than
it would take to compute a neighborhood from scratch. I still say that
if it's running fast now and bits 9-31 of your words are usually zero,
then there is a way to make it run still faster.

>My usual benchmark is the Rabbit, which starts with 9 cells and grows
>for 17331 generations, growing to a population of about 2400 cells.

If you haven't already, you may want to look at Alan Hensel's PC implementation
in /pub/complex_systems/alife/life/life16.zip at life.anu.edu.au (there
are also patterns provided in lifep.zip and lifepw.zip). It's highly
optimized, and is the fastest implementation I've seen for the PC. I
just ran rabbit for 17331 generations and timed it with my watch at 219
seconds on 386 DXL 33Mhz laptop, which averages to 79 generations/sec.
This was with plotting, too, since I don't think there's a way to turn
it off, and all I have is the executable. Obviously, there are faster PCs
out there. Since I don't know anything about your machines, I don't know how
to compare the numbers.

>To
>justify my claim of 1000 gens/sec, the program can do an R pentomino
>(1103 generations, c. 200 cells) in 0.82 secs.

I can do 10^8 generations per second in my head. To justify this
claim, I'll start with a glider at the origin, and in four seconds,
I'll tell you where it is after 4x10^8 generations. (Not intended as
a flame or anything. I think in general the most useful benchmark is
whichever one shows your code in the worst light, because it gives
you an idea of what you have to fix).
--
Paul Callahan
call...@biffvm.cs.jhu.edu