Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Fast implementation of core algorithm for Conway's Game of Life

3 views
Skip to first unread message

Rolf Wilms

unread,
Feb 18, 2003, 7:16:44 PM2/18/03
to
I think I've found a very fast way to implement the core algorithm for
Conway's Game of Life. It deals with deciding which cells should live or be
dead only, no optimizations for empty space, oscillators or gliders.

A Java implementation can be found here:
http://home.arcor.de/rolf_wilms/gameoflife/gameoflife.html

Comments are welcome.

Ciao,
Rolf
Features:
a.. no LUT
b.. very compact implementation
c.. operates directly on bit-mapped cells (no transformation)
d.. uses only one conditional
e.. easily adaptable to use SIMD instruction sets such as from MMX and SSE
Performance:
Language Environment Cell Generations per Second
Java Bea JRockit 7.0 (2) 790123000
Java Sun JRE 1.4.1 434082000
Java Sun JRE 1.4.1 -server 541798000
C MS VC++ 6.0 full opt. (1) 620000000
C Intel C++ full opt. (PGO) (1) 1800000000

Performance data measured on a P4, 2000 Mhz.

(1) Recalled from memory

(2) Interestingly, this shows that JRockit 7.0 is faster than MS VC++ 6.0

How it works:
The algorithm counts the number of living cells in a 3x3 field to determine
if the center cell should live or be dead. There are always 32 counters
(sizeof int) working in parallel. Each counter is represented by one bit it
the int machine word. Three machine words are used to represent the state of
32 counters. The algorithm also exploits the fact, that overlapping 3x3
fields share a number of cells to be counted.

The Code:
This is an implementation in Java. The interesting class is
rwi.ca23_3.CA23_3. Its main method executes a benchmark. The method
computeNextGeneration and the inner class CellCounter are key.


Paul Chapman

unread,
Feb 18, 2003, 8:15:50 PM2/18/03
to
Rolf,

Sounds exciting! I will have a look at your implementation later.

Meanwhile, can you try some larger patterns (say 10,000 live cells) and compare with Life32 (http://psoup.math.wisc.edu/Life32.html)?

Cheers, Paul

Paul Chapman

unread,
Feb 18, 2003, 10:43:41 PM2/18/03
to
Rolf,

Oh, and I see you're a fellow Smalltalker. Did you prototype the algorithm in Smalltalk? Do you have vanilla Smalltalk source to look at?

Cheers, Paul

Rolf Wilms

unread,
Feb 19, 2003, 6:00:49 PM2/19/03
to
Paul,

I've tried to make a comparison to Life32. However, such a comparison is
somewhat problemeatic. This is why:

- Life32 displays cells, my implementation does not. In order to minimize
the effect, I've set framedropping to the max in Life32.

- Life32 has optimizations for stable and empty regions, my implementation
has not (yet). Depending on the choice of life pattern, this may speed up
Life32 (when optimizations are possible) or slow it down (overhead when no
optimizations possible). My implementation only depends on the size of the
universe. Since the optimizations would make for an arbitrary difference, I
chose a pattern packed full with period-4 oscillators. Life32 will not
benefit from its optimizations this way, so the speed of the core algorithm
will be put under test.

- Life32 is implemented in Delphi, my implementation is Java. One could
theoretically compare my implementation to Alan Hensel's Life Applet, which
is in Java too and AFAIK Life32 uses a 1:1 Delphi port of Alan's algorithm.

Ok, now for the results:

Life32 version: 2.15
Universe size: 268x212
Number of living cells: 13824
Cell generations per second: 331,200,279

My implementation
Universe size: 256x256
Number of living cells: doesn't matter
Cell generations per second: 759,397,000

Thus my implementation is about 2.3 times faster than Life32. I'd
extrapolate that using the Intel C++ compiler for my implementation a factor
of 5 could be reached and using SSE2 maybe a factor of 7.

Ciao,
Rolf

"Paul Chapman" <pa...@CHOPigblanCHOP.com> schrieb im Newsbeitrag
news:oYA4a.4220$Vx2.381386@wards...

Rolf Wilms

unread,
Feb 19, 2003, 6:08:55 PM2/19/03
to
Paul,

I didn't prototype it in Smalltalk, but there's some chocolate Java source
to look at on my web page... ;-)

Ciao,
Rolf


"Paul Chapman" <pa...@CHOPigblanCHOP.com> schrieb im Newsbeitrag

news:sgD4a.4235$Vx2.383426@wards...

Paul Chapman

unread,
Feb 19, 2003, 6:50:36 PM2/19/03
to
Rolf,

> Life32 has optimizations for stable and empty regions, my implementation
> has not (yet).

Yup, I looked at your code and saw that it was so far just a raw implementation of the basic algorithm.

> Thus my implementation is about 2.3 times faster than Life32. I'd
> extrapolate that using the Intel C++ compiler for my implementation a factor
> of 5 could be reached and using SSE2 maybe a factor of 7.

I did a quick speed check with Life32 and did a VERY rough mental calculation, taking into account the speed of my CPU etc. I came up with a factor of 2. However, adding the extra Life32-style logic to deal with sparse patterns may slow down your implementation further.

Life32 works on 16x16 blocks containing 8x8 subblocks IIRC. Your algorithm obviously works at its best when all 32 bits of the machine word are used, which would require the use of 32xN blocks for some N, say 8, 16 or 32. A larger block size would also make direct comparisons more difficult. Denser patterns might run faster in RolfLife, but sparser patterns, eg those which utilise thin glider streams, might not be so good.

I'll give some thought to putting some kind of Life32 framework around your algorithm to check it out. But don't hold your breath. :)

Cheers, Paul

Rolf Wilms

unread,
Feb 20, 2003, 5:09:12 PM2/20/03
to
Paul,

you're right, the granularity of at least 32 columns is really an issue.
With 32x32 tiles RolfLife would have to compute 16 times more (empty) cells
than Life32 in the worst case of a sparse universe. In this case RolfLife
would probably fall way behind Life32.

Maybe the RolfLife algorithm could be of value to someone implementing life
i.e. in the vertex shader of a (DirectX 9) graphics card, where few lines of
code and no LUT are favoured. Also, graphics cards usually have a richer set
of boolean ops than ordinary processors, which could also be exploited for
the algorithm.

Anyway, thanks for your analyisis. If you find anything more, please let me
know.

Ciao,
Rolf


0 new messages