Since very long I have an implementation of the Sieve of Eratosthenes
that has a bitmap of 30 numbers per byte.
Of each sequence of 30 numbers only half is coprime with 2,
2/3 is coprime with 3 and 4/5 is coprime with 5, leaving
1/2 * 2/3 * 4/5 * 30 = 8 numbers.
For a prime say p, where 11*p = 1 mod(30) we must flip bit 1
in byte (11*p)/30 and then going on with a stride of p.
The sieve is filled using wheels. This means the first 7 bytes
are filled with a patter where 7-folds are missing. This pattern
can then repeated 11 times using an unintelligent move and then
the 11 folds are removed etc.
With 32 Gbyte we then have a sieve holding 10^12 numbers, and all
primes until 31 or so taken care off. 1]
Then for each prime we must remove the multiples using mask
0x01 0x02 0x04 .. 0x80 .
This is a sure cache miss every time.
I've 8 processors, and I could deal out each of the masks to one
processor, running in the same Forth with 8 preemptive threads
and one sieve.
In view of the fact that cache misses remain the same, do I get
- no speedup at all, just lower processor load
- some speed up, but nowhere near a factor 8
- a decent speedup approaching 8.
(There are only a few critical loops. Assume they are written in
assembler or a good optimizing compiler.)
The basic operation is
get a byte from address N
AND it with a mask
store it back
This is one instruction for the usual Intel/AMD processor fortunately.
I think that means that two processors updating the same byte
do not interfere. (that is rare, but anyway)
Any insights are welcome.
1]
In recent post someone wanted to fill the sieve first with zeroes,
that is several seconds of computer time wasted.
Groetjes Albert
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&
c.xs4all.nl &=n
http://home.hccnet.nl/a.w.m.van.der.horst