Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Sieve of Atkin

314 views
Skip to first unread message

Marcel Hendrix

unread,
May 20, 2022, 8:53:26 AM5/20/22
to
Apparently the fastest possible Sieve (in the limit).
https://www.baeldung.com/cs/prime-number-algorithms .

Unfortunately, the pseudo-code (?) is somewhat ambiguous.

-marcel

Jan Coombs

unread,
May 20, 2022, 11:48:44 AM5/20/22
to
The explanation is good, thanks. This seems to work, how's your python?

This is the optimized implementation proposed by Zsolt KOVACS:
from:
https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python


Marcel Hendrix

unread,
May 20, 2022, 9:41:30 PM5/20/22
to
On Friday, May 20, 2022 at 5:48:44 PM UTC+2, Jan Coombs wrote:
> On Fri, 20 May 2022 05:53:24 -0700 (PDT)
[..]
> This is the optimized implementation proposed by Zsolt KOVACS:
> from:
> https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python

Optimized? Maybe according to Python standards...

(*
* LANGUAGE : ANS Forth with extensions
* PROJECT : Forth Environments
* DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)
* CATEGORY : Example
* AUTHOR : Marcel Hendrix
* LAST CHANGE : 02:27:42, May 21, 2022, Marcel Hendrix
*)


NEEDS -miscutil

REVISION -sieveofatkin "--- Sieve of Atkin Version 0.01 ---"

PRIVATES

0 VALUE P
0 VALUE sz
#16384 VALUE limit
CREATE sieve PRIVATE limit 1+ ALLOT

: SieveOfAtkin ( -- )
sieve limit 1+ ERASE
HERE TO P 2 , 3 , 2 TO sz
limit SQRT 2+
1 DO limit SQRT 2+
1 DO J SQR 4 * I SQR + >S
S limit <= S #12 MOD DUP 1 = SWAP 5 = OR AND IF sieve S + DUP C@ NOT SWAP C! ENDIF -S
J SQR 3 * I SQR + >S
S limit <= S #12 MOD 7 = AND IF sieve S + DUP C@ NOT SWAP C! ENDIF -S
J SQR 3 * I SQR - >S
J I > S limit <= AND S #12 MOD #11 = AND IF sieve S + DUP C@ NOT SWAP C! ENDIF -S
LOOP
LOOP
limit SQRT 1+ 5 DO sieve I + C@ IF limit 2+ I SQR DO sieve I + C0! J SQR +LOOP ENDIF LOOP
limit 1+ 5 DO sieve I + C@ IF I , 1 +TO sz ENDIF LOOP
P sz ;

: PSHOW ( addr u -- ) CR 0 ?DO @+ . LOOP DROP ;

:ABOUT CR ." Try: SieveOfAtkin PSHOW " ;

NESTING @ 1 = [IF] .ABOUT -sieveofatkin [THEN]

DEPRIVE

(* End of Source *)

-marcel

dxforth

unread,
May 20, 2022, 10:10:20 PM5/20/22
to
On 21/05/2022 11:41, Marcel Hendrix wrote:
> On Friday, May 20, 2022 at 5:48:44 PM UTC+2, Jan Coombs wrote:
>> On Fri, 20 May 2022 05:53:24 -0700 (PDT)
> [..]
>> This is the optimized implementation proposed by Zsolt KOVACS:
>> from:
>> https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python
>
> Optimized? Maybe according to Python standards...
>
> (*
> * LANGUAGE : ANS Forth with extensions
> *)

"ANS Forth" ? Perhaps according to your standards :)

Marcel Hendrix

unread,
May 21, 2022, 4:00:26 AM5/21/22
to
On Saturday, May 21, 2022 at 4:10:20 AM UTC+2, dxforth wrote:
> On 21/05/2022 11:41, Marcel Hendrix wrote:
[..]
> > (*
> > * LANGUAGE : ANS Forth with extensions
> > *)
>
> "ANS Forth" ? Perhaps according to your standards :)

What is there about "with extensions" that you don't understand?

(*
* LANGUAGE : ANS Forth with extensions
* PROJECT : Forth Environments
* DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)
* CATEGORY : Example
* AUTHOR : Marcel Hendrix
* LAST CHANGE : Saturday, May 21, 2022, 9:54 AM, mhx;
*)


NEEDS -miscutil

REVISION -sieveofatkin "--- Sieve of Atkin Version 0.02 ---"

PRIVATES

#1000 =: #times
#16384 VALUE limit
limit SQRT 2+ =: SQRT(limit+2) PRIVATE
limit SQRT 1+ =: SQRT(limit+1) PRIVATE
CREATE sieve PRIVATE limit 1+ ALLOT

: toggle ( ix -- ) sieve + DUP C@ NOT SWAP C! ; PRIVATE
: f1: ( I J -- ) SQR 4 * SWAP SQR + ; PRIVATE
: f2: ( I J -- ) SQR 3 * SWAP SQR + ; PRIVATE
: f3: ( I J -- ) SQR 3 * SWAP SQR - ; PRIVATE

: SieveOfAtkin ( -- u )
sieve limit 1+ ERASE
SQRT(limit+2)
1 DO SQRT(limit+2)
1 DO I J f1: >S S limit <= S #12 MOD DUP 1 = SWAP 5 = OR AND IF S toggle ENDIF -S
I J f2: >S S limit <= S #12 MOD 7 = AND IF S toggle ENDIF -S
I J f3: >S J I > S limit <= AND S #12 MOD #11 = AND IF S toggle ENDIF -S
LOOP
LOOP
SQRT(limit+1) 5 DO sieve I + C@ IF limit 2+ I SQR DO sieve I + C0! J SQR +LOOP ENDIF LOOP
2 ( 2 and 3 ) limit 1+ 5 DO sieve I + C@ 1 AND + LOOP ( sz) ;

: PRIMES CR #times DEC. ." iterations." TIMER-RESET
0 #times 0 DO DROP SieveOfAtkin LOOP
MS? SWAP CR . ." primes found, " DEC. ." ms elapsed." ;

:ABOUT CR ." Try: SieveOfAtkin . "
CR ." PRIMES " ;

NESTING @ 1 = [IF] .ABOUT -sieveofatkin [THEN]

DEPRIVE

(* End of Source *)


FORTH> in
Creating --- Sieve of Atkin Version 0.02 ---

Try: SieveOfAtkin .
PRIMES ok
FORTH> SieveOfAtkin . 1900 ok
FORTH> PRIMES
1000 iterations.
1900 primes found, 94 ms elapsed. ok

FORTH> in sieve
Creating --- SIEVE benchmark Version 1.00 ---
Redefining `#times`
Redefining `PRIMES`

Sieve (100 iterations)
Compile time Run time Primes
Par.C 38.19 8.732 1899
Inmos ICC 83.98 9.578 1899
3L C 45.25 5.955 1899
tForth (WS) 1.00 7.244 1899
tForth (main) 1.00 9.253 1899
iForth 386/33 0.3 5.767 1899
iForth 486/50 0.1 1.595 1899
iForth 486/66 0.1 1.559 1899
iForth P5/200 0.1 0.195 1899
iForth PIV/3G 0.1 0.076 1899

Enter PRIMES to run the benchmark.
ok
FORTH> PRIMES
1000 iterations.
1899 primes found, 12 ms elapsed. ok
FORTH>

Out of the box, Atkin is 94 / 12 ~ 8x slower than the standard Sieve algorithm.
The standard algorithm needs no division, so it will work for limit > 2^64, while
Atkin must be extended for that.

-marcel

Marcel Hendrix

unread,
May 21, 2022, 4:59:55 AM5/21/22
to
On Saturday, May 21, 2022 at 10:00:26 AM UTC+2, Marcel Hendrix wrote:
[..]

Probably easier to understand. Speed unchanged.

(*
* LANGUAGE : ANS Forth with extensions
* PROJECT : Forth Environments
* DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)
* CATEGORY : Example
* AUTHOR : Marcel Hendrix
* LAST CHANGE : Saturday, May 21, 2022, 9:54 AM, mhx;
*)


NEEDS -miscutil

REVISION -sieveofatkin "--- Sieve of Atkin Version 0.03 ---"

PRIVATES

#1000 =: #times
#16384 VALUE limit PRIVATE
limit SQRT 2+ =: SQRT(limit+2) PRIVATE
limit SQRT 1+ =: SQRT(limit+1) PRIVATE
CREATE sieve PRIVATE limit 1+ ALLOT

: init ( -- ) sieve limit 1+ ERASE ; PRIVATE
: flip ( ix -- ) sieve + DUP C@ NOT SWAP C! ; PRIVATE
: fi1 ( I J -- ) SQR 4 * SWAP SQR + >S ; PRIVATE
: fi2 ( I J -- ) SQR 3 * SWAP SQR + >S ; PRIVATE
: fi3 ( I J -- ) SQR 3 * SWAP SQR - >S ; PRIVATE
: test1 ( -- ) S limit <= S #12 MOD DUP 1 = SWAP 5 = OR AND IF S flip ENDIF -S ; PRIVATE
: test2 ( -- ) S limit <= S #12 MOD 7 = AND IF S flip ENDIF -S ; PRIVATE
: +test3 ( ? -- ) S limit <= AND S #12 MOD #11 = AND IF S flip ENDIF -S ; PRIVATE
: no-p^2 ( -- ) SQRT(limit+1) 5 DO sieve I + C@ IF limit 2+ I SQR DO sieve I + C0! J SQR +LOOP ENDIF LOOP ; PRIVATE
: cnt ( -- u ) 2 ( 2 and 3 ) limit 1+ 5 DO sieve I + C@ 1 AND + LOOP ; PRIVATE

: SieveOfAtkin ( -- u )
init
SQRT(limit+2)
1 DO SQRT(limit+2)
1 DO I J fi1 test1
I J fi2 test2
I J fi3 J I > +test3
LOOP
LOOP
no-p^2
cnt ( sz) ;

: PRIMES CR #times DEC. ." iterations." TIMER-RESET
0 #times 0 DO DROP SieveOfAtkin LOOP
MS? SWAP CR . ." primes found, " DEC. ." ms elapsed." ;

:ABOUT CR ." Try: SieveOfAtkin . "
CR ." PRIMES " ;

NESTING @ 1 = [IF] .ABOUT -sieveofatkin [THEN]

DEPRIVE

(* End of Source *)

-marcel

Jali Heinonen

unread,
May 21, 2022, 12:28:54 PM5/21/22
to
Is it really faster than sieve implementation that is wheel factorized to the max?

Marcel Hendrix

unread,
May 21, 2022, 5:02:26 PM5/21/22
to
Well, so much for CS.

I adapted both Atkin and the standard Sieve so that I could run them with
array sizes between 1k and 10GByte, with the following results:

Standard Sieve:
n runtime (ms) primes
1,000 0 302
10,000 0 2,261
100,000 0 17,983
1,000,000 3 148,932
10,000,000 29 1,270,606
100,000,000 962 11,078,936
1,000,000,000 12695 98,222,286
10,000,000,000 149099 882,206,715

Atkin Sieve
n runtime (ms) primes
1,000 0 168
10,000 0 1,229
100,000 0 9,592
1,000,000 4 78,498
10,000,000 60 664,579
100,000,000 1728 5,761,455
1,000,000,000 20900 50,847,534
10,000,000,000 301417 455,052,511

Atkin is about 4 times slower than the Standard Sieve (50% of the
results in twice the time). It might slowly increase for even larger
sieves, but with 20 GBytes there is not much evidence for that yet.
Maybe somebody with a TB of RAM has more luck.

The jump in runtime between 0.1GB and 1GB might have something to do
with the working-set default of a Windows program (I see no evidence
of disk-activity but ALLOCATE takes noticeably longer).

Here is the final SieveOfAtkin. I fixed a bug that had to do with Python's
logic operators apparently taking shortcuts without being told.

(*
* LANGUAGE : ANS Forth with extensions
* PROJECT : Forth Environments
* DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)
* CATEGORY : Example
* AUTHOR : Marcel Hendrix
* LAST CHANGE : Saturday, May 21, 2022, 9:54 AM, mhx;
*)


NEEDS -miscutil

REVISION -sieveofatkin "--- Sieve of Atkin Version 0.04 ---"

PRIVATES

DOC
(*
n runtime (ms) primes
1,000 0 168
10,000 0 1,229
100,000 0 9,592
1,000,000 4 78,498
10,000,000 60 664,579
100,000,000 1728 5,761,455
1,000,000,000 20900 50,847,534
10,000,000,000 301417 455,052,511
*)
ENDDOC

DEPTH 0= [IF] CR .( Stack Empty) ABORT [THEN]
( #16384 ) =: limit PRIVATE
#1 =: #times PRIVATE
limit SQRT 2+ =: SQRT(limit+2) PRIVATE
limit SQRT 1+ =: SQRT(limit+1) PRIVATE
limit 1+ ALLOCATE ?ALLOCATE =: sieve PRIVATE
limit 1+ ALLOCATE ?ALLOCATE =: mods PRIVATE

: set# ( -- ) limit 2+ 0 DO I #12 MOD mods I + C! LOOP ; PRIVATE set#
: init ( -- ) sieve limit 1+ ERASE ( set# ) ; PRIVATE
: flip ( ix -- ) sieve + DUP C@ NOT SWAP C! ; PRIVATE
: fi1 ( I J -- u ) SQR 4 * SWAP SQR + ; PRIVATE
: fi2 ( I J -- u ) SQR 3 * SWAP SQR + ; PRIVATE
: fi3 ( I J -- u ) SQR 3 * SWAP SQR - ; PRIVATE
: #MOD ( s -- u ) mods + C@ ; PRIVATE
: test1 ( s -- ) DUP limit > IF DROP ELSE >S S #MOD DUP 1 = SWAP 5 = OR IF S flip ENDIF -S ENDIF ; PRIVATE
: test2 ( s -- ) DUP limit > IF DROP ELSE >S S #MOD 7 = IF S flip ENDIF -S ENDIF ; PRIVATE
: +test3 ( s ? -- ) OVER limit > OR IF DROP ELSE >S S #MOD #11 = IF S flip ENDIF -S ENDIF ; PRIVATE
: no-p^2 ( -- ) SQRT(limit+1) 5 DO sieve I + C@ IF limit 2+ I SQR DO sieve I + C0! J SQR +LOOP ENDIF LOOP ; PRIVATE
: cnt ( -- u ) 2 ( 2 and 3 ) limit 1+ 5 DO sieve I + C@ 1 AND + LOOP ; PRIVATE

: SieveOfAtkin ( -- u )
init
SQRT(limit+2)
1 DO SQRT(limit+2)
1 DO I J fi1 test1
I J fi2 test2
I J fi3 J I <= +test3

none albert

unread,
May 22, 2022, 4:57:09 AM5/22/22
to
In article <4efed8c0-d736-42dd...@googlegroups.com>,
This the deception for those who inspect the O(n) behaviour.
Maybe Atkin uses less operations for n to infinity, there is a
constant C in front, that can count instructions needed for
a single operation.
That may mean -- as you have demonstrated -- that over a practical
range the O(n) doesn't tell the whole story.

>
>-marcel

Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Jali Heinonen

unread,
May 22, 2022, 9:10:20 AM5/22/22
to
I adapted some code using wheel factorization for testing primes from the Inferno operating system sources for the 8th programming language.
Available here: https://www.dropbox.com/s/96c58nykffo0wuu/wheel.8th?dl=0

I have not done any timings as I currently have only RPI 4B board available for testing. Seems fast though and I would be surprised if it can't match the Sieve of Atkins.

Marcel Hendrix

unread,
May 22, 2022, 10:53:46 AM5/22/22
to
On Sunday, May 22, 2022 at 3:10:20 PM UTC+2, Jali Heinonen wrote:
[..]
> I adapted some code using wheel factorization for testing primes from the Inferno
> operating system sources for the 8th programming language.
> Available here: https://www.dropbox.com/s/96c58nykffo0wuu/wheel.8th?dl=0

9007199254740992 constant bigx
...

> I have not done any timings as I currently have only RPI 4B board available for testing.
> Seems fast though and I would be surprised if it can't match the Sieve of Atkins.

Well, let us know when you get to it. Generating between 10M or 100M primes
should provide some relevant timings, 10G would be better though.

-marcel

Marcel Hendrix

unread,
May 22, 2022, 11:18:01 AM5/22/22
to
On Sunday, May 22, 2022 at 10:57:09 AM UTC+2, none albert wrote:
[..]
> >Out of the box, Atkin is 94 / 12 ~ 8x slower than the standard Sieve algorithm.
> >The standard algorithm needs no division, so it will work for limit >
> >2^64, while
> >Atkin must be extended for that.
> This the deception for those who inspect the O(n) behaviour.
> Maybe Atkin uses less operations for n to infinity, there is a
> constant C in front, that can count instructions needed for
> a single operation.
> That may mean -- as you have demonstrated -- that over a practical
> range the O(n) doesn't tell the whole story.

It was a bit of a surprise to me that it was so slow, given that the reference
to this method came from a prime thread on the GMP mailing list.

It is not easy to see which instructions are slow, given that the operation
count is indeed minimal. The division operation is replaced by a lookup table,
(building it leads to an imperceptible delay when loading the source file)
as I assume that a fast MOD was possible. Multiplication is nowadays
1 or 2 cycles, so I guess it must there is an ugly memory access problem,
a stalled pipeline, or serious data dependencies.

-marcel

Jali Heinonen

unread,
May 22, 2022, 1:11:10 PM5/22/22
to
I tested with generating first million primes on my RPI 4B and redirecting output into text file. Resulting time is over eight times faster than the time mentioned in your source listing and that time includes program startup and writing primes into file:

root@DietPi:~# time /opt/8th/bin/rpi64/8th wheel.8th 0 15485863 >temp.txt

real 0m57.158s
user 0m45.213s
sys 0m11.268s

Anton Ertl

unread,
May 22, 2022, 1:43:01 PM5/22/22
to
Marcel Hendrix <m...@iae.nl> writes:
>Multiplication is nowadays
>1 or 2 cycles

More like 4.

> so I guess it must there is an ugly memory access problem,

Numbers from a Skylake:

[~/forth:130297] perf stat -e instructions -e cycles -e L1-dcache-load-misses -e LLC-load-misses iforth "1000000000 include $HOME/forth/sieveofatkin.4th primes bye"

Try: SieveOfAtkin .
PRIMES
1 iterations.
50847534 primes found, 19671 ms elapsed.

[~/forth:130300] perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-load-misses iforth "1000000000 include $HOME/forth/sieveofatkin.4th primes bye"

Try: SieveOfAtkin .
PRIMES
1 iterations.
50847534 primes found, 19662 ms elapsed.

Performance counter stats for 'iforth 1000000000 include /home/anton/forth/sieveofatkin.4th primes bye':

91673986439 instructions # 0.82 insn per cycle
111163950818 cycles
23659587244 L1-dcache-loads
1907344675 L1-dcache-load-misses # 8.06% of all L1-dcache accesses
1340041914 LLC-load-misses

27.797359314 seconds time elapsed

27.607640000 seconds user
0.184024000 seconds sys

So 91 instructions, 111 cycles and 1.34 main memory accesses per
number under consideration; yes, it could be all these cache misses
(actually, with this many cache misses, it could be quite a bit worse;
there seems to be some memory-level parallelism going on). Have you
considered doing it piecewise, in cache-sized pieces? That made
Erathostenes quite a bit faster.

I also notice that you apparently don't measure the time for erasing
the memory.

Let's see about scalability:

[~/forth:130301] perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-load-misses iforth "100000000 include $HOME/forth/sieveofatkin.4th primes bye"

Try: SieveOfAtkin .
PRIMES
1 iterations.
5761455 primes found, 1900 ms elapsed.

Performance counter stats for 'iforth 100000000 include /home/anton/forth/sieveofatkin.4th primes bye':

9516883875 instructions # 0.85 insn per cycle
11136639151 cycles
2477917529 L1-dcache-loads
197185830 L1-dcache-load-misses # 7.96% of all L1-dcache accesses
119105185 LLC-load-misses

2.791729605 seconds time elapsed

2.754480000 seconds user
0.032028000 seconds sys

10 times smaller, 10 times less time, just as O(n) predicts.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2021: https://euro.theforth.net/2021

Marcel Hendrix

unread,
May 22, 2022, 2:39:12 PM5/22/22
to
On Sunday, May 22, 2022 at 7:11:10 PM UTC+2, Jali Heinonen wrote:
> sunnuntai 22. toukokuuta 2022 klo 17.53.46 UTC+3 Marcel Hendrix kirjoitti:
> > On Sunday, May 22, 2022 at 3:10:20 PM UTC+2, Jali Heinonen wrote:
[..]
> I tested with generating first million primes on my RPI 4B and redirecting output
> into text file. Resulting time is over eight times faster than the time mentioned in
> your source listing and that time includes program startup and writing primes into file:
>
> root@DietPi:~# time /opt/8th/bin/rpi64/8th wheel.8th 0 15485863 >temp.txt
>
> real 0m57.158s
> user 0m45.213s
> sys 0m11.268s

That is 57 *seconds*, with n=15,485,863, and finds 1,000,000 primes?
Note that Forth's Atkin (above in the thread) does that in ~120 *milliseconds*,
on an AMD Ryzen 7 5800X. According to Geekbench V, single-core 64bit, the
expected difference with an RPI 4B is about 8x. Something must be wrong.

It would be better to compare to the Standard Sieve?

-marcel

Marcel Hendrix

unread,
May 22, 2022, 2:58:47 PM5/22/22
to
On Sunday, May 22, 2022 at 7:43:01 PM UTC+2, Anton Ertl wrote:
[..]
> So 91 instructions, 111 cycles and 1.34 main memory accesses per
> number under consideration; yes, it could be all these cache misses
> (actually, with this many cache misses, it could be quite a bit worse;
> there seems to be some memory-level parallelism going on). Have you
> considered doing it piecewise, in cache-sized pieces? That made
> Erathostenes quite a bit faster.

That would be real work, I would need to sieve in batches. Maybe
I don't understand what you mean.

> I also notice that you apparently don't measure the time for erasing
> the memory.

That is included, but I don't count the time to build the MOD lookup.

> Let's see about scalability:
>
> [~/forth:130301] perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-load-misses iforth "100000000 include $HOME/forth/sieveofatkin.4th primes bye"
>
> Try: SieveOfAtkin .
> PRIMES
> 1 iterations.
> 5761455 primes found, 1900 ms elapsed.
[..]
> 10 times smaller, 10 times less time, just as O(n) predicts.

On my AMD 5800X the interesting gap is between 10,000,000
and 100,000,000, but thanks anyway.

-marcel

Anton Ertl

unread,
May 22, 2022, 3:12:28 PM5/22/22
to
Marcel Hendrix <m...@iae.nl> writes:
>Here is the final SieveOfAtkin.

I have made it portable to Gforth, lxf, sf, vfx64, and posted the
result on
<https://www.complang.tuwien.ac.at/forth/programs/sieveofatkin.fs>.
Basically all things that I changed, except for the timing code, are
gratuitious incompatibilities.

Results on a Ryzen 5800X (LLC-dcache-load-miss not supported) for n=1e8:

cyc inst load L1 miss
gforth-fast 33,784,978,330 50,096,698,991 14,716,155,398 157,613,972
iforth 8,949,528,121 9,643,635,252 1,459,076,700 163,122,475
lxf 17,879,785,896 15,849,776,266 4,578,122,141 157,093,923
sf 22,653,189,337 20,410,579,910 6,135,891,074 158,402,472
vfx64 8,896,377,480 9,533,502,029 1,212,120,001 155,895,135

Cache misses don't seem to limit the performance, otherwise the cycle
differences between the Forth systems would be much smaller.

Command lines:
LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses gforth-fast -e "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"
LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses iforth "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"
LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses lxf "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"
LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses sf "include /nfs/nfstmp/anton/SwiftForth-3.11.0/lib/options/fpmath.f 100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"
LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses vfx64 "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"

Jali Heinonen

unread,
May 22, 2022, 3:19:57 PM5/22/22
to
Can your Forth version really find all the 1,000,000 prime numbers between number range of 0 - 15,485,863, output all the prime numbers as formatted line of text and redirect output into text file in about 120 milliseconds? Interesting as running empty program on my RPI 4B takes about 177 milliseconds.

Marcel Hendrix

unread,
May 22, 2022, 3:46:43 PM5/22/22
to
On Sunday, May 22, 2022 at 9:19:57 PM UTC+2, Jali Heinonen wrote:
[..]
> Can your Forth version really find all the 1,000,000 prime numbers between number
> range of 0 - 15,485,863, output all the prime numbers as formatted line of text and
> redirect output into text file in about 120 milliseconds? Interesting as running empty
> program on my RPI 4B takes about 177 milliseconds.

I don't know about "and redirect output into text file", I would write-file to stdout if that
is an allowed scenario? Bare numbers are independently duplicated by Anton
in this thread for several Forth systems, including iForth.

-marcel

Jali Heinonen

unread,
May 22, 2022, 4:07:41 PM5/22/22
to
Does this linked Python code really work? With upper limit of 229 it seems to find the following primes:

[2, 3, 5, 7, 13, 19, 29, 53, 67, 85, 103, 173, 199]

There should 50 primes, not 13 primes?

Marcel Hendrix

unread,
May 22, 2022, 4:24:19 PM5/22/22
to
On Sunday, May 22, 2022 at 9:19:57 PM UTC+2, Jali Heinonen wrote:
[..]
> Can your Forth version really find all the 1,000,000 prime numbers between number
> range of 0 - 15,485,863, output all the prime numbers as formatted line of text and
> redirect output into text file in about 120 milliseconds? Interesting as running empty
> program on my RPI 4B takes about 177 milliseconds.

Outputting a million primes is indeed slower than the original ~120 ms (about 25 times).
Other Forths can probably do this much faster as iForth is made to flush the output file.

FORTH> in sieveofatkin
Creating --- Sieve of Atkin Version 0.04 ---

Try: SieveOfAtkin .
PRIMES ok
FORTH> PRIMES
1 iterations.
1000000 primes found, 3204 ms elapsed. ok
FORTH>

Here's how:
: write ( -- )
S" primes.txt" R/W BIN CREATE-FILE ?FILE out-FD !
FILE-I//O SET-IODEVICE
limit 1+ 2 DO sieve I + C@ IF I . ENDIF LOOP
out-FD @ CLOSE-FILE -1 out-FD ! ?FILE
STD-I//O SET-IODEVICE ;

.. where "write" is made the last word of SieveOfAtkin

-marcel

Marcel Hendrix

unread,
May 22, 2022, 4:32:53 PM5/22/22
to
On Sunday, May 22, 2022 at 10:07:41 PM UTC+2, Jali Heinonen wrote:
[..]
> > https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python
> Does this linked Python code really work? With upper limit of 229 it seems to find the following primes:
>
> [2, 3, 5, 7, 13, 19, 29, 53, 67, 85, 103, 173, 199]
>
> There should 50 primes, not 13 primes?

From primes.txt:
2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83
89 97 101 103 107 109 113 127 131 137 139 149 151 157 163 167
173 179 181 191 193 197 199 211 223 227 229

I guess that will up the runtime a few notches :--)

-marcel

Anton Ertl

unread,
May 22, 2022, 4:38:37 PM5/22/22
to
Marcel Hendrix <m...@iae.nl> writes:
>I adapted both Atkin and the standard Sieve so that I could run them with
>array sizes between 1k and 10GByte, with the following results:

I have now tried to run it with 100GB on a machine with 128GB RAM that
is not running anything significant. However, the out-of-memory
killer reaped it after a while, despite "free" showing 125GB free; ok,
100GB for the sieve and 100GB for MODS does not fit. Does MODS buy
anything?

I prepared a version without MODS
<http://www.complang.tuwien.ac.at/forth/programs/sieveofatkin-nomods.fs>,
and it is actually faster:

On the 5800X with n=1e8, using vfx64:

MODS no MODS
8,896,377,480 4,943,212,037 cycles
9,533,502,029 8,432,685,132 instructions
1,212,120,001 998,289,650 L1-dcache-loads
155,895,135 43,060,075 L1-dcache-load-misses

Ok, let's see about the 100G. Our 5800X currently has only 96GB, so I
used a Rocket Lake machine (Xeon W-1370P), where perf unfortunately
does not report loads or misses. I used vfx64, because the version of
iforth installed apparently does not like to allocate 10GB.

cycles instructions n primes
4,958,261,806 8,431,214,174 1e8 5761455
56,409,350,162 84,181,549,986 1e9 50847534
597,442,687,159 891,687,473,921 1e10 455052511
6,068,869,349,719 8,916,633,565,847 1e11 4118054813

The number of instructions is superlinear between n=1e9 and 1e10, but
linear for 1e8->1e9 and 1e10->1e11. Hmm. Cycles rise slightly
superlinearly, possibly because caches help less and less.

Anton Ertl

unread,
May 22, 2022, 4:48:04 PM5/22/22
to
Marcel Hendrix <m...@iae.nl> writes:
>On Sunday, May 22, 2022 at 7:43:01 PM UTC+2, Anton Ertl wrote:
>[..]
>> So 91 instructions, 111 cycles and 1.34 main memory accesses per
>> number under consideration; yes, it could be all these cache misses
>> (actually, with this many cache misses, it could be quite a bit worse;
>> there seems to be some memory-level parallelism going on). Have you
>> considered doing it piecewise, in cache-sized pieces? That made
>> Erathostenes quite a bit faster.
>
>That would be real work,

Yes, it's more work for Erathostenes, don't know how amenable Atkin is
for that.

>On my AMD 5800X the interesting gap is between 10,000,000
>and 100,000,000, but thanks anyway.

For time results you will see cache effects, and complexity theory
usually does not model that. The instruction counts are not affected
by cache misses, and I would have expected the theoretically predicted
linear behaviour there, but as reported in another posting, I saw an
anomaly.

Anton Ertl

unread,
May 23, 2022, 3:02:11 AM5/23/22
to
Marcel Hendrix <m...@iae.nl> writes:
>On Sunday, May 22, 2022 at 10:57:09 AM UTC+2, none albert wrote:
>[..]
>> >Out of the box, Atkin is 94 / 12 ~ 8x slower than the standard Sieve algorithm.
>> >The standard algorithm needs no division, so it will work for limit >
>> >2^64, while
>> >Atkin must be extended for that.
>> This the deception for those who inspect the O(n) behaviour.
>> Maybe Atkin uses less operations for n to infinity, there is a
>> constant C in front, that can count instructions needed for
>> a single operation.
>> That may mean -- as you have demonstrated -- that over a practical
>> range the O(n) doesn't tell the whole story.
>
>It was a bit of a surprise to me that it was so slow, given that the reference
>to this method came from a prime thread on the GMP mailing list.
>
>It is not easy to see which instructions are slow, given that the operation
>count is indeed minimal.

I looked up <https://en.wikipedia.org/wiki/Sieve_of_Atkin> and it
discusses performance relative to Erathostenes quite a bit. I won't
reproduce that here, but I note that Erathostenes is O(N log log N)
compared to O(N) for Atkin. log log N grows very slowly:

1e3 fln fln f. 1.93264473391607 ok
1e6 fln fln f. 2.62579191447601 ok
1e9 fln fln f. 3.03125702258418 ok
1e12 fln fln f. 3.31893909503596 ok
1e15 fln fln f. 3.54208264635017 ok

That can be easily compensated for practically relevant N by a better
constant factor for Erathostenes, and the page discusses various
reasons for such a better constant factors for various implementation
variations.

As for "a prime thread on the GMP mailing list", even
usually-competent people are not immune to fallacies, and giving more
weight to mathematically proven results than they actually deserve,
and interpreting things into mathematical results that they actually
don't state are common fallacies.

Marcel Hendrix

unread,
May 23, 2022, 4:11:23 AM5/23/22
to
On Monday, May 23, 2022 at 9:02:11 AM UTC+2, Anton Ertl wrote:
> I won't
> reproduce that here, but I note that Erathostenes is O(N log log N)
> compared to O(N) for Atkin. log log N grows very slowly:
>
> 1e3 fln fln f. 1.93264473391607 ok
> 1e6 fln fln f. 2.62579191447601 ok
> 1e9 fln fln f. 3.03125702258418 ok
> 1e12 fln fln f. 3.31893909503596 ok
> 1e15 fln fln f. 3.54208264635017 ok

I learned a lot more from your level-headed experiment than from the
Wikipedia page...

Next, and probably last, step will be to limit memory use.
The MOD hack should obviously go (as you've shown) and I can
make sieve a bit array. That will indicate if is advantageous to compress
more aggressively.

-marcel

Jali Heinonen

unread,
May 23, 2022, 5:32:14 AM5/23/22
to
Thanks, does your SieveOfAtkin use one huge buffer for primes? Version I posted uses currently just 1000 element array and needs a shit loads of iterations to find and output millions of primes. It also needs to zero out the primes array on every iteration. I should probably increase the buffer size to eliminate the overhead. Up side is that it uses currently very little memory.

none albert

unread,
May 23, 2022, 6:08:52 AM5/23/22
to
In article <74043ddb-6fd4-4389...@googlegroups.com>,
Marcel Hendrix <m...@iae.nl> wrote:
<SNIP>
>Here is the final SieveOfAtkin. I fixed a bug that had to do with Python's
>logic operators apparently taking shortcuts without being told.

If you mean this:
https://docs.python.org/3/library/stdtypes.html#boolean-operations-and-or-not
Short circuit operation in Python is a feature, not a bug.

This is actually widespread e.g. this is c-idiom:
if ( pc || *pc ) { ... }
Short circuit evaluation prevents accessing address NULL if pc
has not been filled in.

<SNIP>

none albert

unread,
May 23, 2022, 6:24:06 AM5/23/22
to
In article <2022May2...@mips.complang.tuwien.ac.at>,
I've seen code in this thread that made the sieve of Atkin a
test of the speed of printing code.
Long ago the original byte benchmark turned into measuring the
speed of `` 1899 . '' instead of the speed of sieving.

The difference between quicksort N.logN and bubble sort N^2
should wave a red flag. That is a big difference.

However logN missing or adding here or there -- O(n) versus O(N.logn) --
doesn't mask if the algorithm needs a MOD versus pure addition,
or the effects or accessing main memory.

The wikipedia Atking page needs a caveat of this sort,
who are going to add it?

>
>- anton

Jan Coombs

unread,
May 23, 2022, 6:55:03 AM5/23/22
to
jan@mypc:$ python SieveOfAtkin.py 229
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61,
67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137,
139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199,
211, 223, 227]

Email me for source, it was beyond my ability/patience to change it.

Jan Coombs
--

Anton Ertl

unread,
May 23, 2022, 8:05:38 AM5/23/22
to
So?

>Long ago the original byte benchmark turned into measuring the
>speed of `` 1899 . '' instead of the speed of sieving.

Let's test it on a Ryzen 5800X:

perf stat -e cycles gforth-fast ../gforth/siev.fs -e ": foo 1000 0 do 1899 . loop ; ..."

cycles ...
31203365 bye
300973318 main bye
35385551 foo bye

MAIN runs the sieve (made slightly faster than the Byte version) 1000
times, without printing the number of primes.

FOO prints 1899 1000 times.

As you can see, the printing is a lot faster.

>However logN missing or adding here or there

ld N is 20 for N=1e6, so it's a lot less than N, but not negligible.

>-- O(n) versus O(N.logn) --

It's about O(N) vs. O(N log log N), though.

>The wikipedia Atking page needs a caveat of this sort,
>who are going to add it?

It states repeatedly that Erathostenes is faster and discusses the
reasons. What makes you think that it needs an additional caveat?

Marcel Hendrix

unread,
May 23, 2022, 2:05:20 PM5/23/22
to
On Monday, May 23, 2022 at 10:11:23 AM UTC+2, Marcel Hendrix wrote:
> On Monday, May 23, 2022 at 9:02:11 AM UTC+2, Anton Ertl wrote:
[..]
> Next, and probably last, step will be to limit memory use.
> The MOD hack should obviously go (as you've shown) and I can
> make sieve a bit array. That will indicate if is advantageous to compress
> more aggressively.

Using a bit array uncovered a likely reason for the discontinuity in the
speed results.

The bit results start off slower than the ones using bytes. The Intel bit
manipulation instructions are not very fast -- extracting the bits
'by hand' might be faster -- but iForth has the words, so why not.

Contrary to the byte results, from n = 1e3 to 1e8 the runtime increases
linearly with n, and for n =1e8 the result with bits is 1237/738 = 1.7 times
faster than with bytes. Definitely a worthwhile improvement.

Note that the AMD 5800X has 512KB L2 cache per core and 32MEG
L3 in total. It could be that for n > 0.1Gbytes or 12.5Mbits it runs out
of L3 cache and loses 100 - 300% performance that way. The faster
but much smaller L1 and L2 caches won't help here.

n runtime (ms) primes runtime (ms) (bits iso bytes)
1,000 0 168 0
10,000 0 1,229 0
100,000 0 9,592 0
1,000,000 4 78,498 6
10,000,000 44 664,579 72
100,000,000 1237 5,761,455 738
1,000,000,000 16044 50,847,534 16902
10,000,000,000 206919 455,052,511 215561

-marcel

Jali Heinonen

unread,
May 23, 2022, 3:48:37 PM5/23/22
to
Thanks, I found a working Python implementation. Seems to be faster than my wheel sieve conversion for the 8th but luckily I can easily beat it's performance by parallelizing my code. It's easy as wheel sieve allows finding primes between specific number range, so I can for an example find primes between 10000000000 and 10000011050. It takes just a few milliseconds to find that first prime is 10000000019, last prime is 10000011037 and there are 455 primes. I think Sieve of Atkin can't do that? This makes parallelizing easy as each thread can be given different number range to handle and no locking is necessary.

Hans Bezemer

unread,
May 25, 2022, 8:15:50 AM5/25/22
to
On Sunday, May 22, 2022 at 10:38:37 PM UTC+2, Anton Ertl wrote:
> I prepared a version without MODS
> <http://www.complang.tuwien.ac.at/forth/programs/sieveofatkin-nomods.fs>,
> and it is actually faster:
BTW, clean code - getting it to run under 4tH was a breeze! Thanks for publishing this!

Hans Bezemer

Marcel Hendrix

unread,
Jun 5, 2022, 8:25:09 AM6/5/22
to
On Monday, May 23, 2022 at 8:05:20 PM UTC+2, Marcel Hendrix wrote:
[..]
> n runtime (ms) primes runtime (ms) (bits iso bytes)
> 1,000 0 168 0
> 10,000 0 1,229 0
> 100,000 0 9,592 0
> 1,000,000 4 78,498 6
> 10,000,000 44 664,579 72
> 100,000,000 1237 5,761,455 738
> 1,000,000,000 16044 50,847,534 16902
> 10,000,000,000 206919 455,052,511 215561

Slight update: I had a look in the BIOS and saw that the memory timing
was not according to the XMP profile. Adjusting that increased the
run-speed of the Sieve of Atkin by 121%.

n runtime (ms) primes runtime (ms) (bits) (bits+mem timing)
1,000 0 168 0 0
10,000 0 1,229 0 0
100,000 0 9,592 0 0
1,000,000 4 78,498 6 6
10,000,000 44 664,579 72 65
100,000,000 1237 5,761,455 738 729
1,000,000,000 16044 50,847,534 16902 13920 ( -121%)
10,000,000,000 206919 455,052,511 215561 177643 ( -121%)

BTW, I learned that the segmented sieve of Atkin exists but I have no idea
if it is faster or not. The Sieve guys love to throw numbers around and bicker about them.

-marcel
0 new messages