309 views

Skip to first unread message

May 20, 2022, 8:53:26 AMMay 20

to

Apparently the fastest possible Sieve (in the limit).

https://www.baeldung.com/cs/prime-number-algorithms .

Unfortunately, the pseudo-code (?) is somewhat ambiguous.

-marcel

https://www.baeldung.com/cs/prime-number-algorithms .

Unfortunately, the pseudo-code (?) is somewhat ambiguous.

-marcel

May 20, 2022, 11:48:44 AMMay 20

to

This is the optimized implementation proposed by Zsolt KOVACS:

from:

https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python

May 20, 2022, 9:41:30 PMMay 20

to

On Friday, May 20, 2022 at 5:48:44 PM UTC+2, Jan Coombs wrote:

> On Fri, 20 May 2022 05:53:24 -0700 (PDT)

[..]
> On Fri, 20 May 2022 05:53:24 -0700 (PDT)

> This is the optimized implementation proposed by Zsolt KOVACS:

> from:

> https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python

Optimized? Maybe according to Python standards...
> from:

> https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python

(*

* LANGUAGE : ANS Forth with extensions

* PROJECT : Forth Environments

* DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)

* CATEGORY : Example

* AUTHOR : Marcel Hendrix

* LAST CHANGE : 02:27:42, May 21, 2022, Marcel Hendrix

*)

NEEDS -miscutil

REVISION -sieveofatkin "--- Sieve of Atkin Version 0.01 ---"

PRIVATES

0 VALUE P

0 VALUE sz

#16384 VALUE limit

CREATE sieve PRIVATE limit 1+ ALLOT

: SieveOfAtkin ( -- )

sieve limit 1+ ERASE

HERE TO P 2 , 3 , 2 TO sz

limit SQRT 2+

1 DO limit SQRT 2+

1 DO J SQR 4 * I SQR + >S

S limit <= S #12 MOD DUP 1 = SWAP 5 = OR AND IF sieve S + DUP C@ NOT SWAP C! ENDIF -S

J SQR 3 * I SQR + >S

S limit <= S #12 MOD 7 = AND IF sieve S + DUP C@ NOT SWAP C! ENDIF -S

J SQR 3 * I SQR - >S

J I > S limit <= AND S #12 MOD #11 = AND IF sieve S + DUP C@ NOT SWAP C! ENDIF -S

LOOP

LOOP

limit SQRT 1+ 5 DO sieve I + C@ IF limit 2+ I SQR DO sieve I + C0! J SQR +LOOP ENDIF LOOP

limit 1+ 5 DO sieve I + C@ IF I , 1 +TO sz ENDIF LOOP

P sz ;

: PSHOW ( addr u -- ) CR 0 ?DO @+ . LOOP DROP ;

:ABOUT CR ." Try: SieveOfAtkin PSHOW " ;

NESTING @ 1 = [IF] .ABOUT -sieveofatkin [THEN]

DEPRIVE

(* End of Source *)

-marcel

May 20, 2022, 10:10:20 PMMay 20

to

On 21/05/2022 11:41, Marcel Hendrix wrote:

> On Friday, May 20, 2022 at 5:48:44 PM UTC+2, Jan Coombs wrote:

>> On Fri, 20 May 2022 05:53:24 -0700 (PDT)

> [..]

>> This is the optimized implementation proposed by Zsolt KOVACS:

>> from:

>> https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python

>

> Optimized? Maybe according to Python standards...

>

> (*

> * LANGUAGE : ANS Forth with extensions

> *)
> On Friday, May 20, 2022 at 5:48:44 PM UTC+2, Jan Coombs wrote:

>> On Fri, 20 May 2022 05:53:24 -0700 (PDT)

> [..]

>> This is the optimized implementation proposed by Zsolt KOVACS:

>> from:

>> https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python

>

> Optimized? Maybe according to Python standards...

>

> (*

> * LANGUAGE : ANS Forth with extensions

"ANS Forth" ? Perhaps according to your standards :)

May 21, 2022, 4:00:26 AMMay 21

to

On Saturday, May 21, 2022 at 4:10:20 AM UTC+2, dxforth wrote:

> On 21/05/2022 11:41, Marcel Hendrix wrote:

[..]
> On 21/05/2022 11:41, Marcel Hendrix wrote:

> > (*

> > * LANGUAGE : ANS Forth with extensions

> > *)

>

> "ANS Forth" ? Perhaps according to your standards :)

What is there about "with extensions" that you don't understand?
> > * LANGUAGE : ANS Forth with extensions

> > *)

>

> "ANS Forth" ? Perhaps according to your standards :)

(*

* LANGUAGE : ANS Forth with extensions

* PROJECT : Forth Environments

* DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)

* CATEGORY : Example

* AUTHOR : Marcel Hendrix

* LAST CHANGE : Saturday, May 21, 2022, 9:54 AM, mhx;
* DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)

* CATEGORY : Example

* AUTHOR : Marcel Hendrix

*)

NEEDS -miscutil

REVISION -sieveofatkin "--- Sieve of Atkin Version 0.02 ---"

PRIVATES

#1000 =: #times

#16384 VALUE limit

limit SQRT 2+ =: SQRT(limit+2) PRIVATE

limit SQRT 1+ =: SQRT(limit+1) PRIVATE

CREATE sieve PRIVATE limit 1+ ALLOT

: toggle ( ix -- ) sieve + DUP C@ NOT SWAP C! ; PRIVATE
: f1: ( I J -- ) SQR 4 * SWAP SQR + ; PRIVATE

: f2: ( I J -- ) SQR 3 * SWAP SQR + ; PRIVATE

: f3: ( I J -- ) SQR 3 * SWAP SQR - ; PRIVATE

: SieveOfAtkin ( -- u )

sieve limit 1+ ERASE

SQRT(limit+2)

1 DO SQRT(limit+2)

1 DO I J f1: >S S limit <= S #12 MOD DUP 1 = SWAP 5 = OR AND IF S toggle ENDIF -S

I J f2: >S S limit <= S #12 MOD 7 = AND IF S toggle ENDIF -S

I J f3: >S J I > S limit <= AND S #12 MOD #11 = AND IF S toggle ENDIF -S

LOOP

LOOP

SQRT(limit+1) 5 DO sieve I + C@ IF limit 2+ I SQR DO sieve I + C0! J SQR +LOOP ENDIF LOOP

2 ( 2 and 3 ) limit 1+ 5 DO sieve I + C@ 1 AND + LOOP ( sz) ;

: PRIMES CR #times DEC. ." iterations." TIMER-RESET

0 #times 0 DO DROP SieveOfAtkin LOOP

MS? SWAP CR . ." primes found, " DEC. ." ms elapsed." ;

:ABOUT CR ." Try: SieveOfAtkin . "

CR ." PRIMES " ;

NESTING @ 1 = [IF] .ABOUT -sieveofatkin [THEN]

DEPRIVE

(* End of Source *)

Creating --- Sieve of Atkin Version 0.02 ---

Try: SieveOfAtkin .

PRIMES ok

FORTH> SieveOfAtkin . 1900 ok

FORTH> PRIMES

1000 iterations.

1900 primes found, 94 ms elapsed. ok

FORTH> in sieve

Creating --- SIEVE benchmark Version 1.00 ---

Redefining `#times`

Redefining `PRIMES`

Sieve (100 iterations)

Compile time Run time Primes

Par.C 38.19 8.732 1899

Inmos ICC 83.98 9.578 1899

3L C 45.25 5.955 1899

tForth (WS) 1.00 7.244 1899

tForth (main) 1.00 9.253 1899

iForth 386/33 0.3 5.767 1899

iForth 486/50 0.1 1.595 1899

iForth 486/66 0.1 1.559 1899

iForth P5/200 0.1 0.195 1899

iForth PIV/3G 0.1 0.076 1899

Enter PRIMES to run the benchmark.

ok

FORTH> PRIMES

1000 iterations.

1899 primes found, 12 ms elapsed. ok

FORTH>

Out of the box, Atkin is 94 / 12 ~ 8x slower than the standard Sieve algorithm.

The standard algorithm needs no division, so it will work for limit > 2^64, while

Atkin must be extended for that.

-marcel

May 21, 2022, 4:59:55 AMMay 21

to

On Saturday, May 21, 2022 at 10:00:26 AM UTC+2, Marcel Hendrix wrote:

[..]

Probably easier to understand. Speed unchanged.

(*

* LANGUAGE : ANS Forth with extensions

* PROJECT : Forth Environments

* DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)

* CATEGORY : Example

* AUTHOR : Marcel Hendrix

* LAST CHANGE : Saturday, May 21, 2022, 9:54 AM, mhx;

*)

NEEDS -miscutil

REVISION -sieveofatkin "--- Sieve of Atkin Version 0.03 ---"

PRIVATES

#1000 =: #times

#16384 VALUE limit PRIVATE

: flip ( ix -- ) sieve + DUP C@ NOT SWAP C! ; PRIVATE

: fi1 ( I J -- ) SQR 4 * SWAP SQR + >S ; PRIVATE

: fi2 ( I J -- ) SQR 3 * SWAP SQR + >S ; PRIVATE

: fi3 ( I J -- ) SQR 3 * SWAP SQR - >S ; PRIVATE

: test1 ( -- ) S limit <= S #12 MOD DUP 1 = SWAP 5 = OR AND IF S flip ENDIF -S ; PRIVATE

: test2 ( -- ) S limit <= S #12 MOD 7 = AND IF S flip ENDIF -S ; PRIVATE

: +test3 ( ? -- ) S limit <= AND S #12 MOD #11 = AND IF S flip ENDIF -S ; PRIVATE

: no-p^2 ( -- ) SQRT(limit+1) 5 DO sieve I + C@ IF limit 2+ I SQR DO sieve I + C0! J SQR +LOOP ENDIF LOOP ; PRIVATE

: cnt ( -- u ) 2 ( 2 and 3 ) limit 1+ 5 DO sieve I + C@ 1 AND + LOOP ; PRIVATE

: SieveOfAtkin ( -- u )

init

SQRT(limit+2)

1 DO SQRT(limit+2)

1 DO I J fi1 test1

I J fi2 test2

I J fi3 J I > +test3

LOOP

LOOP

no-p^2

cnt ( sz) ;

: PRIMES CR #times DEC. ." iterations." TIMER-RESET

0 #times 0 DO DROP SieveOfAtkin LOOP

MS? SWAP CR . ." primes found, " DEC. ." ms elapsed." ;

:ABOUT CR ." Try: SieveOfAtkin . "

CR ." PRIMES " ;

NESTING @ 1 = [IF] .ABOUT -sieveofatkin [THEN]

DEPRIVE

(* End of Source *)

-marcel

[..]

Probably easier to understand. Speed unchanged.

(*

* LANGUAGE : ANS Forth with extensions

* PROJECT : Forth Environments

* DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)

* CATEGORY : Example

* AUTHOR : Marcel Hendrix

* LAST CHANGE : Saturday, May 21, 2022, 9:54 AM, mhx;

*)

NEEDS -miscutil

PRIVATES

#1000 =: #times

#16384 VALUE limit PRIVATE

limit SQRT 2+ =: SQRT(limit+2) PRIVATE

limit SQRT 1+ =: SQRT(limit+1) PRIVATE

CREATE sieve PRIVATE limit 1+ ALLOT

: init ( -- ) sieve limit 1+ ERASE ; PRIVATE
limit SQRT 1+ =: SQRT(limit+1) PRIVATE

CREATE sieve PRIVATE limit 1+ ALLOT

: flip ( ix -- ) sieve + DUP C@ NOT SWAP C! ; PRIVATE

: fi1 ( I J -- ) SQR 4 * SWAP SQR + >S ; PRIVATE

: fi2 ( I J -- ) SQR 3 * SWAP SQR + >S ; PRIVATE

: fi3 ( I J -- ) SQR 3 * SWAP SQR - >S ; PRIVATE

: test1 ( -- ) S limit <= S #12 MOD DUP 1 = SWAP 5 = OR AND IF S flip ENDIF -S ; PRIVATE

: test2 ( -- ) S limit <= S #12 MOD 7 = AND IF S flip ENDIF -S ; PRIVATE

: +test3 ( ? -- ) S limit <= AND S #12 MOD #11 = AND IF S flip ENDIF -S ; PRIVATE

: no-p^2 ( -- ) SQRT(limit+1) 5 DO sieve I + C@ IF limit 2+ I SQR DO sieve I + C0! J SQR +LOOP ENDIF LOOP ; PRIVATE

: cnt ( -- u ) 2 ( 2 and 3 ) limit 1+ 5 DO sieve I + C@ 1 AND + LOOP ; PRIVATE

: SieveOfAtkin ( -- u )

SQRT(limit+2)

1 DO SQRT(limit+2)

1 DO I J fi1 test1

I J fi2 test2

I J fi3 J I > +test3

LOOP

LOOP

no-p^2

cnt ( sz) ;

: PRIMES CR #times DEC. ." iterations." TIMER-RESET

0 #times 0 DO DROP SieveOfAtkin LOOP

MS? SWAP CR . ." primes found, " DEC. ." ms elapsed." ;

:ABOUT CR ." Try: SieveOfAtkin . "

CR ." PRIMES " ;

NESTING @ 1 = [IF] .ABOUT -sieveofatkin [THEN]

DEPRIVE

(* End of Source *)

May 21, 2022, 12:28:54 PMMay 21

to

May 21, 2022, 5:02:26 PMMay 21

to

I adapted both Atkin and the standard Sieve so that I could run them with

array sizes between 1k and 10GByte, with the following results:

Standard Sieve:

n runtime (ms) primes

1,000 0 302

10,000 0 2,261

100,000 0 17,983

1,000,000 3 148,932

10,000,000 29 1,270,606

100,000,000 962 11,078,936

1,000,000,000 12695 98,222,286

10,000,000,000 149099 882,206,715

Atkin Sieve

n runtime (ms) primes

1,000 0 168

10,000 0 1,229

100,000 0 9,592

1,000,000 4 78,498

10,000,000 60 664,579

100,000,000 1728 5,761,455

1,000,000,000 20900 50,847,534

10,000,000,000 301417 455,052,511

Atkin is about 4 times slower than the Standard Sieve (50% of the

results in twice the time). It might slowly increase for even larger

sieves, but with 20 GBytes there is not much evidence for that yet.

Maybe somebody with a TB of RAM has more luck.

The jump in runtime between 0.1GB and 1GB might have something to do

with the working-set default of a Windows program (I see no evidence

of disk-activity but ALLOCATE takes noticeably longer).

Here is the final SieveOfAtkin. I fixed a bug that had to do with Python's

logic operators apparently taking shortcuts without being told.

(*

* LANGUAGE : ANS Forth with extensions

* PROJECT : Forth Environments

* DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)

* CATEGORY : Example

* AUTHOR : Marcel Hendrix

* LAST CHANGE : Saturday, May 21, 2022, 9:54 AM, mhx;

*)

NEEDS -miscutil

PRIVATES

DOC

(*

n runtime (ms) primes

1,000 0 168

10,000 0 1,229

100,000 0 9,592

1,000,000 4 78,498

10,000,000 60 664,579

100,000,000 1728 5,761,455

1,000,000,000 20900 50,847,534

10,000,000,000 301417 455,052,511

*)

ENDDOC

DEPTH 0= [IF] CR .( Stack Empty) ABORT [THEN]

( #16384 ) =: limit PRIVATE

#1 =: #times PRIVATE

limit SQRT 2+ =: SQRT(limit+2) PRIVATE

limit SQRT 1+ =: SQRT(limit+1) PRIVATE

limit 1+ ALLOCATE ?ALLOCATE =: sieve PRIVATE
limit SQRT 1+ =: SQRT(limit+1) PRIVATE

limit 1+ ALLOCATE ?ALLOCATE =: mods PRIVATE

: set# ( -- ) limit 2+ 0 DO I #12 MOD mods I + C! LOOP ; PRIVATE set#

: init ( -- ) sieve limit 1+ ERASE ( set# ) ; PRIVATE

: flip ( ix -- ) sieve + DUP C@ NOT SWAP C! ; PRIVATE

: fi1 ( I J -- u ) SQR 4 * SWAP SQR + ; PRIVATE
: fi2 ( I J -- u ) SQR 3 * SWAP SQR + ; PRIVATE

: fi3 ( I J -- u ) SQR 3 * SWAP SQR - ; PRIVATE

: #MOD ( s -- u ) mods + C@ ; PRIVATE

: test1 ( s -- ) DUP limit > IF DROP ELSE >S S #MOD DUP 1 = SWAP 5 = OR IF S flip ENDIF -S ENDIF ; PRIVATE

: test2 ( s -- ) DUP limit > IF DROP ELSE >S S #MOD 7 = IF S flip ENDIF -S ENDIF ; PRIVATE

: +test3 ( s ? -- ) OVER limit > OR IF DROP ELSE >S S #MOD #11 = IF S flip ENDIF -S ENDIF ; PRIVATE

: no-p^2 ( -- ) SQRT(limit+1) 5 DO sieve I + C@ IF limit 2+ I SQR DO sieve I + C0! J SQR +LOOP ENDIF LOOP ; PRIVATE

: cnt ( -- u ) 2 ( 2 and 3 ) limit 1+ 5 DO sieve I + C@ 1 AND + LOOP ; PRIVATE

: SieveOfAtkin ( -- u )

init

SQRT(limit+2)

1 DO SQRT(limit+2)

1 DO I J fi1 test1

I J fi2 test2

I J fi3 J I <= +test3
: cnt ( -- u ) 2 ( 2 and 3 ) limit 1+ 5 DO sieve I + C@ 1 AND + LOOP ; PRIVATE

: SieveOfAtkin ( -- u )

init

SQRT(limit+2)

1 DO SQRT(limit+2)

1 DO I J fi1 test1

I J fi2 test2

May 22, 2022, 4:57:09 AMMay 22

to

In article <4efed8c0-d736-42dd...@googlegroups.com>,

This the deception for those who inspect the O(n) behaviour.

Maybe Atkin uses less operations for n to infinity, there is a

constant C in front, that can count instructions needed for

a single operation.

That may mean -- as you have demonstrated -- that over a practical

range the O(n) doesn't tell the whole story.

>

>-marcel

Groetjes Albert

--

"in our communism country Viet Nam, people are forced to be

alive and in the western country like US, people are free to

die from Covid 19 lol" duc ha

albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Maybe Atkin uses less operations for n to infinity, there is a

constant C in front, that can count instructions needed for

a single operation.

That may mean -- as you have demonstrated -- that over a practical

range the O(n) doesn't tell the whole story.

>

>-marcel

Groetjes Albert

--

"in our communism country Viet Nam, people are forced to be

alive and in the western country like US, people are free to

die from Covid 19 lol" duc ha

albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

May 22, 2022, 9:10:20 AMMay 22

to

Available here: https://www.dropbox.com/s/96c58nykffo0wuu/wheel.8th?dl=0

I have not done any timings as I currently have only RPI 4B board available for testing. Seems fast though and I would be surprised if it can't match the Sieve of Atkins.

May 22, 2022, 10:53:46 AMMay 22

to

On Sunday, May 22, 2022 at 3:10:20 PM UTC+2, Jali Heinonen wrote:

[..]

...

> I have not done any timings as I currently have only RPI 4B board available for testing.

> Seems fast though and I would be surprised if it can't match the Sieve of Atkins.

Well, let us know when you get to it. Generating between 10M or 100M primes

should provide some relevant timings, 10G would be better though.

-marcel

[..]

> I adapted some code using wheel factorization for testing primes from the Inferno

> operating system sources for the 8th programming language.

> Available here: https://www.dropbox.com/s/96c58nykffo0wuu/wheel.8th?dl=0

9007199254740992 constant bigx
> operating system sources for the 8th programming language.

> Available here: https://www.dropbox.com/s/96c58nykffo0wuu/wheel.8th?dl=0

...

> I have not done any timings as I currently have only RPI 4B board available for testing.

> Seems fast though and I would be surprised if it can't match the Sieve of Atkins.

should provide some relevant timings, 10G would be better though.

-marcel

May 22, 2022, 11:18:01 AMMay 22

to

On Sunday, May 22, 2022 at 10:57:09 AM UTC+2, none albert wrote:

[..]

to this method came from a prime thread on the GMP mailing list.

It is not easy to see which instructions are slow, given that the operation

count is indeed minimal. The division operation is replaced by a lookup table,

(building it leads to an imperceptible delay when loading the source file)

as I assume that a fast MOD was possible. Multiplication is nowadays

1 or 2 cycles, so I guess it must there is an ugly memory access problem,

a stalled pipeline, or serious data dependencies.

-marcel

[..]

> >Out of the box, Atkin is 94 / 12 ~ 8x slower than the standard Sieve algorithm.

> >The standard algorithm needs no division, so it will work for limit >

> >2^64, while

> >Atkin must be extended for that.

> This the deception for those who inspect the O(n) behaviour.

> Maybe Atkin uses less operations for n to infinity, there is a

> constant C in front, that can count instructions needed for

> a single operation.

> That may mean -- as you have demonstrated -- that over a practical

> range the O(n) doesn't tell the whole story.

It was a bit of a surprise to me that it was so slow, given that the reference
> >The standard algorithm needs no division, so it will work for limit >

> >2^64, while

> >Atkin must be extended for that.

> This the deception for those who inspect the O(n) behaviour.

> Maybe Atkin uses less operations for n to infinity, there is a

> constant C in front, that can count instructions needed for

> a single operation.

> That may mean -- as you have demonstrated -- that over a practical

> range the O(n) doesn't tell the whole story.

to this method came from a prime thread on the GMP mailing list.

It is not easy to see which instructions are slow, given that the operation

count is indeed minimal. The division operation is replaced by a lookup table,

(building it leads to an imperceptible delay when loading the source file)

as I assume that a fast MOD was possible. Multiplication is nowadays

1 or 2 cycles, so I guess it must there is an ugly memory access problem,

a stalled pipeline, or serious data dependencies.

-marcel

May 22, 2022, 1:11:10 PMMay 22

to

root@DietPi:~# time /opt/8th/bin/rpi64/8th wheel.8th 0 15485863 >temp.txt

real 0m57.158s

user 0m45.213s

sys 0m11.268s

May 22, 2022, 1:43:01 PMMay 22

to

More like 4.

> so I guess it must there is an ugly memory access problem,

Numbers from a Skylake:

[~/forth:130297] perf stat -e instructions -e cycles -e L1-dcache-load-misses -e LLC-load-misses iforth "1000000000 include $HOME/forth/sieveofatkin.4th primes bye"

Try: SieveOfAtkin .

PRIMES

1 iterations.

50847534 primes found, 19671 ms elapsed.

[~/forth:130300] perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-load-misses iforth "1000000000 include $HOME/forth/sieveofatkin.4th primes bye"

Try: SieveOfAtkin .

PRIMES

1 iterations.

50847534 primes found, 19662 ms elapsed.

Performance counter stats for 'iforth 1000000000 include /home/anton/forth/sieveofatkin.4th primes bye':

91673986439 instructions # 0.82 insn per cycle

111163950818 cycles

23659587244 L1-dcache-loads

1907344675 L1-dcache-load-misses # 8.06% of all L1-dcache accesses

1340041914 LLC-load-misses

27.797359314 seconds time elapsed

27.607640000 seconds user

0.184024000 seconds sys

So 91 instructions, 111 cycles and 1.34 main memory accesses per

number under consideration; yes, it could be all these cache misses

(actually, with this many cache misses, it could be quite a bit worse;

there seems to be some memory-level parallelism going on). Have you

considered doing it piecewise, in cache-sized pieces? That made

Erathostenes quite a bit faster.

I also notice that you apparently don't measure the time for erasing

the memory.

Let's see about scalability:

[~/forth:130301] perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-load-misses iforth "100000000 include $HOME/forth/sieveofatkin.4th primes bye"

Try: SieveOfAtkin .

PRIMES

1 iterations.

5761455 primes found, 1900 ms elapsed.

Performance counter stats for 'iforth 100000000 include /home/anton/forth/sieveofatkin.4th primes bye':

9516883875 instructions # 0.85 insn per cycle

11136639151 cycles

2477917529 L1-dcache-loads

197185830 L1-dcache-load-misses # 7.96% of all L1-dcache accesses

119105185 LLC-load-misses

2.791729605 seconds time elapsed

2.754480000 seconds user

0.032028000 seconds sys

10 times smaller, 10 times less time, just as O(n) predicts.

- anton

--

M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html

comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html

New standard: http://www.forth200x.org/forth200x.html

EuroForth 2021: https://euro.theforth.net/2021

> so I guess it must there is an ugly memory access problem,

[~/forth:130297] perf stat -e instructions -e cycles -e L1-dcache-load-misses -e LLC-load-misses iforth "1000000000 include $HOME/forth/sieveofatkin.4th primes bye"

Try: SieveOfAtkin .

PRIMES

1 iterations.

50847534 primes found, 19671 ms elapsed.

[~/forth:130300] perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-load-misses iforth "1000000000 include $HOME/forth/sieveofatkin.4th primes bye"

Try: SieveOfAtkin .

PRIMES

1 iterations.

50847534 primes found, 19662 ms elapsed.

Performance counter stats for 'iforth 1000000000 include /home/anton/forth/sieveofatkin.4th primes bye':

91673986439 instructions # 0.82 insn per cycle

111163950818 cycles

23659587244 L1-dcache-loads

1907344675 L1-dcache-load-misses # 8.06% of all L1-dcache accesses

1340041914 LLC-load-misses

27.797359314 seconds time elapsed

27.607640000 seconds user

0.184024000 seconds sys

So 91 instructions, 111 cycles and 1.34 main memory accesses per

number under consideration; yes, it could be all these cache misses

(actually, with this many cache misses, it could be quite a bit worse;

there seems to be some memory-level parallelism going on). Have you

considered doing it piecewise, in cache-sized pieces? That made

Erathostenes quite a bit faster.

I also notice that you apparently don't measure the time for erasing

the memory.

Let's see about scalability:

[~/forth:130301] perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-load-misses iforth "100000000 include $HOME/forth/sieveofatkin.4th primes bye"

Try: SieveOfAtkin .

PRIMES

1 iterations.

5761455 primes found, 1900 ms elapsed.

Performance counter stats for 'iforth 100000000 include /home/anton/forth/sieveofatkin.4th primes bye':

9516883875 instructions # 0.85 insn per cycle

11136639151 cycles

2477917529 L1-dcache-loads

197185830 L1-dcache-load-misses # 7.96% of all L1-dcache accesses

119105185 LLC-load-misses

2.791729605 seconds time elapsed

2.754480000 seconds user

0.032028000 seconds sys

10 times smaller, 10 times less time, just as O(n) predicts.

- anton

--

M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html

comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html

New standard: http://www.forth200x.org/forth200x.html

EuroForth 2021: https://euro.theforth.net/2021

May 22, 2022, 2:39:12 PMMay 22

to

On Sunday, May 22, 2022 at 7:11:10 PM UTC+2, Jali Heinonen wrote:

> sunnuntai 22. toukokuuta 2022 klo 17.53.46 UTC+3 Marcel Hendrix kirjoitti:

> > On Sunday, May 22, 2022 at 3:10:20 PM UTC+2, Jali Heinonen wrote:

[..]

> sunnuntai 22. toukokuuta 2022 klo 17.53.46 UTC+3 Marcel Hendrix kirjoitti:

> > On Sunday, May 22, 2022 at 3:10:20 PM UTC+2, Jali Heinonen wrote:

[..]

> I tested with generating first million primes on my RPI 4B and redirecting output

> into text file. Resulting time is over eight times faster than the time mentioned in

> your source listing and that time includes program startup and writing primes into file:

>

> root@DietPi:~# time /opt/8th/bin/rpi64/8th wheel.8th 0 15485863 >temp.txt

>

> real 0m57.158s

> user 0m45.213s

> sys 0m11.268s

That is 57 *seconds*, with n=15,485,863, and finds 1,000,000 primes?
> into text file. Resulting time is over eight times faster than the time mentioned in

> your source listing and that time includes program startup and writing primes into file:

>

> root@DietPi:~# time /opt/8th/bin/rpi64/8th wheel.8th 0 15485863 >temp.txt

>

> real 0m57.158s

> user 0m45.213s

> sys 0m11.268s

Note that Forth's Atkin (above in the thread) does that in ~120 *milliseconds*,

on an AMD Ryzen 7 5800X. According to Geekbench V, single-core 64bit, the

expected difference with an RPI 4B is about 8x. Something must be wrong.

It would be better to compare to the Standard Sieve?

-marcel

May 22, 2022, 2:58:47 PMMay 22

to

On Sunday, May 22, 2022 at 7:43:01 PM UTC+2, Anton Ertl wrote:

[..]

I don't understand what you mean.

> I also notice that you apparently don't measure the time for erasing

> the memory.

That is included, but I don't count the time to build the MOD lookup.

> Let's see about scalability:

>

> [~/forth:130301] perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-load-misses iforth "100000000 include $HOME/forth/sieveofatkin.4th primes bye"

>

> Try: SieveOfAtkin .

> PRIMES

> 1 iterations.

> 5761455 primes found, 1900 ms elapsed.

[..]

and 100,000,000, but thanks anyway.

-marcel

[..]

> So 91 instructions, 111 cycles and 1.34 main memory accesses per

> number under consideration; yes, it could be all these cache misses

> (actually, with this many cache misses, it could be quite a bit worse;

> there seems to be some memory-level parallelism going on). Have you

> considered doing it piecewise, in cache-sized pieces? That made

> Erathostenes quite a bit faster.

That would be real work, I would need to sieve in batches. Maybe
> number under consideration; yes, it could be all these cache misses

> (actually, with this many cache misses, it could be quite a bit worse;

> there seems to be some memory-level parallelism going on). Have you

> considered doing it piecewise, in cache-sized pieces? That made

> Erathostenes quite a bit faster.

I don't understand what you mean.

> I also notice that you apparently don't measure the time for erasing

> the memory.

> Let's see about scalability:

>

> [~/forth:130301] perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-load-misses iforth "100000000 include $HOME/forth/sieveofatkin.4th primes bye"

>

> Try: SieveOfAtkin .

> PRIMES

> 1 iterations.

> 5761455 primes found, 1900 ms elapsed.

> 10 times smaller, 10 times less time, just as O(n) predicts.

On my AMD 5800X the interesting gap is between 10,000,000
and 100,000,000, but thanks anyway.

-marcel

May 22, 2022, 3:12:28 PMMay 22

to

I have made it portable to Gforth, lxf, sf, vfx64, and posted the

result on

<https://www.complang.tuwien.ac.at/forth/programs/sieveofatkin.fs>.

Basically all things that I changed, except for the timing code, are

gratuitious incompatibilities.

Results on a Ryzen 5800X (LLC-dcache-load-miss not supported) for n=1e8:

cyc inst load L1 miss

gforth-fast 33,784,978,330 50,096,698,991 14,716,155,398 157,613,972

iforth 8,949,528,121 9,643,635,252 1,459,076,700 163,122,475

lxf 17,879,785,896 15,849,776,266 4,578,122,141 157,093,923

sf 22,653,189,337 20,410,579,910 6,135,891,074 158,402,472

vfx64 8,896,377,480 9,533,502,029 1,212,120,001 155,895,135

Cache misses don't seem to limit the performance, otherwise the cycle

differences between the Forth systems would be much smaller.

Command lines:

LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses gforth-fast -e "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"

LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses iforth "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"

LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses lxf "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"

LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses sf "include /nfs/nfstmp/anton/SwiftForth-3.11.0/lib/options/fpmath.f 100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"

LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses vfx64 "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"

result on

<https://www.complang.tuwien.ac.at/forth/programs/sieveofatkin.fs>.

Basically all things that I changed, except for the timing code, are

gratuitious incompatibilities.

Results on a Ryzen 5800X (LLC-dcache-load-miss not supported) for n=1e8:

cyc inst load L1 miss

gforth-fast 33,784,978,330 50,096,698,991 14,716,155,398 157,613,972

iforth 8,949,528,121 9,643,635,252 1,459,076,700 163,122,475

lxf 17,879,785,896 15,849,776,266 4,578,122,141 157,093,923

sf 22,653,189,337 20,410,579,910 6,135,891,074 158,402,472

vfx64 8,896,377,480 9,533,502,029 1,212,120,001 155,895,135

Cache misses don't seem to limit the performance, otherwise the cycle

differences between the Forth systems would be much smaller.

Command lines:

LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses gforth-fast -e "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"

LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses iforth "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"

LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses lxf "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"

LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses sf "include /nfs/nfstmp/anton/SwiftForth-3.11.0/lib/options/fpmath.f 100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"

LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses vfx64 "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"

May 22, 2022, 3:19:57 PMMay 22

to

May 22, 2022, 3:46:43 PMMay 22

to

On Sunday, May 22, 2022 at 9:19:57 PM UTC+2, Jali Heinonen wrote:

[..]

[..]

> Can your Forth version really find all the 1,000,000 prime numbers between number

> range of 0 - 15,485,863, output all the prime numbers as formatted line of text and

> redirect output into text file in about 120 milliseconds? Interesting as running empty

> program on my RPI 4B takes about 177 milliseconds.

I don't know about "and redirect output into text file", I would write-file to stdout if that
> range of 0 - 15,485,863, output all the prime numbers as formatted line of text and

> redirect output into text file in about 120 milliseconds? Interesting as running empty

> program on my RPI 4B takes about 177 milliseconds.

is an allowed scenario? Bare numbers are independently duplicated by Anton

in this thread for several Forth systems, including iForth.

-marcel

May 22, 2022, 4:07:41 PMMay 22

to

[2, 3, 5, 7, 13, 19, 29, 53, 67, 85, 103, 173, 199]

There should 50 primes, not 13 primes?

May 22, 2022, 4:24:19 PMMay 22

to

On Sunday, May 22, 2022 at 9:19:57 PM UTC+2, Jali Heinonen wrote:

[..]

[..]

> Can your Forth version really find all the 1,000,000 prime numbers between number

> range of 0 - 15,485,863, output all the prime numbers as formatted line of text and

> redirect output into text file in about 120 milliseconds? Interesting as running empty

> program on my RPI 4B takes about 177 milliseconds.

Outputting a million primes is indeed slower than the original ~120 ms (about 25 times).
> range of 0 - 15,485,863, output all the prime numbers as formatted line of text and

> redirect output into text file in about 120 milliseconds? Interesting as running empty

> program on my RPI 4B takes about 177 milliseconds.

Other Forths can probably do this much faster as iForth is made to flush the output file.

FORTH> in sieveofatkin

Creating --- Sieve of Atkin Version 0.04 ---

Try: SieveOfAtkin .

PRIMES ok

1 iterations.

1000000 primes found, 3204 ms elapsed. ok

FORTH>

Here's how:

: write ( -- )

S" primes.txt" R/W BIN CREATE-FILE ?FILE out-FD !

FILE-I//O SET-IODEVICE

limit 1+ 2 DO sieve I + C@ IF I . ENDIF LOOP

out-FD @ CLOSE-FILE -1 out-FD ! ?FILE

STD-I//O SET-IODEVICE ;

.. where "write" is made the last word of SieveOfAtkin

-marcel

May 22, 2022, 4:32:53 PMMay 22

to

On Sunday, May 22, 2022 at 10:07:41 PM UTC+2, Jali Heinonen wrote:

[..]

[..]

> > https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python

> Does this linked Python code really work? With upper limit of 229 it seems to find the following primes:

>

> [2, 3, 5, 7, 13, 19, 29, 53, 67, 85, 103, 173, 199]

>

> There should 50 primes, not 13 primes?

From primes.txt:
> Does this linked Python code really work? With upper limit of 229 it seems to find the following primes:

>

> [2, 3, 5, 7, 13, 19, 29, 53, 67, 85, 103, 173, 199]

>

> There should 50 primes, not 13 primes?

2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83

89 97 101 103 107 109 113 127 131 137 139 149 151 157 163 167

173 179 181 191 193 197 199 211 223 227 229

I guess that will up the runtime a few notches :--)

-marcel

May 22, 2022, 4:38:37 PMMay 22

to

Marcel Hendrix <m...@iae.nl> writes:

>I adapted both Atkin and the standard Sieve so that I could run them with

>array sizes between 1k and 10GByte, with the following results:

I have now tried to run it with 100GB on a machine with 128GB RAM that
>I adapted both Atkin and the standard Sieve so that I could run them with

>array sizes between 1k and 10GByte, with the following results:

is not running anything significant. However, the out-of-memory

killer reaped it after a while, despite "free" showing 125GB free; ok,

100GB for the sieve and 100GB for MODS does not fit. Does MODS buy

anything?

I prepared a version without MODS

<http://www.complang.tuwien.ac.at/forth/programs/sieveofatkin-nomods.fs>,

and it is actually faster:

On the 5800X with n=1e8, using vfx64:

MODS no MODS

8,896,377,480 4,943,212,037 cycles

9,533,502,029 8,432,685,132 instructions

1,212,120,001 998,289,650 L1-dcache-loads

155,895,135 43,060,075 L1-dcache-load-misses

Ok, let's see about the 100G. Our 5800X currently has only 96GB, so I

used a Rocket Lake machine (Xeon W-1370P), where perf unfortunately

does not report loads or misses. I used vfx64, because the version of

iforth installed apparently does not like to allocate 10GB.

cycles instructions n primes

4,958,261,806 8,431,214,174 1e8 5761455

56,409,350,162 84,181,549,986 1e9 50847534

597,442,687,159 891,687,473,921 1e10 455052511

6,068,869,349,719 8,916,633,565,847 1e11 4118054813

The number of instructions is superlinear between n=1e9 and 1e10, but

linear for 1e8->1e9 and 1e10->1e11. Hmm. Cycles rise slightly

superlinearly, possibly because caches help less and less.

May 22, 2022, 4:48:04 PMMay 22

to

Marcel Hendrix <m...@iae.nl> writes:

>On Sunday, May 22, 2022 at 7:43:01 PM UTC+2, Anton Ertl wrote:

>[..]

>> So 91 instructions, 111 cycles and 1.34 main memory accesses per

>> number under consideration; yes, it could be all these cache misses

>> (actually, with this many cache misses, it could be quite a bit worse;

>> there seems to be some memory-level parallelism going on). Have you

>> considered doing it piecewise, in cache-sized pieces? That made

>> Erathostenes quite a bit faster.

>

>That would be real work,

Yes, it's more work for Erathostenes, don't know how amenable Atkin is
>On Sunday, May 22, 2022 at 7:43:01 PM UTC+2, Anton Ertl wrote:

>[..]

>> So 91 instructions, 111 cycles and 1.34 main memory accesses per

>> number under consideration; yes, it could be all these cache misses

>> (actually, with this many cache misses, it could be quite a bit worse;

>> there seems to be some memory-level parallelism going on). Have you

>> considered doing it piecewise, in cache-sized pieces? That made

>> Erathostenes quite a bit faster.

>

>That would be real work,

for that.

>On my AMD 5800X the interesting gap is between 10,000,000

>and 100,000,000, but thanks anyway.

usually does not model that. The instruction counts are not affected

by cache misses, and I would have expected the theoretically predicted

linear behaviour there, but as reported in another posting, I saw an

anomaly.

May 23, 2022, 3:02:11 AMMay 23

to

Marcel Hendrix <m...@iae.nl> writes:

>On Sunday, May 22, 2022 at 10:57:09 AM UTC+2, none albert wrote:

>[..]

>> >Out of the box, Atkin is 94 / 12 ~ 8x slower than the standard Sieve algorithm.

>> >The standard algorithm needs no division, so it will work for limit >

>> >2^64, while

>> >Atkin must be extended for that.

>> This the deception for those who inspect the O(n) behaviour.

>> Maybe Atkin uses less operations for n to infinity, there is a

>> constant C in front, that can count instructions needed for

>> a single operation.

>> That may mean -- as you have demonstrated -- that over a practical

>> range the O(n) doesn't tell the whole story.

>

>It was a bit of a surprise to me that it was so slow, given that the reference

>to this method came from a prime thread on the GMP mailing list.

>

>It is not easy to see which instructions are slow, given that the operation

>count is indeed minimal.

I looked up <https://en.wikipedia.org/wiki/Sieve_of_Atkin> and it
>On Sunday, May 22, 2022 at 10:57:09 AM UTC+2, none albert wrote:

>[..]

>> >Out of the box, Atkin is 94 / 12 ~ 8x slower than the standard Sieve algorithm.

>> >The standard algorithm needs no division, so it will work for limit >

>> >2^64, while

>> >Atkin must be extended for that.

>> This the deception for those who inspect the O(n) behaviour.

>> Maybe Atkin uses less operations for n to infinity, there is a

>> constant C in front, that can count instructions needed for

>> a single operation.

>> That may mean -- as you have demonstrated -- that over a practical

>> range the O(n) doesn't tell the whole story.

>

>It was a bit of a surprise to me that it was so slow, given that the reference

>to this method came from a prime thread on the GMP mailing list.

>

>It is not easy to see which instructions are slow, given that the operation

>count is indeed minimal.

discusses performance relative to Erathostenes quite a bit. I won't

reproduce that here, but I note that Erathostenes is O(N log log N)

compared to O(N) for Atkin. log log N grows very slowly:

1e3 fln fln f. 1.93264473391607 ok

1e6 fln fln f. 2.62579191447601 ok

1e9 fln fln f. 3.03125702258418 ok

1e12 fln fln f. 3.31893909503596 ok

1e15 fln fln f. 3.54208264635017 ok

That can be easily compensated for practically relevant N by a better

constant factor for Erathostenes, and the page discusses various

reasons for such a better constant factors for various implementation

variations.

As for "a prime thread on the GMP mailing list", even

usually-competent people are not immune to fallacies, and giving more

weight to mathematically proven results than they actually deserve,

and interpreting things into mathematical results that they actually

don't state are common fallacies.

May 23, 2022, 4:11:23 AMMay 23

to

On Monday, May 23, 2022 at 9:02:11 AM UTC+2, Anton Ertl wrote:

> I won't

> reproduce that here, but I note that Erathostenes is O(N log log N)

> compared to O(N) for Atkin. log log N grows very slowly:

>

> 1e3 fln fln f. 1.93264473391607 ok

> 1e6 fln fln f. 2.62579191447601 ok

> 1e9 fln fln f. 3.03125702258418 ok

> 1e12 fln fln f. 3.31893909503596 ok

> 1e15 fln fln f. 3.54208264635017 ok

I learned a lot more from your level-headed experiment than from the
> I won't

> reproduce that here, but I note that Erathostenes is O(N log log N)

> compared to O(N) for Atkin. log log N grows very slowly:

>

> 1e3 fln fln f. 1.93264473391607 ok

> 1e6 fln fln f. 2.62579191447601 ok

> 1e9 fln fln f. 3.03125702258418 ok

> 1e12 fln fln f. 3.31893909503596 ok

> 1e15 fln fln f. 3.54208264635017 ok

Wikipedia page...

Next, and probably last, step will be to limit memory use.

The MOD hack should obviously go (as you've shown) and I can

make sieve a bit array. That will indicate if is advantageous to compress

more aggressively.

-marcel

May 23, 2022, 5:32:14 AMMay 23

to

May 23, 2022, 6:08:52 AMMay 23

to

In article <74043ddb-6fd4-4389...@googlegroups.com>,

Marcel Hendrix <m...@iae.nl> wrote:

<SNIP>

https://docs.python.org/3/library/stdtypes.html#boolean-operations-and-or-not

Short circuit operation in Python is a feature, not a bug.

This is actually widespread e.g. this is c-idiom:

if ( pc || *pc ) { ... }

Short circuit evaluation prevents accessing address NULL if pc

has not been filled in.

<SNIP>

Marcel Hendrix <m...@iae.nl> wrote:

<SNIP>

>Here is the final SieveOfAtkin. I fixed a bug that had to do with Python's

>logic operators apparently taking shortcuts without being told.

If you mean this:
>logic operators apparently taking shortcuts without being told.

https://docs.python.org/3/library/stdtypes.html#boolean-operations-and-or-not

Short circuit operation in Python is a feature, not a bug.

This is actually widespread e.g. this is c-idiom:

if ( pc || *pc ) { ... }

Short circuit evaluation prevents accessing address NULL if pc

has not been filled in.

<SNIP>

May 23, 2022, 6:24:06 AMMay 23

to

In article <2022May2...@mips.complang.tuwien.ac.at>,

I've seen code in this thread that made the sieve of Atkin a

test of the speed of printing code.

Long ago the original byte benchmark turned into measuring the

speed of `` 1899 . '' instead of the speed of sieving.

The difference between quicksort N.logN and bubble sort N^2

should wave a red flag. That is a big difference.

However logN missing or adding here or there -- O(n) versus O(N.logn) --

doesn't mask if the algorithm needs a MOD versus pure addition,

or the effects or accessing main memory.

The wikipedia Atking page needs a caveat of this sort,

who are going to add it?

>

>- anton

test of the speed of printing code.

Long ago the original byte benchmark turned into measuring the

speed of `` 1899 . '' instead of the speed of sieving.

The difference between quicksort N.logN and bubble sort N^2

should wave a red flag. That is a big difference.

However logN missing or adding here or there -- O(n) versus O(N.logn) --

doesn't mask if the algorithm needs a MOD versus pure addition,

or the effects or accessing main memory.

The wikipedia Atking page needs a caveat of this sort,

who are going to add it?

>

>- anton

May 23, 2022, 6:55:03 AMMay 23

to

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61,

67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137,

139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199,

211, 223, 227]

Email me for source, it was beyond my ability/patience to change it.

Jan Coombs

--

May 23, 2022, 8:05:38 AMMay 23

to

>Long ago the original byte benchmark turned into measuring the

>speed of `` 1899 . '' instead of the speed of sieving.

perf stat -e cycles gforth-fast ../gforth/siev.fs -e ": foo 1000 0 do 1899 . loop ; ..."

cycles ...

31203365 bye

300973318 main bye

35385551 foo bye

MAIN runs the sieve (made slightly faster than the Byte version) 1000

times, without printing the number of primes.

FOO prints 1899 1000 times.

As you can see, the printing is a lot faster.

>However logN missing or adding here or there

>-- O(n) versus O(N.logn) --

>The wikipedia Atking page needs a caveat of this sort,

>who are going to add it?

reasons. What makes you think that it needs an additional caveat?

May 23, 2022, 2:05:20 PMMay 23

to

On Monday, May 23, 2022 at 10:11:23 AM UTC+2, Marcel Hendrix wrote:

> On Monday, May 23, 2022 at 9:02:11 AM UTC+2, Anton Ertl wrote:

[..]
> On Monday, May 23, 2022 at 9:02:11 AM UTC+2, Anton Ertl wrote:

> Next, and probably last, step will be to limit memory use.

> The MOD hack should obviously go (as you've shown) and I can

> make sieve a bit array. That will indicate if is advantageous to compress

> more aggressively.

Using a bit array uncovered a likely reason for the discontinuity in the
> The MOD hack should obviously go (as you've shown) and I can

> make sieve a bit array. That will indicate if is advantageous to compress

> more aggressively.

speed results.

The bit results start off slower than the ones using bytes. The Intel bit

manipulation instructions are not very fast -- extracting the bits

'by hand' might be faster -- but iForth has the words, so why not.

Contrary to the byte results, from n = 1e3 to 1e8 the runtime increases

linearly with n, and for n =1e8 the result with bits is 1237/738 = 1.7 times

faster than with bytes. Definitely a worthwhile improvement.

Note that the AMD 5800X has 512KB L2 cache per core and 32MEG

L3 in total. It could be that for n > 0.1Gbytes or 12.5Mbits it runs out

of L3 cache and loses 100 - 300% performance that way. The faster

but much smaller L1 and L2 caches won't help here.

n runtime (ms) primes runtime (ms) (bits iso bytes)

1,000 0 168 0

10,000 0 1,229 0

100,000 0 9,592 0

1,000,000 4 78,498 6

10,000,000 44 664,579 72

100,000,000 1237 5,761,455 738

1,000,000,000 16044 50,847,534 16902

10,000,000,000 206919 455,052,511 215561

-marcel

May 23, 2022, 3:48:37 PMMay 23

to

May 25, 2022, 8:15:50 AMMay 25

to

On Sunday, May 22, 2022 at 10:38:37 PM UTC+2, Anton Ertl wrote:

> I prepared a version without MODS

> <http://www.complang.tuwien.ac.at/forth/programs/sieveofatkin-nomods.fs>,

> and it is actually faster:

BTW, clean code - getting it to run under 4tH was a breeze! Thanks for publishing this!
> I prepared a version without MODS

> <http://www.complang.tuwien.ac.at/forth/programs/sieveofatkin-nomods.fs>,

> and it is actually faster:

Hans Bezemer

Jun 5, 2022, 8:25:09 AMJun 5

to

On Monday, May 23, 2022 at 8:05:20 PM UTC+2, Marcel Hendrix wrote:

[..]

was not according to the XMP profile. Adjusting that increased the

run-speed of the Sieve of Atkin by 121%.

n runtime (ms) primes runtime (ms) (bits) (bits+mem timing)

1,000 0 168 0 0

10,000 0 1,229 0 0

100,000 0 9,592 0 0

1,000,000 4 78,498 6 6

10,000,000 44 664,579 72 65

100,000,000 1237 5,761,455 738 729

1,000,000,000 16044 50,847,534 16902 13920 ( -121%)

10,000,000,000 206919 455,052,511 215561 177643 ( -121%)

BTW, I learned that the segmented sieve of Atkin exists but I have no idea

if it is faster or not. The Sieve guys love to throw numbers around and bicker about them.

-marcel

[..]

> n runtime (ms) primes runtime (ms) (bits iso bytes)

> 1,000 0 168 0

> 10,000 0 1,229 0

> 100,000 0 9,592 0

> 1,000,000 4 78,498 6

> 10,000,000 44 664,579 72

> 100,000,000 1237 5,761,455 738

> 1,000,000,000 16044 50,847,534 16902

> 10,000,000,000 206919 455,052,511 215561

Slight update: I had a look in the BIOS and saw that the memory timing
> 1,000 0 168 0

> 10,000 0 1,229 0

> 100,000 0 9,592 0

> 1,000,000 4 78,498 6

> 10,000,000 44 664,579 72

> 100,000,000 1237 5,761,455 738

> 1,000,000,000 16044 50,847,534 16902

> 10,000,000,000 206919 455,052,511 215561

was not according to the XMP profile. Adjusting that increased the

run-speed of the Sieve of Atkin by 121%.

n runtime (ms) primes runtime (ms) (bits) (bits+mem timing)

1,000 0 168 0 0

10,000 0 1,229 0 0

100,000 0 9,592 0 0

1,000,000 4 78,498 6 6

10,000,000 44 664,579 72 65

100,000,000 1237 5,761,455 738 729

1,000,000,000 16044 50,847,534 16902 13920 ( -121%)

10,000,000,000 206919 455,052,511 215561 177643 ( -121%)

BTW, I learned that the segmented sieve of Atkin exists but I have no idea

if it is faster or not. The Sieve guys love to throw numbers around and bicker about them.

-marcel

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu