Sieve of Zakiya (SoZ) improved

jzakiya

unread,

Jul 23, 2010, 2:10:12 PM7/23/10

to

In 2008 I released my original Sieve of Zakiya (SoZ) family of prime
generators, and subsequently improved them in 2009. This is to
announce a further improvement which reduces memory usage to a
fundamental minimum while increasing performance. In addition, the
coding implementation has been factored to simplify the software and
algorithm understanding.

All code, papers, and tutorials are available here:

www.4shared.com/dir/TcMrUvTB/sharing.html

There are 2 reference codings: SoZGens.f and SoZGens.1.f

Primary development was done on a Toshiba Satelite laptop, using an
Intel P4 2.8GHz cpu, with 1GB ram.

I would really like to see what kind of results are achieved on
different (PPC, AMD, Sparc,,etc), and more modern (multicore, 64-bit),
platforms.

Software development was done with Gforth 0.7.0 on PC LinuxOS (PCLOS)
2010. It was also tested, and runs, on SwiftForth 3.2.2 on Windows XP.

On most system, you must specifically allocate more dictionary and
stack memory than the defaults to load/run P13. Otherwise, load the
generators that work within the default memory allocations.

Below are benchmark times for various values of N prime.
Best times (secs) using the 'gforth-fast' interpreter are shown.
Values in () are the number of primes upto N.

N -- Prime P13 P11 P7 P5 P3

2,000,000,011 143 178 189
(98,222,288)

1,750,000,027 120 153 159 179
(86,514,021)

1,500,000,001 99 128 134 151 145
(74,726,528)

1,250,000,027 77 106 113 133 130
(62,843,677)

1,000,000,007 59 82 86 96 92
(50,847,535)

750,000,007 43 60 64 72 70
(38,703,182)

500,000,003 28 38 41 47 46
(26,355,868)

250,000,013 13 16 20 22 21
(13,679,319)

100,000,007 6 6 8 9 9
(5,761,456)

I would encourage reading the SoZtutorial.pdf which explains the
operation and implementation of the improved algorithm.

These are 'generic' implementations. Higher performance
implementations can be achieved by performing the inner sieve looping
with multiple threads/cores.

For those who can, I would love to see the performance benchmarks on
other cpus and architectures, to see which reference versions works
best on what systems.

An interesting tidbit on the gforth-fast benchmarks is they are only
3/4 secs slower per generator for comparable C++ versions, but are
much easier to use because of the interactive nature of Forth. Also at
the above url are Ruby and Python versions.

OK Marcel, what can you show me!? :-)

Jabari Zakiya

Marcel Hendrix

unread,

Jul 24, 2010, 2:32:21 AM7/24/10

to

jzakiya <jza...@gmail.com> writes Re: Sieve of Zakiya (SoZ) improved
[..]
> www.4shared.com/dir/TcMrUvTB/sharing.html

> There are 2 reference codings: SoZGens.f and SoZGens.1.f

> Primary development was done on a Toshiba Satelite laptop, using an
> Intel P4 2.8GHz cpu, with 1GB ram.

> I would really like to see what kind of results are achieved on
> different (PPC, AMD, Sparc,,etc), and more modern (multicore, 64-bit),
> platforms.

[..]

> OK Marcel, what can you show me!? :-)

Single-processor 64-bit:

For N = 100,000,007 times in secs for these SoZ prime generators
SoZP7 0.585 seconds elapsed. ( 5,761,456 )
SoZP5 0.733 seconds elapsed. ( 5,761,456 )
SoZP3 0.694 seconds elapsed. ( 5,761,456 )
SoE 1.325 seconds elapsed. ( 5,761,456 ) ok

These times must be wrong, because they are almost 20 times
faster than you published for the Toshiba (I am using an Intel
Core-2 quad 2.66 GHz under Windows 7 with, of course, iForth64).

I can't show the results for the higher generators because of
the way you're pre-initializing the arrays there (put all their
data on the stack and then use a do ... ! loop). I'll make a MOVE
later this day.

-marcel

Marcel Hendrix

unread,

Jul 24, 2010, 5:10:23 AM7/24/10

to

jzakiya <jza...@gmail.com> writes Re: Sieve of Zakiya (SoZ) improved
[..]

> Primary development was done on a Toshiba Satelite laptop, using an

> Intel P4 2.8GHz cpu, with 1GB ram.

[..]

> Software development was done with Gforth 0.7.0 on PC LinuxOS (PCLOS)
> 2010. It was also tested, and runs, on SwiftForth 3.2.2 on Windows XP.

N -- Prime P13 P11 P7 P5 P3

2,000,000,011 143 178 189 (98,222,288)
1,750,000,027 120 153 159 179 (86,514,021)

1,500,000,001 99 128 134 151 145 (74,726,528)
1,250,000,027 77 106 113 133 130 (62,843,677)
1,000,000,007 59 82 86 96 92 (50,847,535)
750,000,007 43 60 64 72 70 (38,703,182)
500,000,003 28 38 41 47 46 (26,355,868)
250,000,013 13 16 20 22 21 (13,679,319)
100,000,007 6 6 8 9 9 ( 5,761,456)

[..]

> OK Marcel, what can you show me!? :-)

Single-processor 64-bit, using an Intel Core-2 quad 2.66 GHz under Windows 7
with, of course, iForth64.

FORTH> benchmarks

N -- Prime P13 P11 P7 P5 P3

2,000,000,011 11.771 11.423 13.063 ( 98,222,288 )
1,750,000,027 9.631 10.094 11.107 12.923 ( 86,514,021 )

1,500,000,001 8.137 8.333 9.381 11.092 12.280 ( 74,726,529 )
1,250,000,027 6.669 7.498 8.999 9.309 9.514 ( 62,843,677 )
1,000,000,007 5.973 5.570 6.648 7.334 7.811 ( 50,847,535 )
500,000,003 2.799 2.552 3.266 3.463 3.527 ( 26,355,868 )
250,000,013 1.091 1.147 1.580 1.853 1.762 ( 13,679,319 )
100,000,007 0.408 0.391 0.414 0.596 0.627 ( 5,761,456 ) ok

This seems to be about twice faster than results I got
( <3530351...@frunobulax.edu> ) in Nov 2008.

FORTH> benchmarks (iForth64, Intel Core-Duo E8200, 2.66 GHz, XP64 PRO)
For N = 1000000001 times in secs for these SoZ prime generators
SoZP11 8.995 seconds elapsed.
SoZP7 9.868 seconds elapsed.
SoZP5 11.543 seconds elapsed.
SoZP60 11.522 seconds elapsed.
SoZP3 14.318 seconds elapsed.
SoE 19.310 seconds elapsed. ok

-marcel

jzakiya

unread,

Jul 24, 2010, 3:00:36 PM7/24/10

to

On Jul 24, 4:10 am, m...@iae.nl (Marcel Hendrix) wrote:
> jzakiya <jzak...@gmail.com> writes Re: Sieve of Zakiya (SoZ) improved

> ( <35303513123...@frunobulax.edu> ) in Nov 2008.

>
> FORTH> benchmarks (iForth64, Intel Core-Duo E8200, 2.66 GHz, XP64 PRO)
> For N = 1000000001 times in secs for these SoZ prime generators
> SoZP11 8.995 seconds elapsed.
> SoZP7 9.868 seconds elapsed.
> SoZP5 11.543 seconds elapsed.
> SoZP60 11.522 seconds elapsed.
> SoZP3 14.318 seconds elapsed.
> SoE 19.310 seconds elapsed. ok
>
> -marcel

Thanks Marcel!

These times are almost hard to believe.

For 2,000,000,011 for P13, from 143 second to 11.7, a factor of 12
speedup. I would love to see the disassembled code you produce.

But I CAN believe them, because the actual sozsieve is really
computationally short and sweet. If it was done in pure assembly it
shouldn't be more than a couple of hundred bytes long, guessing.

You didn't say, but which version did you use. SoZGens.f of *.1.f?

To run with gforth I did: $ gforth-fast -m 1500M -d 6000
to extend the dictionary and stack amount to load P13.

But you can always load the residues in an appropriate chunksize so as
to not exceed your default stack length. Again, the code shows generic
implementations to show how the algorithms work. I've done many
different coding implementations to test how they perform. I don't
have the hardware to do threaded versions, where you break up the
sieve to do the residues in parallel, since they act as independent
vectors into the compressed prime candidates number space that exists
for a given generator.

From an architectural (versus coding) perspective, I think this code
represents the optimal technique for algorithm implementation.
Again, parallel threading represent the next implementation speedup.

Using a bitmapped representation of the primes candidates array 'prms'
will also create a 8x compression in memory usage, where every bit in
a byte represents a candidate value versus using the whole byte (8-
bits) address to represent one value. Of course, the memory management
to do that will slow the algorithm down a little, but if you want to
maximize the size of N you can use within a fixed memory space that's
how to do it.

After I take a break from this for a while (however long that is) I'll
do P17, the next Strictly Prime (SP) SoZ generator. It has ONLY 92,160
residue values. WhY do it? Because it's there!! ;-)

FYI, the SoZ may be incorporated into the SAGE symbolic math app.

http://sagemath.org/

I sent them the Python version and they are playing with it, and have
said they are looking to include it in their library.

We'll see.

Jabari

Marcel Hendrix

unread,

Jul 24, 2010, 4:17:33 PM7/24/10

to

jzakiya <jza...@gmail.com> writes Re: Sieve of Zakiya (SoZ) improved
[..]

> These times are almost hard to believe.

> For 2,000,000,011 for P13, from 143 second to 11.7, a factor of 12
> speedup. I would love to see the disassembled code you produce.

Be careful what you ask.

I don't think the code is that fabulous. I am sure Vfx will produce
about the same results, maybe even better, be it for outdated
architectures only.

-marcel

-- -----------------------------------------------------------------------------
: sozsieve ( prmslimit -) \ find/mark 'False' nonprime addresses in prms array
0 TO modn 0 TO r \ used to compute real value of each prms address
( prmslimit) 0 prms do \ loop thru prms to address close TO >= sqrt(N) value
r rescnt < IF r 1+ TO r else 1 TO r modn mdz + to modn ENDIF
I c@ IF \ if prms adrs value True its a prime
r residues @ modn + DUP TO prime \ compute/store prime value
( prime) rescnt * TO prmstep \ compute/store prmstep value
rescnt 1+ 1 do \ loop over residues for this prime
I residues @ modn + prime * ( product) \ compute (prime*residue[i])
( product) DUP num > IF drop leave ENDIF \ abort loop if product > num
\ compute product address in prms; mark False, then each prmstep adrs
( product) mdz /MOD ( rr qq) \ calc kth group & remainder
( qq) rescnt * swap ( rr) pos @ + ( nonprmndx) prms ( nonprmpos)
begin ( nonprmpos) DUP lastprms < \ mark prime multiples False
while false OVER c! ( nonprmpos) prmstep + repeat drop
loop
THEN
loop ;

--- ---------------------------------------------------
FORTH> ' sozsieve idis
$01251380 : [trashed]
$0125138A mov $01250D00 qword-offset, 0 d#
$01251395 mov $01250D20 qword-offset, 0 d#
$012513A0 mov rax, $01250360 qword-offset
$012513A7 lea rbx, [rax] qword
$012513AA pop rcx
$012513AB call (DO) offset NEAR
$012513B5 mov rax, rax
$012513B8 mov rcx, $01250D20 qword-offset
$012513BF cmp rcx, $01250CA0 qword-offset
$012513C6 jge $012513E3 offset NEAR
$012513CC mov rax, $01250D20 qword-offset
$012513D3 lea rcx, [rax 1 +] qword
$012513D7 mov $01250D20 qword-offset, rcx
$012513DE jmp $01251403 offset NEAR
$012513E3 mov $01250D20 qword-offset, 1 d#
$012513EE mov rdi, $01250D00 qword-offset
$012513F5 add rdi, $01250C80 qword-offset
$012513FC mov $01250D00 qword-offset, rdi
$01251403 mov rdi, [rbp 0 +] qword
$01251407 cmp [rdi] byte, 0 b#
$0125140A je $0125151B offset NEAR
$01251410 mov rdi, $01250D20 qword-offset
$01251417 lea rdi, [rdi*8 0 +] qword
$0125141F add rdi, $01250320 qword-offset
$01251426 mov rdi, [rdi] qword
$01251429 add rdi, $01250D00 qword-offset
$01251430 mov $01250D40 qword-offset, rdi
$01251437 imul rdi, $01250CA0 qword-offset
$0125143F mov $01250D60 qword-offset, rdi
$01251446 mov rax, $01250CA0 qword-offset
$0125144D push rbx
$0125144E lea rbx, [rax 1 +] qword
$01251452 push rbx
$01251453 mov rbx, 1 d#
$0125145A pop rcx
$0125145B call (DO) offset NEAR
$01251465 mov rax, rax
$01251468 mov rdi, [rbp 0 +] qword
$0125146C lea rdi, [rdi*8 0 +] qword
$01251474 add rdi, $01250320 qword-offset
$0125147B mov rdi, [rdi] qword
$0125147E add rdi, $01250D00 qword-offset
$01251485 imul rdi, $01250D40 qword-offset
$0125148D mov rax, $01250D80 qword-offset
$01251494 cmp rax, rdi
$01251497 push rbx
$01251498 mov rbx, rax
$0125149B mov rcx, rdi
$0125149E mov rbx, rcx
$012514A1 jge $012514AB offset NEAR
$012514A7 pop rbx
$012514A8 jmp [rbp #16 +] qword
$012514AB mov rdi, $01250C80 qword-offset
$012514B2 mov rcx, rbx
$012514B5 mov rbx, rdi
$012514B8 mov rax, rcx
$012514BB cqo
$012514BD idiv rbx
$012514C0 mov rcx, rdx
$012514C3 mov rbx, rax
$012514C6 imul rbx, $01250CA0 qword-offset
$012514CE lea rdi, [rcx*8 0 +] qword
$012514D6 add rdi, $01250340 qword-offset
$012514DD add rbx, [rdi] qword
$012514E0 add rbx, $01250360 qword-offset
$012514E7 nop
$012514E8 mov rax, $01250CE0 qword-offset
$012514EF cmp rax, rbx
$012514F2 jle $01251506 offset NEAR
$012514F8 mov [rbx] byte, 0 b#
$012514FB add rbx, $01250D60 qword-offset
$01251502 jmp $012514E8 offset SHORT
$01251504 push rbx
$01251505 pop rbx
$01251506 pop rbx
$01251507 add [rbp 0 +] qword, 1 b#
$0125150C add [rbp 8 +] qword, 1 b#
$01251511 jno $01251468 offset NEAR
$01251517 add rbp, #24 b#
$0125151B add [rbp 0 +] qword, 1 b#
$01251520 add [rbp 8 +] qword, 1 b#
$01251525 jno $012513B8 offset NEAR
$0125152B add rbp, #24 b#
$0125152F push rbx
$01251530 ;

jzakiya

unread,

Jul 24, 2010, 5:30:25 PM7/24/10

to

On Jul 24, 3:17 pm, m...@iae.nl (Marcel Hendrix) wrote:
> jzakiya <jzak...@gmail.com> writes Re: Sieve of Zakiya (SoZ) improved

Oh, that's nice and compact. Not alot of nops. Good Job!

I tried MPE Vfx for linux and it wouldn't load the file, and I
couldn't figure out what flags to use to increase the stack/dictionary
size, so I said forget it. More work than I was willing to put it.
Maybe you've inspired me to mess with it again...maybe.

However, I could load SoZGens.(1).f as is with SwiftForth (3.2.2) on
XP (and Linux under Wine) and run all the generators, until I pushed
the memory requirements for 'prm' and 'primes' out of bounds for N.

I see you ran your test with SoZGens.f. Try running SozGens.1.f and
see what you get. If may be faster for that Intel cpu/architecture.

I gotta get me some new wheels.
I feel like I'm in a stockcar race with a bicycle. :-)

Jabari

jzakiya

unread,

Jul 24, 2010, 8:26:45 PM7/24/10

to

As an experiment in coding efficiency, I modified sozsieve to hold
(r,modn) on the stack instead of in Values. See what difference it
makes with your (h/s)ware.
You only have to modify 6 lines of code in the original sozsieve.

: sozsieve ( prmslimit -) \ find/mark 'False' nonprime addresses in
prms array

0 ( modn) 0 ( r) \ used to compute real value of each prms
address
rot ( prmslimit) 0 prms do \ loop thru prms to address close to >=
sqrt(N) value
( r) dup rescnt < if 1+ else drop mdz + ( modn) 1 ( r) then
( modn r)
I c@ if \ if prms adrs value True
its a prime
2dup residues @ + dup to prime \ compute/store prime value
( prime) rescnt * to prmstep \ compute/store prmstep

value
rescnt 1+ 1 do \ loop over residues for
this prime

over I residues @ + prime * ( product) \ compute
(prime*residue[i])
( product) dup num > if drop leave then \ abort loop if

product > num
\ compute product address in prms; mark False, then each
prmstep adrs

( product) mdz /mod ( rr qq) \ calc kth group &

remainder
( qq) rescnt * swap ( rr) pos @ + ( nonprmndx) prms
( nonprmpos)

begin ( nonprmpos) dup lastprms < \ mark prime
multiples False
while false over c! ( nonprmpos) prmstep + repeat drop
loop
then
loop 2drop
;

Your original sozsieve assembly code was 432 bytes. Is this smaller/
faster(?).
On my system it seemed (?) to be a little faster, but I don't have the
time resolution to really measure the level of difference that may
exist.

But again, the code I posted are generic 'reference' versions, whose
purpose is to show more WHAT needs to be done versus (for a given
system) HOW best to do it.

Jabari

Marcel Hendrix

unread,

Jul 25, 2010, 3:39:14 AM7/25/10

to

jzakiya <jza...@gmail.com> writes Re: Sieve of Zakiya (SoZ) improved

> On Jul 24, 4:30 pm, jzakiya <jzak...@gmail.com> wrote:
>> On Jul 24, 3:17 pm, m...@iae.nl (Marcel Hendrix) wrote:

[..]

> As an experiment in coding efficiency, I modified sozsieve to hold
> (r,modn) on the stack instead of in Values. See what difference it
> makes with your (h/s)ware.
> You only have to modify 6 lines of code in the original sozsieve.

[..]

It makes a difference, but please don't draw heavy conclusions. The
amount of memory needed is close to the limit my machine seems to be
happy with. On machines with far more than 2 GBytes RAM the difference
may show better.

FORTH> benchmarks (old, values)

N -- Prime P13 P11 P7 P5 P3

2,000,000,011 11.771 11.423 13.063 0.000 0.000 ( 98,222,288 )
1,750,000,027 9.631 10.094 11.107 12.923 0.000 ( 86,514,021 )

1,500,000,001 8.137 8.333 9.381 11.092 12.280 ( 74,726,529 )
1,250,000,027 6.669 7.498 8.999 9.309 9.514 ( 62,843,677 )
1,000,000,007 5.973 5.570 6.648 7.334 7.811 ( 50,847,535 )
500,000,003 2.799 2.552 3.266 3.463 3.527 ( 26,355,868 )
250,000,013 1.091 1.147 1.580 1.853 1.762 ( 13,679,319 )
100,000,007 0.408 0.391 0.414 0.596 0.627 ( 5,761,456 ) ok

FORTH> benchmarks (new, stack)

N -- Prime P13 P11 P7 P5 P3

2,000,000,011 13.702 11.780 13.121 ( 98,222,288 )
1,750,000,027 9.607 9.679 13.035 14.045 ( 86,514,021 )
1,500,000,001 8.007 10.274 9.934 11.176 11.490 ( 74,726,529 )
1,250,000,027 7.766 7.597 8.872 9.477 9.443 ( 62,843,677 )
1,000,000,007 5.224 5.216 6.349 7.545 7.415 ( 50,847,535 )
500,000,003 2.438 2.487 2.735 3.447 3.559 ( 26,355,868 )
250,000,013 1.479 1.394 1.331 1.974 2.055 ( 13,679,319 )
100,000,007 0.549 0.484 0.560 0.765 0.847 ( 5,761,456 ) ok

> Your original sozsieve assembly code was 432 bytes. Is this smaller/faster(?).
> On my system it seemed (?) to be a little faster, but I don't have the
> time resolution to really measure the level of difference that may
> exist.

The new code is 378 bytes. As the innermost loop (begin .. while.. repeat)
didn't change, I don't believe there should be speed difference (in theory :-)

-marcel

Stephen Pelc

unread,

Jul 26, 2010, 5:55:08 AM7/26/10

to

On Sat, 24 Jul 2010 14:30:25 -0700 (PDT), jzakiya <jza...@gmail.com>
wrote:

>I tried MPE Vfx for linux and it wouldn't load the file, and I
>couldn't figure out what flags to use to increase the stack/dictionary
>size, so I said forget it. More work than I was willing to put it.
>Maybe you've inspired me to mess with it again...maybe.

In my copy of the PDF manual, there's section 23.2
"Saving to an ELF file"
See the words
set-size ( size -- )
set-stacks ( size -- )

Stephen

--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

jzakiya

unread,

Aug 6, 2010, 1:19:45 PM8/6/10

to

On Jul 26, 4:55 am, stephen...@mpeforth.com (Stephen Pelc) wrote:
> On Sat, 24 Jul 2010 14:30:25 -0700 (PDT), jzakiya <jzak...@gmail.com>

> wrote:
>
> >I tried MPE Vfx for linux and it wouldn't load the file, and I
> >couldn't figure out what flags to use to increase the stack/dictionary
> >size, so I said forget it. More work than I was willing to put it.
> >Maybe you've inspired me to mess with it again...maybe.
>
> In my copy of the PDF manual, there's section 23.2
> "Saving to an ELF file"
> See the words
> set-size ( size -- )
> set-stacks ( size -- )
>
> Stephen
>
> --

> Stephen Pelc, stephen...@mpeforth.com

> MicroProcessor Engineering Ltd - More Real, Less Time
> 133 Hill Lane, Southampton SO15 5AF, England
> tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
> web:http://www.mpeforth.com- free VFX Forth downloads

I created a file with just enough stuff to compile
these different versions of the sieve algorithms below
to get around different forths not being able to load
the full file because of the size of the arrays.

There are 2 reference versions: 1 and 2.
1 and 2 store the parameters [modn,r] in VALUES.
1a and 2a store [modn,r] on the stack.

Below is the difference in bytesize the different
versions compile to for different forths.

It should be noted, though, that the differences in the
bytesizes don't create any measurable difference in
performance between the versions (at least with gforth).
Would love to measure performance on Vfxlin but still
can't get all arrays to compile to allow for large N.
If someone can get results and post I would appreciate.

For Toshiba Satelite laptop,
Intel P4, PCLinuxOS 2010, 2.6.33.4, gcc 4.4.1

Byte sizes of words: sozsieve(1,1a,2,2a)

1 1a 2 2a

Vfxlin 4.4.0 356 322 396 354
# instructions 85 90 98 99

gforth-fast 0.7.0 501 417 536 453

Swiftforth 3.2.2 562 536 609 583

Win32Forth-STC 448 412 472 436

IForth 443

: sozsieve1 ( prmslimit -)
0 to modn 0 to r

( prmslimit) 0 prms do

r rescnt <
if r 1+ to r else 1 to r modn mdz + to modn then
I c@ if
r residues @ modn + dup to prime

( prime) rescnt * to prmstep

rescnt 1+ 1 do

I residues @ modn + prime * ( product)

( product) dup num > if drop leave then

( product) mdz /mod ( rr qq)

( qq) rescnt * swap ( rr) pos @ + prms

begin ( nonprmpos) dup lastprms <

while false over c! prmstep + repeat drop
loop
then
loop
;

: sozsieve1a ( prmslimit -)
0 0 ( modn r)
rot ( prmslimit) 0 prms do

( r) dup rescnt <
if 1+ else drop mdz + ( modn) 1 ( r) then

I c@ if

2dup residues @ + dup to prime

( prime) rescnt * to prmstep

rescnt 1+ 1 do ( modn r)
over I residues @ + prime * ( product)

( product) dup num > if drop leave then

( product) mdz /mod ( rr qq)

( qq) rescnt * swap ( rr) pos @ + prms

begin ( nonprmpos) dup lastprms <

while false over c! prmstep + repeat drop
loop
then
loop 2drop
;

: sozsieve2 ( lastprms -)
0 to modn 0 to r
( lastprms) 0 prms do
r rescnt <
if r 1+ to r else 1 to r modn mdz + to modn then
I c@ if
r residues @ modn + dup to prime
( prime) dup limit > if drop leave then

( prime) rescnt * to prmstep

rescnt 1+ 1 do

I residues @ modn + prime * ( product)

( product) dup num > if drop leave then

( product) mdz /mod ( rr qq)

( qq) rescnt * swap ( rr) pos @ + prms

begin ( nonprmpos) dup lastprms <

while false over c! prmstep + repeat drop
loop
then
loop
;

: sozsieve2a ( lastprms -)
0 0 ( modn r)
rot ( lastprms) 0 prms do

( r) dup rescnt <
if 1+ else drop mdz + ( modn) 1 ( r) then

I c@ if

2dup residues @ + dup to prime

( prime) dup limit > if drop leave then

( prime) rescnt * to prmstep

rescnt 1+ 1 do ( modn r)
over I residues @ + prime * ( product)

( product) dup num > if drop leave then

( product) mdz /mod ( rr qq)

( qq) rescnt * swap ( rr) pos @ + prms

begin ( nonprmpos) dup lastprms <

while false over c! prmstep + repeat drop
loop
then
loop 2drop
;