[ANN] primegen.go: Sieve of Atkin prime number generator

1,097 views
Skip to first unread message

John Barham

unread,
Feb 4, 2011, 10:27:25 PM2/4/11
to golang-nuts
Hello,

I have ported D. J. Bernstein's C primegen library
(http://cr.yp.to/primegen.html) to Go.

Primegen calculates prime numbers in order and uses the Sieve of Atkin
instead of the traditional Sieve of Eratosthenes.

The Go source is at https://github.com/jbarham/primegen.go and can be
installed via goinstall (i.e., "goinstall
github.com/jbarham/primegen.go").

It's a fairly straightforward port so preserves Bernstein's
characteristic terse style, but I do make use of multiple goroutines
when $GOMAXPROCS is set so it's competitive w/ the original C
implementation and is thus very fast: on my 2.4 GHz Intel Core i5,
primegen.go generates the first 50,847,534 primes up to 1,000,000,000
in ~0.7 seconds.

John Barham

gordo...@gmail.com

unread,
Jun 14, 2016, 8:31:18 AM6/14/16
to golang-nuts
That's interesting but your time or 0.7 seconds on your specified machine isn't very informative: you should at least specify how long the original version of primegen (which is not multithreaded) takes, which we assume as about 0.7 seconds, and also how long your golang version takes without multithreading (or set "maxprocs := 1") Also, one should really convert Bernstein's "eratspeed" to golang for comparison.

Current versions of golang are not as fast as GCC C/C++ optimized code by at least a factor of two, and you will find that if one applies maximum wheel factorization (ie a 2/3/5/7 wheel plus pre-culling for primes of 11/13/17/19), rather than the limited 2/3/5 wheel factorization Bernstein used in eratspeed, that the Sieve of Eratosthenes will beat "primegen" for every range when comparing like for like (multithreading or not for both, written in the same language), and that golang will currently always be a factor slower than the C/C++ code due to less efficient compilation.

alb.do...@gmail.com

unread,
Jun 14, 2016, 9:11:49 AM6/14/16
to golang-nuts, gordo...@gmail.com
> Current versions of golang are not as fast as GCC C/C++ optimized code by at least a factor of two

This is, in general, not true at all: I've written numeric code as fast as C code compiled by GCC -O3

gordo...@gmail.com

unread,
Jun 15, 2016, 8:37:59 AM6/15/16
to golang-nuts
While I am sure you are correct that some numeric code with golang can run as fast as C/C++, it seems that bit-packed tight loops such as used for culling composites in the Sieve of Eratosthenes (or the Sieve of Atkin for that matter) are not as optimized. Of course one of the problems that golang has is compulsory array bounds checks, but it doesn't run as fast as even DotNet or JVM code, which also have those compulsory checks.

To be specific, following is a simple Sieve of Eratosthenes program in golang that runs about twice as slow as the same program in C/C++ and about half again as slow as in C# or Java. The range of 262146 is chosen so that the created bit-packed array is smaller than the size of most CPU L1 caches (16 Kilobytes) so that memory access time isn't a bottleneck for most modern desktop CPU's, and the program is run 1000 times to get a reasonable length of time for more accurate timing The commented out lines were an experiment to use unsafe pointers (as in a C style array program) to try to save time by avoiding the automatic array bounds checks, but the resulting time was about the same:

package main

import (
"fmt"
"math"
"time"
// "unsafe"
)

func mkCLUT() [65536]byte {
var arr [65536]byte
for i := 0; i < 65536; i++ {
var cnt byte = 0
for v := (uint16)(i ^ 0xFFFF); v > 0; v &= v - 1 {
cnt++
}
arr[i] = cnt
}
return arr
}

var cnstCLUT [65536]byte = mkCLUT()

func primesTest(top uint) int {
lmtndx := (top - 3) >> 1
lstw := lmtndx >> 5
topsqrtndx := (int(math.Sqrt(float64(top))) - 3) >> 1
cmpsts := make([]uint32, lstw+1)
// start := uintptr(unsafe.Pointer(&cmpsts[0]))
// step := unsafe.Sizeof(buf[0])
// limit := start + uintptr(len(cmpsts))*step
for i := 0; i <= topsqrtndx; i++ {
if cmpsts[i>>5]&(uint32(1)<<uint(i)) == 0 {
p := (uint(i) << 1) + 3
for j := (p*p - 3) >> 1; j <= lmtndx; j += p {
cmpsts[j>>5] |= 1 << (j & 31)
}
// p := uintptr((uint(i) << 1) + 3)
// lmt := uintptr(lmtndx)
// for j := (p*p - 3) >> 1; j <= lmt; j += p {
// *(*uint)(unsafe.Pointer(start + step*(j>>5))) |= 1 << (j & 31)
// }
}
}
msk := uint32(0xFFFFFFFE) << (lmtndx & 31)
cmpsts[lstw] |= msk
cnt := 1
for i := uint(0); i <= lstw; i++ {
v := cmpsts[i]
cnt += int(cnstCLUT[v&0xFFFF] + cnstCLUT[0xFFFF&(v>>16)])
}
return cnt
}

func main() {
n := uint(262146)

strt := time.Now()

sum := 0
for i := 0; i < 1000; i++ {
sum += primesTest(n)
}

end := time.Now()

fmt.Println("Found", sum, "primes up to", n, "in", end.Sub(strt), ".")
}

The assembly listing revealed by "go tool compile -S primestest.go > primestest.s" reveals why (I have added comments at the ends of the lines to indicate what is going on):

line 74 for j := (p*p - 3) >> 1; j <= lmtndx; j += p {
line 75 cmpsts[j>>5] |= 1 << (j & 31)
line 76 }

has the following assembly code:

0x011e 00286 (primestest.go:75) MOVQ AX, R9 ;; make a copy of 'j'
0x0121 00289 (primestest.go:75) SHRQ $5, R9 ;; turn it into the uint32 word address
0x0125 00293 (primestest.go:75) CMPQ R9, SI ;; do the array bounds check
0x0128 00296 (primestest.go:75) JCC $1, 546 ;; out to panic if it fails
0x012e 00302 (primestest.go:75) LEAQ (DI)(R9*4), BX ;; get address of right element of array; the di register contains the pointer to the array base
0x0132 00306 (primestest.go:75) MOVL (BX), R11 ;; get the contents of that element
0x0135 00309 (primestest.go:75) CMPQ R9, SI ;; DO ANOTHER ARRAY BOUNDS CHECK!!!!!
0x0138 00312 (primestest.go:75) JCC $1, 539 ;; OUT TO A DIFFERENT PANIC IF IT FAILS!!!
0x013e 00318 (primestest.go:75) LEAQ (DI)(R9*4), BX ;; GET THE ADDRESS OF THE SAME ELEMENT AGAIN!!!!
0x0142 00322 (primestest.go:75) MOVQ CX, R8 ;;SAVE THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE LOOP)!!!
0x0145 00325 (primestest.go:75) MOVQ AX, CX
0x0148 00328 (primestest.go:75) ANDQ $31, CX ;; cx register now contains 'j' & 31
0x014c 00332 (primestest.go:75) MOVL $1, BP
0x0151 00337 (primestest.go:75) SHLL CX, BP ;; bp register now contains 1 << ('j' & 31)
0x0153 00339 (primestest.go:75) MOVQ R8, CX ;;RESTORE THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE LOOP)!!!
0x0156 00342 (primestest.go:75) ORL R11, BP ;; r11 now contains cmpsts['j' >> 5] | (1 << ('j' & 31))
0x0159 00345 (primestest.go:75) MOVL BP, (BX) ;; move it back to the element via the address in BX
0x015b 00347 (primestest.go:75) NOP ;; nop's to make the addresses work out to even words
0x015b 00347 (primestest.go:75) NOP
0x015b 00347 (primestest.go:74) ADDQ R10, AX ;; add 'j' to the stored 'p' in r10
0x015e 00350 (primestest.go:74) NOP
0x015e 00350 (primestest.go:74) CMPQ AX, R12 ;; check if 'j' is up to the stored 'lmtndx' in r12
0x0161 00353 (primestest.go:74) JLS $0, 286 ;; and loop back to top if not done

The main problem is the lines I have commented in capitals where the bounds check is done twice but also
where the contents of the cx register are saved and restored inside the loop

A secondary problem as far as speed is concerned is that the C/C++ code also takes advantage of the direct read/modify/write instruction,
equivalent to "cmpsts[j >> 5] |= 1 << (j & 31)" so the code in this syntax for C/C++ code would become:

0x011e 00286 (primestest.go:75) MOVQ AX, R9 ;; make a copy of 'j'
0x0121 00289 (primestest.go:75) SHRQ $5, R9 ;; turn it into the uint32 word address
0x0125 00293 (primestest.go:75) CMPQ R9, SI ;; do the array bounds check ONCE IF ONE MUST
0x0128 00296 (primestest.go:75) JCC $1, 546 ;; out to panic if it fails
0x0145 00325 (primestest.go:75) MOVQ AX, CX
0x0148 00328 (primestest.go:75) ANDQ $31, CX ;; cx register now contains 'j' & 31
0x014c 00332 (primestest.go:75) MOVL $1, BP
0x0151 00337 (primestest.go:75) SHLL CX, BP ;; bp register now contains 1 << ('j' & 31)
0x015b 00347 (primestest.go:75) NOP ;; nop's to make the addresses work out to even words AS NECESSARY
0x015b 00347 (primestest.go:75) NOP
0x0156 00342 (primestest.go:75) ORL BP, (DI)(R9*4);; the element now contains cmpsts['j' >> 5] | (1 << ('j' & 31))
0x015b 00347 (primestest.go:74) ADDQ R10, AX ;; add 'j' to the stored 'p' in r10
0x015e 00350 (primestest.go:74) NOP
0x015e 00350 (primestest.go:74) CMPQ AX, R12 ;; check if 'j' is up to the stored 'lmtndx' in r12
0x0161 00353 (primestest.go:74) JLS $0, 286 ;; and loop back to top if not done

Note that NOP's do not take execution time as they are optimized away in a modern CPU and are there to
make destinations of jump instructions work to even word boundaries, which can make a difference in speed.

The immediately above loop will take about 3.5 clock cycles without bounds check and about 5.5 clock cycles with bounds check for a modern desktop CPU
whereas the former version will take about three clock cycles more.

This analysis is for 64-bit code, the difference is even worse for 32-bit code as there are less registers available and
the golang compiler is even worse at making best use of them.

For most use cases, the C/C++ code will be about twice the speed of the golang code.

C#/Java do the bounds checks, but not twice and are better at reducing register operations within tight loops.

In summary, the golang compiler has a way to go for maximum efficiency in register use, even if the array bounds checks are kept.

alb.do...@gmail.com

unread,
Jun 15, 2016, 9:10:28 AM6/15/16
to golang-nuts, gordo...@gmail.com
I suggest you to repeat your analysis on go1.7beta, the compiler has a new backend
for amd64 and the generated code should be better. On my machine:

go1.6.2:

Found 23000000 primes up to 262146 in 445.329735ms .
Found 23000000 primes up to 262146 in 447.491918ms .
Found 23000000 primes up to 262146 in 451.530021ms .
Found 23000000 primes up to 262146 in 445.181835ms .
Found 23000000 primes up to 262146 in 447.489657ms .
Found 23000000 primes up to 262146 in 451.568191ms .
Found 23000000 primes up to 262146 in 446.207366ms .
Found 23000000 primes up to 262146 in 449.992793ms .
Found 23000000 primes up to 262146 in 445.746567ms .
Found 23000000 primes up to 262146 in 449.102428ms .

on go1.7beta

Found 23000000 primes up to 262146 in 358.99806ms .
Found 23000000 primes up to 262146 in 358.755634ms .
Found 23000000 primes up to 262146 in 360.231645ms .
Found 23000000 primes up to 262146 in 358.93451ms .
Found 23000000 primes up to 262146 in 359.412598ms .
Found 23000000 primes up to 262146 in 363.33646ms .
Found 23000000 primes up to 262146 in 363.981767ms .
Found 23000000 primes up to 262146 in 366.880261ms .
Found 23000000 primes up to 262146 in 358.761698ms .
Found 23000000 primes up to 262146 in 358.025324ms .

gordo...@gmail.com

unread,
Jun 15, 2016, 9:23:08 PM6/15/16
to golang-nuts
golang version 1.7beta1 does indeed help, and the time is now not much worse than C#/Java, but still not as good as C/C++ due to the single array bounds check:

Using the same procedure to obtain an assembly listing (go tool compiler -S PrimeTest.go > PrimeTest.s):

line 36 for j := (p*p - 3) >> 1; j <= lmtndx; j += p {
line 37 cmpsts[j>>5] |= 1 << (j & 31)
line 38 }

0x00f1 00241 (Main.go:37) MOVQ R8, CX ;; move 'j' in r8 to r
0x00f4 00244 (Main.go:37) SHRQ $5, R8 ;; shift cx right by 5 to get word address
0x00f8 00248 (Main.go:37) CMPQ R8, DX ;; array bounds check to array length stored in dx
0x00fb 00251 (Main.go:37) JCC $0, 454 ;; panic if fail bounds check
0x0101 00257 (Main.go:37) MOVL (AX)(R8*4), R10 ;; get element to r10 in one step
0x0105 00261 (Main.go:37) MOVQ CX, R11 ;; save 'j' for later in r11
0x0108 00264 (Main.go:37) ANDQ $31, CX ;; leave 'j' & 31 in cx
0x010c 00268 (Main.go:37) MOVL R9, R12 ;; save r9 to r12 to preserve the 1 it contains - WHY NOT JUST MAKE R12 CONTAIN 1 AT ALL TIMES IF USING IT IS QUICKER THAN AN IMMEDIATE LOAD
0x010f 00271 (Main.go:37) SHLL CX, R9 ;; R9 SHOULD JUST BE LOADED WITH 1 ABOVE - now cx contains 1 << ('j' & 31)
0x0112 00274 (Main.go:37) ORL R10, R9 ;; r9 contains cmpsts[j >> 5] | (1 << ('j' & 31)) - the bit or is done here
0x0115 00277 (Main.go:37) MOVL R9, (AX)(R8*4) ;; element now contains the modified value
0x0119 00281 (Main.go:36) LEAQ 3(R11)(DI*2), R8 ;; tricky way to calculate 'j' + 2 * 'j' + 3 where 2 * 'j' + 3 is p, answer to r8, saves a register
0x011e 00286 (Main.go:37) MOVL R12, R9 ;; RESTORE R9 FROM R12 - SHOULD NOT BE NECESSARY, but doesn't really cost in time as CPU is waiting for results of LEAQ operation
0x0121 00289 (Main.go:36) CMPQ R8, BX ;; check if 'j' in r8 is up to limit stored in bx
0x0124 00292 (Main.go:36) JLS $0, 241 ;; loop if not complete

This is much better than the 1.6.2 code in that it no longer does the array bounds check twice, although there is still the minor use of an extra r12 register used to store 1 instead of using an immediate load of 1 into the r9 register as above, where it could have been used to store 'p' to save a slight amount of time instead of the tricky code to calculate 'p' (quickly) every loop (the tricky bit is still about a half cycle slower than just using a pre-calculated 'p' value). The C/C++ code will still be quicker, mainly because of no array bounds check for a couple of CPU clock cycles, but also because it is more efficient to use the single read/modify/write version of the ORL instruction instead of MOVL from the array element to a register, ORL with the bit modifier, then MOVL from the register back to the array element. It seems it is now almost trying too hard to save registers at the cost of time in the tricky 'p' calculation, but costing registers for no gain or an actual loss in saving the 1 to a register.

So it is good to see that golang compiler optimization is taking some steps forward, but it isn't quite there yet.

Nigel Tao

unread,
Jun 15, 2016, 11:11:12 PM6/15/16
to gordo...@gmail.com, golang-nuts
On Thu, Jun 16, 2016 at 11:22 AM, <gordo...@gmail.com> wrote:
> it no longer does the array bounds check twice

It's not safe, but you might want to try disabling bounds checking by
"go tool compile -B" or "go build -gcflags=-B".

gordo...@gmail.com

unread,
Jun 16, 2016, 12:20:50 AM6/16/16
to golang-nuts, gordo...@gmail.com
In fact, in this case it is safe because the loop code has a built-in array check in checking that the 'j' variable doesn't go past the top of the array (and in the enclosing loop that the 'i' index doesn't go past the index of the square root of the top value), which array is sized to contain that top array index.

I am not currently at my usual test machine so will wait a few hours to post the results, but I would assume that just the two instruction array bounds check, which takes about two CPU clock cycles, will be eliminated, thus it is likely that the run time twill be reduced to about two thirds for a modern desktop CPU.  This will still likely be slightly slower than optimized C/C++ code for the reasons given in my last post, but not bad at all.

It's too bad one can't disable bounds checks only for certain ranges of code by "pragma"'s in the source code, but I don't suppose that is possible.


Message has been deleted

gordo...@gmail.com

unread,
Jun 16, 2016, 5:13:12 AM6/16/16
to golang-nuts
No real surprises with the no bounds checks option (-B), it just eliminated the array bounds checks with the rest of the code the same (version 1.7beta1):

0x00dd 00221 (main.go:37) MOVQ DI, CX
0x00e0 00224 (main.go:37) SHRQ $5, DI
0x00e4 00228 (main.go:37) MOVL (AX)(DI*4), R9
0x00e8 00232 (main.go:37) MOVQ CX, R10
0x00eb 00235 (main.go:37) ANDQ $31, CX
0x00ef 00239 (main.go:37) MOVL R8, R11
0x00f2 00242 (main.go:37) SHLL CX, R8
0x00f5 00245 (main.go:37) ORL R8, R9
0x00f8 00248 (main.go:37) MOVL R9, (AX)(DI*4)
0x00fc 00252 (main.go:36) LEAQ 3(R10)(SI*2), DI
0x0101 00257 (main.go:37) MOVL R11, R8
0x0104 00260 (main.go:36) CMPQ DI, DX
0x0107 00263 (main.go:36) JLS $0, 221

It is now almost as fast as C/C++ code, and isn't for the same reasons as explained before: excessively using registers to store things and not using the read/modify/write instruction (which also saves the use of a register).

The current beta will work not too badly with amd64 code but still doesn't use registers efficiently enough to support x86 code as it uses too many register. optimized C/C++ code only uses six or at most 7 registers, which the x86 architecture has, but not the nine registers that the above requires.

So for this tight loop, golang is still slower than optimized C/C++ code, but not by very much if array bounds checks are disabled.

Konstantin Khomoutov

unread,
Jun 16, 2016, 7:06:46 AM6/16/16
to gordo...@gmail.com, golang-nuts
On Thu, 16 Jun 2016 02:12:56 -0700 (PDT)
gordo...@gmail.com wrote:

[...]
> The current beta will work not too badly with amd64 code but still
> doesn't use registers efficiently enough to support x86 code as it
> uses too many register. optimized C/C++ code only uses six or at
> most 7 registers, which the x86 architecture has, but not the nine
> registers that the above requires.
>
> So for this tight loop, golang is still slower than optimized C/C++
> code, but not by very much if array bounds checks are disabled.

It appears you're well versed in the x86 process instruction set and
code generation. Could you may be offer help to the folks working on
improving the code generation backend of the gc compiler?

gordo...@gmail.com

unread,
Jun 16, 2016, 10:57:11 AM6/16/16
to golang-nuts
>. So for this tight loop, golang is still slower than optimized C/C++
>. code, but not by very much if array bounds checks are disabled.

> It appears you're well versed in the x86 process instruction set and
> code generation. Could you may be offer help to the folks working on
> improving the code generation backend of the gc compiler?

I am a machine language and assembler programmer from over 40 years ago, but have little knowledge of writing compiler backends, and while I know what kind of optimizations they make, I have no experience in actually writing programs to achieve those optimizations. Thus I would not likely to be able to make significant contributions to the gc backend other that to make comments on what I see in the generated output as here.

unread,
Jun 16, 2016, 11:45:54 AM6/16/16
to golang-nuts, gordo...@gmail.com
Modern x86 CPUs don't work like that.

In general, optimally scheduled assembly code which uses more registers has higher performance than optimally scheduled assembly code which uses smaller number of registers. Assuming both assembly codes correspond to the same source code.

Register renaming: since Intel Pentium Pro and AMD K5.


An excerpt from the above PDF document (Section 10 about Haswell and Broadwell pipeline): "... the register file has 168 integer registers and 168 vector registers ..."

unread,
Jun 16, 2016, 11:50:19 AM6/16/16
to golang-nuts, gordo...@gmail.com
Will golang or google pay money for such a work?

Konstantin Khomoutov

unread,
Jun 16, 2016, 12:37:29 PM6/16/16
to ⚛, golang-nuts, gordo...@gmail.com
On Thu, 16 Jun 2016 08:50:19 -0700 (PDT)
⚛ <0xe2.0x...@gmail.com> wrote:

> > > The current beta will work not too badly with amd64 code but
> > > still doesn't use registers efficiently enough to support x86
> > > code as it uses too many register. optimized C/C++ code only
> > > uses six or at most 7 registers, which the x86 architecture has,
> > > but not the nine registers that the above requires.
> > >
> > > So for this tight loop, golang is still slower than optimized C/C+
> > > + code, but not by very much if array bounds checks are disabled.
> >
> > It appears you're well versed in the x86 process instruction set
> > and code generation. Could you may be offer help to the folks
> > working on improving the code generation backend of the gc
> > compiler?
>
> Will golang or google pay money for such a work?

IANAG [*] but I beleive it will not. ;-)

My line of reasoning resulted in the friently nudge you're discussing
was something like: "the person appears to be well-versed in this
subject", "he is also interested in making Go programs go faster",
"therefore, it would be good to contribute directly to where the speed
is a concern".

[*] I Am Not At Google.

gordo...@gmail.com

unread,
Jun 16, 2016, 8:30:12 PM6/16/16
to golang-nuts
> Modern x86 CPUs don't work like that.

> In general, optimally scheduled assembly code which uses more registers has higher performance than optimally scheduled assembly code which uses smaller number of registers. Assuming both assembly codes correspond to the same source code.

> Register renaming: since Intel Pentium Pro and AMD K5.

> Suggestion for reading: http://www.agner.org/optimize/microarchitecture.pdf

> An excerpt from the above PDF document (Section 10 about Haswell and Broadwell pipeline): "... the register file has 168 integer registers and 168 vector registers ..."

I am aware of all of the above and have already read Agner Fogg's publications. In addition modern CPU's do Out of Order Execution (OOE) so rearrange the instructions to best reduce instruction latencies and increase throughput given that there are parallel execution pipelines and ahead-of-time execution, so the actual execution order is almost certainly not as per the assembly listing.

Yes, both assembly listings are from the same tight loop code, but the "C/C++" one has been converted from another assembly format to the golang assembly format.

Daniel Bernstein, the author of "primegen" wrote for the Pentium 3 in x86 (32-bit) code, as the Pentium Pro processor wasn't commonly available at that time and 64-bit code didn't exist. His hand optimized C code for the Sieve of Eratosthenes ("eratspeed.c" in the "doit()" function for the "while k < B loop") uses six registers for this inner culling loop being discussed, and takes about 3.5 CPU clock cycles per loop on a modern CPU (Haswell).

The number of internal CPU registers actually used by the CPU to effect OOE is beside the point, as they have to do with the CPU's internal optimizations and not compiler optimizations; my point is that the compiler's incorrect use of registers still costs time.

While I don't expect golang, with its philosophy of preserving "safe" paradigms in doing array bounds checks by default, to run as fast as C/C++ code that doesn't have that philosophy, I do expect it to run at least as fast as C#/Java code which are Just In Time (JIT) compiled and do have the "safe" philosophy. The golang compiler version 1.7beta1 is not quite there yet for the indicated reasons: inconsistent use of registers, using one too many in one place in order to avoid an immediate load which doesn't cost any execution time, and saving one register by use of the "trick" which does cost execution time as compared to the use of a single register.

However, there is hope as version 1.7 has made great advances since version 1.6; surely version 1.8, which is said to intend to improve this further will be faster yet. At any rate, version 1.7 speed is "adequate" for many purposes as at least it comes close (probably within about 15% to 20% or less) of C#/Java speed in many of the most demanding tight loop algorithms, and thus is quite usable as compared to previous versions. But even the most avid golang protagonists must admit that it isn't the language to re-write Kim Walisch's "primesieve" with its extreme loop unrolling that takes an average of about 1.4 CPU clock cycles per composite number cull for small ranges of primes, as even with array bounds checking turned off, golang would still take at least twice and more likely three times as long.

That is also why I first started posting to this thread: the only reason the golang version of "primegen" is reasonably comparable in speed to C/C++ "primegen" is that it uses multi-threading on a multi-core processor, which weren't available to Daniel Bernstein when he wrote "primegen". My point was one should compare like with like.

Keith Randall

unread,
Jun 16, 2016, 10:23:18 PM6/16/16
to golang-nuts, gordo...@gmail.com
Looks like something is wrong with immediate loading for the 1 << ... operation.  Could you open a bug with repro instructions?  I can look at it when 1.8 opens.

gordo...@gmail.com

unread,
Jun 17, 2016, 4:26:00 AM6/17/16
to golang-nuts
> Looks like something is wrong with immediate loading for the 1 << ... operation. Could you open a bug with repro instructions? I can look at it when 1.8 opens.

I have opened a golang issue #16092 as follows: https://github.com/golang/go/issues/16092

I may have over-complicated it, but as well as the failure to use the load immediate instruction, I documented other suggestions that would make the code run faster as well as a method to eliminate the array bounds check for cases when the check is built-in to the loop as C# (sometimes) does for x86 code when one uses a specific formulation of a loop. If able to be implemented, these changes would then apply to x86 code as well due to reduction of register use, and could potentially make golang code run as fast as C/C++ as compared in the issue.

unread,
Jun 17, 2016, 9:35:21 AM6/17/16
to golang-nuts, gordo...@gmail.com
On Friday, June 17, 2016 at 2:30:12 AM UTC+2, gordo...@gmail.com wrote:
> Modern x86 CPUs don't work like that.

> In general, optimally scheduled assembly code which uses more registers has higher performance than optimally scheduled assembly code which uses smaller number of registers. Assuming both assembly codes correspond to the same source code.

> Register renaming: since Intel Pentium Pro and AMD K5.

> Suggestion for reading: http://www.agner.org/optimize/microarchitecture.pdf

> An excerpt from the above PDF document (Section 10 about Haswell and Broadwell pipeline): "... the register file has 168 integer registers and 168 vector registers ..."

I am aware of all of the above and have already read Agner Fogg's publications.  In addition modern CPU's do Out of Order Execution (OOE) so rearrange the instructions to best reduce instruction latencies and increase throughput given that there are parallel execution pipelines and ahead-of-time execution, so the actual execution order is almost certainly not as per the assembly listing.

Yes, both assembly listings are from the same tight loop code, but the "C/C++" one has been converted from another assembly format to the golang assembly format.

Daniel Bernstein, the author of "primegen" wrote for the Pentium 3 in x86 (32-bit) code, as the Pentium Pro processor wasn't commonly available at that time and 64-bit code didn't exist.  His hand optimized C code for the Sieve of Eratosthenes ("eratspeed.c" in the "doit()" function for the "while k < B loop") uses six registers for this inner culling loop being discussed, and takes about 3.5 CPU clock cycles per loop on a modern CPU (Haswell).

The following isn't an argument against your post or in favor of your post. It is just some assembly code that I find interesting.

$ clang-3.9 -S -O3 eratspeed.c -mtune=bdver3
.LBB1_4:
        movl    %edx, %edi
        shrl    $5, %edi
        movl    %edx, %ecx
        andl    $31, %ecx
        movl    two(,%rcx,4), %ecx
        addl    %esi, %edx
        orl     %ecx, a(%rbx,%rdi,4)
        cmpl    $32032, %edx
        jb      .LBB1_4

5 registers: DX DI CX SI BX

$ perf stat --repeat=100 -- ./eratspeed.orig >/dev/null

 Performance counter stats for './eratspeed.orig' (100 runs):

        466.592220      task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.11% )
                 1      context-switches          #    0.002 K/sec                    ( +- 12.37% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
                86      page-faults               #    0.185 K/sec                    ( +-  0.09% )
     1,862,076,369      cycles                    #    3.991 GHz                      ( +-  0.11% )
        74,805,707      stalled-cycles-frontend   #    4.02% frontend cycles idle     ( +-  1.20% )
       272,721,373      stalled-cycles-backend    #  14.65% backend cycles idle       ( +-  0.16% )
     4,116,301,949      instructions              #    2.21  insn per cycle         
                                                  #    0.07  stalled cycles per insn  ( +-  0.00% )
       473,019,237      branches                  # 1013.774 M/sec                    ( +-  0.00% )
         8,443,283      branch-misses             #    1.78% of all branches          ( +-  1.05% )

       0.467167554 seconds time elapsed                                          ( +-  0.11% )

-----

Hand optimization of the above:

.LBB1_4:
        movl    %edx, %edi
        shrl    $5, %edi
        movl    %edx, %ecx
        andl    $31, %ecx
        movl    two(,%rcx,4), %r11d
        addl    %esi, %edx
        orl     %r11d, a(%rbx,%rdi,4)
        cmpl    $32032, %edx
        jb      .LBB1_4

6 registers: DX DI CX SI BX R11

$ perf stat --repeat=100 -- ./eratspeed.opt >/dev/null
 Performance counter stats for './eratspeed.opt' (100 runs):

        444.740487      task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.01% )
                 1      context-switches          #    0.002 K/sec                    ( +- 11.68% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
                85      page-faults               #    0.191 K/sec                    ( +-  0.10% )
     1,775,283,795      cycles                    #    3.992 GHz                      ( +-  0.00% )
        68,397,472      stalled-cycles-frontend   #    3.85% frontend cycles idle     ( +-  0.04% )
       201,687,390      stalled-cycles-backend    #  11.36% backend cycles idle       ( +-  0.02% )
     4,116,277,783      instructions              #    2.32  insn per cycle         
                                                  #    0.05  stalled cycles per insn  ( +-  0.00% )
       473,014,763      branches                  # 1063.575 M/sec                    ( +-  0.00% )
         8,263,323      branch-misses             #    1.75% of all branches          ( +-  0.00% )

       0.445287360 seconds time elapsed                                          ( +-  0.01% )

gordo...@gmail.com

unread,
Jun 17, 2016, 9:52:27 AM6/17/16
to golang-nuts
I don't see the point of the exercise other than it proves that not putting the result of an operation back into the same register reduces the latency slightly for your processor (whatever it is?); I suspect that if you used any other register such as the unused AX register rather then the R11 register, the results would be the same.

What would be interesting is to compare the run time on your processor with the same eratspeed program written in golang using version 1.7beta1 with and without array bounds checks.

unread,
Jun 17, 2016, 10:18:12 AM6/17/16
to golang-nuts, gordo...@gmail.com
On Friday, June 17, 2016 at 3:52:27 PM UTC+2, gordo...@gmail.com wrote:
I don't see the point of the exercise other than it proves that not putting the result of an operation back into the same register reduces the latency slightly for your processor (whatever it is?); I suspect that if you used any other register such as the unused AX register rather then the R11 register, the results would be the same.

Why do you see no point of the exercise and at the same time you do see the point of optimizing 1.7beta1 code? What is the difference?

gordo...@gmail.com

unread,
Jun 17, 2016, 10:48:06 AM6/17/16
to golang-nuts
I don't see the point to the exercise as far as optimizing golang is concerned. Your experiment just shows that Your compiler (GCC?) missed an optimization as far as reducing backend latency goes.

You may also find that swapping the order of some of the instructions such as the second and the third in the loop may also reduce backend latency further.

I am not on a high end Intel CPU now, but when I was I found that with a buffer size adjusted to the L1 cache size (8192 32-bit words or 32 Kilobytes) that eratspeed ran on an Intel 2700K @ 3.5 GHz at about 3.5 clock cycles per loop (about 405,000,000 loops for this range).

My current AMD Bulldozer CPU has a severe cache bottleneck and can't come close to this speed by a factor of about two.
Message has been deleted

unread,
Jun 17, 2016, 11:38:20 AM6/17/16
to golang-nuts, gordo...@gmail.com
On Friday, June 17, 2016 at 4:48:06 PM UTC+2, gordo...@gmail.com wrote:
I don't see the point to the exercise as far as optimizing golang is concerned.

It is a general rule that using more registers results in faster code.
 
Your experiment just shows that Your compiler (GCC?)

My post containing the example mentions that the compiler is clang 3.9.
 
missed an optimization as far as reducing backend latency goes.

It is significant because it indicates that the clang compiler might be completely missing the general rule which I mentioned above.

You may also find that swapping the order of some of the instructions such as the second and the third in the loop may also reduce backend latency further.

Swapping the order of instruction in the example results in the same or lower performance.

I am not on a high end Intel CPU now, but when I was I found that with a buffer size adjusted to the L1 cache size (8192 32-bit words or 32 Kilobytes) that eratspeed ran on an Intel 2700K @ 3.5 GHz at about 3.5 clock cycles per loop (about 405,000,000 loops for this range).

My current AMD Bulldozer CPU has a severe cache bottleneck and can't come close to this speed by a factor of about two.

Which Bulldozer version do you have: original Bulldozer, Piledriver, Steamroller/Kaveri or Excavator/Carrizo?

gordo...@gmail.com

unread,
Jun 17, 2016, 6:21:21 PM6/17/16
to golang-nuts
On Friday, June 17, 2016 at 4:48:06 PM UTC+2, gordo...@gmail.com wrote:
> > I don't see the point to the exercise as far as optimizing golang is concerned.

> It is a general rule that using more registers results in faster code.

Yes, except when they're wasted as in using the extra register is golang 1.7 where to save a load immediate $1.

> > Your experiment just shows that Your compiler (GCC?)

> My post containing the example mentions that the compiler is clang 3.9.

Yes, sorry, I was tired and forgot that.

> > missed an optimization as far as reducing backend latency goes.

> It is significant because it indicates that the clang compiler might be completely missing the general rule which I mentioned above.

It is significant to Clang development, but not to golang developement, which I thought was what we were working on in this thread; that's why I said that your work, though important, doesn't really apply here as to golang improvements.

> > You may also find that swapping the order of some of the instructions such as the second and the third in the loop may also reduce backend latency further.

> Swapping the order of instruction in the example results in the same or lower performance.

I said "may".

> > I am not on a high end Intel CPU now, but when I was I found that with a buffer size adjusted to the L1 cache size (8192 32-bit words or 32 Kilobytes) that eratspeed ran on an Intel 2700K @ 3.5 GHz at about 3.5 clock cycles per loop (about 405,000,000 loops for this range).

> > My current AMD Bulldozer CPU has a severe cache bottleneck and can't come close to this speed by a factor of about two.

> Which Bulldozer version do you have: original Bulldozer, Piledriver, Steamroller/Kaveri or Excavator/Carrizo?

One of the original's, a FX8120 (4 core, 8 processor) @ 3.1 GHz.

by the clock frequency, I assume you were running these tests on a high end Intel CPU?

Have you tried compiling eratspeed with a new version of GCC to see how it compares to Clang?

I am currently working on a golang version of eratspeed and we'll see how it compares; it will be a better comparison that just using my simple PrimesSpeed test...

gordo...@gmail.com

unread,
Jun 18, 2016, 12:15:11 AM6/18/16
to golang-nuts
Here is a golang version of Daniel Bernstein's "eratspeed", which is a straight translation of his C code to golang without any changes other than as necessary to make it work: https://play.golang.org/p/Sd6qlMQcHF.

It won't run on the golang playground because it takes too long unless one changes the number of loops it runs, but at least one gets a nicely formatted listing.

I haven't played with optimizing this other than to turn of array bounds checks in compiling it (to make it more comparable to C), but it seems that "as-is" it runs at about the same speed as John's "primespeed" code from the beginning of this thread when that code has multi-threading disabled (force numprocs = 1 in the sieve.go file) when both have array bounds checking turned off, and both for these conditions run slightly faster than Bernstein's C "primespeed" in 32-bit mode just as compiled with GCC using his "make". Bernstein's C "eratspeed" under those some conditions runs about 17% slower than his "primespeed", which makes it about 20% slower than this golang version. Those are all under the conditions of being compiled and run on my machine, which isn't particularly fast of efficient.

Neither have I yet looked at the machine code generated by this version of eratspeed, but it is likely helped by there being more constants so there is less of a demand for registers.

This isn't the ultimate in Sieve of Eratosthenes algorithms yet as it could be maximally factorized (use a 2/3/5/7 wheel instead of the 2/3/5 one used and use a pre-cull pattern which has the 11/13/17/19 prime composites eliminated to initialize the culling array) to save about another 25% to 30%, multi-threading could be applied for another speed up by a factor of four on 4 cores plus HT CPU's; it also needs some work to make the range extendable since, as it is currently written, it is hard coded to sieve up to this specific range.

Anyway, compare apples with apples, both algorithms true to those written by Bernstein.

unread,
Jun 18, 2016, 6:35:44 AM6/18/16
to golang-nuts, gordo...@gmail.com
On Saturday, June 18, 2016 at 6:15:11 AM UTC+2, gordo...@gmail.com wrote:
Here is a golang version of Daniel Bernstein's "eratspeed", which is a straight translation of his C code to golang without any changes other than as necessary to make it work:  https://play.golang.org/p/Sd6qlMQcHF.

$ perf stat --repeat=100 -- ./eratspeed-go1.7tip
 Performance counter stats for './eratspeed-go1.7tip' (100 runs):

        612.731838      task-clock (msec)         #    1.002 CPUs utilized            ( +-  0.06% )
               129      context-switches          #    0.210 K/sec                    ( +-  0.50% )
                 7      cpu-migrations            #    0.012 K/sec                    ( +-  3.95% )
               185      page-faults               #    0.302 K/sec                    ( +-  0.10% )
     2,442,624,802      cycles                    #    3.986 GHz                      ( +-  0.06% )
        33,146,164      stalled-cycles-frontend   #    1.36% frontend cycles idle     ( +-  0.33% )
       611,827,173      stalled-cycles-backend    #  25.05% backend cycles idle       ( +-  0.21% )
     6,851,374,060      instructions              #    2.80  insn per cycle         
                                                  #    0.09  stalled cycles per insn  ( +-  0.00% )
     1,277,112,717      branches                  # 2084.293 M/sec                    ( +-  0.00% )
         2,924,448      branch-misses             #    0.23% of all branches          ( +-  0.09% )

       0.611729451 seconds time elapsed                                          ( +-  0.06% )

unread,
Jun 18, 2016, 6:54:45 AM6/18/16
to golang-nuts, gordo...@gmail.com
On Saturday, June 18, 2016 at 12:21:21 AM UTC+2, gordo...@gmail.com wrote:
On Friday, June 17, 2016 at 4:48:06 PM UTC+2, gordo...@gmail.com wrote:
> > I am not on a high end Intel CPU now, but when I was I found that with a buffer size adjusted to the L1 cache size (8192 32-bit words or 32 Kilobytes) that eratspeed ran on an Intel 2700K @ 3.5 GHz at about 3.5 clock cycles per loop (about 405,000,000 loops for this range).

> > My current AMD Bulldozer CPU has a severe cache bottleneck and can't come close to this speed by a factor of about two.

> Which Bulldozer version do you have: original Bulldozer, Piledriver, Steamroller/Kaveri or Excavator/Carrizo?

One of the original's, a FX8120 (4 core, 8 processor) @ 3.1 GHz.

Your CPU is bdver1. My CPU is bdver3.

by the clock frequency, I assume you were running these tests on a high end Intel CPU?

bdver3 CPU has some optimizations compared to bdver1, but I don't know whether this affects eratspeed code. Is your IPC lower than 2.80 when you compile https://play.golang.org/p/Sd6qlMQcHF with Go1.7-tip and run "perf stat --repeat=10 -- ./eratspeed-go1.7tip"?

Have you tried compiling eratspeed with a new version of GCC to see how it compares to Clang?

gcc-6.1 is slower. clang-3.9 is able to translate "bits = buf[pos]; bits |= data; buf[pos] = bits;" into a single x86 instruction equivalent to "buf[pos] |= data".

gordo...@gmail.com

unread,
Jun 18, 2016, 7:55:07 PM6/18/16
to golang-nuts
On Saturday, June 18, 2016 at 12:21:21 AM UTC+2, gordo...@gmail.com wrote:
On Friday, June 17, 2016 at 4:48:06 PM UTC+2, gordo...@gmail.com wrote:
> > I am not on a high end Intel CPU now, but when I was I found that with a buffer size adjusted to the L1 cache size (8192 32-bit words or 32 Kilobytes) that eratspeed ran on an Intel 2700K @ 3.5 GHz at about 3.5 clock cycles per loop (about 405,000,000 loops for this range).

> > My current AMD Bulldozer CPU has a severe cache bottleneck and can't come close to this speed by a factor of about two.

> Which Bulldozer version do you have: original Bulldozer, Piledriver, Steamroller/Kaveri or Excavator/Carrizo?

One of the original's, a FX8120 (4 core, 8 processor) @ 3.1 GHz.

Your CPU is bdver1. My CPU is bdver3.

by the clock frequency, I assume you were running these tests on a high end Intel CPU?

bdver3 CPU has some optimizations compared to bdver1, but I don't know whether this affects eratspeed code. Is your IPC lower than 2.80 when you compile https://play.golang.org/p/Sd6qlMQcHF with Go1.7-tip and run "perf stat --repeat=10 -- ./eratspeed-go1.7tip"?

I am on Windows 7 64-bit, so don't have perf, which is specific to Linux.

However, we can easily estimate the figure you request as follows:

1) The number of instructions are likely to be the same between your machine and mine as we are both using the same source code with the same compiler version with the same compiler settings.

2) The run times over the clock speed for my machine as compared to yours is about one third again, therefore my instruction rate will be 75% of yours or about 2.1 instructions per cycle.

3) The difference is almost certainly to be mostly the the backend stall time, which for your machine is 25% or 150 ms out of your run time; my run time at your clock frequency would be 200 ms more, therefore a total of 350 ms out of 800 ms so for my machine backend stall time will be about 43.75%.

4) The reason bdver1 is so bad is it has something called a Write Combining Cache (WCC) to take care of the write-through memory access of bd; which is fine and good; the problem is that WCC is only 4 Kilobytes in size, which means that for a loop like this that mostly does only writes (backend), the effective L1 cache size is only 4 Kilobytes instead of 16 Kilobytes, causing many misses and high cache latency.

5) This is even worse for multi-threading as the WCC is shared between the to processors per core, thus the effective size per processor is only 2 Kilobytes, and much of the advantage of having so many processors is wasted.

6) Later bd versions use something other than the WCC (AFAIK undisclosed in the literture) which doesn't have as bad a bottleneck although it is still there, as in your bdver3.

7) Intel processors do not have this problem as they don't have cache write-through (although they have other problems for multi-processing).

8) Without this problem, our CPU's execution time for eratspeed would be about 75% of the current time for yours and 56% for mine (not quite this good as there would still be a small amount of latency due to instructions being too adjacent.

9) We AMD loyalists can only wait and hope for AMD Zen due toward the end of this year, which is said to adopt much of the Intel design philosophy but better (as I am sure AMD hopes, too).

The other interesting data point from your comparison of Clang C results and these golang results for eratspeed indicate that **on your machine**, C Clang runs in about 75% of the time of golang, or golang is about 33% slower. Part of that gain is Clang minimizing the number of (more complex) instructions, but part of that gain is also reducing backend latency by re-organizing the instructions. I don't have Clang installed on my machine, but I do have Haskell, which can use the LLLVM backend that Clang uses and produces comparable instruction rates. Golang eratspeed runs about 20% slower than that on my machine, but that is of course with the severe handicap of the high backend stalling. The difference will get much bigger in percentage for more efficient CPU's.
Message has been deleted

Michael Jones

unread,
Jun 20, 2016, 5:35:09 AM6/20/16
to gordo...@gmail.com, golang-nuts
Your code (with slightly changed print formatting) on my MacBook Pro:

$ go install
time fasterEratspeed 
  Init:   2.885352ms
Search: 284.849689ms, found 8621475 primes up to 1018429650
 Total: 287.735041ms

real 0m0.302s
user 0m0.287s
sys 0m0.007s

It is pretty complicated to me so I don’t understand why the values printed are as they are. My sense of prime number counts is that the 8621475th prime is 153339973. (Mathematica thinks so too.) 

My old illustrative Go SoE code does this calculation pretty quickly:

BenchmarkNew153339973            5  312174802 ns/op

That’s 312.174802ms to construct and retain a sieve for the first 153339973 integers, count the total number of primes included, and determine the final one in the range. Compiling with “-B” is a big help here...

BenchmarkNew153339973            5  281989575 ns/op

...reducing the total time to 281.989575ms

On Jun 20, 2016, at 3:40 AM, gordo...@gmail.com wrote:

We have proven with the straight translation of Daniel Bernstein's C code to golang that the Sieve of Eratosthenes (SoE) eratspeed rans about the same speed as the Sieve of Atkin (SoA) primespeed for the same range with the same optimizations and with John Barham's multi-treading for primespeed disabled (set numprocs = 1 in sieve.go), however golang primespeed is slower with array bounds checks enabled due to the complexity of the array operations for the prime square free operations.  For both I eliminated the use of the "two" shifting look up array table as that caused an extra bounds check, and tuned the inner loops in exactly the same way to minimize tight loop time.

Now that already shows that the SoA is inferior for page segmented sieving to 1,000,000,000 (one billion), as it only has about 258,000,000 operations whereas SoE eratspeed has about 405,000,000 operations, meaning that the SoA operations are more complex, which is obvious.  However, Bernstein crippled eratspeed to only 2/3/5 wheel factorization where it is easily possible to use 2/3/5/7 plus pre-culling of the 11/13/17 primes to reduce the operations to about 263,000,000 operations, or about 35% less.

The golang FasterEratspeed that implements this maximum wheel factorization version is posted at play.golang at:  https://play.golang.org/p/8Bo04bnT5F and runs about 33% faster than the previous standard wheel factorized version, just as the ratio of operations suggests that it should; also, because it minimizes array access it doesn't matter much whether array bound checking is enabled or not.  This is almost as fast as the highly optimized C++ primesieve (http://primesieve.org/) run signle threaded on my cache access speed bottle-necked machine, but on faster machines it will be slower than primesieve, as much as twice as slow on high end Intel desktop CPU's.

Actually, this implementation isn't maximally factorized as in some languages one is able to also pre-cull by 19 as well, but with golang the overhead of creating handling the extra size of pattern array is about the same as the few percent gain.

So, in this implementation we are coming very close to C/C++ speeds and faster than C#/Java; however, I learned much from the exercise and have found that very slight changes in the formulation of the tight inner loop(s) can slow the speed by at least a factor of two and sometimes up to almost three.  These best opimizations aren't always obvious and it is time consuming trying the different optimizations to get the fastest code.  Sometimes adding instructions actually reduces the execution speed, as the compiler will perhaps change the use of registers in a way that there are less stalls or bottle-necks.  For this particular implementation and golang version 1.7beta1, it seems that for this particular loop, it is optimized quite effectively in spite of having the array bounds check and even better without it.

That should be enough to prove that the SoA is just an interesting intellectual exercise, as for page seqmented sieves as these it has more complexity and is thus slower than the maximally factorized SoE, and also does not support its theoretical O(n) asymptotic complexity as with increasing range there is an ever increasing cost in code overhead, whereas the SoE comes very close to maintaining its theoretical O(n log log n) performance.

-- 
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

gordo...@gmail.com

unread,
Jun 20, 2016, 6:38:34 AM6/20/16
to golang-nuts
Edit: Corrected, forgot to change the new count routine to 48 from 8...

We have proven with the straight translation of Daniel Bernstein's C code to golang that the Sieve of Eratosthenes (SoE) eratspeed runs about the same speed as the Sieve of Atkin (SoA) primespeed for the same range with the same optimizations and with John Barham's multi-treading for primespeed disabled (set numprocs = 1 in sieve.go), however, golang primespeed is slower with array bounds checks enabled due to the complexity of the array operations for the prime square free operations. For both I eliminated the use of the "two" shifting look up array table as that caused an extra bounds check, and tuned the inner loops in exactly the same way to minimize tight loop time.

Now that it is already shown that the SoA is inferior for page segmented sieving to 1,000,000,000 (one billion), as it only has about 258,000,000 operations whereas SoE eratspeed has about 405,000,000 operations, meaning that the SoA operations are more complex, which is obvious. However, Bernstein crippled eratspeed to only 2/3/5 wheel factorization where it is easily possible to use 2/3/5/7 plus pre-culling of the 11/13/17 primes to reduce the operations to about 263,000,000 operations, or about 35% less.

The golang FasterEratspeed that implements this maximum wheel factorization version is posted at play.golang at: https://play.golang.org/p/iKA_s7ocuN and runs about 33% faster than the previous standard wheel factorized version, just as the ratio of operations suggests that it should; also, because it minimizes array access it doesn't matter much whether array bound checking is enabled or not. This is almost as fast as the highly optimized C++ primesieve (http://primesieve.org/) run single threaded on my cache access speed bottle-necked machine, but on faster machines it will likely be slower than primesieve, as much as twice as slow on high end Intel desktop CPU's.

Actually, this implementation isn't maximally factorized as in some languages one is able to also pre-cull by 19 as well, but with golang the overhead of creating/handling the extra size of pattern array is about the same as the few percent gain.

So, in this implementation we are coming very close to C/C++ speeds and faster than C#/Java; however, I learned much from the exercise and have found that very slight changes in the formulation of the tight inner loop(s) can slow the speed by at least a factor of two and sometimes up to almost three. These best optimizations aren't always obvious and it is time consuming trying the different optimizations to get the fastest code. Sometimes adding instructions actually reduces the execution speed, as the compiler will perhaps change the use of registers in a way that there are less stalls or bottle-necks. For this particular implementation and golang version 1.7beta1, it seems that for this particular loop, it is optimized quite effectively in spite of having the array bounds check and even better without it.

That should be enough to prove that the SoA is just an interesting intellectual exercise, as for page seqmented sieves as these it has more complexity and is thus slower than the maximally factorized SoE, and also does not support its theoretical O(n) asymptotic complexity as with increasing range there is an ever increasing cost in code overhead, whereas the SoE comes very close to maintaining its theoretical O(n log log n) performance.

To @Michael Jones: The error was that it was calculating the primes to over a billion but only counting one sixth of them, the culling time (which is almost all of the time) is the same. It wouldn't take any longer to show the last prime. So this program is almost six times faster than yours, as it likely should be in that it uses the extreme factorization to reduce the operations by about a factor of four from a simple odds only SoE.

Yes, it is a little complicated doing the maximum wheel factorization, but not that much more complicated than the original C eratspeed; much of the extra code comes from implementing the pre-culling pattern and copying it into the page segment buffers, but about half the speed gain comes from that so it was worth it. This is a different algorithm than many SoE implementations, as rather than putting all wheel residuals in the same word or adjacent words, it uses separate bit-packed arrays for each residual, with one bit across all of the 48 residual array representing a range of 210. Thus, instead of a increased bit packing of a factor of two for odds-only sieves, there is a bit packing factor of 48 to 210 or a factor of over four. And it is to processing each residual separately that it owes its speed. It may look complicated, but less complicated than the extreme loop unrolling that Kim Waisch had to resort to in order to get the performance of primesieve.

That said, this code was originally written to demonstrate a point and is nowhere near at a state for general use, with the following improvements needed to make it suitable for general use:

1) Instead of providing a "qtab" of pre-calculated base primes less than the square root of the desired range, a secondary stream of primes from the same program should be generated to feed back into the calculations; in that way this could be an "unbounded sieve" and one could draw primes from it almost without limit (other than time of course).

2) Similarly, the pre-initialized "next" array should be generated on-the-fly; otherwise the program can be multi-threaded, as every thread depends on the state of the "next" array, and it changes for every page segment.

3) In order to calculate the "next" array on the fly, it needs to be made even more efficient to calculate the start addresses per modulus residual.

4) Even with more efficient start address calculations by further Look Up Tables, as the range increases it will take an ever increasing percentage of the time to calculate the "next" array as the number of base primes increase; It would be better to let the page size increase automatically about as the square root of the range in order to keep this manageable.

5) If we let the page segment size increase, then we will lose cache associativity unless we implement a "double buffered" culling scheme where we cull for all the primes for one residual for a sub section of the page seqment which is L1 cache sized, then move up through the entire page segment before culling in the same manner for all residual segment plane arrays.

6) Once that is done, we can apply multi-threading, which will reduce the time on most 4 true core CPU's by about a factor of 4, so down to 0.07 seconds on your machine (if it is 4 core, you didn't mention what CPU version of Macbook Pro you have?). maybe slightly less if you hadn't yet compiled it with "-B"

60 to 70 ms for primes to one billion on a notebook isn't bad. That's about as fast as primesieve scaled by probable CPU clock speed. And once I do the above optimizations, it will scale to sieving to 10^14 in a few hours.

I had thought of dropping it at this point, as I've learned most of what I intended, but I'm having fun and may continue. I won't build on the current program format as it isn't so conducive to an "unbounded sieve" and will probably move the good bits to a new structure. And of course the algorithm is applicable to any language, be it Go, Haskell, C/C++, Scheme, F#, C#, Java, Scala, Clojure, Kotlin, whatever...

I have proven that the SoA is nothing as compared to a fully optimized SoE though.

gordo...@gmail.com

unread,
Jun 20, 2016, 9:33:29 AM6/20/16
to golang-nuts
Further to the subject of compiler efficiency, the following is the assembler code output with array bounds checking turned off (-B) the the inner tight composite culling loop of FasterEratspeed above (generated with go tool compile -B -S FasterEratspeed.go > FasterEratspeed.asm):

0x0051 00081 (main.go:426) MOVL R11, CX
0x0054 00084 (main.go:426) SHRL $5, R11
0x0058 00088 (main.go:428) MOVL (R9)(R11*4), R13
0x005c 00092 (main.go:430) MOVL $1, R14
0x0062 00098 (main.go:430) SHLL CX, R14
0x0065 00101 (main.go:430) ORL R13, R14
0x0068 00104 (main.go:431) MOVL R14, (R9)(R11*4)
0x006c 00108 (main.go:429) LEAL (R12)(CX*1), R11
0x0070 00112 (main.go:425) CMPL R11, R8
0x0073 00115 (main.go:425) JCS $0, 81

At 10 instructions, this is about as tight as it gets other than for using the more complex read/modify/write version of the ORL instruction, but that doesn't seem to save much if any time given instruction latencies. Note that this code has eliminated the "k & 31" for the shift, seeming to recognize that it isn't necessary as a long shift can't be greater than 31 anyway, that unlike the simple PrimeSpeed program, this properly uses the immediate load of '1', that it cleverly uses the LEAL instruction to add the prime value 'q' in R12 to the unmodified 'k' value in CX to produce the sum to the original location of 'j' in R11 to save another instruction to move the results from CX to R11.

This uses a total of 7 registers being CX, R8, R9, R11, R12, R13, and R14, so it would be possible to make it work and run just as fast for x86 code. This includes one register to hold the prime value 'q' in R12, and one register to hold the loop limit 'lngthb] in R8. An array bounds check requires another register to hold the array upper bound, which won't be hard for x64 code, but will slow x86 code as there won't be enough available registers unless the compiler can be made smart enough for certain loop formats to recognize that the check isn't necessary, being inherent to the loop check. Pretty impressive if all loops could be compiled as well as this, as this will run at very close to C/C++ speed!

When run within L1 cache on a modern Intel CPU without memory or execution pipeline(s) bottle-necks, this code will run in about 3.5 CPU cycles per loop or about 2.86 instructions per cycle.

I'm looking for the reason that this code is so efficient and the former simple PrimeSpeed code was so poor; one clue might be that I used native uint's and int's for PrimeSpeed, which are 64-bit on an amd version of the compiler for that code so the code was using 'Q' suffix instructions that perhaps it doesn't use as efficiently where uint32's and int32's are used here.

Can the compiler guys contribute anything on how to format tight loop code so it compiles as well as this? If the crucial parts of the compiler and libraries were written to compile to code as efficient as this, I don't think anyone would ahve problems with golang's speed.
Message has been deleted

Keith Randall

unread,
Jun 20, 2016, 12:27:49 PM6/20/16
to golang-nuts, gordo...@gmail.com


On Monday, June 20, 2016 at 6:33:29 AM UTC-7, gordo...@gmail.com wrote:
Further to the subject of compiler efficiency, the following is the assembler code output with array bounds checking turned off (-B) the the inner tight composite culling loop of FasterEratspeed above (generated with go tool compile -B -S FasterEratspeed.go > FasterEratspeed.asm):

        0x0051 00081 (main.go:426)        MOVL        R11, CX
        0x0054 00084 (main.go:426)        SHRL        $5, R11
        0x0058 00088 (main.go:428)        MOVL        (R9)(R11*4), R13
        0x005c 00092 (main.go:430)        MOVL        $1, R14
        0x0062 00098 (main.go:430)        SHLL        CX, R14
        0x0065 00101 (main.go:430)        ORL        R13, R14
        0x0068 00104 (main.go:431)        MOVL        R14, (R9)(R11*4)
        0x006c 00108 (main.go:429)        LEAL        (R12)(CX*1), R11
        0x0070 00112 (main.go:425)        CMPL        R11, R8
        0x0073 00115 (main.go:425)        JCS        $0, 81

At 10 instructions, this is about as tight as it gets other than for using the more complex read/modify/write version of the ORL instruction, but that doesn't seem to save much if any time given instruction latencies.  Note that this code has eliminated the "k & 31" for the shift, seeming to recognize that it isn't necessary as a long shift can't be greater than 31

Getting rid of the &31 is easy and I'll do that in 1.8.
 
anyway, that unlike the simple PrimeSpeed program, this properly uses the immediate load of '1',

I don't know what the issue is yet, but it shouldn't be hard to fix in 1.8.
 
that it cleverly uses the LEAL instruction to add the prime value 'q' in R12 to the unmodified 'k' value in CX to produce the sum to the original location of 'j' in R11 to save another instruction to move the results from CX to R11.


The current SSA backend should do this also.

gordo...@gmail.com

unread,
Jun 20, 2016, 9:39:41 PM6/20/16
to golang-nuts
On Monday, June 20, 2016 at 6:33:29 AM UTC-7, gordo...@gmail.com wrote:
Further to the subject of compiler efficiency, the following is the assembler code output with array bounds checking turned off (-B) the the inner tight composite culling loop of FasterEratspeed above (generated with go tool compile -B -S FasterEratspeed.go > FasterEratspeed.asm):

0x0051 00081 (main.go:426) MOVL R11, CX
0x0054 00084 (main.go:426) SHRL $5, R11
0x0058 00088 (main.go:428) MOVL (R9)(R11*4), R13
0x005c 00092 (main.go:430) MOVL $1, R14
0x0062 00098 (main.go:430) SHLL CX, R14
0x0065 00101 (main.go:430) ORL R13, R14
0x0068 00104 (main.go:431) MOVL R14, (R9)(R11*4)
0x006c 00108 (main.go:429) LEAL (R12)(CX*1), R11
0x0070 00112 (main.go:425) CMPL R11, R8
0x0073 00115 (main.go:425) JCS $0, 81

At 10 instructions, this is about as tight as it gets other than for using the more complex read/modify/write version of the ORL instruction, but that doesn't seem to save much if any time given instruction latencies. Note that this code has eliminated the "k & 31" for the shift, seeming to recognize that it isn't necessary as a long shift can't be greater than 31

Getting rid of the &31 is easy and I'll do that in 1.8.

anyway, that unlike the simple PrimeSpeed program, this properly uses the immediate load of '1',

I don't know what the issue is yet, but it shouldn't be hard to fix in 1.8.

that it cleverly uses the LEAL instruction to add the prime value 'q' in R12 to the unmodified 'k' value in CX to produce the sum to the original location of 'j' in R11 to save another instruction to move the results from CX to R11.

The current SSA backend should do this also.

No, Keith, you seem to have misunderstood, I wasn't complaining above the above assembler codeas produced by the 1.7beta1 compiler, and I was wondering why it always isn't this good, which is about as good as it gets for this loop and already properly gets rid of &31, does a proper immediate load of 1, and the clever use of the LEA instruction without the misuse of the LEA instruction to continuously recalculate 'p'. The assembler code above is produced by either of the below loop variations:

1) as it is in FasterEratspeed:

for k < lngthb {
pos := k >> 5
data := k & 31
bits := buf[pos]
k += q
bits |= 1 << data // two[data]
buf[pos] = bits
}

2) I get the same assembler code if I change this to the simpler:

for ; k < lngthb; k += q {
buf[k>>5] |= 1 << (k & 31)
}

where all variables and buffers are uint32.

My question was, why did the compiler produce this very good code for both variations, yet produced something much worse for the same variation two loop in the simple PrimeSpeed code, with the main difference that PrimeSpeed uses 64-bit uint for the loop variables and loop limit. Does that give you a clue where the problem might be? Converting PrimeSpeed to use uint32's as here fixed the continuous recalculation of 'p' but not the other problems.

It seems that sometimes the compiler erroneously tries to reduce register use without applying the cost in execution speed to the decision. It is inconsistent, sometimes producing great code as here, and sometimes not so great as in PrimeSpeed.

I was looking for some general advice on how to format loops so they produce code as good as this?

Do you plan to include SSA for the x86 version as well?
Message has been deleted

alb.do...@gmail.com

unread,
Jun 21, 2016, 6:48:14 AM6/21/16
to golang-nuts, gordo...@gmail.com
>  Do you plan to include SSA for the x86 version as well?

For an answer to this: yes, it seems like the plan is to port every supported architecture
to SSA. It was discussed here:
https://groups.google.com/d/msg/golang-dev/fSIl5Sbr4ek/10sgOsnDEAAJ

gordo...@gmail.com

unread,
Jun 23, 2016, 7:54:05 AM6/23/16
to golang-nuts, gordo...@gmail.com


On Monday, 20 June 2016 16:35:09 UTC+7, Michael Jones wrote:
Your code (with slightly changed print formatting) on my MacBook Pro:

$ go install
time fasterEratspeed 
  Init:   2.885352ms
Search: 284.849689ms, found 8621475 primes up to 1018429650
 Total: 287.735041ms

real 0m0.302s
user 0m0.287s
sys 0m0.007s

It is pretty complicated to me so I don’t understand why the values printed are as they are. My sense of prime number counts is that the 8621475th prime is 153339973. (Mathematica thinks so too.) 

My old illustrative Go SoE code does this calculation pretty quickly:

BenchmarkNew153339973            5  312174802 ns/op

That’s 312.174802ms to construct and retain a sieve for the first 153339973 integers, count the total number of primes included, and determine the final one in the range. Compiling with “-B” is a big help here...

BenchmarkNew153339973            5  281989575 ns/op

...reducing the total time to 281.989575ms

I hope you say my message below showing a new play link correcting the code (somehow one of my computers does want to reply to a message, but only add to the thread), which was culling to a billion but only counting about 1/6th of the primes sieved.  You'lll find that the corrected program really can sieve to a billion in that amount of time, making it quite a bit faster than your probably odds-only version.

By your timing results, I'm guessing that you have an almost top-of the line late model 15" MacBook Pro with the 2.5 GHz i7-4770HQ quad-core CPU that can Turbo Boost to 3.7 Ghz, which is where this test was probably run as it is short enough not to heat the CPU.  Am I right?

gordo...@gmail.com

unread,
Jun 23, 2016, 8:16:42 AM6/23/16
to golang-nuts, gordo...@gmail.com

On Friday, 17 June 2016 09:23:18 UTC+7, Keith Randall wrote:
Looks like something is wrong with immediate loading for the 1 << ... operation.  Could you open a bug with repro instructions?  I can look at it when 1.8 opens.

@Keith, now that I have found a combination that makes the original loop compile to an efficient assembler loop (avoiding the many discrepencies with the compiler), I have started working to make it even faster using extreme loop unrolling with unsafe.Pointer.  The code is working, but I am finding even more discrepencies, as follows:

1) 'Even when the Go code is written exactly that way as a read;/modify/write instruction using the "|=" operator, the compiler refuses to generate the "ORL src, dstregWithIndexing" as in "ORL $2,(R12,R11*4)" or "ORL $2,$3(R12,R11*4)", which would be much more efficient.  Currently in golang version 1.7beta2, it uses a read instruction to a register, followed by the modification to that register, following by a write to the same destination; this is wasteful of a register and instructions.

2) Even when the 'R12' base index register is already in a register and isn't modified by the operation, the compiler persists in making a copy of the register to another register to use as the base index register; this is wasteful of registers and instructions.  I understood that SAR would mean that the compiler could regognize when a register is not changed and use it wherever appropriate.

3) When there is a lengthening conversion just outside a loop (or any kind of quick calculation for that matter), instead of making sure that conversion stays lifted outside the loop, the compiler will use a register to is a MOVXZX instruction inside the loop, which is wasteful of registers and instructions.

4) While there are a few cases where using the LEA instruction can combine operations and save time, the compiler overuses it to do simple calculations for immediately outside the loop to within the loop, and for complex index register operations where a simple ADDX is all that is necessary and is faster.

Should I report another issue against 1.8 for this?  If so, should I report one issue or combine them all as a tight loop performance issue (seemingly not really related to using unsafe.Pointer, but possibly related for some the of points (as in assigning a special register to use as the base index pointer in 1. above)?


Michael Jones

unread,
Jun 23, 2016, 9:48:21 AM6/23/16
to gordo...@gmail.com, golang-nuts
Just so. I believed the computation…just not the printing. ;-)

Michael Jones

gordo...@gmail.com

unread,
Jun 24, 2016, 8:18:20 AM6/24/16
to golang-nuts, gordo...@gmail.com


On Saturday, 18 June 2016 17:54:45 UTC+7, ⚛ wrote:
On Saturday, June 18, 2016 at 12:21:21 AM UTC+2, gordo...@gmail.com wrote:
On Friday, June 17, 2016 at 4:48:06 PM UTC+2, gordo...@gmail.com wrote: Have you tried compiling
 
eratspeed with a new version of GCC to see how it compares to Clang? 

gGcc-6.1 is slower. clang-3.9 is able to translate "bits = buf[pos]; bits |= data; buf[pos] = bits;" into a single x86 instruction equivalent to "buf[pos] |= data".

You are right:  Clang/LLVM is able to turn separate read and modify and write code into a single read/modify/write instruction; however, we don't have to write the code as Daniel Bernstein did and can write it as follows, in which case  GCC retains the read/modify/write single instruction loop:

                 for ; k <= lngthb; k += q {
                        buf[k>>5] |= 1 << (k & 31)
                 }

In fact, for 32-bit operations of 'k', lngthb, and 'q' and of course the buffer contents, we can skip the "& 31" (as the CPU limits the shift to a maximum of 31 by itself for 32-bit bit register ops) for an extra slight saving in time.

Golang version 1.7beta2 actually does pretty well with the above format other than not retaining the read/modify/write for this particular instance, and the loop speed is the same as the C generated code with Bernstein's hand optmizations, which did not use the single instruction.

Thus, this loop in golang1.7beta2 is as fast as the C Bernstein loop or maybe a bit faster, but that is just for this one case.   When I started writing extreme loop unrolling code to optimize this loop, which for C reduces the cycle time per culling loop to about 1.375 cycles on a non-stalling high end Intel CPU running amd64 mode, I can't get it down to less that about 4 cycles per loop for golang under those same conditions (x86 code in this case is slower because there aren't enough available registers, requiring register spills and reloads).  Note that as the number of cycles is reduced, the number of instructions per cycle is also reduced because of the more complex instruction that does more, without hardly any of the simple register to register instructions which are very fast to execute.

So golang is inconsistent in its optimizations, sometimes getting about the same speed as C, but in some other similar cases completely blowing it by as much as over three times as slow.  It seems that the C/C++'s are much more consistent in their speed, but then they have had much longer to mature.
Reply all
Reply to author
Forward
0 new messages