I have ported D. J. Bernstein's C primegen library
(http://cr.yp.to/primegen.html) to Go.
Primegen calculates prime numbers in order and uses the Sieve of Atkin
instead of the traditional Sieve of Eratosthenes.
The Go source is at https://github.com/jbarham/primegen.go and can be
installed via goinstall (i.e., "goinstall
github.com/jbarham/primegen.go").
It's a fairly straightforward port so preserves Bernstein's
characteristic terse style, but I do make use of multiple goroutines
when $GOMAXPROCS is set so it's competitive w/ the original C
implementation and is thus very fast: on my 2.4 GHz Intel Core i5,
primegen.go generates the first 50,847,534 primes up to 1,000,000,000
in ~0.7 seconds.
John Barham
> Modern x86 CPUs don't work like that.
> In general, optimally scheduled assembly code which uses more registers has higher performance than optimally scheduled assembly code which uses smaller number of registers. Assuming both assembly codes correspond to the same source code.
> Register renaming: since Intel Pentium Pro and AMD K5.
> Suggestion for reading: http://www.agner.org/optimize/microarchitecture.pdf
> An excerpt from the above PDF document (Section 10 about Haswell and Broadwell pipeline): "... the register file has 168 integer registers and 168 vector registers ..."
I am aware of all of the above and have already read Agner Fogg's publications. In addition modern CPU's do Out of Order Execution (OOE) so rearrange the instructions to best reduce instruction latencies and increase throughput given that there are parallel execution pipelines and ahead-of-time execution, so the actual execution order is almost certainly not as per the assembly listing.
Yes, both assembly listings are from the same tight loop code, but the "C/C++" one has been converted from another assembly format to the golang assembly format.
Daniel Bernstein, the author of "primegen" wrote for the Pentium 3 in x86 (32-bit) code, as the Pentium Pro processor wasn't commonly available at that time and 64-bit code didn't exist. His hand optimized C code for the Sieve of Eratosthenes ("eratspeed.c" in the "doit()" function for the "while k < B loop") uses six registers for this inner culling loop being discussed, and takes about 3.5 CPU clock cycles per loop on a modern CPU (Haswell).
I don't see the point of the exercise other than it proves that not putting the result of an operation back into the same register reduces the latency slightly for your processor (whatever it is?); I suspect that if you used any other register such as the unused AX register rather then the R11 register, the results would be the same.
I don't see the point to the exercise as far as optimizing golang is concerned.
Your experiment just shows that Your compiler (GCC?)
missed an optimization as far as reducing backend latency goes.
You may also find that swapping the order of some of the instructions such as the second and the third in the loop may also reduce backend latency further.
I am not on a high end Intel CPU now, but when I was I found that with a buffer size adjusted to the L1 cache size (8192 32-bit words or 32 Kilobytes) that eratspeed ran on an Intel 2700K @ 3.5 GHz at about 3.5 clock cycles per loop (about 405,000,000 loops for this range).
My current AMD Bulldozer CPU has a severe cache bottleneck and can't come close to this speed by a factor of about two.
Here is a golang version of Daniel Bernstein's "eratspeed", which is a straight translation of his C code to golang without any changes other than as necessary to make it work: https://play.golang.org/p/Sd6qlMQcHF.
On Friday, June 17, 2016 at 4:48:06 PM UTC+2, gordo...@gmail.com wrote:
> > I am not on a high end Intel CPU now, but when I was I found that with a buffer size adjusted to the L1 cache size (8192 32-bit words or 32 Kilobytes) that eratspeed ran on an Intel 2700K @ 3.5 GHz at about 3.5 clock cycles per loop (about 405,000,000 loops for this range).
> > My current AMD Bulldozer CPU has a severe cache bottleneck and can't come close to this speed by a factor of about two.
> Which Bulldozer version do you have: original Bulldozer, Piledriver, Steamroller/Kaveri or Excavator/Carrizo?
One of the original's, a FX8120 (4 core, 8 processor) @ 3.1 GHz.
by the clock frequency, I assume you were running these tests on a high end Intel CPU?
Have you tried compiling eratspeed with a new version of GCC to see how it compares to Clang?
$ go install$ time fasterEratspeedInit: 2.885352msSearch: 284.849689ms, found 8621475 primes up to 1018429650Total: 287.735041msreal 0m0.302suser 0m0.287ssys 0m0.007s
On Jun 20, 2016, at 3:40 AM, gordo...@gmail.com wrote:
We have proven with the straight translation of Daniel Bernstein's C code to golang that the Sieve of Eratosthenes (SoE) eratspeed rans about the same speed as the Sieve of Atkin (SoA) primespeed for the same range with the same optimizations and with John Barham's multi-treading for primespeed disabled (set numprocs = 1 in sieve.go), however golang primespeed is slower with array bounds checks enabled due to the complexity of the array operations for the prime square free operations. For both I eliminated the use of the "two" shifting look up array table as that caused an extra bounds check, and tuned the inner loops in exactly the same way to minimize tight loop time.
Now that already shows that the SoA is inferior for page segmented sieving to 1,000,000,000 (one billion), as it only has about 258,000,000 operations whereas SoE eratspeed has about 405,000,000 operations, meaning that the SoA operations are more complex, which is obvious. However, Bernstein crippled eratspeed to only 2/3/5 wheel factorization where it is easily possible to use 2/3/5/7 plus pre-culling of the 11/13/17 primes to reduce the operations to about 263,000,000 operations, or about 35% less.
The golang FasterEratspeed that implements this maximum wheel factorization version is posted at play.golang at: https://play.golang.org/p/8Bo04bnT5F and runs about 33% faster than the previous standard wheel factorized version, just as the ratio of operations suggests that it should; also, because it minimizes array access it doesn't matter much whether array bound checking is enabled or not. This is almost as fast as the highly optimized C++ primesieve (http://primesieve.org/) run signle threaded on my cache access speed bottle-necked machine, but on faster machines it will be slower than primesieve, as much as twice as slow on high end Intel desktop CPU's.
Actually, this implementation isn't maximally factorized as in some languages one is able to also pre-cull by 19 as well, but with golang the overhead of creating handling the extra size of pattern array is about the same as the few percent gain.
So, in this implementation we are coming very close to C/C++ speeds and faster than C#/Java; however, I learned much from the exercise and have found that very slight changes in the formulation of the tight inner loop(s) can slow the speed by at least a factor of two and sometimes up to almost three. These best opimizations aren't always obvious and it is time consuming trying the different optimizations to get the fastest code. Sometimes adding instructions actually reduces the execution speed, as the compiler will perhaps change the use of registers in a way that there are less stalls or bottle-necks. For this particular implementation and golang version 1.7beta1, it seems that for this particular loop, it is optimized quite effectively in spite of having the array bounds check and even better without it.
That should be enough to prove that the SoA is just an interesting intellectual exercise, as for page seqmented sieves as these it has more complexity and is thus slower than the maximally factorized SoE, and also does not support its theoretical O(n) asymptotic complexity as with increasing range there is an ever increasing cost in code overhead, whereas the SoE comes very close to maintaining its theoretical O(n log log n) performance.
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Further to the subject of compiler efficiency, the following is the assembler code output with array bounds checking turned off (-B) the the inner tight composite culling loop of FasterEratspeed above (generated with go tool compile -B -S FasterEratspeed.go > FasterEratspeed.asm):
0x0051 00081 (main.go:426) MOVL R11, CX
0x0054 00084 (main.go:426) SHRL $5, R11
0x0058 00088 (main.go:428) MOVL (R9)(R11*4), R13
0x005c 00092 (main.go:430) MOVL $1, R14
0x0062 00098 (main.go:430) SHLL CX, R14
0x0065 00101 (main.go:430) ORL R13, R14
0x0068 00104 (main.go:431) MOVL R14, (R9)(R11*4)
0x006c 00108 (main.go:429) LEAL (R12)(CX*1), R11
0x0070 00112 (main.go:425) CMPL R11, R8
0x0073 00115 (main.go:425) JCS $0, 81
At 10 instructions, this is about as tight as it gets other than for using the more complex read/modify/write version of the ORL instruction, but that doesn't seem to save much if any time given instruction latencies. Note that this code has eliminated the "k & 31" for the shift, seeming to recognize that it isn't necessary as a long shift can't be greater than 31
anyway, that unlike the simple PrimeSpeed program, this properly uses the immediate load of '1',
that it cleverly uses the LEAL instruction to add the prime value 'q' in R12 to the unmodified 'k' value in CX to produce the sum to the original location of 'j' in R11 to save another instruction to move the results from CX to R11.
Your code (with slightly changed print formatting) on my MacBook Pro:$ go install$ time fasterEratspeedInit: 2.885352msSearch: 284.849689ms, found 8621475 primes up to 1018429650Total: 287.735041msreal 0m0.302suser 0m0.287ssys 0m0.007sIt is pretty complicated to me so I don’t understand why the values printed are as they are. My sense of prime number counts is that the 8621475th prime is 153339973. (Mathematica thinks so too.)My old illustrative Go SoE code does this calculation pretty quickly:BenchmarkNew153339973 5 312174802 ns/opThat’s 312.174802ms to construct and retain a sieve for the first 153339973 integers, count the total number of primes included, and determine the final one in the range. Compiling with “-B” is a big help here...BenchmarkNew153339973 5 281989575 ns/op...reducing the total time to 281.989575ms
Looks like something is wrong with immediate loading for the 1 << ... operation. Could you open a bug with repro instructions? I can look at it when 1.8 opens.
On Saturday, June 18, 2016 at 12:21:21 AM UTC+2, gordo...@gmail.com wrote:
eratspeed with a new version of GCC to see how it compares to Clang?
gGcc-6.1 is slower. clang-3.9 is able to translate "bits = buf[pos]; bits |= data; buf[pos] = bits;" into a single x86 instruction equivalent to "buf[pos] |= data".