Something like that?
https://github.com/golang/image/blob/master/vector/gen.go
-s
sent from my droid
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
> Something like that?
Using the hardware when it is available, falling back on emulation when not?
Take for instance the PSHUFB instruction, which allows a very fast [16]byte lookup in SSSE3 capable machines. This is helpful in various ways, but if it isn't available, it will have to commit the XMM register to memory and do 16 lookups, which is at least an order of magnitude slower than using the SIMD. Similarly, RSQRT (low precision reciprocal of the square root) instruction allows a "shortcut", but if it isn't available on your architecture, it will likely be very expensive.
Yes, speeding up an accumulation step, described at
https://medium.com/@raphlinus/inside-the-fastest-font-renderer-in-the-world-75ae5270c445#.qz8jram0o
The generated code are SIMD implementations of very simple Go functions.
For example, the fixedAccumulateOpSrcSIMD function in the generated
https://github.com/golang/image/blob/master/vector/acc_amd64.s file is
the SIMD equivalent of the fixedAccumulateOpSrc function in
https://github.com/golang/image/blob/master/vector/raster_fixed.go
There is only one template, instantiated 6 times. There are 2 types of
math (fixed point and floating point), and 3 types of composition
operator (draw over, draw src, and accumulate to mask).
Another ignorant question from me, but what do you mean exactly by
universal binary?
FWIW, the golang.org/x/image/vector package is portable, in that it
has an pure Go implementation. If compiled on GOARCH=amd64, it has
SIMD optimizations, guarded at runtime by a very cheap check (look for
haveFixedAccumulateSIMD). But that check is made once per e.g.
accumulate call (a higher level function composed of many SIMD ops),
not once per e.g. 128-bit add.
That's the sort of cheap checks that I had mind in the very first post when I talked about "I envisaged a call to CPUID and then some bool tests along the way to utilise SSE[2-4]/AVX[2] (or NEON on ARM) if available. All in a static, portable package." Thanks for a good example of that.O.