Float32 math and slice arithmetics using SIMD

722 views
Skip to first unread message

Ondrej

unread,
Oct 26, 2016, 8:32:05 AM10/26/16
to golang-nuts
Hi there,
I have two semi-proposals, which are borderline golang-nuts/dev, I'd be happy to flesh out either of them into a full-fledged proposal, if you guys see merit in them. I know SIMD has been discussed at least four times here, but never to a stage where it would result in a proposal. Here are my two main questions:

1) Should there be more support for float32 mathematics in Go? The math package offers a wide variety of functions, but when one operates on float32 values, he needs to convert back and forth. I would normally use f64, but it brings me to my second point:

2) Should there be some higher level support for SIMD? Some SIMD instructions are in the Go assembly and are used in some performance critical code. What I propose is that these are exposed to the user through arithmetics on slices (meaning element by element operations on slices - e.g. add two slices of equal type and length). I started talking about f32 in the previous point, because halving the bitsize (which is often sufficient), you can better utilise your registers in SIMD and thus achieve almost double the speed (one could go even further and utilise half precision arithmetics for further speed gains). There are three ways I've been looking at the issue, let me present them separately:

2a) The easiest way would be through adopting something akin to gonum's internal package (https://github.com/gonum/internal/tree/master/asm or perhaps its higher level wrapper https://github.com/gonum/floats), which essentially covers BLAS level 1 routines in both assembly for some systems and pure Go, so that there's a fallback. I would envisage element-by-element operations (addition, subtraction, multiplication, division) on slices of equal length and saved either to a newly allocated slice, a third supplied slice, or one of the two being operated on. Maybe scalar-vector operations (arithmetics and comparisons, e.g. quick slice scans in database-like scenarios), but not much beyond that, certainly not anything resembling the full BLAS. This whole thing could live in an x/ package.

2b) A more drastic way could be to allow arithmetics on slices explicitly - e.g. a := []float64{...}; b := []float64{...}; c := a + b. This, I think, leads to more problems than it solves. There is little control of memory allocation, likely runtime issues due to mismatched lengths, difficult handling of errors etc. Some of these issues, however, would not be relevant if this was allowed on *arrays*, not slices.

2c) A more aggressive SIMD usage when analysing tight loops might be handy. I don't know how often people loop through a numerical slice and do a single operation on it. I guess this discussion is more about the compiler and does not affect the user like the other proposals. (And I believe gccgo already does this?)

---

I started thinking about this once the multidimensional slice proposal was fully developed. Because while it's not impossible to write SIMD-aware slice arithmetic routines, it become a whole lot tougher if you introduce multiple dimensions, subslicing, contiguous memory, strides and other fun stuff. It would be great if this was resolved on a language level.

Currently, if one wants fast math in Go, it usually results in linking against OpenBLAS or some other implementation, using cgo, sometimes wrapping slices in structs, there are different implementation of matrices etc. Not to mention that these libraries tend to get compiled natively, which breaks portability. I envisaged a call to CPUID and then some bool tests along the way to utilise SSE[2-4]/AVX[2] (or NEON on ARM) if available. All in a static, portable package.

I don't mean to talk about specific implementation, I just want to gauge if this is something that's within the language scope, something that would fit in the x/ packages, or something that should be left for 3rd party package writers.

Thanks,
Ondrej

Ian Lance Taylor

unread,
Oct 26, 2016, 10:18:07 AM10/26/16
to Ondrej, golang-nuts
My opinions.

Different processors have different SIMD operations. I don't see how
to usefully add SIMD support to the language proper.

Adding SIMD support to the compiler is fine in principle, as long as
there is no real effect on compilation time. That seems
uncontroversial.

Packages for SIMD support sound appropriate for third party packages.
Since they would be processor-specific by nature, I think it would be
premature to add them to x/.

Ian

simon place

unread,
Oct 26, 2016, 6:24:39 PM10/26/16
to golang-nuts
i was playing with SIMD last year,

the approach i took was to try to minimise the M/C, so;

no attempt to support general formula, let people combine the, pre-made, most common/expensive functions, like SIMD designers did, only up the complexity of formula supported and make it x-platform.
make each call work on just one SIMD instruction sized array, so no looping or conditions in the M/C.

i only tried 4-way 32bit x86 SIMD, performance was as you might expect. ~5ns for 4 x Sqrt(n+1)

i wanted to put up some code, but only with neon working as well, i could go back to this since i have the h/w to try it on now.

example of 4 x Sqrt(n+1) using address of array.

// func f40pc(i *[4]float32)
TEXT ·f40pca+0(SB),$16-8
MOVQ    i+0(FP),AX   // get 64 bit address from first parameter
MOVAPS (AX),X0       // load 128bit, 4xfloat32,  from memory
MOVSS   $(1.0),X1       // load single precision var, 1.0 , into lower 32 bits
SHUFPS  $0x00,X1,X1  // duplicate it 4 times across the register
ADDPS   X1,X0    // parallel add 1
SQRTPS  X0,X0    // parallel sqrt in-place
MOVAPS   X0,(AX)  // put 128bit back to same address
RET  


ideally you might be able to use 'generate' a support a more expansive range of functions, essentially making an extremely simple go compiler in go, i have the feeling you could get a large proportion of the possible performance increase with only a simple, template implementation.

also support for SIMD is not signalled by architecture alone, you need to check with a CPU-instruction to find out what’s supported. see "math/floor_amd64.s" and "math/floor_asm.go" to see this happen in the std.lib.

Seb Binet

unread,
Oct 27, 2016, 6:39:21 AM10/27/16
to simon place, golang-nuts

Something like that?

https://github.com/golang/image/blob/master/vector/gen.go

-s

sent from my droid


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

simon place

unread,
Oct 27, 2016, 3:54:41 PM10/27/16
to golang-nuts, simon...@googlemail.com

> Something like that?


short answer, Yes.

however, from looking at it, couldn’t find documentation, that code is specific to speeding up graphics overlays? maybe? (accumulate)

but it’s confusing me that its using templates, when there seems to only be one template.

i was thinking of one, very simple, template per function, per h/w feature; 

so one each for;
Sqrt(X+k) [4]float32 on SSE4,
Sqrt(X+k) [4]float32 on NEON,
Sqrt(X+k) [4]float64 on AVX2
Sqrt(X+k) [8]float32 on AVX2
k1/Sqrt(X+k2) on SSE4
...

which leads to a big, but i think a maintainable, collection.

maintainable because, used in linear combinations, without adding that much overhead, stops the number rising at a high ordered rate, only functions with an element that has parallel support in the CPU have any point in being added, and since this would be open source, contributions of functions someone's added themselves could be contributed back.

which is why i was wanting NEON support to begin with, so there could be a general outline onto which contributions could be made, most of the time it would just be a simple modification/extension of a basic pool.



 


Nigel Tao

unread,
Oct 27, 2016, 8:16:56 PM10/27/16
to simon place, golang-nuts
On Fri, Oct 28, 2016 at 6:54 AM, 'simon place' via golang-nuts
<golan...@googlegroups.com> wrote:
> however, from looking at it, couldn’t find documentation, that code is
> specific to speeding up graphics overlays? maybe? (accumulate)

Yes, speeding up an accumulation step, described at
https://medium.com/@raphlinus/inside-the-fastest-font-renderer-in-the-world-75ae5270c445#.qz8jram0o

The generated code are SIMD implementations of very simple Go functions.

For example, the fixedAccumulateOpSrcSIMD function in the generated
https://github.com/golang/image/blob/master/vector/acc_amd64.s file is
the SIMD equivalent of the fixedAccumulateOpSrc function in
https://github.com/golang/image/blob/master/vector/raster_fixed.go


> but it’s confusing me that its using templates, when there seems to only be
> one template.

There is only one template, instantiated 6 times. There are 2 types of
math (fixed point and floating point), and 3 types of composition
operator (draw over, draw src, and accumulate to mask).

Raph's original C prototype, using intrinsics, is at
https://github.com/google/font-rs/blob/master/src/accumulate.c but
that does only 1 of those 6: floating point and draw src. IIRC it does
assume that the input slice's length is a multiple of 4, unlike the Go
code, but the C code is a proof of concept, not a production quality
library.

Nigel Tao

unread,
Oct 27, 2016, 8:23:53 PM10/27/16
to simon place, golang-nuts
On Thu, Oct 27, 2016 at 9:24 AM, 'simon place' via golang-nuts
<golan...@googlegroups.com> wrote:
> the approach i took was to try to minimise the M/C, so;

Sorry for the ignorant question, but what does M/C stand for?

Erwin Driessens

unread,
Oct 27, 2016, 8:37:38 PM10/27/16
to golang-nuts, simon...@googlemail.com
I'd love to see SIMD intrinsics in the Go compiler(s), even if it would mean separate packages for all the architectures. I'm not experienced enough to tell how far one could get with designing a cross-platform set of intrinsics instructions? Using the hardware when it is available, falling back on emulation when not?

Klaus Post

unread,
Oct 28, 2016, 4:08:47 AM10/28/16
to golang-nuts, simon...@googlemail.com
On Friday, 28 October 2016 02:37:38 UTC+2, Erwin Driessens wrote:

> I'd love to see SIMD intrinsics in the Go compiler(s), even if it would mean separate packages for all the architectures.  I'm not experienced enough to tell how far one could get with designing a cross-platform set of intrinsics instructions?

There are attempts at generalized SIMD out there (google "web simd" for instance), but while they may be good for web, I think they fall short for a compiled language. The problem is that they either support a lowest common denominator or have to fall back to slow reference implementations.

Take for instance the PSHUFB instruction, which allows a very fast [16]byte lookup in SSSE3 capable machines. This is helpful in various ways, but if it isn't available, it will have to commit the XMM register to memory and do 16 lookups, which is at least an order of magnitude slower than using the SIMD. Similarly, RSQRT (low precision reciprocal of the square root) instruction allows a "shortcut", but if it isn't available on your architecture, it will likely be very expensive.

Furthermore keeping it close to the C instrinsics would also make porting existing code easier, which I can only see at a positive.

However, adding it is not a trivial task, but with the recent compiler rewrite, it has become much more feasible. There are still issues that should be worked out, like forced constant parameters, switching between 2 and 3 parameter (VEX) instructions with compiler flags, intrinsic types, etc, and of course the entire thing of defining the intrinsics and supplying that information to the compiler.


Using the hardware when it is available, falling back on emulation when not?

As you might be able to tell, I am not a big fan of emulation for other than testing purposes. While it might be reasonable in some cases, I find that a "clean" Go version is mostly better than an emulated intrinsics version.

I could go on. If you haven't already seen it, there is some good ideas here: https://groups.google.com/forum/#!topic/golang-nuts/yVOfeHYCIT4


/Klaus

simon place

unread,
Oct 28, 2016, 1:03:59 PM10/28/16
to golang-nuts, simon...@googlemail.com
just, Machine Code, may be a less common term now.

strictly you might say the 'text' being generated here is assembly, and it becomes m/c 'numbers' after the assembler, but since its just a one-one relationship, there isn't really much of a conceptual difference.

simon place

unread,
Oct 28, 2016, 1:20:12 PM10/28/16
to golang-nuts, simon...@googlemail.com
Take for instance the PSHUFB instruction, which allows a very fast [16]byte lookup in SSSE3 capable machines. This is helpful in various ways, but if it isn't available, it will have to commit the XMM register to memory and do 16 lookups, which is at least an order of magnitude slower than using the SIMD. Similarly, RSQRT (low precision reciprocal of the square root) instruction allows a "shortcut", but if it isn't available on your architecture, it will likely be very expensive.

this kind of thing is what lead me to wonder if using a lot of separate templates might not be better.

because then you wouldn’t be using the compilers ability to optimise through all this complexity, but actually just using a record of what people have found optimal on a case by case basis.

with this being used on a most critical code any slight advantage becomes worthwhile, particularly if all you have to do is 'generate' each time you change the critical part of the code.

simon place

unread,
Oct 28, 2016, 1:22:28 PM10/28/16
to golang-nuts, simon...@googlemail.com

Yes, speeding up an accumulation step, described at
https://medium.com/@raphlinus/inside-the-fastest-font-renderer-in-the-world-75ae5270c445#.qz8jram0o

The generated code are SIMD implementations of very simple Go functions.

For example, the fixedAccumulateOpSrcSIMD function in the generated
https://github.com/golang/image/blob/master/vector/acc_amd64.s file is
the SIMD equivalent of the fixedAccumulateOpSrc function in
https://github.com/golang/image/blob/master/vector/raster_fixed.go


thanks for the explanation.
 
There is only one template, instantiated 6 times. There are 2 types of
math (fixed point and floating point), and 3 types of composition
operator (draw over, draw src, and accumulate to mask).



i did see that, but was just really pointing out the difference to what i was talking about.
 

Ondrej

unread,
Nov 1, 2016, 5:58:17 AM11/1/16
to golang-nuts, simon...@googlemail.com
Klaus, that's a great thread, completely missed it.

It seems that a universal binary, as Go requires it, would be slow on dispatch, because there would be too much checking for individual intrinsics support. Do I understand it correctly, that to overcome this, people either compile natively (which we cannot do [and don't want to allow]) or JIT it (which we cannot do as there's no runtime assembly allowed)?

Thanks for all the input, much appreciated.

Nigel Tao

unread,
Nov 2, 2016, 9:40:29 PM11/2/16
to Ondrej, golang-nuts, simon place
On Tue, Nov 1, 2016 at 8:58 PM, Ondrej <ondrej...@gmail.com> wrote:
> It seems that a universal binary, as Go requires it, would be slow on
> dispatch, because there would be too much checking for individual intrinsics
> support. Do I understand it correctly, that to overcome this, people either
> compile natively (which we cannot do [and don't want to allow]) or JIT it
> (which we cannot do as there's no runtime assembly allowed)?

Another ignorant question from me, but what do you mean exactly by
universal binary?

FWIW, the golang.org/x/image/vector package is portable, in that it
has an pure Go implementation. If compiled on GOARCH=amd64, it has
SIMD optimizations, guarded at runtime by a very cheap check (look for
haveFixedAccumulateSIMD). But that check is made once per e.g.
accumulate call (a higher level function composed of many SIMD ops),
not once per e.g. 128-bit add.

Ondrej

unread,
Nov 3, 2016, 5:00:54 AM11/3/16
to golang-nuts, ondrej...@gmail.com, simon...@googlemail.com
On Thursday, 3 November 2016 01:40:29 UTC, Nigel Tao wrote:

Another ignorant question from me, but what do you mean exactly by
universal binary?

Apologies for the confusing and nonsensical term. What I meant was a binary that works for a number of CPUs within an architecture, with or without SSE[n], AVX[n] etc.

FWIW, the golang.org/x/image/vector package is portable, in that it
has an pure Go implementation. If compiled on GOARCH=amd64, it has
SIMD optimizations, guarded at runtime by a very cheap check (look for
haveFixedAccumulateSIMD). But that check is made once per e.g.
accumulate call (a higher level function composed of many SIMD ops),
not once per e.g. 128-bit add.

That's the sort of cheap checks that I had mind in the very first post when I talked about "I envisaged a call to CPUID and then some bool tests along the way to utilise SSE[2-4]/AVX[2] (or NEON on ARM) if available. All in a static, portable package." Thanks for a good example of that.

O.

simon place

unread,
Nov 3, 2016, 5:21:48 PM11/3/16
to golang-nuts, ondrej...@gmail.com, simon...@googlemail.com

That's the sort of cheap checks that I had mind in the very first post when I talked about "I envisaged a call to CPUID and then some bool tests along the way to utilise SSE[2-4]/AVX[2] (or NEON on ARM) if available. All in a static, portable package." Thanks for a good example of that.

O.

AFAICT, here;

> see "math/floor_amd64.s" and "math/floor_asm.go" to see this happen in the std.lib.

uses the stored result from a single call to CPUID.

Reply all
Reply to author
Forward
0 new messages