"Intrinsics for Go" Branstorm

Klaus Post

unread,

Sep 4, 2015, 9:00:30 AM9/4/15

to golang-nuts

# DISCLAIMERS

* This is not a finished proposal.

* This is a brainstorm to see what is feasible, and measure general interest.

* This is not a simple project, it will take a lot of work to implement by people that are better than me.

# WHY INTRINSICS?

Go has conquered the server space because of the elegant language, and the high performing implementation. However if you work with images, sound or video, Go doesn't have a big presence, and most rely on external libraries/executables. This is natural, since these types of work have huge gains from SIMD code.

Go has an assembler. However, in terms of *time investment* in my subjective experience, intrinsics are only a small fraction of the time investment of assembler. My personal estimate is that it is 5x more time effective to write intrinsics, and performance is rarely more than 10% below handcrafted assembler.

Furthermore intrinsics are much safer than pure assembler. You do not have to manage registers, so there is no risk of reading uninitialized data. I have an experimentational implementation where we even maintain memory safety, since we remove pointer math.

Also, if intrinsics are done using the existing language, we get all the tools of the language. This will enable thing like ensuring that CPU features are detected correctly, refactoring and all the good things we know and love.

# EXPERIMENTS

I have hacked together to programs that generate function signatures for all existing Intel x86 intrinsics, as well as ARM NEON intrinsics.

To get the generated code, use:

go get github.com/klauspost/intrinsics

If you just want to browse the godoc for the generated code, go to https://godoc.org/github.com/klauspost/intrinsics

To see an example, see https://github.com/klauspost/intrinsics/blob/master/x86/m128_test.go

This is actual working code (if the intrinsics are implemented). TestAddEpi8, TestAddPs, TestMulPs and TestMulAddPs will actually pass if you put in the instruction in the assembler for them.

Function names are simplified, but the aim is to make it simple to make it easy to convert existing intrinsic code.

* '_mm_' prefix is dropped.

* Underscore -> CamelCase.

This means that '_mm_and_si128(...)' -> AndSi128(...), '_mm_add_epi8' -> AddEpi8, etc

All intrinsics are separated into packages based on the CPUID features they require. This has the advantage that you can see your cpu requirements in your imports, and it gives reasonably sized packages.

All instruction that receives a pointer are skipped. Grep sources for '// Skipped:' to find these.

Instructions that have an immediate parameter, or returns a value in a parameter pointer are marked with a "FIXME:".

I generate a rather crude stub assembler for each function that loads each parameter from the stack and writes a return value, but no "operation" code is executed. However for basic instructions it will actually work if you uncomment the "proposed" instruction.

MMX has been included for completeness. Maybe that should just be ignored.

# TYPES

The typical x86 assembler uses only a few types. Here are the 128 bit ones:

* M128i (__m128i), which is a 128 bit register with integer content (8x16, 16x8, 32x4, 64x2 bits packed).

* M128 (__m128), 128 bit register containing [4]float32

* M128d (__m128d), 128 bit register containing [2]float64

These all refer to a REGISTER, they are traditionally not settable directly. In traditional intrinsics you need to call a function to cast between them, set and get values. I think we maybe can do that better in Go.

In my opinion it should be allowed to do `var m M128 = M128(M128i{})`.

Proposed type: []M128, []M128i, []M128d

This references a *safe* piece of memory. These reference the content of an underlying slice. There is no alignment guarantee. Here is an example conversion function:

// FloatToM128 converts a slice into M128 array.

// The number of elements is len(src) / 4.

// Will be a pointer to the original data, not a copy

func FloatToM128(src []float32) []M128

There should be similar for []float64 -> []M128d, int8, uint8, int16, etc -> []M128i.

It behaves as a traditional slice, and looking up an entry gives a "M128x" as you would expect. Since the size is rounded down, only valid memory can be refernced. Writing to the []M128 will write to the original slice.

All of the above is on 128 bit registers (SSEx), there are similar types for 64, 256 and 512 bits.

# ISSUES UNCOVERED

- Immediates

Some intrinsics have "immediate" values, which needs to be compiled into the opcode. Without compiler support there is no real way of achieving this. Example of this: https://godoc.org/github.com/klauspost/intrinsics/x86/sse#Pshufw

- VEX prefixed 3+ register instructions

This is a conceptual issue. A very nice feature of intrinsics is that it can switch from SSE codes (MULPS x1, x2 // x2=x2*x1) to VEX encoded equivalents (MULPS x1, x2, x3 // x3=x2*x1) without needing to rewrite any code. In GCC this is done with compiler flags. IMO we don't want that.

The best solution I have been able to come up with is to duplicate "sse" package into a "sse.vex" package. That will allow enabling VEX encoding by simply changing an import.

However, this only partially solves the problem. The compiler also needs to know that it is ok to use the extra registers (XMM16->XMM31) on AVX-512.

- Destination parameters

There are some intrinsics that return multiple values, and where you supply destination in a parameter. For instance "sse.IdivremEpi32". These should IMO be reworked to return multiple values instead.

- AVX Gather/Scatter

Currently all Gather/Scatter intrinsics are skipped, since they have a pure pointer parameter. Gather/Scatter is very powerful, and not having access to them would be a big annoyance.

- ARM NEON notes

ARM NEON is very crude. I have not yet written ARM intrinsics, so I cannot give any real examples for that.

Maybe someone here has tried it and can give feedback on it. The current intrinsics are generated based on the GCC "arm_neon.h" file.

- M256 / M512 function names

I considered replacing "M256" -> "V" and "M512" -> "VP" function prefixes in AVX/AVX512 to make the function names shorter.

# COMPILER ADDITIONS

This is of course where the big amount of work lies, and ultimately this will be where the tradeoff between complexity and added features lie. Our aim should be to make the implementation of this as light as possible on the compiler, so we gain the flexibility we need, but don't need to change the compiler every time we need to add/change an intrinsic.

- New type(s)

This is the biggest issue I see, but I currently cannot see any way to avoid having to put in a new type per register size (64, 128, 256, 512).

On x86 they would represent MM/XMM/YMM/ZMM, on ARM they would represent other registers. This should act as the base types that M128, M128i, etc are created from. They don't need to have a name the user ever sees, so "_vector_register_64_bits_", or something similar that is very unlikely to cause collisions is perfectly fine.

- Inlining

It is a must that intrinsic functions are inlined. The compiler must recognize an intrinsic function and know how to inline them. The usual register assignment optimizations can of course still be applied.

- Specifying intrinsics

There should be a "neat" way to specify intrinsics, so the compiler knows what to do with it, encode registers & immediates. Maybe someone has some input on this?

- Creating aliased slices?

I don't know if the proposed aliased slices []M128, etc would require compiler support. It could be done with a simple assembler function that just copies the pointer and adjusts the size, but that seems a bit "hackish".

# OTHER

* Emulation: We could offer pure Go emulation on other platforms. It will be slow, take a lot of work but it is feasible as a long-term goal.

* CPU Feature insurance: It will be possible to print all used CPU features.

* 32/64 Bits: Intrinsics mostly don't care if they are on a 32 or 64 bit platform. Especially on ARM (which is transitioning now) that could be an advantage.

I would personally love to see intrinsics in Go, so that is why I took out the time to research before this brainstorm. Now I would like to hear from you!

* Would this help you?

* Would you like to see it (eventually)?

* Would you be annoyed that Go developers used time on this?

* What have I overlooked?

* What am I doing that is silly?

* Is there a showstopper?

* Should it be less/more like C intrinsics?

Thanks for you time!

/Klaus

Manlio Perillo

unread,

Sep 4, 2015, 9:39:01 AM9/4/15

to golang-nuts

Il giorno venerdì 4 settembre 2015 15:00:30 UTC+2, Klaus Post ha scritto:

# DISCLAIMERS
* This is not a finished proposal.
* This is a brainstorm to see what is feasible, and measure general interest.
* This is not a simple project, it will take a lot of work to implement by people that are better than me.

# WHY INTRINSICS?

Go has conquered the server space because of the elegant language, and the high performing implementation. However if you work with images, sound or video, Go doesn't have a big presence, and most rely on external libraries/executables. This is natural, since these types of work have huge gains from SIMD code.

> [...]

My suggestion is not to add intrinsics to Go language, but instead replace the current assembler with an high level assembler; something like C--:

http://www.cs.tufts.edu/~nr/c--/index.html

Regards Manlio

Klaus Post

unread,

Sep 4, 2015, 9:54:15 AM9/4/15

to golang-nuts

Yes I noticed the title spelling error. Luckily it is in the middle of the word, so I hope your bra_i_n skipped it.

I forgot writing about build restrictions: It should be fine to use the same build tags as we currently do, so 'X_arm.go' contains ARM intrinsics, etc.

/Klaus

Klaus Post

unread,

Sep 4, 2015, 10:24:39 AM9/4/15

to golang-nuts

On Friday, 4 September 2015 15:39:01 UTC+2, Manlio Perillo wrote:

My suggestion is not to add intrinsics to Go language, but instead replace the current assembler with an high level assembler; something like C--:
http://www.cs.tufts.edu/~nr/c--/index.html

Could you outline the advantages of that? A haven't seen it, and the assembler feels fine for the "I want to control everything" approach, so I am unsure how this fits in.

Regards Manlio

/Klaus

Manlio Perillo

unread,

Sep 4, 2015, 12:17:09 PM9/4/15

to golang-nuts

Il giorno venerdì 4 settembre 2015 16:24:39 UTC+2, Klaus Post ha scritto:

On Friday, 4 September 2015 15:39:01 UTC+2, Manlio Perillo wrote:

My suggestion is not to add intrinsics to Go language, but instead replace the current assembler with an high level assembler; something like C--:
http://www.cs.tufts.edu/~nr/c--/index.html

Could you outline the advantages of that?

A language like C-- make it very convenient to write assembly code.

It naturally give you access to "intrinsics", but with a C like syntax.

A haven't seen it, and the assembler feels fine for the "I want to control everything" approach, so I am unsure how this fits in.

IMHO, it is not a matter of "I want to control everything" but a matter of "I want to write low level code with access to some hardware registers and instructions".

You can not write such code in Go, and writing it in the Go assembler is not very convenient.

C-- is like C for a language like Python.

Regards Manlio

Tieson Molly

unread,

Sep 4, 2015, 1:37:59 PM9/4/15

to golang-nuts

I would love to see this happen.

I saw this other post about a python server outperforming a go server. It came down to the python using SSE

https://www.reddit.com/comments/3jhv80/_/cupel2c?context=3

Nice article on how hand coded assembly still has an upper hand over intrinsics

http://danluu.com/assembly-intrinsics/

Klaus Post

unread,

Sep 4, 2015, 4:55:09 PM9/4/15

to golang-nuts

For simple instructions you can often do better than a compiler, but if you go into more complex tasks, like the "16 bit linear rgb to gamma corrected srgb" example, the compiler can do the register allocation and maintain registers across multiple conditional branches much better than you can easily do yourself.

Egon

unread,

Sep 5, 2015, 12:44:23 AM9/5/15

to golang-nuts

All proposals should have real-life examples attached to it, it's even better when they show how it would simplify Go compiler or std pkg code.

Note having it under "builtin/intrinsic/sse2".

* Would this help you?

In a few projects, yes, although I can also manage writing asm.

* Would you like to see it (eventually)?

I haven't made my mind up.

Jason Playne

unread,

Sep 5, 2015, 3:51:11 AM9/5/15

to Klaus Post, golang-nuts

First Up, Thanks for a well written and researched post/question.

* Would this help you?

TBH - Probably not, I tend to not need this sort of thing

* Would you like to see it (eventually)?

Yes. Yes I would. There could be some very nice data processing speed ups

* Would you be annoyed that Go developers used time on this?

Not at all

* What have I overlooked?

- Golang was conceived to be simple so that programmers would get up to speed quickly. This may impact that

- There is the argument of "If you want low level then write low level and then include it in your program"

- Cross Platform capabilities become a little less clear cut, unless you make sure that anything in the std lib has a fallback

* What am I doing that is silly?

Not silly at all!

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Klaus Post

unread,

Sep 5, 2015, 5:40:26 AM9/5/15

to golang-nuts

- Golang was conceived to be simple so that programmers would get up to speed quickly. This may impact that

(I assume mean that it could be seen as more complex)

I tried to accommodate that by making it standard functions, that can be looked up by standard tools. Compare this to assembly, where you need a cpu reference guide to understand it. Another thing is that you would probably have Go reference functions for all your functions containing intrinsics.

- There is the argument of "If you want low level then write low level and then include it in your program"

You could also argue that you make the low level code more accessible. All code looks like standard Go code, although with a lot of function calls.

- Cross Platform capabilities become a little less clear cut, unless you make sure that anything in the std lib has a fallback

My initial thought is that intrinsics are written for a specific platform, since emulating will likely be slower than straight up Go. That is just an assumption on my side though, I may be wrong on that.

Daniel Eloff

unread,

Sep 5, 2015, 5:37:13 PM9/5/15

to golang-nuts

I have a lot of experience using the Go assembler. And while I've written a lot of low-level code that would benefit from intrinsics, I don't think they should be added to Go.

The bottom line is intrinsics don't offer enough advantage over assembler.

-It's not that much easier to write code with intrinsics, certainly not 5x.

-It's performance critical code, so being precise about register allocations and instruction ordering is important. If you write it in assembler it will usually run faster. See: "

Hand Coded Assembly Beats Intrinsics in Speed and Simplicity" http://danluu.com/assembly-intrinsics/

-Portability is not usually very important. You either write for servers or desktops which are x64 (and x86 if you want, but then you're restricting yourself to SSE2 or older.) Or you (rarely) write for mobile (ARM). So typically you have to write it once or twice, and intrinsics won't help you very much there.

-You inevitably need specialized SIMD instructions which are CPU model specific, intrinsics don't save you anything there

What I would like to see is full assembler support for the SIMD instructions in modern processors. Sky Lake has been released with AVX 512, but the Go assembler can't even do AVX instructions yet. I'd also like to see all the SSE instructions included, currently most of the more specialized ones that Go itself doesn't use are missing. I'd also like to see jump table support (use of labels in data sections). That's a much lower bar than intrinsics and probably just as useful.

The main thing is give most of the power (benefit) without complicating the Go language or placing an unreasonable burden on Go programmers who need to delve deeper than most. Assembler fits that criteria, intrinsics, IMHO, don't.

Sorry to rain on your parade,

Dan

Caleb Spare

unread,

Sep 5, 2015, 6:06:11 PM9/5/15

to Klaus Post, golang-nuts

When implementing some optimizations in assembly I've encountered the
following situation several times:

- The crucial speedup comes from using a handful of specific
instructions in a critical section
- If you write a small function in assembly for just that critical
section, the function call overhead dominates and kills performance
- So you write the whole inner loop in assembly, which is far bigger
and more error-prone. Runtime and GC interactions can get very tricky.

I don't have experience with intrinsics in C/C++ and I have a hard
time seeing a scenario in which they would be added to Go...but it
would solve this problem, which for me was the biggest issue when
implementing optimizations in assembly.

To think out loud a little, perhaps there's some other solution that
the Go toolchain could implement, such as the ability to inline
particular restricted types of assembly functions.

-Caleb

Dan Eloff

unread,

Sep 6, 2015, 12:19:21 PM9/6/15

to Caleb Spare, Klaus Post, golang-nuts

Caleb's experience mirrors my own. Because of the function call overhead I often have to write much more in assembly than just the critical part. So I too would like the ability to inline some types of assembly functions. Doesn't this already exist in Go? I have difficulty believing that e.g. atomics are implemented as function calls. Surely the runtime must have a mechanism for this already, just for implementing itself.

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/yVOfeHYCIT4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.

Caleb Spare

unread,

Sep 6, 2015, 8:41:32 PM9/6/15

to Dan Eloff, Klaus Post, golang-nuts

Caleb's experience mirrors my own. Because of the function call overhead I often have to write much more in assembly than just the critical part. So I too would like the ability to inline some types of assembly functions. Doesn't this already exist in Go? I have difficulty believing that e.g. atomics are implemented as function calls. Surely the runtime must have a mechanism for this already, just for implementing itself.

Unfortunately no; even the runtime's atomic operations are implemented using as normal asm functions. (See e.g. src/runtime/asm_amd64.s; look at objdump -d any-go-binary and search for 'atomicstore' to confirm that there's no inlining.)

There's a closed issue related to this discussing inlined atomic ops for the C compiler: https://github.com/golang/go/issues/4947

In a very quick'n'dirty benchmark on my laptop, calling atomic.AddUint64 was about 30% slower than the manually inlined equivalent. But atomics are relatively slow operations so they're not the worst case; I tried a similar benchmark with the function

func BSF(x uint64) int

(which simply uses the BSF instruction) and the function call overhead appeared to be about 135%.

-Caleb

Nigel Tao

unread,

Sep 6, 2015, 9:24:02 PM9/6/15

to Klaus Post, golang-nuts

On Fri, Sep 4, 2015 at 11:00 PM, Klaus Post <klau...@gmail.com> wrote:
> * Would you like to see it (eventually)?

Thanks for the write-up. I haven't put a lot of thought into
intrinsics specifically, but I'm wary of adding complexity to the
compiler, and possibly overfitting for a specific problem domain and
similarly to an architectural family.

I have also previously said this about the "table proposal" as another
possible approaach:

One could define a computational language a la halide-lang, and write
a program that worked with "go generate". This program would parse the
specialized code and generate Go 1.x code (which possibly uses package
unsafe for pointer arithmetic), or generate C code, or generate
6a-compatible assembly code, or generate GPU-specific code. Of course,
this still requires finding someone to do the work, but that person or
group of people don't have to be familiar with the runtime and
compilers, blocked on Go's release cycles, or bound by the Go 1
compatibility promise.

Caleb Spare

unread,

Sep 6, 2015, 9:42:44 PM9/6/15

to Nigel Tao, Klaus Post, golang-nuts

One could define a computational language a la halide-lang, and write
a program that worked with "go generate". This program would parse the
specialized code and generate Go 1.x code (which possibly uses package
unsafe for pointer arithmetic), or generate C code, or generate
6a-compatible assembly code, or generate GPU-specific code. Of course,
this still requires finding someone to do the work, but that person or
group of people don't have to be familiar with the runtime and
compilers, blocked on Go's release cycles, or bound by the Go 1
compatibility promise.

I'm probably missing something, but I don't see a way to do the sorts of things we're discussing in this thread without being *very* familiar with the runtime and compilers.

Your suggestion makes a lot of sense for code that is possible to write using unsafe, or where C/asm code could fill the gaps yet be compiled using the current gc toolchain. But we've been discussing ways to incorporate particular instructions inline in Go code. AFAICT there's no Go/C/asm code that anyone could generate to accomplish this in today's gc toolchain.

So it seems to me that your suggestion of implementing this entirely outside of gc implies a separate Go compiler.

Brendan Tracey

unread,

Sep 6, 2015, 9:43:33 PM9/6/15

to golang-nuts, klau...@gmail.com

I would love to see go code be vectorized. For numeric code there can be significant speedups, and this will only become larger as the vector lengths get larger (ala AVX 512). Vectorized Go and an improved scheduler would be an amazing language for the upcoming manycore chips.

That said, I don't think inline intrinsics built into the language are the way to go. Let's say you want to do the dot product of two slices

var dot float64
for i, v := range x {
dot += v * y[i]
}

Assuming x and y don't overlap, what size intrinsic do you use? Ideally the largest one available that fits. Today this may be the 128 bit, but the next day it may be 512 and the next it may be 128. I'd rather see effort invested in the compiler to help it identify places where it can vectorize. I'm under the impression that this is difficult, but not impossible, and that the upcoming SSA compiler should make it easier to do such optimizations (and I imagine operations involving [4]float64 will be even easier to vectorize). This keeps the language legible and less likely to be overfit to the architecture.

Dan Eloff

unread,

Sep 6, 2015, 10:07:09 PM9/6/15

to Caleb Spare, Klaus Post, golang-nuts

That's horrifying. Because atomics are slow are also required by Go's memory model when you read/write values that other goroutines may access without synchronization. The equivalent of relaxed atomics in C++ or Java, which compile into a standard mov on x86/x64 will require a function call. The performance difference as a relative % will be insane. This has quite a lot of performance implications for core parts of the Go runtime that use relaxed atomics, possibly including the scheduler, GC, and channels.

I have quite a lot of generated code using atomics to comply with the Go memory model. That's going to be unacceptably slow. I could generate assembly instead of Go, but many of the generated functions do little outside of a load or store, and are candidates for inlining. Besides which, generated Go code is readable, generated assembly code is nasty. The only other option is generating regular data-racey loads and stores and hoping for the best while dancing barefoot across the broken glass of undefined behavior. None of these seem like acceptable solutions to me, but I'm going to have to choose one.

Nigel Tao

unread,

Sep 6, 2015, 10:10:25 PM9/6/15

to Caleb Spare, Klaus Post, golang-nuts

On Mon, Sep 7, 2015 at 11:42 AM, Caleb Spare <ces...@gmail.com> wrote:
> I'm probably missing something, but I don't see a way to do the sorts of
> things we're discussing in this thread without being *very* familiar with
> the runtime and compilers.

Ah, the unsafe thing was more relevant to the table proposal. For
intrinsics, I would imagine that this new tool would emit either asm
code (in "go tool asm" format), or emit binary code directly that
followed the Go calling convention.

It's a separate compiler, but not a separate Go compiler.

Having said that, that's assuming that you can factor out the
SIMD-rich code from the other code. For image processing, I would
generally assume that to be the case. Yes, that means that this new
tool has to write the whole loop, not just one line in the inner loop,
but if you really care about performance, I'd expect that you (via
this tool) want want to consider e.g. registerization and other state
over the whole of the loop.

james4k

unread,

Sep 6, 2015, 10:28:46 PM9/6/15

to golang-nuts, nige...@golang.org, klau...@gmail.com

I agree with Nigel that code generation could make a great compromise between assembler and intrinsics.

Now, I would personally love intrinsics as it's currently the most practical way of making use of all of your compute units, but the hesitation is completely understandable.

Michael Jones

unread,

Sep 7, 2015, 12:51:45 AM9/7/15

to Nigel Tao, Klaus Post, golang-nuts

For the last few years I’ve been using a “personal” flavor of Go that allowed assembly coding of methods. The performance gain was quite measurable for those places where I used it. With the rewrite I gave that up and changed my code (at least so far ;-)

It is similar with light-weight intrinsics. I have wanted to use CTZ many times, but always in the middle of something complex. If I could code “b := CTZ(x)” or get to carry/borrow bits it would be fantastic.

—
Michael Jones, CEO • mic...@wearality.com • +1 650 656-6989
Wearality Corporation • 289 S. San Antonio Road • Los Altos, CA 94022

> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Klaus Post

unread,

Sep 7, 2015, 10:04:39 AM9/7/15

to golang-nuts, nige...@golang.org, klau...@gmail.com

Hi!

Thanks for the great feedback! It is great with constructive criticism! I actually appreciate a community that doesn't accept everything at face value, or on the other hand reject everything that changes anything in their current "world". Thanks!

To avoid spamming, I put replies to Dan, Caleb, Nigel, Brendan and Michael in a single post.

@Dan Eloff:

> "Hand Coded Assembly Beats Intrinsics in Speed and Simplicity"

I have seen that article referenced multiple times, and to use that as a basis of evaluation is a mistake. In my opinion using timing on a single instruction on a single platform with a single compiler is similar to judge an entire language based on execution speed of a for loop. Mind you C has a lot of "baggage" it has to account for.

> It's not that much easier to write code with intrinsics, certainly not 5x.

I intentionally wrote that was *my* experience after working with both for more than 10 years. The biggest difference isn't the initial time it takes to write it, but all the time spent adjusting it. Any kind of test/change/adjustment takes a lot of time for even moderately sized functions, since you have to deal with registers manually. With intrinsics you can insert a conditional jump, a function call, move code and the compiler will adjust registers for you.

> You inevitably need specialized SIMD instructions which are CPU model specific, intrinsics don't save you anything there

Could you elaborate on that? In my experience this is exactly what intrinsics helps you to do.

@Caleb Spare:

> The crucial speedup comes from using a handful of specific instructions in a critical section

Exactly. Recently I encountered this when optimizing the standard library deflate. A most of the time is spent comparing bytes and getting match lengths [1]. Compare an intrinsic version [2] with the assembler version [3]. In my mind:

* The intrinsic version is *much* easier to read.

* The intrinsic version is memory safe. The assembler relies on len(a) <= max && len(b) <= max, which is up to the assembler author to check.

* The intrinsic version can (in theory) be inlined by the compiler, assembler (currently) cannot.

* The intrinsic version can easily call other functions to avoid duplicate functionality. That is much more tedious in asm.

[1] https://github.com/klauspost/compress/blob/master/flate/deflate.go#L241

[2] http://play.golang.org/p/dPotG_e2FD

[3] https://github.com/klauspost/compress/blob/master/flate/crc32_amd64.s#L97

> To think out loud a little, perhaps there's some other solution that the Go toolchain could implement, such as the ability to inline particular restricted types of assembly functions.

That would add things like Bswap, BitScanXxxx [4], Bit Manipulation instruction [5] and similar, where a function call is more expensive than the operation it uses. It seems a little like an "inbetween" solution, but it could very well be a stepping stone towards "full" intrinsics.

[4] https://godoc.org/github.com/klauspost/intrinsics/x86/misc

[5] https://godoc.org/github.com/klauspost/intrinsics/x86/bmi1

@ Nigel Tao:

> One could define a computational language a la halide-lang, and write

> a program that worked with "go generate". [...]

[...]

> Yes, that means that this new tool has to write the whole loop, not just one line in the inner loop,

I have tried to imagine what could be done without some sort of compiler "knowledge" of intrinsics, and I've always ended up with something that is as complex as the current compiler, no matter if you go for "inline assembler" approach, or intrinsics that are "intercepted" and changed in assembler.

My approach shifted a bit to see how little I could get away with in terms of compiler changes. And if we could get types that map to register types (XMM/YMM/ZMM), and some way of marking and replacing intrinsic functions with their assembly equivalents it shouldn't be entirely impossible.

Another reason I went the "compiler" route is that we can benefit from all the nice optimizations that are currently being worked on. Almost all I have seen on golang-codereview also apply to intrinsics in some form.

@Brendan Tracy:

>> I'd rather see effort invested in the compiler to help it identify places where it can vectorize

You have a perfect example, where the compiler is extremely unlikely to be able to vectorize it, unless this specific case is added.

I have created various intrinsic versions in SSE3/SSE4/FMA/ARM NEON [6]. I think it is pretty clear, and you only have to look up the functions to see what they do.

>> what size intrinsic do you use?

That is a question that only depends on your use. That is another reason automatic vectorization is difficult. If you test very small slice slices, you likely get very little SIMD speedup, if any. Big ones are helped by unrolling. With this, you can design it as you like, and libraries like Agner Fog's vector class library can be built to handle these types of problems.

>> less likely to be overfit to the architecture.

Intrinsics don't have specific targets in mind - both ARM/x86 have intrinsics, and others can of course be added. So the overfitting would be from the package authors, which doesn't really change the current situation. Also "automatic" vectorization would also be tied to specific platforms.

[6] http://play.golang.org/p/33OZyxqtmj

[7] http://www.agner.org/optimize/#vectorclass

@Michael Jones

Yes, that would be something like b := bmi1.TzcntU32(x) - depending on your type. [8]

I tried messing about with the 1.5 compiler, but that is beyond my current capabilities, but it could be nice for an *actual* proof-of-concept. Right now XMM registers are passed to separate assembler functions as byte arrays, so for obvious reasons I cannot give any performance indicators, which is very frustrating.

Forking is of course not an option, but I guess a logical next step would be to see if there is a not-to-intrusive way to specify instructions for the compiler, that it understands and can manage, but I'm not sure I am quite there yet.

[8] https://godoc.org/github.com/klauspost/intrinsics/x86/bmi1#TzcntU32

Best regards

/Klaus

Marat Dukhan

unread,

Sep 7, 2015, 8:11:02 PM9/7/15

to golang-nuts

My suggestion is not to add intrinsics to Go language, but instead replace the current assembler with an high level assembler; something like C--:
http://www.cs.tufts.edu/~nr/c--/index.html

I am working on such assembler - https://github.com/Maratyszcza/PeachPy

Some features:

x86-64 instruction set up to AVX2 and SHA (including 3dnow!, XOP and FMA4, excluding x87 FPU and system instructions)
Takes care of register allocation and ABI specifics
Supports Python-based metaprogramming
Can convert programs to Go assembly, Go syso, or write conventional ELF/Mach-O/MS-COFF object files.

Regards,

Marat

Scott Pakin

unread,

Sep 8, 2015, 4:13:12 PM9/8/15

to golang-nuts

On Friday, September 4, 2015 at 7:00:30 AM UTC-6, Klaus Post wrote:

- New type(s)

This is the biggest issue I see, but I currently cannot see any way to avoid having to put in a new type per register size (64, 128, 256, 512).

On x86 they would represent MM/XMM/YMM/ZMM, on ARM they would represent other registers. This should act as the base types that M128, M128i, etc are created from. They don't need to have a name the user ever sees, so "_vector_register_64_bits_", or something similar that is very unlikely to cause collisions is perfectly fine.

Most intrinsics could be represented with an array ([16]float32, [8]float64, etc.), couldn't they?

- Inlining

It is a must that intrinsic functions are inlined. The compiler must recognize an intrinsic function and know how to inline them. The usual register assignment optimizations can of course still be applied.

It would seem to me that once the compiler provides a way to mark a foreign function call as "safe" (known stack-memory requirements, incapable of throwing an exception, etc.) and has the ability to inline such functions, the rest of your proposal would fall into place rather cleanly. That is, you or anyone else would be able to produce a 3rd party intrinsics package for your favorite architectures.

* Would this help you?
* Would you like to see it (eventually)?

That depends on the quality of Go compiler's autovectorization support. If the compiler learns to autovectorize and does it well, I would have relatively little need for intrinsics—just the occasional thing that I know but the compiler doesn't. If autovectorization remains nonexistent or it fails to handle many common cases, intrinsics are a good stopgap.

— Scott

snes...@gmail.com

unread,

Sep 8, 2015, 5:28:16 PM9/8/15

to golang-nuts

The main thing is give most of the power (benefit) without complicating the Go language or placing an unreasonable burden on Go programmers who need to delve deeper than most. Assembler fits that criteria, intrinsics, IMHO, don't.\

Best of both worlds : functions written in assembler that can be inlined. Would make it possible to write a package that offers intrincics-like functionality.

snes...@gmail.com

unread,

Sep 8, 2015, 5:28:34 PM9/8/15

to golang-nuts

Dan Eloff

unread,

Sep 8, 2015, 6:51:56 PM9/8/15

to snes...@gmail.com, golang-nuts

On Tue, Sep 8, 2015 at 4:28 PM <snes...@gmail.com> wrote:

The main thing is give most of the power (benefit) without complicating the Go language or placing an unreasonable burden on Go programmers who need to delve deeper than most. Assembler fits that criteria, intrinsics, IMHO, don't.\

Best of both worlds : functions written in assembler that can be inlined. Would make it possible to write a package that offers intrincics-like functionality.

I agree that inlineable assembly solves this problem. It also solves the problem of slow atomics and the overhead they introduce in lock-free algorithms, as you might find in the Go runtime implementation.

However, Russ Cox had this to say about inlineable assembly:

Re #29 Everyone talks about inlining assembly routines, but it just doesn't work, unless you teach the compiler's optimizer about every possible assembly instruction. That's madness.

It comes from issue https://code.google.com/p/go/issues/detail?id=4947,
which is near and dear to my heart.

Which I suspect means we won't see a solution from the Go team anytime soon. The good news is eventually there will come a time when optimizing Go further will cause the Go team to run up against these problems too and then they will have to do something about it.

However, something worth noting is that GCCGO supports both inlineable assembly and intrinsics, in a roundabout way. according to Ian Lance Taylor:

Answering specifically about gccgo. Gccgo is of course just a
frontend to GCC. GCC can not inline functions written in pure
assembly. However, GCC provides CPU-specific builtin functions usable in C/C++ for many things that people want to do (e.g., vector
instructions) and it also provides a sophisticated asm expression as a
C/C++ extension. This means that you can write your assembly code in
extended C/C++ instead, and a function written that way can be
inlined. It can even be inlined into Go code if you use LTO
(link-time optimization, see GCC's -flto options).

So if you write a C function using inline asm and/or intrinsics, and call that from Go, and enable LTO, it can be inlined. Which is a nifty trick indeed.

The original thread is here: https://groups.google.com/forum/#!topic/golang-nuts/kGgkcOFCBtc

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/yVOfeHYCIT4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.

Scott Pakin

unread,

Sep 8, 2015, 7:51:53 PM9/8/15

to golang-nuts, snes...@gmail.com

On Tuesday, September 8, 2015 at 4:51:56 PM UTC-6, Daniel Eloff wrote:

However, something worth noting is that GCCGO supports both inlineable assembly and intrinsics, in a roundabout way. according to Ian Lance Taylor:

Answering specifically about gccgo. Gccgo is of course just a
frontend to GCC. GCC can not inline functions written in pure
assembly. However, GCC provides CPU-specific builtin functions usable in C/C++ for many things that people want to do (e.g., vector
instructions) and it also provides a sophisticated asm expression as a
C/C++ extension. This means that you can write your assembly code in
extended C/C++ instead, and a function written that way can be
inlined. It can even be inlined into Go code if you use LTO
(link-time optimization, see GCC's -flto options).

I don't see why the Go assembler couldn't support an asm-like construct in .s files. (I'm not proposing adding asm to Go; that's a separate argument.) That is, an assembly language function could specify some meta-information that specifies its inputs and outputs (e.g., "I need Go variable foo to be in a register, which I'll then call %1") and CPU registers that will be overwritten (e.g., "The caller had better save %xmm0 'cause I'm gonna clobber it"). Does anyone know if Go developers like Ian and Russ have ever considered an approach like that?

— Scott

Dan Eloff

unread,

Sep 8, 2015, 8:46:31 PM9/8/15

to Scott Pakin, golang-nuts, snes...@gmail.com

That would be a great solution, it gives the benefits of inline assembly without the costs of "teaching the compiler every assembly instruction". Granted it's not that easy to use, but people have been doing that in C and C++ for a long time. If you're using .S files anyway, you're already in the realm of not easy. It would also allow the Go team to optimize critical assembly functions in the runtime so they can be inlined (e.g. atomics). I also think, that from an implementation perspective, it's the low hanging fruit. There's even an example of prior art.

Ian Lance Taylor

unread,

Sep 9, 2015, 1:14:10 AM9/9/15

to Scott Pakin, golang-nuts, Erwin Driessens

This is approximately how asm statements work in GCC
(https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html). Experience
with them shows that it's very hard for developers to get them right.
It's not enough to say that a value should be in a register; you need
to say what kind of register it should be in. For example, on x86 you
need to indicate whether you want a regular register, an 80-bit
floating point register, a 32-bit or 64-bit SSE register, an XMM
register, an AVX YMM register, or whatever they come up with in the
future. If your arguments don't precisely match what the assembler
code does, the results are incomprehensible. I don't think it's a
good model for Go.

I do personally think that compiler intrinsics are a more viable
approach. I would imagine a compiler implemented package like
intrinsics/amd64 or something. Importing that package would give you
a set of amd64.XXX functions that were automatically inlined. These
could be modeled on Intel's ?mmintrin.h header files. Similar
implementations could be written for other processors as appropriate.
Then people who wanted to use write efficient processor-specific code
could do so. And, of course, such a package could be a normal package
too, written mostly in assembler--it would not actually require
compiler intrinsics for functionality, only for speed. This is just
my own thought, though, not something anybody else has bought into.
It's also not going to happen any time soon, certainly not until after
the SSA work is integrated and stable on all targets.

Ian

Klaus Post

unread,

Sep 9, 2015, 10:58:05 AM9/9/15

to golang-nuts, scot...@pakin.org, snes...@gmail.com

@Marat Dukham:

Yes - someone also pointed me to that on reddit. Mostly, I would consider that an assembler alternative. But the "Opcodes" collection is really great. I have only looked briefly at it, but the information in it is great, and maybe it could even be adjusted to generate the "rules" used by the compiler to generate code.

@Scott Parker:

>Most intrinsics could be represented with an array ([16]float32, [8]float64, etc.), couldn't they?

Indeed, and this is how the "test" is implemented [1]. However it is only good for emulating intrinsics, since the compiler needs to know what register types they map to.

[1] https://github.com/klauspost/intrinsics/blob/master/x86/types.go

> That is, you or anyone else would be able to produce a 3rd party intrinsics package for your favorite architectures.

That could be a very reasonable goal, and why I was looking for feedback on how to make the compiler aware of the new instructions. We need it compile-time to add immediate values and register allocations, etc.

> Best of both worlds : functions written in assembler that can be inlined. Would make it possible to write a package that offers intrincics-like functionality.

Yes. But that would make the compiler (or probably the linker) much, much more complex, since it would need to "dissect" the assembler and rewrite the stack/registers used by the assembler part. Really complex stuff, even without considering flags, partial register writes, etc. For instance, if your calling function already uses XMM1, and your assembly does as well, you would have to move it out of the way before you could use the inlined code, or change the inlined function to use something else.

@Ian Lance Taylor:

> I would imagine a compiler implemented package like intrinsics/amd64 or something.

Yes. My intention of the currently generated packages was to show that this was feasible in the first place, and figure out just how much help we would need from the compiler.

I don't know if a simple comment hint like "//go:intrinsic paddb" could help. That would be a lot nicer than the GCC "__builtin_ia32_paddusb128(a, b)", and allow an "emulation" implementation which would act nicely as additional documentation.

> It's also not going to happen any time soon [...]

That was never the intention, I know things like this take a *long* time to materialize, and why I asked for feedback on the feasibility of the changes. I've kept an eye on the SSA-work, and it makes modifications so much easier. Once it lands I hope to try out implementing a small set of intrinsics and see how intrusive it is to the current code.

@Egon:

To give more "fictive real-world" examples here is a direct conversion of some assembler I submitted for CL. If you compare the intrinsic [2] to the assembler [3] version I think it is pretty clear how much more readable intrinsics are. Add to that that it is memory-safe, I think the advantage is pretty clear.

[2] https://gist.github.com/klauspost/64b36e9904d76d6fc122#file-crc32-intrin-go-L60

[3] https://go-review.googlesource.com/#/c/14080/7/src/hash/crc32/crc32_amd64.s

/Klaus

Egon

unread,

Sep 9, 2015, 3:08:19 PM9/9/15

to golang-nuts, scot...@pakin.org, snes...@gmail.com

On Wednesday, 9 September 2015 17:58:05 UTC+3, Klaus Post wrote:

@Egon:

To give more "fictive real-world" examples here is a direct conversion of some assembler I submitted for CL. If you compare the intrinsic [2] to the assembler [3] version I think it is pretty clear how much more readable intrinsics are. Add to that that it is memory-safe, I think the advantage is pretty clear.

[2] https://gist.github.com/klauspost/64b36e9904d76d6fc122#file-crc32-intrin-go-L60
[3] https://go-review.googlesource.com/#/c/14080/7/src/hash/crc32/crc32_amd64.s

Sure, I agree that intrinsics are safer and a little bit more readable. My main concern is whether there is a better approach than intrinsics. Having real-world examples means you have things to (over)-engineer and see how it will look with different approaches.

The main readability improvement comes from having temporary variables and for/if statements in the CRC32 case. Currently the intrinsics part still suffers from poor naming and manually unrolled parts that make code harder to read.

I also started wondering whether there are "more advanced assembly languages". Wrote a quick asm prototype by trying to reduce duplication [1]. I've experimented with #define-s as well [2] in another case. But I'm not satisfied with them. I'm guessing there are other projects that have tried to make a "more expressive assembler" (that isn't high-level language), but quick google search didn't come up with anything useful.

[1] https://gist.github.com/egonelbre/64b7a4afac085530b48f

[2] https://github.com/egonelbre/exp/blob/master/sse/sse_amd64.s#L63

+ Egon

Nigel Tao

unread,

Sep 9, 2015, 8:50:33 PM9/9/15

to Egon, golang-nuts, Scott Pakin, Erwin Driessens

On Thu, Sep 10, 2015 at 5:08 AM, Egon <egon...@gmail.com> wrote:
> On Wednesday, 9 September 2015 17:58:05 UTC+3, Klaus Post wrote:
>> @Egon:
>>
>> To give more "fictive real-world" examples here is a direct conversion of
>> some assembler I submitted for CL. If you compare the intrinsic [2] to the
>> assembler [3] version I think it is pretty clear how much more readable
>> intrinsics are. Add to that that it is memory-safe, I think the advantage is
>> pretty clear.
>>
>> [2]
>> https://gist.github.com/klauspost/64b36e9904d76d6fc122#file-crc32-intrin-go-L60
>> [3]
>> https://go-review.googlesource.com/#/c/14080/7/src/hash/crc32/crc32_amd64.s
>
> Sure, I agree that intrinsics are safer and a little bit more readable. My
> main concern is whether there is a better approach than intrinsics.

Like Egon, I don't doubt that the intrinsics version is easier to read
than the asm version. My concern is whether the *Go* compiler should
be responsible for intrinsics, or whether another tool, along the
lines of Halide, be responsible for consuming the code in
crc32-intrin.go, and e.g. outputting a .s file.

For example, there is more than one Go compiler, and their number are
only growing. There was another golang-nuts thread just a few days ago
about a Go interpreter. Will adding intrinsics into the language make
it harder for other people to write Go interpreters?

You may be able to patch the gc compiler to emit great SSE code, but
I'm guessing you'd want your .go code to also work on other Go
compilers - ones that implement the Go language specification. You
could use build tags, I suppose, in your intrinsics/x86/sse2 package
to provide fallback implementations. But the overall fallback
performance might be better if, again, a Halide-like tool can take
your crc32-intrin.go code and output pure (but ugly, e.g.
loop-unrolled) Go. It could possibly optimize better if that tool is
focused exclusively on SSE-like computation instead of a general
purpose Go compiler making inlining and other decisions based on
heuristics that have to work across the whole spectrum of Go code.

This tool doesn't have to be as complicated as a general purpose
compiler. It could implement a significantly reduced subset of the Go
language, such as not allowing function or method calls (other than to
intrinsics like sse2.XorSi128). AFAICT your ieeeCLMUL function in
crc32-intrin.go falls into this subset.

That's all my personal opinion, though, and I don't speak for the Go
team on this.

Caleb Spare

unread,

Sep 9, 2015, 9:29:45 PM9/9/15

to Nigel Tao, golang-nuts

Hey Nigel,

I like your Halide-like language idea. However, even if you take your
suggested [subset of Go + intrinsics] language and compile .s files,
the functions in those files still can't be inlined, right? That seems
to be the core issue discussed in this thread requiring compiler
support.

While this approach could work for logic that can be divided into
simple, expensive work units (like calculating the CRC of a large
stream), it won't help with other cases we've been mentioning, where
you really want to use a few particular instructions in the middle of
some more complicated code.

To put it another way, I think your idea is a vision of a much more
convenient way of writing functions in .s files, but it only really
helps with use cases that could be accomplished with asm functions
today.

OTOH, perhaps most people's uses for intrinsics fit this description.

-Caleb

> --

> You received this message because you are subscribed to the Google Groups "golang-nuts" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Nigel Tao

unread,

Sep 10, 2015, 12:49:47 AM9/10/15

to Caleb Spare, golang-nuts

On Thu, Sep 10, 2015 at 11:29 AM, Caleb Spare <ces...@gmail.com> wrote:
> To put it another way, I think your idea is a vision of a much more
> convenient way of writing functions in .s files, but it only really
> helps with use cases that could be accomplished with asm functions
> today.
>
> OTOH, perhaps most people's uses for intrinsics fit this description.

Well, yes, the examples I have in mind fit that description.

Do you have concrete examples of the contrary? I skimmed the mail
thread again but didn't see complete examples. (Sorry if I missed
them).

If it's that calling atomic.AddUint64 or BSF are function calls and
not inlineable, see Dan Eloff previously quoting Russ Cox on
inlineable assembly. What does the code around the BSF call look like?
Would a separate tool not work?

james4k

unread,

Sep 13, 2015, 2:15:15 AM9/13/15

to golang-nuts, ces...@gmail.com

A language designed specifically for wide computation in Go could be very neat, but it kind of depends on what the language is like and if anyone has a strong vision for such a project. Intrinsics at least have a fairly clear path forward.

Reply all

Reply to author

Forward