Unsafe bool->byte conversion slower than if construct? Branching and

1,233 views
Skip to first unread message

Florian Uekermann

unread,
May 26, 2014, 4:13:20 PM5/26/14
to golan...@googlegroups.com
I recently stumbled over the discussion about possibly adding conversions from bool to various numerical types in go to avoid branches.
In my own code I solved this by using the unsafe package to convert bool to byte, but I noticed that this seems to be slower than just using an if statement in many cases.

So I wrote a very simple test case to investigate the behavior:

#####
package main
import "unsafe"
func main(){
s:=byte(0)
for i:=0;i<10000000000;i++{
if i<100 { s++ }
b:=i<100; s+=*(*byte)(unsafe.Pointer(&b))
}
}
#####

which includes minimal examples of equivalent branching and (supposedly) non-branching solutions inside the for loop (upper line branching, lower line no branching).
The corresponding assembly is the following (see end of message for the whole main):

--- branching part?
0011 (bool.go:8) CMPQ    AX,$100 //Comparison (first instruction of branching part)
0012 (bool.go:8) JGE     ,17 //Conditional jump to avoid increment instruction
0013 (bool.go:8) INCQ    ,CX //Increment
0014 (bool.go:8) JMP     ,17 //Unconditional jump to non-branching part (17)

--- non branching part?
0015 (bool.go:9) MOVB    $1,b+-1(SP) //Store true on stack
0016 (bool.go:9) JMP     ,20 //Unconditional jump to 
0017 (bool.go:9) CMPQ    AX,$100 //Comparison (first instruction of non-branching part)
0018 (bool.go:9) JLT     ,15 //Conditional jump to creation of true value
0019 (bool.go:9) MOVB    $0,b+-1(SP) //Store false on stack
0020 (bool.go:9) LEAQ    b+-1(SP),BX //Get pointer to address on stack where true/false is stored
0021 (bool.go:9) NOP     , //(?) Aligning jump target for end of loop? (RET is at 25 and target of 10) Or timing?
0022 (bool.go:9) MOVBQZX (BX),BX //Dereferencing pointer to true false
0023 (bool.go:9) ADDQ    BX,CX //Addition

I commented the assembly as best as I could while trying to understand it (this is the first time I ever dealt with any assembly, so even nitpicky corrections are very welcome, I couldn't even find some instructions on google and had to wildly guess).
To me it looks like there are two issues with the non-branching part.
The first is that the chain of referencing and dereferencing is not optimized (20-22) and could be replaced by "MOVB b+-1(SP), BX" (I made that up and it might be ridiculously wrong or invalid).
The second issue would be that the true and false value are created by jumping around. Is there no way get it directly by copying a flag (I read somewhere that CMP does a signed subtract and which stores stuff in flags but throws away the result). The flag we would want would  be ZF if I understood what CMP does correctly and there are two ways to do that (LAHF and something about pushing something onto the stack which I didn't understand at all). 
LAHF can't be used because we might be running on an early AMD64 or EMT64 CPU that doesn't have it? What about the other option?
If both of these are impractical, can't we just actually do a subtraction and a few shift and & operations to convert everything non-zero to 1, or am I overestimating how much jumps can cost? There must be some way to do a comparison and utilize the results somehow, right?

At least the first optimization issue (reference, dereference) could be interesting, since this kind of casting is probably the most popular usecase of the unsafe package (even the math package does it), or would it be appropriate to make a direct conversions of values of a certain length available in unsafe (unsafe.Value or Value8, Value16, Value32, Value64 which are convertible to values of the same length [similar to unsafe.Pointer]) and not optimize this?

As I mentioned this is the first time I saw assembly in my life, so have mercy. I am fully expecting to be wrong about most stuff I wrote.

Best,
Florian

--- prog list "main" ---
0000 (bool.go:5) TEXT    main+0(SB),$8-0
0001 (bool.go:5) FUNCDATA $0,gcargs·0+0(SB)
0002 (bool.go:5) FUNCDATA $1,gclocals·0+0(SB)
0003 (bool.go:5) TYPE    b+-1(SP){bool},$1
0004 (bool.go:6) MOVQ    $0,CX
0005 (bool.go:7) MOVQ    $0,AX
0006 (bool.go:7) JMP     ,8
0007 (bool.go:7) INCQ    ,AX
0008 (bool.go:7) MOVQ    $10000000000,BP
0009 (bool.go:7) CMPQ    AX,BP
0010 (bool.go:7) JGE     $0,25
0011 (bool.go:8) CMPQ    AX,$100
0012 (bool.go:8) JGE     ,17
0013 (bool.go:8) INCQ    ,CX
0014 (bool.go:8) JMP     ,17
0015 (bool.go:9) MOVB    $1,b+-1(SP)
0016 (bool.go:9) JMP     ,20
0017 (bool.go:9) CMPQ    AX,$100
0018 (bool.go:9) JLT     ,15
0019 (bool.go:9) MOVB    $0,b+-1(SP)
0020 (bool.go:9) LEAQ    b+-1(SP),BX
0021 (bool.go:9) NOP     ,
0022 (bool.go:9) MOVBQZX (BX),BX
0023 (bool.go:9) ADDQ    BX,CX
0024 (bool.go:7) JMP     ,7
0025 (bool.go:11) RET     ,

Taru Karttunen

unread,
May 26, 2014, 6:44:09 PM5/26/14
to Florian Uekermann, golan...@googlegroups.com
On 26.05 13:13, Florian Uekermann wrote:
> I recently stumbled over the discussion about possibly adding conversions
> from bool to various numerical types in go to avoid branches.
> In my own code I solved this by using the unsafe package to convert bool to
> byte, but I noticed that this seems to be slower than just using an if
> statement in many cases.
>
> So I wrote a very simple test case to investigate the behavior:
>
> #####
> package main
> import "unsafe"
> func main(){
> s:=byte(0)
> for i:=0;i<10000000000;i++{
> if i<100 { s++ }
> b:=i<100; s+=*(*byte)(unsafe.Pointer(&b))
> }
> }

Branching can be faster as in this example branch prediction in the
CPU is almost always right. Also the unsafe.Pointer way will probably
generate quite a lot of non-optimal assembly.

- Taru Karttunen

Kevin Gillette

unread,
May 26, 2014, 10:20:07 PM5/26/14
to golan...@googlegroups.com
On Monday, May 26, 2014 2:13:20 PM UTC-6, Florian Uekermann wrote:
I recently stumbled over the discussion about possibly adding conversions from bool to various numerical types in go to avoid branches.

What discussions? I'd be very surprised if any of the seasoned gophers are pushing for that. 

Dan Kortschak

unread,
May 26, 2014, 10:23:03 PM5/26/14
to Kevin Gillette, golan...@googlegroups.com
He me be referring to a discussion I started about a month ago about
ways of using unsafe etc to do this kind of thing - was asking about the
degree of unsafety, not for adding features though.

Florian Uekermann

unread,
May 27, 2014, 5:38:59 AM5/27/14
to golan...@googlegroups.com
Branching can be faster as in this example branch prediction in the 
CPU is almost always right. Also the unsafe.Pointer way will probably 
generate quite a lot of non-optimal assembly. 
- Taru Karttunen 

Yes, but this is only a stupid example which is close to optimal for the branch predictor.
I read somewhere that you can expect penalties of well over 10 cycles on recent CPUs
for a misprediction. I am fairly certain that for example binary searches (one of my original
testcases) generate about 50% mispredictions.

And if you look closely at the assembly you will notice that the case I assumed to be non-
branching actually also contains a conditional jump, which is one of the two reasons why I
asked for clarification about possible solutions on this list.

And you may be correct that the unsafe.Pointer creates suboptimal assembly. I am wondering
if this could theoretically be solved by the compiler.

Best regards,
Florian

Florian Uekermann

unread,
May 27, 2014, 5:54:22 AM5/27/14
to golan...@googlegroups.com, Kevin Gillette
I was referring to one of the requests to add int(x==y) or float64(x==y) etc. to the language,
which comes up every now and then because people want to use the result as coefficient
in a some calculation (which is a nice way to save a few cycles). There is even an issue
on the tracker about it (6011).

Florian Uekermann

unread,
May 27, 2014, 5:56:07 AM5/27/14
to golan...@googlegroups.com
I just leaned about the existence of CMOV instructions, which could be a possibility to
get rid of the jump, or am I misunderstanding how this works.
Reply all
Reply to author
Forward
0 new messages