WebAssembly: Converting conditional branches to predicated execution and loop unrolling

27 views
Skip to first unread message

Immanuel Haffner

unread,
Apr 20, 2020, 4:55:27 AM4/20/20
to v8-dev
Hi guys,

is TurboFan able to eliminate conditional branches by using some means of predication? I have a particular example in mind. Consider the following C++ code snippet:
std::size_t count = 0;
for (int32_t *p = begin; p != end; ++p) {
    if (*p < 42)
        ++count;
}

A simple loop that counts the number of values less than 42. I compiled this to the following WebAssembly code:
(loop $filter_i32
 (block $filter_i32.body
  (if
   (i32.lt_s
    (i32.load
     (local.get $5)
    )
    (i32.const 42)
   )
   (block $filter.accept
    (local.set $3
     (i32.add
      (local.get $3)
      (i32.const 1)
     )
    )
   )
  )
  (local.set $5
   (i32.add
    (local.get $5)
    (i32.const 4)
   )
  )
  (local.set $4
   (i32.add
    (local.get $4)
    (i32.const 1)
   )
  )
  (br_if $filter_i32
   (i32.lt_u
    (local.get $4)
    (global.get $size)
   )
  )
 )
)

$5 is  the address of the next i32 value, $3 is the count of values less than 42, and $4 is the induction variable of the loop and used in the loop header.

When I execute this WASM code in V8 using TurboFan (Liftoff is disabled) and let V8 print the produced assembly I get the following code for the loop:
0x2bdc0466b2f0    30  83c704         addl rdi,0x4
0x2bdc0466b2f3    33  4c8b5e23       REX.W movq r11,[rsi+0x23]
0x2bdc0466b2f7    37  493b23         REX.W cmpq rsp,[r11]
0x2bdc0466b2fa    3a  0f862d000000   jna 0x2bdc0466b32d  <+0x6d>
0x2bdc0466b300    40  448bdf         movl r11,rdi
0x2bdc0466b303    43  4c3bda         REX.W cmpq r11,rdx
0x2bdc0466b306    46  0f835c000000   jnc 0x2bdc0466b368  <+0xa8>
0x2bdc0466b30c    4c  42833c1b2a     cmpl [rbx+r11*1],0x2a
0x2bdc0466b311    51  0f8d04000000   jge 0x2bdc0466b31b  <+0x5b>
0x2bdc0466b317    57  4183c101       addl r9,0x1
0x2bdc0466b31b    5b  4183c001       addl r8,0x1
0x2bdc0466b31f    5f  44394108       cmpl [rcx+0x8],r8
0x2bdc0466b323    63  77cb           ja 0x2bdc0466b2f0  <+0x30>

The jump in 0x3a implements the loop header. The jump in 0x46 is V8's oob check. The jump in 0x51 implements the if-statement. If I compile the above C++ code with clang -O2 I get the following code for the loop:
.LBB0_8:                                # =>This Inner Loop Header: Depth=1
xorl %esi, %esi
cmpl %ebp, (%rcx)
setl %sil
addq %rbx, %rsi
xorl %edi, %edi
cmpl %ebp, 4(%rcx)
setl %dil
addq %rsi, %rdi
xorl %esi, %esi
cmpl %ebp, 8(%rcx)
setl %sil
addq %rdi, %rsi
xorl %ebx, %ebx
cmpl %ebp, 12(%rcx)
setl %bl
addq %rsi, %rbx
addq $16, %rcx
addq $-4, %rdx
jne .LBB0_8

The loop has been unrolled 4 times. (I omitted the code that covers the remainder of size % 4.) Further, the conditional branch has been eliminated and replaced by `setl` and `addq`, which is effectively an optimized form of predication.

When I compare the performance of clang's code to that of TurboFan, clang is around 10x faster. My question is: what can I do to improve the performance of that loop? Is loop unrolling or conversion of conditional branches supported in TurboFan? I must add that using SIMD is not possible, as the body of the if-statement inside the loop can be anything and enforcing SIMD is not always possible.

Regards,
Immanuel

Clemens Backes

unread,
Apr 22, 2020, 7:48:05 AM4/22/20
to v8-dev
Hi Immanuel,

I don't know about the exact set of optimizations that TurboFan implements, but I wouldn't be surprised if clang produces significantly better code for such small and hot loops. 10x is a surprisingly big difference though. Did you profile where all that time is spent? Is it the branch for the if statement mostly? That should be the only unpredictable one.

Two things you could try to improve the numbers (both wouldn't remove the branch for the if though):
1) You can eliminate the OOB check by enabling trap handlers (see EnableWebAssemblyTrapHandler in the API). They will be implemented via signal handling instead then, which only works correctly if the memory was allocated properly for WebAssembly. We previously discussed a hack where to you map your own memory for wasm, in that case the signal handlers would not trigger.
1b) If you can't enable trap handlers, and you fully trust your Wasm code (i.e. you know that it will never access memory out of bounds), you can disable bounds checks via "--no-wasm-bounds-checks". It should be clear that this is not an option if the Wasm code is user-provided.
2) You could try passing "--wasm-opt" which enables some more TurboFan optimizations. The exact set of optimizations is not fine-tuned yet, since the option is off by default, but maybe it improves the code for your example.

Cheers,
Clemens


--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/7b47cbac-f020-4af3-81f7-132e04e3dd8d%40googlegroups.com.


--

Clemens Backes

Software Engineer

clem...@google.com

Google Germany GmbH

Erika-Mann-Straße 33

80636 München


Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg


Diese E-Mail ist vertraulich. Falls sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.


This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.

Immanuel Haffner

unread,
Apr 22, 2020, 8:11:19 AM4/22/20
to v8-dev
Hi Clemens,

in the example, the conditional branch is 50/50 true/false (per iteration) and hence a worst-case for branch prediction. Therefore, it seams reasonable that the machine code with a conditional branch is so much slower than the code with predication.

Since we produce the WASM code ourselves we can fully trust it. So I disabled boundary checks with --no-wasm-bounds-checks and the check in the hot loop disappeard but this had no performance impact. As you already said, this branch is perfectly predictable.

If there is an optimization pass that can perform predication, how would I instruct TurboFan to run it? In fact, in our system we know whether a branch will be predictable. So it would be great if we could opt-in to an optimization pass to perform predication if we deem it worthwhile. Directly generating predicated code is hypothetically possible, but a considerable development effort on our side and has the potential for code duplication. (This needs further investigation on my side.)

Thanks a lot, Clemens :)
Regards, Immanuel

Clemens Backes

unread,
Apr 23, 2020, 3:30:30 AM4/23/20
to v8-dev
Hi Immanuel,

I talked to our TurboFan experts. As suspected, TurboFan currently does not do any predication. In fact, the backend is lacking support for conditional moves, so it would be a bigger undertaking to add such an optimization. Also, other experiments have shown that branches often perform even better than conditional moves, as in most cases, branches are pretty predictable (see the pointer compression blog post).

Also loop unrolling is an optimization we don't currently support. We do some loop peeling, but that does not help in your case.
Loop unrolling of course would be an optimization you could do on the wasm code you generate. But without conditional moves, that probably does not give a significant gain.

So I am afraid there is no way currently to make that loop significantly faster in TurboFan.

Cheers,
Clemens


--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.

Immanuel Haffner

unread,
Apr 23, 2020, 6:36:26 AM4/23/20
to v8-dev
Hi Clemens,

thanks for the information. Unrolling would definitely be possible on our side. I thought about predication, and that can also be done on our side. However, if i understand this correctly, TurboFan does not produce any CMOVcc instructions? I know that clang likes to compile things like a += b * true_or_false to branchfree code with CMOVcc. So a developer can perform "predication" via multiplication. (I really don't like it, but it works...) How could I do something similar to force TurboFan to produce branchfree and efficient code? I mean, using a multiplication for the purpose of predication is not efficient; it only helps when the compiler replaces it by a conditional move.

Regards,
Immanuel

Clemens Backes

unread,
Apr 23, 2020, 7:04:45 AM4/23/20
to v8-dev
Hi Immanuel,

as I wrote, TurboFan does not support conditional moves in the backend currently, so any kind of predication will always be compiled to control flow. This is independent of the input you provide.

Cheers,
Clemens


--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages