WebAssembly: Performance issue with conditional branch on < 0 - 200x slowdown

142 views
Skip to first unread message

Immanuel Haffner

unread,
Apr 7, 2020, 6:18:49 AM4/7/20
to v8-dev
Hi all,

I have the following WebAssembly code snippet to scan through an array of i32 values and count the number of values less than 0 (zero).

(loop $scan_i32
 (block $scan_i32.body
  (if
   (i32.lt_s
    (i32.load
     (i32.add
      (global.get $data)
      (i32.mul
       (local.get $4)
       (i32.const 44)
      )
     )
    )
    (i32.const 0)
   )
   (block $filter.accept
    (local.set $3
     (i32.add
      (local.get $3)
      (i32.const 1)
     )
    )
   )
  )
  (local.set $4
   (i32.add
    (local.get $4)
    (i32.const 1)
   )
  )
  (br_if $scan_i32
   (i32.lt_u
    (local.get $4)
    (global.get $size)
   )
  )
 )
)

The local $4 is the induction variable of the loop and counts from 0 to $size. The local $3 counts the number of values less than zero. $data holds the location of the i32 array.

The i32 array contains 1 million (1e6) values chosen uniformly at random from the entire i32 range (-2^31 to 2^31-1).

The condition < 0 holds for half the values, i.e. 50%. When I execute this snippet on an AMD Ryzen Threadripper 1900X CPU (on one of our servers) it takes ~2000 ms. Using a different constant in the condition, e.g. < 1, results in drastically different running times, i.e. ~ 9 ms. This is a 200x difference! I performed the same experiment on my Notebook (Intel Skylake) and Desktop (Intel Broadwell), and here I get ~ 10 ms for both < 0 and < 1 (so no weird performance loss on less than zero).

Nearly the same amount of values satisfy < 1 as < 0, so I exclude branch misprediction as the root of that problem. (Using much larger values, such as < 2^30 also results in running times of around 10 ms.)

How can I further investigate that issue and how can I help you in reproducing and resolving it?

Regards,
Immanuel

Immanuel Haffner

unread,
Apr 7, 2020, 6:30:44 AM4/7/20
to v8-dev
I added a chart to show the performance for different selectivities. Except for 0.5, the times are all around 10 ms.
times_i32.png

Immanuel Haffner

unread,
Apr 7, 2020, 6:31:38 AM4/7/20
to v8-dev
And I should add that i don't see that behaviour for i64, float, or double.

Clemens Backes

unread,
Apr 7, 2020, 7:20:32 AM4/7/20
to v8-dev
Hi Immanuel,

this sounds like we should look into the generated code and look for missed optimizations. Can you open a bug about this via crbug.com/v8/new, and attach a reproducer? Ideally you could extract a small reproducer which contains the Wasm module and a JS snippet that shows the performance difference (you can read the wasm module via the `readbuffer` function in d8).

If you want to start investigating yourself, then a first step would be printing the generated machine code and looking at the disassembly. That requires a build with the "v8_enable_disassembler" gn arg set to "true" (default in a debug build). You can then run the reproducer with --print-wasm-code. To avoid also printing (and executing) the Liftoff code, which would not be interesting for peak performance, you can pass --no-liftoff. And to avoid concurrent compilation, which can mess up the output, you can pass --predictable.

Cheers,
Clemens


On Tue, Apr 7, 2020 at 12:31 PM Immanuel Haffner <haffner....@gmail.com> wrote:
And I should add that i don't see that behaviour for i64, float, or double.

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/53ea45d3-0c46-4a02-8d72-6e0bb14cf9f4%40googlegroups.com.


--

Clemens Backes

Software Engineer

clem...@google.com

Google Germany GmbH

Erika-Mann-Straße 33

80636 München


Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg


Diese E-Mail ist vertraulich. Falls sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.


This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.

Jakob Kummerow

unread,
Apr 7, 2020, 7:21:19 AM4/7/20
to v8-...@googlegroups.com
Looks like a case of one particular CPU architecture experiencing surprising slowdowns for some particular instruction or pattern. Depending on how widespread the situation is, and on availability+cost of software-side workarounds, it's not immediately clear whether we want to do anything about it on the V8 side.

To narrow it down further:
- you could inspect the generated machine code (--print-wasm-code should do the trick) and/or use low-level profiling (such as Linux' perf tool) to figure out what exactly the trigger is. If the hypothesis is right that it's some particular pattern, then it should be possible to write a small standalone C/C++ program that exhibits the same behavior (godbolt.org is very helpful for crafting exactly the right code for such things). 
- if you can't do that work yourself, then you could significantly help anyone trying to help you by providing a complete example (i.e., complete program text plus instructions on how to compile/run it, detailed and specific enough that one can simply copy/paste them to reproduce what you're seeing).

It might be possible for us to include a small compiler-side change to avoid some pattern on some hardware -- if there is a feasible alternative (e.g. we obviously can't just replace "< 0" comparisons with "< 1" comparisons).
It might also be the case that the right fix is a microcode update, or to buy new hardware, or to hope that real applications won't experience the same impact as a microbenchmark. Only learning more about the issue will tell. Escalating this to AMD, once you have a small self-contained example, definitely makes sense.


On Tue, Apr 7, 2020 at 12:31 PM Immanuel Haffner <haffner....@gmail.com> wrote:
And I should add that i don't see that behaviour for i64, float, or double.

--

Immanuel Haffner

unread,
Apr 7, 2020, 7:43:10 AM4/7/20
to v8-dev
Thanks for the quick responses.

@Clemens How does such a reproducer look like? The thing is, that I use V8 embedded with an underlying memory shared by host and WebAssembly module. How/where would I provide the --print-wasm-code flag? I use Binaryen for WASM codegen and I can easily produce the compiled WASM code as array of bytes. I think this is suitable for instantiating the module from JS.

@Jakob Since this is part of a closed-source project, I must see how much information I can share. I will definitely try to provide a MWE.

Once I know what a reproducer is and how to write one, I will gladly file a bug report on the site Clemens mentioned. I will also share my findings here, in particular the compiled assembly once I have it.

Regards,
Immanuel

Clemens Backes

unread,
Apr 7, 2020, 7:58:29 AM4/7/20
to v8-dev
You can change V8 flags via the public API: V8::SetFlagsFromString("--print-wasm-code --predictable --no-liftoff".

A Binaryen-produces Wasm module is a good start. I would guess that the shared memory does not make a difference to the performance, so maybe you can just instantiate your Wasm module from JS and call the respective methods?

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.

Immanuel Haffner

unread,
Apr 8, 2020, 3:50:58 AM4/8/20
to v8-dev
@Clemens I added the flags and reran the experiment. Now I get this output:

--- WebAssembly code ---
index: 0
kind: wasm function
compiler: TurboFan
Body (size = 192 = 178 + 14 padding)
--- End code ---

Doens't seem really helpful. What am I missing?

Immanuel Haffner

unread,
Apr 8, 2020, 3:57:33 AM4/8/20
to v8-dev
My bad. I did this in a release build without `v8_enable_disassembler`.  I will rebuild V8 with the flag set and rerun the experiment.

Immanuel Haffner

unread,
Apr 8, 2020, 5:46:35 AM4/8/20
to v8-dev
Ok, I have added the flags Clemens suggested and got the disassembly that TurboFan produced. I will append them for both < 0 and < 1 below. However, the slowdown for < 0 disappeared! Maybe this is related to liftoff? I will recompile V8 with liftoff enabled and rerun the experiment once more.

Disassembly for < 0 on AMD Ryzen Threadripper 1900X:
--- WebAssembly code ---
index: 0
kind: wasm function
compiler: TurboFan
Body (size = 192 = 178 + 14 padding)
Instructions (size = 156)
0x39bc600062c0     0  55             push rbp
0x39bc600062c1     1  4889e5         REX.W movq rbp,rsp
0x39bc600062c4     4  6a0a           push 0xa
0x39bc600062c6     6  56             push rsi
0x39bc600062c7     7  4883ec28       REX.W subq rsp,0x28
0x39bc600062cb     b  488b5e4f       REX.W movq rbx,[rsi+0x4f]
0x39bc600062cf     f  488b560b       REX.W movq rdx,[rsi+0xb]
0x39bc600062d3    13  488b4e13       REX.W movq rcx,[rsi+0x13]
0x39bc600062d7    17  4883e903       REX.W subq rcx,0x3
0x39bc600062db    1b  33ff           xorl rdi,rdi
0x39bc600062dd    1d  4c8bc7         REX.W movq r8,rdi
0x39bc600062e0    20  4c8b4e23       REX.W movq r9,[rsi+0x23]
0x39bc600062e4    24  493b21         REX.W cmpq rsp,[r9]
0x39bc600062e7    27  0f8633000000   jna 0x39bc60006320  <+0x60>
0x39bc600062ed    2d  446bcf2c       imull r9,rdi,0x2c
0x39bc600062f1    31  448b5b0c       movl r11,[rbx+0xc]
0x39bc600062f5    35  4503cb         addl r9,r11
0x39bc600062f8    38  4c3bc9         REX.W cmpq r9,rcx
0x39bc600062fb    3b  0f8352000000   jnc 0x39bc60006353  <+0x93>
0x39bc60006301    41  42833c0a00     cmpl [rdx+r9*1],0x0
0x39bc60006306    46  0f8d04000000   jge 0x39bc60006310  <+0x50>
0x39bc6000630c    4c  4183c001       addl r8,0x1
0x39bc60006310    50  83c701         addl rdi,0x1
0x39bc60006313    53  397b08         cmpl [rbx+0x8],rdi
0x39bc60006316    56  77c8           ja 0x39bc600062e0  <+0x20>
0x39bc60006318    58  498bc0         REX.W movq rax,r8
0x39bc6000631b    5b  488be5         REX.W movq rsp,rbp
0x39bc6000631e    5e  5d             pop rbp
0x39bc6000631f    5f  c3             retl
0x39bc60006320    60  48895de8       REX.W movq [rbp-0x18],rbx
0x39bc60006324    64  48897de0       REX.W movq [rbp-0x20],rdi
0x39bc60006328    68  4c8945d8       REX.W movq [rbp-0x28],r8
0x39bc6000632c    6c  488955d0       REX.W movq [rbp-0x30],rdx
0x39bc60006330    70  48894dc8       REX.W movq [rbp-0x38],rcx
0x39bc60006334    74  e887feffff     call 0x39bc600061c0     ;; wasm stub: WasmStackGuard
0x39bc60006339    79  488b5de8       REX.W movq rbx,[rbp-0x18]
0x39bc6000633d    7d  488b7de0       REX.W movq rdi,[rbp-0x20]
0x39bc60006341    81  4c8b45d8       REX.W movq r8,[rbp-0x28]
0x39bc60006345    85  488b55d0       REX.W movq rdx,[rbp-0x30]
0x39bc60006349    89  488b4dc8       REX.W movq rcx,[rbp-0x38]
0x39bc6000634d    8d  488b75f0       REX.W movq rsi,[rbp-0x10]
0x39bc60006351    91  eb9a           jmp 0x39bc600062ed  <+0x2d>
0x39bc60006353    93  e8f8fcffff     call 0x39bc60006050     ;; wasm stub: ThrowWasmTrapMemOutOfBounds
0x39bc60006358    98  90             nop
0x39bc60006359    99  0f1f00         nop

Source positions:
 pc offset  position
        60        13
        93        23

Safepoints (size = 22)
0x39bd600062bfffffffff  000000000 (sp -> fp)

RelocInfo (size = 4)
0x39bc60006335  wasm stub call
0x39bc60006354  wasm stub call

--- End code ---

Disassembly for < 1 on AMD Ryzen Threadripper 1900X:
--- WebAssembly code ---
index: 0
kind: wasm function
compiler: TurboFan
Body (size = 192 = 178 + 14 padding)
Instructions (size = 156)
0x18d7ed9202c0     0  55             push rbp
0x18d7ed9202c1     1  4889e5         REX.W movq rbp,rsp
0x18d7ed9202c4     4  6a0a           push 0xa
0x18d7ed9202c6     6  56             push rsi
0x18d7ed9202c7     7  4883ec28       REX.W subq rsp,0x28
0x18d7ed9202cb     b  488b5e4f       REX.W movq rbx,[rsi+0x4f]
0x18d7ed9202cf     f  488b560b       REX.W movq rdx,[rsi+0xb]
0x18d7ed9202d3    13  488b4e13       REX.W movq rcx,[rsi+0x13]
0x18d7ed9202d7    17  4883e903       REX.W subq rcx,0x3
0x18d7ed9202db    1b  33ff           xorl rdi,rdi
0x18d7ed9202dd    1d  4c8bc7         REX.W movq r8,rdi
0x18d7ed9202e0    20  4c8b4e23       REX.W movq r9,[rsi+0x23]
0x18d7ed9202e4    24  493b21         REX.W cmpq rsp,[r9]
0x18d7ed9202e7    27  0f8633000000   jna 0x18d7ed920320  <+0x60>
0x18d7ed9202ed    2d  446bcf2c       imull r9,rdi,0x2c
0x18d7ed9202f1    31  448b5b0c       movl r11,[rbx+0xc]
0x18d7ed9202f5    35  4503cb         addl r9,r11
0x18d7ed9202f8    38  4c3bc9         REX.W cmpq r9,rcx
0x18d7ed9202fb    3b  0f8352000000   jnc 0x18d7ed920353  <+0x93>
0x18d7ed920301    41  42833c0a01     cmpl [rdx+r9*1],0x1
0x18d7ed920306    46  0f8d04000000   jge 0x18d7ed920310  <+0x50>
0x18d7ed92030c    4c  4183c001       addl r8,0x1
0x18d7ed920310    50  83c701         addl rdi,0x1
0x18d7ed920313    53  397b08         cmpl [rbx+0x8],rdi
0x18d7ed920316    56  77c8           ja 0x18d7ed9202e0  <+0x20>
0x18d7ed920318    58  498bc0         REX.W movq rax,r8
0x18d7ed92031b    5b  488be5         REX.W movq rsp,rbp
0x18d7ed92031e    5e  5d             pop rbp
0x18d7ed92031f    5f  c3             retl
0x18d7ed920320    60  48895de8       REX.W movq [rbp-0x18],rbx
0x18d7ed920324    64  48897de0       REX.W movq [rbp-0x20],rdi
0x18d7ed920328    68  4c8945d8       REX.W movq [rbp-0x28],r8
0x18d7ed92032c    6c  488955d0       REX.W movq [rbp-0x30],rdx
0x18d7ed920330    70  48894dc8       REX.W movq [rbp-0x38],rcx
0x18d7ed920334    74  e887feffff     call 0x18d7ed9201c0     ;; wasm stub: WasmStackGuard
0x18d7ed920339    79  488b5de8       REX.W movq rbx,[rbp-0x18]
0x18d7ed92033d    7d  488b7de0       REX.W movq rdi,[rbp-0x20]
0x18d7ed920341    81  4c8b45d8       REX.W movq r8,[rbp-0x28]
0x18d7ed920345    85  488b55d0       REX.W movq rdx,[rbp-0x30]
0x18d7ed920349    89  488b4dc8       REX.W movq rcx,[rbp-0x38]
0x18d7ed92034d    8d  488b75f0       REX.W movq rsi,[rbp-0x10]
0x18d7ed920351    91  eb9a           jmp 0x18d7ed9202ed  <+0x2d>
0x18d7ed920353    93  e8f8fcffff     call 0x18d7ed920050     ;; wasm stub: ThrowWasmTrapMemOutOfBounds
0x18d7ed920358    98  90             nop
0x18d7ed920359    99  0f1f00         nop

Source positions:
 pc offset  position
        60        13
        93        23

Safepoints (size = 22)
0x18d8ed9202bfffffffff  000000000 (sp -> fp)

RelocInfo (size = 4)
0x18d7ed920335  wasm stub call
0x18d7ed920354  wasm stub call

--- End code ---

If you do a diff and ignore the addresses, you can see that the only difference is the cmpl [rdx+r9*1],0x1.

I have two configurations of V8 on the server now, one default release build and one release build with the flags --print-wasm-code --predictable --no-liftoff.  Running the experiment in the former has 2000 ms running time while running the experiment in the latter has running time 11 ms.

Clemens Backes

unread,
Apr 8, 2020, 5:58:09 AM4/8/20
to v8-dev
That's interesting data. Do you only run that function once? In that case, it can indeed happen that TurboFan code is not ready when you enter the function, so you end up running Liftoff exclusively. This should not depend on that one constant though.

Also, being 200x slower would be an extreme case for Liftoff, and it would still be interesting to compare the generated code and check for optimization potential.

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.

Immanuel Haffner

unread,
Apr 8, 2020, 6:17:37 AM4/8/20
to v8-dev
Yes, I run the function only once. In fact, the module is JIT-compiled and the function is called once.

I did some more experimenting an the problem is related either to `--predictable` or `--no-liftoff`. Here are the times I get with different configurations:
  • `--predictable` : 13 ms
  • `--no-liftoff` : 11 ms
  • `--predictable --no-liftoff` : 11 ms
  • default (no additional flags) : 2158 ms
The disassembly in default configuration is

Liftoff:
0x3835c41b52c0     0  55             push rbp
0x3835c41b52c1     1  4889e5         REX.W movq rbp,rsp
0x3835c41b52c4     4  6a0a           push 0xa
0x3835c41b52c6     6  4881ec28000000 REX.W subq rsp,0x28
0x3835c41b52cd     d  488975f0       REX.W movq [rbp-0x10],rsi
0x3835c41b52d1    11  48c745dc00000000 REX.W movq [rbp-0x24],0x0
0x3835c41b52d9    19  48c745e400000000 REX.W movq [rbp-0x1c],0x0
0x3835c41b52e1    21  488b4df0       REX.W movq rcx,[rbp-0x10]
0x3835c41b52e5    25  488b4923       REX.W movq rcx,[rcx+0x23]
0x3835c41b52e9    29  483b21         REX.W cmpq rsp,[rcx]
0x3835c41b52ec    2c  0f86bb000000   jna 0x3835c41b53ad  <+0xed>
0x3835c41b52f2    32  488b4df0       REX.W movq rcx,[rbp-0x10]
0x3835c41b52f6    36  488b494f       REX.W movq rcx,[rcx+0x4f]
0x3835c41b52fa    3a  8b11           movl rdx,[rcx]
0x3835c41b52fc    3c  488b4df0       REX.W movq rcx,[rbp-0x10]
0x3835c41b5300    40  488b494f       REX.W movq rcx,[rcx+0x4f]
0x3835c41b5304    44  8b5904         movl rbx,[rcx+0x4]
0x3835c41b5307    47  8945ec         movl [rbp-0x14],rax
0x3835c41b530a    4a  8955e8         movl [rbp-0x18],rdx
0x3835c41b530d    4d  895de4         movl [rbp-0x1c],rbx
0x3835c41b5310    50  488b45f0       REX.W movq rax,[rbp-0x10]
0x3835c41b5314    54  488b4023       REX.W movq rax,[rax+0x23]
0x3835c41b5318    58  483b20         REX.W cmpq rsp,[rax]
0x3835c41b531b    5b  0f8698000000   jna 0x3835c41b53b9  <+0xf9>
0x3835c41b5321    61  488b45f0       REX.W movq rax,[rbp-0x10]
0x3835c41b5325    65  488b404f       REX.W movq rax,[rax+0x4f]
0x3835c41b5329    69  8b480c         movl rcx,[rax+0xc]
0x3835c41b532c    6c  8b45dc         movl rax,[rbp-0x24]
0x3835c41b532f    6f  ba2c000000     movl rdx,0x2c
0x3835c41b5334    74  0fafc2         imull rax,rdx
0x3835c41b5337    77  03c8           addl rcx,rax
0x3835c41b5339    79  488b55f0       REX.W movq rdx,[rbp-0x10]
0x3835c41b533d    7d  488b5213       REX.W movq rdx,[rdx+0x13]
0x3835c41b5341    81  b803000000     movl rax,0x3
0x3835c41b5346    86  48f7d8         REX.W negq rax
0x3835c41b5349    89  4803c2         REX.W addq rax,rdx

TurboFan:
0x3835c41b5357    97  488b45f0       REX.W movq rax,[rbp-0x10]
0x3835c41b535b    9b  488b400b       REX.W movq rax,[rax+0xb]
0x3835c41b53e0     0  55             push rbp0x3835c41b535f    9f  8b1408         movl rdx,[rax+rcx*1]
0x3835c41b5362    a2  33c0           xorl rax,rax

0x3835c41b5364    a4  3bd0           cmpl rdx,rax
0x3835c41b5366    a6  0f9cc2         setll dl
0x3835c41b53e1     1  4889e5         REX.W movq rbp,rsp
0x3835c41b5369    a9  0fb6d2         movzxbl rdx,rdx
0x3835c41b53e4     4  6a0a           push 0xa
0x3835c41b53e6     6  56             push rsi
0x3835c41b536c    ac  85d2           testl rdx,rdx0x3835c41b53e7     7  4883ec28       REX.W subq rsp,0x28

0x3835c41b536e    ae  0f840e000000   jz 0x3835c41b5382  <+0xc2>
0x3835c41b53eb     b  488b5e4f       REX.W movq rbx,[rsi+0x4f]
0x3835c41b5374    b4  8b45e0         movl rax,[rbp-0x20]
0x3835c41b53ef     f  488b560b       REX.W movq rdx,[rsi+0xb]
0x3835c41b53f3    13  488b4e13       REX.W movq rcx,[rsi+0x13]
0x3835c41b53f7    17  4883e903       REX.W subq rcx,0x3
0x3835c41b5377    b7  83c001         addl rax,0x10x3835c41b53fb    1b  33ff           xorl rdi,rdi
0x3835c41b53fd    1d  4c8bc7         REX.W movq r8,rdi
0x3835c41b537a    ba  8945e0         movl [rbp-0x20],rax
0x3835c41b537d    bd  e900000000     jmp 0x3835c41b5382  <+0xc2>
0x3835c41b5382    c2  8b45dc         movl rax,[rbp-0x24]
0x3835c41b5385    c5  83c001         addl rax,0x1
0x3835c41b5388    c8  488b4df0       REX.W movq rcx,[rbp-0x10]

0x3835c41b5400    20  4c8b4e23       REX.W movq r9,[rsi+0x23]
0x3835c41b538c    cc  488b494f       REX.W movq rcx,[rcx+0x4f]
0x3835c41b5404    24  493b21         REX.W cmpq rsp,[r9]
0x3835c41b5390    d0  8b5108         movl rdx,[rcx+0x8]
0x3835c41b5407    27  0f8633000000   jna 0x3835c41b5440  <+0x60>
0x3835c41b5393    d3  3bc2           cmpl rax,rdx
0x3835c41b540d    2d  446bcf2c       imull r9,rdi,0x2c
0x3835c41b5395    d5  0f8308000000   jnc 0x3835c41b53a3  <+0xe3>
0x3835c41b5411    31  448b5b0c       movl r11,[rbx+0xc]
0x3835c41b539b    db  8945dc         movl [rbp-0x24],rax
0x3835c41b539e    de  e96dffffff     jmp 0x3835c41b5310  <+0x50>
0x3835c41b5415    35  4503cb         addl r9,r11
0x3835c41b5418    38  4c3bc9         REX.W cmpq r9,rcx0x3835c41b53a3    e3  8b4de0         movl rcx,[rbp-0x20]

0x3835c41b53a6    e6  8bc1           movl rax,rcx
0x3835c41b541b    3b  0f8352000000   jnc 0x3835c41b5473  <+0x93>0x3835c41b53a8    e8  488be5         REX.W movq rsp,rbp
0x3835c41b53ab    eb  5d             pop rbp

0x3835c41b53ac    ec  c3             retl
0x3835c41b5421    41  42833c0a00     cmpl [rdx+r9*1],0x0
0x3835c41b53ad    ed  50             push rax0x3835c41b5426    46  0f8d04000000   jge 0x3835c41b5430  <+0x50>

0x3835c41b53ae    ee  e80dfeffff     call 0x3835c41b51c0     ;; wasm stub: WasmStackGuard
0x3835c41b53b3    f3  58             pop rax
0x3835c41b53b4    f4  e939ffffff     jmp 0x3835c41b52f2  <+0x32>
0x3835c41b542c    4c  4183c001       addl r8,0x1
0x3835c41b53b9    f9  e802feffff     call 0x3835c41b51c0     ;; wasm stub: WasmStackGuard
0x3835c41b5430    50  83c701         addl rdi,0x10x3835c41b53be    fe  e95effffff     jmp 0x3835c41b5321  <+0x61>
0x3835c41b53c3   103  e888fcffff     call 0x3835c41b5050     ;; wasm stub: ThrowWasmTrapMemOutOfBounds

The disassembly in `--predictable` configuration is

Liftoff:
0x1bef76d8f2c0     0  55             push rbp
0x1bef76d8f2c1     1  4889e5         REX.W movq rbp,rsp
0x1bef76d8f2c4     4  6a0a           push 0xa
0x1bef76d8f2c6     6  4881ec28000000 REX.W subq rsp,0x28
0x1bef76d8f2cd     d  488975f0       REX.W movq [rbp-0x10],rsi
0x1bef76d8f2d1    11  48c745dc00000000 REX.W movq [rbp-0x24],0x0
0x1bef76d8f2d9    19  48c745e400000000 REX.W movq [rbp-0x1c],0x0
0x1bef76d8f2e1    21  488b4df0       REX.W movq rcx,[rbp-0x10]
0x1bef76d8f2e5    25  488b4923       REX.W movq rcx,[rcx+0x23]
0x1bef76d8f2e9    29  483b21         REX.W cmpq rsp,[rcx]
0x1bef76d8f2ec    2c  0f86bb000000   jna 0x1bef76d8f3ad  <+0xed>
0x1bef76d8f2f2    32  488b4df0       REX.W movq rcx,[rbp-0x10]
0x1bef76d8f2f6    36  488b494f       REX.W movq rcx,[rcx+0x4f]
0x1bef76d8f2fa    3a  8b11           movl rdx,[rcx]
0x1bef76d8f2fc    3c  488b4df0       REX.W movq rcx,[rbp-0x10]
0x1bef76d8f300    40  488b494f       REX.W movq rcx,[rcx+0x4f]
0x1bef76d8f304    44  8b5904         movl rbx,[rcx+0x4]
0x1bef76d8f307    47  8945ec         movl [rbp-0x14],rax
0x1bef76d8f30a    4a  8955e8         movl [rbp-0x18],rdx
0x1bef76d8f30d    4d  895de4         movl [rbp-0x1c],rbx
0x1bef76d8f310    50  488b45f0       REX.W movq rax,[rbp-0x10]
0x1bef76d8f314    54  488b4023       REX.W movq rax,[rax+0x23]
0x1bef76d8f318    58  483b20         REX.W cmpq rsp,[rax]
0x1bef76d8f31b    5b  0f8698000000   jna 0x1bef76d8f3b9  <+0xf9>
0x1bef76d8f321    61  488b45f0       REX.W movq rax,[rbp-0x10]
0x1bef76d8f325    65  488b404f       REX.W movq rax,[rax+0x4f]
0x1bef76d8f329    69  8b480c         movl rcx,[rax+0xc]
0x1bef76d8f32c    6c  8b45dc         movl rax,[rbp-0x24]
0x1bef76d8f32f    6f  ba2c000000     movl rdx,0x2c
0x1bef76d8f334    74  0fafc2         imull rax,rdx
0x1bef76d8f337    77  03c8           addl rcx,rax
0x1bef76d8f339    79  488b55f0       REX.W movq rdx,[rbp-0x10]
0x1bef76d8f33d    7d  488b5213       REX.W movq rdx,[rdx+0x13]
0x1bef76d8f341    81  b803000000     movl rax,0x3
0x1bef76d8f346    86  48f7d8         REX.W negq rax
0x1bef76d8f349    89  4803c2         REX.W addq rax,rdx
0x1bef76d8f34c    8c  8bc9           movl rcx,rcx
0x1bef76d8f34e    8e  483bc8         REX.W cmpq rcx,rax
0x1bef76d8f351    91  0f836c000000   jnc 0x1bef76d8f3c3  <+0x103>
0x1bef76d8f357    97  488b45f0       REX.W movq rax,[rbp-0x10]
0x1bef76d8f35b    9b  488b400b       REX.W movq rax,[rax+0xb]
0x1bef76d8f35f    9f  8b1408         movl rdx,[rax+rcx*1]
0x1bef76d8f362    a2  33c0           xorl rax,rax
0x1bef76d8f364    a4  3bd0           cmpl rdx,rax
0x1bef76d8f366    a6  0f9cc2         setll dl
0x1bef76d8f369    a9  0fb6d2         movzxbl rdx,rdx
0x1bef76d8f36c    ac  85d2           testl rdx,rdx
0x1bef76d8f36e    ae  0f840e000000   jz 0x1bef76d8f382  <+0xc2>
0x1bef76d8f374    b4  8b45e0         movl rax,[rbp-0x20]
0x1bef76d8f377    b7  83c001         addl rax,0x1
0x1bef76d8f37a    ba  8945e0         movl [rbp-0x20],rax
0x1bef76d8f37d    bd  e900000000     jmp 0x1bef76d8f382  <+0xc2>
0x1bef76d8f382    c2  8b45dc         movl rax,[rbp-0x24]
0x1bef76d8f385    c5  83c001         addl rax,0x1
0x1bef76d8f388    c8  488b4df0       REX.W movq rcx,[rbp-0x10]
0x1bef76d8f38c    cc  488b494f       REX.W movq rcx,[rcx+0x4f]
0x1bef76d8f390    d0  8b5108         movl rdx,[rcx+0x8]
0x1bef76d8f393    d3  3bc2           cmpl rax,rdx
0x1bef76d8f395    d5  0f8308000000   jnc 0x1bef76d8f3a3  <+0xe3>
0x1bef76d8f39b    db  8945dc         movl [rbp-0x24],rax
0x1bef76d8f39e    de  e96dffffff     jmp 0x1bef76d8f310  <+0x50>
0x1bef76d8f3a3    e3  8b4de0         movl rcx,[rbp-0x20]
0x1bef76d8f3a6    e6  8bc1           movl rax,rcx
0x1bef76d8f3a8    e8  488be5         REX.W movq rsp,rbp
0x1bef76d8f3ab    eb  5d             pop rbp
0x1bef76d8f3ac    ec  c3             retl
0x1bef76d8f3ad    ed  50             push rax
0x1bef76d8f3ae    ee  e80dfeffff     call 0x1bef76d8f1c0     ;; wasm stub: WasmStackGuard
0x1bef76d8f3b3    f3  58             pop rax
0x1bef76d8f3b4    f4  e939ffffff     jmp 0x1bef76d8f2f2  <+0x32>
0x1bef76d8f3b9    f9  e802feffff     call 0x1bef76d8f1c0     ;; wasm stub: WasmStackGuard
0x1bef76d8f3be    fe  e95effffff     jmp 0x1bef76d8f321  <+0x61>
0x1bef76d8f3c3   103  e888fcffff     call 0x1bef76d8f050     ;; wasm stub: ThrowWasmTrapMemOutOfBounds



TurboFan:
0x1bef76d8f3e0     0  55             push rbp
0x1bef76d8f3e1     1  4889e5         REX.W movq rbp,rsp
0x1bef76d8f3e4     4  6a0a           push 0xa
0x1bef76d8f3e6     6  56             push rsi
0x1bef76d8f3e7     7  4883ec28       REX.W subq rsp,0x28
0x1bef76d8f3eb     b  488b5e4f       REX.W movq rbx,[rsi+0x4f]
0x1bef76d8f3ef     f  488b560b       REX.W movq rdx,[rsi+0xb]
0x1bef76d8f3f3    13  488b4e13       REX.W movq rcx,[rsi+0x13]
0x1bef76d8f3f7    17  4883e903       REX.W subq rcx,0x3
0x1bef76d8f3fb    1b  33ff           xorl rdi,rdi
0x1bef76d8f3fd    1d  4c8bc7         REX.W movq r8,rdi
0x1bef76d8f400    20  4c8b4e23       REX.W movq r9,[rsi+0x23]
0x1bef76d8f404    24  493b21         REX.W cmpq rsp,[r9]
0x1bef76d8f407    27  0f8633000000   jna 0x1bef76d8f440  <+0x60>
0x1bef76d8f40d    2d  446bcf2c       imull r9,rdi,0x2c
0x1bef76d8f411    31  448b5b0c       movl r11,[rbx+0xc]
0x1bef76d8f415    35  4503cb         addl r9,r11
0x1bef76d8f418    38  4c3bc9         REX.W cmpq r9,rcx
0x1bef76d8f41b    3b  0f8352000000   jnc 0x1bef76d8f473  <+0x93>
0x1bef76d8f421    41  42833c0a00     cmpl [rdx+r9*1],0x0
0x1bef76d8f426    46  0f8d04000000   jge 0x1bef76d8f430  <+0x50>
0x1bef76d8f42c    4c  4183c001       addl r8,0x1
0x1bef76d8f430    50  83c701         addl rdi,0x1
0x1bef76d8f433    53  397b08         cmpl [rbx+0x8],rdi
0x1bef76d8f436    56  77c8           ja 0x1bef76d8f400  <+0x20>
0x1bef76d8f438    58  498bc0         REX.W movq rax,r8
0x1bef76d8f43b    5b  488be5         REX.W movq rsp,rbp
0x1bef76d8f43e    5e  5d             pop rbp
0x1bef76d8f43f    5f  c3             retl
0x1bef76d8f440    60  48895de8       REX.W movq [rbp-0x18],rbx
0x1bef76d8f444    64  48897de0       REX.W movq [rbp-0x20],rdi
0x1bef76d8f448    68  4c8945d8       REX.W movq [rbp-0x28],r8
0x1bef76d8f44c    6c  488955d0       REX.W movq [rbp-0x30],rdx
0x1bef76d8f450    70  48894dc8       REX.W movq [rbp-0x38],rcx
0x1bef76d8f454    74  e867fdffff     call 0x1bef76d8f1c0     ;; wasm stub: WasmStackGuard
0x1bef76d8f459    79  488b5de8       REX.W movq rbx,[rbp-0x18]
0x1bef76d8f45d    7d  488b7de0       REX.W movq rdi,[rbp-0x20]
0x1bef76d8f461    81  4c8b45d8       REX.W movq r8,[rbp-0x28]
0x1bef76d8f465    85  488b55d0       REX.W movq rdx,[rbp-0x30]
0x1bef76d8f469    89  488b4dc8       REX.W movq rcx,[rbp-0x38]
0x1bef76d8f46d    8d  488b75f0       REX.W movq rsi,[rbp-0x10]
0x1bef76d8f471    91  eb9a           jmp 0x1bef76d8f40d  <+0x2d>
0x1bef76d8f473    93  e8d8fbffff     call 0x1bef76d8f050     ;; wasm stub: ThrowWasmTrapMemOutOfBounds
0x1bef76d8f478    98  90             nop
0x1bef76d8f479    99  0f1f00         nop



What further information can I provide?
Message has been deleted

Immanuel Haffner

unread,
Apr 8, 2020, 6:27:58 AM4/8/20
to v8-dev
When running in the default configuration but comparing < 1 instead of < 0, this is the disassembly i get (and it runs for 11 ms).

Liftoff:
0xfad0d8782c0     0  55             push rbp
0xfad0d8782c1     1  4889e5         REX.W movq rbp,rsp
0xfad0d8782c4     4  6a0a           push 0xa
0xfad0d8782c6     6  4881ec28000000 REX.W subq rsp,0x28
0xfad0d8782cd     d  488975f0       REX.W movq [rbp-0x10],rsi
0xfad0d8782d1    11  48c745dc00000000 REX.W movq [rbp-0x24],0x0
0xfad0d8782d9    19  48c745e400000000 REX.W movq [rbp-0x1c],0x0
0xfad0d8782e1    21  488b4df0       REX.W movq rcx,[rbp-0x10]
0xfad0d8782e5    25  488b4923       REX.W movq rcx,[rcx+0x23]
0xfad0d8782e9    29  483b21         REX.W cmpq rsp,[rcx]
0xfad0d8782ec    2c  0f86be000000   jna 0xfad0d8783b0  <+0xf0>
0xfad0d8782f2    32  488b4df0       REX.W movq rcx,[rbp-0x10]
0xfad0d8782f6    36  488b494f       REX.W movq rcx,[rcx+0x4f]
0xfad0d8782fa    3a  8b11           movl rdx,[rcx]
0xfad0d8782fc    3c  488b4df0       REX.W movq rcx,[rbp-0x10]
0xfad0d878300    40  488b494f       REX.W movq rcx,[rcx+0x4f]
0xfad0d878304    44  8b5904         movl rbx,[rcx+0x4]
0xfad0d878307    47  8945ec         movl [rbp-0x14],rax
0xfad0d87830a    4a  8955e8         movl [rbp-0x18],rdx
0xfad0d87830d    4d  895de4         movl [rbp-0x1c],rbx
0xfad0d878310    50  488b45f0       REX.W movq rax,[rbp-0x10]
0xfad0d878314    54  488b4023       REX.W movq rax,[rax+0x23]
0xfad0d878318    58  483b20         REX.W cmpq rsp,[rax]
0xfad0d87831b    5b  0f869b000000   jna 0xfad0d8783bc  <+0xfc>

TurboFan:
0xfad0d878321    61  488b45f0       REX.W movq rax,[rbp-0x10]
0xfad0d878400     0  55             push rbp
0xfad0d878325    65  488b404f       REX.W movq rax,[rax+0x4f]
0xfad0d878401     1  4889e5         REX.W movq rbp,rsp
0xfad0d878404     4  6a0a           push 0xa
0xfad0d878406     6  56             push rsi
0xfad0d878329    69  8b480c         movl rcx,[rax+0xc]
0xfad0d87832c    6c  8b45dc         movl rax,[rbp-0x24]
0xfad0d878407     7  4883ec28       REX.W subq rsp,0x280xfad0d87832f    6f  ba2c000000     movl rdx,0x2c
0xfad0d878334    74  0fafc2         imull rax,rdx

0xfad0d87840b     b  488b5e4f       REX.W movq rbx,[rsi+0x4f]
0xfad0d878337    77  03c8           addl rcx,rax0xfad0d87840f     f  488b560b       REX.W movq rdx,[rsi+0xb]

0xfad0d878413    13  488b4e13       REX.W movq rcx,[rsi+0x13]0xfad0d878339    79  488b55f0       REX.W movq rdx,[rbp-0x10]

0xfad0d87833d    7d  488b5213       REX.W movq rdx,[rdx+0x13]
0xfad0d878417    17  4883e903       REX.W subq rcx,0x30xfad0d878341    81  b803000000     movl rax,0x3

0xfad0d87841b    1b  33ff           xorl rdi,rdi0xfad0d878346    86  48f7d8         REX.W negq rax
0xfad0d87841d    1d  4c8bc7         REX.W movq r8,rdi

0xfad0d878349    89  4803c2         REX.W addq rax,rdx
0xfad0d87834c    8c  8bc9           movl rcx,rcx
0xfad0d87834e    8e  483bc8         REX.W cmpq rcx,rax
0xfad0d878351    91  0f836f000000   jnc 0xfad0d8783c6  <+0x106>
0xfad0d878357    97  488b45f0       REX.W movq rax,[rbp-0x10]
0xfad0d87835b    9b  488b400b       REX.W movq rax,[rax+0xb]
0xfad0d87835f    9f  8b1408         movl rdx,[rax+rcx*1]
0xfad0d878362    a2  b801000000     movl rax,0x1
0xfad0d878367    a7  3bd0           cmpl rdx,rax
0xfad0d878420    20  4c8b4e23       REX.W movq r9,[rsi+0x23]
0xfad0d878424    24  493b21         REX.W cmpq rsp,[r9]
0xfad0d878369    a9  0f9cc2         setll dl
0xfad0d878427    27  0f8633000000   jna 0xfad0d878460  <+0x60>
0xfad0d87836c    ac  0fb6d2         movzxbl rdx,rdx
0xfad0d87842d    2d  446bcf2c       imull r9,rdi,0x2c
0xfad0d87836f    af  85d2           testl rdx,rdx0xfad0d878431    31  448b5b0c       movl r11,[rbx+0xc]

0xfad0d878435    35  4503cb         addl r9,r11
0xfad0d878371    b1  0f840e000000   jz 0xfad0d878385  <+0xc5>0xfad0d878438    38  4c3bc9         REX.W cmpq r9,rcx

0xfad0d878377    b7  8b45e0         movl rax,[rbp-0x20]
0xfad0d87843b    3b  0f8352000000   jnc 0xfad0d878493  <+0x93>
0xfad0d87837a    ba  83c001         addl rax,0x1
0xfad0d878441    41  42833c0a01     cmpl [rdx+r9*1],0x10xfad0d87837d    bd  8945e0         movl [rbp-0x20],rax

0xfad0d878446    46  0f8d04000000   jge 0xfad0d878450  <+0x50>0xfad0d878380    c0  e900000000     jmp 0xfad0d878385  <+0xc5>

0xfad0d878385    c5  8b45dc         movl rax,[rbp-0x24]0xfad0d87844c    4c  4183c001       addl r8,0x1

0xfad0d878388    c8  83c001         addl rax,0x1
0xfad0d878450    50  83c701         addl rdi,0x10xfad0d87838b    cb  488b4df0       REX.W movq rcx,[rbp-0x10]
0xfad0d87838f    cf  488b494f       REX.W movq rcx,[rcx+0x4f]
0xfad0d878393    d3  8b5108         movl rdx,[rcx+0x8]

0xfad0d878396    d6  3bc2           cmpl rax,rdx
0xfad0d878453    53  397b08         cmpl [rbx+0x8],rdi
0xfad0d878456    56  77c8           ja 0xfad0d878420  <+0x20>
0xfad0d878458    58  498bc0         REX.W movq rax,r8
0xfad0d87845b    5b  488be5         REX.W movq rsp,rbp
0xfad0d87845e    5e  5d             pop rbp0xfad0d878398    d8  0f8308000000   jnc 0xfad0d8783a6  <+0xe6>
0xfad0d87845f    5f  c3             retl
0xfad0d878460    60  48895de8       REX.W movq [rbp-0x18],rbx
0xfad0d878464    64  48897de0       REX.W movq [rbp-0x20],rdi
0xfad0d878468    68  4c8945d8       REX.W movq [rbp-0x28],r8

0xfad0d87846c    6c  488955d0       REX.W movq [rbp-0x30],rdx
0xfad0d878470    70  48894dc8       REX.W movq [rbp-0x38],rcx0xfad0d87839e    de  8945dc         movl [rbp-0x24],rax

0xfad0d8783a1    e1  e96affffff     jmp 0xfad0d878310  <+0x50>
0xfad0d878474    74  e847fdffff     call 0xfad0d8781c0       ;; wasm stub: WasmStackGuard0xfad0d8783a6    e6  8b4de0         movl rcx,[rbp-0x20]
0xfad0d8783a9    e9  8bc1           movl rax,rcx
0xfad0d8783ab    eb  488be5         REX.W movq rsp,rbp
0xfad0d8783ae    ee  5d             pop rbp
0xfad0d8783af    ef  c3             retl
0xfad0d8783b0    f0  50             push rax
0xfad0d878479    79  488b5de8       REX.W movq rbx,[rbp-0x18]
0xfad0d87847d    7d  488b7de0       REX.W movq rdi,[rbp-0x20]0xfad0d8783b1    f1  e80afeffff     call 0xfad0d8781c0       ;; wasm stub: WasmStackGuard

0xfad0d878481    81  4c8b45d8       REX.W movq r8,[rbp-0x28]0xfad0d8783b6    f6  58             pop rax

0xfad0d878485    85  488b55d0       REX.W movq rdx,[rbp-0x30]0xfad0d8783b7    f7  e936ffffff     jmp 0xfad0d8782f2  <+0x32>

0xfad0d8783bc    fc  e8fffdffff     call 0xfad0d8781c0       ;; wasm stub: WasmStackGuard0xfad0d878489    89  488b4dc8       REX.W movq rcx,[rbp-0x38]

0xfad0d87848d    8d  488b75f0       REX.W movq rsi,[rbp-0x10]0xfad0d8783c1   101  e95bffffff     jmp 0xfad0d878321  <+0x61>
0xfad0d878491    91  eb9a           jmp 0xfad0d87842d  <+0x2d>

0xfad0d878493    93  e8b8fbffff     call 0xfad0d878050       ;; wasm stub: ThrowWasmTrapMemOutOfBounds
0xfad0d878498    98  90             nop
0xfad0d8783c6   106  e885fcffff     call 0xfad0d878050       ;; wasm stub: ThrowWasmTrapMemOutOfBounds0xfad0d878499    99  0f1f00         nop
0xfad0d8783cb   10b  90             nop


Is it possible that the two compilers write their disassembly concurrently? -.-

Clemens Backes

unread,
Apr 8, 2020, 6:36:58 AM4/8/20
to v8-dev
Yes, without --predictable we compile concurrently, and we also print the code concurrently. But on the other hand --predictable should not make a difference for the generated code, so I would suggest to always pass --predictable when printing code, but skipping it otherwise.

The disassembly looks like expected. In Liftoff, the code is longer and there is some spilling in the loop and around the if block. That shouldn't cause a 200x slowdown though.

So I would propose two things:
1) For your project, just disable Liftoff ("--no-liftoff") to get reliable peak performance.
2) If you are interested to dig deeper, please prepare a standalone reproducer and create a V8 bug. I could try to reproduce then, but it might be the case that there is not much to do here, since the disassembly looks like expected.

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.

Immanuel Haffner

unread,
Apr 8, 2020, 7:51:02 AM4/8/20
to v8-dev
Disabling Liftoff is definitely a viable option for us right now. Related to that, why am I not seeing a slowdown with `--predictable` and Liftoff enabled? What is happening in predictable mode?

Regarding 2), since we now have a workaround we can live with, I have more important matters to tend to right now. I created an issue in our project and maybe I will revisit it when I have some spare time. (Maybe we will extend our system to produce minmal working examples so we can easily report them to V8, who knows...?)

Thanks a lot for the support. If there are anymore questions from your side I will gladly try to answer them.

Clemens Backes

unread,
Apr 8, 2020, 8:26:28 AM4/8/20
to v8-dev
--predictable tries to eliminate all non-determinism. It's an umbrella flag that influences several parts of the system. In particular, it implies --single-threaded, which makes us compile all code on a single thread only. This of course changes timing significantly.
In the default configuration, both Liftoff and TurboFan run concurrently. As soon as Liftoff is done, we start execution. TurboFan code is hot-swapped when it becomes ready (but will only be used for new function calls, i.e. no on-stack-replacement). Without --predictable, this makes it non-deterministic whether you execute Liftoff or TurboFan. You should actually see that when executing your test multiple times.

I am not 100% sure in which order things happen in predictable mode - but at least it should be a deterministic order. If you don't observe a slowdown, then it seems like TurboFan compilation always finishes before code is being executed.
So maybe enabling predictable mode would also work for you, but just disabling Liftoff is the better solution, since it's way more specific.

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.

Immanuel Haffner

unread,
Apr 9, 2020, 6:29:14 AM4/9/20
to v8-dev
All right, thanks for the explanation. We will stick to `--no-liftoff` then.
Reply all
Reply to author
Forward
0 new messages