performance considerations for well-typed Wasm binaries

103 views
Skip to first unread message

Wolfgang

unread,
Mar 24, 2024, 2:41:45 PM3/24/24
to v8-dev
Hello,
I'm a masters student at Aarhus university.

We built a provably correct compiler (using Coq) targeting Wasm: Link
This includes a proof that the modules our compiler generates instantiate
according to the spec, i.e. they are well-typed.

We are evaluating the performance with Node.js, is there a way to improve our performance
by e.g. disabling some runtime-checks, given that we have the stronger correctness guarantees?

Node.js doesn't seem to have (publicly documented) flags to allow that.

I included some of our numbers below.

Best,
Wolfgang

All times in ms, avarage of 10 runs:
startup=load binary+instantiate
main=main function
pp=pretty printing the result as an S-expression, we call an imported function
      for every character, thus somewhat slow, but not that relevant for my question

For a description of the benchmarks, see: Chapter 8.2, https://zoep.github.io/thesis_final.pdf

Node.js: v18.19.0:
   demo1-opt_coalesce-locals : startup:     6,  main:   0,   pp: 28,  sum:   34
   demo2-opt_coalesce-locals : startup:     3,  main:   0,   pp:  8,    sum:   11
list_sum-opt_coalesce-locals : startup:     3,   main:   0,   pp:  2,   sum:    5
 vs_easy-opt_coalesce-locals : startup:    10,  main:  33,  pp:  1,  sum:   44
 vs_hard-opt_coalesce-locals : startup:    10,  main: 101, pp:  1,   sum:  112
    binom-opt_coalesce-locals : startup:    18,  main:  10,  pp: 24,  sum:   52
sha_fast-opt_coalesce-locals : startup:    65,  main:  70,  pp:  7,   sum:  142
       color-opt_coalesce-locals : startup:  132, main:  44,  pp:  2,   sum:  178

Node.js: v20.11.1
   demo1-opt_coalesce-locals : startup:    1, main:     3, pp:  24,  sum:   28
   demo2-opt_coalesce-locals : startup:    2, main:     0, pp:  12,  sum:   14
list_sum-opt_coalesce-locals : startup:    2,  main:     0, pp:   2,   sum:    4
 vs_easy-opt_coalesce-locals : startup:    4, main:    38, pp:   4,   sum:   46
 vs_hard-opt_coalesce-locals : startup:    3,  main:  110, pp:   3,  sum:  116
    binom-opt_coalesce-locals : startup:     3, main:    26, pp: 23,  sum:   52
sha_fast-opt_coalesce-locals : startup:    4,  main:  228, pp: 10,  sum:  242
       color-opt_coalesce-locals : startup:   12, main: 332, pp:  2,   sum:  346

 

Andreas Haas

unread,
Mar 25, 2024, 6:16:00 PM3/25/24
to v8-...@googlegroups.com
Hi Wolfgang,

The command "nodejs --help --v8-options" seems to print all V8 command line options. It's not clear what runtime checks you would like to adjust. Here are some flags I can think of:

You could use --wasm-lazy-validation to avoid function validation during module compilation and module instantiation, but function validation would still happen lazily when a function gets executed for the first time. If you want to prevent even the lazy function validation, then you have to adjust the code at [1].

You can use `--no-wasm-stack-checks` if you can guarantee that there will not be a stack overflow, e.g. because of an unbounded recursion.

You can use `--no-wasm-bounds-checks` if you can guarantee that memory accesses are always in-bounds.

Cheers, Andreas


--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/e7e5b6c8-0f7b-4787-b66f-592f02873950n%40googlegroups.com.


--

Andreas Haas

Software Engineer

ah...@google.com


Google Germany GmbH

Erika-Mann-Straße 33

80636 München


Geschäftsführer: Paul Manicle, Liana Sebastian

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg


Diese E-Mail ist vertraulich. Falls sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.

    

This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.


Wolfgang

unread,
Mar 30, 2024, 6:15:48 PM3/30/24
to v8-dev
Hi Andreas,
thanks a lot for the quick response!
This is precisely what I was looking for.

We indeed know, that memory accesses are always in bounds, so we can safely use the flag `--no-wasm-bounds-checks`. This speeds up one of our benchmarks by ~5% thanks! (for the others, the difference is not really measurable)

Here is the Coq definition for instantiation [0] which we proved, this includes that the module is well-typed [1], so omitting the function validation check you suggested (ValidateFunctionBody) should be safe.
However, doing that didn't affect the performance in a measurable way: It seems, function validation is quite cheap.
(I compared the current master of Node.js (v8 version: 11.9.169.7) with and without the function validation check, by commenting out lines 87-101, 103 [2].
 Our modules have <150 functions with around 10-1000 instructions and a main function with typically >50k instructions.)

Andreas Haas

unread,
Apr 2, 2024, 4:26:40 AM4/2/24
to v8-...@googlegroups.com
Hi Wolfgang,

Your module seems to be quite small, so I guess disabling validation saves you less than 1ms. Do you run your experiments on a MacBook?

It is a bit surprising that you can save even 5% by disabling bounds checks. The reason is that we don't have explicit bounds checks, but instead use a signal handler. This works as follows: The WebAssembly memory is surrounded by guard pages which are marked as not readable and not writable. The WebAssembly memory and the guard pages together cover 8GB. which means that any 32 bit memory address + 32 bit offset will always either hit the WebAssembly memory or a guard page. WebAssembly code then executes memory accesses unconditionally. If the memory access is inside the WebAssembly memory, then everything works just fine. If it hits a guard page, then a segfault is triggered. The signal handler catches the segfault, creates a JavaScript exception, and then continues execution at a JavaScript exception handler. Naturally this only works on 64-bit architecture, because otherwise you cannot reserve 8GB of address space. Also, on arm64 we cannot do this optimization for writes. Assume a write that is partially out of memory, e.g. the first two bytes are in-bounds, but the second two bytes are out of bounds. On x64, a segfault is triggered before any of the four bytes are written, but on some arm64 CPUs, the two in-bounds bytes get written before the segfault gets triggered. Therefore, I assume you did your experiment on a MacBook or some other arm64 device, because on x64 I would assume that you cannot measure any performance improvements by skipping bounds checks.

Cheers, Andreas

Wolfgang

unread,
Apr 5, 2024, 4:27:42 PM4/5/24
to v8-...@googlegroups.com
Hi Andreas,

yes, the binaries are small. (Independent note: they are strange in some way it seems: wasmtime takes 2 seconds to compile/instantiate color, and is in general much slower for all our other benchmarks as well. Investigating this as well currently.)

Thanks for the explanation, interesting!
No, we don't use ARM, we measured the improvement on two standard x64 linux laptops.

Some more background, not sure if relevant:
Our memory model is quite simplistic: memory is never freed, we have a global that points to the next free segment in the linear mem, the pointer is increased after every allocation and never decreased. We grow the linear memory regularly if needed. 
In this model, color uses just 16MB of linear mem, sha_fast 25MB.

In the end I'm happy it's a bit faster :) Sure, it would be cool to know why...
Best,
Wolfgang

The setup of the following two is exactly the same, except in the second one we call node with "--no-wasm-bounds-checks".

Without --no-wasm-bounds-checks:
Running avg. of 20 runs in node (v20.10.0).
       demo1-opt_coalesce-locals : startup:    3, main:   2, pp: 20, sum:   25
       demo2-opt_coalesce-locals : startup:    3, main:   0, pp:  9, sum:   12

    list_sum-opt_coalesce-locals : startup:    3, main:   0, pp:  2, sum:    5
     vs_easy-opt_coalesce-locals : startup:    4, main:  22, pp:  3, sum:   29
     vs_hard-opt_coalesce-locals : startup:    4, main:  91, pp:  3, sum:   98
       binom-opt_coalesce-locals : startup:    3, main:  15, pp: 18, sum:   36
       color-opt_coalesce-locals : startup:    7, main: 164, pp:  2, sum:  173
    sha_fast-opt_coalesce-locals : startup:    4, main: 112, pp:  7, sum:  123
  even_10000-opt_coalesce-locals : startup:    3, main:   1, pp:  1, sum:    5
     ack_3_9-opt_coalesce-locals : startup:    3, main: 112, pp: 28, sum:  143
sm_gauss_nat-opt_coalesce-locals : startup:    3, main:  58, pp:  1, sum:   62
  sm_gauss_N-opt_coalesce-locals : startup:    3, main:  22, pp:  1, sum:   26

With --no-wasm-bounds-checks:
Running avg. of 20 runs in node (v20.10.0).
       demo1-opt_coalesce-locals : startup:    3, main:   2, pp: 19, sum:   24
       demo2-opt_coalesce-locals : startup:    3, main:   0, pp:  9, sum:   12
    list_sum-opt_coalesce-locals : startup:    3, main:   0, pp:  3, sum:    6
     vs_easy-opt_coalesce-locals : startup:    4, main:  21, pp:  3, sum:   28
     vs_hard-opt_coalesce-locals : startup:    4, main:  91, pp:  2, sum:   97
       binom-opt_coalesce-locals : startup:    3, main:  14, pp: 18, sum:   35
       color-opt_coalesce-locals : startup:    7, main: 157, pp:  2, sum:  166
    sha_fast-opt_coalesce-locals : startup:    4, main: 109, pp:  7, sum:  120
  even_10000-opt_coalesce-locals : startup:    3, main:   1, pp:  1, sum:    5
     ack_3_9-opt_coalesce-locals : startup:    3, main: 112, pp: 28, sum:  143
sm_gauss_nat-opt_coalesce-locals : startup:    3, main:  58, pp:  1, sum:   62
  sm_gauss_N-opt_coalesce-locals : startup:    3, main:  22, pp:  1, sum:   26

Andreas Haas

unread,
Apr 8, 2024, 1:07:43 AM4/8/24
to v8-...@googlegroups.com
Hi Wolfgang,

If you use an x64 machine, then I'm really wondering why you see performance improvements. Maybe you are even just measuring a performance artifact. E.g. disabling bounds checks makes the code objects slightly smaller, so maybe what you are measuring is just a code alignment issue?

You can measure your benchmark with "--wasm-enforce-bounds-checks", which adds explicit bounds checks to the code. The bounds checks that you see in that configuration are the bounds checks that can either be eliminated due to your analysis, or due to the use of the signal handling I described before.

On 32-bit builds of V8, signal handlers cannot be used for bounds checks, so your analysis would be the only way to eliminate bounds checks, and you would see a much bigger impact.

Cheers, Andreas

Ben Noordhuis

unread,
Apr 8, 2024, 2:40:14 AM4/8/24
to v8-...@googlegroups.com
Does --no-wasm-bounds-checks imply no guard pages? That could explain
the difference when the wasm module is small and executes quickly; the
overhead of mmap/mprotect/munmap probably starts to dominate.
Profiling with perf(1) should be able to confirm that.

On Mon, Apr 8, 2024 at 7:07 AM 'Andreas Haas' via v8-dev
> To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAELSTvdT7FXktfkMsX2VBYOAOf6onh%2BBgXCTVg3vRQ2AXu6B4w%40mail.gmail.com.

Andreas Haas

unread,
Apr 8, 2024, 3:25:26 AM4/8/24
to v8-...@googlegroups.com
No, I don't think --no-wasm-bounds-checks disables guard pages.

Reply all
Reply to author
Forward
0 new messages