V8 WebAssmebly codegen questions - 64 bit instructions and Vectorization?

si...@zappar.com

unread,

Sep 3, 2019, 4:42:51 AM9/3/19

to v8-dev

Hi,

I'm currently optimizing a codebase for WebAssembly. The code includes some hand-written SIMD implementations for ARM NEON and Intel SIMD. It seems WebAssembly's own SIMD instructions are not yet stable and therefore not available today in browsers without a feature flag.

I've come across some nifty bit twiddling methods [1] that allow some SIMD operations to be performed using standard non-SIMD registers and instructions. 64-bit architectures obviously provide better opportunities for speedups using these techniques than the 32-bit examples shown on that old page from 1997.

I implemented a particular function (half-sampling a greyscale image by averaging 2x2 blocks of input pixels) using these techniques with uint64_t registers. Native benchmarking with an iPhone XR (-O3, Xcode's LLVM) gets these timings relative to my NEON implementation:

| -O3 | -O3 -fno-vectorize -fno-slp-vectorize

NEON | 1.00x | 1.00x

Plain C | 1.36x | 10.71x

Bit twiddling | 3.36x | 5.19x

LLVM's vectorizers obviously found some opportunity to vectorize both the plain C and the bit twiddling implementations, as both were slower when disabling these optimizations. The vectorizer preferred the simplicity of the plain C version as it found very good vectorization opportunities (an almost 8 times speedup) which then easily outperformed the bit-twiddling approach. However without vectorization the bit twiddling approach is a 2x improvement on the plain C variant.

I will do some benchmarking of the same code through Emscripten, but given the additional layers involved (C -> wasm -> V8 codegen (with multi-level JIT?)) I thought it would also be worth a couple of direct questions on this:

1) Does v8 codegen emit 64-bit machine instructions for 64-bit wasm instructions on 64-bit architectures (specifically Android)? I imagine the speedup from SWAR techniques will be significantly reduced using just 32-bit registers, perhaps to the level that doesn't make this worthwhile.

2) Does v8's codegen currently do any vectorization? If not, are there plans to add it? In this case the plain C version might be best to stick with as it would be easier for auto-vectorization to detect and optimize.

3) Can anyone provide tips / links to help with investigating and optimizing this kind of thing? Any way of flagging wasm functions for maximum optimization for benchmarking purposes?

Thanks!

Simon

[1] http://aggregate.org/SWAR/over.html

Clemens Hammacher

unread,

Sep 3, 2019, 5:09:28 AM9/3/19

to v8-dev

Hi Simon,

that's an interesting project, please keep us updated :)

1) Does v8 codegen emit 64-bit machine instructions for 64-bit wasm instructions on 64-bit architectures (specifically Android)? I imagine the speedup from SWAR techniques will be significantly reduced using just 32-bit registers, perhaps to the level that doesn't make this worthwhile.

Yes, of course we use 64-bit instructions for 64-bit operations (if the binary was compiled for a 64-bit platform).

2) Does v8's codegen currently do any vectorization? If not, are there plans to add it? In this case the plain C version might be best to stick with as it would be easier for auto-vectorization to detect and optimize.

No, we do not do any auto-vectorization, and I am not aware of any plan to add this.

3) Can anyone provide tips / links to help with investigating and optimizing this kind of thing? Any way of flagging wasm functions for maximum optimization for benchmarking purposes?

For benchmarking, you should disable the baseline compiler (Liftoff). Chrome has a feature flag for that (chrome://flags/#enable-webassembly-baseline), if you experiment with d8 directly, then add "--no-liftoff --no-wasm-tier-up". If you want to inspect the generated machine code, you can add the "--print-wasm-code" command line flag. This requires a build with then "v8_enable_disassembler" gn flag enabled, which is the default for debug builds. Since compilation happens concurrently, you need to add "--predictable" (or "--single-threaded" or "--wasm-num-compilation-tasks=0") to get readable output.

Hope that helps,

Clemens

--

Clemens Hammacher

Software Engineer

clem...@google.com

Google Germany GmbH

Erika-Mann-Straße 33

80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Diese E-Mail ist vertraulich. Falls sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.

This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.

si...@zappar.com

unread,

Sep 10, 2019, 4:20:24 AM9/10/19

to v8-dev

Hi Clemens,

Thanks for those really helpful pointers.

I have continued digging into this for the last week or so, and have put my current code up GitHub for anyone interested:

https://github.com/tangobravo/webassembly-halfsample-benchmark

Results are that the bit-twiddling approaches do offer a pretty decent speed-up on mobile platforms (and 64-bit desktop), so this is a promising route :)

On Android it seems the version of Chrome distributed through Google Play is still the 32-bit one, even on 64-bit devices (tested on a Google Pixel 2 - chrome://version states 32-bit).

Despite that, there is still a speedup using the packed implementations. The fastest on the Pixel 2 is the half_sample_uint32x2_blocks implementation, which gives something like a 2.4x speedup (2.04ms to 0.84ms for half-sampling a 720p image).

Safari in iOS 12.4 is 64-bit, and there the half_sample_uint64_blocks implementation is fastest and gives more than a 3x speedup on an iPod Touch 7 (1.0ms to 0.3ms for 720p input).

Each benchmark run does 10 iterations with different but overlapping input data (so both input and output are likely to be at least in L2 cache - this is the case I expect in practice). The timing numbers that are printed by the test code show the total time for all 10 iterations, so the numbers above are divided by 10 from typical outputs over multiple runs. Safari only offers 1ms resolution on Performance.now() hence is harder to get accurate measurements, but the numbers above look pretty consistent over multiple runs.

Out of interest I've also tried to write an implementation targeting WebAssembly SIMD. I was able to get it to compile with emcc from latest-upstream but it doesn't run in my self-built d8 7.7 with the --experimental-wasm-simd flag. More details in the README of the repo linked above. I'd appreciate any help to get that one running for a further comparison datapoint.

So the questions that arise:

1) Is there any way for a page to detect if the browser is 32-bit or 64-bit? navigator.platform reports Linux armv8l on the Pixel 2, so that doesn't help unfortunately. I have 32 and 64 bit "busy loops" that report approximately equal counts on 64-bit platforms (and not on 32-bit), but it would be nice if there was a more direct way to determine this!

2) Are there plans to transition Play Store Chrome releases to 64-bit for 64-bit Android? I did some searching but couldn't find any official information about the reasons for sticking with 32-bit there (though I assume memory usage, apk size, or both).

3) Any hints on how I can get the SIMD version to run or is this likely just a bug / spec instability between emcc and d8?

Cheers,

Simon

Ross McIlroy

unread,

Sep 10, 2019, 6:12:36 AM9/10/19

to v8-dev

To answer one of the questions here (inline):

On Tue, 10 Sep 2019 at 09:20, <si...@zappar.com> wrote:

Hi Clemens,

Thanks for those really helpful pointers.

I have continued digging into this for the last week or so, and have put my current code up GitHub for anyone interested:
https://github.com/tangobravo/webassembly-halfsample-benchmark

Results are that the bit-twiddling approaches do offer a pretty decent speed-up on mobile platforms (and 64-bit desktop), so this is a promising route :)

On Android it seems the version of Chrome distributed through Google Play is still the 32-bit one, even on 64-bit devices (tested on a Google Pixel 2 - chrome://version states 32-bit).

Despite that, there is still a speedup using the packed implementations. The fastest on the Pixel 2 is the half_sample_uint32x2_blocks implementation, which gives something like a 2.4x speedup (2.04ms to 0.84ms for half-sampling a 720p image).

Safari in iOS 12.4 is 64-bit, and there the half_sample_uint64_blocks implementation is fastest and gives more than a 3x speedup on an iPod Touch 7 (1.0ms to 0.3ms for 720p input).

Each benchmark run does 10 iterations with different but overlapping input data (so both input and output are likely to be at least in L2 cache - this is the case I expect in practice). The timing numbers that are printed by the test code show the total time for all 10 iterations, so the numbers above are divided by 10 from typical outputs over multiple runs. Safari only offers 1ms resolution on Performance.now() hence is harder to get accurate measurements, but the numbers above look pretty consistent over multiple runs.

Out of interest I've also tried to write an implementation targeting WebAssembly SIMD. I was able to get it to compile with emcc from latest-upstream but it doesn't run in my self-built d8 7.7 with the --experimental-wasm-simd flag. More details in the README of the repo linked above. I'd appreciate any help to get that one running for a further comparison datapoint.

So the questions that arise:

1) Is there any way for a page to detect if the browser is 32-bit or 64-bit? navigator.platform reports Linux armv8l on the Pixel 2, so that doesn't help unfortunately. I have 32 and 64 bit "busy loops" that report approximately equal counts on 64-bit platforms (and not on 32-bit), but it would be nice if there was a more direct way to determine this!

2) Are there plans to transition Play Store Chrome releases to 64-bit for 64-bit Android? I did some searching but couldn't find any official information about the reasons for sticking with 32-bit there (though I assume memory usage, apk size, or both).

There is a plan to transition Play Store Chrome releases to 64-bit for some 64-bit Android devices (those over a certain memory threshold, yet to be determined). You are right that the reason for sticking with 32-bits is for memory reasons. We don't have a timeline for this yet, but it is unlikely to hit the stable channel until sometime early next year.

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/09853b85-6d5d-4726-ad90-c11d566396d1%40googlegroups.com.

Clemens Hammacher

unread,

Sep 11, 2019, 5:47:50 AM9/11/19

to v8-dev

On Tue, Sep 10, 2019 at 10:20 AM <si...@zappar.com> wrote:

Hi Clemens,

Thanks for those really helpful pointers.

I have continued digging into this for the last week or so, and have put my current code up GitHub for anyone interested:
https://github.com/tangobravo/webassembly-halfsample-benchmark

Results are that the bit-twiddling approaches do offer a pretty decent speed-up on mobile platforms (and 64-bit desktop), so this is a promising route :)

On Android it seems the version of Chrome distributed through Google Play is still the 32-bit one, even on 64-bit devices (tested on a Google Pixel 2 - chrome://version states 32-bit).

Despite that, there is still a speedup using the packed implementations. The fastest on the Pixel 2 is the half_sample_uint32x2_blocks implementation, which gives something like a 2.4x speedup (2.04ms to 0.84ms for half-sampling a 720p image).

Safari in iOS 12.4 is 64-bit, and there the half_sample_uint64_blocks implementation is fastest and gives more than a 3x speedup on an iPod Touch 7 (1.0ms to 0.3ms for 720p input).

Each benchmark run does 10 iterations with different but overlapping input data (so both input and output are likely to be at least in L2 cache - this is the case I expect in practice). The timing numbers that are printed by the test code show the total time for all 10 iterations, so the numbers above are divided by 10 from typical outputs over multiple runs. Safari only offers 1ms resolution on Performance.now() hence is harder to get accurate measurements, but the numbers above look pretty consistent over multiple runs.

Out of interest I've also tried to write an implementation targeting WebAssembly SIMD. I was able to get it to compile with emcc from latest-upstream but it doesn't run in my self-built d8 7.7 with the --experimental-wasm-simd flag. More details in the README of the repo linked above. I'd appreciate any help to get that one running for a further comparison datapoint.

So the questions that arise:

1) Is there any way for a page to detect if the browser is 32-bit or 64-bit? navigator.platform reports Linux armv8l on the Pixel 2, so that doesn't help unfortunately. I have 32 and 64 bit "busy loops" that report approximately equal counts on 64-bit platforms (and not on 32-bit), but it would be nice if there was a more direct way to determine this!

2) Are there plans to transition Play Store Chrome releases to 64-bit for 64-bit Android? I did some searching but couldn't find any official information about the reasons for sticking with 32-bit there (though I assume memory usage, apk size, or both).

3) Any hints on how I can get the SIMD version to run or is this likely just a bug / spec instability between emcc and d8?

This indeed sounds like a bug, probably on the emscripten side. Can you open an emscripten bug about this?

--

--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/09853b85-6d5d-4726-ad90-c11d566396d1%40googlegroups.com.

si...@zappar.com

unread,

Sep 16, 2019, 6:41:21 AM9/16/19

to v8-dev

Thanks for the responses.

On Tuesday, 10 September 2019 11:12:36 UTC+1, Ross McIlroy wrote:

To answer one of the questions here (inline):

[...]

2) Are there plans to transition Play Store Chrome releases to 64-bit for 64-bit Android? I did some searching but couldn't find any official information about the reasons for sticking with 32-bit there (though I assume memory usage, apk size, or both).

There is a plan to transition Play Store Chrome releases to 64-bit for some 64-bit Android devices (those over a certain memory threshold, yet to be determined). You are right that the reason for sticking with 32-bits is for memory reasons. We don't have a timeline for this yet, but it is unlikely to hit the stable channel until sometime early next year.

Thanks for the information, good to know.

Are there any public builds of an arm64-v8a android Chrome/Chromium for testing purposes or will I have to self-build? I've found the CI logs but can't see any download links for the build artefacts.

On Wednesday, 11 September 2019 10:47:50 UTC+1, Clemens Hammacher wrote:

On Tue, Sep 10, 2019 at 10:20 AM <si...@zappar.com> wrote:

[...]

3) Any hints on how I can get the SIMD version to run or is this likely just a bug / spec instability between emcc and d8?

This indeed sounds like a bug, probably on the emscripten side. Can you open an emscripten bug about this?

I opened an issue for this, see here:

https://github.com/emscripten-core/emscripten/issues/9439

A reply over there suggested v8 had recently implemented support for non-immediate SIMD shifts which was required for my code.

Using d8 from master I was able to run this on my MacBook Pro (x64). The plain C implementation had a minimum time of 3.16ms, the uint64_t one was 1.60ms and the SIMD one 0.64ms. Mac Chrome Canary could also run the code and unsurprisingly reported similar timings.

Chrome Canary for Android on Pixel 2 unfortunately just gets an "Aw, Snap" page when trying to load the test (with simd enabled in chrome://flags).

Simon

Clemens Hammacher

unread,

Sep 16, 2019, 7:19:13 AM9/16/19

to v8-dev

Wow, that's nice numbers! Good to hear that the issue was resolved by using latest versions. Sorry for that, but the SIMD spec still keeps evolving.

Chrome Canary for Android on Pixel 2 unfortunately just gets an "Aw, Snap" page when trying to load the test (with simd enabled in chrome://flags).

Oh, that's unfortunate. Can you send a crash ID (from chrome://crashes)?

If you have a local reproducer, it would also be really helpful to open a v8 bug.

Simon

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/1d5fa9c6-fc9c-443f-89f6-870a61c39dd6%40googlegroups.com.

si...@zappar.com

unread,

Sep 17, 2019, 4:58:54 AM9/17/19

to v8-dev

Hi Clemens,

On Monday, 16 September 2019 12:19:13 UTC+1, Clemens Hammacher wrote:

On Mon, Sep 16, 2019 at 12:41 PM <si...@zappar.com> wrote:

Chrome Canary for Android on Pixel 2 unfortunately just gets an "Aw, Snap" page when trying to load the test (with simd enabled in chrome://flags).

Oh, that's unfortunate. Can you send a crash ID (from chrome://crashes)?

If you have a local reproducer, it would also be really helpful to open a v8 bug.

I've opened https://bugs.chromium.org/p/v8/issues/detail?id=9746 to describe the issue, and have uploaded the test page too. I'm not that comfortable with the sharing of memory contents I'm afraid but hopefully that's enough for you to reproduce at your end.

Also two bugs for the price of 1 - my very simple SIMD shift test code from the emscripten bug report produces the wrong output in Chrome Canary on Android. That one is reported as https://bugs.chromium.org/p/v8/issues/detail?id=9748 .

Simon

si...@zappar.com

unread,

Sep 22, 2019, 7:04:36 AM9/22/19

to v8-dev

I couldn't find any public official 64-bit Chrome android builds, but did come across the handy docs about cross-compiling just v8 for android and running d8 through adb shell.

So I've now got self-built d8 binaries (master branch on Wednesday) for the arm_release and arm64_release configs. An additional benefit of using d8 for this is Performance.now() returns higher precision results than those in Chrome.

The arm64 build correctly runs the wasm simd implementation too, so I thought I would share some final benchmark numbers.

I've taken the 25th percentile value for each of the implementations, measured over 100 iterations. Each iteration measures the time to do 10 1280x720 greyscale half sample operations. Numbers are given as milliseconds, with speedup factors vs the plain C implementation in brackets.

The uint64_t implementation beats the uint32_t ones with arm64 d8, as expected. The SIMD approach gives a bigger win on arm64 than x64 for this function.

Simon

                       | plain | uint32_blocks | uint32x2_blocks | uint64_blocks |    wasm_simd

-----------------------------------------------------------------------------------------------

MacBook Pro, x64       |  3.16 |  2.37 (1.33x) |    2.06 (1.54x) |  1.60 (1.98x) | 0.64 (4.91x)

Pixel 2, armv7, 32 bit | 20.25 |  9.95 (2.04x) |    8.31 (2.44x) |  9.68 (2.09x) |     ---

Pixel 2, arm64, 64 bit | 16.95 |  9.08 (1.87x) |    7.08 (2.39x) |  4.98 (3.40x) | 2.72 (6.23x)

Reply all

Reply to author

Forward