Issue 643 in webp: libwebp WASM decoding performance

300 views
Skip to first unread message

w… via monorail

unread,
Jun 25, 2024, 10:41:14 AMJun 25
to webp-d...@webmproject.org
Status: New
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 643 by w...@utexas.edu: libwebp WASM decoding performance
https://bugs.chromium.org/p/webp/issues/detail?id=643

Howdy folks!

This is a centralization of ongoing discussions on a collection of patches to improve libwebp WASM decoding performance. Together these patches speedup decoding performance by 48-78% for lossy images, and 24-53% for lossless images.

Here are the following changes with their rationale, ordered from most impactful to least.

## Lossy and lossless SIMD SSE2/SSE4.1 intrinsics via SIMD-everywhere (SIMDe)
https://chromium-review.googlesource.com/c/webm/libwebp/+/5593964

Single-Instruction Multiple Data (SIMD) instructions are CPU vector instructions that parallelize computations. Different architectures, such as ARM, Intel, and MIPS, each have their own SIMD extensions (ARM NEON/SVE, Intel SSE/AVX, and MIPS MSA). When compiling, a developer can take advantage of SIMD instructions by writing architecture-specific instructions, using compiler-provided intrinsics, or letting the compiler perform autovectorization. Libwebp has a combination of all three, and disabling SIMD shows a 50% slowdown on an amd64 machine.

The WASM platform has its own SIMD that is some sort of average of SSE and NEON, and compilers like wasi-clang expose WASM-platform specific SIMD intrinsics.

Currently, libwebp does not have WASM specific SIMD routines. In order to get around this, Emscripten provides its own translation for architecture SIMD intrinsics to WASM intrinsics (see https://emscripten.org/docs/porting/simd.html). This support is specific to Emscripten though, and not applicable to other WASM toolchains.

Our key contribution is the inclusion of SIMD-everywhere (SIMDe; https://github.com/simd-everywhere/simde) to libwebp to improve decoding performance. SIMDe is a header-only library that translates between architecture-specific intrinsics, and here we are using it to translate from SSE2/SSE4.1 to WASM SIMD. This provided a speedup of 32-57% on lossy images, and 1-20% on lossless images.

Applying this patch and compiling with -DWEBP_USE_SIMDE will use the SIMDe header instead of *mmintrin.h.

## Lossy and lossless direct function calls
https://chromium-review.googlesource.com/c/webm/libwebp/+/5593965

This patch removes indirect calls to the DSP functions in lossy and lossless decoding. Because this is in the hot path, this single change improved performance by 3-9% on lossy and 14-26% on lossless test images and reduced the number of indirect calls from ~100,000 to ~4,000.

Libwebp uses function pointers to call the DSP functions, and has a dynamic dispatch to decide if SIMD-enhanced functions should be used (see the functions VP8DspInit and VP8LDspInit). If so, it updates the DSP function pointers to the SIMD-enhanced version. When compiled directly to native, this is a simple indirect call. In WASM though, indirect calls require a bounds check, a table lookup, and then a jump. Because the DSP functions are on the hot path, this introduces a noticeable slowdown.
To illustrate the point, here are assembly snippets showing the calls in Native and WASM. The WASM has been compiled to native by first translating to C with wasm2c.
Native: a call into a value stored at a global pointer.
```
call *0xcc527(%rip) # 4de748 <VP8LPredictorsAdd+0x68>
```
WASM: multiple checks before the indirect call is performed.
```
; Get the function table
452baf: 48 8b 4c 24 20 mov 0x20(%rsp),%rcx
; Look up into the table
452bb4: 48 8b 49 38 mov 0x38(%rcx),%rcx
452bb8: 48 c1 e0 05 shl $0x5,%rax
452bbc: 48 8b 6c 01 08 mov 0x8(%rcx,%rax,1),%rbp
; Ensure it's not zero
452bc1: 48 85 ed test %rbp,%rbp
452bc4: 0f 84 ac 07 00 00 je 453376 <TRAP>
; Bounds check
452bca: 48 8b 14 01 mov (%rcx,%rax,1),%rdx
452bce: 48 8d 35 0f f2 0f 00 lea 0xff20f(%rip),%rsi
452bd5: 48 39 f2 cmp %rsi,%rdx
452bd8: 74 33 je 452c0d
452bda: 48 85 d2 test %rdx,%rdx
452bdd: 0f 84 93 07 00 00 je 453376 <TRAP>
; Type comparison
452be3: f3 0f 6f 02 movdqu (%rdx),%xmm0
452be7: f3 0f 6f 4a 10 movdqu 0x10(%rdx),%xmm1
452bec: 66 0f ef 8c 24 90 00 pxor 0x90(%rsp),%xmm1
452bf3: 00 00
452bf5: 66 0f ef 84 24 a0 00 pxor 0xa0(%rsp),%xmm0
452bfc: 00 00
452bfe: 66 0f eb c1 por %xmm1,%xmm0
452c02: 66 0f 38 17 c0 ptest %xmm0,%xmm0
452c07: 0f 85 69 07 00 00 jne 453376 <TRAP>
; Load arguments for our indirect call
452c0d: 48 8b 7c 01 18 mov 0x18(%rcx,%rax,1),%rdi
452c12: 8b 74 24 40 mov 0x40(%rsp),%esi
452c16: 44 89 ea mov %r13d,%edx
452c19: 48 8b 8c 24 b0 00 00 mov 0xb0(%rsp),%rcx
452c20: 00
452c21: 45 89 e0 mov %r12d,%r8d
; Finally make the call
452c24: ff d5 call *%rbp
TRAP:
453376: bf 06 00 00 00 mov $0x6,%edi
45337b: e8 60 04 01 00 call 4637e0 <wasm_rt_trap>
```
What this change does is remove the extra comparisons by directly calling the DSP functions, including the SIMD-enhanced ones (both behind feature flags).
The entire WASM snippet above now becomes:
```
call 433f00 <w2c_decode__webp__wasmsimd_PredictorAdd1_SSE2>
```
### Lossless: VP8L_USE_FAST_LOAD
https://chromium-review.googlesource.com/c/webm/libwebp/+/5593966

This patch enables VP8L_USE_FAST_LOAD for WASM.

Enabling lossless fast load improves performance by 2-5% across test images.


### Lossy: don't use USE_GENERIC_TREE
https://chromium-review.googlesource.com/c/webm/libwebp/+/5595408

This patch sets USE_GENERIC_TREE to 0, alongside the ARM checks..

We found it is 2-4% faster to use the hard-coded tree on WASM.

As an aside, we also noticed a slight improvement in the native performance, so it may be worth revisiting this change for all architectures.

### Lossy: Enable 64-bit BITS caching
https://chromium-review.googlesource.com/c/webm/libwebp/+/5593967

This patch sets BITS to 56 when compiling to WASM, enabling the use of uint64_t types. Note that the default for the lossless bitstream decoder is uint64_t, so this will bring both decoders to parity.

This change alone improves performance by 0-5%, having less of an impact when combined with direct calls.

## Other questions

Answering some questions that have been asked on the commits:

__Testing on 32-bit architectures?__
I have not tested the performance numbers on a 32-bit machine. I don’t have a machine, but I can test a 32-bit executable.

__Testing on ARM and enabling ARM NEON?__
I have not tested on an Arm device yet.

__Comparing Emscripten's intrinsics?__
This change would not replace Emscripten’s translation – SIMDe would be off by default.

We ran the experiment of using Emscripten’s intrinsic translations (found here https://github.com/emscripten-core/emscripten/tree/main/system/include/compat ) and running it through our pipeline (C code -> WASMSIMD -> C code (via wasm2c) -> x86). The first arrow used Emscripten’s translation instead of SIMDe. We found that the translation of the _mm_sad_epu8 intrinsic to WASMSIMD is better than SIMDe’s, so would like to upstream that improvement to SIMDe. This would bring the translation on-par for other WASM toolchains.

Regarding Emscripten’s NEON intrinsics, they also rely on SIMDe to provide this translation: https://emscripten.org/docs/porting/simd.html#compiling-simd-code-targeting-arm-neon-instruction-set

## Test Environment

The performance numbers were taken on an Intel i7-9850 machine running Debian 12.5. CPU frequency scaling was disabled and libwebp was pinned to a single core. We decoded the Lossy and Lossless images found in the WebP gallery https://developers.google.com/speed/webp/gallery . You can see our collected numbers here: https://docs.google.com/spreadsheets/d/1l9gamytAp5QAMD2mgq-uMgDm_yxvDdWQTfOo-WO-BwU/edit#gid=0

I’m happy to provide any other information to help get these patches landed.

--
You received this message because:
1. The project was configured to send all issue notifications to this address

You may adjust your notification preferences at:
https://bugs.chromium.org/hosting/settings

jz… via monorail

unread,
Jun 26, 2024, 7:58:34 PMJun 26
to webp-d...@webmproject.org
Updates:
Cc: gde...@google.com pasca...@gmail.com

Comment #1 on issue 643 by jz...@google.com: libwebp WASM decoding performance
https://bugs.chromium.org/p/webp/issues/detail?id=643#c1

Thanks for filing the bug. There's a lot here, so let's focus on the code generation. 'Lossless: VP8L_USE_FAST_LOAD', 'Lossy: don't use USE_GENERIC_TREE', and 'Lossy: Enable 64-bit BITS caching' are all fine, let's keep the discussion on the changes themselves.

I think the main question Deepti was asking was what kind of behavior you were seeing when enabling the sse and neon intrinsics in libwebp and letting the wasm compiler translate those (i.e., with `-msimd128`), rather than relying on the C code and auto-vectorization. The project doesn't have any native wasm intrinsics, but we'd like to rely on the intrinsics for the native platforms rather than having something custom for wasm.

w… via monorail

unread,
Jun 28, 2024, 10:58:30 AMJun 28
to webp-d...@webmproject.org

Comment #2 on issue 643 by w...@utexas.edu: libwebp WASM decoding performance
https://bugs.chromium.org/p/webp/issues/detail?id=643#c2

By default WASM compilers do not support compiling with SSE or NEON intrinsics with the notable exception of Emscripten. So compiling a C file using SSE with clang will result in a compiler error. For example:

```
> cat file.c
#include <emmintrin.h>

int add(int a, int b) {
return a + b;
}

> clang --target=wasm32 -msimd128 -Wl,--no-entry -o file.wasm file.c
In file included from file.c:1:
/usr/lib/llvm-14/lib/clang/14.0.6/include/emmintrin.h:14:2: error: "This header is only meant to be used on x86 and x64 architecture"
#error "This header is only meant to be used on x86 and x64 architecture"
^
...
```

Emscripten gets around this by providing its own version of the `*mmintrin.h` and `arm_neon.h` headers that are used to automatically translate SSE/NEON intrinsics to WASM intrinsics during compilation --- something which other WASM compilers do not do. With this SIMDe change, however, we are moving this translation directly into libwebp's WASM build so this translation would work for all WASM compilers.

The translation offered by SIMDe and Emscripten is more or less the same. See, for example, the translation of `_mm_mul_epu32` by Emscripten at https://github.com/emscripten-core/emscripten/blob/main/system/include/compat/emmintrin.h#L701 which is the same translation by SIMDe here https://github.com/simd-everywhere/simde/blob/master/simde/x86/sse2.h#L4289-L4292. Emscripten has one extra minor optimization in one intrinsic translation that I'm planning on upstreaming to SIMDe over the next few weeks to ensure there is parity.

w… via monorail

unread,
Jun 28, 2024, 10:59:06 AMJun 28
to webp-d...@webmproject.org

Comment #3 on issue 643 by w...@utexas.edu: libwebp WASM decoding performance
https://bugs.chromium.org/p/webp/issues/detail?id=643#c3

@deepti: For clarity, did you mean "how does Emscripten's translation layer compare to SIMDe's translation layer"? If so, for SSE, we found the translation to be nearly the same, barring the optimization I plan to upstream. For NEON, it turns out Emscripten actually uses SIMDe's translation layer.

jz… via monorail

unread,
Jul 1, 2024, 3:41:23 PMJul 1
to webp-d...@webmproject.org

Comment #4 on issue 643 by jz...@google.com: libwebp WASM decoding performance
https://bugs.chromium.org/p/webp/issues/detail?id=643#c4


> By default WASM compilers do not support compiling with SSE or NEON intrinsics with the notable exception of Emscripten.

I wasn't aware of Wasm compilers aside from Emscripten. Do you have a use case for another compiler and an application where webp isn't natively embedded?

w… via monorail

unread,
Jul 2, 2024, 4:43:11 PMJul 2
to webp-d...@webmproject.org

Comment #5 on issue 643 by w...@utexas.edu: libwebp WASM decoding performance
https://bugs.chromium.org/p/webp/issues/detail?id=643#c5


> Do you have a use case for another compiler and an application where webp isn't natively embedded?

Yup! Currently there are a large number of use cases that compile their C/C++/Rust code to WASM using a combination of stock clang/rustc along with the WebAssembly System Interface (WASI). WASI is a spec that describes how WASM modules can interact with the outside world documented at https://github.com/WebAssembly/wasi-sdk.

Two production use cases that use "stock clang/rustc + WASI" to compile to WASM are:
1. Server-side computations, such as those provided by Fastly's Compute platform or Cloudflare's Workers. These are WASM server platforms where clients can deploy and run their code as WASM modules.
2. Software sandboxing, as done by RLBox (https://rlbox.dev/) in Firefox. Firefox ships some of its browser code as compiled WASM modules to end users.

When libwebp is compiled via the "stock clang + WASI" approach, it has unacceptably high overhead compared to native that has prevented the use of libwebp in use case 2 above. With the changes in my patch, the performance of WASM libwebp is much closer to native and would unblock use case 2 (and also benefit anyone using libwebp in use case 1).

jz… via monorail

unread,
Jul 9, 2024, 9:47:56 PMJul 9
to webp-d...@webmproject.org

Comment #6 on issue 643 by jz...@google.com: libwebp WASM decoding performance
https://bugs.chromium.org/p/webp/issues/detail?id=643#c6

Thanks for the background. I'd still like to land the changes outside of simde first and allow Deepti to respond. There are a few open comments on the simpler changes.

As for the rest we can look at minimizing them or restructuring things so the changes are easy to carry as a patch if we decide not to incorporate them into the tree.

gdee… via monorail

unread,
Jul 11, 2024, 7:42:57 PMJul 11
to webp-d...@webmproject.org

Comment #7 on issue 643 by gde...@chromium.org: libwebp WASM decoding performance
https://bugs.chromium.org/p/webp/issues/detail?id=643#c7

I'm wondering if there's an external translation layer that you can introduce in your compile toolchain instead of building dependencies into libwebp? Took a quick look at the rustc SIMD support, and compat headers there likely will not be too intensive, with the added bonus that they'll also be beneficial to other libraries outside of libwebp as well.

The LLVM backend for Wasm through stock clang should work the same way as it does from emscripten for generating wasm-simd opcodes - perhaps you could elaborate more on how you're using the WASI + stock clang toolchain that makes this challenging? Overall I think that the translation layer should live in the toolchain, and not libraries as you're proposing here.

w… via monorail

unread,
Jul 17, 2024, 4:36:50 PMJul 17
to webp-d...@webmproject.org

Comment #8 on issue 643 by w...@utexas.edu: libwebp WASM decoding performance
https://bugs.chromium.org/p/webp/issues/detail?id=643#c8

It seems the general source patterns in libwebp and other media libraries are to specialize code to the target platform. Thus targeting Wasm means rewriting libwebp functions to use intrinsics from the `<wasm_simd128.h>` header. This rewrite would require a large dedicated engineering effort. Instead, using SIMD-everywhere as a dependency to translate SSE intrinsics to Wasm SIMD intrinsics allows a minimal-effort approach to extract significant performance gains from Wasm SIMD.

We did consider improving the toolchain to translate SSE/NEON to Wasm SIMD as well. However, from what we’ve gathered, the maintainers of wasi-clang/LLVM are pushing developers to use the Wasm SIMD intrinsics directly rather than auto-translating intrinsics. Our impression is that such toolchain-level auto-translation for Wasm would be similar to asking armv8-clang to translate SSE intrinsics directly to NEON, rather than having developers write code using NEON intrinsics.

Fwiw, we would love to use wasi-clang to auto translate if you could convince wasi-clang developers to add this support. However, in the absence of this support, we believe translation in the library source/build is the best viable option.
Reply all
Reply to author
Forward
0 new messages