Status: New
Owner: ----
Labels: Type-Defect Priority-Medium
New issue 643 by w...@
utexas.edu: libwebp WASM decoding performance
https://bugs.chromium.org/p/webp/issues/detail?id=643Howdy folks!
This is a centralization of ongoing discussions on a collection of patches to improve libwebp WASM decoding performance. Together these patches speedup decoding performance by 48-78% for lossy images, and 24-53% for lossless images.
Here are the following changes with their rationale, ordered from most impactful to least.
## Lossy and lossless SIMD SSE2/SSE4.1 intrinsics via SIMD-everywhere (SIMDe)
https://chromium-review.googlesource.com/c/webm/libwebp/+/5593964 Single-Instruction Multiple Data (SIMD) instructions are CPU vector instructions that parallelize computations. Different architectures, such as ARM, Intel, and MIPS, each have their own SIMD extensions (ARM NEON/SVE, Intel SSE/AVX, and MIPS MSA). When compiling, a developer can take advantage of SIMD instructions by writing architecture-specific instructions, using compiler-provided intrinsics, or letting the compiler perform autovectorization. Libwebp has a combination of all three, and disabling SIMD shows a 50% slowdown on an amd64 machine.
The WASM platform has its own SIMD that is some sort of average of SSE and NEON, and compilers like wasi-clang expose WASM-platform specific SIMD intrinsics.
Currently, libwebp does not have WASM specific SIMD routines. In order to get around this, Emscripten provides its own translation for architecture SIMD intrinsics to WASM intrinsics (see
https://emscripten.org/docs/porting/simd.html). This support is specific to Emscripten though, and not applicable to other WASM toolchains.
Our key contribution is the inclusion of SIMD-everywhere (SIMDe;
https://github.com/simd-everywhere/simde) to libwebp to improve decoding performance. SIMDe is a header-only library that translates between architecture-specific intrinsics, and here we are using it to translate from SSE2/SSE4.1 to WASM SIMD. This provided a speedup of 32-57% on lossy images, and 1-20% on lossless images.
Applying this patch and compiling with -DWEBP_USE_SIMDE will use the SIMDe header instead of *mmintrin.h.
## Lossy and lossless direct function calls
https://chromium-review.googlesource.com/c/webm/libwebp/+/5593965 This patch removes indirect calls to the DSP functions in lossy and lossless decoding. Because this is in the hot path, this single change improved performance by 3-9% on lossy and 14-26% on lossless test images and reduced the number of indirect calls from ~100,000 to ~4,000.
Libwebp uses function pointers to call the DSP functions, and has a dynamic dispatch to decide if SIMD-enhanced functions should be used (see the functions VP8DspInit and VP8LDspInit). If so, it updates the DSP function pointers to the SIMD-enhanced version. When compiled directly to native, this is a simple indirect call. In WASM though, indirect calls require a bounds check, a table lookup, and then a jump. Because the DSP functions are on the hot path, this introduces a noticeable slowdown.
To illustrate the point, here are assembly snippets showing the calls in Native and WASM. The WASM has been compiled to native by first translating to C with wasm2c.
Native: a call into a value stored at a global pointer.
```
call *0xcc527(%rip) # 4de748 <VP8LPredictorsAdd+0x68>
```
WASM: multiple checks before the indirect call is performed.
```
; Get the function table
452baf: 48 8b 4c 24 20 mov 0x20(%rsp),%rcx
; Look up into the table
452bb4: 48 8b 49 38 mov 0x38(%rcx),%rcx
452bb8: 48 c1 e0 05 shl $0x5,%rax
452bbc: 48 8b 6c 01 08 mov 0x8(%rcx,%rax,1),%rbp
; Ensure it's not zero
452bc1: 48 85 ed test %rbp,%rbp
452bc4: 0f 84 ac 07 00 00 je 453376 <TRAP>
; Bounds check
452bca: 48 8b 14 01 mov (%rcx,%rax,1),%rdx
452bce: 48 8d 35 0f f2 0f 00 lea 0xff20f(%rip),%rsi
452bd5: 48 39 f2 cmp %rsi,%rdx
452bd8: 74 33 je 452c0d
452bda: 48 85 d2 test %rdx,%rdx
452bdd: 0f 84 93 07 00 00 je 453376 <TRAP>
; Type comparison
452be3: f3 0f 6f 02 movdqu (%rdx),%xmm0
452be7: f3 0f 6f 4a 10 movdqu 0x10(%rdx),%xmm1
452bec: 66 0f ef 8c 24 90 00 pxor 0x90(%rsp),%xmm1
452bf3: 00 00
452bf5: 66 0f ef 84 24 a0 00 pxor 0xa0(%rsp),%xmm0
452bfc: 00 00
452bfe: 66 0f eb c1 por %xmm1,%xmm0
452c02: 66 0f 38 17 c0 ptest %xmm0,%xmm0
452c07: 0f 85 69 07 00 00 jne 453376 <TRAP>
; Load arguments for our indirect call
452c0d: 48 8b 7c 01 18 mov 0x18(%rcx,%rax,1),%rdi
452c12: 8b 74 24 40 mov 0x40(%rsp),%esi
452c16: 44 89 ea mov %r13d,%edx
452c19: 48 8b 8c 24 b0 00 00 mov 0xb0(%rsp),%rcx
452c20: 00
452c21: 45 89 e0 mov %r12d,%r8d
; Finally make the call
452c24: ff d5 call *%rbp
TRAP:
453376: bf 06 00 00 00 mov $0x6,%edi
45337b: e8 60 04 01 00 call 4637e0 <wasm_rt_trap>
```
What this change does is remove the extra comparisons by directly calling the DSP functions, including the SIMD-enhanced ones (both behind feature flags).
The entire WASM snippet above now becomes:
```
call 433f00 <w2c_decode__webp__wasmsimd_PredictorAdd1_SSE2>
```
### Lossless: VP8L_USE_FAST_LOAD
https://chromium-review.googlesource.com/c/webm/libwebp/+/5593966 This patch enables VP8L_USE_FAST_LOAD for WASM.
Enabling lossless fast load improves performance by 2-5% across test images.
### Lossy: don't use USE_GENERIC_TREE
https://chromium-review.googlesource.com/c/webm/libwebp/+/5595408 This patch sets USE_GENERIC_TREE to 0, alongside the ARM checks..
We found it is 2-4% faster to use the hard-coded tree on WASM.
As an aside, we also noticed a slight improvement in the native performance, so it may be worth revisiting this change for all architectures.
### Lossy: Enable 64-bit BITS caching
https://chromium-review.googlesource.com/c/webm/libwebp/+/5593967 This patch sets BITS to 56 when compiling to WASM, enabling the use of uint64_t types. Note that the default for the lossless bitstream decoder is uint64_t, so this will bring both decoders to parity.
This change alone improves performance by 0-5%, having less of an impact when combined with direct calls.
## Other questions
Answering some questions that have been asked on the commits:
__Testing on 32-bit architectures?__
I have not tested the performance numbers on a 32-bit machine. I don’t have a machine, but I can test a 32-bit executable.
__Testing on ARM and enabling ARM NEON?__
I have not tested on an Arm device yet.
__Comparing Emscripten's intrinsics?__
This change would not replace Emscripten’s translation – SIMDe would be off by default.
We ran the experiment of using Emscripten’s intrinsic translations (found here
https://github.com/emscripten-core/emscripten/tree/main/system/include/compat ) and running it through our pipeline (C code -> WASMSIMD -> C code (via wasm2c) -> x86). The first arrow used Emscripten’s translation instead of SIMDe. We found that the translation of the _mm_sad_epu8 intrinsic to WASMSIMD is better than SIMDe’s, so would like to upstream that improvement to SIMDe. This would bring the translation on-par for other WASM toolchains.
Regarding Emscripten’s NEON intrinsics, they also rely on SIMDe to provide this translation:
https://emscripten.org/docs/porting/simd.html#compiling-simd-code-targeting-arm-neon-instruction-set ## Test Environment
The performance numbers were taken on an Intel i7-9850 machine running Debian 12.5. CPU frequency scaling was disabled and libwebp was pinned to a single core. We decoded the Lossy and Lossless images found in the WebP gallery
https://developers.google.com/speed/webp/gallery . You can see our collected numbers here:
https://docs.google.com/spreadsheets/d/1l9gamytAp5QAMD2mgq-uMgDm_yxvDdWQTfOo-WO-BwU/edit#gid=0 I’m happy to provide any other information to help get these patches landed.
--
You received this message because:
1. The project was configured to send all issue notifications to this address
You may adjust your notification preferences at:
https://bugs.chromium.org/hosting/settings