Issue 643 in webp: libwebp WASM decoding performance

13 views

Skip to first unread message

w… via monorail

unread,

Jun 25, 2024, 10:41:14 AM (4 days ago) Jun 25

to webp-d...@webmproject.org

Status: New
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 643 by w...@utexas.edu: libwebp WASM decoding performance
https://bugs.chromium.org/p/webp/issues/detail?id=643

Howdy folks!

This is a centralization of ongoing discussions on a collection of patches to improve libwebp WASM decoding performance. Together these patches speedup decoding performance by 48-78% for lossy images, and 24-53% for lossless images.

Here are the following changes with their rationale, ordered from most impactful to least. 

## Lossy and lossless SIMD SSE2/SSE4.1 intrinsics via SIMD-everywhere (SIMDe)
https://chromium-review.googlesource.com/c/webm/libwebp/+/5593964 

Single-Instruction Multiple Data (SIMD) instructions are CPU vector instructions that parallelize computations. Different architectures, such as ARM, Intel, and MIPS, each have their own SIMD extensions (ARM NEON/SVE, Intel SSE/AVX, and MIPS MSA). When compiling, a developer can take advantage of SIMD instructions by writing architecture-specific instructions, using compiler-provided intrinsics, or letting the compiler perform autovectorization. Libwebp has a combination of all three, and disabling SIMD shows a 50% slowdown on an amd64 machine.

The WASM platform has its own SIMD that is some sort of average of SSE and NEON, and compilers like wasi-clang expose WASM-platform specific SIMD intrinsics.

Currently, libwebp does not have WASM specific SIMD routines. In order to get around this, Emscripten provides its own translation for architecture SIMD intrinsics to WASM intrinsics (see https://emscripten.org/docs/porting/simd.html). This support is specific to Emscripten though, and not applicable to other WASM toolchains.

Our key contribution is the inclusion of SIMD-everywhere (SIMDe; https://github.com/simd-everywhere/simde) to libwebp to improve decoding performance. SIMDe is a header-only library that translates between architecture-specific intrinsics, and here we are using it to translate from SSE2/SSE4.1 to WASM SIMD. This provided a speedup of 32-57% on lossy images, and 1-20% on lossless images.

Applying this patch and compiling with -DWEBP_USE_SIMDE will use the SIMDe header instead of *mmintrin.h.

## Lossy and lossless direct function calls
https://chromium-review.googlesource.com/c/webm/libwebp/+/5593965 

This patch removes indirect calls to the DSP functions in lossy and lossless decoding. Because this is in the hot path, this single change improved performance by 3-9% on lossy and 14-26% on lossless test images and reduced the number of indirect calls from ~100,000 to ~4,000.

Libwebp uses function pointers to call the DSP functions, and has a dynamic dispatch to decide if SIMD-enhanced functions should be used (see the functions VP8DspInit and VP8LDspInit). If so, it updates the DSP function pointers to the SIMD-enhanced version. When compiled directly to native, this is a simple indirect call. In WASM though, indirect calls require a bounds check, a table lookup, and then a jump. Because the DSP functions are on the hot path, this introduces a noticeable slowdown.
To illustrate the point, here are assembly snippets showing the calls in Native and WASM. The WASM has been compiled to native by first translating to C with wasm2c.
Native: a call into a value stored at a global pointer.
``` 
call   *0xcc527(%rip)        # 4de748 <VP8LPredictorsAdd+0x68>
```
WASM: multiple checks before the indirect call is performed.
``` 
; Get the function table
  452baf:   48 8b 4c 24 20          mov    0x20(%rsp),%rcx
  ; Look up into the table
  452bb4:   48 8b 49 38             mov    0x38(%rcx),%rcx
  452bb8:   48 c1 e0 05             shl    $0x5,%rax
  452bbc:   48 8b 6c 01 08          mov    0x8(%rcx,%rax,1),%rbp
  ; Ensure it's not zero
  452bc1:   48 85 ed                test   %rbp,%rbp
  452bc4:   0f 84 ac 07 00 00       je     453376 <TRAP>
  ; Bounds check
  452bca:   48 8b 14 01             mov    (%rcx,%rax,1),%rdx
  452bce:   48 8d 35 0f f2 0f 00    lea    0xff20f(%rip),%rsi
  452bd5:   48 39 f2                cmp    %rsi,%rdx
  452bd8:   74 33                   je     452c0d
  452bda:   48 85 d2                test   %rdx,%rdx
  452bdd:   0f 84 93 07 00 00       je     453376 <TRAP>
  ; Type comparison
  452be3:   f3 0f 6f 02             movdqu (%rdx),%xmm0
  452be7:   f3 0f 6f 4a 10          movdqu 0x10(%rdx),%xmm1
  452bec:   66 0f ef 8c 24 90 00    pxor   0x90(%rsp),%xmm1
  452bf3:   00 00
  452bf5:   66 0f ef 84 24 a0 00    pxor   0xa0(%rsp),%xmm0
  452bfc:   00 00
  452bfe:   66 0f eb c1             por    %xmm1,%xmm0
  452c02:   66 0f 38 17 c0          ptest  %xmm0,%xmm0
  452c07:   0f 85 69 07 00 00       jne    453376 <TRAP>
  ; Load arguments for our indirect call
  452c0d:   48 8b 7c 01 18          mov    0x18(%rcx,%rax,1),%rdi
  452c12:   8b 74 24 40             mov    0x40(%rsp),%esi
  452c16:   44 89 ea                mov    %r13d,%edx
  452c19:   48 8b 8c 24 b0 00 00    mov    0xb0(%rsp),%rcx
  452c20:   00
  452c21:   45 89 e0                mov    %r12d,%r8d
  ; Finally make the call
  452c24:   ff d5                   call   *%rbp
TRAP:
  453376:   bf 06 00 00 00          mov    $0x6,%edi
  45337b:   e8 60 04 01 00          call   4637e0 <wasm_rt_trap>
```
What this change does is remove the extra comparisons by directly calling the DSP functions, including the SIMD-enhanced ones (both behind feature flags).
The entire WASM snippet above now becomes:
```
call   433f00 <w2c_decode__webp__wasmsimd_PredictorAdd1_SSE2>
```
### Lossless: VP8L_USE_FAST_LOAD
https://chromium-review.googlesource.com/c/webm/libwebp/+/5593966 

This patch enables VP8L_USE_FAST_LOAD for WASM.

Enabling lossless fast load improves performance by 2-5% across test images.


### Lossy: don't use USE_GENERIC_TREE
https://chromium-review.googlesource.com/c/webm/libwebp/+/5595408 

This patch sets USE_GENERIC_TREE to 0, alongside the ARM checks..

We found it is 2-4% faster to use the hard-coded tree on WASM. 

As an aside, we also noticed a slight improvement in the native performance, so it may be worth revisiting this change for all architectures.

### Lossy: Enable 64-bit BITS caching
https://chromium-review.googlesource.com/c/webm/libwebp/+/5593967 

This patch sets BITS to 56 when compiling to WASM, enabling the use of uint64_t types. Note that the default for the lossless bitstream decoder is uint64_t, so this will bring both decoders to parity.

This change alone improves performance by 0-5%, having less of an impact when combined with direct calls.

## Other questions

Answering some questions that have been asked on the commits:

__Testing on 32-bit architectures?__
I have not tested the performance numbers on a 32-bit machine. I don’t have a machine, but I can test a 32-bit executable.

__Testing on ARM and enabling ARM NEON?__
I have not tested on an Arm device yet. 

__Comparing Emscripten's intrinsics?__
This change would not replace Emscripten’s translation – SIMDe would be off by default.

We ran the experiment of using Emscripten’s intrinsic translations (found here https://github.com/emscripten-core/emscripten/tree/main/system/include/compat ) and running it through our pipeline (C code -> WASMSIMD -> C code (via wasm2c) -> x86). The first arrow used Emscripten’s translation instead of SIMDe. We found that the translation of the _mm_sad_epu8 intrinsic to WASMSIMD is better than SIMDe’s, so would like to upstream that improvement to SIMDe. This would bring the translation on-par for other WASM toolchains.

Regarding Emscripten’s NEON intrinsics, they also rely on SIMDe to provide this translation: https://emscripten.org/docs/porting/simd.html#compiling-simd-code-targeting-arm-neon-instruction-set 

## Test Environment

The performance numbers were taken on an Intel i7-9850 machine running Debian 12.5. CPU frequency scaling was disabled and libwebp was pinned to a single core. We decoded the Lossy and Lossless images found in the WebP gallery https://developers.google.com/speed/webp/gallery . You can see our collected numbers here: https://docs.google.com/spreadsheets/d/1l9gamytAp5QAMD2mgq-uMgDm_yxvDdWQTfOo-WO-BwU/edit#gid=0 

I’m happy to provide any other information to help get these patches landed.

-- 
You received this message because:
  1. The project was configured to send all issue notifications to this address

You may adjust your notification preferences at:
https://bugs.chromium.org/hosting/settings

jz… via monorail

unread,

Jun 26, 2024, 7:58:34 PM (3 days ago) Jun 26

to webp-d...@webmproject.org

Updates:
	Cc: gde...@google.com pasca...@gmail.com

Comment #1 on issue 643 by jz...@google.com: libwebp WASM decoding performance
https://bugs.chromium.org/p/webp/issues/detail?id=643#c1

Thanks for filing the bug. There's a lot here, so let's focus on the code generation. 'Lossless: VP8L_USE_FAST_LOAD', 'Lossy: don't use USE_GENERIC_TREE', and 'Lossy: Enable 64-bit BITS caching' are all fine, let's keep the discussion on the changes themselves.

I think the main question Deepti was asking was what kind of behavior you were seeing when enabling the sse and neon intrinsics in libwebp and letting the wasm compiler translate those (i.e., with `-msimd128`), rather than relying on the C code and auto-vectorization. The project doesn't have any native wasm intrinsics, but we'd like to rely on the intrinsics for the native platforms rather than having something custom for wasm.

w… via monorail

unread,

Jun 28, 2024, 10:58:30 AM (yesterday) Jun 28

to webp-d...@webmproject.org

Comment #2 on issue 643 by w...@utexas.edu: libwebp WASM decoding performance
https://bugs.chromium.org/p/webp/issues/detail?id=643#c2

By default WASM compilers do not support compiling with SSE or NEON intrinsics with the notable exception of Emscripten. So compiling a C file using SSE with clang will result in a compiler error. For example:

```
> cat file.c
#include <emmintrin.h>

int add(int a, int b) {
    return a + b;
}

> clang --target=wasm32 -msimd128 -Wl,--no-entry -o file.wasm file.c
In file included from file.c:1:
/usr/lib/llvm-14/lib/clang/14.0.6/include/emmintrin.h:14:2: error: "This header is only meant to be used on x86 and x64 architecture"
#error "This header is only meant to be used on x86 and x64 architecture"
 ^
...
```
 
Emscripten gets around this by providing its own version of the `*mmintrin.h` and `arm_neon.h` headers that are used to automatically translate SSE/NEON intrinsics to WASM intrinsics during compilation --- something which other WASM compilers do not do. With this SIMDe change, however, we are moving this translation directly into libwebp's WASM build so this translation would work for all WASM compilers.

The translation offered by SIMDe and Emscripten is more or less the same. See, for example, the translation of `_mm_mul_epu32` by Emscripten at https://github.com/emscripten-core/emscripten/blob/main/system/include/compat/emmintrin.h#L701 which is the same translation by SIMDe here https://github.com/simd-everywhere/simde/blob/master/simde/x86/sse2.h#L4289-L4292. Emscripten has one extra minor optimization in one intrinsic translation that I'm planning on upstreaming to SIMDe over the next few weeks to ensure there is parity.

w… via monorail

unread,

Jun 28, 2024, 10:59:06 AM (yesterday) Jun 28

to webp-d...@webmproject.org

Comment #3 on issue 643 by w...@utexas.edu: libwebp WASM decoding performance
https://bugs.chromium.org/p/webp/issues/detail?id=643#c3

@deepti: For clarity, did you mean "how does Emscripten's translation layer compare to SIMDe's translation layer"? If so, for SSE, we found the translation to be nearly the same, barring the optimization I plan to upstream. For NEON, it turns out Emscripten actually uses SIMDe's translation layer.

Reply all

Reply to author

Forward

0 new messages