Sharing some of my thoughts on the original CL here, perhaps others have ideas on how to address this.
The change introduces an optimization for a pattern that requires a number of i32.add instructions that in their entirety add a bunch of i32 SIMD extract lane operations that have the lane indices 0, 1, 2 and 3 exactly once.
We'd like to test patterns where we emit something that either fits this pattern or is very close to it (which could catch cases where the optimization wrongly applies).
Based on my assumptions about our random module generation capabilities, I think it is extremely unlikely that the fuzzer will emit any pattern that fully meets the optimization requirements, so running our current fuzzers imo won't help. We can't seed the fuzzer with such a pattern either as we can't go from wasm module to fuzzer input. We could add something that emits exactly the pattern but what we'd want to fuzz is something that is close to this pattern but in most cases will have small differences in the pattern or different orders of these instructions and inputs.
I think the easiest solution is to use a fuzzer that can emit the correct pattern and can perform meaningful mutations on that pattern.
The wasm compile-fuzzers can't perform mutations (it's generative only) and the code fuzzers only interpret input bytes as wasm code, so correct mutations will be exceedingly rare.
This leaves us with Fuzzilli, which can perform these actions. But Fuzzilli doesn't support differential fuzzing. While there is ongoing work on differential fuzzing ("Dumpling"), it is centered around deopt-points in JavaScript and would only compare different JS tiers and therefore can't detect correctness bugs in Wasm. Furthermore the V8 part of the integration only supports x64 and this is an arm64-only optimization.
This is why I'm leaning towards: We should have a small fuzztest that emits a bunch of i32 and S128 operations, including some extract lanes. It shall also emit a bunch of i32.add instructions consuming these extract lanes (and potentially some of the other i32 inputs). The fuzztest would build this into a single function that produces an i32 as a result and runs it with liftoff and turbofan and compares the results. This will only cover basic bugs but it should sometimes produce patterns that match the optimization's requirements.
If at some point we end up with dozens of such small fuzzers for dozens of different Turboshaft optimizations, it should still scale well as fuzztest is designed for having dozens or hundreds of small fuzzers (AFAIK).
Let me know what you think.
Matthias