wasm simd pairwise matching

17 views
Skip to first unread message

Sam Parker-Haynes

unread,
Jun 5, 2024, 9:59:45 AMJun 5
to v8-dev
Hi,

I'd like to add some pattern matching, for Turboshaft, to recognise add + shuffle patterns which correspond to a horizontal pairwise reduction. I've started doing this with wasm::SimdShuffle helpers and then during arm64 instruction selection, but it feels like the pattern matching should be done in a generic place too... So, I was thinking about adding more four more kinds (I32x4, I64x4, F32x4 and F64x2 PairwiseReduction) to Simd128UnaryOp and then perform the combining in machine-optimization-reducer.

Does this sound reasonable enough..? Or is the overhead of plumbing this into the TS IR likely going to be significantly more complicated than backend pattern matching?

Thanks,
Sam 

Matthias Liedtke

unread,
Jun 5, 2024, 10:56:56 AMJun 5
to v8-...@googlegroups.com
Hi,

I quickly synced with Darius:
1) In general it makes sense to do the matching on the graph itself (i.e. in a reducer) assuming this is a generic pattern for which there might also be specialized / optimized instructions on other architectures.
2) Intel is working on a re-vectorization pass to replace 128 bit SIMD operations with 256 bit SIMD operations. So, if these optimized "add + shuffle" operations exist on intel as well, there would be a clear benefit in doing it in a reducer that could then potentially run prior to the revectorization (which would require additional modifications to the revectorizer).

In general it's advisable to have as little architecture-specific code paths in the reducers as possible, so the operations shouldn't be overfitting to some arm64-only instructions.
Still, having some SIMD operations with clear semantics in the graph that only exist on some architectures, is fine.

I don't think the overhead of pattern matching on the graph is likely to be more effort or slower than pattern matching during instruction selection.
Given the complexity of arm64 and x64 ISel code, I'm happy about anything that isn't added on top of that. :)

Cheers,
Matthias

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/2a9c3fcd-ee78-4877-9587-2ccb3b0a59e6n%40googlegroups.com.

dmerc...@google.com

unread,
Jun 5, 2024, 11:04:36 AMJun 5
to v8-dev
And one more thing that will be nicer in a Reducer than in the instruction selector: you don't have to worry about CanCover :o :o :o

Btw, as far as I can tell, there is no corresponding Intel operations for vaddvq (which I guess is what you want to generate), but I think that it's still better in a reduce than in the ISEL directly. Maybe add a #ifdef V8_TARGET_ARCH_ARM64 around the arm64-specific opcodes that you define. 

Cheers,
Darius

Sam Parker-Haynes

unread,
Jun 5, 2024, 11:34:50 AMJun 5
to v8-dev
Okay, good!!

So, although I'm wanting to generate horizontal reduction operations, I'm currently thinking about lowering these to pairwise instructions, such as SSE/AVX haddp and Neon faddp. The semantics of the TS op will be of a recursively pairwise operation so targets should be able to lower them to a variety of optimised sequences, which does mean we'd be able to use addv for ints on aarch64.

Thanks again,
Sam

Reply all
Reply to author
Forward
0 new messages