loading and transposing unaligned input for a hash function using the "V" extension

42 views

Skip to first unread message

Jack O'Connor

unread,

May 6, 2023, 10:30:50 PM5/6/23

to RISC-V SW Dev

I'm experimenting with a RISC-V vector implementation of a hash function called BLAKE3, which is designed to use a lot of SIMD parallelism. I'm trying to figure out the best way to do vector loads of unaligned user input from memory.

The implementation wants to process a long series of input blocks in a stripmining style. Each block is 64 bytes, which the hash interprets as 16x 32-bit words. These blocks are laid out in memory with a 1024-byte stride (the "chunk size" in BLAKE3 terms). It seems like this is almost an ideal use case for a pair of "strided load segment" instructions, vlsseg8e32.v, which could load and transpose 8 words from each of VLEN/32 blocks into 8 vector registers. The problem is that the input isn't guaranteed to be 4-byte aligned, and in my reading of the "V" spec and my experience with the Spike simulator, that makes it illegal to use these instructions.

If I was using regular vector loads, I think I would use EEW=8 to work around the alignment issue. But if I understand correctly, using the EEW=8 for a segment load would mess up the transposition and scatter the bytes of each word across four registers. I can think of a few other workarounds, but I'm worried each of these would entail a lot of overhead:

- Use vector-indexed loads of 8-bit elements, with the index vector accounting for both the stride and the word grouping.

- Memcpy unaligned user input to aligned local memory, and then use vlsseg8e32.v or similar to transpose to registers.

- Do unaligned/EEW=8 loads of whole, untransposed 64-byte blocks, and then do the transposition myself in registers. (This is what implementations tend to look like on x86, but this isn't length-agnostic.)

Are there other workarounds I might've missed? I have very little experience with the ISA, so I wouldn't be at all surprised :) Thanks for your help.

- Jack

Krste Asanovic

unread,

May 6, 2023, 11:53:49 PM5/6/23

to Jack O'Connor, RISC-V SW Dev

The option with memcpy realignment then 8x32b segment loads is probably the portably fastest way to do this, with a check to skip the memcpy if it’s fortuitously aligned.

All implementations should be fast at memcpy - less clear on the other alternatives.

Krste

--
You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/41e2b41f-d3a0-4b68-b902-0d96dd140890n%40groups.riscv.org.

Reply all

Reply to author

Forward

0 new messages