I'm experimenting with a RISC-V vector implementation of a hash function called BLAKE3, which is designed to use a lot of SIMD parallelism. I'm trying to figure out the best way to do vector loads of unaligned user input from memory.
The implementation wants to process a long series of input blocks in a stripmining style. Each block is 64 bytes, which the hash interprets as 16x 32-bit words. These blocks are laid out in memory with a 1024-byte stride (the "chunk size" in BLAKE3 terms). It seems like this is almost an ideal use case for a pair of "strided load segment" instructions,
vlsseg8e32.v, which could load and transpose 8 words from each of VLEN/32 blocks into 8 vector registers. The problem is that the input isn't guaranteed to be 4-byte aligned, and in my reading of the "V" spec and my experience with the Spike simulator, that makes it illegal to use these instructions.
If I was using regular vector loads, I think I would use EEW=8 to work around the alignment issue. But if I understand correctly, using the EEW=8 for a segment load would mess up the transposition and scatter the bytes of each word across four registers. I can think of a few other workarounds, but I'm worried each of these would entail a lot of overhead:
- Use vector-indexed loads of 8-bit elements, with the index vector accounting for both the stride and the word grouping.
- Memcpy unaligned user input to aligned local memory, and then use vlsseg8e32.v or similar to transpose to registers.
- Do unaligned/EEW=8 loads of whole, untransposed 64-byte blocks, and then do the transposition myself in registers. (This is what implementations tend to look like on x86, but this isn't length-agnostic.)
Are there other workarounds I might've missed? I have very little experience with the ISA, so I wouldn't be at all surprised :) Thanks for your help.
- Jack