Hi,
I was looking into the GNU MP Bignum library (gmp) and noticed that it has a large number of hand optimised assembly including some optimised bit manipulation functions. To say the least, the GNU MP Bignum library has many heavily optimised assembly versions including many different versions of the same routine for different models of the same architecture.
mclark@monty:~/src/gmp-5.1.3$ find . -name '*.s' -o -name '*.asm' | wc -l
735
One of the functions that caught my attention was clz (count leading zeros). I decided to have a go at porting GNU MP's 32-bit clz inline assembly to RISC-V and test its performance. Having ported one of Knuth’s division routines in my own bignum library, I’m aware that clz is commonly used in bignum codes, usually for renormalization and GNU MP’s 32-bit clz has some nice properties. It is completely branch-free logic and takes a constant 22 instructions plus return, and a possible cache miss to look up a 129 entry table. It is ‘almost’ constant time.
Here is the asm:
fast_clz(unsigned int):
li t0,0x1000000
sltu t0,a0,t0
neg a5,t0
li t0,0x10000
sltu t0,a0,t0
sub a5,a5,t0
li t0,0x100
sltu t0,a0,t0
sub a5,a5,t0
addiw a5,a5,3
slliw a5,a5,3
addiw a4,a5,1
srlw a0,a0,a4
slli a0,a0,32
lui a4,%hi(.LANCHOR0)
addi a4,a4,%lo(.LANCHOR0)
srli a0,a0,32
add a0,a4,a0
lbu a0,0(a0)
not a5,a5
addiw a5,a5,33
subw a0,a5,a0
ret
It would be useful for 32-bit limbs on RV32. I imagine 64-bit implementations would use 64-bit limbs which would need a 64-bit clz.
I thought I would compare this to GCC’s __builtin_clz(unsigned) which oddly emits calls to __clzdi2 (64-bit version) versus having a dedicated __clzsi2 (32-bit version). The generic GCC __clzdi2 routine iterates up to 8 times (at least 5 times for any 32-bit words, as the first 4 passes will read 0 bytes). For 32-bit words, it looks like it will read 0 bytes for the first 4 bytes in the upper 32-bits of a double word taking up 21 cycles before the function starts any work. Add an additional 10 cycles for an average of two more iterations and 10 cycles for the remainder and it adds up to approximately 40 cycles per word, plus cache latency (which would be the same same as fast_clz).
__clzdi2:
li a5,56
srl a4,a0,a5
andi a4,a4,255
bnez a4,10300 <__clzdi2+0x12>
addi a5,a5,-8
bnez a5,102f2 <__clzdi2+0x4>
li a4,64
sub a4,a4,a5
srl a5,a0,a5
auipc a0,0x7
ld a0,-210(a0) # 17238 <_GLOBAL_OFFSET_TABLE_+0x48>
add a5,a5,a0
lbu a0,0(a5) # 0 <main-0x100e8>
subw a0,a4,a0
ret
I wrote a quick benchmark program. From my initial tests in a JIT, the branch free version appears to be almost twice as fast. It’s not exactly an Apple’s for Apple’s comparison as GCC inlines fast_clz however I still think it might be worth considering a custom __clzsi2. A constant time clz is especially useful for cryptographic code, which is one of the application domains of gmp (according to its website).
https://gmplib.org/
We could consider adding a custom __clzsi2 to GCC or alternatively we could add inline asm to the RISC-V port of the GNU MP Bignum library? (as the table is already defined in gmp, albeit with a libcc symbol collision to make things a little more fun, not noticeable on platforms that don’t use the generic libgcc C-based implementation).
/* RISC-V fast_clz implementation derived from GNU MP Library longlong.h */
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
const
unsigned char __fast_clz_tab[129] =
{
1,2,3,3,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,
8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,
9
};
int fast_clz(unsigned x)
{
unsigned shamt;
__asm__(
"li t0,0x1000000\n\t"
"sltu t0,%1,t0\n\t"
"neg %0,t0\n\t"
"li t0,0x10000\n\t"
"sltu t0,%1,t0\n\t"
"sub %0,%0,t0\n\t"
"li t0,0x100\n\t"
"sltu t0,%1,t0\n\t"
"sub %0,%0,t0"
: "=&r" (shamt) : "r" (x) : "t0");
shamt = shamt*8 + 24 + 1;
return 33 - shamt - __fast_clz_tab[x >> shamt];
}
int main(int argc, char **argv)
{
int use_builtin_clz = 0, use_fast_clz = 0;
if (argc != 2 ||
!((use_builtin_clz = (strcmp(argv[1], "builtin_clz") == 0)) ||
(use_fast_clz = (strcmp(argv[1], "fast_clz") == 0))))
{
fprintf(stderr, "usage: %s [builtin_clz|fast_clz]\n", argv[0]);
exit(1);
}
int sum = 0;
if (use_builtin_clz) {
for (int x = 1; x != 0; x++) {
sum += __builtin_clz(x);
}
}
if (use_fast_clz) {
for (int x = 1; x != 0; x++) {
sum += fast_clz(x);
}
}
printf("sum=%d\n", sum);
}