RISC-V port of GNU MP Bignum library fast_clz vs libgcc's generic __builtin_clz

496 views
Skip to first unread message

Michael Clark

unread,
Aug 23, 2017, 5:33:51 PM8/23/17
to RISC-V SW Dev
Hi,

I was looking into the GNU MP Bignum library (gmp) and noticed that it has a large number of hand optimised assembly including some optimised bit manipulation functions. To say the least, the GNU MP Bignum library has many heavily optimised assembly versions including many different versions of the same routine for different models of the same architecture.

mclark@monty:~/src/gmp-5.1.3$ find . -name '*.s' -o -name '*.asm' | wc -l
735

One of the functions that caught my attention was clz (count leading zeros). I decided to have a go at porting GNU MP's 32-bit clz inline assembly to RISC-V and test its performance. Having ported one of Knuth’s division routines in my own bignum library, I’m aware that clz is commonly used in bignum codes, usually for renormalization and GNU MP’s 32-bit clz has some nice properties. It is completely branch-free logic and takes a constant 22 instructions plus return, and a possible cache miss to look up a 129 entry table. It is ‘almost’ constant time.

Here is the asm:

fast_clz(unsigned int):
li t0,0x1000000
sltu t0,a0,t0
neg a5,t0
li t0,0x10000
sltu t0,a0,t0
sub a5,a5,t0
li t0,0x100
sltu t0,a0,t0
sub a5,a5,t0
addiw a5,a5,3
slliw a5,a5,3
addiw a4,a5,1
srlw a0,a0,a4
slli a0,a0,32
lui a4,%hi(.LANCHOR0)
addi a4,a4,%lo(.LANCHOR0)
srli a0,a0,32
add a0,a4,a0
lbu a0,0(a0)
not a5,a5
addiw a5,a5,33
subw a0,a5,a0
ret

It would be useful for 32-bit limbs on RV32. I imagine 64-bit implementations would use 64-bit limbs which would need a 64-bit clz.

I thought I would compare this to GCC’s __builtin_clz(unsigned) which oddly emits calls to __clzdi2 (64-bit version) versus having a dedicated __clzsi2 (32-bit version). The generic GCC __clzdi2 routine iterates up to 8 times (at least 5 times for any 32-bit words, as the first 4 passes will read 0 bytes). For 32-bit words, it looks like it will read 0 bytes for the first 4 bytes in the upper 32-bits of a double word taking up 21 cycles before the function starts any work. Add an additional 10 cycles for an average of two more iterations and 10 cycles for the remainder and it adds up to approximately 40 cycles per word, plus cache latency (which would be the same same as fast_clz).


__clzdi2:
li a5,56
srl a4,a0,a5
andi a4,a4,255
bnez a4,10300 <__clzdi2+0x12>
addi a5,a5,-8
bnez a5,102f2 <__clzdi2+0x4>
li a4,64
sub a4,a4,a5
srl a5,a0,a5
auipc a0,0x7
ld a0,-210(a0) # 17238 <_GLOBAL_OFFSET_TABLE_+0x48>
add a5,a5,a0
lbu a0,0(a5) # 0 <main-0x100e8>
subw a0,a4,a0
ret


I wrote a quick benchmark program. From my initial tests in a JIT, the branch free version appears to be almost twice as fast. It’s not exactly an Apple’s for Apple’s comparison as GCC inlines fast_clz however I still think it might be worth considering a custom __clzsi2. A constant time clz is especially useful for cryptographic code, which is one of the application domains of gmp (according to its website). https://gmplib.org/

We could consider adding a custom __clzsi2 to GCC or alternatively we could add inline asm to the RISC-V port of the GNU MP Bignum library? (as the table is already defined in gmp, albeit with a libcc symbol collision to make things a little more fun, not noticeable on platforms that don’t use the generic libgcc C-based implementation).


/* RISC-V fast_clz implementation derived from GNU MP Library longlong.h */

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

const
unsigned char __fast_clz_tab[129] =
{
1,2,3,3,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,
8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,
9
};

int fast_clz(unsigned x)
{
unsigned shamt;
__asm__(
"li t0,0x1000000\n\t"
"sltu t0,%1,t0\n\t"
"neg %0,t0\n\t"
"li t0,0x10000\n\t"
"sltu t0,%1,t0\n\t"
"sub %0,%0,t0\n\t"
"li t0,0x100\n\t"
"sltu t0,%1,t0\n\t"
"sub %0,%0,t0"
: "=&r" (shamt) : "r" (x) : "t0");
shamt = shamt*8 + 24 + 1;
return 33 - shamt - __fast_clz_tab[x >> shamt];
}

int main(int argc, char **argv)
{
int use_builtin_clz = 0, use_fast_clz = 0;

if (argc != 2 ||
!((use_builtin_clz = (strcmp(argv[1], "builtin_clz") == 0)) ||
(use_fast_clz = (strcmp(argv[1], "fast_clz") == 0))))
{
fprintf(stderr, "usage: %s [builtin_clz|fast_clz]\n", argv[0]);
exit(1);
}

int sum = 0;
if (use_builtin_clz) {
for (int x = 1; x != 0; x++) {
sum += __builtin_clz(x);
}
}
if (use_fast_clz) {
for (int x = 1; x != 0; x++) {
sum += fast_clz(x);
}
}
printf("sum=%d\n", sum);
}

Michael Clark

unread,
Aug 23, 2017, 5:43:16 PM8/23/17
to RISC-V SW Dev

> On 24 Aug 2017, at 9:33 AM, Michael Clark <michae...@mac.com> wrote:
>
> int fast_clz(unsigned x)
> {
> unsigned shamt;
> __asm__(
> "li t0,0x1000000\n\t"
> "sltu t0,%1,t0\n\t"
> "neg %0,t0\n\t"
> "li t0,0x10000\n\t"
> "sltu t0,%1,t0\n\t"
> "sub %0,%0,t0\n\t"
> "li t0,0x100\n\t"
> "sltu t0,%1,t0\n\t"
> "sub %0,%0,t0"
> : "=&r" (shamt) : "r" (x) : "t0");
> shamt = shamt*8 + 24 + 1;
> return 33 - shamt - __fast_clz_tab[x >> shamt];
> }

Here it is in compiler explorer, including a vanilla C version:

https://cx.rv8.io/g/2qauDE

Here is the vanilla C version:

int fast_clz(unsigned x)
{
unsigned shamt;
shamt = -(x < 0x1000000);
shamt -= (x < 0x10000);
shamt -= (x < 0x100);
shamt = shamt*8 + 24 + 1;
return 32 + 1 - shamt - __fast_clz[x >> shamt];
}

The compiler can’t compete with hand-coded assembly. It appears to not lower the predicate logic quite as compactly as it could.

There was a template for this asm in GNU MP Bignum longlong.h header:

- https://github.com/ryepdx/gmp/blob/master/longlong.h

Michael Clark

unread,
Aug 28, 2017, 6:49:11 AM8/28/17
to RISC-V SW Dev
There is an interesting anomaly for 64-bit GCC targets that use the default libgcc __builltin_clz. The problem doesn’t appear to affect targets that have bit manipulation instructions The compiler expands __builltin_clz to a libcall to very inefficient code for 32-bit clz on 64-bit targets, when there is already code for clzsi2 in libgcc’s longlong.h (albeit with a larger table than GMP’s version of longlong.h). It seems GCC is limited to emitting two width variants.

- https://cx.rv8.io/g/8fEK4F

riscv32 libgcc.a has clzsi2 (32-bit) and clzdi2 (64-bit) variants (likewise for ctz)
riscv64 libgcc.a has clzdi2 (64-bit) and clzti2 (128-bit) variants (likewise for ctz)
riscv32 and riscv64 libgcc.a are consistent and both have bswapsi2 (32-bit) and bswapdi2 (64-bit) variants

If clzsi2 was generated in libgcc.a on RV64 so that calls to the 32-bit __builltin_clz expanded to clzsi2, it would more than double the speed, removing 4 redundant loop iterations and a subtract. We’d have to figure out how to get libgcc.a to emit all 3 widths for clz in the RV64 libgcc.a (SI, DI and TI). It appears to be target independent code that is generating the clz libcalls on targets that don’t have explicit clz pattern matches, however for some reason the 32-bit variants of clz/ctz are not emitted on 64-bit targets. The performance hit is of course biggest on clz, which is what is commonly used in bignum and softfloat for renormalization. It is inconsistent with bswap.

BTW - the libgcc.a __clz_tab table is in in the .relro section with .hidden visibility. Remember we have a patch for a faster bswap64 that has two 64-bit .relro constants. I’m now thinking it might be worthwhile to speed these up e.g. 32-bit clz and 64-bit bswap can both be improved.

Here are the two routines generated from GCC’s longlong.h (with clzsi2 absent from RV64’s libgcc):

__clzdi2:
li a5,56
1: srl a4,a0,a5
andi a4,a4,255
bnez a4, 2f
addi a5,a5,-8
bnez a5, 1b
2: li a4,64
sub a4,a4,a5
srl a5,a0,a5
la a0, __clz_tab
add a5,a5,a0
lbu a0,0(a5)
subw a0,a4,a0
ret

__clzsi2:
lui a5,0x10
bleu a5,a0,2f
li a5,255
sltu a5,a5,a0
slli a5,a5,0x3
1: li a4,32
sub a4,a4,a5
srl a5,a0,a5
lui a0,0x11
la a0, __clz_tab
lbu a0,0(a5)
sub a0,a4,a0
ret
2: lui a4,0x1000
li a5,16
bltu a0,a4,1b
li a5,24
j 1b



> On 24 Aug 2017, at 9:33 AM, Michael Clark <michae...@mac.com> wrote:
>

Michael Clark

unread,
Aug 28, 2017, 6:52:07 AM8/28/17
to RISC-V SW Dev
s/codeine/codegen/

autocorrect issues :-D
> --
> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
> To post to this group, send email to sw-...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/D26CBC0D-3EE5-435E-B026-093B5A03CA7C%40mac.com.

Michael Clark

unread,
Aug 28, 2017, 7:09:46 AM8/28/17
to RISC-V SW Dev
Oh I see it does a shift left and a shift right before calling clz, followed by a subtract -32 after the call, in addition to the four redundant loops in clzdi2. In fact the compiler could just emit a shift left 32 and use the existing clzdi2, and the redundant iterations and subtraction would not be required. That could perhaps be done in target independent code. i.e. the simplest fix.

Minimal reproducer. I probably should raise a “performance” bug for this.

$ cat clz.c
typedef unsigned int u32;
u32 clz(u32 x) { return __builtin_clz(x); }

$ riscv64-unknown-elf-gcc -O3 -fdump-rtl-all —save-temps -c clz.c

$ cat clz.c.229r.expand
<snip rtl which shows SI mode expression>

$ cat clz.s
.file "clz.c"
.option nopic
.globl __clzdi2
.text
.align 1
.globl clz
.type clz, @function
clz:
sll a0,a0,32
add sp,sp,-16
srl a0,a0,32
sd ra,8(sp)
call __clzdi2
ld ra,8(sp)
addw a0,a0,-32
add sp,sp,16
jr ra
.size clz, .-clz
.ident "GCC: (GNU) 7.1.1 20170509"
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/A856C7F2-087C-4AFE-AFB9-F1D25FF79A31%40mac.com.

Michael Clark

unread,
Aug 28, 2017, 3:04:54 PM8/28/17
to RISC-V SW Dev
Sorry for spamming the list with asm, but I was thinking that some of the issues I am finding are target independent issue and was not sure where to raise them e.g. this one appears to be target independent also:

“Bit-field sign extension pattern results in redundant shifts"

- https://github.com/riscv/riscv-gcc/issues/89

Perhaps I should start logging bugs in GCC bugzilla? …

Andrew Waterman

unread,
Aug 28, 2017, 4:35:30 PM8/28/17
to Michael Clark, RISC-V SW Dev
For target-independnet issues, the GCC mailing lists and/or bugzilla
are probably the best place.

For target-dependent ones (or those that haven't yet been proven
target-independent), the riscv-gcc github issue tracker is probably a
better place than sw-dev.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/D8FB8202-390E-4AF0-BBD4-C5209FA8D10D%40mac.com.
Reply all
Reply to author
Forward
0 new messages