Arithmetic in nonbinary radix

17 views
Skip to first unread message

Fredrik Johansson

unread,
Feb 27, 2026, 6:23:00 AMFeb 27
to flint-devel
Hi all,

Here is a writeup about the new radix module in FLINT for multiprecision arithmetic in bases other than 2^64: https://fredrikj.net/blog/2026/fast-nonbinary-arithmetic/

Fredrik

David Sparks

unread,
Feb 28, 2026, 3:58:38 AMFeb 28
to flint...@googlegroups.com, Fredrik Johansson
> sub_ddmmss(hi, lo, 0, a[i] + (cy + 1), 0, B - b[i]);
> res[i] = lo + (hi & B);
> cy = hi;

Obviously, the critical path is the carry-propagation chain.

There are two slight tweaks to this which might shorten the path.

One (I'll call this #2) is to see if avoiding the 3-way add reduces
the dependency length on the carry:

sub_ddmmss(hi, lo, 0, a[i] + cy , 0, B - 1 - b[i]);
res[i] = lo + (hi & B);
cy = hi;

Another (#3) is to see if using addition works better:

add_ssaaaa(hi, lo, 0, a[i] + cy, 0, b[i] - B);
res[i] = lo + ((hi-1) & B);
cy = hi;

For asm implementations, the goal is to get keep the carry word in the
carry flag, but it's hampered by the fact that we need to use it twice:
in the following addition, and to correct the current result limb.

This is straightforward on ARM, which allows fine-grained control over
flag setting, even for operations like ADC which *use* the flags, but
is a PITA on x86.

It's possible to add without using carry using LEA, INC, and DEC. Also,
NOT doesn't set the flags, so B-1 - b[i] can be computed as B + ~b[i].

But on x86, logical operations like AND and XOR all clear the carry flag.
This makes the correction using logical operations apparently impossible.

AFAICT, the only x86 instructions which depend on the carry
flag without setting it are:
1) Direct flags manipulation like LAHF or PUSHF
2) Bcc conditional jump
3) Scc
4) CMOVcc

The first two categories are non-starters for obvious
performance reasons, and Scc is problematic because
converting the 0/1 bit to 0 or B without corrupting
carry is impractical. But CMOVcc has possibilities.

Subtracting without corrupting carry is a pain, so I'll try
option #3. This translates as:

loop:
mov b[i],%tmp1
lea (%tmp1, %negB), %tmp1
adc a[i], %tmp1
lea (%tmp1, %B), %tmp2
cmovcs %tmp1,%tmp2
mov %tmp2,res[i]
dec %len
jnz loop

This has only two loop-carried dependencies (on the carry, and %len),
each a single cycle long.

This comes close to fitting into the Zen's 8-wide integer execution
unit, but not quite. The Zen has
- 4 ALUs (adc, lea, cmov, dec)
- 3 AGUs (2 loads and 1 store)
- 1 conditional branch
The problem is that, without unrolling, we have *five* ALU operations
per iteration. (The adc with a memory operand counts as both a load
and an ALU op.)

But with unrolling, it might be possible to get it down to 1 cycle.

I mention in passing that a useful trick to economize on registers
on x86 is to negate the length, index relative to the end of the
buffers, and increment up to zero. E.g.

lea (%a,%len,8), %a
lea (%b,%len,8), %b
lea (%res,%len,8), %res
neg %len
loop:
mov (%b,%len,8), %tmp1
...
inc %len
jnz loop

Just for completeness, let me see what #2
would look like in asm.

loop:
mov b[i],%tmp1
not %tmp1
lea (%tmp1, %B), %tmp1
mov a[i], %tmp2
sbb %tmp1, %tmp2
lea (%tmp2, %B), %tmp1
cmovcc %tmp2 %tmp1
mov %tmp1,res[i]
dec %len
jnz loop
That ends up as 10 instructions. We have to use separate load
and subtract instructions because we can't use a memory-operand
subtract, and B-1-b[i] is two instructions (not+lea) vs. one
lea for b[i]-B.

Reply all
Reply to author
Forward
0 new messages