David Sparks
unread,Feb 28, 2026, 3:58:38 AMFeb 28Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to flint...@googlegroups.com, Fredrik Johansson
> sub_ddmmss(hi, lo, 0, a[i] + (cy + 1), 0, B - b[i]);
> res[i] = lo + (hi & B);
> cy = hi;
Obviously, the critical path is the carry-propagation chain.
There are two slight tweaks to this which might shorten the path.
One (I'll call this #2) is to see if avoiding the 3-way add reduces
the dependency length on the carry:
sub_ddmmss(hi, lo, 0, a[i] + cy , 0, B - 1 - b[i]);
res[i] = lo + (hi & B);
cy = hi;
Another (#3) is to see if using addition works better:
add_ssaaaa(hi, lo, 0, a[i] + cy, 0, b[i] - B);
res[i] = lo + ((hi-1) & B);
cy = hi;
For asm implementations, the goal is to get keep the carry word in the
carry flag, but it's hampered by the fact that we need to use it twice:
in the following addition, and to correct the current result limb.
This is straightforward on ARM, which allows fine-grained control over
flag setting, even for operations like ADC which *use* the flags, but
is a PITA on x86.
It's possible to add without using carry using LEA, INC, and DEC. Also,
NOT doesn't set the flags, so B-1 - b[i] can be computed as B + ~b[i].
But on x86, logical operations like AND and XOR all clear the carry flag.
This makes the correction using logical operations apparently impossible.
AFAICT, the only x86 instructions which depend on the carry
flag without setting it are:
1) Direct flags manipulation like LAHF or PUSHF
2) Bcc conditional jump
3) Scc
4) CMOVcc
The first two categories are non-starters for obvious
performance reasons, and Scc is problematic because
converting the 0/1 bit to 0 or B without corrupting
carry is impractical. But CMOVcc has possibilities.
Subtracting without corrupting carry is a pain, so I'll try
option #3. This translates as:
loop:
mov b[i],%tmp1
lea (%tmp1, %negB), %tmp1
adc a[i], %tmp1
lea (%tmp1, %B), %tmp2
cmovcs %tmp1,%tmp2
mov %tmp2,res[i]
dec %len
jnz loop
This has only two loop-carried dependencies (on the carry, and %len),
each a single cycle long.
This comes close to fitting into the Zen's 8-wide integer execution
unit, but not quite. The Zen has
- 4 ALUs (adc, lea, cmov, dec)
- 3 AGUs (2 loads and 1 store)
- 1 conditional branch
The problem is that, without unrolling, we have *five* ALU operations
per iteration. (The adc with a memory operand counts as both a load
and an ALU op.)
But with unrolling, it might be possible to get it down to 1 cycle.
I mention in passing that a useful trick to economize on registers
on x86 is to negate the length, index relative to the end of the
buffers, and increment up to zero. E.g.
lea (%a,%len,8), %a
lea (%b,%len,8), %b
lea (%res,%len,8), %res
neg %len
loop:
mov (%b,%len,8), %tmp1
...
inc %len
jnz loop
Just for completeness, let me see what #2
would look like in asm.
loop:
mov b[i],%tmp1
not %tmp1
lea (%tmp1, %B), %tmp1
mov a[i], %tmp2
sbb %tmp1, %tmp2
lea (%tmp2, %B), %tmp1
cmovcc %tmp2 %tmp1
mov %tmp1,res[i]
dec %len
jnz loop
That ends up as 10 instructions. We have to use separate load
and subtract instructions because we can't use a memory-operand
subtract, and B-1-b[i] is two instructions (not+lea) vs. one
lea for b[i]-B.