Implementation of condition code on OoO CPUs

Anton Ertl

unread,

Mar 30, 2023, 12:19:52 PM3/30/23

to

A while ago I saw a web page that said that some OOO core implements
the condition codes by hanging them on general-purpose registers; this
allows renaming the condition codes; and given that a condition code
is generated with many GPR results, managing them together makes some
sense; they are overwritten (i.e., freed) separately, which may speak
for managing them separately.

Anyway, I don't remember where I found this; does anybody have an
idea? It would be even better if I could find other references (best
scientific papers, but web sites are also of interest) that describe
the implementation of condition codes on OoO cores. Or if you know
(and can reveal) unpublished information on the topic, I am also
interested in that.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

MitchAlsup

unread,

Mar 30, 2023, 1:34:33 PM3/30/23

to

On Thursday, March 30, 2023 at 11:19:52 AM UTC-5, Anton Ertl wrote:
> A while ago I saw a web page that said that some OOO core implements
> the condition codes by hanging them on general-purpose registers; this
> allows renaming the condition codes; and given that a condition code
> is generated with many GPR results, managing them together makes some
> sense; they are overwritten (i.e., freed) separately, which may speak
> for managing them separately.
>
> Anyway, I don't remember where I found this; does anybody have an
> idea? It would be even better if I could find other references (best
> scientific papers, but web sites are also of interest) that describe
> the implementation of condition codes on OoO cores. Or if you know
> (and can reveal) unpublished information on the topic, I am also
> interested in that.
<

Athlon and Opteron kept track of Zero, deNorm, Finite, Infinite, NaN attached
to each FP register. Not a true condition code, but enough to sort out which
algorithm is appropriate on this set of Operands.
<
I have seen another microarchitecture that kept track of that fact that this
result was accompanied by an exception in the register file. But I forgot
which.

Quadibloc

unread,

Mar 30, 2023, 8:09:55 PM3/30/23

to

On Thursday, March 30, 2023 at 10:19:52 AM UTC-6, Anton Ertl wrote:
> A while ago I saw a web page that said that some OOO core implements
> the condition codes by hanging them on general-purpose registers;

To me, this seems odd.

What do you want to do with the condition codes in a pipelined computer
that uses out-of-order architecture? You want to ensure you associate
each condition code value with the _instruction_ after which it was set.

So I don't think it would be useful to stick the condition code on a register
so that register renaming could affect it.

However, in an architecture that uses *reservation stations* instead, in
line with the original Tomasulo algorithm, then I think it would make sense
to have the condition codes moving along with the operands, since this is
what mirrors the instruction flow.

John Savard

MitchAlsup

unread,

Mar 30, 2023, 9:26:54 PM3/30/23

to

On Thursday, March 30, 2023 at 7:09:55 PM UTC-5, Quadibloc wrote:
> On Thursday, March 30, 2023 at 10:19:52 AM UTC-6, Anton Ertl wrote:
> > A while ago I saw a web page that said that some OOO core implements
> > the condition codes by hanging them on general-purpose registers;
<
> To me, this seems odd.
>
> What do you want to do with the condition codes in a pipelined computer
> that uses out-of-order architecture? You want to ensure you associate
> each condition code value with the _instruction_ after which it was set.
<

The thing about renaming is that you have to rename it to SomeThing!
Renaming it to a GPR is no harder than renaming it to something unique,
and you already have to have a GPR to rename it to, so why not lob on a few
more bits (64+5) is not much bigger than 64 and you get to reuse all the
other renaming HW............
<
The alternative is to rename it to its own file (of some sort) and then you
need read and write ports to it, and all those complications.
<
No, it seems easier to just use a GPR.

>
> So I don't think it would be useful to stick the condition code on a register
> so that register renaming could affect it.
<

The condition code generating calculation is already writing the register,
now you want to have the ability to write another 5-bits somewhere else
(say the CC ROB) here, the tag you need to write the CC ROB is BIGGER
than the data being written into the CC ROB.
<
And don't get me talking about x86 condition codes and having to track
3 different groups (C, O, and ZAPS) independently. x86 uses more data
tracking logic in condition code management than in register management.

luke.l...@gmail.com

unread,

Apr 2, 2023, 9:28:29 PM4/2/23

to

On Thursday, March 30, 2023 at 5:19:52 PM UTC+1, Anton Ertl wrote:
> A while ago I saw a web page that said that some OOO core implements
> the condition codes by hanging them on general-purpose registers; this
> allows renaming the condition codes; and given that a condition code
> is generated with many GPR results, managing them together makes some
> sense; they are overwritten (i.e., freed) separately, which may speak
> for managing them separately.
>
> Anyway, I don't remember where I found this; does anybody have an
> idea? It would be even better if I could find other references (best
> scientific papers, but web sites are also of interest) that describe
> the implementation of condition codes on OoO cores. Or if you know
> (and can reveal) unpublished information on the topic, I am also
> interested in that.

small / outsid possibility your memory dredged up the Libre-SOC
design concept, where i place Zero-Overhead-Looping Prefixing with
register-number-offsetting onto a Suffix-instruction, where some
of those suffix-instructions in the Power ISA have "Rc=1" modes and
thus the Condition Codes also get Vectorised alongside the
corresponding Vector of results.

implementing this on a Multi-Issue OoO architecture you would
indeed get condition codes "hung off of" general-purpose registers
(in sequence) and would thus be able to cheat and do co-register
renaming of both the GPRs and their associated CC.

the big advantage of 4-bit Condition Codes is that you get effectively
4 Predicate Mask bits with which to play, on the following instructions.
Vectorise/Loop-i-fy the Condition Code instructions *as well*
(in a naturally orthogonal way) and you start to feel like the floor
opened up into a whole new world. just don't jump in feet-first:
it's a 1.5 *million* opcodes world if viewed naively.

l.

luke.l...@gmail.com

unread,

Apr 2, 2023, 9:31:24 PM4/2/23

to

On Friday, March 31, 2023 at 1:09:55 AM UTC+1, Quadibloc wrote:

> However, in an architecture that uses *reservation stations* instead, in
> line with the original Tomasulo algorithm, then I think it would make sense
> to have the condition codes moving along with the operands, since this is
> what mirrors the instruction flow.

indeed. only problem being that Tomasulo requires multi-ported CAMs
if you want Multi-Issue execution. those get fantastically expensive *real*
fast. alternatively you can stripe them (4-way issue ==> 4 completely
independent sets of RSes and 4 completely independent sets of pipelines
connected to them)

l.

luke.l...@gmail.com

unread,

Apr 3, 2023, 5:29:40 AM4/3/23

to

On Monday, April 3, 2023 at 2:28:29 AM UTC+1, luke.l...@gmail.com wrote:

> small / outsid possibility your memory dredged up the Libre-SOC
> design concept, where i place Zero-Overhead-Looping Prefixing with
> register-number-offsetting onto a Suffix-instruction, where some
> of those suffix-instructions in the Power ISA have "Rc=1" modes and
> thus the Condition Codes also get Vectorised alongside the
> corresponding Vector of results.

clarifying:

...# based on Loop-Prefixing of Scalar Power ISA instructions
...# implementation of "sv.add" (Rc=0) and "sv.add." (Rc=1)
...# (scalar add being a degenerate case where VL=1)
...for i in range(VL):
......ra = GPR(ra+i) # looping increments *scalar* register
......rb = GPR(rb+i) # looping increments *scalar* register
......result = ra + rb # actual add (a scalar add) here
......GPR(RT+i) = result # but looping makes it look like a Vector Processor
......if Rc=0: continue # skip Condition Codes if not "sv.add."
......cr = (result == 0) || (result > 0) << 1 || (result < 0) << 2 # Condition Code
......CR.field[i] = cr # also store the Vector of CR Fields

l.

luke.l...@gmail.com

unread,

Apr 3, 2023, 6:55:12 AM4/3/23

to

On Friday, March 31, 2023 at 2:26:54 AM UTC+1, MitchAlsup wrote:

> The thing about renaming is that you have to rename it to SomeThing!
> Renaming it to a GPR is no harder than renaming it to something unique,
> and you already have to have a GPR to rename it to, so why not lob on a few
> more bits (64+5) is not much bigger than 64 and you get to reuse all the
> other renaming HW............

nice thing about that is that the "problem" of hazards for CR fields
goes away. however...

> <
> The alternative is to rename it to its own file (of some sort) and then you
> need read and write ports to it, and all those complications.

indeed we need that (for SVP64).

reason being: the GPR regfiles are actually sub-divideable
into smaller elements (2x32, 4x16, 8x8) for which CR fields
*are still applicable*.

thus no longer is just 64+5 bits needed but 64+(5*8) bits
and the valid ones depend on whether the last arithmetic
operation was a 64-bit one or a 2x32 or 4x16 or 8x8?

hmmm... :)

i mean i don't actually mind that - it's quite a nice idea,
but the programming model is a lot less straightforward
(fragile)

l.

robf...@gmail.com

unread,

Apr 3, 2023, 8:51:21 AM4/3/23

to

I have used condition codes with an ooo design, and I just give the condition code
registers their own register tag just like the regular registers, and reuse the same
renaming hardware. Tags 0 to 31 are for GPRs, and tags 32 to 63 are for SPRs
including condition codes, vector masks, and a few other select SPRs. It is twice
as many tags to deal with, but it is the same hardware. CCs do have their own
register file though. The interesting part is being able to treat the CCs as a group
which also has its own tag.

The next step is to just put the condition codes in a general purpose register,
which is what Thor, and some others do. Branching on cc is just a
branch-on-bit-set then.

luke.l...@gmail.com

unread,

Apr 3, 2023, 9:12:02 AM4/3/23

to

On Monday, April 3, 2023 at 1:51:21 PM UTC+1, robf...@gmail.com wrote:
> I have used condition codes with an ooo design, and I just give the condition code
> registers their own register tag just like the regular registers, and reuse the same
> renaming hardware. Tags 0 to 31 are for GPRs, and tags 32 to 63 are for SPRs
> including condition codes, vector masks, and a few other select SPRs. It is twice
> as many tags to deal with, but it is the same hardware. CCs do have their own
> register file though. The interesting part is being able to treat the CCs as a group
> which also has its own tag.

indeed. what i like about the CDC 6600, 68000, 66000, PDP8/11 etc. is
the fact that there are completely separate register files with very little
in the way of cross-over.

CDC 6600 has A B and X, where (i think) X was auto-increment which
of course avoids having every X-instruction have to also depend on
A or B, thus reducing the size of that Dependency Matrix Hazard Cell

if the only cross-over instructions are 1-in 1-out (mv operations in effect)
then the entire DM remains extremely sparse, each group (A-related,
B-related, X-related) having a *localised* group of multi-entry DM Cells
which on O(N^2) for each sparse group is a big reduction.

we just went over this on the Libre-SOC mailing list with one of the
hardware-inexperienced team members who proposed complex
multi-in-from-one-multi-regfiles multi-out-to-multiple-regfiles
instructions and it took several hours to get across to them that
this was not okay.

as long as you do not go above 3-in 2-out (where one of those
outs is the result and the other is the Condition Code) then it is
just about manageable from a DM Hazard perspective.

bottom line it's not whether you do a full OoO Hazard Management
system, it's whether the instructions are properly designed to
not blow up the DM Cells to completely unmanageable proportions.

l.

MitchAlsup

unread,

Apr 3, 2023, 5:27:30 PM4/3/23

to

On Sunday, April 2, 2023 at 8:31:24 PM UTC-5, luke.l...@gmail.com wrote:
> On Friday, March 31, 2023 at 1:09:55 AM UTC+1, Quadibloc wrote:
>
> > However, in an architecture that uses *reservation stations* instead, in
> > line with the original Tomasulo algorithm, then I think it would make sense
> > to have the condition codes moving along with the operands, since this is
> > what mirrors the instruction flow.
<
> indeed. only problem being that Tomasulo requires multi-ported CAMs
<

That has been the lore for 30 years, but Shebanow and I solved this in 1991.

<
> if you want Multi-Issue execution. those get fantastically expensive *real*
> fast. alternatively you can stripe them (4-way issue ==> 4 completely
> independent sets of RSes and 4 completely independent sets of pipelines
> connected to them)
<

So goes the lore........
>
> l.

MitchAlsup

unread,

Apr 3, 2023, 5:40:44 PM4/3/23

to

On Monday, April 3, 2023 at 8:12:02 AM UTC-5, luke.l...@gmail.com wrote:
> On Monday, April 3, 2023 at 1:51:21 PM UTC+1, robf...@gmail.com wrote:
> > I have used condition codes with an ooo design, and I just give the condition code
> > registers their own register tag just like the regular registers, and reuse the same
> > renaming hardware. Tags 0 to 31 are for GPRs, and tags 32 to 63 are for SPRs
> > including condition codes, vector masks, and a few other select SPRs. It is twice
> > as many tags to deal with, but it is the same hardware. CCs do have their own
> > register file though. The interesting part is being able to treat the CCs as a group
> > which also has its own tag.
> indeed. what i like about the CDC 6600, 68000, 66000, PDP8/11 etc. is
> the fact that there are completely separate register files with very little
> in the way of cross-over.
>
> CDC 6600 has A B and X, where (i think) X was auto-increment which
> of course avoids having every X-instruction have to also depend on
> A or B, thus reducing the size of that Dependency Matrix Hazard Cell
<

Not quite:: When you wrote to A[1..5] you got an address which was then
used to load X[1..5]. When you wrote to A[6..7] you got an address and
X[6..7] was stored at that address. This was all done in the INC unit
which calculated an 18-bit address in a single cycle, but had a latency
of 2 when successive INC instructions were executed.

>
> if the only cross-over instructions are 1-in 1-out (mv operations in effect)
> then the entire DM remains extremely sparse, each group (A-related,
> B-related, X-related) having a *localised* group of multi-entry DM Cells
> which on O(N^2) for each sparse group is a big reduction.
<

There was the SHIFT instruction which stripped the exponent off of
an X[register], delivering a FP value back to an X[register] and delivering
the exponent to a B[register]. There was also a put-it-back-together
instruction that took a B[register] and an X[register] and created
a new X[register] containing X[source]×2^B[register].

>
> we just went over this on the Libre-SOC mailing list with one of the
> hardware-inexperienced team members who proposed complex
> multi-in-from-one-multi-regfiles multi-out-to-multiple-regfiles
> instructions and it took several hours to get across to them that
> this was not okay.
<

Would have loved to have been a fly on the wall.........

>
> as long as you do not go above 3-in 2-out (where one of those
> outs is the result and the other is the Condition Code) then it is
> just about manageable from a DM Hazard perspective.
<

I managed to get N+1 in and 2-out with my CARRY strategy.......

>
> bottom line it's not whether you do a full OoO Hazard Management
> system, it's whether the instructions are properly designed to
> not blow up the DM Cells to completely unmanageable proportions.
<

Exactly--the OoO Hazard prevention is merely a tool. The less you
use/need it, the better the ISA.
>
> l.

luke.l...@gmail.com

unread,

Apr 6, 2023, 6:00:06 AM4/6/23

to

On Monday, April 3, 2023 at 10:27:30 PM UTC+1, MitchAlsup wrote:
> On Sunday, April 2, 2023 at 8:31:24 PM UTC-5, luke.l...@gmail.com wrote:
> > indeed. only problem being that Tomasulo requires multi-ported CAMs
> <
> That has been the lore for 30 years, but Shebanow and I solved this in 1991.

care to tell? wouldn't happen to have involved splitting the ROB# into
a partial unary- partial binary- encoding (6600-style unary DMs combined
with traditional CAM) would it?

l.

luke.l...@gmail.com

unread,

Apr 6, 2023, 6:12:30 AM4/6/23

to

On Monday, April 3, 2023 at 10:40:44 PM UTC+1, MitchAlsup wrote:
> On Monday, April 3, 2023 at 8:12:02 AM UTC-5, luke.l...@gmail.com wrote:
> > we just went over this on the Libre-SOC mailing list with one of the
> > hardware-inexperienced team members who proposed complex
> > multi-in-from-one-multi-regfiles multi-out-to-multiple-regfiles
> > instructions and it took several hours to get across to them that
> > this was not okay.
> <
> Would have loved to have been a fly on the wall.........

ngggh if you like "Eastenders" or other high-stress "addictive"
conversations, it would have indeed been entertaining. honestly
it wasn't fun for me, being pulled out of heavy-concentration mode
from other much higher-priority tasks. sigh.

> >
> > as long as you do not go above 3-in 2-out (where one of those
> > outs is the result and the other is the Condition Code) then it is
> > just about manageable from a DM Hazard perspective.
> <
> I managed to get N+1 in and 2-out with my CARRY strategy.......

yes we're going to have to do micro-coding to avoid the
DMs getting overloaded: the 1st in the chain has to be
3-in 1-out (to read the existing reg-used-as-carry-in) and
the last in the chain has to be 2-in 2-out (to write the
reg-used-as-carry-out), but everything in between can
be 2-in 1-out *if* and only if an operand-forwarding-bus
exists.

i'm reasonably confident that CARRY would be similar?
(certainly i'd expect you to have a Carry-Op-Fwd-Bus for sure)

unless you don't store CARRY in Architectural State (SPR/CSRs)
in which case it wouldn't be possible to service an interrupt
in the middle of a chain-set, nor would it be possible to
either start with an incoming CARRY, and you'd need one
extra "digit" (one extra mul-add) with zeros in it in order
to receive the carry-out into (actual) registers?

i do really like the CARRY-has-more-than-one-bit concept.
Power ISA version 1 had it (back when it was a 32-bit-only
ISA) but they removed it. i think that was a mistake. we're
now having to propose 5 new instructions (all 3-in 2-out)
which "overload" one additional 64-bit register on RD
and one with WR, with the implicit meaning of "64-bit carry-in/out",
sigh.

intriguingly they are all remarkably similar to Intel x86
instructions that have existed for 12+ years, but are
juuust that little bit subtly different, having been designed
for biginteger arbitrary-length vector chaining.
https://libre-soc.org/openpower/sv/biginteger/analysis/

l.

Terje Mathisen

unread,

Apr 6, 2023, 9:15:44 AM4/6/23

to

luke.l...@gmail.com wrote:
[snip]

> i do really like the CARRY-has-more-than-one-bit concept.
> Power ISA version 1 had it (back when it was a 32-bit-only
> ISA) but they removed it. i think that was a mistake. we're
> now having to propose 5 new instructions (all 3-in 2-out)
> which "overload" one additional 64-bit register on RD
> and one with WR, with the implicit meaning of "64-bit carry-in/out",
> sigh.
>
> intriguingly they are all remarkably similar to Intel x86
> instructions that have existed for 12+ years, but are
> juuust that little bit subtly different, having been designed
> for biginteger arbitrary-length vector chaining.
> https://libre-soc.org/openpower/sv/biginteger/analysis/

Those 12+ years must refer to MULX in 2012, but in reality it goes all
the way back to 1978:

x86 have always had 2-in/2-out as the standard MUL opcode, the only
problem being that most of the (register) operands were implied: MULX
provided a way to encode more register names.

For bigint applications I've usually found that the latency of the MUL
provides enough time to do all the required ADD/ADC ops, but it does
become significantly easier, and probably much more compiler friendly
after they added both MULX and allowed a second flag to be used in a
carry chain, separately from the standard CARRY.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

luke.l...@gmail.com

unread,

Apr 6, 2023, 10:53:35 AM4/6/23

to

On Thursday, April 6, 2023 at 2:15:44 PM UTC+1, Terje Mathisen wrote:

> luke.l...@gmail.com wrote:

> > https://libre-soc.org/openpower/sv/biginteger/analysis/
> Those 12+ years must refer to MULX in 2012,

yeah there's a really good app-note about MULX from Intel.
but we also added 3-in 2-out variants of the shift instructions
*and* a div-mod instruction that, by a really nice not-coincidence,
is the exact mathematical opposite of the 3-in 2-out mul
instruction.

> but in reality it goes all the way back to 1978:
>
> x86 have always had 2-in/2-out as the standard MUL opcode, the only
> problem being that most of the (register) operands were implied: MULX
> provided a way to encode more register names.

and dropped OV i think? because even A*B+C still does not overflow
(maximum value 0xfffe0001+0x0000ffff does not overflow)

> For bigint applications I've usually found that the latency of the MUL
> provides enough time to do all the required ADD/ADC ops, but it does
> become significantly easier, and probably much more compiler friendly
> after they added both MULX and allowed a second flag to be used in a
> carry chain, separately from the standard CARRY.

the kicker here (compared to SVP64) *is* that you had to do
2 instructions at all (per RADIX64 digit), whereas by having 3-in 2-out
it can be Vector-chain-looped by setting RC as a scalar and
(implicitly) having it as the 2nd destination for the top-half
of the 128-bit result.

l.

MitchAlsup

unread,

Apr 6, 2023, 11:31:54 AM4/6/23

to

Nope, much more straightforward than that.
But it does require knowing which function unit (or slot) delivers the result;
which Thomasulo did not, but Thornton did.
>
> l.

MitchAlsup

unread,

Apr 6, 2023, 11:35:35 AM4/6/23

to

On Thursday, April 6, 2023 at 5:12:30 AM UTC-5, luke.l...@gmail.com wrote:
> On Monday, April 3, 2023 at 10:40:44 PM UTC+1, MitchAlsup wrote:

> > > as long as you do not go above 3-in 2-out (where one of those
> > > outs is the result and the other is the Condition Code) then it is
> > > just about manageable from a DM Hazard perspective.
> > <
> > I managed to get N+1 in and 2-out with my CARRY strategy.......
<
> yes we're going to have to do micro-coding to avoid the
> DMs getting overloaded: the 1st in the chain has to be
> 3-in 1-out (to read the existing reg-used-as-carry-in) and
> the last in the chain has to be 2-in 2-out (to write the
> reg-used-as-carry-out), but everything in between can
> be 2-in 1-out *if* and only if an operand-forwarding-bus
> exists.
>
> i'm reasonably confident that CARRY would be similar?
> (certainly i'd expect you to have a Carry-Op-Fwd-Bus for sure)
<

Ins and Outs are similar (likely identical) but dealing with CARRY
was done by an independent result-operand bus--like programmable
forwarding.

>
> unless you don't store CARRY in Architectural State (SPR/CSRs)
> in which case it wouldn't be possible to service an interrupt
> in the middle of a chain-set, nor would it be possible to
> either start with an incoming CARRY, and you'd need one
> extra "digit" (one extra mul-add) with zeros in it in order
> to receive the carry-out into (actual) registers?
<

It does make it to architectural state when needed.

>
> i do really like the CARRY-has-more-than-one-bit concept.
> Power ISA version 1 had it (back when it was a 32-bit-only
> ISA) but they removed it. i think that was a mistake. we're
> now having to propose 5 new instructions (all 3-in 2-out)
> which "overload" one additional 64-bit register on RD
> and one with WR, with the implicit meaning of "64-bit carry-in/out",
> sigh.
<

Sins of the past catching up.....

Anton Ertl

unread,

Apr 7, 2023, 3:39:29 AM4/7/23

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>x86 have always had 2-in/2-out as the standard MUL opcode, the only
>problem being that most of the (register) operands were implied: MULX
>provided a way to encode more register names.

Given that register-register moves are almost for free in Intel's
performance cores, that's only a minor benefit. The major benefit of
MULX for the intended application is that it does not destroy the
carry and overflow flags.

>For bigint applications I've usually found that the latency of the MUL
>provides enough time to do all the required ADD/ADC ops,

What kind of bigint applications do you have in mind? For
multiprecision multiplication, all the component multiplications are
independent of each other, and every additions is dependent on a
multiplication; e.g., in the simplest case RSTU = AB * CD (where A, B,
C, D, R, S, T, U are 64-bit values):

BD=B*D || AD=A*D || BC=B*C || AC=A*C
U=lo(BD) || Tc0,T0=hi(BD)+lo(AD) || Sc0,S0=hi(AD)+hi(BC)
Tc,T=T0+lo(BC) || Sc1,S1=S0+lo(AC)+Tc0 || R0=hi(AC)+Sc0
Sc,S=S1+Tc || R1=R0+Sc1
R=R1+Sc

Each line contains operations that can be performed in parallel.

luke.l...@gmail.com

unread,

Apr 7, 2023, 4:02:55 AM4/7/23

to

On Friday, April 7, 2023 at 8:39:29 AM UTC+1, Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
> >x86 have always had 2-in/2-out as the standard MUL opcode, the only
> >problem being that most of the (register) operands were implied: MULX
> >provided a way to encode more register names.
> Given that register-register moves are almost for free in Intel's
> performance cores, that's only a minor benefit. The major benefit of
> MULX for the intended application is that it does not destroy the
> carry and overflow flags.

ahh yes that was it, i remember that in the whitepaper
oink, 404 not found
https://www.google.com/search?q=ia-large-integer-arithmetic-paper.pdf

ah HA!
https://web.archive.org/web/20150131061304/https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf

it wasn't just the addition of MULX, i was the addition of ADCX/ADOX
that did the trick.

the diagrams in that paper are cute, make it really clear what's
going on.

l.

Michael S

unread,

Apr 7, 2023, 6:22:45 AM4/7/23

to

They forgot to make timing comparisons for implementation of multiplication
of particular size with and without MULX/ADCX/ADOX.
I'd like to see the numbers not just for 512x512, which does not strike me as
something used commonly, but also for smaller sizes, down to 128x128 which
looks to me as most frequently used case of extended precision multiplication.

Anton Ertl

unread,

Apr 7, 2023, 8:37:00 AM4/7/23

to

Michael S <already...@yahoo.com> writes:
>They forgot to make timing comparisons for implementation of multiplication
>of particular size with and without MULX/ADCX/ADOX.

>I'd like to see the numbers not just for 512x512, which does not strike me =

>as
>something used commonly, but also for smaller sizes, down to 128x128 which

>looks to me as most frequently used case of extended precision multiplicati=
>on.

What makes you think that 128x128 is most-frequently used? And do you
have any numbers on usage frequency that I can cite?

My thinking is that the biggest use of multiprecision arithmetics is
in cryptography, where RSA works with 1024-bit (deemed too small by
many these days), 2048-bit, and 4096-bit keys, and taking the p-th
power of those involves a number of squarings an multiplications (and
Intel has a paper on squaring in addition to one on multiplication).

However,

1) RSA is only used for exchanging a symmetric key, and then the rest
of the connection uses a symmetric key; nevertheless, key exchange is
relevant for the startup overhead of a connection.

2) Many users seem to replace RSA by elliptic curve cryptography
(e.g., Ed25519); I don't know how important multi-precision
multiplication is for that.

Terje Mathisen

unread,

Apr 7, 2023, 8:52:16 AM4/7/23

to

I had forgotten about how MULX skips all the flag modifications, that
does make it much easier to employ, at least for a compiler.

My own usage have typically been 4-wide, i.e. implementing 128-bit
operations with 32-bit regs. I did that both during FDIV/FPATAN2
verification and for DFC, the Cern AES candidate.

As I've noted here previously, an IMACC operation doing (sumhi, sumlo) =
a * b + c + d is pretty much perfect, except for needing far too many
register specifiers.

The operation cannot overflow, making it even superior to IMAC with a
single addend, and as Mitch have taught us, any extra addends feeding
into the last stage of the MUL are almost free.

Terje

Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
>> x86 have always had 2-in/2-out as the standard MUL opcode, the only
>> problem being that most of the (register) operands were implied: MULX
>> provided a way to encode more register names.
>
> Given that register-register moves are almost for free in Intel's
> performance cores, that's only a minor benefit. The major benefit of
> MULX for the intended application is that it does not destroy the
> carry and overflow flags.
>
>> For bigint applications I've usually found that the latency of the MUL
>> provides enough time to do all the required ADD/ADC ops,
>
> What kind of bigint applications do you have in mind? For
> multiprecision multiplication, all the component multiplications are
> independent of each other, and every additions is dependent on a
> multiplication; e.g., in the simplest case RSTU = AB * CD (where A, B,
> C, D, R, S, T, U are 64-bit values):
>
> BD=B*D || AD=A*D || BC=B*C || AC=A*C
> U=lo(BD) || Tc0,T0=hi(BD)+lo(AD) || Sc0,S0=hi(AD)+hi(BC)
> Tc,T=T0+lo(BC) || Sc1,S1=S0+lo(AC)+Tc0 || R0=hi(AC)+Sc0
> Sc,S=S1+Tc || R1=R0+Sc1
> R=R1+Sc
>
> Each line contains operations that can be performed in parallel.
>
> - anton
>

--

Terje Mathisen

unread,

Apr 7, 2023, 9:17:07 AM4/7/23

to

Intermediate-size (512-2048 bits) are important for SSL/TSL setup, right?

I do agree that 128-bit operations are probably more common in
application code, but you can get reasonable performance for those with
just a regular N*N->2N mul and some add with carry intrinsics.

Michael S

unread,

Apr 7, 2023, 9:28:11 AM4/7/23

to

On Friday, April 7, 2023 at 3:37:00 PM UTC+3, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >They forgot to make timing comparisons for implementation of multiplication
> >of particular size with and without MULX/ADCX/ADOX.
> >I'd like to see the numbers not just for 512x512, which does not strike me =
> >as
> >something used commonly, but also for smaller sizes, down to 128x128 which
> >looks to me as most frequently used case of extended precision multiplicati=
> >on.
>
> What makes you think that 128x128 is most-frequently used? And do you
> have any numbers on usage frequency that I can cite?
>

May be, it's just me.
As you likely know, I have a soft spot extended precision floating point.
Esp. for variants with 128-bit and 112-bit mantissa with, because they
are much faster than the rest and are wide enough to test numeric stability
of algorithms implemented IEEE binary64, which, at least for me, happens
to be the main use of extended precision floating point.

> My thinking is that the biggest use of multiprecision arithmetics is
> in cryptography, where RSA works with 1024-bit (deemed too small by
> many these days), 2048-bit, and 4096-bit keys, and taking the p-th
> power of those involves a number of squarings an multiplications (and
> Intel has a paper on squaring in addition to one on multiplication).
>
> However,
>
> 1) RSA is only used for exchanging a symmetric key, and then the rest
> of the connection uses a symmetric key; nevertheless, key exchange is
> relevant for the startup overhead of a connection.
>
> 2) Many users seem to replace RSA by elliptic curve cryptography
> (e.g., Ed25519); I don't know how important multi-precision
> multiplication is for that.

I did ECC few years ago. In my case it was implementation of ECDSA
signature checks on resource-limited 32-bit embedded platform.
Unfortunately, by now I forgot majority of details.
Out of memory, on the critical path there was 256x256 multiplication with
only low 256 bits of result in use. But I can be wrong about it.
May be, 160 bits rather than 256? Or 320?

Terje Mathisen

unread,

Apr 7, 2023, 10:03:03 AM4/7/23

to

Michael S wrote:
> On Friday, April 7, 2023 at 3:37:00 PM UTC+3, Anton Ertl wrote:
>> Michael S <already...@yahoo.com> writes:
>>> They forgot to make timing comparisons for implementation of multiplication
>>> of particular size with and without MULX/ADCX/ADOX.
>>> I'd like to see the numbers not just for 512x512, which does not strike me =
>>> as
>>> something used commonly, but also for smaller sizes, down to 128x128 which
>>> looks to me as most frequently used case of extended precision multiplicati=
>>> on.
>>
>> What makes you think that 128x128 is most-frequently used? And do you
>> have any numbers on usage frequency that I can cite?
>>
>
> May be, it's just me.
> As you likely know, I have a soft spot extended precision floating point.
> Esp. for variants with 128-bit and 112-bit mantissa with, because they
> are much faster than the rest and are wide enough to test numeric stability
> of algorithms implemented IEEE binary64, which, at least for me, happens
> to be the main use of extended precision floating point.

<BG>

The fp128 library I wrote over a weekend during the FDIV saga had
exactly the same use case. :-)

I don't remember exactly, but I suspect my FDIV128 was doing a very slow
bit/iteration cmp/subtract loop since it was very non-speed-critical at
that point.

FADD128/FMUL128 was where I cared enough for hand-optimizing my asm.

Terje

>
>> My thinking is that the biggest use of multiprecision arithmetics is
>> in cryptography, where RSA works with 1024-bit (deemed too small by
>> many these days), 2048-bit, and 4096-bit keys, and taking the p-th
>> power of those involves a number of squarings an multiplications (and
>> Intel has a paper on squaring in addition to one on multiplication).
>>
>> However,
>>
>> 1) RSA is only used for exchanging a symmetric key, and then the rest
>> of the connection uses a symmetric key; nevertheless, key exchange is
>> relevant for the startup overhead of a connection.
>>
>> 2) Many users seem to replace RSA by elliptic curve cryptography
>> (e.g., Ed25519); I don't know how important multi-precision
>> multiplication is for that.
>
> I did ECC few years ago. In my case it was implementation of ECDSA
> signature checks on resource-limited 32-bit embedded platform.
> Unfortunately, by now I forgot majority of details.
> Out of memory, on the critical path there was 256x256 multiplication with
> only low 256 bits of result in use. But I can be wrong about it.
> May be, 160 bits rather than 256? Or 320?
>
>> - anton
>> --
>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

MitchAlsup

unread,

Apr 7, 2023, 10:46:32 AM4/7/23

to

On Friday, April 7, 2023 at 7:37:00 AM UTC-5, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >They forgot to make timing comparisons for implementation of multiplication
> >of particular size with and without MULX/ADCX/ADOX.
> >I'd like to see the numbers not just for 512x512, which does not strike me =
> >as
> >something used commonly, but also for smaller sizes, down to 128x128 which
> >looks to me as most frequently used case of extended precision multiplicati=
> >on.
>
> What makes you think that 128x128 is most-frequently used? And do you
> have any numbers on usage frequency that I can cite?
<

It is the first step up from 64×64 gives 128.

>
> My thinking is that the biggest use of multiprecision arithmetics is
> in cryptography, where RSA works with 1024-bit (deemed too small by
> many these days), 2048-bit, and 4096-bit keys, and taking the p-th
> power of those involves a number of squarings an multiplications (and
> Intel has a paper on squaring in addition to one on multiplication).
<

Squaring is a lot easier (less logic) than multiplication.

Peter Lund

unread,

Apr 15, 2023, 11:53:22 AM4/15/23

to

On Thursday, March 30, 2023 at 6:19:52 PM UTC+2, Anton Ertl wrote:
> A while ago I saw a web page that said that some OOO core implements
> the condition codes by hanging them on general-purpose registers; this
> allows renaming the condition codes; and given that a condition code
> is generated with many GPR results, managing them together makes some
> sense; they are overwritten (i.e., freed) separately, which may speak
> for managing them separately.
>
> Anyway, I don't remember where I found this; does anybody have an
> idea? It would be even better if I could find other references (best
> scientific papers, but web sites are also of interest) that describe
> the implementation of condition codes on OoO cores.

Years ago, Andy Glew wrote it here in a reply to me.

-Peter