Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Prefix instructions and their usage?

1,117 views
Skip to first unread message

JimBrakefield

unread,
Apr 13, 2020, 11:48:19 PM4/13/20
to
Much of the discussion on the "Condition Code-less Programming" topic can be placed under a "Usage of Prefix instructions" category. It's also a good masters exam question so won't enumerate the known and implemented usages of instruction prefixes.

However "Condition Code-less Programming" adds a new possible usage: prefixing a conditional branch to the following instruction.

So would like to ask the question: What are the trade-offs and practicalities of using prefixes? The x86-64 is perhaps the premier example of prefix usage and the Itanium an example of a blank sheet design that increased the instruction size rather than use prefixes.

Terje Mathisen

unread,
Apr 14, 2020, 4:52:13 AM4/14/20
to
You probably need a byte granularity instruction set for this to make
much sense, unless you are going to do something really complicated.

Any really size-efficient instruction set needs to at least in some way
look like a huffman compression, with the most common
instructions/operations getting the shorter forms. The Mill takes this
more or less to the logical conclusion, right Ivan?

The prefix branch would in fact work well as a full-size prefix since
you need the branch offset just as in a regular branch.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

lkcl

unread,
Apr 14, 2020, 5:22:27 AM4/14/20
to
On Tuesday, April 14, 2020 at 9:52:13 AM UTC+1, Terje Mathisen wrote:
> JimBrakefield wrote:
> > Much of the discussion on the "Condition Code-less Programming" topic
> > can be placed under a "Usage of Prefix instructions" category. It's
> > also a good masters exam question so won't enumerate the known and
> > implemented usages of instruction prefixes.
> >
> > However "Condition Code-less Programming" adds a new possible usage:
> > prefixing a conditional branch to the following instruction.
> >
> > So would like to ask the question: What are the trade-offs and
> > practicalities of using prefixes? The x86-64 is perhaps the premier
> > example of prefix usage and the Itanium an example of a blank sheet
> > design that increased the instruction size rather than use prefixes.
> >
> You probably need a byte granularity instruction set for this to make
> much sense, unless you are going to do something really complicated.
>
> Any really size-efficient instruction set needs to at least in some way
> look like a huffman compression,

you remember someone pointed out last year that ISA i designed in 1990, they said it looked like a compression algorithm?

> with the most common
> instructions/operations getting the shorter forms. The Mill takes this
> more or less to the logical conclusion, right Ivan?

it's a polymorphic tagged architecture, from what i remember. the "tag" comes from the LD.

thus there is no need to have ADD.W or ADD.S etc, just ADD (and narrow and widen to change the tag).

the only downside is that the tag context is critical state information. it *must* be saved on context switch.

and the more context there is, the longer the latency on switching.

this i feel is quite likely to be why such schemes have not seen extensive adoption.

i like prefixing. SV is based around it. prefixes for vector length, predication, and redirection of the regfile all allow a standard scalar instruction set to leapfrog into becoming a full vector one.. *with zero vector opcodes added*.

l.

EricP

unread,
Apr 14, 2020, 11:18:51 AM4/14/20
to
There are no prefixes, there are just instructions.

Back in the 8086 when it decoded 1 byte per clock,
it would decode the prefix into a single *bit*,
save the prefix bit in some internal register, and later
feeds that bit into the instruction decode for the main byte.
This saves decode transistors at the expense of clocks.

Prefixes today exist only in the instruction manual.
They are just different instructions, encoded in an
inefficient and often redundant manner.

If FF is an instruction, and AA and BB are prefixes, then

FF
AA FF
BB FF
AA BB FF
BB AA FF

are 4 separate instructions, the last 2 are duplicates.
The prefixes use a byte to provide a single bit of instruction.

It would be better to think about suffixes as at least they
leave the primary byte in the first position, and suffixes
are lexically dependent on specific primary bytes.
But actually they both come down to the same thing.

If you want 2 different instructions, just encode 2 and call it that.

EricP

unread,
Apr 14, 2020, 11:37:01 AM4/14/20
to
EricP wrote:
>
> It would be better to think about suffixes as at least they
> leave the primary byte in the first position, and suffixes
> are lexically dependent on specific primary bytes.
> But actually they both come down to the same thing.

I should add that it can't actually encode as suffixes as that
is an ambiguous parse. But it does force one to think through
which prefixes are legal in which contexts.


BGB

unread,
Apr 14, 2020, 12:44:05 PM4/14/20
to
On 4/14/2020 3:52 AM, Terje Mathisen wrote:
> JimBrakefield wrote:
>> Much of the discussion on the "Condition Code-less Programming" topic
>> can be placed under a "Usage of Prefix instructions" category.  It's
>> also a good masters exam question so won't enumerate the known and
>> implemented usages of instruction prefixes.
>>
>> However "Condition Code-less Programming" adds a new possible usage:
>> prefixing a conditional branch to the following instruction.
>>
>> So would like to ask the question: What are the trade-offs and
>> practicalities of using prefixes?  The x86-64 is perhaps the premier
>> example of prefix usage and the Itanium an example of a blank sheet
>> design that increased the instruction size rather than use prefixes.
>>
> You probably need a byte granularity instruction set for this to make
> much sense, unless you are going to do something really complicated.
>

I shoe-horned in predicated instructions after the fact, though in my
case it required eating a little more of the 16-bit encoding space
encoding space (effectively doubling the encoding space used for 32-bit
instruction forms from what it was originally).

My encoding scheme is conceptually similar to the one in ARM32, just
eating 1 bit for this rather than 4 bits. This was in contrast to using
prefixes.


In my original BJX1 idea, the idea was that the 32b ops would have been
done via prefix instructions (modifying the following instructions), but
in implementation I didn't go this route because it would have been
slower (instead decoding them as larger variable-length instructions).

In BJX2 I went directly for variable-length instructions, initially
without prefixes.

More recently, I ended up adding a "jumbo prefix" which allows larger
immediate values, although it works by decoding into a separate lane and
operating in parallel. It is possible that if Jumbo ops were supported
in a scalar implementation, they would be accomplished via a prefix.

However, as can be noted, the jumbo prefixes effectively produce 64-bit
or 96-bit instruction encodings.


> Any really size-efficient instruction set needs to at least in some way
> look like a huffman compression, with the most common
> instructions/operations getting the shorter forms. The Mill takes this
> more or less to the logical conclusion, right Ivan?
>

It sort of works like this way in my case, except in
performance-optimized code with WEX enabled, where it ends up being
mostly 32-bit ops (since the whole "OP | OP" thing doesn't work with the
16-bit encodings).

Sadly, getting good code density comes at a performance penalty (of
mostly only running scalar code).

So, between the two modes (16b/32b percentage), size-optimized code ends
up being around 70%/30% and speed optimized code ends up being around
20%/80%. The 48b ops still exist, but have mostly fallen into disuse.


In some prior analysis, I noted that it seemed none of the instructions
were common enough to justify having 8-bit encodings (but plenty were
common enough to justify 16-bit encodings).

While one could argue that x86 has 8-bit instructions, the vast majority
of 8-bit opcodes are either followed immediately by a ModRM byte or
Imm/Disp (so these instructions typically end up as 16 or 24 bits).

Likewise, a fair chunk of the encoding space used for (actual) 8b ops
was carved off and reused for REX prefixes in x86-64 (implying that
their use as REX prefixes was more valuable).


My past conclusion was that 16/32 ISA's would appear to be
"near-optimal" on this front. A slight gain would be possible via also
having 24 and 48 bit instructions; with 16b remaining as the shortest
encoding; at the main costs of added complexity and needing to deal with
a BYTE aligned instruction stream rather than being WORD or DWORD aligned.

Meanwhile, x86 decoding is a lot more complicated than ideal, I have
serious doubts about the possibility of executing x86 code efficiently
or affordably on a lower-end FPGA (at least, excluding some form of
dynamic translation to an internal VLIW ISA with the x86 decoder
implemented in firmware or similar).



In my case, I had considered it could be possible to entropy-encode the
instruction stream ("on disk") using a specialized codec; but my initial
thinking on it implied it may not be worth the added cost and complexity.

This would likely be via a DWORD oriented LZ coder, which would split up
the instruction bits and feed them through different entropy contexts
(say, four 8-bit contexts; possibly with special cases for dealing with
immediate values, which could make more sense if encoded as a VLN than
by "polluting" the Huffman contexts, ...).

As is though, I am using LZ4.

I had also experimented with Deflate (1), but Deflate doesn't gain much
compression advantage over LZ4 to justify its higher cost. The
specialized codec would likely be both slower and more complicated than
Deflate (a lot more Huffman tables), so at best it seems debatable (more
so as the 5MHz SDcard interface isn't quite *that* slow). It would also
need an option to detect and fall-back to a Deflate-like encoding when
it encounters data that doesn't map nicely to DWORD's (DWORD oriented
encoding fares rather poorly against text strings).

1: This was along vaguely similar lines to the PE/COFF image having been
fed though gzip, but with a different file header.


> The prefix branch would in fact work well as a full-size prefix since
> you need the branch offset just as in a regular branch.
>

I am not sure how prefix branches would have much benefit over normal
conditional branches.

Maybe you can have it predicate registers rather than branch over them.
But, one could also potentially do this as a decoder special case, eg:
If the conditional branch has a displacement less than N, decode it as
predicating the following N instructions (with N being a certain
fraction of the effective pipeline length).

Ivan Godard

unread,
Apr 14, 2020, 1:03:12 PM4/14/20
to
On 4/14/2020 1:52 AM, Terje Mathisen wrote:
> JimBrakefield wrote:
>> Much of the discussion on the "Condition Code-less Programming" topic
>> can be placed under a "Usage of Prefix instructions" category.  It's
>> also a good masters exam question so won't enumerate the known and
>> implemented usages of instruction prefixes.
>>
>> However "Condition Code-less Programming" adds a new possible usage:
>> prefixing a conditional branch to the following instruction.
>>
>> So would like to ask the question: What are the trade-offs and
>> practicalities of using prefixes?  The x86-64 is perhaps the premier
>> example of prefix usage and the Itanium an example of a blank sheet
>> design that increased the instruction size rather than use prefixes.
>>
> You probably need a byte granularity instruction set for this to make
> much sense, unless you are going to do something really complicated.
>
> Any really size-efficient instruction set needs to at least in some way
> look like a huffman compression, with the most common
> instructions/operations getting the shorter forms. The Mill takes this
> more or less to the logical conclusion, right Ivan?


Close. We split the various fields of all instructions resident in a
slot into three kinds, based on when the value of the field is needed in
the decode process.

/pinned/ fields are present, in the same bit field, in all instructions,
and typically are used to decode the rest, such as format indicators.

/direct/ fields are only present in the instructions that use the
particular field, and are typically numeric values like belt numbers of
runtime arguments.

/merged/ fields are everything else, and are encoded collectively as a
single compressed field that approximates Huffman. The specification
machinery figures out all possible meaningful combinations of merged
field values; the the ordinal number of a particular combination (called
a "model number" in the software) stands for that combination, and is
encoded as a numeric value in a single field of the instruction. What
would be major and minor opcodes in a legacy format are part of the
combinations represented by the model number. This is within a fraction
of a bit of true Huffman.

The spec notation used lets you say things like "all of add/sub/mul/shl
have forms with two belt args and any of wrap/saturate/except/widen
overflow behavior" and get out 16 combinations (and two direct fields)
to include in the pile that eventually becomes model numbers.

Decode hardware dials out the merged field at the same time it dials out
the direct fields (and all under control of the formatting info in the
pinned fields) and uses the model number obtained as an index into a ROM
to get what is in effect a horizontal microcode version of the
instruction. There is no field-level parse.

There is a bundle-level parse that separates the bundle into three
blocks (on each of two sides), which takes a cycle. Instructions in each
block are decoded in parallel by slot. While the encoding of each slot
is packed and not byte or word aligned, it is fixed size so the position
of each slot's instruction in the block is fixed for the dialers and
shifters.

Put together, this can decode 30+ instructions per cycle for family
members with that many slots. More detail at:
https://millcomputing.com/docs/encoding/

MitchAlsup

unread,
Apr 14, 2020, 1:22:38 PM4/14/20
to
The My 66000 instruction set has a few prefixes::

A) Predication is done wiht a PRED instruction which carries a shadow.
In this shadow is a bit for each subsequent instruction that indicates
it should execute (1) or be skipped (0). The shadow has a maximal length
of 8 instructions. And the shadow can be "cast" in the THEN-clause sense
or in the ELSE-clause sense--which allows for construction of && and ||
constructs that guard short then and else statements.

B) The Virtual Vector Machine implements loop-vectorization by casting
a vector shadow over instructions in the loop. Here the VEC instruction
identifies the top of the loop (Next instruction) and which registers
carry loop dependencies. Down at the end of the loop is a LOOP instruction
which performs the typical ADD/CMP/Bcc of counted loops. Bundling of these
3 instructions together enables them to be executed in 1-cycle {we know
the top-of-loop address, we know the index register and increment, and we
know the loop increment:: a 3-input adder does both ADD and CMP.} By
using casts, every (small enough) instruction loop can be vectorized
with access to the whole of the instruction set (except branches).
Flow control inside a vectorized loop must be in the PREDicate form.

MitchAlsup

unread,
Apr 14, 2020, 1:24:28 PM4/14/20
to
On Tuesday, April 14, 2020 at 3:52:13 AM UTC-5, Terje Mathisen wrote:
> JimBrakefield wrote:
> > Much of the discussion on the "Condition Code-less Programming" topic
> > can be placed under a "Usage of Prefix instructions" category. It's
> > also a good masters exam question so won't enumerate the known and
> > implemented usages of instruction prefixes.
> >
> > However "Condition Code-less Programming" adds a new possible usage:
> > prefixing a conditional branch to the following instruction.
> >
> > So would like to ask the question: What are the trade-offs and
> > practicalities of using prefixes? The x86-64 is perhaps the premier
> > example of prefix usage and the Itanium an example of a blank sheet
> > design that increased the instruction size rather than use prefixes.
> >
> You probably need a byte granularity instruction set for this to make
> much sense, unless you are going to do something really complicated.

My 66000 did not need byte granularity--the trick is that the ISA is
rich enough that prefixes are rare. And then you use the prefix instruction
to modify several subsequent instructions (which utilizes the entropy of
using large containers as the prefix).

MitchAlsup

unread,
Apr 14, 2020, 1:29:01 PM4/14/20
to
On Tuesday, April 14, 2020 at 10:18:51 AM UTC-5, EricP wrote:
> JimBrakefield wrote:
> > Much of the discussion on the "Condition Code-less Programming" topic can be placed under a "Usage of Prefix instructions" category. It's also a good masters exam question so won't enumerate the known and implemented usages of instruction prefixes.
> >
> > However "Condition Code-less Programming" adds a new possible usage: prefixing a conditional branch to the following instruction.
> >
> > So would like to ask the question: What are the trade-offs and practicalities of using prefixes? The x86-64 is perhaps the premier example of prefix usage and the Itanium an example of a blank sheet design that increased the instruction size rather than use prefixes.
>
> There are no prefixes, there are just instructions.
>
> Back in the 8086 when it decoded 1 byte per clock,
> it would decode the prefix into a single *bit*,
> save the prefix bit in some internal register, and later
> feeds that bit into the instruction decode for the main byte.
> This saves decode transistors at the expense of clocks.
>
> Prefixes today exist only in the instruction manual.
> They are just different instructions, encoded in an
> inefficient and often redundant manner.
>
> If FF is an instruction, and AA and BB are prefixes, then
>
> FF
> AA FF
> BB FF
> AA BB FF
> BB AA FF

I softly want to disagree--having build an x86-64 decoder, I see it
slightly differently::

There is a main opcode byte that is optionally preceded by a number
of prefix bytes. The main opcode byte is used to index a table of
instructions (or microcode entry points) while the prefixes are used
to control what table is indexed by the main opcode.

The indexed table, then, controls whether there is a second opcode byte,
R/M, or SIB or both; and those control the generation of displacements and
address modes.

EricP

unread,
Apr 14, 2020, 2:20:12 PM4/14/20
to
Yes, but whether those table entries encode variations of the same
instruction or different ones, it takes the same number of transistors.
Whether you call the first byte a prefix or the first byte of a
2 byte instruction, you wind up in the same place.

By calling it a prefix people think there is some kind of saving.
In fact what happens is that the ISA design fails to define a
specific order and applicability for those prefix bytes,
so you in fact wind up with a much large decoder than otherwise.

Instead of a lock prefix, if x86 had said "we have 2 add-to-memory
instructions: Add [mem],reg and AtomicAdd [mem],reg"
there can be no confusion about what LOCK XOR reg,reg means,
nor any decode table space assigned to encode it.


EricP

unread,
Apr 14, 2020, 3:03:36 PM4/14/20
to
MitchAlsup wrote:
> On Monday, April 13, 2020 at 10:48:19 PM UTC-5, JimBrakefield wrote:
>> Much of the discussion on the "Condition Code-less Programming" topic can be placed under a "Usage of Prefix instructions" category. It's also a good masters exam question so won't enumerate the known and implemented usages of instruction prefixes.
>>
>> However "Condition Code-less Programming" adds a new possible usage: prefixing a conditional branch to the following instruction.
>>
>> So would like to ask the question: What are the trade-offs and practicalities of using prefixes? The x86-64 is perhaps the premier example of prefix usage and the Itanium an example of a blank sheet design that increased the instruction size rather than use prefixes.
>
> The My 66000 instruction set has a few prefixes::
>
> A) Predication is done wiht a PRED instruction which carries a shadow.
> In this shadow is a bit for each subsequent instruction that indicates
> it should execute (1) or be skipped (0). The shadow has a maximal length
> of 8 instructions. And the shadow can be "cast" in the THEN-clause sense
> or in the ELSE-clause sense--which allows for construction of && and ||
> constructs that guard short then and else statements.

Calling PRED a prefix is a bit of a stretch.
If one of those guarded instructions traps,
you do not set the PC back to the PRED, right?

So PRED is a separate instruction in its own right.

> B) The Virtual Vector Machine implements loop-vectorization by casting
> a vector shadow over instructions in the loop. Here the VEC instruction
> identifies the top of the loop (Next instruction) and which registers
> carry loop dependencies. Down at the end of the loop is a LOOP instruction
> which performs the typical ADD/CMP/Bcc of counted loops. Bundling of these
> 3 instructions together enables them to be executed in 1-cycle {we know
> the top-of-loop address, we know the index register and increment, and we
> know the loop increment:: a 3-input adder does both ADD and CMP.} By
> using casts, every (small enough) instruction loop can be vectorized
> with access to the whole of the instruction set (except branches).
> Flow control inside a vectorized loop must be in the PREDicate form.

Here too, VEC is a separate instruction in its own right.

JimBrakefield

unread,
Apr 14, 2020, 3:14:45 PM4/14/20
to
It's definitely a prefix if traps are disallowed between the prefix and the rest of the instruction. It probably is a prefix if branching to the middle (and skipping the prefix) is not a good idea. Of course branching into the middle of an instruction is not a good idea anyway (not that it can't be done).

More interested in the question of encoding: is it better to have prefixes or to increase the variety of instruction sizes and formats?

MitchAlsup

unread,
Apr 14, 2020, 4:12:41 PM4/14/20
to
On Tuesday, April 14, 2020 at 2:03:36 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Monday, April 13, 2020 at 10:48:19 PM UTC-5, JimBrakefield wrote:
> >> Much of the discussion on the "Condition Code-less Programming" topic can be placed under a "Usage of Prefix instructions" category. It's also a good masters exam question so won't enumerate the known and implemented usages of instruction prefixes.
> >>
> >> However "Condition Code-less Programming" adds a new possible usage: prefixing a conditional branch to the following instruction.
> >>
> >> So would like to ask the question: What are the trade-offs and practicalities of using prefixes? The x86-64 is perhaps the premier example of prefix usage and the Itanium an example of a blank sheet design that increased the instruction size rather than use prefixes.
> >
> > The My 66000 instruction set has a few prefixes::
> >
> > A) Predication is done wiht a PRED instruction which carries a shadow.
> > In this shadow is a bit for each subsequent instruction that indicates
> > it should execute (1) or be skipped (0). The shadow has a maximal length
> > of 8 instructions. And the shadow can be "cast" in the THEN-clause sense
> > or in the ELSE-clause sense--which allows for construction of && and ||
> > constructs that guard short then and else statements.
>
> Calling PRED a prefix is a bit of a stretch.

I will give you that.

> If one of those guarded instructions traps,
> you do not set the PC back to the PRED, right?
>
> So PRED is a separate instruction in its own right.

PRED evaluates and decides to execute the then-clause or the else-clause,
PRED casts a shadow over the (up to 8) instructions placing them in the
then-clause or the else clause in a way that still allows code scheduling
between the instructions. A shadow of 01010101 puts every other instruction
in the then-clause and every other other one in the else-clause.

It is the cast shadow which is the prefix--bits not carried by the
actual instructions but modify it in any event.
>
> > B) The Virtual Vector Machine implements loop-vectorization by casting
> > a vector shadow over instructions in the loop. Here the VEC instruction
> > identifies the top of the loop (Next instruction) and which registers
> > carry loop dependencies. Down at the end of the loop is a LOOP instruction
> > which performs the typical ADD/CMP/Bcc of counted loops. Bundling of these
> > 3 instructions together enables them to be executed in 1-cycle {we know
> > the top-of-loop address, we know the index register and increment, and we
> > know the loop increment:: a 3-input adder does both ADD and CMP.} By
> > using casts, every (small enough) instruction loop can be vectorized
> > with access to the whole of the instruction set (except branches).
> > Flow control inside a vectorized loop must be in the PREDicate form.
>
> Here too, VEC is a separate instruction in its own right.

Yes, but its immediate field is used to define which registers are loop
carried dependencies. This information was not present in the actual
instructions "in" the loop but needed in order to set up the loop for
vector evaluation.

A prefix adds bits not carried by the actual instruction itself.

MitchAlsup

unread,
Apr 14, 2020, 4:16:45 PM4/14/20
to
If the instruction is executed (dynamically) more than 0.1% of the time
it should be encoded as an instruction. Less than 0.001% of the time,
it should have modifier bits pasted on by prefixes (as the word prefix
is being used in this thread). {Notice the middle ground}

I, also after thinking about this for a couple of days, think I can use
some sort of casting (prefix) in order to establish carry propagation
across a number of instructions! Rather than inventing a bunch of
instruction variants, throw (i.e., cast) shadows over instructions.

Brett

unread,
Apr 14, 2020, 7:06:29 PM4/14/20
to
BGB <cr8...@gmail.com> wrote:
>
> My past conclusion was that 16/32 ISA's would appear to be
> "near-optimal" on this front. A slight gain would be possible via also
> having 24 and 48 bit instructions;

24 bits looks optimal to me with 48 bits for constants and jumps.
At least if you want something RISC like instead of x86.
Less compression than 16+16 but simpler decode and fewer instructions so
slightly faster.

Also with 48 bits vector instructions actually fit instead of the insane
issues today and alternative of 64 bit Sparc instructions.
64 vector registers and four operands are possible.

A 16 bit variable length would end up with 16+16+16+16 to get there and
then you are into x86 decode insanity.

The window for 16 bit opcodes is closed and fixed 32 bit is both too small
and too large. Vector instructions dominate new architectures, such is
critical today.

lkcl

unread,
Apr 14, 2020, 7:39:21 PM4/14/20
to
On Tuesday, April 14, 2020 at 9:12:41 PM UTC+1, MitchAlsup wrote:

> A prefix adds bits not carried by the actual instruction itself.

it has often taken discussions of over 150 messages to get this simple concept over to people.

i call it by several names because there dimply does not exist an industry standard term

* ISAMUX - because hidden contextual bits NOT IN THE CURRENT OPCODE AT PC actually go into the ISA decode MUX phase
* ISANS for "namespaces" where due to the extra hidden bits you are effectively flipping some opcodes to a new "namespace" (a la c++ "using namespace")
* escape sequencing similar to AT RIL/GSM embedded channels, and many others.

examples include:

* FP Rounding modes. these are usually 3 bits which clearly change the meaning of FP opcodes.
* LE/BE dynamic switching. these clearly change the meaning of LD and ST.
* 32/64 bit compatibility modes. these clearly and radically alter many operations.
* POWER9's PCR (compatibility register) which can switch off entire swathes of instructions and drop back to 2.07B or earlier subsets

some prefixes therefore last a little longer than just "one instruction".

some may have a time limit (as done in 66000) although crossing over a conditional branch or into a subroutine i would say is inadviseable.

others like the fused-branch concept a few days ago only last *one* instruction.

in SV we created two schemes:

* SV Prefix. a 16 bit prefix with a highly compact reduced-functionality Vectorisation Context

* SV VBLOCK which is sort-of like 66000 predication "time limited" except covering up to 32 instructions and also specifying register redirects, swizzle contexts, and much more. the general idea being tocrush the instructions down to the bare minimum.

l.

lkcl

unread,
Apr 14, 2020, 7:46:34 PM4/14/20
to
On Wednesday, April 15, 2020 at 12:06:29 AM UTC+1, Brett wrote:

> The window for 16 bit opcodes is closed

not necessarily, brett.

if you are happy to use escape sequencing it becomes possible to apply it more than once (like the LZ conpression algorithm).

you can then "page in" relevant and ENTIRELY DIFFERENT 16 bit opcodes for a particular specialist algorithm (video decode, 3d texturisation) and once that dubroutine is completed, drop out of that particular prefixing / ISAMUX / ISANS / escapesequencing mode, back to quotes normal quotes instruction decoding.

don't let anyone tell you that 16 bit is overloaded and cannot be useful.

caveat: disassembly, unless you know the current ISAMUX context, is impossible. example: try decoding instructions wher you do not know if they are LE or BE encoded.

l.

Terje Mathisen

unread,
Apr 15, 2020, 3:28:00 AM4/15/20
to
MitchAlsup wrote:
> On Tuesday, April 14, 2020 at 2:03:36 PM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> On Monday, April 13, 2020 at 10:48:19 PM UTC-5, JimBrakefield wrote:
>>>> Much of the discussion on the "Condition Code-less Programming" topic can be placed under a "Usage of Prefix instructions" category. It's also a good masters exam question so won't enumerate the known and implemented usages of instruction prefixes.
>>>>
>>>> However "Condition Code-less Programming" adds a new possible usage: prefixing a conditional branch to the following instruction.
>>>>
>>>> So would like to ask the question: What are the trade-offs and practicalities of using prefixes? The x86-64 is perhaps the premier example of prefix usage and the Itanium an example of a blank sheet design that increased the instruction size rather than use prefixes.
>>>
>>> The My 66000 instruction set has a few prefixes::
>>>
>>> A) Predication is done wiht a PRED instruction which carries a shadow.
>>> In this shadow is a bit for each subsequent instruction that indicates
>>> it should execute (1) or be skipped (0). The shadow has a maximal length
>>> of 8 instructions. And the shadow can be "cast" in the THEN-clause sense
>>> or in the ELSE-clause sense--which allows for construction of && and ||
>>> constructs that guard short then and else statements.
>>
>> Calling PRED a prefix is a bit of a stretch.
>
> I will give you that.
>
>> If one of those guarded instructions traps,
>> you do not set the PC back to the PRED, right?
>>
>> So PRED is a separate instruction in its own right.
>
> PRED evaluates and decides to execute the then-clause or the else-clause,
> PRED casts a shadow over the (up to 8) instructions placing them in the
> then-clause or the else clause in a way that still allows code scheduling
> between the instructions. A shadow of 01010101 puts every other instruction
> in the then-clause and every other other one in the else-clause.

I have many, many times wanted to have a conditional CALL to handle
special cases, is PRED general enough to do this? I.e. can you have a
CALL instruction in the PRED shadow?

Terje Mathisen

unread,
Apr 15, 2020, 3:33:23 AM4/15/20
to
That's actually quite easy in most instruction sets, simply because both
code bytes and immediate data bytes are both relatively sparsely used.

I.e. I bet you could show me some x86 binary data and I would be able to
quickly determine if it was shown correctly (i.e. LE) or byte flipped.

OTOH, if we pretend to be Mel, then I would happily make sure that most
of my immediate data would also be valid opcodes, and jump into the
middle of instructions etc. :-)

Terje Mathisen

unread,
Apr 15, 2020, 3:58:22 AM4/15/20
to
You just need to add the ADD1 instruction which Itanium used, then you
could use the outgoing carry to control the PRED that would either do
ADD or ADD1. Missing ADD1 you just need to PRED the extra increment, but
you do need to also cast a shadow over the two different ways to
determine the outgoing carry:

if (incoming_carry) {
sum[i] = a[i] + b[i] + 1;
outgoing_carry = sum[i] <= a[i];
}
else {
sum[i] = a[i] + b[i];
outgoing_carry = sum[i] < a[i];
}

which simplifies to

sum[i] = a[i] + b[i];
if (carry) {
sum[i]++;
carry = sum[i] <= a[i];
}
else {
carry = sum[i] < a[i];
}

which I think you can compile to

ai = a[i]; bi = b[i];
si = ai + bi;
PRED(carry, True, both, True, False)

si++; // Only done when an incoming carry
cmp = COMP(si, ai); // Done for both paths, sets all compare bits
carry = TEST(cmp, LE); // incoming carry tests for <=
carry = TEST(cmp, LT); // no carry tests for <

sum[i] = si;

I really wanted to generate the next shadow directly instead of the TEST
operations, but having overlapping PREDs gets really complicated . :-)

It looks like 4 cycles would be the minimum possible latency for each
iteration? Not bad for branchless code, but having an ADD1 would save
one of those cycles:

ai = a[i]; bi = b[i];
PRED(carry, True, False, both, True, False)

si = ADD1(ai + bi); // Only done when an incoming carry
si = ai + bi;
cmp = COMP(si, ai); // Done for both paths, sets all compare bits
carry = TEST(cmp, LE); // incoming carry tests for <=
carry = TEST(cmp, LT); // no carry tests for <

sum[i] = si;

lkcl

unread,
Apr 15, 2020, 5:20:11 AM4/15/20
to
On Wednesday, April 15, 2020 at 8:33:23 AM UTC+1, Terje Mathisen wrote:
> > impossible. example: try decoding instructions wher you do not know
> > if they are LE or BE encoded.
>
> That's actually quite easy in most instruction sets, simply because both
> code bytes and immediate data bytes are both relatively sparsely used.

that's a heuristi, isn't it?

> I.e. I bet you could show me some x86 binary data and I would be able to
> quickly determine if it was shown correctly (i.e. LE) or byte flipped.

yes - *you* could... now transfer that to modifying gnu disasm to do it.

now identify all other types of prefixes by similarly guessing....

now write code to do that for gnu disasm...

the more state information needed, the less likely (and in some cases impossible) that that will succeed.


> OTOH, if we pretend to be Mel, then I would happily make sure that most
> of my immediate data would also be valid opcodes, and jump into the
> middle of instructions etc. :-)
>

:)

l.

Terje Mathisen

unread,
Apr 15, 2020, 5:39:39 AM4/15/20
to
lkcl wrote:
> On Wednesday, April 15, 2020 at 8:33:23 AM UTC+1, Terje Mathisen wrote:
>>> impossible. example: try decoding instructions wher you do not know
>>> if they are LE or BE encoded.
>>
>> That's actually quite easy in most instruction sets, simply because both
>> code bytes and immediate data bytes are both relatively sparsely used.
>
> that's a heuristi, isn't it?

Sure!

Mitch's My architecture have intentionally reserved the opcodes
corresponding to most of the small/common constants (integer and fp!) so
that it is very easy to detect an attempt to execute or disassemble
invalid code.
>
>> I.e. I bet you could show me some x86 binary data and I would be able to
>> quickly determine if it was shown correctly (i.e. LE) or byte flipped.
>
> yes - *you* could... now transfer that to modifying gnu disasm to do it.

The best disassemblers are very good at determining what is most likely
code and what looks like embedded data tables, they have to use similar
heuristics to do that. They do allow you to override it since user
feedback driven approaches work best of all.
>
> now identify all other types of prefixes by similarly guessing....
>
> now write code to do that for gnu disasm...

Simply running disasm blindly on both LE and BE versions of the binary,
then a perl script to compare the frequency distribution of recognized
instructions would probably do it.
>
> the more state information needed, the less likely (and in some cases impossible) that that will succeed.
>
>
>> OTOH, if we pretend to be Mel, then I would happily make sure that most
>> of my immediate data would also be valid opcodes, and jump into the
>> middle of instructions etc. :-)
>>
>
> :)

BT, DT: Back in the 486 days my fastest code for some common kernels
would start by skipping one or two initial instructions in the
processing loop, and it would do that by making those instructions the
immediate data of a dummy CMP instruction.

This broke down on the Pentium since it would delay fast (dual-issue)
processing until one iteration later.

Przemysław Cieszyński

unread,
Apr 15, 2020, 8:05:10 AM4/15/20
to
On Wednesday, April 15, 2020 at 11:39:39 AM UTC+2, Terje Mathisen wrote:

> Mitch's My architecture have intentionally reserved the opcodes
> corresponding to most of the small/common constants (integer and fp!) so
> that it is very easy to detect an attempt to execute or disassemble
> invalid code.

I see no point, the "problem" is not existing in practice.

BTW Mitch should incorporate code compression instead, code density is much more important than small increase of decoder complexity. 16k of short opcodes is enough to get 20%.

Stephen Fuld

unread,
Apr 15, 2020, 11:27:07 AM4/15/20
to
I may be mis-remembering this, and Mitch certainly knows, but I thought
the Pred instruction didn't have a "both" option.


>
>  si = ADD1(ai + bi);    // Only done when an incoming carry
>  si = ai + bi;
>  cmp = COMP(si, ai);    // Done for both paths, sets all compare bits
>  carry = TEST(cmp, LE);    // incoming carry tests for <=
>  carry = TEST(cmp, LT); // no carry tests for <
>
>  sum[i] = si;
>


--
- Stephen Fuld
(e-mail address disguised to prevent spam)

BGB

unread,
Apr 15, 2020, 12:00:41 PM4/15/20
to
In my case, 0000..DFFF were used for 16-bit opcodes.

A few 12-bit blocks remain (one unused, one is from the original /
no-longer-used FPU).


So, say, we have 0000..3FFF, and 4b registers, picking the most-used 16b
blocks from my ISA:
0znd: MOV.x (SP, disp4) + misc
1znm: 2R ALU ops ("ADD Rm, Rn" and like)
2znd: Branch ops, a few "Imm4, Reg" ops, ...
3znz: 1R (and 0R) ops.

Immediate-form instructions would be limited to 4b, as there isn't
really enough encoding space here to afford *any* Imm8 cases.

This would omit most Load/Store ops, but in my ISA LD/ST tends to be
dominated by 32-bit encodings (with the main exception being SP-rel forms).

I suspect this is because of a lack of a effective 16b (Rm, disp) or
(Rm, Ri) forms; and my compiler wont usually choose (Rm, R0) if the
operation can be encoded directly (more so since adding jumbo prefixes,
which were as-is the main source of (Rm, R0) loads/stores).


Partial layout (reorg / pruning / ...):
* 00nd MOV.L Rn, (SP, disp4u) //Stack-Relative Store
* 01nd MOV.Q Rn, (SP, disp4u) //Stack-Relative Store
* 02nd MOV.L Rk, (SP, disp4u) //Stack-Relative Store
* 03nd MOV.Q Rk, (SP, disp4u) //Stack-Relative Store
* 04nd MOV.L (SP, disp4u), Rn //Stack-Relative Load
* 05nd MOV.Q (SP, disp4u), Rn //Stack-Relative Load
* 06nd MOV.L (SP, disp4u), Rk //Stack-Relative Load
* 07nd MOV.Q (SP, disp4u), Rk //Stack-Relative Load
* 08nm MOV Rm, Cn //MOV to Control Reg
* 09nm MOV Cm, Rn //MOV from Control Reg
* 0Anm ADD Rm, R0, Rn //Rn=Rm+R0
* 0Bnm -
* 0Cdd ? LEA.Q (SP, disp8s), SP //Stack-Pointer Adjust
* 0Dnd MOV.X Rx, (SP, disp4u) //128b Stack-Rel Store
* 0End MOVU.L (SP, disp4u), Rn //Stack-Relative ZX Load
* 0Fnd MOV.X (SP, disp4u), Rx //128b Stack-Rel Load

* 10nm ADD Rm, Rn //Rn=Rn+Rm
* 11nm SUB Rm, Rn //Rn=Rn-Rm
* 12nm -
* 13nm -
* 14nm TST Rm, Rn //SR.T=!(Rm&Rn)
* 15nm AND Rm, Rn
* 16nm OR Rm, Rn
* 17nm XOR Rm, Rn
* 18nm MOV Rm, Rn //Rn=Rm
* 19nm MOV Rj, Rn
* 1Anm MOV Rm, Rk
* 1Bnm MOV Rj, Rk
* 1Cnm CMPEQ Rm, Rn //Rn==Rm (Low 32 Bits)
* 1Dnm CMPHI Rm, Rn //Unsigned Rn GT Rm (Low 32 Bits)
* 1Enm CMPGT Rm, Rn //Signed Rn GT Rm (Low 32 Bits)
* 1Fnm -

* 20dd BRA (PC, disp8s) //Branch, PC=PC+(disp8s*2)
* 21dd BSR (PC, disp8s) //Branch Subroutine
* 22dd BT (PC, disp8s) //Branch True
* 23dd BF (PC, disp8s) //Branch False
* 24jj -
* 25jj -
* 26jj LDISH imm8u, R0 //R0=(R0 << 8)|Imm8u
* 27ii LDI imm8s, R0 //R0=Imm8s (Collapsed from Imm12)
* 28nj ADD imm4u, Rn //Rn=Rn+Imm8u (Collapsed from Imm8)
* 29nj ADD imm4n, Rn //Rn=Rn+Imm8n (Collapsed from Imm8)
* 2Anj LDIZ imm4u, Rn //Rn=Imm4u (Collapsed from Imm8)
* 2Bnj LDIN imm4n, Rn //Rn=Imm4n (Collapsed from Imm8)
* 2Cnj CMPEQ imm4u, Rn //Rn==Imm4u
* 2Dnj CMPEQ imm4n, Rn //Rn==Imm4n
* 2Enj CMPGT imm4u, Rn //Rn> Imm4u
* 2Fnj CMPGE imm4u, Rn //Rn>=Imm4u

* 3zzz ...

Not going to layout the 1R block, but in my case it encodes 5-bit regs
directly (R0..R31), so could do similar here. A lot of random "odds and
ends" can be fit in here.


As-is, this would not be sufficient for a standalone ISA, but this is
mostly based on culling down to some of the most frequently used parts
of the 16-bit part of my ISA as it exists.

This somewhat reduces the ops which have access to R16..R31, but these
end up less often used with 16-bit encodings in my case.


...

Stefan Monnier

unread,
Apr 15, 2020, 12:22:35 PM4/15/20
to
> It looks like 4 cycles would be the minimum possible latency for each
> iteration? Not bad for branchless code, but having an ADD1 would save
> one of those cycles:

Compared to the `addc` version using condition codes, it still looks
pretty bad,


Stefan

MitchAlsup

unread,
Apr 15, 2020, 12:48:54 PM4/15/20
to
Yes !

But you are going to have to evaluate the arguments n any event.

MitchAlsup

unread,
Apr 15, 2020, 1:00:59 PM4/15/20
to
which simplifies to::

ADD Rsum,Ra,Rb
CMP Rcmp,Rsum,Ra
ADD Rsum,Rsum,Rcarry
PRED Rcarry,{2:T E}
EXT Rcarry,Rcmp<1:3>
EXT Rcarry,Rcmp<1:4>

MitchAlsup

unread,
Apr 15, 2020, 1:14:25 PM4/15/20
to
Which should ALSO simplifies to:

ADD Rsum,Ra,Rb
ADD Rsum,Rsum,Rcarry
CMP Rt,Rsum,Ra
EXT Rcarry,Rt<1:4>

Terje Mathisen

unread,
Apr 15, 2020, 1:24:10 PM4/15/20
to
The best alternative would be to implement ADD3 with three inputs and
two outputs, i.e. a full-register-width version of a binary full adder.

The second output would always be 0, 1 or 2. If you require the third
input to be just 0 or 1 (i.e. incoming carry) then you get a single-bit
outgoing carry in the second return reg.

Mitch already have at least one set of 3-input ops, if he could also
generate a pair of results (which you need for the ieee 754-2019
augmented operations) then he would be home free.

AFAIK, the general ADD3 adder tree is less than 50% larger than regular ADD?

EricP

unread,
Apr 15, 2020, 1:33:28 PM4/15/20
to
MitchAlsup wrote:
> On Wednesday, April 15, 2020 at 2:58:22 AM UTC-5, Terje Mathisen wrote:
>>
>> which simplifies to
>>
>> sum[i] = a[i] + b[i];
>> if (carry) {
>> sum[i]++;
>> carry = sum[i] <= a[i];
>> }
>> else {
>> carry = sum[i] < a[i];
>
> which simplifies to::
>
> ADD Rsum,Ra,Rb
> CMP Rcmp,Rsum,Ra
> ADD Rsum,Rsum,Rcarry
> PRED Rcarry,{2:T E}
> EXT Rcarry,Rcmp<1:3>
> EXT Rcarry,Rcmp<1:4>

which OoO executes concurrently as

ADD Rsum,Ra,Rb
CMP Rcmp,Rsum,Ra ADD Rsum,Rsum,Rcarry PRED Rcarry,{2:T E}
EXT Rcarry,Rcmp<1:3> EXT Rcarry,Rcmp<1:4>

but ADD3 and AddGenerateCarry3 can execute concurrently in 1 clock.

EricP

unread,
Apr 15, 2020, 1:38:17 PM4/15/20
to
MitchAlsup wrote:
>
> which simplifies to::
>
> ADD Rsum,Ra,Rb
> CMP Rcmp,Rsum,Ra
> ADD Rsum,Rsum,Rcarry
> PRED Rcarry,{2:T E}
> EXT Rcarry,Rcmp<1:3>
> EXT Rcarry,Rcmp<1:4>

BTW, what does EXT do?


Terje Mathisen

unread,
Apr 15, 2020, 1:40:19 PM4/15/20
to
I was specifically thinking about things like very rare fixups, i.e.
like a pre-normalization which is not expected to be needed. In that
case you would have the arguments in known registers and return with the
new values before you go on. I.e. a sw version of a hw trap & fixup.

More common would be a buffer handler where you need to call a fillup
function when it is empty. I.e. I'm hoping that this will be lower cost
than a branch around the fillup call which then fails each time the
buffer is empty.

BGB

unread,
Apr 15, 2020, 1:46:50 PM4/15/20
to
On 4/14/2020 6:46 PM, lkcl wrote:
> On Wednesday, April 15, 2020 at 12:06:29 AM UTC+1, Brett wrote:
>
>> The window for 16 bit opcodes is closed
>
> not necessarily, brett.
>
> if you are happy to use escape sequencing it becomes possible to apply it more than once (like the LZ conpression algorithm).
>
> you can then "page in" relevant and ENTIRELY DIFFERENT 16 bit opcodes for a particular specialist algorithm (video decode, 3d texturisation) and once that dubroutine is completed, drop out of that particular prefixing / ISAMUX / ISANS / escapesequencing mode, back to quotes normal quotes instruction decoding.
>
> don't let anyone tell you that 16 bit is overloaded and cannot be useful.
>

If one sets reasonably expectations, and doesn't completely squander the
encoding space, then 16-bit instructions can be pretty useful.

Using 16-bit ops as the primary or sole ISA is fairly limiting, as one
may need a more instructions to express the same operation, or may have
various cases turn out to be non encodable. But, for space-saving in a
variable length ISA, 16-bits works well.


If I were designing a VL ISA clean, I might go with:
0zzz..Bzzz: 16-bit space (slightly smaller than BJX2)
Czzz..Fzzz: 32-bit space

Probably:
Czzz: Execute if True
Dzzz: Execute if False
Ezzz: Wide-Execute
Fzzz: Scalar or End-of-Bundle

Unclear which is better between BJX2 style XXnm or SH-style XnmX.

Say, I go went with SH-style for this one, say:
XnmX //16b ops (2R)
XniX //16b ops (Imm4)
Xnii //16b ops (Imm8)
XnXX //16b ops (1R)
Xnm0-XeoX //2R and 3R ops
Xnm1-Xeii //Disp9 (Load/Store)
Xnm2-Xeii //Rm, Imm9, Rn (ALU)
XnX3-Xeii //Imm10, Rn
Xni4-Xiii //Imm16, Rn (R0..R15)
Xni5-Xiii //Imm16, Rk (R16..R31)
Xii6-iiii //J1 (Imm24)
Xii7-iiii //J2 (Imm24)
...

The J1 and J2 block would be one of several ops depending on mode:
Czz6: BT disp24
Czz7: PRWEX?T (Predicated Wide-Execute Ops ?)
Dzz6: BF disp24
Dzz7: PRWEX?F (Predicated Wide-Execute Ops ?)
Ezz6: ?
Ezz7: JX imm24 (Jumbo Prefix)
Fzz6: BRA disp24
Fzz7: BSR disp24

JX Forms:
Eii7-iiii-Xnm1-Xeii: "MOV.x (Rm, Disp33s), Rn" (Load/Store)
Eii7-iiii-Xnm2-Xeii: "OP Rm, Imm33s, Rn" (ALU)
Eii7-iiii-Xnm3-Xeii: "OP Imm33s, Rn" (ALU)
Eii7-iiii-Eii7-iiii-Xni4-Xiii: "OP Imm64, Rn"
Eii7-iiii-Eii7-iiii-Xni5-Xiii: "OP Imm64, Rk"


Assuming a similar implementation, the JX forms will likely eat multiple
lanes and thus may not be in a bundle with other ops.


Possible: The code stream may be kept nominally 32b aligned, and thus
lone 16-bit ops will not be allowed. If a lone 16-bit op would result in
a misaligned instruction stream for a subsequent 32-bit op (maybe only
required for Jumbo and Wide-Execute ops), it is to be replaced with a
32-bit equivalent.

This will cost slightly in terms of code density, but should reduce the
cost of the instruction cache mechanism.

As with BJX2, most likely 8/16/32/64 bit Load/Store may be misaligned,
but 128-bit could require a 64-bit alignment; and likewise SP would
still have a mandatory 64-bit alignment; ...


This is at least a little cleaner than BJX2 is at present...


> caveat: disassembly, unless you know the current ISAMUX context, is impossible. example: try decoding instructions wher you do not know if they are LE or BE encoded.
>

However, contextually encoded instructions are kind of evil IMO.

Besides just decoding concerns, there is also now the need to insert
instructions into the code-stream to change which encodings are active
(wastes clock cycles and space), as well as pipeline issues (one either
has to schedule the mode change several instructions prior to it being
used; or have it do an explicit pipeline flush and waste clock cycles).

...

Terje Mathisen

unread,
Apr 15, 2020, 1:46:59 PM4/15/20
to
MitchAlsup wrote:
> On Wednesday, April 15, 2020 at 2:58:22 AM UTC-5, Terje Mathisen wrote:
>> which simplifies to
>>
>> sum[i] = a[i] + b[i];
>> if (carry) {
>> sum[i]++;
>> carry = sum[i] <= a[i];
>> }
>> else {
>> carry = sum[i] < a[i];
>
> which simplifies to::
>
> ADD Rsum,Ra,Rb
> CMP Rcmp,Rsum,Ra
> ADD Rsum,Rsum,Rcarry
> PRED Rcarry,{2:T E}
> EXT Rcarry,Rcmp<1:3>
> EXT Rcarry,Rcmp<1:4>
>


Does that work when the second ADD (adding the incoming carry) is the
one that causes an outgoing carry?

I.e. you don't seem to have any test which checks the result of adding
incoming carry to the preliminary sum?

Terje Mathisen

unread,
Apr 15, 2020, 1:57:25 PM4/15/20
to
Similar problem:

if Rb == (unsigned) (-1) and Rcarry = 1, then the result of those two
additions will be to set Rsum == Ra, _exactly_ the same you get from Rb
== Rcarry == 0.

You really do need to handle all the ways this can wrap around, on
either the first or the second ADD. When you don't have an actual carry
flag or ADD3 with dual outputs, then you must have separate code for the
incoming carry/no-carry paths.

sum = a + b;

carry_out = (carry_in) ? (++sum <= a) : (sum < a);

MitchAlsup

unread,
Apr 15, 2020, 2:47:15 PM4/15/20
to
On Wednesday, April 15, 2020 at 12:24:10 PM UTC-5, Terje Mathisen wrote:
> Stefan Monnier wrote:
> >> It looks like 4 cycles would be the minimum possible latency for each
> >> iteration? Not bad for branchless code, but having an ADD1 would save
> >> one of those cycles:
> >
> > Compared to the `addc` version using condition codes, it still looks
> > pretty bad,
>
> The best alternative would be to implement ADD3 with three inputs and
> two outputs, i.e. a full-register-width version of a binary full adder.
>
> The second output would always be 0, 1 or 2. If you require the third
> input to be just 0 or 1 (i.e. incoming carry) then you get a single-bit
> outgoing carry in the second return reg.
>
> Mitch already have at least one set of 3-input ops, if he could also
> generate a pair of results (which you need for the ieee 754-2019
> augmented operations) then he would be home free.
>
> AFAIK, the general ADD3 adder tree is less than 50% larger than regular ADD?

It is less than 12% larger.

MitchAlsup

unread,
Apr 15, 2020, 2:48:21 PM4/15/20
to
Extract <Length:Offset>

MitchAlsup

unread,
Apr 15, 2020, 2:53:25 PM4/15/20
to
On Wednesday, April 15, 2020 at 12:46:59 PM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Wednesday, April 15, 2020 at 2:58:22 AM UTC-5, Terje Mathisen wrote:
> >> which simplifies to
> >>
> >> sum[i] = a[i] + b[i];
> >> if (carry) {
> >> sum[i]++;
> >> carry = sum[i] <= a[i];
> >> }
> >> else {
> >> carry = sum[i] < a[i];
> >
> > which simplifies to::
> >
> > ADD Rsum,Ra,Rb
> > CMP Rcmp,Rsum,Ra
> > ADD Rsum,Rsum,Rcarry
> > PRED Rcarry,{2:T E}
> > EXT Rcarry,Rcmp<1:3>
> > EXT Rcarry,Rcmp<1:4>
> >
>
>
> Does that work when the second ADD (adding the incoming carry) is the
> one that causes an outgoing carry?

I was assuming that carry has states {0, and 1}; so:

> >> sum[i] = a[i] + b[i];
> >> if (carry) {
> >> sum[i]++; //carry had to be 1
> >> carry = sum[i] <= a[i];
> >> }
> >> else { // carry was 0
> >> carry = sum[i] < a[i];

Can be written:
> >> sum[i] = a[i] + b[i] + carry;
> >> if (carry) {
> >> carry = sum[i] <= a[i];
> >> }
> >> else {
> >> carry = sum[i] < a[i];
>

Terje Mathisen

unread,
Apr 15, 2020, 4:07:15 PM4/15/20
to
That means that you are extracting either 3 or 4 bits of compare
results, right?

I don't think you are compressing that down to a logic 0/1 are you?

If you are not, then you cannot add this 3 (or 4)-bit value to the
preliminary sum, but you should be able to use it in a PRED which
shadows the required increment, or you can compress it to a single bit:


ADD Rsum, Ra, Rb
TEST Rcarry, Rcarry, Rcarry ;; Sets Rcarry to 1 if non-zero?

ADD Rsum, Rsum, Rcarry

CMP Rcmp, Rsum, Ra
PRED Rcarry,{2:T E}

EXT Rcarry, Rcmp<1:3>
EXT Rcarry, Rcmp<1:4>

Would that work? It does look like 4 cycles of latency?

MitchAlsup

unread,
Apr 15, 2020, 5:03:53 PM4/15/20
to
On Wednesday, April 15, 2020 at 3:07:15 PM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Wednesday, April 15, 2020 at 12:38:17 PM UTC-5, EricP wrote:
> >> MitchAlsup wrote:
> >>>
> >>> which simplifies to::
> >>>
> >>> ADD Rsum,Ra,Rb
> >>> CMP Rcmp,Rsum,Ra
> >>> ADD Rsum,Rsum,Rcarry
> >>> PRED Rcarry,{2:T E}
> >>> EXT Rcarry,Rcmp<1:3>
> >>> EXT Rcarry,Rcmp<1:4>
> >>
> >> BTW, what does EXT do?
> >
> > Extract <Length:Offset>
> >
> That means that you are extracting either 3 or 4 bits of compare
> results, right?

Length = 1 ..... offset = 3 or 4

1 bit in length at position 3 or 4 (unsigned)
>
> I don't think you are compressing that down to a logic 0/1 are you?
>
> If you are not, then you cannot add this 3 (or 4)-bit value to the
> preliminary sum, but you should be able to use it in a PRED which
> shadows the required increment, or you can compress it to a single bit:
>
>
> ADD Rsum, Ra, Rb
> TEST Rcarry, Rcarry, Rcarry ;; Sets Rcarry to 1 if non-zero?
>
> ADD Rsum, Rsum, Rcarry
>
> CMP Rcmp, Rsum, Ra
> PRED Rcarry,{2:T E}
>
> EXT Rcarry, Rcmp<1:3>
> EXT Rcarry, Rcmp<1:4>
>
> Would that work? It does look like 4 cycles of latency?

We are getting close.

Paul A. Clayton

unread,
Apr 15, 2020, 9:33:43 PM4/15/20
to
On Monday, April 13, 2020 at 11:48:19 PM UTC-4, JimBrakefield wrote:
> Much of the discussion on the "Condition Code-less Programming" topic
> can be placed under a "Usage of Prefix instructions" category. It's
> also a good masters exam question so won't enumerate the known and
> implemented usages of instruction prefixes.

It would also make a nice article for a computer architecture wiki.

Fujitsu's SPARC64 VIIIfx added the Set XAR (Extended Arithmetic
Register) to add to the next two instructions three bits to each
register source operand and an SIMD indicator. (I think Mitch Alsup
might have had been involved with that; I seem to recall him posting
on comp.arch that he was consulted on how to extend the SPARC
architecture and that he proposed a prefix instruction.) The register
field extension has obvious similarities to x86-64's REX extension,
which adds one bit to two register operand fields.

> However "Condition Code-less Programming" adds a new possible usage:
> prefixing a conditional branch to the following instruction.

John Savard (Quadibloc)'s proposal is "merely" a delayed branch with
the condition setting in the delay slot. Ordinary delayed branches
could be considered double-word instructions if architected as
interrupt atomic, but since they are mostly semantically independent
(most ISAs do not allow branches in delay slots or backward data
dependencies). A prepare to branch instruction might be considered a
disconnected prefix if the following branch instruction did not repeat
the target or condition specification.

(If one squints hard enough — really, *REALLY* hard —, all stack- and
queue-based operand instructions could be considered *post*fix
instructions, depending on the instruction history for which value is at
the top (and the rest of the stack/queue sequence).)

> So would like to ask the question: What are the trade-offs and
> practicalities of using prefixes? The x86-64 is perhaps the premier
> example of prefix usage and the Itanium an example of a blank sheet
> design that increased the instruction size rather than use prefixes.

Hint semantics are an obvious use for prefixes. Intel's use of the REP
prefix to provide lock elision is an example. Such has the advantage
of backward compatibility and even allows future implementations to
ignore the hint or change the hint's meaning.

Since simple prefixes essentially just add bits to the next instruction,
adding register operand and immediate extensions are probably fairly
straightforward in terms of decoding. Extending the operation (as
opposed to changing it completely) might be the next most straightforward;
e.g., transforming an addition to an add-with-carry or applying SIMD or
size specification might not interfer with basic decoding and operation
routing. On the other hand, if one uses table-based decoding, any prefix
provided opcode extension could be just handled as a table specification,
similar to mode bits. In fact, prefix-provided bits could be considered
highly transient mode bits set by an immediate value in the prefix.
SPARC64 VIIIfx's SXAR makes this more obvious, though it consumes the
mode bits over two instructions instead of just one.

Terje Mathisen

unread,
Apr 16, 2020, 4:08:36 AM4/16/20
to
MitchAlsup wrote:
> On Wednesday, April 15, 2020 at 3:07:15 PM UTC-5, Terje Mathisen wrote:
>> MitchAlsup wrote:
>>> On Wednesday, April 15, 2020 at 12:38:17 PM UTC-5, EricP wrote:
>>>> MitchAlsup wrote:
>>>>>
>>>>> which simplifies to::
>>>>>
>>>>> ADD Rsum,Ra,Rb
>>>>> CMP Rcmp,Rsum,Ra
>>>>> ADD Rsum,Rsum,Rcarry
>>>>> PRED Rcarry,{2:T E}
>>>>> EXT Rcarry,Rcmp<1:3>
>>>>> EXT Rcarry,Rcmp<1:4>
>>>>
>>>> BTW, what does EXT do?
>>>
>>> Extract <Length:Offset>

I missed that part. :-(
>>>
>> That means that you are extracting either 3 or 4 bits of compare
>> results, right?
>
> Length = 1 ..... offset = 3 or 4
>
> 1 bit in length at position 3 or 4 (unsigned)
>>
>> I don't think you are compressing that down to a logic 0/1 are you?
>>
>> If you are not, then you cannot add this 3 (or 4)-bit value to the
>> preliminary sum, but you should be able to use it in a PRED which
>> shadows the required increment, or you can compress it to a single bit:
>>
>>
>> ADD Rsum, Ra, Rb
>> TEST Rcarry, Rcarry, Rcarry ;; Sets Rcarry to 1 if non-zero?
>>
>> ADD Rsum, Rsum, Rcarry
>>
>> CMP Rcmp, Rsum, Ra
>> PRED Rcarry,{2:T E}
>>
>> EXT Rcarry, Rcmp<1:3>
>> EXT Rcarry, Rcmp<1:4>
>>
>> Would that work? It does look like 4 cycles of latency?
>
> We are getting close.

I realized last night that your EXT was using <length:offset> instead of
<offset:length> which is the canonical order for substr() operations. :-)
It works because you have individual single bit flags for each common
comparison operation, right?

This means that an OoO cpu can overlap the loading and adding of the
next pair with the carry propagation from the previous, so what remains
is the following 3-cycle chain:

ADD Rsum, Rsum, Rcarry

CMP Rcmp, Rsum, Ra
PRED Rcarry, {2: T E}

EXT Rcarry, Rcmp<1:3>
EXT Rcarry, Rcmp<1:4>

Right?

If you had an ADD1(a, b) which sets the incoming carry in your adder,
then you could possibly do it this way:

PRED Rcarry,{6: TTT EEE}

ADD1 Rsum, Ra, Rb
CMP Rcmp, Rsum, Ra
EXT Rcarry, Rcmp<1:3>

ADD Rsum, Ra, Rb
CMP Rcmp, Rsum, Ra
EXT Rcarry, Rcmp<1:4>

but that would be both larger code and probably slower, so forget about
it. :-)

Brian G. Lucas

unread,
Apr 16, 2020, 4:33:01 PM4/16/20
to
On 4/15/20 2:27 AM, Terje Mathisen wrote:
[ snip ]
>
> I have many, many times wanted to have a conditional CALL to handle special
> cases, is PRED general enough to do this? I.e. can you have a CALL instruction
> in the PRED shadow?

In my understanding of the My66000, a change of flow instruction terminates
predication. So if the CALL is the last instruction of a predicated "bundle",
it should work. But only Mitch knows for sure.

brian

MitchAlsup

unread,
Apr 16, 2020, 6:25:01 PM4/16/20
to
Yes to both;

Predication terminates when a branch is taken.
A Call or Table Transfer can happen (or not) under predication.
A Return can also happen under predication.
>
> brian

MitchAlsup

unread,
Apr 16, 2020, 6:42:13 PM4/16/20
to
Having given this some thought:: It seems wise to invent an instruction
that throws the ability to generate and consume carries over a handful
of instructions. SO that the 256-bit wide add with carry would smell like::

CARRY {I,IO,IO,O}
ADD R1,R11,R21 // generate carry out
ADD R2,R12,R22 // carry in and out
ADD R3,R13,R23 // carry in and out
ADD R4,R14,R24 // consume last carry

Bing we are done--but wait!

Now (NOW) since I define 'all bits that do not fit in the destination
container' as "carry" this also works for multiplies ! 256-bit = 256-bit
* 64-bit multiply::

CARRY {I,IO,IO,O}
UMUL R1,R11,R21 // generate carry out
UMUL R2,R12,R21 // carry in and out
UMUL R3,R13,R21 // carry in and out
UMUL R4,R14,R21 // consume last carry

And if you want to check for overflows 320-bit = 256-bit*64-bit::

CARRY {I,IO,IO,O}
UMUL R1,R11,R21 // generate carry out
UMUL R2,R12,R21 // carry in and out
UMUL R3,R13,R21 // carry in and out
UMUL R4,R14,R21 // carry in and out
UMUL R5,R21,0 // consume last carry

now I need to think about Divides to see if they can fit in this framework
by defining the remainder as the carry.

{I hope I have not opened Pandora's box here.}

Stephen Fuld

unread,
Apr 16, 2020, 7:29:43 PM4/16/20
to
I like this for another reason. When you PRED rCarry, I think you need
to have interrupts locked out between the ADD preceding the PRED and the
PRED instructions. By putting the CARRY instruction before the ADDs,
you can use the roll-back mechanism you use for VEC to restart in the
event of an interrupt.

MitchAlsup

unread,
Apr 16, 2020, 7:59:20 PM4/16/20
to
Which is why I like Casting shadows forward across instructions, while
avoiding the creating of interrupt state.

MitchAlsup

unread,
Apr 16, 2020, 8:08:59 PM4/16/20
to
After a bit of thought, it seems I can do Division as:

CARRY {I,IO,IO,O}
UDIV R4,R14,R21 // generate remainder out
UDIV R3,R13,R21 // remainder in and out
UDIV R2,R12,R21 // remainder in and out
UDIV R1,R11,R21 // remainder in and out
UDIV R5,0,R21 // consume last remainder

Where the remainder of one DIV ends up as the higher order bits of the
next lower DIV instruction. {This may add some rules to the DIV
instructions with respect to remainder (C99 style).}

Ivan Godard

unread,
Apr 16, 2020, 8:30:02 PM4/16/20
to
On 4/16/2020 3:42 PM, MitchAlsup wrote:
Nope, not Pandora. What you are doing is backing yourself into wide
instructions. Welcome to VLIW World!

MitchAlsup

unread,
Apr 16, 2020, 9:19:18 PM4/16/20
to
I beg to differ, what I am doing is adding bits to control instruction execution without paying for the bits in instructions that don't use the
capabilities in those bits.

a) Thus every instruction can be predicated, but no instruction pays to
encode what kind of predication this instruction is subject too. That
is provided by the PRED instruction.

b) similarly for Carry.

c) probably similarly for VVM.

Ivan Godard

unread,
Apr 16, 2020, 9:30:18 PM4/16/20
to
And how is this different from bundle encoding?

MitchAlsup

unread,
Apr 16, 2020, 9:36:28 PM4/16/20
to
Having never seen a clear definition of "bundle encoding" I am in no
position to say.

Ivan Godard

unread,
Apr 16, 2020, 9:51:44 PM4/16/20
to
When several operations are collected together in a group ("bundle")
and the bundle carries additional semantic content applying to the group
as a whole or parts thereof. Itanium and Mill are examples of
bundle-encoded ISAs.

MitchAlsup

unread,
Apr 16, 2020, 9:57:56 PM4/16/20
to
So the {1,2,3}-operand sub-groups are each a bundle ? They each share
their major OpCOde, and share the field widths and positions.

If that is a proper definition, the the Mc 88100 was bundled. One could
even claim IBM 360 was a bundled OpCode ISA,.....

Otherwise I don't see the distinction you are trying to make.

EricP

unread,
Apr 16, 2020, 11:27:55 PM4/16/20
to
MitchAlsup wrote:
>
> I beg to differ, what I am doing is adding bits to control instruction execution without paying for the bits in instructions that don't use the
> capabilities in those bits.
>
> a) Thus every instruction can be predicated, but no instruction pays to
> encode what kind of predication this instruction is subject too. That
> is provided by the PRED instruction.

Its not free.
Every uOp pays internally for predicate data fields, renaming,
tracking, and and control logic that are likely very rarely used.

> b) similarly for Carry.

I haven't thought this through but it smells similar.
Particularly when you mention blocking interrupts.

I'm not saying its Itanium complexity,
but I am saying "Danger Will Robinson"!

lkcl

unread,
Apr 17, 2020, 12:27:13 AM4/17/20
to
and SV's VBLOCK, which is easily mistaken for VLIW.

i believe the CARRY idea is another form of counter-sensitive prefixing. VLIW is different in that the length of each block is fixed, and thus need not be part of context for context switching.

for the CARRY idea, and PRED, and SV's VBLOCK, it is critical that state be recorded in the form of a countdown, how long the prefix condition has left.

just as with the branch idea that spawned this discussion, if you do not have such state, you absolutely cannot permit interrupts or exceptions to occur in between, and are forced instead to treat the entire group as a macroop fused atomic operation, no matter how many actual opcodes that group spans.

VBLOCK has a considerable amount of state. ot works, but it was necessary to ban the use of branches: the only location allowed to be jumped to being the very start of the VBLOCK.

Mitch, allowing any kind of persistence of PRED context over branches or any change to the PC has me concerned.

even unconditional jumps, if there are multiple paths and one has PRED context still running and the other not, yet they both jump to the same location, the meaning of the instructions at that same location have changed.

this is not something that compiler writers will be happy about.

l.

BGB

unread,
Apr 17, 2020, 2:34:44 AM4/17/20
to
On 4/16/2020 10:27 PM, EricP wrote:
> MitchAlsup wrote:
>>
>> I beg to differ, what I am doing is adding bits to control instruction
>> execution without paying for the bits in instructions that don't use
>> the capabilities in those bits.
>>
>> a) Thus every instruction can be predicated, but no instruction pays to
>> encode what kind of predication this instruction is subject too. That
>> is provided by the PRED instruction.
>
> Its not free.
> Every uOp pays internally for predicate data fields, renaming,
> tracking, and and control logic that are likely very rarely used.
>

Also, an instruction takes up space, and also any other costs related to
it "existing and being there".

I like my scheme a little more; where granted it is a small constant
cost for every instruction, for a feature which is only sometimes used.

However, when used, it has a lower constant cost, which matters a fair
bit for a feature which is (by design) only really useful for short
instruction sequences.


>> b) similarly for Carry.
>
> I haven't thought this through but it smells similar.
> Particularly when you mention blocking interrupts.
>
> I'm not saying its Itanium complexity,
> but I am saying "Danger Will Robinson"!
>

Yeah, there are cases where either explicit status flags, or passing
state in a register, kind of make sense.

Granted, things are a compromise; say, having a status flag is worse
than not having a status flag, but still better IMO than needing to use
a register port for boolean IO.


Meanwhile, both modal encodings and "submarine state" (*) are not things
which sit well with me. Trying to do these things may come back and bite
later.

*: Where part of the execution state "dives beneath the surface" and
effects subsequent execution without otherwise being observable.



Meanwhile, I am off recently trying to figure ways to reduce costs.
A recent (hypothetical) ISA redesign would drop PUSH/POP as well as
eliminate the DLR/DHR/SP side-channel stuff (allowing these to revert
back to being mostly-normal GPRs); as well as cleaning up a few things
in the encoding, ...

Though, it seems doubtful it would be worth it. Huge effort and breaking
everything probably aren't justified unless I could make a substantial
cost reduction, which doesn't seem all that likely (a majority of the
total resource cost goes into things which would not be effected by such
a change; such as the cache system; and the costs associated with things
like the ALU or FPU will still be present regardless).

Otherwise, recent fiddly has mostly been more on the software side of
things...

lkcl

unread,
Apr 17, 2020, 6:16:45 AM4/17/20
to
On Friday, April 17, 2020 at 7:34:44 AM UTC+1, BGB wrote:

> Also, an instruction takes up space, and also any other costs related to
> it "existing and being there".
>
> I like my scheme a little more; where granted it is a small constant
> cost for every instruction, for a feature which is only sometimes used.
>
> However, when used, it has a lower constant cost, which matters a fair
> bit for a feature which is (by design) only really useful for short
> instruction sequences.

i found this trade-off unexpected and very difficult to judge, in the
design of VBLOCK, which is an "extreme" prefixing system, equivalent
to a hardware compression algorithm.

how far do you go with the idea? anyone reading assembler can spot wasteful
patterns (such as the long LD-immediates problem) and i think it's basically
down to how often the instruction is intended to be used (a measure of its
effectiveness)

for example, jacob tells me (if i recall it correctly) that swizzle is
really quite common in 3D, both for picking x/y/z/w coordinates as well as
for picking RGBA out of colour vectors. therefore it makes sense for us
to investigate putting swizzle "prefixes" into a VBLOCK.

this because a full swizzle prefix can be a whopping eight bits *PER REGISTER*! that's 24 bits if you have a 2-in 1-out operation, and 32 if it's 3-in 1-out!
just to specify the order selection of data on vec4 registers!

yes we could use two (or three) MV operations to swizzle input data prior to the operation followed by a further swizzle to output the correct vec4 answer however that is even worse.

or we can expand all opcodes to a whopping 64 bit just on the offchance that swizzling *might* be needed, punishing the entire ISA as a result.

these are not nice choices :)

l.

lkcl

unread,
Apr 17, 2020, 6:26:34 AM4/17/20
to
On Friday, April 17, 2020 at 7:34:44 AM UTC+1, BGB wrote:

> Yeah, there are cases where either explicit status flags, or passing
> state in a register, kind of make sense.

these involve not-very-commonly-used operations for which shutting down
execution (shutting down instruction issue) temporarily whilst waiting for results critical to the change in status work their way through.

for a typical in-order system there is really no other choice but to take this "stall stall stall stall stall stall" err i can't call it a strategy.

an out-of-order design has, if the infrastructure is there already, the *opportunity* to not stall, however it requires separate read/write hazard protection for the *status* registers, and for them to effectively be treated *as* registers (in their own separate regfile):

https://groups.google.com/forum/#!topic/comp.arch/qeMsE7UxvlI


> Granted, things are a compromise; say, having a status flag is worse
> than not having a status flag, but still better IMO than needing to use
> a register port for boolean IO.

unfortunately... we are gearing up to do pretty much exactly that. actually, 3-bit rather than 1-bit register(s) - CR0 is 3 bits.

consequently i'm in the middle of a maaasssiively intrusive rewiring of the libre-soc logic, adding multiple read-write signal-and-acknowledgement.

l.

Terje Mathisen

unread,
Apr 17, 2020, 8:47:57 AM4/17/20
to
If I read that from left to right, it should possibly be O,IO,IO,I?
I.e. the first ADD generates an outgoing carry, the next two consumes
and generates and the last one just consumes the final carry.

Your comments indicate the same, so was this a typo or did you intend to
read the shadow from right to left?

> ADD R1,R11,R21 // generate carry out
> ADD R2,R12,R22 // carry in and out
> ADD R3,R13,R23 // carry in and out
> ADD R4,R14,R24 // consume last carry
>
> Bing we are done--but wait!
>
> Now (NOW) since I define 'all bits that do not fit in the destination
> container' as "carry" this also works for multiplies ! 256-bit = 256-bit
> * 64-bit multiply::
>
> CARRY {I,IO,IO,O}
Same comment.

> UMUL R1,R11,R21 // generate carry out
> UMUL R2,R12,R21 // carry in and out

This is a really interesting instruction since it does an integer MAC,
and it could in theory start the multiplier part as soon as possible,
only delaying the accumulation of all the partial products until the
carry from the previous instruction makes it available.

> UMUL R3,R13,R21 // carry in and out
> UMUL R4,R14,R21 // consume last carry
>
> And if you want to check for overflows 320-bit = 256-bit*64-bit::
>
> CARRY {I,IO,IO,O}
> UMUL R1,R11,R21 // generate carry out
> UMUL R2,R12,R21 // carry in and out
> UMUL R3,R13,R21 // carry in and out
> UMUL R4,R14,R21 // carry in and out
> UMUL R5,R21,0 // consume last carry

Shouldn't the last one be "UMUL R5,0,0" or possibly "ADD R5,0,0", i.e.
just instantiate the carry bits?

The key is of course that such a CARRY shadow defines an uninterruptible
chain, so it needs to be of limited length, maybe 8 instructions? This
should take less or equal to the time for 8 regular chained MULs, so 24
to 40 cycles or ~10 us?

In order to do a really long operation you would have to regularly save
and reinsert the carries, right?

This seems to a problem in that it is easy to extract it (just do a
dummy ADD of zero which eats the shadow carries), but inserting that
value into the bottom of a new chain is significantly harder because you
don't have an explicit ADD3 which allows you to pass in a visible third
parameter, or simply a SETCARRY reg instruction which lives under the
CARRY shadow but always runs with O rules?

SETCARRY would work for both UMUL and ADD chains.

> now I need to think about Divides to see if they can fit in this framework
> by defining the remainder as the carry.
>
> {I hope I have not opened Pandora's box here.}

You have. :-)

I would not worry too much about chained divides though, because as soon
as you have multiple divides using the same divisor, it is always faster
to implement it with reciprocal multiplications, even when you then have
to do more operations.

EricP

unread,
Apr 17, 2020, 11:52:37 AM4/17/20
to
Internally the ADD and UMUL uOps are 3-input, 2-output.
Essentially the 3rd input and 2nd output are specified as
anonymous, implied, virtual registers. By specifying
them this way you cannot stop part way through because
there is no place to park intermediate values.
So each sequence is all or nothing.

In an OoO uArch those implied carry registers need actual registers
to back them. If one tries to make those carry values exist only
in the forwarding network, then it can get dispatch order dependence
sensitivity: you don't want to issue a uOp before its dependent
is listening on the forwarding network in case it completes
too soon and you miss the result. This might imply that you
have to stall execution of the front uOp until the tail is ready,
the net result being worse performance.

So it winds up with the same cost as an explicit 3-in,2-out
instruction but more complex and caveats like blocked interrupts.

But it does use prefixes creatively.

Terje Mathisen

unread,
Apr 17, 2020, 1:12:35 PM4/17/20
to
Right, I also noted that you must have a way to manually stop and
restart a long sequence, otherwise the interrupt lockout shadow could
become too large.
>
> In an OoO uArch those implied carry registers need actual registers
> to back them. If one tries to make those carry values exist only
> in the forwarding network, then it can get dispatch order dependence
> sensitivity: you don't want to issue a uOp before its dependent
> is listening on the forwarding network in case it completes
> too soon and you miss the result. This might imply that you
> have to stall execution of the front uOp until the tail is ready,
> the net result being worse performance.
>
> So it winds up with the same cost as an explicit 3-in,2-out
> instruction but more complex and caveats like blocked interrupts.
>
> But it does use prefixes creatively.
>
I would in fact prefer to do this half-way, by encoding it as 3-in,
1-out but with the third input reused as the implied second output:

This way you can encode the operation with the same resources as the
FMAC, and the carry register will stay the same all the way thorugh the
chain. No need for a CARRY shadow prefix.

Brian G. Lucas

unread,
Apr 17, 2020, 1:41:24 PM4/17/20
to
On 4/16/20 11:27 PM, lkcl wrote:
[ snip ]
>> When several operations are collected together in a group ("bundle")
>> and the bundle carries additional semantic content applying to the group
>> as a whole or parts thereof. Itanium and Mill are examples of
>> bundle-encoded ISAs.
>
> and SV's VBLOCK, which is easily mistaken for VLIW.
>
> i believe the CARRY idea is another form of counter-sensitive prefixing. VLIW is different in that the length of each block is fixed, and thus need not be part of context for context switching.
>
> for the CARRY idea, and PRED, and SV's VBLOCK, it is critical that state be recorded in the form of a countdown, how long the prefix condition has left.
>
> just as with the branch idea that spawned this discussion, if you do not have such state, you absolutely cannot permit interrupts or exceptions to occur in between, and are forced instead to treat the entire group as a macroop fused atomic operation, no matter how many actual opcodes that group spans.
>
> VBLOCK has a considerable amount of state. ot works, but it was necessary to ban the use of branches: the only location allowed to be jumped to being the very start of the VBLOCK.
>
> Mitch, allowing any kind of persistence of PRED context over branches or any change to the PC has me concerned.
>
> even unconditional jumps, if there are multiple paths and one has PRED context still running and the other not, yet they both jump to the same location, the meaning of the instructions at that same location have changed.
>
> this is not something that compiler writers will be happy about.

You can say that again!

brian

BGB

unread,
Apr 17, 2020, 1:51:12 PM4/17/20
to
On 4/17/2020 5:26 AM, lkcl wrote:
> On Friday, April 17, 2020 at 7:34:44 AM UTC+1, BGB wrote:
>
>> Yeah, there are cases where either explicit status flags, or passing
>> state in a register, kind of make sense.
>
> these involve not-very-commonly-used operations for which shutting down
> execution (shutting down instruction issue) temporarily whilst waiting for results critical to the change in status work their way through.
>

This is less of an issue in my case, as the results of things like
TEST/CMPxx are visible at the next clock-edge (routed fairly directly
from the ALU's EX2 stage into EX1, *1).

In contrast, most of the normal ALU ops which output to GPRs need 2
cycles internally, meaning that using a GPR output from a CMPxx could
not be used immediately without causing an interlock stall.


*1: For timing reasons, EX1 doesn't really "control" the behavior of
most of the other units, but rather most of the units operate as-if
their instructions were being executed, and EX1/EX2 mostly serve to
select the desired results.

Instruction predication in this case is basically handled by determining
whether to perform (or select the results of) the requested operation,
or an alternate operation (typically NOP). The EX2 stage is similar, but
its SR bits are forwarded from the prior EX1 stage (so that EX2's
decisions match those of the prior EX1 stage).


> for a typical in-order system there is really no other choice but to take this "stall stall stall stall stall stall" err i can't call it a strategy.
>
> an out-of-order design has, if the infrastructure is there already, the *opportunity* to not stall, however it requires separate read/write hazard protection for the *status* registers, and for them to effectively be treated *as* registers (in their own separate regfile):
>
> https://groups.google.com/forum/#!topic/comp.arch/qeMsE7UxvlI
>

OK. In my case, my CPU core is essentially an in-order VLIW which only
allows ops which modify the SR bits from Lane 1, so in this area it
effectively behaves like a scalar core.

Though, FWIW, even as ineffective and pointless as the VLIW /
Wide-Execute stuff seems sometimes, forcing things back to scalar
operation does still have a fairly obvious impact on performance.


>
>> Granted, things are a compromise; say, having a status flag is worse
>> than not having a status flag, but still better IMO than needing to use
>> a register port for boolean IO.
>
> unfortunately... we are gearing up to do pretty much exactly that. actually, 3-bit rather than 1-bit register(s) - CR0 is 3 bits.
>
> consequently i'm in the middle of a maaasssiively intrusive rewiring of the libre-soc logic, adding multiple read-write signal-and-acknowledgement.
>

I once considered possibly allowing modification to the SR bits from
other lanes, but ended up not doing so as it would have added complexity.


I seem to now be at the stage of running into needing obscure
bit-twiddly ops for special cases which come up in tight loops, for example:
An operation to transpose the bits in a register;
An operation to interleave various bits in a register;
yyyyyyyyxxxxxxxx -> yxyxyxyxyxyxyxyx (AKA: Morton Order)
...

Cases which, while obscure, are expensive to emulate and thus are "make
or break" for the performance of some of these loops.

But, I am now at the stage of having very few LUTs remaining, so what
features I add need to be offset by trying to shave LUTs off somewhere
else (preferably without removing features).


Though recently did end up disabling the full-duplex L1<->L2 bus
mechanism, as it added a bit of LUT cost but had relatively little
effect on performance (at least when using larger L1 caches).

Some LUTs were also saved recently by fiddling with the logic for the
AGU (saved ~ 3k LUT here by fiddling with how the shift-and-add
mechanism worked; less sure about effects on timing, ...).



Also possibly ironically, recently observed cases where adding logic to
check/prevent dereferencing NULL pointers and similar, and eliminate
some instances of dangling pointers by explicitly setting them to NULL
(in an otherwise fairly buggy piece of software, *), actually improved
performance somewhat (apparently by reducing the number of cache misses).


*: Setting the dangling pointers to NULL generally resulting in other
pieces of code trying to deref the NULL pointer, meaning also a need to
add more NULL checks, ... But, also reducing the number of cases where
it tries to access objects which had since been removed from various
linked lists (where the apparent cost of accessing these objects was
higher than the added cost of first checking whether or not the pointer
was NULL).


But, I am also finding a lot of cases of needing to rewrite things like:
for(i=0; i<foo->bar[x][y]->baz; i++)
...
To:
l=foo->bar[x][y]->baz;
for(i=0; i<l; i++)
...

Because this saves needing to do a bunch of pointer deref's each time
around the loop ( would something like Watcom C from 26 years ago have
been able to optimize stuff like this ?... )

( Then looks up stuff to try to figure out the whole "reflexively avoid
integer multiply in favor of using lookup tables" thing; apparently IMUL
on a 386 was pretty slow, almost as slow as IDIV )

...

Stephen Fuld

unread,
Apr 17, 2020, 2:22:33 PM4/17/20
to
On 4/16/2020 3:42 PM, MitchAlsup wrote:

snip

> Having given this some thought:: It seems wise to invent an instruction
> that throws the ability to generate and consume carries over a handful
> of instructions. SO that the 256-bit wide add with carry would smell like::
>
> CARRY {I,IO,IO,O}
> ADD R1,R11,R21 // generate carry out
> ADD R2,R12,R22 // carry in and out
> ADD R3,R13,R23 // carry in and out
> ADD R4,R14,R24 // consume last carry

Thinking more about this, I have a question, and it seem to also apply
to the PRED instruction. Are interrupts locked out between the
PRED/CARRY instruction and the end of its shadow?

If not, where do you save the state across an interrupt? For PRED, I
think you need to save the mask, the address of the PRED and whether the
condition was true in order to be able to resume. CARRY might have
different requirements. Is all of that right?

If interrupts are locked out, how do you prevent a long chain of PRED
instructions from preventing interrupts for an arbitrarily long time?

MitchAlsup

unread,
Apr 17, 2020, 2:50:31 PM4/17/20
to
On Thursday, April 16, 2020 at 10:27:55 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> >
> > I beg to differ, what I am doing is adding bits to control instruction execution without paying for the bits in instructions that don't use the
> > capabilities in those bits.
> >
> > a) Thus every instruction can be predicated, but no instruction pays to
> > encode what kind of predication this instruction is subject too. That
> > is provided by the PRED instruction.
>
> Its not free.
> Every uOp pays internally for predicate data fields, renaming,
> tracking, and and control logic that are likely very rarely used.

Yes, it is NOT free in the microarchitecture, but it IS free in the ISA
container representation.
>
> > b) similarly for Carry.
>
> I haven't thought this through but it smells similar.
> Particularly when you mention blocking interrupts.
>
> I'm not saying its Itanium complexity,
> but I am saying "Danger Will Robinson"!

Yes, there is a lot of "Danger Will Robinson" hanging around, this is
no usefully different than my comment on Pandora's box.

MitchAlsup

unread,
Apr 17, 2020, 2:55:36 PM4/17/20
to
Cannot permit interrupts--correct
Cannot permit Exceptions--incorrect
List of instructions that smell Atomic--minor quibble:
ATOMICs have memory order rules that op-fusion does not require.

>
> VBLOCK has a considerable amount of state. ot works, but it was necessary to ban the use of branches: the only location allowed to be jumped to being the very start of the VBLOCK.
>
> Mitch, allowing any kind of persistence of PRED context over branches or any change to the PC has me concerned.

Any taken branch under the shadow of a PRED terminates the shadow.
Any taken branch under the shadow of a VEC/LOOP terminates the loop.

MitchAlsup

unread,
Apr 17, 2020, 3:05:32 PM4/17/20
to
L I T T L E E N D I A N

You read Little Endian from right to left!
The first instruction takes not carry input and generates a carry output (0)
The second (and third) consume and generate (IO)
The last only consumes (I)


I am NOT A Little Endian by nature for exactly this reason, I am a
Little Endian because LE code suffers a lower surprise coefficient.
Lower bits are consumed before higher bits.
a) The ISA has no way of encoding "UMUL R5,0,0"
b) ADD might very well be a different Function Unit than UMUL/UDIV and
I don't want to have to move invisible state around.

CARRY is not a magic wand that works universally. It works in restricted
contexts (That I am still figuring out how to document)
>
> The key is of course that such a CARRY shadow defines an uninterruptible
> chain, so it needs to be of limited length, maybe 8 instructions? This
> should take less or equal to the time for 8 regular chained MULs, so 24
> to 40 cycles or ~10 us?

c) Fixed maximal length--correct
d) uninterruptible--correct
e) 8 instructions--still under advisement

MitchAlsup

unread,
Apr 17, 2020, 3:11:06 PM4/17/20
to
Now: where did I leave my violin ?????
>
> In an OoO uArch those implied carry registers need actual registers

F L I P - F L O P s not registers. Registers are something software can use
flip-flops are what hardware uses to sequence data through the pipeline(s).

> to back them. If one tries to make those carry values exist only
> in the forwarding network,

The current plan is for these to exist only within the function unit and
not be passed from unit to unit on the result bus and forwarding logic.

> then it can get dispatch order dependence
> sensitivity: you don't want to issue a uOp before its dependent
> is listening on the forwarding network in case it completes
> too soon and you miss the result. This might imply that you
> have to stall execution of the front uOp until the tail is ready,
> the net result being worse performance.
>
> So it winds up with the same cost as an explicit 3-in,2-out
> instruction but more complex and caveats like blocked interrupts.

But you don't have to have register fields specifying that data-flow;
and the restrictions of staying within the same function unit makes
most of the data-flow sequencing problems vanish.
>
> But it does use prefixes creatively.

Thanks, I thought this one up on my last walk to the local bar.

MitchAlsup

unread,
Apr 17, 2020, 3:17:00 PM4/17/20
to
In My 66000, the "systems" part of the hardware is capable of recognizing
an interrupt and fetching the thread header and registers before the
executing thread is curtailed.

> >
> > In an OoO uArch those implied carry registers need actual registers
> > to back them. If one tries to make those carry values exist only
> > in the forwarding network, then it can get dispatch order dependence
> > sensitivity: you don't want to issue a uOp before its dependent
> > is listening on the forwarding network in case it completes
> > too soon and you miss the result. This might imply that you
> > have to stall execution of the front uOp until the tail is ready,
> > the net result being worse performance.
> >
> > So it winds up with the same cost as an explicit 3-in,2-out
> > instruction but more complex and caveats like blocked interrupts.
> >
> > But it does use prefixes creatively.
> >
> I would in fact prefer to do this half-way, by encoding it as 3-in,
> 1-out but with the third input reused as the implied second output:
>
> This way you can encode the operation with the same resources as the
> FMAC, and the carry register will stay the same all the way thorugh the
> chain. No need for a CARRY shadow prefix.

When it is your machine, you may do as you choose.

MitchAlsup

unread,
Apr 17, 2020, 3:18:39 PM4/17/20
to
Any taken branch eliminates what remains of any shadow.
This includes Vector Shadow, Carry Shadow, and PRED shadow.

Stefan Monnier

unread,
Apr 17, 2020, 3:45:50 PM4/17/20
to
> d) uninterruptible--correct

How 'bout exceptions (e.g. LD from an invalid location)?


Stefan

MitchAlsup

unread,
Apr 17, 2020, 3:56:50 PM4/17/20
to
The Page Fault exception is raised.

If the Page Fault exception is enabled, a control transfer to 3rd entry
in the Exception vector table while copying the current header, instruction
and address operands into the registers of the page fault handler, along
with the referenced TLB.

The IP copied is that of the shadow casting instruction. The results of
instructions past the shadow casting instruction are squashed.

So you go off and fix the memory problem, and come back to execute
the whole shadow over again.

Stefan Monnier

unread,
Apr 17, 2020, 4:11:06 PM4/17/20
to
> The IP copied is that of the shadow casting instruction.
> The results of instructions past the shadow casting instruction
> are squashed.

So it's treated like a transaction? Stores to memory are kept in store
buffers until the end of the shadow so they can be squashed if needed?

> So you go off and fix the memory problem, and come back to execute
> the whole shadow over again.

Do you have a rule like "no shadow-casting instruction within a shadow"
to avoid overlapping shadows ending up creating too large of a "somewhat
atomic" block?


Stefan

lkcl

unread,
Apr 17, 2020, 5:04:03 PM4/17/20
to
On Friday, April 17, 2020 at 9:11:06 PM UTC+1, Stefan Monnier wrote:

> Do you have a rule like "no shadow-casting instruction within a shadow"
> to avoid overlapping shadows ending up creating too large of a "somewhat
> atomic" block?

shadows can be cast across other shadows, and the shadowing is ORed.
in other words, only when *all* shadows are "safe" can the instruction(s)
proceed with committing [otherwise-] irreversible actions.

if you were to create a rule that prevented and prohibited shadow-type-1
across shadow-type-2 then you would get nowhere, fast, for no good reason,
preventing speculative result pathways that had no reason not to be
attempted.

l.

lkcl

unread,
Apr 17, 2020, 5:10:27 PM4/17/20
to
On Friday, April 17, 2020 at 7:22:33 PM UTC+1, Stephen Fuld wrote:
> On 4/16/2020 3:42 PM, MitchAlsup wrote:
> > CARRY {I,IO,IO,O}
> > ADD R1,R11,R21 // generate carry out
> > ADD R2,R12,R22 // carry in and out
> > ADD R3,R13,R23 // carry in and out
> > ADD R4,R14,R24 // consume last carry
>
> Thinking more about this, I have a question, and it seem to also apply
> to the PRED instruction. Are interrupts locked out between the
> PRED/CARRY instruction and the end of its shadow?

i did answer this: you may have missed it. if you do not have a means
to capture, save and restore state, you absolutely must prevent and prohibit
all and any interrupts, exceptions - everything.

this will have a detrimental effect on latency, so is undesirable.

> If not, where do you save the state across an interrupt?

using a standard Control Status Register mechanism. saved alongside all
other context state (the majority of which is the INT and FP regfile(s)).

> For PRED, I
> think you need to save the mask, the address of the PRED and whether the
> condition was true in order to be able to resume. CARRY might have
> different requirements. Is all of that right?
>
> If interrupts are locked out, how do you prevent a long chain of PRED
> instructions from preventing interrupts for an arbitrarily long time?

you don't. you take the latency / performance penalty. or throw away all
intermediate results (all of them) and, on return, start the whole lot again.
this would require a loooong shadow, right back from the PRED/CARRY.

if you have a real-time system with a high probability of interrupts occuring,
clearly the constant need to "rewind" would be bad.

l.

Stephen Fuld

unread,
Apr 17, 2020, 5:47:35 PM4/17/20
to
On 4/17/2020 2:10 PM, lkcl wrote:
> On Friday, April 17, 2020 at 7:22:33 PM UTC+1, Stephen Fuld wrote:
>> On 4/16/2020 3:42 PM, MitchAlsup wrote:
>>> CARRY {I,IO,IO,O}
>>> ADD R1,R11,R21 // generate carry out
>>> ADD R2,R12,R22 // carry in and out
>>> ADD R3,R13,R23 // carry in and out
>>> ADD R4,R14,R24 // consume last carry
>>
>> Thinking more about this, I have a question, and it seem to also apply
>> to the PRED instruction. Are interrupts locked out between the
>> PRED/CARRY instruction and the end of its shadow?
>
> i did answer this: you may have missed it.

Correct. Your response and my question "crossed in the mail". :-(

Note that I was specifically asking Mitch. I don't know if your
implementation is the same as his in this respect.



> if you do not have a means
> to capture, save and restore state, you absolutely must prevent and prohibit
> all and any interrupts, exceptions - everything.


Sure. Well, at least sort of. Mitch's VVF allows interrupts, and when
taken go back to the VEC instruction. Instructions in the current
iteration (Up to seven of them) are redone.


> this will have a detrimental effect on latency, so is undesirable.
>
>> If not, where do you save the state across an interrupt?
>
> using a standard Control Status Register mechanism. saved alongside all
> other context state (the majority of which is the INT and FP regfile(s)).

Yes. There is additional state required if you have the PRED/CARRY
implementation.


>
>> For PRED, I
>> think you need to save the mask, the address of the PRED and whether the
>> condition was true in order to be able to resume. CARRY might have
>> different requirements. Is all of that right?
>>
>> If interrupts are locked out, how do you prevent a long chain of PRED
>> instructions from preventing interrupts for an arbitrarily long time?
>
> you don't. you take the latency / performance penalty. or throw away all
> intermediate results (all of them) and, on return, start the whole lot again.
> this would require a loooong shadow, right back from the PRED/CARRY.

No, even longer than that. Suppose I have a PRED instruction, followed
by 6 "normal" instructions, followed by another PRED. After the first
PRED, you have locked out instructions. When you encounter the second
PRED, instructions are still locked out. After that second PRED,
instructions are still locked out (you need this for the nested PRED
capability). If the second PRED starts a new 8 instruction shadow, and
you keep interrupts locked out, they could be locked out for 16
instructions. Now repeat that, and you have an arbitrarily long extra
latency. I am not suggesting this is useful, but it does provide a
mechanism for a DOS attack.

EricP

unread,
Apr 17, 2020, 6:35:08 PM4/17/20
to
An alternative interrupt implementation that would not suffer this
would inject an interrupt uOp in the front end, just like it was
a jump instruction and redirects Fetch to the ISR.
When it reaches Retire, you officially take the interrupt.

You can choose to drain the pipeline or not.
If not, lots of i's and t's to dot and cross -
like you have to carry the User/Super flag with each uOp,
or if a DisableInterrupt instruction was in the pipeline.

MitchAlsup

unread,
Apr 17, 2020, 6:58:50 PM4/17/20
to
On Friday, April 17, 2020 at 3:11:06 PM UTC-5, Stefan Monnier wrote:
> > The IP copied is that of the shadow casting instruction.
> > The results of instructions past the shadow casting instruction
> > are squashed.
>
> So it's treated like a transaction? Stores to memory are kept in store
> buffers until the end of the shadow so they can be squashed if needed?

It is fair to state that "there will be additional rules"with respect to
casting of shadows. Some of these rules will always be violated by random
code generators. But I have not sorted all this through, yet.

lkcl

unread,
Apr 17, 2020, 7:12:10 PM4/17/20
to
On Friday, April 17, 2020 at 10:47:35 PM UTC+1, Stephen Fuld wrote:

> Correct. Your response and my question "crossed in the mail". :-(

occasionally this newsgroup is very much like jumping into a venturi

> Note that I was specifically asking Mitch. I don't know if your
> implementation is the same as his in this respect.

i try to answer generally.

> > this would require a loooong shadow, right back from the PRED/CARRY.
>
> No, even longer than that. Suppose I have a PRED instruction, followed
> by 6 "normal" instructions, followed by another PRED. After the first
> PRED, you have locked out instructions. When you encounter the second
> PRED, instructions are still locked out. After that second PRED,
> instructions are still locked out (you need this for the nested PRED
> capability). If the second PRED starts a new 8 instruction shadow, and
> you keep interrupts locked out, they could be locked out for 16
> instructions. Now repeat that, and you have an arbitrarily long extra
> latency. I am not suggesting this is useful, but it does provide a
> mechanism for a DOS attack.

it's a good example why i decided not to go with "lockouts" and instead
chose, in all of the prefixing concepts i've done, to go with "saveable
state" and (as mentioned yesterday) to treat carry-state as "actual
registers" (protected by DepMatrices) with their own separate read/write
ports.

in this way, an interrupt may be serviced in the middle, immediately,
even if the carry-state were to be part of the prefix-continuance
concept Mitch evaluated.

whether the increased size of the DMs is worth it remains to be seen.

l.

BGB

unread,
Apr 17, 2020, 9:54:04 PM4/17/20
to
This is a reason IMO why it is a good idea to avoid any sort of
"submarine state" when possible; since it can't really be saved or
restored easily by interrupt handlers (and may require a mechanism to
either complete the operation or roll back).


Also another factor is how to design a reasonably cheap interrupt
mechanism. What I have thus far is basically:
Save off SR, SP, PC, ... into special registers;
Bit-twiddle SR, load SP from another register;
Branch to an address relative to VBR
This location generally holds a branch to another location.
The relative address depends on the interrupt/exception type.

Returning from an exception restores these registers, and will then
branch back to the saved PC.

I realized after a prior post that even if I eliminated PUSH/POP, as-is
I still couldn't fully eliminate SP's side-channel due to the ISR
mechanism needing to mess with the register. Instead, the ISR mechanism
would need to leave SP alone (probably leaving this part of the process
up to the interrupt handler).

FWIW: As with many architectures, I initially used an interrupt as the
system call mechanism. However, I since ended up switching to a
different mechanism (namely a call through a special/magic function
pointer which the kernel sets in the task-state object).


> l.
>

Terje Mathisen

unread,
Apr 18, 2020, 7:08:47 AM4/18/20
to
<BG>

Mitch, you know that I have the utmost respect for you. When I post a
question, it is because you almost always have very good reasons for
what you decide, and you are often wiling to explain your reasoning.

I.e. I am genuinely interested in why you think a carry shadow register
(with no task state to be saved/restored) is better than a named reg, or
just having a way to (under the carry shadow) manually load/unload that
shadow pipeline.

Naively I would have guessed that a CARRY + ADD would have to be tracked
internally just like a full 3-in 2-out adder, but I guess the reality
could be different, i.e. everything under the CARRY shadow is a big
chain with no possible exits which would result in architecturally
visible state, and this makes the hw implementation easier?

Terje Mathisen

unread,
Apr 18, 2020, 7:19:35 AM4/18/20
to
Having PRED within a PRED shadow, which would be needed in order to
enable a thread DOS attack, seems quite iffy, I would rather disallow it
so that you always have an interrupt possibility just before each PRED.

The easiest way to do this would to disallow PRED within PRED, right?

If there are good/useful scenarios which really do need nested PRED then
it would seem that you would be forced to have PRED in both the THEN and
ELSE branch at the same time, otherwise the code would be really
complicated/strange since the PRED-less path would subsequently execute
both the THEN and the ELSE instructions in whatever order they are
listed in the program!

It would at least be a wonderful way to make really obfuscated code,
almost impossible to disassemble, so good for malware (Mel-ware?)
:-)

Stephen Fuld

unread,
Apr 18, 2020, 10:52:20 AM4/18/20
to
Perhaps, but Mitch has said that PRED within the shadow of another PRED
is allowed. Presumably, this is for nested if statements, and possibly
to make compound tests easier.


>
> If there are good/useful scenarios which really do need nested PRED then
> it would seem that you would be forced to have PRED in both the THEN and
> ELSE branch at the same time, otherwise the code would be really
> complicated/strange since the PRED-less path would subsequently execute
> both the THEN and the ELSE instructions in whatever order they are
> listed in the program!
>
> It would at least be a wonderful way to make really obfuscated code,
> almost impossible to disassemble, so good for malware (Mel-ware?)
> :-)

Since the shadow of a PRED is limited to eight instructions, perhaps you
don't increase that range if you encounter a PRED instruction within the
shadow of a previous PRED, though this could force extra useless work if
you really just wanted say two PRED shadows adjacent to each other.

Another possible solution is to do what others have done, and just have
a timer that prevents interrupts from being locked out for more than a
fixed amount of time, or it throws an exception.

MitchAlsup

unread,
Apr 18, 2020, 1:53:29 PM4/18/20
to
Without PRED under PRED you can't make && and || shadows (easily).
Of the cases I am interested in, these PREDs are essentially back to back
(which is good due to the short shadow.)

Just to be sure, PRED state is carried in the thread header so there is
no issue interrupting or taking exceptions from a PRED; and returning
later.

Second, PREDs are the only legal way to perform flow control under
Vectorization; and vectors HAVE to be interruptable.

MitchAlsup

unread,
Apr 18, 2020, 1:55:31 PM4/18/20
to
On Saturday, April 18, 2020 at 6:08:47 AM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Friday, April 17, 2020 at 12:12:35 PM UTC-5, Terje Mathisen wrote:
> >> EricP wrote:
> >>> So it winds up with the same cost as an explicit 3-in,2-out
> >>> instruction but more complex and caveats like blocked interrupts.
> >>>
> >>> But it does use prefixes creatively.
> >>>
> >> I would in fact prefer to do this half-way, by encoding it as 3-in,
> >> 1-out but with the third input reused as the implied second output:
> >>
> >> This way you can encode the operation with the same resources as the
> >> FMAC, and the carry register will stay the same all the way thorugh the
> >> chain. No need for a CARRY shadow prefix.
> >
> > When it is your machine, you may do as you choose.
>
> <BG>
>
> Mitch, you know that I have the utmost respect for you. When I post a
> question, it is because you almost always have very good reasons for
> what you decide, and you are often wiling to explain your reasoning.
>
> I.e. I am genuinely interested in why you think a carry shadow register
> (with no task state to be saved/restored) is better than a named reg, or
> just having a way to (under the carry shadow) manually load/unload that
> shadow pipeline.

The basic problem is that I have a 4 doubleword header and not enough
space left to fit a 64-bit carry. Expanding the header would "cause
other problems".

MitchAlsup

unread,
Apr 18, 2020, 2:02:18 PM4/18/20
to
On Saturday, April 18, 2020 at 6:08:47 AM UTC-5, Terje Mathisen wrote:
One can visualize ADD with Carry as 3-in:2-out Adder, but I draw the box
differently, I see a 2-in:1-out box that has an internal feedback loop
to make the 2-in into 3-in,... The Carry from an ADDER is not available
as the CARRY into the MULTIPLIER (or DIVIDER) These things are internal
to the box performing the calculations, they only have a register when
someone coaxes the remaining carry out of the calculation unit.

Providing state is tricky where does one place a 64-bit carry ?
Note MUL uses 64-bit 'carry', DIV uses a 64-bit carry, floating
point operations should be able to use a 64-bit carry for exact FP
calculations, Kahan summation,...

MitchAlsup

unread,
Apr 18, 2020, 2:03:25 PM4/18/20
to
On Saturday, April 18, 2020 at 9:52:20 AM UTC-5, Stephen Fuld wrote:
> On 4/18/2020 4:19 AM, Terje Mathisen wrote:
> > Stephen Fuld wrote:
> >
> > Having PRED within a PRED shadow, which would be needed in order to
> > enable a thread DOS attack, seems quite iffy, I would rather disallow it
> > so that you always have an interrupt possibility just before each PRED.
> >
> > The easiest way to do this would to disallow PRED within PRED, right?
>
>
> Perhaps, but Mitch has said that PRED within the shadow of another PRED
> is allowed. Presumably, this is for nested if statements, and possibly
> to make compound tests easier.
>
Yes, but as above; PRED causes no lockout.

Brian G. Lucas

unread,
Apr 18, 2020, 2:06:45 PM4/18/20
to
I think having the rule that "a PRED issued under a PRED cannot lengthen the
remaining shadow" is sufficient to meet those needs -- if one wants to hold
off interrupts.

> Just to be sure, PRED state is carried in the thread header so there is
> no issue interrupting or taking exceptions from a PRED; and returning
> later.
>
In this case there doesn't seem to a need to have any such rule.

Stephen Fuld

unread,
Apr 18, 2020, 4:04:02 PM4/18/20
to
Presumably, CARRY state is also carried in the thread header. Does it
reuse the same space as PRED state or does it have its own space. If it
reuses some PRED state space, mustn't there be some rules about CARRY
within PRED shadow, or vice versa.

Terje Mathisen

unread,
Apr 18, 2020, 4:04:30 PM4/18/20
to
Thank you Mitch, this is exactly the kind of stuff I was hoping for. :-)

I can see how you really don't want to mix up the UMUL and the ADD
pipelines, at least not in an exposed way, so that means you cannot use
the same method to export the internal carry at the end.

For ADD it is easy, you just do one more ADD with a zeroe'd source reg:

ADD Rexposed_carry, Rzero,Rzero

and for the 64 bits of UMUL carry you can do one more UMUL of (0*0)

UMUL Rexposed_umul_hi, Rzero, Rzero

which would also export those bits.

The main issue that I still haven't seen you explain and that I've been
unable to figure out is how you intend to restart a long ADD or UMUL
carry chain?

I.e. how do we implement the initial

(carry, *sum++) = ADD3(*a++, *b++, saved_carry);

before we can use subsequent ADDs under the same CARRY shadow to do the
next blocks?

To me it seems like I could do something like this, since the old carry
is just (1/0):

ai = (uint64_t) (-1);

CARRY ( I IO IO IO IO O )
ADD Rdummy, ai, Rsaved_carry ;; Will set the internal carry
ADD Rs0, Ra0, Rb0 ;; Use and set carry
ADD Rs1, Ra1, Rb1
ADD Rs2, Ra2, Rb2
ADD Rs3, Ra3, Rb3
ADD Rsaved_carry, Rzero, Rzero ;; Export final carry

but I haven't been able to figure out a similar pair of operations for
the needed 64 bits of UMUL carry. :-(

BTW, even on a LE machine I have always read lists in lexical (BE) order!

BGB

unread,
Apr 18, 2020, 4:04:48 PM4/18/20
to
On 4/17/2020 5:58 PM, MitchAlsup wrote:
> On Friday, April 17, 2020 at 3:11:06 PM UTC-5, Stefan Monnier wrote:
>>> The IP copied is that of the shadow casting instruction.
>>> The results of instructions past the shadow casting instruction
>>> are squashed.
>>
>> So it's treated like a transaction? Stores to memory are kept in store
>> buffers until the end of the shadow so they can be squashed if needed?
>
> It is fair to state that "there will be additional rules"with respect to
> casting of shadows. Some of these rules will always be violated by random
> code generators. But I have not sorted all this through, yet.


Not sure about this case, but in my case one mechanism for rules
enforcement is to make my emulator check for the various rules and
generate a breakpoint exception in any code-sequences which violate the
ISA rules (such as due to compiler bugs, or failing to take something
into account when writing ASM).

Since I generally run code in the emulator first, this helps detect a
lot of these sorts of issues.

A lot of the rules checked are for things like making sure bundles are
well formed, and trying to verify instructions are not being used in
contexts where they are not allowed (such whether they may be allowed in
prefix or suffix position within a bundle, whether they are allowed in a
given execute lane, ... along with things like register-register
dependency rules and similar).

EricP

unread,
Apr 18, 2020, 4:17:21 PM4/18/20
to
MitchAlsup wrote:
> On Saturday, April 18, 2020 at 6:19:35 AM UTC-5, Terje Mathisen wrote:
>> Having PRED within a PRED shadow, which would be needed in order to
>> enable a thread DOS attack, seems quite iffy, I would rather disallow it
>> so that you always have an interrupt possibility just before each PRED.
>>
>> The easiest way to do this would to disallow PRED within PRED, right?
>
> Without PRED under PRED you can't make && and || shadows (easily).
> Of the cases I am interested in, these PREDs are essentially back to back
> (which is good due to the short shadow.)
>
> Just to be sure, PRED state is carried in the thread header so there is
> no issue interrupting or taking exceptions from a PRED; and returning
> later.
>
> Second, PREDs are the only legal way to perform flow control under
> Vectorization; and vectors HAVE to be interruptable.

I looked at having PRED in my experimental OoO ISA a while back.
In my case PRED had either 2 8-bit or 2 16-bit masks.
The first mask indicated which instructions were guarded,
the second indicated whether a guarded instruction was enabled
if the Predicate Value (PV) was True or False.

I eventually deferred it as affected the entire uArch from
fetch to retire, and it has so many moving parts.

Interrupts were not a problem because as you suggested the control
masks were saved and restored across interrupts & exceptions.
There was a future mask pair in Decode, and committed pair in Retire,
and the committed pair would be saved and restored.

Some design issues/complications that I recall:

- There are 2 kinds of instructions which PRED controls:
some, like branch, change between normal and NOP.
some, like ADD, change between normal and MOV, as CMOV does,
because they may have to copy the old destination register value.

- If the instruction window is 100 long, then worst case
there can be 50 PRED instructions with 50 guarded instructions.
That means there can be 50 PV's and 50 masks in flight.

- Those PV are anonymous 1-bit registers that are allocated for
each PRED by Rename, and assigned to each dependent uOp to watch.

- If the PV is unresolved then each dependent uOp watches
a wake-up matrix for it to resolve. When the PV resolves
the uOp has to adjust its source operand watch matrix accordingly.

For example if ADD rd,rs1,rs2 is guarded, if it is enabled then
the sources rs1 and rs2 need to be watched in the wake-up matrix.
However if it is disabled then the old physical register for rd
must be watched for and copied when available, and stop watching
for rs1 and rs2. And while PV is unresolved all the sources
old_rd, rs1 and rs2 are potential so they must be watched.

- A PRED instruction can be passed from Dispatch to the ALU,
execute and complete immediately, then retire, all while
the tail of its mask was still in the front end.

So the PV resolves and its effects have to propagate backwards
to all dependent instructions, and possibly into the front end.
Since the PV is resolved, all its effects are now resolved
and any uOps still in the front end might be statically
enabled or disabled while still in the front end.

Also if the PRED retires the PV register can be deallocated
before all of its dependents have even been decoded.
So the PV can be copied into the current mask in Decode
to enabled or disabled following instructions not yet decoded.



MitchAlsup

unread,
Apr 18, 2020, 4:29:03 PM4/18/20
to
PRED is an 8-bit field,
CARRY as used in MUL, DIV, and floating point is a 64-bit field.
No amount of compression can stick the larger in the smaller.

CARRY is "all the bits that did not make it into the result".

CARRY in a PRED shadow will not be able to cross between THEN-clause and
ELSE-clause.

PRED should not be used under a CARRY shadow.

MitchAlsup

unread,
Apr 18, 2020, 4:40:26 PM4/18/20
to
Since I cannot foresee ADD, MUL, FADD, FMUL NOT being fully pipelined,
I can accept a few of these in a row without crippling interrupt latency.
{Note: Even my first machine in 1985 had pipelined integer multiply.}

Thus a string of 5 MULs takes 8=3+5 cycles only 2 more than a string
of 5 ADDs. Floating point might be 1 or 2 more cycles for the set.

The DIVs provide a harder cookie to crumble.
>
> I.e. how do we implement the initial
>
> (carry, *sum++) = ADD3(*a++, *b++, saved_carry);
>
> before we can use subsequent ADDs under the same CARRY shadow to do the
> next blocks?

My intent is that you capture the end carry with a dummy op, still in
the CARRY shadow, and then deal with it however you want in the subsequent.
>
> To me it seems like I could do something like this, since the old carry
> is just (1/0):
>
> ai = (uint64_t) (-1);
>
> CARRY ( I IO IO IO IO O )
> ADD Rdummy, ai, Rsaved_carry ;; Will set the internal carry
> ADD Rs0, Ra0, Rb0 ;; Use and set carry
> ADD Rs1, Ra1, Rb1
> ADD Rs2, Ra2, Rb2
> ADD Rs3, Ra3, Rb3
> ADD Rsaved_carry, Rzero, Rzero ;; Export final carry
>
> but I haven't been able to figure out a similar pair of operations for
> the needed 64 bits of UMUL carry. :-(

UMUL Rsaved_carry, Rany, #0

Multiply by zero and addition of the carry results in the carry.
>
> BTW, even on a LE machine I have always read lists in lexical (BE) order!

I will probably invent some kind of syntactic sugar to invert the fields.

MitchAlsup

unread,
Apr 18, 2020, 5:00:48 PM4/18/20
to
On Saturday, April 18, 2020 at 3:17:21 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Saturday, April 18, 2020 at 6:19:35 AM UTC-5, Terje Mathisen wrote:
> >> Having PRED within a PRED shadow, which would be needed in order to
> >> enable a thread DOS attack, seems quite iffy, I would rather disallow it
> >> so that you always have an interrupt possibility just before each PRED.
> >>
> >> The easiest way to do this would to disallow PRED within PRED, right?
> >
> > Without PRED under PRED you can't make && and || shadows (easily).
> > Of the cases I am interested in, these PREDs are essentially back to back
> > (which is good due to the short shadow.)
> >
> > Just to be sure, PRED state is carried in the thread header so there is
> > no issue interrupting or taking exceptions from a PRED; and returning
> > later.
> >
> > Second, PREDs are the only legal way to perform flow control under
> > Vectorization; and vectors HAVE to be interruptable.
>
> I looked at having PRED in my experimental OoO ISA a while back.
> In my case PRED had either 2 8-bit or 2 16-bit masks.
> The first mask indicated which instructions were guarded,
> the second indicated whether a guarded instruction was enabled
> if the Predicate Value (PV) was True or False.

I started with 2-bits per instruction and went back to 1-bit.
I initially wanted to be able to share code between THEN-clause
and ELSE-clause (execute both ways) marking, but generating
code for the MORE important && and || conditionals led me back to
a single 8-bit mask and I will eat the extra instructions (most of
which can be hoisted or pushed out the bottom of THEN and ELSE
clauses.
>
> I eventually deferred it as affected the entire uArch from
> fetch to retire, and it has so many moving parts.

It touches almost everything in the pipe, a bit like multi-threading
touches everything in the pipe.....
>
> Interrupts were not a problem because as you suggested the control
> masks were saved and restored across interrupts & exceptions.
> There was a future mask pair in Decode, and committed pair in Retire,
> and the committed pair would be saved and restored.
>
> Some design issues/complications that I recall:
>
> - There are 2 kinds of instructions which PRED controls:
> some, like branch, change between normal and NOP.
Not usefully different from prediction.
> some, like ADD, change between normal and MOV, as CMOV does,
> because they may have to copy the old destination register value.
A cost of the OoO implementation.
That is: more in order models have lower cost here.
>
> - If the instruction window is 100 long, then worst case
> there can be 50 PRED instructions with 50 guarded instructions.
> That means there can be 50 PV's and 50 masks in flight.

Yep,
But if the implementation model is 1-wide in order, you only have
pipe-depth DIV 2 = about 3. This is why I need VVM to work well even
in the small machines.
>
> - Those PV are anonymous 1-bit registers that are allocated for
> each PRED by Rename, and assigned to each dependent uOp to watch.

Only if you rename in space (pipelines rename in time).
>
> - If the PV is unresolved then each dependent uOp watches
> a wake-up matrix for it to resolve. When the PV resolves
> the uOp has to adjust its source operand watch matrix accordingly.

This is the key point--knowing who to wait for, and knowing when to
proceed (or die).
>
> For example if ADD rd,rs1,rs2 is guarded, if it is enabled then
> the sources rs1 and rs2 need to be watched in the wake-up matrix.
> However if it is disabled then the old physical register for rd
> must be watched for and copied when available, and stop watching
> for rs1 and rs2. And while PV is unresolved all the sources
> old_rd, rs1 and rs2 are potential so they must be watched.

Unguarded instructions are pointed at a condition that has always been
resolved and will always be resolved and thus wait only for their
operands.

Guarded instructions are pointed at the resolving instruction along
with their operands. Most guarded instructions can begin execution
when the operands are ready and hold result delivery until the guard
has become available (if the microarchitecture provides the mechanism
to not deliver results until *).
>
> - A PRED instruction can be passed from Dispatch to the ALU,
> execute and complete immediately, then retire, all while
> the tail of its mask was still in the front end.

But it will have seeded the control registers (header) such that those
shadowed instructions end up knowing whether to execute or die. Dying
in the front end is cheaper than dying in the pipeline.
>
> So the PV resolves and its effects have to propagate backwards
> to all dependent instructions, and possibly into the front end.
> Since the PV is resolved, all its effects are now resolved
> and any uOps still in the front end might be statically
> enabled or disabled while still in the front end.
>
> Also if the PRED retires the PV register can be deallocated
> before all of its dependents have even been decoded.
> So the PV can be copied into the current mask in Decode
> to enabled or disabled following instructions not yet decoded.

If you are doing this, your definition of what PV is doing is
not well thought out.

But (the BIG BUT); everything above is discussed in microarchitecture
not architecture. Architecture only has the vaguest notion of there being
a pipeline, decoding, execution, caches, ....

MitchAlsup

unread,
Apr 18, 2020, 7:26:28 PM4/18/20
to
Just thought of a way around the problem of state persistence.

Since the CARRY "instruction" only needs a few bits (which are constant),
it can specify a destination register that ends up with the final carry
(bits). Conceptually, the register is used as both an operand as a
destination in each instruction under the shadow. In practice, the register
could only read or written when unexpected things happen. Thus, without
"blowing" a specification field in every instruction, we, in effect,
cast an additional IN and OUT register to all of the instruction under
the shadow. So Terje gets his:

(carry, *sum++) = ADD3(*a++, *b++, saved_carry);

What I would like to be able to do is to support up to 256-bit integer
arithmetic; this means 8 registers (4 each) holding A or B and 4 inst
performing the arithmetic. So the shadow is "not very large". Register
count-wise I (or the compiler) could expand this to 512-bits, but sooner
or later one starts running out of registers and resorts to using memory.
The point at which memory starts getting used is precisely where the
length of the CARY shadow is limited.

I suspect the iterative nature of MUL and DIV are such that a pass of
one operand across the other is the size of the shadow: that is there
are a number of shadows, something like:

for( i = 0; i < 4; i++ )
{ // Multiply CARRY cast
{carry,m3,m2,m1,m0} = (A[3],A[2],A[1],A[0])*B[i];
// a different Addition CARRY cast
{prod[i+3],Prod[i+2],prod[i+1],prod[i]} += {carry,m3,m2,m1,m0};
}

Stephen Fuld

unread,
Apr 18, 2020, 10:09:22 PM4/18/20
to
On 4/18/2020 1:29 PM, MitchAlsup wrote:
> On Saturday, April 18, 2020 at 3:04:02 PM UTC-5, Stephen Fuld wrote:
>> On 4/18/2020 10:53 AM, MitchAlsup wrote:
>>> On Saturday, April 18, 2020 at 6:19:35 AM UTC-5, Terje Mathisen wrote:
>
>>>> Having PRED within a PRED shadow, which would be needed in order to
>>>> enable a thread DOS attack, seems quite iffy, I would rather disallow it
>>>> so that you always have an interrupt possibility just before each PRED.
>>>>
>>>> The easiest way to do this would to disallow PRED within PRED, right?
>>>
>>> Without PRED under PRED you can't make && and || shadows (easily).
>>> Of the cases I am interested in, these PREDs are essentially back to back
>>> (which is good due to the short shadow.)
>>>
>>> Just to be sure, PRED state is carried in the thread header so there is
>>> no issue interrupting or taking exceptions from a PRED; and returning
>>> later.
>>
>>
>> Presumably, CARRY state is also carried in the thread header. Does it
>> reuse the same space as PRED state or does it have its own space. If it
>> reuses some PRED state space, mustn't there be some rules about CARRY
>> within PRED shadow, or vice versa.
>
> PRED is an 8-bit field,

That raises another question. When you resume after an interrupt, do
you restart after the last instruction executed, or do you restart at
the PRED?

I ask because, ISTM that if you restart after the last instruction
executed, you need more than the 8 bits of the PRED mask. Specifically,
you need to know how far you are into the PRED shadow, so you know which
bits of the mask to apply,and whether the PRED condition was met or not.

On the other hand, if you restart at the PRED instruction, you end up
doing more work as you have had to throw away any instructions after the
PRED, but before the interrupt. Also, you wouldn't need the PRED mask,
as you are going to re-execute the instruction anyway.

Clearly I am missing something. :-(
It is loading more messages.
0 new messages