On 4/14/2020 3:52 AM, Terje Mathisen wrote:
> JimBrakefield wrote:
>> Much of the discussion on the "Condition Code-less Programming" topic
>> can be placed under a "Usage of Prefix instructions" category. It's
>> also a good masters exam question so won't enumerate the known and
>> implemented usages of instruction prefixes.
>>
>> However "Condition Code-less Programming" adds a new possible usage:
>> prefixing a conditional branch to the following instruction.
>>
>> So would like to ask the question: What are the trade-offs and
>> practicalities of using prefixes? The x86-64 is perhaps the premier
>> example of prefix usage and the Itanium an example of a blank sheet
>> design that increased the instruction size rather than use prefixes.
>>
> You probably need a byte granularity instruction set for this to make
> much sense, unless you are going to do something really complicated.
>
I shoe-horned in predicated instructions after the fact, though in my
case it required eating a little more of the 16-bit encoding space
encoding space (effectively doubling the encoding space used for 32-bit
instruction forms from what it was originally).
My encoding scheme is conceptually similar to the one in ARM32, just
eating 1 bit for this rather than 4 bits. This was in contrast to using
prefixes.
In my original BJX1 idea, the idea was that the 32b ops would have been
done via prefix instructions (modifying the following instructions), but
in implementation I didn't go this route because it would have been
slower (instead decoding them as larger variable-length instructions).
In BJX2 I went directly for variable-length instructions, initially
without prefixes.
More recently, I ended up adding a "jumbo prefix" which allows larger
immediate values, although it works by decoding into a separate lane and
operating in parallel. It is possible that if Jumbo ops were supported
in a scalar implementation, they would be accomplished via a prefix.
However, as can be noted, the jumbo prefixes effectively produce 64-bit
or 96-bit instruction encodings.
> Any really size-efficient instruction set needs to at least in some way
> look like a huffman compression, with the most common
> instructions/operations getting the shorter forms. The Mill takes this
> more or less to the logical conclusion, right Ivan?
>
It sort of works like this way in my case, except in
performance-optimized code with WEX enabled, where it ends up being
mostly 32-bit ops (since the whole "OP | OP" thing doesn't work with the
16-bit encodings).
Sadly, getting good code density comes at a performance penalty (of
mostly only running scalar code).
So, between the two modes (16b/32b percentage), size-optimized code ends
up being around 70%/30% and speed optimized code ends up being around
20%/80%. The 48b ops still exist, but have mostly fallen into disuse.
In some prior analysis, I noted that it seemed none of the instructions
were common enough to justify having 8-bit encodings (but plenty were
common enough to justify 16-bit encodings).
While one could argue that x86 has 8-bit instructions, the vast majority
of 8-bit opcodes are either followed immediately by a ModRM byte or
Imm/Disp (so these instructions typically end up as 16 or 24 bits).
Likewise, a fair chunk of the encoding space used for (actual) 8b ops
was carved off and reused for REX prefixes in x86-64 (implying that
their use as REX prefixes was more valuable).
My past conclusion was that 16/32 ISA's would appear to be
"near-optimal" on this front. A slight gain would be possible via also
having 24 and 48 bit instructions; with 16b remaining as the shortest
encoding; at the main costs of added complexity and needing to deal with
a BYTE aligned instruction stream rather than being WORD or DWORD aligned.
Meanwhile, x86 decoding is a lot more complicated than ideal, I have
serious doubts about the possibility of executing x86 code efficiently
or affordably on a lower-end FPGA (at least, excluding some form of
dynamic translation to an internal VLIW ISA with the x86 decoder
implemented in firmware or similar).
In my case, I had considered it could be possible to entropy-encode the
instruction stream ("on disk") using a specialized codec; but my initial
thinking on it implied it may not be worth the added cost and complexity.
This would likely be via a DWORD oriented LZ coder, which would split up
the instruction bits and feed them through different entropy contexts
(say, four 8-bit contexts; possibly with special cases for dealing with
immediate values, which could make more sense if encoded as a VLN than
by "polluting" the Huffman contexts, ...).
As is though, I am using LZ4.
I had also experimented with Deflate (1), but Deflate doesn't gain much
compression advantage over LZ4 to justify its higher cost. The
specialized codec would likely be both slower and more complicated than
Deflate (a lot more Huffman tables), so at best it seems debatable (more
so as the 5MHz SDcard interface isn't quite *that* slow). It would also
need an option to detect and fall-back to a Deflate-like encoding when
it encounters data that doesn't map nicely to DWORD's (DWORD oriented
encoding fares rather poorly against text strings).
1: This was along vaguely similar lines to the PE/COFF image having been
fed though gzip, but with a different file header.
> The prefix branch would in fact work well as a full-size prefix since
> you need the branch offset just as in a regular branch.
>
I am not sure how prefix branches would have much benefit over normal
conditional branches.
Maybe you can have it predicate registers rather than branch over them.
But, one could also potentially do this as a decoder special case, eg:
If the conditional branch has a displacement less than N, decode it as
predicating the following N instructions (with N being a certain
fraction of the effective pipeline length).