On 4/14/2020 3:52 AM, Terje Mathisen wrote:
> JimBrakefield wrote:
>> Much of the discussion on the "Condition Code-less Programming" topic
>> can be placed under a "Usage of Prefix instructions" category.  It's
>> also a good masters exam question so won't enumerate the known and
>> implemented usages of instruction prefixes.
>>
>> However "Condition Code-less Programming" adds a new possible usage:
>> prefixing a conditional branch to the following instruction.
>>
>> So would like to ask the question: What are the trade-offs and
>> practicalities of using prefixes?  The x86-64 is perhaps the premier
>> example of prefix usage and the Itanium an example of a blank sheet
>> design that increased the instruction size rather than use prefixes.
>>
> You probably need a byte granularity instruction set for this to make 
> much sense, unless you are going to do something really complicated.
> 
I shoe-horned in predicated instructions after the fact, though in my 
case it required eating a little more of the 16-bit encoding space 
encoding space (effectively doubling the encoding space used for 32-bit 
instruction forms from what it was originally).
My encoding scheme is conceptually similar to the one in ARM32, just 
eating 1 bit for this rather than 4 bits. This was in contrast to using 
prefixes.
In my original BJX1 idea, the idea was that the 32b ops would have been 
done via prefix instructions (modifying the following instructions), but 
in implementation I didn't go this route because it would have been 
slower (instead decoding them as larger variable-length instructions).
In BJX2 I went directly for variable-length instructions, initially 
without prefixes.
More recently, I ended up adding a "jumbo prefix" which allows larger 
immediate values, although it works by decoding into a separate lane and 
operating in parallel. It is possible that if Jumbo ops were supported 
in a scalar implementation, they would be accomplished via a prefix.
However, as can be noted, the jumbo prefixes effectively produce 64-bit 
or 96-bit instruction encodings.
> Any really size-efficient instruction set needs to at least in some way 
> look like a huffman compression, with the most common 
> instructions/operations getting the shorter forms. The Mill takes this 
> more or less to the logical conclusion, right Ivan?
> 
It sort of works like this way in my case, except in 
performance-optimized code with WEX enabled, where it ends up being 
mostly 32-bit ops (since the whole "OP | OP" thing doesn't work with the 
16-bit encodings).
Sadly, getting good code density comes at a performance penalty (of 
mostly only running scalar code).
So, between the two modes (16b/32b percentage), size-optimized code ends 
up being around 70%/30% and speed optimized code ends up being around 
20%/80%. The 48b ops still exist, but have mostly fallen into disuse.
In some prior analysis, I noted that it seemed none of the instructions 
were common enough to justify having 8-bit encodings (but plenty were 
common enough to justify 16-bit encodings).
While one could argue that x86 has 8-bit instructions, the vast majority 
of 8-bit opcodes are either followed immediately by a ModRM byte or 
Imm/Disp (so these instructions typically end up as 16 or 24 bits).
Likewise, a fair chunk of the encoding space used for (actual) 8b ops 
was carved off and reused for REX prefixes in x86-64 (implying that 
their use as REX prefixes was more valuable).
My past conclusion was that 16/32 ISA's would appear to be 
"near-optimal" on this front. A slight gain would be possible via also 
having 24 and 48 bit instructions; with 16b remaining as the shortest 
encoding; at the main costs of added complexity and needing to deal with 
a BYTE aligned instruction stream rather than being WORD or DWORD aligned.
Meanwhile, x86 decoding is a lot more complicated than ideal, I have 
serious doubts about the possibility of executing x86 code efficiently 
or affordably on a lower-end FPGA (at least, excluding some form of 
dynamic translation to an internal VLIW ISA with the x86 decoder 
implemented in firmware or similar).
In my case, I had considered it could be possible to entropy-encode the 
instruction stream ("on disk") using a specialized codec; but my initial 
thinking on it implied it may not be worth the added cost and complexity.
This would likely be via a DWORD oriented LZ coder, which would split up 
the instruction bits and feed them through different entropy contexts 
(say, four 8-bit contexts; possibly with special cases for dealing with 
immediate values, which could make more sense if encoded as a VLN than 
by "polluting" the Huffman contexts, ...).
As is though, I am using LZ4.
I had also experimented with Deflate (1), but Deflate doesn't gain much 
compression advantage over LZ4 to justify its higher cost. The 
specialized codec would likely be both slower and more complicated than 
Deflate (a lot more Huffman tables), so at best it seems debatable (more 
so as the 5MHz SDcard interface isn't quite *that* slow). It would also 
need an option to detect and fall-back to a Deflate-like encoding when 
it encounters data that doesn't map nicely to DWORD's (DWORD oriented 
encoding fares rather poorly against text strings).
1: This was along vaguely similar lines to the PE/COFF image having been 
fed though gzip, but with a different file header.
> The prefix branch would in fact work well as a full-size prefix since 
> you need the branch offset just as in a regular branch.
> 
I am not sure how prefix branches would have much benefit over normal 
conditional branches.
Maybe you can have it predicate registers rather than branch over them.
But, one could also potentially do this as a decoder special case, eg:
If the conditional branch has a displacement less than N, decode it as 
predicating the following N instructions (with N being a certain 
fraction of the effective pipeline length).