On 12/8/2022 1:52 PM, MitchAlsup wrote:
> On Wednesday, December 7, 2022 at 2:26:20 PM UTC-6,
robf...@gmail.com wrote:
>> I think deciding between compare-and-branch as one instruction may have to
>> do with the size of the branch displacement possible in an instruction format.
>> It takes more bits to represent the instruction if there are two register fields in
>> it and that may limit the displacement size. If instructions are 16-bits specifying
>> two registers in the instruction probably is not practical due to the small
>> displacement.
> <
> With the Compare instruction and then branch model, there is 1 branch instruction
> with a myriad of sub instructions {Br ==, Br !=, Br s>, BR s>=, Br s<, BR s<=, Br u<
> Br u<=, Br u>, BR u>=, and more} all as 1 single Major OpCode (front 6 bits). I
> use the bits where the destination register would be to encode the condition.
> <
> In contrast, the Compare-Branch model only has a few compares-and-branches
> due to not wanting to waste too much OpCode space of the compare-branch.
> <
> My compare then branch model uses a bit vector condition, so branch on bit
> falls out for free. Over on the compare-to-zero-and-branch side, I have a complete
> set of compare-to-zero-branch instructions, some with the ability to test for
> situations outside of normal programmatics--for example: test if an ATOMIC
> sequence has suffered from interference.
I ended up defining some RISC-V style compare-and-branch instructions.
Only real reason they are enabled in because the RISC-V mode needs them.
They end up talking up ~ 1/4 of the F1 block (mostly used for Load/Store
Disp9 ops).
IIRC, I did add compiler support for them, but they are still "pretty
meh": "if(a<b)" as a single op, but only if the target is within +/- 256
bytes being not that big of a win.
Ironically, "if(a<0)" tends to end up being more useful (these ended up
existing as separate ops in BJX2 mostly because of the lack of an
architectural zero register)
In the case of integer compare ops, the majority of cases tend to be
comparisons with a constant, which is on average better served, say, by
"CMPxx Imm10, Rn; BT/BF lbl"
>>
>> It is often not necessary to do the compare, in some processors the condition
>> codes are set by many instructions and a branch can take place without needing
>> an explicit compare. In that case the instructions are just as compact as doing a
>> compare-and-branch in one instruction.
> <
> The SPARC experience indicates that 80% of this free work is non-useful.
Most status flag updating is noise.
If one does status-flags like on x86, then far more often the status
flags will end up being stomped before one can do anything with the results.
Status flags are more useful, ironically, when only a very limited
selection of instructions modify them.
Similar for True/False bit predication:
To be useful for Then/Else blocks, also implies that no instruction
inside the Then/Else block may update this bit.
>>
>> RISCV has only a 12-bit displacement field which is adequate most of the time,
>> but no doubt on rare occasions extra code needs to be inserted to branch further
>> away.
> <
> My 66000 has 16-bits of word displacement in the OpCode which is inadequate
> a whole lot less than RISC-V. You would need a subroutine with more than 32767
> instructions.
In my case, BT/BF are 20-bit, which is almost always sufficient.
But, yeah, a branch within +/- 32K words will hit a lot more often than
+/- 2K words (which will hit more often than +/- 256 words).
>>
>> Another consideration is that compare instructions are needed in the instruction
>> set anyway, it may be less hardware then to use separate compare and branch
>> instructions. Compares are needed for expressions like: X = a < b. which are
>> sometimes recorded instead of branched on.
> <
> My 66000 compare instructions build a bit-vector of relations. Then my shift
> instructions can extract the signed set {-1,0} or the unsigned set {0,+1}, so
> different languages can access what is their definition of the right result.
In my case, they come in "flavors":
EQ
GT, GE (Signed)
HI, HS (Unsigned)
BT/BF choice may be used to built the other possibilities.
Strictly speaking, one only needs EQ, GT, and HI; but this would leave
immediate-form instructions unable to directly express some cases (LT
and GE; and if these cases need a separate constant-load, this is lame).
So, one needs, say:
CMPEQ Reg, Reg
CMPEQ Imm, Reg
CMPGT Reg, Reg
CMPGT Imm, Reg
CMPGE Imm, Reg
CMPHI Reg, Reg
CMPHI Imm, Reg
CMPHS Imm, Reg
>>
>> My latest core stores the results in condition code registers like the PowerPC
>> rather than gprs. It is just a bit different. I manage then to get a 21-bit branch
>> displacement. For branch-to-subroutine 24-bit displacements are possible.
> <
> How often is 21-bits sufficient (where 16-bits would have been insufficient) ??
> Do you know of a subroutine that is bigger than 1,000,000 instructions ?
> <
> As for direct BR and CALL one has 26-bits
Thus far, none of my test programs have exceeded the existing 20-bit
limit (1MB), for the entire ".text" section.
When this happens, things like function calls will start needing to use
a bigger encoding (thus far mostly untested; and will not currently be
predicted by the branch predictor). As-is, there are both Disp33s and
Abs48 encodings as possible fallback cases.
For most sane functions, for internal branches, this is likely in
"probably not going to happen" territory.
Closest to this limit thus far is ROTT, which weighs in at a (relatively
massive) 754K of ".text" (for reference, the x86-64 builds are nearly 2MB).
Though, part of this though seems to be that the people at Apogee seem
to have used excessive amounts of macros and copy/paste.
Something like a 22-bit branch might be better in the sense of "yeah,
4MB .text section is not implausible".
Though, 26-bits seems a little overkill though.
Non-local calls and branches need not use relative branch instructions
though, so this case does not matter.
In other news, my experiment for Huffman coding + LZ with 4 interleaved
bitstreams was a moderate success (on my Ryzen 2700X, MSVC):
Decode speeds ~ 400 to 700 MB/sec with Huffman coding;
Decode speeds ~ 600 to 1200 MB/sec with fixed-length (byte) symbols;
Compression ratios also slightly better than Deflate.
GCC still seems to be doing its "approx 40% faster" thing, so with GCC
in a few cases was getting Huffman-coded decode speeds of around 1.0GB/sec.
General approach should also map over to BJX2.
The LZ format is a little wonky though on account of the use of parallel
buffers for encoding/decoding stuff. Other than the parallel buffer
wonkiness, the design is vaguely similar to that of LZ4 (just with tags,
literal, distance, and extra bytes, being spread into 4 parallel buffers).
This is also with a 12-bit symbol length limit (can do 13 bit, which
helps compression at the cost of decoding speed).
Still technically only a moderate improvement over a codec based on a
normal (symbol at a time) decoding with a 12-bit symbol length limit
though (vs, say, the speed improvement of a 12-bit limit vs 15 bits).
But, at least, it is not being particularly bottle-necked by the entropy
coder (in the profiler), which is something (also spends a lot of time
in handling the LZ format, and in the match copying function).
Had also designed the format to be able to deal with deltas between
buffers (partly inspired by Sierra's video format). Though this part was
not being tested in this case.
So, would at least appear to be "kinda fast-ish", at least as far as
entropy-encoded designs go.
...