I am starting to come around to that. However, as you and Eric both
pointed out the cost of maintaining this state is non-trivial.
>>
>> B) You have to carry state about any remapping across task switch,
>> interrupt etc. boundaries. What about returns?
>
> You have to make this work in general, not in tiny increments.
My point about returns was that there are two possibilities (at least
that I see):
1) the hardware retrieves the map on return/iret etc. Thus the hardware
needs to store it somewhere when control flow diverges (perhaps only on
branch but not on jump?).
2) the compiler need to insert the PREFIX at the return target address.
That is really not feasible for interrupts and similar.
The problem with #2 seems insurmountable. If you have to make the
hardware deal with the PREFIX state for interrupts and exceptions, you
have already solved most of the problems dealing with the normal
function return case.
>> C) It adds definite complexity to the pipeline per Luke's points.
>
> You are forgetting that ALIAS (or MOV) instructions can be "performed" in
> zero cycles! The subsequent instruction can begin when the ALIASed data
> is ready--it does not actually HAVE to be MOVed (or ALIASed!)
>
> This kind of pipeline complexity should easily pay for itself.
>
Hmm... Good point.
>> The rest of this is uninformed brainstorming, so may make little sense:
>> what if you had something like this:
>>
>> * the goal is to make 16-bit instructions more useful by reducing the
>> number of bits needed for register encoding while simultaneously
>> allowing access to at least 64 architectural registers. The overall
>> goal is to compress the instruction stream.
>
> I fundamentally question whether this should be a goal--as there are so many
> OTHER irregularities wrt 16-bit encodings -- such as does ADD overflow?
> are their carries from integer arithmetic calculations? Are there unsigned
> versions of integer instructions? How about saturation arithmetic,...
Fair enough. If you have 32-bit base instructions and then increments
of either 16 or 32-bit words, then you get a lot more bits to play with
for the register specifiers and thus do not need PREFIX-like extensions
anywhere near as much.
>> * instructions are multiples of 16-bit words.
>>
>> * on a branch/interrupt/trap etc. (anything that changes control flow),
>> the first 16-bit word at the target site is a prefix "instruction"
>> except that because of the knowledge that it is a prefix, you decode it
>> differently.
>
> This is going to make switch codes perform worse, and take more instruction
> area.
Oooh. Switches. Right. That is a pretty lethal problem with this
part of the idea. You really would need two kinds of branch or jump:
one kind would reset the PREFIX and would need a PREFIX at the target,
the other would not. That's not pretty.
>> * a prefix, call it PREFIX, instruction is 8 sets of 2 bits for a total
>> of 16 bits.
>>
>> * each set of two bits is the prefix to following instructions. There
>> is no limit to how long the prefixes apply except as noted above when
>> control flow changes.
>
> Each 2-bit prefix should be assigned to a operand register specifier
> as encountered in the instruction stream, until the PREFIX is depleted.
> {That is--like a shift register.}
I was thinking of this working in parallel for easier decoding, but I am
not a hardware guy (Ivan's IANAHG) nor do I play one on TV. If you
think that it would be possible to do this as a shift register, I am
more than happy to take your word for it!
I did initially think of it like a shift register when I was thinking of
a more general remapping rather than just prefix bits. But then I was
not sure if that could be easily handled in decode.
>> * instructions only reference up to 8 registers via 3-bit register
>> fields. These 3 bits are prefixed by the 2 bits in the PREFIX
>> "instruction" before in the same execution flow. The first pair of
>> bits in a PREFIX are the prefix bits for the register R0, the second
>> pair are the prefix for R1 etc.
>
> I am going to argue, here, that this is like loading gangs of PTEs
> from a single line into the TLB on a PTE miss. It is not going to
> work as well as you envision. It will certainly fail when you need
> that 9th register to properly encode the "loop".
8 registers was a result of trying to fit everything into a 16-bit
instruction word. If that is dropped, then this could be 16 registers
and I think then you probably are fairly safe. Certainly the amount of
push/pop activity in AMD64 code seems significantly lower than in 32-bit
x86 code.
That said, if one of the goals was to lower the impact from
consolidating the FP and integer registers, then even 16 might be light.
>> The quirk here is that the PREFIX does not take any bits away from other
>> instruction encodings because it can only exist (and is the only
>> "instruction" that can exist) at the target address of a control flow
>> change.
>>
>> Note that handling returns requires some saving of the previous PREFIX
>> somewhere in a CSR or on the stack. Same with returning from
>> interrupts/traps/exceptions.
>
> This is one of those areas that it is going to hurt worse than envisioned.
Yep.
>> This does not buy that much. I think that just two bits are
>> insufficient to give that much of an advantage as it does not let you
>> get to more than 32 registers. So this fails the goal of accessing at
>> least 64 architectural registers.
>>
>> I see at least two ways to solve that:
>>
>> 1) Make the PREFIX header more than 16 bits. If it was 32, we could
>> either double the number of bits to 4 thus giving up to 128
>> architectural registers. The overhead is pretty high though as most
>> basic blocks are only ~5 instructions long. So 2 16-bit instructions
>> added to each basic block is a significant amount of overhead.
>
> 1.5) make the PREFIX header optionally greater than 16-bits in 16-bit
> increments.
>
> But still, if your average basic block is 5 instructions and you are forced
> to use at least 1 PREFIX than you have just made your code 20% larger. On
> the other hand, I have made My 66000 RISC ISA 9%-12% smaller and there was
> only a 20%-30% code disadvantage to RISC to begin with.
>
> This points to a "hard sell" and a steep hill to climb.
Definitely.
Your point about using extended basic blocks is a good one though as it
almost doubles the average number of instructions in a BB.
>>
>> 2) Keep PREFIX 16-bit, but do not map all the alias registers. Only
>> map some. For instance if we used 3-bit prefixes, we have enough space
>> for 5 prefixes. Take alias registers A0-A2 as always mapping to real
>> registers R0-R2 (perhaps GOT, stack etc.) and then the remaining A3-A7
>> use the five prefixes.
>>
>> I like the idea of the PREFIX applying until there is a control flow
>> change.
>
> The actual control flow change (taken) or the presence of a possibility
> of a control flow change (BC) ?
I think it needs to be on actual control flow change. As I mentioned
above, I think it starts to make sense to have two kinds of control flow
instructions: those that terminate the PREFIX and those that don't.
This just adds to the complexity :-(
In some ways this is a bit like changing the mode of a processor. I.e.
sort of like going into Arm Thumb mode on a branch and coming out on a
return.
>> I like the idea of the PREFIX only existing at the target of a
>> control flow change because then it does not take up instruction
>> encoding space. But this seems like a lot of complexity.
>
> Especially for fall-through codes.
Yeah, this is a problematic case. However, if you accept the cost of
having two types of control flow instruction (or perhaps a flag in
jump/branch that controls this) then you can do things like switches
without destroying the aliasing/prefixing.
Alternatively, if you accept larger instructions, then you can use some
of the instruction space with less overall impact.
>>
>> Is the overhead worth it?
>
> In my not so humble Opinion:: No.
>
> But I wish you luck.
At this point I do not see much point in pursuing this much further. It
was an interesting little thought experiment, but the benefits seem
relatively small and the cost relatively high. Either:
1) you use tiny 16-bit instructions then you hit the penalty of not
specifying enough aliased registers and all the other problems of 16-bit
instructions. Extending PREFIX/ALIAS to 32-bit or more increases the
overhead on BBs significantly.
2) if you use 32-bit instructions, you are not gaining much because
32-bit instructions can burn a lot more bits for register fields. So
the whole premise becomes weaker.
Either way you pay a lot of cost on things like branch/return and
exceptions/traps/interrupts. And then there is the problem of either
making switches ugly or doubling the instruction bit cost of
jump/branch. Or both :-(
Perhaps if you had a ISA that had entire BBs or EBBs as single
"instructions" (somewhat like EDGE or TRIPS) perhaps something along
these lines makes more sense.
Or, if you need to maintain compatibility with existing code (i.e.
Luke's case or AMD's 64-bit extensions to x86), then this idea or
similar could be useful.
Best,
Kyle