On 10/28/2011 2:56 PM, MitchAlsup wrote:
> On Friday, October 28, 2011 2:21:09 PM UTC-5, BGB wrote:
>> On 10/28/2011 11:07 AM, MitchAlsup wrote:
>>>> like data values, instructions may be 1/2/4/8 bytes, and are kept
>>>> aligned much like data;
>>>
>>> Bad move.
>>>
>>
>> why so?...
>
> If 1 and 2 and 4 and 8 byte instructions are useful, then why are 3 and 5 and 6 and 7 byte instructions not useful?
>
because 3/5/6/7 are not power-of-2 aligned, and anything which fits in 3
bytes also goes in 4, and anything which fits in 5/6/7 goes in 8.
it is much like short vs int vs long-long.
although it would be potentially nifty, say, to have 24 bit integers,
they don't really contribute a whole lot, and one will generally pick 16
or 32 bits instead.
>> I think it could probably simplify the HW some, which was the main
>> reason for considering it.
>
> Getting rid of exceptions ALWAYS simplifies HW; sometimes nearly as much as fixinig the architecture so that it is not a problem to begin with.
>
possible, but in this case the goal would be to make a "simple"
architecture, rather than a necessarily "good" one.
>>>> opcodes are 8 or 16 bits;
>>>
>>> Probably a bad move.
>>>
>>
>> not sure why this would be.
>>
>> at least the "two bits 00/01/10=single byte, 11=dual byte" opcode scheme
>> was used effectively in my own VM bytecode designs, and has generally
>> worked fairly well.
>
> So you have 192 single byte instructions and 64*256 = 16384 two byte instructions.
>
yep.
I have a variant that extends the pattern, and can encode a 32-bit
opcode number in 5 bytes.
say
xxxxxxxx
11xxxxxx xxxxxxxx
1111xxxx xxxxxxxx xxxxxxxx
111111xx xxxxxxxx xxxxxxxx xxxxxxxx
11111111 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
but, it is a moot point as, presently, my VM only uses around 540
opcodes (it would likely be a bit more if they were statically-typed,
but most of these opcodes operate on "variant" types, which the
interpreter generally handles as dynamic types and which my JITs have
generally used type-inference on).
some of my other VM/interpreter designs have used a similar scheme,
except were statically-types, however, these were never fully
implemented (since statically-typed interpreters with a non-trivial
type-system are considerably more effort to implement).
I also considered it a little moot as:
a statically-typed interpreter is still slow in the naive case (IME,
often the "big switch()" comes to dominate in most simple "switch()"
based interpreters).
a less naive interpreter (or a JIT) renders the difference between
static types and inferred types largely a moot point IME (an interpreter
design which internally converts the bytecode into threaded-code can
easily enough infer the types in the process, having little need for
them to be type-specialized in the bytecode).
hence, my bytecode formats tend to have a little more in common with
MSIL / CIL / .NET ByteCode than with Java ByteCode in this regard.
> If you count different address modes as different instructions, x86 has more instructions than what you have provided.
>
yes, but if one doesn't count the addressing mode, IIRC it was something
like 2100 instruction forms (of 770 nmonics), which includes base
opcodes + x87 + SSE + partial AVX.
this is going by my assembler listing, where I don't think I really left
out anything significant.
in this case, the arguments for the opcodes are not included (such as
ModRM encodings), since the assembler manages this itself.
parts of my assembler are generated directly from listings, which use a
notation fairly similar to that used in the Intel docs (except for AVX
and XOP where I devised my own more-compact notation).
ARM actually led to considerably more complex logic, because the
encoding rules for ARM and Thumb instructions were far less regular than
for x86 (and led to a reasonably funky listing notation).
I guess it may not have been so bad if people were hand-writing the
logic for emitting each instruction form, but much more awkward with
"generic" logic.
>> the JVM and .NET use 0xFE and 0xFF as escape bytes.
>> x86 uses a bunch of prefix bytes.
>
> In the prefix space, x86 has fairly flexible use of the bits. A prefix that indicates SSE (versus MMX) opcode decoding, can be used to bias branch prediction on an instruction that is not MMX class. Thus, overall, each prefix byte adds a bit (or two) to the opcode decoding space. Some prefixes (REX in particular) add bits to the register specifiers, and other opcode decoding thingamabobs. This is simply more flexible, and when approachjed as if the byte stream was being passed through a LALR HW parser is a lot more flexible and expandable. In HW implementations not usefully harder than a more fixed format encoding.
yes, however, prefixes could lead to potentially more memory accesses I
would think (for a naive implementation).
I never really got though why access to the 16 x 64-bit registers has
not been added to real-mode or 32-bit protected mode.
when I wrote an x86 interpreter, I added them via a special prefix I
called "PREX" which was basically a 2-byte prefix similar to XOP or VEX,
but encoded the same basic information as a REX.
not entirely sure why Intel or AMD couldn't do similar, unless there is
some more subtle architectural reason for why 64-bit regs can't be added
to 32-bit mode (granted, assuming OS support).
>>
>> not sure why an 8 or 16 bit opcode would be a big issue.
>>
>> granted, in my own VM designs, this has often left invalid overlong
>> encodings for some opcodes, and a value-range hole due to the
>> implementation (namely a "recursive switch statement" as the main
>> instruction-handling function).
>
> In my HW LALR parser, I defined tables containing bit patterns. A negative number indicated that more decoding bytes were necessary and part of thevalue indicated the new state and another part indicated that which had just been decoded. A positive value indicated an entry in a different table that fully identified the instruction. A zero value in the table indicated that the opcode was not in the table (thus it is undefined). Each table has 256 entries and a handfull of bits. One applies 5 bytes of the byte stream to 5 sets of tables and (some of) the bits falling out of the table control multiplexers other bits are operands to the multiplexers. At then end of the tree of multiplexers, the necessary selcet lines are driven to control HW state concerning this instruction's decode.
>
hmm...
in my x86 interpreter, I had used something more akin to a regex
matcher. some LUTs were built from the listings, and then the
instruction was matched against any patterns, and the first-found match
was assumed to be correct.
> When one has marker bits to denote instruction boundaries. The above paragraph is simply replicated to the decode width and run in complete parallelism.
>
ok.
>> granted, HW need not be equivalent.
>>
>>
>>>> opcodes may access a memory operand and a register at the same time;
>>>> maybe, x86-like memory addressing may be used [base+scale*index+disp];
>>>> there will be 16 registers;
>>>
>>> Probably a good move.
>>>
>>
>> yeah.
>> mostly it is that having to calculate stuff in registers, and thus use
>> up registers, to perform memory-address calculations, is kind of
>> annoying, and may not notably simplify the design (but does likely cost
>> some WRT performance).
>
> The misaligned support an the [base+scale*index+displacement] addressing mode are what changed my mind about RISC overall. Notice the 88K had [base+scale*index] and [base+displacement] only.
>
memory addressing seems nice, but I am not sure what are the
costs/benefits of allowing misaligned access (vs not allowing it).
granted, from a programmer POV, having misaligned access is probably
nicer than not having it.
>>
>>>> unused opcode spots will be set to nop;
>>>> the behavior of a misaligned opcode will be undefined.
>>>
>>> ALL UNUSED opCode space MUST be defined as exception generating. Otherwise you will hate yourself in design #2.
>>>
>>
>> I meant in terms of alignment.
>>
>> say:
>> 8-bit op;
>> nop8; //pad-align
>> 16-bit op.
>
> This worked out poorly in: CDC 6600 but not so bad in CDC 6400. It worked out so poorly in S.E.L 32/50, 32/87, so much that they finally fixed in in 32/67.
>
> Even when you can make the decoder eat the noops without creating bubbles in the pipe, they still waste a precious resource--the I cache.
possibly, but I was thinking of the possibility of cores with neither a
pipeline nor (internal) cache (although the external SRAM could likely
be considered cache memory).
granted, the nops would probably eat clock-cyles.
I haven't yet tried to imagine how exactly the MMU and SRAM would work
(I was sort of imagining a bunch of cores on a bus linked up to a common
MMU, which would link IO requests either to SRAM, or somehow pause the
core until the value could be fetched from DRAM).
I am left to wonder some about the possibility of "bus collisions",
where if a common bus is used between a number of cores, two cores could
try to do IO at the same time and then clash.
also possible could be an "IO token", where a line could indicate that a
core wishes to perform IO, and the MMU puts a token on the bus for the
core it is granting the IO request for.
but, sadly, this is a bit far out of my area of expertise...
>>
>> or:
>> 8-bit op;
>> nop8; //pad-align
>> nop16; //pad-align
>> 32-bit op.
>
> I guarentee that youwill hate yourself for doing this on the #2 machine.
>
> Just consider the following case:
> 8-bit op;
> nop8
> nop16
> nop32
> 64-bit op;
>
yeah, maybe some sort of multi-stage nop:
8-bit op;
nop56;
64-bit op;
it would be mostly similar to the above, except that the first nop
indicates the number of bytes to be skipped over (or value to add to the
IP/PC).
>> nevermind any clock-cycles eaten by executing nops.
>> presumably, the assembler could try to figure out if it can pack the
>> instructions effectively, or just otherwise insert nops (code producing
>> ASM will not need to worry about producing aligning nops, any more than
>> it will need to worry about how immediates get loaded into registers, or
>> the specifics of "opr reg, imm" arithmetic-instruction forms).
>>
>> note: there may be some "opr reg, imm" immediate forms, but in the
>> "general case" these may likely be implemented internally as "opr reg,
>> [mem]" operations.
>
> In my not so humble opinion, you will not enjoy the results of this choice. It causes extra work in the pipeline (inbound memory ref: which chews up valuable register basepointers), and displaces useful data from the data cache that could be properly positioned in the instruction cache.
>
fair enough, except that the memory-access cases are likely to be less
common.
also, sequential speed of each core would be left as a lower priority
than complexity, since if the lower transistor count allows, say, 2
cores in the same space, for a modest slowdown, this may be a net
improvement...
granted, this is all well hypothetical, as I am a lone hobbyist
programmer with no access to any sort of chip-fab technology...