For quite some time I've been thinking about what my dream CPU would look like. It's got a long way to go, but I finally got around to writing it up. I would be very interested in anyone's opinions or suggestions.
The 64256 is my preliminary design for a maximum-performance CPU. It includes many unique features:
1. While it is a 64-bit CPU, it has a 256-bit data bus.
2. It has 65536 registers (status register, program counter, & 65534 general purpose) on the chip.
3. It has as much main memory as possible on chip.
4. Each 64-bit instruction performs 2 functions simultaneously.
5. Conditional instructions (other than being based on a bit of a register) can be based on the result of the other function of an instruction eliminating the need for flags in the status register.
6. The word length can be selected (using the status register) to be any power of 2 from 2 to 64 bits, providing SIMD capability (i.e., if the word length is set at 8 bits an add immediate would add the value to each group of 8 bits)
7. If the word length is less than 64, conditional instructions could be based on whether the condition was true for either all on any of the words.
8. Conditional instructions could be based on how many bits of the result are 0 (note that the lowest bit of the zero count is the same as a parity bit & the highest bit is the same as the conventional zero flag (always true if the word length is a power of 2)).
General description:
The chip would have an address space of 2 to the 64th power 64-bit words (since the address bus is 256 bits, each read would be 4 words; writing less than 256 bits would be somewhat complicated but no more than for a typical chip with a 64-bit bus that addresses memory as 8-bit bytes).
The registers would be a stack so there would be a separate stack pointer. R0 would be the status register & R1 the program counter. Additional general purpose registers could be created at any time by pushing the stack; registers at the end of the stack would be written into memory (very fast with the 256-bit bus) & R0 & R1 relocated. For a subroutine call the stack pointer would be decremented (creating 1 new register) & R0 relocated, saving the PC in R2. For an interrupt, the SP would be decremented by 2 saving the old status register & PC in R2 & R3. More advanced versions of the design could keep track of which registers have been written into and only save them or writ\
e changed registers into memory when the bus was not needed to speed up register creation.
Memory would be accessed by special LOAD & STORE instructions like in most RISC designs.
The typical instruction format would have 8 8-bit fields, from right to left destination1, source1, modifier, operation1, destination2, source2, addressing modes, & operation2. The addressing modes field would be further divided into 2 bits for each destination & source. The modes for the sources would be number (immediate data, allowing 8-bit constants to be used in a standard instruction), register, register indirect, & register indirect postincrement; for the destinations register indirect predecrement, register, register indirect, & register indirect postincrement. The operation field would be subdivided into a 6-bit opcode & 2-bit modifier indicating what the modifier field does. For operation1 the modifier could be second source (register mode only), extend source (giving a 16-bit address), extend destination, or extend source & destination. For operation2, no modification, same modifier, same source, or something else (not fully defined yet) would be specified.
Standard format instructions would include AND, OR, XOR, COMPARE, ADD, ADD with carry, SUBTRACT, SUBTRACT with borrow, MULTIPLY, DIVIDE, FLOATING ADD, FLOATING SUBTRACT, FLOATING MULTIPLY, FLOATING DIVIDE, MULTIPLY FLOAT BY INTEGER, & DIVIDE FLOAT BY INTEGER (floating point would not be defined for a word length of less than 16). Other instructions such as shifts & rotates would have different formats. Conditional operations (skip or branch) could be based on the result of the other part of the instruction (perhaps if operation2) or on any bit of any register (perhaps if operation1). Immediate (longer than 8 or 16 (using the modifier) bits) operations would just use R1 indirect autoincrement (which would have to be the same value for both operations of a single instruction).
One possible extension would be extended operations that function out to the end of the 256-bit group of 4 words (i.e., if extended operations were enabled (which could be set independently for operation1 & operation2), ADD R4 to R12 would add R4 to R12, R5 to R13, R6 to R14, & R7 to R15; ADD R6 to R14 would add R6 to R14 & R7 to R15; while ADD R7 to R15 would be the same regardless of whether extended operations were enabled). I suppose if the lowest 2 bits of the source & destination were not identical no extended operation would be done so extended & regular operations could be mixed (just thought of that one - this is definitely a work still in progress).
This seems to be an extremely powerful design. The wide data bus, 65536 registers, & on-chip main memory would hopefully avoid the need for cache (which I find very awkward, especially if a lot of data is written to memory) & simultaneous execution of two operations at once provides the advantage of superscaler operation without complex dependency-checking circuitry (but given that modern chips execute more than 2 instructions at once this would only cut the complexity in half - maybe a 128-bit chip that executes 4 operations at once should be considered for the future).