The RISC-V C (compressed) extension (and AFAIK Thumb and MIPS-16) work
by turning three-address instructions into two-address instructions,
reducing literal fields, and/or reducing the number of addressable
registers.
Here's another idea: How about compressing by using the destination
register of the previous instruction as one of the sources?
+ Smaller code.
+ You directly know that you can forward from the previous instruction
to this one. And extending from this, you can directly encode stuff
for an ILDP architecture [kim&smith02].
- The main disadvantage I see is that the meaning of the instructions
is not encoded just in itself. This is a problem when jumping to
such an instruction (just don't allow that, i.e., have an illegal
instruction exception if you jump to such an instruction). A bigger
problem is when there an interrupt or exception returns to such an
instruction; a way to deal with that may be to allow this encoding
only for instructions that cannot cause exceptions, and to delay
interrupts until the next self-contained instruction.
@InProceedings{kim&smith02,
author = {Ho-Seop Kim and James E. Smith},
title = {An Instruction Set and Microarchitecture for
Instruction Level Distributed Processing},
crossref = {isca02},
pages = {71--81},
url = {
http://www.ece.wisc.edu/~hskim/papers/kimh_ildp.pdf},
annote = {This paper addresses the problems of wide
superscalars with communication across the chip and
the number of write ports in the register file. The
authors propose an architecture (ILDP) with
general-purpose registers and with accumulators
(with instructions only accessing one accumulator
(read and/or write) and one register (read or
write); for the accumulators their death is
specified explicitly in the instructions. The
microarchitecture builds \emph{strands} from
instructions working on an accumulator; a strand
starts with an instruction writing to an accumulator
without reading from it, continues with instructions
reading from (and possibly writing to) the
accumulator and ends with an instruction that kills
the accumulator. Strands are allocated to one out of
eight processing elements (PEs) dynamically (i.e.,
accumulators are renamed). A PE consists of
mainly one ALU data path (but also a copy of the
GPRs and an L1 cache). They evaluated this
architecture by translating Alpha binaries into it,
and comparing their architecture to a 4-wide or
8-wide Alpha implementation; their architecture has
a lower L1 cache latency, though. The performance of
ILDP in clock cycles is competetive, and one can
expect faster clocks for ILDP. The paper also
presents data for other stuff, e.g. general-purpose
register writes, which have to be promoted between
strands and which are relatively few.}
}
@Proceedings{isca02,
title = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
booktitle = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
year = "2002",
key = "ISCA 29",
}
- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <
c17fcd89-f024-40e7...@googlegroups.com>