Java bytecode processors

Thomas Koenig

unread,

Sep 5, 2021, 6:05:33 PM9/5/21

to

Although there are Java bytecode processors (even one available on
Github, https://github.com/amal029/jop) they have had very limited
success.

Why? A lot of commercial code runs on JVM now (I've heard quipped
that Java is the new COBOL), and I would expect significant
performance increase from a native implementation as opposed to
an interpreter or a JIT compiler. Or have JIT compilers become
so good that it no longer matters, and the effort of introducing
a new architecture would be too large (again)?

MitchAlsup

unread,

Sep 5, 2021, 7:05:19 PM9/5/21

to

On Sunday, September 5, 2021 at 5:05:33 PM UTC-5, Thomas Koenig wrote:
> Although there are Java bytecode processors (even one available on
> Github, https://github.com/amal029/jop) they have had very limited
> success.
>
> Why?
<

A Bytecode interpreter, even in HW will have a serious performance
issue when compared to a cross compiling native instruction set.
Probably around 2×-3× penalty.

<
> A lot of commercial code runs on JVM now (I've heard quipped
> that Java is the new COBOL), and I would expect significant
> performance increase from a native implementation as opposed to
> an interpreter or a JIT compiler. Or have JIT compilers become
> so good that it no longer matters, and the effort of introducing
> a new architecture would be too large (again)?
<

JIT compilers compile to native ISA and perform reasonable
optimizations; a bytecode interpreter is always at the mercy
of the interpreter, even when done in HW. Can you get a bytecode
interpreter to run 3-4 bytes per cycle ? continuously ?

BGB

unread,

Sep 5, 2021, 8:23:06 PM9/5/21

to

On 9/5/2021 6:05 PM, MitchAlsup wrote:
> On Sunday, September 5, 2021 at 5:05:33 PM UTC-5, Thomas Koenig wrote:
>> Although there are Java bytecode processors (even one available on
>> Github, https://github.com/amal029/jop) they have had very limited
>> success.
>>
>> Why?
> <
> A Bytecode interpreter, even in HW will have a serious performance
> issue when compared to a cross compiling native instruction set.
> Probably around 2×-3× penalty.

I think there is a reason why stack-oriented ISAs are no longer a thing.

Either the stack would need to map to registers, or be treated as a
special sort of special-purpose cache. The stack is also likely to
become a serious bottleneck if one can't schedule multiple
stack-operations per clock-cycle, and one would need to either operate
at a somewhat higher clock-speed or execute a large number of
instructions per cycle to be performance competitive with a 3R ISA.

> <
>> A lot of commercial code runs on JVM now (I've heard quipped
>> that Java is the new COBOL), and I would expect significant
>> performance increase from a native implementation as opposed to
>> an interpreter or a JIT compiler. Or have JIT compilers become
>> so good that it no longer matters, and the effort of introducing
>> a new architecture would be too large (again)?
> <
> JIT compilers compile to native ISA and perform reasonable
> optimizations; a bytecode interpreter is always at the mercy
> of the interpreter, even when done in HW. Can you get a bytecode
> interpreter to run 3-4 bytes per cycle ? continuously ?
>

Yeah. When translating from a stack IR to a 3AC IR, the number of
operations tends to actually shrink somewhat.

Say:
iload 7
iconst 13
iadd
istore 6

Might, in a 3R ISA, turn into a single instruction, say:
ADD R9, 13, R11

...

Thomas Koenig

unread,

Sep 6, 2021, 1:52:29 AM9/6/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Sunday, September 5, 2021 at 5:05:33 PM UTC-5, Thomas Koenig wrote:
>> Although there are Java bytecode processors (even one available on
>> Github, https://github.com/amal029/jop) they have had very limited
>> success.
>>
>> Why?
><
> A Bytecode interpreter, even in HW will have a serious performance
> issue when compared to a cross compiling native instruction set.
> Probably around 2×-3× penalty.

Of course, the latency of the first-time compilation is somewhat
large :-)

It is interesting to read the claims about picojava, which are
(unfortunately) not substantiated anywhere where I can get
to without a paywall.

The claims are about the same level of efficiency for C/C++ code,
and about a factor of 20 vs. an interpreter.

Of course, they were also looking at the embedded market, where
JIT compilation is definitely out.

><
>> A lot of commercial code runs on JVM now (I've heard quipped
>> that Java is the new COBOL), and I would expect significant
>> performance increase from a native implementation as opposed to
>> an interpreter or a JIT compiler. Or have JIT compilers become
>> so good that it no longer matters, and the effort of introducing
>> a new architecture would be too large (again)?
><
> JIT compilers compile to native ISA and perform reasonable
> optimizations; a bytecode interpreter is always at the mercy
> of the interpreter, even when done in HW. Can you get a bytecode
> interpreter to run 3-4 bytes per cycle ? continuously ?

Me, certainly not. Anybody else... I don't know :-)

Terje Mathisen

unread,

Sep 6, 2021, 2:24:37 AM9/6/21

to

I have to agree 100% with Mitch here:

A Java hw machine has been suggested many times, and it has always been
a (bad) solution in search of a problem.

In my current daytime work (for https://cognite.com/, a cloud-agnostic
industrial IT provider) we use several JVM-based languages, including
both Java and Kotlin as well as Scala, and the JIT compilers are just
Good Enough, and if/when they are not, then clang have native compilers
for both Java and Kotlin.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Anton Ertl

unread,

Sep 6, 2021, 3:39:25 AM9/6/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>Although there are Java bytecode processors (even one available on
>Github, https://github.com/amal029/jop) they have had very limited
>success.
>
>Why?

The JavaVM has not been designed for hardware implementation, and as a
result AFAIK all "hardware implementations" were partial, and fell
back to software assist for some instructions.

Also, what would the advantage of such a hardware implementation be?

A lot of commercial code runs on JVM now (I've heard quipped
>that Java is the new COBOL), and I would expect significant
>performance increase from a native implementation as opposed to
>an interpreter or a JIT compiler. Or have JIT compilers become
>so good that it no longer matters, and the effort of introducing
>a new architecture would be too large (again)?

JIT compilers certainly have become pretty good a long time ago, and
do things that "hardware implementations" at the time would not do:
E.g., escape analysis to optimize "new" into stack allocation;
determine the target of invokevirtual, inline it and use the resulting
knowledge for optimization.

In the meantime hardware designers do or at least think of things that
used to be available only in compilers, such as move elimination
(introduced in Intel CPUs with Sandy Bridge), or VVM. So maybe these
days the hardware would look better than it looked in earlier times;
IIRC both SPARC and ARM did their "hardware implementations" at a time
when they were only doing in-order CPUs. However, I don't think that
hardware can make escape analysis unnecessary, so hardware would have
to overtake JIT compilers in some places in order to make up for the
places where it is deficient.

Or one could think about whether you can get away from directly
executing JavaVM code without JIT, and combine both approaches.
Things that come to my mind would be to use the hardware
implementation for running the cold code (or code that has not yet
been found to be hot), with hardware supporting the profiling to find
hot spots, and then the compiler optimizes the hot spots. Will the
hardware implementation provide an advantage over other instruction
sets for the optimized code?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Anton Ertl

unread,

Sep 6, 2021, 3:51:02 AM9/6/21

to

MitchAlsup <Mitch...@aol.com> writes:
>On Sunday, September 5, 2021 at 5:05:33 PM UTC-5, Thomas Koenig wrote:

>> Although there are Java bytecode processors (even one available on=20
>> Github, https://github.com/amal029/jop) they have had very limited=20
>> success.=20
>>=20
>> Why?=20

><
>A Bytecode interpreter, even in HW will have a serious performance
>issue when compared to a cross compiling native instruction set.

>Probably around 2=C3=97-3=C3=97 penalty.

Why do you think so?

>JIT compilers compile to native ISA and perform reasonable
>optimizations; a bytecode interpreter is always at the mercy
>of the interpreter, even when done in HW.

When done in hardware, it's not an interpreter, it's an instruction
set.

>Can you get a bytecode
>interpreter to run 3-4 bytes per cycle ? continuously ?

Can you decode x register machine instructions per cycle?
Continuously?

I see no unsurmountable hurdles to create a JavaVM front end for an
appropriate OoO machine that produces a similar utilization of the FUs
of the OoO machine. Note that there is no aliasing problem wrt the
locals slots on the JavaVM.

gareth evans

unread,

Sep 6, 2021, 6:34:59 AM9/6/21

to

Does JIT represent a threat to computer security?

We rely on memory divisions to be read-only, read / write or
execute-only but JIT (Just In Time) compilation means using
source code as data, returning compiled code as data, and then
executing that compiled code from a memory division that must
be executable read / write.

This seems like a recipe for viruses!

Anton Ertl

unread,

Sep 6, 2021, 9:26:06 AM9/6/21

to

gareth evans <headst...@yahoo.com> writes:
>This seems like a recipe for viruses!

This sentence *is* a recipe for security theatre.

Stefan Monnier

unread,

Sep 6, 2021, 9:44:45 AM9/6/21

to

> I see no unsurmountable hurdles to create a JavaVM front end for an
> appropriate OoO machine that produces a similar utilization of the FUs
> of the OoO machine. Note that there is no aliasing problem wrt the
> locals slots on the JavaVM.

Whether a JVM CPU is possible (and even efficient) is probably
a hypothetical question at this point. The main question is: where's
the market for it?

Reminds me of the Lisp machines,

Stefan

BGB

unread,

Sep 6, 2021, 11:48:22 AM9/6/21

to

ARM had Jazelle DBX previously, which did a basic set of JVM
instructions in hardware and would jump back to ARM code for anything it
couldn't handle.

A little later on, Jazelle was all but dead, killed off mostly in favor
of JIT compiling the JVM bytecode to ARM / Thumb instructions.

I suspect there are reasons it happened this way.

I mostly suspect that a possible reason is that if one executes most
basic JVM instructions at, say, one instruction per cycle; then normal
ARM code (also running at one instruction per cycle) would effectively
run circles around it.

Though, FWIW, there could be more incentive for an ISA to have partial
emulation for x86 instructions, say:
0/1 prefix bytes;
1/2 opcode bytes;
optional Mod/RM;
optional immed.

And then trap back to an emulator for anything which does not fit this
pattern. Say, for example, it only supports a Load/Store subset of x86,
no SIB byte, ...
MOV EAX, [ECX+0x7C] //native execution
ADD ECX, EAX //native execution
...
MOV EDX, [ECX+4*EAX+0x4000] //trap
ADD EAX, [EDX] //trap (not Load/Store)

...

Does leave open how much of x86 could be run with such restrictions.
Would likely need hardware support for emulating x86 EFLAGS behavior and
similar though, ...

Anton Ertl

unread,

Sep 6, 2021, 11:51:25 AM9/6/21

to

A lot of code is written in Java, so if a hardware JavaVM
implementation provides a significant advantage for that, there should
be a big market (possibly as additional decoder for otherwise
general-purpose hardware, or by adding instructions for missing C
operations to the JavaVM instruction set).

>Reminds me of the Lisp machines,

I think in this case you need to analyse the actual factors involved.

Lisp Machines came out of the 1970s when bit-slice technology made it
relatively cheap (in manpower) to design a CPU, and they provided an
advantage in e.g., tag-handling and cdr-handling, and were faster at
that than other CPUs done in similar technology. But the RISC
revolution killed both the other CPUs done in similar technology and
Lisp machines (and one of the AI winter's helped the demise along).
RISCs with small extensions that can be useful for implementing Lisp
were developed: SOAR on the academic level, and some of that went into
SPARC on the commercial level.

The JavaVM hardware efforts (picoJava, Jazelle DBX) were done in the
late 1990s when Java JIT compilers did not play a role (at least when
the projects were started). Jazelle DBX was a front end for an ARM9
(not ARM v9), a single-issue in-order CPU; not much is known about
picoJava (it never became a product), but given the other designs that
Sun did at the time, I don't expect it to be OoO, and probably not
superscalar, either (with in-order, superscalar stack machines are
hard). This was in a world where OoO CPUs were winning and in
particular Intel and AMD were into the GHz race and crushed basically
all other architectures in the process, presumably not because of the
superior architecture, but because of superior microarchitecture,
superior circuit design, and maybe also superior process technology.

These days, we do not see much generational speedup, so architectural
differences can play a bigger role now. You can afford to spend a
year more on designing a CPU if it gives you 20% speedup (in the 1990s
this would have been the recipe for defeat), although it's still
better to releas the slow version now and the improved version in a
year. OTOH, designing a competetive microarchitecture is a pretty big
task these days, it's not like in the Lisp Machine days. So if a
hardware JavaVM has an advantage in some market over a conventional
architecture and a JIT, the chances are probably much better now than
they were in 1985-2005. The question is if the "if" is satisfied.

MitchAlsup

unread,

Sep 6, 2021, 12:12:52 PM9/6/21

to

On Monday, September 6, 2021 at 2:51:02 AM UTC-5, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Sunday, September 5, 2021 at 5:05:33 PM UTC-5, Thomas Koenig wrote:
> >> Although there are Java bytecode processors (even one available on=20
> >> Github, https://github.com/amal029/jop) they have had very limited=20
> >> success.=20
> >>=20
> >> Why?=20
> ><
> >A Bytecode interpreter, even in HW will have a serious performance
> >issue when compared to a cross compiling native instruction set.
> >Probably around 2=C3=97-3=C3=97 penalty.
>
> Why do you think so?
<

Intuition, plus the bytecode is for a stack machine.

<
> >JIT compilers compile to native ISA and perform reasonable
> >optimizations; a bytecode interpreter is always at the mercy
> >of the interpreter, even when done in HW.
> When done in hardware, it's not an interpreter, it's an instruction
> set.
> >Can you get a bytecode
> >interpreter to run 3-4 bytes per cycle ? continuously ?
<
> Can you decode x register machine instructions per cycle?
> Continuously?
<

Yes.

>
> I see no unsurmountable hurdles to create a JavaVM front end for an
> appropriate OoO machine that produces a similar utilization of the FUs
> of the OoO machine. Note that there is no aliasing problem wrt the
> locals slots on the JavaVM.
<

The stack nature of the bytecode is the most significant imposition.

EricP

unread,

Sep 6, 2021, 12:13:32 PM9/6/21

to

Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
>> On Sunday, September 5, 2021 at 5:05:33 PM UTC-5, Thomas Koenig wrote:
>>> Although there are Java bytecode processors (even one available on
>>> Github, https://github.com/amal029/jop) they have had very limited
>>> success.
>>>
>>> Why?
>> <
>> A Bytecode interpreter, even in HW will have a serious performance
>> issue when compared to a cross compiling native instruction set.
>> Probably around 2×-3× penalty.
>
> Of course, the latency of the first-time compilation is somewhat
> large :-)
>
> It is interesting to read the claims about picojava, which are
> (unfortunately) not substantiated anywhere where I can get
> to without a paywall.

1997 paper by Sun engineers on picoJava uArch and benchmarks says
it was 15-20 times faster than a 486/33 running interpreter,
and about 7-10 times faster than JIT.

picoJava-I: The Java Virtual Machine in Hardware, 1997
https://tsunami.zaibatsutel.net/liberez/picojava.pdf

MitchAlsup

unread,

Sep 6, 2021, 12:16:19 PM9/6/21

to

On Monday, September 6, 2021 at 5:34:59 AM UTC-5, gareth evans wrote:
> On 05/09/2021 23:05, Thomas Koenig wrote:
> > Although there are Java bytecode processors (even one available on
> > Github, https://github.com/amal029/jop) they have had very limited
> > success.
> >
> > Why? A lot of commercial code runs on JVM now (I've heard quipped
> > that Java is the new COBOL), and I would expect significant
> > performance increase from a native implementation as opposed to
> > an interpreter or a JIT compiler. Or have JIT compilers become
> > so good that it no longer matters, and the effort of introducing
> > a new architecture would be too large (again)?
> >
> Does JIT represent a threat to computer security?
<

Absofriggenlutely !

>
> We rely on memory divisions to be read-only, read / write or
> execute-only but JIT (Just In Time) compilation means using
> source code as data, returning compiled code as data, and then
> executing that compiled code from a memory division that must
> be executable read / write.
<

Which is why the JIT should have write only access to the area
it keeps compiled code, and the consumer of that code have
execute only access.

>
> This seems like a recipe for viruses!
<

Where do you think viruses come from ?? eXcel !?!

gareth evans

unread,

Sep 6, 2021, 12:23:20 PM9/6/21

to

On 06/09/2021 17:16, MitchAlsup wrote:
> Which is why the JIT should have write only access to the area
> it keeps compiled code, and the consumer of that code have
> execute only access.

Suggesting that the running application and the JIT compiler
run as coroutines

EricP

unread,

Sep 6, 2021, 12:25:11 PM9/6/21

to

And apparently this was a simulated piocJava core
vs real 486/33 and a Pentium/166.

MitchAlsup

unread,

Sep 6, 2021, 12:25:12 PM9/6/21

to

Java byte code is stack based, to JVM would have to execute 3-4 byte
code instructions per cycle to be in the same ball park as a GBOoO
machine executing 1 instruction per cycle. JIT is the better path.

>
> These days, we do not see much generational speedup, so architectural
> differences can play a bigger role now. You can afford to spend a
> year more on designing a CPU if it gives you 20% speedup (in the 1990s
> this would have been the recipe for defeat), although it's still
> better to releas the slow version now and the improved version in a
> year.
<

A 20% speedup in JVM in HW still leaves JVM a factor of 2.5× behind JIT
to native ISA.

MitchAlsup

unread,

Sep 6, 2021, 12:25:43 PM9/6/21

to

Separate address spaces.

BGB

unread,

Sep 6, 2021, 12:49:41 PM9/6/21

to

On 9/6/2021 8:25 AM, Anton Ertl wrote:
> gareth evans <headst...@yahoo.com> writes:
>> This seems like a recipe for viruses!
>
> This sentence *is* a recipe for security theatre.
>

Meanwhile, am I like the only person with an ISA who has stuff like (?):
Memory that is RWX for the kernel, but R-X or --X for usermode;
Memory that is RW- for the kernel, but R-- for usermode;
...

This is before getting to the keyring, which can also be like:
VM program thread sees the memory as R-X;
JIT compiler thread sees the memory as RW-;
...

Though, the (more recent) addition of separate User and Supervisor
access is a bit limited/hacky given there weren't any free bits, so some
level of twiddly was used.

Likewise, say:
Task register (TBR) points to an area which is Read-Only from usermode,
but Read/Write to the kernel, pointing to another region which is
Read/Write from usermode (TLS variables and similar go here), and to a
region which is inaccessible from usermode (holds system-level state).

...

Granted, some of this stuff only works from usermode and with the MMU
enabled, where currently I only have the MME enabled if a swapfile was
found. This may change (with the MMU always being enabled after
boot-up), given I seem to now have the MMU relatively stable.

Even without a pagefile, the MMU could still allow for things like
remapping pages, growable stacks, ... It would also allow for usermode
programs to be less subject to the issues of RAM fragmentation.

At present, paged virtual memory is still mostly untested and fairly
incomplete (it also bypasses the normal filesystem code and requires the
swapfile to be contiguous on the SDcard). The use of a swapfile (rather
than a swap partition) being mostly because I didn't have a good way to
create such a partition via the Windows tools.

...

Nemo

unread,

Sep 6, 2021, 12:54:43 PM9/6/21

to

Well, billions of UICC devices (called sim cards) would be a lucrative
market.

Sun specified picoJava [sic] years ago but never released a product and
it never took off. Nowadays, there is Jazelle on ARM and the Atmel
AT90SC is fast enough so the JVM-on-silicon ship has probably sailed.

N.

BGB

unread,

Sep 6, 2021, 1:06:09 PM9/6/21

to

IMO: It is better if one can give different sub-tasks different access
to pages within a single address space.

Granted, most existing OS API's (such as mmap/mprotect), or programs,
will have no idea of such a thing. One can add a bunch of new PROT_*
flags / macros, but most programs (probably) wont use them...

EricP

unread,

Sep 6, 2021, 1:14:38 PM9/6/21

to

Blu-ray requires all disc players to include Java software environment
as mandatory part of its standard called BD-J.

It was one of the reasons early Blu-ray players were so frigging slow
and expensive compared to normal DVD players. To run JVM they needed
a 32-bit processor with sophisticated OS like Linux.

If picoJava could have snagged that market...

MitchAlsup

unread,

Sep 6, 2021, 1:29:47 PM9/6/21

to

Which only reinforces the notion that that ship has sailed.

MitchAlsup

unread,

Sep 6, 2021, 1:31:39 PM9/6/21

to

On Monday, September 6, 2021 at 11:49:41 AM UTC-5, BGB wrote:
> On 9/6/2021 8:25 AM, Anton Ertl wrote:
> > gareth evans <headst...@yahoo.com> writes:
> >> This seems like a recipe for viruses!
> >
> > This sentence *is* a recipe for security theatre.
> >
> Meanwhile, am I like the only person with an ISA who has stuff like (?):
> Memory that is RWX for the kernel, but R-X or --X for usermode;
> Memory that is RW- for the kernel, but R-- for usermode;
> ...

No.

MitchAlsup

unread,

Sep 6, 2021, 1:33:54 PM9/6/21

to

On Monday, September 6, 2021 at 11:54:43 AM UTC-5, Nemo wrote:
> On 2021-09-06 09:44, Stefan Monnier wrote:
> >> I see no unsurmountable hurdles to create a JavaVM front end for an
> >> appropriate OoO machine that produces a similar utilization of the FUs
> >> of the OoO machine. Note that there is no aliasing problem wrt the
> >> locals slots on the JavaVM.
> >
> > Whether a JVM CPU is possible (and even efficient) is probably
> > a hypothetical question at this point. The main question is: where's
> > the market for it?
<
> Well, billions of UICC devices (called sim cards) would be a lucrative
> market.
<

I was doing some reading in the I/O MMU category last week and ran into
some academia papers indicating that this is a potent market for viruses.
That is, what Spectré and Meltdown did for CPUs, the same attack vectors
are being palyed out over in the plug-in device arena.

BGB

unread,

Sep 6, 2021, 1:44:19 PM9/6/21

to

On 9/6/2021 12:31 PM, MitchAlsup wrote:
> On Monday, September 6, 2021 at 11:49:41 AM UTC-5, BGB wrote:
>> On 9/6/2021 8:25 AM, Anton Ertl wrote:
>>> gareth evans <headst...@yahoo.com> writes:
>>>> This seems like a recipe for viruses!
>>>
>>> This sentence *is* a recipe for security theatre.
>>>
>> Meanwhile, am I like the only person with an ISA who has stuff like (?):
>> Memory that is RWX for the kernel, but R-X or --X for usermode;
>> Memory that is RW- for the kernel, but R-- for usermode;
>> ...
> No.

OK. In any case, x86 doesn't seem able to do this.

Descriptions of ARM's MMU are a bit more fragmented and confusing, so
less obvious what it can do.

In any case, I should have specified "within a single address space,
without flushing the TLB or changing modes or similar".

Bernd Linsel

unread,

Sep 6, 2021, 2:22:05 PM9/6/21

to

On 06.09.2021 18:12, MitchAlsup wrote:
> The stack nature of the bytecode is the most significant imposition.

There is a concept for HW OoO execution of stack-based code, called
"BOOST: Berkeley Out-of-Order Stack Thingy" (see
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.137.5216>; only
the cached copy at
<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.137.5216&rep=rep1&type=pdf>
still works).

MitchAlsup

unread,

Sep 6, 2021, 2:41:34 PM9/6/21

to

Same problems as::
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.7112&rep=rep1&type=pdf
<

MitchAlsup

unread,

Sep 6, 2021, 2:42:20 PM9/6/21

to

On Monday, September 6, 2021 at 12:44:19 PM UTC-5, BGB wrote:
> On 9/6/2021 12:31 PM, MitchAlsup wrote:
> > On Monday, September 6, 2021 at 11:49:41 AM UTC-5, BGB wrote:
> >> On 9/6/2021 8:25 AM, Anton Ertl wrote:
> >>> gareth evans <headst...@yahoo.com> writes:
> >>>> This seems like a recipe for viruses!
> >>>
> >>> This sentence *is* a recipe for security theatre.
> >>>
> >> Meanwhile, am I like the only person with an ISA who has stuff like (?):
> >> Memory that is RWX for the kernel, but R-X or --X for usermode;
> >> Memory that is RW- for the kernel, but R-- for usermode;
> >> ...
> > No.
> OK. In any case, x86 doesn't seem able to do this.
>
> Descriptions of ARM's MMU are a bit more fragmented and confusing, so
> less obvious what it can do.
>
> In any case, I should have specified "within a single address space,
> without flushing the TLB or changing modes or similar".
>

In My 66000, the MMU is defined such that that the Root <pointer> and
all intermediate PTPs have to allow an access mode for the mode to be
"granted". Thus one Root may allow RW- and another root allow --X and
the page tables can still be shared.
<
So, k different processes can share a subsection of the virtual address
space and have different rights to it. One Root (say the JIT) is given
RW- access to the compiled Java, the others are given --X access, The
entire set of tables is shared.
<
In effect, I synthesize an ASID through the Root and all the PTPs
leading to the PTE. So any sharing of address spaces results in sharing
(and avoidance of invalidating) TLB entries.

Thomas Koenig

unread,

Sep 6, 2021, 3:03:00 PM9/6/21

to

Bernd Linsel <bl1-rem...@gmx.com> schrieb:

I love that footnote:

"Since most compiler generate stack-based code for languages such
as C with re-entrant functions, most GPR machines wind up emulating
stack machines to a significant extent."

MitchAlsup

unread,

Sep 6, 2021, 3:17:21 PM9/6/21

to

Errrrrr, not really.........

Terje Mathisen

unread,

Sep 6, 2021, 4:37:27 PM9/6/21

to

I remember that one, also that I was pretty much certain their
comparison with a 486-33 was completely bogus (Pentium arrived in
1993/94, then we got the PPro around 96/97?). I.e. a Pentium in the
100-200 MHz range ran 5-10 times faster than their reference 486-33.

I also remember thinking that their JVM interpreter must have been quite
bad, it should be possible to at the very least get down to less than 10
cycles (including branch misses) per JVM byte code executed, while a JIT
would run significantly faster unless you included the JIT time in the
comparison.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

BGB

unread,

Sep 6, 2021, 4:38:56 PM9/6/21

to

Even when the IR is stack based, once converted to 3AC one typically
ends up with ~ 1/2 to 1/3 as many operations.

Variable load/store ops get folded into the operators.
Constants typically get folded into the operators.
Operations to duplicate or exchange stack items essentially disappear.
...

Some of the operations, rather than actually "doing something", serving
mostly to move around references to variables on an abstract "stack"
which disappears once the bytecode is decoded.

So, it can be reasonably effective as a compiler IR, but relatively poor
as something intended for direct execution or interpretation.

And, for a direct interpreter, it would likely be better to have an
instruction format more like, say:
(31:24): Source B / Imm8 / Imm16 (Hi)
(23:16): Source A / Imm16 (Lo)
(15: 8): Dest
( 7: 0): Opcode
...

But, direct interpreters have the problem that, while fairly simple and
compact, decoding and dispatching instructions tends to become a bottleneck.

It can also be noted that even in my own script VMs, while I had mostly
ended up sticking with stack-based bytecode formats, they were usually
translated (via a process similar to that mentioned previously) into a
3AC threaded-code representation (usually roughly a similar
representation at execution time to that used by my emulators).

So, the interpreter would mostly spin in a trampoline loop calling into
pre-decoded "instruction traces" rather than decode the stack bytecode
every time something was run. To some extent, type-specialization was
also done at the decoding stage, so for "auto" types, one might have
only a single version of the bytecode function, but it might split off
into multiple sets of "traces" based on the types it was called with, ...

John Levine

unread,

Sep 6, 2021, 7:55:29 PM9/6/21

to

According to Thomas Koenig <tko...@netcologne.de>:

>I love that footnote:
>
>"Since most compiler generate stack-based code for languages such
>as C with re-entrant functions, most GPR machines wind up emulating
>stack machines to a significant extent."

There was a time when that was true. Back in the 1970s the Unix C
compiler generated a stack intermediate code and used Sethi-Ullman
numbering to arrange the sequence of operations in an expression
to try and make it fit into the registers, or equivalently into a
fixed size stack.

But that time was long gone by the time that paper was published,
although I suppose perhaps the news had not yet arrived at Berkeley.
The Berkeley RISC was designed to run code generated by pcc, a
descendant of that Unix compiler. It did the same lousy register
allocation so they invented register windows to speed up register save
and restore.

At the same time the IBM 801 project designed a machine with an
ordinary register file programmed by the PL.8 compiler which used
graph coloring to do really good register allocation. The 801 didn't
even have load or store multiple instructions, although its
descendants added them back in. By the time that paper was published
around 2000 graph coloring had become a standard technique.

This doesn't have all that much to do with Java other than to note
that we know a lot about generating good code now and in any application
where the same routine is run more than a few times, translating it
into conventional machine code that runs fast on a conventional chip is
likely to beat anything special purpose.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Thomas Koenig

unread,

Sep 7, 2021, 2:20:50 AM9/7/21

to

John Levine <jo...@taugh.com> schrieb:

> According to Thomas Koenig <tko...@netcologne.de>:
>>I love that footnote:
>>
>>"Since most compiler generate stack-based code for languages such
>>as C with re-entrant functions, most GPR machines wind up emulating
>>stack machines to a significant extent."
>
> There was a time when that was true. Back in the 1970s the Unix C
> compiler generated a stack intermediate code and used Sethi-Ullman
> numbering to arrange the sequence of operations in an expression
> to try and make it fit into the registers, or equivalently into a
> fixed size stack.
>
> But that time was long gone by the time that paper was published,
> although I suppose perhaps the news had not yet arrived at Berkeley.

[...]

The paper was published in 2005, so I stronlgy suspect they knew.

However, I didn't take that remark at face value, I thoght it was
a bit tounge-in-cheek. The authors are correct in that almost all
modern compilers use stacks (zOS standard calling sequence being
an exception because of historical baggage).

Anton Ertl

unread,

Sep 7, 2021, 3:53:01 AM9/7/21

to

John Levine <jo...@taugh.com> writes:
>According to Thomas Koenig <tko...@netcologne.de>:
>>I love that footnote:
>>
>>"Since most compiler generate stack-based code for languages such
>>as C with re-entrant functions, most GPR machines wind up emulating
>>stack machines to a significant extent."
>
>There was a time when that was true. Back in the 1970s the Unix C
>compiler generated a stack intermediate code and used Sethi-Ullman
>numbering to arrange the sequence of operations in an expression
>to try and make it fit into the registers, or equivalently into a
>fixed size stack.
>
>But that time was long gone by the time that paper was published,
>although I suppose perhaps the news had not yet arrived at Berkeley.
>The Berkeley RISC was designed to run code generated by pcc, a
>descendant of that Unix compiler. It did the same lousy register
>allocation so they invented register windows to speed up register save
>and restore.

The Stanford MIPS project was more compiler-oriented, and Fred Chow
used a stack-based IR called ucode for it. AFAIK this one survived
quite a long time into commercial MIPS.

>At the same time the IBM 801 project designed a machine with an
>ordinary register file programmed by the PL.8 compiler which used
>graph coloring to do really good register allocation.

In the early years of graph coloring, it was applied only at live
ranges surviving across basic block boundaries ("global register
allocation"), while "local register allocation" techniques were used
within basic blocks, probably because applying graph colouring for all
live ranges increases memory usage and compile time. A stack-based IR
for an Algol-family language typically does not keep any values on the
stack at basic-block boundaries, so there is little, if any
interaction between using a stack-based IR and global-only graph
coloring.

Nemo

unread,

Sep 7, 2021, 8:51:47 AM9/7/21

to

On 2021-09-06 12:25, MitchAlsup wrote (in part):
[...]

> Java byte code is stack based, to JVM would have to execute 3-4 byte
> code instructions per cycle to be in the same ball park as a GBOoO
> machine executing 1 instruction per cycle. JIT is the better path.

Are there any JIT implementations on things such as JavaCard or any UICC?

N.

EricP

unread,

Sep 7, 2021, 9:20:12 AM9/7/21

to

Terje Mathisen wrote:
> EricP wrote:
>> Thomas Koenig wrote:
>>> MitchAlsup <Mitch...@aol.com> schrieb:
>>>> On Sunday, September 5, 2021 at 5:05:33 PM UTC-5, Thomas Koenig wrote:
>>>>> Although there are Java bytecode processors (even one available on
>>>>> Github, https://github.com/amal029/jop) they have had very limited
>>>>> success.
>>>>> Why?
>>>> <
>>>> A Bytecode interpreter, even in HW will have a serious performance
>>>> issue when compared to a cross compiling native instruction set.

>>>> Probably around 2Ã—-3Ã— penalty.

>>>
>>> Of course, the latency of the first-time compilation is somewhat
>>> large :-)
>>>
>>> It is interesting to read the claims about picojava, which are
>>> (unfortunately) not substantiated anywhere where I can get
>>> to without a paywall.
>>
>> 1997 paper by Sun engineers on picoJava uArch and benchmarks says
>> it was 15-20 times faster than a 486/33 running interpreter,
>> and about 7-10 times faster than JIT.
>>
>> picoJava-I: The Java Virtual Machine in Hardware, 1997
>> https://tsunami.zaibatsutel.net/liberez/picojava.pdf
>
> I remember that one, also that I was pretty much certain their
> comparison with a 486-33 was completely bogus (Pentium arrived in
> 1993/94, then we got the PPro around 96/97?). I.e. a Pentium in the
> 100-200 MHz range ran 5-10 times faster than their reference 486-33.

I had the same thought while reading it but had no basis
to call B.S. on the numbers.

Just from their description of how limited their microarchitecture was,
the 486 should have beat it, let alone a dual pipeline Pentium.

> I also remember thinking that their JVM interpreter must have been quite
> bad, it should be possible to at the very least get down to less than 10
> cycles (including branch misses) per JVM byte code executed, while a JIT
> would run significantly faster unless you included the JIT time in the
> comparison.
>
> Terje
>

They don't provide enough information to drill down into their
numbers and see where they are spending the x86 instructions.

Here is the Sun patent for a similar sounding contemporaneous Java cpu.
This English version was from Japan - in days of yore companies would
try to hide patents from competitors by filing in other countries,
even splitting them up across multiple countries.
With Google Patent that doesn't work any more,
found this by searching on the paper authors' name.

Instruction folding for stack-based machine, 1996
https://patents.google.com/patent/JP2006216069A/en

Marcus

unread,

Sep 7, 2021, 9:27:51 AM9/7/21

to

On 2021-09-06 00:05, Thomas Koenig wrote:
> Although there are Java bytecode processors (even one available on
> Github, https://github.com/amal029/jop) they have had very limited
> success.
>

> Why? A lot of commercial code runs on JVM now (I've heard quipped
> that Java is the new COBOL), and I would expect significant
> performance increase from a native implementation as opposed to
> an interpreter or a JIT compiler. Or have JIT compilers become
> so good that it no longer matters, and the effort of introducing
> a new architecture would be too large (again)?
>

I guess one reason is that the systems that run Java bytecode typically
also has to run other code (e.g. a Linux based OS), so you would have to
have HW support for both the Java code and something else (e.g. ARM or
x86).

Thus the Java bytecode processor would either have to be implemented as
a separate co-processor (SoC style), or as a translation layer in the
main CPU front-end.

The co-processor approach would effectively require an extra dedicated
core that can only run Java code, while the "main" core can only run
"native" (non-Java) code. This kind of heterogeneous system is very
likely to see underutilized cores, compared to just running on-demand
JIT on any core I think that this, by it self, will disqualify this
solution for most applications.

The multi-ISA front end approach has been used in many designs (x86,
ARM, etc) for similar ISA:s (e.g. THUMB+ARM32 or ARM32+ARM64), but in
those cases the ISA:s have been similar enough that the internal
architecture maps well to the different front end ISA:s. I guess that
the JVM bytecode ISA is a poor match for most modern architectures, so
it is probably difficult to create an efficient & small enough
translation front end that can co-exist with a more regular register
based ISA front end and map onto the same internal architecture.

I believe the Jazelle DBX falls into the latter category, but OTOH it
seems that it was never able to clearly beat JIT, so the extra hardware
cost probably failed to deliver any significant value in terms of
performance.

/Marcus

anti...@math.uni.wroc.pl

unread,

Sep 7, 2021, 10:45:02 AM9/7/21

to

Thomas Koenig <tko...@netcologne.de> wrote:
> Although there are Java bytecode processors (even one available on
> Github, https://github.com/amal029/jop) they have had very limited
> success.
>
> Why? A lot of commercial code runs on JVM now (I've heard quipped
> that Java is the new COBOL), and I would expect significant
> performance increase from a native implementation as opposed to
> an interpreter or a JIT compiler. Or have JIT compilers become
> so good that it no longer matters, and the effort of introducing
> a new architecture would be too large (again)?

I agree with other folks that hardware bytecode is unlikely
to give better speed than JIT. For me main motivation
of bytecode is possible size optimization. I expect bytecode
to be much smaller than native code (some sources claim factor
of 6). So, keeping "cold" code in bytecode form could offer
significant space saving. OTOH executing bytecode by hardware
should be much faster than interpreter. So split between
JIT/native code and bytecode would be less critical and
larger part of application could be kept in bytecode form.

But chip designer have many ways to improve performance, it
is not clear if hardware support for bytecodes is better than
alternatives.

--
Waldek Hebisch

a...@littlepinkcloud.invalid

unread,

Sep 10, 2021, 9:43:39 AM9/10/21

to

BGB <cr8...@gmail.com> wrote:
> On 9/5/2021 6:05 PM, MitchAlsup wrote:

>> On Sunday, September 5, 2021 at 5:05:33 PM UTC-5, Thomas Koenig wrote:
>>> Although there are Java bytecode processors (even one available on
>>> Github, https://github.com/amal029/jop) they have had very limited
>>> success.
>>>
>>> Why?
>> <

>> A Bytecode interpreter, even in HW will have a serious performance
>> issue when compared to a cross compiling native instruction set.

>> Probably around 2?-3? penalty.
>
> I think there is a reason why stack-oriented ISAs are no longer a thing.
>
> Either the stack would need to map to registers, or be treated as a
> special sort of special-purpose cache. The stack is also likely to
> become a serious bottleneck if one can't schedule multiple
> stack-operations per clock-cycle, and one would need to either operate
> at a somewhat higher clock-speed or execute a large number of
> instructions per cycle to be performance competitive with a 3R ISA.

Why would that be a problem? If you had a GBOOO machine, the stack
slots would be renamed to physical registers, just as they usually
are. In the JVM the expression stack is not of arbitrary size, but
each method context has a fixed-size array of registers. Renaming
those surely wouldn't be hard at all, as long as the stack wasn't very
large.

Andrew.

MitchAlsup

unread,

Sep 10, 2021, 1:55:26 PM9/10/21

to

The problem comes at decode, in order to get 3-4 RISC instructions/cycle
of useful throughput, one would have to be decoding 8-10 JBC instructions
per cycle. Not saying it cannot be done, but the semantic density of the
byte is not as great as the semantic density of a RISC Word.
<
>
> Andrew.

Thomas Koenig

unread,

Sep 10, 2021, 4:42:58 PM9/10/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

Also, it is apparently beneficial to elide some pure stack movement
instructions at decode time (which was in one of the publications).

Still, if you have ~2 single-byte stack instructions for a 32-bit
RISC instruction, at least the icache can be smaller.

MitchAlsup

unread,

Sep 10, 2021, 5:35:17 PM9/10/21

to

Elision is something HW has a definite horizon for which SW does not.

>
> Still, if you have ~2 single-byte stack instructions for a 32-bit
> RISC instruction, at least the icache can be smaller.
<

Or higher performing.....

BGB

unread,

Sep 10, 2021, 5:42:40 PM9/10/21

to

On most machines where this sort of thing is tempting, they have
typically been using small in-order processors.

Needing to use register renaming and possibly skipping over a variable
numbers of instructions during decode seems like a bit of a stretch.

Then again, thinking about it, "just taking the hit" could still be
better than a plain interpreter on a machine which is too small to be
able to run a JIT. But, then I guess the question is more what sorts of
machines are too small to afford a JIT, but still big enough to where
running a JVM makes sense.

A JIT, unlike a full compiler, does not particularly need to be
particularly complicated or memory intensive (more so when the bytecode
ops have already sorted out things like value types, etc).

Many of the remaining complicated/expensive parts (such as the logic to
deal with in-memory object layouts, ...) would have already been needed
by the interpreter.

...

JimBrakefield

unread,

Sep 10, 2021, 6:13:36 PM9/10/21

to

Explore extending a byte code ISA to a two byte ISA.
gives both a stack operator and a stack reference
and with the few remaining bits one can concoct all manner
of extras: replace result into stack reference, pop/push the stack,
indirect addressing through the stack reference, apply indexing
to the stack reference as an address
I.e. use the stack frame of mind/reference to add features which
map into RISC instructions easily.

The goal is to decrease the instruction count and increase the semantic content.
What results may cross the boundaries between all the architectural norms?
Could go into more detail except prefer others work the problem for a while?

BGB

unread,

Sep 10, 2021, 6:24:31 PM9/10/21

to

One can get a similar savings by also having 16-bit instructions.

An ISA with a mix of 16 and 32 bit encodings can do pretty good on a
code-density front.

Meanwhile, the number of 16-bit ops needed is still less than it would
require to do an operation with stack ops.

Though, I can note a possibly related bit of trivia:
My C compiler uses a stack-based IR (RIL3), which is along vaguely
similar lines to JVM and .NET bytecode;
The RIL bytecode for the C library tends to actually be several times
larger than the typical size of the final binaries.

Granted, the RIL bytecode has a few drawbacks here:
Contains a lot of symbolic information (function and variable names);
May contain code or functions which are culled from the final binary;
Other format quirks, such as the format using bytecode ops to also
encode most of the metadata;
...

As with .NET bytecode, most RIL ops do not encode types, but rather
receive the types via the stack.

No current plans though for a RIL interpreter for BJX2, but the design
isn't really a great format for an interpreter.

Would ideally want a format with explicitly typed bytecode ops and
structure-based metadata, more like either JVM bytecode or my prior
BS2VM bytecode (which was itself sort of like JVM bytecode, just with a
single large "bytecode image file" in place of a bunch of ".class"
files; and with an expanded type-system and assumed the use of 64-bit
stack elements), but expanded out to be effectively usable as a target
for compiling C and similar.

...

JimBrakefield

unread,

Sep 10, 2021, 7:34:01 PM9/10/21

to

Duh, mistook the discussion thread to be about possible/potential byte code
hardware implementations rather than official Java byte code hardware implementation.
E.g I'm interested in potential Java hardware targets rather than official Java byte code
hardware processing. Didn't see the tree for the forest?

Still there is the question of the best JIT byte code target?

John Levine

unread,

Sep 10, 2021, 8:58:19 PM9/10/21

to

According to JimBrakefield <jim.bra...@ieee.org>:

>Explore extending a byte code ISA to a two byte ISA.
>gives both a stack operator and a stack reference
>and with the few remaining bits one can concoct all manner
>of extras: replace result into stack reference, pop/push the stack,
>indirect addressing through the stack reference, apply indexing
>to the stack reference as an address
> I.e. use the stack frame of mind/reference to add features which
>map into RISC instructions easily.

Sounds a lot like a PDP-11.

MitchAlsup

unread,

Sep 10, 2021, 9:03:49 PM9/10/21

to

On Friday, September 10, 2021 at 7:58:19 PM UTC-5, John Levine wrote:
> According to JimBrakefield <jim.bra...@ieee.org>:
> >Explore extending a byte code ISA to a two byte ISA.
> >gives both a stack operator and a stack reference
> >and with the few remaining bits one can concoct all manner
> >of extras: replace result into stack reference, pop/push the stack,
> >indirect addressing through the stack reference, apply indexing
> >to the stack reference as an address
> > I.e. use the stack frame of mind/reference to add features which
> >map into RISC instructions easily.
<
> Sounds a lot like a PDP-11.
<

Which died because its successor was too hard to pipeline.

John Levine

unread,

Sep 10, 2021, 9:17:02 PM9/10/21

to

According to MitchAlsup <Mitch...@aol.com>:

>> Sounds a lot like a PDP-11.
><
>Which died because its successor was too hard to pipeline.

The PDP-11 was a really good design for the late 1960s, when memory
was starting to be affordable and microcode ROM was still a lot faster
than core.

The VAX was also a good design for the late 1960s but unfortunately it
was introduced in the late 1970s. It wasn't just that it was too hard
to pipeline. The instruction set was full of overcomplicated
microcoded instructions that were often slower than a sequence of
simple instructions, and they somehow didn't notice that memory was
getting a lot cheaper, with a super-dense super-general instruction
set intended for assembler programmers, and tiny 512 byte pages that
even at the time were obviously too small.

JimBrakefield

unread,

Sep 10, 2021, 10:05:55 PM9/10/21

to

So, reinvent the VAX:
keep most of the op-codes, encoded in 8-bits
limit memory references to one operand or result except for mem-mem move
eliminate the double indirect addressing modes
all fits nicely into 24-bits (except for immediate values): op, R1, R2, D, adr-mode

A thing of beauty made practical.

Brett

unread,

Sep 11, 2021, 3:19:07 AM9/11/21

to

John Levine <jo...@taugh.com> wrote:
> According to MitchAlsup <Mitch...@aol.com>:
>>> Sounds a lot like a PDP-11.
>> <
>> Which died because its successor was too hard to pipeline.
>
> The PDP-11 was a really good design for the late 1960s, when memory
> was starting to be affordable and microcode ROM was still a lot faster
> than core.
>
> The VAX was also a good design for the late 1960s but unfortunately it
> was introduced in the late 1970s. It wasn't just that it was too hard
> to pipeline.

> The instruction set was full of overcomplicated
> microcoded instructions that were often slower than a sequence of
> simple instructions,

A Myth what was promoted by RISC proponents, which has been debunked.

RISC only made sense for a decade back in the ancient history of the
1980’s.

Today if I wanted to build a better 16 or 32 bit processor the first step
would be to find what micro coded instructions I could add to reduce
instruction density, and thus win the lowest cost war.

The 8086 with hard coded registers was quite good for the era, but we can
do better today, by micro coding much more complicated sequences.

The first instruction I would add is a one instruction memcpy loop, which
would use three hard coded registers to make the instruction short. The
data register would not be visible so that I could use vector registers if
I wanted. And there would be several variants for copy size and alignment.
And a bit to decide if the count is part of the instruction.

Another instruction I would add is add plus store, etc.

The case for load plus add is harder, and might not make the cut due to
transistor implementation cost outweighing memory transistor savings.
Load compare and branch is in the same boat and has to be added to the
total system cost of load compute.

I seriously think a new 16 or 32 bit processor with micro coded
instructions could win market share, by simple expedient of smaller total
size with code included.

My template is a pipelined 386 with 32 registers and far more complex and
longer micro code instructions.

There is a major company that went for an updated 386, but did not improve
anything besides instruction encoding and minor fixes. Failed to go big and
so failed, go big or go home.

Thomas Koenig

unread,

Sep 11, 2021, 3:26:01 AM9/11/21

to

Brett <gg...@yahoo.com> schrieb:

> John Levine <jo...@taugh.com> wrote:
>> According to MitchAlsup <Mitch...@aol.com>:
>>>> Sounds a lot like a PDP-11.
>>> <
>>> Which died because its successor was too hard to pipeline.
>>
>> The PDP-11 was a really good design for the late 1960s, when memory
>> was starting to be affordable and microcode ROM was still a lot faster
>> than core.
>>
>> The VAX was also a good design for the late 1960s but unfortunately it
>> was introduced in the late 1970s. It wasn't just that it was too hard
>> to pipeline.
>
>> The instruction set was full of overcomplicated
>> microcoded instructions that were often slower than a sequence of
>> simple instructions,
>
> A Myth what was promoted by RISC proponents, which has been debunked.

^1 ^2

^1 = Dubious, discuss
^2 = Citation needed

John Dallman

unread,

Sep 11, 2021, 8:38:50 AM9/11/21

to

In article <shhl98$3vo$1...@dont-email.me>, gg...@yahoo.com (Brett) wrote:

> There is a major company that went for an updated 386, but did not
> improve anything besides instruction encoding and minor fixes.
> Failed to go big and so failed, go big or go home.

Citation?

John

BGB

unread,

Sep 11, 2021, 1:58:56 PM9/11/21

to

I am starting to think about the possibility of a core which only
natively does a restricted subset of x86 and then does a special trap
for cases. The trap likely executes code with a modified ISA and
expanded register set (potentially, REX prefixes are supported during a
trap, but used mostly for the trap handler to have scratch registers).

The trap would be to a special ROM address.

Say, Mod/RM, Mod:
00: [Reg] //MOV Only (Native, 1)
01: [Reg+Disp8] //MOV Only (Native, 1)
10: [Reg+Disp32] //MOV Only (Native, 1)
11: Reg //General Ops

SIB byte:
Mod=00, all index values allowed.
Mod=01/10, only None allowed.

*1:
OP Reg, Mem
Splits into a 2-op sequence on decode:
MOV Ri, Mem
OP Reg, Ri
Where Ri is an internal register.

Ops with memory as a destination would likely split into 3 ops in the
decode stage:
MOV Ri, Mem
OP Ri, Reg
MOV Mem, Ri

Goal would be an x86 variant that could be "usefully" implemented on an
FPGA without being too slow to be usable.

There were apparently a few x86 on FPGA attempts, but many of the
"usable" ones were non-pipelined and generally too slow even to run most
MS-DOS era games. Some others generally required big/expensive FPGAs
(apparently the developers were renting them via a cloud server).

The idea in this case would be to shoe-horn x86 into a RISC-style pipeline.

The Mem-Cache would be able to Stall, ID1 and ID2 would both be able to
trigger interlocks (ID2 for registers, ID1 for decomposing instructions).

Opcode:
Most: 1-Byte Base
0F, F0, F2, F3, 64, 65, 66, 67, 26, 36, 2E, 3E, D8..DF: 2-byte Base

Bytes 2+3 or 3+4, Lookup table for Mod/RM/SIB extension.
Added to Base for encodings which use Mod/RM.
Another table can be used for Byte Word/DWord immediates.

Eg:
04, 0C, 14, 1C, ...: Byte Immed
05, 0D, 15, 1D, ...: Word Immed
80, 83: Byte Immed
81, Word Immed
...

Likely, the IF stage would need to figure out opcode length and would
produce results "unpacked" into a fixed-length format, say:
Opcode: 12 Bits (Prefix Merged)
ModRM / SIB: 20 bits
Imm: 32-bits (Sx/Zx)

This would be followed by another Decode and Register Fetch stage.
EIP would be semi-looped, where IF figures out the next EIP (EIP_1), but
other logic has the ability to override it. This EIP_1 would be captured
for use by EIP-relative instructions.

FPU: Partial Emulation
The Actual FPU would be MM0..MM7, FADD/FSUB/FMUL, ...
Most other FPU instructions are emulated via a trap.
...

MMX/SSE: Probably not supported.

Hardware memory Map:
00000000 .. 0010FFFF: Mimic DOS memory map.
00110000 .. 00FFFFFF: Mix of RAM and ISA hardware addresses.
01000000 .. 7FFFFFFF: More RAM, wraps around.
C0000000 .. FFFFFFFF: PCI emulation, ROM, ...

This would be along with the A20 line and all that other fun.

Goal would be to mostly support enough of the ISA to hopefully run Doom
and Quake and similar, and hopefully run MS-DOS, ...

I guess the debate is whether such a thing could be worthwhile.

I had previously also imagined potentially JIT compiling x86 to BJX2,
but this project hasn't gone very far as of yet (and I have doubts as to
whether it could give satisfactory results). Software emulation could
likely support a larger part of the ISA than would seem realistic via a
hardware implementation.

A JIT would have a slightly easier time as one doesn't necessarily need
to fake all the hardware, but could instead limit things to "userland
only" and then mimic OS level APIs.

I once started on such a project to try to emulate a limited subset of
the Win32 API, but this effort kinda fizzled out without too much results.

Another possibility, granted, would be trying to port DOSBox or similar.
I suspect performance would be pretty terrible though, given that ARM
ports are seemingly unable to do a passable job running Doom or similar
(2), so DOSBox on BJX2 would probably be pretty much hopeless...

*2: RasPi 4 does kinda OK, but older RasPi's and my phones tend to give
what is basically a slide-show.

John Levine

unread,

Sep 11, 2021, 7:37:02 PM9/11/21

to

According to Brett <gg...@yahoo.com>:

>> The instruction set was full of overcomplicated
>> microcoded instructions that were often slower than a sequence of
>> simple instructions,
>
>A Myth what was promoted by RISC proponents, which has been debunked.

Dunno how old you are but I wrote programs for Vaxen in the 1970s.

Instructions were an opcode followed by some number of operands, each
of which was a one byte code followed by zero to four bytes of data,
the data length depending on both the operand specifier and the
opcode. It had to decode instructions a byte at a time since you
couldn't tell where the code for operand N+1 was until you knew how
many data bytes followed opeand N. It was also hard to overlap address
calculations since a register autoincement or decrement could affect a
register used in a later operand. Today we could throw millions of
transistors at it with an umpteen stage pipeline that turns the whole
thing into micro-ops, but good luck doing that in 1980.

DEC had super all purpose procedure call and return instructions
which were super slow and overimplemented, so Unix systems never
used them. There were lots of complex instructions that as far as
I can tell nobody used.

Compare that to the IBM 360 where you can tell from the first two
bits of the opcode how long the instruction is, and for the most
part where the memory addresses and register operands are. The 360
wasn't perfect but it certainly aged a lot better.

Brett

unread,

Sep 11, 2021, 8:54:03 PM9/11/21

to

Renesas RX CPU
Has byte opcodes with 16 registers and is little endian.
Basically an improved incompatible 386.

> John
>

Brett

unread,

Sep 11, 2021, 8:54:03 PM9/11/21

to

John Levine <jo...@taugh.com> wrote:
> According to Brett <gg...@yahoo.com>:
>>> The instruction set was full of overcomplicated
>>> microcoded instructions that were often slower than a sequence of
>>> simple instructions,
>>
>> A Myth what was promoted by RISC proponents, which has been debunked.
>
> Dunno how old you are but I wrote programs for Vaxen in the 1970s.
>
> Instructions were an opcode followed by some number of operands, each
> of which was a one byte code followed by zero to four bytes of data,
> the data length depending on both the operand specifier and the
> opcode. It had to decode instructions a byte at a time since you
> couldn't tell where the code for operand N+1 was until you knew how
> many data bytes followed opeand N. It was also hard to overlap address
> calculations since a register autoincement or decrement could affect a
> register used in a later operand. Today we could throw millions of
> transistors at it with an umpteen stage pipeline that turns the whole
> thing into micro-ops, but good luck doing that in 1980.

Having horrible instruction encoding has nothing to do with modern
microcode.

Not proposing add instructions where all three operands are memory indirect
addressing. VAX stupidly need not apply.

> DEC had super all purpose procedure call and return instructions
> which were super slow and overimplemented, so Unix systems never
> used them. There were lots of complex instructions that as far as
> I can tell nobody used.

So microcode the shorter call sequence, one wonders why no one with a brain
at DEC did this.

VAX had bad instruction density, same as RISC which tells you how horrible
it was.

Brett

unread,

Sep 11, 2021, 9:03:14 PM9/11/21

to

One example I know of is the VAX divide microcode which was 2 cycles slower
than the assembly macro, but this was because the assembly macro cheated on
the last carry and so was half a bit less accurate.

In order to pull off the microcode is slower scam you have to compare a
pipelined compiler output on a pipelined processor to a non-pipelined micro
coded processor, which is cheating. X86 has done just fine with enough
money piled into pipelining the micro ops.

Anton Ertl

unread,

Sep 12, 2021, 8:15:45 AM9/12/21

to

John Levine <jo...@taugh.com> writes:
[VAX]

>Compare that to the IBM 360 where you can tell from the first two
>bits of the opcode how long the instruction is, and for the most
>part where the memory addresses and register operands are. The 360
>wasn't perfect but it certainly aged a lot better.

And yet the 801 project found that they could outperform the 360
descendants by moving to a load/store architecture.

John Levine

unread,

Sep 12, 2021, 1:10:02 PM9/12/21

to

According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:

>John Levine <jo...@taugh.com> writes:
>[VAX]
>>Compare that to the IBM 360 where you can tell from the first two
>>bits of the opcode how long the instruction is, and for the most
>>part where the memory addresses and register operands are. The 360
>>wasn't perfect but it certainly aged a lot better.
>
>And yet the 801 project found that they could outperform the 360
>descendants by moving to a load/store architecture.

Sure, but that was 40 years ago. The 801 also only had 24 bit
registers because that how big the memory addresses were at the time.
The point of the 801 wasn't that it was load/store but that the
instructions were simple enough to implement without microcode with a
1980 transistor budget, and that they found that compilers rarely used
the more complex instructions anyway. In the 1960s ROMs were faster
than core memory so even with multiple microinstructions it could keep
the main memory running at full speed, but by 1980 we had
semiconductor RAM and caches so a microcode cycle was no faster than a
main memory cycle.

Descendants of the 801 added back a certain amount of stuff that
turned out to be useful like 32 bit registers and some support for
packed decimal.

The 360 wasn't perfect, e.g., the 12 bit displacements were too small,
branches should have been PC-relative, addresses should have been 32
rather than 24 bits, and the floating point gave awful results, all
things they fixed in later interations of the architecture. But it
seems to me that the 360 lends itself to a pipelined implmentation way
better than a Vax-like design. It can tell the length and format of an
instruction from the first byte (and mostly from the first two bits),
and it can compute the addresses in parallel since address
calculations don't change anything else.

MitchAlsup

unread,

Sep 12, 2021, 1:37:08 PM9/12/21

to

Yes, ti did.

<
> It can tell the length and format of an
> instruction from the first byte (and mostly from the first two bits),
<

This, BTW, has NOTHING to do with 360 being pipelineable. The RR and
RX formats have everything to do with it being pipelineable.

EricP

unread,

Sep 12, 2021, 2:03:43 PM9/12/21

to

One thing the VAX did do was spur the investigation in 1985-87
by Yale Patt, Wen-mei Hwu, Michael Shebanow, and others,
called HPS (High Performance Substrate) as a way to work
around VAX's pipeline problems.

Based on Tomasulo and others, HPS looks to be the basis for modern
OoO when picked up by Pentium and others.

HPS, A New Microarchitecture Rationale And Introduction, 1985
http://impact.crhc.illinois.edu/Shared/Papers/Micro-85-HPS_a_new_microarchitecture.pdf

Critical Issues Regarding HPS, A High Performance Microarchitecture, 1985
http://impact.crhc.illinois.edu/shared/papers/Micro-85-HPS_critical_issues.pdf

(and other papers)

Comp.arch tie-in: Mitch co-authors a paper with Pratt, Shebanow, et al.
Single Instruction Stream Parallelism Is Greater than Two, 1991
and co-invents some patents with Shebanow at Motorola,
possibly drawing on the HPS approach for the 88100 design.

Anne & Lynn Wheeler

unread,

Sep 12, 2021, 2:32:26 PM9/12/21

to

John Levine <jo...@taugh.com> writes:
> Sure, but that was 40 years ago. The 801 also only had 24 bit
> registers because that how big the memory addresses were at the time.
> The point of the 801 wasn't that it was load/store but that the
> instructions were simple enough to implement without microcode with a
> 1980 transistor budget, and that they found that compilers rarely used
> the more complex instructions anyway. In the 1960s ROMs were faster
> than core memory so even with multiple microinstructions it could keep
> the main memory running at full speed, but by 1980 we had
> semiconductor RAM and caches so a microcode cycle was no faster than a
> main memory cycle.

801/ROMP ... originally going to be displaywriter followon running CP.r
and programmed in PL.8 ... didn't have any hardware protection domain
... claim was that PL.8 would only generate "correct" programs and CP.r
would only load/execute correct PL.8 programs. Everything was trusted
and so things that nominally required kernel call to change modes, could
be done inline code. It nominally had 32bit addressing ... top four bits
indexed content of 12bit "segment" register ... with 28bit displacement.
Since the segment registers were 12bits ... they claimed it was 40bit
virtual address ... 28bit displacement appended to the 12 bit contents
of the segment register value (claiming that inline application code
could change the contents of any segment register as easily as pointer
in general register could be changed).

In effect, treating machine as a single 40bit virtual address space,
with 16 segment registers that could each "window" 28bits of that
virtual address space at a time.

when displaywriter followon was killed, they decided to retarget to the
unix workstation market ... having to adopt the hardware to unix
programming paradigm ... including privilege / non-privilege (PC/RT) ...
and getting the company that did AT&T Unix port to IBM/PC for PC/IX to
do one for ROMP (AIX) ... but documentation still would periodically
reference "40-bit" addressing ... sort of being able to map a 32bit
virtual address space into a specific portion of 40-bit machine address
space (theoritically 40-32=8 or 256 32-bit virtual address spaces)
possibly reserving specific 12bit segment register values for "shared
memory" segments.

For RIOS (used in RS/6000) they extended segment registers to 24bits and
documentation would periodically reference 52bit address (i.e. 24+28
instead of the ROMP 12+28 40bit).

there was internal advanced technology conference where we presented a
design for 16processor tightly coupled 370 multiprocessor and the 801
group presented 801/risc, CP.r & PL.8. One of their people claimed that
they had looked at existing operating system code and claimed that it
didn't support 16-way (implying that we couldn't write new code) so I
criticized them for how could they (efficiently) support shared memory
segments ... since their segment size was so large and there were only
16.

I had done a page-mapped filesystem for CP67/CMS and made extensive use
of shared semgnets. When the development group morphed CP67->VM370, they
greatly simplified and/or dropped (like multiprocessor support) a lot of
stuff. By 1975, I had migrated lots of the dropped CP67/CMS stuff to
VM370 (including my page-mapped filesystem stuff with extensive shared
segment sharing as well as multiprocessor support). One of my hobbies
after joining IBM was enhanced production operating systems for internal
datacenters ... and I continued to work on 370 stuff all during the
Future System period ... even perodically ridiculing how they were doing
stuff (FS was completely different than 370 and going to completely
replace it ... lack of new 370 during the FS period is credited with
giving the clone 370 makers their market foothold). some FS info
http://www.jfsowa.com/computer/memo125.htm

I've periodically claimed that John Cocke had taken 801 to the opposite
extreme from FS complexity.

--
virtualization experience starting Jan1968, online at home since Mar1970

MitchAlsup

unread,

Sep 12, 2021, 3:42:47 PM9/12/21

to

Yes, indeed, HPS was, in reality, not a whole lot more than Tomasulo with
multiple common data busses and branch prediction.
<
And we were using many of the ideas, but with entirely different underpinnings
than HPS. For example, we used a physical register file rather than a RAT, and
created a way to backup a branch misprediction and issue on the alternate path
in the same cycle.
<
This was for the 88100 architecture and for the 88120 design point.

John Levine

unread,

Sep 12, 2021, 6:10:46 PM9/12/21

to

According to Anne & Lynn Wheeler <ly...@garlic.com>:

>when displaywriter followon was killed, they decided to retarget to the
>unix workstation market ... having to adopt the hardware to unix
>programming paradigm ... including privilege / non-privilege (PC/RT) ...
>and getting the company that did AT&T Unix port to IBM/PC for PC/IX to
>do one for ROMP (AIX) ...

Yeah, that was me. I worked for Interactive and wrote the ROMP assembler
and linker for AIX. It sat on top of an IBM monitor called the VRM which
provided us 28 bit segments we could map in and out of a process address
space, which was fine except the VRM was dog slow.

Someone else did a native BSD port which worked a lot better.

Anne & Lynn Wheeler

unread,

Sep 13, 2021, 12:31:11 AM9/13/21

to

John Levine <jo...@taugh.com> writes:
> Yeah, that was me. I worked for Interactive and wrote the ROMP assembler
> and linker for AIX. It sat on top of an IBM monitor called the VRM which
> provided us 28 bit segments we could map in and out of a process address
> space, which was fine except the VRM was dog slow.
>
> Someone else did a native BSD port which worked a lot better.

I was working with the people doing the BSD port ... the Austin people
claimed that it was quicker, less resources and cheaper for them to
build a VRM with abstract virtual machine and then have you do the AT&T
port to the abstract virtual machine ... than having you do the AT&T
port directly to the bare hardware. Actually I think they had 200 PL.8
programmers and they needed something for them to do (aka the VRM).

The Palo Alto group was doing BSD port to VM370 virtual machine
mainframe when they got retargeted to do it directly port directly to
bare machine. I think the BSD port directly to bare PC/RT hardware was
1/10th the effort (or less) to do (just) the VRM.

Just one of the less obvious downsides that Austin ran into was device
drivers for new hardware 1st had to be done in C for unix ... and then
repeated in PL.8 for VRM.

trivia: I had been working with one of the people in Los Gatos VLSI
tools group using (metaware) TWS and did Pascal for IBM mainframe
(original for internal VLSI tools, later released as vs/pascal)... to
work on C language front-end. I left for a summer lecture tour in Europe
... when I got back he had left IBM and was working for Metaware. I
talked the Palo Alto people into hiring Metaware to do the C-compiler
for the BSD mainframe port ... which they then had Metaware do ROMP
backend when they were redirected to PC/RT port ("AOS").

Note the Palo Alto group was also working with UCLA and doing LOCUS
ports ... which was eventually released as AIX/370 and AIX/386.
https://en.wikipedia.org/wiki/LOCUS

aka an "AIX" having nothing to do with PC/RT AIX or RS/6000 AIX.

Terje Mathisen

unread,

Sep 13, 2021, 2:01:20 AM9/13/21

to

Anne & Lynn Wheeler wrote:
>
> trivia: I had been working with one of the people in Los Gatos VLSI
> tools group using (metaware) TWS and did Pascal for IBM mainframe
> (original for internal VLSI tools, later released as vs/pascal)... to

Wow!

I once used that vs/pascal to implement large packet Kermit, using up to
a full page (25x80=2000 bytes) of a 3270 emulator as the packet size,
instead of just a single 80-byte line.

The result was file transfers running at the same speed as the 3270/PC
(or the /AT 286 version) while using 3270 protocol emulators to allow
ascii serial port connections.

I remember that Pascal version as being quite nice. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Anton Ertl

unread,

Sep 13, 2021, 4:04:12 AM9/13/21

to

John Levine <jo...@taugh.com> writes:
>According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
>>John Levine <jo...@taugh.com> writes:
>>[VAX]
>>>Compare that to the IBM 360 where you can tell from the first two
>>>bits of the opcode how long the instruction is, and for the most
>>>part where the memory addresses and register operands are. The 360
>>>wasn't perfect but it certainly aged a lot better.
>>
>>And yet the 801 project found that they could outperform the 360
>>descendants by moving to a load/store architecture.
>
>Sure, but that was 40 years ago. The 801 also only had 24 bit
>registers because that how big the memory addresses were at the time.
>The point of the 801 wasn't that it was load/store but that the
>instructions were simple enough to implement without microcode with a
>1980 transistor budget, and that they found that compilers rarely used
>the more complex instructions anyway.

The first implementation was "Motorola MECL-10K discrete component
technology on large wire-wrapped custom boards." So the transistor
budget was not limited by what fit on a single chip. So what limited
the transistor budget? Or was it that they just wanted to leave out
everything that they could do without, making it easier to make the
rest fast? They certainly made it really fast.

>In the 1960s ROMs were faster
>than core memory so even with multiple microinstructions it could keep
>the main memory running at full speed, but by 1980 we had
>semiconductor RAM and caches so a microcode cycle was no faster than a
>main memory cycle.

The IBM 801 ran at 15MHZ, the Vax 11/780 at 5MHZ and had a 2KB cache,
so apparently the main memory was too slow for 5MHz (but cache was
probably at the same speed as the microcode store). The IBM 801 (as
describe by Radin
<https://course.ece.cmu.edu/~ece447/s12/lib/exe/fetch.php?media=wiki:801-radin-1982.pdf>,
which is already the 32-bit version) had a split instruction and data
cache with 32-byte cache lines. I have not found cache sizes;
associativity seems to be at least 2-way (they mention LRU
replacement).

EricP

unread,

Sep 13, 2021, 9:16:51 AM9/13/21

to

Anton Ertl wrote:
> John Levine <jo...@taugh.com> writes:
>> According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
>>> John Levine <jo...@taugh.com> writes:
>>> [VAX]
>>>> Compare that to the IBM 360 where you can tell from the first two
>>>> bits of the opcode how long the instruction is, and for the most
>>>> part where the memory addresses and register operands are. The 360
>>>> wasn't perfect but it certainly aged a lot better.
>>> And yet the 801 project found that they could outperform the 360
>>> descendants by moving to a load/store architecture.
>> Sure, but that was 40 years ago. The 801 also only had 24 bit
>> registers because that how big the memory addresses were at the time.
>> The point of the 801 wasn't that it was load/store but that the
>> instructions were simple enough to implement without microcode with a
>> 1980 transistor budget, and that they found that compilers rarely used
>> the more complex instructions anyway.
>
> The first implementation was "Motorola MECL-10K discrete component
> technology on large wire-wrapped custom boards." So the transistor
> budget was not limited by what fit on a single chip. So what limited
> the transistor budget? Or was it that they just wanted to leave out
> everything that they could do without, making it easier to make the
> rest fast? They certainly made it really fast.

ECL was invented by IBM in 1956. Originally called current-steering logic,
it was used in the Stretch, IBM 7090, and IBM 7094 computers (circa 1959).
MECL is Motorola's integrated circuit ECL logic developed starting in 1962.
By 1971 Motorola had their MECL 10,000 series.

The IBM 801 was circa 1976.

https://en.wikipedia.org/wiki/Emitter-coupled_logic#history

"MECL III in 1968 with 1-nanosecond gate propagation time and 300 MHz
flip-flop toggle rates, and the 10,000 series (with lower power
consumption and controlled edge speeds) in 1971."

ECL has a fan-out of 20 (up to 80) and allows wired-OR logic.
MECL 10K series is similar to TTL with DIP packages for things like
quad 2-input NOR, dual flip-flop, 8-1 mux, 64-bit RAM, 4-bit ALU, etc.

The design rules for ECL are different from TTL.
I've never worked with ECL but from what I read wire reflections are a
major consideration. There are different rules for PCB layout, grounding.
Wire-wrap could have issues with wire length (~2-3ns per foot),
long wires need damping resistors, and each wrap connection has
to be perfect or you get resistance reflections
(so you might have to scope each connection).

Other than wire length, I can't think of anything that would
otherwise limit the physical size and therefore the complexity.

>> In the 1960s ROMs were faster
>> than core memory so even with multiple microinstructions it could keep
>> the main memory running at full speed, but by 1980 we had
>> semiconductor RAM and caches so a microcode cycle was no faster than a
>> main memory cycle.
>
> The IBM 801 ran at 15MHZ, the Vax 11/780 at 5MHZ and had a 2KB cache,
> so apparently the main memory was too slow for 5MHz (but cache was
> probably at the same speed as the microcode store). The IBM 801 (as
> describe by Radin
> <https://course.ece.cmu.edu/~ece447/s12/lib/exe/fetch.php?media=wiki:801-radin-1982.pdf>,
> which is already the 32-bit version) had a split instruction and data
> cache with 32-byte cache lines. I have not found cache sizes;
> associativity seems to be at least 2-way (they mention LRU
> replacement).
>
> - anton

Bitsavers has two IBM Company Confidential docs from 1976 which
might provide more insight as why they made the decisions they did.
The overview says they had a project team of 20.

The 801 Minicomputer - An Overview, 1976
http://bitsavers.org/pdf/ibm/system801/The_801_Minicomputer_an_Overview_Sep76.pdf

IBM System 801 Principles of Operation, 1976
http://bitsavers.org/pdf/ibm/system801/System_801_Principles_of_Operation_Jan76.pdf

They definitely refer to it as a minicomputer and a prototype.
In a big company like IBM with many entrenched products,
not appearing to threaten other products and thereby step
on toes is a career consideration, part of the unwritten rules.
It can conjecture that they might have had to get "permission"
to move from 24 to 32 bits.

MitchAlsup

unread,

Sep 13, 2021, 12:42:11 PM9/13/21

to

ECL has its major voltage of -5.2 (that is MINUS 5.2V) and Gnd is 0V.
All of the load comes from the Gnd <plane>.

<
> Wire-wrap could have issues with wire length (~2-3ns per foot),
> long wires need damping resistors, and each wrap connection has
> to be perfect or you get resistance reflections
> (so you might have to scope each connection).
<

You had 3 choices, 1) wires shorter than 5", 2) parallel damping,
3) series damping. In series damping one puts a resistor equal
to the characteristic impedance near the driving gate and a high
value (1K Ohms) resistor at the terminating end of the wire. In
parallel damping one puts a resistor at the terminating end the
resistance equaling the characteristic impedance to -2.0V.
<
Both arrangements result in no overshoot/undershoot on the logic
wire.

<
>
> Other than wire length, I can't think of anything that would
> otherwise limit the physical size and therefore the complexity.
<

Indeed, Cray built refrigerator sized wire wrap boards that ran at
80 MHz with sub nanosecond clock delay+skew+jitter.

Brett

unread,

Sep 13, 2021, 4:38:25 PM9/13/21

to

Brett <gg...@yahoo.com> wrote:
> John Levine <jo...@taugh.com> wrote:
>> According to MitchAlsup <Mitch...@aol.com>:
>>>> Sounds a lot like a PDP-11.
>>> <
>>> Which died because its successor was too hard to pipeline.
>>
>> The PDP-11 was a really good design for the late 1960s, when memory
>> was starting to be affordable and microcode ROM was still a lot faster
>> than core.
>>
>> The VAX was also a good design for the late 1960s but unfortunately it
>> was introduced in the late 1970s. It wasn't just that it was too hard
>> to pipeline.
>
>> The instruction set was full of overcomplicated
>> microcoded instructions that were often slower than a sequence of
>> simple instructions,
>
> A Myth what was promoted by RISC proponents, which has been debunked.

An alternative or addition is wide packed instructions with chaining.
The example of wide packed is Itanic, but of course they did it wrong.

You would go 256 bits wide and support a variable number of instructions in
the packet and by supporting chaining you save 10 bits of register
specifiers for each chain segment, minus a bit to indicate the chain link.

You would use heads and tails encoding and support jumps into and out of
the packet. Chaining alone should give you the best instruction density and
add some micro coded instructions and you should get dominating instruction
density that makes manufactures take a serious look at your offerings.

A variable width instruction set can support chaining today by just adding
the instructions, I am perplexed as to why no one has.

A new architecture needs a hook to get noticed and dominating instruction
density is one way to get that notice.

Ivan Godard

unread,

Sep 13, 2021, 5:35:23 PM9/13/21

to

And how is this different from Mill belt?

MitchAlsup

unread,

Sep 13, 2021, 5:55:53 PM9/13/21

to

On Monday, September 13, 2021 at 3:38:25 PM UTC-5, gg...@yahoo.com wrote:
> Brett <gg...@yahoo.com> wrote:
> > John Levine <jo...@taugh.com> wrote:
> >> According to MitchAlsup <Mitch...@aol.com>:
> >>>> Sounds a lot like a PDP-11.
> >>> <
> >>> Which died because its successor was too hard to pipeline.
> >>
> >> The PDP-11 was a really good design for the late 1960s, when memory
> >> was starting to be affordable and microcode ROM was still a lot faster
> >> than core.
> >>
> >> The VAX was also a good design for the late 1960s but unfortunately it
> >> was introduced in the late 1970s. It wasn't just that it was too hard
> >> to pipeline.
> >
> >> The instruction set was full of overcomplicated
> >> microcoded instructions that were often slower than a sequence of
> >> simple instructions,
> >
> > A Myth what was promoted by RISC proponents, which has been debunked.
<
> An alternative or addition is wide packed instructions with chaining.
> The example of wide packed is Itanic, but of course they did it wrong.
>
> You would go 256 bits wide and support a variable number of instructions in
> the packet and by supporting chaining you save 10 bits of register
> specifiers for each chain segment, minus a bit to indicate the chain link.
<

Why is there any romance associated with power of 2 widths in support of
wide {Fetch-Decode-Execute} ?
<
4 in 128-bits
6 in 192-bits
8 in 256-bits
10 in 320-bits
....
<
It seems to me that powers of 2 in byte count is an artificial boundary that
is not good for the ISA in general.
<
Also this seriously hinders building little implementation and big implementations
at the same time, being semi-optimal only for a few implementations in the middle.

>
> You would use heads and tails encoding and support jumps into and out of
> the packet. Chaining alone should give you the best instruction density and
> add some micro coded instructions and you should get dominating instruction
> density that makes manufactures take a serious look at your offerings.
<

My 66000 ISA is getting near x86-64 instruction density without any of this.

>
> A variable width instruction set can support chaining today by just adding
> the instructions, I am perplexed as to why no one has.
<

Why don't you give it a go and see what comes out ?

>
> A new architecture needs a hook to get noticed and dominating instruction
> density is one way to get that notice.
<

My guess is that lower context switch overhead would garner more wins
than instruction density; for example, an ADA call to an entry accept point
in a different address space costing only 12 cycles.

<
> > RISC only made sense for a decade back in the ancient history of the
> > 1980’s.
<

RISC made sense in the brief interval when 32-bit ISAs were too complicated
to all be on 1 chip. By shedding the area of the microcode, one got the space
to build pipeline registers; and instead of 1 instruction every 4-6 cycles, one
got 1 instructions every clock (less cache miss latency).

> >
> > Today if I wanted to build a better 16 or 32 bit processor the first step
> > would be to find what micro coded instructions I could add to reduce
> > instruction density, and thus win the lowest cost war.
<

In My case (My 66000) the biggest code density benefit was in creating
ENTER and EXIT instructions, second best was giving every instruction
access to any width immediate.
<
Secondarily, I doubt the 16-bit market is in the search for a new architecture.

> >
> > The 8086 with hard coded registers was quite good for the era, but we can
> > do better today, by micro coding much more complicated sequences.
<

At a different level, transcendental instructions in My 66000 are microcoded
IN the FUNCTION UNIT (not in the Decoder). The Function unit goes "busy"
until the result is ready to be delivered. SIN() COS() Ln() EXP() take 19 cycles
about the latency of an FDIV.

> >
> > The first instruction I would add is a one instruction memcpy loop, which
> > would use three hard coded registers to make the instruction short. The
> > data register would not be visible so that I could use vector registers if
> > I wanted. And there would be several variants for copy size and alignment.
> > And a bit to decide if the count is part of the instruction.
> >
> > Another instruction I would add is add plus store, etc.
> >
> > The case for load plus add is harder, and might not make the cut due to
> > transistor implementation cost outweighing memory transistor savings.
> > Load compare and branch is in the same boat and has to be added to the
> > total system cost of load compute.
> >
> > I seriously think a new 16 or 32 bit processor with micro coded
> > instructions could win market share, by simple expedient of smaller total
> > size with code included.
<

Not a single RISC architecture got a single design win due to code density !!

Brett

unread,

Sep 13, 2021, 7:49:18 PM9/13/21

to

Mill is BETTER than RISC with chaining. ;)

We are discussing current state of the art, not your leap forward. ;)

Brett

unread,

Sep 13, 2021, 8:34:55 PM9/13/21

to

Agree.

>> You would use heads and tails encoding and support jumps into and out of
>> the packet. Chaining alone should give you the best instruction density and
>> add some micro coded instructions and you should get dominating instruction
>> density that makes manufactures take a serious look at your offerings.
> <
> My 66000 ISA is getting near x86-64 instruction density without any of this.

Um, x86-64 is not much better than RISC and sometimes worse depending on
how much floating point you do as that adds another byte to each
instruction.

If ARM Cortex/Thumb2 crushes you on density then you are in trouble.

>> A variable width instruction set can support chaining today by just adding
>> the instructions, I am perplexed as to why no one has.
> <
> Why don't you give it a go and see what comes out ?

Chaining opcodes is complex homework that every company should have taken a
look at, including you, results should be somewhere on the internet.

Getting the RISC guys to go variable width was worse than pulling teeth.
Threats of firings and resignations were involved at ARM, and the MIPS
founder did fire people for such suggestions, though most were weeded out
at hiring interviews leading to brain dead group think that killed the
company when the market changed.

Adding a chaining register dependency is maybe 10 times worse in these
peoples minds.

>> A new architecture needs a hook to get noticed and dominating instruction
>> density is one way to get that notice.
> <
> My guess is that lower context switch overhead would garner more wins
> than instruction density; for example, an ADA call to an entry accept point
> in a different address space costing only 12 cycles.

Sounds good.

> <
>>> RISC only made sense for a decade back in the ancient history of the
>>> 1980’s.
> <
> RISC made sense in the brief interval when 32-bit ISAs were too complicated
> to all be on 1 chip. By shedding the area of the microcode, one got the space
> to build pipeline registers; and instead of 1 instruction every 4-6 cycles, one
> got 1 instructions every clock (less cache miss latency).

Agree.

>>> Today if I wanted to build a better 16 or 32 bit processor the first step
>>> would be to find what micro coded instructions I could add to reduce
>>> instruction density, and thus win the lowest cost war.
> <
> In My case (My 66000) the biggest code density benefit was in creating
> ENTER and EXIT instructions, second best was giving every instruction
> access to any width immediate.
> <
> Secondarily, I doubt the 16-bit market is in the search for a new architecture.

The 16 bit market has the worst choices to pick from, the problem is that
these devices are cheap, so everyone ignores this market. Definition of
opportunity.

In the 32 bit market you have to compete with RISC-V which is free. Sure it
is crap compared to the new architectures discussed here, but it’s free.

Pick your poison. ;)

Brett

unread,

Sep 13, 2021, 8:47:07 PM9/13/21

to

The 64 bit market is where the 32 bit market was in the pre-Thumb days.
There is a market window between the crap free RISC-V and ARM64, for a
Thumb2 style architecture which My 66000 fits in.
Of course you could get crushed if/when ARM does a thumb implementation in
response to you.

Anne & Lynn Wheeler

unread,

Sep 13, 2021, 9:42:36 PM9/13/21

to

Terje Mathisen <terje.m...@tmsw.no> writes:
> Wow!
>
> I once used that vs/pascal to implement large packet Kermit, using up
> to a full page (25x80=2000 bytes) of a 3270 emulator as the packet
> size, instead of just a single 80-byte line.
>
> The result was file transfers running at the same speed as the 3270/PC
> (or the /AT 286 version) while using 3270 protocol emulators to allow
> ascii serial port connections.
>
> I remember that Pascal version as being quite nice. :-)
>
> Terje

the first IBM mainframe tcp/ip stack was implemented in VS/pascal
... but the communication group was fighting hard to prevent its release
... when they lost that, they changed their strategy and claimed that
since they had corporate strategic ownership for everything that crossed
datacenter walls ... it had to be repleased through them. What shipped
got 44kbyte/sec aggregate using 3090 processor. I then did RFC1044
support and in some tuning tests at Cray Research between 4341 and Cray
... got sustained 4341 channel throughput using only modest amount of
4341 processor (something like 500 times improvement in bytes moved per
instruction executed).

after leaving IBM ... during its trouble and recovery years ... IBM was
being reorganized into the 13 baby blues in preparation for breaking up
the company ... article gone behind paywall, but still mostly free at
wayback machine
http://web.archive.org/web/20101120231857/http://www.time.com/time/magazine/article/0,9171,977353,00.html

then board brings in new CEO that reverses the breakup ... who also cuts
a lot of stuff. Much of VLSI design tools were being given away to major
industry standard VLSI design tool company ... but condition was that
since much of the industry ran on SUN ... they had to all be ported to
SUN. We had already left the company but get a contract to port a Los
Gatos, 50,000 vs/pascal statement physical layout app to SUN.

In retrospect instead of converting to SUN pascal, it would be simpler
to have rewritten it in C ... it seemed that SUN pascal many have never
been used for much other than simple educational instruction. It was
easy to drop into SUN hdqtrs to discuss problems ... but it didn't do
much good since pascal support had been outsourced to organization on
the opposite of the world (lets say rocket scientists ... later I got
some uniform insignias that said "space trooper" and "space command"
... not in english)

MitchAlsup

unread,

Sep 13, 2021, 9:49:06 PM9/13/21

to

It should not have been.
<
I figured it out in the first several weeks of starting an x86-64 microarchitecture.
Variable length instruction sets have a value that you cannot ignore, although
I generally state this in the form that "No instructions should be used in pasting
bits together" rather than VLI.

<
> Threats of firings and resignations were involved at ARM, and the MIPS
> founder did fire people for such suggestions, though most were weeded out
> at hiring interviews leading to brain dead group think that killed the
> company when the market changed.
<

I was not hired by HP back in the Snake days because I was willing to consider
a pipeline where LDs could "short circuit" some piece of the instruction pipeline.
<
I understand the group think going on.

>
> Adding a chaining register dependency is maybe 10 times worse in these
> peoples minds.
<

I has specific chaining in my GPU ISA. Overall, it was probably not necessary
and might have been easier to use 5-bit comparators.

<
> >> A new architecture needs a hook to get noticed and dominating instruction
> >> density is one way to get that notice.
> > <
> > My guess is that lower context switch overhead would garner more wins
> > than instruction density; for example, an ADA call to an entry accept point
> > in a different address space costing only 12 cycles.
<
> Sounds good.
> > <
> >>> RISC only made sense for a decade back in the ancient history of the
> >>> 1980’s.
> > <
> > RISC made sense in the brief interval when 32-bit ISAs were too complicated
> > to all be on 1 chip. By shedding the area of the microcode, one got the space
> > to build pipeline registers; and instead of 1 instruction every 4-6 cycles, one
> > got 1 instructions every clock (less cache miss latency).
> Agree.
> >>> Today if I wanted to build a better 16 or 32 bit processor the first step
> >>> would be to find what micro coded instructions I could add to reduce
> >>> instruction density, and thus win the lowest cost war.
> > <
> > In My case (My 66000) the biggest code density benefit was in creating
> > ENTER and EXIT instructions, second best was giving every instruction
> > access to any width immediate.
> > <
> > Secondarily, I doubt the 16-bit market is in the search for a new architecture.
<
> The 16 bit market has the worst choices to pick from, the problem is that
> these devices are cheap, so everyone ignores this market. Definition of
> opportunity.
<

The problem in not ignoring this market is that you have to pay for the design
team at $0.01 per chip sold. And since a design team is going to cost you
on-the-order-of $50M/yr, you better be selling a crap-ton(R) of chips.

>
> In the 32 bit market you have to compete with RISC-V which is free. Sure it
> is crap compared to the new architectures discussed here, but it’s free.
<

My 66000 is/will be available for free.
>
> Pick your poison. ;)

Stefan Monnier

unread,

Sep 13, 2021, 10:18:21 PM9/13/21

to

>> The 16 bit market has the worst choices to pick from, the problem is that
>> these devices are cheap, so everyone ignores this market. Definition of
>> opportunity.
> The problem in not ignoring this market is that you have to pay for the design
> team at $0.01 per chip sold. And since a design team is going to cost you
> on-the-order-of $50M/yr, you better be selling a crap-ton(R) of chips.

I do wonder: I can see a market for really tiny 8bit CPUs, but how big
is the market for 16bit CPUs (i.e. those that need more than what 8bit
CPUs can offer but for which a tiny 32bit CPU is already too expensive)?

Stefan

Brett

unread,

Sep 13, 2021, 10:39:12 PM9/13/21

to

Did you fix the compilers annoying habit of interleaving loads and math,
hard to chain two adds when there is a needless load in between.

BGB

unread,

Sep 14, 2021, 3:20:01 AM9/14/21

to

IME, in terms of code density x86-64 seems to be pretty weak. Though, it
varies some, for example, "gcc -Os" on Linux does somewhat better than MSVC.

In my own comparisons, 32-bit x86 and Thumb2 tend to do a lot better.
A64 seems to do a bit worse here than Thumb2.

As a small example, I had recently hacked together a small voxel based
3D engine along vaguely similar lines to Minecraft Classic for the BJX2
ISA (*1).

A build of the engine for x86-64 is 838K (via MSVC), whereas the BJX2
build is 175K. Not strictly apples-to-apples, but still...

*1: Its renderer basically sweeps across the screen doing ray-casts,
building up a list of any blocks hit by a ray-cast, and then drawing the
list of blocks (via software-rasterized OpenGL). It has minimal overdraw
(because a raycast will not pass through a wall), but only really works
effectively at small draw distances.

Its performance still manages to be somehow less awful than I originally
imagined (framerates are a little better than Quake, albeit at a 24
block draw-distance). Ended up using a color-fill sky, mostly as this
improves framerate somewhat if compared with drawing a skybox.

This is still with me discovering and occasionally fixing "crazy bad"
compiler bugs, eg:
Turns out the compiler was very-frequently trying to cast-convert
operands of binary operators to the destination type even when they were
the same (resulting in a lot of extra register MOVs, spills, ...).

Eg, if you did something like:
int a, b, c;
c=a+b;
It was tending to often compile it like it were:
c=(int)a+(int)b;

Which at the ASM level would, instead of, say:
ADDS.L R8, R9, R14
Result in something like:
MOV R8, R25
MOV R9, R28
ADDS.L R25, R28, R14
And would also result in higher register pressure and a larger number of
spills.

Fixing this bug gave a roughly 4% reduction in the size of binaries, and
a roughly 20% increase in performance for Doom and similar. This also
caused Dhrystone score to increase from ~51.3k to ~57.1k.

It seemed this was also related to a lot of cases where, say:
c=a+imm;
Was resulting in things like:
MOV Imm, R6
MOV R7, R9
ADDS.L R12, R9, R13
Rather than, say:
ADDS.L R12, Imm, R13

...

Then, relatedly, noted that, eg:
y=x&255;
Was being compiled sorta like:
MOV R11, R7
EXTU.B R7, R7
MOV R7, R28
MOV R28, R14
Vs, say:
EXTU.B R11, R14

Turns out this was stumbling on some logic for a "stale" code-path,
where early on, the compiler would handle operators more like:
Allocate scratch registers;
Load frame variables into scratch registers;
Apply operator to scratch registers;
Store result back to call frame;
Free said scratch registers.

But, this was later replaced with:
Fetch the variables as registers;
Operate on these registers;
Release the registers.

After I had switched over, trying to load/store a frame variable in this
way would typically result in a register MOVs rather than an actual
memory load/store. These older paths have not been entirely eliminated
though.

But, yeah, fixing these appears to have slightly reduced the level of
"general awfulness" in my C compiler output.

Then added another slight compiler tweak which got it up to ~57.9k
(namely, caching and reusing struct-field loads in certain cases).

Was able to push it up to ~ 59.0k by assuming less-conservative
semantics ("strict aliasing"), but I decided against enabling this by
default as it seems unsafe. This mostly effects under which conditions
the cached struct field would be discarded.

At present, it has certain restrictions:
* Does not cross a basic-block boundary;
* Discarded if either the of the cached variables is modified;
* Discarded if any sort of explicit memory store happens;
* ...

But, what it will do, is compile an expression like:
y=foo->x*foo->x;
As if it were:
t0=foo->x;
y=t0*t0;

Though, this optimization does not appear to have any real effect on
Doom and similar.

It is possible a similar trick could be used for array loads or pointer
derefs.

I also went and recently optionally re-added the "FMOV.S" instruction
(Memory Load/Store combined with a Single<->Double conversion), since
this should be able to help some for code which works with
single-precision floating point values (avoids some common penalty cases).

...

>>> A variable width instruction set can support chaining today by just adding
>>> the instructions, I am perplexed as to why no one has.
>> <
>> Why don't you give it a go and see what comes out ?
>
> Chaining opcodes is complex homework that every company should have taken a
> look at, including you, results should be somewhere on the internet.
>
> Getting the RISC guys to go variable width was worse than pulling teeth.
> Threats of firings and resignations were involved at ARM, and the MIPS
> founder did fire people for such suggestions, though most were weeded out
> at hiring interviews leading to brain dead group think that killed the
> company when the market changed.
>
> Adding a chaining register dependency is maybe 10 times worse in these
> peoples minds.
>

It is balance, some.

16/32, by looking at a few bits, is OK.
Decoding a bundle based on also looking at a few bits and daisy-chaining
is also OK.

Fully variable length encodings which depend on looking at lots of
different bits are less OK.

>>> A new architecture needs a hook to get noticed and dominating instruction
>>> density is one way to get that notice.
>> <
>> My guess is that lower context switch overhead would garner more wins
>> than instruction density; for example, an ADA call to an entry accept point
>> in a different address space costing only 12 cycles.
>
> Sounds good.
>
>> <
>>>> RISC only made sense for a decade back in the ancient history of the
>>>> 1980’s.
>> <
>> RISC made sense in the brief interval when 32-bit ISAs were too complicated
>> to all be on 1 chip. By shedding the area of the microcode, one got the space
>> to build pipeline registers; and instead of 1 instruction every 4-6 cycles, one
>> got 1 instructions every clock (less cache miss latency).
>
> Agree.
>

FWIW: In theory, my BJX2 core can do ~ 2 or 3 instructions per cycle.
Though, actual "real-world" results tend to be closer to 0.3 to 0.5 ...

A lot of this is due to cache misses, interlock penalties, and my
compiler mostly failing to bundle instructions.

Can generally get better results with ASM though.

>>>> Today if I wanted to build a better 16 or 32 bit processor the first step
>>>> would be to find what micro coded instructions I could add to reduce
>>>> instruction density, and thus win the lowest cost war.
>> <
>> In My case (My 66000) the biggest code density benefit was in creating
>> ENTER and EXIT instructions, second best was giving every instruction
>> access to any width immediate.
>> <
>> Secondarily, I doubt the 16-bit market is in the search for a new architecture.
>
> The 16 bit market has the worst choices to pick from, the problem is that
> these devices are cheap, so everyone ignores this market. Definition of
> opportunity.
>
> In the 32 bit market you have to compete with RISC-V which is free. Sure it
> is crap compared to the new architectures discussed here, but it’s free.
>
> Pick your poison. ;)
>

For 16-bit, one mostly wants "as cheap as possible".

Though, at least for off-the-shelf microcontrollers, it is hard to
really beat out something like "just use an MSP430 or similar".

One can design a "better" 16-bit ISA, and then, say, run it on an ICE40
or similar, but then unless one has a strong use-case to justify needing
FPGA logic, the ICE40 costs more and is more complicated to use than an
MSP430.

Then a certain amount of "use a Cortex-M but treat it like it is a
16-bit ISA".

Or, if one does custom silicon, how to they get enough "volume" and
"momentum" to make it cost-effective vs existing options? ...

Then again, I am doing my existing project more because I found it
interesting, than does it necessarily make sense.

For many things, a Cortex-M would be both faster and cheaper.
Though, it isn't 1:1, because while a (higher end) Cortex-M dev-board
can do a pretty decent job running Doom or similar, if trying to do
something like an OpenGL style software rasterizer or similar on it, it
falls on its face.

Seemingly, the Thumb ISA does rather poorly on workloads that end up
consisting almost entirely of memory loads and stores.

Terje Mathisen

unread,

Sep 14, 2021, 3:57:06 AM9/14/21

to

Brett wrote:
> MitchAlsup <Mitch...@aol.com> wrote:

>> In My case (My 66000) the biggest code density benefit was in creating
>> ENTER and EXIT instructions, second best was giving every instruction
>> access to any width immediate.
>> <
>> Secondarily, I doubt the 16-bit market is in the search for a new architecture.
>
> The 16 bit market has the worst choices to pick from, the problem is that
> these devices are cheap, so everyone ignores this market. Definition of
> opportunity.

How cheap must a 32-bit cpu be in order to make 16-bit irrelevant?

I was blown away by a conference talk by "Bunny" Chang several years
ago, telling us that _every_ single USB memory stick contains a 32-bit
cpu, this is what allows the manufacturers to sell all the flash memory
chips they make, no matter how good or bad it turned out: The embedded
cpu does all the testing/verification/remapping/bad block flagging etc
and reports back "This is an 8GB stick", probably also with some
speed/quality markers to allow it to be sold to different markets/price
points.

I can only assume that the cost of those 32-bit memory stick cpus have
gone even further down by now.

MitchAlsup

unread,

Sep 14, 2021, 1:20:14 PM9/14/21

to

Reservation stations do this automagically. {And get rid of the need to explicitly
chain one instruction to another.} And to a large extent, so do ScoreBoards.
<
It is only the statically scheduled architectures that have the problem you suggest.

MitchAlsup

unread,

Sep 14, 2021, 1:26:39 PM9/14/21

to

Which is why all of the "stuff" necessary to perform an instruction is found in the first
word of the My 66000 Instruction, the variable remainder is for constants (immediates
and displacements.}

MitchAlsup

unread,

Sep 14, 2021, 1:28:37 PM9/14/21

to

Figure the 32-bit CPU gets 2%-5% of the cost of the USB-stick.

BGB

unread,

Sep 14, 2021, 2:07:14 PM9/14/21

to

FWIW: Power-of-2 makes a lot more sense for fixed length instructions
than for variable-length. For variable-length, what matters more tends
to be the minimum alignment.

For BJX2, the minimum alignment has generally been 2, on account of
there being 16-bit encodings, whereas WEX bundles are typically limited
to 32-bit units.

At one point, I had 48-bit encodings, but the original encodings were
lost in a reorganization of the ISA's encoding space, and new 48 bit
encodings have not been introduced. They would offer little advantage
over 64-bit Jumbo encodings, and statistically these are rare enough
that the effect on code-density would be rare.

I had experimented with 24-bit encodings for a microcontroller profile,
but within the ISA these added a lot of hair, and their space-savings
were relatively modest at best. Better would be to have them in an ISA
designed specifically for byte-aligned instructions, eg:
16/24/32/48/64/96.

A few bits could select between 16/24/32. The 48/64/96 encodings could
be used via a 32-bit combiner prefix.

>>>> You would use heads and tails encoding and support jumps into and
>>>> out of
>>>> the packet. Chaining alone should give you the best instruction
>>>> density and
>>>> add some micro coded instructions and you should get dominating
>>>> instruction
>>>> density that makes manufactures take a serious look at your offerings.
>>> <
>>> My 66000 ISA is getting near x86-64 instruction density without any
>>> of this.
>>
>> Um, x86-64 is not much better than RISC and sometimes worse depending on
>> how much floating point you do as that adds another byte to each
>> instruction.
>>
>> If ARM Cortex/Thumb2 crushes you on density then you are in trouble.
>>
>
> IME, in terms of code density x86-64 seems to be pretty weak. Though, it
> varies some, for example, "gcc -Os" on Linux does somewhat better than
> MSVC.
>
> In my own comparisons, 32-bit x86 and Thumb2 tend to do a lot better.
> A64 seems to do a bit worse here than Thumb2.
>

Add: In my testing, code density for BJX2 tends to be in a similar area
to i386, but generally worse than Thumb2.

Eg, approximate ranking seems to be roughly:
Thumb2 (best)
i386
~ BJX2
SH4
ARM32 + Thumb1 ( ~= SH4 )
A32 / ARM32 (no Thumb)
x86-64 / X64 (GCC / Clang)
A64 / ARM64 / Aarch64
...
x86-64 / X64 (MSVC)

I don't have compilers set up for SPARC/MIPS/IA64/... but, from tables I
had looked at before, I expect them to be somewhat behind x86-64 here.

MSP430 seems to also do pretty good, but given its limited scope, direct
comparison with the others is difficult.

For the most part, a lot of these fall within a fairly small window of
each other. Most seem to fall pretty close to an average of ~ 2.6 to 3.4
bytes per instruction.

I suspect x86-64 takes a pretty big hit relative to i386 due to the REX
prefix (pushes the average closer to ~ 4.2 bytes / instruction or so).

Code density in BJX2 is a little bit of a wildcard, as sometimes
(usually with small examples) it does particularly bad. It does pay a
bit of overhead for small programs as the binaries tend to be statically
linked with their own copy of virtual memory subsystem and the FAT32
filesystem drivers, ... (these are not used if the program is launched
via the shell).

Code density also varies a bit depending on whether WEX is enabled, eg:
WEX = false: 16b = ~60% / 32b = ~40% (avg = ~ 2.8 bytes/op);
WEX = true : 16b = ~20% / 32b = ~80% (avg = ~ 3.6 bytes/op).

The latter could be improved, would mostly require changing how the
compiler deals with this.

Jumbo encodings are infrequent enough to mostly be ignored here.

The compiler also produces LZ4 compressed binaries, which needs to be
accounted for (directly comparing an LZ compressed binary with an
uncompressed binary is misleading).

>
>
> As a small example, I had recently hacked together a small voxel based
> 3D engine along vaguely similar lines to Minecraft Classic for the BJX2
> ISA (*1).
>
> A build of the engine for x86-64 is 838K (via MSVC), whereas the BJX2
> build is 175K. Not strictly apples-to-apples, but still...
>

With MSVC, it is fairly common (for many other programs) to see binaries
in the size range of several MB.

In some tests, it seems to depend on the program whether /O2, /O1, or
/Os is "actually" fastest. Mostly I suspect this is because "/O2" tends
to prefer to balloon the binary and use lots of misguided attempts at
autovectorization.

>
> *1: Its renderer basically sweeps across the screen doing ray-casts,
> building up a list of any blocks hit by a ray-cast, and then drawing the
> list of blocks (via software-rasterized OpenGL). It has minimal overdraw
> (because a raycast will not pass through a wall), but only really works
> effectively at small draw distances.
>
> Its performance still manages to be somehow less awful than I originally
> imagined (framerates are a little better than Quake, albeit at a 24
> block draw-distance). Ended up using a color-fill sky, mostly as this
> improves framerate somewhat if compared with drawing a skybox.
>

Note, the SW GL rasterizer was static-linked in both of the above cases.

Typo: should have been a double MOV through a single register.

This particular bug was in the RIL->3AC conversion stage.

It seems like the benefit of this instruction may potentially outweigh
its LUT cost.

It would mostly replace things like:
MOV.L (R4, 8), R3
FLDCF R3, R9
With:
FMOV.S (R4, 8), R9

The compiler was typically leaving the FLDCF directly following the
MOV.L, leading to a penalty case. Using FMOV.S would not have this penalty.

Its implementation differs slightly this time around, in that the
converters were added to the EX stages, rather than trying to route them
through the FPU (which makes little sense given that the FPU now uses GPRs).

Side note, this is for code which works with scalar single precision
floating-point values. This case happens to be "a significant part of
the Quake engine".

(Snip)

>
> Seemingly, the Thumb ISA does rather poorly on workloads that end up
> consisting almost entirely of memory loads and stores.
>

This is one big drawback of ~ 11x 32-bit usable GPRs.
More so, when most of the 16-bit Thumb encodings are limited to R0-R7.

Everything is fine, so long as it fits.

Something like a Doom style column drawing loop fits fine (fetch byte,
feed through colormap table, store result).
Significantly exceed this limit, and it all goes to crap.

If one starts doing things like color-modulation and blending, this
blows the register budget. And, "more MHz" can't really entirely
compensate for this.

Meanwhile, 27x 64-bits fares a lot better.

58x 64-bits can fare better still, but relatively few workloads really
seem to use enough registers to benefit from this case.

Stephen Fuld

unread,

Sep 14, 2021, 2:44:24 PM9/14/21

to

Of course, the percentage depends upon the size of the memory in the
stick. As sticks get larger, the percentage goes down.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

BGB

unread,

Sep 14, 2021, 3:01:55 PM9/14/21

to

On 9/14/2021 12:26 PM, MitchAlsup wrote:
> On Tuesday, September 14, 2021 at 2:20:01 AM UTC-5, BGB wrote:
>> On 9/13/2021 7:34 PM, Brett wrote:
>>> MitchAlsup <Mitch...@aol.com> wrote:
>>>> On Monday, September 13, 2021 at 3:38:25 PM UTC-5, gg...@yahoo.com wrote:
>>>>> Brett <gg...@yahoo.com> wrote:
>>>>>> John Levine <jo...@taugh.com> wrote:
>>>>>>> According to MitchAlsup <Mitch...@aol.com>:
>>>>>>>>> Sounds a lot like a PDP-11.
>>>>>>>> <

...

>>
>> ...
>>>>> A variable width instruction set can support chaining today by just adding
>>>>> the instructions, I am perplexed as to why no one has.
>>>> <
>>>> Why don't you give it a go and see what comes out ?
>>>
>>> Chaining opcodes is complex homework that every company should have taken a
>>> look at, including you, results should be somewhere on the internet.
>>>
>>> Getting the RISC guys to go variable width was worse than pulling teeth.
>>> Threats of firings and resignations were involved at ARM, and the MIPS
>>> founder did fire people for such suggestions, though most were weeded out
>>> at hiring interviews leading to brain dead group think that killed the
>>> company when the market changed.
>>>
>>> Adding a chaining register dependency is maybe 10 times worse in these
>>> peoples minds.
>>>
>> It is balance, some.
>>
>> 16/32, by looking at a few bits, is OK.
>> Decoding a bundle based on also looking at a few bits and daisy-chaining
>> is also OK.
>>
>> Fully variable length encodings which depend on looking at lots of
>> different bits are less OK.
> <
> Which is why all of the "stuff" necessary to perform an instruction is found in the first
> word of the My 66000 Instruction, the variable remainder is for constants (immediates
> and displacements.}
> <

Yeah.

My case, the 16/32 case can be determined via the top 4 bits of the
first 16-bit word. Originally, it was "top 3 bits == 111", but this got
messed up by adding the XGPR encodings (7, 9, E, F). Could have been
cleaner, but, didn't want to break binary compatibility with existing
parts of the ISA.

WEX Bundles and Jumbo encodings require looking at a few more bits, but
not too much worse.

Admittedly, the encoding of larger immediate values in BJX2 is a little
bit awful. Not too bad in terms of resource costs, but cases where one
has to deal with them as part applying reloc's is pretty ugly (both in
the compiler, but also some of the PE/COFF relocs need to deal with
instruction encodings to deal with certain parts of the ABI).

As can be noted, it can encode a 64-bit constant load, or 33-bit
immediate value, within a single clock cycle.

Though, FWIW, I suspect RISC-V will have a similar issue with immediate
fields being basically "bit confetti". One could do like some
traditional RISC's and load larger immediate values from memory using
PC-rel ops, but this sucks, so alas...

Luckily at least, an FPGA doesn't really care that much if ones'
immediate fields are bit confetti.

The ISA encoding does have the property that trying to start decoding in
the middle of an instruction will typically cause decoding to realign
within 1 or 2 instructions, and it is possible to look at a pair of
16-bit words and have a fairly high confidence which one represents the
start of a valid instruction.

BGB

unread,

Sep 14, 2021, 3:53:24 PM9/14/21

to

I suspect the cost difference between a small 16-bit and 32-bit ISA will
mostly disappear in the noise.

Even on FPGA's:
Most of the cheaper-end FPGAs one can buy, can handle a 32 or 64 bit CPU
core;
One might be more tempted to use a 16-bit core on an ICE40, since these
tend to be pretty small, however, boards with ICE40 FPGAs tend not
really to be any cheaper than one with a Spartan or Artix, and the
Spartan and Artix can generally handle somewhat bigger designs.

I have been able to shove a BJX2 core onto an XC7S25, which is near the
cheaper end of readily available FPGAs. Granted, a 32-bit core would
make more sense here than a 64-bit core.

Though, did note recently that someone had managed to get a small
Minecraft clone running on a board with an ICE40 (ICE40UP5K), which was
partly what led to a revival of my "get voxel-based 3D engine working on
BJX2" effort (the XC7A100T is a significantly bigger FPGA than the
ICE40UP5K).

For expediency, mostly reused the TKRA-GL implementation which I had
written for my GLQuake and Quake 3 Arena efforts. Granted, maybe not the
fastest option possible for this. Had originally considered doing
something similar to ROTT, but this was more complicated, and it was
unclear how to draw the tops/bottoms of blocks without effectively
falling back to a full polygonal rasterizer (at which point one may as
well just do everything with a polygon rasterizer).

Though, I guess a question here could be if there were good ways to
further accelerate things like vertex projection and rasterization tasks.

They got better performance than my effort as well, but were also doing
a lot of the rendering tasks using dedicated FPGA logic, as opposed to
me basically using a software renderer (with raycasts mostly being used
for visibility determination).

Apparently they were using a custom 16-bit RISC style core with a lot of
the rest of the FPGA mostly dedicated to rendering tasks, apparently
using a hardware-based raytracer (though apparently limited to only
being able to deal with surfaces aligned on a cubic grid).

...

EricP

unread,

Sep 14, 2021, 4:03:16 PM9/14/21

to

I would guess there are some per-stick percents for
Microsoft's FAT file system format patents.

Also there is some fello who seems to hold most of the
write wear leveling patents. Few percent to him too.

Stefan Monnier

unread,

Sep 14, 2021, 4:16:44 PM9/14/21

to

> I would guess there are some per-stick percents for
> Microsoft's FAT file system format patents.

Any patents on FAT would have expired years ago (maybe you meant
exFAT?). But in any case, USB flash sticks usually don't implement any
filesystem at all, they only expose a block-device.

Stefan

Ivan Godard

unread,

Sep 14, 2021, 4:23:36 PM9/14/21

to

Only if the architecture restricts chaining to the immediately following
instruction. If chaining has a range of reference then the compiler can
interleave up to the allowable range, at the cost of bigger instructions
to specify which value in the range was desired.

If you think of chaining as an implicit belt, then OP is complaining
about a belt of length one, which is too short for compiled code. At a
guess, useful chaining for a sequential encoding (i.e. not a wide
encoding with bundles) needs a chain range (equivalently belt length) of
four, and two-bit references, with typical compilers.

That's for a compiler that does ordinary scheduling before register and
reference allocation. A compiler that does temporal scheduling so as to
increase the number and length of chains is a non-trivial exercise, but
can double the chaining or alternatively reduce the required reference
range by half.

BGB

unread,

Sep 14, 2021, 4:39:32 PM9/14/21

to

True enough, but they do usually come already formatted as such, and it
is a question if someone has to pay royalties for the sake of doing so,
such that the "standards compliance" deities are satisfied.

This is unlike HDDs, which are generally sold in an unformatted state.

I suspect ExFAT was mostly devised as a scheme for MS to try to continue
squeezing royalties out of this (after the LFN related patents expired).

MS: FAT32 only works up to 32GB for... Reasons...

Then they (somehow) managed to convince the standards people of this, so
SDcard and USB sticks are generally required to come formatted as ExFAT
(and then typically need to be manually reformatted as FAT32 or similar,
because Linux and friends still generally only understand FAT32, well,
and EXT2/3/4 and similar, ...).

Though, I suspect pretty much everyone realizes FAT32 works well past
this point, and the actual limit is closer to 2TB.

There is the 4GB file-size limit, but like, apart from things like
camcorders, this doesn't really matter.

And, ExFAT still doesn't really offer much of anything else that is
particularly compelling for use-cases other than camcorders or similar.

>
> Stefan
>

Ivan Godard

unread,

Sep 14, 2021, 4:44:16 PM9/14/21

to

Which is essential for fast cheap decode of a sequential encoding. Way
long ago I looked into that approach and decided that the necessary
priority encoding gave a practical limit of ~8 VL sequential
instructions decodable per cycle. IIRC Mitch has said he expected limits
of similar order.

That limit can be avoided by using a multi-component bundled encoding
where component sizes are explicit in the bundle-level encoding and
intra-component instructions are fixed length. Such a format seems to
have decode area/power cost linear in the maximal number of instructions
per bundle, and has no decode size limits at all.

The format can be expanded arbitrarily if you have enough things to do
concurrently to use it. After a lot of poking we settled on a
six-component bundle format and component sizes in the 2-8 instruction
range, but that likely will get tuned later with experience. We see no
decode constraints at all; the physical critical path architectural
limits seem to be the cost of crossbar distribution of bypassed
operands. However, for general-purpose code we seem to run out of things
to do before we run out of either decode or distribution capacity to do
them.

Brett

unread,

Sep 15, 2021, 1:06:44 AM9/15/21

to

I like this idea a LOT. This is so good that 8 bit encodings may be viable
again. 2 bits to select from ABCD and 2 bits of belt chaining from previous
instructions which could store to any of the 32 registers leaving 4 bits
for instruction which generally indicate more bytes.

You could also have no result register specified for most non-addressing
instructions as you plan to pick up the result in chaining, saving an
opcode byte. Only the load ops will actually specify which of 32 registers
generally, if they specify a result register at all. You need a byte code
extension to specify one of the 28 not covered by ABCD.

To extend this to its logical conclusion most addressing will be from the
four EFGH registers. Also since adding a register specifier adds 8 bits you
could have up to 256 registers, you will likely cap this at 64 and fault on
a larger number. Some customer will ask for more, and you will be a able to
provide for a fee. Super computers like large register files.

Most compressed encodings get to 35% compression, this could break the
impossible 50% wall. Almost as good as the Mill arch for opcode
compression.

Software engineers would flock to this architecture, AND the silicon/money
guys will as well for the die savings for code space.

Mission is a GO, have at it. Get the first mover advantage, grab the gold
ring.

Anton Ertl

unread,

Sep 16, 2021, 6:22:23 AM9/16/21

to

Stefan Monnier <mon...@iro.umontreal.ca> writes:
>I do wonder: I can see a market for really tiny 8bit CPUs, but how big
>is the market for 16bit CPUs (i.e. those that need more than what 8bit
>CPUs can offer but for which a tiny 32bit CPU is already too expensive)?

This market is vanishing. E.g., TI is not promoting the MSP430 to new
customers, but presents the ARM-based MSP432 as replacement.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

EricP

unread,

Sep 16, 2021, 7:36:13 AM9/16/21

to

Anton Ertl wrote:
> Stefan Monnier <mon...@iro.umontreal.ca> writes:
>> I do wonder: I can see a market for really tiny 8bit CPUs, but how big
>> is the market for 16bit CPUs (i.e. those that need more than what 8bit
>> CPUs can offer but for which a tiny 32bit CPU is already too expensive)?
>
> This market is vanishing. E.g., TI is not promoting the MSP430 to new
> customers, but presents the ARM-based MSP432 as replacement.
>
> - anton

Software development is so much more expensive now that it probably does
not make sense to pay someone to shoe-horn things into 16 bit segments,
and all the programming shenanigans to manage objects through handles
or overlay program or data segments.

32 bits doesn't have to mean complicated.
Eliminate the MMU and have a single flat physical address space.
Tiny multi-threaded OS and some heap routines costs maybe few kB.

And what is the extra hardware cost of 32 bits really?
A few extra register flip flops inside.
It doesn't even need to have a 32-bit data bus
(so you can use byte wide memory chips and save on memory)
and only as many address bus bits as you need.