Re: [sw-dev] [ANN] RISC-V J Extension Working Group

202 views
Skip to first unread message

Rangeen Basu Roy Chowdhury

unread,
Feb 6, 2018, 6:58:09 PM2/6/18
to David Chisnall, RISC-V HW Dev, RISC-V SW Dev, isa...@groups.riscv.org
I would like to participate in the working group.

On Tue, Nov 28, 2017 at 1:23 AM, David Chisnall <David.C...@cl.cam.ac.uk> wrote:
Hello RISC-V Developers,

We are pleased to announce the formation of the J Extension Working Group (charter attached), chaired by David Chisnall and with Martin Maas as vice chair. This group will be responsible for proposing RISC-V extensions for managed, interpreted and JIT-ed languages that have extend beyond those of ahead-of-time compiled Algol-family languages (such as C/C++).

The J working group solicits contributions from hardware implementors, language designers, as well as experts on language-runtime systems, interpreters, compilers, memory management and garbage collection. Contributions in related areas such as dynamic binary translation, memory consistency models and transactional memory are strongly encouraged as well.  If you wish to participate, please reply to this email.

David Chisnall and Martin Maas



RISC-V J Extension Working Group Charter
----------------------------------------

The RISC-V J extension aims to make RISC-V an attractive target for languages that are traditionally interpreted or JIT compiled, or which require large runtime libraries or language-level virtual machines. Examples include (but are not limited to) C#, Go, Haskell, Java, JavaScript, OCaml, PHP, Python, R, Ruby, Scala or WebAssembly.

Typical features of these languages include garbage collection, dynamic typing and dynamic dispatch, transparent boxing of primitive values, and reflection. This provides a very wide scope for possible approaches and, as such, the working group will follow a two-pronged strategy investigating both immediate gains and longer-term more experimental ideas concurrently. Existing attempts to implement JIT-compiled languages on RISC-V have highlighted some places where better instruction density is possible, and these should fall into an early version of the specification.

Instructions intended to accelerate common JIT’d instruction sequences may be optional within the J extension, with the expectation that software will test for their presence before determining which code sequence to generate. This also provides scope for additions that are only appropriate for a subset of microarchitectures. For example, there is increasing interest in running JavaScript on IoT devices, but acceleration for simple low-power in-order pipelines with constrained memory may be wholly inappropriate for large application cores.

Among other topics, the group expects to work within the following areas and collaborate with several existing RISC-V extension working groups:

- Dynamic languages often require efficient overflow-checked addition for promotion between integer representations. The M standard describes overflow-checking multiplication, and the J and M extension work groups will work together to unify these.

- A significant amount of research has explored hardware support for garbage collection (GC), including hardware read/write barriers and using transactional memory for GC. The J extension group will consider these options and work with a potential future T extension working group to use transactional memory support for this purpose. It is important that the J extension does not propose specialised garbage-collection acceleration support when similar performance can be achieved in software simply by using the T extension.

- The memory model working group is refining the core specification’s atomicity and ordering guarantees. Environments containing JIT compilers have stronger requirements with regard to ordering of data writes to instruction reads than traditional ahead-of-time environments (particularly on multicore systems). The J extension may propose a stronger memory model in this regard, but must not propose anything that contradicts the memory model working group’s design.

- User-level interrupts have significant potential for side-exits from JIT-compiled code for deoptimisation, certain garbage collection algorithms, and potentially other VM features. User-level interrupts are currently defined in the privileged specification and may be supported by a future N extension. The J working group must coordinate designs with a potential future N working group to ensure that such mechanisms are reusable.

The J working group solicits contributions from hardware implementors, language designers, as well as experts on language-runtime systems, interpreters, compilers, memory management and garbage collection. Contributions in related areas such as dynamic binary translation, memory consistency models and transactional memory are strongly encouraged as well.

--
You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to sw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/A01AD003-EC8D-4769-B477-063B6B2AA483%40cl.cam.ac.uk.



--
Thanks!
Rangeen Basu

Swami

unread,
Feb 6, 2018, 7:54:42 PM2/6/18
to Rangeen Basu Roy Chowdhury, David Chisnall, RISC-V HW Dev, RISC-V SW Dev, isa...@groups.riscv.org
Welcome!!

Sent from my iPhone
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAC4QhZ2OMsv3on1c5z5ZEibCH_LxPMjVm24kC%3D%2BWcqu-VzLcPg%40mail.gmail.com.

Jecel Assumpção Jr

unread,
Mar 21, 2018, 3:29:54 PM3/21/18
to RISC-V SW Dev, hw-...@groups.riscv.org, isa...@groups.riscv.org, David.C...@cl.cam.ac.uk
On Tuesday, November 28, 2017 at 7:23:59 AM UTC-2, David Chisnall wrote:
We are pleased to announce the formation of the J Extension Working Group

Though I am late to the party, I would like to join this group.

I have been working on a processor called SiliconSqueak which is optimized
for the OpenSmalltalk VM (http://opensmalltalk.org/) which powers Squeak,
Pharo, Cuis and Newspeak though my deisgn should be interesting for other
bytecode based languages. I am currently redesigning SiliconSqueak to be a
RISC-V extension instead of a completely custom design. Avoiding any needless
incompatiblity with the J extension would be a very good thing and I will be
more than happy to share any results that I have.

Though my focus is on helping adaptive compilation, I also made a serious
effort to speed up bytecode interpretation. The idea is that while a high
performance system might interpret a method once (or not at all) before
compiling, it would be nice to have a "knob" to change that if needed to
reduce memory use.

Some of what I do contradicts the famous ECOOP 95 paper:
"Do Object-Oriented Languages Need Special Hardware Support?
by Urs Hölzle and David Ungar
ECOOP '95 Proceedings, pages 253-282
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.31.4796&rep=rep1&type=pdf

Their conclusion was that the hardware extensions in RISC-III (SOAR:
Smalltalk On A RISC) seemed to help due to the poor compiler. With
better compilation technology only larger instruction caches made any
difference in performance. My own conclusion was that if you are going
to use traps for infrequent, but not really rare, events like tag mismatches
or register window overflow/underflow then don't make these traps take
thousands of clock cycles like in Sparc 8 and Solaris.

One of their conclusions was that special object oriented caches like
in Mushroom and which Mario Wolczko tried to get adopted by Sparc
without success wouldn't help. Sun was on a patent binge so we can't
use stuff published after 2001, but a lot of good ideas are older than that:
https://labs.oracle.com/pls/apex/f?p=labs:bio:0:134

-- Jecel

Tommy Thorn

unread,
Mar 21, 2018, 5:15:11 PM3/21/18
to Jecel Assumpção Jr, RISC-V SW Dev, hw-...@groups.riscv.org, isa...@groups.riscv.org, David.C...@cl.cam.ac.uk
Hi Jecel,

Though I am late to the party, I would like to join this group.

I think you have to be a member of the RISC-V Foundation to join formally,
but otherwise you're most welcome.

Though my focus is on helping adaptive compilation, I also made a serious
effort to speed up bytecode interpretation. The idea is that while a high
performance system might interpret a method once (or not at all) before
compiling, it would be nice to have a "knob" to change that if needed to
reduce memory use.

I'm not sure how widely applicable *byte* dispatch is, but efficiently monitoring
interpretation and trigger compilation is important for a JIT.

...difference in performance. My own conclusion was that if you are going

to use traps for infrequent, but not really rare, events like tag mismatches
or register window overflow/underflow then don't make these traps take
thousands of clock cycles like in Sparc 8 and Solaris.

SPARC's branch delay slots and register windows are IMO the anthesis of
RISC-V.  Traps are never going to be cheap in a superscalar implementation,
but a good implementation can make them no more expensive than a
mispredicted branch.


One of their conclusions was that special object oriented caches like
in Mushroom and which Mario Wolczko tried to get adopted by Sparc
without success wouldn't help. Sun was on a patent binge so we can't
use stuff published after 2001, but a lot of good ideas are older than that:
https://labs.oracle.com/pls/apex/f?p=labs:bio:0:134

I think the real challenge will be finding options that benefits
more than one paradigm (eg. JIT, Smalltalk, Java, JavaScript,
Haskell, Prolog, Lisp, Erlang, real-time GC, etc).

Tommy


Jecel Assumpcao Jr.

unread,
Mar 21, 2018, 7:46:20 PM3/21/18
to Tommy Thorn, RISC-V SW Dev, hw-...@groups.riscv.org, isa...@groups.riscv.org
Tommy Thorn wrote on Wed, 21 Mar 2018 14:15:07 -0700
> I think you have to be a member of the RISC-V Foundation to join
> formally,but otherwise you're most welcome.

Thanks, I am looking at the membership documents.

> > [...] serious effort to speed up bytecode interpretation
>
> I'm not sure how widely applicable *byte* dispatch is, but efficiently
> monitoring interpretation and trigger compilation is important for a JIT.

I meant like the ARM original Jazelle extension optimized for
interpretation (later renamed to DBX - Direct Bytecode eXecution) and
their later version optimized for JIT (RCT - Runtime Compilation
Target).

https://en.wikipedia.org/wiki/Jazelle

Any technology that can help trigger a recompilation in an adaptive
system could probably be adapted to help switch from interpretation to
the first compilation. I haven't worked very much on making execution
counters more efficient, but it is an interesting project.

> > [..] costly traps in Sparc 8 + Solaris
>
> SPARC's branch delay slots and register windows are IMO the
> anthesis of RISC-V.  Traps are never going to be cheap in a
> superscalar implementation, but a good implementation can
> make them no more expensive than a mispredicted branch.

In my original SiliconSqueak design tag mismatches actually were just
branches instead of traps. But the way I did it wouldn't work too well
in RISC-V (unless you want to use instructions larger than 32 bits,
which I don't).

> > [...] object oriented caches
>
> I think the real challenge will be finding options that benefits
> more than one paradigm (eg. JIT, Smalltalk, Java, JavaScript,
> Haskell, Prolog, Lisp, Erlang, real-time GC, etc).

Exactly. It isnt something I have worried about in my project, though I
have kept the requirements for all of these in the back of my mind.
Except for Haskell - I don't know what hardware support it needs.

-- Jecel

Tommy Thorn

unread,
Mar 21, 2018, 9:43:41 PM3/21/18
to Jecel Assumpcao Jr., RISC-V SW Dev, hw-...@groups.riscv.org, isa...@groups.riscv.org
I'm not sure how widely applicable *byte* dispatch is, but efficiently
monitoring interpretation and trigger compilation is important for a JIT.

I meant like the ARM original Jazelle extension optimized for
interpretation (later renamed to DBX - Direct Bytecode eXecution) and
their later version optimized for JIT (RCT - Runtime Compilation
Target).

https://en.wikipedia.org/wiki/Jazelle

I'd let others speak to that.  I know nothing about it, but had the impression
that it failed [in the market place / got no adoption].


I think the real challenge will be finding options that benefits
more than one paradigm (eg. JIT, Smalltalk, Java, JavaScript,
Haskell, Prolog, Lisp, Erlang, real-time GC, etc).

Exactly. It isnt something I have worried about in my project, though I
have kept the requirements for all of these in the back of my mind.
Except for Haskell - I don't know what hardware support it needs.

For a first approximation, Lazy FP is very much like strict FP
with the addition of (once) updatable closures.  Of course,
updated closured usually causes an indirection that we rely on the
GC to eliminate, eventually.  (Efficient support of partial application,
aka "currying" is also an issue).

The longer answer is that there are a lot of different ways to do
this efficiently, but layering it on top of a mechanism for something
else never works out well (converse isn't true as strict is a subset
of lazy).  GHC is very good and the de-facto standard, but
there are an interesting, and IMhO a more diverse, design space than
most paradigms (with the exception of perhaps Prolog).

To get back to the Java/JavaScript world that seems to be the primary
motivation, even statically typed languages deploy tag bits in the pointers
(for GC and others).  Old 32-bit SPARC had limited support for tags in the lower
two bits and ARM, I'm told, have support for ignoring the top 8-bit of pointers.
Though masking out bits is just a single instruction, it's a dependent operation
and it might to worthwhile having support for that.  I might even suggest
that

  and rY, rX, rM
  ld rZ, rY(d)

be treated as a "load-under-mask" macrofusion-pair (Suggestion #1)

There, the first concrete suggestion in the J group :)

Related, I was a bit regretful that the conditional branch didn't have a quick
way to test bits (other than the sign bit), so I'd similarly propose fusing a masked
branch pair

  and rY, rX, rM
  bne rY, r0, target

and likewise for beq (Suggestion #2).  (Alternatively, there's space for a
new branch instruction).

Tommy





Samuel Falvo II

unread,
Mar 22, 2018, 12:28:43 AM3/22/18
to Tommy Thorn, Jecel Assumpção Jr, RISC-V SW Dev, RISC-V HW Dev, RISC-V ISA Dev, David Chisnall
On Wed, Mar 21, 2018 at 2:15 PM, Tommy Thorn
<tommy...@esperantotech.com> wrote:
> I think the real challenge will be finding options that benefits
> more than one paradigm (eg. JIT, Smalltalk, Java, JavaScript,
> Haskell, Prolog, Lisp, Erlang, real-time GC, etc).

If anyone is at all interested in Forth, I can help here for sure.
The original Kestrel-2 CPU was a stack architecture CPU, and I have
extensive experience programming on Forth hardware.

--
Samuel A. Falvo II

David Chisnall

unread,
Mar 22, 2018, 6:00:51 AM3/22/18
to Jecel Assumpção Jr, RISC-V SW Dev, hw-...@groups.riscv.org, isa...@groups.riscv.org
On 21 Mar 2018, at 19:29, Jecel Assumpção Jr <je...@merlintec.com> wrote:
>
> On Tuesday, November 28, 2017 at 7:23:59 AM UTC-2, David Chisnall wrote:
>> We are pleased to announce the formation of the J Extension Working Group
>
> Though I am late to the party, I would like to join this group.

As others have mentioned, you will need to join the Foundation and can then request membership via Kavi.

> I have been working on a processor called SiliconSqueak which is optimized
> for the OpenSmalltalk VM (http://opensmalltalk.org/) which powers Squeak,
> Pharo, Cuis and Newspeak though my deisgn should be interesting for other
> bytecode based languages. I am currently redesigning SiliconSqueak to be a
> RISC-V extension instead of a completely custom design. Avoiding any needless
> incompatiblity with the J extension would be a very good thing and I will be
> more than happy to share any results that I have.

That sounds very interesting, thank you. We are currently largely blocked by not having mature software implementations to prototype, although we can do some analysis based on overheads on other platforms. We won’t propose any extensions until we can validate that they actually do provide an improvement.

> Though my focus is on helping adaptive compilation, I also made a serious
> effort to speed up bytecode interpretation. The idea is that while a high
> performance system might interpret a method once (or not at all) before
> compiling, it would be nice to have a "knob" to change that if needed to
> reduce memory use.

High-performance interpreters are definitely in scope for J. These are still very important on low-memory systems (for example, Samsung’s JerryScript is intended to interpret JavaScript on systems with a few tens of KBs of RAM).

> Some of what I do contradicts the famous ECOOP 95 paper:
> "Do Object-Oriented Languages Need Special Hardware Support?
> by Urs Hölzle and David Ungar
> ECOOP '95 Proceedings, pages 253-282
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.31.4796&rep=rep1&type=pdf
>
> Their conclusion was that the hardware extensions in RISC-III (SOAR:
> Smalltalk On A RISC) seemed to help due to the poor compiler. With
> better compilation technology only larger instruction caches made any
> difference in performance. My own conclusion was that if you are going
> to use traps for infrequent, but not really rare, events like tag mismatches
> or register window overflow/underflow then don't make these traps take
> thousands of clock cycles like in Sparc 8 and Solaris.

I believe that three important things have changed since this paper:

1. Multicore has become common. A lot of the techniques (such as polymorphic inline caching) that make performance a lot better on single-threaded implementations have comparatively high overhead if they require synchronisation. They are acceptable for languages such as Java, where cache invalidations are infrequent (and so can be very expensive), but less applicable to more Smalltalk-like languages.

2. Web-based deployment has made implementations a lot more sensitive to startup latency. This is also true for short-lived command-line tools (a number of Python tools, for example, spend longer starting Python than they spend actually running), but in the context of a web browser it is essential to start executing in under 100ms from first access to source code. Compilers that rely on large quantities of profiling data are great for the third or fourth tiers in such environments but are unacceptable for early startup code. This means that the interpreter or low-tier JITs are often critical for user-visible performance. A large proportion of JavaScript programs never make it to the higher-tier JITs.

3. Resource-constrained systems have started to use high-level languages. The Internet of Insecure Things increasingly needs memory-safe languages if it wants to become the Internet of Less Insecure Things. This means that there’s still a place for hardware acceleration for simple in-order pipelines with small and shallow memory hierarchies.

4. The relationships between memory throughput, latency, and size have changed a lot. This makes some form of hardware acceleration for garbage collection interesting, because you typically have enough memory bandwidth available to scan at higher than the allocation rate, but doing anything that impacts cache usage can hurt the performance of mutator threads.

> One of their conclusions was that special object oriented caches like
> in Mushroom and which Mario Wolczko tried to get adopted by Sparc
> without success wouldn't help. Sun was on a patent binge so we can't
> use stuff published after 2001, but a lot of good ideas are older than that:
> https://labs.oracle.com/pls/apex/f?p=labs:bio:0:134

Mario is a member of the group, so can steer us away from patents.

David

David Chisnall

unread,
Mar 22, 2018, 6:09:04 AM3/22/18
to Tommy Thorn, Jecel Assumpcao Jr., RISC-V SW Dev, hw-...@groups.riscv.org, isa...@groups.riscv.org
On 22 Mar 2018, at 01:43, Tommy Thorn <tommy...@esperantotech.com> wrote:
>> I meant like the ARM original Jazelle extension optimized for
>> interpretation (later renamed to DBX - Direct Bytecode eXecution) and
>> their later version optimized for JIT (RCT - Runtime Compilation
>> Target).
>>
>> https://en.wikipedia.org/wiki/Jazelle
>
> I'd let others speak to that. I know nothing about it, but had the impression
> that it failed [in the market place / got no adoption].
>

DBX was a commercial success and was used in a huge number of mobile phones (pretty much all Symbian phones used it). It achieve performance comparable with a JIT (though not with a well optimised JIT), in an interpreter, requiring a lot less memory. As I recall, the break even point for the JIT was around 4MB - if you have more memory than that then a JIT began to outperform Jazelle.

This kind of approach is still very interesting for usage scenarios where you have constrained amounts of RAM, and it might also be interesting for first-tier interpreters in a multi-tier JIT scenario, though defining a format that wasn’t tied to a specific bytecode format would be difficult. The original Smalltalk implementation implemented the bytecode interpreter in Alto microcode.

David

Tommy Thorn

unread,
Mar 22, 2018, 2:23:53 PM3/22/18
to David Chisnall, Jecel Assumpcao Jr., RISC-V SW Dev, hw-...@groups.riscv.org, isa...@groups.riscv.org

> This kind of approach is still very interesting for usage scenarios where you have constrained amounts of RAM, and it might also be interesting for first-tier interpreters in a multi-tier JIT scenario, though defining a format that wasn’t tied to a specific bytecode format would be difficult. The original Smalltalk implementation implemented the bytecode interpreter in Alto microcode.

I'm well aware of the many bytecode VMs (Smalltalk, Java, Camllight, P-Code, etc etc), but I expect most modern implementation will, like Camllight, expand bytecodes to threaded code upon load or simply JIT it. I don't think this level of constrained memory is a relevant for "J" (my personal opinion).

Tommy

Jecel Assumpcao Jr.

unread,
Mar 22, 2018, 4:35:20 PM3/22/18
to David Chisnall, RISC-V SW Dev, hw-...@groups.riscv.org, isa...@groups.riscv.org
This discussions is being cross-posted to three lists due to me replying
to the original announcement that was (very properly) on all three.
Sorry about that. Which would be the best list for this?

David Chisnall wrote on Thu, 22 Mar 2018 10:09:01 +0000
> DBX was a commercial success and was used in a huge number of mobile
> phones (pretty much all Symbian phones used it). It achieve performance
> comparable with a JIT (though not with a well optimised JIT), in an interpreter,
> requiring a lot less memory. As I recall, the break even point for the JIT
> was around 4MB - if you have more memory than that then a JIT began to
> outperform Jazelle.

One metric I care very much about is pJ per bytecode which is important
for mobile applications. It also helps you get better performance when
you are limited by power. Being able to interpret efficiently for a
while before compiling can help.

> This kind of approach is still very interesting for usage scenarios where
> you have constrained amounts of RAM, and it might also be interesting for
> first-tier interpreters in a multi-tier JIT scenario, though defining a format
> that wasn?t tied to a specific bytecode format would be difficult.

The DBX hardware was a variation of Thumb (equivalent to RISC-V C
extension) which was a dedicated block of hardware that could translate
a single byte or 16 bit word into a single 32 bit instruction. If you
have some kind of hardware register remapping to make it easy to push
and pop stuff on the stack, then most of the 256 possible bytecodes can
be translated to a single ARM (or RISC-V) instruction. And these also
happen to be the most frequent bytecodes too. The ones that need two or
more instructions to interpret can be translated to a single JUMP to the
correct code fragment in the interpreter.

While the Jazelle DBX used random logic to do its translation, you could
have a 256 by 32 bit RAM do the exact same job. This would allow
swapping to different bytecodes at runtime so the same processor could
run Java in one thread and Python in another. This RAM could even be
implemented as a helper "level 0" to the normal instruction cache so
changing a CSR would flush it and reload the new bytecode set on demand.

For previous SiliconSqueak designs I used a 1024 by 32 bit cache so each
bytecode could execute up to four instructions (one bit in each
instruction indicated if it was the last) or three instructions and a
jump to the rest of the code. But it is often the case in computer
science that we one have to worry about zero, one and many.

> The original Smalltalk implementation implemented the bytecode interpreter
> in Alto microcode.

Almost - it was actually in Data General Nova assembly language that was
implemented in Alto microcode. As time went on several critical
"kernels" were rewritten directly in Alto microcode.

The Dorado Smalltalk did use quite a bit more microcode though it still
had some parts in Nova assembly due to the size limits of microcode
memory. One interesting feature of the Dorado computer was that it had
special hardware for bytecodes which was very similar to what I
described above. But looking at the microcode for Dorado Smalltalk that
was published it painfully and slowly decodes the bytecodes explicitly
(it is awkward to extract bytes in a 16 bit word addressed machine). I
have no idea why it simply ignored the special hardware.

http://bitsavers.trailing-edge.com/pdf/xerox/dorado/microcode/DoradoSmalltalkMicrocode.pdf

-- Jecel

Martin Schoeberl

unread,
Mar 22, 2018, 7:36:12 PM3/22/18
to Tommy Thorn, Jecel Assumpcao Jr., RISC-V SW Dev, hw-...@groups.riscv.org, isa...@groups.riscv.org

On 22 Mar, 2018, at 2:43, Tommy Thorn <tommy...@esperantotech.com> wrote:

I'm not sure how widely applicable *byte* dispatch is, but efficiently
monitoring interpretation and trigger compilation is important for a JIT.

I meant like the ARM original Jazelle extension optimized for
interpretation (later renamed to DBX - Direct Bytecode eXecution) and
their later version optimized for JIT (RCT - Runtime Compilation
Target).

https://en.wikipedia.org/wiki/Jazelle

I'd let others speak to that.  I know nothing about it, but had the impression
that it failed [in the market place / got no adoption].

It was basically killed by Apple not supporting Java on the iPhone. End of Java applets, for the good or the bad, I don’t know.

Cheers,
Martin

Jecel Assumpcao Jr.

unread,
Mar 22, 2018, 9:33:35 PM3/22/18
to David Chisnall, RISC-V SW Dev, hw-...@groups.riscv.org, isa...@groups.riscv.org
Same comment as in the other thread about cross-posting. And I hope my
editing the subject to reflect the topic drift is the proper protocol of
these lists.

David Chisnall wrote on Thu, 22 Mar 2018 10:00:48 +0000
> > [...] SiliconSqueak [...]
>
> That sounds very interesting, thank you. We are currently largely blocked
> by not having mature software implementations to prototype, although
> we can do some analysis based on overheads on other platforms. We
> won?t propose any extensions until we can validate that they actually do
> provide an improvement.

This is an extremely valid point. I have been mostly using other
people's data to drive the designs though I fully agree with the
importance of doing my own experiments for a "quantative approach"
(specialy given this year's Turing award ;-) ).

Besides the OpenSmalltalk VM (written in a subset of Smalltalk that can
be translated to C) I can use the Self VM (written in C++ and was the
original adaptive compilation system) and I have looked at StrongTalk
(form which HotSpot evolved, though I have no idea how directly), PyPy
(written in RPython) and Graal+ Truffle.

What other options do we have?

And for OpenSmalltalk VM the Bochs simulator is used for testing and
development of the x86 and x86-64 compilers and gdbarm for the ARM
compiler. I know there are lots of options for RISC-V (Spike, Qemu,
RiscvEmu, Gem5, etc) but I am only a bit familiar with Qemu.

> > "Do Object-Oriented Languages Need Special Hardware Support?
> > by Urs Hölzle and David Ungar
> I believe that three important things have changed since this paper:
>
> 1. Multicore has become common. A lot of the techniques (such as
> polymorphic inline caching) that make performance a lot better on
> single-threaded implementations have comparatively high overhead
> if they require synchronisation. They are acceptable for languages
> such as Java, where cache invalidations are infrequent (and so can
> be very expensive), but less applicable to more Smalltalk-like languages.

David Ungar did a multicore implementation of the Squeak VM a while ago.
Called the RoarVM, it ran on a Tilera chip and used 56 cores (8 were
reserved for Linux). This is actually something I worked on myself
starting in 1992 (I built a machine with 64 nodes, but it didn't have a
shared memory).

> 2. Web-based deployment has made implementations a lot more
> sensitive to startup latency. This is also true for short-lived command-line
> tools (a number of Python tools, for example, spend longer starting
> Python than they spend actually running), but in the context of a web
> browser it is essential to start executing in under 100ms from first
> access to source code. Compilers that rely on large quantities of
> profiling data are great for the third or fourth tiers in such environments
> but are unacceptable for early startup code. This means that the
> interpreter or low-tier JITs are often critical for user-visible performance.
> A large proportion of JavaScript programs never make it to the higher-tier
> JITs.

Exactly. Self 1 was very nice to use interactively but while Self 2
vastly improved benchmarks the GUI became too jerky to be practical.
That was an important motive to introduce adaptive compilation in Self
3. This was taken into account in the hardware paper, but adding an
interpreter was only experimented with after that.

> 3. Resource-constrained systems have started to use high-level languages.
> The Internet of Insecure Things increasingly needs memory-safe
> languages if it wants to become the Internet of Less Insecure Things.
> This means that there?s still a place for hardware acceleration for
> simple in-order pipelines with small and shallow memory hierarchies.

It is impressive that people are using Lua to program the small ESP32 /
ESP8266 chips, though these are huge compared to old machines that ran
nice languages.

> 4. The relationships between memory throughput, latency, and size have
> changed a lot. This makes some form of hardware acceleration for
> garbage collection interesting, because you typically have enough
> memory bandwidth available to scan at higher than the allocation rate,
> but doing anything that impacts cache usage can hurt the performance
> of mutator threads.

If you can have many objects be created in cache and then collected
before they ever touch main memory you can reduce gc overhead quite a
bit. Of course, eventually you have to scan the whole memory and then
the problem you mention will happen (sort of like going gc in a virtual
memory system and touching the whole address space).

-- Jecel

David Chisnall

unread,
Mar 23, 2018, 4:29:21 AM3/23/18
to Tommy Thorn, Jecel Assumpcao Jr., RISC-V SW Dev, hw-...@groups.riscv.org, isa...@groups.riscv.org
On 22 Mar 2018, at 18:23, Tommy Thorn <tommy...@esperantotech.com> wrote:
>
>> This kind of approach is still very interesting for usage scenarios where you have constrained amounts of RAM, and it might also be interesting for first-tier interpreters in a multi-tier JIT scenario, though defining a format that wasn’t tied to a specific bytecode format would be difficult. The original Smalltalk implementation implemented the bytecode interpreter in Alto microcode.
>
> I'm well aware of the many bytecode VMs (Smalltalk, Java, Camllight, P-Code, etc etc), but I expect most modern implementation will, like Camllight, expand bytecodes to threaded code upon load or simply JIT it. I don't think this level of constrained memory is a relevant for "J" (my personal opinion).

This kind of environment is explicitly in scope for the J working group (see our charter).

As this thread is being cross-posted to three of the wrong lists, I suggest moving any further discussion to the J extension list.

David

lkcl .

unread,
Mar 25, 2018, 10:12:25 AM3/25/18
to Jecel Assumpcao Jr., Tommy Thorn, RISC-V SW Dev, RISC-V HW Dev, RISC-V ISA Dev
On Wed, Mar 21, 2018 at 9:15 PM, Tommy Thorn
<tommy...@esperantotech.com> wrote:

> I think the real challenge will be finding options that benefits
> more than one paradigm (eg. JIT, Smalltalk, Java, JavaScript,
> Haskell, Prolog, Lisp, Erlang, real-time GC, etc).

and if it's going to be general-purpose enough, can i also advocate
adding foreign assembly architectures (whatever they may be) to that
list as well? bearing in mind, ICT, the developers of the MIPS64
Loongson Architecture, managed to achieve a staggering 70% of the
clockrate of native x86 execution by identifying and implementing
translation of the top 200 most commonly-used / highest-efficiency x86
instructions, and letting the rest fall through to a
specially-modified version of qemu.

the point being of mentioning that story, it's not totally necessary
to JIT / emulate / translate absolutely *all* of the foreign
architecture's instructions.

l.

Tommy Thorn

unread,
Mar 25, 2018, 1:40:06 PM3/25/18
to lkcl ., Jecel Assumpcao Jr., RISC-V SW Dev, RISC-V HW Dev, RISC-V ISA Dev

> and if it's going to be general-purpose enough, can i also advocate
> adding foreign assembly architectures (whatever they may be) to that
> list as well?

Thanks, I forgot that. I believe it was mentioned at the inaugural "J" meeting.

> bearing in mind, ICT, the developers of the MIPS64
> Loongson Architecture, managed to achieve a staggering 70% of the
> clockrate of native x86 execution by identifying and implementing
> translation of the top 200 most commonly-used / highest-efficiency x86
> instructions, and letting the rest fall through to a
> specially-modified version of qemu.

Is there independent verification of the 70% number? Mind you,
even if it's true, it's presumedly at most one instruction per clock,
like a 486.

> the point being of mentioning that story, it's not totally necessary
> to JIT / emulate / translate absolutely *all* of the foreign
> architecture's instructions.

I think you got that backwards. Implementing a (subset of a) foreign
instruction set is a significant investment (in silicon). A JIT is "just"
software and has a likely higher ROI. A JIT has a higher peak-perf
_potential_ but the challenge is cold code, which is where even a
1 IPC hardware decoder can help (NVIDIA Denver?).

Tommy

lkcl .

unread,
Mar 25, 2018, 1:56:04 PM3/25/18
to Tommy Thorn, Jecel Assumpcao Jr., RISC-V SW Dev, RISC-V HW Dev, RISC-V ISA Dev
On Sun, Mar 25, 2018 at 6:40 PM, Tommy Thorn
<tommy...@esperantotech.com> wrote:
>
>> and if it's going to be general-purpose enough, can i also advocate
>> adding foreign assembly architectures (whatever they may be) to that
>> list as well?
>
> Thanks, I forgot that. I believe it was mentioned at the inaugural "J" meeting.

oh cool!

>> bearing in mind, ICT, the developers of the MIPS64
>> Loongson Architecture, managed to achieve a staggering 70% of the
>> clockrate of native x86 execution by identifying and implementing

> Is there independent verification of the 70% number?

of a China state-sponsored CPU that was designed specifically for use
in high-security China Govt secret supercomputing research where they
didn't want the U.S. Govt's NSA spying co-processor intel ME backdoors
to report on what they're doing? yyyeah.... :) but seriously: it's on
the wikipedia page [1] so it _must_ be true.... :)

> Mind you,
> even if it's true, it's presumedly at most one instruction per clock,
> like a 486.

honestly don't know. there may be some links to sources from the
wikipedia article in order to investigate further [1], hth?

>> the point being of mentioning that story, it's not totally necessary
>> to JIT / emulate / translate absolutely *all* of the foreign
>> architecture's instructions.
>
> I think you got that backwards.

probably. i do that. then write unit tests and iterate through the
permutations of possible algorithms and hope nobody notices i can't do
boolean logic...

one thing that would be absolutely awesome to have would be
general-purpose hardware-emulation sufficient to cover RISC-V
instructions. aside from potentially being a better gateway to
implement (emulate) missing instructions, it might also potentially be
a way to fix silicon errata and/or cater for critical revisions in the
specification.

l.


[1] https://en.wikipedia.org/wiki/Loongson#Hardware-assisted_x86_emulation
Reply all
Reply to author
Forward
0 new messages