Same comment as in the other thread about cross-posting. And I hope my
editing the subject to reflect the topic drift is the proper protocol of
these lists.
David Chisnall wrote on Thu, 22 Mar 2018 10:00:48 +0000
> > [...] SiliconSqueak [...]
>
> That sounds very interesting, thank you. We are currently largely blocked
> by not having mature software implementations to prototype, although
> we can do some analysis based on overheads on other platforms. We
> won?t propose any extensions until we can validate that they actually do
> provide an improvement.
This is an extremely valid point. I have been mostly using other
people's data to drive the designs though I fully agree with the
importance of doing my own experiments for a "quantative approach"
(specialy given this year's Turing award ;-) ).
Besides the OpenSmalltalk VM (written in a subset of Smalltalk that can
be translated to C) I can use the Self VM (written in C++ and was the
original adaptive compilation system) and I have looked at StrongTalk
(form which HotSpot evolved, though I have no idea how directly), PyPy
(written in RPython) and Graal+ Truffle.
What other options do we have?
And for OpenSmalltalk VM the Bochs simulator is used for testing and
development of the x86 and x86-64 compilers and gdbarm for the ARM
compiler. I know there are lots of options for RISC-V (Spike, Qemu,
RiscvEmu, Gem5, etc) but I am only a bit familiar with Qemu.
> > "Do Object-Oriented Languages Need Special Hardware Support?
> > by Urs Hölzle and David Ungar
> I believe that three important things have changed since this paper:
>
> 1. Multicore has become common. A lot of the techniques (such as
> polymorphic inline caching) that make performance a lot better on
> single-threaded implementations have comparatively high overhead
> if they require synchronisation. They are acceptable for languages
> such as Java, where cache invalidations are infrequent (and so can
> be very expensive), but less applicable to more Smalltalk-like languages.
David Ungar did a multicore implementation of the Squeak VM a while ago.
Called the RoarVM, it ran on a Tilera chip and used 56 cores (8 were
reserved for Linux). This is actually something I worked on myself
starting in 1992 (I built a machine with 64 nodes, but it didn't have a
shared memory).
> 2. Web-based deployment has made implementations a lot more
> sensitive to startup latency. This is also true for short-lived command-line
> tools (a number of Python tools, for example, spend longer starting
> Python than they spend actually running), but in the context of a web
> browser it is essential to start executing in under 100ms from first
> access to source code. Compilers that rely on large quantities of
> profiling data are great for the third or fourth tiers in such environments
> but are unacceptable for early startup code. This means that the
> interpreter or low-tier JITs are often critical for user-visible performance.
> A large proportion of JavaScript programs never make it to the higher-tier
> JITs.
Exactly. Self 1 was very nice to use interactively but while Self 2
vastly improved benchmarks the GUI became too jerky to be practical.
That was an important motive to introduce adaptive compilation in Self
3. This was taken into account in the hardware paper, but adding an
interpreter was only experimented with after that.
> 3. Resource-constrained systems have started to use high-level languages.
> The Internet of Insecure Things increasingly needs memory-safe
> languages if it wants to become the Internet of Less Insecure Things.
> This means that there?s still a place for hardware acceleration for
> simple in-order pipelines with small and shallow memory hierarchies.
It is impressive that people are using Lua to program the small ESP32 /
ESP8266 chips, though these are huge compared to old machines that ran
nice languages.
> 4. The relationships between memory throughput, latency, and size have
> changed a lot. This makes some form of hardware acceleration for
> garbage collection interesting, because you typically have enough
> memory bandwidth available to scan at higher than the allocation rate,
> but doing anything that impacts cache usage can hurt the performance
> of mutator threads.
If you can have many objects be created in cache and then collected
before they ever touch main memory you can reduce gc overhead quite a
bit. Of course, eventually you have to scan the whole memory and then
the problem you mention will happen (sort of like going gc in a virtual
memory system and touching the whole address space).
-- Jecel