This is going to be extremely light on details with respect to the current state of the Parrot interpreter.
It is also going to be expressed in terms of Win32 APIs.
For both of these I apologise in advance. Time, and the "or forever hold your peace" imperative has overridden my desire to do otherwise.
My attempts to get up speed enough on the sources to work out how to apply the ideas I am going to express, and to become conversant with the *nix
pthreads implementation and terminology, have moved too slowly to afford me the opportunity to do more.
Hoping this stimulates discussion rather than a flame war.
Regards, Nigel Sandever.
PS. Sorry if this gets posted twice.
THE BASIS OF THE IDEA
Modern OSs succeed in having multiple threads of execution share a single copy of process memory without the operations of one thread being able to
interfere with the state of another. The state of the code running in those threads may be corrupted through mismanagement. But the state of the
threads themselves, their program counters, registers and stacks cannot. The mechanisms for this incorruptibility are:
Each operation (opcode) performed by the thread is atomic.
The scheduler can never interrupt a thread whilst an operation is in progress. Only between operations.
Before the start of an operation, and after the end of one, the state of the thread is entirely encapsulated within the registers and stack.
By swapping the entire state of the CPU register set, when switching from one thread to another, the state of each thread is preserved
and reconstituted. No other mechanisms or interlocks are required.
By analogy, a virtual machine that wishes to have multiple threads of execution must achieve the same level of atomicity for each operation it performs.
VIRTUAL MACHINE INTERPRETER
At any given point in the running of the interpreter, the VM register set, program counter and stack must represent the entire state for that thread.
Once an opcode has started execution on a given thread, no other thread of execution within that interpreter much be allowed to start an operation until
the first thread completes its opcode.
Note that this does not mean that no other thread in the process can take a timeslice, only that any thread that is allowed to run should not be able to
affect the state of the VM in question. A thread charged with performing asynchronous reads on behalf of the user program running within the VM
interpreter can go ahead so long as it doesn't directly modify the VMI state.
EVENT MANAGER THREAD
Equally, an event thread can also be run concurrently in the background to receive asynchronous notifications (signals, messages, asynchronous read
completions etc.). It can then queue these events and set a flag that the VMI can inspect between each iteration of the "opcode execution loop" and action
appropriately. This gives the benefit of "safe signals" along with safe and timely processing of other, similarly asynchronous events.
The garbage collector would need to run *synchronously* with the interpreter opcode loop. Effectively, it would be another (possibly long running) opcode.
An analogy of this are all the long running syscalls that operate within a multi-tasking OS. Eg. synchronous IO, virtual memory allocation, swapping etc.
Just as virtual memory operations suspend the affected processes until the operations are complete, so garbage collection can be see as a virtual memory
operation for the virtual machine that requires the VMI to be suspended until the operation is complete.
The key is for the VM to be reentrant, and the use of (in win32 terms) a "critical section".
Not only must the VMI be coded in a reentrant fashion, with all state addressed through pointers (references) loaded into it's Virtual register set. All
the code underlying it, including syscalls and CRT must equally be reentrant. Many APIs within many CRTs *are not reentrant* (eg. strtok()). All state
must be on a per-thread not a per-process basis.
To this end, I would propose that no CRT APIs are used directly from the main code!
Instead, a full set of CRT-like macros would be inherited from a header file, where either the real CRT API would be called, or an alternative
implementation. This header file would be conditionally included on the basic of the target platform. This concentrates the bulk, if not all, platform
specific code into a single file (or set of files).
EXPANDING THE VIRTUAL MACHINE
What I am suggesting here is that the virtual machine analogy be extended to encompass not just the VHLL-user-program view of the registers and stack,
but also it's view of the entire machine and underlying OS.
By so doing, not only does it allow those working on the core of the VMI to be isolated from the underlying machine and OS architecture. It allows them
to extend the VMI's view of the world, beyond the concepts of any single architecture. It also serves the very practical purpose of avoiding the morass
of platform-conditional compilation directives in the bulk of the code, by concentrating them in single (groups of) files, by platform.
ATOMICITY AND CRITICAL SECTIONS
Atomicity of the VMI opcodes can be achieved by having the core loop of the interpreter enter and exit a critical section each time it processes an
opcode. For those not familiar with critical sections, they are a (Win32) OS construct that prevents any cooperating thread within a process, from having
timeslice until the thread that entered the critical section has exited.
Unlike semaphores and events, critsecs only operate within a single process. They are relatively lightweight as there is no need to enter kernel mode for
their operation. They are, in effect a CPU specific "test and set" operation applied to a piece of user mode memory. Their lightweight nature means that
they are faster than inter-process semaphore mechanisms. When used in a process that currently has only a single thread in operation, they have almost
negligible performance effect upon that thread. They also operate correctly on SMP machines.
PROCESS GLOBAL STATE
Unavoidably process global state, such as filehandles etc., should only be usable (from the user program being run by the VMI) through the use of opcodes.
Done this way, the serialisation of the opcodes at the VMI level protects the global state.
THIS IS ONLY PART OF THE SOLUTION
None of the above addresses the problems of synchronising user program state. This will still need to be addressed. This could either be done by the user
code through the use of something like the P5 :shared and lock() mechanisms. Or preferably, transparently by the VMI itself. Possibly, by having each PMC
contain an 'in-use' flag that would be set each time an opcode (method) was dispatched on a PMC (object) and cleared when the operation completes.
If two threads attempt concurrent operations on the same PMC, the first thread accessing the PMC sets the flag. When a second thread attempt to
access it, it finds the flag set and blocks (relinquishing it's timeslice) until the first thread has completed and cleared the flag. This doesn't solve the
potential for deadlock problems arising from the user-level code, though it may be possible for the VMI to detect and diagnose these at runtime in a
similar way that deep-recursion detection is done in P5.
The cost to the single-threaded user application is for a test&set operation (a few cycles), per object-method, plus the flag, which need only be one bit,
but maybe more efficiently coded as a 32-bit entity.
CONCLUSION (TENTATIVE, NO PROOF)
As all VHLL entities are PMCs at the VMI level, the sharing of data between threads at the VHLL level is done entirely through those PMCs. If no single
PMC can be the subject of an opcode on two threads concurrently, there /should/ be no opportunity for conflict.
As all VMI internal state are encapsulated within the VMI register set, and each thread has it's own set of registers, the internal state of the VMI(s)
should be equally secure.
Other internal housekeeping operations, memory allocation, garbage collection etc. are performed as "sysopcodes", performed by the VMI within the
auspices of the critical section, and thus secured.
All asynchronous operations are performed by one or more non-VMI threads and do not adjust the state of the VMI directly. Instead, they queue
notifications of events, and results to the VMI, and these are detected by the VMI within the body of the main loop. Once detected, an appropriate
sysopcode is dispatched within the critical section in the normal way.
SOME ADDITIONAL IDEAS
I've been looking at (glancing through) the memory management used by Parrot. The basic concept of memory pools, if I have understood them correctly, is
to allocate similarly size object from the same pool. The idea behind this is that it will be easier to find a suitable size piece of memory available for
reallocation having been GC'd, from a pool dedicated to allocation of a similar size to that needed. This should reduce the fragmentation of originally large
allocations into lots of small allocations, thereby reducing the need for coalescence, and overall memory allocation.
The thought struck me that with the exception of strings and (other) compound data types, all the instances of any given class are always the same size.
Where an existing class is dynamically extended in some way, the result is the creation of a new (anonymous) class as pre-existing instances of the
original class would not be modified. As such, the class would seem to be the ideal point at which to pool the allocation and deallocation of memory.
If each class is given it's own memory pool that is a multiple of its instance size. That pool effectively becomes an (C-style) array of instances. As each
element is exactly the right size, the only time the pool would need to grow, is when all the existing elements are in-use and another is required. Once an
instance has been freed, it's slot in the array would be available and exactly the right size for the next instantiation. There would never be a requirement
to coalesce allocations. Whether the free slots are found through a free space chain, or even a linear search, the GC would only need to process this pool
when a new instance of this class is being allocated. It would then only need to scan a single pool of (same-sized) reference objects to GC the existing
If all references to this type where allocated from another pool dedicated to references-to-this-class, that also hung off the parent class, then the
volume of references to be scanned would also be minimised.
If the larger segments of memory allocated for each pool, were allocated as virtual memory segments, rather than as arbitrary chunks of heap memory, the
size of each individual pool can be grown in-place, at runtime, without the need to copy then existing space to the new larger allocation. And without the
need to shuffle any other memory pools around to accommodate the increase in size. The OS's Virtual Memory Management takes care of the whole
process. It also becomes possible to reserve a sizeable pre-allocation for important pools and those known to be likely to grow quickly, without actually
consuming the entire reservation from the physical memory pool before it is actually needed. The OS will then take care of notification of an individual
pool's need to grow through page faults and the growth becomes a process of simply committing another (few) pages from the pre-allocation.
The memory requirements of variable size objects can be managed through Global Heap and sub-allocation APIs as now.
Thread specific local storage required for the VMI's register stacks etc. can be handled using Thread Local Storage APIs.
COOPERATIVE MULTIPROCESSING AND CO_ROUTINES.
Some applications benefit from the programmer being able to have much tighter control over the scheduling of threads of execution within the process.
The extreme example of this are co-routines where control is passed back and forth between two (or more) pieces of code as the demands of the algorithm
and data dictate rather than on the basis of arbitrary time-slicing. Win32 has a concept of Fibres. The are manually dispatched threads. A given thread
(fibre) receives it's timeslice in the normal way. However, it can transfer the rest of it's current timeslice to another fibre on demand. The first fibre
remains suspended until the second fibre transfers control back. The state of the first fibre is retained just as it would be if the thread had been
swapped by the scheduler. In this way two (or more) fibres can cooperatively share their timeslices whilst other threads (and processes) continue to
receive their timeslices in the normal way. This, I think, is the essence of co-routines?
I have no understanding (yet) of how the co-routines that Parrot is promising are implemented on *nix systems. But unless the concept of co-routines is
encapsulated at the higher level with the implementation being pushed down to platform specific code, unless *nix also supports a concept very similar to
Win32 fibres, it will be impossible to utilise this "tailor made" OS facility. Thus, the implementation of co-routines designed to operate well on *nix
platforms is likely to need to be emulated on Win32 with high level requirements that won't fit well with the underlying OS. This could result in a
similarly sub-optimal emulation of a *nix concept on Win32 as currently exists for the fork emulation. Other non-*nix platforms would likely also suffer
from this force fit as well.