Threads Design. A Win32 perspective.

Nigel Sandever

unread,

Jan 1, 2004, 2:14:23 PM1/1/04

to perl6-i...@perl.org

This is going to be extremely light on details with respect to the current state of the Parrot interpreter.

It is also going to be expressed in terms of Win32 APIs.

For both of these I apologise in advance. Time, and the "or forever hold your peace" imperative has overridden my desire to do otherwise.

My attempts to get up speed enough on the sources to work out how to apply the ideas I am going to express, and to become conversant with the *nix
pthreads implementation and terminology, have moved too slowly to afford me the opportunity to do more.

Hoping this stimulates discussion rather than a flame war.

Regards, Nigel Sandever.

PS. Sorry if this gets posted twice.
------------------------

THE BASIS OF THE IDEA

Modern OSs succeed in having multiple threads of execution share a single copy of process memory without the operations of one thread being able to
interfere with the state of another. The state of the code running in those threads may be corrupted through mismanagement. But the state of the
threads themselves, their program counters, registers and stacks cannot. The mechanisms for this incorruptibility are:

Each operation (opcode) performed by the thread is atomic.
The scheduler can never interrupt a thread whilst an operation is in progress. Only between operations.

Before the start of an operation, and after the end of one, the state of the thread is entirely encapsulated within the registers and stack.
By swapping the entire state of the CPU register set, when switching from one thread to another, the state of each thread is preserved
and reconstituted. No other mechanisms or interlocks are required.

By analogy, a virtual machine that wishes to have multiple threads of execution must achieve the same level of atomicity for each operation it performs.

VIRTUAL MACHINE INTERPRETER

At any given point in the running of the interpreter, the VM register set, program counter and stack must represent the entire state for that thread.
Once an opcode has started execution on a given thread, no other thread of execution within that interpreter much be allowed to start an operation until
the first thread completes its opcode.

NON-VMI THREADS

ASYNCHRONOUS IO

Note that this does not mean that no other thread in the process can take a timeslice, only that any thread that is allowed to run should not be able to
affect the state of the VM in question. A thread charged with performing asynchronous reads on behalf of the user program running within the VM
interpreter can go ahead so long as it doesn't directly modify the VMI state.

EVENT MANAGER THREAD

Equally, an event thread can also be run concurrently in the background to receive asynchronous notifications (signals, messages, asynchronous read
completions etc.). It can then queue these events and set a flag that the VMI can inspect between each iteration of the "opcode execution loop" and action
appropriately. This gives the benefit of "safe signals" along with safe and timely processing of other, similarly asynchronous events.

GARBAGE COLLECTION

The garbage collector would need to run *synchronously* with the interpreter opcode loop. Effectively, it would be another (possibly long running) opcode.
An analogy of this are all the long running syscalls that operate within a multi-tasking OS. Eg. synchronous IO, virtual memory allocation, swapping etc.
Just as virtual memory operations suspend the affected processes until the operations are complete, so garbage collection can be see as a virtual memory
operation for the virtual machine that requires the VMI to be suspended until the operation is complete.

PREREQUISITES

The key is for the VM to be reentrant, and the use of (in win32 terms) a "critical section".

REENTRANCY

Not only must the VMI be coded in a reentrant fashion, with all state addressed through pointers (references) loaded into it's Virtual register set. All
the code underlying it, including syscalls and CRT must equally be reentrant. Many APIs within many CRTs *are not reentrant* (eg. strtok()). All state
must be on a per-thread not a per-process basis.

To this end, I would propose that no CRT APIs are used directly from the main code!

Instead, a full set of CRT-like macros would be inherited from a header file, where either the real CRT API would be called, or an alternative
implementation. This header file would be conditionally included on the basic of the target platform. This concentrates the bulk, if not all, platform
specific code into a single file (or set of files).

EXPANDING THE VIRTUAL MACHINE

What I am suggesting here is that the virtual machine analogy be extended to encompass not just the VHLL-user-program view of the registers and stack,
but also it's view of the entire machine and underlying OS.

By so doing, not only does it allow those working on the core of the VMI to be isolated from the underlying machine and OS architecture. It allows them
to extend the VMI's view of the world, beyond the concepts of any single architecture. It also serves the very practical purpose of avoiding the morass
of platform-conditional compilation directives in the bulk of the code, by concentrating them in single (groups of) files, by platform.

ATOMICITY AND CRITICAL SECTIONS

Atomicity of the VMI opcodes can be achieved by having the core loop of the interpreter enter and exit a critical section each time it processes an
opcode. For those not familiar with critical sections, they are a (Win32) OS construct that prevents any cooperating thread within a process, from having
timeslice until the thread that entered the critical section has exited.

Unlike semaphores and events, critsecs only operate within a single process. They are relatively lightweight as there is no need to enter kernel mode for
their operation. They are, in effect a CPU specific "test and set" operation applied to a piece of user mode memory. Their lightweight nature means that
they are faster than inter-process semaphore mechanisms. When used in a process that currently has only a single thread in operation, they have almost
negligible performance effect upon that thread. They also operate correctly on SMP machines.

PROCESS GLOBAL STATE

Unavoidably process global state, such as filehandles etc., should only be usable (from the user program being run by the VMI) through the use of opcodes.
Done this way, the serialisation of the opcodes at the VMI level protects the global state.

THIS IS ONLY PART OF THE SOLUTION

None of the above addresses the problems of synchronising user program state. This will still need to be addressed. This could either be done by the user
code through the use of something like the P5 :shared and lock() mechanisms. Or preferably, transparently by the VMI itself. Possibly, by having each PMC
contain an 'in-use' flag that would be set each time an opcode (method) was dispatched on a PMC (object) and cleared when the operation completes.

If two threads attempt concurrent operations on the same PMC, the first thread accessing the PMC sets the flag. When a second thread attempt to
access it, it finds the flag set and blocks (relinquishing it's timeslice) until the first thread has completed and cleared the flag. This doesn't solve the
potential for deadlock problems arising from the user-level code, though it may be possible for the VMI to detect and diagnose these at runtime in a
similar way that deep-recursion detection is done in P5.

The cost to the single-threaded user application is for a test&set operation (a few cycles), per object-method, plus the flag, which need only be one bit,
but maybe more efficiently coded as a 32-bit entity.

CONCLUSION (TENTATIVE, NO PROOF)

As all VHLL entities are PMCs at the VMI level, the sharing of data between threads at the VHLL level is done entirely through those PMCs. If no single
PMC can be the subject of an opcode on two threads concurrently, there /should/ be no opportunity for conflict.

As all VMI internal state are encapsulated within the VMI register set, and each thread has it's own set of registers, the internal state of the VMI(s)
should be equally secure.

Other internal housekeeping operations, memory allocation, garbage collection etc. are performed as "sysopcodes", performed by the VMI within the
auspices of the critical section, and thus secured.

All asynchronous operations are performed by one or more non-VMI threads and do not adjust the state of the VMI directly. Instead, they queue
notifications of events, and results to the VMI, and these are detected by the VMI within the body of the main loop. Once detected, an appropriate
sysopcode is dispatched within the critical section in the normal way.

SOME ADDITIONAL IDEAS

MEMORY MANAGEMENT

I've been looking at (glancing through) the memory management used by Parrot. The basic concept of memory pools, if I have understood them correctly, is
to allocate similarly size object from the same pool. The idea behind this is that it will be easier to find a suitable size piece of memory available for
reallocation having been GC'd, from a pool dedicated to allocation of a similar size to that needed. This should reduce the fragmentation of originally large
allocations into lots of small allocations, thereby reducing the need for coalescence, and overall memory allocation.

The thought struck me that with the exception of strings and (other) compound data types, all the instances of any given class are always the same size.
Where an existing class is dynamically extended in some way, the result is the creation of a new (anonymous) class as pre-existing instances of the
original class would not be modified. As such, the class would seem to be the ideal point at which to pool the allocation and deallocation of memory.

If each class is given it's own memory pool that is a multiple of its instance size. That pool effectively becomes an (C-style) array of instances. As each
element is exactly the right size, the only time the pool would need to grow, is when all the existing elements are in-use and another is required. Once an
instance has been freed, it's slot in the array would be available and exactly the right size for the next instantiation. There would never be a requirement
to coalesce allocations. Whether the free slots are found through a free space chain, or even a linear search, the GC would only need to process this pool
when a new instance of this class is being allocated. It would then only need to scan a single pool of (same-sized) reference objects to GC the existing
instances.

If all references to this type where allocated from another pool dedicated to references-to-this-class, that also hung off the parent class, then the
volume of references to be scanned would also be minimised.

If the larger segments of memory allocated for each pool, were allocated as virtual memory segments, rather than as arbitrary chunks of heap memory, the
size of each individual pool can be grown in-place, at runtime, without the need to copy then existing space to the new larger allocation. And without the
need to shuffle any other memory pools around to accommodate the increase in size. The OS's Virtual Memory Management takes care of the whole
process. It also becomes possible to reserve a sizeable pre-allocation for important pools and those known to be likely to grow quickly, without actually
consuming the entire reservation from the physical memory pool before it is actually needed. The OS will then take care of notification of an individual
pool's need to grow through page faults and the growth becomes a process of simply committing another (few) pages from the pre-allocation.

The memory requirements of variable size objects can be managed through Global Heap and sub-allocation APIs as now.

Thread specific local storage required for the VMI's register stacks etc. can be handled using Thread Local Storage APIs.

COOPERATIVE MULTIPROCESSING AND CO_ROUTINES.

Some applications benefit from the programmer being able to have much tighter control over the scheduling of threads of execution within the process.
The extreme example of this are co-routines where control is passed back and forth between two (or more) pieces of code as the demands of the algorithm
and data dictate rather than on the basis of arbitrary time-slicing. Win32 has a concept of Fibres. The are manually dispatched threads. A given thread
(fibre) receives it's timeslice in the normal way. However, it can transfer the rest of it's current timeslice to another fibre on demand. The first fibre
remains suspended until the second fibre transfers control back. The state of the first fibre is retained just as it would be if the thread had been
swapped by the scheduler. In this way two (or more) fibres can cooperatively share their timeslices whilst other threads (and processes) continue to
receive their timeslices in the normal way. This, I think, is the essence of co-routines?

I have no understanding (yet) of how the co-routines that Parrot is promising are implemented on *nix systems. But unless the concept of co-routines is
encapsulated at the higher level with the implementation being pushed down to platform specific code, unless *nix also supports a concept very similar to
Win32 fibres, it will be impossible to utilise this "tailor made" OS facility. Thus, the implementation of co-routines designed to operate well on *nix
platforms is likely to need to be emulated on Win32 with high level requirements that won't fit well with the underlying OS. This could result in a
similarly sub-optimal emulation of a *nix concept on Win32 as currently exists for the fork emulation. Other non-*nix platforms would likely also suffer
from this force fit as well.

Uri Guttman

unread,

Jan 1, 2004, 9:32:22 PM1/1/04

to Nigel Sandever, perl6-i...@perl.org

>>>>> "NS" == Nigel Sandever <nigels...@btconnect.com> writes:

NS> REENTRANCY

NS> Not only must the VMI be coded in a reentrant fashion, with all state
NS> addressed through pointers (references) loaded into it's Virtual
NS> register set. All the code underlying it, including syscalls and CRT
NS> must equally be reentrant. Many APIs within many CRTs *are not
NS> reentrant* (eg. strtok()). All state must be on a per-thread not a
NS> per-process basis.

NS> To this end, I would propose that no CRT APIs are used directly from the
NS> main code!

NS> Instead, a full set of CRT-like macros would be inherited from a header
NS> file, where either the real CRT API would be called, or an alternative
NS> implementation. This header file would be conditionally included on the
NS> basic of the target platform. This concentrates the bulk, if not all,
NS> platform specific code into a single file (or set of files).

this is true for c level threads but not necessarily true for VM level
threads. if the VM is atomic at its operation level and can't be
preempted (i.e. it is not using kernel threads with time slicing), then
it could use thread unsafe calls (as long as it keeps those silly static
buffers clean). parrot will (according to dan) use one interpreter per
VM thread and those may run on kernel threads. it may be possible to
disable preemption and/or time slicing so the VM threads will be atomic
at the VM operation level and then we don't have to worry as much about
thread unsafe libs. but i gather that people want real preemption and
priorities and time slicing so that idea may be moot. but on most
platforms that support kernel threads there are thread safe versions of
most/all the c lib stuff. now, external libs that get linked in under
nci is a different story.

NS> ATOMICITY AND CRITICAL SECTIONS

NS> Atomicity of the VMI opcodes can be achieved by having the core
NS> loop of the interpreter enter and exit a critical section each
NS> time it processes an opcode. For those not familiar with critical
NS> sections, they are a (Win32) OS construct that prevents any
NS> cooperating thread within a process, from having timeslice until
NS> the thread that entered the critical section has exited.

that is what i mentioned above, disabling timeslicing/preemption when
desired. it is not just a win32 concept. hell, turning off interrupts
during interrupt handlers goes way back! redmond just likes to rename
stuff and act like they invented it. :)

NS> Unlike semaphores and events, critsecs only operate within a
NS> single process. They are relatively lightweight as there is no
NS> need to enter kernel mode for their operation. They are, in effect
NS> a CPU specific "test and set" operation applied to a piece of user
NS> mode memory. Their lightweight nature means that they are faster
NS> than inter-process semaphore mechanisms. When used in a process
NS> that currently has only a single thread in operation, they have
NS> almost negligible performance effect upon that thread. They also
NS> operate correctly on SMP machines.

in effect it sounds like a thread shared mutex. it could be implemented
in kernel or process space.

NS> If two threads attempt concurrent operations on the same PMC, the
NS> first thread accessing the PMC sets the flag. When a second thread
NS> attempt to access it, it finds the flag set and blocks
NS> (relinquishing it's timeslice) until the first thread has
NS> completed and cleared the flag. This doesn't solve the potential
NS> for deadlock problems arising from the user-level code, though it
NS> may be possible for the VMI to detect and diagnose these at
NS> runtime in a similar way that deep-recursion detection is done in
NS> P5.

that flag setting needs to be atomic or a mutex or similar. plain flag
setting won't work. also the blocking has to be kernel level (so a
kernel mutex/semaphore is needed) so the kernel slicing will work.

deadlock detection is a known problem in the DB world and we can do
similar things. but are those locks VM level or kernel level? if the
coder wants to prevent deadlocks they have to do a higher level lock
(analogous to a full DB lock or transaction). we can't stop stupid
coders from themselves.

NS> The cost to the single-threaded user application is for a test&set
NS> operation (a few cycles), per object-method, plus the flag, which
NS> need only be one bit, but maybe more efficiently coded as a 32-bit
NS> entity.

more than test/set is needed as it has to be atomic. so cpu's have that
instruction but we can't get at it directly from c and we have to
emulate it on platforms which don't have it. assuming it is short and
fast is not a good idea. now the atomic semaphore problem was solved
long ago by dijkstra and it is a two phase test/set gizmo. it needs
several machine instructions or a few lines of c. this is not a trivial
amount of overhead for each access to a shared object. also this type
of lock is in user space and won't block the kernel thread running this
VM thread. so we really need kernel semaphores and that means more
storage on each PMC that could be locked.

this is part of why dan said that the unthreaded version will be
designed to be faster. if you don't have all those locks and don't
compile in the storage for them, it will be faster. i would find it
amazing if someone could design a threaded system that actually ran
faster than the same thing with threads ripped out (even in a single
threaded application).

NS> CONCLUSION (TENTATIVE, NO PROOF)

NS> As all VHLL entities are PMCs at the VMI level, the sharing of data
NS> between threads at the VHLL level is done entirely through those
NS> PMCs. If no single PMC can be the subject of an opcode on two threads
NS> concurrently, there /should/ be no opportunity for conflict.

then you have no sharing. the issue is when you do have sharing. the
ithreads architecture is no IMPLICIT sharing so you can start out like
that. variables (and their underlying PMCs) can be declared shared at
compile or runtime (and with perl, who can tell the difference?
:). those shared things must each have a lock (preferably kernel level
so the locking thread can block and not spin) and that requires storage
and extra code wrapping access to those things.

NS> As all VMI internal state are encapsulated within the VMI register set,
NS> and each thread has it's own set of registers, the internal state of the
NS> VMI(s) should be equally secure.

you can't lock internal state nor do you need to as only the interpreter
can see them. the possibly shared things are variables and data.

NS> Other internal housekeeping operations, memory allocation, garbage
NS> collection etc. are performed as "sysopcodes", performed by the VMI
NS> within the auspices of the critical section, and thus secured.

there may be times when a GC run needs to be initiated DURING a VM
operation. if the op requires an immediate lare chunk of ram it can
trigger a GC pass or allocation request. you can't force those things to
only happen between normal ops (which is what making them into ops
does). so GC and allocation both need to be able to lock all shared
things in their interpreter (and not just do a process global lock) so
those things won't be modified by the other threads that share them.

this is what dan is all so scared about. you can't hide gc and
allocation under a VM carpet. it has to be out there and visible at all
times and it needs total access to an interpreter's stack and other
stuff.

NS> All asynchronous operations are performed by one or more non-VMI
NS> threads and do not adjust the state of the VMI directly. Instead,
NS> they queue notifications of events, and results to the VMI, and
NS> these are detected by the VMI within the body of the main
NS> loop. Once detected, an appropriate sysopcode is dispatched within
NS> the critical section in the normal way.

there is no real main loop anymore. there are multiple main loops (one
in each interpreter. remember, each interpreter is mapped to a real
kernel thread (if possible). you can have a single thread dedicated to
handling signals and events but each of the others must check that
process global event queue. and the classic problem is there too, which
thread gets what events which is particularly nasty with signals.

NS> MEMORY MANAGEMENT

NS> The thought struck me that with the exception of strings and
NS> (other) compound data types, all the instances of any given class
NS> are always the same size. Where an existing class is dynamically
NS> extended in some way, the result is the creation of a new
NS> (anonymous) class as pre-existing instances of the original class
NS> would not be modified. As such, the class would seem to be the
NS> ideal point at which to pool the allocation and deallocation of
NS> memory.

the problem is too many different memory pools. you will waste enormous
amounts of unused space in each different sized pool as they all will
allocate slabs and break them up to their own granularity. so you will
have much of the allocated ram just sitting in pools and not being used
unless each class creates enough objects to use them.

NS> If each class is given it's own memory pool that is a multiple of
NS> its instance size. That pool effectively becomes an (C-style)
NS> array of instances. As each element is exactly the right size, the
NS> only time the pool would need to grow, is when all the existing
NS> elements are in-use and another is required. Once an instance has
NS> been freed, it's slot in the array would be available and exactly
NS> the right size for the next instantiation. There would never be a
NS> requirement to coalesce allocations. Whether the free slots are
NS> found through a free space chain, or even a linear search, the GC
NS> would only need to process this pool when a new instance of this
NS> class is being allocated. It would then only need to scan a single
NS> pool of (same-sized) reference objects to GC the existing
NS> instances.

see above. this is a fine scheme for a few queues of special size but
not for supporting hundreds of queues of differing sizes. and then you
get the real nasty problem of a class adding attributes at run time
which will necessitate changing the storage needed for all its instances
(this is the notification stuff dan has mentioned). this would require a
massive copy of all instances and an allocation of a completely new ram
pool of the new size and all the overhead of setting it up, etc.

NS> If the larger segments of memory allocated for each pool, were
NS> allocated as virtual memory segments, rather than as arbitrary
NS> chunks of heap memory, the size of each individual pool can be
NS> grown in-place, at runtime, without the need to copy then existing
NS> space to the new larger allocation. And without the need to
NS> shuffle any other memory pools around to accommodate the increase
NS> in size. The OS's Virtual Memory Management takes care of the
NS> whole process. It also becomes possible to reserve a sizeable
NS> pre-allocation for important pools and those known to be likely to
NS> grow quickly, without actually consuming the entire reservation
NS> from the physical memory pool before it is actually needed. The OS
NS> will then take care of notification of an individual pool's need
NS> to grow through page faults and the growth becomes a process of
NS> simply committing another (few) pages from the pre-allocation.

how do you know who will grow quickly? larry hasn't published the specs
for PSI::ESP yet. :) also applications never see page faults so we can't
use that.

NS> COOPERATIVE MULTIPROCESSING AND CO_ROUTINES.

NS> I have no understanding (yet) of how the co-routines that Parrot is
NS> promising are implemented on *nix systems. But unless the concept of
NS> co-routines is encapsulated at the higher level with the implementation
NS> being pushed down to platform specific code, unless *nix also supports a
NS> concept very similar to Win32 fibres, it will be impossible to utilise
NS> this "tailor made" OS facility. Thus, the implementation of co-routines
NS> designed to operate well on *nix platforms is likely to need to be
NS> emulated on Win32 with high level requirements that won't fit well with
NS> the underlying OS. This could result in a similarly sub-optimal
NS> emulation of a *nix concept on Win32 as currently exists for the fork
NS> emulation. Other non-*nix platforms would likely also suffer from this
NS> force fit as well.

you seem to be crossing the kernel and VM boundaries here. parrot will
have coroutines but at the VM level. they won't map easily onto any form
of kernel coros since parrot needs to keep a reference to a parrot level
stack and all sorts of VM level info in the coro object/thing. this
problem is similar to the problem of mapping VM threads onto kernel
threads. there are many ways to do it and none are perfect. a VM just
doesn't have the internal granularity of a kernel and the ability to
truly block threads when waiting. VM coros also can't utilize the kernel
for switching since a kernel stack only needs a pointer and a few
registers to be swapped vs a larger and more complex swap with VM coros.

you have addressed many issues here and i think that is what dan
wanted. i hope i clarified some of them and explained why there are no
simple answers. we have to bite some performance bullets somewhere due
to the very lofty functional goals we have set. the key is to keep the
design and api clean and as elegant as possible while keeping up the
performance. and almost all of this is moot in a pure event system which
is why i like them better than threads. :)

uri

--
Uri Guttman ------ u...@stemsystems.com -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org

Nigel Sandever

unread,

Jan 3, 2004, 12:33:07 AM1/3/04

to perl6-i...@perl.org

On Thu, 01 Jan 2004 21:32:22 -0500, u...@stemsystems.com (Uri Guttman) wrote:

UG> Uri Guttman
NS> Nigel Sandever. (Mostly not reproduced here!)

NS> REENTRANCY

UG> this is true for c level threads but not necessarily true for VM level
UG> threads. f the VM is atomic at its operation level and can't be
UG> preempted (i.e. it is not using kernel threads with time slicing),

One of the major points of what I am (was?) proposing, is that there
would be a 1:1 mapping between VM threads and kernel threads.

The second point is that atomic operation of the VM thread does not imply
that they cannot be preempted. It only means that when one VM thread is
preempted by the scheduler, no other thread that is waiting to enter the
same critical section as the first thread is in, will be scheduled
(by the scheduler), until the first thread exits that critical section.

UG> then
UG> it could use thread unsafe calls (as long as it keeps those silly static
UG> buffers clean).

The point about avoiding calling the CRT directly from the main bulk of the
code does not preclude the use of (suitably thread-aware) CRT calls completely.

The idea is to acknowledge that
A) Not all CRT's are equally well crafted or complete.
B) Thread-aware CRTs are not available with all compilers for all platforms.

By writing the main bulk of the code in terms of a virtual CRT using macros,
the main bulk of the code becomes platform independent. The only change required
porting the code to different platforms is to write a set of suitable expansions
for the macros for each platform/ compiler. Where the available CRT is up to the
job, the macros expand directly to the underlying CRT call. Where they are not,
a suitable alternative can be written that bypasses the CRT and goes directly to
runtime.

As an example: Compiling P5 with Borland 5.5 for NT, I encountered the restriction
that the Borland runtime didn't support (fseek) or (f)tell on files > 4GB. The
underlying platform does, and it was relatively trivial to replace the calls to
the CRT functions with appropriate NT syscalls and so enable both USE_LARGE_FILES
and PERLIO.

The only difficulty was that instead of being able to make the changes in a single
place, it was necessary to make essentially the same change in 3 or 4 different files.
To compound matters, the modifications required conditional compilation directives at
all 4 places, and these had to be interspersed with other #ifdefs for other platforms
and other purposes.

The idea was to avoid this morass of conditional compilation and copy&paste coding by
pushing the conditional directives in the main code to a single

#if defined( THIS_PLATFORM ) #include "this_platform_crt.h"

Within that header would be a simple set of #ifdefs to include compiler specific
header files. And within those, all the CRT macro expansions. Thus, I sought to
ease both porting and maintenance.

It would also become possible to add virtual syscall definitions to the high level
code, and expand them differently, perhaps radically so, on a platform by platform
basis. In essence, creating a Virtual OS for the VM.

UG> parrot will (according to dan) use one interpreter per
UG> VM thread and those may run on kernel threads.

From what I have read, the decision as to whether each VM thread also got a
separate interpreter was one of the things Dan was opening up to discussion?

Personally, with the perspective of (the forkless) NT platform, and my experience
with trying to make good use of both p5's 5005thread and ithread implementations,
I am completely against the idea of VM threads having a 1:1 relationship with
interpreters. The runtime costs of duplicating all shared data between interpreters,
tying all shared variables, added to the cost of locking, throw away almost all of
the benefit of using threads. Combine that overhead with the restrictions of

a) No shared references therefore:
b) No shared objects, therefore:
c) No shared methods.
d) No shared compound data structures.

And the only hope for multiprocessing becomes forking, which WIn32 doesn't support
natively. And which, without the benefits of process level COW performed by the
kernel, results in a kludged emulation of a non-native facility that will never
be usable in any real sense of the word.

UG> it may be possible to
UG> disable preemption and/ or time slicing so the VM threads will be atomic
UG> at the VM operation level and then we don't have to worry as much about
UG> thread unsafe libs.

I was trying to demonstrate that there is no need to disable the scheduler in order
for VM threads to be achieve atomic VM level operations.

UG> but i gather that people want real preemption and
UG> priorities and time slicing so that idea may be moot.

Yes. I am one of them:)

UG> but on most
UG> platforms that support kernel threads there are thread safe versions of
UG> most/ all the c lib stuff.

As I said above. This is true, but it is also compiler dependant. And even
where thread-safe CRTs are available, they aren't always reliable.

Insulation the bulk of the sources from the vagaries of different CRTs was
my aim.

UG> now, external libs that get linked in under
UG> nci is a different story.

I can't address this as I have no idea what nci is :), and nothing applicable
to this context came up in a google for the term.

NS> ATOMICITY AND CRITICAL SECTIONS

UG> that is what i mentioned above, disabling time slicing/ preemption when
UG> desired. it is not just a win32 concept. hell, turning off interrupts
UG> during interrupt handlers goes way back! redmond just likes to rename
UG> stuff and act like they invented it. :)

I cannot emphasis enough that Critical Sections are different to both Events
and Semaphores.

The distinction is that a critical section does not disable time slicing/ preemption.
It prevents any thread attempting to enter an owned critical section from receiving
a timeslice until the current owner relinquishes it. This means that other threads in
the same process and other processes continue to timeslice preemptively.
Only the thread(s) waiting to enter the specified critical section are blocked.
There is no 'spinning' involved.

UG> in effect it sounds like a thread shared mutex. it could be implemented
UG> in kernel or process space.

They are indeed a form of mutex, but the process space state and avoiding the
need to switch to kernel mode, is crucial, as it is this that makes them
lighter weight than a conventional IPC Mutex.

A CRITICAL_SECTION is a DWORD allocated in user space. It must be allocated
by the user space code, and then initialised through a syscall. Once initialised,
it cannot be move, copied or modified (directly) by the user code.

If you have the time and inclination, there is a brief (25 line) description
of them at
[http://msdn.microsoft.com/library/en-us/dllproc/base/critical_section_objects.asp]

UG> that flag setting needs to be atomic or a mutex or similar. plain flag
UG> setting won't work. also the blocking has to be kernel level (so a
UG> kernel mutex/ semaphore is needed) so the kernel slicing will work.

Indeed. A simple flag *is* all that is necessary to protect shared (fat) data
structures, *if* the VMs, are the only code that could achieve concurrent access
to the contents of the PMC, and these are already interlocked at the operational
level.

A PMC is an object, and the only code that can affect the state of that object
are it's methods. The methods can only be invoked by the VM. Each method
constitutes an (atomic) operation at the VM level.

All that is required to protect an object from corruption through concurrent
access and state change is to prevent two (or more) VMs trying to access the
same object simultaneously. In order for the VM to address the PMC and must
load it into a VM register, it must know where it is. Ie. It must have a pointer.
It can use this pointer to access the PMC with a Bit Test and Set operation.
(x86 BTS) [http://www.itis.mn.it/linux/quarta/x86/bts.htm]
This is a CPU atomic operation. This occurs before the VM enters the VM
operations critical section.

Once the bit has been set in the PMC header by one VM, if the scheduler intervened
exactly before the next instruction in that thread, and scheduled a second VM thread
to run, and that VM also attempts to access that same PMC,the first operation it
tries to perform is the same BTS instruction. The next instruction then tests the CF
(Carry flag) and knows that the PMC is already in use. Hence, it then relinquishes
(yields) the rest of its timeslice.

It will continue to do so until the result of the BTS, is that the CF is clear.
This will only happen when the original VM that caused the bit to be set, enters
the critsec, performs and completes the operation, and exits the critsec.
At which point it does a Bit Test & Reset on the bit. (BTR)
[http://www.itis.mn.it/linux/quarta/x86/btr.htm]

Every time the second (or subsequent) VM attempts to use a PMC that is already
in-use, it finds the bit set and immediately yields. This will only happen during
the interval between the first VM issuing the BTS, and that same VM entering the
critsec. Once the critsec has been entered, no other VM that could load that PMC,
would even be scheduled.

UG> we can't stop stupid coders from themselves.

Agreed:)

UG> more than test/set is needed as it has to be atomic. so CPU's have that
UG> instruction but we can't get at it directly from c and we have to
UG> emulate it on platforms which don't have it. assuming it is short and
UG> fast is not a good idea. now the atomic semaphore problem was solved
UG> long ago by dijkstra and it is a two phase test/set gizmo. it needs
UG> several machine instructions or a few lines of c. this is not a trivial
UG> amount of overhead for each access to a shared object. also this type
UG> of lock is in user space and won't block the kernel thread running this
UG> VM thread. so we really need kernel semaphores and that means more
UG> storage on each PMC that could be locked.

I think this is all covered above.

UG> this is part of why dan said that the unthreaded version will be
UG> designed to be faster. if you don't have all those locks and don't
UG> compile in the storage for them, it will be faster. i would find it
UG> amazing if someone could design a threaded system that actually ran
UG> faster than the same thing with threads ripped out (even in a single
UG> threaded application).

If my description above is to be believed, then the BTS opcode + a JCXZ
[http://www.itis.mn.it/linux/quarta/x86/jcc.htm]
to skip over the yield are that are the only penalty of leaving the VM
interlocking code in place for non-threaded versions. As the flag would
never test set on entry, the jump would always be taken and the yield code
would never come into play.

The costs associated with entering and exiting critsec are minimal, but this
code could be bypass until a flag is set in the interpreter to indicate that
a second VM thread has be created.

NS> CONCLUSION (TENTATIVE, NO PROOF)

NS> As all VHLL entities are PMCs at the VMI level, the sharing of data
NS> between threads at the VHLL level is done entirely through those
NS> PMCs. If no single PMC can be the subject of an opcode on two threads
NS> concurrently, there /should/ be no opportunity for conflict.

UG> then you have no sharing. the issue is when you do have sharing. the
UG> ithreads architecture is no IMPLICIT sharing so you can start out like
UG> that. variables (and their underlying PMCs) can be declared shared at
UG> compile or runtime (and with perl, who can tell the difference?
UG> :). those shared things must each have a lock (preferably kernel level
UG> so the locking thread can block and not spin) and that requires storage
UG> and extra code wrapping access to those things.

Again. I think everything here was dealt with above.

NS> As all VMI internal state are encapsulated within the VMI register set,
NS> and each thread has it's own set of registers, the internal state of the
NS> VMI(s) should be equally secure.

UG> you can't lock internal state nor do you need to as only the interpreter
UG> can see them. the possibly shared things are variables and data.

By "internal state of the VMI" (interpreter), I was referring to such things as
the VHLL program that the interpreter is running. If a subroutine comes into
existence at runtime (through eval for example) then the user program changes,
and with it, the state of the interpreter.
If the garbage collector runs, again the internal state of the interpreter is modified.

If these operations, like all user-program directed operations, are run within
the VM as atomic operations, then the state of the interpreter is protected.

UG> there may be times when a GC run needs to be initiated DURING a VM
UG> operation.

This is a matter of design. The suggestion that memory allocation was managed
at the class level was central to the proposal. If memory needs to be allocated
as part of a VM operation, then it is associated with a single PMC; the subject
of the operation. If the pool of memory used by that PMC is entirely local to
PMC and allocated by an operation within the PMC, then a failure to obtain memory
requires that only *that* pool of memory, need be garbage collected. As such,
The GC cycle can be run on that pool, as a (nested) VM level operation. The
limited scope of the pool to be GC's makes this a musg cheaper increment than
global GC would be in the same circumstances. If the GC doesn't succeed in
locating and freeing space, another (few) pages of the reserved allocation
are then commited.

No other pool is affected. No shuffling or copying is required. See below.

UG> if the op requires an immediate lare chunk of ram it can
UG> trigger a GC pass or allocation request. you can't force those things to
UG> only happen between normal ops (which is what making them into ops
UG> does). so GC and allocation both need to be able to lock all shared
UG> things in their interpreter (and not just do a process global lock) so
UG> those things won't be modified by the other threads that share them.

UG> this is what dan is all so scared about. you can't hide gc and
UG> allocation under a VM carpet. it has to be out there and visible at all
UG> times and it needs total access to an interpreter's stack and other
UG> stuff.

By moving away from a single central pool, or even many central pools of similarly
sized objects, to a much finer grained set of pools, each managed directly by the
consumer class of that pool. Allocation and GC become operations of the PMC
(class) itself.

Other pool(s) of memory used to satisfy the interpreters needs for it's own use,
as opposed to those done on behalf of the user program, would be required also.
But these will tend to have relatively longer life and will be allocated outside
of the VM operations. The separation of interpreter memory reqs from those of the
user program also translate into less work to do at both allocation and GC times
for either.

UG> there is no real main loop anymore. there are multiple main loops (one
UG> in each interpreter. remember, each interpreter is mapped to a real
UG> kernel thread (if possible).

Again, I think I have covered this above. I know there will be multiple
"main loops". In the scheme outlined, there would be one "main loop"
per VM/kernel thread. And these would share a single interpreter.

Indeed, multiple interpreters within the same process would be possible.
And each interpreter would be able to have multiple VM threads. Two or
more interpreters (and the VM(s) that run them) would never, in fact,
could never, interfere with each other. Their state, including their
memory pools, garbage collectors and everything else.

The only exceptions would be unavoidably process global OS entities,
like files, sockets, Events etc. but these are already managed by
multi-threaded CRTs and there shoudlbe no reason to make extra
provision to handle them.

UG> you can have a single thread dedicated to
UG> handling signals and events but each of the others must check that
UG> process global event queue.

Agreed. This is the way I would implement asynchronous events.
In fact, every Win32 process already has a separate event queue by
default. This is used by native GUI applications to manage events for
single and multi-threaded applications, including the message passing
mechanisms for communicating between windows running on different threads.
In part, this is the source of some of the inspiration for the model I am
suggesting. I don't have a great deal of VM interpreter knowledge, but
many of the same problems arise in threaded GUI applcations,and I have
a lot of experience with those.

In fact, the Win32 port of P5 utilises this event queue to provide its
(very limited) emulation of *nix signals. It would theoretically be
possible to extend that current emulation to cover more than the 2 or 3
signals now supported By extending this mechanism, it would be possible
to enable SIG_ALRM and SIG_CHLD for example, though the latter would be
limited to inter operation with other pseudo-forked children.

UG> and the classic problem is there too, which
UG> thread gets what events which is particularly nasty with signals.

This only becomes a problem if the process can be considered to be more
than a single program -- as with pseudo-forked children.

However, if the raiser of the signal can somehow identify the thread it
wishes to signal, then there is no problem tagging the signal with it's
destination and having each thread only respond to signal tagged for it.
Where a signal is raised by the OS, that has no way to target individual
threads /interpreters, then the signal would be acted upon by ALL threads
in the process.

NS> MEMORY MANAGEMENT

UG> the problem is too many different memory pools. you will waste enormous
UG> amounts of unused space in each different sized pool as they all will
UG> allocate slabs and break them up to their own granularity. so you will
UG> have much of the allocated ram just sitting in pools and not being used
UG> unless each class creates enough objects to use them.

Virtual memory allocation
[http://msdn.microsoft.com/library/en-us/memory/base/virtualalloc.asp]
can be done in terms of reservation rather than committal.

Each class would reserve some sized lump of virtual memory, but
only commit a small amount (in 4k pages) of the total reserved.

UG> see above. this is a fine scheme for a few queues of special size but
UG> not for supporting hundreds of queues of differing sizes. and then you
UG> get the real nasty problem of a class adding attributes at run time
UG> which will necessitate changing the storage needed for all its instances
UG> (this is the notification stuff dan has mentioned). this would require a
UG> massive copy of all instances and an allocation of a completely new ram
UG> pool of the new size and all the overhead of setting it up, etc.

Theoretically, each VirtualAlloc call can reserve as much Virtual memory
as it likes (up to the limit of physical memory size) and then commit
this in 4K chunks on demand. The fact that no two such 'full-memory'
reservations could ever be completely committed doesn't matter. The
OS/hardware takes care of mapping the committed virtual memory to
physical memory. As far as the program is concerned, there is no
shuffling of memory required to expand any single allocation.

UG> how do you know who will grow quickly? larry hasn't published the specs
UG> for PSI::ESP yet. :) also applications never see page faults so we can't
UG> use that.

There are some indications (even without ESP:), of what classes are likely to
grow faster than others. The SCALAR class, is likely to see considerably more
activity than the FATAL_ERROR_DUMP class for example.

The former might pre-commit 1MB on start up, and extend in 1MB increments.
Whereas the latter might opt to have no pre-committal at all and expand only
when allocation is requested.

It might even be possible to have the reservation, pre-committal and extension
values configurable at run time on a class-by-class basis. These might be an
optional class creation/ load-time parameters.

NS> COOPERATIVE MULTIPROCESSING AND CO_ROUTINES.

UG> you seem to be crossing the kernel and VM boundaries here.

I don't believe I have.

UG> parrot will
UG> have co-routines but at the VM level.

In the model I describe, a VM thread *is* a kernel thread.

UG> they won't map easily onto any form
UG> of kernel coros since parrot needs to keep a reference to a parrot level
UG> stack and all sorts of VM level info in the coro object/thing.

Agreed. This is why each VM thread, in addition to having it's own set of
VM registers, also needs it's own VM stack(s), just as kernel threads have
at the CPU level.

UG> this
UG> problem is similar to the problem of mapping VM threads onto kernel
UG> threads. there are many ways to do it and none are perfect.

I doubt this mechanism gets any closer to perfect than any other, but
I do think that it is
a) feasible, on win32 at least.
b) has some unique features worthy of consideration.

I don't know if all the requisite (win32) OS facilities are available on
all other threaded platforms. There is no doubt in my mind that equivalent
facilities are either available (probably under different names), or could
be made available. Linux runs on the same hardware as Win32, and the rest
is just software. Whether anything that doesn't exist could be fabricated
at a level that wouldn't mean a kernel rebuild I simply don't know.

UG> a VM just
UG> doesn't have the internal granularity of a kernel and the ability to
UG> truly block threads when waiting.

The scheme of using a single critsec to arbitrate and interlock between
VM threads is the basis of the mechanism for doing exactly this. When a
thread attempts to enter a critsec that is already entered by another
thread, it is effectively removed from the schedulers tables and will
not be dispatched until the thread holding the critsec exits it. No other
threads are affected unless they too attempt to enter the same critsec.

VM coros also can't utilize the kernel

UG> for switching since a kernel stack only needs a pointer and a few
UG> registers to be swapped vs a larger and more complex swap with VM coros.

By having both (each) thread in the co-processes running within it's
own VM thread, and converting that VM thread to a fibre (in win32 terms)
[http://msdn.microsoft.com/library/en-us/dllproc/base/threadproc.asp]

As each VM thread already has the wherewithall to retain *all* state
across scheduler switches, once the thread is converted to a fibre, it
retains these same abilities. The only change is that now, instead of the
switching between VMs occurring at the schedulers behest, it occurs as a
result of a user program directive.

I'm not sure what the Parrot, IMCC or P6 language level terms for the
transfer of control are or will be, but at the (win32 level) it would
just translate to SwitchToFiber( lpfibre );
[http://msdn.microsoft.com/library/en-us/dllproc/base/threadproc.asp]

That's not a complete spec., but it's the root of the idea.

UG> you have addressed many issues here and i think that is what dan
UG> wanted. i hope i clarified some of them and explained why there are no
UG> simple answers. we have to bite some performance bullets somewhere due
UG> to the very lofty functional goals we have set.

Likewise, I hope that the above, in conjunction with the offsite links, will
serve as a clarification of my ideas, and as evidence that I did not arrive
at the proposals without considerable thought and research.

That's no guarantee that they are ultimately workable, and that, in part,
was the source of my reluctance to post them here earlier. I am working
on a trivial implementation of most of the core ideas by way of a
proof-of-concept, but it will be some time before this is demonstrable.

I fully appreciate that I am late to the party and it may be too late to
consider altering the design of Parrot to accommodate these ideas.

It may well be that some or all of the facilities upon which the concept
is based are simple to Win32 specific to translate into a cross-platform
implementation.

It's also more than just possible, that there are gapping holes that I
haven't addressed, or even considered.

My purpose in posting, was prompted by Dan's opening of the discussion, and
with the hope that the skills and expertise of the denizens of this list
would both scrutinise the ideas (as you did) and perhaps be able to relate
my win32 view of the world into *nix (and other) views. I have no doubts
that even if some of the ideas I have postulated could be used in Parrot
to good effect, that they would require (perhaps considerable) refinement
and translation along the way by all those people here that are far more
experienced with interpreter and VM design than I.

Whilst I have tried to cover everything I can foresee, there is so much
that do not yet understand about Parrot in particular, and VM interpreters
in general, that I fully accept that none of this might fly here.

But as someone recently pointed out, there is no point in having ideas
unless you are prepared to voice them, and support them against summary
dismissal. If Dan turns speaks up and says

"It too much, and too little, too late"
or
"it is simply too platform specific and too speculative to even consider"

I accept that.

UG> the key is to keep the
UG> design and api clean and as elegant as possible while keeping up the
UG> performance. and almost all of this is moot in a pure event system which
UG> is why i like them better than threads. :)

Once again. Thanks for taking the time to read and digest my original post,
and for your comprehensive feedback. Without the feedback, it is very
difficult to know which parts of what I wrote need clarification. I had
several attempts at writing the original up. They varied from a 12,000
word diatribe that tried (and probably failed) to explain every detail
of the ideas. That would never have been read, and would have patronised
anyone brave enough to read it. To the other extreme, that talked totally
in terms of win32 APIs with no explanation at all.

That would have been like I was talking Greek to most people here.

UG> uri

Regards,
and forever holding my peace :),

Nigel Sandever.

Uri Guttman

unread,

Jan 3, 2004, 1:48:07 AM1/3/04

to Nigel Sandever, perl6-i...@perl.org

>>>>> "NS" == Nigel Sandever <nigels...@btconnect.com> writes:

NS> ATOMICITY AND CRITICAL SECTIONS

UG> that is what i mentioned above, disabling time slicing/ preemption when
UG> desired. it is not just a win32 concept. hell, turning off interrupts
UG> during interrupt handlers goes way back! redmond just likes to rename
UG> stuff and act like they invented it. :)

NS> I cannot emphasis enough that Critical Sections are different to
NS> both Events and Semaphores.

NS> The distinction is that a critical section does not disable time
NS> slicing/ preemption. It prevents any thread attempting to enter an
NS> owned critical section from receiving a timeslice until the current
NS> owner relinquishes it. This means that other threads in the same process
NS> and other processes continue to timeslice preemptively. Only the
NS> thread(s) waiting to enter the specified critical section are blocked.
NS> There is no 'spinning' involved.

it has to disable slicing/preemption if there are multiple kernel
threads running that have access to that data.

UG> in effect it sounds like a thread shared mutex. it could be implemented
UG> in kernel or process space.

NS> They are indeed a form of mutex, but the process space state and
NS> avoiding the need to switch to kernel mode, is crucial, as it is
NS> this that makes them lighter weight than a conventional IPC Mutex.

NS> A CRITICAL_SECTION is a DWORD allocated in user space. It must be
NS> allocated by the user space code, and then initialised through a
NS> syscall. Once initialised, it cannot be move, copied or modified
NS> (directly) by the user code.

NS> If you have the time and inclination, there is a brief (25 line)
NS> description of them at
NS> [http://msdn.microsoft.com/library/en-us/dllproc/base/critical_section_objects.asp]

UG> that flag setting needs to be atomic or a mutex or similar. plain flag
UG> setting won't work. also the blocking has to be kernel level (so a
UG> kernel mutex/ semaphore is needed) so the kernel slicing will work.

NS> Indeed. A simple flag *is* all that is necessary to protect shared (fat)
NS> data structures, *if* the VMs, are the only code that could achieve
NS> concurrent access to the contents of the PMC, and these are already
NS> interlocked at the operational level.

NS> A PMC is an object, and the only code that can affect the state of
NS> that object are it's methods. The methods can only be invoked by
NS> the VM. Each method constitutes an (atomic) operation at the VM
NS> level.

NS> All that is required to protect an object from corruption through
NS> concurrent access and state change is to prevent two (or more) VMs
NS> trying to access the same object simultaneously. In order for the VM to
NS> address the PMC and must load it into a VM register, it must know where
NS> it is. Ie. It must have a pointer. It can use this pointer to access
NS> the PMC with a Bit Test and Set operation. (x86 BTS)
NS> [http://www.itis.mn.it/linux/quarta/x86/bts.htm] This is a CPU atomic
NS> operation. This occurs before the VM enters the VM operations critical
NS> section.

ding! ding! ding! you just brought in a cpu specific instruction which
is not guaranteed to be on any other arch. in fact many have such a
beast but again, it is not accessible from c. and still some archs don't
have it so it has to be emulated by dijkstra's algorithm. and that
requires two counters and a little chunk of code.

you can't bring x86 centrism into this. the fact that redmond/intel
threads can make use of this instruction to do 'critical sections' is a
major point why it is bad for a portable system with many architectures
to be supported.

NS> Once the bit has been set in the PMC header by one VM, if the scheduler
NS> intervened exactly before the next instruction in that thread, and
NS> scheduled a second VM thread to run, and that VM also attempts to access
NS> that same PMC,the first operation it tries to perform is the same BTS
NS> instruction. The next instruction then tests the CF (Carry flag) and
NS> knows that the PMC is already in use. Hence, it then relinquishes
NS> (yields) the rest of its timeslice.

i am well aware about test/set and atomic instructions. my point in my
previous reply was that it can't be assumed to be on a particular
arch. so it has to be assumed to be on none and it needs a pure software
solution. sure a platform specific macro/assembler lib could be done but
that is a pain as well.

NS> Every time the second (or subsequent) VM attempts to use a PMC
NS> that is already in-use, it finds the bit set and immediately
NS> yields. This will only happen during the interval between the
NS> first VM issuing the BTS, and that same VM entering the
NS> critsec. Once the critsec has been entered, no other VM that could
NS> load that PMC, would even be scheduled.

yield is also a kernel specific thing and what will wake up the thread
after that? it has to block on some kernel thingy and the test/set flag
is not one. you are conflating the vm/kernel boundaries here with
intel specific solutions. the only sure way to block a kernel thread
is a kernel level mutex/lock/semaphore. but that is not a good solution
to add to every PMC that could be shared or it has to be added at
runtime to a PMC that is being shared. we want fine grained locking
which means locks around each shared thing. but that would mean a lock
inside each shared thing and that is a lot of wasted overhead for all
those things that aren't shared. and it is a slower solution as well.

and above all of this is any GC or event handling threads that might
need to lock everything in the entire process so they may need a global
lock that all threads have to obey.

UG> more than test/set is needed as it has to be atomic. so CPU's have that
UG> instruction but we can't get at it directly from c and we have to
UG> emulate it on platforms which don't have it. assuming it is short and
UG> fast is not a good idea. now the atomic semaphore problem was solved
UG> long ago by dijkstra and it is a two phase test/set gizmo. it needs
UG> several machine instructions or a few lines of c. this is not a trivial
UG> amount of overhead for each access to a shared object. also this type
UG> of lock is in user space and won't block the kernel thread running this
UG> VM thread. so we really need kernel semaphores and that means more
UG> storage on each PMC that could be locked.

NS> I think this is all covered above.

i disagree. it is covered for the intel case only and that is not good
enough.

NS> If my description above is to be believed, then the BTS opcode + a JCXZ
NS> [http://www.itis.mn.it/linux/quarta/x86/jcc.htm]
NS> to skip over the yield are that are the only penalty of leaving the VM
NS> interlocking code in place for non-threaded versions. As the flag would
NS> never test set on entry, the jump would always be taken and the yield code
NS> would never come into play.

NS> The costs associated with entering and exiting critsec are
NS> minimal, but this code could be bypass until a flag is set in the
NS> interpreter to indicate that a second VM thread has be created.

again, intel specific and it has to be tossed out.

NS> CONCLUSION (TENTATIVE, NO PROOF)

NS> As all VHLL entities are PMCs at the VMI level, the sharing of data
NS> between threads at the VHLL level is done entirely through those
NS> PMCs. If no single PMC can be the subject of an opcode on two threads
NS> concurrently, there /should/ be no opportunity for conflict.

UG> then you have no sharing. the issue is when you do have sharing. the
UG> ithreads architecture is no IMPLICIT sharing so you can start out like
UG> that. variables (and their underlying PMCs) can be declared shared at
UG> compile or runtime (and with perl, who can tell the difference?
UG> :). those shared things must each have a lock (preferably kernel level
UG> so the locking thread can block and not spin) and that requires storage
UG> and extra code wrapping access to those things.

NS> Again. I think everything here was dealt with above.

again, i disagree.

NS> As all VMI internal state are encapsulated within the VMI register

NS> set, and each thread has it's own set of registers, the internal
NS> state of the VMI(s) should be equally secure.

UG> you can't lock internal state nor do you need to as only the

UG> interpreter can see them. the possibly shared things are variables
UG> and data.

NS> By "internal state of the VMI" (interpreter), I was referring to
NS> such things as the VHLL program that the interpreter is
NS> running. If a subroutine comes into existence at runtime (through
NS> eval for example) then the user program changes, and with it, the
NS> state of the interpreter. If the garbage collector runs, again
NS> the internal state of the interpreter is modified.

NS> If these operations, like all user-program directed operations,
NS> are run within the VM as atomic operations, then the state of the
NS> interpreter is protected.

but atomic operations are not available on all platforms. so you need
more work for that and since you have kernel threads you need kernel
locks. test/set in user space is not good enough as you can't block in
user space nor yield and have the kernel wake you up later. the kernel
has to be aware of the test/set location and that means it can't all be
done in user space.

UG> there may be times when a GC run needs to be initiated DURING a VM
UG> operation.

NS> This is a matter of design. The suggestion that memory allocation
NS> was managed at the class level was central to the proposal. If
NS> memory needs to be allocated as part of a VM operation, then it is
NS> associated with a single PMC; the subject of the operation. If the
NS> pool of memory used by that PMC is entirely local to PMC and
NS> allocated by an operation within the PMC, then a failure to obtain
NS> memory requires that only *that* pool of memory, need be garbage
NS> collected. As such, The GC cycle can be run on that pool, as a
NS> (nested) VM level operation. The limited scope of the pool to be
NS> GC's makes this a musg cheaper increment than global GC would be
NS> in the same circumstances. If the GC doesn't succeed in locating
NS> and freeing space, another (few) pages of the reserved allocation
NS> are then commited.

but what about shared PMCs? which ram pool gets GC'ed or allocated when
one of the sharing threads needs to deal with it?

NS> No other pool is affected. No shuffling or copying is required. See below.

again, sharing breaks that. thread specific pools still need to be aware
of each other so threads need to lock them with kernel locks.

NS> By moving away from a single central pool, or even many central
NS> pools of similarly sized objects, to a much finer grained set of
NS> pools, each managed directly by the consumer class of that
NS> pool. Allocation and GC become operations of the PMC (class)
NS> itself.

they are already operations of the PMC for most cases. but sharing
breaks that and so does GC which can happen inside an operation and not
just between them.

NS> Indeed, multiple interpreters within the same process would be possible.
NS> And each interpreter would be able to have multiple VM threads. Two or
NS> more interpreters (and the VM(s) that run them) would never, in fact,
NS> could never, interfere with each other. Their state, including their
NS> memory pools, garbage collectors and everything else.

there is one VM to a kernel thread and one interpreter per VM. the
parrot VM is effectively a interpreter per parrot thread. you can't get
away from that.

UG> and the classic problem is there too, which
UG> thread gets what events which is particularly nasty with signals.

NS> This only becomes a problem if the process can be considered to be more
NS> than a single program -- as with pseudo-forked children.

not in a unix environment. real multi-threaded processes already have
this problem.

NS> However, if the raiser of the signal can somehow identify the
NS> thread it wishes to signal, then there is no problem tagging the
NS> signal with it's destination and having each thread only respond
NS> to signal tagged for it. Where a signal is raised by the OS, that
NS> has no way to target individual threads /interpreters, then the
NS> signal would be acted upon by ALL threads in the process.

that is a kernel issue and most signalling systems don't have thread
specific arguments AFAIK.

NS> MEMORY MANAGEMENT

UG> the problem is too many different memory pools. you will waste enormous
UG> amounts of unused space in each different sized pool as they all will
UG> allocate slabs and break them up to their own granularity. so you will
UG> have much of the allocated ram just sitting in pools and not being used
UG> unless each class creates enough objects to use them.

NS> Virtual memory allocation
NS> [http://msdn.microsoft.com/library/en-us/memory/base/virtualalloc.asp]
NS> can be done in terms of reservation rather than committal.

NS> Each class would reserve some sized lump of virtual memory, but
NS> only commit a small amount (in 4k pages) of the total reserved.

virtual ram is what counts on unix. you can't request some large amount
without using real swap space. it may not allocate real ram until later
(page faulting on demand) but it is swap space that counts. it is used
up as soon as you allocate it.

UG> see above. this is a fine scheme for a few queues of special size but
UG> not for supporting hundreds of queues of differing sizes. and then you
UG> get the real nasty problem of a class adding attributes at run time
UG> which will necessitate changing the storage needed for all its instances
UG> (this is the notification stuff dan has mentioned). this would require a
UG> massive copy of all instances and an allocation of a completely new ram
UG> pool of the new size and all the overhead of setting it up, etc.

NS> Theoretically, each VirtualAlloc call can reserve as much Virtual memory
NS> as it likes (up to the limit of physical memory size) and then commit
NS> this in 4K chunks on demand. The fact that no two such 'full-memory'
NS> reservations could ever be completely committed doesn't matter. The
NS> OS/hardware takes care of mapping the committed virtual memory to
NS> physical memory. As far as the program is concerned, there is no
NS> shuffling of memory required to expand any single allocation.

again, redmond specific. physical ram is never the issue and virtual ram
is used as you allocate it. there is no way to separate the two on unix.

NS> COOPERATIVE MULTIPROCESSING AND CO_ROUTINES.

UG> you seem to be crossing the kernel and VM boundaries here.

NS> I don't believe I have.

UG> parrot will
UG> have co-routines but at the VM level.

NS> In the model I describe, a VM thread *is* a kernel thread.

which means a kernel level lock and not a use level lock. this is the
disagreement we have. this is where you are crossing the kernel/VM
boundary (even if you think not).

UG> they won't map easily onto any form of kernel coros since parrot
UG> needs to keep a reference to a parrot level stack and all sorts of
UG> VM level info in the coro object/thing.

NS> Agreed. This is why each VM thread, in addition to having it's own
NS> set of VM registers, also needs it's own VM stack(s), just as
NS> kernel threads have at the CPU level.

that is expected. it is the locking issue where we disagree.

UG> this problem is similar to the problem of mapping VM threads onto
UG> kernel threads. there are many ways to do it and none are perfect.

NS> I doubt this mechanism gets any closer to perfect than any other, but
NS> I do think that it is
NS> a) feasible, on win32 at least.
NS> b) has some unique features worthy of consideration.

what win32 does is not portable and not useful at a VM level. kernels
and their threads can work well together. portable VMs and their threads
are another beast that can't rely on a particular architecture
instruction or kernel feature.

NS> I don't know if all the requisite (win32) OS facilities are
NS> available on all other threaded platforms. There is no doubt in my
NS> mind that equivalent facilities are either available (probably
NS> under different names), or could be made available. Linux runs on
NS> the same hardware as Win32, and the rest is just software. Whether
NS> anything that doesn't exist could be fabricated at a level that
NS> wouldn't mean a kernel rebuild I simply don't know.

hardware is tossed out with portable VM design. we have a user space
process written in c with a VM and VM threads that are to be based on
kernel threads. the locking issues for shared objects is the toughest
nut to crack right now. there is no simple or fast solution to this
given the known contraints. intel/redmond specific solutions are not
applicable (though we can learn from them).

UG> a VM just doesn't have the internal granularity of a kernel and
UG> the ability to truly block threads when waiting.

NS> The scheme of using a single critsec to arbitrate and interlock
NS> between VM threads is the basis of the mechanism for doing exactly
NS> this. When a thread attempts to enter a critsec that is already
NS> entered by another thread, it is effectively removed from the
NS> schedulers tables and will not be dispatched until the thread
NS> holding the critsec exits it. No other threads are affected unless
NS> they too attempt to enter the same critsec.

this is where i feel you are contradicting yourself. you just did a
kernel level block in this critical section but above you claim it can
be done with just a user level test/set instruction. the kernel handles
scheduling but that means a thread has to block on a lock or yeild on
some kernel level thing. a use level test/set will never get the
kernel's attention and the thread will have to spin on the test.

NS> VM coros also can't utilize the kernel

UG> for switching since a kernel stack only needs a pointer and a few
UG> registers to be swapped vs a larger and more complex swap with VM

UG> coros.

NS> By having both (each) thread in the co-processes running within
NS> it's own VM thread, and converting that VM thread to a fibre (in
NS> win32 terms)
NS> [http://msdn.microsoft.com/library/en-us/dllproc/base/threadproc.asp]

NS> As each VM thread already has the wherewithall to retain *all*
NS> state across scheduler switches, once the thread is converted to a
NS> fibre, it retains these same abilities. The only change is that
NS> now, instead of the switching between VMs occurring at the
NS> schedulers behest, it occurs as a result of a user program
NS> directive.

ok, i can see where you got the test/set and yield stuff from now.
fibres seem to be manually scheduled threads. this means user code has
to decide to yield to another thread blocking on the same lock. this
means all locks must have a list of threads/fibres blocking on that
lock. and manually scheduled stuff is not appropriate for
parrot. finally, it is redmond specific again and very unportable. i
have never heard of this fibre concept on any unix flavor.

NS> I'm not sure what the Parrot, IMCC or P6 language level terms for
NS> the transfer of control are or will be, but at the (win32 level)
NS> it would just translate to SwitchToFiber( lpfibre );
NS> [http://msdn.microsoft.com/library/en-us/dllproc/base/threadproc.asp]

and that is not possible on other platforms. also it means parrot would
have to have its own scheduler with that the pain that brings.

UG> you have addressed many issues here and i think that is what dan
UG> wanted. i hope i clarified some of them and explained why there

UG> are no simple answers. we have to bite some performance bullets
UG> somewhere due to the very lofty functional goals we have set.

NS> Likewise, I hope that the above, in conjunction with the offsite
NS> links, will serve as a clarification of my ideas, and as evidence
NS> that I did not arrive at the proposals without considerable
NS> thought and research.

NS> That's no guarantee that they are ultimately workable, and that,
NS> in part, was the source of my reluctance to post them here
NS> earlier. I am working on a trivial implementation of most of the
NS> core ideas by way of a proof-of-concept, but it will be some time
NS> before this is demonstrable.

NS> I fully appreciate that I am late to the party and it may be too
NS> late to consider altering the design of Parrot to accommodate
NS> these ideas.

NS> It may well be that some or all of the facilities upon which the
NS> concept is based are simple to Win32 specific to translate into a
NS> cross-platform implementation.

that is a big point and one that i don't see as possible. redmond can do
what they want with their kernel and user procs. parrot can only use
kernel concepts that are reasonably portable across most platforms.
kernel threads are reasonably portable (FSDO of reasonable) but anything
beyond that such as fibres, test/set in user space, etc is not. locks
have to be in kernel space since we can do a fibre yield in user space
on any other platform. so this rule out user space test/set as well
since that would need a thread to spin instead of blocking.

your ideas make sense but only on redmond/intel which is not the target
space for parrot.

thanx,

Nigel Sandever

unread,

Jan 3, 2004, 2:20:39 AM1/3/04

to perl6-i...@perl.org

On Sat, 03 Jan 2004 01:48:07 -0500, u...@stemsystems.com (Uri Guttman) wrote:
> ding! ding! ding! you just brought in a cpu specific instruction which
> is not guaranteed to be on any other arch. in fact many have such a
> beast but again, it is not accessible from c.

> you can't bring x86 centrism into this. the fact that redmond/intel

> threads can make use of this instruction to do 'critical sections' is a
> major point why it is bad for a portable system with many architectures
> to be supported.

> i disagree. it is covered for the intel case only and that is not good
> enough.

> again, intel specific and it has to be tossed out.

> but atomic operations are not available on all platforms.

> not in a unix environment. real multi-threaded processes already have
> this problem.

> that is a kernel issue and most signalling systems don't have thread
> specific arguments AFAIK.

> virtual ram is what counts on unix. you can't request some large amount

> without using real swap space.

> again, redmond specific.

> what win32 does is not portable and not useful at a VM level.

> ok, i can see where you got the test/set and yield stuff from now....

> finally, it is redmond specific again and very unportable. i
> have never heard of this fibre concept on any unix flavor.

> and that is not possible on other platforms.

> that is a big point and one that i don't see as possible. redmond can do

> what they want with their kernel and user procs. parrot can only use
> kernel concepts that are reasonably portable across most platforms.
> kernel threads are reasonably portable (FSDO of reasonable) but anything
> beyond that such as fibres, test/set in user space, etc is not. locks
> have to be in kernel space since we can do a fibre yield in user space
> on any other platform. so this rule out user space test/set as well
> since that would need a thread to spin instead of blocking.
>
> your ideas make sense but only on redmond/intel which is not the target
> space for parrot.

That's pretty much the crux. I don't know what is available (in detail) on
other platforms. Hence I needed to express the ideas in terms I understand
and explain them sufficiently that they could be viewed, interpreted and
either related to similar concepts on othe platforms, or shot down.

I accept your overall judgement, though not necessarially all the specifics.

Maybe it would be possible (for me + others) to write the core of a win32 specific,
threaded VM interpreter that would take parrot byte code and run it. Thereby,
utilising all the good stuff that preceeds the VM interpreter, plus probably large
chunks of the parrot VM, but provides it with a more native compatible target.

That's something that obviously not a simple project and is beyond the scope of
this list.

Thanks for taking the time to review this.

Nigel Sandever.

Luke Palmer

unread,

Jan 3, 2004, 2:51:33 AM1/3/04

to Nigel Sandever, perl6-i...@perl.org

Nigel Sandever writes:
> Maybe it would be possible (for me + others) to write the core of a win32 specific,
> threaded VM interpreter that would take parrot byte code and run it. Thereby,
> utilising all the good stuff that preceeds the VM interpreter, plus probably large
> chunks of the parrot VM, but provides it with a more native compatible target.

You want to write a parrot? Um, good luck.

One thing Uri mentioned was writing platform specific macro code, and he
dismissed it with that it's "also a pain". While I agree that it's a
pain, and it's about as maintainence-friendly as JIT, I don't think it
has to be ruled out.

Parrot is platform-independent, but that doesn't mean we can't take
advantage of platform-specific instructions to make it faster on certain
machines. Indeed, this is precisely what JIT is.

But a lock on every PMC is still pretty heavy for those non-x86
platforms out there, and we should avoid it if we can.

Luke

Leopold Toetsch

unread,

Jan 3, 2004, 5:35:37 AM1/3/04

to Nigel Sandever, perl6-i...@perl.org

Nigel Sandever <nigels...@btconnect.com> wrote:

[ Line length adjusted for readability ]

> VIRTUAL MACHINE INTERPRETER

> At any given point in the running of the interpreter, the VM register
> set, program counter and stack must represent the entire state for
> that thread.

That's exactly, what a ParrotInterpreter is: "the entire state for a
thread".

> I am completely against the idea of VM threads having a 1:1
> relationship with interpreters.

While these can be separated its not efficient. Please note that Parrot
is a register-based VM. Switching state is *not* switching a
stack-pointer and a PC thingy like in stack-based VMs.

> ... The runtime costs of duplicating all

> shared data between interpreters, tying all shared variables, added to
> the cost of locking, throw away almost all of the benefit of using
> threads.

No, shared resources are not duplicated, nor tied. The locking of course
remains, but that's it.

> The distinction is that a critical section does not disable time
> slicing/ preemption. It prevents any thread attempting to enter an
> owned critical section from receiving a timeslice until the current
> owner relinquishes it.

These are platform specific details. We will use whatever the
platform/OS provides. In the source code its a LOCK() UNLOCK() pair.
The LOCK() can be any atomic operation and doesn't need to call the
kernel, if the lock is aquired.

[ Lot of Wintel stuff snipped ]

> Nigel Sandever.

leo

Leopold Toetsch

unread,

Jan 3, 2004, 5:43:19 AM1/3/04

to Nigel Sandever, perl6-i...@perl.org

Nigel Sandever <nigels...@btconnect.com> wrote:
> On Sat, 03 Jan 2004 01:48:07 -0500, u...@stemsystems.com (Uri Guttman) wrote:

>> your ideas make sense but only on redmond/intel which is not the target
>> space for parrot.

s/not the/by far not the only/

> Maybe it would be possible (for me + others) to write the core of a
> win32 specific, threaded VM interpreter that would take parrot byte
> code and run it. Thereby, utilising all the good stuff that preceeds
> the VM interpreter, plus probably large chunks of the parrot VM, but
> provides it with a more native compatible target.

I'd be glad, if someone fills the gaps in platform code. There is no
need to rewrite the interpreter or such: Defining the necessary macros
in an efficient way should be enough to start with.

> Nigel Sandever.

leo

Uri Guttman

unread,

Jan 3, 2004, 12:15:41 PM1/3/04

to l...@toetsch.at, Nigel Sandever, perl6-i...@perl.org

>>>>> "LT" == Leopold Toetsch <l...@toetsch.at> writes:

LT> These are platform specific details. We will use whatever the
LT> platform/OS provides. In the source code its a LOCK() UNLOCK() pair.
LT> The LOCK() can be any atomic operation and doesn't need to call the
LT> kernel, if the lock is aquired.

if it doesn't call the kernel, how can a thread be blocked? you can't
have user level locks without spinning. at some point (even with fibres)
you need to make a kernel call so other threads can run. the macro layer
will make the mainline source look better but you still need kernel
calls in their definition.

Elizabeth Mattijsen

unread,

Jan 3, 2004, 1:08:26 PM1/3/04

to Uri Guttman, l...@toetsch.at, Nigel Sandever, perl6-i...@perl.org

At 12:15 -0500 1/3/04, Uri Guttman wrote:
> >>>>> "LT" == Leopold Toetsch <l...@toetsch.at> writes:
> LT> These are platform specific details. We will use whatever the
> LT> platform/OS provides. In the source code its a LOCK() UNLOCK() pair.
> LT> The LOCK() can be any atomic operation and doesn't need to call the
> LT> kernel, if the lock is aquired.
>if it doesn't call the kernel, how can a thread be blocked?

Think out of the box!

Threads can be blocked in many ways. My forks.pm module uses sockets
to block threads ;-).

It sucks performance wise, but it beats the current perl ithreads
implementation on many platforms in many situations.

Therefore my motto: whatever works, works.

>...you can't

>have user level locks without spinning. at some point (even with fibres)
>you need to make a kernel call so other threads can run.

Possibly. I don't know enough of the specifics whether this is true or not.

>the macro layer
>will make the mainline source look better but you still need kernel
>calls in their definition.

Possibly again. But remember, since Parrot started, hardware has
roughly grown a factor of 4 in power (if I'm to believe in Moore's
law)..

Liz

Nigel Sandever

unread,

Jan 3, 2004, 1:20:15 PM1/3/04

to perl6-i...@perl.org

On Sat, 3 Jan 2004 11:35:37 +0100, l...@toetsch.at (Leopold Toetsch) wrote:
> Nigel Sandever <nigels...@btconnect.com> wrote:
>
> > VIRTUAL MACHINE INTERPRETER
>
> > At any given point in the running of the interpreter, the VM register
> > set, program counter and stack must represent the entire state for
> > that thread.
>
> That's exactly, what a ParrotInterpreter is: "the entire state for a
> thread".

This is only true if a thread == interpreter.
If a single interpreter can run 2 threads then that single interpreter
cannot represent the state of both threads safely.

>
> > I am completely against the idea of VM threads having a 1:1
> > relationship with interpreters.
>

With 5005threads, multiple threads exist in a single interpreter.

All VHLL level data is shared without duplication, but locking has
to be performed on each entity. This model is more efficient than
ithreads.
However it was extremely difficult prevent onwanted interaction
between the threads corrupting the internal state of the interpreter
given the internal architecture of P5 and so it was abandoned.

With ithreads, each thread is also a seperate interpreter.

This insulates the internal state of one interpreter from the other
but also insulates *all* perl level program data in one interpreter
from the perl level data in the other.

Spawning a new thread becomes a process of duplicating everything.
The interpreter, the perl program, and all it existing data.

Sharing data between the threads/interpreters is implemented by
tieing the two copies of the variables to be shared and each time
a STORE is performed in one thread, the same STORE has too be
performed on the copy of that var held on every other threads
dataspace.

If 2 threads need to share a scalar, but the program has 10 other
threads, then each write to the shared scalar requires the update
of all 12 copies of that scalar. There is no way to indicate that
you only need to share it between threads x & y.

With ithreads, there can be no shared references, so no shared
objects and no shared compound data structures

> While these can be separated its not efficient. Please note that Parrot
> is a register-based VM. Switching state is *not* switching a
> stack-pointer and a PC thingy like in stack-based VMs.
> > ... The runtime costs of duplicating all
> > shared data between interpreters, tying all shared variables, added to
> > the cost of locking, throw away almost all of the benefit of using
> > threads.
> No, shared resources are not duplicated, nor tied. The locking of course
> remains, but that's it.

The issues above are what make p5 ithreads almost unusable.

If Parrot has found a way of avoiding these costs and limitations
then everything I offered is a waste of time, because these are
the issues was attempting to address.

However, I have seen no indication here, in the sources or anywhere
else that this is the case. I assume that the reason Dan opened the
discussion up in the first place is because the perception by those
looking on was that the p5 ithreads model was being suggested as the
way Parrot was going to go.

And the reaction from those wh have tried to make use of ithreads
under p5 are all too aware that replicating them for Parrot would
be ..... [phrase deleted as too emotionally charged:)]

> leo

Nigel.

Elizabeth Mattijsen

unread,

Jan 3, 2004, 2:05:13 PM1/3/04

to Nigel Sandever, perl6-i...@perl.org

At 18:20 +0000 1/3/04, Nigel Sandever wrote:
> Sharing data between the threads/interpreters is implemented by
> tieing the two copies of the variables to be shared and each time
> a STORE is performed in one thread, the same STORE has too be
> performed on the copy of that var held on every other threads
> dataspace.

Hmmm.... is that true? My impression was (and that's the way I've
implemented it in Thread::Tie) is that each STORE actually stores the
value in a hidden background thread, and each FETCH obtains the
current value from the background thread. I don't think each STORE
is actually cascading through all of the threads. Not until they try
to fetch the shared value, anyway.

> If 2 threads need to share a scalar, but the program has 10 other
> threads, then each write to the shared scalar requires the update
> of all 12 copies of that scalar. There is no way to indicate that
> you only need to share it between threads x & y.

That's true. But there is some implicit grouping. You can only
create newly shared variables for the current thread and any thread
that is started from the current thread. This is actually a pity, as
that precludes you to start a "bare" (minimal number of Perl modules
loaded, or with exactly the modules that _you_ want) thread at
compile time, from which you could start other threads.

> With ithreads, there can be no shared references, so no shared
> objects and no shared compound data structures

Actually, you can bless a reference to a shared variable, but you
can't share a blessed object (the sharing will let you lose the
content of the object). I think shared compound data structures
_are_ possible, but very tricky to get right (because the CLONE sub
is called as a package method, rather than as an object method: see
Thread::Bless for an attempt at getting around this).

> > While these can be separated its not efficient. Please note that Parrot
>> is a register-based VM. Switching state is *not* switching a
>> stack-pointer and a PC thingy like in stack-based VMs.
>> > ... The runtime costs of duplicating all
>> > shared data between interpreters, tying all shared variables, added to
>> > the cost of locking, throw away almost all of the benefit of using
>> > threads.
>> No, shared resources are not duplicated, nor tied. The locking of course
> > remains, but that's it.
>The issues above are what make p5 ithreads almost unusable.

For more about this, see my article on Perl Monks:

http://www.perlmonks.org/index.pl?node_id=288022

>And the reaction from those wh have tried to make use of ithreads
>under p5 are all too aware that replicating them for Parrot would
>be ..... [phrase deleted as too emotionally charged:)]

What can I say? ;-)

Liz

Elizabeth Mattijsen

unread,

Jan 3, 2004, 2:19:08 PM1/3/04

to Uri Guttman, Nigel Sandever, perl6-i...@perl.org

I'm trying to be constructive here. Some passages may appear to be
blunt. Read at your own risk ;-)

At 01:48 -0500 1/3/04, Uri Guttman wrote:
> >>>>> "NS" == Nigel Sandever <nigels...@btconnect.com> writes:
> NS> All that is required to protect an object from corruption through
> NS> concurrent access and state change is to prevent two (or more) VMs
> NS> trying to access the same object simultaneously. In order for the VM to
> NS> address the PMC and must load it into a VM register, it must know where
> NS> it is. Ie. It must have a pointer. It can use this pointer to access
> NS> the PMC with a Bit Test and Set operation. (x86 BTS)
> NS> [http://www.itis.mn.it/linux/quarta/x86/bts.htm] This is a CPU atomic
> NS> operation. This occurs before the VM enters the VM operations critical
> NS> section.
>ding! ding! ding! you just brought in a cpu specific instruction which
>is not guaranteed to be on any other arch. in fact many have such a
>beast but again, it is not accessible from c.

I just _can't_ believe I'm hearing this. So what if it's not
accessible from C? Could we just not build a little C-program that
would create a small in whatever loadable library? Or have a
post-processing run through the binary image inserting the right
machine instructions in the right places? Not being from a *nix
background, but more from a MS-DOS background, I've been used to
inserting architecture specific machine codes from higher level
languages into executable streams since 1983! Don't tell me that's
not "done" anymore? ;-)

>.. and still some archs don't

>have it so it has to be emulated by dijkstra's algorithm. and that
>requires two counters and a little chunk of code.

Well, I guess those architectures will have to live with that. If
that's what it takes?

>you can't bring x86 centrism into this. the fact that redmond/intel
>threads can make use of this instruction to do 'critical sections' is a
>major point why it is bad for a portable system with many architectures
>to be supported.

I disagree. I'm not a redmond fan, so I agree with a lot of your
_sentiment_, but you should also realize that a _lot_ of intel
hardware is running Linux. Heck, even some Solarises run on it.

I'm not saying that the intel CPU's are superior to others. They're
probably not. But it's just as with cars: most of the cars get
people from A to B. They're not all Maserati's. Or Rolls Royce's.
Most of them are Volkswagen, Fiat and whatever compacts you guys
drive in the States. ;-)

I don't think we're making Parrot to run well on Maserati's and Rolls
Royce's only. We want to reach the Volkswagens. And not even
today's Volkswagens: by the time Perl 6 comes around, CPU's will have
doubled in power yet _again_!

The portability is in Parrot itself: not by using the lowest common
denominator of C runtime systems out there _today_! It will take a
lot of trouble to create a system that will run everywhere, but
that's just it what makes it worthwhile. Not that it offers the same
limited capabilities on all systems!

>i am well aware about test/set and atomic instructions. my point in my
>previous reply was that it can't be assumed to be on a particular
>arch. so it has to be assumed to be on none and it needs a pure software
>solution. sure a platform specific macro/assembler lib could be done but
>that is a pain as well.

Indeed. A pain, probably. Maybe not so much. I can't believe that
Sparc processors are so badly designed that they don't offer
something similar as what Nigel suggested for Intel platforms.

>again, intel specific and it has to be tossed out.

Again, I think you're thinking too much inside the non-Wintel box.
You stated yourself just now "...in fact many have such beast...", so
maybe it should first be investigated which platforms do suppport
this and which don't, and then decide whether it is a good idea or an
idea to be tossed?

>virtual ram is what counts on unix. you can't request some large amount
>without using real swap space. it may not allocate real ram until later
>(page faulting on demand) but it is swap space that counts. it is used
>up as soon as you allocate it.

So, maybe a wrapper is, either for *nix, or for Win32, or maybe both.

>what win32 does is not portable and not useful at a VM level. kernels
>and their threads can work well together. portable VMs and their threads
>are another beast that can't rely on a particular architecture
>instruction or kernel feature.

This sounds too much like dogma to me. Why? Isn't Parrot about
borgifying all good things from all OS's and VM's, now and in the
future? ;-)

>hardware is tossed out with portable VM design. we have a user space
>process written in c with a VM and VM threads that are to be based on
>kernel threads. the locking issues for shared objects is the toughest
>nut to crack right now. there is no simple or fast solution to this
>given the known contraints. intel/redmond specific solutions are not
>applicable (though we can learn from them).

Then please lets. Instead of tossing them without further investigation.

>ok, i can see where you got the test/set and yield stuff from now.
>fibres seem to be manually scheduled threads. this means user code has
>to decide to yield to another thread blocking on the same lock. this
>means all locks must have a list of threads/fibres blocking on that
>lock. and manually scheduled stuff is not appropriate for
>parrot. finally, it is redmond specific again and very unportable. i
>have never heard of this fibre concept on any unix flavor.

It's never too late. Maybe Parrot can learn from it.

>and that is not possible on other platforms. also it means parrot would
>have to have its own scheduler with that the pain that brings.

Ah, but the joy when it _does_ come together! ;-)

>your ideas make sense but only on redmond/intel which is not the target
>space for parrot.

Wouldn't that need to read "Intel", and therefore include most of
Linuxes running out there? Are we talking CPU specific or OS
specific here? If OS specific, but still on Intel, it still is some
set of machine instructions at some level that could be borgified.
Don't tell me Redmond has the monopoly on using some of the Intel x86
machine instructions?

Liz

Jeff Clites

unread,

Jan 3, 2004, 2:41:56 PM1/3/04

to Elizabeth Mattijsen, Uri Guttman, P6I Internals

Yes, you are correct--we are already using bits of assembly in parrot.
C compilers tend to allow you to insert bits of inline assembly, and we
are taking advantage of that--for instance, look for "__asm__" in the
following files:

jit/arm/jit_emit.h
jit/ppc/jit_emit.h
src/list.c
src/malloc.c

Also, JIT is all about generating platform-specific machine
instructions at runtime. So it's certainly do-able, and right along the
lines of of what we are already doing.

JEff

Leopold Toetsch

unread,

Jan 3, 2004, 2:25:50 PM1/3/04

to Uri Guttman, perl6-i...@perl.org

Uri Guttman <u...@stemsystems.com> wrote:
> >>>>> "LT" == Leopold Toetsch <l...@toetsch.at> writes:

> LT> These are platform specific details. We will use whatever the
> LT> platform/OS provides. In the source code its a LOCK() UNLOCK() pair.
> LT> The LOCK() can be any atomic operation and doesn't need to call the
> LT> kernel, if the lock is aquired.

> if it doesn't call the kernel, how can a thread be blocked?

I wrote, *if* the lock is aquired. That's AFAIK the fast path of a futex
or of the described Win32 behavior. The slow path is always a kernel
call (or a some rounds spinning before ...)
But anyway, we don't reinvent these locking primitives.

> uri

leo

Jeff Clites

unread,

Jan 3, 2004, 3:08:08 PM1/3/04

to Elizabeth Mattijsen, Uri Guttman, P6I Internals

On Jan 3, 2004, at 10:08 AM, Elizabeth Mattijsen wrote:

> At 12:15 -0500 1/3/04, Uri Guttman wrote:
>> >>>>> "LT" == Leopold Toetsch <l...@toetsch.at> writes:
>> LT> These are platform specific details. We will use whatever the
>> LT> platform/OS provides. In the source code its a LOCK() UNLOCK()
>> pair.
>> LT> The LOCK() can be any atomic operation and doesn't need to call
>> the
>> LT> kernel, if the lock is aquired.
>> if it doesn't call the kernel, how can a thread be blocked?
>
> Think out of the box!
>
> Threads can be blocked in many ways. My forks.pm module uses sockets
> to block threads ;-).

IO operations which block like that end up calling into the kernel.

But I believe it is usually possible to acquire an uncontested lock
without calling into the kernel. When you do need to block (when trying
to acquire a lock which is already held by another thread) you may need
to enter the kernel. But I think that Leo's point was that in the
common case of a successful lock operation, it may not be necessary.

JEff

Leopold Toetsch

unread,

Jan 3, 2004, 3:00:31 PM1/3/04

to Nigel Sandever, perl6-i...@perl.org

Nigel Sandever <nigels...@btconnect.com> wrote:
> On Sat, 3 Jan 2004 11:35:37 +0100, l...@toetsch.at (Leopold Toetsch) wrote:
>> That's exactly, what a ParrotInterpreter is: "the entire state for a
>> thread".

> This is only true if a thread == interpreter.
> If a single interpreter can run 2 threads then that single interpreter
> cannot represent the state of both threads safely.

Yep. So if a single interpreter (which is almost a thread state) should
run two threads, you have to allocate and swap all. What should the
advantage of such a solution be?

> With 5005threads, multiple threads exist in a single interpreter.

These are obsolete.

> With ithreads, each thread is also a seperate interpreter.

> Spawning a new thread becomes a process of duplicating everything.

> The interpreter, the perl program, and all it existing data.

Partly yes. A new interpreter is created, the program, i.e. the opcode
stream is *not* duplicated, but JIT or prederef information has to be
rebuilt (on demand, if that run-core is running), and existing
non-shared data items are cloned.

> Sharing data between the threads/interpreters is implemented by
> tieing

Parrot != perl5.ithreads

> If Parrot has found a way of avoiding these costs and limitations
> then everything I offered is a waste of time, because these are
> the issues was attempting to address.

I posted a very premature benchmark result, where an unoptimized Parrot
build is 8 times faster then the equivalent perl5 code.

> And the reaction from those wh have tried to make use of ithreads
> under p5 are all too aware that replicating them for Parrot would
> be ..... [phrase deleted as too emotionally charged:)]

I don't know how ithreads are working internally WRT the relevant issues
like object allocation and such. But threads at the OS level provide
shared code and data segments. So at the VM level you have to unshare
non-shared resources at thread creation. You can copy objects lazily and
make 2 distinct items when writing, or you copy them in the first
place. But you have these costs at thread start - and not later.

> Nigel.

leo

Uri Guttman

unread,

Jan 3, 2004, 3:23:10 PM1/3/04

to Elizabeth Mattijsen, Nigel Sandever, perl6-i...@perl.org

>>>>> "EM" == Elizabeth Mattijsen <l...@dijkmat.nl> writes:

>> ding! ding! ding! you just brought in a cpu specific instruction which
>> is not guaranteed to be on any other arch. in fact many have such a
>> beast but again, it is not accessible from c.

EM> I just _can't_ believe I'm hearing this. So what if it's not
EM> accessible from C? Could we just not build a little C-program that
EM> would create a small in whatever loadable library? Or have a
EM> post-processing run through the binary image inserting the right
EM> machine instructions in the right places? Not being from a *nix
EM> background, but more from a MS-DOS background, I've been used to
EM> inserting architecture specific machine codes from higher level
EM> languages into executable streams since 1983! Don't tell me that's
EM> not "done" anymore? ;-)

it is not that it isn't done anymore but the effect has to be the same
on machines without test/set. and on top of that, it still needs to be a
kernel level operation so a thread can block on the lock. that is the
more important issue that makes using a test/set in user space a moot
problem.

EM> I disagree. I'm not a redmond fan, so I agree with a lot of your
EM> _sentiment_, but you should also realize that a _lot_ of intel
EM> hardware is running Linux. Heck, even some Solarises run on it.

we are talking maybe 10-20 architectures out there that we would want
parrot to run on. maybe more. how many does p5 run on now?

EM> The portability is in Parrot itself: not by using the lowest common
EM> denominator of C runtime systems out there _today_! It will take a
EM> lot of trouble to create a system that will run everywhere, but that's
EM> just it what makes it worthwhile. Not that it offers the same limited
EM> capabilities on all systems!

but we need a common denominator of OS features more than one for cpu
features. the fibre/thread stuff is redmond only. and they still require
system calls. so as i said the test/set is not a stopping point
(dijkstra) but the OS support is. how and where and when we lock is the
only critical factor and that hasn't been decided yet. we don't want to
lock at global thread levels and we are not sure we can lock at PMC or
object levels (GC and alloc can break that). we should be focusing on
that issue. think about how DBs did it. sybase used to do page locking
(coarse grained) since it was faster (this was 15 years ago) and they
had the fastest engine. but when multithreading and multicpu designs
came in a finer grained row locking was faster (oracle). sybase fell
behind and has not caught up. we have the same choices to make so we
need to study locking algorithms and techniques from that perspective
and not how to do a single lock (test/set vs kernel). but i will keep
reiterating that it has to be a kernel lock since we must block threads
and GC and such without spinning or manual scheduling (fibres).

>> virtual ram is what counts on unix. you can't request some large amount
>> without using real swap space. it may not allocate real ram until later
>> (page faulting on demand) but it is swap space that counts. it is used
>> up as soon as you allocate it.

EM> So, maybe a wrapper is, either for *nix, or for Win32, or maybe both.

this is very different behavior IMO and not something that can be
wrapped easily. i could be wrong but separating virtual allocation from
real allocation can't be emulated without kernel support. and we need
the same behavior on all platforms. this again brings up how we lock so
that GC/alloc will work properly with threads. do we lock a thread pool
but not the thread when we access a shared thingy? that is a medium
grain lock. can the GC/alloc break the lock if it is called inside that
operation? or could only the pool inside the active thread do that? what
about an shared object alloced from thread A's pool but it triggers an
alloc when being accessed in thread B. these are the questions that need
to be asked and answered. i was just trying to point out to nigel that
the intel/redmond solutions are not portable as they require OS
support and that all locks need to be kernel level. given that
requirement, we need to decide how to do the locks so those questions
can be answered with reasonable efficiency. of course a single global
lock would work but that stinks and we all know it. so what is the lock
granularity? how do we handle GC/alloc across shared objects?

EM> This sounds too much like dogma to me. Why? Isn't Parrot about
EM> borgifying all good things from all OS's and VM's, now and in the
EM> future? ;-)

but parrot can only use a common set of features across OS's. we can't
use a redmond feature that can't be emulated on other platforms. and my
karma ran over my dogma :( :-)

>> hardware is tossed out with portable VM design. we have a user space
>> process written in c with a VM and VM threads that are to be based on
>> kernel threads. the locking issues for shared objects is the toughest
>> nut to crack right now. there is no simple or fast solution to this
>> given the known contraints. intel/redmond specific solutions are not
>> applicable (though we can learn from them).

EM> Then please lets. Instead of tossing them without further investigation.

i didn't toss them out as much as dismiss them as a portable solution.
i have asked a bunch of questions. i would like to see some suggested
solutions.

>> ok, i can see where you got the test/set and yield stuff from now.
>> fibres seem to be manually scheduled threads. this means user code has
>> to decide to yield to another thread blocking on the same lock. this
>> means all locks must have a list of threads/fibres blocking on that
>> lock. and manually scheduled stuff is not appropriate for
>> parrot. finally, it is redmond specific again and very unportable. i
>> have never heard of this fibre concept on any unix flavor.

EM> It's never too late. Maybe Parrot can learn from it.

manual scheduling is not a good solution IMO. it means all locks must
have a good amount of code executed to yield the thread and figure out
what thread to be run next, etc. the kernel already does this with its
scheduler so we would be reinventing a big wheel. also the kernel still
does scheduling underneath our scheduling so it would be even slower.

>> and that is not possible on other platforms. also it means parrot would
>> have to have its own scheduler with that the pain that brings.

EM> Ah, but the joy when it _does_ come together! ;-)

heh. having done low level rtos stuff with its own 'scheduler' (really
event queues) i can remember the joy. but it works best on raw iron. in
user space it is much trickier. proper event loop code is in effect a
manual scheduler as you use 'return' from event handler code instead of
yield and that gets you back to the main loop. you also have to either
avoid blocking operations (i/o, etc.) or shunt them off to
threads/processes where they can block and not block the main
thread/process. to do manual scheduling of thread would be even
harder. we would need to work around every blocking operation and either
let the kernel reschedule another thread or do a test/set lock and yield
(this is from the fibre stuff i read) and wait until we get rescheduled
and do the same thing again. this is effectively spinning with kernel
yields in each cycle. not a nice solution IMO.

>> your ideas make sense but only on redmond/intel which is not the target
>> space for parrot.

EM> Wouldn't that need to read "Intel", and therefore include most of
EM> Linuxes running out there? Are we talking CPU specific or OS specific
EM> here? If OS specific, but still on Intel, it still is some set of
EM> machine instructions at some level that could be borgified. Don't tell
EM> me Redmond has the monopoly on using some of the Intel x86 machine
EM> instructions?

a combination of both IMO. the test/set instruction is intel (but many
cpus have something similar) but the fibre stuff (manual thread
scheduling) seems to be redmond specific. using a kernel lock is
portable but again brings up the lock granularity issue. i suggest we
focus on that problem and not about how we do the locking.

Uri Guttman

unread,

Jan 3, 2004, 3:26:27 PM1/3/04

to l...@toetsch.at, perl6-i...@perl.org

>>>>> "LT" == Leopold Toetsch <l...@toetsch.at> writes:

LT> Uri Guttman <u...@stemsystems.com> wrote:
>> >>>>> "LT" == Leopold Toetsch <l...@toetsch.at> writes:

LT> These are platform specific details. We will use whatever the
LT> platform/OS provides. In the source code its a LOCK() UNLOCK() pair.
LT> The LOCK() can be any atomic operation and doesn't need to call the
LT> kernel, if the lock is aquired.

>> if it doesn't call the kernel, how can a thread be blocked?

LT> I wrote, *if* the lock is aquired. That's AFAIK the fast path of a futex
LT> or of the described Win32 behavior. The slow path is always a kernel
LT> call (or a some rounds spinning before ...)
LT> But anyway, we don't reinvent these locking primitives.

ok, i missed the 'if' there. :)

that could be workable and might be faster. it does mean that locks are
two step as well, user space test/set and fallback to kernel lock. we
can do what nigel said and wrap the test/set in macros and use assembler
to get at it on platforms that have it or fallback to dijkstra on those
that don't.

Uri Guttman

unread,

Jan 3, 2004, 3:32:04 PM1/3/04

to Elizabeth Mattijsen, l...@toetsch.at, Nigel Sandever, perl6-i...@perl.org

>>>>> "EM" == Elizabeth Mattijsen <l...@dijkmat.nl> writes:

EM> At 12:15 -0500 1/3/04, Uri Guttman wrote:
>> >>>>> "LT" == Leopold Toetsch <l...@toetsch.at> writes:
LT> These are platform specific details. We will use whatever the
LT> platform/OS provides. In the source code its a LOCK() UNLOCK() pair.
LT> The LOCK() can be any atomic operation and doesn't need to call the
LT> kernel, if the lock is aquired.
>> if it doesn't call the kernel, how can a thread be blocked?

EM> Think out of the box!

EM> Threads can be blocked in many ways. My forks.pm module uses sockets
EM> to block threads ;-).

i used that design as well. a farm of worker threads blocked on a pipe
(socketpair) to the same process. the main event loop handled the other
side. worked very well.

EM> It sucks performance wise, but it beats the current perl ithreads
EM> implementation on many platforms in many situations.

i can believe that.

EM> Therefore my motto: whatever works, works.

but i discussed that solution with dan and he shot it down for speed
reasons IIRC. i still think it is an interesting solution. it could also
be used for the main event queue and/or loop as i mention above. we are
assuming some form of sockets on all platforms IIRC, so we can use
socketpair for that. i even use socketpair on win32 to test a
(pseudo)fork thing for file::slurp.

>> ...you can't
>> have user level locks without spinning. at some point (even with fibres)
>> you need to make a kernel call so other threads can run.

EM> Possibly. I don't know enough of the specifics whether this is
EM> true or not.

i looked at the docs for fibres and they say you do a manual reschedule
by selecting the fibre to run next or i think a yield. but something has
to go to the kernel since even fibres are kernel thingys.

Dave Mitchell

unread,

Jan 3, 2004, 4:11:25 PM1/3/04

to Elizabeth Mattijsen, Nigel Sandever, perl6-i...@perl.org

On Sat, Jan 03, 2004 at 08:05:13PM +0100, Elizabeth Mattijsen wrote:
> At 18:20 +0000 1/3/04, Nigel Sandever wrote:
> > Sharing data between the threads/interpreters is implemented by
> > tieing the two copies of the variables to be shared and each time
> > a STORE is performed in one thread, the same STORE has too be
> > performed on the copy of that var held on every other threads
> > dataspace.
>
> Hmmm.... is that true? My impression was (and that's the way I've
> implemented it in Thread::Tie) is that each STORE actually stores the
> value in a hidden background thread, and each FETCH obtains the
> current value from the background thread. I don't think each STORE
> is actually cascading through all of the threads. Not until they try
> to fetch the shared value, anyway.

Sharing consists of the real SV living in a shared interpreter, with each
individual thread having a lightweight proxy SV that causes the
appropriate real SV to be accessed/updated by a mixture or magic and/or
tied-ish access. A particular access by one thread does not involve any of
the other threads or their proxies.

> > With ithreads, there can be no shared references, so no shared
> > objects and no shared compound data structures
>
> Actually, you can bless a reference to a shared variable, but you
> can't share a blessed object (the sharing will let you lose the
> content of the object). I think shared compound data structures
> _are_ possible, but very tricky to get right (because the CLONE sub
> is called as a package method, rather than as an object method: see
> Thread::Bless for an attempt at getting around this).

Nested shared structures work just fine, and there's no need to mess with
CLONE for plain data.

> >And the reaction from those wh have tried to make use of ithreads
> >under p5 are all too aware that replicating them for Parrot would
> >be ..... [phrase deleted as too emotionally charged:)]

It's the implementation of ithreads in p5 that sucks, not the concept
itself. The use of COW makes new thread creation cheap, and the use of
lock PMCs interposed on the real PMCs makes shareing easy.

Dave.

--
O Unicef Clearasil!
Gibberish and Drivel!
- "Bored of the Rings"

Jeff Clites

unread,

Jan 3, 2004, 4:24:44 PM1/3/04

to Uri Guttman, P6I Internals

On Jan 3, 2004, at 12:26 PM, Uri Guttman wrote:

> LT> These are platform specific details. We will use whatever the
> LT> platform/OS provides. In the source code its a LOCK() UNLOCK()
> pair.
> LT> The LOCK() can be any atomic operation and doesn't need to call
> the
> LT> kernel, if the lock is aquired.
>
>>> if it doesn't call the kernel, how can a thread be blocked?
>
> LT> I wrote, *if* the lock is aquired. That's AFAIK the fast path of
> a futex
> LT> or of the described Win32 behavior. The slow path is always a
> kernel
> LT> call (or a some rounds spinning before ...)
> LT> But anyway, we don't reinvent these locking primitives.
>
> ok, i missed the 'if' there. :)
>
> that could be workable and might be faster. it does mean that locks are
> two step as well, user space test/set and fallback to kernel lock. we
> can do what nigel said and wrap the test/set in macros and use
> assembler
> to get at it on platforms that have it

This is probably already done inside of pthread (and Win32) locking
implementations--a check in userspace before a kernel call. It's also
important to remember that it's not a two-step process most of the
time, since most of the locking we are taking about is likely to be
uncontested (ie, the lock acquisition will succeed most of the time,
and no blocking will be needed). If we _do_ have major lock contention
in our internal locking, those will be areas calling for a redesign.

JEff

Nigel Sandever

unread,

Jan 3, 2004, 4:32:13 PM1/3/04

to perl6-i...@perl.org

On Sat, 3 Jan 2004 21:00:31 +0100, l...@toetsch.at (Leopold Toetsch) wrote:
> >> That's exactly, what a ParrotInterpreter is: "the entire state for a
> >> thread".
>
> > This is only true if a thread == interpreter.
> > If a single interpreter can run 2 threads then that single interpreter
> > cannot represent the state of both threads safely.
>
> Yep. So if a single interpreter (which is almost a thread state) should
> run two threads, you have to allocate and swap all.

When a kernel level thead is spawned, no duplication of application memory
is required, Only a set of registers, program counter and stack. These
represent the entire state of that thread.

If a VM thread mirrors this, by duplicating the VM program counter,
VM registers and VM stack, then this VM thread context can also
avoid the need to replicate the rest of the program data (interpreter).

> What should the
> advantage of such a solution be?

The avoidance of duplication.
Transparent interlocking of VHLL fat structures performed automatically
by the VM itself. No need for :shared or lock().

>
> > With 5005threads, multiple threads exist in a single interpreter.
>
> These are obsolete.

ONLY because they couldn't be made to work properly. The reason
that was true are entirely due to the architecture of P5.

Dan Sugalski suggested in this list back in 2001, that he would prefer
pthreads to ithreads.

I've used both in p5, and pthreads are vastly more efficient, but flaky and
difficult to use well. These limitations are due to the architecture upon
which they were built. My interest is in seeing the Parrot architecture
not exclude them.

>
> > With ithreads, each thread is also a seperate interpreter.
>
> > Spawning a new thread becomes a process of duplicating everything.
> > The interpreter, the perl program, and all it existing data.
>
> Partly yes. A new interpreter is created, the program, i.e. the opcode
> stream is *not* duplicated, but JIT or prederef information has to be
> rebuilt (on demand, if that run-core is running), and existing
> non-shared data items are cloned.
>

Only duplicating shared data on demand (COW) may work well on systems
that support COW in the kernel. But on systems that don't, this has to be
emulated in user space, with all the inherent overhead that implies.

My desire was that the VM_Spawn_Thread VM_Share_PMC and
VM_Lock_PMC opcodes could be coded such that those platforms where
the presence of kernel level COW and other native features mean that
the ithreads-style model of VMthread == kernel thread + interpreter
is the best way to go, then that would be the underlying implementation.

On those platforms where VMthread == kernel thread + VMthread context
is the best way, then that would be the underlying implementation.

In order for this to be possible, it implies a certain level of support for
both be engrained in the design of the interpreter.

My (long) oroginal post, with all the subjects covered and details given
was my attempt to describe the support required in the design for the
latter. It would be necessary to consider all the elements, and the way
they intereact, and take these into consideration when implementing
Parrots threading in order that this would be achievable.

Each element, the seraration of the VMstate from the interpreter state,
the atomisation of VM operations, the automated detection and locking of
concurrect access attempts and the serialisation of the VM threads when
it is detected all need support at the highest level before they may be
implemented at the lowest (platform specific) levels.

It simply isn't possible to implement them on one platform at the lowest
levels unless the upper levels of the design are contructed with the
possibilities in mind.

> > Sharing data between the threads/interpreters is implemented by
> > tieing
>
> Parrot != perl5.ithreads
>
> > If Parrot has found a way of avoiding these costs and limitations
> > then everything I offered is a waste of time, because these are
> > the issues was attempting to address.
>
> I posted a very premature benchmark result, where an unoptimized Parrot
> build is 8 times faster then the equivalent perl5 code.
>
> > And the reaction from those wh have tried to make use of ithreads
> > under p5 are all too aware that replicating them for Parrot would
> > be ..... [phrase deleted as too emotionally charged:)]
>
> I don't know how ithreads are working internally WRT the relevant issues
> like object allocation and such. But threads at the OS level provide
> shared code and data segments. So at the VM level you have to unshare
> non-shared resources at thread creation.

You only need to copy them, if the two threads can attempt to modify
the contents of the objects concurrently. By precluding this possibility,
by atomising VMthread level operations by preventing a new VM thread
form being scheduled until any othe VM thread completes its current
operation and ensuring that each VMthreads state is in a complete and
coherent state before another VM thread can run, you can avoid the need
for the duplication.

> You can copy objects lazily and

This works well if the kernel will take care of the COW, but on
kernels not supporting this, you have to add costly extra code to
detect the writes (or the reads) and perform the duplication (and
possibly the allocation) in user code. This is what p5 ithreads do,
and the result is less than satisfactory.

> make 2 distinct items when writing, or you copy them in the first
> place. But you have these costs at thread start - and not later.

Duplicating everything a thread start, is effectively how win32
emulates forking. The problem is that in addition to not supporting
forking natively, which then requires complicated steps to be taken
in user level code to introspect the OS to find and duplicate the
appropriate data segments. Stuff that *nix kernal do as a result
of a single syscall that has access to all the appropriate kernel
tables and data, costs hugely when performed from user space.

In addition to that, the there are several other *nix kernel facilities
associated with forking (SIG_CHLD etc.) that th win32 kernel doesn't
support. These have never been suitably emulated, and so forking as a
concept is still pretty unusable on win32.

My desire is that *all* OS concepts used from the interpreter be
virtualised so that where they are native, they can be macro'd
directly to teh undrlying OS syscalls. But where they are not so
supported, they can be implemented in isolation without the need to
make wholesale changes throughout the sources.

> leo

Nigel.

Uri Guttman

unread,

Jan 3, 2004, 4:44:28 PM1/3/04

to Jeff Clites, P6I Internals

>>>>> "JC" == Jeff Clites <jcl...@mac.com> writes:

JC> On Jan 3, 2004, at 12:26 PM, Uri Guttman wrote:

>> that could be workable and might be faster. it does mean that locks
>> are two step as well, user space test/set and fallback to kernel
>> lock. we can do what nigel said and wrap the test/set in macros and
>> use assembler to get at it on platforms that have it

JC> This is probably already done inside of pthread (and Win32)
JC> locking implementations--a check in userspace before a kernel
JC> call. It's also important to remember that it's not a two-step
JC> process most of the time, since most of the locking we are taking
JC> about is likely to be uncontested (ie, the lock acquisition will
JC> succeed most of the time, and no blocking will be needed). If we
JC> _do_ have major lock contention in our internal locking, those
JC> will be areas calling for a redesign.

i meant in the coding and not necessarily at runtime. we still need to
address when and where locking happens.

Elizabeth Mattijsen

unread,

Jan 3, 2004, 4:45:07 PM1/3/04

to Dave Mitchell, Nigel Sandever, perl6-i...@perl.org

At 21:11 +0000 1/3/04, Dave Mitchell wrote:
>On Sat, Jan 03, 2004 at 08:05:13PM +0100, Elizabeth Mattijsen wrote:
> > Actually, you can bless a reference to a shared variable, but you
>> can't share a blessed object (the sharing will let you lose the
>> content of the object). I think shared compound data structures
>> _are_ possible, but very tricky to get right (because the CLONE sub
>> is called as a package method, rather than as an object method: see
> > Thread::Bless for an attempt at getting around this).
>Nested shared structures work just fine, and there's no need to mess with
>CLONE for plain data.

Indeed. But as soon as there is something special such as a
datastructure external to Perl between threads (which happens
automatically "shared" automatically, because Perl doesn't know about
the datastructure, so the cloned objects point to the same memory
address), then you're in trouble. Simply because you now have
multiple DESTROYs called on the same external data-structure. If the
function of the DESTROY is to free the memory of the external
data-structure, you're in trouble as soon as the first thread is
done. ;-(

> > >And the reaction from those wh have tried to make use of ithreads
>> >under p5 are all too aware that replicating them for Parrot would
> > >be ..... [phrase deleted as too emotionally charged:)]
>It's the implementation of ithreads in p5 that sucks, not the concept
>itself. The use of COW makes new thread creation cheap, and the use of
>lock PMCs interposed on the real PMCs makes shareing easy.

I agree that Perl ithreads as *a* concept are ok.

The same could be said about what are now referred to as 5.005
threads. Ok as *a* concept. And closer to what many non-Perl people
consider to be "threads".

Pardon my French, but both suck in the implementation. And it is not
for lack of effort by the people who developed it. It is for lack of
a good foundation to build on. And we're talking foundation here
now, we all want to make sure it is the best, earth-quake proofed,
rocking foundation we can get!

Liz

Leopold Toetsch

unread,

Jan 3, 2004, 6:24:48 PM1/3/04

to Uri Guttman, perl6-i...@perl.org

Uri Guttman <u...@stemsystems.com> wrote:

> ok, i missed the 'if' there. :)

> that could be workable and might be faster. it does mean that locks are
> two step as well, user space test/set and fallback to kernel lock.

Yep, that is, what the OS provides. I really don't like to reinvent
wheels here - ehem, and nowhere else.

> uri

leo

Leopold Toetsch

unread,

Jan 3, 2004, 6:21:57 PM1/3/04

to Uri Guttman, perl6-i...@perl.org

Uri Guttman <u...@stemsystems.com> wrote:
> ... this again brings up how we lock so

> that GC/alloc will work properly with threads. do we lock a thread pool
> but not the thread when we access a shared thingy?

This is the major issue, how to continue. Where are shared objects
living (alloc) and where and how are these destroyed (DOD/GC).

But first I'd like to hear some requirements defined, e.g.:

A spawns B, C threads with shared $a
B spawns D, E threads with shared $b

- is $b shared in A or C
- is $a shared in D or E
- are there any (HLL) decisions to decide what is shared
- and so on

If that isn't layed out we don't need to talk about locking a certain
thread pool.

> ...how do we handle GC/alloc across shared objects?

I posted a proposal for that :)

> uri

leo

Leopold Toetsch

unread,

Jan 3, 2004, 6:49:58 PM1/3/04

to Elizabeth Mattijsen, perl6-i...@perl.org

Elizabeth Mattijsen <l...@dijkmat.nl> wrote:

> Indeed. But as soon as there is something special such as a
> datastructure external to Perl between threads (which happens
> automatically "shared" automatically, because Perl doesn't know about
> the datastructure,

Why is it shared automatically? Do you have an example for that?
But anyway an interesting problem, I didn't consider until yet - thanks :)

> ... so the cloned objects point to the same memory

> address), then you're in trouble. Simply because you now have
> multiple DESTROYs called on the same external data-structure. If the
> function of the DESTROY is to free the memory of the external
> data-structure, you're in trouble as soon as the first thread is
> done. ;-(

Maybe that DOD/GC can help here. A shared object can and will be
destroyed only, when the last holder of that object has released it.

[ perl5 thread concepts ]

> Pardon my French, but both suck in the implementation. And it is not
> for lack of effort by the people who developed it.

The problem for sure was, to put threads on top of a working
interpreter and a commonly used language. Parrots design is based on
having threads, events, async IO in mind. It was surprisingly simple to
implement these first steps that are running now.

Separating the HLL layer from the engine probably helps a lot for such
rather major design changes.

> now, we all want to make sure it is the best, earth-quake proofed,
> rocking foundation we can get!

Yep. So again, your input is very welcome,

> Liz

leo

Leopold Toetsch

unread,

Jan 3, 2004, 5:59:37 PM1/3/04

to Nigel Sandever, perl6-i...@perl.org

Nigel Sandever <nigels...@btconnect.com> wrote:
> On Sat, 3 Jan 2004 21:00:31 +0100, l...@toetsch.at (Leopold Toetsch) wrote:
>> Yep. So if a single interpreter (which is almost a thread state) should
>> run two threads, you have to allocate and swap all.

> When a kernel level thead is spawned, no duplication of application memory
> is required, Only a set of registers, program counter and stack. These
> represent the entire state of that thread.

Here is the current approach, I've implemented partly: The state of a
thread is basically the interpreter structure - that's it. In terms of
Parrot at thread (a ParrotThread PMC) is derived from an interpreter (a
ParrotInterpreter PMC).

Please remember, Parrot is a register based VM and has a lot of
registers. The whole representation of a VM thread is more and different
to a kernel thread.

While scheduling a kernel thread is only swapping above items, a VM
level thread scheduler would have to swap much more.

> If a VM thread mirrors this, by duplicating the VM program counter,
> VM registers and VM stack, then this VM thread context can also
> avoid the need to replicate the rest of the program data (interpreter).

You are again missing here: the interpreter is above VM state - the rest
is almost nothing. So the interpreter := thread approach holds. You
can't run a even a single - the one and only - thread without these
necessary data and that's just called interpreter in Parrot speak.

> ... No need for :shared or lock().

That's the - let's say - type 4 of Dan's layout of different threading
models. Everything is shared by default. That's similar to the shared
PMC type 3 model - except that no objects have to be copied. It for sure
depends on the user code, if one or the other model will have better
performance, so the user can choose. We will provide both.

> Only duplicating shared data on demand (COW) may work well on systems
> that support COW in the kernel.

No, we are dealing with VM objects and structures here - no kernel is
involved for COWed copies of e.g. strings.

[ snips ]

> Each element, the seraration of the VMstate from the interpreter state,

VM = Virtual machine = interpreter

These can't be separated as they are the same.

> the atomisation of VM operations,

Different VMs can run on different CPUs. Why should we make atomic
instructions out if these? We have a JIT runtime performing at 1 Parrot
instruction per CPU instruction for native integers. Why should we slow
down that by a magnitude of many tenths?

We have to lock shared data, then you have to pay the penalty, but not
for each piece of code.

> You only need to copy them, if the two threads can attempt to modify
> the contents of the objects concurrently.

I think, that you are missing multiprocessor systems totally.

leo

Jeff Clites

unread,

Jan 3, 2004, 8:39:47 PM1/3/04

to P6I Internals

On Jan 3, 2004, at 2:59 PM, Leopold Toetsch wrote:

> Nigel Sandever <nigels...@btconnect.com> wrote:
>> Only duplicating shared data on demand (COW) may work well on systems
>> that support COW in the kernel.
>
> No, we are dealing with VM objects and structures here - no kernel is
> involved for COWed copies of e.g. strings.

And also, COW at the OS level (that is, of memory pages) doesn't help,
because we have complex data structures filled with pointers, so
copying them involves more than just duplicating a block of memory. We
can use an approach similar to what we do for strings to make a COW
copy of, for instance, the globals stash, but overall that will only be
a speed improvement if the data structure is rarely modified. (That is,
once it's modified, we will have paid the price. Unless we have clever
data structure which can be COWed in sections.)

Just adding to what Leo already said.

JEff

Elizabeth Mattijsen

unread,

Jan 4, 2004, 5:01:36 AM1/4/04

to l...@toetsch.at, perl6-i...@perl.org

At 00:49 +0100 1/4/04, Leopold Toetsch wrote:
>Elizabeth Mattijsen <l...@dijkmat.nl> wrote:
> > Indeed. But as soon as there is something special such as a
>> datastructure external to Perl between threads (which happens
>> automatically "shared" automatically, because Perl doesn't know about
> > the datastructure,
>Why is it shared automatically? Do you have an example for that?

When you use an external library in Perl, such as e.g. libxml, you
have Perl data-structures and libxml data-structures. The Perl
data-structures contain pointers to the libxml data-structures.

In comes the starting of an ithread and Perl clones all of the Perl
data-structures. But it copies _only_ does things it knows about.
And thus leaves the pointers to the libxml data-structures untouched.
Now you have 2 Perl data-structures that point to the _same_ libxml
data-structures. Voila, instant sharing.

With disastrous results. Because as soon as the thread ends, the
cloned Perl object in the thread goes out of scope. Perl then calls
the DESTROY method on the object, which then frees up the libxml
data-structures. That's what it's supposed to do. Meanwhile, back
in the original thread, the pointers in the Perl object now point at
freed memory, rather than a live libxml data-structure. And chaos
ensues sooner or later. Of course, chaos could well ensue before the
thread is ended, because both threads think they have exclusive
access to the libxml data-structure.

Hope this explanation made sense.

> > ... so the cloned objects point to the same memory
>> address), then you're in trouble. Simply because you now have
>> multiple DESTROYs called on the same external data-structure. If the
>> function of the DESTROY is to free the memory of the external
>> data-structure, you're in trouble as soon as the first thread is
> > done. ;-(
>Maybe that DOD/GC can help here. A shared object can and will be
>destroyed only, when the last holder of that object has released it.

But do you see now how complicated this can become if thread === interpreter?

Liz

Gordon Henriksen

unread,

Jan 3, 2004, 11:59:50 PM1/3/04

to Nigel Sandever, perl6-i...@perl.org

On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote:

> Transparent interlocking of VHLL fat structures performed automatically
> by the VM itself. No need for :shared or lock().

Completely specious, and repeatedly proven unwise. Shouldn't even be
pursued.

Atomic guarantees on collections (or other data structures) are rarely
meaningful; providing them is simply a waste of time. Witness the
well-deserved death of Java's synchronized Vector class in favor of
ArrayList. The interpreter absolutely shouldn't crash due to threading
errors—it should protect itself using standard techniques—but it would
be a mistake for parrot to mandate that all ops and PMCs be thread-safe.

The details of threaded programming cannot be hidden from the
programmer. It's tempting to come up with clever ways to try, but the
engine really has to take a back seat here. Smart programmers will
narrow the scope of potential conflicts by reducing sharing of data
structures in their threaded programs. Having done so, any atomicity
guarantees on individual objects proves to be wasted effort: It will be
resented by parrot's users as needless overhead, not praised. Consider
the potential usage cases.

1. All objects in a non-threaded program.
2. Unshared objects in a threaded program.
3. Shared objects in a threaded program.

The first two cases will easily comprise 99% of all usage. In only the
third case are synchronized objects even conceivably useful, and even
then the truth of the matter is that they are of extremely limited
utility: Their guarantees are more often than not too fine-grained to
provide the high-level guarantees that the programmer truly needs. In
light of this, the acquisition of a mutex (even a mutex that's
relatively cheap to acquire) to push an element onto an array, or to
access a string's data—well, it stops looking so good.

That said, the interpreter can't be allowed to crash due to threading
errors. It must protect itself. But should a PerlArray written to
concurrently from 2 threads guarantee its state make sense at the end of
the program? I say no based upon precedent; the cost is too high.

—

Gordon Henriksen
mali...@mac.com

Elizabeth Mattijsen

unread,

Jan 4, 2004, 9:41:27 AM1/4/04

to l...@toetsch.at, perl6-i...@perl.org

At 14:47 +0100 1/4/04, Leopold Toetsch wrote:
>Elizabeth Mattijsen <l...@dijkmat.nl> wrote:
> > When you use an external library in Perl, such as e.g. libxml, you
>> have Perl data-structures and libxml data-structures. The Perl
> > data-structures contain pointers to the libxml data-structures.
> > In comes the starting of an ithread and Perl clones all of the Perl
>> data-structures. But it copies _only_ does things it knows about.
>> And thus leaves the pointers to the libxml data-structures untouched.
>> Now you have 2 Perl data-structures that point to the _same_ libxml
> > data-structures. Voila, instant sharing.

>I see. Our library loading code should take care of that. On thread
>creation we call again the _init code, so that the external lib can
>prepare itself to be used from multiple threads. But don't ask me about
>details ;)

What you need, is basically being able to:

- register a class method to be called on cloning
- register an object method that is called whenever an _object_ is cloned

The CLONE sub that Perl5 has, is the class method. The object method
is missing from Perl (Thread::Bless is a way to remedy this problem).

I don't know what the _init code does, but judging by its name, its
not giving enough info to be able to properly clone an object with
external data structures.

Liz

Leopold Toetsch

unread,

Jan 4, 2004, 8:47:48 AM1/4/04

to Elizabeth Mattijsen, perl6-i...@perl.org

Elizabeth Mattijsen <l...@dijkmat.nl> wrote:

> When you use an external library in Perl, such as e.g. libxml, you
> have Perl data-structures and libxml data-structures. The Perl
> data-structures contain pointers to the libxml data-structures.

> In comes the starting of an ithread and Perl clones all of the Perl
> data-structures. But it copies _only_ does things it knows about.
> And thus leaves the pointers to the libxml data-structures untouched.
> Now you have 2 Perl data-structures that point to the _same_ libxml
> data-structures. Voila, instant sharing.

I see. Our library loading code should take care of that. On thread

creation we call again the _init code, so that the external lib can
prepare itself to be used from multiple threads. But don't ask me about
details ;)

> Liz

leo

Dan Sugalski

unread,

Jan 4, 2004, 1:43:53 PM1/4/04

to Gordon Henriksen, Nigel Sandever, perl6-i...@perl.org

At 11:59 PM -0500 1/3/04, Gordon Henriksen wrote:
>On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote:
>
>>Transparent interlocking of VHLL fat structures performed
>>automatically by the VM itself. No need for :shared or lock().
>
>Completely specious, and repeatedly proven unwise. Shouldn't even be pursued.

Erm... that turns out not to be the case. A lot. (Yeah, I know, I
said I wasn't paying attention)

An interpreter *must* lock any shared data structure, including PMCs,
when accessing them. Otherwise they may be in an inconsistent state
when being accessed, which will lead to data corruption or process
crashing, which is unacceptable.

These locks do not have to correspond to HLL-level locks, though it'd
be a reasonable argument that they share the same mutex.
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Jeff Clites

unread,

Jan 4, 2004, 3:05:19 PM1/4/04

to l...@toetsch.at, perl6-i...@perl.org, Elizabeth Mattijsen

But I think we'll never be able to make this work as the user would
initially expect. For instance if we have a DBI implementation, and
some PMC is holding an external reference to a database cursor for an
open transaction, then we can't properly duplicate the necessary state
to make the copy of the PMC work correctly (that is, independently).
(And I'm not saying just that we can't do it from within parrot, I'm
saying the native database libraries can't do this.)

So some objects such as this would always have to end up shared (or
else non-functional in the new thread), which is bad for users because
they have to be concerned with what objects are backed by native
libraries and which ones cannot be made to conform to each of our
thread styles.

That seems like a major caveat.

JEff

Dan Sugalski

unread,

Jan 4, 2004, 3:08:35 PM1/4/04

to Jeff Clites, l...@toetsch.at, perl6-i...@perl.org, Elizabeth Mattijsen

At 12:05 PM -0800 1/4/04, Jeff Clites wrote:
>On Jan 4, 2004, at 5:47 AM, Leopold Toetsch wrote:
>
>>Elizabeth Mattijsen <l...@dijkmat.nl> wrote:
>>
>>>When you use an external library in Perl, such as e.g. libxml, you
>>>have Perl data-structures and libxml data-structures. The Perl
>>>data-structures contain pointers to the libxml data-structures.
>>
>>>In comes the starting of an ithread and Perl clones all of the Perl
>>>data-structures. But it copies _only_ does things it knows about.
>>>And thus leaves the pointers to the libxml data-structures untouched.
>>>Now you have 2 Perl data-structures that point to the _same_ libxml
>>>data-structures. Voila, instant sharing.
>>
>>I see. Our library loading code should take care of that. On thread
>>creation we call again the _init code, so that the external lib can
>>prepare itself to be used from multiple threads. But don't ask me about
>>details ;)
>
>But I think we'll never be able to make this work as the user would
>initially expect. For instance if we have a DBI implementation, and
>some PMC is holding an external reference to a database cursor for
>an open transaction, then we can't properly duplicate the necessary
>state to make the copy of the PMC work correctly (that is,
>independently). (And I'm not saying just that we can't do it from
>within parrot, I'm saying the native database libraries can't do
>this.)

And don't forget the libraries that are picky about which thread
calls into them -- there are some that require that the thread that
created the handle for the library be the thread that calls into the
library with that handle. (Though luckily those are pretty rare) And
of course the non-reentrant libraries that require a global library
lock for all calls otherwise the library state gets corrupted.

Aren't threads fun? :)

Jeff Clites

unread,

Jan 4, 2004, 3:17:33 PM1/4/04

to Gordon Henriksen, P6I Internals

On Jan 3, 2004, at 8:59 PM, Gordon Henriksen wrote:

> On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote:
>
>> Transparent interlocking of VHLL fat structures performed
>> automatically by the VM itself. No need for :shared or lock().
>
> Completely specious, and repeatedly proven unwise. Shouldn't even be
> pursued.
>
> Atomic guarantees on collections (or other data structures) are rarely
> meaningful; providing them is simply a waste of time. Witness the
> well-deserved death of Java's synchronized Vector class in favor of
> ArrayList. The interpreter absolutely shouldn't crash due to threading
> errors—it should protect itself using standard techniques—but it would
> be a mistake for parrot to mandate that all ops and PMCs be
> thread-safe.

What are these standard techniques? The JVM spec does seem to guarantee
that even in the absence of proper locking by user code, things won't
go completely haywire, but I can't figure out how this is possible
without actual locking. (That is, I'm wondering if Java is doing
something clever.) For instance, inserting something into a collection
will often require updating more than one memory location (especially
if the collection is out of space and needs to be grown), and I can't
figure out how this could be guaranteed not to completely corrupt
internal state in the absence of locking. (And if it _does_ require
locking, then it seems that the insertion method would in fact then be
synchronized.)

So my question is, how do JVMs manage to protect internal state in the
absence of locking? Or do they?

JEff

Uri Guttman

unread,

Jan 4, 2004, 3:17:22 PM1/4/04

to Dan Sugalski, Jeff Clites, l...@toetsch.at, perl6-i...@perl.org, Elizabeth Mattijsen

>>>>> "DS" == Dan Sugalski <d...@sidhe.org> writes:

DS> And don't forget the libraries that are picky about which thread calls
DS> into them -- there are some that require that the thread that created
DS> the handle for the library be the thread that calls into the library
DS> with that handle. (Though luckily those are pretty rare) And of course
DS> the non-reentrant libraries that require a global library lock for all
DS> calls otherwise the library state gets corrupted.

DS> Aren't threads fun? :)

hence my love for events and forked procs. i even have a solution to
that very problem by forking the DBI (or other nonthreaded lib) process
and communicating to that via messages. but we still have to support
threads. just gonna be messy and i will prolly rarely use them
(especially since we will have a core event loop and async i/o (which
will prolly use kernel threads (according to dan) but not be parrot
threads ) ) (end of lisp text) :-/

Sam Vilain

unread,

Jan 4, 2004, 3:27:01 PM1/4/04

to Luke Palmer, perl6-i...@perl.org

On Sat, 03 Jan 2004 20:51, Luke Palmer wrote;

> Parrot is platform-independent, but that doesn't mean we can't
> take advantage of platform-specific instructions to make it faster
> on certain machines. Indeed, this is precisely what JIT is.
> But a lock on every PMC is still pretty heavy for those non-x86
> platforms out there, and we should avoid it if we can.

So implement threading on architectures that don't support interrupt
masking with completely user-space threading (ie, runloop round-robin)
like Ruby does. *That* is available on *every* platform.
--
Sam Vilain, s...@vilain.net

Seeing a murder on television... can help work off one's antagonisms.
And if you haven't any antagonisms, the commercials will give you
some.
-- Alfred Hitchcock

Dan Sugalski

unread,

Jan 4, 2004, 3:48:25 PM1/4/04

to Uri Guttman, Jeff Clites, l...@toetsch.at, perl6-i...@perl.org, Elizabeth Mattijsen

At 3:17 PM -0500 1/4/04, Uri Guttman wrote:
> >>>>> "DS" == Dan Sugalski <d...@sidhe.org> writes:
>
> DS> And don't forget the libraries that are picky about which thread calls
> DS> into them -- there are some that require that the thread that created
> DS> the handle for the library be the thread that calls into the library
> DS> with that handle. (Though luckily those are pretty rare) And of course
> DS> the non-reentrant libraries that require a global library lock for all
> DS> calls otherwise the library state gets corrupted.
>
> DS> Aren't threads fun? :)
>
>hence my love for events and forked procs.

Forks make some of this worse. There are more libraries that don't
work with connections forked across processes than libraries that
don't work with calls from a different thread.

Dan Sugalski

unread,

Jan 4, 2004, 3:51:32 PM1/4/04

to Sam Vilain, Luke Palmer, perl6-i...@perl.org

At 9:27 AM +1300 1/5/04, Sam Vilain wrote:
>On Sat, 03 Jan 2004 20:51, Luke Palmer wrote;
>
> > Parrot is platform-independent, but that doesn't mean we can't
> > take advantage of platform-specific instructions to make it faster
> > on certain machines. Indeed, this is precisely what JIT is.
> > But a lock on every PMC is still pretty heavy for those non-x86
> > platforms out there, and we should avoid it if we can.
>
>So implement threading on architectures that don't support interrupt
>masking with completely user-space threading (ie, runloop round-robin)
>like Ruby does. *That* is available on *every* platform.

Interrupt masking and a proper threading interface can be considered
a prerequisite for threads of any sort under Parrot, the same way an
ANSI C89-compliant compiler is a requirement. Platforms that can't
muster at least thread spawning, mutexes, and condition variables
don't get threads, and don't have to be considered. (You can, it's
just not required, and you'd be hard-pressed to find anything outside
the embedded realm that doesn't support at least that level of
functionality, and I'm OK if there are no threads on the Gameboy port)

Damien Neil

unread,

Jan 4, 2004, 7:03:33 PM1/4/04

to P6I Internals

On Sun, Jan 04, 2004 at 12:17:33PM -0800, Jeff Clites wrote:
> What are these standard techniques? The JVM spec does seem to guarantee
> that even in the absence of proper locking by user code, things won't
> go completely haywire, but I can't figure out how this is possible
> without actual locking. (That is, I'm wondering if Java is doing
> something clever.) For instance, inserting something into a collection
> will often require updating more than one memory location (especially
> if the collection is out of space and needs to be grown), and I can't
> figure out how this could be guaranteed not to completely corrupt
> internal state in the absence of locking. (And if it _does_ require
> locking, then it seems that the insertion method would in fact then be
> synchronized.)

My understanding is that Java Collections are generally implemented
in Java. Since the underlying Java bytecode does not permit unsafe
operations, Collections are therefore safe. (Of course, unsynchronized
writes to a Collection will probably result in exceptions--but it
won't crash the JVM.)

For example, insertion into a list might be handled something like
this (apologies for rusty Java skills):
void append(Object new_entry) {
if (a.length <= size) {
Object new_a[] = new Object[size * 2];
for (int i = 0; i < size; i++) {
new_a[i] = a;
}
}
a[size++] = new_entry;
}

If two threads call this function at the same time, they may well leave
the list object in an inconsistent state--but there is no way that the
above code can cause JVM-level problems.

The key decision in Java threading is to forbid modification of all
bytecode-level types that cannot be atomically modified. For example,
the size of an array cannot be changed, and strings are constant.
If it WERE possible to resize arrays, the above code would require locks
to avoid potential JVM corruption--every access to 'a' would need a lock
against the possiblity that another thread was in the process of resizing
it.

It's my understanding that Parrot has chosen to take the path of using
many mutable data structures at the VM level; unfortunately, this is
pretty much incompatible with a fast or elegant threading model.

- Damien

Dan Sugalski

unread,

Jan 5, 2004, 11:41:31 AM1/5/04

to P6I Internals

At 4:03 PM -0800 1/4/04, Damien Neil wrote:
>It's my understanding that Parrot has chosen to take the path of using
>many mutable data structures at the VM level; unfortunately, this is
>pretty much incompatible with a fast or elegant threading model.

Yep, generally true. The choice was made on purpose, and I was aware
of most of the ramifications of it, both good and ill.

Gordon Henriksen

unread,

Jan 5, 2004, 11:01:46 PM1/5/04

to Jeff Clites, P6I Internals

The JVM won't crash, but individual objects can easily become corrupt.
Java makes no guarantees of the level you're proposing. All
synchronization in Java is explicit, using either synchronized methods,
or synchronized (...) { ... } blocks; the runtime doesn't implicitly
burn cycles synchronizing objects for the user. It does, however, give
the user the threading primitives which allow them to write thread-safe
code. Thread-safety is, indeed, left entirely to the user. This is
crucial to allowing code to execute quickly and for the optimizer to
function. The JVM protects itself, ensures that its core is thread-safe,
and provides a secure operating environment for user code; it is the
responsibility of the authors of classes to protect instances of their
classes. Remember, Java's libraries are implemented atop these
primitives *in Java.* It would be analogous to propose that parrot-based
languages be primarily implemented atop its primitives in PBC, and that
the sections which are not so-implemented ("the core") must be
scrutinized carefully for threadsafety.

What are these threadsafe primitives which Java provides? For example:

1. Fixed-size arrays are safe to access concurrently from multiple
threads.
2. ints, chars, pointers, etc. can be updated atomically.
3. Java's memory allocator is thread-safe.
4. Strings are immutable (and thus sharable without threadsafety
violations).
5. The JVM does not allow memory to be resized, making bounds-checking
threadsafe.
6. No objects can be moved (while retaining their identity) except by
the garbage collector.
7. All threads are suspended during GC.
8. All pointers are traceable by GC.

That parrot is largely implemented in C, and is designed to operate in
terms of fat data structures, makes it much more difficult to guarantee
in parrot that segfaults will not occur. I have some ideas which I'll
share after I sync up with this fast-moving thread (of conversation).
But one of my premises is that implicitly guaranteeing atomic PMC
accesses will render threaded parrot dead-on-arrival. Just like the last
2 abortive attempts at creating a threading Perl implementation. (And
"threaded parrot" would be a separate executable, because the
performance hit will be so staggering.) The sheer overhead of locking is
an obvious deal-killer, but I think that deadlocks and prohibition of
optimizations would quickly put a few extra nails in the coffin. I don't
want to see the Perl community left adrift with another toy threading
implementation unsuitable for any serious work.

—

Gordon Henriksen
mali...@mac.com

Gordon Henriksen

unread,

Jan 5, 2004, 11:10:45 PM1/5/04

to Dan Sugalski, Nigel Sandever, perl6-i...@perl.org

On Sunday, January 4, 2004, at 01:43 , Dan Sugalski wrote:

> At 11:59 PM -0500 1/3/04, Gordon Henriksen wrote:
>> On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote:
>>
>>> Transparent interlocking of VHLL fat structures performed
>>> automatically by the VM itself. No need for :shared or lock().
>>
>> Completely specious, and repeatedly proven unwise. Shouldn't even be
>> pursued.
>
> Erm... that turns out not to be the case. A lot. (Yeah, I know, I said
> I wasn't paying attention)
>
> An interpreter *must* lock any shared data structure, including PMCs,
> when accessing them. Otherwise they may be in an inconsistent state
> when being accessed, which will lead to data corruption or process
> crashing, which is unacceptable.
>
> These locks do not have to correspond to HLL-level locks, though it'd
> be a reasonable argument that they share the same mutex.

Process crashing unacceptable? Absolutely. Complete agreement.

Data corruption unacceptable? I disagree. It depends on the contract put
forward by the language in question. Notably, Perl makes no such
guarantees thus far, being as how it doesn't (any longer) run in a
traditional threaded model. Successfully threaded languages and virtual
machines explicitly make NO such guarantees. I'm completely okay with
the potential for inconsistent data where the user doesn't take
precautions. If the alternative is my code bogging down to half of its
ideal speed because parrot is wasting time: acquiring three mutexes with
deadlock detection so that it can execute add $P10, $P11, $P12;
acquiring a mutex every time it indexes a string register; acquiring a
mutex so it can examine cache->int_val—if those are the alternatives,
I'm for the potential for inconsistent data. (And I say wasting because
it's completely unnecessary in a good 99% of cases; the vast majority of
data is not shared between threads at all.)

I'm not okay with parrot crashing for any reason, ever, though, so where
locking is not provided, PMC classes *must* be coded *very* carefully so
as to avoid crashes.

—

Gordon Henriksen
mali...@mac.com

Garrett Goebel

unread,

Jan 6, 2004, 11:34:50 AM1/6/04

to Gordon Henriksen, Dan Sugalski, perl6-i...@perl.org

Gordon Henriksen wrote:

> Dan Sugalski wrote:
> >
> > An interpreter *must* lock any shared data structure,
> > including PMCs, when accessing them. Otherwise they may
> > be in an inconsistent state when being accessed, which will
> > lead to data corruption or process crashing, which is
> > unacceptable.
>

> Process crashing unacceptable? Absolutely. Complete agreement.
>
> Data corruption unacceptable? I disagree. It depends on the
> contract put forward by the language in question. Notably,
> Perl makes no such guarantees thus far, being as how it
> doesn't (any longer) run in a traditional threaded model.
> Successfully threaded languages and virtual machines explicitly
> make NO such guarantees

Dan's non-negotiable "Thread notes" didn't say the VM's garauntees were
non-negotiable ;)

Are there differences in the contracts put forth by existing languages which
we hope to have parrot implementations?

Ironically, wouldn't setting the garauntees against data corruption this
high pretty much rule out a pure parrot implementation of an SQL-92
compliant database. SQL-92 defines 4 transaction isolation levels in terms
of the phenomena they're meant to allow/avoid: dirty reads, non-repeatable
reads, and phantoms.

Allowing language level variation in concurrency vs. correctness garuantees?
Can of worms? I can already hear Dan, "Nothing to be seen here. -Move
along."

On the off chance that anyone's interested, here's a critique of the ANSI
SQL-92 isolation levels: http://citeseer.nj.nec.com/berenson95critique.html

--
Garrett Goebel
IS Development Specialist

ScriptPro Direct: 913.403.5261
5828 Reeds Road Main: 913.384.1008
Mission, KS 66202 Fax: 913.384.2180
www.scriptpro.com garrett at scriptpro dot com

Elizabeth Mattijsen

unread,

Jan 7, 2004, 8:30:42 AM1/7/04

to Gordon Henriksen, Dan Sugalski, Nigel Sandever, perl6-i...@perl.org

At 23:10 -0500 1/5/04, Gordon Henriksen wrote:
>Data corruption unacceptable? I disagree. It depends on the contract
>put forward by the language in question. Notably, Perl makes no such
>guarantees thus far, being as how it doesn't (any longer) run in a
>traditional threaded model. Successfully threaded languages and
>virtual machines explicitly make NO such guarantees. I'm completely
>okay with the potential for inconsistent data where the user doesn't
>take precautions. If the alternative is my code bogging down to half
>of its ideal speed because parrot is wasting time: acquiring three
>mutexes with deadlock detection so that it can execute add $P10,
>$P11, $P12; acquiring a mutex every time it indexes a string

>register; acquiring a mutex so it can examine cache->int_val-if

>those are the alternatives, I'm for the potential for inconsistent
>data. (And I say wasting because it's completely unnecessary in a
>good 99% of cases; the vast majority of data is not shared between
>threads at all.)

I agree with most what is said here.

However, I think there should be some type of debugging mode that
_would_ do all of the locking and deadlock detection (and whatever
other debugging capabilities we can think of). Occasional,
hard-to-reproduce data corruption is the single most problematic
thing to debug in threaded applications. It allows you to create
Heisenbugs that will only show up under specific load conditions
and/or specific platforms. And when that happens, it doesn't matter
that your threaded application doesn't crash: for the users it does
the equivalent, namely produce the wrong results occasionally, and
therefore not production ready.

Liz

Dan Sugalski

unread,

Jan 7, 2004, 10:44:31 AM1/7/04

to Gordon Henriksen, Nigel Sandever, perl6-i...@perl.org

At 23:10 -0500 1/5/04, Gordon Henriksen wrote:
>Data corruption unacceptable? I disagree.

I get the feeling people just aren't reading what's been written, or
aren't keeping it all straight.

*User* and *program* data integrity is not our problem -- not only
are we not guaranteeing that, I'd be fine with parrot actively
corrupting the contents of shared data structures that are accessed
without synchronization.

*Interpreter* data corruption is absolutely unacceptable, will not be
tolerated, and whatever synchronization the interpreters need to do
to make sure it doesn't happen.