Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Threads: Time to get the terminology straight

6 views
Skip to first unread message

Dan Sugalski

unread,
Jan 4, 2004, 3:47:35 PM1/4/04
to perl6-i...@perl.org
I think some of the massive back and forth that's going on is in part
due to terminology problems, which are in part causing some
conceptual problems. So, for the moment, let's agree on the following
things:

*) MUTEX - this is a low level, under the hood, not exposed to users,
thing that can be locked. They're non-recursive, non-read/write,
exclusive things. When a thread gets a mutex, any other attempt to
get that mutex will block until the owning thread releases the mutex.
The platform-native lock construct will be used for this.

*) LOCK - This is an exposed-to-HLL-code thing that can be locked.
Only PMCs can be locked, and the lock may or may not be recursive or
read/write.

*) CONDITION VARIABLE - the "sleep until something pings me"
construct. Useful for queue construction, always associated with a
MUTEX.

*) RENDEZVOUS POINT - A HLL version of a condition variable. *not*
associated with a lock -- these are standalone.

Note that the mutex/condition association's a POSIX limitation, and
POSIX threads is what we have on some platforms. If you want to
propose abstracting it away, go for it. The separation doesn't buy us
anything, though it's useful in other circumstances.

*) INTERPRETER - those bits of the Parrot_Interp structure that are
absolutely required to be thread-specific. This includes the current
register sets and stack pointers, as well as security context
information. Basically if a continuation captures it, it's the
interpreter.

*) INTERPRETER ENVIRONMENT - Those bits of the Parrot_Interp
structure that aren't required to be thread-specific (though I'm not
sure there are any) *PLUS* anything pointed to that doesn't have to
be thread-specific.

The environment includes the global namespaces, pads, stack chunks,
memory allocation regions, arenas, and whatnots. Just because the
pointer to the current pad is thread-specific doesn't mean the pad
*itself* has to be. It can be shared.

*) INDEPENDENT THREAD - A thread that has no contact *AT ALL* with
the internal data of any other thread in the current process.
Independent threads need no synchronization for anything other than
what few global things we have. And the fewer the better, though alas
we can't have none at all.

Note that independent threads may still communicate back and forth by
passing either atomic things (ints, floats, and pointers) or static
buffers that can become the property of the destination thread.

*) SHARED THREAD - A thread that's part of a group of threads sharing
a common interpreter environment.

Anyway, there's some terminology. It doesn't solve the design
problem, but hopefully it'll help everyone talk the same language.

Remember that everything from the wrapped OS interface on up is up
for grabs -- while we're not going to build our own mutexes or thread
scheduler, everything that's been implemented or designed to date can
be changed with sufficient good reason. (Though, again, the more you
want to change the more spectacular the design has to be)
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Nigel Sandever

unread,
Jan 4, 2004, 6:58:20 PM1/4/04
to perl6-i...@perl.org
On Sun, 4 Jan 2004 15:47:35 -0500, d...@sidhe.org (Dan Sugalski) wrote:

> *) INTERPRETER - those bits of the Parrot_Interp structure that are
> absolutely required to be thread-specific. This includes the current
> register sets and stack pointers, as well as security context
> information. Basically if a continuation captures it, it's the
> interpreter.
>
> *) INTERPRETER ENVIRONMENT - Those bits of the Parrot_Interp
> structure that aren't required to be thread-specific (though I'm not
> sure there are any) *PLUS* anything pointed to that doesn't have to
> be thread-specific.
>
> The environment includes the global namespaces, pads, stack chunks,
> memory allocation regions, arenas, and whatnots. Just because the
> pointer to the current pad is thread-specific doesn't mean the pad
> *itself* has to be. It can be shared.
>

> *) SHARED THREAD - A thread that's part of a group of threads sharing
> a common interpreter environment.

Ignoring the implementation of the synchronisation required, the basic
premise of my long post was that each SHARED THREAD, should have it's
own INTERPRETER (a VM in my terms). and that these should share a
common INTERPRETER ENVIRONMENT.

Simplistically, 5005threads shared an INTERPRETER ENVIRONMENT
and a single INTERPRETER. Synchronising threaded access to the shared
INTERPRETER (rather than it's environment) was the biggest headache.
(I *think*).

With ithreads, each SHARED THREAD has it's own INTERPRETER *and*
INTERPRETER ENVIRONMENT. This removes the contention for and the
need to synchronise access to the INTERPRETER, but requires the
duplication of shared elements of the INTERPRETER ENVIRONMENT and
the copy_on_read, with the inherent costs of the duplication at start-up,
and slow, indirect access to shared entities across the duplicated
INTERPRETER ENVIRONMENTS.

My proposal was that each SHARED THREAD,
should have a separate copy of the INTERPRETER,
but share a copy of the INTERPRETER ENVIRONMENT.

Everything else, were my attempts at solving the requirements of
synchronisation that this would require, whilst minimising the cost
of that synchronisation, by avoiding the need for a mutex on every
shared entity, and the costs of attempting to aquire a mutex except
when two SHARED THREADS attempted concurrent access to a
shared entity.

I think that by having SHARED THREAD == INTERPRETER, sharing
a common INTERPRETER ENVIRONMENT, you can avoid (some) of
the problems associated with 5005threads but retain the direct
access of shared entities.

This imposes it's own set of requirements and costs, but (I believe)
that the ideas that underly the mechanisms I offered as solutions
are sound. The specific implementation is a platform specific detail
that could be pushed down to a lower level.

> ...those bits of the Parrot_Interp

> structure that aren't required to be thread-specific (though I'm not
> sure there are any)

This is were I have a different (and quite possibly incorrect) view.
My mental picture of the INTERPRETER ENVIRONMENT includes
both the impementation of all the classes in the process *plus* all
the memory of every instance of those classes.

I think your statement above implies that these would not be a part
of the INTERPRETER ENVIRONMENT per se, but would be allocated
from global heap and only referenced from the bytecode that would live
in the INTERPRETER ENVIRONMENT?

I realise that this is possible, and maybe even desirable, but the cost
of the GC walking a global heap, especially in the situationof a single
process that contains to entirely separate instances of the
INTERPRETER ENVIRONMENT, would be (I *think*) rather high.

I realise that this is a fairly rare occurance on mist platforms,
but in the win32 situation of emulated forks, each pseudo-process
must have an entirely separate INTERPRETER ENVIRONMENT,
potentially with each having multiple SHARED THREADS.

If the memory for all entities in all pseudo-process is allocated from
a (real) process-global heap, then the multiple GC's required by the
multiple pseudo-process are going to be walking the same heap.
Possibly concurrently. I realise that this problem (if it is such)
does not occur on platforms that have real forks available, but it
would be useful if the high level design would allow for the use of
separate (virtual) heaps tied to the INTERPRETER ENVIRONMENTs
which win32 has the ability to do.


>
> Dan
>

Nigel.

Sam Vilain

unread,
Jan 4, 2004, 8:22:32 PM1/4/04
to perl6-i...@perl.org
On Mon, 05 Jan 2004 12:58, Nigel Sandever wrote;

> Everything else, were my attempts at solving the requirements of
> synchronisation that this would require, whilst minimising the
> cost of that synchronisation, by avoiding the need for a mutex on
> every shared entity, and the costs of attempting to aquire a mutex
> except when two SHARED THREADS attempted concurrent access to a
> shared entity.

This paragraph sounds like you're trying to solve an intractable problem.
Try posting some psuedocode to explain what you mean.

But it has given me an idea that could minimise the number of locks,
ie not require a mutex on each PMC, just each shared PMC. 10 points
to the person who finds a flaw in this approach :)

Each object you share, could create a new (virtual) shared memory
segment with its own semaphore. This virtual memory segment is
considered its own COW domain (ie, its own thread to the GC);
references inserted back to non-shared memory will pull the
structures into that virtual COW thread.

Access to the entire structure is controlled via a multiple
reader/single writer lock (close to what a semaphore is IIRC); locks
for a thread are released when references to places inside the
shared segment are no longer anywhere in any @_ on the locking
thread's call stack, or in use by any opcode (is that good enough?),
and are acquired for writing when anything needs to be changed.

Virtual shared memory segments can then easily be cleaned up by
normal GC.

The major problem I can see is that upgrading a lock from reading to
writing can't work if there are concurrent writes (and the read lock
to be upgraded cannot sensibly be released). But that should be OK,
since operation signatures will mark variables that need changing as
read-write as early as possible.

For example, in this sort of code (sorry for P5 code);

sub changer {
my $shared_object = shift;
$shared_object->{bar} = "baz";
}

A read lock to the segment \$shared_object is in is acquired, then
released when it is `shifted' off. As the next instruction has a
writable lvalue, it acquires a write lock. But this code:

sub changer {
my $shared_object = shift;
$shared_object->{bar} = &somefunc();
}

Will hold the write lock on $shared_object open until &somefunc
runs.

My 2¢ :). This discussion will certainly reach a dollar soon ;).
--
Sam Vilain, s...@vilain.net

Start every day with a smile and get it over with.
W C FIELDS


Nigel Sandever

unread,
Jan 4, 2004, 9:43:29 PM1/4/04
to perl6-i...@perl.org, Sam Vilain
05/01/04 01:22:32, Sam Vilain <s...@vilain.net> wrote:

[STUFF] :)

In another post you mentions intel hyperthreading.
Essentially, duplicate sets of registers within a single CPU.

Do these need to apply lock on every machine level entity that
they access?
No.

Why not?

Because they can only modify an entity if it is loaded into a register
and the logic behind hyperthreading won't allow both register sets
to load the same entity concurrently.

( I know this is a gross simplificationof the interactions
between the on-board logic and L1/L2 caching!)

--- Not an advert or glorification of Intel. Just an example ---------

Hyper-Threading Technology provides thread-level-parallelism (TLP)
on each processor resulting in increased utilization of processor
execution resources. As a result, resource utilization yields higher
processing throughput. Hyper-Threading Technology is a form of
simultaneous multi-threading technology (SMT) where multiple
threads of software applications can be run simultaneously on one
processor.

This is achieved by duplicating the *architectural state* on each
processor, while *sharing one set of processor execution resources*.
------------------------------------------------------------------------------

The last paragraph is the salient one as far as I am concerned.

The basic premise of my original proposal was that multi-threaded,
machine level applications don't have to interlock on machine level
entities, because each operation they perform is atomic.

Whilst the state of higher level objects, that the machine level
objects are a part of, may have their state corrupted by two
threads modifying things concurrently. The state of the threads
(registers sets+stack) themselves cannot be corrupted.

This is because they have their own internally consistant state,
that only changes atomically, and that is completely separated,
each from the other. They only share common data (code is data
to the cpu, just bytecode is data to a VM).

So, if you are going to emulate a (hyper)threaded CPU, in a
register-based virtual machine interpreter. And allow for
concurrent threads of execution within that VMI.
Then one way of ensuring that the internal state of the
VMI was never corrupted, would be to have each thread have
it's own copy of the *architectural state* of the VM, whilst
*one set of processor execution resources*.

For this to work, you would need to achieve the same opcode
atomicity at the VMI level. Interlocking the threads so that
on shared thread can not start an opcode until anothe shared
threads has completed gives this atomicity. The penalty is that
if the interlocking is done for every opcode, then shared
threads end up with very long virtual timeslices. To prevent
that being the case (most of the time), the interlocking should
only come into effect *if* concurrent access to a VM level
entity is imminant.

As the VMI cannot access (modify) the state of a VM level
entity (PMC) until it has loaded it into a VM register, the
interlosking only need come into effect *if*, the entity
who's reference is being loaded into a PMC register is
currently in-use by (another) thread.

The state if a PMCs in-useness can be flagged by a single bit
in its header. This can be detected by a shared thread when
the reference to it is loaded into teh PMC register and
when it is, that shared thread then waits on the single,
shared mutex before proceeding.

It is only when the combination of atomised VM opcodes,
and lightweight in-use detection come together, that the
need for a mutex/entity can be avoided.

If the mutex used is capable of handling SMP, NUMA,
clusters etc, then the mechinsm will work.

If the lightweight bit-test&-set opcode isn't available,
then a heavyweight equivalent could be used, though the
advantages would be reduced.


>Sam Vilain, s...@vilain.net

I hope that clarifiies my thinking and how I arrived at it.

I accept that it may not be possible on all platforms, and
it may be too expensive on some others. It may even be
undesirable in the context of Parrot, but I have seen no
argument that goes to invalidate the underlying premise.

Regards, Nigel

Luke Palmer

unread,
Jan 4, 2004, 11:34:15 PM1/4/04
to perl6-i...@perl.org, Sam Vilain
Nigel Sandever writes:
> Whilst the state of higher level objects, that the machine level
> objects are a part of, may have their state corrupted by two
> threads modifying things concurrently. The state of the threads
> (registers sets+stack) themselves cannot be corrupted.

I'm going to ask for some clarification here.

I think you're saying that each thread gets its own register set and
register stacks (the call chain is somewhat hidden within the register
stacks). Dan has been saying we'll do this all along.

But you're also saying that a particular PMC can only be loaded into one
register at a time, in any thread. So if thread A has a string in its
P17, then if thread B tries to load that string into its, say, P22, it
will block until thread A releases it. Is this correct?

This is a novel idea. It reduces lock acquisition/release from every
opcode to only the C<set> opcodes. That's a win.

Sadly, it greatly increases the chances of deadlocks. Take this
example:

my ($somestr, $otherstr);
sub A {
$somestr .= "foo";
$otherstr .= "bar";
}
sub B {
$otherstr .= "foo";
$somestr .= "bar";
}
Thread->new(\&A);
Thread->new(\&B);

This is a fairly trivial example, and it should work smoothly if we're
automatically locking for the user.

But consider your scheme: A loads $somestr into its P17, and performs
the concatenation. B loads $otherstr into its P17, and performs the
concatenation. A tries to load $otherstr into its P18, but blocks
because it's in B's P17. B then tries to load $somestr into its P18,
but blocks because it's in A's P17. Deadlock.

Did I accurately portray your scheme? If not, could you explain what
yours does in terms of this example?

Luke

> Regards, Nigel

Sam Vilain

unread,
Jan 4, 2004, 11:51:20 PM1/4/04
to Nigel Sandever, perl6-i...@perl.org
On Mon, 05 Jan 2004 15:43, Nigel Sandever wrote;

> I accept that it may not be possible on all platforms, and it may
> be too expensive on some others. It may even be undesirable in the
> context of Parrot, but I have seen no argument that goes to
> invalidate the underlying premise.

I think you missed this:

LT> Different VMs can run on different CPUs. Why should we make atomic
LT> instructions out if these? We have a JIT runtime performing at 1
LT> Parrot instruction per CPU instruction for native integers. Why
LT> should we slow down that by a magnitude of many tenths?

LT> We have to lock shared data, then you have to pay the penalty, but
LT> not for each piece of code.

and this:

LT> I think, that you are missing multiprocessor systems totally.

You are effectively excluding true parallellism by blocking other
processors from executing Parrot ops while one has the lock. You may
as well skip the thread libraries altogether and multi-thread the ops
in a runloop like Ruby does.

But let's carry the argument through, restricting it to UP systems,
with hyperthreading switched off, and running Win32. Is it even true
that masking interrupts is enough on these systems?

Win32 `Critical Sections' must be giving the scheduler hints not to
run other pending threads whilst a critical section is running. Maybe
it uses the CPU sti/cli flags for that, to avoid the overhead of
setting a memory word somewhere (bad enough) or calling the system
(crippling). In that case, setting STI/CLI might only incur a ~50%
performance penalty for integer operations.

but then there's this:

NS> Other internal housekeeping operations, memory allocation, garbage
NS> collection etc. are performed as "sysopcodes", performed by the VMI
NS> within the auspices of the critical section, and thus secured.

UG> there may be times when a GC run needs to be initiated DURING a VM
UG> operation. if the op requires an immediate lare chunk of ram it
UG> can trigger a GC pass or allocation request. you can't force those
UG> things to only happen between normal ops (which is what making
UG> them into ops does). so GC and allocation both need to be able to
UG> lock all shared things in their interpreter (and not just do a
UG> process global lock) so those things won't be modified by the
UG> other threads that share them.

I *think* this means that even if we *could* use critical sections for
each op, where this works and isn't terribly inefficient, GC throws a
spanner in the works. This could perhaps be worked around.

In any case, it won't work on the fastest known threading
implementations (Solaris, Linux NPTL, etc), as they won't know to
block all the other threads in a given process just because one of
them set a CPU flag cycles before it was pre-empted.

So, in summary - it won't work on MP, and on UP, it couldn't possibly
be as overhead-free as the other solutions.

Clear as mud ? :-)

[back to processors]


> Do these need to apply lock on every machine level entity that
> they access?

Yes, but the only resource that matters here is memory. Locking
*does* take place inside the processor, but the locks are all close
enough to be inspected in under a cycle. And misses incur a penalty
of several cycles - maybe dozens, depending on who has the memory
locked.

Registers are also "locked" by virtue of the fact that the
out-of-order execution and pipelining logic will not schedule/allow an
instruction to proceed until its data is ready. Any CPU with
pipelining has this problem.

There is an interesting comparison to be drawn between the JIT
assembly happening inside the processor from the bytecode being
executed (x86) into a RISC core machine language (µ-ops) on
hyperthreading systems, and Parrot's compiling PASM to native machine
code. It each case is the µ-ops that are ordered to maximize
performance and fed into the execution units.

On a hyperthreading processor, it has the luxury of knowing how long
it will take to check the necessary locks for each instruction,
probably under a cycle, so that µ-ops may scream along.

With Parrot, it might have to contact another host over an ethernet
controller to acquire a lock (eg, threads running in an OpenMOSIX
cluster). This cannot happen for every instruction!
--
Sam Vilain, s...@vilain.net

The golden rule is that there are no golden rules
GEORGE BERNARD SHAW


Nigel Sandever

unread,
Jan 5, 2004, 2:51:35 AM1/5/04
to perl6-i...@perl.org

05/01/04 04:34:15, Luke Palmer <fibo...@babylonia.flatirons.org> wrote:

LP> I think you're saying that each thread gets its own register set and
LP> register stacks (the call chain is somewhat hidden within the register
LP> stacks). Dan has been saying we'll do this all along.
.
That has been my interpretation, but there are others that appear to be
contradicting that view.

LP> But you're also saying that a particular PMC can only be loaded into one
LP> register at a time, in any thread. So if thread A has a string in its
LP> P17, then if thread B tries to load that string into its, say, P22, it
LP> will block until thread A releases it. Is this correct?
.
Yes!

LP> This is a novel idea. It reduces lock acquisition/release from every
LP> opcode to only the C<set> opcodes. That's a win.
.
That is the intent.

LP> Sadly, it greatly increases the chances of deadlocks. Take this
LP> example:
.


my ($somestr, $otherstr);
sub A {
$somestr .= "foo";
$otherstr .= "bar";
}
sub B {
$otherstr .= "foo";
$somestr .= "bar";
}
Thread->new(\&A);
Thread->new(\&B);

.
LP> This is a fairly trivial example, and it should work smoothly if we're
LP> automatically locking for the user.

LP> But consider your scheme: A loads $somestr into its P17, and performs
LP> the concatenation. B loads $otherstr into its P17, and performs the
LP> concatenation. A tries to load $otherstr into its P18, but blocks
LP> because it's in B's P17. B then tries to load $somestr into its P18,
LP> but blocks because it's in A's P17. Deadlock.

LP> Did I accurately portray your scheme? If not, could you explain what
LP> yours does in terms of this example?
.
Where your description falls down is considering each sub as an opcode.
I *think* that each statement within the sub is an opcode.

Therefore the sequence of opcodes in A (simplified) is:

A sets the in-use flag on $somestr when it loads it into P17.
If the inuse flag was set, it aquires the mutex.
A executes the CONCAT opcode against $somestr.
Constant 'foo' as operand.
Releases the mutex.
A then clears the in-use flag for $somestr.

A then set the in-use flag for $otherstr.
If the inuse flag was set, it aquires the mutex.
A executes CONCAT opcode against P18.
Constant 'bar' as operand.
Releases the mutex.
A then clears the in-use flag for $otherstr.

And the sequence of opcodes in B (simplified) is:

B sets the in-use flag on $otherstr when it loads it into P17.
If the inuse flag was set, it aquires the mutex.
B executes the CONCAT opcode for against $otherstr.
Constant 'foo' as operand.
Releases the mutex.
B then clears the in-use flag for $otherstr.

B then set the in-use flag for $somestr.
If the inuse flag was set, it aquires the mutex.
B executes CONCAT opcode against P18.
Constant 'bar; as operand.
Releases the mutex.
B then clears the in-use flag for $somestr.

The deadlock doesn't occur (if I'm right) because however you overlap
those two sequences, only one of the two in-use flags is set at any
given time. Blocking only occurs if the two threads are attempting
simultaneous operatons on the same PMC. When this occurs, the second
thread blocks until the first completes it's operation, when it
immediately releases the lock and clears the in-use flag.

Ie. (In this example) Only one in-use flag (per thread) will be set
at any given time, so possibility for deadlock cannot arise.

If the situation requires a single opcode to manipulate two PMCs, then
other threads will be prevented from accessing either of those PMCs
until the single opcode on the first thread clears both in-use flags.

It may be possible to set up a scenario where, loops of dual PMC, single
opcodes could deadlock. I haven't convinced myself of this one way or the
other. But notionally, opcodes with dual PMC operands should be fairly
rare if at all. Mostly, perl-level statements which involve multiple PMCs
will resolve down to sequences of opcodes that involve a single PMC and
intermediate results from previous opcodes held in non-PMC registers.

NOTE: That last paragraph is *highly* speculative on my part. It's how
I think it should work. Not how it does or should.


LP> Luke

Does it work? If it does, can it be implemented, cross-platform, within
the Dan's constraints? I do not know the answer to this. I'm trying to
work that out now. I would have preferred to have had categorical answers
before bringing it here, but time ran out and my knowledge of Parrot and
other platforms lets me down.

Cheers, Nigel.

-------- End of forwarded message --------

Nigel Sandever

unread,
Jan 5, 2004, 2:55:38 AM1/5/04
to perl6-i...@perl.org

05/01/04 04:51:20, Sam Vilain <s...@vilain.net> wrote:

>On Mon, 05 Jan 2004 15:43, Nigel Sandever wrote;
>
> > I accept that it may not be possible on all platforms, and it may
> > be too expensive on some others. It may even be undesirable in the
> > context of Parrot, but I have seen no argument that goes to
> > invalidate the underlying premise.
>
>I think you missed this:
>
>LT> Different VMs can run on different CPUs. Why should we make atomic
>LT> instructions out if these? We have a JIT runtime performing at 1
>LT> Parrot instruction per CPU instruction for native integers. Why
>LT> should we slow down that by a magnitude of many tenths?
>
>LT> We have to lock shared data, then you have to pay the penalty, but
>LT> not for each piece of code.

.
So far, I have only suggested using the mechanism in conjuction with
PMCs and PMC registers.

You can't add an in-use flag to a native integer. But then, native integers
are not a part of the VHLLs (perl/Python/Ruby). The are a consituent part
of scalars, but they use different register set and opcodes. Copying the
integer value of a scalar into an I register would require locking the scalars
PMC. Once the value is in the I register, operations performed on it would
not need to be synchronised. Once the resultant is calculated, it need to be
moved back to the PMC and lock is cleared. There should be no need to
interlock on most opcodes dealing with the I and R register sets.

The S registers are a different kettle of fish, and I haven't worked through
the implications for these. My gut feels is that the C-style strings pointed
at by S registers would be protected by the in-use flag set on the PMCs for
the scalars from which they are derived.

This means that when a single PMC opcode results in a sequence of non-PMC
operations, then other shared threads would be blocked from operations
until the sequence of non-PMC ops in the first shared thread where complete.
But ONLY if they attempt access to the same PMC.

If they are processing PMC or non-PMC operations that do not involve the
in-use PMC, then they will not be blocked and will be scheduled for their
timeslices in the normal way.

>
>and this:
>
>LT> I think, that you are missing multiprocessor systems totally.
>

If the mutex mechanism that is use to block the shared threads
is SMP, NUMA, AMP etc. safe, then the mechanism I describe is also
safe in these envirnments.

>You are effectively excluding true parallellism by blocking other
>processors from executing Parrot ops while one has the lock.

.
The block only occurs *IF* concurrent operations on the same data
are attempted.

> You may
>as well skip the thread libraries altogether and multi-thread the ops
>in a runloop like Ruby does.
>
>But let's carry the argument through, restricting it to UP systems,
>with hyperthreading switched off, and running Win32. Is it even true
>that masking interrupts is enough on these systems?

.
No masking of interupts is involved anywhere!
I don't know where the idea arises, but it wasn't from me.

>
>Win32 `Critical Sections' must be giving the scheduler hints not to
>run other pending threads whilst a critical section is running. Maybe
>it uses the CPU sti/cli flags for that, to avoid the overhead of
>setting a memory word somewhere (bad enough) or calling the system
>(crippling). In that case, setting STI/CLI might only incur a ~50%
>performance penalty for integer operations.

.
I don't have access to the sources, but I do know that when one
thread has entered a critical section, all other threads and processes
continue to be scheduled in the normal way except those that also try
to enter the critical section.

Scheduling is only disabled for those threads that *ask* to be so,
and no others. Either within the process or other processes. How the
mechanism works I can only speculate, but no CLI/STI instructions are
involved.

<total speculation>
When the first thread enters the critsec, a flag is set in the
critsec memory.

When a second thread attempts to enter the critsec, a flag is
set in the corresponding scheduler table to indicate that it should
not be scheduled again until the flag is cleared.

When the first thread leaves the critsec, the flag in the critsec
memory is cleared and the flag(s) in the scheduler tables for any
thread(s) blocking on the critsec are also cleared.

Which ever of the blocked threads is next scheduled, it aquires the
critsec, sets the flag in the critsec memory and the process repeats.

</total speculation>

No masking of interupts is involved.

>
>but then there's this:
>
> NS> Other internal housekeeping operations, memory allocation, garbage
> NS> collection etc. are performed as "sysopcodes", performed by the VMI
> NS> within the auspices of the critical section, and thus secured.
>
>UG> there may be times when a GC run needs to be initiated DURING a VM
>UG> operation. if the op requires an immediate lare chunk of ram it
>UG> can trigger a GC pass or allocation request. you can't force those
>UG> things to only happen between normal ops (which is what making
>UG> them into ops does). so GC and allocation both need to be able to
>UG> lock all shared things in their interpreter (and not just do a
>UG> process global lock) so those things won't be modified by the
>UG> other threads that share them.

.
I did answer that.

However, if the GC needs to run globally, then all threads must
be stopped or blocked regardless. Nothing in this mechanism
changes that fact for better or worse.

Personally, I think that a non-global GC is possible and preferable
but that is an entirely unrelated discussion.

>
>I *think* this means that even if we *could* use critical sections for
>each op, where this works and isn't terribly inefficient, GC throws a
>spanner in the works. This could perhaps be worked around.

.
I disagree that it would be inefficient. I'm open to be proved wrong,
but no proof has been forthcoming. Only opinion.

>
>In any case, it won't work on the fastest known threading
>implementations (Solaris, Linux NPTL, etc), as they won't know to
>block all the other threads in a given process just because one of
>them set a CPU flag cycles before it was pre-empted.

.
You cannot divorce one part of the mechanism from the other.
It requires both the critsec (or equivalent mutex) between the threads
*plus* the in-use flags to achieve it's purpose.

>
>So, in summary - it won't work on MP, and on UP, it couldn't possibly
>be as overhead-free as the other solutions.
>
>Clear as mud ? :-)

.
Your summary does not fit with mine. :-)

>
>[back to processors]
>> Do these need to apply lock on every machine level entity that
>> they access?
>
>Yes, but the only resource that matters here is memory. Locking
>*does* take place inside the processor, but the locks are all close
>enough to be inspected in under a cycle. And misses incur a penalty
>of several cycles - maybe dozens, depending on who has the memory
>locked.
>
>Registers are also "locked" by virtue of the fact that the
>out-of-order execution and pipelining logic will not schedule/allow an
>instruction to proceed until its data is ready. Any CPU with
>pipelining has this problem.

.
Essentially you are saying that there is a mechanisms of interlocks
that works.

That these are implemented at the silicon/ micro-code level is pretty
irrelevant. If they can be implemented there, they could also be
implemented at the VM level.

Whether they would be worthwhile is a different argument.

>
>There is an interesting comparison to be drawn between the JIT
>assembly happening inside the processor from the bytecode being

>executed (x86) into a RISC core machine language (5-ops) on


>hyperthreading systems, and Parrot's compiling PASM to native machine

>code. It each case is the 5-ops that are ordered to maximize


>performance and fed into the execution units.
>
>On a hyperthreading processor, it has the luxury of knowing how long
>it will take to check the necessary locks for each instruction,

>probably under a cycle, so that 5-ops may scream along.


>
>With Parrot, it might have to contact another host over an ethernet
>controller to acquire a lock (eg, threads running in an OpenMOSIX
>cluster).

>This cannot happen for every instruction!

.
By instruction, I assume you mean VM level operation.

Which is why you need the in-use flag to ensure that the lock
is only aquired when it is needed.

Again, If there is a mutex available on the target system that
is able to operate correctly within that system, then this can be
used in-place of the critsec on win32.

The combination of that mutex *plus* the in-use flags (however they are
best achieved on the target system) is the mechanism.

The particular terms 'critical section' and BTS are only the terms I
used because they are what I am familiar with, and perceive as the
best choices (amongst many others) for use on win32.

>--
>Sam Vilain, s...@vilain.net

Regards, Nigel.

Dan Sugalski

unread,
Jan 5, 2004, 11:45:49 AM1/5/04
to Nigel Sandever, perl6-i...@perl.org, Sam Vilain
At 2:43 AM +0000 1/5/04, Nigel Sandever wrote:
>05/01/04 01:22:32, Sam Vilain <s...@vilain.net> wrote:
>
>[STUFF] :)
>
>In another post you mentions intel hyperthreading.
>Essentially, duplicate sets of registers within a single CPU.
>
>Do these need to apply lock on every machine level entity that
>they access?
> No.
>
>Why not?
>
>Because they can only modify an entity if it is loaded into a register
>and the logic behind hyperthreading won't allow both register sets
>to load the same entity concurrently.

This seems generally common for SMT core CPUs (though some of them
may allow multiple references in different threads on-chip and just
keeps the data sync'd up--I've not looked), unfortunately, is
untenable for Parrot. The hardware gets to throw massive amounts of
parallelism at the problem, a luxury we, alas, don't have. (I'm
pretty sure they also play some serious register remapping games and
deferred & speculative execution tricks to not deadlock)

To do this would require that every time a PMC register was loaded
that all the PMC registers of any other active interpreter in the
interpreter group be scanned. Kind of a pricey operation.

0 new messages