JVM as a threading example (threads proposal)

Jeff Clites

unread,

Jan 12, 2004, 4:29:47 AM1/12/04

to Internals List

This is a threading proposal, in the form of a collection of notes.

Rationale:

We need to constantly compare parrot to the JVM, and in particular have
a deep understanding of cases where the JVM performs well (or better
than parrot). The reason for this is obvious: the JVM is the product of
several years of R&D, and we need to benefit from that where we can. Of
course, there are places where the two will necessarily differ (due to
differences such as static v. dynamic typing of their target
languages), but these necessary differences don't account for
everything, and it's important for us to understand if individual
performance differences are fundamental, or incidental. In particular,
it's instructive to look at Java's threading strategy, and its impact
on GC.

I have 2 fundamental assumptions:

1) As stated by others before, Parrot needs to guarantee that its
internal state is protected, even in the absence of proper
synchronization by user code, though user-visible data structures may
become corrupted from the user's perspective. (That is, a hash might
suddenly lose some entries, but it can't become corrupted in such a way
that a crash or memory leak might result.) To put it formally, no
sequence of byte code can have the possibility of causing an attempt to
read from, or write to, unallocated memory. [That may not cover all
cases of corruption, but it gets at the main point.] This is
particularly important in an embedded context--we want to be able to
guarantee that bad bytecode can't crash the process in which parrot is
embedded. Note that Java seems to provide similar guarantees (more on
this below).

2) My second assumption is that if we can provide a safe and performant
implementation which allows "full-access" threads (that is, without
data-access restrictions imposed between threads), then
access-restricted threads (such as Perl 5's ithreads) can be
implemented on top of this. Note that thread creation in the
access-restricted case will necessarily be slower, due to the need to
copy (or COW-copy) additional data structures.

So the theme is taking lessons from the JVM wherever we can. The focus
is on how we can provide safe and high-performance threading without
data-access restrictions between threads (in light of assumption 2
above).

Notes and considerations:

1) As mentioned above, Java seems to guarantee robustness in the
absence of proper user-level synchronization. In particular, section
2.19 of the JVM spec
<http://java.sun.com/docs/books/vmspec/2nd-edition/html/
Concepts.doc.html#33308> states the following:

"When a thread uses the value of a variable, the value it obtains is in
fact a value stored into the variable by that thread or by some other
thread. This is true even if the program does not contain code for
proper synchronization. For example, if two threads store references to
different objects into the same reference value, the variable will
subsequently contain a reference to one object or the other, not a
reference to some other object or a corrupted reference value. (There
is a special exception for long and double values; see §8.4.)"

(It also further states, "In the absence of explicit synchronization,
an implementation is free to update the main memory in an order that
may be surprising. Therefore, the programmer who prefers to avoid
surprises should use explicit synchronization.")

(Section 8.1
<http://java.sun.com/docs/books/vmspec/2nd-edition/html/
Threads.doc.html#22197> also clarifies, "A variable is any location
within a program that may be stored into. This includes not only class
variables and instance variables, but also components of arrays.")

2) It seems that Java provides these guarantees in part by relying on
the atomicity of stores of 32-bit sized data. (This is "atomic" in the
sense of "no word tearing", meaning that when such a value is written
to a memory location, any thread reading that location will see either
the old or the new value, and not some other value.) Section 8.4 of the
JVM spec
<http://java.sun.com/docs/books/vmspec/2nd-edition/html/
Threads.doc.html#22244> implies this, albeit indirectly. Due to the
ubiquity of the JVM, it seems reasonable for Parrot to similarly rely
on atomic stores, without compromising the range of platforms on which
parrot can run. In practice, modern processors likely provide such a
guarantee natively, without the need for explicit locking (though
explicit locking could be employed in the seemingly rare case of a
platform which provides threads but not atomic memory updates). If Java
is relying on some other mechanism here, we need to understand what
that is.

3) There seem to be 3 basic categories of things which need
thread-safety: data structures internal to the VM (eg, opcode arrays,
register stacks), variables (globals, registers, etc.; most
importantly, those which hold references to aggregate types), and the
internals of aggregate types (PMCs and strings). Things in these
categories which we don't need to worry about are things which are
logically local to a thread, no matter what the thread
semantic--registers, lexical pads, etc. Of the three categories, I
think that the aggregate types pose the biggest challenge. Most of the
internals of the VM are likely not logically shared between threads
(and those which are should probably be protected by explicit locking,
and again this is likely true whether or not we actually support
full-access threads), and global variables probably rely on a
thread-safe hash implementation (and in particular, may be slower to
access than lexical variables). So aggregate types are the key "problem
to be solved".

4) On a related note, access to the global stash should probably happen
via a vtable-like function pointer hanging off of the interpreter
struct, to allow data-access (sharing) restrictions to be imposed when
necessary. (That is, this use of different functions here will allow
restricted access v. full access to globals.)

5) Java seems to use a check-in/check-out model for access to global
data, in which global data "lives" in a central store, but is copied
back-and-forth to thread-local storage for modification. I don't fully
understand the performance benefit which I assume this provides, but
this is something else we need to understand. It's discussed in detail
in the threads section of the JVM spec,
<http://java.sun.com/docs/books/vmspec/2nd-edition/html/
Threads.doc.html>.

6) A key aspect of the safety guarantees provided by Java (in addition
to atomicity of 32-bit stores) seems to be non-shrinking of allocated
memory blocks. For example, if an object needs to acquire additional
memory in order to expand the backing store of a hash, it has to
allocate a new (larger) memory block, copy over the values from the
previous block, and leave the previous block to be garbage-collected.
In practice, parrot already does this, in the case of the compacting
collector. (It's realloc(), without immediately deallocating the old
block.) This is part of what allows safety in the light of concurrent
modification (because stale pointers and offsets can still refer to
valid structures).

7) Java 1.4 provides several different GC strategies, but most (if not
all) involve a stop-the-world stage, in which all threads are stopped.
(Note that this doesn't mean that the whole GC process needs to happen
with computation stopped, merely that some part of it may need to
happen with threads stopped.) So a strategy in which data is not owned
per-thread (from the point of view of GC) should be reasonable, since
GC may be necessarily a global endeavor. In parrot, if we have devised
a way to provide timely delivery of events, then we should be able to
use the event infrastructure to suspend all threads. (That is, we don't
need or want to suspend threads at the pthreads level.) See
<http://developers.sun.com/techtopics/mobility/midp/articles/
garbagecollection2/> and the previous version
<http://developers.sun.com/techtopics/mobility/midp/articles/garbage/>
for a discussion of Java's GC strategies.

8) It should be possible to provide per-thread allocation pools, such
that header or string body allocations can occur without locking,
despite the fact that the allocated items are not considered to be
"owned" by a thread. (Since GC will occur globally, as discussed in
item 7, the deallocation will not be logically per-thread.)

9) As mentioned above, PMC thread-robustness seems to be the main
challenge. (Where "thread-robustness" means non-crashing in the light
of concurrent modification.) I think that we can provide this in 3
ways: (a) minimize the number of custom PMCs that need to be written;
that is, emphasize use of the object infrastructure rather than custom
PMCs to provide HLL-level data types. (b) provide tools and guidelines
for implementing thread-robust PMCs; for instance, the memory-expansion
policy mentioned in item 6 above (Java provides a native
System.arraycopy() to assist with this, for example), favoring the use
of offsets/indexes rather than pointers into data blocks, and
out-of-band creation of data structure (initialization prior to storage
in a globally-accessible location). (c) use actual locking internally
when necessary, and only when running multithreaded. Simple PMCs such
as PerlInt are probably already thread-robust. Notably, PerlString
should be straightforward to make thread-robust as well, as long as its
flags are only changed at create time or during GC, and a new string
header is put into place when the backing store is expanded.
(Specifically, if headers can be guaranteed to be allocated from
zero-filled memory, then bufstart and buflen will either be consistent,
or one of them will be zero/NULL.) Thread-robustness is probably
easiest in the case of leaf structures such as these (ie, structures
which hold few pointers).

10) For PMC internal use, the lock-only-when-threaded mentioned in item
9 can be easily provide via a LOCK_THIS_PMC macro which expands to
something like "if(INTERP->is_threaded) {
pthread_mutex_lock(SELF->mutex) }". (It's safe to check an
interpreter->is_threaded flag without locking, because its value will
only change when single-threaded.)

11) I don't think it's particularly relevant that Java's strings are
immutable. StringBuffers are not, nor are collections such as HashMap,
so Java still needs to deal with thread-robustness of data structures.
Also, parrot has facilities for immutability, which HLLs can take
advantage of as an optimization (though probably Perl6 will not, except
for things such as string literals).

A few benchmarks (based on testing on Mac OS X, single CPU):

1) A pthread_lock/pthead_unlock pair on an uncontested mutex seems to
have an overhead equivalent to about 7 function calls. (Or, since it
_is_ 2 function calls, you could say the overhead is 5 function calls
above that.) Also, pthead_trylock is only slightly faster.

2) In a pure-computation context with no locking, there isn't a penalty
to being multithreaded. That is, counting to a large number in each of
40 threads takes no more time (seemingly, slightly less) than counting
up to 40*large-number in a single thread. This is likely because the
context-switch overhead is already being paid in context switching
between processes.

3) In Perl 5.8.1, it is much faster to perform 1000 forks (and waits)
than it is to create (and join) 1000 threads. In C, the opposite is
true.

I'll publish some actual benchmarking numbers, with source code,
separately. (They're just sort of interesting to have on hand.)

JEff

Elizabeth Mattijsen

unread,

Jan 12, 2004, 4:46:50 AM1/12/04

to Jeff Clites, Internals List

At 01:29 -0800 1/12/04, Jeff Clites wrote:
>I'll publish some actual benchmarking numbers, with source code,
>separately. (They're just sort of interesting to have on hand.)

If you're benchmarking Perl 5 ithreads for memory usage, you might
want to have a look at Benchmark::Thread::Size.

Liz

Gordon Henriksen

unread,

Jan 12, 2004, 1:03:56 PM1/12/04

to Jeff Clites, perl6-i...@perl.org

On Monday, January 12, 2004, at 04:29 , Jeff Clites wrote:

> 5) Java seems to use a check-in/check-out model for access to global
> data, in which global data "lives" in a central store, but is copied
> back-and-forth to thread-local storage for modification. I don't fully
> understand the performance benefit which I assume this provides, but
> this is something else we need to understand. It's discussed in detail
> in the threads section of the JVM spec,
> <http://java.sun.com/docs/books/vmspec/2nd-edition/html/
> Threads.doc.html>.

The passage in question being:

Every thread has a working memory in which it keeps its own working copy
of variables that it must use or assign. As the thread executes a
program, it operates on these working copies. The main memory contains
the master copy of every variable. There are rules about when a thread
is permitted or required to transfer the contents of its working copy of
a variable into the master copy or vice versa.

This is very obliquely phrased, but it refers to the register file and
stack frame. Or, since the Java bytecode is stack-based, it refers to
the stack.

—

Gordon Henriksen
mali...@mac.com

Jeff Clites

unread,

Jan 15, 2004, 1:28:25 AM1/15/04

to Gordon Henriksen, perl6-i...@perl.org

On Jan 12, 2004, at 10:03 AM, Gordon Henriksen wrote:

> On Monday, January 12, 2004, at 04:29 , Jeff Clites wrote:
>
>> 5) Java seems to use a check-in/check-out model for access to global
>> data, in which global data "lives" in a central store, but is copied
>> back-and-forth to thread-local storage for modification. I don't
>> fully understand the performance benefit which I assume this
>> provides, but this is something else we need to understand. It's
>> discussed in detail in the threads section of the JVM spec,
>> <http://java.sun.com/docs/books/vmspec/2nd-edition/html/
>> Threads.doc.html>.
>
> The passage in question being:
>
> Every thread has a working memory in which it keeps its own working
> copy of variables that it must use or assign. As the thread executes a
> program, it operates on these working copies. The main memory contains
> the master copy of every variable. There are rules about when a thread
> is permitted or required to transfer the contents of its working copy
> of a variable into the master copy or vice versa.

There's that passage, but actually the following sections go into depth
about the semantic constraints on how this is supposed to work.

> This is very obliquely phrased, but it refers to the register file and
> stack frame. Or, since the Java bytecode is stack-based, it refers to
> the stack.

You might be right, but that's not exactly how I read it, because later
it says, "A use action (by a thread) transfers the contents of the
thread's working copy of a variable to the thread's execution engine.
This action is performed whenever a thread executes a virtual machine
instruction that uses the value of a variable."

I assume that the stack would correspond to the "execution engine"
here, rather than to the "thread's working copy", but I could read it
either way.

In any case, I thought this sounded like an interesting model, but
based on the rules in that section about how accesses to data in the
thread-local store have to correspond to accesses of the main store, it
wasn't apparent to me how there was an advantage over just accessing
things directly out of the main store. But I assume I'm overlooking
some subtlety, and that there may be an idea here that we could use to
our advantage.

JEff

Leopold Toetsch

unread,

Jan 15, 2004, 3:31:39 AM1/15/04

to Jeff Clites, perl6-i...@perl.org

Jeff Clites <jcl...@mac.com> wrote:

[ threaded var access: read -> load -> use ... ]

> In any case, I thought this sounded like an interesting model,

I don't see any advantage of such a model. The more as it doesn't
gurantee any atomic access to e.g. long or doubles. The atomic access to
ints and pointers seems to rely on the architecture but is of course
reasonable.

> ... But I assume I'm overlooking

> some subtlety, and that there may be an idea here that we could use to
> our advantage.

Yep. Could be GC related. If there is either incremental or generational
GC, all reads or writes have to be tracked to update GC state, which is
probably simpler, when done at one place.

> JEff

leo

Gordon Henriksen

unread,

Jan 15, 2004, 1:34:44 PM1/15/04

to Jeff Clites, perl6-i...@perl.org

The two-level memory model in the spec seems to address, as a group, the
register file, stack (JVM stack and/or system call stack; same
difference), and the cache hierarchy in an SMP system. All in all, it's
a very oblique way of saying that changes might not be immediately
visible to all threads. For example, on the PPC, a "sync" instruction is
required to ensure that all the processor caches have a consistent view
of main memory. sync and its equivalents are far too expensive to
execute after every change to any variable, so the JVM's spec is written
to allow implementations to avoid it most of the time.

--

Gordon Henriksen
IT Manager
ICLUBcentral Inc.
gor...@iclub.com

Damien Neil

unread,

Jan 15, 2004, 2:32:05 PM1/15/04

to perl6-i...@perl.org

On Thu, Jan 15, 2004 at 09:31:39AM +0100, Leopold Toetsch wrote:
> I don't see any advantage of such a model. The more as it doesn't
> gurantee any atomic access to e.g. long or doubles. The atomic access to
> ints and pointers seems to rely on the architecture but is of course
> reasonable.

You *can't* guarantee atomic access to longs and doubles on some
architectures, unless you wrap every read or write to one with a
lock. The CPU support isn't there.

(Why the "e.g."? Longs and doubles are explicitly the only core
data types which the JVM does not guarantee atomic access to.)

- Damien

Damien Neil

unread,

Jan 15, 2004, 2:28:05 PM1/15/04

to perl6-i...@perl.org

On Wed, Jan 14, 2004 at 10:28:25PM -0800, Jeff Clites wrote:
> You might be right, but that's not exactly how I read it, because later
> it says, "A use action (by a thread) transfers the contents of the
> thread's working copy of a variable to the thread's execution engine.
> This action is performed whenever a thread executes a virtual machine
> instruction that uses the value of a variable."

There are three locations where a variable's data may be present:
- Main memory, which all threads may access.
- Working memory, which is thread-private. (The thread's stack
and/or registers, depending on VM design.)
- The execution engine--the low-level code (probably C or C++),
which implements the VM's opcodes.

The "use" action refers to the execution of a JVM opcode which uses
data from a thread's stack.

> In any case, I thought this sounded like an interesting model, but
> based on the rules in that section about how accesses to data in the
> thread-local store have to correspond to accesses of the main store, it
> wasn't apparent to me how there was an advantage over just accessing
> things directly out of the main store. But I assume I'm overlooking
> some subtlety, and that there may be an idea here that we could use to
> our advantage.

The JVM disallows accesses from main store for the same reason
Parrot does--bytecode efficiency. JVM opcodes operate on the stack,
as Parrot opcodes operate on registers. Accesses from main store
(which require a symbol table lookup) are performed by special-purpose
opcodes.

- Damien

Leopold Toetsch

unread,

Jan 16, 2004, 1:55:37 AM1/16/04

to Damien Neil, perl6-i...@perl.org

Damien Neil <ne...@misago.org> wrote:
> On Thu, Jan 15, 2004 at 09:31:39AM +0100, Leopold Toetsch wrote:
>> I don't see any advantage of such a model. The more as it doesn't
>> gurantee any atomic access to e.g. long or doubles. The atomic access to
>> ints and pointers seems to rely on the architecture but is of course
>> reasonable.

> You *can't* guarantee atomic access to longs and doubles on some
> architectures, unless you wrap every read or write to one with a
> lock. The CPU support isn't there.

Yes, that's what I'm saying. I don't see an advantage of JVMs multi-step
variable access, because it even doesn't provide such atomic access.

Parrot deals with PMCs, which can contain (lets consider scalars only)
e.g. a PerlInt or a PerlNumer. Now we would have atomic access
(normally) to the former and very likely non-atomic access to the latter
just depending on the value which happened to be stored in the PMC.

This implies, that we have to wrap almost[1] all shared write *and* read
PMC access with LOCK/UNLOCK.

[1] except plain ints and pointers on current platforms

> - Damien

leo

Jeff Clites

unread,

Jan 16, 2004, 2:58:22 AM1/16/04

to l...@toetsch.at, perl6-i...@perl.org, Damien Neil

On Jan 15, 2004, at 10:55 PM, Leopold Toetsch wrote:
> Damien Neil <ne...@misago.org> wrote:
>> On Thu, Jan 15, 2004 at 09:31:39AM +0100, Leopold Toetsch wrote:
>>> I don't see any advantage of such a model. The more as it doesn't
>>> gurantee any atomic access to e.g. long or doubles. The atomic
>>> access to
>>> ints and pointers seems to rely on the architecture but is of course
>>> reasonable.
>
>> You *can't* guarantee atomic access to longs and doubles on some
>> architectures, unless you wrap every read or write to one with a
>> lock. The CPU support isn't there.
>
> Yes, that's what I'm saying. I don't see an advantage of JVMs
> multi-step
> variable access, because it even doesn't provide such atomic access.

What I was expecting that the Java model was trying to do (though I
didn't find this) was something along these lines: "Accessing the main
store involves locking, so by copying things to a thread-local store we
can perform several operations on an item before we have to move it
back to the main store (again, with locking). If we worked directly
from the main store, we'd have to lock for each and every use of the
variable."

The reason I'm not finding it is that the semantic rules spelled out in
the spec _seem_ to imply that every local access implies a
corresponding access to the main store, one-to-one. On the other hand,
maybe the point is that it can "save up" these accesses--that is, lock
the main store once, and push back several values from the thread-local
store. If it can do this, then it is saving some locking.

> Parrot deals with PMCs, which can contain (lets consider scalars only)
> e.g. a PerlInt or a PerlNumer. Now we would have atomic access
> (normally) to the former and very likely non-atomic access to the
> latter
> just depending on the value which happened to be stored in the PMC.
>
> This implies, that we have to wrap almost[1] all shared write *and*
> read
> PMC access with LOCK/UNLOCK.
>
> [1] except plain ints and pointers on current platforms

Ah, but this misses a key point: We know that user data is allowed to
get corrupted if the user isn't locking properly--we only have to
protect VM-internal state. The key point is that it's very unlikely
that there will be any floats involved in VM-internal state--it's going
to be all pointers and ints (for offsets and lengths). That is, a
corrupted float won't crash the VM.

JEff

Damien Neil

unread,

Jan 16, 2004, 4:01:44 PM1/16/04

to Jeff Clites, l...@toetsch.at, perl6-i...@perl.org

On Thu, Jan 15, 2004 at 11:58:22PM -0800, Jeff Clites wrote:
> On Jan 15, 2004, at 10:55 PM, Leopold Toetsch wrote:
> >Yes, that's what I'm saying. I don't see an advantage of JVMs
> >multi-step
> >variable access, because it even doesn't provide such atomic access.

You're missing the point of the multi-step access. It has nothing to
do with threading or atomic access to variables.

The JVM is a stack machine. JVM opcodes operate on the stack, not on
main memory. The stack is thread-local. In order for a thread to operate
on a variable, therefore, it must first copy it from main store to thread-
local store (the stack).

Parrot, so far as I know, operates in exactly the same way, except that
the thread-local store is a set of registers rather than a stack.

Both VMs separate working-set data (the stack and/or registers) from
main store to reduce symbol table lookups.

> What I was expecting that the Java model was trying to do (though I
> didn't find this) was something along these lines: "Accessing the main
> store involves locking, so by copying things to a thread-local store we
> can perform several operations on an item before we have to move it
> back to the main store (again, with locking). If we worked directly
> from the main store, we'd have to lock for each and every use of the
> variable."

I don't believe accesses to main store require locking in the JVM.

This will all make a lot more sense if you keep in mind that Parrot--
unthreaded as it is right now--*also* copies variables to working store
before operating on them. This isn't some odd JVM strangeness. The
JVM threading document is simply describing how the stack interacts
with main memory.

- Damien

Jeff Clites

unread,

Jan 16, 2004, 8:38:07 PM1/16/04

to Damien Neil, perl6-i...@perl.org, l...@toetsch.at

On Jan 16, 2004, at 1:01 PM, Damien Neil wrote:

> On Thu, Jan 15, 2004 at 11:58:22PM -0800, Jeff Clites wrote:
>> On Jan 15, 2004, at 10:55 PM, Leopold Toetsch wrote:
>>> Yes, that's what I'm saying. I don't see an advantage of JVMs
>>> multi-step
>>> variable access, because it even doesn't provide such atomic access.
>
> You're missing the point of the multi-step access. It has nothing to
> do with threading or atomic access to variables.
>
> The JVM is a stack machine. JVM opcodes operate on the stack, not on
> main memory. The stack is thread-local. In order for a thread to
> operate
> on a variable, therefore, it must first copy it from main store to
> thread-
> local store (the stack).
>
> Parrot, so far as I know, operates in exactly the same way, except that
> the thread-local store is a set of registers rather than a stack.
>
> Both VMs separate working-set data (the stack and/or registers) from
> main store to reduce symbol table lookups.

...

> This will all make a lot more sense if you keep in mind that Parrot--
> unthreaded as it is right now--*also* copies variables to working store
> before operating on them. This isn't some odd JVM strangeness. The
> JVM threading document is simply describing how the stack interacts
> with main memory.

I think the JVM spec actually implying something beyond this. For
instance section 8.3 states, "A store operation by T on V must
intervene between an assign by T of V and a subsequent load by T of V."
Translating this to parrot terms, this would mean that the following is
illegal, which it clearly isn't:

find_global P0, "V"
set P0, P1 # "assign by T of V"
find_global P0, "V" # "a subsequent load by T of V" w/o an intervening
"store operation by T on V"

I think it is talking about something below the Java-bytecode
level--remember, this is the JVM spec, and constrains how an
implementation of the JVM must behave when executing a sequence of
opcodes, not the rules a Java compiler must follow when generating a
sequence of opcodes from Java source code.

What I think it's really saying, again translated into Parrot terms, is
this:

store_global "foo", P0 # internally, may cache value and not push to
main memory
find_global P0, "foo" # internally, can't pull value from main memory
if above value was not yet pushed there

and I think the point is this:

find_global P0, "foo" # internally, also caches value in thread-local
storage
find_global P0, "foo" # internally, can use cached thread-local value

And, as mentioned in section 8.6, any time a lock is taken, cached
values need to be pushed back into main memory, and the local cache
emptied. This doesn't make any sense if the "thread's working memory"
is interpreted as the stack.

JEff

Gordon Henriksen

unread,

Jan 16, 2004, 10:33:47 PM1/16/04

to Jeff Clites, perl6-i...@perl.org

On Friday, January 16, 2004, at 02:58 , Jeff Clites wrote:

> On Jan 15, 2004, at 10:55 PM, Leopold Toetsch wrote:
>> Damien Neil <ne...@misago.org> wrote:
>>> On Thu, Jan 15, 2004 at 09:31:39AM +0100, Leopold Toetsch wrote:
>>>> I don't see any advantage of such a model. The more as it doesn't
>>>> gurantee any atomic access to e.g. long or doubles. The atomic
>>>> access to
>>>> ints and pointers seems to rely on the architecture but is of course
>>>> reasonable.
>>
>>> You *can't* guarantee atomic access to longs and doubles on some
>>> architectures, unless you wrap every read or write to one with a
>>> lock. The CPU support isn't there.
>>
>> Yes, that's what I'm saying. I don't see an advantage of JVMs
>> multi-step
>> variable access, because it even doesn't provide such atomic access.
>
> What I was expecting that the Java model was trying to do (though I
> didn't find this) was something along these lines: "Accessing the main
> store involves locking, so by copying things to a thread-local store we
> can perform several operations on an item before we have to move it
> back to the main store (again, with locking). If we worked directly
> from the main store, we'd have to lock for each and every use of the
> variable."

I think the real purpose of the model was to say "thread-local values
may be committed to main memory (perhaps significantly) after the local
copy is logically assigned." Thus: In the absence of explicit
synchronization, threads may manipulate potentially inconsistent local
copies of variables. This model addresses: Copies of variables in
registers, copies on the JVM stack, copies in the stack frame, thread
preemption prior to store (which occurs on uniprocessors and
multiprocessors alike), and delayed write-back caches in SMP systems.[*]

In short, this portion of the spec provides bounds for the undefinedness
of the behavior that occurs when programs do not use Java's
synchronization primitives. It does so realistically, in a manner that
contemporary computer systems can implement efficiently. (In fact, the
spec is far more descriptive than it is proscriptive.)

Or, as an example, it allows the natural thing to happen here:

; PPC[**] implementation of:
; var = var + var;
lwz r30, 0(r29) ; load var (1 JVM "load")
addi r30, r30, r30 ; double var (2 JVM "uses", 1 JVM "assign")
; if your thread is preempted here
stw r30, 0(r29) ; store var (1 JVM "store")

And allows obvious optimizations like this:

; PPC implementation of:
; var = var + var;
; var = var + var;
lwz r30, 0(r29)
addi r30, r30, r30
; imagine your thread is preempted here
addi r30, r30, r30
stw 0(r29), r30

But it explicitly disallows that same optimization for a case like this:

var = var + var;
synchronized (other} { other++; }
var = var + var;

That--that, and the whole cache coherency/delayed write thing:

; CPU 1 ; CPU 2
loadi r29, 0xFFF8 loadi r29, 0xFFF8
loadi r30, 0xDEAD loadi r30, 0xBEEF
stw 0(r29), r30 stw 0(r29), r30
lwz r28, 0(r29) lwz r28, 0(r29)
; r28 is probably 0xDEAD ; r28 is probably 0xBEEF
; (but could be 0xBEEF) ; (but could be 0xDEAD)
sync noop
lwz r28, 0(r29) lwz r28, 0(r29)
; r28 matches on both CPUs now, either 0xDEAD or
; 0xBEEF (but not 0xDEEF or 0xBEAD or 0x9999).

[* - On many SMP systems, processors do not have coherent views of main
memory (due to their private data caches) unless the program executes
explicit memory synchronization operations, which are at least expensive
enough that you don't want to execute them on every opcode.]

[** - Forgive my rusty assembler.]

> The reason I'm not finding it is that the semantic rules spelled out in
> the spec _seem_ to imply that every local access implies a
> corresponding access to the main store, one-to-one. On the other hand,
> maybe the point is that it can "save up" these accesses--that is, lock
> the main store once, and push back several values from the thread-local
> store. If it can do this, then it is saving some locking.

The spec doesn't lock except when, or if the architecture were unable to
provide atomic loads and stores for 32-bit quantities.

>> Parrot deals with PMCs, which can contain (lets consider scalars only)
>> e.g. a PerlInt or a PerlNumer. Now we would have atomic access
>> (normally) to the former and very likely non-atomic access to the
>> latter
>> just depending on the value which happened to be stored in the PMC.
>>
>> This implies, that we have to wrap almost[1] all shared write *and*
>> read
>> PMC access with LOCK/UNLOCK.
>>
>> [1] except plain ints and pointers on current platforms
>
> Ah, but this misses a key point: We know that user data is allowed to
> get corrupted if the user isn't locking properly--we only have to
> protect VM-internal state. The key point is that it's very unlikely
> that there will be any floats involved in VM-internal state--it's going
> to be all pointers and ints (for offsets and lengths). That is, a
> corrupted float won't crash the VM.

On one point with respect to "VM-internal state," morph scares me quite
a bit. Type-stable memory is vital to lock-free (efficient) data access.
Consider:

a_pmc is of type A
--- thread 1 is executing ---
Thread 1: load a_pmc>vtable->whatever
Thread 1: call to a_vtable_whatever begins...
--- thread 2 preempts thread 1 ---
Thread 2: morph PMC to type B
Thread 2: Allocate new B-type pmc_ext structure
Thread 2: update a_pmc->vtable
Thread 2: adjust a_pmc->pmc_ext
--- thread 1 resumes ---
Thread 1: ... finish setting up the call
Thread 1: a_vtable_whatever reads ((A_ext) a_pmc->pmc_ext)->foo; B_ext
happens to keep some float at the same offset in the struct
Thread 1: chaos

Keeping in mind the reality of delayed writes on SMP systems, there's
just no way to code around this except to acquire a lock on the entire
PMC for every read or write. Bye-bye performance: Acquire 2-3 locks to
add 2 PerlInts together? ... Or bad performance and poor concurrency:
Acquire 1 global lock whenever performing any PMC operation.

I'm tired....

—

Gordon Henriksen
mali...@mac.com

Gordon Henriksen

unread,

Jan 17, 2004, 12:53:29 AM1/17/04

to Jeff Clites, perl6-i...@perl.org

On Friday, January 16, 2004, at 08:38 , Jeff Clites wrote:

> On Jan 16, 2004, at 1:01 PM, Damien Neil wrote:
>
>> On Thu, Jan 15, 2004 at 11:58:22PM -0800, Jeff Clites wrote:
>>
>>> On Jan 15, 2004, at 10:55 PM, Leopold Toetsch wrote:
>>>
>>>> Yes, that's what I'm saying. I don't see an advantage of JVMs
>>>> multi-step variable access, because it even doesn't provide such
>>>> atomic access.
>>
>> You're missing the point of the multi-step access. It has nothing to
>> do with threading or atomic access to variables.

... it has everything to do with allowing multiprocessors to operate
without extraneous synchronization.

>> The JVM is a stack machine. JVM opcodes operate on the stack, not on
>> main memory. The stack is thread-local. In order for a thread to
>> operate on a variable, therefore, it must first copy it from main

>> store to thread-local store (the stack).

>>
>> Parrot, so far as I know, operates in exactly the same way, except
>> that the thread-local store is a set of registers rather than a stack.
>>
>> Both VMs separate working-set data (the stack and/or registers) from
>> main store to reduce symbol table lookups.
> ...
>> This will all make a lot more sense if you keep in mind that

>> Parrot--unthreaded as it is right now--*also* copies variables to

>> working store before operating on them. This isn't some odd JVM
>> strangeness. The JVM threading document is simply describing how the
>> stack interacts with main memory.
>
> I think the JVM spec actually implying something beyond this. For
> instance section 8.3 states, "A store operation by T on V must
> intervene between an assign by T of V and a subsequent load by T of V."
> Translating this to parrot terms, this would mean that the following is
> illegal, which it clearly isn't:
>
> find_global P0, "V"
> set P0, P1 # "assign by T of V"
> find_global P0, "V" # "a subsequent load by T of V" w/o an
> intervening "store operation by T on V"

This rule addresses aliasing. It says that this (in PPC assembly):

; presume &(obj->i) == &obj+12
lwz r29, 12(r30) ; read, load
addi r29, r29, 1 ; use, assign
lwz r28, 12(r30) ; read, load
addi r28, r28, 1 ; use, assign
stw 0(r30), r29 ; store, eventual write
stw 0(r30), r28 ; store, eventual write

... is an invalid implementation of this:

j.i = j.i + 1;
k.i = k.i + 1;

... where the JVM cannot prove j == k to be false. The rule states that
the stw of r29 must precede the stw of r28. Why this is under
threading... beyond me.

Let me briefly hilight the operations as discussed and digress a little
bit as to why all the layers:

main memory
--read+load-->
working copy (register file, stack frame, etc.)
--use-->
execution engine (CPU core)
<--assign--
working copy (register file, stack frame, etc.)
<--write+store--
main memory

(I paired the read+load and write+store due to the second set of rules
in 8.2.)

The spec never says where a read puts something that a load can use it,
or where a store puts something that a write can use it. A store with
its paired write pending is simply an in-flight memory transaction (and
the same for a read+load pair). Possible places the value could be:
In-flight on the system bus; queued by the memory controller; on a dirty
line in a write-back cache; somewhere in transit on a NUMA
architecture. Store-write and read-load are just different ends of the
underlying ISA's load and store memory transactions. The read and write
operations specify the operations from the memory controller's
perspective; load and store specify them from the program's perspective.

Note that reads and writes are performed "by main memory," not "by a
thread." That distinction is crucial to reading the following section
from the spec:

"8.2 EXECUTION ORDER AND CONSISTENCY
"The rules of execution order constrain the order in which certain
events may occur. There are four general constraints on the
relationships among actions:

* "The actions performed by any one thread are totally ordered; that
is, for any two actions performed by a thread, one action precedes the
other.
* "The actions performed by the main memory for any one variable are
totally ordered; that is, for any two actions performed by the main
memory on the same variable, one action precedes the other.
• "..."

The extra read/write step essentially allows main memory (the memory
controller) to order its operations with bounded independence of any
particular thread. Careful reading of the other rules will show that
this is only a useful abstraction in the case of true concurrency (e.g.,
SMP), as the other rules ensure that a single processor will always load
variables in a state consistent with what it last stored.

> I think it is talking about something below the Java-bytecode
> level--remember, this is the JVM spec, and constrains how an
> implementation of the JVM must behave when executing a sequence of
> opcodes, not the rules a Java compiler must follow when generating a
> sequence of opcodes from Java source code.
>
> What I think it's really saying, again translated into Parrot terms, is
> this:

What it's really saying is that JVM must be wary of aliasing, a bane of
optimizing compilers.

> store_global "foo", P0 # internally, may cache value and not push to
> main memory

Well, might not immediately push to main memory. But, yes, a dirty write
cache would qualify as an in-flight memory transaction (an uncommitted
store operation).

> find_global P0, "foo" # internally, can't pull value from main memory
> if above value was not yet pushed there

True, but it's enforced by a different rule:

"Let action A be a load or store by thread T on variable V, and let
action P be the corresponding read or write by the main memory on
variable V. Similarly, let action B be some other load or store by
thread T on that same variable V, and let action Q be the corresponding
read or write by the main memory on variable V. If A precedes B, then P
must precede Q. (Less formally: operations on the master copy of any
given variable on behalf of a thread are performed by the main memory in
exactly the order that the thread requested.)"

> and I think the point is this:
>
> find_global P0, "foo" # internally, also caches value in thread-local
> storage
> find_global P0, "foo" # internally, can use cached thread-local value

Don't think of working storage as a cache. Think of it as a necessity:
The processor can't operate on data unless it has been brought into the
working storage. The true complexities of caching are encompassed by the
read and write operations and the orderings imposed on them—particularly
with respect to the fact that only explicit locking imposes an ordering
on reads and writes between threads. Caches, the stack, and registers
all fall under the rubric of working storage in this spec.

> And, as mentioned in section 8.6, any time a lock is taken, cached
> values need to be pushed back into main memory, and the local cache
> emptied. This doesn't make any sense if the "thread's working memory"
> is interpreted as the stack.

It does make sense. The rules says that all thread-local changes must be
flushed completely to main memory before the thread blocks. It ensures
that other threads will see changes made by the thread, and visa versa,
so that data protected by the lock is always viewed in a consistent
state. On a PowerPC, the JVM just needs to store all
assigned-but-unstored variables and then execute the PPC sync
instruction.

—

Gordon Henriksen
mali...@mac.com

[Vocab note: A write-back cache only stores changes to main memory when
the variable is evicted from the cache (it optimizes both loads and
stores). Its corollary is a write-through cache, which stores changes to
main memory on every update (it only optimizes loads). A write-back
cache prevents implicit SMP cache coherency, because it gives the other
processor caches no sign that the memory word has been changed until the
word is evicted naturally or a cache-coherency instruction (like sync)
forces the issue.]

Leopold Toetsch

unread,

Jan 17, 2004, 5:58:32 AM1/17/04

to Damien Neil, perl6-i...@perl.org

Damien Neil <ne...@misago.org> wrote:

> The JVM is a stack machine. JVM opcodes operate on the stack, not on
> main memory. The stack is thread-local. In order for a thread to operate
> on a variable, therefore, it must first copy it from main store to thread-
> local store (the stack).

Silly me, yes of course, that's it.

> Parrot, so far as I know, operates in exactly the same way, except that
> the thread-local store is a set of registers rather than a stack.

Except that we don't need this mulit-step variable access, because a
P-register can refer to a shared PMC living in a different thread, which
is a different threads interpreter->arena_base memory.

leo

Jeff Clites

unread,

Jan 17, 2004, 2:48:47 PM1/17/04

to l...@toetsch.at, perl6-i...@perl.org, Damien Neil

But just like a P-register, a Java stack entry can hold a reference to
an object, and in Java there's no thread-based ownership if objects.

So if this parallel is correct, then Java wouldn't need 2-step access
either. This is part of why I think this is the wrong interpretation of
the JVM spec.

JEff