Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion JVM as a threading example (threads proposal)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Jeff Clites  
View profile  
 More options Jan 12 2004, 4:48 am
Newsgroups: perl.perl6.internals
From: jcli...@mac.com (Jeff Clites)
Date: Mon, 12 Jan 2004 01:29:47 -0800
Local: Mon, Jan 12 2004 4:29 am
Subject: JVM as a threading example (threads proposal)
This is a threading proposal, in the form of a collection of notes.

Rationale:

We need to constantly compare parrot to the JVM, and in particular have  
a deep understanding of cases where the JVM performs well (or better  
than parrot). The reason for this is obvious: the JVM is the product of  
several years of R&D, and we need to benefit from that where we can. Of  
course, there are places where the two will necessarily differ (due to  
differences such as static v. dynamic typing of their target  
languages), but these necessary differences don't account for  
everything, and it's important for us to understand if individual  
performance differences are fundamental, or incidental. In particular,  
it's instructive to look at Java's threading strategy, and its impact  
on GC.

I have 2 fundamental assumptions:

1) As stated by others before, Parrot needs to guarantee that its  
internal state is protected, even in the absence of proper  
synchronization by user code, though user-visible data structures may  
become corrupted from the user's perspective. (That is, a hash might  
suddenly lose some entries, but it can't become corrupted in such a way  
that a crash or memory leak might result.) To put it formally, no  
sequence of byte code can have the possibility of causing an attempt to  
read from, or write to, unallocated memory. [That may not cover all  
cases of corruption, but it gets at the main point.] This is  
particularly important in an embedded context--we want to be able to  
guarantee that bad bytecode can't crash the process in which parrot is  
embedded. Note that Java seems to provide similar guarantees (more on  
this below).

2) My second assumption is that if we can provide a safe and performant  
implementation which allows "full-access" threads (that is, without  
data-access restrictions imposed between threads), then  
access-restricted threads (such as Perl 5's ithreads) can be  
implemented on top of this. Note that thread creation in the  
access-restricted case will necessarily be slower, due to the need to  
copy (or COW-copy) additional data structures.

So the theme is taking lessons from the JVM wherever we can. The focus  
is on how we can provide safe and high-performance threading without  
data-access restrictions between threads (in light of assumption 2  
above).

Notes and considerations:

1) As mentioned above, Java seems to guarantee robustness in the  
absence of proper user-level synchronization. In particular, section  
2.19 of the JVM spec  
<http://java.sun.com/docs/books/vmspec/2nd-edition/html/
Concepts.doc.html#33308> states the following:

"When a thread uses the value of a variable, the value it obtains is in  
fact a value stored into the variable by that thread or by some other  
thread. This is true even if the program does not contain code for  
proper synchronization. For example, if two threads store references to  
different objects into the same reference value, the variable will  
subsequently contain a reference to one object or the other, not a  
reference to some other object or a corrupted reference value. (There  
is a special exception for long and double values; see §8.4.)"

(It also further states, "In the absence of explicit synchronization,  
an implementation is free to update the main memory in an order that  
may be surprising. Therefore, the programmer who prefers to avoid  
surprises should use explicit synchronization.")

(Section 8.1  
<http://java.sun.com/docs/books/vmspec/2nd-edition/html/
Threads.doc.html#22197> also clarifies, "A variable is any location  
within a program that may be stored into. This includes not only class  
variables and instance variables, but also components of arrays.")

2) It seems that Java provides these guarantees in part by relying on  
the atomicity of stores of 32-bit sized data. (This is "atomic" in the  
sense of "no word tearing", meaning that when such a value is written  
to a memory location, any thread reading that location will see either  
the old or the new value, and not some other value.) Section 8.4 of the  
JVM spec  
<http://java.sun.com/docs/books/vmspec/2nd-edition/html/
Threads.doc.html#22244> implies this, albeit indirectly. Due to the  
ubiquity of the JVM, it seems reasonable for Parrot to similarly rely  
on atomic stores, without compromising the range of platforms on which  
parrot can run. In practice, modern processors likely provide such a  
guarantee natively, without the need for explicit locking (though  
explicit locking could be employed in the seemingly rare case of a  
platform which provides threads but not atomic memory updates). If Java  
is relying on some other mechanism here, we need to understand what  
that is.

3) There seem to be 3 basic categories of things which need  
thread-safety: data structures internal to the VM (eg, opcode arrays,  
register stacks), variables (globals, registers, etc.; most  
importantly, those which hold references to aggregate types), and the  
internals of aggregate types (PMCs and strings). Things in these  
categories which we don't need to worry about are things which are  
logically local to a thread, no matter what the thread  
semantic--registers, lexical pads, etc. Of the three categories, I  
think that the aggregate types pose the biggest challenge. Most of the  
internals of the VM are likely not logically shared between threads  
(and those which are should probably be protected by explicit locking,  
and again this is likely true whether or not we actually support  
full-access threads), and global variables probably rely on a  
thread-safe hash implementation (and in particular, may be slower to  
access than lexical variables). So aggregate types are the key "problem  
to be solved".

4) On a related note, access to the global stash should probably happen  
via a vtable-like function pointer hanging off of the interpreter  
struct, to allow data-access (sharing) restrictions to be imposed when  
necessary. (That is, this use of different functions here will allow  
restricted access v. full access to globals.)

5) Java seems to use a check-in/check-out model for access to global  
data, in which global data "lives" in a central store, but is copied  
back-and-forth to thread-local storage for modification. I don't fully  
understand the performance benefit which I assume this provides, but  
this is something else we need to understand. It's discussed in detail  
in the threads section of the JVM spec,  
<http://java.sun.com/docs/books/vmspec/2nd-edition/html/
Threads.doc.html>.

6) A key aspect of the safety guarantees provided by Java (in addition  
to atomicity of 32-bit stores) seems to be non-shrinking of allocated  
memory blocks. For example, if an object needs to acquire additional  
memory in order to expand the backing store of a hash, it has to  
allocate a new (larger) memory block, copy over the values from the  
previous block, and leave the previous block to be garbage-collected.  
In practice, parrot already does this, in the case of the compacting  
collector. (It's realloc(), without immediately deallocating the old  
block.) This is part of what allows safety in the light of concurrent  
modification (because stale pointers and offsets can still refer to  
valid structures).

7) Java 1.4 provides several different GC strategies, but most (if not  
all) involve a stop-the-world stage, in which all threads are stopped.  
(Note that this doesn't mean that the whole GC process needs to happen  
with computation stopped, merely that some part of it may need to  
happen with threads stopped.) So a strategy in which data is not owned  
per-thread (from the point of view of GC) should be reasonable, since  
GC may be necessarily a global endeavor. In parrot, if we have devised  
a way to provide timely delivery of events, then we should be able to  
use the event infrastructure to suspend all threads. (That is, we don't  
need or want to suspend threads at the pthreads level.) See  
<http://developers.sun.com/techtopics/mobility/midp/articles/
garbagecollection2/> and the previous version  
<http://developers.sun.com/techtopics/mobility/midp/articles/garbage/>  
for a discussion of Java's GC strategies.

8) It should be possible to provide per-thread allocation pools, such  
that header or string body allocations can occur without locking,  
despite the fact that the allocated items are not considered to be  
"owned" by a thread. (Since GC will occur globally, as discussed in  
item 7, the deallocation will not be logically per-thread.)

9) As mentioned above, PMC thread-robustness seems to be the main  
challenge. (Where "thread-robustness" means non-crashing in the light  
of concurrent modification.) I think that we can provide this in 3  
ways: (a) minimize the number of custom PMCs that need to be written;  
that is, emphasize use of the object infrastructure rather than custom  
PMCs to provide HLL-level data types. (b) provide tools and guidelines  
for implementing thread-robust PMCs; for instance, the memory-expansion  
policy mentioned in item 6 above (Java provides a native  
System.arraycopy() to assist with this, for example), favoring the use  
of offsets/indexes rather than pointers into data blocks, and  
out-of-band creation of data structure (initialization prior to storage  
in a globally-accessible location). (c) use actual locking internally  
when necessary, and only when running multithreaded. Simple PMCs such  
as PerlInt are probably already thread-robust. Notably, PerlString  
should be straightforward to make thread-robust as well, as long as its  
flags are only changed at create time or during GC, and a new string  
header is put into place when the backing store is expanded.  
(Specifically, if headers can be guaranteed to be allocated from  
zero-filled memory, then bufstart and buflen will either be consistent,  
or one of them will be zero/NULL.) Thread-robustness is probably  
easiest in the case of leaf structures such as these (ie, structures  
which hold few pointers).

10) For PMC internal use, the lock-only-when-threaded mentioned in item  
9 can be easily provide via a LOCK_THIS_PMC macro which expands to  
something like "if(INTERP->is_threaded) {  
pthread_mutex_lock(SELF->mutex) }". (It's safe to check an  
interpreter->is_threaded flag without locking, because its value will  
only change when single-threaded.)

11) I don't think it's particularly relevant that Java's strings are  
immutable. StringBuffers are not, nor are collections such as HashMap,  
so Java still needs to deal with thread-robustness of data structures.  
Also, parrot has facilities for immutability, which HLLs can take  
advantage of as an optimization (though probably Perl6 will not, except  
for things such as string literals).

A few benchmarks (based on testing on Mac OS X, single CPU):

1) A pthread_lock/pthead_unlock pair on an uncontested mutex seems to  
have an overhead equivalent to about 7 function calls. (Or, since it  
_is_ 2 function calls, you could say the overhead is 5 function calls  
above that.) Also, pthead_trylock is only slightly faster.

2) In a pure-computation context with no locking, there isn't a penalty  
to being multithreaded. That is, counting to a large number in each of  
40 threads takes no more time (seemingly, slightly less) than counting  
up to 40*large-number in a single thread. This is likely because the  
context-switch overhead is already being paid in context switching  
between processes.

3) In Perl 5.8.1, it is much faster to perform 1000 forks (and waits)  
than it is to create (and join) 1000 threads. In C, the opposite is  
true.

I'll publish some actual benchmarking numbers, with source code,  
separately. (They're just sort of interesting to have on hand.)

JEff


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.