1) We are relying on OS services for all threading constructs
We are not going to count on 'atomic' operations, low-level assembly
processor guarantees, and we are definitely *not* rolling our own
threading constructs of any sort. They break far too often in the
face of SMP, new processors having different ordering of writes, and
odd hardware issues.
Yes, I know there's a "But..." for each of these. Please don't, I'll
just have to get cranky. We shall use the system thread primitives
and functions.
2) The only thread constructs we are going to count on are:
*) Abstract, non-recursive, simple locks
*) Rendezvous points (Things threads go to sleep on until another
thread pings the condition)
*) Semaphores (in the "I do a V and P operation, with a count")
We are *not* counting on being able to kill or freeze a thread from
any other thread. Nor are we counting on recursive locks, read/write
locks, nor any other things. Unfortunately.
3) I'm still not paying much attention.
--
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk
All d'accord with above but: I'm not sure yet, but /me thinks that we
need to have a CLEANUP_PUSH and _POP handler functionality too. But
these are basically macros (simple in the absence of pthread_kill or
such) and currently already used :)
> 3) I'm still not paying much attention.
May I ask why?
-- Yes :)
Why?
leo
I have a naive question:
Why must each thread have its own interpreter?
I understand that this suggestion will likely be disregarded because of
the answer to the above question. But here goes anyway...
Why not have the threads that share everything share interpreters. We
can have these threads be within the a single interpreter thus
eliminating the need for complicated GC locking and resource sharing
complexity. Because all of these threads will be one kernel level
thread, they will not actually run concurrently and there will be no
need to lock them. We will have to implement a rudimentary scheduler in
the interpreter, but I don't think that is actually that hard.
Threads that do not share state, can be implemented as seperate
interpreters that send events to eachothers to synchronize.
This allows threads to have completely shared state, at the cost of not
being quite as efficient on SMP (they might be more efficient on single
processors as there are fewer kernel traps necessary).
Programs that want to run faster on an SMP will use threads without
shared that use events to communicate. (which probably provides better
performance, as there will be fewer faults to main memory because of
cache misses and shared data).
I understand if this suggestion is dismissed for violating the rules,
but I would like an answer to the question simply because I do not know
the answer.
Thanks,
Matt
> I have a naive question:
>
> Why must each thread have its own interpreter?
~handwavy, high-level answer~
For the same reason each thread in C, for example, needs its own stack
pointer.
Since Parrot's a register machine, each thread needs its own set of
registers so it can go off and do its own thing without whomping all
over the other threads. Those registers live in each interpreter.
-- c
> All~
>
> I have a naive question:
>
> Why must each thread have its own interpreter?
The short answer is that the bulk of the state of the virtual machine
(including, and most importantly, its registers and register stacks)
needs to be per-thread, since it represents the "execution context"
which is logically thread-local. Stuff like the globals stash may or
may not be shared (depending on the thread semantics we want), but as I
understand it the "potentially shared" stuff is actually only a small
part of the bits making up the VM.
That said, I do think we have a terminology problem, since I initially
had the same question you did, and I think my confusion mostly stems
from there being no clear terminology to distinguish between 2
"interpreters" which are completely independent, and 2 "interpreters"
which represent 2 threads of the same program. In the latter case,
those 2 "interpreters" are both part of one "something", and we don't
have a name for that "something". It would be clearer to say that we
have two "threads" in one "interpreter", and just note that almost all
of our state lives in the "thread" structure. (That would mean that the
thing which is being passed into all of our API would be called the
thread, not the interpreter, since it's a thread which represents an
execution context.) It's just (or mostly) terminology, but it's causing
confusion.
> Why not have the threads that share everything share interpreters. We
> can have these threads be within the a single interpreter thus
> eliminating the need for complicated GC locking and resource sharing
> complexity. Because all of these threads will be one kernel level
> thread, they will not actually run concurrently and there will be no
> need to lock them. We will have to implement a rudimentary scheduler
> in the interpreter, but I don't think that is actually that hard.
There are 2 main problems with trying to emulate threads this way:
1) It would likely kill the performance gains of JIT.
2) Calls into native libraries could block the entire VM. (We can't
manually timeslice external native code.) Even things such as regular
expressions can take an unbounded amount of time, and the internals of
the regex engine will be in C--so we couldn't timeslice without slowing
them down.
And basically, people are going to want "real" threads--they'll want
access to the full richness and power afforded by an API such as
pthreads, and the "real" threading libraries (and the OS) have already
done all of the really hard work.
> This allows threads to have completely shared state, at the cost of
> not being quite as efficient on SMP (they might be more efficient on
> single processors as there are fewer kernel traps necessary).
Not likely to be more efficient even on a single processor, since even
if a process is single threaded it is being preempted by other
processes. (On Mac OS X, I've not been able to create a case where
being multithreaded is a slowdown in the absence of locking--even for
pure computation on a single processor machine, being multithreaded is
actually a slight performance gain.)
> Programs that want to run faster on an SMP will use threads without
> shared that use events to communicate.
It's nice to have speed gains on MP machines without having to redesign
your application, especially as MP machines are quickly becoming the
norm.
> (which probably provides better performance, as there will be fewer
> faults to main memory because of cache misses and shared data).
Probably not faster actually, since you'll end up with more data
copying (and more total data).
> I understand if this suggestion is dismissed for violating the rules,
> but I would like an answer to the question simply because I do not
> know the answer.
I hope my answers are useful. I think it's always okay to ask questions.
JEff
I wasn't listing anything we can build ourselves -- arguably we only
need two of the three thigns I listed, since with semaphores you can
do the rendezvous things (POSIX condition variables, but I'm sure
windows has something similar) and vice versa.
> > 3) I'm still not paying much attention.
>
>May I ask why?
>-- Yes :)
>Why?
Got a killer deadline at work. Things ease up after the 9th if I make
it, but then I owe someone else at home a few days of time. :)
[Dan getting cranky snipped]
>And that was that! Sorry I spoke.
I'm not trying to shut anyone down. What I wanted to do was stop
folks diving down too low a level. Yes, we could roll our own
mutexes, condition variables, and semaphores, but we're not going to;
it's far too system--not just architecture or OS specific, but system
setup specific. Single-processor systems want to context switch on
mutex aquisition failures, SMP systems want to use adaptive
spinlocks, atomic test-and-set operations aren't necessarily on some
NUMA systems, and ordering operations are somewhat fuzzy on some of
the more advanced processors--and that's all on x86 systems.
All this stuff is best left to the OS, which presumably has a better
idea of what the right and most efficient thing to do is, and
certainly has more resources behind it than we do. Definitely is in a
position to be up-to-date, in ways that we don't have. (You can
guarantee that the OS on a system is sufficiently up-to-date to run
properly, but it's not the same with user executables, which can be
years old)
I really don't want folks to get distracted by trying to get down to
the metal--it'll just get folks all worked up over something we're
not going to be doing because it's not prudent. I'd prefer everyone
get worked up over the higher-level stuff and just assume we have the
simple stuff at hand, and as the simple stuff is all we can safely
assume that's just a prudent thing.
(This is one of those cases where I'd really prefer for force
everyone doing thread work to have to work on 8 processor Alpha boxes
(your choice of OS, I don't care), one of the most vicious threading
enviroments ever devised, but alas that's not going to happen. Pity,
though)
DS> (This is one of those cases where I'd really prefer for force
DS> everyone doing thread work to have to work on 8 processor Alpha
DS> boxes (your choice of OS, I don't care), one of the most vicious
DS> threading enviroments ever devised, but alas that's not going to
DS> happen. Pity, though)
single cpu lsi-11's running FG/BG rt-11 doesn't count? :)
it was a dec product too! :)
uri
--
Uri Guttman ------ u...@stemsystems.com -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
Given that it's not a SMP, massively out of order NUMA system with
delayed writes... no. 'Fraid not.
DS> At 11:49 PM -0500 1/3/04, Uri Guttman wrote:
>> >>>>> "DS" == Dan Sugalski <d...@sidhe.org> writes:
>>
DS> (This is one of those cases where I'd really prefer for force
DS> everyone doing thread work to have to work on 8 processor Alpha
DS> boxes (your choice of OS, I don't care), one of the most vicious
DS> threading enviroments ever devised, but alas that's not going to
DS> happen. Pity, though)
>>
>> single cpu lsi-11's running FG/BG rt-11 doesn't count? :)
DS> Given that it's not a SMP, massively out of order NUMA system with
DS> delayed writes... no. 'Fraid not.
bah, humbug. then dec lied in their marketing crap.
actually i think there were SMP pdp/lsi-11 systems but i never had one.
tonight i happened to drive by the apartment where 20 years ago i lived
alone with an lsi-11 box that my employer lent me (cost $10k!!). did my
thesis on it. times have changed a little.
The most admiral reason for asking a question and I doubt it will be
dismissed.
H
Why on earth would they be all one kernel-level thread?
--
Monto Blanco... scorchio!
> Why not have the threads that share everything share interpreters. We
> can have these threads be within the a single interpreter thus
> eliminating the need for complicated GC locking and resource sharing
> complexity. Because all of these threads will be one kernel level
> thread, they will not actually run concurrently and there will be no
> need to lock them. We will have to implement a rudimentary scheduler in
> the interpreter, but I don't think that is actually that hard.
Jeff already answered that. Above model is e.g. implemented in Ruby. But
we want preemptive threads that can take advantage of mulitple
processors.
> Matt
leo
>> Why must each thread have its own interpreter?
> The short answer is that the bulk of the state of the virtual machine
> (including, and most importantly, its registers and register stacks)
> needs to be per-thread, since it represents the "execution context"
> which is logically thread-local.
Yep. A struct Parrot_Interp has all the information to run one thread of
execution. When you start a new VM thread, you need a new Parrot_Interp
to run the code.
But it depends on the thread type how this new Parrot_Interp is created.
The range is from all is new except the opcode stream (type 1 - nothing
shared thread) to only register + stacks + some more is distinct
for type 4 - the shared everything case.
Perl5 doesn't have a real interpreter structure, its mainly a bunch of
globals. But when compiled with threads enabled tons of macros convert
these to the thread context, which is then passed around as first
argument of API calls - mostly (that's at least how I understand the
src).
This thread context is our interpreter structure with all the necessary
information or state to run a piece of code as the only one or as a
distinct thread.
> That said, I do think we have a terminology problem, ...
> ... It would be clearer to say that we
> have two "threads" in one "interpreter", and just note that almost all
> of our state lives in the "thread" structure. (That would mean that the
> thing which is being passed into all of our API would be called the
> thread, not the interpreter,
Yep. But the thing passed around happens to be named interpreter, so
that's our thread state, if you run single-threaded or not doesn't
matter. A thread-enabled interpreter is created by filling one
additional structure "thread_data" with thread-specific items like
thread handle or thread ID. But anyway the state is called interpreter.
> JEff
leo
> Given that it's not a SMP, massively out of order NUMA system with
> delayed writes... no. 'Fraid not.
Sorry to be pedantic, but I always thought that the NU in NUMA implied
a contradiction of the S in SMP!
"NUMA MP" or "SMP", what does it mean to have *both* ?
--
Sam Vilain, s...@vilain.net
What would life be if we had no courage to attempt anything ?
VINCENT van GOGH
It means you've got loosely coupled clusters of SMP things. For an
example, if you go buy an Alpha GS3200 32 processor system (assuming
DEC^WCompaq^HP still knows how to sell the things) you have one of
these things. It's a set of 8 4-processor nodes with a fast
interconnect between them which functions as a 32 CPU system. The
four processors in each node are in a traditional SMP setup with a
shared memory bus, tightly coupled caches, and fight-for-the-bus
access to the memory on that node. Access to memory on another node
goes over a slower bus, though it still looks and acts like local
memory.
Nearly all of the NUMA systems I know of act like this, because it's
still feasible to have tightly coupled 2 or 4 CPU SMP systems. The
global slowdown generally occurs past that point, so the NUMA systems
usually group 2 or 4 CPU SMP systems together this way.
Given the increases in processor vs memory vs bus speeds, this setup
may not hold for that much longer, as it's only really workable when
a single CPU doesn't saturate the memory bus with any regularity,
which is getting harder and harder to do. (backplane and memory
speeds can be increased pretty significantly with a sufficient
application of cash, which is why the mini and mainframe systems can
actually do it, but there are limits beyond which cash just won't get
you)
Truth to tell I got the idea from Ruby. As I said it make
syncronization easier, because the interpreter can dictate when threads
context switch, allowing them to only switch at safe points. There are
some tradeoffs to this though. I had forgotten about threads calling
into C code. Although the example of regular expressions doesn't work
as I think those are supposed to compile to byte code...
Matt
> Dave Mitchell wrote:
>
>> Why on earth would they be all one kernel-level thread?
>>
>>
> Truth to tell I got the idea from Ruby. As I said it make
> syncronization easier, because the interpreter can dictate when
> threads context switch, allowing them to only switch at safe points.
> There are some tradeoffs to this though. I had forgotten about
> threads calling into C code. Although the example of regular
> expressions doesn't work as I think those are supposed to compile to
> byte code...
Ah yes, I think you are right about regexes, judging from 'perldoc
ops/rx.ops'. I was thinking of a Perl5-style regex engine, in which
regex application is a call into compiled code (I believe...).
JEff
[...]
> these things. It's a set of 8 4-processor nodes with a fast
> interconnect between them which functions as a 32 CPU system. The
> four processors in each node are in a traditional SMP setup with a
> shared memory bus, tightly coupled caches, and fight-for-the-bus
[...]
I know what a NUMA system is, I was just a little worried by the
combination of the terms SMP and NUMA in the same sentence :).
Normally "SMP" means "Shared Everything" - meaning Uniform Memory
Access. If compared to the term MPP or AMP (in which different CPUs
are put to different tasks), it is true that each node in a NUMA
system could be put to any task. So, the term "SMP" would seem to fit
partially; but the implication is with NUMA that there are clear
benefits to *not* having each processor doing *exactly* the same
thing, all the time. "CPU affinity" & all that.
Groups of processors in a NUMA machine have shared a group of memory
in all the NUMA systems I've seen too (SGI Origin/Onyx, and Sun
Enterprise servers *must* be, though they don't mention it!). So I'd
say that the "SMP" is practically redundant, borderlining on
confusing. Maybe a term like "8 x 4MP NUMA" is better.
I did apologise at the beginning for being pedantic. But hey, didn't
this digression serve to elaborate on the meaning of NUMA ? :)
> Given the increases in processor vs memory vs bus speeds, this
> setup may not hold for that much longer, as it's only really
> workable when a single CPU doesn't saturate the memory bus with
> any regularity, which is getting harder and harder to
> do. (backplane and memory speeds can be increased pretty
> significantly with a sufficient application of cash, which is why
> the mini and mainframe systems can actually do it, but there are
> limits beyond which cash just won't get you)
Opteron, and Sparc IV (IIRC) both have 3 bi-directional high speed
(=core speed) interconnects, so these could `easily' be arranged into
NUMA configurations with SMP groups. Also, some high-end processors
are going multicore, which presumably has different characteristics
again (especially if the two chips on the die share a cache!).
Then of course there's single-processor multi-threading (eg, Intel
HyperThreading). These systems have twice the registers internally
and interleave instructions from each `thread' as the processor can
deal with them; using seperate registers for them all helps keep the
execution units busy (kind of like what GCC does with unrolled loops
on a RISC system with more registers in the first place). These
perform like little NUMAs, because the cache is `hotter' (ie, locks on
those memory pages are held) on the other virtual processor than other
CPUs on the motherboard. If my understanding is correct, the Intel
implementation is not truly SMP, as the other virtual processor must
share code segments to run threads. If that is true, doing JIT (or
otherwise changing the executable code, eg dlopen()) in a thread might
break Hyperthreading. But then again, it might not. Maybe someone
who gives a flying fork() would like to devise a test to see if this
is the case.
Apparently current dual Opteron systems are also effectively NUMA (as
each chip has its own memory controller), but at the moment, NUMA mode
with Linux is slower than straight SMP mode. Presumably because it's
a bitch to code for ;-)
So these fun systems are here to stay! :)
--
Sam Vilain, s...@vilain.net
All things being equal, a fat person uses more soap than a thin
person.
- anon.