Andrey gleamed this bit of interesting text on the O'Reilly website.
Let's have a look at a snippet, shall we?
Under the "Kernel Space and User Space" heading:
| ... Another example of this happened in the Plan 9 operating system.
| They had this really cool system call to do a better process fork--a
| simple way for a program to split itself into two and continue processing
| along both forks. This new fork, which Plan 9 called R-Fork (and SGI later
| called S-Proc) essentially creates two separate process spaces that share
| an address space. This is helpful for threading especially.
| Linux does this too with its clone system call, but it was implemented
| properly. However, with the SGI and Plan9 routines they decided that
| programs with two branches can share the same address space but use
| separate stacks. Normally when you use the same address in both threads,
| you get the same memory location. But you have a stack segment that is
| specific, so if you use a stack-based memory address you actually get
| two different memory locations that can share a stack pointer without
| overriding the other stack.
| While this is a clever feat, the downside is that the overhead in maintaining
| the stacks makes this in practice really stupid to do. They found out too
| late that the performance went to hell. Since they had programs which used
| the interface they could not fix it. Instead they had to introduce an additional
| properly-written interface so that they could do what was wise with the stack
| space. ...
This commentary from Linus Torvalds on rfork proclaims that a single, unified
stack space for multiple threads is not only beneficial but imperative for proper
thread propagation. What we are wondering is ... why?
What scenario would make a unified stack segment beneficial in a multi-threaded
execution environment. Is this something that should be an optional parameter
on-the-fly in a call to rfork() ? Does clone() in Linux allow you to decide, or, does
it enforce the unified stack segment? Doesn't unification create collisions within
scopes of process namespaces through recursion and non-leaf code vectors?
Since we couldn't figure out the logic behind this Lunix tantra, we thought
trolling the public would be a slick idea. Who has an opinion on (or a defense
for ;-)) this statement?
Also, who is the "They" that "found out too late that the performance went to
Curious minds are interested.
also, you omitted the conclusion from the chapter, which is (imho) the best:
| In fifteen years, I expect somebody else to come along and say, hey, I can
| do everything that Linux can do but I can be lean and mean about it because
| my system won't have twenty years of baggage holding it back. They'll say
| Linux was designed for the 386 and the new CPUs are doing the really
| interesting things differently. Let's drop this old Linux stuff. This is
| essentially what I did when creating Linux. And in the future, they'll be
| able to look at our code, and use our interfaces, and provide binary
| compatibility, and if all that happens I'll be happy.
andrey "just fanning the flames" mirtchovski
I didn't want to paste *too* much into the email ;-)
Ah, yep. Can't paste over VNC. Thanks, Andrey
on the other hand, having a separate piece of address space with
different contents is a powerful idea. it's called something like
per-thread space, and it's really convenient for many implementation
details in threading. but the feature is (or was; i really don't know
situation any more) hard to provide on linux and in fact the clone
interface used to be really nasty because the two processes returned
from the syscall with two processes sharing the same stack. i
something has been done to make the code after a clone on linux
easier to write, but when linus was dyspeptic about rfork, the linux
dance was really tricky.
so it's just a detail, a design choice with advantages and disadvantages
either way. win some, lose some. yet somehow our code
is 'stupid' and linux is 'proper', so i guess the snarkiness endures.
What I am mainly confused about, is... We're talking about user
context threading. In user context threading we've got the abstractions
of a BSS, Data, Text, Stack, (etc in some cases) to maintain a sense
of program-persistant-object-classification. As a coder, we know
and can specify (for the most part) where these "objects" will end
So the question to me is, why would you *want* a unified stack?
The only state in which this might be relevant is threads maintained
in ASM where data is passed back and forth via Stack and manipulated
via Registers. Otherwise, things just don't make much sense. In a
land of higher level languages, this semantic seems irrational.
This is mainly what I love about Plan 9's threading implementation, that
it unifies memory in the heap, and not at the stack. I agree that this
creates a powerful tool not only in efficiency, but also in reliability.
I see Linus' logic as more of a negation of both efficiency and reliability.
I'm not interested in starting a flame war, but I *am* interested in
understanding the logic, because I just can't see it clearly from my
point of view.
So I like plan 9 rfork quite a bit. I think Linus/Linux are wrong.
> So the question to me is, why would you *want* a unified stack?
well, you don't. It makes programming a nightmare and you need assembly
goo to seperate the stacks after a fork, else you will be stepping on
each others' stack. As the freebsd users of their official version of
rfork() learned, that meant you hit segv walls very quickly. Ouch.
> I see Linus' logic as more of a negation of both efficiency and reliability.
I don't even understand it.
> I see Linus' logic as more of a negation of both efficiency and
i don't understand his logic.
> what happens on MP systems?
not sure what you're driving at.
on an MP system, the forked process/thread has chances to be run on a distinct
CPU underneath. I would have thought the complexity of managing a shared stack
was magnified in that circumstance.
process migration -wise?
> i don't understand his logic.
Well, I interpreted his logic that each process, itself, should be the
manager of how data is abstracted into each thread. Further, that
the kernel should not impose an interface on such userspace
processes. When I think in terms of future scalability and dynamic
code design, this *seems* to be a good thing.
However, over time, we've seen that the lack of a conformed
interface has prompted the design of such an interface: pthreads,
etc. So, in the end, the necessity for POSIX compliance (or what ever
compliance you prefer) is perceived as paramount.
Though, this could have been avoided and deemed unnecessary if
these kernels had only implemented an interface in the first place.
I believe the shortsightedness of not imposing a strict type created
too many possibilities. This creates a bigger problem when dealing
with libraries imported into a system with no defined interface, and
you end up with the desire for one.
So, in the end, Plan 9 looks like the better solution. The foresight was
there to see a necessity for conformity where all possible permutations
of a given set tended to yield outcomes meeting at the same axis:
thus an interface.
I duno, that's my two cents, anyway. But, that's dependant on me
getting Linus' logic correct, which is why he was CC'd.
On Thu, 26 Feb 2004 dbai...@ameritech.net wrote:
> This commentary from Linus Torvalds on rfork proclaims that a single, unified
> stack space for multiple threads is not only beneficial but imperative for proper
> thread propagation. What we are wondering is ... why?
(1) There are valid reasons to pass pointers between threads, and yes,
they can be pointers to thread stack areas.
(2) No existing common hardware can do "partial TLB flushes" efficiently.
The main performance advantage of threads is that you don't need to flush
the VM state on thread switches. If you have partial VM area sharing, you
lose that whole point.
(3) Implementation sucks. Irix and Plan-9 both get it wrong, and they
_pay_ for it. Heavily. The Linux code is just better.
Do you have code from existing Linux implementation that exemplifies
this scenario? What are your opinions on the benefits? If there is existing
code that exemplifies this scenario, how often does it propagate? Is it
often enough to denounce a standard interface in the kernel?
Also, what are your comments, Linus, on shared library execution altering
the scope of an interface from kernel context to user context? Even if there
isn't a unified stack, there is still an interface defined by such thread projects
as POSIX threads, etc. Doesn't that just create a standard interface in user-
land that could be sped up, if moved to kernel context?
> Do you have code from existing Linux implementation that exemplifies
> this scenario?
Anything that accesses any process arguments would access the stack of the
original thread. That's the most obvious one. But synchronization with
completion events on the stack is also perfectly feasible (the kernel does
exactly that internally, for example - and user mode could do the same).
I'm not sure I can agree here. Arguments are not meant for the scope of an
entire process, but for the initialization of a program's state. In a program that
parses a fully qualified file name path derived from command line arguments,
the initial state should only open a file descriptor and pass that descriptor to
a child thread while performing other duties. This simple example depicts
how a process' arguments should be restricted to initialization. After all, if
the process can't open the file, there is no need to thread.
It seems that allowing such a misuse of resource would be more of a
rationalization for allowing someone's bad programming, rather than a
strong argument against defined thread interface.
Can you elaborate on your synchronization example, please?
Linus Torvalds wrote:
> (1) There are valid reasons to pass pointers between threads, and yes,
> they can be pointers to thread stack areas.
OK. If one of the threads is a "debugger" thread and another
is a "thread to be debugged," then it would be helpful if
the "debugger" thread can access the stack of the "thread to be debugged."
Why? So it can muck with the stack, say, to alter a return
But to inspect or modify the stack of another thread requires
some knowledge of that other thread's state. Like, is it
> I'm not sure I can agree here. Arguments are not meant for the scope of an
> entire process, but for the initialization of a program's state.
Wrong. You have zero basis for your assertion, and it's simply bogus.
Would you say that the process environment array is also only applicable
to the first thread?
But it doesn't matter. Regardless, threads should see each others stacks.
> Can you elaborate on your synchronization example, please?
There are tons of examples in the linux kernel. Search for any use of
"struct completion" - most of them are allocatied on a threads stack
(which is obviously distinct between different threads), and other threads
and interrupts will use that data structure to wake up the thread.
But hey, I'm not interested in trying to convince you. Feel free to think
Not only. Presumably, the stack is duplicated on rfork(2) which means
that reading the arguments is OK, writing them would be a problem.
But just because the arguments can be changed does not mean that it
can't be deprecated in the case of threads.
PS: I do think this particular one is a quibble, though. My choice
would be along the lines of: "having made choice "A", is there a
mechanism to implement choice "B" from within it?" If there is,
then only efficiency criteria can be used to select "A" over "B".
If there isn't, then either the reverse applies in which case "B"
is the obvious contender, or the two are distinct solutions, each
> But it doesn't matter. Regardless, threads should see each others
and on plan 9, they do. they just can't see each other's *stack
within their own address space.
there seems to be confusion on this point. plan 9's kernel splits the
stack segment after a fork, but in the normal state, the processes run
with the sp in shared memory. the marvelous properties of the
same-address-different-contents split stack segment is used only
during the fiddly bits of process manipulation and to store per-process
how does linux store per-process data in user space?
Clean, logical programming isn't a sound basis? Or is it only not a sound
basis because you've decided so? Saying that I have zero basis for my
assertion is a bit bogus, in it self, don't you think? I CC'd you my
original email to 9fans because I was hoping to understand your logic
behind the rfork() comments, not to start a flame war.
> Would you say that the process environment array is also only applicable
> to the first thread?
Yes. That's why we have env(3) in Plan 9. So each thread can alter and
read the environment, globally, in a process.
> There are tons of examples in the linux kernel. Search for any use of
> "struct completion" - most of them are allocatied on a threads stack
> (which is obviously distinct between different threads), and other threads
> and interrupts will use that data structure to wake up the thread.
Okay, but is there a reason why it *needs* to be on the stack? What's the
rationale behind this usage.
> But hey, I'm not interested in trying to convince you. Feel free to think
> I'm wrong.
Well, I'm hoping to determine what really is the best thing for OS design,
as a whole. It isn't in my interest to determine who is right or wrong, only
what makes sense universally. If we are really all here to try and design
code that works the best it can, that's all that matters. Ego is irrelevant and
only gets in the way of those discoveries that can really help us materialize
the better code.
Even if there are no more comments on this thread, I still learned a lot
by asking the questions. If that stirrs up some contraversy, so be it. I'd
rather learn and ask questions than stay silent and experience nothing.
it might have been if Linux hadn't gone through an O(n) scheduler before it did that context switch.
let me help, then: nothing. nothing makes sense "universally".
Linux is Linux. Plan 9 is Plan 9. the two systems have very different
design goals. that's not to defend linux's fork design or any other
particular decision, but asking (by implication) why linux doesn't
just do something like env(3) misses those goals.
personally, i'm less interested in arguing point 1 from linus's post
(there might be valid reasons to do this sometimes), which seems at
least plausable, and much more interested in the other two. on the
second, i think clarification is in order: that's the main advantage
of threads over what, processes? and on that last one (the bit on
performance and code quality) i'm *really* curious what he's talking
about. do we have a particularly heavy fork? i'm confused. saying
"The Linux code is just better." just *begs* for support.
oh, and it's "Plan 9", or sometimes "plan9", but never "Plan-9".
> Okay, but is there a reason why it *needs* to be on the stack? What's the
> rationale behind this usage.
The rationale is that it's incredibly more sane, and it's the logical
place to put something that (a) needs to be allocated thread-specific and
(b) doesn't need any special allocator.
In short, it's an automatic variable.
You are arguing against automatic variables in C. You apparently do so for
some incredibly broken religious reason, and hey, I simply don't care.
> Well, I'm hoping to determine what really is the best thing for OS design,
> as a whole.
I can guarantee you that the broken behaviour of SGI sproc and plan-9
rfork i sa major pain in the ass for VM management.
I'm obviously very biased, but I claim that the Linux VM is the best VM
out there. Bar none. It's flexible, it's clearly very portable to every
single relevant hardware base out there (and quite a few that aren't
relevant), and it's efficient.
And at the same thing, the Linux VM (from an architecture standpoint) is
simple. There are lots of fundamentally hard problems in teh VM stuff, but
they tend to be things that _are_ fundamentally hard for other reasons (ie
page replacement algorithms in the presense of many different caches that
all fight for memory). But the actual virtual mapping side is very simple
And the plan-9/irix thing isn't. It's an abomination.
And there are real _technical_ reasons why it's an abomination:
- it means that you cannot share hardware page tables between threads
unless you have special hardware (ie it is either fundamentally
unportable, or it is fundamentally inefficient)
- it means that you have to give different TLB contexts to threads,
causing inefficient TLB usage. See above.
- it means that you need to keep track of separate lists of "shared" and
"private" VM mapping lists, and locking of your VM data structures is a
- it almost certainly means a lot of special cases, since on the magic
hardware that does have segmented page tables and where you want to
share the right segments, you now have magic hardware-dependent limits
for which areas can be shared and which can be private.
But yes, I'm biased.
Do you notice that you had to say 'from an architecture standpoint' ?
I don't think it's simple at all. I agree it seems to be more efficient
(although I don't have measures to support this), but it's
contorted code, at best.
Regarding what's broken or not, it mostly depends on bugs, and bugs
depend mostly on complexity.
You've just proven my point. Thread specific. Being Thread specific, it
is data that is reserved to the scope of a single thread. Nothing more.
If you want more scope there are many more usages of memory that
are better utilized.
I'm not arguing against automatic variables. I'm arguing against using
automatic variables anywhere but the scope of stack in which they are
I don't know where you're getting religiousness, but, there's nothing
wrong with trying to do the best for all of your users. That's why
companies research their target audience, so they can improve a
product according to the most global needs of a given group.
On Fri, 27 Feb 2004, Lucio De Re wrote:
> Is Torvalds really saying that environment is held _in_ the stack?!
> No wonder he was reluctant to copy it! Specially when using "bash".
> But I must be mistaken.
Take a look if you don't believe me.
There is a _lot_ of state on the stack. That's how C works.
It's perfectly valid behaviour to do something like this:
where you give the thread you created its initial state on your own stack.
You obviously must keep it there until the thread is done with it (either
by waiting for the whole thread as in the above example, or by using some
other synchronization mechanism), but the point is that C programmers are
used to a fundamentally "flat" environment without segments etc.
And what a private stack is, is _nothing_ more than a segment.
And I have not _ever_ met a good C programmer that liked segments.
So in a C/UNIX-like environment, private stacks are wrong. You could
imagine _other_ environments where they might be valid, but even those
other environments would not invalidate my points about efficiency and
I wasn't interested in #2 because #2 didn't make much sense to me.
I wasn't interested in #3 because I had no desire to discuss things
resembling personal attacks.
as i said before, the stacks are not private. you're right, that's a
that's why they're not private.
the segment called 'stack' is private, but that's a different thing.
i stress: stack != stack segment. stack is where your sp is; stack
segment is a figment of the VM system.
i ask again: how does linux create per-thread storage?
the way the plan 9 thread library works is so different from linux's
that they're hard to compare. program design in the two worlds is
radically different. so your claim of 'better' is curious to me. by
you seem to mean 'faster' and 'cleaner'. faster at least can be
you speak with certainty. have you seen performance comparisons?
i haven't, although it wouldn't surprise me to learn that there are
programs for which linux outperforms plan 9, and vice versa of course.
> > The rationale is that it's incredibly more sane, and it's the logical
> > place to put something that (a) needs to be allocated thread-specific and
> > (b) doesn't need any special allocator.
> You've just proven my point. Thread specific. Being Thread specific, it
> is data that is reserved to the scope of a single thread. Nothing more.
> If you want more scope there are many more usages of memory that
> are better utilized.
A "per-thread allocation" does NOT MEAN that other threads should not
access it. It measn that the ALLOCATION is thread-private, not that the
USE is thread-private.
per-thread allocations are quite common, and critical. If you have global
state, you need to protect them with locks, and you need to have nasty
One common per-thread allocation is the "I want to wait for an event". The
data is clearly for that one thread, and using a global allocator would be
WRONG. Not to mention inefficient.
But once the data has been allocated, other threads are what will actually
use the data to wake the original thread up. So while it wants a
per-thread allocator, it simply wouldn't _work_ if other threads couldn't
access the data.
That's what a "completion structure" is in the kernel. It's all the data
necessary to let a thread wait for something to complete. Another thread
will do "complete(xxx)", where "xxx" is that per-thread data.
You don't like it. Fine. I don't care. You're myopic, and have an agenda
to push, so you want to tell others that "you can't do that, it's against
While I'm telling you that people _do_ do that, and that it makes sense,
and if you didn't have blinders on, you'd see that.
After all, there is nothing to prevent a coder from altering the
return address (traditionally a stack entry) in one thread and
totally wrecking the behaviour of another on return, right?
I still think the acid test lies with whether one version can model
PS: and I really hope that the arguments and environment variables
are stored elsewhere and only the pointers appear on the stack :-)
I mean, even MS-DOS got that bit "sane", to use a very loaded word
previously used by Linus in this discussion.
That sharing is achieved by the thread library, not by the rfork
system call, though, right?
I'm not saying you can't, I'm just asking why, and you're not giving
me any sensible reason for it besides "because people do". That isn't
logical to me, and that's got nothing to do with any mysterious
agenda. I'm just interested in the theory.
> While I'm telling you that people _do_ do that, and that it makes sense,
> and if you didn't have blinders on, you'd see that.
Ok, so people do *do* that. That's fine, but that doesn't make it
correct, or sensible. I just wanted some semblance of logic, not
flames that equate to "I do it because I can". I can hop on one leg,
too, but that doesn't make hopping on one leg efficient.
But maybe they should also be shared, to be totally consistent.
Of course, I may be talking out of turn, but I really don't see
how threads can have private space if the stack isn't private.
> I suspect Linus isn't subscribed to 9fans, so we'll have to cc him.
sorry about that.
> Rob writes:
> | the argument about TLB flush times is interesting.
> | > But it doesn't matter. Regardless, threads should see each others
> | > stacks.
> | and on plan 9, they do. they just can't see each other's *stack
> | segments*
> | within their own address space.
> | there seems to be confusion on this point. plan 9's kernel splits
> | stack segment after a fork, but in the normal state, the processes
> | with the sp in shared memory. the marvelous properties of the
> | same-address-different-contents split stack segment is used only
> | during the fiddly bits of process manipulation and to store
> | data.
> That sharing is achieved by the thread library, not by the rfork
> system call, though, right?
right. but where it's implemented is in some sense just a detail, as my
message, incuded below, implies.
another way of looking at it is that the kernel provide some nice
primitives upon which to build a thread library. the plan 9 kernel
leaves out a lot of stuff that you're supposed to do in the kernel
for threads, but we were exploring options.
> | how does linux store per-process data in user space?
> | -rob
here's my other message, with the cc: this time:
> Of course, I may be talking out of turn, but I really don't see
> how threads can have private space if the stack isn't private.
well, perhaps the stack isn't the only place to do it, but it's
certainly an easy one, and one that makes the syscall interface
to fork easy to implement in a threaded environment: longjmp
to the private stack, fork, adjust, longjmp back.
i do. they have forgotten their lister and the old (now quick):
s = splfoo();
/* critical region */
usually simplifies things, unlike the lunacy you have to code
on linix >= 2.2 -- i've coded it; it made me sick.
On Fri, 27 Feb 2004, Rob Pike wrote:
> > So in a C/UNIX-like environment, private stacks are wrong. You could
> > imagine _other_ environments where they might be valid, but even those
> > other environments would not invalidate my points about efficiency and
> > simplicity.
> as i said before, the stacks are not private. you're right, that's a
> bad thing.
> that's why they're not private.
> the segment called 'stack' is private, but that's a different thing.
> i stress: stack != stack segment. stack is where your sp is; stack
> segment is a figment of the VM system.
Well, in another email I already said that "private stack" and "segments"
are really exactly the same thing - some people think of segments as
paging things, others think of them in the x86 sense, but in the end it
all comes down to the fact that a "stack address" ends up making sense
only within a specific context (and that context can sometimes be
partially visible to other threads by using explicit segment registers or
other magic, like special instructions that can take another address
And private/segmented stacks are bad.
They are bad exactly because they magically make automatic variables
fundamentally different from other variables. And they really have no
reason they should be different.
There is absolutely nothing wrong with having a thread take the address of
some automatic variable, and then just pass that address off to another
routine. And if that other routine decides that it is going to create a
hundred threads to solve the problem that the variable described in
parallell, then that should JUST WORK. Anything else would be EVIL.
Having a pointer that sometimes works, and sometimes doesn't, based on who
uses it - that's just crazy talk.
> i ask again: how does linux create per-thread storage?
The same way it creates any other storage: with mmap() and brk(). You just
malloc the thing, and you pass in the new stack as an argument to the
thread creation mechanism (which linux calls "clone()", just to be
And because that storage is just storage, things like the one I described
above "just work". If you pass another thread a pointer to your stack, the
other thread can happily manipulate it, and never even needs to know that
it's an automatic variable somewhere else.
And sure, you can shoot yourself in the foot that way. You can pass off a
pointer to another thread, and then return from the function without
synchronizing with that other thread properly, and now the other thread
will scribble all over your stack. But that's really nothing different
than using "alloca()" and passing off that to something that remembers the
> the way the plan 9 thread library works is so different from linux's
> that they're hard to compare. program design in the two worlds is
> radically different. so your claim of 'better' is curious to me. by
> 'better' you seem to mean 'faster' and 'cleaner'. faster at least can
> be measured.
To me, the final decision on "better" tends to be a fairly wide issue.
Performance is part of it - especially infrastructure that everybody
depends on should always strive to at least _allow_ good performance, even
if not everybody ends up caring.
But the concept, to me, is more important. Basically, I do not see how you
can really have a portable and even _remotely_ efficient partial VM
sharing. And if you can't have it, then you shouldn't design the
interfaces around it.
> you speak with certainty. have you seen performance comparisons? i
> haven't, although it wouldn't surprise me to learn that there are useful
> programs for which linux outperforms plan 9, and vice versa of course.
When it comes to threads, I only see three interesting performance
metrics: how fast can you create them, how fast can you synchronize them
(both "join" and locking), and how well do you switch between them.
The locking is pretty much OS-independent, since fast locking has to be
done in user space anyway (with just the contention case falling back to
the OS, and if your app cares about performance it hopefully won't have
So we're left with create, tear-down and switch. All of which are
_fundamentally_ faster if you just have a "share everything" model.
Create and tear-down are just increment/decrement a reference counter
(there's a spinlock involved too). Task switch is a no-op from a VM
standpoint (except we have a per-thread lazy TLB invalidate that will
In contrast, partial sharing is a major pain. You definitely don't just do
a reference count increment for your VM.
5M SLOC for the kernel? look at list.h
better? mon cul.
nice one, don.
no, you're wrong.