[9fans] Threads: Sewing badges of honor onto a Kernel

dbai...@ameritech.net

unread,

Feb 26, 2004, 11:55:39 PM2/26/04

to

A few of us had a discussion this evening on the state of Linux and
bloat in the Operating System development community. This isn't an
event isolated to this evening, however tonight had a bit of spice.

Andrey gleamed this bit of interesting text on the O'Reilly[1] website.
Let's have a look at a snippet, shall we?

Under the "Kernel Space and User Space" heading:

| ... Another example of this happened in the Plan 9 operating system.
| They had this really cool system call to do a better process fork--a
| simple way for a program to split itself into two and continue processing
| along both forks. This new fork, which Plan 9 called R-Fork (and SGI later
| called S-Proc) essentially creates two separate process spaces that share
| an address space. This is helpful for threading especially.

| Linux does this too with its clone system call, but it was implemented
| properly. However, with the SGI and Plan9 routines they decided that
| programs with two branches can share the same address space but use
| separate stacks. Normally when you use the same address in both threads,
| you get the same memory location. But you have a stack segment that is
| specific, so if you use a stack-based memory address you actually get
| two different memory locations that can share a stack pointer without
| overriding the other stack.

| While this is a clever feat, the downside is that the overhead in maintaining
| the stacks makes this in practice really stupid to do. They found out too
| late that the performance went to hell. Since they had programs which used
| the interface they could not fix it. Instead they had to introduce an additional
| properly-written interface so that they could do what was wise with the stack
| space. ...

This commentary from Linus Torvalds on rfork[2] proclaims that a single, unified
stack space for multiple threads is not only beneficial but imperative for proper
thread propagation. What we are wondering is ... why?

What scenario would make a unified stack segment beneficial in a multi-threaded
execution environment. Is this something that should be an optional parameter
on-the-fly in a call to rfork() ? Does clone() in Linux allow you to decide, or, does
it enforce the unified stack segment? Doesn't unification create collisions within
scopes of process namespaces through recursion and non-leaf code vectors?

Since we couldn't figure out the logic behind this Lunix tantra, we thought
trolling the public would be a slick idea. Who has an opinion on (or a defense
for ;-)) this statement?

Also, who is the "They" that "found out too late that the performance went to
hell".

Curious minds are interested.

Don (north_)

References:

[1] http://www.oreilly.com/catalog/opensources/books/linus.html
[2] fork(2)

andrey mirtchovski

unread,

Feb 27, 2004, 12:07:41 AM2/27/04

to

the url is wrong -- s/books/book/

also, you omitted the conclusion from the chapter, which is (imho) the best:

| In fifteen years, I expect somebody else to come along and say, hey, I can
| do everything that Linux can do but I can be lean and mean about it because
| my system won't have twenty years of baggage holding it back. They'll say
| Linux was designed for the 386 and the new CPUs are doing the really
| interesting things differently. Let's drop this old Linux stuff. This is
| essentially what I did when creating Linux. And in the future, they'll be
| able to look at our code, and use our interfaces, and provide binary
| compatibility, and if all that happens I'll be happy.

andrey "just fanning the flames" mirtchovski

Scott Schwartz

unread,

Feb 27, 2004, 12:06:50 AM2/27/04

to

Beware that this mailing list doesn't permit postings
with more than a few cc-s.

To: 9f...@cse.psu.edu
cc: torv...@osdl.org, sio...@beefed.org, fa...@beefed.org

dbai...@ameritech.net

unread,

Feb 27, 2004, 12:13:51 AM2/27/04

to

> also, you omitted the conclusion from the chapter, which is (imho) the best:

I didn't want to paste *too* much into the email ;-)

dbai...@ameritech.net

unread,

Feb 27, 2004, 12:14:43 AM2/27/04

to

> the url is wrong -- s/books/book/

Ah, yep. Can't paste over VNC. Thanks, Andrey

full URL is:
http://www.oreilly.com/catalog/opensources/book/linus.html

Don (north_)

Rob Pike

unread,

Feb 27, 2004, 12:27:48 AM2/27/04

to

this is a peculiar distortion of a peculiar episode. the original
issue linus
objected to, and rather rudely if i remember right, is that if the
stacks are
distinct, there can be confusing bugs if you pass pointers around that
accidentally happen to refer to stack addresses. in practice, we never
do this because the thread library makes it all but impossible. however
this issue really rankled linus,

on the other hand, having a separate piece of address space with
different contents is a powerful idea. it's called something like
per-thread space, and it's really convenient for many implementation
details in threading. but the feature is (or was; i really don't know
the
situation any more) hard to provide on linux and in fact the clone
interface used to be really nasty because the two processes returned
from the syscall with two processes sharing the same stack. i
understand
something has been done to make the code after a clone on linux
easier to write, but when linus was dyspeptic about rfork, the linux
dance was really tricky.

so it's just a detail, a design choice with advantages and disadvantages
either way. win some, lose some. yet somehow our code
is 'stupid' and linux is 'proper', so i guess the snarkiness endures.

-rob

Rob Pike

unread,

Feb 27, 2004, 12:29:31 AM2/27/04

to

in case that wasn't clear enough, i have no idea what linus is talking
about when he says we had 'overhead' that made it 'really stupid'
'in practice'.

-rob

dbai...@ameritech.net

unread,

Feb 27, 2004, 12:43:51 AM2/27/04

to

> so it's just a detail, a design choice with advantages and disadvantages
> either way. win some, lose some. yet somehow our code
> is 'stupid' and linux is 'proper', so i guess the snarkiness endures.

What I am mainly confused about, is... We're talking about user
context threading. In user context threading we've got the abstractions
of a BSS, Data, Text, Stack, (etc in some cases) to maintain a sense
of program-persistant-object-classification. As a coder, we know
and can specify (for the most part) where these "objects" will end
up.

So the question to me is, why would you *want* a unified stack?
The only state in which this might be relevant is threads maintained
in ASM where data is passed back and forth via Stack and manipulated
via Registers. Otherwise, things just don't make much sense. In a
land of higher level languages, this semantic seems irrational.

This is mainly what I love about Plan 9's threading implementation, that
it unifies memory in the heap, and not at the stack. I agree that this
creates a powerful tool not only in efficiency, but also in reliability.

I see Linus' logic as more of a negation of both efficiency and reliability.

I'm not interested in starting a flame war, but I *am* interested in
understanding the logic, because I just can't see it clearly from my
point of view.

Don (north_)

ron minnich

unread,

Feb 27, 2004, 12:45:35 AM2/27/04

to

I did the original rfork for freebsd, and I wrote it so I split the stacks
too. It was lovely, since you just rforked and all was shared save the
stack -- no assembly required. When they did the committed version, after
fork, all was shared. This sucked. It was quite a mess to fix up your
stack so you weren't walking on each other's stack -- required assembly
goo to make it go. I hated it and stopped using it.

So I like plan 9 rfork quite a bit. I think Linus/Linux are wrong.

ron

ron minnich

unread,

Feb 27, 2004, 12:48:58 AM2/27/04

to

On Fri, 27 Feb 2004 dbai...@ameritech.net wrote:

> So the question to me is, why would you *want* a unified stack?

well, you don't. It makes programming a nightmare and you need assembly
goo to seperate the stacks after a fork, else you will be stepping on
each others' stack. As the freebsd users of their official version of
rfork() learned, that meant you hit segv walls very quickly. Ouch.

> I see Linus' logic as more of a negation of both efficiency and reliability.

I don't even understand it.

ron

George Michaelson

unread,

Feb 27, 2004, 12:51:40 AM2/27/04

to

what happens on MP systems?

-George

Rob Pike

unread,

Feb 27, 2004, 12:55:49 AM2/27/04

to

i agree with you. i think the model, although done for reasons of
expediency, has some advantages. the split stack provides a
unique and powerful memory abstraction: same address, different
value. (c.f. the up and mp registers in the kernel.)

> I see Linus' logic as more of a negation of both efficiency and
> reliability.

i don't understand his logic.

-rob

ron minnich

unread,

Feb 27, 2004, 1:00:50 AM2/27/04

to

On Fri, 27 Feb 2004, George Michaelson wrote:

> what happens on MP systems?
>

not sure what you're driving at.

ron

George Michaelson

unread,

Feb 27, 2004, 1:08:50 AM2/27/04

to

on an MP system, the forked process/thread has chances to be run on a distinct
CPU underneath. I would have thought the complexity of managing a shared stack
was magnified in that circumstance.

process migration -wise?

-George

dbai...@ameritech.net

unread,

Feb 27, 2004, 1:09:02 AM2/27/04

to

> > I see Linus' logic as more of a negation of both efficiency and
> > reliability.

> i don't understand his logic.

Well, I interpreted his logic that each process, itself, should be the
manager of how data is abstracted into each thread. Further, that
the kernel should not impose an interface on such userspace
processes. When I think in terms of future scalability and dynamic
code design, this *seems* to be a good thing.

However, over time, we've seen that the lack of a conformed
interface has prompted the design of such an interface: pthreads,
etc. So, in the end, the necessity for POSIX compliance (or what ever
compliance you prefer) is perceived as paramount.

Though, this could have been avoided and deemed unnecessary if
these kernels had only implemented an interface in the first place.

I believe the shortsightedness of not imposing a strict type created
too many possibilities. This creates a bigger problem when dealing
with libraries imported into a system with no defined interface, and
you end up with the desire for one.

So, in the end, Plan 9 looks like the better solution. The foresight was
there to see a necessity for conformity where all possible permutations
of a given set tended to yield outcomes meeting at the same axis:
thus an interface.

I duno, that's my two cents, anyway. But, that's dependant on me
getting Linus' logic correct, which is why he was CC'd.

Don (north_)

Linus Torvalds

unread,

Feb 27, 2004, 1:15:50 AM2/27/04

to

On Thu, 26 Feb 2004 dbai...@ameritech.net wrote:
>
> This commentary from Linus Torvalds on rfork[2] proclaims that a single, unified
> stack space for multiple threads is not only beneficial but imperative for proper
> thread propagation. What we are wondering is ... why?

(1) There are valid reasons to pass pointers between threads, and yes,
they can be pointers to thread stack areas.

(2) No existing common hardware can do "partial TLB flushes" efficiently.
The main performance advantage of threads is that you don't need to flush
the VM state on thread switches. If you have partial VM area sharing, you
lose that whole point.

(3) Implementation sucks. Irix and Plan-9 both get it wrong, and they
_pay_ for it. Heavily. The Linux code is just better.

Linus

Scott Schwartz

unread,

Feb 27, 2004, 1:20:45 AM2/27/04

to

Al Viro once told me that if Plan 9 had shared libraries, rfork wouldn't
work as well. Apparently they're concerned about page tables when there
are lots of different combinations of sharing going on.

dbai...@ameritech.net

unread,

Feb 27, 2004, 1:38:59 AM2/27/04

to

> (1) There are valid reasons to pass pointers between threads, and yes,
> they can be pointers to thread stack areas.

Do you have code from existing Linux implementation that exemplifies
this scenario? What are your opinions on the benefits? If there is existing
code that exemplifies this scenario, how often does it propagate? Is it
often enough to denounce a standard interface in the kernel?

Also, what are your comments, Linus, on shared library execution altering
the scope of an interface from kernel context to user context? Even if there
isn't a unified stack, there is still an interface defined by such thread projects
as POSIX threads, etc. Doesn't that just create a standard interface in user-
land that could be sped up, if moved to kernel context?

Don (north_)

Linus Torvalds

unread,

Feb 27, 2004, 1:45:49 AM2/27/04

to

On Fri, 27 Feb 2004 dbai...@ameritech.net wrote:
>

> Do you have code from existing Linux implementation that exemplifies
> this scenario?

Anything that accesses any process arguments would access the stack of the
original thread. That's the most obvious one. But synchronization with
completion events on the stack is also perfectly feasible (the kernel does
exactly that internally, for example - and user mode could do the same).

Linus

dbai...@ameritech.net

unread,

Feb 27, 2004, 1:54:38 AM2/27/04

to

> Anything that accesses any process arguments would access the stack of the
> original thread.

I'm not sure I can agree here. Arguments are not meant for the scope of an
entire process, but for the initialization of a program's state. In a program that
parses a fully qualified file name path derived from command line arguments,
the initial state should only open a file descriptor and pass that descriptor to
a child thread while performing other duties. This simple example depicts
how a process' arguments should be restricted to initialization. After all, if
the process can't open the file, there is no need to thread.

It seems that allowing such a misuse of resource would be more of a
rationalization for allowing someone's bad programming, rather than a
strong argument against defined thread interface.

Can you elaborate on your synchronization example, please?

Don (north_)

Donald Brownlee

unread,

Feb 27, 2004, 1:57:39 AM2/27/04

to

Linus Torvalds wrote:

>
> (1) There are valid reasons to pass pointers between threads, and yes,
> they can be pointers to thread stack areas.
>

OK. If one of the threads is a "debugger" thread and another
is a "thread to be debugged," then it would be helpful if
the "debugger" thread can access the stack of the "thread to be debugged."

Why? So it can muck with the stack, say, to alter a return
address. Whatever!

But to inspect or modify the stack of another thread requires
some knowledge of that other thread's state. Like, is it
stopped?

D.

Linus Torvalds

unread,

Feb 27, 2004, 2:00:56 AM2/27/04

to

On Fri, 27 Feb 2004 dbai...@ameritech.net wrote:
>

> I'm not sure I can agree here. Arguments are not meant for the scope of an
> entire process, but for the initialization of a program's state.

Wrong. You have zero basis for your assertion, and it's simply bogus.

Would you say that the process environment[] array is also only applicable
to the first thread?

But it doesn't matter. Regardless, threads should see each others stacks.

> Can you elaborate on your synchronization example, please?

There are tons of examples in the linux kernel. Search for any use of
"struct completion" - most of them are allocatied on a threads stack
(which is obviously distinct between different threads), and other threads
and interrupts will use that data structure to wake up the thread.

But hey, I'm not interested in trying to convince you. Feel free to think
I'm wrong.

Linus

Lucio De Re

unread,

Feb 27, 2004, 2:08:46 AM2/27/04

to

On Fri, Feb 27, 2004 at 01:48:09AM -0500, dbai...@ameritech.net wrote:
>
> > Anything that accesses any process arguments would access the stack of the
> > original thread.
>

> I'm not sure I can agree here. Arguments are not meant for the scope of an

> entire process, but for the initialization of a program's state. In a program that
> [ ... ]

Not only. Presumably, the stack is duplicated on rfork(2) which means
that reading the arguments is OK, writing them would be a problem.

But just because the arguments can be changed does not mean that it
can't be deprecated in the case of threads.

++L

PS: I do think this particular one is a quibble, though. My choice
would be along the lines of: "having made choice "A", is there a
mechanism to implement choice "B" from within it?" If there is,
then only efficiency criteria can be used to select "A" over "B".
If there isn't, then either the reverse applies in which case "B"
is the obvious contender, or the two are distinct solutions, each
requiring implementation.

Rob Pike

unread,

Feb 27, 2004, 2:13:50 AM2/27/04

to

the argument about TLB flush times is interesting.

> But it doesn't matter. Regardless, threads should see each others
> stacks.

and on plan 9, they do. they just can't see each other's *stack
segments*
within their own address space.

there seems to be confusion on this point. plan 9's kernel splits the
stack segment after a fork, but in the normal state, the processes run
with the sp in shared memory. the marvelous properties of the
same-address-different-contents split stack segment is used only
during the fiddly bits of process manipulation and to store per-process
data.

how does linux store per-process data in user space?

-rob

dbai...@ameritech.net

unread,

Feb 27, 2004, 2:14:20 AM2/27/04

to

> Wrong. You have zero basis for your assertion, and it's simply bogus.

Clean, logical programming isn't a sound basis? Or is it only not a sound
basis because you've decided so? Saying that I have zero basis for my
assertion is a bit bogus, in it self, don't you think? I CC'd you my
original email to 9fans because I was hoping to understand your logic
behind the rfork() comments, not to start a flame war.

> Would you say that the process environment[] array is also only applicable
> to the first thread?

Yes. That's why we have env(3) in Plan 9. So each thread can alter and
read the environment, globally, in a process.

> There are tons of examples in the linux kernel. Search for any use of
> "struct completion" - most of them are allocatied on a threads stack
> (which is obviously distinct between different threads), and other threads
> and interrupts will use that data structure to wake up the thread.

Okay, but is there a reason why it *needs* to be on the stack? What's the
rationale behind this usage.

> But hey, I'm not interested in trying to convince you. Feel free to think
> I'm wrong.

Well, I'm hoping to determine what really is the best thing for OS design,
as a whole. It isn't in my interest to determine who is right or wrong, only
what makes sense universally. If we are really all here to try and design
code that works the best it can, that's all that matters. Ego is irrelevant and
only gets in the way of those discoveries that can really help us materialize
the better code.

Even if there are no more comments on this thread, I still learned a lot
by asking the questions. If that stirrs up some contraversy, so be it. I'd
rather learn and ask questions than stay silent and experience nothing.

Don (north_)

Charles Forsyth

unread,

Feb 27, 2004, 2:16:46 AM2/27/04

to

>>the argument about TLB flush times is interesting.

it might have been if Linux hadn't gone through an O(n) scheduler before it did that context switch.

Lucio De Re

unread,

Feb 27, 2004, 2:41:41 AM2/27/04

to

On Fri, Feb 27, 2004 at 02:06:57AM -0500, dbai...@ameritech.net wrote:
>
> > Would you say that the process environment[] array is also only applicable
> > to the first thread?
>

> Yes. That's why we have env(3) in Plan 9. So each thread can alter and
> read the environment, globally, in a process.
>

Is Torvalds really saying that environment[] is held _in_ the stack?!
No wonder he was reluctant to copy it! Specially when using "bash".
But I must be mistaken.

++L

a...@9srv.net

unread,

Feb 27, 2004, 2:42:41 AM2/27/04

to

// It isn't in my interest to determine who is right or wrong,
// only what makes sense universally.

let me help, then: nothing. nothing makes sense "universally".

Linux is Linux. Plan 9 is Plan 9. the two systems have very different
design goals. that's not to defend linux's fork design or any other
particular decision, but asking (by implication) why linux doesn't
just do something like env(3) misses those goals.

personally, i'm less interested in arguing point 1 from linus's post
(there might be valid reasons to do this sometimes), which seems at
least plausable, and much more interested in the other two. on the
second, i think clarification is in order: that's the main advantage
of threads over what, processes? and on that last one (the bit on
performance and code quality) i'm *really* curious what he's talking
about. do we have a particularly heavy fork? i'm confused. saying
"The Linux code is just better." just *begs* for support.

oh, and it's "Plan 9", or sometimes "plan9", but never "Plan-9".
thanks.
ア

Linus Torvalds

unread,

Feb 27, 2004, 2:42:52 AM2/27/04

to

On Fri, 27 Feb 2004 dbai...@ameritech.net wrote:
>

> Okay, but is there a reason why it *needs* to be on the stack? What's the
> rationale behind this usage.

The rationale is that it's incredibly more sane, and it's the logical
place to put something that (a) needs to be allocated thread-specific and
(b) doesn't need any special allocator.

In short, it's an automatic variable.

You are arguing against automatic variables in C. You apparently do so for
some incredibly broken religious reason, and hey, I simply don't care.

> Well, I'm hoping to determine what really is the best thing for OS design,
> as a whole.

I can guarantee you that the broken behaviour of SGI sproc and plan-9
rfork i sa major pain in the ass for VM management.

I'm obviously very biased, but I claim that the Linux VM is the best VM
out there. Bar none. It's flexible, it's clearly very portable to every
single relevant hardware base out there (and quite a few that aren't
relevant), and it's efficient.

And at the same thing, the Linux VM (from an architecture standpoint) is
simple. There are lots of fundamentally hard problems in teh VM stuff, but
they tend to be things that _are_ fundamentally hard for other reasons (ie
page replacement algorithms in the presense of many different caches that
all fight for memory). But the actual virtual mapping side is very simple
indeed.

And the plan-9/irix thing isn't. It's an abomination.

And there are real _technical_ reasons why it's an abomination:

- it means that you cannot share hardware page tables between threads
unless you have special hardware (ie it is either fundamentally
unportable, or it is fundamentally inefficient)
- it means that you have to give different TLB contexts to threads,
causing inefficient TLB usage. See above.
- it means that you need to keep track of separate lists of "shared" and
"private" VM mapping lists, and locking of your VM data structures is a
complete nightmare.
- it almost certainly means a lot of special cases, since on the magic
hardware that does have segmented page tables and where you want to
share the right segments, you now have magic hardware-dependent limits
for which areas can be shared and which can be private.

But yes, I'm biased.

Linus

Fco.J.Ballesteros

unread,

Feb 27, 2004, 2:48:46 AM2/27/04

to

> And at the same thing, the Linux VM (from an architecture standpoint) is
> simple.

Do you notice that you had to say 'from an architecture standpoint' ?
I don't think it's simple at all. I agree it seems to be more efficient
(although I don't have measures to support this), but it's
contorted code, at best.

Regarding what's broken or not, it mostly depends on bugs, and bugs
depend mostly on complexity.

dbai...@ameritech.net

unread,

Feb 27, 2004, 2:52:44 AM2/27/04

to

> The rationale is that it's incredibly more sane, and it's the logical
> place to put something that (a) needs to be allocated thread-specific and
> (b) doesn't need any special allocator.

You've just proven my point. Thread specific. Being Thread specific, it
is data that is reserved to the scope of a single thread. Nothing more.
If you want more scope there are many more usages of memory that
are better utilized.

I'm not arguing against automatic variables. I'm arguing against using
automatic variables anywhere but the scope of stack in which they are
defined.

I don't know where you're getting religiousness, but, there's nothing
wrong with trying to do the best for all of your users. That's why
companies research their target audience, so they can improve a
product according to the most global needs of a given group.

Don (north_)

Linus Torvalds

unread,

Feb 27, 2004, 2:53:40 AM2/27/04

to

On Fri, 27 Feb 2004, Lucio De Re wrote:
>
> Is Torvalds really saying that environment[] is held _in_ the stack?!
> No wonder he was reluctant to copy it! Specially when using "bash".
> But I must be mistaken.

Take a look if you don't believe me.

There is a _lot_ of state on the stack. That's how C works.

It's perfectly valid behaviour to do something like this:

myfunction()
{
mytype myvariable;

pthread_create(.. &myvariable);
...
pthread_join(..);
}

where you give the thread you created its initial state on your own stack.
You obviously must keep it there until the thread is done with it (either
by waiting for the whole thread as in the above example, or by using some
other synchronization mechanism), but the point is that C programmers are
used to a fundamentally "flat" environment without segments etc.

And what a private stack is, is _nothing_ more than a segment.

And I have not _ever_ met a good C programmer that liked segments.

So in a C/UNIX-like environment, private stacks are wrong. You could
imagine _other_ environments where they might be valid, but even those
other environments would not invalidate my points about efficiency and
simplicity.

Linus

dbai...@ameritech.net

unread,

Feb 27, 2004, 2:55:41 AM2/27/04

to

> personally, i'm less interested in arguing point 1 from linus's post
> (there might be valid reasons to do this sometimes), which seems at
> least plausable, and much more interested in the other two. on the
> second, i think clarification is in order: that's the main advantage
> of threads over what, processes? and on that last one (the bit on
> performance and code quality) i'm *really* curious what he's talking
> about. do we have a particularly heavy fork? i'm confused. saying
> "The Linux code is just better." just *begs* for support.

I wasn't interested in #2 because #2 didn't make much sense to me.
I wasn't interested in #3 because I had no desire to discuss things
resembling personal attacks.

Don (north_)

Rob Pike

unread,

Feb 27, 2004, 3:01:52 AM2/27/04

to

> So in a C/UNIX-like environment, private stacks are wrong. You could
> imagine _other_ environments where they might be valid, but even those
> other environments would not invalidate my points about efficiency and
> simplicity.

as i said before, the stacks are not private. you're right, that's a
bad thing.
that's why they're not private.

the segment called 'stack' is private, but that's a different thing.
i stress: stack != stack segment. stack is where your sp is; stack
segment is a figment of the VM system.

i ask again: how does linux create per-thread storage?

the way the plan 9 thread library works is so different from linux's
that they're hard to compare. program design in the two worlds is
radically different. so your claim of 'better' is curious to me. by
'better'
you seem to mean 'faster' and 'cleaner'. faster at least can be
measured.
you speak with certainty. have you seen performance comparisons?
i haven't, although it wouldn't surprise me to learn that there are
useful
programs for which linux outperforms plan 9, and vice versa of course.

-rob

Linus Torvalds

unread,

Feb 27, 2004, 3:04:46 AM2/27/04

to

On Fri, 27 Feb 2004 dbai...@ameritech.net wrote:
>

> > The rationale is that it's incredibly more sane, and it's the logical
> > place to put something that (a) needs to be allocated thread-specific and
> > (b) doesn't need any special allocator.
>
> You've just proven my point. Thread specific. Being Thread specific, it
> is data that is reserved to the scope of a single thread. Nothing more.
> If you want more scope there are many more usages of memory that
> are better utilized.

NO!

A "per-thread allocation" does NOT MEAN that other threads should not
access it. It measn that the ALLOCATION is thread-private, not that the
USE is thread-private.

per-thread allocations are quite common, and critical. If you have global
state, you need to protect them with locks, and you need to have nasty
global allocators.

One common per-thread allocation is the "I want to wait for an event". The
data is clearly for that one thread, and using a global allocator would be
WRONG. Not to mention inefficient.

But once the data has been allocated, other threads are what will actually
use the data to wake the original thread up. So while it wants a
per-thread allocator, it simply wouldn't _work_ if other threads couldn't
access the data.

That's what a "completion structure" is in the kernel. It's all the data
necessary to let a thread wait for something to complete. Another thread
will do "complete(xxx)", where "xxx" is that per-thread data.

You don't like it. Fine. I don't care. You're myopic, and have an agenda
to push, so you want to tell others that "you can't do that, it's against
my agenda".

While I'm telling you that people _do_ do that, and that it makes sense,
and if you didn't have blinders on, you'd see that.

Linus

Lucio De Re

unread,

Feb 27, 2004, 3:06:42 AM2/27/04

to

On Thu, Feb 26, 2004 at 11:57:52PM -0800, Linus Torvalds wrote:
>
> On Fri, 27 Feb 2004, Lucio De Re wrote:
> >
> > Is Torvalds really saying that environment[] is held _in_ the stack?!
> > No wonder he was reluctant to copy it! Specially when using "bash".
> > But I must be mistaken.
>
> Take a look if you don't believe me.
>
> There is a _lot_ of state on the stack. That's how C works.
>

Hm. Yes, C allows it. And your point is valid that an arbitrary
data segment may justifiably be copied. But the stack is traditionally
the repository of special, let's call it "holy" information and
having a thread stomping all over it when another thread gets to
find out the hard way is not a thrilling prospect.

After all, there is nothing to prevent a coder from altering the
return address (traditionally a stack entry) in one thread and
totally wrecking the behaviour of another on return, right?

I still think the acid test lies with whether one version can model
the other.

++L

PS: and I really hope that the arguments and environment variables
are stored elsewhere and only the pointers appear on the stack :-)
I mean, even MS-DOS got that bit "sane", to use a very loaded word
previously used by Linus in this discussion.

Scott Schwartz

unread,

Feb 27, 2004, 3:07:33 AM2/27/04

to

I suspect Linus isn't subscribed to 9fans, so we'll have to cc him.

That sharing is achieved by the thread library, not by the rfork
system call, though, right?

dbai...@ameritech.net

unread,

Feb 27, 2004, 3:11:17 AM2/27/04

to

> You don't like it. Fine. I don't care. You're myopic, and have an agenda
> to push, so you want to tell others that "you can't do that, it's against
> my agenda".

I'm not saying you can't, I'm just asking why, and you're not giving
me any sensible reason for it besides "because people do". That isn't
logical to me, and that's got nothing to do with any mysterious
agenda. I'm just interested in the theory.

> While I'm telling you that people _do_ do that, and that it makes sense,
> and if you didn't have blinders on, you'd see that.

Ok, so people do *do* that. That's fine, but that doesn't make it
correct, or sensible. I just wanted some semblance of logic, not
flames that equate to "I do it because I can". I can hop on one leg,
too, but that doesn't make hopping on one leg efficient.

Don (north_)

Lucio De Re

unread,

Feb 27, 2004, 3:12:43 AM2/27/04

to

On Fri, Feb 27, 2004 at 12:08:38AM -0800, Linus Torvalds wrote:

>
> On Fri, 27 Feb 2004 dbai...@ameritech.net wrote:
> >
> > > The rationale is that it's incredibly more sane, and it's the logical
> > > place to put something that (a) needs to be allocated thread-specific and
> > > (b) doesn't need any special allocator.
> >
> > You've just proven my point. Thread specific. Being Thread specific, it
> > is data that is reserved to the scope of a single thread. Nothing more.
> > If you want more scope there are many more usages of memory that
> > are better utilized.
>

> NO!
>
> A "per-thread allocation" does NOT MEAN that other threads should not
> access it. It measn that the ALLOCATION is thread-private, not that the
> USE is thread-private.
>

Wait a minute! If the stack is not thread private, what is? I
think this answers my question: stack duplication allows per-thread
private space, whereas stack sharing doesn't. This is a one-way
path. Unless you drop into register access, where it's the platform
that decides whether registers are duplicated or not.

But maybe they should also be shared, to be totally consistent.

Of course, I may be talking out of turn, but I really don't see
how threads can have private space if the stack isn't private.

++L

Rob Pike

unread,

Feb 27, 2004, 3:16:41 AM2/27/04

to

On Feb 27, 2004, at 12:06 AM, Scott Schwartz wrote:

> I suspect Linus isn't subscribed to 9fans, so we'll have to cc him.

sorry about that.

>
> Rob writes:
> | the argument about TLB flush times is interesting.
> |
> | > But it doesn't matter. Regardless, threads should see each others
> | > stacks.
> |
> | and on plan 9, they do. they just can't see each other's *stack
> | segments*
> | within their own address space.
> |
> | there seems to be confusion on this point. plan 9's kernel splits
> the
> | stack segment after a fork, but in the normal state, the processes
> run
> | with the sp in shared memory. the marvelous properties of the
> | same-address-different-contents split stack segment is used only
> | during the fiddly bits of process manipulation and to store
> per-process
> | data.
>
> That sharing is achieved by the thread library, not by the rfork
> system call, though, right?

right. but where it's implemented is in some sense just a detail, as my
other
message, incuded below, implies.

another way of looking at it is that the kernel provide some nice
primitives upon which to build a thread library. the plan 9 kernel
leaves out a lot of stuff that you're supposed to do in the kernel
for threads, but we were exploring options.

>
> | how does linux store per-process data in user space?
> |
> | -rob

here's my other message, with the cc: this time:

Rob Pike

unread,

Feb 27, 2004, 3:18:49 AM2/27/04

to

On Feb 27, 2004, at 12:11 AM, Lucio De Re wrote:

> Of course, I may be talking out of turn, but I really don't see
> how threads can have private space if the stack isn't private.

well, perhaps the stack isn't the only place to do it, but it's
certainly an easy one, and one that makes the syscall interface
to fork easy to implement in a threaded environment: longjmp
to the private stack, fork, adjust, longjmp back.

-rob

Geoff Collyer

unread,

Feb 27, 2004, 3:20:36 AM2/27/04

to

I think we're talking past each other due to different terminologies.
Linus seems to use `thread' to mean `a process sharing address space
(other than normal text segment sharing)', whereas in Plan 9, that's
just a process; some share address space, some don't. A Plan 9 thread
is entirely a user-mode creation of the Plan 9 thread library, which
doesn't implement POSIX threads.

Lucio De Re

unread,

Feb 27, 2004, 3:32:46 AM2/27/04

to

But I can't think of even one possible alternative. After all, the
stack is the only storage being duplicated (ignoring registers) so
where does one keep pointers to the private space?

++L

boyd, rounin

unread,

Feb 27, 2004, 3:46:40 AM2/27/04

to

> > So the question to me is, why would you *want* a unified stack?
>
> well, you don't. It makes programming a nightmare and you need assembly
> goo to seperate the stacks after a fork, ...

yup

boyd, rounin

unread,

Feb 27, 2004, 3:50:37 AM2/27/04

to

> > what happens on MP systems?
> >
> not sure what you're driving at.

i do. they have forgotten their lister and the old (now quick):

s = splfoo();

/* critical region */

splx(s);

usually simplifies things, unlike the lunacy you have to code
on linix >= 2.2 -- i've coded it; it made me sick.

Linus Torvalds

unread,

Feb 27, 2004, 3:53:36 AM2/27/04

to

On Fri, 27 Feb 2004, Rob Pike wrote:
>
> > So in a C/UNIX-like environment, private stacks are wrong. You could
> > imagine _other_ environments where they might be valid, but even those
> > other environments would not invalidate my points about efficiency and
> > simplicity.
>
> as i said before, the stacks are not private. you're right, that's a
> bad thing.
> that's why they're not private.
>
> the segment called 'stack' is private, but that's a different thing.
> i stress: stack != stack segment. stack is where your sp is; stack
> segment is a figment of the VM system.

Well, in another email I already said that "private stack" and "segments"
are really exactly the same thing - some people think of segments as
paging things, others think of them in the x86 sense, but in the end it
all comes down to the fact that a "stack address" ends up making sense
only within a specific context (and that context can sometimes be
partially visible to other threads by using explicit segment registers or
other magic, like special instructions that can take another address
space).

And private/segmented stacks are bad.

They are bad exactly because they magically make automatic variables
fundamentally different from other variables. And they really have no
reason they should be different.

There is absolutely nothing wrong with having a thread take the address of
some automatic variable, and then just pass that address off to another
routine. And if that other routine decides that it is going to create a
hundred threads to solve the problem that the variable described in
parallell, then that should JUST WORK. Anything else would be EVIL.

Having a pointer that sometimes works, and sometimes doesn't, based on who
uses it - that's just crazy talk.

> i ask again: how does linux create per-thread storage?

The same way it creates any other storage: with mmap() and brk(). You just
malloc the thing, and you pass in the new stack as an argument to the
thread creation mechanism (which linux calls "clone()", just to be
different).

And because that storage is just storage, things like the one I described
above "just work". If you pass another thread a pointer to your stack, the
other thread can happily manipulate it, and never even needs to know that
it's an automatic variable somewhere else.

And sure, you can shoot yourself in the foot that way. You can pass off a
pointer to another thread, and then return from the function without
synchronizing with that other thread properly, and now the other thread
will scribble all over your stack. But that's really nothing different
than using "alloca()" and passing off that to something that remembers the
address.

> the way the plan 9 thread library works is so different from linux's
> that they're hard to compare. program design in the two worlds is
> radically different. so your claim of 'better' is curious to me. by
> 'better' you seem to mean 'faster' and 'cleaner'. faster at least can
> be measured.

To me, the final decision on "better" tends to be a fairly wide issue.
Performance is part of it - especially infrastructure that everybody
depends on should always strive to at least _allow_ good performance, even
if not everybody ends up caring.

But the concept, to me, is more important. Basically, I do not see how you
can really have a portable and even _remotely_ efficient partial VM
sharing. And if you can't have it, then you shouldn't design the
interfaces around it.

> you speak with certainty. have you seen performance comparisons? i
> haven't, although it wouldn't surprise me to learn that there are useful
> programs for which linux outperforms plan 9, and vice versa of course.

When it comes to threads, I only see three interesting performance
metrics: how fast can you create them, how fast can you synchronize them
(both "join" and locking), and how well do you switch between them.

The locking is pretty much OS-independent, since fast locking has to be
done in user space anyway (with just the contention case falling back to
the OS, and if your app cares about performance it hopefully won't have
much contention).

So we're left with create, tear-down and switch. All of which are
_fundamentally_ faster if you just have a "share everything" model.
Create and tear-down are just increment/decrement a reference counter
(there's a spinlock involved too). Task switch is a no-op from a VM
standpoint (except we have a per-thread lazy TLB invalidate that will
trigger).

In contrast, partial sharing is a major pain. You definitely don't just do
a reference count increment for your VM.

Linus

boyd, rounin

unread,

Feb 27, 2004, 3:54:05 AM2/27/04

to

> (3) Implementation sucks. Irix and Plan-9 both get it wrong, and they
> _pay_ for it. Heavily. The Linux code is just better.

5M SLOC for the kernel? look at list.h

better? mon cul.

boyd, rounin

unread,

Feb 27, 2004, 3:57:48 AM2/27/04

to

> Can you elaborate on your synchronization example, please?

FOTFLMAO :)

nice one, don.

boyd, rounin

unread,

Feb 27, 2004, 4:07:06 AM2/27/04

to

> But yes, I'm biased.

no, you're wrong.

boyd, rounin

unread,

Feb 27, 2004, 4:06:43 AM2/27/04

to

> it might have been if Linux hadn't gone through an O(n) scheduler before
it did that context switch.

yes, the comments in the 2.4 scheduler are immediate:

danger, danger, will robinson.

boyd, rounin

unread,

Feb 27, 2004, 4:08:47 AM2/27/04

to

> And I have not _ever_ met a good C programmer that liked segments.

err, the PDP-11?

Linus Torvalds

unread,

Feb 27, 2004, 4:26:37 AM2/27/04

to

Well, what Linux has is really what I privately call a "context of
execution".

The "clone()" system call in Linux just creates a new such "context of
execution", and you can choose to arbitrarily share pretty much any OS
state, by just saying which state you want to share in a bitmap. In
addition to the bitmap there are a few pointers you pass around, the full
required state is actually

clone_flags: bitmap of how to create the new context of execution
newsp: stack pointer of new context
parent tidptr: pointer (in the parent) to the thread ID information
child tidptr: pointer (in the child) to the thread ID information
tls pointer: pointer to TLS (thread-local-storage) for the context

but not all of them are necessarily used (ie if you don't want to set TLS
or TID information, those pointers are obviously unused).

The bits you can control the context copy with are:

CSIGNAL /* signal mask to be sent at exit */

CLONE_VM /* set if VM shared between processes */
CLONE_FS /* set if fs info shared between processes */
CLONE_FILES /* set if open files shared between processes */
CLONE_SIGHAND /* set if signal handlers and blocked signals shared */
CLONE_IDLETASK /* set if new pid should be 0 (kernel only)*/
CLONE_PTRACE /* set if we want to let tracing continue on the child too */
CLONE_VFORK /* set if the parent wants the child to wake it up on mm_release */
CLONE_PARENT /* set if we want to have the same parent as the cloner */
CLONE_THREAD /* Same thread group? */
CLONE_NEWNS /* New namespace group? */
CLONE_SYSVSEM /* share system V SEM_UNDO semantics */
CLONE_SETTLS /* create a new TLS for the child */
CLONE_PARENT_SETTID /* set the TID in the parent */
CLONE_CHILD_CLEARTID /* clear the TID in the child */
CLONE_DETACHED /* Unused, ignored */
CLONE_UNTRACED /* set if the tracing process can't force CLONE_PTRACE on this clone */
CLONE_CHILD_SETTID /* set the TID in the child */
CLONE_STOPPED /* Start in stopped state */

(CSIGNAL isn't a bit - it's the low 8 bits, and it specifies the signal
you want to send to your parent when you die).

So a "fork()" is literally really just a "clone(SIGCHLD)". We're saying
that we don't want to share anything, and that we want to send a SIGCHLD
at exit.

Setting the CLONE_VM bit says that the VM gets shared. That means that
instead of copying the page tables, we just copy the pointer to the
"struct mm_struct", which describes everything in the VM, and we increment
its reference count.

There is no "partial copy". If you say that you want to share the VM, you
get the WHOLE VM. Or you will get a totally private VM. Similarly, i fyou
say that you want to share the file descriptors (CLONE_FILES), they will
all be shared: one context doing an "open()" will have that fd be valid in
all other contexts that share it.

(The difference between CLONE_FILES and a regular fork() is that a
CLONE_FILES will increment just _one_ reference count: the reference count
for the whole array of pointers to files. In contrast, a fork-like
non-shared case will create a whole new array of pointers to files, and
then for each file increment the pointer for that file).

What most "unix people" call threads is somethign that is created with
pretty much all flags set - we share pretty much everything except for the
register state and the kernel stack between the contexts. And when I say
"share", I really mean share: most of the bits end up being copying a
kernel pointer and incrementing the reference count for that object.

Some of the bits are "administrative": the VFORK bit isn't about sharing,
it's about the parent waiting until the child releases the VM back to it
(btw, that uses a "completion" structure on the parents stack). Similarly,
the SETTID/CLEARTID bits are about writing the TID ("thread ID" as opposed
to "process ID") to the VM space atomically with the creation (or in the
case of CLEARTID, teardown) of the thread. That ends up helping the thread
management (from user space) a _lot_.

(Tangential to this discussion is the TLS or "thread local storage" bit -
some architecture-specific way of indicating a small thread-specific
storage area. It's not the stack, it's just a regular allocation, and
different architectures have different ways of pointing to it. Usually
there's some architected register set aside for it).

And you can mix and match things. You can literally create a new context
that shares the file descriptors (so that one process doing an "open()"
will open files in the other one), but doesn't share the VM space.

Although some of them are interdependent - CLONE_THREAD (which is really
just "all signal state" despite the name - it has nothing to do with VM
per se) depends on CLONE_SIGHAND (which is just the set of signal handlers
associated with the context), which in turn depends on CLONE_VM (because
it doesn't make sense to be able to take a signal in different contexts
unless they share the same VM).

This has gotten fairly far off the notion of stacks and VM.. But I hope
it's clear to everybody that I heartily agree with rfork-like
functionality. It's just segmented/private stacks I can't understand.

Linus

boyd, rounin

unread,

Feb 27, 2004, 4:35:46 AM2/27/04

to

> The bits you can control the context copy with are:

what? only ~20. surely you need some more?

Linus Torvalds

unread,

Feb 27, 2004, 4:41:50 AM2/27/04

to

Think it through. You should _not_ duplicate the stack (because that
wreaks havoc with your TLB and normal usage), so what do you have left?

Once you've eliminated the impossible, what you have left, however
improbable, is the truth.

Registers.

Why are you ignoring registers? That's what you _should_ use.

For example, inside the Linux kernel, we tend to use the stack POINTER as
the thread-local state. When we allocate a new context of execution, we
allocate (depending on architecture) 8kB of memory, and it's aligned so
that if the architecture doesn't have any other registers free, we can get
at the "thread_info" structure by just doing bit masking on the stack
pointer.

That ends up being quite powerful - and it's cheap too, exactly because it
is a register, and thus fast to access. The stack itself is by no means
private - other threads can access the stack.

In fact, we used to put the whole "struct task_struct" (which is the thing
that defines a context of execution in Linux) that way, but it ends up
doing nasty things to caches when important global data structures are all
aligned on powers-of-two boundaries, so we ended up getting rid of that.

In user space, that doesn't tend to work too well, because the stack isn't
as well bounded as in the kernel, but most architectures either have lots
of registers (and then one is just used for the thread-local pointer) or
even an architected register that user space can read but not write.

One of the most problematic architectures is the x86, which doesn't have
lots of general-purpose registers (so using one of them to point to TLS
would be bad), and doesn't have any nice architected register either.
There we ended up using a segment register, however much I hate them.

We could have just made a trivial system call ("get the thread-local
pointer" from the kernel stack), but obviously there are performance
issues here.

In short: there is absolutely no reason to make the stack be private. The
only thing you need for thread-local-storage is literally just one
register, to indirect through. And it can be a fairly strange one at that,
ie it doesn't need to be able to hold a full 32-bit value.

Linus

boyd, rounin

unread,

Feb 27, 2004, 4:47:47 AM2/27/04

to

> We could have just made a trivial system call ("get the thread-local
> pointer" from the kernel stack), but obviously there are performance
> issues here.

just one? surely another 32 or 64 ...

Linus Torvalds

unread,

Feb 27, 2004, 4:48:41 AM2/27/04

to

On Fri, 27 Feb 2004, boyd, rounin wrote:
>
> > The bits you can control the context copy with are:
>
> what? only ~20. surely you need some more?

So far, no. It's been growing over the years, but not quickly.

Realize that you don't want to even control this with a very fine
granularity. Every single new thing that you allow being copied or shared
ends up being more state that you have to reference-count and keep track
of. You want to have as little as possible of that kind of state.

So keep them to big fundamental things. The VM. The file table. The signal
state. The namespace. (And those f*ing horrible SysV IPC things).

Splitting it up more would just hurt. Just out of interest, what else
could you possibly want, and why?

After all, it's not like the kernel keeps track of all that much else than
VM, files and signals.

Linus

Lucio De Re

unread,

Feb 27, 2004, 4:54:42 AM2/27/04

to

On Fri, Feb 27, 2004 at 01:46:27AM -0800, Linus Torvalds wrote:
>
> Why are you ignoring registers? That's what you _should_ use.
>

Because they are not in the base language? Because they impair
portability?

Because in my C code, I can't instantiate them as variables without
running the risk of my colleagues doing the same in a conflicting
manner?

Just out of curiosity, how do I get to the private space without
a lock? It's been a long time since I studied these things and lots
of water has flown under bridges, so I could be missing something.

++L

Linus Torvalds

unread,

Feb 27, 2004, 4:55:43 AM2/27/04

to

On Fri, 27 Feb 2004, boyd, rounin wrote:
>

Why?

You only need one thread-local pointer. You just put all your stuff offset
from that one. Realize that the kernel doesn't do anything at all with
that value - it's purely a random value that the user can set, and the
kernel doesn't even need to know that it is a pointer. You could use it as
an index into other pointers if you wanted to.

So anything but one doesn't make any sense. Where would you stop?

Am I missing something here? Linux certainly happily makes do with just
one TLS pointer.

Linus

Charles Forsyth

unread,

Feb 27, 2004, 5:00:59 AM2/27/04

to

>>Just out of curiosity, how do I get to the private space without
>>a lock? It's been a long time since I studied these things and lots
>>of water has flown under bridges, so I could be missing something.

static __inline Proc *getup(void) {
Proc *p;
__asm__( "movl %%esp, %%eax\n\t"
: "=a" (p)
);
return *(Proc **)((unsigned long)p & ~(KSTACK - 1));
};

where the process that makes the new process's stack puts a pointer to
the private space (a Proc* in this case) at the base of that stack, and where KSTACK is
a power-of-2. i don't think it can be accessed portably in Linux,
but you don't need a lock.

Linus Torvalds

unread,

Feb 27, 2004, 5:06:37 AM2/27/04

to

On Fri, 27 Feb 2004, Lucio De Re wrote:

> On Fri, Feb 27, 2004 at 01:46:27AM -0800, Linus Torvalds wrote:
> >
> > Why are you ignoring registers? That's what you _should_ use.
>
> Because they are not in the base language? Because they impair
> portability?

You literally need _one_ operation: you need the operation of "give me the
TLS pointer" (well, your thread setup code obviously needs a way to set
the pointer when creating a thread too).

The rest _is_ in the language - although if you want nice syntax, you
sometimes want more explicit language support.

Have you ever looked at those system header files? They contain a lot of
magic cruft that is specific to your compiler and your particular
architecture.

> Because in my C code, I can't instantiate them as variables without
> running the risk of my colleagues doing the same in a conflicting
> manner?
>
> Just out of curiosity, how do I get to the private space without
> a lock? It's been a long time since I studied these things and lots
> of water has flown under bridges, so I could be missing something.

In the kernel, the x86 implementation of the thread-local pointer is
literally:

/* how to get the thread information struct from C */
static inline struct thread_info *current_thread_info(void)
{
struct thread_info *ti;
__asm__("andl %%esp,%0; ":"=r" (ti) : "0" (~(THREAD_SIZE - 1)));
return ti;
}

That's it. It compiles to _two_ instructions, eg something like

movl $-8192,%eax
andl %esp,%eax

and it's all done. You literally _cannot_ do it faster or smaller. No
locking, no memory accesses, no TLB games, no NOTHING.

On some other architectures, you have

register struct thread_info *__current_thread_info __asm__("$8");
#define current_thread_info() __current_thread_info

which just tells the compiler that the thread_info pointer is in hardware
register 8. And you're done. The compiler will just automatically use %r8
whenever you do a "current_thread_info()".

In user space, there are similar things going on. These days there is more
compiler support to make it easier to create these things, but it all
boils down to having a thread-local pointer somewhere.

Linus

Lucio De Re

unread,

Feb 27, 2004, 5:08:34 AM2/27/04

to

On Fri, Feb 27, 2004 at 10:00:32AM +0000, Charles Forsyth wrote:
>
> >>Just out of curiosity, how do I get to the private space without
> >>a lock? It's been a long time since I studied these things and lots
> >>of water has flown under bridges, so I could be missing something.
>

> static __inline Proc *getup(void) {
> Proc *p;
> __asm__( "movl %%esp, %%eax\n\t"
> : "=a" (p)
> );
> return *(Proc **)((unsigned long)p & ~(KSTACK - 1));
> };
>
> where the process that makes the new process's stack puts a pointer to
> the private space (a Proc* in this case) at the base of that stack, and where KSTACK is
> a power-of-2. i don't think it can be accessed portably in Linux,
> but you don't need a lock.

Yeow! And where do I put the returned address, so that no other thread
can stomp on it? While I attempt to use it?

++L

Douglas A. Gwyn

unread,

Feb 27, 2004, 5:11:25 AM2/27/04

to

dbai...@ameritech.net wrote:
> Under the "Kernel Space and User Space" heading:
[very confused description omitted]
> | While this is a clever feat, the downside is that the overhead in maintaining
> | the stacks makes this in practice really stupid to do. They found out too
> | late that the performance went to hell. Since they had programs which used
> | the interface they could not fix it. Instead they had to introduce an additional
> | properly-written interface so that they could do what was wise with the stack
> | space. ...
I don't think any of that is correct.

Lucio De Re

unread,

Feb 27, 2004, 5:14:51 AM2/27/04

to

On Fri, Feb 27, 2004 at 02:11:27AM -0800, Linus Torvalds wrote:
>
> On Fri, 27 Feb 2004, Lucio De Re wrote:
>
> > On Fri, Feb 27, 2004 at 01:46:27AM -0800, Linus Torvalds wrote:
> > >
> > > Why are you ignoring registers? That's what you _should_ use.
> >
> > Because they are not in the base language? Because they impair
> > portability?
>
> You literally need _one_ operation: you need the operation of "give me the
> TLS pointer" (well, your thread setup code obviously needs a way to set
> the pointer when creating a thread too).
>

Yes, but...

> > Because in my C code, I can't instantiate them as variables without
> > running the risk of my colleagues doing the same in a conflicting
> > manner?
> >

Unless I'm missing the wood for the trees, this returns a pointer.
Where do I put _it_ so no other thread, choosing to do likewise,
decides to stomp all over it?

++L

Charles Forsyth

unread,

Feb 27, 2004, 5:14:54 AM2/27/04

to

>>Yeow! And where do I put the returned address, so that no other thread
>>can stomp on it? While I attempt to use it?

the process is running by then on its own stack, so the result is either in
a register (EAX say) or in a temporary on the stack but either way, it's private.

how does it get to its own stack in the first place? that's the bit
that's been causing some of the fuss here. at least for a few years,
a variant of `clone' is available that takes a pointer to the new top-of-stack for the
new process, so the creating process allocates a stack somewhere somehow
and passes that to clone. thus the new process is running on a private
stack once it starts.

Lucio De Re

unread,

Feb 27, 2004, 5:25:41 AM2/27/04

to

On Fri, Feb 27, 2004 at 10:14:32AM +0000, Charles Forsyth wrote:
>
> >>Yeow! And where do I put the returned address, so that no other thread
> >>can stomp on it? While I attempt to use it?
>

> the process is running by then on its own stack, so the result is either in
> a register (EAX say) or in a temporary on the stack but either way, it's private.
>

Sure, but I thought the idea was that all storage was shared, except
any that was privately allocated later. Did I read wrong Torvalds'
comments about the stack being nothing special? And then a new
private stack comes along? How is it initialised? Isn't it a
duplicate?

I think I'm out of my depth.

++L

Linus Torvalds

unread,

Feb 27, 2004, 5:32:42 AM2/27/04

to

On Fri, 27 Feb 2004, Lucio De Re wrote:
> >
> > You literally need _one_ operation: you need the operation of "give me the
> > TLS pointer" (well, your thread setup code obviously needs a way to set
> > the pointer when creating a thread too).
> >
> Yes, but...
>

> Unless I'm missing the wood for the trees, this returns a pointer.
> Where do I put _it_ so no other thread, choosing to do likewise,
> decides to stomp all over it?

Your register state? Or you save it to the (shared) VM, by using another
private pointer (ie the stack pointer).

It all boils down to one thing: the _only_ state that is always truly
private in your thread is your register state. But exactly because the
register state is private, you now have ways of turning the _rest_ of your
state (that isn't private) into your own private areas.

So while the VM is shared across all different threads, all threads
obviously have their own stack pointer registers, and they'd better be
pointing to different _parts_ of that shared VM, or your threads won't
work at all.

"Any problem in computer science can be solved with another layer
of indirection."
David Wheeler

In other words, your registers fundamentally _are_ your private space, and
everything else follows from that.

So your stack is "private" in the sense that nobody else uses it unless
you explicitly give a pointer to it (modulo bugs, of course ;).

Linus

Fco.J.Ballesteros

unread,

Feb 27, 2004, 7:15:57 AM2/27/04

to

>> what? only ~20. surely you need some more?
>
> So far, no. It's been growing over the years, but not quickly.

I think boyd was kidding, wasn't him?

Dave Lukes

unread,

Feb 27, 2004, 7:25:53 AM2/27/04

to

> > Do you have code from existing Linux implementation that exemplifies
> > this scenario?

... and I'm _still_ waiting for an answer.

> Anything that accesses any process arguments would access the stack of the
> original thread.

... which is copied, so what's the problem?
Do you want all the threads to be able to _update_ the process args?
That's harder, and also more dubious.

> But synchronization with
> completion events on the stack is also perfectly feasible (the kernel does
> exactly that internally, for example - and user mode could do the same).

The kernel using it is an implementation detail.
User mode using it: show me the example!

Cheers,
Dave.

C H Forsyth

unread,

Feb 27, 2004, 7:30:41 AM2/27/04

to

sorry, when i said `private' i meant `for that process's exclusive use as a stack',
not that it was in a per-process address space.

it's per-process only in that by agreement the process to which
it is given is the only one that uses it as a stack (the others can
still access its contents if they like, but often they don't).

they are all still in the same address space.
they could be allocated (essentially) by malloc, except that
you need to ensure they are all aligned appropriately (eg,
at an address that is 0 mod KSTACK in that previous example).

Rob Pike

unread,

Feb 27, 2004, 10:29:49 AM2/27/04

to

> I think we're talking past each other due to different terminologies.
> Linus seems to use `thread' to mean `a process sharing address space
> (other than normal text segment sharing)', whereas in Plan 9, that's
> just a process; some share address space, some don't. A Plan 9 thread
> is entirely a user-mode creation of the Plan 9 thread library, which
> doesn't implement POSIX threads.

i was speaking using linux's terminology.

> Having a pointer that sometimes works, and sometimes doesn't, based
on who
> uses it - that's just crazy talk.

put in those terms, it sounds weird, but it's not. consider the old u.
area in unix.
that was a piece of address space with the same virtual address in all
processes
but a different content. the system used the fact that the addresses
aliased
that way. plan 9's thread model does a similar thing by constructing a
special
storage class for data private to each process. for instance, one can
have a
variable with the same address in each process, call it tp, that points
to the
thread-specific data, so you can write code like
printf("my id is %d\n", tp->id);

> > i ask again: how does linux create per-thread storage?

> The same way it creates any other storage: with mmap() and brk(). You
just
> malloc the thing, and you pass in the new stack as an argument to the
> thread creation mechanism (which linux calls "clone()", just to be
> different).

that wasn't what i was asking. i was referring to this special storage
class.
how does a thread identify who it is? when we were porting inferno to
linux (back in 1996; things have likely changed) we resorted to using
unused user-visible MMU bits to store enough state to hold on with our
fingernails and claw back to private storage. another option we
considered was playing with magic address bits in the sp.

ah, i see in later mail that you answered this. there are now pointers
created in the user space (i think) to thread-local storage. how is it
accessed, that is, how does the user process derive the pointer to it?
this state stuff did not exist when we did the inferno port.

oh, and now i see you answering that later. it is 'cruft', as you say,
but
it will work; it's the magic address bits hack. which kernel version
introduced this stuff? i've heard people say that 2.6 is the first one
with the default thread model being 'efficient' and 'good', but i don't
know the specifics. i've also heard that they can be retrofitted to
2.4.

i think a big part of my confusion is that my criticisms of linux
threads are
based on an older view of how they worked. and a big part of the
commentary that started the discussion was incorrect about plan 9
history. i hope we're on the same page now.

it's interesting you advocate using registers for the magic storage
class.
it's a great trick when you can do it - plan 9 uses it in the kernel on
machines
with lots of registers - but it's not so great on a machine with too few
registers, like the x86.

-rob

Linus Torvalds

unread,

Feb 27, 2004, 11:04:46 AM2/27/04

to

On Fri, 27 Feb 2004, Dave Lukes wrote:
>
> > > Do you have code from existing Linux implementation that exemplifies
> > > this scenario?
>
> ... and I'm _still_ waiting for an answer.

Hmm.. I answered it the first time around. Look in the kernel for "struct
completion", and you will see it. It's common behaviour there to do (this
example is from the fork() case:

struct completion vfork;

if (clone_flags & CLONE_VFORK) {
p->vfork_done = &vfork;
init_completion(&vfork);
}
...
wake_up_forked_process(p); /* do this last */
...
if (clone_flags & CLONE_VFORK) {
wait_for_completion(&vfork);
...

and then the mm_release function (in the child) does:

struct completion *vfork_done = tsk->vfork_done;

/* Get rid of any cached register state */
deactivate_mm(tsk, mm);

/* notify parent sleeping on vfork() */
if (vfork_done) {
tsk->vfork_done = NULL;
complete(vfork_done);
}

ie the other thread reads and updates the "struct completion" that is on
another process's stack.

The above is neither insane nor bad in any other way. It's in my opinion
_the_ most readable way to do things, and it certainly is the most
efficient one too.

In user space, I don't work with threaded programs, but if I were to write
one, I'd do this all the time. With threaded programs - even more so than
with normal programs - you need to avoid global state, which means that
quite often the stack is the only easy place to put something.

For example, most thread interfaces only allow passing in one single data
structure to a thread (eg the pthread_create() "void * arg" argument. So
whenever you have an issue with passing down data to describe what you
want done to a helper thread, you commonly end up passing in a pointer to
a structure. And if you end up waiting for the result (imagine doing
threaded video encoding, for example - where the subthreads exist as
computational things to do encoding ove rmany CPU's), then it makes 100%
sense to do exactly the above.

> > Anything that accesses any process arguments would access the stack of the
> > original thread.
>
> ... which is copied, so what's the problem?

The problem is that communication is almost never one-way, except
apparently in the thread we're involved right now.

> Do you want all the threads to be able to _update_ the process args?
> That's harder, and also more dubious.

Hey, I don't know about you, but pretty much every _single_ argument
parser I've ever done tends to fill in some data structure with the
pointer to the argument in question.

So for example, when I get a filename argument, I don't do

filename = strdup(argv[i]);

but I do

filename = argv[i];

and then I just pass that filename around.

And nobody - and I mean NOBODY - should be in the position that they
should remember how exactly that "filename" pointer was created to know
what the semantics of trying to edit it are. It's a pointer.

> The kernel using it is an implementation detail.
> User mode using it: show me the example!

I'm not into user-mode - user mode is for whimps who can't handle the
truth (yeah yeah, I know, I'm crazy, but I just find it more interesting
to interact with the hardware).

The thing is, I don't see why you are even arguing. There are zero
advantages to a private stack, and there are tons of disadvantages. So
what's your beef?

Linus

Dave Lukes

unread,

Feb 27, 2004, 11:41:42 AM2/27/04

to

> Hmm.. I answered it the first time around.

I said: the kernel using it is an implementation detail.

You said:
> Look in the kernel for "struct
> completion", and you will see it.

i.e. you've built the code to fit the thread implementation.
I stand by what I said.

I said: user mode using it: show me the example!

You said:
> In user space, I don't work with threaded programs, but if I were to
> write
> one, I'd do this all the time.

i.e. no example.

> The problem is that communication is almost never one-way, except
> apparently in the thread we're involved right now.

Ha ha very funny: let me just recover from my ruptured abdomen ...
Now let me point out that, in spite of your copious code inclusion,
you have not yet provided any real examples.

Also, communication in programs is often one-way
(that's why ANSI invented "const":-).

> Hey, I don't know about you, but pretty much every _single_ argument
> parser I've ever done tends to fill in some data structure with the
> pointer to the argument in question.

Oh, yes, of course, silly me:
Real Programmers always parse their arguments in a separate thread.

> I'm not into user-mode - user mode is for whimps who can't handle the
> truth (yeah yeah, I know, I'm crazy, but I just find it more interesting
> to interact with the hardware).

If it's PC hardware, that _is_ crazy:-).

> The thing is, I don't see why you are even arguing. There are zero
> advantages to a private stack, and there are tons of disadvantages. So
> what's your beef?

"There are zero disadvantages to incorporating an RSX-11M emulator
in the kernel, and there are tons of advantages. What's your beef?"

Also, as a mantra:
the more shared data you have, the more problems you have.

One other detail: as far as I can see,
your examples all use shared memory as a cheap substitute
for message passing: why?

Cheers,
Dave.

Linus Torvalds

unread,

Feb 27, 2004, 11:53:53 AM2/27/04

to

On Fri, 27 Feb 2004, Rob Pike wrote:
>
> > Having a pointer that sometimes works, and sometimes doesn't, based on who
> > uses it - that's just crazy talk.
>
> put in those terms, it sounds weird, but it's not. consider the old u.
> area in unix. that was a piece of address space with the same virtual
> address in all processes but a different content.

I hate to put it to you, Rob, but that sucks. It sucks so hard that it's
not even funny.

Playing tricks with the VM is a cool hack, and we've considered it
multiple times, but in the end sanity has always prevailed. Playing VM
tricks is _expensive_. It looks cheap because the access code "just
works", but the expense is in

- wasting TLB entries (all sane CPU's have big "fixed" areas that are
better on the TLB for system mode - large pages, BAT registers, or just
architected 1:1 mappings). The kernel should use them, because the
fewer TLB entries the kernel uses, the more there are for user space to
waste. But that implies that the kernel shouldn't play any VM tricks
with its own data - the VM is for user space (yeah, the kernel ends up
wanting to use the VM hardware for some things, but it's discouraged)

- CPU's with hardware TLB lookup have to have separate page tables for
different CPU's. Which means not only that you're wasting memory on
having <n> copies of the thing, it also means that now you have to have
code to maintain coherency between those separate page tables, and have
to have locking.

Having just one copy is just better. Two copies of the same thing is
bad.

- TLB invalidates are just a lot more expensive than changing a register
around. It's bad even on "good" hardware, it's totally unacceptable on
anything with a virtual cache, for example.

So the kernel internally has always had the stack pointer register as the
main "thread pointer". It did that even before SMP, just because it was
simpler and faster. With SMP, doing anything else becomes unacceptable.

> the system used the fact that the addresses aliased that way. plan 9's
> thread model does a similar thing by constructing a special storage
> class for data private to each process. for instance, one can have a
> variable with the same address in each process, call it tp, that points
> to the thread-specific data, so you can write code like
>
> printf("my id is %d\n", tp->id);

Yes. And you pay the price. For no good reason, I may add, since you
traditionally have been able to do the same by just having to add some
explicit thread tracking (it wouldn't be "tp->id", it would be
"mythread()->tp->id") or by adding compiler support to make the syntax be
easier.

These days, if you want to avoid the syntax of carrying that per-thread
pointer around, we have compiler support, ie you can do

__thread struct mystruct *tp = NULL;

and now "tp" is a per-thread variable. Behind the schenes the compiler
will change it to be an offset off the TLS pointer, pretty much exactly
the same way it does position-independent code.

> > The same way it creates any other storage: with mmap() and brk(). You just
> > malloc the thing, and you pass in the new stack as an argument to the
> > thread creation mechanism (which linux calls "clone()", just to be
> > different).
>
> that wasn't what i was asking. i was referring to this special storage
> class. how does a thread identify who it is?

A long time ago, it was literally a "gettid()" system call. If you wanted
the thread-local space, you followed that by a index lookup.

It's not insanely expensive if you avoid re-generating the thread-local
pointer all the time, and pass it down as a regular argument, but it is
obviously syntactically not pretty.

These days - mostly thanks to compiler and library advances, not so much
any real kernel changes - the thread infrastructure sets up its local
pointers in registers, so that you can use the above "__thread"
specifier in the compiler, and when you access

tp->id

the compiler will actually generate

movl %gs:tp@NTPOFF, %eax
movl (%eax), %eax

for you (on other platforms that have compiler support of thread-local
storage it usually would end up being a indirect access through a regular
register).

The linker fixes these things up, the same way it does things like GOT
tables etc.

> ah, i see in later mail that you answered this. there are now pointers
> created in the user space (i think) to thread-local storage. how is it
> accessed, that is, how does the user process derive the pointer to it?
> this state stuff did not exist when we did the inferno port.

See above. If you control your environment (ie you don't have to worry
about having arbitrary TLS space), you can do better with the stack
register trick the kernel uses, but indirection will handle the general
case.

> it will work; it's the magic address bits hack. which kernel version
> introduced this stuff? i've heard people say that 2.6 is the first one
> with the default thread model being 'efficient' and 'good', but i don't
> know the specifics. i've also heard that they can be retrofitted to
> 2.4.

The new threading model in 2.6.x is really more about signal handling than
anything else. The _real_ problem with the original clone() implementation
had nothing to do with the VM, and had everything to do with insane POSIX
shared signal semantics. It's really hard to get the POSIX thread signal
semantics rigt, since the whole pthreads thing really was designed for
having all threads run within one master process, and Linux never had the
notion of "process vs threads".

The signal case that is hard to get right in POSIX is the fact that signal
masks are thread-local, yet their effect is "process global" (ie when you
change the signal mask of your thread, that means that you suddenly now
potentially start accepting pending signals that were shared process
global). I still don't like that POSIX model, and I didn't see any sane
way to do it efficiently with truly independent threads that don't have
the notion of a "process" that encompasses them.

What 2.6.x (and the 2.4.x back-port) does is to just accept the fact that
there is a "thread group leader" (that's what the CLONE_THREAD flag does:
if it is set, you share the thread group leader, if it is clear you create
a new thread group), and that pending signal state really is shared in the
thread group.

The VM side has always been the same: if you share the VM, you share
everything. There literally isn't any thread-local storage from a VM
standpoint, there are only thread-local registers that point to different
areas of memory.

> it's interesting you advocate using registers for the magic storage
> class. it's a great trick when you can do it - plan 9 uses it in the
> kernel on machines with lots of registers - but it's not so great on a
> machine with too few registers, like the x86.

Well, even in the absense of a register, you can always just have a system
call to ask what the pointer should be. That really does work very well,
as long as your programming model is about explicit thread pointers (which
pthreads is) so that you don't have to do it all the time.

And the x86 really is the worst possible case, in this situation, because
it is so register-starved anyway. But happily, it has some (very ugly)
legacy registers that have to be user-visible, and have to be saved and
restored anyway, and that nobody sane really wants to use, so the thread
model can use them.

Making the threaded stuff explicit helps avoid confusion. Now, if somebody
takes an address of a per-thread variable, it is clear that that address
is the address of the variable IN THAT THREAD. You can pass it along, but
when you pass it along to another thread, it doesn't change value - it
still points to the exact same thread-local variable in the _original_
thread.

(Obviously you can pass around offsets to the thread-local space if you
want to, although I can't really see why you'd do it).

And I hope it's clear by now that because the thing is entirely in
registers, that "thread model" is pretty much all in user space. It needs
no kernel support, although some of the interfaces are obviously done
certain ways to make it easier to do (ie the kernel does know about a TLS
pointer at thread setup, even if the kernel doesn't actually _use_ it for
anything, it just sets up the register state as indicated by the parent of
the thread).

Linus

Linus Torvalds

unread,

Feb 27, 2004, 12:01:04 PM2/27/04

to

On Fri, 27 Feb 2004, Dave Lukes wrote:
>
> I said: user mode using it: show me the example!

Hey. Ask why Plan-9 doesn't use the shared stack either. I bet it's
because tons of programs broke.

[ Hint: Dave just explained the Plan9 thread model to me: the "private
stack" ends up not being used as a stack at all. The real stacks end up
being malloc'ed for each thread (in _shared_ space), and the "private
stack" area only ends up being a TLS segment. ]

I wouldn't know about whatever apps a private stack would break, because I
was never crazy enough to think that a private stack was a good diea.

> i.e. no example.

Hey, _you're_ the one with the crazy idea. As such, the burden of proof is
on you to show that it isn't crazy.

> One other detail: as far as I can see,
> your examples all use shared memory as a cheap substitute
> for message passing: why?

Because message passing is idiotic, when the real hardware just passes
pointers around?

Because you can't put complex data structures in a message without silly
encodings that make performance plummet like a stone?

Methinks you have read a few too many papers about microkernels, without
actually seeing the real world.

Hint: you can't message-pass a hash table that describes 200 megabytes
worth of filesystem names.

Welcome to the real world, Neo.

Stop playing around with those examples your professors showed you. They
had no relevance.

Linus

Fco.J.Ballesteros

unread,

Feb 27, 2004, 12:05:46 PM2/27/04

to

> Hint: you can't message-pass a hash table that describes 200 megabytes
> worth of filesystem names.
>
> Welcome to the real world, Neo.

You can. You can pass a pointer to it.
If you're in the mood for using locks, you can put a proc
in charge of your hash table, and avoid locks. If you're not,
you can use non-preemptive threads and avoid locks too.

Linus Torvalds

unread,

Feb 27, 2004, 12:22:41 PM2/27/04

to

On Fri, 27 Feb 2004, Fco. J. Ballesteros wrote:
>
> > Hint: you can't message-pass a hash table that describes 200 megabytes
> > worth of filesystem names.
> >
> > Welcome to the real world, Neo.
>
> You can. You can pass a pointer to it.

Absolutely. But don't call it "message passing" then, because it damn well
isn't.

Sure, "message passing" is a wonderful idea if you are allowed to redefine
it to say "pass pointers around". In fact, let's go one step further, and
just call a function call "message passing", since what it does is to make
a message out of the arguments, and then let the CPU do a "message pass"
operation to the target.

That proves that "message passing" is as efficient as traditional code,
while still having the obvious advantage of being about "messages", which
we all know are much better than those horrible "function calls".

I can speak the newspeak as well as anybody else.

But when I speak it, I realize when I'm full of shit.

Linus

C H Forsyth

unread,

Feb 27, 2004, 12:29:08 PM2/27/04

to

>>I said: user mode using it: show me the example!

my impression, and it's only that, is that relatively few user mode
applications use clone-for-threads at all on Linux, as yet. thus,
it's a little unfair, or at least pointless, to ask Torvalds for
specific Linux user mode examples since there aren't many generally.
(i am curious to know what applications there are that do use it,
though.)

examples seem few partly because Unix programs aren't typically
structured as cooperating threads (however they cooperate). that's
partly because the most obviously portable way to do it, pthreads, has
several awful or incomplete implementations. you don't know what
you're going to get (or rather, you know only too well). (i don't
refer here to the pthreads specification, just its implementation in
practice, because that's what is relevant here.) for instance, some in
the past tried to do it all with coroutines and an internal scheduler
that (say) intercepted operations on file descriptors and multiplexed
coroutines using select/poll.

>>One other detail: as far as I can see,
>>your examples all use shared memory as a cheap substitute
>>for message passing: why?

in practice, you often end up sharing something even for message passing
(for instance a Channel or a mailbox if those are your models), unless you
meant reading and writing pipe file descriptors, in which case the sharing
(if it's a pipe) is tucked away inside the kernel but it's still there.

>>how does a thread identify who it is? when we were porting inferno to
>>linux (back in 1996; things have likely changed) we resorted to using
>>unused user-visible MMU bits to store enough state to hold on with our
>>fingernails and claw back to private storage. another option we
>>considered was playing with magic address bits in the sp.

inferno on linux currently conditions the stack and sp as previously discussed,
not least because the place used by the original scheme wasn't any longer visible.
that does require machine-specific code to access the sp, although that's
in a linux-386-dependent include file. fortunately clone accepts the stack
pointer for the new process (although that's machine-specific as to whether
it's top or bottom), so there's no need for assembly-language to bounce
to the new stack (as there was in FreeBSD for a few years, until they extended
the rfork interface).

Linus Torvalds

unread,

Feb 27, 2004, 12:39:43 PM2/27/04

to

On Fri, 27 Feb 2004, C H Forsyth wrote:
>
> my impression, and it's only that, is that relatively few user mode
> applications use clone-for-threads at all on Linux, as yet.

No, there's tons of them. But they almost always use the pthreads
interfaces. Few applications use "clone()" directly, since there is seldom
any point to going to that low level.

So libc does a mapping (pretty trivial these days thanks to the new signal
interfaces) of pthreads to clone() and other low-level kernel
functionality.

There's a lot more to threads than just creating them, btw. Linux has a
whole "futex" infrastructure for doing fast user-level mutual exclusion
(which in a pthreads world get mapped to pthread_mutex calls etc). Again,
the kernel interfaces are _not_ the pthreads interfaces (since it would be
crazy to call into the kernel just for a simple lock - 99.99% of the time
you can do it with a simple atomic sequence in user space). But they are
there to support _any_ kind of thread synchronization, whether pthreads-
based or anything else.

Linus

Dave Lukes

unread,

Feb 27, 2004, 12:53:15 PM2/27/04

to

> Hey, _you're_ the one with the crazy idea. As such, the burden of proof is
> on you to show that it isn't crazy.

Firstly,
an old saying: "Everyone is insane but yourself".

Secondly,
I'm not the one who wants to put 200Mb on the stack (see below):
who's crazy, again?

> Because message passing is idiotic, when the real hardware just passes
> pointers around?

Ohh, yea, and pointers scale really well to NUMAs, right?

> Methinks you have read a few too many papers about microkernels, without
> actually seeing the real world.

Methinks you don't know much about me.

> Hint: you can't message-pass a hash table that describes 200 megabytes
> worth of filesystem names.

No shit, Sherlock! Thanks for the enlightenment!
Next question: are you _seriously_ suggesting having 200Mb on a stack?

> Welcome to the real world

I _am_ in the real world.
Check out who we are: we don't sit in labs playing with our pointers:
we build real systems that we sell to real customers for real money.

> , Neo.

Hmmm ...
Someone who fought against the bland uniformity of a regimented world:
I'll tolerate that epithet.

> Stop playing around with those examples your professors showed you.

This attitude is beginning to _seriously_ piss me off:
you know _nothing_ about my knowledge, abilities or opinions.
To be specific: I have _never_ attended a CompSci lecture in my life.

> They
> had no relevance.

... and what _does_ have relevance?

Cheers,
Dave.

Linus Torvalds

unread,

Feb 27, 2004, 1:21:52 PM2/27/04

to

On Fri, 27 Feb 2004, Dave Lukes wrote:
>
> I'm not the one who wants to put 200Mb on the stack (see below):
> who's crazy, again?

Neither am I.

I'm saying that you CANNOT "message pass" a hash table. You pass its
address, and that has NOTHING to do with message passing. You use a
function call that passes the address to a data structure around.

> > Because message passing is idiotic, when the real hardware just passes
> > pointers around?
>
> Ohh, yea, and pointers scale really well to NUMAs, right?

Hey, when did you last write an operating system that worked on NUMA
machines?

Trust me. Passing pointers around scales a _hell_ of a lot better than
copying data around as messages.

> > Hint: you can't message-pass a hash table that describes 200 megabytes
> > worth of filesystem names.
>
> No shit, Sherlock! Thanks for the enlightenment!
> Next question: are you _seriously_ suggesting having 200Mb on a stack?

No.

I'm suggesting passing the pointer around, and not messing with messages
at all. You're the one who complained about me using pointers:

"your examples all use shared memory as a cheap substitute
for message passing: why?"

and I'm telling you that pointers are NOT a "cheap substitute for message
passing". Pointers are fundamentally MORE POWERFUL than message passing
is, and anybody who calls them a "cheap substitute" is a moron.

That was my point: a pointer can point to hundreds of megabytes of complex
data structures with lots of interdependencies and interesting locking
rules. THAT is the real world. A message it is NOT.

A message is a way to pass data by value. It has its place too, of course,
especially in networks, but comparing it to a pointer is just misguided.

Th eonly people who confuse pointers and messages are the microkernel
people who noticed that real messages are too expensive to use, so they
started calling pointers "messages", and play other semantic games.

Linus

andrey mirtchovski

unread,

Feb 27, 2004, 1:40:52 PM2/27/04

to

> I can't add much to the technical discussion but I did make a Linus "face"
>
> m

hehe, you should've used this one:

http://pages.infinit.net/rave/FOTO/linus.gif

source:

http://www.robotwisdom.com/linux/timeline.html

the source is worth spending some time with, especially in the pre-'91
sections...

Donald Brownlee

unread,

Feb 27, 2004, 2:09:45 PM2/27/04

to

Is "completion" in Linux like VMS' fork routines?

Linus Torvalds wrote:
>
> On Fri, 27 Feb 2004 dbai...@ameritech.net wrote:
>
>>>The rationale is that it's incredibly more sane, and it's the logical
>>>place to put something that (a) needs to be allocated thread-specific and
>>>(b) doesn't need any special allocator.
>>
>>You've just proven my point. Thread specific. Being Thread specific, it
>>is data that is reserved to the scope of a single thread. Nothing more.
>>If you want more scope there are many more usages of memory that
>>are better utilized.
>
>
> NO!
>
> A "per-thread allocation" does NOT MEAN that other threads should not
> access it. It measn that the ALLOCATION is thread-private, not that the
> USE is thread-private.
>
> per-thread allocations are quite common, and critical. If you have global
> state, you need to protect them with locks, and you need to have nasty
> global allocators.
>
> One common per-thread allocation is the "I want to wait for an event". The
> data is clearly for that one thread, and using a global allocator would be
> WRONG. Not to mention inefficient.
>
> But once the data has been allocated, other threads are what will actually
> use the data to wake the original thread up. So while it wants a
> per-thread allocator, it simply wouldn't _work_ if other threads couldn't
> access the data.
>
> That's what a "completion structure" is in the kernel. It's all the data
> necessary to let a thread wait for something to complete. Another thread
> will do "complete(xxx)", where "xxx" is that per-thread data.
>
> You don't like it. Fine. I don't care. You're myopic, and have an agenda
> to push, so you want to tell others that "you can't do that, it's against
> my agenda".
>
> While I'm telling you that people _do_ do that, and that it makes sense,
> and if you didn't have blinders on, you'd see that.
>
> Linus
>

boyd, rounin

unread,

Feb 27, 2004, 7:23:52 PM2/27/04

to

> Because message passing is idiotic, when the real hardware just passes
> pointers around?

pass pointers around? err, dump core or take a kernel mode protection
fault.

boyd, rounin

unread,

Feb 27, 2004, 7:25:43 PM2/27/04

to

> But when I speak it, I realize when I'm full of shit.

you said it.

boyd, rounin

unread,

Feb 27, 2004, 7:45:52 PM2/27/04

to

> I'm saying that you CANNOT "message pass" a hash table.

you can. i've done it.

Martin C.Atkins

unread,

Feb 28, 2004, 12:24:55 AM2/28/04

to

If I understand right, I could summarise (one of) Linus's arguments,
as follows:

The cost of sharing some of VM, but not all (in terms of peformance, due
to less-visible things like TLB flushes, etc), out-weighs having to
write a little bit of assembler wrapping the clone system call (which
only has to be got right once, however horrible it might be).

I haven't seen any argument rebutting this in any way. If it is true,
then surely a performance hit on all (forked?) processes, is more
important than having to shim a system call? If it is not true, then it
would be nice to know!

Tell me where I'm wrong, please?

Martin

--
Martin C. Atkins mar...@parvat.com
Parvat Infotech Private Limited http://www.parvat.com{/,/martin}

Bruce Ellis

unread,

Feb 28, 2004, 4:38:56 AM2/28/04

to

i had trouble believing that the screeds of banter were from linus.
maybe he has a ghost writer. when he did the "i don't write
user programs with threads" and thru in some kernel examples,
with obligatory assembly language results - well i thought
what are these guys discussing? gee, i just write in limbo.
threads are fun, linus, when you don't have to do all that crap
to manage them. i just use "spawn", the only language support
for threads. the rest is easy - i'm happy.

correct or challenge on any presummption.

brucee
----- Original Message -----
From: "Rob Pike" <r...@mightycheese.com>
To: <9f...@cse.psu.edu>
Sent: Friday, February 27, 2004 4:28 PM
Subject: Re: [9fans] Threads: Sewing badges of honor onto a Kernel

> in case that wasn't clear enough, i have no idea what linus is talking
> about when he says we had 'overhead' that made it 'really stupid'
> 'in practice'.
>
> -rob

Nigel Roles

unread,

Feb 28, 2004, 4:45:48 AM2/28/04

to

9fans...@cse.psu.edu wrote:
> If I understand right, I could summarise (one of) Linus's arguments,
> as follows:
>
> The cost of sharing some of VM, but not all (in terms of peformance,
> due to less-visible things like TLB flushes, etc), out-weighs having
> to write a little bit of assembler wrapping the clone system call
> (which only has to be got right once, however horrible it might be).
>
> I haven't seen any argument rebutting this in any way. If it is true,
> then surely a performance hit on all (forked?) processes, is more
> important than having to shim a system call? If it is not true, then
> it would be nice to know!
>
> Tell me where I'm wrong, please?
>

The performance argument may well still be regarded by Linus as
stronger, but there are other differences. One is that the stack
used by the clone, being allocated on the heap, is fixed in size,
and unprotected from overflow.

Bruce Ellis

unread,

Feb 28, 2004, 4:58:49 AM2/28/04

to

it was much more fun porting inferno to linux a few years ago.
no privilege of private data after a clone so the only way we
could find "up" was to grab the TSS, which linux actually used,
and hash it to get a up. good stuff.

brucee
----- Original Message -----
From: "Charles Forsyth" <for...@terzarima.net>
To: <9f...@cse.psu.edu>

Sent: Friday, February 27, 2004 10:00 AM
Subject: Re: [9fans] Re: Threads: Sewing badges of honor onto a Kernel

> >>Just out of curiosity, how do I get to the private space without
> >>a lock? It's been a long time since I studied these things and lots
> >>of water has flown under bridges, so I could be missing something.
>
> static __inline Proc *getup(void) {
> Proc *p;
> __asm__( "movl %%esp, %%eax\n\t"
> : "=a" (p)
> );
> return *(Proc **)((unsigned long)p & ~(KSTACK - 1));
> };
etc ...

Charles Forsyth

unread,

Feb 28, 2004, 5:11:50 AM2/28/04

to

i didn't say this at the time, but now that you mention inferno, i will.
if someone is seriously worried about TLB flushing on context switches,
likes setting up the MMU pretty much once-for-all,
thinks all threads should share all the address space, and absolutely
disdains user mode, he should be writing in limbo for
native inferno, where all of that is the usual state,
and indeed there is no other. we're all in it together.

David Tolpin

unread,

Feb 28, 2004, 5:14:52 AM2/28/04

to

Novice question: do I need Inferno to program in limbo?
That is, I know I can if I have (I did on FreeBSD-hosted one).
But there is no limbo without Inferno under Plan9 too, right?

David

Geoff Collyer

unread,

Feb 28, 2004, 5:23:54 AM2/28/04

to

I believe that you still need Inferno. There's been talk of
freestanding limbo but I don't believe it exists yet.

Bruce Ellis

unread,

Feb 28, 2004, 5:28:44 AM2/28/04

to

yeah, it made no sense. both the froggie and the ps2 ports
have very static page tables. and the tlb is always right
after a first access. private stacks? every thread.

brucee
----- Original Message -----
From: "Charles Forsyth" <for...@terzarima.net>
To: <9f...@cse.psu.edu>

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Feb 28, 2004, 6:09:45 AM2/28/04

to

On Sat, Feb 28, 2004 at 09:44:24AM -0000, Nigel Roles wrote:

> The performance argument may well still be regarded by Linus as
> stronger, but there are other differences. One is that the stack
> used by the clone, being allocated on the heap, is fixed in size,
> and unprotected from overflow.

clone() uses whatever you pass to it; man 2 mmap for further inspiration...

David Presotto

unread,

Feb 28, 2004, 8:42:33 AM2/28/04

to

The overhead on fork/exec of having to copy the stack descriptors
into the forked process is pretty minimal compared to the exec.

If you are just forking without execing then getting a new
stack is what you want.

If you are forking and sharing memory twixt the two forked processes,
we do indeed take a hit every time the processes context switch
especially on the x86 where we lose the previous tlb context as
soon as we putcr3(). If the two processes shared the TLB state
(pid on mips, page table on x86) we'ld be able to avoid that.
Not having any private segments would allow us to do it.

If you are serious about caring, throw another bit into rfork
that says dump the stack segment. You'll also have to find
someplace in all architectures to hide the pointer to the
thread private memory. Add to the kernel support for sharing
the TLB state. Then measure the two and tell us how much you
saved in various programs. If its non-neglibigle, you'll
have a reason argument instead of this endless whining back
and forth.

Nigel Roles

unread,

Feb 28, 2004, 8:41:46 AM2/28/04

to

To paraphrase Captain Mainwaring, "I wondered if you'd spot that".
Whilst writing the email, it occurred to me that you could
probably pull stunts with mmap/mremap and catching signals,
and get what is wanted.

We are now composing several system calls in some moderately
clever ways to get a behaviour which, whilst not equivalent to
rfork() is as flexible. It's not exactly obvious though is it?
I think I would expect to see a helper function.

Nice to see you on the list again Al.

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Feb 28, 2004, 9:15:51 AM2/28/04

to

On Sat, Feb 28, 2004 at 01:40:48PM -0000, Nigel Roles wrote:
> Whilst writing the email, it occurred to me that you could
> probably pull stunts with mmap/mremap and catching signals,
> and get what is wanted.

mmap(2)
MAP_GROWSDOWN
Used for stacks. Indicates to the kernel VM system that the map-
ping should extend downwards in memory.

See mm/mmap.c:expand_stack() for details.

So no signals involved and that's precisely what is used for stack anyway -
anonymous VM_GROWSDOWN mapping.

Note: from my reading of the calc_vm_flag_bits() there is a problem on
parisc, what with the stack growing up - no way to get VM_GROWSUP from
mmap() and that's what would be needed to act as stacks there. For that
matter, parisc do_page_fault() looks fishy in that area, but I'm not
familiar enough with that platform.

boyd, rounin

unread,

Feb 28, 2004, 9:31:12 AM2/28/04

to

> correct or challenge on any presummption.

any number between 1 and 9 is a pint!!

boyd, rounin

unread,

Feb 28, 2004, 9:34:42 AM2/28/04

to

> The performance argument may well still be regarded by Linus as
> stronger, but there are other differences.

get real, we're no longer using 1 MIP VAXes, where TLB flushes
etc were a real problem. 128 users on an 11/780, anyone?

boyd, rounin

unread,

Feb 28, 2004, 9:38:40 AM2/28/04

to

> To paraphrase Captain Mainwaring, "I wondered if you'd spot that".

just "don't panic, don't panic"!!