Thread Stacks

Andy Glew

unread,

Nov 28, 2005, 3:22:15 PM11/28/05

to

> On platforms where Fortran is commonly used, the stack limit isn't
> that low. Low stack limits are e-vile.
>
> -- greg

Hey, let's make a comp.arch topic!

How would you allocate stacks if you wanted, say, 1000 threads.
A megabyte apiece? A gigabyte apiece?

Contiguous stacks are a pain.

See stackless python etc.

Greg Lindahl

unread,

Nov 28, 2005, 3:55:17 PM11/28/05

to

In article <peypk6esc...@pxpl2829.amr.corp.intel.com>,
Andy Glew <andy...@intel.com> wrote:

>How would you allocate stacks if you wanted, say, 1000 threads.

Doctor, it hurts when I do *this*.

Seriously, stick with 64-bit OSes, they cure both this issue and the
issue of mmapped segments with satisfyingly brute force.

-- greg

Joe Seigh

unread,

Nov 28, 2005, 4:14:52 PM11/28/05

to

IBM's standard linkage didn't use a contiguous stack. You could
taka a contiguous stack and dummy it up to look like standard
linkage. I used to have a scheme where I used chunks of contiguous
stack. When one stack segment ran out, I allocated a new stack segment,
switched to that, and inserted a dummy return to deallocate the stack
segment on return. You can pick the granularity of stack segments
to get the trade off you want on overhead vs. memory usage.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

George Neuner

unread,

Nov 29, 2005, 5:08:45 AM11/29/05

to

On 28 Nov 2005 22:22:15 +0200, Andy Glew <andy...@intel.com> wrote:

>> On platforms where Fortran is commonly used, the stack limit isn't
>> that low. Low stack limits are e-vile.
>>
>> -- greg
>
>Hey, let's make a comp.arch topic!
>
>How would you allocate stacks if you wanted, say, 1000 threads.
>A megabyte apiece? A gigabyte apiece?

On 64-bit systems you just allocate it.

On 32-bit Intel or AMD you could (ab)use MMU segments ... the logical
address space is 64TB IIRC. Of course the physical address space is
only 4GB so you would also have to maintain alternate page table sets
and swap them as necessary.

George
--
for email reply remove "/" from address

Ian Rogers

unread,

Nov 29, 2005, 5:33:11 AM11/29/05

to

George Neuner wrote:
> On 32-bit Intel or AMD you could (ab)use MMU segments ... the logical
> address space is 64TB IIRC. Of course the physical address space is
> only 4GB so you would also have to maintain alternate page table sets
> and swap them as necessary.

The logical space is only 4GB. The base address of the selector gets
added to the address and truncated to 4GB to form the logical address
(which is a shame as it'd be nice sometimes to address more than 4GB of
RAM in a single process). The logical address then gets mapped through
the page tables onto a 36bit physical memory address. Hence 64TB of
physical memory addressable but only 4GB addressable in a single process.

Ian Rogers

Peter Dickerson

unread,

Nov 29, 2005, 6:13:55 AM11/29/05

to

"Ian Rogers" <ian.r...@manchester.ac.uk> wrote in message
news:dmhapg$6iu$1...@wapping.cs.man.ac.uk...

Thats 64 GB not TB. 64 TB needs a 46-bit address, which is more like the
physical address range of a 64-bit processor.

Peter

Mikael Pettersson

unread,

Nov 29, 2005, 5:21:28 AM11/29/05

to

In article <peypk6esc...@pxpl2829.amr.corp.intel.com>,
Andy Glew <andy...@intel.com> wrote:

>> On platforms where Fortran is commonly used, the stack limit isn't
>> that low. Low stack limits are e-vile.
>>
>> -- greg
>
>Hey, let's make a comp.arch topic!
>
>How would you allocate stacks if you wanted, say, 1000 threads.
>A megabyte apiece? A gigabyte apiece?

1000 threads is peanuts. Erlang does 10000s of threads on 32-bitters.
We start each thread with a small stack and grow/relocate it on demand
via stack overflow checks in function prologues.

Erlang doesn't have intra-stack data pointers, so stack relocation works.
For languages like C/C++ you should probably just use segmented/linked stacks.
--
Mikael Pettersson (mi...@csd.uu.se)
Computing Science Department, Uppsala University

Bernd Paysan

unread,

Nov 29, 2005, 10:41:36 AM11/29/05

to

Andy Glew wrote:

>> On platforms where Fortran is commonly used, the stack limit isn't
>> that low. Low stack limits are e-vile.
>>
>> -- greg
>
> Hey, let's make a comp.arch topic!
>
> How would you allocate stacks if you wanted, say, 1000 threads.
> A megabyte apiece? A gigabyte apiece?

My rant about stack usage in typical C compilers (e.g. GCC): This is a waste
of memory!

Let's take Gforth as example: GCC allocates 2904 bytes for the engine()
function on x86_64. Yes, if I count it through, I get that many local
variables to assign. However: Only half a dozend variables *live* for the
entire life of this functions, all the others are temporaries, most of them
small! If GCC would apply a stack-like allocation rule for all variables
with limited scope, it would get away with 10 or 20 times less space. And
probably the register allocation algorithm would have a much easier time to
figure out which registers to choose. Now, we have to assign the most
important by hand.

This is an extreme example, but C functions often have locally scoped
temporary variables, just less. All this adds up to more stack use than
necessary, to waste of memory.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Eric P.

unread,

Nov 29, 2005, 9:36:12 AM11/29/05

to

Segmentation just creates a 48 bit address with screwy non-linear
behavior. If you are going to have a 48 bit address then just make
it flat. It costs the same in gates and memory but behaves better.

Also this would make it impossible for one thread to access objects
allocated on another threads stack, thereby greatly complicating
interface considerations. With the proper synchronization,
threads should be able to pass stack data between them.

Eric

Eric P.

unread,

Nov 29, 2005, 10:40:24 AM11/29/05

to

I disagree.
The potential amount of space a thread might use is a property
only of its specified starting routine.

For threads to operate reliably you need to ensure that their
worst case allocation is available (reserved but not allocated).
Unless you can prove that all your threads cannot use the
worst case at once, you must assume that it might happen.
Otherwise the application will randomly fail.

That means there is no advantage to dynamic stack chaining as
you must ensure that the heap contains sufficient space anyway,
and chaining becomes just a more expensive way to manage stacks.

I conclude therefore that the correct solution is to specify the
reserved linear stack size for each individual thread at create
and that means it is as an argument to the ThreadCreate function.
Note that this is NOT how Win32 works. As an optimization one could
maintain a pool of previously used stack ranges to save continuously
reserving and releasing the same sized memory sections.

As the stack grows down, each thread should track its low water mark,
and bounds check and commit new space efficiently, as a single system
request, only when below the mark. This is also not how Win32 works.

Eric

dg...@barnowl.research.intel-research.net

unread,

Nov 29, 2005, 11:52:03 AM11/29/05

to

Bernd Paysan <bernd....@gmx.de> writes:

> Andy Glew wrote:
>
> >> On platforms where Fortran is commonly used, the stack limit isn't
> >> that low. Low stack limits are e-vile.
> >>
> >> -- greg
> >
> > Hey, let's make a comp.arch topic!
> >
> > How would you allocate stacks if you wanted, say, 1000 threads.
> > A megabyte apiece? A gigabyte apiece?
>
> My rant about stack usage in typical C compilers (e.g. GCC): This is a waste
> of memory!
>
> Let's take Gforth as example: GCC allocates 2904 bytes for the engine()
> function on x86_64. Yes, if I count it through, I get that many local
> variables to assign. However: Only half a dozend variables *live* for the
> entire life of this functions, all the others are temporaries, most of them
> small! If GCC would apply a stack-like allocation rule for all variables
> with limited scope, it would get away with 10 or 20 times less space. And
> probably the register allocation algorithm would have a much easier time to
> figure out which registers to choose. Now, we have to assign the most
> important by hand.

The strange thing is that it isn't very hard to reuse a graph-colouring
register allocator to colour (allocate) local variables efficiently (been
there, done that, etc)... (though, IIRC, it wouldn't really help register
allocation, as the obvious approach is to apply it afterwards, to spilled
variables)

--
David Gay
dg...@acm.org

Dave Hansen

unread,

Nov 29, 2005, 12:06:31 PM11/29/05

to

On Tue, 29 Nov 2005 10:40:24 -0500 in comp.arch, "Eric P."
<eric_p...@sympaticoREMOVE.ca> wrote:

[...]

>I conclude therefore that the correct solution is to specify the
>reserved linear stack size for each individual thread at create
>and that means it is as an argument to the ThreadCreate function.
>Note that this is NOT how Win32 works. As an optimization one could

What, then, is the purpose of the dwStackSize parameter to the Win32
function CreateThread?

>maintain a pool of previously used stack ranges to save continuously
>reserving and releasing the same sized memory sections.
>
>As the stack grows down, each thread should track its low water mark,
>and bounds check and commit new space efficiently, as a single system
>request, only when below the mark. This is also not how Win32 works.

Windoze commits only a page at a time (by default) to a thread's stack
as each page boundary is exceeded. How is this different from what
you describe?

I'm not saying you are wrong. I'm trying to understand how your
solution differs from Win32.

Regards,
-=Dave

--
Change is inevitable, progress is not.

Ken Hagan

unread,

Nov 29, 2005, 12:25:31 PM11/29/05

to

Eric P. wrote:
>
> I disagree.
> The potential amount of space a thread might use is a property
> only of its specified starting routine.

...and its input, at least for some obvious implementations of (say)
a recursive descent parser.

George Neuner

unread,

Nov 29, 2005, 12:49:58 PM11/29/05

to

You're confusing the *logical* address space - 16000 4GB segments -
with the *physical* address space.

The extended 64GB address space is only available on P6 or later
processors - earlier ones were limited to 4GB. And the extended space
is banked and not all accessible simultaneously.

George Neuner

unread,

Nov 29, 2005, 12:52:10 PM11/29/05

to

No. The *physical* space is 4GB - the P6 and later added support for
banked access to up to 64GB.

The *logical* address space is ~16000 4GB segments [I don't remember
the exact number - some of the selector values are illegal].

Eric P.

unread,

Nov 29, 2005, 2:44:51 PM11/29/05

to

Dave Hansen wrote:
>
> On Tue, 29 Nov 2005 10:40:24 -0500 in comp.arch, "Eric P."
> <eric_p...@sympaticoREMOVE.ca> wrote:
>
> [...]
> >I conclude therefore that the correct solution is to specify the
> >reserved linear stack size for each individual thread at create
> >and that means it is as an argument to the ThreadCreate function.
> >Note that this is NOT how Win32 works. As an optimization one could
>
> What, then, is the purpose of the dwStackSize parameter to the Win32
> function CreateThread?

It sets the Commit size. I have no idea why MS thinks I would want
to since the commit pages are automatically expanded.

> >maintain a pool of previously used stack ranges to save continuously
> >reserving and releasing the same sized memory sections.
> >
> >As the stack grows down, each thread should track its low water mark,
> >and bounds check and commit new space efficiently, as a single system
> >request, only when below the mark. This is also not how Win32 works.
>
> Windoze commits only a page at a time (by default) to a thread's stack
> as each page boundary is exceeded. How is this different from what
> you describe?
>
> I'm not saying you are wrong. I'm trying to understand how your
> solution differs from Win32.

Currently, for every allocation > 4 KB or by alloca, it touches every
page one by one. That not only wastes time in a loop touching pages
over and over, as far as I can tell it only extends the stack one
page at a time. Dumb. Dumb. Dumb.

To do this optimally is simple.
It requires 3 TEB values: Stack Top, Bottom and Low Water Mark.
Top and Bottom cover the reserved range.
Mark is the low water commit point, between top and bottom.

When a large block is required, check against the Mark.
If NewSP >= Mark then it has already been probed so nothing to do.
If NewSp >= Bottom, touch ONE reserved page and WNT should extend
the committed stack up to that point, and update low water mark.
If NewSP < Bottom, throw exception.

The vast majority of checks will be picked off by the low water test.
To extend the commit stack an arbitrary amount requires one page fault.

Eric

Chris Gray

unread,

Nov 29, 2005, 9:36:56 PM11/29/05

to

Bernd Paysan <bernd....@gmx.de> writes:

> Let's take Gforth as example: GCC allocates 2904 bytes for the engine()
> function on x86_64. Yes, if I count it through, I get that many local
> variables to assign. However: Only half a dozend variables *live* for the
> entire life of this functions, all the others are temporaries, most of them
> small! If GCC would apply a stack-like allocation rule for all variables
> with limited scope, it would get away with 10 or 20 times less space. And
> probably the register allocation algorithm would have a much easier time to
> figure out which registers to choose. Now, we have to assign the most
> important by hand.
>
> This is an extreme example, but C functions often have locally scoped
> temporary variables, just less. All this adds up to more stack use than
> necessary, to waste of memory.
>
> --
> Bernd Paysan
> "If you want it done right, you have to do it yourself"
> http://www.jwdt.com/~paysan/

Put the variables in inner scopes, and gcc will re-use registers at least.
I don't have that many in my byte-code machine so I haven't noticed if it
re-uses stack space or not. It definitely re-uses processor registers. E.g.

CASE(bc_usub) {
z_word_t ul1, ul2;

ul2 = *sp++;
ul1 = *sp;
if (ul2 > ul1) {
ex->ex_PC = pc;
runError(ex, "run-time uint sub underflow");
return;
}
*sp = ul1 - ul2;
BREAK;
}

(CASE and BREAK are macros that will do switch statement stuff or gcc's
computed goto stuff, depending on compile-time flags.)

--
Experience should guide us, not rule us.

Chris Gray c...@ami-cg.GraySage.COM
http://www.Nalug.ORG/ (Lego)
http://www.GraySage.COM/cg/ (Other)

Peter Dickerson

unread,

Nov 30, 2005, 2:34:11 AM11/30/05

to

"George Neuner" <gneuner2/@comcast.net> wrote in message
news:614po1hhnliu3bjvk...@4ax.com...

OK, I wondered where you got 46 bits from. If you are allowing traps often
to change the physical memory map every time a selector is loaded. But in
that case why does the selector have to be ring 3 (say). Such an OS could
use the ring number as part of the logical address - of course the actual
selector value loaded would be different, so you can't save it and reload it
safely... In this case loading a selector into a segment register can be
viewed as an OS call to change the memory bank switching logic.

Peter

Bernd Paysan

unread,

Nov 30, 2005, 5:35:55 AM11/30/05

to

Chris Gray wrote:
> Put the variables in inner scopes, and gcc will re-use registers at least.

We do that, and GCC does re-use registers. It doesn't reuse the spilled
stuff.

Ian Rogers

unread,

Nov 30, 2005, 5:57:22 AM11/30/05

to

George Neuner wrote:
> You're confusing the *logical* address space - 16000 4GB segments -
> with the *physical* address space.

Your right I was confusing things (GB not TB), but I think my point has
been lost. A logical address is a segment selector plus a 32bit offset.
The segment selector chooses a segment descriptor which has a base
address which gets added to the offset. This combination forms the
linear address which can only be 32bits long. The linear address gets
mapped through the page tables to a potentially 36bit physical address.
What's not clear when you say 16000 4GB segments is that these must all
be within the same 4GB! You can't say have DS set up with a base address
of 0 and FS set up with a base address of 4GB, as the linear address
gets truncated to 32bits.
There are occassions where you would like to do this trick and its
annoying you can't, e.g.:
A debugger or emulator could be in the same address space as an
application but not visible within its 4GB window.
A user space kernel (such as user mode linux) could have a different
application mapped to each of the 16000 segments, and then just copy
data out of them using a segment over-ride.

Regards,

Ian Rogers

Message has been deleted

Bernd Paysan

unread,

Nov 30, 2005, 11:35:07 AM11/30/05

to

Andi Kleen wrote:

> Bernd Paysan <bernd....@gmx.de> writes:
>
>> Chris Gray wrote:
>> > Put the variables in inner scopes, and gcc will re-use registers at
>> > least.
>>
>> We do that, and GCC does re-use registers. It doesn't reuse the spilled
>> stuff.
>

> 4.0 got some changes in that area. Did you try it? It should work now.

Yes, with GCC 4.0.2 (as in SuSE 10.0), stack allocation is reduced to ~900
bytes (one third).

Eric P.

unread,

Nov 30, 2005, 12:22:59 PM11/30/05

to

Sure, but that still needs an upper nesting limit or it would allow
stack overflows, based on sufficiently complex external input.

For example, an RDB server with a thread managing each connection,
and a recursive decent SQL parser in each. You can't allow a
complex SQL statement to cause a stack overflow, so it must check
its stack depth and abort.

(I would not design such a database server that way though.
I would use async IO for disk and network, work queues of callbacks
using state machine logic, and a small pool of worker threads.
So the 1000 thread stacks problem does not occur.)

Eric

Jeremy Linton

unread,

Nov 30, 2005, 2:41:54 PM11/30/05

to

Eric P. wrote:

> Dave Hansen wrote:
>
>>On Tue, 29 Nov 2005 10:40:24 -0500 in comp.arch, "Eric P."
>><eric_p...@sympaticoREMOVE.ca> wrote:
>>>I conclude therefore that the correct solution is to specify the
>>>reserved linear stack size for each individual thread at create
>>>and that means it is as an argument to the ThreadCreate function.
>>>Note that this is NOT how Win32 works. As an optimization one could
>>
>>What, then, is the purpose of the dwStackSize parameter to the Win32
>>function CreateThread?
>
>
> It sets the Commit size. I have no idea why MS thinks I would want
> to since the commit pages are automatically expanded.

For performance of course... That way you don't have to worry about
taking individual page faults for each 4k page in the stack. But your
partially wrong about the commit vs reserve issue (which should be
thought of as the MAX stack size ever). The SDK documentation says:

"To change the initially committed stack space, use the dwStackSize
parameter of the CreateThread, CreateRemoteThread, or CreateFiber
function. This value is rounded up to the nearest page. Generally, the
reserve size is the default reserve size specified in the executable
header. However, if the initially committed size specified by
dwStackSize is larger than the default reserve size, the reserve size is
this new commit size rounded up to the nearest multiple of 1 MB. "

So basically the linker option sets the default, which is always used
unless someone specifies a stack size larger, in which case it is grown.

Now 1Meg seems a tiny amount of stack today, but its been that way for a
long time, and I'm sure that back in '93 or so when everyone had 32-bit
machines with 8 megs or so it seemed like a lot of space for a stack.
Its easy to change, just flip the linker switch and give yourself a few
gigs with windows-64, but it also directly influences the number of
threads that can be created. So, it is sort of a trade off, how big is
your max stack vs how many threads you can create. You can't grow past
the reserve amount because there is a good chance some other thread has
placed its stack right before yours. On 32-bit linux, it seems the
decision (not true anymore) had been made to limit the system to 256
threads so the stack size could be bigger. The application I currently
work on needs a lot of threads (large SMP hardware, with lots of blocked
threads just sitting around) so it would have been nice to have more,
since we consume very little stack space. So, the problem can go both
ways. At least in windows, its easy to mix and match the stack sizes
based on the thread requirements rather than having them fixed (unless
there is a way to get linux to dynamically set the thread size, that I
don't know about). So, this is an instant win for windows in my book....

>>Windoze commits only a page at a time (by default) to a thread's stack
>>as each page boundary is exceeded. How is this different from what
>>you describe?
>>
>>I'm not saying you are wrong. I'm trying to understand how your
>>solution differs from Win32.
>
>
> Currently, for every allocation > 4 KB or by alloca, it touches every
> page one by one. That not only wastes time in a loop touching pages
> over and over, as far as I can tell it only extends the stack one
> page at a time. Dumb. Dumb. Dumb.

Hu? That is what the commit flags to ZwAllocateVirtualMemory() (exposed
through VirtualAllocEx() does for you... The NT kernel doesn't really
seem to know anything about user space stacks. That is all done through
the win32 subsystem. So in that regard, its sort of silly to complain
about the OS just demand paging the one page you ask for.

>
> To do this optimally is simple.
> It requires 3 TEB values: Stack Top, Bottom and Low Water Mark.
> Top and Bottom cover the reserved range.
> Mark is the low water commit point, between top and bottom.

How well this works is probably application dependent. You could
probably implement your own version using SetThreadStackGuarantee(). The
other option is to call ZwWriteWatch() on your stack region and check to
see which pages have been accessed, and allocate/commit extra space as
you see necessary.

Eric P.

unread,

Nov 30, 2005, 4:44:08 PM11/30/05

to

Jeremy Linton wrote:
>
> Eric P. wrote:
>
> > Dave Hansen wrote:
> >
> >>On Tue, 29 Nov 2005 10:40:24 -0500 in comp.arch, "Eric P."
> >><eric_p...@sympaticoREMOVE.ca> wrote:
> >>>I conclude therefore that the correct solution is to specify the
> >>>reserved linear stack size for each individual thread at create
> >>>and that means it is as an argument to the ThreadCreate function.
> >>>Note that this is NOT how Win32 works. As an optimization one could
> >>
> >>What, then, is the purpose of the dwStackSize parameter to the Win32
> >>function CreateThread?
> >
> >
> > It sets the Commit size. I have no idea why MS thinks I would want
> > to since the commit pages are automatically expanded.
>
> For performance of course... That way you don't have to worry about
> taking individual page faults for each 4k page in the stack.

Which wouldn't be an issue if they didn't expand the commit
space one page at a time in the first place.

> But your
> partially wrong about the commit vs reserve issue (which should be
> thought of as the MAX stack size ever). The SDK documentation says:
>
> "To change the initially committed stack space, use the dwStackSize
> parameter of the CreateThread, CreateRemoteThread, or CreateFiber
> function. This value is rounded up to the nearest page. Generally, the
> reserve size is the default reserve size specified in the executable
> header. However, if the initially committed size specified by
> dwStackSize is larger than the default reserve size, the reserve size is
> this new commit size rounded up to the nearest multiple of 1 MB. "

You are correct. My apologies.
The CreateThread docs have been updated and missed that.
That info was previously buried at the end of a knowledge base article.

I was going on what CreateThread used to say:

"dwStackSize: Specifies the size, in bytes, of the stack for the new
thread. If 0 is specified, the stack size defaults to the same size
as that of the primary thread of the process. <...> CreateThread tries
to commit the number of bytes specified by dwStackSize, and fails if
the size exceeds available memory."

Even so, I believe my concerns below still apply.

> So basically the linker option sets the default, which is always used
> unless someone specifies a stack size larger, in which case it is grown.
>
> Now 1Meg seems a tiny amount of stack today, but its been that way for a
> long time, and I'm sure that back in '93 or so when everyone had 32-bit
> machines with 8 megs or so it seemed like a lot of space for a stack.
> Its easy to change, just flip the linker switch and give yourself a few
> gigs with windows-64, but it also directly influences the number of
> threads that can be created. So, it is sort of a trade off, how big is
> your max stack vs how many threads you can create. You can't grow past
> the reserve amount because there is a good chance some other thread has
> placed its stack right before yours. On 32-bit linux, it seems the
> decision (not true anymore) had been made to limit the system to 256
> threads so the stack size could be bigger. The application I currently
> work on needs a lot of threads (large SMP hardware, with lots of blocked
> threads just sitting around) so it would have been nice to have more,
> since we consume very little stack space. So, the problem can go both
> ways. At least in windows, its easy to mix and match the stack sizes
> based on the thread requirements rather than having them fixed (unless
> there is a way to get linux to dynamically set the thread size, that I
> don't know about). So, this is an instant win for windows in my book....

My concern is what happens when you make the default smaller
so you can run more threads. Any CreateThread calls in library
functions that specified a commit size of 0 will also change.

I have had situations where I know that my threads do not require
1 MB. But I cannot lower the stack size for just those threads,
and if I change the global default it may affect all threads,
even ones my code did not create.

Based on this experience I conclude that it would have been better
design for CreateThread to require the caller to specify the both
commit & reserve as arguments for each thread.

The global process defaults look easy, but seem just plain dangerous
to me because they allow stacks to be adjusted without regard to usage.

> >>Windoze commits only a page at a time (by default) to a thread's stack
> >>as each page boundary is exceeded. How is this different from what
> >>you describe?
> >>
> >>I'm not saying you are wrong. I'm trying to understand how your
> >>solution differs from Win32.
> >
> >
> > Currently, for every allocation > 4 KB or by alloca, it touches every
> > page one by one. That not only wastes time in a loop touching pages
> > over and over, as far as I can tell it only extends the stack one
> > page at a time. Dumb. Dumb. Dumb.
> Hu? That is what the commit flags to ZwAllocateVirtualMemory() (exposed
> through VirtualAllocEx() does for you... The NT kernel doesn't really
> seem to know anything about user space stacks. That is all done through
> the win32 subsystem. So in that regard, its sort of silly to complain
> about the OS just demand paging the one page you ask for.

I know that, but I am not referring to the commit space.
I am referring to how the committed stack grows into the reserve space.
It was not actually clear, to me anyway, which code is doing this.
The documentation does not say, but tends to imply that it should
be the Win32 user mode code. However I have looked for exceptions
generated into user mode that would trigger the stack expansion and
find none. That is why I put the 'kernel(?)' in my previous message.
(It was long winded enough as it was.)

Anyway, I think you should try single stepping through he assembler
for 'alloca' or a _chkstk call. Alloca calls __alloca_probe which is
almost identical to the code of __chkstk that is also called whenever
you declare a local object larger than 4 KB on the stack.

Both just loop touching pages one by one. Every time it is invoked.
If a routine ever declared a 20 MB or 50 MB array...

void Foo (void)
{ int vec[20*(1<<20)];
}

every call to Foo () does the same check.

> >
> > To do this optimally is simple.
> > It requires 3 TEB values: Stack Top, Bottom and Low Water Mark.
> > Top and Bottom cover the reserved range.
> > Mark is the low water commit point, between top and bottom.
> How well this works is probably application dependent. You could
> probably implement your own version using SetThreadStackGuarantee(). The
> other option is to call ZwWriteWatch() on your stack region and check to
> see which pages have been accessed, and allocate/commit extra space as
> you see necessary.

That is great, but SetThreadStackGuarantee is only in Win64.
I also notice an update to CreateThread for XP that allows
you to explicitly specify the reserve size in CreateThread with
the STACK_SIZE_PARAM_IS_A_RESERVATION, somewhat similar to my rants.

These don't help all the billions of lines of existing code.

Eric

Jeremy Linton

unread,

Nov 30, 2005, 7:56:18 PM11/30/05

to

Eric P. wrote:
> My concern is what happens when you make the default smaller
> so you can run more threads. Any CreateThread calls in library
> functions that specified a commit size of 0 will also change.

Well its probably a valid concern, and one of the reasons why the
default reserve stack size is as small as it is. Its always going to be
easier to increase the reserve (max) stack size rather than decrease it.
Budding OS writers should take note of this!

>
> I have had situations where I know that my threads do not require
> 1 MB. But I cannot lower the stack size for just those threads,
> and if I change the global default it may affect all threads,
> even ones my code did not create.

This is basically the same point, my understanding is that windows gets
really close to 2k threads per process with the standard 2/2 memory
break on the 32-bit kernel. That is a lot of threads.... Might be time
to consider a different paradigm for your application, or upgrading to
64-bit.

>
> Based on this experience I conclude that it would have been better
> design for CreateThread to require the caller to specify the both
> commit & reserve as arguments for each thread.

Chuckle, look at the native api reference, this is exactly what
ZwCreateThread does. Of course this doesn't help you if all your
libraries are already win32. On the other hand people were complaining
that CreateThread already takes to many parameters.

>
>
> I know that, but I am not referring to the commit space.
> I am referring to how the committed stack grows into the reserve space.
> It was not actually clear, to me anyway, which code is doing this.

Probably isn't any code as such. It simply the way the OS handles all
virtually allocated but unused space. Its reserved in the virtual memory
map and against the total ram+page file size, and it is committed to
physical ram when its touched. Same as any other demand loaded page.
Except in this case its demand created. Only it doesn't tend to be as
fast on page faults because the caching and page fault algorithms are
probably all forward biased for mmaped sections. If you touch
curstackoffset-somebigsize you probably can speed up the demand commit
operations. Of course you could just as well tell the OS to commit the
virtual range, might be faster than the page fault, might not.

> generated into user mode that would trigger the stack expansion and
> find none. That is why I put the 'kernel(?)' in my previous message.
> (It was long winded enough as it was.)

Thet is because its not really "stack expansion" its simply a VM commit.

>
> Anyway, I think you should try single stepping through he assembler
> for 'alloca' or a _chkstk call. Alloca calls __alloca_probe which is
> almost identical to the code of __chkstk that is also called whenever
> you declare a local object larger than 4 KB on the stack.
>
> Both just loop touching pages one by one. Every time it is invoked.
> If a routine ever declared a 20 MB or 50 MB array...
>
> void Foo (void)
> { int vec[20*(1<<20)];
> }
>
> every call to Foo () does the same check.

This is not a Win32 or OS issue. That is a Visual Studio/CRT issue. The
check stack seems to be there primary to force the stack overflow
exception during the allocation rather than during the reference. Its
easy to disable for particular functions with the
#pragma check_stack(off)
directive. You can also control how big of a stack allocation causes
this to fire with the /Gs command line switch to the compiler.
I also think calling alloca is not recommended practice anymore, check
the documentation for more information.
If you know enough about your stack allocations its unlikely you really
need this code anyway. You could also replace it with something a little
smarter. It would be easy to hide the stack base/len somewhere and
simply check the current stack pointer and the requested size against
those values. I'm sort of surprised visual studio isn't doing something
similar.

>>>To do this optimally is simple.
>>>It requires 3 TEB values: Stack Top, Bottom and Low Water Mark.
>>>Top and Bottom cover the reserved range.
>>>Mark is the low water commit point, between top and bottom.
>>
>> How well this works is probably application dependent. You could
>>probably implement your own version using SetThreadStackGuarantee(). The
>>other option is to call ZwWriteWatch() on your stack region and check to
>>see which pages have been accessed, and allocate/commit extra space as
>>you see necessary.
>
>
> That is great, but SetThreadStackGuarantee is only in Win64.
> I also notice an update to CreateThread for XP that allows
> you to explicitly specify the reserve size in CreateThread with
> the STACK_SIZE_PARAM_IS_A_RESERVATION, somewhat similar to my rants.
>
> These don't help all the billions of lines of existing code.

Well they seem to be working just fine right now, so they probably
don't need it! Half of them were probably written back when everyone had
16M of RAM and 1M stack spaces seem like a lot.

Eric P.

unread,

Nov 30, 2005, 10:48:38 PM11/30/05

to

Jeremy Linton wrote:
>
> Eric P. wrote:
> > My concern is what happens when you make the default smaller
> > so you can run more threads. Any CreateThread calls in library
> > functions that specified a commit size of 0 will also change.
> Well its probably a valid concern, and one of the reasons why the
> default reserve stack size is as small as it is. Its always going to be
> easier to increase the reserve (max) stack size rather than decrease it.
> Budding OS writers should take note of this!

That is why I brought it up - so others may avoid this cow pie.

> > Based on this experience I conclude that it would have been better
> > design for CreateThread to require the caller to specify the both
> > commit & reserve as arguments for each thread.
> Chuckle, look at the native api reference, this is exactly what
> ZwCreateThread does. Of course this doesn't help you if all your
> libraries are already win32. On the other hand people were complaining
> that CreateThread already takes to many parameters.

Yes, I see. This does not surprise me. Often I think that the NT Kernel
did the right thing, but that the Win32 wrapper screwed it up.
(My other big beefs are full reentrancy, thread rundown, function
return status codes, and arg data types, also screwed up in Win32)

> > I know that, but I am not referring to the commit space.
> > I am referring to how the committed stack grows into the reserve space.
> > It was not actually clear, to me anyway, which code is doing this.
> Probably isn't any code as such. It simply the way the OS handles all
> virtually allocated but unused space. Its reserved in the virtual memory
> map and against the total ram+page file size, and it is committed to
> physical ram when its touched. Same as any other demand loaded page.

Yes, but read on.

The reserve stack space is recorded a b-tree of reserved address
ranges in entries called Virtual Address Descriptors, and does
not require any changes the page table to reserve space.

Committing stack requires changing the PTE's to be demand zero pages,
but does not actually allocate physical memory. When the pages are
next touched a zero'ed page frame is assigned.

Touching a reserved but not committed page causes an access violation.
According to the documentation and experiment, changing reserved into
committed pages **requires** the _chkstk probes. The OS, Win32 and
the C-RTL work together to make it function correctly.

See below for more on _chkstk.

> Except in this case its demand created. Only it doesn't tend to be as
> fast on page faults because the caching and page fault algorithms are
> probably all forward biased for mmaped sections. If you touch
> curstackoffset-somebigsize you probably can speed up the demand commit
> operations. Of course you could just as well tell the OS to commit the
> virtual range, might be faster than the page fault, might not.

One call to VirtualAlloc would be faster than hundreds of page faults.

> > generated into user mode that would trigger the stack expansion and
> > find none. That is why I put the 'kernel(?)' in my previous message.
> > (It was long winded enough as it was.)
> Thet is because its not really "stack expansion" its simply a VM commit.
>
> >
> > Anyway, I think you should try single stepping through he assembler
> > for 'alloca' or a _chkstk call. Alloca calls __alloca_probe which is
> > almost identical to the code of __chkstk that is also called whenever
> > you declare a local object larger than 4 KB on the stack.
> >
> > Both just loop touching pages one by one. Every time it is invoked.
> > If a routine ever declared a 20 MB or 50 MB array...
> >
> > void Foo (void)
> > { int vec[20*(1<<20)];
> > }
> >
> > every call to Foo () does the same check.
> This is not a Win32 or OS issue. That is a Visual Studio/CRT issue. The
> check stack seems to be there primary to force the stack overflow
> exception during the allocation rather than during the reference. Its
> easy to disable for particular functions with the
> #pragma check_stack(off)
> directive. You can also control how big of a stack allocation causes
> this to fire with the /Gs command line switch to the compiler.
> I also think calling alloca is not recommended practice anymore, check
> the documentation for more information.

No.
#pragma check_stack(off) and /Gs only work in DOS & Win3.1
In Win32 these switches are [mostly] ignored and you always get
stack checks when the local vars > 4 KB.

See Q100775 Stack Checking for Windows NT-based Applications.

It basically says that _chkstk is *required* for NT to work correctly
and if you just subtract from ESP and touch memory, your code breaks.

For proof, try the following. It will get Access Violation:

void foo (void)
{ // Move down 5 pages in stack and write
_asm { sub esp, 0x5000 };
_asm { push eax }
}

int main (int argc, const char *argv[])
{ foo ();
return 0;
}

It says that the /Gs threshold value can be set to anything, but
"For a user mode application to run correctly in Windows NT, the
default threshold 4096 is required."

> If you know enough about your stack allocations its unlikely you really
> need this code anyway. You could also replace it with something a little
> smarter. It would be easy to hide the stack base/len somewhere and
> simply check the current stack pointer and the requested size against
> those values. I'm sort of surprised visual studio isn't doing something
> similar.

Yes, that is what I said. Well, you were more polite.
I called it Dumb Dumb Dumb.

> > These don't help all the billions of lines of existing code.
> Well they seem to be working just fine right now, so they probably
> don't need it! Half of them were probably written back when everyone had
> 16M of RAM and 1M stack spaces seem like a lot.

The whole thing gives me an ugly feeling that it works by luck
rather than design.

Eric

George Neuner

unread,

Dec 1, 2005, 1:01:57 AM12/1/05

to

On Wed, 30 Nov 2005 07:34:11 GMT, "Peter Dickerson"
<first{dot}sur...@ukonline.co.uk> wrote:

>"George Neuner" <gneuner2/@comcast.net> wrote in message
>news:614po1hhnliu3bjvk...@4ax.com...
>

>> The *physical* space is 4GB - the P6 and later added support for
>> banked access to up to 64GB.
>>
>> The *logical* address space is ~16000 4GB segments [I don't remember
>> the exact number - some of the selector values are illegal].
>
>OK, I wondered where you got 46 bits from. If you are allowing traps often
>to change the physical memory map every time a selector is loaded. But in
>that case why does the selector have to be ring 3 (say). Such an OS could
>use the ring number as part of the logical address - of course the actual
>selector value loaded would be different, so you can't save it and reload it
>safely... In this case loading a selector into a segment register can be
>viewed as an OS call to change the memory bank switching logic.

On 32-bit Intel designs, loading a segment register from any ring
other than 0 causes a trap to ring 0 so the selector can be validated.
This segment load trap was very expensive on the 286 ... it was much
faster on the 386, but the 386 introduced paging as well so virtually
all the OS developers chose to put all of memory into a single segment
and using paging instead.

IIRC, the segment TLB was removed or greatly reduced in size on the
Pentiums [because no one used it] making protected mode segment
register loads even more costly because validation results were not
being cached. IA64 has dropped support for hardware segmentation
entirely.

Tandem had a secure 386 OS that used segmentation but I'm not aware of
any others. All the popular systems ignored them entirely. It's sad
because the MMU did segment checking anyway prior to page checking.
Things like "data execution prevention" have been available since 1985
as segment privileges - it's just that no one cared.

Given the cycle cost of validation [even on the earlier CPUs that
expected to do it], I think it was a sensible choice not to abuse
segmentation. But IMO the developers went overboard the other way. A
reasonable compromise, I think, would have been to restrict each
application to just a few segments - code/data or even code/data/stack
... you have 16000+ values to use - and to partition applications from
the OS and from each other. The whole line of IA32 operating systems
could have been a lot more robust if more sensible choices had been
made early.

George Neuner

unread,

Dec 1, 2005, 1:14:44 AM12/1/05

to

On Wed, 30 Nov 2005 10:57:22 +0000, Ian Rogers
<ian.r...@manchester.ac.uk> wrote:

>George Neuner wrote:
>> You're confusing the *logical* address space - 16000 4GB segments -
>> with the *physical* address space.
>
>Your right I was confusing things (GB not TB), but I think my point has
>been lost. A logical address is a segment selector plus a 32bit offset.
>The segment selector chooses a segment descriptor which has a base
>address which gets added to the offset. This combination forms the
>linear address which can only be 32bits long. The linear address gets
>mapped through the page tables to a potentially 36bit physical address.
>What's not clear when you say 16000 4GB segments is that these must all
>be within the same 4GB! You can't say have DS set up with a base address
>of 0 and FS set up with a base address of 4GB, as the linear address
>gets truncated to 32bits.

Right. The page tables have to be swapped manually if you want exceed
the physical limit. It can be done but is a PITA as you can imagine.
Tandem had a secure segmented 386 OS but I'm not sure whether it also
over committed the physical address space.

>There are occassions where you would like to do this trick and its
>annoying you can't, e.g.:
>A debugger or emulator could be in the same address space as an
>application but not visible within its 4GB window.
>A user space kernel (such as user mode linux) could have a different
>application mapped to each of the 16000 segments, and then just copy
>data out of them using a segment over-ride.

Those are great ideas ... you should have had them 20 years ago when
it mattered. Nobody much used the segment unit so Intel deprecated it
and in later IA32 models removed some of the cache hardware that
supported it. It's very expensive now to change a segment register in
32-bit protected mode. The IA64 MMU has completely done away with
the segment unit.

Ian Rogers

unread,

Dec 1, 2005, 5:50:39 AM12/1/05

to

George Neuner wrote:
> Those are great ideas ... you should have had them 20 years ago when
> it mattered. Nobody much used the segment unit so Intel deprecated it
> and in later IA32 models removed some of the cache hardware that
> supported it. It's very expensive now to change a segment register in
> 32-bit protected mode. The IA64 MMU has completely done away with
> the segment unit.

Linux was using fs selector to address thread local storage (TLS) for a
while, but now it does it through the paging system, which has many
advantages. Quite a few systems (well I'm thinking JVMs) need places to
hold onto thread local variables, commiting them to a register wastes
the register so putting them at constant offsets addressed via a
selector is something that has been talked about being done (you could
just hijack an existing TLS except if you need TLS for green threads).

PowerPC has segmentation on the top 4 bits of the address (IIRC). You
can disable this for either instruction or data accesses. I think this
makes it conceivable to have separate 4GB instruction and data regions,
so 8GB total, which is potentially useful in emulation. Its odd that
PowerPC can address more in a 32bit user process than IA32 which seems
to commit more resources to the job. I still think 32bit linear
addresses are a shame :-)

Ian Rogers

Peter "Firefly" Lund

unread,

Dec 1, 2005, 12:43:05 PM12/1/05

to

On Thu, 1 Dec 2005, Ian Rogers wrote:

> Linux was using fs selector to address thread local storage (TLS) for a
> while, but now it does it through the paging system, which has many
> advantages.

Linux uses the gs register for TLS, at least in NPTL (Native POSIX
Thread Library). It does NOT use the paging system. TLS in Linux is
documented in this document:

http://people.redhat.com/drepper/tls.pdf

I also made a quick little test:

$ cat test1.c
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

__thread int x;

int f()
{
return x;
}

int main()
{
printf("x: %d\n", f());
return EXIT_SUCCESS;
}
$ gcc -fomit-frame-pointer test1.c
$ objdump -M intel -d a.out | grep '<f>:' -A2
08048398 <f>:
8048398: 65 a1 fc ff ff ff mov eax,gs:0xfffffffc
804839e: c3 ret
$

(Ubuntu Breezy Badger)

Windows does something similar, except that it uses fs. I think gs was
chosen on Linux in order to make wine/libwine work with NPTL TLS with as
little fuss as possible.

> Quite a few systems (well I'm thinking JVMs) need places to hold
> onto thread local variables, commiting them to a register wastes the register
> so putting them at constant offsets addressed via a selector is something
> that has been talked about being done (you could just hijack an existing TLS
> except if you need TLS for green threads).

You could always allocate a selector value of your own and load it into
fs/gs/es when necessary. Linux lets you allocate a few selectors with
per-thread private base/limit values. I think three are available in all
2.5+ versions (but some have a few more).

In NPTL, there is a 1:1 mapping between userspace threads and kernel-level
"execution contexts" (call them threads or processes). As part of each
context switch, a small part of the global descriptor table gets copied
from the process control block and the descriptor table gets reloaded.

-Peter

Brian Hurt

unread,

Dec 1, 2005, 4:01:43 PM12/1/05

to

George Neuner <gneuner2/@comcast.net> writes:

>On 32-bit Intel or AMD you could (ab)use MMU segments ... the logical
>address space is 64TB IIRC. Of course the physical address space is
>only 4GB so you would also have to maintain alternate page table sets
>and swap them as necessary.

You still have a limit of 4G of address space per process. Worse yet,
every segment has to be contiguous.

Brian

Jeremy Linton

unread,

Dec 1, 2005, 4:50:50 PM12/1/05

to

Eric P. wrote:
> The reserve stack space is recorded a b-tree of reserved address
> ranges in entries called Virtual Address Descriptors, and does
> not require any changes the page table to reserve space.

Yah cause its just being reserved... Check out the kernel section objects.

>
> Committing stack requires changing the PTE's to be demand zero pages,
> but does not actually allocate physical memory. When the pages are
> next touched a zero'ed page frame is assigned.

Yup..

>
> Touching a reserved but not committed page causes an access violation.
> According to the documentation and experiment, changing reserved into
> committed pages **requires** the _chkstk probes. The OS, Win32 and
> the C-RTL work together to make it function correctly.

Yah, you should in theory be able to trap the guard page exceptions and
write your own version.... I just wasted an hour looking for the code
that is trapping the stack guard and issuing the VirtualAlloc() or
whatever its doing. Either i'm missing something or I didn't look long
enough. Not only that but the 30 second program I just wrote to trap the
guard page exceptions isn't working. Not sure if its me, or something
i'm missing. M$'s example for working with guard pages doesn't have code
to trap the exception.

> No.
> #pragma check_stack(off) and /Gs only work in DOS & Win3.1
> In Win32 these switches are [mostly] ignored and you always get
> stack checks when the local vars > 4 KB.
>
> See Q100775 Stack Checking for Windows NT-based Applications.

Yah my mistake, that's what I get for just remembering that was there
and looking it up..

>
> It basically says that _chkstk is *required* for NT to work correctly
> and if you just subtract from ESP and touch memory, your code breaks.
>
> For proof, try the following. It will get Access Violation:
>
> void foo (void)
> { // Move down 5 pages in stack and write
> _asm { sub esp, 0x5000 };
> _asm { push eax }
> }

Actually, this works fine on my machine, bumping it to 50 pages does
not, so I get your point. On the other hand if you do a CreateThread()
and commit all the stack memory you shouldn't have this problem, and it
will just work. So, it should really only be a problem for the main thread.

> Yes, that is what I said. Well, you were more polite.
> I called it Dumb Dumb Dumb.

It does seem a little more complicated and slow than it needs to be. I
just compared the code between vs6 32-bit and vs2005's x64 chkstk
routine. At least in the new version someone is squirreling (probably
the exception handler I can't find) the current stack bottom address in
gs:[0x10] and reading it before doing the page touching. This is better
than what vs6 is doing which is always touching all the pages in
allocated space every time an allocation occurred.

> The whole thing gives me an ugly feeling that it works by luck
> rather than design.

Well, that is probably true of a lot of code out there, they simply
haven't stressed it enough to overflow the stack. This is probably true
on nearly every operating system out there written in C. Which brings us
back full circle to the question of using non linear stacks, discussed
elsewhere in this thread.

BTW: I considered writing some code to try to switch the stack to a
normal VirtualAlloc'ed region, but I don't really see the point since
the CreateThread allows you to commit all the memory during the initial
call.

Jeremy Linton

unread,

Dec 1, 2005, 5:14:12 PM12/1/05

to

Ian Rogers wrote:

> George Neuner wrote:
>
>> Those are great ideas ... you should have had them 20 years ago when
>> it mattered. Nobody much used the segment unit so Intel deprecated it
>> and in later IA32 models removed some of the cache hardware that
>> supported it. It's very expensive now to change a segment register in
>> 32-bit protected mode. The IA64 MMU has completely done away with
>> the segment unit.
>

> PowerPC has segmentation on the top 4 bits of the address (IIRC). You
> can disable this for either instruction or data accesses. I think this
> makes it conceivable to have separate 4GB instruction and data regions,
> so 8GB total, which is potentially useful in emulation. Its odd that
> PowerPC can address more in a 32bit user process than IA32 which seems
> to commit more resources to the job. I still think 32bit linear
> addresses are a shame :-)

First, PPC had a funny naming convention, they called the virtual
address (used by user programs) the Effective address, and the linear
address the Virtual address. I'm going to ignore their names.
As far as accessing more than 4G in 32-bit mode, I don't think so. The
32-bit PPC had 16 segment registers which held the translation to the
linear address, which segment register was selected by the top 4 bits of
the virtual address. It had nothing to do with code vs data, it was
simply a virtual address had the top 4 bits removed to select the
segment register. The segment register returned a 24 bit Virtual segment
ID (VSID) which replaced the upper 4 bits for purposes of page table
lookups which were done with the 24bit VSID and the 16bits of the
virtual address below the 4 bit segment selector.
This is still a 32-bit virtual address space, same as the x86. The
64-bit version of the PPC behaves almost the same way except that the
top 36 bits of a 64-bit address are translated through the SLB instead
of the segment registers. The SLB returns a 52bit VSID which gets
concatenated to the same 16 bits as before.

Two google hits for reference.
http://www.cs.wisc.edu/~remzi/Classes/537/Fall2005/Lectures/lecture14.ppt
http://www.cs.cmu.edu/~412/projects/9mac/PowerPC_Memory.ppt

Jeremy Linton

unread,

Dec 1, 2005, 11:54:05 PM12/1/05

to

Jeremy Linton wrote:

> Ian Rogers wrote:
>> PowerPC has segmentation on the top 4 bits of the address (IIRC). You
>> can disable this for either instruction or data accesses. I think this
>> makes it conceivable to have separate 4GB instruction and data
>> regions, so 8GB total, which is potentially useful in emulation. Its
>> odd that PowerPC can address more in a 32bit user process than IA32
>> which seems to commit more resources to the job. I still think 32bit
>> linear addresses are a shame :-)
>

> This is still a 32-bit virtual address space, same as the x86. The
> 64-bit version of the PPC behaves almost the same way except that the
> top 36 bits of a 64-bit address are translated through the SLB instead

Oh, i guess there are a couple of really ugly hacks to get more... You
could just run with either instruction or data address translation
turned off (Which is what you said but i read it diffrenty the first
time around). In other words you run with MSR:ir and MSR:dr in diffrent
states. This could give you more than 4G in a funky kind of way. If you
ran with data translation off it would be really hard to protect your
memory from another process. On the other hand you migh be able to get
away with running with instruction address translation turned off but
you would have a limititation of not being able to have more than 4G of
total instruction space accross all processes. It still wouldn't look
like more than 4G, just that instruction fetches would go directly to
physical memory. It would be an absoulte nightmare to maintain, plus you
would have to dedicate the first 4G of phsical RAM, and you wouldn't be
able to demand page those addresses. It might be possible but it hurts
to think about.
Its possible you could get a similar affect with just changing the
IBAT's for each process and be able to access something other than the
first 4G. Still it would be really strange to deal with. You could be
executing code at one virtual address and fetching data from the same
virtual address and getting diffrent data returned.

Eric P.

unread,

Dec 2, 2005, 1:14:06 PM12/2/05

to

Jeremy Linton wrote:
>
> Eric P. wrote:
> >

> > Touching a reserved but not committed page causes an access violation.
> > According to the documentation and experiment, changing reserved into
> > committed pages **requires** the _chkstk probes. The OS, Win32 and
> > the C-RTL work together to make it function correctly.
> Yah, you should in theory be able to trap the guard page exceptions and
> write your own version.... I just wasted an hour looking for the code
> that is trapping the stack guard and issuing the VirtualAlloc() or
> whatever its doing. Either i'm missing something or I didn't look long
> enough. Not only that but the 30 second program I just wrote to trap the
> guard page exceptions isn't working. Not sure if its me, or something
> i'm missing. M$'s example for working with guard pages doesn't have code
> to trap the exception.

I tried the same thing, and also could not find it. That was why I
had that 'kernel(?)' reference in my original description.
Like I said earlier, despite the documentation hinting that it is
is handled in Win32, I suspect it is picked off inside the kernel.

> > I called it Dumb Dumb Dumb.
> It does seem a little more complicated and slow than it needs to be. I
> just compared the code between vs6 32-bit and vs2005's x64 chkstk
> routine. At least in the new version someone is squirreling (probably
> the exception handler I can't find) the current stack bottom address in
> gs:[0x10] and reading it before doing the page touching. This is better
> than what vs6 is doing which is always touching all the pages in
> allocated space every time an allocation occurred.

Yeah. It is good that it at least checks the low water mark now,
as I suggested earlier. But it is still very inefficient because
it touches every page one by one. This causes a real demand zero
page fault, and real frame allocation, for every page when it
should be just committing a whole range instead.

Just imagine if you had a really large array, say hundreds of MB,
allocated on the stack so your code would be reentrant.
The current approach kills large application performance.

It would at least be vastly more efficient if chkstk just did a
bounds check against the stack lower bound and touched the lower
guard page if out of bounds. Takes 3 instructions.
To perform automatic commit stack expansion, when a Reserved
page is touched it causes an access violation. A First Chance
Exception Handler would catch it and use VirtualAlloc to expand
the whole commit range into reserve space up to the touched page.

Ideally, for maximum efficiency the above is done automatically by
the OS's Page Fault Handler code and skip the unnecessary exception
and subsequent system service call. For that to work, the stack
memory would need to be distinguished from other by having its own
memory section type, a Stack Section, with its own set of behavior.
I believe this is easily justified because thread stacks are so
central to application execution and critical to performance.

Eric

Eric P.

unread,

Dec 2, 2005, 1:38:47 PM12/2/05

to

"Eric P." wrote:
>
> Dave Hansen wrote:
> >

> > What, then, is the purpose of the dwStackSize parameter to the Win32
> > function CreateThread?
>
> It sets the Commit size. I have no idea why MS thinks I would want
> to since the commit pages are automatically expanded.

BTW, I thought of one reason why having Commit size in the
CreateThread arg list is important, beyond the efficiency issue.

It is also a reliability issue. It ensures that resources,
specifically page file backing store, is allocated at thread
create time. If CreateThread succeeds, you are sure that you
have enough stack space to run.

A reserved allocation will not be checked until converted to commit
and then might be denied. That could cause unexpected failures.

So for reliable execution, in an ideal world both Reserve and
Commit space would be specified in the CreateThread arg list
(as they apparently are, internally).

Eric

robert...@yahoo.com

unread,

Dec 2, 2005, 3:16:37 PM12/2/05

to

Eric P. wrote:
> I have had situations where I know that my threads do not require
> 1 MB. But I cannot lower the stack size for just those threads,
> and if I change the global default it may affect all threads,
> even ones my code did not create.
>
> Based on this experience I conclude that it would have been better
> design for CreateThread to require the caller to specify the both
> commit & reserve as arguments for each thread.
>
> The global process defaults look easy, but seem just plain dangerous
> to me because they allow stacks to be adjusted without regard to usage.

If you can live with a WinXP dependency, you can specify
STACK_SIZE_PARAM_IS_A_RESERVATION on CreateThread(), which causes the
stack size parameter to be treated as, ahem, a reservation request and
not a commit request.

Interestingingly CreateFiberEx() allows you to specify both the reserve
and commit size for the stack.

Andy Glew

unread,

Dec 3, 2005, 6:00:54 AM12/3/05

to

> Erlang doesn't have intra-stack data pointers, so stack relocation works.
> For languages like C/C++ you should probably just use segmented/linked stacks.

So, what dos the CALL sequence look like for a segmented/linked stack:

High-level CALL
If there is room for another stack frame in current segment
then
Frame <- StackPointer
StackPointer -= FrameSize
OldFrame.NextFrameLink <- Frame
Else
Allocate NextStackSegment (e.g. get off a linked freelist)
CurrentStackSegment.next <- NextStackSegment
Frame <- NextStacksegment.start
StackPointer = Frame - FrameSize
OldFrame.NextFrameLink <- Frame

similarly for a high level return,
assuming that pointers to stack variables are handled
elsewhere

This seems a lot more expensive than existing CALL/RET instructions.

Should there be instructions that accomplish this?
Or should it all be software convention?

George Neuner

unread,

Dec 3, 2005, 6:26:38 AM12/3/05

to

On Thu, 01 Dec 2005 10:50:39 +0000, Ian Rogers
<ian.r...@manchester.ac.uk> wrote:

>George Neuner wrote:
>> Those are great ideas ... you should have had them 20 years ago when
>> it mattered. Nobody much used the segment unit so Intel deprecated it
>> and in later IA32 models removed some of the cache hardware that
>> supported it. It's very expensive now to change a segment register in
>> 32-bit protected mode. The IA64 MMU has completely done away with
>> the segment unit.
>
>Linux was using fs selector to address thread local storage (TLS) for a
>while, but now it does it through the paging system, which has many
>advantages. Quite a few systems (well I'm thinking JVMs) need places to
>hold onto thread local variables, commiting them to a register wastes
>the register so putting them at constant offsets addressed via a
>selector is something that has been talked about being done (you could
>just hijack an existing TLS except if you need TLS for green threads).

Using multiple segments is fine so long as you are not constantly
changing them - you take a hit on each selector register load.

Even reloading the same value into a segment register takes a small
hit - the value is already known to be a valid selector so it doesn't
need to be validated again, but the limits and privileges have to
reloaded from the cache and applied to the MMU. If you change the
register value and the selector isn't in the cache you take a huge hit
to do the validation check and application.

George Neuner

unread,

Dec 3, 2005, 6:29:22 AM12/3/05

to

Only logically contiguous. The segments are still paged and you can
virtualize the page tables if you really want to. I'm certainly not
going to claim it's easy to do though.

Mikael Pettersson

unread,

Dec 3, 2005, 6:44:25 AM12/3/05

to

In article <peyphd9q...@pxpl2829.amr.corp.intel.com>,

You do call/ret as usual.

In a function's prologue you check if you have enough space for
your own frame plus the minimal guarantee[1] you give other functions
you call (if any). If you ran out of space, allocate a new segment,
push a trap continuation (current return address and previous segment's
stack pointer), then reinvoke yourself with a trap return address.
At return the trap takes care of switching back to the previous segment.

Since the caller and callee may run in different stack segments,
the compiler should use an argument pointer register for addressing
actual parameters in the caller's frame, or the switch to a new
segment should copy the stacked parameters to the new segment.

[1] Callees must have a few words available for storing return
address + argument registers before invoking the stack overflow
handler.
--
Mikael Pettersson (mi...@csd.uu.se)
Computing Science Department, Uppsala University

Joe Seigh

unread,

Dec 3, 2005, 8:41:19 AM12/3/05

to

Mikael Pettersson wrote:
> In article <peyphd9q...@pxpl2829.amr.corp.intel.com>,
> Andy Glew <andy...@intel.com> wrote:

[...]

>>
>>This seems a lot more expensive than existing CALL/RET instructions.
>>
>>Should there be instructions that accomplish this?
>>Or should it all be software convention?
>
>
> You do call/ret as usual.
>
> In a function's prologue you check if you have enough space for
> your own frame plus the minimal guarantee[1] you give other functions
> you call (if any). If you ran out of space, allocate a new segment,
> push a trap continuation (current return address and previous segment's
> stack pointer), then reinvoke yourself with a trap return address.
> At return the trap takes care of switching back to the previous segment.
>
> Since the caller and callee may run in different stack segments,
> the compiler should use an argument pointer register for addressing
> actual parameters in the caller's frame, or the switch to a new
> segment should copy the stacked parameters to the new segment.

z/OS (s360) linkage uses register 1 for the paramater list so I didn't
have to worry about that when I used segmented stacks for it.

You can get rid of most of the need for stack segment allocation and
deallocation from the free pool. Just have a per processor stack
segment free pool and append that to the current stack segment on
thread dispatch. On thread suspend, detach the unused stack segments
from the current stack segment and use those for the next dispatched
thread. No need to use dummy trap returns to free stack segments.
You'd still have to check if you ran out of free stack segments. The
dispatcher can check if free pool gets too big and stack segments need
to be deallocated.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Andy Glew

unread,

Dec 3, 2005, 1:21:45 PM12/3/05

to

> "Eric P." <eric_p...@sympaticoREMOVE.ca> writes:
>
> I have had situations where I know that my threads do not require
> 1 MB. But I cannot lower the stack size for just those threads,
> and if I change the global default it may affect all threads,
> even ones my code did not create.
>
> Based on this experience I conclude that it would have been better
> design for CreateThread to require the caller to specify the both
> commit & reserve as arguments for each thread.

You talk as if you come from an embedded world, where the programmer
knows what each thread will do and need.

I think, in the general purpose world, this is NOT true.

E.g. Joe Programmer writes a piece of code. He knows that most of the
time is spent in his code, and that his code is parallelizable, so he
wants to use threads.

However, in several places Joe is calling a library routine. Maybe
printf; maybe some other. Maybe an STL function He doesn't think that
the library routine matters much. It isn't really worth the trouble to
collapse all the threads via a barrier, and serialize, at every
library routine.

But Joe doesn't know how much stack the library routine uses; maybe
typically, but certainly not in all cases.

I think that it is unreasonable for Joe to have to specify the amount
of memory for the thread stacks (or any other property of the library
routine).

Making that requirement basically means that Joe should not use
library routines in his threads. If he does, his code may break if the
library routine changes its stack usage.

---

I was inspired by a talk Jack Dennis gave at PACT in Paris in 1997(?):

Jack said that the problem with parallel programming is that it
violates all of the rules of software engineering. E.g. modularity:
you're NOT SUPPOSED to know how much stack a library routine is using.

I suggest that, so long as parallel programming requires the top level
programmer to know all sorts of intimate details about the code he is
calling, parallel programming will remain a niche. Maybe an important
niche, but a niche nevertheless.

So: how do you make parallel programming compatible with software
engineering principles like modularity?

Rupert Pigott

unread,

Dec 3, 2005, 1:55:16 PM12/3/05

to

Andy Glew wrote:

[SNIP]

> I suggest that, so long as parallel programming requires the top level
> programmer to know all sorts of intimate details about the code he is
> calling, parallel programming will remain a niche. Maybe an important
> niche, but a niche nevertheless.

Parallel programming hasn't required that kind of knowledge for
donkey's
years. Threads will *always* require that kind of knowledge, and that
is
why they suck.

Fine grain parallelism just doesn't scale well, it is a dead end.

> So: how do you make parallel programming compatible with software
> engineering principles like modularity?

Do I need to say CSP again ? :/

CSP enforces modularity, that is one of the main points behind it.

Cheers,
Rupert

Andy Glew

unread,

Dec 3, 2005, 2:34:25 PM12/3/05

to

Ian Rogers <ian.r...@manchester.ac.uk> writes:
> Linux was using fs selector to address thread local storage (TLS) for
> a while, but now it does it through the paging system, which has many
> advantages.

And disadvantages... In particular, that the threads are no longer
"completely shaed address space". Making it harder to share the page
tables and TLB entries and TLB ASIDs and TLB PIDs and ...

Further, beware of the bug that can occur when you pass a pointer
between threads in such an environment. You can nearly always do
this, *except* when the pointer is to the TLS. If you pass a TLS
pointer, in the receiving process it points to a different TLS.

Further... Typically you will have both the TLS mapped at a constant address,
and the TLS for all of the threads visible at some other address. Which
means that there will be at least one page that has aliased virtual addresses.
The TLS alias is probably most often accessed, but if you ever accessthe other,
and you are on a system that uses a virtual indexed cache, you may incur
expensive cache flushes when alising is detected.

Andy Glew

unread,

Dec 3, 2005, 2:36:06 PM12/3/05

to

"Peter \"Firefly\" Lund" <fir...@diku.dk> writes:

> On Thu, 1 Dec 2005, Ian Rogers wrote:
>
> > Linux was using fs selector to address thread local storage (TLS)
> > for a while, but now it does it through the paging system, which has
> > many advantages.
>
> Linux uses the gs register for TLS, at least in NPTL (Native POSIX
> Thread Library). It does NOT use the paging system. TLS in Linux is
> documented in this document:

That's what I thought.

Nevertheless, I replied to Ian Rogers, pointing out some of the
problems of using paging for TLS.

Back in the old days various UNIXes used the same trick for the "u",
the per-process area. Same problems.

Andy Glew

unread,

Dec 3, 2005, 2:40:48 PM12/3/05

to

mi...@harpo.csd.uu.se (Mikael Pettersson) writes:

> >So, what dos the CALL sequence look like for a segmented/linked stack:
> >

> >This seems a lot more expensive than existing CALL/RET instructions.
> >
> >Should there be instructions that accomplish this?
> >Or should it all be software convention?
>
> You do call/ret as usual.
>
> In a function's prologue you check if you have enough space for
> your own frame plus the minimal guarantee[1] you give other functions
> you call (if any).

So, what is that minimal guarantee? For functions it's
straightforward enough... but you better make sure that you are not
executing any instructions that get trapped and emulated via user code
(not kernel code, that has its own stack) in that prologue. Better
make sure that the compiler isn;t scheduling code into that
prologue. I.e. make sure that the compiler treats that prologue as an
atomic unit.

What about signals? Can signals run on the user stack? Probably best
to make sure that they run on a different stack.

What about "events" in general? Must all event classes have dedicated
stacks?

Eric P.

unread,

Dec 3, 2005, 2:36:12 PM12/3/05

to

Joe Seigh wrote:

>
> Mikael Pettersson wrote:
> >
> > You do call/ret as usual.
> >
> > In a function's prologue you check if you have enough space for
> > your own frame plus the minimal guarantee[1] you give other functions
> > you call (if any). If you ran out of space, allocate a new segment,
> > push a trap continuation (current return address and previous segment's
> > stack pointer), then reinvoke yourself with a trap return address.
> > At return the trap takes care of switching back to the previous segment.
> >
> > Since the caller and callee may run in different stack segments,
> > the compiler should use an argument pointer register for addressing
> > actual parameters in the caller's frame, or the switch to a new
> > segment should copy the stacked parameters to the new segment.
>
> z/OS (s360) linkage uses register 1 for the paramater list so I didn't
> have to worry about that when I used segmented stacks for it.
>
> You can get rid of most of the need for stack segment allocation and
> deallocation from the free pool. Just have a per processor stack
> segment free pool and append that to the current stack segment on
> thread dispatch. On thread suspend, detach the unused stack segments
> from the current stack segment and use those for the next dispatched
> thread. No need to use dummy trap returns to free stack segments.
> You'd still have to check if you ran out of free stack segments. The
> dispatcher can check if free pool gets too big and stack segments need
> to be deallocated.

Off the top of my head, some other issues for consideration:

- Large allocation, larger than your pre allocated segment size.
You still need to decide whether to use a standard or custom
segment and release it to the appropriate heap/pool.
Just more things for the prolog and epilog to consider.

- Dynamic stack allocation, alloca, including multiple allocations
in a single routine, and their recovery upon return.
Not a difficult issue, but can't be forgotten.

- Non C or Fortran languages like Ada which allow variable
size function return values.

IOW functions that do not restore the stack pointer on return
so as to effect an allocation in the caller's frame.

- User mode interrupt or exception delivery, including imprecise
exceptions, delivered at the most inconvenient instant during
a stack expansion.

This would effect the decisions on how to chain the segments
(ie probably not a double linked list).

- And of course space recovery during exceptions' stack unwind
(including a partially linked segment that was being chained
just as the exception was delivered.)

- Setjmp & longjmp, and space recovery.

Not that these are insoluble.
Eric

Eric P.

unread,

Dec 3, 2005, 3:16:53 PM12/3/05

to

Andy Glew wrote:
>
> > "Eric P." <eric_p...@sympaticoREMOVE.ca> writes:
> >
> > I have had situations where I know that my threads do not require
> > 1 MB. But I cannot lower the stack size for just those threads,
> > and if I change the global default it may affect all threads,
> > even ones my code did not create.
> >
> > Based on this experience I conclude that it would have been better
> > design for CreateThread to require the caller to specify the both
> > commit & reserve as arguments for each thread.
>
> You talk as if you come from an embedded world, where the programmer
> knows what each thread will do and need.
>
> I think, in the general purpose world, this is NOT true.

Actually, I did start in embedded hardware and software 30 years ago.
But in this case I was intentionally being a slight hyperbolist
in order to make a point about reliability.

> E.g. Joe Programmer writes a piece of code. He knows that most of the
> time is spent in his code, and that his code is parallelizable, so he
> wants to use threads.
>
> However, in several places Joe is calling a library routine. Maybe
> printf; maybe some other. Maybe an STL function He doesn't think that
> the library routine matters much. It isn't really worth the trouble to
> collapse all the threads via a barrier, and serialize, at every
> library routine.
>
> But Joe doesn't know how much stack the library routine uses; maybe
> typically, but certainly not in all cases.
>
> I think that it is unreasonable for Joe to have to specify the amount
> of memory for the thread stacks (or any other property of the library
> routine).
>
> Making that requirement basically means that Joe should not use
> library routines in his threads. If he does, his code may break if the
> library routine changes its stack usage.

All true. Before multi-threading the OS just allocated a huge
swath of on the order of 1 GB of virtual space to the program stack.
This was so huge that it allows programmers to ignore all stack space
considerations. You are more likely to get a "Page File Exhausted"
error than a stack overflow.

Then along comes 32 bit multi-threading and we can't have 1 GB
per thread any more. Many of the previous development practices,
specifically not giving *any* consideration to worst case stack use,
come back to haunt one when there is only 1 MB.

So the answer would seem to be: use a 64 bit cpu, commit 1 GB
of virtual space (and page file space) to every thread stack,
and stick the page file on a great whacking 1 TB disk.

Then we can all get back to writing 'error free' code.

> I was inspired by a talk Jack Dennis gave at PACT in Paris in 1997(?):
>
> Jack said that the problem with parallel programming is that it
> violates all of the rules of software engineering. E.g. modularity:
> you're NOT SUPPOSED to know how much stack a library routine is using.
>
> I suggest that, so long as parallel programming requires the top level
> programmer to know all sorts of intimate details about the code he is
> calling, parallel programming will remain a niche. Maybe an important
> niche, but a niche nevertheless.
>
> So: how do you make parallel programming compatible with software
> engineering principles like modularity?

Then the above approach would seem to fill the bill. :-)

I never agreed with that philosophy, that modularity implied absolute
ignorance, so I do not get disappointed. Furthermore, sometimes people
then use such ignorance to imply lack of personal responsibility
for the result.

I use modularity is to lower the number of variables under
consideration below my grossly stupid error point.
Reuse is a bonus.

Eric

Mikael Pettersson

unread,

Dec 4, 2005, 5:18:34 AM12/4/05

to

In article <peypmzji...@pxpl2829.amr.corp.intel.com>,

Andy Glew <andy...@intel.com> wrote:
>mi...@harpo.csd.uu.se (Mikael Pettersson) writes:
>
>> >So, what dos the CALL sequence look like for a segmented/linked stack:
>> >
>> >This seems a lot more expensive than existing CALL/RET instructions.
>> >
>> >Should there be instructions that accomplish this?
>> >Or should it all be software convention?
>>
>> You do call/ret as usual.
>>
>> In a function's prologue you check if you have enough space for
>> your own frame plus the minimal guarantee[1] you give other functions
>> you call (if any).
>
>So, what is that minimal guarantee? For functions it's
>straightforward enough... but you better make sure that you are not
>executing any instructions that get trapped and emulated via user code
>(not kernel code, that has its own stack) in that prologue. Better
>make sure that the compiler isn;t scheduling code into that
>prologue. I.e. make sure that the compiler treats that prologue as an
>atomic unit.

The minimal guarantee must include enough space so your callees can
invoke the stack overflow handler. Typically the guarantee also
includes enough to allow most leaf procedures to run without any
stack overflow checks.

>What about signals? Can signals run on the user stack? Probably best
>to make sure that they run on a different stack.

Signals are a pain because the Unix model with signals running on
your current stack by default is horribly broken. On x86/amd64
our Erlang runtime system uses sigaltstack() to force all signal
handlers to a separate stack, but that requires libc-specific code
since we also have to catch and override sigaction() calls in libc
and other libraries. It's doable but very icky.

The only alternative is to either waste a GPR for a simulated stack
(which kills performance on x86/amd64 since you then also lose
return stack branch prediction), or to always have another page
or so of "slack" at the bottom of the stack for signal handlers.

Joe Seigh

unread,

Dec 4, 2005, 10:19:38 AM12/4/05

to

Andy Glew wrote:
>
> What about signals? Can signals run on the user stack? Probably best
> to make sure that they run on a different stack.

To say that signals (in unix) are problematic is a wild understatement.
Unix signals are an anachronism that predate threads and are all but
impossible to have interact correctly with threads assuming that you
care about niggling little details like program correctness. If you
look at them as a really primative form of user level threading, the
fact that all the major unix vendors have abandoned user level threads
in favor of kernel level threads should tell you something.

So whatever stack solution you consider, its effect of signal handling
efficiency should not be be a major consideration. In fact if you can
degrade signal handling somewhat then it would be a good thing if it
discourage use of signals.

Someone will point out that there is a need for low latency dispatching
of high priority "tasks" but signals are not the only way to accomplish
that and are a really bad way to do it.

>
> What about "events" in general? Must all event classes have dedicated
> stacks?
>

What do you mean by "events"?