An alternative memory management technique

Piotr Wyderski

unread,

Feb 28, 2003, 5:24:17 AM2/28/03

to

Hello,

here's my latest idea: each processor in the system
has the PAE extension enabled. There's one PDPT
table (it has 4 entries) per processor. Virtual address
space switching is performed by remapping the page
directory bases (i.e. one needs to replace just three
memory entries, one entry is reserved for the kernel).
It allows:

1. Easy implementation of per-CPU private memory
areas, each processor has a separate kernel page
directory.

2. Constant stack base address: each kernel stack
is mapped at the same address, to switch a thread
we need only to remap that mapping and invalidate
it using invlpg. It simplifies many things, for example
it allows to connect fast address space switching
(through the code/data segment base address updates)
and kernel-mode preemption. The other approaches
don't have this feature, because there's not enough
virtual address space to map all the kernel stacks.
In my OS each stack has 16KiB, there could be
up to 2^20 threads, so it would need 16 GiB of
virtual memory -- pure nonsense on IA-32. Kernel
mode promotions are supported too (i.e. it's when
a thread placed in a small address space is switched
to use it's full address space, through the PDBR
updates).

3. It allows to use 36-bit addresses as a side effect.

4. User-mode memory management remains unchanged,
becaue PDE-s/PTE-s are shared, just the PDPT is CPU-private.

Could you please comment this approach, especially
why I should not use this approach? :-)

Best regards
Piotr Wyderski

KVP

unread,

Feb 28, 2003, 9:08:34 AM2/28/03

to

"Piotr Wyderski" <piotr.wyde...@hoga.pl> wrote in message

This is the PAE based implementation of my 4 process idea. At least it is
very similar. My idea was to divide the address space into 4 areas, and
map different processes into them. This way, one could speed up message
copying, by mapping two processes and the kernel in at the same time.
I tought to implement it with the old pd setup, but I had to modify 256
pd entries to map in a process, so it wasn't good enough.

Btw: Piotr, what kind of kernel architecture are you using? Because if you
plan to implement message passing with a microkernel, then you don't need the
preemptable kernel threads... (because the kernel can use cooperative
sheduling, and you can save the state of the copy process in 2 dwords) Or you
have given up the idea, and you are making a monolithic kernel?

Viktor

w m r

unread,

Feb 28, 2003, 9:21:45 AM2/28/03

to

"Piotr Wyderski" <piotr.wyde...@hoga.pl> wrote in message news:<b3nddg$1ak$1...@nemesis.news.tpi.pl>...

> Hello,
>
> here's my latest idea: each processor in the system
> has the PAE extension enabled. There's one PDPT
> table (it has 4 entries) per processor. Virtual address
> space switching is performed by remapping the page
> directory bases (i.e. one needs to replace just three
> memory entries, one entry is reserved for the kernel).

I think you'd still have to reload CR3 so as to invalidate
the old bad entries.

> It allows:
>
> 1. Easy implementation of per-CPU private memory
> areas, each processor has a separate kernel page
> directory.
>
> 2. Constant stack base address: each kernel stack
> is mapped at the same address, to switch a thread
> we need only to remap that mapping and invalidate
> it using invlpg. It simplifies many things, for example
> it allows to connect fast address space switching
> (through the code/data segment base address updates)

I don't know what you mean by 'connect fast address space switching'

> and kernel-mode preemption. The other approaches

I do that with kernel stacks at different VA's.
I just have to set the TSS entry for the top of the
thread's kernel stack each time I switch threads.

> don't have this feature, because there's not enough
> virtual address space to map all the kernel stacks.
> In my OS each stack has 16KiB, there could be
> up to 2^20 threads, so it would need 16 GiB of
> virtual memory -- pure nonsense on IA-32.

The day I need 1,000,000 threads, I'll splurge and buy
a real 64-bit cpu. Then, of course, I'll have to buy
the 16GiB memory just for the stacks, let alone memory
for any applications (at up to 1MB per user stack per
thread).

> Kernel
> mode promotions are supported too (i.e. it's when
> a thread placed in a small address space is switched
> to use it's full address space, through the PDBR
> updates).

> 3. It allows to use 36-bit addresses as a side effect.
>
> 4. User-mode memory management remains unchanged,
> becaue PDE-s/PTE-s are shared, just the PDPT is CPU-private.

But they're 8-bytes long instead of 4.

> Could you please comment this approach, especially
> why I should not use this approach? :-)

Now you're manipulating quadword instead of longword pagetable entries.
I don't know if the CPU has to do more work, I would think it would
have to read twice as much memory for the same number of pagetable entries.

You can always try it and see how it performs!

> Best regards
> Piotr Wyderski

Colin LeMahieu

unread,

Feb 28, 2003, 12:07:04 PM2/28/03

to

>It simplifies many things, for example
>it allows to connect fast address space switching
>(through the code/data segment base address updates)
>and kernel-mode preemption.

I'm going to have to disagree that this is "Fast" switching. When
people talk about reloading cr3 and how long it takes, it isn't
because it takes a lot of time to move a number in to that register.
It isn't because it takes a long time to flush tlb entries, it's
because it takes a "long time" to read page table entires out of
memory and put them in the tlb. As far as I understand this process
doesn't change that at all. When you're switching out 1/4 or 3/4 of
your address space, you're invalidating a huge portion of it, which by
the way would all have to be invalidated by invlpg, which in itself
would take a long time to execute.

Piotr Wyderski

unread,

Feb 28, 2003, 4:21:53 PM2/28/03

to

w m r wrote:

> I think you'd still have to reload CR3 so as to invalidate
> the old bad entries.

Of course, but the main goal is to provide processor-private
page directory, not to speed up process switching.

> I don't know what you mean by 'connect fast address space switching'

Fast address space switching is not related with virtual memory management,
it's done by the segment base/limit remapping -- this way you can keep
several
different processes inside a single address space. Complete address space
switching is necessary only when you can't cover your address space by
smaller segments. L4 uses this approach. "Connect" = keep these two features
together; "kill two birds with one stone".

> I do that with kernel stacks at different VA's.
> I just have to set the TSS entry for the top of the
> thread's kernel stack each time I switch threads.

Yes, this is classical approach. But this way I can't provide
the necessary address space for all stacks -- 16 GiB is needed.

> The day I need 1,000,000 threads, I'll splurge and buy
> a real 64-bit cpu.

My OS is designed to support 2^10 processes with 2^10
threads each. Fast address space switching with kernel-mode
preemption need all the kernel-mode stacks to be mapped
within a single address space.

> But they're 8-bytes long instead of 4.

Yes, but everything but representation remains unchanged.

Best regards
Piotr Wyderski

Piotr Wyderski

unread,

Feb 28, 2003, 4:27:07 PM2/28/03

to

Colin LeMahieu wrote:

> I'm going to have to disagree that this is "Fast" switching. When
> people talk about reloading cr3 and how long it takes, it isn't
> because it takes a lot of time to move a number in to that register.

No, fast switching is not related with paging -- it's a special
segmentation scheme, see my answer for Mike. Pardon, if
my text was vague in this respect. The problem is because
of that fast switching approach and kernel-mode preemption
need all stacks to be accessible from each address space.

Best regards
Piotr Wyderski

Piotr Wyderski

unread,

Feb 28, 2003, 5:05:09 PM2/28/03

to

KVP wrote:

Hm, I'll explain problem. First of all, I need:

1. Kernel-mode preemption
2. Fast address space switching (based on segments)
3. Support for SMP machines

(1) and (2) needs all the kernel-mode stacks are accessible
within every address space. 4 GiB is not enough for this
purpose, so I can't use the well-known approach based on
an array of stacks -- in other words, this way is not scalable.
When you map the stack on the fly, you have virtually
unlimited number of kernel stacks mapped the same time.
This would be a good way for uniprocessors, but (3) kills
it -- several processors could map different stacks at the
same address (a simple corollary from Dirichlet's compartment
theorem) the same time, causing system panic. It's possible to
break this limitation using two similar techniques:

1. Through PAE; remapping 3 entries of PDPT on the fly
and then invalidating TLB via CR3.

2. By providing one page directory per cpu and per process.
Almost all memory is mapped identically (i.e. all these page
directories share page tables) except of a small portion, used
for stack mapping purposes (and, as a side effect, for processor
-private data, so fs/gs PEB segments are not needed). There's
no difference at user level (because page tables are shared),
but the switcher must be extended and choose an appropriate
page directory for a given CPU. This thread is about exactly
this problem, not about message passing/thread switching
optimizations. :-)

To Mike: as you can see, there are two ways of thread
switching. The first one uses an array of stacks and you
need to patch the TSS entry. The second one uses the
same _virtual_ address for each stack (no patching is needed),
but you need to update the _mapping_ when switching
threads. I was using the first approach, but it seems it's too
weak for my requirements. Imagine an SMP machine and
draw this situation on a sheet of paper -- you'll see where's
the problem with mapping collisions.

> Btw: Piotr, what kind of kernel architecture are you using?

Still the same old good microkernel with message-based
communication scheme and user-mode drivers.

> Because if you plan to implement message passing with a microkernel,
> then you don't need the preemptable kernel threads...

Then how are you going to implement fully parallel message passing
with one-/zero-copy optimizations? This way needs accessing PDE-s
and PTE-s, so user-mode code would be able to bypass protection
domains. That's the reason my message passing subsystem resides
in the kernel.

> (because the kernel can use cooperative sheduling, and you can save
> the state of the copy process in 2 dwords)

And what's with its real-time features? You can't lock the kernel
for a long time, so your kernel-mode code must be preemptable.

> Or you have given up the idea, and you are making a monolithic kernel?

No way... ;-)

Best regards
Piotr Wyderski

Colin LeMahieu

unread,

Feb 28, 2003, 6:52:31 PM2/28/03

to

I think the problem with asking your question here is that people
don't know what you're talking about. I'd have to say I'm well versed
in how the x86 architecture works yet I can't figure out what you're
saying and I don't think the other two people who responded do either.

Perhaps if you re-think your question to us, restate it, and state
your goals we would be able to give good feedback.

KVP

unread,

Mar 1, 2003, 9:03:40 AM3/1/03

to

"Piotr Wyderski" <piotr.wyde...@hoga.pl> wrote in message

Your idea seem very similar to light weight processes in some unix
implementations. When you create a new thread, you just duplicate
you kernel stack, and everything is set up. The execution will continue
at the same address in the same function. Even the user level return
frame is set up correctly. I assume, that you share the per process
resources on a per process base, and the per thread resouces are private.

This is a good idea, and in this case, mapping the kernel stacks seem to
be the best solution.

However, I have to share all kernel data, (except per cpu ones) since my
kernel uses the process and thread datas of different processes at the
same time, so I must have them mapped in. This is done to handle various
cross process queues in my message passer.

> To Mike: as you can see, there are two ways of thread
> switching. The first one uses an array of stacks and you
> need to patch the TSS entry. The second one uses the
> same _virtual_ address for each stack (no patching is needed),
> but you need to update the _mapping_ when switching
> threads. I was using the first approach, but it seems it's too
> weak for my requirements. Imagine an SMP machine and
> draw this situation on a sheet of paper -- you'll see where's
> the problem with mapping collisions.

This is very simple, and when you use partly separate kernel spaces for
every process, then this will happen on a process switch too. Actually,
this way the overhead is the same.

> > Btw: Piotr, what kind of kernel architecture are you using?
> Still the same old good microkernel with message-based
> communication scheme and user-mode drivers.

Ok, but a few years ago we discussed the message passing model, and there
were two main techniques: copy based and remapped. Which one did you choose?

> > Because if you plan to implement message passing with a microkernel,
> > then you don't need the preemptable kernel threads...
> Then how are you going to implement fully parallel message passing
> with one-/zero-copy optimizations? This way needs accessing PDE-s
> and PTE-s, so user-mode code would be able to bypass protection
> domains. That's the reason my message passing subsystem resides
> in the kernel.

Actually, I use page remapping (zero copy), and my kernel has maximal
execution times defined. The accuracy of the kernel timing will be the
worst case time for a given function. This is a guaranteed precision.

About the message pass times: The worst case is when you pass your whole
user mode address space in one call. This is 2 Gb on the x86, and 512 page
tables. The required memory movements for the case when you pass all but 2
pages are: 2046 ptes and 510 pdes. (the first and last not fully passed
page tables and the rest between them) This is not even 12 Kb... this is
the amount of data that have to be moved to pass approx. 2 GB around. Imho,
this can be done under a very short time, and it can be done with irqs
disabled... (for smaller messages, this time is even smaller)

> > (because the kernel can use cooperative sheduling, and you can save
> > the state of the copy process in 2 dwords)
> And what's with its real-time features? You can't lock the kernel
> for a long time, so your kernel-mode code must be preemptable.

Currently my kernel is used in a way that only one thread can enter at a
time, but the trick is that the kernel has only finite time functions, so
every iterative operation is placed outside. I designed the code and the
data structures in a way, that it can use locking, but I found out, that
the per structure locking takes as long as the whole entry/exec/exit
sequence, so it's simplier to just leave them out and lock on entry.
(supported by hw. on the single cpu x86)

> > Or you have given up the idea, and you are making a monolithic kernel?
> No way... ;-)

Good to hear this!!! Viktor

Piotr Wyderski

unread,

Mar 1, 2003, 10:02:07 AM3/1/03

to

KVP wrote:

> I assume, that you share the per process resources on a per
> process base, and the per thread resouces are private.

Generally yes, but fast switching can cause this information to be even
more condensed -- one process can access control data structures of
several other processes.

> This is very simple, and when you use partly separate kernel spaces for
> every process, then this will happen on a process switch too.

Hm, _what_ will will happen on a process switch? The
problem with the kernel-stack allocation? -- it's impossible.

> Actually, this way the overhead is the same.

Almost, invlpg uses ~100 cycles, while movl $...,%esp takes 0,5.

> Ok, but a few years ago we discussed the message passing model, and there
> were two main techniques: copy based and remapped. Which one did you
choose?

Both of them. :-) There are two main cases:

1. Short messages (up to 8 bytes), they are transferred through the thread
control block; they don't need any memory access operations.

2. Longer messages (8 bytes ... 3 GiB). Each transferred block has
known base address S and the end address E. So, if S and E are
page-aligned, fast mapping + some privilege adjustment can be used
(this is my implicit shared memory). The "marigins" between S, E
and their aligned representations S` and E` are passed using the
alloc&copy method. As you can see, this method uses two schemes,
zero- and single copy, accordingly to the user-mode data layout.

There's also an orthogonal set of transfer qualifiers, {in, in-out, out}.
A message sent in "out" mode is the most intuitive -- the receiver
can only read it. "In" is a complementary mode -- you send an empty
buffer and the receiver must fill it with some data. "In-out" is
a compilation of these two: you send something and expect the
answer in the same buffer (e.g. the receiver must increment each
byte of data -- not realistic, but a clear example). Performed
optimizations depends on the chosen transfer mode -- for IN and
OUT the COW technique can be easily used, OUT also supports
on-the-fly memory allocation. But, of course, they're only technical
details. BTW, there's no possibility to send something behind the
system's back (my specification marks several special cases as
'undefined'), so it could be easily scaled to a distributed system.

> Actually, I use page remapping (zero copy), and my kernel has maximal
> execution times defined. The accuracy of the kernel timing will be the
> worst case time for a given function. This is a guaranteed precision.

This way simplifies your design a lot, but IMO kernel-mode preemption
is a very powerful fetaure, so I want to keep it. I was even thinking about
lock-free data structures inside the kernel, but they are economically
unpermitted, so I still use the classic spinlock-based ones...

Best regards
Piotr Wyderski

Maxim S. Shatskih

unread,

Mar 1, 2003, 5:34:38 PM3/1/03

to

Looks good.

"Piotr Wyderski" <piotr.wyde...@hoga.pl> wrote in message

news:b3nddg$1ak$1...@nemesis.news.tpi.pl...

Maxim S. Shatskih

unread,

Mar 1, 2003, 5:42:11 PM3/1/03

to

> 1. Kernel-mode preemption
> 2. Fast address space switching (based on segments)
> 3. Support for SMP machines
>
> (1) and (2) needs all the kernel-mode stacks are accessible
> within every address space. 4 GiB is not enough for this
> purpose, so I can't use the well-known approach based on
> an array of stacks -- in other words, this way is not scalable.
> When you map the stack on the fly, you have virtually
> unlimited number of kernel stacks mapped the same time.

Do you want to manipulate TLB cache on each thread switch, even within
the same process? Can be slow.

Max

Piotr Wyderski

unread,

Mar 1, 2003, 6:17:32 PM3/1/03

to

Maxim S. Shatskih wrote:

> Do you want to manipulate TLB cache on each thread switch, even within
> the same process?

Yes, but I need to invalidate just one entry (the stack mapping page).
Remaining 12 KiB of thread-private storage don't need to be updated,
one invlpg will do.

> Can be slow.

Even in this case?

Best regards
Piotr Wyderski

KVP

unread,

Mar 1, 2003, 6:43:34 PM3/1/03

to

"Piotr Wyderski" <piotr.wyde...@hoga.pl> wrote in message

> KVP wrote:
> > I assume, that you share the per process resources on a per
> > process base, and the per thread resouces are private.
> Generally yes, but fast switching can cause this information to be even
> more condensed -- one process can access control data structures of
> several other processes.

This means, that you have the kernel stacks as private data, but other
datas are shared. (at least on the process level)

> > This is very simple, and when you use partly separate kernel spaces for
> > every process, then this will happen on a process switch too.
> Hm, _what_ will will happen on a process switch? The
> problem with the kernel-stack allocation? -- it's impossible.

No. I meant, that you have to invalidate pages for a thread switch and
a process switch too. This makes the thread switch as fast as a process
switch.

> > Actually, this way the overhead is the same.
> Almost, invlpg uses ~100 cycles, while movl $...,%esp takes 0,5.

See above... I was speaking about the problems with per thread private
datas. And yes, the shared space kernel stack method is much faster.
I have all kernel data (incl. pagetables) mapped in all the time. This
simplifies message passing and the descriptor structures can be accessed
from any context. (even from the irq level)

> > Ok, but a few years ago we discussed the message passing model, and there
> > were two main techniques: copy based and remapped. Which one did you
> > choose?
> Both of them. :-) There are two main cases:
> 1. Short messages (up to 8 bytes), they are transferred through the thread
> control block; they don't need any memory access operations.

Ok, these are pulses. I have them too.

I have two ways to pass messages:
-send them as messages through connections (architeture independent)
-send them as pulses to local threads (only one unsigned int is passed)

The second works only on the local machine, and requires access (supervisor
or parent process) rights to send a pulse to a thread in another process.

> 2. Longer messages (8 bytes ... 3 GiB). Each transferred block has
> known base address S and the end address E. So, if S and E are
> page-aligned, fast mapping + some privilege adjustment can be used
> (this is my implicit shared memory). The "marigins" between S, E
> and their aligned representations S` and E` are passed using the
> alloc&copy method. As you can see, this method uses two schemes,
> zero- and single copy, accordingly to the user-mode data layout.

I just remap the pages belonging to a message. The start offset is
adjusted to the target page offset, but the intra page offset is kept.
This happens, because both send() and recv() need an offset and a length.
If the receive area is too small, the error is reported according to the
settings. (fail/notify sender/notify receiver/partial pass)

> There's also an orthogonal set of transfer qualifiers, {in, in-out, out}.
> A message sent in "out" mode is the most intuitive -- the receiver
> can only read it. "In" is a complementary mode -- you send an empty
> buffer and the receiver must fill it with some data. "In-out" is
> a compilation of these two: you send something and expect the
> answer in the same buffer (e.g. the receiver must increment each
> byte of data -- not realistic, but a clear example). Performed
> optimizations depends on the chosen transfer mode -- for IN and
> OUT the COW technique can be easily used, OUT also supports
> on-the-fly memory allocation. But, of course, they're only technical
> details. BTW, there's no possibility to send something behind the
> system's back (my specification marks several special cases as
> 'undefined'), so it could be easily scaled to a distributed system.

I just move the pages. (unmap from the sender and map to the receiver)
It's easy, and you can insert any number of proxies, since the message
is really travelling through the system. (when you send something, you
loose all control over its memory, but you will get a response in exchange)

> > Actually, I use page remapping (zero copy), and my kernel has maximal
> > execution times defined. The accuracy of the kernel timing will be the
> > worst case time for a given function. This is a guaranteed precision.
> This way simplifies your design a lot, but IMO kernel-mode preemption
> is a very powerful fetaure, so I want to keep it. I was even thinking about
> lock-free data structures inside the kernel, but they are economically
> unpermitted, so I still use the classic spinlock-based ones...

I have three main methods:
-bitmap based allocation (for physical memory) => can be lockfree (bts)
-handle based descriptor strutures => can be lockfree (cmpxchg)
-double linked list based message/thread/process/connection/etc. queues
=> needs locking

Viktor