After reading the book "Operating System Concepts" I decided my
physical allocator would be a stack of free frames. So whenever I
needed a new page I'd call phys_aloc() and it would return the address
of a free frame.
When I started to read about malloc, I discovered the numerous
algorithms and different ways to allocating memory. And I ended up
reading that I would need continuous chunks of physical memory in
order to talk to some hardware. So I really shouldn't let the physical
memory get fragmented.
I thought about implementing the buddy system for that. So, in order
to keep the simple interface of the physical_allocation module, I
decided to pre-allocate (by using a static array) the buddy
allocator's metadata separately from the memory blocks being
allocated. The minimum size for a block will be the size of a page
(4KB). That way I don't have to mess with virtual address mapping in
the physical allocation module (which I feel would overly complicate
matters).
Using buddy allocator doesn't seem to work as smoothly as I thought it
would. First, the page replacement algorithms don't seem to be aware
of the buddy allocator. But I guess that wouldn't be too much of a
problem since the replacement algorithm only has to be called when the
buddy allocator can't do anything anymore (the whole memory is full).
Second, I've been reading the paper "Page Placement Algorithms
Real-Indexed Caches" by R. E. KESSLER and MARK D. HILL. It talks about
page coloring and it also seems to assume I'll have something like a
stack of free pages instead of a buddy allocator dealing with physical
allocation.
That makes me think I'm probably going on the wrong track. Does anyone
have suggestions on what path to follow? Perhaps give me some tips
regarding the design of the memory management module. Bear in mind
that I'm still in a very early stage of development, but I'm trying to
understand the whole context of things so I can get the most things
right as I can from the beginning.
[]'s
Rafael
So far so good.
> After reading the book "Operating System Concepts" I decided my
> physical allocator would be a stack of free frames. So whenever I
> needed a new page I'd call phys_aloc() and it would return the address
> of a free frame.
Yep, very simple and fast: push, pop a number, physical address of a
4KB block of RAM.
> When I started to read about malloc, I discovered the numerous
> algorithms and different ways to allocating memory.
Surely. Different applications have different requirements (speed vs
space).
> And I ended up
> reading that I would need continuous chunks of physical memory in
> order to talk to some hardware. So I really shouldn't let the physical
> memory get fragmented.
Wait a minute. Unless the system is horribly broken or awkwardly
designed, there shouldn't be several devices (including RAM)
addressable by the same physical address, they shouldn't share the
same address range unless it's a scarce resource, which it's not on
x86. Don't you think so?
> I thought about implementing the buddy system for that.
For what exactly? I thought you wanted to use a stack to keep track of
the physical memory.
You should separate the two concepts: the physical address space and
the virtual address space. Devices and physical memory live in the
former, but can be made to appear anywhere in the latter.
Usually there're some restrictions on the physical address space (e.g.
hardwired addresses) but fewer on the virtual address space (unless
the CPU itself treats different subranges of it differently, e.g. here
lies the kernel and here lie applications and at this magical virtual
address some CPU's onchip device is).
For example, there're special ranges in the physical address space for
devices using the old 16-bit DMA chip for DMA transfers (accessible
only below 16 MB physical address), there's the fixed BIOS ROM
location and the video buffer. Right now I can't remember any
practical restriction on the x86 CPU virtual addresses other than the
v86 mode code must be below ~1MB virtual address.
So, basically, you have to take care of the old DMA -- you want to be
able to allocate some memory below 16MB physical, not much, though.
You can even preallocate it for this very purpose. The rest of the
physical memory (I really mean RAM, not ROM or device range) can be
mapped to anywhere in the virtual address space you like or need. How
exactly you carve up the virtual address space is up to you.
AFAIR, page coloring is simply a way to use the associative cache more
efficiently/predictably, to minimize the cache contention/unnecessary
loads and flushes.
The caching in question happens at the physical address level in L2
cache. With an N-way associative cache every memory location can be
cached in one of only N cache locations (which N ones depends on log2
(N) bits of the physical address). That is, the CPU is trying to find
cached data corresponding to some physical address in at most N
locations in the cache (the bigger N is, the more expensive is the
cache and slower search).
Such cache organization implies that if an application tries to use
use in a short succession more than N physical locations that share
the same log2(N)-bit value in their address (color) that is used
during the cache search, there will be constant cache contention since
N+1 or more locations won't fit into N. Therefore you want to make
uniform use of the colors to make sure the contention doesn't occur
when it can be avoided.
Alex
Yup. Memory for hardware should be physically mapped contiguously: no
virtual or paging addressing. You might also want to physically map below
1MB for BIOS, video BIOS, RM, DMA, BDA, IVT, etc. Also, languages, like C,
require their objects to be contiguous chunks of memory. These don't have
to be physically mapped. They just have to appear to be contiguous (using
virtual or paging addressing etc.).
> Does anyone
> have suggestions on what path to follow?
Nope. I've got some links to memory allocators and links to some .pdf's on
memory allocators. (below my .sig)
Rod Pemberton
--pdfs--
"Design of a General Purpose Memory Allocator for the 4.3BSD UNIX Kernel"
http://docs.freebsd.org/44doc/papers/kernmalloc.pdf
"A Scalable Concurrent malloc(3) Implementation for FreeBSD"
http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf
"Memory Management for High-performance Applications"
http://www.cs.utexas.edu/ftp/pub/techreports/tr02-52.pdf
"Hoard: A Fast, Scalable, and Memory-Efficient Allocator for Shared-Memory
Multiprocessors"
http://www.cs.utexas.edu/ftp/pub/emery/papers/UTCS-TR99-22.pdf
"Dynamic Storage Allocation: A Survey and Critical Review"
http://citeseer.comp.nus.edu.sg/246520.html (click on PDF in upper right
cornert)
"Understanding the Linux Virtual Memory Manager"
http://www.csn.ul.ie/~mel/projects/vm/guide/pdf/understand.pdf
"Fast and Efficient Memory Management for Non-FIFO Routers"
http://research.microsoft.com/en-us/um/people/chguo/psdmm.pdf
"Memory Management in Modern Kernels"
http://www.cse.iitb.ac.in/~nadesai/seminar/report.pdf
--allocators--
Doug Lea's dlmalloc
http://gee.cs.oswego.edu/dl/html/malloc.html
John Walker's bget
http://www.fourmilab.ch/bget/
Dynamic Storage Allocator, Richard Harter, comp.lang.c, Nov. 11, 1990.
http://groups.google.com/group/comp.lang.c/msg/7da27dcbc6e2ace1
Only for DMA common buffers and bounce buffers.
Also, ISA DMA buffers must not only be physically mapped contiguously, but also below 16MB, and buffers for 32bit PCI devices without Dual Address Cycle feature must be below 4GB and physically contiguous.
--
Maxim S. Shatskih
Windows DDK MVP
ma...@storagecraft.com
http://www.storagecraft.com
On Jan 18, 11:51 pm, "Maxim S. Shatskih"
<ma...@storagecraft.com.no.spam> wrote:
> > Yup. Memory for hardware should be physically mapped contiguously:
>
> Only for DMA common buffers and bounce buffers.
>
> Also, ISA DMA buffers must not only be physically mapped contiguously, but also below 16MB, and buffers for 32bit PCI devices without Dual Address Cycle feature must be below 4GB and physically contiguous.
Yes (although ISA DMA also can't cross 64 KiB boundaries either).
However, this doesn't mean an OS needs to be able to allocate
contiguous pages from all of RAM.
For example, I usually use a free page bitmap to manage RAM in the
first 16 MiB of the physical address space (and use that for all
allocations that need to be contiguous), then use a set of free page
stacks to manage RAM from 16 MiB to 4 GiB, and another set of free
page stacks to manage RAM above 4 GiB; so that most of the time I'm
using (faster, more scalable) free page stacks instead of the (slower,
less scalable) free page bitmap.
Cheers,
Brendan
Hi Rod,
Added to the faq list at
http://aodfaq.wikispaces.com/#toc16
Thanks Rod.
Ben
Ben or James (or someone who is currently motivated in updating our FAQ...),
there are two invalid "geezer" links in the AOD FAQ. Unfortunately, Chris
Giese's website is no longer with us, but some mirrors are.
Two old links in AOD FAQ:
http://my.execpc.com/~geezer/johnfine/segments.htm
http://my.execpc.com/~geezer/index.htm
This site seems to be up:
http://redir.no-ip.org/mirrors/my.execpc.com/~geezer/johnfine/segments.htm
http://redir.no-ip.org/mirrors/my.execpc.com/~geezer/index.htm
Rod Pemberton
PS. I hope someone corrects "cornert" someday... ;-)
RP
IIRC this is true for 8-bit but 16-bit channels must stay within 128k
ranges.
>
> However, this doesn't mean an OS needs to be able to allocate
> contiguous pages from all of RAM.
>
> For example, I usually use a free page bitmap to manage RAM in the
> first 16 MiB of the physical address space (and use that for all
> allocations that need to be contiguous), then use a set of free page
> stacks to manage RAM from 16 MiB to 4 GiB, and another set of free
> page stacks to manage RAM above 4 GiB; so that most of the time I'm
> using (faster, more scalable) free page stacks instead of the (slower,
> less scalable) free page bitmap.
Why a "set" of free page stacks rather than just one? Page colouring?
James
Got these too.
Ben
Sorry about that.
Ben
It's easily updated but why the reticence to make the change yourself?
Perhaps politeness? In case it's due to unfamiliarity all that's
needed is to click on the "Edit This Page" button at the top of the
page, change the misspelled word and click "Save." What could be
easier? That said, it's a good idea to add a comment about the update
in the box at the bottom of the edit screen. "Fixed typo" or something
similar, perhaps.
James
Correct.
On Jan 19, 10:04 pm, James Harris <james.harri...@googlemail.com>
wrote:
There's several reasons (including page colouring), but I'll only
mention one more... Have you seen Intel's roadmap?
Imagine a 4-way motherboard with four 8-core CPUs, with 2 logical CPUs
per core. Do you really want to have 64 logical CPUs fighting for the
same "free page stack" re-entrancy lock? IMHO this just the beginning
(double the number of CPUs every 18 to 24 months to see what I'm
expecting in 10 years time).
If each free page stack had it's own separate re-entrancy lock, then
increasing the number of free page stacks will decrease the chance of
lock contention (which decreases the chance of starvation, etc, and
can help with cache line trashing if it's done right).
Note: For lock-free algorithms (instead of spinlocks) replace "lock
contention" with "repeated retries" (and then realize that nothing was
really gained). Wait-free algorithms would be perfect but in practice
they're impossibly difficult to implement.
Cheers,
Brendan
Well, I have question for you too. Anyway, I wasn't motivated... I
previously made a few additions and corrections. (IIRC, it'd log my current
IP... I don't have problem with the server or FAQ logging it. I have a
slight issue with it being publicly seen in the logs.) Besides, constant
snow removal here is draining my mental energy and psychological tenacity -
til February at least... :-) (Pfft! Global warming is when it's too hot
to get snow... Have you ever noticed that the liberal news media never
covers global warming in the middle of a harsh snowy winter? It seems to be
true even in places where they don't get snow! Even they realize the effect
is poorly named. At least the snow makes it easy, er... they're not easy to
see, um... easier to watch a few deer in the woods... Unfortunately,
they're skittish and can pinpoint a whisper at 100+ yards... behind
glass... motorized camera...) Well, I do have to thank you guys. Thanks
guys! A large percentage of the links there were from posts by me and added
by you or Ben or others... So, I have a question for you:
If it's easily updated, why the reticence from you guys to find and add
(your own) good links? :-) :-D
PS. I haven't forgotten the thread on the RTC. I'd like to finish it up.
I've been meaning to get back to it, but I just feel like taking my sweet
time right now... I'm still planning on seeing if I dl'd the schematic
somewhere. Ah, been playing a little Q3 lately.
Rod Pemberton
Haha, touché!
Understood re. the IP. As far as I know it is not recorded for anyone
who uses an ID. I know that's another small motvational hurdle to
overcome but you never know. One day those snows will clear. :-)
>
> PS. I haven't forgotten the thread on the RTC. I'd like to finish it up.
> I've been meaning to get back to it, but I just feel like taking my sweet
> time right now... I'm still planning on seeing if I dl'd the schematic
> somewhere. Ah, been playing a little Q3 lately.
Cool. Looking forward to hear your comments.
James
After reading the posts here (and some of the reference PDF suggested
on aodfaq), I really liked the way you managed the problem. I'm
curious about a few details, though.
Should 16MiB be enough? Is that a huge amount of memory for most cases
or could it be too little in certain situations? I'm developing my
first kernel, so I'm not sure what is the average size of those
buffers or how many of them are usually active at each instant. I'm
sure that, for my particular case, I could get away with even less
space, but I'm curious into knowing what is to be expected here.
How do you handle that hole in memory at 640KiB boundry? Are you
developing for ia-32? From what I've read, it seems that linux maps
most of the physical RAM it can on the upper 1GB of virtual addresses,
setting a small pool of virtual address that will point to unmapped
parts of physical memory as needed. That design looked weird to me.
Wouldn't that make linux's heap have kernel pages interleaved with
user pages it shouldn't touch?
Finally, isn't accessing contiguous pages on memory faster than
non-contiguous access due to pre-fetching?
[]'s
Rafael
On Jan 22, 12:23 pm, Rafael Cunha de Almeida <no-re...@example.com>
wrote:
> Brendan <btrot...@gmail.com> wrote:
> > For example, I usually use a free page bitmap to manage RAM in the
> > first 16 MiB of the physical address space (and use that for all
> > allocations that need to be contiguous), then use a set of free page
> > stacks to manage RAM from 16 MiB to 4 GiB, and another set of free
> > page stacks to manage RAM above 4 GiB; so that most of the time I'm
> > using (faster, more scalable) free page stacks instead of the (slower,
> > less scalable) free page bitmap.
>
> After reading the posts here (and some of the reference PDF suggested
> on aodfaq), I really liked the way you managed the problem. I'm
> curious about a few details, though.
>
> Should 16MiB be enough? Is that a huge amount of memory for most cases
> or could it be too little in certain situations? I'm developing my
> first kernel, so I'm not sure what is the average size of those
> buffers or how many of them are usually active at each instant. I'm
> sure that, for my particular case, I could get away with even less
> space, but I'm curious into knowing what is to be expected here.
16 MiB may be too small (OS unable to supply physically contiguous
pages) or it might be too large (OS using the slow bitmap instead of
faster free page stacks when it's not necessary). There's no good way
to accurately estimate the best size (the perfect size would be
different for different computers, and different for different OSs/
device drivers on the same computer). The following is my "not so
good" way of inaccurately estimating it... :-)
For ISA DMA there's four 8-bit channels (limited to 64 KiB transfers)
and three 16-bit channels (limited to 128 KiB transfers). If the
device drivers use one bounce buffer each then the maximum RAM you
could need for ISA DMA buffers is 640 KiB (or "4*64 + 3*128 KiB"). If
some device drivers use many DMA buffers or allocate/freee DMA buffers
dynamically then you could need a lot more. However, for modern
computers the ISA DMA isn't used by much (floppy drive controllers,
parallel ports and old sound cards?) and it's unlikely that all DMA
channels would be used. Managing the first 2 MiB of RAM with a bitmap
is probably plenty for ISA DMA alone (even though about 400 KiB of
this isn't usable - ROMs, etc).
In addition there may be some PCI devices that require contiguous
pages for bus-mastering/DMA. Most PCI devices don't (they have some
sort of scatter/gather controller and only need access to individual
pages), and to be honest I've never come across a PCI device that does
need contiguous pages. I allow some extra just in case, even though
I'm not sure if it's necessary or not.
The CPU itself never needs contiguous pages - the only thing that uses
physical pages is data structures used for paging (which are 1 page
each anyway), and everything that uses mutliple pages (e.g. the GDT)
uses linear addresses (you need contiguous linear pages, not
contiguous physical pages).
I allow 16 MiB (which is probably 10 times more than needed for most
computers), but it's better to have too much than not enough.
I'm currently rewriting my code though - for the new version of my OS
I'm planning to make the size of the bitmap variable. It should be
easy to allow the OS to reduce the size of the bitmap at any time, and
allow the initial size of the bitmap to be determined during boot.
> How do you handle that hole in memory at 640KiB boundry? Are you
> developing for ia-32? From what I've read, it seems that linux maps
> most of the physical RAM it can on the upper 1GB of virtual addresses,
> setting a small pool of virtual address that will point to unmapped
> parts of physical memory as needed. That design looked weird to me.
IMHO the way Linux does physical memory management (especially the
area of the physical address space that's mapped into the virtual
address space) sucks for a number of reasons (but these reasons didn't
apply when Linux was originally designed, and once you've got working
code it's tempting not to break it).
> Wouldn't that make linux's heap have kernel pages interleaved with
> user pages it shouldn't touch?
Yes, but that only really matters if the kernel is buggy, and there's
very little you can do to properly protect against kernel bugs (except
shifting millions of lines of potentially buggy code into user-space).
> Finally, isn't accessing contiguous pages on memory faster than
> non-contiguous access due to pre-fetching?
The CPU's "hardware prefetcher" (for Pentium 4, Xeon and Pentium M -
not sure about other CPUs) never crosses a 4 KiB boundary anyway,
regardless of what you do with paging (e.g. even if paging is
disabled, AFAIK). For explicit prefetch the CPU prefetches what you
tell it to prefetch (regardless of where it is).
In theory, in some situations (depending on the exact hardware being
used) the CPU might be able to access RAM faster if the pages are
physically contiguous due to the way RAM chips are arranged in rows
and columns (e.g. instead of changing the row and the column, just
change column and skip the delays associated with changing the row).
There's lots of reasons that this is unlikely to matter though (things
that prevent it from happening and things that prevent it from
mattering if it does happen), and I've never seen it mentioned
anywhere.
If an OS doesn't support page colouring then using physically
contiguous pages can improve cache efficiency (which AFAIK is one of
the reasons Linux might support physically contiguous allocations),
but it's better to support page colouring instead.
Cheers,
Brendan
No, the kmalloc allocator guarantees that the physical page allocated for kernel pool will never be used in a VM subsystem for user-mode memory.
Windows before Vista (or even before 2003) used the same approach, BTW.
> Finally, isn't accessing contiguous pages on memory faster than
> non-contiguous access due to pre-fetching?
Prefetching is not allowed across page boundaries. So, not so.
With SDRAM, the memory controller can run burst transfers in this mode - RAS once, CAS once, then a burst of consecutive words.
With DDR, the data rate clock within the burst is 2/4/8 times higher then the memory's main clock.
> There's lots of reasons that this is unlikely to matter though (things
From CPU side - yes (it is too hard for a CPU to generate bursts). From DMA busmaster side - easily.