TIA.
Hmm, AFAIK, QNX does not support demand paging...
> According to book, with page on demand when you load
> a program, it throws page faults as different parts
> of it are executed. What I don't understand is how
> the page fault handler knows where to load the image
> from. I've seen it uses a field called vnode to read
> from, but does it mean the file is already open?
Yes.
> What happens if you delete the image and the program
> was run a while ago? This situation should be detected,
You can't delete a file that is open.
Rob
Surely.
> What happens if you delete the image and the program
> was run a while ago?
You cannot delete the memory-mapped file. Under some circumstances,
like the network failure, the file can disappear. In this case, the
program will sooner or later catch unresolved page fault with "inpage
error" (Win32) or SIGBUS (UNIX).
Max
No? Then when it loads a program, it loads the whole program?
>> According to book, with page on demand when you load
>> a program, it throws page faults as different parts
>> of it are executed. What I don't understand is how
>> the page fault handler knows where to load the image
>> from. I've seen it uses a field called vnode to read
>> from, but does it mean the file is already open?
>
> Yes.
Oh, so when you execute the file, you have an Open("file.exe");
and then when the proces is finished, you have the close?
Where could I get detailed information about this particularly
in microkernels?
TIA.
Yes (again: AFAIK). Keep in mind that QNX is a real-time system.
With demand paging, how could it possibly guarantee deterministic
response times ?
> Oh, so when you execute the file, you have an Open("file.exe");
> and then when the proces is finished, you have the close?
Basically, yes.
> Where could I get detailed information about this particularly
> in microkernels?
Not sure. Demand paging is IMHO not really a feature a microkernel
should provide (though I believe that Mach, for example, does). The
microkernel I'm most familiar with is L4 and it certainly does not
implement anything along those lines. Instead it provides primitives
allowing servers to implement demand paging. *How* this is done in
the server is very much like how it works with monolithic kernels, so
if you are looking for algorithms for demand paging, you might as well
look at Linux.
Rob
New programs (after deletion) could no longer open the file, but the currently
running program would still be able to fully use the file contents (including
even extending the file.) When the program exits, the VM reference (for mmap)
would be decremented to zero, along with any filedescriptors being decremented.
This problem becomes 'interesting' in the case of near-stateless filesystems like
NFS. The answer to this problem will be left to the 'reader' :-). I do seem to
remember things like SIGBUS or somesuch in cases where files just go away
(for mmaped NFS stuff.) The various notions of coherency (as implemented
in FreeBSD) would also go-away when using filesystems like NFS.
John
Surely, this leads us to more general question of different semantics
of file deletion in UNIX and NT.
In UNIX, you can (generally) delete the opened file, unlink() will
succeed and the dirent will be deleted, but inode and file blocks will
be freed only when all apps will close the file.
In NT, deleting a file means - open it "for deletion", set the "delete
on close" flag, then close it.
DeleteFile() is just a shortcut around this. Usuall Win32 sematics
will just fail deletion of opened file, even the dirent will not be
deleted. You can also mimic the UNIX semantics by passing another
sharing flags to ZwCreateFile used for deletion.
Max
Another historical (interesting) difference would be the NFS vs. RFS type
remote filesystem design choices. Of course, there is a vast spectrum
of these kinds of choices.
One thing that I just found out about (between WinNT and FreeBSD) is
that FreeBSD tends NOT to map unused cache memory into kernel or
user space, while WinNT seems to do so. Apparently, the
cache memory in WinNT is also counted against mappable memory as
an allocated resource. I made the choice to avoid required mapping cached
memory because on X86 type machines (yes, this shows my bias), you
only have a 4GB address space, and other design decisions had defaulted
to sharing the kernel and user space in that 4GB. This meant that
caching could cost significant address space (if I hadn't decided to make
the majority of caching unmapped.)
The disadvantage of my approach is that cached memory would sometimes
(not always) have to be remapped before using it EVEN for read/write
type operations. However, the most often used cached space would retain
mappings, so that the remap operations wouldn't always have to happen.
In a way, there was a 'local' cache memory that was mapped, and an
'extended' cache memory that might replace some of the local memory.
In the case of the X86, it is much cheaper to remap memory, than to require
I/O transfers required because of an address space limitation.
John
NT's cache manager (Cc) way:
- there is a large region of kernel virtual addresses for cache
- it is divided to 128KB chunks
- the file chunks - 128KB size and 128KB aligned - are mapped to these
address chunks (VACBs). VACBs are allocated by demand.
- CcCopyRead/Write routines - the same as Linux's
generic_file_read()/_write(), dunno the FreeBSD names - just map the
necessary region to a VACB and then do memcpy() to/from the user
buffer.
- the copy can incur page faults in cache's region, which are serviced
by the usual page fault path.
- there is no block disk-level cache at all. Only the file streams are
cached, not disk volumes. There is a way of creating your own virtual
file streams.
- there is support for record-oriented metadata (not user data,
sorry - NT is not VMS and does not support them yet, neither UNIX
does) files, where the "record" is a part of file which must be
in-memory contiguous, but can be on-disk discontiguous (can span
several fragmented clusters). Record size can be both > allocation
unit size and <= allocation unit size. NTFS relies on these
record-oriented files heavily - the log, MFT and directories are such.
- this record-oriented interface is very similar to UNIX's buffer
cache - CcSetDirtyPinnedData and such. You map some on-disk structure
to cache memory, then work with it directly, then possibly set it
dirty, then unmap it.
- nevertheless, PAGE_SIZE is an upper limit for Cc's record-oriented
support.
- to implement a mapped file - both user mappings and cache mappings,
an in-memory array called "segment" is allocated by the MM. Its
entries are similar to PTEs in structure and are called "prototype
PTEs". Linux uses the similar for anonymous shared memory.
- prototype PTEs contain the physical page addresses for pages which
contain the file data. So, array indexing of the segment is used for
"find a page containing this part of this file" functionality.
Linux's way:
- traditional buffer cache for metadata.
- disk IO subsystem implemented in terms of "buffers", so, artificial
fake buffer heads are added to a page to do page-based disk IO. A
historical lameness :-)
- no record-oriented support. Anyway not needed for UNIX's FSs, though
NTFS/Linux can suffer from it a lot.
- for file cache, there is a hash collection of "physical pages
hanging off this vnode". The physical page descriptor contains the
"vnode" and "offset" fields.
- no VACBs.
- generic_file_read() is - take the single page for this file by hash
lookup, if found - map it to virtual address and memcpy() to user. If
not found - go the full inpage path. Then the next page.
- no prototype PTEs for mapped files (only for anonymous shared
memory), the "find a page containing this part of this file"
functionality is hash table lookup.
Both OSes support clustered inpages and read-ahead, though NT is
limited to 64KB in this.
I do not remember the policy details of both OSes - like the percent
of physical memory allocated for cache - just off-head.
Now NT's problems.
First, the PPTEs. They disallow creation of a huge and sparse file,
since the segment must be large enough to cover all sparse spaces too.
So, you cannot create a huge virtual file describing the whole huge
disk. This means - problems with caching UNIX's indirection blocks.
Second, the VACBs. Map/unmap is time-consuming. Linux's way requires
mapping a single page a time, which is easily done by (phys_addr +
KSEG_BASE) and does not require PTE updates. Unmap is a no-op.
The only reason of VACBs to exist is supporting record-oriented files
with a record > PAGE_SIZE, but NTFS does not use them anyway.
Personally, I do not like NT's cache management and consider Linux's
one to better in the aforementioned aspects. Given my general pro-MS
bias, this means something :-)
According to some information from MS's employees on forums, Cc just
does not fit at all for some filesystems, namely for UNIX-style
filesystems with indirection blocks like UDF, so, UDF designer was
forced to code a huge workaround for Cc's lameness - namely the
dynamically patched "virtual file stream" describing all known
indirection blocks.
On the other hand, lack of record-oriented support in Linux's cache
manager means problems implementing NTFS. Dunno on XFS, ReiserFS and
ext3 - maybe they use record-oriented metadata files too.
Max
John
Never mind, John!
You can visit one of the NT kernel forums:
- NTDEV and NTFSD mailing lists on www.osr.com
- comp.os.ms-windows.programmer.nt.kernel-mode
- microsoft.public.development.device.drivers
On MS's newsgroup, you can see several MS's people answering
questions. Also three high level developers from NT kernel team are on
NTDEV.
Max
> Yes (again: AFAIK). Keep in mind that QNX is a real-time system.
> With demand paging, how could it possibly guarantee deterministic
> response times ?
By only specifying/guaranteeing response times for code that's
executing from "locked" memory that can't be paged out. Real-time
guarantees don't have to apply to every piece of code in the system -
only the ones that actually need real-time response.
I don't know what the actual situation is with QNX - but I don't think
real-time and demand paging are mutually exclusive.
--
Mike
The swappable regions of an address space should be considered as cached
_slow_ memory. Its speed equals to the speed which the pager can swap pages
in. This is the worst case assumption, based on that your working set size
equals to 1. Part of your code could be real time, part of it not. All you
have to do is just calculate the time you will need for an operation to
finish.
Example:
A wave player, that uses a memory image:
-the player's code is locked (unswappable)
-the data is swappable
-the current sound block should be paged in (by locking it)
-the already played blocks should be unlocked
-this method gurarantees that the player will always have a locked buffer,
but the whole data file could be larger than the avaible memory
-with this system, you manage your own working set manually, and this
provides you with very precise control over page access delays
Viktor
I guess you can, iff secondary storage is a flash rom or something like
that... I think in these cases page demanding would provide a good
"optimization"
> Not sure. Demand paging is IMHO not really a feature a microkernel
> should provide (though I believe that Mach, for example, does). The
Why not? Could you expand? (Yes, Mach provides support for
external pagers)
Regards,
David.
BTW, this whole discussion should be in 'comp.os.research'...
>> Not sure. Demand paging is IMHO not really a feature a microkernel
>> should provide (though I believe that Mach, for example, does). [
>> ... ]
> Why not? Could you expand?
One very good argument is that paging is a policy (that is, the
availability of paging itself, not the specific replacement algorithm),
not a mechanism.
There are several and rather different uses of the label "microkernel",
and under some fairly popular ones Mach is not a microkernel, for one
thing.
Under one definition of the concept of microkernel, a microkernel should
only provide mechanisms that allow the definition of different memory
occupation policies (fully resident, swapped, paged, paged and swapped).
KeyKOS (a descendant of which is called EROS, and there is a delightful
site with lots of documents about it) coherently with this did not even
have process/task/thread scheduling in the microkernel; it just provided
mechanisms to define different scheduling systems. I like that.
nospam> (Yes, Mach provides support for external pagers)
The external pagers are optional. By default it uses an internal pager.
Morever, because of rather high overheads, Mach external pagers tend to
be used to provide ``virtual'' data space, leaving to the built-in pager
the nitty gritty of demand paging memory data space to/from disk (the
usual form of backing store).