mapping user space buffer to kernel address space

mdal...@in.ibm.com

unread,

Oct 13, 2000, 3:00:00 AM10/13/00

to linux-...@vger.kernel.org

Hi,

I have a user buffer and i want to map it to kernel address space
can anyone tell how to do this like in AIX we have xmattach

thanks
daljeet

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

Petr Vandrovec

unread,

Oct 13, 2000, 3:00:00 AM10/13/00

to mdal...@in.ibm.com

On 13 Oct 00 at 15:31, mdal...@in.ibm.com wrote:

> I have a user buffer and i want to map it to kernel address space
> can anyone tell how to do this like in AIX we have xmattach

Look at mm/memory.c:map_user_kiobuf. It is used by drivers/char/raw.c,
or by drivers/media/video/bttv-driver.c, for example. It is 2.4 solution.

For 2.2 solution, you have to
(1) pagein page to memory (getuser(c, (char*)addr)),
(2) walk pagetables (yourself, as p[gm]d_offset and pte_offset are 2.3 things),
(3) increment page reference count (get_page(mem_map + pagenr))
I believe that it is implemented by 2.2 drivers/char/bttv.c, and by
vmware's vmmon module.
Petr Vandrovec
vand...@vc.cvut.cz

P.S.: Last week I wrote that incrementing page->count can cause troubles
under 2.4 kernels. I found that you do not have to increment page count,
just shared mappings are not correctly truncated at ftruncate at all...
So you can use 2.2 solution in 2.4 too. But kiobuf is not i386 specific,
so it is preferred solution.

Malcolm Beattie

unread,

Oct 13, 2000, 3:00:00 AM10/13/00

to mdal...@in.ibm.com

mdal...@in.ibm.com writes:
> I have a user buffer and i want to map it to kernel address space
> can anyone tell how to do this like in AIX we have xmattach

In 2.2, you're better off providing a fake character device driver
which allocates the memory in kernel space and lets the user mmap it.
In 2.4, you could try out kiobufs which, if my reading of the section
marked "#ifdef HACKING", "case BTTV_JUST_HACKING" and also
/* playing with kiobufs and dma-to-userspace */
is correct, goes (modulo error handling) roughly like:

struct kiobuf *iobuf;
alloc_kiovec(1, &iobuf); /* allocate a(n array of one) kiobuf */
map_user_kiobuf(READ, iobuf, va, len); /* userland vaddr and length */
/* s/READ/WRITE/ for write */
/* now you have an iobuf containing pinned down user pages */
...
lock_kiovec(1, &iobuf, 1); /* Lock pages down for I/O */
/* first 1 is vector count */
/* second means wait for lock */
.. /* do I/O on it */
free_kiovec(1, &iobuf); /* does an implicit unlock_kiovec */

It doesn't do an unmap_kiobuf(iobuf) so I don't understand where
the per-page map->count that map_user_kiobuf incremented gets
decremented again. Anyone? Lowlevel I/O on a kiovec can be done
with something like an ll_rw_kiovec which sct said was going to get
put in but since I haven't read anything more recent than
2.4.0-test5 at the moment, I can't say if it's there or what it
looks like.

--Malcolm

--
Malcolm Beattie <mbea...@sable.ox.ac.uk>
Unix Systems Programmer
Oxford University Computing Services

Eric Lowe

unread,

Oct 13, 2000, 3:00:00 AM10/13/00

to Petr Vandrovec

Hello,

> On 13 Oct 00 at 15:31, mdal...@in.ibm.com wrote:
>

> > I have a user buffer and i want to map it to kernel address space
> > can anyone tell how to do this like in AIX we have xmattach
>

> Look at mm/memory.c:map_user_kiobuf. It is used by drivers/char/raw.c,
> or by drivers/media/video/bttv-driver.c, for example. It is 2.4 solution.
>
> For 2.2 solution, you have to
> (1) pagein page to memory (getuser(c, (char*)addr)),
> (2) walk pagetables (yourself, as p[gm]d_offset and pte_offset are 2.3 things),
> (3) increment page reference count (get_page(mem_map + pagenr))
> I believe that it is implemented by 2.2 drivers/char/bttv.c, and by
> vmware's vmmon module.
>

You can also apply the raw I/O patches to 2.2 and export the kiobuf
functions. I have a patch against 2.2.17 (with bugfixes), anyone who
needs it e-mail me and I'll send it along. This provides a *nearly*
compatible solution to 2.4, and it's much cleaner than driver-specific
hacks.

--
Eric Lowe
Software Engineer, Systran Corporation
el...@systran.com

Linus Torvalds

unread,

Oct 13, 2000, 3:00:00 AM10/13/00

to linux-...@vger.kernel.org

In article <CA256977.0...@d73mta05.au.ibm.com>,

<mdal...@in.ibm.com> wrote:
>
>I have a user buffer and i want to map it to kernel address space
>can anyone tell how to do this like in AIX we have xmattach

Note that it is usually MUCH better to do this the other way around:
having a kernel-side buffer, and mapping that into user space. I don't
understand why so many people always want to do it the wrong way around..

Linus

Rogier Wolff

unread,

Oct 14, 2000, 3:00:00 AM10/14/00

to Linus Torvalds

Linus Torvalds wrote:
> In article <CA256977.0...@d73mta05.au.ibm.com>,
> <mdal...@in.ibm.com> wrote:
> >
> >I have a user buffer and i want to map it to kernel address space
> >can anyone tell how to do this like in AIX we have xmattach
>
> Note that it is usually MUCH better to do this the other way around:
> having a kernel-side buffer, and mapping that into user space. I don't
> understand why so many people always want to do it the wrong way around..

Provided this wasn't a retorical question, I'll answer it for you.

It seems easier to people. Allocing megabytes of memory is "easy" from
userspace, and there are all kinds of restrictions on the kernel side.

Also if you could lock the userspace, the interface simplfies: The
"read" systemcall can lock the userspace buffer given, and then
initiate the DMA and return when done. Easy.

If you do it the "right" way, the userspace application will have to
mmap the device, and play signalling tricks with the device to
synchronize when there is data.

So people percieve it easier to lock the userspace memeory for the DMA
into it than to map the kernel-buffer into userspace.

Roger.

--
** R.E....@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* Common sense is the collection of *
****** prejudices acquired by age eighteen. -- Albert Einstein ********

Eric Lowe

unread,

Oct 14, 2000, 3:00:00 AM10/14/00

to Linus Torvalds

Hello,

> Note that it is usually MUCH better to do this the other way around:
> having a kernel-side buffer, and mapping that into user space. I don't
> understand why so many people always want to do it the wrong way around..

Well, I totally agree. Unfortunately, my fellow driver developers
seem to think "because it's in UNIX" means "it must be a good idea".
Pinning pages in memory leaves big holes in processes that can't
be swapped out on systems that really need the VM to be able
to work its magic under load.

Fortunately, VIA and similar schemes try to address the problem
in the application, OS, _and_ hardware simultaneously.. In the
future we will probably tell our customers (Fibre Channel) that if
they really want good performance, they need to use VI instead of
IP, and to avoid such hacks as pinning user pages for proprietary
protocols.. My experience has been that systems that do a lot
of page pinning in UNIX for I/O don't gain much in performance
anyway, but it _really_ puts a lot of stress on the system compared
to doing it the right way.

--
Eric Lowe
Software Engineer, Systran Corporation
el...@systran.com

Linus Torvalds

unread,

Oct 15, 2000, 3:00:00 AM10/15/00

to Rogier Wolff

On Sat, 14 Oct 2000, Rogier Wolff wrote:
> > Note that it is usually MUCH better to do this the other way around:
> > having a kernel-side buffer, and mapping that into user space. I don't
> > understand why so many people always want to do it the wrong way around..
>

> Provided this wasn't a retorical question, I'll answer it for you.
>
> It seems easier to people. Allocing megabytes of memory is "easy" from
> userspace, and there are all kinds of restrictions on the kernel side.

Yeah.

It's "easier".

Until you want to actually _map_ the thing.

At which point it turns a hell of a lot harder.

In short, it's a classic case of being STUPID. Taking something that looks
easy, and staying with the wrong approach long after it turns out that the
"easy" case wasn't the easy one after all.

Linus

Rogier Wolff

unread,

Oct 15, 2000, 3:00:00 AM10/15/00

to Linus Torvalds

Linus Torvalds wrote:
>
>
> On Sat, 14 Oct 2000, Rogier Wolff wrote:
> > > Note that it is usually MUCH better to do this the other way around:
> > > having a kernel-side buffer, and mapping that into user space. I don't
> > > understand why so many people always want to do it the wrong way around..
> >
> > Provided this wasn't a retorical question, I'll answer it for you.
> >
> > It seems easier to people. Allocing megabytes of memory is "easy" from
> > userspace, and there are all kinds of restrictions on the kernel side.
>
> Yeah.
>
> It's "easier".
>
> Until you want to actually _map_ the thing.
>
> At which point it turns a hell of a lot harder.

Right. So YOU the kernel-god have the overview that in a flash you see
this as the "hard" approach.

A way to design a good interface is to first start thinking about (or
even coding) the USER of the interface. If you start with the library
or "support functions" that you think you'll need, you're likely to
offload stuff to the user that could've been done in "behind the
scenes" in the library or whatever.

Thus these people start by thinking about their user-level program.
They would like to do

buf = malloc (BUFSIZE);
read (dev, buf, BUFSIZE);

but due to kernel-considerations this is not as easy as you'd like it
to be.

> In short, it's a classic case of being STUPID. Taking something that
> looks easy, and staying with the wrong approach long after it turns
> out that the "easy" case wasn't the easy one after all.

This IS the umptieth time that you answer this question. However, it's
a different person every time. If I or Alan or David would attempt this
it'd be STUPID.

People try to design the interface, and when they get to implementing
it, the first hurdle they can't take by themselves is the first one
they post a question for on linux-kernel. That's always the same
question. It's not that they keep on trying when you tell them its the
wrong approach.

I'm just saying that I understand how different people bump into the
same problem every time, and ask the same question. If you don't
understand it, that's fine with me. You understand the kernel better
than I do.

Roger.

--
** R.E....@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* Common sense is the collection of *
****** prejudices acquired by age eighteen. -- Albert Einstein ********

Andrea Arcangeli

unread,

Oct 15, 2000, 3:00:00 AM10/15/00

to Linus Torvalds

On Sat, Oct 14, 2000 at 07:17:57PM -0700, Linus Torvalds wrote:
>
>
> On Sat, 14 Oct 2000, Rogier Wolff wrote:
> > > Note that it is usually MUCH better to do this the other way around:
> > > having a kernel-side buffer, and mapping that into user space. I don't
> > > understand why so many people always want to do it the wrong way around..
> >
> > Provided this wasn't a retorical question, I'll answer it for you.
> >
> > It seems easier to people. Allocing megabytes of memory is "easy" from
> > userspace, and there are all kinds of restrictions on the kernel side.
>
> Yeah.
>
> It's "easier".
>
> Until you want to actually _map_ the thing.
>
> At which point it turns a hell of a lot harder.

Just as proof it's lot harder: the mechanism implemented in 2.4.0-test10-pre3
in map_user_kiobuf to pin the userspace pages always been buggy and any driver
that is using it (like rawio) will lead to memory corruption or errors right
now. I started looking into this after I got reports of 2.2.18pre15aa1 getting
-EFAULT errors using Oracle with rawio.

The basic problem is that map_user_kiobuf tries to map those pages calling an
handle_mm_fault on their virtual addresses and it's thinking that when
handle_mm_fault returns 1 the page is mapped. That's wrong. handle_mm_fault
may have generated a readonly mapping and it may want to giveup waiting a new
wrprotection fault to happen as soon as we return from the current page fault
handler. Or the pte may be changed under us while we slept and so
handle_mm_faults givups only to see if the fault happens again or not
(not because it finished its work).

In short the fact handle_mm_fault returns 1 doesn't mean the page
is currently mapped.

Furthmore in 2.4.x SMP (this couldn't happen in 2.2.x SMP at least) the page
can also go away from under us even assuming handle_mm_fault did its work right
by luck. And reading the code it knows the pagetable lock was necessary but
in case the race happened it returns -EFAULT and that's buggy.

spin_lock(&mm->page_table_lock);
map = follow_page(ptr);
if (!map) {
spin_unlock(&mm->page_table_lock);
dprintk (KERN_ERR "Missing page in map_user_kiobuf\n");
goto out_unlock;
}

It should try again until it succeeds in pinning the page in memory instead (or
as worse return -EAGAIN after many tries).

I'm also not convinced that only increasing the page count in the critical
section in map_user_kiobuf is enough because swap_out doesn't care about the
page count (in 2.2.x rawio it's taking the page lock). The page count is
significant as a "pin" only for the page cache and not for anonymous or shm
memory for example. swap_out can mess with anonymous memory with a page
count > 1 for example. To avoid VM activities on the page we should grab the
lock atomically in the critical section protected by the pagetable lock. So I
think map_user_kiovec is buggy in not getting the user page lock atomically
while taking the page pinned in memory. I left the lock/unlock_kiovec
operations only for everything that doesn't deal with userspace pages (for
example skb or more in general kernel memory not visible to userspace or mapped
as reserved). (they're not necessary while pinning userspace memory)

And since we must lock down the page to pin it in map_user_kiobuf then we get
the same swapin_readahead deadlock that is now fixed in the 2.2.x rawio patch.
Really that deadlock is a side effect of a silly thing that swapin_readahead
was doing all the time (infact such deadlock doesn't happen with the readahead
of the pagecache but only of the swapcache): swapin_readahead is wasting time
on locked swap cache pages while doing readahead (while it should only startup
the I/O on the not yet present ones). This is potentially hurting also swapin
performance because we end waiting on the wrong page (potentially waiting many
times more than necessary).

Assume the user is doing rawio to write to disk. We'll generate a _read_
swapin fault in memory and the page will remain in swap cache because it's not
been changed and the pte is read only.

1) swapin page 0

2) swapin completes and page is mapped so we pin it down setting PG_locked
while we hold the pagetable lock

3) swapin page 1

4) swapin_readaround(page 1) does a lookup_swap_cache(page 0)
that hangs in find_lock_page(page 0) because we recursed on the
PG_locked that we set in point 2)

5) task unkillable in D state

I also noticed shm_swap is buggy because it adds shm pages in the swap cache
without the big kernel lock acquired (I added a FIXME in the patch, the fix
is trivial).

The patch fixes also some silly 2.4.x specific bug (missing wakeup and wrong
place for wait_on_page).

Patch is against 2.4.0-test10-pre3 and I checked the rawio regression test
works correctly as with 2.2.15pre18aa1 with the patch applied (it run for one
hour without showing any trouble). Then I backed out the patch and I run the
testsuite again on a clean 2.4.0-test10-pre3 and as I expected it reported
memory corruption while reading from disk after a few minutes exactly as it was
happening in 2.2.18pre2aa2:

root@laser:/home/andrea > ./rawio-test
Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Setting data
Writing data to /dev/raw1
Checking written data
Invalidating data
Checking invalidated data
Reading from /dev/raw1
Checking read data
Writing data to /dev/raw1
Checking written data
Invalidating data
Checking invalidated data
Reading from /dev/raw1
Checking read data
Writing data to /dev/raw1
Checking written data
Invalidating data
Checking invalidated data
Reading from /dev/raw1
Checking read data
Writing data to /dev/raw1
Checking written data
Invalidating data
Checking invalidated data
Reading from /dev/raw1
Checking read data
Writing data to /dev/raw1
Checking written data
Invalidating data
Checking invalidated data
Reading from /dev/raw1
Checking read data
Writing data to /dev/raw1
Checking written data
Invalidating data
Checking invalidated data
Reading from /dev/raw1
Checking read data
Writing data to /dev/raw1
Checking written data
Invalidating data
Checking invalidated data
Reading from /dev/raw1
Checking read data
NOT MATCH (read Z) 472000
^^^^^^^^^^^^^^^^^^^^^^^^ memory corruption

Binary files 2.4.0-test10-pre3/ID and rawio-1/ID differ
diff -urN 2.4.0-test10-pre3/drivers/char/raw.c rawio-1/drivers/char/raw.c
--- 2.4.0-test10-pre3/drivers/char/raw.c Thu Oct 12 03:04:42 2000
+++ rawio-1/drivers/char/raw.c Sun Oct 15 17:00:07 2000
@@ -310,11 +310,6 @@
err = map_user_kiobuf(rw, iobuf, (unsigned long) buf, iosize);
if (err)
break;
-#if 0
- err = lock_kiovec(1, &iobuf, 1);
- if (err)
- break;
-#endif

for (i=0; i < blocks; i++)
b[i] = blocknr++;
diff -urN 2.4.0-test10-pre3/fs/buffer.c rawio-1/fs/buffer.c
--- 2.4.0-test10-pre3/fs/buffer.c Sat Oct 14 19:00:17 2000
+++ rawio-1/fs/buffer.c Sun Oct 15 16:32:22 2000
@@ -1923,6 +1923,7 @@
}

spin_unlock(&unused_list_lock);
+ wake_up(&buffer_wait);

return iosize;
}
@@ -2061,6 +2062,7 @@
__put_unused_buffer_head(bh[i]);
}
spin_unlock(&unused_list_lock);
+ wake_up(&buffer_wait);
goto finished;
}

diff -urN 2.4.0-test10-pre3/include/linux/mm.h rawio-1/include/linux/mm.h
--- 2.4.0-test10-pre3/include/linux/mm.h Sun Oct 15 17:49:26 2000
+++ rawio-1/include/linux/mm.h Sun Oct 15 17:51:11 2000
@@ -408,7 +408,7 @@
extern void mem_init(void);
extern void show_mem(void);
extern void si_meminfo(struct sysinfo * val);
-extern void swapin_readahead(swp_entry_t);
+extern void swapin_readaround(swp_entry_t);

/* mmap.c */
extern void merge_segments(struct mm_struct *, unsigned long, unsigned long);
diff -urN 2.4.0-test10-pre3/include/linux/pagemap.h rawio-1/include/linux/pagemap.h
--- 2.4.0-test10-pre3/include/linux/pagemap.h Sun Oct 15 17:38:48 2000
+++ rawio-1/include/linux/pagemap.h Sun Oct 15 17:49:28 2000
@@ -77,6 +77,11 @@
#define find_lock_page(mapping, index) \
__find_lock_page(mapping, index, page_hash(mapping, index))

+extern int __check_page(struct address_space * mapping, unsigned long index,
+ struct page **hash);
+#define check_page(mapping, index) \
+ __check_page(mapping, index, page_hash(mapping, index))
+
extern void __add_page_to_hash_queue(struct page * page, struct page **p);

extern void add_to_page_cache(struct page * page, struct address_space *mapping, unsigned long index);
diff -urN 2.4.0-test10-pre3/ipc/shm.c rawio-1/ipc/shm.c
--- 2.4.0-test10-pre3/ipc/shm.c Thu Oct 12 03:04:50 2000
+++ rawio-1/ipc/shm.c Sun Oct 15 17:51:15 2000
@@ -1414,7 +1414,7 @@
page = lookup_swap_cache(entry);
if (!page) {
lock_kernel();
- swapin_readahead(entry);
+ swapin_readaround(entry);
page = read_swap_cache(entry);
unlock_kernel();
if (!page)
diff -urN 2.4.0-test10-pre3/mm/filemap.c rawio-1/mm/filemap.c
--- 2.4.0-test10-pre3/mm/filemap.c Sat Oct 14 19:00:18 2000
+++ rawio-1/mm/filemap.c Sun Oct 15 17:47:41 2000
@@ -254,10 +254,23 @@
* up kswapd.
*/
age_page_up(page);
+
+ /* Oh, I feel bad with this horror. */
if (inactive_shortage() > inactive_target / 2 && free_shortage())
wakeup_kswapd(0);
not_found:
return page;
+}
+
+int __check_page(struct address_space * mapping, unsigned long index, struct page ** hash)
+{
+ struct page * page;
+
+ spin_lock(&pagecache_lock);
+ page = __find_page_nolock(mapping, index, *hash);
+ spin_unlock(&pagecache_lock);
+
+ return page != NULL;
}

/*
diff -urN 2.4.0-test10-pre3/mm/memory.c rawio-1/mm/memory.c
--- 2.4.0-test10-pre3/mm/memory.c Thu Oct 12 03:04:50 2000
+++ rawio-1/mm/memory.c Sun Oct 15 17:51:20 2000
@@ -385,7 +385,7 @@
/*
* Do a quick page-table lookup for a single page.
*/
-static struct page * follow_page(unsigned long address)
+static struct page * follow_page(unsigned long address, int write)
{
pgd_t *pgd;
pmd_t *pmd;
@@ -395,7 +395,8 @@
if (pmd) {
pte_t * pte = pte_offset(pmd, address);
if (pte && pte_present(*pte))
- return pte_page(*pte);
+ if (!write || pte_write(*pte))
+ return pte_page(*pte);
}

return NULL;
@@ -427,8 +428,12 @@
struct mm_struct * mm;
struct vm_area_struct * vma = 0;
struct page * map;
+ int doublepage = 0;
+ int repeat = 0;
int i;
- int datain = (rw == READ);
+ int write = (rw == READ); /* if we read from disk
+ it means we write
+ to memory */

/* Make sure the iobuf is not already mapped somewhere. */
if (iobuf->nr_pages)
@@ -443,10 +448,11 @@
if (err)
return err;

+ repeat:
down(&mm->mmap_sem);

err = -EFAULT;
- iobuf->locked = 0;
+ iobuf->locked = 1;
iobuf->offset = va & ~PAGE_MASK;
iobuf->length = len;

@@ -457,8 +463,9 @@
*/
while (ptr < end) {
if (!vma || ptr >= vma->vm_end) {
- vma = find_vma(current->mm, ptr);
- if (!vma)
+ refind:
+ vma = find_vma(mm, ptr);
+ if (!vma)
goto out_unlock;
if (vma->vm_start > ptr) {
if (!(vma->vm_flags & VM_GROWSDOWN))
@@ -466,25 +473,35 @@
if (expand_stack(vma, ptr))
goto out_unlock;
}
- if (((datain) && (!(vma->vm_flags & VM_WRITE))) ||
- (!(vma->vm_flags & VM_READ))) {
- err = -EACCES;
- goto out_unlock;
+ err = -EACCES;
+ if (write) {
+ if (!(vma->vm_flags & VM_WRITE))
+ goto out_unlock;
+ } else {
+ if (!(vma->vm_flags & VM_READ))
+ goto out_unlock;
}
+ err = -EFAULT;
}
- if (handle_mm_fault(current->mm, vma, ptr, datain) <= 0)
- goto out_unlock;
spin_lock(&mm->page_table_lock);
- map = follow_page(ptr);
- if (!map) {
+ if (!(map = follow_page(ptr, write))) {
+ char * user_ptr = (char *) ptr;
+ char c;
spin_unlock(&mm->page_table_lock);
- dprintk (KERN_ERR "Missing page in map_user_kiobuf\n");
- goto out_unlock;
+ up(&mm->mmap_sem);
+ if (get_user(c, user_ptr))
+ goto out;
+ if (write && put_user(c, user_ptr))
+ goto out;
+ down(&mm->mmap_sem);
+ goto refind;
}
map = get_page_map(map);
- if (map)
+ if (map) {
+ if (TryLockPage(map))
+ goto retry;
atomic_inc(&map->count);
- else
+ } else
printk (KERN_INFO "Mapped page missing [%d]\n", i);
spin_unlock(&mm->page_table_lock);
iobuf->maplist[i] = map;
@@ -499,9 +516,39 @@

out_unlock:
up(&mm->mmap_sem);
+ out:
unmap_kiobuf(iobuf);
dprintk ("map_user_kiobuf: end %d\n", err);
return err;
+
+ retry:
+
+ /*
+ * Undo the locking so far, wait on the page we got to, and try again.
+ */
+ unmap_kiobuf(iobuf);
+ up(&mm->mmap_sem);
+
+ /*
+ * Did the release also unlock the page we got stuck on?
+ */
+ if (!PageLocked(map)) {
+ /* If so, we may well have the page mapped twice in the
+ * IO address range. Bad news. Of course, it _might_
+ * just be a coincidence, but if it happens more than
+ * once, chances are we have a double-mapped page. */
+ if (++doublepage >= 3) {
+ return -EINVAL;
+ }
+ }
+
+ /*
+ * Try again...
+ */
+ wait_on_page(map);
+ if (++repeat < 16)
+ goto repeat;
+ return -EAGAIN;
}

@@ -594,9 +641,9 @@
if (++doublepage >= 3)
return -EINVAL;

+ } else
/* Try again... */
wait_on_page(page);
- }

if (++repeat < 16)
goto repeat;
@@ -1027,11 +1074,12 @@
* because it doesn't cost us any seek time. We also make sure to queue
* the 'original' request together with the readahead ones...
*/
-void swapin_readahead(swp_entry_t entry)
+void swapin_readaround(swp_entry_t entry)
{
int i, num;
struct page *new_page;
unsigned long offset;
+ swp_entry_t readaround_entry;

/*
* Get the number of handles we should do readahead io to. Also,
@@ -1046,8 +1094,17 @@
swap_free(SWP_ENTRY(SWP_TYPE(entry), offset++));
break;
}
- /* Ok, do the async read-ahead now */
- new_page = read_swap_cache_async(SWP_ENTRY(SWP_TYPE(entry), offset), 0);
+ readaround_entry = SWP_ENTRY(SWP_TYPE(entry), offset);
+ new_page = NULL;
+ /*
+ * Nobody can insert a page in the swap_cache without
+ * holding the big kernel lock. FIXME: swap_shm will SMP
+ * race because it runs add_to_swap_cache without the big
+ * kernel lock held.
+ */
+ if (!check_page(&swapper_space, readaround_entry.val))
+ /* Ok, do the async read-ahead now */
+ new_page = read_swap_cache_async(readaround_entry, 0);
if (new_page != NULL)
page_cache_release(new_page);
swap_free(SWP_ENTRY(SWP_TYPE(entry), offset));
@@ -1064,7 +1121,7 @@

if (!page) {
lock_kernel();
- swapin_readahead(entry);
+ swapin_readaround(entry);
page = read_swap_cache(entry);
unlock_kernel();
if (!page)

The patch is also here:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test10-pre3/rawio-1

Andrea

apr...@in.ibm.com

unread,

Oct 16, 2000, 3:00:00 AM10/16/00

to Rogier Wolff

Linus Torvalds wrote:
>> In article <CA256977.0...@d73mta05.au.ibm.com>,
>> <mdal...@in.ibm.com> wrote:
>> >
>> >I have a user buffer and i want to map it to kernel address space
>> >can anyone tell how to do this like in AIX we have xmattach
>

>> Note that it is usually MUCH better to do this the other way around:
>> having a kernel-side buffer, and mapping that into user space. I don't
>> understand why so many people always want to do it the wrong way
around..

>if you could lock the userspace, the interface simplfies: The

>"read" systemcall can lock the userspace buffer given, and then
>initiate the DMA and return when done. Easy.

then is it like every buffer you pass to read system call is pinned down in
memory(user space)??

thanks,

Stephen Tweedie

unread,

Oct 16, 2000, 3:00:00 AM10/16/00

to Malcolm Beattie

Hi,

On Fri, Oct 13, 2000 at 12:30:49PM +0100, Malcolm Beattie wrote:

> free_kiovec(1, &iobuf); /* does an implicit unlock_kiovec */
>
> It doesn't do an unmap_kiobuf(iobuf) so I don't understand where
> the per-page map->count that map_user_kiobuf incremented gets
> decremented again. Anyone?

The 2.4 raw code I'm looking at does an explicit unmap_kiobuf after
each brw_kiovec().

> Lowlevel I/O on a kiovec can be done
> with something like an ll_rw_kiovec which sct said was going to get
> put in but since I haven't read anything more recent than
> 2.4.0-test5 at the moment, I can't say if it's there or what it
> looks like.

It's being maintained inside the SGI XFS tree right now. They've got
it pretty stable under XFS load so I'll probably put together the
ll_rw_kio functionality using their low level code the next time I do
a kiovec release. I believe that the XFS tree has 2.4.0-test9 support
for this code.

Cheers,
Stephen

Linus Torvalds

unread,

Oct 16, 2000, 3:00:00 AM10/16/00

to Andrea Arcangeli

On Mon, 16 Oct 2000, Andrea Arcangeli wrote:
>
> I'm also not convinced that only increasing the page count in the critical
> section in map_user_kiobuf is enough because swap_out doesn't care about the
> page count (in 2.2.x rawio it's taking the page lock). The page count is
> significant as a "pin" only for the page cache and not for anonymous or shm
> memory for example.

The page count is (or should be) sufficient, and if it weren't sufficient
that would be a bug in the swap-out handling of anonymous or shm memory. I
refuse to see page locking or similar for this kind of pinning (counts are
recursive and re-entrant, a "lock bit" is not).

Anonymous pages that get swapped out use the page cache too, so it's all
safe (it's _not_ safe in 2.0.x or possibly even 2.2.x for shm, but 2.4.x
should be ok - if it's not we need to fix it in the shm_swap function, not
in the kiobuf code).

Linus

Stephen Tweedie

unread,

Oct 16, 2000, 3:00:00 AM10/16/00

to Andrea Arcangeli

Hi,

On Mon, Oct 16, 2000 at 12:08:54AM +0200, Andrea Arcangeli wrote:

> The basic problem is that map_user_kiobuf tries to map those pages calling an
> handle_mm_fault on their virtual addresses and it's thinking that when
> handle_mm_fault returns 1 the page is mapped. That's wrong.

Good point --- good work.

> Furthmore in 2.4.x SMP (this couldn't happen in 2.2.x SMP at least) the page
> can also go away from under us even assuming handle_mm_fault did its work right
> by luck.

Hmm --- we had the BKL to protect against this when this code was
first done for 2.3. That's definitely a regression, agreed.

> I'm also not convinced that only increasing the page count in the critical
> section in map_user_kiobuf is enough because swap_out doesn't care about the
> page count (in 2.2.x rawio it's taking the page lock). The page count is
> significant as a "pin" only for the page cache and not for anonymous or shm
> memory for example. swap_out can mess with anonymous memory with a page
> count > 1 for example.

This bit I think we have to talk about --- I'm not sure I agree. I
dropped that automatic-page-locking from the map_user_kiobuf code
quite deliberately.

Basically, we don't care at all about pinning the pte in the page
tables during the IO. All we really care about is preserving the
mapping between the user's VA and the physical page. If the VM
chooses to unmap the page temporarily, then as long as the page
remains in physical memory, then a subsequent page fault will
reestablish the mapping. The swap cache and page cache are sufficient
to make this guarantee.

It's an important distinction, because we really would rather avoid
taking the page lock. If you happen to have the same page mapped into
multiple VAs within the region being written, should the kernel
object? If you're writing that region to disk, then it really
shouldn't matter. For some other applications, it might be important,
which is why I wanted to keep the map_user_kiobuf() and lock_kiobuf()
functions distinct.

Note that if you have a threaded application and another thread goes
messing with the MM while your IO is in progress, then it's possible
that the pages in the user's VA at the end of the IO are not the same
as the ones that were there at the start. No big deal, that's no
different to the situation if you have any other read or write going
on in parallel with other MM activity.

One final point: the new code,

> + if (!(map = follow_page(ptr, write))) {
> + char * user_ptr = (char *) ptr;
> + char c;
> spin_unlock(&mm->page_table_lock);

> + up(&mm->mmap_sem);
> + if (get_user(c, user_ptr))
> + goto out;
> + if (write && put_user(c, user_ptr))
> + goto out;
> + down(&mm->mmap_sem);
> + goto refind;

looks unnecessarily messy. We don't need the get_user() in ptrace:
why do we need it here? Similarly, the put_user should be dealt with
by the handle_mm_fault. We already absolutely rely on the fact that
handle_mm_fault never continually fails to make progress for ever. If
it did, then a page fault could produce an infinite loop in the
kernel.

Once I'm back in the UK I'll look at getting map_user_kiobuf() simply
to call the existing access_one_page() from ptrace. You're right,
this stuff is racy, so we should avoid duplicating it in memory.c.

Cheers,
Stephen

Andrea Arcangeli

unread,

Oct 16, 2000, 3:00:00 AM10/16/00

to Linus Torvalds

On Mon, Oct 16, 2000 at 11:29:27AM -0700, Linus Torvalds wrote:
> The page count is (or should be) sufficient, and if it weren't sufficient
> that would be a bug in the swap-out handling of anonymous or shm memory. I

If the page isn't locked swap_out will unmap it from the pte and anybody will
be able to start any kind of regular VM I/O on the page. This isn't what the
programmer expects and that's not what I consider pinning. It just looks much
saner to completly pin it and to keep it mapped as if it would be kernel memory
marked as reserved and then mapped to userspace via remap_page_range. I don't
see any particular benefit in allowing swap_out to mess with pinned pages that
would make worthwhile not to lock them.

Said that (and the above arguments are more a design issue) there's also a
pratical problem that I can see in not taking the page locked. What will
happen without the lock on the page is that swapout will unmap the pte while
adding the page to the swap cache and while writing it to disk. Then the page
will be clean and in the swap cache with reference count 2 because we take it
pinned and because it's part of the swap cache. Then we do the DMA from disk
to the page but the page is still considered clean from the VM
because it's an unlocked swap cache page. But instead the contents of the page
aren't in sync anymore with the contents of the on-disk counterpart of the swap
cache after DMA completed. So that will again generate corruption because it
breaks VM assumptions on the swap cache (even if it's more subtle to trigger
than the other problems). Maybe even more subtle issue could arise by
not taking the lock on the page while it's pinned.

> refuse to see page locking or similar for this kind of pinning (counts are
> recursive and re-entrant, a "lock bit" is not).

Infact swapin_readahead was deadlocking exactly because lock bit isn't
reentrant. But the fix for that deadlock looks worthwhile regardless of the
pinning-user-memory-with-lock-bit issue.

Andrea

Linus Torvalds

unread,

Oct 16, 2000, 3:00:00 AM10/16/00

to Andrea Arcangeli

On Mon, 16 Oct 2000, Andrea Arcangeli wrote:
>

> If the page isn't locked swap_out will unmap it from the pte and anybody will
> be able to start any kind of regular VM I/O on the page.

Doesn't matter.

If you have increased the page count, the page _will_ stay in the page
cache. So everybody who wants to see the page that was mapped, will see it
with no possibility of getting an alias.

> This isn't what the
> programmer expects

If so, the programmer has shit for brains. That simple.

> and that's not what I consider pinning.

No. Because "pinning" is _stupid_.

Imagine having two threads that both do direct IO from overlapping pages.

YOU NEED TO COUNT THE PINNING.

A "lock" bit is useless, and anybody who uses a lock bit is stupid and
ends up having to serialize things for no good reason.

Instead, the Linux answer is to say: pinning is bad, pinning is stupid,
pinning is useless - so dont do it.

Instead, we have the notion of "I have a reference to this page, don't let
it go away". Sure, the page can be _unmapped_ (ie it is not pinned), but
WHO CARES?

Nobody.

Linus

Andrea Arcangeli

unread,

Oct 16, 2000, 3:00:00 AM10/16/00

to Stephen Tweedie

On Mon, Oct 16, 2000 at 10:14:01PM +0100, Stephen C. Tweedie wrote:
> [..] If the VM

> chooses to unmap the page temporarily, then as long as the page
> remains in physical memory, then a subsequent page fault will

> reestablish the mapping. [..]

Correct. But the problem is that the page won't stay in physical memory after
we finished the I/O because swap cache with page count 1 will be freed by the
VM. And even it stays in physical memory because a read fault happened on the
swap cache before we freed it, we could have swap cache that isn't coherent
with the on-disk data (that could lead to the same problem but later).

And anyways from a design standpoint it looks much better to really pin the
page in the pte too (just like kernel reserved pages are pinend after a
remap_page_range).

> It's an important distinction, because we really would rather avoid
> taking the page lock. If you happen to have the same page mapped into
> multiple VAs within the region being written, should the kernel

> object? [..]

I see your point but if you want to allow that we should simply check if the
page that we can't lock is just locked in the earlier part of the kiobuf (just
browsing backwards the iobuf->maplist). I just don't think that not locking the
page is the right way to provide that feature.

I think it's not horribly wrong if people can't do map_user_kiobuf on virtual
pages pointing to the same physical page because that functionality is quite
special, just like rawio is quite special in requiring people to use
hardblocksize aligned buffers. And yes, we could also allow people to do
rawio without that userspace alignment constraint but with that constraint
we force them to do zero copy.

> shouldn't matter. For some other applications, it might be important,
> which is why I wanted to keep the map_user_kiobuf() and lock_kiobuf()
> functions distinct.

I'm not sure which are those apps but if we need we can easily handle that case
by browsing backwards the maplist in the TryLockPage == 1 slow path.

> Note that if you have a threaded application and another thread goes
> messing with the MM while your IO is in progress, then it's possible
> that the pages in the user's VA at the end of the IO are not the same
> as the ones that were there at the start. No big deal, that's no
> different to the situation if you have any other read or write going
> on in parallel with other MM activity.

Yep, that looks ok.

> > + if (!(map = follow_page(ptr, write))) {
> > + char * user_ptr = (char *) ptr;
> > + char c;
> > spin_unlock(&mm->page_table_lock);
> > + up(&mm->mmap_sem);
> > + if (get_user(c, user_ptr))
> > + goto out;
> > + if (write && put_user(c, user_ptr))
> > + goto out;
> > + down(&mm->mmap_sem);
> > + goto refind;
>
> looks unnecessarily messy. We don't need the get_user() in ptrace:
> why do we need it here? Similarly, the put_user should be dealt with
> by the handle_mm_fault. We already absolutely rely on the fact that
> handle_mm_fault never continually fails to make progress for ever. If
> it did, then a page fault could produce an infinite loop in the
> kernel.

Replacing the get_user/put_user with handle_mm_fault _after_ changing
follow_page to check the dirty bit too in the write case should be ok.

If you want I can do this change plus the backwards maplist check for allowing
mapping the same physical page multiple times in a kiobuf (the unmap_kiobuf
will unlock the same physical page multiple times but that's ok).

> Once I'm back in the UK I'll look at getting map_user_kiobuf() simply
> to call the existing access_one_page() from ptrace. You're right,

access_one_page is missing the pagetable lock too, but that seems the only
problem. I'm not convinced mixing the internal of access_one_page and
map_user_kiobuf is a good thing since they needs to do a very different thing
in the critical section.

Andrea

Linus Torvalds

unread,

Oct 16, 2000, 3:00:00 AM10/16/00

to Andrea Arcangeli

On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
>
> And anyways from a design standpoint it looks much better to really pin the
> page in the pte too (just like kernel reserved pages are pinend after a
> remap_page_range).

No.

Read my emails.

"Pinning" is stupid. There are no ifs, buts, or maybes about it.

Pinning will not happen.

(And remap_page_range() has nothing to do with pinning - they are just
pages that cannot be swapped out because they are not normal pages at
all).

Linus

Andrea Arcangeli

unread,

Oct 16, 2000, 3:00:00 AM10/16/00

to Linus Torvalds

On Mon, Oct 16, 2000 at 02:59:59PM -0700, Linus Torvalds wrote:
> No. Because "pinning" is _stupid_.

Pinning using map_user_kiobuf looks just the other way around of what we do
usually with the mmap callback of the device driver and remap_page_range. I
considered "pinning" just to convert the user page to one of the pages mapped
with remap_page_range so they can threat them in the same way internally in the
device driver (and the ones mapped with remap_page_range don't get unmapped
and then into swap of course).

> Instead, the Linux answer is to say: pinning is bad, pinning is stupid,
> pinning is useless - so dont do it.

So just delete map_user_kiobuf from your tree if people shouldn't do it. That
would fix the mm corruption too indeed. I don't see why you provide that
functionality and then you keep telling people not to use it. Also rawio
and networking could use the other way around. I think that's mainly a
matter of API. People is used to think in read/write terms. And infact rawio
doesn't provide a completly transpartent read/write because it have constrants
on the buffer alignment.

If you won't delete map_user_kiobuf from your tree I think I've just provided a
real world MM corruption case where the user send the bug report back to us if
we only increase the reference count of the page to pin it.

Andrea

Andrea Arcangeli

unread,

Oct 16, 2000, 3:00:00 AM10/16/00

to Linus Torvalds

On Mon, Oct 16, 2000 at 03:21:11PM -0700, Linus Torvalds wrote:
> Pinning will not happen.

Pinning happens every day on my box while I use rawio.

If you want to avoid pinning _userspace_ pages then we should delete
map_user_kiobuf and define a new functionality and API to replace RAWIO for
DBMS that must bypass the kernel cache while doing I/O to disk and possibly
providing zero copy below highmem as rawio does. (later on we should
do the same for networkng)

IMHO rawio is providing the above necessary functionality with a
sane userspace interface.

> (And remap_page_range() has nothing to do with pinning - they are just
> pages that cannot be swapped out because they are not normal pages at
> all).

They are _normal_ pages allocated by a device driver and made temporarly
visible to userspace mapping them into the virtual address space of the process
_after_ "pinning" them using the PG_reserved bitflag. If we wouldn't pin them
too, they would be unmapped as well as soon as they're visible in the virtual
address space of the process.

I don't think the thing is much different. The main difference I can see is the
one that was buggy: that is remap_page_range doesn't have to care that the page
stays there while pinning it, because before pinning it it's still private and
not visible by the VM (that's why it's much simpler). map_user_kiobuf instead
is more complex because it must make sure that the page stays there while we
pin it (and that should be fixed now).

I hope we're not getting confused by the term "pin". With "pin" I always meant
to avoid the userspace-visible page to go away from under us while we use it
from kernel space because of underlying VM activities. I don't see any
other possible meaning for "pin" in the context of map_user_kiobuf.

(in journaling we instead use "pin" to mean a page that can't be freed but that
also can't be written before some other transaction is committed to disk)

Linus Torvalds

unread,

Oct 17, 2000, 3:00:00 AM10/17/00

to Andrea Arcangeli

On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
>

> If you won't delete map_user_kiobuf from your tree I think I've just provided a
> real world MM corruption case where the user send the bug report back to us if
> we only increase the reference count of the page to pin it.

Oh. So to fix a bug, you say "either delete the code, or do something else
that is completely idiotic instead"?

Sure, that's sensible. NOT.

Andrea, explain to me how pinning _could_ work? Explain to me how you'd
lock down pages in virtual address space with multiple threads, and how
you'd handle the cases of:

- two threads doing direct IO from different parts of the same page
- one thread starting IO from a page, another thread unmapping the range

Basically, you can't handle it sanly, because the notion of virtual
pinning really isn't a sane notion. The first case would need a special
"pinning count". Which is too expensive to be an option, although I've
seen patches that seemed to do something like that - I consider the whole
notion idiotic.

The second case would require that unmap() synchronize completely with the
IO. Which is wasteful, and doesn't make any sense: what's the point, when
you can avoid it by just not pinning?

Neither option is, quite frankly, acceptable.

So we're left with your suggestion to remove direct IO completely.
Something that I wouldn't mind horribly much, but too many people seem to
consider it worth-while - and while I've stubborny fought the direct-IO
patches a long time, every single technical argument I've had has been
successfully addressed over time.

I'm sure this bug will get fixed too. And the fix probably won't end up
even being all that painful - it's probably a question of marking the page
dirty after completing IO into it and making sure the swap-out logic does
the right thing (ie try to write it out again - which is exactly the same
thing that happens right now if a user dirties a page while it's busy
doing write-out).

In fact, the code may do this already, I'll let sct look into the exact
details and fix it.

Linus

Stephen Tweedie

unread,

Oct 17, 2000, 3:00:00 AM10/17/00

to Andrea Arcangeli

Hi,

On Tue, Oct 17, 2000 at 12:13:49AM +0200, Andrea Arcangeli wrote:
>
> Correct. But the problem is that the page won't stay in physical memory after
> we finished the I/O because swap cache with page count 1 will be freed by the
> VM.

Rik has been waiting for an excuse to get deferred swapout into the
mainline. Sounds like we've got the excuse.

> And anyways from a design standpoint it looks much better to really pin the
> page in the pte too (just like kernel reserved pages are pinend after a
> remap_page_range).

Unfortunately, there is one common case where we want to do exactly
that. "dd < /dev/zero > something_using_raw_io" maps a whole series
of identical readonly ZERO_PAGE pages into the kiobuf. One of the
reasons I removed the automatic page locking was that otherwise we're
forced to special-case things like ZERO_PAGE in the locking code.

Even ignoring that, users _will_ submit multiple IOs in the same page.
Pinning the physical page with page->count is clean. Doing the
locking with the page lock makes no sense if you have adjacent IOs or
if you want to maintain the kiobuf mapping for any length of time.
The point of kiobufs was to avoid VM hacks so that IO can be done at
physical page level. Pinning ptes should not have anything to do with
the IO or we've lost that abstraction.

> Replacing the get_user/put_user with handle_mm_fault _after_ changing
> follow_page to check the dirty bit too in the write case should be ok.

Right.

> > Once I'm back in the UK I'll look at getting map_user_kiobuf() simply
> > to call the existing access_one_page() from ptrace. You're right,
>
> access_one_page is missing the pagetable lock too, but that seems the only
> problem. I'm not convinced mixing the internal of access_one_page and
> map_user_kiobuf is a good thing since they needs to do a very different thing
> in the critical section.

Not the whole of access_one_page, but the pagetable-locked
follow-page / handle_mm_fault loop should be common code. That's
where we're having the problem, so let's avoid having to maintain it
in two places.

Cheers,
Stephen

Stephen Tweedie

unread,

Oct 17, 2000, 3:00:00 AM10/17/00

to Linus Torvalds

Hi,

On Mon, Oct 16, 2000 at 08:11:55PM -0700, Linus Torvalds wrote:

> I'm sure this bug will get fixed too. And the fix probably won't end up
> even being all that painful - it's probably a question of marking the page
> dirty after completing IO into it and making sure the swap-out logic does
> the right thing (ie try to write it out again - which is exactly the same
> thing that happens right now if a user dirties a page while it's busy
> doing write-out).
>
> In fact, the code may do this already, I'll let sct look into the exact
> details and fix it.

I've looked, and it doesn't yet, but Rik has already marked the places
where this would be implemented and it looks simple (famous last
words): mark the swap cache page dirty at writeout and flush to disk
during refill_inactive. Rik is sitting a few rows behind me right now
so we'll look at the code later today. He's already been talking
about doing this anyway to improve the write clustering of the swapout
code.

--Stephen

Andrea Arcangeli

unread,

Oct 17, 2000, 3:00:00 AM10/17/00

to Linus Torvalds

On Mon, Oct 16, 2000 at 08:11:55PM -0700, Linus Torvalds wrote:

> Oh. So to fix a bug, you say "either delete the code, or do something else
> that is completely idiotic instead"?

I'm not saying this because the "something else" doesn't look completly idiotic
to me.

> Andrea, explain to me how pinning _could_ work? Explain to me how you'd
> lock down pages in virtual address space with multiple threads, and how
> you'd handle the cases of:
>
> - two threads doing direct IO from different parts of the same page
> - one thread starting IO from a page, another thread unmapping the range

I don't see the problem with those two cases.

In the first case the first task will sleep in wait_on_page until the
second task will unmap the kiobuf.

The second case the I/O will run even while the page is not visible
to userspace anymore. Then when unmap_kiobuf will run after the I/O
completed the page will be put back in the freelist.

The only "problematic" case that I can see is the one mentioned by SCT that is
the same physical page present in two entries of the maplist in the kiobuf (in
my patch the user_map_kiobuf will return -EINVAL gracefully in that case and
that behaviour doesn't look horribly wrong to me). And that case can be
trivially handled by checking the list backwards in the trylock slowpath as I
just said.

> The second case would require that unmap() synchronize completely with the

> IO. [..]

unmap doesn't need to synchronize at all. If it works without the locked
bit, it will work with the locked bit too.

NOTE: in the patch I'm definitely also increasing the page count (that is
certainly necessary). I never suggested to replace get_page with
the TryLock. I'm suggesting to _add_ the TryLock to avoid having to deal
with the VM for something that is just under I/O and unfreeable anyways.
Maybe you understood I was suggesting to remove the get_page? In that case I
explained wrong but you can check the patch to see I never removed the get_page
as confirm of my current words.

> [..] Which is wasteful, and doesn't make any sense: what's the point, when

> you can avoid it by just not pinning?

The point is to avoid making the driver to deal with the VM and the VM to know
about new special cases invoked by the driver. Exactly as a driver that uses
remap_page_range never known about the VM because it always kept the pages
pinned in the pte via the reserved bit just for that purpose.

If you want to put VM knowledge inside device drivers and also put new
conditions in the VM in order to avoid pinning the pages mapped via
map_user_kiobuf in the pte (so with no gain), then you have my disagreement
about the design.

Andrea

Linus Torvalds

unread,

Oct 17, 2000, 3:00:00 AM10/17/00

to Andrea Arcangeli

On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
>

> > Andrea, explain to me how pinning _could_ work? Explain to me how you'd
> > lock down pages in virtual address space with multiple threads, and how
> > you'd handle the cases of:
> >
> > - two threads doing direct IO from different parts of the same page
> > - one thread starting IO from a page, another thread unmapping the range
>
> I don't see the problem with those two cases.
>
> In the first case the first task will sleep in wait_on_page until the
> second task will unmap the kiobuf.

Ehh.. Remind me again what the whole _point_ of physical IO was, in the
first place?

Was it maybe PERFORMANCE?

Yeah. I thought so.

Bye, bye, performance. You might as well remove the whole thing
completely.

Also, note that one of my requirements to accept the direct IO patches in
the first place from Stephen was that I wanted them to NOT mess at all
with virtual memory mappings. Basically, the way kio buffers work is that
they are 100% based on only physical pages. There are no virtual issues at
all in the IO, and that's exactly how I want it. There is no reason to
confuse virtual addresses into this, because the thing should be usable
even in the complete absense of virtual mappings (ie the kernel can do
direct IO purely based on pages - think sendfile() etc).

What does this mean? It means that the notion of writing directly from
user space doesn't actually exist in the Linux implementation of direct IO
at all. What Linux direct IO does is really a two-phase operation:

- look up the physical pages.
- do IO based on physical pages

The two are completely and entirely unrelated. Think "good design". Think
"modularity". Think "containment".

Which means, btw, that "virtual pinning" does not make sense from a
conceptual standpoint at all. Because the "virtual" part does not even
_exist_ by the time we do IO.

> The only "problematic" case that I can see is the one mentioned by SCT that is
> the same physical page present in two entries of the maplist in the kiobuf (in
> my patch the user_map_kiobuf will return -EINVAL gracefully in that case and
> that behaviour doesn't look horribly wrong to me). And that case can be
> trivially handled by checking the list backwards in the trylock slowpath as I
> just said.

And WHY do you want to pin the damn things virtually in the first place?

Your bug is a non-issue: it can obviously be fixed other ways, and has
nothing to do with virtually pinning the page. Yes, pinning the page in
the virtual map would also fix it, but at the cost of completely breaking
the whole idea of what direct IO is all about.

Linus

Andrea Arcangeli

unread,

Oct 17, 2000, 3:00:00 AM10/17/00

to Linus Torvalds

On Tue, Oct 17, 2000 at 11:36:22AM -0700, Linus Torvalds wrote:
> On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
> >
> > > Andrea, explain to me how pinning _could_ work? Explain to me how you'd
> > > lock down pages in virtual address space with multiple threads, and how
> > > you'd handle the cases of:
> > >
> > > - two threads doing direct IO from different parts of the same page
> > > - one thread starting IO from a page, another thread unmapping the range
> >
> > I don't see the problem with those two cases.
> >
> > In the first case the first task will sleep in wait_on_page until the
> > second task will unmap the kiobuf.
>
> Ehh.. Remind me again what the whole _point_ of physical IO was, in the
> first place?
>
> Was it maybe PERFORMANCE?

It should certainly run fast :).

> Yeah. I thought so.
>
> Bye, bye, performance. You might as well remove the whole thing
> completely.

I don't think that is a common case relevant for performance. I seen it only as
a case that we must handle without losing stability.

For example if both threads are reading different part of disk using the same
buffer that's also a wrong condition that will provide impredictable result (or
if they're reading the same part of disk why are they doing it twice?). If both
threads are writing to different part of disk using the same buffer then you're
right we could push more I/O at the same time by avoiding the locked bit but
that case doesn't look very interesting to me.

> Also, note that one of my requirements to accept the direct IO patches in
> the first place from Stephen was that I wanted them to NOT mess at all
> with virtual memory mappings. Basically, the way kio buffers work is that
> they are 100% based on only physical pages. There are no virtual issues at
> all in the IO, and that's exactly how I want it. There is no reason to

That's kiobuf.

Example: we use kiobuf to handle COWs in the LVM snapshotting code. That only
deals with physical pages. It never walks pagetables. It never cares about the
VM. Those pages never become visible from userspace. All right.

But in map_user_kiobuf we can't avoid to play with the virtual memory,
as we can't avoid to do that in the remap_page_range case.

> conceptual standpoint at all. Because the "virtual" part does not even
> _exist_ by the time we do IO.

The fact is that the "virtual" part _exists_ since the page that we are doing
I/O from/to is mapped in the virtual address of the task and so the VM will play
with it from under us.

> And WHY do you want to pin the damn things virtually in the first place?

In short for the same reason all drivers mapping in userspace kernel memory
via remap_page_range are first calling mem_map_reserve on the physical
page.

Why should we make the VM and rawio more complex when we can simply pin the
page in the pte as mem_map_reserve always did so far? I can see the fact you
can't write to two different part of disk the same buffer from two threads at
the same time but that looks a non interesting case and if we need to optimize
it I'd prefer to do it in another way than to put further VM stuff into rawio.

In fact the reason I don't like to put VM stuff into rawio is that I like
the clean design that you described:

o lookup the physical page
o do I/O on the physical page

and not:

o lookup the physical page
o do I/O on the physical page
o check what the physical page become, if it's swap cache
the set this bit, otherwise set this other bit and then
the VM will write us to disk again later because we put
some more check in the VM to handle us

Andrea

David S. Miller

unread,

Oct 17, 2000, 3:00:00 AM10/17/00

to and...@suse.de

Date: Tue, 17 Oct 2000 21:32:43 +0200
From: Andrea Arcangeli <and...@suse.de>

> Bye, bye, performance. You might as well remove the whole thing
> completely.

I don't think that is a common case relevant for performance. I
seen it only as a case that we must handle without losing
stability.

Hint: smp_flush_tlb_page()

Current kiobufs never need to do that, under any circumstances.
This is not by accident.

I don't see what the big deal is in requiring that user applications
do their own locking to protect page contents being modified in one
thread while a direct I/O from it is happening in another thread. I
also don't see why any bug with kiobufs can't be fixed without the
expensive and complex pinning.

Later,
David S. Miller
da...@redhat.com

Linus Torvalds

unread,

Oct 17, 2000, 3:00:00 AM10/17/00

to Andrea Arcangeli

On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
>
> For example if both threads are reading different part of disk using the same
> buffer that's also a wrong condition that will provide impredictable result (or
> if they're reading the same part of disk why are they doing it twice?).

I'm not talking about users that do things that are obviously
meaningless.

But what about things like:
- linearized circular buffers (where "linearized" means that the buffer
is mapped twice or more consecutively virtually in memory, so that the
user doesn't need to worry about the boundary conditions that normal
circular buffers have)
- multiple buffers per page.

And note in particular that you may not be able to page-align everything:
and it can be extremely costly (both in memory use and in cache footprint)
to always pad out your buffers etc.

> But in map_user_kiobuf we can't avoid to play with the virtual memory,
> as we can't avoid to do that in the remap_page_range case.

map_user_kiobuf() isn't "playing" with the virtual memory.

It's a simple lookup function, looking up the physical pages that
correspond to a specific range of virtual memory.

Nothing more.

> In fact the reason I don't like to put VM stuff into rawio is that I like
> the clean design that you described:
>
> o lookup the physical page
> o do I/O on the physical page

If you like it, why do you want to break it?

You _say_ that you like it, but yet you want to add extra conditions like

o while I/O is pending the virtual mapping is fixed

Which basically means that you suddenly have lots of interactions with the
VM layer - even though everybody agrees that is a bad thing.

Your solution would be that you'd mark the _virtual_ page dirty, then lock
it so that the dirty bit cannot go away (and the page cannot be unmapped),
and then keep it locked until the IO is complete, and basically have all
dirty handling based on virtual addresses. Even though those virtual
addresses have _nothing_ to do with the IO itself, and you say that you
want to do IO on the physical page.

In contrast, I'm saying that direct IO that reads into a (physical) page
should obviously mark that (physical) page dirty. In the meantime, others
may choose to do IO to the same page. They could even use the same buffer,
because we don't lock things: this has extremely well-defined meaning in
the case of "write this page to these 100 devices asynchronously".

How would you suggest we handle the "write to 100 clients simulataneously"
issue?

Are you suggesting something like: if it is reading from a page (ie
writing the contents of that page somewhere else), we don't lock it, but
if it is writing to a page, we lock it so that the dirty bit won't get
lost.

Sure, that works (modulo the fact that it still has the issues with
serializing mmap's and accesses to other areas in the same page). But do
you really claim that it's the clean solution?

Linus

Andrea Arcangeli

unread,

Oct 17, 2000, 3:00:00 AM10/17/00

to David S. Miller

On Tue, Oct 17, 2000 at 12:27:30PM -0700, David S. Miller wrote:
> Hint: smp_flush_tlb_page()
>
> Current kiobufs never need to do that, under any circumstances.
> This is not by accident.

I don't understand. flush_tlb_page() done in the context of a thread won't care
about the state of the physical page. I don't see how it's related to kiobufs.

> I don't see what the big deal is in requiring that user applications
> do their own locking to protect page contents being modified in one
> thread while a direct I/O from it is happening in another thread. I

We never talked about this issue in the previous emails. That is required
regardless of how we solve the MM corruption.

> also don't see why any bug with kiobufs can't be fixed without the
> expensive and complex pinning.

IMHO pinning the page in the pte is less expensive and less complex than making
rawio and the VM aware of those issues. (remap_page_range is so clean
implementation exactly because it pins the page into the pte)

Andrea

Eric Lowe

unread,

Oct 17, 2000, 3:00:00 AM10/17/00

to Andrea Arcangeli

Hello,

> For example if both threads are reading different part of disk using the same
> buffer that's also a wrong condition that will provide impredictable result (or

> if they're reading the same part of disk why are they doing it twice?). If both
> threads are writing to different part of disk using the same buffer then you're
> right we could push more I/O at the same time by avoiding the locked bit but
> that case doesn't look very interesting to me.

*nod*

> > Also, note that one of my requirements to accept the direct IO patches in
> > the first place from Stephen was that I wanted them to NOT mess at all
> > with virtual memory mappings. Basically, the way kio buffers work is that
> > they are 100% based on only physical pages. There are no virtual issues at
> > all in the IO, and that's exactly how I want it. There is no reason to
>
> That's kiobuf.
>
> Example: we use kiobuf to handle COWs in the LVM snapshotting code. That only
> deals with physical pages. It never walks pagetables. It never cares about the
> VM. Those pages never become visible from userspace. All right.
>

> But in map_user_kiobuf we can't avoid to play with the virtual memory,
> as we can't avoid to do that in the remap_page_range case.

I've been down that road, and I don't think we want to go there.
Most research shows that fast multiprocessors simply suffer too much
from the TLB shootdown costs to make up for the difference between
copying the page and not copying it. Twiddling with pte's to make them
COW during the I/O is a waste of cycles and it doesn't gain you much
except in the threaded case, in which case the application was brain-dead
to begin with.

Is that what you're suggesting, making the pages COW? Or am I missing
the point?

>
> > conceptual standpoint at all. Because the "virtual" part does not even
> > _exist_ by the time we do IO.
>
> The fact is that the "virtual" part _exists_ since the page that we are doing
> I/O from/to is mapped in the virtual address of the task and so the VM will play
> with it from under us.

Doesn't incrementing the page count mean we still have the page? I mean,
w/o twiddling with the mapping, the buffer may change while we're doing
I/O on it, but I really don't care if that happens.. if it means I can
avoid 75% of the overhead in the page locking. Since we're not doing
in-place I/O in the general case (e.g. read/write), I think any sane
people can deal with these changed semantics (fbuf-like, that is)

> In short for the same reason all drivers mapping in userspace kernel memory
> via remap_page_range are first calling mem_map_reserve on the physical
> page.
>
> Why should we make the VM and rawio more complex when we can simply pin the
> page in the pte as mem_map_reserve always did so far? I can see the fact you
> can't write to two different part of disk the same buffer from two threads at
> the same time but that looks a non interesting case and if we need to optimize
> it I'd prefer to do it in another way than to put further VM stuff into rawio.

Again, because you're adding overhead for the brain-dead semantics.

I think this is Linus' point here -- twiddling with the VM doesn't really
gain you anything at all. The I/O still happens either way, but by
doing it based ONLY on physical pages (as it is now), there's a LOT
less overhead and it still works correctly as long as the application
and driver are well behaved.

And I don't agree that we should ever try to do kiobuf() things in
the read/write general case. I think that mapping files into
kernel buffers and copying the pages to/from the page cache
would suffice, and with the benefit that all page flushing could
then be done through the VM and bdflush eliminated.

--
Eric Lowe
Software Engineer, Systran Corporation
el...@systran.com

Linus Torvalds

unread,

Oct 17, 2000, 3:00:00 AM10/17/00

to Andrea Arcangeli

On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
>

> IMHO pinning the page in the pte is less expensive and less complex than making
> rawio and the VM aware of those issues. (remap_page_range is so clean
> implementation exactly because it pins the page into the pte)

You keep on bringing up remap_page_range(), and it does nothing of the
sort.

remap_page_range() does one thing, and one thing only: it populates the
page tables with physical page mappings.

It has nothing to do with pinning.

It so happens that the vmscan stuff won't ever remove a physical page
mapping, but that's simply because such a page CANNOT be swapped out. How
would you swap out the PCI space?

I don't see why you consider that to have anything to do with kiobuf's.

Linus

Stephen Tweedie

unread,

Oct 17, 2000, 3:00:00 AM10/17/00

to Andrea Arcangeli

Hi,

On Tue, Oct 17, 2000 at 10:06:35PM +0200, Andrea Arcangeli wrote:

> > also don't see why any bug with kiobufs can't be fixed without the
> > expensive and complex pinning.
>

> IMHO pinning the page in the pte is less expensive and less complex than making
> rawio and the VM aware of those issues.

raw IO will never be aware of the issues at all. It's all done when
we do the page lookup.

What are the VM issues? There are only two left that I know about.
One is the the swap code --- we are evicting swap cache pages without
caring that there may be a user of the page who has written to it
since we did the unmap. We got away with it until now because of the
assumption that shared swap cache was always readonly (anonymous
shared memory cannot swap through the swap cache for precisely this
reason). In current 2.4, this is a trivial assumption to eliminate.

The other VM gotcha is threads plus fork --- a kiobuf mapping is
basically equivalent to a temporary shared mapping as far as the page
is concerned, but the vma is not marked shared. So we're left with
the possibility of another thread forking, both MMs getting marked as
COW, and then if the page gets touched by the parent thread first, COW
results in the parent getting a new page and the child process
inheriting the page in which the raw IO is still happening.

Does this matter? Maybe, if you want to play kiobuf tricks with
things like pipes and you care about preserving POSIX semantics in all
cases. Not, if you're satisfied with "well, don't do that, it's
stupid!" (Although it _would_ be nice to get the doubled pipe
bandwidth of davem's kiobuf pipes to thoroughly stomp on anybody
else's lmbench numbers...)

--Stephen

Andrea Arcangeli

unread,

Oct 17, 2000, 3:00:00 AM10/17/00

to Linus Torvalds

On Tue, Oct 17, 2000 at 02:04:10PM -0700, Linus Torvalds wrote:
> It so happens that the vmscan stuff won't ever remove a physical page
> mapping, but that's simply because such a page CANNOT be swapped out. How

So if I write a mechanism that allows those driver-private-pages that are used
for DMA of a soundcard to be swapped out, then you would remove the
PG_referenced bitflag from them too?

What sense it has to start to writeout to disk a page that is under DMA when
you don't know if this page is getting changed under the swapout? Furthmore
despite of the swapout I/O those pages can't be freed until we run unmap_kiobuf.

> would you swap out the PCI space?

(minor note: that's not PCI space (the pci space case is handled by the
!VALID_PAGE(page) check). That's normal kernel memory exported to userspace
handled by the PageReserved(page) check.)

Andrea

Andrea Arcangeli

unread,

Oct 18, 2000, 3:00:00 AM10/18/00

to Linus Torvalds

On Tue, Oct 17, 2000 at 01:02:01PM -0700, Linus Torvalds wrote:
> But what about things like:
> - linearized circular buffers (where "linearized" means that the buffer
> is mapped twice or more consecutively virtually in memory, so that the
> user doesn't need to worry about the boundary conditions that normal
> circular buffers have)

This is exactly the case pointed out by SCT and this should be trivial to
handle just browsing backwards the maplist in the trylock slow path.

> - multiple buffers per page.
>
> And note in particular that you may not be able to page-align everything:
> and it can be extremely costly (both in memory use and in cache footprint)
> to always pad out your buffers etc.

See below for this one.

> It's a simple lookup function, looking up the physical pages that
> correspond to a specific range of virtual memory.
>
> Nothing more.

Yep.

> > In fact the reason I don't like to put VM stuff into rawio is that I like
> > the clean design that you described:
> >
> > o lookup the physical page
> > o do I/O on the physical page
>
> If you like it, why do you want to break it?
>
> You _say_ that you like it, but yet you want to add extra conditions like
>
> o while I/O is pending the virtual mapping is fixed

That's not the condition that I want to add. Everybody can do a mremap of the
vma while the I/O runs or as you pointed out everybody can munmap the
virtual memory area while the physical page is referenced by the kiobuf.

What I want to add is basically only:

o swap_out doesn't unmap the pages mapped by a kiobuf from their pte
and in turn it doesn't swapout them while they could be under I/O

> Are you suggesting something like: if it is reading from a page (ie
> writing the contents of that page somewhere else), we don't lock it, but
> if it is writing to a page, we lock it so that the dirty bit won't get
> lost.

That wasn't what I suggested but I like also the the way you describe above.
It makes sense.

It also seems to handle the case of multiple buffers in the same page.
When we do the pagein (if a pagein is necessary) all users of the page have to
wait anyways. For the pageout they will be able to start writes at the same
time this way.

> Sure, that works (modulo the fact that it still has the issues with
> serializing mmap's and accesses to other areas in the same page). But do
> you really claim that it's the clean solution?

It looks cleaner than swapping out a page while a device is writing
to it in DMA under the swapout.

Stephen Tweedie

unread,

Oct 18, 2000, 3:00:00 AM10/18/00

to Andrea Arcangeli

Hi,

On Wed, Oct 18, 2000 at 01:00:48AM +0200, Andrea Arcangeli wrote:
> On Tue, Oct 17, 2000 at 02:04:10PM -0700, Linus Torvalds wrote:
> > It so happens that the vmscan stuff won't ever remove a physical page
> > mapping, but that's simply because such a page CANNOT be swapped out. How
>
> So if I write a mechanism that allows those driver-private-pages that are used
> for DMA of a soundcard to be swapped out, then you would remove the
> PG_referenced bitflag from them too?

Are you sure you don't mean PG_Reserved?

If they get removed from the process's working set, then sure --- why
not? Why should the driver care? As long as the page is pinned in
physical memory (ie. as long as it has a raised page count), the
driver won't mind at all. If your vma has a nopage method to
reestablish the ptes on page fault, then swap would work fine, just as
long as we fix the problem we've already identified about swapout
writing pages early rather than once they are finally leaving the swap
cache.

It's a completely different matter if you're dealing with ioremap of
PCI-aperture pages, but that's mostly because they don't have valid
struct page *'s (at least not unless you have 4GB ram).

> What sense it has to start to writeout to disk a page that is under DMA when
> you don't know if this page is getting changed under the swapout?

Umm, we've already said that we want to fix the raw IO problem by
setting the page dirty on initial swapout but doing the IO once we
know it's only in swap cache. Do that and we won't ever write a swap
page that is mapped by any IO subsystem.

--Stephen

Linus Torvalds

unread,

Oct 18, 2000, 3:00:00 AM10/18/00

to Andrea Arcangeli

On Wed, 18 Oct 2000, Andrea Arcangeli wrote:

> On Tue, Oct 17, 2000 at 02:04:10PM -0700, Linus Torvalds wrote:
> > It so happens that the vmscan stuff won't ever remove a physical page
> > mapping, but that's simply because such a page CANNOT be swapped out. How
>
> So if I write a mechanism that allows those driver-private-pages that are used
> for DMA of a soundcard to be swapped out, then you would remove the
> PG_referenced bitflag from them too?

PG_referenced?

Maybe you mean PG_reserved?

Anyway, I didn't realize you were talking about the sound drivers use of
remap_page_range(). That's not the original reason for remap_page_range()
at all, and in fact it's the _ugly_ way to do things. It's simple and it
works, but it's not pretty.

Quite frankly, the way I'd conceptually prefer people do these kinds of
DMA buffers etc is to just have a "nopage()" function, and let that
nopage() function just fill in the page from the DMA buffer directly. This
would work fine, and would get rid of all the playing with PG_reserved
etc.

Obviously, such a page would never really get swapped out: you'd have a
"swapout()" function that would be a no-op (as the page would not truly be
swapped out: the page is actually held by the driver inside the kernel,
and the user mapping is just a _window_ into that kernel buffer).

Now, remap_page_range() ends up giving this same functionality without
sound drivers needing to have nopage() or swapout() functions at all, and
obviously it's trivial to do this way. But you should realize that this
use of remap_page_range() is very much a hack - the whole point of
remap_page_range() was always to do the PCI (and legacy ISA region)
mapping for things like the X server etc that access hardware directly.

I hated people mis-using it the way it's being done by the sound drivers,
but because I also realize that it allows for some simplifications I do
accept it - it's basically an ugly hack that doesn't really matter because
the exact same code _is_ needed for the real IO mapping case anyway, and
as far as the kernel is concerned the whole thing with PG_reserved is all
just a game to make the kernel VM subsystem think that some real RAM is
actually IO space and shouldn't be touched.

But for going the other way (ie moving a user space page into kernel
space, a la kiobuf mapping) is rather different, and I think it would be a
big ugly hairy lie to mark such a user space page be "PCI memory", because
that user space page might in fact be a page cache page etc. It's not at
all the same kind of controlled lie that the sound driver ugly hack is..

Linus

Linus Torvalds

unread,

Oct 18, 2000, 3:00:00 AM10/18/00

to Andrea Arcangeli

On Wed, 18 Oct 2000, Andrea Arcangeli wrote:
>

> > Are you suggesting something like: if it is reading from a page (ie
> > writing the contents of that page somewhere else), we don't lock it, but
> > if it is writing to a page, we lock it so that the dirty bit won't get
> > lost.
>
> That wasn't what I suggested but I like also the the way you describe above.
> It makes sense.

I don't think it really makes sense - I can see that it works, but I don't
like the way it ties in the dirty bit handling with the lock bit handling.
I personally think they are (and should be) unrelated.

> > Sure, that works (modulo the fact that it still has the issues with
> > serializing mmap's and accesses to other areas in the same page). But do
> > you really claim that it's the clean solution?
>
> It looks cleaner than swapping out a page while a device is writing
> to it in DMA under the swapout.

Note that _that_ is something I'd much rather handle another way entirely:
something I've long long wanted to do is to handle all swap-outs from the
"struct page *" rather than based on the VM scan.

Now, the way I'v ealways envisioned this to work is that the VM scanning
function basically always does the equivalent of just

- get PTE entry, clear it out.
- if PTE was dirty, add the page to the swap cache, and mark it dirty,
but DON'T ACTUALLY START THE IO!
- free the page.

Basically, we removed the page from the virtual mapping, and it's now in
the LRU queues, and marked dirty there.

Then, we'd move the "writeout" part into the LRU queue side, and at that
point I agree with you 100% that we probably should just delay it until
there are no mappings available - is we'd only write out a swap cache
entry if the count == 1 (ie it only exists in the swap cache), because
before that is true there are other people marking it dirty.

What are the advantges of this approach? Never mind the kiobuf issues at
this point - I wanted to do this long before kiobuf's happened.

Basically, it means that we'd write out shared pages only once, instead of
initiating write-back multiple times for each mapping that the page exists
in. Right now this isn't actually much of a problem, simply because it's
actually fairly hard to get shared dirty pages that would get written out
twice, but I think you see what I'm talking about on a conceptual level.

It also makes the kiobuf dirty issues a _completely_ separate issue, and
makes it very clean to handle: what kiobuf's become is just a kind of
virtual "pseudo-address-space". A "kiobuf address space" doesn't have a
page table, but it ends up being basically equivalent to a virtual address
space without the TLB overhead. Like a "real" address space attached to a
process, it has dirty pages in it, and like the real address space it
informs the VM layer of that through the page dirty bit.

See? THAT, in my opinion, is the clean way to handle this all.

Justin Schoeman

unread,

Oct 18, 2000, 3:00:00 AM10/18/00

to Linus Torvalds, linux-...@vger.kernel.org

Linus Torvalds wrote:
...snip...

Slightly off topic (and a bit newbie too, but hey...). When I was
rewriting the bttv driver for v4l2, I used the v4l2 memory management,
which uses nopage() to increase page->count, and the return page (the
memory is vmalloc()ed. This worked, but there was occasional memory
corruption, as though the physical memory was being moved. Switching to
the old style bttv memory management (basically the same as the sound
drivers) fixed this problem. Is there any extra "catch" to using the
nopage method, or did I just make a stupid mistake somewhere. (If
anybody actually wants to look at the code, you can get it from
http://bttv-v4l2.sourceforge.net/download/driver091000.tgz - both
methods are there delimmited by #if(n)def BTTV_MM lines. When BTTV_MM
is defined, it works fine, otherwise I get occasional memory
corruption.)

If anybody could give a comment on why one works, and the other not, I
would appreciate it!

TIA

-justin

Andrea Arcangeli

unread,

Oct 18, 2000, 3:00:00 AM10/18/00

to Linus Torvalds

On Tue, Oct 17, 2000 at 09:26:07PM -0700, Linus Torvalds wrote:
> Maybe you mean PG_reserved?

Yes of course. (sorry for the typo)

> Quite frankly, the way I'd conceptually prefer people do these kinds of
> DMA buffers etc is to just have a "nopage()" function, and let that
> nopage() function just fill in the page from the DMA buffer directly. This
> would work fine, and would get rid of all the playing with PG_reserved
> etc.

And it would generate useless page faults as well. What's the point
of unmapping a page that can't be freed? Once the page gets freed
the mapping just gone away via munmap anyways. What's the point of
introducing that performance hit?

Andrea

Andrea Arcangeli

unread,

Oct 18, 2000, 3:00:00 AM10/18/00

to Linus Torvalds

On Tue, Oct 17, 2000 at 09:42:36PM -0700, Linus Torvalds wrote:
> - get PTE entry, clear it out.
> - if PTE was dirty, add the page to the swap cache, and mark it dirty,
> but DON'T ACTUALLY START THE IO!
> - free the page.
>
> Basically, we removed the page from the virtual mapping, and it's now in
> the LRU queues, and marked dirty there.
>
> Then, we'd move the "writeout" part into the LRU queue side, and at that
> point I agree with you 100% that we probably should just delay it until
> there are no mappings available - is we'd only write out a swap cache
> entry if the count == 1 (ie it only exists in the swap cache), because
> before that is true there are other people marking it dirty.

This change makes sense and I agree it would cover the problem. However I
prefer to clarify that doing it for the swap cache as described is not nearly
enough to cover the mm corruption (everything that gets written via a memory
balancing mechanism should do the same).

Said that I think it would be possible to do it for SHM and shared mappings too.

However this still makes me wonder why should we unmap the pte of a page that
we can't free until we unmap_kiobuf? That's not as bad as having a
nopage+swapout dummy operations in the sound driver DMA page case, because
usually user-kiobufs are temporary just for the time of the DMA I/O, though.

> twice, but I think you see what I'm talking about on a conceptual level.

I see.

> See? THAT, in my opinion, is the clean way to handle this all.

Ok. I'm still not completly convinced that it's right to unmap a page "pinned"
(get_page()) on the physical layer but I think the above is conceptually a good
idea regardless of rawio, and if done everywhere it will avoid us to pin the
page in the pte. Only point left in not pinning the page in the pte is
performance wise that as said it's probably not big deal in real life as user
kiobufs usually stays there only for the duration of the I/O.

Stephen Tweedie

unread,

Oct 18, 2000, 3:00:00 AM10/18/00

to Andrea Arcangeli

Hi,

On Wed, Oct 18, 2000 at 03:23:17PM +0200, Andrea Arcangeli wrote:

> This change makes sense and I agree it would cover the problem. However I
> prefer to clarify that doing it for the swap cache as described is not nearly
> enough to cover the mm corruption (everything that gets written via a memory
> balancing mechanism should do the same).
>
> Said that I think it would be possible to do it for SHM and shared mappings too.

shm already does it: shm_swap_core returns RETRY unless page count is
1. file maps are close: file swapping already uses the address space
a_ops, but currently uses that at the wrong time: unmap time rather
than last-release. Rik?

> However this still makes me wonder why should we unmap the pte of a page that
> we can't free until we unmap_kiobuf? That's not as bad as having a
> nopage+swapout dummy operations in the sound driver DMA page case, because
> usually user-kiobufs are temporary just for the time of the DMA I/O, though.

It's no different from mlock(): if you've got some reason why one
specific mapping can't be unmapped (either it's locked or it is
inherently non-swappable like a kiobuf), then there's no reason to
make that per-mapping information visible at the per-page level.

We allow multiple processes to mlock() the same page, after all.

Cheers,
Stephen

Andrea Arcangeli

unread,

Oct 18, 2000, 3:00:00 AM10/18/00

to Stephen Tweedie

On Wed, Oct 18, 2000 at 03:40:39PM +0100, Stephen C. Tweedie wrote:
> shm already does it: [..]

Right. Only the shared mappings and anonymoys memory need to be changed.

Andrea

Linus Torvalds

unread,

Oct 18, 2000, 3:00:00 AM10/18/00

to Andrea Arcangeli

On Wed, 18 Oct 2000, Andrea Arcangeli wrote:
>

> > Quite frankly, the way I'd conceptually prefer people do these kinds of
> > DMA buffers etc is to just have a "nopage()" function, and let that
> > nopage() function just fill in the page from the DMA buffer directly. This
> > would work fine, and would get rid of all the playing with PG_reserved
> > etc.
>
> And it would generate useless page faults as well. What's the point
> of unmapping a page that can't be freed? Once the page gets freed
> the mapping just gone away via munmap anyways. What's the point of
> introducing that performance hit?

As I explained in the part you snipped, this is why I _do_ accept
remap_page_range().

But to answer your question: the point is cleanliness.

For example, remap_page_range() cannot do many things. Because a reserved
page does not maintain any counts (that's the definition of "reserved" as
far as the Linux MM is concerned - the pages never count anything), you
cannot do proper memory management with it.

In the specific case of braindead hardware that needs a 1:1 direct mapping
due to DMA limitations etc, that's exactly what you want. You can't let
people move pages around anyway or do things like that. You can never let
people swap these pages out, because there is no point.

But this is NOT how the MM layer should work in general. And it's sure as
h*ll not how the direct IO layer is going to work. You can argue until you
get blue in the face, I don't care. I'm _not_ going to accept some
braindamaged setup where you can only have one concurrent write into a
page at the same time. I'm _not_ going to accept some braindamaged setup
where you can't unmap such a page in another process or thread.

In short, I'm _not_ willing to paper over bugs by just saying "don't touch
this page".

Linus

Gerd Knorr

unread,

Oct 18, 2000, 3:00:00 AM10/18/00

to linux-...@vger.kernel.org

> I've got kiobuf diffs which allow you to (a) map any kernel memory
> (including vmalloced areas) into a kiobuf, and then (b) mmap any
> kiobuf into user space using sane, faulting vmas (with the option of
> prefaulting for performance if you want it). Nobody is using it right
> now so I wasn't planning on resending it to you until 2.5, but if you
> want to give people a chance to start using it earlier I can send it
> in.

Would be great. /me plans to use this for bttv ...

Gerd

--
Protecting the children is a good way to get a lot of adults who cant
stand up for themselves. -- seen in some sig on /.

Jeff Garzik

unread,

Oct 19, 2000, 3:00:00 AM10/19/00

to Linus Torvalds

Linus Torvalds wrote:
> Anyway, I didn't realize you were talking about the sound drivers use of
> remap_page_range(). That's not the original reason for remap_page_range()
> at all, and in fact it's the _ugly_ way to do things. It's simple and it
> works, but it's not pretty.
>

> Quite frankly, the way I'd conceptually prefer people do these kinds of
> DMA buffers etc is to just have a "nopage()" function, and let that
> nopage() function just fill in the page from the DMA buffer directly. This
> would work fine, and would get rid of all the playing with PG_reserved
> etc.

Well coolio. Would somebody be up for sanity checking my audio mmap
code (attached)? It doesn't look too hard at all to get the audio
drivers doing the correct thing.

Converting to this was easy... my driver hardcodes the buffer size at
PAGE_SIZE. Since Via audio uses scatter-gather DMA, all I need to do in
'nopage' is get the virtual page number for the vma, and use that as a
direct index into the scatter-gather array.

Since this code works in my local tests, my two concerns at this point
are correct vma->vm_pgoff treatment, and correct vm_flags/vm_file usage.

Jeff

--
Jeff Garzik | The difference between laziness and
Building 1024 | prioritization is the end result.
MandrakeSoft |

foo.txt

Andrea Arcangeli

unread,

Oct 19, 2000, 3:00:00 AM10/19/00

to Jeff Garzik

On Wed, Oct 18, 2000 at 10:30:45PM -0400, Jeff Garzik wrote:
> Well coolio. Would somebody be up for sanity checking my audio mmap
> code (attached)? It doesn't look too hard at all to get the audio

Nice patch ;).

> vma->vm_flags |= VM_LOCKED | VM_SHM; /* Don't swap */

Since you do vm_falgs |= VM_LOCKED you won't need to implement swapout since
it will never run. (and the VM_SHM there is wrong)

The driver will work also this way but it will be slower because it will
generate pagefaults that would be otherwise avoided.

A case where the above would make tons of sense is if the mapped area would
be of the order of gigabytes (or at least houndred of mbytes) where lazily
mapping would avoid a big mmap(/dev/dsp) latency and precious pagetables at
startup.

I just don't think the additional complexity is worthwhile for a sound driver.

However I see it's fun :).

Andrea

Linus Torvalds

unread,

Oct 19, 2000, 3:00:00 AM10/19/00

to Jeff Garzik

On Wed, 18 Oct 2000, Jeff Garzik wrote:
>
> Well coolio. Would somebody be up for sanity checking my audio mmap
> code (attached)? It doesn't look too hard at all to get the audio
> drivers doing the correct thing.

Looks reasonable - although your "max_size" checks are wrong at mmap time.
You should check for something like

if (size > max_size)
goto out;
if (offset > max_size - size)
goto out;

instead of your current (offset >= max_size) check. As it stands, I think
you can mmap past the end by just having size and offset < max_size, but
the _sum_ being bigger..

> Since this code works in my local tests, my two concerns at this point
> are correct vma->vm_pgoff treatment, and correct vm_flags/vm_file usage.

You should drop the

vma->vm_file = file;

as the VM layer will have done that already.

Also, the VM_LOCKED | VM_SHM thing is wrong, because it will cause
mis-calulcation of the mm->vm_locked fields, and it's also unnecessary.
Sure, it will cause the VM layer to not swap, but the no-op "swapping"
shouldn't much hurt anyway, as it's easy enough to re-populate the page
again.

So either just drop it, or make sure that the locked page count doesn't
get corrupted.

Linus

Jeff Garzik

unread,

Oct 19, 2000, 3:00:00 AM10/19/00

to Linus Torvalds

Linus Torvalds wrote:
> Looks reasonable - although your "max_size" checks are wrong at mmap time.
> You should check for something like

thanks

> > Since this code works in my local tests, my two concerns at this point
> > are correct vma->vm_pgoff treatment, and correct vm_flags/vm_file usage.

Can I completely ignore vm_pgoff in nopage()? Currently I just do the
following, to get the page number relative to the start of the vma:

pgoff = (address - vma->vm_start) >> PAGE_SHIFT;

where 'address' is the value passed to nopage().

> You should drop the
>
> vma->vm_file = file;
>
> as the VM layer will have done that already.
>
> Also, the VM_LOCKED | VM_SHM thing is wrong, because it will cause
> mis-calulcation of the mm->vm_locked fields, and it's also unnecessary.
> Sure, it will cause the VM layer to not swap, but the no-op "swapping"
> shouldn't much hurt anyway, as it's easy enough to re-populate the page
> again.
>
> So either just drop it, or make sure that the locked page count doesn't
> get corrupted.

The first draft of code had only

vma->vm_ops = &via_mm_ops;
vma->vm_private_data = card;

and userspace apps segfaulted. I then added the two objectionable
lines, and things started working:

vma->vm_flags |= VM_LOCKED | VM_SHM; /* Don't swap */

vma->vm_file = file;

Have not yet checked any combinations in between...

Regards,

Jeff

--
Jeff Garzik | The difference between laziness and
Building 1024 | prioritization is the end result.
MandrakeSoft |

Linus Torvalds

unread,

Oct 19, 2000, 3:00:00 AM10/19/00

to Jeff Garzik

On Thu, 19 Oct 2000, Jeff Garzik wrote:
>
> > > Since this code works in my local tests, my two concerns at this point
> > > are correct vma->vm_pgoff treatment, and correct vm_flags/vm_file usage.
>
> Can I completely ignore vm_pgoff in nopage()? Currently I just do the
> following, to get the page number relative to the start of the vma:
>
> pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
>
> where 'address' is the value passed to nopage().

Oh, no, you can't do that. It should be something like

pgoff = vma->vm_pgoff + (address - vma->vm_start) >> PAGE_SHIFT;

and quite frankly we should just change the damn "nopage()" arguments to
be

struct page * (*nopage)(struct vm_area_struct * area, unsigned long pgoff, int write_access);

because the nopage() routine really shouldn't care about "address" at all
(the VM layer cares, but "nopage()" should just return the proper page at
the proper offset into the mapping).

I didn't notice..

>
> The first draft of code had only
>
> vma->vm_ops = &via_mm_ops;
> vma->vm_private_data = card;
>
> and userspace apps segfaulted. I then added the two objectionable
> lines, and things started working:
>
> vma->vm_flags |= VM_LOCKED | VM_SHM; /* Don't swap */
> vma->vm_file = file;
>
> Have not yet checked any combinations in between...

Ahh, I think I see why. You should do a "get_page()" on the page before
you return it from your nopage() - otherwise your page counts will be
completely wrong.

Actually, as far as I can tell your page counts ended up being wrong
anyway, but probably only at the end of the munmap() rather than at
runtime.

Linus

Jeff Garzik

unread,

Oct 19, 2000, 3:00:00 AM10/19/00

to Linus Torvalds

Linus Torvalds wrote:
> and quite frankly we should just change the damn "nopage()" arguments to
> be
>
> struct page * (*nopage)(struct vm_area_struct * area, unsigned long pgoff, int write_access);
>
> because the nopage() routine really shouldn't care about "address" at all
> (the VM layer cares, but "nopage()" should just return the proper page at
> the proper offset into the mapping).

fine with me, it makes my code even shorter. ;-) This new mmap
solution is really elegant. Excluding all the debug code and assertions
I stick in there, the guts of via audio mmap support went from ~50 lines
to ~10.

> > The first draft of code had only
> >
> > vma->vm_ops = &via_mm_ops;
> > vma->vm_private_data = card;
> >
> > and userspace apps segfaulted. I then added the two objectionable
> > lines, and things started working:
> >
> > vma->vm_flags |= VM_LOCKED | VM_SHM; /* Don't swap */
> > vma->vm_file = file;
> >
> > Have not yet checked any combinations in between...
>
> Ahh, I think I see why. You should do a "get_page()" on the page before
> you return it from your nopage() - otherwise your page counts will be
> completely wrong.
>
> Actually, as far as I can tell your page counts ended up being wrong
> anyway, but probably only at the end of the munmap() rather than at
> runtime.

Thanks, fixed.

I stole the last two lines from drivers/char/drm/vm.c, which might need
to be fixed up also.. He uses the vm_flags above and nevers calls
get_page, at the very least.

Jeff

--
Jeff Garzik | The difference between laziness and
Building 1024 | prioritization is the end result.
MandrakeSoft |

Andrea Arcangeli

unread,

Oct 19, 2000, 3:00:00 AM10/19/00

to Jeff Garzik

On Thu, Oct 19, 2000 at 05:16:14PM -0400, Jeff Garzik wrote:
> solution is really elegant. Excluding all the debug code and assertions
> I stick in there, the guts of via audio mmap support went from ~50 lines
> to ~10.

Was it 50 lines with remap_page_range?

Which is the advantage of introducing pagefaults that we can avoid? (and
that we are also used to avoid)

Andrea

Linus Torvalds

unread,

Oct 19, 2000, 3:00:00 AM10/19/00

to Jeff Garzik

On Thu, 19 Oct 2000, Jeff Garzik wrote:
>

> I stole the last two lines from drivers/char/drm/vm.c, which might need
> to be fixed up also.. He uses the vm_flags above and nevers calls
> get_page, at the very least.

The DRM code does

atomic_inc(&virt_to_page(physical)->count);

which is basically what get_page() expands into. The DRM code looks ugly,
but correct, at least as far as this thing is concerned.

But you're right about the mmap vm_flags/vm_file things.

Linus

Linus Torvalds

unread,

Oct 19, 2000, 3:00:00 AM10/19/00

to Andrea Arcangeli

On Fri, 20 Oct 2000, Andrea Arcangeli wrote:

> On Thu, Oct 19, 2000 at 05:16:14PM -0400, Jeff Garzik wrote:
> > solution is really elegant. Excluding all the debug code and assertions
> > I stick in there, the guts of via audio mmap support went from ~50 lines
> > to ~10.
>
> Was it 50 lines with remap_page_range?

You cannot use remap_page_range() if the pages aren't physically
contiguous.

Face it, Andrea, remap_page_range() is a very limited hack that only works
for a very limited subset of the things that real people want to do. It is
_not_ a generic solution, while the nopage() approach _is_ generic.

Jeff Garzik

unread,

Oct 19, 2000, 3:00:00 AM10/19/00

to Andrea Arcangeli

Andrea Arcangeli wrote:
>
> On Thu, Oct 19, 2000 at 05:16:14PM -0400, Jeff Garzik wrote:
> > solution is really elegant. Excluding all the debug code and assertions
> > I stick in there, the guts of via audio mmap support went from ~50 lines
> > to ~10.
>
> Was it 50 lines with remap_page_range?

Yeah. Via audio is scatter-gather, so I had a function via_mmap_chan
which had to walk the scatter-gather list, calling remap_page_range for
each PAGE_SIZE'd dma buffer.

Now with nopage(), the need to walk the scatter-gather list is
completely avoided. The scatter-gather info is stored as an array, so
finding the physical address for a virtual page is a direct lookup.

> Which is the advantage of introducing pagefaults that we can avoid? (and
> that we are also used to avoid)

Well, I don't know the VM well so I can't say how bad the lack of
VM_LOCK will affect latency, if at all.

I prefer this way because it seems the most clean -- let the system
pagefault if it wants to, we're not really losing the buffer, only a
reference to it.

Jeff

--
Jeff Garzik | The difference between laziness and
Building 1024 | prioritization is the end result.
MandrakeSoft |

Stephen Tweedie

unread,

Oct 18, 2000, 3:00:00 AM10/18/00

to Linus Torvalds

Hi,

On Tue, Oct 17, 2000 at 09:26:07PM -0700, Linus Torvalds wrote:

> I hated people mis-using it the way it's being done by the sound drivers,
> but because I also realize that it allows for some simplifications I do
> accept it - it's basically an ugly hack that doesn't really matter because
> the exact same code _is_ needed for the real IO mapping case anyway, and
> as far as the kernel is concerned the whole thing with PG_reserved is all
> just a game to make the kernel VM subsystem think that some real RAM is
> actually IO space and shouldn't be touched.

I've got kiobuf diffs which allow you to (a) map any kernel memory

(including vmalloced areas) into a kiobuf, and then (b) mmap any
kiobuf into user space using sane, faulting vmas (with the option of
prefaulting for performance if you want it). Nobody is using it right
now so I wasn't planning on resending it to you until 2.5, but if you
want to give people a chance to start using it earlier I can send it
in.

--Stephen

Andrea Arcangeli

unread,

Oct 20, 2000, 3:00:00 AM10/20/00

to Linus Torvalds

On Thu, Oct 19, 2000 at 03:45:43PM -0700, Linus Torvalds wrote:

>
>
> On Fri, 20 Oct 2000, Andrea Arcangeli wrote:
>
> > On Thu, Oct 19, 2000 at 05:16:14PM -0400, Jeff Garzik wrote:
> > > solution is really elegant. Excluding all the debug code and assertions
> > > I stick in there, the guts of via audio mmap support went from ~50 lines
> > > to ~10.
> >
> > Was it 50 lines with remap_page_range?
>

> You cannot use remap_page_range() if the pages aren't physically
> contiguous.

Yep.

And about the other question?

> Face it, Andrea, remap_page_range() is a very limited hack that only works
> for a very limited subset of the things that real people want to do. It is
> _not_ a generic solution, while the nopage() approach _is_ generic.

But without keeping those pages pinned in the ptes, the nopage approch is
_suboptimal_ and adding a dummy swapout that returns "progress" where no
progress is been made will also _break_ the VM with relatively large device
mmapped vmas. It would at least _certainly_ break 2.2.18pre15aa1 (if 2.4.x has
an oom deadlock in GFP it won't break of course).

I'm fine to drop remap_page_range and the PG_reserved bit, but to do that I'd
suggest to add a new per-VMA VM_RESERVED bitflags. This will also avoid to walk
pagetables where it makes no sense to unmap the pages (it won't only avoid
senseless page faults) and it will fix the VM callbacks problem that returns
"success" when no progress is made. (SHM returns a dummy 0 too, but progress is
been made in that case)

The only reason it could make sense to do that unmapping is if we would be able
to also free the _pagetables_, but we do that only via munmap if a whole gdt
entry gets unmapped (and after munmap the page is been just unmapped anyways).
So once we'll free pagetables via swap_out (maybe never) some driver with
houndred of mbytes of virtual mapping may choose to drop the VM_RESERVED
bitflag from their vmas if pagefaults aren't a performance problem for that
drivers.

I just find hard to widespread propose something that ends doing the wrong
thing (ok I see it's not a big deal but it's still the _wrong_ thing to do and
the VM_RESERVED way will fix it a _zero_ cost, it will improve performance
compared to PG_reserved and it will also avoid the uglyness of having all the
SG drivers out there implementing a dummy swapout callback :).

This introduces VM_RESERVED against test10-pre4:

--- 2.4.0-test10-pre4/include/linux/mm.h.~1~ Fri Oct 20 17:56:09 2000
+++ 2.4.0-test10-pre4/include/linux/mm.h Fri Oct 20 18:23:47 2000
@@ -95,6 +95,7 @@

#define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork */
#define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */
+#define VM_RESERVED 0x00080000 /* Don't unmap it from swap_out */

#define VM_STACK_FLAGS 0x00000177

--- 2.4.0-test10-pre4/include/linux/wrapper.h.~1~ Tue Aug 8 06:01:36 2000
+++ 2.4.0-test10-pre4/include/linux/wrapper.h Fri Oct 20 18:26:02 2000
@@ -29,8 +29,17 @@
#define vma_get_end(v) v->vm_end
#define vma_get_page_prot(v) v->vm_page_prot

+/*
+ * mem_map_reserve()/unreserve() are going to be obsoleted by
+ * vma_reserve(). (unreserve shouldn't be necessary)
+ *
+ * Instead of marking the pages as reserved, just mark the vma as reserved
+ * this will improve performance (it's zero cost unlike the PG_reserved check)
+ * and it will be trivial for not physically contigous mappings too.
+ */
#define mem_map_reserve(p) set_bit(PG_reserved, &p->flags)
#define mem_map_unreserve(p) clear_bit(PG_reserved, &p->flags)
#define mem_map_inc_count(p) atomic_inc(&(p->count))
#define mem_map_dec_count(p) atomic_dec(&(p->count))
+#define vma_reserve(vma) ((vma)->vm_flags |= VM_RESERVED)
#endif
--- 2.4.0-test10-pre4/mm/vmscan.c.~1~ Fri Oct 20 17:56:10 2000
+++ 2.4.0-test10-pre4/mm/vmscan.c Fri Oct 20 18:11:09 2000
@@ -318,7 +318,7 @@
unsigned long end;

/* Don't swap out areas which are locked down */
- if (vma->vm_flags & VM_LOCKED)
+ if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
return 0;

pgdir = pgd_offset(mm, address);

And this shows a pratical (but untested :) demonstration of the VM_RESERVED
usage:

--- 2.4.0-test10-pre4/drivers/media/video/bttv-driver.c.~1~ Thu Oct 12 03:04:42 2000
+++ 2.4.0-test10-pre4/drivers/media/video/bttv-driver.c Fri Oct 20 18:32:43 2000
@@ -39,7 +39,6 @@
#include <linux/sched.h>
#include <asm/segment.h>
#include <linux/types.h>
-#include <linux/wrapper.h>
#include <linux/interrupt.h>
#include <linux/kmod.h>
#include <linux/vmalloc.h>
@@ -181,40 +180,17 @@
static void * rvmalloc(signed long size)
{
void * mem;
- unsigned long adr, page;

mem=vmalloc_32(size);
if (mem)
- {
memset(mem, 0, size); /* Clear the ram out, no junk to the user */
- adr=(unsigned long) mem;
- while (size > 0)
- {
- page = kvirt_to_pa(adr);
- mem_map_reserve(virt_to_page(__va(page)));
- adr+=PAGE_SIZE;
- size-=PAGE_SIZE;
- }
- }
return mem;
}

static void rvfree(void * mem, signed long size)
{
- unsigned long adr, page;
-
if (mem)
- {
- adr=(unsigned long) mem;
- while (size > 0)
- {
- page = kvirt_to_pa(adr);
- mem_map_unreserve(virt_to_page(__va(page)));
- adr+=PAGE_SIZE;
- size-=PAGE_SIZE;
- }
vfree(mem);
- }
}

--- 2.4.0-test10-pre4/drivers/media/video/videodev.c.~1~ Wed Jul 19 07:35:33 2000
+++ 2.4.0-test10-pre4/drivers/media/video/videodev.c Fri Oct 20 18:32:38 2000
@@ -27,6 +27,7 @@
#include <linux/errno.h>
#include <linux/videodev.h>
#include <linux/init.h>
+#include <linux/wrapper.h>

#include <asm/uaccess.h>
#include <asm/system.h>
@@ -228,6 +229,7 @@
int ret = -EINVAL;
struct video_device *vfl=video_device[MINOR(file->f_dentry->d_inode->i_rdev)];
if(vfl->mmap) {
+ vma_reserve(vma);
lock_kernel();
ret = vfl->mmap(vfl, (char *)vma->vm_start,
(unsigned long)(vma->vm_end-vma->vm_start));

bttv-driver.c | 24 ------------------------
videodev.c | 2 ++
2 files changed, 2 insertions(+), 24 deletions(-)

Once VM_RESERVED is in place using nopage instead of remap_page_range looks
fine to me (we'll have an additional startup page fault but that can
explained as a mmap latency reduction and after all people can keep
using remap_page_range to avoid that first page fault as the bttv
driver would still do with the above patch applied even if it's just
using the new vma_reserve way).

(in theory we could use VM_LOCKED for that but as you noticed we would
end corrupting the number of locked-page stats and I just find cleaner
to not mix mlock and reserved kernel pages issues)

Andrea

Linus Torvalds

unread,

Oct 20, 2000, 3:00:00 AM10/20/00

to Stephen Tweedie

On Thu, 19 Oct 2000, Stephen Tweedie wrote:
> >
> > Then, we'd move the "writeout" part into the LRU queue side, and at that
> > point I agree with you 100% that we probably should just delay it until
> > there are no mappings available
>

> I've just been talking about this with Ben LaHaise and Rik van Riel,
> and Ben brought up a nasty problem --- NFS, which because of its
> credentials requirements needs to have the struct file available in
> its writepage function. Of course, if we defer the write then we
> don't necessarily have the file available when we come to flush the
> page from cache.

Yes. But that doesn't mean that swapping couldn't do it (swapping
fundamentally doesn't have credentials).

And note that this is not about "NFS is broken" - any remote filesystem
will have some issues like this, and shared mappings will always have to
handle this case.

So basically I agree that shared mappings cannot be converted to this
setup, I was only talking about the specific case of the swapping (and
anonymous shared memory, which along with SysV IPC shm is basically the
same thing and already uses the swap cache).

So what I was thinking of was the very end of try_to_swap_out(), where we
have noticed that we do not have a "swapout()" function, and we need to
add the page to the swap cache. I would suggest moving _that_ code to the
LRU queue, and handling it conceptually together with the stuff that
handles the buffer cache writeout.

--

And no, I haven't forgotten about the case of direct IO into a shared
mapping. That _is_ going to be different in many ways, and I suspect that
a solution to that particular issue may be to move the "vm_file"
information from when we do the virtual kiobuf lookup into the kiobuf's,
because otherwise we'd basically lose that information.

(We _already_ lose that information, in fact. Keeping the page in the
virtual mapping doesn't really even fix it - because the page can be in
multiple virtual mappings with different vm_file's and thus different
credentials. And the kiobuf's do not really contain any information of
_which_ of the credentials we looked up. It happens to work, but it's
conceptually not very correct).

Linus

fa...@valinux.com

unread,

Oct 20, 2000, 3:00:00 AM10/20/00

to Linus Torvalds

In article <Pine.LNX.4.10.100101...@penguin.transmeta.com>,

you write:
>
>
>On Thu, 19 Oct 2000, Jeff Garzik wrote:
>>
>> I stole the last two lines from drivers/char/drm/vm.c, which might need
>> to be fixed up also.. He uses the vm_flags above and nevers calls
>> get_page, at the very least.
>
>The DRM code does
>
> atomic_inc(&virt_to_page(physical)->count);
>
>which is basically what get_page() expands into. The DRM code looks ugly,
>but correct, at least as far as this thing is concerned.
>
>But you're right about the mmap vm_flags/vm_file things.

We'll look at this and submit the changes with the next patch set.

Stephen Tweedie

unread,

Oct 20, 2000, 3:00:00 AM10/20/00

to Linus Torvalds

Hi,

On Tue, Oct 17, 2000 at 09:42:36PM -0700, Linus Torvalds wrote:

> Now, the way I'v ealways envisioned this to work is that the VM scanning
> function basically always does the equivalent of just
>

> - get PTE entry, clear it out.
> - if PTE was dirty, add the page to the swap cache, and mark it dirty,
> but DON'T ACTUALLY START THE IO!
> - free the page.
>

> Then, we'd move the "writeout" part into the LRU queue side, and at that
> point I agree with you 100% that we probably should just delay it until
> there are no mappings available

I've just been talking about this with Ben LaHaise and Rik van Riel,
and Ben brought up a nasty problem --- NFS, which because of its
credentials requirements needs to have the struct file available in
its writepage function. Of course, if we defer the write then we
don't necessarily have the file available when we come to flush the
page from cache.

One answer is to say "well then NFS is broken, fix it". It's not too
hard --- NFS mmaps need a wp_page function which registers the
caller's credentials against the page when we dirty it so that we can
use those credentials on flush. That means that writes to a
multiply-mapped file essentially get random credentials, but I don't
think we care --- the credentials eventually used will be enough to
avoid the root_squash problems and the permissions at open make sure
we're not doing anything illegal.

(Changing permissions on an already-mmaped file and causing the NFS
server to refuse the write raises problems which are ... interesting,
but I'm not convinced that that is a new problem; I suspect we can
fabricate such a failure today.)

--Stephen