[patch 1/3] split mmap

Miklos Szeredi

unread,

Mar 24, 2007, 6:10:07 PM3/24/07

to

From: Miklos Szeredi <msze...@suse.cz>

This is a straightforward split of do_mmap_pgoff() into two functions:

- do_mmap_pgoff() checks the parameters, and calculates the vma
flags. Then it calls

- mmap_region(), which does the actual mapping

Signed-off-by: Miklos Szeredi <msze...@suse.cz>
---

Index: linux/mm/mmap.c
===================================================================
--- linux.orig/mm/mmap.c 2007-03-24 21:00:40.000000000 +0100
+++ linux/mm/mmap.c 2007-03-24 22:28:52.000000000 +0100
@@ -893,14 +893,11 @@ unsigned long do_mmap_pgoff(struct file
unsigned long flags, unsigned long pgoff)
{
struct mm_struct * mm = current->mm;
- struct vm_area_struct * vma, * prev;
struct inode *inode;
unsigned int vm_flags;
- int correct_wcount = 0;
int error;
- struct rb_node ** rb_link, * rb_parent;
int accountable = 1;
- unsigned long charged = 0, reqprot = prot;
+ unsigned long reqprot = prot;

/*
* Does the application expect PROT_READ to imply PROT_EXEC?
@@ -1025,7 +1022,25 @@ unsigned long do_mmap_pgoff(struct file
error = security_file_mmap(file, reqprot, prot, flags);
if (error)
return error;
-
+
+ return mmap_region(file, addr, len, flags, vm_flags, pgoff,
+ accountable);
+}
+EXPORT_SYMBOL(do_mmap_pgoff);
+
+unsigned long mmap_region(struct file *file, unsigned long addr,
+ unsigned long len, unsigned long flags,
+ unsigned int vm_flags, unsigned long pgoff,
+ int accountable)
+{
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma, *prev;
+ int correct_wcount = 0;
+ int error;
+ struct rb_node **rb_link, *rb_parent;
+ unsigned long charged = 0;
+ struct inode *inode = file ? file->f_path.dentry->d_inode : NULL;
+
/* Clear old maps */
error = -ENOMEM;
munmap_back:
@@ -1174,8 +1189,6 @@ unacct_error:
return error;
}

-EXPORT_SYMBOL(do_mmap_pgoff);
-
/* Get an address range which is currently unmapped.
* For shmat() with addr=0.
*
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h 2007-03-24 21:00:40.000000000 +0100
+++ linux/include/linux/mm.h 2007-03-24 22:28:52.000000000 +0100
@@ -1035,6 +1035,10 @@ extern unsigned long get_unmapped_area(s
extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long pgoff);
+extern unsigned long mmap_region(struct file *file, unsigned long addr,
+ unsigned long len, unsigned long flags,
+ unsigned int vm_flags, unsigned long pgoff,
+ int accountable);

static inline unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Miklos Szeredi

unread,

Mar 24, 2007, 6:20:08 PM3/24/07

to

From: Miklos Szeredi <msze...@suse.cz>

Dirty page accounting/limiting doesn't work for nonlinear mappings, so
for non-ram backed filesystems emulate with linear mappings. This
retains ABI compatibility with previous kernels at minimal code cost.

All known users of nonlinear mappings actually use tmpfs, so this
shouldn't have any negative effect.

Signed-off-by: Miklos Szeredi <msze...@suse.cz>
---

Index: linux-2.6.21-rc4-mm1/mm/fremap.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/fremap.c 2007-03-24 22:30:05.000000000 +0100
+++ linux-2.6.21-rc4-mm1/mm/fremap.c 2007-03-24 22:37:59.000000000 +0100
@@ -181,6 +181,24 @@ asmlinkage long sys_remap_file_pages(uns
goto retry;
}
mapping = vma->vm_file->f_mapping;
+ /*
+ * page_mkclean doesn't work on nonlinear vmas, so if dirty
+ * pages need to be accounted, emulate with linear vmas.
+ */
+ if (mapping_cap_account_dirty(mapping)) {
+ unsigned long addr;
+
+ flags &= MAP_NONBLOCK;
+ addr = mmap_region(vma->vm_file, start, size, flags,
+ vma->vm_flags, pgoff, 1);
+ if (IS_ERR_VALUE(addr))
+ err = addr;
+ else {
+ BUG_ON(addr != start);
+ err = 0;
+ }
+ goto out;
+ }
spin_lock(&mapping->i_mmap_lock);
flush_dcache_mmap_lock(mapping);
vma->vm_flags |= VM_NONLINEAR;

Miklos Szeredi

unread,

Mar 24, 2007, 6:20:12 PM3/24/07

to

From: Miklos Szeredi <msze...@suse.cz>

Changes:
v3:
o rename is_page_modified to test_clear_page_modified
v2:
o set AS_CMTIME flag in clear_page_dirty_for_io() too
o don't clear AS_CMTIME in file_update_time()
o check the dirty bit in the page tables
v1:
o moved check from __fput() to remove_vma(), which is more logical
o changed set_page_dirty() to set_page_dirty_mapping in hugetlb.c
o cleaned up #ifdef CONFIG_BLOCK mess

This patch makes writing to shared memory mappings update st_ctime and
st_mtime as defined by SUSv3:

The st_ctime and st_mtime fields of a file that is mapped with
MAP_SHARED and PROT_WRITE shall be marked for update at some point
in the interval between a write reference to the mapped region and
the next call to msync() with MS_ASYNC or MS_SYNC for that portion
of the file by any process. If there is no such call and if the
underlying file is modified as a result of a write reference, then
these fields shall be marked for update at some time after the
write reference.

A new address_space flag is introduced: AS_CMTIME. This is set each
time a page is dirtied through a userspace memory mapping. This
includes write accesses via get_user_pages().

Note, the flag is set unconditionally, even if the page is already
dirty. This is important, because the page might have been dirtied
earlier by a non-mmap write.

This flag is checked in msync() and munmap()/mremap(), and if set, the
file times are updated and the flag is cleared.

Msync also needs to check the dirty bit in the page tables, because
the data might change again after an msync(MS_ASYNC), while the page
is already dirty and read-write. This also makes the time updating
work for memory backed filesystems such as tmpfs.

This implementation walks the pages in the synced range, and uses rmap
to find all the ptes for each page. Non-linear vmas are ignored,
since the ptes can only be found by scanning the whole vma, which is
very inefficient.

As an optimization, if dirty pages are accounted, then only walk the
dirty pages, since the clean pages necessarily have clean ptes. This
doesn't work for memory backed filesystems, where no dirty accounting
is done.

An alternative implementation could check for all intersecting vmas in
the mapping and walk the page tables for each. This would probably be
more efficient for memory backed filesystems and if the number of
dirty pages is near the total number of pages in the range.

Fixes Novell Bugzilla #206431.

Inspired by Peter Staubach's patch and the resulting comments.

Signed-off-by: Miklos Szeredi <msze...@suse.cz>
---

Index: linux-2.6.21-rc4-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/linux/pagemap.h 2007-03-24 19:03:11.000000000 +0100
+++ linux-2.6.21-rc4-mm1/include/linux/pagemap.h 2007-03-24 19:34:30.000000000 +0100
@@ -19,6 +19,7 @@
*/
#define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
#define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */
+#define AS_CMTIME (__GFP_BITS_SHIFT + 2) /* ctime/mtime update needed */

static inline void mapping_set_error(struct address_space *mapping, int error)
{
Index: linux-2.6.21-rc4-mm1/include/linux/mm.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/linux/mm.h 2007-03-24 19:04:15.000000000 +0100
+++ linux-2.6.21-rc4-mm1/include/linux/mm.h 2007-03-24 19:34:30.000000000 +0100
@@ -808,6 +808,7 @@ int redirty_page_for_writepage(struct wr
struct page *page);
int FASTCALL(set_page_dirty(struct page *page));
int set_page_dirty_lock(struct page *page);
+int set_page_dirty_mapping(struct page *page);
int clear_page_dirty_for_io(struct page *page);

extern unsigned long do_mremap(unsigned long addr,
Index: linux-2.6.21-rc4-mm1/mm/memory.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/memory.c 2007-03-24 19:03:11.000000000 +0100
+++ linux-2.6.21-rc4-mm1/mm/memory.c 2007-03-24 19:34:30.000000000 +0100
@@ -676,7 +676,7 @@ static unsigned long zap_pte_range(struc
anon_rss--;
else {
if (pte_dirty(ptent))
- set_page_dirty(page);
+ set_page_dirty_mapping(page);
if (pte_young(ptent))
SetPageReferenced(page);
file_rss--;
@@ -954,7 +954,7 @@ struct page *follow_page(struct vm_area_
if (flags & FOLL_TOUCH) {
if ((flags & FOLL_WRITE) &&
!pte_dirty(pte) && !PageDirty(page))
- set_page_dirty(page);
+ set_page_dirty_mapping(page);
mark_page_accessed(page);
}
unlock:
@@ -1519,6 +1519,15 @@ static inline void cow_user_page(struct
copy_user_highpage(dst, src, va, vma);
}

+static void set_page_dirty_mapping_balance(struct page *page)
+{
+ if (set_page_dirty_mapping(page)) {
+ struct address_space *mapping = page_mapping(page);
+ if (mapping)
+ balance_dirty_pages_ratelimited(mapping);
+ }
+}
+
/*
* This routine handles present pages, when users try to write
* to a shared page. It is done by copying the page to a new address
@@ -1678,7 +1687,7 @@ unlock:
* do_no_page is protected similarly.
*/
wait_on_page_locked(dirty_page);
- set_page_dirty_balance(dirty_page);
+ set_page_dirty_mapping_balance(dirty_page);
put_page(dirty_page);
}
return ret;
@@ -2328,7 +2337,7 @@ out:
if (anon)
page_cache_release(faulted_page);
else if (dirty_page) {
- set_page_dirty_balance(dirty_page);
+ set_page_dirty_mapping_balance(dirty_page);
put_page(dirty_page);
}

Index: linux-2.6.21-rc4-mm1/mm/page-writeback.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/page-writeback.c 2007-03-24 19:27:26.000000000 +0100
+++ linux-2.6.21-rc4-mm1/mm/page-writeback.c 2007-03-24 19:34:30.000000000 +0100
@@ -290,16 +290,6 @@ static void balance_dirty_pages(struct a
pdflush_operation(background_writeout, 0);
}

-void set_page_dirty_balance(struct page *page)
-{
- if (set_page_dirty(page)) {
- struct address_space *mapping = page_mapping(page);
-
- if (mapping)
- balance_dirty_pages_ratelimited(mapping);
- }
-}
-
/**
* balance_dirty_pages_ratelimited_nr - balance dirty memory state
* @mapping: address_space which was dirtied
@@ -848,17 +838,42 @@ EXPORT_SYMBOL(redirty_page_for_writepage
* If the mapping doesn't provide a set_page_dirty a_op, then
* just fall through and assume that it wants buffer_heads.
*/
+static inline int __set_page_dirty(struct address_space *mapping,
+ struct page *page)
+{
+ int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
+#ifdef CONFIG_BLOCK
+ if (!spd)
+ spd = __set_page_dirty_buffers;
+#endif
+ return (*spd)(page);
+}
+
int fastcall set_page_dirty(struct page *page)
{
struct address_space *mapping = page_mapping(page);

+ if (likely(mapping))
+ return __set_page_dirty(mapping, page);
+ if (!PageDirty(page)) {
+ if (!TestSetPageDirty(page))
+ return 1;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(set_page_dirty);
+
+/*
+ * Special set_page_dirty() variant for dirtiness coming from a memory
+ * mapping. In this case the ctime/mtime update flag needs to be set.
+ */
+int set_page_dirty_mapping(struct page *page)
+{
+ struct address_space *mapping = page_mapping(page);
+
if (likely(mapping)) {
- int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
-#ifdef CONFIG_BLOCK
- if (!spd)
- spd = __set_page_dirty_buffers;
-#endif
- return (*spd)(page);
+ set_bit(AS_CMTIME, &mapping->flags);
+ return __set_page_dirty(mapping, page);
}
if (!PageDirty(page)) {
if (!TestSetPageDirty(page))
@@ -866,7 +881,6 @@ int fastcall set_page_dirty(struct page
}
return 0;
}
-EXPORT_SYMBOL(set_page_dirty);

/*
* set_page_dirty() is racy if the caller has no reference against
@@ -936,7 +950,7 @@ int clear_page_dirty_for_io(struct page
* threads doing their things.
*/
if (page_mkclean(page))
- set_page_dirty(page);
+ set_page_dirty_mapping(page);
/*
* We carefully synchronise fault handlers against
* installing a dirty pte and marking the page dirty
Index: linux-2.6.21-rc4-mm1/mm/rmap.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/rmap.c 2007-03-24 19:03:11.000000000 +0100
+++ linux-2.6.21-rc4-mm1/mm/rmap.c 2007-03-24 19:34:30.000000000 +0100
@@ -507,6 +507,43 @@ int page_mkclean(struct page *page)
EXPORT_SYMBOL_GPL(page_mkclean);

/**
+ * test_clear_page_modified - check and clear the dirty bit for all mappings of a page
+ * @page: the page to check
+ */
+bool test_clear_page_modified(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ struct vm_area_struct *vma;
+ struct prio_tree_iter iter;
+ bool modified = false;
+
+ BUG_ON(!mapping);
+ BUG_ON(!page_mapped(page));
+
+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+ if (vma->vm_flags & VM_SHARED) {
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long addr = vma_address(page, vma);
+ pte_t *pte;
+ spinlock_t *ptl;
+
+ if (addr != -EFAULT &&
+ (pte = page_check_address(page, mm, addr, &ptl))) {
+ if (ptep_clear_flush_dirty(vma, addr, pte))
+ modified = true;
+ pte_unmap_unlock(pte, ptl);
+ }
+ }
+ }
+ spin_unlock(&mapping->i_mmap_lock);
+ if (page_test_and_clear_dirty(page))
+ modified = true;
+ return modified;
+}
+
+/**
* page_set_anon_rmap - setup new anonymous rmap
* @page: the page to add the mapping to
* @vma: the vm area in which the mapping is added
@@ -657,7 +694,7 @@ void page_remove_rmap(struct page *page,
* faster for those pages still in swapcache.
*/
if (page_test_and_clear_dirty(page))
- set_page_dirty(page);
+ set_page_dirty_mapping(page);
__dec_zone_page_state(page,
PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
}
@@ -702,7 +739,7 @@ static int try_to_unmap_one(struct page

/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
- set_page_dirty(page);
+ set_page_dirty_mapping(page);

/* Update high watermark before we lower rss */
update_hiwater_rss(mm);
@@ -836,7 +873,7 @@ static void try_to_unmap_cluster(unsigne

/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
- set_page_dirty(page);
+ set_page_dirty_mapping(page);

page_remove_rmap(page, vma);
page_cache_release(page);
Index: linux-2.6.21-rc4-mm1/include/linux/writeback.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/linux/writeback.h 2007-03-24 19:27:26.000000000 +0100
+++ linux-2.6.21-rc4-mm1/include/linux/writeback.h 2007-03-24 19:34:30.000000000 +0100
@@ -129,7 +129,6 @@ int sync_page_range(struct inode *inode,
loff_t pos, loff_t count);
int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
loff_t pos, loff_t count);
-void set_page_dirty_balance(struct page *page);
void writeback_set_ratelimit(void);

/* pdflush.c */
Index: linux-2.6.21-rc4-mm1/mm/mmap.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/mmap.c 2007-03-24 19:04:15.000000000 +0100
+++ linux-2.6.21-rc4-mm1/mm/mmap.c 2007-03-24 19:34:30.000000000 +0100
@@ -222,12 +222,16 @@ void unlink_file_vma(struct vm_area_stru
static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
{
struct vm_area_struct *next = vma->vm_next;
+ struct file *file = vma->vm_file;

might_sleep();
if (vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
- if (vma->vm_file)
- fput(vma->vm_file);
+ if (file) {
+ if (test_and_clear_bit(AS_CMTIME, &file->f_mapping->flags))
+ file_update_time(file);
+ fput(file);
+ }
mpol_free(vma_policy(vma));
kmem_cache_free(vm_area_cachep, vma);
return next;
Index: linux-2.6.21-rc4-mm1/mm/hugetlb.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/hugetlb.c 2007-03-24 19:03:11.000000000 +0100
+++ linux-2.6.21-rc4-mm1/mm/hugetlb.c 2007-03-24 19:34:30.000000000 +0100
@@ -407,7 +407,7 @@ void __unmap_hugepage_range(struct vm_ar

page = pte_page(pte);
if (pte_dirty(pte))
- set_page_dirty(page);
+ set_page_dirty_mapping(page);
list_add(&page->lru, &page_list);
}
spin_unlock(&mm->page_table_lock);
Index: linux-2.6.21-rc4-mm1/mm/msync.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/msync.c 2007-02-04 19:44:54.000000000 +0100
+++ linux-2.6.21-rc4-mm1/mm/msync.c 2007-03-24 19:34:30.000000000 +0100
@@ -12,6 +12,85 @@
#include <linux/mman.h>
#include <linux/file.h>
#include <linux/syscalls.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/pagevec.h>
+
+/*
+ * Update ctime/mtime on msync().
+ *
+ * POSIX requires, that the times are updated between a modification
+ * of the file through a memory mapping and the next msync for a
+ * region containing the modification. The wording implies that this
+ * must be done even if the modification was through a different
+ * address space. Ugh.
+ *
+ * Non-linear vmas are too hard to handle and they are non-standard
+ * anyway, so they are ignored for now.
+ *
+ * The "file modified" info is collected from two places:
+ *
+ * - AS_CMTIME flag of the mapping
+ * - the dirty bit of the ptes
+ *
+ * For memory backed filesystems all the pages in the range need to be
+ * examined. In other cases, since dirty pages are accurately
+ * tracked, it is enough to look at the pages with the dirty tag.
+ */
+static void msync_update_file_time(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ struct address_space *mapping;
+ struct pagevec pvec;
+ pgoff_t index;
+ pgoff_t end_index;
+ bool modified;
+
+ if (!vma->vm_file || !(vma->vm_flags & VM_SHARED) ||
+ (vma->vm_flags & VM_NONLINEAR))
+ return;
+
+ mapping = vma->vm_file->f_mapping;
+ modified = test_and_clear_bit(AS_CMTIME, &mapping->flags);
+
+ pagevec_init(&pvec, 0);
+ index = linear_page_index(vma, start);
+ end_index = linear_page_index(vma, end);
+ while (index < end_index) {
+ int i;
+ int nr_pages = min(end_index - index, (pgoff_t) PAGEVEC_SIZE);
+
+ if (mapping_cap_account_dirty(mapping))
+ nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+ PAGECACHE_TAG_DIRTY, nr_pages);
+ else
+ nr_pages = pagevec_lookup(&pvec, mapping, index,
+ nr_pages);
+ if (!nr_pages)
+ break;
+
+ for (i = 0; i < nr_pages; i++) {
+ struct page *page = pvec.pages[i];
+
+ /* Skip pages which are just being read */
+ if (!PageUptodate(page))
+ continue;
+
+ lock_page(page);
+ index = page->index + 1;
+ if (page->mapping == mapping &&
+ test_clear_page_modified(page)) {
+ set_page_dirty(page);
+ modified = true;
+ }
+ unlock_page(page);
+ }
+ pagevec_release(&pvec);
+ }
+
+ if (modified)
+ file_update_time(vma->vm_file);
+}

/*
* MS_SYNC syncs the entire file - including mappings.
@@ -75,6 +154,9 @@ asmlinkage long sys_msync(unsigned long
error = -EBUSY;
goto out_unlock;
}
+ if (flags & (MS_SYNC | MS_ASYNC))
+ msync_update_file_time(vma, start,
+ min(end, vma->vm_end));
file = vma->vm_file;
start = vma->vm_end;
if ((flags & MS_SYNC) && file &&
Index: linux-2.6.21-rc4-mm1/include/linux/rmap.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/linux/rmap.h 2007-03-24 19:03:11.000000000 +0100
+++ linux-2.6.21-rc4-mm1/include/linux/rmap.h 2007-03-24 19:34:30.000000000 +0100
@@ -100,6 +100,8 @@ unsigned long page_address_in_vma(struct
*/
int page_mkclean(struct page *);

+bool test_clear_page_modified(struct page *page);
+
#else /* !CONFIG_MMU */

#define anon_vma_init() do {} while (0)

Peter Zijlstra

unread,

Mar 25, 2007, 8:20:05 AM3/25/07

to

On Sat, 2007-03-24 at 23:07 +0100, Miklos Szeredi wrote:
> From: Miklos Szeredi <msze...@suse.cz>
>
> This is a straightforward split of do_mmap_pgoff() into two functions:
>
> - do_mmap_pgoff() checks the parameters, and calculates the vma
> flags. Then it calls
>
> - mmap_region(), which does the actual mapping
>
> Signed-off-by: Miklos Szeredi <msze...@suse.cz>

Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>

Peter Zijlstra

unread,

Mar 25, 2007, 8:20:04 AM3/25/07

to

A few comments..

> Index: linux-2.6.21-rc4-mm1/mm/rmap.c
> ===================================================================
> --- linux-2.6.21-rc4-mm1.orig/mm/rmap.c 2007-03-24 19:03:11.000000000 +0100
> +++ linux-2.6.21-rc4-mm1/mm/rmap.c 2007-03-24 19:34:30.000000000 +0100
> @@ -507,6 +507,43 @@ int page_mkclean(struct page *page)
> EXPORT_SYMBOL_GPL(page_mkclean);
>
> /**
> + * test_clear_page_modified - check and clear the dirty bit for all mappings of a page
> + * @page: the page to check
> + */
> +bool test_clear_page_modified(struct page *page)
> +{
> + struct address_space *mapping = page->mapping;

page_mapping(page)? Otherwise that BUG_ON(!mapping) a few lines down
isn't of much use.

> + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> + struct vm_area_struct *vma;
> + struct prio_tree_iter iter;
> + bool modified = false;
> +
> + BUG_ON(!mapping);
> + BUG_ON(!page_mapped(page));
> +
> + spin_lock(&mapping->i_mmap_lock);
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> + if (vma->vm_flags & VM_SHARED) {
> + struct mm_struct *mm = vma->vm_mm;
> + unsigned long addr = vma_address(page, vma);
> + pte_t *pte;
> + spinlock_t *ptl;
> +
> + if (addr != -EFAULT &&
> + (pte = page_check_address(page, mm, addr, &ptl))) {
> + if (ptep_clear_flush_dirty(vma, addr, pte))
> + modified = true;
> + pte_unmap_unlock(pte, ptl);
> + }

Its against coding style to do assignments in conditionals.

> + }
> + }
> + spin_unlock(&mapping->i_mmap_lock);
> + if (page_test_and_clear_dirty(page))
> + modified = true;
> + return modified;
> +}

Why not parametrize page_mkclean() to conditionally wrprotect clean
pages? Something like:

--- mm/rmap.c~ 2007-03-11 17:52:20.000000000 +0100
+++ mm/rmap.c 2007-03-25 14:01:55.000000000 +0200
@@ -432,7 +432,8 @@
return referenced;
}

-static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
+static int
+page_mkclean_one(struct page *page, struct vm_area_struct *vma, int prot)
{

struct mm_struct *mm = vma->vm_mm;

unsigned long address;
@@ -448,12 +449,13 @@
if (!pte)
goto out;

- if (pte_dirty(*pte) || pte_write(*pte)) {
+ if (pte_dirty(*pte) || (prot && pte_write(*pte))) {
pte_t entry;

flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
- entry = pte_wrprotect(entry);
+ if (prot)
+ entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
lazy_mmu_prot_update(entry);
@@ -465,7 +467,8 @@
return ret;
}

-static int page_mkclean_file(struct address_space *mapping, struct page *page)
+static int
+page_mkclean_file(struct address_space *mapping, struct page *page, int prot)
{

pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);

struct vm_area_struct *vma;
@@ -477,13 +480,13 @@
spin_lock(&mapping->i_mmap_lock);

vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {

if (vma->vm_flags & VM_SHARED)

- ret += page_mkclean_one(page, vma);
+ ret += page_mkclean_one(page, vma, prot);
}
spin_unlock(&mapping->i_mmap_lock);
return ret;
}

-int page_mkclean(struct page *page)
+int page_mkclean(struct page *page, int prot)
{
int ret = 0;

@@ -492,7 +495,7 @@
if (page_mapped(page)) {

struct address_space *mapping = page_mapping(page);

if (mapping)
- ret = page_mkclean_file(mapping, page);
+ ret = page_mkclean_file(mapping, page, prot);
}
if (page_test_and_clear_dirty(page))
ret = 1;

page_mkclean(page, 0)

> + set_page_dirty(page);

set_page_dirty_mapping() ?

> + modified = true;
> + }
> + unlock_page(page);
> + }
> + pagevec_release(&pvec);
> + }
> +
> + if (modified)
> + file_update_time(vma->vm_file);
> +}
>
> /*
> * MS_SYNC syncs the entire file - including mappings.

Peter Zijlstra

unread,

Mar 25, 2007, 8:20:05 AM3/25/07

to

On Sat, 2007-03-24 at 23:09 +0100, Miklos Szeredi wrote:
> From: Miklos Szeredi <msze...@suse.cz>
>
> Dirty page accounting/limiting doesn't work for nonlinear mappings, so
> for non-ram backed filesystems emulate with linear mappings. This
> retains ABI compatibility with previous kernels at minimal code cost.
>
> All known users of nonlinear mappings actually use tmpfs, so this
> shouldn't have any negative effect.
>
> Signed-off-by: Miklos Szeredi <msze...@suse.cz>

Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>

Matt Mackall

unread,

Mar 25, 2007, 12:10:08 PM3/25/07

to

On Sun, Mar 25, 2007 at 02:12:32PM +0200, Peter Zijlstra wrote:
> On Sat, 2007-03-24 at 23:09 +0100, Miklos Szeredi wrote:
> > From: Miklos Szeredi <msze...@suse.cz>
> >
> > Dirty page accounting/limiting doesn't work for nonlinear mappings, so
> > for non-ram backed filesystems emulate with linear mappings. This
> > retains ABI compatibility with previous kernels at minimal code cost.
> >
> > All known users of nonlinear mappings actually use tmpfs, so this
> > shouldn't have any negative effect.

They do? I thought the whole point of nonlinear mappings was for
mapping files bigger than the address space (eg. databases). Is Oracle
instead using this to map >3G files on a tmpfs??

--
Mathematics is the supreme nostalgia of our time.

Miklos Szeredi

unread,

Mar 25, 2007, 5:10:11 PM3/25/07

to

> A few comments..

Thanks for reviewing.

> > Index: linux-2.6.21-rc4-mm1/mm/rmap.c
> > ===================================================================
> > --- linux-2.6.21-rc4-mm1.orig/mm/rmap.c 2007-03-24 19:03:11.000000000 +0100
> > +++ linux-2.6.21-rc4-mm1/mm/rmap.c 2007-03-24 19:34:30.000000000 +0100
> > @@ -507,6 +507,43 @@ int page_mkclean(struct page *page)
> > EXPORT_SYMBOL_GPL(page_mkclean);
> >
> > /**
> > + * test_clear_page_modified - check and clear the dirty bit for all mappings of a page
> > + * @page: the page to check
> > + */
> > +bool test_clear_page_modified(struct page *page)
> > +{
> > + struct address_space *mapping = page->mapping;
>
> page_mapping(page)? Otherwise that BUG_ON(!mapping) a few lines down
> isn't of much use.

OK, removed BUG_ON(). This is called with page locked and mapping
checked, so there's no need to use page_mapping().

> > + spin_lock(&mapping->i_mmap_lock);
> > + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> > + if (vma->vm_flags & VM_SHARED) {
> > + struct mm_struct *mm = vma->vm_mm;
> > + unsigned long addr = vma_address(page, vma);
> > + pte_t *pte;
> > + spinlock_t *ptl;
> > +
> > + if (addr != -EFAULT &&
> > + (pte = page_check_address(page, mm, addr, &ptl))) {
> > + if (ptep_clear_flush_dirty(vma, addr, pte))
> > + modified = true;
> > + pte_unmap_unlock(pte, ptl);
> > + }
>
> Its against coding style to do assignments in conditionals.

OK, cleaned up.

> > + if (page_test_and_clear_dirty(page))
> > + modified = true;
> > + return modified;
> > +}
>
> Why not parametrize page_mkclean() to conditionally wrprotect clean
> pages? Something like:

Well, I don't really like this, because for msync, there's really no
need to do the complex ptep operations. Using ptep_clear_flush_dirty()
can save a couple of cycles.

That would cause the file times to be updated twice, which is wrong.

But if AS_CMTIME were tested/cleared after walking the pages, this
would work. Fixed.

Also realized, that setting AS_CMTIME from the page fault can also
cause a double update, since the PTE dirty bit is also set there.

And also realized, that doing the file update from munmap can also
cause a double update, if this is not the last mapping of the inode.

Fixed all these in the updated patch.

Thanks,
Miklos

Miklos Szeredi

unread,

Mar 25, 2007, 5:20:06 PM3/25/07

to

From: Miklos Szeredi <msze...@suse.cz>

Changes:
v4:
o rename test_clear_page_modified to page_mkclean_noprot
o clean up page_mkclean_noprot
o don't set AS_CMTIME from fault handler, since that also sets the PTE dirty
o only update c/mtime in munmap, if file is not mapped any more
o cleanups

v3:
o rename is_page_modified to test_clear_page_modified
v2:
o set AS_CMTIME flag in clear_page_dirty_for_io() too
o don't clear AS_CMTIME in file_update_time()
o check the dirty bit in the page tables
v1:
o moved check from __fput() to remove_vma(), which is more logical
o changed set_page_dirty() to set_page_dirty_mapping in hugetlb.c
o cleaned up #ifdef CONFIG_BLOCK mess

This patch makes writing to shared memory mappings update st_ctime and
st_mtime as defined by SUSv3:

The st_ctime and st_mtime fields of a file that is mapped with
MAP_SHARED and PROT_WRITE shall be marked for update at some point
in the interval between a write reference to the mapped region and
the next call to msync() with MS_ASYNC or MS_SYNC for that portion
of the file by any process. If there is no such call and if the
underlying file is modified as a result of a write reference, then
these fields shall be marked for update at some time after the
write reference.

A new address_space flag is introduced: AS_CMTIME. This is set each

time dirtyness from the page table is transferred to the page or if
the page is dirtied without the page table being set dirty.

Note, the flag is set unconditionally, even if the page is already
dirty. This is important, because the page might have been dirtied
earlier by a non-mmap write.

Msync checks this flag and also dirty bit in the page tables, because

the data might change again after an msync(MS_ASYNC), while the page
is already dirty and read-write. This also makes the time updating
work for memory backed filesystems such as tmpfs.

This implementation walks the pages in the synced range, and uses rmap
to find all the ptes for each page. Non-linear vmas are ignored,
since the ptes can only be found by scanning the whole vma, which is
very inefficient.

As an optimization, if dirty pages are accounted, then only walk the
dirty pages, since the clean pages necessarily have clean ptes. This
doesn't work for memory backed filesystems, where no dirty accounting
is done.

An alternative implementation could check for all intersecting vmas in
the mapping and walk the page tables for each. This would probably be
more efficient for memory backed filesystems and if the number of
dirty pages is near the total number of pages in the range.

In munmap if there are no more memory mappings of this file, and the
AS_CMTIME flag has been set, the file times are updated.

Fixes Novell Bugzilla #206431.

Inspired by Peter Staubach's patch and the resulting comments.

Signed-off-by: Miklos Szeredi <msze...@suse.cz>
---

Index: linux-2.6.21-rc4-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/linux/pagemap.h 2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/include/linux/pagemap.h 2007-03-25 19:00:35.000000000 +0200

@@ -19,6 +19,7 @@
*/
#define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
#define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */
+#define AS_CMTIME (__GFP_BITS_SHIFT + 2) /* ctime/mtime update needed */

static inline void mapping_set_error(struct address_space *mapping, int error)
{
Index: linux-2.6.21-rc4-mm1/include/linux/mm.h
===================================================================

--- linux-2.6.21-rc4-mm1.orig/include/linux/mm.h 2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/include/linux/mm.h 2007-03-25 19:00:36.000000000 +0200

@@ -808,6 +808,7 @@ int redirty_page_for_writepage(struct wr
struct page *page);
int FASTCALL(set_page_dirty(struct page *page));
int set_page_dirty_lock(struct page *page);
+int set_page_dirty_mapping(struct page *page);
int clear_page_dirty_for_io(struct page *page);

extern unsigned long do_mremap(unsigned long addr,
Index: linux-2.6.21-rc4-mm1/mm/memory.c
===================================================================

--- linux-2.6.21-rc4-mm1.orig/mm/memory.c 2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/memory.c 2007-03-25 19:00:36.000000000 +0200

@@ -676,7 +676,7 @@ static unsigned long zap_pte_range(struc
anon_rss--;
else {
if (pte_dirty(ptent))
- set_page_dirty(page);
+ set_page_dirty_mapping(page);
if (pte_young(ptent))
SetPageReferenced(page);
file_rss--;
@@ -954,7 +954,7 @@ struct page *follow_page(struct vm_area_
if (flags & FOLL_TOUCH) {
if ((flags & FOLL_WRITE) &&
!pte_dirty(pte) && !PageDirty(page))
- set_page_dirty(page);
+ set_page_dirty_mapping(page);
mark_page_accessed(page);
}
unlock:

Index: linux-2.6.21-rc4-mm1/mm/page-writeback.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/page-writeback.c 2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/page-writeback.c 2007-03-25 19:00:36.000000000 +0200
@@ -848,17 +848,42 @@ EXPORT_SYMBOL(redirty_page_for_writepage

* If the mapping doesn't provide a set_page_dirty a_op, then
* just fall through and assume that it wants buffer_heads.
*/
+static inline int __set_page_dirty(struct address_space *mapping,
+ struct page *page)
+{
+ int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
+#ifdef CONFIG_BLOCK
+ if (!spd)
+ spd = __set_page_dirty_buffers;
+#endif
+ return (*spd)(page);
+}
+
int fastcall set_page_dirty(struct page *page)
{

struct address_space *mapping = page_mapping(page);

+ if (likely(mapping))
+ return __set_page_dirty(mapping, page);
+ if (!PageDirty(page)) {
+ if (!TestSetPageDirty(page))
+ return 1;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(set_page_dirty);
+
+/*
+ * Special set_page_dirty() variant for dirtiness coming from a memory
+ * mapping. In this case the ctime/mtime update flag needs to be set.
+ */

+int set_page_dirty_mapping(struct page *page)
+{

+ struct address_space *mapping = page_mapping(page);
+
if (likely(mapping)) {
- int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
-#ifdef CONFIG_BLOCK
- if (!spd)
- spd = __set_page_dirty_buffers;
-#endif
- return (*spd)(page);
+ set_bit(AS_CMTIME, &mapping->flags);
+ return __set_page_dirty(mapping, page);
}
if (!PageDirty(page)) {
if (!TestSetPageDirty(page))

@@ -866,7 +891,6 @@ int fastcall set_page_dirty(struct page

}
return 0;
}
-EXPORT_SYMBOL(set_page_dirty);

/*
* set_page_dirty() is racy if the caller has no reference against

@@ -936,7 +960,7 @@ int clear_page_dirty_for_io(struct page

* threads doing their things.
*/
if (page_mkclean(page))
- set_page_dirty(page);
+ set_page_dirty_mapping(page);
/*
* We carefully synchronise fault handlers against
* installing a dirty pte and marking the page dirty

Index: linux-2.6.21-rc4-mm1/mm/rmap.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/rmap.c 2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/rmap.c 2007-03-25 21:38:18.000000000 +0200
@@ -506,6 +506,49 @@ int page_mkclean(struct page *page)
}
EXPORT_SYMBOL_GPL(page_mkclean);

+static int page_mkclean_one_noprot(struct page *page,
+ struct vm_area_struct *vma)
+{
+ int modified = 0;
+ unsigned long address = vma_address(page, vma);
+ if (address != -EFAULT) {

+ struct mm_struct *mm = vma->vm_mm;

+ spinlock_t *ptl;
+ pte_t *pte = page_check_address(page, mm, address, &ptl);
+ if (pte) {
+ if (ptep_clear_flush_dirty(vma, address, pte))
+ modified = 1;

+ pte_unmap_unlock(pte, ptl);
+ }
+ }

+ return modified;
+}
+
+/**

+ * page_mkclean_noprot - check and clear the dirty bit for all mappings of a page

+ * @page: the page to check
+ */

+int page_mkclean_noprot(struct page *page)

+{
+ struct address_space *mapping = page->mapping;

+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ struct vm_area_struct *vma;
+ struct prio_tree_iter iter;

+ int modified = 0;
+
+ BUG_ON(!page_mapped(page));
+

+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+ if (vma->vm_flags & VM_SHARED)

+ modified |= page_mkclean_one_noprot(page, vma);
+ }
+ spin_unlock(&mapping->i_mmap_lock);
+ if (page_test_and_clear_dirty(page))
+ modified = 1;

+ return modified;
+}
+

/**
* page_set_anon_rmap - setup new anonymous rmap

* @page: the page to add the mapping to
@@ -657,7 +700,7 @@ void page_remove_rmap(struct page *page,

* faster for those pages still in swapcache.
*/
if (page_test_and_clear_dirty(page))
- set_page_dirty(page);
+ set_page_dirty_mapping(page);
__dec_zone_page_state(page,
PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
}

@@ -702,7 +745,7 @@ static int try_to_unmap_one(struct page

/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
- set_page_dirty(page);
+ set_page_dirty_mapping(page);

/* Update high watermark before we lower rss */
update_hiwater_rss(mm);

@@ -836,7 +879,7 @@ static void try_to_unmap_cluster(unsigne

/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
- set_page_dirty(page);
+ set_page_dirty_mapping(page);

page_remove_rmap(page, vma);
page_cache_release(page);

Index: linux-2.6.21-rc4-mm1/mm/mmap.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/mmap.c 2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/mmap.c 2007-03-25 19:00:36.000000000 +0200
@@ -222,12 +222,30 @@ void unlink_file_vma(struct vm_area_stru

static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
{
struct vm_area_struct *next = vma->vm_next;
+ struct file *file = vma->vm_file;

might_sleep();
if (vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
- if (vma->vm_file)
- fput(vma->vm_file);
+ if (file) {

+ struct address_space *mapping = file->f_mapping;
+ int update_cmtime = 0;
+ /*
+ * Only update c/mtime if there are no more memory maps
+ * referring to this inode. Otherwise it would be possible,
+ * that some modification info remains in page tables of
+ * other mappings, and the times would be updated again,
+ * even though the file wasn't modified after this
+ */
+ spin_lock(&mapping->i_mmap_lock);
+ if (prio_tree_empty(&mapping->i_mmap) &&
+ test_and_clear_bit(AS_CMTIME, &file->f_mapping->flags))
+ update_cmtime = 1;
+ spin_unlock(&mapping->i_mmap_lock);
+ if (update_cmtime)

+ file_update_time(file);
+ fput(file);
+ }
mpol_free(vma_policy(vma));
kmem_cache_free(vm_area_cachep, vma);
return next;
Index: linux-2.6.21-rc4-mm1/mm/hugetlb.c
===================================================================

--- linux-2.6.21-rc4-mm1.orig/mm/hugetlb.c 2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/hugetlb.c 2007-03-25 19:00:36.000000000 +0200

@@ -407,7 +407,7 @@ void __unmap_hugepage_range(struct vm_ar

page = pte_page(pte);
if (pte_dirty(pte))
- set_page_dirty(page);
+ set_page_dirty_mapping(page);
list_add(&page->lru, &page_list);
}
spin_unlock(&mm->page_table_lock);

Index: linux-2.6.21-rc4-mm1/mm/msync.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/msync.c 2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/msync.c 2007-03-25 19:00:36.000000000 +0200
@@ -12,6 +12,81 @@

+{

+ struct address_space *mapping;
+ struct pagevec pvec;
+ pgoff_t index;
+ pgoff_t end_index;
+

+ if (!vma->vm_file || !(vma->vm_flags & VM_SHARED) ||
+ (vma->vm_flags & VM_NONLINEAR))
+ return;
+

+ mapping = vma->vm_file->f_mapping;

+ page_mkclean_noprot(page))
+ set_page_dirty_mapping(page);

+
+ unlock_page(page);
+ }
+ pagevec_release(&pvec);
+ }
+

+ if (test_and_clear_bit(AS_CMTIME, &mapping->flags))

+ file_update_time(vma->vm_file);
+}

/*
* MS_SYNC syncs the entire file - including mappings.

@@ -75,6 +150,9 @@ asmlinkage long sys_msync(unsigned long

error = -EBUSY;
goto out_unlock;
}
+ if (flags & (MS_SYNC | MS_ASYNC))
+ msync_update_file_time(vma, start,
+ min(end, vma->vm_end));
file = vma->vm_file;
start = vma->vm_end;
if ((flags & MS_SYNC) && file &&
Index: linux-2.6.21-rc4-mm1/include/linux/rmap.h
===================================================================

--- linux-2.6.21-rc4-mm1.orig/include/linux/rmap.h 2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/include/linux/rmap.h 2007-03-25 19:00:36.000000000 +0200
@@ -100,6 +100,11 @@ unsigned long page_address_in_vma(struct

*/
int page_mkclean(struct page *);

+/*
+ * Similar to the above, but doesn't write protect the PTEs
+ */
+int page_mkclean_noprot(struct page *page);

+
#else /* !CONFIG_MMU */

#define anon_vma_init() do {} while (0)

Andrew Morton

unread,

Mar 25, 2007, 8:10:07 PM3/25/07

to

On Sat, 24 Mar 2007 23:09:19 +0100 Miklos Szeredi <mik...@szeredi.hu> wrote:

> Dirty page accounting/limiting doesn't work for nonlinear mappings,

Doesn't it? iirc the problem is that we don't correctly re-clean the ptes
while starting writeout. And the dirty-page accounting is in fact correct
(it'd darn well better be).

> so
> for non-ram backed filesystems emulate with linear mappings. This
> retains ABI compatibility with previous kernels at minimal code cost.
>
> All known users of nonlinear mappings actually use tmpfs, so this
> shouldn't have any negative effect.

Unless someone is using remap_file_pages() against an ext3 file, in which
case their application stops working?

That would be a problem. These guys:
http://www.technovelty.org/code/linux/fremap.html, for example, will be in
for a little surprise.

Peter Zijlstra

unread,

Mar 26, 2007, 3:00:15 AM3/26/07

to

On Sun, 2007-03-25 at 16:00 -0800, Andrew Morton wrote:
> On Sat, 24 Mar 2007 23:09:19 +0100 Miklos Szeredi <mik...@szeredi.hu> wrote:
>
> > Dirty page accounting/limiting doesn't work for nonlinear mappings,
>
> Doesn't it? iirc the problem is that we don't correctly re-clean the ptes
> while starting writeout. And the dirty-page accounting is in fact correct
> (it'd darn well better be).

If we do not re-protect the pages on writeout, we'll decrement the dirty
count but not get a fault on re-dirty. Hence the dirty count will
actually skew.

In order to make page_mkclean() work for nonlinear vmas we need to do a
full pte scan for each invocation (we could perhaps only scan 1 in n
times to try and limit the damage) and that hurts. This will basically
render it useless.

The other solution is adding rmap information to nonlinear vmas but
doubling the memory overhead for nonlinear mappings was not deemed a
good idea.

> > so
> > for non-ram backed filesystems emulate with linear mappings. This
> > retains ABI compatibility with previous kernels at minimal code cost.
> >
> > All known users of nonlinear mappings actually use tmpfs, so this
> > shouldn't have any negative effect.
>
> Unless someone is using remap_file_pages() against an ext3 file, in which
> case their application stops working?

it'll work up to a certain point (when you hit the max vma count). When
this issue first came up a few weeks ago, nobody knew of any non tmpfs
users.

> That would be a problem. These guys:
> http://www.technovelty.org/code/linux/fremap.html, for example, will be in
> for a little surprise.

Its an example, it doesn't show if there are actually users of this, but
point taken. We actually could make this very example work by noting
that they map PROT_READ only.

Andrew Morton

unread,

Mar 26, 2007, 5:10:06 PM3/26/07

to

On Sun, 25 Mar 2007 23:10:21 +0200
Miklos Szeredi <mik...@szeredi.hu> wrote:

> This patch makes writing to shared memory mappings update st_ctime and
> st_mtime as defined by SUSv3:

Boy this is complicated.

Is there a simpler way of doing all this? Say, we define a new page flag
PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
page-dirtiness. Then, when performing writeback we look to see if any of
the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
current-time.

Or something like that - I'm just thinking out loud and picking holes in
the above doesn't shut me up ;) We're adding complexity and some overhead
and we're losing our recent msync() simplifications and this all hurts. Is
there some other way? I think burning a page flag to avoid this additional
complexity would be worthwhile.

Matt Mackall

unread,

Mar 26, 2007, 5:30:10 PM3/26/07

to

On Mon, Mar 26, 2007 at 02:00:36PM -0700, Andrew Morton wrote:
> On Sun, 25 Mar 2007 23:10:21 +0200
> Miklos Szeredi <mik...@szeredi.hu> wrote:
>
> > This patch makes writing to shared memory mappings update st_ctime and
> > st_mtime as defined by SUSv3:
>
> Boy this is complicated.
>
> Is there a simpler way of doing all this? Say, we define a new page flag
> PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> page-dirtiness. Then, when performing writeback we look to see if any of
> the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> current-time.
>
> Or something like that - I'm just thinking out loud and picking holes in
> the above doesn't shut me up ;) We're adding complexity and some overhead
> and we're losing our recent msync() simplifications and this all hurts. Is
> there some other way? I think burning a page flag to avoid this additional
> complexity would be worthwhile.

Aren't we basically out of those?

--
Mathematics is the supreme nostalgia of our time.

Miklos Szeredi

unread,

Mar 26, 2007, 5:50:10 PM3/26/07

to

> > This patch makes writing to shared memory mappings update st_ctime and
> > st_mtime as defined by SUSv3:
>
> Boy this is complicated.

You tell me?

> Is there a simpler way of doing all this? Say, we define a new page flag
> PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> page-dirtiness. Then, when performing writeback we look to see if any of
> the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> current-time.

I don't think a page flag gains anything over the address_space flag
that this patch already has.

The complexity is not about keeping track of the "data modified
through mmap" state, but about msync() guarantees, that POSIX wants.

And these requirements do in fact make some sense: msync() basically
means:

"I want the data written through mmaps to be visible to the world"

And that obviously includes updating the timestamps.

So how do we know if the data was modified between two msync()
invocations? The only sane way I can think of is to walk the page
tables in msync() and test/clear the pte dirty bit.

Yes, this will make msync(MS_ASYNC) more heavyweight again. But if an
application doesn't want to update the timestamps, it should just omit
this call, since it does nothing else.

There shouldn't be any other side effect.

Miklos

Andrew Morton

unread,

Mar 26, 2007, 6:30:09 PM3/26/07

to

On Mon, 26 Mar 2007 16:10:09 -0500
Matt Mackall <m...@selenic.com> wrote:

> On Mon, Mar 26, 2007 at 02:00:36PM -0700, Andrew Morton wrote:
> > On Sun, 25 Mar 2007 23:10:21 +0200
> > Miklos Szeredi <mik...@szeredi.hu> wrote:
> >
> > > This patch makes writing to shared memory mappings update st_ctime and
> > > st_mtime as defined by SUSv3:
> >
> > Boy this is complicated.
> >
> > Is there a simpler way of doing all this? Say, we define a new page flag
> > PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> > ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> > page-dirtiness. Then, when performing writeback we look to see if any of
> > the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> > current-time.
> >
> > Or something like that - I'm just thinking out loud and picking holes in
> > the above doesn't shut me up ;) We're adding complexity and some overhead
> > and we're losing our recent msync() simplifications and this all hurts. Is
> > there some other way? I think burning a page flag to avoid this additional
> > complexity would be worthwhile.
>
> Aren't we basically out of those?

Rafael has liberated a couple of the flags which swsusp was using.

Andrew Morton

unread,

Mar 26, 2007, 6:40:09 PM3/26/07

to

On Mon, 26 Mar 2007 23:43:08 +0200
Miklos Szeredi <mik...@szeredi.hu> wrote:

> > > This patch makes writing to shared memory mappings update st_ctime and
> > > st_mtime as defined by SUSv3:
> >
> > Boy this is complicated.
>
> You tell me?
>
> > Is there a simpler way of doing all this? Say, we define a new page flag
> > PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> > ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> > page-dirtiness. Then, when performing writeback we look to see if any of
> > the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> > current-time.
>
> I don't think a page flag gains anything over the address_space flag
> that this patch already has.
>
> The complexity is not about keeping track of the "data modified
> through mmap" state, but about msync() guarantees, that POSIX wants.
>
> And these requirements do in fact make some sense: msync() basically
> means:
>
> "I want the data written through mmaps to be visible to the world"
>
> And that obviously includes updating the timestamps.
>
> So how do we know if the data was modified between two msync()
> invocations? The only sane way I can think of is to walk the page
> tables in msync() and test/clear the pte dirty bit.

clear_page_dirty_for_io() already does that.

So we should be able to test PageDirtiedByWrite() after running
clear_page_dirty_for_io() to discover whether this page was dirtied via
MAP_SHARED, and then update the inode times if so.

William Lee Irwin III

unread,

Mar 26, 2007, 8:10:07 PM3/26/07

to

On Sat, 2007-03-24 at 23:09 +0100, Miklos Szeredi wrote:
>>> Dirty page accounting/limiting doesn't work for nonlinear mappings, so
>>> for non-ram backed filesystems emulate with linear mappings. This
>>> retains ABI compatibility with previous kernels at minimal code cost.
>>> All known users of nonlinear mappings actually use tmpfs, so this
>>> shouldn't have any negative effect.

On Sun, Mar 25, 2007 at 02:12:32PM +0200, Peter Zijlstra wrote:

On Sun, Mar 25, 2007 at 10:51:27AM -0500, Matt Mackall wrote:
> They do? I thought the whole point of nonlinear mappings was for
> mapping files bigger than the address space (eg. databases). Is Oracle
> instead using this to map >3G files on a tmpfs??

It's used for > 3GB files on tmpfs and also ramfs, sometimes
substantially larger than 3GB.

It's not used for the database proper. It's used for the buffer pool,
which is the in-core destination and source of direct I/O, the on-disk
source and destination of the I/O being the database.

-- wli

Miklos Szeredi

unread,

Mar 27, 2007, 3:00:25 AM3/27/07

to

What do you need the page flag for? The "modified through mmap" info
is there in the ptes. And from the ptes it can be transfered to a
per-address_space flag. Nobody is interested through which page was
the file modified.

Anyway, that's just MS_SYNC. MS_ASYNC doesn't walk the pages, yet it
should update the timestamp. That's the difficult one.

Miklos

Andrew Morton

unread,

Mar 27, 2007, 3:30:19 AM3/27/07

to

To work out whether the page was dirtied via write (in which case the
timestamps were updated) or via mmap (in which case they were not).

> The "modified through mmap" info
> is there in the ptes.

It might not be there any more - the ptes could have got taken down by, for
example, reclaim.

I dunno - I'm not trying very hard. I'm trying to encourage you to come up
with something less costly and less complex than this thing, but you appear
to be resisting.

> And from the ptes it can be transfered to a
> per-address_space flag. Nobody is interested through which page was
> the file modified.
>
> Anyway, that's just MS_SYNC. MS_ASYNC doesn't walk the pages, yet it
> should update the timestamp. That's the difficult one.
>

We can treat MS_ASYNC as we treat MS_SYNC. Then, MS_ASYNC *does* walk the
pages. Is does it in generic_writepages(). It also even walks the ptes
for you, in clear_page_dirty_for_io().

There is surely no need to duplicate all that.

Miklos Szeredi

unread,

Mar 27, 2007, 3:40:10 AM3/27/07

to

> > > clear_page_dirty_for_io() already does that.
> > >
> > > So we should be able to test PageDirtiedByWrite() after running
> > > clear_page_dirty_for_io() to discover whether this page was dirtied via
> > > MAP_SHARED, and then update the inode times if so.
> >
> > What do you need the page flag for?
>
> To work out whether the page was dirtied via write (in which case the
> timestamps were updated) or via mmap (in which case they were not).
>
> > The "modified through mmap" info
> > is there in the ptes.
>
> It might not be there any more - the ptes could have got taken down by, for
> example, reclaim.

Yes, but then the "modified through mmap" is transferred to the
per-address_space flag. All this is already done by this patch.

> I dunno - I'm not trying very hard. I'm trying to encourage you to come up
> with something less costly and less complex than this thing, but you appear
> to be resisting.

No, I'm just arguing that your suggestion is actually a complication,
not a simplification ;)

> > And from the ptes it can be transfered to a
> > per-address_space flag. Nobody is interested through which page was
> > the file modified.
> >
> > Anyway, that's just MS_SYNC. MS_ASYNC doesn't walk the pages, yet it
> > should update the timestamp. That's the difficult one.
> >
>
> We can treat MS_ASYNC as we treat MS_SYNC. Then, MS_ASYNC *does* walk the
> pages. Is does it in generic_writepages(). It also even walks the ptes
> for you, in clear_page_dirty_for_io().

Yes. But that's not very useful semantic for MS_ASYNC vs. file time
update.

It would basically say:

"if you cann MS_ASYNC, and the file was modified then sometime in
the future you will get an updated c/mtime".

But this is not what POSIX says, and it's not what applications want.

For example "make" would want to know if a file was modified or not,
and with your suggestion only msync(MS_SYNC) would reliably provide
that info. But msync(MS_SYNC) is unnecessary in many cases.

> There is surely no need to duplicate all that.

Yeah, we could teach generic_writepages() to conditionally not submit
for io just test/clear pte dirtyness.

Maybe that would be somewhat cleaner, dunno.

Then there are the ram backed filesystems, which don't have dirty
accounting and radix trees, and for which this pte walking is still
needed to provide semantics consistent with normal filesystems.

Miklos

Andrew Morton

unread,

Mar 27, 2007, 4:00:21 AM3/27/07

to

On Tue, 27 Mar 2007 09:36:50 +0200 Miklos Szeredi <mik...@szeredi.hu> wrote:

> > There is surely no need to duplicate all that.
>
> Yeah, we could teach generic_writepages() to conditionally not submit
> for io just test/clear pte dirtyness.
>
> Maybe that would be somewhat cleaner, dunno.
>
> Then there are the ram backed filesystems, which don't have dirty
> accounting and radix trees, and for which this pte walking is still
> needed to provide semantics consistent with normal filesystems.

hm.

I don't know how important all this is, really - we've had this bug for
ever and presumably we've already trained everyone to work around it.

What usage scenarios are people actually hurting from? Is there anything
interesting in the mysterious Novell Bugzilla #206431?

Perhaps we can get away with doing something half-assed which covers most
requirements...

Miklos Szeredi

unread,

Mar 27, 2007, 4:10:21 AM3/27/07

to

> > > There is surely no need to duplicate all that.
> >
> > Yeah, we could teach generic_writepages() to conditionally not submit
> > for io just test/clear pte dirtyness.
> >
> > Maybe that would be somewhat cleaner, dunno.
> >
> > Then there are the ram backed filesystems, which don't have dirty
> > accounting and radix trees, and for which this pte walking is still
> > needed to provide semantics consistent with normal filesystems.
>
> hm.
>
> I don't know how important all this is, really - we've had this bug for
> ever and presumably we've already trained everyone to work around it.
>
> What usage scenarios are people actually hurting from? Is there anything
> interesting in the mysterious Novell Bugzilla #206431?

That's just a failing LTP testcase, not quite real life ;)

But Peter Staubach says a RH custumer has files written thorugh mmap,
which are not being backed up.

> Perhaps we can get away with doing something half-assed which covers most
> requirements...

OK. At least I can split the patch into two half asses.

The big question is tmpfs and friends. Those won't get any timestamp
update without additional page table walking.

Miklos

Andrew Morton

unread,

Mar 27, 2007, 4:20:08 AM3/27/07

to

On Tue, 27 Mar 2007 10:03:41 +0200 Miklos Szeredi <mik...@szeredi.hu> wrote:

> But Peter Staubach says a RH custumer has files written thorugh mmap,
> which are not being backed up.

Yes, I expect the backup problem is the major real-world hurt arising from
this bug.

But I expect we could adequately plug that problem at munmap()-time. Or,
better, do_wp_page(). As I said - half-assed.

It's a question if whether the backup problem is the only thing which is hurting
in the real-world, or if people have other problems.

(In fact, what's wrong with doing it in do_wp_page()? The timestamp could
be up to 30 seconds too early, but that's heaps better than what we have
now..)

Miklos Szeredi

unread,

Mar 27, 2007, 4:30:09 AM3/27/07

to

> > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > which are not being backed up.
>
> Yes, I expect the backup problem is the major real-world hurt arising from
> this bug.
>
> But I expect we could adequately plug that problem at munmap()-time. Or,
> better, do_wp_page(). As I said - half-assed.
>
> It's a question if whether the backup problem is the only thing which is hurting
> in the real-world, or if people have other problems.
>
> (In fact, what's wrong with doing it in do_wp_page()?

It's rather more expensive, than just toggling a bit.

Let me work on it a bit more. I think I can make the current patch
more palatable.

Miklos

Andrew Morton

unread,

Mar 27, 2007, 5:00:19 AM3/27/07

to

On Tue, 27 Mar 2007 10:28:16 +0200 Miklos Szeredi <mik...@szeredi.hu> wrote:

> > > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > > which are not being backed up.
> >
> > Yes, I expect the backup problem is the major real-world hurt arising from
> > this bug.
> >
> > But I expect we could adequately plug that problem at munmap()-time. Or,
> > better, do_wp_page(). As I said - half-assed.
> >
> > It's a question if whether the backup problem is the only thing which is hurting
> > in the real-world, or if people have other problems.
> >
> > (In fact, what's wrong with doing it in do_wp_page()?
>
> It's rather more expensive, than just toggling a bit.

It shouldn't be, especially for filesystems which have one-second timestamp
granularity.

Filesystems which have s_time_gran=1 might hurt a bit, but no more than
they will with write().

Actually, no - we'd only update the mctime once per page per writeback
period (30 seconds by default) so the load will be small. It'll be more
often if the user is doing a lot of pte-cleaning via msync() or fsync(),
but then the m/ctime writes will be the least of their problems.

I'd have thought there were more substantial problems with something that
crude?

Miklos Szeredi

unread,

Mar 27, 2007, 5:30:23 AM3/27/07

to

> > > > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > > > which are not being backed up.
> > >
> > > Yes, I expect the backup problem is the major real-world hurt arising from
> > > this bug.
> > >
> > > But I expect we could adequately plug that problem at munmap()-time. Or,
> > > better, do_wp_page(). As I said - half-assed.
> > >
> > > It's a question if whether the backup problem is the only thing which is hurting
> > > in the real-world, or if people have other problems.
> > >
> > > (In fact, what's wrong with doing it in do_wp_page()?
> >
> > It's rather more expensive, than just toggling a bit.
>
> It shouldn't be, especially for filesystems which have one-second timestamp
> granularity.
>
> Filesystems which have s_time_gran=1 might hurt a bit, but no more than
> they will with write().
>
> Actually, no - we'd only update the mctime once per page per writeback
> period (30 seconds by default) so the load will be small.

Why? For each faulted page the times will be updated, no?

Maybe it's acceptable, I don't really know the cost of
file_update_time().

Tried this patch, and it seems to work. It will even randomly update
the time for tmpfs files (on initial fault, and on swapins).

Miklos

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c 2007-03-27 11:04:40.000000000 +0200
+++ linux/mm/memory.c 2007-03-27 11:08:19.000000000 +0200
@@ -1664,6 +1664,8 @@ gotten:
unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+ if (vma->vm_file)
+ file_update_time(vma->vm_file);
set_page_dirty_balance(dirty_page);
put_page(dirty_page);
}
@@ -2316,6 +2318,8 @@ retry:
unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+ if (vma->vm_file)
+ file_update_time(vma->vm_file);
set_page_dirty_balance(dirty_page);
put_page(dirty_page);

Andrew Morton

unread,

Mar 27, 2007, 2:00:20 PM3/27/07

to

On Tue, 27 Mar 2007 11:23:06 +0200 Miklos Szeredi <mik...@szeredi.hu> wrote:

> > > > > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > > > > which are not being backed up.
> > > >
> > > > Yes, I expect the backup problem is the major real-world hurt arising from
> > > > this bug.
> > > >
> > > > But I expect we could adequately plug that problem at munmap()-time. Or,
> > > > better, do_wp_page(). As I said - half-assed.
> > > >
> > > > It's a question if whether the backup problem is the only thing which is hurting
> > > > in the real-world, or if people have other problems.
> > > >
> > > > (In fact, what's wrong with doing it in do_wp_page()?
> > >
> > > It's rather more expensive, than just toggling a bit.
> >
> > It shouldn't be, especially for filesystems which have one-second timestamp
> > granularity.
> >
> > Filesystems which have s_time_gran=1 might hurt a bit, but no more than
> > they will with write().
> >
> > Actually, no - we'd only update the mctime once per page per writeback
> > period (30 seconds by default) so the load will be small.
>
> Why? For each faulted page the times will be updated, no?

Yes, but only at pagefault-time. And

- the faults are already "slow": we need to pull the page contents in
from disk, or memset or cow the page

- we need to take the trap

compared to which, the cost of the timestamp update will (we hope) be
relatively low.

that's simpler ;) Is it correct enough though? The place where it will
become inaccurate is for repeated modification via an established map. ie:

p = mmap(..., MAP_SHARED);
for ( ; ; )
*p = 1;

in which case I think the timestamp will only get updated once per
writeback interval (ie: 30 seconds).

tmpfs files have an s_time_gran of 1, so benchmarking some workload on
tmpfs with this patch will tell us the worst-case overhead of the change.

I guess we should arrange for multiple CPUs to perform write faults against
multiple pages of the same file in parallel. Of course, that'd be a pretty
darn short benchmark because it'll run out of RAM. Which reveals why we
probably won't have a performance problem in there.

Miklos Szeredi

unread,

Mar 27, 2007, 2:40:15 PM3/27/07

to

> that's simpler ;) Is it correct enough though? The place where it will
> become inaccurate is for repeated modification via an established map. ie:
>
> p = mmap(..., MAP_SHARED);
> for ( ; ; )
> *p = 1;
>
> in which case I think the timestamp will only get updated once per
> writeback interval (ie: 30 seconds).

Which is perfectly OK, we really can't do any better in any sane way.

My concern is only about MS_ASYNC, it would be really nice to know
what other OSs do. I've checked on a Solaris 5.7 (not exactly a
modern OS) and that is as good as current Linux.

If someone has access to others pls. find my test program at the end.

The logical way to handle msync is IMO:

write to memory + msync(MS_ASYNC) == write()
write to memory + msync(MS_SYNC) == write() + fdatasync()

Yes, it would add some overhead for MS_ASYNC, but that's what the user
wants isn't it? If the user doesn't want correctly updated
timestamps, it should not call msync().

Miklos

=== msync_time.c ==============================================
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>

static const char *filename;

static void print_times(const char *msg)
{
struct stat stbuf;
stat(filename, &stbuf);
printf("%s\t%li\t%li\t%li\n", msg, stbuf.st_ctime, stbuf.st_mtime,
stbuf.st_atime);
}

static void usage(const char *progname)
{
fprintf(stderr, "usage: %s filename [-s]\n", progname);
exit(1);
}

int main(int argc, char *argv[])
{
int res;
char *addr;
int msync_flag = MS_ASYNC;
int fd;

if (argc < 2)
usage(argv[0]);

filename = argv[1];
if (argc > 2) {
if (argc > 3)
usage(argv[0]);
if (strcmp(argv[2], "-s") == 0)
msync_flag = MS_SYNC;
else
usage(argv[0]);
}

fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd == -1) {
perror(filename);
return 1;
}
print_times("begin");
sleep(1);
write(fd, "aaaa\n", 4);
print_times("write");
sleep(1);
addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (addr == (char *) -1) {
perror("mmap");
return 1;
}
print_times("mmap");
sleep(1);

addr[1] = 'b';
print_times("b");
sleep(1);
res = msync(addr, 4, msync_flag);
if (res == -1) {
perror("msync");
return 1;
}
print_times("msync b");
sleep(1);

addr[2] = 'c';
print_times("c");
sleep(1);
res = msync(addr, 4, msync_flag);
if (res == -1) {
perror("msync");
return 1;
}
print_times("msync c");
sleep(1);

addr[3] = 'd';
print_times("d");
sleep(1);
res = munmap(addr, 4);
if (res == -1) {
perror("munmap");
return 1;
}
print_times("munmap");
sleep(1);

res = close(fd);
if (res == -1) {
perror("close");
return 1;
}
print_times("end");
return 0;

li...@horizon.com

unread,

Mar 27, 2007, 2:50:23 PM3/27/07

to

> Yes, this will make msync(MS_ASYNC) more heavyweight again. But if an
> application doesn't want to update the timestamps, it should just omit
> this call, since it does nothing else.

Er... FWIW, I have an application that makes heavy use of msync(MS_ASYNC)
and doesn't care about timestamps. (In fact, sometimes it's configured
to write to a raw device and there are no timestamps.)

It's used as a poor man's portable async I/O. The application logs
data to disk, and sometimes needs to sync it to disk to ensure it has
all been written.

To reduce long pauses when doing msync(MS_SYNC), it does msync(MS_ASYNC)
as soon as a page is filled up to prompt asynchronous writeback.
"I'm done writing this page and don't intend to write it again.
Please start committing it to stable storage, but don't block me."

Then, occasionally, there's an msync(MS_SYNC) call to be sure the data
is synced to disk. This caused annoying hiccups before the MS_ASYNC
calls were added.

I agree that msync(MS_ASYNC) has no semantics if time is ignored.
But it's a useful way to tell the OS that the page is not going
to be dirtied again.

Miklos Szeredi

unread,

Mar 27, 2007, 3:00:22 PM3/27/07

to

> > Yes, this will make msync(MS_ASYNC) more heavyweight again. But if an
> > application doesn't want to update the timestamps, it should just omit
> > this call, since it does nothing else.
>
> Er... FWIW, I have an application that makes heavy use of msync(MS_ASYNC)
> and doesn't care about timestamps. (In fact, sometimes it's configured
> to write to a raw device and there are no timestamps.)
>
> It's used as a poor man's portable async I/O. The application logs
> data to disk, and sometimes needs to sync it to disk to ensure it has
> all been written.
>
> To reduce long pauses when doing msync(MS_SYNC), it does msync(MS_ASYNC)
> as soon as a page is filled up to prompt asynchronous writeback.
> "I'm done writing this page and don't intend to write it again.
> Please start committing it to stable storage, but don't block me."
>
> Then, occasionally, there's an msync(MS_SYNC) call to be sure the data
> is synced to disk. This caused annoying hiccups before the MS_ASYNC
> calls were added.
>
>
> I agree that msync(MS_ASYNC) has no semantics if time is ignored.
> But it's a useful way to tell the OS that the page is not going
> to be dirtied again.

Just to clarify, here's the header comment for sys_msync():

/*
* MS_SYNC syncs the entire file - including mappings.

*
* MS_ASYNC does not start I/O (it used to, up to 2.5.67).
* Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
* Now it doesn't do anything, since dirty pages are properly tracked.
*
* The application may now run fsync() to
* write out the dirty pages and wait on the writeout and check the result.
* Or the application may run fadvise(FADV_DONTNEED) against the fd to start
* async writeout immediately.
* So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
* applications.
*/

It's actually wrong about FADV_DONTNEED, which I think doesn't start
writeout either. So there you have it ;)

Miklos

Andrew Morton

unread,

Mar 27, 2007, 3:10:15 PM3/27/07

to

On Tue, 27 Mar 2007 20:55:51 +0200
Miklos Szeredi <mik...@szeredi.hu> wrote:

> It's actually wrong about FADV_DONTNEED, which I think doesn't start
> writeout either.

case POSIX_FADV_DONTNEED:
if (!bdi_write_congested(mapping->backing_dev_info))
filemap_flush(mapping);

Miklos Szeredi

unread,

Mar 27, 2007, 3:10:17 PM3/27/07

to

> > I agree that msync(MS_ASYNC) has no semantics if time is ignored.
> > But it's a useful way to tell the OS that the page is not going
> > to be dirtied again.
>
> Just to clarify, here's the header comment for sys_msync():
>
> /*
> * MS_SYNC syncs the entire file - including mappings.
> *
> * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
> * Now it doesn't do anything, since dirty pages are properly tracked.
> *
> * The application may now run fsync() to
> * write out the dirty pages and wait on the writeout and check the result.
> * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
> * async writeout immediately.
> * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
> * applications.
> */
>
> It's actually wrong about FADV_DONTNEED, which I think doesn't start
> writeout either. So there you have it ;)

Sorry, FADV_DONTNEED _does_ seem to start writeout, but it does it
indiscriminately, not just on the specified range. That should be
easy to fix though.

li...@horizon.com

unread,

Mar 27, 2007, 3:30:20 PM3/27/07

to

> * MS_ASYNC does not start I/O (it used to, up to 2.5.67).

Yes, I noticed. See
http://www.ussg.iu.edu/hypermail/linux/kernel/0602.1/0450.html
for a bug report on the subject February 2006.

That's why this application is still running on 2.4.

As I mentioned at the time, the SUS says:
(http://opengroup.org/onlinepubs/007908799/xsh/msync.html)
"When MS_ASYNC is specified, msync() returns immediately once all the
write operations are initiated or queued for servicing."

You can argue that putting it on the dirty list constitutes "queued for
servicing", but the intent seems pretty clear to me: MS_ASYNC is supposed
to start the I/O. Although strict standards-ese parsing says that
either branch of an or is acceptable, it is a common English language
convention that the first alternative is preferred and the second
is a fallback.

It makes sense in this case: start the write or, if that's not possible
(the disk is already busy), queue it for service as soon as the disk
is available.

They perhaps didn't mandate it this strictly, but that's clearly the
intent.

Andrew Morton

unread,

Mar 27, 2007, 3:40:15 PM3/27/07

to

On 27 Mar 2007 15:24:52 -0400
li...@horizon.com wrote:

> > * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
>
> Yes, I noticed. See
> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.1/0450.html
> for a bug report on the subject February 2006.

Suggest you use msync(MS_ASYNC), then
sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE).

The new (post 2.6.17) MAP_SHARED dirty-page semantics mean that the msync()
isn't actually needed.

> That's why this application is still running on 2.4.
>
> As I mentioned at the time, the SUS says:
> (http://opengroup.org/onlinepubs/007908799/xsh/msync.html)
> "When MS_ASYNC is specified, msync() returns immediately once all the
> write operations are initiated or queued for servicing."
>
> You can argue that putting it on the dirty list constitutes "queued for
> servicing", but the intent seems pretty clear to me: MS_ASYNC is supposed
> to start the I/O. Although strict standards-ese parsing says that
> either branch of an or is acceptable, it is a common English language
> convention that the first alternative is preferred and the second
> is a fallback.
>
> It makes sense in this case: start the write or, if that's not possible
> (the disk is already busy), queue it for service as soon as the disk
> is available.
>
> They perhaps didn't mandate it this strictly, but that's clearly the
> intent.

We can fix your application, and we'll break someone else's.

I don't think it's solveable, really - the range of applications is so
broad, and the "standard" is so vague as to be useless. This is why we've
been extending these things with linux-specific goodies which permit
applications to actually tell the kernel what they want to be done in a
more finely-grained fashion.

li...@horizon.com

unread,

Mar 27, 2007, 4:20:08 PM3/27/07

to

> Suggest you use msync(MS_ASYNC), then
> sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE).

Thank you; I didn't know about that. And I can handle -ENOSYS by falling
back to the old behavior.

> We can fix your application, and we'll break someone else's.

If you can point to an application that it'll break, I'd be a lot more
understanding. Nobody did, last year.

> I don't think it's solveable, really - the range of applications is so
> broad, and the "standard" is so vague as to be useless.

I agree that standards are sometimes vague, but that one seemed about
as clear as it's possible to be without imposing unreasonably on
the file system and device driver layers.

What part of "The msync() function writes all modified data to
permanent storage locations [...] For mappings to files, the msync()
function ensures that all write operations are completed as defined
for synchronised I/O data integrity completion." suggests that it's not
supposed to do disk I/O? How is that uselessly vague?

It says to me that msync's raison d'être is to write data from RAM to
stable storage. If an application calls it too often, that's
the application's fault just as if it called sync(2) too often.

> This is why we've
> been extending these things with linux-specific goodies which permit
> applications to actually tell the kernel what they want to be done in a
> more finely-grained fashion.

Well, I still think the current Linux behavior is a bug, but there's a
usable (and run-time compatible) workaround that doesn't unreasonably
complicate the code, and that's good enough.

Miklos Szeredi

unread,

Mar 27, 2007, 4:40:13 PM3/27/07

to

> > Suggest you use msync(MS_ASYNC), then
> > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE).
>
> Thank you; I didn't know about that. And I can handle -ENOSYS by falling
> back to the old behavior.
>
> > We can fix your application, and we'll break someone else's.
>
> If you can point to an application that it'll break, I'd be a lot more
> understanding. Nobody did, last year.
>
> > I don't think it's solveable, really - the range of applications is so
> > broad, and the "standard" is so vague as to be useless.
>
> I agree that standards are sometimes vague, but that one seemed about
> as clear as it's possible to be without imposing unreasonably on
> the file system and device driver layers.
>
> What part of "The msync() function writes all modified data to
> permanent storage locations [...] For mappings to files, the msync()
> function ensures that all write operations are completed as defined
> for synchronised I/O data integrity completion." suggests that it's not
> supposed to do disk I/O? How is that uselessly vague?

Linux _will_ write all modified data to permanent storage locations.
Since 2.6.17 it will do this regardless of msync(). Before 2.6.17 you
do need msync() to enable data to be written back.

But it will not start I/O immediately, which is not a requirement in
the standard, or at least it's pretty vague about that.

Miklos

Andrew Morton

unread,

Mar 27, 2007, 4:50:08 PM3/27/07

to

On 27 Mar 2007 16:09:33 -0400
li...@horizon.com wrote:

> What part of "The msync() function writes all modified data to
> permanent storage locations [...] For mappings to files, the msync()
> function ensures that all write operations are completed as defined
> for synchronised I/O data integrity completion." suggests that it's not
> supposed to do disk I/O? How is that uselessly vague?
>

Because for MS_ASYNC, "msync() shall return immediately once all the write

operations are initiated or queued for servicing".

ie: the writes can complete one millisecond or one week later. We chose 30
seconds.

And this is not completely fatuous - before 2.6.17, MAP_SHARED pages could
float about in memory in a dirty state for arbitrarily long periods -
potentially for the entire application lifetime. It was quite reasonable
for our MS_ASYNC implementation to do what it did: tell the VM about the
dirtiness of these pages so they get written back soon.

Post-2.6.17 we preserved that behaviour.

li...@horizon.com

unread,

Mar 27, 2007, 9:50:09 PM3/27/07

to

> Linux _will_ write all modified data to permanent storage locations.
> Since 2.6.17 it will do this regardless of msync(). Before 2.6.17 you
> do need msync() to enable data to be written back.
>
> But it will not start I/O immediately, which is not a requirement in
> the standard, or at least it's pretty vague about that.

As I've said before, I disagree, but I'm not going to start a major
flame war about it.

The most relevant paragraph is:

# When MS_ASYNC is specified, msync() returns immediately once all the
# write operations are initiated or queued for servicing; when MS_SYNC is
# specified, msync() will not return until all write operations are
# completed as defined for synchronised I/O data integrity completion.
# Either MS_ASYNC or MS_SYNC is specified, but not both.

Note two things:
1) In the paragraphs before, what msync does is defined independently
of the MS_* flags. Only the time of the return to user space varies.
Thus, whatever the delay between calling msync() and the data being
written, it should be the same whether MS_SYNC or MS_ASYNC is used.

The implementation intended is:
- Start all I/O
- If MS_SYNC, wait for I/O to complete
- Return to user space

2) "all the write operations are initiated or queued for servicing".
It is a common convention in English (and most languages, I expect)
that in the "or" is a preference for the first alternative. The second
is a permitted alternative if the first is not possible.

And "queued for servicing", especially "initiated or queued for
servicing", to me imples queuing waiting for some resource. To have
the resource being waited for be a timer expiry seems like rather a
cheat to me. It's perhaps doesn't break the letter of the standard,
but definitely bends it. It feels like a fiddle.

Still, the basic hint function of msync(MS_ASYNC) *is* being accomplished:
"I don't expect to write this page any more, so now would be a good time
to clean it."
It would just make my life easier if the kernel procrastinated less.

Nick Piggin

unread,

Mar 28, 2007, 4:00:10 AM3/28/07

to

li...@horizon.com wrote:
>>Linux _will_ write all modified data to permanent storage locations.
>>Since 2.6.17 it will do this regardless of msync(). Before 2.6.17 you
>>do need msync() to enable data to be written back.
>>
>>But it will not start I/O immediately, which is not a requirement in
>>the standard, or at least it's pretty vague about that.
>
>
> As I've said before, I disagree, but I'm not going to start a major
> flame war about it.
>
> The most relevant paragraph is:
>
> # When MS_ASYNC is specified, msync() returns immediately once all the
> # write operations are initiated or queued for servicing; when MS_SYNC is
> # specified, msync() will not return until all write operations are
> # completed as defined for synchronised I/O data integrity completion.
> # Either MS_ASYNC or MS_SYNC is specified, but not both.
>
> Note two things:
> 1) In the paragraphs before, what msync does is defined independently
> of the MS_* flags. Only the time of the return to user space varies.
> Thus, whatever the delay between calling msync() and the data being
> written, it should be the same whether MS_SYNC or MS_ASYNC is used.
>
> The implementation intended is:
> - Start all I/O
> - If MS_SYNC, wait for I/O to complete
> - Return to user space

set_page_dirty queues pages for IO, and the writeout daemon will service
that queue when it sees fit. IMO it is sufficient that you cannot say the
implementation does not meet the standard.

> 2) "all the write operations are initiated or queued for servicing".
> It is a common convention in English (and most languages, I expect)
> that in the "or" is a preference for the first alternative. The second
> is a permitted alternative if the first is not possible.
>
> And "queued for servicing", especially "initiated or queued for
> servicing", to me imples queuing waiting for some resource. To have
> the resource being waited for be a timer expiry seems like rather a
> cheat to me. It's perhaps doesn't break the letter of the standard,
> but definitely bends it. It feels like a fiddle.
>
> Still, the basic hint function of msync(MS_ASYNC) *is* being accomplished:
> "I don't expect to write this page any more, so now would be a good time
> to clean it."
> It would just make my life easier if the kernel procrastinated less.

But if you didn't notice until now, then the current implementation
must be pretty reasonable for you use as well.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

li...@horizon.com

unread,

Mar 28, 2007, 6:00:24 AM3/28/07

to

> But if you didn't notice until now, then the current implementation
> must be pretty reasonable for you use as well.

Oh, I definitely noticed. As soon as I tried to port my application
to 2.6, it broke - as evidenced by my complaints last year. The
current solution is simple - since it's running on dedicated boxes,
leave them on 2.4.

I've now got the hint on how to make it work on 2.6 (sync_file_range()),
so I can try again. But the pressure to upgrade is not strong, so it
might be a while.

You may recall, this subthread started when I responding to "the
only reason to use msync(MS_ASYNC) is to update timestamps" with a
counterexample. I still think the purpose of the call is a hint to the
kernel that writing to the specified page(s) is complete and now would be
a good time to clean them. Which has very little to do with timestamps.

Now, my application, which leaves less than a second between the MS_ASYNC
and a subsequent MS_SYNC to check whether it's done, broke, but I can
imagine similar cases where MS_ASYNC would remain a useful hint to reduce
the sort of memory hogging generally associated with "dd if=/dev/zero"
type operations.

Reading between the lines of the standard, that seems (to me, at least)
to obviously be the intended purpose of msync(MS_ASYNC). I wonder if
there's any historical documentation describing the original intent
behind creating the call.

Nick Piggin

unread,

Mar 29, 2007, 1:10:07 AM3/29/07

to

li...@horizon.com wrote:
>>But if you didn't notice until now, then the current implementation
>>must be pretty reasonable for you use as well.
>
>
> Oh, I definitely noticed. As soon as I tried to port my application
> to 2.6, it broke - as evidenced by my complaints last year. The
> current solution is simple - since it's running on dedicated boxes,
> leave them on 2.4.

Well I didn't know that was a change in behaviour vs 2.4 (or maybe I
did and forgot). That was probably a bit silly, unless there was a
good reason for it.

--
SUSE Labs, Novell Inc.