[PATCH 00/16] expand mmap_prepare functionality, port more users

5 views
Skip to first unread message

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:11:13 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file
callback"), The f_op->mmap hook has been deprecated in favour of
f_op->mmap_prepare.

This was introduced in order to make it possible for us to eventually
eliminate the f_op->mmap hook which is highly problematic as it allows
drivers and filesystems raw access to a VMA which is not yet correctly
initialised.

This hook also introduces complexity for the memory mapping operation, as
we must correctly unwind what we do should an error arises.

Overall this interface being so open has caused significant problems for
us, including security issues, it is important for us to simply eliminate
this as a source of problems.

Therefore this series continues what was established by extending the
functionality further to permit more drivers and filesystems to use
mmap_prepare.

After updating some areas that can simply use mmap_prepare as-is, and
performing some housekeeping, we then introduce two new hooks:

f_op->mmap_complete - this is invoked at the point of the VMA having been
correctly inserted, though with the VMA write lock still held. mmap_prepare
must also be specified.

This expands the use of mmap_prepare to those callers which need to
prepopulate mappings, as well as any which does genuinely require access to
the VMA.

It's simple - we will let the caller access the VMA, but only once it's
established. At this point unwinding issues is simple - we just unmap the
VMA.

The VMA is also then correctly initialised at this stage so there can be no
issues arising from a not-fully initialised VMA at this point.

The other newly added hook is:

f_op->mmap_abort - this is only valid in conjunction with mmap_prepare and
mmap_complete. This is called should an error arise between mmap_prepare
and mmap_complete (not as a result of mmap_prepare but rather some other
part of the mapping logic).

This is required in case mmap_prepare wishes to establish state or locks
which need to be cleaned up on completion. If we did not provide this, then
this could not be permitted as this cleanup would otherwise not occur
should the mapping fail between the two calls.

We then add split remap_pfn_range*() functions which allow for PFN remap (a
typical mapping prepopulation operation) split between a prepare/complete
step, as well as io_mremap_pfn_range_prepare, complete for a similar
purpose.

From there we update various mm-adjacent logic to use this functionality as
a first set of changes, as well as resctl and cramfs filesystems to round
off the non-stacked filesystem instances.


REVIEWER NOTE:
~~~~~~~~~~~~~~

I considered putting the complete, abort callbacks in vm_ops, however this
won't work because then we would be unable to adjust helpers like
generic_file_mmap_prepare() (which provides vm_ops) to provide the correct
complete, abort callbacks.

Conceptually it also makes more sense to have these in f_op as they are
one-off operations performed at mmap time to establish the VMA, rather than
a property of the VMA itself.

Lorenzo Stoakes (16):
mm/shmem: update shmem to use mmap_prepare
device/dax: update devdax to use mmap_prepare
mm: add vma_desc_size(), vma_desc_pages() helpers
relay: update relay to use mmap_prepare
mm/vma: rename mmap internal functions to avoid confusion
mm: introduce the f_op->mmap_complete, mmap_abort hooks
doc: update porting, vfs documentation for mmap_[complete, abort]
mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
mm: introduce io_remap_pfn_range_prepare, complete
mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete
mm: update mem char driver to use mmap_prepare, mmap_complete
mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort
mm: update cramfs to use mmap_prepare, mmap_complete
fs/proc: add proc_mmap_[prepare, complete] hooks for procfs
fs/proc: update vmcore to use .proc_mmap_[prepare, complete]
kcov: update kcov to use mmap_prepare, mmap_complete

Documentation/filesystems/porting.rst | 9 ++
Documentation/filesystems/vfs.rst | 35 +++++++
arch/csky/include/asm/pgtable.h | 5 +
arch/mips/alchemy/common/setup.c | 28 +++++-
arch/mips/include/asm/pgtable.h | 10 ++
arch/s390/kernel/crash_dump.c | 6 +-
arch/sparc/include/asm/pgtable_32.h | 29 +++++-
arch/sparc/include/asm/pgtable_64.h | 29 +++++-
drivers/char/mem.c | 80 ++++++++-------
drivers/dax/device.c | 32 +++---
fs/cramfs/inode.c | 134 ++++++++++++++++++--------
fs/hugetlbfs/inode.c | 86 +++++++++--------
fs/ntfs3/file.c | 2 +-
fs/proc/inode.c | 13 ++-
fs/proc/vmcore.c | 53 +++++++---
fs/resctrl/pseudo_lock.c | 56 ++++++++---
include/linux/fs.h | 4 +
include/linux/mm.h | 53 +++++++++-
include/linux/mm_types.h | 5 +
include/linux/proc_fs.h | 5 +
include/linux/shmem_fs.h | 3 +-
include/linux/vmalloc.h | 10 +-
kernel/kcov.c | 40 +++++---
kernel/relay.c | 32 +++---
mm/memory.c | 128 +++++++++++++++---------
mm/secretmem.c | 2 +-
mm/shmem.c | 49 +++++++---
mm/util.c | 18 +++-
mm/vma.c | 96 +++++++++++++++---
mm/vmalloc.c | 16 ++-
tools/testing/vma/vma_internal.h | 31 +++++-
31 files changed, 810 insertions(+), 289 deletions(-)

--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:11:21 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
This simply assigns the vm_ops so is easily updated - do so.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
mm/shmem.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 29e1eb690125..cfc33b99a23a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2950,16 +2950,17 @@ int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
return retval;
}

-static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
+static int shmem_mmap_prepare(struct vm_area_desc *desc)
{
+ struct file *file = desc->file;
struct inode *inode = file_inode(file);

file_accessed(file);
/* This is anonymous shared memory if it is unlinked at the time of mmap */
if (inode->i_nlink)
- vma->vm_ops = &shmem_vm_ops;
+ desc->vm_ops = &shmem_vm_ops;
else
- vma->vm_ops = &shmem_anon_vm_ops;
+ desc->vm_ops = &shmem_anon_vm_ops;
return 0;
}

@@ -5229,7 +5230,7 @@ static const struct address_space_operations shmem_aops = {
};

static const struct file_operations shmem_file_operations = {
- .mmap = shmem_mmap,
+ .mmap_prepare = shmem_mmap_prepare,
.open = shmem_file_open,
.get_unmapped_area = shmem_get_unmapped_area,
#ifdef CONFIG_TMPFS
--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:11:23 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
The devdax driver does nothing special in its f_op->mmap hook, so
straightforwardly update it to use the mmap_prepare hook instead.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
drivers/dax/device.c | 32 +++++++++++++++++++++-----------
1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 2bb40a6060af..c2181439f925 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -13,8 +13,9 @@
#include "dax-private.h"
#include "bus.h"

-static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
- const char *func)
+static int __check_vma(struct dev_dax *dev_dax, vm_flags_t vm_flags,
+ unsigned long start, unsigned long end, struct file *file,
+ const char *func)
{
struct device *dev = &dev_dax->dev;
unsigned long mask;
@@ -23,7 +24,7 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
return -ENXIO;

/* prevent private mappings from being established */
- if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+ if ((vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
dev_info_ratelimited(dev,
"%s: %s: fail, attempted private mapping\n",
current->comm, func);
@@ -31,15 +32,15 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
}

mask = dev_dax->align - 1;
- if (vma->vm_start & mask || vma->vm_end & mask) {
+ if (start & mask || end & mask) {
dev_info_ratelimited(dev,
"%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n",
- current->comm, func, vma->vm_start, vma->vm_end,
+ current->comm, func, start, end,
mask);
return -EINVAL;
}

- if (!vma_is_dax(vma)) {
+ if (!file_is_dax(file)) {
dev_info_ratelimited(dev,
"%s: %s: fail, vma is not DAX capable\n",
current->comm, func);
@@ -49,6 +50,13 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
return 0;
}

+static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
+ const char *func)
+{
+ return __check_vma(dev_dax, vma->vm_flags, vma->vm_start, vma->vm_end,
+ vma->vm_file, func);
+}
+
/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
unsigned long size)
@@ -285,8 +293,9 @@ static const struct vm_operations_struct dax_vm_ops = {
.pagesize = dev_dax_pagesize,
};

-static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
+static int dax_mmap_prepare(struct vm_area_desc *desc)
{
+ struct file *filp = desc->file;
struct dev_dax *dev_dax = filp->private_data;
int rc, id;

@@ -297,13 +306,14 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
* fault time.
*/
id = dax_read_lock();
- rc = check_vma(dev_dax, vma, __func__);
+ rc = __check_vma(dev_dax, desc->vm_flags, desc->start, desc->end, filp,
+ __func__);
dax_read_unlock(id);
if (rc)
return rc;

- vma->vm_ops = &dax_vm_ops;
- vm_flags_set(vma, VM_HUGEPAGE);
+ desc->vm_ops = &dax_vm_ops;
+ desc->vm_flags |= VM_HUGEPAGE;
return 0;
}

@@ -377,7 +387,7 @@ static const struct file_operations dax_fops = {
.open = dax_open,
.release = dax_release,
.get_unmapped_area = dax_get_unmapped_area,
- .mmap = dax_mmap,
+ .mmap_prepare = dax_mmap_prepare,
.fop_flags = FOP_MMAP_SYNC,
};

--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:11:27 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
It's useful to be able to determine the size of a VMA descriptor range used
on f_op->mmap_prepare, expressed both in bytes and pages, so add helpers
for both and update code that could make use of it to do so.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
fs/ntfs3/file.c | 2 +-
include/linux/mm.h | 10 ++++++++++
mm/secretmem.c | 2 +-
3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/ntfs3/file.c b/fs/ntfs3/file.c
index c1ece707b195..86eb88f62714 100644
--- a/fs/ntfs3/file.c
+++ b/fs/ntfs3/file.c
@@ -304,7 +304,7 @@ static int ntfs_file_mmap_prepare(struct vm_area_desc *desc)

if (rw) {
u64 to = min_t(loff_t, i_size_read(inode),
- from + desc->end - desc->start);
+ from + vma_desc_size(desc));

if (is_sparsed(ni)) {
/* Allocate clusters for rw map. */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a6bfa46937a8..9d4508b20be3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3560,6 +3560,16 @@ static inline unsigned long vma_pages(const struct vm_area_struct *vma)
return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
}

+static inline unsigned long vma_desc_size(struct vm_area_desc *desc)
+{
+ return desc->end - desc->start;
+}
+
+static inline unsigned long vma_desc_pages(struct vm_area_desc *desc)
+{
+ return vma_desc_size(desc) >> PAGE_SHIFT;
+}
+
/* Look up the first VMA which exactly match the interval vm_start ... vm_end */
static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
unsigned long vm_start, unsigned long vm_end)
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 60137305bc20..62066ddb1e9c 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -120,7 +120,7 @@ static int secretmem_release(struct inode *inode, struct file *file)

static int secretmem_mmap_prepare(struct vm_area_desc *desc)
{
- const unsigned long len = desc->end - desc->start;
+ const unsigned long len = vma_desc_size(desc);

if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
return -EINVAL;
--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:11:34 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
It is relatively trivial to update this code to use the f_op->mmap_prepare
hook in favour of the deprecated f_op->mmap hook, so do so.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
kernel/relay.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/kernel/relay.c b/kernel/relay.c
index 8d915fe98198..8866054104fe 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -72,17 +72,17 @@ static void relay_free_page_array(struct page **array)
}

/**
- * relay_mmap_buf: - mmap channel buffer to process address space
- * @buf: relay channel buffer
- * @vma: vm_area_struct describing memory to be mapped
+ * relay_mmap_prepare_buf: - mmap channel buffer to process address space
+ * @desc: describing what to map
*
* Returns 0 if ok, negative on error
*
* Caller should already have grabbed mmap_lock.
*/
-static int relay_mmap_buf(struct rchan_buf *buf, struct vm_area_struct *vma)
+static int relay_mmap_prepare_buf(struct rchan_buf *buf,
+ struct vm_area_desc *desc)
{
- unsigned long length = vma->vm_end - vma->vm_start;
+ unsigned long length = vma_desc_size(desc);

if (!buf)
return -EBADF;
@@ -90,9 +90,9 @@ static int relay_mmap_buf(struct rchan_buf *buf, struct vm_area_struct *vma)
if (length != (unsigned long)buf->chan->alloc_size)
return -EINVAL;

- vma->vm_ops = &relay_file_mmap_ops;
- vm_flags_set(vma, VM_DONTEXPAND);
- vma->vm_private_data = buf;
+ desc->vm_ops = &relay_file_mmap_ops;
+ desc->vm_flags |= VM_DONTEXPAND;
+ desc->private_data = buf;

return 0;
}
@@ -749,16 +749,16 @@ static int relay_file_open(struct inode *inode, struct file *filp)
}

/**
- * relay_file_mmap - mmap file op for relay files
- * @filp: the file
- * @vma: the vma describing what to map
+ * relay_file_mmap_prepare - mmap file op for relay files
+ * @desc: describing what to map
*
- * Calls upon relay_mmap_buf() to map the file into user space.
+ * Calls upon relay_mmap_prepare_buf() to map the file into user space.
*/
-static int relay_file_mmap(struct file *filp, struct vm_area_struct *vma)
+static int relay_file_mmap_prepare(struct vm_area_desc *desc)
{
- struct rchan_buf *buf = filp->private_data;
- return relay_mmap_buf(buf, vma);
+ struct rchan_buf *buf = desc->file->private_data;
+
+ return relay_mmap_prepare_buf(buf, desc);
}

/**
@@ -1006,7 +1006,7 @@ static ssize_t relay_file_read(struct file *filp,
const struct file_operations relay_file_operations = {
.open = relay_file_open,
.poll = relay_file_poll,
- .mmap = relay_file_mmap,
+ .mmap_prepare = relay_file_mmap_prepare,
.read = relay_file_read,
.release = relay_file_release,
};
--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:11:41 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Now we have the f_op->mmap_prepare() hook, having a static function called
__mmap_prepare() that has nothing to do with it is confusing, so rename the
function.

Additionally rename __mmap_complete() to __mmap_epilogue(), as we intend to
provide a f_op->mmap_complete() callback.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
mm/vma.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index abe0da33c844..0efa4288570e 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2329,7 +2329,7 @@ static void update_ksm_flags(struct mmap_state *map)
}

/*
- * __mmap_prepare() - Prepare to gather any overlapping VMAs that need to be
+ * __mmap_prelude() - Prepare to gather any overlapping VMAs that need to be
* unmapped once the map operation is completed, check limits, account mapping
* and clean up any pre-existing VMAs.
*
@@ -2338,7 +2338,7 @@ static void update_ksm_flags(struct mmap_state *map)
*
* Returns: 0 on success, error code otherwise.
*/
-static int __mmap_prepare(struct mmap_state *map, struct list_head *uf)
+static int __mmap_prelude(struct mmap_state *map, struct list_head *uf)
{
int error;
struct vma_iterator *vmi = map->vmi;
@@ -2515,13 +2515,13 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
}

/*
- * __mmap_complete() - Unmap any VMAs we overlap, account memory mapping
+ * __mmap_epilogue() - Unmap any VMAs we overlap, account memory mapping
* statistics, handle locking and finalise the VMA.
*
* @map: Mapping state.
* @vma: Merged or newly allocated VMA for the mmap()'d region.
*/
-static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
+static void __mmap_epilogue(struct mmap_state *map, struct vm_area_struct *vma)
{
struct mm_struct *mm = map->mm;
vm_flags_t vm_flags = vma->vm_flags;
@@ -2649,7 +2649,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,

map.check_ksm_early = can_set_ksm_flags_early(&map);

- error = __mmap_prepare(&map, uf);
+ error = __mmap_prelude(&map, uf);
if (!error && have_mmap_prepare)
error = call_mmap_prepare(&map);
if (error)
@@ -2675,11 +2675,11 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
if (have_mmap_prepare)
set_vma_user_defined_fields(vma, &map);

- __mmap_complete(&map, vma);
+ __mmap_epilogue(&map, vma);

return addr;

- /* Accounting was done by __mmap_prepare(). */
+ /* Accounting was done by __mmap_prelude(). */
unacct_error:
if (map.charged)
vm_unacct_memory(map.charged);
--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:11:43 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
We have introduced the f_op->mmap_prepare hook to allow for setting up a
VMA far earlier in the process of mapping memory, reducing problematic
error handling paths, but this does not provide what all
drivers/filesystems need.

In order to supply this, and to be able to move forward with removing
f_op->mmap altogether, introduce f_op->mmap_complete.

This hook is called once the VMA is fully mapped and everything is done,
however with the mmap write lock and VMA write locks held.

The hook is then provided with a fully initialised VMA which it can do what
it needs with, though the mmap and VMA write locks must remain held
throughout.

It is not intended that the VMA be modified at this point, attempts to do
so will end in tears.

This allows for operations such as pre-population typically via a remap, or
really anything that requires access to the VMA once initialised.

In addition, a caller may need to take a lock in mmap_prepare, when it is
possible to modify the VMA, and release it on mmap_complete. In order to
handle errors which may arise between the two operations, f_op->mmap_abort
is provided.

This hook should be used to drop any lock and clean up anything before the
VMA mapping operation is aborted. After this point the VMA will not be
added to any mapping and will not exist.

We also add a new mmap_context field to the vm_area_desc type which can be
used to pass information pertinent to any locks which are held or any state
which is required for mmap_complete, abort to operate correctly.

We also update the compatibility layer for nested filesystems which
currently still only specify an f_op->mmap() handler so that it correctly
invokes f_op->mmap_complete as necessary (note that no error can occur
between mmap_prepare and mmap_complete so mmap_abort will never be called
in this case).

Also update the VMA tests to account for the changes.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
include/linux/fs.h | 4 ++
include/linux/mm_types.h | 5 ++
mm/util.c | 18 +++++--
mm/vma.c | 82 ++++++++++++++++++++++++++++++--
tools/testing/vma/vma_internal.h | 31 ++++++++++--
5 files changed, 129 insertions(+), 11 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 594bd4d0521e..bb432924993a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2195,6 +2195,10 @@ struct file_operations {
int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
unsigned int poll_flags);
int (*mmap_prepare)(struct vm_area_desc *);
+ int (*mmap_complete)(struct file *, struct vm_area_struct *,
+ const void *context);
+ void (*mmap_abort)(const struct file *, const void *vm_private_data,
+ const void *context);
} __randomize_layout;

/* Supports async buffered reads */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cf759fe08bb3..052db1f31fb3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -793,6 +793,11 @@ struct vm_area_desc {
/* Write-only fields. */
const struct vm_operations_struct *vm_ops;
void *private_data;
+ /*
+ * A user-defined field, value will be passed to mmap_complete,
+ * mmap_abort.
+ */
+ void *mmap_context;
};

/*
diff --git a/mm/util.c b/mm/util.c
index 248f877f629b..f5bcac140cb9 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1161,17 +1161,26 @@ int __compat_vma_mmap_prepare(const struct file_operations *f_op,
err = f_op->mmap_prepare(&desc);
if (err)
return err;
+
set_vma_from_desc(vma, &desc);

- return 0;
+ /*
+ * No error can occur between mmap_prepare() and mmap_complete so no
+ * need to invoke mmap_abort().
+ */
+
+ if (f_op->mmap_complete)
+ err = f_op->mmap_complete(file, vma, desc.mmap_context);
+
+ return err;
}
EXPORT_SYMBOL(__compat_vma_mmap_prepare);

/**
* compat_vma_mmap_prepare() - Apply the file's .mmap_prepare() hook to an
- * existing VMA.
+ * existing VMA and invoke .mmap_complete() if provided.
* @file: The file which possesss an f_op->mmap_prepare() hook.
- * @vma: The VMA to apply the .mmap_prepare() hook to.
+ * @vma: The VMA to apply the hooks to.
*
* Ordinarily, .mmap_prepare() is invoked directly upon mmap(). However, certain
* stacked filesystems invoke a nested mmap hook of an underlying file.
@@ -1188,6 +1197,9 @@ EXPORT_SYMBOL(__compat_vma_mmap_prepare);
* establishes a struct vm_area_desc descriptor, passes to the underlying
* .mmap_prepare() hook and applies any changes performed by it.
*
+ * If the relevant hooks are provided, it also invokes .mmap_complete() upon
+ * successful completion.
+ *
* Once the conversion of filesystems is complete this function will no longer
* be required and will be removed.
*
diff --git a/mm/vma.c b/mm/vma.c
index 0efa4288570e..a0b568fe9e8d 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -22,6 +22,7 @@ struct mmap_state {
/* User-defined fields, perhaps updated by .mmap_prepare(). */
const struct vm_operations_struct *vm_ops;
void *vm_private_data;
+ void *mmap_context;

unsigned long charged;

@@ -2343,6 +2344,23 @@ static int __mmap_prelude(struct mmap_state *map, struct list_head *uf)
int error;
struct vma_iterator *vmi = map->vmi;
struct vma_munmap_struct *vms = &map->vms;
+ struct file *file = map->file;
+
+ if (file) {
+ /* f_op->mmap_complete requires f_op->mmap_prepare. */
+ if (file->f_op->mmap_complete && !file->f_op->mmap_prepare)
+ return -EINVAL;
+
+ /*
+ * It's not valid to provide an f_op->mmap_abort hook without also
+ * providing the f_op->mmap_prepare and f_op->mmap_complete hooks it is
+ * used with.
+ */
+ if (file->f_op->mmap_abort &&
+ (!file->f_op->mmap_prepare ||
+ !file->f_op->mmap_complete))
+ return -EINVAL;
+ }

/* Find the first overlapping VMA and initialise unmap state. */
vms->vma = vma_find(vmi, map->end);
@@ -2595,6 +2613,7 @@ static int call_mmap_prepare(struct mmap_state *map)
/* User-defined fields. */
map->vm_ops = desc.vm_ops;
map->vm_private_data = desc.private_data;
+ map->mmap_context = desc.mmap_context;

return 0;
}
@@ -2636,16 +2655,61 @@ static bool can_set_ksm_flags_early(struct mmap_state *map)
return false;
}

+/*
+ * Invoke the f_op->mmap_complete hook, providing it with a fully initialised
+ * VMA to operate upon.
+ *
+ * The mmap and VMA write locks must be held prior to and after the hook has
+ * been invoked.
+ */
+static int call_mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
+{
+ struct file *file = map->file;
+ void *context = map->mmap_context;
+ int error;
+ size_t len;
+
+ if (!file || !file->f_op->mmap_complete)
+ return 0;
+
+ error = file->f_op->mmap_complete(file, vma, context);
+ /* The hook must NOT drop the write locks. */
+ vma_assert_write_locked(vma);
+ mmap_assert_write_locked(current->mm);
+ if (!error)
+ return 0;
+
+ /*
+ * If an error occurs, unmap the VMA altogether and return an error. We
+ * only clear the newly allocated VMA, since this function is only
+ * invoked if we do NOT merge, so we only clean up the VMA we created.
+ */
+ len = vma_pages(vma) << PAGE_SHIFT;
+ do_munmap(current->mm, vma->vm_start, len, NULL);
+ return error;
+}
+
+static void call_mmap_abort(struct mmap_state *map)
+{
+ struct file *file = map->file;
+ void *vm_private_data = map->vm_private_data;
+
+ VM_WARN_ON_ONCE(!file || !file->f_op);
+ file->f_op->mmap_abort(file, vm_private_data, map->mmap_context);
+}
+
static unsigned long __mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
struct list_head *uf)
{
- struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma = NULL;
- int error;
bool have_mmap_prepare = file && file->f_op->mmap_prepare;
+ bool have_mmap_abort = file && file->f_op->mmap_abort;
+ struct mm_struct *mm = current->mm;
VMA_ITERATOR(vmi, mm, addr);
MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file);
+ struct vm_area_struct *vma = NULL;
+ bool allocated_new = false;
+ int error;

map.check_ksm_early = can_set_ksm_flags_early(&map);

@@ -2668,8 +2732,12 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
/* ...but if we can't, allocate a new VMA. */
if (!vma) {
error = __mmap_new_vma(&map, &vma);
- if (error)
+ if (error) {
+ if (have_mmap_abort)
+ call_mmap_abort(&map);
goto unacct_error;
+ }
+ allocated_new = true;
}

if (have_mmap_prepare)
@@ -2677,6 +2745,12 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,

__mmap_epilogue(&map, vma);

+ if (allocated_new) {
+ error = call_mmap_complete(&map, vma);
+ if (error)
+ return error;
+ }
+
return addr;

/* Accounting was done by __mmap_prelude(). */
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 07167446dcf4..566cef1c0e0b 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -297,11 +297,20 @@ struct vm_area_desc {
/* Write-only fields. */
const struct vm_operations_struct *vm_ops;
void *private_data;
+ /*
+ * A user-defined field, value will be passed to mmap_complete,
+ * mmap_abort.
+ */
+ void *mmap_context;
};

struct file_operations {
int (*mmap)(struct file *, struct vm_area_struct *);
int (*mmap_prepare)(struct vm_area_desc *);
+ void (*mmap_abort)(const struct file *, const void *vm_private_data,
+ const void *context);
+ int (*mmap_complete)(struct file *, struct vm_area_struct *,
+ const void *context);
};

struct file {
@@ -1471,7 +1480,7 @@ static inline int __compat_vma_mmap_prepare(const struct file_operations *f_op,
{
struct vm_area_desc desc = {
.mm = vma->vm_mm,
- .file = vma->vm_file,
+ .file = file,
.start = vma->vm_start,
.end = vma->vm_end,

@@ -1485,13 +1494,21 @@ static inline int __compat_vma_mmap_prepare(const struct file_operations *f_op,
err = f_op->mmap_prepare(&desc);
if (err)
return err;
+
set_vma_from_desc(vma, &desc);

- return 0;
+ /*
+ * No error can occur between mmap_prepare() and mmap_complete so no
+ * need to invoke mmap_abort().
+ */
+
+ if (f_op->mmap_complete)
+ err = f_op->mmap_complete(file, vma, desc.mmap_context);
+
+ return err;
}

-static inline int compat_vma_mmap_prepare(struct file *file,
- struct vm_area_struct *vma)
+static inline int compat_vma_mmap_prepare(struct file *file, struct vm_area_struct *vma)
{
return __compat_vma_mmap_prepare(file->f_op, file, vma);
}
@@ -1548,4 +1565,10 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
return vm_flags;
}

+static inline int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
+ struct list_head *uf)
+{
+ return 0;
+}
+
#endif /* __MM_VMA_INTERNAL_H */
--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:11:50 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
We have introduced the mmap_complete() and mmap_abort() callbacks, which
work in conjunction with mmap_prepare(), so describe what they used for.

We update both the VFS documentation and the porting guide.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
Documentation/filesystems/porting.rst | 9 +++++++
Documentation/filesystems/vfs.rst | 35 +++++++++++++++++++++++++++
2 files changed, 44 insertions(+)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 85f590254f07..abc1b8c95d24 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1285,3 +1285,12 @@ rather than a VMA, as the VMA at this stage is not yet valid.
The vm_area_desc provides the minimum required information for a filesystem
to initialise state upon memory mapping of a file-backed region, and output
parameters for the file system to set this state.
+
+In nearly all cases, this is all that is required for a filesystem. However,
+should there be a need to operate on the newly inserted VMA, the mmap_complete()
+can be specified to do so.
+
+Additionally, if mmap_prepare() and mmap_complete() are specified, mmap_abort()
+may also be provided which is invoked if the mapping fails between mmap_prepare
+and mmap_complete(). It is only valid to specify mmap_abort() if both other
+hooks are provided.
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 486a91633474..172d36a13e13 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -1114,6 +1114,10 @@ This describes how the VFS can manipulate an open file. As of kernel
int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
unsigned int poll_flags);
int (*mmap_prepare)(struct vm_area_desc *);
+ int (*mmap_complete)(struct file *, struct vm_area_struct *,
+ const void *context);
+ void (*mmap_abort)(const struct file *, const void *vm_private_data,
+ const void *context);
};

Again, all methods are called without any locks being held, unless
@@ -1236,6 +1240,37 @@ otherwise noted.
file-backed memory mapping, most notably establishing relevant
private state and VMA callbacks.

+``mmap_complete``
+ If mmap_prepare is provided, will be invoked after the mapping is fully
+ established, with the mmap and VMA write locks held.
+
+ It is useful for prepopulating VMAs before they may be accessed by
+ users.
+
+ The hook MUST NOT release either the VMA or mmap write locks. This is
+ asserted by the mmap logic.
+
+ If an error is returned by the hook, the VMA is unmapped and the
+ mmap() operation fails with that error.
+
+ It is not valid to specify this hook if mmap_prepare is not also
+ specified, doing so will result in an error upon mapping.
+
+``mmap_abort``
+ If mmap_prepare() and mmap_complete() are provided, then mmap_abort
+ may also be provided, which will be invoked if the mapping operation
+ fails between the two calls.
+
+ This is important, because mmap_prepare may succeed, but some other part
+ of the mapping operation may fail before mmap_complete can be called.
+
+ This allows a caller to acquire locks in mmap_prepare with certainty
+ that the locks will be released by either mmap_abort or mmap_complete no
+ matter what happens.
+
+ It is not valid to specify this unless mmap_prepare and mmap_complete
+ are both specified, doing so will result in an error upon mapping.
+
Note that the file operations are implemented by the specific
filesystem in which the inode resides. When opening a device node
(character or block special) most filesystems will call special
--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:11:50 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
We need the ability to split PFN remap between updating the VMA and
performing the actual remap, in order to do away with the legacy f_op->mmap
hook.

To do so, update the PFN remap code to provide shared logic, and also make
remap_pfn_range_notrack() static, as its one user, io_mapping_map_user()
was removed in commit 9a4f90e24661 ("mm: remove mm/io-mapping.c").

Then, introduce remap_pfn_range_prepare(), which accepts VMA descriptor and
PFN parameters, and remap_pfn_range_complete() which accepts the same
parameters as remap_pfn_rangte().

remap_pfn_range_prepare() will set the cow vma->vm_pgoff if necessary, so
it must be supplied with a correct PFN to do so. If the caller must hold
locks to be able to do this, those locks should be held across the
operation, and mmap_abort() should be provided to revoke the lock should an
error arise.

While we're here, also clean up the duplicated #ifdef
__HAVE_PFNMAP_TRACKING check and put into a single #ifdef/#else block.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
include/linux/mm.h | 25 +++++++--
mm/memory.c | 128 ++++++++++++++++++++++++++++-----------------
2 files changed, 102 insertions(+), 51 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9d4508b20be3..0f59bf14cac3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -489,6 +489,21 @@ extern unsigned int kobjsize(const void *objp);
*/
#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)

+/*
+ * Physically remapped pages are special. Tell the
+ * rest of the world about it:
+ * VM_IO tells people not to look at these pages
+ * (accesses can have side effects).
+ * VM_PFNMAP tells the core MM that the base pages are just
+ * raw PFN mappings, and do not have a "struct page" associated
+ * with them.
+ * VM_DONTEXPAND
+ * Disable vma merging and expanding with mremap().
+ * VM_DONTDUMP
+ * Omit vma from core dump, even when VM_IO turned off.
+ */
+#define VM_REMAP_FLAGS (VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP)
+
/* This mask prevents VMA from being scanned with khugepaged */
#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)

@@ -3611,10 +3626,12 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,

struct vm_area_struct *find_extend_vma_locked(struct mm_struct *,
unsigned long addr);
-int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
- unsigned long pfn, unsigned long size, pgprot_t);
-int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
- unsigned long pfn, unsigned long size, pgprot_t prot);
+int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t pgprot);
+void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn);
+int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t pgprot);
+
int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
struct page **pages, unsigned long *num);
diff --git a/mm/memory.c b/mm/memory.c
index d9de6c056179..f6234c54047f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2900,8 +2900,27 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
return 0;
}

+static int get_remap_pgoff(vm_flags_t vm_flags, unsigned long addr,
+ unsigned long end, unsigned long vm_start, unsigned long vm_end,
+ unsigned long pfn, pgoff_t *vm_pgoff_p)
+{
+ /*
+ * There's a horrible special case to handle copy-on-write
+ * behaviour that some programs depend on. We mark the "original"
+ * un-COW'ed pages by matching them up with "vma->vm_pgoff".
+ * See vm_normal_page() for details.
+ */
+ if (is_cow_mapping(vm_flags)) {
+ if (addr != vm_start || end != vm_end)
+ return -EINVAL;
+ *vm_pgoff_p = pfn;
+ }
+
+ return 0;
+}
+
static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long addr,
- unsigned long pfn, unsigned long size, pgprot_t prot)
+ unsigned long pfn, unsigned long size, pgprot_t prot, bool set_vma)
{
pgd_t *pgd;
unsigned long next;
@@ -2912,32 +2931,17 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad
if (WARN_ON_ONCE(!PAGE_ALIGNED(addr)))
return -EINVAL;

- /*
- * Physically remapped pages are special. Tell the
- * rest of the world about it:
- * VM_IO tells people not to look at these pages
- * (accesses can have side effects).
- * VM_PFNMAP tells the core MM that the base pages are just
- * raw PFN mappings, and do not have a "struct page" associated
- * with them.
- * VM_DONTEXPAND
- * Disable vma merging and expanding with mremap().
- * VM_DONTDUMP
- * Omit vma from core dump, even when VM_IO turned off.
- *
- * There's a horrible special case to handle copy-on-write
- * behaviour that some programs depend on. We mark the "original"
- * un-COW'ed pages by matching them up with "vma->vm_pgoff".
- * See vm_normal_page() for details.
- */
- if (is_cow_mapping(vma->vm_flags)) {
- if (addr != vma->vm_start || end != vma->vm_end)
- return -EINVAL;
- vma->vm_pgoff = pfn;
+ if (set_vma) {
+ err = get_remap_pgoff(vma->vm_flags, addr, end,
+ vma->vm_start, vma->vm_end,
+ pfn, &vma->vm_pgoff);
+ if (err)
+ return err;
+ vm_flags_set(vma, VM_REMAP_FLAGS);
+ } else {
+ VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) == VM_REMAP_FLAGS);
}

- vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
-
BUG_ON(addr >= end);
pfn -= addr >> PAGE_SHIFT;
pgd = pgd_offset(mm, addr);
@@ -2957,11 +2961,10 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad
* Variant of remap_pfn_range that does not call track_pfn_remap. The caller
* must have pre-validated the caching bits of the pgprot_t.
*/
-int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
- unsigned long pfn, unsigned long size, pgprot_t prot)
+static int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t prot, bool set_vma)
{
- int error = remap_pfn_range_internal(vma, addr, pfn, size, prot);
-
+ int error = remap_pfn_range_internal(vma, addr, pfn, size, prot, set_vma);
if (!error)
return 0;

@@ -2974,6 +2977,18 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
return error;
}

+void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn)
+{
+ /*
+ * We set addr=VMA start, end=VMA end here, so this won't fail, but we
+ * check it again on complete and will fail there if specified addr is
+ * invalid.
+ */
+ get_remap_pgoff(desc->vm_flags, desc->start, desc->end,
+ desc->start, desc->end, pfn, &desc->pgoff);
+ desc->vm_flags |= VM_REMAP_FLAGS;
+}
+
#ifdef __HAVE_PFNMAP_TRACKING
static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
unsigned long size, pgprot_t *prot)
@@ -3002,23 +3017,9 @@ void pfnmap_track_ctx_release(struct kref *ref)
pfnmap_untrack(ctx->pfn, ctx->size);
kfree(ctx);
}
-#endif /* __HAVE_PFNMAP_TRACKING */

-/**
- * remap_pfn_range - remap kernel memory to userspace
- * @vma: user vma to map to
- * @addr: target page aligned user address to start at
- * @pfn: page frame number of kernel physical memory address
- * @size: size of mapping area
- * @prot: page protection flags for this mapping
- *
- * Note: this is only safe if the mm semaphore is held when called.
- *
- * Return: %0 on success, negative error code otherwise.
- */
-#ifdef __HAVE_PFNMAP_TRACKING
-int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
- unsigned long pfn, unsigned long size, pgprot_t prot)
+static int remap_pfn_range_track(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t prot, bool set_vma)
{
struct pfnmap_track_ctx *ctx = NULL;
int err;
@@ -3044,7 +3045,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
return -EINVAL;
}

- err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
+ err = remap_pfn_range_notrack(vma, addr, pfn, size, prot, set_vma);
if (ctx) {
if (err)
kref_put(&ctx->kref, pfnmap_track_ctx_release);
@@ -3054,11 +3055,44 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
return err;
}

+/**
+ * remap_pfn_range - remap kernel memory to userspace
+ * @vma: user vma to map to
+ * @addr: target page aligned user address to start at
+ * @pfn: page frame number of kernel physical memory address
+ * @size: size of mapping area
+ * @prot: page protection flags for this mapping
+ *
+ * Note: this is only safe if the mm semaphore is held when called.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+ return remap_pfn_range_track(vma, addr, pfn, size, prot,
+ /* set_vma = */true);
+}
+
+int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+ /* With set_vma = false, the VMA will not be modified. */
+ return remap_pfn_range_track(vma, addr, pfn, size, prot,
+ /* set_vma = */false);
+}
#else
int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t prot)
{
- return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
+ return remap_pfn_range_notrack(vma, addr, pfn, size, prot, /* set_vma = */true);
+}
+
+int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+ return remap_pfn_range_notrack(vma, addr, pfn, size, prot,
+ /* set_vma = */false);
}
#endif
EXPORT_SYMBOL(remap_pfn_range);
--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:11:55 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
We introduce the io_remap*() equivalents of remap_pfn_range_prepare() and
remap_pfn_range_complete() to allow for I/O remapping utilising
f_op->mmap_prepare and f_op->mmap_complete hooks.

We have to make some architecture-specific changes for those architectures
which define customised handlers.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
arch/csky/include/asm/pgtable.h | 5 +++++
arch/mips/alchemy/common/setup.c | 28 +++++++++++++++++++++++++---
arch/mips/include/asm/pgtable.h | 10 ++++++++++
arch/sparc/include/asm/pgtable_32.h | 29 +++++++++++++++++++++++++----
arch/sparc/include/asm/pgtable_64.h | 29 +++++++++++++++++++++++++----
include/linux/mm.h | 18 ++++++++++++++++++
6 files changed, 108 insertions(+), 11 deletions(-)

diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h
index 5a394be09c35..c83505839a06 100644
--- a/arch/csky/include/asm/pgtable.h
+++ b/arch/csky/include/asm/pgtable.h
@@ -266,4 +266,9 @@ void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
#define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \
remap_pfn_range(vma, vaddr, pfn, size, prot)

+/* default io_remap_pfn_range_prepare can be used. */
+
+#define io_remap_pfn_range_complete(vma, addr, pfn, size, prot) \
+ remap_pfn_range_complete(vma, addr, pfn, size, prot)
+
#endif /* __ASM_CSKY_PGTABLE_H */
diff --git a/arch/mips/alchemy/common/setup.c b/arch/mips/alchemy/common/setup.c
index a7a6d31a7a41..a4ab02776994 100644
--- a/arch/mips/alchemy/common/setup.c
+++ b/arch/mips/alchemy/common/setup.c
@@ -94,12 +94,34 @@ phys_addr_t fixup_bigphys_addr(phys_addr_t phys_addr, phys_addr_t size)
return phys_addr;
}

-int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long vaddr,
- unsigned long pfn, unsigned long size, pgprot_t prot)
+static unsigned long calc_pfn(unsigned long pfn, unsigned long size)
{
phys_addr_t phys_addr = fixup_bigphys_addr(pfn << PAGE_SHIFT, size);

- return remap_pfn_range(vma, vaddr, phys_addr >> PAGE_SHIFT, size, prot);
+ return phys_addr >> PAGE_SHIFT;
+}
+
+int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long vaddr,
+ unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+ return remap_pfn_range(vma, vaddr, calc_pfn(pfn, size), size, prot);
}
EXPORT_SYMBOL(io_remap_pfn_range);
+
+void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+ unsigned long size)
+{
+ remap_pfn_range_prepare(desc, calc_pfn(pfn, size));
+}
+EXPORT_SYMBOL(io_remap_pfn_range_prepare);
+
+int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long pfn, unsigned long size,
+ pgprot_t prot)
+{
+ return remap_pfn_range_complete(vma, addr, calc_pfn(pfn, size),
+ size, prot);
+}
+EXPORT_SYMBOL(io_remap_pfn_range_complete);
+
#endif /* CONFIG_MIPS_FIXUP_BIGPHYS_ADDR */
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index ae73ecf4c41a..6a8964f55a31 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -607,6 +607,16 @@ phys_addr_t fixup_bigphys_addr(phys_addr_t addr, phys_addr_t size);
int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long vaddr,
unsigned long pfn, unsigned long size, pgprot_t prot);
#define io_remap_pfn_range io_remap_pfn_range
+
+void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+ unsigned long size);
+#define io_remap_pfn_range_prepare io_remap_pfn_range_prepare
+
+int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long pfn, unsigned long size,
+ pgprot_t prot);
+#define io_remap_pfn_range_complete io_remap_pfn_range_complete
+
#else
#define fixup_bigphys_addr(addr, size) (addr)
#endif /* CONFIG_MIPS_FIXUP_BIGPHYS_ADDR */
diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h
index 7c199c003ffe..cfd764afc107 100644
--- a/arch/sparc/include/asm/pgtable_32.h
+++ b/arch/sparc/include/asm/pgtable_32.h
@@ -398,9 +398,7 @@ __get_iospace (unsigned long addr)
int remap_pfn_range(struct vm_area_struct *, unsigned long, unsigned long,
unsigned long, pgprot_t);

-static inline int io_remap_pfn_range(struct vm_area_struct *vma,
- unsigned long from, unsigned long pfn,
- unsigned long size, pgprot_t prot)
+static inline unsigned long calc_io_remap_pfn(unsigned long pfn)
{
unsigned long long offset, space, phys_base;

@@ -408,10 +406,33 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,
space = GET_IOSPACE(pfn);
phys_base = offset | (space << 32ULL);

- return remap_pfn_range(vma, from, phys_base >> PAGE_SHIFT, size, prot);
+ return phys_base >> PAGE_SHIFT;
+}
+
+static inline int io_remap_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+ return remap_pfn_range(vma, from, calc_io_remap_pfn(pfn), size, prot);
}
#define io_remap_pfn_range io_remap_pfn_range

+static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+ unsigned long size)
+{
+ remap_pfn_range_prepare(desc, calc_io_remap_pfn(pfn));
+}
+#define io_remap_pfn_range_prepare io_remap_pfn_range_prepare
+
+static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long pfn, unsigned long size,
+ pgprot_t prot)
+{
+ return remap_pfn_range_complete(vma, addr, calc_io_remap_pfn(pfn),
+ size, prot);
+}
+#define io_remap_pfn_range_complete io_remap_pfn_range_complete
+
#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
({ \
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 669cd02469a1..b8000ce4b59f 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1084,9 +1084,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
return 0;
}

-static inline int io_remap_pfn_range(struct vm_area_struct *vma,
- unsigned long from, unsigned long pfn,
- unsigned long size, pgprot_t prot)
+static inline unsigned long calc_io_remap_pfn(unsigned long pfn)
{
unsigned long offset = GET_PFN(pfn) << PAGE_SHIFT;
int space = GET_IOSPACE(pfn);
@@ -1094,10 +1092,33 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,

phys_base = offset | (((unsigned long) space) << 32UL);

- return remap_pfn_range(vma, from, phys_base >> PAGE_SHIFT, size, prot);
+ return phys_base >> PAGE_SHIFT;
+}
+
+static inline int io_remap_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+ return remap_pfn_range(vma, from, calc_io_remap_pfn(pfn), size, prot);
}
#define io_remap_pfn_range io_remap_pfn_range

+static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+ unsigned long size)
+{
+ return remap_pfn_range_prepare(desc, calc_io_remap_pfn(pfn));
+}
+#define io_remap_pfn_range_prepare io_remap_pfn_range_prepare
+
+static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long pfn, unsigned long size,
+ pgprot_t prot)
+{
+ return remap_pfn_range_complete(vma, addr, calc_io_remap_pfn(pfn),
+ size, prot);
+}
+#define io_remap_pfn_range_complete io_remap_pfn_range_complete
+
static inline unsigned long __untagged_addr(unsigned long start)
{
if (adi_capable()) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0f59bf14cac3..d96840262498 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3673,6 +3673,24 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,
}
#endif

+#ifndef io_remap_pfn_range_prepare
+static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+ unsigned long size)
+{
+ return remap_pfn_range_prepare(desc, pfn);
+}
+#endif
+
+#ifndef io_remap_pfn_range_complete
+static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long pfn, unsigned long size,
+ pgprot_t prot)
+{
+ return remap_pfn_range_complete(vma, addr, pfn, size,
+ pgprot_decrypted(prot));
+}
+#endif
+
static inline vm_fault_t vmf_error(int err)
{
if (err == -ENOMEM)
--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:11:59 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
We can now update hugetlb to make sure of the new .mmap_prepare() hook, by
deferring the reservation of pages until the VMA is fully established and
handle this in the f_op->mmap_complete() hook.

We hold the VMA write lock throughout so we can't race with faults. rmap
can discover the VMA, but this should not cause a problem.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
fs/hugetlbfs/inode.c | 86 ++++++++++++++++++++++++--------------------
1 file changed, 47 insertions(+), 39 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3cfdf4091001..46d1ddc654c2 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -96,39 +96,14 @@ static const struct fs_parameter_spec hugetlb_fs_parameters[] = {
#define PGOFF_LOFFT_MAX \
(((1UL << (PAGE_SHIFT + 1)) - 1) << (BITS_PER_LONG - (PAGE_SHIFT + 1)))

-static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
+static int hugetlb_file_mmap_complete(struct file *file, struct vm_area_struct *vma,
+ const void *context)
{
struct inode *inode = file_inode(file);
- loff_t len, vma_len;
- int ret;
struct hstate *h = hstate_file(file);
- vm_flags_t vm_flags;
-
- /*
- * vma address alignment (but not the pgoff alignment) has
- * already been checked by prepare_hugepage_range. If you add
- * any error returns here, do so after setting VM_HUGETLB, so
- * is_vm_hugetlb_page tests below unmap_region go the right
- * way when do_mmap unwinds (may be important on powerpc
- * and ia64).
- */
- vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
- vma->vm_ops = &hugetlb_vm_ops;
-
- /*
- * page based offset in vm_pgoff could be sufficiently large to
- * overflow a loff_t when converted to byte offset. This can
- * only happen on architectures where sizeof(loff_t) ==
- * sizeof(unsigned long). So, only check in those instances.
- */
- if (sizeof(unsigned long) == sizeof(loff_t)) {
- if (vma->vm_pgoff & PGOFF_LOFFT_MAX)
- return -EINVAL;
- }
-
- /* must be huge page aligned */
- if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
- return -EINVAL;
+ vm_flags_t vm_flags = vma->vm_flags;
+ loff_t len, vma_len;
+ int ret = 0;

vma_len = (loff_t)(vma->vm_end - vma->vm_start);
len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
@@ -139,9 +114,6 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
inode_lock(inode);
file_accessed(file);

- ret = -ENOMEM;
-
- vm_flags = vma->vm_flags;
/*
* for SHM_HUGETLB, the pages are reserved in the shmget() call so skip
* reserving here. Note: only for SHM hugetlbfs file, the inode
@@ -151,20 +123,55 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
vm_flags |= VM_NORESERVE;

if (hugetlb_reserve_pages(inode,
- vma->vm_pgoff >> huge_page_order(h),
- len >> huge_page_shift(h), vma,
- vm_flags) < 0)
+ vma->vm_pgoff >> huge_page_order(h),
+ len >> huge_page_shift(h), vma,
+ vm_flags) < 0) {
+ ret = -ENOMEM;
goto out;
+ }

- ret = 0;
if (vma->vm_flags & VM_WRITE && inode->i_size < len)
i_size_write(inode, len);
+
out:
inode_unlock(inode);
-
return ret;
}

+static int hugetlbfs_file_mmap_prepare(struct vm_area_desc *desc)
+{
+ struct file *file = desc->file;
+ struct hstate *h = hstate_file(file);
+
+ /*
+ * vma address alignment (but not the pgoff alignment) has
+ * already been checked by prepare_hugepage_range. If you add
+ * any error returns here, do so after setting VM_HUGETLB, so
+ * is_vm_hugetlb_page tests below unmap_region go the right
+ * way when do_mmap unwinds (may be important on powerpc
+ * and ia64).
+ */
+ desc->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
+ desc->vm_ops = &hugetlb_vm_ops;
+
+ /*
+ * page based offset in vm_pgoff could be sufficiently large to
+ * overflow a loff_t when converted to byte offset. This can
+ * only happen on architectures where sizeof(loff_t) ==
+ * sizeof(unsigned long). So, only check in those instances.
+ */
+ if (sizeof(unsigned long) == sizeof(loff_t)) {
+ if (desc->pgoff & PGOFF_LOFFT_MAX)
+ return -EINVAL;
+ }
+
+ /* must be huge page aligned */
+ if (desc->pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
+ return -EINVAL;
+
+ return 0;
+}
+
/*
* Called under mmap_write_lock(mm).
*/
@@ -1219,7 +1226,8 @@ static void init_once(void *foo)

static const struct file_operations hugetlbfs_file_operations = {
.read_iter = hugetlbfs_read_iter,
- .mmap = hugetlbfs_file_mmap,
+ .mmap_prepare = hugetlbfs_file_mmap_prepare,
+ .mmap_complete = hugetlb_file_mmap_complete,
.fsync = noop_fsync,
.get_unmapped_area = hugetlb_get_unmapped_area,
.llseek = default_llseek,
--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:12:04 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Update the mem char driver (backing /dev/mem and /dev/zero) to use
f_op->mmap_prepare, f_op->mmap_complete hooks rather than the deprecated
f_op->mmap hook.

The /dev/zero implementation has a very unique and rather concerning
characteristic in that it converts MAP_PRIVATE mmap() mappings anonymous
when they are, in fact, not.

The new f_op->mmap_prepare() can support this, but rather than introducing
a helper function to perform this hack (and risk introducing other users),
simply set desc->vm_op to NULL here and add a comment describing what's
going on.

We also introduce shmem_zero_setup_desc() to allow for the shared mapping
case via an f_op->mmap_prepare() hook, and generalise the code between this
and shmem_zero_setup().

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
drivers/char/mem.c | 80 +++++++++++++++++++++++-----------------
include/linux/shmem_fs.h | 3 +-
mm/shmem.c | 40 ++++++++++++++++----
3 files changed, 81 insertions(+), 42 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 34b815901b20..b57ed104d302 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -304,13 +304,13 @@ static unsigned zero_mmap_capabilities(struct file *file)
}

/* can't do an in-place private mapping if there's no MMU */
-static inline int private_mapping_ok(struct vm_area_struct *vma)
+static inline int private_mapping_ok(struct vm_area_desc *desc)
{
- return is_nommu_shared_mapping(vma->vm_flags);
+ return is_nommu_shared_mapping(desc->vm_flags);
}
#else

-static inline int private_mapping_ok(struct vm_area_struct *vma)
+static inline int private_mapping_ok(struct vm_area_desc *desc)
{
return 1;
}
@@ -322,46 +322,54 @@ static const struct vm_operations_struct mmap_mem_ops = {
#endif
};

-static int mmap_mem(struct file *file, struct vm_area_struct *vma)
+static int mmap_mem_complete(struct file *file, struct vm_area_struct *vma,
+ const void *context)
{
size_t size = vma->vm_end - vma->vm_start;
- phys_addr_t offset = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
+
+ if (remap_pfn_range_complete(vma,
+ vma->vm_start,
+ vma->vm_pgoff,
+ size,
+ vma->vm_page_prot))
+ return -EAGAIN;
+
+ return 0;
+}
+
+static int mmap_mem_prepare(struct vm_area_desc *desc)
+{
+ size_t size = vma_desc_size(desc);
+ phys_addr_t offset = (phys_addr_t)desc->pgoff << PAGE_SHIFT;

/* Does it even fit in phys_addr_t? */
- if (offset >> PAGE_SHIFT != vma->vm_pgoff)
+ if (offset >> PAGE_SHIFT != desc->pgoff)
return -EINVAL;

/* It's illegal to wrap around the end of the physical address space. */
if (offset + (phys_addr_t)size - 1 < offset)
return -EINVAL;

- if (!valid_mmap_phys_addr_range(vma->vm_pgoff, size))
+ if (!valid_mmap_phys_addr_range(desc->pgoff, size))
return -EINVAL;

- if (!private_mapping_ok(vma))
+ if (!private_mapping_ok(desc))
return -ENOSYS;

- if (!range_is_allowed(vma->vm_pgoff, size))
+ if (!range_is_allowed(desc->pgoff, size))
return -EPERM;

- if (!phys_mem_access_prot_allowed(file, vma->vm_pgoff, size,
- &vma->vm_page_prot))
+ if (!phys_mem_access_prot_allowed(desc->file, desc->pgoff, size,
+ &desc->page_prot))
return -EINVAL;

- vma->vm_page_prot = phys_mem_access_prot(file, vma->vm_pgoff,
- size,
- vma->vm_page_prot);
-
- vma->vm_ops = &mmap_mem_ops;
+ desc->page_prot = phys_mem_access_prot(desc->file, desc->pgoff,
+ size,
+ desc->page_prot);
+ desc->vm_ops = &mmap_mem_ops;

/* Remap-pfn-range will mark the range VM_IO */
- if (remap_pfn_range(vma,
- vma->vm_start,
- vma->vm_pgoff,
- size,
- vma->vm_page_prot)) {
- return -EAGAIN;
- }
+ remap_pfn_range_prepare(desc, desc->pgoff);
return 0;
}

@@ -501,14 +509,18 @@ static ssize_t read_zero(struct file *file, char __user *buf,
return cleared;
}

-static int mmap_zero(struct file *file, struct vm_area_struct *vma)
+static int mmap_prepare_zero(struct vm_area_desc *desc)
{
#ifndef CONFIG_MMU
return -ENOSYS;
#endif
- if (vma->vm_flags & VM_SHARED)
- return shmem_zero_setup(vma);
- vma_set_anonymous(vma);
+ if (desc->vm_flags & VM_SHARED)
+ return shmem_zero_setup_desc(desc);
+ /*
+ * This is a highly unique situation where we mark a MAP_PRIVATE mapping
+ * of /dev/zero anonymous, despite it not being.
+ */
+ desc->vm_ops = NULL;
return 0;
}

@@ -526,10 +538,11 @@ static unsigned long get_unmapped_area_zero(struct file *file,
{
if (flags & MAP_SHARED) {
/*
- * mmap_zero() will call shmem_zero_setup() to create a file,
- * so use shmem's get_unmapped_area in case it can be huge;
- * and pass NULL for file as in mmap.c's get_unmapped_area(),
- * so as not to confuse shmem with our handle on "/dev/zero".
+ * mmap_prepare_zero() will call shmem_zero_setup() to create a
+ * file, so use shmem's get_unmapped_area in case it can be
+ * huge; and pass NULL for file as in mmap.c's
+ * get_unmapped_area(), so as not to confuse shmem with our
+ * handle on "/dev/zero".
*/
return shmem_get_unmapped_area(NULL, addr, len, pgoff, flags);
}
@@ -632,7 +645,8 @@ static const struct file_operations __maybe_unused mem_fops = {
.llseek = memory_lseek,
.read = read_mem,
.write = write_mem,
- .mmap = mmap_mem,
+ .mmap_prepare = mmap_mem_prepare,
+ .mmap_complete = mmap_mem_complete,
.open = open_mem,
#ifndef CONFIG_MMU
.get_unmapped_area = get_unmapped_area_mem,
@@ -668,7 +682,7 @@ static const struct file_operations zero_fops = {
.write_iter = write_iter_zero,
.splice_read = copy_splice_read,
.splice_write = splice_write_zero,
- .mmap = mmap_zero,
+ .mmap_prepare = mmap_prepare_zero,
.get_unmapped_area = get_unmapped_area_zero,
#ifndef CONFIG_MMU
.mmap_capabilities = zero_mmap_capabilities,
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 0e47465ef0fd..5b368f9549d6 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -94,7 +94,8 @@ extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
unsigned long flags);
extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt,
const char *name, loff_t size, unsigned long flags);
-extern int shmem_zero_setup(struct vm_area_struct *);
+int shmem_zero_setup(struct vm_area_struct *vma);
+int shmem_zero_setup_desc(struct vm_area_desc *desc);
extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags);
extern int shmem_lock(struct file *file, int lock, struct ucounts *ucounts);
diff --git a/mm/shmem.c b/mm/shmem.c
index cfc33b99a23a..7f402e438af0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -5905,14 +5905,9 @@ struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt, const char *name,
}
EXPORT_SYMBOL_GPL(shmem_file_setup_with_mnt);

-/**
- * shmem_zero_setup - setup a shared anonymous mapping
- * @vma: the vma to be mmapped is prepared by do_mmap
- */
-int shmem_zero_setup(struct vm_area_struct *vma)
+static struct file *__shmem_zero_setup(unsigned long start, unsigned long end, vm_flags_t vm_flags)
{
- struct file *file;
- loff_t size = vma->vm_end - vma->vm_start;
+ loff_t size = end - start;

/*
* Cloning a new file under mmap_lock leads to a lock ordering conflict
@@ -5920,7 +5915,17 @@ int shmem_zero_setup(struct vm_area_struct *vma)
* accessible to the user through its mapping, use S_PRIVATE flag to
* bypass file security, in the same way as shmem_kernel_file_setup().
*/
- file = shmem_kernel_file_setup("dev/zero", size, vma->vm_flags);
+ return shmem_kernel_file_setup("dev/zero", size, vm_flags);
+}
+
+/**
+ * shmem_zero_setup - setup a shared anonymous mapping
+ * @vma: the vma to be mmapped is prepared by do_mmap
+ */
+int shmem_zero_setup(struct vm_area_struct *vma)
+{
+ struct file *file = __shmem_zero_setup(vma->vm_start, vma->vm_end, vma->vm_flags);
+
if (IS_ERR(file))
return PTR_ERR(file);

@@ -5932,6 +5937,25 @@ int shmem_zero_setup(struct vm_area_struct *vma)
return 0;
}

+/**
+ * shmem_zero_setup_desc - same as shmem_zero_setup, but determined by VMA
+ * descriptor for convenience.
+ * @desc: Describes VMA
+ * Returns: 0 on success, or error
+ */
+int shmem_zero_setup_desc(struct vm_area_desc *desc)
+{
+ struct file *file = __shmem_zero_setup(desc->start, desc->end, desc->vm_flags);
+
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+
+ desc->vm_file = file;
+ desc->vm_ops = &shmem_anon_vm_ops;
+
+ return 0;
+}
+
/**
* shmem_read_folio_gfp - read into page cache, using specified page allocation flags.
* @mapping: the folio's address_space
--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:12:09 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
resctl uses remap_pfn_range(), but holds a mutex over the
operation. Therefore, establish the mutex in mmap_prepare(), release it in
mmap_complete() and release it in mmap_abort() should the operation fail.

Otherwise, we simply make use of the remap_pfn_range_[prepare/complete]()
remap PFN range variants in an ordinary way.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
fs/resctrl/pseudo_lock.c | 56 +++++++++++++++++++++++++++++++---------
1 file changed, 44 insertions(+), 12 deletions(-)

diff --git a/fs/resctrl/pseudo_lock.c b/fs/resctrl/pseudo_lock.c
index 87bbc2605de1..6d18ffde6a94 100644
--- a/fs/resctrl/pseudo_lock.c
+++ b/fs/resctrl/pseudo_lock.c
@@ -995,7 +995,8 @@ static const struct vm_operations_struct pseudo_mmap_ops = {
.mremap = pseudo_lock_dev_mremap,
};

-static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
+static int pseudo_lock_dev_mmap_complete(struct file *filp, struct vm_area_struct *vma,
+ const void *context)
{
unsigned long vsize = vma->vm_end - vma->vm_start;
unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
@@ -1004,6 +1005,40 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
unsigned long physical;
unsigned long psize;

+ rdtgrp = filp->private_data;
+ plr = rdtgrp->plr;
+
+ physical = __pa(plr->kmem) >> PAGE_SHIFT;
+ psize = plr->size - off;
+
+ memset(plr->kmem + off, 0, vsize);
+
+ if (remap_pfn_range_complete(vma, vma->vm_start, physical + vma->vm_pgoff,
+ vsize, vma->vm_page_prot)) {
+ mutex_unlock(&rdtgroup_mutex);
+ return -EAGAIN;
+ }
+
+ mutex_unlock(&rdtgroup_mutex);
+ return 0;
+}
+
+static void pseudo_lock_dev_mmap_abort(const struct file *filp,
+ const void *vm_private_data,
+ const void *context)
+{
+ mutex_unlock(&rdtgroup_mutex);
+}
+
+static int pseudo_lock_dev_mmap_prepare(struct vm_area_desc *desc)
+{
+ unsigned long vsize = vma_desc_size(desc);
+ unsigned long off = desc->pgoff << PAGE_SHIFT;
+ struct file *filp = desc->file;
+ struct pseudo_lock_region *plr;
+ struct rdtgroup *rdtgrp;
+ unsigned long psize;
+
mutex_lock(&rdtgroup_mutex);

rdtgrp = filp->private_data;
@@ -1031,7 +1066,6 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
return -EINVAL;
}

- physical = __pa(plr->kmem) >> PAGE_SHIFT;
psize = plr->size - off;

if (off > plr->size) {
@@ -1043,7 +1077,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
* Ensure changes are carried directly to the memory being mapped,
* do not allow copy-on-write mapping.
*/
- if (!(vma->vm_flags & VM_SHARED)) {
+ if (!(desc->vm_flags & VM_SHARED)) {
mutex_unlock(&rdtgroup_mutex);
return -EINVAL;
}
@@ -1053,15 +1087,11 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
return -ENOSPC;
}

- memset(plr->kmem + off, 0, vsize);
+ /* No CoW allowed so don't need to specify pfn. */
+ remap_pfn_range_prepare(desc, 0);
+ desc->vm_ops = &pseudo_mmap_ops;

- if (remap_pfn_range(vma, vma->vm_start, physical + vma->vm_pgoff,
- vsize, vma->vm_page_prot)) {
- mutex_unlock(&rdtgroup_mutex);
- return -EAGAIN;
- }
- vma->vm_ops = &pseudo_mmap_ops;
- mutex_unlock(&rdtgroup_mutex);
+ /* mutex will be release in mmap_complete or mmap_abort. */
return 0;
}

@@ -1071,7 +1101,9 @@ static const struct file_operations pseudo_lock_dev_fops = {
.write = NULL,
.open = pseudo_lock_dev_open,
.release = pseudo_lock_dev_release,
- .mmap = pseudo_lock_dev_mmap,
+ .mmap_prepare = pseudo_lock_dev_mmap_prepare,
+ .mmap_complete = pseudo_lock_dev_mmap_complete,
+ .mmap_abort = pseudo_lock_dev_mmap_abort,
};

int rdt_pseudo_lock_init(void)
--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:12:09 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
We thread the state through the mmap_context, allowing for both PFN map and
mixed mapped pre-population.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
fs/cramfs/inode.c | 134 +++++++++++++++++++++++++++++++---------------
1 file changed, 92 insertions(+), 42 deletions(-)

diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index b002e9b734f9..11a11213304d 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -59,6 +59,12 @@ static const struct address_space_operations cramfs_aops;

static DEFINE_MUTEX(read_mutex);

+/* How should the mapping be completed? */
+enum cramfs_mmap_state {
+ NO_PREPOPULATE,
+ PREPOPULATE_PFNMAP,
+ PREPOPULATE_MIXEDMAP,
+};

/* These macros may change in future, to provide better st_ino semantics. */
#define OFFSET(x) ((x)->i_ino)
@@ -342,34 +348,89 @@ static bool cramfs_last_page_is_shared(struct inode *inode)
return memchr_inv(tail_data, 0, PAGE_SIZE - partial) ? true : false;
}

-static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
+static int cramfs_physmem_mmap_complete(struct file *file, struct vm_area_struct *vma,
+ const void *context)
{
struct inode *inode = file_inode(file);
struct cramfs_sb_info *sbi = CRAMFS_SB(inode->i_sb);
- unsigned int pages, max_pages, offset;
unsigned long address, pgoff = vma->vm_pgoff;
- char *bailout_reason;
- int ret;
+ unsigned int pages, offset;
+ enum cramfs_mmap_state mmap_state = (enum cramfs_mmap_state)context;
+ int ret = 0;

- ret = generic_file_readonly_mmap(file, vma);
- if (ret)
- return ret;
+ if (mmap_state == NO_PREPOPULATE)
+ return 0;
+
+ offset = cramfs_get_block_range(inode, pgoff, &pages);
+ address = sbi->linear_phys_addr + offset;

/*
* Now try to pre-populate ptes for this vma with a direct
* mapping avoiding memory allocation when possible.
*/

+ if (mmap_state == PREPOPULATE_PFNMAP) {
+ /*
+ * The entire vma is mappable. remap_pfn_range() will
+ * make it distinguishable from a non-direct mapping
+ * in /proc/<pid>/maps by substituting the file offset
+ * with the actual physical address.
+ */
+ ret = remap_pfn_range_complete(vma, vma->vm_start, address >> PAGE_SHIFT,
+ pages * PAGE_SIZE, vma->vm_page_prot);
+ } else {
+ /*
+ * Let's create a mixed map if we can't map it all.
+ * The normal paging machinery will take care of the
+ * unpopulated ptes via cramfs_read_folio().
+ */
+ int i;
+
+ for (i = 0; i < pages && !ret; i++) {
+ vm_fault_t vmf;
+ unsigned long off = i * PAGE_SIZE;
+
+ vmf = vmf_insert_mixed(vma, vma->vm_start + off,
+ address + off);
+ if (vmf & VM_FAULT_ERROR)
+ ret = vm_fault_to_errno(vmf, 0);
+ }
+ }
+
+ if (!ret)
+ pr_debug("mapped %pD[%lu] at 0x%08lx (%u/%lu pages) "
+ "to vma 0x%08lx, page_prot 0x%llx\n", file,
+ pgoff, address, pages, vma_pages(vma), vma->vm_start,
+ (unsigned long long)pgprot_val(vma->vm_page_prot));
+ return ret;
+}
+
+static int cramfs_physmem_mmap_prepare(struct vm_area_desc *desc)
+{
+ struct file *file = desc->file;
+ struct inode *inode = file_inode(file);
+ struct cramfs_sb_info *sbi = CRAMFS_SB(inode->i_sb);
+ unsigned int pages, max_pages, offset, mapped_pages;
+ unsigned long address, pgoff = desc->pgoff;
+ enum cramfs_mmap_state mmap_state;
+ char *bailout_reason;
+ int ret;
+
+ ret = generic_file_readonly_mmap_prepare(desc);
+ if (ret)
+ return ret;
+
/* Could COW work here? */
bailout_reason = "vma is writable";
- if (vma->vm_flags & VM_WRITE)
+ if (desc->vm_flags & VM_WRITE)
goto bailout;

max_pages = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
bailout_reason = "beyond file limit";
if (pgoff >= max_pages)
goto bailout;
- pages = min(vma_pages(vma), max_pages - pgoff);
+ mapped_pages = vma_desc_pages(desc);
+ pages = min(mapped_pages, max_pages - pgoff);

offset = cramfs_get_block_range(inode, pgoff, &pages);
bailout_reason = "unsuitable block layout";
@@ -391,41 +452,23 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
goto bailout;
}

- if (pages == vma_pages(vma)) {
- /*
- * The entire vma is mappable. remap_pfn_range() will
- * make it distinguishable from a non-direct mapping
- * in /proc/<pid>/maps by substituting the file offset
- * with the actual physical address.
- */
- ret = remap_pfn_range(vma, vma->vm_start, address >> PAGE_SHIFT,
- pages * PAGE_SIZE, vma->vm_page_prot);
+ if (mapped_pages == pages)
+ mmap_state = PREPOPULATE_PFNMAP;
+ else
+ mmap_state = PREPOPULATE_MIXEDMAP;
+ desc->mmap_context = (void *)mmap_state;
+
+ if (mmap_state == PREPOPULATE_PFNMAP) {
+ /* No CoW allowed, so no need to provide PFN. */
+ remap_pfn_range_prepare(desc, 0);
} else {
- /*
- * Let's create a mixed map if we can't map it all.
- * The normal paging machinery will take care of the
- * unpopulated ptes via cramfs_read_folio().
- */
- int i;
- vm_flags_set(vma, VM_MIXEDMAP);
- for (i = 0; i < pages && !ret; i++) {
- vm_fault_t vmf;
- unsigned long off = i * PAGE_SIZE;
- vmf = vmf_insert_mixed(vma, vma->vm_start + off,
- address + off);
- if (vmf & VM_FAULT_ERROR)
- ret = vm_fault_to_errno(vmf, 0);
- }
+ desc->vm_flags |= VM_MIXEDMAP;
}

- if (!ret)
- pr_debug("mapped %pD[%lu] at 0x%08lx (%u/%lu pages) "
- "to vma 0x%08lx, page_prot 0x%llx\n", file,
- pgoff, address, pages, vma_pages(vma), vma->vm_start,
- (unsigned long long)pgprot_val(vma->vm_page_prot));
- return ret;
+ return 0;

bailout:
+ desc->mmap_context = (void *)NO_PREPOPULATE;
pr_debug("%pD[%lu]: direct mmap impossible: %s\n",
file, pgoff, bailout_reason);
/* Didn't manage any direct map, but normal paging is still possible */
@@ -434,9 +477,15 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)

#else /* CONFIG_MMU */

-static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
+static int cramfs_physmem_mmap_prepare(struct vm_area_desc *desc)
{
- return is_nommu_shared_mapping(vma->vm_flags) ? 0 : -ENOSYS;
+ return is_nommu_shared_mapping(desc->vm_flags) ? 0 : -ENOSYS;
+}
+
+static int cramfs_physmem_mmap_complete(struct file *file,
+ struct vm_area_struct *vma)
+{
+ return 0;
}

static unsigned long cramfs_physmem_get_unmapped_area(struct file *file,
@@ -474,7 +523,8 @@ static const struct file_operations cramfs_physmem_fops = {
.llseek = generic_file_llseek,
.read_iter = generic_file_read_iter,
.splice_read = filemap_splice_read,
- .mmap = cramfs_physmem_mmap,
+ .mmap_prepare = cramfs_physmem_mmap_prepare,
+ .mmap_complete = cramfs_physmem_mmap_complete,
#ifndef CONFIG_MMU
.get_unmapped_area = cramfs_physmem_get_unmapped_area,
.mmap_capabilities = cramfs_physmem_mmap_capabilities,
--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:12:14 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Now are able to use mmap_prepare, complete callbacks for procfs
implementations, update the vmcore implementation accordingly.

As part of this change, we must also update remap_vmalloc_range_partial()
to optionally not update VMA flags. Other than then remap_vmalloc_range()
wrapper, vmcore is the only user of this function so we can simply go ahead
and add a parameter.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
arch/s390/kernel/crash_dump.c | 6 ++--
fs/proc/vmcore.c | 53 +++++++++++++++++++++++++----------
include/linux/vmalloc.h | 10 +++----
mm/vmalloc.c | 16 +++++++++--
4 files changed, 59 insertions(+), 26 deletions(-)

diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
index d4839de8ce9d..44d7902f7e41 100644
--- a/arch/s390/kernel/crash_dump.c
+++ b/arch/s390/kernel/crash_dump.c
@@ -186,7 +186,7 @@ static int remap_oldmem_pfn_range_kdump(struct vm_area_struct *vma,

if (pfn < oldmem_data.size >> PAGE_SHIFT) {
size_old = min(size, oldmem_data.size - (pfn << PAGE_SHIFT));
- rc = remap_pfn_range(vma, from,
+ rc = remap_pfn_range_complete(vma, from,
pfn + (oldmem_data.start >> PAGE_SHIFT),
size_old, prot);
if (rc || size == size_old)
@@ -195,7 +195,7 @@ static int remap_oldmem_pfn_range_kdump(struct vm_area_struct *vma,
from += size_old;
pfn += size_old >> PAGE_SHIFT;
}
- return remap_pfn_range(vma, from, pfn, size, prot);
+ return remap_pfn_range_complete(vma, from, pfn, size, prot);
}

/*
@@ -220,7 +220,7 @@ static int remap_oldmem_pfn_range_zfcpdump(struct vm_area_struct *vma,
from += size_hsa;
pfn += size_hsa >> PAGE_SHIFT;
}
- return remap_pfn_range(vma, from, pfn, size, prot);
+ return remap_pfn_range_complete(vma, from, pfn, size, prot);
}

/*
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index f188bd900eb2..5e4e19c38d5e 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -254,7 +254,7 @@ int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
unsigned long size, pgprot_t prot)
{
prot = pgprot_encrypted(prot);
- return remap_pfn_range(vma, from, pfn, size, prot);
+ return remap_pfn_range_complete(vma, from, pfn, size, prot);
}

/*
@@ -308,7 +308,7 @@ static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,
tsz = min(offset + (u64)dump->size - start, (u64)size);
buf = dump->buf + start - offset;
if (remap_vmalloc_range_partial(vma, dst, buf, 0,
- tsz))
+ tsz, /* set_vma= */false))
return -EFAULT;

size -= tsz;
@@ -588,24 +588,40 @@ static int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma,
return ret;
}

-static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
+static int mmap_prepare_vmcore(struct vm_area_desc *desc)
{
- size_t size = vma->vm_end - vma->vm_start;
- u64 start, end, len, tsz;
- struct vmcore_range *m;
+ size_t size = vma_desc_size(desc);
+ u64 start, end;

- start = (u64)vma->vm_pgoff << PAGE_SHIFT;
+ start = (u64)desc->pgoff << PAGE_SHIFT;
end = start + size;

if (size > vmcore_size || end > vmcore_size)
return -EINVAL;

- if (vma->vm_flags & (VM_WRITE | VM_EXEC))
+ if (desc->vm_flags & (VM_WRITE | VM_EXEC))
return -EPERM;

- vm_flags_mod(vma, VM_MIXEDMAP, VM_MAYWRITE | VM_MAYEXEC);
- vma->vm_ops = &vmcore_mmap_ops;
+ desc->vm_flags |= VM_MIXEDMAP | VM_REMAP_FLAGS;
+ desc->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+ desc->vm_ops = &vmcore_mmap_ops;
+
+ /*
+ * No need for remap_pfn_range_prepare() as we ensure non-CoW by
+ * clearing VM_MAYWRITE.
+ */
+
+ return 0;
+}
+
+static int mmap_complete_vmcore(struct file *file, struct vm_area_struct *vma,
+ const void *context)
+{
+ size_t size = vma->vm_end - vma->vm_start;
+ u64 start, len, tsz;
+ struct vmcore_range *m;

+ start = (u64)vma->vm_pgoff << PAGE_SHIFT;
len = 0;

if (start < elfcorebuf_sz) {
@@ -613,8 +629,8 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)

tsz = min(elfcorebuf_sz - (size_t)start, size);
pfn = __pa(elfcorebuf + start) >> PAGE_SHIFT;
- if (remap_pfn_range(vma, vma->vm_start, pfn, tsz,
- vma->vm_page_prot))
+ if (remap_pfn_range_complete(vma, vma->vm_start, pfn, tsz,
+ vma->vm_page_prot))
return -EAGAIN;
size -= tsz;
start += tsz;
@@ -664,7 +680,7 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)start, size);
kaddr = elfnotes_buf + start - elfcorebuf_sz - vmcoredd_orig_sz;
if (remap_vmalloc_range_partial(vma, vma->vm_start + len,
- kaddr, 0, tsz))
+ kaddr, 0, tsz, /* set_vma =*/false))
goto fail;

size -= tsz;
@@ -701,7 +717,13 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
return -EAGAIN;
}
#else
-static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
+static int mmap_prepare_vmcore(struct vm_area_desc *desc)
+{
+ return -ENOSYS;
+}
+
+static int mmap_complete_vmcore(struct file *file, struct vm_area_struct *vma,
+ const void *context)
{
return -ENOSYS;
}
@@ -712,7 +734,8 @@ static const struct proc_ops vmcore_proc_ops = {
.proc_release = release_vmcore,
.proc_read_iter = read_vmcore,
.proc_lseek = default_llseek,
- .proc_mmap = mmap_vmcore,
+ .proc_mmap_prepare = mmap_prepare_vmcore,
+ .proc_mmap_complete = mmap_complete_vmcore,
};

static u64 get_vmcore_size(size_t elfsz, size_t elfnotesegsz,
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index eb54b7b3202f..588810e571aa 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -215,12 +215,12 @@ extern void *vmap(struct page **pages, unsigned int count,
void *vmap_pfn(unsigned long *pfns, unsigned int count, pgprot_t prot);
extern void vunmap(const void *addr);

-extern int remap_vmalloc_range_partial(struct vm_area_struct *vma,
- unsigned long uaddr, void *kaddr,
- unsigned long pgoff, unsigned long size);
+int remap_vmalloc_range_partial(struct vm_area_struct *vma,
+ unsigned long uaddr, void *kaddr, unsigned long pgoff,
+ unsigned long size, bool set_vma);

-extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
- unsigned long pgoff);
+int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
+ unsigned long pgoff);

int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
struct page **pages, unsigned int page_shift);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4249e1e01947..877b557b2482 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4528,6 +4528,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
* @kaddr: virtual address of vmalloc kernel memory
* @pgoff: offset from @kaddr to start at
* @size: size of map area
+ * @set_vma: If true, update VMA flags
*
* Returns: 0 for success, -Exxx on failure
*
@@ -4540,7 +4541,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
*/
int remap_vmalloc_range_partial(struct vm_area_struct *vma, unsigned long uaddr,
void *kaddr, unsigned long pgoff,
- unsigned long size)
+ unsigned long size, bool set_vma)
{
struct vm_struct *area;
unsigned long off;
@@ -4566,6 +4567,10 @@ int remap_vmalloc_range_partial(struct vm_area_struct *vma, unsigned long uaddr,
return -EINVAL;
kaddr += off;

+ /* If we shouldn't modify VMA flags, vm_insert_page() mustn't. */
+ if (!set_vma && !(vma->vm_flags & VM_MIXEDMAP))
+ return -EINVAL;
+
do {
struct page *page = vmalloc_to_page(kaddr);
int ret;
@@ -4579,7 +4584,11 @@ int remap_vmalloc_range_partial(struct vm_area_struct *vma, unsigned long uaddr,
size -= PAGE_SIZE;
} while (size > 0);

- vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP);
+ if (set_vma)
+ vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP);
+ else
+ VM_WARN_ON_ONCE((vma->vm_flags & (VM_DONTEXPAND | VM_DONTDUMP)) !=
+ (VM_DONTEXPAND | VM_DONTDUMP));

return 0;
}
@@ -4603,7 +4612,8 @@ int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
{
return remap_vmalloc_range_partial(vma, vma->vm_start,
addr, pgoff,
- vma->vm_end - vma->vm_start);
+ vma->vm_end - vma->vm_start,
+ /* set_vma= */ true);
}
EXPORT_SYMBOL(remap_vmalloc_range);

--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:12:17 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
By adding these hooks we enable procfs implementations to be able to use
the .mmap_prepare, .mmap_complete hooks rather than the deprecated .mmap
hook.

We treat this as if it were any other nested mmap hook and utilise the
.mmap_prepare compatibility layer if necessary.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
fs/proc/inode.c | 13 ++++++++++---
include/linux/proc_fs.h | 5 +++++
2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 129490151be1..d031267e2e4a 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -414,9 +414,16 @@ static long proc_reg_compat_ioctl(struct file *file, unsigned int cmd, unsigned

static int pde_mmap(struct proc_dir_entry *pde, struct file *file, struct vm_area_struct *vma)
{
- __auto_type mmap = pde->proc_ops->proc_mmap;
- if (mmap)
- return mmap(file, vma);
+ const struct file_operations f_op = {
+ .mmap = pde->proc_ops->proc_mmap,
+ .mmap_prepare = pde->proc_ops->proc_mmap_prepare,
+ .mmap_complete = pde->proc_ops->proc_mmap_complete,
+ };
+
+ if (f_op.mmap)
+ return f_op.mmap(file, vma);
+ else if (f_op.mmap_prepare)
+ return __compat_vma_mmap_prepare(&f_op, file, vma);
return -EIO;
}

diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index f139377f4b31..3573192f813d 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -47,6 +47,11 @@ struct proc_ops {
long (*proc_compat_ioctl)(struct file *, unsigned int, unsigned long);
#endif
int (*proc_mmap)(struct file *, struct vm_area_struct *);
+ int (*proc_mmap_prepare)(struct vm_area_desc *);
+ int (*proc_mmap_complete)(struct file *, struct vm_area_struct *,
+ const void *context);
+ void (*proc_mmap_abort)(const struct file *, const void *vm_private_data,
+ const void *context);
unsigned long (*proc_get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
} __randomize_layout;

--
2.51.0

Lorenzo Stoakes

unread,
Sep 8, 2025, 7:12:18 AMSep 8
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Now we have the capacity to set up the VMA in f_op->mmap_prepare and then
later, once the VMA is established, insert a mixed mapping in
f_op->mmap_complete, do so for kcov.

We utilise the context desc->mmap_context field to pass context between
mmap_prepare and mmap_complete to conveniently provide the size over which
the mapping is performed.

Also note that we intentionally set VM_MIXEDMAP ahead of time so upon
mmap_complete being invoked, vm_insert_page() does not adjust VMA flags.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
kernel/kcov.c | 40 ++++++++++++++++++++++++++++------------
1 file changed, 28 insertions(+), 12 deletions(-)

diff --git a/kernel/kcov.c b/kernel/kcov.c
index 1d85597057e1..53c8bcae54d0 100644
--- a/kernel/kcov.c
+++ b/kernel/kcov.c
@@ -484,23 +484,40 @@ void kcov_task_exit(struct task_struct *t)
kcov_put(kcov);
}

-static int kcov_mmap(struct file *filep, struct vm_area_struct *vma)
+static int kcov_mmap_prepare(struct vm_area_desc *desc)
{
- int res = 0;
- struct kcov *kcov = vma->vm_file->private_data;
- unsigned long size, off;
- struct page *page;
+ struct kcov *kcov = desc->file->private_data;
+ unsigned long size;
unsigned long flags;
+ int res = 0;

spin_lock_irqsave(&kcov->lock, flags);
size = kcov->size * sizeof(unsigned long);
- if (kcov->area == NULL || vma->vm_pgoff != 0 ||
- vma->vm_end - vma->vm_start != size) {
+ if (kcov->area == NULL || desc->pgoff != 0 ||
+ vma_desc_size(desc) != size) {
res = -EINVAL;
goto exit;
}
spin_unlock_irqrestore(&kcov->lock, flags);
- vm_flags_set(vma, VM_DONTEXPAND);
+
+ desc->vm_flags |= VM_DONTEXPAND | VM_MIXEDMAP;
+ desc->mmap_context = (void *)size;
+
+ return 0;
+exit:
+ spin_unlock_irqrestore(&kcov->lock, flags);
+ return res;
+}
+
+static int kcov_mmap_complete(struct file *file, struct vm_area_struct *vma,
+ const void *context)
+{
+ struct kcov *kcov = file->private_data;
+ unsigned long size = (unsigned long)context;
+ struct page *page;
+ unsigned long off;
+ int res;
+
for (off = 0; off < size; off += PAGE_SIZE) {
page = vmalloc_to_page(kcov->area + off);
res = vm_insert_page(vma, vma->vm_start + off, page);
@@ -509,10 +526,8 @@ static int kcov_mmap(struct file *filep, struct vm_area_struct *vma)
return res;
}
}
+
return 0;
-exit:
- spin_unlock_irqrestore(&kcov->lock, flags);
- return res;
}

static int kcov_open(struct inode *inode, struct file *filep)
@@ -761,7 +776,8 @@ static const struct file_operations kcov_fops = {
.open = kcov_open,
.unlocked_ioctl = kcov_ioctl,
.compat_ioctl = kcov_ioctl,
- .mmap = kcov_mmap,
+ .mmap_prepare = kcov_mmap_prepare,
+ .mmap_complete = kcov_mmap_complete,
.release = kcov_close,
};

--
2.51.0

Jason Gunthorpe

unread,
Sep 8, 2025, 8:51:12 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 12:10:34PM +0100, Lorenzo Stoakes wrote:
> static int secretmem_mmap_prepare(struct vm_area_desc *desc)
> {
> - const unsigned long len = desc->end - desc->start;
> + const unsigned long len = vma_desc_size(desc);
>
> if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> return -EINVAL;

I wonder if we should have some helper for this shared check too, it
is a bit tricky with the two flags. Forced-shared checks are pretty
common.

vma_desc_must_be_shared(desc) ?

Also 'must not be exec' is common too.

Jason

Jason Gunthorpe

unread,
Sep 8, 2025, 8:55:35 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 12:10:37PM +0100, Lorenzo Stoakes wrote:
> We have introduced the f_op->mmap_prepare hook to allow for setting up a
> VMA far earlier in the process of mapping memory, reducing problematic
> error handling paths, but this does not provide what all
> drivers/filesystems need.
>
> In order to supply this, and to be able to move forward with removing
> f_op->mmap altogether, introduce f_op->mmap_complete.
>
> This hook is called once the VMA is fully mapped and everything is done,
> however with the mmap write lock and VMA write locks held.
>
> The hook is then provided with a fully initialised VMA which it can do what
> it needs with, though the mmap and VMA write locks must remain held
> throughout.
>
> It is not intended that the VMA be modified at this point, attempts to do
> so will end in tears.

The commit message should call out if this has fixed the race
condition with unmap mapping range and prepopulation in mmap()..

> @@ -793,6 +793,11 @@ struct vm_area_desc {
> /* Write-only fields. */
> const struct vm_operations_struct *vm_ops;
> void *private_data;
> + /*
> + * A user-defined field, value will be passed to mmap_complete,
> + * mmap_abort.
> + */
> + void *mmap_context;

Seem strange, private_data and mmap_context? Something actually needs
both?

Jason

Jason Gunthorpe

unread,
Sep 8, 2025, 9:00:26 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 12:10:39PM +0100, Lorenzo Stoakes wrote:
> remap_pfn_range_prepare() will set the cow vma->vm_pgoff if necessary, so
> it must be supplied with a correct PFN to do so. If the caller must hold
> locks to be able to do this, those locks should be held across the
> operation, and mmap_abort() should be provided to revoke the lock should an
> error arise.

It seems very strange to me that callers have to provide locks.

Today once mmap is called the vma priv should be allocated and access
to the PFN is allowed - access doesn't stop until the priv is
destroyed.

So whatever refcounting the driver must do to protect PFN must already
be in place and driven by the vma priv.

When split I'd expect the same thing the prepare should obtain the vma
priv and that locks the pfn. On complete the already affiliated PFN is
mapped to PTEs.

Why would any driver need a lock held to complete?

Arguably we should store the remap pfn in the desc and just make
complete a fully generic helper that fills the PTEs from the prepared
desc.

Jason

Jason Gunthorpe

unread,
Sep 8, 2025, 9:11:30 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 12:10:41PM +0100, Lorenzo Stoakes wrote:
> @@ -151,20 +123,55 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> vm_flags |= VM_NORESERVE;
>
> if (hugetlb_reserve_pages(inode,
> - vma->vm_pgoff >> huge_page_order(h),
> - len >> huge_page_shift(h), vma,
> - vm_flags) < 0)
> + vma->vm_pgoff >> huge_page_order(h),
> + len >> huge_page_shift(h), vma,
> + vm_flags) < 0) {

It was split like this because vma is passed here right?

But hugetlb_reserve_pages() doesn't do much with the vma:

hugetlb_vma_lock_alloc(vma);
[..]
vma->vm_private_data = vma_lock;

Manipulates the private which should already exist in prepare:

Check non-share a few times:

if (!vma || vma->vm_flags & VM_MAYSHARE) {
if (vma && !(vma->vm_flags & VM_MAYSHARE) && h_cg) {
if (!vma || vma->vm_flags & VM_MAYSHARE) {

And does this resv_map stuff:

set_vma_resv_map(vma, resv_map);
set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
[..]
set_vma_private_data(vma, (unsigned long)map);

Which is also just manipulating the private data.

So it looks to me like it should be refactored so that
hugetlb_reserve_pages() returns the priv pointer to set in the VMA
instead of accepting vma as an argument. Maybe just pass in the desc
instead?

Then no need to introduce complete. I think it is probably better to
try to avoid using complete except for filling PTEs..

Jason

Lorenzo Stoakes

unread,
Sep 8, 2025, 9:12:10 AMSep 8
to Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 09:51:01AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 12:10:34PM +0100, Lorenzo Stoakes wrote:
> > static int secretmem_mmap_prepare(struct vm_area_desc *desc)
> > {
> > - const unsigned long len = desc->end - desc->start;
> > + const unsigned long len = vma_desc_size(desc);
> >
> > if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> > return -EINVAL;
>
> I wonder if we should have some helper for this shared check too, it
> is a bit tricky with the two flags. Forced-shared checks are pretty
> common.

Sure can add.

>
> vma_desc_must_be_shared(desc) ?

Maybe _could_be_shared()?

>
> Also 'must not be exec' is common too.

Right, will have a look! :)

>
> Jason

Lorenzo Stoakes

unread,
Sep 8, 2025, 9:19:25 AMSep 8
to Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 09:55:26AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 12:10:37PM +0100, Lorenzo Stoakes wrote:
> > We have introduced the f_op->mmap_prepare hook to allow for setting up a
> > VMA far earlier in the process of mapping memory, reducing problematic
> > error handling paths, but this does not provide what all
> > drivers/filesystems need.
> >
> > In order to supply this, and to be able to move forward with removing
> > f_op->mmap altogether, introduce f_op->mmap_complete.
> >
> > This hook is called once the VMA is fully mapped and everything is done,
> > however with the mmap write lock and VMA write locks held.
> >
> > The hook is then provided with a fully initialised VMA which it can do what
> > it needs with, though the mmap and VMA write locks must remain held
> > throughout.
> >
> > It is not intended that the VMA be modified at this point, attempts to do
> > so will end in tears.
>
> The commit message should call out if this has fixed the race
> condition with unmap mapping range and prepopulation in mmap()..

To be claer, this isn't the intent of the series, the intent is to make it
possible for mmap_prepare to replace mmap. This is just a bonus :)

Looking at the discussion in [0] it seems the issue was that .mmap() is
called before the vma is actually correctly inserted into the maple tree.

This is no longer the case, we call .mmap_complete() once the VMA is fully
established, but before releasing the VMA/mmap write lock.

This should, presumably, resolve the race as stated?

I can add some blurb about this yes.


[0]:https://lore.kernel.org/linux-mm/20250801162...@nvidia.com/


>
> > @@ -793,6 +793,11 @@ struct vm_area_desc {
> > /* Write-only fields. */
> > const struct vm_operations_struct *vm_ops;
> > void *private_data;
> > + /*
> > + * A user-defined field, value will be passed to mmap_complete,
> > + * mmap_abort.
> > + */
> > + void *mmap_context;
>
> Seem strange, private_data and mmap_context? Something actually needs
> both?

We are now doing something _new_ - we're splitting an operation that was
never split before.

Before a hook implementor could rely on there being state throughout the
_entire_ operation. But now they can't.

And they may already be putting context into private_data, which then gets
put into vma->vm_private_data for a VMA added to the maple tree and made
accessible.

So it is appropriate and convenient to allow for the transfer of state
between the two, and I already implement logic that does this.

>
> Jason

Cheers, Lorenzo

Jason Gunthorpe

unread,
Sep 8, 2025, 9:24:54 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 12:10:43PM +0100, Lorenzo Stoakes wrote:
> resctl uses remap_pfn_range(), but holds a mutex over the
> operation. Therefore, establish the mutex in mmap_prepare(), release it in
> mmap_complete() and release it in mmap_abort() should the operation fail.

The mutex can't do anything relative to remap_pfn, no reason to hold it.

> @@ -1053,15 +1087,11 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
> return -ENOSPC;
> }
>
> - memset(plr->kmem + off, 0, vsize);
> + /* No CoW allowed so don't need to specify pfn. */
> + remap_pfn_range_prepare(desc, 0);

This would be a good place to make a more generic helper..

ret = remap_pfn_no_cow(desc, phys);

And it can consistently check for !shared internally.

Store phys in the desc and use common code to trigger the PTE population
during complete.

Jason

Lorenzo Stoakes

unread,
Sep 8, 2025, 9:27:24 AMSep 8
to Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 10:00:15AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 12:10:39PM +0100, Lorenzo Stoakes wrote:
> > remap_pfn_range_prepare() will set the cow vma->vm_pgoff if necessary, so
> > it must be supplied with a correct PFN to do so. If the caller must hold
> > locks to be able to do this, those locks should be held across the
> > operation, and mmap_abort() should be provided to revoke the lock should an
> > error arise.
>
> It seems very strange to me that callers have to provide locks.
>
> Today once mmap is called the vma priv should be allocated and access
> to the PFN is allowed - access doesn't stop until the priv is
> destroyed.
>
> So whatever refcounting the driver must do to protect PFN must already
> be in place and driven by the vma priv.
>
> When split I'd expect the same thing the prepare should obtain the vma
> priv and that locks the pfn. On complete the already affiliated PFN is
> mapped to PTEs.
>
> Why would any driver need a lock held to complete?


In general, again we're splitting an operation that didn't used to be split.

A hook implementor may need to hold the lock in order to stabilise whatever
is required to be stabilisesd across the two (of course, with careful
consideration of the fact we're doing stuff between the two!)

It's not only remap that is a concern here, people do all kinds of weird
and wonderful things in .mmap(), sometimes in combination with remap.

This is what makes this so fun to try to change ;)

An implementor may also update state somehow which would need to be altered
should the operation fail, again something that would not have needed to be
considered previously, as it was all done in one.

>
> Arguably we should store the remap pfn in the desc and just make
> complete a fully generic helper that fills the PTEs from the prepared
> desc.

That's an interesting take actually.

Though I don't thik we can _always_ do that, as drivers again do weird and
wonderful things and we need to have maximum flexibility here.

But we could have a generic function that could speed some things up here,
and have that assume desc->mmap_context contains the PFN.

You can see patch 12/16 for an example of mmap_abort in action.

I also wonder if we should add remap_pfn_range_prepare_nocow() - which can
assert !is_cow_mapping(desc->vm_flags) - and then that self-documents the
cases where we don't actually need the PFN on prepare (this is only for the
hideous vm_pgoff hack for arches without special page table flag).

>
> Jason

Cheers, Lorenzo

Jason Gunthorpe

unread,
Sep 8, 2025, 9:27:29 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
It would be nicer to have different ops than this, the normal op could
just call the generic helper and then there is only the mixed map op.

Makes me wonder if putting the op in the fops was right, a
mixed/non-mixed vm_ops would do this nicely.

Jason

Jan Kara

unread,
Sep 8, 2025, 9:28:01 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Hi Lorenzo!

On Mon 08-09-25 12:10:31, Lorenzo Stoakes wrote:
> Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file
> callback"), The f_op->mmap hook has been deprecated in favour of
> f_op->mmap_prepare.
>
> This was introduced in order to make it possible for us to eventually
> eliminate the f_op->mmap hook which is highly problematic as it allows
> drivers and filesystems raw access to a VMA which is not yet correctly
> initialised.
>
> This hook also introduces complexity for the memory mapping operation, as
> we must correctly unwind what we do should an error arises.
>
> Overall this interface being so open has caused significant problems for
> us, including security issues, it is important for us to simply eliminate
> this as a source of problems.
>
> Therefore this series continues what was established by extending the
> functionality further to permit more drivers and filesystems to use
> mmap_prepare.
>
> After updating some areas that can simply use mmap_prepare as-is, and
> performing some housekeeping, we then introduce two new hooks:
>
> f_op->mmap_complete - this is invoked at the point of the VMA having been
> correctly inserted, though with the VMA write lock still held. mmap_prepare
> must also be specified.
>
> This expands the use of mmap_prepare to those callers which need to
> prepopulate mappings, as well as any which does genuinely require access to
> the VMA.
>
> It's simple - we will let the caller access the VMA, but only once it's
> established. At this point unwinding issues is simple - we just unmap the
> VMA.
>
> The VMA is also then correctly initialised at this stage so there can be no
> issues arising from a not-fully initialised VMA at this point.
>
> The other newly added hook is:
>
> f_op->mmap_abort - this is only valid in conjunction with mmap_prepare and
> mmap_complete. This is called should an error arise between mmap_prepare
> and mmap_complete (not as a result of mmap_prepare but rather some other
> part of the mapping logic).
>
> This is required in case mmap_prepare wishes to establish state or locks
> which need to be cleaned up on completion. If we did not provide this, then
> this could not be permitted as this cleanup would otherwise not occur
> should the mapping fail between the two calls.

So seeing these new hooks makes me wonder: Shouldn't rather implement
mmap(2) in a way more similar to how other f_op hooks behave like ->read or
->write? I.e., a hook called at rather high level - something like from
vm_mmap_pgoff() or similar similar level - which would just call library
functions from MM for the stuff it needs to do. Filesystems would just do
their checks and call the generic mmap function with the vm_ops they want
to use, more complex users could then fill in the VMA before releasing
mmap_lock or do cleanup in case of failure... This would seem like a more
understandable API than several hooks with rules when what gets called.

Honza

>
> We then add split remap_pfn_range*() functions which allow for PFN remap (a
> typical mapping prepopulation operation) split between a prepare/complete
> step, as well as io_mremap_pfn_range_prepare, complete for a similar
> purpose.
>
> From there we update various mm-adjacent logic to use this functionality as
> a first set of changes, as well as resctl and cramfs filesystems to round
> off the non-stacked filesystem instances.
>
>
> REVIEWER NOTE:
> ~~~~~~~~~~~~~~
>
> I considered putting the complete, abort callbacks in vm_ops, however this
> won't work because then we would be unable to adjust helpers like
> generic_file_mmap_prepare() (which provides vm_ops) to provide the correct
> complete, abort callbacks.
>
> Conceptually it also makes more sense to have these in f_op as they are
> one-off operations performed at mmap time to establish the VMA, rather than
> a property of the VMA itself.
>
> Lorenzo Stoakes (16):
> mm/shmem: update shmem to use mmap_prepare
> device/dax: update devdax to use mmap_prepare
> mm: add vma_desc_size(), vma_desc_pages() helpers
> relay: update relay to use mmap_prepare
> mm/vma: rename mmap internal functions to avoid confusion
> mm: introduce the f_op->mmap_complete, mmap_abort hooks
> doc: update porting, vfs documentation for mmap_[complete, abort]
> mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
> mm: introduce io_remap_pfn_range_prepare, complete
> mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete
> mm: update mem char driver to use mmap_prepare, mmap_complete
> mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort
> mm: update cramfs to use mmap_prepare, mmap_complete
> fs/proc: add proc_mmap_[prepare, complete] hooks for procfs
> fs/proc: update vmcore to use .proc_mmap_[prepare, complete]
> kcov: update kcov to use mmap_prepare, mmap_complete
>
> Documentation/filesystems/porting.rst | 9 ++
> Documentation/filesystems/vfs.rst | 35 +++++++
> arch/csky/include/asm/pgtable.h | 5 +
> arch/mips/alchemy/common/setup.c | 28 +++++-
> arch/mips/include/asm/pgtable.h | 10 ++
> arch/s390/kernel/crash_dump.c | 6 +-
> arch/sparc/include/asm/pgtable_32.h | 29 +++++-
> arch/sparc/include/asm/pgtable_64.h | 29 +++++-
> drivers/char/mem.c | 80 ++++++++-------
> drivers/dax/device.c | 32 +++---
> fs/cramfs/inode.c | 134 ++++++++++++++++++--------
> fs/hugetlbfs/inode.c | 86 +++++++++--------
> fs/ntfs3/file.c | 2 +-
> fs/proc/inode.c | 13 ++-
> fs/proc/vmcore.c | 53 +++++++---
> fs/resctrl/pseudo_lock.c | 56 ++++++++---
> include/linux/fs.h | 4 +
> include/linux/mm.h | 53 +++++++++-
> include/linux/mm_types.h | 5 +
> include/linux/proc_fs.h | 5 +
> include/linux/shmem_fs.h | 3 +-
> include/linux/vmalloc.h | 10 +-
> kernel/kcov.c | 40 +++++---
> kernel/relay.c | 32 +++---
> mm/memory.c | 128 +++++++++++++++---------
> mm/secretmem.c | 2 +-
> mm/shmem.c | 49 +++++++---
> mm/util.c | 18 +++-
> mm/vma.c | 96 +++++++++++++++---
> mm/vmalloc.c | 16 ++-
> tools/testing/vma/vma_internal.h | 31 +++++-
> 31 files changed, 810 insertions(+), 289 deletions(-)
>
> --
> 2.51.0
--
Jan Kara <ja...@suse.com>
SUSE Labs, CR

Jason Gunthorpe

unread,
Sep 8, 2025, 9:30:22 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 12:10:47PM +0100, Lorenzo Stoakes wrote:
> Now we have the capacity to set up the VMA in f_op->mmap_prepare and then
> later, once the VMA is established, insert a mixed mapping in
> f_op->mmap_complete, do so for kcov.
>
> We utilise the context desc->mmap_context field to pass context between
> mmap_prepare and mmap_complete to conveniently provide the size over which
> the mapping is performed.

Why?

+ vma_desc_size(desc) != size) {
+ res = -EINVAL;

Just call some vma_size()?

Jason

Jason Gunthorpe

unread,
Sep 8, 2025, 9:32:30 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 02:12:00PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 09:51:01AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 08, 2025 at 12:10:34PM +0100, Lorenzo Stoakes wrote:
> > > static int secretmem_mmap_prepare(struct vm_area_desc *desc)
> > > {
> > > - const unsigned long len = desc->end - desc->start;
> > > + const unsigned long len = vma_desc_size(desc);
> > >
> > > if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> > > return -EINVAL;
> >
> > I wonder if we should have some helper for this shared check too, it
> > is a bit tricky with the two flags. Forced-shared checks are pretty
> > common.
>
> Sure can add.
>
> >
> > vma_desc_must_be_shared(desc) ?
>
> Maybe _could_be_shared()?

It is not could, it is must.

Perhaps

!vma_desc_cowable()

Is what many drivers are really trying to assert.

Jason

Jason Gunthorpe

unread,
Sep 8, 2025, 9:35:44 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 02:27:12PM +0100, Lorenzo Stoakes wrote:

> It's not only remap that is a concern here, people do all kinds of weird
> and wonderful things in .mmap(), sometimes in combination with remap.

So it should really not be split this way, complete is a badly name
prepopulate and it should only fill the PTEs, which shouldn't need
more locking.

The only example in this series didn't actually need to hold the lock.

Jason

Lorenzo Stoakes

unread,
Sep 8, 2025, 9:37:54 AMSep 8
to Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
Well hugetlb_vma_lock_alloc() does:

vma_lock->vma = vma;

Which we cannot do in prepare.

This is checked in hugetlb_dup_vma_private(), and obviously desc is not a stable
pointer to be used for comparing anything.

I'm also trying to do the minimal changes I can here, I'd rather not majorly
refactor things to suit this change if possible.

>
> Then no need to introduce complete. I think it is probably better to
> try to avoid using complete except for filling PTEs..

I'd rather do that yes. hugetlbfs is the exception to many rules, unfortunately.

>
> Jason

Cheers, Lorenzo

Lorenzo Stoakes

unread,
Sep 8, 2025, 9:44:34 AMSep 8
to Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
Right, but I can't stop to refactor everything I change, or this effort will
take even longer.

I do have to compromise a _little_ on that as there's ~250 odd callsites to
go...

>
> Makes me wonder if putting the op in the fops was right, a
> mixed/non-mixed vm_ops would do this nicely.

I added a reviewers note just for you in 00/16 :) I guess you missed it:

REVIEWER NOTE:
~~~~~~~~~~~~~~

I considered putting the complete, abort callbacks in vm_ops,
however this won't work because then we would be unable to adjust
helpers like ngeneric_file_mmap_prepare() (which provides vm_ops)
to provide the correct complete, abort callbacks.

Conceptually it also makes more sense to have these in f_op as they
are one-off operations performed at mmap time to establish the VMA,
rather than a property of the VMA itself.

Basically, existing generic code sets vm_ops to something already, now we'd
need to somehow also vary it on this as well or nest vm_ops? I don't think
it's workable.

I found this out because I started working on this series with the complete
callback as part of vm_ops then hit this stumbling block as a result.

>
> Jason

Cheers, Lorenzo

Lorenzo Stoakes

unread,
Sep 8, 2025, 9:47:36 AMSep 8
to Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
Ah yeah we can do you're right, as we assert vma_desc_size() == size, will fix
that thanks!

There is no vma_size() though, which is weird to me. There is vma_pages() <<
PAGE_SHIFT though...

Maybe one to add!

>
> Jason

Cheers, Lorenzo

Jason Gunthorpe

unread,
Sep 8, 2025, 9:52:48 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
Okay, just doing that in commit would be appropriate then

> This is checked in hugetlb_dup_vma_private(), and obviously desc is not a stable
> pointer to be used for comparing anything.
>
> I'm also trying to do the minimal changes I can here, I'd rather not majorly
> refactor things to suit this change if possible.

It doesn't look like a bit refactor, pass vma desc into
hugetlb_reserve_pages(), lift the vma_lock set out

Jason

Lorenzo Stoakes

unread,
Sep 8, 2025, 9:54:55 AMSep 8
to Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 10:24:47AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 12:10:43PM +0100, Lorenzo Stoakes wrote:
> > resctl uses remap_pfn_range(), but holds a mutex over the
> > operation. Therefore, establish the mutex in mmap_prepare(), release it in
> > mmap_complete() and release it in mmap_abort() should the operation fail.
>
> The mutex can't do anything relative to remap_pfn, no reason to hold it.
>
> > @@ -1053,15 +1087,11 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
> > return -ENOSPC;
> > }
> >
> > - memset(plr->kmem + off, 0, vsize);
> > + /* No CoW allowed so don't need to specify pfn. */
> > + remap_pfn_range_prepare(desc, 0);
>
> This would be a good place to make a more generic helper..
>
> ret = remap_pfn_no_cow(desc, phys);

Ha, funny I suggested a _no_cow() thing earlier :) seems we are agreed on that
then!

Presumably you mean remap_pfn_no_cow_prepare()?

>
> And it can consistently check for !shared internally.
>
> Store phys in the desc and use common code to trigger the PTE population
> during complete.

We can use mmap_context for this, I guess it's not a terrible idea to set .pfn
but I just don't want to add any confusion as to what doing that means in
the non-generic mmap_complete case.

>
> Jason

Cheers, Lorenzo

Lorenzo Stoakes

unread,
Sep 8, 2025, 10:09:55 AMSep 8
to Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 10:32:24AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 02:12:00PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Sep 08, 2025 at 09:51:01AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Sep 08, 2025 at 12:10:34PM +0100, Lorenzo Stoakes wrote:
> > > > static int secretmem_mmap_prepare(struct vm_area_desc *desc)
> > > > {
> > > > - const unsigned long len = desc->end - desc->start;
> > > > + const unsigned long len = vma_desc_size(desc);
> > > >
> > > > if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> > > > return -EINVAL;
> > >
> > > I wonder if we should have some helper for this shared check too, it
> > > is a bit tricky with the two flags. Forced-shared checks are pretty
> > > common.
> >
> > Sure can add.
> >
> > >
> > > vma_desc_must_be_shared(desc) ?
> >
> > Maybe _could_be_shared()?
>
> It is not could, it is must.

I mean VM_MAYSHARE is a nonsense anyway, but _in theory_ VM_MAYSHARE &&
!VM_SHARE means we _could_ share it.

But in reality of course this isn't a real thing.

Perhaps vma_desc_is_shared() or something, I obviously don't want to get stuck
on semantics here :) [he says, while getting obviously stuck on semantics] :P

>
> Perhaps
>
> !vma_desc_cowable()
>
> Is what many drivers are really trying to assert.

Well no, because:

static inline bool is_cow_mapping(vm_flags_t flags)
{
return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}

Read-only means !CoW.

Hey we've made a rod for own backs! Again!

>
> Jason

Cheers, Lorenzo

Lorenzo Stoakes

unread,
Sep 8, 2025, 10:19:02 AMSep 8
to Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 10:35:38AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 02:27:12PM +0100, Lorenzo Stoakes wrote:
>
> > It's not only remap that is a concern here, people do all kinds of weird
> > and wonderful things in .mmap(), sometimes in combination with remap.
>
> So it should really not be split this way, complete is a badly name

I don't understand, you think we can avoid splitting this in two? If so, I
disagree.

We have two stages, _intentionally_ placed to avoid the issues the mmap_prepare
series in the first instance worked to avoid:

1. 'Hey, how do we configure this VMA we have _not yet set up_'
2. 'OK it's set up, now do you want to do something else?

I'm sorry but I'm not sure how we could otherwise do this.

Keep in mind re: point 1, we _need_ the VMA to be established enough to check
for merge etc.

Another key aim of this change was to eliminate the need for a merge re-check.

> prepopulate and it should only fill the PTEs, which shouldn't need
> more locking.
>
> The only example in this series didn't actually need to hold the lock.

There's ~250 more mmap callbacks to work through. Do you provide a guarantee
that:

- All 250 absolutely only need access to the VMAs to perform prepopulation of
this nature.

- That absolutely none will set up state in the prepopulate step that might need
to be unwound should an error arise?

Keeping in mind I must remain practical re: refactoring each caller.

I mean, let me go check what you say re: the resctl lock, if you're right I
could drop mmap_abort for now and add it later if needed.

But re: calling mmap_complete prepopulate, I don't really think that's sensible.

mmap_prepare is invoked at the point of the preparation of the mapping, and
mmap_complete is invoked once that preoparation is complete to allow further
actions.

I'm obviously open to naming suggestions, but I think it's safer to consistently
refer to where we are in the lifecycle rather than presuming what the caller
might do.

(I'd _prefer_ they always did just prepopulate, but I just don't think we
necessarily can).

>
> Jason

Cheers, Lorenzo

Lorenzo Stoakes

unread,
Sep 8, 2025, 10:20:06 AMSep 8
to Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
OK, I'll take a look at refactoring this.

>
> Jason

Cheers, Lorenzo

Jason Gunthorpe

unread,
Sep 8, 2025, 10:20:22 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 03:09:43PM +0100, Lorenzo Stoakes wrote:
> > Perhaps
> >
> > !vma_desc_cowable()
> >
> > Is what many drivers are really trying to assert.
>
> Well no, because:
>
> static inline bool is_cow_mapping(vm_flags_t flags)
> {
> return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> }
>
> Read-only means !CoW.

What drivers want when they check SHARED is to prevent COW. It is COW
that causes problems for whatever the driver is doing, so calling the
helper cowable and making the test actually right for is a good thing.

COW of this VMA, and no possibilty to remap/mprotect/fork/etc it into
something that is COW in future.

Drivers have commonly various things with VM_SHARED to establish !COW,
but if that isn't actually right then lets fix it to be clear and
correct.

Jason

Lorenzo Stoakes

unread,
Sep 8, 2025, 10:28:30 AMSep 8
to Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 10:24:47AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 12:10:43PM +0100, Lorenzo Stoakes wrote:
> > resctl uses remap_pfn_range(), but holds a mutex over the
> > operation. Therefore, establish the mutex in mmap_prepare(), release it in
> > mmap_complete() and release it in mmap_abort() should the operation fail.
>
> The mutex can't do anything relative to remap_pfn, no reason to hold it.

Sorry I missed this bit before...

Yeah I guess my concern was that the original code very intentionally holds the
mutex _over the remap operation_.

But I guess given we release the lock on failure this isn't necessary, and of
course obviously the lock has no bearing ont he actual remap.

Will drop it and drop mmap_abort for now as it's not yet needed.

Cheers, Lorenzo

Lorenzo Stoakes

unread,
Sep 8, 2025, 10:47:46 AMSep 8
to Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 11:20:11AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 03:09:43PM +0100, Lorenzo Stoakes wrote:
> > > Perhaps
> > >
> > > !vma_desc_cowable()
> > >
> > > Is what many drivers are really trying to assert.
> >
> > Well no, because:
> >
> > static inline bool is_cow_mapping(vm_flags_t flags)
> > {
> > return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> > }
> >
> > Read-only means !CoW.
>
> What drivers want when they check SHARED is to prevent COW. It is COW
> that causes problems for whatever the driver is doing, so calling the
> helper cowable and making the test actually right for is a good thing.
>
> COW of this VMA, and no possibilty to remap/mprotect/fork/etc it into
> something that is COW in future.

But you can't do that if !VM_MAYWRITE.

I mean probably the driver's just wrong and should use is_cow_mapping() tbh.

>
> Drivers have commonly various things with VM_SHARED to establish !COW,
> but if that isn't actually right then lets fix it to be clear and
> correct.

I think we need to be cautious of scope here :) I don't want to accidentally
break things this way.

OK I think a sensible way forward - How about I add desc_is_cowable() or
vma_desc_cowable() and only set this if I'm confident it's correct?

That way I can achieve both aims at once.

>
> Jason

Cheers, Lorenzo

Lorenzo Stoakes

unread,
Sep 8, 2025, 10:48:55 AMSep 8
to Jan Kara, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On Mon, Sep 08, 2025 at 03:27:52PM +0200, Jan Kara wrote:
> Hi Lorenzo!

Hey! :)
We can't just do everything at this level, because we need:

a. Information to actually know how to map the VMA before putting it in the
maple tree.
b. Once it's there, anything else we need to do (typically - prepopulate).

The crux of this change is to avoid horrors around the VMA being passed
around not yet being properly initialised, and yet being accessible for
drivers to do 'whatever' with.

Ideally we'd have only one case, and for _nearly all_ filesystems this is
how it is actually.

But sadly some _do need_ to do extra work afterwards, most notably,
prepopulation.

Cheers, Lorenzo

David Hildenbrand

unread,
Sep 8, 2025, 10:59:55 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On 08.09.25 13:10, Lorenzo Stoakes wrote:
> This simply assigns the vm_ops so is easily updated - do so.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
> ---

Reviewed-by: David Hildenbrand <da...@redhat.com>

--
Cheers

David / dhildenb

David Hildenbrand

unread,
Sep 8, 2025, 11:04:03 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On 08.09.25 13:10, Lorenzo Stoakes wrote:
> The devdax driver does nothing special in its f_op->mmap hook, so
> straightforwardly update it to use the mmap_prepare hook instead.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
> ---
> drivers/dax/device.c | 32 +++++++++++++++++++++-----------
> 1 file changed, 21 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 2bb40a6060af..c2181439f925 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -13,8 +13,9 @@
> #include "dax-private.h"
> #include "bus.h"
>
> -static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
> - const char *func)
> +static int __check_vma(struct dev_dax *dev_dax, vm_flags_t vm_flags,
> + unsigned long start, unsigned long end, struct file *file,
> + const char *func)

In general

Acked-by: David Hildenbrand <da...@redhat.com>

The only thing that bugs me is __check_vma() that does not check a vma.

Maybe something along the lines of

"check_vma_properties"

Not sure.

Jason Gunthorpe

unread,
Sep 8, 2025, 11:04:13 AMSep 8
to Lorenzo Stoakes, Jan Kara, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 03:48:36PM +0100, Lorenzo Stoakes wrote:
> But sadly some _do need_ to do extra work afterwards, most notably,
> prepopulation.

I think Jan is suggesting something more like

mmap_op()
{
struct vma_desc desc = {};

desc.[..] = x
desc.[..] = y
desc.[..] = z
vma = vma_alloc(desc);

ret = remap_pfn(vma)
if (ret) goto err_vma;

return vma_commit(vma);

err_va:
vma_dealloc(vma);
return ERR_PTR(ret);
}

Jason

David Hildenbrand

unread,
Sep 8, 2025, 11:08:05 AMSep 8
to Lorenzo Stoakes, Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
I'll note that the naming is bad.

Why?

Because the vma_desc is not cowable. The underlying mapping maybe is.

David Hildenbrand

unread,
Sep 8, 2025, 11:10:28 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On 08.09.25 13:10, Lorenzo Stoakes wrote:
> It's useful to be able to determine the size of a VMA descriptor range used
> on f_op->mmap_prepare, expressed both in bytes and pages, so add helpers
> for both and update code that could make use of it to do so.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
> ---
> fs/ntfs3/file.c | 2 +-
> include/linux/mm.h | 10 ++++++++++
> mm/secretmem.c | 2 +-
> 3 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ntfs3/file.c b/fs/ntfs3/file.c
> index c1ece707b195..86eb88f62714 100644
> --- a/fs/ntfs3/file.c
> +++ b/fs/ntfs3/file.c
> @@ -304,7 +304,7 @@ static int ntfs_file_mmap_prepare(struct vm_area_desc *desc)
>
> if (rw) {
> u64 to = min_t(loff_t, i_size_read(inode),
> - from + desc->end - desc->start);
> + from + vma_desc_size(desc));
>
> if (is_sparsed(ni)) {
> /* Allocate clusters for rw map. */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a6bfa46937a8..9d4508b20be3 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3560,6 +3560,16 @@ static inline unsigned long vma_pages(const struct vm_area_struct *vma)
> return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> }
>
> +static inline unsigned long vma_desc_size(struct vm_area_desc *desc)
> +{
> + return desc->end - desc->start;
> +}
> +
> +static inline unsigned long vma_desc_pages(struct vm_area_desc *desc)
> +{
> + return vma_desc_size(desc) >> PAGE_SHIFT;
> +}

"const struct vm_area_desc *" in both cases?

> +
> /* Look up the first VMA which exactly match the interval vm_start ... vm_end */
> static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
> unsigned long vm_start, unsigned long vm_end)
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index 60137305bc20..62066ddb1e9c 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -120,7 +120,7 @@ static int secretmem_release(struct inode *inode, struct file *file)
>
> static int secretmem_mmap_prepare(struct vm_area_desc *desc)
> {
> - const unsigned long len = desc->end - desc->start;
> + const unsigned long len = vma_desc_size(desc);
>
> if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> return -EINVAL;

We really want to forbid any private mappings here, independent of cow.

Maybe a is_private_mapping() helper

or a

vma_desc_is_private_mapping()

helper if we really need it

David Hildenbrand

unread,
Sep 8, 2025, 11:15:23 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On 08.09.25 13:10, Lorenzo Stoakes wrote:
> It is relatively trivial to update this code to use the f_op->mmap_prepare
> hook in favour of the deprecated f_op->mmap hook, so do so.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
> ---

Reviewed-by: David Hildenbrand <da...@redhat.com>

Lorenzo Stoakes

unread,
Sep 8, 2025, 11:15:36 AMSep 8
to Jason Gunthorpe, Jan Kara, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
Right, unfortunately the locking and the subtle issues around memory mapping
really preclude something like this I think. We really do need to keep control
over that.

And since partly the motivation here is 'drivers do insane things when given too
much freedom', I feel this would not improve that :)

If you look at do_mmap() -> mmap_region() -> __mmap_region() etc. you can see a
lot of that.

We also had a security issue arise as a result of incorrect error path handling,
I don't think letting a driver writer handle that is wise.

It's a nice idea, but I just think this stuff is too sensitive for that. And in
any case, it wouldn't likely be tractable to convert legacy code to this.

Cheers, Lorenzo

Jason Gunthorpe

unread,
Sep 8, 2025, 11:16:51 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 03:47:34PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 11:20:11AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 08, 2025 at 03:09:43PM +0100, Lorenzo Stoakes wrote:
> > > > Perhaps
> > > >
> > > > !vma_desc_cowable()
> > > >
> > > > Is what many drivers are really trying to assert.
> > >
> > > Well no, because:
> > >
> > > static inline bool is_cow_mapping(vm_flags_t flags)
> > > {
> > > return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> > > }
> > >
> > > Read-only means !CoW.
> >
> > What drivers want when they check SHARED is to prevent COW. It is COW
> > that causes problems for whatever the driver is doing, so calling the
> > helper cowable and making the test actually right for is a good thing.
> >
> > COW of this VMA, and no possibilty to remap/mprotect/fork/etc it into
> > something that is COW in future.
>
> But you can't do that if !VM_MAYWRITE.

See this is my fear, the drivers are wrong and you are talking about
edge cases nobody actually knows about.

The need is the created VMA, and its dups, never, ever becomes
COWable. This is what drivers actually want. We need to give them a
clear test to do that.

Anything using remap and checking for SHARED almost certainly falls
into this category as COWing remapped memory is rare and weird.

> I mean probably the driver's just wrong and should use
> is_cow_mapping() tbh.

Maybe.

> I think we need to be cautious of scope here :) I don't want to
> accidentally break things this way.

IMHO it is worth doing when you get into more driver places it is far
more obvious why the VM_SHARED is being checked.

> OK I think a sensible way forward - How about I add desc_is_cowable() or
> vma_desc_cowable() and only set this if I'm confident it's correct?

I'm thinking to call it vma_desc_never_cowable() as that is much much
clear what the purpose is.

I think anyone just checking VM_SHARED should be changed over..

Jason

David Hildenbrand

unread,
Sep 8, 2025, 11:19:28 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On 08.09.25 13:10, Lorenzo Stoakes wrote:
> Now we have the f_op->mmap_prepare() hook, having a static function called
> __mmap_prepare() that has nothing to do with it is confusing, so rename the
> function.
>
> Additionally rename __mmap_complete() to __mmap_epilogue(), as we intend to
> provide a f_op->mmap_complete() callback.

Isn't prologue the opposite of epilogue? :)

I guess I would just have done a

__mmap_prepare -> __mmap_setup()

and left the __mmap_complete() as is.

David Hildenbrand

unread,
Sep 8, 2025, 11:24:32 AMSep 8
to Jason Gunthorpe, Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
>
>> I think we need to be cautious of scope here :) I don't want to
>> accidentally break things this way.
>
> IMHO it is worth doing when you get into more driver places it is far
> more obvious why the VM_SHARED is being checked.
>
>> OK I think a sensible way forward - How about I add desc_is_cowable() or
>> vma_desc_cowable() and only set this if I'm confident it's correct?
>
> I'm thinking to call it vma_desc_never_cowable() as that is much much
> clear what the purpose is.

Secretmem wants no private mappings. So we should check exactly that,
not whether we might have a cow mapping.

David Hildenbrand

unread,
Sep 8, 2025, 11:27:46 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On 08.09.25 13:10, Lorenzo Stoakes wrote:
> We have introduced the f_op->mmap_prepare hook to allow for setting up a
> VMA far earlier in the process of mapping memory, reducing problematic
> error handling paths, but this does not provide what all
> drivers/filesystems need.
>
> In order to supply this, and to be able to move forward with removing
> f_op->mmap altogether, introduce f_op->mmap_complete.
>
> This hook is called once the VMA is fully mapped and everything is done,
> however with the mmap write lock and VMA write locks held.
>
> The hook is then provided with a fully initialised VMA which it can do what
> it needs with, though the mmap and VMA write locks must remain held
> throughout.
>
> It is not intended that the VMA be modified at this point, attempts to do
> so will end in tears.
>
> This allows for operations such as pre-population typically via a remap, or
> really anything that requires access to the VMA once initialised.
>
> In addition, a caller may need to take a lock in mmap_prepare, when it is
> possible to modify the VMA, and release it on mmap_complete. In order to
> handle errors which may arise between the two operations, f_op->mmap_abort
> is provided.
>
> This hook should be used to drop any lock and clean up anything before the
> VMA mapping operation is aborted. After this point the VMA will not be
> added to any mapping and will not exist.
>
> We also add a new mmap_context field to the vm_area_desc type which can be
> used to pass information pertinent to any locks which are held or any state
> which is required for mmap_complete, abort to operate correctly.
>
> We also update the compatibility layer for nested filesystems which
> currently still only specify an f_op->mmap() handler so that it correctly
> invokes f_op->mmap_complete as necessary (note that no error can occur
> between mmap_prepare and mmap_complete so mmap_abort will never be called
> in this case).
>
> Also update the VMA tests to account for the changes.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
> ---
> include/linux/fs.h | 4 ++
> include/linux/mm_types.h | 5 ++
> mm/util.c | 18 +++++--
> mm/vma.c | 82 ++++++++++++++++++++++++++++++--
> tools/testing/vma/vma_internal.h | 31 ++++++++++--
> 5 files changed, 129 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 594bd4d0521e..bb432924993a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2195,6 +2195,10 @@ struct file_operations {
> int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
> unsigned int poll_flags);
> int (*mmap_prepare)(struct vm_area_desc *);
> + int (*mmap_complete)(struct file *, struct vm_area_struct *,
> + const void *context);
> + void (*mmap_abort)(const struct file *, const void *vm_private_data,
> + const void *context);

Do we have a description somewhere what these things do, when they are
called, and what a driver may be allowed to do with a VMA?

In particular, the mmap_complete() looks like another candidate for
letting a driver just go crazy on the vma? :)

Lorenzo Stoakes

unread,
Sep 8, 2025, 11:28:31 AMSep 8
to David Hildenbrand, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On Mon, Sep 08, 2025 at 04:59:46PM +0200, David Hildenbrand wrote:
> On 08.09.25 13:10, Lorenzo Stoakes wrote:
> > This simply assigns the vm_ops so is easily updated - do so.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
> > ---
>
> Reviewed-by: David Hildenbrand <da...@redhat.com>

Cheers!

Lorenzo Stoakes

unread,
Sep 8, 2025, 11:28:57 AMSep 8
to David Hildenbrand, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On Mon, Sep 08, 2025 at 05:03:54PM +0200, David Hildenbrand wrote:
> On 08.09.25 13:10, Lorenzo Stoakes wrote:
> > The devdax driver does nothing special in its f_op->mmap hook, so
> > straightforwardly update it to use the mmap_prepare hook instead.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
> > ---
> > drivers/dax/device.c | 32 +++++++++++++++++++++-----------
> > 1 file changed, 21 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> > index 2bb40a6060af..c2181439f925 100644
> > --- a/drivers/dax/device.c
> > +++ b/drivers/dax/device.c
> > @@ -13,8 +13,9 @@
> > #include "dax-private.h"
> > #include "bus.h"
> > -static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
> > - const char *func)
> > +static int __check_vma(struct dev_dax *dev_dax, vm_flags_t vm_flags,
> > + unsigned long start, unsigned long end, struct file *file,
> > + const char *func)
>
> In general
>
> Acked-by: David Hildenbrand <da...@redhat.com>

Thanks!

>
> The only thing that bugs me is __check_vma() that does not check a vma.

Ah yeah, you're right.

>
> Maybe something along the lines of
>
> "check_vma_properties"

maybe check_vma_desc()?

>
> Not sure.
>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

Lorenzo Stoakes

unread,
Sep 8, 2025, 11:29:43 AMSep 8
to David Hildenbrand, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Thanks!

David Hildenbrand

unread,
Sep 8, 2025, 11:31:59 AMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On 08.09.25 17:28, Lorenzo Stoakes wrote:
Would also work, although it might imply that we are passing in a vma desc.

Well, you could let check_vma() construct a vma_desc and pass that to
check_vma_desc() ...

Lorenzo Stoakes

unread,
Sep 8, 2025, 11:32:20 AMSep 8
to David Hildenbrand, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On Mon, Sep 08, 2025 at 05:19:18PM +0200, David Hildenbrand wrote:
> On 08.09.25 13:10, Lorenzo Stoakes wrote:
> > Now we have the f_op->mmap_prepare() hook, having a static function called
> > __mmap_prepare() that has nothing to do with it is confusing, so rename the
> > function.
> >
> > Additionally rename __mmap_complete() to __mmap_epilogue(), as we intend to
> > provide a f_op->mmap_complete() callback.
>
> Isn't prologue the opposite of epilogue? :)

:) well indeed, the prologue comes _first_ and epilogue comes _last_. So we
rename the bit that comes first

>
> I guess I would just have done a
>
> __mmap_prepare -> __mmap_setup()

Sure will rename to __mmap_setup().

>
> and left the __mmap_complete() as is.

But we are adding a 'mmap_complete' hook :)'

I can think of another sensible name here then if I'm being too abstract here...

__mmap_finish() or something.

>
>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

Jason Gunthorpe

unread,
Sep 8, 2025, 11:33:52 AMSep 8
to David Hildenbrand, Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
secretmem is checking shared for a different reason than many other places..

Jason

Lorenzo Stoakes

unread,
Sep 8, 2025, 11:33:55 AMSep 8
to David Hildenbrand, Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 05:24:23PM +0200, David Hildenbrand wrote:
> >
Well then :)

Probably in most cases what Jason is saying is valid for drivers.

So I can add a helper for both.

Maybe vma_desc_is_private() for this one?

Lorenzo Stoakes

unread,
Sep 8, 2025, 11:35:51 AMSep 8
to David Hildenbrand, Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
Right, but the vma_desc desribes a VMA being set up.

I mean is_cow_mapping(desc->vm_flags) isn't too egregious anyway, so maybe
just use that for that case?

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

David Hildenbrand

unread,
Sep 8, 2025, 11:46:12 AMSep 8
to Jason Gunthorpe, Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
I think many cases just don't want any private mappings.

After all, you need a R/O file (VM_MAYWRITE cleared) mapped MAP_PRIVATE
to make is_cow_mapping() == false.

And at that point, you just mostly have a R/O MAP_SHARED mapping IIRC.

David Hildenbrand

unread,
Sep 8, 2025, 11:50:29 AMSep 8
to Jason Gunthorpe, Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
Sorry, was confused there. R/O file does not matter with MAP_PRIVATE. I
think we default to VM_MAYWRITE with MAP_PRIVATE unless someone
explicitly clears it.

So in practice there is indeed not a big difference between a private
and cow mapping.

Jason Gunthorpe

unread,
Sep 8, 2025, 11:57:02 AMSep 8
to David Hildenbrand, Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 05:50:18PM +0200, David Hildenbrand wrote:

> So in practice there is indeed not a big difference between a private and
> cow mapping.

Right and most drivers just check SHARED.

But if we are being documentative why they check shared is because the
driver cannot tolerate COW.

I think if someone is cargo culting a diver and sees
'vma_never_cowable' they will have a better understanding of the
driver side issues.

Driver's don't actually care about private vs shared, except this
indirectly implies something about cow.

Jason

Jason Gunthorpe

unread,
Sep 8, 2025, 12:03:13 PMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 03:18:46PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 10:35:38AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 08, 2025 at 02:27:12PM +0100, Lorenzo Stoakes wrote:
> >
> > > It's not only remap that is a concern here, people do all kinds of weird
> > > and wonderful things in .mmap(), sometimes in combination with remap.
> >
> > So it should really not be split this way, complete is a badly name
>
> I don't understand, you think we can avoid splitting this in two? If so, I
> disagree.

I'm saying to the greatest extent possible complete should only
populate PTEs.

We should refrain from trying to use it for other things, because it
shouldn't need to be there.

> > The only example in this series didn't actually need to hold the lock.
>
> There's ~250 more mmap callbacks to work through. Do you provide a guarantee
> that:

I'd be happy if only a small few need something weird and everything
else was aligned.

Jason

Lorenzo Stoakes

unread,
Sep 8, 2025, 12:07:26 PMSep 8
to Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 01:03:06PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 03:18:46PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Sep 08, 2025 at 10:35:38AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Sep 08, 2025 at 02:27:12PM +0100, Lorenzo Stoakes wrote:
> > >
> > > > It's not only remap that is a concern here, people do all kinds of weird
> > > > and wonderful things in .mmap(), sometimes in combination with remap.
> > >
> > > So it should really not be split this way, complete is a badly name
> >
> > I don't understand, you think we can avoid splitting this in two? If so, I
> > disagree.
>
> I'm saying to the greatest extent possible complete should only
> populate PTEs.
>
> We should refrain from trying to use it for other things, because it
> shouldn't need to be there.

OK that sounds sensible, I will refactor to try to do only this in the
mmap_complete hook as far as is possible and see if I can use a generic function
also.

>
> > > The only example in this series didn't actually need to hold the lock.
> >
> > There's ~250 more mmap callbacks to work through. Do you provide a guarantee
> > that:
>
> I'd be happy if only a small few need something weird and everything
> else was aligned.

Ack!

>
> Jason

Cheers, Lorenzo

David Hildenbrand

unread,
Sep 8, 2025, 1:30:48 PMSep 8
to Lorenzo Stoakes, Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
Yes, I don't think we would need another wrapper.

David Hildenbrand

unread,
Sep 8, 2025, 1:37:15 PMSep 8
to Jason Gunthorpe, Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
I recall some corner cases, but yes, most drivers don't clear MAP_MAYWRITE so
is_cow_mapping() would just rule out what they wanted to rule out (no anon
pages / cow semantics).

FWIW, I recalled some VM_MAYWRITE magic in memfd, but it's really just for
!cow mappings, so the following should likely work:

diff --git a/mm/memfd.c b/mm/memfd.c
index 1de610e9f2ea2..2a3aa26444bbb 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -346,14 +346,11 @@ static int check_write_seal(vm_flags_t *vm_flags_ptr)
vm_flags_t vm_flags = *vm_flags_ptr;
vm_flags_t mask = vm_flags & (VM_SHARED | VM_WRITE);

- /* If a private mapping then writability is irrelevant. */
- if (!(mask & VM_SHARED))
+ /* If a CoW mapping then writability is irrelevant. */
+ if (is_cow_mapping(vm_flags))
return 0;

- /*
- * New PROT_WRITE and MAP_SHARED mmaps are not allowed when
- * write seals are active.
- */
+ /* New PROT_WRITE mappings are not allowed when write-sealed. */
if (mask & VM_WRITE)
return -EPERM;

David Hildenbrand

unread,
Sep 8, 2025, 1:39:06 PMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
LGTM. I guess it would all be clearer if we could just describe less
abstract what is happening. But that would likely imply a bigger rework.
So setup/finish sounds good.

Lorenzo Stoakes

unread,
Sep 8, 2025, 4:24:39 PMSep 8
to David Hildenbrand, Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
On Mon, Sep 08, 2025 at 07:36:59PM +0200, David Hildenbrand wrote:
> On 08.09.25 17:56, Jason Gunthorpe wrote:
> > On Mon, Sep 08, 2025 at 05:50:18PM +0200, David Hildenbrand wrote:
> >
> > > So in practice there is indeed not a big difference between a private and
> > > cow mapping.
> >
> > Right and most drivers just check SHARED.
> >
> > But if we are being documentative why they check shared is because the
> > driver cannot tolerate COW.
> >
> > I think if someone is cargo culting a diver and sees
> > 'vma_never_cowable' they will have a better understanding of the
> > driver side issues.
> >
> > Driver's don't actually care about private vs shared, except this
> > indirectly implies something about cow.
>
> I recall some corner cases, but yes, most drivers don't clear MAP_MAYWRITE so
> is_cow_mapping() would just rule out what they wanted to rule out (no anon
> pages / cow semantics).
>
> FWIW, I recalled some VM_MAYWRITE magic in memfd, but it's really just for
> !cow mappings, so the following should likely work:

I was invovled in these dark arts :)

Since we gate the check_write_seal() function (which is the one that removes
VM_MAYWRITE) on the mapping being shared, then obviously we can't remove
VM_MAYWRITE in the first place.

The only other way VM_MAYWRITE could be got rid of is if it already a MAP_SHARED
or MAP_SHARED_VALIDATE mapping without write permission, and then it'd fail this
check anyway.

So I think the below patch is fine!

>
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 1de610e9f2ea2..2a3aa26444bbb 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -346,14 +346,11 @@ static int check_write_seal(vm_flags_t *vm_flags_ptr)
> vm_flags_t vm_flags = *vm_flags_ptr;
> vm_flags_t mask = vm_flags & (VM_SHARED | VM_WRITE);
> - /* If a private mapping then writability is irrelevant. */
> - if (!(mask & VM_SHARED))
> + /* If a CoW mapping then writability is irrelevant. */
> + if (is_cow_mapping(vm_flags))
> return 0;
> - /*
> - * New PROT_WRITE and MAP_SHARED mmaps are not allowed when
> - * write seals are active.
> - */
> + /* New PROT_WRITE mappings are not allowed when write-sealed. */
> if (mask & VM_WRITE)
> return -EPERM;

>
>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

Randy Dunlap

unread,
Sep 8, 2025, 7:17:44 PMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Hi--

On 9/8/25 4:10 AM, Lorenzo Stoakes wrote:
> We have introduced the mmap_complete() and mmap_abort() callbacks, which
> work in conjunction with mmap_prepare(), so describe what they used for.
>
> We update both the VFS documentation and the porting guide.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
> ---
> Documentation/filesystems/porting.rst | 9 +++++++
> Documentation/filesystems/vfs.rst | 35 +++++++++++++++++++++++++++
> 2 files changed, 44 insertions(+)
>

> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 486a91633474..172d36a13e13 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst

> @@ -1236,6 +1240,37 @@ otherwise noted.
> file-backed memory mapping, most notably establishing relevant
> private state and VMA callbacks.
>
> +``mmap_complete``
> + If mmap_prepare is provided, will be invoked after the mapping is fully

s/mmap_prepare/mmap_complete/ ??

> + established, with the mmap and VMA write locks held.
> +
> + It is useful for prepopulating VMAs before they may be accessed by
> + users.
> +
> + The hook MUST NOT release either the VMA or mmap write locks. This is

You could also do **bold** above:

The hook **MUST NOT** release ...


> + asserted by the mmap logic.
> +
> + If an error is returned by the hook, the VMA is unmapped and the
> + mmap() operation fails with that error.
> +
> + It is not valid to specify this hook if mmap_prepare is not also
> + specified, doing so will result in an error upon mapping.

--
~Randy

Baolin Wang

unread,
Sep 8, 2025, 11:19:28 PMSep 8
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe


On 2025/9/8 19:10, Lorenzo Stoakes wrote:
> This simply assigns the vm_ops so is easily updated - do so.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
> ---

LGTM.
Reviewed-by: Baolin Wang <baoli...@linux.alibaba.com>

> mm/shmem.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 29e1eb690125..cfc33b99a23a 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2950,16 +2950,17 @@ int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
> return retval;
> }
>
> -static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
> +static int shmem_mmap_prepare(struct vm_area_desc *desc)
> {
> + struct file *file = desc->file;
> struct inode *inode = file_inode(file);
>
> file_accessed(file);
> /* This is anonymous shared memory if it is unlinked at the time of mmap */
> if (inode->i_nlink)
> - vma->vm_ops = &shmem_vm_ops;
> + desc->vm_ops = &shmem_vm_ops;
> else
> - vma->vm_ops = &shmem_anon_vm_ops;
> + desc->vm_ops = &shmem_anon_vm_ops;
> return 0;
> }
>
> @@ -5229,7 +5230,7 @@ static const struct address_space_operations shmem_aops = {
> };
>
> static const struct file_operations shmem_file_operations = {
> - .mmap = shmem_mmap,
> + .mmap_prepare = shmem_mmap_prepare,
> .open = shmem_file_open,
> .get_unmapped_area = shmem_get_unmapped_area,
> #ifdef CONFIG_TMPFS

Alexander Gordeev

unread,
Sep 9, 2025, 4:31:41 AMSep 9
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On Mon, Sep 08, 2025 at 12:10:31PM +0100, Lorenzo Stoakes wrote:

Hi Lorenzo,

I am getting this warning with this series applied:

[Tue Sep 9 10:25:34 2025] ------------[ cut here ]------------
[Tue Sep 9 10:25:34 2025] WARNING: CPU: 0 PID: 563 at mm/memory.c:2942 remap_pfn_range_internal+0x36e/0x420
[Tue Sep 9 10:25:34 2025] Modules linked in: diag288_wdt(E) watchdog(E) ghash_s390(E) des_generic(E) prng(E) aes_s390(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha_common(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) pkey(E) autofs4(E) overlay(E) squashfs(E) loop(E)
[Tue Sep 9 10:25:34 2025] Unloaded tainted modules: hmac_s390(E):1
[Tue Sep 9 10:25:34 2025] CPU: 0 UID: 0 PID: 563 Comm: makedumpfile Tainted: G E 6.17.0-rc4-gcc-mmap-00410-g87e982e900f0 #288 PREEMPT
[Tue Sep 9 10:25:34 2025] Tainted: [E]=UNSIGNED_MODULE
[Tue Sep 9 10:25:34 2025] Hardware name: IBM 8561 T01 703 (LPAR)
[Tue Sep 9 10:25:34 2025] Krnl PSW : 0704d00180000000 00007fffe07f5ef2 (remap_pfn_range_internal+0x372/0x420)
[Tue Sep 9 10:25:34 2025] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[Tue Sep 9 10:25:34 2025] Krnl GPRS: 0000000004044400 001c0f000188b024 0000000000000000 001c0f000188b022
[Tue Sep 9 10:25:34 2025] 000078000c458120 000078000a0ca800 00000f000188b022 0000000000000711
[Tue Sep 9 10:25:34 2025] 000003ffa6e05000 00000f000188b024 000003ffa6a05000 0000000004044400
[Tue Sep 9 10:25:34 2025] 000003ffa7aadfa8 00007fffe2c35ea0 001c000000000000 00007f7fe0faf000
[Tue Sep 9 10:25:34 2025] Krnl Code: 00007fffe07f5ee6: 47000700 bc 0,1792
00007fffe07f5eea: af000000 mc 0,0
#00007fffe07f5eee: af000000 mc 0,0
>00007fffe07f5ef2: a7f4ff11 brc 15,00007fffe07f5d14
00007fffe07f5ef6: b904002b lgr %r2,%r11
00007fffe07f5efa: c0e5000918bb brasl %r14,00007fffe0919070
00007fffe07f5f00: a7f4ff39 brc 15,00007fffe07f5d72
00007fffe07f5f04: e320f0c80004 lg %r2,200(%r15)
[Tue Sep 9 10:25:34 2025] Call Trace:
[Tue Sep 9 10:25:34 2025] [<00007fffe07f5ef2>] remap_pfn_range_internal+0x372/0x420
[Tue Sep 9 10:25:34 2025] [<00007fffe07f5fd4>] remap_pfn_range_complete+0x34/0x70
[Tue Sep 9 10:25:34 2025] [<00007fffe019879e>] remap_oldmem_pfn_range+0x13e/0x1a0
[Tue Sep 9 10:25:34 2025] [<00007fffe0bd3550>] mmap_complete_vmcore+0x520/0x7b0
[Tue Sep 9 10:25:34 2025] [<00007fffe077b05a>] __compat_vma_mmap_prepare+0x3ea/0x550
[Tue Sep 9 10:25:34 2025] [<00007fffe0ba27f0>] pde_mmap+0x160/0x1a0
[Tue Sep 9 10:25:34 2025] [<00007fffe0ba3750>] proc_reg_mmap+0xd0/0x180
[Tue Sep 9 10:25:34 2025] [<00007fffe0859904>] __mmap_new_vma+0x444/0x1290
[Tue Sep 9 10:25:34 2025] [<00007fffe085b0b4>] __mmap_region+0x964/0x1090
[Tue Sep 9 10:25:34 2025] [<00007fffe085dc7e>] mmap_region+0xde/0x250
[Tue Sep 9 10:25:34 2025] [<00007fffe08065fc>] do_mmap+0x80c/0xc30
[Tue Sep 9 10:25:34 2025] [<00007fffe077c708>] vm_mmap_pgoff+0x218/0x370
[Tue Sep 9 10:25:34 2025] [<00007fffe080467e>] ksys_mmap_pgoff+0x2ee/0x400
[Tue Sep 9 10:25:34 2025] [<00007fffe0804a3a>] __s390x_sys_old_mmap+0x15a/0x1d0
[Tue Sep 9 10:25:34 2025] [<00007fffe29f1cd6>] __do_syscall+0x146/0x410
[Tue Sep 9 10:25:34 2025] [<00007fffe2a17e1e>] system_call+0x6e/0x90
[Tue Sep 9 10:25:34 2025] 2 locks held by makedumpfile/563:
[Tue Sep 9 10:25:34 2025] #0: 000078000a0caab0 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x16e/0x370
[Tue Sep 9 10:25:34 2025] #1: 00007fffe3864f50 (vmcore_cb_srcu){.+.+}-{0:0}, at: mmap_complete_vmcore+0x20c/0x7b0
[Tue Sep 9 10:25:34 2025] Last Breaking-Event-Address:
[Tue Sep 9 10:25:34 2025] [<00007fffe07f5d0e>] remap_pfn_range_internal+0x18e/0x420
[Tue Sep 9 10:25:34 2025] irq event stamp: 19113
[Tue Sep 9 10:25:34 2025] hardirqs last enabled at (19121): [<00007fffe0391910>] __up_console_sem+0xe0/0x120
[Tue Sep 9 10:25:34 2025] hardirqs last disabled at (19128): [<00007fffe03918f2>] __up_console_sem+0xc2/0x120
[Tue Sep 9 10:25:34 2025] softirqs last enabled at (4934): [<00007fffe021cb8e>] handle_softirqs+0x70e/0xed0
[Tue Sep 9 10:25:34 2025] softirqs last disabled at (3919): [<00007fffe021b670>] __irq_exit_rcu+0x2e0/0x380
[Tue Sep 9 10:25:34 2025] ---[ end trace 0000000000000000 ]---

Thanks!

Lorenzo Stoakes

unread,
Sep 9, 2025, 5:00:15 AMSep 9
to Alexander Gordeev, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On Tue, Sep 09, 2025 at 10:31:24AM +0200, Alexander Gordeev wrote:
> On Mon, Sep 08, 2025 at 12:10:31PM +0100, Lorenzo Stoakes wrote:
>
> Hi Lorenzo,
>
> I am getting this warning with this series applied:
>
> [Tue Sep 9 10:25:34 2025] ------------[ cut here ]------------
> [Tue Sep 9 10:25:34 2025] WARNING: CPU: 0 PID: 563 at mm/memory.c:2942 remap_pfn_range_internal+0x36e/0x420

OK yeah this is a very silly error :)

I'm asserting:

VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) == VM_REMAP_FLAGS);

So err.. this should be:

VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) != VM_REMAP_FLAGS);

This was a super late addition to the code and obviously I didn't test this as
well as I did the remap code in general, apologies.

Will fix on respin! :)

Cheers, Lorenzo

Lorenzo Stoakes

unread,
Sep 9, 2025, 5:02:48 AMSep 9
to Randy Dunlap, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On Mon, Sep 08, 2025 at 04:17:16PM -0700, Randy Dunlap wrote:
> Hi--
>
> On 9/8/25 4:10 AM, Lorenzo Stoakes wrote:
> > We have introduced the mmap_complete() and mmap_abort() callbacks, which
> > work in conjunction with mmap_prepare(), so describe what they used for.
> >
> > We update both the VFS documentation and the porting guide.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
> > ---
> > Documentation/filesystems/porting.rst | 9 +++++++
> > Documentation/filesystems/vfs.rst | 35 +++++++++++++++++++++++++++
> > 2 files changed, 44 insertions(+)
> >
>
> > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> > index 486a91633474..172d36a13e13 100644
> > --- a/Documentation/filesystems/vfs.rst
> > +++ b/Documentation/filesystems/vfs.rst
>
> > @@ -1236,6 +1240,37 @@ otherwise noted.
> > file-backed memory mapping, most notably establishing relevant
> > private state and VMA callbacks.
> >
> > +``mmap_complete``
> > + If mmap_prepare is provided, will be invoked after the mapping is fully
>
> s/mmap_prepare/mmap_complete/ ??

Yes indeed sorry! Will fix on respin.

>
> > + established, with the mmap and VMA write locks held.
> > +
> > + It is useful for prepopulating VMAs before they may be accessed by
> > + users.
> > +
> > + The hook MUST NOT release either the VMA or mmap write locks. This is
>
> You could also do **bold** above:
>
> The hook **MUST NOT** release ...
>
>

Ack will do!

> > + asserted by the mmap logic.
> > +
> > + If an error is returned by the hook, the VMA is unmapped and the
> > + mmap() operation fails with that error.
> > +
> > + It is not valid to specify this hook if mmap_prepare is not also
> > + specified, doing so will result in an error upon mapping.
>
> --
> ~Randy
>

Cheers, Lorenzo

Lorenzo Stoakes

unread,
Sep 9, 2025, 5:04:23 AMSep 9
to David Hildenbrand, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Ack will fix on respin!

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

Lorenzo Stoakes

unread,
Sep 9, 2025, 5:08:45 AMSep 9
to Baolin Wang, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On Tue, Sep 09, 2025 at 11:19:16AM +0800, Baolin Wang wrote:
>
>
> On 2025/9/8 19:10, Lorenzo Stoakes wrote:
> > This simply assigns the vm_ops so is easily updated - do so.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
> > ---
>
> LGTM.
> Reviewed-by: Baolin Wang <baoli...@linux.alibaba.com>

Thanks!

Lorenzo Stoakes

unread,
Sep 9, 2025, 5:14:13 AMSep 9
to David Hildenbrand, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Yeah there's a doc patch that follows this.

>
> In particular, the mmap_complete() looks like another candidate for letting
> a driver just go crazy on the vma? :)

Well there's only so much we can do. In an ideal world we'd treat VMAs as
entirely internal data structures and pass some sort of opaque thing around, but
we have to keep things real here :)

So the main purpose of these changes is not so much to be as ambitious as
_that_, but to only provide the VMA _when it's safe to do so_.

Before we were providing a pointer to an incompletely-initialised VMA that was
not yet in the maple tree, with which the driver could do _anything_, and then
afterwards have:

a. a bunch of stuff left to do with a VMA that might be in some broken state due
to drivers.
b. (the really bad case) have error paths to handle because the driver returned
an error, but did who-knows-what with the VMA and page tables.

So we address this by:

1. mmap_prepare being done _super early_ and _not_ providing a VMA. We
essentially ask the driver 'hey what do you want these fields that you are
allowed to change in the VMA to be?'

2. mmap_complete being done _super_ late, essentially just before we release the
VMA/mmap locks. If an error arises - we can just unmap it, easy. And then
there's a lot less damage the driver can do.

I think it's probably the most sensible means of doing something about the
legacy we have where we've been rather too 'free and easy' with allowing drivers
to do whatever.

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

Lorenzo Stoakes

unread,
Sep 9, 2025, 5:21:35 AMSep 9
to David Hildenbrand, Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com
Ack will use this in favour of a wrapper.

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

David Hildenbrand

unread,
Sep 9, 2025, 5:26:30 AMSep 9
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Yeah, spotted that afterwards.

>
>>
>> In particular, the mmap_complete() looks like another candidate for letting
>> a driver just go crazy on the vma? :)
>
> Well there's only so much we can do. In an ideal world we'd treat VMAs as
> entirely internal data structures and pass some sort of opaque thing around, but
> we have to keep things real here :)

Right, we'd pass something around that cannot be easily abused (like
modifying random vma flags in mmap_complete).

So I was wondering if most operations that driver would perform during
the mmap_complete() could be be abstracted, and only those then be
called with whatever opaque thing we return here.

But I have no feeling about what crazy things a driver might do. Just
calling remap_pfn_range() would be easy, for example, and we could
abstract that.

Lorenzo Stoakes

unread,
Sep 9, 2025, 5:37:43 AMSep 9
to David Hildenbrand, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
On Tue, Sep 09, 2025 at 11:26:21AM +0200, David Hildenbrand wrote:
> > >
> > > In particular, the mmap_complete() looks like another candidate for letting
> > > a driver just go crazy on the vma? :)
> >
> > Well there's only so much we can do. In an ideal world we'd treat VMAs as
> > entirely internal data structures and pass some sort of opaque thing around, but
> > we have to keep things real here :)
>
> Right, we'd pass something around that cannot be easily abused (like
> modifying random vma flags in mmap_complete).
>
> So I was wondering if most operations that driver would perform during the
> mmap_complete() could be be abstracted, and only those then be called with
> whatever opaque thing we return here.

Well there's 2 issues at play:

1. I might end up having to rewrite _large parts_ of kernel functionality all of
which relies on there being a vma parameter (or might find that to be
intractable).

2. There's always the 'odd ones out' :) so there'll be some drivers that
absolutely do need to have access to this.

But as I was writing this I thought of an idea - why don't we have something
opaque like this, perhaps with accessor functions, but then _give the ability to
get the VMA if you REALLY have to_.

That way we can handle both problems without too much trouble.

Also Jason suggested generic functions that can just be assigned to
.mmap_complete for instance, which would obviously eliminate the crazy
factor a lot too.

I'm going to refactor to try to put ONLY prepopulate logic in
.mmap_complete where possible which fits with all of this.

>
> But I have no feeling about what crazy things a driver might do. Just
> calling remap_pfn_range() would be easy, for example, and we could abstract
> that.

Yeah, I've obviously already added some wrappers for these.

BTW I really really hate that STUPID ->vm_pgoff hack, if not for that, life
would be much simpler.

But instead now we need to specify PFN in the damn remap prepare wrapper in
case of CoW. God.

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

Suren Baghdasaryan

unread,
Sep 9, 2025, 12:43:42 PMSep 9
to Lorenzo Stoakes, David Hildenbrand, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Thinking along these lines, do you have a case when mmap_abort() needs
vm_private_data? I was thinking if VMA mapping failed, why would you
need vm_private_data to unwind prep work? You already have the context
pointer for that, no?

Suren Baghdasaryan

unread,
Sep 9, 2025, 12:45:15 PMSep 9
to Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
> } __randomize_layout;
>
> /* Supports async buffered reads */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index cf759fe08bb3..052db1f31fb3 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -793,6 +793,11 @@ struct vm_area_desc {
> /* Write-only fields. */
> const struct vm_operations_struct *vm_ops;
> void *private_data;
> + /*
> + * A user-defined field, value will be passed to mmap_complete,
> + * mmap_abort.
> + */
> + void *mmap_context;
> };
>
> /*
> diff --git a/mm/util.c b/mm/util.c
> index 248f877f629b..f5bcac140cb9 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1161,17 +1161,26 @@ int __compat_vma_mmap_prepare(const struct file_operations *f_op,
> err = f_op->mmap_prepare(&desc);
> if (err)
> return err;
> +
> set_vma_from_desc(vma, &desc);
>
> - return 0;
> + /*
> + * No error can occur between mmap_prepare() and mmap_complete so no
> + * need to invoke mmap_abort().
> + */
> +
> + if (f_op->mmap_complete)
> + err = f_op->mmap_complete(file, vma, desc.mmap_context);
> +
> + return err;
> }
> EXPORT_SYMBOL(__compat_vma_mmap_prepare);
>
> /**
> * compat_vma_mmap_prepare() - Apply the file's .mmap_prepare() hook to an
> - * existing VMA.
> + * existing VMA and invoke .mmap_complete() if provided.
> * @file: The file which possesss an f_op->mmap_prepare() hook.

nit: possesss seems to be misspelled. Maybe we can fix it here as well?

> - * @vma: The VMA to apply the .mmap_prepare() hook to.
> + * @vma: The VMA to apply the hooks to.
> *
> * Ordinarily, .mmap_prepare() is invoked directly upon mmap(). However, certain
> * stacked filesystems invoke a nested mmap hook of an underlying file.
> @@ -1188,6 +1197,9 @@ EXPORT_SYMBOL(__compat_vma_mmap_prepare);
> * establishes a struct vm_area_desc descriptor, passes to the underlying
> * .mmap_prepare() hook and applies any changes performed by it.
> *
> + * If the relevant hooks are provided, it also invokes .mmap_complete() upon
> + * successful completion.
> + *
> * Once the conversion of filesystems is complete this function will no longer
> * be required and will be removed.
> *
> diff --git a/mm/vma.c b/mm/vma.c
> index 0efa4288570e..a0b568fe9e8d 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -22,6 +22,7 @@ struct mmap_state {
> /* User-defined fields, perhaps updated by .mmap_prepare(). */
> const struct vm_operations_struct *vm_ops;
> void *vm_private_data;
> + void *mmap_context;
>
> unsigned long charged;
>
> @@ -2343,6 +2344,23 @@ static int __mmap_prelude(struct mmap_state *map, struct list_head *uf)
> int error;
> struct vma_iterator *vmi = map->vmi;
> struct vma_munmap_struct *vms = &map->vms;
> + struct file *file = map->file;
> +
> + if (file) {
> + /* f_op->mmap_complete requires f_op->mmap_prepare. */
> + if (file->f_op->mmap_complete && !file->f_op->mmap_prepare)
> + return -EINVAL;
> +
> + /*
> + * It's not valid to provide an f_op->mmap_abort hook without also
> + * providing the f_op->mmap_prepare and f_op->mmap_complete hooks it is
> + * used with.
> + */
> + if (file->f_op->mmap_abort &&
> + (!file->f_op->mmap_prepare ||
> + !file->f_op->mmap_complete))
> + return -EINVAL;
> + }
>
> /* Find the first overlapping VMA and initialise unmap state. */
> vms->vma = vma_find(vmi, map->end);
> @@ -2595,6 +2613,7 @@ static int call_mmap_prepare(struct mmap_state *map)
> /* User-defined fields. */
> map->vm_ops = desc.vm_ops;
> map->vm_private_data = desc.private_data;
> + map->mmap_context = desc.mmap_context;
>
> return 0;
> }
> @@ -2636,16 +2655,61 @@ static bool can_set_ksm_flags_early(struct mmap_state *map)
> return false;
> }
>
> +/*
> + * Invoke the f_op->mmap_complete hook, providing it with a fully initialised
> + * VMA to operate upon.
> + *
> + * The mmap and VMA write locks must be held prior to and after the hook has
> + * been invoked.
> + */
> +static int call_mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
> +{
> + struct file *file = map->file;
> + void *context = map->mmap_context;
> + int error;
> + size_t len;
> +
> + if (!file || !file->f_op->mmap_complete)
> + return 0;
> +
> + error = file->f_op->mmap_complete(file, vma, context);
> + /* The hook must NOT drop the write locks. */
> + vma_assert_write_locked(vma);
> + mmap_assert_write_locked(current->mm);
> + if (!error)
> + return 0;
> +
> + /*
> + * If an error occurs, unmap the VMA altogether and return an error. We
> + * only clear the newly allocated VMA, since this function is only
> + * invoked if we do NOT merge, so we only clean up the VMA we created.
> + */
> + len = vma_pages(vma) << PAGE_SHIFT;
> + do_munmap(current->mm, vma->vm_start, len, NULL);
> + return error;
> +}
> +
> +static void call_mmap_abort(struct mmap_state *map)
> +{
> + struct file *file = map->file;
> + void *vm_private_data = map->vm_private_data;
> +
> + VM_WARN_ON_ONCE(!file || !file->f_op);
> + file->f_op->mmap_abort(file, vm_private_data, map->mmap_context);
> +}
> +
> static unsigned long __mmap_region(struct file *file, unsigned long addr,
> unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> struct list_head *uf)
> {
> - struct mm_struct *mm = current->mm;
> - struct vm_area_struct *vma = NULL;
> - int error;
> bool have_mmap_prepare = file && file->f_op->mmap_prepare;
> + bool have_mmap_abort = file && file->f_op->mmap_abort;
> + struct mm_struct *mm = current->mm;
> VMA_ITERATOR(vmi, mm, addr);
> MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file);
> + struct vm_area_struct *vma = NULL;
> + bool allocated_new = false;
> + int error;
>
> map.check_ksm_early = can_set_ksm_flags_early(&map);
>
> @@ -2668,8 +2732,12 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> /* ...but if we can't, allocate a new VMA. */
> if (!vma) {
> error = __mmap_new_vma(&map, &vma);
> - if (error)
> + if (error) {
> + if (have_mmap_abort)
> + call_mmap_abort(&map);
> goto unacct_error;
> + }
> + allocated_new = true;
> }
>
> if (have_mmap_prepare)
> @@ -2677,6 +2745,12 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
>
> __mmap_epilogue(&map, vma);
>
> + if (allocated_new) {
> + error = call_mmap_complete(&map, vma);
> + if (error)
> + return error;
> + }
> +
> return addr;
>
> /* Accounting was done by __mmap_prelude(). */
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 07167446dcf4..566cef1c0e0b 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -297,11 +297,20 @@ struct vm_area_desc {
> /* Write-only fields. */
> const struct vm_operations_struct *vm_ops;
> void *private_data;
> + /*
> + * A user-defined field, value will be passed to mmap_complete,
> + * mmap_abort.
> + */
> + void *mmap_context;
> };
>
> struct file_operations {
> int (*mmap)(struct file *, struct vm_area_struct *);
> int (*mmap_prepare)(struct vm_area_desc *);
> + void (*mmap_abort)(const struct file *, const void *vm_private_data,
> + const void *context);
> + int (*mmap_complete)(struct file *, struct vm_area_struct *,
> + const void *context);
> };
>
> struct file {
> @@ -1471,7 +1480,7 @@ static inline int __compat_vma_mmap_prepare(const struct file_operations *f_op,
> {
> struct vm_area_desc desc = {
> .mm = vma->vm_mm,
> - .file = vma->vm_file,
> + .file = file,
> .start = vma->vm_start,
> .end = vma->vm_end,
>
> @@ -1485,13 +1494,21 @@ static inline int __compat_vma_mmap_prepare(const struct file_operations *f_op,
> err = f_op->mmap_prepare(&desc);
> if (err)
> return err;
> +
> set_vma_from_desc(vma, &desc);
>
> - return 0;
> + /*
> + * No error can occur between mmap_prepare() and mmap_complete so no
> + * need to invoke mmap_abort().
> + */
> +
> + if (f_op->mmap_complete)
> + err = f_op->mmap_complete(file, vma, desc.mmap_context);
> +
> + return err;
> }
>
> -static inline int compat_vma_mmap_prepare(struct file *file,
> - struct vm_area_struct *vma)
> +static inline int compat_vma_mmap_prepare(struct file *file, struct vm_area_struct *vma)
> {
> return __compat_vma_mmap_prepare(file->f_op, file, vma);
> }
> @@ -1548,4 +1565,10 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
> return vm_flags;
> }
>
> +static inline int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> + struct list_head *uf)
> +{
> + return 0;
> +}
> +
> #endif /* __MM_VMA_INTERNAL_H */
> --
> 2.51.0
>

Lorenzo Stoakes

unread,
Sep 9, 2025, 1:37:06 PMSep 9
to Suren Baghdasaryan, David Hildenbrand, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Actually have removed mmap_abort in latest respin :) the new version will
be a fairly substantial rewrite based on feedback.

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:22:48 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
This simply assigns the vm_ops so is easily updated - do so.

Reviewed-by: Baolin Wang <baoli...@linux.alibaba.com>
Reviewed-by: David Hildenbrand <da...@redhat.com>
Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
mm/shmem.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 45e7733d6612..990e33c6a776 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2938,16 +2938,17 @@ int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
return retval;
}

-static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
+static int shmem_mmap_prepare(struct vm_area_desc *desc)
{
+ struct file *file = desc->file;
struct inode *inode = file_inode(file);

file_accessed(file);
/* This is anonymous shared memory if it is unlinked at the time of mmap */
if (inode->i_nlink)
- vma->vm_ops = &shmem_vm_ops;
+ desc->vm_ops = &shmem_vm_ops;
else
- vma->vm_ops = &shmem_anon_vm_ops;
+ desc->vm_ops = &shmem_anon_vm_ops;
return 0;
}

@@ -5217,7 +5218,7 @@ static const struct address_space_operations shmem_aops = {
};

static const struct file_operations shmem_file_operations = {
- .mmap = shmem_mmap,
+ .mmap_prepare = shmem_mmap_prepare,
.open = shmem_file_open,
.get_unmapped_area = shmem_get_unmapped_area,
#ifdef CONFIG_TMPFS
--
2.51.0

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:22:49 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
The devdax driver does nothing special in its f_op->mmap hook, so
straightforwardly update it to use the mmap_prepare hook instead.

Acked-by: David Hildenbrand <da...@redhat.com>
Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
drivers/dax/device.c | 32 +++++++++++++++++++++-----------
1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 2bb40a6060af..c2181439f925 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -13,8 +13,9 @@
#include "dax-private.h"
#include "bus.h"

-static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
- const char *func)
+static int __check_vma(struct dev_dax *dev_dax, vm_flags_t vm_flags,
+ unsigned long start, unsigned long end, struct file *file,
+ const char *func)
{
struct device *dev = &dev_dax->dev;
unsigned long mask;
@@ -23,7 +24,7 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
return -ENXIO;

/* prevent private mappings from being established */
- if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+ if ((vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
dev_info_ratelimited(dev,
"%s: %s: fail, attempted private mapping\n",
current->comm, func);
@@ -31,15 +32,15 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
}

mask = dev_dax->align - 1;
- if (vma->vm_start & mask || vma->vm_end & mask) {
+ if (start & mask || end & mask) {
dev_info_ratelimited(dev,
"%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n",
- current->comm, func, vma->vm_start, vma->vm_end,
+ current->comm, func, start, end,
mask);
return -EINVAL;
}

- if (!vma_is_dax(vma)) {
+ if (!file_is_dax(file)) {
dev_info_ratelimited(dev,
"%s: %s: fail, vma is not DAX capable\n",
current->comm, func);
@@ -49,6 +50,13 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
return 0;
}

+static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
+ const char *func)
+{
+ return __check_vma(dev_dax, vma->vm_flags, vma->vm_start, vma->vm_end,
+ vma->vm_file, func);
+}
+
/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
unsigned long size)
@@ -285,8 +293,9 @@ static const struct vm_operations_struct dax_vm_ops = {
.pagesize = dev_dax_pagesize,
};

-static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
+static int dax_mmap_prepare(struct vm_area_desc *desc)
{
+ struct file *filp = desc->file;
struct dev_dax *dev_dax = filp->private_data;
int rc, id;

@@ -297,13 +306,14 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
* fault time.
*/
id = dax_read_lock();
- rc = check_vma(dev_dax, vma, __func__);
+ rc = __check_vma(dev_dax, desc->vm_flags, desc->start, desc->end, filp,
+ __func__);
dax_read_unlock(id);
if (rc)
return rc;

- vma->vm_ops = &dax_vm_ops;
- vm_flags_set(vma, VM_HUGEPAGE);
+ desc->vm_ops = &dax_vm_ops;
+ desc->vm_flags |= VM_HUGEPAGE;
return 0;
}

@@ -377,7 +387,7 @@ static const struct file_operations dax_fops = {
.open = dax_open,
.release = dax_release,
.get_unmapped_area = dax_get_unmapped_area,
- .mmap = dax_mmap,
+ .mmap_prepare = dax_mmap_prepare,
.fop_flags = FOP_MMAP_SYNC,
};

--
2.51.0

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:22:50 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file
callback"), The f_op->mmap hook has been deprecated in favour of
f_op->mmap_prepare.

This was introduced in order to make it possible for us to eventually
eliminate the f_op->mmap hook which is highly problematic as it allows
drivers and filesystems raw access to a VMA which is not yet correctly
initialised.

This hook also introduced complexity for the memory mapping operation, as
we must correctly unwind what we do should an error arises.

Overall this interface being so open has caused significant problems for
us, including security issues, it is important for us to simply eliminate
this as a source of problems.

Therefore this series continues what was established by extending the
functionality further to permit more drivers and filesystems to use
mmap_prepare.

We start by udpating some existing users who can use the mmap_prepare
functionality as-is.

We then introduce the concept of an mmap 'action', which a user, on
mmap_prepare, can request to be performed upon the VMA:

* Nothing - default, we're done
* Remap PFN - perform PFN remap with specified parameters
* Insert mixed - Insert a linear PFN range as a mixed map
* Insert mixed pages - Insert a set of specific pages as a mixed map
* Custom action - Should rarely be used, for operations that are truly
custom. A hook is invoked.

By setting the action in mmap_prepare, this alows us to dynamically decide
what to do next, so if a driver/filesystem needs to determine whether to
e.g. remap or use a mixed map, it can do so then change which is done.

This significantly expands the capabilities of the mmap_prepare hook, while
maintaining as much control as possible in the mm logic.

In the custom hook case, which unfortunately we have to provide for the
obstinate drivers which insist on doing 'interesting' things, we make it
possible for them to invoke mmap actions themselves via
mmap_action_prepare() (to be called in mmap_prepare as necessary) and
mmap_action_complete() (to be called in the custom hook).

This way, we keep as much logic in generic code as possible even in the
custom case.

The point at which the VMA is accessible it is safe for it to be
manipulated as it will already be fully established in the maple tree and
error handling can be simplified to unmapping the VMA.

We split remap_pfn_range*() functions which allow for PFN remap (a typical
mapping prepopulation operation) split between a prepare/complete step, as
well as io_mremap_pfn_range_prepare, complete for a similar purpose.

From there we update various mm-adjacent logic to use this functionality as
a first set of changes, as well as resctl and cramfs filesystems to round
off the non-stacked filesystem instances.

We also add success and error hooks for post-action processing for
e.g. output debug log on success and filtering error codes.

v2:
* Propagated tags, thanks everyone! :)
* Refactored resctl patch to avoid assigned-but-not-used variable.
* Updated resctl change to not use .mmap_abort as discussed with Jason.
* Removed .mmap_abort as discussed with Jason.
* Removed references to .mmap_abort from documentation.
* Fixed silly VM_WARN_ON_ONCE() mistake (asserting opposite of what we mean
to) as per report from Alexander.
* Fixed relay kerneldoc error.
* Renamed __mmap_prelude to __mmap_setup, keep __mmap_complete the same as
per David.
* Fixed docs typo in mmap_complete description + formatted bold rather than
capitalised as per Randy.
* Eliminated mmap_complete and rework into actions specified in mmap_prepare
(via vm_area_desc) which therefore eliminates the driver's ability to do
anything crazy and allows us to control generic logic.
* Added helper functions for these - vma_desc_set_remap(),
vma_desc_set_mixedmap().
* However unfortunately had to add post action hooks to vm_area_desc, as
already hugetlbfs for instance needs to access the VMA to function
correctly. It is at least the smallest possible means of doing this.
* Updated VMA test logic, the stacked filesystem compatibility layer and
documentation to reflect this.
* Updated hugetlbfs implementation to use new approach, and refactored to
accept desc where at all possible and to do as much as possible in
.mmap_prepare, and the minimum required in the new post_hook callback.
* Updated /dev/mem and /dev/zero mmap logic to use the new mechanism.
* Updated cramfs, resctl to use the new mechanism.
* Updated proc_mmap hooks to only have proc_mmap_prepare.
* Updated the vmcore implementation to use the new hooks.
* Updated kcov to use the new hooks.
* Added hooks for success/failure for post-action handling.
* Added custom action hook for truly custom cases.
* Abstracted actions to separate type so we can use generic custom actions in
custom handlers when necessary.
* Added callout re: lock issue raised in
https://lore.kernel.org/linux-mm/20250801162...@nvidia.com/ as per
discussion with Jason.

v1:
https://lore.kernel.org/all/cover.1757329751.g...@oracle.com/

Lorenzo Stoakes (16):
mm/shmem: update shmem to use mmap_prepare
device/dax: update devdax to use mmap_prepare
mm: add vma_desc_size(), vma_desc_pages() helpers
relay: update relay to use mmap_prepare
mm/vma: rename __mmap_prepare() function to avoid confusion
mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
mm: introduce io_remap_pfn_range_[prepare, complete]()
mm: add ability to take further action in vm_area_desc
doc: update porting, vfs documentation for mmap_prepare actions
mm/hugetlbfs: update hugetlbfs to use mmap_prepare
mm: update mem char driver to use mmap_prepare
mm: update resctl to use mmap_prepare
mm: update cramfs to use mmap_prepare
fs/proc: add the proc_mmap_prepare hook for procfs
fs/proc: update vmcore to use .proc_mmap_prepare
kcov: update kcov to use mmap_prepare

Documentation/filesystems/porting.rst | 5 +
Documentation/filesystems/vfs.rst | 4 +
arch/csky/include/asm/pgtable.h | 5 +
arch/mips/alchemy/common/setup.c | 28 ++++-
arch/mips/include/asm/pgtable.h | 10 ++
arch/s390/kernel/crash_dump.c | 6 +-
arch/sparc/include/asm/pgtable_32.h | 29 ++++-
arch/sparc/include/asm/pgtable_64.h | 29 ++++-
drivers/char/mem.c | 75 ++++++------
drivers/dax/device.c | 32 +++--
fs/cramfs/inode.c | 46 ++++----
fs/hugetlbfs/inode.c | 30 +++--
fs/ntfs3/file.c | 2 +-
fs/proc/inode.c | 12 +-
fs/proc/vmcore.c | 54 ++++++---
fs/resctrl/pseudo_lock.c | 22 ++--
include/linux/hugetlb.h | 9 +-
include/linux/hugetlb_inline.h | 15 ++-
include/linux/mm.h | 83 ++++++++++++-
include/linux/mm_types.h | 61 ++++++++++
include/linux/proc_fs.h | 1 +
include/linux/shmem_fs.h | 3 +-
include/linux/vmalloc.h | 10 +-
kernel/kcov.c | 42 ++++---
kernel/relay.c | 33 +++---
mm/hugetlb.c | 77 +++++++-----
mm/memory.c | 128 ++++++++++++--------
mm/secretmem.c | 2 +-
mm/shmem.c | 49 ++++++--
mm/util.c | 150 ++++++++++++++++++++++-
mm/vma.c | 74 ++++++++----
mm/vmalloc.c | 16 ++-
tools/testing/vma/vma_internal.h | 164 +++++++++++++++++++++++++-
33 files changed, 1002 insertions(+), 304 deletions(-)

--
2.51.0

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:22:53 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
It's useful to be able to determine the size of a VMA descriptor range used
on f_op->mmap_prepare, expressed both in bytes and pages, so add helpers
for both and update code that could make use of it to do so.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
fs/ntfs3/file.c | 2 +-
include/linux/mm.h | 10 ++++++++++
mm/secretmem.c | 2 +-
3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/ntfs3/file.c b/fs/ntfs3/file.c
index c1ece707b195..86eb88f62714 100644
--- a/fs/ntfs3/file.c
+++ b/fs/ntfs3/file.c
@@ -304,7 +304,7 @@ static int ntfs_file_mmap_prepare(struct vm_area_desc *desc)

if (rw) {
u64 to = min_t(loff_t, i_size_read(inode),
- from + desc->end - desc->start);
+ from + vma_desc_size(desc));

if (is_sparsed(ni)) {
/* Allocate clusters for rw map. */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 892fe5dbf9de..0b97589aec6d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3572,6 +3572,16 @@ static inline unsigned long vma_pages(const struct vm_area_struct *vma)
return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
}

+static inline unsigned long vma_desc_size(struct vm_area_desc *desc)
+{
+ return desc->end - desc->start;
+}
+
+static inline unsigned long vma_desc_pages(struct vm_area_desc *desc)
+{
+ return vma_desc_size(desc) >> PAGE_SHIFT;
+}
+
/* Look up the first VMA which exactly match the interval vm_start ... vm_end */
static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
unsigned long vm_start, unsigned long vm_end)
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 60137305bc20..62066ddb1e9c 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -120,7 +120,7 @@ static int secretmem_release(struct inode *inode, struct file *file)

static int secretmem_mmap_prepare(struct vm_area_desc *desc)
{
- const unsigned long len = desc->end - desc->start;
+ const unsigned long len = vma_desc_size(desc);

if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
return -EINVAL;
--
2.51.0

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:22:56 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
It is relatively trivial to update this code to use the f_op->mmap_prepare
hook in favour of the deprecated f_op->mmap hook, so do so.

Reviewed-by: David Hildenbrand <da...@redhat.com>
Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
kernel/relay.c | 33 +++++++++++++++++----------------
1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/kernel/relay.c b/kernel/relay.c
index 8d915fe98198..e36f6b926f7f 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -72,17 +72,18 @@ static void relay_free_page_array(struct page **array)
}

/**
- * relay_mmap_buf: - mmap channel buffer to process address space
- * @buf: relay channel buffer
- * @vma: vm_area_struct describing memory to be mapped
+ * relay_mmap_prepare_buf: - mmap channel buffer to process address space
+ * @buf: the relay channel buffer
+ * @desc: describing what to map
*
* Returns 0 if ok, negative on error
*
* Caller should already have grabbed mmap_lock.
*/
-static int relay_mmap_buf(struct rchan_buf *buf, struct vm_area_struct *vma)
+static int relay_mmap_prepare_buf(struct rchan_buf *buf,
+ struct vm_area_desc *desc)
{
- unsigned long length = vma->vm_end - vma->vm_start;
+ unsigned long length = vma_desc_size(desc);

if (!buf)
return -EBADF;
@@ -90,9 +91,9 @@ static int relay_mmap_buf(struct rchan_buf *buf, struct vm_area_struct *vma)
if (length != (unsigned long)buf->chan->alloc_size)
return -EINVAL;

- vma->vm_ops = &relay_file_mmap_ops;
- vm_flags_set(vma, VM_DONTEXPAND);
- vma->vm_private_data = buf;
+ desc->vm_ops = &relay_file_mmap_ops;
+ desc->vm_flags |= VM_DONTEXPAND;
+ desc->private_data = buf;

return 0;
}
@@ -749,16 +750,16 @@ static int relay_file_open(struct inode *inode, struct file *filp)
}

/**
- * relay_file_mmap - mmap file op for relay files
- * @filp: the file
- * @vma: the vma describing what to map
+ * relay_file_mmap_prepare - mmap file op for relay files
+ * @desc: describing what to map
*
- * Calls upon relay_mmap_buf() to map the file into user space.
+ * Calls upon relay_mmap_prepare_buf() to map the file into user space.
*/
-static int relay_file_mmap(struct file *filp, struct vm_area_struct *vma)
+static int relay_file_mmap_prepare(struct vm_area_desc *desc)
{
- struct rchan_buf *buf = filp->private_data;
- return relay_mmap_buf(buf, vma);
+ struct rchan_buf *buf = desc->file->private_data;
+
+ return relay_mmap_prepare_buf(buf, desc);
}

/**
@@ -1006,7 +1007,7 @@ static ssize_t relay_file_read(struct file *filp,
const struct file_operations relay_file_operations = {
.open = relay_file_open,
.poll = relay_file_poll,
- .mmap = relay_file_mmap,
+ .mmap_prepare = relay_file_mmap_prepare,
.read = relay_file_read,
.release = relay_file_release,
};
--
2.51.0

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:23:00 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Now we have the f_op->mmap_prepare() hook, having a static function called
__mmap_prepare() that has nothing to do with it is confusing, so rename the
function to __mmap_setup().

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
mm/vma.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index abe0da33c844..36a9f4d453be 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2329,7 +2329,7 @@ static void update_ksm_flags(struct mmap_state *map)
}

/*
- * __mmap_prepare() - Prepare to gather any overlapping VMAs that need to be
+ * __mmap_setup() - Prepare to gather any overlapping VMAs that need to be
* unmapped once the map operation is completed, check limits, account mapping
* and clean up any pre-existing VMAs.
*
@@ -2338,7 +2338,7 @@ static void update_ksm_flags(struct mmap_state *map)
*
* Returns: 0 on success, error code otherwise.
*/
-static int __mmap_prepare(struct mmap_state *map, struct list_head *uf)
+static int __mmap_setup(struct mmap_state *map, struct list_head *uf)
{
int error;
struct vma_iterator *vmi = map->vmi;
@@ -2649,7 +2649,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,

map.check_ksm_early = can_set_ksm_flags_early(&map);

- error = __mmap_prepare(&map, uf);
+ error = __mmap_setup(&map, uf);
if (!error && have_mmap_prepare)
error = call_mmap_prepare(&map);
if (error)
@@ -2679,7 +2679,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,

return addr;

- /* Accounting was done by __mmap_prepare(). */
+ /* Accounting was done by __mmap_setup(). */
unacct_error:
if (map.charged)
vm_unacct_memory(map.charged);
--
2.51.0

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:23:02 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
We need the ability to split PFN remap between updating the VMA and
performing the actual remap, in order to do away with the legacy f_op->mmap
hook.

To do so, update the PFN remap code to provide shared logic, and also make
remap_pfn_range_notrack() static, as its one user, io_mapping_map_user()
was removed in commit 9a4f90e24661 ("mm: remove mm/io-mapping.c").

Then, introduce remap_pfn_range_prepare(), which accepts VMA descriptor and
PFN parameters, and remap_pfn_range_complete() which accepts the same
parameters as remap_pfn_rangte().

remap_pfn_range_prepare() will set the cow vma->vm_pgoff if necessary, so
it must be supplied with a correct PFN to do so. If the caller must hold
locks to be able to do this, those locks should be held across the
operation, and mmap_abort() should be provided to revoke the lock should an
error arise.

While we're here, also clean up the duplicated #ifdef
__HAVE_PFNMAP_TRACKING check and put into a single #ifdef/#else block.

We would prefer to define these functions in mm/internal.h, however we will
do the same for io_remap*() and these have arch defines that require access
to the remap functions.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
include/linux/mm.h | 25 +++++++--
mm/memory.c | 128 ++++++++++++++++++++++++++++-----------------
2 files changed, 102 insertions(+), 51 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0b97589aec6d..0e256823799d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -489,6 +489,21 @@ extern unsigned int kobjsize(const void *objp);
*/
#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)

+/*
+ * Physically remapped pages are special. Tell the
+ * rest of the world about it:
+ * VM_IO tells people not to look at these pages
+ * (accesses can have side effects).
+ * VM_PFNMAP tells the core MM that the base pages are just
+ * raw PFN mappings, and do not have a "struct page" associated
+ * with them.
+ * VM_DONTEXPAND
+ * Disable vma merging and expanding with mremap().
+ * VM_DONTDUMP
+ * Omit vma from core dump, even when VM_IO turned off.
+ */
+#define VM_REMAP_FLAGS (VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP)
+
/* This mask prevents VMA from being scanned with khugepaged */
#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)

@@ -3623,10 +3638,12 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,

struct vm_area_struct *find_extend_vma_locked(struct mm_struct *,
unsigned long addr);
-int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
- unsigned long pfn, unsigned long size, pgprot_t);
-int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
- unsigned long pfn, unsigned long size, pgprot_t prot);
+int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t pgprot);
+void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn);
+int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t pgprot);
+
int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
struct page **pages, unsigned long *num);
diff --git a/mm/memory.c b/mm/memory.c
index 3e0404bd57a0..5c4d5261996d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2903,8 +2903,27 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
return 0;
}

+static int get_remap_pgoff(vm_flags_t vm_flags, unsigned long addr,
+ unsigned long end, unsigned long vm_start, unsigned long vm_end,
+ unsigned long pfn, pgoff_t *vm_pgoff_p)
+{
+ /*
+ * There's a horrible special case to handle copy-on-write
+ * behaviour that some programs depend on. We mark the "original"
+ * un-COW'ed pages by matching them up with "vma->vm_pgoff".
+ * See vm_normal_page() for details.
+ */
+ if (is_cow_mapping(vm_flags)) {
+ if (addr != vm_start || end != vm_end)
+ return -EINVAL;
+ *vm_pgoff_p = pfn;
+ }
+
+ return 0;
+}
+
static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long addr,
- unsigned long pfn, unsigned long size, pgprot_t prot)
+ unsigned long pfn, unsigned long size, pgprot_t prot, bool set_vma)
{
pgd_t *pgd;
unsigned long next;
@@ -2915,32 +2934,17 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad
if (WARN_ON_ONCE(!PAGE_ALIGNED(addr)))
return -EINVAL;

- /*
- * Physically remapped pages are special. Tell the
- * rest of the world about it:
- * VM_IO tells people not to look at these pages
- * (accesses can have side effects).
- * VM_PFNMAP tells the core MM that the base pages are just
- * raw PFN mappings, and do not have a "struct page" associated
- * with them.
- * VM_DONTEXPAND
- * Disable vma merging and expanding with mremap().
- * VM_DONTDUMP
- * Omit vma from core dump, even when VM_IO turned off.
- *
- * There's a horrible special case to handle copy-on-write
- * behaviour that some programs depend on. We mark the "original"
- * un-COW'ed pages by matching them up with "vma->vm_pgoff".
- * See vm_normal_page() for details.
- */
- if (is_cow_mapping(vma->vm_flags)) {
- if (addr != vma->vm_start || end != vma->vm_end)
- return -EINVAL;
- vma->vm_pgoff = pfn;
+ if (set_vma) {
+ err = get_remap_pgoff(vma->vm_flags, addr, end,
+ vma->vm_start, vma->vm_end,
+ pfn, &vma->vm_pgoff);
+ if (err)
+ return err;
+ vm_flags_set(vma, VM_REMAP_FLAGS);
+ } else {
+ VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) != VM_REMAP_FLAGS);
}

- vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
-
BUG_ON(addr >= end);
pfn -= addr >> PAGE_SHIFT;
pgd = pgd_offset(mm, addr);
@@ -2960,11 +2964,10 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad
* Variant of remap_pfn_range that does not call track_pfn_remap. The caller
* must have pre-validated the caching bits of the pgprot_t.
*/
-int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
- unsigned long pfn, unsigned long size, pgprot_t prot)
+static int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t prot, bool set_vma)
{
- int error = remap_pfn_range_internal(vma, addr, pfn, size, prot);
-
+ int error = remap_pfn_range_internal(vma, addr, pfn, size, prot, set_vma);
if (!error)
return 0;

@@ -2977,6 +2980,18 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
return error;
}

+void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn)
+{
+ /*
+ * We set addr=VMA start, end=VMA end here, so this won't fail, but we
+ * check it again on complete and will fail there if specified addr is
+ * invalid.
+ */
+ get_remap_pgoff(desc->vm_flags, desc->start, desc->end,
+ desc->start, desc->end, pfn, &desc->pgoff);
+ desc->vm_flags |= VM_REMAP_FLAGS;
+}
+
#ifdef __HAVE_PFNMAP_TRACKING
static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
unsigned long size, pgprot_t *prot)
@@ -3005,23 +3020,9 @@ void pfnmap_track_ctx_release(struct kref *ref)
pfnmap_untrack(ctx->pfn, ctx->size);
kfree(ctx);
}
-#endif /* __HAVE_PFNMAP_TRACKING */

-/**
- * remap_pfn_range - remap kernel memory to userspace
- * @vma: user vma to map to
- * @addr: target page aligned user address to start at
- * @pfn: page frame number of kernel physical memory address
- * @size: size of mapping area
- * @prot: page protection flags for this mapping
- *
- * Note: this is only safe if the mm semaphore is held when called.
- *
- * Return: %0 on success, negative error code otherwise.
- */
-#ifdef __HAVE_PFNMAP_TRACKING
-int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
- unsigned long pfn, unsigned long size, pgprot_t prot)
+static int remap_pfn_range_track(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t prot, bool set_vma)
{
struct pfnmap_track_ctx *ctx = NULL;
int err;
@@ -3047,7 +3048,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
return -EINVAL;
}

- err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
+ err = remap_pfn_range_notrack(vma, addr, pfn, size, prot, set_vma);
if (ctx) {
if (err)
kref_put(&ctx->kref, pfnmap_track_ctx_release);
@@ -3057,11 +3058,44 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
return err;
}

+/**
+ * remap_pfn_range - remap kernel memory to userspace
+ * @vma: user vma to map to
+ * @addr: target page aligned user address to start at
+ * @pfn: page frame number of kernel physical memory address
+ * @size: size of mapping area
+ * @prot: page protection flags for this mapping
+ *
+ * Note: this is only safe if the mm semaphore is held when called.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+ return remap_pfn_range_track(vma, addr, pfn, size, prot,
+ /* set_vma = */true);
+}
+
+int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+ /* With set_vma = false, the VMA will not be modified. */
+ return remap_pfn_range_track(vma, addr, pfn, size, prot,
+ /* set_vma = */false);
+}
#else
int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t prot)
{
- return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
+ return remap_pfn_range_notrack(vma, addr, pfn, size, prot, /* set_vma = */true);
+}
+
+int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+ return remap_pfn_range_notrack(vma, addr, pfn, size, prot,
+ /* set_vma = */false);
}
#endif
EXPORT_SYMBOL(remap_pfn_range);
--
2.51.0

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:23:04 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
We introduce the io_remap*() equivalents of remap_pfn_range_prepare() and
remap_pfn_range_complete() to allow for I/O remapping via mmap_prepare.

We have to make some architecture-specific changes for those architectures
which define customised handlers.

It doesn't really make sense to make this internal-only as arches specify
their version of these functions so we declare these in mm.h.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
arch/csky/include/asm/pgtable.h | 5 +++++
arch/mips/alchemy/common/setup.c | 28 +++++++++++++++++++++++++---
arch/mips/include/asm/pgtable.h | 10 ++++++++++
arch/sparc/include/asm/pgtable_32.h | 29 +++++++++++++++++++++++++----
arch/sparc/include/asm/pgtable_64.h | 29 +++++++++++++++++++++++++----
include/linux/mm.h | 18 ++++++++++++++++++
6 files changed, 108 insertions(+), 11 deletions(-)

diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h
index 5a394be09c35..c83505839a06 100644
--- a/arch/csky/include/asm/pgtable.h
+++ b/arch/csky/include/asm/pgtable.h
@@ -266,4 +266,9 @@ void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
#define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \
remap_pfn_range(vma, vaddr, pfn, size, prot)

+/* default io_remap_pfn_range_prepare can be used. */
+
+#define io_remap_pfn_range_complete(vma, addr, pfn, size, prot) \
+ remap_pfn_range_complete(vma, addr, pfn, size, prot)
+
#endif /* __ASM_CSKY_PGTABLE_H */
diff --git a/arch/mips/alchemy/common/setup.c b/arch/mips/alchemy/common/setup.c
index a7a6d31a7a41..a4ab02776994 100644
--- a/arch/mips/alchemy/common/setup.c
+++ b/arch/mips/alchemy/common/setup.c
@@ -94,12 +94,34 @@ phys_addr_t fixup_bigphys_addr(phys_addr_t phys_addr, phys_addr_t size)
return phys_addr;
}

-int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long vaddr,
- unsigned long pfn, unsigned long size, pgprot_t prot)
+static unsigned long calc_pfn(unsigned long pfn, unsigned long size)
{
phys_addr_t phys_addr = fixup_bigphys_addr(pfn << PAGE_SHIFT, size);

- return remap_pfn_range(vma, vaddr, phys_addr >> PAGE_SHIFT, size, prot);
+ return phys_addr >> PAGE_SHIFT;
+}
+
+int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long vaddr,
+ unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+ return remap_pfn_range(vma, vaddr, calc_pfn(pfn, size), size, prot);
}
EXPORT_SYMBOL(io_remap_pfn_range);
+
+void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+ unsigned long size)
+{
+ remap_pfn_range_prepare(desc, calc_pfn(pfn, size));
+}
+EXPORT_SYMBOL(io_remap_pfn_range_prepare);
+
+int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long pfn, unsigned long size,
+ pgprot_t prot)
+{
+ return remap_pfn_range_complete(vma, addr, calc_pfn(pfn, size),
+ size, prot);
+}
+EXPORT_SYMBOL(io_remap_pfn_range_complete);
+
#endif /* CONFIG_MIPS_FIXUP_BIGPHYS_ADDR */
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index ae73ecf4c41a..6a8964f55a31 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -607,6 +607,16 @@ phys_addr_t fixup_bigphys_addr(phys_addr_t addr, phys_addr_t size);
int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long vaddr,
unsigned long pfn, unsigned long size, pgprot_t prot);
#define io_remap_pfn_range io_remap_pfn_range
+
+void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+ unsigned long size);
+#define io_remap_pfn_range_prepare io_remap_pfn_range_prepare
+
+int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long pfn, unsigned long size,
+ pgprot_t prot);
+#define io_remap_pfn_range_complete io_remap_pfn_range_complete
+
#else
#define fixup_bigphys_addr(addr, size) (addr)
#endif /* CONFIG_MIPS_FIXUP_BIGPHYS_ADDR */
diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h
index 7c199c003ffe..cfd764afc107 100644
--- a/arch/sparc/include/asm/pgtable_32.h
+++ b/arch/sparc/include/asm/pgtable_32.h
@@ -398,9 +398,7 @@ __get_iospace (unsigned long addr)
int remap_pfn_range(struct vm_area_struct *, unsigned long, unsigned long,
unsigned long, pgprot_t);

-static inline int io_remap_pfn_range(struct vm_area_struct *vma,
- unsigned long from, unsigned long pfn,
- unsigned long size, pgprot_t prot)
+static inline unsigned long calc_io_remap_pfn(unsigned long pfn)
{
unsigned long long offset, space, phys_base;

@@ -408,10 +406,33 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,
space = GET_IOSPACE(pfn);
phys_base = offset | (space << 32ULL);

- return remap_pfn_range(vma, from, phys_base >> PAGE_SHIFT, size, prot);
+ return phys_base >> PAGE_SHIFT;
+}
+
+static inline int io_remap_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+ return remap_pfn_range(vma, from, calc_io_remap_pfn(pfn), size, prot);
}
#define io_remap_pfn_range io_remap_pfn_range

+static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+ unsigned long size)
+{
+ remap_pfn_range_prepare(desc, calc_io_remap_pfn(pfn));
+}
+#define io_remap_pfn_range_prepare io_remap_pfn_range_prepare
+
+static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long pfn, unsigned long size,
+ pgprot_t prot)
+{
+ return remap_pfn_range_complete(vma, addr, calc_io_remap_pfn(pfn),
+ size, prot);
+}
+#define io_remap_pfn_range_complete io_remap_pfn_range_complete
+
#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
({ \
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 669cd02469a1..b8000ce4b59f 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1084,9 +1084,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
return 0;
}

-static inline int io_remap_pfn_range(struct vm_area_struct *vma,
- unsigned long from, unsigned long pfn,
- unsigned long size, pgprot_t prot)
+static inline unsigned long calc_io_remap_pfn(unsigned long pfn)
{
unsigned long offset = GET_PFN(pfn) << PAGE_SHIFT;
int space = GET_IOSPACE(pfn);
@@ -1094,10 +1092,33 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,

phys_base = offset | (((unsigned long) space) << 32UL);

- return remap_pfn_range(vma, from, phys_base >> PAGE_SHIFT, size, prot);
+ return phys_base >> PAGE_SHIFT;
+}
+
+static inline int io_remap_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+ return remap_pfn_range(vma, from, calc_io_remap_pfn(pfn), size, prot);
}
#define io_remap_pfn_range io_remap_pfn_range

+static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+ unsigned long size)
+{
+ return remap_pfn_range_prepare(desc, calc_io_remap_pfn(pfn));
+}
+#define io_remap_pfn_range_prepare io_remap_pfn_range_prepare
+
+static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long pfn, unsigned long size,
+ pgprot_t prot)
+{
+ return remap_pfn_range_complete(vma, addr, calc_io_remap_pfn(pfn),
+ size, prot);
+}
+#define io_remap_pfn_range_complete io_remap_pfn_range_complete
+
static inline unsigned long __untagged_addr(unsigned long start)
{
if (adi_capable()) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e256823799d..cca149bb8ef1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3685,6 +3685,24 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,
}
#endif

+#ifndef io_remap_pfn_range_prepare
+static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+ unsigned long size)
+{
+ return remap_pfn_range_prepare(desc, pfn);
+}
+#endif
+
+#ifndef io_remap_pfn_range_complete
+static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long pfn, unsigned long size,
+ pgprot_t prot)
+{
+ return remap_pfn_range_complete(vma, addr, pfn, size,
+ pgprot_decrypted(prot));
+}
+#endif
+
static inline vm_fault_t vmf_error(int err)
{
if (err == -ENOMEM)
--
2.51.0

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:23:37 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Now we have introduced the ability to specify that actions should be taken
after a VMA is established via the vm_area_desc->action field as specified
in mmap_prepare, update both the VFS documentation and the porting guide to
describe this.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
Documentation/filesystems/porting.rst | 5 +++++
Documentation/filesystems/vfs.rst | 4 ++++
2 files changed, 9 insertions(+)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 85f590254f07..6743ed0b9112 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1285,3 +1285,8 @@ rather than a VMA, as the VMA at this stage is not yet valid.
The vm_area_desc provides the minimum required information for a filesystem
to initialise state upon memory mapping of a file-backed region, and output
parameters for the file system to set this state.
+
+In nearly all cases, this is all that is required for a filesystem. However, if
+a filesystem needs to perform an operation such a pre-population of page tables,
+then that action can be specified in the vm_area_desc->action field, which can
+be configured using the mmap_action_*() helpers.
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 486a91633474..9e96c46ee10e 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -1236,6 +1236,10 @@ otherwise noted.
file-backed memory mapping, most notably establishing relevant
private state and VMA callbacks.

+ If further action such as pre-population of page tables is required,
+ this can be specified by the vm_area_desc->action field and related
+ parameters.
+
Note that the file operations are implemented by the specific
filesystem in which the inode resides. When opening a device node
(character or block special) most filesystems will call special
--
2.51.0

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:23:37 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Some drivers/filesystems need to perform additional tasks after the VMA is
set up. This is typically in the form of pre-population.

The forms of pre-population most likely to be performed are a PFN remap or
insertion of a mixed map, so we provide this functionality, ensuring that
we perform the appropriate actions at the appropriate time - that is
setting flags at the point of .mmap_prepare, and performing the actual
remap at the point at which the VMA is fully established.

This prevents the driver from doing anything too crazy with a VMA at any
stage, and we retain complete control over how the mm functionality is
applied.

Unfortunately callers still do often require some kind of custom action, so
we add an optional success/error _hook to allow the caller to do something
after the action has succeeded or failed.

This is done at the point when the VMA has already been established, so the
harm that can be done is limited.

The error hook can be used to filter errors if necessary.

We implement actions as abstracted from the vm_area_desc, so we provide the
ability for custom hooks to invoke actions distinct from the vma
descriptor.

If any error arises on these final actions, we simply unmap the VMA
altogether.

Also update the stacked filesystem compatibility layer to utilise the
action behaviour, and update the VMA tests accordingly.

For drivers which perform truly custom logic, we provide a custom action
hook which is invoked at the point of action execution.

This can then, in turn, update the desc object and perform other actions,
such as partially remapping ranges for instance. We export
vma_desc_action_prepare() and vma_desc_action_complete() for drivers to do
this.

This is performed at a stage where the VMA is already established,
immediately prior to mapping completion, so it is considerably less
problematic than a general mmap hook.

Note that at the point of the action being taken, the VMA is visible via
the rmap, only the VMA write lock is held, so if anything needs to access
the VMA, it is able to.

Essentially the action is taken as if it were performed after the mapping,
but is kept atomic with VMA state.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
include/linux/mm.h | 30 ++++++
include/linux/mm_types.h | 61 ++++++++++++
mm/util.c | 150 +++++++++++++++++++++++++++-
mm/vma.c | 70 ++++++++-----
tools/testing/vma/vma_internal.h | 164 ++++++++++++++++++++++++++++++-
5 files changed, 447 insertions(+), 28 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cca149bb8ef1..2ceead3ffcf0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3597,6 +3597,36 @@ static inline unsigned long vma_desc_pages(struct vm_area_desc *desc)
return vma_desc_size(desc) >> PAGE_SHIFT;
}

+static inline void mmap_action_remap(struct mmap_action *action,
+ unsigned long addr, unsigned long pfn, unsigned long size,
+ pgprot_t pgprot)
+{
+ action->type = MMAP_REMAP_PFN;
+
+ action->remap.addr = addr;
+ action->remap.pfn = pfn;
+ action->remap.size = size;
+ action->remap.pgprot = pgprot;
+}
+
+static inline void mmap_action_mixedmap(struct mmap_action *action,
+ unsigned long addr, unsigned long pfn, unsigned long num_pages)
+{
+ action->type = MMAP_INSERT_MIXED;
+
+ action->mixedmap.addr = addr;
+ action->mixedmap.pfn = pfn;
+ action->mixedmap.num_pages = num_pages;
+}
+
+struct page **mmap_action_mixedmap_pages(struct mmap_action *action,
+ unsigned long addr, unsigned long num_pages);
+
+void mmap_action_prepare(struct mmap_action *action,
+ struct vm_area_desc *desc);
+int mmap_action_complete(struct mmap_action *action,
+ struct vm_area_struct *vma);
+
/* Look up the first VMA which exactly match the interval vm_start ... vm_end */
static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
unsigned long vm_start, unsigned long vm_end)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4a441f78340d..ae6c7a0a18a7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -770,6 +770,64 @@ struct pfnmap_track_ctx {
};
#endif

+/* What action should be taken after an .mmap_prepare call is complete? */
+enum mmap_action_type {
+ MMAP_NOTHING, /* Mapping is complete, no further action. */
+ MMAP_REMAP_PFN, /* Remap PFN range based on desc->remap. */
+ MMAP_INSERT_MIXED, /* Mixed map based on desc->mixedmap. */
+ MMAP_INSERT_MIXED_PAGES, /* Mixed map based on desc->mixedmap_pages. */
+ MMAP_CUSTOM_ACTION, /* User-provided hook. */
+};
+
+struct mmap_action {
+ union {
+ /* Remap range. */
+ struct {
+ unsigned long addr;
+ unsigned long pfn;
+ unsigned long size;
+ pgprot_t pgprot;
+ } remap;
+ /* Insert mixed map. */
+ struct {
+ unsigned long addr;
+ unsigned long pfn;
+ unsigned long num_pages;
+ } mixedmap;
+ /* Insert specific mixed map pages. */
+ struct {
+ unsigned long addr;
+ struct page **pages;
+ unsigned long num_pages;
+ /* kfree pages on completion? */
+ bool kfree_pages :1;
+ } mixedmap_pages;
+ struct {
+ int (*action_hook)(struct vm_area_struct *vma);
+ } custom;
+ };
+ enum mmap_action_type type;
+
+ /*
+ * If specified, this hook is invoked after the selected action has been
+ * successfully completed. Not that the VMA write lock still held.
+ *
+ * The absolute minimum ought to be done here.
+ *
+ * Returns 0 on success, or an error code.
+ */
+ int (*success_hook)(struct vm_area_struct *vma);
+
+ /*
+ * If specified, this hook is invoked when an error occurred when
+ * attempting the selection action.
+ *
+ * The hook can return an error code in order to filter the error, but
+ * it is not valid to clear the error here.
+ */
+ int (*error_hook)(int err);
+};
+
/*
* Describes a VMA that is about to be mmap()'ed. Drivers may choose to
* manipulate mutable fields which will cause those fields to be updated in the
@@ -793,6 +851,9 @@ struct vm_area_desc {
/* Write-only fields. */
const struct vm_operations_struct *vm_ops;
void *private_data;
+
+ /* Take further action? */
+ struct mmap_action action;
};

/*
diff --git a/mm/util.c b/mm/util.c
index 248f877f629b..11752d67b89c 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1155,15 +1155,18 @@ int __compat_vma_mmap_prepare(const struct file_operations *f_op,
.vm_file = vma->vm_file,
.vm_flags = vma->vm_flags,
.page_prot = vma->vm_page_prot,
+
+ .action.type = MMAP_NOTHING, /* Default */
};
int err;

err = f_op->mmap_prepare(&desc);
if (err)
return err;
- set_vma_from_desc(vma, &desc);

- return 0;
+ mmap_action_prepare(&desc.action, &desc);
+ set_vma_from_desc(vma, &desc);
+ return mmap_action_complete(&desc.action, vma);
}
EXPORT_SYMBOL(__compat_vma_mmap_prepare);

@@ -1279,6 +1282,149 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
}
}

+struct page **mmap_action_mixedmap_pages(struct mmap_action *action,
+ unsigned long addr, unsigned long num_pages)
+{
+ struct page **pages;
+
+ pages = kmalloc_array(num_pages, sizeof(struct page *), GFP_KERNEL);
+ if (!pages)
+ return NULL;
+
+ action->type = MMAP_INSERT_MIXED_PAGES;
+
+ action->mixedmap_pages.addr = addr;
+ action->mixedmap_pages.num_pages = num_pages;
+ action->mixedmap_pages.kfree_pages = true;
+ action->mixedmap_pages.pages = pages;
+
+ return pages;
+}
+EXPORT_SYMBOL(mmap_action_mixedmap_pages);
+
+/**
+ * mmap_action_prepare - Perform preparatory setup for an VMA descriptor
+ * action which need to be performed.
+ * @desc: The VMA descriptor to prepare for @action.
+ * @action: The action to perform.
+ *
+ * Other than internal mm use, this is intended to be used by mmap_prepare code
+ * which specifies a custom action hook and needs to prepare for another action
+ * it wishes to perform.
+ */
+void mmap_action_prepare(struct mmap_action *action,
+ struct vm_area_desc *desc)
+{
+ switch (action->type) {
+ case MMAP_NOTHING:
+ case MMAP_CUSTOM_ACTION:
+ break;
+ case MMAP_REMAP_PFN:
+ remap_pfn_range_prepare(desc, action->remap.pfn);
+ break;
+ case MMAP_INSERT_MIXED:
+ case MMAP_INSERT_MIXED_PAGES:
+ desc->vm_flags |= VM_MIXEDMAP;
+ break;
+ }
+}
+EXPORT_SYMBOL(mmap_action_prepare);
+
+/**
+ * mmap_action_complete - Execute VMA descriptor action.
+ * @action: The action to perform.
+ * @vma: The VMA to perform the action upon.
+ *
+ * Similar to mmap_action_prepare(), other than internal mm usage this is
+ * intended for mmap_prepare users who implement a custom hook - with this
+ * function being called from the custom hook itself.
+ *
+ * Return: 0 on success, or error, at which point the VMA will be unmapped.
+ */
+int mmap_action_complete(struct mmap_action *action,
+ struct vm_area_struct *vma)
+{
+ int err = 0;
+
+ switch (action->type) {
+ case MMAP_NOTHING:
+ break;
+ case MMAP_REMAP_PFN:
+ VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) !=
+ VM_REMAP_FLAGS);
+
+ err = remap_pfn_range_complete(vma, action->remap.addr,
+ action->remap.pfn, action->remap.size,
+ action->remap.pgprot);
+
+ break;
+ case MMAP_INSERT_MIXED:
+ {
+ unsigned long pgnum = 0;
+ unsigned long pfn = action->mixedmap.pfn;
+ unsigned long addr = action->mixedmap.addr;
+ unsigned long vaddr = vma->vm_start;
+
+ VM_WARN_ON_ONCE(!(vma->vm_flags & VM_MIXEDMAP));
+
+ for (; pgnum < action->mixedmap.num_pages;
+ pgnum++, pfn++, addr += PAGE_SIZE, vaddr += PAGE_SIZE) {
+ vm_fault_t vmf;
+
+ vmf = vmf_insert_mixed(vma, vaddr, addr);
+ if (vmf & VM_FAULT_ERROR) {
+ err = vm_fault_to_errno(vmf, 0);
+ break;
+ }
+ }
+
+ break;
+ }
+ case MMAP_INSERT_MIXED_PAGES:
+ {
+ struct page **pages = action->mixedmap_pages.pages;
+ unsigned long nr_pages = action->mixedmap_pages.num_pages;
+
+ VM_WARN_ON_ONCE(!(vma->vm_flags & VM_MIXEDMAP));
+
+ err = vm_insert_pages(vma, action->mixedmap_pages.addr,
+ pages, &nr_pages);
+ if (action->mixedmap_pages.kfree_pages)
+ kfree(pages);
+ break;
+ }
+ case MMAP_CUSTOM_ACTION:
+ err = action->custom.action_hook(vma);
+ break;
+ }
+
+ /*
+ * If an error occurs, unmap the VMA altogether and return an error. We
+ * only clear the newly allocated VMA, since this function is only
+ * invoked if we do NOT merge, so we only clean up the VMA we created.
+ */
+ if (err) {
+ const size_t len = vma_pages(vma) << PAGE_SHIFT;
+
+ do_munmap(current->mm, vma->vm_start, len, NULL);
+
+ if (action->error_hook) {
+ /* We may want to filter the error. */
+ err = action->error_hook(err);
+
+ /* The caller should not clear the error. */
+ VM_WARN_ON_ONCE(!err);
+ }
+ return err;
+ }
+
+ if (action->success_hook)
+ err = action->success_hook(vma);
+
+ return 0;
+}
+EXPORT_SYMBOL(mmap_action_complete);
+
#ifdef CONFIG_MMU
/**
* folio_pte_batch - detect a PTE batch for a large folio
diff --git a/mm/vma.c b/mm/vma.c
index 36a9f4d453be..a1ec405bda25 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2328,17 +2328,33 @@ static void update_ksm_flags(struct mmap_state *map)
map->vm_flags = ksm_vma_flags(map->mm, map->file, map->vm_flags);
}

+static void set_desc_from_map(struct vm_area_desc *desc,
+ const struct mmap_state *map)
+{
+ desc->start = map->addr;
+ desc->end = map->end;
+
+ desc->pgoff = map->pgoff;
+ desc->vm_file = map->file;
+ desc->vm_flags = map->vm_flags;
+ desc->page_prot = map->page_prot;
+}
+
/*
* __mmap_setup() - Prepare to gather any overlapping VMAs that need to be
* unmapped once the map operation is completed, check limits, account mapping
* and clean up any pre-existing VMAs.
*
+ * As a result it sets up the @map and @desc objects.
+ *
* @map: Mapping state.
+ * @desc: VMA descriptor
* @uf: Userfaultfd context list.
*
* Returns: 0 on success, error code otherwise.
*/
-static int __mmap_setup(struct mmap_state *map, struct list_head *uf)
+static int __mmap_setup(struct mmap_state *map, struct vm_area_desc *desc,
+ struct list_head *uf)
{
int error;
struct vma_iterator *vmi = map->vmi;
@@ -2395,6 +2411,7 @@ static int __mmap_setup(struct mmap_state *map, struct list_head *uf)
*/
vms_clean_up_area(vms, &map->mas_detach);

+ set_desc_from_map(desc, map);
return 0;
}

@@ -2567,34 +2584,26 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
*
* Returns 0 on success, or an error code otherwise.
*/
-static int call_mmap_prepare(struct mmap_state *map)
+static int call_mmap_prepare(struct mmap_state *map,
+ struct vm_area_desc *desc)
{
int err;
- struct vm_area_desc desc = {
- .mm = map->mm,
- .file = map->file,
- .start = map->addr,
- .end = map->end,
-
- .pgoff = map->pgoff,
- .vm_file = map->file,
- .vm_flags = map->vm_flags,
- .page_prot = map->page_prot,
- };

/* Invoke the hook. */
- err = vfs_mmap_prepare(map->file, &desc);
+ err = vfs_mmap_prepare(map->file, desc);
if (err)
return err;

+ mmap_action_prepare(&desc->action, desc);
+
/* Update fields permitted to be changed. */
- map->pgoff = desc.pgoff;
- map->file = desc.vm_file;
- map->vm_flags = desc.vm_flags;
- map->page_prot = desc.page_prot;
+ map->pgoff = desc->pgoff;
+ map->file = desc->vm_file;
+ map->vm_flags = desc->vm_flags;
+ map->page_prot = desc->page_prot;
/* User-defined fields. */
- map->vm_ops = desc.vm_ops;
- map->vm_private_data = desc.private_data;
+ map->vm_ops = desc->vm_ops;
+ map->vm_private_data = desc->private_data;

return 0;
}
@@ -2642,16 +2651,24 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = NULL;
- int error;
bool have_mmap_prepare = file && file->f_op->mmap_prepare;
VMA_ITERATOR(vmi, mm, addr);
MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file);
+ struct vm_area_desc desc = {
+ .mm = mm,
+ .file = file,
+ .action = {
+ .type = MMAP_NOTHING, /* Default to no further action. */
+ },
+ };
+ bool allocated_new = false;
+ int error;

map.check_ksm_early = can_set_ksm_flags_early(&map);

- error = __mmap_setup(&map, uf);
+ error = __mmap_setup(&map, &desc, uf);
if (!error && have_mmap_prepare)
- error = call_mmap_prepare(&map);
+ error = call_mmap_prepare(&map, &desc);
if (error)
goto abort_munmap;

@@ -2670,6 +2687,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
error = __mmap_new_vma(&map, &vma);
if (error)
goto unacct_error;
+ allocated_new = true;
}

if (have_mmap_prepare)
@@ -2677,6 +2695,12 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,

__mmap_complete(&map, vma);

+ if (have_mmap_prepare && allocated_new) {
+ error = mmap_action_complete(&desc.action, vma);
+ if (error)
+ return error;
+ }
+
return addr;

/* Accounting was done by __mmap_setup(). */
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 07167446dcf4..c21642974798 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -170,6 +170,28 @@ typedef __bitwise unsigned int vm_fault_t;
#define swap(a, b) \
do { typeof(a) __tmp = (a); (a) = (b); (b) = __tmp; } while (0)

+enum vm_fault_reason {
+ VM_FAULT_OOM = (__force vm_fault_t)0x000001,
+ VM_FAULT_SIGBUS = (__force vm_fault_t)0x000002,
+ VM_FAULT_MAJOR = (__force vm_fault_t)0x000004,
+ VM_FAULT_HWPOISON = (__force vm_fault_t)0x000010,
+ VM_FAULT_HWPOISON_LARGE = (__force vm_fault_t)0x000020,
+ VM_FAULT_SIGSEGV = (__force vm_fault_t)0x000040,
+ VM_FAULT_NOPAGE = (__force vm_fault_t)0x000100,
+ VM_FAULT_LOCKED = (__force vm_fault_t)0x000200,
+ VM_FAULT_RETRY = (__force vm_fault_t)0x000400,
+ VM_FAULT_FALLBACK = (__force vm_fault_t)0x000800,
+ VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000,
+ VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000,
+ VM_FAULT_COMPLETED = (__force vm_fault_t)0x004000,
+ VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000,
+};
+#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | \
+ VM_FAULT_SIGSEGV | VM_FAULT_HWPOISON | \
+ VM_FAULT_HWPOISON_LARGE | VM_FAULT_FALLBACK)
+
+#define FOLL_HWPOISON (1 << 6)
+
struct kref {
refcount_t refcount;
};
@@ -274,6 +296,92 @@ struct mm_struct {

struct vm_area_struct;

+/* What action should be taken after an .mmap_prepare call is complete? */
+enum mmap_action_type {
+ MMAP_NOTHING, /* Mapping is complete, no further action. */
+ MMAP_REMAP_PFN, /* Remap PFN range based on desc->remap. */
+ MMAP_INSERT_MIXED, /* Mixed map based on desc->mixedmap. */
+ MMAP_INSERT_MIXED_PAGES, /* Mixed map based on desc->mixedmap_pages. */
+ MMAP_CUSTOM_ACTION, /* User-provided hook. */
+};
+
+struct mmap_action {
+ union {
+ /* Remap range. */
+ struct {
+ unsigned long addr;
+ unsigned long pfn;
+ unsigned long size;
+ pgprot_t pgprot;
+ } remap;
+ /* Insert mixed map. */
+ struct {
+ unsigned long addr;
+ unsigned long pfn;
+ unsigned long num_pages;
+ } mixedmap;
+ /* Insert specific mixed map pages. */
+ struct {
+ unsigned long addr;
+ struct page **pages;
+ unsigned long num_pages;
+ /* kfree pages on completion? */
+ bool kfree_pages :1;
+ } mixedmap_pages;
+ struct {
+ int (*action_hook)(struct vm_area_struct *vma);
+ } custom;
+ };
+ enum mmap_action_type type;
+
+ /*
+ * If specified, this hook is invoked after the selected action has been
+ * successfully completed. Not that the VMA write lock still held.
+ *
+ * The absolute minimum ought to be done here.
+ *
+ * Returns 0 on success, or an error code.
+ */
+ int (*success_hook)(struct vm_area_struct *vma);
+
+ /*
+ * If specified, this hook is invoked when an error occurred when
+ * attempting the selection action.
+ *
+ * The hook can return an error code in order to filter the error, but
+ * it is not valid to clear the error here.
+ */
+ int (*error_hook)(int err);
+};
+
+/*
+ * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
+ * manipulate mutable fields which will cause those fields to be updated in the
+ * resultant VMA.
+ *
+ * Helper functions are not required for manipulating any field.
+ */
+struct vm_area_desc {
+ /* Immutable state. */
+ const struct mm_struct *const mm;
+ struct file *const file; /* May vary from vm_file in stacked callers. */
+ unsigned long start;
+ unsigned long end;
+
+ /* Mutable fields. Populated with initial state. */
+ pgoff_t pgoff;
+ struct file *vm_file;
+ vm_flags_t vm_flags;
+ pgprot_t page_prot;
+
+ /* Write-only fields. */
+ const struct vm_operations_struct *vm_ops;
+ void *private_data;
+
+ /* Take further action? */
+ struct mmap_action action;
+};
+
/*
* Describes a VMA that is about to be mmap()'ed. Drivers may choose to
* manipulate mutable fields which will cause those fields to be updated in the
@@ -297,6 +405,9 @@ struct vm_area_desc {
/* Write-only fields. */
const struct vm_operations_struct *vm_ops;
void *private_data;
+
+ /* Take further action? */
+ struct mmap_action action;
};

struct file_operations {
@@ -1466,12 +1577,23 @@ static inline void free_anon_vma_name(struct vm_area_struct *vma)
static inline void set_vma_from_desc(struct vm_area_struct *vma,
struct vm_area_desc *desc);

+static inline void mmap_action_prepare(struct mmap_action *action,
+ struct vm_area_desc *desc)
+{
+}
+
+static inline int mmap_action_complete(struct mmap_action *action,
+ struct vm_area_struct *vma)
+{
+ return 0;
+}
+
static inline int __compat_vma_mmap_prepare(const struct file_operations *f_op,
struct file *file, struct vm_area_struct *vma)
{
struct vm_area_desc desc = {
.mm = vma->vm_mm,
- .file = vma->vm_file,
+ .file = file,
.start = vma->vm_start,
.end = vma->vm_end,

@@ -1479,15 +1601,18 @@ static inline int __compat_vma_mmap_prepare(const struct file_operations *f_op,
.vm_file = vma->vm_file,
.vm_flags = vma->vm_flags,
.page_prot = vma->vm_page_prot,
+
+ .action.type = MMAP_NOTHING, /* Default */
};
int err;

err = f_op->mmap_prepare(&desc);
if (err)
return err;
- set_vma_from_desc(vma, &desc);

- return 0;
+ mmap_action_prepare(&desc.action, &desc);
+ set_vma_from_desc(vma, &desc);
+ return mmap_action_complete(&desc.action, vma);
}

static inline int compat_vma_mmap_prepare(struct file *file,
@@ -1548,4 +1673,37 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
return vm_flags;
}

+static inline void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn)
+{
+}
+
+static inline int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t pgprot)
+{
+ return 0;
+}
+
+static inline vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn)
+{
+ return 0;
+}
+
+static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
+{
+ if (vm_fault & VM_FAULT_OOM)
+ return -ENOMEM;
+ if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
+ return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
+ if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
+ return -EFAULT;
+ return 0;
+}
+
+static inline int do_munmap(struct mm_struct *, unsigned long, size_t,
+ struct list_head *uf)
+{
+ return 0;
+}
+

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:23:38 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Since we can now perform actions after the VMA is established via
mmap_prepare, use desc->action_success_hook to set up the hugetlb lock once
the VMA is setup.

We also make changes throughout hugetlbfs to make this possible.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
fs/hugetlbfs/inode.c | 30 +++++++------
include/linux/hugetlb.h | 9 +++-
include/linux/hugetlb_inline.h | 15 ++++---
mm/hugetlb.c | 77 ++++++++++++++++++++--------------
4 files changed, 79 insertions(+), 52 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3cfdf4091001..026bcc65bb79 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -96,8 +96,9 @@ static const struct fs_parameter_spec hugetlb_fs_parameters[] = {
#define PGOFF_LOFFT_MAX \
(((1UL << (PAGE_SHIFT + 1)) - 1) << (BITS_PER_LONG - (PAGE_SHIFT + 1)))

-static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
+static int hugetlbfs_file_mmap_prepare(struct vm_area_desc *desc)
{
+ struct file *file = desc->file;
struct inode *inode = file_inode(file);
loff_t len, vma_len;
int ret;
@@ -112,8 +113,8 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
* way when do_mmap unwinds (may be important on powerpc
* and ia64).
*/
- vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
- vma->vm_ops = &hugetlb_vm_ops;
+ desc->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
+ desc->vm_ops = &hugetlb_vm_ops;

/*
* page based offset in vm_pgoff could be sufficiently large to
@@ -122,16 +123,16 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
* sizeof(unsigned long). So, only check in those instances.
*/
if (sizeof(unsigned long) == sizeof(loff_t)) {
- if (vma->vm_pgoff & PGOFF_LOFFT_MAX)
+ if (desc->pgoff & PGOFF_LOFFT_MAX)
return -EINVAL;
}

/* must be huge page aligned */
- if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
+ if (desc->pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
return -EINVAL;

- vma_len = (loff_t)(vma->vm_end - vma->vm_start);
- len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
+ vma_len = (loff_t)vma_desc_size(desc);
+ len = vma_len + ((loff_t)desc->pgoff << PAGE_SHIFT);
/* check for overflow */
if (len < vma_len)
return -EINVAL;
@@ -141,7 +142,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)

ret = -ENOMEM;

- vm_flags = vma->vm_flags;
+ vm_flags = desc->vm_flags;
/*
* for SHM_HUGETLB, the pages are reserved in the shmget() call so skip
* reserving here. Note: only for SHM hugetlbfs file, the inode
@@ -151,17 +152,20 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
vm_flags |= VM_NORESERVE;

if (hugetlb_reserve_pages(inode,
- vma->vm_pgoff >> huge_page_order(h),
- len >> huge_page_shift(h), vma,
- vm_flags) < 0)
+ desc->pgoff >> huge_page_order(h),
+ len >> huge_page_shift(h), desc,
+ vm_flags) < 0)
goto out;

ret = 0;
- if (vma->vm_flags & VM_WRITE && inode->i_size < len)
+ if ((desc->vm_flags & VM_WRITE) && inode->i_size < len)
i_size_write(inode, len);
out:
inode_unlock(inode);

+ /* Allocate the VMA lock after we set it up. */
+ if (!ret)
+ desc->action.success_hook = hugetlb_vma_lock_alloc;
return ret;
}

@@ -1219,7 +1223,7 @@ static void init_once(void *foo)

static const struct file_operations hugetlbfs_file_operations = {
.read_iter = hugetlbfs_read_iter,
- .mmap = hugetlbfs_file_mmap,
+ .mmap_prepare = hugetlbfs_file_mmap_prepare,
.fsync = noop_fsync,
.get_unmapped_area = hugetlb_get_unmapped_area,
.llseek = default_llseek,
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 526d27e88b3b..b39f2b70ccab 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -150,8 +150,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
struct folio **foliop);
#endif /* CONFIG_USERFAULTFD */
long hugetlb_reserve_pages(struct inode *inode, long from, long to,
- struct vm_area_struct *vma,
- vm_flags_t vm_flags);
+ struct vm_area_desc *desc, vm_flags_t vm_flags);
long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
long freed);
bool folio_isolate_hugetlb(struct folio *folio, struct list_head *list);
@@ -280,6 +279,7 @@ bool is_hugetlb_entry_hwpoisoned(pte_t pte);
void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
void fixup_hugetlb_reservations(struct vm_area_struct *vma);
void hugetlb_split(struct vm_area_struct *vma, unsigned long addr);
+int hugetlb_vma_lock_alloc(struct vm_area_struct *vma);

#else /* !CONFIG_HUGETLB_PAGE */

@@ -466,6 +466,11 @@ static inline void fixup_hugetlb_reservations(struct vm_area_struct *vma)

static inline void hugetlb_split(struct vm_area_struct *vma, unsigned long addr) {}

+static inline int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
+{
+ return 0;
+}
+
#endif /* !CONFIG_HUGETLB_PAGE */

#ifndef pgd_write
diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 0660a03d37d9..a27aa0162918 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -2,22 +2,27 @@
#ifndef _LINUX_HUGETLB_INLINE_H
#define _LINUX_HUGETLB_INLINE_H

-#ifdef CONFIG_HUGETLB_PAGE
-
#include <linux/mm.h>

-static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
+#ifdef CONFIG_HUGETLB_PAGE
+
+static inline bool is_vm_hugetlb_flags(vm_flags_t vm_flags)
{
- return !!(vma->vm_flags & VM_HUGETLB);
+ return !!(vm_flags & VM_HUGETLB);
}

#else

-static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
+static inline bool is_vm_hugetlb_flags(vm_flags_t vm_flags)
{
return false;
}

#endif

+static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
+{
+ return is_vm_hugetlb_flags(vma->vm_flags);
+}
+
#endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d812ad8f0b9f..cb6eda43cb7f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -119,7 +119,6 @@ struct mutex *hugetlb_fault_mutex_table __ro_after_init;
/* Forward declaration */
static int hugetlb_acct_memory(struct hstate *h, long delta);
static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
-static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
unsigned long start, unsigned long end, bool take_locks);
@@ -417,17 +416,21 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
}
}

-static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
+/*
+ * vma specific semaphore used for pmd sharing and fault/truncation
+ * synchronization
+ */
+int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
{
struct hugetlb_vma_lock *vma_lock;

/* Only establish in (flags) sharable vmas */
if (!vma || !(vma->vm_flags & VM_MAYSHARE))
- return;
+ return 0;

/* Should never get here with non-NULL vm_private_data */
if (vma->vm_private_data)
- return;
+ return -EINVAL;

vma_lock = kmalloc(sizeof(*vma_lock), GFP_KERNEL);
if (!vma_lock) {
@@ -442,13 +445,15 @@ static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
* allocation failure.
*/
pr_warn_once("HugeTLB: unable to allocate vma specific lock\n");
- return;
+ return -EINVAL;
}

kref_init(&vma_lock->refs);
init_rwsem(&vma_lock->rw_sema);
vma_lock->vma = vma;
vma->vm_private_data = vma_lock;
+
+ return 0;
}

/* Helper that removes a struct file_region from the resv_map cache and returns
@@ -1180,20 +1185,28 @@ static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
}
}

-static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
+static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags)
{
- VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma);
- VM_BUG_ON_VMA(vma->vm_flags & VM_MAYSHARE, vma);
+ VM_WARN_ON_ONCE_VMA(!is_vm_hugetlb_page(vma), vma);
+ VM_WARN_ON_ONCE_VMA(vma->vm_flags & VM_MAYSHARE, vma);

- set_vma_private_data(vma, (unsigned long)map);
+ set_vma_private_data(vma, get_vma_private_data(vma) | flags);
}

-static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags)
+static void set_vma_desc_resv_map(struct vm_area_desc *desc, struct resv_map *map)
{
- VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma);
- VM_BUG_ON_VMA(vma->vm_flags & VM_MAYSHARE, vma);
+ VM_WARN_ON_ONCE(!is_vm_hugetlb_flags(desc->vm_flags));
+ VM_WARN_ON_ONCE(desc->vm_flags & VM_MAYSHARE);

- set_vma_private_data(vma, get_vma_private_data(vma) | flags);
+ desc->private_data = map;
+}
+
+static void set_vma_desc_resv_flags(struct vm_area_desc *desc, unsigned long flags)
+{
+ VM_WARN_ON_ONCE(!is_vm_hugetlb_flags(desc->vm_flags));
+ VM_WARN_ON_ONCE(desc->vm_flags & VM_MAYSHARE);
+
+ desc->private_data = (void *)((unsigned long)desc->private_data | flags);
}

static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag)
@@ -1203,6 +1216,13 @@ static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag)
return (get_vma_private_data(vma) & flag) != 0;
}

+static bool is_vma_desc_resv_set(struct vm_area_desc *desc, unsigned long flag)
+{
+ VM_WARN_ON_ONCE(!is_vm_hugetlb_flags(desc->vm_flags));
+
+ return ((unsigned long)desc->private_data) & flag;
+}
+
bool __vma_private_lock(struct vm_area_struct *vma)
{
return !(vma->vm_flags & VM_MAYSHARE) &&
@@ -7225,9 +7245,9 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
*/

long hugetlb_reserve_pages(struct inode *inode,
- long from, long to,
- struct vm_area_struct *vma,
- vm_flags_t vm_flags)
+ long from, long to,
+ struct vm_area_desc *desc,
+ vm_flags_t vm_flags)
{
long chg = -1, add = -1, spool_resv, gbl_resv;
struct hstate *h = hstate_inode(inode);
@@ -7242,12 +7262,6 @@ long hugetlb_reserve_pages(struct inode *inode,
return -EINVAL;
}

- /*
- * vma specific semaphore used for pmd sharing and fault/truncation
- * synchronization
- */
- hugetlb_vma_lock_alloc(vma);
-
/*
* Only apply hugepage reservation if asked. At fault time, an
* attempt will be made for VM_NORESERVE to allocate a page
@@ -7260,9 +7274,9 @@ long hugetlb_reserve_pages(struct inode *inode,
* Shared mappings base their reservation on the number of pages that
* are already allocated on behalf of the file. Private mappings need
* to reserve the full area even if read-only as mprotect() may be
- * called to make the mapping read-write. Assume !vma is a shm mapping
+ * called to make the mapping read-write. Assume !desc is a shm mapping
*/
- if (!vma || vma->vm_flags & VM_MAYSHARE) {
+ if (!desc || desc->vm_flags & VM_MAYSHARE) {
/*
* resv_map can not be NULL as hugetlb_reserve_pages is only
* called for inodes for which resv_maps were created (see
@@ -7279,8 +7293,8 @@ long hugetlb_reserve_pages(struct inode *inode,

chg = to - from;

- set_vma_resv_map(vma, resv_map);
- set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
+ set_vma_desc_resv_map(desc, resv_map);
+ set_vma_desc_resv_flags(desc, HPAGE_RESV_OWNER);
}

if (chg < 0)
@@ -7290,7 +7304,7 @@ long hugetlb_reserve_pages(struct inode *inode,
chg * pages_per_huge_page(h), &h_cg) < 0)
goto out_err;

- if (vma && !(vma->vm_flags & VM_MAYSHARE) && h_cg) {
+ if (desc && !(desc->vm_flags & VM_MAYSHARE) && h_cg) {
/* For private mappings, the hugetlb_cgroup uncharge info hangs
* of the resv_map.
*/
@@ -7324,7 +7338,7 @@ long hugetlb_reserve_pages(struct inode *inode,
* consumed reservations are stored in the map. Hence, nothing
* else has to be done for private mappings here
*/
- if (!vma || vma->vm_flags & VM_MAYSHARE) {
+ if (!desc || desc->vm_flags & VM_MAYSHARE) {
add = region_add(resv_map, from, to, regions_needed, h, h_cg);

if (unlikely(add < 0)) {
@@ -7378,16 +7392,15 @@ long hugetlb_reserve_pages(struct inode *inode,
hugetlb_cgroup_uncharge_cgroup_rsvd(hstate_index(h),
chg * pages_per_huge_page(h), h_cg);
out_err:
- hugetlb_vma_lock_free(vma);
- if (!vma || vma->vm_flags & VM_MAYSHARE)
+ if (!desc || desc->vm_flags & VM_MAYSHARE)
/* Only call region_abort if the region_chg succeeded but the
* region_add failed or didn't run.
*/
if (chg >= 0 && add < 0)
region_abort(resv_map, from, to, regions_needed);
- if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+ if (desc && is_vma_desc_resv_set(desc, HPAGE_RESV_OWNER)) {
kref_put(&resv_map->refs, resv_map_release);
- set_vma_resv_map(vma, NULL);
+ set_vma_desc_resv_map(desc, NULL);
}
return chg < 0 ? chg : add < 0 ? add : -EINVAL;
}
--
2.51.0

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:23:39 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Make use of the ability to specify a remap action within mmap_prepare to
update the resctl pseudo-lock to use mmap_prepare in favour of the
deprecated mmap hook.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
fs/resctrl/pseudo_lock.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/resctrl/pseudo_lock.c b/fs/resctrl/pseudo_lock.c
index 87bbc2605de1..e847df586766 100644
--- a/fs/resctrl/pseudo_lock.c
+++ b/fs/resctrl/pseudo_lock.c
@@ -995,10 +995,11 @@ static const struct vm_operations_struct pseudo_mmap_ops = {
.mremap = pseudo_lock_dev_mremap,
};

-static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
+static int pseudo_lock_dev_mmap_prepare(struct vm_area_desc *desc)
{
- unsigned long vsize = vma->vm_end - vma->vm_start;
- unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
+ unsigned long off = desc->pgoff << PAGE_SHIFT;
+ unsigned long vsize = vma_desc_size(desc);
+ struct file *filp = desc->file;
struct pseudo_lock_region *plr;
struct rdtgroup *rdtgrp;
unsigned long physical;
@@ -1043,7 +1044,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
* Ensure changes are carried directly to the memory being mapped,
* do not allow copy-on-write mapping.
*/
- if (!(vma->vm_flags & VM_SHARED)) {
+ if (!(desc->vm_flags & VM_SHARED)) {
mutex_unlock(&rdtgroup_mutex);
return -EINVAL;
}
@@ -1055,12 +1056,11 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)

memset(plr->kmem + off, 0, vsize);

- if (remap_pfn_range(vma, vma->vm_start, physical + vma->vm_pgoff,
- vsize, vma->vm_page_prot)) {
- mutex_unlock(&rdtgroup_mutex);
- return -EAGAIN;
- }
- vma->vm_ops = &pseudo_mmap_ops;
+ desc->vm_ops = &pseudo_mmap_ops;
+
+ mmap_action_remap(&desc->action, desc->start, physical + desc->pgoff,
+ vsize, desc->page_prot);
+
mutex_unlock(&rdtgroup_mutex);
return 0;
}
@@ -1071,7 +1071,7 @@ static const struct file_operations pseudo_lock_dev_fops = {
.write = NULL,
.open = pseudo_lock_dev_open,
.release = pseudo_lock_dev_release,
- .mmap = pseudo_lock_dev_mmap,
+ .mmap_prepare = pseudo_lock_dev_mmap_prepare,
};

int rdt_pseudo_lock_init(void)
--
2.51.0

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:23:39 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Update the mem char driver (backing /dev/mem and /dev/zero) to use
f_op->mmap_prepare hook rather than the deprecated f_op->mmap.

The /dev/zero implementation has a very unique and rather concerning
characteristic in that it converts MAP_PRIVATE mmap() mappings anonymous
when they are, in fact, not.

The new f_op->mmap_prepare() can support this, but rather than introducing
a helper function to perform this hack (and risk introducing other users),
simply set desc->vm_op to NULL here and add a comment describing what's
going on.

We also introduce shmem_zero_setup_desc() to allow for the shared mapping
case via an f_op->mmap_prepare() hook, and generalise the code between this
and shmem_zero_setup().

We also use the desc->action_error_hook to filter the remap error to
-EAGAIN to keep behaviour consistent.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
drivers/char/mem.c | 75 ++++++++++++++++++++++------------------
include/linux/shmem_fs.h | 3 +-
mm/shmem.c | 40 ++++++++++++++++-----
3 files changed, 76 insertions(+), 42 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 34b815901b20..23194788ee41 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -304,13 +304,13 @@ static unsigned zero_mmap_capabilities(struct file *file)
}

/* can't do an in-place private mapping if there's no MMU */
-static inline int private_mapping_ok(struct vm_area_struct *vma)
+static inline int private_mapping_ok(struct vm_area_desc *desc)
{
- return is_nommu_shared_mapping(vma->vm_flags);
+ return is_nommu_shared_mapping(desc->vm_flags);
}
#else

-static inline int private_mapping_ok(struct vm_area_struct *vma)
+static inline int private_mapping_ok(struct vm_area_desc *desc)
{
return 1;
}
@@ -322,46 +322,50 @@ static const struct vm_operations_struct mmap_mem_ops = {
#endif
};

-static int mmap_mem(struct file *file, struct vm_area_struct *vma)
+static int mmap_filter_error(int err)
{
- size_t size = vma->vm_end - vma->vm_start;
- phys_addr_t offset = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
+ return -EAGAIN;
+}
+
+static int mmap_mem_prepare(struct vm_area_desc *desc)
+{
+ struct file *file = desc->file;
+ const size_t size = vma_desc_size(desc);
+ const phys_addr_t offset = (phys_addr_t)desc->pgoff << PAGE_SHIFT;

/* Does it even fit in phys_addr_t? */
- if (offset >> PAGE_SHIFT != vma->vm_pgoff)
+ if (offset >> PAGE_SHIFT != desc->pgoff)
return -EINVAL;

/* It's illegal to wrap around the end of the physical address space. */
if (offset + (phys_addr_t)size - 1 < offset)
return -EINVAL;

- if (!valid_mmap_phys_addr_range(vma->vm_pgoff, size))
+ if (!valid_mmap_phys_addr_range(desc->pgoff, size))
return -EINVAL;

- if (!private_mapping_ok(vma))
+ if (!private_mapping_ok(desc))
return -ENOSYS;

- if (!range_is_allowed(vma->vm_pgoff, size))
+ if (!range_is_allowed(desc->pgoff, size))
return -EPERM;

- if (!phys_mem_access_prot_allowed(file, vma->vm_pgoff, size,
- &vma->vm_page_prot))
+ if (!phys_mem_access_prot_allowed(file, desc->pgoff, size,
+ &desc->page_prot))
return -EINVAL;

- vma->vm_page_prot = phys_mem_access_prot(file, vma->vm_pgoff,
- size,
- vma->vm_page_prot);
+ desc->page_prot = phys_mem_access_prot(file, desc->pgoff,
+ size,
+ desc->page_prot);

- vma->vm_ops = &mmap_mem_ops;
+ desc->vm_ops = &mmap_mem_ops;

/* Remap-pfn-range will mark the range VM_IO */
- if (remap_pfn_range(vma,
- vma->vm_start,
- vma->vm_pgoff,
- size,
- vma->vm_page_prot)) {
- return -EAGAIN;
- }
+ mmap_action_remap(&desc->action, desc->start, desc->pgoff, size,
+ desc->page_prot);
+ /* We filter remap errors to -EAGAIN. */
+ desc->action.error_hook = mmap_filter_error;
+
return 0;
}

@@ -501,14 +505,18 @@ static ssize_t read_zero(struct file *file, char __user *buf,
return cleared;
}

-static int mmap_zero(struct file *file, struct vm_area_struct *vma)
+static int mmap_prepare_zero(struct vm_area_desc *desc)
{
#ifndef CONFIG_MMU
return -ENOSYS;
#endif
- if (vma->vm_flags & VM_SHARED)
- return shmem_zero_setup(vma);
- vma_set_anonymous(vma);
+ if (desc->vm_flags & VM_SHARED)
+ return shmem_zero_setup_desc(desc);
+ /*
+ * This is a highly unique situation where we mark a MAP_PRIVATE mapping
+ *of /dev/zero anonymous, despite it not being.
+ */
+ desc->vm_ops = NULL;
return 0;
}

@@ -526,10 +534,11 @@ static unsigned long get_unmapped_area_zero(struct file *file,
{
if (flags & MAP_SHARED) {
/*
- * mmap_zero() will call shmem_zero_setup() to create a file,
- * so use shmem's get_unmapped_area in case it can be huge;
- * and pass NULL for file as in mmap.c's get_unmapped_area(),
- * so as not to confuse shmem with our handle on "/dev/zero".
+ * mmap_prepare_zero() will call shmem_zero_setup() to create a
+ * file, so use shmem's get_unmapped_area in case it can be
+ * huge; and pass NULL for file as in mmap.c's
+ * get_unmapped_area(), so as not to confuse shmem with our
+ * handle on "/dev/zero".
*/
return shmem_get_unmapped_area(NULL, addr, len, pgoff, flags);
}
@@ -632,7 +641,7 @@ static const struct file_operations __maybe_unused mem_fops = {
.llseek = memory_lseek,
.read = read_mem,
.write = write_mem,
- .mmap = mmap_mem,
+ .mmap_prepare = mmap_mem_prepare,
.open = open_mem,
#ifndef CONFIG_MMU
.get_unmapped_area = get_unmapped_area_mem,
@@ -668,7 +677,7 @@ static const struct file_operations zero_fops = {
.write_iter = write_iter_zero,
.splice_read = copy_splice_read,
.splice_write = splice_write_zero,
- .mmap = mmap_zero,
+ .mmap_prepare = mmap_prepare_zero,
.get_unmapped_area = get_unmapped_area_zero,
#ifndef CONFIG_MMU
.mmap_capabilities = zero_mmap_capabilities,
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 0e47465ef0fd..5b368f9549d6 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -94,7 +94,8 @@ extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
unsigned long flags);
extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt,
const char *name, loff_t size, unsigned long flags);
-extern int shmem_zero_setup(struct vm_area_struct *);
+int shmem_zero_setup(struct vm_area_struct *vma);
+int shmem_zero_setup_desc(struct vm_area_desc *desc);
extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags);
extern int shmem_lock(struct file *file, int lock, struct ucounts *ucounts);
diff --git a/mm/shmem.c b/mm/shmem.c
index 990e33c6a776..cb6ff00eb4cb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -5893,14 +5893,9 @@ struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt, const char *name,
}
EXPORT_SYMBOL_GPL(shmem_file_setup_with_mnt);

-/**
- * shmem_zero_setup - setup a shared anonymous mapping
- * @vma: the vma to be mmapped is prepared by do_mmap
- */
-int shmem_zero_setup(struct vm_area_struct *vma)
+static struct file *__shmem_zero_setup(unsigned long start, unsigned long end, vm_flags_t vm_flags)
{
- struct file *file;
- loff_t size = vma->vm_end - vma->vm_start;
+ loff_t size = end - start;

/*
* Cloning a new file under mmap_lock leads to a lock ordering conflict
@@ -5908,7 +5903,17 @@ int shmem_zero_setup(struct vm_area_struct *vma)
* accessible to the user through its mapping, use S_PRIVATE flag to
* bypass file security, in the same way as shmem_kernel_file_setup().
*/
- file = shmem_kernel_file_setup("dev/zero", size, vma->vm_flags);
+ return shmem_kernel_file_setup("dev/zero", size, vm_flags);
+}
+
+/**
+ * shmem_zero_setup - setup a shared anonymous mapping
+ * @vma: the vma to be mmapped is prepared by do_mmap
+ */
+int shmem_zero_setup(struct vm_area_struct *vma)
+{
+ struct file *file = __shmem_zero_setup(vma->vm_start, vma->vm_end, vma->vm_flags);
+
if (IS_ERR(file))
return PTR_ERR(file);

@@ -5920,6 +5925,25 @@ int shmem_zero_setup(struct vm_area_struct *vma)
return 0;
}

+/**
+ * shmem_zero_setup_desc - same as shmem_zero_setup, but determined by VMA
+ * descriptor for convenience.
+ * @desc: Describes VMA
+ * Returns: 0 on success, or error
+ */
+int shmem_zero_setup_desc(struct vm_area_desc *desc)
+{
+ struct file *file = __shmem_zero_setup(desc->start, desc->end, desc->vm_flags);
+
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+
+ desc->vm_file = file;
+ desc->vm_ops = &shmem_anon_vm_ops;
+
+ return 0;
+}
+
/**
* shmem_read_folio_gfp - read into page cache, using specified page allocation flags.
* @mapping: the folio's address_space
--
2.51.0

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:23:40 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
cramfs uses either a PFN remap or a mixedmap insertion, we are able to
determine this at the point of mmap_prepare and to select the appropriate
action to perform using the vm_area_desc.

Note that there appears to have been a bug in this code, with the physical
address being specified as the PFN (!!) to vmf_insert_mixed(). This patch
fixes this issue.

Finally, we trivially have to update the pr_debug() message indicating
what's happening to occur before the remap/mixedmap occurs.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
fs/cramfs/inode.c | 46 ++++++++++++++++++++--------------------------
1 file changed, 20 insertions(+), 26 deletions(-)

diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index b002e9b734f9..2a41b30753a7 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -342,16 +342,17 @@ static bool cramfs_last_page_is_shared(struct inode *inode)
return memchr_inv(tail_data, 0, PAGE_SIZE - partial) ? true : false;
}

-static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
+static int cramfs_physmem_mmap_prepare(struct vm_area_desc *desc)
{
+ struct file *file = desc->file;
struct inode *inode = file_inode(file);
struct cramfs_sb_info *sbi = CRAMFS_SB(inode->i_sb);
unsigned int pages, max_pages, offset;
- unsigned long address, pgoff = vma->vm_pgoff;
+ unsigned long address, pgoff = desc->pgoff;
char *bailout_reason;
int ret;

- ret = generic_file_readonly_mmap(file, vma);
+ ret = generic_file_readonly_mmap_prepare(desc);
if (ret)
return ret;

@@ -362,14 +363,14 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)

/* Could COW work here? */
bailout_reason = "vma is writable";
- if (vma->vm_flags & VM_WRITE)
+ if (desc->vm_flags & VM_WRITE)
goto bailout;

max_pages = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
bailout_reason = "beyond file limit";
if (pgoff >= max_pages)
goto bailout;
- pages = min(vma_pages(vma), max_pages - pgoff);
+ pages = min(vma_desc_pages(desc), max_pages - pgoff);

offset = cramfs_get_block_range(inode, pgoff, &pages);
bailout_reason = "unsuitable block layout";
@@ -391,38 +392,31 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
goto bailout;
}

- if (pages == vma_pages(vma)) {
+ pr_debug("mapping %pD[%lu] at 0x%08lx (%u/%lu pages) "
+ "to vma 0x%08lx, page_prot 0x%llx\n", file,
+ pgoff, address, pages, vma_desc_pages(desc), desc->start,
+ (unsigned long long)pgprot_val(desc->page_prot));
+
+ if (pages == vma_desc_pages(desc)) {
/*
* The entire vma is mappable. remap_pfn_range() will
* make it distinguishable from a non-direct mapping
* in /proc/<pid>/maps by substituting the file offset
* with the actual physical address.
*/
- ret = remap_pfn_range(vma, vma->vm_start, address >> PAGE_SHIFT,
- pages * PAGE_SIZE, vma->vm_page_prot);
+ mmap_action_remap(&desc->action, desc->start,
+ address >> PAGE_SHIFT, pages * PAGE_SIZE,
+ desc->page_prot);
} else {
/*
* Let's create a mixed map if we can't map it all.
* The normal paging machinery will take care of the
* unpopulated ptes via cramfs_read_folio().
*/
- int i;
- vm_flags_set(vma, VM_MIXEDMAP);
- for (i = 0; i < pages && !ret; i++) {
- vm_fault_t vmf;
- unsigned long off = i * PAGE_SIZE;
- vmf = vmf_insert_mixed(vma, vma->vm_start + off,
- address + off);
- if (vmf & VM_FAULT_ERROR)
- ret = vm_fault_to_errno(vmf, 0);
- }
+ mmap_action_mixedmap(&desc->action, desc->start,
+ address >> PAGE_SHIFT, pages);
}

- if (!ret)
- pr_debug("mapped %pD[%lu] at 0x%08lx (%u/%lu pages) "
- "to vma 0x%08lx, page_prot 0x%llx\n", file,
- pgoff, address, pages, vma_pages(vma), vma->vm_start,
- (unsigned long long)pgprot_val(vma->vm_page_prot));
return ret;

bailout:
@@ -434,9 +428,9 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)

#else /* CONFIG_MMU */

-static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
+static int cramfs_physmem_mmap_prepare(struct vm_area_desc *desc)
{
- return is_nommu_shared_mapping(vma->vm_flags) ? 0 : -ENOSYS;
+ return is_nommu_shared_mapping(desc->vm_flags) ? 0 : -ENOSYS;
}

static unsigned long cramfs_physmem_get_unmapped_area(struct file *file,
@@ -474,7 +468,7 @@ static const struct file_operations cramfs_physmem_fops = {
.llseek = generic_file_llseek,
.read_iter = generic_file_read_iter,
.splice_read = filemap_splice_read,
- .mmap = cramfs_physmem_mmap,
+ .mmap_prepare = cramfs_physmem_mmap_prepare,
#ifndef CONFIG_MMU
.get_unmapped_area = cramfs_physmem_get_unmapped_area,
.mmap_capabilities = cramfs_physmem_mmap_capabilities,
--
2.51.0

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:23:42 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
Now we have the ability to specify a custom hook we can handle even very
customised behaviour.

As part of this change, we must also update remap_vmalloc_range_partial()
to optionally not update VMA flags. Other than then remap_vmalloc_range()
wrapper, vmcore is the only user of this function so we can simply go ahead
and add a parameter.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
arch/s390/kernel/crash_dump.c | 6 ++--
fs/proc/vmcore.c | 54 +++++++++++++++++++++++------------
include/linux/vmalloc.h | 10 +++----
mm/vmalloc.c | 16 +++++++++--
4 files changed, 57 insertions(+), 29 deletions(-)

diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
index d4839de8ce9d..44d7902f7e41 100644
--- a/arch/s390/kernel/crash_dump.c
+++ b/arch/s390/kernel/crash_dump.c
@@ -186,7 +186,7 @@ static int remap_oldmem_pfn_range_kdump(struct vm_area_struct *vma,

if (pfn < oldmem_data.size >> PAGE_SHIFT) {
size_old = min(size, oldmem_data.size - (pfn << PAGE_SHIFT));
- rc = remap_pfn_range(vma, from,
+ rc = remap_pfn_range_complete(vma, from,
pfn + (oldmem_data.start >> PAGE_SHIFT),
size_old, prot);
if (rc || size == size_old)
@@ -195,7 +195,7 @@ static int remap_oldmem_pfn_range_kdump(struct vm_area_struct *vma,
from += size_old;
pfn += size_old >> PAGE_SHIFT;
}
- return remap_pfn_range(vma, from, pfn, size, prot);
+ return remap_pfn_range_complete(vma, from, pfn, size, prot);
}

/*
@@ -220,7 +220,7 @@ static int remap_oldmem_pfn_range_zfcpdump(struct vm_area_struct *vma,
from += size_hsa;
pfn += size_hsa >> PAGE_SHIFT;
}
- return remap_pfn_range(vma, from, pfn, size, prot);
+ return remap_pfn_range_complete(vma, from, pfn, size, prot);
}

/*
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index f188bd900eb2..faf811ed9b15 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -254,7 +254,7 @@ int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
unsigned long size, pgprot_t prot)
{
prot = pgprot_encrypted(prot);
- return remap_pfn_range(vma, from, pfn, size, prot);
+ return remap_pfn_range_complete(vma, from, pfn, size, prot);
}

/*
@@ -308,7 +308,7 @@ static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,
tsz = min(offset + (u64)dump->size - start, (u64)size);
buf = dump->buf + start - offset;
if (remap_vmalloc_range_partial(vma, dst, buf, 0,
- tsz))
+ tsz, /* set_vma= */false))
return -EFAULT;

size -= tsz;
@@ -588,24 +588,15 @@ static int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma,
return ret;
}

-static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
+static int mmap_prepare_action_vmcore(struct vm_area_struct *vma)
{
+ struct mmap_action action;
size_t size = vma->vm_end - vma->vm_start;
u64 start, end, len, tsz;
struct vmcore_range *m;

start = (u64)vma->vm_pgoff << PAGE_SHIFT;
end = start + size;
-
- if (size > vmcore_size || end > vmcore_size)
- return -EINVAL;
-
- if (vma->vm_flags & (VM_WRITE | VM_EXEC))
- return -EPERM;
-
- vm_flags_mod(vma, VM_MIXEDMAP, VM_MAYWRITE | VM_MAYEXEC);
- vma->vm_ops = &vmcore_mmap_ops;
-
len = 0;

if (start < elfcorebuf_sz) {
@@ -613,8 +604,10 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)

tsz = min(elfcorebuf_sz - (size_t)start, size);
pfn = __pa(elfcorebuf + start) >> PAGE_SHIFT;
- if (remap_pfn_range(vma, vma->vm_start, pfn, tsz,
- vma->vm_page_prot))
+
+ mmap_action_remap(&action, vma->vm_start, pfn, tsz,
+ vma->vm_page_prot);
+ if (mmap_action_complete(&action, vma))
return -EAGAIN;
size -= tsz;
start += tsz;
@@ -664,7 +657,7 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)start, size);
kaddr = elfnotes_buf + start - elfcorebuf_sz - vmcoredd_orig_sz;
if (remap_vmalloc_range_partial(vma, vma->vm_start + len,
- kaddr, 0, tsz))
+ kaddr, 0, tsz, /* set_vma =*/false))
goto fail;

size -= tsz;
@@ -700,8 +693,33 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
do_munmap(vma->vm_mm, vma->vm_start, len, NULL);
return -EAGAIN;
}
+
+static int mmap_prepare_vmcore(struct vm_area_desc *desc)
+{
+ size_t size = vma_desc_size(desc);
+ u64 start, end;
+
+ start = (u64)desc->pgoff << PAGE_SHIFT;
+ end = start + size;
+
+ if (size > vmcore_size || end > vmcore_size)
+ return -EINVAL;
+
+ if (desc->vm_flags & (VM_WRITE | VM_EXEC))
+ return -EPERM;
+
+ /* This is a unique case where we set both PFN map and mixed map flags. */
+ desc->vm_flags |= VM_MIXEDMAP | VM_REMAP_FLAGS;
+ desc->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+ desc->vm_ops = &vmcore_mmap_ops;
+
+ desc->action.type = MMAP_CUSTOM_ACTION;
+ desc->action.custom.action_hook = mmap_prepare_action_vmcore;
+
+ return 0;
+}
#else
-static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
+static int mmap_prepare_vmcore(struct vm_area_desc *desc)
{
return -ENOSYS;
}
@@ -712,7 +730,7 @@ static const struct proc_ops vmcore_proc_ops = {
.proc_release = release_vmcore,
.proc_read_iter = read_vmcore,
.proc_lseek = default_llseek,
- .proc_mmap = mmap_vmcore,
+ .proc_mmap_prepare = mmap_prepare_vmcore,
};

static u64 get_vmcore_size(size_t elfsz, size_t elfnotesegsz,
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index eb54b7b3202f..588810e571aa 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -215,12 +215,12 @@ extern void *vmap(struct page **pages, unsigned int count,
void *vmap_pfn(unsigned long *pfns, unsigned int count, pgprot_t prot);
extern void vunmap(const void *addr);

-extern int remap_vmalloc_range_partial(struct vm_area_struct *vma,
- unsigned long uaddr, void *kaddr,
- unsigned long pgoff, unsigned long size);
+int remap_vmalloc_range_partial(struct vm_area_struct *vma,
+ unsigned long uaddr, void *kaddr, unsigned long pgoff,
+ unsigned long size, bool set_vma);

-extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
- unsigned long pgoff);
+int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
+ unsigned long pgoff);

int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
struct page **pages, unsigned int page_shift);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 9fc86ddf1711..3dd9d5c441d8 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4531,6 +4531,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
* @kaddr: virtual address of vmalloc kernel memory
* @pgoff: offset from @kaddr to start at
* @size: size of map area
+ * @set_vma: If true, update VMA flags
*
* Returns: 0 for success, -Exxx on failure
*
@@ -4543,7 +4544,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
*/
int remap_vmalloc_range_partial(struct vm_area_struct *vma, unsigned long uaddr,
void *kaddr, unsigned long pgoff,
- unsigned long size)
+ unsigned long size, bool set_vma)
{
struct vm_struct *area;
unsigned long off;
@@ -4569,6 +4570,10 @@ int remap_vmalloc_range_partial(struct vm_area_struct *vma, unsigned long uaddr,
return -EINVAL;
kaddr += off;

+ /* If we shouldn't modify VMA flags, vm_insert_page() mustn't. */
+ if (!set_vma && !(vma->vm_flags & VM_MIXEDMAP))
+ return -EINVAL;
+
do {
struct page *page = vmalloc_to_page(kaddr);
int ret;
@@ -4582,7 +4587,11 @@ int remap_vmalloc_range_partial(struct vm_area_struct *vma, unsigned long uaddr,
size -= PAGE_SIZE;
} while (size > 0);

- vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP);
+ if (set_vma)
+ vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP);
+ else
+ VM_WARN_ON_ONCE((vma->vm_flags & (VM_DONTEXPAND | VM_DONTDUMP)) !=
+ (VM_DONTEXPAND | VM_DONTDUMP));

return 0;
}
@@ -4606,7 +4615,8 @@ int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
{
return remap_vmalloc_range_partial(vma, vma->vm_start,
addr, pgoff,
- vma->vm_end - vma->vm_start);
+ vma->vm_end - vma->vm_start,
+ /* set_vma= */ true);
}
EXPORT_SYMBOL(remap_vmalloc_range);

--
2.51.0

Lorenzo Stoakes

unread,
Sep 10, 2025, 4:23:42 PMSep 10
to Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, David S . Miller, Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse, Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato, linu...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, linux...@vger.kernel.org, sparc...@vger.kernel.org, nvd...@lists.linux.dev, linu...@vger.kernel.org, linu...@kvack.org, nt...@lists.linux.dev, ke...@lists.infradead.org, kasa...@googlegroups.com, Jason Gunthorpe
By adding this hook we enable procfs implementations to be able to use the
.mmap_prepare hook rather than the deprecated .mmap one.

We treat this as if it were any other nested mmap hook and utilise the
.mmap_prepare compatibility layer if necessary.

Signed-off-by: Lorenzo Stoakes <lorenzo...@oracle.com>
---
fs/proc/inode.c | 12 +++++++++---
include/linux/proc_fs.h | 1 +
2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 129490151be1..609abbc84bf4 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -414,9 +414,15 @@ static long proc_reg_compat_ioctl(struct file *file, unsigned int cmd, unsigned

static int pde_mmap(struct proc_dir_entry *pde, struct file *file, struct vm_area_struct *vma)
{
- __auto_type mmap = pde->proc_ops->proc_mmap;
- if (mmap)
- return mmap(file, vma);
+ const struct file_operations f_op = {
+ .mmap = pde->proc_ops->proc_mmap,
+ .mmap_prepare = pde->proc_ops->proc_mmap_prepare,
+ };
+
+ if (f_op.mmap)
+ return f_op.mmap(file, vma);
+ else if (f_op.mmap_prepare)
+ return __compat_vma_mmap_prepare(&f_op, file, vma);
return -EIO;
}

diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index f139377f4b31..e5f65ebd62b8 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -47,6 +47,7 @@ struct proc_ops {
long (*proc_compat_ioctl)(struct file *, unsigned int, unsigned long);
#endif
int (*proc_mmap)(struct file *, struct vm_area_struct *);
+ int (*proc_mmap_prepare)(struct vm_area_desc *);
unsigned long (*proc_get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
} __randomize_layout;

--
2.51.0

It is loading more messages.
0 new messages