[PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

Dan Williams

unread,

Jan 19, 2017, 11:00:06 PM1/19/17

to

A couple weeks back, in the course of reviewing the memcpy_nocache()
proposal from Brian, Linus subtly suggested that the pmem specific
memcpy_to_pmem() routine be moved to be implemented at the driver
level [1]:

"Quite frankly, the whole 'memcpy_nocache()' idea or (ab-)using
copy_user_nocache() just needs to die. It's idiotic.

As you point out, it's also fundamentally buggy crap.

Throw it away. There is no possible way this is ever valid or
portable. We're not going to lie and claim that it is.

If some driver ends up using 'movnt' by hand, that is up to that
*driver*. But no way in hell should we care about this one whit in
the sense of <linux/uaccess.h>."

This feedback also dovetails with another fs/dax.c design wart of being
hard coded to assume the backing device is pmem. We call the pmem
specific copy, clear, and flush routines even if the backing device
driver is one of the other 3 dax drivers (axonram, dccssblk, or brd).
There is no reason to spend cpu cycles flushing the cache after writing
to brd, for example, since it is using volatile memory for storage.

Moreover, the pmem driver might be fronting a volatile memory range
published by the ACPI NFIT, or the platform might have arranged to flush
cpu caches on power fail. This latter capability is a feature that has
appeared in embedded storage appliances ("legacy" / pre-NFIT nvdimm
platforms).

So, this series:

1/ moves what was previously named "the pmem api" out of the global
namespace and into "that *driver*" (libnvdimm / pmem).

2/ arranges for dax to stop abusing copy_user_nocache() and implements a
libnvdimm-local memcpy that uses movnt

3/ makes cache maintenance optional by arranging for dax to call driver
specific copy and flush operations only if the driver publishes them.

4/ adds a module parameter that can be used to inform libnvdimm of a
platform-level flush-cache-on-power-fail capability.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html

These patches have a build success notification from the 0day kbuild robot
and pass the libnvdimm / ndctl unit tests. I am looking to take them
through the libnvdimm tree with acks from x86, block, dm etc...

---

Dan Williams (13):
x86, dax, pmem: remove indirection around memcpy_from_pmem()
block, dax: introduce dax_operations
x86, dax, pmem: introduce 'copy_from_iter' dax operation
dax, pmem: introduce an optional 'flush' dax operation
x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush
x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm
x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm
x86, libnvdimm, dax: stop abusing __copy_user_nocache
libnvdimm, pmem: implement cache bypass for all copy_from_iter() operations
libnvdimm, pmem: fix persistence warning
libnvdimm, nfit: enable support for volatile ranges
libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
libnvdimm, pmem: disable dax flushing for 'cache flush on fail' platforms

MAINTAINERS | 2
arch/powerpc/sysdev/axonram.c | 6 +
arch/x86/Kconfig | 1
arch/x86/include/asm/pmem.h | 121 ----------------------------
arch/x86/include/asm/string_64.h | 1
drivers/acpi/nfit/core.c | 15 ++-
drivers/block/brd.c | 6 +
drivers/md/dm.c | 6 +
drivers/nvdimm/Kconfig | 5 +
drivers/nvdimm/Makefile | 2
drivers/nvdimm/bus.c | 10 +-
drivers/nvdimm/claim.c | 9 +-
drivers/nvdimm/core.c | 2
drivers/nvdimm/dax_devs.c | 2
drivers/nvdimm/dimm_devs.c | 4 -
drivers/nvdimm/namespace_devs.c | 9 +-
drivers/nvdimm/nd-core.h | 9 ++
drivers/nvdimm/pfn_devs.c | 4 -
drivers/nvdimm/pmem.c | 46 ++++++++---
drivers/nvdimm/pmem.h | 20 +++++
drivers/nvdimm/region_devs.c | 52 ++++++++----
drivers/nvdimm/x86-asm.S | 71 ++++++++++++++++
drivers/nvdimm/x86.c | 84 +++++++++++++++++++
drivers/s390/block/dcssblk.c | 6 +
fs/block_dev.c | 6 +
fs/dax.c | 35 +++++++-
include/linux/blkdev.h | 10 ++
include/linux/libnvdimm.h | 9 ++
include/linux/pmem.h | 165 --------------------------------------
include/linux/string.h | 8 ++
include/linux/uio.h | 4 +
lib/Kconfig | 6 +
lib/iov_iter.c | 25 ++++++
tools/testing/nvdimm/Kbuild | 2
34 files changed, 405 insertions(+), 358 deletions(-)
delete mode 100644 arch/x86/include/asm/pmem.h
create mode 100644 drivers/nvdimm/x86-asm.S
create mode 100644 drivers/nvdimm/x86.c
delete mode 100644 include/linux/pmem.h

Dan Williams

unread,

Jan 19, 2017, 11:00:06 PM1/19/17

to

Filesystem-DAX flushes caches whenever it writes to the address returned
through dax_map_atomic() and when writing back dirty radix entries. That
flushing is only required in the pmem case, so add a dax operation to
allow pmem to take this extra action, but skip it for other dax capable
block_devices like brd.

We still do all the dirty tracking since the radix entry will already be
there for locking purposes. However, the work to clean the entry will be
a nop for some dax drivers.

Cc: Jan Kara <ja...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Matthew Wilcox <mawi...@microsoft.com>
Cc: Ross Zwisler <ross.z...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.w...@intel.com>
---
drivers/nvdimm/pmem.c | 1 +
fs/dax.c | 16 ++++++++++++----
include/linux/blkdev.h | 1 +
3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 71e5e365d3fc..68fc7599a053 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -229,6 +229,7 @@ static size_t pmem_copy_from_iter(void *addr, size_t bytes,
static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_direct_access,
.copy_from_iter = pmem_copy_from_iter,
+ .flush = wb_cache_pmem,
};

static const struct block_device_operations pmem_fops = {
diff --git a/fs/dax.c b/fs/dax.c
index 22cd57424a55..160024e403f6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -756,10 +756,19 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
i_mmap_unlock_read(mapping);
}

+static const struct dax_operations *to_dax_ops(struct block_device *bdev)
+{
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+ const struct dax_operations *dax_ops = ops->dax_ops;
+
+ return dax_ops;
+}
+
static int dax_writeback_one(struct block_device *bdev,
struct address_space *mapping, pgoff_t index, void *entry)
{
struct radix_tree_root *page_tree = &mapping->page_tree;
+ const struct dax_operations *dax_ops = to_dax_ops(bdev);
struct blk_dax_ctl dax;
void *entry2, **slot;
int ret = 0;
@@ -830,7 +839,8 @@ static int dax_writeback_one(struct block_device *bdev,
}

dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
- wb_cache_pmem(dax.addr, dax.size);
+ if (dax_ops->flush)
+ dax_ops->flush(dax.addr, dax.size);
/*
* After we have flushed the cache, we can clear the dirty tag. There
* cannot be new dirty data in the pfn after the flush has completed as
@@ -1006,10 +1016,8 @@ static loff_t
dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
struct iomap *iomap)
{
- struct block_device *bdev = iomap->bdev;
size_t (*dax_copy_from_iter)(void *, size_t, struct iov_iter *);
- const struct block_device_operations *ops = bdev->bd_disk->fops;
- const struct dax_operations *dax_ops = ops->dax_ops;
+ const struct dax_operations *dax_ops = to_dax_ops(iomap->bdev);
struct iov_iter *iter = data;
loff_t end = pos + length, done = 0;
ssize_t ret = 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7ca559d124a3..692bdcd63db6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1867,6 +1867,7 @@ struct dax_operations {
long (*direct_access)(struct block_device *, sector_t, void **, pfn_t *,
long);
size_t (*copy_from_iter)(void *, size_t, struct iov_iter *);
+ void (*flush)(void *, size_t);
};

struct block_device_operations {

Dan Williams

unread,

Jan 19, 2017, 11:00:07 PM1/19/17

to

With all calls to this routine re-directed through the pmem driver, we
can kill the pmem api indirection. arch_wb_cache_pmem() is now
optionally supplied by an arch specific extension to libnvdimm. Same as
before, pmem flushing is only defined for x86_64, but it is
straightforward to add other archs in the future.

Cc: <x...@kernel.org>

Cc: Jan Kara <ja...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>

Cc: Ingo Molnar <mi...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: Thomas Gleixner <tg...@linutronix.de>

Cc: Matthew Wilcox <mawi...@microsoft.com>
Cc: Ross Zwisler <ross.z...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.w...@intel.com>
---

arch/x86/include/asm/pmem.h | 20 --------------------
drivers/nvdimm/Makefile | 1 +
drivers/nvdimm/pmem.c | 4 ++--
drivers/nvdimm/pmem.h | 9 +++++++++
drivers/nvdimm/x86.c | 36 ++++++++++++++++++++++++++++++++++++
include/linux/pmem.h | 19 -------------------
tools/testing/nvdimm/Kbuild | 1 +
7 files changed, 49 insertions(+), 41 deletions(-)
create mode 100644 drivers/nvdimm/x86.c

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index ab4983df7bff..4759a179aa52 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -44,26 +44,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
BUG();
}

-/**
- * arch_wb_cache_pmem - write back a cache range with CLWB
- * @vaddr: virtual start address
- * @size: number of bytes to write back
- *
- * Write back a cache range using the CLWB (cache line write back)
- * instruction.
- */
-static inline void arch_wb_cache_pmem(void *addr, size_t size)
-{
- u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
- unsigned long clflush_mask = x86_clflush_size - 1;
- void *vend = addr + size;
- void *p;
-
- for (p = (void *)((unsigned long)addr & ~clflush_mask);
- p < vend; p += x86_clflush_size)
- clwb(p);
-}
-
static inline void arch_invalidate_pmem(void *addr, size_t size)
{
clflush_cache_range(addr, size);
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 909554c3f955..9eafb1dd2876 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -24,3 +24,4 @@ libnvdimm-$(CONFIG_ND_CLAIM) += claim.o
libnvdimm-$(CONFIG_BTT) += btt_devs.o
libnvdimm-$(CONFIG_NVDIMM_PFN) += pfn_devs.o
libnvdimm-$(CONFIG_NVDIMM_DAX) += dax_devs.o
+libnvdimm-$(CONFIG_X86_64) += x86.o
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 68fc7599a053..838241bb31d1 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -222,14 +222,14 @@ static size_t pmem_copy_from_iter(void *addr, size_t bytes,
{
size_t rc = copy_from_iter_nocache(addr, bytes, i);

- wb_cache_pmem(addr, bytes);
+ arch_wb_cache_pmem(addr, bytes);
return rc;

}

static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_direct_access,
.copy_from_iter = pmem_copy_from_iter,

- .flush = wb_cache_pmem,
+ .flush = arch_wb_cache_pmem,

};

static const struct block_device_operations pmem_fops = {

diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index b4ee4f71b4a1..4faacff434ff 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -7,6 +7,15 @@

long pmem_direct_access(struct block_device *bdev, sector_t sector,
void **kaddr, pfn_t *pfn, long size);
+
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+void arch_wb_cache_pmem(void *addr, size_t size);
+#else
+static inline void arch_wb_cache_pmem(void *addr, size_t size)
+{
+}
+#endif
+
/* this definition is in it's own header for tools/testing/nvdimm to consume */
struct pmem_device {
/* One contiguous memory region per device */
diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c
new file mode 100644
index 000000000000..79d7267da4d2
--- /dev/null
+++ b/drivers/nvdimm/x86.c
@@ -0,0 +1,36 @@
+/*
+ * Copyright(c) 2015 - 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <asm/cacheflush.h>
+#include <asm/cpufeature.h>
+#include <asm/special_insns.h>
+
+/**
+ * arch_wb_cache_pmem - write back a cache range with CLWB
+ * @vaddr: virtual start address
+ * @size: number of bytes to write back
+ *
+ * Write back a cache range using the CLWB (cache line write back)
+ * instruction.
+ */
+void arch_wb_cache_pmem(void *addr, size_t size)
+{
+ u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
+ unsigned long clflush_mask = x86_clflush_size - 1;
+ void *vend = addr + size;
+ void *p;
+
+ for (p = (void *)((unsigned long)addr & ~clflush_mask);
+ p < vend; p += x86_clflush_size)
+ clwb(p);
+}
+EXPORT_SYMBOL_GPL(arch_wb_cache_pmem);
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 772bd02a5b52..33ae761f010a 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,11 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
BUG();
}

-static inline void arch_wb_cache_pmem(void *addr, size_t size)
-{
- BUG();
-}
-
static inline void arch_invalidate_pmem(void *addr, size_t size)
{
BUG();
@@ -80,18 +75,4 @@ static inline void invalidate_pmem(void *addr, size_t size)
if (arch_has_pmem_api())
arch_invalidate_pmem(addr, size);
}
-
-/**
- * wb_cache_pmem - write back processor cache for PMEM memory range
- * @addr: virtual start address
- * @size: number of bytes to write back
- *
- * Write back the processor cache range starting at 'addr' for 'size' bytes.
- * See blkdev_issue_flush() note for memcpy_to_pmem().
- */
-static inline void wb_cache_pmem(void *addr, size_t size)
-{
- if (arch_has_pmem_api())
- arch_wb_cache_pmem(addr, size);
-}
#endif /* __PMEM_H__ */
diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index 405212be044a..7488dfa1309a 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -66,6 +66,7 @@ libnvdimm-$(CONFIG_ND_CLAIM) += $(NVDIMM_SRC)/claim.o
libnvdimm-$(CONFIG_BTT) += $(NVDIMM_SRC)/btt_devs.o
libnvdimm-$(CONFIG_NVDIMM_PFN) += $(NVDIMM_SRC)/pfn_devs.o
libnvdimm-$(CONFIG_NVDIMM_DAX) += $(NVDIMM_SRC)/dax_devs.o
+libnvdimm-$(CONFIG_X86_64) += $(NVDIMM_SRC)/x86.o
libnvdimm-y += config_check.o

obj-m += test/

Matthew Wilcox

unread,

Jan 21, 2017, 11:30:06 AM1/21/17

to

From: Dan Williams [mailto:dan.j.w...@intel.com]

> A couple weeks back, in the course of reviewing the memcpy_nocache()
> proposal from Brian, Linus subtly suggested that the pmem specific
> memcpy_to_pmem() routine be moved to be implemented at the driver
> level [1]:

Of course, there may not be a backing device either! That will depend on the filesystem.
I see two possible routes here:

1. Add a new address_space_operation:

const struct dax_operations *(*get_dax_ops)(struct address_space *);

2. Add two of the dax_operations to address_space_operations:

size_t (*copy_from_iter)(struct address_space *, void *, size_t, struct iov_iter *);
void (*flush)(struct address_space *, void *, size_t);
(we won't need ->direct_access as an address_space op because that'll be handled a different way in the brave new world that supports non-bdev-based filesystems)

Obviously in either case we'd have generic bdev versions for ext4, xfs and other block based filesystems, but filesystems with a character device or a network protocol behind them would do whatever it is they need to do.

I kind of prefer the second option, but does anyone else have a preference?

Christoph Hellwig

unread,

Jan 21, 2017, 1:00:05 PM1/21/17

to

On Sat, Jan 21, 2017 at 04:28:52PM +0000, Matthew Wilcox wrote:
> Of course, there may not be a backing device either!

s/backing device/block device/ ? If so fully agreed. I like the dax_ops
scheme, but we should go all the way and detangle it from the block
device. I already brought up this issue with the fallback to direct I/O
on I/O error series.

> I see two possible routes here:
>
> 1. Add a new address_space_operation:
>
> const struct dax_operations *(*get_dax_ops)(struct address_space *);
>
> 2. Add two of the dax_operations to address_space_operations:
>
> size_t (*copy_from_iter)(struct address_space *, void *, size_t, struct iov_iter *);
> void (*flush)(struct address_space *, void *, size_t);
> (we won't need ->direct_access as an address_space op because that'll be handled a different way in the brave new world that supports non-bdev-based filesystems)

And both of them are wrong. The write_begin/write_end mistake
notwithstanding address_space ops are operations the VM can call without
knowing things like fs locking contexts. The above on the other hand
are device operations provided by the low-level driver, similar to
block_device operations. So what we need is to have a way to mount
a dax device as a file system, similar to how we support that for block
or MTD devices and can then call methods on it. For now this will
be a bit complicated because all current DAX-aware file systems also
still need block device for the metadata path, so we can't just say
you mount either a DAX or block device. But I think we should aim
for mounting a DAX device as the primary use case, and then deal
with block device emulation as a generic DAX layer thing, similarly
how we implement (bad in the rw case) block devices on top of MTD.

Matthew Wilcox

unread,

Jan 22, 2017, 10:50:05 AM1/22/17

to

From: Christoph Hellwig [mailto:h...@lst.de]

> On Sat, Jan 21, 2017 at 04:28:52PM +0000, Matthew Wilcox wrote:
> > Of course, there may not be a backing device either!
>
> s/backing device/block device/ ? If so fully agreed. I like the dax_ops
> scheme, but we should go all the way and detangle it from the block
> device. I already brought up this issue with the fallback to direct I/O
> on I/O error series.

In the case of a network filesystem being used to communicate with a different VM on the same physical machine, there is no backing device, just a network protocol.

> And both of them are wrong. The write_begin/write_end mistake
> notwithstanding address_space ops are operations the VM can call without
> knowing things like fs locking contexts. The above on the other hand
> are device operations provided by the low-level driver, similar to
> block_device operations. So what we need is to have a way to mount
> a dax device as a file system, similar to how we support that for block
> or MTD devices and can then call methods on it. For now this will
> be a bit complicated because all current DAX-aware file systems also
> still need block device for the metadata path, so we can't just say
> you mount either a DAX or block device. But I think we should aim
> for mounting a DAX device as the primary use case, and then deal
> with block device emulation as a generic DAX layer thing, similarly
> how we implement (bad in the rw case) block devices on top of MTD.

I'm not terribly enthusiastic about creating a fake block device to sit on top of a network filesystem, but I suppose we could go that way if we had to.

Christoph Hellwig

unread,

Jan 22, 2017, 11:40:08 AM1/22/17

to

On Sun, Jan 22, 2017 at 03:43:09PM +0000, Matthew Wilcox wrote:
> In the case of a network filesystem being used to communicate with
> a different VM on the same physical machine, there is no backing
> device, just a network protocol.

Again, do you mean block device? For a filesystem that does not do any
pagecache writeback we already don't need a backing device, so I don't
really see an issue there to start with.

> I'm not terribly enthusiastic about creating a fake block device to
> sit on top of a network filesystem, but I suppose we could go that
> way if we had to.

I see no need to a new network filesystem to have a fake block device.
We do need a fake block device for an unchanged or partial DAX aware
file system. And those are the only ones we have at the moment, although
XFS could be converted to do direct calls bypassing the block layer
fairly trivially if needed. For ext2 and ext4 that would be much harder
due to the buffer cache dependency.

Dan Williams

unread,

Jan 22, 2017, 12:40:05 PM1/22/17

to

So are you saying we need a way to go from a block_device inode to a
dax_device inode and then look up the dax_operations from there?

A filesystem, if it so chooses, could mount on top of the dax_device
inode directly?

I did add a dax_superblock for the device-dax character device
representation I could refactor that so the block_device presentation
of a namespace and a character device presentation are just different
layers on top of the base-level dax inode.

...or am I not tracking what you are suggesting?

Matthew Wilcox

unread,

Jan 22, 2017, 1:20:07 PM1/22/17

to

From: Christoph Hellwig [mailto:h...@lst.de]

> On Sun, Jan 22, 2017 at 03:43:09PM +0000, Matthew Wilcox wrote:
> > In the case of a network filesystem being used to communicate with
> > a different VM on the same physical machine, there is no backing
> > device, just a network protocol.
>
> Again, do you mean block device? For a filesystem that does not do any
> pagecache writeback we already don't need a backing device, so I don't
> really see an issue there to start with.

No, I mean a network filesystem like 9p or cifs or nfs. If the memcpy is supposed to be performed by the backing device and there is no backing device, then it's going to need to be done by the network filesystem.

(Also, the network filesystem might have a command, like RDMA has/will have, to ensure that the write has reached persistence)

Matthew Wilcox

unread,

Jan 22, 2017, 1:40:05 PM1/22/17

to

From: Christoph Hellwig [mailto:h...@lst.de]

> On Sun, Jan 22, 2017 at 06:19:24PM +0000, Matthew Wilcox wrote:
> > No, I mean a network filesystem like 9p or cifs or nfs. If the memcpy
> > is supposed to be performed by the backing device
>

> struct backing_dev has no relation to the DAX code. Even more so what's
> the point of doing a DAXish memcpy in that case? If we buffer in
> memory for network I/O we should just use the page cache.

Oh, I didn't mean a 'struct backing_dev'. I meant that, conceptually, there is no driver for the filesystem to call. Here's the architecture that I'm trying to work with:

Two guests on the same physical machine (or a guest and a host) have access to the same set of physical addresses. This might be an NV-DIMM, or it might just be DRAM (for the purposes of reducing guest overhead). The network filesystem has been enhanced with a call to allow the client to ask the server "What is the physical address for this range of bytes in this file?"

We don't want to use the guest pagecache here. That's antithetical to the second usage, and it's inefficient for the first usage.

Christoph Hellwig

unread,

Jan 22, 2017, 1:40:05 PM1/22/17

to

On Sun, Jan 22, 2017 at 06:19:24PM +0000, Matthew Wilcox wrote:

> No, I mean a network filesystem like 9p or cifs or nfs. If the memcpy
> is supposed to be performed by the backing device

struct backing_dev has no relation to the DAX code. Even more so what's
the point of doing a DAXish memcpy in that case? If we buffer in
memory for network I/O we should just use the page cache.

> (Also, the network filesystem might have a command, like RDMA has/will have, to ensure that the write has reached persistence)

I know very well due to my work for a DAX-backed pNFS layout. But that
is mostly transparent to the NFS frontend code and won't use DAX
on the client at all. Just pagecache as a source for RDMA READ/WRITE.

Christoph Hellwig

unread,

Jan 22, 2017, 1:50:05 PM1/22/17

to

On Sun, Jan 22, 2017 at 06:39:28PM +0000, Matthew Wilcox wrote:
> Two guests on the same physical machine (or a guest and a host) have access to the same set of physical addresses. This might be an NV-DIMM, or it might just be DRAM (for the purposes of reducing guest overhead). The network filesystem has been enhanced with a call to allow the client to ask the server "What is the physical address for this range of bytes in this file?"
>
> We don't want to use the guest pagecache here. That's antithetical to the second usage, and it's inefficient for the first usage.

And the answer is that you need a dax device for whatever memoery exposed
in this way, as it needs to show up in the memory map for example.

Matthew Wilcox

unread,

Jan 23, 2017, 1:40:05 AM1/23/17

to

From: Christoph Hellwig [mailto:h...@lst.de]

Wow, DAX devices look painful and awful. I certainly don't want to be exposing the memory fronted by my network filesystem to userspace to access. That just seems like a world of pain and bad experiences. Absolutely the filesystem (or perhaps better, the ACPI tables) need to mark that chunk of memory as reserved, but it's definitely not available for anyone to access without the filesystem being aware.

Even if we let the filesystem create a DAX device that doesn't show up in /dev (for example), Dan's patches don't give us a way to go from a file on the filesystem to a set of dax_ops. And it does need to be a per-file operation, eg to support a file on an XFS volume which might be on a RT device or a normal device. That was why I leaned towards an address_space operation, but I'd be happy to see an inode_operation instead.

Dan Williams

unread,

Jan 23, 2017, 2:20:05 AM1/23/17

to

How about we solve the copy_from_user() abuse first before we hijack
this thread for some future feature that afaics has no patches posted
yet.

An incremental step towards disentangling filesystem-dax from
block_devices is a lookup mechanism to go from a block_device to a dax
object that holds dax_ops. When this brave new filesystem enabling
appears it can grow a mechanism to lookup, or mount on, the dax object
directly.

One idea is to just hang a pointer to this dax object off of
bdev_inode, set at bdev open() time.

Christoph Hellwig

unread,

Jan 23, 2017, 11:00:06 AM1/23/17

to

On Mon, Jan 23, 2017 at 06:37:18AM +0000, Matthew Wilcox wrote:
> Wow, DAX devices look painful and awful. I certainly don't want to be
> exposing the memory fronted by my network filesystem to userspace to
> access. That just seems like a world of pain and bad experiences.

So what is your interest in using DAX for your file system then instead
of a private mechanisms?

> Absolutely the filesystem (or perhaps better, the ACPI tables) need to
> mark that chunk of memory as reserved, but it's definitely not available
> for anyone to access without the filesystem being aware.

That does sounds like a massive special case all over the stack.
But until we see it I think we should simply ignore this case and
concentrate on what we have right now.

> Even if we let the filesystem create a DAX device that doesn't show
> up in /dev (for example), Dan's patches don't give us a way to go
> from a file on the filesystem to a set of dax_ops.

Which doesn't make sense any way. The entry points into the file system
are read + write and mmap, and the file system might then use libraries
to implement different types of I/O, such as the page cache or DAX.

> And it does need to be a per-file operation, eg to support a file on
> an XFS volume which might be on a RT device or a normal device.
> That was why I leaned towards an address_space operation, but I'd be
> happy to see an inode_operation instead.

Again, no. The layers above the file system have absolutely no business
to even know if we're using DAX or pagecache access, nevermind how
in detail they are used. Assuming you want to use DAX-like semantics
it's up to the lower level to expose the correct operations for
a given memory region. Right now these would just be intel nfit or
legacy 820 + ADR for regions marked such in the memory map. If say
a hypervisor wants to expose a region that needs a special flush
call or even has requirements on the type of memcpy it needs to
provide operations for this memory region. The user of this region
(DAX-native file system pmem driver or device DAX) then needs
to use these methods.

And those pretty much are the methods Dan proposes here - it's
just that we should not tie them to block device operations, at
least not in the long run.

Christoph Hellwig

unread,

Jan 23, 2017, 11:10:05 AM1/23/17

to

On Sun, Jan 22, 2017 at 11:10:04PM -0800, Dan Williams wrote:
> How about we solve the copy_from_user() abuse first before we hijack
> this thread for some future feature that afaics has no patches posted
> yet.

Solving copy_from_user abuse first sounds perfectly fine to me. But
please do so without abusing the block layer for persistent memory
access. Given that we don't have use cases for different pmem access
methods in a single OS image yet let's avoid introducing new ops
for now and just remove the copy_from_user abuse.

In the longer run I like your dax_operations, but they need to be
separate from the block layer.

Christoph Hellwig

unread,

Jan 23, 2017, 11:10:06 AM1/23/17

to

On Sun, Jan 22, 2017 at 09:30:23AM -0800, Dan Williams wrote:
> So are you saying we need a way to go from a block_device inode to a
> dax_device inode and then look up the dax_operations from there?
>
> A filesystem, if it so chooses, could mount on top of the dax_device
> inode directly?

Sentence 1: maybe if we have to. Sentence 2: absolutely.

> I did add a dax_superblock for the device-dax character device
> representation I could refactor that so the block_device presentation
> of a namespace and a character device presentation are just different
> layers on top of the base-level dax inode.

That's a good start.

Dan Williams

unread,

Jan 23, 2017, 12:20:06 PM1/23/17

to

On Mon, Jan 23, 2017 at 8:00 AM, Christoph Hellwig <h...@lst.de> wrote:
> On Sun, Jan 22, 2017 at 11:10:04PM -0800, Dan Williams wrote:
>> How about we solve the copy_from_user() abuse first before we hijack
>> this thread for some future feature that afaics has no patches posted
>> yet.
>
> Solving copy_from_user abuse first sounds perfectly fine to me. But
> please do so without abusing the block layer for persistent memory
> access. Given that we don't have use cases for different pmem access
> methods in a single OS image yet let's avoid introducing new ops
> for now and just remove the copy_from_user abuse.

The use case that we have now is distinguishing volatile vs persistent
memory (brd vs pmem).

I took a look at mtd layering approach and the main difference is that
layers above the block layer do not appear to know anything about mtd
specifics. For fs/dax.c we currently need some path to retrieve a dax
anchor object through the block device.

> In the longer run I like your dax_operations, but they need to be
> separate from the block layer.

I'll move them from block_device_operations to dax data hanging off of
the bdev_inode, or is there a better way to go from bdev-to-dax?

Christoph Hellwig

unread,

Jan 23, 2017, 1:10:06 PM1/23/17

to

On Mon, Jan 23, 2017 at 09:14:04AM -0800, Dan Williams wrote:
> The use case that we have now is distinguishing volatile vs persistent
> memory (brd vs pmem).

brd is a development tool, so until we have other reasons for this
abstraction (which I'm pretty sure will show up rather sooner than later)
I would not worry about it too much.

> I took a look at mtd layering approach and the main difference is that
> layers above the block layer do not appear to know anything about mtd
> specifics.

Or the block layer itself for that matter. And that's exactly where
I want DAX to be in the future.

> For fs/dax.c we currently need some path to retrieve a dax
> anchor object through the block device.

We have a need to retreiver the anchor object. We currently do it
though the block layer for historical reasons, but it doesn't have
to be that way.

> > In the longer run I like your dax_operations, but they need to be
> > separate from the block layer.
>
> I'll move them from block_device_operations to dax data hanging off of
> the bdev_inode, or is there a better way to go from bdev-to-dax?

I don't think that's any better. What we really want is a way
to find the underlying persistent memory / DAX / whatever we call
it node without going through a block device. E.g. a library function
to give that object for a given path name, where the path name could
be either that of the /dev/pmemN or the /dev/daxN device.

If the file system for now still needs a block device as well it
will only accept the /dev/pmemN name, and open both the low-level
pmem device and the block device. Once that file system doesn't
need block code (and I think we could do that easily for XFS,
nevermind any new FS) it won't have to deal with the block
device at all.

pmem.c then becomes a consumer of the dax_ops just like the file system.

Dan Williams

unread,

Jan 23, 2017, 1:40:07 PM1/23/17

to

On Mon, Jan 23, 2017 at 10:03 AM, Christoph Hellwig <h...@lst.de> wrote:
> On Mon, Jan 23, 2017 at 09:14:04AM -0800, Dan Williams wrote:
>> The use case that we have now is distinguishing volatile vs persistent
>> memory (brd vs pmem).
>
> brd is a development tool, so until we have other reasons for this
> abstraction (which I'm pretty sure will show up rather sooner than later)
> I would not worry about it too much.

By "volatile" I also meant cases where pmem is fronting volatile
memory, or more importantly when the platform has otherwise arranged
for cpu caches to be flushed on a power loss event like I believe some
existing storage appliances do.

Ah ok, I'll take a look at a dax_by_path() capability.