[PATCH 0/2] virtio-fs: handle multiple dax mappings

52 views
Skip to first unread message

Fotis Xenakis

unread,
Jun 19, 2020, 12:24:02 PM6/19/20
to osv...@googlegroups.com, Fotis Xenakis
With the current, simplistic implementation of DAX window support, the
window only hosts one file mapping at a time. These patches rectify
that, implementing an initial DAX window manager which handles multiple
file mappings. This way, the DAX window functions as a cache for
virtio-fs file systems.

Please note that the manager's lifetime matches a mount's lifetime,
which means multiple mounts of the same device will *NOT* work (also see
note added in the mount operation in the second patch). I have not yet
come up with an elegant alternative to this. If this (no multiple
mounts) is not acceptable, I will try to switch to an alternative (might
not be elegant). Any suggestions on this matter is of course welcome and
appreciated!

Fotis Xenakis (2):
virtio-fs: implement dax window manager
virtio-fs: use multiple dax mappings in filesystem

Makefile | 39 ++---
drivers/virtio-fs.hh | 2 +-
fs/virtiofs/virtiofs.hh | 19 ++-
fs/virtiofs/virtiofs_dax.cc | 268 +++++++++++++++++++++++++++++++++
fs/virtiofs/virtiofs_dax.hh | 109 ++++++++++++++
fs/virtiofs/virtiofs_i.hh | 3 -
fs/virtiofs/virtiofs_vfsops.cc | 37 ++++-
fs/virtiofs/virtiofs_vnops.cc | 151 +++----------------
8 files changed, 470 insertions(+), 158 deletions(-)
create mode 100644 fs/virtiofs/virtiofs_dax.cc
create mode 100644 fs/virtiofs/virtiofs_dax.hh

--
2.27.0

Fotis Xenakis

unread,
Jun 19, 2020, 12:25:11 PM6/19/20
to osv...@googlegroups.com, Fotis Xenakis
For details on the manager's policy, please see
fs/virtiofs/virtiofs_dax.hh.

Signed-off-by: Fotis Xenakis <fo...@windowslive.com>
---
Makefile | 39 +++---
fs/virtiofs/virtiofs_dax.cc | 268 ++++++++++++++++++++++++++++++++++++
fs/virtiofs/virtiofs_dax.hh | 109 +++++++++++++++
3 files changed, 397 insertions(+), 19 deletions(-)
create mode 100644 fs/virtiofs/virtiofs_dax.cc
create mode 100644 fs/virtiofs/virtiofs_dax.hh

diff --git a/Makefile b/Makefile
index 20ddf3b1..12366794 100644
--- a/Makefile
+++ b/Makefile
@@ -536,23 +536,23 @@ bsd += bsd/porting/mmu.o
bsd += bsd/porting/pcpu.o
bsd += bsd/porting/bus_dma.o
bsd += bsd/porting/kobj.o
-bsd += bsd/sys/netinet/if_ether.o
-bsd += bsd/sys/compat/linux/linux_socket.o
-bsd += bsd/sys/compat/linux/linux_ioctl.o
-bsd += bsd/sys/net/if_ethersubr.o
-bsd += bsd/sys/net/if_llatbl.o
-bsd += bsd/sys/net/radix.o
-bsd += bsd/sys/net/route.o
-bsd += bsd/sys/net/raw_cb.o
-bsd += bsd/sys/net/raw_usrreq.o
-bsd += bsd/sys/net/rtsock.o
-bsd += bsd/sys/net/netisr.o
-bsd += bsd/sys/net/netisr1.o
-bsd += bsd/sys/net/if_dead.o
-bsd += bsd/sys/net/if_clone.o
-bsd += bsd/sys/net/if_loop.o
-bsd += bsd/sys/net/if.o
-bsd += bsd/sys/net/pfil.o
+bsd += bsd/sys/netinet/if_ether.o
+bsd += bsd/sys/compat/linux/linux_socket.o
+bsd += bsd/sys/compat/linux/linux_ioctl.o
+bsd += bsd/sys/net/if_ethersubr.o
+bsd += bsd/sys/net/if_llatbl.o
+bsd += bsd/sys/net/radix.o
+bsd += bsd/sys/net/route.o
+bsd += bsd/sys/net/raw_cb.o
+bsd += bsd/sys/net/raw_usrreq.o
+bsd += bsd/sys/net/rtsock.o
+bsd += bsd/sys/net/netisr.o
+bsd += bsd/sys/net/netisr1.o
+bsd += bsd/sys/net/if_dead.o
+bsd += bsd/sys/net/if_clone.o
+bsd += bsd/sys/net/if_loop.o
+bsd += bsd/sys/net/if.o
+bsd += bsd/sys/net/pfil.o
bsd += bsd/sys/net/routecache.o
bsd += bsd/sys/netinet/in.o
bsd += bsd/sys/netinet/in_pcb.o
@@ -1769,7 +1769,8 @@ fs_objs += rofs/rofs_vfsops.o \
rofs/rofs_common.o

fs_objs += virtiofs/virtiofs_vfsops.o \
- virtiofs/virtiofs_vnops.o
+ virtiofs/virtiofs_vnops.o \
+ virtiofs/virtiofs_dax.o

fs_objs += pseudofs/pseudofs.o
fs_objs += procfs/procfs_vnops.o
@@ -1976,7 +1977,7 @@ libuutil-objects = $(foreach file, $(libuutil-file-list), $(out)/bsd/cddl/contri

define libuutil-includes
bsd/cddl/contrib/opensolaris/lib/libuutil/common
- bsd/cddl/compat/opensolaris/include
+ bsd/cddl/compat/opensolaris/include
bsd/sys/cddl/contrib/opensolaris/uts/common
bsd/sys/cddl/compat/opensolaris
bsd/cddl/contrib/opensolaris/head
diff --git a/fs/virtiofs/virtiofs_dax.cc b/fs/virtiofs/virtiofs_dax.cc
new file mode 100644
index 00000000..8e612eb5
--- /dev/null
+++ b/fs/virtiofs/virtiofs_dax.cc
@@ -0,0 +1,268 @@
+/*
+ * Copyright (C) 2020 Fotis Xenakis
+ *
+ * This work is open source software, licensed under the terms of the
+ * BSD license as described in the LICENSE file in the top-level directory.
+ */
+
+#include <algorithm>
+#include <mutex>
+
+#include <osv/debug.h>
+#include <osv/uio.h>
+
+#include "fuse_kernel.h"
+#include "virtiofs.hh"
+#include "virtiofs_dax.hh"
+#include "virtiofs_i.hh"
+
+namespace virtiofs {
+
+int dax_manager::read(virtiofs_inode& inode, uint64_t file_handle, u64 read_amt,
+ struct uio& uio, bool aggressive)
+{
+ std::lock_guard<mutex> guard {_lock};
+
+ // Necessary pre-declarations due to goto below
+ size_t to_map;
+ chunk nchunks;
+ int error;
+ mapping_part mp;
+ chunk fstart = uio.uio_offset / _chunk_size;
+ off_t coffset = uio.uio_offset % _chunk_size; // offset within chunk
+ if (find(inode.nodeid, fstart, mp)) {
+ // Requested data (at least some initial) is already mapped
+ auto read_amt_act = std::min<size_t>(read_amt,
+ (mp.nchunks * _chunk_size) - coffset);
+ virtiofs_debug("inode %lld, found in DAX (foffset=%lld, len=%lld, "
+ "moffset=%lld)\n", inode.nodeid, uio.uio_offset, read_amt_act,
+ (mp.mstart * _chunk_size) + coffset);
+ goto out;
+ }
+
+ // Map file
+ to_map = coffset; // bytes to map
+ if (aggressive) {
+ // Map the rest of the file
+ to_map += inode.attr.size - uio.uio_offset;
+ } else {
+ // Map just enough chunks to satisfy read_amt
+ to_map += read_amt;
+ }
+ nchunks = to_map / _chunk_size;
+ if (to_map % _chunk_size > 0) {
+ nchunks++;
+ }
+ // NOTE: This relies on the fact that requesting a mapping longer than the
+ // remaining file works (see mmap() on the host). If that didn't work, we
+ // would have to request exact mappings (byte-granularity, rather than
+ // chunk-granularity).
+ error = map(inode.nodeid, file_handle, nchunks, fstart, mp, true);
+ if (error) {
+ return error;
+ }
+
+out:
+ auto req_data = _window->addr + (mp.mstart * _chunk_size) + coffset;
+ auto read_amt_act = std::min<size_t>(read_amt,
+ (mp.nchunks * _chunk_size) - coffset);
+ // NOTE: It shouldn't be necessary to use the mmio* interface (i.e. volatile
+ // accesses). From the spec: "Drivers map this shared memory region with
+ // writeback caching as if it were regular RAM."
+ error = uiomove(const_cast<void*>(req_data), read_amt_act, &uio);
+ if (error) {
+ kprintf("[virtiofs] inode %lld, uiomove failed\n", inode.nodeid);
+ }
+ return error;
+}
+
+int dax_manager::map(uint64_t nodeid, uint64_t file_handle, chunk nchunks,
+ chunk fstart, mapping_part& mapped, bool evict)
+{
+ // If necessary, unmap just enough chunks
+ auto empty = _window_chunks - first_empty();
+ if (evict && empty < nchunks) {
+ mapping_part mp;
+ auto error = unmap(nchunks - empty, mp, false);
+ if (error) {
+ return error;
+ }
+ empty += mp.nchunks;
+ }
+ auto to_map = std::min<chunk>(nchunks, empty);
+ if (to_map == 0) {
+ // The window is full and evict is false, or nchunks is 0
+ mapped.mstart = _window_chunks - empty;
+ mapped.nchunks = 0;
+ return (nchunks == 0) ? 0 : ENOBUFS;
+ }
+
+ // Map new chunks
+ auto mstart = _window_chunks - empty;
+ auto error = map_ll(nodeid, file_handle, to_map, fstart, mstart);
+ if (error) {
+ return error;
+ }
+ if (!_mappings.empty()) {
+ auto& m {_mappings.back()};
+ if (m.nodeid == nodeid && m.fstart + m.nchunks == fstart) {
+ // Extend previous mapping
+ m.nchunks += to_map;
+ mapped.mstart = mstart;
+ mapped.nchunks = to_map;
+ return 0;
+ }
+ }
+ _mappings.emplace_back(nodeid, to_map, fstart, mstart);
+ mapped.mstart = mstart;
+ mapped.nchunks = to_map;
+ return 0;
+}
+
+int dax_manager::unmap(chunk nchunks, mapping_part& unmapped, bool deep)
+{
+ // Determine necessary changes
+ chunk to_unmap = 0;
+ auto erase_first {_mappings.cend()};
+ chunk to_unmap_from_last = 0;
+ for (auto it {_mappings.crbegin()};
+ to_unmap < nchunks && it != _mappings.crend(); it++) {
+
+ if (it->nchunks <= nchunks - to_unmap) {
+ // Remove *it
+ erase_first = it.base() - 1;
+ to_unmap += it->nchunks;
+ } else {
+ // Modify *it
+ to_unmap_from_last = nchunks - to_unmap;
+ to_unmap = nchunks;
+ }
+ }
+ if (to_unmap == 0) {
+ // The window is empty, or nchunks is 0
+ unmapped.mstart = first_empty();
+ unmapped.nchunks = 0;
+ return (nchunks == 0) ? 0 : ENODATA;
+ }
+
+ // Apply changes
+ if (deep) {
+ auto mstart = first_empty() - to_unmap;
+ auto error = unmap_ll(to_unmap, mstart);
+ if (error) {
+ return error;
+ }
+ }
+ _mappings.erase(erase_first, _mappings.cend());
+ if (to_unmap_from_last > 0) {
+ _mappings.back().nchunks -= to_unmap_from_last;
+ }
+
+ unmapped.mstart = first_empty();
+ unmapped.nchunks = to_unmap;
+ return 0;
+}
+
+int dax_manager::map_ll(uint64_t nodeid, uint64_t file_handle, chunk nchunks,
+ chunk fstart, chunk mstart)
+{
+ assert(mstart + nchunks <= _window_chunks);
+
+ // NOTE: There are restrictions on the arguments to FUSE_SETUPMAPPING, from
+ // the spec: "Alignment constraints for FUSE_SETUPMAPPING and
+ // FUSE_REMOVEMAPPING requests are communicated during FUSE_INIT
+ // negotiation"):
+ // - foffset: multiple of map_alignment from FUSE_INIT
+ // - len: not larger than remaining file?
+ // - moffset: multiple of map_alignment from FUSE_INIT
+ // In practice, map_alignment is the host's page size, because foffset and
+ // moffset are passed to mmap() on the host. These are satisfied by
+ // _chunk_size being a multiple of map_alignment.
+
+ std::unique_ptr<fuse_setupmapping_in> in_args {
+ new (std::nothrow) fuse_setupmapping_in()};
+ if (!in_args) {
+ return ENOMEM;
+ }
+ in_args->fh = file_handle;
+ in_args->foffset = fstart * _chunk_size;
+ in_args->len = nchunks * _chunk_size;
+ in_args->flags = 0; // Read-only
+ in_args->moffset = mstart * _chunk_size;
+
+ virtiofs_debug("inode %lld, setting up mapping (foffset=%lld, len=%lld, "
+ "moffset=%lld)\n", nodeid, in_args->foffset, in_args->len,
+ in_args->moffset);
+ auto error = fuse_req_send_and_receive_reply(&_drv, FUSE_SETUPMAPPING,
+ nodeid, in_args.get(), sizeof(*in_args), nullptr, 0);
+ if (error) {
+ kprintf("[virtiofs] inode %lld, mapping setup failed\n", nodeid);
+ return error;
+ }
+
+ return 0;
+}
+
+int dax_manager::unmap_ll(chunk nchunks, chunk mstart)
+{
+ assert(mstart + nchunks <= _window_chunks);
+
+ // NOTE: FUSE_REMOVEMAPPING accepts a fuse_removemapping_in followed by
+ // fuse_removemapping_in.count fuse_removemapping_one arguments in general.
+ auto in_args_size = sizeof(fuse_removemapping_in) +
+ sizeof(fuse_removemapping_one);
+ std::unique_ptr<u8> in_args {new (std::nothrow) u8[in_args_size]};
+ if (!in_args) {
+ return ENOMEM;
+ }
+ auto r_in = new (in_args.get()) fuse_removemapping_in();
+ auto r_one = new (in_args.get() + sizeof(fuse_removemapping_in))
+ fuse_removemapping_one();
+ r_in->count = 1;
+ r_one->moffset = mstart * _chunk_size;
+ r_one->len = nchunks * _chunk_size;
+
+ // The nodeid is irrelevant for the current implementation of
+ // FUSE_REMOVEMAPPING. If it needed to be set, would we need to make a
+ // request per inode?
+ uint64_t nodeid = 0;
+
+ virtiofs_debug("inode %lld, removing mapping (moffset=%lld, len=%lld)\n",
+ nodeid, r_one->moffset, r_one->len);
+ auto error = fuse_req_send_and_receive_reply(&_drv, FUSE_REMOVEMAPPING,
+ nodeid, in_args.get(), in_args_size, nullptr, 0);
+ if (error) {
+ kprintf("[virtiofs] inode %lld, mapping removal failed\n", nodeid);
+ return error;
+ }
+
+ return 0;
+}
+
+bool dax_manager::find(uint64_t nodeid, chunk fstart, mapping_part& found) const
+{
+ for (auto& m : _mappings) {
+ if (m.nodeid == nodeid &&
+ m.fstart <= fstart &&
+ m.fstart + m.nchunks > fstart) {
+
+ // m contains fstart
+ auto excess = fstart - m.fstart; // excess contained in m
+ found.nchunks = m.nchunks - excess;
+ found.mstart = m.mstart + excess;
+ return true;
+ }
+ }
+ return false;
+}
+
+dax_manager::chunk dax_manager::first_empty() const
+{
+ if (_mappings.empty()) {
+ return 0;
+ }
+ auto& m {_mappings.back()};
+ return m.mstart + m.nchunks;
+}
+
+}
diff --git a/fs/virtiofs/virtiofs_dax.hh b/fs/virtiofs/virtiofs_dax.hh
new file mode 100644
index 00000000..2b9fa341
--- /dev/null
+++ b/fs/virtiofs/virtiofs_dax.hh
@@ -0,0 +1,109 @@
+/*
+ * Copyright (C) 2020 Fotis Xenakis
+ *
+ * This work is open source software, licensed under the terms of the
+ * BSD license as described in the LICENSE file in the top-level directory.
+ */
+
+#include <vector>
+
+#include <api/assert.h>
+#include <osv/mutex.h>
+#include <osv/uio.h>
+
+#include "drivers/virtio-fs.hh"
+#include "virtiofs.hh"
+
+namespace virtiofs {
+
+// A manager for the DAX window of a virtio-fs device. This implements a
+// straight-forward scheme for file mappings:
+// - The window is split into equally-sized chunks. Each mapping occupies an
+// integer amount of consecutive chunks.
+// - New mappings are placed on the lowest available chunks in the window.
+// - When there are not enough chunks available for a new mapping, the highest
+// (i.e. most recently mapped) chunks occupied are evicted. Thus, chunks are
+// mapped in a LIFO manner (the window resembles a stack).
+class dax_manager {
+public:
+ static constexpr size_t DEFAULT_CHUNK_SIZE = 1 << 21; // 2MiB
+
+ // Construct a new manager for the DAX window associated with @drv (as
+ // returned by drv.get_dax()). The alignment constraint of the device (as
+ // reported by drv.get_map_alignment()) should be compatible with
+ // @chunk_size.
+ dax_manager(virtio::fs& drv, size_t chunk_size = DEFAULT_CHUNK_SIZE)
+ : _drv {drv},
+ _window {drv.get_dax()},
+ _chunk_size {chunk_size},
+ _window_chunks {_window->len / _chunk_size} {
+
+ assert(_chunk_size % (1ull << _drv.get_map_alignment()) == 0);
+
+ // NOTE: If _window->len % CHUNK_SIZE > 0, that remainder (< CHUNK_SIZE)
+ // is effectively ignored.
+ }
+
+ // Read @read_amt bytes from @inode, using the DAX window. If @aggressive,
+ // try to prefetch as much of the rest of the file as possible.
+ int read(virtiofs_inode& inode, uint64_t file_handle, u64 read_amt,
+ struct uio& uio, bool aggressive = false);
+
+private:
+ // Helper type to better distinguish referring to chunks vs bytes
+ using chunk = size_t;
+
+ struct mapping {
+ mapping(uint64_t _nodeid, chunk _nchunks, chunk _fstart, chunk _mstart)
+ : nodeid {_nodeid},
+ nchunks {_nchunks},
+ fstart {_fstart},
+ mstart {_mstart} {}
+ uint64_t nodeid;
+ chunk nchunks;
+ chunk fstart;
+ chunk mstart;
+ };
+
+ struct mapping_part {
+ chunk nchunks;
+ chunk mstart;
+ };
+
+ // Map up to @nchunks chunks of the file with @nodeid, starting at chunk
+ // @fstart of the file, after all other mappings. If @evict, evict other
+ // chunks if necessary. Returns in @mapped the new mapping and non-zero on
+ // failure. Called with _lock held (for writing).
+ int map(uint64_t nodeid, uint64_t file_handle, chunk nchunks, chunk fstart,
+ mapping_part& mapped, bool evict = false);
+ // Unmap @nchunks last chunks, also doing an actual unmapping on the device
+ // if @deep. Returns in @unmapped what was unmapped and non-zero on failure.
+ // Called with _lock held (for writing).
+ int unmap(chunk nchunks, mapping_part& unmapped, bool deep = false);
+ // Map @nchunks chunks of the file with @nodeid (opened as @fh), starting at
+ // chunk @fstart of the file and chunk @mstart of the window. Returns
+ // non-zero on failure. Called with _lock held (for writing).
+ int map_ll(uint64_t nodeid, uint64_t fh, chunk nchunks, chunk fstart,
+ chunk mstart);
+ // Unmap @nchunks chunks, starting at chunk @mstart of the window. Returns
+ // non-zero on failure. Called with _lock held (for writing).
+ int unmap_ll(chunk nchunks, chunk mstart);
+
+ // Return in @found the largest contiguous existing mapping for @nodeid
+ // starting at @fstart. If none found, returns false. Called with _lock held
+ // (for reading).
+ bool find(uint64_t nodeid, chunk fstart, mapping_part& found) const;
+ // Returns the first empty chunk in the window, or one-past-the-last if the
+ // window is full. Called with _lock held (for reading).
+ chunk first_empty() const;
+
+ virtio::fs& _drv;
+ const virtio::fs::dax_window* const _window;
+ const size_t _chunk_size;
+ const chunk _window_chunks;
+ // TODO OPT: Switch to rwlock
+ mutex _lock;
+ std::vector<mapping> _mappings;
+};
+
+}
--
2.27.0

Fotis Xenakis

unread,
Jun 19, 2020, 12:25:53 PM6/19/20
to osv...@googlegroups.com, Fotis Xenakis
This integrates the dax window manager with the virtio-fs file system
operations:

- Removes the lock from virtio::fs::dax_window, since this is now
handled by the window manager.
- A manager instance has a lifetime matching a virtio-fs mount's
lifetime: it is created at mount time and deleted at unmount time.
- The FUSE_READ fallback as well as the logic of using it when the DAX
window is not available or a read using it fails are untouched.

Signed-off-by: Fotis Xenakis <fo...@windowslive.com>
---
drivers/virtio-fs.hh | 2 +-
fs/virtiofs/virtiofs.hh | 19 ++++-
fs/virtiofs/virtiofs_i.hh | 3 -
fs/virtiofs/virtiofs_vfsops.cc | 37 ++++++--
fs/virtiofs/virtiofs_vnops.cc | 151 ++++++---------------------------
5 files changed, 73 insertions(+), 139 deletions(-)

diff --git a/drivers/virtio-fs.hh b/drivers/virtio-fs.hh
index d78d7962..a2f7f748 100644
--- a/drivers/virtio-fs.hh
+++ b/drivers/virtio-fs.hh
@@ -10,6 +10,7 @@

#include <osv/mutex.h>
#include <osv/waitqueue.hh>
+
#include "drivers/virtio.hh"
#include "drivers/virtio-device.hh"
#include "fs/virtiofs/fuse_kernel.h"
@@ -49,7 +50,6 @@ public:
struct dax_window {
mmioaddr_t addr;
u64 len;
- mutex lock;
};

explicit fs(virtio_device& dev);
diff --git a/fs/virtiofs/virtiofs.hh b/fs/virtiofs/virtiofs.hh
index 475d5eba..32121686 100644
--- a/fs/virtiofs/virtiofs.hh
+++ b/fs/virtiofs/virtiofs.hh
@@ -8,11 +8,11 @@
#ifndef __INCLUDE_VIRTIOFS_H__
#define __INCLUDE_VIRTIOFS_H__

-#include <osv/vnode.h>
+#include <osv/debug.h>
#include <osv/mount.h>
-#include <osv/dentry.h>
-#include <osv/prex.h>
-#include <osv/buf.h>
+#include <osv/vnode.h>
+
+#include "drivers/virtio-fs.hh"
#include "fuse_kernel.h"

#define VIRTIOFS_DEBUG_ENABLED 1
@@ -23,6 +23,17 @@
#define virtiofs_debug(...)
#endif

+// Necessary pre-declaration because virtiofs::dax depends on virtiofs_inode,
+// declared below.
+namespace virtiofs {
+class dax_manager;
+}
+
+struct virtiofs_mount_data {
+ virtio::fs* drv;
+ virtiofs::dax_manager* dax;
+};
+
struct virtiofs_inode {
uint64_t nodeid;
struct fuse_attr attr;
diff --git a/fs/virtiofs/virtiofs_i.hh b/fs/virtiofs/virtiofs_i.hh
index 76533d74..e879546d 100644
--- a/fs/virtiofs/virtiofs_i.hh
+++ b/fs/virtiofs/virtiofs_i.hh
@@ -8,9 +8,6 @@
#ifndef VIRTIOFS_IO_H
#define VIRTIOFS_IO_H

-#include "fuse_kernel.h"
-#include <osv/mutex.h>
-#include <osv/waitqueue.hh>
#include "drivers/virtio-fs.hh"

int fuse_req_send_and_receive_reply(virtio::fs* drv, uint32_t opcode,
diff --git a/fs/virtiofs/virtiofs_vfsops.cc b/fs/virtiofs/virtiofs_vfsops.cc
index 30daa42b..e2101200 100644
--- a/fs/virtiofs/virtiofs_vfsops.cc
+++ b/fs/virtiofs/virtiofs_vfsops.cc
@@ -5,14 +5,19 @@
* BSD license as described in the LICENSE file in the top-level directory.
*/

+#include <atomic>
+#include <memory>
+#include <new>
#include <sys/types.h>
-#include <osv/device.h>
+
+#include <api/assert.h>
#include <osv/debug.h>
-#include <iomanip>
-#include <iostream>
+#include <osv/device.h>
+
+#include "drivers/virtio-fs.hh"
#include "virtiofs.hh"
+#include "virtiofs_dax.hh"
#include "virtiofs_i.hh"
-#include "drivers/virtio-fs.hh"

using fuse_request = virtio::fs::fuse_request;

@@ -115,7 +120,26 @@ static int virtiofs_mount(struct mount* mp, const char* dev, int flags,

virtiofs_set_vnode(mp->m_root->d_vnode, root_node);

- mp->m_data = drv;
+ // TODO OPT: Coupling the window manager to a mount as done here means that
+ // multiple mounts of the same device will *NOT* work. Ideally, we want to
+ // couple the window manager with the device, but that would require the
+ // device driver layer to be aware of the filesystem layer, thus depending
+ // on it, creating a circular dependency.
+ auto* m_data = new (std::nothrow) virtiofs_mount_data;
+ if (!m_data) {
+ return ENOMEM;
+ }
+ m_data->drv = drv;
+ if (drv->get_dax()) {
+ m_data->dax = new (std::nothrow) virtiofs::dax_manager {*drv};
+ if (!m_data->dax) {
+ return ENOMEM;
+ }
+ } else {
+ m_data->dax = nullptr;
+ }
+
+ mp->m_data = m_data;
mp->m_dev = device;

return 0;
@@ -141,6 +165,9 @@ static int virtiofs_statfs(struct mount* mp, struct statfs* statp)

static int virtiofs_unmount(struct mount* mp, int flags)
{
+ auto* m_data = static_cast<virtiofs_mount_data*>(mp->m_data);
+ delete m_data->dax;
+ delete m_data;
struct device* dev = mp->m_dev;
return device_close(dev);
}
diff --git a/fs/virtiofs/virtiofs_vnops.cc b/fs/virtiofs/virtiofs_vnops.cc
index 38c000ec..822780ad 100644
--- a/fs/virtiofs/virtiofs_vnops.cc
+++ b/fs/virtiofs/virtiofs_vnops.cc
@@ -5,28 +5,27 @@
* BSD license as described in the LICENSE file in the top-level directory.
*/

-#include <sys/stat.h>
+#include <cstdlib>
+#include <cstring>
#include <dirent.h>
-#include <sys/param.h>
-
#include <errno.h>
-#include <string.h>
-#include <stdlib.h>
#include <fcntl.h>
+#include <sys/param.h>
+#include <sys/stat.h>
+#include <sys/types.h>

-#include <osv/prex.h>
-#include <osv/vnode.h>
-#include <osv/file.h>
-#include <osv/mount.h>
+#include <osv/contiguous_alloc.hh>
#include <osv/debug.h>
-
-#include <sys/types.h>
#include <osv/device.h>
-#include <osv/sched.hh>
+#include <osv/file.h>
#include <osv/mmio.hh>
-#include <osv/contiguous_alloc.hh>
+#include <osv/mount.h>
+#include <osv/prex.h>
+#include <osv/sched.hh>
+#include <osv/vnode.h>

#include "virtiofs.hh"
+#include "virtiofs_dax.hh"
#include "virtiofs_i.hh"

static constexpr uint32_t OPEN_FLAGS = O_RDONLY;
@@ -59,7 +58,8 @@ static int virtiofs_lookup(struct vnode* vnode, char* name, struct vnode** vpp)
}
strcpy(in_args.get(), name);

- auto* drv = static_cast<virtio::fs*>(vnode->v_mount->m_data);
+ auto* m_data = static_cast<virtiofs_mount_data*>(vnode->v_mount->m_data);
+ auto* drv = m_data->drv;
auto error = fuse_req_send_and_receive_reply(drv, FUSE_LOOKUP,
inode->nodeid, in_args.get(), in_args_len, out_args.get(),
sizeof(*out_args));
@@ -110,7 +110,8 @@ static int virtiofs_open(struct file* fp)
}
in_args->flags = OPEN_FLAGS;

- auto* drv = static_cast<virtio::fs*>(vnode->v_mount->m_data);
+ auto* m_data = static_cast<virtiofs_mount_data*>(vnode->v_mount->m_data);
+ auto* drv = m_data->drv;
auto error = fuse_req_send_and_receive_reply(drv, FUSE_OPEN,
inode->nodeid, in_args.get(), sizeof(*in_args), out_args.get(),
sizeof(*out_args));
@@ -145,7 +146,8 @@ static int virtiofs_close(struct vnode* vnode, struct file* fp)
in_args->fh = f_data->file_handle;
in_args->flags = OPEN_FLAGS; // need to be same as in FUSE_OPEN

- auto* drv = static_cast<virtio::fs*>(vnode->v_mount->m_data);
+ auto* m_data = static_cast<virtiofs_mount_data*>(vnode->v_mount->m_data);
+ auto* drv = m_data->drv;
auto error = fuse_req_send_and_receive_reply(drv, FUSE_RELEASE,
inode->nodeid, in_args.get(), sizeof(*in_args), nullptr, 0);
if (error) {
@@ -172,7 +174,8 @@ static int virtiofs_readlink(struct vnode* vnode, struct uio* uio)
return ENOMEM;
}

- auto* drv = static_cast<virtio::fs*>(vnode->v_mount->m_data);
+ auto* m_data = static_cast<virtiofs_mount_data*>(vnode->v_mount->m_data);
+ auto* drv = m_data->drv;
auto error = fuse_req_send_and_receive_reply(drv, FUSE_READLINK,
inode->nodeid, nullptr, 0, link_path.get(), PATH_MAX);
if (error) {
@@ -185,107 +188,6 @@ static int virtiofs_readlink(struct vnode* vnode, struct uio* uio)
return uiomove(link_path.get(), strlen(link_path.get()), uio);
}

-// Read @read_amt bytes from @inode, using the DAX window.
-static int virtiofs_read_direct(virtiofs_inode& inode, u64 file_handle,
- u64 read_amt, virtio::fs& drv, struct uio& uio)
-{
- auto* dax = drv.get_dax();
- // Enter the critical path: setup mapping -> read -> remove mapping
- std::lock_guard<mutex> guard {dax->lock};
-
- // Setup mapping
- // NOTE: There are restrictions on the arguments to FUSE_SETUPMAPPING, from
- // the spec: "Alignment constraints for FUSE_SETUPMAPPING and
- // FUSE_REMOVEMAPPING requests are communicated during FUSE_INIT
- // negotiation"):
- // - foffset: multiple of map_alignment from FUSE_INIT
- // - len: not larger than remaining file?
- // - moffset: multiple of map_alignment from FUSE_INIT
- // In practice, map_alignment is the host's page size, because foffset and
- // moffset are passed to mmap() on the host.
- std::unique_ptr<fuse_setupmapping_in> in_args {
- new (std::nothrow) fuse_setupmapping_in()};
- if (!in_args) {
- return ENOMEM;
- }
- in_args->fh = file_handle;
- in_args->flags = 0;
- uint64_t moffset = 0;
- in_args->moffset = moffset;
-
- auto map_align = drv.get_map_alignment();
- if (map_align < 0) {
- kprintf("[virtiofs] inode %lld, map alignment not set\n", inode.nodeid);
- return ENOTSUP;
- }
- uint64_t alignment = 1ul << map_align;
- auto foffset = align_down(static_cast<uint64_t>(uio.uio_offset), alignment);
- in_args->foffset = foffset;
-
- // The possible excess part of the file mapped due to alignment constraints
- // NOTE: map_excess <= alignemnt
- auto map_excess = uio.uio_offset - foffset;
- if (moffset + map_excess >= dax->len) {
- // No usable room in DAX window due to map_excess
- return ENOBUFS;
- }
- // Actual read amount is read_amt, or what fits in the DAX window
- auto read_amt_act = std::min<uint64_t>(read_amt,
- dax->len - moffset - map_excess);
- in_args->len = read_amt_act + map_excess;
-
- virtiofs_debug("inode %lld, setting up mapping (foffset=%lld, len=%lld, "
- "moffset=%lld)\n", inode.nodeid, in_args->foffset,
- in_args->len, in_args->moffset);
- auto error = fuse_req_send_and_receive_reply(&drv, FUSE_SETUPMAPPING,
- inode.nodeid, in_args.get(), sizeof(*in_args), nullptr, 0);
- if (error) {
- kprintf("[virtiofs] inode %lld, mapping setup failed\n", inode.nodeid);
- return error;
- }
-
- // Read from the DAX window
- // NOTE: It shouldn't be necessary to use the mmio* interface (i.e. volatile
- // accesses). From the spec: "Drivers map this shared memory region with
- // writeback caching as if it were regular RAM."
- // The location of the requested data in the DAX window
- auto req_data = dax->addr + moffset + map_excess;
- error = uiomove(const_cast<void*>(req_data), read_amt_act, &uio);
- if (error) {
- kprintf("[virtiofs] inode %lld, uiomove failed\n", inode.nodeid);
- return error;
- }
-
- // Remove mapping
- // NOTE: This is only necessary when FUSE_SETUPMAPPING fails. From the spec:
- // "If the device runs out of resources the FUSE_SETUPMAPPING request fails
- // until resources are available again following FUSE_REMOVEMAPPING."
- auto r_in_args_size = sizeof(fuse_removemapping_in) +
- sizeof(fuse_removemapping_one);
- std::unique_ptr<u8> r_in_args {new (std::nothrow) u8[r_in_args_size]};
- if (!r_in_args) {
- return ENOMEM;
- }
- auto r_in = new (r_in_args.get()) fuse_removemapping_in();
- auto r_one = new (r_in_args.get() + sizeof(fuse_removemapping_in))
- fuse_removemapping_one();
- r_in->count = 1;
- r_one->moffset = in_args->moffset;
- r_one->len = in_args->len;
-
- virtiofs_debug("inode %lld, removing mapping (moffset=%lld, len=%lld)\n",
- inode.nodeid, r_one->moffset, r_one->len);
- error = fuse_req_send_and_receive_reply(&drv, FUSE_REMOVEMAPPING,
- inode.nodeid, r_in_args.get(), r_in_args_size, nullptr, 0);
- if (error) {
- kprintf("[virtiofs] inode %lld, mapping removal failed\n",
- inode.nodeid);
- return error;
- }
-
- return 0;
-}
-
// Read @read_amt bytes from @inode, using the fallback FUSE_READ mechanism.
static int virtiofs_read_fallback(virtiofs_inode& inode, u64 file_handle,
u32 read_amt, u32 flags, virtio::fs& drv, struct uio& uio)
@@ -314,9 +216,6 @@ static int virtiofs_read_fallback(virtiofs_inode& inode, u64 file_handle,
return uiomove(buf.get(), read_amt, &uio);
}

-// TODO: Optimize it to reduce number of exits to host (each
-// fuse_req_send_and_receive_reply()) by reading eagerly "ahead/around" just
-// like ROFS does and caching it
static int virtiofs_read(struct vnode* vnode, struct file* fp, struct uio* uio,
int ioflag)
{
@@ -343,17 +242,17 @@ static int virtiofs_read(struct vnode* vnode, struct file* fp, struct uio* uio,

auto* inode = static_cast<virtiofs_inode*>(vnode->v_data);
auto* file_data = static_cast<virtiofs_file_data*>(fp->f_data);
- auto* drv = static_cast<virtio::fs*>(vnode->v_mount->m_data);
+ auto* m_data = static_cast<virtiofs_mount_data*>(vnode->v_mount->m_data);
+ auto* drv = m_data->drv;
+ auto* dax = m_data->dax;

// Total read amount is what they requested, or what is left
auto read_amt = std::min<uint64_t>(uio->uio_resid,
inode->attr.size - uio->uio_offset);

- if (drv->get_dax()) {
+ if (dax) {
// Try to read from DAX
- if (!virtiofs_read_direct(*inode, file_data->file_handle, read_amt,
- *drv, *uio)) {
-
+ if (!dax->read(*inode, file_data->file_handle, read_amt, *uio, false)) {
return 0;
}
}
--
2.27.0

Fotis Xenakis

unread,
Jun 21, 2020, 5:05:12 AM6/21/20
to osv...@googlegroups.com, Fotis Xenakis
With the current, simplistic implementation of DAX window support, the
window only hosts one file mapping at a time. These patches rectify
that, implementing an initial DAX window manager which handles multiple
file mappings. This way, the DAX window functions as a cache for
virtio-fs file systems.

Changes since v1:
- Made the dax_manager:virtio-fs device relation 1:1 (previously it was
1:1 with a mount, so mounting a device multiple times would not work).
- Renamed virtiofs_mount_data.dax to virtiofs_mount_data.dax_mgr for
clarity.

Fotis Xenakis (2):
virtio-fs: implement dax window manager
virtio-fs: use multiple dax mappings in filesystem

Makefile | 39 ++---
drivers/virtio-fs.hh | 11 +-
fs/virtiofs/virtiofs.hh | 21 ++-
fs/virtiofs/virtiofs_dax.cc | 268 +++++++++++++++++++++++++++++++++
fs/virtiofs/virtiofs_dax.hh | 109 ++++++++++++++
fs/virtiofs/virtiofs_i.hh | 3 -
fs/virtiofs/virtiofs_vfsops.cc | 53 ++++++-
fs/virtiofs/virtiofs_vnops.cc | 151 +++----------------
8 files changed, 497 insertions(+), 158 deletions(-)
create mode 100644 fs/virtiofs/virtiofs_dax.cc
create mode 100644 fs/virtiofs/virtiofs_dax.hh

--
2.27.0

Fotis Xenakis

unread,
Jun 21, 2020, 5:06:28 AM6/21/20
to osv...@googlegroups.com, Fotis Xenakis
For details on the manager's policy, please see
fs/virtiofs/virtiofs_dax.hh.

Signed-off-by: Fotis Xenakis <fo...@windowslive.com>
---
Makefile | 39 +++---
fs/virtiofs/virtiofs_dax.cc | 268 ++++++++++++++++++++++++++++++++++++
fs/virtiofs/virtiofs_dax.hh | 109 +++++++++++++++
3 files changed, 397 insertions(+), 19 deletions(-)
create mode 100644 fs/virtiofs/virtiofs_dax.cc
create mode 100644 fs/virtiofs/virtiofs_dax.hh

+ return ENOMEM;
+ }
+ return ENOMEM;
+ }

Fotis Xenakis

unread,
Jun 21, 2020, 5:09:53 AM6/21/20
to osv...@googlegroups.com, Fotis Xenakis
This integrates the dax window manager with the virtio-fs file system
operations:

- Removes the lock from virtio::fs::dax_window, since this is now
handled by the window manager.
- A single manager instance is maintained for each virtio-fs device
having at least a singe mount. Multiple, concurrent mounts of the same
device share a common dax_manager.
- The FUSE_READ fallback as well as the logic of using it when the DAX
window is not available or a read using it fails are untouched.

Signed-off-by: Fotis Xenakis <fo...@windowslive.com>
---
drivers/virtio-fs.hh | 11 ++-
fs/virtiofs/virtiofs.hh | 21 ++++-
fs/virtiofs/virtiofs_i.hh | 3 -
fs/virtiofs/virtiofs_vfsops.cc | 53 ++++++++++--
fs/virtiofs/virtiofs_vnops.cc | 151 ++++++---------------------------
5 files changed, 100 insertions(+), 139 deletions(-)

diff --git a/drivers/virtio-fs.hh b/drivers/virtio-fs.hh
index d78d7962..88a16939 100644
--- a/drivers/virtio-fs.hh
+++ b/drivers/virtio-fs.hh
@@ -8,8 +8,11 @@
#ifndef VIRTIO_FS_DRIVER_H
#define VIRTIO_FS_DRIVER_H

+#include <functional>
+
#include <osv/mutex.h>
#include <osv/waitqueue.hh>
+
#include "drivers/virtio.hh"
#include "drivers/virtio-device.hh"
#include "fs/virtiofs/fuse_kernel.h"
@@ -49,7 +52,13 @@ public:
struct dax_window {
mmioaddr_t addr;
u64 len;
- mutex lock;
+ };
+
+ // Helper enabling the use of fs* as key type in an unordered_* container.
+ struct hasher {
+ size_t operator()(fs* const _fs) const noexcept {
+ return std::hash<int>{}(_fs->_id);
+ }
};

explicit fs(virtio_device& dev);
diff --git a/fs/virtiofs/virtiofs.hh b/fs/virtiofs/virtiofs.hh
index 475d5eba..8a3b06f2 100644
--- a/fs/virtiofs/virtiofs.hh
+++ b/fs/virtiofs/virtiofs.hh
@@ -8,11 +8,13 @@
#ifndef __INCLUDE_VIRTIOFS_H__
#define __INCLUDE_VIRTIOFS_H__

-#include <osv/vnode.h>
+#include <memory>
+
+#include <osv/debug.h>
#include <osv/mount.h>
-#include <osv/dentry.h>
-#include <osv/prex.h>
-#include <osv/buf.h>
+#include <osv/vnode.h>
+
+#include "drivers/virtio-fs.hh"
#include "fuse_kernel.h"

#define VIRTIOFS_DEBUG_ENABLED 1
@@ -23,6 +25,17 @@
#define virtiofs_debug(...)
#endif

+// Necessary pre-declaration because virtiofs::dax depends on virtiofs_inode,
+// declared below.
+namespace virtiofs {
+class dax_manager;
+}
+
+struct virtiofs_mount_data {
+ virtio::fs* drv;
+ std::shared_ptr<virtiofs::dax_manager> dax_mgr;
+};
+
struct virtiofs_inode {
uint64_t nodeid;
struct fuse_attr attr;
diff --git a/fs/virtiofs/virtiofs_i.hh b/fs/virtiofs/virtiofs_i.hh
index 76533d74..e879546d 100644
--- a/fs/virtiofs/virtiofs_i.hh
+++ b/fs/virtiofs/virtiofs_i.hh
@@ -8,9 +8,6 @@
#ifndef VIRTIOFS_IO_H
#define VIRTIOFS_IO_H

-#include "fuse_kernel.h"
-#include <osv/mutex.h>
-#include <osv/waitqueue.hh>
#include "drivers/virtio-fs.hh"

int fuse_req_send_and_receive_reply(virtio::fs* drv, uint32_t opcode,
diff --git a/fs/virtiofs/virtiofs_vfsops.cc b/fs/virtiofs/virtiofs_vfsops.cc
index 30daa42b..b60a766b 100644
--- a/fs/virtiofs/virtiofs_vfsops.cc
+++ b/fs/virtiofs/virtiofs_vfsops.cc
@@ -5,19 +5,32 @@
* BSD license as described in the LICENSE file in the top-level directory.
*/

+#include <atomic>
+#include <memory>
+#include <mutex>
+#include <new>
#include <sys/types.h>
-#include <osv/device.h>
+
+#include <api/assert.h>
#include <osv/debug.h>
-#include <iomanip>
-#include <iostream>
+#include <osv/device.h>
+#include <osv/mutex.h>
+
+#include "drivers/virtio-fs.hh"
#include "virtiofs.hh"
+#include "virtiofs_dax.hh"
#include "virtiofs_i.hh"
-#include "drivers/virtio-fs.hh"

using fuse_request = virtio::fs::fuse_request;

static std::atomic<uint64_t> fuse_unique_id(1);

+static struct {
+ std::unordered_map<virtio::fs*, std::shared_ptr<virtiofs::dax_manager>,
+ virtio::fs::hasher> mgrs;
+ mutex lock;
+} dax_managers;
+
int fuse_req_send_and_receive_reply(virtio::fs* drv, uint32_t opcode,
uint64_t nodeid, void* input_args_data, size_t input_args_size,
void* output_args_data, size_t output_args_size)
@@ -115,7 +128,28 @@ static int virtiofs_mount(struct mount* mp, const char* dev, int flags,

virtiofs_set_vnode(mp->m_root->d_vnode, root_node);

- mp->m_data = drv;
+ auto* m_data = new (std::nothrow) virtiofs_mount_data;
+ if (!m_data) {
+ return ENOMEM;
+ }
+ m_data->drv = drv;
+ if (drv->get_dax()) {
+ // The device supports the DAX window
+ std::lock_guard<mutex> guard {dax_managers.lock};
+ auto found = dax_managers.mgrs.find(drv);
+ if (found != dax_managers.mgrs.end()) {
+ // There is a dax_manager already associated with this device (the
+ // device is already mounted)
+ m_data->dax_mgr = found->second;
+ } else {
+ m_data->dax_mgr = std::make_shared<virtiofs::dax_manager>(*drv);
+ if (!m_data->dax_mgr) {
+ return ENOMEM;
+ }
+ }
+ }
+
+ mp->m_data = m_data;
mp->m_dev = device;

return 0;
@@ -141,6 +175,15 @@ static int virtiofs_statfs(struct mount* mp, struct statfs* statp)

static int virtiofs_unmount(struct mount* mp, int flags)
{
+ auto* m_data = static_cast<virtiofs_mount_data*>(mp->m_data);
+ std::lock_guard<mutex> guard {dax_managers.lock};
+ if (m_data->dax_mgr && m_data->dax_mgr.use_count() == 2) {
+ // This was the last mount of this device. It's safe to delete the
+ // window manager.
+ dax_managers.mgrs.erase(m_data->drv);
+ }
+ delete m_data;
+
struct device* dev = mp->m_dev;
return device_close(dev);
}
diff --git a/fs/virtiofs/virtiofs_vnops.cc b/fs/virtiofs/virtiofs_vnops.cc
index 38c000ec..51be3ee4 100644
--- a/fs/virtiofs/virtiofs_vnops.cc
+++ b/fs/virtiofs/virtiofs_vnops.cc
@@ -5,28 +5,27 @@
* BSD license as described in the LICENSE file in the top-level directory.
+ auto dax_mgr = m_data->dax_mgr;

// Total read amount is what they requested, or what is left
auto read_amt = std::min<uint64_t>(uio->uio_resid,
inode->attr.size - uio->uio_offset);

- if (drv->get_dax()) {
+ if (dax_mgr) {
// Try to read from DAX
- if (!virtiofs_read_direct(*inode, file_data->file_handle, read_amt,
- *drv, *uio)) {
-
+ if (!dax_mgr->read(*inode, file_data->file_handle, read_amt, *uio)) {
return 0;
}
}
--
2.27.0

Waldek Kozaczuk

unread,
Jun 26, 2020, 12:29:22 AM6/26/20
to Fotis Xenakis, OSv Development
Hi,

Very impressive work! 

I cannot really say I fully appreciate all details nor know this part of the spec but your code looks good and makes sense.

How did you actually test it especially the edge conditions - window getting full, etc?

Waldek


--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/AM0PR03MB629260F9F8CC3272E9CF435AA6960%40AM0PR03MB6292.eurprd03.prod.outlook.com.

Fotis Xenakis

unread,
Jun 26, 2020, 5:59:22 AM6/26/20
to OSv Development
Testing is actually a matter I wanted to discuss.
The dax_manager class is modular enough, so I have unit tested it in isolation (mocking the low level methods of course). Then, I have also done basic integration testing of the complete implementation, playing with the window size (artificially limiting its size to test the full window conditions) as well as the chunk size (also limiting it to e.g. 4k to test sane behavior), in single and multi-threaded cases.

On the way, I thought it would be nice to also submit the unit tests, but realized I don't really know if / how they should be integrated in the repo, so I wanted to ask: apart from the tests in the tests directory (which AFAIK are integration tests), are there unit tests for other parts in OSv and if so, how do they work (where do they live, how are they run, do they utilize e.g. boost.Test)?

Thanks,
Fotis

Waldek Kozaczuk

unread,
Jun 26, 2020, 6:11:25 PM6/26/20
to OSv Development
So I have some basic manual tests involving running a simple hello world app with DAX on and off and things look good!

I will commit this code as-is. Please see my thoughts about unit tests.

Thanks for your contribution. It is really a great feature and I am very excited about it. With DAX on we can map 2MB chunk once (couple of exits to the hypervisor) and then simply copy the data without any exists, right? So it should be way faster - even faster than with regular virtio-blk, am I wrong?

Waldek

On Friday, June 26, 2020 at 5:59:22 AM UTC-4 Fotis Xenakis wrote:
Testing is actually a matter I wanted to discuss.
The dax_manager class is modular enough, so I have unit tested it in isolation (mocking the low level methods of course). Then, I have also done basic integration testing of the complete implementation, playing with the window size (artificially limiting its size to test the full window conditions) as well as the chunk size (also limiting it to e.g. 4k to test sane behavior), in single and multi-threaded cases.

On the way, I thought it would be nice to also submit the unit tests, but realized I don't really know if / how they should be integrated in the repo, so I wanted to ask: apart from the tests in the tests directory (which AFAIK are integration tests), are there unit tests for other parts in OSv and if so, how do they work (where do they live, how are they run, do they utilize e.g. boost.Test)?
You are right - most of the tests under /tests directory are integration tests is a sense they use POSIX api in most cases to test various areas of OSv functionality and run on OSv. But there is nothing that prevents us from running a standalone app (a PIE) that would be a pure unit test of dax_manager class on OSv as well. I think we use boost test framework in many cases (please look at some of tests under /tests for an example)

Commit Bot

unread,
Jun 26, 2020, 6:12:51 PM6/26/20
to osv...@googlegroups.com, Fotis Xenakis
From: Fotis Xenakis <fo...@windowslive.com>
Committer: Waldemar Kozaczuk <jwkoz...@gmail.com>
Branch: master

virtio-fs: implement dax window manager

For details on the manager's policy, please see
fs/virtiofs/virtiofs_dax.hh.

Signed-off-by: Fotis Xenakis <fo...@windowslive.com>
Message-Id: <AM0PR03MB62924F7C32...@AM0PR03MB6292.eurprd03.prod.outlook.com>

---
diff --git a/Makefile b/Makefile
@@ -1771,7 +1771,8 @@ fs_objs += rofs/rofs_vfsops.o \
rofs/rofs_common.o

fs_objs += virtiofs/virtiofs_vfsops.o \
- virtiofs/virtiofs_vnops.o
+ virtiofs/virtiofs_vnops.o \
+ virtiofs/virtiofs_dax.o

fs_objs += pseudofs/pseudofs.o
fs_objs += procfs/procfs_vnops.o
@@ -1978,7 +1979,7 @@ libuutil-objects = $(foreach file, $(libuutil-file-list), $(out)/bsd/cddl/contri

define libuutil-includes
bsd/cddl/contrib/opensolaris/lib/libuutil/common
- bsd/cddl/compat/opensolaris/include
+ bsd/cddl/compat/opensolaris/include
bsd/sys/cddl/contrib/opensolaris/uts/common
bsd/sys/cddl/compat/opensolaris
bsd/cddl/contrib/opensolaris/head
diff --git a/fs/virtiofs/virtiofs_dax.cc b/fs/virtiofs/virtiofs_dax.cc
--- a/fs/virtiofs/virtiofs_dax.cc
--- a/fs/virtiofs/virtiofs_dax.hh

Commit Bot

unread,
Jun 26, 2020, 6:12:52 PM6/26/20
to osv...@googlegroups.com, Fotis Xenakis
From: Fotis Xenakis <fo...@windowslive.com>
Committer: Waldemar Kozaczuk <jwkoz...@gmail.com>
Branch: master

virtio-fs: use multiple dax mappings in filesystem

This integrates the dax window manager with the virtio-fs file system
operations:

- Removes the lock from virtio::fs::dax_window, since this is now
handled by the window manager.
- A single manager instance is maintained for each virtio-fs device
having at least a singe mount. Multiple, concurrent mounts of the same
device share a common dax_manager.
- The FUSE_READ fallback as well as the logic of using it when the DAX
window is not available or a read using it fails are untouched.

Signed-off-by: Fotis Xenakis <fo...@windowslive.com>
Message-Id: <AM0PR03MB629260F9F8...@AM0PR03MB6292.eurprd03.prod.outlook.com>

---
diff --git a/drivers/virtio-fs.hh b/drivers/virtio-fs.hh
--- a/fs/virtiofs/virtiofs_i.hh
+++ b/fs/virtiofs/virtiofs_i.hh
@@ -8,9 +8,6 @@
#ifndef VIRTIOFS_IO_H
#define VIRTIOFS_IO_H

-#include "fuse_kernel.h"
-#include <osv/mutex.h>
-#include <osv/waitqueue.hh>
#include "drivers/virtio-fs.hh"

int fuse_req_send_and_receive_reply(virtio::fs* drv, uint32_t opcode,
diff --git a/fs/virtiofs/virtiofs_vfsops.cc b/fs/virtiofs/virtiofs_vfsops.cc

Waldek Kozaczuk

unread,
Jun 26, 2020, 6:19:40 PM6/26/20
to OSv Development
As you have seen I have applied this patch because the code looks good and works. But I am curious how is it possible, given what you wrote above, that the DAX window mapping can be accessed like this? It seems as if the physical - virtual memory mapping is 1-1 but then it might collide on the virtual side with other mappings OSv is maintaining. Am I missing something? Or are we just lucky it just happens to not collide during the tests we have performed so far?

Fotis Xenakis

unread,
Jun 27, 2020, 8:44:49 AM6/27/20
to OSv Development
Στις Σάββατο, 27 Ιουνίου 2020 στις 1:11:25 π.μ. UTC+3, ο χρήστης jwkoz...@gmail.com έγραψε:
So I have some basic manual tests involving running a simple hello world app with DAX on and off and things look good!

I will commit this code as-is. Please see my thoughts about unit tests.

Thanks for your contribution. It is really a great feature and I am very excited about it. With DAX on we can map 2MB chunk once (couple of exits to the hypervisor) and then simply copy the data without any exists, right? So it should be way faster - even faster than with regular virtio-blk, am I wrong?
A DAX mapping translates to an mmap() on the host [1], with pages shared with the guest. So, after establishing it, the VMM should be out of the way as I understand it (the possible page faults will be handled by the host kernel). To be honest, I can't make a comparison with virtio-blk at this point, but performance should be at least somewhat close to it. For some performance figures on Linux you can look at [2] and [3].

Nevertheless, my next step (already started) is to evaluate performance. The plan is to use fio [4] and probably some custom tests (suggestions always welcome!) and follow up on the list when I have some results of course.


Waldek

On Friday, June 26, 2020 at 5:59:22 AM UTC-4 Fotis Xenakis wrote:
Testing is actually a matter I wanted to discuss.
The dax_manager class is modular enough, so I have unit tested it in isolation (mocking the low level methods of course). Then, I have also done basic integration testing of the complete implementation, playing with the window size (artificially limiting its size to test the full window conditions) as well as the chunk size (also limiting it to e.g. 4k to test sane behavior), in single and multi-threaded cases.

On the way, I thought it would be nice to also submit the unit tests, but realized I don't really know if / how they should be integrated in the repo, so I wanted to ask: apart from the tests in the tests directory (which AFAIK are integration tests), are there unit tests for other parts in OSv and if so, how do they work (where do they live, how are they run, do they utilize e.g. boost.Test)?
You are right - most of the tests under /tests directory are integration tests is a sense they use POSIX api in most cases to test various areas of OSv functionality and run on OSv. But there is nothing that prevents us from running a standalone app (a PIE) that would be a pure unit test of dax_manager class on OSv as well. I think we use boost test framework in many cases (please look at some of tests under /tests for an example)
Thanks for the insight here! In this case, I will slightly refactor dax_manager (making it more testable) and port my unit tests to go with the approach you are suggesting.

Fotis Xenakis

unread,
Jun 27, 2020, 9:12:42 AM6/27/20
to OSv Development
It's not entirely clear to me what the concern is about collisions. The DAX window shows up as device memory, subsequently mapped in the guest's virtual address space (that's taken care of by existing code in the PCI subsystem).
The note above (btw, this has been present since the initial DAX support) touches on why we are using uiomove (i.e. plain memory accesses) instead of mmio_get* [1] (i.e. memory accesses with the volatile specifier, usually seen when touching device memory). I can't claim to be 100% sure this is always correct (and it would be hard to debug if it isn't), but the spec's note and my reasoning (taking into account the underlying implementation of the DAX window with mmap) provide some confidence.

Waldek Kozaczuk

unread,
Jun 29, 2020, 10:06:54 AM6/29/20
to OSv Development
Makes sense. Thanks for your explanation.

Fotis Xenakis

unread,
Mar 6, 2021, 10:57:19 AM3/6/21
to OSv Development
On Saturday, June 27, 2020 at 1:11:25 AM UTC+3 jwkoz...@gmail.com wrote:
So I have some basic manual tests involving running a simple hello world app with DAX on and off and things look good!

I will commit this code as-is. Please see my thoughts about unit tests.

Thanks for your contribution. It is really a great feature and I am very excited about it. With DAX on we can map 2MB chunk once (couple of exits to the hypervisor) and then simply copy the data without any exists, right? So it should be way faster - even faster than with regular virtio-blk, am I wrong?

Waldek

On Friday, June 26, 2020 at 5:59:22 AM UTC-4 Fotis Xenakis wrote:
Testing is actually a matter I wanted to discuss.
The dax_manager class is modular enough, so I have unit tested it in isolation (mocking the low level methods of course). Then, I have also done basic integration testing of the complete implementation, playing with the window size (artificially limiting its size to test the full window conditions) as well as the chunk size (also limiting it to e.g. 4k to test sane behavior), in single and multi-threaded cases.

On the way, I thought it would be nice to also submit the unit tests, but realized I don't really know if / how they should be integrated in the repo, so I wanted to ask: apart from the tests in the tests directory (which AFAIK are integration tests), are there unit tests for other parts in OSv and if so, how do they work (where do they live, how are they run, do they utilize e.g. boost.Test)?
You are right - most of the tests under /tests directory are integration tests is a sense they use POSIX api in most cases to test various areas of OSv functionality and run on OSv. But there is nothing that prevents us from running a standalone app (a PIE) that would be a pure unit test of dax_manager class on OSv as well. I think we use boost test framework in many cases (please look at some of tests under /tests for an example)
A couple of patches have just been posted, containing these unit tests, with the approach you had suggested. Feel free to take a look whenever you the a chance!
Reply all
Reply to author
Forward
0 new messages