[PATCH 1/2] reactor: add passthru functionality to io_uring backend

72 views
Skip to first unread message

shw096@snu.ac.kr

<shw096@snu.ac.kr>
unread,
Aug 29, 2022, 8:17:32 AM8/29/22
to seastar-dev@googlegroups.com, Heewon Shin, JinyongHa
From: Heewon Shin <shw...@snu.ac.kr>

passthru is the functionality that enables users to submit an NVMe
command manually constructed by themselves. passthru is also available
via IOCTL, but it is now proper to be used as performance usage because
of IOCTL's overhead.

ZNS is new type of NVMe SSD that only supports sequential write within a
zone. Therefore NVMe commands, which is executed in out-of-order manner,
requires stricted address sequencing overhead.

ZNS append command is new type of NVMe command that targets to relax the
overhead of deciding the block address to place user data. With ZNS
append command, user only submit command to a zone and ZNS SSD decides
the exact block address of user data by its own placement policy. The
assined block address is returned to user via the completion of NVMe
protocol. However append command is not provided as a user level library
like aio_write(). The only way to use append command is submit it via
manually constructed NVMe command.

io_uring library provides passthru with low overhead. Therefore it is
proper to be used to submitting IO commands like append or write. We aim
to use this passthru functionality to utilize append command on user
level.

By utilizing append command on user level, user can mitigate
synchronization overhead for sequencing address rather than using legacy
write command.

This commit contains the implementation of passthru functionality via io
uring library. And it also extents the returning value of IO of
reactor_backend_uring from size_t to io_result which contains size_t and
ssize_t. The second ssize_t is for assigned address of append command
from ZNS SSD. For the other command like write or read, it is fixed to
-1 for compatibility.

Following commit contains the implementation of append command on
file-related classes.

Unit test is done with the our ZNS SSD and test code. However it is not
proper to be upstreamed because it requires a ZNS block device which
cannot be emulated.

Signed-off-by: Heewon Shin <shw...@snu.ac.kr>
Signed-off-by: JinyongHa <jy20...@samsung.com>
---
CMakeLists.txt | 8 ++
cmake/FindLibNvme.cmake | 70 +++++++++
cmake/SeastarDependencies.cmake | 1 +
include/seastar/core/internal/io_request.hh | 28 +++-
include/seastar/core/io_queue.hh | 4 +-
include/seastar/core/reactor.hh | 13 +-
install-dependencies.sh | 1 +
src/core/io_queue.cc | 21 +--
src/core/reactor.cc | 4 +-
src/core/reactor_backend.cc | 150 +++++++++++++++++++-
10 files changed, 281 insertions(+), 19 deletions(-)
create mode 100644 cmake/FindLibNvme.cmake

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 0bf0aeb8..4b0778dc 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -937,6 +937,13 @@ if (Seastar_IO_URING)
PRIVATE URING::uring)
endif ()

+set_option_if_package_is_found(Seastar_NVME LibNvme)
+if (Seastar_NVME)
+ list (APPEND Seastar_PRIVATE_COMPILE_DEFINITIONS SEASTAR_HAVE_NVME)
+ target_link_libraries (seastar
+ PRIVATE NVME::nvme)
+endif ()
+
if (Seastar_LD_FLAGS)
# In newer versions of CMake, there is `target_link_options`.
target_link_libraries (seastar
@@ -1260,6 +1267,7 @@ if (Seastar_INSTALL)
${CMAKE_CURRENT_SOURCE_DIR}/cmake/Findyaml-cpp.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/SeastarDependencies.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/FindLibUring.cmake
+ ${CMAKE_CURRENT_SOURCE_DIR}/cmake/FindLibNvme.cmake
DESTINATION ${install_cmakedir})

install (
diff --git a/cmake/FindLibNvme.cmake b/cmake/FindLibNvme.cmake
new file mode 100644
index 00000000..1ee9af0e
--- /dev/null
+++ b/cmake/FindLibNvme.cmake
@@ -0,0 +1,70 @@
+#
+# This file is open source software, licensed to you under the terms
+# of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+# distributed with this work for additional information regarding copyright
+# ownership. You may not use this file except in compliance with the License.
+#
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+#
+# Copyright (C) 2022 ScyllaDB
+#
+
+find_package (PkgConfig REQUIRED)
+
+pkg_search_module (NVME libnvme)
+INCLUDE(CheckIncludeFile)
+
+find_library (NVME_LIBRARY
+ NAMES nvme
+ HINTS
+ ${NVME_PC_LIBDIR}
+ ${NVME_PC_LIBRARY_DIRS})
+
+find_path (NVME_INCLUDE_DIR
+ NAMES libnvme.h
+ HINTS
+ ${NVME_PC_INCLUDE_DIR}
+ ${NVME_PC_INCLUDE_DIRS})
+
+if (NVME_INCLUDE_DIR)
+ include (CheckStructHasMember)
+ include (CMakePushCheckState)
+ cmake_push_check_state (RESET)
+ list(APPEND CMAKE_REQUIRED_INCLUDES ${NVME_INCLUDE_DIR})
+endif ()
+
+mark_as_advanced (
+ NVME_LIBRARY
+ NVME_INCLUDE_DIR)
+
+
+include (FindPackageHandleStandardArgs)
+
+find_package_handle_standard_args (LibNvme
+ REQUIRED_VARS
+ NVME_LIBRARY
+ NVME_INCLUDE_DIR
+ VERSION_VAR NVME_PC_VERSION)
+
+set (NVME_LIBRARIES ${NVME_LIBRARY})
+set (NVME_INCLUDE_DIRS ${NVME_INCLUDE_DIR})
+
+if (NVME_FOUND AND NOT (TARGET NVME::nvme))
+ add_library (NVME::nvme UNKNOWN IMPORTED)
+
+ set_target_properties (NVME::nvme
+ PROPERTIES
+ IMPORTED_LOCATION ${NVME_LIBRARY}
+ INTERFACE_INCLUDE_DIRECTORIES ${NVME_INCLUDE_DIRS})
+endif ()
diff --git a/cmake/SeastarDependencies.cmake b/cmake/SeastarDependencies.cmake
index 5141ffda..da7882fd 100644
--- a/cmake/SeastarDependencies.cmake
+++ b/cmake/SeastarDependencies.cmake
@@ -92,6 +92,7 @@ macro (seastar_find_dependencies)
Concepts
GnuTLS
LibUring
+ LibNvme
LinuxMembarrier
Sanitizers
SourceLocation
diff --git a/include/seastar/core/internal/io_request.hh b/include/seastar/core/internal/io_request.hh
index 3d43490b..090ba8ef 100644
--- a/include/seastar/core/internal/io_request.hh
+++ b/include/seastar/core/internal/io_request.hh
@@ -35,7 +35,7 @@ namespace internal {

class io_request {
public:
- enum class operation { read, readv, write, writev, fdatasync, recv, recvmsg, send, sendmsg, accept, connect, poll_add, poll_remove, cancel };
+ enum class operation { read, readv, write, writev, fdatasync, recv, recvmsg, send, sendmsg, accept, connect, poll_add, poll_remove, cancel, append };
private:
operation _op;
int _fd;
@@ -64,6 +64,8 @@ class io_request {
} _size;

bool _nowait_works;
+ uint32_t _nsid = -1;
+ size_t _block_size = -1;

explicit io_request(operation op, int fd, int flags, ::msghdr* msg)
: _op(op)
@@ -108,6 +110,18 @@ class io_request {
_size.len = size;
}

+ explicit io_request(operation op, int fd, uint64_t pos, char* ptr, size_t size, bool nowait_works, uint32_t nsid, size_t block_size)
+ : _op(op)
+ , _fd(fd)
+ , _nowait_works(nowait_works)
+ , _nsid(nsid)
+ , _block_size(block_size)
+ {
+ _attr.pos = pos;
+ _ptr.addr = ptr;
+ _size.len = size;
+ }
+
explicit io_request(operation op, int fd, uint64_t pos, iovec* ptr, size_t size, bool nowait_works)
: _op(op)
, _fd(fd)
@@ -218,6 +232,14 @@ class io_request {
return _nowait_works;
}

+ uint32_t nsid() const {
+ return _nsid;
+ }
+
+ size_t block_size() const {
+ return _block_size;
+ }
+
static io_request make_read(int fd, uint64_t pos, void* address, size_t size, bool nowait_works) {
return io_request(operation::read, fd, pos, reinterpret_cast<char*>(address), size, nowait_works);
}
@@ -246,6 +268,10 @@ class io_request {
return io_request(operation::write, fd, pos, const_cast<char*>(reinterpret_cast<const char*>(address)), size, nowait_works);
}

+ static io_request make_append(int fd, uint64_t pos, const void* address, size_t size, bool nowait_works, uint32_t nsid, size_t block_size) {
+ return io_request(operation::append, fd, pos, const_cast<char*>(reinterpret_cast<const char*>(address)), size, nowait_works, nsid, block_size);
+ }
+
static io_request make_writev(int fd, uint64_t pos, std::vector<iovec>& iov, bool nowait_works) {
return io_request(operation::writev, fd, pos, iov.data(), iov.size(), nowait_works);
}
diff --git a/include/seastar/core/io_queue.hh b/include/seastar/core/io_queue.hh
index fe1046ba..d2936244 100644
--- a/include/seastar/core/io_queue.hh
+++ b/include/seastar/core/io_queue.hh
@@ -51,6 +51,8 @@ struct iocb;
}
}

+using io_result = std::tuple<size_t, ssize_t>;
+
using shard_id = unsigned;
using stream_id = unsigned;

@@ -121,7 +123,7 @@ class io_queue {
size_t len, internal::io_request req, io_intent* intent, iovec_keeper iovs = {}) noexcept;

future<size_t> queue_request(const io_priority_class& pc, internal::io_direction_and_length dnl, internal::io_request req, io_intent* intent, iovec_keeper iovs) noexcept;
- future<size_t> queue_one_request(const io_priority_class& pc, internal::io_direction_and_length dnl, internal::io_request req, io_intent* intent, iovec_keeper iovs) noexcept;
+ future<io_result> queue_one_request(const io_priority_class& pc, internal::io_direction_and_length dnl, internal::io_request req, io_intent* intent, iovec_keeper iovs) noexcept;
void submit_request(io_desc_read_write* desc, internal::io_request req) noexcept;
void cancel_request(queued_io_request& req) noexcept;
void complete_cancelled_request(queued_io_request& req) noexcept;
diff --git a/include/seastar/core/reactor.hh b/include/seastar/core/reactor.hh
index db2c2bb4..aee0f918 100644
--- a/include/seastar/core/reactor.hh
+++ b/include/seastar/core/reactor.hh
@@ -170,11 +170,20 @@ class io_intent;
class disk_config_params;

class io_completion : public kernel_completion {
+private:
+ size_t result = 0;
+
public:
- virtual void complete_with(ssize_t res) final override;
+ virtual void complete_with(ssize_t res) override;

virtual void complete(size_t res) noexcept = 0;
virtual void set_exception(std::exception_ptr eptr) noexcept = 0;
+ void set_result(size_t result_) {
+ result = result_;
+ }
+ size_t get_result() {
+ return result;
+ }
};

class reactor {
@@ -687,7 +696,7 @@ class reactor {

future<struct stat> fstat(int fd) noexcept;
future<struct statfs> fstatfs(int fd) noexcept;
- friend future<shared_ptr<file_impl>> make_file_impl(int fd, file_open_options options, int flags) noexcept;
+ friend future<shared_ptr<file_impl>> make_file_impl(int fd, file_open_options options, int flags, std::string name) noexcept;
public:
future<> readable(pollable_fd_state& fd);
future<> writeable(pollable_fd_state& fd);
diff --git a/install-dependencies.sh b/install-dependencies.sh
index 5c5cc597..7b7ca029 100755
--- a/install-dependencies.sh
+++ b/install-dependencies.sh
@@ -57,6 +57,7 @@ debian_packages=(
doxygen
openssl
pkg-config
+ libnvme
)

# seastar doesn't directly depend on these packages. They are
diff --git a/src/core/io_queue.cc b/src/core/io_queue.cc
index ea82551d..a1a12a64 100644
--- a/src/core/io_queue.cc
+++ b/src/core/io_queue.cc
@@ -208,7 +208,7 @@ class io_desc_read_write final : public io_completion {
const stream_id _stream;
const io_direction_and_length _dnl;
const fair_queue_ticket _fq_ticket;
- promise<size_t> _pr;
+ promise<io_result> _pr;
iovec_keeper _iovs;

public:
@@ -237,7 +237,7 @@ class io_desc_read_write final : public io_completion {
auto now = io_queue::clock_type::now();
_pclass.on_complete(std::chrono::duration_cast<std::chrono::duration<double>>(now - _ts));
_ioq.complete_request(*this);
- _pr.set_value(res);
+ _pr.set_value(std::make_tuple(res, get_result()));
delete this;
}

@@ -254,7 +254,7 @@ class io_desc_read_write final : public io_completion {
_ts = now;
}

- future<size_t> get_future() {
+ future<io_result> get_future() {
return _pr.get_future();
}

@@ -305,7 +305,7 @@ class queued_io_request : private internal::io_request {
_intent.enqueue(cq);
}

- future<size_t> get_future() noexcept { return _desc->get_future(); }
+ future<io_result> get_future() noexcept { return _desc->get_future(); }
fair_queue_entry& queue_entry() noexcept { return _fq_entry; }
stream_id stream() const noexcept { return _stream; }

@@ -833,7 +833,7 @@ io_queue::request_limits io_queue::get_request_limits() const noexcept {
return l;
}

-future<size_t> io_queue::queue_one_request(const io_priority_class& pc, io_direction_and_length dnl, internal::io_request req, io_intent* intent, iovec_keeper iovs) noexcept {
+future<io_result> io_queue::queue_one_request(const io_priority_class& pc, io_direction_and_length dnl, internal::io_request req, io_intent* intent, iovec_keeper iovs) noexcept {
return futurize_invoke([&pc, dnl = std::move(dnl), req = std::move(req), this, intent, iovs = std::move(iovs)] () mutable {
// First time will hit here, and then we create the class. It is important
// that we create the shared pointer in the same shard it will be used at later.
@@ -857,15 +857,18 @@ future<size_t> io_queue::queue_request(const io_priority_class& pc, io_direction
size_t max_length = _group->_max_request_length[dnl.rw_idx()];

if (__builtin_expect(dnl.length() <= max_length, true)) {
- return queue_one_request(pc, dnl, std::move(req), intent, std::move(iovs));
+ return queue_one_request(pc, dnl, std::move(req), intent, std::move(iovs)).then(
+ [] (auto io_result) {
+ return make_ready_future<size_t>(std::get<0>(io_result));
+ });
}

std::vector<internal::io_request::part> parts;
- lw_shared_ptr<std::vector<future<size_t>>> p;
+ lw_shared_ptr<std::vector<future<io_result>>> p;

try {
parts = req.split(max_length);
- p = make_lw_shared<std::vector<future<size_t>>>();
+ p = make_lw_shared<std::vector<future<io_result>>>();
p->reserve(parts.size());
find_or_create_class(pc).on_split(dnl);
engine()._io_stats.aio_outsizes++;
@@ -888,7 +891,7 @@ future<size_t> io_queue::queue_request(const io_priority_class& pc, io_direction
for (auto&& res : results) {
if (!res.failed()) {
if (prev_ok) {
- size_t sz = res.get0();
+ size_t sz = std::get<0>(res.get0());
total += sz;
prev_ok &= (sz == max_length);
}
diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index 100eb80f..e6833635 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -1619,6 +1619,8 @@ sstring io_request::opname() const {
return "poll remove";
case io_request::operation::cancel:
return "cancel";
+ case io_request::operation::append:
+ return "append";
}
std::abort();
}
@@ -1748,7 +1750,7 @@ reactor::open_file_dma(std::string_view nameref, open_flags flags, file_open_opt
return wrap_syscall<int>(fd);
}).then([&options, name = std::move(name), &open_flags] (syscall_result<int> sr) {
sr.throw_fs_exception_if_error("open failed", name);
- return make_file_impl(sr.result, options, open_flags);
+ return make_file_impl(sr.result, options, open_flags, name);
}).then([] (shared_ptr<file_impl> impl) {
return make_ready_future<file>(std::move(impl));
});
diff --git a/src/core/reactor_backend.cc b/src/core/reactor_backend.cc
index 5d3d8499..fcb91adb 100644
--- a/src/core/reactor_backend.cc
+++ b/src/core/reactor_backend.cc
@@ -34,12 +34,15 @@

#ifdef SEASTAR_HAVE_URING
#include <liburing.h>
+#include <linux/nvme_ioctl.h>
#endif

#ifdef HAVE_OSV
#include <osv/newpoll.hh>
#endif

+static const unsigned char IORING_OP_URING_CMD = 40;
+
namespace seastar {

using namespace std::chrono_literals;
@@ -1254,7 +1257,9 @@ class reactor_backend_uring final : public reactor_backend {
}
auto sqe = be.get_sqe();
::io_uring_prep_poll_add(sqe, fd().get(), POLLIN);
- ::io_uring_sqe_set_data(sqe, static_cast<kernel_completion*>(this));
+ auto arg = new passthru_arg;
+ arg->completion = this;
+ ::io_uring_sqe_set_data(sqe, arg);
_armed = true;
be._has_pending_submissions = true;
}
@@ -1316,10 +1321,134 @@ class reactor_backend_uring final : public reactor_backend {
auto sqe = get_sqe();
::io_uring_prep_poll_add(sqe, fd.fd.get(), events);
auto ufd = static_cast<uring_pollable_fd_state*>(&fd);
- ::io_uring_sqe_set_data(sqe, static_cast<kernel_completion*>(ufd->get_desc(events)));
+ auto arg = new passthru_arg;
+ arg->completion = ufd->get_desc(events);
+ ::io_uring_sqe_set_data(sqe, arg);
_has_pending_submissions = true;
return ufd->get_completion_future(events);
}
+ struct block_uring_cmd {
+ __u32 ioctl_cmd;
+ __u32 unused1;
+ __u64 unused2[4];
+ };
+
+ struct nvme_user_io_t {
+ uint32_t cdw00_09[10];
+ uint64_t slba;
+ uint32_t nlb : 16;
+ uint32_t rsvd : 4;
+ uint32_t dtype : 4;
+ uint32_t rsvd2 : 2;
+ uint32_t prinfo : 4;
+ uint32_t fua : 1;
+ uint32_t lr : 1;
+ uint32_t cdw13_15[3];
+ };
+
+ struct nvme_io_command {
+ union {
+#ifdef NVME_IOCTL_IO64_CMD
+ nvme_passthru_cmd64 common;
+#else
+ nvme_passthru_cmd common;
+#endif
+ nvme_user_io_t rw;
+ uint32_t raw[16];
+ };
+
+ enum class opcode {
+ FLUSH = 0x0,
+ WRITE = 0x1,
+ READ = 0x2,
+ APPEND = 0x7D,
+ };
+ };
+ struct passthru_arg;
+ class passthru_completion final : public kernel_completion {
+ public:
+ io_completion* orig_completion = nullptr;
+ passthru_arg* arg = nullptr;
+ size_t block_size = -1;
+
+ void complete_with(ssize_t res) override {
+ seastar_logger.error("complete res {}", res);
+ if (res >= 0) {
+ complete(res);
+ return;
+ }
+ ++engine()._io_stats.aio_errors;
+ try {
+ throw_kernel_error(res);
+ } catch (...) {
+ set_exception(std::current_exception());
+ }
+ }
+
+ void complete(size_t res) noexcept {
+ if (res == 0) {
+ orig_completion->set_result(arg->cmd.common.result * block_size);
+ orig_completion->complete_with(arg->cmd.common.data_len);
+ } else {
+ // TODO pass proper error code
+ orig_completion->complete_with(-1);
+ }
+ delete this;
+ }
+
+ void set_exception(std::exception_ptr eptr) noexcept {
+ orig_completion->set_exception(eptr);
+ }
+
+ ~passthru_completion() {}
+ };
+
+ struct passthru_arg {
+ nvme_io_command cmd;
+ kernel_completion* completion;
+ };
+
+ passthru_arg* prep_passthru_append(struct io_uring_sqe *sqe, int fd, void *buf, unsigned nbytes, off_t offset, uint32_t nsid, size_t block_size, io_completion* io_completion) {
+ // create append nvme command
+ auto arg = new passthru_arg;
+ auto& cmd = arg->cmd;
+ memset(&cmd, 0x0, sizeof(nvme_io_command));
+ cmd.common.opcode = static_cast<uint8_t>(nvme_io_command::opcode::APPEND);
+ cmd.common.nsid = nsid;
+ cmd.common.addr = reinterpret_cast<uint64_t>(buf);
+ cmd.common.data_len = nbytes;
+ cmd.rw.slba = offset / block_size;
+ cmd.rw.nlb = nbytes / block_size - 1;
+
+ // create sqe for passthru command
+ sqe->opcode = IORING_OP_URING_CMD;
+ sqe->addr = 4;
+ sqe->len = cmd.common.data_len;
+ sqe->off = reinterpret_cast<uint64_t>(&(cmd));
+ sqe->flags = 0;
+ sqe->fd = fd;
+ sqe->ioprio = 0;
+ sqe->user_data = reinterpret_cast<uint64_t>(arg);
+ sqe->rw_flags = 0;
+ sqe->__pad2[0] = 0;
+ sqe->__pad2[1] = 0;
+ sqe->__pad2[2] = 0;
+
+ struct block_uring_cmd *blk_cmd =
+ reinterpret_cast<block_uring_cmd *>(&sqe->len);
+
+#ifdef NVME_IOCTL_IO64_CMD
+ blk_cmd->ioctl_cmd = NVME_IOCTL_IO64_CMD;
+#else
+ blk_cmd->ioctl_cmd = NVME_IOCTL_IO_CMD;
+#endif
+ auto completion = new passthru_completion;
+ completion->block_size = block_size;
+ completion->arg = arg;
+ arg->completion = completion;
+ completion->orig_completion = io_completion;
+ return arg;
+ }

void submit_io_request(internal::io_request& req, io_completion* completion) {
auto sqe = get_sqe();
@@ -1340,6 +1469,11 @@ class reactor_backend_uring final : public reactor_backend {
case o::fdatasync:
::io_uring_prep_fsync(sqe, req.fd(), IORING_FSYNC_DATASYNC);
break;
+ case o::append:
+ {
+ auto passthru_arg = prep_passthru_append(sqe, req.fd(), req.address(), req.size(), req.pos(), req.nsid(), req.block_size(), completion);
+ break;
+ }
case o::recv:
case o::recvmsg:
case o::send:
@@ -1355,7 +1489,12 @@ class reactor_backend_uring final : public reactor_backend {
seastar_logger.error("Invalid operation for iocb: {}", req.opname());
abort();
}
- ::io_uring_sqe_set_data(sqe, completion);
+
+ if (req.opcode() != o::append) {
+ auto arg = new passthru_arg;
+ arg->completion = completion;
+ ::io_uring_sqe_set_data(sqe, arg);
+ }

_has_pending_submissions = true;
}
@@ -1374,8 +1513,9 @@ class reactor_backend_uring final : public reactor_backend {
void do_process_ready_kernel_completions(::io_uring_cqe** buf, size_t nr) {
for (auto p = buf; p != buf + nr; ++p) {
auto cqe = *p;
- auto completion = reinterpret_cast<kernel_completion*>(cqe->user_data);
- completion->complete_with(cqe->res);
+ passthru_arg* arg = reinterpret_cast<passthru_arg*>(cqe->user_data);
+ arg->completion->complete_with(cqe->res);
+ delete arg;
::io_uring_cqe_seen(&_uring, cqe);
}
}
--
2.25.1

Avi Kivity

<avi@scylladb.com>
unread,
Sep 1, 2022, 11:37:26 AM9/1/22
to shw096@snu.ac.kr, seastar-dev@googlegroups.com, JinyongHa
You should put your own copyright here.

> diff --git a/cmake/SeastarDependencies.cmake b/cmake/SeastarDependencies.cmake
> index 5141ffda..da7882fd 100644
> --- a/cmake/SeastarDependencies.cmake
> +++ b/cmake/SeastarDependencies.cmake
> @@ -92,6 +92,7 @@ macro (seastar_find_dependencies)
> Concepts
> GnuTLS
> LibUring
> + LibNvme


I hope this is optional? Please test building with the library not present.


> LinuxMembarrier
> Sanitizers
> SourceLocation
> diff --git a/include/seastar/core/internal/io_request.hh b/include/seastar/core/internal/io_request.hh
> index 3d43490b..090ba8ef 100644
> --- a/include/seastar/core/internal/io_request.hh
> +++ b/include/seastar/core/internal/io_request.hh
> @@ -35,7 +35,7 @@ namespace internal {
>
> class io_request {
> public:
> - enum class operation { read, readv, write, writev, fdatasync, recv, recvmsg, send, sendmsg, accept, connect, poll_add, poll_remove, cancel };
> + enum class operation { read, readv, write, writev, fdatasync, recv, recvmsg, send, sendmsg, accept, connect, poll_add, poll_remove, cancel, append };


How do appends work with concurrency? If I append something, do I have
to wait until a previous append completed?


> private:
> operation _op;
> int _fd;
> @@ -64,6 +64,8 @@ class io_request {
> } _size;
>
> bool _nowait_works;
> + uint32_t _nsid = -1;
> + size_t _block_size = -1;


Can this be moved to one of the unions to reduce size growth?
What do the different size_t:s mean? Better to use a struct.


> +
> using shard_id = unsigned;
> using stream_id = unsigned;
>
> @@ -121,7 +123,7 @@ class io_queue {
> size_t len, internal::io_request req, io_intent* intent, iovec_keeper iovs = {}) noexcept;
>
> future<size_t> queue_request(const io_priority_class& pc, internal::io_direction_and_length dnl, internal::io_request req, io_intent* intent, iovec_keeper iovs) noexcept;
> - future<size_t> queue_one_request(const io_priority_class& pc, internal::io_direction_and_length dnl, internal::io_request req, io_intent* intent, iovec_keeper iovs) noexcept;
> + future<io_result> queue_one_request(const io_priority_class& pc, internal::io_direction_and_length dnl, internal::io_request req, io_intent* intent, iovec_keeper iovs) noexcept;
> void submit_request(io_desc_read_write* desc, internal::io_request req) noexcept;
> void cancel_request(queued_io_request& req) noexcept;
> void complete_cancelled_request(queued_io_request& req) noexcept;
> diff --git a/include/seastar/core/reactor.hh b/include/seastar/core/reactor.hh
> index db2c2bb4..aee0f918 100644
> --- a/include/seastar/core/reactor.hh
> +++ b/include/seastar/core/reactor.hh
> @@ -170,11 +170,20 @@ class io_intent;
> class disk_config_params;
>
> class io_completion : public kernel_completion {
> +private:
> + size_t result = 0;
> +


Please provide motivation for this change.
Please update fedora_packages too.


>
> # seastar doesn't directly depend on these packages. They are
> diff --git a/src/core/io_queue.cc b/src/core/io_queue.cc
> index ea82551d..a1a12a64 100644
> --- a/src/core/io_queue.cc
> +++ b/src/core/io_queue.cc
> @@ -208,7 +208,7 @@ class io_desc_read_write final : public io_completion {
> const stream_id _stream;
> const io_direction_and_length _dnl;
> const fair_queue_ticket _fq_ticket;
> - promise<size_t> _pr;
> + promise<io_result> _pr;
> iovec_keeper _iovs;
>
> public:
> @@ -237,7 +237,7 @@ class io_desc_read_write final : public io_completion {
> auto now = io_queue::clock_type::now();
> _pclass.on_complete(std::chrono::duration_cast<std::chrono::duration<double>>(now - _ts));
> _ioq.complete_request(*this);
> - _pr.set_value(res);
> + _pr.set_value(std::make_tuple(res, get_result()));


This looks strange. Why are there two results here?
Please indent the lambda introducer.


This causes an extra continuation which will have negative performance
impact. So the change to io_result (if we agree it's required) is better
propagated deeper into the I/O stack.
Should not mandate this, can use __has_include.


> #endif
>
> #ifdef HAVE_OSV
> #include <osv/newpoll.hh>
> #endif
>
> +static const unsigned char IORING_OP_URING_CMD = 40;
> +


Isn't this available from <liburing.h>?


Should be protected by #ifndef in case it becomes available later.


> namespace seastar {
>
> using namespace std::chrono_literals;
> @@ -1254,7 +1257,9 @@ class reactor_backend_uring final : public reactor_backend {
> }
> auto sqe = be.get_sqe();
> ::io_uring_prep_poll_add(sqe, fd().get(), POLLIN);
> - ::io_uring_sqe_set_data(sqe, static_cast<kernel_completion*>(this));
> + auto arg = new passthru_arg;


This should be limited to the case where nvme passthrough is used.


> + arg->completion = this;
> + ::io_uring_sqe_set_data(sqe, arg);
> _armed = true;
> be._has_pending_submissions = true;
> }
> @@ -1316,10 +1321,134 @@ class reactor_backend_uring final : public reactor_backend {
> auto sqe = get_sqe();
> ::io_uring_prep_poll_add(sqe, fd.fd.get(), events);
> auto ufd = static_cast<uring_pollable_fd_state*>(&fd);
> - ::io_uring_sqe_set_data(sqe, static_cast<kernel_completion*>(ufd->get_desc(events)));
> + auto arg = new passthru_arg;
> + arg->completion = ufd->get_desc(events);
> + ::io_uring_sqe_set_data(sqe, arg);
> _has_pending_submissions = true;
> return ufd->get_completion_future(events);
> }
> + struct block_uring_cmd {


All of this should be protected against a newer liburing.h. Or maybe it
should be disabled unless a sufficiently recent liburing is available.
We use 4-space indents.


> + passthru_arg* arg = nullptr;
> + size_t block_size = -1;
> +
> + void complete_with(ssize_t res) override {
> + seastar_logger.error("complete res {}", res);


This seems excessive.
Please add an example application (in demos/).


­신희원 / 학생 / 컴퓨터공학부

<shw096@snu.ac.kr>
unread,
Sep 7, 2022, 5:24:12 AM9/7/22
to Avi Kivity, seastar-dev@googlegroups.com, JinyongHa
Thanks for your prompt reply.

By the way, I was wondering how to send the newly reflected version of the code.
Do I need to send another new patch? or should I reply on the same maling thread by using the option "git send-email --in-reply-to"

Can you please let me know?

2022년 9월 2일 (금) 오전 12:37, Avi Kivity <a...@scylladb.com>님이 작성:
Is it okay to reuse existing copyright? 

> diff --git a/cmake/SeastarDependencies.cmake b/cmake/SeastarDependencies.cmake
> index 5141ffda..da7882fd 100644
> --- a/cmake/SeastarDependencies.cmake
> +++ b/cmake/SeastarDependencies.cmake
> @@ -92,6 +92,7 @@ macro (seastar_find_dependencies)
>       Concepts
>       GnuTLS
>       LibUring
> +    LibNvme


I hope this is optional? Please test building with the library not present.
We changed the cmake files.


>       LinuxMembarrier
>       Sanitizers
>       SourceLocation
> diff --git a/include/seastar/core/internal/io_request.hh b/include/seastar/core/internal/io_request.hh
> index 3d43490b..090ba8ef 100644
> --- a/include/seastar/core/internal/io_request.hh
> +++ b/include/seastar/core/internal/io_request.hh
> @@ -35,7 +35,7 @@ namespace internal {
>   
>   class io_request {
>   public:
> -    enum class operation { read, readv, write, writev, fdatasync, recv, recvmsg, send, sendmsg, accept, connect, poll_add, poll_remove, cancel };
> +    enum class operation { read, readv, write, writev, fdatasync, recv, recvmsg, send, sendmsg, accept, connect, poll_add, poll_remove, cancel, append };


How do appends work with concurrency? If I append something, do I have
to wait until a previous append completed?
Multiple appends can be worked concurrently. Users do not have to wait for the completion of the previous append command to submit another one.
Only if the user wants to ensure that data of the append command is written to ZNS SSD, the user should wait for the completion.

>   private:
>       operation _op;
>       int _fd;
> @@ -64,6 +64,8 @@ class io_request {
>       } _size;
>   
>       bool _nowait_works;
> +    uint32_t _nsid = -1;
> +    size_t _block_size = -1;


Can this be moved to one of the unions to reduce size growth?
No. existing unions (_attr, _ptr, _size) are already used by the append command. 
The first one is written bytes like the returned value of the legacy write command, second one is assigned LBA from ZNS SSD.
we replaced tuple to struct as you commented. 


> +
>   using shard_id = unsigned;
>   using stream_id = unsigned;
>   
> @@ -121,7 +123,7 @@ class io_queue {
>               size_t len, internal::io_request req, io_intent* intent, iovec_keeper iovs = {}) noexcept;
>   
>       future<size_t> queue_request(const io_priority_class& pc, internal::io_direction_and_length dnl, internal::io_request req, io_intent* intent, iovec_keeper iovs) noexcept;
> -    future<size_t> queue_one_request(const io_priority_class& pc, internal::io_direction_and_length dnl, internal::io_request req, io_intent* intent, iovec_keeper iovs) noexcept;
> +    future<io_result> queue_one_request(const io_priority_class& pc, internal::io_direction_and_length dnl, internal::io_request req, io_intent* intent, iovec_keeper iovs) noexcept;
>       void submit_request(io_desc_read_write* desc, internal::io_request req) noexcept;
>       void cancel_request(queued_io_request& req) noexcept;
>       void complete_cancelled_request(queued_io_request& req) noexcept;
> diff --git a/include/seastar/core/reactor.hh b/include/seastar/core/reactor.hh
> index db2c2bb4..aee0f918 100644
> --- a/include/seastar/core/reactor.hh
> +++ b/include/seastar/core/reactor.hh
> @@ -170,11 +170,20 @@ class io_intent;
>   class disk_config_params;
>   
>   class io_completion : public kernel_completion {
> +private:
> +    size_t result = 0;
> +


Please provide motivation for this change.
To make a completion of the append command, an assigned block address is required. This new 'result' field is for passing the assigned block address via io_completion.
We added a comment in code.
We updated as you mentioned. 


>   
>   # seastar doesn't directly depend on these packages. They are
> diff --git a/src/core/io_queue.cc b/src/core/io_queue.cc
> index ea82551d..a1a12a64 100644
> --- a/src/core/io_queue.cc
> +++ b/src/core/io_queue.cc
> @@ -208,7 +208,7 @@ class io_desc_read_write final : public io_completion {
>       const stream_id _stream;
>       const io_direction_and_length _dnl;
>       const fair_queue_ticket _fq_ticket;
> -    promise<size_t> _pr;
> +    promise<io_result> _pr;
>       iovec_keeper _iovs;
>   
>   public:
> @@ -237,7 +237,7 @@ class io_desc_read_write final : public io_completion {
>           auto now = io_queue::clock_type::now();
>           _pclass.on_complete(std::chrono::duration_cast<std::chrono::duration<double>>(now - _ts));
>           _ioq.complete_request(*this);
> -        _pr.set_value(res);
> +        _pr.set_value(std::make_tuple(res, get_result()));


This looks strange. Why are there two results here?
As we explained previously, In addition to written bytes like the legacy write command, an assigned block address is needed to make completion of the append command.  
Changed indentation. 


This causes an extra continuation which will have negative performance
impact. So the change to io_result (if we agree it's required) is better
propagated deeper into the I/O stack.
Did you mean that the returning value of queue_request() also should be changed to 'future<io_result>'?
Can you please explain in more detail?
Changed to include only if the NVMe library is found.


>   #endif
>   
>   #ifdef HAVE_OSV
>   #include <osv/newpoll.hh>
>   #endif
>   
> +static const unsigned char IORING_OP_URING_CMD = 40;
> +


Isn't this available from <liburing.h>?
It is available from linux commit 'ee692a21e9bf8354bd3ec816f1cf4bff8619ed77' (commit name: fs,io_uring: add infrastructure for uring-cmd).
Changed to the definition from "liburing.h".


Should be protected by #ifndef in case it becomes available later.


>   namespace seastar {
>   
>   using namespace std::chrono_literals;
> @@ -1254,7 +1257,9 @@ class reactor_backend_uring final : public reactor_backend {
>               }
>               auto sqe = be.get_sqe();
>               ::io_uring_prep_poll_add(sqe, fd().get(), POLLIN);
> -            ::io_uring_sqe_set_data(sqe, static_cast<kernel_completion*>(this));
> +            auto arg = new passthru_arg;


This should be limited to the case where nvme passthrough is used.
Reflected.


> +            arg->completion = this;
> +            ::io_uring_sqe_set_data(sqe, arg);
>               _armed = true;
>               be._has_pending_submissions = true;
>           }
> @@ -1316,10 +1321,134 @@ class reactor_backend_uring final : public reactor_backend {
>           auto sqe = get_sqe();
>           ::io_uring_prep_poll_add(sqe, fd.fd.get(), events);
>           auto ufd = static_cast<uring_pollable_fd_state*>(&fd);
> -        ::io_uring_sqe_set_data(sqe, static_cast<kernel_completion*>(ufd->get_desc(events)));
> +        auto arg = new passthru_arg;
> +        arg->completion = ufd->get_desc(events);
> +        ::io_uring_sqe_set_data(sqe, arg);
>           _has_pending_submissions = true;
>           return ufd->get_completion_future(events);
>       }
> +    struct block_uring_cmd {


All of this should be protected against a newer liburing.h. Or maybe it
should be disabled unless a sufficiently recent liburing is available.
Changed. 
Reflected indentations. 


> +      passthru_arg* arg = nullptr;
> +      size_t block_size = -1;
> +
> +      void complete_with(ssize_t res) override {
> +        seastar_logger.error("complete res {}", res);


This seems excessive.
Deleted the logger. 
We added and tested demo code that utilizes append command, read and verify for the example. (demos/passthru_append_demo.cc)

Avi Kivity

<avi@scylladb.com>
unread,
Sep 7, 2022, 11:40:28 AM9/7/22
to 신희원 / 학생 / 컴퓨터공학부, seastar-dev@googlegroups.com, JinyongHa


On 07/09/2022 12.23, 신희원 / 학생 / 컴퓨터공학부 wrote:
Thanks for your prompt reply.

By the way, I was wondering how to send the newly reflected version of the code.
Do I need to send another new patch? or should I reply on the same maling thread by using the option "git send-email --in-reply-to"


The convention on this list is to start a thread with new patches. However, it's better to agree on the direction before.


> +#
> +# Copyright (C) 2022 ScyllaDB
> +#


You should put your own copyright here.
Is it okay to reuse existing copyright?


The code is copyrighted by whoever wrote it or their employer. ScyllaDB can't claim copyright to this code.


>       LinuxMembarrier
>       Sanitizers
>       SourceLocation
> diff --git a/include/seastar/core/internal/io_request.hh b/include/seastar/core/internal/io_request.hh
> index 3d43490b..090ba8ef 100644
> --- a/include/seastar/core/internal/io_request.hh
> +++ b/include/seastar/core/internal/io_request.hh
> @@ -35,7 +35,7 @@ namespace internal {
>   
>   class io_request {
>   public:
> -    enum class operation { read, readv, write, writev, fdatasync, recv, recvmsg, send, sendmsg, accept, connect, poll_add, poll_remove, cancel };
> +    enum class operation { read, readv, write, writev, fdatasync, recv, recvmsg, send, sendmsg, accept, connect, poll_add, poll_remove, cancel, append };


How do appends work with concurrency? If I append something, do I have
to wait until a previous append completed?
Multiple appends can be worked concurrently. Users do not have to wait for the completion of the previous append command to submit another one.
Only if the user wants to ensure that data of the append command is written to ZNS SSD, the user should wait for the completion.


But, how does the device know which command has to be executed first? The commands may be reordered by Seastar or possibly the kernel.



>   private:
>       operation _op;
>       int _fd;
> @@ -64,6 +64,8 @@ class io_request {
>       } _size;
>   
>       bool _nowait_works;
> +    uint32_t _nsid = -1;
> +    size_t _block_size = -1;


Can this be moved to one of the unions to reduce size growth?
No. existing unions (_attr, _ptr, _size) are already used by the append command.


What about the offset?


Why is a new block size needed?

So, how am I supposed to use it? For every write, remember the assigned LBA and use it?


Very strange. Can you point me to an application that uses it?




> +
>   using shard_id = unsigned;
>   using stream_id = unsigned;
>   
> @@ -121,7 +123,7 @@ class io_queue {
>               size_t len, internal::io_request req, io_intent* intent, iovec_keeper iovs = {}) noexcept;
>   
>       future<size_t> queue_request(const io_priority_class& pc, internal::io_direction_and_length dnl, internal::io_request req, io_intent* intent, iovec_keeper iovs) noexcept;
> -    future<size_t> queue_one_request(const io_priority_class& pc, internal::io_direction_and_length dnl, internal::io_request req, io_intent* intent, iovec_keeper iovs) noexcept;
> +    future<io_result> queue_one_request(const io_priority_class& pc, internal::io_direction_and_length dnl, internal::io_request req, io_intent* intent, iovec_keeper iovs) noexcept;
>       void submit_request(io_desc_read_write* desc, internal::io_request req) noexcept;
>       void cancel_request(queued_io_request& req) noexcept;
>       void complete_cancelled_request(queued_io_request& req) noexcept;
> diff --git a/include/seastar/core/reactor.hh b/include/seastar/core/reactor.hh
> index db2c2bb4..aee0f918 100644
> --- a/include/seastar/core/reactor.hh
> +++ b/include/seastar/core/reactor.hh
> @@ -170,11 +170,20 @@ class io_intent;
>   class disk_config_params;
>   
>   class io_completion : public kernel_completion {
> +private:
> +    size_t result = 0;
> +


Please provide motivation for this change.
To make a completion of the append command, an assigned block address is required. This new 'result' field is for passing the assigned block address via io_completion.
We added a comment in code.


I think it's better to use a subclass. Also, name it _assigned_lba_address so it's clear what it is. We have plenty of variables named "result" already,

Ok. I still don't understand how the high-level code can make use of this lba address.

I don't understand it completely myself since I don't understand how assigned_lba_address can be used.


If it turns out that we can't get rid of it, then yes we need to return io_result everywhere.


<snip>




Please add an example application (in demos/).
We added and tested demo code that utilizes append command, read and verify for the example. (demos/passthru_append_demo.cc)


Thanks, this will help me understand. In addition pointers to real life applications will be appreciated.


­신희원 / 학생 / 컴퓨터공학부

<shw096@snu.ac.kr>
unread,
Sep 16, 2022, 5:10:22 AM9/16/22
to Avi Kivity, seastar-dev@googlegroups.com, JinyongHa


2022년 9월 8일 (목) 오전 12:40, Avi Kivity <a...@scylladb.com>님이 작성:


On 07/09/2022 12.23, 신희원 / 학생 / 컴퓨터공학부 wrote:
Thanks for your prompt reply.

By the way, I was wondering how to send the newly reflected version of the code.
Do I need to send another new patch? or should I reply on the same maling thread by using the option "git send-email --in-reply-to"


The convention on this list is to start a thread with new patches. However, it's better to agree on the direction before.

 Okay. We'll send you the new patch after your comments.


> +#
> +# Copyright (C) 2022 ScyllaDB
> +#


You should put your own copyright here.
Is it okay to reuse existing copyright?


The code is copyrighted by whoever wrote it or their employer. ScyllaDB can't claim copyright to this code.

 Thanks. We fixed it. It will be included in the upcoming patch.


>       LinuxMembarrier
>       Sanitizers
>       SourceLocation
> diff --git a/include/seastar/core/internal/io_request.hh b/include/seastar/core/internal/io_request.hh
> index 3d43490b..090ba8ef 100644
> --- a/include/seastar/core/internal/io_request.hh
> +++ b/include/seastar/core/internal/io_request.hh
> @@ -35,7 +35,7 @@ namespace internal {
>   
>   class io_request {
>   public:
> -    enum class operation { read, readv, write, writev, fdatasync, recv, recvmsg, send, sendmsg, accept, connect, poll_add, poll_remove, cancel };
> +    enum class operation { read, readv, write, writev, fdatasync, recv, recvmsg, send, sendmsg, accept, connect, poll_add, poll_remove, cancel, append };


How do appends work with concurrency? If I append something, do I have
to wait until a previous append completed?
Multiple appends can be worked concurrently. Users do not have to wait for the completion of the previous append command to submit another one.
Only if the user wants to ensure that data of the append command is written to ZNS SSD, the user should wait for the completion.


But, how does the device know which command has to be executed first? The commands may be reordered by Seastar or possibly the kernel.

 Commands are pushed to the NVMe submission queue, so the device just executes the commands by their own internal policies. 
So the device doesn't need to care about the sequence which is reordered by Seastar or kernel.



>   private:
>       operation _op;
>       int _fd;
> @@ -64,6 +64,8 @@ class io_request {
>       } _size;
>   
>       bool _nowait_works;
> +    uint32_t _nsid = -1;
> +    size_t _block_size = -1;


Can this be moved to one of the unions to reduce size growth?
No. existing unions (_attr, _ptr, _size) are already used by the append command.


What about the offset?

Append command requires an index of a zone where to place data.
Therefore the user needs an offset for selecting a zone to append.



Why is a new block size needed?

 To prepare the NVMe passthrough command, the start address of zone and size of data should be expressed in terms of the number of blocks. 
Therefore, block size is required to translate byte granularity to block granularity.
 Yes, the user needs to remember the assigned LBA.


Very strange. Can you point me to an application that uses it?

  We want to use the append command in Seastore, which is one of the objectstore in Ceph. Here's our document why we need append command (https://docs.google.com/document/d/1DvADmtqVAoXKSmds8ByqPhVTUoexDQvRlkPvB2xzSfc/edit?usp=sharing)
And also attached a presentation file that introduces the necessity of append command. (https://www.usenix.org/system/files/vault20_slides_bjorling.pdf)
 We moved the 'result' field from class 'io_completion' to 'io_desc_read_write' and renamed it to '_assigned_lba'. 
To keep that concrete class 'io_desc_read_write' is not exposed to 'reactor_backend' layer, we left a method 'set_result()' and renamed it to 'set_additional_result()'. 
If there is any better design, please give us a comment.
 We wrote and tested demo code that utilizes append command, reads with assigned lba returned via append completion, and verifies data for the example usage. (demos/passthru_append_demo.cc). It will be included in the upcoming patch.
 Please refer to the document mentioned earlier.
 We want to use the append command in Seastore, which is one of the objectstore in Ceph. Here's our document why we need append command (https://docs.google.com/document/d/1DvADmtqVAoXKSmds8ByqPhVTUoexDQvRlkPvB2xzSfc/edit?usp=sharing)
And also attached a presentation file that introduces the necessity of append command. (https://www.usenix.org/system/files/vault20_slides_bjorling.pdf)


If it turns out that we can't get rid of it, then yes we need to return io_result everywhere.

It will be included in the upcoming patch. 
However, It will change legacy read and write functions, I'm just wondering if that's okay. 


<snip>




Please add an example application (in demos/).
We added and tested demo code that utilizes append command, read and verify for the example. (demos/passthru_append_demo.cc)


Thanks, this will help me understand. In addition pointers to real life applications will be appreciated.

 Okay, it would be included in the upcoming patch.(demos/passthru_append_demo.cc)

Thanks
Heewon

Avi Kivity

<avi@scylladb.com>
unread,
Sep 18, 2022, 6:57:53 AM9/18/22
to 신희원 / 학생 / 컴퓨터공학부, seastar-dev@googlegroups.com, JinyongHa

I'm beginning to understand it. The resulting zone can't be read in-order, so if the application wants the ability to read it in-order, it must keep track of the returned offsets in some index file.




>   private:
>       operation _op;
>       int _fd;
> @@ -64,6 +64,8 @@ class io_request {
>       } _size;
>   
>       bool _nowait_works;
> +    uint32_t _nsid = -1;
> +    size_t _block_size = -1;


Can this be moved to one of the unions to reduce size growth?
No. existing unions (_attr, _ptr, _size) are already used by the append command.


What about the offset?

Append command requires an index of a zone where to place data.
Therefore the user needs an offset for selecting a zone to append.


So, io_request._attr.pos is used to select the zone?


Let's add another member to the _attr union, so it's clear it's in use.




Why is a new block size needed?

 To prepare the NVMe passthrough command, the start address of zone and size of data should be expressed in terms of the number of blocks. 
Therefore, block size is required to translate byte granularity to block granularity.


Ok.

Hard for me to see, I'll comment on the next version of the patch.


A different approach is to have a separate concrete class, io_desc_zone_append, but to it will be easier to see if it's better if I see the patch.

Very good, thanks.

It's not ideal, but it's okay as long as it isn't exposed in a public API.

­신희원 / 학생 / 컴퓨터공학부

<shw096@snu.ac.kr>
unread,
Sep 19, 2022, 4:53:34 AM9/19/22
to Avi Kivity, seastar-dev@googlegroups.com, JinyongHa, Jinyong Ha


2022년 9월 18일 (일) 오후 7:57, Avi Kivity <a...@scylladb.com>님이 작성:
 Yes, exactly. That's why the assigned lba should be returned. 




>   private:
>       operation _op;
>       int _fd;
> @@ -64,6 +64,8 @@ class io_request {
>       } _size;
>   
>       bool _nowait_works;
> +    uint32_t _nsid = -1;
> +    size_t _block_size = -1;


Can this be moved to one of the unions to reduce size growth?
No. existing unions (_attr, _ptr, _size) are already used by the append command.


What about the offset?

Append command requires an index of a zone where to place data.
Therefore the user needs an offset for selecting a zone to append.


So, io_request._attr.pos is used to select the zone?

 Yes.


Let's add another member to the _attr union, so it's clear it's in use.

Do you mean adding zone_address to _attr union? but we think that its meaning is exactly the same with _attr.pos. Please explain the reason why you thought about it. 
Okay. 
  There might be some mis-communication. What is the difference in the aspect of performance between our first patch and your comment?
 If  the return value should be translated from io_result to size_t somewhere before the public API, it means that it is also using additional 'then'.
Anyway, at this time, we will send the second patch without this part.



<snip>




Please add an example application (in demos/).
We added and tested demo code that utilizes append command, read and verify for the example. (demos/passthru_append_demo.cc)


Thanks, this will help me understand. In addition pointers to real life applications will be appreciated.

 Okay, it would be included in the upcoming patch.(demos/passthru_append_demo.cc)

Thanks
Heewon

Jinyong Ha

<jyha200@gmail.com>
unread,
Sep 20, 2022, 10:18:28 PM9/20/22
to ­신희원 / 학생 / 컴퓨터공학부, Myoungwon Oh, Avi Kivity, seastar-dev@googlegroups.com
To AVI.

I have one question.
What do you think about adding a new interface 'passthru_nvme_cmd()' that provides passthru functionality to users?
It is true that it is too nvme-specific, but it will be helpful for direct SSD control such as fast adoption of a newly introduced nvme command.

Thanks.
Jinyong Ha.


2022년 9월 19일 (월) 오후 5:53, ­신희원 / 학생 / 컴퓨터공학부 <shw...@snu.ac.kr>님이 작성:

Avi Kivity

<avi@scylladb.com>
unread,
Sep 25, 2022, 11:43:56 AM9/25/22
to Jinyong Ha, ­신희원 / 학생 / 컴퓨터공학부, Myoungwon Oh, seastar-dev@googlegroups.com
Yes, I think it's a good idea. It doesn't need to be specific to nvme,
we can just expose uring_cmd as a Seastar API with minimal
transformations.

On Wed, 2022-09-21 at 11:18 +0900, Jinyong Ha wrote:
> To AVI.
>
> I have one question.
> What do you think about adding a new interface 'passthru_nvme_cmd()'
> that provides passthru functionality to users?
> It is true that it is too nvme-specific, but it will be helpful for
> direct SSD control such as fast adoption of a newly introduced nvme
> command.
>
> Thanks.
> Jinyong Ha.
>
> > > > > If it turns out that we can't get rid of it, then yes we need
> > > > > to return io_result everywhere.
> > > > >
> > > >
> > > > It will be included in the upcoming patch. 
> > > > However, It will change legacy read and write functions, I'm
> > > > just wondering if that's okay.
> > >
> > >
> > > It's not ideal, but it's okay as long as it isn't exposed in a
> > > public API.
> > >
> >
> >   There might be some mis-communication. What is the difference in
> > the aspect of performance between our first patch and your comment?
> >  If  the return value should be translated from io_result to size_t
> > somewhere before the public API, it means that it is also using
> > additional 'then'.
> > Anyway, at this time, we will send the second patch without this
> > part.
> > >
> > >  
> > > >  
> > > > >
> > > > > <snip>
> > > > >
> > > > >  
> > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  Please add an example application (in demos/).
> > > > > > >
> > > > > >
> > > > > > We added and tested demo code that utilizes append command,
> > > > > > read and verify for the example.
> > > > > > (demos/passthru_append_demo.cc)
> > > > > > ᐧ
> > > > >
> > > > >
> > > > > Thanks, this will help me understand. In addition pointers to
> > > > > real life applications will be appreciated.
> > > > >
> > > >
> > > >  Okay, it would be included in the upcoming
> > > > patch.(demos/passthru_append_demo.cc)
> > > >
> > > > Thanks
> > > > Heewon
> > > > ᐧ
> > ᐧ

Reply all
Reply to author
Forward
0 new messages