[PATCH seastar v1] reactor: add io_uring backend

648 views
Skip to first unread message

Avi Kivity

<avi@scylladb.com>
unread,
May 1, 2022, 1:38:02 PM5/1/22
to seastar-dev@googlegroups.com
io_uring is a unified asynchronous I/O interface, supporting
network, buffered disk, and direct disk I/O.

This patch adds a reactor backend using io_uring. It is deliberately
non-ambitious and only implements the minimal number of verbs using
io_uring. We could support many more (sendmsg(), recvmsg(), open(), etc.)
but it is better to start small, as each of those features will require
a separate capability check and fallback path if not available.

In terms of performance, I measured about 4% degradation compared to
linux-aio in httpd/wrk. I am working with the io_uring maintainer to
resolve it, but until it is resolved linux-aio will be preferred to
io_uring and the latter has to be enabled manually (with --reactor-backend).

The implementation follows the linux-aio backend, using
IORING_OP_POLL_ADD for polls and the read/write/fsync ops for files.
More elaborate socket operation (using sendmsg/recvmsg equivalents to
combine poll+read) will follow later. The preemption notifier still
uses linux-aio (as in Glauber's implementation) as a simple solution.

The implementation is more complex internally than linux-aio, since
getting a submission queue entry (sqe) can require flushing pending
sqe:s and consuming completion queue entries (cqe:s). This is because
the queue lengths are smaller than the total amount of in-flight
operations. So getting an sqe can result (rarely) in a syscall
and processing some completions.

However, in return the ergonomics are better for the user. It works
well with buffered I/O (--kernel-page-cache 1) and doesn't require
tuning some sysctls for it to work on large machines.
---

This appears to work well, apart from the cmake integration. I
appreciate any help in this area!

We require liburing 2.0 and above, and it should be optional.


include/seastar/core/reactor.hh | 1 +
src/core/reactor_backend.hh | 2 +
src/core/reactor_backend.cc | 404 ++++++++++++++++++++++++++++++++
CMakeLists.txt | 2 +
cmake/FindLibUring.cmake | 61 +++++
cmake/SeastarDependencies.cmake | 1 +
install-dependencies.sh | 2 +
7 files changed, 473 insertions(+)
create mode 100644 cmake/FindLibUring.cmake

diff --git a/include/seastar/core/reactor.hh b/include/seastar/core/reactor.hh
index e55bada0f4..0b47d5fd65 100644
--- a/include/seastar/core/reactor.hh
+++ b/include/seastar/core/reactor.hh
@@ -208,10 +208,11 @@ class reactor {
friend class preempt_io_context;
friend struct hrtimer_aio_completion;
friend struct task_quota_aio_completion;
friend class reactor_backend_epoll;
friend class reactor_backend_aio;
+ friend class reactor_backend_uring;
friend class reactor_backend_selector;
friend struct reactor_options;
friend class aio_storage_context;
friend size_t scheduling_group_count();
public:
diff --git a/src/core/reactor_backend.hh b/src/core/reactor_backend.hh
index 1aef269c75..9bb61050f4 100644
--- a/src/core/reactor_backend.hh
+++ b/src/core/reactor_backend.hh
@@ -361,10 +361,12 @@ class reactor_backend_osv : public reactor_backend {
virtual pollable_fd_state_ptr
make_pollable_fd_state(file_desc fd, pollable_fd::speculation speculate) override;
};
#endif /* HAVE_OSV */

+class reactor_backend_uring;
+
class reactor_backend_selector {
std::string _name;
private:
static bool has_enough_aio_nr();
explicit reactor_backend_selector(std::string name) : _name(std::move(name)) {}
diff --git a/src/core/reactor_backend.cc b/src/core/reactor_backend.cc
index 4a1bca1afc..3d9b5d7299 100644
--- a/src/core/reactor_backend.cc
+++ b/src/core/reactor_backend.cc
@@ -28,10 +28,16 @@
#include <seastar/util/read_first_line.hh>
#include <chrono>
#include <sys/poll.h>
#include <sys/syscall.h>

+#define SEASTAR_HAVE_URING // FIXME: convince cmake to do it.
+
+#ifdef SEASTAR_HAVE_URING
+#include <liburing.h>
+#endif
+
#ifdef HAVE_OSV
#include <osv/newpoll.hh>
#endif

namespace seastar {
@@ -1112,10 +1118,398 @@ reactor_backend_osv::make_pollable_fd_state(file_desc fd, pollable_fd::speculati
std::cerr << "reactor_backend_osv does not support file descriptors - make_pollable_fd_state() shouldn't have been called!\n";
abort();
}
#endif

+#ifdef SEASTAR_HAVE_URING
+
+// We want not to throw during detection (to avoid spurious exceptions)
+// but we do want to throw during for-real construction (to have an error
+// message. Hence `throw_on_error`.
+static
+std::optional<::io_uring>
+try_create_uring(unsigned queue_len, bool throw_on_error) {
+ auto params = ::io_uring_params{
+ .flags = 0,
+ };
+ ::io_uring ring;
+ auto err = ::io_uring_queue_init_params(queue_len, &ring, &params);
+ if (err != 0) {
+ if (throw_on_error) {
+ throw std::system_error(-err, std::system_category());
+ }
+ return std::nullopt;
+ }
+ ::io_uring_ring_dontfork(&ring);
+ if (!(params.features & IORING_FEAT_NODROP)
+ || !(params.features & IORING_FEAT_SUBMIT_STABLE)) {
+ ::io_uring_queue_exit(&ring);
+ if (throw_on_error) {
+ throw std::runtime_error("missing IORING_FEAT_NODROP or IORING_FEAT_SUBMIT_STABLE");
+ }
+ return std::nullopt;
+ }
+ return ring;
+}
+
+bool detect_io_uring() {
+ auto required_features = IORING_FEAT_SUBMIT_STABLE;
+ auto required_ops = {
+ IORING_OP_POLL_ADD,
+ IORING_OP_READ,
+ IORING_OP_WRITE,
+ IORING_OP_READV,
+ IORING_OP_WRITEV,
+ IORING_OP_FSYNC,
+ };
+ ::io_uring ring;
+ int r = ::io_uring_queue_init(1, &ring, 0);
+ if (r != 0) {
+ return false;
+ }
+ auto free_ring = defer([&] () noexcept { ::io_uring_queue_exit(&ring); });
+
+ if (~ring.features & required_features) {
+ return false;
+ }
+
+ auto probe = ::io_uring_get_probe_ring(&ring);
+ if (!probe) {
+ return false;
+ }
+ auto free_probe = defer([&] () noexcept { ::free(probe); });
+
+ for (auto op : required_ops) {
+ if (!io_uring_opcode_supported(probe, op)) {
+ return false;
+ }
+ }
+
+ return true;
+}
+
+class uring_pollable_fd_state : public pollable_fd_state {
+ pollable_fd_state_completion _completion_pollin;
+ pollable_fd_state_completion _completion_pollout;
+public:
+ explicit uring_pollable_fd_state(file_desc desc, speculation speculate)
+ : pollable_fd_state(std::move(desc), std::move(speculate)) {
+ }
+ pollable_fd_state_completion* get_desc(int events) {
+ if (events & POLLIN) {
+ return &_completion_pollin;
+ } else {
+ return &_completion_pollout;
+ }
+ }
+ future<> get_completion_future(int events) {
+ return get_desc(events)->get_future();
+ }
+};
+
+class reactor_backend_uring final : public reactor_backend {
+ static constexpr unsigned s_queue_len = 200;
+ reactor& _r;
+ ::io_uring _uring;
+ bool _did_work_while_getting_sqe = false;
+ bool _has_pending_submissions = false;
+ file_desc _hrtimer_timerfd;
+ preempt_io_context _preempt_io_context;
+
+ // Completion for high resolution timerfd, used in wait_and_process_events()
+ // (while running tasks it's waited for in _preempt_io_context)
+ class hrtimer_completion : public fd_kernel_completion {
+ reactor& _r;
+ bool _armed = false;
+ public:
+ explicit hrtimer_completion(reactor& r, file_desc& timerfd)
+ : fd_kernel_completion(timerfd), _r(r) {
+ }
+ void maybe_rearm(reactor_backend_uring& be) {
+ if (_armed) {
+ return;
+ }
+ auto sqe = be.get_sqe();
+ ::io_uring_prep_poll_add(sqe, fd().get(), POLLIN);
+ ::io_uring_sqe_set_data(sqe, static_cast<kernel_completion*>(this));
+ _armed = true;
+ be._has_pending_submissions = true;
+
+ }
+ virtual void complete_with(ssize_t res) override {
+ uint64_t expirations = 0;
+ fd().read(&expirations, sizeof(expirations));
+ _armed = false;
+ _r.service_highres_timer();
+ }
+ };
+
+ class smp_wakeup_completion : public fd_kernel_completion {
+ bool _armed = false;
+ public:
+ explicit smp_wakeup_completion(file_desc& fd) : fd_kernel_completion(fd) {}
+ virtual void complete_with(ssize_t res) override {
+ char garbage[8];
+ auto ret = _fd.read(garbage, 8);
+ assert(ret && *ret == 8);
+ _armed = false;
+ }
+ void maybe_rearm(reactor_backend_uring& be) {
+ if (_armed) [[likely]] {
+ return;
+ }
+ auto sqe = be.get_sqe();
+ ::io_uring_prep_poll_add(sqe, fd().get(), POLLIN);
+ ::io_uring_sqe_set_data(sqe, static_cast<kernel_completion*>(this));
+ _armed = true;
+ be._has_pending_submissions = true;
+ }
+ };
+
+ hrtimer_completion _hrtimer_completion;
+ smp_wakeup_completion _smp_wakeup_completion;
+private:
+ static file_desc make_timerfd() {
+ return file_desc::timerfd_create(CLOCK_MONOTONIC, TFD_CLOEXEC|TFD_NONBLOCK);
+ }
+
+ bool await_events(int timeout, const sigset_t* active_sigmask);
+
+ // Can fail if the completion queue is full
+ ::io_uring_sqe* try_get_sqe() {
+ return ::io_uring_get_sqe(&_uring);
+ }
+
+ bool do_flush_submission_ring() {
+ if (_has_pending_submissions) {
+ _has_pending_submissions = false;
+ _did_work_while_getting_sqe = false;
+ io_uring_submit(&_uring);
+ return true;
+ } else {
+ return std::exchange(_did_work_while_getting_sqe, false);
+ }
+ }
+
+ ::io_uring_sqe* get_sqe() {
+ ::io_uring_sqe* sqe;
+ while ((sqe = try_get_sqe()) == nullptr) [[unlikely]] {
+ do_flush_submission_ring();
+ do_process_kernel_completions_step();
+ _did_work_while_getting_sqe = true;
+ }
+ return sqe;
+ }
+ future<> poll(pollable_fd_state& fd, int events) {
+ if (events & fd.events_known) {
+ fd.events_known &= ~events;
+ return make_ready_future<>();
+ }
+ fd.events_rw = events == (POLLIN|POLLOUT);
+ auto sqe = get_sqe();
+ ::io_uring_prep_poll_add(sqe, fd.fd.get(), events);
+ auto ufd = static_cast<uring_pollable_fd_state*>(&fd);
+ ::io_uring_sqe_set_data(sqe, static_cast<kernel_completion*>(ufd->get_desc(events)));
+ _has_pending_submissions = true;
+ return ufd->get_completion_future(events);
+ }
+
+ void submit_io_request(internal::io_request& req, io_completion* completion) {
+ auto sqe = get_sqe();
+ using o = internal::io_request::operation;
+ switch (req.opcode()) {
+ case o::read:
+ ::io_uring_prep_read(sqe, req.fd(), req.address(), req.size(), req.pos());
+ break;
+ case o::write:
+ ::io_uring_prep_write(sqe, req.fd(), req.address(), req.size(), req.pos());
+ break;
+ case o::readv:
+ ::io_uring_prep_readv(sqe, req.fd(), req.iov(), req.iov_len(), req.pos());
+ break;
+ case o::writev:
+ ::io_uring_prep_writev(sqe, req.fd(), req.iov(), req.iov_len(), req.pos());
+ break;
+ case o::fdatasync:
+ ::io_uring_prep_fsync(sqe, req.fd(), IORING_FSYNC_DATASYNC);
+ break;
+ case o::recv:
+ case o::recvmsg:
+ case o::send:
+ case o::sendmsg:
+ case o::accept:
+ case o::connect:
+ case o::poll_add:
+ case o::poll_remove:
+ case o::cancel:
+ abort();
+ }
+ ::io_uring_sqe_set_data(sqe, completion);
+
+ _has_pending_submissions = true;
+ }
+
+ // Returns true if any work was done
+ bool queue_pending_file_io() {
+ return _r._io_sink.drain([&] (internal::io_request& req, io_completion* completion) -> bool {
+ submit_io_request(req, completion);
+ return true;
+ });
+ }
+
+ // Process kernel completions already extracted from the ring.
+ // This is needed because we sometimes extract completions without
+ // waiting, and sometimes with waiting.
+ void do_process_ready_kernel_completions(::io_uring_cqe** buf, size_t nr) {
+ for (auto p = buf; p != buf + nr; ++p) {
+ auto cqe = *p;
+ auto completion = reinterpret_cast<kernel_completion*>(cqe->user_data);
+ completion->complete_with(cqe->res);
+ ::io_uring_cqe_seen(&_uring, cqe);
+ }
+ }
+
+ // Returns true if completions were processed
+ bool do_process_kernel_completions_step() {
+ struct ::io_uring_cqe* buf[s_queue_len];
+ auto n = ::io_uring_peek_batch_cqe(&_uring, buf, s_queue_len);
+ do_process_ready_kernel_completions(buf, n);
+ return n != 0;
+ }
+
+ // Returns true if completions were processed
+ bool do_process_kernel_completions() {
+ auto did_work = false;
+ while (do_process_kernel_completions_step()) {
+ did_work = true;
+ }
+ return did_work | std::exchange(_did_work_while_getting_sqe, false);
+ }
+public:
+ explicit reactor_backend_uring(reactor& r)
+ : _r(r)
+ , _uring(try_create_uring(s_queue_len, true).value())
+ , _hrtimer_timerfd(make_timerfd())
+ , _preempt_io_context(_r, _r._task_quota_timer, _hrtimer_timerfd)
+ , _hrtimer_completion(_r, _hrtimer_timerfd)
+ , _smp_wakeup_completion(_r._notify_eventfd) {
+ // Protect against spurious wakeups - if we get notified that the timer has
+ // expired when it really hasn't, we don't want to block in read(tfd, ...).
+ auto tfd = _r._task_quota_timer.get();
+ ::fcntl(tfd, F_SETFL, ::fcntl(tfd, F_GETFL) | O_NONBLOCK);
+ }
+ ~reactor_backend_uring() {
+ ::io_uring_queue_exit(&_uring);
+ }
+ virtual bool reap_kernel_completions() override {
+ return do_process_kernel_completions();
+ }
+ virtual bool kernel_submit_work() override {
+ bool did_work = false;
+ did_work |= _preempt_io_context.service_preempting_io();
+ queue_pending_file_io();
+ did_work |= ::io_uring_submit(&_uring);
+ // io_uring_submit() may have kicked up queued work
+ did_work |= reap_kernel_completions();
+ return did_work;
+ }
+ virtual bool kernel_events_can_sleep() const override {
+ // We never need to spin while I/O is in flight.
+ return true;
+ }
+ virtual void wait_and_process_events(const sigset_t* active_sigmask) override {
+ _smp_wakeup_completion.maybe_rearm(*this);
+ _hrtimer_completion.maybe_rearm(*this);
+ ::io_uring_submit(&_uring);
+ bool did_work = false;
+ did_work |= _preempt_io_context.service_preempting_io();
+ did_work |= std::exchange(_did_work_while_getting_sqe, false);
+ if (did_work) {
+ return;
+ }
+ struct ::io_uring_cqe* buf[s_queue_len];
+ sigset_t sigs = *active_sigmask; // io_uring_wait_cqes() wants non-const
+ ssize_t nr = ::io_uring_wait_cqes(&_uring, buf, 1, nullptr, &sigs);
+ if (nr < 0) [[unlikely]] {
+ switch (-nr) {
+ case EINTR:
+ return;
+ default:
+ abort();
+ }
+ }
+ do_process_ready_kernel_completions(buf, nr);
+ if (nr == s_queue_len) {
+ do_process_kernel_completions();
+ }
+ _preempt_io_context.service_preempting_io();
+ }
+ virtual future<> readable(pollable_fd_state& fd) override {
+ return poll(fd, POLLIN);
+ }
+ virtual future<> writeable(pollable_fd_state& fd) override {
+ return poll(fd, POLLOUT);
+ }
+ virtual future<> readable_or_writeable(pollable_fd_state& fd) override {
+ return poll(fd, POLLIN | POLLOUT);
+ }
+ virtual void forget(pollable_fd_state& fd) noexcept override {
+ auto* pfd = static_cast<uring_pollable_fd_state*>(&fd);
+ delete pfd;
+ }
+ virtual future<std::tuple<pollable_fd, socket_address>> accept(pollable_fd_state& listenfd) override {
+ return _r.do_accept(listenfd);
+ }
+ virtual future<> connect(pollable_fd_state& fd, socket_address& sa) override {
+ return _r.do_connect(fd, sa);
+ }
+ virtual void shutdown(pollable_fd_state& fd, int how) override {
+ fd.fd.shutdown(how);
+ }
+ virtual future<size_t> read_some(pollable_fd_state& fd, void* buffer, size_t len) override {
+ return _r.do_read_some(fd, buffer, len);
+ }
+ virtual future<size_t> read_some(pollable_fd_state& fd, const std::vector<iovec>& iov) override {
+ return _r.do_read_some(fd, iov);
+ }
+ virtual future<temporary_buffer<char>> read_some(pollable_fd_state& fd, internal::buffer_allocator* ba) override {
+ return _r.do_read_some(fd, ba);
+ }
+ virtual future<size_t> write_some(pollable_fd_state& fd, net::packet& p) override {
+ return _r.do_write_some(fd, p);
+ }
+ virtual future<size_t> write_some(pollable_fd_state& fd, const void* buffer, size_t len) override {
+ return _r.do_write_some(fd, buffer, len);
+ }
+ virtual void signal_received(int signo, siginfo_t* siginfo, void* ignore) override {
+ _r._signals.action(signo, siginfo, ignore);
+ }
+ virtual void start_tick() override {
+ _preempt_io_context.start_tick();
+ }
+ virtual void stop_tick() override {
+ _preempt_io_context.stop_tick();
+ }
+ virtual void arm_highres_timer(const ::itimerspec& its) override {
+ _hrtimer_timerfd.timerfd_settime(TFD_TIMER_ABSTIME, its);
+ }
+ virtual void reset_preemption_monitor() override {
+ _preempt_io_context.reset_preemption_monitor();
+ }
+ virtual void request_preemption() override {
+ _preempt_io_context.request_preemption();
+ }
+ virtual void start_handling_signal() override {
+ // We don't have anything special wrt. signals
+ }
+ virtual pollable_fd_state_ptr make_pollable_fd_state(file_desc fd, pollable_fd::speculation speculate) override {
+ return pollable_fd_state_ptr(new uring_pollable_fd_state(std::move(fd), std::move(speculate)));
+ }
+};
+
+#endif
+
static bool detect_aio_poll() {
auto fd = file_desc::eventfd(0, 0);
aio_context_t ioc{};
setup_aio_context(1, &ioc);
auto cleanup = defer([&] () noexcept { io_destroy(ioc); });
@@ -1151,10 +1545,15 @@ bool reactor_backend_selector::has_enough_aio_nr() {
}
return true;
}

std::unique_ptr<reactor_backend> reactor_backend_selector::create(reactor& r) {
+#ifdef SEASTAR_HAVE_URING
+ if (_name == "io_uring") {
+ return std::make_unique<reactor_backend_uring>(r);
+ }
+#endif
if (_name == "linux-aio") {
return std::make_unique<reactor_backend_aio>(r);
} else if (_name == "epoll") {
return std::make_unique<reactor_backend_epoll>(r);
}
@@ -1169,9 +1568,14 @@ std::vector<reactor_backend_selector> reactor_backend_selector::available() {
std::vector<reactor_backend_selector> ret;
if (detect_aio_poll() && has_enough_aio_nr()) {
ret.push_back(reactor_backend_selector("linux-aio"));
}
ret.push_back(reactor_backend_selector("epoll"));
+#ifdef SEASTAR_HAVE_URING
+ if (detect_io_uring()) {
+ ret.push_back(reactor_backend_selector("io_uring"));
+ }
+#endif
return ret;
}

}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index dcc064fe47..57f5c1c506 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -736,10 +736,11 @@ target_link_libraries (seastar
Boost::thread
c-ares::c-ares
cryptopp::cryptopp
fmt::fmt
lz4::lz4
+ URING::uring
PRIVATE
${CMAKE_DL_LIBS}
GnuTLS::gnutls
StdAtomic::atomic
lksctp-tools::lksctp-tools
@@ -1247,10 +1248,11 @@ if (Seastar_INSTALL)
${CMAKE_CURRENT_SOURCE_DIR}/cmake/Findnumactl.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/Findragel.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/Findrt.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/Findyaml-cpp.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/SeastarDependencies.cmake
+ ${CMAKE_CURRENT_SOURCE_DIR}/cmake/FindUring.cmake
DESTINATION ${install_cmakedir})

install (
DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/cmake/code_tests
DESTINATION ${install_cmakedir})
diff --git a/cmake/FindLibUring.cmake b/cmake/FindLibUring.cmake
new file mode 100644
index 0000000000..e541263195
--- /dev/null
+++ b/cmake/FindLibUring.cmake
@@ -0,0 +1,61 @@
+#
+# This file is open source software, licensed to you under the terms
+# of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+# distributed with this work for additional information regarding copyright
+# ownership. You may not use this file except in compliance with the License.
+#
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+#
+# Copyright (C) 2022 ScyllaDB
+#
+
+find_package (PkgConfig REQUIRED)
+
+pkg_search_module (URING liburing)
+
+find_library (URING_LIBRARY
+ NAMES uring
+ HINTS
+ ${URING_PC_LIBDIR}
+ ${URING_PC_LIBRARY_DIRS})
+
+find_path (URING_INCLUDE_DIR
+ NAMES liburing.h
+ HINTS
+ ${URING_PC_INCLUDEDIR}
+ ${URING_PC_INCLUDE_DIRS})
+
+mark_as_advanced (
+ URING_LIBRARY
+ URING_INCLUDE_DIR)
+
+include (FindPackageHandleStandardArgs)
+
+find_package_handle_standard_args (LibUring
+ REQUIRED_VARS
+ URING_LIBRARY
+ URING_INCLUDE_DIR
+ VERSION_VAR URING_PC_VERSION)
+
+set (URING_LIBRARIES ${URING_LIBRARY})
+set (URING_INCLUDE_DIRS ${URING_INCLUDE_DIR})
+
+if (URING_FOUND AND NOT (TARGET URING::uring))
+ add_library (URING::uring UNKNOWN IMPORTED)
+
+ set_target_properties (URING::uring
+ PROPERTIES
+ IMPORTED_LOCATION ${URING_LIBRARY}
+ INTERFACE_INCLUDE_DIRECTORIES ${URING_INCLUDE_DIRS})
+endif ()
diff --git a/cmake/SeastarDependencies.cmake b/cmake/SeastarDependencies.cmake
index 51a8a65a26..418542e878 100644
--- a/cmake/SeastarDependencies.cmake
+++ b/cmake/SeastarDependencies.cmake
@@ -56,10 +56,11 @@ macro (seastar_find_dependencies)
fmt
lz4
# Private and private/public dependencies.
Concepts
GnuTLS
+ LibUring
LinuxMembarrier
Sanitizers
StdAtomic
hwloc
lksctp-tools # No version information published.
diff --git a/install-dependencies.sh b/install-dependencies.sh
index 7798657093..438223abdb 100755
--- a/install-dependencies.sh
+++ b/install-dependencies.sh
@@ -38,10 +38,11 @@ debian_packages=(
libxml2-dev
xfslibs-dev
libgnutls28-dev
liblz4-dev
libsctp-dev
+ liburing-dev
gcc
make
python3
systemtap-sdt-dev
libtool
@@ -73,10 +74,11 @@ redhat_packages=(
libxml2-devel
xfsprogs-devel
gnutls-devel
lksctp-tools-devel
lz4-devel
+ liburing-devel
gcc
make
python3
systemtap-sdt-devel
libtool
--
2.35.1

Raphael S. Carvalho

<raphaelsc@scylladb.com>
unread,
May 1, 2022, 10:02:19 PM5/1/22
to Avi Kivity, seastar-dev
do you have plans for IORING_SETUP_IOPOLL flag? we want I/O polling
for very fast disk devices. we'd need to make it conditional though as
not all file systems support it (any one other than XFS?) and not all
disk devices are fast enough.

we could have two rings, one for IRQ driven I/O, and another for
polled I/O. the latter one with special logic for reaping completion
events. with a bit of abstraction, we can easily make it happen,
right?
> --
> You received this message because you are subscribed to the Google Groups "seastar-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/20220501173756.3537646-1-avi%40scylladb.com.

Kefu Chai

<tchaikov@gmail.com>
unread,
May 2, 2022, 2:54:09 AM5/2/22
to seastar-dev@googlegroups.com, Kefu Chai
Signed-off-by: Kefu Chai <tcha...@gmail.com>
---
CMakeLists.txt | 15 ++++++++++++++-
cooking_recipe.cmake | 10 ++++++++++
src/core/reactor_backend.cc | 2 --
3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 57f5c1c5..f323c38a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -236,6 +236,10 @@ option (Seastar_HWLOC
"Enable hwloc support."
ON)

+option (Seastar_IO_URING
+ "Enable io_uring support."
+ OFF)
+
set (Seastar_JENKINS
""
CACHE
@@ -738,7 +742,6 @@ target_link_libraries (seastar
cryptopp::cryptopp
fmt::fmt
lz4::lz4
- URING::uring
PRIVATE
${CMAKE_DL_LIBS}
GnuTLS::gnutls
@@ -930,6 +933,16 @@ if (Seastar_HWLOC)
PRIVATE hwloc::hwloc)
endif ()

+if (Seastar_IO_URING)
+ if (NOT LibUring_FOUND)
+ message (FATAL_ERROR "`io_uring` supported is enabled but liburing is not available!")
+ endif ()
+
+ list (APPEND Seastar_PRIVATE_COMPILE_DEFINITIONS SEASTAR_HAVE_URING)
+ target_link_libraries (seastar
+ PRIVATE URING::uring)
+endif ()
+
if (Seastar_LD_FLAGS)
# In newer versions of CMake, there is `target_link_options`.
target_link_libraries (seastar
diff --git a/cooking_recipe.cmake b/cooking_recipe.cmake
index 288ac901..9be1e377 100644
--- a/cooking_recipe.cmake
+++ b/cooking_recipe.cmake
@@ -286,6 +286,16 @@ cooking_ingredient (fmt
-DFMT_DOC=OFF
-DFMT_TEST=OFF)

+cooking_ingredient (liburing
+ EXTERNAL_PROJECT_ARGS
+ URL https://github.com/axboe/liburing/archive/liburing-2.1.tar.gz
+ URL_MD5 78f13d9861b334b9a9ca0d12cf2a6d3c
+ CONFIGURE_COMMAND <SOURCE_DIR>/configure --prefix=<INSTALL_DIR>
+ BUILD_COMMAND <DISABLE>
+ BUILD_BYPRODUCTS "<SOURCE_DIR>/src/liburing.a"
+ BUILD_IN_SOURCE ON
+ INSTALL_COMMAND ${make_command} -C src -s install)
+
cooking_ingredient (lz4
EXTERNAL_PROJECT_ARGS
URL https://github.com/lz4/lz4/archive/v1.8.0.tar.gz
diff --git a/src/core/reactor_backend.cc b/src/core/reactor_backend.cc
index ff99ac63..92fbae62 100644
--- a/src/core/reactor_backend.cc
+++ b/src/core/reactor_backend.cc
@@ -30,8 +30,6 @@
#include <sys/poll.h>
#include <sys/syscall.h>

-#define SEASTAR_HAVE_URING // FIXME: convince cmake to do it.
-
#ifdef SEASTAR_HAVE_URING
#include <liburing.h>
#endif
--
2.36.0

tcha...@gmail.com

<tchaikov@gmail.com>
unread,
May 2, 2022, 2:59:41 AM5/2/22
to seastar-dev
the change is based on v1 of "reactor: add io_uring backend", and is also available at g...@github.com:tchaikov/seastar.git io_uring

change since v1:

- add cooking support
- add "Seastar_IO_URING" cmake option
- define SEASTAR_HAVE_URING by adding it to Seastar_PRIVATE_COMPILE_DEFINITIONS
- move URING::uring to the PRIVATE section of seastar linkage. as liburing.h is included by non-public header, so we should make it a private linkage.

Avi Kivity

<avi@scylladb.com>
unread,
May 2, 2022, 4:44:48 AM5/2/22
to tcha...@gmail.com, seastar-dev

Many thanks, I folded it into my patch (retaining credits).


Is there a reason not to default support to ON?


After changing it to ON, I don't see liburing in seastar.pc, like the other libraries, so linkage using pkgconfig will fail. How do we make it show there?

--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.

tcha...@gmail.com

<tchaikov@gmail.com>
unread,
May 2, 2022, 4:50:14 AM5/2/22
to seastar-dev
On Monday, May 2, 2022 at 4:44:48 PM UTC+8 a...@scylladb.com wrote:

Many thanks, I folded it into my patch (retaining credits).


Is there a reason not to default support to ON?


thought io_uring might not be available on systems stuck with older kernels. but yeah, since it is optional and is selected at run-time. no good reason to disable it by default.


After changing it to ON, I don't see liburing in seastar.pc, like the other libraries, so linkage using pkgconfig will fail. How do we make it show there?


ahh, i missed the .pc file. will come up with another patch on top of the previous one shortly.

Avi Kivity

<avi@scylladb.com>
unread,
May 2, 2022, 4:50:14 AM5/2/22
to Raphael S. Carvalho, seastar-dev

On 02/05/2022 05.02, Raphael S. Carvalho wrote:
>
>> +#ifdef SEASTAR_HAVE_URING
>> +
>> +// We want not to throw during detection (to avoid spurious exceptions)
>> +// but we do want to throw during for-real construction (to have an error
>> +// message. Hence `throw_on_error`.
>> +static
>> +std::optional<::io_uring>
>> +try_create_uring(unsigned queue_len, bool throw_on_error) {
>> + auto params = ::io_uring_params{
>> + .flags = 0,
>> + };
> do you have plans for IORING_SETUP_IOPOLL flag? we want I/O polling
> for very fast disk devices. we'd need to make it conditional though as
> not all file systems support it (any one other than XFS?) and not all
> disk devices are fast enough.
>
> we could have two rings, one for IRQ driven I/O, and another for
> polled I/O. the latter one with special logic for reaping completion
> events. with a bit of abstraction, we can easily make it happen,
> right?
>
>

New io_uring supports [2] poll for network [1] too. This is more
important than I/O because TCP processing is much more expensive than
disk processing. With that, we can have a single ring and dedicate some
cores to the kernel, with the rest running purely Seastar code.


[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=adc8682ec69012b68d5ab7123e246d2ad9a6f94b

[2] actually it was later reverted, but it will come back

Avi Kivity

<avi@scylladb.com>
unread,
May 2, 2022, 5:43:48 AM5/2/22
to seastar-dev@googlegroups.com
io_uring is a unified asynchronous I/O interface, supporting
network, buffered disk, and direct disk I/O.

This patch adds a reactor backend using io_uring. It is deliberately
non-ambitious and only implements the minimal number of verbs using
io_uring. We could support many more (sendmsg(), recvmsg(), open(), etc.)
but it is better to start small, as each of those features will require
a separate capability check and fallback path if not available.

In terms of performance, I measured about 4% degradation compared to
linux-aio in httpd/wrk. I am working with the io_uring maintainer to
resolve it, but until it is resolved linux-aio will be preferred to
io_uring and the latter has to be enabled manually (with --reactor-backend).

The implementation follows the linux-aio backend, using
IORING_OP_POLL_ADD for polls and the read/write/fsync ops for files.
More elaborate socket operation (using sendmsg/recvmsg equivalents to
combine poll+read) will follow later. The preemption notifier still
uses linux-aio (as in Glauber's implementation) as a simple solution.

Cmake improvements by: Kefu Chai <tcha...@gmail.com>

improve cmake support of liburing
Signed-off-by: Kefu Chai <tcha...@gmail.com>
Message-Id: <20220502065359.3...@gmail.com>
---

v2:
- incorporated cmake fixes from Kefu
- seastar.pc still doesn't include -luring, so not functional

include/seastar/core/reactor.hh | 1 +
src/core/reactor_backend.hh | 2 +
src/core/reactor_backend.cc | 396 ++++++++++++++++++++++++++++++++
CMakeLists.txt | 15 ++
cmake/FindLibUring.cmake | 61 +++++
cmake/SeastarDependencies.cmake | 1 +
cooking_recipe.cmake | 10 +
install-dependencies.sh | 2 +
8 files changed, 488 insertions(+)
diff --git a/src/core/reactor_backend.cc b/src/core/reactor_backend.cc
index 4a1bca1afc..92fbae6244 100644
--- a/src/core/reactor_backend.cc
+++ b/src/core/reactor_backend.cc
@@ -28,10 +28,14 @@
#include <seastar/util/read_first_line.hh>
#include <chrono>
#include <sys/poll.h>
#include <sys/syscall.h>

+#ifdef SEASTAR_HAVE_URING
+#include <liburing.h>
+#endif
+
#ifdef HAVE_OSV
#include <osv/newpoll.hh>
#endif

namespace seastar {
@@ -1112,10 +1116,392 @@ reactor_backend_osv::make_pollable_fd_state(file_desc fd, pollable_fd::speculati
std::cerr << "reactor_backend_osv does not support file descriptors - make_pollable_fd_state() shouldn't have been called!\n";
abort();
}
#endif

+#ifdef SEASTAR_HAVE_URING
+
+// We want not to throw during detection (to avoid spurious exceptions)
+// but we do want to throw during for-real construction (to have an error
+// message. Hence `throw_on_error`.
+static
+std::optional<::io_uring>
+try_create_uring(unsigned queue_len, bool throw_on_error) {
+ auto params = ::io_uring_params{
+ .flags = 0,
+ };
+ ~reactor_backend_uring() {
+ ::io_uring_queue_exit(&_uring);
+ }
+ virtual bool reap_kernel_completions() override {
+ return do_process_kernel_completions();
+ }
+ virtual bool kernel_submit_work() override {
+ bool did_work = false;
+ did_work |= _preempt_io_context.service_preempting_io();
+ queue_pending_file_io();
+ did_work |= ::io_uring_submit(&_uring);
@@ -1151,10 +1537,15 @@ bool reactor_backend_selector::has_enough_aio_nr() {
}
return true;
}

std::unique_ptr<reactor_backend> reactor_backend_selector::create(reactor& r) {
+#ifdef SEASTAR_HAVE_URING
+ if (_name == "io_uring") {
+ return std::make_unique<reactor_backend_uring>(r);
+ }
+#endif
if (_name == "linux-aio") {
return std::make_unique<reactor_backend_aio>(r);
} else if (_name == "epoll") {
return std::make_unique<reactor_backend_epoll>(r);
}
@@ -1169,9 +1560,14 @@ std::vector<reactor_backend_selector> reactor_backend_selector::available() {
std::vector<reactor_backend_selector> ret;
if (detect_aio_poll() && has_enough_aio_nr()) {
ret.push_back(reactor_backend_selector("linux-aio"));
}
ret.push_back(reactor_backend_selector("epoll"));
+#ifdef SEASTAR_HAVE_URING
+ if (detect_io_uring()) {
+ ret.push_back(reactor_backend_selector("io_uring"));
+ }
+#endif
return ret;
}

}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index dcc064fe47..f323c38a56 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -234,10 +234,14 @@ option (Seastar_EXECUTE_ONLY_FAST_TESTS

option (Seastar_HWLOC
"Enable hwloc support."
ON)

+option (Seastar_IO_URING
+ "Enable io_uring support."
+ OFF)
+
set (Seastar_JENKINS
""
CACHE
STRING
"If non-empty, the prefix for XML files containing the results of running tests (for Jenkins).")
@@ -927,10 +931,20 @@ if (Seastar_HWLOC)

target_link_libraries (seastar
PRIVATE hwloc::hwloc)
endif ()

+if (Seastar_IO_URING)
+ if (NOT LibUring_FOUND)
+ message (FATAL_ERROR "`io_uring` supported is enabled but liburing is not available!")
+ endif ()
+
+ list (APPEND Seastar_PRIVATE_COMPILE_DEFINITIONS SEASTAR_HAVE_URING)
+ target_link_libraries (seastar
+ PRIVATE URING::uring)
+endif ()
+
if (Seastar_LD_FLAGS)
# In newer versions of CMake, there is `target_link_options`.
target_link_libraries (seastar
PRIVATE ${Seastar_LD_FLAGS})
endif ()
@@ -1247,10 +1261,11 @@ if (Seastar_INSTALL)
diff --git a/cooking_recipe.cmake b/cooking_recipe.cmake
index 288ac90151..9be1e37782 100644
--- a/cooking_recipe.cmake
+++ b/cooking_recipe.cmake
@@ -284,10 +284,20 @@ cooking_ingredient (fmt
URL_MD5 eaf6e3c1b2f4695b9a612cedf17b509d
CMAKE_ARGS
-DFMT_DOC=OFF
-DFMT_TEST=OFF)

+cooking_ingredient (liburing
+ EXTERNAL_PROJECT_ARGS
+ URL https://github.com/axboe/liburing/archive/liburing-2.1.tar.gz
+ URL_MD5 78f13d9861b334b9a9ca0d12cf2a6d3c
+ CONFIGURE_COMMAND <SOURCE_DIR>/configure --prefix=<INSTALL_DIR>
+ BUILD_COMMAND <DISABLE>
+ BUILD_BYPRODUCTS "<SOURCE_DIR>/src/liburing.a"
+ BUILD_IN_SOURCE ON
+ INSTALL_COMMAND ${make_command} -C src -s install)
+
cooking_ingredient (lz4
EXTERNAL_PROJECT_ARGS
URL https://github.com/lz4/lz4/archive/v1.8.0.tar.gz
URL_MD5 6247bf0e955899969d1600ff34baed6b
# This is upsetting.

Kefu Chai

<tchaikov@gmail.com>
unread,
May 2, 2022, 7:43:45 AM5/2/22
to seastar-dev@googlegroups.com, Kefu Chai
* enable Seastar_IO_URING by default
* add liburing to seastar.pc as a private linkage. liburing stuff
is added only if Seastar_IO_URING is ON

Signed-off-by: Kefu Chai <tcha...@gmail.com>
---
CMakeLists.txt | 2 +-
pkgconfig/seastar.pc.in | 8 +++++---
2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/CMakeLists.txt b/CMakeLists.txt
index f323c38a..101b48d0 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -238,7 +238,7 @@ option (Seastar_HWLOC

option (Seastar_IO_URING
"Enable io_uring support."
- OFF)
+ ON)

set (Seastar_JENKINS
""
diff --git a/pkgconfig/seastar.pc.in b/pkgconfig/seastar.pc.in
index 9cdf41d7..f163114f 100644
--- a/pkgconfig/seastar.pc.in
+++ b/pkgconfig/seastar.pc.in
@@ -28,6 +28,8 @@ fmt_cflags=-I$<JOIN:$<TARGET_PROPERTY:fmt::fmt,INTERFACE_INCLUDE_DIRECTORIES>, -
fmt_libs=$<TARGET_LINKER_FILE:fmt::fmt>
lksctp_tools_cflags=-I$<JOIN:@lksctp-tools_INCLUDE_DIRS@, -I>
lksctp_tools_libs=$<JOIN:@lksctp-tools_LIBRARIES@, >
+liburing_cflags=$<$<BOOL:@Seastar_IO_URING@>:-I$<JOIN:$<TARGET_PROPERTY:URING::uring,INTERFACE_INCLUDE_DIRECTORIES>, -I>>
+liburing_libs=$<$<BOOL:@Seastar_IO_URING@>:$<TARGET_LINKER_FILE:URING::uring>>
numactl_cflags=-I$<JOIN:@numactl_INCLUDE_DIRS@, -I>
numactl_libs=$<JOIN:@numactl_LIBRARIES@, >

@@ -36,8 +38,8 @@ seastar_cflags=${seastar_include_flags} $<JOIN:$<FILTER:$<TARGET_PROPERTY:seasta
seastar_libs=${libdir}/$<TARGET_FILE_NAME:seastar> @Seastar_SPLIT_DWARF_FLAG@ $<JOIN:@Seastar_Sanitizers_OPTIONS@, >

Requires: liblz4 >= 1.7.3
-Requires.private: gnutls >= 3.2.26, hwloc >= 1.11.2, yaml-cpp >= 0.5.1
+Requires.private: gnutls >= 3.2.26, hwloc >= 1.11.2, yaml-cpp >= 0.5.1$<$<BOOL:@Seastar_IO_URING@>:, liburing $<ANGLE-R>= 2.0>
Conflicts:
-Cflags: ${boost_cflags} ${c_ares_cflags} ${cryptopp_cflags} ${fmt_cflags} ${lksctp_tools_cflags} ${numactl_cflags} ${seastar_cflags}
+Cflags: ${boost_cflags} ${c_ares_cflags} ${cryptopp_cflags} ${fmt_cflags} ${liburing_cflags} ${lksctp_tools_cflags} ${numactl_cflags} ${seastar_cflags}
Libs: ${seastar_libs} ${boost_program_options_libs} ${boost_thread_libs} ${c_ares_libs} ${cryptopp_libs} ${fmt_libs}
-Libs.private: ${dl_libs} ${rt_libs} ${boost_thread_libs} ${lksctp_tools_libs} ${numactl_libs} ${stdatomic_libs}
+Libs.private: ${dl_libs} ${rt_libs} ${boost_thread_libs} ${lksctp_tools_libs} ${liburing_libs} ${numactl_libs} ${stdatomic_libs}
--
2.36.0

Kefu Chai

<tchaikov@gmail.com>
unread,
May 2, 2022, 7:55:09 AM5/2/22
to seastar-dev@googlegroups.com, Kefu Chai
* enable Seastar_IO_URING by default
* add liburing to seastar.pc as a private linkage. liburing stuff
is added only if Seastar_IO_URING is ON

Signed-off-by: Kefu Chai <tcha...@gmail.com>
---
CMakeLists.txt | 2 +-
pkgconfig/seastar.pc.in | 8 +++++---
2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/CMakeLists.txt b/CMakeLists.txt
index f323c38a..101b48d0 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -238,7 +238,7 @@ option (Seastar_HWLOC

option (Seastar_IO_URING
"Enable io_uring support."
- OFF)
+ ON)

set (Seastar_JENKINS
""
diff --git a/pkgconfig/seastar.pc.in b/pkgconfig/seastar.pc.in
index 9cdf41d7..edfa20cb 100644
--- a/pkgconfig/seastar.pc.in
+++ b/pkgconfig/seastar.pc.in
@@ -28,6 +28,8 @@ fmt_cflags=-I$<JOIN:$<TARGET_PROPERTY:fmt::fmt,INTERFACE_INCLUDE_DIRECTORIES>, -
fmt_libs=$<TARGET_LINKER_FILE:fmt::fmt>
lksctp_tools_cflags=-I$<JOIN:@lksctp-tools_INCLUDE_DIRS@, -I>
lksctp_tools_libs=$<JOIN:@lksctp-tools_LIBRARIES@, >
+liburing_cflags=$<$<BOOL:@Seastar_IO_URING@>:-I$<JOIN:$<TARGET_PROPERTY:URING::uring,INTERFACE_INCLUDE_DIRECTORIES>, -I>>
+liburing_libs=$<$<BOOL:@Seastar_IO_URING@>:$<TARGET_LINKER_FILE:URING::uring>>
numactl_cflags=-I$<JOIN:@numactl_INCLUDE_DIRS@, -I>
numactl_libs=$<JOIN:@numactl_LIBRARIES@, >

@@ -36,8 +38,8 @@ seastar_cflags=${seastar_include_flags} $<JOIN:$<FILTER:$<TARGET_PROPERTY:seasta
seastar_libs=${libdir}/$<TARGET_FILE_NAME:seastar> @Seastar_SPLIT_DWARF_FLAG@ $<JOIN:@Seastar_Sanitizers_OPTIONS@, >

Requires: liblz4 >= 1.7.3
-Requires.private: gnutls >= 3.2.26, hwloc >= 1.11.2, yaml-cpp >= 0.5.1
+Requires.private: gnutls >= 3.2.26, hwloc >= 1.11.2, $<$<BOOL:@Seastar_IO_URING@>:liburing $<ANGLE-R>= 2.0, >yaml-cpp >= 0.5.1

tcha...@gmail.com

<tchaikov@gmail.com>
unread,
May 2, 2022, 7:57:39 AM5/2/22
to seastar-dev
this change is based on top of v2 of "reactor: add io_uring backend".

change since v1:

- preserve the order of "Requires.private" libraries: they are ordered alphabetically.

Piotr Sarna

<sarna@scylladb.com>
unread,
May 2, 2022, 8:19:30 AM5/2/22
to Avi Kivity, seastar-dev@googlegroups.com
Where can I find docs/commit messages which explain this? Do we just use
a dummy 8-byte write to a descriptor to signal that we want to wake up?
Does it mean that the current implementation of io_uring backend is a
stub and would abort on any network operation, or is this path
unreachable? It would be nice to see a comment explaining what happens here.
Is there a minimum version of liburing that we need to have in order for
the backend to work? I'm aware that the interface is rather young and
has a fast development pace, so maybe we need to specify some minimum
release?

Avi Kivity

<avi@scylladb.com>
unread,
May 2, 2022, 11:36:50 AM5/2/22
to Piotr Sarna, seastar-dev@googlegroups.com
It's an eventfd, so all it supports are 8-byte writes. It's more or less
a semaphore wrapped in an fd.


commit 51c2917e9bd71bf5f4e39d1a18264a13e3908b37
Author: Avi Kivity <a...@scylladb.com>
Date:   Tue Dec 19 21:14:08 2017 +0200

    reactor: switch smp notifications from signals to eventfds

    Signals are generally nasty, requiring a process-wide lock to send.
    This makes them slow on large machines (although this is mitigated
    by the reactor spinning before going to sleep).

    Switch to eventfd instead, which should also be faster. File
    descriptors should not be used with dpdk, but since with dpdk
    we never sleep, we also never need to wake up.

    The original motivation for this patch was the lack of io_pgetevents(),
    which is needed to atomically wait for both signals and fd
    notifications. Since io_pgetevents() was since made available, this
    patch is not critical, but is still an improvement.
It's unreachable. I'll add a comment. The patch is fully functional
(apart from the cmake stuff), passes the test suite, as well as Scylla's
test suite in all its glory.


The unreachable stuff is due to Glauber's preparation for io_uring,
which encapsulated operations (like sendmsg) in messages. But we don't
yet generate those messages, so no need to handle them yet.


Handling will have some complexity, since support to io_uring was added
over time, and we'll need to detect and dispatch to io_uring or the
existing path based on availability.


>> + +set (URING_LIBRARIES ${URING_LIBRARY})
>> +set (URING_INCLUDE_DIRS ${URING_INCLUDE_DIR})
>> +
>> +if (URING_FOUND AND NOT (TARGET URING::uring))
>> +  add_library (URING::uring UNKNOWN IMPORTED)
>> +
>> +  set_target_properties (URING::uring
>> +    PROPERTIES
>> +      IMPORTED_LOCATION ${URING_LIBRARY}
>> +      INTERFACE_INCLUDE_DIRECTORIES ${URING_INCLUDE_DIRS})
>> +endif ()
> Is there a minimum version of liburing that we need to have in order
> for the backend to work? I'm aware that the interface is rather young
> and has a fast development pace, so maybe we need to specify some
> minimum release?


2.0 (for ring.features, see detect_io_uring()). I'll add a condition if
I can find out how.


Avi Kivity

<avi@scylladb.com>
unread,
May 2, 2022, 11:39:07 AM5/2/22
to Kefu Chai, seastar-dev@googlegroups.com

On 02/05/2022 14.55, Kefu Chai wrote:
> * enable Seastar_IO_URING by default
> * add liburing to seastar.pc as a private linkage. liburing stuff
> is added only if Seastar_IO_URING is ON
>

This doesn't apply on top of v2. I pushed it to
https://github.com/avikivity/seastar/tree/io_uring.


Can we make it auto-detect? e.g. OFF=always off, ON=always on (complain
if not available), DETECT=enable if possible. If that's not cmake best
practice, let's stick with ON/OFF.

Kefu Chai

<tchaikov@gmail.com>
unread,
May 2, 2022, 11:51:26 AM5/2/22
to seastar-dev@googlegroups.com, Kefu Chai
* enable Seastar_IO_URING by default
* add liburing to seastar.pc as a private linkage. liburing stuff
is added only if Seastar_IO_URING is ON

Signed-off-by: Kefu Chai <tcha...@gmail.com>
---
CMakeLists.txt | 2 +-
pkgconfig/seastar.pc.in | 8 +++++---
2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/CMakeLists.txt b/CMakeLists.txt
index f323c38a..101b48d0 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
--
2.36.0

Avi Kivity

<avi@scylladb.com>
unread,
May 2, 2022, 11:54:47 AM5/2/22
to Kefu Chai, seastar-dev@googlegroups.com
Thanks, folded into my patch.

kefu chai

<tchaikov@gmail.com>
unread,
May 2, 2022, 12:00:27 PM5/2/22
to Avi Kivity, seastar-dev
On Mon, May 2, 2022 at 11:39 PM Avi Kivity <a...@scylladb.com> wrote:
>
>
> On 02/05/2022 14.55, Kefu Chai wrote:
> > * enable Seastar_IO_URING by default
> > * add liburing to seastar.pc as a private linkage. liburing stuff
> > is added only if Seastar_IO_URING is ON
> >
>
> This doesn't apply on top of v2. I pushed it to
> https://github.com/avikivity/seastar/tree/io_uring.

rebased and resent. pushed to
https://github.com/tchaikov/seastar/tree/io_uring .

>
>
> Can we make it auto-detect? e.g. OFF=always off, ON=always on (complain
> if not available), DETECT=enable if possible. If that's not cmake best
> practice, let's stick with ON/OFF.

sure, we can make "Seastar_IO_URING" a user-settable string variable
which allows one of the tri-states. but IMHO, a repeatable build is
always desirable. in other words, i think the built artifacts should
not rely on the existence of a library or a setting in its building
environment. they should only depend on the configure options and the
version of its dependencies.
--
Regards
Kefu Chai

Kefu Chai

<tchaikov@gmail.com>
unread,
May 2, 2022, 12:31:07 PM5/2/22
to seastar-dev@googlegroups.com, Kefu Chai
Signed-off-by: Kefu Chai <tcha...@gmail.com>
---
cmake/FindLibUring.cmake | 15 ++++++++++++++-
cmake/SeastarDependencies.cmake | 1 +
2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/cmake/FindLibUring.cmake b/cmake/FindLibUring.cmake
index e5412631..54ff51db 100644
--- a/cmake/FindLibUring.cmake
+++ b/cmake/FindLibUring.cmake
@@ -36,9 +36,21 @@ find_path (URING_INCLUDE_DIR
${URING_PC_INCLUDEDIR}
${URING_PC_INCLUDE_DIRS})

+if (URING_INCLUDE_DIR)
+ include (CheckStructHasMember)
+ include (CMakePushCheckState)
+ cmake_push_check_state (RESET)
+ list(APPEND CMAKE_REQUIRED_INCLUDES ${URING_INCLUDE_DIR})
+ CHECK_STRUCT_HAS_MEMBER ("struct io_uring" features liburing.h
+ HAVE_IOURING_FEATURES LANGUAGE CXX)
+ cmake_pop_check_state ()
+endif ()
+
mark_as_advanced (
URING_LIBRARY
- URING_INCLUDE_DIR)
+ URING_INCLUDE_DIR
+ HAVE_IOURING_FEATURES)
+

include (FindPackageHandleStandardArgs)

@@ -46,6 +58,7 @@ find_package_handle_standard_args (LibUring
REQUIRED_VARS
URING_LIBRARY
URING_INCLUDE_DIR
+ HAVE_IOURING_FEATURES
VERSION_VAR URING_PC_VERSION)

set (URING_LIBRARIES ${URING_LIBRARY})
diff --git a/cmake/SeastarDependencies.cmake b/cmake/SeastarDependencies.cmake
index 418542e8..819bcfa9 100644
--- a/cmake/SeastarDependencies.cmake
+++ b/cmake/SeastarDependencies.cmake
@@ -88,6 +88,7 @@ macro (seastar_find_dependencies)
set (_seastar_dep_args_fmt 5.0.0 REQUIRED)
set (_seastar_dep_args_lz4 1.7.3 REQUIRED)
set (_seastar_dep_args_GnuTLS 3.3.26 REQUIRED)
+ set (_seastar_dep_args_LibUring 2.0)
set (_seastar_dep_args_StdAtomic REQUIRED)
set (_seastar_dep_args_hwloc 1.11.2)
set (_seastar_dep_args_lksctp-tools REQUIRED)
--
2.36.0

tcha...@gmail.com

<tchaikov@gmail.com>
unread,
May 2, 2022, 12:33:16 PM5/2/22
to seastar-dev
hi Avi,

this change should apply on top of 6a3b2d3fa33ed89de8288d926e59ad1d4095a489

- check for io_uring.features
- requires for io_uring v2.0 and up

pushed to https://github.com/tchaikov/seastar/tree/io_uring

Avi Kivity

<avi@scylladb.com>
unread,
May 2, 2022, 12:37:31 PM5/2/22
to kefu chai, seastar-dev

On 02/05/2022 19.00, kefu chai wrote:
> On Mon, May 2, 2022 at 11:39 PM Avi Kivity <a...@scylladb.com> wrote:
>>
>> On 02/05/2022 14.55, Kefu Chai wrote:
>>> * enable Seastar_IO_URING by default
>>> * add liburing to seastar.pc as a private linkage. liburing stuff
>>> is added only if Seastar_IO_URING is ON
>>>
>> This doesn't apply on top of v2. I pushed it to
>> https://github.com/avikivity/seastar/tree/io_uring.
> rebased and resent. pushed to
> https://github.com/tchaikov/seastar/tree/io_uring .


Thanks.


>>
>> Can we make it auto-detect? e.g. OFF=always off, ON=always on (complain
>> if not available), DETECT=enable if possible. If that's not cmake best
>> practice, let's stick with ON/OFF.
> sure, we can make "Seastar_IO_URING" a user-settable string variable
> which allows one of the tri-states. but IMHO, a repeatable build is
> always desirable. in other words, i think the built artifacts should
> not rely on the existence of a library or a setting in its building
> environment. they should only depend on the configure options and the
> version of its dependencies.


Ok, agree.


Avi Kivity

<avi@scylladb.com>
unread,
May 2, 2022, 12:39:20 PM5/2/22
to Kefu Chai, seastar-dev@googlegroups.com
Thanks, merged.

Raphael S. Carvalho

<raphaelsc@scylladb.com>
unread,
May 2, 2022, 1:22:34 PM5/2/22
to Avi Kivity, seastar-dev
? )
it seems like try_create_uring() can be potentially reused here, and
given its params, it seems like you intend to do it eventually.
This could leave the reader guessing why you picked 200 for the
submission ring size. Perhaps a few words explaining the reason?
I think it's more intuitive if you do it like in
do_process_kernel_completions() with a single return at the end:
'return did_work | std::exchange(_did_work_while_getting_sqe, false);'
> --
> You received this message because you are subscribed to the Google Groups "seastar-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/20220501173756.3537646-1-avi%40scylladb.com.

Avi Kivity

<avi@scylladb.com>
unread,
May 2, 2022, 1:33:06 PM5/2/22
to Raphael S. Carvalho, seastar-dev

On 02/05/2022 20.22, Raphael S. Carvalho wrote:
> ? )


!
Actually try_create_ring() need not check those things if
detect_io_uring() does it.
It's just a random number. It doesn't have much significance, beyond
that too small == too many syscalls, too large = larger chance of
failing on not enough mlocked memory.


I'll add a comment.


I don't see how it works, what is did_work here?


>
>> +

Raphael S. Carvalho

<raphaelsc@scylladb.com>
unread,
May 2, 2022, 1:49:00 PM5/2/22
to Avi Kivity, seastar-dev
On Mon, May 2, 2022 at 2:33 PM Avi Kivity <a...@scylladb.com> wrote:
>
>
> On 02/05/2022 20.22, Raphael S. Carvalho wrote:
> > ? )
>
>
> !

Was a typo, sorry.
When reading this, it felt like try_create_uring() could be reused
with throw_on_error set to false, then duplicated checks could be
removed from detect_io_uring(). if disengaged optional isn't returned,
then missing checks would be performed on the ring created by
try_create_uring(). but this can be revisited later, if needed.
thanks. Indeed larger numbers could potentially lead to ENOMEM, which
IIRC is the reason Glauber had to decrease it in the latest version of
his work, from 1k to 256.
something like this:

diff --git a/src/core/reactor_backend.cc b/src/core/reactor_backend.cc
index 3d9b5d72..be2f91d4 100644
--- a/src/core/reactor_backend.cc
+++ b/src/core/reactor_backend.cc
@@ -1280,14 +1280,13 @@ class reactor_backend_uring final : public
reactor_backend {
}

bool do_flush_submission_ring() {
+ auto did_work = false;
if (_has_pending_submissions) {
_has_pending_submissions = false;
- _did_work_while_getting_sqe = false;
io_uring_submit(&_uring);
- return true;
- } else {
- return std::exchange(_did_work_while_getting_sqe, false);
+ did_work = true;
}
+ return did_work | std::exchange(_did_work_while_getting_sqe, false);
}

did_work refers to whether or not we have flushed any pending submission

>
>
> >
> >> +

Pavel Emelyanov

<xemul@scylladb.com>
unread,
May 3, 2022, 4:53:01 AM5/3/22
to Avi Kivity, seastar-dev@googlegroups.com
The throw_on_error is always true despite the comment.

> + auto params = ::io_uring_params{
> + .flags = 0,
> + };
> + ::io_uring ring;
> + auto err = ::io_uring_queue_init_params(queue_len, &ring, &params);
> + if (err != 0) {
> + if (throw_on_error) {
> + throw std::system_error(-err, std::system_category());
> + }
> + return std::nullopt;
> + }
> + ::io_uring_ring_dontfork(&ring);
> + if (!(params.features & IORING_FEAT_NODROP)
> + || !(params.features & IORING_FEAT_SUBMIT_STABLE)) {

Maybe if (~params.features & (IORING_FEAT_NODROP | IORING_FEAT_SUBMIT_STABLE)) ?
Likely can be a subclass of the reactor_backend_uring one.

> + pollable_fd_state_completion _completion_pollin;
> + pollable_fd_state_completion _completion_pollout;
> +public:
> + explicit uring_pollable_fd_state(file_desc desc, speculation speculate)
> + : pollable_fd_state(std::move(desc), std::move(speculate)) {
> + }
> + pollable_fd_state_completion* get_desc(int events) {
> + if (events & POLLIN) {
> + return &_completion_pollin;
> + } else {
> + return &_completion_pollout;
> + }
> + }
> + future<> get_completion_future(int events) {
> + return get_desc(events)->get_future();
> + }
> +};
> +
> +class reactor_backend_uring final : public reactor_backend {
> + static constexpr unsigned s_queue_len = 200;

Why only 200? The AIO backend has its ring size larger. (traling space in this line too)
Is

if (expirations != 0) {

> + _r.service_highres_timer();

}

missing?
Why not call try_get_sqe() here and return false to the sink drainer?
The latter can handle it and postpone the request submission.
The AIO peer code in prepare_iocb() logs this case.

> + abort();
> + }
> + ::io_uring_sqe_set_data(sqe, completion);
> +
> + _has_pending_submissions = true;
> + }
> +
> + // Returns true if any work was done
> + bool queue_pending_file_io() {

Caller ignores the return value and relies on the io_uring_submit() result instead.
As per docs the nr is 0 here.

> + if (nr == s_queue_len) {

Ditto.

Avi Kivity

<avi@scylladb.com>
unread,
May 3, 2022, 5:41:33 AM5/3/22
to Pavel Emelyanov, seastar-dev@googlegroups.com
I need to revamp the detect/create code. One was written months ago, the
other days, but they are doing the same thing.


>> +    auto params = ::io_uring_params{
>> +        .flags = 0,
>> +    };
>> +    ::io_uring ring;
>> +    auto err = ::io_uring_queue_init_params(queue_len, &ring, &params);
>> +    if (err != 0) {
>> +        if (throw_on_error) {
>> +            throw std::system_error(-err, std::system_category());
>> +        }
>> +        return std::nullopt;
>> +    }
>> +    ::io_uring_ring_dontfork(&ring);
>> +    if (!(params.features & IORING_FEAT_NODROP)
>> +            || !(params.features & IORING_FEAT_SUBMIT_STABLE)) {
>
> Maybe if (~params.features & (IORING_FEAT_NODROP |
> IORING_FEAT_SUBMIT_STABLE)) ?


Yes, and move into detect.
Yes.


>
>> +    pollable_fd_state_completion _completion_pollin;
>> +    pollable_fd_state_completion _completion_pollout;
>> +public:
>> +    explicit uring_pollable_fd_state(file_desc desc, speculation
>> speculate)
>> +            : pollable_fd_state(std::move(desc),
>> std::move(speculate)) {
>> +    }
>> +    pollable_fd_state_completion* get_desc(int events) {
>> +        if (events & POLLIN) {
>> +            return &_completion_pollin;
>> +        } else {
>> +            return &_completion_pollout;
>> +        }
>> +    }
>> +    future<> get_completion_future(int events) {
>> +        return get_desc(events)->get_future();
>> +    }
>> +};
>> +
>> +class reactor_backend_uring final : public reactor_backend {
>> +    static constexpr unsigned s_queue_len = 200;
>
> Why only 200? The AIO backend has its ring size larger. (traling space
> in this line too)
>

There's already a comment in the new version. This is just the
communication ring between userspace and the kernel, 200 ops per task
quota is already an overkill. The amount of in-flight ops is unlimited
(unlike aio which uses the same number).


This is why get_sqe() can recurse into flushing the submission and
completion rings.
No, we won't be awakened if expirations = 0 (timerfd contract), and
spurious wakeup is no-op anyway.
But why postpone it?


Let's say manage to submit some long running requests (poll of idle
connection) and postpone the fast requests. We'll return and run some
CPU task and try to submit again in the next task quota. So we lose a
task quota for those fast requests.


It's unlikely to happen, but I'd rather not pile on latency where we can
avoid it.
It can't happen, and if it does we can't recover. I think it's better to
abort.


>
>> +                abort();
>> +        }
>> +        ::io_uring_sqe_set_data(sqe, completion);
>> +
>> +        _has_pending_submissions = true;
>> +    }
>> +
>> +    // Returns true if any work was done
>> +    bool queue_pending_file_io() {
>
> Caller ignores the return value and relies on the io_uring_submit()
> result instead.


Hmm. There's duplication here because get_sqe() stores state in _ring
that we have unsubmitted stuff (and _did_work_while_getting_sqe
remembers it in case we have an internal flush). It's probably better to
use both and not fall on some edge case.
The doc must be wrong. Probably copied from io_uring_wait_cqe().
Otherwise how would you know how many completions are ready?


Although, trying to follow the code shows that it sometimes returns the
number of entries and sometimes zero. I guess it works because the call
to reap_kernel_completions() later will pick up the completions.


I'll try to clarify what's supposed to happen with the list.
Unfortunately the documentation is quite confusing.

Pavel Emelyanov

<xemul@scylladb.com>
unread,
May 3, 2022, 5:57:16 AM5/3/22
to Avi Kivity, seastar-dev@googlegroups.com
Not to make synchronous system call that is ... well, strictly speaking ... not guaranteed
to be non-blocking.
I meant it _also_ logs in this case. Abort is of course unavoidable.
This surprises me too, but following the code from Jens' repo it still doesn't return positive
numbers. Maybe the timeout version does, but when the ts is null it follows

io_uring_wait_cqes()
__io_uring_get_cqe()
_io_uring_get_cqe()

and the latter has the number of entries in nr_available and/or ret variables, but these never
propagate to return value if being positive.

Avi Kivity

<avi@scylladb.com>
unread,
May 3, 2022, 6:20:12 AM5/3/22
to Pavel Emelyanov, seastar-dev@googlegroups.com
If io_uring_enter blocks, it blocks, and it doesn't matter if it blocks
here or on another place we call it.


However, it takes great effort not to block, so at least that part is okay.


Doing it immediately is a little problematic in that it can cause
completions to happen where we don't expect them, but I think it works out.
Ok, I can add it.
Yes, very strange code. I'll rework the call sequence (probably just
call reap() unconditionally here).

Pavel Emelyanov

<xemul@scylladb.com>
unread,
May 3, 2022, 6:22:49 AM5/3/22
to Avi Kivity, seastar-dev@googlegroups.com
Yes, makes sense.

Gleb Natapov

<gleb@scylladb.com>
unread,
May 3, 2022, 6:50:48 AM5/3/22
to Avi Kivity, seastar-dev@googlegroups.com
Nit: Why is there a difference between hrtimer_completion::complete_with()
and this one while it looks like they do the same thing? They both read
and discard 8 bytes and set _armed to false but one uses uint64_t and
another char[8] and on asserts while another does not. Also _fd vs fd()
usage even in a single class is a little but annoying.


> + }
> + void maybe_rearm(reactor_backend_uring& be) {
> + if (_armed) [[likely]] {
> + return;
> + }
> + auto sqe = be.get_sqe();
> + ::io_uring_prep_poll_add(sqe, fd().get(), POLLIN);
> + ::io_uring_sqe_set_data(sqe, static_cast<kernel_completion*>(this));
> + _armed = true;
> + be._has_pending_submissions = true;
> + }
Except for [[likely]] this looks identical to hrtimer_completion. Can
they be de-duped?
> --
> You received this message because you are subscribed to the Google Groups "seastar-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/20220502094341.3731142-1-avi%40scylladb.com.

--
Gleb.

Nadav Har'El

<nyh@scylladb.com>
unread,
May 3, 2022, 7:00:59 AM5/3/22
to Avi Kivity, seastar-dev
On Sun, May 1, 2022 at 8:38 PM Avi Kivity <a...@scylladb.com> wrote:
io_uring is a unified asynchronous I/O interface, supporting
network, buffered disk, and direct disk I/O.

This patch adds a reactor backend using io_uring. It is deliberately
non-ambitious and only implements the minimal number of verbs using
io_uring. We could support many more (sendmsg(), recvmsg(), open(), etc.)
but it is better to start small, as each of those features will require
a separate capability check and fallback path if not available.

In terms of performance, I measured about 4% degradation compared to
linux-aio in httpd/wrk. I am working with the io_uring maintainer to
resolve it, but until it is resolved linux-aio will be preferred to
io_uring and the latter has to be enabled manually (with --reactor-backend).

We have a comment  in reactor_config.hh which lists the available backends, and
the new one should be listed as well. But I think we need more than that - we need
some sort of documentation, in the source code or a separate document (but not
just commit messages) explaining the *current* state of each backend (especially
now that we have one that you consider incomplete and we need to remember
somewhat what is incomplete), the advantages and disadvantages of each backend,
and perhaps a general statement (would it be correct?) that the best backend for
the capabilities of the current kernel is chosen automatically, so users should not explicitly
choose the reactor backend.


The implementation follows the linux-aio backend, using
IORING_OP_POLL_ADD for polls and the read/write/fsync ops for files.
More elaborate socket operation (using sendmsg/recvmsg equivalents to
combine poll+read) will follow later. The preemption notifier still
uses linux-aio (as in Glauber's implementation) as a simple solution.

Isn't the preemption notifier just a file descriptor, like the other files?
(I'm very rusty in this code, so I'm probably just misremembering).
 
The implementation is more complex internally than linux-aio, since
getting a submission queue entry (sqe) can require flushing pending
sqe:s and consuming completion queue entries (cqe:s). This is because
the queue lengths are smaller than the total amount of in-flight
operations. So getting an sqe can result (rarely) in a syscall
and processing some completions.

However, in return the ergonomics are better for the user. It works
well with buffered I/O (--kernel-page-cache 1) and doesn't require
tuning some sysctls for it to work on large machines.
---

This appears to work well, apart from the cmake integration. I
appreciate any help in this area!

We require liburing 2.0 and above, and it should be optional.

I guess this means that this patch is an RFC and we shouldn't commit it yet?
 


 include/seastar/core/reactor.hh |   1 +
 src/core/reactor_backend.hh     |   2 +
 src/core/reactor_backend.cc     | 404 ++++++++++++++++++++++++++++++++
 CMakeLists.txt                  |   2 +
 cmake/FindLibUring.cmake        |  61 +++++
 cmake/SeastarDependencies.cmake |   1 +
 install-dependencies.sh         |   2 +
 7 files changed, 473 insertions(+)
index 4a1bca1afc..3d9b5d7299 100644
--- a/src/core/reactor_backend.cc
+++ b/src/core/reactor_backend.cc
@@ -28,10 +28,16 @@

 #include <seastar/util/read_first_line.hh>
 #include <chrono>
 #include <sys/poll.h>
 #include <sys/syscall.h>

+#define SEASTAR_HAVE_URING // FIXME: convince cmake to do it.
+
+#ifdef SEASTAR_HAVE_URING
+#include <liburing.h>
+#endif
+
 #ifdef HAVE_OSV
 #include <osv/newpoll.hh>
 #endif

 namespace seastar {
@@ -1112,10 +1118,398 @@ reactor_backend_osv::make_pollable_fd_state(file_desc fd, pollable_fd::speculati

     std::cerr << "reactor_backend_osv does not support file descriptors - make_pollable_fd_state() shouldn't have been called!\n";
     abort();
 }
 #endif

+#ifdef SEASTAR_HAVE_URING
+
+// We want not to throw during detection (to avoid spurious exceptions)

What's wrong with doing a try/catch around detection, which should just happen
once anyway? It looks like you're complicating this function just for this esoteric
case that doesn't seem an important optimization.

Also, by not throwing exceptions during detection, you miss the *reason* why
it wasn't detected. Maybe this can be important for somebody, perhaps in some
trace logging mode or something (somebody's not getting uring as he expected,
and wondering why)?
 
+// but we do want to throw during for-real construction (to have an error
+// message. Hence `throw_on_error`.
+static
+std::optional<::io_uring>
+try_create_uring(unsigned queue_len, bool throw_on_error) {

A thought: There is a large chunk of code here that is 100% specific to uring, so I think it would have been nice to have a separate source file reactor_backend_uring
instead of stuffing all the different backends into one big source file.
Of course, if this will be a mess because we'll also need to put a lot of stuff in header files, then let's just leave it one big source file as you did here.
Nitpick: why is it important to use this "::"? Don't you consider the code uglier with all those extra ::?
 
+        }
+        void maybe_rearm(reactor_backend_uring& be) {
+            if (_armed) [[likely]] {
+                return;
+            }
+            auto sqe = be.get_sqe();
+            ::io_uring_prep_poll_add(sqe, fd().get(), POLLIN);
+            ::io_uring_sqe_set_data(sqe, static_cast<kernel_completion*>(this));
+            _armed = true;
+            be._has_pending_submissions = true;
+        }

Even if this isn't implemented yet, perhaps better to print a more distinctive message?
 
+        // Protect against spurious wakeups - if we get notified that the timer has
+        // expired when it really hasn't, we don't want to block in read(tfd, ...).
+        auto tfd = _r._task_quota_timer.get();
+        ::fcntl(tfd, F_SETFL, ::fcntl(tfd, F_GETFL) | O_NONBLOCK);
+    }
+    ~reactor_backend_uring() {
+        ::io_uring_queue_exit(&_uring);
+    }
+    virtual bool reap_kernel_completions() override {
+        return do_process_kernel_completions();
+    }
+    virtual bool kernel_submit_work() override {
+        bool did_work = false;
+        did_work |= _preempt_io_context.service_preempting_io();
+        queue_pending_file_io();
+        did_work |= ::io_uring_submit(&_uring);
+        // io_uring_submit() may have kicked up queued work
+        did_work |= reap_kernel_completions();
@@ -1151,10 +1545,15 @@ bool reactor_backend_selector::has_enough_aio_nr() {

     }
     return true;
 }

 std::unique_ptr<reactor_backend> reactor_backend_selector::create(reactor& r) {
+#ifdef SEASTAR_HAVE_URING
+    if (_name == "io_uring") {
+        return std::make_unique<reactor_backend_uring>(r);
+    }
+#endif

Nitpick: maybe better to anyway support the "io_uring" option but if not compiled in, print an
error message that it's not compiled in?
 
     if (_name == "linux-aio") {
         return std::make_unique<reactor_backend_aio>(r);
     } else if (_name == "epoll") {
         return std::make_unique<reactor_backend_epoll>(r);
     }
@@ -1169,9 +1568,14 @@ std::vector<reactor_backend_selector> reactor_backend_selector::available() {

     std::vector<reactor_backend_selector> ret;
     if (detect_aio_poll() && has_enough_aio_nr()) {
         ret.push_back(reactor_backend_selector("linux-aio"));
     }
     ret.push_back(reactor_backend_selector("epoll"));
+#ifdef SEASTAR_HAVE_URING
+    if (detect_io_uring()) {
+        ret.push_back(reactor_backend_selector("io_uring"));
+    }
+#endif
     return ret;
 }

 }
diff --git a/CMakeLists.txt b/CMakeLists.txt
index dcc064fe47..57f5c1c506 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt

@@ -736,10 +736,11 @@ target_link_libraries (seastar
     Boost::thread
     c-ares::c-ares
     cryptopp::cryptopp
     fmt::fmt
     lz4::lz4
+    URING::uring
   PRIVATE
     ${CMAKE_DL_LIBS}
     GnuTLS::gnutls
     StdAtomic::atomic
     lksctp-tools::lksctp-tools
@@ -1247,10 +1248,11 @@ if (Seastar_INSTALL)

tcha...@gmail.com

<tchaikov@gmail.com>
unread,
May 15, 2022, 5:54:55 AM5/15/22
to seastar-dev
comments lined. just two nits.

On Monday, May 2, 2022 at 1:38:02 AM UTC+8 Avi Kivity wrote:
io_uring is a unified asynchronous I/O interface, supporting
network, buffered disk, and direct disk I/O.

This patch adds a reactor backend using io_uring. It is deliberately
non-ambitious and only implements the minimal number of verbs using
io_uring. We could support many more (sendmsg(), recvmsg(), open(), etc.)
but it is better to start small, as each of those features will require
a separate capability check and fallback path if not available.

In terms of performance, I measured about 4% degradation compared to
linux-aio in httpd/wrk. I am working with the io_uring maintainer to
resolve it, but until it is resolved linux-aio will be preferred to
io_uring and the latter has to be enabled manually (with --reactor-backend).

The implementation follows the linux-aio backend, using
IORING_OP_POLL_ADD for polls and the read/write/fsync ops for files.
More elaborate socket operation (using sendmsg/recvmsg equivalents to
combine poll+read) will follow later. The preemption notifier still
uses linux-aio (as in Glauber's implementation) as a simple solution.

The implementation is more complex internally than linux-aio, since
getting a submission queue entry (sqe) can require flushing pending
sqe:s and consuming completion queue entries (cqe:s). This is because
the queue lengths are smaller than the total amount of in-flight
operations. So getting an sqe can result (rarely) in a syscall
and processing some completions.

However, in return the ergonomics are better for the user. It works
well with buffered I/O (--kernel-page-cache 1) and doesn't require
tuning some sysctls for it to work on large machines.
---

This appears to work well, apart from the cmake integration. I
appreciate any help in this area!

We require liburing 2.0 and above, and it should be optional.


+// but we do want to throw during for-real construction (to have an error
+// message. Hence `throw_on_error`.
+static
+std::optional<::io_uring>
+try_create_uring(unsigned queue_len, bool throw_on_error) {

might want to drop this function declaration? looks like a copy-paste leftover from reactor_backend_aio.
the only reader of fd.events_rw is now reactor_backend_epoll::wait_and_process(). since reactor_backend_uring::wait_and_process_events() in quite a different way, probably we could dispense with this step?
 

Avi Kivity

<avi@scylladb.com>
unread,
May 20, 2022, 12:33:28 PM5/20/22
to tcha...@gmail.com, seastar-dev


On 15/05/2022 12.54, tcha...@gmail.com wrote:
comments lined. just two nits.+
+ bool await_events(int timeout, const sigset_t* active_sigmask);

might want to drop this function declaration? looks like a copy-paste leftover from reactor_backend_aio.
 


Done


+ }
+ future<> poll(pollable_fd_state& fd, int events) {
+ if (events & fd.events_known) {
+ fd.events_known &= ~events;
+ return make_ready_future<>();
+ }
+ fd.events_rw = events == (POLLIN|POLLOUT);

the only reader of fd.events_rw is now reactor_backend_epoll::wait_and_process(). since reactor_backend_uring::wait_and_process_events() in quite a different way, probably we could dispense with this step?


I think we have to apply the same logic from the aio backend to io_uring, I'll check.


Avi Kivity

<avi@scylladb.com>
unread,
May 20, 2022, 12:54:22 PM5/20/22
to Nadav Har'El, seastar-dev


On 03/05/2022 14.00, Nadav Har'El wrote:
On Sun, May 1, 2022 at 8:38 PM Avi Kivity <a...@scylladb.com> wrote:
io_uring is a unified asynchronous I/O interface, supporting
network, buffered disk, and direct disk I/O.

This patch adds a reactor backend using io_uring. It is deliberately
non-ambitious and only implements the minimal number of verbs using
io_uring. We could support many more (sendmsg(), recvmsg(), open(), etc.)
but it is better to start small, as each of those features will require
a separate capability check and fallback path if not available.

In terms of performance, I measured about 4% degradation compared to
linux-aio in httpd/wrk. I am working with the io_uring maintainer to
resolve it, but until it is resolved linux-aio will be preferred to
io_uring and the latter has to be enabled manually (with --reactor-backend).

We have a comment  in reactor_config.hh which lists the available backends, and
the new one should be listed as well. But I think we need more than that - we need
some sort of documentation, in the source code or a separate document (but not
just commit messages) explaining the *current* state of each backend


I'll do that in a separate patch. I want to limit the scope of this patch to just io_uring, and avoid expanding it to include other backends or general guidance. In general there is no required user documentation since Seastar will select the best backend the system can support.


(especially
now that we have one that you consider incomplete and we need to remember


It's only incomplete in the sense that I want to further improve it. Otherwise it is fully functional.


somewhat what is incomplete), the advantages and disadvantages of each backend,
and perhaps a general statement (would it be correct?) that the best backend for
the capabilities of the current kernel is chosen automatically, so users should not explicitly
choose the reactor backend.


It is already chosen automatically.




The implementation follows the linux-aio backend, using
IORING_OP_POLL_ADD for polls and the read/write/fsync ops for files.
More elaborate socket operation (using sendmsg/recvmsg equivalents to
combine poll+read) will follow later. The preemption notifier still
uses linux-aio (as in Glauber's implementation) as a simple solution.

Isn't the preemption notifier just a file descriptor, like the other files?
(I'm very rusty in this code, so I'm probably just misremembering).


It's a memory area that is updated when preemption happens. It's carefully arranged to match the memory are used by linux-aio to signal completions are available.


 
The implementation is more complex internally than linux-aio, since
getting a submission queue entry (sqe) can require flushing pending
sqe:s and consuming completion queue entries (cqe:s). This is because
the queue lengths are smaller than the total amount of in-flight
operations. So getting an sqe can result (rarely) in a syscall
and processing some completions.

However, in return the ergonomics are better for the user. It works
well with buffered I/O (--kernel-page-cache 1) and doesn't require
tuning some sysctls for it to work on large machines.
---

This appears to work well, apart from the cmake integration. I
appreciate any help in this area!

We require liburing 2.0 and above, and it should be optional.

I guess this means that this patch is an RFC and we shouldn't commit it yet?


New versions have cmake support and library auto-detection.

It's super annoying to issue "catch throw" in gdb and then struggle through some random internal exceptions. I'm not at all trying to optimize for performance.



Also, by not throwing exceptions during detection, you miss the *reason* why
it wasn't detected. Maybe this can be important for somebody, perhaps in some
trace logging mode or something (somebody's not getting uring as he expected,
and wondering why)?


We don't have any mechanism to communicate it. And really, you're turning an "it just works" mechanism that most users aren't aware of into a "you have to understand every detail here" thing.


The effect of Seastar not able to use a backend is that it will fall back to another implementation with a small performance penalty. It doesn't matter what the cause is, since the fix is always to move to a newer kernel.


 
+// but we do want to throw during for-real construction (to have an error
+// message. Hence `throw_on_error`.
+static
+std::optional<::io_uring>
+try_create_uring(unsigned queue_len, bool throw_on_error) {

A thought: There is a large chunk of code here that is 100% specific to uring, so I think it would have been nice to have a separate source file reactor_backend_uring
instead of stuffing all the different backends into one big source file.
Of course, if this will be a mess because we'll also need to put a lot of stuff in header files, then let's just leave it one big source file as you did here.


I can do that as a follow-up. I don't think it will cause a big mess, but since it's already a single file, I'll defer the change until later.

It's a mnemonic to remind me that it's part of the C namespace, not std::. and not seatar::.

I don't see the point. It's impossible to recover from it, and knowing that we got op::cancel here will not make anyone wiser. The code dump has to be inspected.


Couple that with it being unreachable code, I don't think it's helpful.

I can do that.

Avi Kivity

<avi@scylladb.com>
unread,
May 20, 2022, 1:49:07 PM5/20/22
to Gleb Natapov, seastar-dev@googlegroups.com
Deduplicated in v3.


>
>> + }
>> + void maybe_rearm(reactor_backend_uring& be) {
>> + if (_armed) [[likely]] {
>> + return;
>> + }
>> + auto sqe = be.get_sqe();
>> + ::io_uring_prep_poll_add(sqe, fd().get(), POLLIN);
>> + ::io_uring_sqe_set_data(sqe, static_cast<kernel_completion*>(this));
>> + _armed = true;
>> + be._has_pending_submissions = true;
>> + }
> Except for [[likely]] this looks identical to hrtimer_completion. Can
> they be de-duped?


Yes



Avi Kivity

<avi@scylladb.com>
unread,
May 20, 2022, 2:41:02 PM5/20/22
to seastar-dev@googlegroups.com
io_uring is a unified asynchronous I/O interface, supporting
network, buffered disk, and direct disk I/O.

This patch adds a reactor backend using io_uring. It is deliberately
non-ambitious and only implements the minimal number of verbs using
io_uring. We could support many more (sendmsg(), recvmsg(), open(), etc.)
but it is better to start small, as each of those features will require
a separate capability check and fallback path if not available.

In terms of performance, I measured about 4% degradation compared to
linux-aio in httpd/wrk. I am working with the io_uring maintainer to
resolve it, but until it is resolved linux-aio will be preferred to
io_uring and the latter has to be enabled manually (with --reactor-backend).

The implementation follows the linux-aio backend, using
IORING_OP_POLL_ADD for polls and the read/write/fsync ops for files.
More elaborate socket operation (using sendmsg/recvmsg equivalents to
combine poll+read) will follow later. The preemption notifier still
uses linux-aio (as in Glauber's implementation) as a simple solution.

Cmake improvements by: Kefu Chai <tcha...@gmail.com>

CircleCI support has io_uring disable until the containers are
updated with the library.

Signed-off-by: Kefu Chai <tcha...@gmail.com>
Message-Id: <20220502065359.3...@gmail.com>
---

v3:
- comment about unused io_request opcodes
- more cmake fixes from Kefu
- added configure.py integration
- comment about s_queue_len choice
- implement detect_io_uring() in terms of try_create_uring()
- drop unimplemplement member function await_events()
- drop unneeded pollable_fd::events_rw write (only needed for epoll backend)
- add io_uring to backend list in comment in
- better error if io_uring is selected despite not being compiled in
- deduplicate eventds/timerfd code (for hrtimers/smp notifications)
- nest uring_pollable_fd_state inside reactor_backend_uring
- log invalid io_request opcodes before aborting
- don't ignore result of queue_pending_file_io()
- don't interpret result of io_uring_wait_cqes() as completion count
(it isn't), instead just call do_process_kernel_completions()
unconditionally.


configure.py | 6 +
include/seastar/core/reactor.hh | 1 +
include/seastar/core/reactor_config.hh | 1 +
src/core/reactor_backend.hh | 2 +
src/core/reactor_backend.cc | 391 +++++++++++++++++++++++++
CMakeLists.txt | 15 +
cmake/FindLibUring.cmake | 74 +++++
cmake/SeastarDependencies.cmake | 2 +
cooking_recipe.cmake | 10 +
install-dependencies.sh | 2 +
pkgconfig/seastar.pc.in | 8 +-
11 files changed, 509 insertions(+), 3 deletions(-)
create mode 100644 cmake/FindLibUring.cmake

diff --git a/configure.py b/configure.py
index 1eb4d6bda2..f1783da9cd 100755
--- a/configure.py
+++ b/configure.py
@@ -108,10 +108,15 @@ add_tristate(
add_tristate(
arg_parser,
name = 'debug-shared-ptr',
dest = "debug_shared_ptr",
help = 'Debug shared_ptr')
+add_tristate(
+ arg_parser,
+ name='io_uring',
+ dest='io_uring',
+ help='Support io_uring via liburing')
arg_parser.add_argument('--allocator-page-size', dest='alloc_page_size', type=int, help='override allocator page size')
arg_parser.add_argument('--without-tests', dest='exclude_tests', action='store_true', help='Do not build tests by default')
arg_parser.add_argument('--without-apps', dest='exclude_apps', action='store_true', help='Do not build applications by default')
arg_parser.add_argument('--without-demos', dest='exclude_demos', action='store_true', help='Do not build demonstrations by default')
arg_parser.add_argument('--split-dwarf', dest='split_dwarf', action='store_true', default=False,
@@ -193,10 +198,11 @@ def configure_mode(mode):
tr(CFLAGS, 'CXX_FLAGS'),
tr(LDFLAGS, 'LD_FLAGS'),
tr(args.dpdk, 'DPDK'),
tr(infer_dpdk_machine(args.user_cflags), 'DPDK_MACHINE'),
tr(args.hwloc, 'HWLOC', value_when_none='yes'),
+ tr(args.io_uring, 'IO_URING', value_when_none='yes'),
tr(args.alloc_failure_injection, 'ALLOC_FAILURE_INJECTION', value_when_none='DEFAULT'),
tr(args.task_backtrace, 'TASK_BACKTRACE'),
tr(args.alloc_page_size, 'ALLOC_PAGE_SIZE'),
tr(args.split_dwarf, 'SPLIT_DWARF'),
tr(args.heap_profiling, 'HEAP_PROFILING'),
diff --git a/include/seastar/core/reactor.hh b/include/seastar/core/reactor.hh
index c51e9e2bbc..a5bdf192ce 100644
--- a/include/seastar/core/reactor.hh
+++ b/include/seastar/core/reactor.hh
@@ -208,10 +208,11 @@ class reactor {
friend class preempt_io_context;
friend struct hrtimer_aio_completion;
friend struct task_quota_aio_completion;
friend class reactor_backend_epoll;
friend class reactor_backend_aio;
+ friend class reactor_backend_uring;
friend class reactor_backend_selector;
friend struct reactor_options;
friend class aio_storage_context;
friend size_t scheduling_group_count();
public:
diff --git a/include/seastar/core/reactor_config.hh b/include/seastar/core/reactor_config.hh
index 1080ccb133..4098a0871e 100644
--- a/include/seastar/core/reactor_config.hh
+++ b/include/seastar/core/reactor_config.hh
@@ -127,10 +127,11 @@ struct reactor_options : public program_options::option_group {
/// \brief Internal reactor implementation.
///
/// Available backends:
/// * \p linux-aio
/// * \p epoll
+ /// * \p io_uring
///
/// Default: \p linux-aio (if available).
program_options::selection_value<reactor_backend_selector> reactor_backend;
/// \brief Use Linux aio for fsync() calls.
///
diff --git a/src/core/reactor_backend.hh b/src/core/reactor_backend.hh
index 3edbf771c1..a2bec88d9b 100644
--- a/src/core/reactor_backend.hh
+++ b/src/core/reactor_backend.hh
@@ -361,10 +361,12 @@ class reactor_backend_osv : public reactor_backend {
virtual pollable_fd_state_ptr
make_pollable_fd_state(file_desc fd, pollable_fd::speculation speculate) override;
};
#endif /* HAVE_OSV */

+class reactor_backend_uring;
+
class reactor_backend_selector {
std::string _name;
private:
static bool has_enough_aio_nr();
explicit reactor_backend_selector(std::string name) : _name(std::move(name)) {}
diff --git a/src/core/reactor_backend.cc b/src/core/reactor_backend.cc
index 9d63e1620e..e5f8a7ad17 100644
--- a/src/core/reactor_backend.cc
+++ b/src/core/reactor_backend.cc
@@ -28,10 +28,14 @@
#include <seastar/util/read_first_line.hh>
#include <chrono>
#include <sys/poll.h>
#include <sys/syscall.h>

+#ifdef SEASTAR_HAVE_URING
+#include <liburing.h>
+#endif
+
#ifdef HAVE_OSV
#include <osv/newpoll.hh>
#endif

namespace seastar {
@@ -1112,10 +1116,385 @@ reactor_backend_osv::make_pollable_fd_state(file_desc fd, pollable_fd::speculati
std::cerr << "reactor_backend_osv does not support file descriptors - make_pollable_fd_state() shouldn't have been called!\n";
abort();
}
#endif

+#ifdef SEASTAR_HAVE_URING
+
+static
+std::optional<::io_uring>
+try_create_uring(unsigned queue_len, bool throw_on_error) {
+ auto required_features =
+ IORING_FEAT_SUBMIT_STABLE
+ | IORING_FEAT_NODROP;
+ auto required_ops = {
+ IORING_OP_POLL_ADD,
+ IORING_OP_READ,
+ IORING_OP_WRITE,
+ IORING_OP_READV,
+ IORING_OP_WRITEV,
+ IORING_OP_FSYNC,
+ };
+ auto maybe_throw = [&] (auto exception) {
+ if (throw_on_error) {
+ throw exception;
+ }
+ };
+
+ auto params = ::io_uring_params{
+ .flags = 0,
+ };
+ ::io_uring ring;
+ auto err = ::io_uring_queue_init_params(queue_len, &ring, &params);
+ if (err != 0) {
+ maybe_throw(std::system_error(std::error_code(-err, std::system_category()), "trying to create io_uring"));
+ return std::nullopt;
+ }
+ auto free_ring = defer([&] () noexcept { ::io_uring_queue_exit(&ring); });
+ ::io_uring_ring_dontfork(&ring);
+ if (~ring.features & required_features) {
+ maybe_throw(std::runtime_error(fmt::format("missing required io_ring features, required 0x{:x} available 0x{:x}", required_features, ring.features)));
+ return std::nullopt;
+ }
+
+ auto probe = ::io_uring_get_probe_ring(&ring);
+ if (!probe) {
+ maybe_throw(std::runtime_error("unable to create io_uring probe"));
+ return std::nullopt;
+ }
+ auto free_probe = defer([&] () noexcept { ::free(probe); });
+
+ for (auto op : required_ops) {
+ if (!io_uring_opcode_supported(probe, op)) {
+ maybe_throw(std::runtime_error(fmt::format("required io_uring opcode {} not supported", op)));
+ return std::nullopt;
+ }
+ }
+
+ free_ring.cancel();
+
+ return ring;
+}
+
+static
+bool
+detect_io_uring() {
+ auto ring_opt = try_create_uring(1, false);
+ if (ring_opt) {
+ ::io_uring_queue_exit(&ring_opt.value());
+ }
+ return bool(ring_opt);
+}
+
+class reactor_backend_uring final : public reactor_backend {
+ // s_queue_len is more or less arbitrary. Too low and we'll be
+ // issuing too small batches, too high and we require too much locked
+ // memory, but otherwise it doesn't matter.
+ static constexpr unsigned s_queue_len = 200;
+ reactor& _r;
+ ::io_uring _uring;
+ bool _did_work_while_getting_sqe = false;
+ bool _has_pending_submissions = false;
+ file_desc _hrtimer_timerfd;
+ preempt_io_context _preempt_io_context;
+
+ class uring_pollable_fd_state : public pollable_fd_state {
+ pollable_fd_state_completion _completion_pollin;
+ pollable_fd_state_completion _completion_pollout;
+ public:
+ explicit uring_pollable_fd_state(file_desc desc, speculation speculate)
+ : pollable_fd_state(std::move(desc), std::move(speculate)) {
+ }
+ pollable_fd_state_completion* get_desc(int events) {
+ if (events & POLLIN) {
+ return &_completion_pollin;
+ } else {
+ return &_completion_pollout;
+ }
+ }
+ future<> get_completion_future(int events) {
+ return get_desc(events)->get_future();
+ }
+ };
+
+ // eventfd and timerfd both need an 8-byte read after completion
+ class recurring_eventfd_or_timerfd_completion : public fd_kernel_completion {
+ bool _armed = false;
+ public:
+ explicit recurring_eventfd_or_timerfd_completion(file_desc& fd) : fd_kernel_completion(fd) {}
+ virtual void complete_with(ssize_t res) override {
+ char garbage[8];
+ auto ret = _fd.read(garbage, 8);
+ // Note: for hrtimer_completion we can have spurious wakeups,
+ // since we wait for this using both _preempt_io_context and the
+ // ring. So don't assert that we read anything.
+ assert(!ret || *ret == 8);
+ _armed = false;
+ }
+ void maybe_rearm(reactor_backend_uring& be) {
+ if (_armed) {
+ return;
+ }
+ auto sqe = be.get_sqe();
+ ::io_uring_prep_poll_add(sqe, fd().get(), POLLIN);
+ ::io_uring_sqe_set_data(sqe, static_cast<kernel_completion*>(this));
+ _armed = true;
+ be._has_pending_submissions = true;
+ }
+ };
+
+ // Completion for high resolution timerfd, used in wait_and_process_events()
+ // (while running tasks it's waited for in _preempt_io_context)
+ class hrtimer_completion : public recurring_eventfd_or_timerfd_completion {
+ reactor& _r;
+ public:
+ explicit hrtimer_completion(reactor& r, file_desc& timerfd)
+ : recurring_eventfd_or_timerfd_completion(timerfd), _r(r) {
+ }
+ virtual void complete_with(ssize_t res) override {
+ recurring_eventfd_or_timerfd_completion::complete_with(res);
+ _r.service_highres_timer();
+ }
+ };
+
+ using smp_wakeup_completion = recurring_eventfd_or_timerfd_completion;
+
+ hrtimer_completion _hrtimer_completion;
+ smp_wakeup_completion _smp_wakeup_completion;
+private:
+ static file_desc make_timerfd() {
+ return file_desc::timerfd_create(CLOCK_MONOTONIC, TFD_CLOEXEC|TFD_NONBLOCK);
+ }
+
+ // Can fail if the completion queue is full
+ ::io_uring_sqe* try_get_sqe() {
+ return ::io_uring_get_sqe(&_uring);
+ }
+
+ bool do_flush_submission_ring() {
+ if (_has_pending_submissions) {
+ _has_pending_submissions = false;
+ _did_work_while_getting_sqe = false;
+ io_uring_submit(&_uring);
+ return true;
+ } else {
+ return std::exchange(_did_work_while_getting_sqe, false);
+ }
+ }
+
+ ::io_uring_sqe* get_sqe() {
+ ::io_uring_sqe* sqe;
+ while ((sqe = try_get_sqe()) == nullptr) [[unlikely]] {
+ do_flush_submission_ring();
+ do_process_kernel_completions_step();
+ _did_work_while_getting_sqe = true;
+ }
+ return sqe;
+ }
+ future<> poll(pollable_fd_state& fd, int events) {
+ if (events & fd.events_known) {
+ fd.events_known &= ~events;
+ return make_ready_future<>();
+ }
+ auto sqe = get_sqe();
+ // The reactor does not generate these types of I/O requests yet, so
+ // this path is unreachable. As more features of io_uring are exploited,
+ // we'll utilize more of these opcodes.
+ seastar_logger.error("Invalid operation for iocb: {}", req.opname());
+ abort();
+ }
+ ::io_uring_sqe_set_data(sqe, completion);
+
+ _has_pending_submissions = true;
+ }
+
+ // Returns true if any work was done
+ ~reactor_backend_uring() {
+ ::io_uring_queue_exit(&_uring);
+ }
+ virtual bool reap_kernel_completions() override {
+ return do_process_kernel_completions();
+ }
+ virtual bool kernel_submit_work() override {
+ bool did_work = false;
+ did_work |= _preempt_io_context.service_preempting_io();
+ did_work |= queue_pending_file_io();
+ did_work |= ::io_uring_submit(&_uring);
+ return did_work;
+ }
+ virtual bool kernel_events_can_sleep() const override {
+ // We never need to spin while I/O is in flight.
+ return true;
+ }
+ virtual void wait_and_process_events(const sigset_t* active_sigmask) override {
+ _smp_wakeup_completion.maybe_rearm(*this);
+ _hrtimer_completion.maybe_rearm(*this);
+ ::io_uring_submit(&_uring);
+ bool did_work = false;
+ did_work |= _preempt_io_context.service_preempting_io();
+ did_work |= std::exchange(_did_work_while_getting_sqe, false);
+ if (did_work) {
+ return;
+ }
+ struct ::io_uring_cqe* buf[s_queue_len];
+ sigset_t sigs = *active_sigmask; // io_uring_wait_cqes() wants non-const
+ auto r = ::io_uring_wait_cqes(&_uring, buf, 1, nullptr, &sigs);
+ if (r < 0) [[unlikely]] {
+ switch (-r) {
+ case EINTR:
+ return;
+ default:
+ abort();
+ }
+ }
+ did_work |= do_process_kernel_completions();
+ virtual pollable_fd_state_ptr make_pollable_fd_state(file_desc fd, pollable_fd::speculation speculate) override {
+ return pollable_fd_state_ptr(new uring_pollable_fd_state(std::move(fd), std::move(speculate)));
+ }
+};
+
+#endif
+
static bool detect_aio_poll() {
auto fd = file_desc::eventfd(0, 0);
aio_context_t ioc{};
setup_aio_context(1, &ioc);
auto cleanup = defer([&] () noexcept { io_destroy(ioc); });
@@ -1151,10 +1530,17 @@ bool reactor_backend_selector::has_enough_aio_nr() {
}
return true;
}

std::unique_ptr<reactor_backend> reactor_backend_selector::create(reactor& r) {
+ if (_name == "io_uring") {
+#ifdef SEASTAR_HAVE_URING
+ return std::make_unique<reactor_backend_uring>(r);
+#else
+ throw std::runtime_error("io_uring backend not compiled in");
+#endif
+ }
if (_name == "linux-aio") {
return std::make_unique<reactor_backend_aio>(r);
} else if (_name == "epoll") {
return std::make_unique<reactor_backend_epoll>(r);
}
@@ -1169,9 +1555,14 @@ std::vector<reactor_backend_selector> reactor_backend_selector::available() {
std::vector<reactor_backend_selector> ret;
if (has_enough_aio_nr() && detect_aio_poll()) {
ret.push_back(reactor_backend_selector("linux-aio"));
}
ret.push_back(reactor_backend_selector("epoll"));
+#ifdef SEASTAR_HAVE_URING
+ if (detect_io_uring()) {
+ ret.push_back(reactor_backend_selector("io_uring"));
+ }
+#endif
return ret;
}

}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 0fe2bc6fd6..3bc5fc5e15 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -226,10 +226,14 @@ option (Seastar_EXECUTE_ONLY_FAST_TESTS

option (Seastar_HWLOC
"Enable hwloc support."
ON)

+option (Seastar_IO_URING
+ "Enable io_uring support."
+ ON)
+
set (Seastar_JENKINS
""
CACHE
STRING
"If non-empty, the prefix for XML files containing the results of running tests (for Jenkins).")
@@ -923,10 +927,20 @@ if (Seastar_HWLOC)

target_link_libraries (seastar
PRIVATE hwloc::hwloc)
endif ()

+if (Seastar_IO_URING)
+ if (NOT LibUring_FOUND)
+ message (FATAL_ERROR "`io_uring` supported is enabled but liburing is not available!")
+ endif ()
+
+ list (APPEND Seastar_PRIVATE_COMPILE_DEFINITIONS SEASTAR_HAVE_URING)
+ target_link_libraries (seastar
+ PRIVATE URING::uring)
+endif ()
+
if (Seastar_LD_FLAGS)
# In newer versions of CMake, there is `target_link_options`.
target_link_libraries (seastar
PRIVATE ${Seastar_LD_FLAGS})
endif ()
@@ -1249,10 +1263,11 @@ if (Seastar_INSTALL)
${CMAKE_CURRENT_SOURCE_DIR}/cmake/Findnumactl.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/Findragel.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/Findrt.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/Findyaml-cpp.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/SeastarDependencies.cmake
+ ${CMAKE_CURRENT_SOURCE_DIR}/cmake/FindUring.cmake
DESTINATION ${install_cmakedir})

install (
DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/cmake/code_tests
DESTINATION ${install_cmakedir})
diff --git a/cmake/FindLibUring.cmake b/cmake/FindLibUring.cmake
new file mode 100644
index 0000000000..54ff51db2a
--- /dev/null
+++ b/cmake/FindLibUring.cmake
@@ -0,0 +1,74 @@
+if (URING_INCLUDE_DIR)
+ include (CheckStructHasMember)
+ include (CMakePushCheckState)
+ cmake_push_check_state (RESET)
+ list(APPEND CMAKE_REQUIRED_INCLUDES ${URING_INCLUDE_DIR})
+ CHECK_STRUCT_HAS_MEMBER ("struct io_uring" features liburing.h
+ HAVE_IOURING_FEATURES LANGUAGE CXX)
+ cmake_pop_check_state ()
+endif ()
+
+mark_as_advanced (
+ URING_LIBRARY
+ URING_INCLUDE_DIR
+ HAVE_IOURING_FEATURES)
+
+
+include (FindPackageHandleStandardArgs)
+
+find_package_handle_standard_args (LibUring
+ REQUIRED_VARS
+ URING_LIBRARY
+ URING_INCLUDE_DIR
+ HAVE_IOURING_FEATURES
+ VERSION_VAR URING_PC_VERSION)
+
+set (URING_LIBRARIES ${URING_LIBRARY})
+set (URING_INCLUDE_DIRS ${URING_INCLUDE_DIR})
+
+if (URING_FOUND AND NOT (TARGET URING::uring))
+ add_library (URING::uring UNKNOWN IMPORTED)
+
+ set_target_properties (URING::uring
+ PROPERTIES
+ IMPORTED_LOCATION ${URING_LIBRARY}
+ INTERFACE_INCLUDE_DIRECTORIES ${URING_INCLUDE_DIRS})
+endif ()
diff --git a/cmake/SeastarDependencies.cmake b/cmake/SeastarDependencies.cmake
index 2c267e459c..3acaf0788d 100644
--- a/cmake/SeastarDependencies.cmake
+++ b/cmake/SeastarDependencies.cmake
@@ -56,10 +56,11 @@ macro (seastar_find_dependencies)
fmt
lz4
# Private and private/public dependencies.
Concepts
GnuTLS
+ LibUring
LinuxMembarrier
Sanitizers
SourceLocation
StdAtomic
hwloc
@@ -86,10 +87,11 @@ macro (seastar_find_dependencies)
set (_seastar_dep_args_c-ares 1.13 REQUIRED)
set (_seastar_dep_args_cryptopp 5.6.5 REQUIRED)
set (_seastar_dep_args_fmt 5.0.0 REQUIRED)
set (_seastar_dep_args_lz4 1.7.3 REQUIRED)
set (_seastar_dep_args_GnuTLS 3.3.26 REQUIRED)
+ set (_seastar_dep_args_LibUring 2.0)
set (_seastar_dep_args_StdAtomic REQUIRED)
set (_seastar_dep_args_hwloc 1.11.2)
set (_seastar_dep_args_lksctp-tools REQUIRED)
set (_seastar_dep_args_rt REQUIRED)
set (_seastar_dep_args_yaml-cpp 0.5.1 REQUIRED)
diff --git a/cooking_recipe.cmake b/cooking_recipe.cmake
index 4e639679cd..bbf054e0bc 100644
--- a/cooking_recipe.cmake
+++ b/cooking_recipe.cmake
@@ -284,10 +284,20 @@ cooking_ingredient (fmt
URL_MD5 eaf6e3c1b2f4695b9a612cedf17b509d
CMAKE_ARGS
-DFMT_DOC=OFF
-DFMT_TEST=OFF)

+cooking_ingredient (liburing
+ EXTERNAL_PROJECT_ARGS
+ URL https://github.com/axboe/liburing/archive/liburing-2.1.tar.gz
+ URL_MD5 78f13d9861b334b9a9ca0d12cf2a6d3c
+ CONFIGURE_COMMAND <SOURCE_DIR>/configure --prefix=<INSTALL_DIR>
+ BUILD_COMMAND <DISABLE>
+ BUILD_BYPRODUCTS "<SOURCE_DIR>/src/liburing.a"
+ BUILD_IN_SOURCE ON
+ INSTALL_COMMAND ${make_command} -C src -s install)
+
cooking_ingredient (lz4
EXTERNAL_PROJECT_ARGS
URL https://github.com/lz4/lz4/archive/v1.8.0.tar.gz
URL_MD5 6247bf0e955899969d1600ff34baed6b
# This is upsetting.
diff --git a/install-dependencies.sh b/install-dependencies.sh
index e9f1f51f28..e0f0c43188 100755
diff --git a/pkgconfig/seastar.pc.in b/pkgconfig/seastar.pc.in
index 9cdf41d787..edfa20cbfc 100644
--- a/pkgconfig/seastar.pc.in
+++ b/pkgconfig/seastar.pc.in
@@ -26,18 +26,20 @@ cryptopp_cflags=-I$<JOIN:@cryptopp_INCLUDE_DIRS@, -I>
cryptopp_libs=$<JOIN:@cryptopp_LIBRARIES@, >
fmt_cflags=-I$<JOIN:$<TARGET_PROPERTY:fmt::fmt,INTERFACE_INCLUDE_DIRECTORIES>, -I>
fmt_libs=$<TARGET_LINKER_FILE:fmt::fmt>
lksctp_tools_cflags=-I$<JOIN:@lksctp-tools_INCLUDE_DIRS@, -I>
lksctp_tools_libs=$<JOIN:@lksctp-tools_LIBRARIES@, >
+liburing_cflags=$<$<BOOL:@Seastar_IO_URING@>:-I$<JOIN:$<TARGET_PROPERTY:URING::uring,INTERFACE_INCLUDE_DIRECTORIES>, -I>>
+liburing_libs=$<$<BOOL:@Seastar_IO_URING@>:$<TARGET_LINKER_FILE:URING::uring>>
numactl_cflags=-I$<JOIN:@numactl_INCLUDE_DIRS@, -I>
numactl_libs=$<JOIN:@numactl_LIBRARIES@, >

# Us.
seastar_cflags=${seastar_include_flags} $<JOIN:$<FILTER:$<TARGET_PROPERTY:seastar,INTERFACE_COMPILE_OPTIONS>,EXCLUDE,-Wno-error=#warnings>, > -D$<JOIN:$<TARGET_PROPERTY:seastar,INTERFACE_COMPILE_DEFINITIONS>, -D>
seastar_libs=${libdir}/$<TARGET_FILE_NAME:seastar> @Seastar_SPLIT_DWARF_FLAG@ $<JOIN:@Seastar_Sanitizers_OPTIONS@, >

Requires: liblz4 >= 1.7.3
-Requires.private: gnutls >= 3.2.26, hwloc >= 1.11.2, yaml-cpp >= 0.5.1
+Requires.private: gnutls >= 3.2.26, hwloc >= 1.11.2, $<$<BOOL:@Seastar_IO_URING@>:liburing $<ANGLE-R>= 2.0, >yaml-cpp >= 0.5.1
Conflicts:
-Cflags: ${boost_cflags} ${c_ares_cflags} ${cryptopp_cflags} ${fmt_cflags} ${lksctp_tools_cflags} ${numactl_cflags} ${seastar_cflags}
+Cflags: ${boost_cflags} ${c_ares_cflags} ${cryptopp_cflags} ${fmt_cflags} ${liburing_cflags} ${lksctp_tools_cflags} ${numactl_cflags} ${seastar_cflags}
Libs: ${seastar_libs} ${boost_program_options_libs} ${boost_thread_libs} ${c_ares_libs} ${cryptopp_libs} ${fmt_libs}
-Libs.private: ${dl_libs} ${rt_libs} ${boost_thread_libs} ${lksctp_tools_libs} ${numactl_libs} ${stdatomic_libs}
+Libs.private: ${dl_libs} ${rt_libs} ${boost_thread_libs} ${lksctp_tools_libs} ${liburing_libs} ${numactl_libs} ${stdatomic_libs}
--
2.36.1

Avi Kivity

<avi@scylladb.com>
unread,
Jun 3, 2022, 4:46:01 AM6/3/22
to seastar-dev@googlegroups.com
Review ping

Piotr Sarna

<sarna@scylladb.com>
unread,
Jun 9, 2022, 11:12:57 AM6/9/22
to Avi Kivity, seastar-dev@googlegroups.com, Gleb Natapov, Nadav Har'El, Pavel Emelyanov
Well, my minor comments from v2 where addressed, looks good to me,
especially that the support has to be explicitly enabled. /cc-ing a fine
selection of reviewers from v2 and Seastar maintainers in general - does
anyone object to queuing this one?

Pavel Emelyanov

<xemul@scylladb.com>
unread,
Jun 10, 2022, 5:37:07 AM6/10/22
to Piotr Sarna, Avi Kivity, seastar-dev@googlegroups.com, Gleb Natapov, Nadav Har'El
It fails compilation with clang-11 -std=c++17, but I don't know if
we're going to support this mode or not :)

[2022-06-10T08:47:52.428Z] /usr/bin/clang++ -DFMT_LOCALE -DFMT_SHARED -DSEASTAR_API_LEVEL=6 -DSEASTAR_DEFERRED_ACTION_REQUIRE_NOEXCEPT -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_HAS_MEMBARRIER -DSEASTAR_HAVE_ASAN_FIBER_SUPPORT -DSEASTAR_HAVE_HWLOC -DSEASTAR_HAVE_LZ4_COMPRESS_DEFAULT -DSEASTAR_HAVE_NUMA -DSEASTAR_HAVE_URING -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_THREAD_STACK_GUARDS -DSEASTAR_TYPE_ERASE_MORE -I/home/src/include -I/home/src/build/dev/gen/include -I/home/src/src -I/home/src/build/dev/gen/src -O2 -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fvisibility=hidden -UNDEBUG -Wall -Werror -Wno-array-bounds -Wno-error=deprecated-declarations -gz -std=gnu++17 -MD -MT CMakeFiles/seastar.dir/src/core/reactor_backend.cc.o -MF CMakeFiles/seastar.dir/src/core/reactor_backend.cc.o.d -o CMakeFiles/seastar.dir/src/core/reactor_backend.cc.o -c /home/src/src/core/reactor_backend.cc
[2022-06-10T08:47:52.428Z] /home/src/src/core/reactor_backend.cc:1286:52: error: unknown attribute 'unlikely' ignored [-Werror,-Wunknown-attributes]
[2022-06-10T08:47:52.428Z] while ((sqe = try_get_sqe()) == nullptr) [[unlikely]] {
[2022-06-10T08:47:52.428Z] ^
[2022-06-10T08:47:52.428Z] /home/src/src/core/reactor_backend.cc:1420:22: error: unknown attribute 'unlikely' ignored [-Werror,-Wunknown-attributes]
[2022-06-10T08:47:52.428Z] if (r < 0) [[unlikely]] {
[2022-06-10T08:47:52.428Z] ^
[2022-06-10T08:47:52.428Z] 2 errors generated.

Piotr Sarna

<sarna@scylladb.com>
unread,
Jun 10, 2022, 6:48:58 AM6/10/22
to Pavel Emelyanov, Avi Kivity, seastar-dev@googlegroups.com, Gleb Natapov, Nadav Har'El
We do support c++17, I think at least until 2023. And since keeping this
code compatible is trivial (__builtin_expect), let's make it so.

Avi Kivity

<avi@scylladb.com>
unread,
Jun 10, 2022, 8:45:42 AM6/10/22
to seastar-dev@googlegroups.com
io_uring is a unified asynchronous I/O interface, supporting
network, buffered disk, and direct disk I/O.

This patch adds a reactor backend using io_uring. It is deliberately
non-ambitious and only implements the minimal number of verbs using
io_uring. We could support many more (sendmsg(), recvmsg(), open(), etc.)
but it is better to start small, as each of those features will require
a separate capability check and fallback path if not available.

In terms of performance, I measured about 4% degradation compared to
linux-aio in httpd/wrk. I am working with the io_uring maintainer to
resolve it, but until it is resolved linux-aio will be preferred to
io_uring and the latter has to be enabled manually (with --reactor-backend).

The implementation follows the linux-aio backend, using
IORING_OP_POLL_ADD for polls and the read/write/fsync ops for files.
More elaborate socket operation (using sendmsg/recvmsg equivalents to
combine poll+read) will follow later. The preemption notifier still
uses linux-aio (as in Glauber's implementation) as a simple solution.

Cmake improvements by: Kefu Chai <tcha...@gmail.com>

CircleCI support has io_uring disable until the containers are
updated with the library.

Signed-off-by: Kefu Chai <tcha...@gmail.com>
Message-Id: <20220502065359.3...@gmail.com>
---

v4:
- replace [[unlikely]] with __builtin_expect for C++17 support
- add protection against sprurious wakeups for the task quota timer
fd (reactor_backend_uring constructor), similar to reactor_backend_aio.
Without it, the reactor could sometimes hang.

configure.py | 6 +
include/seastar/core/reactor.hh | 1 +
include/seastar/core/reactor_config.hh | 1 +
src/core/reactor_backend.hh | 2 +
src/core/reactor_backend.cc | 395 +++++++++++++++++++++++++
CMakeLists.txt | 15 +
cmake/FindLibUring.cmake | 74 +++++
cmake/SeastarDependencies.cmake | 2 +
cooking_recipe.cmake | 10 +
install-dependencies.sh | 2 +
pkgconfig/seastar.pc.in | 8 +-
11 files changed, 513 insertions(+), 3 deletions(-)
create mode 100644 cmake/FindLibUring.cmake

diff --git a/configure.py b/configure.py
index 1eb4d6bda..f1783da9c 100755
index c51e9e2bb..a5bdf192c 100644
--- a/include/seastar/core/reactor.hh
+++ b/include/seastar/core/reactor.hh
@@ -208,10 +208,11 @@ class reactor {
friend class preempt_io_context;
friend struct hrtimer_aio_completion;
friend struct task_quota_aio_completion;
friend class reactor_backend_epoll;
friend class reactor_backend_aio;
+ friend class reactor_backend_uring;
friend class reactor_backend_selector;
friend struct reactor_options;
friend class aio_storage_context;
friend size_t scheduling_group_count();
public:
diff --git a/include/seastar/core/reactor_config.hh b/include/seastar/core/reactor_config.hh
index 1080ccb13..4098a0871 100644
--- a/include/seastar/core/reactor_config.hh
+++ b/include/seastar/core/reactor_config.hh
@@ -127,10 +127,11 @@ struct reactor_options : public program_options::option_group {
/// \brief Internal reactor implementation.
///
/// Available backends:
/// * \p linux-aio
/// * \p epoll
+ /// * \p io_uring
///
/// Default: \p linux-aio (if available).
program_options::selection_value<reactor_backend_selector> reactor_backend;
/// \brief Use Linux aio for fsync() calls.
///
diff --git a/src/core/reactor_backend.hh b/src/core/reactor_backend.hh
index 3edbf771c..a2bec88d9 100644
--- a/src/core/reactor_backend.hh
+++ b/src/core/reactor_backend.hh
@@ -361,10 +361,12 @@ class reactor_backend_osv : public reactor_backend {
virtual pollable_fd_state_ptr
make_pollable_fd_state(file_desc fd, pollable_fd::speculation speculate) override;
};
#endif /* HAVE_OSV */

+class reactor_backend_uring;
+
class reactor_backend_selector {
std::string _name;
private:
static bool has_enough_aio_nr();
explicit reactor_backend_selector(std::string name) : _name(std::move(name)) {}
diff --git a/src/core/reactor_backend.cc b/src/core/reactor_backend.cc
index 9d63e1620..f27506b19 100644
--- a/src/core/reactor_backend.cc
+++ b/src/core/reactor_backend.cc
@@ -28,10 +28,14 @@
#include <seastar/util/read_first_line.hh>
#include <chrono>
#include <sys/poll.h>
#include <sys/syscall.h>

+#ifdef SEASTAR_HAVE_URING
+#include <liburing.h>
+#endif
+
#ifdef HAVE_OSV
#include <osv/newpoll.hh>
#endif

namespace seastar {
@@ -1112,10 +1116,389 @@ reactor_backend_osv::make_pollable_fd_state(file_desc fd, pollable_fd::speculati
+ while (__builtin_expect((sqe = try_get_sqe()) == nullptr, false)) {
+ // Protect against spurious wakeups - if we get notified that the timer has
+ // expired when it really hasn't, we don't want to block in read(tfd, ...).
+ auto tfd = _r._task_quota_timer.get();
+ ::fcntl(tfd, F_SETFL, ::fcntl(tfd, F_GETFL) | O_NONBLOCK);
+ if (__builtin_expect(r < 0, false)) {
@@ -1151,10 +1534,17 @@ bool reactor_backend_selector::has_enough_aio_nr() {
}
return true;
}

std::unique_ptr<reactor_backend> reactor_backend_selector::create(reactor& r) {
+ if (_name == "io_uring") {
+#ifdef SEASTAR_HAVE_URING
+ return std::make_unique<reactor_backend_uring>(r);
+#else
+ throw std::runtime_error("io_uring backend not compiled in");
+#endif
+ }
if (_name == "linux-aio") {
return std::make_unique<reactor_backend_aio>(r);
} else if (_name == "epoll") {
return std::make_unique<reactor_backend_epoll>(r);
}
@@ -1169,9 +1559,14 @@ std::vector<reactor_backend_selector> reactor_backend_selector::available() {
std::vector<reactor_backend_selector> ret;
if (has_enough_aio_nr() && detect_aio_poll()) {
ret.push_back(reactor_backend_selector("linux-aio"));
}
ret.push_back(reactor_backend_selector("epoll"));
+#ifdef SEASTAR_HAVE_URING
+ if (detect_io_uring()) {
+ ret.push_back(reactor_backend_selector("io_uring"));
+ }
+#endif
return ret;
}

}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 7638ad149..e87ee8ffd 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -228,10 +228,14 @@ option (Seastar_EXECUTE_ONLY_FAST_TESTS

option (Seastar_HWLOC
"Enable hwloc support."
ON)

+option (Seastar_IO_URING
+ "Enable io_uring support."
+ ON)
+
set (Seastar_JENKINS
""
CACHE
STRING
"If non-empty, the prefix for XML files containing the results of running tests (for Jenkins).")
@@ -925,10 +929,20 @@ if (Seastar_HWLOC)

target_link_libraries (seastar
PRIVATE hwloc::hwloc)
endif ()

+if (Seastar_IO_URING)
+ if (NOT LibUring_FOUND)
+ message (FATAL_ERROR "`io_uring` supported is enabled but liburing is not available!")
+ endif ()
+
+ list (APPEND Seastar_PRIVATE_COMPILE_DEFINITIONS SEASTAR_HAVE_URING)
+ target_link_libraries (seastar
+ PRIVATE URING::uring)
+endif ()
+
if (Seastar_LD_FLAGS)
# In newer versions of CMake, there is `target_link_options`.
target_link_libraries (seastar
PRIVATE ${Seastar_LD_FLAGS})
endif ()
@@ -1251,10 +1265,11 @@ if (Seastar_INSTALL)
${CMAKE_CURRENT_SOURCE_DIR}/cmake/Findnumactl.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/Findragel.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/Findrt.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/Findyaml-cpp.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/SeastarDependencies.cmake
+ ${CMAKE_CURRENT_SOURCE_DIR}/cmake/FindUring.cmake
DESTINATION ${install_cmakedir})

install (
DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/cmake/code_tests
DESTINATION ${install_cmakedir})
diff --git a/cmake/FindLibUring.cmake b/cmake/FindLibUring.cmake
new file mode 100644
index 000000000..54ff51db2
index 2c267e459..3acaf0788 100644
index 4e639679c..bbf054e0b 100644
--- a/cooking_recipe.cmake
+++ b/cooking_recipe.cmake
@@ -284,10 +284,20 @@ cooking_ingredient (fmt
URL_MD5 eaf6e3c1b2f4695b9a612cedf17b509d
CMAKE_ARGS
-DFMT_DOC=OFF
-DFMT_TEST=OFF)

+cooking_ingredient (liburing
+ EXTERNAL_PROJECT_ARGS
+ URL https://github.com/axboe/liburing/archive/liburing-2.1.tar.gz
+ URL_MD5 78f13d9861b334b9a9ca0d12cf2a6d3c
+ CONFIGURE_COMMAND <SOURCE_DIR>/configure --prefix=<INSTALL_DIR>
+ BUILD_COMMAND <DISABLE>
+ BUILD_BYPRODUCTS "<SOURCE_DIR>/src/liburing.a"
+ BUILD_IN_SOURCE ON
+ INSTALL_COMMAND ${make_command} -C src -s install)
+
cooking_ingredient (lz4
EXTERNAL_PROJECT_ARGS
URL https://github.com/lz4/lz4/archive/v1.8.0.tar.gz
URL_MD5 6247bf0e955899969d1600ff34baed6b
# This is upsetting.
diff --git a/install-dependencies.sh b/install-dependencies.sh
index e9f1f51f2..e0f0c4318 100755
index 9cdf41d78..edfa20cbf 100644
--- a/pkgconfig/seastar.pc.in
+++ b/pkgconfig/seastar.pc.in
@@ -26,18 +26,20 @@ cryptopp_cflags=-I$<JOIN:@cryptopp_INCLUDE_DIRS@, -I>
cryptopp_libs=$<JOIN:@cryptopp_LIBRARIES@, >
fmt_cflags=-I$<JOIN:$<TARGET_PROPERTY:fmt::fmt,INTERFACE_INCLUDE_DIRECTORIES>, -I>
fmt_libs=$<TARGET_LINKER_FILE:fmt::fmt>
lksctp_tools_cflags=-I$<JOIN:@lksctp-tools_INCLUDE_DIRS@, -I>
lksctp_tools_libs=$<JOIN:@lksctp-tools_LIBRARIES@, >
+liburing_cflags=$<$<BOOL:@Seastar_IO_URING@>:-I$<JOIN:$<TARGET_PROPERTY:URING::uring,INTERFACE_INCLUDE_DIRECTORIES>, -I>>
+liburing_libs=$<$<BOOL:@Seastar_IO_URING@>:$<TARGET_LINKER_FILE:URING::uring>>
numactl_cflags=-I$<JOIN:@numactl_INCLUDE_DIRS@, -I>
numactl_libs=$<JOIN:@numactl_LIBRARIES@, >

# Us.
seastar_cflags=${seastar_include_flags} $<JOIN:$<FILTER:$<TARGET_PROPERTY:seastar,INTERFACE_COMPILE_OPTIONS>,EXCLUDE,-Wno-error=#warnings>, > -D$<JOIN:$<TARGET_PROPERTY:seastar,INTERFACE_COMPILE_DEFINITIONS>, -D>
seastar_libs=${libdir}/$<TARGET_FILE_NAME:seastar> @Seastar_SPLIT_DWARF_FLAG@ $<JOIN:@Seastar_Sanitizers_OPTIONS@, >

Requires: liblz4 >= 1.7.3
-Requires.private: gnutls >= 3.2.26, hwloc >= 1.11.2, yaml-cpp >= 0.5.1
+Requires.private: gnutls >= 3.2.26, hwloc >= 1.11.2, $<$<BOOL:@Seastar_IO_URING@>:liburing $<ANGLE-R>= 2.0, >yaml-cpp >= 0.5.1
Conflicts:
-Cflags: ${boost_cflags} ${c_ares_cflags} ${cryptopp_cflags} ${fmt_cflags} ${lksctp_tools_cflags} ${numactl_cflags} ${seastar_cflags}
+Cflags: ${boost_cflags} ${c_ares_cflags} ${cryptopp_cflags} ${fmt_cflags} ${liburing_cflags} ${lksctp_tools_cflags} ${numactl_cflags} ${seastar_cflags}
Libs: ${seastar_libs} ${boost_program_options_libs} ${boost_thread_libs} ${c_ares_libs} ${cryptopp_libs} ${fmt_libs}
-Libs.private: ${dl_libs} ${rt_libs} ${boost_thread_libs} ${lksctp_tools_libs} ${numactl_libs} ${stdatomic_libs}
+Libs.private: ${dl_libs} ${rt_libs} ${boost_thread_libs} ${lksctp_tools_libs} ${liburing_libs} ${numactl_libs} ${stdatomic_libs}
--
2.36.1

Nadav Har'El

<nyh@scylladb.com>
unread,
Jun 12, 2022, 10:43:04 AM6/12/22
to Avi Kivity, seastar-dev
When I run the new cmake on my Fedora 34 installation, I get a cmake failure:

-- Could NOT find LibUring (missing: HAVE_IOURING_FEATURES) (Required is at least version "2.0")
CMake Error at CMakeLists.txt:930 (message):

  `io_uring` supported is enabled but liburing is not available!

Apparently I have liburing-devel version 0.7 installed...

The question is what to do now. I know Fedora 34 isn't super-up-to-date (I really do need to update), but Seastar has been working absolutely fine with it until this patch.
Does the new code really need liburing version 2.0?
And If liburing is missing, can't we just build without it? Isn't it an optional feature?

--
Nadav Har'El
n...@scylladb.com


--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.

Avi Kivity

<avi@scylladb.com>
unread,
Jun 12, 2022, 10:55:24 AM6/12/22
to Nadav Har'El, seastar-dev

For older distributions, you can specify


   cmake -DSeastar_IO_URING=OFF


We decided not to auto-enable it to have a more reproducible build.

kefu chai

<tchaikov@gmail.com>
unread,
Jun 12, 2022, 10:59:12 AM6/12/22
to Nadav Har'El, Avi Kivity, seastar-dev


Le dim. 12 juin 2022 à 22:43, Nadav Har'El <n...@scylladb.com> a écrit :
When I run the new cmake on my Fedora 34 installation, I get a cmake failure:

-- Could NOT find LibUring (missing: HAVE_IOURING_FEATURES) (Required is at least version "2.0")
CMake Error at CMakeLists.txt:930 (message):
  `io_uring` supported is enabled but liburing is not available!

Apparently I have liburing-devel version 0.7 installed...

The question is what to do now. I know Fedora 34 isn't super-up-to-date (I really do need to update), but Seastar has been working absolutely fine with it until this patch.

Currently, io uring is enabled by default both by the configure.py script and by CMake. we could disable io uring by default. 

Does the new code really need liburing version 2.0?

Yes, the code uses the featues field in io_uring struct. And this field was introduced by io uring 2.0.

And If liburing is missing, can't we just build without it? Isn't it an optional feature?

You can also diable it using configure.py.

You received this message because you are subscribed to a topic in the Google Groups "seastar-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/seastar-dev/S2sJq-h4VB0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to seastar-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/CANEVyjsK8Q0pVrO1_Tobj3KL%2BWGTGzs9NzP-42e-DFri-fS%3DSQ%40mail.gmail.com.
--
Regards
Kefu Chai

Nadav Har'El

<nyh@scylladb.com>
unread,
Jun 12, 2022, 11:30:35 AM6/12/22
to Avi Kivity, Kefu Chai, seastar-dev
On Sun, Jun 12, 2022 at 5:55 PM Avi Kivity <a...@scylladb.com> wrote:

For older distributions, you can specify


   cmake -DSeastar_IO_URING=OFF


We decided not to auto-enable it to have a more reproducible build.


I guess that after I lost exactly the same argument about DPDK, and got used to running "configure --disable-dpdk", now
I just need to get used to "./configure.py --disable-dpdk --disable-io_uring"...

I still think this is a usability mistake. Seastar isn't a product, it's a library. It sounds to me like the burden of doing reproducible builds
is on the person who builds the library. After all, you can also build Seastar on different compilers and different versions of compilers -
and we deliberately want to support that. A person that wants reproducible builds should build Seastar every time with the same compiler
and with the same libraries installed - including io_uring.

Anyway, I'll commit - again, this is the same argument as we had in the past about DPDK and also about valgrind, so I won't start
it again now.

Kefu, I think the least we can do is that the error message can say how io_uring can be disabled so the user doesn't need to look
through configure.py source code to figure out the name of the option (I couldn't guess that it's "--disable-io_uring" with one dash
and one underscore). Although because of the cmake/configure.py duplication, I'm not sure how to do it cleanly.

Commit Bot

<bot@cloudius-systems.com>
unread,
Jun 12, 2022, 11:34:40 AM6/12/22
to seastar-dev@googlegroups.com, Avi Kivity
From: Avi Kivity <a...@scylladb.com>
Committer: Nadav Har'El <n...@scylladb.com>
Branch: master

reactor: add io_uring backend

io_uring is a unified asynchronous I/O interface, supporting
network, buffered disk, and direct disk I/O.

This patch adds a reactor backend using io_uring. It is deliberately
non-ambitious and only implements the minimal number of verbs using
io_uring. We could support many more (sendmsg(), recvmsg(), open(), etc.)
but it is better to start small, as each of those features will require
a separate capability check and fallback path if not available.

In terms of performance, I measured about 4% degradation compared to
linux-aio in httpd/wrk. I am working with the io_uring maintainer to
resolve it, but until it is resolved linux-aio will be preferred to
io_uring and the latter has to be enabled manually (with --reactor-backend).

The implementation follows the linux-aio backend, using
IORING_OP_POLL_ADD for polls and the read/write/fsync ops for files.
More elaborate socket operation (using sendmsg/recvmsg equivalents to
combine poll+read) will follow later. The preemption notifier still
uses linux-aio (as in Glauber's implementation) as a simple solution.

Cmake improvements by: Kefu Chai <tcha...@gmail.com>

CircleCI support has io_uring disable until the containers are
updated with the library.

Signed-off-by: Kefu Chai <tcha...@gmail.com>
Message-Id: <20220502065359.3...@gmail.com>
Message-Id: <20220610124536...@scylladb.com>

---
diff --git a/CMakeLists.txt b/CMakeLists.txt
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -230,6 +230,10 @@ option (Seastar_HWLOC
"Enable hwloc support."
ON)

+option (Seastar_IO_URING
+ "Enable io_uring support."
+ ON)
+
set (Seastar_JENKINS
""
CACHE
@@ -921,6 +925,16 @@ if (Seastar_HWLOC)
PRIVATE hwloc::hwloc)
endif ()

+if (Seastar_IO_URING)
+ if (NOT LibUring_FOUND)
+ message (FATAL_ERROR "`io_uring` supported is enabled but liburing is not available!")
+ endif ()
+
+ list (APPEND Seastar_PRIVATE_COMPILE_DEFINITIONS SEASTAR_HAVE_URING)
+ target_link_libraries (seastar
+ PRIVATE URING::uring)
+endif ()
+
if (Seastar_LD_FLAGS)
# In newer versions of CMake, there is `target_link_options`.
target_link_libraries (seastar
@@ -1247,6 +1261,7 @@ if (Seastar_INSTALL)
${CMAKE_CURRENT_SOURCE_DIR}/cmake/Findrt.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/Findyaml-cpp.cmake
${CMAKE_CURRENT_SOURCE_DIR}/cmake/SeastarDependencies.cmake
+ ${CMAKE_CURRENT_SOURCE_DIR}/cmake/FindUring.cmake
DESTINATION ${install_cmakedir})

install (
diff --git a/cmake/FindLibUring.cmake b/cmake/FindLibUring.cmake
--- a/cmake/FindLibUring.cmake
--- a/cmake/SeastarDependencies.cmake
+++ b/cmake/SeastarDependencies.cmake
@@ -58,6 +58,7 @@ macro (seastar_find_dependencies)
# Private and private/public dependencies.
Concepts
GnuTLS
+ LibUring
LinuxMembarrier
Sanitizers
SourceLocation
@@ -88,6 +89,7 @@ macro (seastar_find_dependencies)
set (_seastar_dep_args_fmt 5.0.0 REQUIRED)
set (_seastar_dep_args_lz4 1.7.3 REQUIRED)
set (_seastar_dep_args_GnuTLS 3.3.26 REQUIRED)
+ set (_seastar_dep_args_LibUring 2.0)
set (_seastar_dep_args_StdAtomic REQUIRED)
set (_seastar_dep_args_hwloc 1.11.2)
set (_seastar_dep_args_lksctp-tools REQUIRED)
diff --git a/configure.py b/configure.py
--- a/configure.py
+++ b/configure.py
@@ -110,6 +110,11 @@ def standard_supported(standard, compiler='g++'):
name = 'debug-shared-ptr',
dest = "debug_shared_ptr",
help = 'Debug shared_ptr')
+add_tristate(
+ arg_parser,
+ name='io_uring',
+ dest='io_uring',
+ help='Support io_uring via liburing')
arg_parser.add_argument('--allocator-page-size', dest='alloc_page_size', type=int, help='override allocator page size')
arg_parser.add_argument('--without-tests', dest='exclude_tests', action='store_true', help='Do not build tests by default')
arg_parser.add_argument('--without-apps', dest='exclude_apps', action='store_true', help='Do not build applications by default')
@@ -195,6 +200,7 @@ def configure_mode(mode):
tr(args.dpdk, 'DPDK'),
tr(infer_dpdk_machine(args.user_cflags), 'DPDK_MACHINE'),
tr(args.hwloc, 'HWLOC', value_when_none='yes'),
+ tr(args.io_uring, 'IO_URING', value_when_none='yes'),
tr(args.alloc_failure_injection, 'ALLOC_FAILURE_INJECTION', value_when_none='DEFAULT'),
tr(args.task_backtrace, 'TASK_BACKTRACE'),
tr(args.alloc_page_size, 'ALLOC_PAGE_SIZE'),
diff --git a/cooking_recipe.cmake b/cooking_recipe.cmake
--- a/cooking_recipe.cmake
+++ b/cooking_recipe.cmake
@@ -286,6 +286,16 @@ cooking_ingredient (fmt
-DFMT_DOC=OFF
-DFMT_TEST=OFF)

+cooking_ingredient (liburing
+ EXTERNAL_PROJECT_ARGS
+ URL https://github.com/axboe/liburing/archive/liburing-2.1.tar.gz
+ URL_MD5 78f13d9861b334b9a9ca0d12cf2a6d3c
+ CONFIGURE_COMMAND <SOURCE_DIR>/configure --prefix=<INSTALL_DIR>
+ BUILD_COMMAND <DISABLE>
+ BUILD_BYPRODUCTS "<SOURCE_DIR>/src/liburing.a"
+ BUILD_IN_SOURCE ON
+ INSTALL_COMMAND ${make_command} -C src -s install)
+
cooking_ingredient (lz4
EXTERNAL_PROJECT_ARGS
URL https://github.com/lz4/lz4/archive/v1.8.0.tar.gz
diff --git a/include/seastar/core/reactor.hh b/include/seastar/core/reactor.hh
--- a/include/seastar/core/reactor.hh
+++ b/include/seastar/core/reactor.hh
@@ -210,6 +210,7 @@ private:
friend struct task_quota_aio_completion;
friend class reactor_backend_epoll;
friend class reactor_backend_aio;
+ friend class reactor_backend_uring;
friend class reactor_backend_selector;
friend struct reactor_options;
friend class aio_storage_context;
diff --git a/include/seastar/core/reactor_config.hh b/include/seastar/core/reactor_config.hh
--- a/include/seastar/core/reactor_config.hh
+++ b/include/seastar/core/reactor_config.hh
@@ -129,6 +129,7 @@ struct reactor_options : public program_options::option_group {
/// Available backends:
/// * \p linux-aio
/// * \p epoll
+ /// * \p io_uring
///
/// Default: \p linux-aio (if available).
program_options::selection_value<reactor_backend_selector> reactor_backend;
diff --git a/install-dependencies.sh b/install-dependencies.sh
--- a/install-dependencies.sh
+++ b/install-dependencies.sh
@@ -40,6 +40,7 @@ debian_packages=(
libgnutls28-dev
liblz4-dev
libsctp-dev
+ liburing-dev
gcc
make
python3
@@ -75,6 +76,7 @@ redhat_packages=(
gnutls-devel
lksctp-tools-devel
lz4-devel
+ liburing-devel
gcc
make
python3
diff --git a/pkgconfig/seastar.pc.in b/pkgconfig/seastar.pc.in
--- a/pkgconfig/seastar.pc.in
+++ b/pkgconfig/seastar.pc.in
@@ -28,6 +28,8 @@ fmt_cflags=-I$<JOIN:$<TARGET_PROPERTY:fmt::fmt,INTERFACE_INCLUDE_DIRECTORIES>, -
fmt_libs=$<TARGET_LINKER_FILE:fmt::fmt>
lksctp_tools_cflags=-I$<JOIN:@lksctp-tools_INCLUDE_DIRS@, -I>
lksctp_tools_libs=$<JOIN:@lksctp-tools_LIBRARIES@, >
+liburing_cflags=$<$<BOOL:@Seastar_IO_URING@>:-I$<JOIN:$<TARGET_PROPERTY:URING::uring,INTERFACE_INCLUDE_DIRECTORIES>, -I>>
+liburing_libs=$<$<BOOL:@Seastar_IO_URING@>:$<TARGET_LINKER_FILE:URING::uring>>
numactl_cflags=-I$<JOIN:@numactl_INCLUDE_DIRS@, -I>
numactl_libs=$<JOIN:@numactl_LIBRARIES@, >

@@ -36,8 +38,8 @@ seastar_cflags=${seastar_include_flags} $<JOIN:$<FILTER:$<TARGET_PROPERTY:seasta
seastar_libs=${libdir}/$<TARGET_FILE_NAME:seastar> @Seastar_SPLIT_DWARF_FLAG@ $<JOIN:@Seastar_Sanitizers_OPTIONS@, >

Requires: liblz4 >= 1.7.3
-Requires.private: gnutls >= 3.2.26, hwloc >= 1.11.2, yaml-cpp >= 0.5.1
+Requires.private: gnutls >= 3.2.26, hwloc >= 1.11.2, $<$<BOOL:@Seastar_IO_URING@>:liburing $<ANGLE-R>= 2.0, >yaml-cpp >= 0.5.1
Conflicts:
-Cflags: ${boost_cflags} ${c_ares_cflags} ${cryptopp_cflags} ${fmt_cflags} ${lksctp_tools_cflags} ${numactl_cflags} ${seastar_cflags}
+Cflags: ${boost_cflags} ${c_ares_cflags} ${cryptopp_cflags} ${fmt_cflags} ${liburing_cflags} ${lksctp_tools_cflags} ${numactl_cflags} ${seastar_cflags}
Libs: ${seastar_libs} ${boost_program_options_libs} ${boost_thread_libs} ${c_ares_libs} ${cryptopp_libs} ${fmt_libs}
-Libs.private: ${dl_libs} ${rt_libs} ${boost_thread_libs} ${lksctp_tools_libs} ${numactl_libs} ${stdatomic_libs}
+Libs.private: ${dl_libs} ${rt_libs} ${boost_thread_libs} ${lksctp_tools_libs} ${liburing_libs} ${numactl_libs} ${stdatomic_libs}
diff --git a/src/core/reactor_backend.cc b/src/core/reactor_backend.cc
--- a/src/core/reactor_backend.cc
+++ b/src/core/reactor_backend.cc
@@ -30,6 +30,10 @@
#include <sys/poll.h>
#include <sys/syscall.h>

+#ifdef SEASTAR_HAVE_URING
+#include <liburing.h>
+#endif
+
#ifdef HAVE_OSV
#include <osv/newpoll.hh>
#endif
@@ -1114,6 +1118,385 @@ reactor_backend_osv::make_pollable_fd_state(file_desc fd, pollable_fd::speculati
@@ -1153,6 +1536,13 @@ bool reactor_backend_selector::has_enough_aio_nr() {
}

std::unique_ptr<reactor_backend> reactor_backend_selector::create(reactor& r) {
+ if (_name == "io_uring") {
+#ifdef SEASTAR_HAVE_URING
+ return std::make_unique<reactor_backend_uring>(r);
+#else
+ throw std::runtime_error("io_uring backend not compiled in");
+#endif
+ }
if (_name == "linux-aio") {
return std::make_unique<reactor_backend_aio>(r);
} else if (_name == "epoll") {
@@ -1171,6 +1561,11 @@ std::vector<reactor_backend_selector> reactor_backend_selector::available() {
ret.push_back(reactor_backend_selector("linux-aio"));
}
ret.push_back(reactor_backend_selector("epoll"));
+#ifdef SEASTAR_HAVE_URING
+ if (detect_io_uring()) {
+ ret.push_back(reactor_backend_selector("io_uring"));
+ }
+#endif
return ret;
}

diff --git a/src/core/reactor_backend.hh b/src/core/reactor_backend.hh
--- a/src/core/reactor_backend.hh
+++ b/src/core/reactor_backend.hh
@@ -363,6 +363,8 @@ public:

Avi Kivity

<avi@scylladb.com>
unread,
Jun 12, 2022, 12:14:07 PM6/12/22
to Nadav Har'El, Kefu Chai, seastar-dev


On 12/06/2022 18.30, Nadav Har'El wrote:
On Sun, Jun 12, 2022 at 5:55 PM Avi Kivity <a...@scylladb.com> wrote:

For older distributions, you can specify


   cmake -DSeastar_IO_URING=OFF


We decided not to auto-enable it to have a more reproducible build.


I guess that after I lost exactly the same argument about DPDK, and got used to running "configure --disable-dpdk", now
I just need to get used to "./configure.py --disable-dpdk --disable-io_uring"...

I still think this is a usability mistake. Seastar isn't a product, it's a library. It sounds to me like the burden of doing reproducible builds
is on the person who builds the library. After all, you can also build Seastar on different compilers and different versions of compilers -
and we deliberately want to support that. A person that wants reproducible builds should build Seastar every time with the same compiler
and with the same libraries installed - including io_uring.

Anyway, I'll commit - again, this is the same argument as we had in the past about DPDK and also about valgrind, so I won't start
it again now.

Kefu, I think the least we can do is that the error message can say how io_uring can be disabled so the user doesn't need to look
through configure.py source code to figure out the name of the option (I couldn't guess that it's "--disable-io_uring" with one dash
and one underscore). Although because of the cmake/configure.py duplication, I'm not sure how to do it cleanly.


I made the same argument actually, and Kefu convinced me that autodetection can be more trouble than it's worth (support can silently disappear Seastar needs a newer library).

Reply all
Reply to author
Forward
0 new messages