Many thanks, I folded it into my patch (retaining credits).
Is there a reason not to default support to ON?
After changing it to ON, I don't see liburing in seastar.pc, like
the other libraries, so linkage using pkgconfig will fail. How do
we make it show there?
--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/ee909ec6-d175-4b33-a288-c03af4543afen%40googlegroups.com.
Many thanks, I folded it into my patch (retaining credits).
Is there a reason not to default support to ON?
After changing it to ON, I don't see liburing in seastar.pc, like the other libraries, so linkage using pkgconfig will fail. How do we make it show there?
io_uring is a unified asynchronous I/O interface, supporting
network, buffered disk, and direct disk I/O.
This patch adds a reactor backend using io_uring. It is deliberately
non-ambitious and only implements the minimal number of verbs using
io_uring. We could support many more (sendmsg(), recvmsg(), open(), etc.)
but it is better to start small, as each of those features will require
a separate capability check and fallback path if not available.
In terms of performance, I measured about 4% degradation compared to
linux-aio in httpd/wrk. I am working with the io_uring maintainer to
resolve it, but until it is resolved linux-aio will be preferred to
io_uring and the latter has to be enabled manually (with --reactor-backend).
The implementation follows the linux-aio backend, using
IORING_OP_POLL_ADD for polls and the read/write/fsync ops for files.
More elaborate socket operation (using sendmsg/recvmsg equivalents to
combine poll+read) will follow later. The preemption notifier still
uses linux-aio (as in Glauber's implementation) as a simple solution.
The implementation is more complex internally than linux-aio, since
getting a submission queue entry (sqe) can require flushing pending
sqe:s and consuming completion queue entries (cqe:s). This is because
the queue lengths are smaller than the total amount of in-flight
operations. So getting an sqe can result (rarely) in a syscall
and processing some completions.
However, in return the ergonomics are better for the user. It works
well with buffered I/O (--kernel-page-cache 1) and doesn't require
tuning some sysctls for it to work on large machines.
---
This appears to work well, apart from the cmake integration. I
appreciate any help in this area!
We require liburing 2.0 and above, and it should be optional.
include/seastar/core/reactor.hh | 1 +
src/core/reactor_backend.hh | 2 +
src/core/reactor_backend.cc | 404 ++++++++++++++++++++++++++++++++
CMakeLists.txt | 2 +
cmake/FindLibUring.cmake | 61 +++++
cmake/SeastarDependencies.cmake | 1 +
install-dependencies.sh | 2 +
7 files changed, 473 insertions(+)
index 4a1bca1afc..3d9b5d7299 100644
--- a/src/core/reactor_backend.cc
+++ b/src/core/reactor_backend.cc
@@ -28,10 +28,16 @@
#include <seastar/util/read_first_line.hh>
#include <chrono>
#include <sys/poll.h>
#include <sys/syscall.h>
+#define SEASTAR_HAVE_URING // FIXME: convince cmake to do it.
+
+#ifdef SEASTAR_HAVE_URING
+#include <liburing.h>
+#endif
+
#ifdef HAVE_OSV
#include <osv/newpoll.hh>
#endif
namespace seastar {
@@ -1112,10 +1118,398 @@ reactor_backend_osv::make_pollable_fd_state(file_desc fd, pollable_fd::speculati
std::cerr << "reactor_backend_osv does not support file descriptors - make_pollable_fd_state() shouldn't have been called!\n";
abort();
}
#endif
+#ifdef SEASTAR_HAVE_URING
+
+// We want not to throw during detection (to avoid spurious exceptions)
+// but we do want to throw during for-real construction (to have an error
+// message. Hence `throw_on_error`.
+static
+std::optional<::io_uring>
+try_create_uring(unsigned queue_len, bool throw_on_error) {
+ }
+ void maybe_rearm(reactor_backend_uring& be) {
+ if (_armed) [[likely]] {
+ return;
+ }
+ auto sqe = be.get_sqe();
+ ::io_uring_prep_poll_add(sqe, fd().get(), POLLIN);
+ ::io_uring_sqe_set_data(sqe, static_cast<kernel_completion*>(this));
+ _armed = true;
+ be._has_pending_submissions = true;
+ }
+ // Protect against spurious wakeups - if we get notified that the timer has
+ // expired when it really hasn't, we don't want to block in read(tfd, ...).
+ auto tfd = _r._task_quota_timer.get();
+ ::fcntl(tfd, F_SETFL, ::fcntl(tfd, F_GETFL) | O_NONBLOCK);
+ }
+ ~reactor_backend_uring() {
+ ::io_uring_queue_exit(&_uring);
+ }
+ virtual bool reap_kernel_completions() override {
+ return do_process_kernel_completions();
+ }
+ virtual bool kernel_submit_work() override {
+ bool did_work = false;
+ did_work |= _preempt_io_context.service_preempting_io();
+ queue_pending_file_io();
+ did_work |= ::io_uring_submit(&_uring);
+ // io_uring_submit() may have kicked up queued work
+ did_work |= reap_kernel_completions();
@@ -1151,10 +1545,15 @@ bool reactor_backend_selector::has_enough_aio_nr() {
}
return true;
}
std::unique_ptr<reactor_backend> reactor_backend_selector::create(reactor& r) {
+#ifdef SEASTAR_HAVE_URING
+ if (_name == "io_uring") {
+ return std::make_unique<reactor_backend_uring>(r);
+ }
+#endif
if (_name == "linux-aio") {
return std::make_unique<reactor_backend_aio>(r);
} else if (_name == "epoll") {
return std::make_unique<reactor_backend_epoll>(r);
}
@@ -1169,9 +1568,14 @@ std::vector<reactor_backend_selector> reactor_backend_selector::available() {
std::vector<reactor_backend_selector> ret;
if (detect_aio_poll() && has_enough_aio_nr()) {
ret.push_back(reactor_backend_selector("linux-aio"));
}
ret.push_back(reactor_backend_selector("epoll"));
+#ifdef SEASTAR_HAVE_URING
+ if (detect_io_uring()) {
+ ret.push_back(reactor_backend_selector("io_uring"));
+ }
+#endif
return ret;
}
}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index dcc064fe47..57f5c1c506 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -736,10 +736,11 @@ target_link_libraries (seastar
Boost::thread
c-ares::c-ares
cryptopp::cryptopp
fmt::fmt
lz4::lz4
+ URING::uring
PRIVATE
${CMAKE_DL_LIBS}
GnuTLS::gnutls
StdAtomic::atomic
lksctp-tools::lksctp-tools
@@ -1247,10 +1248,11 @@ if (Seastar_INSTALL)
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/20220501173756.3537646-1-avi%40scylladb.com.
io_uring is a unified asynchronous I/O interface, supporting
network, buffered disk, and direct disk I/O.
This patch adds a reactor backend using io_uring. It is deliberately
non-ambitious and only implements the minimal number of verbs using
io_uring. We could support many more (sendmsg(), recvmsg(), open(), etc.)
but it is better to start small, as each of those features will require
a separate capability check and fallback path if not available.
In terms of performance, I measured about 4% degradation compared to
linux-aio in httpd/wrk. I am working with the io_uring maintainer to
resolve it, but until it is resolved linux-aio will be preferred to
io_uring and the latter has to be enabled manually (with --reactor-backend).
The implementation follows the linux-aio backend, using
IORING_OP_POLL_ADD for polls and the read/write/fsync ops for files.
More elaborate socket operation (using sendmsg/recvmsg equivalents to
combine poll+read) will follow later. The preemption notifier still
uses linux-aio (as in Glauber's implementation) as a simple solution.
The implementation is more complex internally than linux-aio, since
getting a submission queue entry (sqe) can require flushing pending
sqe:s and consuming completion queue entries (cqe:s). This is because
the queue lengths are smaller than the total amount of in-flight
operations. So getting an sqe can result (rarely) in a syscall
and processing some completions.
However, in return the ergonomics are better for the user. It works
well with buffered I/O (--kernel-page-cache 1) and doesn't require
tuning some sysctls for it to work on large machines.
---
This appears to work well, apart from the cmake integration. I
appreciate any help in this area!
We require liburing 2.0 and above, and it should be optional.
+// but we do want to throw during for-real construction (to have an error
+// message. Hence `throw_on_error`.
+static
+std::optional<::io_uring>
+try_create_uring(unsigned queue_len, bool throw_on_error) {
+ bool await_events(int timeout, const sigset_t* active_sigmask);
might want to drop this function declaration? looks like a copy-paste leftover from reactor_backend_aio.
Done
+ }
+ future<> poll(pollable_fd_state& fd, int events) {
+ if (events & fd.events_known) {
+ fd.events_known &= ~events;
+ return make_ready_future<>();
+ }
+ fd.events_rw = events == (POLLIN|POLLOUT);
the only reader of fd.events_rw is now reactor_backend_epoll::wait_and_process(). since reactor_backend_uring::wait_and_process_events() in quite a different way, probably we could dispense with this step?
I think we have to apply the same logic from the aio backend to
io_uring, I'll check.
On Sun, May 1, 2022 at 8:38 PM Avi Kivity <a...@scylladb.com> wrote:
io_uring is a unified asynchronous I/O interface, supporting
network, buffered disk, and direct disk I/O.
This patch adds a reactor backend using io_uring. It is deliberately
non-ambitious and only implements the minimal number of verbs using
io_uring. We could support many more (sendmsg(), recvmsg(), open(), etc.)
but it is better to start small, as each of those features will require
a separate capability check and fallback path if not available.
In terms of performance, I measured about 4% degradation compared to
linux-aio in httpd/wrk. I am working with the io_uring maintainer to
resolve it, but until it is resolved linux-aio will be preferred to
io_uring and the latter has to be enabled manually (with --reactor-backend).
We have a comment in reactor_config.hh which lists the available backends, andthe new one should be listed as well. But I think we need more than that - we needsome sort of documentation, in the source code or a separate document (but notjust commit messages) explaining the *current* state of each backend
I'll do that in a separate patch. I want to limit the scope of
this patch to just io_uring, and avoid expanding it to include
other backends or general guidance. In general there is no
required user documentation since Seastar will select the best
backend the system can support.
(especiallynow that we have one that you consider incomplete and we need to remember
It's only incomplete in the sense that I want to further improve
it. Otherwise it is fully functional.
somewhat what is incomplete), the advantages and disadvantages of each backend,and perhaps a general statement (would it be correct?) that the best backend forthe capabilities of the current kernel is chosen automatically, so users should not explicitlychoose the reactor backend.
It is already chosen automatically.
The implementation follows the linux-aio backend, using
IORING_OP_POLL_ADD for polls and the read/write/fsync ops for files.
More elaborate socket operation (using sendmsg/recvmsg equivalents to
combine poll+read) will follow later. The preemption notifier still
uses linux-aio (as in Glauber's implementation) as a simple solution.
Isn't the preemption notifier just a file descriptor, like the other files?(I'm very rusty in this code, so I'm probably just misremembering).
It's a memory area that is updated when preemption happens. It's
carefully arranged to match the memory are used by linux-aio to
signal completions are available.
The implementation is more complex internally than linux-aio, since
getting a submission queue entry (sqe) can require flushing pending
sqe:s and consuming completion queue entries (cqe:s). This is because
the queue lengths are smaller than the total amount of in-flight
operations. So getting an sqe can result (rarely) in a syscall
and processing some completions.
However, in return the ergonomics are better for the user. It works
well with buffered I/O (--kernel-page-cache 1) and doesn't require
tuning some sysctls for it to work on large machines.
---
This appears to work well, apart from the cmake integration. I
appreciate any help in this area!
We require liburing 2.0 and above, and it should be optional.
I guess this means that this patch is an RFC and we shouldn't commit it yet?
New versions have cmake support and library auto-detection.
It's super annoying to issue "catch throw" in gdb and then
struggle through some random internal exceptions. I'm not at all
trying to optimize for performance.
Also, by not throwing exceptions during detection, you miss the *reason* whyit wasn't detected. Maybe this can be important for somebody, perhaps in sometrace logging mode or something (somebody's not getting uring as he expected,and wondering why)?
We don't have any mechanism to communicate it. And really, you're
turning an "it just works" mechanism that most users aren't aware
of into a "you have to understand every detail here" thing.
The effect of Seastar not able to use a backend is that it will
fall back to another implementation with a small performance
penalty. It doesn't matter what the cause is, since the fix is
always to move to a newer kernel.
+// but we do want to throw during for-real construction (to have an error
+// message. Hence `throw_on_error`.
+static
+std::optional<::io_uring>
+try_create_uring(unsigned queue_len, bool throw_on_error) {
A thought: There is a large chunk of code here that is 100% specific to uring, so I think it would have been nice to have a separate source file reactor_backend_uringinstead of stuffing all the different backends into one big source file.Of course, if this will be a mess because we'll also need to put a lot of stuff in header files, then let's just leave it one big source file as you did here.
I can do that as a follow-up. I don't think it will cause a big
mess, but since it's already a single file, I'll defer the change
until later.
It's a mnemonic to remind me that it's part of the C namespace,
not std::. and not seatar::.
I don't see the point. It's impossible to recover from it, and knowing that we got op::cancel here will not make anyone wiser. The code dump has to be inspected.
Couple that with it being unreachable code, I don't think it's
helpful.
I can do that.
--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/20220610124536.2091828-1-avi%40scylladb.com.
For older distributions, you can specify
cmake -DSeastar_IO_URING=OFF
We decided not to auto-enable it to have a more reproducible build.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/CANEVyjsK8Q0pVrO1_Tobj3KL%2BWGTGzs9NzP-42e-DFri-fS%3DSQ%40mail.gmail.com.
When I run the new cmake on my Fedora 34 installation, I get a cmake failure:-- Could NOT find LibUring (missing: HAVE_IOURING_FEATURES) (Required is at least version "2.0")
CMake Error at CMakeLists.txt:930 (message):
`io_uring` supported is enabled but liburing is not available!Apparently I have liburing-devel version 0.7 installed...The question is what to do now. I know Fedora 34 isn't super-up-to-date (I really do need to update), but Seastar has been working absolutely fine with it until this patch.
Does the new code really need liburing version 2.0?
And If liburing is missing, can't we just build without it? Isn't it an optional feature?
You received this message because you are subscribed to a topic in the Google Groups "seastar-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/seastar-dev/S2sJq-h4VB0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to seastar-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/CANEVyjsK8Q0pVrO1_Tobj3KL%2BWGTGzs9NzP-42e-DFri-fS%3DSQ%40mail.gmail.com.
For older distributions, you can specify
cmake -DSeastar_IO_URING=OFF
We decided not to auto-enable it to have a more reproducible build.
On Sun, Jun 12, 2022 at 5:55 PM Avi Kivity <a...@scylladb.com> wrote:
For older distributions, you can specify
cmake -DSeastar_IO_URING=OFF
We decided not to auto-enable it to have a more reproducible build.
I guess that after I lost exactly the same argument about DPDK, and got used to running "configure --disable-dpdk", now
I just need to get used to "./configure.py --disable-dpdk --disable-io_uring"...
I still think this is a usability mistake. Seastar isn't a product, it's a library. It sounds to me like the burden of doing reproducible buildsis on the person who builds the library. After all, you can also build Seastar on different compilers and different versions of compilers -and we deliberately want to support that. A person that wants reproducible builds should build Seastar every time with the same compilerand with the same libraries installed - including io_uring.
Anyway, I'll commit - again, this is the same argument as we had in the past about DPDK and also about valgrind, so I won't startit again now.
Kefu, I think the least we can do is that the error message can say how io_uring can be disabled so the user doesn't need to lookthrough configure.py source code to figure out the name of the option (I couldn't guess that it's "--disable-io_uring" with one dashand one underscore). Although because of the cmake/configure.py duplication, I'm not sure how to do it cleanly.
I made the same argument actually, and Kefu convinced me that
autodetection can be more trouble than it's worth (support can
silently disappear Seastar needs a newer library).