Seastar performance tuning

Sasha Solganik

<solganik@gmail.com>

unread,

Dec 9, 2018, 4:53:50 AM12/9/18

to seastar-dev

Hi, I had recently evaluated seastar, as a replacement for our current network stack that is using lwip + dpdk.

My benchmark is very simple, it is basically simple protocol parsing of 64-bytes request and 64 + 4096 + 64 bytes response, while requests are pipelined together up-to 32 qdepth.

In some cases i got that seastar solution was significantly better than lwip (on very low and very high qdepth). However in a middle rage, i.e. qd=8 and 16. I got that seastar is much inferior to my lwip implementation.

Currently "server" is running on single core, just to simplify measurement.

Things that i had tried:

1) remove flusher future.

2) increase connection buffer size

3) reduce or increase --task-quota-ms value.

This did not provide virtually any improvement.

Machine specs:

Intel(R) Xeon(R) CPU E5-2648L v4 @ 1.80GHz

network adapter:

Ethernet Controller XL710 for 40GbE QSFP+

Is there are tuning parameters for seastar that I can apply ?

Avi Kivity

<avi@scylladb.com>

unread,

Dec 9, 2018, 5:02:39 AM12/9/18

to Sasha Solganik, seastar-dev

It's hard to say without more input. You can try profiling with "perf"
to see where the bottleneck is. It's also recommended to set up a
Prometheus monitor to look at counters to see what's going on (but
exactly what to look at depends on the results of profiling).

It will be interesting to compare the profiles of the good cases with
the bad cases.

How many continuations per request do you have? That can be determined
by looking at the rate of the tasks_processed counter divided by your
request rate.

Sasha Solganik

<solganik@gmail.com>

unread,

Dec 9, 2018, 5:09:02 AM12/9/18

to seastar-dev

Two more things that i had tried are:

1) I measured with perf and got that "free" is the method that is costly (is there any way to reduce amount of futures that are allocated)?

2) I also tried your Idea from this thread to remove "need_preempt" from then and "then_wrapped" to try to reduce amount of "scheduler" switches, this did not yield any improvement.

Sasha Solganik

<solganik@gmail.com>

unread,

Dec 9, 2018, 11:42:41 AM12/9/18

to seastar-dev

I had measured the amount of continuations:

in scenario where seastar performs worse than lwip: it is roughly 3.3 per request (12 connections queue depth 4)

in scenario where seastar performs better than lwip: it is around 6 continuations per request (12 connections queue depth 1)

which is quite weird.

Avi Kivity

<avi@scylladb.com>

unread,

Dec 10, 2018, 3:52:33 AM12/10/18

to Sasha Solganik, seastar-dev

On 09/12/2018 18.42, Sasha Solganik wrote:

I had measured the amount of continuations:
in scenario where seastar performs worse than lwip: it is roughly 3.3 per request (12 connections queue depth 4)

in scenario where seastar performs better than lwip: it is around 6 continuations per request (12 connections queue depth 1)

which is quite weird.

It's not weird, Seastar has to create more continuations since less data is available in the buffers. You should look closely at the code to see if you can remove needless continuations.

A design goal should be to only create continuations when we have to wait for I/O, but in practice this can be hard.

Example:

return do_io().then([] {

c1();

}).then([] {

c2();

});

Both c1 and c2 will create continuations, since the future returned from do_io() is not available, and the future return from do_io().then() is also not available.

On the other hand

do_io().then([] {

return c1().then([] {

c2();

});

c1 will have a continuation created for it, but if it happens to return a ready future, then c2 will be executed in c1's continuation.

If processing connections that have no pipelining is important to you, perhaps we can add an API that calls a function for all connections that have data:

class connection_monitor {

public:

void monitor(connected_socket cs);

void unmonitor(connected_socket cs);

future<> process(noncopyable_function<future<> (connected_socket cs)>); // runs until stop() is called

void stop();

};

But as soon as your code gets more complex, the benefits of are reduced.

--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To post to this group, send email to seast...@googlegroups.com.
Visit this group at https://groups.google.com/group/seastar-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/a6b08683-2f21-4f38-a3a0-db5dbfa02972%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sasha Solganik

<solganik@gmail.com>

unread,

Dec 10, 2018, 2:00:35 PM12/10/18

to seastar-dev

Hi Avi, thanks for quick response.

Indeed one of my "cases" was not optimized use of continuation, but it did not improve IOPS.

Regards "pipelining", i indeed use pipelining (a.k.a. qdepth or iodepth) so i indeed need to separate "reads" and "writes" continuation.

I did some more testing, and there is one more troubling scenario, lets call it "write"

in this scenario client writes 64 + 4096 bytes and gets a response of 64 bytes. This all happenes in with "pipeline" of 32 requests.

Results I got from our LWIP implementation are >X2 from seastar (on single core in LWIP i got ~480KIOPS, while in seastar i got <200).

I tried to increase DPDK fetch budget, that did not help.

From profiling i saw that packet::share is taking lots of cycles, which is called for every RX packet in the following method:

native_connected_socket_impl<Protocol>::native_data_source_impl::

virtual future<temporary_buffer<char>> get() override {

if (_eof) {

return make_ready_future<temporary_buffer<char>>(temporary_buffer<char>(0));

}

if (_cur_frag != _buf.nr_frags()) {

auto& f = _buf.fragments()[_cur_frag++];

return make_ready_future<temporary_buffer<char>>(

temporary_buffer<char>(f.base, f.size,

make_deleter(deleter(), [p = _buf.share()] () mutable {})));

}

return _conn->wait_for_data().then([this] {

_buf = _conn->read();

_cur_frag = 0;

_eof = !_buf.len();

return get();

});

}

Now from what i understand this is only for "releasing the tmp_buf", however i`m struggling to find a better solution for this.

Any idea on how to improve it ?

Avi Kivity

<avi@scylladb.com>

unread,

Dec 10, 2018, 3:44:32 PM12/10/18

to Sasha Solganik, seastar-dev

It's surprising that this little thing that happens once per packet is so expensive. Perhaps we're confusing the allocator's heuristics (when allocating the deleter). Or maybe something else, worth investigating.

We can also avoid the sharing altogether:

1. increase the fragment size to 8k so the packet has one fragment

2. instead of assigning to _buf, use packet::release_into() to convert it to a vector of temporary_buffers (stored in native_data_source_impl).

3. optimize release_into() to move the last fragment's deleter, instead of calling share(). If there's just one fragment, we'll never call share().

On Sunday, December 9, 2018 at 11:53:50 AM UTC+2, Sasha Solganik wrote:

Hi, I had recently evaluated seastar, as a replacement for our current network stack that is using lwip + dpdk.
My benchmark is very simple, it is basically simple protocol parsing of 64-bytes request and 64 + 4096 + 64 bytes response, while requests are pipelined together up-to 32 qdepth.

In some cases i got that seastar solution was significantly better than lwip (on very low and very high qdepth). However in a middle rage, i.e. qd=8 and 16. I got that seastar is much inferior to my lwip implementation.

Currently "server" is running on single core, just to simplify measurement.

Things that i had tried:

1) remove flusher future.

2) increase connection buffer size

3) reduce or increase --task-quota-ms value.

This did not provide virtually any improvement.

Machine specs:

Intel(R) Xeon(R) CPU E5-2648L v4 @ 1.80GHz

network adapter:

Ethernet Controller XL710 for 40GbE QSFP+

Is there are tuning parameters for seastar that I can apply ?

--

You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To post to this group, send email to seast...@googlegroups.com.
Visit this group at https://groups.google.com/group/seastar-dev.

To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/0ffa2be4-a101-4be1-929c-f0fecd74caba%40googlegroups.com.

Sasha Solganik

<solganik@gmail.com>

unread,

Dec 11, 2018, 9:24:14 AM12/11/18

to seastar-dev

Hi Avi, thanks for very useful insight, Please take a look at this patch, it midifies RX patch so that connection stores queue of buffers and not packet.

This increased performance approximately from <200K to 240-250K. (Still significantly below my current implementation using LWIP, but a significant progress).

Also i do not observe packet::share anymore on top of my perf output.

I will try to take a look at improving "release_into" next. see where it takes me.

0001-seastar-core-avoid-sharing-packets-in-tcp-connection.patch

Avi Kivity

<avi@scylladb.com>

unread,

Dec 11, 2018, 9:42:29 AM12/11/18

to Sasha Solganik, seastar-dev

On 11/12/2018 16.24, Sasha Solganik wrote:

Hi Avi, thanks for very useful insight, Please take a look at this patch, it midifies RX patch so that connection stores queue of buffers and not packet.
This increased performance approximately from <200K to 240-250K. (Still significantly below my current implementation using LWIP, but a significant progress).

Also i do not observe packet::share anymore on top of my perf output.

What about increasing the fragment size?

I will try to take a look at improving "release_into" next. see where it takes me.

diff --git a/net/native-stack-impl.hh b/net/native-stack-impl.hh
index f0113ce9..48a0ac9d 100644
--- a/net/native-stack-impl.hh
+++ b/net/native-stack-impl.hh
@@ -124,9 +124,8 @@ class native_connected_socket_impl<Protocol>::native_data_source_impl final
     : public data_source_impl {
     typedef typename Protocol::connection connection_type;
     lw_shared_ptr<connection_type> _conn;
-    size_t _cur_frag = 0;
     bool _eof = false;
-    packet _buf;
+    std::deque<temporary_buffer<char>> _buf;

seastar's circular_buffer is likely better for this.

 public:
     explicit native_data_source_impl(lw_shared_ptr<connection_type> conn)
         : _conn(std::move(conn)) {}
@@ -134,16 +133,14 @@ class native_connected_socket_impl<Protocol>::native_data_source_impl final
         if (_eof) {
             return make_ready_future<temporary_buffer<char>>(temporary_buffer<char>(0));
         }
-        if (_cur_frag != _buf.nr_frags()) {
-            auto& f = _buf.fragments()[_cur_frag++];
-            return make_ready_future<temporary_buffer<char>>(
-                    temporary_buffer<char>(f.base, f.size,
-                            make_deleter(deleter(), [p = _buf.share()] () mutable {})));
+        if (_buf.size() > 0) {
+		auto result = make_ready_future<temporary_buffer<char>>(std::move(_buf.front()));
+		_buf.pop_front();
+		return result;
         }

We use spaces for indents, not tabs.

         return _conn->wait_for_data().then([this] {
-            _buf = _conn->read();
-            _cur_frag = 0;
-            _eof = !_buf.len();
+            _buf = std::move(_conn->read());
+            _eof = !_buf.size();

This causes an allocation for every packet. It's better to keep _buf's identity and move fragments into it.

Perhaps (much larger change) we should also change packet to keep a deleter per fragment rather than a deleter per packet. That will allow most fragments to have a free_deleter, which is super cheap.

             return get();
         });
     }
diff --git a/net/packet.hh b/net/packet.hh
index 5d626a86..7c25a7b1 100644
--- a/net/packet.hh
+++ b/net/packet.hh
@@ -30,6 +30,7 @@
 #include <iosfwd>
 #include "util/std-compat.hh"
 #include <functional>
+#include <deque>
 
 namespace seastar {
 
@@ -302,6 +303,13 @@ class packet final {
         });
         return ret;
     }
+
+    void release_to(std::deque<temporary_buffer<char>> &dest){
+	    release_into([&dest] (temporary_buffer<char>&& frag) {
+		    dest.push_back(std::move(frag));
+	    });
+    }

You should just open-code this at the call site.

+
     explicit operator bool() {
         return bool(_impl);
     }
diff --git a/net/tcp.hh b/net/tcp.hh
index d080afaf..5cd606fd 100644
--- a/net/tcp.hh
+++ b/net/tcp.hh
@@ -376,7 +376,7 @@ class tcp {
             uint16_t mss;
             tcp_seq urgent;
             tcp_seq initial;
-            std::deque<packet> data;
+            std::deque<temporary_buffer<char>> data;
             // The total size of data stored in std::deque<packet> data
             size_t data_size = 0;
             tcp_packet_merger out_of_order;
@@ -436,7 +436,7 @@ class tcp {
         future<> wait_send_available();
         future<> send(packet p);
         void connect();
-        packet read();
+        std::deque<temporary_buffer<char>> read();
         void close();
         void remove_from_tcbs() {
             auto id = connid{_local_ip, _foreign_ip, _local_port, _foreign_port};
@@ -689,7 +689,7 @@ class tcp {
         future<> wait_for_data() {
             return _tcb->wait_for_data();
         }
-        packet read() {
+        std::deque<temporary_buffer<char>> read() {
             return _tcb->read();
         }
         ipaddr foreign_ip() {
@@ -1478,7 +1478,7 @@ void tcp<InetTraits>::tcb::input_handle_other_state(tcp_hdr* th, packet p) {
             // apporopriate to the current buffer availability.  The total of
             // RCV.NXT and RCV.WND should not be reduced.
             _rcv.data_size += p.len();
-            _rcv.data.push_back(std::move(p));
+            p.release_to(_rcv.data);
             _rcv.next += seg_len;
             auto merged = merge_out_of_order();
             _rcv.window = get_modified_receive_window_size();
@@ -1753,15 +1753,11 @@ void tcp<InetTraits>::tcb::connect() {
 }
 
 template <typename InetTraits>
-packet tcp<InetTraits>::tcb::read() {
-    packet p;
-    for (auto&& q : _rcv.data) {
-        p.append(std::move(q));
-    }
+std::deque<temporary_buffer<char>> tcp<InetTraits>::tcb::read() {
+    auto result = std::move(_rcv.data);

Moving out _rcv.data will force a reallocation here when it is pushed into again. Better to move just the fragments.

     _rcv.data_size = 0;
-    _rcv.data.clear();
     _rcv.window = get_default_receive_window_size();
-    return p;
+    return result;
 }
 
 template <typename InetTraits>
@@ -1872,7 +1868,7 @@ bool tcp<InetTraits>::tcb::merge_out_of_order() {
                 seg_len -= trim;
             }
             _rcv.next += seg_len;
-            _rcv.data.push_back(std::move(p));
+            p.release_to(_rcv.data);
             _rcv.data_size += p.len();
             // Since c++11, erase() always returns the value of the following element
             it = _rcv.out_of_order.map.erase(it);

For a log time I wanted to move tcp (esp. the tx path) to working with temporary_buffers instead of packets, and only use packets when ready to send one to the interface, this is a step in that direction.

Sasha Solganik

<solganik@gmail.com>

unread,

Dec 11, 2018, 12:10:34 PM12/11/18

to seastar-dev

Hi Avi, again thanks a lot for quick response. some comments inline.

On Tuesday, December 11, 2018 at 4:42:29 PM UTC+2, Avi Kivity wrote:

On 11/12/2018 16.24, Sasha Solganik wrote:

Hi Avi, thanks for very useful insight, Please take a look at this patch, it midifies RX patch so that connection stores queue of buffers and not packet.
This increased performance approximately from <200K to 240-250K. (Still significantly below my current implementation using LWIP, but a significant progress).

Also i do not observe packet::share anymore on top of my perf output.

What about increasing the fragment size?

I am not sure i understand what does it mean increasing fragment size, i`m not using LRO, so all RX fragments are at most 1500 bytes.

I will try to take a look at improving "release_into" next. see where it takes me.

diff --git a/net/native-stack-impl.hh b/net/native-stack-impl.hh
index f0113ce9..48a0ac9d 100644
--- a/net/native-stack-impl.hh
+++ b/net/native-stack-impl.hh
@@ -124,9 +124,8 @@ class native_connected_socket_impl<Protocol>::native_data_source_impl final
     : public data_source_impl {
     typedef typename Protocol::connection connection_type;
     lw_shared_ptr<connection_type> _conn;
-    size_t _cur_frag = 0;
     bool _eof = false;
-    packet _buf;
+    std::deque<temporary_buffer<char>> _buf;

seastar's circular_buffer is likely better for this.

OK

 public:
     explicit native_data_source_impl(lw_shared_ptr<connection_type> conn)
         : _conn(std::move(conn)) {}
@@ -134,16 +133,14 @@ class native_connected_socket_impl<Protocol>::native_data_source_impl final
         if (_eof) {
             return make_ready_future<temporary_buffer<char>>(temporary_buffer<char>(0));
         }
-        if (_cur_frag != _buf.nr_frags()) {
-            auto& f = _buf.fragments()[_cur_frag++];
-            return make_ready_future<temporary_buffer<char>>(
-                    temporary_buffer<char>(f.base, f.size,
-                            make_deleter(deleter(), [p = _buf.share()] () mutable {})));
+        if (_buf.size() > 0) {
+		auto result = make_ready_future<temporary_buffer<char>>(std::move(_buf.front()));
+		_buf.pop_front();
+		return result;
         }

We use spaces for indents, not tabs.

My bad i did not notice, i`ll fix

         return _conn->wait_for_data().then([this] {
-            _buf = _conn->read();
-            _cur_frag = 0;
-            _eof = !_buf.len();
+            _buf = std::move(_conn->read());
+            _eof = !_buf.size();
This causes an allocation for every packet. It's better to keep _buf's identity and move fragments into it.

So read should return "array of fragments"? but this means allocating array on every packet, ot maybe i should use "channel" approach so that "read" will receive circular buffer to which to put fragments ? What do you think.

Perhaps (much larger change) we should also change packet to keep a deleter per fragment rather than a deleter per packet. That will allow most fragments to have a free_deleter, which is super cheap.

Yes i thought about this change also, and if we choose seastar over what we currently are using for sure we will invest in performance improvements. I do see huge value in seastar continuations framework, and how it speeds development of applications.

My issue is that i must provide on par results with what i`m currently having to justify this move.

             return get();
         });
     }
diff --git a/net/packet.hh b/net/packet.hh
index 5d626a86..7c25a7b1 100644
--- a/net/packet.hh
+++ b/net/packet.hh
@@ -30,6 +30,7 @@
 #include <iosfwd>
 #include "util/std-compat.hh"
 #include <functional>
+#include <deque>
 
 namespace seastar {
 
@@ -302,6 +303,13 @@ class packet final {
         });
         return ret;
     }
+
+    void release_to(std::deque<temporary_buffer<char>> &dest){
+	    release_into([&dest] (temporary_buffer<char>&& frag) {
+		    dest.push_back(std::move(frag));
+	    });
+    }

You should just open-code this at the call site.

OK

Again what you suggest return value to be ? fragments vector (it will also cause allocations of vector itself), maybe pass "container" as parameter and fill it? What do you think ?

Sasha Solganik

<solganik@gmail.com>

unread,

Dec 13, 2018, 6:28:15 AM12/13/18

to seastar-dev

Attached 2 patches, Note that i kept move on circular_buffer, as its default contructor does not allocate anything, so we are ok on this front. please do tell me if you have a better idea on how to transfer "buffers" from tcp_connection into traits without it. Initially i though that maybe having a direct access to "_rcv.data" from traits will be a better idea, but i dont see an easy way to "update" amount of data read without completely braking API.

On Sunday, December 9, 2018 at 11:53:50 AM UTC+2, Sasha Solganik wrote:

0002-seastar-net-optimize-release_into-for-rx-path.patch

0001-seastar-core-avoid-sharing-packets-in-tcp-connection.patch

Sasha Solganik

<solganik@gmail.com>

unread,

Dec 13, 2018, 6:28:56 AM12/13/18

to seastar-dev

Just a word on performance, we are up from ~180K to 270K on "write scenario"

On Sunday, December 9, 2018 at 11:53:50 AM UTC+2, Sasha Solganik wrote:

Avi Kivity

<avi@scylladb.com>

unread,

Dec 13, 2018, 7:33:17 AM12/13/18

to Sasha Solganik, seastar-dev

On 12/11/18 7:10 PM, Sasha Solganik wrote:

Hi Avi, again thanks a lot for quick response. some comments inline.

On Tuesday, December 11, 2018 at 4:42:29 PM UTC+2, Avi Kivity wrote:

On 11/12/2018 16.24, Sasha Solganik wrote:

Hi Avi, thanks for very useful insight, Please take a look at this patch, it midifies RX patch so that connection stores queue of buffers and not packet.
This increased performance approximately from <200K to 240-250K. (Still significantly below my current implementation using LWIP, but a significant progress).

Also i do not observe packet::share anymore on top of my perf output.

What about increasing the fragment size?

I am not sure i understand what does it mean increasing fragment size, i`m not using LRO, so all RX fragments are at most 1500 bytes.

Oh. Why not use LRO? It will reduce allocation rates.

read() can return a packet, which is then release_into()ed the circular_buffer.

It's also possible to connect the two classes more strongly, but I can't guess how that will turn out.

Perhaps (much larger change) we should also change packet to keep a deleter per fragment rather than a deleter per packet. That will allow most fragments to have a free_deleter, which is super cheap.

Yes i thought about this change also, and if we choose seastar over what we currently are using for sure we will invest in performance improvements. I do see huge value in seastar continuations framework, and how it speeds development of applications.

My issue is that i must provide on par results with what i`m currently having to justify this move.

Well, it looks like you're not far away.

If the application is simple you will likely achieve the best results with a tightly integrated stack (e.g. LWIP + stuff hard coded).

If the application is complex (doing replication, scheduling, deduplication, multi-tier storage) then a hard-coded application is not feasible IMO and the simplification provided by Seastar becomes dominant. I'm not saying we should strive to achieve the best performance; I'm saying it has no contended because the other approaches lead to complex code.

It can be the packet. Its vector was already allocated.

     _rcv.data_size = 0;
-    _rcv.data.clear();
     _rcv.window = get_default_receive_window_size();
-    return p;
+    return result;
 }
 
 template <typename InetTraits>
@@ -1872,7 +1868,7 @@ bool tcp<InetTraits>::tcb::merge_out_of_order() {
                 seg_len -= trim;
             }
             _rcv.next += seg_len;
-            _rcv.data.push_back(std::move(p));
+            p.release_to(_rcv.data);
             _rcv.data_size += p.len();
             // Since c++11, erase() always returns the value of the following element
             it = _rcv.out_of_order.map.erase(it);

For a log time I wanted to move tcp (esp. the tx path) to working with temporary_buffers instead of packets, and only use packets when ready to send one to the interface, this is a step in that direction.

--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To post to this group, send email to seast...@googlegroups.com.
Visit this group at https://groups.google.com/group/seastar-dev.

To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/bfed3f81-6905-4816-b962-44aa8c1b9260%40googlegroups.com.

Reply all

Reply to author

Forward