Two more things that i had tried are:
I had measured the amount of continuations:in scenario where seastar performs worse than lwip: it is roughly 3.3 per request (12 connections queue depth 4)in scenario where seastar performs better than lwip: it is around 6 continuations per request (12 connections queue depth 1)
which is quite weird.
It's not weird, Seastar has to create more continuations since
less data is available in the buffers. You should look closely at
the code to see if you can remove needless continuations.
A design goal should be to only create continuations when we have to wait for I/O, but in practice this can be hard.
Example:
return do_io().then([] {
c1();
}).then([] {
c2();
});
Both c1 and c2 will create continuations, since the future returned from do_io() is not available, and the future return from do_io().then() is also not available.
On the other hand
do_io().then([] {
return c1().then([] {
c2();
});
});
c1 will have a continuation created for it, but if it happens to
return a ready future, then c2 will be executed in c1's
continuation.
If processing connections that have no pipelining is important to you, perhaps we can add an API that calls a function for all connections that have data:
class connection_monitor {
public:
void monitor(connected_socket cs);
void unmonitor(connected_socket cs);
future<> process(noncopyable_function<future<> (connected_socket cs)>); // runs until stop() is called
void stop();
};
But as soon as your code gets more complex, the benefits of are
reduced.
--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To post to this group, send email to seast...@googlegroups.com.
Visit this group at https://groups.google.com/group/seastar-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/a6b08683-2f21-4f38-a3a0-db5dbfa02972%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
It's surprising that this little thing that happens once per
packet is so expensive. Perhaps we're confusing the allocator's
heuristics (when allocating the deleter). Or maybe something else,
worth investigating.
We can also avoid the sharing altogether:
1. increase the fragment size to 8k so the packet has one fragment
2. instead of assigning to _buf, use packet::release_into() to
convert it to a vector of temporary_buffers (stored in
native_data_source_impl).
3. optimize release_into() to move the last fragment's deleter,
instead of calling share(). If there's just one fragment, we'll
never call share().
On Sunday, December 9, 2018 at 11:53:50 AM UTC+2, Sasha Solganik wrote:Hi, I had recently evaluated seastar, as a replacement for our current network stack that is using lwip + dpdk.My benchmark is very simple, it is basically simple protocol parsing of 64-bytes request and 64 + 4096 + 64 bytes response, while requests are pipelined together up-to 32 qdepth.In some cases i got that seastar solution was significantly better than lwip (on very low and very high qdepth). However in a middle rage, i.e. qd=8 and 16. I got that seastar is much inferior to my lwip implementation.Currently "server" is running on single core, just to simplify measurement.
Things that i had tried:1) remove flusher future.2) increase connection buffer size3) reduce or increase --task-quota-ms value.
This did not provide virtually any improvement.Machine specs:Intel(R) Xeon(R) CPU E5-2648L v4 @ 1.80GHz
network adapter:Ethernet Controller XL710 for 40GbE QSFP+
Is there are tuning parameters for seastar that I can apply ?
--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To post to this group, send email to seast...@googlegroups.com.
Visit this group at https://groups.google.com/group/seastar-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/0ffa2be4-a101-4be1-929c-f0fecd74caba%40googlegroups.com.
Hi Avi, thanks for very useful insight, Please take a look at this patch, it midifies RX patch so that connection stores queue of buffers and not packet.This increased performance approximately from <200K to 240-250K. (Still significantly below my current implementation using LWIP, but a significant progress).Also i do not observe packet::share anymore on top of my perf output.
What about increasing the fragment size?
I will try to take a look at improving "release_into" next. see where it takes me.
diff --git a/net/native-stack-impl.hh b/net/native-stack-impl.hh index f0113ce9..48a0ac9d 100644 --- a/net/native-stack-impl.hh +++ b/net/native-stack-impl.hh @@ -124,9 +124,8 @@ class native_connected_socket_impl<Protocol>::native_data_source_impl final : public data_source_impl { typedef typename Protocol::connection connection_type; lw_shared_ptr<connection_type> _conn; - size_t _cur_frag = 0; bool _eof = false; - packet _buf; + std::deque<temporary_buffer<char>> _buf;
seastar's circular_buffer is likely better for this.
public: explicit native_data_source_impl(lw_shared_ptr<connection_type> conn) : _conn(std::move(conn)) {} @@ -134,16 +133,14 @@ class native_connected_socket_impl<Protocol>::native_data_source_impl final if (_eof) { return make_ready_future<temporary_buffer<char>>(temporary_buffer<char>(0)); } - if (_cur_frag != _buf.nr_frags()) { - auto& f = _buf.fragments()[_cur_frag++]; - return make_ready_future<temporary_buffer<char>>( - temporary_buffer<char>(f.base, f.size, - make_deleter(deleter(), [p = _buf.share()] () mutable {}))); + if (_buf.size() > 0) { + auto result = make_ready_future<temporary_buffer<char>>(std::move(_buf.front())); + _buf.pop_front(); + return result; }
We use spaces for indents, not tabs.
return _conn->wait_for_data().then([this] { - _buf = _conn->read(); - _cur_frag = 0; - _eof = !_buf.len(); + _buf = std::move(_conn->read()); + _eof = !_buf.size();
This causes an allocation for every packet. It's better to keep _buf's identity and move fragments into it.
Perhaps (much larger change) we should also change packet to keep
a deleter per fragment rather than a deleter per packet. That will
allow most fragments to have a free_deleter, which is super cheap.
return get(); }); } diff --git a/net/packet.hh b/net/packet.hh index 5d626a86..7c25a7b1 100644 --- a/net/packet.hh +++ b/net/packet.hh @@ -30,6 +30,7 @@ #include <iosfwd> #include "util/std-compat.hh" #include <functional> +#include <deque> namespace seastar { @@ -302,6 +303,13 @@ class packet final { }); return ret; } + + void release_to(std::deque<temporary_buffer<char>> &dest){ + release_into([&dest] (temporary_buffer<char>&& frag) { + dest.push_back(std::move(frag)); + }); + }
You should just open-code this at the call site.
+ explicit operator bool() { return bool(_impl); } diff --git a/net/tcp.hh b/net/tcp.hh index d080afaf..5cd606fd 100644 --- a/net/tcp.hh +++ b/net/tcp.hh @@ -376,7 +376,7 @@ class tcp { uint16_t mss; tcp_seq urgent; tcp_seq initial; - std::deque<packet> data; + std::deque<temporary_buffer<char>> data; // The total size of data stored in std::deque<packet> data size_t data_size = 0; tcp_packet_merger out_of_order; @@ -436,7 +436,7 @@ class tcp { future<> wait_send_available(); future<> send(packet p); void connect(); - packet read(); + std::deque<temporary_buffer<char>> read(); void close(); void remove_from_tcbs() { auto id = connid{_local_ip, _foreign_ip, _local_port, _foreign_port}; @@ -689,7 +689,7 @@ class tcp { future<> wait_for_data() { return _tcb->wait_for_data(); } - packet read() { + std::deque<temporary_buffer<char>> read() { return _tcb->read(); } ipaddr foreign_ip() { @@ -1478,7 +1478,7 @@ void tcp<InetTraits>::tcb::input_handle_other_state(tcp_hdr* th, packet p) { // apporopriate to the current buffer availability. The total of // RCV.NXT and RCV.WND should not be reduced. _rcv.data_size += p.len(); - _rcv.data.push_back(std::move(p)); + p.release_to(_rcv.data); _rcv.next += seg_len; auto merged = merge_out_of_order(); _rcv.window = get_modified_receive_window_size(); @@ -1753,15 +1753,11 @@ void tcp<InetTraits>::tcb::connect() { } template <typename InetTraits> -packet tcp<InetTraits>::tcb::read() { - packet p; - for (auto&& q : _rcv.data) { - p.append(std::move(q)); - } +std::deque<temporary_buffer<char>> tcp<InetTraits>::tcb::read() { + auto result = std::move(_rcv.data);
Moving out _rcv.data will force a reallocation here when it is
pushed into again. Better to move just the fragments.
_rcv.data_size = 0; - _rcv.data.clear(); _rcv.window = get_default_receive_window_size(); - return p; + return result; } template <typename InetTraits> @@ -1872,7 +1868,7 @@ bool tcp<InetTraits>::tcb::merge_out_of_order() { seg_len -= trim; } _rcv.next += seg_len; - _rcv.data.push_back(std::move(p)); + p.release_to(_rcv.data); _rcv.data_size += p.len(); // Since c++11, erase() always returns the value of the following element it = _rcv.out_of_order.map.erase(it);
For a log time I wanted to move tcp (esp. the tx path) to working
with temporary_buffers instead of packets, and only use packets
when ready to send one to the interface, this is a step in that
direction.
On 11/12/2018 16.24, Sasha Solganik wrote:
Hi Avi, thanks for very useful insight, Please take a look at this patch, it midifies RX patch so that connection stores queue of buffers and not packet.This increased performance approximately from <200K to 240-250K. (Still significantly below my current implementation using LWIP, but a significant progress).Also i do not observe packet::share anymore on top of my perf output.
What about increasing the fragment size?
I will try to take a look at improving "release_into" next. see where it takes me.
diff --git a/net/native-stack-impl.hh b/net/native-stack-impl.hh index f0113ce9..48a0ac9d 100644 --- a/net/native-stack-impl.hh +++ b/net/native-stack-impl.hh @@ -124,9 +124,8 @@ class native_connected_socket_impl<Protocol>::native_data_source_impl final : public data_source_impl { typedef typename Protocol::connection connection_type; lw_shared_ptr<connection_type> _conn; - size_t _cur_frag = 0; bool _eof = false; - packet _buf; + std::deque<temporary_buffer<char>> _buf;
seastar's circular_buffer is likely better for this.
public: explicit native_data_source_impl(lw_shared_ptr<connection_type> conn) : _conn(std::move(conn)) {} @@ -134,16 +133,14 @@ class native_connected_socket_impl<Protocol>::native_data_source_impl final if (_eof) { return make_ready_future<temporary_buffer<char>>(temporary_buffer<char>(0)); } - if (_cur_frag != _buf.nr_frags()) { - auto& f = _buf.fragments()[_cur_frag++]; - return make_ready_future<temporary_buffer<char>>( - temporary_buffer<char>(f.base, f.size, - make_deleter(deleter(), [p = _buf.share()] () mutable {}))); + if (_buf.size() > 0) { + auto result = make_ready_future<temporary_buffer<char>>(std::move(_buf.front())); + _buf.pop_front(); + return result; }
We use spaces for indents, not tabs.
return _conn->wait_for_data().then([this] { - _buf = _conn->read(); - _cur_frag = 0; - _eof = !_buf.len(); + _buf = std::move(_conn->read()); + _eof = !_buf.size();
This causes an allocation for every packet. It's better to keep _buf's identity and move fragments into it.
Perhaps (much larger change) we should also change packet to keep a deleter per fragment rather than a deleter per packet. That will allow most fragments to have a free_deleter, which is super cheap.
return get(); }); } diff --git a/net/packet.hh b/net/packet.hh index 5d626a86..7c25a7b1 100644 --- a/net/packet.hh +++ b/net/packet.hh @@ -30,6 +30,7 @@ #include <iosfwd> #include "util/std-compat.hh" #include <functional> +#include <deque> namespace seastar { @@ -302,6 +303,13 @@ class packet final { }); return ret; } + + void release_to(std::deque<temporary_buffer<char>> &dest){ + release_into([&dest] (temporary_buffer<char>&& frag) { + dest.push_back(std::move(frag)); + }); + }
You should just open-code this at the call site.
Hi Avi, again thanks a lot for quick response. some comments inline.
On Tuesday, December 11, 2018 at 4:42:29 PM UTC+2, Avi Kivity wrote:
On 11/12/2018 16.24, Sasha Solganik wrote:
Hi Avi, thanks for very useful insight, Please take a look at this patch, it midifies RX patch so that connection stores queue of buffers and not packet.This increased performance approximately from <200K to 240-250K. (Still significantly below my current implementation using LWIP, but a significant progress).Also i do not observe packet::share anymore on top of my perf output.
What about increasing the fragment size?
I am not sure i understand what does it mean increasing fragment size, i`m not using LRO, so all RX fragments are at most 1500 bytes.
Oh. Why not use LRO? It will reduce allocation rates.
read() can return a packet, which is then release_into()ed the circular_buffer.
It's also possible to connect the two classes more strongly, but
I can't guess how that will turn out.
Perhaps (much larger change) we should also change packet to keep a deleter per fragment rather than a deleter per packet. That will allow most fragments to have a free_deleter, which is super cheap.
Yes i thought about this change also, and if we choose seastar over what we currently are using for sure we will invest in performance improvements. I do see huge value in seastar continuations framework, and how it speeds development of applications.My issue is that i must provide on par results with what i`m currently having to justify this move.
Well, it looks like you're not far away.
If the application is simple you will likely achieve the best results with a tightly integrated stack (e.g. LWIP + stuff hard coded).
If the application is complex (doing replication, scheduling,
deduplication, multi-tier storage) then a hard-coded application
is not feasible IMO and the simplification provided by Seastar
becomes dominant. I'm not saying we should strive to achieve the
best performance; I'm saying it has no contended because the other
approaches lead to complex code.
It can be the packet. Its vector was already allocated.
_rcv.data_size = 0; - _rcv.data.clear(); _rcv.window = get_default_receive_window_size(); - return p; + return result; } template <typename InetTraits> @@ -1872,7 +1868,7 @@ bool tcp<InetTraits>::tcb::merge_out_of_order() { seg_len -= trim; } _rcv.next += seg_len; - _rcv.data.push_back(std::move(p)); + p.release_to(_rcv.data); _rcv.data_size += p.len(); // Since c++11, erase() always returns the value of the following element it = _rcv.out_of_order.map.erase(it);
For a log time I wanted to move tcp (esp. the tx path) to working with temporary_buffers instead of packets, and only use packets when ready to send one to the interface, this is a step in that direction.
--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To post to this group, send email to seast...@googlegroups.com.
Visit this group at https://groups.google.com/group/seastar-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/bfed3f81-6905-4816-b962-44aa8c1b9260%40googlegroups.com.