Improvements to native TCP stack and updates to DDPK

Nicolas Le Scouarnec

<nicolas.lescouarnec@broadpeak.tv>

unread,

Mar 26, 2021, 5:57:54 AM3/26/21

to seastar-dev

Dear all,

As we explained at Scylla Summit, at Broadpeak, we've been using Seastar with its DPDK stack to build an alternative to JMeter for video streaming service/CDN benchmarking. While doing this we identified a few issues in the TCP stack, at least for our usage. We would be happy to to upstream those changes/fixes if they seem of interest to you. We would also like to know if you're already aware of ongoing work for any item so that we avoid preparing redundant/conflicting patches.

Regarding the TCP stack, we made the following changes:

1) we turned several promise<> in tcp.hh into optional<promise<>> to avoid reports of broken futures when there is no consumer for the future. This could happen with duplicated reception of FIN, or RST packets for example.

2) the TCP stack (packet_merger) performs a linearization on each packet reception. While a packet is missing (at the beginning, preventing delivery of data to the upper layers, due to out of order reception/retransmission) this linearization will reallocate/copy the buffer at each packet reception leading to very significant CPU costs as soon as a few packet losses/out-of-order deliveries slip in. We actually removed the linearization for tcp, and deliver the content of packets to the upper layer input stream directly in order to increase zero-copy opportunities (potentially calling multiple times the input_stream with smaller amounts of data but without copying data).

3) the tcp (native stack) makes the assumption that the user code will always do

co_await write

flush

co_await read

co_await close

destroy

This is the case for protocols such as L7 protocols such as HTTP/1.1 that have request/response cycles. When you issue a GET (write + flush), and wait (reading) the response body, you implicitly waited for the flush (complete send over the network) to complete because there is a happen-before relationship between the query being received (hence the write+flush completing) and the response being sent back. With other protocols (HTTP/2, TLS...) we had issues because the "client" could write something to TCP, flush it and then close or destroy the TCP connection without waiting for a (L7) response.

This can lead to a UAF because the reactor (flush_poller) captures a raw pointer to the corresponding iostream implementation. If the flush does not complete when destruction of the iostream occurs we get a UAF. Our fix is to add a gate.enter() when passing the reference to the flush_poller / gate.leave() when the poll_flush is executed, and on iostream close() we wait on that gate.close().

4) We added a fix, to be discussed, for https://github.com/scylladb/seastar/issues/365

Regarding DPDK, we updated to DPDK20.11 (which uses a new meson-based build system). DPDK 20.11 brings various improvements and support for new hardware such as Mellanox ConnectX6.

1) We updated Seastar cooking rules/finddpdk.cmake accordingly. Seastar uses the pkg-config generated by meson.build to simplify the search for libs. We kept the approach of Seastar through by relinking everyhting into a single .o so that the complex dependency to DPDK, which requires specific link flags, is hidden to applications using Seastar.

2) We took an upstream DPDK 20.11 and dropped some commits that concern parts of code that have changed upstream or where related to the make build system.

We kept

net: i40e: add VLAN tag size to RXMAX …

ixgbevf: ixgbevf_dev_info_get(): return default tx_rs_thresh=1 …

We dropped

ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

which seems related to (or at least conflicts with)

net/ixgbe: fix Tx threshold setup -- 9f5d2352fe45bee575f50f4a7131f3beee52b43f

https://github.com/DPDK/dpdk/commit/9f5d2352fe45bee575f50f4a7131f3beee52b43f

3) The Zero-Copy mechanism uses functions that have been deprecated/removed (part of the code with HugeTLBFSMembackend=true). However, in our previous experiment we also found that the overhead of rte_mem_virt2iova/rte_vfio_dma_map made the approach not worth it on our setup/workload. And we also used Mellanox NIC at some point. For the purpose of zero copy, we added a few lines of codes to have bigger (configurable) DPDK rx_pool allocated on program start, and pass rte_mbuf without copy to seastar by wrapping them into seastar:temporary_buffer with a custom deleter calling rte_mbuf_free in a best effort way (reverting to copy if the pool is exhausted). We found this approach to be efficient on both Intel and Mellanox NICs, and this simplifies the integration of DPDK20.11.

4) The other limit of our integration so far is that we use all NICs in single queue mode, so we don't need to configure RSS. However, the API rte_eth_dev_filter_ctrl has been replaced by rte_flow, so this part will need to be updated before a complete integration. But the feature/configurability is still possible, so it's not a blocking point.

Let me know if these changes are of interest to you,

Best regards,

--

Nicolas Le Scouarnec

Broadpeak

Avi Kivity

<avi@scylladb.com>

unread,

Mar 26, 2021, 8:10:59 AM3/26/21

to Nicolas Le Scouarnec, seastar-dev, Asias He, Vladislav Zolotarov

On 3/26/21 12:57 PM, Nicolas Le Scouarnec wrote:

Dear all,

As we explained at Scylla Summit, at Broadpeak, we've been using Seastar with its DPDK stack to build an alternative to JMeter for video streaming service/CDN benchmarking. While doing this we identified a few issues in the TCP stack, at least for our usage. We would be happy to to upstream those changes/fixes if they seem of interest to you. We would also like to know if you're already aware of ongoing work for any item so that we avoid preparing redundant/conflicting patches.

Please post patches. I'm not aware of conflicting work.

Regarding the TCP stack, we made the following changes:

1) we turned several promise<> in tcp.hh into optional<promise<>> to avoid reports of broken futures when there is no consumer for the future. This could happen with duplicated reception of FIN, or RST packets for example.

That sounds reasonable.

We should add a debug feature that breaks hard when set_value() is called twice on a promise. But of course that is independent of a fix.

2) the TCP stack (packet_merger) performs a linearization on each packet reception. While a packet is missing (at the beginning, preventing delivery of data to the upper layers, due to out of order reception/retransmission) this linearization will reallocate/copy the buffer at each packet reception leading to very significant CPU costs as soon as a few packet losses/out-of-order deliveries slip in. We actually removed the linearization for tcp, and deliver the content of packets to the upper layer input stream directly in order to increase zero-copy opportunities (potentially calling multiple times the input_stream with smaller amounts of data but without copying data).

commit 6a468dfd3dadbcd3c29bb684b6bd2c6059bfc5a3
Author: Asias He <as...@cloudius-systems.com>
Date:   Wed Feb 4 10:39:54 2015 +0800

    packet: Linearize more in packet_merger::merge

    This fix tcp_server rxrx test on DPDK. The problem is that when we
    receive out of order packets, we will hold the packet in the ooo queue.
    We do linearize on the incoming packet which will copy the packet and
    thus free the old packet. However, we missed one case where we need to
    linearize. As a result, the original packet will be held in the ooo
    queue. In DPDK, we have fixed buffer in the rx pool. When all the dpdk
    buffer are in ooo queue, we will not be able to receive further packets.
    So rx hangs, even ping will not work.

So it looks like there is a deadlock opportunity here. Someone could (accidentally or maliciously) exhaust all dedicated packet memory and prevent further reception.

The deadlock is not limited to out-of-order reception. For example, all packets could be in tcp streams that are not being read.

Potential solutions:

- figure out how to remove the limit on the rx pool and allow any piece of memory to be used as an rx buffer. This would be most in line with how Seastar generally works (no reservations for anything), but requires digging into dpdk. Perhaps newer versions of dpdk make it easier.

- account for zero-copy packets that are "hanging". When the number of zero-copy packets that haven't been recycled yet exceeds some threshold, drop the zero-copy optimization and copy packets. This prevents deadlock at the expense of copying packets when the system is stressed.

3) the tcp (native stack) makes the assumption that the user code will always do

co_await write

flush

co_await read

co_await close

destroy

This is the case for protocols such as L7 protocols such as HTTP/1.1 that have request/response cycles. When you issue a GET (write + flush), and wait (reading) the response body, you implicitly waited for the flush (complete send over the network) to complete because there is a happen-before relationship between the query being received (hence the write+flush completing) and the response being sent back. With other protocols (HTTP/2, TLS...) we had issues because the "client" could write something to TCP, flush it and then close or destroy the TCP connection without waiting for a (L7) response.

This can lead to a UAF because the reactor (flush_poller) captures a raw pointer to the corresponding iostream implementation. If the flush does not complete when destruction of the iostream occurs we get a UAF. Our fix is to add a gate.enter() when passing the reference to the flush_poller / gate.leave() when the poll_flush is executed, and on iostream close() we wait on that gate.close().

4) We added a fix, to be discussed, for https://github.com/scylladb/seastar/issues/365

Regarding DPDK, we updated to DPDK20.11 (which uses a new meson-based build system). DPDK 20.11 brings various improvements and support for new hardware such as Mellanox ConnectX6.

This would be most welcome.

1) We updated Seastar cooking rules/finddpdk.cmake accordingly. Seastar uses the pkg-config generated by meson.build to simplify the search for libs. We kept the approach of Seastar through by relinking everyhting into a single .o so that the complex dependency to DPDK, which requires specific link flags, is hidden to applications using Seastar.

2) We took an upstream DPDK 20.11 and dropped some commits that concern parts of code that have changed upstream or where related to the make build system.

We kept

net: i40e: add VLAN tag size to RXMAX …

ixgbevf: ixgbevf_dev_info_get(): return default tx_rs_thresh=1 …

We dropped

ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

which seems related to (or at least conflicts with)

net/ixgbe: fix Tx threshold setup -- 9f5d2352fe45bee575f50f4a7131f3beee52b43f

https://github.com/DPDK/dpdk/commit/9f5d2352fe45bee575f50f4a7131f3beee52b43f

Maybe Vlad can tell if the patches fix the same problem. I think it's likely and that we can drop the local patch.

3) The Zero-Copy mechanism uses functions that have been deprecated/removed (part of the code with HugeTLBFSMembackend=true). However, in our previous experiment we also found that the overhead of rte_mem_virt2iova/rte_vfio_dma_map made the approach not worth it on our setup/workload.

Can you elaborate on this? I was hoping vfio can make deploying dpdk easier and with fewer constraints (it also relates to one proposed solution to the deadlock problem, above), so I'd be sad to lose it. Of course, if vfio makes dpdk's advantages disappear then it's not worth it.

And we also used Mellanox NIC at some point. For the purpose of zero copy, we added a few lines of codes to have bigger (configurable) DPDK rx_pool allocated on program start, and pass rte_mbuf without copy to seastar by wrapping them into seastar:temporary_buffer with a custom deleter calling rte_mbuf_free in a best effort way (reverting to copy if the pool is exhausted). We found this approach to be efficient on both Intel and Mellanox NICs, and this simplifies the integration of DPDK20.11.

This looks similar to the second solution I proposed for the deadlock problem.

4) The other limit of our integration so far is that we use all NICs in single queue mode, so we don't need to configure RSS. However, the API rte_eth_dev_filter_ctrl has been replaced by rte_flow, so this part will need to be updated before a complete integration. But the feature/configurability is still possible, so it's not a blocking point.

Let me know if these changes are of interest to you,

Yes they are. I recommend posting patches individually so they can be applied without interdependencies, when possible.

Vladislav Zolotarov

<vladz@scylladb.com>

unread,

Mar 26, 2021, 9:48:40 AM3/26/21

to Avi Kivity, Nicolas Le Scouarnec, seastar-dev, Asias He

I think the problem is on the TCP level: TCP should not allow such a dead lock situation and should have dropped all ooo packets and requests a retransmit.
I wonder how Linux stack addresses the same issue?

3) the tcp (native stack) makes the assumption that the user code will always do

co_await write

flush

co_await read

co_await close

destroy

This is the case for protocols such as L7 protocols such as HTTP/1.1 that have request/response cycles. When you issue a GET (write + flush), and wait (reading) the response body, you implicitly waited for the flush (complete send over the network) to complete because there is a happen-before relationship between the query being received (hence the write+flush completing) and the response being sent back. With other protocols (HTTP/2, TLS...) we had issues because the "client" could write something to TCP, flush it and then close or destroy the TCP connection without waiting for a (L7) response.

This can lead to a UAF because the reactor (flush_poller) captures a raw pointer to the corresponding iostream implementation. If the flush does not complete when destruction of the iostream occurs we get a UAF. Our fix is to add a gate.enter() when passing the reference to the flush_poller / gate.leave() when the poll_flush is executed, and on iostream close() we wait on that gate.close().

4) We added a fix, to be discussed, for https://github.com/scylladb/seastar/issues/365

Regarding DPDK, we updated to DPDK20.11 (which uses a new meson-based build system). DPDK 20.11 brings various improvements and support for new hardware such as Mellanox ConnectX6.

This would be most welcome.

1) We updated Seastar cooking rules/finddpdk.cmake accordingly. Seastar uses the pkg-config generated by meson.build to simplify the search for libs. We kept the approach of Seastar through by relinking everyhting into a single .o so that the complex dependency to DPDK, which requires specific link flags, is hidden to applications using Seastar.

2) We took an upstream DPDK 20.11 and dropped some commits that concern parts of code that have changed upstream or where related to the make build system.

We kept

net: i40e: add VLAN tag size to RXMAX …

ixgbevf: ixgbevf_dev_info_get(): return default tx_rs_thresh=1 …

We dropped

ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

which seems related to (or at least conflicts with)

net/ixgbe: fix Tx threshold setup -- 9f5d2352fe45bee575f50f4a7131f3beee52b43f

https://github.com/DPDK/dpdk/commit/9f5d2352fe45bee575f50f4a7131f3beee52b43f

Maybe Vlad can tell if the patches fix the same problem. I think it's likely and that we can drop the local patch.

No, the issue fixed in the local patch is not addressed by the patch you referenced.
And these patches are indeed in conflict.
I started my career in the semiconductors company (Mellanox) and I tend to believe specs more than SW engineers. ;)
And specs of NICs mentioned in my patch are very clear about the flow SW needs to follow, while the original DPDK code and the patch you referenced above are trying to hack something using some undocumented (at least) logic.

I'd recommend not to remove the local patch until Intel specs are updated and till then follow the guidance described in those Specs ("Programmer Manual" to be precise).

My impression was/is that DPDK people are trying to optimize a "small single buffer packets" use case on account the robustness of other use cases which includes a TCP use case as well.

Avi Kivity

<avi@scylladb.com>

unread,

Mar 26, 2021, 10:12:26 AM3/26/21

to Vladislav Zolotarov, Nicolas Le Scouarnec, seastar-dev, Asias He

I don't think Linux has something equivalent to an rx pool (other than the DMA32 zone). It can probably allocate all of memory into skbuffs.

Dropping ooo packets is possible now, but it's not possible if we implement selective acknowledgements, so I don't think it's a good path forward.

Nicolas Le Scouarnec

<nicolas.lescouarnec@broadpeak.tv>

unread,

Mar 26, 2021, 3:54:13 PM3/26/21

to seastar-dev

On Friday, March 26, 2021 at 2:48:40 PM UTC+1 Vladislav Zolotarov wrote:

2) the TCP stack (packet_merger) performs a linearization on each packet reception. While a packet is missing (at the beginning, preventing delivery of data to the upper layers, due to out of order reception/retransmission) this linearization will reallocate/copy the buffer at each packet reception leading to very significant CPU costs as soon as a few packet losses/out-of-order deliveries slip in. We actually removed the linearization for tcp, and deliver the content of packets to the upper layer input stream directly in order to increase zero-copy opportunities (potentially calling multiple times the input_stream with smaller amounts of data but without copying data).

commit 6a468dfd3dadbcd3c29bb684b6bd2c6059bfc5a3
Author: Asias He <as...@cloudius-systems.com>
Date:   Wed Feb 4 10:39:54 2015 +0800

    packet: Linearize more in packet_merger::merge

    This fix tcp_server rxrx test on DPDK. The problem is that when we
    receive out of order packets, we will hold the packet in the ooo queue.
    We do linearize on the incoming packet which will copy the packet and
    thus free the old packet. However, we missed one case where we need to
    linearize. As a result, the original packet will be held in the ooo
    queue. In DPDK, we have fixed buffer in the rx pool. When all the dpdk
    buffer are in ooo queue, we will not be able to receive further packets.
    So rx hangs, even ping will not work.

So it looks like there is a deadlock opportunity here. Someone could (accidentally or maliciously) exhaust all dedicated packet memory and prevent further reception.

The deadlock is not limited to out-of-order reception. For example, all packets could be in tcp streams that are not being read.

Potential solutions:

- figure out how to remove the limit on the rx pool and allow any piece of memory to be used as an rx buffer. This would be most in line with how Seastar generally works (no reservations for anything), but requires digging into dpdk. Perhaps newer versions of dpdk make it easier.

- account for zero-copy packets that are "hanging". When the number of zero-copy packets that haven't been recycled yet exceeds some threshold, drop the zero-copy optimization and copy packets. This prevents deadlock at the expense of copying packets when the system is stressed.

I think the problem is on the TCP level: TCP should not allow such a dead lock situation and should have dropped all ooo packets and requests a retransmit.
I wonder how Linux stack addresses the same issue?

Actually, the TCP sender should refrain from sending data beyond the TCP receive window. If the sender sends more data, we can/must just drop it. This provides a bound on the amount of memory that a sender can force seastar to use. In seastar its value is computed by get_modified_receive_window_size. Now, that being said, the "linearize" also free memory in the (unlikely?) case there is some weird overlap between packets in the packet merger (coud happen due to TCP LRO + rentransmissions ?). This bounds the amount of memory used some where near 3737600 per connection. Dropping out of order packets that are within the receive window would be problematic/non-trivial because, the missing packet may just arrive soon (due to reordering in the network).

Now regarding the commit comment, as far as I understand the current seastar (without our zero copy patch, and without HugeTlbFs), as soon as packets are received (dpdk.cc:from_mbuf) a new buffer is allocated (from Seastar) and the rte_mbuf is immediately returned to the pool, thus avoiding the deadlock. I think this got changed later (April 2015) : https://github.com/scylladb/seastar/commit/e9a59e5f8d9c253a2ff721edffbf8845db18ef9f "DPDK: drop incoming packets if allocation fails in the copy-flow: Ensure the "detaching" of rte_mbuf from the received data in the copy path. This patch completes the above in case the allocation of the buffer to copy the newly arrived data to has failed. In the above case we prefer to drop the arrived packet instead of consuming the rte_mbuf from the Rx ring's mempool. Signed-off-by: Vlad Zolotarov <>""

Finally, our current implementation of zero copy is actually the second potential solution you describe. When I start my application I create a very large (configurable) rx_pool (e.g., 100 000 packets instead of just twice the NIC rx ring). If the number of still available mbufs is larger than twice the rx ring, we allow the mbuf to be passed to seastar, otherwise we know that we ought to free it immediately and perform a copy to seastar-allocated memory. So as long as packets are consumed fast enough, packet go through the zero copy path -- otherwise we just revert to the safe (previous) behavior. See bellow for the code which is quite simple in the end I just added the part underline:

template<>

inline std::optional<packet>

dpdk_qp<false>::from_mbuf(rte_mbuf* m)

{

if (!_dev->hw_features_ref().rx_lro || rte_pktmbuf_is_contiguous(m)) {

//

// Try to allocate a buffer for packet's data. If we fail - give the

// application an mbuf itself. If we succeed - copy the data into this

// buffer, create a packet based on this buffer and return the mbuf to

// its pool.

//

auto len = rte_pktmbuf_data_len(m);

char* buf = rte_pktmbuf_mtod(m, char*);

if(rte_mempool_avail_count(_pktmbuf_pool_rx) > mbufs_per_queue_rx) {

// There remains sufficient buffers in the pool to allow zero copy

return packet(fragment{buf, len}, seastar::make_deleter(seastar::deleter(), [m]() {rte_pktmbuf_free(m);}));

}else{

if(_dpdk_zc_pool_size > 0) _stats.rx.bad.missed_zc_opportunities++; /* If zero-copy is enabled, count missed opportunities */

char* buf = (char*)malloc(len);

rte_memcpy(buf, rte_pktmbuf_mtod(m, char*), len);

I'll figure out how to merge it then. However, I don't have ixgbe NICs for testing.

3) The Zero-Copy mechanism uses functions that have been deprecated/removed (part of the code with HugeTLBFSMembackend=true). However, in our previous experiment we also found that the overhead of rte_mem_virt2iova/rte_vfio_dma_map made the approach not worth it on our setup/workload.

Can you elaborate on this? I was hoping vfio can make deploying dpdk easier and with fewer constraints (it also relates to one proposed solution to the deadlock problem, above), so I'd be sad to lose it. Of course, if vfio makes dpdk's advantages disappear then it's not worth it.

I'll need to run some benchmark/profiling again but performance was low, and as far as I remember significant time was spend within refill_rx_mbuf, if I look at rte_mem_virt2iova depending on whether mode is RTE_IOVA_VA/PA cost can vary a lot. Maybe it was specific to our configuration (and I lost access to the machine I had doing this test, so I cannot confirm whether RTE_IOVA_VA/PA was used: I remember having issues with vfio setup and no access to BIOS, so maybe something was wrong). Anyway this part also got updated in DPDK to provide something "uniform" between Mellanox & non Mellanox, so it could be possible/necessary to update it (for RTE_IOVA_VA capable setups).

Regarding ease of deplyoment, its true that for us, we always create and bind virtual functions of physical NICs so we don't need to "dedicate" phyiscal NICs, and thanks to VFIO do not need to fear bugs corrupting beyond the process. So it's not a hurdle (once everything is activated in the BIOS/kernel). Yet, with our current solution we still need hugepages and to allocate 64 GB of RAM to DPDK just for networking queues but we reuse server with plenty of memory. Again this could probably be eased through the new external memory API from DPDK, but that's almost another topic.

Yes they are. I recommend posting patches individually so they can be applied without interdependencies, when possible.

I'll prepare these.

Best regards,

Avi Kivity

<avi@scylladb.com>

unread,

Mar 27, 2021, 9:24:00 AM3/27/21

to Nicolas Le Scouarnec, seastar-dev

On 3/26/21 10:54 PM, Nicolas Le Scouarnec wrote:

On Friday, March 26, 2021 at 2:48:40 PM UTC+1 Vladislav Zolotarov wrote:

2) the TCP stack (packet_merger) performs a linearization on each packet reception. While a packet is missing (at the beginning, preventing delivery of data to the upper layers, due to out of order reception/retransmission) this linearization will reallocate/copy the buffer at each packet reception leading to very significant CPU costs as soon as a few packet losses/out-of-order deliveries slip in. We actually removed the linearization for tcp, and deliver the content of packets to the upper layer input stream directly in order to increase zero-copy opportunities (potentially calling multiple times the input_stream with smaller amounts of data but without copying data).

commit 6a468dfd3dadbcd3c29bb684b6bd2c6059bfc5a3
Author: Asias He <as...@cloudius-systems.com>
Date:   Wed Feb 4 10:39:54 2015 +0800

    packet: Linearize more in packet_merger::merge

    This fix tcp_server rxrx test on DPDK. The problem is that when we
    receive out of order packets, we will hold the packet in the ooo queue.
    We do linearize on the incoming packet which will copy the packet and
    thus free the old packet. However, we missed one case where we need to
    linearize. As a result, the original packet will be held in the ooo
    queue. In DPDK, we have fixed buffer in the rx pool. When all the dpdk
    buffer are in ooo queue, we will not be able to receive further packets.
    So rx hangs, even ping will not work.

So it looks like there is a deadlock opportunity here. Someone could (accidentally or maliciously) exhaust all dedicated packet memory and prevent further reception.

The deadlock is not limited to out-of-order reception. For example, all packets could be in tcp streams that are not being read.

Potential solutions:

- figure out how to remove the limit on the rx pool and allow any piece of memory to be used as an rx buffer. This would be most in line with how Seastar generally works (no reservations for anything), but requires digging into dpdk. Perhaps newer versions of dpdk make it easier.

- account for zero-copy packets that are "hanging". When the number of zero-copy packets that haven't been recycled yet exceeds some threshold, drop the zero-copy optimization and copy packets. This prevents deadlock at the expense of copying packets when the system is stressed.

I think the problem is on the TCP level: TCP should not allow such a dead lock situation and should have dropped all ooo packets and requests a retransmit.
I wonder how Linux stack addresses the same issue?

Actually, the TCP sender should refrain from sending data beyond the TCP receive window. If the sender sends more data, we can/must just drop it. This provides a bound on the amount of memory that a sender can force seastar to use. In seastar its value is computed by get_modified_receive_window_size. Now, that being said, the "linearize" also free memory in the (unlikely?) case there is some weird overlap between packets in the packet merger (coud happen due to TCP LRO + rentransmissions ?). This bounds the amount of memory used some where near 3737600 per connection. Dropping out of order packets that are within the receive window would be problematic/non-trivial because, the missing packet may just arrive soon (due to reordering in the network).

Consider a sender that sends packets with one byte each, only even bytes. Each such packet occupies 4k (or 2k?). A 100k window has 50k even bytes, and so needs 200M of memory.

Now regarding the commit comment, as far as I understand the current seastar (without our zero copy patch, and without HugeTlbFs), as soon as packets are received (dpdk.cc:from_mbuf) a new buffer is allocated (from Seastar) and the rte_mbuf is immediately returned to the pool, thus avoiding the deadlock. I think this got changed later (April 2015) : https://github.com/scylladb/seastar/commit/e9a59e5f8d9c253a2ff721edffbf8845db18ef9f "DPDK: drop incoming packets if allocation fails in the copy-flow: Ensure the "detaching" of rte_mbuf from the received data in the copy path. This patch completes the above in case the allocation of the buffer to copy the newly arrived data to has failed. In the above case we prefer to drop the arrived packet instead of consuming the rte_mbuf from the Rx ring's mempool. Signed-off-by: Vlad Zolotarov <>""

Well, that is the copy path. We'd like to use the zero-copy path, no?

Finally, our current implementation of zero copy is actually the second potential solution you describe. When I start my application I create a very large (configurable) rx_pool (e.g., 100 000 packets instead of just twice the NIC rx ring). If the number of still available mbufs is larger than twice the rx ring, we allow the mbuf to be passed to seastar, otherwise we know that we ought to free it immediately and perform a copy to seastar-allocated memory. So as long as packets are consumed fast enough, packet go through the zero copy path -- otherwise we just revert to the safe (previous) behavior. See bellow for the code which is quite simple in the end I just added the part underline:

100,000 packets is 400MB (assuming 4k packets). That's quite a lot of overhead, and per-shard if we have enough queues. I'd like a more dynamic solution that doesn't reserve so much memory. Maybe dpdk evolved enough not to require pre-allocation of rx packets.

template<>

inline std::optional<packet>

dpdk_qp<false>::from_mbuf(rte_mbuf* m)

{

    if (!_dev->hw_features_ref().rx_lro || rte_pktmbuf_is_contiguous(m)) {

        //

        // Try to allocate a buffer for packet's data. If we fail - give the

        // application an mbuf itself. If we succeed - copy the data into this

        // buffer, create a packet based on this buffer and return the mbuf to

        // its pool.

        //

        auto len = rte_pktmbuf_data_len(m);

        char* buf = rte_pktmbuf_mtod(m, char*);

if(rte_mempool_avail_count(_pktmbuf_pool_rx) > mbufs_per_queue_rx) {

            // There remains sufficient buffers in the pool to allow zero copy

            return packet(fragment{buf, len}, seastar::make_deleter(seastar::deleter(), [m]() {rte_pktmbuf_free(m);}));

        }else{

            if(_dpdk_zc_pool_size > 0) _stats.rx.bad.missed_zc_opportunities++; /* If zero-copy is enabled, count missed opportunities */

            char* buf = (char*)malloc(len);

rte_memcpy(buf, rte_pktmbuf_mtod(m, char*), len);

Alternatively, it may be possible to replenish a pool, and so we can increase a pool size if there is demand, and reduce it when there is not.

We may be able to help. I also think some older AWS instances have them.

3) The Zero-Copy mechanism uses functions that have been deprecated/removed (part of the code with HugeTLBFSMembackend=true). However, in our previous experiment we also found that the overhead of rte_mem_virt2iova/rte_vfio_dma_map made the approach not worth it on our setup/workload.

Can you elaborate on this? I was hoping vfio can make deploying dpdk easier and with fewer constraints (it also relates to one proposed solution to the deadlock problem, above), so I'd be sad to lose it. Of course, if vfio makes dpdk's advantages disappear then it's not worth it.

I'll need to run some benchmark/profiling again but performance was low, and as far as I remember significant time was spend within refill_rx_mbuf, if I look at rte_mem_virt2iova depending on whether mode is RTE_IOVA_VA/PA cost can vary a lot. Maybe it was specific to our configuration (and I lost access to the machine I had doing this test, so I cannot confirm whether RTE_IOVA_VA/PA was used: I remember having issues with vfio setup and no access to BIOS, so maybe something was wrong). Anyway this part also got updated in DPDK to provide something "uniform" between Mellanox & non Mellanox, so it could be possible/necessary to update it (for RTE_IOVA_VA capable setups).

I guess IOVA_VA is fast (it's a 1:1 translation) and IOVA_PA is slow (it's a lookup in a very large array, so cache misses). I was worried the iommu itself slows down the hardware, but I guess you used IOVA_PA.

The code looks terrible:

phys_addr_t
rte_mem_virt2phy(const void *virtaddr)
{
        int fd, retval;
        uint64_t page, physaddr;
        unsigned long virt_pfn;
        int page_size;
        off_t offset;

        if (phys_addrs_available == 0)
                return RTE_BAD_IOVA;

        /* standard page size */
        page_size = getpagesize();

        fd = open("/proc/self/pagemap", O_RDONLY);
        if (fd < 0) {
                RTE_LOG(INFO, EAL, "%s(): cannot open /proc/self/pagemap: %s\n",
                        __func__, strerror(errno));
                return RTE_BAD_IOVA;
        }

        virt_pfn = (unsigned long)virtaddr / page_size;
        offset = sizeof(uint64_t) * virt_pfn;
        if (lseek(fd, offset, SEEK_SET) == (off_t) -1) {
                RTE_LOG(INFO, EAL, "%s(): seek error in /proc/self/pagemap: %s\n",
                                __func__, strerror(errno));
                close(fd);
                return RTE_BAD_IOVA;
        }

        retval = read(fd, &page, PFN_MASK_SIZE);
        close(fd);

It looks like it was just written for it to work with no regard for performance.

Regarding ease of deplyoment, its true that for us, we always create and bind virtual functions of physical NICs so we don't need to "dedicate" phyiscal NICs, and thanks to VFIO do not need to fear bugs corrupting beyond the process. So it's not a hurdle (once everything is activated in the BIOS/kernel). Yet, with our current solution we still need hugepages and to allocate 64 GB of RAM to DPDK just for networking queues but we reuse server with plenty of memory. Again this could probably be eased through the new external memory API from DPDK, but that's almost another topic.

I don't understand, if you have vfio then you don't need hugepages.

Yes they are. I recommend posting patches individually so they can be applied without interdependencies, when possible.

I'll prepare these.

Best regards,

--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/7c0182eb-52c3-48af-8c54-a2994d5d9c74n%40googlegroups.com.

Nicolas Le Scouarnec

<nicolas.lescouarnec@broadpeak.tv>

unread,

Mar 30, 2021, 5:34:54 PM3/30/21

to seastar-dev

On Saturday, March 27, 2021 at 2:24:00 PM UTC+1 Avi Kivity wrote:

Potential solutions:

- figure out how to remove the limit on the rx pool and allow any piece of memory to be used as an rx buffer. This would be most in line with how Seastar generally works (no reservations for anything), but requires digging into dpdk. Perhaps newer versions of dpdk make it easier.

- account for zero-copy packets that are "hanging". When the number of zero-copy packets that haven't been recycled yet exceeds some threshold, drop the zero-copy optimization and copy packets. This prevents deadlock at the expense of copying packets when the system is stressed.

I think the problem is on the TCP level: TCP should not allow such a dead lock situation and should have dropped all ooo packets and requests a retransmit.
I wonder how Linux stack addresses the same issue?

Actually, the TCP sender should refrain from sending data beyond the TCP receive window. If the sender sends more data, we can/must just drop it. This provides a bound on the amount of memory that a sender can force seastar to use. In seastar its value is computed by get_modified_receive_window_size. Now, that being said, the "linearize" also free memory in the (unlikely?) case there is some weird overlap between packets in the packet merger (coud happen due to TCP LRO + rentransmissions ?). This bounds the amount of memory used some where near 3737600 per connection. Dropping out of order packets that are within the receive window would be problematic/non-trivial because, the missing packet may just arrive soon (due to reordering in the network).

Consider a sender that sends packets with one byte each, only even bytes. Each such packet occupies 4k (or 2k?). A 100k window has 50k even bytes, and so needs 200M of memory.

I understand your point, I had ignored the fact that the sender could be malicious and deliberately attempt to exploit this. That said, we should account for both a memory-exhaustion attack, such as the one you describe, but also a complexity attack as the current linearization has a complexity in n² for the same workload (copy 1 byte, copy 2 byte, copy 3 bytes, ....) as it reallocates a buffer of the total size and copies everything on each packet reception. In addition it forces the CPU to touch the data (while it may not have accessed the packet data yet / may not use it immediately afterwards, thus resulting in cache pollution).

I had a look at FreeBSD code, and Linux code. As far as I understand the approach is to keep a fragment structure but to collapse "small" buffers into the remaining room of the previous buffer if any, with possibly a limit to avoid a copy if the chunk is already large enough (condition is: m->m_len <= MCLBYTES / 4 && /* XXX: Don't copy too much */ m->m_len <= M_TRAILINGSPACE(n) where m is received packet and n is the already existing chain ).

https://www.leidinger.net/FreeBSD/dox/kern/html/de/d20/uipc__sockbuf_8c_source.html#l01298

https://elixir.bootlin.com/linux/latest/source/net/core/skbuff.c#L5206

This would reduce the memory pressure for your adversarial example, yet avoid memcpys for large enough packets (thus reducing computational cost and preservering zero-copy). However, the current fragment structure does not carry any information on the real size of the underlying buffer to allow such an in-place "merge" to occur, even if the underlying buffer (rte_mbuf) had tailroom (or headroom) initally.

Now regarding the commit comment, as far as I understand the current seastar (without our zero copy patch, and without HugeTlbFs), as soon as packets are received (dpdk.cc:from_mbuf) a new buffer is allocated (from Seastar) and the rte_mbuf is immediately returned to the pool, thus avoiding the deadlock. I think this got changed later (April 2015) : https://github.com/scylladb/seastar/commit/e9a59e5f8d9c253a2ff721edffbf8845db18ef9f "DPDK: drop incoming packets if allocation fails in the copy-flow: Ensure the "detaching" of rte_mbuf from the received data in the copy path. This patch completes the above in case the allocation of the buffer to copy the newly arrived data to has failed. In the above case we prefer to drop the arrived packet instead of consuming the rte_mbuf from the Rx ring's mempool. Signed-off-by: Vlad Zolotarov <>""

Well, that is the copy path. We'd like to use the zero-copy path, no?

Sure, but for the zero-copy path since the DPDK pool is not fixed but is refilled with newly seastar-allocated buffers, unless there is no memory at all remaining for the seastar allocator, the deadlock is unlikely also. However, it stress an interesting point: if we want to have "zero-copy", we also need data not to be copied due to a linearize call discussed just above.

Finally, our current implementation of zero copy is actually the second potential solution you describe. When I start my application I create a very large (configurable) rx_pool (e.g., 100 000 packets instead of just twice the NIC rx ring). If the number of still available mbufs is larger than twice the rx ring, we allow the mbuf to be passed to seastar, otherwise we know that we ought to free it immediately and perform a copy to seastar-allocated memory. So as long as packets are consumed fast enough, packet go through the zero copy path -- otherwise we just revert to the safe (previous) behavior. See bellow for the code which is quite simple in the end I just added the part underline:

100,000 packets is 400MB (assuming 4k packets). That's quite a lot of overhead, and per-shard if we have enough queues. I'd like a more dynamic solution that doesn't reserve so much memory. Maybe dpdk evolved enough not to require pre-allocation of rx packets.

I saw it evolved, you still need a rx_pool but apparently the dynamic allocation you took should work better, for all NICs/drivers as there is now a "uniform" API to handle this.

I'll need to run some benchmark/profiling again but performance was low, and as far as I remember significant time was spend within refill_rx_mbuf, if I look at rte_mem_virt2iova depending on whether mode is RTE_IOVA_VA/PA cost can vary a lot. Maybe it was specific to our configuration (and I lost access to the machine I had doing this test, so I cannot confirm whether RTE_IOVA_VA/PA was used: I remember having issues with vfio setup and no access to BIOS, so maybe something was wrong). Anyway this part also got updated in DPDK to provide something "uniform" between Mellanox & non Mellanox, so it could be possible/necessary to update it (for RTE_IOVA_VA capable setups).

I guess IOVA_VA is fast (it's a 1:1 translation) and IOVA_PA is slow (it's a lookup in a very large array, so cache misses). I was worried the iommu itself slows down the hardware, but I guess you used IOVA_PA.

I think so. Regarding this, as this is a part of the code that'll have to evolve as DPDK API changed, do you know what was the rationale for keeping two code paths (HugeTlbFs / No HugeTlbFs ?) . I think it is related to this where Mellanox card could not be supported: https://groups.google.com/g/seastar-dev/c/fuQHNVE8db0/m/Fsh-ZkY4BgAJ -- but now DPDK provides a uniform approach for both NICs so it should be possible to avoid this issue ( rte_extmem_register, rte_dev_dma_map , https://mails.dpdk.org/archives/dev/2019-February/124690.html ) . Finally, the non-IOVA_VA based-path remain necessary for VMs without vIOMMU support / machines without IOMMU (or IOMMU disabled); or would you like to keep just one variant of the code, dropping support for uncommon/old hardware and assume that going forward everything will support IOVA_VA (or equivalent such as Mellanox) ?

Regarding ease of deplyoment, its true that for us, we always create and bind virtual functions of physical NICs so we don't need to "dedicate" phyiscal NICs, and thanks to VFIO do not need to fear bugs corrupting beyond the process. So it's not a hurdle (once everything is activated in the BIOS/kernel). Yet, with our current solution we still need hugepages and to allocate 64 GB of RAM to DPDK just for networking queues but we reuse server with plenty of memory. Again this could probably be eased through the new external memory API from DPDK, but that's almost another topic.

I don't understand, if you have vfio then you don't need hugepages.

To me DPDK always use(d) hugepages and fail(ed) to start if none are configured in the system. It seems that now it is not necessary in the newest versions (where newest can be more than 2 years old). So it may change too. But seastar ask DPDK to start without huge-pages only if we don't enable the dpdk-pmd. Anyway, as per your recommendation of incremental patches, I think I'm going to try feature parity with DPDK 20.11, and leave as seconds steps optimization (reduction) of fixed memory dedicated to DPDK , or more advanced external memory usage made possible in the newest versions.

} else if (!opts.count("dpdk-pmd")) {

args.push_back(string2vector("--no-huge"));

}

Best regards,

Avi Kivity

<avi@scylladb.com>

unread,

Mar 31, 2021, 8:10:45 AM3/31/21

to Nicolas Le Scouarnec, seastar-dev, Vladislav Zolotarov

On 3/31/21 12:34 AM, Nicolas Le Scouarnec wrote:

On Saturday, March 27, 2021 at 2:24:00 PM UTC+1 Avi Kivity wrote:

Potential solutions:

- figure out how to remove the limit on the rx pool and allow any piece of memory to be used as an rx buffer. This would be most in line with how Seastar generally works (no reservations for anything), but requires digging into dpdk. Perhaps newer versions of dpdk make it easier.

- account for zero-copy packets that are "hanging". When the number of zero-copy packets that haven't been recycled yet exceeds some threshold, drop the zero-copy optimization and copy packets. This prevents deadlock at the expense of copying packets when the system is stressed.

I think the problem is on the TCP level: TCP should not allow such a dead lock situation and should have dropped all ooo packets and requests a retransmit.
I wonder how Linux stack addresses the same issue?

Actually, the TCP sender should refrain from sending data beyond the TCP receive window. If the sender sends more data, we can/must just drop it. This provides a bound on the amount of memory that a sender can force seastar to use. In seastar its value is computed by get_modified_receive_window_size. Now, that being said, the "linearize" also free memory in the (unlikely?) case there is some weird overlap between packets in the packet merger (coud happen due to TCP LRO + rentransmissions ?). This bounds the amount of memory used some where near 3737600 per connection. Dropping out of order packets that are within the receive window would be problematic/non-trivial because, the missing packet may just arrive soon (due to reordering in the network).

Consider a sender that sends packets with one byte each, only even bytes. Each such packet occupies 4k (or 2k?). A 100k window has 50k even bytes, and so needs 200M of memory.

I understand your point, I had ignored the fact that the sender could be malicious and deliberately attempt to exploit this. That said, we should account for both a memory-exhaustion attack, such as the one you describe, but also a complexity attack as the current linearization has a complexity in n² for the same workload (copy 1 byte, copy 2 byte, copy 3 bytes, ....) as it reallocates a buffer of the total size and copies everything on each packet reception. In addition it forces the CPU to touch the data (while it may not have accessed the packet data yet / may not use it immediately afterwards, thus resulting in cache pollution).

Right, we need some kind oh hybrid solution. First optimizing for work/cpu, but when memory limits are reached, optimize for memory.

I had a look at FreeBSD code, and Linux code. As far as I understand the approach is to keep a fragment structure but to collapse "small" buffers into the remaining room of the previous buffer if any, with possibly a limit to avoid a copy if the chunk is already large enough (condition is: m->m_len <= MCLBYTES / 4 && /* XXX: Don't copy too much */ m->m_len <= M_TRAILINGSPACE(n) where m is received packet and n is the already existing chain ).

https://www.leidinger.net/FreeBSD/dox/kern/html/de/d20/uipc__sockbuf_8c_source.html#l01298

https://elixir.bootlin.com/linux/latest/source/net/core/skbuff.c#L5206

This would reduce the memory pressure for your adversarial example, yet avoid memcpys for large enough packets (thus reducing computational cost and preservering zero-copy). However, the current fragment structure does not carry any information on the real size of the underlying buffer to allow such an in-place "merge" to occur, even if the underlying buffer (rte_mbuf) had tailroom (or headroom) initally.

Since our `class packet` (via temporary_buffer) allows splitting and sharing, you couldn't know that this extra space isn't used by another packet.

Maybe we could: if the temporary_buffer tells us its deleter was created with raw_object_tag, and is unshared, we'd know we can expand it to malloc_usable_size(). Perhaps we'd need to adjust the sharing mechanism to make sure we don't have false positives.

Now regarding the commit comment, as far as I understand the current seastar (without our zero copy patch, and without HugeTlbFs), as soon as packets are received (dpdk.cc:from_mbuf) a new buffer is allocated (from Seastar) and the rte_mbuf is immediately returned to the pool, thus avoiding the deadlock. I think this got changed later (April 2015) : https://github.com/scylladb/seastar/commit/e9a59e5f8d9c253a2ff721edffbf8845db18ef9f "DPDK: drop incoming packets if allocation fails in the copy-flow: Ensure the "detaching" of rte_mbuf from the received data in the copy path. This patch completes the above in case the allocation of the buffer to copy the newly arrived data to has failed. In the above case we prefer to drop the arrived packet instead of consuming the rte_mbuf from the Rx ring's mempool. Signed-off-by: Vlad Zolotarov <>""

Well, that is the copy path. We'd like to use the zero-copy path, no?

Sure, but for the zero-copy path since the DPDK pool is not fixed but is refilled with newly seastar-allocated buffers, unless there is no memory at all remaining for the seastar allocator, the deadlock is unlikely also.

We still don't want to fill all memory with packets that are waiting for a remote sender. We'll want something like sys.net.ipv4.tcp_mem to limit global memory in tcp.

It may not matter for an application that does pure networking, but an application that is memory intensive would like to limit competition for memory.

However, it stress an interesting point: if we want to have "zero-copy", we also need data not to be copied due to a linearize call discussed just above.

Certainly. The first choice should be to store the packet as is in the reordering buffer.

Linearization should happen if some heuristics apply:

- if the packet is smaller than some threshold, minify it by copying to a new packet. This will ignore normal packets that aren't tiny. This can apply even if the packet doesn't overlap/touch another packet.

- if the total amount of memory in all reordering buffers is too high, start linearizing until it drops off

- if a particular reordering buffer has too many elements in the reordering buffer (i.e. it's under attack), send RST

These are just random ideas, we might come up with better ones, and they don't have to be blockers for your patches.

Finally, our current implementation of zero copy is actually the second potential solution you describe. When I start my application I create a very large (configurable) rx_pool (e.g., 100 000 packets instead of just twice the NIC rx ring). If the number of still available mbufs is larger than twice the rx ring, we allow the mbuf to be passed to seastar, otherwise we know that we ought to free it immediately and perform a copy to seastar-allocated memory. So as long as packets are consumed fast enough, packet go through the zero copy path -- otherwise we just revert to the safe (previous) behavior. See bellow for the code which is quite simple in the end I just added the part underline:

100,000 packets is 400MB (assuming 4k packets). That's quite a lot of overhead, and per-shard if we have enough queues. I'd like a more dynamic solution that doesn't reserve so much memory. Maybe dpdk evolved enough not to require pre-allocation of rx packets.

I saw it evolved, you still need a rx_pool but apparently the dynamic allocation you took should work better, for all NICs/drivers as there is now a "uniform" API to handle this.

Well, that would be best. Using hugetlbfs is a deployment headache as it often requires a reboot to reserve on the first install. It's probably not a problem if you're consuming your own code, but it is for ISVs.

I'll need to run some benchmark/profiling again but performance was low, and as far as I remember significant time was spend within refill_rx_mbuf, if I look at rte_mem_virt2iova depending on whether mode is RTE_IOVA_VA/PA cost can vary a lot. Maybe it was specific to our configuration (and I lost access to the machine I had doing this test, so I cannot confirm whether RTE_IOVA_VA/PA was used: I remember having issues with vfio setup and no access to BIOS, so maybe something was wrong). Anyway this part also got updated in DPDK to provide something "uniform" between Mellanox & non Mellanox, so it could be possible/necessary to update it (for RTE_IOVA_VA capable setups).

I guess IOVA_VA is fast (it's a 1:1 translation) and IOVA_PA is slow (it's a lookup in a very large array, so cache misses). I was worried the iommu itself slows down the hardware, but I guess you used IOVA_PA.

I think so. Regarding this, as this is a part of the code that'll have to evolve as DPDK API changed, do you know what was the rationale for keeping two code paths (HugeTlbFs / No HugeTlbFs ?) .

Vlad wrote this, copying him.

I think it is related to this where Mellanox card could not be supported: https://groups.google.com/g/seastar-dev/c/fuQHNVE8db0/m/Fsh-ZkY4BgAJ -- but now DPDK provides a uniform approach for both NICs so it should be possible to avoid this issue ( rte_extmem_register, rte_dev_dma_map , https://mails.dpdk.org/archives/dev/2019-February/124690.html ) . Finally, the non-IOVA_VA based-path remain necessary for VMs without vIOMMU support / machines without IOMMU (or IOMMU disabled); or would you like to keep just one variant of the code, dropping support for uncommon/old hardware and assume that going forward everything will support IOVA_VA (or equivalent such as Mellanox) ?

I think we still need to support IOVA_PA since many virtual machines won't have a virtual iommu.

But, I'd like to drop the copying path and hugetlbfs.

Regarding ease of deplyoment, its true that for us, we always create and bind virtual functions of physical NICs so we don't need to "dedicate" phyiscal NICs, and thanks to VFIO do not need to fear bugs corrupting beyond the process. So it's not a hurdle (once everything is activated in the BIOS/kernel). Yet, with our current solution we still need hugepages and to allocate 64 GB of RAM to DPDK just for networking queues but we reuse server with plenty of memory. Again this could probably be eased through the new external memory API from DPDK, but that's almost another topic.

I don't understand, if you have vfio then you don't need hugepages.

To me DPDK always use(d) hugepages and fail(ed) to start if none are configured in the system. It seems that now it is not necessary in the newest versions (where newest can be more than 2 years old). So it may change too. But seastar ask DPDK to start without huge-pages only if we don't enable the dpdk-pmd. Anyway, as per your recommendation of incremental patches, I think I'm going to try feature parity with DPDK 20.11, and leave as seconds steps optimization (reduction) of fixed memory dedicated to DPDK , or more advanced external memory usage made possible in the newest versions.

} else if (!opts.count("dpdk-pmd")) {

args.push_back(string2vector("--no-huge"));

}

Many of the original implementation choices don't make sense now. If we can settle on vfio (with IOVA_VA or IOVA_PA as available) that would make everything easier. I think.

Best regards,

--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/ed5c552f-e53f-48e8-909b-3dd63b9f3698n%40googlegroups.com.

Vladislav Zolotarov

<vladz@scylladb.com>

unread,

Apr 1, 2021, 6:25:59 PM4/1/21

to Avi Kivity, Nicolas Le Scouarnec, seastar-dev

RST seems a bit harsh.
This may happen not only in the context of the attack.
Moving the connection into a Slow Start mod would be sufficient IMO.

My understanding that today it's an issue due to us lacking heuristics described in the first bullet above. I think we should start with fixing this and this is likely going to address the vast majority of the use cases since the TCP window should get closed before we run out of buffers.

However before all that - why do we discuss an L2 level buffers pool in the context of TCP (L4) handling.
L2 should have no notion about upper layers specifics and out-of -order in particular.

Namely buffers in the L2 buffers pool should be refilled immediately once pulled.
This means that we should only get our networking stuck if seastar runs our of memory completely and never because the L2 buffers pool is empty.

And my understanding was that his IS the case.
Isn't it, Nicolas?

These are just random ideas, we might come up with better ones, and they don't have to be blockers for your patches.

Finally, our current implementation of zero copy is actually the second potential solution you describe. When I start my application I create a very large (configurable) rx_pool (e.g., 100 000 packets instead of just twice the NIC rx ring). If the number of still available mbufs is larger than twice the rx ring, we allow the mbuf to be passed to seastar, otherwise we know that we ought to free it immediately and perform a copy to seastar-allocated memory. So as long as packets are consumed fast enough, packet go through the zero copy path -- otherwise we just revert to the safe (previous) behavior. See bellow for the code which is quite simple in the end I just added the part underline:

100,000 packets is 400MB (assuming 4k packets). That's quite a lot of overhead, and per-shard if we have enough queues. I'd like a more dynamic solution that doesn't reserve so much memory. Maybe dpdk evolved enough not to require pre-allocation of rx packets.

I saw it evolved, you still need a rx_pool but apparently the dynamic allocation you took should work better, for all NICs/drivers as there is now a "uniform" API to handle this.

Well, that would be best. Using hugetlbfs is a deployment headache as it often requires a reboot to reserve on the first install. It's probably not a problem if you're consuming your own code, but it is for ISVs.

I'll need to run some benchmark/profiling again but performance was low, and as far as I remember significant time was spend within refill_rx_mbuf, if I look at rte_mem_virt2iova depending on whether mode is RTE_IOVA_VA/PA cost can vary a lot. Maybe it was specific to our configuration (and I lost access to the machine I had doing this test, so I cannot confirm whether RTE_IOVA_VA/PA was used: I remember having issues with vfio setup and no access to BIOS, so maybe something was wrong). Anyway this part also got updated in DPDK to provide something "uniform" between Mellanox & non Mellanox, so it could be possible/necessary to update it (for RTE_IOVA_VA capable setups).

I guess IOVA_VA is fast (it's a 1:1 translation) and IOVA_PA is slow (it's a lookup in a very large array, so cache misses). I was worried the iommu itself slows down the hardware, but I guess you used IOVA_PA.

I think so. Regarding this, as this is a part of the code that'll have to evolve as DPDK API changed, do you know what was the rationale for keeping two code paths (HugeTlbFs / No HugeTlbFs ?) .

Vlad wrote this, copying him.

This was written in order to be able to allow seastar work when DPDK maintains huge pages (NoTLBFS) and when it's maintained externally (by us).
It had nothing to do with a specific vendor.

Reply all

Reply to author

Forward