Dear all,
As we explained at Scylla Summit, at Broadpeak, we've been using Seastar with its DPDK stack to build an alternative to JMeter for video streaming service/CDN benchmarking. While doing this we identified a few issues in the TCP stack, at least for our usage. We would be happy to to upstream those changes/fixes if they seem of interest to you. We would also like to know if you're already aware of ongoing work for any item so that we avoid preparing redundant/conflicting patches.
Please post patches. I'm not aware of conflicting work.
Regarding the TCP stack, we made the following changes:1) we turned several promise<> in tcp.hh into optional<promise<>> to avoid reports of broken futures when there is no consumer for the future. This could happen with duplicated reception of FIN, or RST packets for example.
That sounds reasonable.
We should add a debug feature that breaks hard when set_value()
is called twice on a promise. But of course that is independent of
a fix.
2) the TCP stack (packet_merger) performs a linearization on each packet reception. While a packet is missing (at the beginning, preventing delivery of data to the upper layers, due to out of order reception/retransmission) this linearization will reallocate/copy the buffer at each packet reception leading to very significant CPU costs as soon as a few packet losses/out-of-order deliveries slip in. We actually removed the linearization for tcp, and deliver the content of packets to the upper layer input stream directly in order to increase zero-copy opportunities (potentially calling multiple times the input_stream with smaller amounts of data but without copying data).
commit 6a468dfd3dadbcd3c29bb684b6bd2c6059bfc5a3
Author: Asias He <as...@cloudius-systems.com>
Date: Wed Feb 4 10:39:54 2015 +0800
packet: Linearize more in packet_merger::merge
This fix tcp_server rxrx test on DPDK. The problem is that
when we
receive out of order packets, we will hold the packet in the
ooo queue.
We do linearize on the incoming packet which will copy the
packet and
thus free the old packet. However, we missed one case where we
need to
linearize. As a result, the original packet will be held in
the ooo
queue. In DPDK, we have fixed buffer in the rx pool. When all
the dpdk
buffer are in ooo queue, we will not be able to receive
further packets.
So rx hangs, even ping will not work.
So it looks like there is a deadlock opportunity here. Someone could (accidentally or maliciously) exhaust all dedicated packet memory and prevent further reception.
The deadlock is not limited to out-of-order reception. For
example, all packets could be in tcp streams that are not being
read.
Potential solutions:
- figure out how to remove the limit on the rx pool and allow any piece of memory to be used as an rx buffer. This would be most in line with how Seastar generally works (no reservations for anything), but requires digging into dpdk. Perhaps newer versions of dpdk make it easier.
- account for zero-copy packets that are "hanging". When the
number of zero-copy packets that haven't been recycled yet exceeds
some threshold, drop the zero-copy optimization and copy packets.
This prevents deadlock at the expense of copying packets when the
system is stressed.
3) the tcp (native stack) makes the assumption that the user code will always doco_await writeflushco_await readco_await closedestroyThis is the case for protocols such as L7 protocols such as HTTP/1.1 that have request/response cycles. When you issue a GET (write + flush), and wait (reading) the response body, you implicitly waited for the flush (complete send over the network) to complete because there is a happen-before relationship between the query being received (hence the write+flush completing) and the response being sent back. With other protocols (HTTP/2, TLS...) we had issues because the "client" could write something to TCP, flush it and then close or destroy the TCP connection without waiting for a (L7) response.This can lead to a UAF because the reactor (flush_poller) captures a raw pointer to the corresponding iostream implementation. If the flush does not complete when destruction of the iostream occurs we get a UAF. Our fix is to add a gate.enter() when passing the reference to the flush_poller / gate.leave() when the poll_flush is executed, and on iostream close() we wait on that gate.close().
4) We added a fix, to be discussed, for https://github.com/scylladb/seastar/issues/365
Regarding DPDK, we updated to DPDK20.11 (which uses a new meson-based build system). DPDK 20.11 brings various improvements and support for new hardware such as Mellanox ConnectX6.
This would be most welcome.
1) We updated Seastar cooking rules/finddpdk.cmake accordingly. Seastar uses the pkg-config generated by meson.build to simplify the search for libs. We kept the approach of Seastar through by relinking everyhting into a single .o so that the complex dependency to DPDK, which requires specific link flags, is hidden to applications using Seastar.
2) We took an upstream DPDK 20.11 and dropped some commits that concern parts of code that have changed upstream or where related to the make build system.
We kept
We droppedwhich seems related to (or at least conflicts with)net/ixgbe: fix Tx threshold setup -- 9f5d2352fe45bee575f50f4a7131f3beee52b43f
Maybe Vlad can tell if the patches fix the same problem. I think
it's likely and that we can drop the local patch.
3) The Zero-Copy mechanism uses functions that have been deprecated/removed (part of the code with HugeTLBFSMembackend=true). However, in our previous experiment we also found that the overhead of rte_mem_virt2iova/rte_vfio_dma_map made the approach not worth it on our setup/workload.
Can you elaborate on this? I was hoping vfio can make deploying
dpdk easier and with fewer constraints (it also relates to one
proposed solution to the deadlock problem, above), so I'd be sad
to lose it. Of course, if vfio makes dpdk's advantages disappear
then it's not worth it.
And we also used Mellanox NIC at some point. For the purpose of zero copy, we added a few lines of codes to have bigger (configurable) DPDK rx_pool allocated on program start, and pass rte_mbuf without copy to seastar by wrapping them into seastar:temporary_buffer with a custom deleter calling rte_mbuf_free in a best effort way (reverting to copy if the pool is exhausted). We found this approach to be efficient on both Intel and Mellanox NICs, and this simplifies the integration of DPDK20.11.
This looks similar to the second solution I proposed for the
deadlock problem.
4) The other limit of our integration so far is that we use all NICs in single queue mode, so we don't need to configure RSS. However, the API rte_eth_dev_filter_ctrl has been replaced by rte_flow, so this part will need to be updated before a complete integration. But the feature/configurability is still possible, so it's not a blocking point.
Let me know if these changes are of interest to you,
Yes they are. I recommend posting patches individually so they
can be applied without interdependencies, when possible.
3) the tcp (native stack) makes the assumption that the user code will always doco_await writeflushco_await readco_await closedestroyThis is the case for protocols such as L7 protocols such as HTTP/1.1 that have request/response cycles. When you issue a GET (write + flush), and wait (reading) the response body, you implicitly waited for the flush (complete send over the network) to complete because there is a happen-before relationship between the query being received (hence the write+flush completing) and the response being sent back. With other protocols (HTTP/2, TLS...) we had issues because the "client" could write something to TCP, flush it and then close or destroy the TCP connection without waiting for a (L7) response.This can lead to a UAF because the reactor (flush_poller) captures a raw pointer to the corresponding iostream implementation. If the flush does not complete when destruction of the iostream occurs we get a UAF. Our fix is to add a gate.enter() when passing the reference to the flush_poller / gate.leave() when the poll_flush is executed, and on iostream close() we wait on that gate.close().
4) We added a fix, to be discussed, for https://github.com/scylladb/seastar/issues/365
Regarding DPDK, we updated to DPDK20.11 (which uses a new meson-based build system). DPDK 20.11 brings various improvements and support for new hardware such as Mellanox ConnectX6.
This would be most welcome.
1) We updated Seastar cooking rules/finddpdk.cmake accordingly. Seastar uses the pkg-config generated by meson.build to simplify the search for libs. We kept the approach of Seastar through by relinking everyhting into a single .o so that the complex dependency to DPDK, which requires specific link flags, is hidden to applications using Seastar.
2) We took an upstream DPDK 20.11 and dropped some commits that concern parts of code that have changed upstream or where related to the make build system.
We kept
We droppedwhich seems related to (or at least conflicts with)net/ixgbe: fix Tx threshold setup -- 9f5d2352fe45bee575f50f4a7131f3beee52b43f
Maybe Vlad can tell if the patches fix the same problem. I think it's likely and that we can drop the local patch.
I don't think Linux has something equivalent to an rx pool (other
than the DMA32 zone). It can probably allocate all of memory into
skbuffs.
Dropping ooo packets is possible now, but it's not possible if we
implement selective acknowledgements, so I don't think it's a good
path forward.
2) the TCP stack (packet_merger) performs a linearization on each packet reception. While a packet is missing (at the beginning, preventing delivery of data to the upper layers, due to out of order reception/retransmission) this linearization will reallocate/copy the buffer at each packet reception leading to very significant CPU costs as soon as a few packet losses/out-of-order deliveries slip in. We actually removed the linearization for tcp, and deliver the content of packets to the upper layer input stream directly in order to increase zero-copy opportunities (potentially calling multiple times the input_stream with smaller amounts of data but without copying data).
commit 6a468dfd3dadbcd3c29bb684b6bd2c6059bfc5a3
Author: Asias He <as...@cloudius-systems.com>
Date: Wed Feb 4 10:39:54 2015 +0800
packet: Linearize more in packet_merger::merge
This fix tcp_server rxrx test on DPDK. The problem is that when we
receive out of order packets, we will hold the packet in the ooo queue.
We do linearize on the incoming packet which will copy the packet and
thus free the old packet. However, we missed one case where we need to
linearize. As a result, the original packet will be held in the ooo
queue. In DPDK, we have fixed buffer in the rx pool. When all the dpdk
buffer are in ooo queue, we will not be able to receive further packets.
So rx hangs, even ping will not work.
So it looks like there is a deadlock opportunity here. Someone could (accidentally or maliciously) exhaust all dedicated packet memory and prevent further reception.
The deadlock is not limited to out-of-order reception. For example, all packets could be in tcp streams that are not being read.
Potential solutions:
- figure out how to remove the limit on the rx pool and allow any piece of memory to be used as an rx buffer. This would be most in line with how Seastar generally works (no reservations for anything), but requires digging into dpdk. Perhaps newer versions of dpdk make it easier.
- account for zero-copy packets that are "hanging". When the number of zero-copy packets that haven't been recycled yet exceeds some threshold, drop the zero-copy optimization and copy packets. This prevents deadlock at the expense of copying packets when the system is stressed.
I think the problem is on the TCP level: TCP should not allow such a dead lock situation and should have dropped all ooo packets and requests a retransmit.
I wonder how Linux stack addresses the same issue?
3) The Zero-Copy mechanism uses functions that have been deprecated/removed (part of the code with HugeTLBFSMembackend=true). However, in our previous experiment we also found that the overhead of rte_mem_virt2iova/rte_vfio_dma_map made the approach not worth it on our setup/workload.
Can you elaborate on this? I was hoping vfio can make deploying dpdk easier and with fewer constraints (it also relates to one proposed solution to the deadlock problem, above), so I'd be sad to lose it. Of course, if vfio makes dpdk's advantages disappear then it's not worth it.
Yes they are. I recommend posting patches individually so they can be applied without interdependencies, when possible.
On Friday, March 26, 2021 at 2:48:40 PM UTC+1 Vladislav Zolotarov wrote:
2) the TCP stack (packet_merger) performs a linearization on each packet reception. While a packet is missing (at the beginning, preventing delivery of data to the upper layers, due to out of order reception/retransmission) this linearization will reallocate/copy the buffer at each packet reception leading to very significant CPU costs as soon as a few packet losses/out-of-order deliveries slip in. We actually removed the linearization for tcp, and deliver the content of packets to the upper layer input stream directly in order to increase zero-copy opportunities (potentially calling multiple times the input_stream with smaller amounts of data but without copying data).
commit 6a468dfd3dadbcd3c29bb684b6bd2c6059bfc5a3
Author: Asias He <as...@cloudius-systems.com>
Date: Wed Feb 4 10:39:54 2015 +0800
packet: Linearize more in packet_merger::merge
This fix tcp_server rxrx test on DPDK. The problem is that when we
receive out of order packets, we will hold the packet in the ooo queue.
We do linearize on the incoming packet which will copy the packet and
thus free the old packet. However, we missed one case where we need to
linearize. As a result, the original packet will be held in the ooo
queue. In DPDK, we have fixed buffer in the rx pool. When all the dpdk
buffer are in ooo queue, we will not be able to receive further packets.
So rx hangs, even ping will not work.
So it looks like there is a deadlock opportunity here. Someone could (accidentally or maliciously) exhaust all dedicated packet memory and prevent further reception.
The deadlock is not limited to out-of-order reception. For example, all packets could be in tcp streams that are not being read.
Potential solutions:
- figure out how to remove the limit on the rx pool and allow any piece of memory to be used as an rx buffer. This would be most in line with how Seastar generally works (no reservations for anything), but requires digging into dpdk. Perhaps newer versions of dpdk make it easier.
- account for zero-copy packets that are "hanging". When the number of zero-copy packets that haven't been recycled yet exceeds some threshold, drop the zero-copy optimization and copy packets. This prevents deadlock at the expense of copying packets when the system is stressed.
I think the problem is on the TCP level: TCP should not allow such a dead lock situation and should have dropped all ooo packets and requests a retransmit.
I wonder how Linux stack addresses the same issue?
Actually, the TCP sender should refrain from sending data beyond the TCP receive window. If the sender sends more data, we can/must just drop it. This provides a bound on the amount of memory that a sender can force seastar to use. In seastar its value is computed by get_modified_receive_window_size. Now, that being said, the "linearize" also free memory in the (unlikely?) case there is some weird overlap between packets in the packet merger (coud happen due to TCP LRO + rentransmissions ?). This bounds the amount of memory used some where near 3737600 per connection. Dropping out of order packets that are within the receive window would be problematic/non-trivial because, the missing packet may just arrive soon (due to reordering in the network).
Consider a sender that sends packets with one byte each, only
even bytes. Each such packet occupies 4k (or 2k?). A 100k window
has 50k even bytes, and so needs 200M of memory.
Now regarding the commit comment, as far as I understand the current seastar (without our zero copy patch, and without HugeTlbFs), as soon as packets are received (dpdk.cc:from_mbuf) a new buffer is allocated (from Seastar) and the rte_mbuf is immediately returned to the pool, thus avoiding the deadlock. I think this got changed later (April 2015) : https://github.com/scylladb/seastar/commit/e9a59e5f8d9c253a2ff721edffbf8845db18ef9f "DPDK: drop incoming packets if allocation fails in the copy-flow: Ensure the "detaching" of rte_mbuf from the received data in the copy path. This patch completes the above in case the allocation of the buffer to copy the newly arrived data to has failed. In the above case we prefer to drop the arrived packet instead of consuming the rte_mbuf from the Rx ring's mempool. Signed-off-by: Vlad Zolotarov <>""
Well, that is the copy path. We'd like to use the zero-copy path,
no?
Finally, our current implementation of zero copy is actually the second potential solution you describe. When I start my application I create a very large (configurable) rx_pool (e.g., 100 000 packets instead of just twice the NIC rx ring). If the number of still available mbufs is larger than twice the rx ring, we allow the mbuf to be passed to seastar, otherwise we know that we ought to free it immediately and perform a copy to seastar-allocated memory. So as long as packets are consumed fast enough, packet go through the zero copy path -- otherwise we just revert to the safe (previous) behavior. See bellow for the code which is quite simple in the end I just added the part underline:
100,000 packets is 400MB (assuming 4k packets). That's quite a
lot of overhead, and per-shard if we have enough queues. I'd like
a more dynamic solution that doesn't reserve so much memory. Maybe
dpdk evolved enough not to require pre-allocation of rx packets.
template<>inline std::optional<packet>dpdk_qp<false>::from_mbuf(rte_mbuf* m){if (!_dev->hw_features_ref().rx_lro || rte_pktmbuf_is_contiguous(m)) {//// Try to allocate a buffer for packet's data. If we fail - give the// application an mbuf itself. If we succeed - copy the data into this// buffer, create a packet based on this buffer and return the mbuf to// its pool.//auto len = rte_pktmbuf_data_len(m);char* buf = rte_pktmbuf_mtod(m, char*);
if(rte_mempool_avail_count(_pktmbuf_pool_rx) > mbufs_per_queue_rx) {// There remains sufficient buffers in the pool to allow zero copyreturn packet(fragment{buf, len}, seastar::make_deleter(seastar::deleter(), [m]() {rte_pktmbuf_free(m);}));}else{if(_dpdk_zc_pool_size > 0) _stats.rx.bad.missed_zc_opportunities++; /* If zero-copy is enabled, count missed opportunities */
char* buf = (char*)malloc(len);rte_memcpy(buf, rte_pktmbuf_mtod(m, char*), len);
Alternatively, it may be possible to replenish a pool, and so we
can increase a pool size if there is demand, and reduce it when
there is not.
We may be able to help. I also think some older AWS instances
have them.
3) The Zero-Copy mechanism uses functions that have been deprecated/removed (part of the code with HugeTLBFSMembackend=true). However, in our previous experiment we also found that the overhead of rte_mem_virt2iova/rte_vfio_dma_map made the approach not worth it on our setup/workload.
Can you elaborate on this? I was hoping vfio can make deploying dpdk easier and with fewer constraints (it also relates to one proposed solution to the deadlock problem, above), so I'd be sad to lose it. Of course, if vfio makes dpdk's advantages disappear then it's not worth it.
I'll need to run some benchmark/profiling again but performance was low, and as far as I remember significant time was spend within refill_rx_mbuf, if I look at rte_mem_virt2iova depending on whether mode is RTE_IOVA_VA/PA cost can vary a lot. Maybe it was specific to our configuration (and I lost access to the machine I had doing this test, so I cannot confirm whether RTE_IOVA_VA/PA was used: I remember having issues with vfio setup and no access to BIOS, so maybe something was wrong). Anyway this part also got updated in DPDK to provide something "uniform" between Mellanox & non Mellanox, so it could be possible/necessary to update it (for RTE_IOVA_VA capable setups).
I guess IOVA_VA is fast (it's a 1:1 translation) and IOVA_PA is
slow (it's a lookup in a very large array, so cache misses). I was
worried the iommu itself slows down the hardware, but I guess you
used IOVA_PA.
The code looks terrible:
phys_addr_t
rte_mem_virt2phy(const void *virtaddr)
{
int fd, retval;
uint64_t page, physaddr;
unsigned long virt_pfn;
int page_size;
off_t offset;
if (phys_addrs_available == 0)
return RTE_BAD_IOVA;
/* standard page size */
page_size = getpagesize();
fd = open("/proc/self/pagemap", O_RDONLY);
if (fd < 0) {
RTE_LOG(INFO, EAL, "%s(): cannot open
/proc/self/pagemap: %s\n",
__func__, strerror(errno));
return RTE_BAD_IOVA;
}
virt_pfn = (unsigned long)virtaddr / page_size;
offset = sizeof(uint64_t) * virt_pfn;
if (lseek(fd, offset, SEEK_SET) == (off_t) -1) {
RTE_LOG(INFO, EAL, "%s(): seek error in
/proc/self/pagemap: %s\n",
__func__, strerror(errno));
close(fd);
return RTE_BAD_IOVA;
}
retval = read(fd, &page, PFN_MASK_SIZE);
close(fd);
It looks like it was just written for it to work with no regard
for performance.
Regarding ease of deplyoment, its true that for us, we always create and bind virtual functions of physical NICs so we don't need to "dedicate" phyiscal NICs, and thanks to VFIO do not need to fear bugs corrupting beyond the process. So it's not a hurdle (once everything is activated in the BIOS/kernel). Yet, with our current solution we still need hugepages and to allocate 64 GB of RAM to DPDK just for networking queues but we reuse server with plenty of memory. Again this could probably be eased through the new external memory API from DPDK, but that's almost another topic.
I don't understand, if you have vfio then you don't need
hugepages.
Yes they are. I recommend posting patches individually so they can be applied without interdependencies, when possible.
I'll prepare these.
Best regards,
--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/7c0182eb-52c3-48af-8c54-a2994d5d9c74n%40googlegroups.com.
Potential solutions:
- figure out how to remove the limit on the rx pool and allow any piece of memory to be used as an rx buffer. This would be most in line with how Seastar generally works (no reservations for anything), but requires digging into dpdk. Perhaps newer versions of dpdk make it easier.
- account for zero-copy packets that are "hanging". When the number of zero-copy packets that haven't been recycled yet exceeds some threshold, drop the zero-copy optimization and copy packets. This prevents deadlock at the expense of copying packets when the system is stressed.
I think the problem is on the TCP level: TCP should not allow such a dead lock situation and should have dropped all ooo packets and requests a retransmit.
I wonder how Linux stack addresses the same issue?
Actually, the TCP sender should refrain from sending data beyond the TCP receive window. If the sender sends more data, we can/must just drop it. This provides a bound on the amount of memory that a sender can force seastar to use. In seastar its value is computed by get_modified_receive_window_size. Now, that being said, the "linearize" also free memory in the (unlikely?) case there is some weird overlap between packets in the packet merger (coud happen due to TCP LRO + rentransmissions ?). This bounds the amount of memory used some where near 3737600 per connection. Dropping out of order packets that are within the receive window would be problematic/non-trivial because, the missing packet may just arrive soon (due to reordering in the network).Consider a sender that sends packets with one byte each, only even bytes. Each such packet occupies 4k (or 2k?). A 100k window has 50k even bytes, and so needs 200M of memory.
Now regarding the commit comment, as far as I understand the current seastar (without our zero copy patch, and without HugeTlbFs), as soon as packets are received (dpdk.cc:from_mbuf) a new buffer is allocated (from Seastar) and the rte_mbuf is immediately returned to the pool, thus avoiding the deadlock. I think this got changed later (April 2015) : https://github.com/scylladb/seastar/commit/e9a59e5f8d9c253a2ff721edffbf8845db18ef9f "DPDK: drop incoming packets if allocation fails in the copy-flow: Ensure the "detaching" of rte_mbuf from the received data in the copy path. This patch completes the above in case the allocation of the buffer to copy the newly arrived data to has failed. In the above case we prefer to drop the arrived packet instead of consuming the rte_mbuf from the Rx ring's mempool. Signed-off-by: Vlad Zolotarov <>""Well, that is the copy path. We'd like to use the zero-copy path, no?
Finally, our current implementation of zero copy is actually the second potential solution you describe. When I start my application I create a very large (configurable) rx_pool (e.g., 100 000 packets instead of just twice the NIC rx ring). If the number of still available mbufs is larger than twice the rx ring, we allow the mbuf to be passed to seastar, otherwise we know that we ought to free it immediately and perform a copy to seastar-allocated memory. So as long as packets are consumed fast enough, packet go through the zero copy path -- otherwise we just revert to the safe (previous) behavior. See bellow for the code which is quite simple in the end I just added the part underline:100,000 packets is 400MB (assuming 4k packets). That's quite a lot of overhead, and per-shard if we have enough queues. I'd like a more dynamic solution that doesn't reserve so much memory. Maybe dpdk evolved enough not to require pre-allocation of rx packets.
I'll need to run some benchmark/profiling again but performance was low, and as far as I remember significant time was spend within refill_rx_mbuf, if I look at rte_mem_virt2iova depending on whether mode is RTE_IOVA_VA/PA cost can vary a lot. Maybe it was specific to our configuration (and I lost access to the machine I had doing this test, so I cannot confirm whether RTE_IOVA_VA/PA was used: I remember having issues with vfio setup and no access to BIOS, so maybe something was wrong). Anyway this part also got updated in DPDK to provide something "uniform" between Mellanox & non Mellanox, so it could be possible/necessary to update it (for RTE_IOVA_VA capable setups).I guess IOVA_VA is fast (it's a 1:1 translation) and IOVA_PA is slow (it's a lookup in a very large array, so cache misses). I was worried the iommu itself slows down the hardware, but I guess you used IOVA_PA.
Regarding ease of deplyoment, its true that for us, we always create and bind virtual functions of physical NICs so we don't need to "dedicate" phyiscal NICs, and thanks to VFIO do not need to fear bugs corrupting beyond the process. So it's not a hurdle (once everything is activated in the BIOS/kernel). Yet, with our current solution we still need hugepages and to allocate 64 GB of RAM to DPDK just for networking queues but we reuse server with plenty of memory. Again this could probably be eased through the new external memory API from DPDK, but that's almost another topic.I don't understand, if you have vfio then you don't need hugepages.
On Saturday, March 27, 2021 at 2:24:00 PM UTC+1 Avi Kivity wrote:
Potential solutions:
- figure out how to remove the limit on the rx pool and allow any piece of memory to be used as an rx buffer. This would be most in line with how Seastar generally works (no reservations for anything), but requires digging into dpdk. Perhaps newer versions of dpdk make it easier.
- account for zero-copy packets that are "hanging". When the number of zero-copy packets that haven't been recycled yet exceeds some threshold, drop the zero-copy optimization and copy packets. This prevents deadlock at the expense of copying packets when the system is stressed.
I think the problem is on the TCP level: TCP should not allow such a dead lock situation and should have dropped all ooo packets and requests a retransmit.
I wonder how Linux stack addresses the same issue?
Actually, the TCP sender should refrain from sending data beyond the TCP receive window. If the sender sends more data, we can/must just drop it. This provides a bound on the amount of memory that a sender can force seastar to use. In seastar its value is computed by get_modified_receive_window_size. Now, that being said, the "linearize" also free memory in the (unlikely?) case there is some weird overlap between packets in the packet merger (coud happen due to TCP LRO + rentransmissions ?). This bounds the amount of memory used some where near 3737600 per connection. Dropping out of order packets that are within the receive window would be problematic/non-trivial because, the missing packet may just arrive soon (due to reordering in the network).
Consider a sender that sends packets with one byte each, only even bytes. Each such packet occupies 4k (or 2k?). A 100k window has 50k even bytes, and so needs 200M of memory.
I understand your point, I had ignored the fact that the sender could be malicious and deliberately attempt to exploit this. That said, we should account for both a memory-exhaustion attack, such as the one you describe, but also a complexity attack as the current linearization has a complexity in n² for the same workload (copy 1 byte, copy 2 byte, copy 3 bytes, ....) as it reallocates a buffer of the total size and copies everything on each packet reception. In addition it forces the CPU to touch the data (while it may not have accessed the packet data yet / may not use it immediately afterwards, thus resulting in cache pollution).
Right, we need some kind oh hybrid solution. First optimizing for
work/cpu, but when memory limits are reached, optimize for memory.
I had a look at FreeBSD code, and Linux code. As far as I understand the approach is to keep a fragment structure but to collapse "small" buffers into the remaining room of the previous buffer if any, with possibly a limit to avoid a copy if the chunk is already large enough (condition is: m->m_len <= MCLBYTES / 4 && /* XXX: Don't copy too much */ m->m_len <= M_TRAILINGSPACE(n) where m is received packet and n is the already existing chain ).
This would reduce the memory pressure for your adversarial example, yet avoid memcpys for large enough packets (thus reducing computational cost and preservering zero-copy). However, the current fragment structure does not carry any information on the real size of the underlying buffer to allow such an in-place "merge" to occur, even if the underlying buffer (rte_mbuf) had tailroom (or headroom) initally.
Since our `class packet` (via temporary_buffer) allows splitting and sharing, you couldn't know that this extra space isn't used by another packet.
Maybe we could: if the temporary_buffer tells us its deleter was
created with raw_object_tag, and is unshared, we'd know we can
expand it to malloc_usable_size(). Perhaps we'd need to adjust the
sharing mechanism to make sure we don't have false positives.
Now regarding the commit comment, as far as I understand the current seastar (without our zero copy patch, and without HugeTlbFs), as soon as packets are received (dpdk.cc:from_mbuf) a new buffer is allocated (from Seastar) and the rte_mbuf is immediately returned to the pool, thus avoiding the deadlock. I think this got changed later (April 2015) : https://github.com/scylladb/seastar/commit/e9a59e5f8d9c253a2ff721edffbf8845db18ef9f "DPDK: drop incoming packets if allocation fails in the copy-flow: Ensure the "detaching" of rte_mbuf from the received data in the copy path. This patch completes the above in case the allocation of the buffer to copy the newly arrived data to has failed. In the above case we prefer to drop the arrived packet instead of consuming the rte_mbuf from the Rx ring's mempool. Signed-off-by: Vlad Zolotarov <>""Well, that is the copy path. We'd like to use the zero-copy path, no?
Sure, but for the zero-copy path since the DPDK pool is not fixed but is refilled with newly seastar-allocated buffers, unless there is no memory at all remaining for the seastar allocator, the deadlock is unlikely also.
We still don't want to fill all memory with packets that are
waiting for a remote sender. We'll want something like
sys.net.ipv4.tcp_mem to limit global memory in tcp.
It may not matter for an application that does pure networking,
but an application that is memory intensive would like to limit
competition for memory.
However, it stress an interesting point: if we want to have "zero-copy", we also need data not to be copied due to a linearize call discussed just above.
Certainly. The first choice should be to store the packet as is
in the reordering buffer.
Linearization should happen if some heuristics apply:
- if the packet is smaller than some threshold, minify it by
copying to a new packet. This will ignore normal packets that
aren't tiny. This can apply even if the packet doesn't
overlap/touch another packet.
- if the total amount of memory in all reordering buffers is too high, start linearizing until it drops off
- if a particular reordering buffer has too many elements in the reordering buffer (i.e. it's under attack), send RST
These are just random ideas, we might come up with better ones,
and they don't have to be blockers for your patches.
Finally, our current implementation of zero copy is actually the second potential solution you describe. When I start my application I create a very large (configurable) rx_pool (e.g., 100 000 packets instead of just twice the NIC rx ring). If the number of still available mbufs is larger than twice the rx ring, we allow the mbuf to be passed to seastar, otherwise we know that we ought to free it immediately and perform a copy to seastar-allocated memory. So as long as packets are consumed fast enough, packet go through the zero copy path -- otherwise we just revert to the safe (previous) behavior. See bellow for the code which is quite simple in the end I just added the part underline:
100,000 packets is 400MB (assuming 4k packets). That's quite a lot of overhead, and per-shard if we have enough queues. I'd like a more dynamic solution that doesn't reserve so much memory. Maybe dpdk evolved enough not to require pre-allocation of rx packets.
I saw it evolved, you still need a rx_pool but apparently the dynamic allocation you took should work better, for all NICs/drivers as there is now a "uniform" API to handle this.
Well, that would be best. Using hugetlbfs is a deployment
headache as it often requires a reboot to reserve on the first
install. It's probably not a problem if you're consuming your own
code, but it is for ISVs.
I'll need to run some benchmark/profiling again but performance was low, and as far as I remember significant time was spend within refill_rx_mbuf, if I look at rte_mem_virt2iova depending on whether mode is RTE_IOVA_VA/PA cost can vary a lot. Maybe it was specific to our configuration (and I lost access to the machine I had doing this test, so I cannot confirm whether RTE_IOVA_VA/PA was used: I remember having issues with vfio setup and no access to BIOS, so maybe something was wrong). Anyway this part also got updated in DPDK to provide something "uniform" between Mellanox & non Mellanox, so it could be possible/necessary to update it (for RTE_IOVA_VA capable setups).
I guess IOVA_VA is fast (it's a 1:1 translation) and IOVA_PA is slow (it's a lookup in a very large array, so cache misses). I was worried the iommu itself slows down the hardware, but I guess you used IOVA_PA.
I think so. Regarding this, as this is a part of the code that'll have to evolve as DPDK API changed, do you know what was the rationale for keeping two code paths (HugeTlbFs / No HugeTlbFs ?) .
Vlad wrote this, copying him.
I think it is related to this where Mellanox card could not be supported: https://groups.google.com/g/seastar-dev/c/fuQHNVE8db0/m/Fsh-ZkY4BgAJ -- but now DPDK provides a uniform approach for both NICs so it should be possible to avoid this issue ( rte_extmem_register, rte_dev_dma_map , https://mails.dpdk.org/archives/dev/2019-February/124690.html ) . Finally, the non-IOVA_VA based-path remain necessary for VMs without vIOMMU support / machines without IOMMU (or IOMMU disabled); or would you like to keep just one variant of the code, dropping support for uncommon/old hardware and assume that going forward everything will support IOVA_VA (or equivalent such as Mellanox) ?
I think we still need to support IOVA_PA since many virtual
machines won't have a virtual iommu.
But, I'd like to drop the copying path and hugetlbfs.
Regarding ease of deplyoment, its true that for us, we always create and bind virtual functions of physical NICs so we don't need to "dedicate" phyiscal NICs, and thanks to VFIO do not need to fear bugs corrupting beyond the process. So it's not a hurdle (once everything is activated in the BIOS/kernel). Yet, with our current solution we still need hugepages and to allocate 64 GB of RAM to DPDK just for networking queues but we reuse server with plenty of memory. Again this could probably be eased through the new external memory API from DPDK, but that's almost another topic.I don't understand, if you have vfio then you don't need hugepages.
To me DPDK always use(d) hugepages and fail(ed) to start if none are configured in the system. It seems that now it is not necessary in the newest versions (where newest can be more than 2 years old). So it may change too. But seastar ask DPDK to start without huge-pages only if we don't enable the dpdk-pmd. Anyway, as per your recommendation of incremental patches, I think I'm going to try feature parity with DPDK 20.11, and leave as seconds steps optimization (reduction) of fixed memory dedicated to DPDK , or more advanced external memory usage made possible in the newest versions.
} else if (!opts.count("dpdk-pmd")) {args.push_back(string2vector("--no-huge"));}
Many of the original implementation choices don't make sense now.
If we can settle on vfio (with IOVA_VA or IOVA_PA as available)
that would make everything easier. I think.
--Best regards,
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/ed5c552f-e53f-48e8-909b-3dd63b9f3698n%40googlegroups.com.
These are just random ideas, we might come up with better ones, and they don't have to be blockers for your patches.
Finally, our current implementation of zero copy is actually the second potential solution you describe. When I start my application I create a very large (configurable) rx_pool (e.g., 100 000 packets instead of just twice the NIC rx ring). If the number of still available mbufs is larger than twice the rx ring, we allow the mbuf to be passed to seastar, otherwise we know that we ought to free it immediately and perform a copy to seastar-allocated memory. So as long as packets are consumed fast enough, packet go through the zero copy path -- otherwise we just revert to the safe (previous) behavior. See bellow for the code which is quite simple in the end I just added the part underline:
100,000 packets is 400MB (assuming 4k packets). That's quite a lot of overhead, and per-shard if we have enough queues. I'd like a more dynamic solution that doesn't reserve so much memory. Maybe dpdk evolved enough not to require pre-allocation of rx packets.
I saw it evolved, you still need a rx_pool but apparently the dynamic allocation you took should work better, for all NICs/drivers as there is now a "uniform" API to handle this.
Well, that would be best. Using hugetlbfs is a deployment headache as it often requires a reboot to reserve on the first install. It's probably not a problem if you're consuming your own code, but it is for ISVs.
I'll need to run some benchmark/profiling again but performance was low, and as far as I remember significant time was spend within refill_rx_mbuf, if I look at rte_mem_virt2iova depending on whether mode is RTE_IOVA_VA/PA cost can vary a lot. Maybe it was specific to our configuration (and I lost access to the machine I had doing this test, so I cannot confirm whether RTE_IOVA_VA/PA was used: I remember having issues with vfio setup and no access to BIOS, so maybe something was wrong). Anyway this part also got updated in DPDK to provide something "uniform" between Mellanox & non Mellanox, so it could be possible/necessary to update it (for RTE_IOVA_VA capable setups).
I guess IOVA_VA is fast (it's a 1:1 translation) and IOVA_PA is slow (it's a lookup in a very large array, so cache misses). I was worried the iommu itself slows down the hardware, but I guess you used IOVA_PA.
I think so. Regarding this, as this is a part of the code that'll have to evolve as DPDK API changed, do you know what was the rationale for keeping two code paths (HugeTlbFs / No HugeTlbFs ?) .
Vlad wrote this, copying him.