True zero copy writes to a socket in linux

3,300 views
Skip to first unread message

Rajiv Kurian

unread,
Nov 17, 2013, 12:29:19 PM11/17/13
to mechanica...@googlegroups.com
Do you guys have any advice on minimizing the number of copies when writing in-memory data to a TCP socket on Linux? Resorting to a user-space TCP/IP stack is not an option. I am using C/C++.

Recently I found out about the splice/vmsplice family of calls and it looks promising. Apparently sendfile on Linux is implemented using splice. We could obtain a buffer by memory mapping a file and use it to write our data. Calling sendfile() or splice() on the underlying FD seems like it would achieve true zero copy on NICs that support-scatter gather DMA. The problem is that when there is scatter-gather DMA support only a pointer to the buffer and the length is written to a NIC when the splice/sendfile calls return. The NIC then asynchronously writes the data from our buffer onto the wire (hence the zero copy). There seems to be no way to know when it is safe to re-use the buffer to write new data without using explicit application level acks. This paper and this article demonstrate the problem.

Quote from the second one:

Be aware, when splicing data from a mmap'ed buffer to a network socket, it is not possible to say when all data has been sent. Even if splice() returns, the network stack may not have sent all data yet. So reusing the buffer may overwrite unsent data.

So it seems like that this is a no go unless we wait for clients to ack particular messages and only then re-use buffers.

What have you guys done to minimize the number of copies in such cases? I am trying especially hard because my application processes large images/videos where copies are not cheap.

Peter Lawrey

unread,
Nov 17, 2013, 1:32:16 PM11/17/13
to mechanica...@googlegroups.com

User space stack kernel bypass network adapters all use C AFAIK so I dont see a problem using them. You dont even need to change your code to use them except if you need the lowest latencies.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Rajiv Kurian

unread,
Nov 17, 2013, 2:36:38 PM11/17/13
to mechanica...@googlegroups.com
Don't they need specialized drivers? I have no control over the deployment environment (hardware and drivers).
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Peter Lawrey

unread,
Nov 17, 2013, 4:28:10 PM11/17/13
to mechanica...@googlegroups.com

If you have no control over the hardware I don't see how you can do low latency.  You can gain more by using the right hardware in the right dats centre than you can by using C vs Java for most low latency systems.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Rajiv Kurian

unread,
Nov 17, 2013, 4:48:46 PM11/17/13
to mechanica...@googlegroups.com
Peter:
I understand your point, but I am writing generic software that any one can download and run. The best I can do is to incorporate the best practices for minimizing copies. I am writing image/video data (large amounts) back to clients and hence the need to minimize copies. If there is nothing better than a write(), then that's all I can do.

Thanks,
Rajiv


On Sunday, November 17, 2013 1:28:10 PM UTC-8, Peter Lawrey wrote:

If you have no control over the hardware I don't see how you can do low latency.  You can gain more by using the right hardware in the right dats centre than you can by using C vs Java for most low latency systems.

On 17 Nov 2013 19:36, "Rajiv Kurian" <geet...@gmail.com> wrote:
Don't they need specialized drivers? I have no control over the deployment environment (hardware and drivers).

On Sunday, November 17, 2013 10:32:16 AM UTC-8, Peter Lawrey wrote:

User space stack kernel bypass network adapters all use C AFAIK so I dont see a problem using them. You dont even need to change your code to use them except if you need the lowest latencies.

On 17 Nov 2013 17:29, "Rajiv Kurian" <geet...@gmail.com> wrote:
Do you guys have any advice on minimizing the number of copies when writing in-memory data to a TCP socket on Linux? Resorting to a user-space TCP/IP stack is not an option. I am using C/C++.

Recently I found out about the splice/vmsplice family of calls and it looks promising. Apparently sendfile on Linux is implemented using splice. We could obtain a buffer by memory mapping a file and use it to write our data. Calling sendfile() or splice() on the underlying FD seems like it would achieve true zero copy on NICs that support-scatter gather DMA. The problem is that when there is scatter-gather DMA support only a pointer to the buffer and the length is written to a NIC when the splice/sendfile calls return. The NIC then asynchronously writes the data from our buffer onto the wire (hence the zero copy). There seems to be no way to know when it is safe to re-use the buffer to write new data without using explicit application level acks. This paper and this article demonstrate the problem.

Quote from the second one:

Be aware, when splicing data from a mmap'ed buffer to a network socket, it is not possible to say when all data has been sent. Even if splice() returns, the network stack may not have sent all data yet. So reusing the buffer may overwrite unsent data.

So it seems like that this is a no go unless we wait for clients to ack particular messages and only then re-use buffers.

What have you guys done to minimize the number of copies in such cases? I am trying especially hard because my application processes large images/videos where copies are not cheap.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Peter Lawrey

unread,
Nov 17, 2013, 6:25:02 PM11/17/13
to mechanica...@googlegroups.com
I am in the same position.  I write low latency software but can only recommend users consider their options when it comes to hardware.



To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Juergen Donnerstag

unread,
Nov 18, 2013, 12:31:53 PM11/18/13
to mechanica...@googlegroups.com
Since you mentioned image/video data (large amounts). Recently somebody pointed in that forum at http://udt.sourceforge.net/ which it's titled with Breaking the Data Transfer Bottleneck

Rajiv Kurian

unread,
Nov 23, 2013, 10:18:28 PM11/23/13
to mechanica...@googlegroups.com
UDT is interesting. Thanks.
Reply all
Reply to author
Forward
0 new messages