Direct buffers and zero copy

2,648 views
Skip to first unread message

Simone Bordet

unread,
Dec 16, 2015, 10:10:45 AM12/16/15
to mechanica...@googlegroups.com
Hi,

so I was spelunking into direct buffers and also talking with
colleagues about what exactly happens when a direct buffer (allocated
by an application) is written to a SocketChannel (normal ethernet).
I know that heap buffers are copied into a direct buffer first.

A) SocketChannel.write(directBuffer)
My understanding is that when the syscall to write() happens, the
direct buffer is copied from user space to kernel space, and then the
kernel buffer is passed to the lower levels for the actual send.
If this is correct, even for direct buffers there is one data copy.

B) FileChannel.transferTo()
Only the usage of FileChannel.transferTo() is mapped to a sendfile()
syscall, thereby achieving true zero copy, and that does not involve
in the API any buffer.

C) SocketChannel.write(mappedBuffer)
If I map a file via FileChannel.map() and obtain a MappedByteBuffer,
and then I try to write that buffer via SocketChannel.write(), then I
end up again in the write() syscall which involves a data copy.

Am I right for A), B) and C) ?

Seems to me that C) could be optimized by the JVM into a sendfile() call ?

Thanks !

--
Simone Bordet
http://bordet.blogspot.com
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless. Victoria Livschitz

Vitaly Davidovich

unread,
Dec 16, 2015, 2:13:04 PM12/16/15
to mechanical-sympathy
You're right in A-C.  However, I'd like to point out that the "zero-copy" is user-kernel copying -- there is still copying within the kernel (e.g. kernel buf to socket buf).

Seems to me that C) could be optimized by the JVM into a sendfile() call ?

You mean by JDK I guess? If you mapped from a FileChannel, why not then just retain a reference to the FileChannel and use its transferTo(SocketChannel) directly? 

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Thompson

unread,
Dec 16, 2015, 3:41:17 PM12/16/15
to mechanica...@googlegroups.com
On 16 December 2015 at 15:10, Simone Bordet <simone...@gmail.com> wrote:

B) FileChannel.transferTo()
Only the usage of FileChannel.transferTo() is mapped to a sendfile()
syscall, thereby achieving true zero copy, and that does not involve
in the API any buffer.

In my experience you only see a performance benefit with FileChannel.transferTo() when source is a file and the target is a TCP socket or another file. TCP Socket to file does not see any benefit other a normal read write loop. I've also seen FileChannel.tranferTo() with the target as a UDP Socket be slower than a normal write to the UDP socket.

Also note that sendfile() only sees benefits on larger transfers. Depending on your version of Linux, buffer alignment, and which way the wind is blowing, you only see benefits on ~8-32KB plus transfers.

Martin...

 

Simone Bordet

unread,
Dec 16, 2015, 5:48:14 PM12/16/15
to mechanica...@googlegroups.com
Hi,

On Wed, Dec 16, 2015 at 8:13 PM, Vitaly Davidovich <vit...@gmail.com> wrote:
> You're right in A-C. However, I'd like to point out that the "zero-copy" is
> user-kernel copying -- there is still copying within the kernel (e.g. kernel
> buf to socket buf).

Just to play devil's advocate here, if I am mapping a 100 GiB file,
and I am want to write that 100 GiB mapped buffer via
SocketChannel.write(), is the kernel really *copying* the data from
the mapped buffer to a kernel buffer ? I mean allocating another 100
GiB of virtual memory and then page in/out all the 100 GiB as they're
written ? To be precise, would these two buffers (the one that I
mapped, and the kernel buffer) have two different addresses in memory
? It is using the swap if the RAM is not enough ?

I would think that the kernel would just look at the mapped buffer and
write starting from its address, rather than copying all the data to a
new address ?
What would be the benefit of direct buffers otherwise ?

Greg Young

unread,
Dec 16, 2015, 5:51:16 PM12/16/15
to mechanica...@googlegroups.com
When is the last time you did this? Most protocols will chunk anyways
due to varying network concerns....
> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Studying for the Turing test

Martin Thompson

unread,
Dec 16, 2015, 6:16:56 PM12/16/15
to mechanica...@googlegroups.com
On 16 December 2015 at 22:48, Simone Bordet <simone...@gmail.com> wrote:
Hi,

On Wed, Dec 16, 2015 at 8:13 PM, Vitaly Davidovich <vit...@gmail.com> wrote:
> You're right in A-C.  However, I'd like to point out that the "zero-copy" is
> user-kernel copying -- there is still copying within the kernel (e.g. kernel
> buf to socket buf).

Just to play devil's advocate here, if I am mapping a 100 GiB file,
and I am want to write that 100 GiB mapped buffer via
SocketChannel.write(), is the kernel really *copying* the data from
the mapped buffer to a kernel buffer ? I mean allocating another 100
GiB of virtual memory and then page in/out all the 100 GiB as they're
written ? To be precise, would these two buffers (the one that I
mapped, and the kernel buffer) have two different addresses in memory
? It is using the swap if the RAM is not enough ?

I would think that the kernel would just look at the mapped buffer and
write starting from its address, rather than copying all the data to a
new address ?
What would be the benefit of direct buffers otherwise ?

Java direct buffers get copied from user space into kernel space socket send buffers pointed to by an sk_buff.  These sk_buffs then pass down through the QDiscs and the TX ring having their headers added on the way.

With sendfile() the page cache is the buffer pointed to by the sk_buff for passing down the stack.

Without sendfile() you are copying the mapped page cache back down into the kernel managed socket buffer pointed to by an sk_buff. All the while repeatedly switching between user and kernel space.


Simone Bordet

unread,
Dec 16, 2015, 7:05:30 PM12/16/15
to mechanica...@googlegroups.com
Hi,
Thanks for clarifying this !

You know of any resource that explains why the kernel copies these
buffers rather than just referencing them ?

Thanks !

Vitaly Davidovich

unread,
Dec 16, 2015, 10:26:39 PM12/16/15
to mechanica...@googlegroups.com
The main issue is if you only provide a pointer + data length to a scatter-gather device, you don't actually know when the device has sent the data.  So a syscall that hands off your memory to a device may return before the device has actually sent the data to its medium.  The question then is: when can you continue to use the memory mapping safely? If you copy data the advantage is you can immediately return to user space and user space can be sure the memory is safe to overwrite.
 

Thanks !

--
Simone Bordet
http://bordet.blogspot.com
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless.   Victoria Livschitz

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Sent from my phone

Greg Wilkins

unread,
Dec 17, 2015, 12:30:12 AM12/17/15
to mechanical-sympathy


On Thursday, December 17, 2015 at 9:51:16 AM UTC+11, Greg Young wrote:
When is the last time you did this? Most protocols will chunk anyways
due to varying network concerns....

Simone and I are implementing such protocols.   Ideally even with a large/huge file memory mapped, we'd like to be able to do a gather write of a frame header with a slice of the huge buffer and have minimal copying.

We are trying to work out the optimal thing to do when:
  • we have a large file, but that fits into user memory and can be shared by many responsed (a cached static file).  Is it worthwhile caching this file as a DirectBuffer?  or should we just memory map it? 
  • we have a huge file that does not fit into user memory. Should we memory map the file and gather write frames with slices of the buffer?
cheers

Greg Wilkins

unread,
Dec 17, 2015, 12:45:21 AM12/17/15
to mechanical-sympathy


On Thursday, December 17, 2015 at 2:26:39 PM UTC+11, Vitaly Davidovich wrote:
> Java direct buffers get copied from user space into kernel space socket send
> buffers pointed to by an sk_buff.  These sk_buffs then pass down through the
> QDiscs and the TX ring having their headers added on the way.


Vitaly,

but I'm struggling to understand exactly what "copy" means in this context.

Say I have a 100GB file mapped to memory, then that file is not really in memory as I don't have that much memory.  It is in virtual memory with some page faulting mechanism that maps it to the data on the file system.
So if the kernel wants to copy this buffer passed to a write, is it really going to copy the whole 100GB?  At worst surely it is just going to copy the paging data structure pointing to the file, perhaps marking them to copy-on-write so that only if the buffers are changed will a copy actually be made?   So surely the data itself is never actually copied, just the metadata for the buffer?  

If the only thing that actually reads the data is the network interface, surely it is the network interface that will generate the page faults and then the data will flow directly from the file system to the network?   Why or how would that data ever be put into java user space?

Or am I missing something?  Is this a case if a little knowledge is dangerous???









Martin Thompson

unread,
Dec 17, 2015, 4:22:02 AM12/17/15
to mechanica...@googlegroups.com
On 17 December 2015 at 05:45, Greg Wilkins <gr...@webtide.com> wrote:


On Thursday, December 17, 2015 at 2:26:39 PM UTC+11, Vitaly Davidovich wrote:
> Java direct buffers get copied from user space into kernel space socket send
> buffers pointed to by an sk_buff.  These sk_buffs then pass down through the
> QDiscs and the TX ring having their headers added on the way.


Vitaly,

but I'm struggling to understand exactly what "copy" means in this context.

Say I have a 100GB file mapped to memory, then that file is not really in memory as I don't have that much memory.  It is in virtual memory with some page faulting mechanism that maps it to the data on the file system.
So if the kernel wants to copy this buffer passed to a write, is it really going to copy the whole 100GB?  At worst surely it is just going to copy the paging data structure pointing to the file, perhaps marking them to copy-on-write so that only if the buffers are changed will a copy actually be made?   So surely the data itself is never actually copied, just the metadata for the buffer?  

If you call the likes of SocketChannel.write(), implemented by Linux write(), with your MappedDirectBuffer as the source and the socket channel as the target then kernel is just seeing a pointer to data to be copied to the socket. This must be copied to the SO_SNDBUF. If you call transferTo() implemented as sendfile() and the source is a file it knows the contents are in the page cache. Send and receive buffers get autosized to deal with Bandwidth Delay Product (BDP) on Linux.

With Java you can only map a 2GB - 1 byte region of a file per mapped buffer so that limits your chunking with writes anyway as it is int indexed. FileChannel.transferTo is long indexed.
 
If the only thing that actually reads the data is the network interface, surely it is the network interface that will generate the page faults and then the data will flow directly from the file system to the network?   Why or how would that data ever be put into java user space?

How does it know the source when you call write() rather than sendfile()? 

Martin Thompson

unread,
Dec 17, 2015, 4:26:55 AM12/17/15
to mechanica...@googlegroups.com
On 17 December 2015 at 05:45, Greg Wilkins <gr...@webtide.com> wrote:
Have you considered where the data is kept when Nagle's algorithm is enabled? Buffers also need to be kept to support retransmission on loss.
 

Ross Bencina

unread,
Dec 17, 2015, 7:27:35 AM12/17/15
to mechanica...@googlegroups.com
On 17/12/2015 4:45 PM, Greg Wilkins wrote:
> So if the kernel wants to copy this buffer passed to a write, is it
> really going to copy the whole 100GB? At worst surely it is just going
> to copy the paging data structure pointing to the file, perhaps marking
> them to copy-on-write so that only if the buffers are changed will a
> copy actually be made? So surely the data itself is never actually
> copied, just the metadata for the buffer?

I have seen page remapping schemes similar to the above proposed in
research papers about "zero copy" networking. The implication being that
this is not the traditional way that it is done.

Ross.

Martin Thompson

unread,
Dec 17, 2015, 7:45:28 AM12/17/15
to mechanica...@googlegroups.com

If you really want to keep copying to a minimum right down the stack then try a user space stack like Open Onload and Solarflare.

Greg Wilkins

unread,
Dec 17, 2015, 8:19:21 AM12/17/15
to mechanical-sympathy


On Thursday, December 17, 2015 at 8:26:55 PM UTC+11, Martin Thompson wrote:

Have you considered where the data is kept when Nagle's algorithm is enabled? Buffers also need to be kept to support retransmission on loss.

I had assumed that the data would be kept exactly where it came from - in the file system and that the only copying would be a clone of the data structures.    Even with a 2GB-1 limit, there is nowhere else physical that the data can be copied to... unless it is swap, which would be silly.   

Looking up sk_buffs, I see the page http://www.linuxfoundation.org/collaborate/workgroups/networking/sk_buff which says:

The struct sk_buff objects themselves are private for every network layer. When a packet is passed from one layer to another, the struct sk_buff is cloned. However, the data itself is not copied in that case.

So that does indicate the kernel is capable of doing this.... I just have to work out who creates the sk_buffs and if they directly reference the data in the memory mapped buffer or not....   so I've got some code to read!

cheers




 

Vitaly Davidovich

unread,
Dec 17, 2015, 9:30:12 AM12/17/15
to mechanica...@googlegroups.com
On the same site is an explanation of the TX flow in the kernel: http://www.linuxfoundation.org/collaborate/workgroups/networking/kernel_flow

The kernel is definitely able to minimize copying if you opt in to that (see e.g. splice/vmsplice, which back sendfile).  So sk_buff can be just a descriptor of where the data starts and its length; sk_buffs are then placed on the driver TX queue for the device to handle.  As the link above states, if the device doesn't support gathering the data pointed to by sk_buff is linearized (copied), but I think that's rare with modern NICs.

The key issue with not copying, as I mentioned upthread, is ownership of the data.  If you don't copy user data but only point to it, it creates a hazard whereby user may end up modifying the data before it's actually sent out, leading to all sorts of possible issues.  The only option then is to block the userland all the way until device acknowledges transmission and the sk_buff is released.  But I don't think many people would prefer this over some amount of copying and letting userland continue doing other things.

As for sendfile, its main purpose is to avoid read/write copying and associated mode switches when sending a file to a socket.  That's much more expensive and wasteful than simply letting kernel handle the transfer itself.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ross Bencina

unread,
Dec 17, 2015, 9:40:34 AM12/17/15
to mechanica...@googlegroups.com
On 18/12/2015 1:30 AM, Vitaly Davidovich wrote:
> The key issue with not copying, as I mentioned upthread, is ownership of
> the data. If you don't copy user data but only point to it, it creates
> a hazard whereby user may end up modifying the data before it's actually
> sent out, leading to all sorts of possible issues. The only option then
> is to block the userland all the way until device acknowledges
> transmission and the sk_buff is released.

That's not the only option. The kernel could take ownership of the pages
and do copy-on-write if the user does try to modify the data before it's
sent out.

Trent Nelson

unread,
Dec 17, 2015, 10:59:03 AM12/17/15
to mechanical-sympathy
Consider it from the point of view of the device/driver.  It has been told to send 4KB, and been provided with a physical address.  The kernel needs to ensure that the 4KB chunk stays where it is until the device has finished DMA'ing it.  That's why sockets have socket buffers; so that there is a kernel-side send and receive area that isn't susceptible to paging or user modification.  (With the added benefit that a separate buffer allows things like coalescing multiple user space calls into single device driver calls.)

sendfile() and (TransmitFile() on Windows) simply aim to take this common pattern:

    while data = file.read():
        sock.send(data)

And remove the need to go from kernel<->user<->kernel to just kernel<->kernel.  There's still a copy happening though.

Registered I/O introduced in Windows 8 is where it gets really interesting though.  Socket buffers are a thing of the past.  Instead, you allocate large chunks of memory (potentially NUMA-cognizant) at startup, slice the memory up into N buffers, and then "register" these buffer arrays with Windows.  Windows then locks the pages into memory *once* (i.e. makes them non-pageable).  You then use request and completion queues to dispatch I/O requests to sockets, leveraging these individual buffers that have already been locked into memory.

Not having to lock/unlock every socket buffer for every single read/write call significantly improves performance at high levels of concurrency.

Note that Registered I/O is very similar to SetFileIoOverlappedRange(), which essentially facilitates the same sort of thing, but on files; locking large byte ranges via one call instead of lots of little ones.  This is advantageous when you're using completion-oriented overlapped I/O (i.e. actual asynchronous file I/O on Windows).


    Trent.

Vitaly Davidovich

unread,
Dec 17, 2015, 11:30:47 AM12/17/15
to mechanical-sympathy
That's not the only option. The kernel could take ownership of the pages and do copy-on-write if the user does try to modify the data before it's sent out.

I think this would be wasteful.  Every send would now change protection on the user pages involved in the transfer to being read only - that alone is a TLB shootdown across all cpus on which the process is running (at best).  If user writes to them, that's an MMU trap with kernel fixup and copying (now we're back to copying :)).  If the user didn't actually modify anything, what do we do? Change the protection again back to being RW? That wouldn't necessarily require a TLB shootdown (I think linux avoids sending IPIs for page protection changes that become more restrictive), but I think it still causes a microfault on access on the remote CPU to reload the TLB.

Now imagine doing the above on every send operation, by default (i.e. the suggestion in this thread is why kernel doesn't avoid copying mmap'd memory automatically).

Look into vmsplice syscall with SPLICE_F_GIFT flag set; that's likely as close as you'll get to moving (vs copying) user accessible pages within the kernel; here you're basically promising the kernel that you won't modify those pages.

 

Greg Wilkins

unread,
Dec 17, 2015, 5:07:35 PM12/17/15
to mechanical-sympathy


On Friday, December 18, 2015 at 3:30:47 AM UTC+11, Vitaly Davidovich wrote:
That's not the only option. The kernel could take ownership of the pages and do copy-on-write if the user does try to modify the data before it's sent out.

I think this would be wasteful.  Every send would now change protection on the user pages involved in the transfer to being read only - that alone is a TLB shootdown across all cpus on which the process is running (at best).  If user writes to them, that's an MMU trap with kernel fixup and copying (now we're back to copying :)).  If the user didn't actually modify anything, what do we do? Change the protection again back to being RW? That wouldn't necessarily require a TLB shootdown (I think linux avoids sending IPIs for page protection changes that become more restrictive), but I think it still causes a microfault on access on the remote CPU to reload the TLB.


Memory Mapped files may already be mapped in the appropriate mode.  They can be mapped as read-only or mapped in private mode where any changes are not copied back to the original file and are only available through the current mapping.    So that sounds to me like the copy-on-write mechanisms of page mapped memory are already being utilized by the mechanism.         For the particular use-case I'm concerned with, the files are mapped read-only mode, so modifications are not a problem.

With regards to the general issue you raise of modifications to buffers during async IO calls, this is something that many APIs do not make very clear.  However, I would think that requiring async APIs to always copy data before returning would risk making them blocking as the write may need to wait until buffers/memory is available to do such a copy.    In the work I've done with the servlet API and HTTP2 we do not do such a copy and any buffers passed are essentially "owned" by the IO layer until notification of completion is sent.






 

Vitaly Davidovich

unread,
Dec 17, 2015, 6:46:54 PM12/17/15
to mechanica...@googlegroups.com


On Thursday, December 17, 2015, Greg Wilkins <gr...@webtide.com> wrote:


On Friday, December 18, 2015 at 3:30:47 AM UTC+11, Vitaly Davidovich wrote:
That's not the only option. The kernel could take ownership of the pages and do copy-on-write if the user does try to modify the data before it's sent out.

I think this would be wasteful.  Every send would now change protection on the user pages involved in the transfer to being read only - that alone is a TLB shootdown across all cpus on which the process is running (at best).  If user writes to them, that's an MMU trap with kernel fixup and copying (now we're back to copying :)).  If the user didn't actually modify anything, what do we do? Change the protection again back to being RW? That wouldn't necessarily require a TLB shootdown (I think linux avoids sending IPIs for page protection changes that become more restrictive), but I think it still causes a microfault on access on the remote CPU to reload the TLB.


Memory Mapped files may already be mapped in the appropriate mode.  They can be mapped as read-only or mapped in private mode where any changes are not copied back to the original file and are only available through the current mapping.    So that sounds to me like the copy-on-write mechanisms of page mapped memory are already being utilized by the mechanism.         For the particular use-case I'm concerned with, the files are mapped read-only mode, so modifications are not a problem.

If you map private you've explicitly opted for COW right from the start; there's no need to change protection on existing pages and intent is clear to the kernel.  As mentioned, vmsplice() can be used to ask kernel to transfer pages without copying, but that's an explicit request.

If you're mapped read only, why not just sendfile()? Is it a convenience thing?

 


With regards to the general issue you raise of modifications to buffers during async IO calls, this is something that many APIs do not make very clear.  However, I would think that requiring async APIs to always copy data before returning would risk making them blocking as the write may need to wait until buffers/memory is available to do such a copy.    In the work I've done with the servlet API and HTTP2 we do not do such a copy and any buffers passed are essentially "owned" by the IO layer until notification of completion is sent.

Async APIs with completion notification provide the ability for caller to manage buffer reuse since they're informed when the operation is complete.  In your server work, suppose user asks you to do something with a buffer they provide to you - the API is otherwise synchronous.  Suppose there's a point at which you could return to the user but the operation isn't fully done yet (but you have reason to believe that this point is safe to return anyway).  However, you still need to read something from the user buffer at a later point.  How do you proceed? If you have no way to notify user of ultimate completion you either copy or don't return until you don't need the buffer anymore.

As for blocking on async submission, yes it's possible.  Even if you don't need to copy user buffer there may be other resources that aren't available so you need to apply back pressure.  Alternatively, you let user query if their operation will block - that's the Linux readiness model, which isn't async but rather non-blocking - and the user is then responsible for providing back pressure.

 







 
 
 

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Greg Wilkins

unread,
Dec 18, 2015, 5:36:03 PM12/18/15
to mechanical-sympathy


On Friday, December 18, 2015 at 10:46:54 AM UTC+11, Vitaly Davidovich wrote:


On Thursday, December 17, 2015, Greg Wilkins <gr...@webtide.com> wrote:


If you're mapped read only, why not just sendfile()? Is it a convenience thing?

In the server, we have many many sources of content that need to be written either out various protocol (HTTP/1 or HTTP/2) which may or may not need fragmentation, may or may not be compressed/encrypted etc.   The abstraction we work with for this is ByteBuffer, so it is not possible at the lowest level to reverse that to a File that can be used for sendfile.     What we want/believe to happen is that if a slice of a direct/mapped buffer does make it down to a gather write (without being compressed/encrypted by java which fragments and sucks it into user space anyway) then it will be written efficiently.     Putting a sendfile in the middle of all that abstraction would be rather ugly and defeat the purpose of having the ByteBuffer abstraction over mapped files in the first place!
 
 


With regards to the general issue you raise of modifications to buffers during async IO calls, this is something that many APIs do not make very clear.  However, I would think that requiring async APIs to always copy data before returning would risk making them blocking as the write may need to wait until buffers/memory is available to do such a copy.    In the work I've done with the servlet API and HTTP2 we do not do such a copy and any buffers passed are essentially "owned" by the IO layer until notification of completion is sent.

Async APIs with completion notification provide the ability for caller to manage buffer reuse since they're informed when the operation is complete.  In your server work, suppose user asks you to do something with a buffer they provide to you - the API is otherwise synchronous.  Suppose there's a point at which you could return to the user but the operation isn't fully done yet (but you have reason to believe that this point is safe to return anyway).  However, you still need to read something from the user buffer at a later point.  How do you proceed? If you have no way to notify user of ultimate completion you either copy or don't return until you don't need the buffer anymore.

The problem with that approach is that we do not place a limit on the size of writes that an application can do.   Thus if we copy, we allow the application to have an unlimited memory debt on the container - not going to scale!;  OR if we don't return until written, then we are blocking not asynchronous.

The only way to have true async IO with bounded resource consumption (in java) is to not copy and to tell the application that they are not free to modify the passed buffers until completion is signalled.      Our completion signal is the ultimate completion (at least from a java sense), in that we wait until told by NIO that the write has completed.    Now in the case of Mapped byte buffers, perhaps that completion is before the OS has really flushed the data and maybe it is using COW paging to copy the data we pass it - which is fine, as that does not break our contract with the app, that after we notify completion they are free to modify/reuse/discard the passed buffer.



As for blocking on async submission, yes it's possible.  Even if you don't need to copy user buffer there may be other resources that aren't available so you need to apply back pressure.  Alternatively, you let user query if their operation will block - that's the Linux readiness model, which isn't async but rather non-blocking - and the user is then responsible for providing back pressure.

To apply back pressure in an async API you should not block an async submission.  Rather you should delay the notification mechanism, which may be a completion (onComplete) or a readiness (onWritePossible) notification depending on the API.  In a server dealing with 100's of thousands of connections you just don't want resource starvation to cause any blocking as that will be a really rapid road to thread starvation.   Also remember that with a readiness model, an app can still write an arbitrary large write, which cannot be written without congestion, nor can it be copied without breaking memory constraints - so we just place that buffer in our Q and asynchronously write chunks (or all of it depending on the protocol) until we are signalled that it is complete, and then we signal the app that their write is complete.

cheers
 




Vitaly Davidovich

unread,
Dec 18, 2015, 6:31:41 PM12/18/15
to mechanical-sympathy
In the server, we have many many sources of content that need to be written either out various protocol (HTTP/1 or HTTP/2) which may or may not need fragmentation, may or may not be compressed/encrypted etc.   The abstraction we work with for this is ByteBuffer, so it is not possible at the lowest level to reverse that to a File that can be used for sendfile.

You don't need a File, you'd just need to retain the FileChannel from which you mapped the MBB.  But I understand that you want to deal with only BB, which is a convenient thing to do.

What we want/believe to happen is that if a slice of a direct/mapped buffer does make it down to a gather write (without being compressed/encrypted by java which fragments and sucks it into user space anyway) then it will be written efficiently

I'm not sure why compressing/encrypting or otherwise mutating the MBB would suck it into user space; the point of mmap'd memory is so user and kernel share it.  Touching pages that aren't resident may page fault, but that's not mutation specific.  Perhaps I misunderstood.

The problem with that approach is that we do not place a limit on the size of writes that an application can do.   Thus if we copy, we allow the application to have an unlimited memory debt on the container - not going to scale!;  OR if we don't return until written, then we are blocking not asynchronous.

There's still a limit even if you copy.  If there's no room to copy into, you block anyway.  There are reasons that page cache file writes may block, file backed mmap() writes may block, etc.  Most of those reasons are due to resource exhaustion (or rather prevention thereof).

The only way to have true async IO with bounded resource consumption (in java) is to not copy and to tell the application that they are not free to modify the passed buffers until completion is signalled.      Our completion signal is the ultimate completion (at least from a java sense), in that we wait until told by NIO that the write has completed.    Now in the case of Mapped byte buffers, perhaps that completion is before the OS has really flushed the data and maybe it is using COW paging to copy the data we pass it - which is fine, as that does not break our contract with the app, that after we notify completion they are free to modify/reuse/discard the passed buffer.

Yes, when you have an explicit async API you can specify the terms & agreement and hope the user obliges, but see below.
 
To apply back pressure in an async API you should not block an async submission.  Rather you should delay the notification mechanism, which may be a completion (onComplete) or a readiness (onWritePossible) notification depending on the API.  In a server dealing with 100's of thousands of connections you just don't want resource starvation to cause any blocking as that will be a really rapid road to thread starvation.  

You can delay the notification mechanism, but this is just a gentlemen's agreement between the client and your framework that they will respect this.  If they don't respect this and continue submitting i/o, what happens? If you're the kernel, you can't just have gentlemen's agreements with userland and hope they oblige.  So you specify a model to the user, which may very well be "don't submit more i/o until i tell you i'm ready or else i'll block or reject your request", and then it's up to user to follow that and make good use of resources.  But if they don't do this, you need to protect your resources (as the kernel) the hard way.  I'm sure we've all seen the linux kernel "panic" under some low mem conditions and things slow down to a crawl for a variety of reasons, nevermind OOM killer stepping in and such.  But it's not just kernel, if you're any sort of container for others, you will have resource management policy, including handling tenants that are misbehaving, intentionally or not.

Also remember that with a readiness model, an app can still write an arbitrary large write, which cannot be written without congestion, nor can it be copied without breaking memory constraints - so we just place that buffer in our Q and asynchronously write chunks (or all of it depending on the protocol) until we are signalled that it is complete, and then we signal the app that their write is complete.

Yes, but I'm not following your point here.  With the readiness model, it's simply non-blocking rather than async.  If you attempt to shove more data down the pipe than there's space, you get an EAGAIN.  So instead of blocking you, it just rejects your operation.  So in that case, you end up having to put backpressure yourself, such as the queue you mention.  But presumably, you'd then want to take care to not exhaust your process memory with too much queueing but rejecting more requests or whatever is the appropriate backpressure plan there.


Greg Wilkins

unread,
Dec 22, 2015, 7:16:11 PM12/22/15
to mechanical-sympathy

On Saturday, December 19, 2015 at 10:31:41 AM UTC+11, Vitaly Davidovich wrote:

But I understand that you want to deal with only BB, which is a convenient thing to do.

Indeed we've abstracted around BB on the assumption that there is some advantage of writing Direct and/or memory mapped file buffers.

However, the suggestion here is that such buffers are copied anyway during the write - which is what I'm questioning?   Both on why is there a copy?  How can it possible actually copy large buffers unless if moves it to swap space?  and if it does copy, then what exactly is the benefit of direct buffers and memory mapped files anyway?
 

I'm not sure why compressing/encrypting or otherwise mutating the MBB would suck it into user space; the point of mmap'd memory is so user and kernel share it.  Touching pages that aren't resident may page fault, but that's not mutation specific.  Perhaps I misunderstood.

Ah that is a failing in the Java APIs for compressing/encrypting, both of which require byte[] rather than ByteBuffer.  So using pure java for encryption/compression means than the content has to be brought into the java heap just so it can exist in a byte[] - yuck! That is a good reason to offload SSL from java app servers and to precompress whenever possible.  
 
 
Yes, but I'm not following your point here.  With the readiness model, it's simply non-blocking rather than async.  If you attempt to shove more data down the pipe than there's space, you get an EAGAIN.  So instead of blocking you, it just rejects your operation.  So in that case, you end up having to put backpressure yourself, such as the queue you mention.  But presumably, you'd then want to take care to not exhaust your process memory with too much queueing but rejecting more requests or whatever is the appropriate backpressure plan there.

It all depends on your API.  In the servlet async IO model, if you do a write(veryLargeBuffer) and then immediately try to shove more data down the pipe, you will get a WritePendingException.   You have to first poll with an isReady() call before you can write again and if that returns false then an async callback to onWritePossible() is automagically scheduled when it does become ready for the next write.       This API allows us to not copy the veryLargeBuffer and to have a gentleman's agreement with the application that it does not mutate the buffer.   If the app breaks the agreement, it can corrupt writes, but it cannot consume more resources from the server.



Anyway, to get back to the original question from Simone.... my thoughts are:


> A) SocketChannel.write(directBuffer)
> My understanding is that when the syscall to write() happens, the
> direct buffer is copied from user space to kernel space, and then the
> kernel buffer is passed to the lower levels for the actual send.
> If this is correct, even for direct buffers there is one data copy.

I think this is wrong.  With a direct buffer the data is copied to kernal space when data is put into the buffer, not when the buffer is written.
Thus is a buffer is created once and written many times, there is a benefit because the data is only copied to kernel space once.




> B) FileChannel.transferTo()
> Only the usage of FileChannel.transferTo() is mapped to a sendfile()
> syscall, thereby achieving true zero copy, and that does not involve
> in the API any buffer.

Sounds about right, other the word "Only"



> C) SocketChannel.write(mappedBuffer)
> If I map a file via FileChannel.map() and obtain a MappedByteBuffer,
> and then I try to write that buffer via SocketChannel.write(), then I
> end up again in the write() syscall which involves a data copy.

I disagree. 

The data for a memory mapped file buffer does not exist in user space, it is only copied into user space when a buffer get method is used.  Thus there is no need to copy the data from user space to kernel space.
Once in kernel space I do not see how the data can be copied or where it could be copied to?  Swap???  At the worst, I can see the data structures describing the buffers being copied, but not the physical data itself - which exists only in the file system.

Now I'm prepared to believe that perhaps writing such a buffer is not exactly zero copies, because I'm not exactly sure how it gets from the file system to the network card and expect that a page by page copy might be needed... but then it might also be possible for a DMA from the page space to the network?

Eitherway, surely it has to be less copies than writing a byte[] from user space, specially if you include the reads required to fill it up from the file system?

 

I'm not sure why compressing/encrypting or otherwise mutating the MBB would suck it into user space; the point of mmap'd memory is so user and kernel share it.  Touching pages that aren't resident may page fault, but that's not mutation specific.  Perhaps I misunderstood.

Ah that is a failing in the Java APIs for compressing/encrypting, both of which require byte[] rather than ByteBuffer.  So using pure java for encryption/compression means than the content has to be brought into the java heap just so it can exist in a byte[] - yuck! That is a good reason to offload SSL from java app servers and to precompress whenever possible.  
 
 
Yes, but I'm not following your point here.  With the readiness model, it's simply non-blocking rather than async.  If you attempt to shove more data down the pipe than there's space, you get an EAGAIN.  So instead of blocking you, it just rejects your operation.  So in that case, you end up having to put backpressure yourself, such as the queue you mention.  But presumably, you'd then want to take care to not exhaust your process memory with too much queueing but rejecting more requests or whatever is the appropriate backpressure plan there.

It all depends on your API.  In the servlet async IO model, if you do a write(veryLargeBuffer) and then immediately try to shove more data down the pipe, you will get a WritePendingException.   You have to first poll with an isReady() call before you can write again and if that returns false then an async callback to onWritePossible() is automagically scheduled when it does become ready for the next write.       This API allows us to not copy the veryLargeBuffer and to have a gentleman's agreement with the application that it does not mutate the buffer.   If the app breaks the agreement, it can corrupt writes, but it cannot consume more resources from the server.



Anyway, to get back to the original question from Simone.... my thoughts are:


> A) SocketChannel.write(directBuffer)
> My understanding is that when the syscall to write() happens, the
> direct buffer is copied from user space to kernel space, and then the
> kernel buffer is passed to the lower levels for the actual send.
> If this is correct, even for direct buffers there is one data copy.

I think this is wrong.  With a direct buffer the data is copied to kernal space when data is put into the buffer, not when the buffer is written.
Thus is a buffer is created once and written many times, there is a benefit because the data is only copied to kernel space once.




> B) FileChannel.transferTo()
> Only the usage of FileChannel.transferTo() is mapped to a sendfile()
> syscall, thereby achieving true zero copy, and that does not involve
> in the API any buffer.

Sounds about right, other the word "Only"



> C) SocketChannel.write(mappedBuffer)
> If I map a file via FileChannel.map() and obtain a MappedByteBuffer,
> and then I try to write that buffer via SocketChannel.write(), then I
> end up again in the write() syscall which involves a data copy.

I disagree. 

The data for a memory mapped file buffer does not exist in user space, it is only copied into user space when a buffer get method is used.  Thus there is no need to copy the data from user space to kernel space.
Once in kernel space I do not see how the data can be copied or where it could be copied to?  Swap???  At the worst, I can see the data structures describing the buffers being copied, but not the physical data itself - which exists only in the file system.

Now I'm prepared to believe that perhaps writing such a buffer is not exactly zero copies, because I'm not exactly sure how it gets from the file system to the network card and expect that a page by page copy might be needed... but then it might also be possible for a DMA from the page space to the network?

Eitherway, surely it has to be less copies than writing a byte[] from user space, specially if you include the reads required to fill it up from the file system?



Dan Eloff

unread,
Dec 23, 2015, 9:27:04 AM12/23/15
to mechanica...@googlegroups.com
However, the suggestion here is that such buffers are copied anyway during the write - which is what I'm questioning?   Both on why is there a copy?  How can it possible actually copy large buffers unless if moves it to swap space?  and if it does copy, then what exactly is the benefit of direct buffers and memory mapped files anyway?

I think the copy people have been discussing here is the copy from userspace buffer to kernel (socket) buffer when calling write(). That will copy at most the size of the available space in the socket buffer and then block (or return bytes written if non blocking socket.) Direct buffers / memory mapped data avoids a second copy from Java memory to unmanaged memory before passing it to write(). If you have a mmaped file, and you use sendfile (or vmsplice+splice which it's based on) then even the kernel copy can be avoided, it will just send references to the pages in the kernel buffer cache.

Vitaly Davidovich

unread,
Dec 23, 2015, 9:30:55 AM12/23/15
to mechanica...@googlegroups.com


On Tuesday, December 22, 2015, Greg Wilkins <gr...@webtide.com> wrote:

On Saturday, December 19, 2015 at 10:31:41 AM UTC+11, Vitaly Davidovich wrote:

But I understand that you want to deal with only BB, which is a convenient thing to do.

Indeed we've abstracted around BB on the assumption that there is some advantage of writing Direct and/or memory mapped file buffers.

However, the suggestion here is that such buffers are copied anyway during the write - which is what I'm questioning?   Both on why is there a copy?  How can it possible actually copy large buffers unless if moves it to swap space?  and if it does copy, then what exactly is the benefit of direct buffers and memory mapped files anyway?

 
Benefit of DBB is interfacing with native code without needing to copy/marshal Java heap memory to native and vice versa.  It, along with Unsafe.allocateMemory, have since been also used to do off heap data storage to give GC a break.  But really, it's just a managed wrapper on top of native memory.

Benefit of mmap'd file is sharing the mapping/memory with kernel, which avoids syscalls for read/write operations.  By avoiding read/write syscalls you also don't need buffers to read/write the data into/from, respectively, which avoids some copying.

Large buffers aren't copied all at once, they'll typically be split into chunked transfers.  You don't allocate a 2GB array in Java when reading files but rather use some smaller array (e.g 8-32KB) to do reads in chunks, right? :)

At any rate, Linux kernel does have a way for you to hand off the pages directly to it for transfer, which is vmsplice that was mentioned before.

I should also mention that there are ways to have user space memory be DMA'able from a device - consider the various user space networking libs.  But DMA isn't straightforward; some devices may not support scatter/gather, requiring linearizing scattered physical pages into contiguous physical page range; there may be restrictions on addressable memory by the device; there's typically some cap on how many scatter/gather iops a device can queue/service at a time, etc. But I think we're talking specifically about generic I/O operations.


I'm not sure why compressing/encrypting or otherwise mutating the MBB would suck it into user space; the point of mmap'd memory is so user and kernel share it.  Touching pages that aren't resident may page fault, but that's not mutation specific.  Perhaps I misunderstood.

Ah that is a failing in the Java APIs for compressing/encrypting, both of which require byte[] rather than ByteBuffer.  So using pure java for encryption/compression means than the content has to be brought into the java heap just so it can exist in a byte[] - yuck! That is a good reason to offload SSL from java app servers and to precompress whenever possible.  
 
Indeed 
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages