Seems to me that C) could be optimized by the JVM into a sendfile() call ?
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
B) FileChannel.transferTo()
Only the usage of FileChannel.transferTo() is mapped to a sendfile()
syscall, thereby achieving true zero copy, and that does not involve
in the API any buffer.
Hi,
On Wed, Dec 16, 2015 at 8:13 PM, Vitaly Davidovich <vit...@gmail.com> wrote:
> You're right in A-C. However, I'd like to point out that the "zero-copy" is
> user-kernel copying -- there is still copying within the kernel (e.g. kernel
> buf to socket buf).
Just to play devil's advocate here, if I am mapping a 100 GiB file,
and I am want to write that 100 GiB mapped buffer via
SocketChannel.write(), is the kernel really *copying* the data from
the mapped buffer to a kernel buffer ? I mean allocating another 100
GiB of virtual memory and then page in/out all the 100 GiB as they're
written ? To be precise, would these two buffers (the one that I
mapped, and the kernel buffer) have two different addresses in memory
? It is using the swap if the RAM is not enough ?
I would think that the kernel would just look at the mapped buffer and
write starting from its address, rather than copying all the data to a
new address ?
What would be the benefit of direct buffers otherwise ?
Thanks !
--
Simone Bordet
http://bordet.blogspot.com
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless. Victoria Livschitz
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
When is the last time you did this? Most protocols will chunk anyways
due to varying network concerns....
> Java direct buffers get copied from user space into kernel space socket send
> buffers pointed to by an sk_buff. These sk_buffs then pass down through the
> QDiscs and the TX ring having their headers added on the way.
On Thursday, December 17, 2015 at 2:26:39 PM UTC+11, Vitaly Davidovich wrote:> Java direct buffers get copied from user space into kernel space socket send
> buffers pointed to by an sk_buff. These sk_buffs then pass down through the
> QDiscs and the TX ring having their headers added on the way.
Vitaly,
but I'm struggling to understand exactly what "copy" means in this context.
Say I have a 100GB file mapped to memory, then that file is not really in memory as I don't have that much memory. It is in virtual memory with some page faulting mechanism that maps it to the data on the file system.
So if the kernel wants to copy this buffer passed to a write, is it really going to copy the whole 100GB? At worst surely it is just going to copy the paging data structure pointing to the file, perhaps marking them to copy-on-write so that only if the buffers are changed will a copy actually be made? So surely the data itself is never actually copied, just the metadata for the buffer?
If the only thing that actually reads the data is the network interface, surely it is the network interface that will generate the page faults and then the data will flow directly from the file system to the network? Why or how would that data ever be put into java user space?
If you really want to keep copying to a minimum right down the stack then try a user space stack like Open Onload and Solarflare.
Have you considered where the data is kept when Nagle's algorithm is enabled? Buffers also need to be kept to support retransmission on loss.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
That's not the only option. The kernel could take ownership of the pages and do copy-on-write if the user does try to modify the data before it's sent out.
That's not the only option. The kernel could take ownership of the pages and do copy-on-write if the user does try to modify the data before it's sent out.I think this would be wasteful. Every send would now change protection on the user pages involved in the transfer to being read only - that alone is a TLB shootdown across all cpus on which the process is running (at best). If user writes to them, that's an MMU trap with kernel fixup and copying (now we're back to copying :)). If the user didn't actually modify anything, what do we do? Change the protection again back to being RW? That wouldn't necessarily require a TLB shootdown (I think linux avoids sending IPIs for page protection changes that become more restrictive), but I think it still causes a microfault on access on the remote CPU to reload the TLB.
On Friday, December 18, 2015 at 3:30:47 AM UTC+11, Vitaly Davidovich wrote:That's not the only option. The kernel could take ownership of the pages and do copy-on-write if the user does try to modify the data before it's sent out.I think this would be wasteful. Every send would now change protection on the user pages involved in the transfer to being read only - that alone is a TLB shootdown across all cpus on which the process is running (at best). If user writes to them, that's an MMU trap with kernel fixup and copying (now we're back to copying :)). If the user didn't actually modify anything, what do we do? Change the protection again back to being RW? That wouldn't necessarily require a TLB shootdown (I think linux avoids sending IPIs for page protection changes that become more restrictive), but I think it still causes a microfault on access on the remote CPU to reload the TLB.
Memory Mapped files may already be mapped in the appropriate mode. They can be mapped as read-only or mapped in private mode where any changes are not copied back to the original file and are only available through the current mapping. So that sounds to me like the copy-on-write mechanisms of page mapped memory are already being utilized by the mechanism. For the particular use-case I'm concerned with, the files are mapped read-only mode, so modifications are not a problem.
With regards to the general issue you raise of modifications to buffers during async IO calls, this is something that many APIs do not make very clear. However, I would think that requiring async APIs to always copy data before returning would risk making them blocking as the write may need to wait until buffers/memory is available to do such a copy. In the work I've done with the servlet API and HTTP2 we do not do such a copy and any buffers passed are essentially "owned" by the IO layer until notification of completion is sent.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
On Thursday, December 17, 2015, Greg Wilkins <gr...@webtide.com> wrote:If you're mapped read only, why not just sendfile()? Is it a convenience thing?
With regards to the general issue you raise of modifications to buffers during async IO calls, this is something that many APIs do not make very clear. However, I would think that requiring async APIs to always copy data before returning would risk making them blocking as the write may need to wait until buffers/memory is available to do such a copy. In the work I've done with the servlet API and HTTP2 we do not do such a copy and any buffers passed are essentially "owned" by the IO layer until notification of completion is sent.Async APIs with completion notification provide the ability for caller to manage buffer reuse since they're informed when the operation is complete. In your server work, suppose user asks you to do something with a buffer they provide to you - the API is otherwise synchronous. Suppose there's a point at which you could return to the user but the operation isn't fully done yet (but you have reason to believe that this point is safe to return anyway). However, you still need to read something from the user buffer at a later point. How do you proceed? If you have no way to notify user of ultimate completion you either copy or don't return until you don't need the buffer anymore.
As for blocking on async submission, yes it's possible. Even if you don't need to copy user buffer there may be other resources that aren't available so you need to apply back pressure. Alternatively, you let user query if their operation will block - that's the Linux readiness model, which isn't async but rather non-blocking - and the user is then responsible for providing back pressure.
In the server, we have many many sources of content that need to be written either out various protocol (HTTP/1 or HTTP/2) which may or may not need fragmentation, may or may not be compressed/encrypted etc. The abstraction we work with for this is ByteBuffer, so it is not possible at the lowest level to reverse that to a File that can be used for sendfile.
What we want/believe to happen is that if a slice of a direct/mapped buffer does make it down to a gather write (without being compressed/encrypted by java which fragments and sucks it into user space anyway) then it will be written efficiently
The problem with that approach is that we do not place a limit on the size of writes that an application can do. Thus if we copy, we allow the application to have an unlimited memory debt on the container - not going to scale!; OR if we don't return until written, then we are blocking not asynchronous.
The only way to have true async IO with bounded resource consumption (in java) is to not copy and to tell the application that they are not free to modify the passed buffers until completion is signalled. Our completion signal is the ultimate completion (at least from a java sense), in that we wait until told by NIO that the write has completed. Now in the case of Mapped byte buffers, perhaps that completion is before the OS has really flushed the data and maybe it is using COW paging to copy the data we pass it - which is fine, as that does not break our contract with the app, that after we notify completion they are free to modify/reuse/discard the passed buffer.
To apply back pressure in an async API you should not block an async submission. Rather you should delay the notification mechanism, which may be a completion (onComplete) or a readiness (onWritePossible) notification depending on the API. In a server dealing with 100's of thousands of connections you just don't want resource starvation to cause any blocking as that will be a really rapid road to thread starvation.
Also remember that with a readiness model, an app can still write an arbitrary large write, which cannot be written without congestion, nor can it be copied without breaking memory constraints - so we just place that buffer in our Q and asynchronously write chunks (or all of it depending on the protocol) until we are signalled that it is complete, and then we signal the app that their write is complete.
But I understand that you want to deal with only BB, which is a convenient thing to do.
I'm not sure why compressing/encrypting or otherwise mutating the MBB would suck it into user space; the point of mmap'd memory is so user and kernel share it. Touching pages that aren't resident may page fault, but that's not mutation specific. Perhaps I misunderstood.
Yes, but I'm not following your point here. With the readiness model, it's simply non-blocking rather than async. If you attempt to shove more data down the pipe than there's space, you get an EAGAIN. So instead of blocking you, it just rejects your operation. So in that case, you end up having to put backpressure yourself, such as the queue you mention. But presumably, you'd then want to take care to not exhaust your process memory with too much queueing but rejecting more requests or whatever is the appropriate backpressure plan there.
I'm not sure why compressing/encrypting or otherwise mutating the MBB would suck it into user space; the point of mmap'd memory is so user and kernel share it. Touching pages that aren't resident may page fault, but that's not mutation specific. Perhaps I misunderstood.
Yes, but I'm not following your point here. With the readiness model, it's simply non-blocking rather than async. If you attempt to shove more data down the pipe than there's space, you get an EAGAIN. So instead of blocking you, it just rejects your operation. So in that case, you end up having to put backpressure yourself, such as the queue you mention. But presumably, you'd then want to take care to not exhaust your process memory with too much queueing but rejecting more requests or whatever is the appropriate backpressure plan there.
However, the suggestion here is that such buffers are copied anyway during the write - which is what I'm questioning? Both on why is there a copy? How can it possible actually copy large buffers unless if moves it to swap space? and if it does copy, then what exactly is the benefit of direct buffers and memory mapped files anyway?
On Saturday, December 19, 2015 at 10:31:41 AM UTC+11, Vitaly Davidovich wrote:But I understand that you want to deal with only BB, which is a convenient thing to do.
Indeed we've abstracted around BB on the assumption that there is some advantage of writing Direct and/or memory mapped file buffers.
However, the suggestion here is that such buffers are copied anyway during the write - which is what I'm questioning? Both on why is there a copy? How can it possible actually copy large buffers unless if moves it to swap space? and if it does copy, then what exactly is the benefit of direct buffers and memory mapped files anyway?
I'm not sure why compressing/encrypting or otherwise mutating the MBB would suck it into user space; the point of mmap'd memory is so user and kernel share it. Touching pages that aren't resident may page fault, but that's not mutation specific. Perhaps I misunderstood.
Ah that is a failing in the Java APIs for compressing/encrypting, both of which require byte[] rather than ByteBuffer. So using pure java for encryption/compression means than the content has to be brought into the java heap just so it can exist in a byte[] - yuck! That is a good reason to offload SSL from java app servers and to precompress whenever possible.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.