How to efficiently upload file chunks

427 views
Skip to first unread message

Martin Grotzke

unread,
Sep 10, 2017, 9:08:48 AM9/10/17
to mechanical-sympathy
Hi,

TL;DR: my question covers the difference between MappedByteBuffer vs.
direct ByteBuffers when uploading chunks of a file from NFS.

Details: I want to upload file chunks to some cloud storage. Input files
are several GB large (say s.th. between 1 and 100), accessed via NFS on
64bit Linux/CentOS. An input file has to be split into chunks of ~ 1 to
10 MB (a file has to be split by some index, i.e. I have a list of
byte-ranges for a file).

I'm planning to use async-http-client (AHC) to upload file chunks via
`setBody(ByteBuffer)` [1].

My two favourites for splitting the file into chunks (ByteBuffers) are
1) FileChannel.map -> MappedByteBuffer
2) FileChannel.read(ByteBuffer) -> a (pooled) direct ByteBuffer

My understanding of 1) is, that the MappedByteBuffer would represent a
segment of virtual memory, so that the OS would even not have to load
the data from NFS (during mmap'ing, as long as the MappedByteBuffer is
not read). When AHC/netty writes the buffer to the output (socket)
channel, the OS/kernel loads data from NFS into the page cache and then
writes these pages to the network socket (and to be honest, I have no
clue how the NFS API works and how the kernel loads the file chunks).

Is this understanding correct?

My understanding of 2) is, that on FileChannel.read(ByteBuffer) the OS
would read data from NFS and copy it into the memory region backing the
direct ByteBuffer. When AHC/netty writes the ByteBuffer to the output
channel, the OS would copy data from the memory region to the network
socket.

Is this understanding correct?

Based on these assumptions, 1) should be _a bit_ more efficient than 2),
but not significantly. With 1) my concern is that it's not possible to
unmap the memory mapped file [2] and I have less control over native
memory usage. Therefore my preference currently is 1), using pooled
direct ByteBuffers.

What do you think about this concern?

Is there an even better way than 1) and 2) to achieve what I want?

Thanks && cheers,
Martin


[1]
https://github.com/AsyncHttpClient/async-http-client/blob/master/client/src/main/java/org/asynchttpclient/RequestBuilderBase.java#L390
[2] http://bugs.java.com/view_bug.do?bug_id=4724038

signature.asc

Avi Kivity

unread,
Sep 10, 2017, 9:25:20 AM9/10/17
to mechanica...@googlegroups.com
mmap() should be avoided unless the workload is fairly static:


 1. There's a bunch of setup at the start of the program, not constant
setting up and tearing down of mmaps

 2. The mmaped data is accessed many times and is likely not to be
paged out; the total number of mapped pages is significantly smaller
than memory size


The set up / tear down costs of mmap, as well as mapping and unmapping
pages, are very high and are likely to result in a net loss unless they
are amortized over a large number of reads to the same pages.


So, for your use case I recommend traditional synchronous reads. I
assume this is 2 in your terminology.

Chet L

unread,
Jan 12, 2018, 3:47:21 PM1/12/18
to mechanical-sympathy
Martin,

minor note: in future if you decide to checksum the chunks before writing them(chunks[data]+cksum[meta]) out then you will end up loading the NFS:Rd contents.

Chetan

Martin Grotzke

unread,
Jan 13, 2018, 4:54:48 AM1/13/18
to mechanica...@googlegroups.com

Thanks for your hints, Avi and Chetan!

From your suggestions (Avi) I found some more reading material regarding mmap costs and followed approach 2 (avoiding mmap costs). This works nicely for some time now.

Thanks again,
Martin


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages