Java, fallocate, mmap, and write performance.

Kevin Burton

unread,

Sep 8, 2013, 12:54:29 AM9/8/13

to mechanica...@googlegroups.com

So I was just following up on the thread from last week whereby we were discussing the performance of writing to file IO and CPU involved.

I wanted to benchmark mmap vs write()... mmap should in theory be faster because it doesn't require a system call. write() takes an fd and a pointer. Which is nice in that it's easy but for lots of SMALL writes it's going to be a bit of a pain.

The alternative approach is to mmap your file, then write your data, then truncate it when you're complete.

The problem is that mmap must map a file which is already allocated.

The JVM doesn't have support for fallocate.

But RandomAccessFile has a setLength() method ...

I traced that down and it's a native method mapped to ftruncate.

My strace showed that the JVM does in fact call ftruncate:

[pid 8357] ftruncate(7, 2147483647) = 0

...

The problem is that the behavior of ftruncate is magic and depends on the underlying filesystem.

on my home OS X machine with HFS+ the benchmark I wrote took about 12seconds to write out 2GB of data to disk.

That's because HFS+ doesn't support sparse files.

I ran the same tests on ext3 and XFS and both executed in 0ms...

The resulting files were correct.

XFS shows that the file was allocated with the correct size.

xfs_bmap reports:

/d0/test.txt: no extents

And stat reports....

root@util0029:/usr/local/peregrine# stat /d0/test.txt

File: `/d0/test.txt'

Size: 2147483647 Blocks: 0 IO Block: 4096 regular file

Device: 805h/2053d Inode: 177 Links: 1

Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)

so calling setLength is efficient... however, I guess I still don't like it. I would rather call fallocate via JNA. The behavior of fallocate is explicit. Not magic. I don't have to worry about something magically not working correctly in the future.

Of course it's possible I'm just obsessing over this. One approach is to write a wrapper function and then on Linux/Solaris insist on calling fallocate and then on BSD calling ftruncate. This way I can still debug code on my mac and integration will find any issues on Linux.

Kevin

Peter Lawrey

unread,

Sep 8, 2013, 1:41:16 AM9/8/13

to mechanica...@googlegroups.com

For large sustained writes I expect using mmap or write to based on the performance of the file system or hardware. When I used mmap it is because I want low latency for each persitence and I expect the application to be spend up to 90% of its time doing real work rather than writing. Put another way, not using more than 50% of the maximum write bandwidth.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Michael Barker

unread,

Sep 8, 2013, 3:49:21 AM9/8/13

to mechanica...@googlegroups.com

I had the need to something similar previously. I.e. pre-allocating a file in order to speed up disk writes (it makes a bit difference if you are syncing to disk). We found that the RandomAccessFile.setLength() call will result in a sparse file on Linux (ext3/4). Instead of dropping to JNI/JNA code we would allocate a "template file" using RandomAccessFile.setLength(). For the actual journal file we would open 2 file channels and perform a templateFile.transferTo(journalFile). This would get the same results as fallocate (certainly the same performance profile when being written to) and was more portable.

Mike.

Kevin Burton

unread,

Sep 8, 2013, 12:38:20 PM9/8/13

to mechanica...@googlegroups.com

I'm not sure I understand. Why as transferTo faster? Because Java wasn't involved in the write?

If the template file was created via setLength then it would use a sparse file on ext3/4 then calling setLength repeatedly would always result in a sparse file.

Kevin

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Martin Thompson

unread,

Sep 8, 2013, 5:55:09 PM9/8/13

to

Under the covers FileChannel.transferTo() can copy a file all within the kernel using sendfile().

Kevin Burton

unread,

Sep 8, 2013, 12:59:57 PM9/8/13

to mechanica...@googlegroups.com

Right... so I was saying " Because Java wasn't involved in the write?" meaning Java doesn't perform the write so this is just done within the kernel.

This still has the magic problem though...

Kevin

On Sunday, September 8, 2013 9:56:09 AM UTC-7, Martin Thompson wrote:

FileChannel.transferTo() can under the covers copy a file all within the kernel using sendto.

On Sunday, September 8, 2013 5:38:20 PM UTC+1, Kevin Burton wrote:

Michael Barker

unread,

Sep 8, 2013, 1:14:25 PM9/8/13

to mechanica...@googlegroups.com

I'm not sure I understand. Why as transferTo faster? Because Java wasn't involved in the write?

Not any faster, just platform agnostic/non-native code way of achieving the same thing as fallocate.

Kirk Pepperdine

unread,

Sep 8, 2013, 2:51:39 PM9/8/13

to mechanica...@googlegroups.com

Not magic, it eliminates the need to move the data from kernel space to user space and then back down to kernel space... that uses few IO buffers and fewer memory copies.

-- Kirk

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,

Sep 8, 2013, 4:33:48 PM9/8/13

to mechanica...@googlegroups.com

The template file needs to be created by filling it with zero's so it is not sparse. This is a one off operation.

On Sunday, September 8, 2013 5:38:20 PM UTC+1, Kevin Burton wrote:

Remko Popma

unread,

Sep 8, 2013, 7:19:28 PM9/8/13

to mechanica...@googlegroups.com

Martin, is this the same scenario as the one Mike mentioned?

I don't follow... If you fill the RandomAccessFile with zeros first, then why would you need to call setLength?

Is FileChannel.transferTo faster than just asking the OS to copy the journal file?

Martin Thompson

unread,

Sep 9, 2013, 1:39:45 AM9/9/13

to mechanica...@googlegroups.com

No need to set the length when you have filled the template file with zeros.

Michael Barker

unread,

Sep 9, 2013, 2:00:06 AM9/9/13

to mechanica...@googlegroups.com

The template file needs to be created by filling it with zero's so it is not sparse. This is a one off operation.

Is this something that you've encounter with a later kernel version or specific file system? On Linux 2.6.23/Ext3, doing transferTo from a sparse file was sufficient to ensure that you get the necessary physical writes to the destination file that is to be preallocated.

Mike.

Martin Thompson

unread,

Sep 9, 2013, 2:08:27 AM9/9/13

to mechanica...@googlegroups.com

I've not tested with a sparse file. I went with a file filled with zeros to try and be as portable as possible.

Howard Chu

unread,

Sep 9, 2013, 10:13:11 AM9/9/13

to mechanica...@googlegroups.com

BSD supports other filesystems too. The old FFS also supported sparse files. HFS+ is relatively unique in its lack of support, even Windows NTFS supports sparse files. If you want to be platform-agnostic just write zeroes to the file to fill it out to the desired length.

This is the best approach for HDDs at least. Not so great if you're on a no-overwrite logging filesystem though. Also pretty horrible for SSDs.

We had a debate about a related topic on the linux-kernel list a year or so ago. The problem is that even with fallocate() the filesystem still needs to do metadata updates as you write over the allocated space. This is due to the kernel's insistence on protecting you from seeing stale data. (E.g., if you fallocate a file and the allocation includes blocks that used to reside in some other user's files.) The filesystem has to maintain a bit for every allocated page; if you try to read from the page before writing to it the FS must give you zeros back instead of whatever is actually on the disk. And when you actually do overwrite the page, this bit must be cleared. it somewhat defeats the purpose of preallocating a file in the first place, because it means every write still has to seek to update FS metadata. (The debate was about adding an FS flag to disable this check, for DB users who just don't care and want max performance. As I recall, those of us who wanted an option to disable this check lost the argument.)

Holger Hoffstätte

unread,

Sep 9, 2013, 10:35:33 AM9/9/13

to mechanica...@googlegroups.com

On 09/09/13 16:13, Howard Chu wrote:
> NTFS supports sparse files. If you want to be platform-agnostic just
> write zeroes to the file to fill it out to the desired length.

This approach is especially effective on zfs or btrfs with compression.
ZOMG speeds!!1

Seriously..

-h

Martin Thompson

unread,

Sep 9, 2013, 2:24:07 PM9/9/13

to mechanica...@googlegroups.com

This is a good call out. Filling the file with zeros first will effectively cause it to be written twice. On HDD this is fine but for SSDs you wear them out twice as quick if this is the major workload.

However if you have sufficiently large amount of memory and page cache then the file may only be written once to real disk if not doing synchronous/direct writes.

Martin..

Reply all

Reply to author

Forward