about atomicity of pmem aware filesytem(DAX)

Chris Yu

unread,

Jun 17, 2019, 6:42:10 AM6/17/19

to pmem

Hi experts,

I have some questions about the DAX filesystems. When using pmem as fsdax mode, I can create a file and read/write to it using filesystem apis and I can also mmap the file and access it using memory operations such as memcpy and memset.

But I have a few simple questions.

I understand that when I mmap it, I can bypass the page cache and directly access the pmem. But what if I use the filesystem read/write apis, would it bypass the page cache too?
If the answer to the question1 is that we still bypass the page cache when using filesystem APIs, then using filesystem APIs, if I write a single byte, how much data the kernel writes? a whole block or just a byte? Can the kernel gurantee the atomocity of the block?
what a bout filesystem management APIs? such as falloc and ftruncate. When I use these filesystem management APIs, it would modify the content of inode. If the kernel write to the inode block like the normal block and it can not gurantee the atomicity of it, the filesystem might crash.

Thanks,

Chris Yu

Andy Rudoff

unread,

Jun 17, 2019, 11:48:10 AM6/17/19

to pmem

Hi Chris,

But I have a few simple questions.

I agree the questions are simple. My answers are not as simple... sorry about that!

1. I understand that when I mmap it, I can bypass the page cache and directly access the pmem. But what if I use the filesystem read/write apis, would it bypass the page cache too?

The Linux file system community has been careful on what promises they make to applications. They want to provide the semantics expected by applications, but they don't want to prevent file systems from doing optimizations that are transparent to applications. In this way, you can think of DAX as being a "hint" to the file system, telling it the media allows direct access so the page cache is unnecessary. Technically, the file system could still decide to use the page cache, some of the time or all the time, as long as the semantics expected by applications are still met. In the current implementation, both ext4 and XFS will not use the page cache on successful DAX mounts, even when you use read() and write(). But I'm giving the long answer because I want to make it clear the file systems reserve the right to change how they implement this.

2. If the answer to the question1 is that we still bypass the page cache when using filesystem APIs, then using filesystem APIs, if I write a single byte, how much data the kernel writes? a whole block or just a byte? Can the kernel gurantee the atomocity of the block?

For user data, there was never a guarantee of block atomicity in POSIX. This is one of the most misunderstood facts about file systems. When an application writes a block of data, a system crash can tear that write -- you could see some old data, some new data, or something worse (on an allocating write, where someone is appending data to a file, I've seen the file containing a block of zeros instead of the user data because the crash happened after the file size changed but before the new data was flushed to the media). Applications should not be depending on write failure atomicity of user data -- POSIX never promised it. Only after a successful sync/fsync/msync, etc. does an application know the write is persistent. For memory-mapped files where the MAP_SYNC flag was successfully used, Linux extends this to allow flushing to persistence using user space instructions like CLWB. But any application that takes crash consistency seriously will use techniques like logging, checksumming (or both) to detect torn writes and recover from them after a crash.

Now that I've made that part clear, the answer to your first sentence is that if you are doing a write() to an already-allocated area of the file, the file system is free to use store instructions directly to the persistent data on media, and that's what both ext4 and XFS do. Of course, if it is an allocating write, like appending to a file or writing to a hole in a file, then a bunch of allocation/metadata logic will also happen as a result.

3. what a bout filesystem management APIs? such as falloc and ftruncate. When I use these filesystem management APIs, it would modify the content of inode. If the kernel write to the inode block like the normal block and it can not gurantee the atomicity of it, the filesystem might crash.

Both ext4 and XFS use journaling to provide consistency in the face of a crash.

-andy

Steve Scargall

unread,

Jun 17, 2019, 12:35:50 PM6/17/19

to pmem

On Monday, June 17, 2019 at 9:48:10 AM UTC-6, Andy Rudoff wrote:

3. what a bout filesystem management APIs? such as falloc and ftruncate. When I use these filesystem management APIs, it would modify the content of inode. If the kernel write to the inode block like the normal block and it can not gurantee the atomicity of it, the filesystem might crash.

Both ext4 and XFS use journaling to provide consistency in the face of a crash.

You can create a SECTOR type namespace which provides atomic block operations as an SSD/NVME drive does. However, this mode does not support DAX as it's meant to be used for applications and filesystems that do not provide atomic or crash consistent operations for data and metadata, just as logging/journaling.

From ndctl-create-namespace(1)

fsdax: Filesystem-DAX mode is the default mode of a namespace when specifying ndctl create-namespace with no options. It creates a block device (/dev/pmemX[.Y]) that supports the DAX capabilities of Linux filesystems (xfs and ext4 to date). DAX removes the page cache from the I/O path and allows mmap(2) to establish direct mappings to persistent memory media. The DAX capability enables workloads / working-sets that would exceed the capacity of the page cache to scale up to the capacity of persistent memory. Workloads that fit in page cache or perform bulk data transfers may not see benefit from DAX. When in doubt, pick this mode.

sector: Use this mode to host legacy filesystems that do not checksum metadata or applications that are not prepared for torn sectors after a crash. Expected usage for this mode is for small boot volumes. This mode is compatible with other operating systems.

Here's a link to the documentation with examples for creating FSDAX and SECTOR namespaces:

FSDAX: https://docs.pmem.io/ndctl-users-guide/managing-namespaces#fsdax-mode-examples

SECTOR: https://docs.pmem.io/ndctl-users-guide/managing-namespaces#sector-mode-examples

- Steve

Chris Yu

unread,

Jun 20, 2019, 6:30:40 AM6/20/19

to pmem

Hi Andy,

Thanks for the answer.

And one more simple question:

Since filesystem APIs also bypass the page cache, when reading some data from the file into a buffer, compared with using memcpy after mmap, how is the efficiency of the filesystem APIs? Is it as fast as the mmap?

Thanks,

Chris Yu

在 2019年6月17日星期一 UTC+8下午11:48:10，Andy Rudoff写道：

Andy Rudoff

unread,

Jun 20, 2019, 7:59:56 AM6/20/19

to pmem

Hi Chris,

For the operation you're describing, copying data from pmem into DRAM, I've found the kernel path performs just as well as the user space version as long as you're copying a big enough chunk of data. Of course, benchmarking your use case will give you the best information. I found for small copies (smaller than about 4k), the overhead of trapping into the kernel and any kernel locks that end up being taken makes a measurable difference. However, as you increase the transfer size to 4k or larger that overhead becomes such a small portion of the time spent copying from the pmem media that it doesn't matter any more and the performance is basically the same as doing it in user space.

-andy

Reply all

Reply to author

Forward