FAdvise/MAdvise for Persistent Memory

Maximilian Böther

unread,

Mar 12, 2021, 11:13:10 AM3/12/21

to pmem

Hello,

we are working on benchmarks that compare various IO methods on PMem (and NVMes). For Linux file IO (read/write), there is the `posix_fadvise` system call. This system call, amongst other things, configures how aggressively page prefetching is done.

1) In a sequential read workload, on NVMes, we observe that disabling the page prefetching (fadvise(RANDOM)) worsens performance, and increasing the prefetching, as expected, increases performance (fadvise(SEQUENTIAL)). However, on fsdax PMem, we do not see any change at all, no matter which fadvise call we use. Both PMem and NVMe use ext4, however, PMem is configured with dax. We were wondering whether this page prefetching mechanism of the kernel is disabled for dax ext4 or if the fadvise system calls are just ignored, currently.

2) We furthermore tested the behavior of `mmap` and `madvise`. While we are not sure how fadvise and madvise interact with each other if you use both on a memory-mapped file, we see again that on NVMes, calling `madvise(RANDOM)` disables the mmap prefetching mechanism, resulting in worse sequential read performance. For PMem, we see that also `madvise` seems to have no effect at all. Again, we wonder whether the kernel, for PMem, just does not prefetch pages or if our madvise hints are just ignored due to missing implementation.

3) Somewhat unrelated, but maybe one of you has an idea: When sequentially reading 4k pages with 8 Threads using Linux IO (read), we get a throughput of 15 GB/s, and using mmap, we get 10 GB/s. However, if we increase the page size to 16k, Linux IO has a throughput of 20 GB/s and MMAP of 24 GB/s. We would have expected MMAP to be faster in all cases. For a random read benchmark, this is the case; only sequentially reading 4k pages is a outlier with respect to the ratio of mmap and linux IO. Do any of you guys have an idea why this is the case? We only figured it could be because of the copy implementations, as using u`read` on our 5.4 kernel uses the `memcpy_mcsafe` function of the Kernel itself, while ´mmap` uses `memcpy` of the glibc. However, as the random workload behaves as expected (mmap > linux io), this does not seem to be the reason.

Thank you very much in advance for any pointers, advice, or explanations.

Best,

Maximilian

Jungsik Choi

unread,

Mar 12, 2021, 9:39:03 PM3/12/21

to Maximilian Böther, pmem

Dear Maximilian Böther,

The fadvise() and madvise() system calls make paging efficient using user hints. So you can't get the expected effect in DAX where paging doesn't happen.

In the past, I wrote a paper about this.

https://www.usenix.org/conference/hotstorage17/program/presentation/choi

I hope this helps.

Thanks,

Jungsik Choi

2021년 3월 13일 (토) 오전 1:13, Maximilian Böther <maxim...@boether.de>님이 작성:

--
You received this message because you are subscribed to the Google Groups "pmem" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/357fcf60-03b4-4579-90f8-56cabdbefba2n%40googlegroups.com.

Maximilian Böther

unread,

Mar 13, 2021, 5:05:43 AM3/13/21

to pmem

Dear Jungsik Choi,

thank you very much for your answer. The paper indeed was a very relevant and interesting read. However, it may be that I do not fully understand what DAX actually does.

You say that "paging doesn't happen" in DAX. However, in the paper, you compare mmap with and without MAP_POPULATE, which _prefaults_ pages. If we would not do any paging mechanism, this would not make sense, would it? Do you maybe mean that _page caching_ doesn't happen? If so, is DAX basically equivalent to opening a file in O_DIRECT mode or is there any difference between DAX and O_DIRECT conceptually?

Second, I do not understand yet whether DAX only has an impact on the mmap system call, or also on read. If I take a look in Vtune at the call stack of a read syscall on a ext4 DAX file, I can see that specialized dax methods of ext4 are being called.

Thank you very much for your help.

Best,

Maximilian

Jungsik Choi

unread,

Mar 13, 2021, 6:03:30 AM3/13/21

to Maximilian Böther, pmem

I think this page will help.

https://www.kernel.org/doc/Documentation/filesystems/dax.txt

As you know, the page cache is used to buffer reads and writes to files. It is also used to provide the pages which are mapped into userspace by mmap.

For persistent memories, the page cache pages would be unnecessary copies of the original storage. The DAX code removes the extra copy by performing reads and writes directly to the storage device. For file mappings, the storage device is mapped directly into the userspace.

As you mentioned, the DAX and O_DIRECT are conceptually similar. However, O_DIRECT works differently from DAX. For example, all I/O is still done to/from userspace buffers (no direct pointers to media).

Thanks,

Jungsik Choi

2021년 3월 13일 (토) 오후 7:05, Maximilian Böther <maxim...@boether.de>님이 작성:

To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/43827a8b-5f90-4e61-ac9e-2b0f21fe57a8n%40googlegroups.com.

Maximilian Böther

unread,

Mar 15, 2021, 8:10:01 AM3/15/21

to pmem

The kernel documentation indeed helped. However, it did not differentiate between the roles of mmap and read. For mmap, I now understand that we map the _device_ directly to userspace, avoiding any system calls for further accesses as well as unnecessary copies from PMem -> DRAM or kernel space to user space.

However, for read, I wonder about the impact. I know that "normal" reads copy the results from kernel space to user space and therefore always require a syscall/mode switch into kernel mode. It is unclear to me how this interacts with DAX. The idea of DAX is to read directly from PMem/other storage devices. If read() would still do a kernel mode -> user mode copy, this would mean a copy of the PMem data into DRAM, which is what DAX wants to avoid. Therefore, I do not understand what a DAX-read implicates, in comparison to mmap.

Thank you so much again.

Best,
Maximilian

Jungsik Choi

unread,

Mar 15, 2021, 8:38:06 AM3/15/21

to Maximilian Böther, pmem

The process of performing read() is as follows.

User Buffer <------------ Page Cache <------------- Existing FS

(user) memcpy (kernel) paging (kernel)

User Buffer <------------ DAX-FS

(user) memcpy (kernel)

Thanks,

Jungsik Choi

2021년 3월 15일 (월) 오후 9:10, Maximilian Böther <maxim...@boether.de>님이 작성:

To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/7a3ea7da-8989-4542-9b62-d168225cddb8n%40googlegroups.com.

Message has been deleted

Eduardo Berrocal

unread,

Mar 18, 2021, 3:28:51 PM3/18/21

to pmem

Maximilian Böther:

" I know that "normal" reads copy the results from kernel space to user space and therefore always require a syscall/mode switch into kernel mode. It is unclear to me how this interacts with DAX. The idea of DAX is to read directly from PMem/other storage devices. If read() would still do a kernel mode -> user mode copy, this would mean a copy of the PMem data into DRAM, which is what DAX wants to avoid. Therefore, I do not understand what a DAX-read implicates, in comparison to mmap. "

When you do a regular read() against PM in DAX, all you are doing is calling the kernel which then issues the memory copies for you (instead of you doing them directly from user space). It is good to retrofit code using I/O API but you pay the penalty of the switch to the OS. The memory copies the kernel does are done directly from PM to the buffer you pass, there is no intermediate copy as there is no page cache.

Hope this helps,

Eduardo.

Mason, William A

unread,

Mar 21, 2021, 2:15:46 AM3/21/21

to Eduardo Berrocal, pmem

It’s also worth noting that memcpy in user mode can be considerably faster than it is in the Linux kernel, since the kernel doesn’t use the SIMD operations by default (recall, x86 is traditionally very expensive to save floating point registers so the norm is not to do so). I recently did some quick benchmarking of this and found it varied widely by platform, but in no case I measured across six different machines was the Linux kernel memcpy faster than user memcpy, and usually much slower.

Tony

On Mar 18, 2021, at 12:29 PM, Eduardo Berrocal <edube...@gmail.com> wrote:

Maximilian Böther:

To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/3f8727b9-5db5-401e-9d1f-2dae044169abn%40googlegroups.com.

Maximilian Böther

unread,

Mar 21, 2021, 5:14:11 AM3/21/21

to pmem

Hi,

thank you for the additional input. Indeed, we also verified using VTune that the kernel memcpy function does not use SIMD instructions, and std::memcpy does. However, on our machine, we still measure that read() is fast, especially for lower access sizes and thread counts. In the plot in the attachment, on the left, we measure a sequential scan of 4k pages, and on the right of 16k pages. For 4k pages, read is always faster than mmap+memcpy, for 16k pages, memcpy is only faster for 32+ threads. This is interesting because as you stated, one would think that SIMD instructions will always be better. But both bandwidth and the instructions used are also confirmed with VTune.

bw.png

Niall Douglas

unread,

Mar 22, 2021, 7:08:45 AM3/22/21

to pmem

On Sunday, March 21, 2021 at 6:15:46 AM UTC fsg...@gatech.edu wrote:

It’s also worth noting that memcpy in user mode can be considerably faster than it is in the Linux kernel, since the kernel doesn’t use the SIMD operations by default (recall, x86 is traditionally very expensive to save floating point registers so the norm is not to do so). I recently did some quick benchmarking of this and found it varied widely by platform, but in no case I measured across six different machines was the Linux kernel memcpy faster than user memcpy, and usually much slower.

I may be out of date now, but historically overly aggressive memcpy in the kernel badly affected whole system performance. One deliberately wrote memcpy to work in smaller chunks than was possible, and on heavily loaded systems, you got far better performance.

i.e. one swaps some single instance memcpy performance for greatly improved multi-instance memcpy performance.

Niall

Reply all

Reply to author

Forward