Hello,
we are working on benchmarks that compare various IO methods on PMem (and NVMes). For Linux file IO (read/write), there is the `posix_fadvise` system call. This system call, amongst other things, configures how aggressively page prefetching is done.
1) In a sequential read workload, on NVMes, we observe that disabling the page prefetching (fadvise(RANDOM)) worsens performance, and increasing the prefetching, as expected, increases performance (fadvise(SEQUENTIAL)). However, on fsdax PMem, we do not see any change at all, no matter which fadvise call we use. Both PMem and NVMe use ext4, however, PMem is configured with dax. We were wondering whether this page prefetching mechanism of the kernel is disabled for dax ext4 or if the fadvise system calls are just ignored, currently.
2) We furthermore tested the behavior of `mmap` and `madvise`. While we are not sure how fadvise and madvise interact with each other if you use both on a memory-mapped file, we see again that on NVMes, calling `madvise(RANDOM)` disables the mmap prefetching mechanism, resulting in worse sequential read performance. For PMem, we see that also `madvise` seems to have no effect at all. Again, we wonder whether the kernel, for PMem, just does not prefetch pages or if our madvise hints are just ignored due to missing implementation.
3) Somewhat unrelated, but maybe one of you has an idea: When sequentially reading 4k pages with 8 Threads using Linux IO (read), we get a throughput of 15 GB/s, and using mmap, we get 10 GB/s. However, if we increase the page size to 16k, Linux IO has a throughput of 20 GB/s and MMAP of 24 GB/s. We would have expected MMAP to be faster in all cases. For a random read benchmark, this is the case; only sequentially reading 4k pages is a outlier with respect to the ratio of mmap and linux IO. Do any of you guys have an idea why this is the case? We only figured it could be because of the copy implementations, as using u`read` on our 5.4 kernel uses the `memcpy_mcsafe` function of the Kernel itself, while ´mmap` uses `memcpy` of the glibc. However, as the random workload behaves as expected (mmap > linux io), this does not seem to be the reason.
Thank you very much in advance for any pointers, advice, or explanations.
Best,
Maximilian