POSIX IO vs mmap

105 ΠΏΡ€Π΅Π³Π»Π΅Π΄Π°
ΠŸΡ€Π΅Ρ’ΠΈ Π½Π° ΠΏΡ€Π²Ρƒ Π½Π΅ΠΏΡ€ΠΎΡ‡ΠΈΡ‚Π°Π½Ρƒ ΠΏΠΎΡ€ΡƒΠΊΡƒ

Maximilian BΓΆther

Π½Π΅ΠΏΡ€ΠΎΡ‡ΠΈΡ‚Π°Π½ΠΎ,
5. 1. 2021. 14:47:125.1.21.
– pmem
Hello,

we are working on a project where we compare PMem to traditional (NVMe) SSDs. Until now, we've used POSIX IO primitives (fwrite, fopen, ...) and fsdax on PMem (ext4) to write data to PMem as well as NVMe.

Our question is whether someone has experience with IO on fsdax. Would it be preferred to use mmap, compared to POSIX IO? We think we saw a paper some time ago that mentioned that POSIX IO is not to be used with PMem, but we could not find that paper again. Is there anyone with practical experience on how to write as performant as possible to fsdax files?

Thank you very much.

Best regards,
Maximilian

Andy Rudoff

Π½Π΅ΠΏΡ€ΠΎΡ‡ΠΈΡ‚Π°Π½ΠΎ,
5. 1. 2021. 15:00:035.1.21.
– pmem
Hi Maximilian,

The pattern of your workload makes all the difference here.Β  DAX means you don't use the page cache, much like opening a file O_DIRECT.Β  The page cache is there for a reason, so if you really have a workload that runs better without the page cache, then DAX makes sense, but I've seen some people turn off the page cache, causing them to never find their data in DRAM, and then wonder why the performance didn't improve.

For pmem, there's a similar consideration.Β  If you have a program that wants to update a persistent data structure like a tree or hash table, for storage those updates must cause full block I/O.Β  For example, to update a single 8-byte pointer and make it persistent, SW must read a block from storage, make the change, then write it back.Β  For most storage, that means moving 4k blocks just to update 8 bytes.Β  If you use pmem with DAX/mmap/MAP_SYNC, that same update can be made without moving any blocks.Β  The update writes the data to the persistence moving only the cache line that contains the data.Β  Furthermore, it does so without calling into the kernel, so the overhead of doing kernel-based I/O is avoided.

I gave that example of a small update to make a point: as you move larger and larger blocks of data, the kernel overhead becomes less and less of an issue.Β  For sufficiently large blocks of data, you'll find that the media becomes the bottleneck so accessing the same type of media via DAX/mmap/MAP_SYNC and accessing it in an NVMe device will approach the same performance as the block size gets larger.

Ultimately, the only concise answer to your question is to benchmark your workload to see which data path works better for it.

Hope that helps,

-andy

Maximilian BΓΆther

Π½Π΅ΠΏΡ€ΠΎΡ‡ΠΈΡ‚Π°Π½ΠΎ,
5. 1. 2021. 15:14:235.1.21.
– pmem
Dear Andy,

thank you for the quick and insightful response. However, and please excuse this lack of knowledge: Does the usage of fopen/other POSIX functions imply we are _not_ using DAX, even on a DAX-enabled PMem mount? I was not aware of that. The workload we currently test is a simple external sorting algorithm, where the IO limitation is writing sequentially for a long time (single-threadedly due to the nature of the merge) at the end. I guess we are more bottlenecked by the single-threadedness of the write workload instead of us maybe using the page cache, if I got you correctly.

Thank you again.

Best,
Maximilian

Andy Rudoff

Π½Π΅ΠΏΡ€ΠΎΡ‡ΠΈΡ‚Π°Π½ΠΎ,
5. 1. 2021. 15:24:045.1.21.
– pmem
No, I never said using fopen means you are not using DAX.Β  Note that mmap is a POSIX function too, by the way.Β  I think you are really asking about the difference between using read(2) and write(2) and using mmap(2).Β  Note that read and write are the system calls -- you mentioned stdio library calls like fread and fwrite which will do their own buffering inside your application, so you should be aware of that.

If you're using the "-o dax" mount option, the file system will attempt to turn off the page cache and provide direct access when mmap is called, but there are some conditions where DAX cannot be provided, and I believe if that happens you'll see a message about it in your system log.

-andy
ΠžΠ΄Π³ΠΎΠ²ΠΎΡ€ΠΈ свима
ΠžΠ΄Π³ΠΎΠ²ΠΎΡ€ΠΈ Π°ΡƒΡ‚ΠΎΡ€Ρƒ
ΠŸΡ€ΠΎΡΠ»Π΅Π΄ΠΈ
0 Π½ΠΎΠ²ΠΈΡ… ΠΏΠΎΡ€ΡƒΠΊΠ°