I think the part of the optimization manual you're referring to is not very clear about what it is saying. I'll add getting that text updated to my todo list :-)
Despite the confusing text and labels on the chart, what it really appears to be comparing is using msync() to flush a large range to pmem versus using a loop of CLWB instructions. I know the chart has the label "PMDK" on it, but PMDK is actually a suite of more than ten libraries, one of which provides a simple, low-level function to loop through a range using CLWB instructions. Let me re-state your question without using the confusing stuff in that document:
When is it better to use msync() to flush changes to pmem, and when is it better to use a loop of CLWB instructions?
Remember that flushing is not a transactional operation -- if you memory map some persistent memory, do some stores to it, when you decide to flush it, some or all of it might be flushed already, and if you crash before the flush operation is complete, the stores you did could be persistent or not persistent in any order. This is why we wrote libraries that provide transactions, like libpmemobj, so that a programmer could use it to make transactional updates to pmem-resident data structures, so they are consistent even after a crash. When you're using transactions, the flushing is typically managed at a fine-grained level by the library and the programmer doesn't worry about them.
But back to our question, we're just talking about flushing ranges of memory, so my answer will only make sense if that's your use case (i.e. you have some other way of maintaining consistency in the face of failure, since flushes alone will not do that).
Now let's compare flushing with a loop of CLWB instructions, such as that provided by libpmem's pmem_persist() routine, versus calling msync().
The first thing you should know is they both use loops of CLWB instructions, followed by a single SFENCE since CLWB is not serialized, to do the flushing. In many ways these flushes are the same, one implemented in user space and one implemented in kernel space. Here's a list of differences:
- msync() follows POSIX which means it can only flush ranges whose size are multiples of the system page size, 4k. If you want to flush a small range, like 64 bytes, msync requires that you round it up to 4k and flush more than you intend.
- Using CLWB allows you to flush at cache-line granularity
- msync() imposes the kernel syscall overhead and can sometimes take locks, which can end up serializing threads in a multi-threaded program
- Using CLWB doesn't take any locks and multiple threads can use it concurrently
- msync() can take advantage of things you can only do in the kernel, like looking to see if a page is marked "dirty" in the MMU page tables and skipping the flush for that page if it is clean. This is a mixed bag, since skipping ranges can improve performance, but managing the dirty information means updating page table entries, doing TLB shootdowns, etc.
- msync() could detect very large ranges and decide to use special flushing mechanisms only available in the kernel, like WBNOINVD. We're still researching when it is appropriate to do this and right now, msync() always uses a loop of CLWB instructions.
- msync() currently calls into the driver, giving it a chance to do additional things like waiting for memory controller queues to drain. This adds overhead.
- Using CLWB to flush from user space is only safe if the kernel agrees to allow it. The mmap() flag MAP_SYNC is used to negotiate this. If you do not have a MAP_SYNC mapping, msync() is the only safe way to flush changes.
- msync() always flushes the entire range, even if stores were done to that range using non-temporal stores which bypass the cache and don't need flushing. There's no way to communicate to msync() that you used NT stores.
- Using CLWB allows you micro-manage the flushes, so that NT stores don't get flushes unnecessarily. Tools are available, like Persistence Inspector and pmemcheck, which help ensure code that micromanages flushing is correct. These tools tell you if you leave changes unflushed.
Which one should you use? The point of the text you referred to is that it depends on the workload. If most of the range contains modifications, flushing from user space will avoid the kernel overhead and calling into the kernel will just execute the same CLWB instructions, unable to skip any pages because they're all modified. On the other hand, if you have a very large range with sparse modifications, the skipping of clean pages done by msync() can perform better in a single-threaded program, but it can cause a lock bottleneck on a multi-threaded program.
libpmem contains a routine called pmem_persist() and it is our intention to continue to benchmark and maintain this routine so it makes the most sensible choice it can given the information it has. If new, faster ways of flushing the cache are introduced, pmem_persist() will use them. If conditions indicate that calling msync() is better, pmem_persist() will call it. Also, libpmem understands what instructions are available on the CPU (it checks CPUID on startup), and it knows how to use MAP_SYNC and anything else required. libpmem is also designed for future platforms where CPU caches are considered persistent. pmem_persist() knows when flushes are required and when they can be skipped. Programs that use pmem_persist() today will get faster on such a platform in the future. Programs that unconditionally use CLWB will be at a disadvantage. libpmem also contains memcpy/memset routines that know when to use things like NT stores and avoid unnecessary flushes.
My advice is to stick with the simplest API that meets your needs, and only add complexity as necessary. If POSIX APIs meet your needs, sticking with msync() will always work. If not, then I suggest using libpmem (or at the very least stealing the code from the library instead of re-inventing it). That will use MAP_SYNC correctly and choose the best available flushing methods.
And, of course, if you want changes to be atomic, want malloc-like management of pmem, or want any sort of transactions, then flushing alone won't cut it, and you should look into a library like libpmemobj.
-andy