libpmem2 vs mmap - Read Performance

592 views
Skip to first unread message

Maximilian Böther

unread,
Feb 2, 2021, 11:22:00 AM2/2/21
to pmem
Hey,

In order to compare different storage technologies and ways to access them, we have implemented a "buffer management" workload. In this workload, we randomly write and read pages to and from disk (e.g. fsdax PMem). If you consider 4k pages, you can basically think of the buffer management workload as a 4k random disk benchmark with variable read/write ratio.

Each thread has access to a buffer file where it reads and writes the pages to. For Persistent Memory, we implemented the workload above an abstract IO wrapper in order to compare various means of accessing PMem. We compare standard Linux IO (read/write), mmap and libpmem. For mmap, we use `std::memcpy` for reading from PMem to DRAM and also writing (in addition with msync - this should be a generic approach that works for NVMes as well as for PMem) from DRAM to PMem. For libpmem2, we use `std::memcpy` for reading from PMem to DRAM and the function obtained by `pmem2_get_memcpy_fn` for writing.

We test in a fsdax setup. We observe that libpmem2 beats mmap for writing on PMem, which is to be expected as msync is a lot slower than the optimized function chain provided by libpmem. However, for reading, mmap beats libpmem. For example, if we use 4k pages and 32 threads, we observe 25 GB/s bandwidth for mmap, but only 17.5 GB/s for libpmem2. This is especially weird because the call for reading is basically the same (memcpy to DRAM).

We mmap the file as follows:
`(char*)mmap(NULL, map_length, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);`

I digged a little into the libpmem2 call stack in `posix_map.c`, but as we do not set any configuration for the libpmem2 mapping, I think the resulting mmap call by libpmem should be the same. So I do not understand where this performance difference comes from.

Additionally, when comparing libpmem2 vs mmap on NVMes instead of PMem in the same setup, we do not observe that effect.

Any input on why that may be the case (e.g. libpmem2 does something different with mmap or we should use another function for the PMem -> DRAM read) is much appreciated. Thank you very much!

KR,
Maximilian 

Łukasz Plewa

unread,
Feb 2, 2021, 12:19:53 PM2/2/21
to Maximilian Böther, pmem
Hi,

I guess that the increased bandwidth comes from DRAM, as you didn't map with map_sync - it means you don't have direct access to persistent memory (DAX) and your writes go to page cache(dram) instead of pmem this also means that your reads go thru page cache, and they are cached there :).
If the size of our buffer is smaller than the size of DRAM on your server, you are getting a good hit/miss ratio, and it is why you see increased bandwidth.

Łukasz

--
You received this message because you are subscribed to the Google Groups "pmem" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/405ca2a6-5fb6-4df6-ac7c-78fc5c1bc27cn%40googlegroups.com.

Maximilian Böther

unread,
Feb 2, 2021, 12:47:19 PM2/2/21
to pmem
Dear Łukasz,

thank you for your answer! This means that opening a file on a dax-enabled FS using mmap without MAP_SHARED_VALIDATE and MAP_SYNC will not use dax? This is interesting, because we can observe that for standard Unix IO (read/write), we see dax access, even without O_DIRECT, on a dax-enabled fsdax device (same performance results). 

I am going to run the experiments again and see whether it makes a difference.

Best,
Maximilian

Dan Williams

unread,
Feb 2, 2021, 12:49:18 PM2/2/21
to Łukasz Plewa, Maximilian Böther, pmem
No, MAP_SYNC has no bearing on directing traffic between DRAM and PMEM on a DAX file. MAP_SYNC only governs when metadata updates for new allocations can be considered stable.

ppbb...@gmail.com

unread,
Feb 2, 2021, 1:56:29 PM2/2/21
to pmem
Are you sure you are using either a dax-mounted file system or a file with the dax attribute in both scenarios?
One easy way to check if you are really using PMEM and not the page cache would be to look at the bandwidth numbers (if you don't want to play around with raw counters, VTune is an easy way to get those numbers).

One difference that MAP_SYNC might make is the performance characteristics of the page fault handler. This might reduce the overall benchmark throughput if you haven't made sure that all the pages are faulted prior to time measurements.
We usually write to every single byte of a memory-mapped region before we do any benchmarks - just to eliminate some of the possible kernel overheads.

btw, the number of threads you are using is likely causing degraded performance, see 11.2 of this doc for details: https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

Piotr

Maximilian Böther

unread,
Feb 5, 2021, 4:56:58 PM2/5/21
to pmem
Hey,

my last mails have not reached the group, so I will update you now using the webinterface.

I am absolutely sure its a dax-mounted filesystem. Basically, the benchmark is running on a fsdax PMem mount (the same mount in both cases) and just switches the IO layer (libpmem vs MMAP). So the environment is exactly the same for both variants (`./benchmark /mnt/nvram mmap` vs `./benchmark /mnt/nvram libpmem` in a nutshell). Of course, we also made sure to node-pin and memory-pin the benchmark to avoid NUMA effects.

We are testing 1-32 threads and various page sizes and observing the same effect in all cases (we get a kind of tensor of various configuration options and their results).  However, while I don't think this is a bug in libpmem and rather some other issue, if you are interested in what we are doing exactly I could share a private repository with you and maybe shortly discuss details (this is part of a research project and we want to publish findings in the end, therefore I cannot share this publically yet). 

We tested the MAP_SYNC flag for mmap and it does not make any performance difference. For reference, I have attached a plot. In the first row, you can see our mmap results for 4k, 64k and 2 MiB pages. For the second row, you can see the exact same benchmark using libpmem2 for creating the mapping instead of mmap. Everything else is consistent. The +pmcl bar can be ignored, this is a flag that varies the cacheline granularity to CACHE_LINE instead of PAGE, but this is a read-only scenario, where this does not make any difference. 

For any further comments and/or if somebody would be willing to look at some code, I would be more than glad. Thank you so much.

Best,
Max
map_sync.png

ppbb...@gmail.com

unread,
Feb 8, 2021, 3:22:03 AM2/8/21
to pmem
 Hi Max,

This is truly odd. There should be no change if all you do differently is using libpmem2's map instead of mmap directly. Try doing strace on your benchmark code to confirm that.
Other then that - run a profiler (perf, VTune), and see what's different. Make sure you also allow the profiler to collect kernel traces.

Piotr

Maximilian Böther

unread,
Feb 8, 2021, 1:00:34 PM2/8/21
to pmem
Dear Piotr,

we did some VTune and strace analysis. For VTune, we could only find that more time is spent in std::memcpy for libpmem2. For strace, there was one difference we noticed. Libpmem2 has a logic that somehow reserves memory for the mapping, which can be seen in the following trace:

```
     0.000032 openat(AT_FDCWD, "/mnt/nvrams2/pvn/buffer.bin.ap", O_RDWR|O_CREAT, 0666) = 3 <0.000013>
     0.000033 fadvise64(3, 0, 0, POSIX_FADV_DONTNEED) = 0 <0.000006>
     0.000025 fcntl(3, F_GETFL)         = 0x8002 (flags O_RDWR|O_LARGEFILE) <0.000006>
     0.000026 fstat(3, {st_mode=S_IFREG|0644, st_size=2147483648, ...}) = 0 <0.000006>
     0.000027 fstat(3, {st_mode=S_IFREG|0644, st_size=2147483648, ...}) = 0 <0.000006>
     0.000027 mmap(NULL, 2147487744, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff113edc000 <0.000008>
     0.000027 munmap(0x7ff193edc000, 4096) = 0 <0.000009>
     0.000026 mmap(0x7ff113edc000, 2147483648, PROT_READ|PROT_WRITE, MAP_SHARED_VALIDATE|MAP_FIXED|MAP_SYNC, 3, 0) = 0x7ff113edc000 <0.000010>
```

We can see that our buffer file (file descriptor 3) is mapped to 0x7ff113edc000 using MAP_FIXED, which was previously obtained by mmaping file-descriptor -1. This probably somehow has got something to do with the vm reservation implemented in vm_reservation_posix.c (https://github.com/pmem/pmdk/blob/master/src/libpmem2/vm_reservation_posix.c), but I don't quite understand yet why and what libpmem is doing with this anonymous map. Our own mmap solution is _not_ doing that:

```
     0.000030 openat(AT_FDCWD, "/mnt/nvrams2/pvn/buffer.bin.ap", O_RDWR|O_CREAT, 0666) = 3 <0.000013>
     0.000033 fadvise64(3, 0, 0, POSIX_FADV_DONTNEED) = 0 <0.000006>
     0.000034 stat("/mnt/nvrams2/pvn/buffer.bin.ap", {st_mode=S_IFREG|0644, st_size=2147483648, ...}) = 0 <0.000007>
     0.000029 mmap(NULL, 2147483648, PROT_READ|PROT_WRITE, MAP_SHARED_VALIDATE|MAP_SYNC, 3, 0) = 0x7fcd29a00000 <0.000008>
```

This could be the only thing explaining the performance difference, from our perspective. My first idea was that maybe libpmem somehow ignores the numa settings of numactl, but we checked where the mapping is created and in both cases, the mapping was on the correct numa node.

Do you have any idea about this? 

Best,
Max

ppbb...@gmail.com

unread,
Feb 8, 2021, 1:39:11 PM2/8/21
to pmem
Hi,
libpmem2 creates a private read only mapping as a virtual address space reservation. This is useful if you want to, for example, map additional data contiguously to your existing stuff. For code simplicity, this path is always taken, even if no reservation is provided. But we also use it to make sure we get a nicely aligned address... Which is where a bug sneaked in: https://github.com/pmem/pmdk/pull/5143

What I've noticed is that with libpmem2, you are only getting 4k aligned mapping, whereas with raw mmap you have a 2mb alignment - which might impact performance in a measurable way. Try the patch I quickly put together to test that theory. You could also work around the problem in your benchmark by explicitly using a vm reservation for the map.

Piotr

Mason, William A

unread,
Feb 12, 2021, 10:49:20 AM2/12/21
to pmem

We found that shifting from 2MB pages to 4KB pages can have surprisingly high impact on performance, depending upon the data access locality and size of the workload.

 

https://par.nsf.gov/servlets/purl/10193037

 

In fact, we’ve been looking at the impact of 1GB pages; preliminary results are encouraging, even given the paucity of 1GB TLB entries, because the page table walk cost for a 1GB TLB miss is quite a bit lower than the cost of walking for a 2MB page.   Another way to look at it is that if you use 4KB pages, you can have less than 2,000 4KB TLB entries.  If your working set is 1TB, with a uniform random access workload, your odds of finding the relevant 4KB TLB entry present is very close to zero, which mean you have to do a full page table walk.  Since the cache isn’t big enough to keep a four level deep page table resident, you are paying the cost of full DRAM access.  Typical cost hits are higher than the slower cost of accessing PMEM.

 

Small page sizes make sense when a miss can involve I/O.  PMEM eliminates the I/O consideration, which suggests that there’s no benefit to using small pages for large workloads.

 

Tony Mason

Reply all
Reply to author
Forward
0 new messages