memcpy() slower without subsequent pmem_flush/pmem_persist

93 views

Skip to first unread message

Robert Jandow

unread,

Jun 19, 2021, 9:01:06 AM6/19/21

to pmem

Hi all,

I'm currently evaluating the performance of the pmem_memcpy(). Therefore, I used the PMDK benchmark tool and modified it a little bit.

One modification was to compare libc_memcpy + pmem_flush against the plain libc_memcpy (Github). In the multithreaded benchmarks of this operation, I noticed that the performance of the plain memcpy() is much worse than its counterpart.

In the attached images you can see that especially for bigger data sizes the plain variant performs very badly (purple). I've tested with 64B, 256B, and 4KiB (4096B).

Does anyone have any idea what could be the reason for the slowdown? And why does running a pmem_flush() or pmem_persist() (green, red) afterwards improve performance?

Thanks in advance for your help and feedback

pmembench_memcpy_thread_4KB.cfg

ppbb...@gmail.com

unread,

Jun 19, 2021, 1:37:45 PM6/19/21

to pmem

I'm assuming this is random traffic?

My best guess would be that the libc with which you are testing does something that reduces the spatial locality of data written to the DIMMs - e.g., it could be using temporal SSE2 stores like movdqa. There's a reason PMDK ships with its own memcpy implementation. See https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-optimization-reference-manual.html, section 12.2.3.