NUMA awareness with PMEM DIMMs.

Pradeep Fernando

unread,

Mar 26, 2019, 9:30:17 AM3/26/19

to pmem

Hi all,

I have a PM system with DIMMs connected to multiple NUMA sockets.

The DIMM access is through a DAX enabled file-system. similar to - http://pmem.io/2018/05/15/using_persistent_memory_devices_with_the_linux_device_mapper.html

I mmap memory regions for application usage. But I loose the NUMA distance details during this process. How to make use of PM device with NUMA aware mmapping / other technique.

thanks

--Pradeep

Adrian Jackson

unread,

Mar 26, 2019, 9:35:53 AM3/26/19

to Pradeep Fernando, pmem

Do you need to create a single DAX enabled file system across the sockets? Or can you get away with creating multiple file systems, one for each NUMA region?

--
You received this message because you are subscribed to the Google Groups "pmem" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.
To post to this group, send email to pm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/CAPSEm%3DYQU9Q-%2BNghwsR0KFUBzq7g%3DShRKvKLJ9LxbhZF1ObVEQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Anton Gavriliuk

unread,

Mar 26, 2019, 9:37:22 AM3/26/19

to Pradeep Fernando, pmem

> How to make use of PM device with NUMA aware mmapping / other technique.

PM is NUMA aware. You can check numa_node to identyfing. I think the best is to have NUMA aware Db/App. Otherwise you will alway drop in performance due to UPI traffic which you can measure by pcm tools. Does your Db/App NUMA aware ?? It's a good question how to make non-NUMA -> numa aware.

Anton

вт, 26 мар. 2019 г. в 15:30, Pradeep Fernando <prad...@gmail.com>:

Pradeep Fernando

unread,

Mar 26, 2019, 9:58:27 AM3/26/19

to Anton Gavriliuk, pmem

Hi Anton,

>> Do you need to create a single DAX enabled file system across the sockets? Or can you get away with creating multiple file systems, one for each NUMA region?

This is a very good suggestion. I am building an application that uses PMDK allocator. Want a way to allocate memory from different NUMA devices.

thanks

--Pradeep

Steve Scargall

unread,

Mar 26, 2019, 10:37:36 AM3/26/19

to pmem

The most common solution is to create pools of worker threads where the pool threads are NUMA bound to the CPU sockets closest to the pmem and DRAM that they will be accessing. The app then knows which worker thread pool to use to access different data.

We do have an open enhancement/feature request to support cross pool transactions - https://github.com/pmem/issues/issues/988

Pradeep Fernando

unread,

Mar 26, 2019, 11:52:49 AM3/26/19

to Steve Scargall, pmem

Thanks Steve!. I am going to do a similar thing. Separate file system for each NUMA socket + NUMA aware threads.

One another question on the same lines. Is there a way to stitch together a set of PM pages that are both virtually and physically contiguous.?

I have seen tricks played with DRAM where they query sysfs mmapped addresses to find starting physical addresses of the mapped pages to figure out physically contiguous pages.

I could not figure out similar tricks, with PM device mapping.

thanks

--Pradeep

--

You received this message because you are subscribed to the Google Groups "pmem" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.
To post to this group, send email to pm...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/e41e1268-8732-47e8-988b-6aae89c9eedf%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Pradeep Fernando.

Andy Rudoff

unread,

Mar 26, 2019, 12:44:27 PM3/26/19

to pmem

Hi Pradeep,

Can you say a little more on why you want the physical pages to be contiguous? If it is to use large pages, fs DAX already makes an effort to do contiguous allocations and use large page mappings when possible. And the "device DAX" mode on Linux will ensure you get large pages, but at the cost of not being able to manage pmem using file operations (so no names, permissions, etc). For most use cases I've seen, the opportunistic use of large pages already in the kernel is sufficient with fs dax and not something the app has to worry about. But of course each use case if different so perhaps you have another reason for wanting the pages to be physically contiguous...

-andy

On Tuesday, March 26, 2019 at 9:52:49 AM UTC-6, Pradeep Fernando wrote:

Thanks Steve!. I am going to do a similar thing. Separate file system for each NUMA socket + NUMA aware threads.

One another question on the same lines. Is there a way to stitch together a set of PM pages that are both virtually and physically contiguous.?

I have seen tricks played with DRAM where they query sysfs mmapped addresses to find starting physical addresses of the mapped pages to figure out physically contiguous pages.

I could not figure out similar tricks, with PM device mapping.

thanks
--Pradeep

On Tue, Mar 26, 2019 at 10:37 AM Steve Scargall <steve.s...@gmail.com> wrote:

On Tuesday, 26 March 2019 07:58:27 UTC-6, Pradeep Fernando wrote:
Hi Anton,

>> Do you need to create a single DAX enabled file system across the sockets? Or can you get away with creating multiple file systems, one for each NUMA region?

This is a very good suggestion. I am building an application that uses PMDK allocator. Want a way to allocate memory from different NUMA devices.

The most common solution is to create pools of worker threads where the pool threads are NUMA bound to the CPU sockets closest to the pmem and DRAM that they will be accessing. The app then knows which worker thread pool to use to access different data.

We do have an open enhancement/feature request to support cross pool transactions - https://github.com/pmem/issues/issues/988

--
You received this message because you are subscribed to the Google Groups "pmem" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pmem+unsubscribe@googlegroups.com.

To post to this group, send email to pm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/e41e1268-8732-47e8-988b-6aae89c9eedf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Pradeep Fernando.

Pradeep Fernando

unread,

Mar 26, 2019, 1:28:06 PM3/26/19

to Andy Rudoff, pmem

Hi Andy,

Thanks for the clarification on hugepages.

My usecase is bit different from the above. Initial motivation came from the DPDK network stacks' poll mode driver.

https://doc.dpdk.org/guides/prog_guide/mempool_lib.html

I want to implement a network driver that directly writes to PM. The written data should be accessible to user-space applications without

kernel traps. Userspace works with virtual addresses while kernel driver works with physical addresses.

DPDK has a allocator that bridges this virtual/physical difference. The allocator works on segments of memory regions that has both virtual and physical contiguity.

They create hugepages on tlbfs and stitch together a contiguous segment by querying the physical addresses of the physical page. ( and use remap call).

I can use the above driver with PM using file system dax mmaps. However, my segment size is limited one hugepage. (2MB). I don't have a mechanism to

create virtual and physical contiguous segments larger than that.

Hope I clarified my exact usecase.

thanks

--Pradeep

On Tue, Mar 26, 2019 at 12:44 PM Andy Rudoff <an...@rudoff.com> wrote:

Hi Pradeep,

Can you say a little more on why you want the physical pages to be contiguous? If it is to use large pages, fs DAX already makes an effort to do contiguous allocations and use large page mappings when possible. And the "device DAX" mode on Linux will ensure you get large pages, but at the cost of not being able to manage pmem using file operations (so no names, permissions, etc). For most use cases I've seen, the opportunistic use of large pages already in the kernel is sufficient with fs dax and not something the app has to worry about. But of course each use case if different so perhaps you have another reason for wanting the pages to be physically contiguous...

-andy

On Tuesday, March 26, 2019 at 9:52:49 AM UTC-6, Pradeep Fernando wrote:

Thanks Steve!. I am going to do a similar thing. Separate file system for each NUMA socket + NUMA aware threads.

One another question on the same lines. Is there a way to stitch together a set of PM pages that are both virtually and physically contiguous.?

I have seen tricks played with DRAM where they query sysfs mmapped addresses to find starting physical addresses of the mapped pages to figure out physically contiguous pages.

I could not figure out similar tricks, with PM device mapping.

thanks
--Pradeep

On Tue, Mar 26, 2019 at 10:37 AM Steve Scargall <steve.s...@gmail.com> wrote:

On Tuesday, 26 March 2019 07:58:27 UTC-6, Pradeep Fernando wrote:
Hi Anton,

>> Do you need to create a single DAX enabled file system across the sockets? Or can you get away with creating multiple file systems, one for each NUMA region?

This is a very good suggestion. I am building an application that uses PMDK allocator. Want a way to allocate memory from different NUMA devices.

The most common solution is to create pools of worker threads where the pool threads are NUMA bound to the CPU sockets closest to the pmem and DRAM that they will be accessing. The app then knows which worker thread pool to use to access different data.

We do have an open enhancement/feature request to support cross pool transactions - https://github.com/pmem/issues/issues/988

--
You received this message because you are subscribed to the Google Groups "pmem" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.

To post to this group, send email to pm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/e41e1268-8732-47e8-988b-6aae89c9eedf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Pradeep Fernando.

--

You received this message because you are subscribed to the Google Groups "pmem" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.

To post to this group, send email to pm...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/7f264594-53ab-4205-9960-d0f869818230%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Pradeep Fernando.

benjami...@intel.com

unread,

Mar 26, 2019, 7:43:48 PM3/26/19

to pmem

Hi Pradeep,

There are a few things to consider here. The first is that you need to guarantee that the virtual to physical memory mappings for the region of PM never change because user space driver frameworks like DPDK (and SPDK) hold the mapping of virtual to physical address in memory for fast lookup. DPDK has two mechanisms to set memory up in this way. The first is to allocate hugepages, which at least on Linux kernels today for regular volatile memory means the pages won't get moved around after allocation. But I don't know enough about the internals of DAX here to know whether we can make these same strong guarantees on PM, or how this interacts with hugepages.

The other mechanism is to enable the IOMMU, which allows a user space process to program a mapping from virtual addresses to I/O virtual addresses (iova) however it wants. The network or storage devices can be programmed using those iovas instead of using physical addresses, and the IOMMU will handle any appropriate translation, including handling changes of virtual to physical mappings. You could use this mechanism to mmap the PM and then program the IOMMU with the desired iovas and it would DMA to the right target. The Linux kernel device driver responsible for this feature is vfio-pci.

However, for either mechanism, I'm not aware of any mechanism in DPDK today to allow the user to "register" memory after start up. DPDK can now dynamically allocate additional hugepages during runtime, but there isn't any API to add memory allocated by you - i.e. the mmap'd region of PM. Note that SPDK does have this code path set up (see spdk_mem_register()), so it is possible there. I'd consult with the DPDK people to see what they think (and maybe something is already available!).

Beyond the memory mapping stuff there is still one additional concern - when the device performs the DMA it needs to be able to bypass any mechanisms that may place the data into the CPU cache. Otherwise, you'd need to ensure you issue the appropriate flush instructions to guarantee persistence.