pmem setup on Linux

Anton Gavriliuk

unread,

Jan 15, 2021, 12:23:34 PM1/15/21

to pmem

Hi all

I have 4S box with top pmem setup (24 x 512 GB modules).

Goal - get xfs based on fsdax namespaces with "dax" mount option support.

I can create interleave setup, it will be 4 name spaces, but how to create further concatenate without loosing "dax" mount option ?

And second question - can I interleave part of 6 modules on CPU ?, for example create interleaved with 4 modules and remain non-interleaved 2 modules ?

Anton

Steve Scargall

unread,

Jan 19, 2021, 7:31:57 PM1/19/21

to pmem

Hi Anton,

I can create interleave setup, it will be 4 name spaces, but how to create further concatenate without loosing "dax" mount option ?

You would need to use a software RAID solution such as Linux Volume Manager. There's a document entitled "Storage Redundancy with Intel® Optane™ Persistent Memory Modules" that describes this approach along with some performance numbers. This was based off my blog article - "Using Linux Volume Manager (LVM) with Persistent Memory" where I didn't share performance numbers.

I don't recommend the SW RAID approach though. As the document describes, the overhead of SW RAID + CPU UPI traffic latencies means you're losing a lot of performance. The best performance will come from a NUMA aware application that can schedule threads on CPUs where the in-memory data resides - PMem and DRAM. I don't know of any UMA platforms that support Optane PMem, which would be one solution to the question.

And second question - can I interleave part of 6 modules on CPU ?, for example create interleaved with 4 modules and remain non-interleaved 2 modules?

No. When provisioning PMem, your choice is either AppDirect or AppDirectNotInterleaved which applies to all PMem on all sockets in the host. You can't mix and match the config on a per-socket basis. If you need something more complicated, choose AppDirectNotInterleaved and use SW RAID or erasure encoding in the app across N PMem devices. There's no logic in the memory controller to create complicated memory configurations within the socket or across sockets. Interleaved or Non-Interleaved is all we have on NUMA platforms.

When CXL devices appear on the market, they'll be configured and controlled within the OS, so there is an opportunity to provision more complicated solutions if the tools and firmware allow it. Depending on the requested mode(s) and config, we no longer need the logic to be in the memory controller.

/Steve

Anton Gavriliuk

unread,

Jan 21, 2021, 2:35:07 AM1/21/21

to Steve Scargall, pmem

Hi Steve

Thank you for the clarification.

Regarding software RAID (madadm). One year ago I tested on OpenSuse15.1 pmem raid5 based on mdadm. Write performance was really awful. However a couple days ago I repeated the same test on OpenSuse 15.2 and the latest mdadm version and I have to say that write performance became slightly better.

So there is some progress in this area, but it is still bad.

Mdadm for pmem optimizations required.

Anton

ср, 20 янв. 2021 г. в 02:32, Steve Scargall <steve.s...@intel.com>:

--
You received this message because you are subscribed to the Google Groups "pmem" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/28ccaf06-a67a-4a2c-b1ba-d932ac8045d0n%40googlegroups.com.

SeanW

unread,

Jan 21, 2021, 8:31:28 AM1/21/21

to pmem

Hey Anton - that's certainly a powerful config you have there!

Perhaps you could share (if possible) some of the broader use cases you are targeting and the group might have additional ideas? Are you only looking at fast file/block based access, or are you ultimately targeting byte addressable access - where Optane PMem really shines IMHO.

To get around the multi namespace issue, we ended up striping data objects across pools at the application layer (on top of PMDK and libpmemobj/obj++). We then have more granular control over what data goes to which socket. We can also allocate work threads to be closer to their data, but in actual testing we found the scheduler seems to do a decent job of this automatically. We also found for some use cases having multiple pool files per namespace gave better performance too. If you abstract that away in RAID or a single large pool, then you may lose some of the unique performance gains from Optane PMem.

Cheers

Sean

Anton Gavriliuk

unread,

Jan 22, 2021, 1:41:46 AM1/22/21

to SeanW, pmem

Hi Sean

> Hey Anton - that's certainly a powerful config you have there!

This is the customer's box. They already hasve many pmem boxes, but this one in a test environment.

> Perhaps you could share (if possible) some of the broader use cases you are targeting and the group might have additional ideas? Are you only looking at fast file/block based access, or are you ultimately targeting byte addressable access - where Optane PMem really shines IMHO.

Sure, we are targeting for DAX access and all pmem advantages where possible, like in MS SQL2019, Aerospike,... . However in this particular case customer interested in MongoDB. As far as I know MongoDB doesn't support DAX, so we have to create storage over AppDirect.

Anyway local pmem protection is required. That is why we are very interested in mdadm optimized for pmem or other such solutions provided local pmem redundancy.

Anton

чт, 21 янв. 2021 г. в 15:31, SeanW <sdw.g...@axomem.io>:

To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/ad2bb0ac-5e63-4811-a902-33f7e4574214n%40googlegroups.com.

Steve Scargall

unread,

Jan 22, 2021, 11:32:42 AM1/22/21

to pmem

Hi Anton,

There is support for memory-mapped files in the MongoDB 'development' branch, but I don't know when it'll be merged into the main branch. WiredTiger was researched using PMDK a long time ago. The more recent work is memory-mapped files (not using PMDK) is described in Getting storage engines ready for fast storage devices. The blog was written before they got access to a real PMem system, and the results I saw on PMem were very encouraging. Approaching MongoDB with your customer could prioritise the work.

References:

JIRA: https://jira.mongodb.org/browse/WT-5170

GitHub: https://github.com/wiredtiger/wiredtiger/pull/5155

If you don't need the DAX feature of the file system, I recommend using SECTOR namespaces to avoid any potentially torn blocks, at least for the logs at minimum. There's not much overhead for SECTOR. Alternatively, compare the performance of Optane P4800X SSDs. Or perhaps consider a tiered solution where PMem is used for the hot data - logs & caches, and Optane SSD for the primary data storage.

As another suggestion/option, the folks at https://formulusblack.com/ have solved the data redundancy problem using PMem as a block storage device. This avoids all the issues we discussed of using LVM/mdadm. They offer common storage features such as redundancy (erasure encoding), replication, snapshots, clones, deduplication, smart application thread scheduling, dynamic LUN resizing (they call them LEMs - Logical Extended Memory), etc. It's not an open-source solution, but one that would be worth comparing. You can test drive their product using the FORSA Cloud, or on-premise since you have the hardware available.

/Steve

Schmiegel, Jakub

unread,

Jan 22, 2021, 12:23:49 PM1/22/21

to Scargall, Steve, pmem

Hi Anton,

There is also some development in Wired Tiger done on this branch recently:

https://github.com/wiredtiger/wiredtiger/tree/wt-6022-nvram-cache

Best regards,

Jakub

--

You received this message because you are subscribed to the Google Groups "pmem" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/258befd3-90a9-4480-8222-074a7b7115den%40googlegroups.com.

---------------------------------------------------------------------
Intel Technology Poland sp. z o.o.
ul. Słowackiego 173 | 80-298 Gdańsk | Sąd Rejonowy Gdańsk Północ | VII Wydział Gospodarczy Krajowego Rejestru Sądowego - KRS 101882 | NIP 957-07-52-316 | Kapitał zakładowy 200.000 PLN.
Ta wiadomość wraz z załącznikami jest przeznaczona dla określonego adresata i może zawierać informacje poufne. W razie przypadkowego otrzymania tej wiadomości, prosimy o powiadomienie nadawcy oraz trwałe jej usunięcie; jakiekolwiek przeglądanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.

Reply all

Reply to author

Forward