Non-Volatile Memory - What does it mean for application design?

Martin Thompson

unread,

Oct 18, 2015, 10:08:52 AM10/18/15

to mechanica...@googlegroups.com

While attending Joker Conf I was asked a number of times about my views on the emerging non-volatile memory that can replace traditional storage devices. It is a great question and a subject I find really interesting.

I was a little surprised that many people have the expectation that "I can have my HashMaps persisted and soon we'll not need Oracle databases again". It occurred to me that peoples expectations of what non-volatile memory will bring is probably not aligned with how it is likely to change application design. It also made me realise that our platform providers currently don't have plans to benefit from it on their roadmaps. We'll none that I'm aware of.

Memory like the Intel 3D XPoint will plug directly into our memory channels and thus avoid the indirection of the PCI-e bus which will give us low'ish latency and high-throughput. However to applications like Java this will not likely appear as "normal" memory. For what I've seen of these technologies we get an efficient file system implementation for accessing them. They appear as very fast drives. To really benefit from them we need to memory map the files when programming to them as storage.

I expect that the first wave of applications to benefit from non-volatile memory will be in-memory and more exotic data stores. For example, a write ahead transaction log would suit this type of storage really well, or the nodes of the tree near and at the root for a database like LMDB. Or the first few level of a levelling database to avoid the write amplification they have on current SSDs.

This type of memory will fit in as another layer in our memory hierarchy. Faster than SSDs but slower than DRAM. We are also going to get eDRAM between DRAM and our SRAM caches. eDRAM with Skylake that will be full coherent and thus be first class cache unlike the eDRAM we have with existing generation Crystalwell which is a victim cache. Our memory hierarchy is increasing in complexity with each passing generation.

I cannot see this new non-volatile memory fit in with write-back or write-through memory in common usage. Our expectations need to be clear when programming as to whether we are using volatile or non-volatile memory. By putting a file system in front of it makes the distinction clear.

So what would this mean for programming in managed runtime languages? For me the stand out thing is that we are going to have to mature our techniques for working with memory mapped files to gain significant benefits. For Java this would be a significant overhaul of MappedByteBuffer. We'd need long indexes and support for memory ordering semantics. Concurrency with our large number of cores and huge memory spaces will need taming. Well need clear ways of overlaying flyweights/structures that can be complied to prevent silly mistakes like being misaligned.

Maybe longer term our runtimes could evolve to support a new generational memory region that is non-volatile. How would we make sure our "transactions" are saved to this memory rather than normal DRAM? Would we have persist/save methods that promote the objects. Would some class of objects be non-volatile? Maybe we'll see a resurgence of object databases?

It makes for some interesting thinking about the future. Having the technology available is only the start. The impact it will have on application design is likely to be immense.

I'd love to know what others think about this. Any of the VM friends on here been considering this?

Martin...

Avi Kivity

unread,

Oct 18, 2015, 11:37:37 AM10/18/15

to mechanical-sympathy

You would not want to use regular objects, because the layout of these objects is not fixed. Instead you want to use serialized data that can survive an update of the code base (e.g. with some kind of versioning).

Note that to persist non-volatile memory you have to issue specialized instructions (PCOMMIT) that have non-trivial cost, so programmer intervention will be required.

Martin Thompson

unread,

Oct 18, 2015, 3:43:27 PM10/18/15

to mechanica...@googlegroups.com

Absolutely on both points. We cannot perform any sort of migration or upgrade without versioning and extension mechanisms.

With a good serialisation mechanism we can also have multi language support.

Howard Chu

unread,

Oct 18, 2015, 3:50:26 PM10/18/15

to mechanical-sympathy

This was a big topic at the last couple Storage Developers' Conferences too. IMO the current approach is misguided and a dead-end. It is effectively the same as when we first had PCs with over 640K of RAM back in the 1980s, and people wrote RAMdisk drivers to use up the space above 512K. It's a pathetic waste of memory, better used as filesystem cache than as a dedicated storage device.

Using it as a cache means it is immediately and transparently useful to every app on your system. Using it as a dedicated block device means you have to go out of your way to rewrite your apps to use it.

Logically it makes sense as a new layer in the memory hierarchy:
CPU -> L!$ -> L2$ -> (L3$ ->) DRAM -> NVRAM -> SSD/HDD/tape

But the currently implemented approaches treat it as a wart on the side of the existing hierarchy:
CPU -> ... DRAM -> SSD/HDD/tape
\
--> NVRAM

It's a stupid design. Transitional and short-lived at best.

Avi Kivity

unread,

Oct 18, 2015, 4:50:17 PM10/18/15

to mechanica...@googlegroups.com

A cache means you have to go through the the OS kernel to access it, and incur either page fault or system call overhead, and possible wait for a copy (when using system calls). To squeeze every last bit of performance out of it you must bypass the OS.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Howard Chu

unread,

Oct 18, 2015, 8:59:36 PM10/18/15

to mechanical-sympathy

I'm not convinced. They're not claiming equal to DRAM performance yet. Within an order of magnitude is all - from what I've seen, 2x slower or so. Syscall overhead will be dwarfed by access speed. Also, these still don't have the endurance of DRAM, so allowing userland processes to write all over them willy-nilly is going to be a major liability.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vijayant Dhankhar

unread,

Oct 18, 2015, 9:08:47 PM10/18/15

to mechanical-sympathy

Totally agree.. what I had heard is they only survive ~1000 writes and are slower than DRAM.. Use DRAM as a small write coalescing cache in front of NVRAM to overcome the DRAM refresh and scalability issues.

Avi Kivity

unread,

Oct 19, 2015, 2:24:32 AM10/19/15

to mechanica...@googlegroups.com

Time will tell. The existence of PCOMMIT and CLWB and their availability to userspace is evidence that Intel at least believes than raw NVDIMM will be useful outside of caching/fs.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,

Oct 19, 2015, 3:41:45 AM10/19/15

to mechanica...@googlegroups.com

Interesting debate.

If NVRAM is to fit transparently into the memory hierarchy then it needs to be an order of magnitude more dense in capacity and an order of magnitude lower cost per byte compared to DRAM. This may well happen. Given DRAM trends this is going to be an interesting challenge.

Sorter term I would expect it be be more useful in the hierarchy as a write buffer to SSDs and HDDs. As a pure read cache then a DRAM backed page cache feels like a better option for now. Maybe NVRAM becomes a victim cache like the current eDRAM offerings. It would be a write buffer to speed up transaction writes then this seems a good fit.

With this write buffering model it then also fits well with memory mapping approach whereby a syscall can be saved on commit by issuing a user space PCOMMIT.

Some interesting material I've found on the subject include.

http://www.hpl.hp.com/techreports/2012/HPL-2012-236.pdf

https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf

It is also interesting how some of these technologies will mix and match. For example, the intel docs state that a PCOMMIT will likely always cause a TSX transaction to abort.

Martin Thompson

unread,

Oct 19, 2015, 12:06:16 PM10/19/15

to mechanical-sympathy

The more I think about this the more I'm convinced that inserting NVRAM between DRAM and SSDs does not work very well when you consider the page cache.

The page cache exists in DRAM and will likely continue to do so, especially for reads. Below the page cache we have block devices. One of the major distinctions with most types of NVRAM is that they are byte addressable, rather then just block addressable. As I mentioned previously I think the case for NVRAM in this context as a write buffer to SSDs/HHDs makes the most sense rather than as a full layer.

When considering the syscall overhead, and the required copies, I think this could be more hurtful than helpful. Take a normal write for example, we copy from user space to the page cache, the OS would then need to copy again from page cache to another address range for the NVM. These copies need to be coherent and would wash the L3 cache as a result. This will also invalidate private L1 and L2 caches due to the inclusive nature of the Intel x86 cache design. A raw syscall can be fast but when it involves locking a page in the page cache and the copy then it is many 100s of nanoseconds which is at least as costly as the NVRAM access. Having byte rather than block level access reduces the surface area potential for contention from pages down to cache lines. O_DIRECT does not help in this context as languages like Java cannot open files with O_DIRECT and Linus really hates it. Plus with O_DIRECT we have all the alignment and block issues exposed to the user to the extent they might as well deal with the NVRAM directly.

A more efficient design would be to have a NVRAM address range that we deal with as mapped memory. "Transactions" are serialised into this memory by fronting it with CPU cache, which fast combines the word level writes, and the TX is committed with PCOMMIT plus fences as appropriate. This makes a strong case for having a flyweight design over the NVRAM that can write or read without copying. Avoiding the copies will be critical to preserving the efficiency of the L3 and other private caches unless we constrain this memory traffic with CAT (Intel Cache Allocation Technology). By taking a memory mapped approach and using flyweights we can avoid at least two copies and work with the existing cache sub-system rather than against it. Copies are super fast but it is easy to forget the impact they have on the efficiency of the cache by evicting other critical code and data.

A flyweighting mechanism that validates alignment and other concerns that also provides versioning and extension feels like the best fit. We would need language/compiler/tooling support to do this elegantly. All of this with no user "willy nilly" writing all over it :-)

Martin...

Scott Carey

unread,

Oct 19, 2015, 1:09:26 PM10/19/15

to mechanical-sympathy

On Monday, October 19, 2015 at 9:06:16 AM UTC-7, Martin Thompson wrote:

The more I think about this the more I'm convinced that inserting NVRAM between DRAM and SSDs does not work very well when you consider the page cache.

The page cache exists in DRAM and will likely continue to do so, especially for reads. Below the page cache we have block devices. One of the major distinctions with most types of NVRAM is that they are byte addressable, rather then just block addressable. As I mentioned previously I think the case for NVRAM in this context as a write buffer to SSDs/HHDs makes the most sense rather than as a full layer.

When considering the syscall overhead, and the required copies, I think this could be more hurtful than helpful. Take a normal write for example, we copy from user space to the page cache, the OS would then need to copy again from page cache to another address range for the NVM. These copies need to be coherent and would wash the L3 cache as a result. This will also invalidate private L1 and L2 caches due to the inclusive nature of the Intel x86 cache design. A raw syscall can be fast but when it involves locking a page in the page cache and the copy then it is many 100s of nanoseconds which is at least as costly as the NVRAM access. Having byte rather than block level access reduces the surface area potential for contention from pages down to cache lines. O_DIRECT does not help in this context as languages like Java cannot open files with O_DIRECT and Linus really hates it. Plus with O_DIRECT we have all the alignment and block issues exposed to the user to the extent they might as well deal with the NVRAM directly.

A more efficient design would be to have a NVRAM address range that we deal with as mapped memory. "Transactions" are serialised into this memory by fronting it with CPU cache, which fast combines the word level writes, and the TX is committed with PCOMMIT plus fences as appropriate. This makes a strong case for having a flyweight design over the NVRAM that can write or read without copying. Avoiding the copies will be critical to preserving the efficiency of the L3 and other private caches unless we constrain this memory traffic with CAT (Intel Cache Allocation Technology). By taking a memory mapped approach and using flyweights we can avoid at least two copies and work with the existing cache sub-system rather than against it. Copies are super fast but it is easy to forget the impact they have on the efficiency of the cache by evicting other critical code and data.

A flyweighting mechanism that validates alignment and other concerns that also provides versioning and extension feels like the best fit. We would need language/compiler/tooling support to do this elegantly. All of this with no user "willy nilly" writing all over it :-)

I feel that this new technology does not fit well in the current memory hierarchy. The better the technology gets, the more like DRAM it will be and less like block device storage. So fossilizing it in a sea of 30 year old block device access concepts seems like a bad idea. But it will never be exactly like DRAM either. Furthermore, file system abstractions are a mismatch feature-wise in almost every way other than persistence.

Ideas like yours above seem more fruitful. I feel both the operating system and applications will need to adapt, with tools and apis, to leverage this most effectively.

Even if future magnetic memory comes to be, and is as fast and endures as well as DRAM, the OS and Applications will need to change to take advantage of it. For example, the page cache as-is is the wrong design if it were to live in persistent memory (all sorts of eviction and flushing on sync becomes unnecessary). The OS might divide its memory space into persistent and ephemeral chunks, and keep a new page cache and other select things in persistent memory only. Intel's new hardware is only half way there, but this is just the beginning.

Todd Lipcon

unread,

Oct 19, 2015, 2:24:06 PM10/19/15

to mechanica...@googlegroups.com

It seems like there are a few misconceptions in this thread about how persistent memory interacts with applications... I'd encourage people to take a look at the native libraries available on http://pmem.io/ which are geared towards efficiently and easily making use of this type of storage.

The particular thing that folks seem to be missing is the existence of the DAX mount option: https://www.kernel.org/doc/Documentation/filesystems/dax.txt

DAX support in filesystems (already supported in ext4 I believe) means that, if you mmap a file on a pmem mount, the memory mapping will directly access the underlying persistent memory, _not_ the page cache. That is to say, you can read/write from the memory with a normal load/store instruction, and commit with PCOMMIT. This is true cacheline-sized load/store, with no kernel involvement (except for soft page faults as with any other memory mapping), syscall overhead, or extra memcpy.

Making use of this from Java is of course a little messier than native code, since you can't map objects onto arbitrary memory (excepting the use of flyweight wrappers around off-heap access). But, for cases where you're already dealing with mapped files in tmpfs, you should be able to use a dax mount to make that storage persistent with few changes (except for perhaps some additional care around ordering of operations and fences to ensure that the data structure is safely persistent).

-Todd

Martin Thompson

unread,

Oct 19, 2015, 2:45:38 PM10/19/15

to mechanica...@googlegroups.com

Thanks Todd this is useful. I was not aware of DAX. The nice thing is that it fits the model for how I thought it should work :-) It will be interesting to dig into the details.

Andy Rudoff

unread,

Oct 19, 2015, 4:22:37 PM10/19/15

to mechanical-sympathy

Just to add to what Todd said: in the NVM Programming Workgroup, we were acknowledging that new use cases for the emerging NVM would continue to come up, and we couldn't possibly imagine them all now. So we tried to come up with an architecture that is extensible. We definitely know applications will want to have direct load/store access ("direct" meaning no page cache or kernel code in the path when it is accessed) and we know that requires naming, so applications can re-connect to their data, and a permission model. The file permission model lends itself nicely to this because after you've opened the persistent memory you want (using the file naming and permission checks), you can memory map it and virtually never call into the file system again if you don't want to since you have direct access. This architecture turned into the DAX feature in Linux, supported by both ext4 and XFS in kernel.org now, although there's still a decent TODO list of improvements underway.

But the other side of it is that a filesystem is just one type of kernel module that can use DAX. We think there will be others, like kernel modules that figure out how to use persistent memory to do what they do better, transparently to applications. So it is important to note that even if we totally got the programming model wrong, we're pushing an architecture that lets us evolve it to other ideas.

Now the big question for Java, in my mind anyway, is how high up the stack do we expose persistent memory? Just the JVM, so that the applications don't comprehend persistent memory but it is transparently used to implement some things applications need? Or maybe just some Java API chooses to use persistent memory or storage, depending on what's available and the app using that API has no idea? Or maybe it gets exposed all the way into the language via some new type of persistent object or something? I don't know the answer -- so I'm all ears :-)

-andy

p.s. Full disclosure: I work at Intel on NVM architecture but of course everything I wrote above is based on public information.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Martin Thompson

unread,

Oct 20, 2015, 3:47:04 PM10/20/15

to mechanica...@googlegroups.com

Thanks for providing some insight Andy.

Opening named files and then mapping them feels right to me. It also provides a good cross process entry point.

From Java I feel we need more first class support for working with mapped files. That would be mapped file buffers that supported the ability to work at a byte/word level with additional support for memory ordering semantics and ability to commit changes to be stable in NVM. We could then compile up flyweight structures that can window over this data to provide access more safely than raw peeking and poking.

Have you seen other suggested approaches? Does this seem sane to you?

Andy Rudoff

unread,

Oct 20, 2015, 5:26:30 PM10/20/15

to mechanical-sympathy

Hi Martin,

At the lowest levels, where we're simply solving the problem of how something gets from kernel space exposed to user space, I agree with what you're saying. But even in system programming languages like C, people have a hard time understanding what the heck to do with a memory-mapped file unless they have previous experience with them. So after solving the problem of how to expose persistent memory to user space, rather than just tell programmers "Okay, you have a big range of load/store persistence -- have fun!" we started building libraries on top of the mapped areas. These libraries provide memory allocation, transactions, optimized ways to flush changes to persistence, etc. So that's a bit better, and yet the APIs we invented still seem to low-level for Java. I think we need to abstract persistent memory into the Java garbage-collected object model to make it less error prone. But I am far from a Java expert, so that's why I'm looking for more brainstorming in this area.

Thanks for the comments!

-andy

Martin Thompson

unread,

Oct 21, 2015, 10:33:36 AM10/21/15

to mechanica...@googlegroups.com

Interesting times ahead. It feels like in the short to mid term we have new options for data storage engines to move data more efficiently from user space to storage. We also may have some nice benefits for those comfortable with memory mapped files.

Longer term it will be interesting to see if object databases come back into fashion and integrate with a managed runtime like the JVM so that we have the ability to mark objects as either persistent or ephemeral and scope access in some sort of transaction boundaries. It would be great to see the VM vendors start to consider how the runtime can support this useful new technology.

Many thanks for contributing! It has helped my understanding of the current status.

Martin...

Matt D.

unread,

Oct 22, 2015, 12:11:18 PM10/22/15

to mechanical-sympathy

Hi,

On Monday, October 19, 2015 at 2:59:36 AM UTC+2, Howard Chu wrote:

Also, these still don't have the endurance of DRAM, so allowing userland processes to write all over them willy-nilly is going to be a major liability.

The endurance is definitely going to be an issue with the current state of technology -- at the same time, just wanted to chime in to point out that there's an interesting line of research in lifetime extension (context: hybrid cache; SRAM-NVM) which may help in addressing this.
Here's a recent paper: https://www.researchgate.net/publication/277075597_AYUSH_Extending_Lifetime_of_SRAM-NVM_Way-based_Hybrid_Caches_Using_Wear-leveling
It also includes a brief overview (with references) of the current issues (as well as proposing the solution), which may be informative; a few select quotes:

"with conventional cache management policies, the write endurance limit of NVM may be reached in a few days even with a few (e.g. 3 out of 16) SRAM ways. Our experiments with a 4MB, 16-way cache (designed using 3 SRAM and 13 ReRAM ways) shows that the cache lifetime with bzip2 and zeusmp benchmarks is only 28 and 25 days, respectively (more details of experimental setup are provided in Section IV). Clearly, effective techniques are required for managing the hybrid caches."

"PCM is generally considered less suitable for designing on-chip caches due to its high latency and low write endurance (10^8 − 10^9 writes). However, some researchers study SRAM-PCM or STTRAM-PCM hybrid caches [4, 13], which others study PCM-only caches [6] and propose using PCM as an L4 cache [4]. These proposals, along with the trends of very large LLCs designed with SRAM-alternatives, e.g. 128MB embedded DRAM LLC in Intel’s Haswell processor, indicate that PCM may be used in on-chip caches in near future due to its high density. Thus, AYUSH is expected to be useful for such SRAM-PCM hybrid caches.

Although STT-RAM write endurance has been predicted to be greater than 10^15 , the best endurance value in demonstrated prototypes is only 4 × 10^12 [5]. Since process variations may further reduce this value by even 50× [12] and malicious programs can repeatedly write a block to make the system fail, some researchers believe that endurance may be an issue for STT-RAM also [5, 13, 26]. In such a case, AYUSH may be useful for SRAM-STTRAM hybrid caches."

Best regards,

Matt

Howard Chu

unread,

Oct 27, 2015, 6:44:15 AM10/27/15

to mechanical-sympathy

On Tuesday, October 20, 2015 at 10:26:30 PM UTC+1, Andy Rudoff wrote:

Hi Martin,

At the lowest levels, where we're simply solving the problem of how something gets from kernel space exposed to user space, I agree with what you're saying. But even in system programming languages like C, people have a hard time understanding what the heck to do with a memory-mapped file unless they have previous experience with them.

Sounds like poor education to me. Contemporary emphasis on types and object-orientation has completely lost sight of the fact that, in the end, everything is just bytes. Programming environments that discourage programmers from knowing this really make it impossible to get full leverage on efficient system designs.

E.g., zero-copy in LMDB is easily used in C code because a well-educated C programmer knows that
struct foo {
int x;
int y;
}
is exactly some number of bytes and exactly what order they're stored in. So we can allocate structures from a mmap'd file just as easily as from any other memory range, and use it directly. JVM-based environments hide the fact that "everything is just bytes" from the programmer, so you can't simply say "instantiate this object in that memory range" and use it. Heck, java won't even let you access a String as a sequence of bytes, if you want that style of access you have to explicitly create a Byte[] array instead.

So after solving the problem of how to expose persistent memory to user space, rather than just tell programmers "Okay, you have a big range of load/store persistence -- have fun!" we started building libraries on top of the mapped areas. These libraries provide memory allocation, transactions, optimized ways to flush changes to persistence, etc. So that's a bit better, and yet the APIs we invented still seem to low-level for Java. I think we need to abstract persistent memory into the Java garbage-collected object model to make it less error prone. But I am far from a Java expert, so that's why I'm looking for more brainstorming in this area.

In regards to Java, "solving the problem of how to expose persistent memory to user space" is secondary to solving the problem of how to expose memory at all. A system designed intentionally with a lack of pointers, where all objects are opaque, makes most of this futile.

Howard Chu

unread,

Oct 27, 2015, 6:54:09 AM10/27/15

to mechanical-sympathy

Good info, thanks for posting. It certainly gives reason to be more optimistic.

Reply all

Reply to author

Forward