Partial cache line writebacks

Tobias Häberlein

unread,

Nov 5, 2022, 6:20:26 PM11/5/22

to pmem

Hi all

I recently read this article about a B+-Tree variant called Crab-tree for persistent memory with ARMv8 architecture.

In order to insert a new key-value pair <k,v>, Crab-Tree can choose between a Copy-On-Write and a Shift-In-Place mechanism.

Basically, the Shift-In-Place mechanism moves all larger keys to the right to make space for the new key. The authors try to save some expensive cache line flushes during shifting by only flushing a cache line when they move from one cache line to the left one.

The algorithm is similar to the FAST_insert algorithm in this paper.

My understanding is that only 8 bytes can be moved atomically to the persistent memory at once. However, the DC CVAP (ARM) and CLWB (Intel) instructions flush a whole cache line (64 bytes). From what I could find online, this is (probably?) not done atomically - instead, a burst (JEDEC/AXI) is used, i.e. the cache line reaches memory in 8 byte chunks.

Is it possible that - if a power failure occurs in a very bad moment - only a part of the cache line reaches the persistent memory?

If so, here's a possible situation that could occur in Crab-Tree during insertion:

I am curious about your opinion.

Thanks in advance!

- Tobias Häberlein

Andy Rudoff

unread,

Nov 6, 2022, 11:09:43 AM11/6/22

to pmem

Hi Tobias,

Thanks for the interesting post, complete with interesting links!

One of the goals of the PMem programming model was to make PMem follow the memory semantics of the platform once it is mapped into the application's address space. We didn't want to try to force an unnatural memory model into a platform with a different model. For that reason, the answers your questions could be different for different platforms. I'm not an ARM expert, so I can only provide an answer for x86 -- hope that's useful.

On x86, PMem is supported by creating a "persistence domain" which extends a small amount beyond the memory device -- it includes the write pending queue in the memory controller (or more than that if the platform is built with the CPU caches in the persistence domain, but that case obviously avoids the problem we're discussing, so I'll stick to platforms where the CPU caches are not in the persistence domain). I bring this up because you mention JEDEC and how data is sent to the DIMM, but that's not what matters for persistence -- it is the point at which the store is accepted by the memory subsystem. Again, this may very well be specific to x86 systems -- I'm not speaking to how an ARM system works.

When people talk about an atomic operation, they are typically asking the "all or nothing" question. Although a cache line flush on x86 is very likely to write all 64 bytes atomically to the memory subsystem on modern server CPUs, it is not architecturally guaranteed -- an implementation is free to break up the store into multiple chunks along the way. A recent instruction, MOVDIR64B, does provide an architecturally guaranteed way to send 64-bytes to the memory subsystem without it getting broken up. So one way to ensure the algorithm you mention works correctly is to use MOVDIR64B. However, I think the algorithm you link to will work fine as long as the ordering of writes during the shift operation is done correctly. The paper describes how the ordering of the writes is done carefully so that if a line is evicted at any point, the worst thing that can happen is a duplicate entry is found after reboot, and the algorithm is designed so that those duplicate entries are handled correctly. The issue really isn't whether the flush operation is atomic, in my opinion, it is whether evictions, which can happen any time due to other system activity, can cause an inconsistent state. It seems to me the clever part of the paper is how they avoid the inconsistent state by defining what happens if an eviction creates the duplicate.

Now to answer your question directly, can part of the cache line reach PMem and the rest is dropped due to a power failure? Although it is unlikely, there is not an architectural guarantee that all 64-bytes travel to the persistence domain as a unit unless you use MOVDIR64B. But even if x86 were to make that architectural guarantee in the future, since nothing prevents evictions at any point, SW would still have to deal with the case where it is in the middle of storing the 64-bytes of data and some of it becomes persistent before SW can execute the rest of the stores.

-andy

Tobias Häberlein

unread,

Nov 6, 2022, 12:26:55 PM11/6/22

to pmem

Hi Andy,

very interesting! Thank you for your detailed answer.

After reading through (a few subchapters) of the ARM reference manual, I think that ARM systems operate in a very similar way

- except for the fact that ARM systems are with non-TSO which is why additional barriers are necessary to ensure the order of stores.

I agree that the clever part of the paper is to make sure that the system can handle (specific) duplicate values.

However, assuming that only part of a cache line could be persisted during a power failure:

I still think that a scenario like in the picture above is possible during Shift-In-Place insertion in the Crab-Tree b+-tree. I uploaded a version with more comments (and one small correction) down below.

In the example, the entry <k3, v3> gets lost because it was overriden with the newly inserted pair <k2, v2> and only a part of the cache line (k1, v1, k2, v2) made it back to the persistent memory.

In my opinion, the only way to ensure that this (admittedly very unlikely) scenario will never happen is either to use MOVDIR64B (on x86) or to introduce additional cache flushes (+ barriers).

(Realistically however, this is probably overkill on most systems)

Thanks again for your detailed answer, it made many things clear to me.

Reply all

Reply to author

Forward

Message has been deleted