8 byte atomicity & larger store operations

285 views
Skip to first unread message

Christian Schwarz

unread,
Oct 30, 2020, 3:10:22 PM10/30/20
to pmem

Hi all,

the pmem.io docs and several posts on this list state that only aligned 8 byte stores are powerfail atomic.
However, it is still unclear to me whether we have _no_ guarantees for larger stores or whether each individual 8 byte chunk of a larger store is still powerfail atomic.

Consider the following example:

uint8_t *pmem; // 256byte aligned, 256byte buffer
// zero it out
pmem_memset_persist(pmem, 0xff, 256);

// Variant 1:
for (size_t i = 0; i < 256; i++)
    *(pmem + i) = 0xff; // aligned stores, like mov $1, (%some_reg)
pmem_persist(pmem, 256);

// Variant 2:
void *b = aligned_alloc(256, 256);
memset(b, 0xffffffff, 256);
pmem_memcpy_persist(pmem, b, 256);

For variant 1, it is my understanding that the 8 byte powerfail atomicity guarantees us that we are only going to read 0x00 or 0xff.

For variant 2, a conservative read of the docs doesn't guarantee that, i.e., it could be possible to read 0x0a.
However, it would quite surprise me if that is actually the case given what I learned about the hardware architecture.

Could someone please clarify
a) whether the guarantee that we will only read 0x00 or 0xff holds for variant 1 and
b) whether we have the same guarantees for variant 2?

Thanks,

Christian

Andy Rudoff

unread,
Oct 30, 2020, 3:49:16 PM10/30/20
to pmem
Hi Christian,

I'm not sure I understand your example code (does memset even pay attention to bytes above 0xff in the second arg?).  But I think I can still answer your question.

Regardless of what causes the 8-byte aligned value to get sent to memory, whether it is eviction due to other system activity, or a single cache flush instruction, or a range being flushed, the x86 hardware will not tear it due to power failure.   A bunch of stores to pmem without flushes and fences between them may become persistent in any order, perhaps all of them persistent before you get around to calling flush, perhaps none of them, are any combination.  But the 8-byte chunks sent to memory won't be torn.

Hopefully that answered your question.

Christian Schwarz

unread,
Oct 30, 2020, 6:52:14 PM10/30/20
to pmem
Hi Andy,

thanks for the quick response!

I'm not sure I understand your example code (does memset even pay attention to bytes above 0xff in the second arg?).  But I think I can still answer your question.

Sorry about that mess. Here's what I meant to send:

int8_t *pmem; // 256byte aligned, 256byte buffer
// zero it out
pmem_memset_persist(pmem, 0x0, 256);


// Variant 1:
for (size_t i = 0; i < 256; i++)
    *(pmem + i) = 0xff; // aligned stores, like mov $1, (%some_reg)
pmem_persist(pmem, 256);

// Variant 2:
void *b = aligned_alloc(256, 256);
for (size_t i = 0; i < 256; i++)
    *(b + i) = 0xff;
// copy b to pmem using a libpmem function
pmem_memcpy_persist(pmem, b, 256);


Regardless of what causes the 8-byte aligned value to get sent to memory, whether it is eviction due to other system activity, or a single cache flush instruction, or a range being flushed, the x86 hardware will not tear it due to power failure.   A bunch of stores to pmem without flushes and fences between them may become persistent in any order, perhaps all of them persistent before you get around to calling flush, perhaps none of them, are any combination.  But the 8-byte chunks sent to memory won't be torn.

Hopefully that answered your question.

That's exactly what I was hoping to hear!

Christian
 

Abdullah Al Raqibul Islam

unread,
Oct 30, 2020, 8:29:18 PM10/30/20
to Christian Schwarz, pmem
Hi Christian,

I can see Andy Rudoff already give an explanation about 8 byte atomicity in persistent memory. Let me explain what you should expect from your example code.

In pmem manual, it is said that:

“pmem_memcpy_persist(), and pmem_memset_persist(), functions provide the same memory copying as their namesakes memcpy(3) and memset(3), and ensure that the result has been flushed to persistence before returning.”

So, your variant-2 is good. It is confirmed, after your “pmem_memcpy_persist” call … your data in *pmem is persisted.

On the other hand, your variant-1 is interesting. Again in pmem manual, it is stated that

  • “The pmem_persist() function force any changes in the range [addr, addr+len) to be stored durably in persistent memory.”
  • “Any unwritten stores in the given range will be written, but some stores may have already been written by virtue of normal cache eviction/replacement policies.”

So, if any crash happen within your variant-1’s for loop, it could be possible that some data in *pmem is persisted (cache evicted by virtue of normal cache eviction/replacement policies). And that would be an inconsistent state. But, it is guaranteed that after your pmem_persist() call, all the data within this range is persisted.

On Oct 30, 2020, at 6:52 PM, Christian Schwarz <m...@cschwarz.com> wrote:

[Caution: Email from External Sender. Do not click or open links or attachments unless you know this sender.]
--
You received this message because you are subscribed to the Google Groups "pmem" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/e6744be7-494a-44a4-924a-625dd7033078n%40googlegroups.com.

Dimitris Stavrakakis

unread,
Oct 31, 2020, 5:43:30 PM10/31/20
to pmem
Hello,
the question that arises based on this answer is what happens if the power failure occurs during the execution of 
pmem_memcpy_persist(pmem, b, 256) instruction.
Isn't there the possibility that only a part of this range will become persistent and not the whole 256-byte range? 
I suppose yes, and this is where the transactions come into the game in order to provide such crash-consistency guarantees.
Could someone please confirm that or have I understood something completely wrong?

I have another question regarding the atomicity though: since the flush instruction applies on cachelines (64 bytes), if we explicitly 
request a cacheline to be flushed, this action should be made in 8 byte parts. Thus, we cannot be sure in what order these cacheline parts
will become persistent. Therefore, it is mandatory to use specific 8-byte flushes if we want to leverage the 8 byte PM atomicity for building 
crash consistent applications. Does it sound correct? 

Thank you very much in advance,
Dimitris

Andy Rudoff

unread,
Oct 31, 2020, 6:01:29 PM10/31/20
to pmem
Hi Dimitris,

Yes, you are correct that pmem_memcpy_persist() and all the other routines in libpmem are non-transactional.  Use them only if you have your own strategy for dealing with interrupted stores and flushes.  The libpmemobj library is built on top of these primitives and provides fully transactional updates.  You can find lots more information about what's transactional and what isn't in the book at http://pmem.io (it is free to read online).

On your second question, I'm not 100% sure I get the question so I'll give you some facts that hopefully answer it.  Again, let's start out with the assumption that we're talking about normal stores to write-back cached memory on a system without eADR...

If two 8-byte stores to pmem are NOT in the same cache line, and the order of persistence is important to your application, then you'll need a CLWB+SFENCE after each one to make sure the younger store doesn't reach persistence before the older store.  But if the two stores are to the same cache line, the younger store cannot pass the older store so the ordering to persistence is preserved.  That is, after an interruption you might see none of the stores became persistent, or the older store became persistent, or both stores became persistent, but you will not see just the younger store without the older store.

Hope that made sense.

-andy

Dimitris Stavrakakis

unread,
Oct 31, 2020, 6:33:55 PM10/31/20
to Andy Rudoff, pmem
Hello Andy,

thanks for your immediate response.
My second question mostly has to do with the process followed in order to be sure that data becomes persistent.
To be more specific:
The pmem_memcpy_persist function is a pmem_memcpy which copies the data,
followed by pmem_persist (pmem_flush & pmem_drain) to flush data and make sure that they reach the persistent domain.
Without the call of pmem_persist, we cannot be sure that our data is durable.
In a simple scenario, we modify a single cache line (64 bytes).
The pmem_memcpy function is completed successfully.
Then, the pmem_flush is called. 
My question is, during the flush, can only a part of the cache line be written to the PM if a crash occurs? (during flush, before the drain)
Excuse me if the question makes no sense in case that I have understood something wrong.

Thank you very much once again for your time.
Dimitris

You received this message because you are subscribed to a topic in the Google Groups "pmem" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pmem/6_5daOuEI00/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pmem+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/7f8e1ec5-344d-4ce7-a20d-a585f6b46950n%40googlegroups.com.

Christian Schwarz

unread,
Nov 1, 2020, 3:41:40 AM11/1/20
to pmem
On Saturday, 31 October 2020 at 01:29:18 UTC+1 ais...@uncc.edu wrote:
  • “The pmem_persist() function force any changes in the range [addr, addr+len) to be stored durably in persistent memory.”
  • “Any unwritten stores in the given range will be written, but some stores may have already been written by virtue of normal cache eviction/replacement policies.”

So, if any crash happen within your variant-1’s for loop, it could be possible that some data in *pmem is persisted (cache evicted by virtue of normal cache eviction/replacement policies). And that would be an inconsistent state. But, it is guaranteed that after your pmem_persist() call, all the data within this range is persisted.

Both variants are guaranteed to have persisted all the 0xff's after the call to the respective pmem_*persist functions. That's besides the point of my question.
All I wanted to know was whether the individual 8-byte sized 8-byte aligned chunks of *pmem are guaranteed to not be torn at any point in time with either variant.
And if I understand correctly, Andy confirmed that.

Christian Schwarz

unread,
Nov 1, 2020, 3:48:29 AM11/1/20
to pmem
On Saturday, 31 October 2020 at 23:33:55 UTC+1 dims...@gmail.com wrote:

My question is, during the flush, can only a part of the cache line be written to the PM if a crash occurs? (during flush, before the drain)
Excuse me if the question makes no sense in case that I have understood something wrong.


Jan K

unread,
Nov 1, 2020, 9:03:30 AM11/1/20
to pmem
This raised another question in my head: can SIMD 512 bit writes be
torn as well (I assume they can) and how is that possible?

I mean e.g., the vmovdqa64 / vmovntdqa (_mm512_store_epi64 /
_mm512_store_si512 / _mm512_stream_si512) instructions.

It's not possible to observe torn writes of these instructions from
another CPU (without a power fail), but 8-byte atomicity would mean
that one can see torn writes of these instructions across a power fail
- am I right?

Regards,
Jan

Christian Schwarz

unread,
Nov 1, 2020, 9:39:27 AM11/1/20
to pmem
My understanding is that, as far as the documented guarantees are concerned, runtime cache coherency issues are orthogonal to powerfail atomicity issues.

> It's not possible to observe torn writes of these instructions from
another CPU (without a power fail),

That's a runtime cache coherency issue and has nothing to do with power fail atomicity.

> but 8-byte atomicity would mean  that one can see torn writes of these instructions across a power fail 

Yes, there is no guarantee that the entire 512bit write will persist atomically.
However, the insight I gained in this thread is that the 8 byte chunks of that 512bit write won't be torn.
That doesn't help someone who need larger powerfail atomicity guarantees, but since SIMD 512bit instructions are also used to implement pmem_memcpy(), it's nice to know that pmem_memcpy() wont tore aligned 8 byte chunks.

@andy Would the following sentence be an appropriate addition to the pmem_mem*() man pages?
> pmem_mem*(dst, ...) guarantees that, if dst is 8 byte aligned, the individual 8 byte chunks of dst will be updated powerfail atomically.
Or even:
> pmem_mem*(dst, ...) guarantees that those segments of dst that are 8 byte sized and 8 byte aligned will be updated powerfail atomically.

Andy Rudoff

unread,
Nov 1, 2020, 4:47:17 PM11/1/20
to pmem
Hi Dimitris,

On Saturday, October 31, 2020 at 4:33:55 PM UTC-6 dims...@gmail.com wrote:
Hello Andy,

thanks for your immediate response.
My second question mostly has to do with the process followed in order to be sure that data becomes persistent.
To be more specific:
The pmem_memcpy_persist function is a pmem_memcpy which copies the data,
followed by pmem_persist (pmem_flush & pmem_drain) to flush data and make sure that they reach the persistent domain.
Without the call of pmem_persist, we cannot be sure that our data is durable.

Correct.  The application cannot assume the stores are persistent until the persist step has completed.  Note that we explain the semantics of pmem_memcpy_persist() by saying it is equivalent to pmem_memcpy() followed by pmem_persist() but in practice, libpmem is able to leverage optimizations, such as non-temporal stores, making a call to pmem_memcpy_persist() potentially faster than calling pmem_memcpy() + pmem_persist().
 
In a simple scenario, we modify a single cache line (64 bytes).
The pmem_memcpy function is completed successfully.
Then, the pmem_flush is called. 
My question is, during the flush, can only a part of the cache line be written to the PM if a crash occurs? (during flush, before the drain)
Excuse me if the question makes no sense in case that I have understood something wrong.

I get what you're asking, but I think it is better to think about it using my previous answer where I said "if the two stores are to the same cache line, the younger store cannot pass the older store so the ordering to persistence is preserved".  I do not want you to think there is anything transactional about cache lines -- you can leverage the ordering property I'm describing, but since cache lines can be evicted at any time, you can see a subset of your stores to the cache line make it to persistence in the face of failure.  By the way, a future CPU will introduce a new instruction which will write a full 64-byte atomically.  it is called MOVDIR64B and the documentation for it is already public if you're interested.

Andy Rudoff

unread,
Nov 1, 2020, 4:51:12 PM11/1/20
to pmem
Hi Jan,

As Christian pointed out, operations that are atomic with respect to visibility are not necessarily atomic with respect to persistence.  An AVX512 atomic store may make it to persistence atomically on some CPUs, but it is not architecturally guaranteed to be atomically persistent on all CPUs.  The upcoming MOVDIR64B instruction will provide that.  Until then, larger than 8-byte atomics must be built by software.  Libraries like PMDK are designed to detect and leverage the features on a platform to give the best performance, so if you use PMDK your code will automatically build atomics in SW or use instructions like MOVDIR64B as appropriate.

Thanks,

-andy

Andy Rudoff

unread,
Nov 1, 2020, 4:58:39 PM11/1/20
to pmem
@andy Would the following sentence be an appropriate addition to the pmem_mem*() man pages?
> pmem_mem*(dst, ...) guarantees that, if dst is 8 byte aligned, the individual 8 byte chunks of dst will be updated powerfail atomically.
Or even:
> pmem_mem*(dst, ...) guarantees that those segments of dst that are 8 byte sized and 8 byte aligned will be updated powerfail atomically.

Hi Christian,

The fact that 8-byte aligned stores are powerfail atomic is a feature of the platform, not the library.  PMDK is designed to be platform neutral, intended to work on ARM, POWER, etc. in additional to x86.  (Although we have had some ARM-related pull requests, no other platform besides x86 is yet production quality, but it is still our goal to encourage it.)

Also, I have to admit, I don't see the need for the distinction you've mentioned about 8-byte chunks of a larger update.  To me, the statement that 8-byte aligned stores are not torn is the architectural guarantee.

Thanks,

-andy 

Christian Schwarz

unread,
Nov 2, 2020, 8:10:05 AM11/2/20
to pmem
On Sunday, 1 November 2020 at 22:58:39 UTC+1 Andy Rudoff wrote:

Also, I have to admit, I don't see the need for the distinction you've mentioned about 8-byte chunks of a larger update.  To me, the statement that 8-byte aligned stores are not torn is the architectural guarantee.

I find "stores" to be ambiguous, but maybe that's just language barrier.
From this thread I have learned that it means  "any instruction that updates 8-byte aligned memory", i.e., including all the AVX512 instructions used for efficient memcpy.
But originally, I though it meant "only those instructions that take an 8 byte source operand and write it to an 8-byte aligned address", which wouldn't include those AVX instructions.

- Christian
Reply all
Reply to author
Forward
0 new messages