How to use CLWB instructions

523 views
Skip to first unread message

Jungsik Choi

unread,
May 22, 2019, 3:00:13 AM5/22/19
to pm...@googlegroups.com
Hello,

I am wondering how to use CLWB instructions correctly. Some papers give the following example about the use of the CLWB instruction.

1. var = new_data
2. CLWB(var)
3. SFENCE

The above code is an example that updates a variable and makes it durable. I checked the source code of some NVM file systems, and they are implemented similar to the above. 

However, I think there is a possibility that instruction #1 and #2 are reordered when the above example is executed. So I think an SFENCE instruction should be added before the CLWB instruction as shown below.

1. var = new_data
2. SFENCE
3. CLWB(var)
4. SFENCE

I am curious about your opinion.

Thanks,

 Jungsik Choi

 PhD student
 College of Software
 Sungkyunkwan University
 ch...@skku.edu

Andy Rudoff

unread,
May 22, 2019, 9:33:58 AM5/22/19
to pmem
Hi Jungsik,

I believe the answer to your question is in this sentence, which I am pasting from the CLWB description in the Software Developer's Manual:

CLWB is implicitly ordered with older stores executed by the logical processor to the same address.

In your example, the same address "var" is used for both the store and the CLWB, so the CPU will not re-order them.  If your example did a store to var1 and then a CLWB on var2, where var1 and var2 are in different cache lines, then the CPU would be free to reorder them and an SFENCE would be required between them if you wanted to prevent the reordering.


-andy

Jungsik Choi

unread,
May 22, 2019, 7:47:00 PM5/22/19
to Andy Rudoff, pmem
Hello Andy,

Thanks for your answer, I definitely understand that.

This mailing list is always helping me a lot. I am also grateful to everyone.

Best regards,
Jungsik Choi


2019년 5월 22일 (수) 오후 10:34, Andy Rudoff <an...@rudoff.com>님이 작성:
--
You received this message because you are subscribed to the Google Groups "pmem" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.
To post to this group, send email to pm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/0f373699-f3eb-4043-a52b-3f5659f8006a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ziv Dor

unread,
Jul 31, 2019, 10:59:38 AM7/31/19
to pmem
בתאריך יום רביעי, 22 במאי 2019 בשעה 16:33:58 UTC+3, מאת Andy Rudoff:
Following this with a more complex case, as the documentation seems to be unclear about it. What if I wrote to an offset in cache line, and try to CLWB it? For example:
char buff[64]
buff[13]=5
CLWB(&buff[0])
SFENCE

Is it possible that the the order is swapped (assuming the 2 addresses are in the same cache line)?

Thanks in advance
Ziv Dor

Andy Rudoff

unread,
Jul 31, 2019, 12:03:10 PM7/31/19
to pmem
char buff[64]
buff[13]=5
CLWB(&buff[0])
SFENCE

Is it possible that the the order is swapped (assuming the 2 addresses are in the same cache line)?

Hi Ziv,

Your example only contains one store, so I'm not sure I understand your question.  Are you asking if the order of the store and CLWB can be swapped?  It can't.

But I think it is more likely you were asking something like this:

/* assume the cache line starts off containing zeros */
buff[0] = 1
buff[8] = 1
CLWB buff
SFENCE

Assuming you have taken steps to prevent the compiler from reordering the stores, then a younger store will not pass an older store to the same cache line.  If my example code gets interrupted by a crash, on recovery either buff is all zeros, only buff[0] is 1, or both buff[0] is 1 and buff[1] are 1.  It is not possible for buff[8] to be 1 and buff[0] to be zero.

Once again, this is only true because they are the same cache line.  If buff crosses between multiple cache lines, the ordering property I described doesn't hold.

Hope that helps,

-andy

Ziv Dor

unread,
Aug 4, 2019, 1:08:51 AM8/4/19
to pmem
Thank you, both topics do help, though I was originally concerned about the first topic.

Ziv

Marcel Köppen

unread,
Sep 17, 2019, 9:29:10 AM9/17/19
to pmem
Hi Andy,


Am Mittwoch, 31. Juli 2019 18:03:10 UTC+2 schrieb Andy Rudoff:
But I think it is more likely you were asking something like this:

/* assume the cache line starts off containing zeros */
buff[0] = 1
buff[8] = 1
CLWB buff
SFENCE

Assuming you have taken steps to prevent the compiler from reordering the stores, then a younger store will not pass an older store to the same cache line.  If my example code gets interrupted by a crash, on recovery either buff is all zeros, only buff[0] is 1, or both buff[0] is 1 and buff[1] are 1.  It is not possible for buff[8] to be 1 and buff[0] to be zero.

Once again, this is only true because they are the same cache line.  If buff crosses between multiple cache lines, the ordering property I described doesn't hold.

does this hold although buff[0] and buff[8] are 8 bytes apart? From your login article and https://groups.google.com/forum/#!msg/pmem/nXB7VyssN9A/eRfSAT-GBQAJ I got the impression that there are no guarantees for writes larger than 8 bytes.

Best regards,
    Marcel

Andy Rudoff

unread,
Sep 17, 2019, 9:48:01 AM9/17/19
to pmem
Hi Marcel,

The 8-byte atomicity I wrote about in the ;login: article is talking about failure atomicity, or the size of a store that is guaranteed to be atomic in the face of failure.  For example, if you store 8 bytes (that are aligned) and the system experiences a power failure while that store is in-flight, on reboot the contents of that location will contain the old 8 bytes, the new 8 bytes, but not a partial update.  Anything larger and this is no longer guaranteed, so if you did the two store instructions that it would take for buff[0] and buff[8], it is possible only one of them happens in the face of failure.

What I was talking about in the post below is about ordering of multiple stores to the same cache line, not about atomicity.

Hope that clarified it a bit.

-andy 

Marcel Köppen

unread,
Sep 17, 2019, 10:12:15 AM9/17/19
to pmem


Am Dienstag, 17. September 2019 15:48:01 UTC+2 schrieb Andy Rudoff:
The 8-byte atomicity I wrote about in the ;login: article is talking about failure atomicity, or the size of a store that is guaranteed to be atomic in the face of failure.  For example, if you store 8 bytes (that are aligned) and the system experiences a power failure while that store is in-flight, on reboot the contents of that location will contain the old 8 bytes, the new 8 bytes, but not a partial update.  Anything larger and this is no longer guaranteed, so if you did the two store instructions that it would take for buff[0] and buff[8], it is possible only one of them happens in the face of failure.

What I was talking about in the post below is about ordering of multiple stores to the same cache line, not about atomicity.

So, do I get this right: when I flush a cache line, it is persisted in chunks of 8 bytes in no particular order, but it is guaranteed, that stores to that cache-line since the last flush do not get reordered?

Andy Rudoff

unread,
Sep 17, 2019, 10:47:12 AM9/17/19
to pmem
So, do I get this right: when I flush a cache line, it is persisted in chunks of 8 bytes in no particular order, but it is guaranteed, that stores to that cache-line since the last flush do not get reordered?

Not exactly.  I would state it this way: There's no way to change a full cache line in a failure atomic way.  You might be in the middle of updating the cache line and other activity in the system causes some of it to be flushed to persistence before you're finished.  Then if the system crashes before you finish updating the rest of the cache line, you have a mix of old data and new data in that cache line.  The failure atomic size in the current generation of CPUs is 8-bytes, anything larger can be torn by a failure.

Of course, on a quiet system it is very likely that you can change a full cache line and flush it and the entire cache line travels to the persistent memory in a single chunk.  It just isn't guaranteed.

That's the atomic component of this conversation.  The ordering component is the fact that a younger store won't pass an older store to the same cache line.

-andy

Marcel Köppen

unread,
Sep 17, 2019, 11:22:06 AM9/17/19
to pmem
That should be enough guarantees for our purpose. :)

Thank's for your help!

    Marcel

Marcel Köppen

unread,
Sep 18, 2019, 8:16:32 AM9/18/19
to pmem
Hello again,

after consulting my colleagues, I have another question about the cache flush. I understand that a cache-line is flushed in chunks of 8 bytes. 
In what order are the chunks written? Is it deterministic?

Best regards,

Marcel

Jonathan Halliday

unread,
Sep 18, 2019, 8:49:40 AM9/18/19
to pmem

A cache line is flushed atomically. You just can't completely control when. It may be flushed spontaneously by the hardware between any other instructions if there is cache memory pressure. This is radically different to block I/O, where the memory->storage writeback occurs only when you explicitly tell it to.

An 8 byte write as a single instruction is persistence atomic, because the line may flush before or after, but not during. The same logical write done instead as 2x 4 byte write instructions is not atomic, since the flush may occur between the two.

Intel hardware won't reorder writes to the same cache line, so you can at least reason that the failure cases are a) no write is done, b) first write is done but second isn't and c) both are done. But it's not possible that the second is done whilst the first isn't. Beyond that you don't have determinism. In particular, multiple cache lines may be flushed in any order, so writes spanning cache line boundaries are a pain.

Regards

Jonathan

Marcin Ślusarz

unread,
Sep 18, 2019, 2:08:50 PM9/18/19
to Jonathan Halliday, pmem
śr., 18 wrz 2019 o 14:49 Jonathan Halliday <goo...@the-transcend.com> napisał(a):
Intel hardware won't reorder writes to the same cache line, so you can at least reason that the failure cases are a) no write is done, b) first write is done but second isn't and c) both are done. But it's not possible that the second is done whilst the first isn't. 

Hardware won't do it on its own, but compiler can still reorder writes (or even optimize out some of them). You have to use compiler barrier to deal with it.

Marcin

Marcel Köppen

unread,
Sep 19, 2019, 7:23:21 AM9/19/19
to pmem

Am Mittwoch, 18. September 2019 14:49:40 UTC+2 schrieb Jonathan Halliday:

A cache line is flushed atomically. You just can't completely control when. It may be flushed spontaneously by the hardware between any other instructions if there is cache memory pressure. This is radically different to block I/O, where the memory->storage writeback occurs only when you explicitly tell it to.

Does that mean that a cache line reaches the persistence domain either as a whole or not at all, when it is flushed? I couldn't find that in the documentation.

    Marcel

Andy Rudoff

unread,
Sep 19, 2019, 8:13:51 AM9/19/19
to pmem

Does that mean that a cache line reaches the persistence domain either as a whole or not at all, when it is flushed? I couldn't find that in the documentation.



It is important to distinguish between what is *likely* to happen and what is *architecturally guaranteed* to happen.  In normal operation of a system, it is likely that data is delivered to DIMMs as full cache lines that are not torn by powerfail on most CPUs.  Right now, the only architectural guarantees are what I stated, that an 8-byte store is failure atomic and that a younger store won't pass an older store to the same cache line.

In future CPUs, a new instruction called MOVDIR64B will provide a 64-byte failure atomic store to persistent memory.  You can read about this instruction in the SDM on intel.com if you are interested.  Until then, you should only be depending on the 8-byte atomicity.

-andy

Marcel Köppen

unread,
Sep 19, 2019, 8:22:01 AM9/19/19
to pmem
So when Jonathan said, that "a cache line is flushed atomically", that was not correct?

Andy Rudoff

unread,
Sep 19, 2019, 8:26:33 AM9/19/19
to pmem

It is important to distinguish between what is *likely* to happen and what is *architecturally guaranteed* to happen.  In normal operation of a system, it is likely that data is delivered to DIMMs as full cache lines that are not torn by powerfail on most CPUs.  Right now, the only architectural guarantees are what I stated, that an 8-byte store is failure atomic and that a younger store won't pass an older store to the same cache line.

In future CPUs, a new instruction called MOVDIR64B will provide a 64-byte failure atomic store to persistent memory.  You can read about this instruction in the SDM on intel.com if you are interested.  Until then, you should only be depending on the 8-byte atomicity.


So when Jonathan said, that "a cache line is flushed atomically", that was not correct?

The architectural guarantees are what I summarized above.  A cache line is not failure atomic until MOVDIR64B is available.

-andy 

Marcel Köppen

unread,
Sep 19, 2019, 8:38:09 AM9/19/19
to pmem
Ok. And that an older store won't pass a younger store to the same cache line is guaranteed on the DIMM?

Andy Rudoff

unread,
Sep 19, 2019, 8:44:11 AM9/19/19
to pmem
The architectural guarantees are what I summarized above.  A cache line is not failure atomic until MOVDIR64B is available.

Ok. And that an older store won't pass a younger store to the same cache line is guaranteed on the DIMM?

Yes, the ordering guarantee holds true all the way to the point where the stores are persistent.

-andy
 

Marcel Köppen

unread,
Sep 19, 2019, 8:50:13 AM9/19/19
to pmem
Wonderful! Can I quote you on this? :D Or is this information available somewhere in the documentation?

Thanks for your patience!

    Marcel

Andy Rudoff

unread,
Sep 19, 2019, 8:55:06 AM9/19/19
to pmem
Please feel free to quote me (this is a public mailing list after all, so every future google search will find what I said :-)

The information is in Intel's SDM, but it can be difficult to sift through all the information and get to these precise details.  That's one reason why we've been publishing slides, videos, tutorials, etc. on pmem.io and on Intel's Software Development Zone.  It helps, but there is always room for improvement on how we present these details.

-andy

Jonathan Halliday

unread,
Sep 19, 2019, 9:34:28 AM9/19/19
to pmem
It's my understanding that the path from the L1 to the DIMMs doesn't have any knowledge of the order in which 8 byte chunks of the line were previously written, as the CPU isn't recording that information anywhere. i.e. the path doesn't operate as a serial queue of <addr,8byte-value> change tuples, but rather treats the line as a single 64 byte entity. If that is true, it has to handle the line atomically or it can't maintain the ordering guarantee? If it splits the line at an arbitrary 8 byte boundary it may inadvertently invert the write order. Have I misunderstood the design of the protocol the memory controller is using to move cache lines around?

Jonathan
 
-andy
 

Jan K

unread,
Sep 19, 2019, 9:56:47 AM9/19/19
to pmem
I feel that somehow two different kinds of guarantees got mixed up here.

If I get it right, one set of guarantees is the memory model - the
rules what one CPU may see when another CPU accesses same data
concurrently.
For instance, when one thread executes "write 1 to x; read y", another
thread executes "write 1 to y; read x", then x86_64 is fine with
returning the initial value (say 0) in both threads, contrary to
expectations. And x86_64 memory model disallows observing stores in a
different order then they were issued.
But these guarantees hold until failure, and I believe there's no
continuity between crashes.

The other set of guarantees is what will be present in the persistent
memory after a failure. And here, as far as I understand, the only
guarantee is that aligned 8-byte chunks are either written fully into
the persistent memory, or not at all.

Am I correct that after issuing two stores 8 bytes apart, if the
failure does not happen, then one can never see only the second store
alone, but if the failure does happen, then it is allowed that only
the second store is present in pmem?
E.g., when a thread executes "write 1 to x; write 1 to y;", then upon
no failures executing "read y; read x" can yield "0,0", "0,1" or
"1,1", but if a failure occurs, then (after restart obviously) it can
also yield "1,0"?

Regards,
Jan
> --
> You received this message because you are subscribed to the Google Groups
> "pmem" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pmem+uns...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pmem/66a17f89-b263-4639-ad5b-8e7fd36fe1e8%40googlegroups.com.
>

Andy Rudoff

unread,
Sep 19, 2019, 10:17:43 AM9/19/19
to pmem
Hi Jonathan,

I apologize in advance for acting like a lawyer and picking apart your message.  But that's what it takes to give correct information on topics like this!

>It's my understanding that the path from the L1 to the DIMMs doesn't have any knowledge of the order in which 8 byte chunks of the line were previously written, as the CPU isn't recording that information anywhere. i.e. the path doesn't operate as a serial queue of <addr,8byte-value> change tuples, but rather treats the line as a single 64 byte entity.

I've been talking about what is architecturally guaranteed.  I'm honestly not sure if Intel publishes a document that describes what you just said and promises not to change it -- they might, I'm not just not sure.  But as a software developer, I know what is promised in the SDM and Intel spends quite a bit of energy making sure they don't break those promises by validating the semantics on each new implementation.  That said, I believe what you said above is true on the current Xeon CPUs.

>If that is true, it has to handle the line atomically or it can't maintain the ordering guarantee?

The ordering guarantees you are referring to are for visibility.  Yes, the stores must become visible in the order guaranteed by the memory model.  They can become persistent in a different order and the model described by the SDM still holds true.  This is why we started talking about failure atomicity and ordering to persistence, because they are not the same as the order of visibility specified by the memory model.  That said, the ordering property for persistence I have stated multiple times in this thread (an older store won't pass a younger store to persistence in the the same cache line) seems to be the same thing you are saying in the above sentence.

>If it splits the line at an arbitrary 8 byte boundary it may inadvertently invert the write order. Have I misunderstood the design of the protocol the memory controller is using to move cache lines around?

What you say seems to make sense, but it seems to me that different implementations are free to change how this works, as long as they continue to provide the required semantics.  That's why I've been trying to pull the conversation back to what semantics are guaranteed by the spec, and not how the current CPUs implement them.

In summary, I strongly suspect we are agreeing on the current semantics, I'm just focused on using what the spec is promising so that software is more future-proof if the HW implementation changes.

-andy

Andy Rudoff

unread,
Sep 19, 2019, 10:28:36 AM9/19/19
to pmem
Hi Jan,

Yes, I've been talking about only the persistence semantics in this thread and I think some confusion has been because there are also visibility semantics that are not the same.  More comments below...


On Thursday, September 19, 2019 at 7:56:47 AM UTC-6, Jan K wrote:
I feel that somehow two different kinds of guarantees got mixed up here.

If I get it right, one set of guarantees is the memory model - the
rules what one CPU may see when another CPU accesses same data
concurrently.
For instance, when one thread executes "write 1 to x; read y", another
thread executes "write 1 to y; read x", then x86_64 is fine with
returning the initial value (say 0) in both threads, contrary to
expectations. And x86_64 memory model disallows observing stores in a
different order then they were issued.
But these guarantees hold until failure, and I believe there's no
continuity between crashes.

Yes, sounds right to me.
 

The other set of guarantees is what will be present in the persistent
memory after a failure. And here, as far as I understand, the only
guarantee is that aligned 8-byte chunks are either written fully into
the persistent memory, or not at all.

Yes, that's currently the only failure atomicity guarantee, but another point of this thread was to describe an ordering property that also holds true for persistence.
 

Am I correct that after issuing two stores 8 bytes apart, if the
failure does not happen, then one can never see only the second store
alone, but if the failure does happen, then it is allowed that only
the second store is present in pmem?
E.g., when a thread executes "write 1 to x; write 1 to y;", then upon
no failures executing "read y; read x" can yield "0,0", "0,1" or
"1,1", but if a failure occurs, then (after restart obviously) it can
also yield "1,0"?

If the two stores that are 8-bytes apart are in different cache lines, they can become persistent in any order.  But if they are in the same cache line, then SW can depend on the "older store won't pass a younger store in the same cache line" for persistence.  For example, on recovery from a power failure, SW can see that the write 1 to y happened in your example, and use that fact to infer that the write to x has happened.  But again, only if they are in the same cache line.  In SW we sometimes use this property to store a few values to pmem, then store a "valid flag" to indicate the stores are complete, and since all the stores are in the same cache line we only have to do a single CLWB+SFENCE and on recovery the valid flag works because it was stored last.

 -andy

Jonathan Halliday

unread,
Sep 19, 2019, 11:16:42 AM9/19/19
to pmem

Indeed so. My assertion that the cache line persist is atomic is based on the fact that it seems like that's the only reasonably efficient implementation of the guarantees that are provided, not that it's an explicitly provided guarantee in its own right.

In the absence of a way to pin cache lines and prevent writeback, it's hard to distinguish between 'the system spontaneously, atomically persisted a line at the worst possible time in the instruction stream, then crashed' from 'the system persisted when expected, but wrote only a partial line, in update order, then crashed part way through the persist operation'. That is, because of the ordering guarantees, a partial persist happening later is semantically equivalent to a full persist happening earlier, and the program needs equivalent safeguards. The persistent state of a single cache line at the DIMMs will only ever be one that actually existed in L1 at some point in time, not one that's torn between partial states from two different points in the instruction stream. You just don't always get to choose which point in time.

Based on my own experience learning pmem programming, the key insight is that it differs from traditional I/O in so far as you are never operating on a transient copy that tries to becomes persistent only if you explicitly ask. Instead you're effectively operating on the persistent data directly, and have to cope with the possibility that changes may rather inconveniently become immediately persistent even if they are not finished yet. This is particularly hard for people with SQL experience, since it's very much not the ACID transaction model and CLWB is not a transaction commit instruction.  CPUs that had the hw tx support interacting with pmem in that way would be really interesting and perhaps more intuitive to program against...

Jonathan

steve

unread,
Sep 19, 2019, 11:22:49 AM9/19/19
to Jonathan Halliday, pmem
From my limited understanding, that is planned for a later iteration.
------------
Steve Heller

Jonathan Halliday

unread,
Sep 19, 2019, 11:31:21 AM9/19/19
to pmem
It does seem like a logical progression. My feeling is that some form of isolation mechanism is going to be particularly helpful, as it's currently possible for other processes to observe changes that are not yet persistent, act on them and persist related updates before the data they are based on is itself persisted. The overhead of locking at the software level to prevent that is painful. Also, whilst multi-word CAS algorithms exist, that would likewise be much better at the hardware level. I'd be very happy if the code I'm writing today was obsolete sooner rather than later!

Jonathan
Reply all
Reply to author
Forward
0 new messages