Libck vs C11

73 views
Skip to first unread message

Patrick Julien

unread,
Jun 25, 2020, 4:23:29 PM6/25/20
to Concurrency Kit
Hello,

I just started working on a project that uses libck and I was hoping to get a few clarifications.  If I have the following:

   X *tmp = atomic_load_explicit(instance_, memory_order_consume);

   if (!tmp) {
    mtx_lock(&mutex);
    tmp = atomic_load_explicit(instance_, memory_order_consume);

    if (!tmp) {
      tmp = malloc(sizeof(X));
      atomic_store_explicit(instance_, memory_order_release);
    }

    mtx_unlock(&mutex);
  }
  
  return tmp;
  
  
  Would the libck equivalent be the following?

  X *tmp = instance_;  // Don't know how to do this directly
  ck_pr_fence_acquire(); // Use stronger acquire fence, consume does not look to be available

  if (!tmp) {
    mtx_lock(&mutex);
    tmp = instance_;
    ck_pr_fence_acquire();

    if (!tmp) {
      tmp = malloc(sizeof(X));
      instance_ = tmp;
      ck_pr_fence_release();
    }

    mtx_unlock(&mutex);
  }
  
  return tmp;

  }

Is this correct?  

Furthermore, are opaque operations available?  This is equivalent to C++11 memory_order_relaxed.  For spin counts, which we use heavily, using volatile access all the time is expensive.  When adding to the reference count, an opaque add is sufficient.  When decrementing it, a release is also sufficient if followed by an acquire when actually deleting.

Thanking you in advance for any clarification,

Patrick Julien

unread,
Jun 26, 2020, 4:14:50 PM6/26/20
to Concurrency Kit
After reading more of the source, I think it should be more like this:

   ck_pr_fence_acquire(); // Use stronger acquire fence, consume does not look to be available
   X *tmp = instance_;  // Don't know how to do this directly
  
  if (!tmp) {
    mtx_lock(&mutex);
    ck_pr_fence_acquire();  
    tmp = instance_;

Samy Al Bahra

unread,
Jun 28, 2020, 5:41:37 AM6/28/20
to concurr...@googlegroups.com, Samy Bahra
 Hi Patrick,

Sorry for the late reply!

On fences: I got rid of consume (was called a depends fence) as it ended up confusing developers, the expectation in CK is that as long as you're carrying through the data dependency no fence is required. Otherwise, use a stronger fence. In your example, the consume fence would not be used in CK as there is a data dependency. You would use fences to order operations between distinct memory locations that are not carrying a data dependency. The second update is incorrect, first one is correct but has an unnecessary consume fence. You also have the option of using traditional read / write / full barriers (see https://www.kernel.org/doc/Documentation/memory-barriers.txt and section 7 of http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf). If using R / L / F barriers, in your first example you would use a ck_pr_fence_store instead of release. Perhaps it is simpler to just say there is an implicit load_depends in all ck_pr_load_ptr operations. Even in the Linux kernel, it isn't permitted to use read_depends any more since it got pushed down to READ_ONCE.

On loading and storing safely: You still need to load the pointer atomically, use ck_pr_load_ptr(&instance). To update the pointer, use ck_pr_store_ptr(&instance, tmp). 

On your mutex usage, I'm a bit confused by the example. Once you have acquired the mutex, if it's the case no one else can update the variable then you do not require ck_pr_load_ptr, but you will need to ck_pr_store_ptr if there will be concurrent readers (using ck_pr_load_ptr). Using operations only when needed could help clarify the code.

Regarding opaque operations, all the operations in CK are relaxed W.R.T. ordering, with stricter semantics enforced with fences. However, you still have to pay the cost of a compiler barrier if you use ck_pr. In cases where you prefer a plain load or store, make sure to use ck_pr_load/store respectively. We could provide more relaxed variants of ck_pr without a full blown compiler barrier (ck_pr_*_*_relaxed) for use-cases where it is absolutely safe. Would you be up for contributing such a patch set? Happy to shepherd it and it would be fairly small given how the code is structured. I know several users would be happy with this.


--
You received this message because you are subscribed to the Google Groups "Concurrency Kit" group.
To unsubscribe from this group and stop receiving emails from it, send an email to concurrencyki...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/concurrencykit/97e698a2-1318-46f2-8c0e-65b39e6ab9e1n%40googlegroups.com.


--
Samy Al Bahra [http://repnop.org]

Patrick Julien

unread,
Jun 28, 2020, 9:34:45 AM6/28/20
to Concurrency Kit
On Sunday, June 28, 2020 at 5:41:37 AM UTC-4 sbahra wrote:
 Hi Patrick,

Sorry for the late reply!

On fences: I got rid of consume (was called a depends fence) as it ended up confusing developers, the expectation in CK is that as long as you're carrying through the data dependency no fence is required. Otherwise, use a stronger fence. In your example, the consume fence would not be used in CK as there is a data dependency. You would use fences to order operations between distinct memory locations that are not carrying a data dependency.


That's unfortunate.  I do see this here, it looks like I could replace the acquire in this example with this function.

 
The second update is incorrect,

Yes, I realized this after posting.  I need the fence before dereferencing the object.

 
first one is correct but has an unnecessary consume fence. You also have the option of using traditional read / write / full barriers (see https://www.kernel.org/doc/Documentation/memory-barriers.txt and section 7 of http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf). If using R / L / F barriers, in your first example you would use a ck_pr_fence_store instead of release. Perhaps it is simpler to just say there is an implicit load_depends in all ck_pr_load_ptr operations. Even in the Linux kernel, it isn't permitted to use read_depends any more since it got pushed down to READ_ONCE.


I am familiar with these documents already.  I'm looking to make the fast path faster, this means adding a full volatile read on what should be the fast path.

 

On loading and storing safely: You still need to load the pointer atomically, use ck_pr_load_ptr(&instance). To update the pointer, use ck_pr_store_ptr(&instance, tmp). 

I think what you're saying, but really not sure, since opaque is not available, both examples I gave are incorrect and use full volatile reads and writes instead.

These are not only slower but there is no need for them in this use case, safety can be achieve with acquire/release... If opaque operations are available that is.  The mutex makes sure that only one instance of the object is ever created. To make sure that any dereference of the object strictly "happens after" creating the instance in another thread the use of memory_order_release after creating and initializing the object and memory_order_acquire before dereferencing the object provides this guarantee.  It's OK to use memory_order_acquire instead of memory_order_consume but this provides a stronger guarantee than is required since only operations depending on the value of the pointer need to be ordered.

 

On your mutex usage, I'm a bit confused by the example. Once you have acquired the mutex, if it's the case no one else can update the variable then you do not require ck_pr_load_ptr, but you will need to ck_pr_store_ptr if there will be concurrent readers (using ck_pr_load_ptr). Using operations only when needed could help clarify the code.


Yes, concurrent readers.  See the original C11 example and the explanation just before this.  I don't need volatile access if I can opaque with release/acquire.

 

Regarding opaque operations, all the operations in CK are relaxed W.R.T. ordering, with stricter semantics enforced with fences. However, you still have to pay the cost of a compiler barrier if you use ck_pr. In cases where you prefer a plain load or store, make sure to use ck_pr_load/store respectively.

Not sure why you're saying this in this context, now I'm really confused.  The documentation for ck_pr_load leads me to believe this is a full fence, a volatile read.  This seems to imply that this is not.  The second I see the word "volatile" I stop reading because I'm assuming I'm dealing with a full fence.

 
We could provide more relaxed variants of ck_pr without a full blown compiler barrier (ck_pr_*_*_relaxed) for use-cases where it is absolutely safe. Would you be up for contributing such a patch set? Happy to shepherd it and it would be fairly small given how the code is structured. I know several users would be happy with this.

I can make Intel version of those in a jiffy but will need research for the other CPUs for sure.  Before I attempt to submit anything however, I need to be thoroughly "unconfused" about libck itself.


 

Samy Al Bahra

unread,
Jun 29, 2020, 2:39:33 PM6/29/20
to concurr...@googlegroups.com
Responses inline. Patrick and I caught up offline, clarifications for folks following along below.

On Sun, Jun 28, 2020 at 9:34 AM Patrick Julien <pju...@gmail.com> wrote:


On Sunday, June 28, 2020 at 5:41:37 AM UTC-4 sbahra wrote:
 Hi Patrick,

Sorry for the late reply!

On fences: I got rid of consume (was called a depends fence) as it ended up confusing developers, the expectation in CK is that as long as you're carrying through the data dependency no fence is required. Otherwise, use a stronger fence. In your example, the consume fence would not be used in CK as there is a data dependency. You would use fences to order operations between distinct memory locations that are not carrying a data dependency.


That's unfortunate.  I do see this here, it looks like I could replace the acquire in this example with this function.

load_depends is in there and we'll keep it in CK as a no-op.
 
On loading and storing safely: You still need to load the pointer atomically, use ck_pr_load_ptr(&instance). To update the pointer, use ck_pr_store_ptr(&instance, tmp). 

I think what you're saying, but really not sure, since opaque is not available, both examples I gave are incorrect and use full volatile reads and writes instead.

These are not only slower but there is no need for them in this use case, safety can be achieve with acquire/release... If opaque operations are available that is.  The mutex makes sure that only one instance of the object is ever created. To make sure that any dereference of the object strictly "happens after" creating the instance in another thread the use of memory_order_release after creating and initializing the object and memory_order_acquire before dereferencing the object provides this guarantee.  It's OK to use memory_order_acquire instead of memory_order_consume but this provides a stronger guarantee than is required since only operations depending on the value of the pointer need to be ordered.

This was miscommunication. "volatile" here refers to C volatile semantics. ck_pr implies a compiler barrier and an atomic operation (where relevant), but doesn't guarantee memory ordering if it can avoid it.


 

On your mutex usage, I'm a bit confused by the example. Once you have acquired the mutex, if it's the case no one else can update the variable then you do not require ck_pr_load_ptr, but you will need to ck_pr_store_ptr if there will be concurrent readers (using ck_pr_load_ptr). Using operations only when needed could help clarify the code.


Yes, concurrent readers.  See the original C11 example and the explanation just before this.  I don't need volatile access if I can opaque with release/acquire.

 

Regarding opaque operations, all the operations in CK are relaxed W.R.T. ordering, with stricter semantics enforced with fences. However, you still have to pay the cost of a compiler barrier if you use ck_pr. In cases where you prefer a plain load or store, make sure to use ck_pr_load/store respectively.

Not sure why you're saying this in this context, now I'm really confused.  The documentation for ck_pr_load leads me to believe this is a full fence, a volatile read.  This seems to imply that this is not.  The second I see the word "volatile" I stop reading because I'm assuming I'm dealing with a full fence.

Clarification above.
 

 
We could provide more relaxed variants of ck_pr without a full blown compiler barrier (ck_pr_*_*_relaxed) for use-cases where it is absolutely safe. Would you be up for contributing such a patch set? Happy to shepherd it and it would be fairly small given how the code is structured. I know several users would be happy with this.

I can make Intel version of those in a jiffy but will need research for the other CPUs for sure.  Before I attempt to submit anything however, I need to be thoroughly "unconfused" about libck itself.

In this case, we're talking about getting rid of the compiler barrier only for a weaker variant of ck_pr, nothing architecture-specific. 
 


 

--
You received this message because you are subscribed to the Google Groups "Concurrency Kit" group.
To unsubscribe from this group and stop receiving emails from it, send an email to concurrencyki...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages