atomic primitives

Ivan Godard

unread,

Jun 22, 2020, 10:45:56 PM6/22/20

to

<moved from "[OT] Anton Ertl has been proven right" to get away from the
UB wrangling

On 6/22/2020 7:09 PM, Chris M. Thomasson wrote:
> On 6/22/2020 6:26 PM, Ivan Godard wrote:
>> On 6/22/2020 5:35 PM, already...@yahoo.com wrote:
>>> On Tuesday, June 23, 2020 at 2:28:24 AM UTC+3, Chris M. Thomasson
wrote:

>>> Why would anybody do it?
>>> Unaligned accesses are good for many things, but atomicity is
>>> certainly not one of them.
>>>
>>
>> That's one of the reasons why Mill doesn't have hardware locking
>> primitives but uses transactional semantics instead.
>
> Have any problems with live lock? ;^)

That's a real good question, and the answer is we really don't know yet
but don't believe so. The reason we don't think so is that the hardware
primitives automatically do a backoff, so lockout is exponentially
unlikely as the duration increases, as in well known network protocols.

We realize that a backoff algorithm only gives statistical guarantees,
as opposed to a round robin which can give a fixed guarantee. This may
rule out use in certain hard realtime codes that must have the fixed
guarantee. It's a market choice to abandon such codes, but then any OOO
or machine with a cache has also made the same market choice.

>Btw, I am not necessarily talking
> about atomic ops wrt a RMW. I am taking about the ability to execute a
> remote memory barrier in user space. I assume the Mill is 100%
> sequentially consistent, right? On x86, there is a way to trigger a bus
> lock on purpose.
>
> Think of something like RCU. The readers do not need to execute any
> memory barriers or atomic RMW's at all, no transactions, nothing...

The problem with RCU is that the final U must be itself atomic. If you
arm-bend the data structure enough you can usually get that down to a
pointer store, which is hardware-atomic. But frequently you really don't
want to bend the structure: insert into a doubly linked list is just not
naturally RCU, but it's fine with transactional semantics.

> Well, DEC Alpha aside for a moment. Its very fast. Its an example of an
> asymmetric algorihtm. The readers are VERY fast, and the writers need to
> sync up. Writers can use mutexes, lock-free algorithms, LL/SC, CAS,
> whatever. Readers are free as a bird.
>
> Think of a special mutex that uses two threads. One of the threads is a
> hot spot, that runs very frequently. The other runs once and a while. We
> want the hot spot to be as free as possible.

Absent collision the cost of the transaction is the same as the
non-atomic code; at most an addressed line may get pushed from the L1 to
the L2. Collision induces backoff of course. So any algorithm that has
free readers with mutexed (etc) writes has free readers in ours too; the
transaction can be just the writes. But you can get a true RMW
transaction at the same cost, which eliminates the "C" in RCU.

But we freely admit that we don't know how well the users will take to
actually using the primitives, as opposed to just treating them as an
implementation of mutexes. IBM has a similar facility, and people seem
happy with it, but Intel's effort in that line was withdrawn.

By the time our micro-kernel is up we should know more :-)

Chris M. Thomasson

unread,

Jun 23, 2020, 12:24:01 AM6/23/20

to

On 6/22/2020 7:45 PM, Ivan Godard wrote:
> <moved from "[OT] Anton Ertl has been proven right" to get away from the
> UB wrangling
>
> On 6/22/2020 7:09 PM, Chris M. Thomasson wrote:
> > On 6/22/2020 6:26 PM, Ivan Godard wrote:
> >> On 6/22/2020 5:35 PM, already...@yahoo.com wrote:
> >>> On Tuesday, June 23, 2020 at 2:28:24 AM UTC+3, Chris M. Thomasson
> wrote:
>
>
> >>> Why would anybody do it?
> >>> Unaligned accesses are good for many things, but atomicity is
> >>> certainly not one of them.
> >>>
> >>
> >> That's one of the reasons why Mill doesn't have hardware locking
> >> primitives but uses transactional semantics instead.
> >
> > Have any problems with live lock? ;^)
>
> That's a real good question, and the answer is we really don't know yet
> but don't believe so. The reason we don't think so is that the hardware
> primitives automatically do a backoff, so lockout is exponentially
> unlikely as the duration increases, as in well known network protocols.

Clever back off schemes can help wrt a transactional setup. I have even
seen some elaborate so-called "contention manager" algorithms in various
STM's over the years. Although, I am wondering if there is an upper
bound on how many failures there can be in a LL/SC on the Mill? Lets
call a failure a spin... Can there theoretically be infinite spins?

One possible case of headaches, have you seen a scenario where a LL/SC
was operating on a reservation granule that was "accidentally" falsely
shared with another piece of unrelated data that another thread works
on? If this other thread stores into this granule, it can cause the the
unrelated LL/SC to fail at will.

false sharing on reservation granule, is bad mojo. Humm, well, what is
the size of a granule on the Mill? Is it really small? Perhaps even a
single byte?

Then there is just very high contention scenarios, no false sharing,
that can still cause a lot of failures, and spins. The system is under
heavy load, oh shi%.

> We realize that a backoff algorithm only gives statistical guarantees,
> as opposed to a round robin which can give a fixed guarantee. This may
> rule out use in certain hard realtime codes that must have the fixed
> guarantee. It's a market choice to abandon such codes, but then any OOO
> or machine with a cache has also made the same market choice.

For sure. Indeed. The problems tend to show up when the system is under
periods of heavy sustained load. Hot spots emerge...

> >Btw, I am not necessarily talking
> > about atomic ops wrt a RMW. I am taking about the ability to execute a
> > remote memory barrier in user space. I assume the Mill is 100%
> > sequentially consistent, right? On x86, there is a way to trigger a bus
> > lock on purpose.
> >
> > Think of something like RCU. The readers do not need to execute any
> > memory barriers or atomic RMW's at all, no transactions, nothing...
>
> The problem with RCU is that the final U must be itself atomic. If you
> arm-bend the data structure enough you can usually get that down to a
> pointer store, which is hardware-atomic. But frequently you really don't
> want to bend the structure: insert into a doubly linked list is just not
> naturally RCU, but it's fine with transactional semantics.

RCU is really nice because it allows for reader threads to not use any
explicit sync, DEC alpha aside. Think of 10 reader threads constantly
iterating through a dynamic data structure, perhaps performing database
lookups. Then think of three or four writer threads periodically adding
and removing items from this structure. The read activity is much more
frequent than writes. We also want to allow for the writers to
de/reallocate removed items, and still allow the readers to go "full
steam ahead" without hitting any corruption, and keeps things coherent.

The writers can use mutexes, or anything they want. Okay, so a writer
removes an item. Can it deallocate it right now? Well, not if any
readers are accessing it! So, it defers its destruction until a
quiescent period is hit. Then it knows that no readers can possibly have
a reference.

It removes an element.

Defers its destruction.

When a quiescent period is detected, the element can be freed or
realloc'd, whatever.

It can be referred to as "epoch" based memory reclamation. Some systems
go through multiple epochs before freeing elements. RCU can be thought
of as a deferred memory reclamation scheme.

Also, in a lot of RCU schemes, readers do not generally care if they
load out of date elements. The reads can go 100% full steam ahead, and
not sync with the writers at all.

Its radically asymmetric. Lots of reads, moderate writes.

> > Well, DEC Alpha aside for a moment. Its very fast. Its an example of an
> > asymmetric algorihtm. The readers are VERY fast, and the writers need to
> > sync up. Writers can use mutexes, lock-free algorithms, LL/SC, CAS,
> > whatever. Readers are free as a bird.
> >
> > Think of a special mutex that uses two threads. One of the threads is a
> > hot spot, that runs very frequently. The other runs once and a while. We
> > want the hot spot to be as free as possible.
>
> Absent collision the cost of the transaction is the same as the
> non-atomic code; at most an addressed line may get pushed from the L1 to
> the L2. Collision induces backoff of course. So any algorithm that has
> free readers with mutexed (etc) writes has free readers in ours too; the
> transaction can be just the writes. But you can get a true RMW
> transaction at the same cost, which eliminates the "C" in RCU.
>
> But we freely admit that we don't know how well the users will take to
> actually using the primitives, as opposed to just treating them as an
> implementation of mutexes. IBM has a similar facility, and people seem
> happy with it, but Intel's effort in that line was withdrawn.
>
> By the time our micro-kernel is up we should know more :-)

Have you played around with asymmetric synchronization algorithms? An
example can be as simple as the Dekker algorihtm. No atomic RMW needed,
no LL/SC. Just plain atomic loads and stores with the right memory barriers.

The idea wrt asymmetry in Dekker is that there are two threads: A "hot"
thread that takes the lock a lot of times, very frequently, and another
that only takes it from time to time, not so frequent. Well, wouldn't it
be nice to get rid of that damn nasty memory barrier on the hot thread?
It can be done, and can really improve performance. Big time.

asymmetric sync is definitely not for everything, no silver bullet
indeed. It just pretty darn useful when one has an imbalance in access
patterns. Say, a lot more reads than writes. Read mostly data structure
access patterns.

Chris M. Thomasson

unread,

Jun 23, 2020, 12:31:55 AM6/23/20

to

On 6/22/2020 7:45 PM, Ivan Godard wrote:

[...]

The Update is accomplished by the writer threads. This can be as simple
a locking a mutex, removing an item, and unlocking the mutex. The main
magic is when this removed item can be safely reclaimed, freed, or
reused, whatever. Ina strange sense, a form of GC.

Btw, did you ever hear about this:

https://www.cnet.com/news/sco-suit-now-seeks-3-billion-from-ibm/

The RCU wars! IBM vs SCO...

Ivan Godard

unread,

Jun 23, 2020, 2:54:57 AM6/23/20

to

There can be infinite spins, but the decay rate is such that rather
rapidly you will be hitting the domain of Schrödinger

> One possible case of headaches, have you seen a scenario where a LL/SC
> was operating on a reservation granule that was "accidentally" falsely
> shared with another piece of unrelated data that another thread works
> on? If this other thread stores into this granule, it can cause the the
> unrelated LL/SC to fail at will.

The Mill does not have false sharing of any kind, in or out of transactions.

> false sharing on reservation granule, is bad mojo. Humm, well, what is
> the size of a granule on the Mill? Is it really small? Perhaps even a
> single byte?

It's a byte.

> Then there is just very high contention scenarios, no false sharing,
> that can still cause a lot of failures, and spins. The system is under
> heavy load, oh shi%.
>

When the decay rate becomes large enough to be of the order of the
number of requestors then the hardware in effect degenerates into a
(stochastic) round robin. That is, each requester will have backed off
enough to ensure that no more than one is requesting in any transaction
interval. The numbers suggest that whole-system performance will drop
smoothly with increasing contention, and then plateau at a rate no worse
than 2X the execution time of the transaction itself. This prevents
livelock, except briefly due to random chance collision.

The drawback of the approach is that in heavy contention (and lots of
spins) the requestors spend a lot of time idle in backoff.

>> We realize that a backoff algorithm only gives statistical guarantees,
>> as opposed to a round robin which can give a fixed guarantee. This may
>> rule out use in certain hard realtime codes that must have the fixed
>> guarantee. It's a market choice to abandon such codes, but then any
>> OOO or machine with a cache has also made the same market choice.
>
> For sure. Indeed. The problems tend to show up when the system is under
> periods of heavy sustained load. Hot spots emerge...
>

We're not greatly concerned with artificial stressors that do nothing
but endlessly update a location; we expect that the bottleneck will be
in coherence, not in the transaction. Note that Mill has sequential
consistency, not total ordering.

>
>> >Btw, I am not necessarily talking
>> > about atomic ops wrt a RMW. I am taking about the ability to execute a
>> > remote memory barrier in user space. I assume the Mill is 100%
>> > sequentially consistent, right? On x86, there is a way to trigger a
>> bus
>> > lock on purpose.
>> >
>> > Think of something like RCU. The readers do not need to execute any
>> > memory barriers or atomic RMW's at all, no transactions, nothing...
>>
>> The problem with RCU is that the final U must be itself atomic. If you
>> arm-bend the data structure enough you can usually get that down to a
>> pointer store, which is hardware-atomic. But frequently you really
>> don't want to bend the structure: insert into a doubly linked list is
>> just not naturally RCU, but it's fine with transactional semantics.
>
> RCU is really nice because it allows for reader threads to not use any
> explicit sync, DEC alpha aside. Think of 10 reader threads constantly
> iterating through a dynamic data structure, perhaps performing database
> lookups. Then think of three or four writer threads periodically adding
> and removing items from this structure. The read activity is much more
> frequent than writes. We also want to allow for the writers to
> de/reallocate removed items, and still allow the readers to go "full
> steam ahead" without hitting any corruption, and keeps things coherent.

Yes, I'm aware of RCU.

> The writers can use mutexes, or anything they want. Okay, so a writer
> removes an item. Can it deallocate it right now? Well, not if any
> readers are accessing it! So, it defers its destruction until a
> quiescent period is hit. Then it knows that no readers can possibly have
> a reference.
>
> It removes an element.
>
> Defers its destruction.
>
> When a quiescent period is detected, the element can be freed or
> realloc'd, whatever.
>
> It can be referred to as "epoch" based memory reclamation. Some systems
> go through multiple epochs before freeing elements. RCU can be thought
> of as a deferred memory reclamation scheme.
>
> Also, in a lot of RCU schemes, readers do not generally care if they
> load out of date elements. The reads can go 100% full steam ahead, and
> not sync with the writers at all.
>
> Its radically asymmetric. Lots of reads, moderate writes.
>

Yes, it works well when the elements can be detached without moving. It
does not work well when that is not possible; you wind up having to copy
the whole structure and that m,ay be too expensive to bear.

RCU is useful, but not a panacea.

Mill is not OOO; it has no memory barriers, and no need of them. It is
strictly sequential consistent (and explicitly not TSO).

> The idea wrt asymmetry in Dekker is that there are two threads: A "hot"
> thread that takes the lock a lot of times, very frequently, and another
> that only takes it from time to time, not so frequent. Well, wouldn't it
> be nice to get rid of that damn nasty memory barrier on the hot thread?
> It can be done, and can really improve performance. Big time.
>
> asymmetric sync is definitely not for everything, no silver bullet
> indeed. It just pretty darn useful when one has an imbalance in access
> patterns. Say, a lot more reads than writes. Read mostly data structure
> access patterns.

No membar needed; both threads do their transaction at whatever rate
they choose. If the hot thread is so hot that the cold thread collides
then it will get backed off to make room.

Ivan Godard

unread,

Jun 23, 2020, 2:58:10 AM6/23/20

to

No mutex needed: enterAtomic; store(s); commit. Repeat if collided.

> Btw, did you ever hear about this:
>
> https://www.cnet.com/news/sco-suit-now-seeks-3-billion-from-ibm/

No. I don't pay much attention to what other companies are doing; the
s2n is too low.

Chris M. Thomasson

unread,

Jun 23, 2020, 5:00:39 PM6/23/20

to

On 6/22/2020 11:58 PM, Ivan Godard wrote:
> On 6/22/2020 9:31 PM, Chris M. Thomasson wrote:
>> On 6/22/2020 7:45 PM, Ivan Godard wrote:
>>> <moved from "[OT] Anton Ertl has been proven right" to get away from
>>> the UB wrangling

[...]

>>> The problem with RCU is that the final U must be itself atomic.
>> [...]
>>
>> The Update is accomplished by the writer threads. This can be as
>> simple a locking a mutex, removing an item, and unlocking the mutex.
>> The main magic is when this removed item can be safely reclaimed,
>> freed, or reused, whatever. Ina strange sense, a form of GC.
>
> No mutex needed: enterAtomic; store(s); commit. Repeat if collided.

No problem! The writer threads can basically use whatever they want to
remove and/or add an item to the data-structure. The readers just don't
care. They can run full steam ahead. A lot of times they don't care if
they are reading "stale" data.

There are multiple ways to remove a node from a linked data structure.
Mutex is easy. XCHG and CAS can be used directly. Fwiw, the easy way to
use XCHG is to flush all of the items in a lock free stack at once,
using a simple atomic RMW. No need for loops.

There are more exotic methods, to even use XCHG for adding an item.

I have an older scheme that is a hybrid work queue stack thing. Need to
find it. I can remember around 80'ish% of it of the top of my head.
Iirc, uses XCHG for push and pop.

>
>> Btw, did you ever hear about this:
>>
>> https://www.cnet.com/news/sco-suit-now-seeks-3-billion-from-ibm/
>
> No. I don't pay much attention to what other companies are doing; the
> s2n is too low.

I always found this interesting... Did IBM rip off RCU? ;^)

MitchAlsup

unread,

Jun 23, 2020, 5:52:07 PM6/23/20

to

On Tuesday, June 23, 2020 at 4:00:39 PM UTC-5, Chris M. Thomasson wrote:
> On 6/22/2020 11:58 PM, Ivan Godard wrote:
> > On 6/22/2020 9:31 PM, Chris M. Thomasson wrote:
> >> On 6/22/2020 7:45 PM, Ivan Godard wrote:
> >>> <moved from "[OT] Anton Ertl has been proven right" to get away from
> >>> the UB wrangling
> [...]
> >>> The problem with RCU is that the final U must be itself atomic.
> >> [...]
> >>
> >> The Update is accomplished by the writer threads. This can be as
> >> simple a locking a mutex, removing an item, and unlocking the mutex.
> >> The main magic is when this removed item can be safely reclaimed,
> >> freed, or reused, whatever. Ina strange sense, a form of GC.
> >
> > No mutex needed: enterAtomic; store(s); commit. Repeat if collided.
>
> No problem! The writer threads can basically use whatever they want to
> remove and/or add an item to the data-structure. The readers just don't
> care. They can run full steam ahead. A lot of times they don't care if
> they are reading "stale" data.

In the real world, less than great data early is often better than perfect
data late.

>
> There are multiple ways to remove a node from a linked data structure.
> Mutex is easy. XCHG and CAS can be used directly. Fwiw, the easy way to
> use XCHG is to flush all of the items in a lock free stack at once,
> using a simple atomic RMW. No need for loops.

My 66000 ISA has what appears to be pipelined version of LDL STC where
up to 8 cache liens can participate in a single ATOMIC event with <soft>
guarantees of forward progress!

And programmed means to avoid 'future' interference on some kinds of
data structures.

Here one can move an item from a doubly linked list in one part
of a concurrent data structure to a doubly linked list in another
part of the CDS in a single ATOMIC event.

Also note the typical worst case of typical ATOMIC events (where a
timer goes off and every CPU tries to take stuff off of the run queues)
is BigO( n**3 ) {or worse} where My 66000 scheme ends up BigO( 3 ) {no 'n'}
when properly programmed. We talked about this several years ago.

Chris M. Thomasson

unread,

Jun 24, 2020, 6:27:19 PM6/24/20

to

Nice!

>
>> Then there is just very high contention scenarios, no false sharing,
>> that can still cause a lot of failures, and spins. The system is under
>> heavy load, oh shi%.
>>
>
> When the decay rate becomes large enough to be of the order of the
> number of requestors then the hardware in effect degenerates into a
> (stochastic) round robin. That is, each requester will have backed off
> enough to ensure that no more than one is requesting in any transaction
> interval. The numbers suggest that whole-system performance will drop
> smoothly with increasing contention, and then plateau at a rate no worse
> than 2X the execution time of the transaction itself. This prevents
> livelock, except briefly due to random chance collision.

It kind of reminds me of a lock convoy. Instead of having to back off,
pessimistic schemes can just wait for their turn in the lock queue,
assuming a FIFO ordering. Then there is hybrid schemes that can try to
be optimistic, then fall back to locks when things get hot.

Pessimistic or opportunistic, they both meet their maker, so to speak,
when exposed to very high contention scenarios. Especially when the user
algorihtm is not distributed. Think of a single location getting
hammered by a shi% load of threads.

Although, using a single atomic XCHG to modify a shared location still
beats a mutex protecting it in these high contention situations. The
mutex falls apart, might as well bail the test because it takes too long.

> The drawback of the approach is that in heavy contention (and lots of
> spins) the requestors spend a lot of time idle in backoff.

Right. I was always wondering about a scheme where instead of backing
off, perhaps try to do some other work. Think of the following thing wrt
a mutex with a try_lock:
______________________________
while (! try_lock(mutex))
{
if (! try_to_do_something_else())
{
lock(mutex);
break;
}
}

// critical section

unlock(mutex);
______________________________

If it spins, it means the thread did something else. So, it was not
wasted in a sense.

>>> We realize that a backoff algorithm only gives statistical
>>> guarantees, as opposed to a round robin which can give a fixed
>>> guarantee. This may rule out use in certain hard realtime codes that
>>> must have the fixed guarantee. It's a market choice to abandon such
>>> codes, but then any OOO or machine with a cache has also made the
>>> same market choice.
>>
>> For sure. Indeed. The problems tend to show up when the system is
>> under periods of heavy sustained load. Hot spots emerge...
>>
>
> We're not greatly concerned with artificial stressors that do nothing
> but endlessly update a location; we expect that the bottleneck will be
> in coherence, not in the transaction. Note that Mill has sequential
> consistency, not total ordering.

It can be interesting to create programs that are designed to stress the
heck out of another program. Sometimes, it can show hot spots...

It works well with dynamic linked data structures. If you have a massive
single array, well, that's not ideal for RCU.

>>> > Well, DEC Alpha aside for a moment. Its very fast. Its an example
>>> of an
>>> > asymmetric algorihtm. The readers are VERY fast, and the writers
>>> need to
>>> > sync up. Writers can use mutexes, lock-free algorithms, LL/SC, CAS,
>>> > whatever. Readers are free as a bird.
>>> >
>>> > Think of a special mutex that uses two threads. One of the threads
>>> is a
>>> > hot spot, that runs very frequently. The other runs once and a
>>> while. We
>>> > want the hot spot to be as free as possible.
>>>
>>> Absent collision the cost of the transaction is the same as the
>>> non-atomic code; at most an addressed line may get pushed from the L1
>>> to the L2. Collision induces backoff of course. So any algorithm that
>>> has free readers with mutexed (etc) writes has free readers in ours
>>> too; the transaction can be just the writes. But you can get a true
>>> RMW transaction at the same cost, which eliminates the "C" in RCU.

The readers in RCU should not care if the writers are using
transactional memory or not. The reads are not part of any transaction,
they are free.

>>>
>>> But we freely admit that we don't know how well the users will take
>>> to actually using the primitives, as opposed to just treating them as
>>> an implementation of mutexes. IBM has a similar facility, and people
>>> seem happy with it, but Intel's effort in that line was withdrawn.
>>>
>>> By the time our micro-kernel is up we should know more :-)
>>
>> Have you played around with asymmetric synchronization algorithms? An
>> example can be as simple as the Dekker algorihtm. No atomic RMW
>> needed, no LL/SC. Just plain atomic loads and stores with the right
>> memory barriers.
>
> Mill is not OOO; it has no memory barriers, and no need of them. It is
> strictly sequential consistent (and explicitly not TSO).

That makes things easier. So, a Dekker algorithm on a Mill just uses
plain atomic loads and stores, no need for any explicit memory
barriers... Right? That makes things easier to code for sure.

Can it compete with the read side of a RCU protected algorihtm that does
not need to use anything on a system that has very weak memory ordering?
Say, SPARC in RMO mode? DEC Alpha aside for a moment?

>> The idea wrt asymmetry in Dekker is that there are two threads: A
>> "hot" thread that takes the lock a lot of times, very frequently, and
>> another that only takes it from time to time, not so frequent. Well,
>> wouldn't it be nice to get rid of that damn nasty memory barrier on
>> the hot thread? It can be done, and can really improve performance.
>> Big time.
>>
>> asymmetric sync is definitely not for everything, no silver bullet
>> indeed. It just pretty darn useful when one has an imbalance in access
>> patterns. Say, a lot more reads than writes. Read mostly data
>> structure access patterns.
>
> No membar needed; both threads do their transaction at whatever rate
> they choose. If the hot thread is so hot that the cold thread collides
> then it will get backed off to make room.

Well, wrt RCU, the reads never need to be part of any transaction.

Chris M. Thomasson

unread,

Jun 24, 2020, 6:38:52 PM6/24/20

to

On 6/22/2020 11:54 PM, Ivan Godard wrote:
> On 6/22/2020 9:23 PM, Chris M. Thomasson wrote:
>> On 6/22/2020 7:45 PM, Ivan Godard wrote:
>>> <moved from "[OT] Anton Ertl has been proven right" to get away from
>>> the UB wrangling

[...]

>>> Absent collision the cost of the transaction is the same as the
>>> non-atomic code; at most an addressed line may get pushed from the L1
>>> to the L2. Collision induces backoff of course.

[...]

>>> So any algorithm that
>>> has free readers with mutexed (etc) writes has free readers in ours
>>> too; the transaction can be just the writes.

Indeed. In a lot of scenarios, the reads do not care about the writes.
Totally asymmetric.

>>> But you can get a true
>>> RMW transaction at the same cost, which eliminates the "C" in RCU.

[...]

Humm... How? Think of a scenario in which the readers are free, free
enough to never have to retry anything. If the reads are part of a
transaction, well, that seems to imply that failures can be encountered
by a read?

Also, another aspect of RCU is the ability to allow one to free some
memory without fear of a reader accessing it in anyway shape or form.

Ivan Godard

unread,

Jun 24, 2020, 7:06:00 PM6/24/20

to

On 6/24/2020 3:27 PM, Chris M. Thomasson wrote:
> On 6/22/2020 11:54 PM, Ivan Godard wrote:
>> On 6/22/2020 9:23 PM, Chris M. Thomasson wrote:

<snip>

>> The drawback of the approach is that in heavy contention (and lots of
>> spins) the requestors spend a lot of time idle in backoff.
>
> Right. I was always wondering about a scheme where instead of backing
> off, perhaps try to do some other work. Think of the following thing wrt
> a mutex with a try_lock:
> ______________________________
> while (! try_lock(mutex))
> {
>     if (! try_to_do_something_else())
>     {
>         lock(mutex);
>         break;
>     }
> }
>
>    // critical section
>
> unlock(mutex);
> ______________________________
>
>
> If it spins, it means the thread did something else. So, it was not
> wasted in a sense.

That's actually what Mill hardware does. The backoff is not at the
(failed) commit, it's at the next retry. If you do other work that lasts
longer than the backoff time then the retry starts with no stall.

<snip>

>> Mill is not OOO; it has no memory barriers, and no need of them. It is
>> strictly sequential consistent (and explicitly not TSO).
>
> That makes things easier. So, a Dekker algorithm on a Mill just uses
> plain atomic loads and stores, no need for any explicit memory
> barriers... Right? That makes things easier to code for sure.
>
> Can it compete with the read side of a RCU protected algorihtm that does
> not need to use anything on a system that has very weak memory ordering?
> Say, SPARC in RMO mode? DEC Alpha aside for a moment?
>

Sorry, no idea, and no measurements yet. Frankly, the whole memory
system is so different that I think it would be impossible to tease out
that effect from all the other things that are different, like having no
need for owned state, nor to read lines before writing into them.

Ivan Godard

unread,

Jun 24, 2020, 7:21:46 PM6/24/20

to

That implies a structure where each object is separately allocated and
only the link to reach that object is updated (which is of course the
usual RCU structure). But that also means that updating the object
itself (as opposed to updating the link) requires building a new object,
usually by copying the old, because the structure needs updates in
several places and they all have to form a transaction. In other words,
RCU is a device to convert a many-object transaction to a single-object
(the link) transaction, where the single is hardware atomic on the hardware.

Mill, and IBM, can do the many-object transaction in place. Of course, a
reader that reads part of a many, which then gets updated, and then
reads the rest will see inconsistent data. But that's because it hasn't
treated the collection of reads like the transaction it is. If RCU is
used, the key (link) update is atomic, and the object is immutable so it
gets transaction semantics for the read by implication. Not so with
update in place.

Hence it is meaningful to use Mill atomic primitives on the read side as
well as the write side. You may get a collision, which will never happen
in RCU, but you see the data in place so there's no allocation or copy
overhead. And read transactions don't collide with each other, so
collision can only happen when a writer is concurrent with a reader.

As for costs, it's a matter of balancing the cost of infrequent retries
vs. the cost of copy on every update. That will obviously be very
algorithm and application dependent and will require measurement. Which
we don't have yet :-)

Chris M. Thomasson

unread,

Jun 24, 2020, 7:49:12 PM6/24/20

to

Right. RCU loves linked lists. Readers love to iterate nodes... ;^)

Each object might contain a smallish array, like a deque or something.

Its used in various places throughout the Linux Kernel.

> But that also means that updating the object
> itself (as opposed to updating the link) requires building a new object,
> usually by copying the old, because the structure needs updates in
> several places and they all have to form a transaction.

Inserting/Removing nodes is easy, updating an existing node involves
creating a copy of parts or all of it, in a lot of cases. Usually, a
common theme is that readers either see the old copies, or the new one
during iteration of the data structure. No intermediate intermixed
state. Readers do not generally care if they are reading an old node
that was already removed. Although I have seen some special schemes that
try to mitigate this by simply setting a flag in the node. If the reader
decides to read the flag, it can tell if the node has been removed.

> In other words,
> RCU is a device to convert a many-object transaction to a single-object
> (the link) transaction, where the single is hardware atomic on the
> hardware.

I always thought of RCU as more of a "deferred memory reclamation"
mechanism. It allows a writer to defer the freeing of an object until
its sure that no readers are accessing it. It uses quiescent states to
determine this. This is a key aspect.

> Mill, and IBM, can do the many-object transaction in place. Of course, a
> reader that reads part of a many, which then gets updated, and then
> reads the rest will see inconsistent data. But that's because it hasn't
> treated the collection of reads like the transaction it is. If RCU is
> used, the key (link) update is atomic, and the object is immutable so it
> gets transaction semantics for the read by implication. Not so with
> update in place.

Ideally, the readers in RCU should be 100% decoupled from writers.

> Hence it is meaningful to use Mill atomic primitives on the read side as
> well as the write side. You may get a collision, which will never happen
> in RCU, but you see the data in place so there's no allocation or copy
> overhead. And read transactions don't collide with each other, so
> collision can only happen when a writer is concurrent with a reader.

In normal RCU setups, a collision can never happen when a writer is
concurrent with a reader. For a lot of scenarios, readers act as if
there are no writers around at all. The readers just iterate the
data-structure. If they read some stale nodes, so be it.

> As for costs, it's a matter of balancing the cost of infrequent retries
> vs. the cost of copy on every update. That will obviously be very
> algorithm and application dependent and will require measurement. Which
> we don't have yet :-)

Humm... Still, try to think in terms of RCU as a deferred memory
reclamation system.

Chris M. Thomasson

unread,

Jun 24, 2020, 7:51:59 PM6/24/20

to

On 6/24/2020 4:21 PM, Ivan Godard wrote:
> On 6/24/2020 3:38 PM, Chris M. Thomasson wrote:
>> On 6/22/2020 11:54 PM, Ivan Godard wrote:
>>> On 6/22/2020 9:23 PM, Chris M. Thomasson wrote:
>>>> On 6/22/2020 7:45 PM, Ivan Godard wrote:
>>>>> <moved from "[OT] Anton Ertl has been proven right" to get away
>>>>> from the UB wrangling
[...]

> As for costs, it's a matter of balancing the cost of infrequent retries
> vs. the cost of copy on every update. That will obviously be very
> algorithm and application dependent and will require measurement. Which
> we don't have yet :-)

Check this out:

https://youtu.be/qcD2Zj9GgI4

Its basically exactly what we are discussing here.

Chris M. Thomasson

unread,

Jun 24, 2020, 7:57:34 PM6/24/20

to

On 6/22/2020 7:45 PM, Ivan Godard wrote:

> <moved from "[OT] Anton Ertl has been proven right" to get away from the
> UB wrangling
>
> On 6/22/2020 7:09 PM, Chris M. Thomasson wrote:
> > On 6/22/2020 6:26 PM, Ivan Godard wrote:
> >> On 6/22/2020 5:35 PM, already...@yahoo.com wrote:
> >>> On Tuesday, June 23, 2020 at 2:28:24 AM UTC+3, Chris M. Thomasson
> wrote:

[...]

Try to complete the The Issaquah Challenge in the Mill:

https://youtu.be/Qxduz-CRnDw

RCU can help out quite a bit here... ;^)

MitchAlsup

unread,

Jun 24, 2020, 8:05:23 PM6/24/20

to

The key part of RCU (it seems to me) is having all intermediate states
point at a single value such that when the value has one pattern the original
ordering of a multiplictiy of nodes are visible, and when it has another
pattern the new ordering of a multiplicity of nodes are visible.

Ivan Godard

unread,

Jun 24, 2020, 8:16:50 PM6/24/20

to

Not relevant: the Challenge explicitly disallows use of transactions,
for which the Challenge is trivial. As on Mill hardware.

Chris M. Thomasson

unread,

Jun 24, 2020, 8:42:04 PM6/24/20

to

Is it impossible to implement Dekkers algorihtm on a Mill without using
transactions?

Chris M. Thomasson

unread,

Jun 24, 2020, 8:44:34 PM6/24/20

to

So the Challenge is trivial to implement on the Mill, and still get all
of the requirements? I must be misunderstanding you here. Need to think.

Ivan Godard

unread,

Jun 24, 2020, 9:33:55 PM6/24/20

to

Oh I suppose any non-atomic algorithm for anything would work on a Mill.
But figuring out such an algorithm, or answering the Challenge, requires
Dijkstra (or Terje's) skills - that's why they are puzzles. Such skills
are in short supply, so why would you bother?

Chris M. Thomasson

unread,

Jun 25, 2020, 1:48:13 AM6/25/20

to

Why do I have the very strong feeling that Terje knows all about this,
and possibly more? ;^) He is a very smart individual indeed. Big time.
Its a privilege to be able to converse with him, Mitch, and yourself,
others on this group. I remember Nick. Thanks Ivan.

Fwiw, lets see if I can create some pseudo code on the fly right here of
a _highly_ experimental atomic "mostly" lock-free stack I did a while
back... Pardon any typos... So, this stack should be "different" in that
it uses a loopless atomic exchange operation, well on the x86 anyway,
for the push and pop, or flush to be more accurate. LOCK XCHG, or isn't
LOCK implied for XCHG on x86? I think so, but whatever. atomic
exchange... Here we go!

Lets create a special address that comes in handy later...

Membars aside for a moment:

// hackish!
#define SPECIAL ((node*)0xDEADBEEF)

Lets create a node:

struct node
{
node* next;
// user defined payload
};

Okay. Lets create a stack:

struct stack
{
node* head; // initialize with a nullptr
};

Fine. Lets create a function that adds a new node to this stack:

void stack_push(stack* self, node* n)
{
n->next = SPECIAL;
node* previous = XCHG(&self->head, n);
n->next = previous;
}

Okay! That's it for push. Nice and simple, basically wait-free and gets
LIFO order: Okay for a stack. Now, lets allow one to remove all items
from this stack at once:

node* stack_flush(stack* self)
{
return XCHG(&self->head, nullptr);
}

Perfect! A thread can grab all of the nodes at once. Pretty fast, and
wait-free. Now... What can we do with the list of nodes returned from
stack_flush, assuming its not nullptr? Well.. We can process it! Lets
define something called user_process_node. Its a function that takes a
pointer to a node as input, and does whatever a user wants with its user
defined payload. I do not even need to define it, just declare its
interface. So, user_process_node(node*) is it.

Now, lets create a "driver" that eventually calls into
user_process_node, call it node_process. It takes a list of nodes and
executes each one of them... So, something like...

Humm, well, before I show the driver, perhaps I should show a very
important function first, call it node_next_wait. Perhaps, the most
important. Its going to turn people off for a moment, but it can be sort
of, "amortized" across the complexity of user processing in a sense.

node* node_next_wait(node* n)
{
node* next = nullptr;

while ((next = n->next) == SPECIAL) backoff();

return next;
}

Well, there ya go. This is blocking after all, however its still
interesting to me, it returns the next pointer. Okay, lets define the
driver function node_process:

void node_process(node* head)
{
node* n = head;

while (n)
{
// user processing, _before_ we wait...
// should soak up some time...
user_process_node(n);

// wait for the next work
n = node_next_wait(n);
}
}

Okay, this can processes a list of nodes using wait free atomics for the
stack, and use some blocking, spinning on nodes, however, the user
processing should act as a sort of backoff anyway, minimizing actual
spins doing nothing.

Think of something like:

node* work = stack_flush(work_stack);

node_process(work);

Well, if work is not nullptr, then actual user work is going to be
performed.

Well, what do ya think Ivan? Crap, or kind of interesting? ;^)

Perhaps a little crappy, kind of decent in certain areas? lol. :^)

Terje Mathisen

unread,

Jun 25, 2020, 1:59:26 AM6/25/20

to

Chris M. Thomasson wrote:
> On 6/24/2020 4:21 PM, Ivan Godard wrote:
>> That implies a structure where each object is separately allocated and
>> only the link to reach that object is updated (which is of course the
>> usual RCU structure).
>
> Right. RCU loves linked lists. Readers love to iterate nodes... ;^)

A single-linked list is trivially updated without the need for RCU, you
just need CAS (or DCAS if you can't otherwise guarantee avoidance of the
A-B-A problem when memory is quickly reused.)

>
> Each object might contain a smallish array, like a deque or something.
>
> Its used in various places throughout the Linux Kernel.
>
>
>> But that also means that updating the object itself (as opposed to
>> updating the link) requires building a new object, usually by copying
>> the old, because the structure needs updates in several places and
>> they all have to form a transaction.
>
> Inserting/Removing nodes is easy, updating an existing node involves
> creating a copy of parts or all of it, in a lot of cases. Usually, a
> common theme is that readers either see the old copies, or the new one
> during iteration of the data structure. No intermediate intermixed
> state. Readers do not generally care if they are reading an old node
> that was already removed. Although I have seen some special schemes that
> try to mitigate this by simply setting a flag in the node. If the reader
> decides to read the flag, it can tell if the node has been removed.
>
>
>> In other words, RCU is a device to convert a many-object transaction
>> to a single-object (the link) transaction, where the single is
>> hardware atomic on the hardware.
>
> I always thought of RCU as more of a "deferred memory reclamation"
> mechanism. It allows a writer to defer the freeing of an object until
> its sure that no readers are accessing it. It uses quiescent states to
> determine this. This is a key aspect.

Right.

As soon as every possible user of the object have passed some kind of
sync point, outside the object, then it can be safely reclaimed.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Chris M. Thomasson

unread,

Jun 25, 2020, 2:03:33 AM6/25/20

to

Imvvho, this deserves its own thread...? from the following topic:

https://groups.google.com/d/topic/comp.arch/k1Qr520dcDk/discussion

Its a special, highly experimental, atomic stack that uses some
wait-free techniques and blocking in the form of spinning. I wrote it
out in pseudo-code from memory. I think I got it... Actually, have some
C++ code around here somewhere. Well, if interested, there is a working
model that can be tested. Think I posted about over on comp.lang.c++.

Part of the magic is that user processing can act as the main backoff,
thus radically reducing the number of spins.

A caveat! If a thread up and dies in the right place, it can deadlock.
Try to find that place as a challenge... ;^)

Chris M. Thomasson

unread,

Jun 25, 2020, 2:14:47 AM6/25/20

to

On 6/24/2020 10:59 PM, Terje Mathisen wrote:
> Chris M. Thomasson wrote:
>> On 6/24/2020 4:21 PM, Ivan Godard wrote:
>>> That implies a structure where each object is separately allocated
>>> and only the link to reach that object is updated (which is of course
>>> the usual RCU structure).
>>
>> Right. RCU loves linked lists. Readers love to iterate nodes... ;^)
>
> A single-linked list is trivially updated without the need for RCU, you
> just need CAS (or DCAS if you can't otherwise guarantee avoidance of the
> A-B-A problem when memory is quickly reused.)

Correct. IBM Appendix A something, Principles of Operation. I forget for
now, but its there. I always called it DWCAS, for double width
compare-and-swap. the DW part would be a 64-bit atomic update on a
system with 32-bit words, CMPXCHG8B. For 64-bit, 128 bit atomic updates.
Then there was a strange one from Itanium. cmp8xchg16b or something!
damn I cannot remember.

Imvho, RCU is "more" about allowing readers to freely fly/iterate
through the nodes of the single linked list, without any fear. RCU
allows the nodes to be freed, reused, whatever because it can tell when
a node is in a quiescent state. So, imvvvho, RCU is more of a deferred
reclamation scheme? Btw, are you familiar with SMR? AKA, hazard pointers?

Perfect! 100% perfect. Thank you.

Ivan Godard

unread,

Jun 25, 2020, 2:30:36 AM6/25/20

to

On 6/24/2020 11:14 PM, Chris M. Thomasson wrote:
> On 6/24/2020 10:59 PM, Terje Mathisen wrote:
>> Chris M. Thomasson wrote:
>>> On 6/24/2020 4:21 PM, Ivan Godard wrote:
>>>> That implies a structure where each object is separately allocated
>>>> and only the link to reach that object is updated (which is of
>>>> course the usual RCU structure).
>>>
>>> Right. RCU loves linked lists. Readers love to iterate nodes... ;^)
>>
>> A single-linked list is trivially updated without the need for RCU,
>> you just need CAS (or DCAS if you can't otherwise guarantee avoidance
>> of the A-B-A problem when memory is quickly reused.)

DCAS can avoid the ABA problem by pairing the location to be updated
with a location carrying a counter. The problem is that (in most
versions of DCAS) the two locations must be adjacent, which means that
the natural single-thread data structure must be modified if it is to be
used concurrently.

This enough of a nuisance that programmers don't bother, and just use a
mutex to lock the whole thing. The mutex itself of course uses DCAS, but
it is special purpose rather than inherited code.

Chris M. Thomasson

unread,

Jun 25, 2020, 2:44:55 AM6/25/20

to

Think of avoiding ABA with CAS... on a 32-bit word system.... 64 CAS
works wonders... ;^) The CAS needs to work on sizes that are greater
than a work to be able to include the aba, or version counter if you
will... Something like:

struct double_width
{
uintptr_t ptr; // 32 bits
uint32_t aba; // 32 bits
};

32*2 = 64 bits, okay. We need a cas that can handle this...

Well, DWCAS (Double Width CAS), or CMPXCHG8B on 32-bit Intel, can mutate
a struct double_width atomically, its size is 64 bits, perfect for an
aba counter! Incrementing aba for each pop does the trick. LL/SC is a
different beast. Think of CAS for a moment...

Chris M. Thomasson

unread,

Jun 25, 2020, 2:46:30 AM6/25/20

to

On 6/24/2020 11:44 PM, Chris M. Thomasson wrote:
> On 6/24/2020 11:30 PM, Ivan Godard wrote:
>> On 6/24/2020 11:14 PM, Chris M. Thomasson wrote:
>>> On 6/24/2020 10:59 PM, Terje Mathisen wrote:
>>>> Chris M. Thomasson wrote:
>>>>> On 6/24/2020 4:21 PM, Ivan Godard wrote:

[...]

> Think of avoiding ABA with CAS... on a 32-bit word system.... 64 CAS
> works wonders... ;^) The CAS needs to work on sizes that are greater
> than a work to be able to include the aba, or version counter if you

^^^^^^^^^

word!

The CAS needs to work on sizes that are greater than a _word_, to be

able to include the aba, or version counter if you

> will... Something like:
>
> struct double_width
> {
> uintptr_t ptr; // 32 bits
> uint32_t aba; // 32 bits
> };
>
> 32*2 = 64 bits, okay. We need a cas that can handle this...
>
> Well, DWCAS (Double Width CAS), or CMPXCHG8B on 32-bit Intel, can mutate
> a struct double_width atomically, its size is 64 bits, perfect for an
> aba counter! Incrementing aba for each pop does the trick. LL/SC is a
> different beast. Think of CAS for a moment...

Sorry for the typo!

Ivan Godard

unread,

Jun 25, 2020, 3:13:51 AM6/25/20

to

Guess you didn't read the material above before writing. Call it DWCAS
or DCAS, it required modifying the data structure to add a counter.
There's nothing wrong with D(W)CAS, it's using it that's the problem.
It's not sufficient to offer a facility that can be used for what you
need to get done; it is also necessary for it to be humanely usable and
not require other things to be pretzeled. D(W)CAS fails.

Chris M. Thomasson

unread,

Jun 25, 2020, 3:35:02 AM6/25/20

to

Fwiw, DWCAS operates on two adjacent locations in memory, DCAS can
operate on two disjoint locations. They are different things.

Chris M. Thomasson

unread,

Jun 25, 2020, 3:38:26 AM6/25/20

to

Assume DWCAS can only operate on struct double_width. DCAS can take two
pointers to two different locations. It have a pointer from A, and a
pointer to B, that are very far apart from each other. This does not
work at all with DWCAS.

Chris M. Thomasson

unread,

Jun 25, 2020, 3:58:31 AM6/25/20

to

I failed to create a new thread for this. Well, here is the original topic:

https://groups.google.com/d/topic/comp.arch/k1Qr520dcDk/discussion

here is the message:

https://groups.google.com/forum/#!original/comp.arch/k1Qr520dcDk/l8konN9KBQAJ

Wrt RCU:

[...]

Ivan Godard

unread,

Jun 25, 2020, 4:39:30 AM6/25/20

to

On 6/25/2020 12:34 AM, Chris M. Thomasson wrote:

> Fwiw, DWCAS operates on two adjacent locations in memory, DCAS can
> operate on two disjoint locations. They are different things.

I am aware of the difference; I had long forgotten the mnemonics. Either
way, both require altering the data structure to avoid ABA.

Chris M. Thomasson

unread,

Jun 25, 2020, 4:53:26 AM6/25/20

to

On 6/25/2020 1:39 AM, Ivan Godard wrote:
> On 6/25/2020 12:34 AM, Chris M. Thomasson wrote:
>
>> Fwiw, DWCAS operates on two adjacent locations in memory, DCAS can
>> operate on two disjoint locations. They are different things.
>
> I am aware of the difference; I had long forgotten the mnemonics.

I don't think DWCAS or DCAS were ever "standard" mnemonics to begin
with. Iirc, I heard about calling it DWCAS from Joe Seigh, worked for
IBM in the 80's and early 90's. On the IBM z/arch, iirc, normal CAS was
called CS. Then DWCAS was called CDS? I think. On x86 for 32-bit
systems, its called cmpxchg8b. for 64-bit, its cmpxchg16b. Though, of
course none of these act like DCAS...

> Either way, both require altering the data structure to avoid ABA.

Iirc, there was a highly experimental way to use DCAS that got around
ABA. It was expensive, but worked. I am most likely mistaken here, the
paper must be from around 2001, or earlier. Humm... I do remember a
special algorithm called KCSS, or something. K compare single swap? Ever
heard of it?

https://groups.google.com/d/topic/comp.arch/shshLdF1uqs/discussion

The damn pdf link is broken in the message:

http://research.sun.com/people/moir/Papers/p004-luchangco.pdf

It has to be somewhere on the web.

Chris M. Thomasson

unread,

Jun 25, 2020, 4:54:02 AM6/25/20

to

On 6/25/2020 1:53 AM, Chris M. Thomasson wrote:
> On 6/25/2020 1:39 AM, Ivan Godard wrote:
>> On 6/25/2020 12:34 AM, Chris M. Thomasson wrote:
>>
>>> Fwiw, DWCAS operates on two adjacent locations in memory, DCAS can
>>> operate on two disjoint locations. They are different things.
>>
>> I am aware of the difference; I had long forgotten the mnemonics.
>
> I don't think DWCAS or DCAS were ever "standard" mnemonics to begin
> with. Iirc, I heard about calling it DWCAS from Joe Seigh, worked for
> IBM in the 80's and early 90's. On the IBM z/arch, iirc, normal CAS was
> called CS. Then DWCAS was called CDS? I think. On x86 for 32-bit
> systems, its called cmpxchg8b. for 64-bit, its cmpxchg16b. Though, of
> course none of these act like DCAS...
>
>
>> Either way, both require altering the data structure to avoid ABA.
>
> Iirc, there was a highly experimental way to use DCAS that got around
> ABA. It was expensive, but worked. I am most likely mistaken here, the
> paper must be from around 2001, or earlier. Humm... I do remember a
> special algorithm called KCSS, or something. K compare single swap? Ever
> heard of it?
>
> https://groups.google.com/d/topic/comp.arch/shshLdF1uqs/discussion

Wow! way back in 2005. How time flies!

Chris M. Thomasson

unread,

Jun 25, 2020, 5:34:10 AM6/25/20

to

On 6/25/2020 12:58 AM, Chris M. Thomasson wrote:
> I failed to create a new thread for this. Well, here is the original topic:

[...]

> Fine. Lets create a function that adds a new node to this stack:
>
> void stack_push(stack* self, node* n)
> {
>     n->next = SPECIAL;
>     node* previous = XCHG(&self->head, n);
>     n->next = previous;
> }
>
>
> Okay! That's it for push. Nice and simple, basically wait-free and gets
> LIFO order: Okay for a stack. Now, lets allow one to remove all items
> from this stack at once:

[...]
A funny thing. One can actually get FIFO if they "simply" reverse the
nodes they get from stack_flush... ;^)

Chris M. Thomasson

unread,

Jun 25, 2020, 5:43:36 AM6/25/20

to

Well, it would not be "ideal" to get FIFO in this experiment. But can
work. An alternative would be to modify the stack_push function to use
CAS, and eliminate the SPECIAL hack pointer thing. So, something like,
just typing this in, memory barrier aside for a moment:

void stack_push(stack* self, node *n)
{
node* head = nullptr;

do
{
head = self->head;
n->next = head;

} while (! CAS(&self->head, head, n));
}

Imvvho, its not as pretty as the fast XCHG version, it loops! damn it...

However, it gets around having to use SPECIAL.

Chris M. Thomasson

unread,

Jun 25, 2020, 5:59:45 AM6/25/20

to

On 6/24/2020 10:59 PM, Terje Mathisen wrote:

These sync points can be a lot of things. They just need to act as a
memory barrier.

Terje Mathisen

unread,

Jun 25, 2020, 8:08:31 AM6/25/20

to

When used in the suggested way, i.e. no data items are reclaimed until
all users/threads have passed sync point, then it is perfectly safe to
do it with regular CAS:

The current thread is currently inside the liked list, since it is about
to update it: This means that it is impossible for either the node which
we will update or the following node to be deleted, reclaimed and then
reused in the same location while we are doing our update: The first two
steps are possible but not the last one, so the CAS will fail when we
try it.

Bonita Montero

unread,

Jun 25, 2020, 9:59:11 AM6/25/20

to

> DCAS can avoid the ABA problem by pairing the location to be updated
> with a location carrying a counter. The problem is that (in most
> versions of DCAS) the two locations must be adjacent, which means that
> the natural single-thread data structure must be modified if it is to be
> used concurrently.

There's no problem with that.

MitchAlsup

unread,

Jun 25, 2020, 12:43:04 PM6/25/20

to

On Thursday, June 25, 2020 at 7:08:31 AM UTC-5, Terje Mathisen wrote:
> Ivan Godard wrote:

BOOLEAN MoveElement( Element *fr, Element *to )
{
esmLOCK( fn = fr->next );
fp = fr->prev;
esmLOCK( tn = to->next );
esmLOCK( fn );
esmLOCK( fp );
esmLOCK( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCK( fr->next = tn );
return TRUE;
}
return FALSE;
}

Ivan Godard

unread,

Jun 25, 2020, 2:39:14 PM6/25/20

to

There is if the objects are also used by un-related non-threaded
software, for example if they come from a file and some of the programs
using the fie are threaded and some are not, or they are all threaded
but different programs update different parts of the object.

It's true that is performance is unimportant then the programs can
marshal in the objects and the marshal them out again. But it can be
important: think the network stack of a router. There the code may need
the ability to update any of several different individual fields in the
object, sometimes alone and sometimes in groups, and object copy costs
will dominate the throughput and resulting bandwidth.

It is impractical to explode the object to put a sequence field next to
each updatable field; it is unattractive to copy the object to do RCU.
The best solution is to update-in-place. On a legacy machine that means
using a mutex to lock the object. On a Mill or IBM it means a
transaction. I feel that the transaction is more natural and less
distorting of the program.

Bonita Montero

unread,

Jun 25, 2020, 2:44:48 PM6/25/20

to

>> There's no problem with that.

> There is if the objects are also used by un-related non-threaded
> software, for example if they come from a file and some of the programs
> using the fie are threaded and some are not, or they are all threaded
> but different programs update different parts of the object.

It is very abstract. Or in other words: you are turning word phrases.
Where lockfree operations make sense, there aren't any restrictions
like non-alignedness which would them make them impossible.

Alex McDonald

unread,

Jun 25, 2020, 2:50:06 PM6/25/20

to

Was it this paper?

https://people.csail.mit.edu/shanir/publications/Lazy_Concurrent.pdf

--
Alex

Chris M. Thomasson

unread,

Jun 25, 2020, 6:30:40 PM6/25/20

to

Its not KCSS, its not the ABA avoiding DCAS thing. However, there is a
familiar name. Luchangco. Wrt KCSS, lets see here:

Ahh, this is it:

http://www.cs.tau.ac.il/~shanir/nir-pubs-web/Papers/K-Compare.pdf

I need to cross check from the data I have in the following thread:

https://groups.google.com/d/topic/comp.arch/shshLdF1uqs/discussion

but it looks right!