On 6/22/2020 7:45 PM, Ivan Godard wrote:
> <moved from "[OT] Anton Ertl has been proven right" to get away from the
> UB wrangling
>
> On 6/22/2020 7:09 PM, Chris M. Thomasson wrote:
> > On 6/22/2020 6:26 PM, Ivan Godard wrote:
> >> On 6/22/2020 5:35 PM,
already...@yahoo.com wrote:
> >>> On Tuesday, June 23, 2020 at 2:28:24 AM UTC+3, Chris M. Thomasson
> wrote:
>
>
> >>> Why would anybody do it?
> >>> Unaligned accesses are good for many things, but atomicity is
> >>> certainly not one of them.
> >>>
> >>
> >> That's one of the reasons why Mill doesn't have hardware locking
> >> primitives but uses transactional semantics instead.
> >
> > Have any problems with live lock? ;^)
>
> That's a real good question, and the answer is we really don't know yet
> but don't believe so. The reason we don't think so is that the hardware
> primitives automatically do a backoff, so lockout is exponentially
> unlikely as the duration increases, as in well known network protocols.
Clever back off schemes can help wrt a transactional setup. I have even
seen some elaborate so-called "contention manager" algorithms in various
STM's over the years. Although, I am wondering if there is an upper
bound on how many failures there can be in a LL/SC on the Mill? Lets
call a failure a spin... Can there theoretically be infinite spins?
One possible case of headaches, have you seen a scenario where a LL/SC
was operating on a reservation granule that was "accidentally" falsely
shared with another piece of unrelated data that another thread works
on? If this other thread stores into this granule, it can cause the the
unrelated LL/SC to fail at will.
false sharing on reservation granule, is bad mojo. Humm, well, what is
the size of a granule on the Mill? Is it really small? Perhaps even a
single byte?
Then there is just very high contention scenarios, no false sharing,
that can still cause a lot of failures, and spins. The system is under
heavy load, oh shi%.
> We realize that a backoff algorithm only gives statistical guarantees,
> as opposed to a round robin which can give a fixed guarantee. This may
> rule out use in certain hard realtime codes that must have the fixed
> guarantee. It's a market choice to abandon such codes, but then any OOO
> or machine with a cache has also made the same market choice.
For sure. Indeed. The problems tend to show up when the system is under
periods of heavy sustained load. Hot spots emerge...
> >Btw, I am not necessarily talking
> > about atomic ops wrt a RMW. I am taking about the ability to execute a
> > remote memory barrier in user space. I assume the Mill is 100%
> > sequentially consistent, right? On x86, there is a way to trigger a bus
> > lock on purpose.
> >
> > Think of something like RCU. The readers do not need to execute any
> > memory barriers or atomic RMW's at all, no transactions, nothing...
>
> The problem with RCU is that the final U must be itself atomic. If you
> arm-bend the data structure enough you can usually get that down to a
> pointer store, which is hardware-atomic. But frequently you really don't
> want to bend the structure: insert into a doubly linked list is just not
> naturally RCU, but it's fine with transactional semantics.
RCU is really nice because it allows for reader threads to not use any
explicit sync, DEC alpha aside. Think of 10 reader threads constantly
iterating through a dynamic data structure, perhaps performing database
lookups. Then think of three or four writer threads periodically adding
and removing items from this structure. The read activity is much more
frequent than writes. We also want to allow for the writers to
de/reallocate removed items, and still allow the readers to go "full
steam ahead" without hitting any corruption, and keeps things coherent.
The writers can use mutexes, or anything they want. Okay, so a writer
removes an item. Can it deallocate it right now? Well, not if any
readers are accessing it! So, it defers its destruction until a
quiescent period is hit. Then it knows that no readers can possibly have
a reference.
It removes an element.
Defers its destruction.
When a quiescent period is detected, the element can be freed or
realloc'd, whatever.
It can be referred to as "epoch" based memory reclamation. Some systems
go through multiple epochs before freeing elements. RCU can be thought
of as a deferred memory reclamation scheme.
Also, in a lot of RCU schemes, readers do not generally care if they
load out of date elements. The reads can go 100% full steam ahead, and
not sync with the writers at all.
Its radically asymmetric. Lots of reads, moderate writes.
> > Well, DEC Alpha aside for a moment. Its very fast. Its an example of an
> > asymmetric algorihtm. The readers are VERY fast, and the writers need to
> > sync up. Writers can use mutexes, lock-free algorithms, LL/SC, CAS,
> > whatever. Readers are free as a bird.
> >
> > Think of a special mutex that uses two threads. One of the threads is a
> > hot spot, that runs very frequently. The other runs once and a while. We
> > want the hot spot to be as free as possible.
>
> Absent collision the cost of the transaction is the same as the
> non-atomic code; at most an addressed line may get pushed from the L1 to
> the L2. Collision induces backoff of course. So any algorithm that has
> free readers with mutexed (etc) writes has free readers in ours too; the
> transaction can be just the writes. But you can get a true RMW
> transaction at the same cost, which eliminates the "C" in RCU.
>
> But we freely admit that we don't know how well the users will take to
> actually using the primitives, as opposed to just treating them as an
> implementation of mutexes. IBM has a similar facility, and people seem
> happy with it, but Intel's effort in that line was withdrawn.
>
> By the time our micro-kernel is up we should know more :-)
Have you played around with asymmetric synchronization algorithms? An
example can be as simple as the Dekker algorihtm. No atomic RMW needed,
no LL/SC. Just plain atomic loads and stores with the right memory barriers.
The idea wrt asymmetry in Dekker is that there are two threads: A "hot"
thread that takes the lock a lot of times, very frequently, and another
that only takes it from time to time, not so frequent. Well, wouldn't it
be nice to get rid of that damn nasty memory barrier on the hot thread?
It can be done, and can really improve performance. Big time.
asymmetric sync is definitely not for everything, no silver bullet
indeed. It just pretty darn useful when one has an imbalance in access
patterns. Say, a lot more reads than writes. Read mostly data structure
access patterns.