About SC and TSO and RMO hardware memory models..

Horizon68

unread,

Feb 4, 2019, 4:19:24 PM2/4/19

to

Hello..

About SC and TSO and RMO hardware memory models..

I have just read the following webpage about the performance difference
between: SC and TSO and RMO hardware memory models

I think TSO is better, it is just around 3% ~ 6% less performance
than RMO and it is a simpler programming model than RMO. So i think ARM
must support TSO to be compatible with x86 that is TSO.

Read more here to notice it:

https://infoscience.epfl.ch/record/201695/files/CS471_proj_slides_Tao_Marc_2011_1222_1.pdf

About memory models and sequential consistency:

As you have noticed i am working with x86 architecture..

Even though x86 gives up on sequential consistency, it’s among the most
well-behaved architectures in terms of the crazy behaviors it allows.
Most other architectures implement even weaker memory models.

ARM memory model is notoriously underspecified, but is essentially a
form of weak ordering, which provides very few guarantees. Weak ordering
allows almost any operation to be reordered, which enables a variety of
hardware optimizations but is also a nightmare to program at the lowest
levels.

Read more here:

https://homes.cs.washington.edu/~bornholt/post/memory-models.html

Memory Models: x86 is TSO, TSO is Good

Essentially, the conclusion is that x86 in practice implements the old
SPARC TSO memory model.

The big take-away from the talk for me is that it confirms the
observation made may times before that SPARC TSO seems to be the optimal
memory model. It is sufficiently understandable that programmers can
write correct code without having barriers everywhere. It is
sufficiently weak that you can build fast hardware implementation that
can scale to big machines.

Read more here:

https://jakob.engbloms.se/archives/1435

Thank you,
Amine Moulay Ramdane.

MitchAlsup

unread,

Feb 4, 2019, 6:17:23 PM2/4/19

to

On Monday, February 4, 2019 at 3:19:24 PM UTC-6, Horizon68 wrote:
> Hello..
>
>
> About SC and TSO and RMO hardware memory models..
>
> I have just read the following webpage about the performance difference
> between: SC and TSO and RMO hardware memory models
>
> I think TSO is better, it is just around 3% ~ 6% less performance
> than RMO and it is a simpler programming model than RMO. So i think ARM
> must support TSO to be compatible with x86 that is TSO.

Err, x86s, at least back when I worked on them, were not TSO.

>
> Read more here to notice it:
>
> https://infoscience.epfl.ch/record/201695/files/CS471_proj_slides_Tao_Marc_2011_1222_1.pdf
>
> About memory models and sequential consistency:
>
> As you have noticed i am working with x86 architecture..
>
> Even though x86 gives up on sequential consistency, it’s among the most
> well-behaved architectures in terms of the crazy behaviors it allows.

You should see what happens when a misaligned store crosses a MTRR boundary
leaving the first 1/2 of a misaligned store in cacheable memory space, and
the other 1/2 in a non-cacheable memory space! I have, its not pretty.

> Most other architectures implement even weaker memory models.
>
> ARM memory model is notoriously underspecified, but is essentially a
> form of weak ordering, which provides very few guarantees. Weak ordering
> allows almost any operation to be reordered, which enables a variety of
> hardware optimizations but is also a nightmare to program at the lowest
> levels.
>
> Read more here:
>
> https://homes.cs.washington.edu/~bornholt/post/memory-models.html
>
>
> Memory Models: x86 is TSO, TSO is Good
>
> Essentially, the conclusion is that x86 in practice implements the old
> SPARC TSO memory model.

The SPARC guys invented and pushed TSO. All I ever found was it slowing
everything down.

>
> The big take-away from the talk for me is that it confirms the
> observation made may times before that SPARC TSO seems to be the optimal
> memory model. It is sufficiently understandable that programmers can
> write correct code without having barriers everywhere. It is
> sufficiently weak that you can build fast hardware implementation that
> can scale to big machines.

Nievé at best.

lkcl

unread,

Feb 6, 2019, 12:27:12 AM2/6/19

to

Indeed.

MIT added TSO to RISCV (the experience was deeply unpleasant for the students, that is another story)

The advantage of TSO is that it provides order guarantees that allow formal mathematical correctness proofs to be carried out, thus providing security guarantees about programs that are otherwise impossible to do.

Paradoxically, in the Intelligence Community, the nightmare scenario is when you DON'T know if something is secure or not. If you *do* know something is insecure, you can use that to doublecross or honeypot advantage.

L.

Kent Dickey

unread,

Feb 6, 2019, 6:01:14 PM2/6/19

to

In article <36aa431c-c0f6-4642...@googlegroups.com>,

First, the academic literature on ordering models is terrible. My eyes
glaze over and it's just so boring.

I'm going to guess "niev" means naive. I find that surprising since x86
is basically TSO. TSO is a good idea. I think weakly ordered CPUs are a
bad idea.

TSO is just a handy name for the Sparc and x86 effective ordering for
writeback cacheable memory: loads are ordered, and stores are buffered and
will complete in order but drain separately from the main CPU pipeline. TSO
can allow loads to hit stores in the buffer and see the new value, this
doesn't really matter for general ordering purposes.

TSO lets you write basic producer/consumer code with no barriers. In fact,
about the only type of code that doesn't just work with no barriers on TSO
is Lamport's Bakery Algorithm since it relies on "if I write a location and
read it back and it's still there, other CPUs must see that value as well",
which isn't true for TSO.

Lock free programming "just works" with TSO or stronger ordering guarantees,
and it's extremely difficult to automate putting in barriers for complex
algorithms for weakly ordered systems. So code for weakly ordered systems
tend to either toss in lots of barriers, or use explicit locks (with
barriers). And extremely weakly ordered systems are very hard to reason
about, and especially hard to program since many implementations are not as
weakly ordered as the specification says they could be, so just running your
code and having it work is insufficient. Alpha was terrible in this regard,
and I'm glad it's silliness died with it.

HP PA-RISC was documented as weakly ordered, but all implementations
guaranteed full system sequential consistency (and it was tested in and
enforced, but not including things like cache flushing, which did need
barriers). No one wanted to risk breaking software from the original in-order
fully sequential machines that might have relied on it. It wasn't really a
performance issue, especially once OoO was added.

Weakly ordered CPUs are a bad idea in much the same way in-order VLIW is a bad
idea. Certain niche applications might work out fine, but not for a general
purpose CPU. It's better to throw some hardware at making TSO perform well,
and keep the software simple and easy to get right.

Kent

MitchAlsup

unread,

Feb 6, 2019, 7:08:33 PM2/6/19

to

Somehow you missed the <alt>0233 (an e with a backwards accent)

> I find that surprising since x86
> is basically TSO. TSO is a good idea. I think weakly ordered CPUs are a
> bad idea.

I also think weakly ordered memory models are an idea to avoid, but x86s
smell a lot like TSO without really being strictly TSO. A lot of this
falls through the cracks due to misaligned support and how MTRRs slice
up the memory address space too far from the CPU to be taken into account
during store processing.

I suspect if you have a strongly conforming program with no misaligned
accesses, it might smell like TSO to any synchronization scheme. Still
this does not mean it fits the strict definition of TSO. Strongly con-
forming programs only touch cacheable memory.

already...@yahoo.com

unread,

Feb 7, 2019, 4:28:31 AM2/7/19

to

The latest (and not that latest) Intel memory ordering documents does no regard unaligned accesses as "single memory access".
In the latest doc set it's specifically mentioned in paragraph 8.2.3.1 of Volume 3.

AMD manual says the same in less formal manner in the 2nd paragraph of 7.2 of Volume 2.

Chris M. Thomasson

unread,

Feb 14, 2019, 10:04:02 PM2/14/19

to

On 2/6/2019 3:01 PM, Kent Dickey wrote:
> In article <36aa431c-c0f6-4642...@googlegroups.com>,
> lkcl <luke.l...@gmail.com> wrote:
>> On Tuesday, February 5, 2019 at 7:17:23 AM UTC+8, MitchAlsup wrote:
>>> On Monday, February 4, 2019 at 3:19:24 PM UTC-6, Horizon68 wrote:
>>
>>> The SPARC guys invented and pushed TSO. All I ever found was it slowing
>>> everything down.
>>>>
>>>> The big take-away from the talk for me is that it confirms the
>>>> observation made may times before that SPARC TSO seems to be the optimal
>>>> memory model. It is sufficiently understandable that programmers can
>>>> write correct code without having barriers everywhere. It is
>>>> sufficiently weak that you can build fast hardware implementation that
>>>> can scale to big machines.
>>>

>>> NievÃ© at best.

[...]

Iirc, hazard pointers need a #StoreLoad style barrier when one loads a
pointer. On x86, mfence or locked atomic rmw. Again, iirc, the load was
something like:
____________________________
void foo(void** global, void** local)
{
void* global0 = nullptr;
void* global1 = nullptr;

do
{
global0 = atomic_load(global); // load

atomic_store(local, global0); // store

// mfence, or #StoreLoad ordering // mfence

global1 = atomic_load(global); // load

} while (global0 != global1); // check for consistency
}
____________________________

x86 can reorder a store followed to a load from another location. So,
the store to local can be reordered with the following load from global.
The barrier is needed. Btw, this type of barrier is fairly expensive.

Chris M. Thomasson

unread,

Feb 15, 2019, 5:54:19 PM2/15/19

to

Fwiw, This is called SMR (Safe Memory Reclamation):

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.395.378&rep=rep1&type=pdf

Noob

unread,

Feb 25, 2019, 3:35:07 PM2/25/19

to

On 05/02/2019 00:17, MitchAlsup wrote:

> Nievé at best.

Who's Nievé ?

MitchAlsup

unread,

Feb 25, 2019, 3:45:28 PM2/25/19

to

Consider a memory system like found on a CRAY-YMP, where the memory fabric can
support 256 reads and writes happening simultaneously and continuously, in
order to keep 8 vector machines running at full performance. A TSO model would
completely obliterate any hope of performance out of the system.

Chris M. Thomasson

unread,

Feb 25, 2019, 4:21:35 PM2/25/19

to

On 2/14/2019 7:04 PM, Chris M. Thomasson wrote:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I really need to implement this in Relacy Race Detector.

http://www.1024cores.net/home/relacy-race-detector

https://github.com/dvyukov/relacy

A pure c++11 implementation of SMR might be interesting, even though I
believe there it has a patent. Humm...

Chris M. Thomasson

unread,

Feb 25, 2019, 4:35:15 PM2/25/19

to

Imvho, a weaker memory model can allow for more opportunities for
optimization...