Fenceless atomic RMW

Dmitriy Vyukov

unread,

Sep 26, 2010, 7:25:07 AM9/26/10

to

Hi,

As far as I understand, atomic RMW operations on Intel IA-32/Intel64
are implemented along the lines of:
1. let's assume that the cache line is already in cache in E state
2. initial fetch marks the cache line as locked
3. RMW operation is performed
4. resulting value is placed into a store buffer
[above steps should take just a few cycles]
5. core stalls waiting for store buffers to drain
[this part is costly]
6. when store buffers are drained and the store reaches the cache, the
cache line is unlocked
7. at this point the core can proceed with subsequent instructions in
normal mode

Is my understanding correct? At least at large (I'm a software
developer, so forgive me terminology and details).

Then, is there any principal obstacles why memory fence (store buffers
drain) can't be removed? Well, not actually removed, but performed
lazily. That's is:
[first 4 steps are the same]
1. let's assume that the cache line is already in cache in E state
2. initial fetch marks the cache line as locked
3. RMW operation is performed
4. resulting value is placed into a store buffer
5. at this point the core can proceed with subsequent instructions in
normal mode
6. when the store reaches the cache, the cache line is unlocked

I understand that this will change semantics, so this should be
implemented with a separate instruction or new instruction prefix. As
far as I see, atomicity of operation is preserved, effective cost is
basically equal to that of normal instruction, and semantics are
enough to implement a lot of useful algorithms. Nevertheless, atomic
RMW operations in most (all?) of commodity hardware are combined with
full memory fence and thus costly. So, are the reasons for this
historical or there are some principal obstacles why atomic RMW can't
be made virtually zero overhead?

TIA
--
Dmitriy V'jukov

MitchAlsup

unread,

Sep 26, 2010, 2:47:57 PM9/26/10

to

On Sep 26, 6:25 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> Hi,
>
> As far as I understand, atomic RMW operations on Intel IA-32/Intel64
> are implemented along the lines of:
> 1. let's assume that the cache line is already in cache in E state
> 2. initial fetch marks the cache line as locked
> 3. RMW operation is performed
> 4. resulting value is placed into a store buffer
> [above steps should take just a few cycles]
> 5. core stalls waiting for store buffers to drain
> [this part is costly]
> 6. when store buffers are drained and the store reaches the cache, the
> cache line is unlocked
> 7. at this point the core can proceed with subsequent instructions in
> normal mode

The store buffer is drained before the RMW is performed. Were one to
wait to drain the buffer in the middle, then the Snoop/Probe operation
that hits on that line might (have to be) delayed until the buffer
drains and the store performed. This can be "lots" of cycles, and the
snoop/probe responce cannot be delayed that long. In any event, if
there is anything uncacheable in the store buffer it needs to be
pushed out to memory before the RMW makes much forward progress (past
Agen) {consider the case of using cacheable locks to guard uncacheable
data. You DO NOT want the cacheable locks to be faster than the
uncacheable data. Otherwise the new lock holder can read the old stale
uncacheable data!

There is low probability that the original line is a) in the cache,
and b) in an immediately writeable state. The probability rapidly
approaches zero (both cases) rapidly under contension for the atomic
variable.

> Is my understanding correct?

One must still maintain memory order in the face of uncacheable locks.

Mitch

Andy Glew

unread,

Sep 26, 2010, 5:08:01 PM9/26/10

to

On 9/26/2010 4:25 AM, Dmitriy Vyukov wrote:
> Hi,
>
> As far as I understand, atomic RMW operations on Intel IA-32/Intel64
> are implemented along the lines of:
> 1. let's assume that the cache line is already in cache in E state
> 2. initial fetch marks the cache line as locked
> 3. RMW operation is performed
> 4. resulting value is placed into a store buffer
> [above steps should take just a few cycles]
> 5. core stalls waiting for store buffers to drain
> [this part is costly]
> 6. when store buffers are drained and the store reaches the cache, the
> cache line is unlocked
> 7. at this point the core can proceed with subsequent instructions in
> normal mode
>
> Is my understanding correct? At least at large (I'm a software
> developer, so forgive me terminology and details).

First, your understanding is not correct. There has been quite a lot
of change, evolution, and improvement in this area in the last few
years, more than can be discussed here.

On old, P6, systems, the load-locked part was not even started until the
load retired. At that point it drained the store buffers, acquired a
cache lock or address lock (depending) (or even a bus lock for
uncached), did the modification, put the store-unlock into the buffer,
and waited until the store unlock was done before starting anyone else.

Can't talk about all of the flavors of improvement.

But I can talk about what I think is the ultimate, approaching the
limit, where atomic RMWs are almost the same cost as INC mem unlocked,
since I created this agenda prior to joining Intel:

First, we are allowed to have other cache coherent operations pass the
atomic RMW. So long as we continue snooping them for Freye's rule
enforcement of memory ordering right up to the point where the
store-unlock is committed. (In other contexts you can stop snooping for
the oldest load in the buffer. Actually, there are contexts where you
can stop snooping for locks, but won't discuss those here.)

So, the question is, how do you do the load part of the atomic RMWS?
Do you let it go speculative, or not? A: which do you prefer? You can
do the load speculatively, but you must do Freye's rule snooping right
up to the bitter end - right up to the point where the load-lock and
store-unlock are about to commit, all previous stores having drained,
etc.

Whether you speculate the load-locked or not is a performance issue. If
you speculate on a contended lock, you may get misspeculations, the
infamous memory ordering nukes. If you do not speculate, you lose
performance. (1) Can I hear a predictor here? (2) An efficient repair
mechanism is desirable. Often atomic RMWs are part of spinloops, and
have limited numbers of values. Why bother making the spinloop fast?
Perhaps you should not nuke, but mark the speculated instructions as
nuke required if value is different from the "I got the lock" value.

I believe that I have mentioned before how Freye's rule snooping is not
always the best thing, how I was drawn to the "re-execute the load at
commit and nuke if different" approach. Locked RMWs may be a good place
to apply this.

So, anyway: you can execute the load-locked speculatively, without
actually acquiring the address lock (the thing that males it atomic,
that says to other processors "I can't let you read the cache line now").

But, y'know, you *can* acquire the lock early, if you are careful about
deadlock. Or you could acquire the lock, hold it for a bit, but if the
cache line is contended release it before the locked RMW retires - which
probably indicateds contention, and the bnuke options above.

In any case, you must acquire the address lock when the load-lock is
about to commit. Preferably also when guaranteed that the store-unlock
will also commit.

What about the M and W parts of the atomic RMW? Well, we can do those
speculatively, based on the speculative load data - all except the
store-unlock part. We can even do store-unlock forwarding to younger
instructions. But, of course, if we do this we may have more to repair
on contention.

If comp.arch alloweddrawing, I would draw a graph showing the partial
ordering of operations. The operations are:

* R: load-lock getting the data unlocked
* R: load-lock getting an address lock, optional
* R: load-lock optionally releasing the address lock if too much
contention or deadlock
* R: load-lock irreversibly acquiring the address lock
* M: the modify operation - which may be done whenever load--lock
returns valid data
* W: store-unlock - placing address and data in store buffer
* W: store-unlock: forwarding to younger loads
* RW: draining the store buffer so that the RMW is sequentially consistentg
* RW: Freye's rule snooping for possible changes to older loads
* RW: telling other processors to go away, the RMW is in flight, and
finally, commiting the store-unlock

Anyway, my point is that atomic RMWs can be made very, very, cheap. We
have known how to do this since before P6was started.

Certainly if no contention, as cheap as unlocked INC mem, except for the
minor perturbation due to effective store buffer size while draining.

Even if contention, pretty darned fast. Especially if the address lock
holdoff isn't just a retry, but is instead a "I will send you the line
in half a mo when I have finished with the RMW." In fact, you don;t
need to reply when the address lock is held; you can just delay the reply.

Intel has made a lot of progress on this trajectory, but I don't think
that they have quite arrived at the ultimate.

And we have not even talked about external optimizations for locked
RMWs: doing things like sending the RMW outside, where it can be
combined. I am quite interested to see that the GPUs seem to be doing
something like this. Not quite all the way back to the NYU
Ultracomputer fetch-and-op combining, but getting there.

We are approaching the point where we will have all the mechanisms in
place, and will have to work on tuning predictors and/or providing
instruction modifiers (much as Dmitry proposes) to indicate what
strategy is expected to do best. We want to effortlessly transition
between locked atomic RMWs that are inside the proceessor, to those
external; from low to high contention.

> Then, is there any principal obstacles why memory fence (store buffers
> drain) can't be removed? Well, not actually removed, but performed
> lazily. That's is:
> [first 4 steps are the same]
> 1. let's assume that the cache line is already in cache in E state
> 2. initial fetch marks the cache line as locked
> 3. RMW operation is performed
> 4. resulting value is placed into a store buffer
> 5. at this point the core can proceed with subsequent instructions in
> normal mode
> 6. when the store reaches the cache, the cache line is unlocked
>
> I understand that this will change semantics, so this should be
> implemented with a separate instruction or new instruction prefix. As
> far as I see, atomicity of operation is preserved, effective cost is
> basically equal to that of normal instruction, and semantics are
> enough to implement a lot of useful algorithms. Nevertheless, atomic
> RMW operations in most (all?) of commodity hardware are combined with
> full memory fence and thus costly. So, are the reasons for this
> historical or there are some principal obstacles why atomic RMW can't
> be made virtually zero overhead?

No reason at all. This is a very good observation: the atomic RMW part
is separate from the memory ordering aspect part.

IIRC Itanium did this, with their compare-and-swap operation. You had
to apply the appropriate memory fences before and after if you wanted it
to be sequentially consistent.

x86 atomic RMW has legacy semantics.

However, a warning: I have observed that there are at least two
different interpretations of "atomic". I use the narrow definition,
"atomic RMW wrt a single memory location" - it is either all done, or
all not done, from any single observer. However, many people,
especially software people, implicitly assume that atomic means globally
ordered - i.e. it is all done, or all not done, from all observers.

Apart from this: no reason why we can't separate the location atomicity
from the global memory ordering. However, the challenge lies in showing
the advantage of doing so: what codes work better with this new
semantic (approximately the Itanium semantic)?

The real gotcha is that often code examples are put forward that
*almost* work, that work on most practical systems - but not all. I.e.
codes that would break if the memory ordering for these
location-atomic-RMWs were relaxed to the maximum extent permitted.

Facets:

* What codes benefit when atomic RMWs are weakly ordered - location
ordered, but may appear in different orders on different processors?

* ... processor ordered?

* ... TSO ordered?

Since there is a definite trend for shared memory consistency models to
become stronger over time, there is a chance that you might add this
operation, and then implement sequential consistency for all loads and
stores.

Andy Glew

unread,

Sep 27, 2010, 1:38:46 AM9/27/10

to

On 9/26/2010 2:08 PM, Andy Glew wrote:
> On 9/26/2010 4:25 AM, Dmitriy Vyukov wrote:
> In any case, you must acquire the address lock when the load-lock is
> about to commit. Preferably also when guaranteed that the store-unlock
> will also commit.

By the way: the concept of "address lock" is really only required on a
P6-like machine, where the load-locked and the store-unlocked are
separate uops, with a potentially non-zero time between them. In P6's
case, coming from separate places.

Another artifact of this RISC-like separation into load-locked and
store-unlocked is that the TLB translation used for the load-locked is
not necessarily the same as it would be for the store-unlock. Which
would be bad. On P6, this badness was prevented by serializing. Morre
recent machines jump through hoops to lock the TLB entry.

I think that the RISC mindset which led to me proposing separate
load-lock and store-unlock uops was one of my worst mistakes on P6.

Now, my preferred design is to have a single uop that performs the
address calculation for both the load and store parts of an RMW. If you
have a combined load/store queue, one address entry, but both read and
write data slots. If you have separate load and store queues, the same
uop allocates in both.

Something like

tmp := load-and-store-address( addressing-mode ) LINKED-STA-STD
tmp := tmp + 1
store-data( tmp ) LINKED-STA-STD

Now, the load-and-store-address is preventing from retiring until it has
gotten the cache line in an exclusive (e.g. M or E) state.

You can do away with the address lock of yore by ensuring that, if the
address is snooped, you delay HITM returning snoop data until the
store-data has been merged. Which is just another flavor of address
lock, I suppose.

Dmitriy Vyukov

unread,

Sep 27, 2010, 5:25:56 AM9/27/10

to

Thank you, Mitch!
I see the problem with uncacheable data. What do you mean by "snoop/
probe responce cannot be delayed that long"? In what sense it cannot?
What if it's delayed for that long occasionally?
I think that it's Ok if the new instruction won't respect uncacheable
memory accesses at all. It's like each store is implicitly store-
release, *unless* preceding store is non-temporal, and then it's the
responsibility of a developer who issued non-temporal store to issue
SFENCE after it. However I'm not sure as to whether it buys us
anything or not, because as far as I understand store-unlock can't
outrun uncacheable-store in store buffers.

> There is low probability that the original line is a) in the cache,
> and b) in an immediately writeable state. The probability rapidly
> approaches zero (both cases) rapidly under contension for the atomic
> variable.

Well, the expected use case is that the line is in M state 99.9999% of
times.
Contended shared data can't be made fast and scalable anyway, even if
it's implemented with plain MOVs. So who cares? Processor-local store
buffer drain is less of a problem here.

> > Is my understanding correct?
>
> One must still maintain memory order in the face of uncacheable locks.

Since I am talking about potentially new hypothetical instruction, I
am ready sacrifice it. Think that all memory is WB (if not, then that
guy must execute M/SFENCE) and all cache lines is in E/M state (if
not, sorry we can't help you).

--
Dmitriy V'jukov

MitchAlsup

unread,

Sep 27, 2010, 4:50:00 PM9/27/10

to

On Sep 27, 12:38 am, Andy Glew <"newsgroup at comp-arch.net"> wrote:
> Another artifact of this RISC-like separation into load-locked and
> store-unlocked is that the TLB translation used for the load-locked is
> not necessarily the same as it would be for the store-unlock. Which
> would be bad. On P6, this badness was prevented by serializing. Morre
> recent machines jump through hoops to lock the TLB entry.

Or capture the physical address and reuse the physical address at
store time. {Which is a LOT easier than locking hte TLB fomr more than
a cycle....}

> I think that the RISC mindset which led to me proposing separate
> load-lock and store-unlock uops was one of my worst mistakes on P6.

Interesting.....

> Now, my preferred design is to have a single uop that performs the
> address calculation for both the load and store parts of an RMW. If you
> have a combined load/store queue, one address entry, but both read and
> write data slots. If you have separate load and store queues, the same
> uop allocates in both.

So, how do you propose doing a DCAS() in 1 µOP ?

Mitch

MitchAlsup

unread,

Sep 27, 2010, 4:57:10 PM9/27/10

to

On Sep 27, 4:25 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On Sep 26, 10:47 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> I see the problem with uncacheable data. What do you mean by "snoop/
> probe responce cannot be delayed that long"? In what sense it cannot?
> What if it's delayed for that long occasionally?

The bus/fabric protocol does not allow one to arbitrarily delay the
response to Snoops/Probe operations (or bad things like bus lock up
occur). A lot of this RMW stuff operates between the windows of
deadlock and livelock on the parts of the machine that the programmers
don't really see.

> Contended shared data can't be made fast and scalable anyway,

You are not aware of my ASF (sub)architecture where it can be made
both fast and scaleble, and with guarentees for forward progress under
contension. Proper use of ASF could enable most concurrent data
structure access methods to decrease the exponent of latency wrt
contention to go from the typical BigO(n**2) to BigO(log(n)).

> Since I am talking about potentially new hypothetical instruction, I
> am ready sacrifice it. Think that all memory is WB (if not, then that
> guy must execute M/SFENCE) and all cache lines is in E/M state (if
> not, sorry we can't help you).

Too bad nobody implemented the real ASF. I think it would have solved
your problems.

Mitch

Paul A. Clayton

unread,

Sep 27, 2010, 7:49:17 PM9/27/10

to

On Sep 27, 4:50 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> On Sep 27, 12:38 am, Andy Glew <"newsgroup at comp-arch.net"> wrote:
>
> > Another artifact of this RISC-like separation into load-locked and
> > store-unlocked is that the TLB translation used for the load-locked is
> > not necessarily the same as it would be for the store-unlock. Which
> > would be bad. On P6, this badness was prevented by serializing. Morre
> > recent machines jump through hoops to lock the TLB entry.
>
> Or capture the physical address and reuse the physical address at
> store time. {Which is a LOT easier than locking hte TLB fomr more than
> a cycle....}

Why is translation memoization not broadly implemented? Unlike
cache way memoization, translations usually do not change
(especially for a writable [non-COW] page), so invalidations
would be rare (a simpler non-exact but conservative method
would probably be adequate). I suspect that most objects
reside within a single page and have more than one access
while the base address is in a register, so enough work might
be avoided to justify the overhead.

It seems odd when the advantages of complex instructions are
not exploited.

Paul A. Clayton
just a technophile

Paul A. Clayton

unread,

Sep 27, 2010, 7:53:42 PM9/27/10

to

On Sep 27, 4:57 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> On Sep 27, 4:25 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>
> > On Sep 26, 10:47 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> > I see the problem with uncacheable data. What do you mean by "snoop/
> > probe responce cannot be delayed that long"? In what sense it cannot?
> > What if it's delayed for that long occasionally?
>
> The bus/fabric protocol does not allow one to arbitrarily delay the
> response to Snoops/Probe operations (or bad things like bus lock up
> occur). A lot of this RMW stuff operates between the windows of
> deadlock and livelock on the parts of the machine that the programmers
> don't really see.

Would adding NAK to the protocol help with/solve this problem?

Andy Glew

unread,

Sep 27, 2010, 8:43:53 PM9/27/10

to

On 9/27/2010 1:50 PM, MitchAlsup wrote:
> On Sep 27, 12:38 am, Andy Glew<"newsgroup at comp-arch.net"> wrote:
>> Another artifact of this RISC-like separation into load-locked and
>> store-unlocked is that the TLB translation used for the load-locked is
>> not necessarily the same as it would be for the store-unlock. Which
>> would be bad. On P6, this badness was prevented by serializing. Morre
>> recent machines jump through hoops to lock the TLB entry.
>
> Or capture the physical address and reuse the physical address at
> store time. {Which is a LOT easier than locking hte TLB fomr more than
> a cycle....}

That's what I have been wanting to do ever since I realized this was a
problem. However, there was resistance.

You can capture the translation (not just the physical addressm, but
also the permissions, minor quibble) in two ways:

In microcode:

saved_translation_in_register := translate_address(...)
tmp := load_locked(saved_translation)
tmp := tmp + 1 // for example
store_address_unlocking(saved_translation)
store_data(tmp)

with the usual implicit link between the store addres and the store data.

By the way, I want translate_address() as a uop, and maybe an
instruction, in any case for other reasons.

Or you can rely on a side effect - load_locked placing the translation
in a register that store_unlock will pick up.

The side effect works, but I dislike side effects in general. Especially
if they are to non-renamed registers. as I explained in my first reply
to Dimitry (sp?), I can see at least two different ways to do even the
RMW speculatively - so I don;t want to use a side effect register at
retirement.

Optimizing that microcode:

a) merging the translate_address and the store_address:

saved_translation_in_register :=
translate_address_and_use_as_store-address(...)
tmp := load_locked(saved_translation)
tmp := tmp + 1 // for example
store_data(tmp)

b) but since the saved translation is only used in one place:
tmp := translate_address_
__do_load_locked
__andPlace_addres_in_store_address_buffer(...)
tmp := tmp + 1 // for example
store_data(tmp)

c) for grins, since the store_data only has one input

tmp := translate_address_
__do_load_locked
__and_place_address_in_store_address_buffer(...)
store_data_with_alu_op( tmp + 1 )

>> Now, my preferred design is to have a single uop that performs the
>> address calculation for both the load and store parts of an RMW. If you
>> have a combined load/store queue, one address entry, but both read and
>> write data slots. If you have separate load and store queues, the same
>> uop allocates in both.
>

> So, how do you propose doing a DCAS() in 1 焜P ?

I don't even propose doing a store in a single uop. I'm the guy who
gave P6 separate store_address and store_data, and they still make sense:
* store_address often has two or more inputs,
seg + basereg + indexreg<<scale + offset
so to have a general purpose single input store uop
(I've got to learn how you typed the micro)
needs yet another input - and certainly puts you over the top
or a two input limit
* I'm fine on optimizing special cases, like store( M[reg] := const )
to be single uop, or taking advantage of uarchs with lots of inputs
* but, in general, most uarchs with single uop store are actually
2 uops from the point of view of execution. Double firing from the
scheduler, etc.
* you certainly do not want to delay the store address until the
store data is ready, or vice versa. They neeed to execute independently.

Anyway, CAS:

tmp := translate_address_
__do_load_locked
__and_place_address_in_store_address_buffer(...)
tmp := if( matches(tmp,compare_value) ) then store_value
else mp
store_data( tmp )
dest_reg := tmp

Noting that the compare-and-swap often can fit within the uop template
for the store_data_with_alu_op. Especially nicem because then you can
cancel the store, avoiding dirtying, rather than write the oldvalue.

---

Ah, but you asked about DCAS, didn't you, Mitch?

Yes, Mitch, you need something like ASF. Both of us have designed stuff
like ASF, me at least three times - one instance of which was full TM.
However, your ASF saw the light of day, got published and built,
right? I'm jealous of that.

The low hanging fruit flavor takes advantage of the load and store
buffer. You load up the entries for at least two loads and two stores
(or two entries, used for both loads and stores). You rely on Freye's
rule snooping to ensure that when the oldest arrives at retirement, the
other is also ready to go. You have acquired ownership for both.

If you want orderiong, yoou drain the store buffers, as we started
discussing. If you don't want ordering, you don;t neeed to.

We get the location atomicity because we have arrived at retirement with
two locations, both in M or E state. We now guarantee that both can be
merged into cache, guaranteed hitting, holding off snoops and evictions
of the affected lines while so doing. Short term holdoff, since all are
guaranteed to be hitting.

You can do this for any number of operations, so long as they fit in
your MOB (Memory Ordering Buffer) or equivalent.

(You're going to point out forward progress, right, Mitch.)

Note that I want this even without instruction set extensions. If I
have this "multi-line atomic commit" I can eliminate 20-50% of memory
traffic by extending write combining. The WC buffers for WB memory must
be evicted after switching to a different cache line because of memory
ordering. But with multi-line atomic commit, they don't need to be.

Heck, you don't even need to have the data in your cache. You just need
to "own" the data, guaranteeing that nobody else can access it without
giving you a chance to fix things up. You can own data that lives in
DRAM. There's a paper by Sun on this.

MitchAlsup

unread,

Sep 27, 2010, 9:55:04 PM9/27/10

to

On Sep 27, 6:53 pm, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:

> Would adding NAK to the protocol help with/solve this problem?

NAK goes a long way to resolving this problem only to open the
opportunities for other problems.

The right way to think about this is that "under tightly restricted
sets of circumstances" NAKs are very valuable. Outside of this
envelope NAKs can cause livelock and deadlock circumstances.

For the set of circumstances pertinate to this thread, microcode
execution of RWM can easily be considered 'within" the tightly
controled set of circumstances.

Mitch

MitchAlsup

unread,

Sep 27, 2010, 10:01:25 PM9/27/10

to

On Sep 27, 6:49 pm, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
> Why is translation memoization not broadly implemented?

If a user (or now supervisor) is allowed to hang onto a physical
address for more than "just a few cycles" some other supervisor/
hypervisor could have changed the MMU tables in the mean time, done a
TLB shootdown, and then the translation is no longer valid. Since the
time difference between any unpriviledged instruction can include
context switches to higher privilege levels, this is just "inherently
dangerous".

Sometimes, even allowing microcode to hang onto a physical address
"for a while" can end up being dangerus.

Then there are various security threats that can be expressed if there
is access at the architectural level (ISA) of real physical addresses.

Mitch

MitchAlsup

unread,

Sep 27, 2010, 10:11:05 PM9/27/10

to

On Sep 27, 7:43 pm, Andy Glew <"newsgroup at comp-arch.net"> wrote:
> Heck, you don't even need to have the data in your cache. You just need
> to "own" the data, guaranteeing that nobody else can access it without
> giving you a chance to fix things up. You can own data that lives in
> DRAM. There's a paper by Sun on this.

This was THE key insight I came to. Ownership (writability) is the
important aspect, not the actual presence in the cache! As long as you
can prevent anyone else from writing the cache lines, the rest is just
optimizing latency of the "event", and expressing the intent at the
instruction set level.

Mitch

A DMA device is not in a position (i.e. smart enough) to be able to
take a NAK. A side effect is that since you cannot prevent an
uncacheble DMA device from writing the line in memory, you cannot
extend this synchronization feature into the uncacheable domain, or
deal with uncacheable data. So, we defined ASF such that if you (the
programmer) NEED memory order other than that (somewhat weekly)
specified in ASF, you use a SFENCE, or MFENCE as desired. The
[MS]FENCE could be placed as far down the code stream as just prior to
the RELEASE instruction (now renamed to something I forgot). So, you
don't have to push the write buffer until you know that the atomic
event will succeed. This helps the latency aspect in dealing with
uncacheables and RMWs.

Dmitriy Vyukov

unread,

Sep 28, 2010, 4:03:37 PM9/28/10

to

On Sep 28, 12:57 am, MitchAlsup <MitchAl...@aol.com> wrote:
> > On Sep 26, 10:47 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> > I see the problem with uncacheable data. What do you mean by "snoop/
> > probe responce cannot be delayed that long"? In what sense it cannot?
> > What if it's delayed for that long occasionally?
>
> The bus/fabric protocol does not allow one to arbitrarily delay the
> response to Snoops/Probe operations (or bad things like bus lock up
> occur). A lot of this RMW stuff operates between the windows of
> deadlock and livelock on the parts of the machine that the programmers
> don't really see.

:)
Well, but at least I want to form in my head some crude coarse-grained
performance model of how all that stuff operates on that level.

> > Contended shared data can't be made fast and scalable anyway,
>
> You are not aware of my ASF (sub)architecture where it can be made
> both fast and scaleble, and with guarentees for forward progress under
> contension. Proper use of ASF could enable most concurrent data
> structure access methods to decrease the exponent of latency wrt
> contention to go from the typical BigO(n**2) to BigO(log(n)).
>
> > Since I am talking about potentially new hypothetical instruction, I
> > am ready sacrifice it. Think that all memory is WB (if not, then that
> > guy must execute M/SFENCE) and all cache lines is in E/M state (if
> > not, sorry we can't help you).
>
> Too bad nobody implemented the real ASF. I think it would have solved
> your problems.

Actually I am aware of your ASF for some time now:
http://groups.google.com/group/comp.arch/browse_frm/thread/68a9fda9386048d7
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c1c6c6327aed79b6

I think it's a great thing! And it indeed would solve most of my
problems.

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Sep 28, 2010, 4:15:58 PM9/28/10

to

Hi Andy,

Thank you!
I think I need to re-read it several times more :) But I get the main
point: my understanding is not quite correct to say the least, but
atomic RMW still can be made very cheap (so cheap that there will be
no need to economize on them).

> So, the question is, how do you do the load part of the atomic RMWS?

Personally I is Ok with both variants :) I was thinking about eager
lock (I assume that the cache line is already in E/M state), but it
actually irrelevant for me.

By atomicity I mean only atomicity - i.e. if several threads increment
a variable N times, resulting value must be N, no more and no less. I
understand that then it's not LOCK ADD, but something along the lines
of LOCK_RELAXED ADD.

As for code examples that will benefit from it. Is not atomic
reference counting more than enough?

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Sep 28, 2010, 4:24:15 PM9/28/10

to

On Sep 27, 1:08 am, Andy Glew <"newsgroup at comp-arch.net"> wrote:

> We are approaching the point where we will have all the mechanisms in
> place, and will have to work on tuning predictors and/or providing
> instruction modifiers (much as Dmitry proposes) to indicate what
> strategy is expected to do best. We want to effortlessly transition
> between locked atomic RMWs that are inside the proceessor, to those
> external; from low to high contention.

Humm... for me it sounds quite counter-intuitive...
How can anything centralized be scalable in a distribute system with a
dozen of processors each with a dozen of cores? It should create
single point of contention. Moreover, each core will have to send
requests over some distance and then wait for results...
On the other hand, with how it's done now, each core is able to
process atomic RMW pure locally, thus it scales to great number of
processors/cores.

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Sep 28, 2010, 4:39:59 PM9/28/10

to

On Sep 28, 4:43 am, Andy Glew <"newsgroup at comp-arch.net"> wrote:
> Yes, Mitch, you need something like ASF. Both of us have designed stuff
> like ASF, me at least three times - one instance of which was full TM.
> However, your ASF saw the light of day, got published and built,
> right? I'm jealous of that.

I just want to let you know, that we, people on this side of the
barricades, are with you!

--
Dmitriy V'jukov

Paul A. Clayton

unread,

Sep 28, 2010, 5:21:44 PM9/28/10

to

On Sep 27, 10:01 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> On Sep 27, 6:49 pm, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
>
> > Why is translation memoization not broadly implemented?
>
> If a user (or now supervisor) is allowed to hang onto a physical
> address for more than "just a few cycles" some other supervisor/
> hypervisor could have changed the MMU tables in the mean time, done a
> TLB shootdown, and then the translation is no longer valid. Since the
> time difference between any unpriviledged instruction can include
> context switches to higher privilege levels, this is just "inherently
> dangerous".

Thanks for the reply. As I wrote: "so invalidations

would be rare (a simpler non-exact but conservative method

would probably be adequate)" For large systems, permission
removals and translation changes might be frequent enough
that the _really_ simple invalidate all memoizations on
any such PTE update (changing an accessed bit should not
matter, clearing a dirty bit could be problematic--but I
assume such requires more extensive guards anyway--but not
setting a dirty bit, adding permissions should not be a
problem) to make memoization a power cost. A bit for
clear-all-memoizations-on-exit for each TLB entry might be
okay (??), perhaps with a generation id to allow old
memoizations to fade out (rather than using a reference
count) (??).

(Even just a strong prediction could be useful, it seems, for
early cache tag comparison and perhaps reduction in the
number of TLB ways checked.)

> Sometimes, even allowing microcode to hang onto a physical address
> "for a while" can end up being dangerus.
>
> Then there are various security threats that can be expressed if there
> is access at the architectural level (ISA) of real physical addresses.

I can understand the desirability of maintaining the virtual
memory abstraction layer, but how is knowing a physical address
a security risk? Stale permissions are an obvious problem, but
I receive the impression that you are referring to a much more
clever exploit.

Paul A. Clayton

unread,

Sep 28, 2010, 5:25:54 PM9/28/10

to

Thanks (yet again).

Paul A. Clayton
--ignorance[Paul_A_Clayton];

Paul A. Clayton

unread,

Sep 28, 2010, 6:07:08 PM9/28/10

to

On Sep 27, 10:11 pm, MitchAlsup <MitchAl...@aol.com> wrote:
[snip]

> A DMA device is not in a position (i.e. smart enough) to be able to
> take a NAK. A side effect is that since you cannot prevent an
> uncacheble DMA device from writing the line in memory, you cannot
> extend this synchronization feature into the uncacheable domain, or
> deal with uncacheable data. So, we defined ASF such that if you (the
> programmer) NEED memory order other than that (somewhat weekly)
> specified in ASF, you use a SFENCE, or MFENCE as desired. The
> [MS]FENCE could be placed as far down the code stream as just prior to
> the RELEASE instruction (now renamed to something I forgot). So, you
> don't have to push the write buffer until you know that the atomic
> event will succeed. This helps the latency aspect in dealing with
> uncacheables and RMWs.

Could a really smart IOTLB help with this? If the target of
a write is a virtual address, the IOTLB might translate it to
an IO Hub local memory address (and/or cache it, perhaps using
virtual tags). (It seems it might be useful to distinguish
between different purposes of non-cacheable storage. I would
guess that the DMA kind is primarily meant to avoid cache
pollution not ensure that side-effects occur.)

Along similar lines, I wondered if a smart IOTLB could be
used to make page-pinning only a 'kiss of cowpox' not a
'Kiss of Death'. If an IOTLB could dynamically assign
pages from a free list, a huge number of virtual pages
could be 'locked'. (It would still be possible for a
write to page-fault--if the system software could not
provide pages to the free list fast enough to meet the
demand by the IOTLB--and read page-faults would be
possible; but software might be able to retry the IO
requests.)

(I also wonder if processor TLB COW support would be
worthwhile. Aside from COW, such might be used by a
user-level memory allocator to free and allocate pages.
A shared page free list might allow tighter memory
usage. [ISTR the BSD malloc tried to free pages back
to the OS. The above mechanism would simply put a
hardware managed buffer between the use memory
management and the OS.])

(Even further off-topic, could 'cache' pages be useful?
I.e., the software handles re-generation/fill and only
needs a low-overhead exception when the page has been
reclaimed for other uses. Rather than having system
software save and then restore the cache page, it
could just be dropped. Even if the cost of restoration
is greater than the cost of a save and restore, this
sort of caching could be a win if the probability of
reuse is low enough.)

Noob

unread,

Sep 29, 2010, 7:04:41 AM9/29/10

to

Dmitriy Vyukov wrote:

> As far as I understand, atomic RMW operations on Intel IA-32/Intel64

> are implemented along the lines of: [...]

In 2007, Intel published a document titled
"Intel 64 Architecture Memory Ordering White Paper"

You might find it insightful. (I can provide a link
if you cannot locate it.)

Regards.

Tim McCaffrey

unread,

Sep 29, 2010, 4:30:50 PM9/29/10

to

In article
<2a76b121-33d6-4eeb...@k13g2000vbq.googlegroups.com>,
Mitch...@aol.com says...
>
>On Sep 27, 6:49=A0pm, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
>> Why is translation memoization not broadly implemented? =A0

>
>If a user (or now supervisor) is allowed to hang onto a physical
>address for more than "just a few cycles" some other supervisor/
>hypervisor could have changed the MMU tables in the mean time, done a
>TLB shootdown, and then the translation is no longer valid. Since the
>time difference between any unpriviledged instruction can include
>context switches to higher privilege levels, this is just "inherently
>dangerous".
>
>Sometimes, even allowing microcode to hang onto a physical address
>"for a while" can end up being dangerus.
>
>Then there are various security threats that can be expressed if there
>is access at the architectural level (ISA) of real physical addresses.
>

I would envision this done as a side car register to the base register (not
visible to the ISA).

So, for all registers that can be used as base registers:
add a "physical address of page that register points to"
add a "side car is valid" bit

the bit is set valid only when the register is used as a base register & the
offset is within the same page (this actually happens alot).

The valid bit is reset on:
TLB invalidates (shoot downs, etc.)
Context switches.
Interrupts.
When the base register is modified (this could be relaxed to be: reset
the valid bit if bits outside the page offset are modified).

You could think of the sidecar registers like a L0 TLB.

- Tim

Paul A. Clayton

unread,

Sep 29, 2010, 7:07:50 PM9/29/10

to

On Sep 29, 4:30 pm, timcaff...@aol.com (Tim McCaffrey) wrote:
[snip]

> The valid bit is reset on:
> TLB invalidates (shoot downs, etc.)
> Context switches.
> Interrupts.
> When the base register is modified (this could be relaxed to be: reset
> the valid bit if bits outside the page offset are modified).

I think resetting the valid bit could be the main
problem. Invalidating all translations/permissions
whenever any change is made to the page table could
be a problem in a larger system. I made some off-hand
suggestions for dealing with this, but I suspect no
simple solution exists.

By the way, 'context switches' could be handled as the
'base register is modified', especially for the
translation (the only time--that I can think of--that
an address would be passed between contexts would be
if the address had the same translation, e.g., in a
microkernel OS). Clearing just the permissions,
however, might not be worth the bother.

'Interrupts' would not need to clear the valid bits.
Interrupt handlers typically use the same PTEs (if
a Permission Lookaside Buffer was used in which
different handlers used different permission tables,
one would probably not want to try pre-loading
permissions, but with typical systems only a few
rings of privilege are supported and the permissions
are attached to the translation). One might not
want to bother keeping permissions for the other
privilege levels, but this would not require
invalidating the translation or even invalidating
the permission bits--the hardware would merely
have to know to which level the permissions applied.
(Execution path internal PTE changes would have to
appear as TLB invalidates--as they presumably have
to appear anyway.)

Andy Glew

unread,

Oct 1, 2010, 1:05:43 AM10/1/10

to

On 9/27/2010 7:01 PM, MitchAlsup wrote:
> On Sep 27, 6:49 pm, "Paul A. Clayton"<paaronclay...@gmail.com> wrote:
>> Why is translation memoization not broadly implemented?
>
> If a user (or now supervisor) is allowed to hang onto a physical
> address for more than "just a few cycles" some other supervisor/
> hypervisor could have changed the MMU tables in the mean time, done a
> TLB shootdown, and then the translation is no longer valid. Since the
> time difference between any unpriviledged instruction can include
> context switches to higher privilege levels, this is just "inherently
> dangerous".

It depends on who is doing the memoization.

Hardware or microcode can put aphysical address into a register and use
it for an arbitrary number of cycles ... up to the point where a TLB
shootdown may have occurred.

On current x86, that is the next interrupt, or the next instruction that
invalidates the TLB.

But it can be many, many, cycles.

I call this pseudo-pinning.
http://semipublic.comp-arch.net/wiki/Pseudo-pinning

---

On current x86, you can only do an MP TLB shootdown via inter-processor
interrupts.

On machines that have special instructions and bus cycles to cause TLB
invalidations, like Itanium and PowerPC, these also would have to
invalidate any pseudo-pinned translation.

--

User code can't easily do this, because there aren't standard ways of
arranging to get a user value invalidated by an interrupt.

OS code *could* do this, because it could hook to all IPIs.

But at the moment I am only aware of people having used pseudo-pinning
in microcode subsystems.

===

Why pseudo-pinning works: it's based on a property of modern processors
that I called speculative cacheability when I described it in the P6 EAS.

Basically, if any memory is marked WB, a memory type that allows
speculative accesses, software must assume that it *may* have been
pulled into the cache, even though the actual Von Neumann retirement
instruction stream has never accessed it.

Similarly wrt TLBs and the page table: if an address is in the page
table, the translation may have been installed in the TLBs.

But whereas WB meory is usually cache coherent, the TLBs are snooped.

So if a translation is in the page tables, if you change it you must do
a TLB shootdown, using IPIs and/or INVLPGs and/or other TLB invalidation.

> Sometimes, even allowing microcode to hang onto a physical address
> "for a while" can end up being dangerus.

Pseudo-pinning, as I describe above, has been extensively used via
microcode patch and other prototyping mechanisms.

Andy Glew

unread,

Oct 1, 2010, 1:07:09 AM10/1/10

to

On 9/29/2010 1:30 PM, Tim McCaffrey wrote:
> In article
> <2a76b121-33d6-4eeb...@k13g2000vbq.googlegroups.com>,
> Mitch...@aol.com says...
>>
>> On Sep 27, 6:49=A0pm, "Paul A. Clayton"<paaronclay...@gmail.com> wrote:
>>> Why is translation memoization not broadly implemented? =A0
>>
>> If a user (or now supervisor) is allowed to hang onto a physical
>> address for more than "just a few cycles" some other supervisor/
>> hypervisor could have changed the MMU tables in the mean time, done a
>> TLB shootdown, and then the translation is no longer valid. Since the
>> time difference between any unpriviledged instruction can include
>> context switches to higher privilege levels, this is just "inherently
>> dangerous".
>>
>> Sometimes, even allowing microcode to hang onto a physical address
>> "for a while" can end up being dangerus.
>>
>> Then there are various security threats that can be expressed if there
>> is access at the architectural level (ISA) of real physical addresses.
>>
>
>
> I would envision this done as a side car register to the base register (not
> visible to the ISA).
>
> So, for all registers that can be used as base registers:
> add a "physical address of page that register points to"
> add a "side car is valid" bit
>

> the bit is set valid only when the register is used as a base register& the

> offset is within the same page (this actually happens alot).
>
> The valid bit is reset on:
> TLB invalidates (shoot downs, etc.)
> Context switches.
> Interrupts.
> When the base register is modified (this could be relaxed to be: reset
> the valid bit if bits outside the page offset are modified).
>
> You could think of the sidecar registers like a L0 TLB.

Yep.

IIRC Tod Austin's PhD thesis proposed this.

Andy Glew

unread,

Oct 1, 2010, 1:15:32 AM10/1/10

to

DMA devices more and more must be able to tolerate IOMMU TLB misses.
Maybe not a NACK, but they sure can be a big delay.

If you can't cut off the flow, then you can use an overflow bin:

DMA is streaming into IOMMU TLB hits.

Suddenly there is an IOMMU TLB miss.

If you can't wait while walking the page tables, just write the DMA
stuff streaming in into a fresh, blank, page of memory. You keep a
queue of such "catch pages" available.

When the DMA is done:

(1) If you have written a whole page, you have the option of updating
the IOMMU (and other) page tables with the catch page address.

(2) If not a full page DMA, or maybe just because, you may have to copy
from the catch page to the proper IOMMU page table returned by the IOMMU
TLB miss handler.

---

Depending on the memory ordering model for your DMA I/O - yes, they have
to worry about memory ordering - you may have to switch all pages to the
catch page list after an IOMMU TLB miss. It may not be acceptable to
make page 1 visible, not page 2, and then page 3 before page 2's IOMMU
TLB misss is handled.

---

This amounts to hardware DMA copy-on-write.

Andy Glew

unread,

Oct 1, 2010, 1:16:26 AM10/1/10

to

Paul, I love it - great minds think alike!

Andy Glew

unread,

Oct 1, 2010, 1:39:51 AM10/1/10

to

Your understanding is pretty good. Not 100% accurate, but, heck, it is
hard for guys who work in this stuff to remember exactly what
synchronization feature has been added to Penryn, then Nhm, then Wsm,
then Snb... Lots of small incremental changes, converging on the
ultimate, but not quite there yet, IIRC.

By the way, Mike Haertel says one of his patents in this area came
through. Might google.

> As for code examples that will benefit from it. Is not atomic
> reference counting more than enough?

I think so. But you might get caught up in issues wrt RC versus GC.

By the way, reference counting, if done naively, DOES require ordering.

Consider:

P1:
creates another pointer to object Obj,
increments Obj.RefCnt++

P2: deletes a pointer to Obj, Obj.RefCnt--
if( Obj.RefCnt == 0 ) then free Obj.

I know, you have to be more careful with races between the decrement and
free. But even forgetting this, there is a problem:

If the P1 pointer creation has occurred,
but the Obj.RefCnt++ has not been globally observed,
then the P2 if( --Obj.RefCnt ) free(Obj)
may trigger - and then only later would Obj.RefCnt++ occur.

Basically, you need what I call a receiver fence before freeing an object:

instead of saying "force all of my storwes to be globally visibly",

you need
"force any pending ...atomic relaxed RMWs... from other processorrs to
be globally visible."

Now, a receiver fence can be a net performance win. But, it's a bit
harder to implement: you basically need to send a token to anybody who
you need to synchronize with. Because this can be even slower than a
sender store fence, you may want to split it - as you may want to split
any fence - into a "start fence" and "end fence" part, with other code
in between.

Anyway... I'm not aware of anyone who actually currently implements a
receiver fence. Even the supposedly worst case MF memory fence is not
strong enough.

So, the path of least resistance even for reference counting is to use
sender fences.

You can get around it, but at this point I would need to call up the
guys who coded it.

Andy Glew

unread,

Oct 1, 2010, 1:44:53 AM10/1/10

to

Well, that's why we implementted the inside the processor local RMW
first on P6.

But the GPU guys have been getting quite a lot of mileage from remote
RMWs, saying that it scales better than the CPU atomics.

Basically, it amounts to contention:

If contention is low. use local atomics.

If contention is high, remote atomics can be better. Especuially since
they allow fetch-and-op combining: e.gl in yourreference counting
example, you can have 32 processors increment the counter
simultaneously, and they all see a serializable value.

By the way, remote atomics are not necessarily centralized. You can do
them in a decentralizzed manner, e.g. at each memory controller, or in
shared switching nodes or caches distributed in the fabric. People have
been dreaming of doing them in the DRAM.

Dmitriy Vyukov

unread,

Oct 1, 2010, 6:12:48 AM10/1/10

to

Ah, I see so you are talking about high contention.
And I get your point on decentralization. Potentially it can be
coupled with NUMA, right? So that RMW on local memory are faster, and
RMW on remote memory is slower.

For me, the ultimate way is to ensure low contention and then get some
support from hardware for exactly low contention (I think that it will
be more scalable anyway). But I understand that you (hardware guys)
have to support not only perfectly designed and carefully crafted
programs (which I too write on a daily basis in real-life :) ).

However, I do not think that sacrificing performance/scalability of
low-contention in the name of high-contention is reasonable. For
example, OS kernels extensively use per-CPU locks, and some server
software designed with the same patterns. And if uncontended RMW
performance/scalability any observably decrease, well, $hit hit the
fan.

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Oct 1, 2010, 6:17:09 AM10/1/10

to

On Sep 26, 3:25 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> So, are the reasons for this historical or there are some principal obstacles why atomic RMW can't be made virtually zero overhead?

I've found a reference that uncontended CAS on the Azul hardware takes
just few cycles:
http://www.azulsystems.com/blog/cliff-click/2009-02-25-and-now-some-hardware-transactional-memory-comments

And they have quite nice HTM. It's a pity that it's so narrowly-
distributed.

--
Dmitriy V'jukov

Andy Glew

unread,

Oct 1, 2010, 11:03:50 AM10/1/10

to

This is why I envisage it as dynamic:

If hitting the local cache in M or E state, use local atomic RMWs.

If missing the cache, send out a request that allows remote RMW
optimizations such as combining.

The big question is when to do the remote RMW compleetely externally,
and when to bring the data back into the local cache.

MitchAlsup

unread,

Oct 1, 2010, 12:22:10 PM10/1/10

to

On Oct 1, 12:05 am, Andy Glew <"newsgroup at comp-arch.net"> wrote:

> User code can't easily do this, because there aren't standard ways of
> arranging to get a user value invalidated by an interrupt.

Signal(InterruptRecoveryPoint);

// That is, ther easily could be.

Mitch

EricP

unread,

Oct 1, 2010, 1:23:50 PM10/1/10

to

Andy Glew wrote:
>
> Well, that's why we implementted the inside the processor local RMW
> first on P6.
>
> But the GPU guys have been getting quite a lot of mileage from remote
> RMWs, saying that it scales better than the CPU atomics.
>
> Basically, it amounts to contention:
>
> If contention is low. use local atomics.

But isn't the problem with contention due to the bus protocols
using try-time_out-retry which results in O(N^2) cost and bus flooding?

I was thinking that a smarter protocol and a directory cache
controller could give an O(N) cost and no flooding.

Specifically it could use "request forwarding" to create an
ordered chain of cpus waiting for a particular cache line.

The home directory controller, P1, coordinates setting up
the forwarding chain.
A cache line is owned by P2.
The cpu wanting to do an atomic, P3, sends a special read request
for a location to P1 indicating this is a read for atomic op.
P1 sends a message to P2 telling it to forward the cache line to P3 when done.
If another cpu, P4, contends for the same line, P1 tells P3
to forward for P4 when it is done.

So the controller sets up a bucket brigade of cpus who
want the cache line.

Since each cpu can only be doing one atomic op at a time
there can only be one pending forwarding request each.
It costs only a single message from the home controller to the
current owner, so scales O(N) where N is the number of contenders.
The worst case delay for a read reply is O(N) also.

Eric

Andy Glew

unread,

Oct 1, 2010, 11:00:10 PM10/1/10

to

On 10/1/2010 10:23 AM, EricP wrote:
> Andy Glew wrote:
>>
>> Well, that's why we implementted the inside the processor local RMW
>> first on P6.
>>
>> But the GPU guys have been getting quite a lot of mileage from remote
>> RMWs, saying that it scales better than the CPU atomics.
>>
>> Basically, it amounts to contention:
>>
>> If contention is low. use local atomics.
>
> But isn't the problem with contention due to the bus protocols
> using try-time_out-retry which results in O(N^2) cost and bus flooding?
>
> I was thinking that a smarter protocol and a directory cache
> controller could give an O(N) cost and no flooding.

That was my MS thesis.

(All of my research was on OOO CPUs, but when my then-wife graduated,
and said "we must move away", I got a quick and dirty multiprocessor
synchronization MSEE. I used an update cache protocol, but anticipated
queues.

> So the controller sets up a bucket brigade of cpus who
> want the cache line.

Alain Kagi, by the way, accomplished the same thing by adding
appropriate delays.

---

Anyway, back to contention:

Local RMWs with test-and-test-and-set loops have O(N^2) bursts on a lock
release. With present protocols.

(Let's skip the discussion of why test-and-test0and-set makes this
worse, not better.)

If you switch to remote, external RMWs, with combining (the GPUs call
this coalescing), then N processors can do N RMWs in O(1) time. As
opposed to the O(N^2) that we get nowadays with stupid protocols, that
might be improved to O(N) with better protocols.

Actually, O(1) on a bus, O(log(N)) on a big system, where the base of
the log is V, the valency of the switching nodes. Neverthelless, still
better than O(N) or O(N^2)

Terje Mathisen

unread,

Oct 2, 2010, 10:27:41 AM10/2/10

to

Andy Glew wrote:
> Anyway, back to contention:
>
> Local RMWs with test-and-test-and-set loops have O(N^2) bursts on a lock
> release. With present protocols.
>
> (Let's skip the discussion of why test-and-test0and-set makes this
> worse, not better.)
>
> If you switch to remote, external RMWs, with combining (the GPUs call
> this coalescing), then N processors can do N RMWs in O(1) time. As
> opposed to the O(N^2) that we get nowadays with stupid protocols, that
> might be improved to O(N) with better protocols.
>
> Actually, O(1) on a bus, O(log(N)) on a big system, where the base of
> the log is V, the valency of the switching nodes. Neverthelless, still
> better than O(N) or O(N^2)

Indeed.

I wrote about an obvious idea some months back: Any really
critical/highly contented lock would scale easily if you made it
hierarchical, something which you could even do dynamically.

The key idea is that if you have (largish) N threads/processes who wants
to aquire the same lock, you should setup at least one additional layer,
i.e. ceil(sqrt(N))=M +1 locks, each located in a separate cache line of
course.

Each thread would pick one of the M first-level locks, mostly easily
with a hash of the thread ID or something similar & unique, then only
after getting this lock would it be allowed to try to get the
real/main/second-level lock.

Since those currently very bad locks scale as O(N^2), using just two
layers would reduce this to 2*O(M^2) => O(N)

Using more levels you get as close to O(log(N)) as you'd like, at
additional memory cost of course.

The main downside is that with zero contention, acquiring a lock takes
twice as long since you have to do it twice, and the same for the
release part, this is most easily fixed by starting with a single level
and only switching to the dual-layer approach when you've hit real
contention.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

nm...@cam.ac.uk

unread,

Oct 2, 2010, 11:06:53 AM10/2/10

to

In article <04sjn7-...@ntp.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>
>I wrote about an obvious idea some months back: Any really
>critical/highly contented lock would scale easily if you made it
>hierarchical, something which you could even do dynamically.

Actually, I disagree, but am not denying that your approach would
scale far better than most other approaches.

>The key idea is that if you have (largish) N threads/processes who wants
>to aquire the same lock, you should setup at least one additional layer,
>i.e. ceil(sqrt(N))=M +1 locks, each located in a separate cache line of
>course.
>

> ...

>
>Using more levels you get as close to O(log(N)) as you'd like, at
>additional memory cost of course.

And there's the rub.

You end up with O(Nlog(N)) accesses to O(N) locations, instead of
O(N^2) to O(1). Still a winner, but not as much of one as appears
at first sight. Not too bad if you have only one such lock, but
not good news if you have lots of them. "Would scale more easily"
I am happy with, but not "would scale easily".

It's very closely related to a technology that keeps being reinvented,
and failing dismally every time - virtual shared memory - in that,
each location can be regarded as a lock, and that's Bad News. Well
isolated codes (in the memory sense) work fine, of course, but then
people try running ones that aren't ....

Regards,
Nick Maclaren.

Andy Glew

unread,

Oct 2, 2010, 1:04:58 PM10/2/10

to

On 9/30/2010 10:39 PM, Andy Glew wrote:
> On 9/28/2010 1:15 PM, Dmitriy Vyukov wrote:
>
>> As for code examples that will benefit from it. Is not atomic
>> reference counting more than enough?
>
> I think so. But you might get caught up in issues wrt RC versus GC.
>
> By the way, reference counting, if done naively, DOES require ordering.

Thinking about this this morning:

In reference counting, I think that it is probably common to implicitly
assume location consistency.

The usual formulation of location consistency is that there is a
serializable order for accesses to a particular memory location, such as
the reference count. There may or not be global ordering, but it is
darned convenient to talk about a single order for a memory location.

MESI protocols tend to imply location consistency, since there can be
only one writer to a cache line at any time.

However, as luck has it, I have recently been pushing my bitmask
consistency, which allow multiple writers to a cache line. This
discussion caused me to think more about location consistency.
Certainly, with my bitmask consistency, there can be several different
values for the same location in different caches at different times -
i.e. it is not physically, instantaneously, location consistent.
However, for most of the interconnect topologies I want to consider, it
is eventually physically consistent. And because of the bitmask
coherency rules, it is instantaneously logically location consistent -
you can always construct a location consistent order. You just can't see
it in the hardware. (You can build systems that are eventually
physically consistent but not logically location consistent. I think
multiple paths in a mesh network get that. I'm interested in these, but
tend to steer away from them.)

But, I don't want to talk about bitmask consistency. I want to talk
about fenceless atomic RMWs used for reference counting, and how it
interacts with location consistency, or the lack thereof.

Reference counting tennds to be used as follows:

creating a new pointer P to an object O
increment the reference count O.RefCnt
tmp := O.RefCnt
tmp += 1
O.RefCnt := tmp
distribute P

deleting a pointer P to an object O
decrement the reference count O.RefCnt
tmp := O.RefCnt
tmp -= 1
O.RefCnt := tmp
if O.RefCnt = 0 then free(0)

Now, the RMW, the reference count increment or decrement, must be atomic
wrt the reference count. It must be all done or all not done. It should
not be possible for two such RMWs to overlap, both reading the same
value --- or ellse one of the RMWs will be lost. The set of all such
reference count RMWs is location consistent. Stronger: the set of all
such RMWs and stores to the reference count must be location consistent.
Which I think amounts to saying that there is location coonsistency,
since reads can always be interleaved.

... OK, I can see how this simplifies now. But I'll try to record the
rest of the thought sequence ...

The thing is, we need some sort of atomicity between the decrement of
the reference count, and the test for a zero refedrennce count:

deleting a pointer P to an object O
decrement the reference count O.RefCnt
regtmp := O.RefCnt
regtmp -= 1
O.RefCnt := regtmp
return regNewRefCnt := tmp
if regNewRefCnt = 0 then free(0)

(It was more convenient here to use newRefCnt, but most systems return
the oldRefCnt, where "old" is in this hypothetical location order; in
which case the free test is slightly tweaked.)

I have added the local variable regNewRefCnt and changed local tmp into
regtmp, to emphasize that these are not global memory, but are private
memory such as registers.

If using MESI, we can do the entire decrement reference count operation
locally. Since MESI implies location consistency. Dmitriy's desire for
a cheap fenceless RMW for reference counting can easily be satisfied.

If we are not using MESI, but are using some sort of external
synchronization:

The local processor does not need to wait until the atomic RMW is
performed before it goes on to the next instruction. I.e. it does not
need to fence. It can just shoot off an atomic RMW that foes into the
interconnect. Certainly this will work for incrementing a reference
count, where we don't really care about the reference count value
(assuming that we are not looking for integer overflow).

It'll even work for decrementing the reference count. So long as you
are not looking at the decremented value to decide to free the object.

Howeever, if you are looking at the decremented value to decide to free
the object - that's another matter. If you assume location consistency,
i.e. if you assume that the first time you see the reference count as
zero it is freeable - then you have to enforce location consistency at
that point. Which may require slowing the processor down.

I.e. it is not the atomic RMW that might slow the processor down. It is
returning the value of the memory location that is being RMW'ed that may
be costly. As usual, it is not the RMW that is costly, it is the memory
ordering. Even just for location consistency.

Now, even that doesn't need to be THAT costly. E.g. we could use
dataflow: allow younger local processor ops that do not depend on the
value of the reference count to proceed, or speculate even if they do
depend. But I'm a dataflow guy; many people would just say "fence",
which might be a bigger performance cost (if they can figure out what
buffers need to be drained to establish the appropriate location
consistency, as I alluded to in a previous post where I talked about
receiver synchronization).

The more important point is, if you don't need to look at the values
invoolved in the RMW, you don't need to build hardware to make looking
at it fast. So I can imagine systems where
atomic increment memory location
that does not return the old value, or even a condition code, is faster than
atomic decrement memory location returning zero indication
(It isn't the inc/ddec that makes the difference; it is returning an
indication of the true value.)

I.e. location-atomic RMWs can be "fire and forget". Whereas if you
need to act on the value, they can't be.

I can even imagine asymmetric location ordering: return quickly as
soon as you can tell that the value cannot be zero, but require
establishment of full location consistency if the value may be zero.

The question arises as to how this should be embedded in the instruction
set. Relatively more complex operations - one might even say CISCy -
that do the RMW and enforce the location consistency may have a
performance advantage. As comparred to "RISCy" operations whedre
separate fences may be added by the user where necessary.
'

===

Although I have arived at an understanding above that satisfies me for
now, I'd like to share a thought experiment that I explored while
thinking of the above:

Imagine that you are doing 100 location atomic increment and 50 location
atomic decrements to a location. Let's say that the initial value was 50.

Think of these as concurrent operations. Something like

Object Obj;
Object* ptr[200];
for(i=0;i<50;i++) p[i] = &Obj;
PAR
PARDO for(i=0;i<50;i++) p[i] = 0;
PARDO for(i=0;i<100;i++) p[50+i] = &Obj;
END PAR

i.e. all of these reference and dereference operations, with
accompanying referencecounter increment and decrement, are being done in
parallel.

Obviously, at the end of all of these operations, the final reference
count should be 50+100-50 = 50. Nonzero, so the object should not be freed.

But many "PAR" semantics say that the code should be correct for any
arbitrary serialization. So, if we did all the decrements first, the
reference counter would go to 0 before increasing past zero again.

Now, you might say that the reference count provides an implicit
dependency between objects, so such code should neever have been PAR'ed
in the first place. Maybe so. Maybe this is an argument for GC,
garbage collection that is never logically in parallel with other
operations. (Such GC may be actually concurrent, but logically not;
logically, each chunk of GC occurs between non-GC operations.)

I don't like this restriction. Something similar happens with malloc.
Operations that, from the point of view of the high level programmer,
are independent, are not because of "innards". Mutable memobers do the
same.

The mechanism for establishing location consistency for "fenceless"
atomic RMWs discussed above might provide a true serialization order,
but it doesn't solve the problem: hardware might have established a
location order, so that we can unambiguously see that RefCnt has become
0, but that order might still be arbitrary within the PAR, and the value
might climb back up past 0.

I think that I prefer than software take responsibility, by deferring
the free'ing until out of the PAR clause.

I've been toying with the idea of hardware knowing what atomic RMWs
might be in flight. That's almost what the location consistency
establishment is doing. But this is not enough: each of the PAR'ed
statements may not actually be in the RMW, but may be committed to doing
so. In some ways, a PAR'ed clause, where statements have implicit
atomic semantics. "I'm going to do this statement no matter what..."

Interestingly, think about the above when you have Myria's COW forking
semantics. No problem. But if you do
Object Obj;
Object* ptr[200];
for(i=0;i<50;i++) p[i] = &Obj;
PAR
PARDO for(i=0;i<50;i++) p[i] = 0;
END PAR
the object has no references, but never gets freed.

===

Some people might throw up their hands and say that this is an argument
for IBM mainframe style sequential consistency. At least that's not
ambiguous.

I'm not so sure. I *like* saying things like

Object Obj;
Object* ptr[200];
for(i=0;i<50;i++) p[i] = &Obj;
PAR
PARDO for(i=0;i<50;i++) p[i] = 0;
PARDO for(i=0;i<100;i++) p[50+i] = &Obj;
END PAR

because I find them easier to think about.

The thing is, we are trying to bridge the difference between

"I want to use PAR clauses, where I know that the initial states are
fine, and the final state is fine - and I want to guarantee that the
system will find a way to implement it so that none of any intermediate
states cause a problem"

and

"You are allowed to use PAR clauses, but only if you can guarantee that
there is no interference between the statements".

The former is programmer friendly - for the user programmer, although
perhaps not so much for the facilities inside the PAR. The latter is
implementer friendly, by placing the onus on the user programmer.

I keep thinking about trying to annotate things like

deleting a pointer P to an object O
decrement the reference count O.RefCnt
tmp := O.RefCnt
tmp -= 1
OUT-OF-PAR: O.RefCnt := tmp
OUT-OF-PAR: if O.RefCnt = 0 then free(0)

but I don't like 2 level, inside PAR and outside PAR, concepts.

Terje Mathisen

unread,

Oct 2, 2010, 3:49:02 PM10/2/10

to

nm...@cam.ac.uk wrote:
> In article<04sjn7-...@ntp.tmsw.no>,
> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>>
>> I wrote about an obvious idea some months back: Any really
>> critical/highly contented lock would scale easily if you made it
>> hierarchical, something which you could even do dynamically.
>
> Actually, I disagree, but am not denying that your approach would
> scale far better than most other approaches.
>
>> The key idea is that if you have (largish) N threads/processes who wants
>> to aquire the same lock, you should setup at least one additional layer,
>> i.e. ceil(sqrt(N))=M +1 locks, each located in a separate cache line of
>> course.
>>
>> ...
>>
>> Using more levels you get as close to O(log(N)) as you'd like, at
>> additional memory cost of course.
>
> And there's the rub.

I think we agree: Using one additional layer is a reasonable quick fix
for a single bad lock, getting that one operation down from O(NxN) to
O(N), without wasting huge amounts of memory of incurring massive
amounts of memory traffic.

> You end up with O(Nlog(N)) accesses to O(N) locations, instead of
> O(N^2) to O(1). Still a winner, but not as much of one as appears
> at first sight. Not too bad if you have only one such lock, but
> not good news if you have lots of them. "Would scale more easily"
> I am happy with, but not "would scale easily".

Agreed. It is a workaround, not a proper solution.

Dmitriy Vyukov

unread,

Oct 4, 2010, 7:38:20 AM10/4/10

to

On Oct 1, 9:39 am, Andy Glew <"newsgroup at comp-arch.net"> wrote:

> > As for code examples that will benefit from it. Is not atomic
> > reference counting more than enough?
>
> I think so. But you might get caught up in issues wrt RC versus GC.
>
> By the way, reference counting, if done naively, DOES require ordering.

Acquire operation does not require any fences, because thread does not
actually acquire anything - he already owns a reference to the object.
Release operation requires acquire/release fence (depending on counter
value), but does not require an expensive store-load fence about which
we are talking here.

> Consider:
>
> P1:
> creates another pointer to object Obj,
> increments Obj.RefCnt++
>
> P2: deletes a pointer to Obj, Obj.RefCnt--
> if( Obj.RefCnt == 0 ) then free Obj.
>
> I know, you have to be more careful with races between the decrement and
> free. But even forgetting this, there is a problem:
>
> If the P1 pointer creation has occurred,
> but the Obj.RefCnt++ has not been globally observed,
> then the P2 if( --Obj.RefCnt ) free(Obj)
> may trigger - and then only later would Obj.RefCnt++ occur.

It's incorrect usage.

> Basically, you need what I call a receiver fence before freeing an object:
>
> instead of saying "force all of my storwes to be globally visibly",
>
> you need
> "force any pending ...atomic relaxed RMWs... from other processorrs to
> be globally visible."

It won't help. Atomic RMW might not even start.
Anyway it's programming error, reference counting with basic thread
safety requires a thread that creates another reference to already own
at least one reference. So the problem you described can't happen.
There is reference counting with strong thread safety that indeed
allows a thread to acquire a reference when he does not own any yet.
It also requires double-word atomic RMW operation w/o store-load
fence.

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Oct 4, 2010, 7:43:04 AM10/4/10

to

On Oct 2, 9:04 pm, Andy Glew <"newsgroup at comp-arch.net"> wrote:

> Obviously, at the end of all of these operations, the final reference
> count should be 50+100-50 = 50. Nonzero, so the object should not be freed.
>
> But many "PAR" semantics say that the code should be correct for any
> arbitrary serialization. So, if we did all the decrements first, the
> reference counter would go to 0 before increasing past zero again.

> I don't like this restriction. Something similar happens with malloc.

> Operations that, from the point of view of the high level programmer,
> are independent, are not because of "innards". Mutable memobers do the
> same.

For me, it's quite reasonable. Because it means that my program is
correct under serial single-threaded execution. It's a base point from
which I can start further reasoning.
And if my program is not correct under serial execution, and should
become correct under parallel execution... well, I'm not sure how I
possibly can reason about it..

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Oct 4, 2010, 7:51:24 AM10/4/10

to

On Oct 2, 9:04 pm, Andy Glew <"newsgroup at comp-arch.net"> wrote:

> I.e. location-atomic RMWs can be "fire and forget". Whereas if you
> need to act on the value, they can't be.

Yes. But note that I always need to act on values from at least some
atomic RMW operations on a location. Otherwise, I would create per-CPU/
thread split counters, operate on them with plain loads/stores, and
periodically/episodically aggregate them.

Consequently, even "fire and forget" operations must be subject to
release ordering imposed by non "fire and forget" operations:

// thread already own 1 reference to it
object_t* obj = ...;
...
// fire-and-forget
atomic_add(&obj->rc, 1, memory_order_relaxed);
...
// not fire-and-forget, because we need the value
if (1 == atomic_fetch_sub(&obj->rc, 1, memory_order_acq_rel))
delete obj;

fire&forget atomic_add() must not sink below not-fire&forget
atomic_fetch_sub(). I.e. one can't completely offload atomic_add() to
some external device.

--
Dmitriy V'jukov

Andy Glew

unread,

Oct 4, 2010, 10:29:08 AM10/4/10

to

Here's an example of notations that have non-serial semantics:

Swapping two numbers

a,b := b,a;
or
a := b, b := a;
or
PAR BLOCK
PAR CLAUSE a := b END PAR CLAUSE
PAR CLAUSE b := a END PAR CLAUSE
END PARBLOCK

rather than
tmp := a
a := b
b := tmp

Can be extended to rotations, etc.

ptr->next := newnode,
newnode->next := ptr->next,
newnode->prev := ptr,
ptr->next->prev := newnode

Sure, you can always convert this into serial code, adding temporaries,
etc. But such conversion can be error prone, when done by a human. I
once saw a study that said that more than 10% of the bugs in code
written in a high level language were code ordering bugs.

And converting from the parallel form to the serial form is trivially
automatable.

---

I think that all I am doing is wondering if you can use locks and memory
in general in such parallel form.

Or, if you want to go functional / single assignment, you have to give
up memory and state and locks.

Andy Glew

unread,

Oct 4, 2010, 10:31:07 AM10/4/10

to

> Dmitriy V'jukov

More precisely: if we offload, we must enforce ordering externally.

Tim McCaffrey

unread,

Oct 6, 2010, 12:19:17 AM10/6/10

to

In article <Q4adnbjz7s-g-zrR...@giganews.com>,
"newsgroupatcomp-arch.net" says...

>
>On 9/30/2010 10:39 PM, Andy Glew wrote:

>Think of these as concurrent operations. Something like
>
> Object Obj;
> Object* ptr[200];
> for(i=0;i<50;i++) p[i] = &Obj;
> PAR
> PARDO for(i=0;i<50;i++) p[i] = 0;
> PARDO for(i=0;i<100;i++) p[50+i] = &Obj;
> END PAR
>
>i.e. all of these reference and dereference operations, with
>accompanying referencecounter increment and decrement, are being done in
>parallel.
>
>Obviously, at the end of all of these operations, the final reference
>count should be 50+100-50 = 50. Nonzero, so the object should not be freed.
>
>But many "PAR" semantics say that the code should be correct for any
>arbitrary serialization. So, if we did all the decrements first, the
>reference counter would go to 0 before increasing past zero again.

When the count goes to zero, then it does have no references, if there is
somebody incrementing the counter that means somebody kept a reference. You
can't get another reference if there are none around.

What you are concerned about is interesting, but if the reference counting
(and pointer management) is done correctly should never happen.

- Tim

Robert Myers

unread,

Oct 6, 2010, 1:00:19 AM10/6/10

to

On Oct 2, 1:04 pm, Andy Glew <"newsgroup at comp-arch.net"> wrote:

>
> I don't like this restriction. Something similar happens with malloc.
> Operations that, from the point of view of the high level programmer,
> are independent, are not because of "innards".

I'm going to stick my head up above the wall here, even though it may
get shot off as a result.

It's comments like this one that lead me to conclude that memory
semantics are broken and unfixable.

Even if I'm wrong and, high on some mountain, there might be a wizard
in a cave who could always get it right for an arbitrary c-like
language, compiler, and hardware, I conclude that no realistic system
of software production and management can hope to avoid subtle errors
using existing tools.

I await the obligatory, "You don't know what you're talking about"
comment, which I deserve, because I really don't. Every time I try to
follow one of these conversations, though, I come away with the same
conclusion.

Robert.

Andy Glew

unread,

Oct 6, 2010, 1:13:59 AM10/6/10

to

All that I'll say back is:

1) message passing has very similar issues. You have to worry about a
"message passing ordering model", just like you have to worry about a
"memory ordering models".

Oh, heck, I can't resist: a "message passing ordering model" is all
about when message passing messages can pass other message paassing
messages. Passably. Possibly?

2) The complexities of weak memory ordering models - the confusion
resulting from and in posts such as mine that you were responding to -
is the sort of thing that led to the desire to enforce strong memory
ordering models, e.g. at IBM.

Andy Glew

unread,

Oct 6, 2010, 1:19:39 AM10/6/10

to

On 10/5/2010 10:00 PM, Robert Myers wrote:

By the way, the example where you snipped my comment has nothing to do
with memory semantics. It solely has to do with parallelism. The
"hidden state with side effects" could be implemented via message
passing, not memory.

Does it follow, since I corrected s/memory semantics/parallel semantics/
in the statement "memory semantics are broken and unfixable",

that after making this correction the statement should read
"parallel semantics are broken and unfixable"

?

I hope not. I *like* parallel semantics.

I keep mulling over this sort of issue because I want to find the
boundaries of what parallel semantics can and cannot express.

And I want to find automatic ways of detecting statements that should
not be expressed with parallel semantics.

Andy Glew

unread,

Oct 6, 2010, 1:30:21 AM10/6/10

to

The flaw in my example is that I was doing reference counting of
pointers to non-heap Object Obj.

If it was recast as:

Object* objptr = new Object();;
Object* ptr[200];
for(i=0;i<50;i++) p[i] =objptr;

PAR
PARDO for(i=0;i<50;i++) p[i] = 0;

PARDO for(i=0;i<100;i++) p[50+i] =objptr;
END PAR

then the error becomes more obvious: you should not ever let the
reference count become 0, since objptr is still in scope.

More precisely:

reference_counted_ptr<Object> objptr = new Object();;
reference_counted_ptr<Object> ptr[200];
for(i=0;i<50;i++) p[i] =objptr;

PAR
PARDO for(i=0;i<50;i++) p[i] = 0;

PARDO for(i=0;i<100;i++) p[50+i] =objptr;
END PAR

would not have the problem.

But

Object* objptr = new Object();;
reference_counted_ptr<Object> ptr[200];
for(i=0;i<50;i++) p[i] =objptr;

PAR
PARDO for(i=0;i<50;i++) p[i] = 0;

PARDO for(i=0;i<100;i++) p[50+i] =objptr;
END PAR

would, since mixing non-RC'ed and RC'ed smart pointers is always dangerous.

---

Nevertheless, the observation that serializations of parallel code with
side effects are problematic still stands.

nm...@cam.ac.uk

unread,

Oct 6, 2010, 3:50:21 AM10/6/10

to

In article <gq-dnZ3jX45ymzHR...@giganews.com>,

Andy Glew <"newsgroup at comp-arch.net"> wrote:
>On 10/5/2010 10:00 PM, Robert Myers wrote:
>>
>> Even if I'm wrong and, high on some mountain, there might be a wizard
>> in a cave who could always get it right for an arbitrary c-like
>> language, compiler, and hardware, I conclude that no realistic system
>> of software production and management can hope to avoid subtle errors
>> using existing tools.
>

>By the way, the example where you snipped my comment has nothing to do
>with memory semantics. It solely has to do with parallelism. The
>"hidden state with side effects" could be implemented via message
>passing, not memory.

Agreed. But isn't memory access really message-passing, anyway? :-)

>Does it follow, since I corrected s/memory semantics/parallel semantics/
>in the statement "memory semantics are broken and unfixable",
>
>that after making this correction the statement should read
>"parallel semantics are broken and unfixable"

No. It means two things:

1) Starting from an extremely bad serial model, you can't get
to anywhere except broken and unfixable parallel semantics. The
relevant terms are "C-like" and "existing".

2) If you want decent parallel semantics, you have to both
design them in from scratch and limit the flexibility to a model
that you can design and use correctly.

I use Hoare's "Communicating Sequential Processes" to teach this
in some of my classes - i.e. I tell the kiddies that, if they
think that they can use arbitrary parallelism correctly, they
should read that first. It is 260-odd pages and not easy going
even for mathematicians ....

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Oct 6, 2010, 5:00:19 AM10/6/10

to

Robert Myers wrote:
> It's comments like this one that lead me to conclude that memory
> semantics are broken and unfixable.

Broken: Yes. Unfixable: No.

>
> Even if I'm wrong and, high on some mountain, there might be a wizard
> in a cave who could always get it right for an arbitrary c-like
> language, compiler, and hardware, I conclude that no realistic system
> of software production and management can hope to avoid subtle errors
> using existing tools.
>
> I await the obligatory, "You don't know what you're talking about"
> comment, which I deserve, because I really don't. Every time I try to
> follow one of these conversations, though, I come away with the same
> conclusion.

In the particular case of memory allocation you do know what you are
talking about:

Every single highly threaded application I've read about, particularly
in the context of optimization, have been forced to use some form of
custom allocation strategy:

Having a single global pool simply doesn't scale, and can perform really
badly if you have many tiny allocations (less than a cache line
including overhead).

The solution is quite obvious: Memory needs to be strongly clustered
close to the cpu that needs to use it.

Communication between cores needs to be obviously separate from local
memory accesses.

Does this sound like I'm advocating something like clusters and MPI?

Yes. :-)

Andy Glew

unread,

Oct 6, 2010, 10:23:38 AM10/6/10

to

On 10/5/2010 10:30 PM, Andy Glew wrote:

> More precisely:
>
> reference_counted_ptr<Object> objptr = new Object();;
> reference_counted_ptr<Object> ptr[200];
> for(i=0;i<50;i++) p[i] =objptr;
> PAR
> PARDO for(i=0;i<50;i++) p[i] = 0;
> PARDO for(i=0;i<100;i++) p[50+i] =objptr;
> END PAR
>
> would not have the problem.
>
> But
>
> Object* objptr = new Object();;
> reference_counted_ptr<Object> ptr[200];
> for(i=0;i<50;i++) p[i] =objptr;
> PAR
> PARDO for(i=0;i<50;i++) p[i] = 0;
> PARDO for(i=0;i<100;i++) p[50+i] =objptr;
> END PAR
>
> would, since mixing non-RC'ed and RC'ed smart pointers is always dangerous.
>
> ---
>
> Nevertheless, the observation that serializations of parallel code with
> side effects are problematic still stands.

Strange:

Some data types, and mixed usages thereof in parallel code, have
acceptable semantics for parallelism. Like uniformly using
reference_counted_ptr<Object>, where we only do something special when
crossing to 0.

In other cases, such as mixing reference and non reference counted
pointers, parallel semantics get all boo-boo-ed up.

Q: can we define the rules for what is allowable, and what is not?

Perhaps what we need to say is that only "safely parallelizable"
datatypes, or combinations thereof, should be alive across the boundary
of a PAR...ENDPAR clause.

(By "alive across" I mean "alive into", or "alive out of". Unsafe
variables may be defined before and used after, but just should not be
used inside the parallel clause.)

Rob Warnock

unread,

Oct 7, 2010, 4:41:56 AM10/7/10

to

<nm...@cam.ac.uk> wrote:
+---------------

| 2) If you want decent parallel semantics, you have to both
| design them in from scratch and limit the flexibility to a model
| that you can design and use correctly.

+---------------

Hence Dijkstra's Turing Award lecture, "The Humble Programmer" (q.v.).

+---------------

| I use Hoare's "Communicating Sequential Processes" to teach this
| in some of my classes - i.e. I tell the kiddies that, if they
| think that they can use arbitrary parallelism correctly, they
| should read that first. It is 260-odd pages and not easy going
| even for mathematicians ....

+---------------

That's certainly a good book!

Dijkstra also had a number of good (often short) papers on taming
parallelism. One of my favorites [though I confess I may be unduly
influenced by the cute title!] has always been this one:

http://www.cs.utexas.edu/users/EWD/ewd04xx/EWD464.PDF
EWD464: A new elephant built from mosquitos humming in harmony
Edsger W. Dijkstra, pages 79-83 of "Selected Writings on
Computing: A Personal Perspective, Springer-Verlag, 1982.
ISBN 0-387-90652-5

-Rob

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

nm...@cam.ac.uk

unread,

Oct 7, 2010, 7:38:39 AM10/7/10

to

In article <6vqdnasVb_dJGjDR...@speakeasy.net>,

Rob Warnock <rp...@rpw3.org> wrote:
>
>Hence Dijkstra's Turing Award lecture, "The Humble Programmer" (q.v.).

I regret that I never met him. His attitude to program design
and software engineering is one that I regard highly.

>Dijkstra also had a number of good (often short) papers on taming
>parallelism. One of my favorites [though I confess I may be unduly
>influenced by the cute title!] has always been this one:
>

> EWD464: A new elephant built from mosquitos humming in harmony

Thanks for the pointer. I wish that more computer scientists would
read Dijkstra. I am giving a very short course on software design,
and am NOT impressed by the methodology in books like Code Complete
and many computer science courses. They recommend including checks
until you have debugged the code, removing almost all of the checks,
and then shipping the result. There is an apocryphal Dijkstra
aphorism about car seat-belts that I cannot refind ....

Regards,
Nick Maclaren.

Jean-Marc Bourguet

unread,

Oct 7, 2010, 8:31:47 AM10/7/10

to

nm...@cam.ac.uk writes:

I remember something -- don't know if it is from Dijkstra or someone else
-- along the line of "Dropping tests from production code is like wearing
live jackets on shore but not at sea."

Yours,

--
Jean-Marc

Bakul Shah

unread,

Oct 7, 2010, 1:14:36 PM10/7/10

to

On 10/7/10 1:41 AM, Rob Warnock wrote:
> <nm...@cam.ac.uk> wrote:
> +---------------
> | 2) If you want decent parallel semantics, you have to both
> | design them in from scratch and limit the flexibility to a model
> | that you can design and use correctly.
> +---------------
>
> Hence Dijkstra's Turing Award lecture, "The Humble Programmer" (q.v.).

Though it doesn't say anything about parallel programming.

> +---------------
> | I use Hoare's "Communicating Sequential Processes" to teach this
> | in some of my classes - i.e. I tell the kiddies that, if they
> | think that they can use arbitrary parallelism correctly, they
> | should read that first. It is 260-odd pages and not easy going
> | even for mathematicians ....
> +---------------
>
> That's certainly a good book!

Has there been a better theory of parallel programming since then?

Andy Glew

unread,

Oct 7, 2010, 4:16:43 PM10/7/10

to

Dijkstra was my first big influence in computers.

"A Discipline of Programming"

and

"Primer of Algol 60 Programming"

Robert Myers

unread,

Oct 7, 2010, 7:40:38 PM10/7/10

to

On 10/6/2010 1:19 AM, Andy Glew wrote:
> On 10/5/2010 10:00 PM, Robert Myers wrote:
>> On Oct 2, 1:04 pm, Andy Glew<"newsgroup at comp-arch.net"> wrote:
>>
>>>
>>> I don't like this restriction. Something similar happens with malloc.
>>> Operations that, from the point of view of the high level programmer,
>>> are independent, are not because of "innards".
>>
>> I'm going to stick my head up above the wall here, even though it may
>> get shot off as a result.
>>
>> It's comments like this one that lead me to conclude that memory
>> semantics are broken and unfixable.
>>
>> Even if I'm wrong and, high on some mountain, there might be a wizard
>> in a cave who could always get it right for an arbitrary c-like
>> language, compiler, and hardware, I conclude that no realistic system
>> of software production and management can hope to avoid subtle errors
>> using existing tools.
>>
>> I await the obligatory, "You don't know what you're talking about"
>> comment, which I deserve, because I really don't. Every time I try to
>> follow one of these conversations, though, I come away with the same
>> conclusion.
>>
>
>

> By the way, the example where you snipped my comment has nothing to do
> with memory semantics. It solely has to do with parallelism. The "hidden
> state with side effects" could be implemented via message passing, not
> memory.
>
> Does it follow, since I corrected s/memory semantics/parallel semantics/
> in the statement "memory semantics are broken and unfixable",
>
> that after making this correction the statement should read
> "parallel semantics are broken and unfixable"
>

A generalization of the Turing machine that has made it (somewhat)
easier for me to reason about concurrency is f-nets, which I know in the
version advanced by David Dinucci, formerly a regular contributor here:

Tolerant (Parallel) Programming with F-Nets and Software Cabling

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.7.7159&rep=rep1&type=pdf

The formalism allows, at least in theory, for formal correctness
checking of concurrent programs.

Leslie Lamport's Temporal Logic of Actions provides a complete set of
tools for reasoning about and building concurrent systems:

http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1994-001.pdf

I believe that both Tony Hoare and Leslie Lamport are both at Microsoft.
^_^

All of these models reason about some idealization of the actual
hardware, and discussions about concurrency almost invariably confuse
and/or conceal two kinds of mischief: that the Turing tape and reasoning
about Turing machines cannot apply without modification to parallel
programming (the more obvious one), and that all idealizations make
assumptions about the behavior of hardware that violate the laws of
physics (the less obvious one). Actual hardware has all the problems of
simultaneity that the real world does.

I have the possibly irrational belief that concurrent programs cannot
really ever be reliable (with current tools) for the same reason that,
while there are theoretical advantages to designing asynchronous
processors, the theoretical advantages are swamped by by the practical
difficulties of functioning in a world without a clock.

In concurrent software and asynchronous circuit design, all notions of
temporality have to be conveyed by something that comes down to the
equivalent of message-passing, and such message passing is inherently
unreliable because it has to be implemented with actual hardware with
finite imperfections and uncertainties.

The same is true, of course, of synchronous hardware, since there is no
such thing as a perfect clock, and the systems still have to rely on
message-passing across clock domains, but the ambiguities can even
theoretically arise at a much smaller number of places and are in
practice much more manageable.

One reasons about concurrent software in much the same way one reasons
about asynchronous hardware. Almost no one essays asynchronous
processors because they are simply too hard to reason about correctly,
and yet...

Robert.

Andy Glew

unread,

Oct 8, 2010, 11:28:16 AM10/8/10

to

David Dinucci is a regular at "The Portland Geek Dinner", also known as
"Portland Friends of Eugene". I am lookingforward to reading his book
when it is finished.

> All of these models reason about some idealization of the actual
> hardware, and discussions about concurrency almost invariably confuse
> and/or conceal two kinds of mischief: that the Turing tape and reasoning
> about Turing machines cannot apply without modification to parallel
> programming (the more obvious one), and that all idealizations make
> assumptions about the behavior of hardware that violate the laws of
> physics (the less obvious one). Actual hardware has all the problems of
> simultaneity that the real world does.

I just came across "The Resurgence of Parallelism", Peter J. Denning and
Jack B. Dennis, CACM June 2010.

I think that they discuss a more important issue than you mention:
composability. Espousing (this sounds like Jack) functional
programming. Which basically disallows the sort of side effects of the
examples I was talking about.

Unfortunately, pretty much all linked data structures (LDS) are
manipulated by code that exhibits such problems. I think that it would
be unfortunate if we had to give up LDS for parallelism.

>
> I have the possibly irrational belief that concurrent programs cannot
> really ever be reliable (with current tools) for the same reason that,
> while there are theoretical advantages to designing asynchronous
> processors, the theoretical advantages are swamped by by the practical
> difficulties of functioning in a world without a clock.

Hold on:

1) I'm an Ivan Sutherland fan. Asynch will rise again.

2) Asynch is all around us. The core and uncore of Nehalem is
essentially a GALS, Globally Asynchronous Locally Synchronous, system.

Asynch is just happening between units, not within them.

> The same is true, of course, of synchronous hardware, since there is no
> such thing as a perfect clock, and the systems still have to rely on
> message-passing across clock domains, but the ambiguities can even
> theoretically arise at a much smaller number of places and are in
> practice much more manageable.
>
> One reasons about concurrent software in much the same way one reasons
> about asynchronous hardware. Almost no one essays asynchronous
> processors because they are simply too hard to reason about correctly,
> and yet...

Or how about: Newtonian physics tends to work fine in local domains.

One only needs to start thinking relativistically when thinking about
galactic scales, or stuff close in to gravity wells, like the transit of
Mercury.

MitchAlsup

unread,

Oct 8, 2010, 3:48:11 PM10/8/10

to

On Oct 8, 10:28 am, Andy Glew <"newsgroup at comp-arch.net"> wrote:

> 1) I'm an Ivan Sutherland fan. Asynch will rise again.
>
> 2) Asynch is all around us. The core and uncore of Nehalem is
> essentially a GALS, Globally Asynchronous Locally Synchronous, system.
>
> Asynch is just happening between units, not within them.

I am also a fan of Southerland

However, unless there has been a major breakthrough in the concordance
problem::
SuperScalar Asynchronous pipelines are unlikely inside of the Cache-
Miss boundary.

Mitch

Robert Myers

unread,

Oct 8, 2010, 5:35:02 PM10/8/10

to

On 10/8/2010 11:28 AM, Andy Glew wrote:

>
>
> Or how about: Newtonian physics tends to work fine in local domains.
>
> One only needs to start thinking relativistically when thinking about
> galactic scales, or stuff close in to gravity wells, like the transit of
> Mercury.
>
>

There is wisdom in that, but it provides helpful practical advice
without addressing the core issue, which is that, so long as part of the
meaning of data is encoded in time-of-arrival, the enterprise is futile.
I claim that practically every system in which memory locations are
reused without a co-located indication of the state of re-use is open to
this challenge.

The problem is that what a memory location means depends on state that
is not co-located with the data. Thus, it is possible to change the
meaning of a memory location non-locally. I think that's what you are
referring to as hidden side-effects. If I'm way wrong, you're the ideal
instructor.

You can't know with certainty what a memory location means unless you
have consistent global knowledge of the state of the system. You can
have consistent global knowledge of the state of a system only when it
has been halted long enough to allow every part of the system to report
to a central location. Computation under such rules of engagement would
be reliable but *very* slow.

Functional programming prevents at least one class of problems by
enforcing causality.

Robert.

MitchAlsup

unread,

Oct 8, 2010, 7:08:37 PM10/8/10

to

> On 10/8/2010 11:28 AM, Andy Glew wrote:
>
> > Or how about: Newtonian physics tends to work fine in local domains.
>
> > One only needs to start thinking relativistically when thinking about
> > galactic scales, or stuff close in to gravity wells, like the transit of
> > Mercury.

Newtonian Physics works on the local scale about as well as
synchronization works in the absence of contention.

Where both fail is when the time of events are approximately equal, to
the degree at which one can measure at the remote location {laboratory
or CPU pipeline}.

Mitch

Andrew Reilly

unread,

Oct 8, 2010, 7:12:02 PM10/8/10

to

On Fri, 08 Oct 2010 08:28:16 -0700, Andy Glew wrote:

> I think that they discuss a more important issue than you mention:
> composability. Espousing (this sounds like Jack) functional
> programming. Which basically disallows the sort of side effects of the
> examples I was talking about.
>
> Unfortunately, pretty much all linked data structures (LDS) are
> manipulated by code that exhibits such problems. I think that it would
> be unfortunate if we had to give up LDS for parallelism.

While there are certainly many algorithms and a lot of code that deal
destructively with linked data structures, there are plenty of other ways
to use them in purely functional (non-modifying) ways. There is no
obvious need to give them up.

One of the more controversial changes that the PLT group made to their
"native" dialect of Scheme at version 4.0 was to make the default "cons"
cell immutable, thus relegating all algorithms that operate by modifying
links to second-class citizens. Apparently very little of their own code
had to change, and several years later that is still the state of the
default language. I had to change my own code in some fairly serious
ways, but the result is, IMO, better structured.

Check Okzaki's "Purely Functional Data Structures" for an interesting
read.

One of the defining characteristics of the Clojure language is its
promotion of functional data structures as a tool to help write very-
parallel applications.

Cheers,

--
Andrew

Rob Warnock

unread,

Oct 10, 2010, 7:58:38 AM10/10/10

to

Andy Glew <"newsgroup at comp-arch.net"> wrote:

+---------------

| Or how about: Newtonian physics tends to work fine in local domains.
|
| One only needs to start thinking relativistically when thinking about
| galactic scales, or stuff close in to gravity wells, like the transit of
| Mercury.

+---------------

Uh... *Much* closer to home than that!! GPS won't work without
General Relativity corrections to the rates of the clocks, which
vary according to their altitudes:

http://en.wikipedia.org/wiki/Global_Positioning_System
...
GPS positioning is one of the few everyday events in which relativistic
effects must be accounted for.[81] For example, satellite clocks are
tuned to 10.22999999543 MHz before launch, to compensate for the effects
of gravitational time dilation and achieve a frequency of precisely
10.23 MHz once in orbit.[82]

And not just the altitudes of the satellites, either!! Since 1977, the
*terrestrial* master clocks that are the basis of UTC, well, TAI, now get
relativistic corrections for *their* altitudes as well:

http://en.wikipedia.org/wiki/International_Atomic_Time
...
In the 1970s, it became clear that the clocks participating in TAI
were ticking at different rates due to gravitational time dilation,
and the combined TAI scale therefore corresponded to an average of the
altitudes of the various clocks. Starting from Julian Date 2443144.5
(1 January 1977T00:00:00), corrections were applied to the output of
all participating clocks, so that TAI would correspond to proper time
at mean sea level (the geoid). Because the clocks had been on average
well above sea level, this meant that TAI slowed down, by about 10^-12.

Message has been deleted

nm...@cam.ac.uk

unread,

Oct 10, 2010, 9:33:53 AM10/10/10

to

In article <SpOdnaAjtLzzNyzR...@speakeasy.net>,

Rob Warnock <rp...@rpw3.org> wrote:
>Andy Glew <"newsgroup at comp-arch.net"> wrote:
>+---------------
>| Or how about: Newtonian physics tends to work fine in local domains.
>|
>| One only needs to start thinking relativistically when thinking about
>| galactic scales, or stuff close in to gravity wells, like the transit of
>| Mercury.
>+---------------
>
>Uh... *Much* closer to home than that!! GPS won't work without
>General Relativity corrections to the rates of the clocks, which
>vary according to their altitudes:
>
> http://en.wikipedia.org/wiki/Global_Positioning_System
> ...
> GPS positioning is one of the few everyday events in which relativistic
> effects must be accounted for.[81] For example, satellite clocks are
> tuned to 10.22999999543 MHz before launch, to compensate for the effects
> of gravitational time dilation and achieve a frequency of precisely
> 10.23 MHz once in orbit.[82]

That's bullshit. That may be what they do, but it isn't needed.

There is no technical difficulty in synchronisation to within a
few microseconds, and better than that is not too hard. All
that is needed is a consistent clock. If you want to see an
algorithm that would work perfectly well, look at msntp.

Regards,
Nick Maclaren.

Message has been deleted

Terje Mathisen

unread,

Oct 10, 2010, 3:16:58 PM10/10/10

to

Rob Warnock wrote:
> Andy Glew<"newsgroup at comp-arch.net"> wrote:
> +---------------
> | Or how about: Newtonian physics tends to work fine in local domains.
> |
> | One only needs to start thinking relativistically when thinking about
> | galactic scales, or stuff close in to gravity wells, like the transit of
> | Mercury.

> Uh... *Much* closer to home than that!! GPS won't work without

> General Relativity corrections to the rates of the clocks, which
> vary according to their altitudes:
>
> http://en.wikipedia.org/wiki/Global_Positioning_System
> ...
> GPS positioning is one of the few everyday events in which relativistic
> effects must be accounted for.[81] For example, satellite clocks are
> tuned to 10.22999999543 MHz before launch, to compensate for the effects
> of gravitational time dilation and achieve a frequency of precisely
> 10.23 MHz once in orbit.[82]
>
> And not just the altitudes of the satellites, either!! Since 1977, the
> *terrestrial* master clocks that are the basis of UTC, well, TAI, now get
> relativistic corrections for *their* altitudes as well:

Even closer to home:

Every single GPS _receiver_, including the one you probably have in your
car, performs a correction step for each sat based on the current orbit
altitude: Since the orbits are very close to but not quite circular, the
onboard clocks run a little slow during the low half of the orbit and a
little fast during the other half.

Message has been deleted

Robert Myers

unread,

Oct 10, 2010, 5:37:52 PM10/10/10

to

On Oct 10, 3:16 pm, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:

With your interest in orienteering, maybe you are the ideal candidate
to think about concurrency. I just realized that your sig line about
cache is also about where things actually are and how much you
actually know based on the limited information available to you.

Maybe if everyone participated in one of your competitions, it would
be more obvious just how confusing things can be when you can't see
everything all at once and have to proceed sometimes by dead
reckoning.

Robert.

Message has been deleted

Terje Mathisen

unread,

Oct 11, 2010, 2:16:09 AM10/11/10

to

Robert Myers wrote:
>
> With your interest in orienteering, maybe you are the ideal candidate
> to think about concurrency. I just realized that your sig line about

Not really, but an O race is indeed a hard realtime scheduling/routing
problem, in the face of somewhat incomplete and very often difficult to
interpret information.

> cache is also about where things actually are and how much you
> actually know based on the limited information available to you.

I consider cache to be a much wider term than what's usually considered,
in that I believe algorithms is a way to cache insight about how to
solve a particular problem:

Finding a new O(n) or O(n*log(n)) algorithm for something that naively
requires O(n*n), and then implementing it in a library, is the most
effective form of caching there is.

> Maybe if everyone participated in one of your competitions, it would
> be more obvious just how confusing things can be when you can't see
> everything all at once and have to proceed sometimes by dead
> reckoning.

<BG>

Thanks, what you've just described is pretty much what a Night-O race
can feel like: You have whatever is inside the cone of your headlamp,
plus what you remember of the locations you've passed and the current
compass direction you're traveling.

Making mistakes, small & large, is the name of the game: What matters is
how quickly you can accept and then use any new information.

In fact, it reminds me a lot of our Austrian friend's sig, the one that
ends "most things must be believed to be seen": It is amazingly easy to
very quickly look at any new, slightly "off", piece of information and
still believe it fits your current misconception.

Terje Mathisen

unread,

Oct 11, 2010, 2:41:22 AM10/11/10

to

Morten Reistad wrote:
> In article<i8t85k$gd6$1...@gosset.csi.cam.ac.uk>,<nm...@cam.ac.uk> wrote:
>> The nanoseconds is irrelevant, because there are bigger errors in
>> the terrestrial ends. And drift is TRIVIAL to deal with.
>
> Well, the user segment of the GPS system uses the space segment
> to get synchronised, and to compensate for drift once synchronised.
> The latter can be done by a phase locked loop; the first can be done
> by a binary search, like the older GPSes did. It may take several
> tens of minutes to get an initial synchronisation with this method,
> though. So one relativistic correction is normally incorporated; that
> to find the dopper-corrected frequency. It is normally table-driven.
>
> If your GPS gets initial synch in less than a few minutes it definatly
> incorporates this feature. But it is most probably a table lookup for
> a static correction.

This is the only way Nick is correct here: _Most_ of the GPS receiver
calculations can be seen as regular engineering approximation, i.e. a
few poly terms to interpolate between table lookup values.

It is still a fact that what you approximate this way is really several
relativistic effects.

>> Precisely. And the gravitational relativistic effects are not an
>> issue, despite what Wikipedia says, because there are known, trivial
>> corrections that are quite good enough.
>
> You can improve the "pure" Newtonian 4-satellite correction by
> about one order of magnitude using static corrections and transformations.
> You can then double this precision using another pair of satellites, or
> a DGPS. This gives an effective precision of around 10 feet up to
> 55 degrees latitude (inclination of GPS satellites) and gradually getting
> worse from there.

This isn't quite true: According to the official GPS docs I've seen, the
minimum GPS precision occurs around 45-50 degrees latitude, it actually
gets better a bit further north (like here in Norway) due to how we
usually have a number of visible sats to the north, i.e. seen over the
North Pole.
>
> These are not the limits for consumer GPSes. The limits for these
> are clock stability, receiver stability and computation power. Getting
> precision beyond around 30-50 feet still requires hardware engineering
> well beyond the normal consumer hardware design. And that is even
> with aggressive synchronisation to the space segment.
>
> However, the clocks and positional data on the GPS Space segments are good
> enough for 1.5 - 2 feet of precision. To get there, the user
> segment needs clock and receiver stability near two orders of
> magnitude better than what is sold in your average electronic store.

Afaik Garmin handles this with a neat hack: Instead of having an
ultra-stable (i.e. temperature-stabilized) frequency base, they have a
small (very low-power) temperature sensor and a lookup table of
temp/frequency corrections. (The table is dynamically updated (using
exponential averaging?) whenever the GPS is locked on.)

> And, you need to tackle a dynamic correction of relativistic effects.
> You need to integrate the clock drift on the received signal
> over the gravitation seen along the path of the signal, and not just
> use a table correction from the appearant elevation of the satellite.

Yep, that's the only end-user relativistic engineering I know about.

The end result is pretty amazing though:

A few years ago, as EGNOS was coming online, I ran a series of GPS
stability/accuracy tests over a two-week summer period:

The GPS (Garmin 76S) had an external antenna, magnetically mounted to a
good ground plane (metal sheet roof cover) and 360-degree visibility to
within a degree or two of the horizon: I.e. pretty much an optimal
situation.

During those two weeks I had (from memory) about 67% of all measurements
within 1.1m, 95% within 2.0m, 99% within 2.65m.

In fact, every sample except for those during a single 30-second
excursion to a point ~30m away and then back again, was within 3.0m.

To me this is/was simply amazing!

nm...@cam.ac.uk

unread,

Oct 11, 2010, 4:07:07 AM10/11/10

to

In article <q5oao7-...@ntp.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:

>Morten Reistad wrote:
>
>>> The nanoseconds is irrelevant, because there are bigger errors in
>>> the terrestrial ends. And drift is TRIVIAL to deal with.
>>
>> Well, the user segment of the GPS system uses the space segment
>> to get synchronised, and to compensate for drift once synchronised.
>> The latter can be done by a phase locked loop; the first can be done
>> by a binary search, like the older GPSes did. It may take several
>> tens of minutes to get an initial synchronisation with this method,
>> though. So one relativistic correction is normally incorporated; that
>> to find the dopper-corrected frequency. It is normally table-driven.

That's special relativity, not general.

>> If your GPS gets initial synch in less than a few minutes it definatly
>> incorporates this feature. But it is most probably a table lookup for
>> a static correction.
>
>This is the only way Nick is correct here: _Most_ of the GPS receiver
>calculations can be seen as regular engineering approximation, i.e. a
>few poly terms to interpolate between table lookup values.

Eh? I think that you have seriously misunderstood. What I was
saying was bullshit was that it was necessary to take the general
relativistic (i.e. gravitational) time-distortion into account.

That is a pretty minor effect - so much so that it was only shortly
before GPS arrived that it could even be measured - and standard
drift-compensation methods reduce it below the noise level. And,
no, phase-locked loops AREN'T needed, though they would probably
be used.

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Oct 11, 2010, 5:23:26 AM10/11/10

to

nm...@cam.ac.uk wrote:
> In article<q5oao7-...@ntp.tmsw.no>,
> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>> This is the only way Nick is correct here: _Most_ of the GPS receiver
>> calculations can be seen as regular engineering approximation, i.e. a
>> few poly terms to interpolate between table lookup values.
>
> Eh? I think that you have seriously misunderstood. What I was
> saying was bullshit was that it was necessary to take the general
> relativistic (i.e. gravitational) time-distortion into account.

I think I understand what you're saying, and I beg to differ:

It is specifically the clock error terms that are caused by (varying)
gravitational time-distortion that I believe regular GPS receivers
correct for.

I've done some googling:

Here is an old paper showing exactly how and why a receiver has to make
two relativity-related corrections, but I did not find an explicit
notice of the integral of the clock error due to orbit eccentricity:

http://tycho.usno.navy.mil/ptti/1997/Vol 29_08.pdf

This paper shows how a small orbit change (for repositioning) caused a
measurable clock frequency offset, and that gravitational effects were
the likely cause:

http://tycho.usno.navy.mil/ptti/ptti2001/paper50.pdf

Message has been deleted

nm...@cam.ac.uk

unread,

Oct 11, 2010, 7:42:19 AM10/11/10

to

In article <hl1bo7-...@ntp.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>
>>> This is the only way Nick is correct here: _Most_ of the GPS receiver
>>> calculations can be seen as regular engineering approximation, i.e. a
>>> few poly terms to interpolate between table lookup values.
>>
>> Eh? I think that you have seriously misunderstood. What I was
>> saying was bullshit was that it was necessary to take the general
>> relativistic (i.e. gravitational) time-distortion into account.
>
>I think I understand what you're saying, and I beg to differ:
>
>It is specifically the clock error terms that are caused by (varying)
>gravitational time-distortion that I believe regular GPS receivers
>correct for.

I stand by my remark. Classical corrections for drift are perfectly
good enough to solve this problem, and obviate the need for all of
the stations to have clocks that run at the same rate. If N stations
have consistent clocks and can send a signal with a non-deterministic
error e (i.e. ignoring the errors that can be calculated from the
mere speed of light and movement), then they can synchronise themselves
to within e*sqrt(N) without difficulty and probably down to e*log(N).

>This paper shows how a small orbit change (for repositioning) caused a
>measurable clock frequency offset, and that gravitational effects were
>the likely cause:
>
>http://tycho.usno.navy.mil/ptti/ptti2001/paper50.pdf

While I have made a stupid error, from the figures given in that
paper, the drift causes an error of about 4e-15 seconds per orbit
per kilometre of height (i.e. 43081*(1.5448e-15/900)/(26561-26543)).
I.e. 1 nanosecond for 250 kilometres difference.

Sorry, but the variation within the orbit is irrelevant, and the
drift can be corrected using classical methods without difficulty.
The use of general relativity is "because it's there" and not because
it is needed.

Regards,
Nick Maclaren.

Message has been deleted

nm...@cam.ac.uk

unread,

Oct 11, 2010, 9:15:04 AM10/11/10

to

In article <3edbo7-...@ntp.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>

>>> If you want high enough precision you have to do this. You have to
>>> do the integration of clock drift over the signal path if you want to
>>> go below about ten feet in precision on a 4-satellite signal, or
>>> five feet on a 6-satellite signal.
>>
>> Not if you have corrected for it, you don't. See my other post.
>>
>> The general relativity fanatics claim the most amazing bullshit to
>> justify their religion (and that is what it is). In this case,
>> the claim was that GPS would not work without using the theory.
>> I am not denying that using the theory is a reasonable solution,
>> but it's NOT even important to the functioning of GPS, because
>> classical methods for correcting for differential clock rates
>> are quite good enough.
>
>So what you are really saying is that since consumer grade gear isn't
>accurate/stable enough to directly notice these effects, the fact that
>it really is a relativistic correction doesn't count?

Now, now, Terje - it isn't like you to post that sort of thing.

What I am saying is that NO equipment, consumer, military or even
potential future developments is enough to notice the effect, if
appropriate classical methods are used to correct drift.

You could equally well say that the theory of elementary particles
was required to do chemistry, or that of the strong nuclear force
to use surface tension. Those are obviously silly, but this is a
comparable claim - except that I have done the unspeakable, which
is to hit physicists in the dogmas :-)

General relativity is almost the only part of physics that is as
much a religion as a theory, and it was the false claim that only
True Believers could build GPS systems that I am denying.

Regards,
Nick Maclaren.

Robert Myers

unread,

Oct 11, 2010, 1:00:22 PM10/11/10

to

On Oct 11, 9:15 am, n...@cam.ac.uk wrote:
> In article <3edbo7-4nc....@ntp.tmsw.no>,

> Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>
>
>
>
>
>
>
> >>> If you want high enough precision you have to do this. You have to
> >>> do the integration of clock drift over the signal path if you want to
> >>> go below about ten feet in precision on a 4-satellite signal, or
> >>> five feet on a 6-satellite signal.
>
> >> Not if you have corrected for it, you don't. See my other post.
>
> >> The general relativity fanatics claim the most amazing bullshit to
> >> justify their religion (and that is what it is). In this case,
> >> the claim was that GPS would not work without using the theory.
> >> I am not denying that using the theory is a reasonable solution,
> >> but it's NOT even important to the functioning of GPS, because
> >> classical methods for correcting for differential clock rates
> >> are quite good enough.
>
> >So what you are really saying is that since consumer grade gear isn't
> >accurate/stable enough to directly notice these effects, the fact that
> >it really is a relativistic correction doesn't count?
>
> Now, now, Terje - it isn't like you to post that sort of thing.
>
> What I am saying is that NO equipment, consumer, military or even
> potential future developments is enough to notice the effect, if
> appropriate classical methods are used to correct drift.
>

I don't think you can draw conclusions about the design of the system
without knowing the full set of contingencies under which the system
is expected to continue operating, and I'll guess with some confidence
that those contingencies are classified.

If you can't tell absolute time accurately, you can't tell your
longitude, a not insignificant issue in the history of technology.
Leaving religion out of it, if the satellites are expected to be
autonomously reliable for some period of time, that sets a requirement
for being able to predict time accurately without synchronization.

If there is no nuclear war or similar apocalyptic event underway, you
can obtain a time model using measurements synchronized with clocks
safely on terra firma, and, so long as the effects are reproducible,
it doesn't matter, from an engineering point of view, whether they are
explicable by general relativity or not.

Real orbital dynamics calculations take account of an expansion of the
earth's gravitational field in spherical harmonics (the earth not
being a perfect sphere of uniform density), so I'd be suspicious, as
apparently you are, of any particular measurement having a
demonstrable connection to general relativity. I'd wait for an
article in Physical Review or Physical Review Letters, where I don't
think religious speculation would be permitted.

Robert.

Message has been deleted

Rob Warnock

unread,

Oct 11, 2010, 10:03:29 PM10/11/10

to

<nm...@cam.ac.uk> wrote:
+---------------

| Rob Warnock <rp...@rpw3.org> wrote:
| >Uh... *Much* closer to home than that!! GPS won't work without
| >General Relativity corrections to the rates of the clocks, which
| >vary according to their altitudes:
| >
| > http://en.wikipedia.org/wiki/Global_Positioning_System
| > ...
| > GPS positioning is one of the few everyday events in which relativistic
| > effects must be accounted for.[81] For example, satellite clocks are
| > tuned to 10.22999999543 MHz before launch, to compensate for the effects
| > of gravitational time dilation and achieve a frequency of precisely
| > 10.23 MHz once in orbit.[82]
|
| That's bullshit. That may be what they do, but it isn't needed.
|
| There is no technical difficulty in synchronisation to within a

| few microseconds...
+---------------

Microsecond? *MICROSECOND?!?* Nick, remember, a microsecond is a *MILE*
when it comes to GPS. No, the GPS constellation is synchronized to
within a few *nano*seconds.

+---------------
| ...and better than that is not too hard. All that is needed is a
| consistent clock.
+---------------

"Clocks", plural. Remember that each GPS satellite has its own atomic
clock, since it has to remain accurate between synchronization updates.

+---------------

| If you want to see an algorithm that would work perfectly well,
| look at msntp.

+---------------

I strongly doubt that MSNTP is within an order of magnitude of good
enough, maybe not even *two* orders of magnitude. For one thing, it
uses standard NICs, whose transmission jitters aren't anywhere *near*
close enough to zero. Fortunately, that's not what GPS uses:

The flight paths of the satellites are tracked by dedicated U.S. Air
Force monitoring stations in Hawaii, Kwajalein, Ascension Island,
Diego Garcia, Colorado Springs, Colorado and Cape Canaveral, along
with shared NGA monitor stations operated in England, Argentina,
Ecuador, Bahrain, Australia and Washington DC.[51] The tracking
information is sent to the Air Force Space Command's MCS at Schriever
Air Force Base 25 km (16 miles) ESE of Colorado Springs, which is
operated by the 2nd Space Operations Squadron (2 SOPS) of the United
States Air Force (USAF). Then 2 SOPS contacts each GPS satellite
regularly with a navigational update using dedicated or shared (AFSCN)
ground antennas (GPS dedicated ground antennas are located at
Kwajalein, Ascension Island, Diego Garcia, and Cape Canaveral).
These updates synchronize the atomic clocks on board the satellites
to within a few nanoseconds of each other, and adjust the ephemeris
of each satellite's internal orbital model. The updates are created
by a Kalman filter, which uses inputs from the ground monitoring
stations, space weather information, and various other inputs.[52]

As the article goes on to mention, the algorithm must also compensate
for the Sagnac effect:

The GPS time scale is defined in an inertial system but observations
are processed in an Earth-centered, Earth-fixed (ECEF) co-rotating
system, a system in which simultaneity is not uniquely defined. A
Lorentz transformation converts from the inertial system to the ECEF
system. The resulting correction has opposite algebraic signs for
satellites in the Eastern and Western celestial hemispheres. Ignoring
this effect produces an east-west error on the order of hundreds of
nanoseconds, or tens of meters in position.[88]

Fun stuff.

Message has been deleted

nm...@cam.ac.uk

unread,

Oct 12, 2010, 4:36:14 AM10/12/10

to

In article <4uWdnSpkpOZsXC7R...@speakeasy.net>,

Rob Warnock <rp...@rpw3.org> wrote:
>|
>| There is no technical difficulty in synchronisation to within a
>| few microseconds...
>

>Microsecond? *MICROSECOND?!?* Nick, remember, a microsecond is a *MILE*
>when it comes to GPS. No, the GPS constellation is synchronized to
>within a few *nano*seconds.

I said that I wouldn't follow on, but I need to clarify. Sorry,
that was my fault. The few microseconds was referring to using
TCP/IP over Ethernet on commodity systems. I apologise for being
so confusing.

The synchronisation can be done to within a small factor (perhaps
even less than one) of the non-deterministic error caused by sending
a timestamp from one station to another. You need some extremely
fancy hardware to get down to nanoseconds, HOWEVER you do it.

Regards,
Nick Maclaren.

Rob Warnock

unread,

Oct 12, 2010, 6:09:09 AM10/12/10

to

<nm...@cam.ac.uk> wrote:
+---------------
| Rob Warnock <rp...@rpw3.org> wrote:

| > ...the GPS constellation is synchronized to within a few *nano*seconds.
...

| The synchronisation can be done to within a small factor (perhaps
| even less than one) of the non-deterministic error caused by sending
| a timestamp from one station to another.

+---------------

Hopefully you already know this, but the GPS satellites don't talk
to each other at all; they only talk to (a few) ground stations,
and not very frequently at that [a few times per day at most].
The satellite clocks are pure slaves; the ground stations *tell*
the satellites what corrections to apply to their clocks [a few
times per day at most]. It's not at all a peer-to-peer system like NTP.

[The system the ground stations use to sync between *themselves*
is a whole 'nother story!!]

Nor do satellites listen to each others' navigational data broadcasts.
Since all the satellites transmit the navigational data on the *same*
frequency [well, same two frequencies], even with the CDMA encoding
being unique for each satellite it would be very, very hard for one
satellite to hear another in the midst of its own transmission, given
the power disparity between its own transmitter and any received signal
from another satellite.

[It's much easier for the user GPS receivers on the ground, since
the received signals are much closer to the same strengths (still
different, but much less different).]

+---------------

| You need some extremely fancy hardware to get down to nanoseconds,
| HOWEVER you do it.

+---------------

Yup. And good continual observation of changes in satellite orbits
[for ephemeris updates], space weather [e.g., to calculate changes
in local atmospheric densities due to solar activity], etc.

Michael S

unread,

Oct 12, 2010, 7:47:03 AM10/12/10

to

On Oct 11, 3:15 pm, n...@cam.ac.uk wrote:
>
> General relativity is almost the only part of physics that is as

> much a religion as a theory...

Come on. General relativity is religion and M-theory is not?
Which of the two is easier to falsify in Popper's sense?
And M-theory is just an extreme example. I'd dare to say that pretty
much all modern developments in such fields as string theory and
extended supersymmetry are harder to falsify than good old general
relativity.

nm...@cam.ac.uk

unread,

Oct 12, 2010, 8:16:24 AM10/12/10

to

In article <678a2bc0-c463-492a...@i21g2000yqg.googlegroups.com>,

Michael S <already...@yahoo.com> wrote:
>>
>> General relativity is almost the only part of physics that is as
>> much a religion as a theory...
>
>Come on. General relativity is religion and M-theory is not?
>Which of the two is easier to falsify in Popper's sense?

Please do not misrepresent me. I stated that it was AS MUCH a
religion as a theory, not that it was purely a religion. And
I am not interested in extreme lunacies like M-theory. With 7
parameters, you can fit an elephant - with 11 dimensions, you
can ....

General relativity dates from 1915, but the first solid scientific
confirmations were quite recent. The precession of Mercury doesn't
count (it wasn't a prediction, and there are other very plausible
explanations), nor does the mere FACT of the bending of light (it
is predicted by special relativity, at half the value). That did
not stop them being claimed as proofs by its believers. There are
now at least two solid proofs.

My main point, however, was the way that the existence of actual
singularities is claimed to be a physical fact, based entirely on
prothats that themselves rely on the assumption of Einstein's formula
is PRECISELY correct. The simple fact is that we have no solid
independently derived evidence for it except at very low space-
time curvatures. I assert that any claim that singularities exist
in the universe we inhabit belongs to the realm of religion and
not science.

Let's ignore the more extreme weeblings about what happens inside
singularities as just plain fantasy by people who ought to know
better.

Anyway, this is very off-group.

Regards,
Nick Maclaren.

Message has been deleted

Terje Mathisen

unread,

Oct 12, 2010, 9:32:09 AM10/12/10

to

Rob Warnock wrote:
> Uh... At an average altitude of ~12,550 miles, the GPS satellites
> are *FAR* from being geostationary!!! Indeed, their periods are quite
> close to a half day.

The period is _exactly_ half that of the Earth, in order to stay in the
same orbit plane.

The Earth's period is of course about 23 hours and 56 min:

(24 * 364.2422/365.2422 is about 23:56:03.44)

so the GPS period must be about 11:58:01.72

MitchAlsup

unread,

Oct 12, 2010, 2:40:24 PM10/12/10

to

On Oct 12, 7:16 am, n...@cam.ac.uk wrote:

> General relativity dates from 1915, but the first solid scientific
> confirmations were quite recent. The precession of Mercury doesn't
> count (it wasn't a prediction, and there are other very plausible
> explanations), nor does the mere FACT of the bending of light (it
> is predicted by special relativity, at half the value).

What about the Eddington expidition that measured the bending of light
near the suns surface durring an eclipse in 1919?
http://en.wikipedia.org/wiki/File:1919_eclipse_negative.jpg

Mitch

Bakul Shah

unread,

Oct 12, 2010, 2:46:56 PM10/12/10

to

On 10/11/10 7:03 PM, Rob Warnock wrote:
>
> Microsecond? *MICROSECOND?!?* Nick, remember, a microsecond is a *MILE*
> when it comes to GPS. No, the GPS constellation is synchronized to
> within a few *nano*seconds.

Rob, how do you map a microsecond to a mile? And a mile where?
I am being lazy (& dense) but I couldn't figure out how. Even
light travels 300 meters in a microsecond.

Andy "Krazy" Glew

unread,

Oct 12, 2010, 3:49:07 PM10/12/10

to

On 10/10/2010 4:58 AM, Rob Warnock wrote:
> Andy Glew<"newsgroup at comp-arch.net"> wrote:
> +---------------
> | Or how about: Newtonian physics tends to work fine in local domains.
> |
> | One only needs to start thinking relativistically when thinking about
> | galactic scales, or stuff close in to gravity wells, like the transit of
> | Mercury.
> +---------------

>
> Uh... *Much* closer to home than that!! GPS won't work without
> General Relativity corrections to the rates of the clocks

Guys, you've drifted, ratholed.

I was trust trying to make an analogy, between shared memory and message
passing and ordering models on the one hand, and Newtonian vs.
Einsteinian physics on the other.

Andy "Krazy" Glew

unread,

Oct 12, 2010, 3:52:50 PM10/12/10

to

On 10/10/2010 11:16 PM, Terje Mathisen wrote:
> Robert Myers wrote:
>>
>> With your interest in orienteering, maybe you are the ideal candidate
>> to think about concurrency. I just realized that your sig line about

>> cache is also about where things actually are and how much you
>> actually know based on the limited information available to you.
>
> I consider cache to be a much wider term than what's usually considered,
> in that I believe algorithms is a way to cache insight about how to
> solve a particular problem:
>
> Finding a new O(n) or O(n*log(n)) algorithm for something that naively
> requires O(n*n), and then implementing it in a library, is the most
> effective form of caching there is.

Wow! I've been reading your .sig for years, and never realized you
thought this way.

In fact, I have often had mental debates with my model of you, about
whether it is cheaper to store in a cache vs. recompute.

Terje Mathisen

unread,

Oct 12, 2010, 3:59:49 PM10/12/10

to

I think Rob was talking about 5 us, which is close enough to a mile, at
least for US engineers (and track & field runners, hereabouts we prefer
1500m).

Anyway, the speed of light is a pretty good proof of the superiority of
the metric system (300,000 km/s) or even better, the metric A paper
formats: A regular A4 sheet of paper is extremely close to a ns in
length. :-)

nm...@cam.ac.uk

unread,

Oct 12, 2010, 4:25:05 PM10/12/10

to

In article <nareo7-...@ntp.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>

>Anyway, the speed of light is a pretty good proof of the superiority of
>the metric system (300,000 km/s) or even better, the metric A paper
>formats: A regular A4 sheet of paper is extremely close to a ns in
>length. :-)

Nah. A nanosecond is a foot :-)

Regards,
Nick Maclaren.

Robert Myers

unread,

Oct 12, 2010, 4:55:19 PM10/12/10

to

On Oct 12, 3:52 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:

I'll continue to think of my two obsessions as separate:

1. Cognitive activity (including computation) would be nearly
impossible if we didn't do the same things, or nearly the same things,
over and over and over again.

2. We can't process stored (remembered) experience faster (or sooner)
than we can retrieve it.

Robert.