Some confusion about virtual cache

Abdullah Muzahid

unread,

Aug 22, 2008, 12:31:47 AM8/22/08

to

Hi,
I am trying to understand how virtually indexed, physically tagged
cache works. And I have some confusion.
Let just assume that page offset is 12 bit and virtual and physical
addresses are 32 bit. What will be the tag if the lower, say, 16 bits
of a virtual address are used to index the cache? Will it be 20 bit
physical page number or upper (32-16)=16 bit of physical page number?
Similarly, what will be the tag if the lower, say 8 bits of a virtual
address is used to index the cache?

Thanks.
-Abdullah

MitchAlsup

unread,

Aug 22, 2008, 3:33:45 PM8/22/08

to

On Aug 21, 11:31 pm, Abdullah Muzahid <prince.cs...@gmail.com> wrote:
> Let just assume that page offset is 12 bit and virtual and physical
> addresses are 32 bit. What will be the tag if the lower, say, 16 bits
> of a virtual address are used to index the cache?

You have to check all physical bits that were not present in the
indexing
bits. Thus you need to check 32-12=20-bits. 16-of these are what
would
have been required with a PIPT cache, and the lower 4-bits are checked
becacuse these bits did not participate in the indexing fo the data
(i.e.
these bits change between virtual and physical addresses.)

Otherwise, software can guarentee that any shared page has the
property
that virtual address bits <15:12> are the same as physical address
bits
<15:12> and make the aliasing problem "go away".

Otherwise you will get yourself "hosed" somewhere along the line.

Nick Maclaren

unread,

Aug 23, 2008, 8:28:02 AM8/23/08

to

In article <048c1d7b-76db-4585...@x35g2000hsb.googlegroups.com>,
MitchAlsup <Mitch...@aol.com> writes:

|> On Aug 21, 11:31pm, Abdullah Muzahid <prince.cs...@gmail.com> wrote:
|>
|> > Let just assume that page offset is 12 bit and virtual and physical
|> > addresses are 32 bit. What will be the tag if the lower, say, 16 bits
|> > of a virtual address are used to index the cache?
|>
|> You have to check all physical bits that were not present in the
|> indexing bits. Thus you need to check 32-12=20-bits. 16-of these
|> are what would have been required with a PIPT cache, and the lower

|> 4-bits are checked because these bits did not participate in the
|> indexing for the data (i.e. these bits change between virtual and
|> physical addresses.)

That is, of course, one reason that I regard the current approaches
to shared memory as misguided, at best. There are no problems with
virtual caches and a cleaner, simpler shared memory model - but that
would force programmers to face up to the fact that demanding fully
cache coherent shared memory costs a hell of a lot.

And doing that is, of course, unthinkable.

Regards,
Nick Maclaren.

MitchAlsup

unread,

Aug 24, 2008, 1:21:24 PM8/24/08

to

On Aug 23, 7:28 am, n...@cus.cam.ac.uk (Nick Maclaren) wrote:
> That is, of course, one reason that I regard the current approaches
> to shared memory as misguided, at best. There are no problems with
> virtual caches and a cleaner, simpler shared memory model - but that
> would force programmers to face up to the fact that demanding fully
> cache coherent shared memory costs a hell of a lot.
>
> And doing that is, of course, unthinkable.

Its not unthinkable, it just will never get past any sane
architectureal review board.

Mitch

Nick Maclaren

unread,

Aug 25, 2008, 4:55:41 AM8/25/08

to

In article <91298b2b-e59e-415d...@p25g2000hsf.googlegroups.com>,

MitchAlsup <Mitch...@aol.com> writes:
|>
|> > That is, of course, one reason that I regard the current approaches
|> > to shared memory as misguided, at best. There are no problems with
|> > virtual caches and a cleaner, simpler shared memory model - but that
|> > would force programmers to face up to the fact that demanding fully
|> > cache coherent shared memory costs a hell of a lot.
|> >
|> > And doing that is, of course, unthinkable.
|>
|> Its not unthinkable, it just will never get past any sane
|> architectureal review board.

I think that you have just confirmed my point!

What is unthinkable is the demonstrable fact that the current approaches
are not working, and cannot be made to work by any amount of tweaking.

Regards,
Nick Maclaren.

Piotr Wyderski

unread,

Aug 25, 2008, 8:50:59 AM8/25/08

to

Nick Maclaren wrote:

> What is unthinkable is the demonstrable fact that the current approaches
> are not working, and cannot be made to work by any amount of tweaking.

But those not working approaches do have a market share worth of many
milliards, and since every year they are not working even better and
better... :-)

Best regards
Piotr Wyderski

Nick Maclaren

unread,

Aug 25, 2008, 8:58:34 AM8/25/08

to

In article <g8u5hn$f39$1...@node1.news.atman.pl>,

"Piotr Wyderski" <piotr.w...@mothers.against.spam.gmail.com> writes:
|>
|> > What is unthinkable is the demonstrable fact that the current approaches
|> > are not working, and cannot be made to work by any amount of tweaking.
|>
|> But those not working approaches do have a market share worth of many
|> milliards, and since every year they are not working even better and
|> better... :-)

Indeed! But I post here as an architect, not as a marketeer. I have
absolutely no idea when the current insanity is going to hit an issue
that cannot be 'resolved' by marketing, but I am certain it will
eventually happen.

Regards,
Nick Maclaren.

Piotr Wyderski

unread,

Aug 25, 2008, 10:44:56 AM8/25/08

to

Nick Maclaren wrote:

> Indeed! But I post here as an architect, not as a marketeer.

Sure. Unfortunately the T business is all about making money, not T
(substitute
T with anything you want). This constraint introduces a lot of inertia and
restricts
the braver approaches. You don't like it, I don't like it, but the market is
always
right. Having said that... could you please provide a short description of
your
dream architecture, Nick?

I opt for a restricted dataflow machine with a lot of single core (but
SIMD-capable)
nodes connected with a number of low latency data paths. It must allow
simple and
efficient transmissions of _short_ messages (no MPI, please) with message
queues
exposed to the user. It would be good to have vector processing abilities
with cache
bypassing. A thread-level DMA engine/blitter would also be great.

Best regards,
Piotr Wyderski

Nick Maclaren

unread,

Aug 25, 2008, 11:02:18 AM8/25/08

to

In article <g8uc7e$h0f$1...@node1.news.atman.pl>,

"Piotr Wyderski" <piotr.w...@mothers.against.spam.gmail.com> writes:
|>
|> > Indeed! But I post here as an architect, not as a marketeer.
|>
|> Sure. Unfortunately the T business is all about making money, not T
|> (substitute T with anything you want). This constraint introduces a
|> lot of inertia and restricts the braver approaches. You don't like
|> it, I don't like it, but the market is always right.

Not really. The problem about revisionism (in this context, marketing
rewriting technical analysis) is that, when circumstances change, most
alternative approaches are "known" not to work (despite not having been
tried, or even investigated in depth). Been there - been told that -
provided the evidence - failed to make any progress :-(

|> Having said that... could you please provide a short description of
|> your dream architecture, Nick?

Nope - I don't have one :-) My preferences vary according to whatever
issue I feel is most important at the time.

On the matter that started this thread, I believe that cache-coherent
shared memory between processes is a stupid idea, and am being proved
right. What is needed is shared memory with explicit synchronisation,
which is VASTLY more scalable and easier to get right; indeed, I would
go for a three-category segment table state:

a) The segment is being shared read-only.
b) The segment is owned by a single process.
c) The segment is used for communication, and is accessed
atomically and with sequential consistency, but ONLY through the
instructions provided for the purpose (others would fault SIGILL).

And a segment would also be the unit of protection, as now. That
would be easy to implement, scalable, and would deliver what 99% of
threaded programs actually need.

Genuinely lightweight threading is another matter entirely, but POSIX
and Microsoft threads are just processes with common addressing and
authentication.

Regards,
Nick Maclaren.

Piotr Wyderski

unread,

Aug 25, 2008, 11:43:11 AM8/25/08

to

Nick Maclaren wrote:

> On the matter that started this thread, I believe that cache-coherent
> shared memory between processes is a stupid idea

Agreed.

> and am being proved right.

Well, that's rather obvious to anybody who have implemented a non-trivial
parallel program on a UMA in an imperative language. This experience can
(and should) cause serious paranoia to the programmer. Otherwise he is
evidently underinformed. :-)

> What is needed is shared memory with explicit synchronisation,
> which is VASTLY more scalable and easier to get right; indeed, I would
> go for a three-category segment table state:
>
> a) The segment is being shared read-only.
> b) The segment is owned by a single process.
> c) The segment is used for communication, and is accessed
> atomically and with sequential consistency, but ONLY through the
> instructions provided for the purpose (others would fault SIGILL).

But since the shared object doesn't behave as memory (is not randomly
addressible, to be exact), it's not memory. It's merely a memory-backed
communication channel which matches my requirements of a user-visible
low-latency queue. But it should be generaliz(able | ed) to more distributed
cases.

> Genuinely lightweight threading is another matter entirely

One of the most painful facts is that there already is a high-throuhput
DMA engine, namely the decoupled load/store unit, but we are not
allowed to take control over it, i.e. scheduling a lightweight _synchronous_
thread (relatively to the main one) on it. The only tool we have now is
prefetching... :-(

Best regards,
Piotr Wyderski

Nick Maclaren

unread,

Aug 25, 2008, 12:09:32 PM8/25/08

to

In article <g8ufkg$jh9$1...@node1.news.atman.pl>,

"Piotr Wyderski" <piotr.w...@mothers.against.spam.gmail.com> writes:
|>
|> Well, that's rather obvious to anybody who have implemented a non-trivial
|> parallel program on a UMA in an imperative language. This experience can
|> (and should) cause serious paranoia to the programmer. Otherwise he is
|> evidently underinformed. :-)

Probably by not noticing that his program didn't work :-(

|> But since the shared object doesn't behave as memory (is not randomly
|> addressible, to be exact), it's not memory. It's merely a memory-backed
|> communication channel which matches my requirements of a user-visible
|> low-latency queue. But it should be generaliz(able | ed) to more distributed
|> cases.

Sorry - I wasn't clear. I didn't mean that the segment could be only
read or written, but that stores and fetches of parts of it (e.g.
words) would be atomic. It would have full memory semantics, but
would not be as fast as unshared memory, even in the absence of any
clashes.

Regards,
Nick Maclaren.

Anne & Lynn Wheeler

unread,

Aug 25, 2008, 1:30:52 PM8/25/08

to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:
> Sorry - I wasn't clear. I didn't mean that the segment could be only
> read or written, but that stores and fetches of parts of it (e.g.
> words) would be atomic. It would have full memory semantics, but
> would not be as fast as unshared memory, even in the absence of any
> clashes.

801/risc philosiphy was solidly opposite of supporting cache consistency
and smp operation/configruaiton. in the late 80s, there was a four
processor "single chip rios" effort done that had flag for segments that
would bypass cache (i.e. data in segments identified as "non-cached"
would have memory load&stores that would bypass caching). standard
application data ... either non-shared and/or r/o shared ... would have
segments identified as cache'able ... but data requiring multiprocessing
serialization ... would be positioned in cache identified as non-cached.

we had done something analogous for 370 16-way SMP more than a decade
earlier (that didn't ship as a product).

another example of restructuring data (for 801/risc rios) was the aix
journaled filesystem ... where all the unix filesystem metadata was
collected in storage area that was flagged as "transaction" memory i.e.
allowed identifying changed/modified filesystem metadata for
logging/journaling ... w/o requring explicit logging calls whenever
there was modification of transaction data.

misc. past posts mentioning 801, risc, romp, rios, power, power/pc, etc
http://www.garlic.com/~lynn/subtopic.html#801

misc. past posts mentioning "live oak" (four processor, single-chip
rios)
http://www.garlic.com/~lynn/2000c.html#21 Cache coherence [was Re: TF-1]
http://www.garlic.com/~lynn/2001j.html#17 I hate Compaq
http://www.garlic.com/~lynn/2003d.html#57 Another light on the map going out
http://www.garlic.com/~lynn/2004q.html#40 Tru64 and the DECSYSTEM 20
http://www.garlic.com/~lynn/2006w.html#40 Why so little parallelism?
http://www.garlic.com/~lynn/2006w.html#41 Why so little parallelism?

some of the above makes reference to the ("alternative") cluster
approach ... trying to heavily leverage commodity priced components (w/o
cache consistancy ... that was eventually announced as the corporate
supercomputer) ... misc. old email
http://www.garlic.com/~lynn/lhwemail.html#medusa

for other topic drift
http://www.garlic.com/~lynn/2008m.html#22 Future architectures

... reference more detailed 3090 cache description ... has a small
"fast" logical (aka virtual) index ... that was kept consistent with
real index
http://www.garlic.com/~lynn/2003j.html#email831118
in this post
http://www.garlic.com/~lynn/2003j.html#42 Flash 10208

the 370 16-way SMP effort in the mid-70s ... leveraged charlie's
invention of the compare&swap instruction ("CAS" was chosen because they
are charlie's initials) ... misc. past posts mentioning SMP and/or
compare&swap instruction
http://www.garlic.com/~lynn/subtopic.html#smp

--
40+yrs virtualization experience (since Jan68), online at home since Mar70

MitchAlsup

unread,

Aug 25, 2008, 2:23:08 PM8/25/08

to

On Aug 25, 10:02 am, n...@cus.cam.ac.uk (Nick Maclaren) wrote:
<snip>

> On the matter that started this thread, I believe that cache-coherent
> shared memory between processes is a stupid idea, and am being proved
> right.

It is my belief that the problem is not shared cache-coherent memory
(bear with me for a second), but the 'universality' of shared cache-
coherent memory. Everything that is cacheable is by mandate coherent,
and this, in turn squashes out the parallelism at several hierarchies.
There should be some mechanism that allows data to reside in a cache
where that data is not coherent, and likewise, some mechanism that
explicitly states that this data IS shared and must be coherent. The
lack of mechanisms to subscribe-to and subscribe-against coherence
results in massive cache coherent traffic in the system.

> What is needed is shared memory with explicit synchronisation,
> which is VASTLY more scalable and easier to get right; indeed, I would
> go for a three-category segment table state:
>
> a) The segment is being shared read-only.
> b) The segment is owned by a single process.
> c) The segment is used for communication, and is accessed
> atomically and with sequential consistency, but ONLY through the
> instructions provided for the purpose (others would fault SIGILL).

The problem, as I see it, is that when it comes to ATOMIC
interactions, architects layer atomicity on-top-of cache coherence!
simply making the problem worse.

What is needed in Atomic transactions is a way to explicitly specify
those memory locations taking place in the atomic multi-memory-
modifying transaction, and providing an Atomic memory ordering model
for those explicitly annotated memory references; while keeping the
normal memory ordering for other non-participating memory operations/
instructions. Thus, those participating memory locations are seen
either in the completely-before sense or the completely-after sense
which no visibility to intermediate transistions that make have to
take place to go from the before state to the after state.

Rather than relate all the nuances here, I refere the readers to the
USPTO and search Patent applications with "IN/Alsup AND AN/Advanced".
I am rather sorry that the C code in the patent applications was run
through a word processor and did not retain the C indentation
structure of the original. A CButifier can straighen this out to near-
normal. This lack-of-pretty-printability happened after I left AMD.

The sub-architecture is sufficiently powerful to allow one to move an
element from a doubly linked list and insert it into another doubly
linked list in a single atomic transaction, even when the pointers
contains addresses, sequence numbers, and state that may get copied or
modified as the atomic transaction proceeds. It is alos possible to
insert and remove a data element from a structure where 3 sets of
doubly linked pointers must always remain consistent.

This sub-architecture allows the construction of atomic primitives
anywhere in the spectrum between fully locked and fuly non-blocking.

Used carefully, this can enable creating atomic primiatives that have
bigO(log(n)) worse case latency {instead of bigO(n**2)} but of similar
importance many of the activities that hit the bigO(n**2) problems
with current atomic instructions run in bigO(3) in this system. Things
such as a timer goes off and all CPUs try to extract a task from the
front of a run queue.

I call this a sub-architecture, because it is more than just new
instructions, it carries an atomic memory model with it that adds
negative acknowledgement on the responces to cacheable accesses, and
it explicitly enters and explicitly exits atomic activity.

Mitch

Nick Maclaren

unread,

Aug 25, 2008, 2:39:37 PM8/25/08

to

In article <73422a43-849c-42b4...@79g2000hsk.googlegroups.com>,

MitchAlsup <Mitch...@aol.com> writes:
|>
|> It is my belief that the problem is not shared cache-coherent memory
|> (bear with me for a second), but the 'universality' of shared cache-
|> coherent memory. Everything that is cacheable is by mandate coherent,
|> and this, in turn squashes out the parallelism at several hierarchies.
|> There should be some mechanism that allows data to reside in a cache
|> where that data is not coherent, and likewise, some mechanism that
|> explicitly states that this data IS shared and must be coherent. The
|> lack of mechanisms to subscribe-to and subscribe-against coherence
|> results in massive cache coherent traffic in the system.

Agreed - at the hardware level! But there are also software issues.

|> The problem, as I see it, is that when it comes to ATOMIC
|> interactions, architects layer atomicity on-top-of cache coherence!
|> simply making the problem worse.
|>
|> What is needed in Atomic transactions is a way to explicitly specify
|> those memory locations taking place in the atomic multi-memory-
|> modifying transaction, and providing an Atomic memory ordering model
|> for those explicitly annotated memory references; while keeping the
|> normal memory ordering for other non-participating memory operations/
|> instructions. Thus, those participating memory locations are seen
|> either in the completely-before sense or the completely-after sense
|> which no visibility to intermediate transistions that make have to
|> take place to go from the before state to the after state.

Yes, precisely. That is what I was trying to say - I leave the details
of how to implement such a scheme to you hardware guys :-)

|> Rather than relate all the nuances here, I refere the readers to the
|> USPTO and search Patent applications with "IN/Alsup AND AN/Advanced".
|> I am rather sorry that the C code in the patent applications was run
|> through a word processor and did not retain the C indentation
|> structure of the original. A CButifier can straighen this out to near-
|> normal. This lack-of-pretty-printability happened after I left AMD.

That sounds most interesting!

|> Used carefully, this can enable creating atomic primiatives that have
|> bigO(log(n)) worse case latency {instead of bigO(n**2)} but of similar
|> importance many of the activities that hit the bigO(n**2) problems
|> with current atomic instructions run in bigO(3) in this system. Things
|> such as a timer goes off and all CPUs try to extract a task from the
|> front of a run queue.

A strong argument for randomising such events - it's amazing how useful
a bit of added randomness can be :-)

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Aug 25, 2008, 3:39:32 PM8/25/08

to

MitchAlsup wrote:
> The problem, as I see it, is that when it comes to ATOMIC
> interactions, architects layer atomicity on-top-of cache coherence!
> simply making the problem worse.

[snip]

> This sub-architecture allows the construction of atomic primitives
> anywhere in the spectrum between fully locked and fuly non-blocking.
>
> Used carefully, this can enable creating atomic primiatives that have
> bigO(log(n)) worse case latency {instead of bigO(n**2)} but of similar
> importance many of the activities that hit the bigO(n**2) problems
> with current atomic instructions run in bigO(3) in this system. Things
> such as a timer goes off and all CPUs try to extract a task from the
> front of a run queue.
>
> I call this a sub-architecture, because it is more than just new
> instructions, it carries an atomic memory model with it that adds
> negative acknowledgement on the responces to cacheable accesses, and
> it explicitly enters and explicitly exits atomic activity.

Very interesting!

It sounds like a hw engineer's solution to the "Transactional Memory"
question?

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

unread,

Aug 26, 2008, 12:42:19 PM8/26/08

to

On Aug 25, 2:39 pm, Terje Mathisen <terje.mathi...@hda.hydro.com>
wrote:

> Very interesting!
>
> It sounds like a hw engineer's solution to the "Transactional Memory"
> question?

A few years ago, I was charged with answering the question "Why don't
we (AMD) give MS the CompareTwoSwapTwo instruction they say they want.

I looked into the mater and found that there was no mechanism in the
CPU, hostBridge, of transport fabric that could be used to cause two
cache lines to become and remain co-resident in the cache for the
handfull of CPU cycles that would be needed to move the data arround.

So, I invented a mechanism to do this, and it turns out that the same
mechanism would naturally extend to at least 7 outstanding memory
transactions. I put this mechanism in the cache miss buffers rather
than in the cache itself (as did TM) and this ends up to be the key
for making forward progress and avoiding all the cache coherency crap
that gets in the way of atomicity.

Secondarily, the mechanism works even when there are no caches in the
system whatsoever, and each memory update is delivered directly to
DRAM. Thus, it is architecturally clean (like the original caches were
architecturally clean -- invisible). The mechanism requires the
ability to NAK a coherent request for data, and dealing with the
fabric issues this creates.

Mitch

Chris M. Thomasson

unread,

Aug 29, 2008, 12:25:12 AM8/29/08

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:g8uhhq$s9l$1...@gemini.csx.cam.ac.uk...

>
> In article <g8uc7e$h0f$1...@node1.news.atman.pl>,
> "Piotr Wyderski" <piotr.w...@mothers.against.spam.gmail.com> writes:
> |>
> |> > Indeed! But I post here as an architect, not as a marketeer.
> |>
> |> Sure. Unfortunately the T business is all about making money, not T
> |> (substitute T with anything you want). This constraint introduces a
> |> lot of inertia and restricts the braver approaches. You don't like
> |> it, I don't like it, but the market is always right.
>
> Not really. The problem about revisionism (in this context, marketing
> rewriting technical analysis) is that, when circumstances change, most
> alternative approaches are "known" not to work (despite not having been
> tried, or even investigated in depth). Been there - been told that -
> provided the evidence - failed to make any progress :-(
>
> |> Having said that... could you please provide a short description of
> |> your dream architecture, Nick?
>
> Nope - I don't have one :-) My preferences vary according to whatever
> issue I feel is most important at the time.
>
> On the matter that started this thread, I believe that cache-coherent
> shared memory between processes is a stupid idea, and am being proved
> right. What is needed is shared memory with explicit synchronisation,
> which is VASTLY more scalable and easier to get right; indeed, I would
> go for a three-category segment table state:

I am in favor of an extremely weak fine-grain cache-coherency mechanism. A
memory model should not strive to be implicit because it won't be able to
scale.

> a) The segment is being shared read-only.
> b) The segment is owned by a single process.
> c) The segment is used for communication, and is accessed
> atomically and with sequential consistency, but ONLY through the
> instructions provided for the purpose (others would fault SIGILL).

You could even weaken the requirements of c. Sequential consistency should
not be implied. If the programmer wants seq_cst, well, the she/he would have
to make explicit calls into a fine-grain memory barrier instruction (think
SPARC membar instruction).

> And a segment would also be the unit of protection, as now. That
> would be easy to implement, scalable, and would deliver what 99% of
> threaded programs actually need.

Agreed.

> Genuinely lightweight threading is another matter entirely, but POSIX
> and Microsoft threads are just processes with common addressing and
> authentication.

:^)

Chris M. Thomasson

unread,

Aug 29, 2008, 12:35:58 AM8/29/08

to

"MitchAlsup" <Mitch...@aol.com> wrote in message
news:e679d004-1e4e-4ae9...@r35g2000prm.googlegroups.com...

I don't have time to read the patent right now, but I was wondering how it
compares with SUNS hardware implementation of limited HTM? Also, how does
your system attempt to prevent live-lock? Think along the lines of something
like:

http://groups.google.com/group/comp.programming.threads/msg/3a356c832812213b

A CAS can theoretically be subjected to live-lock, and the more compare
locations you add, well, the worse it can get. How do you attempt to
mitigate this issue?

MitchAlsup

unread,

Aug 29, 2008, 4:25:24 PM8/29/08

to

On Aug 28, 11:35 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "MitchAlsup" <MitchAl...@aol.com> wrote in message

> I don't have time to read the patent right now, but I was wondering how it
> compares with SUNS hardware implementation of limited HTM?

You can use my mechanism to (helk) implement SUNs sTM stuff.

TM is for those applications that just want synchronization to work.
ASF is for those applications that want synchronization to work FAST.

>Also, how does
> your system attempt to prevent live-lock? Think along the lines of something
> like:
>

> http://groups.google.com/group/comp.programming.threads/msg/3a356c832...

To the issue raised at the given URL: ASCII-art extracted from same::

SP(0)-->NodeA(0)-->NodeB(0)-->NodeC(0)-->NodeD(0)-->NULL

With ASF one can 'monitor' the front of the queue SP(0) while the
application makes its way down the list, by making it look like
address[SP(0)] is participating in the atomic event at NodeD(0). Then
when NodeD(0) is found and the application attempts to do a NodeD(0)-
>NodeD(1) transition, anyone that has attempted to modify SP(0) will
prevent the NodeD(0) transtition. Thus the actual modification at
NodeD(0) will look like (in low level ASF form)::

while( TRUE );
{
sp0 = &SP(0);
watch( sp0 ); // Anyone who
modifies SP(0) will cause the subsequent ACQUIRE to fail
for( i = p = SP(0)->nest; p; p=p->next)
if( IS_WHAT_I'M_LOOKING_FOR( p->kind, kind) )
{
watch( p );
if( ACQUIRE(2) == SUCCESS ) // ATOMICITY
{ // Note below
sp0->next = i; // Cause everyone
else to fail--by writing the value it currently has back into it
p->data = 1;
RELEASE();
break;
}
}
}

Note below:: we know at this point that this application has been
granted atomic access to both NodeD and SP, and we know that that
noone has modified SP since we (started watch-ing it and) used it as
the starting point in our search for NodeD (even when the search takes
a lot of cycles). Neither SP nor any of the intermedicate nodes need
to remain in the cache (nor does there need to be a cache in the
system), however if anyone else does access SP (or NodeD) this CPU
needs to see a snoop to that/those address (which is no burden to
modern Cache Coherent CPUs).

> A CAS can theoretically be subjected to live-lock, and the more compare
> locations you add, well, the worse it can get. How do you attempt to
> mitigate this issue?

I should just say:: read the patent applications, however I will go on
to say::

In a CAS based synchronization system, the application can see that
interference is happening (CAS failed) but the application has no
means to determine how much interference is happening, and has to
resort ot things like exponential back-off to make forward progress.
Thus, software is measureing interference by trying again and again
and again.

In an ASF based synchronization, when your effective-CAS event fails,
you get a number of where you are in line of applications trying to
synchronize on the same cache lines (0 means success). Thus, you can
use this to seed the exponential back-off stuff, OR, You can use this
to access a different member of the data structure that others will
not be accessing. In order to determine which other member of the data
structure you might want, you use the failure code (which is unique
wrt other failure codes other applications received) to participate in
deciding which other element to access (for example: the N-th item
down the list). Since you got a unique failure code, the probability
at nodeN colliding with another atomic access at NodeN is remote--and
this is what drives SAF up from bigO(3) to bigO(log(n)) in total
latency.

Note: I am well aware that some data structures do not allow the
access of elements other than at either front or rear. These kinds of
data structures are doomed to have bigO(n**2) latencies in any
synchronization "kind". But all data structures DO NOT have to have
this property. Code wisely.

Thus, for the typical worst case synchronization event of "timer goes
off, and every CPU attempt to access the front of the run queue"; in
the ASF system, everybody attemts the access the first time, and all
but one fails and all the afilures are seeded with a unique failure
code (monotonically increaseing at the point where interference is
detected). Then using this failure code, they access the [n-th] item
on the run queue and find that nobody else is accessing it, remove it
from the queue and presto :: N CPUs have removed N units of work from
a single queue in 3 trips (each) through the atomic section of code.
And where N can be significantly large (16+)

If software does not use the failure code to spread interfering
accesses apart, the total latency will remain bigO(n**2).

Secondarily, one can do significant things that CAS, DCAS cannot do.
For example one can remove an element from a doubly linked list and
insert it between items in another doubly linked list in a single
atomic event. Thus the concurrent data structure is never seen with
intermediate states, AND the total number of synchronizations goes
DOWN for the same amount of forward progress. In addition, it fully
supports the notions of time stamps and cursors as part of the pointer
being manipulated--and NOT just when the address, timestamp, and state
can be squished into 64-bits (Quad) or 128-bits (Double Quad) storage
items. ASF allows software to build whatever data structure it needs/
wants/requires and makes only the stipulation that the "pointer" has
to fit in a cache line {64-ish bytes--4 Double QUads)}, and the second
limitation is that the total number of such pointers must be greater
than 0 and less than 8.

Mtich

Terje Mathisen

unread,

Aug 30, 2008, 3:35:00 AM8/30/08

to

MitchAlsup wrote:
> On Aug 28, 11:35 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>> "MitchAlsup" <MitchAl...@aol.com> wrote in message
>> I don't have time to read the patent right now, but I was wondering how it
>> compares with SUNS hardware implementation of limited HTM?
>
> You can use my mechanism to (helk) implement SUNs sTM stuff.
>
> TM is for those applications that just want synchronization to work.
> ASF is for those applications that want synchronization to work FAST.

Nice! :-)

> In an ASF based synchronization, when your effective-CAS event fails,
> you get a number of where you are in line of applications trying to
> synchronize on the same cache lines (0 means success). Thus, you can
> use this to seed the exponential back-off stuff, OR, You can use this
> to access a different member of the data structure that others will
> not be accessing. In order to determine which other member of the data
> structure you might want, you use the failure code (which is unique
> wrt other failure codes other applications received) to participate in
> deciding which other element to access (for example: the N-th item
> down the list). Since you got a unique failure code, the probability
> at nodeN colliding with another atomic access at NodeN is remote--and
> this is what drives SAF up from bigO(3) to bigO(log(n)) in total
> latency.
>
> Note: I am well aware that some data structures do not allow the
> access of elements other than at either front or rear. These kinds of
> data structures are doomed to have bigO(n**2) latencies in any
> synchronization "kind". But all data structures DO NOT have to have
> this property. Code wisely.

This is very nice indeed.

My best pure sw idea has been to use XADD to allow any number of
producers/consumers to access the same queue, and as long as the hw can
guarantee forward progress, each client will receive a unique
number/pointer/index as the result of the update.

If the work items pointed to are all located in separate cache lines,
there should be no subsequent collisions, right?

The problem is in the memory hw which is needed to make the XADD work
when a _lot_ of threads are trying to access the variable at the same time.

I really like your idea of letting the single update attempt return a
unique failure code, but I can't (easily) figure out how the hw
requirements differ from an XADD-based approach?

> Mtich

Ouch. You know that aguy is in _real_ hurry when he misspells his own
name. :-(

Nick Maclaren

unread,

Aug 30, 2008, 4:04:04 AM8/30/08

to

In article <b1c6a672-3d9d-46d4...@t54g2000hsg.googlegroups.com>,

MitchAlsup <Mitch...@aol.com> writes:
|>
|> Note: I am well aware that some data structures do not allow the
|> access of elements other than at either front or rear. These kinds of
|> data structures are doomed to have bigO(n**2) latencies in any
|> synchronization "kind". But all data structures DO NOT have to have
|> this property. Code wisely.

A slight niggle: that's not true in general, even if it is under the
circumstances being considered. For example, all LIFOs and FIFOs are
like that, and they can be implemented a lot more efficiently (even
with parallel access).

However, I think that your point is that it IS true when the data
structures are being built on top of 'basic' (scalar) memory accesses,
which is all that modern Von Neumann ISAs deliver to the programmer
(and what almost all programming languages assume).

In a radical, probably capability-based, architecture, there might
be direct FIFO and LIFO support exported by the hardware to the ISA.
Then things might be different.

Regards,
Nick Maclaren.

MitchAlsup

unread,

Aug 30, 2008, 3:14:13 PM8/30/08

to

On Aug 30, 2:35 am, Terje Mathisen <terje.mathi...@hda.hydro.com>
wrote:

> MitchAlsup wrote:
> > On Aug 28, 11:35 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> >> "MitchAlsup" <MitchAl...@aol.com> wrote in message
> >> I don't have time to read the patent right now, but I was wondering how it
> >> compares with SUNS hardware implementation of limited HTM?
>

> > You can use my mechanism to (help) implement SUNs sTM stuff.

>
> > TM is for those applications that just want synchronization to work.
> > ASF is for those applications that want synchronization to work FAST.
>
> Nice! :-)
>

Mark Moir from SUN came up with this one while we were presenting ASF
to them.
{snip}

> This is very nice indeed.
>
> My best pure sw idea has been to use XADD to allow any number of
> producers/consumers to access the same queue, and as long as the hw can
> guarantee forward progress, each client will receive a unique
> number/pointer/index as the result of the update.
>
> If the work items pointed to are all located in separate cache lines,
> there should be no subsequent collisions, right?

The general situation is that you want the element.next and
element.prev 'pointers' to be in different cache lines. Thus one can
be manipulating the element immediately before an element and
immediately after an element simultaneously without collision.

> The problem is in the memory hw which is needed to make the XADD work
> when a _lot_ of threads are trying to access the variable at the same time.
>
> I really like your idea of letting the single update attempt return a
> unique failure code, but I can't (easily) figure out how the hw
> requirements differ from an XADD-based approach?

There has to be somebody who does the interference counting, and this
is something that the CPU cannot do for itself; for the CPU is not in
a position to see all of the interference which may be in progress
before that CPU generates an address that will ultimately result in
interference.

> > Mtich
>
> Ouch. You know that aguy is in _real_ hurry when he misspells his own
> name. :-(

You got me there.....

On Aug 30, 3:04 am, n...@cus.cam.ac.uk (Nick Maclaren) wrote:
> A slight niggle: that's not true in general, even if it is under the
> circumstances being considered. For example, all LIFOs and FIFOs are
> like that, and they can be implemented a lot more efficiently (even
> with parallel access).

OK, lets look at the FIFO situation.

Consider a FIFO that has at least n items in it, and n CPUs attempting
to take the front item. The current situation is that n-1 beat each
other up while one (at random) CPU gets the front element (along with
some beating), and the cycle continues with n-1 players. In effect,
even though the FIFO is strictly accessed only at its "front", there
is a random relationship between the n CPUs and n FIFO items. (That
is, each CPU gets FIFO access to a random item in the FIFO.)

So, the question becomes, why cannot the n CPUs access deeper into the
FIFO as long as the CPUs can only access/remove from the "front-most"
n items?

As seen by the CPUs they still get a random item from the "front" of
the FIFO--so the view from the CPUs remains the same.
As seen by the FIFO the same number of items from the "front" can be
accesses simultaneously.
So, there are many data structure access methods that give the
appearance of having to be (a lot) more ordered than they really have
to be.

Yet, all of the accesses do not actually take place at the "front" of
the FIFO. It is this "liberation" from order that enables the
reduction from bigO(n**2) to bigO(log(n)) latency.

Mitch

Nick Maclaren

unread,

Aug 30, 2008, 3:30:03 PM8/30/08

to

In article <40c70b11-c9f9-455c...@26g2000hsk.googlegroups.com>,

MitchAlsup <Mitch...@aol.com> writes:
|>
|> Consider a FIFO that has at least n items in it, and n CPUs attempting
|> to take the front item. The current situation is that n-1 beat each
|> other up while one (at random) CPU gets the front element (along with
|> some beating), and the cycle continues with n-1 players. In effect,
|> even though the FIFO is strictly accessed only at its "front", there
|> is a random relationship between the n CPUs and n FIFO items. (That
|> is, each CPU gets FIFO access to a random item in the FIFO.)

Yup. Not a problem :-)

|> Yet, all of the accesses do not actually take place at the "front" of
|> the FIFO. It is this "liberation" from order that enables the
|> reduction from bigO(n**2) to bigO(log(n)) latency.

Agreed. My niggle was that you can deliver much better than O(N^2)
if you constrain the ISA to say nothing more than the access to the
FIFO is sequentially consistent - i.e. each CPU would get 'its'
results in order, with no guarantees about interleaving. But that
would mean building the FIFO into the ISA.

It's not what ISAs currently do ....

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Aug 30, 2008, 3:44:13 PM8/30/08

to

This is a repost in an attempt to clarify. Sorry.

results in order, with no guarantees about interleaving[*]. But

that would mean building the FIFO into the ISA.

It's not what ISAs currently do ....

[*] It would still guarantee that, if CPU A put two items on in the
same order, and CPU B took them off, it would get them in the same
order. And there could be a (slow) synchronisation operation by
which CPUs C and D could ensure that all actions by CPU C up to that
point precede all those by CPU D after that point.

If I recollect, the most obvious algorithm is something like O(log^2(N))
for suitable random accesses. Ones that always involve a single CPU
are hopeless, of course, but that's no news.

Regards,
Nick Maclaren.

EricP

unread,

Aug 31, 2008, 12:39:39 PM8/31/08

to

I assume you are referring to the "Method for proactive
synchronization within a computer system" and friends patents.
I looked at these patents but it was not immediately obvious
to me why it is guaranteed to make forward progress.

I probably misunderstood the mechanism, but it sounded like the cache
queue contains monitor logic (e.g. a CAM plus some counters) that
track the number of requests for a collection of addresses.
The program issues multiple LOCK prefix instructions to modify a
set of addresses, and the addresses and new values are stored in
the monitor logic. The program issues an ACQUIRE instruction and
if no one else is also updating *any* of the addresses in the set
then the whole batch of new values is committed and ACQUIRE returns
success, otherwise failure and a counter of the contenders.

Why wouldn't that just livelock?

Eric

Chris M. Thomasson

unread,

Aug 31, 2008, 2:52:10 PM8/31/08

to

"EricP" <ThatWould...@theisland.com> wrote in message
news:de1b0$48bac521$45c49ea8$91...@TEKSAVVY.COM...

I have only very briefly skimmed the patents because I am swamped with work
and have no time. However, it sure seems like it is indeed prone to livelock
if the programmer does not take advantage of the unique number returned from
a failed operation and proper integrate it into an exponential backoff
mechanism. Well, IMVHO, that sure sounds like some of the contention
managers in a transactional memory system. In fact, it sounds like ASF is
indeed a form of limited optimized HTM. The monitor logic would be analogous
to the operations (e.g., the LOCK'd instructions) within an atomic block of
the transaction. The ACQUIRE instruction would be analogous to the commit
function, only when the ACUQIRE fails it gives an important piece of
information, the number of contenders which would be used to seed the
contention manager. I need to really CAREFULLY read the darn patent
applications! I know I am missing something important here.

;^(...

EricP

unread,

Aug 31, 2008, 5:52:39 PM8/31/08

to

Chris M. Thomasson wrote:
>
> I have only very briefly skimmed the patents because I am swamped with
> work and have no time. However, it sure seems like it is indeed prone to
> livelock if the programmer does not take advantage of the unique number
> returned from a failed operation and proper integrate it into an
> exponential backoff mechanism. Well, IMVHO, that sure sounds like some
> of the contention managers in a transactional memory system. In fact, it
> sounds like ASF is indeed a form of limited optimized HTM. The monitor
> logic would be analogous to the operations (e.g., the LOCK'd
> instructions) within an atomic block of the transaction. The ACQUIRE
> instruction would be analogous to the commit function, only when the
> ACUQIRE fails it gives an important piece of information, the number of
> contenders which would be used to seed the contention manager. I need to
> really CAREFULLY read the darn patent applications! I know I am missing
> something important here.
>
> ;^(...

There is a more understandable paper and slightly different
explaination of it which mentions COMMIT as distinct from ACQUIRE:

Hardware acceleration for lock-free data structures and
software-transactional memory
http://www.amd64.org/fileadmin/user_upload/pub/epham08-asf-eval.pdf

though I still don't see how it avoids, for example with two
cpus trying to pop adjacent items from a double linked list,
having them ping pong failures between them as each Acquire
invalidates the others (my brain is locked into looking at this for
a mechanism that somehow impeds cache line mobility without causing
deadlock soas to collect the update cache line set together).

Eric

MitchAlsup

unread,

Aug 31, 2008, 9:39:50 PM8/31/08

to

On Aug 31, 11:39 am, EricP <ThatWouldBeTell...@theisland.com> wrote:

> I assume you are referring to the "Method for proactive
> synchronization within a computer system" and friends patents.

Yep.

> I looked at these patents but it was not immediately obvious
> to me why it is guaranteed to make forward progress.

The guarentee of foward progress is massively stronger than in other
synchronization systems but not absolute.

> I probably misunderstood the mechanism, but it sounded like the cache
> queue contains monitor logic (e.g. a CAM plus some counters) that
> track the number of requests for a collection of addresses.

Its the miss buffers that keep track of the lines participating in the
atomic event.

> The program issues multiple LOCK prefix instructions to modify a
> set of addresses, and the addresses and new values are stored in
> the monitor logic.

The LOCK prefix mearly indicates that this address is potentially
participating in an atomic event.

> The program issues an ACQUIRE instruction and
> if no one else is also updating *any* of the addresses in the set
> then the whole batch of new values is committed and ACQUIRE returns
> success, otherwise failure and a counter of the contenders.

The way you have phrased your stated question is a not correct AST
event sequence. ACQUIRE only marks the start of CDS modification while
RELEASE marks the end. Bad things happening between these two markers
cause the atomic event to fail and control flow flows out of the
ACQUIRE instruction with a failure code. ACQUIRE returns success
BEFORE you have damaged anything. It is your key to go and do the
damage. RELEASE is the indication that you are done and want the rest
of the sytem to see the new state of the participating lines.

LOCKed memrefs announce participation
Once all the participants are announced you execute an ACQUIRE
instruction followed immediately by a jnz instruction
If ACQUIRE succeeds, then you start modifying the participating data;
otherwise control leaves the atomic event
When done modifying the participating data, you execute a RELEASE
instruction.

If an exception occurs after ACQUIRE and before RELEASE, the
participating cache lines are rolled back making it look like the
event never took place.

IF ACQUIRE returns success, certain pieces of HW operate differently
preventing the HW live lock situations.

Malformed ASF software may create its own SW livelock--and there is
nothing HW can do about it (nor should it).

But if you use the concurrent data structure (CDS) philosophy of: A)
start at the front of the CDS, B) traverse until you find item of
merit, C) attempt to manipulate CDS at point of merit (ACQUIRE), D)
Failure leads back to A, F) Success leads through the CDS modification
code, E) announce you are done (RELEASE).

I have reworded you question in a correct manner:

> The program issues an ACQUIRE instruction and is given the success indication and enters the CDS modification block. If no one else is also updating *any* of the addresses in the set then the whole batch of new values is made visible to the system at the conclusion of the RELEASE instruction.

> Why wouldn't that just livelock?

Can you digest what I have written and come back with a refined
question?

Mitch

Michael Hohmuth

unread,

Sep 1, 2008, 10:24:16 AM9/1/08

to

EricP <ThatWould...@thevillage.com> writes:

> There is a more understandable paper and slightly different
> explaination of it which mentions COMMIT as distinct from ACQUIRE:
>
> Hardware acceleration for lock-free data structures and
> software-transactional memory
> http://www.amd64.org/fileadmin/user_upload/pub/epham08-asf-eval.pdf

Disclaimer:
I work for AMD and am one of the coauthors of this paper. AMD has
not announced support for ASF (Advanced Synchronization Facility) in
any future product. So please don't get too excited. :-)

In this paper we have described a slightly different incarnation of
ASF:

* We renamed RELEASE to COMMIT. (Yes, ACQUIRE is different from
COMMIT [nee RELEASE]: ACQUIRE marks the begin of the atomic phase,
and COMMIT denotes its end.)

* We added a VALIDATE instruction.

* The paper only describes and evaluates what the patent describes as
``optimistic synchronization mode.''

> though I still don't see how it avoids, for example with two cpus
> trying to pop adjacent items from a double linked list, having them
> ping pong failures between them as each Acquire invalidates the

> others [...]

Optimistic synchronization mode (the one described in the paper) does
not do that in hardware; it leaves that as something software has to
deal with.

I believe that in his previous posts, Mitch has been referring to the
other operational mode described in the patent (but not the paper),
``deterministic synchronization mode.'' Thus, the paper cannot be
used shed light on whether or not the mechanism described in the
patent can successfully avoid livelock. (But the paper can serve as
an illustration of how ASF can be used and what performance could be
expected from it for some workloads.)

That said, I'd be interested to learn what you (and the other readers
of this group) think about the paper!

Michael
--
Michael Hohmuth, AMD Operating System Research Center, Dresden, Germany
michael...@amd.com, www.amd64.org

EricP

unread,

Sep 1, 2008, 11:20:37 AM9/1/08

to

MitchAlsup wrote:
> On Aug 31, 11:39 am, EricP <ThatWouldBeTell...@theisland.com> wrote:
>
>> I assume you are referring to the "Method for proactive
>> synchronization within a computer system" and friends patents.
>
> Yep.

> ...

> The LOCK prefix mearly indicates that this address is potentially
> participating in an atomic event.
>

>> ...

>
> The way you have phrased your stated question is a not correct AST
> event sequence. ACQUIRE only marks the start of CDS modification while
> RELEASE marks the end. Bad things happening between these two markers
> cause the atomic event to fail and control flow flows out of the
> ACQUIRE instruction with a failure code. ACQUIRE returns success
> BEFORE you have damaged anything. It is your key to go and do the
> damage. RELEASE is the indication that you are done and want the rest
> of the sytem to see the new state of the participating lines.

Yep, my bad. I missed that in my initial quick document scan.
That was in the Detailed Description section and
I had only scanned the Claims sections.

> LOCKed memrefs announce participation
> Once all the participants are announced you execute an ACQUIRE
> instruction followed immediately by a jnz instruction
> If ACQUIRE succeeds, then you start modifying the participating data;
> otherwise control leaves the atomic event
> When done modifying the participating data, you execute a RELEASE
> instruction.
>
> If an exception occurs after ACQUIRE and before RELEASE, the
> participating cache lines are rolled back making it look like the
> event never took place.

Yes, and for interrupts too. It looks like it takes a replay trap
back to the ACQUIRE and then re-executes it but with a Fail status.

That would seem to set limits on the amount of code between
ACQUIRE and RELEASE. It would also seem to preclude non OoO
cpus from implementations.

Would having Release also return a Pass/Fail condition code,
in addition to Acquire, allow non OoO implementations?

> IF ACQUIRE returns success, certain pieces of HW operate differently
> preventing the HW live lock situations.
>
> Malformed ASF software may create its own SW livelock--and there is
> nothing HW can do about it (nor should it).
>
> But if you use the concurrent data structure (CDS) philosophy of: A)
> start at the front of the CDS, B) traverse until you find item of
> merit, C) attempt to manipulate CDS at point of merit (ACQUIRE), D)
> Failure leads back to A, F) Success leads through the CDS modification
> code, E) announce you are done (RELEASE).
>
> I have reworded you question in a correct manner:
>
>> The program issues an ACQUIRE instruction and is given the success indication and enters the CDS modification block. If no one else is also updating *any* of the addresses in the set then the whole batch of new values is made visible to the system at the conclusion of the RELEASE instruction.
>
>> Why wouldn't that just livelock?
>
> Can you digest what I have written and come back with a refined
> question?

Digesting....

A key phrase in getting this to work is the clause

"[0055]... After a passed cache line arrives, processor core 18 may hold
onto that cache line and prevent other processor cores from stealing the
line by responding to coherent invalidation probes with
Fail-to-REQUESTOR responses"

which in turn causes others Acquire attempt to fail.
(And then I see this was worthy of a whole patent itself:
"Method for denying probes during proactive synchronization within
a computer system".)

I also notice that there is a single per-system Synchronization Arbiter
to resolve disputes over sets of ACQUIRE addresses.

So that seems to answer my double linked list livelock question.
After one cpu has Acquired its address set, the Sync Arbiter prevents
the ping-ponging, and these NAKs allow a cpu to hold multiple cache
lines until it is finished the critical section.

Eric

MitchAlsup

unread,

Sep 1, 2008, 3:19:27 PM9/1/08

to

First of all, I would like to thank Mike for stepping in with new
paper and data that I have never seen before.

On Sep 1, 10:20 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> MitchAlsup wrote:
> > On Aug 31, 11:39 am, EricP <ThatWouldBeTell...@theisland.com> wrote:
> > LOCKed memrefs announce participation
> > Once all the participants are announced you execute an ACQUIRE
> > instruction followed immediately by a jnz instruction
> > If ACQUIRE succeeds, then you start modifying the participating data;
> > otherwise control leaves the atomic event
> > When done modifying the participating data, you execute a RELEASE
> > instruction.
>
> > If an exception occurs after ACQUIRE and before RELEASE, the
> > participating cache lines are rolled back making it look like the
> > event never took place.
>
> Yes, and for interrupts too. It looks like it takes a replay trap
> back to the ACQUIRE and then re-executes it but with a Fail status.

Interrupts are defered from the first LOCKed memref for a number of
cycles under OS control. This gives the ASF event time to make forward
progress but not so long as to seriously effect interrupt performance.
When I left, the watch-dog counter was going to defalut to one
absolutely worst case DRAM access across the fabric with a full memory
controller queue at that node (about 5-ish µs if I remember
correctly). The counter can be set to 0 and interupts are not defered.

> That would seem to set limits on the amount of code between
> ACQUIRE and RELEASE. It would also seem to preclude non OoO
> cpus from implementations.

Nothing prevents an in order machine from doing ASF events. Nothing in
ASF even requires that a data cache be present. All that IS required
is at least 9 miss buffers {address and line of data}, so that TLB
misses and instruction fetches can continue while participating lines
are "watched". And there has been absolutely no relief for the general
concept of an atomic event should contain as LITTLE work as possible.
ASF just makes the amount of work possible bigger than DCAS. But ASF
does not anticipate the application ever needing more than a few
handfulls of instruction or a couple of branches to perform the atomic
event. ASF is NOT TM. If you need more than a few handfulls of
instructions, use TM not ASF.

> Would having Release also return a Pass/Fail condition code,
> in addition to Acquire, allow non OoO implementations?

We had one proponent of this direction. Both myself and the other lead
did not like that direction. I don't know if the version Mike brings
to the table did this or not.

> A key phrase in getting this to work is the clause
>
> "[0055]... After a passed cache line arrives, processor core 18 may hold
> onto that cache line and prevent other processor cores from stealing the
> line by responding to coherent invalidation probes with
> Fail-to-REQUESTOR responses"

The NAK.

> which in turn causes others Acquire attempt to fail.
> (And then I see this was worthy of a whole patent itself:
> "Method for denying probes during proactive synchronization within
> a computer system".)
>
> I also notice that there is a single per-system Synchronization Arbiter
> to resolve disputes over sets of ACQUIRE addresses.

Nope, there is a single SO per node in the system (each chip has a
SO). OS software can decide to route all ASF messages to a single SO,
or to route different applications' ASF messages to different SOs.
Virtual machines that do not share pages, or applications that do not
share pages where atomic events are being performed can be routed to
different SOs. Thus, if you are not performing atomic activities on
pages in the disk cache, you can share these pages and still direct
the ASF messages to different SOs. There is a lot of SW thinking left
to be done, here. And none of it has yet been done (or made public).

It was anticipated that each SO would contain 64 entries so for the
kind of stuff DCAS is good for, a single SO could be allowing 32
independent atomic events to be simultaneously in progress. The
biggest ASF event we thought was "pretty cool" {That of moving a CDS-
element from one place to another in a single atomic event} takes 5
participating lines. So the single SO can manage as many as 12 of
these at one instant.

> So that seems to answer my double linked list livelock question.
> After one cpu has Acquired its address set, the Sync Arbiter prevents
> the ping-ponging, and these NAKs allow a cpu to hold multiple cache
> lines until it is finished the critical section.

Yep, that is why the guarentee of forward progress is greatly
strengthened.

But to be absolutely fair, the bigO(n**2) ping-ponging still takes
place, but it takes place with one 64-byte and one 64-bit incoherent
messages rather than taking place with a dozens of coherent messages
and a handful of data movements in the fabric with all sorts of
ordering constraints. This reduces the amount of fabric traffic in a 4-
node system by around 8X (its all addresses rather than cache lines).

Of similar importance, is that none of these messages arrive at the
memory controlers' queues and do not interact with the states of the
DRAM controllers. One way of thinking about this is that this
synchronization method works on addresses rather than on the bit-
pattern of a unit of data associated with an address. Thus the
synchronizing agents are not impeeding other benign processes while
attempting the atomic event.

Mitch

EricP

unread,

Sep 3, 2008, 3:46:26 PM9/3/08

to

My interest is primarily in the mechanism. The only comment I'd have
is that anything that allows 7 or 8 addresses to be updated atomically
is definitely an improvement and I'm sure it would find many uses.

WRT deterministic vs optimistic, in reading between the lines of the
descriptions, I got the impression the distinction was whether ACQUIRE
waited for a response or not. This lead me to wonder about the latency
of the Acquire operation and if this was a synchronous vs async issue.

I'm also thinking that optimistic code needs to be more capable of
handling interference and performing a graceful back off and retry,
whereas deterministic code would be more straight forward.

The patent refers to deterministic vs optimistic being selected by the
BIOS or OS or virtual machine. I don't think that would work out well.
Whether optimistic is tolerable or not would be dependant on the way
particular routine was written. If an OS was forced to choose, it
would always choose the most restrictive which would be deterministic.
So as a system wide option, this need not exist.

If it is a synchronous vs async issue, then it would be better to
have two Acquire instructions so that those routines that are written
to perform a more complex back off and retry without ping ponging can
select the async Acquire and gain some performance by being optimistic.

However people like me would just want things to work obviously
would choose the deterministic version.

Eric

MitchAlsup

unread,

Sep 4, 2008, 1:34:44 PM9/4/08

to

On Sep 3, 2:46 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> My interest is primarily in the mechanism. The only comment I'd have
> is that anything that allows 7 or 8 addresses to be updated atomically
> is definitely an improvement and I'm sure it would find many uses.
>
> WRT deterministic vs optimistic, in reading between the lines of the
> descriptions, I got the impression the distinction was whether ACQUIRE
> waited for a response or not. This lead me to wonder about the latency
> of the Acquire operation and if this was a synchronous vs async issue.

Basically, in optimistic mode, the CPU does the checks for
interference at COMMIT--if any interference has happened undo any
damage and exit at ACQUIRE with a failure code.
While in determinitstic mode, the CPU gets permission at ACQUIRE. ANd
will run to completion if exceptions are not generated.

But there is a sublty that has escaped everyone, in optimistic mode,
you only get as many instructions in the atomic section as you have
room in the execution window. So there are spurrious failures in
optimistic mode, for example, if you use more than 24 instructions, or
overflow LS2 (in Opteron) with stores then the atomic section will
fail back to ACQUIRE. Since different machines may have different
sized windows and store buffering, it would be unwise to expose these
to SW. This lead us back to having a deterministic mode that can
handle an essentially-unbounded number of instructions and memrefs in
the atomic section.

> I'm also thinking that optimistic code needs to be more capable of
> handling interference and performing a graceful back off and retry,
> whereas deterministic code would be more straight forward.

Optimistic mode is almost like there were no atomic checks in the code
at all. ACQUIRE takes 1 clock, COMMIT takes only a few (to ask the BIU
about interference).

Deterministic mode bundles up the participating cache line addresses
and ships these to a synchronization observer (no closer than 12
nanosecond away or 24 nanoseconds round trip--with a 4-node system
averaging 30-ish nanoseconds round trip as seen at the random CPU). So
this mode always incurs a fabric crossing delay, even if all the data
is sitting in its caches.

Thus, by having a predictor, the typical lightly loaded
synchronizations can run as fast or faster than simple LOCKs/CASs that
already exist. This give ASF the ability to compete in performance
even at the simplest levels of synchronization--and allws microcode a
new avenue to make the already existing instructions faster. But when
interference IS detected, then switch everyone to the SO so that the
bigO(n**2) effects are ameliorated. But you don't want the software
model to change between optimistic and deterministic.

So, the expected general use pattern is to make a try in optimistic
mode, and then when this fails, make a second try in deterministic
mode. The failure codes at ACQUIRE are designed to support this
notion, and no code "looks or smells" different at the instruction
level, while "looking and smelling" considerably different in the HW
gyration needed to pull it all off.

> The patent refers to deterministic vs optimistic being selected by the
> BIOS or OS or virtual machine. I don't think that would work out well.
> Whether optimistic is tolerable or not would be dependant on the way
> particular routine was written. If an OS was forced to choose, it
> would always choose the most restrictive which would be deterministic.
> So as a system wide option, this need not exist.

First, the BIOS/OS/VMM can select between always optimistic, always
deterministic, and a predictor (or turn facillity off) for each of the
priviledge and for the unprivelidged.

The OS may want to choose something for the application(s) and
something different for itself!

> If it is a synchronous vs async issue, then it would be better to
> have two Acquire instructions so that those routines that are written
> to perform a more complex back off and retry without ping ponging can
> select the async Acquire and gain some performance by being optimistic.

No, you want the software model (what coders code to) to be the same
in either mode.

But the thing, here, is that for the typical mode data element within
CDS kinds of synchronizations, ASF does not require the software
writer to code back-offs (complex or not). ASF wants the SW code to
take a new look at the CDS (pointers) and use the failure code to
select a different element that has much lower probability of
suffering from interference, and give that one a try.

> However people like me would just want things to work obviously
> would choose the deterministic version.

In optimistic mode, with data sitting in the cache, one can perform a
CAS-like atomic event in a dozen cycles (4-nanoseconds).
In determniistic mode, with data sitting in the cache, one can perform
a CAS-like atomic event in 30-nanoseconds (best case--with fabric
overhead additioinal).

Mitch

P.S. Also note, the SO messages contain physical addresses. So if you
launch one program 18 times, the individual programs will not interfer
with each other unless they actually share memory and attempt atomic
accesses on that shared memory that happen to collide. Nor are these
physical addresses ever exposed to the running programs.

Nick Maclaren

unread,

Sep 4, 2008, 1:56:36 PM9/4/08

to

In article <81768b1d-bc68-4051...@r15g2000prd.googlegroups.com>,

MitchAlsup <Mitch...@aol.com> writes:
|>
|> Deterministic mode bundles up the participating cache line addresses
|> and ships these to a synchronization observer (no closer than 12
|> nanosecond away or 24 nanoseconds round trip--with a 4-node system
|> averaging 30-ish nanoseconds round trip as seen at the random CPU). So
|> this mode always incurs a fabric crossing delay, even if all the data
|> is sitting in its caches.

That must be bounded, too, so there is SOME limit - but I assume that
it is fairly large.

|> > If it is a synchronous vs async issue, then it would be better to
|> > have two Acquire instructions so that those routines that are written
|> > to perform a more complex back off and retry without ping ponging can
|> > select the async Acquire and gain some performance by being optimistic.
|>
|> No, you want the software model (what coders code to) to be the same
|> in either mode.

Agreed. It's a nightmare to debug programs that can run in different
modes according to something that can't even be displayed in the
program output.

|> In optimistic mode, with data sitting in the cache, one can perform a
|> CAS-like atomic event in a dozen cycles (4-nanoseconds).
|> In determniistic mode, with data sitting in the cache, one can perform
|> a CAS-like atomic event in 30-nanoseconds (best case--with fabric
|> overhead additioinal).

That is impressive - even the latter is fast enough to use for a lot
of program designs that aren't viable today.

Unfortunately, the former will encourage some loons to use the
mechanism for the accumulation value in (say) a dot product. But
against stupidity the gods themselves contend in vain :-(

Regards,
Nick Maclaren.

Michael Hohmuth

unread,

Sep 4, 2008, 4:26:50 PM9/4/08

to

MitchAlsup <Mitch...@aol.com> writes:

> But there is a sublty that has escaped everyone, in optimistic mode,
> you only get as many instructions in the atomic section as you have
> room in the execution window. So there are spurrious failures in
> optimistic mode, for example, if you use more than 24 instructions,
> or overflow LS2 (in Opteron) with stores then the atomic section
> will fail back to ACQUIRE. Since different machines may have
> different sized windows and store buffering, it would be unwise to
> expose these to SW. This lead us back to having a deterministic mode
> that can handle an essentially-unbounded number of instructions and
> memrefs in the atomic section.

What you are describing is the limitation of one possible
implementation of optimistic mode. The implementation we've simulated
for our paper does not have this limitation.

Nick Maclaren

unread,

Sep 4, 2008, 4:50:40 PM9/4/08

to

In article <3yzljy7...@sieglinde.amd.com>,

Michael Hohmuth <Michael...@amd.com> writes:
|> MitchAlsup <Mitch...@aol.com> writes:
|>
|> > But there is a sublty that has escaped everyone, in optimistic mode,
|> > you only get as many instructions in the atomic section as you have
|> > room in the execution window. So there are spurrious failures in
|> > optimistic mode, for example, if you use more than 24 instructions,
|> > or overflow LS2 (in Opteron) with stores then the atomic section
|> > will fail back to ACQUIRE. Since different machines may have
|> > different sized windows and store buffering, it would be unwise to
|> > expose these to SW. This lead us back to having a deterministic mode
|> > that can handle an essentially-unbounded number of instructions and
|> > memrefs in the atomic section.
|>
|> What you are describing is the limitation of one possible
|> implementation of optimistic mode. The implementation we've simulated
|> for our paper does not have this limitation.

But I think that it still has the one I referred to - i.e. that
unprotected accesses by another CPU to protected locations between
an ACQUIRE and a COMMIT are essentially undefined behaviour.

Not that I think that such a constraint is unreasonable.

Regards,
Nick Maclaren.

Michael Hohmuth

unread,

Sep 4, 2008, 5:46:06 PM9/4/08

to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:

> But I think that [ASF's optimistic mode] still has the [limitation]

> I referred to - i.e. that unprotected accesses by another CPU to
> protected locations between an ACQUIRE and a COMMIT are essentially
> undefined behaviour.

Architecturally, there is no undefined behavior. The execution flow
is rolled back to the ACQUIRE instruction, and modifications to
protected memory are undone and never become visible. In the
implementation we simulated, other modifications (to unprotected
memory or to most registers) are not rolled back, but that's hardly
undefined behavior. We could argue about nondeterminism, but that's
easy to hide in software (in the back-off code, do not rely on what
might have been modified in the critical section).

In your other post you talked about language standards, and I have to
admit I didn't quite follow. If ACQUIRE and COMMIT were exposed as
language features, then the language would have to define the
semantics of a rollback. I don't see how that would be more difficult
that defining, say, the semantics of a longjmp out of a signal
handler.

(Cross-posted to comp.programming.threads because that's where you
originally brought up the question.)

Michael

PS: Your news reader or news server truncates your news postings'
References header, which breaks threading. Could you please try
to fix this?

Nick Maclaren

unread,

Sep 5, 2008, 4:10:38 AM9/5/08

to

In article <3yz7i9r...@sieglinde.amd.com>,

Michael Hohmuth <Michael...@amd.com> writes:
|>
|> > But I think that [ASF's optimistic mode] still has the [limitation]
|> > I referred to - i.e. that unprotected accesses by another CPU to
|> > protected locations between an ACQUIRE and a COMMIT are essentially
|> > undefined behaviour.
|>
|> Architecturally, there is no undefined behavior. The execution flow
|> is rolled back to the ACQUIRE instruction, and modifications to
|> protected memory are undone and never become visible.

Let's consider the case where another CPU does an unprotected read to
a protected location after it has been modified. Which value does it
get? There are serious inconsistencies if it gets the modified one,
because it might be rolled back later.

In the cache-coherence-based implementation, I can see that it can
easily force a roll-back and block the read until after the roll-back,
but that isn't so easy with all approaches. And, what I am not sure
is whether it is required.

|> In the
|> implementation we simulated, other modifications (to unprotected
|> memory or to most registers) are not rolled back, but that's hardly
|> undefined behavior. We could argue about nondeterminism, but that's
|> easy to hide in software (in the back-off code, do not rely on what
|> might have been modified in the critical section).

Unless the possibility domain of non-determinism is defined, it is
undefined behaviour. For example, consider a single protected byte
in the middle of a doubleword, which is updated by a floating-point
store and the byte is then rolled back :-)

Yes, such examples are ridiculous, but I don't know how many extra
constraints you are assuming in order to exclude them.

|> In your other post you talked about language standards, and I have to
|> admit I didn't quite follow. If ACQUIRE and COMMIT were exposed as
|> language features, then the language would have to define the
|> semantics of a rollback. I don't see how that would be more difficult
|> that defining, say, the semantics of a longjmp out of a signal
|> handler.

Think preloading and store aggregation/scheduling. In order to give
composite objects or sections of code atomic properties, you need to
isolate their accesses from the surrounding ones. That is done by
ACQUIRE and COMMIT, but there is no equivalent for the CPUs that
access the same data non-atomically.

However, I am afraid that paragraph contains two mistakes.

Firstly, while you might like to think that a language would define
such semantics, you aren't being realistic. For example, POSIX and
C don't even define the semantics of either threading or signal
handling, nor do they require an implementation to do so.

And, given that I am one of the VERY few people who has ever even
attempted to define the semantics of a longjmp out of a signal
handler, I think that you are underestimating its difficulty!

|> PS: Your news reader or news server truncates your news postings'
|> References header, which breaks threading. Could you please try
|> to fix this?

Regrettably not. I am a mere user of this system, and it is due to
get the chop in a few weeks.

Regards,
Nick Maclaren.

Michael Hohmuth

unread,

Sep 5, 2008, 7:52:30 AM9/5/08

to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:
> In article <3yz7i9r...@sieglinde.amd.com>,
> Michael Hohmuth <Michael...@amd.com> writes:

> |> Architecturally, there is no undefined behavior. The execution
> |> flow is rolled back to the ACQUIRE instruction, and modifications
> |> to protected memory are undone and never become visible.
>
> Let's consider the case where another CPU does an unprotected read
> to a protected location after it has been modified. Which value
> does it get? There are serious inconsistencies if it gets the
> modified one, because it might be rolled back later.

As I mentioned, in this case modifications to protected memory are

undone and never become visible.

In other words: If a CPU B reads a memory location that CPU A has
protected and modified, then CPU A's critical section will be aborted.
CPU B gets the old value.

> In the cache-coherence-based implementation, I can see that it can
> easily force a roll-back and block the read until after the
> roll-back, but that isn't so easy with all approaches. And, what I
> am not sure is whether it is required.

Yes, it is required -- otherwise the atomicity property would be
broken.

> |> In the implementation we simulated, other modifications (to
> |> unprotected memory or to most registers) are not rolled back, but
> |> that's hardly undefined behavior. We could argue about
> |> nondeterminism, but that's easy to hide in software (in the
> |> back-off code, do not rely on what might have been modified in
> |> the critical section).
>
> Unless the possibility domain of non-determinism is defined, it is

> undefined behaviour. [...]

I believe I defined that domain in the paragraph you quoted.

> |> In your other post you talked about language standards, and I
> |> have to admit I didn't quite follow. If ACQUIRE and COMMIT were
> |> exposed as language features, then the language would have to
> |> define the semantics of a rollback. I don't see how that would
> |> be more difficult that defining, say, the semantics of a longjmp
> |> out of a signal handler.
>
> Think preloading and store aggregation/scheduling. In order to give
> composite objects or sections of code atomic properties, you need to
> isolate their accesses from the surrounding ones. That is done by
> ACQUIRE and COMMIT, but there is no equivalent for the CPUs that
> access the same data non-atomically.

There are memory-access ordering constraints for x86, and ASF carries
those along in one form or another. From the perspective of a CPU B
observing modifications by an ASF critical section on CPU A, the only
change is that some modification now happen atomically -- so that no
intermediate state can be observed. This is _better_ than what we
have today, not worse.

> However, I am afraid that paragraph contains two mistakes.
>
> Firstly, while you might like to think that a language would define
> such semantics, you aren't being realistic. For example, POSIX and
> C don't even define the semantics of either threading or signal
> handling, nor do they require an implementation to do so.

OK -- but there is a difference between what a standard mandates and
what implementations provide. Fortunately, most implementations do
provide some guarantees. ASF requires less guarantees, not more.

> And, given that I am one of the VERY few people who has ever even
> attempted to define the semantics of a longjmp out of a signal
> handler, I think that you are underestimating its difficulty!

Maybe. Nonetheless, there is siglongjmp(3) in POSIX, and people have
been able to apply it usefully and portably.

In summary, I don't think that ASF introduces any (additional)
undefined behavior.

Michael

Nick Maclaren

unread,

Sep 5, 2008, 8:15:37 AM9/5/08

to

In article <3yzbpz2...@sieglinde.amd.com>,

That is true, but your paper didn't it clear whether the atomicity
property applied to non-protected accesses as well as protected ones!
That sort of thing has bedevilled specifications for decades - you
assume one thing, someone else (often a decade later) assumes another,
and chaos ensues - usually followed by a screaming match where each
side claims the other is at fault! [*]

|> There are memory-access ordering constraints for x86, and ASF carries
|> those along in one form or another. From the perspective of a CPU B
|> observing modifications by an ASF critical section on CPU A, the only
|> change is that some modification now happen atomically -- so that no
|> intermediate state can be observed. This is _better_ than what we
|> have today, not worse.

I am not denying that - but do you seriously think that the current
situation is adequate? Because practical experience is that it isn't.

|> OK -- but there is a difference between what a standard mandates and
|> what implementations provide. Fortunately, most implementations do
|> provide some guarantees. ASF requires less guarantees, not more.

Actually, very few do, when you try to find out what they are. They
usually give a description of what they expect, and some examples of
what they will deliver, and you have to deduce what they guarantee
from that. That applies to all levels, incidentally, including ISAs.

Sorry, but many people call me on such topics when they get stuck,
and I have spent a LOT of time chasing down such details.

|> Maybe. Nonetheless, there is siglongjmp(3) in POSIX, and people have
|> been able to apply it usefully and portably.

Really? I doubt that you could provide an example of a use of it
that is genuinely portable. What I can guarantee is that, when I
demonstrate a POSIX-conforming system and circumstances on which it
doesn't work and/or explain why, the event marked [*] above will
ensue.

|> In summary, I don't think that ASF introduces any (additional)
|> undefined behavior.

I never said that it did - what I would hope for is that it would
REDUCE the amount of undefined behaviour to the point at which
writing reasonably correct shared memory programs with a lot of
data sharing in conventional languages becomes feasible. Currently,
it isn't.

Note that I know perfectly well that it is easy for an experienced
parallel programmer to write shared memory primitives, and for less
experienced programers tho use them, but the issue is for the latter
to do the shared accesses themselves. And get them right :-(

Regards,
Nick Maclaren.

Hallvard B Furuseth

unread,

Sep 5, 2008, 8:54:01 AM9/5/08

to

Michael Hohmuth writes:

> nm...@cus.cam.ac.uk (Nick Maclaren) writes:
>> And, given that I am one of the VERY few people who has ever even
>> attempted to define the semantics of a longjmp out of a signal
>> handler, I think that you are underestimating its difficulty!
>
> Maybe. Nonetheless, there is siglongjmp(3) in POSIX, and people
> have been able to apply it usefully and portably.

POSIX doesn't say siglongjmp out of a signal handler is safe though.
On the contrary, it says sigsetjmp/siglongjmp are like setjmp/longjmp,
where jumping out of a signal handler gives undefined behavoir.
The difference is that siglongjmp can restore the signal mask, while
longjmp's effect on the signal mask is unspecified.

--
Hallvard

Michael Hohmuth

unread,

Sep 5, 2008, 10:37:45 AM9/5/08

to

Hrmpf... I should have chosen a better example. ;-)

I think you are basically right, except that IIRC POSIX does not say
that longjmp/siglongjmp out of a signal handler are undefined (at
least not for single-threaded programs). It just doesn't promise that
(sig)longjmp is reentrant, making it somewhat hard to write safe POSIX
code that jumps out of a signal handler.

But leaving the legalese aside for a moment, recent OSes support
longjmp out of signal handlers in a useful and predictable fashion.
This function is required for some debugging and emulation
applications.

The reason I brought up this example was the following similarity:

Like code that longjmps out of a signal handler, ASF critical sections
must avoid using unsafe (nonreentrant) code after the jump point
(setjmp and ACQUIRE, respectively).

Michael

[1] a regular longjmp, not one out of a signal handler

Nick Maclaren

unread,

Sep 5, 2008, 11:11:35 AM9/5/08

to

In article <3yzljy6...@sieglinde.amd.com>,

Oh, yes, it does! Look at the specification of signal() or sigaction:

If the signal occurs other than as the result of calling abort(),
raise(), kill(), pthread_kill(), or sigqueue(), the behavior is
undefined if the signal handler refers to any object with static
storage duration other than by assigning a value to an object
declared as volatile sig_atomic_t, ...

Now, how do you use longjmp() under those restrictions? Where do
you get the jump buffer from, for a start?

|> But leaving the legalese aside for a moment, recent OSes support
|> longjmp out of signal handlers in a useful and predictable fashion.
|> This function is required for some debugging and emulation
|> applications.

No, they don't. They are a hell of a lot better than they were, but
are still not up to the quality of the old mainframe systems. That
issue is precisely why those debugging and emulation applications
are unreliable to useless when applied to 'advanced' programs,
including many parallel ones.

There are at least two fundamental problems:

1) The very action of accepting a signal causes irreversible
effects. Considering interrupting an I/O transfer in flight.

2) There are generally no mechanisms for a library to provide
a callback to restore its state when it is interrupted, and a lot
use static or global state.

I could explain how I did a lot better under MVS, and how IBM did
with CEL, but neither approach would work for Unix or Microsoft
systems. You are more-or-less dead in the water with the nastier
signal-handling problems.

|> The reason I brought up this example was the following similarity:
|>
|> Like code that longjmps out of a signal handler, ASF critical sections
|> must avoid using unsafe (nonreentrant) code after the jump point
|> (setjmp and ACQUIRE, respectively).

Yes - I had taken that one as read - it's pretty obvious to anyone
who has been there. But you actually need more than reentrancy;
you need both idempotency and reversibility.

Regards,
Nick Maclaren.

Hallvard B Furuseth

unread,

Sep 5, 2008, 11:55:51 AM9/5/08

to

Nick Maclaren writes:
> In article <3yzljy6...@sieglinde.amd.com>,
> Michael Hohmuth <Michael...@amd.com> writes:
> |>
> |> Hrmpf... I should have chosen a better example. ;-)
> |>
> |> I think you are basically right, except that IIRC POSIX does not say
> |> that longjmp/siglongjmp out of a signal handler are undefined (at
> |> least not for single-threaded programs).
>
> Oh, yes, it does! Look at the specification of signal() or sigaction:
>
> If the signal occurs other than as the result of calling abort(),
> raise(), kill(), pthread_kill(), or sigqueue(), the behavior is
> undefined if the signal handler refers to any object with static
> storage duration other than by assigning a value to an object
> declared as volatile sig_atomic_t, ...
>
> Now, how do you use longjmp() under those restrictions? Where do
> you get the jump buffer from, for a start?

I was wrong though, I saw what I expected to see. It does say setjmp
should work in signal handlers - except in a *nested* signal handler:

<http://www.unix.org/version3/online.html>; functions/longjmp.html:
"As it bypasses the usual function call and return mechanisms,
longjmp() shall execute correctly in contexts of interrupts, signals,
and any of their associated functions. However, if longjmp() is
invoked from a nested signal handler (that is, from a function invoked
as a result of a signal raised during the handling of another signal),
the behavior is undefined."

I don't see how that's supposed to work in combination with the
paragraph you quoted though. Maybe it's just that signal handlers
may use setjmp/longjmp internally?

For that matter, I don't see how it would work without paying a
significant cost in performance and complexity. Just with "x = y;"
where the update of x is not atomic (like x is 64-bit on a 32-bit host):
If the signal interrupts in the middle of updating x and you longjmp out
past that update, how can anything about x have defined behavior
afterwards? Or if you are in the middle of updating a more complex data
structure (yours or e.g. malloc's) when the signal arrives, that data
structure can get broken.

--
Hallvard

EricP

unread,

Sep 5, 2008, 12:25:42 PM9/5/08

to

MitchAlsup wrote:
> On Sep 3, 2:46 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
>> My interest is primarily in the mechanism. The only comment I'd have
>> is that anything that allows 7 or 8 addresses to be updated atomically
>> is definitely an improvement and I'm sure it would find many uses.
>>
>> WRT deterministic vs optimistic, in reading between the lines of the
>> descriptions, I got the impression the distinction was whether ACQUIRE
>> waited for a response or not. This lead me to wonder about the latency
>> of the Acquire operation and if this was a synchronous vs async issue.
>
> Basically, in optimistic mode, the CPU does the checks for
> interference at COMMIT--if any interference has happened undo any
> damage and exit at ACQUIRE with a failure code.
> While in determinitstic mode, the CPU gets permission at ACQUIRE. ANd
> will run to completion if exceptions are not generated.

It sounds like both Deterministic and Optimistic modes would be
interfered with if an Acquired cache line is accessed by another cpu
or DMA, because, for example, some data object not logically guarded
by that Acquire shared the cache line and it was either read or written.

(I include read access in the above because the mechanism is trying to
ensure globally coherent access to all cache lines in an update set.
So if any uncoordinated read occurs, the system must present a before
image of the entire update set, which presumably triggers an undo
and retry).

So one must ensure that guarded objects don't share cache
lines with any unrelated data objects.

Ok, if the mode switch-over is automatic that's even better.
I just didn't see how optimistic mode could guarantee forward progress.
I can see certain usage patterns on certain data structures
would cause optimistic mode to behave pathologically,
which means it can't be set at a system wide level (which it isn't).

> But the thing, here, is that for the typical mode data element within
> CDS kinds of synchronizations, ASF does not require the software
> writer to code back-offs (complex or not). ASF wants the SW code to
> take a new look at the CDS (pointers) and use the failure code to
> select a different element that has much lower probability of
> suffering from interference, and give that one a try.

This will sound ignorant, but what's a CDS?

Anyway, in the kind of usage I might have for ASF would be making
lock free list and tree management a realistic consideration
by lowering its complexity.
However the operations are dictated by the app requirements that
it insert/remove/update a particular data object and it can't
really go off and work on something else if there is a collision.
It just has to hang around a retry and thus my concern about
ping pongs.

So in that context I'm not exactly sure how the failure code could
be used especially if fair, mostly-fifo access is desired.

>> However people like me would just want things to work obviously
>> would choose the deterministic version.
>
> In optimistic mode, with data sitting in the cache, one can perform a
> CAS-like atomic event in a dozen cycles (4-nanoseconds).
> In determniistic mode, with data sitting in the cache, one can perform
> a CAS-like atomic event in 30-nanoseconds (best case--with fabric
> overhead additioinal).
>
> Mitch
>
> P.S. Also note, the SO messages contain physical addresses. So if you
> launch one program 18 times, the individual programs will not interfer
> with each other unless they actually share memory and attempt atomic
> accesses on that shared memory that happen to collide. Nor are these
> physical addresses ever exposed to the running programs.

Sounds great.

Eric

MitchAlsup

unread,

Sep 5, 2008, 1:02:37 PM9/5/08

to

On Sep 5, 11:25 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> MitchAlsup wrote:

> > Basically, in optimistic mode, the CPU does the checks for
> > interference at COMMIT--if any interference has happened undo any
> > damage and exit at ACQUIRE with a failure code.
> > While in determinitstic mode, the CPU gets permission at ACQUIRE. ANd
> > will run to completion if exceptions are not generated.
>
> It sounds like both Deterministic and Optimistic modes would be
> interfered with if an Acquired cache line is accessed by another cpu
> or DMA, because, for example, some data object not logically guarded
> by that Acquire shared the cache line and it was either read or written.

Define CPU A as the one doing the atomic event
Define CPU B as the one making the read
DEFINE IO C as an innocent bystander making a read

With CPU A optimistically performing an atomic event on line L and
between ACQUIRE and COMMIT, CPU B makes an access to line L. In this
case, CPU B gets the old value of line L, and CPU A aborts the atomic
event and rolls back any protected cache lines and takes an error exit
through the ACQUIRE instruction, and enters deterministic mode.

With CPU A deterministically performing an atomic event on line L and
between ACQUIRE and COMMIT, CPU B makes an acces to line L. IN this
case, CPU B gets a NAK, while CPU A continues to plug along on the
atomic evetn.

IO is a special case, and the receiving CPU can see that the request
came from IO. IO is special because it is not coherent, so special
rules apply that don't apply to CPUs (and their caches). In this case
CPU A aborts the atomic event and exits through ACQUIRE with a failure
code; and the IO request sees the now current bit pattern in cache/
memory, his request is fulfilled and he goes away happy..

> (I include read access in the above because the mechanism is trying to
> ensure globally coherent access to all cache lines in an update set.
> So if any uncoordinated read occurs, the system must present a before
> image of the entire update set, which presumably triggers an undo
> and retry).
>
> So one must ensure that guarded objects don't share cache
> lines with any unrelated data objects.

I might ask why one has IO scheduled to a line that is usbject to
atomic events....

(snip)

> >> If it is a synchronous vs async issue, then it would be better to
> >> have two Acquire instructions so that those routines that are written
> >> to perform a more complex back off and retry without ping ponging can
> >> select the async Acquire and gain some performance by being optimistic.
>
> > No, you want the software model (what coders code to) to be the same
> > in either mode.
>
> Ok, if the mode switch-over is automatic that's even better.
> I just didn't see how optimistic mode could guarantee forward progress.

It can't and that is why CAS, DCAS, Locked Load/Store Conditioinal
based stuff can not get the job done. (i.e. doesn't scale)

> I can see certain usage patterns on certain data structures
> would cause optimistic mode to behave pathologically,
> which means it can't be set at a system wide level (which it isn't).
>
> > But the thing, here, is that for the typical mode data element within
> > CDS kinds of synchronizations, ASF does not require the software
> > writer to code back-offs (complex or not). ASF wants the SW code to
> > take a new look at the CDS (pointers) and use the failure code to
> > select a different element that has much lower probability of
> > suffering from interference, and give that one a try.
>
> This will sound ignorant, but what's a CDS?

Concurrent Data Structure

> Anyway, in the kind of usage I might have for ASF would be making
> lock free list and tree management a realistic consideration
> by lowering its complexity.
> However the operations are dictated by the app requirements that
> it insert/remove/update a particular data object and it can't
> really go off and work on something else if there is a collision.
> It just has to hang around a retry and thus my concern about
> ping pongs.

So you get hit with interference. and by the time you can look at the
structure, it has changed under your nose. So you go back to so safe
point (like the top) and then traverse into the structure to re-find
where you want to do the insert/extract/update and try again.

Or you can do what CAS-based stuff ahs been doing all along--keep
pounding on the CDS until you get your increment of change performed.

Think of the failure code as somthing you can ignore (except for
atomic activities) or use to your benefit.

> So in that context I'm not exactly sure how the failure code could
> be used especially if fair, mostly-fifo access is desired.

You can use the failure code to seed a back-off timer. So rather than
accumulate the needed back-off time (ala CAS-based with exponential
back-off) one can use the failure code directly
(failure*BACK_OFF_LOOP_COUNT) or indirect via a table
(BackOffLoopTime[failure]) to determine the time/cycles to wait.

Mitch

Nick Maclaren

unread,

Sep 5, 2008, 1:15:07 PM9/5/08

to

In article <hbf.2008...@bombur.uio.no>,

Hallvard B Furuseth <h.b.fu...@usit.uio.no> writes:
|>

|> > Now, how do you use longjmp() under those restrictions? Where do
|> > you get the jump buffer from, for a start?
|>
|> I was wrong though, I saw what I expected to see. It does say setjmp
|> should work in signal handlers - except in a *nested* signal handler:

Yup.

|> I don't see how that's supposed to work in combination with the
|> paragraph you quoted though. Maybe it's just that signal handlers
|> may use setjmp/longjmp internally?

Nope. POSIX and C are seriously inconsistent, even within themselves
and ignoring mere ambiguities. I know of at least four incompatible
interpretations firmly believed to be The One True Meaning by actual
members of the relevant standards committees.

Your interpretation is one of them, incidentally.

|> For that matter, I don't see how it would work without paying a

|> significant cost in performance and complexity. ...

Yup.

The consequence of this mess is that most implementations seem to
work most of the time - but you have do idea whether they actually
do work or you just haven't observed the failures. If you then hit
a problem and report even the obvious error as a bug, the vendor's
support service will point at one of the sections of the standards
that say you are causing undefined behaviour. And so the bug will
not get reported, and will remain in the code.

Regards,
Nick Maclaren.

Chris M. Thomasson

unread,

Sep 5, 2008, 2:46:09 PM9/5/08

to

"EricP" <ThatWould...@thevillage.com> wrote in message
news:e4333$48c15db3$45c49ea8$25...@TEKSAVVY.COM...

[...]

You would need to ensure that objects are properly padded up to the size of
a cache-line, and properly aligned on cache-line boundaries. AFAICT,
false-sharing can destroy performance in ASF. This would be analogous to the
effect false-sharing has on LL/SC. If an unrelated data-structure is falsely
shared with a cache line within the reservation granule of an active LL/SC,
well, then any mutations to said data-structure in between the LL and the SC
will make the SC erroneously fail. That would be bad...

;^(

EricP

unread,

Sep 5, 2008, 2:44:58 PM9/5/08

to

MitchAlsup wrote:
> On Sep 5, 11:25 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
>> So one must ensure that guarded objects don't share cache
>> lines with any unrelated data objects.
>
> I might ask why one has IO scheduled to a line that is usbject to
> atomic events....

Because the acquired object "A" just happened to be located
in the same line as some different object "B", and someone was doing
IO on "B" totally unaware that it shared the cache line with "A".

E.g if the cache line was 64 or 128 bytes and the heap allocated
objects on 32 byte boundaries.

E.g the data structure being acquired is located in static program
memory and the linker locates some random variable just before
or after that structure, and the programmer has no knowledge
that this happened.

>> Anyway, in the kind of usage I might have for ASF would be making
>> lock free list and tree management a realistic consideration
>> by lowering its complexity.
>> However the operations are dictated by the app requirements that
>> it insert/remove/update a particular data object and it can't
>> really go off and work on something else if there is a collision.
>> It just has to hang around a retry and thus my concern about
>> ping pongs.
>
> So you get hit with interference. and by the time you can look at the
> structure, it has changed under your nose. So you go back to so safe
> point (like the top) and then traverse into the structure to re-find
> where you want to do the insert/extract/update and try again.
>
> Or you can do what CAS-based stuff ahs been doing all along--keep
> pounding on the CDS until you get your increment of change performed.
>
> Think of the failure code as somthing you can ignore (except for
> atomic activities) or use to your benefit.
>
>> So in that context I'm not exactly sure how the failure code could
>> be used especially if fair, mostly-fifo access is desired.
>
> You can use the failure code to seed a back-off timer. So rather than
> accumulate the needed back-off time (ala CAS-based with exponential
> back-off) one can use the failure code directly
> (failure*BACK_OFF_LOOP_COUNT) or indirect via a table
> (BackOffLoopTime[failure]) to determine the time/cycles to wait.

Yeah, thats what I was thinking.
But that gets you into the retry loop approach
and that cause the O(N**2) cost.

Hmmm... an MLH (Magnusson, Landin, Hagersteny) spin queue is O(N) cost,
uses a linked list of cache lines to ensure FIFO ordering,
and is very efficient in that accessors perform a single
AtomicExchange on the spin queue head and thereafter each
spin waits on independent cache lines if there is any conflict.

I wonder if these could be combined so that normally you have
concurrent access but if a collision does occur,
you back off and wait in a fifo queue and the owner will let
you know when it is done. The waiter needs to leave a flag for
the owner letting him know there is someone waiting,
but do so without interfering with the locked object cache line
(and thereby triggering an abort and retry by the current owner).
Hmmmm... just thinking out loud...

Eric

EricP

unread,

Sep 5, 2008, 3:09:56 PM9/5/08

to

EricP wrote:
> Yeah, thats what I was thinking.
> But that gets you into the retry loop approach
> and that cause the O(N**2) cost.
>
> Hmmm... an MLH (Magnusson, Landin, Hagersteny) spin queue is O(N) cost,
> uses a linked list of cache lines to ensure FIFO ordering,
> and is very efficient in that accessors perform a single
> AtomicExchange on the spin queue head and thereafter each
> spin waits on independent cache lines if there is any conflict.
>
> I wonder if these could be combined so that normally you have
> concurrent access but if a collision does occur,
> you back off and wait in a fifo queue and the owner will let
> you know when it is done. The waiter needs to leave a flag for
> the owner letting him know there is someone waiting,
> but do so without interfering with the locked object cache line
> (and thereby triggering an abort and retry by the current owner).
> Hmmmm... just thinking out loud...

Or maybe use a memory write to intentionally trigger a replay trap
in the current owner, like a low cost interrupt.

Oh dear, I've corrupted it's purpose already :-)

Eric

Jan Vorbrüggen

unread,

Sep 8, 2008, 2:28:27 AM9/8/08

to

>> I might ask why one has IO scheduled to a line that is usbject to
>> atomic events....
>
> Because the acquired object "A" just happened to be located
> in the same line as some different object "B", and someone was doing
> IO on "B" totally unaware that it shared the cache line with "A".
>
> E.g if the cache line was 64 or 128 bytes and the heap allocated
> objects on 32 byte boundaries.

"Doctor, it hurts terribly when I do _that_!"

"Well, then don't do that!"

Seriously: Any code that doesn't guard against false sharing is broken
in the first place - missing guarantees in the language standards (hi,
Nick!) notwithstanding.

>> You can use the failure code to seed a back-off timer. So rather than
>> accumulate the needed back-off time (ala CAS-based with exponential
>> back-off) one can use the failure code directly
>> (failure*BACK_OFF_LOOP_COUNT) or indirect via a table
>> (BackOffLoopTime[failure]) to determine the time/cycles to wait.
>
> Yeah, thats what I was thinking.
> But that gets you into the retry loop approach
> and that cause the O(N**2) cost.

My understanding of the failure code was that you would use it as a tool
that leads, with a high probability, to the "dis-entanglement" of the
competing/conflicting threads by seperating them in time and/or space.
This means that all the worst-case scenarios now have a much lower
probability; I don't know whether even stronger statements can be made.

Jan

Nick Maclaren

unread,

Sep 8, 2008, 4:59:56 AM9/8/08

to

In article <ga2brp$udk$1...@s1.news.oleane.net>,

Hi, there! Yes. One can argue whose business it is to avoid that,
but it has been a known disaster area for threaded code for 25 years
now. There is no excuse for ignoring it.

I can see some ways that it could be alleviated in a programming
language properly designed for parallelism, but they wouldn't be nice.
The current attempts to bolt such mechanisms onto existing languages
have been uniformly unsatisfactory. So it gets left to the programmer,
who doesn't have the tools to do the job right.

But that isn't really false sharing, anyway, which is when two widely
separated locations map to the same location in the cache. And the
only way that I can see how to alleviate that involves moving to using
virtual addresses for caching, which in turn means changing the current
dogma for what a shared memory model should be.

Regards,
Nick Maclaren.

Andrew Reilly

unread,

Sep 8, 2008, 5:17:42 AM9/8/08

to

On Mon, 08 Sep 2008 08:28:27 +0200, Jan Vorbrüggen wrote:

>> E.g if the cache line was 64 or 128 bytes and the heap allocated
>> objects on 32 byte boundaries.
>
> "Doctor, it hurts terribly when I do _that_!"
>
> "Well, then don't do that!"
>
> Seriously: Any code that doesn't guard against false sharing is broken
> in the first place - missing guarantees in the language standards (hi,
> Nick!) notwithstanding.

Since when did "false sharing" refer to cache lines? Words are one
thing; since there are aligned "words" that are SSE vectors these days,
then you can perhaps argue that sharing on that granularity is a bad idea
too. But the cache is (or should be) architecturally invisible. If it's
up to software not to put more than one "object" in a cache line on pain
of breakage, then we're all in for a whole lot of breakage.

IMO.

Cheers,

--
Andrew

Nick Maclaren

unread,

Sep 8, 2008, 6:22:34 AM9/8/08

to

In article <6ik8tlF...@mid.individual.net>,

Andrew Reilly <andrew-...@areilly.bpc-users.org> writes:
|>
|> Since when did "false sharing" refer to cache lines? Words are one
|> thing; since there are aligned "words" that are SSE vectors these days,
|> then you can perhaps argue that sharing on that granularity is a bad idea
|> too. But the cache is (or should be) architecturally invisible. If it's
|> up to software not to put more than one "object" in a cache line on pain
|> of breakage, then we're all in for a whole lot of breakage.

For about the last 20 years. It also refers to sharing at the page
level on distributed memory systems with virtual shared memory.

I suggest that you read up about tuning of shared memory programs,
such as in:

docs.sun.com/source/819-0501/7_tuning.html

Regards,
Nick Maclaren.

Chris M. Thomasson

unread,

Sep 8, 2008, 9:06:57 AM9/8/08

to

"Andrew Reilly" <andrew-...@areilly.bpc-users.org> wrote in message
news:6ik8tlF...@mid.individual.net...

> On Mon, 08 Sep 2008 08:28:27 +0200, Jan Vorbrüggen wrote:
>
>>> E.g if the cache line was 64 or 128 bytes and the heap allocated
>>> objects on 32 byte boundaries.
>>
>> "Doctor, it hurts terribly when I do _that_!"
>>
>> "Well, then don't do that!"
>>
>> Seriously: Any code that doesn't guard against false sharing is broken
>> in the first place - missing guarantees in the language standards (hi,
>> Nick!) notwithstanding.
>
>
> Since when did "false sharing" refer to cache lines?

WRT instructions such as LL/SC, if you don't properly align and pad objects
on separate L2 cache-lines, and the reservation granule is big enough to
take an entire L2 line into account during the transaction, false-sharing
can cause problems with the SC failing erroneously; this can lead to
live-lock in certain scenarios.

> Words are one
> thing; since there are aligned "words" that are SSE vectors these days,
> then you can perhaps argue that sharing on that granularity is a bad idea
> too. But the cache is (or should be) architecturally invisible. If it's
> up to software not to put more than one "object" in a cache line on pain
> of breakage, then we're all in for a whole lot of breakage.

I believe that a PPC manual mentions that programmers should avoid
false-sharing of cache-lines when they use LL/SC instruction...

Chris M. Thomasson

unread,

Sep 8, 2008, 9:38:42 AM9/8/08

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:ga2pic$9q4$1...@gemini.csx.cam.ac.uk...

>
> In article <ga2brp$udk$1...@s1.news.oleane.net>,
> =?ISO-8859-1?Q?Jan_Vorbr=FCggen?= <Jan.Vor...@not-thomson.net>
> writes:
> |> >> I might ask why one has IO scheduled to a line that is usbject to
> |> >> atomic events....
> |> >
> |> > Because the acquired object "A" just happened to be located
> |> > in the same line as some different object "B", and someone was doing
> |> > IO on "B" totally unaware that it shared the cache line with "A".
> |> >
> |> > E.g if the cache line was 64 or 128 bytes and the heap allocated
> |> > objects on 32 byte boundaries.
> |>
> |> "Doctor, it hurts terribly when I do _that_!"
> |>
> |> "Well, then don't do that!"
> |>
> |> Seriously: Any code that doesn't guard against false sharing is broken
> |> in the first place - missing guarantees in the language standards (hi,
> |> Nick!) notwithstanding.
>
> Hi, there! Yes. One can argue whose business it is to avoid that,
> but it has been a known disaster area for threaded code for 25 years
> now. There is no excuse for ignoring it.
>
> I can see some ways that it could be alleviated in a programming
> language properly designed for parallelism, but they wouldn't be nice.
> The current attempts to bolt such mechanisms onto existing languages
> have been uniformly unsatisfactory.

IMHO, one must also take the following into account:

http://groups.google.com/group/comp.programming.threads/msg/1ee634e9788dc359

object access patterns are important when deciding how to place them in
memory...

> So it gets left to the programmer,
> who doesn't have the tools to do the job right.

You can hack something together in C:

http://groups.google.com/group/comp.programming.threads/msg/f2b1cac59c8e98a3

http://groups.google.com/group/comp.lang.c/msg/be7e0d0e97c5e1d9

;^)

Nick Maclaren

unread,

Sep 8, 2008, 9:48:52 AM9/8/08

to

In article <oQ9xk.42715$9u1....@newsfe09.iad>,

"Chris M. Thomasson" <n...@spam.invalid> writes:
|>
|> IMHO, one must also take the following into account:
|>

|> object access patterns are important when deciding how to place them in
|> memory...

Well, that's been known for decades, too, but see my next point.

|> ;^)

Not really. Even ignoring the disgusting nature of that sort of trick,
and the fact that it will stop working as soon as you move to another
system, it doesn't help with the associativity ("true false sharing")
problem.

Now, that is much less of a problem with threaded codes, but is MUCH
fouler when it occurs. For example, Intel has systems where cores
share a cache with limited associativity. Tuning multi-threaded code
for that sort of structure is a gibbering horror.

Regards,
Nick Maclaren.

Chris M. Thomasson

unread,

Sep 8, 2008, 10:11:37 AM9/8/08

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message

news:ga3ag4$qu0$1...@gemini.csx.cam.ac.uk...

>
> In article <oQ9xk.42715$9u1....@newsfe09.iad>,
> "Chris M. Thomasson" <n...@spam.invalid> writes:
> |>
> |> IMHO, one must also take the following into account:
> |>
> |> object access patterns are important when deciding how to place them in
> |> memory...
>
> Well, that's been known for decades, too, but see my next point.
>
> |> > So it gets left to the programmer,
> |> > who doesn't have the tools to do the job right.
> |>
> |> You can hack something together in C:
> |>
> |> ;^)
>
> Not really. Even ignoring the disgusting nature of that sort of trick,

Its a damn abomination! Humm, can you suggest a cleaner method in C?

> and the fact that it will stop working as soon as you move to another
> system,

You would need to know the cache-line size of the specific architectures
your targeting. It involves an ifdef mess though... One part of it could
look something like:

#if ! defined(L2_CACHESIZE)
# if defined(ARCH_A)
# define L2_CACHESIZE 128
# elif defined(ARCH_B)
# define L2_CACHESIZE 64
# else
# error Shit Happens!
# endif
#endif

> it doesn't help with the associativity ("true false sharing")
> problem.

Well, if you want a data-structure to be aligned on a given boundary, the C
hack can help. IMVHO, that can be useful...

> Now, that is much less of a problem with threaded codes, but is MUCH
> fouler when it occurs. For example, Intel has systems where cores
> share a cache with limited associativity. Tuning multi-threaded code
> for that sort of structure is a gibbering horror.

Yeah, good point.

Nick Maclaren

unread,

Sep 8, 2008, 10:14:39 AM9/8/08

to

In article <gjaxk.43013$9u1....@newsfe09.iad>,

"Chris M. Thomasson" <n...@spam.invalid> writes:
|>
|> > |> > So it gets left to the programmer,
|> > |> > who doesn't have the tools to do the job right.
|> > |>
|> > |> You can hack something together in C:
|> > |>
|> > |> ;^)
|> >
|> > Not really. Even ignoring the disgusting nature of that sort of trick,
|>
|> Its a damn abomination! Humm, can you suggest a cleaner method in C?

Agreed. No.

Regards,
Nick Maclaren.

Jan Vorbrüggen

unread,

Sep 9, 2008, 2:54:12 PM9/9/08

to

> I can see some ways that it could be alleviated in a programming
> language properly designed for parallelism, but they wouldn't be nice.

Hmmm...why do you think it couldn't be nice? It's the programmer's job
to express the semantics he needs - "don't allow this variable to be
shared with anything else" - and the tool chain's responsibility to get
it right for the platform in use.

> But that isn't really false sharing, anyway, which is when two widely
> separated locations map to the same location in the cache.

Ugh? That's conflict misses, is it not? Page colouring helps you with
that. The false sharing we're talking about here is concerned with cache
lines and the coherency mechanisms.

Jan

Nick Maclaren

unread,

Sep 9, 2008, 3:16:09 PM9/9/08

to

In article <ga6bs4$9sg$1...@s1.news.oleane.net>,

=?ISO-8859-15?Q?Jan_Vorbr=FCggen?= <Jan.Vor...@not-thomson.net> writes:
|>
|> > I can see some ways that it could be alleviated in a programming
|> > language properly designed for parallelism, but they wouldn't be nice.
|>
|> Hmmm...why do you think it couldn't be nice? It's the programmer's job
|> to express the semantics he needs - "don't allow this variable to be
|> shared with anything else" - and the tool chain's responsibility to get
|> it right for the platform in use.

Because it's not as simple as that. There are two problems, of which
the first is the easier:

There are multiple levels needed, in general, one for each level
of cache and one for the page size (for virtual shared memory) and
it is unclean to require the program to handle so much system dependence.
But I think that I could solve that one. Anyway, you are talking
about 'unshared at level N', where N could be constrained to be a
static property of a declaration.

It gets in the way of clean slicing and sub-structuring, as was
supported by Algol 68 and is Fortran (not coincidentally). For example,
taking a large matrix and slicing it one way or the other depending on
what you want to do to it. Ideally, you would like it to be easy to
pass each slice to a separate processor. Fortran allows that, but
HPF's attempts to specify layout were a dismal failure. And I don't
see how to make it static without constraining what reasonable people
want to do.

|> > But that isn't really false sharing, anyway, which is when two widely
|> > separated locations map to the same location in the cache.
|>
|> Ugh? That's conflict misses, is it not? Page colouring helps you with
|> that. The false sharing we're talking about here is concerned with cache
|> lines and the coherency mechanisms.

Page colouring helps, yes, but is a hack not a solution. We could
argue about terminology, but I can't be bothered! Whatever you call
it, it's a foul thing to handle when you have multi-core CPUs with a
common cache with limited associativity.

Regards,
Nick Maclaren.

Jan Vorbrüggen

unread,

Sep 10, 2008, 2:26:54 AM9/10/08

to

> There are multiple levels needed, in general, one for each level
> of cache and one for the page size (for virtual shared memory) and
> it is unclean to require the program to handle so much system dependence.
> But I think that I could solve that one. Anyway, you are talking
> about 'unshared at level N', where N could be constrained to be a
> static property of a declaration.

Compilers have been handling that kind of thing for a decade now, at
least. For instance, I know the Sun Fortran compilers have had switches
to specify the blocking at all levels of the memory hierarchy, and when
you select a certain processor model, the proper set of these parameters
is also selected.

[page colouring]

> Whatever you call it, it's a foul thing to handle when you have
> multi-core CPUs with a common cache with limited associativity.

I agree the limited associativity is what might really get you. Is that
still a problem? Perhaps in L1 cache, but not in L2 and higher, I would
say. And nowadays L1 is more like an extended register set, anyway.

Jan

Nick Maclaren

unread,

Sep 10, 2008, 4:02:51 AM9/10/08

to

In article <ga7kfl$nav$1...@s1.news.oleane.net>,

=?ISO-8859-15?Q?Jan_Vorbr=FCggen?= <Jan.Vor...@not-thomson.net> writes:
|>
|> > There are multiple levels needed, in general, one for each level
|> > of cache and one for the page size (for virtual shared memory) and
|> > it is unclean to require the program to handle so much system dependence.
|> > But I think that I could solve that one. Anyway, you are talking
|> > about 'unshared at level N', where N could be constrained to be a
|> > static property of a declaration.
|>
|> Compilers have been handling that kind of thing for a decade now, at
|> least. For instance, I know the Sun Fortran compilers have had switches
|> to specify the blocking at all levels of the memory hierarchy, and when
|> you select a certain processor model, the proper set of these parameters
|> is also selected.

Very badly, but yes. The problem is how to ensure that the separation
rules are followed, without forcing the program to do disgusting and
system-specific explicit layout.

|> [page colouring]
|> > Whatever you call it, it's a foul thing to handle when you have
|> > multi-core CPUs with a common cache with limited associativity.
|>
|> I agree the limited associativity is what might really get you. Is that
|> still a problem? Perhaps in L1 cache, but not in L2 and higher, I would
|> say. And nowadays L1 is more like an extended register set, anyway.

Yes. If I recall, some of the current Intel systems have precisely the
property I mention, but I can't remember exactly which and at how many
levels.

Regards,
Nick Maclaren.

Hallvard B Furuseth

unread,

Sep 13, 2008, 11:11:11 AM9/13/08

to

Chris M. Thomasson writes:
> You would need to know the cache-line size of the specific architectures
> your targeting. It involves an ifdef mess though... One part of it could
> look something like:
>
> #if ! defined(L2_CACHESIZE)
> # if defined(ARCH_A)
> # define L2_CACHESIZE 128
> # elif defined(ARCH_B)
> # define L2_CACHESIZE 64
> # else
> # error Shit Happens!
> # endif
> #endif

Does that usually need to be recompiled with different cache sizes when
one buys a new machine? How wide a range of architectures have the same
cache size nowadays?

--
Hallvard

Terje Mathisen

unread,

Sep 13, 2008, 3:08:38 PM9/13/08

to

The above more or less doesn't work at all:

If you have code that actually depnds on the cache (or cache line) size,
you pretty much have to query the size at runtime, and then act accordingly.

Otherwise your best bet is to use algorithms that are cache-friendly but
size agnostic, i.e. many recursive algorithms like quick sort work on a
divide & conquer approach that maps close to optimally to cached
architectures.

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Nick Maclaren

unread,

Sep 13, 2008, 3:16:39 PM9/13/08

to

In article <J5qdncIDIZUrklHV...@giganews.com>,

It's a LOT worse than that! You usually need to know at least the
associativity, and often some of the more arcane properties, both
of your hardware and software :-(

I gave up such lunacies decades ago - life is too short.

|> Otherwise your best bet is to use algorithms that are cache-friendly but
|> size agnostic, i.e. many recursive algorithms like quick sort work on a
|> divide & conquer approach that maps close to optimally to cached
|> architectures.

Quicksort is near-optimal only if reading sequentially backwards is
fairly efficient (as well as forwards). If the cache is such that only
forward and repeat reading are fast, there are better approaches. But
it is not TOO bad even on such systems (which, in any case, are rare
nowadays).

Regards,
Nick Maclaren.

Chris M. Thomasson

unread,

Sep 13, 2008, 8:01:56 PM9/13/08

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message
news:J5qdncIDIZUrklHV...@giganews.com...

> Hallvard B Furuseth wrote:
>> Chris M. Thomasson writes:
>>> You would need to know the cache-line size of the specific architectures
>>> your targeting. It involves an ifdef mess though... One part of it could
>>> look something like:
>>>
>>> #if ! defined(L2_CACHESIZE)
>>> # if defined(ARCH_A)
>>> # define L2_CACHESIZE 128
>>> # elif defined(ARCH_B)
>>> # define L2_CACHESIZE 64
>>> # else
>>> # error Shit Happens!
>>> # endif
>>> #endif
>>
>> Does that usually need to be recompiled with different cache sizes when
>> one buys a new machine?

Sometimes.

>> How wide a range of architectures have the same
>> cache size nowadays?

Not that many. For instance, an UltraSPARC T1 has 64-byte l2-cache. Some
Intel's have 128-byte l2-cache which is split into two 64-byte regions.
AFAICT, 64-bytes seems to be "fairly" common...

> The above more or less doesn't work at all:

It works in a narrow context indeed. For instance: This library is compiled
for archX vendorV. If you use it on archX vendorY, you may or may not
experience a performance degradation...

> If you have code that actually depnds on the cache (or cache line) size,

When your creating synchronization primitives, knowing the l2 cache-line
size can be VERY important indeed. I would not say that the algorihtms
depend on the size in the sense that they will not work if there is a
conflict, however, there may be performance issue(s)...

> you pretty much have to query the size at runtime, and then act
> accordingly.

That works, but pretty much rules out static initialization of sync
primitives...

> Otherwise your best bet is to use algorithms that are cache-friendly but
> size agnostic, i.e. many recursive algorithms like quick sort work on a
> divide & conquer approach that maps close to optimally to cached
> architectures.

Yeah, using algorithms with high level of data locality is always a win;
IMHO that is...

Chris M. Thomasson

unread,

Sep 13, 2008, 8:04:17 PM9/13/08

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message

news:gah3in$356$1...@gemini.csx.cam.ac.uk...

[...]

IMVHO, creating arch dependant code is fun!

:^D

Andrew Reilly

unread,

Sep 14, 2008, 7:28:34 AM9/14/08

to

On Sat, 13 Sep 2008 17:04:17 -0700, Chris M. Thomasson wrote:

>> I gave up such lunacies decades ago - life is too short.
> [...]
>
> IMVHO, creating arch dependant code is fun!

Sometimes it's the only game in town, but I'm with Nick (I think):
despite common wisdom to the contrary, software turns *much* more slowly
than hardware, so unless you have at least a "reference" version of your
important algorithms in as pure and machine-independent form as you can,
you'll wind up spending all your time chasing architecture variations,
rather than coming up with interesting new algorithms, or solving
interesting new problems.

Of course, nailing an algorithm down on any particular architecture is
going to need some arch-dependent code (probably), so go nuts. Have fun
where you find it!

Cheers,

--
Andrew

Nick Maclaren

unread,

Sep 14, 2008, 7:39:06 AM9/14/08

to

In article <6j4ar2F...@mid.individual.net>,

Andrew Reilly <andrew-...@areilly.bpc-users.org> writes:
|> On Sat, 13 Sep 2008 17:04:17 -0700, Chris M. Thomasson wrote:
|>

|> >> I gave up such lunacies decades ago - life is too short.

|> > [...]
|> >
|> > IMVHO, creating arch dependant code is fun!

Well, yes, but you get bored of that after a couple of decades - especially
if you have to do it when you don't want to. And I have had to do far too
much of that, especially to bypass design faults and/or investigate what
the documentation should have clarified but didn't.

|> Sometimes it's the only game in town, but I'm with Nick (I think):
|> despite common wisdom to the contrary, software turns *much* more slowly
|> than hardware, so unless you have at least a "reference" version of your
|> important algorithms in as pure and machine-independent form as you can,
|> you'll wind up spending all your time chasing architecture variations,
|> rather than coming up with interesting new algorithms, or solving
|> interesting new problems.

Yes.

|> Of course, nailing an algorithm down on any particular architecture is
|> going to need some arch-dependent code (probably), so go nuts. Have fun
|> where you find it!

There is a big difference between throw-away code and permanent code;
it is often necessary to do the most repulsive things in order to
investigate what is going on and/or expose a bug. But such code needs
to be deliberately deleted, in case it creeps into a real program.

Also, writing a simple algorithmic core should be done by providing
just such a reference algorithm - and having a program option to
calculate the result both ways and compare them! That is tricky when
the point of the tuned version is to get the exception handling right,
rather than simple performance, but it can usually be done.

Regards,
Nick Maclaren.

John Dallman

unread,

Sep 14, 2008, 8:39:00 AM9/14/08

to

In article <6j4ar2F...@mid.individual.net>,

andrew-...@areilly.bpc-users.org (Andrew Reilly) wrote:
> On Sat, 13 Sep 2008 17:04:17 -0700, Chris M. Thomasson wrote:
> > IMVHO, creating arch dependant code is fun!
>
> Sometimes it's the only game in town, but I'm with Nick (I think):
> despite common wisdom to the contrary, software turns *much* more
> slowly than hardware, so unless you have at least a "reference"
> version of your important algorithms in as pure and
> machine-independent form as you can, you'll wind up spending all your
> time chasing architecture variations, rather than coming up with
> interesting new algorithms, or solving interesting new problems.

Absolutely. For the product I work on, architecture-dependent code,
except in OS interfaces, is banned. The reason is simple. We have to
produce it working as similarly as possible on a bunch of different
architectures, and those get implicitly compared, daily, by customers.

How so? One branch of a customer's company runs the software on, say,
Windows, and another on, say, Suns. When their separate pieces of work
have to be merged together, if the two platforms haven't behaved
consistently, then they come and beat us up about it.

Plus, if you have architecture-dependent implementations of an
algorithm, it's far too easy for a maintainer to fix a bug for the
platform he works on, but fail to do so, or mess up the change, in the
version(s) for other architectures. Yes, you can have code reviews, and
so on, but making problems impossible is preferable. Correctness matters
more than speed.

--
John Dallman, j...@cix.co.uk, HTML mail is treated as probable spam.