thread-safety of unrelated variables

Pankaj

unread,

Jun 19, 2003, 4:59:26 PM6/19/03

to

Hi,

Consider the following:

// globals
short a = 10; // 16 bit qty
short b = 12; // 16 bit qty

void incr_a() {a++;}
void incr_b() {b++;}

// thread1
{
incr_a();
}

// thread2
{
incr_b();
}

Is this C/C++ program thread-safe (let's say on a 32 bit x86?)
1) If compiled with a non-pthreads compliant compiler?
2) If compiled on a pthreads compliant compiler?
3) If none of the above, what can be done to make it so?

(My concern is whether a compiler would be able to situate variables
'a' and 'b' "far" apart from each other so that a ReadModifyWrite on
'a' does not affect 'b'?)

Thanks,

Pankaj

David Schwartz

unread,

Jun 19, 2003, 5:16:26 PM6/19/03

to

"Pankaj" <pank...@yahoo.com> wrote in message
news:7b5cbdbf.03061...@posting.google.com...

> Is this C/C++ program thread-safe (let's say on a 32 bit x86?)

No, assuming you mean the obvious.

> 1) If compiled with a non-pthreads compliant compiler?

Who knows. Knowing only that a compiler isn't pthreads-compliant tells
us nothing about what it would or might do.

> 2) If compiled on a pthreads compliant compiler?

Sadly, no.

> 3) If none of the above, what can be done to make it so?

You have three obvious choices:

1) Protect all global variables by the same mutex where memory visiblity
and word tearing can't be prevented any other way. (Yuck.)

2) Promote them from mere global variables to fully distinct objects.
For example, if you copy them into 'malloc'ed space, you should be able to
avoid word tearing. (Tolerable.)

3) Use your knowledge of the native word size and assume words will tear
for all objects smaller than that size. (Painful.)

DS

SenderX

unread,

Jun 19, 2003, 6:08:10 PM6/19/03

to

> Is this C/C++ program thread-safe (let's say on a 32 bit x86?)

If writing to a doesn't change b, then yes.

If not, then you need to modify a and b at the same time, in one atomic
operation. The Interlocked API's would do this for you.

> (My concern is whether a compiler would be able to situate variables
> 'a' and 'b' "far" apart from each other so that a ReadModifyWrite on
> 'a' does not affect 'b'?)

It better!

You could put padding between them, or use the Interlocked API's to modify
to adjacent 16-bit values.

Some words on word-tearing:

http://intel.forums.liveworld.com/thread.jsp?forum=242&thread=6158

;)

--
The designer of the SMP and HyperThread friendly, AppCore library.

http://AppCore.home.attbi.com

Steve Watt

unread,

Jun 19, 2003, 6:35:19 PM6/19/03

to

In article <7b5cbdbf.03061...@posting.google.com>,
Pankaj <pank...@yahoo.com> wrote:

>Is this C/C++ program thread-safe (let's say on a 32 bit x86?)

On a 32 bit x86, yes, because it has 16 bit access opcodes, and can do
memory-to-memory arithmetic. On other architectures that don't have
such opcodes, no.

In an SMP system, the cache line will be rattled back and forth between
the processors if you access those variables a lot. This will have a,
shall we say, somewhat deleterious effect on your performance.

>1) If compiled with a non-pthreads compliant compiler?
>2) If compiled on a pthreads compliant compiler?
>3) If none of the above, what can be done to make it so?

Fixing this is surprisingly hard. Grabbing space out of the heap
(i.e. via malloc) will work, but at a rather substantial cost (at
least for your particular two-word example).

In general, the easiest way to avoid this is to group variables that
are going to be accessed by one thread, which at least decreases the
probability of Bad Things happening.

>(My concern is whether a compiler would be able to situate variables
>'a' and 'b' "far" apart from each other so that a ReadModifyWrite on
>'a' does not affect 'b'?)

The keywords you want for Googling is "false sharing".
--
Steve Watt KD6GGD PP-ASEL-IA ICBM: 121W 56' 57.8" / 37N 20' 14.9"
Internet: steve @ Watt.COM Whois: SW32
Free time? There's no such thing. It just comes in varying prices...

Steve Watt

unread,

Jun 19, 2003, 7:16:44 PM6/19/03

to

In article <Xns93A0804131F...@130.133.1.4>,

Oliver S. <Foll...@gmx.net> wrote:
>> On a 32 bit x86, yes, because it has 16 bit access opcodes, and can do
>> memory-to-memory arithmetic. On other architectures that don't have
>> such opcodes, no.
>

> Even memory-to-memory operations aren't atomic on x86 SMP-systems.
>Locked memory-to-memory operations are necessary here; and this oper-
>ations aren't provided by any C-compiler with C-like syntax.

I'm a little out of date on my x86 architecture details. What I was
imagining is that processor A would drag the cache line in, and keep
it busy until the increment was back into the cache line. Then the
other processor would get it's time.

But the false sharing operation doesn't work even on a uniprocessor
architecture that doesn't have the 16-bit opcodes.

>> In an SMP system, the cache line will be rattled back and forth between
>> the processors if you access those variables a lot. This will have a,
>> shall we say, somewhat deleterious effect on your performance.
>

> If sharing a variable is necessary this discussion is superfluous.

The original question was about false sharing -- where, through an
accident of declarations, a memory location winds up having two
variables in it. Thus if thread1 is on CPU A and spends a lot of
time manipulating one half of the word at address X, and thread2 on
CPU B spends a lot of time manipulating the other half, the cache
line will spend a lot of time being invalidated and written back.

David Schwartz

unread,

Jun 19, 2003, 7:43:16 PM6/19/03

to

"Steve Watt" <st...@nospam.Watt.COM> wrote in message
news:HGr4n...@Watt.COM...

> I'm a little out of date on my x86 architecture details. What I was
> imagining is that processor A would drag the cache line in, and keep
> it busy until the increment was back into the cache line. Then the
> other processor would get it's time.

Yes, and if this happens there's no problem. The thing, nothing makes
sure this will happen. This is precisely what the 'lock' prefix is for.

> The original question was about false sharing -- where, through an
> accident of declarations, a memory location winds up having two
> variables in it. Thus if thread1 is on CPU A and spends a lot of
> time manipulating one half of the word at address X, and thread2 on
> CPU B spends a lot of time manipulating the other half, the cache
> line will spend a lot of time being invalidated and written back.

Or worse, the data will be corrupted. While one CPU is doing
read-modify-write of one variable, another CPU can do a read-modify-write of
the other variable. The cache line can be invalidated and written back
during the modify operations. (If not, what would the 'lock' prefix be for?)

DS

David Schwartz

unread,

Jun 19, 2003, 8:14:36 PM6/19/03

to

"Oliver S." <Foll...@gmx.net> wrote in message
news:Xns93A012C2CE70...@130.133.1.4...

> No, it's not like that with x86-CPUs - from the PPro to the current
> P4s/Xeons - because they're breaking up memory-to-memory operations
> into three distinct RISC-like operations called u-OPs or Macro-OPs.
> And the OPs corresponding to a single x86 mem-to-mem instruction unfor-
> tunatly don't enforce the locking of a cache-line although this would be
> a cool feature because we could drop the lock-prefix in some cases and
> thereby omit the synchronizing behaviour of the mem-to-mem operation
> (which significantly slows down the whole instruction-stream) to yield
> a higher performance.

This really isn't possible to actually do, easy though it may be to
imagine. The problem is that if you omit the synchronizing behavior, you're
doing other things (things other than this one operation) while the cache
line is locked. Without the synchronizing behavior, there is no "when I
start this instruction" and "when I'm done with this instruction". If the
cache line is held while there are other operations to the same cache line,
things could get very scary.

Because of the synchronizing behavior, while the locked operation is
taking place, nothing else is. So it's not difficult to figure out when the
lock should be asserted and when it should be released and which operations
can take place while it's locked.

I suppose you could imagine that it would somehow detect another access
attempt to the locked cache line and stall the pipeline only if that
happened. One problem is that there's code that relies upon this behavior.

I know there's code that (perhaps foolishly) relies on locked operations
to provide memory visibility and ordering guarantees. See the long thread
about the Win32 Interlocked* API.

DS

David Butenhof

unread,

Jun 20, 2003, 8:57:33 AM6/20/03

to

Steve Watt wrote:

> In article <7b5cbdbf.03061...@posting.google.com>,
> Pankaj <pank...@yahoo.com> wrote:
>
>>Is this C/C++ program thread-safe (let's say on a 32 bit x86?)
>
> On a 32 bit x86, yes, because it has 16 bit access opcodes, and can do
> memory-to-memory arithmetic. On other architectures that don't have
> such opcodes, no.

It's not that simple, on either axis.

First, you're assuming that because X86 supports 16-bit operands, and a
"short" happens to be 16 bits, that the compiler will necessarily always
reference that data using the 16-bit operands. Now, given a PARTICULAR X86
compiler, someone sufficiently versed in the code generator (or perhaps
even documentation, though such details are usually omitted) might be able
to say for sure that the compiler could never decide to use a 32-bit
access. Why would it? Well, that's a good question, but code generators and
optimizers can do a lot of things that surprise even the designers.

On the other hand, the fact that a machine lacks (or the compiler doesn't
choose to generate) 16-bit instructions doesn't necessarily mean that
16-bit data references can't be safe. For example, on Alpha (which does
actually have 16-bit operand support, but since the initial releases of the
architecture DIDN'T, compilers typically don't generate those instructions
by default), you can tell the compiler to make unaligned references safe;
it actually uses atomic update instructions (load-lock, store-conditional)
to protect the entire aligned "chunk" around the actual address. In that
mode, access to a and b from separate threads would be completely safe...
although somewhat slower than one might expect.

The point here is that people tend to look at these things strictly from the
perspective of hardware capabilities and restrictions, and you really need
to consider the whole tool chain and application environment. The ANSI C
standard does not require that a compiler always generate 16-bit
instructions (even where available) for 16-bit data. Even if you can
determine that the current version of the compiler you currently use
"always seems to" do it the way you'd like, that's no guarantee UNLESS SO
DOCUMENTED that the compiler will always do it that way.

So such assumptions carry a risk. You need to weigh that risk against your
other project constraints. If you're only building the code yourself, on
one compiler, on a single architecture, you may well decide that the risk
isn't significant. If you're distributing source code that many people will
build on many architectures and on different compilers, I'd suggest that
perhaps you should approach such risk somewhat more conservatively. ;-)

> In an SMP system, the cache line will be rattled back and forth between
> the processors if you access those variables a lot. This will have a,
> shall we say, somewhat deleterious effect on your performance.

Yes, but it won't affect the data, which seems to be the main concern here.
Nevertheless, cache thrashing can be really nasty -- it can easily dominate
all other aspects of application performance. We've dealt with several
critical high-profile customer "performance complaints" that turned out to
be merely because two commonly used application variables shared a cache
line. When the data were separated, application performance was increased
by a factor of 10.

>>(My concern is whether a compiler would be able to situate variables
>>'a' and 'b' "far" apart from each other so that a ReadModifyWrite on
>>'a' does not affect 'b'?)
>
> The keywords you want for Googling is "false sharing".

More is involved than just "false sharing" -- there's a potential for data
corruption depending on the code generator and optimizer.

In general, compilers will not allocate variables at greater than their
"natural" alignment. Compilers designed for machines that support unaligned
data access often don't do any alignment. This is all beyond the scope of
the ANSI C standard, so any dependencies in this area are on the particular
compiler you're using rather than on "C".

So, to speak a bit lightly, yes, the compiler "would be able to" separate a
and b far enough apart to avoid "crosstalk" effects -- but there's no
external requirement on any compiler that it do this.

Many compilers have pragmas to affect data alignment, but again that's
implementation dependent and may not port even to other compilers on the
same system.

Sometimes we just hack it. For example, in your case you could do this:

short a = 10; // 16 bit qty

int filler[CACHE_LINE_SIZE/sizeof(int)-sizeof(short)];

short b = 12; // 16 bit qty

This will force 'b' into a separate cache line (presuming of course that
your CACHE_LINE_SIZE symbol is set large enough) from 'a'. That'll avoid
false sharing and make it highly improbable that any compiler could think
it advantageous to use any access technique that might affect both
variables.

And of course, technically, you also need fillers before 'a' and after 'b'.
Even if this particular file has no other extern declarations, they will
probably find themselves living next door to extern declarations from
separate object files when you link an application.

--
/--------------------[ David.B...@hp.com ]--------------------\
| Hewlett-Packard Company Tru64 UNIX & VMS Thread Architect |
| My book: http://www.awl.com/cseng/titles/0-201-63392-2/ |
\----[ http://homepage.mac.com/dbutenhof/Threads/Threads.html ]---/

Pankaj

unread,

Jun 20, 2003, 1:36:07 PM6/20/03

to

David Butenhof <David.B...@hp.com> wrote in message news:<3ef3...@usenet01.boi.hp.com>...

> Sometimes we just hack it. For example, in your case you could do this:
>
> short a = 10; // 16 bit qty
> int filler[CACHE_LINE_SIZE/sizeof(int)-sizeof(short)];
> short b = 12; // 16 bit qty
>
> This will force 'b' into a separate cache line (presuming of course that
> your CACHE_LINE_SIZE symbol is set large enough) from 'a'. That'll avoid
> false sharing and make it highly improbable that any compiler could think
> it advantageous to use any access technique that might affect both
> variables.
>
> And of course, technically, you also need fillers before 'a' and after 'b'.
> Even if this particular file has no other extern declarations, they will
> probably find themselves living next door to extern declarations from
> separate object files when you link an application.

I see. BTW, I believe you meant to say
char filler[CACHE_LINE_SIZE/sizeof(int)-sizeof(short)];

Couldn't this "filling" be done automatically and made a requirement
for a compiler's pthreads-compliance? In other words, couldn't a
pthreads-compliant compiler, by default, situate independent objects
on separate cache lines? (I am not mentioning other alternatives since
they could still cause sharing of the cache line and therefore affect
performance.)

Thanks,

Pankaj

SenderX

unread,

Jun 20, 2003, 3:26:04 PM6/20/03

to

> __asm
> {
> mov eax, [ecx]
>
> xchgLoop:
> test eax, eax
> jz byebye
> mov edx, [eax]
>
> cmpxchg [ecx], edx
> jnz xchgLoop

Where is your ABA increment? Also, you would want to use a wide-CAS here?

If you intend to reuse nodes in a lock-free stack:

You would " need " to increment an " ABA count " on both pops and pushes.
SList API, IBM FreeList, Mike and Scott algo, ect... they all " increment
aba on pushes ", because they reuse nodes.

Here are some correct lifo stacks:

Emulates the SList API on IA-32 and PowerPC:

http://cvs.grame.fr/cgi-bin/midishare-cvs/src/common/Headers/lflifo.h?rev=1.
5&co

Uses my lock-free system to implement the SList API using a " normal "
CAS!!:

http://AppCore.home.attbi.com/src/FreeStack.h

Pankaj

unread,

Jun 20, 2003, 7:00:37 PM6/20/03

to

"David Schwartz" <dav...@webmaster.com> wrote in message
news:<bct97b$lfa$1...@nntp.webmaster.com>...

> "Pankaj" <pank...@yahoo.com> wrote in message
> news:7b5cbdbf.03061...@posting.google.com...
>

> // globals
> short a = 10; // 16 bit qty
> short b = 12; // 16 bit qty
>
> void incr_a() {a++;}
> void incr_b() {b++;}
>
> // thread1
> {
> incr_a();
> }
>
> // thread2
> {
> incr_b();
> }
>

> > Is this C/C++ program thread-safe (let's say on a 32 bit x86?)
>
> No, assuming you mean the obvious.
> > 1) If compiled with a non-pthreads compliant compiler?
>
> Who knows. Knowing only that a compiler isn't pthreads-compliant
tells
> us nothing about what it would or might do.
>
> > 2) If compiled on a pthreads compliant compiler?
>
> Sadly, no.
>
> > 3) If none of the above, what can be done to make it so?
>
> You have three obvious choices:
>
> 1) Protect all global variables by the same mutex where memory
visiblity
> and word tearing can't be prevented any other way. (Yuck.)

And not guaranteed to be doable in all situations. For example what
if <short a, void incr_a()> was part of one library and <short b, void
incr_b()> part of another. Let us also assume that these libraries
were individually thread-safe (for example using individual mutexes
for the respective variable). But put them together...and...the world
shatters.

> 3) Use your knowledge of the native word size and assume words
will tear
> for all objects smaller than that size. (Painful.)

And not portable. Even if all objects could be padded to N bits for a
N bit processor, we are back to square one for a 2N bit cpu.

> 2) Promote them from mere global variables to fully distinct
objects.
> For example, if you copy them into 'malloc'ed space, you should be
able to
> avoid word tearing. (Tolerable.)

This seems to be "theoretically" correct but lead to horrible
complexity. For example, C++ classes, especially template classes
would not be able to hold data members by value, since who knows what
the user may instantiate them with.

For example:

template <class T>
class Date
{
static Mutex m;
T d, m, y;

// the following ops use 'm' in the usual way
void incr_d();
void incr_m();
void incr_y();
};

will lead to the same problem on a 64 bit processor if instantiated as

Date<int32> d1, d2;

Instead we would have to write:

template <class T>
class Date
{
static Mutex m;
T * d, m, y;
Date() {d = new T; m = new T; y = new T;}
// the following ops use 'm' in the usual way
void incr_d();
void incr_m();
void incr_y();
};

Or test for the sizeof(T) and compare it to the native word size
before instantiating the earlier version.

I don't know if I am ready to live like this :-(...

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

SenderX

unread,

Jun 21, 2003, 3:25:18 PM6/21/03

to

> Hey, there are hundreds of articles about lock-free synchronization
> on the web; you're that euphoric about your lock-free synch-primitives
> like you invented the whole thing.

I did invent my new lock-free system ( I did a brief write up on it in this
group ), it uses "normal" CAS, so that is a reason to be happy!

You don't see the importance in the ability to use a normal CAS for
lock-free LIFO's and FIFO's? It means it can port to any ( 32 or 64 bit )
processor that has a LL/SC or CAS!!! Its a very portable lock-free algo
indeed.

Please, try to find and cite an algo that does what my lock-free system
does...

Remember, it must be able to handle hundreds-of-thousands of nodes and use a
normal CAS, and avoid ABA using only an 8-bit ABA counter. It must be able
to implement "scaleable" lock-free FIFO and a LIFO collections. Also, it
must break down lock-free access into sub-queues, another scheme I came up
with.

Have fun trying to find a paper that presents an algo that does what mine
does!

> I hoped I could do this simplification

You may give people the wrong idea. You can't use cmpxchg to do a lock-free
stack, unless you use my algo, or code your own that uses a normal CAS.

I needed to clarify to anybody reading your post, and started thinking they
don't need an ABA count. They do.

SenderX

unread,

Jun 21, 2003, 4:42:06 PM6/21/03

to

> You're really naive. Since Lamport's initial '77-paper there has been a
> lot of research done in this area and even a more libraries for lock-free
> synchronization. One very comprehensive library for lock-free synchroni-
> zation is NOBLE: www.noble-library.org

Again, try to find an algo that does what my lock-free system does. Nobel
doesn't have it.

SenderX

unread,

Jun 21, 2003, 4:56:06 PM6/21/03

to

> You're really naive.

Humm, Nobel is not to hot man... Maybe you should look at it.

Nobel uses some algos from John. D. Valois ( FIFO / LIFO ). There
non-scaleable, and mostly broken.

If you want good linked lists, look at what Joe Seigh is doing using
collector objects.

My library simply provides a more robust assortment of efficient lock-free
algos.

Also, Nobel doesn't seem to have ANY dynamic lock-free algos?

SenderX

unread,

Jun 21, 2003, 6:09:23 PM6/21/03

to

> I'm sure you could find everything you did on the web.

Not so far... Please try!

;)

I am going to do a formal write up on it, focusing on its internals. Like
how it makes only an 8-bit ABA counter sufficient to halt ABA...

After it is done, if you can find another older paper that mimics it, I will
have to concede to you...

Would anybody even want to look at my upcoming write ups on my library's
lock-free system?

Chris Noonan

unread,

Jun 22, 2003, 10:44:09 PM6/22/03

to

"SenderX" <x...@xxx.xxx> wrote in message news:<nI4Ja.1013402$OV.1113588@rwcrnsc54>...

> I am going to do a formal write up on it, focusing on its internals. Like
> how it makes only an 8-bit ABA counter sufficient to halt ABA...

When I design a lock-free algorithm, it always seems to happen
that the width of a counter has to cope with the maximum
number of threads that could be concurrently executing the
relevant code section. In turn, this generally seems to be
the full number of threads that are running in the program.

I am interested if your 8-bit counter can handle more than
256 concurrent threads, safely.

Chris

SenderX

unread,

Jun 23, 2003, 5:27:53 AM6/23/03

to

> When I design a lock-free algorithm, it always seems to happen
> that the width of a counter has to cope with the maximum
> number of threads that could be concurrently executing the
> relevant code section.

Well, the ABA problem that the 8-bit counter helps to solve is in direct
relation to lock-free LIFO / FIFO collections...

The "real" reason behind ABA, is simply "rapid" pointer reuse in lock-free
algos.

Due to the nature of CAS, it is possible for one thread to pop & push the
"same node" on the "same" aba protected pointer, "all inside another threads
currently contending CAS critical section". Another threads CAS can succeed,
when it should not! Of course this will simply break the collection, and
crash the application that is using it. Not good! ;(

So, the amount of threads contending for the same ABA protected pointer does
have a factor in the width of ABA counters, but its a side-effect of the
real reason for ABA.

So, you have to do some tricks to get it to handle hundreds ( 256+ ) of
threads.

;)

The write up should clearly explain the simple but effective algo...

> I am interested if your 8-bit counter can handle more than
> 256 concurrent threads, safely.

Yes it can.

Pankaj

unread,

Jun 23, 2003, 11:21:05 AM6/23/03

to

David Butenhof <David.B...@hp.com> wrote in message news:<3ef3...@usenet01.boi.hp.com>...

> Steve Watt wrote:
>
> > In article <7b5cbdbf.03061...@posting.google.com>,
> > Pankaj <pank...@yahoo.com> wrote:
> >
> >>Is this C/C++ program thread-safe (let's say on a 32 bit x86?)
> >
> > On a 32 bit x86, yes, because it has 16 bit access opcodes, and can do
> > memory-to-memory arithmetic. On other architectures that don't have
> > such opcodes, no.
>

> Sometimes we just hack it. For example, in your case you could do this:
>
> short a = 10; // 16 bit qty
> int filler[CACHE_LINE_SIZE/sizeof(int)-sizeof(short)];
> short b = 12; // 16 bit qty
>
> This will force 'b' into a separate cache line (presuming of course that
> your CACHE_LINE_SIZE symbol is set large enough) from 'a'. That'll avoid
> false sharing and make it highly improbable that any compiler could think
> it advantageous to use any access technique that might affect both
> variables.
>
> And of course, technically, you also need fillers before 'a' and after 'b'.
> Even if this particular file has no other extern declarations, they will
> probably find themselves living next door to extern declarations from
> separate object files when you link an application.

Just realized that this hack won't be able to handle the "fillers
before 'a' and after 'b'" since one does not know who the neighbors
of 'a' and 'b' are or can be in the future. For example, if another
file has a 'char' global, that could be the neighbor; if it or another
file has a 'short' global, that could be the neighbor. And also this
can change as more files are added to the project.

At this point, I think, mallocing and placing objects in the malloced
memory (which will ensure correct alignment for the particular cpu)
seems to be the only workable option but....it's so ugly..:-(.

Alexander Terekhov

unread,

Jun 23, 2003, 11:41:39 AM6/23/03

to

Pankaj wrote:
[...]

> At this point, I think, mallocing and placing objects in the malloced
> memory (which will ensure correct alignment for the particular cpu)
> seems to be the only workable option but....it's so ugly..:-(.

http://groups.google.com/groups?threadm=3DB45F0D.FBE503A1%40web.de
(Subject: Re: Memory isolation)

regards,
alexander.

David Butenhof

unread,

Jun 23, 2003, 11:51:37 AM6/23/03

to

Pankaj wrote:

> David Butenhof <David.B...@hp.com> wrote in message
> news:<3ef3...@usenet01.boi.hp.com>...
>
>> Sometimes we just hack it. For example, in your case you could do this:
>>
>> short a = 10; // 16 bit qty
>> int filler[CACHE_LINE_SIZE/sizeof(int)-sizeof(short)];
>> short b = 12; // 16 bit qty
>>
>> This will force 'b' into a separate cache line (presuming of course that
>> your CACHE_LINE_SIZE symbol is set large enough) from 'a'. That'll avoid
>> false sharing and make it highly improbable that any compiler could think
>> it advantageous to use any access technique that might affect both
>> variables.
>>
>> And of course, technically, you also need fillers before 'a' and after
>> 'b'. Even if this particular file has no other extern declarations, they
>> will probably find themselves living next door to extern declarations
>> from separate object files when you link an application.
>
> I see. BTW, I believe you meant to say
> char filler[CACHE_LINE_SIZE/sizeof(int)-sizeof(short)];

No, actually, what I should have meant to say was more like

int filler [(CACHE_LINE_SIZE-sizeof(short))/sizeof(int)];

To compute the number of 'int's necessary to fill (at least) the remainder
of the cache line. (Depending on alignment, it may actually take more.)
However, you could indeed pack the cache line more exactly using a char
array, as:

char filler[(CACHE_LINE_SIZE-sizeof(short)];

I had a reason at the time for using 'int' instead of 'char', but I can't
remember what it was, and I can't reconstruct it now. ;-)

> Couldn't this "filling" be done automatically and made a requirement
> for a compiler's pthreads-compliance? In other words, couldn't a
> pthreads-compliant compiler, by default, situate independent objects
> on separate cache lines? (I am not mentioning other alternatives since
> they could still cause sharing of the cache line and therefore affect
> performance.)

The problem is that you only WANT it when you NEED it. Otherwise you'd waste
tons of space -- which means more non-local references, more paging, bigger
images, and all sorts of other nasty secondary effects. I'm sure you'd also
break lots of crufy old code that's been working "forever".

So the crux is "situate INDEPENDENT objects". Yeah. Except the compiler has
no way to know when objects are independent. You'd have to tell it. That
would require new compiler syntax. And POSIX has no power to do that.

Charles Bryant

unread,

Jun 23, 2003, 6:17:59 PM6/23/03

to

In article <3ef3...@usenet01.boi.hp.com>,
David Butenhof <David.B...@hp.com> wrote:
..

>Many compilers have pragmas to affect data alignment, but again that's
>implementation dependent and may not port even to other compilers on the
>same system.
>
>Sometimes we just hack it. For example, in your case you could do this:
>
> short a = 10; // 16 bit qty
> int filler[CACHE_LINE_SIZE/sizeof(int)-sizeof(short)];
> short b = 12; // 16 bit qty
>
>This will force 'b' into a separate cache line (presuming of course that
>your CACHE_LINE_SIZE symbol is set large enough) from 'a'.

I don't know if the C standard prohibits it, but I would not be
confident with that method because I would be concerned that the
compiler (or linker) might re-order the objects. In particular, there
are two reasons why it might - one frivolous and one more sensible.
The frivolous one is that global objects might be in alphabetical
order ('a', 'b', 'filler'). More seriously, a compiler might order
objects by size, from largest to smallest, to save wasting space.

By using:
union {
short a;
char dummy[CACHE_LINE_SIZE];
} a;
union {
short b;
char dummy[CACHE_LINE_SIZE];
} b;
I think you're much more likely to guarantee they don't share a cache
line with each other (or anything else).

--
Eppur si muove

Michael Furman

unread,

Jun 23, 2003, 7:21:11 PM6/23/03

to

"Charles Bryant" <n163636...@chch.demon.co.uk> wrote in message
news:2003-06-2...@chch.demon.co.uk...

> In article <3ef3...@usenet01.boi.hp.com>,
> David Butenhof <David.B...@hp.com> wrote:
> ..
> >Many compilers have pragmas to affect data alignment, but again that's
> >implementation dependent and may not port even to other compilers on the
> >same system.
> >
> >Sometimes we just hack it. For example, in your case you could do this:
> >
> > short a = 10; // 16 bit qty
> > int filler[CACHE_LINE_SIZE/sizeof(int)-sizeof(short)];
> > short b = 12; // 16 bit qty
> >
> >This will force 'b' into a separate cache line (presuming of course that
> >your CACHE_LINE_SIZE symbol is set large enough) from 'a'.
>
> I don't know if the C standard prohibits it, but I would not be
> confident with that method because I would be concerned that the
> compiler (or linker) might re-order the objects. In particular, there

It is enough to put it inside a sructure to prevent reordering.

Michael

David Butenhof

unread,

Jun 24, 2003, 8:50:50 AM6/24/03

to

Pankaj wrote:

> Just realized that this hack won't be able to handle the "fillers
> before 'a' and after 'b'" since one does not know who the neighbors
> of 'a' and 'b' are or can be in the future. For example, if another
> file has a 'char' global, that could be the neighbor; if it or another
> file has a 'short' global, that could be the neighbor. And also this
> can change as more files are added to the project.

You can put fillers anywhere, though I resisted the momentary temptation to
bracket fillers on both ends as well as in the middle.

However, as another poster points out, there's no language guarantee that a
compiler couldn't reorder declarations. I suspect most compilers don't
reorder, though, because there's still a lot of crufty old "seat of the
pants" pre-ANSI code that makes all sorts of assumptions based on the
limitations of ancient compilers; and the assumption that declared objects
appear IN ORDER was once common. Even if some compiler did start
reordering, it would almost certainly do so only in something like a strict
ANSI C mode, excluding older code that might "reasonably" have such
dependencies.

Still... this is all a matter of how safe you want to be and how much
inconvenience you're willing to accept for it. You can always use
structures, malloc, mmap, and other techniques that are "more safe".
Ultimately, the only real safety will be a standard and portable feature of
the language.

> At this point, I think, mallocing and placing objects in the malloced
> memory (which will ensure correct alignment for the particular cpu)
> seems to be the only workable option but....it's so ugly..:-(.

It's also NOT a perfect guarantee, because you're presuming that the
alignment and granularity of malloc is at least the alignment and
granularity required for physical isolation of all operations in the
machine. I suspect that most implementations of malloc have large enough
granularity and alignment to overcome the data corruption problems, but NOT
generally big enough to avoid false sharing. (Alignment is usually 4-byte,
8-byte, or maybe 16-byte, depending on the machine, while cache lines are
often 64 bytes or more.)

We don't live in a perfect world. There's no one practical and completely
general solution. Yet in most real code you either never see the problems
in the first place or can fairly easily work around the point cases where
they appear. You just need to be careful.

David Butenhof

unread,

Jun 24, 2003, 8:59:44 AM6/24/03

to

Charles Bryant wrote:

> By using:
> union {
> short a;
> char dummy[CACHE_LINE_SIZE];
> } a;
> union {
> short b;
> char dummy[CACHE_LINE_SIZE];
> } b;
> I think you're much more likely to guarantee they don't share a cache
> line with each other (or anything else).

Yes, though that's ugly, inconvenient, and wasteful. This is essentially
what the original POSIX (draft) memory model suggested, even providing a
standard macro to encase any C declaration into a union of appropriate
size. It was of course removed along with the rest of the memory model
section.