Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Memory corruption due to word sharing

23 views
Skip to first unread message

Jan Kara

unread,
Feb 1, 2012, 10:20:03 AM2/1/12
to
Hello,

we've spotted the following mismatch between what kernel folks expect
from a compiler and what GCC really does, resulting in memory corruption on
some architectures. Consider the following structure:
struct x {
long a;
unsigned int b1;
unsigned int b2:1;
};

We have two processes P1 and P2 where P1 updates field b1 and P2 updates
bitfield b2. The code GCC generates for b2 = 1 e.g. on ia64 is:
0: 09 00 21 40 00 21 [MMI] adds r32=8,r32
6: 00 00 00 02 00 e0 nop.m 0x0
c: 11 00 00 90 mov r15=1;;
10: 0b 70 00 40 18 10 [MMI] ld8 r14=[r32];;
16: 00 00 00 02 00 c0 nop.m 0x0
1c: f1 70 c0 47 dep r14=r15,r14,32,1;;
20: 11 00 38 40 98 11 [MIB] st8 [r32]=r14
26: 00 00 00 02 00 80 nop.i 0x0
2c: 08 00 84 00 br.ret.sptk.many b0;;

Note that gcc used 64-bit read-modify-write cycle to update b2. Thus if P1
races with P2, update of b1 can get lost. BTW: I've just checked on x86_64
and there GCC uses 8-bit bitop to modify the bitfield.

We actually spotted this race in practice in btrfs on structure
fs/btrfs/ctree.h:struct btrfs_block_rsv where spinlock content got
corrupted due to update of following bitfield and there seem to be other
places in kernel where this could happen.

I've raised the issue with our GCC guys and they said to me that: "C does
not provide such guarantee, nor can you reliably lock different
structure fields with different locks if they share naturally aligned
word-size memory regions. The C++11 memory model would guarantee this,
but that's not implemented nor do you build the kernel with a C++11
compiler."

So it seems what C/GCC promises does not quite match with what kernel
expects. I'm not really an expert in this area so I wanted to report it
here so that more knowledgeable people can decide how to solve the issue...

Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Markus Trippelsdorf

unread,
Feb 1, 2012, 10:50:02 AM2/1/12
to
FYI, the gcc bug can be found here:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080

--
Markus

Colin Walters

unread,
Feb 1, 2012, 11:40:02 AM2/1/12
to
On Wed, 2012-02-01 at 16:19 +0100, Jan Kara wrote:

> I've raised the issue with our GCC guys and they said to me that: "C does
> not provide such guarantee, nor can you reliably lock different
> structure fields with different locks if they share naturally aligned
> word-size memory regions. The C++11 memory model would guarantee this,
> but that's not implemented nor do you build the kernel with a C++11
> compiler."

That's interesting. So it seems like there are two solutions:

1) Use the same lock for a given bitfield
2) Split up the bitfield into different words

Linus Torvalds

unread,
Feb 1, 2012, 11:50:02 AM2/1/12
to
On Wed, Feb 1, 2012 at 7:19 AM, Jan Kara <ja...@suse.cz> wrote:
>
>  we've spotted the following mismatch between what kernel folks expect
> from a compiler and what GCC really does, resulting in memory corruption on
> some architectures.

This is sad.

We've had something like this before due to architectural reasons
(alpha inability to do byte loads and stores leading us to not be able
to have items smaller than a byte sharing a word).

But at least then there was a *reason* for it, not "the compiler is
being difficult for no good reason".

Actually, the *sad* part is not that the compiler does something
unexpected - that's not new - but the usual "..and the gcc crowd
doesn't care, because they can point to paperwork saying it's not
defined". Even if that same paper is actually in the process of
getting updated exactly because it causes problems like ours.

That mentality is not new, of course.

> So it seems what C/GCC promises does not quite match with what kernel
> expects.

The paper C standard can *never* promise what a kernel expects. There
are tons of unspecified things that a compiler could do, including
moving memory around behind our back as long as it moves it back.
Because it's all "invisible" in the virtual C machine in the absence
of volatiles. The fact that the kernel has things like SMP coherency
requirements is simply not covered by the standard. There are tons of
other things not covered by the standard too that are just "that's
what we need".

So C/gcc has never "promised" anything in that sense, and we've always
had to make assumptions about what is reasonable code generation. Most
of the time, our assumptions are correct, simply because it would be
*stupid* for a C compiler to do anything but what we assume it does.

But sometimes compilers do stupid things. Using 8-byte accesses to a
4-byte entity is *stupid*, when it's not even faster, and when the
base type has been specified to be 4 bytes!

> I'm not really an expert in this area so I wanted to report it
> here so that more knowledgeable people can decide how to solve the issue...

If the gcc people aren't willing to agree that this is actually a flaw
in the standard (one that is being addressed, no less) and try to fix
it, we just have to extend our assumptions to something like "a
compiler would be stupid to ever access anything bigger than the
aligned register-size area". It's still just an assumption, and
compiler people could be crazy, but we could just extend the current
alpha rules to cover not just "int", but "long" too.

Sure, the compiler people could use "load/store multiple" or something
like that, but it would be obviously crazy code, so if it happens past
a register size, at least you could argue that it's a performance
issue and maybe the gcc people would care.

HOWEVER. The problem with the alpha rules (which, btw, were huge, and
led to the CPU designers literally changing the CPU instruction set
because they admitted they made a huge mistake) was never so much the
occasional memory corruption, as the fact that the places where it
could happen were basically almost impossible to find.

So we probably have tons of random places in the kernel that violate
even the alpha rules - because the corruption is so rare, and the
architecture was so rare as to making the corruption even harder to
find.

I assume this code generation idiocy only happens with bitfields? The
problem is, we really don't want to make all bitfields take up 64 bits
just because we might have a lock or something else next to them. But
we could probably re-order things (and then *occasionally* wasting
memory) if we could just find the ones that are problematic.

It's finding the problematic ones that is the issue. Which is why the
compiler should just generate code that matches what we told it to do,
not try to be "clever" in ways that doesn't even help performance! The
compiler simply doesn't know enough about the requirements. Never has,
never will, and this is not about "standards".

Linus

Linus Torvalds

unread,
Feb 1, 2012, 12:00:02 PM2/1/12
to
On Wed, Feb 1, 2012 at 8:37 AM, Colin Walters <wal...@verbum.org> wrote:
>
> 1) Use the same lock for a given bitfield

That's not the problem. All the *bitfield* fields are all accessed
under the same word already.

> 2) Split up the bitfield into different words

Again, it's not the bitfield that is the problem.

The problem is that the compiler - when it writes to the word that
contains the bitfield - will also corrupt the word *NEXT* to the
bitfield (or before - it probably depends on alignment).

So we have two separate 32-bit words - one of which just happens to
contain a bitfield. We write to these *separate* words using different
locking rules (btw, they don't even need to be protected by a lock: we
may have other rules that protects the individual word contents.

But the compiler turns the access to the bitfield (in a 32-bit aligned
word) into a 64-bit access that accesses the word *next* to it.

That word next to it might *be* the lock, for example.

So we could literally have this kind of situation:

struct {
atomic_t counter;
unsigned int val:4, other:4, data:24;
};

and if we write code like this:

spin_lock(&somelock);
s->data++;
spin_unlock(&somelock);

and on another CPU we might do

atomic_inc(&counter);

and the access to the bitfield will *corrupt* the atomic counter, even
though both of them are perfectly fine!

Quite frankly, if the bug is simply because gcc doesn't actually know
or care about the underlying size of the bitmask, it is possible that
we can find a case where gcc clearly is buggy even according to the
original C rules.

Honza - since you have access to the compiler in question, try
compiling this trivial test-program:


struct example {
volatile int a;
int b:1;
};

..
s->b = 1;
..

and if that bitfield access actually does a 64-bit access that also
touches 's->a', then dammit, that's a clear violation of even the
*old* C standard, and the gcc people cannot just wave away their bugs
by saying "we've got standads, pttthththt".

And I suspect it really is a generic bug that can be shown even with
the above trivial example.

Linus

Torvald Riegel

unread,
Feb 1, 2012, 12:10:03 PM2/1/12
to
On Wed, 2012-02-01 at 16:19 +0100, Jan Kara wrote:
> I've raised the issue with our GCC guys and they said to me that: "C does
> not provide such guarantee, nor can you reliably lock different
> structure fields with different locks if they share naturally aligned
> word-size memory regions. The C++11 memory model would guarantee this,
> but that's not implemented nor do you build the kernel with a C++11
> compiler."

Indeed, it's memory models that specify which kind of behavior is
allowed, and you need them for both the hardware and the programming
language. C++11 and C11 have memory models, in contrast to previous
versions of these standards.
GCC 4.7 implements this memory model (C++11's and C11's models are very
similar), even though there might be some rough edges in this
implementation (bit fields, for example...).
http://gcc.gnu.org/wiki/Atomic/GCCMM

> So it seems what C/GCC promises does not quite match with what kernel
> expects. I'm not really an expert in this area so I wanted to report it
> here so that more knowledgeable people can decide how to solve the issue...

There needs to be agreement about the memory model. The only time I
spoke to a kernel person about memory models, I got the reply that the
kernel would use its own model.

What do the kernel folks think about the C11 memory model? If you can
spot any issues in there, the GCC community would certainly like to
know.


Torvald

Jiri Kosina

unread,
Feb 1, 2012, 12:20:02 PM2/1/12
to
I have actually tried exactly this earlier today (because while looking at
this, I had an idea that putting volatile in place could be a workaround,
causing gcc to generate a saner code), but it doesn't work either:

# cat x.c
struct x {
long a;
volatile unsigned int lock;
unsigned int full:1;
};

void
wrong(struct x *ptr)
{
ptr->full = 1;
}

int main()
{
wrong(0);
}
# gcc -O2 x.c
# gdb -q ./a.out
Reading symbols from /root/a.out...done.
(gdb) disassemble wrong
Dump of assembler code for function wrong:
0x40000000000005c0 <+0>: [MMI] adds r32=8,r32
0x40000000000005c1 <+1>: nop.m 0x0
0x40000000000005c2 <+2>: mov r15=1;;
0x40000000000005d0 <+16>: [MMI] ld8 r14=[r32];;
0x40000000000005d1 <+17>: nop.m 0x0
0x40000000000005d2 <+18>: dep r14=r15,r14,32,1;;
0x40000000000005e0 <+32>: [MIB] st8 [r32]=r14
0x40000000000005e1 <+33>: nop.i 0x0
0x40000000000005e2 <+34>: br.ret.sptk.many b0;;

In my opinion, this is a clear bug in gcc (while the original problem,
without explitict volatile, is not a C spec violation per se, it's just
very inconvenient :) ).

--
Jiri Kosina
SUSE Labs

Linus Torvalds

unread,
Feb 1, 2012, 12:40:02 PM2/1/12
to
On Wed, Feb 1, 2012 at 9:08 AM, Torvald Riegel <tri...@redhat.com> wrote:
>
> What do the kernel folks think about the C11 memory model?  If you can
> spot any issues in there, the GCC community would certainly like to
> know.

I don't think this is about memory models except very tangentially.

Gcc currently accesses fields in a way that affects *independent*
fields, without checking their memory models at all.

Even original C already has one memory model: "volatile" data is seen
outside the virtual C machine. And Jiri reports that even that
*original* memory model is being violated. We're taling about the one
from about 40 years ago.

So no amount of memory model will ever matter, if gcc can't even get
the 40-year old one right.

Also, the C11 memory model really isn't going to make any difference
to this. Again, because this happens even if we have explicit locking
for the different fields. So they aren't "atomic" or anything like
that, because we do our own (and sufficient) locking. But once the
compiler starts to randomly touch data that just happens to be
adjacent to other data, any memory model about those individual pieces
is going to be entirely immaterial anyway.

Finally: I do not believe for a second that we can actually use the
C11 memory model in the kernel and say "that's sufficient for our
needs". We will continue to have to do things that are "outside the
specs" for the very simple reason that the kernel isn't just some
random application that happens to use pthreads. We do end up doing
much more aggressive threading, with models that C11 simply doesn't
cover.

_Atomic simply isn't going to make any difference to us. Sure, we
might use it in a couple of places (where we have sometimes tried to
use volatile casts, for example), but it would be totally irrelevant
in this case exactly because *neither* field is really atomic - they
are both protected, it's just that they are *separately* protected, so
accessing one does *not* mean that you can access the other, even if
they are next to each other.

Linus

Linus Torvalds

unread,
Feb 1, 2012, 12:40:02 PM2/1/12
to
Yup, gcc is clearly just buggy here. I do not believe there is any
question what-so-ever about the above test-case showing a bug.

And the thing is, if they fix this bug, they'll fix our problem too,
unless they are going to write explicit code to *try* to screw us over
while fixing that 'volatile' bug.

Because the right thing to do with bitfields is really to take the
base type into account. If the bitfield was in an "int", you use an
"int" access for it, not a 64-bit access. That's the simple fix for
the volatile problem, and it just happens to fix our issue too.

Trying to explicitly *look* for volatiles, and only doing the 32-bit
access when you see them is actually extra code, and extra effort, and
doesn't even *help* anything. It's not like the 64-bit access is
somehow "better".

I can see some vindictive programmer doing that, while thinking "I'll
show these people who pointed out this bug in my code, mhwhahahahaa!
I'll fix their test-case while still leaving the real problem
unaddressed", but I don't think compiler people are quite *that* evil.
Yes, they are evil people who are trying to trip us up, but still..

Linus

Michael Matz

unread,
Feb 1, 2012, 12:50:02 PM2/1/12
to
Hi,

On Wed, 1 Feb 2012, Jiri Kosina wrote:

> # cat x.c
> struct x {
> long a;
> volatile unsigned int lock;
> unsigned int full:1;
> };
>
> void
> wrong(struct x *ptr)
> {
> ptr->full = 1;
> }
>
> In my opinion, this is a clear bug

Even that depends (sadly) on who you ask. half-volatile objects (i.e.
struct where some members are volatile and others aren't) are terribly
underspecified. You can make the bitfield volatile, and for ia64 that
would at least result of ld8.acq/st8.rel pairs.

And Linus: don't be so hastily dismissive. People (even the compiler
ones) do agree that using an 8 byte access for a 4 byte entity is
problematic. Even if allowed by the standard it's a quality of
implementation problem.

One problem is that it's not a new problem, GCC emitted similar code since
about forever, and still they turned up only now (well, probably because
ia64 is dead, but sparc64 should have similar problems). The bitfield
handling code is _terribly_ complex and fixing it is quite involved. So
don't expect any quick fixes.

The other problem is specification. While you think that the code you
wrote is totally obvious it might not actually be so. For instance, what
about this struct:

{long l:32; int i1:16; short s; int i2:1; char c:7; short s2:8; short s3;}

What are the obviously correct accesses for various writes into this
struct?

One rule may be to never write to a bitfield with accesses larger than
their underlying declared type. Well, but only if the bitfield fits into
that type (int:96 is quite okay to have, and what accesses should be
allowed there?). That might be one obvious rule. But it's not how
traditional bitfield layout works. It works based on underlying objects,
and for that e.g. the three fields i2,c,s are all in the same one, of int
type.

The rule above would at least work for code that most people do write,
i.e. sequence of same-typed bitfields mixed with normal members.

And then, there's the last problem: are you sure that if GCC would use 4
byte accesses for the case at hand, that the hardware wouldn't screw you
again? What about using atomic instructions for one half of the
cache-line (the spinlock) but non-atomic instructions on the other half
(the bitfield). Are you sure the latter won't interfere with the former?


Ciao,
Michael.

David Miller

unread,
Feb 1, 2012, 1:20:03 PM2/1/12
to
From: Michael Matz <ma...@suse.de>
Date: Wed, 1 Feb 2012 18:41:05 +0100 (CET)

> One problem is that it's not a new problem, GCC emitted similar code since
> about forever, and still they turned up only now (well, probably because
> ia64 is dead, but sparc64 should have similar problems).

Indeed, on sparc64 it does do the silly 64-bit access too:

wrong:
ldx [%o0+8], %g2
sethi %hi(2147483648), %g1
or %g2, %g1, %g1
jmp %o7+8
stx %g1, [%o0+8]

Personally I've avoided C bitfields like the plague in any code I've
written.

Jeff Law

unread,
Feb 1, 2012, 1:50:02 PM2/1/12
to
On 02/01/2012 11:09 AM, David Miller wrote:
> From: Michael Matz<ma...@suse.de>
> Date: Wed, 1 Feb 2012 18:41:05 +0100 (CET)
>
>> One problem is that it's not a new problem, GCC emitted similar code since
>> about forever, and still they turned up only now (well, probably because
>> ia64 is dead, but sparc64 should have similar problems).
>
> Indeed, on sparc64 it does do the silly 64-bit access too:
>
> wrong:
> ldx [%o0+8], %g2
> sethi %hi(2147483648), %g1
> or %g2, %g1, %g1
> jmp %o7+8
> stx %g1, [%o0+8]
>
> Personally I've avoided C bitfields like the plague in any code I've
> written.
Torvald Riegel & I were told that was kernel policy when we brought up
the upcoming bitfield semantic changes with some of the linux kernel
folks last year.

Regardless of the kernel team's policy WRT bitfields, I believe fixing
the semantics to avoid creation of data races in bitfields is going to
be important. I'm hoping Aldy can return to that task soon and he and
Richi can come to some agreement on the implementation with gcc-4.8 as a
target.

jeff

Dennis Clarke

unread,
Feb 1, 2012, 2:00:02 PM2/1/12
to

> I have actually tried exactly this earlier today (because while looking at
> this, I had an idea that putting volatile in place could be a workaround,
> causing gcc to generate a saner code), but it doesn't work either:
>
> # cat x.c
> struct x {
> long a;
> volatile unsigned int lock;
> unsigned int full:1;
> };
>
> void
> wrong(struct x *ptr)
> {
> ptr->full = 1;
> }
>
> int main()
> {
> wrong(0);
> }
<snip>
> In my opinion, this is a clear bug in gcc (while the original problem,
> without explitict volatile, is not a C spec violation per se, it's just
> very inconvenient :) ).

As a data point, the exact same code on a Solaris 8 pentium III box:

$ gcc -S -o x.s x.c
$ cat x.s
.file "x.c"
.text
.globl wrong
.type wrong, @function
wrong:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
movzbl 8(%eax), %edx
orl $1, %edx
movb %dl, 8(%eax)
popl %ebp
ret
.size wrong, .-wrong
.globl main
.type main, @function
main:
pushl %ebp
movl %esp, %ebp
subl $4, %esp
movl $0, (%esp)
call wrong
leave
ret
.size main, .-main
.ident "GCC: (Blastwave.org Inc. Thu Dec 16 18:05:01 GMT 2010) 4.5.2"
$ gcc -o x x.c
$ file x
x: ELF 32-bit LSB executable 80386 Version 1, dynamically linked,
not stripped
$ ldd x
libc.so.1 => /usr/lib/libc.so.1
libdl.so.1 => /usr/lib/libdl.so.1
$ ./x
Segmentation Fault(coredump)

$ ls -l core
-rw------- 1 dclarke csw 71384 Feb 1 17:26 core

71384 bytes core is complete thus :

$ elfdump -p core | tail -6

Program Header[12]:
p_vaddr: 0xdfbf3000 p_flags: [ PF_X PF_W PF_R ]
p_paddr: 0 p_type: [ PT_LOAD ]
p_filesz: 0x1000 p_memsz: 0x1000
p_offset: 0x106d8 p_align: 0

$ /opt/studio/SOS11/SUNWspro/bin/dbx -c "print 0x1000 + 0x106d8; quit"
dbx: warning: unknown language, 'c' assumed
0x1000+0x106d8 = 71384

what caused the seg fault ?

$ /opt/studio/SOS11/SUNWspro/bin/dbx x core
Reading x
core file header read successfully
Reading ld.so.1
Reading libc.so.1
Reading libdl.so.1
program terminated by signal SEGV (no mapping at the fault address)
0x08050672: wrong+0x0006: movzbl 0x00000008(%eax),%edx
(dbx) where
=>[1] wrong(0x0, 0x8047b70, 0x805057d, 0x1, 0x8047b7c, 0x8047b84), at 0x8050672
[2] main(0x1, 0x8047b7c, 0x8047b84), at 0x8050690

However Sun Studio 5.8 does no better :

$ /opt/studio/SOS11/SUNWspro/bin/cc -Xc -o x_Sun_Studio_5.8 x.c
$ ./x_Sun_Studio_5.8
Segmentation Fault(coredump)
$ ls -l core
-rw------- 1 dclarke csw 71384 Feb 1 17:48 core

$ /opt/studio/SOS11/SUNWspro/bin/dbx x_Sun_Studio_5.8 core
dbx: warning: core object name "x_Sun_Studio_5." matches
object name "x_Sun_Studio_5.8" within the limit of 14. assuming they match
core file header read successfully
Reading ld.so.1
Reading libc.so.1
Reading libdl.so.1
program terminated by signal SEGV (no mapping at the fault address)
0x080506ae: wrong+0x000e: movl 0x00000008(%ecx),%eax
(dbx) where
=>[1] wrong(0x0), at 0x80506ae
[2] main(0x1, 0x8047b4c, 0x8047b54), at 0x80506ca
(dbx) quit
$ /opt/studio/SOS11/SUNWspro/bin/cc -V
cc: Sun C 5.8 Patch 121016-08 2009/04/20
usage: cc [ options] files. Use 'cc -flags' for details
$


dc


--
--
http://pgp.mit.edu:11371/pks/lookup?op=vindex&search=0x1D936C72FA35B44B
+-------------------------+-----------------------------------+
| Dennis Clarke | Solaris and Linux and Open Source |
| dcl...@blastwave.org | Respect for open standards. |
+-------------------------+-----------------------------------+

Linus Torvalds

unread,
Feb 1, 2012, 2:00:01 PM2/1/12
to
On Wed, Feb 1, 2012 at 10:09 AM, David Miller <da...@davemloft.net> wrote:
>
> Personally I've avoided C bitfields like the plague in any code I've
> written.

I do agree with that. The kernel largely tries to avoid bitfields,
usually because we have some really strict rules about different
bitfields, but also because initialization of bitfields tends to
result in gcc generating an incredible mess of the code (while
"explicit bits" allows us to just set all the fields in one go, and
know what the result is).

So core kernel data structures tend to be things like "unsigned long",
together with our various bitop functions that have explicit atomicity
(or non-atomicity) guarantees on a bit-per-bit basis.

Sometimes bitfields are really convenient, though, and allow for much
more natural syntax.

I'm not surprised this issue came up in a filesystem, for example.

Linus

Linus Torvalds

unread,
Feb 1, 2012, 2:00:02 PM2/1/12
to
On Wed, Feb 1, 2012 at 9:41 AM, Michael Matz <ma...@suse.de> wrote:
>
> One problem is that it's not a new problem, GCC emitted similar code since
> about forever, and still they turned up only now (well, probably because
> ia64 is dead, but sparc64 should have similar problems).  The bitfield
> handling code is _terribly_ complex and fixing it is quite involved.  So
> don't expect any quick fixes.

I agree that bitfields are nasty, I've had to handle them myself in
sparse. And we have actually had problems with bitfields before, to
the point where we've replaced use of bitfields with masking and
setting bits by hand.

But I also think that gcc is simply *buggy*, and has made them much
nastier than they should be. What gcc *should* have done is to turn
bitfield accesses into shift-and-masking of the underlying field as
early as possible, and then do all optimizations at that level.

In fact, there is another gcc bug outstanding (48696) where I complain
about absolutely horrible code generation, and that one was actually
the exact same issue except in reverse: gcc wouldn't take the
underlying size of the bitfield into account, and use the wrong
(smaller) size for the access, causing absolutely horrendous code
generation that mixes byte and word accesses, and causes slowdowns by
orders of magnitude.

And it really is the same issue: gcc has forgotten what the base type
is, and tries to "compute" some type using the actual bits. Which is
simply *wrong*. Always has been.

It's stupid, it generates bad code (both from performance *and* from a
correctness angle), and it has actually resulted in gcc having *extra*
complexity because it keeps the bitfield data around much too late.

> The other problem is specification.  While you think that the code you
> wrote is totally obvious it might not actually be so.  For instance, what
> about this struct:
>
> {long l:32; int i1:16; short s; int i2:1; char c:7; short s2:8; short s3;}
>
> What are the obviously correct accesses for various writes into this
> struct?

I really don't think that example matters. You should instead turn the
question around, and look at the real code examples, make *those*
generate good and obviously correct code, and then *after* you've done
that, you start looking at the mixed case where people do odd things.

Quite frankly, the layout that makes *sense* for something like the
above is not to pack them. You'd just do

If somebody actually wants packing, they'd do the *sane* thing, which is

unsigned int l:32, i1:16; s:16,i2:1, c:7, s2:8, s3:16;

and now it's obvious what the packing rules and the access rules are.

That said, I just tried, and 'sparse' actually gets your crazy example
right (where "right" is defined as what I *think* you want): optimally
packed. And it turns the accesses into the ones that are based on the
base type. And I'm pretty sure the sparse bitfield rules are way
simpler than gcc, exactly because it tries to convert bitfields fairly
early into mask-and-set. So if I do

struct {long l:32; int i1:16; short s; int i2:1; char c:7; short
s2:8; short s3;} dummy;

int main(int argc, char **argv)
{
dummy.l = 1;
dummy.i1 = 2;
dummy.s = 3;
dummy.i2 = 4;
dummy.c = 5;
dummy.s2 = 6;
dummy.s3 = 7;
}

and then do a test-linearize (with "-m64" to show the difference
between "long" and "int") I get

t.c:2:48: error: dubious one-bit signed bitfield
main:
.L0x7f928e378010:
<entry-point>
load.64 %r1 <- 0[dummy]
and.64 %r2 <- %r1, $0xffffffff00000000
or.64 %r3 <- %r2, $1
store.64 %r3 -> 0[dummy]
load.32 %r4 <- 4[dummy]
and.32 %r5 <- %r4, $0xffffffffffff0000
or.32 %r6 <- %r5, $2
store.32 %r6 -> 4[dummy]
store.16 $3 -> 6[dummy]
load.32 %r7 <- 8[dummy]
and.32 %r8 <- %r7, $-2
store.32 %r8 -> 8[dummy]
load.8 %r10 <- 8[dummy]
and.8 %r12 <- %r10, $-255
or.8 %r13 <- %r12, $10
store.8 %r13 -> 8[dummy]
load.16 %r14 <- 8[dummy]
and.16 %r16 <- %r14, $0xffffffffffff00ff
or.16 %r17 <- %r16, $0x600
store.16 %r17 -> 8[dummy]
store.16 $7 -> 10[dummy]
ret.32 $7

which I think is what you'd expect. Now, sparse doesn't do the smart
"combine shifts-and-masks to memory", but that's an optimization phase
that should be done long after bitfields *anyway*, so that's kind of
immaterial.

And yes, look at how you can see how sparse mixes different access
sizes for different fields. But that is damn well what the programmer
asked for.

If programmers write stupid code, the compiler might as well generate odd code.

But if a programmer writes *good* code, and the compiler messes it up
because it *thinks* the code is stupid, then the compiler is just bad.

> One rule may be to never write to a bitfield with accesses larger than
> their underlying declared type.  Well, but only if the bitfield fits into
> that type (int:96 is quite okay to have, and what accesses should be
> allowed there?).

Actually, "int:96" isn't ok last I saw. Not in original C, at least.

Yeah, I just checked. You get an error like

error: width of ‘a’ exceeds its type

so "int:96" is fine, but only if int is bigger than or equal to 96 bits.

> That might be one obvious rule.  But it's not how
> traditional bitfield layout works.  It works based on underlying objects,
> and for that e.g. the three fields i2,c,s are all in the same one, of int
> type.

I think that's actually because bitfields are really only defined for
'int' to begin with afaik. Anything else (well, "unsigned" is fine
too) is not really defined.

The 'long' thing is a later extension, and the shorter fields are just
random. They happen to be pretty portable.

> The rule above would at least work for code that most people do write,
> i.e. sequence of same-typed bitfields mixed with normal members.

Right. Where the same-type should generally be "unsigned int"
(bitfield signs are really debatable and nonportable too, so unsigned
is actually the only really sane type, although others do work).

> And then, there's the last problem: are you sure that if GCC would use 4
> byte accesses for the case at hand, that the hardware wouldn't screw you
> again?  What about using atomic instructions for one half of the
> cache-line (the spinlock) but non-atomic instructions on the other half
> (the bitfield).  Are you sure the latter won't interfere with the former?

Yes, I'm sure. Of course, if the architecture isn't cache-coherent,
all bets are off, but nobody cares. That's a hardware issue, not a
compiler problem.

If we told you "unsigned int bitfield:x", then you as the compiler
should just assume that you can access it as an "int" (unless you only
allocate a "char' for it, of course, in which case you need to access
it as a char - you can never access past the field).

Just out of morbid curiosity, what happens if you have totally
*separate* variables that just happen to link together? IOW, something
like

static struct { unsigned bit:1; } onebit;
static volatile int var;

and they just *happen* to link next to each other (because they were
declared next to each other) in the same 8-byte aligned block?

Linus

Peter Bergner

unread,
Feb 1, 2012, 2:10:02 PM2/1/12
to
On Wed, 2012-02-01 at 13:09 -0500, David Miller wrote:
> From: Michael Matz <ma...@suse.de>
> Date: Wed, 1 Feb 2012 18:41:05 +0100 (CET)
>
> > One problem is that it's not a new problem, GCC emitted similar code since
> > about forever, and still they turned up only now (well, probably because
> > ia64 is dead, but sparc64 should have similar problems).
>
> Indeed, on sparc64 it does do the silly 64-bit access too:
>
> wrong:
> ldx [%o0+8], %g2
> sethi %hi(2147483648), %g1
> or %g2, %g1, %g1
> jmp %o7+8
> stx %g1, [%o0+8]

Ditto for powerpc64-linux:

ld 9,8(3)
li 10,1
rldimi 9,10,31,32
std 9,8(3)
blr


Peter

Linus Torvalds

unread,
Feb 1, 2012, 2:10:02 PM2/1/12
to
On Wed, Feb 1, 2012 at 10:45 AM, Jeff Law <l...@redhat.com> wrote:
>
> Torvald Riegel & I were told that was kernel policy when we brought up the
> upcoming bitfield semantic changes with some of the linux kernel folks last
> year.

Btw, one reason this is true is that the bitfield ordering/packing is
so unspecified that using bitfields sometimes is just way more pain
than you get. Some people have tried to use bitfields for various IO
data structures (think laying out bits in memory to match some
particular layout of some disk controller result structure). It's
always a total disaster, because you have another whole level of
architecture-specific bit ordering (in *addition* to all the normal
byte order issues).

That's not a compiler issue, that's just the nature of the beast.
It's just another reason why the kernel often ends up then doing bit
masking by hand.

But *all* that said, we do have a metric buttload of bitfields in the
kernel. It's not some really unusual feature. It does get used a lot,
despite all the reasons why some particular code might not use
bitfields.

We have a lot of code, there's still a lot of situations left where
bitfields are just really convenient.

Linus

Torvald Riegel

unread,
Feb 1, 2012, 2:20:02 PM2/1/12
to
On Wed, 2012-02-01 at 08:41 -0800, Linus Torvalds wrote:
> If the gcc people aren't willing to agree that this is actually a flaw
> in the standard (one that is being addressed, no less)

It has been addressed in the standards.

> and try to fix
> it,

Again, this is being worked on, see http://gcc.gnu.org/wiki/Atomic/GCCMM

> we just have to extend our assumptions to something like "a
> compiler would be stupid to ever access anything bigger than the
> aligned register-size area". It's still just an assumption, and
> compiler people could be crazy, but we could just extend the current
> alpha rules to cover not just "int", but "long" too.

So, let's ignore everyone's craziness (the kernel is not the only GCC
client...) and think about how we can improve the situation.

We need a proper memory model. No vague assumptions with lots of
hand-waving. If you think that this is simple stuff and can
sufficiently described by "don't do anything stupid", then please have a
look at the issues that the Java memory model faced, and all the
iterations of the C++11/C11 model and discussions about it.

The only candidate that I see is the C++11/C11 model. Any other
suggestions?

Now, if we assume this model, what does the kernel community think about
it? It might be good to split this discussion into the following two
points, to avoid syntax flame wars:
1) The model itself (ie, the memory orders, ordering guarantees, (lack
of) progress guarantees, etc.).
2) The syntax/APIs to specify atomics.

If something else or more is needed, the compiler needs to have a formal
specification for that. This will help users too, because it avoids all
the ambiguities.

> Sure, the compiler people could use "load/store multiple" or something
> like that, but it would be obviously crazy code, so if it happens past
> a register size, at least you could argue that it's a performance
> issue and maybe the gcc people would care.
>
> HOWEVER. The problem with the alpha rules (which, btw, were huge, and
> led to the CPU designers literally changing the CPU instruction set
> because they admitted they made a huge mistake) was never so much the
> occasional memory corruption, as the fact that the places where it
> could happen were basically almost impossible to find.
>
> So we probably have tons of random places in the kernel that violate
> even the alpha rules - because the corruption is so rare, and the
> architecture was so rare as to making the corruption even harder to
> find.

I assume you still would want a weak memory model, or not? (That is,
rely on data-race-freedom, only atomics do not contribute to data races,
and you need to mark data used for synchronization (or which is just
accessed concurrently) as atomics).

If so, you need to tell the compiler which variables etc. are atomics.

> I assume this code generation idiocy only happens with bitfields? The
> problem is, we really don't want to make all bitfields take up 64 bits
> just because we might have a lock or something else next to them. But
> we could probably re-order things (and then *occasionally* wasting
> memory) if we could just find the ones that are problematic.

If the compiler is aware what is a lock (or atomics, or similar), then a
correct implementation should leave the lock alone. But in a weak
memory model, it needs to know what is a lock.
(Same goes for volatile obviously, like in the other example posted in
the thread where the bitfields interfered with the volatile.)

>
> It's finding the problematic ones that is the issue. Which is why the
> compiler should just generate code that matches what we told it to do,

So, would it be okay to tell the compiler which part of the state is
accessed concurrently (ie, locks, atomics, ...)?


Torvald

Linus Torvalds

unread,
Feb 1, 2012, 2:50:01 PM2/1/12
to
On Wed, Feb 1, 2012 at 9:42 AM, Torvald Riegel <tri...@redhat.com> wrote:
>
> We need a proper memory model.

Not really.

The fact is, the kernel will happily take the memory model of the
underlying hardware. Trying to impose some compiler description of the
memory model is actually horribly bad, because it automatically also
involves compiler synchronization points - which will try to be
portable and easy to understand, and quite frankly, that will
automatically result in what is technically known as a "shitload of
crap".

Now, a strict memory model is fine for portability, and to make it
simple for programmers. I actually largely approve of the Java memory
model approach, even if it has been horribly problematic and has some
serious problems. But it's not something that is acceptable for the
kernel - we absolutely do *not* want the compiler to introduce some
memory model, because we are very much working together with whatever
the hardware memory model is, and we do our own serialization
primitives.

> No vague assumptions with lots of hand-waving.

So here's basically what the kernel needs:

- if we don't touch a field, the compiler doesn't touch it.

This is the rule that gcc now violates with bitfields.

This is a gcc bug. End of story. The "volatile" example proves it -
anybody who argues otherwise is simply wrong, and is just trying to
make excuses.

- if we need to touch something only once, or care about something
that is touched only conditionally, and we actually need to make sure
that it is touched *only* under that condition, we already go to quite
extreme lengths to tell the compiler so.

We usually use an access with a "volatile" cast on it or tricks
like that. Or we do the whole "magic asm that says it modified memory
to let the compiler know not to try anything clever"

- The kernel IS NOT GOING TO MARK DATA STRUCTURES.

Marking data structures is seriously stupid and wrong. It's not the
*data* that has access rules, it's the *code* that has rules of
access.

The traditional C "volatile" is misdesigned and wrong. We don't
generally mark data volatile, we really mark *code* volatile - which
is why our "volatiles" are in the casts, not on the data structures.

Stuff that is "volatile" in one context is not volatile in another. If
you hold a RCU write lock, it may well be entirely stable, and marking
it volatile is *wrong*, and generating code as if it was volatile is
pure and utter shit.

On the other hand, if you are touching *the*very*same* field while you
are only read-locked for RCU, it may well be one of those "this has to
be read by accessing it exactly once".

And we do all this correctly in the kernel. Modulo bugs, of course,
but the fundamental rule really is: "atomicity or volatility is about
CODE, not DATA".

And no, C11 does *not* do it correctly. The whole "_Atomic" crap is
exactly the same mistake as "volatile" was. It's wrong. Marking data
_Atomic is a sure sign that whoever designed it didn't understand
things.

> The only candidate that I see is the C++11/C11 model.  Any other
> suggestions?

The C11 model addresses the wrong thing, and addresses it badly.

You might as well ignore it as far as the kernel is concerned. I'm
sure it helps some *other* problem, but it's not the kernel problem.

The rules really are very simple, and the kernel already does
everything to make it easy for the compiler.

When we do something that the compiler cannot re-order around, we do
an asm() and mark it as modifying memory so that the compiler won't
screw things up. In addition, we will do whatever that the CPU
requires for memory ordering, and quite frankly, the compiler will
never have sufficient locking primitives to satisfy us, and there is
no real point in even trying. If you try to do locking in the
compiler, you *will* do it wrong.

If you add random flags on data structures ("_Atomic" or "volatile" or
whatever), you *will* do it wrong. It's simply a fundamentally broken
model.

So the *only* memory model we want from the compiler really is: "don't
write to fields that the source code didn't write to".

It's really is that simple.

> So, would it be okay to tell the compiler which part of the state is
> accessed concurrently (ie, locks, atomics, ...)?

Seriously, no.

See above: it's not the "state" that is accessed concurrently. It's
the code. If you ever try to mark state, you've already lost. The same
"state" can be atomic or not depending on context. It's not about the
state or the data structures, and it never will be.

Linus

Jeff Law

unread,
Feb 1, 2012, 3:00:02 PM2/1/12
to
On 02/01/2012 12:44 PM, Boehm, Hans wrote:

> C11 is a published standard. Last I checked, gcc did not follow many of the above rules. It looks like major changes were recently merged into the gcc trunk, and I haven't had a chance to test those, so it may well be fixed. But more importantly, so far I haven't read anything here to dissuade me that they are the right target.

The bitfield changes didn't make it into gcc-4.7; we're still working
through implementation details. GCC's bitfield support is a horrid
nasty mess which has made implementation of the standard problematical.
It's on Aldy's plate for gcc-4.8.

jeff

Alan Cox

unread,
Feb 1, 2012, 3:00:02 PM2/1/12
to
> So here's basically what the kernel needs:
>
> - if we don't touch a field, the compiler doesn't touch it.
>
> This is the rule that gcc now violates with bitfields.
>
> This is a gcc bug. End of story. The "volatile" example proves it -
> anybody who argues otherwise is simply wrong, and is just trying to
> make excuses.

C historically didn't make this guarantee because a lot of processors
couldn't make it because they didn't have things like byte accessors (In
fact I suspect early ARM cannot make it for example).

Not meeting it for types where you can do is a bit rude however and
really ought to be an option (speed v sanity).

> See above: it's not the "state" that is accessed concurrently. It's
> the code. If you ever try to mark state, you've already lost. The same
> "state" can be atomic or not depending on context. It's not about the
> state or the data structures, and it never will be.

There are optimisation cases - where you can prove access properties are
safe (eg local variables some times) but they should be exactly that -
optimisations.

Alan

Linus Torvalds

unread,
Feb 1, 2012, 3:10:02 PM2/1/12
to
On Wed, Feb 1, 2012 at 11:40 AM, Jakub Jelinek <ja...@redhat.com> wrote:
>
> Well, the C++11/C11 model doesn't allow to use the underlying type
> for accesses, consider e.g.
>
> struct S { long s1; unsigned int s2 : 5; unsigned int s3 : 19; unsigned char s4; unsigned int s5; };
> struct T { long s1 : 16; unsigned int s2; };
>
> on e.g. x86_64-linux, S is 16 byte long, field s4 is packed together into
> the same 32 bits as s2 and s3.  While the memory model allows s2 to be
> changed when storing s3 (i.e. use a RMW cycle on it), it doesn't allow s4
> to be changed, as it isn't a bitfield (you'd need s4 : 8 for that).
> T is 8 bytes long, again, s2 is packed into the same 64 bits as s1,
> and the memory model doesn't allow s2 to be modified.
>
> Not sure what the kernel would expect in such a case.

So the kernel really doesn't care what you do to things *within* the bitfield.

Sure, we do have some expectations of packing, but basically we never
expect bitfields to be 'atomic'. You really don't need to worry about
that.

From a correctness standpoint, I really don't think the kernel cares
if gcc does bitfield reads and writes one bit at a time. Seriously.

The bug is that the bitfield write wrote *outside* the bitfield.
That's *WRONG*. It's completely wrong, even if you write back the same
value you read. It's *always* wrong. It's simply not acceptable.

If gcc does that kind of crap, then gcc needs to align bitfields
sufficiently that the accesses that are outside the bitfields never
touch any other fields.

So what I would suggest:

- use the *smallest* aligned access you can use to access the
bitfield. This will be 'correct', but it may perform badly.

This is apparently what x86 does. So on x86, if you have a one-bit
bitfield within a bitfield that is really 32 bits wide, it looks like
gcc will generally generate a byte load/store to set that bit. I don't
know exactly what the rules are, but this is fine from a *correctness*
point.

- However, while using the *smallest* possible access may generate
correct code, it often generates really *crappy* code. Which is
exactly the bug that I reported in

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

So quite frankly, my suggestion is to just try to use the base type.

But *whatever* you do, using a type *larger* than the whole bitfield
itself is a bug.

And dammit, the people who continue to bring up C11 as if this was a
new issue, and only needs to be worried about if you support C11 (ie
gcc-4.7 and up) are just in denial.

The 'volatile' example makes it entirely clear that the bug has
nothing what-so-ever to do with C11.

Linus

Linus Torvalds

unread,
Feb 1, 2012, 3:20:02 PM2/1/12
to
On Wed, Feb 1, 2012 at 12:01 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
>  - However, while using the *smallest* possible access may generate
> correct code, it often generates really *crappy* code. Which is
> exactly the bug that I reported in
>
>   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

Btw, as I also pointed out in that (old) bugzilla entry, gcc can do
pretty much whatever it damn well pleases with loads from a bitfield.

You can narrow them, you can make them wider, you can paint them pink.

Widening the loads can still be a performance problem, and technically
it's a really bad idea for any nearby "volatile"s too, but in practice
things will *work*.

Writes are much more critical. If you overwrite a field that is no
longer in the bitfield, you can no longer depend on "nobody cares
about bitfield accesses", because by definition you are clearly now
stepping on non-bitfield things.

You do realize that even in user space, and even before C11, people
have things like "sig_atomic_t" etc. So even if you don't like the
notion of "volatile", here's *another* example of this being real gcc
bug:

struct mydata {
sig_atomic_t seen_signal;
unsigned flags:1;
};

and now do the same test-program, realizing that "sig_atomic_t" is
normally the exact same thing as "int".

Now, thing about what happens if you have a signal handler that comes
in and does

mydata.seen_signal = 1;

and happens to interrupt the code that changes "mydata.flags" in just
the right spot.

That's right: the bitfield update will possibly *clear* that
"signal-safe" flag again, and gcc just created buggy asm code from
correct C code.

Guys, if you don't admit that this is a bug, I don't know what to say.

IT IS A GCC BUG.

Don't try to make it into something bigger and related to C++11/C11.
Don't try to talk about "memory models". Just admit that it is a bug
to do a 64-bit write to a 32-bit bitfield, and fix it!

Jakub Jelinek

unread,
Feb 1, 2012, 3:20:03 PM2/1/12
to
On Wed, Feb 01, 2012 at 12:01:46PM -0800, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 11:40 AM, Jakub Jelinek <ja...@redhat.com> wrote:
> > struct S { long s1; unsigned int s2 : 5; unsigned int s3 : 19; unsigned char s4; unsigned int s5; };
> > struct T { long t1 : 16; unsigned int t2; };
> >
> > on e.g. x86_64-linux, S is 16 byte long, field s4 is packed together into
> > the same 32 bits as s2 and s3.  While the memory model allows s2 to be
> > changed when storing s3 (i.e. use a RMW cycle on it), it doesn't allow s4
> > to be changed, as it isn't a bitfield (you'd need s4 : 8 for that).
> > T is 8 bytes long, again, s2 is packed into the same 64 bits as s1,
> > and the memory model doesn't allow s2 to be modified.
> >
> > Not sure what the kernel would expect in such a case.
>
> So the kernel really doesn't care what you do to things *within* the bitfield.

But what is *within* the bitfield? Do you consider s4 or t2 fields
(non-bitfield fields that just the ABI wants to pack together with
the bitfield) above as *within* or is already modifying of those a problem?

> The 'volatile' example makes it entirely clear that the bug has
> nothing what-so-ever to do with C11.

The reason we mention C11/C++11 memory model is because we plan to support
that, likely for GCC 4.8, when requested by some options (either by default
when choosing those standards?, or when explicitly requested). This will
even disable some optimizations that are fine for single-threaded apps,
but aren't fine in those memory models, aren't fine in OpenMP or for many
other threaded programs. E.g. loop store motion if it isn't guaranteed the
variable is always stored in the loop:
int x;

void
foo (int j)
{
int i;
for (i = 0; i < 100000; i++)
if (i > j)
x = i;
}
can't be performed, because otherwise it introduces a tmp = x; ... x = tmp;
into code that wouldn't otherwise touch the variable, so if some other
thread modifies x, it might have unexpected value.

Jakub

Jakub Jelinek

unread,
Feb 1, 2012, 3:40:02 PM2/1/12
to
On Wed, Feb 01, 2012 at 06:42:54PM +0100, Torvald Riegel wrote:
> We need a proper memory model. No vague assumptions with lots of
> hand-waving. If you think that this is simple stuff and can
> sufficiently described by "don't do anything stupid", then please have a
> look at the issues that the Java memory model faced, and all the
> iterations of the C++11/C11 model and discussions about it.
>
> The only candidate that I see is the C++11/C11 model. Any other
> suggestions?

Well, the C++11/C11 model doesn't allow to use the underlying type
for accesses, consider e.g.

struct S { long s1; unsigned int s2 : 5; unsigned int s3 : 19; unsigned char s4; unsigned int s5; };
struct T { long s1 : 16; unsigned int s2; };

on e.g. x86_64-linux, S is 16 byte long, field s4 is packed together into
the same 32 bits as s2 and s3. While the memory model allows s2 to be
changed when storing s3 (i.e. use a RMW cycle on it), it doesn't allow s4
to be changed, as it isn't a bitfield (you'd need s4 : 8 for that).
T is 8 bytes long, again, s2 is packed into the same 64 bits as s1,
and the memory model doesn't allow s2 to be modified.

Not sure what the kernel would expect in such a case.

Jakub

Torvald Riegel

unread,
Feb 1, 2012, 3:50:02 PM2/1/12
to
On Wed, 2012-02-01 at 11:47 -0800, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 9:42 AM, Torvald Riegel <tri...@redhat.com> wrote:
> >
> > We need a proper memory model.
>
> Not really.
>
> The fact is, the kernel will happily take the memory model of the
> underlying hardware.

You do rely on the compiler to do common transformations I suppose:
hoist loads out of loops, CSE, etc. How do you expect the compiler to
know whether they are allowed for a particular piece of code or not?

An example, C++11 pseudo code:
atomic<int> shared_flags[123];
x = 0;
for(...)
if (shared_flags[i].load(memory_order_acquire))
x += shared_data;

Now, can we hoist the load of shared_data to out of the loop? The
acquire tells the compiler that this is not allowed.
Now assume x86, and it's memory model. The acquire in there isn't
visible because you don't need a barrier for that. So, what's the
compiler gonna do? Never hoist any loops because an acquire might be
intended?
So, you could add a compiler barrier in there (volatile asm...). But
those work in both directions, and the acquire only works in one
direction, allowing the first store to x to be moved later.

> Trying to impose some compiler description of the
> memory model is actually horribly bad, because it automatically also
> involves compiler synchronization points - which will try to be
> portable and easy to understand, and quite frankly, that will
> automatically result in what is technically known as a "shitload of
> crap".

What is a "compiler synchronization point"?
Have you actually read the C++11/C11 memory model?

> Now, a strict memory model is fine for portability, and to make it
> simple for programmers. I actually largely approve of the Java memory
> model approach, even if it has been horribly problematic and has some
> serious problems. But it's not something that is acceptable for the
> kernel - we absolutely do *not* want the compiler to introduce some
> memory model, because we are very much working together with whatever
> the hardware memory model is, and we do our own serialization
> primitives.

I think you underestimate the issues here. The memory model that's the
kernel is programmed against is what tells the compiler which
transformations are allowed, and which aren't. You seem to claim that
this is just the hardware's memory model that counts, but it's clear
that you have an implicit language memory model in mind, because
otherwise the compiler could never hoist any loads, for example (to be
safe, conservatively).

> > No vague assumptions with lots of hand-waving.
>
> So here's basically what the kernel needs:
>
> - if we don't touch a field, the compiler doesn't touch it.

"we don't touch a field": what does that mean precisely? Never touch it
in the same thread? Same function? Same basic block? Between two
sequence points? When crossing synchronizing code? For example, in the
above code, can we touch it earlier/later?

I understand where this rule is heading (granularity races...), but
really, this isn't precise, and this all affects what optimizations are
allowed. In this context, "show me the code" means having something
more formal, unfortunately, to prevent ambiguities.

For this reason, I suggested to have a look at the C11 model, and then
complain about limitations in this model.

>
> This is the rule that gcc now violates with bitfields.

For the volatile case, that's a bug, yes. For the non-volatile, I'd
need to check.

> This is a gcc bug. End of story. The "volatile" example proves it -
> anybody who argues otherwise is simply wrong, and is just trying to
> make excuses.

I'm trying to explain here why language memory models are a good thing,
and not the enemy.


> - if we need to touch something only once, or care about something
> that is touched only conditionally, and we actually need to make sure
> that it is touched *only* under that condition, we already go to quite
> extreme lengths to tell the compiler so.
>
> We usually use an access with a "volatile" cast on it or tricks
> like that. Or we do the whole "magic asm that says it modified memory
> to let the compiler know not to try anything clever"
>
> - The kernel IS NOT GOING TO MARK DATA STRUCTURES.
>
> Marking data structures is seriously stupid and wrong. It's not the
> *data* that has access rules, it's the *code* that has rules of
> access.

And types are usually a good way to enforce code that accesses a piece
of state to behave similarly. Which is what atomic<foo> does, and what
I meant. From the overall C++/C perspective, it makes sense to require
types for atomics. For the kernel, I don't care.

> Stuff that is "volatile" in one context is not volatile in another. If
> you hold a RCU write lock, it may well be entirely stable, and marking
> it volatile is *wrong*, and generating code as if it was volatile is
> pure and utter shit.

Agreed. There is no full and portable way to do the weaker accesses
with atomicsc, but memory_order_relaxed is very close and essentially
just guarantees atomic access, which isn't too hard on most
architectures (ie, not the same overhead as volatile).
You could as well cast the atomic to an nonatomic, but this assumes that
atomics have the same bit representation as the same but nonatomic type.
Might be a reasonable assumption, but not good for general C.

> And no, C11 does *not* do it correctly. The whole "_Atomic" crap is
> exactly the same mistake as "volatile" was. It's wrong. Marking data
> _Atomic is a sure sign that whoever designed it didn't understand
> things.

Like types are a bad idea too? I think you're exaggerating here. And
as a general language standard, this cannot be tailored towards just the
kernel hacker.

>
> > The only candidate that I see is the C++11/C11 model. Any other
> > suggestions?
>
> The C11 model addresses the wrong thing

What's the wrong thing?

> , and addresses it badly.

You mean to enforce an atomic type?

> You might as well ignore it as far as the kernel is concerned. I'm
> sure it helps some *other* problem, but it's not the kernel problem.
>
> The rules really are very simple, and the kernel already does
> everything to make it easy for the compiler.

What would be helpful if you could look at the data-race / access
granularity in the C11 model and tell us whether that is equal with (or
not) the rules you summarized above (no, they were not precise enough,
sorry).

> When we do something that the compiler cannot re-order around, we do
> an asm() and mark it as modifying memory so that the compiler won't
> screw things up. In addition, we will do whatever that the CPU
> requires for memory ordering, and quite frankly, the compiler will
> never have sufficient locking primitives to satisfy us, and there is
> no real point in even trying. If you try to do locking in the
> compiler, you *will* do it wrong.

You don't need to use the stdlib locks. That's not the memory model.
This is standard library threading support. The compiler is also not
generating any locking primitives (unless as library fallback code), but
atomic instructions. They do seem okay in terms of coverage to me
though, if you need others, we could provide.

The memory orders that the C11 model provides generalize things
somewhat, depending on the architecture, I agree. But which further
ones would you need (e.g., some variant of consume?)?


Torvald

Linus Torvalds

unread,
Feb 1, 2012, 3:50:02 PM2/1/12
to
On Wed, Feb 1, 2012 at 12:16 PM, Jakub Jelinek <ja...@redhat.com> wrote:
>>
>> So the kernel really doesn't care what you do to things *within* the bitfield.
>
> But what is *within* the bitfield?  Do you consider s4 or t2 fields
> (non-bitfield fields that just the ABI wants to pack together with
> the bitfield) above as *within* or is already modifying of those a problem?

I seriously think you should just do the obvious thing, and LOOK AT
THE BASE TYPE. That solves everything.

If the definition of the data is (assume "long" to be 64 bits)

long mybits:32;
int nonbitfield;

and you end up doing a 64-bit access when you read/write the "mybits"
bitfield, and it overlaps "nonbitfield", then I would also just say
"hey, whatever, the user asked for it". He really did. The
"allocation" for the bitfield comes from the base type, and so you
basically have a "long" to play with.

Let's take an even more extreme example:

int bits1:8
char c;
int bits2:16;

where the "char c" is actually *embedded* in the "bitfield
allocation". But again, it's completely and totally unambiguous what
accesses you would at least start out with: sure, you *could* do a
8-bit access (and you probably should, if the ISA allows it), but
dangit, the base type is "int", so I don't think it's wrong to do a
32-bit load, mask, and a 32-bit write.

But if the user wrote

char bits1:8;
char c;
short bits2:16;

then I really think it would be a *bug* if modifying "bits1" would
possible write to 'c'. Of course, this is again a possible ISA
problem: if the architecture doesn't have byte writes, you can't do
anything about it, but that is obviously true regardless of bitfields,
so that's really a totally irrelevant argument for this case. That
non-atomicity doesn't come from bitfields, it comes from the fact that
the BASE TYPE is again non-atomic.

Notice how it always ends up being about the base type. Simple,
straightforward, and unambiguous.

And similarly, to go to the kernel example, if you have

int mybits:2;
int nonbitfield;

then I think it's an obvious *BUG* to access the nonbitfield things
when you access "mybits". It clearly is not at all "within" the
bitfield allocation, and "int" isn't non-atomic on any sane
architecture, so you don't even have the "byte accesses aren't atomic"
issue)

So I really do think that the sane approach is to look at the base
type of the bitfield, like I suggested from the very beginning.

If the base type is an "int", you do an int access. It really solves
so many problems, and it is even entirely sane semantics that you can
*explain* to people. It makes sense, it avoids the clear bugs gcc has
now, and it also solves the performance bug I reported a long time
ago.

Seriously - is there any real argument *against* just using the base
type as a hint for access size?

I realize that it may not be simple, and may not fit the current mess
of gcc bitfields, but I really do think it's the RightThing(tm) to do.
It has sensible semantics, and avoids problems.

In contrast, the *current* gcc semantics are clearly not sensible, and
have both performance and correctness issues.

Linus

Torvald Riegel

unread,
Feb 1, 2012, 4:00:02 PM2/1/12
to
On Wed, 2012-02-01 at 09:29 -0800, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 9:08 AM, Torvald Riegel <tri...@redhat.com> wrote:
> >
> > What do the kernel folks think about the C11 memory model? If you can
> > spot any issues in there, the GCC community would certainly like to
> > know.
>
> I don't think this is about memory models except very tangentially.
>
> Gcc currently accesses fields in a way that affects *independent*
> fields, without checking their memory models at all.
>
> Even original C already has one memory model: "volatile" data is seen
> outside the virtual C machine. And Jiri reports that even that
> *original* memory model is being violated. We're taling about the one
> from about 40 years ago.

For volatile, I agree.

However, the original btrfs example was *without* a volatile, and that's
why I raised the memory model point. This triggered an error in a
concurrent execution, so that's memory model land, at least in C
language standard.

The example was a granularity-of-access violation, I agree.
Nonetheless, C11 has rules for that, they have been written down, it
would good to know whether these rules are sufficient for you.

> We do end up doing
> much more aggressive threading, with models that C11 simply doesn't
> cover.

Any specific examples for that would be interesting.

Linus Torvalds

unread,
Feb 1, 2012, 4:00:04 PM2/1/12
to
On Wed, Feb 1, 2012 at 12:41 PM, Torvald Riegel <tri...@redhat.com> wrote:
>
> You do rely on the compiler to do common transformations I suppose:
> hoist loads out of loops, CSE, etc.  How do you expect the compiler to
> know whether they are allowed for a particular piece of code or not?

We have barriers.

Compiler barriers are cheap. They look like this:

include/linux/compiler-gcc.h:
#define barrier() __asm__ __volatile__("": : :"memory")

and if we don't allow hoisting, we'll often use something like that.

It's not the only thing we do. We have cases where it's not that you
can't hoist things outside of loops, it's that you have to read things
exactly *once*, and then use that particular value (ie the compiler
can't be allowed to reload the value because it may change). So any
*particular* value we read is valid, but we can't allow the compiler
to do anything but a single access.

So we have this beauty:

include/linux/compiler.h:
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

which does that for us (quite often, it's not used as-is: a more
common case is using the "rcu_access_pointer()" inline function that
not only does that ACCESS_ONCE() but also has the capability to enable
run-time dynamic verification that you are in a proper RCU read locked
section etc)

We also have tons of (very subtle) code that actually creates locks
and memory ordering etc. They usually just use the big "barrier()"
approach to telling the compiler not to combine or move memory
accesses around it.

And I realize that compiler people tend to think that loop hoisting
etc is absolutely critical for performance, and some big hammer like
"barrier()" makes a compiler person wince. You think it results in
horrible code generation problems.

It really doesn't. Loops are fairly unusual in the kernel to begin
with, and the compiler barriers are a total non-issue. We have much
more problems with the actual CPU barriers that can be *very*
expensive on some architectures, and we work a lot at avoiding those
and avoiding cacheline ping-pong issues etc.

>> > No vague assumptions with lots of hand-waving.
>>
>> So here's basically what the kernel needs:
>>
>>  - if we don't touch a field, the compiler doesn't touch it.
>
> "we don't touch a field": what does that mean precisely?  Never touch it
> in the same thread?  Same function?  Same basic block?  Between two
> sequence points?  When crossing synchronizing code?  For example, in the
> above code, can we touch it earlier/later?

Why are you trying to make it more complex than it is?

If the source code doesn't write to it, the assembly the compiler
generates doesn't write to it.

Don't try to make it anything more complicated. This has *nothing* to
do with threads or functions or anything else.

If you do massive inlining, and you don't see any barriers or
conditionals or other reasons not to write to it, just write to it.

Don't try to appear smart and make this into something it isn't.

Look at the damn five-line example of the bug. FIX THE BUG. Don't try
to make it anything bigger than a stupid compiler bug. Don't try to
make this into a problem it simply isn't.

Linus

Boehm, Hans

unread,
Feb 1, 2012, 4:20:01 PM2/1/12
to
> From: Linus Torvalds
> >
> > We need a proper memory model.
>
> Not really.
>
> The fact is, the kernel will happily take the memory model of the
> underlying hardware. Trying to impose some compiler description of the
> memory model is actually horribly bad, because it automatically also
> involves compiler synchronization points - which will try to be
> portable and easy to understand, and quite frankly, that will
> automatically result in what is technically known as a "shitload of
> crap".
The C11 memory model potentially adds overhead in only two cases:

1. When current code involves touching a field that wouldn't otherwise be touched. There are odd cases in which this measurably slows down code, but I think all agree that we need it. In addition to bitfields, it can affect speculatively promoting a value to a register in a loop, which at least older versions of gcc also do.

2. When you use atomic operations for racing accesses. And in that case, if you really want it, you get a lot of control over memory ordering. Atomic operations that specify "memory_order_relaxed" should only have three kinds of impact:
- They ensure the access is indivisible.
- They tell the compiler that the location may be concurrently modified and it should refrain from optimizations that will break if it is.
- They enforce cache coherence, i.e. single-variable ordering. This is implicit on most architectures, but requires ld.acq on Itanium. IIRC, Paul McKenney argued convincingly that this was essential for preserving programmer sanity.

>
> Now, a strict memory model is fine for portability, and to make it
> simple for programmers. I actually largely approve of the Java memory
> model approach, even if it has been horribly problematic and has some
> serious problems. But it's not something that is acceptable for the
> kernel - we absolutely do *not* want the compiler to introduce some
> memory model, because we are very much working together with whatever
> the hardware memory model is, and we do our own serialization
> primitives.
I think you are somewhat misinterpreting the C11 memory model. Aside from the minor cache coherence issues on one or two architectures, you can still do that, if you like. But I suspect you can also do better in places.

>
> > No vague assumptions with lots of hand-waving.
>
> So here's basically what the kernel needs:
>
> - if we don't touch a field, the compiler doesn't touch it.
>
> This is the rule that gcc now violates with bitfields.
And in other more subtle cases. At least until very recently.

>
> This is a gcc bug. End of story. The "volatile" example proves it -
> anybody who argues otherwise is simply wrong, and is just trying to
> make excuses.
Agreed.

>
> - if we need to touch something only once, or care about something
> that is touched only conditionally, and we actually need to make sure
> that it is touched *only* under that condition, we already go to quite
> extreme lengths to tell the compiler so.
>
> We usually use an access with a "volatile" cast on it or tricks
> like that. Or we do the whole "magic asm that says it modified memory
> to let the compiler know not to try anything clever"
>
> - The kernel IS NOT GOING TO MARK DATA STRUCTURES.
>
> Marking data structures is seriously stupid and wrong. It's not the
> *data* that has access rules, it's the *code* that has rules of
> access.
>
> The traditional C "volatile" is misdesigned and wrong. We don't
> generally mark data volatile, we really mark *code* volatile - which
> is why our "volatiles" are in the casts, not on the data structures.
>
> Stuff that is "volatile" in one context is not volatile in another. If
> you hold a RCU write lock, it may well be entirely stable, and marking
> it volatile is *wrong*, and generating code as if it was volatile is
> pure and utter shit.
>
> On the other hand, if you are touching *the*very*same* field while you
> are only read-locked for RCU, it may well be one of those "this has to
> be read by accessing it exactly once".
>
> And we do all this correctly in the kernel. Modulo bugs, of course,
> but the fundamental rule really is: "atomicity or volatility is about
> CODE, not DATA".
The preferred C11 model is clearly to mark data that potentially participates in unprotected concurrent accesses, and then to annotate accesses that require less care, e.g. because we know they can't occur concurrently with other accesses. Maybe this is the wrong approach for the kernel, but to me that seems like a safer approach. We did consider the alternative model, and tried to ensure that casts to atomic could also work on as many platforms as possible, though we can't guarantee that. (There is currently no specification for atomic accesses that says "not concurrently accessed". We've thought about it. I would argue for adding it if there were important cases in which memory_order_relaxed is demonstrably not close enough.)

>
> And no, C11 does *not* do it correctly. The whole "_Atomic" crap is
> exactly the same mistake as "volatile" was. It's wrong. Marking data
> _Atomic is a sure sign that whoever designed it didn't understand
> things.
>
> > The only candidate that I see is the C++11/C11 model.  Any other
> > suggestions?
>
> The C11 model addresses the wrong thing, and addresses it badly.
>
> You might as well ignore it as far as the kernel is concerned. I'm
> sure it helps some *other* problem, but it's not the kernel problem.
But it seems to me that the core of the C11 model consists largely of the rules you're asking for. Even if you continue to use volatile and ignore atomic, it seems to me that you're far ahead, though still on needlessly thin ice, with the C11 model.

>
> The rules really are very simple, and the kernel already does
> everything to make it easy for the compiler.
>
> When we do something that the compiler cannot re-order around, we do
> an asm() and mark it as modifying memory so that the compiler won't
> screw things up. In addition, we will do whatever that the CPU
> requires for memory ordering, and quite frankly, the compiler will
> never have sufficient locking primitives to satisfy us, and there is
> no real point in even trying. If you try to do locking in the
> compiler, you *will* do it wrong.
Which is one reason C11 gives you fine control over atomics that can be used to implement locks.

>
> If you add random flags on data structures ("_Atomic" or "volatile" or
> whatever), you *will* do it wrong. It's simply a fundamentally broken
> model.
>
> So the *only* memory model we want from the compiler really is: "don't
> write to fields that the source code didn't write to".
>
> It's really is that simple.
You clearly also need to say something about adjacent bitfields. Nor would it be usable outside the kernel, because the programmer needs to understand the esoteric fence semantics etc. on every architecture in order to really use it. (It also isn't exactly a language specification, but that's OK here.) But again I think the C11 spec implies exactly what you want here.

>
> > So, would it be okay to tell the compiler which part of the state is
> > accessed concurrently (ie, locks, atomics, ...)?
>
> Seriously, no.
>
> See above: it's not the "state" that is accessed concurrently. It's
> the code. If you ever try to mark state, you've already lost. The same
> "state" can be atomic or not depending on context. It's not about the
> state or the data structures, and it never will be.
I think we have no disagreement that in the end it's the access whose properties you need to specify. The question is what constitutes the most convenient and robust way to do so. Clearly we specify other attributes of accesses (length, alignment) using the type, and perhaps override it when necessary. C11 atomics arguably use the same approach.

Hans

>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ia64"

Torvald Riegel

unread,
Feb 1, 2012, 4:30:02 PM2/1/12
to
On Wed, 2012-02-01 at 12:59 -0800, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 12:41 PM, Torvald Riegel <tri...@redhat.com> wrote:
> >
> > You do rely on the compiler to do common transformations I suppose:
> > hoist loads out of loops, CSE, etc. How do you expect the compiler to
> > know whether they are allowed for a particular piece of code or not?
>
> We have barriers.
>
> Compiler barriers are cheap. They look like this:
[snip]

I know, and I've programmed my chunk of concurrent code with those and
architecture-specific code too (or libatomic-ops for a bit more
ready-to-use portability).

> It's not the only thing we do. We have cases where it's not that you
> can't hoist things outside of loops, it's that you have to read things
> exactly *once*, and then use that particular value (ie the compiler
> can't be allowed to reload the value because it may change). So any
> *particular* value we read is valid, but we can't allow the compiler
> to do anything but a single access.

That's what an access to an atomics-typed var would give you as well.

> And I realize that compiler people tend to think that loop hoisting
> etc is absolutely critical for performance, and some big hammer like
> "barrier()" makes a compiler person wince. You think it results in
> horrible code generation problems.
>
> It really doesn't. Loops are fairly unusual in the kernel to begin
> with, and the compiler barriers are a total non-issue.

This is interesting, because GCC is currently not optimizing across
atomics but we we're considering working on this later.
Do you have real data about this, or is this based on analysis of
samples of compiled code?

> We have much
> more problems with the actual CPU barriers that can be *very*
> expensive on some architectures, and we work a lot at avoiding those
> and avoiding cacheline ping-pong issues etc.

Well, obviously...

>
> >> > No vague assumptions with lots of hand-waving.
> >>
> >> So here's basically what the kernel needs:
> >>
> >> - if we don't touch a field, the compiler doesn't touch it.
> >
> > "we don't touch a field": what does that mean precisely? Never touch it
> > in the same thread? Same function? Same basic block? Between two
> > sequence points? When crossing synchronizing code? For example, in the
> > above code, can we touch it earlier/later?
>
> Why are you trying to make it more complex than it is?
>
> If the source code doesn't write to it, the assembly the compiler
> generates doesn't write to it.

If it would be so easy, then answering my questions would have been easy
too, right? Which piece of source code? The whole kernel?

>
> Don't try to make it anything more complicated. This has *nothing* to
> do with threads or functions or anything else.
>
> If you do massive inlining, and you don't see any barriers or
> conditionals or other reasons not to write to it, just write to it.

And a precise definition of these reasons is what we need to agree on.

> Don't try to appear smart and make this into something it isn't.

Even if I would have done it, it wouldn't be as rude as trying to imply
that other people aren't smart.

Boehm, Hans

unread,
Feb 1, 2012, 4:30:03 PM2/1/12
to
> From: Linus Torvalds
> Don't try to make it anything more complicated. This has *nothing* to
> do with threads or functions or anything else.
>
> If you do massive inlining, and you don't see any barriers or
> conditionals or other reasons not to write to it, just write to it.
>
> Don't try to appear smart and make this into something it isn't.
>
> Look at the damn five-line example of the bug. FIX THE BUG. Don't try
> to make it anything bigger than a stupid compiler bug. Don't try to
> make this into a problem it simply isn't.
>
My impression is that all of us not on the hook to fix this are in violent agreement on this example.

Here are some more interesting ones that illustrate the issues (all declarations are non-local, unless stated otherwise):

struct { char a; int b:9; int c:7; char d} x;

Is x.b = 1 allowed to overwrite x.a? C11 says no, essentially requiring two byte stores. Gcc currently does so. I'm not sure I understand Linus' position here.


int count;
/* p and q are local */

for (q = p; q = q -> next; q != 0) if (q -> data > 0) ++count;

Can count be promoted to a register, and thus written even if there are no positive elements. C11 says no. gcc at least used to do this.


for (q = p; q = q -> next; q != 0) { ++count; if (rare_cond) f(); }

Same question, with cond saved and restored around the call to f() (which might include a fence). C11 says no. I think Linus is also arguing for no.


for (i = 0; i < 1000; ++i) { if (i%1) a[i] = i; }

Can I vectorize the loop writing back the original even values, and thus writing all entries of the array. C11 and Linus both say no.


My impression is that we are generally in agreement.

Hans

Linus Torvalds

unread,
Feb 1, 2012, 4:30:03 PM2/1/12
to
On Wed, Feb 1, 2012 at 12:53 PM, Torvald Riegel <tri...@redhat.com> wrote:
>
> For volatile, I agree.
>
> However, the original btrfs example was *without* a volatile, and that's
> why I raised the memory model point.  This triggered an error in a
> concurrent execution, so that's memory model land, at least in C
> language standard.

Sure. The thing is, if you fix the volatile problem, you'll almost
certainly fix our problem too.

The whole "compilers actually do reasonable things" approach really
does work in reality. It in fact works a lot better than reading some
spec and trying to figure out if something is "valid" or not, and
having fifteen different compiler writers and users disagree about
what the meaning of the word "is" is in some part of it.

I'm not kidding. With specs, there really *are* people who spend years
discussing what the meaning of the word "access" is or similar.
Combine that with a big spec that is 500+ pages in size and then try
to apply that all to a project that is 15 million lines of code and
sometimes *knowingly* has to do things that it simply knows are
outside the spec, and the discussion about these kinds of details is
just mental masturbation.


>> We do end up doing
>> much more aggressive threading, with models that C11 simply doesn't
>> cover.
>
> Any specific examples for that would be interesting.

Oh, one of my favorite (NOT!) pieces of code in the kernel is the
implementation of the

smp_read_barrier_depends()

macro, which on every single architecture except for one (alpha) is a no-op.

We have basically 30 or so empty definitions for it, and I think we
have something like five uses of it. One of them, I think, is
performance crticial, and the reason for that macro existing.

What does it do? The semantics is that it's a read barrier between two
different reads that we want to happen in order wrt two writes on the
writing side (the writing side also has to have a "smp_wmb()" to order
those writes). But the reason it isn't a simple read barrier is that
the reads are actually causally *dependent*, ie we have code like

first_read = read_pointer;
smp_read_barrier_depends();
second_read = *first_read;

and it turns out that on pretty much all architectures (except for
alpha), the *data*dependency* will already guarantee that the CPU
reads the thing in order. And because a read barrier can actually be
quite expensive, we don't want to have a read barrier for this case.

But alpha? Its memory consistency is so broken that even the data
dependency doesn't actually guarantee cache access order. It's
strange, yes. No, it's not that alpha does some magic value prediction
and can do the second read without having even done the first read
first to get the address. What's actually going on is that the cache
itself is unordered, and without the read barrier, you may get a stale
version from the cache even if the writes were forced (by the write
barrier in the writer) to happen in the right order.

You really want to try to describe issues like this in your memory
consistency model? No you don't. Nobody will ever really care, except
for crazy kernel people. And quite frankly, not even kernel people
care: we have a fairly big kernel developer community, and the people
who actually talk about memory ordering issues can be counted on one
hand. There's the "RCU guy" who writes the RCU helper functions, and
hides the proper serializing code into those helpers, so that normal
mortal kernel people don't have to care, and don't even have to *know*
how ignorant they are about the things.

And that's also why the compiler shouldn't have to care. It's a really
small esoteric detail, and it can be hidden in a header file and a set
of library routines. Teaching the compiler about crazy memory ordering
would just not be worth it. 99.99% of all programmers will never need
to understand any of it, they'll use the locking primitives and follow
the rules, and the code that makes it all work is basically invisible
to them.

Linus

Torvald Riegel

unread,
Feb 1, 2012, 4:40:03 PM2/1/12
to
On Wed, 2012-02-01 at 13:20 -0800, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 12:53 PM, Torvald Riegel <tri...@redhat.com> wrote:
> >
> > For volatile, I agree.
> >
> > However, the original btrfs example was *without* a volatile, and that's
> > why I raised the memory model point. This triggered an error in a
> > concurrent execution, so that's memory model land, at least in C
> > language standard.
>
> Sure. The thing is, if you fix the volatile problem, you'll almost
> certainly fix our problem too.
>
> The whole "compilers actually do reasonable things" approach really
> does work in reality. It in fact works a lot better than reading some
> spec and trying to figure out if something is "valid" or not, and
> having fifteen different compiler writers and users disagree about
> what the meaning of the word "is" is in some part of it.

That's why researchers have formalized the model in order to verify
whether it's sane. And they found bugs / ambiguities in it:
http://www.cl.cam.ac.uk/~pes20/cpp/

The standards text might still be a typical standards text for various
reasons. But it enables having a formal version of it too, and
discussions about this formal version. Personally, I find this more
formal model easier to work with, exactly because it is more precise
than prose.

>
> >> We do end up doing
> >> much more aggressive threading, with models that C11 simply doesn't
> >> cover.
> >
> > Any specific examples for that would be interesting.
>
> Oh, one of my favorite (NOT!) pieces of code in the kernel is the
> implementation of the
>
> smp_read_barrier_depends()
>
> macro, which on every single architecture except for one (alpha) is a no-op.
>
> We have basically 30 or so empty definitions for it, and I think we
> have something like five uses of it. One of them, I think, is
> performance crticial, and the reason for that macro existing.
>
> What does it do? The semantics is that it's a read barrier between two
> different reads that we want to happen in order wrt two writes on the
> writing side (the writing side also has to have a "smp_wmb()" to order
> those writes). But the reason it isn't a simple read barrier is that
> the reads are actually causally *dependent*, ie we have code like
>
> first_read = read_pointer;
> smp_read_barrier_depends();
> second_read = *first_read;
>
> and it turns out that on pretty much all architectures (except for
> alpha), the *data*dependency* will already guarantee that the CPU
> reads the thing in order. And because a read barrier can actually be
> quite expensive, we don't want to have a read barrier for this case.

I don't have time to look at this in detail right now, but it looks
roughly close to C++11's memory_order_consume to me, which is somehwat
like an acquire, but just for subsequent data-dependent loads. Added
for performance reasons on some architecture AFAIR.

> You really want to try to describe issues like this in your memory
> consistency model? No you don't. Nobody will ever really care, except
> for crazy kernel people. And quite frankly, not even kernel people
> care: we have a fairly big kernel developer community, and the people
> who actually talk about memory ordering issues can be counted on one
> hand. There's the "RCU guy" who writes the RCU helper functions, and
> hides the proper serializing code into those helpers, so that normal
> mortal kernel people don't have to care, and don't even have to *know*
> how ignorant they are about the things.

I'm not a kernel person, and I do care about it. Userspace is
synchronizing too, and not just with pthread mutexes.

> And that's also why the compiler shouldn't have to care. It's a really
> small esoteric detail, and it can be hidden in a header file and a set
> of library routines. Teaching the compiler about crazy memory ordering
> would just not be worth it. 99.99% of all programmers will never need
> to understand any of it, they'll use the locking primitives and follow
> the rules, and the code that makes it all work is basically invisible
> to them.

I disagree (though it would be nice if it were that esoteric), but
that's off-topic...

Linus Torvalds

unread,
Feb 1, 2012, 5:00:06 PM2/1/12
to
On Wed, Feb 1, 2012 at 1:24 PM, Torvald Riegel <tri...@redhat.com> wrote:
>> It's not the only thing we do. We have cases where it's not that you
>> can't hoist things outside of loops, it's that you have to read things
>> exactly *once*, and then use that particular value (ie the compiler
>> can't be allowed to reload the value because it may change). So any
>> *particular* value we read is valid, but we can't allow the compiler
>> to do anything but a single access.
>
> That's what an access to an atomics-typed var would give you as well.

Right. The concept of "atomic" access is required.

The problem I have with treating this as a C11 issue is that people
aren't using C11 compilers, and won't for a long while.

So quite frankly, it won't be reasonable for the kernel people to say
"use a C11 version" for years to come.

And the thing is, "atomic" doesn't actually seem to buy us anything in
the kernel afaik. We're perfectly happy to use "volatile". We don't
want the data structure annotated, we want the code annotated, so the
semantic differences between 'volatile' and 'atomic' seem to be pretty
irrelevant.

Sure, there is probably some subtle difference, but on the whole, it
all boils down to "we can't depend on new rules for several years, and
even when we can, it doesn't look like they'd buy us a lot, because we
have to play so many games around them *anyway*".

And don't get me wrong. I don't think that means that C11 is *bad*.
It's just that the kernel is very different from most other projects.
We have to have those crazy architecture-specific header files and
random synchronization macros etc anyway.

C11 is not - I think - even meant to be geared towards the Linux
kernel kind of crazy use. We really do some odd things, adding
compiler features for them is mostly a mistake. asm() takes care of a
lot of the oddities for us, it's not like all of them are about memory
accesses or concurrency either.

I do think that the threading issues in C11 are going to help us
kernel people, because the more people think about issues with
concurrency, they really *will* be hitting some of the issues we've
been having. So it clearly forces clarification of just what the word
"access" implies, for example, which is certainly not going to be bad
for the kernel.

>> It really doesn't. Loops are fairly unusual in the kernel to begin
>> with, and the compiler barriers are a total non-issue.
>
> This is interesting, because GCC is currently not optimizing across
> atomics but we we're considering working on this later.
> Do you have real data about this, or is this based on analysis of
> samples of compiled code?

So the kernel literally doesn't tend to *have* loops.

Sure, there are exceptions: there are the obvious loops implied in
"memcpy()" and friends, but almost all kernel activity tends to be
about individual events. A really hot loop under normal loads would be
the loop that calculates a hash value over a pathname component. Or
the loop that loops over those components.

Loop count? Think about it. Most pathnames have one or two components.
And while "8.3" is clearly too short for a filename, it's still true
that a lot of filenames (especially directories, which is what you end
up traversing many times) are in a kind of awkward 5-8 character
range.

So the kernel loads tend to have loop counts in the single digits -
*maybe* double digits. We use hash-tables a lot, so looping to find
something is usually just an entry or two, and the cost is almost
never the loop, it's the cache miss from the pointer chasing.

Most "looping" behavior in the kernel is actually user space calling
the kernel in a loop. So the kernel may see "loops" in that sense, but
it's not loopy kernel code itself.

Even really hot kernel-only code is often not about loops. You might
get a million network packets a second, and with interrupt mitigation
you would not necessarily treat them each as individual events with
its own individual interrupt (which does happen too!), but you'd still
not see it as a loop over a million packets. You'd see them come in as
a few packets at a time, and looping until you've handled them.

And the loops we have tend to not be very amenable to nice compiler
optimizations either.

The kernel almost never works with arrays, for example. The string in
a pathname component is an array, yes, but *most* kernel data
structures loop over our doubly linked lists or over a hash-chain.

And it tends to be really hard to do much loop optimization over list
traversal like that.

>> If the source code doesn't write to it, the assembly the compiler
>> generates doesn't write to it.
>
> If it would be so easy, then answering my questions would have been easy
> too, right?  Which piece of source code?  The whole kernel?

So make the rule be: "in one single function, with all you can see
there", and I'll be happy. Do as much inlining as you want (although
the kernel does have "noinline" too).

With all the normal sequence point and memory invalidation event rules
etc. If you have two non-pure function calls (or barriers, or whatever
like that) that may change memory, and in between them nothing in the
source code writes to a structure member, then there had damn well not
be any stores to that member in the resulting asm either!

IOW, it's all the normal memory access rules. You can't do random
read-write events to variables that aren't mentioned by the source
code, because other threads may be doing it without your knowledge.

None of this is subtle or even complicated.

The thing that brought it up was an obvious compiler bug. Well, it's
obviously a bug, but it was equally obviously very hard to find.

Fix the test-case. It's a small and simple test-case.

And fix it for the "sig_atomic_t" case, which doesn't even imply
"volatile". So you can't use volatile as some kind of magic marker for
"now we can't do it".

Anything else is irrelevant. You have a non-kernel test-case that
shows a bug. Don't worry about the kernel.

Linus

Boehm, Hans

unread,
Feb 1, 2012, 5:30:01 PM2/1/12
to
> From: Torvald Riegel
It's intended to address the same problem, though somewhat differently. (I suspect there was overlap in the people involved?) One reason that C11 took a slightly different path is that compilers can, and sometimes do, remove dependencies, making smp_read_barrier_depends brittle unless it also imposes compiler constraints.

Hans

Linus Torvalds

unread,
Feb 1, 2012, 5:30:02 PM2/1/12
to
On Wed, Feb 1, 2012 at 1:25 PM, Boehm, Hans <hans....@hp.com> wrote:
>
> Here are some more interesting ones that illustrate the issues (all declarations are non-local, unless stated otherwise):
>
> struct { char a; int b:9; int c:7; char d} x;
>
> Is x.b = 1 allowed to overwrite x.a?  C11 says no, essentially requiring two byte stores.  Gcc currently does so.  I'm not sure I understand Linus' position here.

So I like the fact that the C11 stance seems very strict. But honestly
I also think it sounds like C11 is actually more strict than I would
necessarily be.

I really do think that bitfields imply "int", both historically and
technically. So I would not see the problem with treating the
bitfields as part of an 'int' and thus overwriting a (and d) when
writing to b. That's how bitfields work! They are fields of an int.

It would be good if it perhaps caused a *warning*, and gave a good way
to avoid it. For example, while I think using any other base type than
'int' is technically an extension of the C bitfield rules (but
whatever, I don't have any specs in front of me), I think a warning
together with alowing the user to rewrite it as

struct { char a; char d; short b:9; short c:7; } x;

would make it clear that now a write to 'b' cannot validly overwrite
'a' or 'd'.

But forcing the compiler to do two (and sometimes three!) byte
accesses sounds excessive.

The reason I think the

int flag:1;
int othervariable;

overwriting of "othervariable" is so *obviously* a bug is exactly that
bitfields are historically about 'int', and no 'long' was there
anywhere, so using a 64-bit access is clearly not sane in any way,
shape or form.

I dunno.

Linus

Paul E. McKenney

unread,
Feb 1, 2012, 5:50:04 PM2/1/12
to
On Wed, Feb 01, 2012 at 12:59:24PM -0800, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 12:41 PM, Torvald Riegel <tri...@redhat.com> wrote:
> >
> > You do rely on the compiler to do common transformations I suppose:
> > hoist loads out of loops, CSE, etc.  How do you expect the compiler to
> > know whether they are allowed for a particular piece of code or not?
>
> We have barriers.
>
> Compiler barriers are cheap. They look like this:
>
> include/linux/compiler-gcc.h:
> #define barrier() __asm__ __volatile__("": : :"memory")
>
> and if we don't allow hoisting, we'll often use something like that.
>
> It's not the only thing we do. We have cases where it's not that you
> can't hoist things outside of loops, it's that you have to read things
> exactly *once*, and then use that particular value (ie the compiler
> can't be allowed to reload the value because it may change). So any
> *particular* value we read is valid, but we can't allow the compiler
> to do anything but a single access.
>
> So we have this beauty:
>
> include/linux/compiler.h:
> #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

My (perhaps forlorn and naive) hope is that C++11 memory_order_relaxed
will eventually allow ACCESS_ONCE() to be upgraded so that (for example)
access-once increments can generate a single increment-memory instruction
on x86.

> which does that for us (quite often, it's not used as-is: a more
> common case is using the "rcu_access_pointer()" inline function that
> not only does that ACCESS_ONCE() but also has the capability to enable
> run-time dynamic verification that you are in a proper RCU read locked
> section etc)

And I also hope that rcu_access_pointer(), rcu_dereference(), and
friends can eventually be written in terms of C++11 memory_order_consume
so that the some of the less-sane optimizations can actually be applied
without breaking the kernel.

Of course, this cannot happen immediately. First, gcc must fully support
the C++11 features of interest, the inevitable bugs must be found and
fixed, and the optimizer will no doubt need to be reworked a bit before
the kernel can make use of these features.

New architectures might eventually might define things like atomic_inc()
in terms of C++11 atomics, but let's start with the straightforward stuff
as and if it makes sense.

Thanx, Paul

Linus Torvalds

unread,
Feb 1, 2012, 6:20:02 PM2/1/12
to
On Wed, Feb 1, 2012 at 2:45 PM, Paul E. McKenney
<pau...@linux.vnet.ibm.com> wrote:
>
> My (perhaps forlorn and naive) hope is that C++11 memory_order_relaxed
> will eventually allow ACCESS_ONCE() to be upgraded so that (for example)
> access-once increments can generate a single increment-memory instruction
> on x86.

I don't think that is a semantic issue.

gcc could do it *today* with volatile accesses. It doesn't, because
volatiles are scary and basically disables a lot of optimizations. Why
would memory ordering be substantially different just because it has a
different name?

> New architectures might eventually might define things like atomic_inc()
> in terms of C++11 atomics, but let's start with the straightforward stuff
> as and if it makes sense.

SMP-atomic or percpu atomic? Or both?

We need both variants in the kernel. If the compiler generates one of
them for us, that doesn't really much help.

Richard Guenther

unread,
Feb 2, 2012, 4:40:02 AM2/2/12
to
On Wed, 1 Feb 2012, Linus Torvalds wrote:

> Just out of morbid curiosity, what happens if you have totally
> *separate* variables that just happen to link together? IOW, something
> like
>
> static struct { unsigned bit:1; } onebit;
> static volatile int var;
>
> and they just *happen* to link next to each other (because they were
> declared next to each other) in the same 8-byte aligned block?

Just to make you run away screaming ...

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48124

where exactly such situation happens ...

Richard.

Bernd Petrovitsch

unread,
Feb 2, 2012, 4:40:02 AM2/2/12
to
On Mit, 2012-02-01 at 21:04 +0000, Boehm, Hans wrote:
[...]
> The C11 memory model potentially adds overhead in only two cases:
>
> 1. When current code involves touching a field that wouldn't otherwise
> be touched. There are odd cases in which this measurably slows down
> code, but I think all agree that we need it. In addition to
> bitfields, it can affect speculatively promoting a value to a register
> in a loop, which at least older versions of gcc also do.

Just adding an -f option for this and/or activating it only for -O5 (or
whatever the highest level is) and - in case that feature is activated -
emit warnings if bitfields (and/or any other data types that might be
affected)?

Kind regards,
Bernd
--
Bernd Petrovitsch Email : be...@petrovitsch.priv.at
LUGA : http://www.luga.at

Richard Guenther

unread,
Feb 2, 2012, 4:40:03 AM2/2/12
to
On Wed, 1 Feb 2012, Linus Torvalds wrote:

> On Wed, Feb 1, 2012 at 9:41 AM, Michael Matz <ma...@suse.de> wrote:
> >
> > One problem is that it's not a new problem, GCC emitted similar code since
> > about forever, and still they turned up only now (well, probably because
> > ia64 is dead, but sparc64 should have similar problems).  The bitfield
> > handling code is _terribly_ complex and fixing it is quite involved.  So
> > don't expect any quick fixes.
>
> I agree that bitfields are nasty, I've had to handle them myself in
> sparse. And we have actually had problems with bitfields before, to
> the point where we've replaced use of bitfields with masking and
> setting bits by hand.
>
> But I also think that gcc is simply *buggy*, and has made them much
> nastier than they should be. What gcc *should* have done is to turn
> bitfield accesses into shift-and-masking of the underlying field as
> early as possible, and then do all optimizations at that level.
>
> In fact, there is another gcc bug outstanding (48696) where I complain
> about absolutely horrible code generation, and that one was actually
> the exact same issue except in reverse: gcc wouldn't take the
> underlying size of the bitfield into account, and use the wrong
> (smaller) size for the access, causing absolutely horrendous code
> generation that mixes byte and word accesses, and causes slowdowns by
> orders of magnitude.

Yeah, sorry for dropping the ball on this one (and missing my original
GCC 4.7 target ...)

> And it really is the same issue: gcc has forgotten what the base type
> is, and tries to "compute" some type using the actual bits. Which is
> simply *wrong*. Always has been.
>
> It's stupid, it generates bad code (both from performance *and* from a
> correctness angle), and it has actually resulted in gcc having *extra*
> complexity because it keeps the bitfield data around much too late.
>
> > The other problem is specification.  While you think that the code you
> > wrote is totally obvious it might not actually be so.  For instance, what
> > about this struct:
> >
> > {long l:32; int i1:16; short s; int i2:1; char c:7; short s2:8; short s3;}
> >
> > What are the obviously correct accesses for various writes into this
> > struct?
>
> I really don't think that example matters. You should instead turn the
> question around, and look at the real code examples, make *those*
> generate good and obviously correct code, and then *after* you've done
> that, you start looking at the mixed case where people do odd things.

Well, it matters because it shows that simply using the declared type
isn't going to work in the end. What we instead need to do is
remember the underlying objects the bitfield packing algorithm uses.
It then simply becomes obvious how to implement the bitfield accesses.
I hope to get back to this for GCC 4.8.

> Quite frankly, the layout that makes *sense* for something like the
> above is not to pack them. You'd just do

Heh, but of course the ABI specifies they are supposed to be packed ...

In the end we all agree GCC does something nasty (and I would call
it a bug even), but any solution we find in GCC won't be backportable
to earlier releases so you have to deal with the GCC bug for quite
some time and devise workarounds in the kernel. You'll hit the
bug for all structure fields that share the largest aligned
machine word with a bitfield (thus the size depends on the alignment
of the full object, not that of the struct containing the bitfield
in case that struct is nested inside another more aligned one).
This situation should be easily(?) detectable with sparse.

Thanks,
Richard.

Richard Guenther

unread,
Feb 2, 2012, 4:50:02 AM2/2/12
to
On Wed, 1 Feb 2012, Linus Torvalds wrote:

> IT IS A GCC BUG.

I think you are wrong when you assume that we think it is not a bug.
Repeating it repeatedly in all-caps doesn't make the fix appear
faster though.

Richard.

James Courtier-Dutton

unread,
Feb 2, 2012, 6:20:01 AM2/2/12
to
On 1 February 2012 15:19, Jan Kara <ja...@suse.cz> wrote:
>  Hello,
>
>  we've spotted the following mismatch between what kernel folks expect
> from a compiler and what GCC really does, resulting in memory corruption on
> some architectures. Consider the following structure:
> struct x {
>    long a;
>    unsigned int b1;
>    unsigned int b2:1;
> };
>
> We have two processes P1 and P2 where P1 updates field b1 and P2 updates
> bitfield b2. The code GCC generates for b2 = 1 e.g. on ia64 is:
>   0:   09 00 21 40 00 21       [MMI]       adds r32=8,r32
>   6:   00 00 00 02 00 e0                   nop.m 0x0
>   c:   11 00 00 90                         mov r15=1;;
>  10:   0b 70 00 40 18 10       [MMI]       ld8 r14=[r32];;
>  16:   00 00 00 02 00 c0                   nop.m 0x0
>  1c:   f1 70 c0 47                         dep r14=r15,r14,32,1;;
>  20:   11 00 38 40 98 11       [MIB]       st8 [r32]=r14
>  26:   00 00 00 02 00 80                   nop.i 0x0
>  2c:   08 00 84 00                         br.ret.sptk.many b0;;
>
> Note that gcc used 64-bit read-modify-write cycle to update b2. Thus if P1
> races with P2, update of b1 can get lost. BTW: I've just checked on x86_64
> and there GCC uses 8-bit bitop to modify the bitfield.
>
> We actually spotted this race in practice in btrfs on structure
> fs/btrfs/ctree.h:struct btrfs_block_rsv where spinlock content got
> corrupted due to update of following bitfield and there seem to be other
> places in kernel where this could happen.
>
> I've raised the issue with our GCC guys and they said to me that: "C does
> not provide such guarantee, nor can you reliably lock different
> structure fields with different locks if they share naturally aligned
> word-size memory regions.  The C++11 memory model would guarantee this,
> but that's not implemented nor do you build the kernel with a C++11
> compiler."
>
> So it seems what C/GCC promises does not quite match with what kernel
> expects. I'm not really an expert in this area so I wanted to report it
> here so that more knowledgeable people can decide how to solve the issue...
>
>                                                                Honza
> --
> Jan Kara <ja...@suse.cz>
> SUSE Labs, CR

What is the recommended work around for this problem?

David Sterba

unread,
Feb 2, 2012, 6:20:02 AM2/2/12
to
On Wed, Feb 01, 2012 at 04:19:18PM +0100, Jan Kara wrote:
> We actually spotted this race in practice in btrfs on structure
> fs/btrfs/ctree.h:struct btrfs_block_rsv where spinlock content got
> corrupted due to update of following bitfield and there seem to be other
> places in kernel where this could happen.

Here's the list of structures where a bitfield is shared with spinlock
or atomic/kref within an 8B word, generated from 3.3-rc2:

spinlock+bitfield:

Struct: struct ak4113; Field: init
Struct: struct ak4114; Field: init
Struct: struct ak4117; Field: init
Struct: struct btrfs_block_rsv; Field: full
Struct: struct cm109_dev; Field: buzzer_pending
Struct: struct pch_udc_dev; Field: active
Struct: struct rds_iw_device; Field: dma_local_lkey
Struct: struct sierra_intf_private; Field: suspended
Struct: struct sm501_gpio; Field: registered
Struct: struct unix_sock; Field: gc_candidate
Struct: struct usb_anchor; Field: poisoned
Struct: struct usb_wwan_intf_private; Field: suspended

atomic/kref+bitfield:

Struct: struct dlm_lock_resource; Field: migration_pending
Struct: struct extent_map; Field: in_tree
Struct: struct kobject; Field: state_initialized
Struct: struct page; Field: inuse
Struct: struct rds_ib_connection; Field: i_flowctl
Struct: struct rds_iw_connection; Field: i_flowctl
Struct: struct sctp_transport; Field: dead
Struct: struct transaction_s; Field: t_synchronous_commit
Struct: struct xfs_ioend; Field: io_isasync


Not all listed structs are necessarily subject to the bug. There may be
another mechanism preventing concurrent access to the bitfield and
spinlock/atomic, or the bitfield is modified from a single cpu, or is
not used. But all of them need to be reviewed of course.


david

Ingo Molnar

unread,
Feb 2, 2012, 6:30:02 AM2/2/12
to

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> [...]
>
> And I realize that compiler people tend to think that loop
> hoisting etc is absolutely critical for performance, and some
> big hammer like "barrier()" makes a compiler person wince. You
> think it results in horrible code generation problems.
>
> It really doesn't. Loops are fairly unusual in the kernel to
> begin with, and the compiler barriers are a total non-issue.
> We have much more problems with the actual CPU barriers that
> can be *very* expensive on some architectures, and we work a
> lot at avoiding those and avoiding cacheline ping-pong issues
> etc.

Just to underline this point, if barriers caused optimization
problems when GCC builds the kernel then we'd expect to see
various code generation problems: for example the compiler would
not be able to cache things well enough and reorder it to make
the code faster and (often) more compact.

So to test that effect of Linus's claim I picked up a fairly
bleeding edge version of GCC:

gcc version 4.7.0 20120112 (Red Hat 4.7.0-0.6) (GCC)

and performed a test build of the kernel with the majority of
optimization barriers removed (using the v3.2 kernel, x86
defconfig, 64-bit, -O2 optimization level): 1600 barriers were
removed (!) and GCC's hands were thus freed to create more
optimal code [and a very broken kernel], if it could.

I compared the resulting kernel image to an unmodified kernel
image:

text data bss dec hex filename
9781555 982328 1118208 11882091 b54e6b vmlinux.vanilla
9780972 982328 1118208 11881508 b54c24 vmlinux.no-barriers

So the barriers are costing us only a 0.06% size increase - 583
bytes on an almost 10 MB kernel image.

To put that into perspectve: a *single* inline function inlining
decision by the compiler has a larger effect than that. Just a
couple of days ago we uninlined a function, which had an order
of magnitude larger effect than this.

The other possible dimension would be the ordering of
instructions.

To test for that effect I disassembled the two kernel images and
performed a function by function, instruction by instruction
comparison of instruction ordering. The summary is that GCC was
able to remove only 86 instructions (0.005%) and reordered
around 2400 instructions (0.15%) - out of about 1,570,000
instructions.

Or, put differently, for the 1600 barriers in this particular
kernel build, there's about 1.5 instructions reordered and 0.05
instructions removed.

I also inspected the type of reordering: the overwhelming
majority of reordering happened within a jump-free basic block
of instructions and did not affect any loops.

Thus much of the effect of barriers kernel is only the crutial
effect that we want them to have: to reorder code to have a
specific program order sequence - but in the process the
barriers() cause very, very small optimization quality side
effects.

So the numbers support Linus's claim, the kernel incurs very
little optimization cost side effects from barriers.

Thanks,

Ingo

Richard Guenther

unread,
Feb 2, 2012, 6:30:03 AM2/2/12
to
To hit this bug the containing objects also have to be at least
8-byte aligned.

Richard.

Richard Guenther

unread,
Feb 2, 2012, 6:30:03 AM2/2/12
to
> What is the recommended work around for this problem?

The recommended work around is to re-layout your structures.

Richard.

Michael Matz

unread,
Feb 2, 2012, 8:50:01 AM2/2/12
to
Hi,

On Wed, 1 Feb 2012, Linus Torvalds wrote:

> But I also think that gcc is simply *buggy*, and has made them much
> nastier than they should be. What gcc *should* have done is to turn
> bitfield accesses into shift-and-masking of the underlying field as
> early as possible, and then do all optimizations at that level.

Indeed. And there's even work going on towards that (stalled in the
moment), but you know how it is, should, could, would.

> In fact, there is another gcc bug outstanding (48696) where I complain
> about absolutely horrible code generation, and that one was actually
> the exact same issue except in reverse: gcc wouldn't take the
> underlying size of the bitfield into account, and use the wrong
> (smaller) size for the access,

That's actually one example why not everything is so totally obvious. We
really don't want to store into more bytes than "allowed" (and as we are
at it, extend this even to reading for volatile members). For some
definition for "allowed".

For the ia64 problem at hand it has to surely exclude non-adjacent
bitfield members. For PR48124 (writing beyond end of decl) it has to be
constrained to the size of a decl of such struct type (seems obvious, but
it's surprising how long you get away with some less strict rule, namely
only excluding everything after the next alignment border). Well, that's
all sensible restrictions. But for your optimization problem you have to
extend the allowed range a bit again (to at least contain all adjacent
bitfields), but not too much to break the next guy screaming "but here you
obviously do a wild write".

For instance, given these three structs:

struct s1 {short a; short b; unsigned c:1;};
struct s2 {short a:16; short b:16; unsigned c:1;};
struct s3 {unsigned a:16; unsigned b:16; unsigned c:1;};

Are the writes to x.b allowed to touch x.a? And writing x.c? I think I
know what you will claim (no crossing writes, except for s3.a and s3.b can
be combined), but let's say I claim that there's no difference between all
three structures, and therefore there should be no difference in what's
allowed to be touched and what not. In particular the declared type is
not what matters in allowing certain accesses. So, if you want to combine
s3.a and s3.b (which is implied by your PR48696), you'll have to give me
also combining s2.a and s2.b. (I'm not so nasty to also want combining
s1.a and s1.b, because there are other deeper reasons why s1 and the rest
differ).

The point is, there are simply different opinions and point of views.
Using exclamation marks, bold typography and hyperbole doesn't make anyone
more correct nor does it make the problem simpler than it is.

> struct {long l:32; int i1:16; short s; int i2:1; char c:7; short
> s2:8; short s3;} dummy;
>
> int main(int argc, char **argv)
> {
> dummy.l = 1;
> dummy.i1 = 2;
>
> and then do a test-linearize (with "-m64" to show the difference
> between "long" and "int") I get
>
> t.c:2:48: error: dubious one-bit signed bitfield
> main:
> .L0x7f928e378010:
> <entry-point>
> load.64 %r1 <- 0[dummy]
> and.64 %r2 <- %r1, $0xffffffff00000000
> or.64 %r3 <- %r2, $1
> store.64 %r3 -> 0[dummy]

Here you store 8 bytes, touching ...

> load.32 %r4 <- 4[dummy]

... this int. Both corresponds to members "long l:32; int i1:16;". They
don't have the same base type, hence per your own rule they shouldn't be
allowed to touch each other (later on you also talk about the definedness
for bit-fields only with int, which also hints that you think one better
should always use int containers). Dang, still thinking this is easy?

> And yes, look at how you can see how sparse mixes different access sizes
> for different fields. But that is damn well what the programmer asked
> for.
>
> If programmers write stupid code, the compiler might as well generate
> odd code.

But what exactly constitutes "stupid" code? Are you really unable to see
that we're exactly struggling to find rules to differ between "stupid" and
supposedly "non-stupid" code? What if I'm saying that anyone using
bitfields writes "stupid" code, and hence the compiler can as well
generate odd code?

> Actually, "int:96" isn't ok last I saw. Not in original C, at least.

GCC supports multiple languages. For C++ it's valid (well, you get a
warning, and you can't do much with that member, but there we are).

> Just out of morbid curiosity, what happens if you have totally
> *separate* variables that just happen to link together? IOW, something
> like
>
> static struct { unsigned bit:1; } onebit;
> static volatile int var;
>
> and they just *happen* to link next to each other (because they were
> declared next to each other) in the same 8-byte aligned block?

non-struct decls always work. It's only the bit-field expander that
willy-nilly invents access modes (for some targets), and sometimes block
moves have problems. Until about 2008 we could do out-of-range writes
with struct copies (which usually works fine when they are naturally
aligned), fixed with PR31309. A related problem which I think is still
there is PR36043 (also out-of-range struct read, when passing stuff in
registers).

And as said, PR48124 is an out-of-range write, but again involving
bit-fields.

You see, we actually have much more serious problems than just clobbering
memory in under-specified situations ;-)


Ciao,
Michael.

Jeff Garzik

unread,
Feb 2, 2012, 11:00:01 AM2/2/12
to
On 02/01/2012 02:09 PM, Linus Torvalds wrote:
> We have a lot of code, there's still a lot of situations left where
> bitfields are just really convenient.

Or even just s/convenient/ingrained habit/ As much as I try to avoid
bitfields, engineers writing vendor drivers love to lay out their
hardware structures using bitfields, leading to such crapola as

#ifdef little endian
a bunch of bitfields, LE arrangement
#else
bitfields, now in BE arrangement
#endif

This crops up again and again in drivers :/

Jeff

Aldy Hernandez

unread,
Feb 2, 2012, 11:10:02 AM2/2/12
to
Linus Torvalds <torv...@linux-foundation.org> writes:

> Seriously - is there any real argument *against* just using the base
> type as a hint for access size?

If I'm on the hook for attempting to fix this again, I'd also like to
know if there are any arguments against using the base type.

Michael Matz

unread,
Feb 2, 2012, 11:30:02 AM2/2/12
to
Hi,

On Thu, 2 Feb 2012, Aldy Hernandez wrote:

> > Seriously - is there any real argument *against* just using the base
> > type as a hint for access size?
>
> If I'm on the hook for attempting to fix this again, I'd also like to
> know if there are any arguments against using the base type.

Sure. Simplest example: struct s {int i:24;} __attribute__((packed)).

You must access only three bytes, no matter what. The basetype (int) is
four bytes.


Ciao,
Michael.

Linus Torvalds

unread,
Feb 2, 2012, 1:00:03 PM2/2/12
to
On Thu, Feb 2, 2012 at 8:28 AM, Michael Matz <ma...@suse.de> wrote:
>
> Sure.  Simplest example: struct s {int i:24;} __attribute__((packed)).
>
> You must access only three bytes, no matter what.  The basetype (int) is
> four bytes.

Ok, so here's a really *stupid* (but also really really simple) patch attached.

What it does is to simply change the meaning of the SLOW_BYTE_ACCESS thing.

What SLOW_BYTE_ACCESS means is currently is not just that byte
accesses are slow: it means that *everything* but full-word accesses
are slow!

Which is (a) not what the name really implies, (b) not even the
historical meaning of that #define (why? the meaning of "full word"
has changed since: it used to be 32 bits, now it is 64 bits) and (c)
stupid, because that's not how hardware with slow byte accesses really
work anyway.

Finally, it's what causes gcc to do 64-bit accesses to bitfields that
aren't 64 bits in size.

So because of this, I propose gcc just change the rules for what
SLOW_BYTE_ACCESS means.

I propose to just change it to mean "accesses smaller than 32 bits are
slow". That's actually much closer to what the hardware definition
tends to be.

It doesn't fix the generic issue, it doesn't fix any C++11/C11 issues,
it doesn't really change anything, but what it *does* do is make for a
hell of a lot saner model. And it avoids bugs in practice.

NOTE! On at least some machines with SLOW_BYTE_ACCESS, accesses
smaller than a word cannot possibly be atomic anyway (well, not
without jumping through hoops), so the fact that we still extend to 32
bits and the problem of touching too much still exists with 'char' and
'short' variables that are in the same 32-bit word as a bitfield is
kind of unavoidable.

So this suggested patch doesn't really "fix" anything fundamental, but
it is (a) simple, (b) totally untested and (c) at least fixes *some*
cases.

For example, it might fix the 'sig_atomic_t' shadowning, since that is
usually 'int'. It obviously can never fix the issue with volatile
char/short, but at least it works around it for 'long'.

In other words - I'm not trying to say that this patch really "fixes"
anything (except sig_atomic_t interaction, which really does get fixed
for the normal 'int' case). But what it does do is to limit the damage
of a bad situation.

And it is small, and should hopefully be easy to accept even for
stable versions of gcc.

So can we please do something like this for maintenance releases, and
consider the much more complex C++11/C11 issues to be a separate much
bigger issue for the future?

Because the current SLOW_BYTE_ACCESS meaning really is crap. The
*only* thing that architecture define seems to do is literally the
bitfield extract semantics, and it does that horribly horribly badly
as shown by the bug on both 64-bit POWER and Sparc. For no good reason
- both of those have perfectly fine word operations afaik.

I did not look at other architectures that define SLOW_BYTE_ACCESS,
but if they have a 32-bit integer type, I'm pretty sure they support
fast loads/stores of it.

Hacky? Yes. Pretty? No. But better than the disaster that is there
now? Hell yes.

Linus

PS. The only reason I hardcoded "32" was that I didn't know how to ask
the quesion "is this mode at least as wide as 'int'" in the proper gcc
way. So I'm really not suggesting you apply this patch as-is, I'm just
sending it out as a "hey, this is a pretty obvious way to work around
part of the problem, somebody who really knows what they are doing
should probably improve on it".
patch.diff

Paul E. McKenney

unread,
Feb 2, 2012, 1:50:01 PM2/2/12
to
On Wed, Feb 01, 2012 at 03:11:00PM -0800, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 2:45 PM, Paul E. McKenney
> <pau...@linux.vnet.ibm.com> wrote:
> >
> > My (perhaps forlorn and naive) hope is that C++11 memory_order_relaxed
> > will eventually allow ACCESS_ONCE() to be upgraded so that (for example)
> > access-once increments can generate a single increment-memory instruction
> > on x86.
>
> I don't think that is a semantic issue.
>
> gcc could do it *today* with volatile accesses. It doesn't, because
> volatiles are scary and basically disables a lot of optimizations. Why
> would memory ordering be substantially different just because it has a
> different name?

I too would much prefer that gcc volatile worked more sanely.

But several people, including me, pushed on that and consistently got back
"the standard doesn't say we have to do that".

So I got together with the standards people and now there is something
(memory_order_relaxed atomics) that is specified to work the way we want
it to. Of course, it will likely be quite some time before it appears
in usable form in gcc, but probably quite a bit less time than we have
been pushing on the gcc folks about volatile.

> > New architectures might eventually might define things like atomic_inc()
> > in terms of C++11 atomics, but let's start with the straightforward stuff
> > as and if it makes sense.
>
> SMP-atomic or percpu atomic? Or both?

Only SMP-atomic.

> We need both variants in the kernel. If the compiler generates one of
> them for us, that doesn't really much help.

I must admit that the non-x86 per-CPU atomics are, ummm, "interesting".

Thanx, Paul

Linus Torvalds

unread,
Feb 2, 2012, 2:10:01 PM2/2/12
to
On Thu, Feb 2, 2012 at 10:42 AM, Paul E. McKenney
<pau...@linux.vnet.ibm.com> wrote:
>>
>> SMP-atomic or percpu atomic? Or both?
>
> Only SMP-atomic.

And I assume that since the compiler does them, that would now make it
impossible for us to gather a list of all the 'lock' prefixes so that
we can undo them if it turns out that we are running on a UP machine.

When we do SMP operations, we don't just add a "lock" prefix to it. We do this:

#define LOCK_PREFIX_HERE \
".section .smp_locks,\"a\"\n" \
".balign 4\n" \
".long 671f - .\n" /* offset */ \
".previous\n" \
"671:"

#define LOCK_PREFIX LOCK_PREFIX_HERE "\n\tlock; "

and I'm sure you know that, but I'm not sure the gcc people realize
the kinds of games we play to make things work better.

Sure, "everything will be SMP" some day, but running UP kernels is
likely still going to remain a good idea in virtualized environments,
for example. And having to make it a compile-time option is *not* a
good idea.

So compiler intrisics are often *worse* than doing it by hand for us
for all these kinds of reasons. They aren't generally geared towards
the very specialized needs that a kernel has.

Of course, maybe even user space would want some kind of way to
automatically strip 'lock' prefixes from a binary, so maybe the
compiler would have some kind of support like this too.

(And no, disassembling the binary in order to find lock prefixes is
*not* the answer, at least not for the kernel)

>> We need both variants in the kernel. If the compiler generates one of
>> them for us, that doesn't really much help.
>
> I must admit that the non-x86 per-CPU atomics are, ummm, "interesting".

Most non-x86 cpu's would probably be better off treating them the same
as smp-atomics (load-locked + store-conditional), but right now we
have this insane generic infrastructure for having versions that are
irq-safe by disabling interrupts etc. Ugh. Mainly because nobody
really is willing to work on and fix up the 25 architectures that
really don't matter.

Linus

Paul E. McKenney

unread,
Feb 2, 2012, 2:40:02 PM2/2/12
to
So if the gcc guys want the Linux kernel to use their atomics, they
must give it some useful way of finding all the lock prefixes on x86.
Should be a fun conversation. ;-)

> >> We need both variants in the kernel. If the compiler generates one of
> >> them for us, that doesn't really much help.
> >
> > I must admit that the non-x86 per-CPU atomics are, ummm, "interesting".
>
> Most non-x86 cpu's would probably be better off treating them the same
> as smp-atomics (load-locked + store-conditional), but right now we
> have this insane generic infrastructure for having versions that are
> irq-safe by disabling interrupts etc. Ugh. Mainly because nobody
> really is willing to work on and fix up the 25 architectures that
> really don't matter.

Again, fair point!

Thanx, Paul

DJ Delorie

unread,
Feb 3, 2012, 1:50:02 AM2/3/12
to

Jan Kara <ja...@suse.cz> writes:
> we've spotted the following mismatch between what kernel folks expect
> from a compiler and what GCC really does, resulting in memory corruption on
> some architectures. Consider the following structure:
> struct x {
> long a;
> unsigned int b1;
> unsigned int b2:1;
> };

If this structure were volatile, you could try
-fstrict-volatile-bitfields, which forces GCC to use the C type to
define the access width, instead of doing whatever it thinks is optimal.

Note: that flag is enabled by default for some targets already, most
notably ARM.

Richard Guenther

unread,
Feb 3, 2012, 4:40:03 AM2/3/12
to
On Fri, 3 Feb 2012, DJ Delorie wrote:

>
> Jan Kara <ja...@suse.cz> writes:
> > we've spotted the following mismatch between what kernel folks expect
> > from a compiler and what GCC really does, resulting in memory corruption on
> > some architectures. Consider the following structure:
> > struct x {
> > long a;
> > unsigned int b1;
> > unsigned int b2:1;
> > };
>
> If this structure were volatile, you could try
> -fstrict-volatile-bitfields, which forces GCC to use the C type to
> define the access width, instead of doing whatever it thinks is optimal.
>
> Note: that flag is enabled by default for some targets already, most
> notably ARM.

Note that -fstrict-volatile-bitfields does not work for

volatile struct S {
int i : 1;
char c;
} s;
int main()
{
s.i = 1;
s.c = 2;
}

where it accesses s.i using SImode. -fstrict-volatile-bitfields
falls foul of all the games bitfield layout plays and the
irrelevantness of the declared bitfield type (but maybe the
ARM ABI exactly specifies it that way).

So no, I would not recommend -fstrict-volatile-bitfields.

Richard.

Matthew Gretton-Dann

unread,
Feb 3, 2012, 5:10:02 AM2/3/12
to
Indeed the ARM ABI does - see Section 7.1.7.5 of the PCS available at:

http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042-/

In fact the example above is pretty much the same as that given in the ABI
docs, and it says that accessing s.i will also cause an access
to s.c, but not vice-versa.

Thanks,

Matt

--
Matthew Gretton-Dann
Principal Engineer, PD Software, ARM Ltd.

Andrew MacLeod

unread,
Feb 3, 2012, 11:40:02 AM2/3/12
to

>> And I assume that since the compiler does them, that would now make it
>> impossible for us to gather a list of all the 'lock' prefixes so that
>> we can undo them if it turns out that we are running on a UP machine.
>>
>> When we do SMP operations, we don't just add a "lock" prefix to it. We do this:
>>
>> #define LOCK_PREFIX_HERE \
>> ".section .smp_locks,\"a\"\n" \
>> ".balign 4\n" \
>> ".long 671f - .\n" /* offset */ \
>> ".previous\n" \
>> "671:"
>>
>> #define LOCK_PREFIX LOCK_PREFIX_HERE "\n\tlock; "
>>

I don't see why we cant do something similar when the compiler issues
a lock on an atomic operation. I would guess we'd want to put it under
some sort of flag control (something like -fatomic-lock-list ) since
most applications aren't going to want that section. It certainly seems
plausible to me anyway.
>> and I'm sure you know that, but I'm not sure the gcc people realize
>> the kinds of games we play to make things work better.
>>
No, but someone just needs to tell us -)

>>>> We need both variants in the kernel. If the compiler generates one of
>>>> them for us, that doesn't really much help.
>>> I must admit that the non-x86 per-CPU atomics are, ummm, "interesting".
>> Most non-x86 cpu's would probably be better off treating them the same
>> as smp-atomics (load-locked + store-conditional), but right now we
>> have this insane generic infrastructure for having versions that are
>> irq-safe by disabling interrupts etc. Ugh. Mainly because nobody
>> really is willing to work on and fix up the 25 architectures that
>> really don't matter.
>
The atomic intrinsics were created for c++11 memory model compliance,
but I am certainly open to enhancements that would make them more
useful. I am planning some enhancements for 4.8 now, and it sounds
like you may have some suggestions...

Andrew

Linus Torvalds

unread,
Feb 3, 2012, 12:20:02 PM2/3/12
to
On Fri, Feb 3, 2012 at 8:38 AM, Andrew MacLeod <amac...@redhat.com> wrote:
>
> The atomic intrinsics were created for c++11  memory model compliance, but I
> am certainly open to enhancements that would make them more useful.   I am
> planning some enhancements for 4.8 now, and it sounds like you may have some
> suggestions...

So we have several atomics we use in the kernel, with the more common being

- add (and subtract) and cmpchg of both 'int' and 'long'

- add_return (add and return new value)

- special cases of the above:
dec_and_test (decrement and test result for zero)
inc_and_test (decrement and test result for zero)
add_negative (add and check if result is negative)

The special cases are because older x86 cannot do the generic
"add_return" efficiently - it needs xadd - but can do atomic versions
that test the end result and give zero or sign information.

- atomic_add_unless() - basically an optimized cmpxchg.

- atomic bit array operations (bit set, clear, set-and-test,
clear-and-test). We do them on "unsigned long" exclusively, and in
fact we do them on arrays of unsigned long, ie we have the whole "bts
reg,mem" semantics. I'm not sure we really care about the atomic
versions for the arrays, so it's possible we only really care about a
single long.

The only complication with the bit setting is that we have a
concept of "set/clear bit with memory barrier before or after the bit"
(for locking). We don't do the whole release/acquire thing, though.

- compare_xchg_double

We also do byte/word atomic increments and decrements, but that' sin
the x86 spinlock implementation, so it's not a generic need.

We also do the add version in particular as CPU-local optimizations
that do not need to be SMP-safe, but do need to be interrupt-safe. On
x86, this is just an r-m-w op, on most other architectures it ends up
being the usual load-locked/store-conditional.

I think that's pretty much it, but maybe I'm missing something.

Of course, locking itself tends to be special cases of the above with
extra memory barriers, but it's usually hidden in asm for other
reasons (the bit-op + barrier being a special case).

Linus

Andrew MacLeod

unread,
Feb 3, 2012, 2:20:02 PM2/3/12
to
On 02/03/2012 12:16 PM, Linus Torvalds wrote:
>
> So we have several atomics we use in the kernel, with the more common being
>
> - add (and subtract) and cmpchg of both 'int' and 'long'
This would be __atomic_fetch_add, __atomic_fetch_sub, and
__atomic_compare_exchange.

For 4.8 __atomic_compare_exchange is planned to be better optimized then
it is now... ie, it currently uses the same form as c++ requires:

atomic_compare_exchange (&var, &expected, value, weak/strong,
memorymodel).

'expected' is updated in place with the current value if it doesn't
match. With the address of expected taken, we dont always do a good job
generating code for it... I plan to remedy that in 4.8 so that it is
efficient and doesn't impact optimization of 'expected' elsewhere.
> - add_return (add and return new value)
__atomic_add_fetch returns the new value. (__atomic_fetch_add returns
the old value). If it isn't as efficient as it needs to be, the RTL
pattern can be fixed. what sequence do you currently use for this?
The compiler currently generates the equivilent of

lock; xadd
add

ie, it performs the atomic add then re-adds the same value to the
previous value to get the atomic post-add value. If there is something
more efficient, we ought to be able to do the same.
> - special cases of the above:
> dec_and_test (decrement and test result for zero)
> inc_and_test (decrement and test result for zero)
> add_negative (add and check if result is negative)
>
> The special cases are because older x86 cannot do the generic
> "add_return" efficiently - it needs xadd - but can do atomic versions
> that test the end result and give zero or sign information.
Since these are older x86 only, could you use add_return() always and
then have the compiler use new peephole optimizations to detect those
usage patterns and change the instruction sequence for x86 when
required? would that be acceptable? Or maybe you don't trust the
compiler :-) Or maybe I can innocently ask if the performance impact
on older x86 matters enough any more? :-)
> - atomic_add_unless() - basically an optimized cmpxchg.

is this the reverse of a compare_exchange and add? Ie, add if the value
ISN'T expected? or some form of compare_exchange_and_add? This
might require a new atomic builltin.

What exactly does it do?

> - atomic bit array operations (bit set, clear, set-and-test,
> clear-and-test). We do them on "unsigned long" exclusively, and in
> fact we do them on arrays of unsigned long, ie we have the whole "bts
> reg,mem" semantics. I'm not sure we really care about the atomic
> versions for the arrays, so it's possible we only really care about a
> single long.
>
> The only complication with the bit setting is that we have a
> concept of "set/clear bit with memory barrier before or after the bit"
> (for locking). We don't do the whole release/acquire thing, though.
Are these functions wrappers to a tight load, mask, cmpxchg loop? or
something else? These could also require new built-ins if they can't be
constructed from the existing operations...
>
> - compare_xchg_double
>
> We also do byte/word atomic increments and decrements, but that' sin
> the x86 spinlock implementation, so it's not a generic need.
The existing __atomic builtins will work on 1,2,4,8 or 16 byte values
regardless of type, as long as the hardware supports those sizes. so
x86-64 can do a 16 byte cmpxchg.

In theory, the add_fetch and sub_fetch are suppose to use INC/DEC if the
operand is 1/-1 and the result isn't used. If it isnt doing this right
now, I will fix it.
> We also do the add version in particular as CPU-local optimizations
> that do not need to be SMP-safe, but do need to be interrupt-safe. On
> x86, this is just an r-m-w op, on most other architectures it ends up
> being the usual load-locked/store-conditional.
>
It may be possible to add modifier extensions to the memory model
component for such a thing. ie

v = __atomic_add_fetch (&v, __ATOMIC_RELAXED | __ATOMIC_CPU_LOCAL)

which will allow fine tuning for something more specific like this.
Targets which dont care can ignore it, but x86 could have atomic_add
avoid the lock when the CPU_LOCAL modifier flag is present.

> I think that's pretty much it, but maybe I'm missing something.
>
> Of course, locking itself tends to be special cases of the above with
> extra memory barriers, but it's usually hidden in asm for other
> reasons (the bit-op + barrier being a special case).

All of the __atomic operations are currently optimization barriers in
both directions, the optimizers tend to treat them like function calls.
I hope to enable some sorts of optimizations eventually, especially
based on memory model... but for now we play it safe.

Synchronization barriers are inserted based on the memory model used.
If it can be determined that something additional is required that the
existing memory model doesn't cover, it could be possible to add
extensions beyond the c++11 memory model.. ( ie add new
__ATOMIC_OTHER_BARRIER_KIND models)

Andrew

Linus Torvalds

unread,
Feb 3, 2012, 3:10:01 PM2/3/12
to
On Fri, Feb 3, 2012 at 11:16 AM, Andrew MacLeod <amac...@redhat.com> wrote:
>>    The special cases are because older x86 cannot do the generic
>> "add_return" efficiently - it needs xadd - but can do atomic versions
>> that test the end result and give zero or sign information.
>
> Since these are older x86 only, could you use add_return() always and then
> have the compiler use new peephole optimizations to detect those usage
> patterns and change the instruction sequence for x86 when required?  would
> that be acceptable?    Or maybe you don't trust the compiler :-)   Or maybe
> I can innocently ask if the performance impact on older x86 matters enough
> any more? :-)

Since xadd is often slow even when it does exist, the peepholes would
definitely be required. A "dec_and_test" does need to be

decl %m (or "subl $1" for x86-64)
je

rather than

xadd %r,%m
cmp $1,%r
je

simply because the latter is not just larger, but often quite a bit
slower (newer CPU's seem to be better at xadd).

But with the peepholes, by the time we could depend on gcc doing the
intrinsics, we'd almost certainly be ready to let the old pre-xadd
machines just go.

We're already almost there, and by the time we can trust the compiler
to do it for us, we're almost certainly going to be at a point where
we really don't care any more about pre-xadd (but we will still care
about older machines where xadd is slow).

>>  - atomic_add_unless() - basically an optimized cmpxchg.
>
> is this the reverse of a compare_exchange and add?  Ie, add if the value
> ISN'T expected?   or some form of compare_exchange_and_add?    This might
> require a new atomic builltin.
>
> What exactly does it do?

It's basically

do {
load-link %r,%m
if (r == value)
return 0;
add
} while (store-conditional %r,%m)
return 1;

and it is used to implement two *very* common (and critical)
reference-counting use cases:

- decrement ref-count locklessly, and if it hits free, take a lock
atomically (ie "it would be a bug for anybody to ever see it decrement
to zero while holding the lock")

- increment ref-count locklessly, but if we race with the final free,
don't touch it, because it's gone (ie "if somebody decremented it to
zero while holding the lock, we absolutely cannot increment it again")

They may sound odd, but those two operations are absolutely critical
for most lock-less refcount management. And taking locks for reference
counting is absolutely death to performance, and is often impossible
anyway (the lock may be a local lock that is *inside* the structure
that is being reference-counted, so if the refcount is zero, then you
cannot even look at the lock!)

In the above example, the load-locked -> store-conditional would
obviously be a cmpxchg loop instead on architectures that do cmpxchg
instead of ll/sc.

Of course, it you expose some intrinsic for the whole "ll/sc" model
(and you then turn it into cmpxchg on demand), we could literally
open-code it.

That is likely the most *flexible* approach for a compiler. I think
pretty much everything the kernel needs (except for cmpxchg_double)
can be very naturally written as a "ll/sc" sequence, and if the
compiler then just does the right thing with peephole optimizations,
that's fine.

IOW, we don't really need "atomic_add()" or anything like that. If we can do

do {
val = __load_linked(mem);
val++;
} while (__store_conditional(val, mem));

and the compiler just automagically turns that into "lock add" on x86,
that's perfectly fine.

It might not be too hard, because you really don't need to recognize
very many patterns, and any pattern you don't recognize could be
turned into a cmpxchg loop.

NOTE NOTE NOTE! The "turned into a cmpxchg loop" is not the true
correct translation of load-linked/store-conditional, since it allows
the memory to be modified as long as it's modified *back* before the
store-conditional, and that actually matters for a few algorithms. But
if you document the fact that it's not a "true" ll/sc (and maybe have
some compile-time way to detect when it isn't), it would be very
flexible.

Of course, the syntax could be something completely different. Maybe
you'd want to do it as

__builtin_ll_sc(&var, update-expression, return-expression,
failure-expression)

rather than an explicit loop.

But it doesn't sound like the internal gcc model is based on some
generic ll/sc model.

>>  - atomic bit array operations (bit set, clear, set-and-test,
>> clear-and-test). We do them on "unsigned long" exclusively, and in
>> fact we do them on arrays of unsigned long, ie we have the whole "bts
>> reg,mem" semantics. I'm not sure we really care about the atomic
>> versions for the arrays, so it's possible we only really care about a
>> single long.
>>
>>    The only complication with the bit setting is that we have a
>> concept of "set/clear bit with memory barrier before or after the bit"
>> (for locking). We don't do the whole release/acquire thing, though.
>
> Are these functions wrappers to a  tight load, mask, cmpxchg loop? or
> something else?  These could also require new built-ins if they can't be
> constructed from the existing operations...

Not usually, no. Those things tend to be used for locking or setting
state, so we'd commonly want them to result in

lock orl $128,(%mem)

or for the testing case they would become

lock btsl $0,(%mem)
jne .. failed ..

and a cmpxchg loop would generally be much too expensive.

I realize that people have bad memories of the x86 bit instructions,
but especially in their locked form, the fact that they take a few
extra cycles or decode in only one pipeline etc is *not* relevant.
They are small and "fast", because the true cost tends to be not the
instruction cost, but the locking overhead and the cache effects.

And a "cmpxchg" loop tends to be *very* expensive for two reasons:

- mispredicted backwards jump (most cpu's think it's a loop, so
without prediction information they do the wrong thing)

- load+cmpxchg does the wrong thing for a cache miss (the load will
result in the cacheline being fetched for read)

both of which tend to be totally invisible if you do the obvious
microbenchmark, making people sometimes mistakenly believe that
"cmpxchg" is a good way to implement things.

>>  - compare_xchg_double
>>
>> We also do byte/word atomic increments and decrements, but that' sin
>> the x86 spinlock implementation, so it's not a generic need.
>
> The existing __atomic builtins will work on 1,2,4,8 or 16 byte values
> regardless of type, as long as the hardware supports those sizes.  so x86-64
> can do a 16 byte cmpxchg.

Ok.

> In theory, the add_fetch and sub_fetch are suppose to use INC/DEC if the
> operand is 1/-1 and the result isn't used.  If it isnt doing this right now,
> I will fix it.

Yeah, we definitely want that fixed to inc/dec or regular add/sub.
xadd really isn't good, even outside the "some CPU's don't support
it".

> It may be possible to add modifier extensions to the memory model component
> for such a thing. ie
>
> v = __atomic_add_fetch (&v, __ATOMIC_RELAXED | __ATOMIC_CPU_LOCAL)

Ok, that sounds clean.

> All of the __atomic operations are currently optimization barriers in both
> directions, the optimizers tend to treat them like function calls.  I hope
> to enable some sorts of optimizations eventually, especially based on memory
> model... but for now we play it safe.

That's fine. I really don't worry about compiler barriers, and the use
of __ATOMIC_ACQUIRE and __ATOMIC_RELEASE will actually allow us to do
better than we do now, if the compiler just implements them correctly
on all architectures.

Right now we have special (and very ugly) extra conditional
architecture-specific full memory barriers ("smp_mb__after_bitop()"
etc) that are complete hacks, but generate "good enough" code (which
on x86 means "generate nothing", since the atomic bitops are already
barriers on their own).

Having access to __ATOMIC_ACQUIRE would actually be an improvement -
it's just that the architectures that really care about things like
that simply don't matter enough for us to really worry about them
(ia64 being the prime example), and everybody else either doesn't need
extra barriers at all (x86), or might as well use a full barrier (most
everything else).

Linus

Paul E. McKenney

unread,
Feb 3, 2012, 3:20:01 PM2/3/12
to
On Fri, Feb 03, 2012 at 12:00:03PM -0800, Linus Torvalds wrote:
> On Fri, Feb 3, 2012 at 11:16 AM, Andrew MacLeod <amac...@redhat.com> wrote:

[ . . . ]

> Having access to __ATOMIC_ACQUIRE would actually be an improvement -
> it's just that the architectures that really care about things like
> that simply don't matter enough for us to really worry about them
> (ia64 being the prime example), and everybody else either doesn't need
> extra barriers at all (x86), or might as well use a full barrier (most
> everything else).

I believe that version 8 of the ARM architecture will provide load-acquire
and store-release instructions.

Thanx, Paul

Torvald Riegel

unread,
Feb 6, 2012, 10:40:02 AM2/6/12
to
No, and I don't think it's beneficial overall to do this. Sure, an
LL/SC or CAS loop is universal, but in turn programmers would have to
make sure that they hit the patterns that the compiler can actually
recognize and turn into the more efficient forms.

The custom atomic operations also provide different progress guarantees.
While a single CAS/cmpxchg is wait-free, the full loop isn't
necessarily. Same for the bit operations. So, I think it makes sense
to offer them separately. The split between weak- and strong-progress
compare-and-exchange in C++11 is related.

> I realize that people have bad memories of the x86 bit instructions,
> but especially in their locked form, the fact that they take a few
> extra cycles or decode in only one pipeline etc is *not* relevant.
> They are small and "fast", because the true cost tends to be not the
> instruction cost, but the locking overhead and the cache effects.

And the semantics of the operation is known immediately (without trying
to recover the actual atomic op from some surrounding cmpxchg loop).
That allows potential optimizations like combining (but I'm not a HW
expert, so I don't know whether HW actually does this internally).


Torvald

Richard Henderson

unread,
Feb 10, 2012, 2:40:02 PM2/10/12
to
We can't expose the ll/sc model directly, due to various target-specific
hardware constraints (e.g. no other memory references, i.e. no spills).
But the "weak" compare-exchange is sorta-kinda close.

A "weak" compare-exchange is a ll/sc pair, without the loop for restart
on sc failure. Thus a weak compare-exchange can fail for arbitrary reasons.
You do, however, get back the current value in the memory, so you can loop
back yourself and try again.

So that loop above becomes the familiar expression as for CAS:

r = *m;
do {
if (r == value)
return;
n = r + inc;
} while (!atomic_compare_exchange(m, &r, n, /*weak=*/true,
__ATOMIC_RELAXED, __ATOMIC_RELAXED));

which, unlike with the current __sync_val_compare_and_swap, does not
result in a nested loop for the ll/sc architectures.

Yes, the ll/sc architectures can technically do slightly better by writing
this as inline asm, but the gain is significantly less than before. Given
that the line must be in L1 cache already, the redundant load from M in
the ll insn should be nearly negligible.

Of course for architectures that implement cas, there is no difference
between "weak" and "strong".


r~
0 new messages