On Wednesday, 2 January 2019 16:42:30 -02 Florian Weimer wrote:
> * Thiago Macieira:
> > On Wednesday, 2 January 2019 14:10:00 -02 Florian Weimer wrote:
> >> We should really avoid any solution which makes __atomic built-ins or
> >> <stdatomic.h> unusable for performance reasons. It would force
> >> programmers to go back to inline assembly, and we absolutely and
> >> categorically do not want that.
> >
> > Agreed, but why would there be a performance reason for correct code?
> > The only case I can think of is when the compiler doesn't know that a
> > type will be properly aligned by external means. We ought to teach
> > people to add the necessary alignas() or __builtin_assume_aligned() so
> > the compiler will know the detail and thus generate the optimal code.
>
> On i386, it's unclear what kind of optimizations a cmpiler can perform
> for non-naturally-aligned types, particurlarly -fno-strict-aliasing.
I don't see how that affects anything. If the library functions being called
can accept unaligned pointers for a given size, then the functions must have a
codepath that uses mutexes. Whether they implement an additional check for
lock-free atomics if the pointers align is a QoI question. Since it's just
three instructions, two of them macro-fusing (CMP+JBE), I expect it to be a
clear choice for implementations.
Anyway, let's take this struct for the remainder of the discussion:
struct S { short s1, s2; };
// sizeof(S) == 4
// alignof(S) == 2
> >> Would the above rule require to call into the library for an atomic load
> >> of an int variable which is potentially not 4-byte-aligned?
> >
> > For a 4-byte sized variable that is *under-aligned*, yes. For it to be
> > properly atomic, even if straddling two cachelines, it requires an
> > external mutex. The compiler could inline the boundary detection if it
> > wanted too (it's three instructions), or not. It would be up to the
> > implementation.
This was an example of that struct S above: it's 4 bytes but under-aligned for
atomicity. The compiler can't guarantee that it is properly aligned to use MOV
loads and stores. That means it must generate a call to the unaligned, mutex-
locking library functions that load and store, with the optional alignment
detection code.
But if the code was:
alignas(int) S data;
Then the compiler does know it's sufficiently aligned and can use direct MOV.
*Provided* that the library implementation matches the behaviour if it got
called.
SImilarly for
S *ptr;
auto p = __builtin_assume_aligned(ptr, 4);
> > Ditto for __attribute__((packed)) int: if someone went through the trouble
> > of writing the attribute, they meant for the compiler to use a mutex.
> >
> > For an *int*, it's UB to be under-aligned.
>
> So if you a have struct with __attribute__ ((packed)) and take an
> address of an int member, you expect it to get a type of int
> __attribute__ ((aligned (1))) *? I don't think this is how GCC works
> today. Instead, there is -Waddress-of-packed-member. See this bug:
>
> <
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51628>
>
> There hasn't been much movement to change the GCC type system to track
> pointer alignment, if I read the bug correctly.
Yes, I'd expect that the compiler be able to or the developer be forced to
track the alignmend/packing of pointers it extracts from packed structures.
x86 may not need it, but other architectures do. I think it's a reasonable
thing to ask.
And if people don't, we have an UB that would only trigger on crossing a
cacheline. Maybe the code in question knows that won't happen -- it may be a
32-bit value 3 bytes into a file, which was mmap()ed and therefore known to
never cross a cacheline boundary.
> I'm worried the run-time check is the only realistic way to implement
> what you want, given this constraint. And I really don't think we
> should emit a run-time check.
What's your suggestion? That we always lock? Or that we never do?
> > One thing comes to mind: complex atomics. Should their alignment be
> > doubled
> > too? Like _Atomic _Complex short which is 4 bytes also getting aligned to
> > 4
> > bytes, thus avoiding the under-alignment I mentioned above.
> >
> > And if we go this way, should we also do the same for all custom
> > types, like std::pair<short,short>?
>
> We cannot do this for std::pair<short, short> because that would be an
> ABI event. I think it would be possible for
> std::atomic<std::pair<short, short>>, but at that point, we should start
> talking to C++ people. 8-)
I meant that. And for me, std::atomic<S> should behave the same as and have
the same size and alignment as _Atomic S. In fact, libc++ implements
std::atomic<S> by way of _Atomic S.
> > How about _Atomic _Complex double going to 16-byte alignment? That can be
> > atomically loaded even on 32-bit using MOVPD,
Note: I meant MOVAPD.
> I've been told that this is not true; there are SSE2 implementations
> which tear loads and stores. This is different from loading and storing
> 8-byte doubles with the FPU.
Well, it's probably not worth then. We could runtime-detect the CPU and match
-march= settings, but it's probably not worth the cost.
One more thing: what about interoperability with a 64-bit application running
in the same CPU? For what primitive types and for what user types do we
guarantee true atomic (lock-free) operations?