Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Performance of unaligned memory-accesses

523 views
Skip to first unread message

Bonita Montero

unread,
Aug 7, 2019, 1:13:33 AM8/7/19
to
I just wrote a litte test that checks the performance of unaligned
memory-acesses on x86 / Win32. I've run this code on my Ryten 1800X:

#pragma warning(disable: 6387)
#pragma warning(disable: 6001)
#pragma warning(disable: 4244)

#include <Windows.h>
#include <iostream>
#include <cstdint>
#include <cstring>
#include <algorithm>
#include <intrin.h>

using namespace std;

struct unaligned_uint32
{
uint8_t u[sizeof(uint32_t)];
operator uint32_t();
unaligned_uint32 &operator = ( uint32_t ui );
};

inline
unaligned_uint32::operator uint32_t()
{
return ((uint32_t)u[0] | (uint32_t)u[1] << 8) | ((uint32_t)u[2] <<
16 | (uint32_t)u[3] << 24);
}

inline
unaligned_uint32 &unaligned_uint32::operator = ( uint32_t ui )
{
u[0] = (uint8_t)ui;
u[1] = (uint8_t)(ui << 8);
u[2] = (uint8_t)(ui << 16);
u[3] = (uint8_t)(ui << 24);
return *this;
}


template<typename TUI32>
void memkill( TUI32 *m, size_t elems );

int main()
{
size_t const SIZE = (size_t)1024 * 1024;
size_t const ELEMS = SIZE / sizeof(uint32_t);
unsigned const ITERATIONS = 10'000;
char *m;
LONGLONG llFreq;
double freq;
LONGLONG start, end;
double seconds;

m = (char *)VirtualAlloc( nullptr, SIZE, MEM_RESERVE | MEM_COMMIT,
PAGE_READWRITE );
memset( m, 0, SIZE );

QueryPerformanceFrequency( &(LARGE_INTEGER &)llFreq );
freq = llFreq;

QueryPerformanceCounter( &(LARGE_INTEGER &)start );
for( unsigned i = 0; i != ITERATIONS; i++ )
memkill( (uint32_t *)m, ELEMS );
QueryPerformanceCounter( &(LARGE_INTEGER &)end );
seconds = (end - start) / freq;
cout << "aligned native: " << seconds << endl;

QueryPerformanceCounter( &(LARGE_INTEGER &)start );
for( unsigned i = 0; i != ITERATIONS; i++ )
memkill( (uint32_t *)(m + 1), ELEMS - 1 );
QueryPerformanceCounter( &(LARGE_INTEGER &)end );
seconds = (end - start) / freq;
cout << "unaligned native: " << seconds << endl;

QueryPerformanceCounter( &(LARGE_INTEGER &)start );
for( unsigned i = 0; i != ITERATIONS; i++ )
memkill( (unaligned_uint32 *)(m + 1), ELEMS - 1 );
QueryPerformanceCounter( &(LARGE_INTEGER &)end );
seconds = (end - start) / freq;
cout << "unaligned wrapped: " << seconds << endl;

}

template<typename TUI32>
void memkill( TUI32 *m, size_t elems )
{
for_each( m, m + elems, []( TUI32 &e ) { e = (uint32_t)e + 1; } );
}

As you can see the code also tests accessing unaligned memory unaligned
with manual shifting of the bytes.
Here are the results of my 1800X:

aligned native: 0.244328
unaligned native: 0.437457
unaligned wrapped: 2.12482

I was very surprised that unaligned memory-access is less than twice as
slow on my PC.
It would be nice to see results from Intel-CPUs here. Thanks in advance.

Chris M. Thomasson

unread,
Aug 7, 2019, 2:20:12 AM8/7/19
to
Try using unaligned addresses with several threads. Try doing a LOCK
XADD on a location that straddles two cache lines, and is not aligned on
a line, vs one that is aligned on a cache line, and properly padded.

Bonita Montero

unread,
Aug 7, 2019, 2:57:14 AM8/7/19
to
> Try using unaligned addresses with several threads.

That's not relevant to me because I only wanted to measure the
cost of an unaligned access.

> Try doing a LOCK XADD on a location that straddles two cache
> lines, ..

That's also not relevant to me; and not to me because no one
would do that in reality because this isn't of any use.

Bonita Montero

unread,
Aug 7, 2019, 2:57:54 AM8/7/19
to
> Try using unaligned addresses with several threads.

That's not relevant to me because I only wanted to measure the
cost of an unaligned access.

> Try doing a LOCK XADD on a location that straddles two cache
> lines, ..

That's also not relevant to me; and not only to me because no

Chris M. Thomasson

unread,
Aug 7, 2019, 3:22:24 AM8/7/19
to
Okay. I was just sort of, thinking out loud. Sorry. Fwiw, using LOCK
XADD on unaligned address that straddles a cache line can invoke a Bus lock!

Chris M. Thomasson

unread,
Aug 7, 2019, 3:24:37 AM8/7/19
to
On 8/6/2019 11:57 PM, Bonita Montero wrote:
Actually, it is of some use. It can trigger a BUS lock. So, one can use
it for a forced quiescence period in a RCU algorihtm. Actually, Windows
has a nice way to do this without totally abusing x86/64:

https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers

Bonita Montero

unread,
Aug 7, 2019, 3:26:08 AM8/7/19
to
> Fwiw, using LOCK XADD on unaligned address that
> straddles a cache line can invoke a Bus lock!

It's logical that Intel implemented unaligned loads and stores
for backward-compatibility. And these loads / stores are useful
for performantly mofifying datastructure for persistence or
transmission over the network. So this is clearly an unique
advantage of the Intel-Architecture.
But for what should an unaligned LOCK * be useful? I can see
no sense in that.

Chris M. Thomasson

unread,
Aug 7, 2019, 3:33:34 AM8/7/19
to
To trigger a full blown bus lock, to be used as a quiescence point in
user space RCU. There is an older write up on this, back in 2010:

https://blogs.oracle.com/dave/qpi-quiescence

I have implemented RCU in userspace before.

Chris M. Thomasson

unread,
Aug 7, 2019, 3:35:45 AM8/7/19
to
On 8/7/2019 12:26 AM, Bonita Montero wrote:
There are a whole class of exotic asymmetric synchronization algorithms
that can use this. Although, FlushProcessWriteBuffers can work okay.

Bonita Montero

unread,
Aug 7, 2019, 3:52:41 AM8/7/19
to
>> It's logical that Intel implemented unaligned loads and stores
>> for backward-compatibility. And these loads / stores are useful
>> for performantly mofifying datastructure for persistence or
>> transmission over the network. So this is clearly an unique
>> advantage of the Intel-Architecture.
>> But for what should an unaligned LOCK * be useful? I can see
>> no sense in that.

> To trigger a full blown bus lock, to be used as a quiescence point
> in user space RCU. There is an older write up on this, back in 2010:

Ok, but's that's extremely exotic.

> There are a whole class of exotic asymmetric synchronization
> algorithms that can use this. Although, FlushProcessWriteBuffers
> can work okay.

Do you have a link?

Paavo Helde

unread,
Aug 7, 2019, 4:00:53 AM8/7/19
to
Results from x64 build on Intel Core i7-6600U:

aligned native: 0.30228
unaligned native: 0.31015
unaligned wrapped: 2.64527

Hmm, seems not so unaligned at all... Trying 32-bit build:

aligned native: 0.362762
unaligned native: 0.42812
unaligned wrapped: 2.63736


BTW, it looks to me your code has a buffer overrun bug. Probably won't
affect benchmarks, but still.


Bonita Montero

unread,
Aug 7, 2019, 4:03:33 AM8/7/19
to
Am 07.08.2019 um 10:00 schrieb Paavo Helde:

> BTW, it looks to me your code has a buffer overrun bug.
> Probably won't affect benchmarks, but still.

No, look at the "- 1"!

memkill( (unaligned_uint32 *)(m + 1), ELEMS - 1 );

Without this the code would touch the next page which is very likely
to be not allocated so that it would crash.

Bonita Montero

unread,
Aug 7, 2019, 4:08:01 AM8/7/19
to

Paavo Helde

unread,
Aug 7, 2019, 4:46:35 AM8/7/19
to
I see. It seems I looked only at the first (aligned) call and assumed
the test set size would be the same always.

David Brown

unread,
Aug 7, 2019, 5:20:03 AM8/7/19
to
Can MSVC not turn these into optimised unaligned accesses? gcc and
clang treat them exactly like accesses via a cast to a uint32_t pointer,
except that the behaviour is defined and portable. (On targets that
don't support unaligned access, gcc will access data by bytes.)


Bonita Montero

unread,
Aug 7, 2019, 7:13:42 AM8/7/19
to
> Can MSVC not turn these into optimised unaligned accesses?

Do you really think there will be ever a compiler that "optimizes"
away the unaligned loads? The compiler won't simply do what I told
him by doing that. I bet there won'T be any C/C++-compiler that does
this until the eath is burnt by the sun.

David Brown

unread,
Aug 7, 2019, 7:29:37 AM8/7/19
to
#include <stdint.h>

struct unaligned_uint32
{
uint8_t u[sizeof(uint32_t)];
operator uint32_t();
unaligned_uint32 &operator = ( uint32_t ui );
};

unaligned_uint32::operator uint32_t()
{
return ((uint32_t)u[0] | (uint32_t)u[1] << 8) | ((uint32_t)u[2] <<
16 | (uint32_t)u[3] << 24);
}


gcc -O2:

unaligned_uint32::operator unsigned int():
movl (%rdi), %eax
ret


Bonita Montero

unread,
Aug 7, 2019, 8:02:28 AM8/7/19
to
> gcc -O2:
> unaligned_uint32::operator unsigned int():
> movl (%rdi), %eax
> ret

The compiler isn't optimizing away an unaligned load here.
If rdi is unaligned the load is also.

Bonita Montero

unread,
Aug 7, 2019, 8:08:47 AM8/7/19
to
Sorry, you miss-worded what you'd like to tell.
> Can MSVC not turn these into optimised unaligned accesses?
... is not what you'd like to tell. You wanted to tell that the
operator is compiled in a way that the shifts and loads are bundled
in a single load. So I misunderstood you.

Jorgen Grahn

unread,
Aug 7, 2019, 8:37:53 AM8/7/19
to
On Wed, 2019-08-07, Bonita Montero wrote:
> I just wrote a litte test that checks the performance of unaligned
> memory-acesses on x86 / Win32. I've run this code on my Ryten 1800X:

What's the point of the exercise, in a C++ context? Unaligned access
in portable code is always the result of a programming error.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Bonita Montero

unread,
Aug 7, 2019, 8:47:52 AM8/7/19
to
> What's the point of the exercise, in a C++ context? Unaligned access
> in portable code is always the result of a programming error.

Pure theory. All platforms support unaligned access either directly
through the CPU or through trapping by the operating-system (very slow).

David Brown

unread,
Aug 7, 2019, 9:57:26 AM8/7/19
to
I wrote what I intended to write, but you misunderstood. (That happens,
sometimes, especially if when you have to work with a second language.
It's no problem.)

The compiler turns the shift-and-or code into optimised code using an
unaligned access. I was surprised that MSVC could not do this
optimisation - that compiler is often quite good at optimisations.


David Brown

unread,
Aug 7, 2019, 10:18:56 AM8/7/19
to
No, they don't. Some cpus support direct unaligned accesses. For
others, various things could happen. On big OS's, you are likely to get
a trap or exception causing the OS to kill your program with a fault - I
can't imagine why an OS would bother simulating the unaligned access.
On embedded systems, unaligned access may lead to a bus fault of some
sort, halting the system or causing a restart. And on some systems that
I have used, unaligned access will silently give you muddled reads and
corrupting writes.

Unaligned access is always an error. Use code with shifts and masks, if
you need it, or use memcpy. If your compiler isn't good enough to give
you efficient enough code for your needs, get a better compiler.

Scott Lurndal

unread,
Aug 7, 2019, 10:46:46 AM8/7/19
to
"Chris M. Thomasson" <invalid_chris_t...@invalid.com> writes:
>On 8/6/2019 10:13 PM, Bonita Montero wrote:
>> I just wrote a litte test that checks the performance of unaligned
>> memory-acesses on x86 / Win32. I've run this code on my Ryten 1800X:

>>
>> I was very surprised that unaligned memory-access is less than twice as
>> slow on my PC.
>> It would be nice to see results from Intel-CPUs here. Thanks in advance.
>
>Try using unaligned addresses with several threads. Try doing a LOCK
>XADD on a location that straddles two cache lines, and is not aligned on
>a line, vs one that is aligned on a cache line, and properly padded.

Processor vendors work hard so that most unaligned accesses don't add
significant additional latencies to the instructions. Our ARM64 processor
generally has no perf difference between aligned and unaligned to DRAM
(unaligned isn't supported to device memory).

Locked transactions on intel systems that straddle cache lines need
to assert a system bus lock, which causes extreme performance degradation,
particularly in NUMA systems. Don't do that.

Scott Lurndal

unread,
Aug 7, 2019, 10:47:33 AM8/7/19
to
Bonita Montero <Bonita....@gmail.com> writes:
>> Fwiw, using LOCK XADD on unaligned address that
>> straddles a cache line can invoke a Bus lock!
>
>It's logical that Intel implemented unaligned loads and stores
>for backward-compatibility. And these loads / stores are useful
>for performantly mofifying datastructure for persistence or
>transmission over the network. So this is clearly an unique
>advantage of the Intel-Architecture.

No, it's not unique. See AArch64.

Bonita Montero

unread,
Aug 7, 2019, 10:53:41 AM8/7/19
to
For portability-reasons almost any system tries to be comatible to
x86-systems for unaligned accesses. But on some exotic systems which
don't run a common operating system or none at all you might be right.
Unaligned accesses are simply useful for data-structures which are
sent over the network or persisted on disk to save the padding bytes.

Bonita Montero

unread,
Aug 7, 2019, 10:55:51 AM8/7/19
to
Ok, except for atomicity: i.e. loads / stors aren't atomic and
atomic RMW-instruction will alway fault.

Scott Lurndal

unread,
Aug 7, 2019, 11:11:50 AM8/7/19
to
B2.2:

Atomicity is a feature of memory accesses, described as atomic accesses. The Arm architecture description refers to
two types of atomicity, single-copy atomicity and multi-copy atomicity. In the Armv8 architecture, the atomicity
requirements for memory accesses depend on the memory type, and whether the access is explicit or implicit. For
more information, see:

B2.2.1 Requirements for single-copy atomicity


If ARMv8.4-LSE is implemented, all loads and stores are single-copy atomic when the following conditions are
true:
· Accesses are unaligned to their data size but are aligned within a 16-byte quantity that is aligned to 16 bytes.
· Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.

Otherwise it is IMPLEMENTATION DEFINED whether loads and stores are single-copy atomic.

David Brown

unread,
Aug 7, 2019, 11:30:29 AM8/7/19
to
/Please/ learn to use Usenet properly! Keep attributions, and quote an
appropriate amount of context!

On 07/08/2019 16:53, Bonita Montero wrote:
> For portability-reasons almost any system tries to be comatible to
> x86-systems for unaligned accesses.

Total nonsense.

There is rarely any reason for wanting unaligned access, and no
justification for using it in code that should be portable. (An
implementation can use it under the hood, when implementing memcpy, or
for the kind of optimisations I showed gcc doing. But you don't write
unaligned accesses in the source code.)

Other cpu designs do not attempt to copy the x86. The great majority of
code that is written that is reliant on a processor working like an x86
is written for Windows, and does not need to be portable to anything
other than x86.

The rest of the programming world is mostly either aimed at reasonable
portability, such as across different *nix systems, or targeted at
smaller embedded systems. For portable code, you don't care about
unaligned accesses because you don't use them in the source code - you
only care that the implementation handles your source code efficiently.

Many cpus implement unaligned accesses - because the designers think the
balance between use and cost makes it appropriate. It is /not/ for
compatibility with x86.

And for processors that don't support unaligned access in hardware, no
one would bother supporting it by software emulation except if it were
required for /binary/ compatibility with other processors in the same
family. I don't know of any systems where that applies.

> But on some exotic systems which
> don't run a common operating system or none at all you might be right.

These "exotic" systems far and away outnumber the PC's of this world.
The programming world does not revolve around x86.

> Unaligned accesses are simply useful for data-structures which are
> sent over the network or persisted on disk to save the padding bytes.

Nonsense. Proper programming is useful for data structures that are
sent over the network or are stored on files. Use portable coding, or
implementation dependent coding (like "packed" structs), and let the
compiler use whatever instructions are supported and most efficient for
the platform.

Bonita Montero

unread,
Aug 7, 2019, 11:39:17 AM8/7/19
to
> There is rarely any reason for wanting unaligned access, and
> no justification for using it in code that should be portable.

I didn't say that it's always good style, but a lot of code relies
on it so that most platforms support it through the CPU or the OS.
So the code formerly written for x86 will also run without changes.

> Many cpus implement unaligned accesses - because the designers think
> the balance between use and cost makes it appropriate. It is /not/
> for compatibility with x86.

I'll bet no one would have implemented this on other CPUs if there
won't be a lot of code written on x86-machines that relies on this.

> These "exotic" systems far and away outnumber the PC's of this world.

Most of these trap unaligned acceses through the OS.

>> Unaligned accesses are simply useful for data-structures which are
>> sent over the network or persisted on disk to save the padding bytes.

> Nonsense. Proper programming is useful for data structures that are
> sent over the network or are stored on files. Use portable coding, or
> implementation dependent coding (like "packed" structs), and let the
> compiler use whatever instructions are supported and most efficient
> for the platform.

That's your taste of proper programming, but not the facts.

Bonita Montero

unread,
Aug 7, 2019, 11:41:24 AM8/7/19
to
>> Ok, except for atomicity: i.e. loads / stors aren't atomic and
>> atomic RMW-instruction will alway fault.
>>
>
> B2.2:
> ...

I mis-worded what I wanted to tell: I simply wanted to say that
_unaligned_ loads / stores are not atomic on ARM.

Scott Lurndal

unread,
Aug 7, 2019, 11:58:53 AM8/7/19
to
But they are atomic on ARMv8.4. Which is what the text you
elided showed.

Scott Lurndal

unread,
Aug 7, 2019, 12:07:19 PM8/7/19
to
David Brown <david...@hesbynett.no> writes:
>/Please/ learn to use Usenet properly! Keep attributions, and quote an
>appropriate amount of context!

Good luck with that, it's been tried before.

>
>On 07/08/2019 16:53, Bonita Montero wrote:
>> For portability-reasons almost any system tries to be comatible to
>> x86-systems for unaligned accesses.
>

>Other cpu designs do not attempt to copy the x86.

Certainly not the instruction set, unless you consider AMD, Cyrix,
Nat Semi, Harris, IBM, TI or Transmeta :-)

On the other hand, any processor vendor attempting to make a competing
server processor will have to accomodate the standard programming
methods used on Intel processors if they want to gain any market share
since most of the software would be ported from X86. That means things
like unaligned accesses and providing something that looks like the
intel strongly (program) ordered memory model are high on the desirable feature list.

AArch64 was specifically designed to be a competing server processor and thus
supports unaligned accesses. The memory model is a bit weaker but generally
provides program ordering; a small percentage of software ported from x86
may require some changes (unless it uses the appropriate C11 or C++14
capabilities).

>
>Many cpus implement unaligned accesses - because the designers think the
>balance between use and cost makes it appropriate. It is /not/ for
>compatibility with x86.

For ARMv8 it was _specifically_ for compatibility with X86(_64).

David Brown

unread,
Aug 7, 2019, 12:12:26 PM8/7/19
to
On 07/08/2019 17:39, Bonita Montero wrote:
>> There is rarely any reason for wanting unaligned access, and
>> no justification for using it in code that should be portable.
>
> I didn't say that it's always good style, but a lot of code relies
> on it so that most platforms support it through the CPU or the OS.

Any code that relies on this is very badly written. It may be that it
is common in the Windows world, where it is clear that many people have
quite a poor knowledge of legal C and C++, and little concept or
interest in writing clear, safe, and portable code. And it may be that
MSVC, knowing that its users are often unaware of the details of their
programming languages, is dumbed down to support such broken code.

Attempts to use unaligned access on other compilers may fail in
unexpected ways due to undefined behaviour.


> So the code formerly written for x86 will also run without changes.

Again, nonsense.

The x86 is quite a "programmer friendly" ISA. It supports unaligned
accesses, it has a strong memory model, it has support for many types of
atomic operations. People do write code that is dependent on these
features, and also dependent on compilers that have extra semantics to
support non-portable coding (such as guaranteeing wrapping on signed
integer overflow). Code that is written "assuming an x86 processor"
will often have strange breakages on other platforms and other compilers
- because the code is not portable C or C++.

Other processors do not copy these x86 features. These kind of features
can often be very expensive to implement (in terms of die size, power
consumption, speed, etc.) and only exist in the x86 world because of
backwards compatibility with the kind of badly written code that exists
in the Windows world.

>
>> Many cpus implement unaligned accesses - because the designers think
>> the balance between use and cost makes it appropriate.  It is /not/
>> for compatibility with x86.
>
> I'll bet no one would have implemented this on other CPUs if there
> won't be a lot of code written on x86-machines that relies on this.

Bet whatever you like. But don't quit your day job.

>
>> These "exotic" systems far and away outnumber the PC's of this world.
>
> Most of these trap unaligned acceses through the OS.

Can you give any kind of a reference for even a single case where you
know the OS will trap unaligned accesses and emulate them in software?
If not, then I think we can dispense with the fantasy that OS's provide
support for unaligned access when the cpu does not.


>
>>> Unaligned accesses are simply useful for data-structures which are
>>> sent over the network or persisted on disk to save the padding bytes.
>
>> Nonsense.  Proper programming is useful for data structures that are
>> sent over the network or are stored on files. Use portable coding, or
>> implementation dependent coding (like "packed" structs), and let the
>> compiler use whatever instructions are supported and most efficient
>> for the platform.
>
> That's your taste of proper programming, but not the facts.
>

Quote the section in the C++ standards that says unaligned access is
allowed in C++, and I'll believe you.

Bonita Montero

unread,
Aug 7, 2019, 12:34:08 PM8/7/19
to
> Any code that relies on this is very badly written.

Depends on which platforms you target.

> It may be that it is common in the Windows world, where it is clear
> that many people have quite a poor knowledge of legal C and C++,

Unaligned accesses come not by doing straight coding. You must code
with special alignment-directives or with pointer-casting. So these
developers that use unalignes acesses know what they're doing and
they know the target-platforms.

> The x86 is quite a "programmer friendly" ISA. It supports unaligned
> accesses, it has a strong memory model, it has support for many types
> of atomic operations. People do write code that is dependent on these
> features, and also dependent on compilers that have extra semantics to
> support non-portable coding (such as guaranteeing wrapping on signed
> integer overflow). Code that is written "assuming an x86 processor"
> will often have strange breakages on other platforms and other compilers
> - because the code is not portable C or C++.

Maybe it will break because of other features; but the unaligned
themselfes accessses are mostly de-facto-portable to the target
-platforms.

> Can you give any kind of a reference for even a single case where you
> know the OS will trap unaligned accesses and emulate them in software?

It's just a tiny task to support this by an OS and helps to run a lot
of old code; so this is very likely.

>> That's your taste of proper programming, but not the facts.

> Quote the section in the C++ standards that says unaligned access is
> allowed in C++, and I'll believe you.

My statement was related to your taste of proper programming and not
to the standard.

You're simply one of those compulsive and intolerant programmers.

Bonita Montero

unread,
Aug 7, 2019, 12:38:09 PM8/7/19
to
> AArch64 was specifically designed to be a competing server processor and thus
> supports unaligned accesses. The memory model is a bit weaker but generally
> provides program ordering; a small percentage of software ported from x86
> may require some changes (unless it uses the appropriate C11 or C++14
> capabilities).

I don't think that ARM is cosiderring AArch64-implementations as a
server-competitor.
It's just convenient to have unaligned loads / stores for persistence
and network-transfers.

Scott Lurndal

unread,
Aug 7, 2019, 1:35:08 PM8/7/19
to
Bonita Montero <Bonita....@gmail.com> writes:
>> AArch64 was specifically designed to be a competing server processor and thus
>> supports unaligned accesses. The memory model is a bit weaker but generally
>> provides program ordering; a small percentage of software ported from x86
>> may require some changes (unless it uses the appropriate C11 or C++14
>> capabilities).
>
>I don't think that ARM is cosiderring AArch64-implementations as a
>server-competitor.

It doesn't matter what you think. You can't even be troubled to
properly attribute your posts.

Aarch64 was specifically designed as a server-capable processor. I
was there.

Bonita Montero

unread,
Aug 7, 2019, 2:08:03 PM8/7/19
to
> It doesn't matter what you think. You can't even be troubled to
> properly attribute your posts.

> Aarch64 was specifically designed as a server-capable processor.
> I was there.

All attempts to establish Aarch64-based machines as servers failed.
F.e. this here: https://en.wikipedia.org/wiki/Calxeda
The people simply want x86 and in rare cases SPARC or POWER.

And I doubt that Aarch64 was designed mainly with servers in mind.
A 64 bit address-space has advantages even on smartphones with <=
4GB RAM.

David Brown

unread,
Aug 7, 2019, 3:32:18 PM8/7/19
to
On 07/08/2019 18:33, Bonita Montero wrote:
>> Any code that relies on this is very badly written.
>
> Depends on which platforms you target.

No, it does not - unless you a programming for a single compiler and
single target for which the compiler documentation clearly says that
unaligned accesses are supported. Even then, you will usually need to
do something special (like using the "__unaligned" modifier with MSVC).

>
>> It may be that it is common in the Windows world, where it is clear
>> that many people have quite a poor knowledge of legal C and C++,
>
> Unaligned accesses come not by doing straight coding. You must code
> with special alignment-directives or with pointer-casting. So these
> developers that use unalignes acesses know what they're doing and
> they know the target-platforms.

If they use pointer casts to break alignment requirements (or access
objects through incompatible types), then clearly they /don't/ know what
they are doing as this is very specifically not allowed by the language.

>
>> The x86 is quite a "programmer friendly" ISA.  It supports unaligned
>> accesses, it has a strong memory model, it has support for many types
>> of atomic operations.  People do write code that is dependent on these
>> features, and also dependent on compilers that have extra semantics to
>> support non-portable coding (such as guaranteeing wrapping on signed
>> integer overflow).  Code that is written "assuming an x86 processor"
>> will often have strange breakages on other platforms and other compilers
>> - because the code is not portable C or C++.
>
> Maybe it will break because of other features; but the unaligned
> themselfes accessses are mostly de-facto-portable to the target
> -platforms.

No, they are not.

You are extrapolating from "it seemed to work when I tried it on one
compiler without optimisation" to "it works everywhere". That is
ridiculous. It is downright scary that someone who thinks of themselves
as a serious programmer would write such stuff.

>
>> Can you give any kind of a reference for even a single case where you
>> know the OS will trap unaligned accesses and emulate them in software?
>
> It's just a tiny task to support this by an OS and helps to run a lot
> of old code; so this is very likely.

You have /no/ concept of what you are talking about. And you have made
it perfectly clear that you have no reference or samples - you are
making up stuff as you go along.

>
>>> That's your taste of proper programming, but not the facts.
>
>> Quote the section in the C++ standards that says unaligned access is
>> allowed in C++, and I'll believe you.
>
> My statement was related to your taste of proper programming and not
> to the standard.
>
> You're simply one of those compulsive and intolerant programmers.

I have little tolerance for people who write code that they know is bad.
I have no tolerance for people who not only insist that it is good
code on their one platform, but think that this means it works everywhere.

You appear to have a very serious misunderstanding of how programming
and languages work. Understanding the language and writing code that is
correct and valid is not a matter of taste.

Jorgen Grahn

unread,
Aug 7, 2019, 3:48:18 PM8/7/19
to
On Wed, 2019-08-07, David Brown wrote:
> On 07/08/2019 14:47, Bonita Montero wrote:
>>> What's the point of the exercise, in a C++ context?  Unaligned access
>>> in portable code is always the result of a programming error.
>>
>> Pure theory. All platforms support unaligned access either directly
>> through the CPU or through trapping by the operating-system (very slow).
>
> No, they don't. Some cpus support direct unaligned accesses. For
> others, various things could happen. On big OS's, you are likely to get
> a trap or exception causing the OS to kill your program with a fault - I
> can't imagine why an OS would bother simulating the unaligned access.

Somewhere in my career I saw some system do that simulation. I can't
remember where, but I think we disabled the feature after a while,
when we saw it caused a lot more problems than it solved (broken
programs would slow down to a crawl instead of crashing and thereby
telling us they were broken). I think it was an ARM of some kind.

Christopher Collins

unread,
Aug 7, 2019, 3:50:50 PM8/7/19
to
On 2019-08-07, David Brown <david...@hesbynett.no> wrote:
> There is rarely any reason for wanting unaligned access, and no
> justification for using it in code that should be portable.

There is one good reason for wanting it: reduced code size. If
you are willing to tie yourself to gcc or clang (and probably others),
you can annotate struct definitions with `__attribute__((packed))`.
Then you don't need to manually [un]marshal objects that get sent over
the network. If the processor does not support unaligned accesses, then
the compiler has to generate the marshalling code, and there is probably
no savings in code size. But for processors that support fast unaligned
access (e.g., ARM cortex M4), the compiler just pretends the data is
aligned. You don't pay the price in code size for [un]marshalling code,
and the C code is considerably simpler.

[...]

Chris

Jorgen Grahn

unread,
Aug 7, 2019, 3:58:35 PM8/7/19
to
On Wed, 2019-08-07, Scott Lurndal wrote:
> David Brown <david...@hesbynett.no> writes:
...
>>On 07/08/2019 16:53, Bonita Montero wrote:
>>> For portability-reasons almost any system tries to be comatible to
>>> x86-systems for unaligned accesses.
>>
>
>>Other cpu designs do not attempt to copy the x86.
>
> Certainly not the instruction set, unless you consider AMD, Cyrix,
> Nat Semi, Harris, IBM, TI or Transmeta :-)
>
> On the other hand, any processor vendor attempting to make a competing
> server processor will have to accomodate the standard programming
> methods used on Intel processors if they want to gain any market share
> since most of the software would be ported from X86. That means things
> like unaligned accesses

Why would they want that to work? With C or C++ code you can only
provoke an unaligned access by writing broken code. Also, the huge
body of Unix software is already free from such code (or it wouldn't
have worked on e.g. SPARC or PPC).

> and providing something that looks like the intel strongly (program)
> ordered memory model are high on the desirable feature list.

That I can believe, since most people didn't play with such things
until long after x86 won dominance. I've written (and been told to
write) application code which relied on the x86 model in that area.

Bonita Montero

unread,
Aug 7, 2019, 4:01:21 PM8/7/19
to
>> Depends on which platforms you target.

> No, it does not - unless you a programming for a single compiler and
> single target for which the compiler documentation clearly says that
> unaligned accesses are supported.  Even then, you will usually need to
> do something special (like using the "__unaligned" modifier with MSVC).

No, you don't have to program for a single computer / compiler, but
you can do that for multiple platforms if you have their capabilities
in mind

> If they use pointer casts to break alignment requirements (or access
> objects through incompatible types), then clearly they /don't/ know what
> they are doing as this is very specifically not allowed by the language.

They know what they do since they know their platform.

>> Maybe it will break because of other features; but the unaligned
>> themselfes accessses are mostly de-facto-portable to the target
>> -platforms.

> No, they are not.

You have to kow the target-platforms.

> You are extrapolating from "it seemed to work when I tried it on one
> compiler without optimisation" to "it works everywhere".

I never said that. I said that you usually have a set of target-patforms
that allow unaligned accesses.

>> It's just a tiny task to support this by an OS and helps to run a lot
>> of old code; so this is very likely.

> You have /no/ concept of what you are talking about.  And you have made
> it perfectly clear that you have no reference or samples - you are
> making up stuff as you go along.

No argumentation against what I said above.

> I have little tolerance for people who write code that they know is bad.

IF the discussed codig is bad depends on the target-platform. Usually
it will work because either the CPU supports that or the operating-sys-
tem traps unaligned accesses; but in the latter case it's very slow
and that's just for compatibility and it's not recommended to rely on
it for code that has to run fast.

Bonita Montero

unread,
Aug 7, 2019, 4:08:19 PM8/7/19
to
> Why would they want that to work? With C or C++ code you can only
> provoke an unaligned access by writing broken code.

It's only broken in theory. In reality it will work on most platforms,
although it might not be recommended on those emulating ualinged loads
/ stores through trapping because that's slow.

> Also, the huge body of Unix software is already free from such code
> (or it wouldn't have worked on e.g. SPARC or PPC).

Linux onj SPARC f.e. emulates unaligned memory-accesses through
trapping. With Solaris it depends on how you compile your code.

David Brown

unread,
Aug 7, 2019, 4:13:54 PM8/7/19
to
On 07/08/2019 21:58, Jorgen Grahn wrote:
> On Wed, 2019-08-07, Scott Lurndal wrote:
>> David Brown <david...@hesbynett.no> writes:
> ...
>>> On 07/08/2019 16:53, Bonita Montero wrote:
>>>> For portability-reasons almost any system tries to be comatible to
>>>> x86-systems for unaligned accesses.
>>>
>>
>>> Other cpu designs do not attempt to copy the x86.
>>
>> Certainly not the instruction set, unless you consider AMD, Cyrix,
>> Nat Semi, Harris, IBM, TI or Transmeta :-)
>>
>> On the other hand, any processor vendor attempting to make a competing
>> server processor will have to accomodate the standard programming
>> methods used on Intel processors if they want to gain any market share
>> since most of the software would be ported from X86. That means things
>> like unaligned accesses
>
> Why would they want that to work? With C or C++ code you can only
> provoke an unaligned access by writing broken code. Also, the huge
> body of Unix software is already free from such code (or it wouldn't
> have worked on e.g. SPARC or PPC).

I suppose there are a few things to consider here. One is that C and
C++ are not the only languages around - perhaps other languages allow
unaligned access as long as the target supports it. The other is that,
unfortunately, the programming world is full of people who don't know
what they are doing, and there is a lot of badly written code out there.
I guess the hardware folks would rather support this broken code than
limit their market to higher quality code.

>
>> and providing something that looks like the intel strongly (program)
>> ordered memory model are high on the desirable feature list.
>
> That I can believe, since most people didn't play with such things
> until long after x86 won dominance. I've written (and been told to
> write) application code which relied on the x86 model in that area.
>

Certainly strong memory models can be easier to program than weaker ones.

Scott Lurndal

unread,
Aug 7, 2019, 4:17:03 PM8/7/19
to
Bonita Montero <Bonita....@gmail.com> writes:
>> It doesn't matter what you think. You can't even be troubled to
>> properly attribute your posts.
>
>> Aarch64 was specifically designed as a server-capable processor.
>> I was there.
>
>All attempts to establish Aarch64-based machines as servers failed.

And again, you are incorrect.

https://www.cray.com/blog/inside-isambard-worlds-first-production-arm-supercomputer/
https://www.nextplatform.com/2017/11/13/cray-arms-highest-end-supercomputer-thunderx2/

There are also various cloud deployments.
Calxeda was using large arrays of A9's (32-bit processors).



>
>And I doubt that Aarch64 was designed mainly with servers in mind.

As I said, I was there.

David Brown

unread,
Aug 7, 2019, 4:18:33 PM8/7/19
to
On 07/08/2019 21:50, Christopher Collins wrote:
> On 2019-08-07, David Brown <david...@hesbynett.no> wrote:
>> There is rarely any reason for wanting unaligned access, and no
>> justification for using it in code that should be portable.
>
> There is one good reason for wanting it: reduced code size.

Nope.

Write your code properly and safely (using shifts and masks, memcpy, or
compiler-specific features or intrinsics) and let the compiler turn it
into unaligned accesses. Writing safe, correct, and valid source code
is the programmer's responsibility. Turning it into small and fast
object code is the compiler's responsibility. Don't try to do the
compiler's job - work with it so that it can to the best job it can.


> If
> you are willing to tie yourself to gcc or clang (and probably others),
> you can annotate struct definitions with `__attribute__((packed))`.
> Then you don't need to manually [un]marshal objects that get sent over
> the network. If the processor does not support unaligned accesses, then
> the compiler has to generate the marshalling code, and there is probably
> no savings in code size. But for processors that support fast unaligned
> access (e.g., ARM cortex M4), the compiler just pretends the data is
> aligned. You don't pay the price in code size for [un]marshalling code,
> and the C code is considerably simpler.
>

Compiler extensions like "packed" are fine in my book. Use them if they
make the code clearer and more efficient - assuming, of course, that the
lack of portability is not a problem. Compilers that don't support such
features will complain, so you don't get silent problems.

Chris M. Thomasson

unread,
Aug 7, 2019, 4:37:29 PM8/7/19
to
On 8/7/2019 12:52 AM, Bonita Montero wrote:
>>> It's logical that Intel implemented unaligned loads and stores
>>> for backward-compatibility. And these loads / stores are useful
>>> for performantly mofifying datastructure for persistence or
>>> transmission over the network. So this is clearly an unique
>>> advantage of the Intel-Architecture.
>>> But for what should an unaligned LOCK * be useful? I can see
>>> no sense in that.
>
>> To trigger a full blown bus lock, to be used as a quiescence point
>> in user space RCU. There is an older write up on this, back in 2010:
>
> Ok, but's that's extremely exotic.
>
> > There are a whole class of exotic asymmetric synchronization
> > algorithms that can use this. Although, FlushProcessWriteBuffers
> > can work okay.
>
> Do you have a link?

The following link mentions the asymmetric Dekker algorihtm:

https://blogs.oracle.com/dave/qpi-quiescence

https://cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-e4aebca50425/1a6535a9-5fe4-418b-ab11-91f1668a5720/File/acfe7384038ac881c1a34cc0cb0b9315/asymmetric_dekker_synchronization_140215.txt

The short version:

https://preview.tinyurl.com/y4e5forx


There is biased locking:

https://blogs.oracle.com/dave/biased-locking-in-hotspot


One can remove the nasty #StoreLoad membar in hazard pointers:

https://patents.google.com/patent/US20040107227A1/en

This requires a store load relationship that even Intel does not have.
An MFENCE or LOCK RMW is required. Acquiring a hazard pointer basically
involves a load, store, load, compare, conditional loop. The store needs
to be _before_ the following load. A release membar is not good enough,
even on Intel.

Keep in mind that Intel can reorder a store followed by a load to
another location.

http://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf

Bonita Montero

unread,
Aug 7, 2019, 4:42:35 PM8/7/19
to
We were talking about servers.

Bonita Montero

unread,
Aug 7, 2019, 4:48:37 PM8/7/19
to
Consider this code. It compiles on whether certain platforms support
unaligned loads with proper code and without vaguely relying on the
compiler to strip away the assembly of a uint32_t with shifts and
ORs with a single load / store.

#include <cstdint>

#ifdef _MSC_VER
// MSC only works on x86/x64 and ARMv8
#define SUPPORTS_UNALIGNED
#elif(__GNUC__)
#if defined(__x86_64__) || defined(__i386__)
#define SUPPORTS_UNALIGNED
#elif defined(__aarch64__)
#define SUPPORTS_UNALIGNED
#endif
#endif

struct unaligned_uint32
{
union
{
uint8_t ub[sizeof(uint32_t)];
uint32_t u;
} uu;
operator uint32_t();
unaligned_uint32 &operator = ( uint32_t ui );
};

inline
unaligned_uint32::operator uint32_t()
{
#if defined(SUPPORTS_UNALIGNED)
return uu.u;
#else
return ((uint32_t)uu.ub[0] | (uint32_t)uu.ub[1] << 8) |
((uint32_t)uu.ub[2] << 16 | (uint32_t)uu.ub[3] << 24);
#endif
}

inline
unaligned_uint32 &unaligned_uint32::operator = ( uint32_t ui )
{
#if defined(SUPPORTS_UNALIGNED)
uu.u = ui;
return *this;
#else
uu.ub[0] = (uint8_t)ui;
uu.ub[1] = (uint8_t)(ui << 8);
uu.ub[2] = (uint8_t)(ui << 16);
uu.ub[3] = (uint8_t)(ui << 24);
return *this;
#endif
}

Chris M. Thomasson

unread,
Aug 7, 2019, 5:09:29 PM8/7/19
to
On 8/7/2019 7:46 AM, Scott Lurndal wrote:
> "Chris M. Thomasson" <invalid_chris_t...@invalid.com> writes:
>> On 8/6/2019 10:13 PM, Bonita Montero wrote:
>>> I just wrote a litte test that checks the performance of unaligned
>>> memory-acesses on x86 / Win32. I've run this code on my Ryten 1800X:
>
>>>
>>> I was very surprised that unaligned memory-access is less than twice as
>>> slow on my PC.
>>> It would be nice to see results from Intel-CPUs here. Thanks in advance.
>>
>> Try using unaligned addresses with several threads. Try doing a LOCK
>> XADD on a location that straddles two cache lines, and is not aligned on
>> a line, vs one that is aligned on a cache line, and properly padded.
>
> Processor vendors work hard so that most unaligned accesses don't add
> significant additional latencies to the instructions. Our ARM64 processor
> generally has no perf difference between aligned and unaligned to DRAM
> (unaligned isn't supported to device memory).
>
> Locked transactions on intel systems that straddle cache lines need
> to assert a system bus lock, which causes extreme performance degradation,
> particularly in NUMA systems. Don't do that.
>

It can be "abused" to achieve userspace asymmetric sync algorithms, or
even RCU. These hacks of the system bus lock on Intel are _not_
frequently called, it would make no sense if they were. They are there
to force a system wide memory barrier when, say the garbage in a
deferred reclamation system builds up to a limit. Now, Windows kindly
offers another way to do this. Its not system wide, but process wide:

https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers

Wrt, asymmetric, think of two things A and B that need to be done. A
gets called on a very rapid basis. However, B is not used all that much.
We call A the fastpath and B the slowpath. A should be as fast as
possible. In fact, we can do it without using any memory barriers at
all. When B decides to run, it can execute a system wide quiescence
using the Intel hack, or call into FlushProcessWriteBuffers to get all
the stores. Then it knows that anything A stored to up to this point is
guaranteed to be visible.

Think of A and B as being the two sides of a Dekker algorihtm, where A
gets used more than B.

https://en.wikipedia.org/wiki/Dekker%27s_algorithm

Scott Lurndal

unread,
Aug 7, 2019, 5:19:37 PM8/7/19
to
And if you think a supercomputer isn't a server (or more properly, a collection
of servers), you've not been paying attention.

Chris M. Thomasson

unread,
Aug 7, 2019, 5:21:02 PM8/7/19
to
On 8/7/2019 8:30 AM, David Brown wrote:
> /Please/ learn to use Usenet properly! Keep attributions, and quote an
> appropriate amount of context!
>
> On 07/08/2019 16:53, Bonita Montero wrote:
>> For portability-reasons almost any system tries to be comatible to
>> x86-systems for unaligned accesses.
>
> Total nonsense.
>
> There is rarely any reason for wanting unaligned access, and no
> justification for using it in code that should be portable.


Abusing the bus lock on Intel via forced unaligned cache line straddling
hack can be useful in rare algorithms.

[...]

Bonita Montero

unread,
Aug 7, 2019, 5:40:34 PM8/7/19
to
>>> https://www.cray.com/blog/inside-isambard-worlds-first-production-arm-supercomputer/

>> https://www.nextplatform.com/2017/11/13/cray-arms-highest-end-supercomputer-thunderx2/
>> We were talking about servers.

> And if you think a supercomputer isn't a server (or more properly, a collection
> of servers), you've not been paying attention.

Not necessarily.

Bart

unread,
Aug 7, 2019, 5:53:20 PM8/7/19
to
On 07/08/2019 21:18, David Brown wrote:
> On 07/08/2019 21:50, Christopher Collins wrote:
>> On 2019-08-07, David Brown <david...@hesbynett.no> wrote:
>>> There is rarely any reason for wanting unaligned access, and no
>>> justification for using it in code that should be portable.
>>
>> There is one good reason for wanting it: reduced code size.
>
> Nope.
>
> Write your code properly and safely (using shifts and masks, memcpy, or
> compiler-specific features or intrinsics) and let the compiler turn it
> into unaligned accesses.  Writing safe, correct, and valid source code
> is the programmer's responsibility.  Turning it into small and fast
> object code is the compiler's responsibility.  Don't try to do the
> compiler's job - work with it so that it can to the best job it can.

You have to keep an eye on what's going on, otherwise you may end up
with a struct that needs 65 bytes. Will the compiler say anything about
that, or just accept it, and try and create arrays with a stride of 65
bytes, or pad them to 128 bytes?

I doubt it will re-design the struct to keep it to a power-of-two.



> Compiler extensions like "packed" are fine in my book.  Use them if they
> make the code clearer and more efficient - assuming, of course, that the
> lack of portability is not a problem.  Compilers that don't support such
> features will complain, so you don't get silent problems.

(I've had silent problems with 'pack(1)', where int* accesses that I
knew were aligned, were assumed to be unaligned by gcc, and generated
byte-at-a-time accesses, which in that app slowed things down to 1/3 the
expected speed. Another reason to keep a check on what it is doing.)

Chris Vine

unread,
Aug 7, 2019, 6:02:35 PM8/7/19
to
Out of interest, how are you getting an unaligned_uint32 object which is
in fact unaligned for the target in question? A compiler should
construct a union so that it is correctly aligned for all its members,
so what is the actual usage case? (I have not been reading this thread
attentively so it is possible you have already explained this: if so,
I apologize.)

Keith Thompson

unread,
Aug 7, 2019, 6:51:17 PM8/7/19
to
David Brown <david...@hesbynett.no> writes:
> On 07/08/2019 14:47, Bonita Montero wrote:
>>> What's the point of the exercise, in a C++ context? Unaligned access
>>> in portable code is always the result of a programming error.
>>
>> Pure theory. All platforms support unaligned access either directly
>> through the CPU or through trapping by the operating-system (very slow).
>
> No, they don't. Some cpus support direct unaligned accesses. For
> others, various things could happen. On big OS's, you are likely to get
> a trap or exception causing the OS to kill your program with a fault - I
> can't imagine why an OS would bother simulating the unaligned access.
> On embedded systems, unaligned access may lead to a bus fault of some
> sort, halting the system or causing a restart. And on some systems that
> I have used, unaligned access will silently give you muddled reads and
> corrupting writes.

Agreed.

> Unaligned access is always an error. Use code with shifts and masks, if
> you need it, or use memcpy. If your compiler isn't good enough to give
> you efficient enough code for your needs, get a better compiler.

I suppose that depends on what you mean by "unaligned".

For example, as I understand it x86 and x86_64 support accessing
words at odd addresses in hardware, using the same CPU instructions
as an aligned access but executing more slowly. If you want
to extract a 32-bit value from an odd address, and you know
you're running on an x86 or x86_64 and you're not concerned with
portability, you might as well use a word move instruction in
assembly or a simple assignment in C++. You *could* break it down
into byte moves, but the resulting code would be slower.

For some CPUs, aligned access gives you better performance, but
unaligned access still works. For others, unaligned access just goes
kablooie.

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Keith Thompson

unread,
Aug 7, 2019, 8:27:27 PM8/7/19
to
David Brown <david...@hesbynett.no> writes:
[...]
> Compiler extensions like "packed" are fine in my book. Use them if they
> make the code clearer and more efficient - assuming, of course, that the
> lack of portability is not a problem. Compilers that don't support such
> features will complain, so you don't get silent problems.

Compiler extension like "packed" can cause problems. For example,
suppose you have something like this (using gcc syntax):

struct foo {
char c;
int i;
} __attribute__((packed));
foo obj;
some_func(&obj.i);

some_func() takes an argument of type int*, but there's no indication
that it's misaligned, so it can't take any special steps to avoid
blowing up when it dereferences its argument.

Recent versions of gcc and clang warn about taking the address of a
misaligned member of a packed structure.

https://stackoverflow.com/q/8568432/827263
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51628

Keith Thompson

unread,
Aug 7, 2019, 8:27:53 PM8/7/19
to
David Brown <david...@hesbynett.no> writes:
> /Please/ learn to use Usenet properly! Keep attributions, and quote an
> appropriate amount of context!
>
> On 07/08/2019 16:53, Bonita Montero wrote:
[SNIP]

It's not a matter of learning. Bonita Montero's headers indicate
that she(?) is using Thunderbird, which I'm reasonably sure handles
attribution lines correctly. She must be removing them deliberately.
I asked her to keep attribution lines a while ago. She responded
with insults.

Do you use a killfile?

Christopher Collins

unread,
Aug 7, 2019, 8:30:04 PM8/7/19
to
On 2019-08-07, David Brown <david...@hesbynett.no> wrote:
> On 07/08/2019 21:50, Christopher Collins wrote:
>> On 2019-08-07, David Brown <david...@hesbynett.no> wrote:
>>> There is rarely any reason for wanting unaligned access, and no
>>> justification for using it in code that should be portable.
>>
>> There is one good reason for wanting it: reduced code size.
>
> Nope.
>
> Write your code properly and safely (using shifts and masks, memcpy, or
> compiler-specific features or intrinsics) and let the compiler turn it
> into unaligned accesses. Writing safe, correct, and valid source code
> is the programmer's responsibility. Turning it into small and fast
> object code is the compiler's responsibility. Don't try to do the
> compiler's job - work with it so that it can to the best job it can.

I did an experiment using godbolt (ARM gcc 8.2, compiler settings:
-mcpu=cortex-m4 -Os). In the below code, there are two functions,
`marshal1` and `marshal2`. Both functions serialize an object of type
`struct msg` into a byte array. `marshal1` does it member-by-member;
`marshal2` does it with a single memcpy as I described in my previous
post.

// Code:

#include <inttypes.h>
#include <string.h>

struct msg {
uint8_t a;
uint32_t b;
uint16_t c;
uint8_t d;
uint16_t e[4];
} __attribute__((packed));

void marshal1(const struct msg *m, uint8_t *out) {
out[0] = m->a;
memcpy(&out[1], &m->b, 4);
memcpy(&out[5], &m->c, 2);
out[7] = m->d;
memcpy(&out[8], m->e, 8);
}

void marshal2(const struct msg *m, uint8_t *out) {
memcpy(out, m, sizeof *m);
}

// Result:

marshal1:
ldrb r3, [r0] @ zero_extendqisi2
strb r3, [r1]
ldr r3, [r0, #1] @ unaligned
str r3, [r1, #1] @ unaligned
ldrh r3, [r0, #5] @ unaligned
strh r3, [r1, #5] @ unaligned
ldrb r3, [r0, #7] @ zero_extendqisi2
strb r3, [r1, #7]
ldr r3, [r0, #8]! @ unaligned
str r3, [r1, #8] @ unaligned
ldr r3, [r0, #4] @ unaligned
str r3, [r1, #12] @ unaligned
bx lr
marshal2:
add r3, r0, #16
.L3:
ldr r2, [r0], #4 @ unaligned
str r2, [r1], #4 @ unaligned
cmp r0, r3
bne .L3
bx lr

The output for `marshal2` is obviously smaller than `marshal1`. How can
I acheive the same results without relying on this technique?

Chris

Keith Thompson

unread,
Aug 7, 2019, 8:52:12 PM8/7/19
to
Bart <b...@freeuk.com> writes:
[...]
> You have to keep an eye on what's going on, otherwise you may end up
> with a struct that needs 65 bytes. Will the compiler say anything about
> that, or just accept it, and try and create arrays with a stride of 65
> bytes, or pad them to 128 bytes?
>
> I doubt it will re-design the struct to keep it to a power-of-two.
[...]

If the target imposes alignment constraints and you don't use any
compiler extensions to force packing, the compiler will do what it needs
to do to prevent unaligned access. There's unlikely to be any need to
pad a 65-byte structure to 128 bits. It would probably just be padded
to 68 or perhaps 72 bytes, depending on the alignment requirements of
its members.

For example:

#include <stdio.h>
#include <stdint.h>
int main(void) {
struct foo {
uint32_t arr[16];
char c;
};
struct bar {
uint64_t arr[8];
char c;
};
printf("struct foo is %zu bytes\n", sizeof (struct foo));
printf("struct bar is %zu bytes\n", sizeof (struct bar));
}

On my system:

struct foo is 68 bytes
struct bar is 72 bytes

The stride of an array of FOO is always equal to sizeof (FOO); there is
no padding between array elements.

Chris M. Thomasson

unread,
Aug 7, 2019, 11:55:40 PM8/7/19
to
On 8/7/2019 1:07 AM, Bonita Montero wrote:
>> https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers
>
>
> Why is this necessary? We have M(O)ESI.

The inherent acquire and release wrt loads and stores on Intel is simply
not strong enough to get hazard points up and running. Think about it.
Loading/Storing from memory into a local variable implies
acquire/release semantics wrt a MOV instruction. Iirc, its in WB memory,
or something. so, this is not strong enough for SMR, or Safe Memory
Reclamation, aka: Hazard Pointers. Intel is NOT seq_cst... ;^)

SMR needs to load back a value _after_ the previous store obtained real
data. Intel does NOT allow for this without an explicit fence. A store
followed by a load to another location, can and will be reordered. Well,
this is not Kosher wrt SMR!

load from A
store A in B
load from A
does A == B?

the store of A into B needs to be _committed_ before the subsequent load
from A. This can use MFENCE or a LOCK prefix wrt a "dummy" RMW, or even
directly. So:

load from A
store A in B
MFENCE
load from A
does A == B


This is an explicit memory barrier that needs to be added, even on
Intel, believe it or not. Its do to the usage pattern of the algorihtm
and the details of Intel.

Chris M. Thomasson

unread,
Aug 8, 2019, 1:42:59 AM8/8/19
to
On 8/6/2019 11:57 PM, Bonita Montero wrote:
>> Try using unaligned addresses with several threads.
>
> That's not relevant to me because I only wanted to measure the
> cost of an unaligned access.
>
>> Try doing a LOCK XADD on a location that straddles two cache
>> lines, ..
>
> That's also not relevant to me; and not only to me because no
> one would do that in reality because this isn't of any use.

Fair enough.

Bonita Montero

unread,
Aug 8, 2019, 1:43:24 AM8/8/19
to
> Out of interest, how are you getting an unaligned_uint32 object
> which is in fact unaligned for the target in question?

I also saw that, but I thought that's sufficient for an example and
David would be able to think the rest. And even there's a missing
directive to enforce unaligned placement: the object might be placed
unaligned by casting a pointet; so a missing directive doesn't count.

Chris M. Thomasson

unread,
Aug 8, 2019, 1:54:37 AM8/8/19
to
Think of unaligned straddling a cache line!

Bonita Montero

unread,
Aug 8, 2019, 2:03:20 AM8/8/19
to
If you are doing lock-free synchronizazion and thereby calling
FlushProcessWriteBuffers, you've lost anyway. FlushProcessWriteBuffers
is a kernel call and slow. If you are doing it that way, you could
stick with usual locking and having a better performance.

Bonita Montero

unread,
Aug 8, 2019, 2:04:31 AM8/8/19
to
That has nothing to do with Chris or mine statement.

Bonita Montero

unread,
Aug 8, 2019, 2:06:43 AM8/8/19
to
I don't think FPWB is suitable for lock-free-programming because
it a kernel-call and thereby very slow.

Bonita Montero

unread,
Aug 8, 2019, 2:14:44 AM8/8/19
to
Consider this:

#include <Windows.h>
#include <iostream>
#include <intrin.h>

using namespace std;

int main()
{
unsigned const ROUNDS = 10'000'000;
int64_t start, end;
double ticksPerCall;

start = (int64_t)__rdtsc();
for( unsigned i = 0; i != ROUNDS; ++i )
FlushProcessWriteBuffers();
end = (int64_t)__rdtsc();

ticksPerCall = (end - start) / (double)ROUNDS;
cout << ticksPerCall << endl;
}

This gives about 1.100 clock-cycles per call on mmy 1800.
Even this is not accurate because it might be the base-clock;
that's horrible.

Chris M. Thomasson

unread,
Aug 8, 2019, 2:54:22 AM8/8/19
to
Calls to FlushProcessWriteBuffers are all on the slow side. The slow
side of the asymmetric sync, think about it for a moment.

Chris M. Thomasson

unread,
Aug 8, 2019, 2:59:46 AM8/8/19
to
On 8/7/2019 11:06 PM, Bonita Montero wrote:
> I don't think FPWB is suitable for lock-free-programming because
> it a kernel-call and thereby very slow.

Abusing the bus lock can create very excellent performance. User space
RCU wrt read "mostly" work loads..

Chris M. Thomasson

unread,
Aug 8, 2019, 3:00:21 AM8/8/19
to
FlushProcessWriteBuffers() is called on the slowpath. Did you read my links?

Chris M. Thomasson

unread,
Aug 8, 2019, 3:16:16 AM8/8/19
to
Ahh shi%. Sorry again. I keep thinking of where unaligned access on x86
wrt LOCK can possibly be "useful".

Bonita Montero

unread,
Aug 8, 2019, 3:30:11 AM8/8/19
to
What you do doesn't makes sense. When you have a packed data-structure
it's those which you want to persist or send over the network. So what
you put to out should be directly what the data-structure holds.

David Brown

unread,
Aug 8, 2019, 3:56:49 AM8/8/19
to
On 07/08/2019 22:48, Bonita Montero wrote:
> Consider this code. It compiles on whether certain platforms support
> unaligned loads with proper code and without vaguely relying on the
> compiler to strip away the assembly of a uint32_t with shifts and
> ORs with a single load / store.
>

I support the principle of using conditional compilation to let you use
known efficient methods on platforms that support it, and fall back to
generic but safe methods elsewhere. That is a good way to handle
getting maximal efficiency on platforms you view as important, while
keeping portability.

Unfortunately, you are completely wrong about where unaligned accesses
are actually supported by the compiler. You have identified some
/targets/ that support unaligned access, but the /compilers/ do not
support it.

Note that this does /not/ mean the compilers will not generate code that
does what you want. The code you write might work fine in testing. You
could look at the assembly, and it looks fine. And then one day you
make a minor change to another part of the code, and suddenly it does
/not/ work as you expect. Or you upgrade your compiler. Or you change
an optimisation flag.

C (C++ follows here) does not allow you to create an unaligned pointer
by conversions - see C11 6.3.2.3p7 "A pointer to an object type may be
converted to a pointer to a different object type. If the resulting
pointer is not correctly aligned 68) for the referenced type, the
behavior is undefined." It does not, except for certain cases, allow
you to take a pointer to one type, convert it to a pointer to a
different type, and use that to access data (6.5p7). It does not allow
you to access data through an unaligned pointer (6.5.3.3p4).

Nothing in the gcc documentation, nor the MSVC documentation I have
read, nor the documentation for any other compiler I have ever used (and
that's a lot, on many embedded systems) gives the impression that the
compilers support unaligned access.


You are living on luck. That is a ridiculous attitude to take, when it
is not difficult to get code that is correct /and/ efficient.


> #include <cstdint>
>
> #ifdef _MSC_VER
>     // MSC only works on x86/x64 and ARMv8
>     #define SUPPORTS_UNALIGNED

MSVC does not, as far as I can tell, support unaligned accesses on any
platform unless you use the __unaligned keyword.

> #elif(__GNUC__)
>     #if defined(__x86_64__) || defined(__i386__)
>         #define SUPPORTS_UNALIGNED

gcc does not support unaligned accesses on any platforms.

>     #elif defined(__aarch64__)
>         #define SUPPORTS_UNALIGNED

Compilers for aarch64 do not support unaligned accesses unless they
specifically document that they do.
That would work (as I said above, the principle of conditional
compilation here is right), except for three things:

1. None of the implementations you test for support unaligned access,
even though the underlying hardware does. So SUPPORTS_UNALIGNED should
not be set for those implementations.

2. There is no legal way in standard C or C++ to create and use an
"unaligned_uint32" object or lvalue that is not correctly aligned.

3. If you are thinking of using casts to get unaligned_uint32 pointers
from char pointers, you fall foul of the type-based pointer aliasing
rules of C - you can't do that unless you originally started with
correctly aligned unaligned_uint32 objects.

Particular compilers, extensions, flags, etc., can of course allow these.

David Brown

unread,
Aug 8, 2019, 3:58:28 AM8/8/19
to
It is not the efficiency that is the main problem here - it is the
correctness of the code (or lack thereof).

Chris M. Thomasson

unread,
Aug 8, 2019, 4:20:39 AM8/8/19
to
The unaligned access with LOCK prefix is a total hack to force a bus lock.

David Brown

unread,
Aug 8, 2019, 4:40:08 AM8/8/19
to
On 07/08/2019 23:52, Bart wrote:
> On 07/08/2019 21:18, David Brown wrote:
>> On 07/08/2019 21:50, Christopher Collins wrote:
>>> On 2019-08-07, David Brown <david...@hesbynett.no> wrote:
>>>> There is rarely any reason for wanting unaligned access, and no
>>>> justification for using it in code that should be portable.
>>>
>>> There is one good reason for wanting it: reduced code size.
>>
>> Nope.
>>
>> Write your code properly and safely (using shifts and masks, memcpy,
>> or compiler-specific features or intrinsics) and let the compiler turn
>> it into unaligned accesses.  Writing safe, correct, and valid source
>> code is the programmer's responsibility.  Turning it into small and
>> fast object code is the compiler's responsibility.  Don't try to do
>> the compiler's job - work with it so that it can to the best job it can.
>
> You have to keep an eye on what's going on, otherwise you may end up
> with a struct that needs 65 bytes. Will the compiler say anything about
> that, or just accept it, and try and create arrays with a stride of 65
> bytes, or pad them to 128 bytes?
>
> I doubt it will re-design the struct to keep it to a power-of-two.

I don't see much connection between struct sizes and what we have
discussing, but I will try to answer anyway.

Compilers will add padding in structs to keep alignments correct for the
fields. The alignment for the struct itself will be that of the largest
alignment of any field, and there will be padding at the end of the
struct if needed to make the whole thing a multiple of this alignment
(so that arrays of the struct work).

So you won't get a struct array that has a stride of 65 bytes unless the
struct itself has size 65 bytes. And you won't get that unless the
component fields have alignments 1, 5 or 13. (I don't know any
platforms with types of alignment 5 or 13, but it is hypothetically
possible.)

As for checking the size of structs, compilers might be able to give you
help such as gcc's "-Wpadded" warning that is issued if there is any
padding added to structs. Most people don't want such a warning, of
course, but I have sometimes found it useful.

The best method of checking struct sizes IMHO is with C11 static assertions:

_Static_assert(sizeof(struct X) == 65, "Struct X should be 65 bytes");

or for C18, you might prefer just:

_Static_assert(sizeof(struct X) == 65)


Prior to C11, you can force compile-time errors with declarations like this:

#define STATIC_ASSERT(claim) \
typedef struct { char STATIC_ASSERT_(__LINE__) [ \
(claim) ? 1 : -1]; } STATIC_ASSERT___LINE__)

STATIC_ASSERT(sizeof(struct X) == 65);


Yes, the macro is a little ugly and the error message on failures
complains about array types of negative sizes, but it works nicely, can
be put at file scope or inside a function, and generates no extra code,
data space, or external symbols.

>
>> Compiler extensions like "packed" are fine in my book.  Use them if
>> they make the code clearer and more efficient - assuming, of course,
>> that the lack of portability is not a problem.  Compilers that don't
>> support such features will complain, so you don't get silent problems.
>
> (I've had silent problems with 'pack(1)', where int* accesses that I
> knew were aligned, were assumed to be unaligned by gcc, and generated
> byte-at-a-time accesses, which in that app slowed things down to 1/3 the
> expected speed. Another reason to keep a check on what it is doing.)
>

Don't use "packed" attributes or pragmas unless you have to. One
disadvantage is that you lose alignment information and, depending on
the compiler, flags and target, you might get inefficient code such as
the byte-at-a-time accesses you mentioned. But there are several ways
to avoid that in gcc:

1. Enable optimisation. That will coalesce the byte-at-a-time accesses
on targets that support unaligned accesses.

2. Add an attribute "aligned" to the type or variable to tell the
compiler about the struct's alignment. It will figure out the alignment
of the fields from there.

3. Use C11's _Alignas to get the same effect.

4. Wrap the packed struct in a union with a field that has the alignment
you want. Then use that outer union type rather than the struct type,
and the internal struct will have the biggest alignment of the union fields.

5. Use gcc's __builtin_assume_aligned() function when accessing members.
This would quickly make the code quite ugly, but it is the most
flexible as you can tell the compiler that a pointer is, for example, 3
bytes away from 8 byte alignment.

David Brown

unread,
Aug 8, 2019, 4:43:31 AM8/8/19
to
On 08/08/2019 02:27, Keith Thompson wrote:
> David Brown <david...@hesbynett.no> writes:
> [...]
>> Compiler extensions like "packed" are fine in my book. Use them if they
>> make the code clearer and more efficient - assuming, of course, that the
>> lack of portability is not a problem. Compilers that don't support such
>> features will complain, so you don't get silent problems.
>
> Compiler extension like "packed" can cause problems. For example,
> suppose you have something like this (using gcc syntax):
>
> struct foo {
> char c;
> int i;
> } __attribute__((packed));
> foo obj;
> some_func(&obj.i);
>
> some_func() takes an argument of type int*, but there's no indication
> that it's misaligned, so it can't take any special steps to avoid
> blowing up when it dereferences its argument.
>
> Recent versions of gcc and clang warn about taking the address of a
> misaligned member of a packed structure.
>
> https://stackoverflow.com/q/8568432/827263
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51628
>

Agreed.

I did not mean to imply that "packed" is something you should use widely
- IMHO, it is usually a poor choice, and I rarely use it myself. I
merely mean that I think it is fine to use compiler-specific features
like that if they give you the code you need. They are defined
behaviour for the compiler you are using, and give hard compile-time
errors on compilers that don't support them. The code will not be as
portable, but you will avoid surprises with silent errors.


Chris Vine

unread,
Aug 8, 2019, 5:30:03 AM8/8/19
to
Yes, it's a good example of how to access an unaligned object assuming
your conditionals are right (about which I defer to others): my
curiosity was how the union object got to be unaligned in the first
place.

If you are casting a pointer to get there you are breaching the strict
aliasing rules so you have undefined behaviour anyway (although of
course some compilers allow it with a suitable compiler flag such as
-fno-strict-aliasing). I suppose the other main way of getting
an unaligned unaligned_uint32 object would be with placement new, but
again placement new gives undefined behaviour if you construct an object
using it in incorrectly aligned memory. Of course, malloc() and
operator new() guarantee that the memory they provide is correctly
aligned for any type that could be constructed in it, so that isn't
normally a problem.

I guess using placement new with a stack allocated buffer is the problem
area you might fall into, but doing that kind of thing makes me really
nervous, so I would need to have an very unusual use case. C++ does
provide some help: you can I think safely construct an object T in a
stack allocated buffer constructed this way:

alignas(T) char buffer[sizeof(T)];

You probably know all this. My purpose in reciting it is to indicate
how nervous I would be of this kind of code.

David Brown

unread,
Aug 8, 2019, 5:40:08 AM8/8/19
to
On 08/08/2019 02:27, Keith Thompson wrote:
> David Brown <david...@hesbynett.no> writes:
>> /Please/ learn to use Usenet properly! Keep attributions, and quote an
>> appropriate amount of context!
>>
>> On 07/08/2019 16:53, Bonita Montero wrote:
> [SNIP]
>
> It's not a matter of learning. Bonita Montero's headers indicate
> that she(?) is using Thunderbird, which I'm reasonably sure handles
> attribution lines correctly. She must be removing them deliberately.
> I asked her to keep attribution lines a while ago. She responded
> with insults.
>
> Do you use a killfile?
>

I do have a killfile (or, to be precise, Thunderbird filter rules), but
don't put people in it very often - mostly it is for spambots. Usually
I just ignore most of their posts - as I have done with Bonita's in the
past, and probably will in the future. Sometimes I find a topic
interesting enough or important enough to override my "manual killfile".
I don't think there is any one good solution to all this - there is no
way to have a free and open Usenet and yet force people to follow simple
rules and common courtesy, and no way to block all rudeness without also
blocking things of interest. We just all have to make a balance that
seems good enough for ourselves.

David Brown

unread,
Aug 8, 2019, 6:51:12 AM8/8/19
to
The distinction here must be between what is allowed by the language
(combining the standards and any compiler-specific defined features) and
what is allowed by the hardware. Many processors support unaligned
accesses. For some, these give significant speed penalties or
limitations (such as being non-atomic), while others have more efficient
support.

But C (and C++) does not support unaligned accesses. Most ways of
creating pointers to unaligned objects are undefined behaviour even
before you try to use them for access.

So the correct way to code this is using either portable constructs
(such as memcpy) which are given as byte moves, or using some
implementation-specific feature. It is then up to the compiler to turn
this into optimal code, such as using an unaligned move instruction. It
is not necessarily slower - indeed, on gcc you will get a single word
move instruction for a variety of different byte move combinations.

The nearest standard C has to saying "I want this to be a single word
move regardless of alignment" is to use volatile pointers. Compilers
rarely document what would happen in this case, but it would be an
extremely odd compiler that did not issue the access regardless of
alignment, invalid pointer casts, etc.


> For some CPUs, aligned access gives you better performance, but
> unaligned access still works. For others, unaligned access just goes
> kablooie.
>

Yes, that's a fine way to describe it!

Bart

unread,
Aug 8, 2019, 8:49:00 AM8/8/19
to
On 08/08/2019 09:39, David Brown wrote:
> On 07/08/2019 23:52, Bart wrote:

> I don't see much connection between struct sizes and what we have
> discussing, but I will try to answer anyway.

I was picking up on the suggestion to just let a compiler get on with
its job. I'm saying that for things like this it will still need to be
checked, and if necessary to tweak a struct to get a more satisfactory
result.

That can include deliberately using some misalignment although usually
you will try and avoid that.

> Compilers will add padding in structs to keep alignments correct for the
> fields. The alignment for the struct itself will be that of the largest
> alignment of any field, and there will be padding at the end of the
> struct if needed to make the whole thing a multiple of this alignment
> (so that arrays of the struct work).
>
> So you won't get a struct array that has a stride of 65 bytes unless the
> struct itself has size 65 bytes.

The size might be 65 or more than 65, neither ideal if you were hoping
for 64.

(If I'm generating C, then I'd use pack(1) since the source language
does not do automatic padding, and then a size of 65 is possible for any
mix of element types. Although that would usually be picked up in the
original language before I start generating C versions.)

> As for checking the size of structs, compilers might be able to give you
> help such as gcc's "-Wpadded" warning that is issued if there is any
> padding added to structs. Most people don't want such a warning, of
> course, but I have sometimes found it useful.
>
> The best method of checking struct sizes IMHO is with C11 static assertions:
>
> _Static_assert(sizeof(struct X) == 65, "Struct X should be 65 bytes");

I would just use printf("size = %<whatever>\n",sizeof(struct X)) while
creating it. Just to ensure it is some sane value.

> But there are several ways
> to avoid that in gcc:
>
> 1. Enable optimisation. That will coalesce the byte-at-a-time accesses
> on targets that support unaligned accesses.

On that specific example (that would have been on gcc 4.x on RPi in
2012), it was optimised, and the ARM32 processor /was/ capable of
unaligned access, but gcc still generated byte-at-a-time code).

Jorgen Grahn

unread,
Aug 8, 2019, 8:51:36 AM8/8/19
to
On Wed, 2019-08-07, Bonita Montero wrote:

[David Brown, attribution missing as usual]

>> Can you give any kind of a reference for even a single case where you
>> know the OS will trap unaligned accesses and emulate them in software?
>
> It's just a tiny task to support this by an OS and helps to run a lot
> of old code; so this is very likely.

I take that as a 'no'.

(That aspect got lost in David Brown's frustrated response, which is
the only reason I bring it up.)

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

David Brown

unread,
Aug 8, 2019, 9:53:14 AM8/8/19
to
On 08/08/2019 14:48, Bart wrote:
> On 08/08/2019 09:39, David Brown wrote:
>> On 07/08/2019 23:52, Bart wrote:
>
>> I don't see much connection between struct sizes and what we have
>> discussing, but I will try to answer anyway.
>
> I was picking up on the suggestion to just let a compiler get on with
> its job. I'm saying that for things like this it will still need to be
> checked, and if necessary to tweak a struct to get a more satisfactory
> result.
>

Struct size and padding follows strict rules - it is not something that
varies between compilers (for the same target ABI), by versions or by
options. If you know the rules, the results are always predictable.
(Of course they can still be hard to check manually if the struct is big
or the member types are complicated).

Occasionally you will want to tweak a struct for better results. For
example, if it is of size 60 bytes, you might want to add a 4-byte dummy
entry so that it is 64 bytes giving more efficient use in an array
(especially when you arrange for it to have the same alignment as a
cache line). But that sort of thing is not very common.

> That can include deliberately using some misalignment although usually
> you will try and avoid that.

I don't imagine there is much use for deliberate misalignment. But
sometimes additional alignment (typically to cache line size) can be
helpful.

>
>> Compilers will add padding in structs to keep alignments correct for the
>> fields.  The alignment for the struct itself will be that of the largest
>> alignment of any field, and there will be padding at the end of the
>> struct if needed to make the whole thing a multiple of this alignment
>> (so that arrays of the struct work).
>>
>> So you won't get a struct array that has a stride of 65 bytes unless the
>> struct itself has size 65 bytes.
>
> The size might be 65 or more than 65, neither ideal if you were hoping
> for 64.

True. But the size and padding follows set rules, which allows you to
re-arrange fields to minimise padding.

>
> (If I'm generating C, then I'd use pack(1) since the source language
> does not do automatic padding, and then a size of 65 is possible for any
> mix of element types. Although that would usually be picked up in the
> original language before I start generating C versions.)

I think if you want to translate directly from your source language into
C concepts, you should make your source language match the C rules here.

>
>> As for checking the size of structs, compilers might be able to give you
>> help such as gcc's "-Wpadded" warning that is issued if there is any
>> padding added to structs.  Most people don't want such a warning, of
>> course, but I have sometimes found it useful.
>>
>> The best method of checking struct sizes IMHO is with C11 static
>> assertions:
>>
>> _Static_assert(sizeof(struct X) == 65, "Struct X should be 65 bytes");
>
> I would just use printf("size = %<whatever>\n",sizeof(struct X)) while
> creating it. Just to ensure it is some sane value.
>

IMHO, it is a lot better to have a compile-time assertion that is always
checked whenever the code is compiled, than a run-time check that you
confirm manually when you remember to include it and when you remember
to look at it. Printing the size of the struct can be handy if the
static assertion fails and you can't immediately see why - the static
assertion will tell you the struct is not 65 bytes, but it won't tell
you what it actually is.


>> But there are several ways
>> to avoid that in gcc:
>>
>> 1. Enable optimisation.  That will coalesce the byte-at-a-time accesses
>> on targets that support unaligned accesses.
>
> On that specific example (that would have been on gcc 4.x on RPi in
> 2012), it was optimised, and the ARM32 processor /was/ capable of
> unaligned access, but gcc still generated byte-at-a-time code).
>

It's hard to guess details from that description. But it is fair to say
the "enable optimisation" has no guarantees - gcc can spot common
patterns and optimise them, but not all possible patterns. And results
can vary by compiler version and target, with improvements for each
generation. "gcc 4.x" could be nearly 15 years old, or just 5 years old.

Scott Lurndal

unread,
Aug 8, 2019, 10:46:17 AM8/8/19
to
David Brown <david...@hesbynett.no> writes:
>On 08/08/2019 00:51, Keith Thompson wrote:
>> David Brown <david...@hesbynett.no> writes:
>>> On 07/08/2019 14:47, Bonita Montero wrote:

>
>But C (and C++) does not support unaligned accesses.

Just a nit, but the _standard_ doesn't support unaligned accesses.

Every C and C++ compiler I've used supports them (both via some
compiler specific 'packed' attribute and via simple casts).

And while the behavior is undefined per the standard, it's quite
well defined per architecture if not portable.

Bonita Montero

unread,
Aug 8, 2019, 10:48:03 AM8/8/19
to
>> Think of unaligned straddling a cache line!

> It is not the efficiency that is the main problem here - it is the
> correctness of the code (or lack thereof).

If it works as expected depends on the platform.

Bonita Montero

unread,
Aug 8, 2019, 10:50:10 AM8/8/19
to
> If you are casting a pointer to get there you are breaching the strict
> aliasing rules so you have undefined behaviour anyway (although of
> course some compilers allow it with a suitable compiler flag such as
> -fno-strict-aliasing).

Doesn't matter! My code is written with conditional-compilation for
certain platforms where this does work. And it might be extended for
other platforms which also work.

Bonita Montero

unread,
Aug 8, 2019, 10:51:14 AM8/8/19
to
> Unfortunately, you are completely wrong about where unaligned accesses
> are actually supported by the compiler. You have identified some
> /targets/ that support unaligned access, but the /compilers/ do not
> support it.

The compilers I conditionally select suport it.
And the list might be extended.

Bonita Montero

unread,
Aug 8, 2019, 10:55:10 AM8/8/19
to
>>> Can you give any kind of a reference for even a single case where you
>>> know the OS will trap unaligned accesses and emulate them in software?

>> It's just a tiny task to support this by an OS and helps to run a lot
>> of old code; so this is very likely.

> I take that as a 'no'.

No, that's a "mostly". But it's stupid to rely on that because it's
extremely inefficient - an article on the Web I found reported a slow-
down of about 300x over manually assembling the data-type like I did
it in the code I have shown here on Solaris / SPARC.
It's just a fallback for old code.

Bonita Montero

unread,
Aug 8, 2019, 10:57:50 AM8/8/19
to
Sorry, this wasn't correct. The benchmark was a comparison of
emulated unaligned accesses through trapping with aligned accesses.
But the differrence was about 300x.

David Brown

unread,
Aug 8, 2019, 11:05:57 AM8/8/19
to
No - it is written with conditional compilation for platforms where you
/think/ it works. Speculation, guesswork and "it seemed okay when I
tried it" is not how you should be writing code.

David Brown

unread,
Aug 8, 2019, 11:10:53 AM8/8/19
to
Show me links to the documentation supporting this claim.

I don't mean post examples of it working - I mean /documentation/.
Written information, from the compiler writers, saying that this sort of
code is valid for their compiler.

Until you can do that, you are an example of the cowboy attitude to
programming that crashes cars, brings down planes, and innumerable time
wasted and frustration caused when programs crash yet again.

It's one thing to have bugs in code by mistake, or because the
programming task is particularly hard, or because you don't know there
is a problem.

It is inexcusable to write code that you know is wrong and risky, just
because you are too lazy and arrogant to write it safely.

David Brown

unread,
Aug 8, 2019, 11:14:27 AM8/8/19
to
On 08/08/2019 16:46, Scott Lurndal wrote:
> David Brown <david...@hesbynett.no> writes:
>> On 08/08/2019 00:51, Keith Thompson wrote:
>>> David Brown <david...@hesbynett.no> writes:
>>>> On 07/08/2019 14:47, Bonita Montero wrote:
>
>>
>> But C (and C++) does not support unaligned accesses.
>
> Just a nit, but the _standard_ doesn't support unaligned accesses.
>

Unless you are talking about a specific compiler, the standard /is/ the
language.

> Every C and C++ compiler I've used supports them (both via some
> compiler specific 'packed' attribute and via simple casts).

A compiler supports features beyond the standards if it documents that
it supports them. Other than that, you are in the realms of "it worked
when I tested it" - i.e., luck and a problem waiting to happen.

So if your compiler says it works okay, that's fine - for that compiler.
If it supports "packed" attributes, or "__unaligned" keywords, use them
- that's fine.

>
> And while the behavior is undefined per the standard, it's quite
> well defined per architecture if not portable.
>

Sure, unaligned access is often defined on the underlying architecture.
But unless you are programming in assembly, that doesn't matter.

Bonita Montero

unread,
Aug 8, 2019, 11:33:48 AM8/8/19
to
>> Doesn't matter! My code is written with conditional-compilation for
>> certain platforms where this does work. And it might be extended for
>> other platforms which also work.

> No - it is written with conditional compilation for platforms where you
> /think/ it works. Speculation, guesswork and "it seemed okay when I
> tried it" is not how you should be writing code.

There's no speculation. It works.

Bonita Montero

unread,
Aug 8, 2019, 11:35:53 AM8/8/19
to
> Show me links to the documentation supporting this claim.

There needs not to be any documentation on that. The thing is simply
that the compiler usually won't see that an access is unaligned in mot
times so it can't be incompatible to this. So the whole thing is simply
defined by the architecture and not by the compiler.

Bonita Montero

unread,
Aug 8, 2019, 11:38:54 AM8/8/19
to
And even more: compilers for those CPUs that support unaligned access
usually have special #pragma pack() or unaligned-directives. They won't
have these if they don't allow unaligned accesses.

David Brown

unread,
Aug 8, 2019, 3:43:37 PM8/8/19
to
So it is all speculation and guesswork, with no evidence, documentation
or recommendations to back up your position. And as compilers get
better, and in particular as link-time optimisation (or "omniscient
compilation") becomes more popular so that compilers /can/ see when
accesses are aligned or not, the risk of things breaking gets higher.

David Brown

unread,
Aug 8, 2019, 3:45:27 PM8/8/19
to
Compilers support unaligned accesses when using "pragma pack",
"__unaligned", or other such features. Those features are documented to
do the job. They exist /precisely/ because unaligned accesses are not
allowed without them.

The fact that compilers have this kind of extension shows that using
normal pointers for unaligned access is not supported!

Scott Lurndal

unread,
Aug 8, 2019, 4:01:07 PM8/8/19
to
David Brown <david...@hesbynett.no> writes:
>On 08/08/2019 17:38, Bonita Montero wrote:
>> And even more: compilers for those CPUs that support unaligned access
>> usually have special #pragma pack() or unaligned-directives. They won't
>> have these if they don't allow unaligned accesses.
>
>Compilers support unaligned accesses when using "pragma pack",
>"__unaligned", or other such features. Those features are documented to
>do the job. They exist /precisely/ because unaligned accesses are not
>allowed without them.

Or sometimes, they just support them:

int
main(void)
{
unsigned long a_variable = 0xaabbccddeeff0011ul;

unsigned short *b_variable = (unsigned short *)((unsigned char *)&a_variable + 1);

printf("addr=%p value=%x\n", b_variable, *b_variable);

return 0;
}

$ cc -o /tmp/a /tmp/a.c
/tmp/a.c: In function 'main':
/tmp/a.c:9: warning: incompatible implicit declaration of built-in function 'printf'
$ /tmp/a
addr=0x7fff9644d151 value=ff00

>
>The fact that compilers have this kind of extension shows that using
>normal pointers for unaligned access is not supported!
>

Well sort of.
It is loading more messages.
0 new messages