Before I do so, any comments on the following?
Thanks,
Daniel
UNALIGNED MEMORY ACCESSES
=========================
Linux runs on a wide variety of architectures which have varying behaviour
when it comes to memory access. This document presents some details about
unaligned accesses, why you need to write code that doesn't cause them,
and how to write such code!
What's the definition of an unaligned access?
=============================================
Unaligned memory accesses occur when you try to read N bytes of data starting
from an address that is not evenly divisible by N (i.e. addr % N != 0).
For example, reading 4 bytes of data from address 0x10000004 is fine, but
reading 4 bytes of data from address 0x10000005 would be an unaligned memory
access.
Why unaligned access is bad
===========================
Most architectures are unable to perform unaligned memory accesses. Any
unaligned access causes a processor exception.
Some architectures have an exception handler implemented in the kernel which
corrects the memory access, but this is very expensive and is not true for
all architectures. You cannot rely on the exception handler to correct your
memory accesses.
In summary: if your code causes unaligned memory accesses to happen, your code
will not work on some platforms, and will perform *very* badly on others.
You may be wondering why you have never seen these problems on your own
architecture. Some architectures (such as i386 and x86_64) do not have this
limitation, but nevertheless it is important for you to write portable code
that works everywhere.
Natural alignment
=================
The rule we mentioned earlier forms what we refer to as natural alignment:
When accessing N bytes of memory, the base memory address must be evenly
divisible by N, i.e. addr % N == 0
When writing code, assume the target architecture has natural alignment
requirements.
Sidenote: in reality, only a few architectures require natural alignment
on all sizes of memory access. However, again we must consider ALL supported
architectures; natural alignment is the only way to achieve full portability.
Code that doesn't cause unaligned access
========================================
At first, the concepts above may seem a little hard to relate to actual
coding practice. After all, you don't have a great deal of control over
memory addresses of certain variables, etc.
Fortunately things are not too complex, as in most cases, the compiler
ensures that things will work for you. For example, take the following
structure:
struct foo {
u16 field1;
u32 field2;
u8 field3;
};
Let us assume that an instance of the above structure resides in memory
starting at address 0x10000000. With a basic level of understanding, it would
not be unreasonable to expect that accessing field2 would cause an unaligned
access. You'd be expecting field2 to be located at offset 2 bytes into the
structure, i.e. address 0x10000002, but that address is not evenly divisible
by 4 (remember, we're reading a 4 byte value here).
Fortunately, the compiler understands the alignment constraints, so in the
above case it would insert 2 bytes of padding inbetween field1 and field2.
Therefore, for standard structure types you can always rely on the compiler
to pad structures so that accesses to fields are suitably aligned (assuming
you do not cast the field to a type of different length).
Similarly, you can also rely on the compiler to align variables and function
parameters to a naturally aligned scheme, based on the size of the type of
the variable.
Sidenote: in the above example, you may wish to reorder the fields in the
above structure so that the overall structure uses less memory. For example,
moving field3 to sit inbetween field1 and field2 (where the padding is
inserted) would shrink the overall structure by 1 byte:
struct foo {
u16 field1;
u8 field3;
u32 field2;
};
Sidenote: it should be obvious by now, but in case it is not, accessing a
single byte (u8 or char) can never cause an unaligned access, because all
memory addresses are evenly divisible by 1.
Code that causes unaligned access
=================================
With the above in mind, let's move onto a real life example of a function
that can cause an unaligned memory access. The following function adapted
from include/linux/etherdevice.h is an optimized routine to compare two
ethernet MAC addresses for equality.
unsigned int compare_ether_addr(const u8 *addr1, const u8 *addr2)
{
const u16 *a = (const u16 *) addr1;
const u16 *b = (const u16 *) addr2;
return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0;
}
In the above function, the reference to a[0] causes 2 bytes (16 bits) to
be read from memory starting at address addr1. Think about what would happen
if addr1 was an odd address, such as 0x10000003. (Hint: it'd be an unaligned
access)
Despite the potential unaligned access problems with the above function, it
is included in the kernel anyway but is documented to only work on
16-bit-aligned addresses. It is up to the caller to ensure this alignment or
not use this function at all. This alignment-unsafe function is still useful
as it is a decent optimization for the cases when you can ensure alignment.
Here is another example of code that could cause unaligned accesses:
void myfunc(u8 *data, u32 value)
{
[...]
*((u32 *) data) = cpu_to_le32(value);
[...]
}
This code will cause unaligned accesses every time the data parameter points
to an address that is not evenly divisible by 4.
Consider the following structure:
struct foo {
u16 field1;
u32 field2;
u8 field3;
} __attribute__((packed));
It's the same structure as we looked at earlier, but the packed attribute has
been added. This attribute ensures that the compiler never inserts any padding
and the structure is laid out in memory exactly as is suggested above.
The packed attribute is useful when you want to use a C struct to represent
some data that comes in a fixed arrangement 'off the wire'.
It should be clear why accessing fields of an instance of that structure could
cause unaligned accesses in some situations. Even if the instance started at
an address such as 0x10000000 where accessing field1 would not cause an
unaligned access, accessing field2 would be reading 4 bytes from 0x10000002,
which, is an unaligned access. The compiler didn't jump to your rescue and
insert padding because you asked it not to.
In summary, the 3 main scenarios where you may run into unaligned access
problems involve:
1. Recasting variables to types of different lengths
2. Pointer arithmetic followed by access to at least 2 bytes of data
3. Accessing elements of packed structures
Avoiding unaligned accesses
===========================
Going back to an earlier example:
void myfunc(u8 *data, u32 value)
{
[...]
*((u16 *) data) = cpu_to_le32(value);
[...]
}
To avoid the unaligned memory access, you could rewrite it as follows:
void myfunc(u8 *data, u32 value)
{
[...]
value = cpu_to_le32(value);
memcpy(data, value, sizeof(value));
[...]
}
It's safe to assume that memcpy will always copy bytewise and hence will
never cause an unaligned access.
Recall an example packed structure from earlier:
struct foo {
u16 field1;
u32 field2;
u8 field3;
} __attribute__((packed));
The following code will potentially cause 2 unaligned accesses: writing to
field2, then reading from field2:
void myfunc2(u32 some_data)
{
struct foo myinstance;
u32 tmp;
myinstance.field2 = some_data;
tmp = myinstance.field2 * 2;
}
When writing this code, you should be aware that field2 acccesses are
potentially unaligned therefore the above will break on some systems. The
kernel provides two macros to simplify handling of situations such as the
above:
void myfunc2(u32 some_data)
{
struct foo myinstance;
u32 tmp;
put_unaligned(tmp, &myinstance.field2);
tmp = get_unaligned(&myinstance.field2);
}
These macros work from pointers to the unaligned data, and work for memory
accesses of any length (not just 32 bits as in the example above). You could
even use put_unaligned() rather than memcpy() in order to solve the bug in
the first example (myfunc()) given above.
--
Author: Daniel Drake <d...@gentoo.org>
With help from: Johannes Berg, Uli Kunitz.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
< above case it would insert 2 bytes of padding inbetween field1 and field2.
> above case it would insert 2 bytes of padding in between field1 and field2.
< moving field3 to sit inbetween field1 and field2 (where the padding is
> moving field3 to sit in between field1 and field2 (where the padding is
--
avuton
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
..
> You may be wondering why you have never seen these problems on your own
> architecture. Some architectures (such as i386 and x86_64) do not have this
> limitation, but nevertheless it is important for you to write portable code
> that works everywhere.
Also, x86 doesn't prohibit unaligned accesses, but I believe they have a
significant performance cost and are best avoided where possible.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hanc...@nospamshaw.ca
Home Page: http://www.roberthancock.com/
Not all. Some simply produce the wrong answer - thats oh so much more
exciting.
> You may be wondering why you have never seen these problems on your own
> architecture. Some architectures (such as i386 and x86_64) do not have this
> limitation, but nevertheless it is important for you to write portable code
> that works everywhere.
Its usually faster if you don't misalign on x86 as well.
Alan
That depends, e.g. for SSE2 they can be forbidden.
> but I believe they have
> a significant performance cost and are best avoided where possible.
On Opteron the typical cost of a misaligned access is a single cycle
and some possible penalty to load-store forwarding.
On Intel it is a bit worse, but not all that much. Unless you do
a lot of accesses of it in a loop it's not really worth something
caring about too much.
-Andi
As one example, the MicroBlaze soft-core processor family designed
for use on Xilinx FPGAs will (by default) simply forcibly zero the
lower bits of the unaligned address, such that the following code
will fail mysteriously:
const char foo[] = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07 };
printf("0x%08lx 0x%08lx 0x%08lx 0x%08lx\n",
*((u32 *)(foo+0)),
*((u32 *)(foo+1)),
*((u32 *)(foo+2)),
*((u32 *)(foo+3)));
Instead of outputting:
0x00010203 0x01020304 0x02030405 0x03040506
It will output:
0x00010203 0x00010203 0x00010203 0x00010203
Other embedded architectures have very similar problems. Some may
provide an "unaligned data access" exception, but offer insufficient
information to repair the damage and resume execution.
Cheers,
Kyle Moffett
> Its usually faster if you don't misalign on x86 as well.
i'm not sure if i agree with "usually"... but i know you (alan) are
probably aware of the exact requirements of the hw.
for everyone else:
on intel x86 processors an access is unaligned only if it crosses a
cacheline boundary (64 bytes). otherwise it's aligned. the penalty for
crossing a cacheline boundary varies from ~12 cycles (core2) to many
dozens of cycles (p4).
on AMD x86 pre-family 10h the boundary is 8 bytes, and on fam 10h it's 16
bytes. the penalty is a mere 3 cycles if an access crosses the specified
boundary.
if you're making <= 4 byte accesses i recommend not worrying about
alignment on x86. it's pretty hard to beat the hardware support.
i curse all the RISC and embedded processor designers who pretend
unaligned accesses are something evil and to be avoided. in case you're
worried, MIPS patent 4,814,976 expired in december 2006 :)
-dean
Worth noting though, is that atomic accesses that cross cache lines on
an Opteron system is going to lock down the Hypertransport fabric for
you during the operation -- which is obviously not so nice.
--
Arne.
>Code that doesn't cause unaligned access
>========================================
In written style, not using n't contracted forms might be preferable.
>Sidenote: in the above example, you may wish to reorder the fields in the
>above structure so that the overall structure uses less memory. For example,
>moving field3 to sit inbetween field1 and field2 (where the padding is
>inserted) would shrink the overall structure by 1 byte:
>
> struct foo {
> u16 field1;
> u8 field3;
> u32 field2;
> };
>
>Sidenote: it should be obvious by now, but in case it is not, accessing a
>single byte (u8 or char) can never cause an unaligned access, because all
>memory addresses are evenly divisible by 1.
Sidenote: You would want an alignment like this:
struct foo {
uint32_t field2;
uint16_t field1;
uint8_t field3;
};
>Consider the following structure:
> struct foo {
> u16 field1;
> u32 field2;
> u8 field3;
> } __attribute__((packed));
>
>It's the same structure as we looked at earlier, but the packed attribute has
>been added. This attribute ensures that the compiler never inserts any padding
>and the structure is laid out in memory exactly as is suggested above.
>
>The packed attribute is useful when you want to use a C struct to represent
>some data that comes in a fixed arrangement 'off the wire'.
>
In the packed case, does not GCC automatically output extra instructions to not
run into unaligned access?
>To avoid the unaligned memory access, you could rewrite it as follows:
>
> void myfunc(u8 *data, u32 value)
> {
> [...]
> value = cpu_to_le32(value);
> memcpy(data, value, sizeof(value));
> [...]
> }
>
>It's safe to assume that memcpy will always copy bytewise and hence will
>never cause an unaligned access.
>
Usually it copies register-size-wise where possible and bytesize at the
left and right edges if they are unaligned. That's how glibc memcpy does it,
not sure how complete the kernel memcpy is in this regard.
"Some architectures are unable to perform unaligned memory accesses,
either an exception is generated, or the data
access is silently invalid. In architectures that allow unaligned
access, natural aligned accesses are usually faster than non-aligned."
> In summary: if your code causes unaligned memory accesses to happen, your code
> will not work on some platforms, and will perform *very* badly on others.
*very* -> *slower*
> Natural alignment
> =================
Please move this definition before "Why unaligned access is bad".
Also, it would be nice to have a table of ISAs:
ISA Need Need
natural alignment
alignment by x
--------------------------------------------
m68k No 2
powerpc/ppc Yes Word size
x86 No No
x86_64 No No
--
Heikki Orsila Barbie's law:
heikki...@iki.fi "Math is hard, let's go shopping!"
http://www.iki.fi/shd
It would also insert 3 bytes of padding after field3, in order to satisfy
alignment constraints for arrays of these structures.
> Sidenote: in the above example, you may wish to reorder the fields in the
> above structure so that the overall structure uses less memory. For
> example, moving field3 to sit inbetween field1 and field2 (where the
> padding is inserted) would shrink the overall structure by 1 byte:
>
> struct foo {
> u16 field1;
> u8 field3;
> u32 field2;
> };
It will actually shrink it by 4 bytes, for the very same reason.
-- Vadim Lobanov
From the viewpoint of yours truly (and I am a teacher of operating system classes), this is a long-expected document, which is going to be very useful especially for newbies. My students often make alignment mistakes in their code, and your article will definitely make my job much easier.
Thank you, Daniel, for your work.
Dmitri
> Being spoilt by the luxuries of i386/x86_64 I've never really had a good
> grasp on unaligned memory access problems on other architectures and decided
> it was time to figure it out. As a result I've written this documentation
> which I plan to submit for inclusion as
> Documentation/unaligned_memory_access.txt
>
> Before I do so, any comments on the following?
>
A very nice, and much needed document. I think you should include one thing though:
memcpy() is _only_ safe when one of the pointers is char* or void*. If it is anything more complex than that, gcc will assume alignment and optimise based on that. E.g. memcpy() of two long:s generates the same assembly as doing an assignment.
(Technically it is no different for char* and void*, but since they have byte alignment, gcc can't really do anything creative.)
Rgds
--
-- Pierre Ossman
Linux kernel, MMC maintainer http://www.kernel.org
PulseAudio, core developer http://pulseaudio.org
rdesktop, core developer http://www.rdesktop.org
Dumb memcpy (while (len--) { *d++ = *s++ }) will have alignment problems
in any case. Intelligent ones, like the one provided in glibc, first copy
bytes till output is aligned (C file) *or* size is a multiple (i686 asm file)
of word size, and then it copies word-by-word.
Linux's x86_64 memcpy does the opposite, copies 64bit words, and then
copies the last bytes.
So, in effect, as long as no packed structures are used, memcpy should
be safer on *int, etc., than *char, as the compiler ensures
word-alignment.
--
lfr
0/0
>
> Dumb memcpy (while (len--) { *d++ = *s++ }) will have alignment problems
> in any case. Intelligent ones, like the one provided in glibc, first copy
> bytes till output is aligned (C file) *or* size is a multiple (i686 asm file)
> of word size, and then it copies word-by-word.
>
> Linux's x86_64 memcpy does the opposite, copies 64bit words, and then
> copies the last bytes.
>
> So, in effect, as long as no packed structures are used, memcpy should
> be safer on *int, etc., than *char, as the compiler ensures
> word-alignment.
>
It most certainly does not. gcc will assume that an int* has int alignment. memcpy() is a builtin, which gcc can translate to pretty much anything. And C specifies that a pointer to foo, will point to a real object of type foo, so gcc can't be blamed for the unsafe typecasts. I have tested this the hard way, so this is not just speculation.
E.g., we have the following struct:
struct foo
{
u8 a[4];
u32 b;
};
This struct will have a size of 8 bytes and an alignment of 4 bytes (caused by the member b). Now take the following code:
void copy_foo(struct foo *dst, struct foo *src)
{
*dst = *src;
}
On a platform that supports 64-bit loads and stores (e.g. AVR32, where I got hit by this), this will generate:
LD r1, (src)
ST r1, (dst)
Now if I replace that with:
void copy_foo(struct foo *dst, struct foo *src)
{
memcpy(dst, src, sizeof(struct foo));
}
then it will generate the same code. So I cannot use copy_foo() to transfer a struct foo either out of, or into a packet buffer.
In other words, memcpy() does _not_ save you from alignment issues. If you cast from char* or void* to something else, you better be damn sure the alignment is correct because gcc will assume it is.
Yes, on *int and other assumed aligned pointers, gcc uses its internal
version.
However, my point is that those pointers, unless speaking of packed
structures, can safely be assumed aligned, while char*/void* can't.
> In other words, memcpy() does _not_ save you from alignment issues. If you cast from char* or void* to something else, you better be damn sure the alignment is correct because gcc will assume it is.
Nothing does, even memcpy doesn't check alignment of the source, or
alignment at all in some assembly implementations (only word-copy,
without checking if at word-boundary).
--
lfr
0/0
> On Sat, Nov 24, 2007 at 05:19:31PM +0100, Pierre Ossman wrote:
> > It most certainly does not. gcc will assume that an int* has int alignment. memcpy() is a builtin, which gcc can translate to pretty much anything. And C specifies that a pointer to foo, will point to a real object of type foo, so gcc can't be blamed for the unsafe typecasts. I have tested this the hard way, so this is not just speculation.
>
> Yes, on *int and other assumed aligned pointers, gcc uses its internal
> version.
>
> However, my point is that those pointers, unless speaking of packed
> structures, can safely be assumed aligned, while char*/void* can't.
>
I get the sensation we're violently in agreement here, just misunderstanding each other. :)
_My_ point was that the documentation should mention that normal, unpacked C objects have alignments that influence the code generated by __builtin_memcpy(). As such, one should always make sure to have either src or dst be char*/void* when alignment cannot be guaranteed. The example in the documentation has this, but it isn't explicit that this is required.
An out-of-line implementation can only do that if the architecture
allows unaligned loads and stores. Since it has no clue about the types
involved, it must assume that both pointers as well as the length may be
misaligned.
gcc, on the other hand, knows exactly what types are involved, so when
it expands its own builtin-memcpy inline it can optimize it based on
the required alignment of those types. So when you cast between types
with different alignment requirements, you must make sure the result is
properly aligned, or you need to use get_unaligned()/put_unaligned()
to override gcc's assumptions.
Btw, some versions of avr32-gcc (I think it was 4.0.x) assumed packed
structs were properly aligned too, with disastrous results. gcc-4.1
handles packed structs correctly as far as I can tell.
Håvard
That's it. :)
Sorry for the noise,...
--
lfr
0/0
Although understanding alignment is important, there is another
extreme - what I call "sadistic alignment". It's when data is being
aligned even if it will definitely run on an arch which doesn't require
this (arch/x86/*), or data being aligned to ridiculously large boundary.
Like gcc aligning any char array bigger that 31 byte to 32 bytes.
Bytes, not bits. Try to compile this with -O2:
static char s1[] = "12345678901234567890123456789012";
static char s2[] = "12345678901234567890123456789012";
void f(char*);
void g() {
f(s1);
f(s2);
}
$ hexdump -Cv t.o
00000000 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 |.ELF............|
00000010 01 00 03 00 01 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 38 01 00 00 00 00 00 00 34 00 00 00 00 00 28 00 |8.......4.....(.|
00000030 0a 00 07 00 55 89 e5 83 ec 08 c7 04 24 40 00 00 |....U.......$@..|
00000040 00 e8 fc ff ff ff c7 04 24 00 00 00 00 e8 fc ff |........$.......|
00000050 ff ff c9 c3 00 00 00 00 00 00 00 00 00 00 00 00 |................| <=== HERE
00000060 31 32 33 34 35 36 37 38 39 30 31 32 33 34 35 36 |1234567890123456|
00000070 37 38 39 30 31 32 33 34 35 36 37 38 39 30 31 32 |7890123456789012|
00000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| <=== HERE
00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| <=== HERE
000000a0 31 32 33 34 35 36 37 38 39 30 31 32 33 34 35 36 |1234567890123456|
000000b0 37 38 39 30 31 32 33 34 35 36 37 38 39 30 31 32 |7890123456789012|
000000c0 00 00 00 00 00 47 43 43 3a 20 28 47 4e 55 29 20 |.....GCC: (GNU) |
000000d0 34 2e 30 2e 33 20 28 55 62 75 6e 74 75 20 34 2e |4.0.3 (Ubuntu 4.|
000000e0 30 2e 33 2d 31 75 62 75 6e 74 75 35 29 00 00 2e |0.3-1ubuntu5)...|
000000f0 73 79 6d 74 61 62 00 2e 73 74 72 74 61 62 00 2e |symtab..strtab..|
43 bytes wasted!
Thankfully, it is fixed in later gcc versions.
Please do not succumb to "alignment scare" in your doc.
--
vda
`No' for >= 68020.
`Yes' for < 68020.
> powerpc/ppc Yes Word size
> x86 No No
> x86_64 No No
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
My bad, yes..
mc68020+ No No
(mc68000/010 No 2) (not for Linux)
--
Heikki Orsila Barbie's law:
heikki...@iki.fi "Math is hard, let's go shopping!"
http://www.iki.fi/shd
Should clarify that you mean "with power-of-two N" - even more
strictly this depends on the processor, but I'm pretty sure there is
none which supports aligned accesses of N==3...
Olaf
Actually ucLinux has been persuaded to run on m68000.
I suppose you mean:
memcpy(data, &value, sizeof(value));
/DM
> dean gaudet <de...@arctic.org> writes:
> > on AMD x86 pre-family 10h the boundary is 8 bytes, and on fam 10h it's 16
> > bytes. the penalty is a mere 3 cycles if an access crosses the specified
> > boundary.
>
> Worth noting though, is that atomic accesses that cross cache lines on
> an Opteron system is going to lock down the Hypertransport fabric for
> you during the operation -- which is obviously not so nice.
ooh awesome, i hadn't measured that before.
on a 2 node sockF / revF with a random pointer chase running on cpu 0 /
node 0 i see the avg load-to-load cache miss latency jump from 77ns to
109ns when i add an unaligned lock-intensive workload on one core of node
1. the worst i can get the pointer chase latency to is 273ns when i add
two threads on node 1 fighting over an unaligned lock.
on a 4 node (square) the worst case i can get seems to be an increase from
98ns with no antagonist to 385ns with 6 antagonists fighting over an
unaligned lock on the other 3 nodes.
cool.
-dean
Note, if the unaligned handler is running, the alignment will be fixed
by the fault handler (at the cost of taking a fault). If the unaligned
handler is turned off, you get a "free" shift of the data instead.
--
Ben (b...@fluff.org, http://www.fluff.org/)
'a smiley only costs 4 bytes'
shameless plug:
https://ols2006.108.redhat.com/2007/Reprints/melo-Reprint.pdf
- Arnaldo
on ppc it varies from processor to processor if misaligned data is
fixed up or causes an exception. However its highly recommend to be
naturally aligned. I'm not sure I follow what is meant by the second
column (need alignment by x).
- k
Changes:
- 'in between' spelling fix
- shortened example addresses for readability
- better summarised the common architectural differences in handling
unaligned access under "Why unaligned access is bad"
- expanded the notes on structure field ordering vs memory usage (trying not
to go too far off topic though)
- correction regarding packed attribute: compiler will generate extra
instructions, so accessing __attribute__((packed)) structures in standard
ways will never cause unaligned access
- natural alignment is defined earlier in the document
- memcpy is now the alternative suggestion, put_unaligned/get_unaligned is
the encouraged solution
There were some suggestions I didn't include. I'd like this document to
remain as a concise and general overview of the problems, not focusing on
too many details. In other words I'm trying to produce a document that *I*
would have found useful to write generic portable code. For example the fact
that mc68020+ has different alignment requirements from mc68000 isn't of
much value here, I just want to know the fundamentals of writing code that
works everywhere.
On the other hand I can see why such information would be useful for other
scenarios, so maybe someone with a good understanding should collect all the
fine details into an 'advanced unaligned memory access topics' document.
Here's a list of the suggestions/discussions I excluded:
- table of alignment requirements for architectures
- details of performance costs of unaligned accesses on different processors
- memcpy discussion
Assuming there aren't too many comments/suggestions on this revision, the
next version will be submitted for inclusion as
Documentation/unaligned_memory_access.txt
UNALIGNED MEMORY ACCESSES
=========================
Linux runs on a wide variety of architectures which have varying behaviour
when it comes to memory access. This document presents some details about
unaligned accesses, why you need to write code that doesn't cause them,
and how to write such code!
The definition of an unaligned access?
======================================
Unaligned memory accesses occur when you try to read N bytes of data starting
from an address that is not evenly divisible by N (i.e. addr % N != 0).
For example, reading 4 bytes of data from address 0x10004 is fine, but
reading 4 bytes of data from address 0x10005 would be an unaligned memory
access.
Natural alignment
=================
The rule mentioned above forms what we refer to as natural alignment:
When accessing N bytes of memory, the base memory address must be evenly
divisible by N, i.e. addr % N == 0
When writing code, assume the target architecture has natural alignment
requirements.
In reality, only a few architectures require natural alignment on all sizes
of memory access. However, we must consider ALL supported architectures;
writing code that satisfies natural alignment requirements is the easiest way
to achieve full portability.
Why unaligned access is bad
===========================
The effects of performing an unaligned memory access vary from architecture
to architecture. It would be easy to write a whole document on the differences
here; a summary of the common scenarios is presented below:
- Some architectures are able to transparently perform unaligned memory
accesses, but there is usually a significant performance cost.
- Some architectures raise processor exceptions when unaligned accesses
happen. The exception handler is able to correct the unaligned access,
at significant cost to performance.
- Some architectures raise processor exceptions when unaligned accesses
happen, but the exceptions do not contain enough information for the
unaligned access to be corrected.
- Some architectures are not capable of unaligned memory access, but will
silently perform a different memory access to the one that was requested,
resulting a a subtle code bug that is hard to detect!
It should be obvious from the above that if your code causes unaligned
memory accesses to happen, your code will not work correctly on certain
platforms and will cause performance problems on others.
Code that does not cause unaligned access
=========================================
At first, the concepts above may seem a little hard to relate to actual
coding practice. After all, you don't have a great deal of control over
memory addresses of certain variables, etc.
Fortunately things are not too complex, as in most cases, the compiler
ensures that things will work for you. For example, take the following
structure:
struct foo {
u16 field1;
u32 field2;
u8 field3;
};
Let us assume that an instance of the above structure resides in memory
starting at address 0x10000. With a basic level of understanding, it would
not be unreasonable to expect that accessing field2 would cause an unaligned
access. You'd be expecting field2 to be located at offset 2 bytes into the
structure, i.e. address 0x10002, but that address is not evenly divisible
by 4 (remember, we're reading a 4 byte value here).
Fortunately, the compiler understands the alignment constraints, so in the
above case it would insert 2 bytes of padding in between field1 and field2.
Therefore, for standard structure types you can always rely on the compiler
to pad structures so that accesses to fields are suitably aligned (assuming
you do not cast the field to a type of different length).
Similarly, you can also rely on the compiler to align variables and function
parameters to a naturally aligned scheme, based on the size of the type of
the variable.
At this point, it should be clear that accessing a single byte (u8 or char)
will never cause an unaligned access, because all memory addresses are evenly
divisible by one.
On a related topic, with the above considerations in mind you may observe
that you could reorder the fields in the structure in order to place fields
where padding would otherwise be inserted, and hence reduce the overall
resident memory size of structure instances. The optimal layout of the
above example is:
struct foo {
u32 field2;
u16 field1;
u8 field3;
};
For a natural alignment scheme, the compiler would only have to add a single
byte of padding at the end of the structure. This padding is added in order
to satisfy alignment constraints for arrays of these structures.
Another point worth mentioning is the use of __attribute__((packed)) on a
structure type. This GCC-specific attribute tells the compiler never to
insert any padding within structures, useful when you want to use a C struct
to represent some data that comes in a fixed arrangement 'off the wire'.
You might be inclined to believe that usage of this attribute can easily
lead to unaligned accesses when accessing fields that do not satisfy
architectural alignment requirements. However, again, the compiler is aware
of the alignment constraints and will generate extra instructions to perform
the memory access in a way that does not cause unaligned access. Of course,
the extra instructions obviously cause a loss in performance compared to the
non-packed case, so the packed attribute should only be used when avoiding
structure padding is of importance.
Code that causes unaligned access
=================================
With the above in mind, let's move onto a real life example of a function
that can cause an unaligned memory access. The following function adapted
from include/linux/etherdevice.h is an optimized routine to compare two
ethernet MAC addresses for equality.
unsigned int compare_ether_addr(const u8 *addr1, const u8 *addr2)
{
const u16 *a = (const u16 *) addr1;
const u16 *b = (const u16 *) addr2;
return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0;
}
In the above function, the reference to a[0] causes 2 bytes (16 bits) to
be read from memory starting at address addr1. Think about what would happen
if addr1 was an odd address such as 0x10003. (Hint: it'd be an unaligned
access)
Despite the potential unaligned access problems with the above function, it
is included in the kernel anyway but is understood to only work on
16-bit-aligned addresses. It is up to the caller to ensure this alignment or
not use this function at all. This alignment-unsafe function is still useful
as it is a decent optimization for the cases when you can ensure alignment,
which is true almost all of the time in ethernet networking context.
Here is another example of some code that could cause unaligned accesses:
void myfunc(u8 *data, u32 value)
{
[...]
*((u32 *) data) = cpu_to_le32(value);
[...]
}
This code will cause unaligned accesses every time the data parameter points
to an address that is not evenly divisible by 4.
In summary, the 2 main scenarios where you may run into unaligned access
problems involve:
1. Casting variables to types of different lengths
2. Pointer arithmetic followed by access to at least 2 bytes of data
Avoiding unaligned accesses
===========================
The easiest way to avoid unaligned access is to use the get_unaligned() and
put_unaligned() macros provided by the <asm/unaligned.h> header file.
Going back to an earlier example of code that potentially causes unaligned
access:
void myfunc(u8 *data, u32 value)
{
[...]
*((u32 *) data) = cpu_to_le32(value);
[...]
}
To avoid the unaligned memory access, you would rewrite it as follows:
void myfunc(u8 *data, u32 value)
{
[...]
value = cpu_to_le32(value);
put_unaligned(value, data);
[...]
}
The get_unaligned() macro works similarly. Assuming 'data' is a pointer to
memory and you wish to avoid unaligned access, its usage is as follows:
u32 value = get_unaligned(data);
These macros work work for memory accesses of any length (not just 32 bits as
in the examples above). Be aware that when compared to standard access of
aligned memory, using these macros to access unaligned memory can be costy in
terms of performance.
If use of such macros is not convenient, another option is to use memcpy(),
where the source or destination (or both) are of type u8* or unsigned char*.
Due to the byte-wise nature of this operation, unaligned accesses are avoided.
--
Author: Daniel Drake <d...@gentoo.org>
With help from: Alan Cox, Avuton Olrich, Heikki Orsila, Jan Engelhardt,
Johannes Berg, Kyle Moffett, Robert Hancock, Uli Kunitz, Vadim Lobanov
On Thu, Nov 29, 2007 at 04:15:23PM +0000, Daniel Drake wrote:
> Unaligned memory accesses occur when you try to read N bytes of data starting
> from an address that is not evenly divisible by N (i.e. addr % N != 0).
> For example, reading 4 bytes of data from address 0x10004 is fine, but
> reading 4 bytes of data from address 0x10005 would be an unaligned memory
> access.
>
This is rather ambiguous, while most people know what you mean,
clarifying it a bit might be nice. How about something like,
Unaligned memory accesses occur when trying to read more than a byte
(i.e. u16, u32, u64) in a single instruction from an address that is not
evenly divisible by the width of the type (i.e. addr % width != 0).
For example, if you had 4GB of virtual memory, picture it as an
array of bytes,
u8 memory[4096 * (1024 * 1024)]; /* 4G bytes */
Aligned accesses would be accessing this array in this manner,
u16 memory[(4096 * (1024 * 1024)) / sizeof(u16)] /* 2G bytes */
u32 memory[(4096 * (1024 * 1024)) / sizeof(u32)] /* 1G bytes */
u64 memory[(4096 * (1024 * 1024)) / sizeof(u64)] /* 512M bytes */
And an unaligned access would be accessing on a non-integer multiple
boundary.
Ok, that kind of sucked too. But you get the idea.
>
> Why unaligned access is bad
> ===========================
>
The rest of this looks good.
Acked-by: Kyle McMartin <ky...@parisc-linux.org>
cheers,
Kyle
> Assuming there aren't too many comments/suggestions on this revision, the
> next version will be submitted for inclusion as
> Documentation/unaligned_memory_access.txt
I just have a few typo/punctuation/grammar fixes. Otherwise it looks
good to me. Thanks.
Acked-by: Randy Dunlap <randy....@oracle.com>
> Natural alignment
> =================
>
> The rule mentioned above forms what we refer to as natural alignment:
> When accessing N bytes of memory, the base memory address must be evenly
> divisible by N, i.e. addr % N == 0
add ending '.'
> Why unaligned access is bad
> ===========================
>
> The effects of performing an unaligned memory access vary from architecture
> to architecture. It would be easy to write a whole document on the differences
> here; a summary of the common scenarios is presented below:
>
> - Some architectures are able to transparently perform unaligned memory
> accesses, but there is usually a significant performance cost.
(remove split infinitive:)
- Some architecture are able to perform unaligned memory accesses
transparently, but ...
> - Some architectures raise processor exceptions when unaligned accesses
> happen. The exception handler is able to correct the unaligned access,
> at significant cost to performance.
> - Some architectures raise processor exceptions when unaligned accesses
> happen, but the exceptions do not contain enough information for the
> unaligned access to be corrected.
> - Some architectures are not capable of unaligned memory access, but will
> silently perform a different memory access to the one that was requested,
> resulting a a subtle code bug that is hard to detect!
>
> It should be obvious from the above that if your code causes unaligned
> memory accesses to happen, your code will not work correctly on certain
> platforms and will cause performance problems on others.
> Code that causes unaligned access
> =================================
>
> With the above in mind, let's move onto a real life example of a function
> that can cause an unaligned memory access. The following function adapted
> from include/linux/etherdevice.h is an optimized routine to compare two
> ethernet MAC addresses for equality.
>
> unsigned int compare_ether_addr(const u8 *addr1, const u8 *addr2)
> {
> const u16 *a = (const u16 *) addr1;
> const u16 *b = (const u16 *) addr2;
> return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0;
> }
>
> In the above function, the reference to a[0] causes 2 bytes (16 bits) to
> be read from memory starting at address addr1. Think about what would happen
> if addr1 was an odd address such as 0x10003. (Hint: it'd be an unaligned
> access)
access.)
> Avoiding unaligned accesses
> ===========================
>
> The easiest way to avoid unaligned access is to use the get_unaligned() and
> put_unaligned() macros provided by the <asm/unaligned.h> header file.
>
> Going back to an earlier example of code that potentially causes unaligned
> access:
>
> void myfunc(u8 *data, u32 value)
> {
> [...]
> *((u32 *) data) = cpu_to_le32(value);
> [...]
> }
>
> To avoid the unaligned memory access, you would rewrite it as follows:
>
> void myfunc(u8 *data, u32 value)
> {
> [...]
> value = cpu_to_le32(value);
> put_unaligned(value, data);
> [...]
> }
>
> The get_unaligned() macro works similarly. Assuming 'data' is a pointer to
> memory and you wish to avoid unaligned access, its usage is as follows:
>
> u32 value = get_unaligned(data);
>
> These macros work work for memory accesses of any length (not just 32 bits as
> in the examples above). Be aware that when compared to standard access of
> aligned memory, using these macros to access unaligned memory can be costy in
costly
> terms of performance.
---
~Randy
uint8_t memory[4096UL * 1024 * 1024];
>Aligned accesses would be accessing this array in this manner,
> u16 memory[(4096 * (1024 * 1024)) / sizeof(u16)] /* 2G bytes */
> u32 memory[(4096 * (1024 * 1024)) / sizeof(u32)] /* 1G bytes */
> u64 memory[(4096 * (1024 * 1024)) / sizeof(u64)] /* 512M bytes */
u64 memory[4096UL * 1024 * 1024 / sizeof(u64)] /* 4G too */
The get_unaligned call above will not do what you intended given the,
at least as I read it, implied context of myfunc. Since data is a u8*
it will only get one byte of data. To avoid misunderstandings the code
should probably read:
u32 value = get_unaligned((u32 *)data);
/DM
The wording could also apply to a DMA of 8k from a 4k-aligned address.
But I don't have a good idea how to improve it.
> It's safe to assume that memcpy will always copy bytewise and hence will
> never cause an unaligned access.
s/always copy/always behave as if copying/
memcpy usually copies at least wordwise, possibly even in bigger chunks.
But that is just the inner loop. Unaligned bytes at the beginning/end
receive special treatment.
Jörn
--
The rabbit runs faster than the fox, because the rabbit is rinning for
his life while the fox is only running for his dinner.
-- Aesop
Signed-off-by: Daniel Drake <d...@gentoo.org>
---
Changes since the v2 draft I posted last week:
- spelling/grammar fixes
- fixed get_unaligned/put_unaligned examples
- clarified the scope of unaligned access
--- /dev/null 2007-12-03 10:17:52.569007801 +0000
+++ linux/Documentation/unaligned-memory-access.txt 2007-12-03 16:04:55.000000000 +0000
@@ -0,0 +1,226 @@
+UNALIGNED MEMORY ACCESSES
+=========================
+
+Linux runs on a wide variety of architectures which have varying behaviour
+when it comes to memory access. This document presents some details about
+unaligned accesses, why you need to write code that doesn't cause them,
+and how to write such code!
+
+
+The definition of an unaligned access
+=====================================
+
+Unaligned memory accesses occur when you try to read N bytes of data starting
+from an address that is not evenly divisible by N (i.e. addr % N != 0).
+For example, reading 4 bytes of data from address 0x10004 is fine, but
+reading 4 bytes of data from address 0x10005 would be an unaligned memory
+access.
+
+The above may seem a little vague, as memory access can happen in different
+ways. The context here is at the machine code level: certain instructions read
+or write a number of bytes to or from memory (e.g. movb, movw, movl in x86
+assembly). As will become clear, it is relatively easy to spot C statements
+which will compile to multiple-byte memory access instructions, namely when
+dealing with types such as u16, u32 and u64.
+
+
+Natural alignment
+=================
+
+The rule mentioned above forms what we refer to as natural alignment:
+When accessing N bytes of memory, the base memory address must be evenly
+divisible by N, i.e. addr % N == 0.
+
+When writing code, assume the target architecture has natural alignment
+requirements.
+
+In reality, only a few architectures require natural alignment on all sizes
+of memory access. However, we must consider ALL supported architectures;
+writing code that satisfies natural alignment requirements is the easiest way
+to achieve full portability.
+
+
+Why unaligned access is bad
+===========================
+
+The effects of performing an unaligned memory access vary from architecture
+to architecture. It would be easy to write a whole document on the differences
+here; a summary of the common scenarios is presented below:
+
+ - Some architectures are able to perform unaligned memory accesses
+ transparently, but there is usually a significant performance cost.
+ - Some architectures raise processor exceptions when unaligned accesses
+ happen. The exception handler is able to correct the unaligned access,
+ at significant cost to performance.
+ - Some architectures raise processor exceptions when unaligned accesses
+ happen, but the exceptions do not contain enough information for the
+ unaligned access to be corrected.
+ - Some architectures are not capable of unaligned memory access, but will
+ silently perform a different memory access to the one that was requested,
+ resulting a a subtle code bug that is hard to detect!
+
+It should be obvious from the above that if your code causes unaligned
+memory accesses to happen, your code will not work correctly on certain
+platforms and will cause performance problems on others.
+
+
+Code that does not cause unaligned access
+=========================================
+
+At first, the concepts above may seem a little hard to relate to actual
+coding practice. After all, you don't have a great deal of control over
+memory addresses of certain variables, etc.
+
+Fortunately things are not too complex, as in most cases, the compiler
+ensures that things will work for you. For example, take the following
+structure:
+
+ struct foo {
+ u16 field1;
+ u32 field2;
+ u8 field3;
+ };
+
+Let us assume that an instance of the above structure resides in memory
+starting at address 0x10000. With a basic level of understanding, it would
+not be unreasonable to expect that accessing field2 would cause an unaligned
+access. You'd be expecting field2 to be located at offset 2 bytes into the
+structure, i.e. address 0x10002, but that address is not evenly divisible
+by 4 (remember, we're reading a 4 byte value here).
+
+Fortunately, the compiler understands the alignment constraints, so in the
+above case it would insert 2 bytes of padding in between field1 and field2.
+Therefore, for standard structure types you can always rely on the compiler
+to pad structures so that accesses to fields are suitably aligned (assuming
+you do not cast the field to a type of different length).
+
+Similarly, you can also rely on the compiler to align variables and function
+parameters to a naturally aligned scheme, based on the size of the type of
+the variable.
+
+At this point, it should be clear that accessing a single byte (u8 or char)
+will never cause an unaligned access, because all memory addresses are evenly
+divisible by one.
+
+On a related topic, with the above considerations in mind you may observe
+that you could reorder the fields in the structure in order to place fields
+where padding would otherwise be inserted, and hence reduce the overall
+resident memory size of structure instances. The optimal layout of the
+above example is:
+
+ struct foo {
+ u32 field2;
+ u16 field1;
+ u8 field3;
+ };
+
+For a natural alignment scheme, the compiler would only have to add a single
+byte of padding at the end of the structure. This padding is added in order
+to satisfy alignment constraints for arrays of these structures.
+
+Another point worth mentioning is the use of __attribute__((packed)) on a
+structure type. This GCC-specific attribute tells the compiler never to
+insert any padding within structures, useful when you want to use a C struct
+to represent some data that comes in a fixed arrangement 'off the wire'.
+
+You might be inclined to believe that usage of this attribute can easily
+lead to unaligned accesses when accessing fields that do not satisfy
+architectural alignment requirements. However, again, the compiler is aware
+of the alignment constraints and will generate extra instructions to perform
+the memory access in a way that does not cause unaligned access. Of course,
+the extra instructions obviously cause a loss in performance compared to the
+non-packed case, so the packed attribute should only be used when avoiding
+structure padding is of importance.
+
+
+Code that causes unaligned access
+=================================
+
+With the above in mind, let's move onto a real life example of a function
+that can cause an unaligned memory access. The following function adapted
+from include/linux/etherdevice.h is an optimized routine to compare two
+ethernet MAC addresses for equality.
+
+unsigned int compare_ether_addr(const u8 *addr1, const u8 *addr2)
+{
+ const u16 *a = (const u16 *) addr1;
+ const u16 *b = (const u16 *) addr2;
+ return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0;
+}
+
+In the above function, the reference to a[0] causes 2 bytes (16 bits) to
+be read from memory starting at address addr1. Think about what would happen
+if addr1 was an odd address such as 0x10003. (Hint: it'd be an unaligned
+access.)
+
+Despite the potential unaligned access problems with the above function, it
+is included in the kernel anyway but is understood to only work on
+16-bit-aligned addresses. It is up to the caller to ensure this alignment or
+not use this function at all. This alignment-unsafe function is still useful
+as it is a decent optimization for the cases when you can ensure alignment,
+which is true almost all of the time in ethernet networking context.
+
+
+Here is another example of some code that could cause unaligned accesses:
+ void myfunc(u8 *data, u32 value)
+ {
+ [...]
+ *((u32 *) data) = cpu_to_le32(value);
+ [...]
+ }
+
+This code will cause unaligned accesses every time the data parameter points
+to an address that is not evenly divisible by 4.
+
+In summary, the 2 main scenarios where you may run into unaligned access
+problems involve:
+ 1. Casting variables to types of different lengths
+ 2. Pointer arithmetic followed by access to at least 2 bytes of data
+
+
+Avoiding unaligned accesses
+===========================
+
+The easiest way to avoid unaligned access is to use the get_unaligned() and
+put_unaligned() macros provided by the <asm/unaligned.h> header file.
+
+Going back to an earlier example of code that potentially causes unaligned
+access:
+
+ void myfunc(u8 *data, u32 value)
+ {
+ [...]
+ *((u32 *) data) = cpu_to_le32(value);
+ [...]
+ }
+
+To avoid the unaligned memory access, you would rewrite it as follows:
+
+ void myfunc(u8 *data, u32 value)
+ {
+ [...]
+ value = cpu_to_le32(value);
+ put_unaligned(value, (u32 *) data);
+ [...]
+ }
+
+The get_unaligned() macro works similarly. Assuming 'data' is a pointer to
+memory and you wish to avoid unaligned access, its usage is as follows:
+
+ u32 value = get_unaligned((u32 *) data);
+
+These macros work work for memory accesses of any length (not just 32 bits as
+in the examples above). Be aware that when compared to standard access of
+aligned memory, using these macros to access unaligned memory can be costly in
+terms of performance.
+
+If use of such macros is not convenient, another option is to use memcpy(),
+where the source or destination (or both) are of type u8* or unsigned char*.
+Due to the byte-wise nature of this operation, unaligned accesses are avoided.
+
+--
+Author: Daniel Drake <d...@gentoo.org>
+With help from: Alan Cox, Avuton Olrich, Heikki Orsila, Jan Engelhardt,
+Johannes Berg, Kyle McMartin, Kyle Moffett, Randy Dunlap, Robert Hancock,
+Uli Kunitz, Vadim Lobanov
+
--
..
> Avoiding unaligned accesses
> ===========================
>
> The easiest way to avoid unaligned access is to use the get_unaligned() and
> put_unaligned() macros provided by the <asm/unaligned.h> header file.
>
> Going back to an earlier example of code that potentially causes unaligned
> access:
>
> void myfunc(u8 *data, u32 value)
> {
> [...]
> *((u32 *) data) = cpu_to_le32(value);
> [...]
> }
>
> To avoid the unaligned memory access, you would rewrite it as follows:
>
> void myfunc(u8 *data, u32 value)
> {
> [...]
> value = cpu_to_le32(value);
> put_unaligned(value, data);
> [...]
> }
>
> The get_unaligned() macro works similarly. Assuming 'data' is a pointer to
> memory and you wish to avoid unaligned access, its usage is as follows:
>
> u32 value = get_unaligned(data);
>
> These macros work work for memory accesses of any length (not just 32 bits as
^^^^^^^^^
remove one of them :)
Honza
--
Jan Kara <ja...@suse.cz>
SuSE CR Labs
--
Remove a "work" here.
> +in the examples above). Be aware that when compared to standard access of
> +aligned memory, using these macros to access unaligned memory can be costly in
> +terms of performance.
> +
> +If use of such macros is not convenient, another option is to use memcpy(),
> +where the source or destination (or both) are of type u8* or unsigned char*.
> +Due to the byte-wise nature of this operation, unaligned accesses are avoided.
Cheers,
Brandon