Does anyone know how much hardware support ARMv6 or ARMv7 [Cortex-A8,
OMAP3, Beagleboard] has for unaligned memory access [Alignment trap
fault].
I seen recently there is a patch for it.[1] But not sure how much it
affects on performance if any unaligned memory access occurs.
I think this patch sets /proc/cpu/alignment to 2[fixup] as a default condition.
for ARMv6, I seen some information at [2] section "4.2.5. Support for
unaligned data access in ARMv6 (U=1)" if U bit is set from control
register?
Does ARMv6, or ARMv7 behaves almost like x86 in performance if U bit
is set to 1?
[2] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0211h/Cdffhdje.html
Thanks for your help.
Thanks and Regards,
Shivdas Gujare
You already found a relevant section in ARM documentation (your link [2]),
you can get all the details there.
> I seen recently there is a patch for it.[1] But not sure how much it
> affects on performance if any unaligned memory access occurs.
> I think this patch sets /proc/cpu/alignment to 2[fixup] as a default
> condition.
That's not a very wise default in my opinion. Better would be 4 (signal) or
at least 3 (fixup+warn). But you can change this behavior at runtime. I
remember there was also a kernel patch submitted somewhere for having initial
'/proc/cpu/alignment' setup configurable in the kernel config.
> for ARMv6, I seen some information at [2] section "4.2.5. Support for
> unaligned data access in ARMv6 (U=1)" if U bit is set from control
> register?
IIRC U bit is always set in linux for the ARM chips which support it. And for
ARMv7 (beagleboard uses ARMv7), unaligned accesses support can't be even
turned off (CPU only supports U=1 mode).
> Does ARMv6, or ARMv7 behaves almost like x86 in performance if U bit
> is set to 1?
Not quite, there are some tricky things. One of the pitfalls is that not all
instructions support unaligned accesses, some generate exceptions on unaligned
memory accesses. Only the instructions dealing with the data sizes up to
32-bit fixup the alignment automagically, plus some NEON instructions. There
is a full table in ARM documentation about what combinations are supported.
To make everything even more fun, if you are a C programmer, you can't freely
use unaligned memory accesses even if you deal with data types not larger than
int.
Let's have a look at the following example (bad code!):
/********************/
#include <stdio.h>
int __attribute__((noinline)) f(int *x)
{
return x[0] + x[1];
}
int main()
{
int buffer[3] = {0x12345678, 0x90ABCDEF, 0x12345678};
printf("%08X\n", f((int *)((char *)buffer + 1)));
return 0;
}
/********************/
If it is compiled with -Os optimizations, the following code is emitted by gcc
for 'f' function:
00000000 <f>:
0: e8900009 ldm r0, {r0, r3}
4: e0830000 add r0, r3, r0
8: e12fff1e bx lr
It uses LDM instruction here (load multiple) to load 2 sequential ints into a
pair of ARM registers at once, so this is effectively a 64-bit load operation.
LDM instruction does not support unaligned reads and will generate an
exception if the address in not properly aligned. Depending of a value
in /proc/cpu/alignment, this program will:
0: freeze, constantly triggering exceptions, which are not handled right in
the kernel, so it is constantly jumping between userspace and kernelspace.
CPU is loaded 100%
2: provide you with the the same result as on x86, but silently spend a huge
amount of time on handling exception and emulating this unaligned access in
the kernel
4: die with SIGPIPE
As I mentioned before, configuration 2 (fixup) is a bad choice in general.
Average Joe "x86 programmer" can insert lots of nonportable code (in the
respect to dealing with alignment) into his programs. Even worse, as ARMv6 and
ARMv7 are supposed to support unaligned memory accesses based on the
information published here and there, he would probably even think that he is
doing the right thing :-)
Configuration 4 (signal) at least lets you to find such bugs in the code and
fix them.
As to gcc generating such code with -Os optimization in the first place. It is
doing the right thing. The code example is buggy and results in unexpected
behavior according to C standard. It just happens to work seemingly right on
x86.
If you compile the example with '-Wcast-align' option, gcc will even issue a
warning on the problematic line. Such warnings may be handy sometimes when
porting applications to the platforms where alignment is more strict than on
x86.
--
Best regards,
Siarhei Siamashka
The most simple way to debug these problems is to set /proc/cpu/alignment to 4
and run the program in gdb, it will break exactly at the right place. You can
also enable core dumps generation and analyze core dumps afterwards.
Maybe for some large scale whole system analysis, something more automated and
convenient can be created. So that you just run the system normally, but get a
preprocessed statistics with the names of modules and functions which have
unaligned memory accesses, sorted by the frequency of occurrence.
The easiest way to achieve this would be to patch the kernel to report
unaligned memory accesses as special events to oprofile :-)
This all is a bit confusing because gnu assembler uses its own syntax flavour.
You can generally find the needed information by checking:
info as
some existing ARM NEON assembly optimizations
binutils sources
Regarding your particular question, you need to use something like this:
VLD1.8 {d0, d1}, [r0, :128]
":128" part specifies alignment in bits
Good luck
Note that only VLD[1-4] take an alignment specifier and allow
unaligned addresses. VLDR and VLDM always require 4-byte aligned
data.
--
Måns Rullgård
ma...@mansr.com
Some of the open source projects have NEON optimizations already, they
are listed here: http://elinux.org/BeagleBoard#ARM_NEON
> binutils sources
http://www.gnu.org/software/binutils/
In the open source world, availability of sources can compensate
the absence of (good) documentation sometimes. I just assume that anybody
seriously interested in assembly optimizations should be already an
experienced software developer, quite familiar with C language. That's why I
also suggested this option. It may be the last resort if you don't find the
needed information in some easier way.
> to get info about optimization?
Cortex-A8 TRM (the one that you already have) contains "Instruction Cycle
Timing" section.
Additionally I suggest checking the following document. It has some nice
pictures and Cortex-A8 pipeline overview:
http://www.arm.com/miscPDFs/24588.pdf
And the last thing. Always try to benchmark everything yourself. TRM has a
warning notice: "Detailed descriptions of all possible instruction
interactions and all possible events taking place in the processor is beyond
the scope of this document. Only a cycle-accurate model of the processor can
produce precise timings for a particular instruction sequence."
So the TRM describes some simplified model, which more or less correlates with
the reality. But it can always happen that those tiny omitted details may
have a major impact on your code, if they manifest themselves in a performance
critical tight loop.