> Hi guys, noob here - today I've banged my head against various pieces
> of furniture trying to understand how to write gcc inline assembly for
> neon in order to optimize memory copy speed. I saw some code on the
> arm infocenter website and decided to give it a try:
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html
>
> The result: the custom neon memcpy with preload achieves around 160MB/
> s, while the standard memcpy has a speed of 205MB/s. I suppose this is
> SDRAM speed, as the data was too large to fit in any cache.
A decent memcpy should easily achieve over 400MB/s (1GB/s on Panda ES).
There is a decent one in Android. I suggest using that one.
--
Måns Rullgård
ma...@mansr.com
> On Jan 17, 7:26 am, Måns Rullgård <m...@mansr.com> wrote:
>> John Salzdurg <swigi...@googlemail.com> writes:
>> > Hi guys, noob here - today I've banged my head against various pieces
>> > of furniture trying to understand how to write gcc inline assembly for
>> > neon in order to optimize memory copy speed. I saw some code on the
>> > arm infocenter website and decided to give it a try:
>> >http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13...
>>
>> > The result: the custom neon memcpy with preload achieves around 160MB/
>> > s, while the standard memcpy has a speed of 205MB/s. I suppose this is
>> > SDRAM speed, as the data was too large to fit in any cache.
>>
>> A decent memcpy should easily achieve over 400MB/s (1GB/s on Panda ES).
>> There is a decent one in Android. I suggest using that one.
>>
>> --
>> Måns Rullgård
>> m...@mansr.com
>
> I had a look here and took the whole package:
> https://launchpad.net/cortex-strings
>
> I couldn't compile it because of errors related to my 4.3.3 version of
GCC 4.3 is ancient. You should use 4.5 or 4.6.
> gcc, and it not supporting cortex-a9 (-mtune switch). Maybe with newer
> compiler versions I could achieve better speed? I'm running angstrom
> and opkg states that this gcc version is the newest. In the feed
> browser there is 4.5 available but how can I install that without
> recompiling everything?
>
> Anyway, I took the memcpy-hybrid.S function from the package and added
> it to my project. The result is similar with the neon memcpy without
> preload, about 160MB/s.
Use the one from Android instead. I can't find an official code browser
but here it is on github:
https://github.com/android/platform_bionic/blob/master/libc/arch-arm/bionic/memcpy.S
> Is there anything that I can do to make a difference in memory/overall
> speed?
You could avoid calling memcpy() in the first place. That's always the
best option.
--
Måns Rullgård
ma...@mansr.com
It depends. What you found was an almost 3 years old memcpy
implementation, tuned for OMAP3530. And the results may vary for
different SoCs, which use different ARM cores and different memory
controllers. When copying buffers which are larger than L2 cache size,
optimistically one would expect to hit memory bandwidth limit for any
memcpy implementation (unless it does something really stupid like
copying just one byte at a time). But different SoCs may have
different quirks.
By the way, ARM Cortex-A9 core and/or L2 cache controller seems to be
poorly configured in your angstrom bootloader/kernel. That's why you
are getting bad performance.
> 2. Any thoughts on the errors?
> 3. Can the errors be a sign of tampering with the data? The benchmark
> is comparing a trivial bit by bit copy with the NEON one, so could it
> be that the compiler is 'optimizing' the program by copying only half
> the data or something (which would explain the 2x speedup)? Both
> Ubuntu and Linaro have gcc 4.6, which exhibits this error, as opposed
> to gcc 4.5 on angstrom which doesn't.
I tried to reproduce the problem with various combinations of
gcc/binutils and got something similar. When compiling the code for
thumb2 (extra -mthumb gcc option) and using binutils 2.22, the
compiled program segfaults. Using binutils 2.21.1 is fine. After a
quick look, this is apparently caused by the use of BL instruction
instead of BLX in 'run_correctness_test' function, which breaks
arm/thumb interworking. But when doing benchmarking, the functions are
called via pointers and it happens to work correctly. There is one
more binutils bug which can be triggered:
http://sourceware.org/bugzilla/show_bug.cgi?id=12931 (it can be
workarounded by adding explicit alignment directives to the assembly
code).
This all seems to be somewhat similar to your symptoms (correctness
test fails, benchmarking works), except that I'm a bit surprised that
it misbehaves instead of crashing and this happens only after running
a number of loop iterations. You might be actually (un)lucky to have
encountered some other bug.
--
Best regards,
Siarhei Siamashka
> Fundamentally, an implementation of memcpy() loads data into registers and
> then saves it again. The more you can do at once, the better.
>
> The ARM instruction set has always had the STM and LDM instructions which
> are capable of loading and storing up to the entire integer register set in
> one go (and that includes the PC). This is a pretty unassailable
> instruction density - especially as you can post-increment the read and
> write pointers in the same instruction.
>
> You can do 32 bytes as:
>
> LDMIA r0!,{r2-r10}
> STMIA r1!,{r2-r10}
That's 36 bytes.
Those instructions need multiple cycles to execute, transferring 64 bits
per cycle at most.
> Optimised memcpy() implementation will probably have unrolled loos and
> attempt to cache-line align reads and writes.
>
> The vector load and store really don't bring much to the table unless they
> can load more data (in fact they can do 128bits at a time?), have a wider
> bus to the L1 cache or can load in parallel with the integer instruction
> set.
NEON loads and stores can transfer 128 bits per cycle.
> In practice, you're going to hit the limits of memory bandwidth: by
> definition almost none of the data being copied in a big block is going to
> be in cache.
To maximise the bandwidth, it is necessary to use prefetching of data
into the caches. Some memcpy implementations, such as the Android one,
use PLD instructions with good results. The hardware also has some
support for automatic prefetching which helps if configured properly.
With proper prefetching, NEON and plain ARM implementations achieve
roughly the same speeds for large copies.
> The fact that the benchmarks are so similar suggests that this is
> happening.
The numbers are still way too low. The OMAP4 has dual 32-bit DDR2-800
memory interfaces providing a theoretical bandwidth of several GB per
second. Due to various other bottlenecks, the easily achievable rates
for copy operations are about 1GB/s. Even a trivial copy loop reaches
600MB/s. These figures are for the 4460 (Panda ES). The 4430 tops out
at just over 400MB/s.
--
Måns Rullgård
ma...@mansr.com
On Thu, Jan 26, 2012 at 7:51 PM, John Salzdurg <swig...@googlemail.com> wrote:
> 1. Why is NEON slower than the glibc implementation in some cases?
> Shouldn't it be faster?
By the way, ARM Cortex-A9 core and/or L2 cache controller seems to be
poorly configured in your angstrom bootloader/kernel. That's why you
are getting bad performance.
The L2 cache controller is configured by the Linux kernel. Depending on
your workload, some tweaking can improve performance.
--
Måns Rullgård
ma...@mansr.com
so what are these magic kernel options then?
I've had good results from setting L2 write-through, enabling L2 double
linefill and data prefetch, and enabling SCU speculative linefill.
That, and not calling memcpy().
--
Måns Rullgård
ma...@mansr.com
Thanks.
--
Felipe Magno de Almeida
> Hi, unfortunately my project is closed-source, but I can tell you that
> I'm doing image processing on a VGA webcam feed. The memcpy routines
> accounted for 20% of the total time spent in one loop.
That suggests you are copying entire images multiple times. Why would
you do such a crazy thing?
--
Måns Rullgård
ma...@mansr.com
[snip]
> I've had good results from setting L2 write-through, enabling L2 double
> linefill and data prefetch, and enabling SCU speculative linefill.
>
> That, and not calling memcpy().
How does someone set these? Have any pointers where I can find?
> --
> Måns Rullgård
> ma...@mansr.com
Regards,
PAT page attributes? PAT is complementary to the MTRR settings
and can make a big difference (see read modify write abuse).
malloc() and calloc() you may find that letting a general purpose
tool recycle pages is the wrong strategy. A stream of
data like a webcam stream should quickly become steady state so
you should be able to manage the buffers better than a general purpose tool.
On Wed, Feb 1, 2012 at 5:05 AM, John Salzdurg <swig...@googlemail.com> wrote:
> Hi, unfortunately my project is closed-source, but I can tell you that
> I'm doing image processing on a VGA webcam feed. The memcpy routines
> accounted for 20% of the total time spent in one loop.
--
T o m M i t c h e l l
"My lifetime goal is to be the kind of person my dogs think I am."