Memcpy performance

2,944 views
Skip to first unread message

John Salzdurg

unread,
Jan 16, 2012, 5:13:07 PM1/16/12
to pandaboard
Hi guys, noob here - today I've banged my head against various pieces
of furniture trying to understand how to write gcc inline assembly for
neon in order to optimize memory copy speed. I saw some code on the
arm infocenter website and decided to give it a try:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

The result: the custom neon memcpy with preload achieves around 160MB/
s, while the standard memcpy has a speed of 205MB/s. I suppose this is
SDRAM speed, as the data was too large to fit in any cache.

After looking at several threads on this mailing list and reading from
the IRC archives, I have to ask, just what is the memory speed of this
board, and how do I go about achieving the best? I saw various
benchmarks with various settings but no definitive answer as to what
to use and how it affects performance.

Regards,
John.

Måns Rullgård

unread,
Jan 17, 2012, 1:26:30 AM1/17/12
to panda...@googlegroups.com
John Salzdurg <swig...@googlemail.com> writes:

> Hi guys, noob here - today I've banged my head against various pieces
> of furniture trying to understand how to write gcc inline assembly for
> neon in order to optimize memory copy speed. I saw some code on the
> arm infocenter website and decided to give it a try:
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html
>
> The result: the custom neon memcpy with preload achieves around 160MB/
> s, while the standard memcpy has a speed of 205MB/s. I suppose this is
> SDRAM speed, as the data was too large to fit in any cache.

A decent memcpy should easily achieve over 400MB/s (1GB/s on Panda ES).
There is a decent one in Android. I suggest using that one.

--
Måns Rullgård
ma...@mansr.com

John Salzdurg

unread,
Jan 18, 2012, 2:54:06 PM1/18/12
to pandaboard
On Jan 17, 7:26 am, Måns Rullgård <m...@mansr.com> wrote:
> John Salzdurg <swigi...@googlemail.com> writes:
> > Hi guys, noob here - today I've banged my head against various pieces
> > of furniture trying to understand how to write gcc inline assembly for
> > neon in order to optimize memory copy speed. I saw some code on the
> > arm infocenter website and decided to give it a try:
> >http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13...
>
> > The result: the custom neon memcpy with preload achieves around 160MB/
> > s, while the standard memcpy has a speed of 205MB/s. I suppose this is
> > SDRAM speed, as the data was too large to fit in any cache.
>
> A decent memcpy should easily achieve over 400MB/s (1GB/s on Panda ES).
> There is a decent one in Android.  I suggest using that one.
>
> --
> Måns Rullgård
> m...@mansr.com

I had a look here and took the whole package:
https://launchpad.net/cortex-strings

I couldn't compile it because of errors related to my 4.3.3 version of
gcc, and it not supporting cortex-a9 (-mtune switch). Maybe with newer
compiler versions I could achieve better speed? I'm running angstrom
and opkg states that this gcc version is the newest. In the feed
browser there is 4.5 available but how can I install that without
recompiling everything?

Anyway, I took the memcpy-hybrid.S function from the package and added
it to my project. The result is similar with the neon memcpy without
preload, about 160MB/s. Is there anything that I can do to make a
difference in memory/overall speed? Are other distros more up-to-date
than angstrom? I've mainly chosen this one because of it being very
lightweight and I'm happy with it, ubuntu was extremely slow and
unresponsive last time I've tried it.

Regards,
John

Sebastien Jan

unread,
Jan 19, 2012, 3:04:33 AM1/19/12
to panda...@googlegroups.com
Hi John,

Ubuntu Oneiric ships with gcc 4.6. You can find Ubuntu server images without the UI stuff (http://cdimage.ubuntu.com/releases/11.10/release/), and Linaro also provides even smaller minimal images (see nano images): http://www.linaro.org/downloads/

regards,
Seb

Måns Rullgård

unread,
Jan 19, 2012, 7:44:04 AM1/19/12
to panda...@googlegroups.com
John Salzdurg <swig...@googlemail.com> writes:

> On Jan 17, 7:26 am, Måns Rullgård <m...@mansr.com> wrote:
>> John Salzdurg <swigi...@googlemail.com> writes:
>> > Hi guys, noob here - today I've banged my head against various pieces
>> > of furniture trying to understand how to write gcc inline assembly for
>> > neon in order to optimize memory copy speed. I saw some code on the
>> > arm infocenter website and decided to give it a try:
>> >http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13...
>>
>> > The result: the custom neon memcpy with preload achieves around 160MB/
>> > s, while the standard memcpy has a speed of 205MB/s. I suppose this is
>> > SDRAM speed, as the data was too large to fit in any cache.
>>
>> A decent memcpy should easily achieve over 400MB/s (1GB/s on Panda ES).
>> There is a decent one in Android.  I suggest using that one.
>>
>> --
>> Måns Rullgård
>> m...@mansr.com
>
> I had a look here and took the whole package:
> https://launchpad.net/cortex-strings
>
> I couldn't compile it because of errors related to my 4.3.3 version of

GCC 4.3 is ancient. You should use 4.5 or 4.6.

> gcc, and it not supporting cortex-a9 (-mtune switch). Maybe with newer
> compiler versions I could achieve better speed? I'm running angstrom
> and opkg states that this gcc version is the newest. In the feed
> browser there is 4.5 available but how can I install that without
> recompiling everything?
>
> Anyway, I took the memcpy-hybrid.S function from the package and added
> it to my project. The result is similar with the neon memcpy without
> preload, about 160MB/s.

Use the one from Android instead. I can't find an official code browser
but here it is on github:
https://github.com/android/platform_bionic/blob/master/libc/arch-arm/bionic/memcpy.S

> Is there anything that I can do to make a difference in memory/overall
> speed?

You could avoid calling memcpy() in the first place. That's always the
best option.

--
Måns Rullgård
ma...@mansr.com

John Salzdurg

unread,
Jan 19, 2012, 11:27:28 AM1/19/12
to pandaboard
Unfortunately the android memcpy didn't work much better, it tops at
about 210MB/s. I'll try with the distros you've suggested and upgrade
my gcc. Thanks for the suggestions, if I come up with anything better
I'll post it here.

John

John Salzdurg

unread,
Jan 26, 2012, 12:51:52 PM1/26/12
to pandaboard
Hello, I have some more questions on the matter. I've currently
checked out three linux distros:

- angstrom for pandaboard (the next-eglibc gave me gcc 4.5 so I
assumed it was the latest)
- ubuntu 11.10 preinstalled omap
- linaro latest binary (Ubuntu desktop) for pandaboard

I compiled/ran the memcpy-neon benchmark I found here:
http://sourceware.org/ml/libc-ports/2009-07/msg00000.html

And got the following results:
On angstrom I got the usual:
L1 cached data:
memcpy_neon : (4096) = 1888.5 MB/s / 2011.3 MB/s
memcpy_arm : (4096) = 1854.4 MB/s / 3451.5 MB/s
memcpy_neon : (6144) = 1901.8 MB/s / 2011.5 MB/s
memcpy_arm : (6144) = 1874.5 MB/s / 3492.4 MB/s

L2 cached data:
memcpy_neon : (65536) = 1819.5 MB/s / 1901.7 MB/s
memcpy_arm : (65536) = 1700.4 MB/s / 1989.8 MB/s
memcpy_neon : (98304) = 1854.8 MB/s / 1941.0 MB/s
memcpy_arm : (98304) = 1706.2 MB/s / 1977.3 MB/s

SDRAM:
memcpy_neon : (2097152) = 201.4 MB/s / 224.2 MB/s
memcpy_arm : (2097152) = 204.5 MB/s / 224.1 MB/s
memcpy_neon : (3145728) = 199.9 MB/s / 223.0 MB/s
memcpy_arm : (3145728) = 203.4 MB/s / 223.7 MB/s

On ubuntu it throws an error while running the correctness tests, so I
didn't bother to run the benchmarks:
--- Running correctness tests (use '-benchonly' option to skip)
---
memcpy_neon: test failed, i=262, offs1=1159 offs2=727904, size=10

I don't know why it does this, but I wouldn't consider Ubuntu as a
good distro to run on this board, it is extremely slow, and after the
boot 'top' reports only 300MB RAM free.


On linaro: it throws the same error, at exactly the same index. I ran
the benchmark however, just to see the speed and it is double for
SDRAM access:
L1 cached
data:
memcpy_neon : (4096) = 1881.3 MB/s / 2001.9 MB/
s
memcpy_arm : (4096) = 1847.5 MB/s / 3435.5 MB/
s
memcpy_neon : (6144) = 1884.6 MB/s / 1984.7 MB/
s
memcpy_arm : (6144) = 1866.7 MB/s / 3474.6 MB/
s

L2 cached
data:
memcpy_neon : (65536) = 1846.4 MB/s / 1939.5 MB/
s
memcpy_arm : (65536) = 1706.3 MB/s / 1985.9 MB/
s
memcpy_neon : (98304) = 1872.4 MB/s / 1963.1 MB/
s
memcpy_arm : (98304) = 1702.4 MB/s / 1957.7 MB/
s

SDRAM:
memcpy_neon : (2097152) = 443.3 MB/s / 435.7 MB/
s
memcpy_arm : (2097152) = 439.1 MB/s / 447.0 MB/
s
memcpy_neon : (3145728) = 439.5 MB/s / 431.4 MB/
s
memcpy_arm : (3145728) = 435.0 MB/s / 443.5 MB/s

My questions are:
1. Why is NEON slower than the glibc implementation in some cases?
Shouldn't it be faster?
2. Any thoughts on the errors?
3. Can the errors be a sign of tampering with the data? The benchmark
is comparing a trivial bit by bit copy with the NEON one, so could it
be that the compiler is 'optimizing' the program by copying only half
the data or something (which would explain the 2x speedup)? Both
Ubuntu and Linaro have gcc 4.6, which exhibits this error, as opposed
to gcc 4.5 on angstrom which doesn't.

John

Mark Olleson

unread,
Jan 26, 2012, 4:24:32 PM1/26/12
to panda...@googlegroups.com
Fundamentally, an implementation of memcpy() loads data into registers and then saves it again.   The more you can do at once, the better.

The ARM instruction set has always had the STM and LDM instructions which are capable of loading and storing up to the entire integer register set in one go (and that includes the PC).   This is a pretty unassailable instruction density - especially as you can post-increment the read and write pointers in the same instruction. 

You can do 32 bytes as:

LDMIA r0!,{r2-r10}
STMIA r1!,{r2-r10}
...
...

Optimised memcpy() implementation will probably have unrolled loos and attempt to cache-line align reads and writes. 

The vector load and store really don't bring much to the table unless they can load more data (in fact they can do 128bits at a time?), have a wider bus to the L1 cache or can load in parallel with the integer instruction set. 

In practice, you're going to hit the limits of memory bandwidth: by definition almost none of the data being copied in a big block is going to be in cache. 

The fact that the benchmarks are so similar suggests that this is happening. 

If you're interested in moving lots of memory around, the OMAP4 has a hardware blitter.

Siarhei Siamashka

unread,
Jan 26, 2012, 4:59:19 PM1/26/12
to panda...@googlegroups.com
On Thu, Jan 26, 2012 at 7:51 PM, John Salzdurg <swig...@googlemail.com> wrote:
> 1. Why is NEON slower than the glibc implementation in some cases?
> Shouldn't it be faster?

It depends. What you found was an almost 3 years old memcpy
implementation, tuned for OMAP3530. And the results may vary for
different SoCs, which use different ARM cores and different memory
controllers. When copying buffers which are larger than L2 cache size,
optimistically one would expect to hit memory bandwidth limit for any
memcpy implementation (unless it does something really stupid like
copying just one byte at a time). But different SoCs may have
different quirks.

By the way, ARM Cortex-A9 core and/or L2 cache controller seems to be
poorly configured in your angstrom bootloader/kernel. That's why you
are getting bad performance.

> 2. Any thoughts on the errors?
> 3. Can the errors be a sign of tampering with the data? The benchmark
> is comparing a trivial bit by bit copy with the NEON one, so could it
> be that the compiler is 'optimizing' the program by copying only half
> the data or something (which would explain the 2x speedup)? Both
> Ubuntu and Linaro have gcc 4.6, which exhibits this error, as opposed
> to gcc 4.5 on angstrom which doesn't.

I tried to reproduce the problem with various combinations of
gcc/binutils and got something similar. When compiling the code for
thumb2 (extra -mthumb gcc option) and using binutils 2.22, the
compiled program segfaults. Using binutils 2.21.1 is fine. After a
quick look, this is apparently caused by the use of BL instruction
instead of BLX in 'run_correctness_test' function, which breaks
arm/thumb interworking. But when doing benchmarking, the functions are
called via pointers and it happens to work correctly. There is one
more binutils bug which can be triggered:
http://sourceware.org/bugzilla/show_bug.cgi?id=12931 (it can be
workarounded by adding explicit alignment directives to the assembly
code).

This all seems to be somewhat similar to your symptoms (correctness
test fails, benchmarking works), except that I'm a bit surprised that
it misbehaves instead of crashing and this happens only after running
a number of loop iterations. You might be actually (un)lucky to have
encountered some other bug.

--
Best regards,
Siarhei Siamashka

Måns Rullgård

unread,
Jan 26, 2012, 5:01:16 PM1/26/12
to panda...@googlegroups.com
Mark Olleson <mark.o...@yamaha.co.uk> writes:

> Fundamentally, an implementation of memcpy() loads data into registers and
> then saves it again. The more you can do at once, the better.
>
> The ARM instruction set has always had the STM and LDM instructions which
> are capable of loading and storing up to the entire integer register set in
> one go (and that includes the PC). This is a pretty unassailable
> instruction density - especially as you can post-increment the read and
> write pointers in the same instruction.
>
> You can do 32 bytes as:
>
> LDMIA r0!,{r2-r10}
> STMIA r1!,{r2-r10}

That's 36 bytes.

Those instructions need multiple cycles to execute, transferring 64 bits
per cycle at most.

> Optimised memcpy() implementation will probably have unrolled loos and
> attempt to cache-line align reads and writes.
>
> The vector load and store really don't bring much to the table unless they
> can load more data (in fact they can do 128bits at a time?), have a wider
> bus to the L1 cache or can load in parallel with the integer instruction
> set.

NEON loads and stores can transfer 128 bits per cycle.

> In practice, you're going to hit the limits of memory bandwidth: by
> definition almost none of the data being copied in a big block is going to
> be in cache.

To maximise the bandwidth, it is necessary to use prefetching of data
into the caches. Some memcpy implementations, such as the Android one,
use PLD instructions with good results. The hardware also has some
support for automatic prefetching which helps if configured properly.

With proper prefetching, NEON and plain ARM implementations achieve
roughly the same speeds for large copies.

> The fact that the benchmarks are so similar suggests that this is
> happening.

The numbers are still way too low. The OMAP4 has dual 32-bit DDR2-800
memory interfaces providing a theoretical bandwidth of several GB per
second. Due to various other bottlenecks, the easily achievable rates
for copy operations are about 1GB/s. Even a trivial copy loop reaches
600MB/s. These figures are for the 4460 (Panda ES). The 4430 tops out
at just over 400MB/s.

--
Måns Rullgård
ma...@mansr.com

Mark Olleson

unread,
Jan 26, 2012, 5:17:39 PM1/26/12
to panda...@googlegroups.com
On 26 January 2012 21:59, Siarhei Siamashka <siarhei....@gmail.com> wrote:
On Thu, Jan 26, 2012 at 7:51 PM, John Salzdurg <swig...@googlemail.com> wrote:
> 1. Why is NEON slower than the glibc implementation in some cases?
> Shouldn't it be faster?


By the way, ARM Cortex-A9 core and/or L2 cache controller seems to be
poorly configured in your angstrom bootloader/kernel. That's why you
are getting bad performance.

Any ideas whose u-boot tree has a better configuration?   (I am, for various reason, using BareBox - this is probably worth confirming that it has reasonable settings).

Måns Rullgård

unread,
Jan 26, 2012, 5:21:28 PM1/26/12
to panda...@googlegroups.com
Mark Olleson <mark.o...@yamaha.co.uk> writes:

The L2 cache controller is configured by the Linux kernel. Depending on
your workload, some tweaking can improve performance.

--
Måns Rullgård
ma...@mansr.com

John Salzdurg

unread,
Jan 27, 2012, 1:00:50 PM1/27/12
to pandaboard
Thanks for the replies, I'm beginning to understand how all this
works. So what you are saying is that the cache parameters are not
configured properly on the angstrom kernel, thus slowing memory
operations, and I should use a newer distro. I'm currently using
Linaro developer based on Ubuntu 11.10, console without X11. It is
reasonably fast but it only reports 635M of RAM, I don't know what's
up with that. The monitor doesn't work on the latest Linaro so I guess
I'm stuck using this one. I'm really not sure which I should choose,
so far I've tried 6-7 distributions and each one of them either didn't
(couldn't) install properly, the monitor didn't work, gave errors
during boot, or were broken somehow (like angstrom with the slow
memory access).

Finally, I've re-ran the tests and now I have >400MB/s memcpy speed
and as a bonus my program works 50% faster when compiled with GCC 4.6.
So I guess that's an improvement :) Thanks again for the help!

Another question which isn't really related - apart from using the two
processors, is there any way of accessing other computational hardware
on the omap4430? I feel like there are untapped resources on the chip,
as the SGX is sitting around and doing nothing - can it be programmed
to do stuff like matrix transformations or image filtering?

John

John Salzdurg

unread,
Jan 30, 2012, 7:13:30 AM1/30/12
to pandaboard
One quick update: I've compiled my project with the old toolchain from
angstrom (gcc 4.3), and the speed improvements are still there on the
linaro distribution, meaning that it has indeed faster kernel options/
memory transfer. I would advise anyone using angstrom to try this
distribution as well, if they need speed improvements.

John

Vladimir Pantelic

unread,
Jan 30, 2012, 12:29:49 PM1/30/12
to panda...@googlegroups.com

so what are these magic kernel options then?

Måns Rullgård

unread,
Jan 30, 2012, 12:49:01 PM1/30/12
to panda...@googlegroups.com
Vladimir Pantelic <vlad...@gmail.com> writes:

I've had good results from setting L2 write-through, enabling L2 double
linefill and data prefetch, and enabling SCU speculative linefill.

That, and not calling memcpy().

--
Måns Rullgård
ma...@mansr.com

Felipe Magno de Almeida

unread,
Jan 30, 2012, 12:21:01 PM1/30/12
to panda...@googlegroups.com
Could you make available the benchmark code you've used?

Thanks.

--
Felipe Magno de Almeida

John Salzdurg

unread,
Feb 1, 2012, 8:05:22 AM2/1/12
to pandaboard
Hi, unfortunately my project is closed-source, but I can tell you that
I'm doing image processing on a VGA webcam feed. The memcpy routines
accounted for 20% of the total time spent in one loop. The 50%
increase was a conservative estimate, from about 10 frames per second
my application runs now at 17-18 frames per second, so the performance
is almost double than what it was, and this by simply running it on
the linaro distribution (compiler was the same). I am now trying to
get it to work in realtime, with some NEON optimizations for the rest
of the code, hence my question about squeezing extra processing power
from various bits of the chip.

One side-effect that I've noticed is that timing various bits of the
program with clock_gettime is no longer accurate. On the angstrom
distribution I was getting quite stable timings, +/- 10%, but now
they're all over the place, including timings that read '0' and some
that are 2-3 times what I was expecting.

Cheers,
John

Måns Rullgård

unread,
Feb 1, 2012, 1:47:59 PM2/1/12
to panda...@googlegroups.com
John Salzdurg <swig...@googlemail.com> writes:

> Hi, unfortunately my project is closed-source, but I can tell you that
> I'm doing image processing on a VGA webcam feed. The memcpy routines
> accounted for 20% of the total time spent in one loop.

That suggests you are copying entire images multiple times. Why would
you do such a crazy thing?

--
Måns Rullgård
ma...@mansr.com

Felipe Magno de Almeida

unread,
Feb 3, 2012, 6:46:33 PM2/3/12
to panda...@googlegroups.com
2012/1/30 Måns Rullgård <ma...@mansr.com>:
> Vladimir Pantelic <vlad...@gmail.com> writes:

[snip]

> I've had good results from setting L2 write-through, enabling L2 double
> linefill and data prefetch, and enabling SCU speculative linefill.
>
> That, and not calling memcpy().

How does someone set these? Have any pointers where I can find?

> --
> Måns Rullgård
> ma...@mansr.com


Regards,

Tom Mitchell

unread,
Feb 3, 2012, 8:23:11 PM2/3/12
to panda...@googlegroups.com
Well closed source -- makes it difficult for others to help.
What is the VGA data size?
Does it begin and end inside the same page? Check both source and destination.
Are you triggering read modify write cache line abuse?
Are you triggering cache line aliasing?
Compression? Encoding?

PAT page attributes? PAT is complementary to the MTRR settings
and can make a big difference (see read modify write abuse).

malloc() and calloc() you may find that letting a general purpose
tool recycle pages is the wrong strategy. A stream of
data like a webcam stream should quickly become steady state so
you should be able to manage the buffers better than a general purpose tool.

On Wed, Feb 1, 2012 at 5:05 AM, John Salzdurg <swig...@googlemail.com> wrote:
> Hi, unfortunately my project is closed-source, but I can tell you that
> I'm doing image processing on a VGA webcam feed. The memcpy routines
> accounted for 20% of the total time spent in one loop.

--

                      T o m   M i t c h e l l
"My lifetime goal is to be the kind of person my dogs think I am."

chardson

unread,
Feb 7, 2012, 8:33:10 PM2/7/12
to pandaboard


On Feb 3, 6:46 pm, Felipe Magno de Almeida
<felipe.m.alme...@gmail.com> wrote:
> 2012/1/30 Måns Rullgård <m...@mansr.com>:
>
> > Vladimir Pantelic <vlado...@gmail.com> writes:
>
> [snip]
>
> > I've had good results from setting L2 write-through, enabling L2 double
> > linefill and data prefetch, and enabling SCU speculative linefill.
>
> > That, and not calling memcpy().
>
> How does someone set these? Have any pointers where I can find?
>

Yeah, I'm interested in that as well. I found the ARM references in
the TRM, but I don't know if I have to do something different under
Linux (e.g. tweak the kernel, or disable a power-saving mode)


Regarding the timing performance being all over the place, I noticed
that today and send an email to linaro-dev. It seems to be accurate as
long as you're timing on the order of hundreds of milliseconds. Timing
1-2ms events yields unreliable numbers, including elapsed times (t1 -
t0) of 0 seconds and 0 nanoseconds.

See http://lists.linaro.org/pipermail/linaro-dev/2012-February/thread.html,
"Minimum timing resolution in Ubuntu/Linaro on the PandaBoard ES"

Andrew
Reply all
Reply to author
Forward
0 new messages