I am using the following little program to check BB memory bandwidth. The
number I got is about 31MB/S for C version, and 83MB/S for simd version.
This seems be too slow. I had expected some number like 10x faster. Any
suggestion on where to look? x-loader, u-boot, kernel or just my compile
flags?
Thanks,
Guo
/*
In omap host environment, compile the code as
arm-none-linux-gnueabi-gcc -O2 -o membench membench.c
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(int argc, char** argv)
{
int* pbuf1, *pbuf2;
const int bufSize = 8*1024*1024;
int i,j;
clock_t t1, t2;
double tdiff;
const int ITER =100;
double rate;
typedef int v4si __attribute__ ((vector_size(16)));
v4si *p1, *p2;
pbuf1 = (int*)memalign(16, bufSize*sizeof(int));
pbuf2 = (int*)memalign(16, bufSize*sizeof(int));
for (i=0; i<bufSize; i++)
{
pbuf2[i] = i;
}
t1 = clock();
for(j=0; j<ITER; j++)
{
for (i=0; i<bufSize; i++)
{
pbuf1[i] = pbuf2[i];
}
}
t2 = clock();
tdiff = (double)(t2) - (double)t1;
rate = ITER*bufSize*sizeof(int)/(tdiff/CLOCKS_PER_SEC);
rate /= (1024.0*1024.0);
printf("rate(MB/S) = %.3f, clocks_per_sec %d\n", rate, CLOCKS_PER_SEC);
/*
for(i=900; i<910; i++)
{
printf("%d\n", pbuf1[i]);
}
*/
//SIMD version of the memory benchmark
t1 = clock();
for(j=0; j<ITER; j++)
{
p1 = (v4si*)(pbuf1);
p2 = (v4si*)(pbuf2);
for(i=0; i<bufSize/4; i++)
{
*p1 = *p2;
p1++;
p2++;
}
}
t2 = clock();
tdiff = (double)(t2) - (double)t1;
rate = ITER*bufSize*sizeof(int)/(tdiff/CLOCKS_PER_SEC);
rate /= (1024.0*1024.0);
printf("SIMD rate(MB/S) = %.3f, clocks_per_sec %d\n", rate, CLOCKS_PER_SEC);
free(pbuf1);
free(pbuf2);
return 0;
}
[root@beagleboard c]# uname -a
Linux beagleboard.org 2.6.22.18-omap3 #1 Thu Jul 24 15:29:36 IST 2008
armv7l unknown
host: arm-none-linux-gnueabi-gcc -v
Using built-in specs.
Target: arm-none-linux-gnueabi
Configured with: /scratch/paul/lite/linux/src/gcc-4.2/configure
--build=i686-pc-linux-gnu --host=i686-pc-linux-gnu
--target=arm-none-linux-gnueabi --enable-threads --disable-libmudflap
--disable-libssp --disable-libgomp --disable-libstdcxx-pch --with-gnu-as
--with-gnu-ld --enable-languages=c,c++ --enable-shared
--enable-symvers=gnu --enable-__cxa_atexit --with-pkgversion=CodeSourcery
Sourcery G++ Lite 2007q3-51
--with-bugurl=https://support.codesourcery.com/GNUToolchain/ --disable-nls
--prefix=/opt/codesourcery
--with-sysroot=/opt/codesourcery/arm-none-linux-gnueabi/libc
--with-build-sysroot=/scratch/paul/lite/linux/install/arm-none-linux-gnueabi/libc
--enable-poison-system-directories
--with-build-time-tools=/scratch/paul/lite/linux/install/arm-none-linux-gnueabi/bin
--with-build-time-tools=/scratch/paul/lite/linux/install/arm-none-linux-gnueabi/bin
Thread model: posix
gcc version 4.2.1 (CodeSourcery Sourcery G++ Lite 2007q3-51)
These numbers still sound almost too small to be true.
regards,
Guo
On 9 Sep 2008, at 06:19, guo tang wrote:
> I am using the following little program to check BB memory
> bandwidth. The
> number I got is about 31MB/S for C version, and 83MB/S for simd
> version.
> This seems be too slow. I had expected some number like 10x faster.
> Any
> suggestion on where to look? x-loader, u-boot, kernel or just my
> compile
> flags?
>
[snip]
> for(j=0; j<ITER; j++)
> {
> p1 = (v4si*)(pbuf1);
> p2 = (v4si*)(pbuf2);
> for(i=0; i<bufSize/4; i++)
> {
> *p1 = *p2;
> p1++;
> p2++;
> }
> }
Out of interest, what results do you see if your inner loop(s) above
do only reads, or only writes? (It strikes me that reading in bursts
and writing in bursts will be less harsh on write buffers, etc., than
read-one-write-one-read-one-write-one...)
Cheers,
Matt
If I replace *p1 with a local variable, the SIMD version speed increased
from 155MB/S to 168MB/S. With similar change, the C version speed will
increase from about 31MB/s to 190MB/S.
Dissemble the code, the C version target local variable is one register.
The SIMD target local variable is still a stack variable (maybe in L1
cache). So I guess the slow speed might just due to slow DDR->L2->L1->CPU
read. I don't know ARM assemly programming, so cannot control the SIMD
version to do a load to register.
I am new to ARM architecture. Does ARM has explicit cache control
instruction? Like overlap cache load and other calculation ability?
regards,
Guo
> Now better with your flags.
> rate(MB/S) = 31.156, clocks_per_sec 1000000
> SIMD rate(MB/S) = 154.739, clocks_per_sec 1000000
>
> These numbers still sound almost too small to be true.
Are you running anything else on your Beagle? Using a cleaned up
version of your test (attached), I get quite different numbers. I
added timings for plain memcpy() and a hand-written assembler
function, and the result looks like this:
memcpy 192566305 B/s
INT32 163817378 B/s
C SIMD 163537932 B/s
ASM SIMD 280814532 B/s
--
Måns Rullgård
ma...@mansr.com
The very different figures for the naive C loop prompted me to dig a
little deeper, and I found something strange. It appears that
addresses 0x2001000 (32M+4k) apart use the same cache line or
something similar, severely degrading the throughput of the copy.
Your test just happens to allocate the buffers with this magic
interval.
--
Måns Rullgård
ma...@mansr.com
> Nice finding. Could you elaborate more on the cache line problem? I
> haven't understood it.
>
> Is this what happening? The 2 buffers are 32M+4K apart, then in the copy,
> target and source are using the same cache line. But then the copy
> operation will be from L1 cache to L1 cache, the copy will be faster
> instead of slower, right?
If the cache is write-allocate, and the source and destination
addresses, for whatever reason, must use the same cache-line, only one
of them can be in cache at any time. Copying a word at a time under
such conditions will result in constant cache misses.
I'm a bit surprised that this is happening, since the Cortex-A8 L1
cache is 4-way set associative, and the L2 cache is 8-way set
associative.
> Are there any way to avoid this problem in the real application?
Profile carefully, looking for unexpected cache misses.
--
Måns Rullgård
ma...@mansr.com
I did some tweaks to the code, and disabled the framebuffer. The
result in numbers, using 8MB buffers:
copy memcpy 225595776 B/s
copy ASM ARM 301156146 B/s
copy ASM NEON 343882833 B/s
copy ASM A+N 352340617 B/s
write memset 530244447 B/s
write ASM ARM 530860509 B/s
write ASM NEON 531750947 B/s
write ASM A+N 590044870 B/s
This is running on a rev C prototype with ES3.0 silicon, in case it
matters. The kernel is l-o head with some patches.
Here's the improved ARM+NEON memcpy:
memcpy_armneon:
push {r4-r11}
mov r3, r0
1: subs r2, r2, #128
pld [r1, #64]
pld [r1, #256]
pld [r1, #320]
ldm r1!, {r4-r11}
vld1.64 {d0-d3}, [r1,:128]!
vld1.64 {d4-d7}, [r1,:128]!
vld1.64 {d16-d19}, [r1,:128]!
stm r3!, {r4-r11}
vst1.64 {d0-d3}, [r3,:128]!
vst1.64 {d4-d7}, [r3,:128]!
vst1.64 {d16-d19}, [r3,:128]!
bgt 1b
pop {r4-r11}
bx lr
The super-fast ARM+NEON memset looks like this:
memset_armneon:
push {r4-r11}
mov r3, r0
vdup.8 q0, r1
vmov q1, q0
orr r4, r1, r1, lsl #8
orr r4, r4, r4, lsl #16
mov r5, r4
mov r6, r4
mov r7, r4
mov r8, r4
mov r9, r4
mov r10, r4
mov r11, r4
add r12, r3, r2, lsr #2
1: subs r2, r2, #128
pld [r3, #64]
stm r3!, {r4-r11}
vst1.64 {d0-d3}, [r12,:128]!
vst1.64 {d0-d3}, [r12,:128]!
vst1.64 {d0-d3}, [r12,:128]!
bgt 1b
pop {r4-r11}
bx lr
--
Måns Rullgård
ma...@mansr.com
The Cortex A8 CPU has an L2 cache preload engine (PLE) [1], which can
be used to preload large blocks of data into the L2 cache. Using
this, I was able to push the copy throughput even higher:
copy PLE+NEON 415596200 B/s
I also coded up some pure read tests:
read ASM ARM 637178403 B/s
read ASM NEON 719075707 B/s
read PLE+NEON 741113693 B/s
The preload engine seems like it can be useful. As with everything,
however, it takes some fine-tuning of parameters to maximise
performance.
[1] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344b/Babjbfdb.html
--
Måns Rullgård
ma...@mansr.com
Can someone tell me if the memory allocated for the omap frame buffer in the
kernel side
is configured as cache enabled ?
I am interesting in it for performance consideration.
Thanks,
Laurent
It should be mapped as non-cached, write-combining.
--
Måns Rullgård
ma...@mansr.com
> ok thanks Mans,
>
> Another question, with the armv7, when data cache is disabled, does
> it mean like other cpus that memory access are made word by word
> instead of "burst" mode access (1 data cache line flow)
For reads single-word reads, I suppose that would be the case. For
writes, you can still have a write-combining buffer. I don't know how
multi-word reads are handled in this case.
> In this case, all frame buffers allocation including offscreen
> surfaces are non cached it may slow down graphics perfomance for any
> kind of pixmap operation.
>
> Two suggestions :
> - At least only the final frame buffer surface could be non cached, not the
> offscreen buffers.
> - Or all frame buffer partition could be cached and a data cache flush could
> happen when VBL happens.
Flushing the cache also takes time. Which is quicker, writing to
uncached memory or writing to cache and flushing, depends on the
precise access patters in each case. If mostly writing, as is
typically the case with a framebuffer, a write-allocate cache would
waste time reading from memory to fill the cache lines as they are
allocated. Without write-allocate, there will no difference compared
to uncached.
The only way to know for sure is to benchmark specific cases.
> In this case your memcpy,memset neon accelerated functions could be
> avantagely used in omap_fb for speeding up blit copy.
Any kind of copy within the framebuffer is probably best done with the
DMA engine.
> I am not criticizing the current omap_fb implementation. I
> understand that this model is simple for cache coherency.
> I am working for a company making a graphic engine for embedded
> device and all frame buffer implementation (ST7109, Sigma design, TI
> davinci) looks like the same.
I've been doing embedded graphics for a few years using various chips,
and I've had opportunities to experiment with various approaches.
> My strong thought is that using cached memory with a more complex
> frame buffer module could speed up graphic part and save some cpu
> bandwitdh.
Working with uncached memory certainly requires a little extra
attention, or there will be consequences. Also keep in mind, that
allowing the framebuffer to be cached means there will be less room
for other data in the cache. The increased thrashing can more than
cancel any gain from having the framebuffer cached. Again,
benchmarks are the only way to know for sure.
--
Måns Rullgård
ma...@mansr.com
Here's a patch to enable userspace access to the PLE:
http://git.mansr.com/?p=linux-omap;a=commitdiff;h=3e1afa3
Here's some code that uses it:
http://thrashbarg.mansr.com/~mru/mem.S
--
Måns Rullgård
ma...@mansr.com