Looking for NEON optimization example

Dirk Behme‏

לא נקראה,

28 ביולי 2008, 13:09:4428.7.2008

עד beagl...@googlegroups.com‏

A friend uses following code to do some benchmarking. Does anybody
have some short (inline assembly?) example how to optimize the
multiply/accumulate in the loop for Beagle/Cortex/NEON?

i = 0;
for(j = 0; j < 512; j++) {
i += a[j] * b[j];
}

Initialization is done by

volatile float i;
float a[512], b[512];
int j;

for(j = 0; j < 512; j++) {
a[j] = j * 0.1;
b[j] = j * 0.1;
}

Thanks

Dirk

TK, Pratheesh Gangadhar‏

לא נקראה,

28 ביולי 2008, 13:41:4328.7.2008

עד beagl...@googlegroups.com‏

You can try using "-O3 -mcpu=cortex-a8 mfloat-abi=softfp -mfpu=neon -ftree-vectorize" as compile option and see whether gcc automatically generates this for you in this case.

arm-2007q3\lib\gcc\arm-none-linux-gnueabi\4.2.1\include\arm_neon.h defines NEON intrinsics(http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html).

I was using 2007-q3 version and it generated following NEON code for below C function

float a[256], b[256], c[256];

foo(int x)
{
int i;
for (i = 0; i < 256; i++) {
a[i] = x * b[i] + c[i];
}
}

foo:
@ args = 0, pretend = 0, frame = 8
@ frame_needed = 0, uses_anonymous_args = 0
fmsr s14, r0 @ int
fsitos s15, s14
stmfd sp!, {r4, lr}
ldr ip, .L8
ldr r4, .L8+4
ldr lr, .L8+8
sub sp, sp, #8
mov r0, #0
fsts s15, [sp, #4]
fsts s15, [sp, #0]
fldd d5, [sp, #0]
.L2:
add r1, r0, ip
add r3, r0, r4
add r2, r0, lr
add r0, r0, #8
cmp r0, #1024
fldd d7, [r3, #0]
vmul.f32 d7, d5, d7
fldd d6, [r2, #0]
vadd.f32 d7, d7, d6
fstd d7, [r1, #0]
bne .L2
add sp, sp, #8
ldmfd sp!, {r4, pc}
.L9:
.align 2
.L8:
.word a
.word b
.word c

Regards,
Pratheesh

Måns Rullgård‏

לא נקראה,

28 ביולי 2008, 15:44:5928.7.2008

עד beagl...@googlegroups.com‏

Dirk Behme <dirk....@googlemail.com> writes:

Here's a simple NEON version, unrolled 8 times:

float
vmac_neon(const float *a, const float *b, unsigned n)
{
float s = 0;

asm ("vmov.f32 q8, #0.0 \n\t"
"vmov.f32 q9, #0.0 \n\t"
"1: \n\t"
"subs %3, %3, #8 \n\t"
"vld1.32 {d0,d1,d2,d3}, [%1]! \n\t"
"vld1.32 {d4,d5,d6,d7}, [%2]! \n\t"
"vmla.f32 q8, q0, q2 \n\t"
"vmla.f32 q9, q1, q3 \n\t"
"bgt 1b \n\t"
"vadd.f32 q8, q8, q9 \n\t"
"vpadd.f32 d0, d16, d17 \n\t"
"vadd.f32 %0, s0, s1 \n\t"
: "=w"(s), "+r"(a), "+r"(b), "+r"(n)
:: "q0", "q1", "q2", "q3", "q8", "q9");

return s;
}

For comparison, I used this C function:

float
vmac_c(const float *a, const float *b, unsigned n)
{
float s = 0;
unsigned i;

for(i = 0; i < n; i++) {
s += a[i] * b[i];
}

return s;
}

Using gcc csl 2007q3 with flags -O3 -fomit-frame-pointer
-mfloat-abi=softfp -mfpu=neon -mcpu=cortex-a8 -ftree-vectorize
-ffast-math, the NEON version is about twice as fast as the C version.
Dropping -ffast-math makes the C version about 7 times slower, and
gives a slightly different result (differing in the 8th decimal
digit).

I was surprised to see that gcc actually managed to vectorise the code
a bit, even if hand-crafted assembler easily outperforms it.

--
Måns Rullgård
ma...@mansr.com

השב לכולם

השב למחבר

העבר לנמענים