Example of code to optimized: memcmp(x,y,160).
Should I use a variant of the movXXX instruction?
Does alignment matter?
/Nordlöw
/Nordlöw
sse2 seems great to compare 16-bytes at once. If all bytes are equal,
pcmpeqb leaves 16 -1 (0xff) bytes, otherwise unequal bytes leave a
zero result byte. So you can compare inside a loop and intersect all
result-vectors to a -1 initialized xmm-accu. Finally fetch the sign-
bits of the accumulated compare results by pmovmskb into a general
purpose register, to look whether all 16-lower bits are set for
equality. Alignment matters, using movdqa is announced for
performance, none aligned leading and/or trailing bytes require some
additional code, eg. using movdqu. You may try sse3 lddqu (Load
Unaligned Double Quadword) though for the whole loop.
; input: esi - source 1, aligned 16
; edi - source 2, aligned 16
; ecx - byte count is multiple of 16
; return: eax - 1 strings are equal, 0 otherwise
shr ecx, 4 ; byte count / 16
pcmpeqb xmm0, xmm0 ; 16 times 0xff
cmploop:
movdqa xmm1, [esi]
movdqa xmm2, [edi]
add esi, 16
add edi, 16
pcmpeqb xmm1, xmm2
sub ecx, 1
pand xmm0, xmm1
jnz cmploop
cmpready:
pmovmskb eax, xmm0
cmp eax, 0xffff
je equal
xor eax, eax
equal:
and eax, 1
ret 0
If you really have huge streams to compare, you may agressivly unroll
the loop. You may periodically apply the inequality test, to
preliminary break if bytes are unequal. I am not sure about C-
compilers and their intrinsic memcmp - guess it is more about rep
cmps. You may use either gcc inline assembly or SSE2-intrinsics. Intel-
C has vectorization features, but I have no idea whether it can
conduct sse2-memcmp from appropriate C-source.
/Gerd
gcc should do this automatically, at least recent versions of gcc. To
write custom macros that does this kind of things, you can use
__builtin_constant_p(), sizeof, and __alignof__.
-hpa