A couple of SSE2 questions

DSF

unread,

Jul 7, 2010, 4:40:31 PM7/7/10

to

Hello all!

I've just recently gotten back into x86 assembly after a few years
away. (I think my last assembly project was circa 1994.) I've been
working mostly with string/buffer manipulations and now I'm dipping my
toes into the SSE2 pond to see if I can speed things up a bit.

My first question involves swapping the contents of two buffers.
Both buffers start on a multiple of 16 and are a multiple of 16 in
length. The main loop for swapping the buffers using GPRs is as
follows:

eax = pointer to buffer 1
edx = pointer to buffer 2
esi = number of times through loop

$$loop1:
mov ecx, [eax]
mov ebx, [edx]
mov [eax], ebx
mov [edx], ecx
add eax, 4
add edx, 4
dec esi
jne $$loop1

The SSE2 version is as follows:
(Same input parameters)

$$loop1:
movdqa xmm0, [eax]
movdqa xmm1, [edx]
movdqa [edx], xmm0
movdqa [eax], xmm1
add eax, 16
add edx, 16
dec esi
jne $$loop1

The SSE2 version is only 10%-20% faster than the GPR version. Now I
wasn't expecting a 4x improvement, but 20% at best seems kind of low.
So I assume there is something else at work here. My question is why
is the SSE2 version not faster than it is, and are there any
improvements I can make to speed it up?

The second question concerns the routine I've written to find the
length of a C string. Here's the routine, followed by my concerns:

edx = pointer to string.
length returned in eax.

pxor xmm0, xmm0
mov ecx, -16
mov eax, edx
and eax, 15
jne $$notaligned

$$aloop:
add ecx, 16
movdqa xmm1, [edx+ecx]
pcmpeqb xmm1, xmm0
pmovmskb eax, xmm1
test eax, eax
je $$aloop
jmp $$exit

$$notaligned:
$$naloop:
add ecx, 16
movdqu xmm1, [edx+ecx]
pcmpeqb xmm1, xmm0
pmovmskb eax, xmm1
test eax, eax
je $$naloop

$$exit:
bsf eax, eax
add eax, ecx
ret

No problem with speed here, this is *much* faster than strlen (about
400%). My concern here is that it is possible for this routine to
read up to 15 bytes past the end of the string. Is this a GP fault
just waiting to happen? I do know that this code is not exactly
original, and after it occurred to me about the potential overrun, I
did some research on the web and discovered several similar examples.
None of the articles covered overrun problems. Am I needlessly
concerned? And if so, why?

Variations of the above routine will be needed in almost every
string function I have/will write. So I need it to be a sound
foundation.

As a side note, I was very surprised to find that the speed penalty
for using the unaligned movdqu instruction is very slight. An
unaligned string takes only about 10% more time to parse for the
ending 0 than an aligned one.

Thanks for any and all help,
DSF

robert...@nospicedham.yahoo.com

unread,

Jul 7, 2010, 7:26:40 PM7/7/10

to

It's an issue for the unaligned case. In the aligned case, the fetch
cannot cross a page boundary, and unless you're actually using a
system that sets the segment limits in a meaningful way, you can't get
a page fault. OTOH, it certainly *can* happen in the unaligned case
if your string ends in just the wrong spot.

James Harris

unread,

Jul 7, 2010, 8:08:57 PM7/7/10

to

Don't forget cache or memory bandwidth. SSE2 and its ilk will see
greater improvements over older methods when performing calculations
than when moving memory.

It depends where you allocate your strings and what's waiting at up to
15 bytes beyond them. If you can be sure that it is always safe to
read (and that the strings always terminate) then you should be OK.

If you don't have things after your string space you should consider
the effect of reading into that unowned space. The effect will depend
on how the memory is laid out. On a modern OS you could get a page
fault that you don't want. That could lead to delays or even your
thread being aborted.

One way to avoid unwanted page faults is to scan in aligned blocks.
(Maybe you're already doing that.) Then if you go into an unowned page
the string lacked a terminating zero and you would have gone into that
page anyway.

James

Terje Mathisen

unread,

Jul 8, 2010, 11:30:05 AM7/8/10

to

First of all: How large are your blocks?

Do they fit in L1/L2/L3?

The obvious idea is to use more SSE registers: DRAM accesses are
significantly faster when handling a large burst than when you alternate
between reading and writing:

loop64:
movdqa xmm0,[eax] ; Load a full cache line of data:
movdqa xmm1,[eax+16]
movdqa xmm2,[eax+32]
movdqa xmm3,[eax+48]
movdqa xmm4,[edx] ; Load the other line
movdqa xmm5,[edx+16]
movdqa xmm6,[edx+32]
movdqa xmm7,[edx+48]
; Then write them back:
movdqa [eax+0],xmm4
movdqa [eax+16],xmm5
movdqa [eax+32],xmm6
movdqa [eax+48],xmm7
movdqa [edx+0],xmm0
movdqa [edx+16],xmm1
movdqa [edx+32],xmm2
movdqa [edx+48],xmm3

; Adjust, check for tail etc

> The second question concerns the routine I've written to find the
> length of a C string. Here's the routine, followed by my concerns:

[snip]

> No problem with speed here, this is *much* faster than strlen (about
> 400%). My concern here is that it is possible for this routine to
> read up to 15 bytes past the end of the string. Is this a GP fault
> just waiting to happen? I do know that this code is not exactly

No, not if you're always making sure your data accesses are aligned!

A memory page cannot end inside a cache line, and since all your loads
are 16-byte aligned, you'll be OK.

> original, and after it occurred to me about the potential overrun, I
> did some research on the web and discovered several similar examples.
> None of the articles covered overrun problems. Am I needlessly
> concerned? And if so, why?
>
> Variations of the above routine will be needed in almost every
> string function I have/will write. So I need it to be a sound
> foundation.
>
> As a side note, I was very surprised to find that the speed penalty
> for using the unaligned movdqu instruction is very slight. An
> unaligned string takes only about 10% more time to parse for the
> ending 0 than an aligned one.

An unaligned access can indeed cause a fault, so it is better to handle
misaligned starting offsets.

You should probably look into using an aligned access for the first 16
bytes as well, even with misaligned data: You can simply use a mask to
filter away any zero bytes before the real start of the string.

The DEC Alpha cpu had special opcodes to do these sorts of things for
byte data using 64-bit load/store operations.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

DSF

unread,

Jul 9, 2010, 1:58:13 AM7/9/10

to

On Thu, 08 Jul 2010 17:30:05 +0200, Terje Mathisen <"terje.mathisen at
tmsw.no"@giganews.com> wrote:

>
>First of all: How large are your blocks?

Currently 1920 bytes.

>Do they fit in L1/L2/L3?

Should be no problem.

>The obvious idea is to use more SSE registers: DRAM accesses are
>significantly faster when handling a large burst than when you alternate
>between reading and writing:
>
>loop64:
> movdqa xmm0,[eax] ; Load a full cache line of data:
> movdqa xmm1,[eax+16]
> movdqa xmm2,[eax+32]
> movdqa xmm3,[eax+48]
> movdqa xmm4,[edx] ; Load the other line
> movdqa xmm5,[edx+16]
> movdqa xmm6,[edx+32]
> movdqa xmm7,[edx+48]
>; Then write them back:
> movdqa [eax+0],xmm4
> movdqa [eax+16],xmm5
> movdqa [eax+32],xmm6
> movdqa [eax+48],xmm7
> movdqa [edx+0],xmm0
> movdqa [edx+16],xmm1
> movdqa [edx+32],xmm2
> movdqa [edx+48],xmm3
>
>; Adjust, check for tail etc

I tried the above. The difference between it an my original code
was negligible.

(Bulb lights over head.)

I think I know why the buffer swap code shows less speed improvement
than my SSE2 code for strlen and strcpy. I'm going to have to give
myself a big fat DUH! on this one if I'm right. Please let me know if
my logic is correct.

For the strlen and strcpy tests, I simply call the library and SSE2
functions in a loop of 100,000 and compare the times. In each case,
they are operating on the *same* string(s) each time through the loop.
The roughly 5K-long string(s) are large as far as strings go, but
small as far as cashing goes. Everything is probably in a cache from
the second loop on.

The buffer swap is actually part of a sort routine that, for the
purposes of this test, sorts 100,000 items 1920 bytes in length. The
buffer swap routine is called 428,932 times. And each time it's
called, the two buffers are *different* memory locations. I know that
100,000 elements, each 1920 bytes in length, are not going to remain
in any modern cache.

So, as I see it, in the strXXX tests, the strings are cached and the
speed advantage if handling 16 bytes at a time over 4 is clearly
evident, while the buffer swap's need to refetch data swamps most of
the speed advantage SSE2 provides.

>A memory page cannot end inside a cache line, and since all your loads
>are 16-byte aligned, you'll be OK.

So if you are sitting at the start of a 16 byte page, the entire
page is guaranteed to be owned by your process?

>An unaligned access can indeed cause a fault, so it is better to handle
>misaligned starting offsets.
>
>You should probably look into using an aligned access for the first 16
>bytes as well, even with misaligned data: You can simply use a mask to
>filter away any zero bytes before the real start of the string.

So by the same reasoning, if a string, buffer, whatever starts in
the middle of a 16 byte page, memory from the start of that page has
to be owned by the process as well?

How does this updated strlen look?

(Note that the remarked lines are for an inline version to be used
in other ASM routines. They eliminate the need for EDX and keep the
used registers down to EAX, ECX, & EDI. This matches the REPNE SCASB
method I'm using now and will make it easier to plug in.)

PROCSTART fstrlen, $$str:DWORD

push edi

pxor xmm0, xmm0
mov edi, [$$str]
; push edi
mov edx, edi
mov ecx, edi
and ecx, 15
je $$aloop2

and edi, 0fffffff0h
movdqa xmm1, [edi]

pcmpeqb xmm1, xmm0
pmovmskb eax, xmm1

shr eax, cl
jne $$exit

$$aloop:
add edi, 16
$$aloop2:
movdqa xmm1, [edi]

pcmpeqb xmm1, xmm0
pmovmskb eax, xmm1
test eax, eax
je $$aloop
jmp $$exit

$$exit:
bsf eax, eax
; sub edi, [esp]
sub edi, edx
add eax, edi

$$exit2:
; pop edi
pop edi
Return
fstrlen Endp

Thanks for your help!
DSF

Terje Mathisen

unread,

Jul 9, 2010, 2:47:05 AM7/9/10

to

DSF wrote:
> On Thu, 08 Jul 2010 17:30:05 +0200, Terje Mathisen<"terje.mathisen at
> tmsw.no"@giganews.com> wrote:
> I tried the above. The difference between it an my original code
> was negligible.
>
> (Bulb lights over head.)
>
> I think I know why the buffer swap code shows less speed improvement
> than my SSE2 code for strlen and strcpy. I'm going to have to give
> myself a big fat DUH! on this one if I'm right. Please let me know if
> my logic is correct.

Yes indeed: This is why I asked about L1/L2 etc: If your working set of
data is larger than those two cache levels, the running time will be
totally dominated by RAM fetches.

>
> For the strlen and strcpy tests, I simply call the library and SSE2
> functions in a loop of 100,000 and compare the times. In each case,
> they are operating on the *same* string(s) each time through the loop.
> The roughly 5K-long string(s) are large as far as strings go, but
> small as far as cashing goes. Everything is probably in a cache from
> the second loop on.

Yes, obviously.

>
> The buffer swap is actually part of a sort routine that, for the
> purposes of this test, sorts 100,000 items 1920 bytes in length. The
> buffer swap routine is called 428,932 times. And each time it's

Ouch!

You really, really don't want to sort 200 MB by actually swapping data
items, at least not until you've first tried something more intelligent:

First of all, have the strings got many common & long prefixes, i.e. can
they all match on the first N bytes?

If not, i.e. more random data, then I would crate a new data structure
with maybe 8 bytes in each record: The first 4 bytes of each string, in
BSWAP'ed order, followed by a 32-bit pointer to the original data position.

When sorting this 800K array (which will fit in a 1 MB L2!) you simply
compare 32-bit words directly and end up with a partly sorted array
where matching 4-byte prefixes will remain in the same order.

Next you repeat this process recursively for each sub-array which have
matched up to now.

Alternatively you can make the prefix part 3 times larger, for 16
bytes/item, and you can also let your comparison routine start with the
prefix part and if they all match,go on to comparing the rest of the
original strings.

No matter how you do it, you use the pointers to do a final rad-out of
the input array in sorted order.

>> A memory page cannot end inside a cache line, and since all your loads
>> are 16-byte aligned, you'll be OK.
>
> So if you are sitting at the start of a 16 byte page, the entire
> page is guaranteed to be owned by your process?

Yes.

>
>> An unaligned access can indeed cause a fault, so it is better to handle
>> misaligned starting offsets.
>>
>> You should probably look into using an aligned access for the first 16
>> bytes as well, even with misaligned data: You can simply use a mask to
>> filter away any zero bytes before the real start of the string.
>
> So by the same reasoning, if a string, buffer, whatever starts in
> the middle of a 16 byte page, memory from the start of that page has
> to be owned by the process as well?

Yes.

>
> How does this updated strlen look?
>
> (Note that the remarked lines are for an inline version to be used
> in other ASM routines. They eliminate the need for EDX and keep the

> used registers down to EAX, ECX,& EDI. This matches the REPNE SCASB

> method I'm using now and will make it easier to plug in.)
>
> PROCSTART fstrlen, $$str:DWORD
>
> push edi
>
> pxor xmm0, xmm0
> mov edi, [$$str]
> ; push edi
> mov edx, edi
> mov ecx, edi
> and ecx, 15
> je $$aloop2

If most strings are 16-byte aligned, then you get a predicted skip of
this part, if the alignment is just 4-byte, then 3/4 of all strings will
need the prefix alignment part.

>
> and edi, 0fffffff0h
> movdqa xmm1, [edi]
> pcmpeqb xmm1, xmm0
> pmovmskb eax, xmm1
> shr eax, cl
> jne $$exit

Would it be possible to always do the initial mask by por'ing in
non-null data for the unaligned leading part?
>
> $$aloop:
> add edi, 16

With my 486/Pentium background I really hate to see the pointer update
immediately in front of the load below, this will cause an AGI stall on
modern in-order cpus like the Atom or Larrabee!

> $$aloop2:
> movdqa xmm1, [edi]
> pcmpeqb xmm1, xmm0
> pmovmskb eax, xmm1
> test eax, eax
> je $$aloop
> jmp $$exit

Terje

DSF

unread,

Jul 14, 2010, 8:38:46 PM7/14/10

to

On Fri, 09 Jul 2010 08:47:05 +0200, Terje Mathisen <"terje.mathisen at
tmsw.no"@giganews.com> wrote:

>>
>> The buffer swap is actually part of a sort routine that, for the
>> purposes of this test, sorts 100,000 items 1920 bytes in length. The
>> buffer swap routine is called 428,932 times. And each time it's
>
>Ouch!
>
>You really, really don't want to sort 200 MB by actually swapping data
>items, at least not until you've first tried something more intelligent:
>

The whole buffer swap business started when I discovered I needed an
additional value sent to the user-defined compare routine of qsort. So
I looked up my compiler's source for qsort and modified it to pass
three parameters instead of two to the comparison routine. While I
was at it, I compiled the "entirely written in C" code to assembler to
have a look around and see if I could speed it up a bit. What I'm
doing is pretty much half learning exercise and half seeing what
improvements I can make. The function used to swap two items (written
to swap one byte at a time, even though the size of the items is
known) seemed like a good place to start. I was surprised that the
code was swapping the actual data until I realized that it can only
operate on what it's fed. The trick, of course, is to feed it an
array of pointers to the objects to be sorted. With that in mind, the
first thing my "improved" swap routine does is check to see if the
items are pointer-sized. If so, it does a quick four-byte swap and
exits. Else, it uses the code I posted here. Even the GPR code, at
four bytes a loop, was significantly faster than the original byte
swap. Then, since it only takes two parameters, I modified it so they
are passed via registers. Thus eliminating the entire stack frame.

With that said, I do have to admit that the original code was no
slouch, it sorted ~150,000 directory entry structures of 860 bytes
each in less than a second on my now ancient AMD Athlon 64 X2 4200+.

>
>>
>> How does this updated strlen look?
>>

I have posted an updated copy of my sterlen routine because, oops,
there was a bug that produced an incorrect value if the string didn't
cross the first 16 byte page. (Not to mention that jmp to the
following line of code.)

I have included your comments in the new code.

PROCSTART fstrlen, $$str:DWORD

push edi

pxor xmm0, xmm0
mov edi, [$$str]

mov edx, edi
mov ecx, edi
and ecx, 15
je $$aloop2

>If most strings are 16-byte aligned, then you get a predicted skip of
>this part, if the alignment is just 4-byte, then 3/4 of all strings will
>need the prefix alignment part.

I'm sorry, but the above is so obvious that I fear I've missed the
point. (?)

My error. If the 0 was found here, the idea was that the code could
just jump to the end and calculate the length, since the current
16-byte page (edi) minus the start (edx) would zero out. But I forgot
about the ANDing edi back to the start of the page. The subtraction
could even underflow, producing a string length of almost 4GB.

and edi, 0fffffff0h
movdqa xmm1, [edi]
pcmpeqb xmm1, xmm0
pmovmskb eax, xmm1
shr eax, cl

jne $$foundshort

>Would it be possible to always do the initial mask by por'ing in
>non-null data for the unaligned leading part?

I don't see how than would help, unless you are suggesting an
entirely different method for searching for zero. The shr gets rid of
the unwanted bits and leaves the correct ones in the right position.
Even if there's something here I'm missing, I still see two potential
problems. 1. Getting the proper bit pattern into the leading part
might be more work than it's worth. 2. Since the bytes in the leading
part are unknown, what would you por xmm0 with to guarantee a
non-match?

>With my 486/Pentium background I really hate to see the pointer update
>immediately in front of the load below, this will cause an AGI stall on
>modern in-order cpus like the Atom or Larrabee!

Understood. But it's such a small loop there's really nothing that
could be rearranged and rewriting it to move the 'add' line to the
end has it's own set of problems. And one needs to stop somewhere. If
one takes into account every processor's peculiarities, strlen might
wind up a few megabytes! :o)

By the way, is there a reference out there of all the bottlenecks
for most x86 CPUs? Or does one need to collect all of the
optimization manuals (if they're available) and put the pieces
together?

$$aloop:
add edi, 16

$$aloop2:
movdqa xmm1, [edi]
pcmpeqb xmm1, xmm0
pmovmskb eax, xmm1
test eax, eax
je $$aloop

$$exit:
bsf eax, eax
sub edi, edx
add eax, edi

pop edi
Return

$$foundshort:
mov edi, edx
jmp $$exit
fstrlen Endp

DSF

Terje Mathisen

unread,

Jul 15, 2010, 4:49:01 AM7/15/10

to

DSF wrote:
> On Fri, 09 Jul 2010 08:47:05 +0200, Terje Mathisen<"terje.mathisen at
> tmsw.no"@giganews.com> wrote:
>
>>>
>>> The buffer swap is actually part of a sort routine that, for the
>>> purposes of this test, sorts 100,000 items 1920 bytes in length. The
>>> buffer swap routine is called 428,932 times. And each time it's
>>
>> Ouch!
>>
>> You really, really don't want to sort 200 MB by actually swapping data
>> items, at least not until you've first tried something more intelligent:
>>
> The whole buffer swap business started when I discovered I needed an
> additional value sent to the user-defined compare routine of qsort. So
> I looked up my compiler's source for qsort and modified it to pass
> three parameters instead of two to the comparison routine. While I
> was at it, I compiled the "entirely written in C" code to assembler to
> have a look around and see if I could speed it up a bit. What I'm
> doing is pretty much half learning exercise and half seeing what
> improvements I can make.

That is always worthwhile...

The function used to swap two items (written
> to swap one byte at a time, even though the size of the items is
> known) seemed like a good place to start. I was surprised that the
> code was swapping the actual data until I realized that it can only
> operate on what it's fed. The trick, of course, is to feed it an
> array of pointers to the objects to be sorted. With that in mind, the

No, the real trick is to write a sort routine which is optimized for the
task at hand, at least if said sort is a significant part of the total
processing time.

Alternatively, you should write your own sort in order to (re-)discover
some of the tricks and gotcha's therein, like how to select pivot values
(for quicksort), why you should always do the shorter partition first,
recursively, and then simply loop back up to handle the other half.

> first thing my "improved" swap routine does is check to see if the
> items are pointer-sized. If so, it does a quick four-byte swap and
> exits. Else, it uses the code I posted here. Even the GPR code, at
> four bytes a loop, was significantly faster than the original byte
> swap. Then, since it only takes two parameters, I modified it so they
> are passed via registers. Thus eliminating the entire stack frame.
>
> With that said, I do have to admit that the original code was no
> slouch, it sorted ~150,000 directory entry structures of 860 bytes
> each in less than a second on my now ancient AMD Athlon 64 X2 4200+.

Please take a look at the sort benchmark results!

http://sortbenchmark.org/

The current record is to use about 10K seconds to sort 1e12 records,
using just under 100 cpus in a cluster.

Sorting is of course not a linear process, i.e. it takes less than 1% of
the work to sort 1% of the items, but just scaling linearly I find that
they are handling at least 1e6 records/second, and in reality it is
probably an order of magnitude more.

Way back in 1986, running on a 6 MHz 286, I sorted 10K entries in 0.5
seconds. (This was part of a database update, where those 10K entries
were un-ordered updates to a DB with 10M records: By sorting them first
I could do all the updates sequentially, only touching the disk blocks
that needed it, and only once. The updates took another 0.5 seconds, so
the total was 1 s.)

>> If most strings are 16-byte aligned, then you get a predicted skip of
>> this part, if the alignment is just 4-byte, then 3/4 of all strings will
>> need the prefix alignment part.
>
> I'm sorry, but the above is so obvious that I fear I've missed the
> point. (?)

The point was that it is possible to write the code in such a way that
you assume the first block is misaligned, and always run that part of
the code. This avoids the need to test/branch first.

>
> My error. If the 0 was found here, the idea was that the code could
> just jump to the end and calculate the length, since the current
> 16-byte page (edi) minus the start (edx) would zero out. But I forgot
> about the ANDing edi back to the start of the page. The subtraction
> could even underflow, producing a string length of almost 4GB.

I almost asked about that, since this is a typical error cause in any
kind of SIMD processing. :-)

>> Would it be possible to always do the initial mask by por'ing in
>> non-null data for the unaligned leading part?
>
> I don't see how than would help, unless you are suggesting an
> entirely different method for searching for zero. The shr gets rid of

This was in relation to my previous question about not specialcasing
aligned inputs.

> the unwanted bits and leaves the correct ones in the right position.
> Even if there's something here I'm missing, I still see two potential
> problems. 1. Getting the proper bit pattern into the leading part
> might be more work than it's worth. 2. Since the bytes in the leading
> part are unknown, what would you por xmm0 with to guarantee a
> non-match?

Anything except 0 bytes would work.

>
>
>> With my 486/Pentium background I really hate to see the pointer update
>> immediately in front of the load below, this will cause an AGI stall on
>> modern in-order cpus like the Atom or Larrabee!
>
> Understood. But it's such a small loop there's really nothing that
> could be rearranged and rewriting it to move the 'add' line to the
> end has it's own set of problems.

Famous last words...

> And one needs to stop somewhere. If
> one takes into account every processor's peculiarities, strlen might
> wind up a few megabytes! :o)

In which case it would be slower, unless you employed
JIT/self-modification to only keep the optimal version for the current cpu.

>
> By the way, is there a reference out there of all the bottlenecks
> for most x86 CPUs? Or does one need to collect all of the
> optimization manuals (if they're available) and put the pieces
> together?

Agner Fog is probably the best.

>
>
> $$aloop:
> add edi, 16
> $$aloop2:
> movdqa xmm1, [edi]
> pcmpeqb xmm1, xmm0
> pmovmskb eax, xmm1
> test eax, eax
> je $$aloop

stringmask db 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

mov eax,[string]
lea edx,[eax+16] ;; Compensate for extra EDI add
and eax,NOT 15
pxor xmm0,xmm0
mov edi,eax
sub eax,edx ;; Negative offset from string start

movdqu xmm2,stringmask[eax+15] ;; Mask for first block of data

loop:
movdqa xmm1,[edi]
add edi,16
por xmm1,xmm2 ;; Make sure any leading bytes are !0
pcmpeqb xmm1,xmm0
pxor xmm2,xmm2
pmovmskb eax,xmm1
test eax,eax
jz loop

bsf eax,eax
sub edi,edx
add eax,edi

This version is probably slower unless that initial branch in your code
is mostly mispredicted, but it does avoid both the branch and the
special case to handle a zero byte in the first block, something which
is extremely common since strlen() is mostly used on short strings.

DSF

unread,

Jul 16, 2010, 4:43:29 AM7/16/10

to

On Thu, 15 Jul 2010 10:49:01 +0200, Terje Mathisen <"terje.mathisen at
tmsw.no"@giganews.com> wrote:

Note: I snipped the sort portion. I understand qsort is not going to
be the fastest out there, but I have to remember than my main goal was
to add a needed extra parameter for the C program I'm writing. I'm
going to stick with the modified qosrt because I got off on a tangent
with it and really need to get back to said C program.

>>> If most strings are 16-byte aligned, then you get a predicted skip of
>>> this part, if the alignment is just 4-byte, then 3/4 of all strings will
>>> need the prefix alignment part.
>>
>> I'm sorry, but the above is so obvious that I fear I've missed the
>> point. (?)
>
>The point was that it is possible to write the code in such a way that
>you assume the first block is misaligned, and always run that part of
>the code. This avoids the need to test/branch first.

Understood.

>>
>> My error. If the 0 was found here, the idea was that the code could
>> just jump to the end and calculate the length, since the current
>> 16-byte page (edi) minus the start (edx) would zero out. But I forgot
>> about the ANDing edi back to the start of the page. The subtraction
>> could even underflow, producing a string length of almost 4GB.
>
>I almost asked about that, since this is a typical error cause in any
>kind of SIMD processing. :-)
>
>>> Would it be possible to always do the initial mask by por'ing in
>>> non-null data for the unaligned leading part?
>>
>> I don't see how than would help, unless you are suggesting an
>> entirely different method for searching for zero. The shr gets rid of
>
>This was in relation to my previous question about not specialcasing
>aligned inputs.
>
>> the unwanted bits and leaves the correct ones in the right position.
>> Even if there's something here I'm missing, I still see two potential
>> problems. 1. Getting the proper bit pattern into the leading part
>> might be more work than it's worth. 2. Since the bytes in the leading
>> part are unknown, what would you por xmm0 with to guarantee a
>> non-match?
>
>Anything except 0 bytes would work.

I misunderstood. I thought you were changing the 0 mask (xmm0) not
masking out the leading bytes of the data. Your code below cleared
that up.

>> By the way, is there a reference out there of all the bottlenecks
>> for most x86 CPUs? Or does one need to collect all of the
>> optimization manuals (if they're available) and put the pieces
>> together?
>
>Agner Fog is probably the best.

I wanted to discuss this further, but I just noticed how late it is.
I'll follow up on this later.

I tried the above code and it doesn't operate properly for me unless
the string is aligned.

Actually, it would not even assemble. The line:
movdqu xmm2,stringmask[eax+15]
needed a size qualilfier for stringmask[eax+15]. I needed to change
it to
movdqu xmm2, oword ptr stringmask[eax+15]

After that, it would return huge values related to the alignment of
string regardless of string's length. The problem, as I worked it out
is that eax+15 is always negative, causing xmm2 to read in bytes prior
to stringmask. The higher the alignment offset, the more bytes prior
to stringmask that are read into xmm2. Here's a little bit of my
reasoning "on paper" to show what I mean. To make things simpler, I
gave string an address of 1000 and stringmask an address of 2000.

I worked out the register values for wach step of the code:

mov eax,[string]
lea edx,[eax+16] ;; Compensate for extra EDI add
and eax,NOT 15
pxor xmm0,xmm0
mov edi,eax
sub eax,edx ;; Negative offset from string start
movdqu xmm2,stringmask[eax+15] ;; Mask for first block of data

stringmask = 2000
aligned:
eax = 1000
edx = 1010
eax = 1000
edi = 1000
eax = fff0
ptr =stringmask+eax+15 = 1fff
x indicates bytes prior to stringmask
xmm2 = x111 1111 1111 1111 0000 0000 0000 0000

offset by 2:
eax = 1002
edx = 1012
eax = 1000
edi = 1000
eax = ffee
ptr =stringmask+eax+15 = 1ffd
x indicates bytes prior to stringmask
xmm2 = xxx1 1111 1111 1111 1100 0000 0000 0000

offset by 15:
eax = 100f
edx = 101f
eax = 1000
edi = 1000
eax = ffe1
ptr =stringmask+eax+15 = fff0
x indicates bytes prior to stringmask
xmm2 = xxxx xxxx xxxx xxxx 1111 1111 1111 1110

I confirmed my theory by placing a byte array of 1s before
stringmask and the code functioned properly. Of course, that's more
of a quick fix than a solution, so I came up with this version. How
does it look?

(newsreader wrap)
stringmask db
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

....

mov eax,[$$str]
mov edi,eax
and edi, NOT 15 ; aligned edi

lea edx,[eax+16] ;; Compensate for extra EDI add

and eax,15 ; number of leading bytes
xor eax,15 ; inverted for index into stringmask
pxor xmm0,xmm0

movdqu xmm2, oword ptr stringmask[eax] ;; Mask first blk of data

sloop:

movdqa xmm1,[edi]
add edi,16
por xmm1,xmm2 ;; Make sure any leading bytes are !0
pcmpeqb xmm1,xmm0
pxor xmm2,xmm2
pmovmskb eax,xmm1
test eax,eax

jz sloop

bsf eax,eax
sub edi,edx
add eax,edi

DSF

Terje Mathisen

unread,

Jul 16, 2010, 10:07:43 AM7/16/10

to

DSF wrote:
> On Thu, 15 Jul 2010 10:49:01 +0200, Terje Mathisen<"terje.mathisen at
> tmsw.no"@giganews.com> wrote:
> I tried the above code and it doesn't operate properly for me unless
> the string is aligned.

That's what I get for just writing code, and not testing/running it. :-)

>
> Actually, it would not even assemble. The line:
> movdqu xmm2,stringmask[eax+15]
> needed a size qualilfier for stringmask[eax+15]. I needed to change
> it to
> movdqu xmm2, oword ptr stringmask[eax+15]

Details, details...

>
> After that, it would return huge values related to the alignment of
> string regardless of string's length. The problem, as I worked it out
> is that eax+15 is always negative, causing xmm2 to read in bytes prior
> to stringmask. The higher the alignment offset, the more bytes prior
> to stringmask that are read into xmm2. Here's a little bit of my
> reasoning "on paper" to show what I mean. To make things simpler, I
> gave string an address of 1000 and stringmask an address of 2000.
>
> I worked out the register values for wach step of the code:
>
> mov eax,[string]
> lea edx,[eax+16] ;; Compensate for extra EDI add

The sub below was based on the original address, so it is off by 16.

> and eax,NOT 15
> pxor xmm0,xmm0
> mov edi,eax
> sub eax,edx ;; Negative offset from string start
> movdqu xmm2,stringmask[eax+15] ;; Mask for first block of data

stringmask[eax+15+16] might do the trick!

The key is simply to start at the 16 zero bytes at the end of
stringmask, and then index backwards from that point based on the number
of leading bytes from ALIGN 16 until the start of the string.

DSF

unread,

Jul 16, 2010, 8:37:26 PM7/16/10

to

On Fri, 16 Jul 2010 16:07:43 +0200, Terje Mathisen <"terje.mathisen at
tmsw.no"@giganews.com> wrote:

>DSF wrote:
>> On Thu, 15 Jul 2010 10:49:01 +0200, Terje Mathisen<"terje.mathisen at
>> tmsw.no"@giganews.com> wrote:
>> I tried the above code and it doesn't operate properly for me unless
>> the string is aligned.
>
>That's what I get for just writing code, and not testing/running it. :-)

I assume you know that I was in no way criticizing! :o)

In fact, working out what was wrong and a fix for it was a good
learning exercise. Especially since I have no debugger that will deal
with SSE2 instructions, let alone let me view xmm registers. I
literally had to work it out, step by step, on "paper." (An empty
text window in my compiler's IDE.)

>>
>> Actually, it would not even assemble. The line:
>> movdqu xmm2,stringmask[eax+15]
>> needed a size qualilfier for stringmask[eax+15]. I needed to change
>> it to
>> movdqu xmm2, oword ptr stringmask[eax+15]
>
>Details, details...

Yup, but it took me a while to find out what MASMs term was for 128
bits. Everybody else seems to use "qdword". Leave it to MS to be
different. But, hey, that's how standards are forced upon us...er...I
mean born! ;o)

>The sub below was based on the original address, so it is off by 16.
>
>> and eax,NOT 15
>> pxor xmm0,xmm0
>> mov edi,eax
>> sub eax,edx ;; Negative offset from string start
>> movdqu xmm2,stringmask[eax+15] ;; Mask for first block of data

Yep.

>stringmask[eax+15+16] might do the trick!

I think it will. It's enduring my "torture test" right now. The
original failed on the first iteration.
It passed. So eax+31 it is! (I think I might've thought of that
had it not been 4:30AM.)

>The key is simply to start at the 16 zero bytes at the end of
>stringmask, and then index backwards from that point based on the number
>of leading bytes from ALIGN 16 until the start of the string.

This is the kind of logic I'm trying to learn. I understood the
concept pretty quickly. Just needed to work out the details.

I did some timing tests and, as you predicted, your routine is
slower, albeit slightly. The difference increases if the string is
less than 16 bytes. I made one more modification to your routine by
unrolling the first loop from the rest. True, that reintroduces a
branch, but it also eliminates the por with 0 for the remaining
loops, as well as eliminating the zeroing of xmm2 altogether. Plus it
retains not modifying edi just before a load.
One rule of optimization seems to be standing the test of time:
The less in the loop, the better.

With this modification the two routines are about equal in speed,
with yours slightly ahead!

mov eax,[$$str]

lea edx,[eax+16] ;; Compensate for extra EDI add

and eax,NOT 15
pxor xmm0,xmm0
mov edi,eax
sub eax,edx ;; Negative offset from string start

movdqu xmm2, oword ptr stringmask[eax+31] ;; Mask...

movdqa xmm1,[edi]
add edi,16
por xmm1,xmm2 ;; Make sure any leading bytes are !0
pcmpeqb xmm1,xmm0

pmovmskb eax,xmm1
test eax,eax
jnz exit

sloop:
movdqa xmm1,[edi]
add edi,16

pcmpeqb xmm1,xmm0

pmovmskb eax,xmm1
test eax,eax
jz sloop

exit:

bsf eax,eax
sub edi,edx
add eax,edi

One more question,(okay, two) if you don't mind. Is there any
problem with putting read-only data, such as stringmask, in the code
segment? Is there a possiblility for speed improvement? (I'm thnking
along the lines of the data being in the cache with the code.)

>Terje

Thanks!
DSF

Terje Mathisen

unread,

Jul 17, 2010, 4:26:47 PM7/17/10

to

DSF wrote:
> I did some timing tests and, as you predicted, your routine is
> slower, albeit slightly. The difference increases if the string is
> less than 16 bytes. I made one more modification to your routine by

How did you time it?

With what kind of data?

Random lengths/same lengths/all above/all below 16 bytes/all aligned/all
misaligned/randomly aligned?

All of these are crucial for measuring the difference between two pieces
of code where one of them works hard to avoid branching!

If the branch pattern is easily predictable by the cpu, then the simple
code will always win, otherwise my more complicated can do so!

> One more question,(okay, two) if you don't mind. Is there any
> problem with putting read-only data, such as stringmask, in the code
> segment? Is there a possiblility for speed improvement? (I'm thnking
> along the lines of the data being in the cache with the code.)

That is actually a useful idea, but you must remember that since you are
using it as data, not code, it cannot be loaded into any separate cache
levels, only L2/L3 which are usually shared between code and data.

DSF

unread,

Jul 22, 2010, 4:01:08 AM7/22/10

to

On Sat, 17 Jul 2010 22:26:47 +0200, Terje Mathisen <"terje.mathisen at
tmsw.no"@giganews.com> wrote:

>> I did some timing tests and, as you predicted, your routine is
>> slower, albeit slightly. The difference increases if the string is
>> less than 16 bytes. I made one more modification to your routine by
>
>How did you time it?
>
>With what kind of data?
>
>Random lengths/same lengths/all above/all below 16 bytes/all aligned/all
>misaligned/randomly aligned?

What I did is set up four versions of strlen: the library version,
my version, your version, and Agner Fog's. To try to avoid
interference from other processes, I closed every non-essential
background process, set the test program's priority to real time, and
it's affinity to one processor. I timed a loop of 10,000 iterations
for each strlen using QueryPerformanceCounter, averaging these loops
within a larger loop of 30. I enclosed the above in two loops: one to
change the alignment offset from 0 to 15 and one to set the string
size from 0 to 4100. The program output the time taken for lengths of
0 to 20 then per length of 100 up to 4100 for each offset.

I also ran the above tests, but used Fog's ReadTSC to obtain the
result in clock cycles. The results were similar to the timing test.

A few highlights. The entire file is too large to post.
strlen = library
fstrlen = mine
tmstrlen = yours
A_strlen = Fog's

Note that fstrlen is my original code posted here and that tmstrlen
is your original code, with eax+15 changed to eax+31.

Offset: 0 Length: 1
strlen time: 0.142ms
fstrlen time: 0.139ms
tmstrlen time: 0.189ms
A_strlen time: 0.292ms
Offset: 0 Length: 16
strlen time: 0.286ms
fstrlen time: 0.150ms
tmstrlen time: 0.198ms
A_strlen time: 0.362ms
Offset: 0 Length: 20
strlen time: 0.317ms
fstrlen time: 0.152ms
tmstrlen time: 0.196ms
A_strlen time: 0.360ms
Offset: 0 Length: 100
strlen time: 1.056ms
fstrlen time: 0.237ms
tmstrlen time: 0.301ms
A_strlen time: 0.690ms

Offset: 1 Length: 1
strlen time: 0.142ms
fstrlen time: 0.151ms
tmstrlen time: 0.189ms
A_strlen time: 0.137ms
Offset: 1 Length: 16
strlen time: 0.277ms
fstrlen time: 0.155ms
tmstrlen time: 0.196ms
A_strlen time: 0.201ms
Offset: 1 Length: 20
strlen time: 0.314ms
fstrlen time: 0.156ms
tmstrlen time: 0.198ms
A_strlen time: 0.209ms
Offset: 1 Length: 100
strlen time: 1.039ms
fstrlen time: 0.228ms
tmstrlen time: 0.300ms
A_strlen time: 0.524ms

Offset: 15 Length: 1
strlen time: 0.145ms
fstrlen time: 0.157ms
tmstrlen time: 0.180ms
A_strlen time: 0.204ms
Offset: 15 Length: 16
strlen time: 0.284ms
fstrlen time: 0.155ms
tmstrlen time: 0.179ms
A_strlen time: 0.203ms
Offset: 15 Length: 20
strlen time: 0.320ms
fstrlen time: 0.169ms
tmstrlen time: 0.202ms
A_strlen time: 0.264ms
Offset: 15 Length: 100
strlen time: 1.063ms
fstrlen time: 0.249ms
tmstrlen time: 0.314ms
A_strlen time: 0.611ms

If there's any combination(s) of offset & length you'd like to see,
let me know.

>
>All of these are crucial for measuring the difference between two pieces
>of code where one of them works hard to avoid branching!
>
>If the branch pattern is easily predictable by the cpu, then the simple
>code will always win, otherwise my more complicated can do so!

Easy now. Put down the exclamation points and back away slowly. :o)

Seriously, if I said something to offend you, I apologize. The
previous paragraphs of yours come off slightly condescending. If I'm
misreading this, then sorry about that.

Please recall, that you were the one who stated that your code might
run slower if the branches in mine were predicted correctly.

I do understand that branch prediction can vary greatly between
processors, even within the same family. So logic would indicate that
the fewer branches, the better overall performance across different
processors.

Since I'm trying to learn all aspects of optimization, I'm going to
put together a test program that does its best to fool branch
prediction in the four strlen routines. I'm curious to see how much I
can slow my code down. Should be interesting. I'll post some results
when I get it running.

>> One more question,(okay, two) if you don't mind. Is there any
>> problem with putting read-only data, such as stringmask, in the code
>> segment? Is there a possiblility for speed improvement? (I'm thnking
>> along the lines of the data being in the cache with the code.)
>
>That is actually a useful idea, but you must remember that since you are
>using it as data, not code, it cannot be loaded into any separate cache
>levels, only L2/L3 which are usually shared between code and data.

On second thought, I see a potential problem putting read-only data
in the code segment. The only way to guard against the possibility of
CS != DS (it hasn't been a problem under Windows XP, but I don't know
about other OSes) would be to use an override on any code segment
read. I don't recall if that is expensive or how much it may vary
from processor to processor. Leaving the data in the data segment is
probably the best option.

DSF

Terje Mathisen

unread,

Jul 22, 2010, 3:30:26 PM7/22/10

to

DSF wrote:
> Seriously, if I said something to offend you, I apologize. The
> previous paragraphs of yours come off slightly condescending. If I'm
> misreading this, then sorry about that.

Not at all! I'm just very happy that someone is interested in discussing
things I like to work on. :-)

>
> Please recall, that you were the one who stated that your code might
> run slower if the branches in mine were predicted correctly.
>
> I do understand that branch prediction can vary greatly between
> processors, even within the same family. So logic would indicate that
> the fewer branches, the better overall performance across different
> processors.
>
> Since I'm trying to learn all aspects of optimization, I'm going to
> put together a test program that does its best to fool branch
> prediction in the four strlen routines. I'm curious to see how much I
> can slow my code down. Should be interesting. I'll post some results
> when I get it running.

That is a very good idea: Use rand() to setup an array of starting
offset (either zero or non-zero) and length (either spanning the next
16-byte boundary or not).

This gives 4 kinds of strings to test, so your timing tests could then
try maybe 20-100 such strlen() calls, with a random distribution among
the four categories.

I suspect the simplest code (i.e. yours) will still be the fastest
though. :-)

sfuerst

unread,

Jul 24, 2010, 1:36:46 PM7/24/10

to

On Jul 16, 5:37 pm, DSF <notava...@address.here> wrote:

>
> sloop:
> movdqa xmm1,[edi]
> add edi,16
> pcmpeqb xmm1,xmm0
> pmovmskb eax,xmm1
> test eax,eax
> jz sloop
>

At least on my machine, the following inner loop is faster:

sloop:
pcmpeqb [edi],xmm0
add edi,16
pmovmskb eax,xmm0
pxor xmm0,xmm0
test eax,eax
jz sloop

The pxor instruction which replaces the load nicely pairs with the
instructions around it, so is "free".

Steven

DSF

unread,

Jul 24, 2010, 11:06:43 PM7/24/10

to

On Sat, 24 Jul 2010 10:36:46 -0700 (PDT), sfuerst
<svfu...@nospicedham.gmail.com> wrote:

>On Jul 16, 5:37 pm, DSF <notava...@address.here> wrote:
>
>>
>> sloop:
>> movdqa xmm1,[edi]
>> add edi,16
>> pcmpeqb xmm1,xmm0
>> pmovmskb eax,xmm1
>> test eax,eax
>> jz sloop
>>
>
>At least on my machine, the following inner loop is faster:
>
>sloop:

> pcmpeqb [edi],xmm0
Safe to assume you meant?
> pcmpeqb xmm0, [edi]

sfuerst

unread,

Jul 25, 2010, 1:20:01 AM7/25/10

to

On Jul 24, 8:06 pm, DSF <notava...@address.here> wrote:
> On Sat, 24 Jul 2010 10:36:46 -0700 (PDT), sfuerst
>
>
>
> <svfue...@nospicedham.gmail.com> wrote:
> >On Jul 16, 5:37 pm, DSF <notava...@address.here> wrote:
>
> >> sloop:
> >> movdqa xmm1,[edi]
> >> add edi,16
> >> pcmpeqb xmm1,xmm0
> >> pmovmskb eax,xmm1
> >> test eax,eax
> >> jz sloop
>
> >At least on my machine, the following inner loop is faster:
>
> >sloop:
> > pcmpeqb [edi],xmm0
>
> Safe to assume you meant?
>
> > pcmpeqb xmm0, [edi]

Ooops, Yes. Sorry, converting from gnu format to intel is error
prone. :-/

Steven

Terje Mathisen

unread,

Jul 25, 2010, 4:49:34 AM7/25/10

to

Not only that, but since the SSE execution stage is staggered, one cycle
behind the load stage, the compare with memory saves a cycle, while the
pxor between pmovmskb and dependent test hides the fact that the SSE
execute is delayed compared to the integer core, i.e. this loop is
likely to be faster on most x86s, particularly those that are in-order
like the Atom.

DSF

unread,

Jul 27, 2010, 10:27:17 PM7/27/10

to

On Thu, 22 Jul 2010 21:30:26 +0200, Terje Mathisen <"terje.mathisen at
tmsw.no"@nospicedham.giganews.com> wrote:

>
>That is a very good idea: Use rand() to setup an array of starting
>offset (either zero or non-zero) and length (either spanning the next
>16-byte boundary or not).
>
>This gives 4 kinds of strings to test, so your timing tests could then
>try maybe 20-100 such strlen() calls, with a random distribution among
>the four categories.
>
>I suspect the simplest code (i.e. yours) will still be the fastest
>though. :-)

It's not precisely what you suggested, but I'd already put it
together before you replied. :o)

I started by using a loop of 6 for the following tests:

Short string, 14 characters, fits within the first page.
1 String is always aligned.
2 String is never aligned.
3 String is aligned/unaligned randomly.

Long string, 4100 characters.
4 String is always aligned.
5 String is never aligned.
6 String is aligned/unaligned randomly.

The next loop is used to average the results. Like all the figures
here, I tried different values. The one used for the results below
was 10.
The inner loop runs 10,000 times. For each iteration, the start of
the string is either aligned to 0 (1, 4), unaligned by 1 (2, 5), or
randomly set to 0 or 1 (3, 6). Each strlen is timed using Agner Fog's
ReadTSC code, but not as a separate C function. Instead I enclosed
each strlen in the core ReadTSC code and returning the result in a 32
bit external variable defined in the C code. The result for each
strlen is totaled for the 10,000 times. The average is taken from 10
of the above loops and that is the result posted here.

The following results are from one run of the above.

strlen = library
fsterlen = my original
dsterlen = your original with the first page separate
tmsterlen = your original
A_strlen = Agner Fog's

Excuse the term "time". The values are in clocks.

Length: 14
100% Aligned string
strlen time: 1729234
fstrlen time: 1383238
dstrlen time: 1558379
tmstrlen time: 1464634
A_strlen time: 1825186

Length: 14
100% Unaligned string
strlen time: 1768086
fstrlen time: 1498181
dstrlen time: 1595460
tmstrlen time: 1573040
A_strlen time: 1554473

Length: 14
50/50 Aligned/Unaligned string
strlen time: 1790166
fstrlen time: 1535385
dstrlen time: 1606361
tmstrlen time: 1593026
A_strlen time: 1723263

Length: 4100
100% Aligned string
strlen time: 86767336
fstrlen time: 12192328
dstrlen time: 12235895
tmstrlen time: 14902195
A_strlen time: 40849104

Length: 4100
100% Unaligned string
strlen time: 95088824
fstrlen time: 13309268
dstrlen time: 13421709
tmstrlen time: 16346562
A_strlen time: 44201293

Length: 4100
50/50 Aligned/Unaligned string
strlen time: 95170604
fstrlen time: 13465087
dstrlen time: 13408829
tmstrlen time: 16387576
A_strlen time: 44242084

The above is only one example. There is enough variance between
runs to render the close numbers equivalent. I have made a few
observations from the various runs.
The library strlen is always the slowest, with the exception of very
small strings (0-2 bytes).
Agner Fog's code is its fastest on a short string that is not
aligned. (Unexpected.)
For short strings, f, d, tm, and A_strlen can be considered
equivalent. (Considering the variance.)
For longer strings, f & dstrlen lead the pack, with tmstrlen *very*
close behind. (I *really* don't understand why the random branching
introduced by the random alignment doesn't slow my code down. By all
that I understand, it *should*. I really expected tmstrlen to be
ahead on the 50/50 test.)

I was puzzled by Agner Fog's slow results for long strings, so I did
some research. I believe this relates to the fact that I'm running an
Athlon 64 X2.

Agner's code places the BSF in the main loop:
; Main loop, search 16 bytes at a time
A100: add eax, 10H ; increment pointer by 16
movdqa xmm1, [eax] ; read 16 bytes aligned
pcmpeqb xmm1, xmm0 ; compare 16 bytes with zero
pmovmskb edx, xmm1 ; get one bit for each byte
result
bsf edx, edx ; find first 1-bit
jz A100 ; loop if not found

A200: ; Zero-byte found. Compute string length
sub eax, [esp+4] ; subtract start address
add eax, edx ; add byte index

While I used TEST:
$$aloop:
add edi, 16 ; Set start to next 16-byte
boundary

$$aloop2:
movdqa xmm1, [edi]
pcmpeqb xmm1, xmm0
pmovmskb eax, xmm1

test eax, eax ; If not all 0, then end 0
found
je $$aloop ; Get next 16 bytes

$$exit:
bsf eax, eax
sub edi, edx
add eax, edi

According to Fog's timing PDF, the P4 times BSF as 2 micro ops and a
latency of 4. The Athlon times BSF as 21 ops and a latency of 8! TEST
is <= 1 on both CPUs.

I wanted to mention this because I respect Mr. Fog and the amount of
work he's put into compiling information of various CPUs. He
understands them better than I ever will. I believe he wrote his code
with the Pentium in mind and it does use the BSF to perform the action
of TEST as well.

As far as the variances of the timing, I blame it on the modern,
multi-tasking OS. Sometimes I miss the days when the programs I wrote
owned the machine. I do believe the tradeoffs have been more than
worth it...but still...sometimes...

DSF

unread,

Jul 27, 2010, 10:27:19 PM7/27/10

to

On Sat, 24 Jul 2010 22:20:01 -0700 (PDT), sfuerst
<svfu...@nospicedham.gmail.com> wrote:

>> >> sloop:
>> >> movdqa xmm1,[edi]
>> >> add edi,16
>> >> pcmpeqb xmm1,xmm0
>> >> pmovmskb eax,xmm1
>> >> test eax,eax
>> >> jz sloop
>>
>> >At least on my machine, the following inner loop is faster:
>>
>> >sloop:
>> > pcmpeqb [edi],xmm0
>>
>> Safe to assume you meant?
>>
>> > pcmpeqb xmm0, [edi]
>
>
>Ooops, Yes. Sorry, converting from gnu format to intel is error
>prone. :-/

I had a bet with myself that the error was AT&T syntax related...I
won! :o)

This may sound silly after all of my timing-related posts to this
thread, but how do you (or Terje, for that matter) go about measuring
code like you posted? I ask because I'm having some difficulty
achieving precision timing. That is, I can get a pretty good idea of
the overall speed, but I do not get the *exact* same results for each
timing run. For example, the results of timing your modification the
strlen routine compared to the original is inconclusive for me. The
results are within the margin of error so either one leads the other
about 50% of the time.

Any special tools you use? Freeware (or cheapware), hopefully?

See my reply to Terje regarding random offsets for a more detailed
account of how I have been timing the code.

DSF

sfuerst

unread,

Jul 27, 2010, 11:25:24 PM7/27/10

to

On Jul 27, 7:27 pm, DSF <notava...@address.here> wrote:
> On Sat, 24 Jul 2010 22:20:01 -0700 (PDT), sfuerst
>
>
>
> <svfue...@nospicedham.gmail.com> wrote:
> >> >> sloop:
> >> >> movdqa xmm1,[edi]
> >> >> add edi,16
> >> >> pcmpeqb xmm1,xmm0
> >> >> pmovmskb eax,xmm1
> >> >> test eax,eax
> >> >> jz sloop
>
> >> >At least on my machine, the following inner loop is faster:
>
> >> >sloop:
> >> > pcmpeqb [edi],xmm0
>
> >> Safe to assume you meant?
>
> >> > pcmpeqb xmm0, [edi]
>
> >Ooops, Yes. Sorry, converting from gnu format to intel is error
> >prone. :-/
>
> I had a bet with myself that the error was AT&T syntax related...I
> won! :o)

haha You guessed right. I cut+paste from the gas code into the reply
and hand-edited it to get it back into intel format. I just missed
one operand reversal... :-P

> This may sound silly after all of my timing-related posts to this
> thread, but how do you (or Terje, for that matter) go about measuring
> code like you posted? I ask because I'm having some difficulty
> achieving precision timing. That is, I can get a pretty good idea of
> the overall speed, but I do not get the *exact* same results for each
> timing run. For example, the results of timing your modification the
> strlen routine compared to the original is inconclusive for me. The
> results are within the margin of error so either one leads the other
> about 50% of the time.
>
> Any special tools you use? Freeware (or cheapware), hopefully?
>
> See my reply to Terje regarding random offsets for a more detailed
> account of how I have been timing the code.
>
> DSF

There is no real trick. All I do is run the code in a loop:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int mystrlen(void *);

int main(void)
{
int size = 1 << 24;
char *a = malloc(size);

int i;

int tot = 0;

memset(a, 1, size);
a[size - 1] = 0;

for (i = 0; i < 256; i++)
{
tot += mystrlen(a);
}

printf("%d\n", tot);

return 0;
}

Note that "mystrlen" is really just the inner loop (not the whole
strlen function), so I didn't care about the exact result:

#if 0

.globl mystrlen
.type mystrlen,@function
mystrlen:
pxor %xmm0, %xmm0
sloop:
movdqa (%rdi), %xmm1
add $16, %rdi
pcmpeqb %xmm0, %xmm1
pmovmskb %xmm1, %eax
test %eax, %eax
jz sloop
rep; retq
.size mystrlen, .-mystrlen

#else

.globl mystrlen
.type mystrlen,@function
mystrlen:
pxor %xmm0, %xmm0
sloop:
pcmpeqb (%rdi), %xmm0
add $16, %rdi
pmovmskb %xmm0, %eax
pxor %xmm0, %xmm0
test %eax, %eax
jz sloop
rep; retq
.size mystrlen, .-mystrlen
#endif

I then use the "time" shell command to run the compiled code. By
choosing the number of iterations, you can alter how long it takes. I
usually choose a number that gives something on the order of a few
seconds. For this particular case, 256 iterations takes 1.7s for the
original loop, and 1.4 for the new version on my machine. The error
was about +-0.1s, by running multiple times. I don't see much point
in cycle-counting. The only thing that really matters is wallclock
time - so that's what I measure.

The major difference between what I did, and your measurement, is that
I just looked at the inner loop. I guess that the rest of the code is
providing enough variability to swamp the change. Another possibility
is that my test strings were quite a bit longer... 16MiB vs
4KiB. ;-)

Steven

Terje Mathisen

unread,

Jul 28, 2010, 3:38:21 AM7/28/10

to

DSF wrote:
> This may sound silly after all of my timing-related posts to this
> thread, but how do you (or Terje, for that matter) go about measuring
> code like you posted? I ask because I'm having some difficulty
> achieving precision timing. That is, I can get a pretty good idea of

Occam's Razor:

I don't care!

If the speed is close enough to make it hard to measure, then I simply
go for the simplest/shortest/easiest to understand code.

Under otherwise equal conditions I might try to select the version with
the most future-proof algorithm, i.e. as little memory access and as
little branching as possible, plus avoiding opcodes which compilers
don't know how to use.

DSF

unread,

Aug 5, 2010, 5:03:48 PM8/5/10

to

On Tue, 27 Jul 2010 20:25:24 -0700 (PDT), sfuerst
<svfu...@nospicedham.gmail.com> wrote:

{code snipped}

>
>
>I then use the "time" shell command to run the compiled code. By
>choosing the number of iterations, you can alter how long it takes. I
>usually choose a number that gives something on the order of a few
>seconds. For this particular case, 256 iterations takes 1.7s for the
>original loop, and 1.4 for the new version on my machine. The error
>was about +-0.1s, by running multiple times. I don't see much point
>in cycle-counting. The only thing that really matters is wallclock
>time - so that's what I measure.

I'm very glad to read this. I was concerned that I would be
chastised here for not providing *clock-cycle accurate* timing. The
people in this group are a lot nicer than most of the groups I
frequent! (That's probably why I thought Terje was upset when he
wasn't.)

That said, it would be nice if there were some way to do precision
testing. To see the difference that swapping a couple of lines of
code to break a dependency can make would be nice. It's one thing to
"know" something works, it's another to see it for yourself.

Speaking of timing, I discovered a bug in my timing code. It seems
I was not clearing the stored averages between the six tests. This
resulted in later test results being progressively influenced. I
found this out because I had a "blank" test slot: No code there, just
the before and after measurements. As the tests grew longer with
larger strings, the "blank" time grew longer as well! One good thing
came out of the error. I learned that a "blank" slot is good to have.
It gives me a rough indication of the "noise floor." (The overhead of
the timing itself.) From that I learned that all the short string
tests are very close to the noise floor, as they can measure as taking
less time than nothing!

>
>The major difference between what I did, and your measurement, is that
>I just looked at the inner loop. I guess that the rest of the code is
>providing enough variability to swamp the change. Another possibility
>is that my test strings were quite a bit longer... 16MiB vs
>4KiB. ;-)

That may be the only practical solution to comparing slight changes
in loop code: Have it run so many freakin' times that the little
differences show up. Of course the downside is that the longer it
runs, the more opportunities for the OS to change tasks. (Sigh.)

The good thing is that I'm just using this as a learning experience
and a way to reacquaint myself with x86 assembly writing. There isn't
any code that I must speed up *or else*. :o) It's only time critical
in the sense that I've been writing a string library for C and finding
the length or end of a string is used in every function. (Speed up
one, speed up them all.)

I've learned a lot and I'd like to thank everyone, with a nod to
Terje. For their help.

DSF

P.S. I already have a few more questions, but that's for another
thread.

Terje Mathisen

unread,

Aug 6, 2010, 3:24:22 AM8/6/10

to

DSF wrote:
> On Tue, 27 Jul 2010 20:25:24 -0700 (PDT), sfuerst
> <svfu...@nospicedham.gmail.com> wrote:
>> The major difference between what I did, and your measurement, is that
>> I just looked at the inner loop. I guess that the rest of the code is
>> providing enough variability to swamp the change. Another possibility
>> is that my test strings were quite a bit longer... 16MiB vs
>> 4KiB. ;-)
>
> That may be the only practical solution to comparing slight changes
> in loop code: Have it run so many freakin' times that the little
> differences show up. Of course the downside is that the longer it
> runs, the more opportunities for the OS to change tasks. (Sigh.)

This is a useful approach, unfortunately it breaks down when what you
are really trying to measure is the speed of just one to tree-four
iterations, i.e. almost all strings which strlen() is ever called on. :-(

The main problem for these low-iteration count functions is that they
can be very dependent on just one or two branch instructions, ref. a P4
taking up to 20 cycles to recover from a single branch mispredict.

I do like using something like VTune when trying to get to grips with a
larger piece of code, particularly when it's something I didn't write
originally: Running for 20 seconds with hw sampling 1000 times/second or
more gives enough resolution that you can at least tell which parts of
the code to concentrate on.

Afaik there are open-source tools for Linux which do similar things.

Niels Fröhling

unread,

Aug 16, 2010, 7:36:21 PM8/16/10

to

DSF wrote:

> Any special tools you use? Freeware (or cheapware), hopefully?
>
> See my reply to Terje regarding random offsets for a more detailed
> account of how I have been timing the code.

I sometimes (when trying to figure out 1/3 throughput OoO behaviour) use the
CodeAnalyst v2.76 from AMD, which has a pipeline simulator (successive versions
removed the simulator). You can see every stage of the given code in there with
memory-load delay and etc.
It's quite neat.

Ciao
Niels

Terje Mathisen

unread,

Aug 17, 2010, 2:37:51 AM8/17/10

to

Tha main problem is that I can tell you, even without ever seeing it,
that is in inaccurate. That is probably one of the reasons they removed it.

Even being inaccurate, it can still help a lot, but to get a true view
you also need a true simulator for the entire machine, and those take
many orders of magnitude more time t run than the real hw.

Andy Glew, one of the main P6 architects, have told me that he has seen
code that had to run for many 1000's of iterations before it would
settle into a stable or semi-stable/alternating pattern.

The CodeAnalyst must simply attempt to give you a best-effort guess,
while not spending too much time. :-)

DSF

unread,

Aug 20, 2010, 12:43:39 AM8/20/10

to

On Fri, 06 Aug 2010 09:24:22 +0200, Terje Mathisen <"terje.mathisen at
tmsw.no"@giganews.com> wrote:

To start with, sorry for taking so long to answer, but I just
regained consciousness after seeing what Intel wants for VTune. :o)

>>
>> That may be the only practical solution to comparing slight changes
>> in loop code: Have it run so many freakin' times that the little
>> differences show up. Of course the downside is that the longer it
>> runs, the more opportunities for the OS to change tasks. (Sigh.)
>
>This is a useful approach, unfortunately it breaks down when what you
>are really trying to measure is the speed of just one to tree-four
>iterations, i.e. almost all strings which strlen() is ever called on. :-(

I have found that I can get fairly accurate results if I run the
test routine with smaller-sized data and loop it 10 million times or
so. It's not perfect, but I have been able to see the difference
small tweaks have on performance.

>
>The main problem for these low-iteration count functions is that they
>can be very dependent on just one or two branch instructions, ref. a P4
>taking up to 20 cycles to recover from a single branch mispredict.

Actually, 20 cycles doesn't sound that bad to me. It's a lot if you
need to squeeze every last nanosecond out of some code, but is
probably not an issue for most tasks.

And remember, I'm just trying to see if I can improve my code's
performance somewhat...and to avoid any serious pitfalls. It's nice
to try to make my string library fast, but it doesn't *need* to be
fast!

>
>I do like using something like VTune when trying to get to grips with a
>larger piece of code, particularly when it's something I didn't write
>originally: Running for 20 seconds with hw sampling 1000 times/second or
>more gives enough resolution that you can at least tell which parts of
>the code to concentrate on.

1. I'm using AMD, so I doubt the Intel software would work even if
someone gave it to me and...

2. Refer to my first paragraph of this reply.

Seriously, IIRC, they used to *give* VTune away. I was also shocked
by the price for their C compiler. The C compiler I use included an
Intel compiler as an alternate, and the whole package cost less than a
third of what Intel wants now.

>
>Afaik there are open-source tools for Linux which do similar things.

Running Windows, but thanks anyway. The search for Windows-based
software led me to AMD's CodeAnalyst. Free! The only downside so far
is that it can do source level analysis, but only by using a database
produced by Microsoft's compiler. It won't even use the symbol table
in my exe, let alone access the source code. :o(

DSF

Niels Fröhling

unread,

Aug 20, 2010, 9:16:21 AM8/20/10

to

You should take a look. It doesn't give you a CVS with sample statistics. It
draws you a precise state-graph for the analysed block for every instruction
(where it waits for which pipeline-stage etc.). You can see pretty easy where
your dependent instructions or low throughput (like pextrw) instructions mow
into your OoO. Or where you need to try to hide an expensive store behind
arithmetic (it's rather pointless to analyse loops and see write-combining
effects, as only one run is made visible, the first).
The length of the block is rather limited, I think maybe 500 insns. And if
that's calculated in 500ms it takes the order of magnitudes you suggest. :^)

Ciao
Niels