Fast string functions

Piotr Wyderski

unread,

Aug 11, 2007, 3:57:28 PM8/11/07

to

Many compilers for IA-32 generate "repne scasb" in order
tto find the length of a given C string. However, it is possible
to implement strlen (and many other string functions) using
the SSE2 instruction set: pcmpeqb + pmovmaskb until there
is a set bit then bsf to find its index. On Core2 it is roughly
9.3 times faster and about 6.5 times faster on Pentium 4.
The same technique could obviously be applied to wider
data types as well.

My question is: why isn't the "repne scas" idiom microcoded
internally to implement the above (or better!) sequence using
the inherent power of the on-chip SIMD execution engines?
The same applies to repe cmp (that is, strcmp, memcmp)...

Best regards
Piotr Wyderski

Terje Mathisen

unread,

Aug 12, 2007, 10:33:16 AM8/12/07

to

Piotr Wyderski wrote:
> Many compilers for IA-32 generate "repne scasb" in order
> tto find the length of a given C string. However, it is possible
> to implement strlen (and many other string functions) using the SSE2
> instruction set: pcmpeqb + pmovmaskb until there
> is a set bit then bsf to find its index. On Core2 it is roughly
> 9.3 times faster and about 6.5 times faster on Pentium 4.

I assume this is measured starting with 16-byte aligned strings?

To handle arbitrary starting alignment you have to specialcase the first
block, probably by POR'ing in a 16*16=256-byte lookup table value, or
you could use a single 31-byte table with a MOVUPS unaligned load?

However, if you're willing to look at misaligned data, then it is
probably much better to simply handle the first block with a MOVUPS
load, do the testing, then move on to aligned accesses from then on.

This is because (a) quite a few strings will be at least 8-byte aligned
anyway, and (b) most C-style zero-terminated strings are quite short,
needing only a single iteration of a 16-byte wide operation.

> The same technique could obviously be applied to wider
> data types as well.
>
> My question is: why isn't the "repne scas" idiom microcoded
> internally to implement the above (or better!) sequence using
> the inherent power of the on-chip SIMD execution engines?

According to discussions I've had with a well-know cpu architect,
hardware could indeed do a much better job, and afaik, the current 'Fast
String' support on some chips is a step in that direction.

The main problem is how to make it totally general, what with variously
cached/uncached memory buffers in particular.

> The same applies to repe cmp (that is, strcmp, memcmp)...

See above.

Terje
>
> Best regards
> Piotr Wyderski
>

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Piotr Wyderski

unread,

Aug 12, 2007, 11:12:01 AM8/12/07

to

Terje Mathisen wrote:

> I assume this is measured starting with 16-byte aligned strings?

Yes, as I can align all the dynamicly allocated strings to that
boundary with my custom memory allocator and enforce an
appropriate alignment of the static strings using language
extensions (#pragma, __attribute__ etc.).

> To handle arbitrary starting alignment you have to specialcase the first
> block, probably by POR'ing in a 16*16=256-byte lookup table value, or you
> could use a single 31-byte table with a MOVUPS unaligned load?

Why such a complex solution? :-) In case of strlen()/memchr() all you need
is to check whether the string is aligned (test with 0x0f + a statically
predicted "not taken" branch to the unaligned mode handler) and if it is
not, then align it down manually, perform the comparison operation, mask
out the invalid "prefix bytes" and stop if found/follow the aligned path if
not found. This way you'll check the overlapping part two times, but who
cares? :-) The movups-based solution also have its merits in case of
short strings, but perhaps a generic instruction-based while loop will
be enough, since the unaligned part requires at most 15 iterations?

> According to discussions I've had with a well-know cpu architect, hardware
> could indeed do a much better job, and afaik, the current 'Fast String'
> support on some chips is a step in that direction.

But is there any "fast string support"? Almost everywhere the rep family
is as slow as its direct replacement based on test/jnz. Only the block
copy/fill operations are optimized.

> The main problem is how to make it totally general

Have you seen the SSE4's pcmpestri instruction? It is pretty darn general.
:-)
Do they have a large array of 256 comparators to execute it in a single
cycle, or unrolled it in time (which means 8+ cycles, hence quite useless).

Best regards
Piotr Wyderski

Terje Mathisen

unread,

Aug 13, 2007, 7:00:59 AM8/13/07

to

Piotr Wyderski wrote:

> Terje Mathisen wrote:
>> To handle arbitrary starting alignment you have to specialcase the
>> first block, probably by POR'ing in a 16*16=256-byte lookup table
>> value, or you could use a single 31-byte table with a MOVUPS unaligned
>> load?
>
> Why such a complex solution? :-) In case of strlen()/memchr() all you need
> is to check whether the string is aligned (test with 0x0f + a statically
> predicted "not taken" branch to the unaligned mode handler) and if it is
> not, then align it down manually, perform the comparison operation, mask
> out the invalid "prefix bytes" and stop if found/follow the aligned path if
> not found. This way you'll check the overlapping part two times, but who
> cares? :-) The movups-based solution also have its merits in case of
> short strings, but perhaps a generic instruction-based while loop will
> be enough, since the unaligned part requires at most 15 iterations?

There is one very important situation that Alpha did consider:

If you have a misaligned string of less than 8/16 (Alpha/SSE) bytes from
the end of the last allocated page, then you'll get a page fault by
trying to read past the end of the page.

Alpha handled this by having an instruction which could convert the
lower three bits of an address into a mask, said mask to be used
together with an aligned load (which it also had: Load from 0xnnnn3 and
you would get the quadword starting at 0xnnnn0). This made sure that all
load operations were aligned, and therefore couldn't cross a page boundary.

With SSE you don't have a similarly fast way of turning a potentially
misaligned load in an aligned load + mask, and since average C string
lengths are just a few bytes (single digits afair from reading some
paper 10+ years ago), the first block handling is _crucial_.

> Have you seen the SSE4's pcmpestri instruction? It is pretty darn
> general. :-)

Sure.

In fact, it is so general that there's just two possible ways to
implement it:

a) Microcode, which then will effectively expand to a _lot_ of code
handling alignment, locating the shortest (implicit) string length
input, do a bunch of 16-byte loads from the two source addresses,
masking the compare if the shortest string ends anywhere inside, do the
compare, then loop.

b) A _huge_ slice of hardware directly implementing all the logic of
(a), which would turn this into one of the CISC'iest non-OS instructions
ever.

The documentation I have doesn't specify enough detail, but my guess
would be that it simply returns the index of the first non-equal byte,
then it is up to the calling code to figure out which string is longer.

The C-style version is in some ways even more hairy, it requires two
sets of comparators (but not three):

One 16-byte wide set which compares pairs of input bytes and another
which checks for zero bytes. The second can of course re-use the first
hw with one of the inputs forced to zero, and what with all the other
stuff going on, I don't believe the actual XOR+merge 8/16 bits part is
going to be the bottleneck.

> Do they have a large array of 256 comparators to execute it in a single
> cycle, or unrolled it in time (which means 8+ cycles, hence quite useless).

16 or 32 is enough, unless it actually works on 64-byte cache blocks, in
the latter case they also need one or two 64-byte barrel shifters to
align the strings with each other.

Terje

Terje Mathisen

unread,

Aug 13, 2007, 9:34:44 AM8/13/07

to

Piotr Wyderski wrote:
> Terje Mathisen wrote:

> But is there any "fast string support"? Almost everywhere the rep family
> is as slow as its direct replacement based on test/jnz. Only the block
> copy/fill operations are optimized.
>
>> The main problem is how to make it totally general
>
> Have you seen the SSE4's pcmpestri instruction? It is pretty darn
> general. :-)
> Do they have a large array of 256 comparators to execute it in a single
> cycle, or unrolled it in time (which means 8+ cycles, hence quite useless).

OK, I've found some more docs about these soon-to-be opcodes, and you're
right, it is pretty darn general, but not what I'd think of as a
"string" operation, more like a building block.

They don't need anything like 256 comparators though:

The way they've probably implemented it is to use the same 16 byte-wide
comparators as needed by lots of existing opcodes, with one of the
immediate bits (which really should be considered part of the opcode,
resulting in 64 times as many opcodes I guess), used to determine if
pairs of byte operations should be joined into words.

Each of these comparators must be two-state (</=) or (=/>), so that the
two bits used for equal_any/range/equal_each/ordered can be used to
select the terms to merge.

What it does look like is an opcode that can only ever be used as part
of library code, or in the form of specific canned operations that are
output from a compiler.

I bet you could use this opcode plus a store and have a two-instruction
Turing-equivalent machine, using self-modification for all interesting
stuff. :-)

Terje

Piotr Wyderski

unread,

Aug 13, 2007, 7:32:07 PM8/13/07

to

Terje Mathisen wrote:

> There is one very important situation that Alpha did consider:
>
> If you have a misaligned string of less than 8/16 (Alpha/SSE) bytes from
> the end of the last allocated page, then you'll get a page fault by trying
> to read past the end of the page.
>
> Alpha handled this by having an instruction which could convert the lower
> three bits of an address into a mask, said mask to be used together with
> an aligned load (which it also had: Load from 0xnnnn3 and you would get
> the quadword starting at 0xnnnn0). This made sure that all load operations
> were aligned, and therefore couldn't cross a page boundary.
>
> With SSE you don't have a similarly fast way of turning a potentially
> misaligned load in an aligned load + mask, and since average C string
> lengths are just a few bytes (single digits afair from reading some paper
> 10+ years ago), the first block handling is _crucial_.

Yes, it is exactly the reason I convert all misaligned string base addresses
into down-aligned ones. This way you'll never cross a page boundary, as
the alignment process simply clears the four least significant bits of the
string address and can use the fast movaps instruction. All you need is
to mask out the "false positives" located before the actual string itself,
and the mask computation process can be performed using a simple
piece of general-purpose code which runs in parallel with the SSE2-based
comparator (namely, it is expected to slide under the pmovmskb latency).

Best regards
Piotr Wyderski

Piotr Wyderski

unread,

Aug 13, 2007, 7:59:19 PM8/13/07

to

Terje Mathisen wrote:

> They don't need anything like 256 comparators though:

How is that possible? The documentation explicitely says
that every SIMD vector A item is compared with every
SIMD vector B item and then the results are merged
accordingly. Since there are up to 16 entries in a vector,
it means that 256 byte-wide comparisons must be performed.
Complete unrolling in space requires 256 byte-wide comparators
(or 16 128-bit-wide, if you prefer), which is a considerable
piece of silicon. Complete unrolling in time requires just
one 128-bit comparator and 16 passes, which is SLOW.

> What it does look like is an opcode that can only ever be used as part of
> library code, or in the form of specific canned operations that are output
> from a compiler.

But if an instruction is slow, then it is useless (provided that
there is a faster replacement). I am pretty sure that a carefully
tuned variant of pcmpeqb + ptest will give you much faster
strlen()/strcmp()/memcmp() than pcmpestri. Another DAS/AAM?
No, thanks... :-)

> I bet you could use this opcode plus a store and have a two-instruction
> Turing-equivalent machine, using self-modification for all interesting
> stuff. :-)

Frankly, I had a very similar impression and the term "overdesigned"
was the first (well, the first non-insulting...) word that came to mind.
:-)))

Best regards
Piotr Wyderski

Terje Mathisen

unread,

Aug 14, 2007, 12:46:35 PM8/14/07

to

Piotr Wyderski wrote:
> Terje Mathisen wrote:
>
>> They don't need anything like 256 comparators though:
>
> How is that possible? The documentation explicitely says
> that every SIMD vector A item is compared with every
> SIMD vector B item and then the results are merged

I _really_ don't think that's what it means!

In fact, I'm 99% positive this instruction does NOT entail any form of
barrel shift/full crossbar to actually compare all bytes against all bytes.

What it does is simply (!) 16+16+16 (a[x] == 0, a[x] == b[x], a[x] <
b[x]) parallel byte compares, with the immediate mask determining how
these intermediate results should be used/combined.

First the W bit enabled merging of pairs of initial results, trivial
except for the (< less than) which has to combine the byte == and < tests:

xw[n] < xw[n] = (xb[2n+1] < yb[2n+1]) ||
((xb[2n+1] == yb[2n+1]) && (xb[2n] < yb[2n]));

> accordingly. Since there are up to 16 entries in a vector,
> it means that 256 byte-wide comparisons must be performed.
> Complete unrolling in space requires 256 byte-wide comparators
> (or 16 128-bit-wide, if you prefer), which is a considerable
> piece of silicon. Complete unrolling in time requires just
> one 128-bit comparator and 16 passes, which is SLOW.

Forget about that, it simply cannot be the case.

>
>> What it does look like is an opcode that can only ever be used as part
>> of library code, or in the form of specific canned operations that are
>> output from a compiler.
>
> But if an instruction is slow, then it is useless (provided that
> there is a faster replacement). I am pretty sure that a carefully
> tuned variant of pcmpeqb + ptest will give you much faster
> strlen()/strcmp()/memcmp() than pcmpestri. Another DAS/AAM?
> No, thanks... :-)

Assuming it works as I (now) expect, it should have latency comparable
with a 64-bit SUB or CMP, i.e. 1 or 2 cycles.

John Mashey

unread,

Aug 14, 2007, 1:54:57 PM8/14/07

to

On Aug 14, 9:46 am, Terje Mathisen <terje.mathi...@hda.hydro.com>
wrote:

> Piotr Wyderski wrote:
> > Terje Mathisen wrote:
>
> >> They don't need anything like 256 comparators though:

BUT, the very FIRST thing to do is to profile a wide range of
programs, see how much time is consumed by str* functions, and decide
whether this is even worth arguing about [it might, or might not be;
the last time I did this was along ago, but at the time, bcopy/memcpy
was the only high-runner.]

The second thing, for code one owns, is to take a look and make sure
there aren't silly things like for (i = 0; i <= strlen(s); i++)
{stuff}, since many compilers won't move the strlen(s) out of the
loop.

Needless to say, profiling Dhrystone doesn't count :-)

Piotr Wyderski

unread,

Aug 14, 2007, 6:19:13 PM8/14/07

to

John Mashey wrote:

> BUT, the very FIRST thing to do is to profile a wide range of
> programs, see how much time is consumed by str* functions, and decide
> whether this is even worth arguing about [it might, or might not be;
> the last time I did this was along ago, but at the time, bcopy/memcpy
> was the only high-runner.]

In my case SIMD does help a lot, therefore I don't care about
"a wide range of programs". ;-) My programs should provide
the highest performance available, the remaining ones
can be slow (most of cases) or even should be slow
(our competitors)...

> The second thing, for code one owns, is to take a look and make sure
> there aren't silly things like for (i = 0; i <= strlen(s); i++)

Indeed, but this kind of code transformation it is the very first
optimization I do.

Best regards
Piotr Wyderski

Tim McCaffrey

unread,

Aug 14, 2007, 8:00:15 PM8/14/07

to

In article <gs29p4-...@osl016lin.hda.hydro.com>,
terje.m...@hda.hydro.com says...

>
>Piotr Wyderski wrote:
>> Terje Mathisen wrote:
>>
>>> They don't need anything like 256 comparators though:
>>
>> How is that possible? The documentation explicitely says
>> that every SIMD vector A item is compared with every
>> SIMD vector B item and then the results are merged
>
>I _really_ don't think that's what it means!
>
>In fact, I'm 99% positive this instruction does NOT entail any form of
>barrel shift/full crossbar to actually compare all bytes against all
bytes.
>
>What it does is simply (!) 16+16+16 (a[x] == 0, a[x] == b[x], a[x] <
>b[x]) parallel byte compares, with the immediate mask determining how
>these intermediate results should be used/combined.

The Intel manual really does say something different.

http://softwarecommunity.intel.com/UserFiles/en-us/File/Intel%20SSE4%20Pro
gramming%20Reference-v1.002.pdf

(sorry about the wrap).

>
>First the W bit enabled merging of pairs of initial results, trivial
>except for the (< less than) which has to combine the byte == and <
tests:
>
> xw[n] < xw[n] = (xb[2n+1] < yb[2n+1]) ||
> ((xb[2n+1] == yb[2n+1]) && (xb[2n] < yb[2n]));
>
>> accordingly. Since there are up to 16 entries in a vector,
>> it means that 256 byte-wide comparisons must be performed.
>> Complete unrolling in space requires 256 byte-wide comparators
>> (or 16 128-bit-wide, if you prefer), which is a considerable
>> piece of silicon. Complete unrolling in time requires just
>> one 128-bit comparator and 16 passes, which is SLOW.
>
>Forget about that, it simply cannot be the case.
>>
>>> What it does look like is an opcode that can only ever be used as part
>>> of library code, or in the form of specific canned operations that are
>>> output from a compiler.
>>
>> But if an instruction is slow, then it is useless (provided that
>> there is a faster replacement). I am pretty sure that a carefully
>> tuned variant of pcmpeqb + ptest will give you much faster
>> strlen()/strcmp()/memcmp() than pcmpestri. Another DAS/AAM?
>> No, thanks... :-)
>
>Assuming it works as I (now) expect, it should have latency comparable
>with a 64-bit SUB or CMP, i.e. 1 or 2 cycles.
>

One way to implement the "equal any" mode would be using 256 byte compares
(the other way would be to implement the loop from the pseudo code).
However, you could also build a 256 bit wide lookup (1 bit wide) from one
xmm register, and index into it with each byte of the other xmm register.
Interesting instruction, I bet the range mode could also help with
audio/video compression. Also note that these instructions do not have an
alignment restriction.

- Tim

Terje Mathisen

unread,

Aug 15, 2007, 6:27:48 AM8/15/07

to

John Mashey wrote:
> On Aug 14, 9:46 am, Terje Mathisen <terje.mathi...@hda.hydro.com>
> wrote:
>> Piotr Wyderski wrote:
>>> Terje Mathisen wrote:
>>>> They don't need anything like 256 comparators though:
>
> BUT, the very FIRST thing to do is to profile a wide range of
> programs, see how much time is consumed by str* functions, and decide
> whether this is even worth arguing about [it might, or might not be;
> the last time I did this was along ago, but at the time, bcopy/memcpy
> was the only high-runner.]

This is of course the reason only REP MOVS has been targeted until now.

>
> The second thing, for code one owns, is to take a look and make sure
> there aren't silly things like for (i = 0; i <= strlen(s); i++)
> {stuff}, since many compilers won't move the strlen(s) out of the
> loop.

A bad programmer can write bad code in any language, the loop above
doesn't even come close. What about re-initializing an entire 3D
workspace, with all objects/lists/surfaces etc between every frame?

Even a "RealityEngine N+1" would have at least some problems at that
point. :-(

>
> Needless to say, profiling Dhrystone doesn't count :-)

No, there was no need to tell me that.

Anyway, with the extreme flexibility of those new opcodes, and assuming
they do run in "CMP reg,reg" time, it more or less becomes a
limited-functionality sub-processor. Parallel state machines?

Terje Mathisen

unread,

Aug 15, 2007, 6:44:51 AM8/15/07

to

Tim McCaffrey wrote:
> In article <gs29p4-...@osl016lin.hda.hydro.com>,
> terje.m...@hda.hydro.com says...
>> Piotr Wyderski wrote:
>>> Terje Mathisen wrote:
>>>
>>>> They don't need anything like 256 comparators though:
>>> How is that possible? The documentation explicitely says
>>> that every SIMD vector A item is compared with every
>>> SIMD vector B item and then the results are merged
>> I _really_ don't think that's what it means!
>>
>> In fact, I'm 99% positive this instruction does NOT entail any form of
>> barrel shift/full crossbar to actually compare all bytes against all
> bytes.
>> What it does is simply (!) 16+16+16 (a[x] == 0, a[x] == b[x], a[x] <
>> b[x]) parallel byte compares, with the immediate mask determining how
>> these intermediate results should be used/combined.
>
> The Intel manual really does say something different.
>
> http://softwarecommunity.intel.com/UserFiles/en-us/File/Intel%20SSE4%20Pro
> gramming%20Reference-v1.002.pdf

That was the manual I was reading, but not the initial part (2.3.1)
where it does state "up to 256 compares".

WOW!

What can I say? Mea Culpa. :-(

This is such a break with all cpu architecture design I've witnessed
over the last 20+ years, this is even more of an
"everything_including_the_kitchen_sink" opcode.

It must use a really big hunk of transistors, it is a true coprocessor,
tightly integrated with implicit registers/flags for input/output.

The way they've remapped all the flag register fields to allow a single
PCMP?STR* instruction to return all sorts of information at once is
quite amazing, it really looks like an attempt to make 8 and 16-bit
searching _very_ fast. I.e. this looks like FAST's original idea, where
my old "Algorithms and Data Structures" worked for years to have
students construct and implement a 16-wide parallel search engine.

Tim McCaffrey

unread,

Aug 15, 2007, 3:44:53 PM8/15/07

to

In article <622bp4-...@osl016lin.hda.hydro.com>,
terje.m...@hda.hydro.com says...

>
>This is such a break with all cpu architecture design I've witnessed
>over the last 20+ years, this is even more of an
>"everything_including_the_kitchen_sink" opcode.
>
>It must use a really big hunk of transistors, it is a true coprocessor,
>tightly integrated with implicit registers/flags for input/output.
>
>The way they've remapped all the flag register fields to allow a single
>PCMP?STR* instruction to return all sorts of information at once is
>quite amazing, it really looks like an attempt to make 8 and 16-bit
>searching _very_ fast. I.e. this looks like FAST's original idea, where
>my old "Algorithms and Data Structures" worked for years to have
>students construct and implement a 16-wide parallel search engine.
>

I can't put my finger on why I feel this way, but I'm willing to make a
(small) bet that this is what Mr. Glew has been working on the last few
years. It just seems like his "style".

I notice it breaks alot of the PPro rules as well: Uses a GPR, one XMM,
one (unrestricted alignment) memory access (or another XMM), and a GPR
and a Flags result all in one instruction. Must cause all kinds of
havoc syncronizing the execution pipes.

- Tim

Terje Mathisen

unread,

Aug 16, 2007, 2:18:28 AM8/16/07

to

Tim McCaffrey wrote:
> I can't put my finger on why I feel this way, but I'm willing to make a
> (small) bet that this is what Mr. Glew has been working on the last few

I'd take that (very small) bet: I'll buy you a beer the next (first?)
time we meet if you're right. (If you're not, forget about it, I don't
like beer. :-)

> years. It just seems like his "style".

Andy, if this is indeed you, then I'd like (at some point in time) to
get the full story about what important workload could conceivably get
sufficient speedup from this.

> I notice it breaks alot of the PPro rules as well: Uses a GPR, one XMM,

Which is why I don't entirely like to think about Andy behind it.

> one (unrestricted alignment) memory access (or another XMM), and a GPR
> and a Flags result all in one instruction. Must cause all kinds of
> havoc syncronizing the execution pipes.

Right.

This cannot really be a part of the usual pipes at all, it seem like a
huge hunk of semi-independent hardware, but with a private pipe in/out
of the register array.

The main hiccup wouldn't be the need to read potentially unaligned input
data, but the way it has to return information to three different
locations, all implied:

XMM0 for the SIMD data, ECX for the index found and the flags for all
sorts of branchable indications.

OTOH we've always had a few special-case instructions with extra
outputs: MUL and DIV are the most obvious, returning two int regs + flags.

Writing to an SSE register in parallel with a normal int return might
actually be easier than writing to two integer registers at once?

MitchAlsup

unread,

Aug 17, 2007, 1:53:43 PM8/17/07

to

On Aug 15, 2:44 pm, timcaff...@aol.com (Tim McCaffrey) wrote:
> I notice it breaks alot of the PPro rules as well: Uses a GPR, one XMM,
> one (unrestricted alignment) memory access (or another XMM), and a GPR
> and a Flags result all in one instruction. Must cause all kinds of
> havoc syncronizing the execution pipes.

Once the FCMP*I instructions went in (FP comparison, int flags result)
the
pipeline guys built a dedicated bus from the FPU to EFLAGS to make
these
fast. The FPUs do both 80-bit and SSE, so having a EFLAGS result is
straightforward (in the sense of interlocks and pipelining).

Mitch

Patrick de Zeester

unread,

Aug 18, 2007, 3:10:24 PM8/18/07

to

Piotr Wyderski wrote:
> John Mashey wrote:
>
>> BUT, the very FIRST thing to do is to profile a wide range of
>> programs, see how much time is consumed by str* functions, and decide
>> whether this is even worth arguing about [it might, or might not be;
>> the last time I did this was along ago, but at the time, bcopy/memcpy
>> was the only high-runner.]
>
> In my case SIMD does help a lot, therefore I don't care about
> "a wide range of programs". ;-) My programs should provide
> the highest performance available, the remaining ones
> can be slow (most of cases) or even should be slow
> (our competitors)...

If performance is a concern why scan for the zero terminator to
determine the length of a string? You could just stored the length with
the string itself.

Piotr Wyderski

unread,

Aug 20, 2007, 10:29:39 AM8/20/07

to

Patrick de Zeester wrote:

> If performance is a concern why scan for the zero terminator to
> determine the length of a string? You could just stored the length with
> the string itself.

strlen() is just a simple and clean example of what can be
done the SIMD way, hence it is a toy function to play with.
But the field of possible applications is much, much wider.
Anyway, even in the case of memcmp()/strcmp() the cached
length field doesn't help.

Best regards
Piotr Wyderski