string performance funkiness

Ken Alverson

unread,

Mar 29, 2002, 5:48:44 AM3/29/02

to

I was trying to track down why a string routine I wrote was running slower
than I expected it to, and discovered some very odd (and undesirable)
behavior with respect to string comparisons. I wrote a simple benchmark to
compare the speed of comparing char*'s using strcmp and comparing strings
using the < operator.

I expected strcmp to have a /very/ slight advantage (I was expecting the
difference to be virtually negligible). What I found was that operator< was
always significantly slower (10-60%, depending on optimization options, on a
release build). Surprisingly, the more optimizations I turned on, the
*worse* operator< did (relative to strcmp).

Now the part that baffled me. On a whim, I tried strcmp-ing the results of
string::c_str, hoping to get performance equal to the strcmp on char* case.
What I got was actually 130% *FASTER* than the char* strcmp case (it took
less than half the time).

Digging into the disassembly, I managed to figure out the following.

- operator< is implemented in terms string::compare which is implemented in
terms of char_traits<char>::compare, which is implemented in terms of memcmp
- memcmp uses a (very) obsolete "optimization" of "repe cmps" instead of a
loop. "repe cmps" is very slow on modern cpus (as are most rep* commands)
- strcmp also uses "repe cmps" *IF* it knows the length of one of its
parameters. If it doesn't know the length of either parameter, it uses a
regular loop and runs more than twice as fast. The slow case is a very
common usage ("if (0 == strcmp(str,CONST_STRING)) {...}" triggers it)
- forcing strcmp to not know the lengths of either string ahead of time
brought its execution time down to 10% below the
strcmp(string::c_str(),string::c_str()) case
- declaring the string constants as "char name[] = "constant string";"
instead of "char* name = "constant string";" was enough to fool strcmp into
not knowing the lengths ahead of time
- strncmp also uses "repe cmps"
- these general points held for all the different optimization settings I
tried

So now I have execution times of:

strcmp(char*,char*): 100% baseline
strcmp(string::c_str(),string::c_str()): 110%
string < string: 285%

Surprisingly, changing char_traits<char>::compare to use strcmp instead of
memcmp *didn't* seem to significantly improve performance of the operator<
case. I understand the penalty for the string class (it has to determine
whether it's in the small string buffer or not...that info could be cached,
but not with the current class layout). I *can't* however come up with a
good excuse for operator< performing so much slower. It seems that
"compare(size_type,size_type,const Elem*,size_type)" isn't being inlined,
and it has a number of tests that are irrelevant for the string < string
case, such as comparing size < _Off, which is known to be zero, based on the
call from "compare(const Myt& _Right)".

I'd really rather not fall back on such archaic stuff as C string library
calls when we have a beautiful string class begging to be used, so please do
something about this. :) In the meantime, it looks like I'll be writing
predicate objects for my perf critical string sets, string keyed maps, finds
on vector<string>, etc, etc, etc...Or maybe I'll just replace
operator<(string,string).

Actually, doing just that (rewriting operator<(string,string) to be "return
(strcmp(_Left.c_str(),_Right.c_str())<0);") brought string<string to within
10% of the other two. But even though it is being inlined, the < case is
still consistently slower than the other two. Very odd.

Ken

P.J. Plauger

unread,

Mar 29, 2002, 8:00:53 AM3/29/02

to

"Ken Alverson" <K...@Alverson.com> wrote in message
news:#bCBd8w1BHA.2228@tkmsftngp07...

> I was trying to track down why a string routine I wrote was running slower
> than I expected it to, and discovered some very odd (and undesirable)
> behavior with respect to string comparisons. I wrote a simple benchmark to
> compare the speed of comparing char*'s using strcmp and comparing strings
> using the < operator.
>
> I expected strcmp to have a /very/ slight advantage (I was expecting the
> difference to be virtually negligible). What I found was that operator< was
> always significantly slower (10-60%, depending on optimization options, on a
> release build). Surprisingly, the more optimizations I turned on, the
> *worse* operator< did (relative to strcmp).

If you don't require a program to be correct, you can make it arbitrarily
fast. A string object can store embedded nul characters, so string("ab")
should compare less than string("ab\0c"). Using strcmp, they compare
equal.

> Now the part that baffled me. On a whim, I tried strcmp-ing the results of
> string::c_str, hoping to get performance equal to the strcmp on char* case.
> What I got was actually 130% *FASTER* than the char* strcmp case (it took
> less than half the time).
>
> Digging into the disassembly, I managed to figure out the following.
>
> - operator< is implemented in terms string::compare which is implemented in
> terms of char_traits<char>::compare, which is implemented in terms of memcmp
> - memcmp uses a (very) obsolete "optimization" of "repe cmps" instead of a
> loop. "repe cmps" is very slow on modern cpus (as are most rep* commands)

We use the memcmp supplied by the Microsoft C library, or inlined by VC++
if that happens. Ordinarily this is a safe bet if you want high performance.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

Ken Alverson

unread,

Mar 29, 2002, 12:16:55 PM3/29/02

to

"P.J. Plauger" <p...@dinkumware.com> wrote in message
news:3ca46548$0$23481$4c41...@reader1.ash.ops.us.uu.net...

>
> If you don't require a program to be correct, you can make it arbitrarily
> fast. A string object can store embedded nul characters, so string("ab")
> should compare less than string("ab\0c"). Using strcmp, they compare
> equal.

You are correct, strcmp is *not* the correct solution here. I posted pretty
early in the morning, so I may have come out a little jumbled. My point was
that the correct solution should still be as fast as strcmp, especially
given strings that do not contain embedded nuls. Which is why I undid my
quick hack to <string> and went back to using a predicate, as I know the
strings I care about do not have embedded nul characters.

> > - operator< is implemented in terms string::compare which is implemented
in
> > terms of char_traits<char>::compare, which is implemented in terms of
memcmp
> > - memcmp uses a (very) obsolete "optimization" of "repe cmps" instead of
a
> > loop. "repe cmps" is very slow on modern cpus (as are most rep*
commands)
>
> We use the memcmp supplied by the Microsoft C library, or inlined by VC++
> if that happens. Ordinarily this is a safe bet if you want high
performance.

Don't get me wrong, I'm not placing blame on the library. Memcmp does seem
to be the appropriate solution. The maximum extent of blame I could/would
place on the library is for not having detected the discrepancy in testing.

This looks pretty squarely like an optimizer issue to me.

Ken

Ken Alverson

unread,

Apr 3, 2002, 6:44:26 PM4/3/02

to

Can I at least get confirmation that someone is looking into this? It
should be a very easy fix...(the "repe cmps" part at least)

Ken

"Ken Alverson" <K...@Alverson.com> wrote in message
news:#bCBd8w1BHA.2228@tkmsftngp07...

Brandon Bray [MSFT]

unread,

Apr 4, 2002, 3:10:23 PM4/4/02

to

Ken Alverson wrote:
> Can I at least get confirmation that someone is looking into this? It
> should be a very easy fix...(the "repe cmps" part at least)

Hi Ken,
I can confirm that not just someone but a number of people are looking
into this. We have looked at "repe cmps" but this has shown to have rather
poor performance on the Pentium 4. There will be a number of improvements
to these string functions and their associated intrinsics in an upcoming
compiler (but not the next one).

Hope that helps!

--
Cheerio!
Brandon Bray Program Manager in the Visual C++ Compiler Team

And now a word from the lawyers: This posting is provided "AS IS" with no
warranties, and confers no rights. You assume all risk for your use. © 2002
Microsoft Corporation. All rights reserved.

Ken Alverson

unread,

Apr 4, 2002, 3:19:59 PM4/4/02

to

"Brandon Bray [MSFT]" <branbra...@microsoft.com> wrote in message
news:#ggXSSB3BHA.1840@tkmsftngp03...

>
> Hi Ken,
> I can confirm that not just someone but a number of people are looking
> into this. We have looked at "repe cmps" but this has shown to have
rather
> poor performance on the Pentium 4. There will be a number of improvements
> to these string functions and their associated intrinsics in an upcoming
> compiler (but not the next one).

Great! One important thing to consider is that this problem is *not*
limited to the Pentium 4. On my Athlon, rep* is approximately half as fast
as spelling out the loop, and this is the case with all x86 processors since
at least the Pentium 1. I'll gladly trade ~15 bytes of machine code for a
doubling in performance ;)

Ken

Brandon Bray [MSFT]

unread,

Apr 4, 2002, 9:29:08 PM4/4/02

to

Ken Alverson wrote:
>
> Great! One important thing to consider is that this problem is *not*
> limited to the Pentium 4. On my Athlon, rep* is approximately half as
> fast as spelling out the loop, and this is the case with all x86
> processors since at least the Pentium 1. I'll gladly trade ~15 bytes of
> machine code for a doubling in performance ;)

You've hit the problem right on spot! We're going back and analyzing a lot
of seemingly small things like this and finding a whole slew of things to
improve. (Too bad it can't show up in the next release "sometimes called
7.1"). As for addressing Athlon performance, we're finding that almost
anything that helps a Pentium 4 also tends to help improve performance on an
Athlon. Overall, we're making sure anything we do benefits everyone.

Ken Alverson

unread,

Apr 4, 2002, 9:51:41 PM4/4/02

to

"Brandon Bray [MSFT]" <branbra...@microsoft.com> wrote in message

news:eSq26lE3BHA.2612@tkmsftngp03...

> Ken Alverson wrote:
> >
> > Great! One important thing to consider is that this problem is *not*
> > limited to the Pentium 4. On my Athlon, rep* is approximately half as
> > fast as spelling out the loop, and this is the case with all x86
> > processors since at least the Pentium 1. I'll gladly trade ~15 bytes of
> > machine code for a doubling in performance ;)
>
> You've hit the problem right on spot! We're going back and analyzing a
lot
> of seemingly small things like this and finding a whole slew of things to
> improve. (Too bad it can't show up in the next release "sometimes called
> 7.1"). As for addressing Athlon performance, we're finding that almost
> anything that helps a Pentium 4 also tends to help improve performance on
an
> Athlon. Overall, we're making sure anything we do benefits everyone.

I just mentioned the Athlon because it's what I'm currently on (and quoting
performance figures from), I actually have three pentium 3's nearby, so yes,
I care about performance on those as well :)

Another red flag to watch out for is the "loop" statement, which is also an
anti-optimization on recent cpus. I didn't see VC producing any of those,
but I did end up creating a patch for the Rotor JIT because it was using
"loop" in the function prolog - replacing it with "dec ecx/jnz label" sped
up overall perf by ~10%.

Ken