Anyone else seeing the same or know how to improve performance? We tried
enabling small block allocation and setting the security checks define to 0
but no luck, performance got even worse.
=====
W:\test\string_tests\VC8\Release>string_tests.exe
Append one character to empty string 100 million times: 6585 ms
Search long string for 3 different substrings 10 million times: 3272 ms
Assignment, search, replace 10 million times: 3537 ms
Pass and return by value function called 10 million times: 8288 ms
Pass by reference return by value function called 10 million times: 4186 ms
Short String(10 chars) - Pass and return by value function called 10 million
times: 1464 ms
VS2005/STLPORT5.0
===============
W:\test\string_tests\VC8\Release>string_tests.exe
Append one character to empty string 100 million times: 1841 ms
Search long string for 3 different substrings 10 million times: 3091 ms
Assignment, search, replace 10 million times: 2664 ms
Pass and return by value function called 10 million times: 3576 ms
Pass by reference return by value function called 10 million times: 1827 ms
Short String(10 chars) - Pass and return by value function called 10 million
times: 820 ms
Without knowing exactly how you built the code, and exactly what the code is
doing, it's impossible to comment beyond the superficial.
Can you post your test program for others to inspect and verify your
results?
-cd
// string_compare.cpp : Defines the entry point for the console application.
//
#include <cstdio>
#include <string>
// This DEFINE is needed for VC6.
// IN VC8 namespace std is used
#ifdef USE_STLPORT
#pragma message ("USING STLPORT")
using namespace stlport;
#else
using namespace std;
#endif
string func(string par )
{
string tmp(par);
return tmp ;
}
string func2(string& par )
{
string tmp(par);
return tmp ;
}
int main ( int , char* const* )
{
HighResPerformanceMeasurement usecTimer;
{
usecTimer.Start();
string s ;
for ( int i = 0 ; i < 100000000; ++ i ) {
s += " " ;
}
usecTimer.End();
double msecs =
double((usecTimer.GetDifference()*1000)/usecTimer.GetFrequency());
printf("Append one character to empty string 100 million times: %.0f
ms\n", msecs);
}
{
usecTimer.Start();
string s("qyweyuewunfkHBUKGYUGL ,wehbYGUW^( @T@H!BALWD:h^&@#*@(#:
JKHWJ:CND");
for (int i = 0 ; i < 10000000; ++i) {
s.find( "unfkHBUKGY" ) ;
s.find( "W^( @T@H!B" ) ;
s.find( "J:CND" ) ;
}
usecTimer.End();
double msecs =
double((usecTimer.GetDifference()*1000)/usecTimer.GetFrequency());
printf("Search long string for 3 different substrings 10 million times:
%.0f ms\n", msecs);
}
{
usecTimer.Start();
string s("qyweyuewunfkHBUKGYUGL ,wehbYGUW^( @T@H!BALWD:h^&@#*@(#:
JKHWJ:CND");
string::size_type p ;
string ss1( "unfkHBUKGY");
string ss2 ( "123456" ) ;
string sx ;
for ( int i = 0 ; i < 10000000; ++ i ) {
sx = s ;
p = sx.find ( ss1 ) ;
sx.replace ( p , ss1.size() , ss2 ) ;
sx += s ;
}
usecTimer.End();
double msecs =
double((usecTimer.GetDifference()*1000)/usecTimer.GetFrequency());
printf("Assignment, search, replace 10 million times: %.0f ms\n", msecs);
}
{
usecTimer.Start();
string s("qyweyuewunfkHBUKGYUGL ,wehbYGUW^( @T@H!BALWD:h^&@#*@(#:
JKHWJ:CND");
for( int i = 0 ; i < 10000000; ++ i ) {
string sx = func( s );
}
usecTimer.End();
double msecs =
double((usecTimer.GetDifference()*1000)/usecTimer.GetFrequency());
printf("Pass and return by value function called 10 million times: %.0f
ms\n", msecs);
}
{
usecTimer.Start();
string s("qyweyuewunfkHBUKGYUGL ,wehbYGUW^( @T@H!BALWD:h^&@#*@(#:
JKHWJ:CND");
for( int i = 0 ; i < 10000000; ++ i ) {
string sx = func2( s );
}
usecTimer.End();
double msecs =
double((usecTimer.GetDifference()*1000)/usecTimer.GetFrequency());
printf("Pass by reference return by value function called 10 million
times: %.0f ms\n", msecs);
}
{
usecTimer.Start();
string s("1234567890");
for( int i = 0 ; i < 10000000; ++ i ) {
string sx = func( s );
}
usecTimer.End();
double msecs =
double((usecTimer.GetDifference()*1000)/usecTimer.GetFrequency());
printf("Short String - Pass and return by value function called 10 million
times: %.0f ms\n", msecs);
}
return 0 ;
But the interesting stuff - how you compiled your code - is left out.
Debug builds are far slower than release builds, of course. And the
Microsoft std library does have a lot of additional functionality in
debug mode. This costs execution time but saves lots of time debugging.
/Peter
In his initial post he wrote:
| W:\test\string_tests\VC8\Release>string_tests.exe
[...test output...]
Note the 'Release' in the path; I'm guessing the default release
compilation.
Uli
I'll see if I can get the project file used exactly.. But try it yourself
under whatever settings you like for either it won't matter (in terms of
relative performance it is still significantly slower than freeware
opensource both running in release mode).
for extra performance in Release mode use option /D_SECURE_SCL=0 (and
/D_SCL_SECURE_NO_DEPRECATE for nicer compiler output). See also
http://msdn2.microsoft.com/en-us/library/ms404587.aspx
B.
The first and most crucial question to ask of any benchmark is how
accurately this benchmark represents real-world workloads. Unless the
benchmark is quite close to real-world usage, it's meaningless -- and
even if it's fairly close, it can still be entirely meaningless (if the
differences between "fairly close" and "real world" make a big
difference in performance).
[ ... ]
> string s ;
> for ( int i = 0 ; i < 100000000; ++ i ) {
> s += " " ;
People have done quite a few studies on string lengths as they're
actually used. Every time I've seen such a study, it's concluded that
most strings are quite short -- most people find that over 90% of
strings are 80 characters or less, and the frequency of usage drops of
quickly with length -- even 1000 characters is extremely rare. A string
of 100 megabytes is so rare that your measurement means nothing about
real use.
[ ... ]
> string s("qyweyuewunfkHBUKGYUGL ,wehbYGUW^( @T@H!BALWD:h^&@#*@(#:
> JKHWJ:CND");
> for (int i = 0 ; i < 10000000; ++i) {
> s.find( "unfkHBUKGY" ) ;
> s.find( "W^( @T@H!B" ) ;
> s.find( "J:CND" ) ;
Likewise, in the real world, most strings consist primarily (if not
exclusively) of a-z, A-Z, space, and a FEW punctuation characters. I
haven't checked, but my immediate guess is that a significant difference
in string searching speed likely results from using a Boyer-Moore (or
BMH) search algorithm. Restricting the alphabet in use also
significantly reduces the advantage gained from such an algorithm.
That's not to say that such an advanced algorithm can't be useful -- but
my immediate guess is that the strings chosen above significantly
exaggerate the advantage likely to be gained in most real use.
> string s("qyweyuewunfkHBUKGYUGL ,wehbYGUW^( @T@H!BALWD:h^&@#*@(#:
> JKHWJ:CND");
[ ... ]
> printf("Assignment, search, replace 10 million times: %.0f ms\n", msecs);
The comments about searching apply here as well.
[ ... ]
> string s("qyweyuewunfkHBUKGYUGL ,wehbYGUW^( @T@H!BALWD:h^&@#*@(#:
> JKHWJ:CND");
[ ... ]
> printf("Pass and return by value function called 10 million times: %.0f
> ms\n", msecs);
I can't imagine anybody passing strings by value except by accident.
Virtually every book on C++ teaches even rank beginners to pass class-
objects by ref-to-const by about chapter 2. This benchmark seems to
apply only to thoroughly inept code.
> string s("qyweyuewunfkHBUKGYUGL ,wehbYGUW^( @T@H!BALWD:h^&@#*@(#:
> JKHWJ:CND");
[ ... ]
> printf("Pass by reference return by value function called 10 million
> times: %.0f ms\n", msecs);
The first comments on string length apply here as well, though obviously
to a somewhat lesser degree -- this string is somewhat longer than
usual, but at least not nearly so outrageously so.
> string s("1234567890");
[ ... ]
> printf("Short String - Pass and return by value function called 10 million
> times: %.0f ms\n", msecs);
Passing by value again? If I didn't know better, I'd guess you carefully
avoided the two (by far) most common situations: 1) pass a short string
by reference and return a short string by value, 2) pass a short string
by reference, modify a string whose reference was passed.
Of course, the second of these doesn't really measure much about the
string implementation at all -- and if the compiler includes named
and/or anonymous return value optimization, the first probably doesn't
either. To make a long story short, this entire line of testing only
means something to 1) people who have no clue of how to use C++, or 2)
are stuck using old compilers without any return value optimization
capability.
Given STLPort's target of supporting nearly all compilers, the latter is
probably meaningful for the library. Unless I'm mistaken, VS 2005 does
include RVO, so it means little in this context though.
Ultimately, I can't see anything here that seems to indicate much of
anything about whether (for example) I'd get better or worse performance
in my programs if I used STLPort's string class instead of Dinkum's. It
might happen, but these tests just don't show anything meaningful one
way or the other.
--
Later,
Jerry.
The universe is a figment of its own imagination.
This one is especially funny, as a small change into
s += ' ';
makes it run *a lot* faster.
What realistic application is doing a significant number of append
with single character string literals?
Bo Persson
[ ... ]
> >> for ( int i = 0 ; i < 100000000; ++ i ) {
> >> s += " " ;
>
> This one is especially funny, as a small change into
>
> s += ' ';
>
> makes it run *a lot* faster.
Yup.
> What realistic application is doing a significant number of append
> with single character string literals?
I can't say I remember have read or written too many of those...
I think that your assessment is reasonable and we should conduct a more real
world performance test with an existing applications code and then if we
discover particular issues break out those scenarios for a performance tester.
Thanks,
Dave
As Jerry pointed out as well some more real world scenarios are probably a
better way to test.
It is still a little surprising the differences between stlport and VS STL
in terms of these particular test scenarios, it is possible these do map to
more real world scenarios, but until we verify we don't really know.
Certainly this does test a specific test case where stlport is faster than
vs stl, but you are right the test cases might not be close enough to a real
world scenario to ensure the results would apply. We'll do some more real
world testing to see if we can see these numbers map to our real world app.
See last test in each, short string.
VS2005
==================================================================
W:\test\string_tests\VC8\Release>string_tests.exe
Append one character to empty string 100 million times: 6653 ms
Search long string for 3 different substrings 10 million times: 3361 ms
Assignment, search, replace 10 million times: 3560 ms
Pass and return by value function called 10 million times: 8317 ms
Pass by reference return by value function called 10 million times: 4177 ms
Short String - Pass and return by value function called 10 million times:
1465 ms
Short String - Pass by reference return by value function called 10 million
times: 743 ms
VS2005 & STLPORT 5
Append one character to empty string 100 million times: 1759 ms
Search long string for 3 different substrings 10 million times: 3114 ms
Assignment, search, replace 10 million times: 2667 ms
Pass and return by value function called 10 million times: 3597 ms
Pass by reference return by value function called 10 million times: 1816 ms
Short String - Pass and return by value function called 10 million times:
822 ms
Short String - Pass by reference return by value function called 10 million
times: 427 ms
string func2(string& par )
{
string tmp(par);
return tmp ;
}
{
usecTimer.Start();
string s("1234567890");
for( int i = 0 ; i < 10000000; ++ i ) {
string sx = func2( s );
}
usecTimer.End();
double msecs =
double((usecTimer.GetDifference()*1000)/usecTimer.GetFrequency());
printf("Short String - Pass by reference return by value function called 10
million times: %.0f ms\n", msecs);
}
Compiler options:
/O2 /I "W:\3rd Party Software\STLport\stlport" /I "v:\include" /D "WIN32" /D
"NDEBUG" /D "_CONSOLE" /FD /EHsc /MD /Fo"VC8\Release\\"
/Fd"VC8\Release\vc80.pdb" /W3 /nologo /c /Wp64 /Zi /TP /errorReport:prompt
Linker Options:
/OUT:"VC8\Release\string_tests.exe" /INCREMENTAL:NO /NOLOGO /LIBPATH:"W:\3rd
Party Software\STLport\stlport" /MANIFEST
/MANIFESTFILE:"VC8\Release\string_tests.exe.intermediate.manifest" /DEBUG
/PDB:"w:\test\string_tests\VC8\Release\string_tests.pdb" /SUBSYSTEM:CONSOLE
/OPT:REF /OPT:ICF /MACHINE:X86 /ERRORREPORT:PROMPT stlport.5.0.lib
kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib
shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib
"Jerry Coffin" wrote:
> The first and most crucial question to ask of any benchmark is how
> accurately this benchmark represents real-world workloads. Unless the
> benchmark is quite close to real-world usage, it's meaningless -- and
> even if it's fairly close, it can still be entirely meaningless (if the
> differences between "fairly close" and "real world" make a big
> difference in performance).
Yes, but I am microbenchmarking a small subset of the STL string
implementation only. Most real world apps will have bigger bottlenecks to
deal with and these benchmarks will not matter much. The benchmarks do point
to a more efficient implementation in STLPort.
> > string s ;
> > for ( int i = 0 ; i < 100000000; ++ i ) {
> > s += " " ;
>
> People have done quite a few studies on string lengths as they're
> actually used. Every time I've seen such a study, it's concluded that
> most strings are quite short -- most people find that over 90% of
> strings are 80 characters or less, and the frequency of usage drops of
> quickly with length -- even 1000 characters is extremely rare. A string
> of 100 megabytes is so rare that your measurement means nothing about
> real use.
>
Test was meant to benchmark the performance when the string must do new
allocations. This was an easy way to get at that.
>
> > string s("qyweyuewunfkHBUKGYUGL ,wehbYGUW^( @T@H!BALWD:h^&@#*@(#:
> > JKHWJ:CND");
> > for (int i = 0 ; i < 10000000; ++i) {
> > s.find( "unfkHBUKGY" ) ;
> > s.find( "W^( @T@H!B" ) ;
> > s.find( "J:CND" ) ;
>
> Likewise, in the real world, most strings consist primarily (if not
> exclusively) of a-z, A-Z, space, and a FEW punctuation characters. I
> haven't checked, but my immediate guess is that a significant difference
> in string searching speed likely results from using a Boyer-Moore (or
> BMH) search algorithm. Restricting the alphabet in use also
> significantly reduces the advantage gained from such an algorithm.
>
> That's not to say that such an advanced algorithm can't be useful -- but
> my immediate guess is that the strings chosen above significantly
> exaggerate the advantage likely to be gained in most real use.
>
Whatever. STLPort searched faster. Try it with other strings and see what
happens.
> > string s("qyweyuewunfkHBUKGYUGL ,wehbYGUW^( @T@H!BALWD:h^&@#*@(#:
> > JKHWJ:CND");
> [ ... ]
> > printf("Pass and return by value function called 10 million times: %.0f
> > ms\n", msecs);
>
> I can't imagine anybody passing strings by value except by accident.
> Virtually every book on C++ teaches even rank beginners to pass class-
> objects by ref-to-const by about chapter 2. This benchmark seems to
> apply only to thoroughly inept code.
>
Test was meant to benchmark the copy constructor, and return value
optimization. This was an easy way to test that.
> > string s("1234567890");
> [ ... ]
> > printf("Short String - Pass and return by value function called 10 million
> > times: %.0f ms\n", msecs);
>
> Passing by value again?
Again, to test the copy constructor on short strings.
> If I didn't know better, I'd guess you carefully
> avoided the two (by far) most common situations: 1) pass a short string
> by reference and return a short string by value, 2) pass a short string
> by reference, modify a string whose reference was passed.
OK. Here goes with these two tests included (code at end):
VS2005
==================================================================
W:\test\string_tests\VC8\Release>string_tests.exe
Append one character to empty string 100 million times: 6791 ms
Search long string for 3 different substrings 10 million times: 3367 ms
Assignment, search, replace 10 million times: 3551 ms
Pass and return by value function called 10 million times: 8362 ms
Pass by reference return by value function called 10 million times: 4222 ms
Short String - Pass and return by value function called 10 million times:
1460 ms
Short String - Pass by reference return by value function called 10 million
times: 741 ms
Short String - Pass by reference modify reference called 10 million times:
643 ms
VS2005 & STLPORT
===================================================================
W:\test\string_tests\VC8\Release>string_tests.exe
Append one character to empty string 100 million times: 1804 ms
Search long string for 3 different substrings 10 million times: 3134 ms
Assignment, search, replace 10 million times: 2611 ms
Pass and return by value function called 10 million times: 3839 ms
Pass by reference return by value function called 10 million times: 1885 ms
Short String - Pass and return by value function called 10 million times:
829 ms
Short String - Pass by reference return by value function called 10 million
times: 421 ms
Short String - Pass by reference modify reference called 10 million times:
245 ms
{
usecTimer.Start();
string s("1234567890");
for( int i = 0 ; i < 10000000; ++ i ) {
string sx = func2( s );
}
usecTimer.End();
double msecs =
double((usecTimer.GetDifference()*1000)/usecTimer.GetFrequency());
printf("Short String - Pass by reference return by value function called 10
million times: %.0f ms\n", msecs);
}
void func3(string& par)
{
par = "1234123412";
}
{
usecTimer.Start();
string s("1234567890");
for( int i = 0 ; i < 10000000; ++ i ) {
func3( s );
}
usecTimer.End();
double msecs =
double((usecTimer.GetDifference()*1000)/usecTimer.GetFrequency());
printf("Short String - Pass by reference modify reference called 10 million
times: %.0f ms\n", msecs);
}
-ephi
> Test was meant to benchmark the performance when the string must do
> new
> allocations. This was an easy way to get at that.
>
But in practice it unfortunately doesn't measure that. The operator +=
probably maps to
append(String, traits_type::length(String));
In my implementation I got about a 10 times speedup by adding the
special case
basic_string& append(const value_type* _String, size_type _StringSize)
{
if (_StringSize == 1)
push_back(*_String);
else
...
}
So, you are measuring one *very* specific optimization of the standard
library implementation. Adding a single space character could be
s += " ";
s += ' ';
s.push_back(' ');
s.append(" ");
s.append(1, ' ');
You are just measuring the first one.
Bo Persson
"Bo Persson" wrote:
[...]
> >> > string s ;
> >> > for ( int i = 0 ; i < 100000000; ++ i ) {
> >> > s += " " ;
> >>
>
> > Test was meant to benchmark the performance when the string must do
> > new
> > allocations. This was an easy way to get at that.
> >
>
> But in practice it unfortunately doesn't measure that. The operator +=
> probably maps to
>
> append(String, traits_type::length(String));
>
> In my implementation I got about a 10 times speedup by adding the
> special case
>
> basic_string& append(const value_type* _String, size_type _StringSize)
> {
> if (_StringSize == 1)
> push_back(*_String);
> else
> ...
> }
Again. I'm trying to factor out the append optimizations and just see which
implementation is better at the allocations.
I'm assuming all the append logic time will be minimal compared to the
number of allocations that need to be done after 100 million appends.
> So, you are measuring one *very* specific optimization of the standard
> library implementation. Adding a single space character could be
>
> s += " ";
> s += ' ';
> s.push_back(' ');
> s.append(" ");
> s.append(1, ' ');
>
> You are just measuring the first one.
OK. Here are the other tests (the push_back is especially scary):
For what it's worth I am not trying to dis VS2005 STL. It is probably more
compliant than STLPort but these tests clearly show a less efficient
implementation compared to STLPort. And for most projects this probably
doesn't matter that much (as larger bottlenecks usually exist). I'm looking
for suggestions, project property options or anything else I can use so I am
not forced into using STLPort.
VS2005
======
W:\test\string_tests\VC8\Release>string_tests.exe
Append one character ( +=" " ) to empty string 100 million times: 6663 ms
Append one character ( += ' ' ) to empty string 100 million times: 3156 ms
Append one character ( push_back(' ') ) to empty string 100 million times:
1497 ms
Append one character ( append(" ") ) to empty string 100 million times: 6610
ms
Append one character ( append(1,' ')to empty string 100 million times: 2870 ms
VS2005 w/STLPort
==============
W:\test\string_tests\VC8\Release>string_tests.exe
Append one character ( +=" " ) to empty string 100 million times: 2076 ms
Append one character ( += ' ' ) to empty string 100 million times: 1496 ms
Append one character ( push_back(' ') ) to empty string 100 million times:
1065 ms
Append one character ( append(" ") ) to empty string 100 million times: 2009
ms
Append one character ( append(1,' ')to empty string 100 million times: 2390 ms
Short String - Pass by reference return by value function called 10 million
time
s: 420 ms
Short String - Pass by reference modify reference called 10 million times:
239 m
s
Do they really?
Imagine you are implementing list.
What will you do in the allocator?
1) Will you "remember" previous nodes on deallocation?
2) Will you free previous nodes on deallocation?
1) and 2) are mutually exclusive.
1) will give fast node reuse but heap memory use will increase
2) will give slower node reuse but heap memory use will be minimal as it is
recycled.
Your tests are too simplistic.
They don't measure how well implemented the library designers have been over
their STL.
Instead they measure the default choice of allocators that comes
out-of-the-box with compiler.
I am certain that accounts for your performance difference.
Your seeing the usual space-versus-time tradeoffs that we all know.
Now I know that Dinkumware sells some add-on package that gives you various
different drop-in replacement allocators. And similarly STLPort also has
alternative allocators if it matters that much.
There is no right-or-wrong here. I think you will find that STLPort has a
bigger footprint in terms of memory consumed - nothing wrong with that, it
is the library choices made. If it was a 16-bit enviroment or embedded ROM
enviroment that might be the wrong choice.
If there is any fault, I blame the C++ standard because apart from limited
control over vector, deque and string; it is impossible for the programmer
to inform the library designers what their intention is with a container. If
your application uses large amounts of heap memory between several
containers, enough to bust OS limits , you don't want any container to
reserve spare memory for later use. And contrarywise if you are constantly
reusing a container which grows and shrinks, for performance reasons you
want it to recycle the same memory again and again.
I would consider the default allocator out-of-the-box as part of tests but
it would be the only thing I would look at.
I would _ALSO_ like to make some tests that are allocator-neutral and also
don depend on the choice of growth factor for vector, string. You have not
done that.
And I can beat STLPort in terms of your tests with my STL implementation
I would make the growth rates of vector, string, deque, stringstreams be 64
times whenever capacity is reached.
The small-string-optimisation for string - I would ensure there is space for
1K right form the outset.
So at a bare minimum sizeof(std::string) would be 1024. The strings would be
obese.
But the cost of this astronomic. The memory footprint is huge.
By your criterion, your tests are likely to reveal that my STL
implementation is the most "efficient" out.
But it would be a poor conclusion drawn from oversimple tests.
Stephen Howe
They don't "clearly" show.
See my other post referencing allocators and container growth rates which is
what you _ARE_ testing.
Did you check memory footprints halfway through? No? Perhaps it does not
matter for your application.
Here are some examples which show something on efficiency.
Try std::sort() with many duplicates and an instrumented ctor, copy ctor,
assignment operator and dtor.
I think you will find Dinkumware comes out on top because last time I looked
STLPort does not implement fatpivot. Dinkumware does.
And it is possible to feed some nasty distribution to Dinkumware/STLPort
where std::nth_element() is not O(N) because neither implementation switches
to an alternative algorithm if quickselect goes bad.
And the latest STLPort's std::stable_partition() handles a few more cases
where auxilary memory needs no allocation.
Both use a form of heapsort for std::partial_sort() which is what the STL
designers had in mind but AFAIK, a variation of quicksort can be used which
is more efficient.
All that I have seen, by inspection, proves to me that both implementations
are written extremely well.
But you are not measuring that.
> Whatever. STLPort searched faster. Try it with other strings and see what
> happens.
I will have a look to see how find is implemented and then feed worse case
to STLPort.
There is nothing magical about std::string.find() - library designer usually
has to make what they think is a good _general_ choice. Some have a startup
cost but afterwards run fast. simple-first character search then memcmp,
Rabin-Karp, Knuth-Morris-Pratt, Boyer-Moore... they have good and bad cases.
Stephen Howe
Yes, right. And if you read the PDF file that Boris provided the link to,
when you get to section 6 you can see what role the allocator plays.
Stephen Howe