ostringstream performance

Bonita Montero

unread,

Jul 25, 2022, 6:20:09 AM7/25/22

to

I tried to evaluate the performance of ostringstream with MSVC,
clang++-12 under Linux and g++.

#include <iostream>
#include <sstream>
#include <vector>
#include <chrono>
#include <iomanip>
#include <array>

using namespace std;
using namespace chrono;

int main()
{
constexpr size_t N = 1'000'000;
auto bench = []<typename StringGen>( StringGen stringGen ) -> double
requires requires( StringGen stringGen ) { { stringGen( (size_t)123 )
} -> std::same_as<string>; }
{
vector<string> vs;
vs.reserve( N );
auto start = high_resolution_clock::now();
for( size_t i = N; i--; )
vs.emplace_back( stringGen( i ) );
return (double)(int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / N;
};
ostringstream oss;
auto genOss = [&]( size_t x ) -> string
{
oss.str( "" );
oss << setw( 8 ) << setfill( '0' ) << hex << x;
return oss.str();
};
cout << bench( genOss ) << endl;
auto genManual = []( size_t x ) -> string
{
using arr_t = array<char, 8>;
using arr_it = typename arr_t::iterator;
array<char, 8> str;
arr_it it = str.end();
while( x && it > str.begin() )
*--it = (x % 10) + '0',
x /= 10;
for( ; it > str.begin(); *--it = '0' );
return string( str.begin(), str.end() );
};
cout << bench( genManual ) << endl;
}

The above program takes about 300ns for each ostringstream-generated
string on my TR3990X under Windows with MSVC 2022, but only 17ns for
my own generator; it's nearly the same with clang-cl 13 under Windows.
Unter Ubuntu with g++ 11 on a 13 yr old Phenom it's abotu 110 and 36.
clang++-12 under Linux (different standard libary than under Windows)
takes 125 vs. 43ns on the same computer.
I won't expect that a stream is that fast like my hand-optimized code,
but I think that this could go even faster with ostringstream under
Linux, and Windows for sure.

Öö Tiib

unread,

Jul 25, 2022, 6:51:15 AM7/25/22

to

On Monday, 25 July 2022 at 13:20:09 UTC+3, Bonita Montero wrote:
> I tried to evaluate the performance of ostringstream with MSVC,
> clang++-12 under Linux and g++.
>

...

> oss << setw( 8 ) << setfill( '0' ) << hex << x;

...

> --it = (x % 10) + '0', x /= 10;

...

Output of those programs is likely different? One looks hex ... other decimal.

A function hard-coded for narrow case compared with widely dynamically
configurable one (with all those setw, setfill and imbue) will likely win forever.
For fairness compare with std::to_chars() that has narrowed down the
spec of configurability to be easier to make efficient.

Wuns Haerst

unread,

Jul 25, 2022, 7:35:49 AM7/25/22

to

Am 25.07.2022 um 12:51 schrieb Öö Tiib:
> On Monday, 25 July 2022 at 13:20:09 UTC+3, Bonita Montero wrote:
>> I tried to evaluate the performance of ostringstream with MSVC,
>> clang++-12 under Linux and g++.
>>
> ...
>
>> oss << setw( 8 ) << setfill( '0' ) << hex << x;
>
> ...
>
>> --it = (x % 10) + '0', x /= 10;
>
> ...
>
> Output of those programs is likely different? One looks hex ... other decimal.

OOOOOOOOOOh, you're right, hex would be even faster.
And with decimal I won't get all digits in the 8 characters.

Bonita Montero

unread,

Jul 25, 2022, 7:41:44 AM7/25/22

to

Am 25.07.2022 um 12:51 schrieb Öö Tiib:

It's obvious that this is faster, but I didn't expect
the streams to be so slow.

auto genManual = []( size_t x ) -> string
{
using arr_t = array<char, 8>;
using arr_it = typename arr_t::iterator;
array<char, 8> str;
arr_it it = str.end();

static char const digits[16] = { '0', '1', '2', '3', '4', '5', '6',
'7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' };

while( x && it > str.begin() )

*--it = digits[x % 16],
x /= 16;

for( ; it > str.begin(); *--it = '0' );
return string( str.begin(), str.end() );
};

This is the right function but it isn't faster
although there are no slow /10-divisions.

Paavo Helde

unread,

Jul 25, 2022, 8:53:21 AM7/25/22

to

25.07.2022 13:20 Bonita Montero kirjutas:
> I tried to evaluate the performance of ostringstream with MSVC,
> clang++-12 under Linux and g++.

[...]

> I won't expect that a stream is that fast like my hand-optimized code,
> but I think that this could go even faster with ostringstream under
> Linux, and Windows for sure.

We have covered this before. C++ streams are slow because of

* lots of flexibility, i.e.
* lots of virtual function calls
* lots of dynamic allocations
* locale support
* perform both formatting and file output

By using a stringstream you have avoided the file output part and
potentially also the locale support part (not sure about that, maybe on
Linux?). The virtual calls and memory allocations are still there. In
other words, you are using a trained carpenter with a chisel for
preparing firewood, of course it will be slow.

If you are concerned about performance, you should start with
std::to_chars() as others have already commented.

Bonita Montero

unread,

Jul 25, 2022, 9:11:31 AM7/25/22

to

Am 25.07.2022 um 14:53 schrieb Paavo Helde:
> 25.07.2022 13:20 Bonita Montero kirjutas:
>> I tried to evaluate the performance of ostringstream with MSVC,
>> clang++-12 under Linux and g++.
> [...]
>> I won't expect that a stream is that fast like my hand-optimized code,
>> but I think that this could go even faster with ostringstream under
>> Linux, and Windows for sure.
>
> We have covered this before. C++ streams are slow because of
>
> * lots of flexibility, i.e.
>    * lots of virtual function calls
>    * lots of dynamic allocations
>    * locale support
> * perform both formatting and file output

Except from the virtual functin calls this might hurt, but that's
still a metter of implementation. The formatting-part wouldn't be
much different than I do that.

Frederick Virchanza Gotham

unread,

Jul 25, 2022, 5:23:33 PM7/25/22

to

On Monday, July 25, 2022 at 2:11:31 PM UTC+1, Bonita Montero wrote:

> Except from the virtual functin calls this might hurt, but that's
> still a metter of implementation. The formatting-part wouldn't be
> much different than I do that.

You beat me to it. Implementing a virtual function call is just one or two dereferences of a pointer, which is less than 5 CPU instructions... we're talking in the nanoseconds range. I never hesitate to add virtual methods and virtual destructors.

Scott Lurndal

unread,

Jul 25, 2022, 5:50:12 PM7/25/22

to

As the real-estate agents always say, locality, locality, locality.

It's not just the number of instructions executed, it's cache and TLB footprint
that matter, perhaps more.

Bonita Montero

unread,

Jul 25, 2022, 10:22:39 PM7/25/22

to

Am 25.07.2022 um 13:42 schrieb Bonita Montero:

>     auto genManual = []( size_t x ) -> string
>     {
>         using arr_t = array<char, 8>;
>         using arr_it = typename arr_t::iterator;
>         array<char, 8> str;
>         arr_it it = str.end();
>         static char const digits[16] = { '0', '1', '2', '3', '4', '5',
> '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' };
>         while( x && it > str.begin() )
>             *--it = digits[x % 16],
>             x /= 16;
>         for( ; it > str.begin(); *--it = '0' );
>         return string( str.begin(), str.end() );
>     };

As you can see from my code reverse iterators absolutely arent
necessary. I never understood what they're are good for but just
for confusion.

Malcolm McLean

unread,

Jul 26, 2022, 12:00:22 AM7/26/22

to

However it's harder for the processor to do lookahead and branch prediction.
I don't know because it's ages since I've looked at a modern processor in
much detail, but the real cost could be much higher than the cost of the
extra instructions.

Bonita Montero

unread,

Jul 26, 2022, 12:48:41 AM7/26/22

to

Think about what the alternative would be: switch case constructs.
It wouldn't be much more efficient either.

Öö Tiib

unread,

Jul 26, 2022, 2:17:40 AM7/26/22

to

That all starts only to matter to performance when the count of objects in
control is large. If it is sole object ... like stream then the methods are
cashed and branches predicted correctly.

With large object count however we do not talk about polymorphism as
such but about polymorphic container. There we may choose something
that is straightforward for us to use but less easy for compiler to optimize
and processor to predict like:

std::vector<std::unique_ptr<Base>> container;

Or something that is more easy for compiler but for us it is messing with
dependency and learning the differences like:

boost::base_collection<Base> container;

The effect must be worth of that price to developer work. But it can
cause remarkable differences in both result performance and storage
usage (large object count).

Juha Nieminen

unread,

Jul 26, 2022, 2:19:34 AM7/26/22

to

A function being needlessly virtual might also hinder compiler
optimizations.

(There are some situations where a function call, which the compiler can't
optimize away, in the middle of heavy number-crunching code can severely
hinder compiler optimizations. Sometimes it can be surprising by how much.
As compilers are becoming better and better at vectorizing code, the
difference in speed may in some cases be even an order of magnitude,
as a non-optimizable function call may act as a severe barrier to
vectorization.)

Bonita Montero

unread,

Jul 26, 2022, 3:20:44 AM7/26/22

to

TLB-thrashing is relevant mostly for data but almost never for code.

daniel...@gmail.com

unread,

Jul 26, 2022, 5:15:02 PM7/26/22

to

On Monday, July 25, 2022 at 5:23:33 PM UTC-4, Frederick Virchanza Gotham wrote:

> Implementing a virtual function call is just one or two dereferences of a pointer, which is less than 5 CPU instructions... we're talking in the nanoseconds range. I never hesitate to add virtual methods and virtual destructors.

There are actually a number of virtual functions that get called for _every character_, see,

https://en.cppreference.com/w/cpp/io/basic_streambuf

As the doctor said in Treasure Island, "one virtual function won’t kill you, but if you take one you’ll take another and another, and I stake my wig if you don’t break off short, you’ll die ..."

If you look at the source of a typical implementation, there are plenty of switches as well. I find it strange that something like that was designed in an era when hardware was slower and compilers less capable. Now that hardware is fast and compilers optimization is great, we're still left with this thing that is intrinsically slow.

Daniel

Manfred

unread,

Jul 26, 2022, 8:08:33 PM7/26/22

to

The cost of /conditional/ branching is significant in modern architectures.
However, a virtual function call is not this case: it's a deterministic
indirection that should be correctly followed (it's not really a
prediction since its destination is known in advance) by the processor.

As Scott says, one relevant factor is locality.
Another relevant factor, when it comes to optimization, is that the
compiler cannot inline virtual functions (unless the concrete type is
actually known at the point of call).

Öö Tiib

unread,

Jul 26, 2022, 10:49:14 PM7/26/22

to

That is the case with boost::base_collection<Base> that keeps the
elements in sub-containers of concrete type. So for example when element
is accessed through local_iterator<Derived> the compiler knows that type is
Derived, despite container is polymorphic.

Frederick Virchanza Gotham

unread,

Jul 27, 2022, 6:54:08 AM7/27/22

to

On Wednesday, July 27, 2022 at 1:08:33 AM UTC+1, Manfred wrote:

> Another relevant factor, when it comes to optimization, is that the
> compiler cannot inline virtual functions (unless the concrete type is
> actually known at the point of call).

If the concrete type is known at the point of call, then there's no overhead.

If the concrete type is **not** known at the point of call, for example you're using a pointer to the Base class type, then the virtualness is actually needed, and it has been implemented the best way possible.

Actually I'm really convincing myself here that I should just mark every method virtual.

Paavo Helde

unread,

Jul 27, 2022, 8:01:17 AM7/27/22

to

27.07.2022 13:54 Frederick Virchanza Gotham kirjutas:
> On Wednesday, July 27, 2022 at 1:08:33 AM UTC+1, Manfred wrote:
>
>> Another relevant factor, when it comes to optimization, is that the
>> compiler cannot inline virtual functions (unless the concrete type is
>> actually known at the point of call).
>
>
> If the concrete type is known at the point of call, then there's no overhead.

Only if all code is visible to the optimizer, and the code is simple
enough for the optimizer, and the optimizer actually decides to optimize
that away.

With this iostreams example, none of that seems to happen. Even clearing
the stringstream buffer with oss.str(""); is taking more time than the
whole Bonita's "manual" branch.

>
> If the concrete type is **not** known at the point of call, for example you're using a pointer to the Base class type, then the virtualness is actually needed, and it has been implemented the best way possible.
>
> Actually I'm really convincing myself here that I should just mark every method virtual.

That language has another name.

Scott Lurndal

unread,

Jul 27, 2022, 9:57:48 AM7/27/22

to

Manfred <non...@add.invalid> writes:
>On 7/26/2022 6:00 AM, Malcolm McLean wrote:
>> On Monday, 25 July 2022 at 22:23:33 UTC+1, Frederick Virchanza Gotham wrote:
>>> On Monday, July 25, 2022 at 2:11:31 PM UTC+1, Bonita Montero wrote:
>>>
>>>> Except from the virtual functin calls this might hurt, but that's
>>>> still a metter of implementation. The formatting-part wouldn't be
>>>> much different than I do that.
>>> You beat me to it. Implementing a virtual function call is just one or two
>>> dereferences of a pointer, which is less than 5 CPU instructions... we're
>>> talking in the nanoseconds range. I never hesitate to add virtual methods
>>> and virtual destructors.
>>>
>> However it's harder for the processor to do lookahead and branch prediction.
>> I don't know because it's ages since I've looked at a modern processor in
>> much detail, but the real cost could be much higher than the cost of the
>> extra instructions.
>
>The cost of /conditional/ branching is significant in modern architectures.

Not generally. Branch history buffers, prediction hardware
and other facilities in the hardware interact with the speculation
hardware to speculate down the likely path.

They work pretty well in general, albeit speculation has it's own
shortcomings vis-a-vis security. Performance or security but not
necessarily both.