Dismal ostream performance

David Barrett-Lennard

unread,

Nov 18, 2010, 4:34:19 PM11/18/10

to

On my target platform the performance of std::ostream is so poor I've
been compelled to roll my own simplified version. I can often achieve
an order of magnitude better performance. I've been using the
standard library provided by Microsoft. I have no information about
other implementations.

The measurements below involve timing the writing of the same value
one million times to an ostream like this:

for (int i=0 ; i < 1000000 ; ++i)
{
os << value;
}

The rate at which values are written is expressed in MHz.

Write bool appearing as '1' in the output:
Microsoft: 1.4 MHz
Mine: 190 MHz

Write char:
Microsoft: 10 MHz
Mine: 260 MHz

Write C string with 1 character:
Microsoft: 8.0 MHz
Mine: 65 MHz

Write C string with 10 characters:
Microsoft: 7.3 MHz
Mine: 33 MHz

Write 32 bit signed int in decimal appearing as '10':
Microsoft: 1.4 MHz
Mine: 33 MHz

Write 32 bit unsigned int in decimal appearing as '1234567890':
Microsoft: 1.0 MHz
Mine: 16 MHz

Write 32 bit unsigned int in hex appearing as 'abcd1234':
Microsoft: .98 MHz
Mine: 20 MHz

Write unformatted block of 128 bytes:
Microsoft: 4.0 MHz
Mine: 8.3 MHz

Machine: Dual core E8400 3GHz
OS: Windows XP 32 bit professional
Compiler: MS Visual Studio 2008
Build: 32 bit release with /O2, _SECURE_SCL=0

QueryPerformanceCounter is used to measure the elapsed time. The
minimum over 100 runs is calculated. I've found this makes the
reported results repeatable to about 2 significant figures.

I have no comparisons for floats because I haven't implemented that
functionality.

For a fair comparison, in all cases the output is written to memory
using a linked list of 16 kByte buffers. This can be quite a lot
faster that using an ostringstream. The Microsoft ostream was able to
write about 500 MByte/s using unformatted output for 1 million x 128
Byte blocks. Although I regard this as unsatisfactory (it should
exceed 1GB/sec on the hardware) it shows that the underlying stream
buffer is not the bottleneck for writing formatted bools, chars and
ints where the implementation provided by Microsoft is extremely poor.

Interestingly sputc() on the underlying streambuf goes around 160 MHz
which is almost tolerable, yet put() on the ostream only achieves 10
MHz. Does that provide anyone with clues about the problem?

I intentionally have only implemented a subset of the functionality of
an ostream. There are many aspects to the standard library streambuf
and ostream which I regard as "bottom heavy" or "more is less". I
have ignored the following:

* Tied streams so that I/O operations on one cause flushes on
another.

* Support for flushing after every output operation (and
corresponding manipulators unitbuf, nounitbuf)

* Error states good,eof,fail,bad, and for example needing a try-catch
on every output operation to set the bad bit and rethrow.

* The concept of a sentry to perform a whole lot of busy work before
every output operation.

* Exception masks

* Seeking

My ostream is otherwise modelled on the standard library, e.g.
allowing for code like this:

os << hex << uppercase << right << setw(10) << x
<< dec << showpos << 5
<< endl;

I assume the underlying buffered output stream throws an exception if
there is an error.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Martin Bonner

unread,

Nov 19, 2010, 3:12:58 PM11/19/10

to

On Nov 18, 9:34 pm, David Barrett-Lennard <davi...@iinet.net.au>
wrote:

> On my target platform the performance of std::ostream is so poor I've
> been compelled to roll my own simplified version. I can often achieve
> an order of magnitude better performance. I've been using the
> standard library provided by Microsoft. I have no information about
> other implementations.

Snip

> I intentionally have only implemented a subset of the functionality of
> an ostream. There are many aspects to the standard library streambuf
> and ostream which I regard as "bottom heavy" or "more is less". I
> have ignored the following:
>
> * Tied streams so that I/O operations on one cause flushes on
> another.
>
> * Support for flushing after every output operation (and
> corresponding manipulators unitbuf, nounitbuf)
>
> * Error states good,eof,fail,bad, and for example needing a try-catch
> on every output operation to set the bad bit and rethrow.
>
> * The concept of a sentry to perform a whole lot of busy work before
> every output operation.
>
> * Exception masks
>
> * Seeking

You appear to be saying "if the specification of ostream is
dramatically simplified, I can implement what's left much efficiently
than Plauger can implement the full-fat spec".

That's not *that* much of a surprise (although it is sad because it
means you are paying for features you aren't using).

What are you trying to achieve with this post?
- someone to point out the gross inefficiency in Dinkumware's
implementation? (I wouldn't hold out much hope for that).
- someone to point out the gross inefficiency in Microsoft's compiler?
(slightly more plausible, but not very likely)
- to start a discussion on what could be dropped from the spec of
ostream? (I doubt that will get far, and I think the next standard of C
++ is feature-complete, so the change couldn't be standardized for >10
years).
- a pointer to what you should tweak in ostream to obtain (most of)
the speed up? (that could happen - but I can't help)

David Barrett-Lennard

unread,

Nov 20, 2010, 8:18:28 PM11/20/10

to

On Nov 20, 4:12 am, Martin Bonner <martinfro...@yahoo.co.uk> wrote:

> You appear to be saying "if the specification of ostream is
> dramatically simplified, I can implement what's left much efficiently
> than Plauger can implement the full-fat spec".

> That's not *that* much of a surprise (although it is sad because it
> means you are paying for features you aren't using).
>
> What are you trying to achieve with this post?
> - someone to point out the gross inefficiency in Dinkumware's
> implementation? (I wouldn't hold out much hope for that).
> - someone to point out the gross inefficiency in Microsoft's compiler?
> (slightly more plausible, but not very likely)
> - to start a discussion on what could be dropped from the spec of
> ostream? (I doubt that will get far, and I think the next standard of C
> ++ is feature-complete, so the change couldn't be standardized for >10
> years).
> - a pointer to what you should tweak in ostream to obtain (most of)
> the speed up? (that could happen - but I can't help)

The purpose of the post is to get it fixed because the performance is
unacceptable.

I don't have any idea of what the problem is. As you suggest it might
be inevitable because of the spec, or it might be some issue either
with Dinkumware's implementation or the Microsoft compiler.

Does anyone have an idea of how to find out?

If the poor performance is inevitable with the spec then I suggest a
superior iostream library be developed for boost, it becomes a de
facto standard over time, and eventually is incorporated into the
standard library and the existing iostream classes are deprecated.

Mia B

unread,

Nov 21, 2010, 3:39:50 AM11/21/10

to

On Nov 21, 2:18 am, David Barrett-Lennard <davi...@iinet.net.au>
wrote:

> The purpose of the post is to get it fixed because the performance is
> unacceptable.

Well, it's unacceptable for __you__ ;-). I never cared about
performance of any streams implementation, but I don't remember people
complaining about MS's implementation being particularly slow compared
to other to-the-spec implementations. You're slightly comparing apples
and oranges, as you never know how big impact on performance can "I
don't need this/that feature" have.

I don't like your idea to have another streams library from e.g. boost
much. Streams library is a big subject, and decisions taken on
features versus performance are difficult to satisfy all people. My
voice there would be "I hope boost people more important things to
do ;-)".

About performance of sputc versus put: by looking at the body of
ostream::put in vc 2008 implementation, I can't see why would it be 16
times slower, but I guess you should focus on that. Take a look at
generated code, and/or profile it.

And finally, you should try /Ox, not /O2.

Goran.

David Barrett-Lennard

unread,

Nov 21, 2010, 5:39:50 PM11/21/10

to

On Nov 21, 4:39 pm, Mia B <berkopu...@gmail.com> wrote:
> On Nov 21, 2:18 am, David Barrett-Lennard <davi...@iinet.net.au>
> wrote:
>
>> The purpose of the post is to get it fixed because the performance is
>> unacceptable.
>
> Well, it's unacceptable for __you__ ;-). I never cared about
> performance of any streams implementation, but I don't remember people
> complaining about MS's implementation being particularly slow compared
> to other to-the-spec implementations. You're slightly comparing apples
> and oranges, as you never know how big impact on performance can "I
> don't need this/that feature" have.

I agree of course that some applications will be seriously handicapped
and others won’t, and I’m only speaking for the former.

There are plenty of applications that output significant amounts of
text (particularly given the trend towards textual versus binary
serialised data representations), and an order of magnitude difference
in performance may be very noticeable to end users.

It is well known that one of the primary design goals for C++ is to
achieve performance comparable to hand coded assembly. I suggest this
goal likewise extends to the design and implementation of the standard
library.

A quick google reveals many programmers complaining about iostream
performance. Here are some quotes I found googling "C++ iostream
slow":

"The stdio version runs in around 1second, the iostream version takes
8seconds. Is this just down to a poor iostream implementation?"

"As a general rule, naive code using C library i/o will be faster than
equally naive code using iostreams"

"A colleague of mine just told me that C++ iostream is typically an
order of magnitude slower than printf. His example shows that printing
out a string like “%s\t%d\tabc\t%s\t%s\n” with C++ iostream is 3 times
slower than printf in Perl! This observation agrees my experience"

"Please don't use stream i/o methods in C++ (ie. cin, cout) because
they tend to be extremely slow."

"For the Standard IO (in C++) we have two posibilities: - to use
<iostream> (cout, cin, etc) or old <cstdio> (printf, scanf).
<iostream> is slower than <cstdio>. is there any faster way to read
and write data in the Standard IO?"

"fprintf is DEFINITELY much faster than ofstream!"

"For my needs printf and sprintf is much more convenient and faster."

"I also have eschewed the use of the iostreams since about 1997,
because ... they're hideously slow ..."

"Streams are generally quite safe. Under some circumstances, they can
be slow and/or clumsy. Slow, mostly stems from the fact that they
impose a few extra layers between your code and the OS, and under the
wrong circumstance those layers can add overhead."

"...The same is true with istream and ostream, slow MANIFOLD compared
to fgets and fputs."

"istream and ostream are slow as heck, yes. I've done several
benchmarks on multiple compilers, and they are up to two orders of
magnitude worse in some cases. They can also bloat executables (if you
link statically), and I don't like the syntax, especially not having
to save/restore the state all the time."

"And yes, avoid the iostreams implementation at all costs."

> I don't like your idea to have another streams library from e.g. boost
> much. Streams library is a big subject, and decisions taken on
> features versus performance are difficult to satisfy all people. My
> voice there would be "I hope boost people more important things to
> do ;-)".

You’re welcome to your opinion. IMO a decent iostream library is
important. To some people C++ loses credibility without it.

> About performance of sputc versus put: by looking at the body of
> ostream::put in vc 2008 implementation, I can't see why would it be 16
> times slower, but I guess you should focus on that. Take a look at
> generated code, and/or profile it.

I only know how to profile using VS2008, and AFAIK that won't do it.

> And finally, you should try /Ox, not /O2.

Thanks I’ll give it a try.

Vaclav Haisman

unread,

Nov 21, 2010, 5:49:43 PM11/21/10

to

David Barrett-Lennard wrote, On 21.11.2010 2:18:

>
> On Nov 20, 4:12 am, Martin Bonner wrote:
>
>> You appear to be saying "if the specification of ostream is dramatically
>> simplified, I can implement what's left much efficiently than Plauger
>> can implement the full-fat spec".
>
>> That's not *that* much of a surprise (although it is sad because it
>> means you are paying for features you aren't using).
>>
>> What are you trying to achieve with this post? - someone to point out
>> the gross inefficiency in Dinkumware's implementation? (I wouldn't hold
>> out much hope for that). - someone to point out the gross inefficiency
>> in Microsoft's compiler? (slightly more plausible, but not very
>> likely) - to start a discussion on what could be dropped from the spec
>> of ostream? (I doubt that will get far, and I think the next standard of
>> C ++ is feature-complete, so the change couldn't be standardized for
>>> 10 years). - a pointer to what you should tweak in ostream to obtain
>> (most of) the speed up? (that could happen - but I can't help)
>
> The purpose of the post is to get it fixed because the performance is
> unacceptable.

It is unacceptable for you. It is acceptable for most people.

>
> I don't have any idea of what the problem is. As you suggest it might be
> inevitable because of the spec, or it might be some issue either with
> Dinkumware's implementation or the Microsoft compiler.

I do not think that MS's implementation of the C++ IO streams is orders of
magnitude worse than any other conforming implementation.

>
> Does anyone have an idea of how to find out?

Code up your own _conforming_ implementation of C++ IO streams and benchmark
it against MS's. I do not think it will be orders of magnitude faster as you
seem to think.

>
> If the poor performance is inevitable with the spec then I suggest a
> superior iostream library be developed for boost, it becomes a de facto
> standard over time, and eventually is incorporated into the standard
> library and the existing iostream classes are deprecated.

I do not think this is going to happen. The problem is that your
implementation is simplistic (I assume). Does your implementation have
support for I18N, extendible locale facets, separation of the formatting and
the storage? The C++ IO streams do support all of it. But with abstraction
comes a performance penalty.

The standard standardises things that are either already widely used or that
would be beneficial for many. Simplistic performance oriented streams would
not be widely useful and as such are unlikely to ever make it into the
standard. However such simple and performance oriented streams could live and
be useful as a stand alone library or part of e.g. Boost.

Not that C++ IO streams are perfect, far from it. There are other things
beside performance that can be improved on C++ IO streams. One is the
terrible terrible names of streambuf's functions. Another thing that it would
be nice to have is better support for batch IO that e.g. std::getline() could
use to gain more performance. With the streams as they are it has to read
char by char. Yet another thing is that the member getline() functions are
bad/almost useless, they should be either removed or fixed.

--
VH

Ebenezer

unread,

Nov 21, 2010, 5:45:36 PM11/21/10

to

On Nov 21, 2:39 am, Mia B <berkopu...@gmail.com> wrote:
> On Nov 21, 2:18 am, David Barrett-Lennard <davi...@iinet.net.au>
> wrote:
>
>> The purpose of the post is to get it fixed because the performance is
>> unacceptable.
>
> Well, it's unacceptable for __you__ ;-). I never cared about
> performance of any streams implementation, but I don't remember people
> complaining about MS's implementation being particularly slow compared
> to other to-the-spec implementations. You're slightly comparing apples
> and oranges, as you never know how big impact on performance can "I
> don't need this/that feature" have.

There are some tests here --
http://webEbenezer.net/comparison.html --
that indicate the Microsoft implementation may be
slower than the GNU implementation.

Brian Wood
Ebenezer Enterprises
http://webEbenezer.net

Martin B.

unread,

Nov 21, 2010, 5:47:47 PM11/21/10

to

On 18.11.2010 22:34, David Barrett-Lennard wrote:
>
> On my target platform the performance of std::ostream is so poor I've
> been compelled to roll my own simplified version. I can often achieve
> an order of magnitude better performance. I've been using the
> standard library provided by Microsoft. I have no information about
> other implementations.
>
> The measurements below involve timing the writing of the same value
> one million times to an ostream like this:
>
> for (int i=0 ; i< 1000000 ; ++i)
> {
> os<< value;
> }
>

Full source code would have me just drop it into my compiler and check with my profiler what's going on. Siginificantly snipped down code just leaves me wondering :-)

> The rate at which values are written is expressed in MHz.
>

> [...]

>
> Machine: Dual core E8400 3GHz
> OS: Windows XP 32 bit professional
> Compiler: MS Visual Studio 2008
> Build: 32 bit release with /O2, _SECURE_SCL=0
>
> QueryPerformanceCounter is used to measure the elapsed time. The
> minimum over 100 runs is calculated. I've found this makes the
> reported results repeatable to about 2 significant figures.

> [...]

> For a fair comparison, in all cases the output is written to memory
> using a linked list of 16 kByte buffers. This can be quite a lot
> faster that using an ostringstream. The Microsoft ostream was able to
> write about 500 MByte/s using unformatted output for 1 million x 128
> Byte blocks. Although I regard this as unsatisfactory (it should
> exceed 1GB/sec on the hardware) it shows that the underlying stream
> buffer is not the bottleneck for writing formatted bools, chars and
> ints where the implementation provided by Microsoft is extremely poor.
>

So this tells us that iostreams performance is crappy? This doesn't exactly come as a big surprise to me -- I would never touch any iostreams related tools with a 10 foot pole in performance critical code. (I can only speak for the Dinkumware implementation as used on Visual Studio.)

Two points that you might want to consider:
* Use a profiler to find out where the bottleneck is. (Pretty much any (commercial) tool (with a trial version) should get you there in a few hours, shouldn't it?)
* Does the performance hit actually matter in the context where you are using it? (I would assume it does, but I always find it helps to focus on this question more than once.)

If you care about something better, you might want to try and contribute to http://www.fastformat.org/ -- it ain't perfect but it is the most promising thing out there I know of.

cheers,
Martin

Miles Bader

unread,

Nov 22, 2010, 1:08:27 AM11/22/10

to

David Barrett-Lennard <dav...@iinet.net.au> writes:
> It is well known that one of the primary design goals for C++ is to
> achieve performance comparable to hand coded assembly. I suggest this
> goal likewise extends to the design and implementation of the standard
> library.

Sure, but since it's an _implementation issue_, I'd guess the solution
is to complain to the implementors. Or are you saying there's some
fundamental problem with iostreams that make them impossible to
implement efficiently (I'd disagree; see below)?

My experience is that with gcc on linux, iostreams are maybe 5% slower
than stdio for typical mixed string/number output (read "almost
identical, just a tad slower"). That's close enough for me.

Two caveats:

* If you're writing to std::cout, you can use std::sync_with_stdio
to gain a bit of speed, at the expense of losing synchronization
with stdio (an issue maybe if you use libraries that write to
stdout when you're using std::cout).

Using std::cout but _not_ std::sync_with_stdio results in about a
20% penalty or so for me.

* std::endl _flushes the buffer_, which may dramatically increase
the number of system calls if you are tending to write short
lines.

In the test case I just tried, it was doing writes of about 50
bytes, compared to 4096 byte writes with printf; the result was
that using std::endl made the test program about 20% slower than
when using "\n".

So use "\n" rather than std::endl if you want to avoid frequent
buffer flushing.

-miles

--
The automobile has not merely taken over the street, it has dissolved the
living tissue of the city. Its appetite for space is absolutely insatiable;
moving and parked, it devours urban land, leaving the buildings as mere
islands of habitable space in a sea of dangerous and ugly traffic.
[James Marston Fitch, New York Times, 1 May 1960]

David Barrett-Lennard

unread,

Nov 22, 2010, 1:11:09 AM11/22/10

to

On Nov 22, 6:49 am, Vaclav Haisman <v.hais...@sh.cvut.cz> wrote:
> David Barrett-Lennard wrote, On 21.11.2010 2:18:

>> Does anyone have an idea of how to find out?
>
> Code up your own _conforming_ implementation of C++ IO streams and benchmark
> it against MS's. I do not think it will be orders of magnitude faster as you
> seem to think.

as I seem to think??

No I think it's quite possible that the poor performance is inevitable
in a conforming implementation. It might be difficult to prove
though.

>> If the poor performance is inevitable with the spec then I suggest a
>> superior iostream library be developed for boost, it becomes a de facto
>> standard over time, and eventually is incorporated into the standard
>> library and the existing iostream classes are deprecated.
>
> I do not think this is going to happen. The problem is that your
> implementation is simplistic (I assume). Does your implementation have
> support for I18N, extendible locale facets, separation of the formatting and
> the storage? The C++ IO streams do support all of it. But with abstraction
> comes a performance penalty.

Depending on the abstraction that penalty can be made quite small. I
am convinced that in this case one can have one's cake and eat it too.

I already achieve separation of the formatting and storage through a
pure abstract base class for an output octet stream:

struct IOutputOctetStream
{
virtual ~IOutputOctetStream() {}
virtual void Write(const OCTET* buffer, size_t count) = 0;
virtual void Flush() = 0;
};

This could just as easily be templatised on an element type T rather
than assume it's an octet.

My implementation of a buffered octet stream provides non-virtual
methods to write individual octets or arrays of octets. It stores a
pointer to an underlying IOutputOctetStream and only calls the above
virtual Write() method when the buffers are full or explicitly
flushed. I find that a buffer of only a few kilobytes is sufficient
to amortise away the overhead of the virtual calls.

This is getting off-topic, but I will add that unlike the standard
library streambuf my implementation:
- Has better cohesion in the sense that it doesn't support both
reading and writing which IMO should be orthogonal.
- Cleanly separates output buffering from the pure abstraction of an
output stream, which are distinct concepts.
- Fully hides the buffering from clients so the interface is simple
and elegant. E.g. there is no counterpart to pubsetbuf.
- Follows the open/closed principle more directly because the
implementation of buffering is closed. My buffered stream class has
no virtual methods and there is never a need for clients to subclass
it.

I have investigated what could be done to support I18N. I think
polymorphism using pure abstract base classes is appropriate. For
example, the following approach allows for complete flexibility in how
an int is formatted:

struct IIntFormatter
{
virtual void Write(my_ostream& os, int x) = 0;
};

class my_ostream
{
public:
void Write(int x) { intFormatter_->Write(*this,x); }
...

private:
IIntFormatter* intFormatter_;
...
};

inline my_ostream& operator<<(my_ostream& os, int x)
{
os.Write(x);
return os;
}

I have implemented this to measure the overhead of the indirection
(i.e. virtual call through a pointer). For writing 1 million integers
in decimal that appear as '10' in the output the result is:

Microsoft: 1.4 MHz
Mine (without indirection): 33 MHz
Mine (with indirection): 32 MHz

This is not unexpected - virtual calls aren't very expensive.

In the case of the indirection, the following:

int value = 10;

for (int i=0 ; i < 1000000 ; ++i)
{
os << value;
}

was compiled as:

00402BFF mov esi,0F4240h
00402C04 mov ecx,dword ptr [esp+160h]
00402C0B mov eax,dword ptr [ecx]
00402C0D mov eax,dword ptr [eax]
00402C0F push 0Ah
00402C11 lea edx,[esp+150h]
00402C18 push edx
00402C19 call eax
00402C1B sub esi,1
00402C1E jne 00402C04

There is no doubt that a virtual method call was taken to write each
integer value.

Evidently supporting I18N with complete generality can be achieved
with minimal overhead.

--

dietma...@gmail.com

unread,

Nov 22, 2010, 8:58:10 AM11/22/10

to

On Nov 19, 8:12 pm, Martin Bonner <martinfro...@yahoo.co.uk> wrote:
> You appear to be saying "if the specification of ostream is
> dramatically simplified, I can implement what's left much efficiently
> than Plauger can implement the full-fat spec".

A long time ago I said, that I can implement a full-fledged version of
IOStreams according to the C++ spec which is a lot faster than e.g.
the
Dinkumware library - and it was, at least, back then. It is somewhat
fallen out of maintenance because there doesn't seem to be much
interest
and there is also one major caveat: I haven't implemented
basic_filebuf to
all the conversions, seeking, etc. (there is only a rather basic
version
which works for typical uses cases, though). The implementation is
available
from <http://www.dietmar-kuehl.de/cxxrt/>. I guess, there are a number
of
things which need to be fixed to get it even through a recent
compiler: I
haven't touched it in years. I know what a number of library
implementers
have picked up on some of the ideas I used (e.g. preinstantiation of
classes
for the two default character types).

BTW, the list of simplifications listed above don't affect the speed!
Yes,
for some of the stuff it takes a bit of extra thought to come up with
an
approach avoiding any overhead but it is all doable: at the least when
I
last profiled my IOStreams against glibc's <stdio.h> it had roughly
the same
performance (in some areas it was slightly faster, in others it was
slightly
slower).

I think the key issue with the poor performance of IOStreams is simply
money:
nobody seems to be willing to put their money where their mouth is!
You can
demand a faster IOStream implementation but nothing will happen unless
you
either do it yourself (my approach) or you pay for it (in some way or
another). Also, the Dinkumware implementation you can buy from their
site may
be better than the one shipping with VC++ (the version shipping with
MSVC++
certianly has to be compatible with earlier versions and there are
certain
apects of the object layout you'd need to change to get better
performance)
but I don't know if this is an area P.J. has worked on ... and, also,
in case
you find that my implementation of IOStreams is still too slow:
diligent use
of algorithms and a few tweaks to use specialized versions for the
known traits
made it substantially faster, yet. However, this version is not on the
net (nor
is it as complete as I would want it to be).

Seungbeom Kim

unread,

Nov 22, 2010, 10:51:32 AM11/22/10

to

On 2010-11-21 14:49, Vaclav Haisman wrote:
>
> The problem is that your
> implementation is simplistic (I assume). Does your implementation have
> support for I18N, extendible locale facets, separation of the formatting
and
> the storage? The C++ IO streams do support all of it. But with abstraction
> comes a performance penalty.

It should not, at least ideally. "What you don't use, you don't pay for"
(zero-overhead rule) has always been an aim of C++ design, and it is
certainly undesirable if the performance penalty is significant, no
matter how many features that are not used bring that penalty.

Things may of course be different for different people, but I admit that
I have never used locales with iostreams (explicitly). If I could choose
an I/O library without explicit locale support (i.e. that always works
as if under the "C" locale) but that could be (say) 20% faster, I would
go for it.

> The standard standardises things that are either already widely used or
that
> would be beneficial for many. Simplistic performance oriented streams
would
> not be widely useful and as such are unlikely to ever make it into the
> standard. However such simple and performance oriented streams could live
and
> be useful as a stand alone library or part of e.g. Boost.

What keeps users from being able to choose from a few options, e.g.
a slow but full-fledged one and a lightweight but simple one?

--
Seungbeom Kim

Martin B.

unread,

Nov 22, 2010, 10:53:41 AM11/22/10

to

On 22.11.2010 07:11, David Barrett-Lennard wrote:
>
> On Nov 22, 6:49 am, Vaclav Haisman<v.hais...@sh.cvut.cz> wrote:
>> David Barrett-Lennard wrote, On 21.11.2010 2:18:
>
>
>>> Does anyone have an idea of how to find out?
>>
>> Code up your own _conforming_ implementation of C++ IO streams and
benchmark
>> it against MS's. I do not think it will be orders of magnitude faster as
you
>> seem to think.
>
> as I seem to think??
>
> No I think it's quite possible that the poor performance is inevitable
> in a conforming implementation. It might be difficult to prove
> though.
>

>[...]

> I already achieve separation of the formatting and storage through a
> pure abstract base class for an output octet stream:
>
> struct IOutputOctetStream
> {
> virtual ~IOutputOctetStream() {}
> virtual void Write(const OCTET* buffer, size_t count) = 0;
> virtual void Flush() = 0;
> };
>
> This could just as easily be templatised on an element type T rather
> than assume it's an octet.
>

> My implementation [...]

> I have investigated what could be done to support I18N. I think
> polymorphism using pure abstract base classes is appropriate. For

> [...]

Evidently you think that iostreams is pretty broken, as things stand.
I have no clue if this is true in general, but:
I know of one usable decent looking type safe alternative to iostreams
and this is called FastFormat (http://www.fastformat.org/). I have tried
it once, it seemed OK, but IMHO still needs some polishing. If you care
for something better than iostreams have a look at it.

cheer,
Martin

David Barrett-Lennard

unread,

Nov 22, 2010, 11:02:52 AM11/22/10

to

On Nov 22, 2:08 pm, Miles Bader <mi...@gnu.org> wrote:

> David Barrett-Lennard <davi...@iinet.net.au> writes:
> > It is well known that one of the primary design goals for C++ is to
> > achieve performance comparable to hand coded assembly. I suggest this
> > goal likewise extends to the design and implementation of the standard
> > library.
>
> Sure, but since it's an _implementation issue_, I'd guess the solution
> is to complain to the implementors. Or are you saying there's some
> fundamental problem with iostreams that make them impossible to
> implement efficiently (I'd disagree; see below)?
>
> My experience is that with gcc on linux, iostreams are maybe 5% slower
> than stdio for typical mixed string/number output (read "almost
> identical, just a tad slower"). That's close enough for me.

Can you be more specific please? For all I know you're comparing cout
and printf to a console window. My machine can only write 80k
characters/sec to a console.

In any case my ostream implementation achieves much better performance
than cstdio on my E8400 machine under Windows. For a fair comparison
I can associate my ostream implementation with a file in a similar
manner to an ofstream. Let 'os' refer to such an ostream, and 'fp' a
FILE* returned by fopen. The following are the rates I get when
repeating the given code in a loop 1 million times. To be fair I
include a final flush on my buffered stream in the elapsed time
measurement. In all cases opening/closing the file isn't included in
the time measurement.

fprintf(fp,"%c",'x') 7.0 MHz
fputc('x',fp) 15 MHz
os << 'x' 180 MHz

fwrite(buffer,1,10,fp) 10 MHz
os.write(buffer,10) 31 MHz

fprintf(fp,"%s","x") 6.9 MHz
os << "x" 59 MHz

fprintf(fp,"%s","0123456789") 4.6 MHz
os << "0123456789" 23 MHz

fprintf(fp,"%d",10) 5.4 MHz
os << 10 27 MHz

fprintf(fp,"%d",1234567890) 2.6 MHz
os << 1234567890 11 MHz

fprintf(fp,"%p",(void*)0) 3.0 MHz
os << (*void*)0 23 MHz

Note that output to a small enough file can appear to well exceed the
write rate of the hard-disk due to file buffering by the Windows OS.
On my machine fwrite using kilobyte buffers into a 10 MB file supports
a write rate of about 400 MByte/sec. This is essentially measuring
the write rate into the Windows OS file cache.

--

Joshua Maurice

unread,

Nov 22, 2010, 4:58:58 PM11/22/10

to

On Nov 21, 2:49 pm, Vaclav Haisman <v.hais...@sh.cvut.cz> wrote:
> I do not think this is going to happen. The problem is that your
> implementation is simplistic (I assume). Does your implementation have
> support for I18N, extendible locale facets, separation of the formatting and
> the storage? The C++ IO streams do support all of it. But with abstraction
> comes a performance penalty.

This is incorrect. The power of C++ templates, for example, is the
power of abstraction and genericity without the runtime speed overhead
of dynamic dispatch at runtime. Another example is inline functions -
well specifically allowing compilers to expand functions inline.
Sometimes abstraction does cost more, and sometimes it doesn't. That's
the beauty of the C++ implementation of the STL, as opposed to say the
Java containers standard library.

Also, just because it's a pet peeve of mine, I will mention that the C+
+ iostream I18N support is a joke. Maybe you can contrive up some
examples which use it, but it's not portable in practice, and for any
real Unicode handling, it's also near useless. In the end, you
basically have to use ICU with a lot of hand-rolling.

--

Goran

unread,

Nov 23, 2010, 2:19:15 AM11/23/10

to

On Nov 21, 11:39 pm, David Barrett-Lennard <davi...@iinet.net.au>
wrote:

>> And finally, you should try /Ox, not /O2.
>
> Thanks I’ll give it a try.

One other thing, what option for exception handling did you use? I
don't know the name of the command offhand, but for sputc versus put
performance, with MS compiler, you really should turn off "combined"
SEH/C++ exception handling (it should be off by default if the
compiler is recent, but...)

Goran.

Miles Bader

unread,

Nov 23, 2010, 2:20:33 AM11/23/10

to

David Barrett-Lennard <dav...@iinet.net.au> writes:
>> My experience is that with gcc on linux, iostreams are maybe 5%
>> slower than stdio for typical mixed string/number output (read
>> "almost identical, just a tad slower"). That's close enough for
>> me.
>
> Can you be more specific please? For all I know you're comparing
> cout and printf to a console window. My machine can only write 80k
> characters/sec to a console.

std::ofstream f ("/dev/null");

> In any case my ostream implementation achieves much better
> performance than cstdio on my E8400 machine under Windows.

Great.

It's not really clear to me what you're complaining about --
originally it seemed like you were implying that iostreams is somehow
fundamentally mis-specified in a way that makes implementations of it
inherently very slow, and you seemed to be denying that it was simply
a problem with your compiler's implementation. But so far I don't see
a lot of evidence for such a position... (certainly iostreams has lots
of ugly points of course, but nothing really seems fatal)

I find that it _is_ much faster to write line-sized blocks of text
using iostreams (e.g. "f << line_buffer") than it is to format lots of
individual numbers etc making up the equivalent line (f << field1 <<
"\t" << field1 << ..) -- but the same seems true with fwrite vs
fprintf...

-Miles

--
Love is the difficult realization that something other than oneself is real.
[Iris Murdoch]

David Barrett-Lennard

unread,

Nov 23, 2010, 4:28:30 AM11/23/10

to

On Nov 23, 3:20 pm, Miles Bader <mi...@gnu.org> wrote:

> David Barrett-Lennard <davi...@iinet.net.au> writes:
> >> My experience is that with gcc on linux, iostreams are maybe 5%
> >> slower than stdio for typical mixed string/number output (read
> >> "almost identical, just a tad slower"). That's close enough for
> >> me.
>
> > Can you be more specific please? For all I know you're comparing
> > cout and printf to a console window. My machine can only write 80k
> > characters/sec to a console.
>
> std::ofstream f ("/dev/null");

Ok (I thought your caveats which mentioned cout and printf suggested
otherwise).

> > In any case my ostream implementation achieves much better
> > performance than cstdio on my E8400 machine under Windows.
>
> Great.
>
> It's not really clear to me what you're complaining about --
> originally it seemed like you were implying that iostreams is somehow
> fundamentally mis-specified in a way that makes implementations of it
> inherently very slow, and you seemed to be denying that it was simply
> a problem with your compiler's implementation. But so far I don't see
> a lot of evidence for such a position... (certainly iostreams has lots
> of ugly points of course, but nothing really seems fatal)

I am complaining about the dismal performance of the ostream provided
by Microsoft.

A comparison between std::ostream and fprintf is not particularly
interesting to me, because fprintf is seriously handicapped by the
need to parse a format string so it should be much slower. I find it
very surprising that formatting an int appearing as '10' is instead 4x
faster with fprintf.

Contrary to your statement above I have not made a single claim about
where I think the problem lies other than saying that I have no idea
whether it is due to the spec, the library implementation or the
compiler.

Also I would say the iostream spec provides plenty of reason to wonder
whether it inevitably and seriously impacts performance. Just the
requirement to set the 'bad' bit when an exception is thrown by the
underlying streambuf is enough to make we wonder about the impact.

> I find that it _is_ much faster to write line-sized blocks of text
> using iostreams (e.g. "f << line_buffer") than it is to format lots of
> individual numbers etc making up the equivalent line (f << field1 <<
> "\t" << field1 << ..) -- but the same seems true with fwrite vs
> fprintf...

So how do you format a line?

--

David Barrett-Lennard

unread,

Nov 23, 2010, 12:07:01 PM11/23/10

to

On Nov 23, 3:19 pm, Goran <goran.pu...@gmail.com> wrote:
> On Nov 21, 11:39 pm, David Barrett-Lennard <davi...@iinet.net.au>
> wrote:
>
> >> And finally, you should try /Ox, not /O2.
>
> > Thanks I’ll give it a try.
>
> One other thing, what option for exception handling did you use? I
> don't know the name of the command offhand, but for sputc versus put
> performance, with MS compiler, you really should turn off "combined"
> SEH/C++ exception handling (it should be off by default if the
> compiler is recent, but...)

I've been using /EHsc (C++ exceptions only).

David Barrett-Lennard

unread,

Nov 23, 2010, 12:13:26 PM11/23/10

to

On Nov 22, 6:47 am, "Martin B." <0xCDCDC...@gmx.at> wrote:

> * Use a profiler to find out where the bottleneck is. (Pretty much any (commercial) tool (with a trial version) should get you there in a few hours, shouldn't it?)

I found that all the source code for the CRT and standard library
comes with VS 2008 and there is a batch file to build them. This
allowed me to run the VS 2008 profiler and see what's going on inside
msvcp90.dll.

I profiled the writing of 1 million bools appearing as '1' in the
output, where the implementation only runs at 1.4 MHz.

My conclusion is that there is no single bottleneck. There appear to
be many reasons why the performance is poor.

Some observations

- Since my implementation is 130x faster, any function that takes just
1% of the time represents a very significant amount of time.

- sputc on the streambuf took 1.5% of the time. Since this is doing
most of the real work I think the underlying streambuf performance is
tolerable.

- Over 33% of the time was spent locking/unlocking some kind of mutex.

- There was a large mismatch between inclusive and exclusive times for
locale::facet::_Incref() and locale::facet::_Decref(), making me think
these functions were locking and unlocking a mutex.

- There were two use_facet calls to bind to facets in the locale - for
num_put, and num_punct. In total these took about 15% of the time.
Since both these facets are mandatory in all locales I would have
thought use_facet could have been specialied and inlined to bind
directly to member variables of the locale, in which case the binding
overhead could be zero.

- A wostream::sentry took 6% inclusive of the time in its constructor,
and 7% inclusive of the time in its destructor. That's an enormous
overhead given that a sentry has no need to do anything in this case.
I'm not sure why a wostream sentry was used in an ostream.

- It appears that some exception handling prolog/epilog took about 7%
of the time.

- There were calls that IMO should never have been made. E.g.
numpunct<char>::grouping() took 4.6% of the time (inclusive), and
there were calls to numpunct<char>::do_thousands_sep().

- Apparently a std::string was employed, with two different overloads
of assign taking 5% inclusive of the time.

- Over 15% inclusive of the time was spent in _sprintf_s which is a
security enhanced version of sprintf. Amongst other things the
implementation needed to parse a format string.

Profile results

Sampling every 1000000 clock cycles was used. I tried to simplify the
very long template names (e.g. removing the std qualifier, char traits
and allocator template parameters, replacing basic_ostream<char> with
ostream etc).

Immediately below each function name appears four numbers which from
left to right are:

Inclusive Samples, Exclusive Samples, Inclusive Samples %,
Exclusive Samples %

The functions are listed in decreasing order by exclusive samples.

[ntdll.dll]
524 519 25.79 25.54
__Mtxunlock
342 125 16.83 6.15
__output_s_l
274 123 13.48 6.05
num_put<char,ostreambuf_iterator<char>>::_Iput(ostreambuf_iterator<char>,ios_base
&,char,char *,unsigned int)const
741 101 36.47 4.97
ostream::operator<<(bool)
1,917 84 94.34 4.13
__EH_prolog3
58 58 2.85 2.85
__EH_epilog3
49 49 2.41 2.41
__aulldvrm
46 46 2.26 2.26
__Mtxlock
338 43 16.63 2.12
locale::facet::_Decref()
228 42 11.22 2.07
_Lockit::~_Lockit()
298 36 14.67 1.77
num_put<char,ostreambuf_iterator<char>>::do_put(ostreambuf_iterator<char>,ios_base
&,char,long)const
1,118 34 55.02 1.67
_Lockit::_Lockit(int)
311 33 15.31 1.62
streambuf::sputc(char)
34 31 1.67 1.53
ostream::_Osfx()
37 29 1.82 1.43
wostream::sentry::sentry(wostream&)
123 27 6.05 1.33
num_put<char,ostreambuf_iterator<char>>::_Putgrouped(ostreambuf_iterator<char>,char
const *,unsigned int,char)const
95 26 4.68 1.28
num_put<char,ostreambuf_iterator<char>>::_Rep(ostreambuf_iterator<char>,char,unsigned
int)const
26 26 1.28 1.28
locale::facet::_Incref()
245 25 12.06 1.23
_LocaleUpdate::_LocaleUpdate(localeinfo_*)
71 24 3.49 1.18
__vsnprintf_helper
292 23 14.37 1.13
ios_base::getloc()const
284 23 13.98 1.13
use_facet<numpunct<char>>(locale const &)
148 23 7.28 1.13
__EH_prolog3_GS
22 22 1.08 1.08
__EH_prolog3_catch
20 20 0.98 0.98
__getptd_noexit
67 20 3.30 0.98
ostream::_Sentry_base::_Sentry_base(ostream&)
77 19 3.79 0.94
num_put<char,ostreambuf_iterator<char>>::_Put(ostreambuf_iterator<char>,char
const *,unsigned int)const
58 17 2.85 0.84
numpunct<char>::grouping()const
94 17 4.63 0.84
wostream::sentry::~sentry()
142 16 6.99 0.79
locale::_Getfacet(unsigned int)const
16 16 0.79 0.79
num_put<wchar_t,ostreambuf_iterator<wchar_t>>::_Ifmt(char *,char const
*,int)const
16 16 0.79 0.79
write_char
16 16 0.79 0.79
[kernel32.dll]
15 15 0.74 0.74
string::assign(char const *,unsigned int)
41 15 2.02 0.74
num_put<char,ostreambuf_iterator<char>>::do_put(ostreambuf_iterator<char>,ios_base
&,char,bool)const
1,150 15 56.59 0.74
use_facet<num_put<char,ostreambuf_iterator<char>>>(locale const &)
163 15 8.02 0.74
__vsprintf_s_l
309 14 15.21 0.69
_sprintf_s
318 12 15.65 0.59
string::_Grow(unsigned int,bool)
17 12 0.84 0.59
locale::~locale()
240 12 11.81 0.59
locale::id::operator unsigned int()
12 12 0.59 0.59
write_string
28 12 1.38 0.59
_memchr
11 11 0.54 0.54
_strlen
11 11 0.54 0.54
string::_Eos(unsigned int)
11 11 0.54 0.54
string::_Tidy(bool,unsigned int)
13 10 0.64 0.49
___set_flsgetvalue
18 9 0.89 0.44
string::_Inside(char const *)
9 9 0.44 0.44
string::assign(char const *)
62 9 3.05 0.44
locale::locale(locale const &)
278 9 13.68 0.44
TimingTests()
1,948 9 95.87 0.44
@__security_check_cookie@4
8 8 0.39 0.39
string::string(char const *)
73 8 3.59 0.39
ostreambuf_iterator<char>::operator=(char)
41 7 2.02 0.34
__getptd
73 6 3.59 0.30
_Mutex::_Lock()
58 6 2.85 0.30
_Mutex::_Unlock()
49 6 2.41 0.30
numpunct<char>::do_thousands_sep()const
6 6 0.30 0.30
num_put<char,ostreambuf_iterator<char>>::put(ostreambuf_iterator<char>,ios_base
&,char,bool)const
1,149 5 56.55 0.25
write_multi_char
5 5 0.25 0.25
__EH_epilog3_GS
10 4 0.49 0.20
basic_ios<unsigned short>::setstate(int,bool)
4 4 0.20 0.20
basic_istream<char>::_Sentry_base::~_Sentry_base()
4 4 0.20 0.20
numpunct<unsigned short>::do_grouping()const
77 4 3.79 0.20
fastcopy_I
2 2 0.10 0.10
uncaught_exception()
2 2 0.10 0.10

Daniel James

unread,

Nov 23, 2010, 1:53:50 PM11/23/10

to

In article <e46c37a8-1c2f-4790-87a7-

c167b5...@r6g2000vbf.googlegroups.com>, David Barrett-Lennard wrote:
> A comparison between std::ostream and fprintf is not particularly
> interesting to me, because fprintf is seriously handicapped by the
> need to parse a format string so it should be much slower. I find it
> very surprising that formatting an int appearing as '10' is instead 4x
> faster with fprintf.

The implementation of iostreams in the MS/Dinkumware C++ library uses
sprintf (actually sprintf_s in VS2008) to perform the format conversion,
and so suffers the same overhead.

Boost::format does the same, by the way.

Cheers,
Daniel.

David Barrett-Lennard

unread,

Nov 24, 2010, 1:09:42 AM11/24/10

to

On Nov 24, 2:53 am, Daniel James <dan...@me.invalid> wrote:
> In article <e46c37a8-1c2f-4790-87a7-

>
> c167b5e95...@r6g2000vbf.googlegroups.com>, David Barrett-Lennard wrote:
> > A comparison between std::ostream and fprintf is not particularly
> > interesting to me, because fprintf is seriously handicapped by the
> > need to parse a format string so it should be much slower. I find it
> > very surprising that formatting an int appearing as '10' is instead 4x
> > faster with fprintf.
>
> The implementation of iostreams in the MS/Dinkumware C++ library uses
> sprintf (actually sprintf_s in VS2008) to perform the format conversion,
> and so suffers the same overhead.
>
> Boost::format does the same, by the way.

Yes, I found that out when I ran the profiler. Even so, according to
the profiler the underlying call to sprintf_s only represents about
15% of the total time to write a bool (with noboolalpha) to an
ostream!

dietma...@gmail.com

unread,

Nov 24, 2010, 1:09:01 AM11/24/10

to

{ Your text is too wide and unwanted linebreaks make it hard to read.
Posters, please fit your text within 70 columns or so. -mod }

On Nov 23, 5:13 pm, David Barrett-Lennard <davi...@iinet.net.au>
wrote:

> On Nov 22, 6:47 am, "Martin B." <0xCDCDC...@gmx.at> wrote:
>
> > * Use a profiler to find out where the bottleneck is. (Pretty much any (commercial) tool (with a trial version) should get you there in a few hours, shouldn't it?)

There is pretty much no surprise about any of these findings. I
thought I reported
about most of these issues more than a decade ago and proposed (and
implemented)
ways to avoid most of them. For example, there is actually no need to
refer to the
locale modify any use counts, etc., at least not if it is the default
locale! (hm,
don't know if I wrote how to do this publicly). The Dinkumware
implementation is
pretty much following the wording in the standards without making use
of the
as-if-rule at all. Like I said before: there doesn't seem to be any
substantial
commercial demand to warrant a better implementation! There is,
however, demand for
other library components or for improvements in these.

All that said, there is nothing in the C++ standard which prevents a
substantially
faster implementation. It is just [quite a bit] more effort to create
and get right.
... especially in a context where a lot of casual users e.g. WILL
demand that the
IOStreams behave nicely even in a multi-threaded context (of course,
without them
doing any locking whatsoever).

David Barrett-Lennard

unread,

Nov 24, 2010, 1:10:57 AM11/24/10

to

On Nov 23, 5:28 pm, David Barrett-Lennard <davi...@iinet.net.au>
wrote:

> Also I would say the iostream spec provides plenty of reason to wonder

> whether it inevitably and seriously impacts performance. Just the
> requirement to set the 'bad' bit when an exception is thrown by the
> underlying streambuf is enough to make we wonder about the impact.

I just tried putting

try
{
<implementation>
}
catch(...)
{
throw;
}

around my code to write a formatted character, and found the
performance drop from 260MHz to 100MHz. For a formatted bool the
performance dropped from 190MHz to 74MHz.

So it seems the spec can inevitably impose an unreasonable burden.

David Barrett-Lennard

unread,

Nov 24, 2010, 4:45:04 AM11/24/10

to

On Nov 24, 2:09 pm, "dietmar_ku...@yahoo.com"
<dietmar.ku...@gmail.com> wrote:

Yes I agree that a conforming implementation can do much better than
the Dinkumware implementation.

Even so on some platforms I now have doubts with it achieving the
performance of my simple, non-conforming implementation.

One issue is the overhead of entering try-catch blocks (or
equivalently, when there are objects declared on the frame with non-
empty destructors that must be run when the stack unwinds). There can
be significant overheads depending on the ABI.

In another post I described how this slows down my implementation on
x86 for writing formatted chars and bools by a factor of 2.5.

On x64 the overhead is instead only about 15%. This is not surprising,
Microsoft have gone to a lot of trouble to optimise the x64 ABI for
where no exceptions are thrown (and very much at the expense of what
happens when exceptions are thrown - on the premise that exceptions
really are exceptional).

Ulrich Eckhardt

unread,

Nov 24, 2010, 4:49:21 AM11/24/10

to

David Barrett-Lennard wrote:
[profiling IOstreams]
> Some observations
[...]

> - sputc on the streambuf took 1.5% of the time. Since this is doing
> most of the real work I think the underlying streambuf performance is
> tolerable.

Agreed.

> - Over 33% of the time was spent locking/unlocking some kind of mutex.
>
> - There was a large mismatch between inclusive and exclusive times for
> locale::facet::_Incref() and locale::facet::_Decref(), making me think
> these functions were locking and unlocking a mutex.

Incrementing a refcount in a thread-safe way is not completely trivial.
Especially on the increasingly popular multi-core CPUs it requires some
synchronisation overhead. That said, I don't see a reason to increase or
decrease the reference count for any facet. In any case, this and the
mentioned mutex are probably useful for global objects like cout/cerr, even
if they aren't in your scenario.

> - There were two use_facet calls to bind to facets in the locale - for
> num_put, and num_punct. In total these took about 15% of the time.
> Since both these facets are mandatory in all locales I would have
> thought use_facet could have been specialied and inlined to bind
> directly to member variables of the locale, in which case the binding
> overhead could be zero.

I'd rather say that the stream should buffer these when the locale is
set/changed. I know that e.g. STLport's filebuffer buffers the codecvt
facet.

> - A wostream::sentry took 6% inclusive of the time in its constructor,
> and 7% inclusive of the time in its destructor. That's an enormous
> overhead given that a sentry has no need to do anything in this case.
> I'm not sure why a wostream sentry was used in an ostream.
>
> - It appears that some exception handling prolog/epilog took about 7%
> of the time.

This is probably boilerplate code, that could have been replaced in just
this very special case. I think that it is necessary nonetheless, because
you could legally create a setup where writing a bool fails or throws.

> - There were calls that IMO should never have been made. E.g.
> numpunct<char>::grouping() took 4.6% of the time (inclusive), and
> there were calls to numpunct<char>::do_thousands_sep().

This could have been avoided.

> - Apparently a std::string was employed, with two different overloads
> of assign taking 5% inclusive of the time.

Using a var-sized buffer makes sense in many cases, and this one wasn't
special enough to optimize it.

> - Over 15% inclusive of the time was spent in _sprintf_s which is a
> security enhanced version of sprintf. Amongst other things the
> implementation needed to parse a format string.

*Yikes*. This is the easy way out, because stream formatting is defined in
terms of printf behaviour, but ugly still.

Summing up, 33% + 15% + 4.6% = 52.6% could easily have been avoided, if the
implementor would care for that performance. Skipping the call to sprintf
could have saved a bit more, and all that without much work.

There are two things the programmer can do to speed up things:
1. Don't sync with stdio, in case you use cin/cout.
2. Use the C locale explicitly for your stream. If you are writing for a
machine, you want a defined format anyway, not a localized one. Further, I
have seen some implementations take shortcuts for the C locale, which might
speed things up.

Uli

--
Domino Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

Miles Bader

unread,

Nov 24, 2010, 4:45:30 AM11/24/10

to

David Barrett-Lennard <dav...@iinet.net.au> writes:
> I just tried putting
> try
> {
> <implementation>
> }
> catch(...)
> {
> throw;
> }
>
> around my code to write a formatted character, and found the
> performance drop from 260MHz to 100MHz. For a formatted bool the
> performance dropped from 190MHz to 74MHz.
>
> So it seems the spec can inevitably impose an unreasonable burden.

... or is it another implementation issue (compiler implementation in
this case)?

Can you see what the compiler is doing that causes the speed drop?

With gcc, a try-catch generally adds no speed penalty unless an
exception actually occurs. Is your compiler doing something different?

-Miles

--
We live, as we dream -- alone....

cjhopman

unread,

Nov 24, 2010, 4:50:23 AM11/24/10

to

On Nov 24, 12:10 am, David Barrett-Lennard <davi...@iinet.net.au>
wrote:
>

> I just tried putting
>
> try
> {
> <implementation>}
>
> catch(...)
> {
> throw;
>
> }
>
> around my code to write a formatted character, and found the
> performance drop from 260MHz to 100MHz. For a formatted bool the
> performance dropped from 190MHz to 74MHz.
>
> So it seems the spec can inevitably impose an unreasonable burden.

You have not shown that. You do not necessarily need a try-catch
to meet the requirement that the 'bad' bit is set when an
exception is thrown.

For example:

...
bad = true;
<implementation>;
bad = false;
...

Bart van Ingen Schenau

unread,

Nov 24, 2010, 4:50:31 AM11/24/10

to

On Nov 24, 7:10 am, David Barrett-Lennard <davi...@iinet.net.au>
wrote:

> On Nov 23, 5:28 pm, David Barrett-Lennard <davi...@iinet.net.au>
> wrote:
>
> > Also I would say the iostream spec provides plenty of reason to wonder
> > whether it inevitably and seriously impacts performance. Just the
> > requirement to set the 'bad' bit when an exception is thrown by the
> > underlying streambuf is enough to make we wonder about the impact.
>
> I just tried putting
>
> try
> {
> <implementation>}
>
> catch(...)
> {
> throw;
>
> }
>
> around my code to write a formatted character, and found the
> performance drop from 260MHz to 100MHz. For a formatted bool the
> performance dropped from 190MHz to 74MHz.
>
> So it seems the spec can inevitably impose an unreasonable burden.

I don't agree with that conclusion.
There is no requirement in the standard that a try-block hurts the
performance of the code even in the no-exception case. And there are
implementations about (e.g. GCC) where a try-block indeed has no
performance impact as long as no exceptions are thrown.

So, my conclusion would be that an implementation choice in one area
(how to implement exceptions) affects the performance in another area
(I/O).
This can not be blamed on the specification, but only on the
implementation.

Bart v Ingen Schenau

dietma...@gmail.com

unread,

Nov 24, 2010, 1:32:02 PM11/24/10

to

On Nov 24, 9:50 am, cjhopman <cjhop...@gmail.com> wrote:
> On Nov 24, 12:10 am, David Barrett-Lennard <davi...@iinet.net.au>
> wrote:
> > I just tried putting
> > try
> > {
> > <implementation>}
>
> > catch(...)
> > {
> > throw;
>
> > }
>
> > around my code to write a formatted character, and found the
> > performance drop from 260MHz to 100MHz. For a formatted bool the
> > performance dropped from 190MHz to 74MHz.

This isn't about IOStreams performance, however: this is about a bad
implementation of exceptions! Almost certainly the compiler follows
a rather old ABI, predating the introduction of advanced optimizaion
techniques in the late 90s.

> > So it seems the spec can inevitably impose an unreasonable burden.

Nope. With some compilers there is no overhead in try/catch blocks
unless there is an exception actually being throw.

> You have not shown that. You do not necessarily need a try-catch
> to meet the requirement that the 'bad' bit is set when an
> exception is thrown.
>
> For example:
>
> ...
> bad = true;
> <implementation>;
> bad = false;
> ...

Unfortunately, the specification of IOStreams isn't to rethrow the
exception unconditionally. Instead, it is to only throw an exception
if the corresponding flag is set in the exception mask after setting
the badbit. I actually think it always throws an exception of a
specific type (std::ios_base::failure). That said, the implementation
doesn't need to put a try/catch block around all operations! It
could deal with bad exception handling implementation by putting a
try/catch block only around which is actually user provided. In the
given example this would effectively be the calls triggering
overflow() in the stream buffer. This would almost certainly use
special custom functions for much of the processing which is a bit
tricky because the implementation needs to make sure it doesn't use
any of this stuff in cases where a user has specialized some of the
class templats in IOStreams.

Joshua Maurice

unread,

Nov 24, 2010, 2:22:02 PM11/24/10

to

On Nov 24, 1:45 am, David Barrett-Lennard <davi...@iinet.net.au>
wrote:

> One issue is the overhead of entering try-catch blocks (or
> equivalently, when there are objects declared on the frame with non-
> empty destructors that must be run when the stack unwinds). There can
> be significant overheads depending on the ABI.
>
> In another post I described how this slows down my implementation on
> x86 for writing formatted chars and bools by a factor of 2.5.

Not to go off on too much of a tangent, but I think that this is
easily the worst thing that happened to C++. An integral language
feature which carries 0 cost when not used (thrown) (besides larger
executable sizes which is largely irrelevant with virtual memory,
though still relevant on embedded systems) instead carries a cost even
when not used. It's so frustrating that so many implementers missed
the memo on this, and because of that a lot of us have to put up with
very C-esque code because the use of exceptions is frowned upon for
performance critical code.

> On x64 the overhead is instead only about 15%. This is not surprising,
> Microsoft have gone to a lot of trouble to optimise the x64 ABI for
> where no exceptions are thrown (and very much at the expense of what
> happens when exceptions are thrown - on the premise that exceptions
> really are exceptional).

That's not a very good job. In fact, that's a very bad job if they
really designed their ABI to keep any exception overhead minimal for
the not thrown code paths. They should have taken a look at even
marginally recent gcc versions and Linux ABIs which manage to get
about a 0% overhead for exception handling on the code paths where an
exception is not thrown. That's the way it should be for all
platforms, but again a lot of implementers really dropped the ball on
this one, and apparently they continue to drop the ball.

David Barrett-Lennard

unread,

Nov 25, 2010, 3:42:41 AM11/25/10

to

On Nov 24, 5:45 pm, Miles Bader <mi...@gnu.org> wrote:

> David Barrett-Lennard <davi...@iinet.net.au> writes:
> > I just tried putting
> > try
> > {
> > <implementation>
> > }
> > catch(...)
> > {
> > throw;
> > }
>
> > around my code to write a formatted character, and found the
> > performance drop from 260MHz to 100MHz. For a formatted bool the
> > performance dropped from 190MHz to 74MHz.
>
> > So it seems the spec can inevitably impose an unreasonable burden.
>
> ... or is it another implementation issue (compiler implementation in
> this case)?
>
> Can you see what the compiler is doing that causes the speed drop?
>
> With gcc, a try-catch generally adds no speed penalty unless an
> exception actually occurs. Is your compiler doing something different?

Indeed it is. Some of the details are described in this 2006 video
where Kevin Frei discusses the assembly language cost of exception
handling on both x86, x64 Windows platforms:

http://video.google.com/videoplay?docid=9169999597330548749#

--