How can I improve streams performance?

Roy Smith

unread,

Jul 16, 2003, 1:20:58 PM7/16/03

to

I've been running some benchmarks to compare streams and stdio
performance. I've always suspected stdio was faster, but was
astonished to discover how much faster. I timed the following running
into /dev/null, as well as the same loop using printf().

int main ()
{
for (int i = 0; i < 1000 * 1000; ++i) {
cout << "i = " << i << '\n';
}
}

I did two variations of the streams version (with '\n' and with endl),
and also two variations with printf, the difference being ading a call
to fflush() after each call to printf(). Here's the results using the
Sun Workshop compiler on a Sun E-250:

printf.noflush 2.80 user 0.01 system (2.81 total)
printf.flush 6.02 user 3.28 system (9.30 total)
stream.newline 33.02 user 18.29 system (51.31 total)
stream.endl 36.33 user 19.81 system (56.14 total)

Stdio looks like it's 5-20 times faster than streams. I got similar
ratios using gcc on a 800 MHz Pentium running Linux. Using gcc on a
400 MHz Macintosh G4 running OS-X, stdio was closer to 100 times
faster!

Are these ratios typical? Is there something I can do to make streams
faster? I like the type safety and extensibility, but 5-20 (or maybe
even 100) times slower is a big price to pay for applications which do
any significant amount of i/o.

Ron Natalie

unread,

Jul 16, 2003, 1:34:04 PM7/16/03

to

"Roy Smith" <r...@panix.com> wrote in message news:bf41hq$rst$1...@panix2.panix.com...

> Stdio looks like it's 5-20 times faster than streams. I got similar
> ratios using gcc on a 800 MHz Pentium running Linux. Using gcc on a
> 400 MHz Macintosh G4 running OS-X, stdio was closer to 100 times
> faster!

Are you sure you turned on the optimizer? The standard library relies heavily
on inline functions in most implementations. I only get a 3x difference on g++
here (printf is still faster). Of course, try feeding some class type to printf...

Raoul Gough

unread,

Jul 16, 2003, 1:47:55 PM7/16/03

to

"Ron Natalie" <r...@sensor.com> writes:

Another important point is to cancel sync_with_stdio. Dietmar Kuehl
pointed this out a while back in a clc++m thread on a similar
topic. ISTR it making iostreams twice as fast as printf (using STLport
with gcc on Windows, anyway). In theory, iostreams formatting should
be faster, since it doesn't have to parse a format string first.

// ...
int main () {
std::sync_with_stdio (false);
// ...
}

--
Raoul Gough
"Let there be one measure for wine throughout our kingdom, and one
measure for ale, and one measure for corn" - Magna Carta

Raoul Gough

unread,

Jul 16, 2003, 1:53:18 PM7/16/03

to

Raoul Gough <Raoul...@yahoo.co.uk> writes:

> // ...
> int main () {
> std::sync_with_stdio (false);

Correction - that should be:

std::cout.sync_with_stdio(false);

The thread I referred to (performance of printf(...) vs. cout << ...)
is at http://groups.google.com/groups?th=d75093a01bedc2f0

Roy Smith

unread,

Jul 16, 2003, 2:12:20 PM7/16/03

to

Raoul Gough <Raoul...@yahoo.co.uk> wrote:
> Another important point is to cancel sync_with_stdio.

Hmmm. The Sun Workshop compiler doesn't seem to know about that.
When I added:

std::sync_with_stdio (false);

I got:

"stream.cc", line 5: Error: sync_with_stdio is not a member of std.

Raoul Gough

unread,

Jul 16, 2003, 2:21:00 PM7/16/03

to

r...@panix.com (Roy Smith) writes:

No wait, I'll get it right eventually :-) sync_with_stdio is a static
member function of std::ios_base, so you should be using

std::ios_base::sync_with_stdio (false);

John Harrison

unread,

Jul 16, 2003, 2:56:59 PM7/16/03

to

"Raoul Gough" <Raoul...@yahoo.co.uk> wrote in message
news:ur84qu...@yahoo.co.uk...

> "Ron Natalie" <r...@sensor.com> writes:
>
> > "Roy Smith" <r...@panix.com> wrote in message
> > news:bf41hq$rst$1...@panix2.panix.com...
> >
> >> Stdio looks like it's 5-20 times faster than streams. I got
> >> similar ratios using gcc on a 800 MHz Pentium running Linux. Using
> >> gcc on a 400 MHz Macintosh G4 running OS-X, stdio was closer to 100
> >> times faster!
> >
> > Are you sure you turned on the optimizer? The standard library
> > relies heavily on inline functions in most implementations. I only
> > get a 3x difference on g++ here (printf is still faster). Of
> > course, try feeding some class type to printf...
>
> Another important point is to cancel sync_with_stdio. Dietmar Kuehl
> pointed this out a while back in a clc++m thread on a similar
> topic. ISTR it making iostreams twice as fast as printf (using STLport
> with gcc on Windows, anyway). In theory, iostreams formatting should
> be faster, since it doesn't have to parse a format string first.
>
> // ...
> int main () {
> std::sync_with_stdio (false);
> // ...
> }
>

Just looked at my implementation of STL. That flag is ignored. YMMV in other
words.

john

Alex Vinokur

unread,

Jul 16, 2003, 3:12:29 PM7/16/03

to

"Roy Smith" <r...@panix.com> wrote in message news:bf41hq$rst$1...@panix2.panix.com...

> I've been running some benchmarks to compare streams and stdio
> performance. I've always suspected stdio was faster, but was
> astonished to discover how much faster. I timed the following running
> into /dev/null, as well as the same loop using printf().
>

[snip]

C/C++ Performance Tests
=======================
Using C/C++ Program Perfometer
http://sourceforge.net/projects/cpp-perfometer
http://alexvn.freeservers.com/s1/perfometer.html

Environment
-----------
Windows 2000 Professional
CYGWIN_NT-5.0 1.3.22(0.78/3/2)
Intel(R) Celeron(R) CPU 1.70 GHz
GNU gcc/g++ version 3.2 20020927 (prerelease)
Compilation : No optimization

Summary test results
--------------------

#==================================================
# stdout, stderr,
# cout, cerr, clog, ostringstream,
# cout-to-file, cerr-to-file, clog-to-file
# - - - - - - - - - - - - - - - - - - - - -
# endl vs. "\n/" and '\n'
#--------------------------------------------------
# Resource Name : user time used (via rusage)
# Resource Cost Unit : milliseconds (unsigned long long)
# Resource State Unit : timeval
#==================================================
: -----------------------------------
: stdout "\n" -> 71
: stdout '\n' -> 81
: cout endl -> 420
: cout "\n" -> 90
: cout '\n' -> 79
: stderr "\n" -> 432
: stderr '\n' -> 421
: cerr endl -> 746
: cerr "\n" -> 764
: cerr '\n' -> 718
: clog endl -> 756
: clog "\n" -> 737
: clog '\n' -> 739
: ostringstream endl -> 30
: ostringstream "\n" -> 46
: ostringstream '\n' -> 33
: cout-to-file endl -> 400
: cout-to-file "\n" -> 52
: cout-to-file '\n' -> 41
: cerr-to-file endl -> 445
: cerr-to-file "\n" -> 53
: cerr-to-file '\n' -> 43
: clog-to-file endl -> 489
: clog-to-file "\n" -> 61
: clog-to-file '\n' -> 48
: -----------------------------------

==========================================
Alex Vinokur
mailto:ale...@connect.to
http://sourceforge.net/users/alexvn
http://www.simtel.net/search.php?action=search&authorName=Alex+Vinokur
==========================================

Thomas Matthews

unread,

Jul 16, 2003, 3:56:28 PM7/16/03

to

Roy Smith wrote:

Applications which perform any significant amount of I/O will first
format to a buffer, then output the buffer as one chunk. IOW, use
stringstream, then ostream::write(). In C, this translates to
ssprintf, then fwrite().

These applications would also either queue up the I/O requests or
reduce the number of requests. For example:
cout << "Menu choices:\n";
cout << "0) Exit\n";
cout << "1) Nothing\n";
cout << "2) Sleep\n";
cout << "3) Party\n";
could be reduced to:
const char Menu[] = "Menu choices:\n"
"0) Exit\n"
"1) Nothing\n"
"2) Sleep\n"
"3) Party\n";
cout << Menu;
or
cout.write(Menu, sizeof Menu);
Even a mixture of using write() for the constant text and the
formatted operator << for the variant part would still be
faster than using the stream insertion operator for everything.

If the application is truly I/O dependent, it would use some
assembly language to maximise the usage of platform I/O.
Another technique is to use double or more of buffers and
a thread for processing the buffers.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book

Roy Smith

unread,

Jul 17, 2003, 10:29:25 PM7/17/03

to

Wow. Thanks everybody for the many and varied ideas. This led me on a
bit of a performance measuring trip, and the results really surprised me.

In short, I ran three basic tests: using printf, using streams to cout,
and streams again, but calling sync_with_stdio(false).

For each of those, I did a buffered and a flushed version. Flushed
meant calling fflush(stdio) with printf, or using endl instead of \n
with streams.

The printf code looks like:

---------------------------------
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char *argv[])
{
int count = atoi (argv[1]);
for (int i = 0; i < count; ++i) {
printf ("i = %d\n", i);
#ifdef FLUSH
fflush (stdout);
#endif
}
}
---------------------------------

and the steams code is:

---------------------------------
#include <iostream>
#include <stdlib.h>

#ifdef FLUSH
#define END std::endl
#else
#define END '\n'
#endif

int main (int argc, char *argv[])
{
int count = atoi (argv[1]);
std::ios_base::sync_with_stdio (SYNC);
for (int i = 0; i < count; ++i) {
std::cout << "i = " << i << END;
}
}
---------------------------------

For each of those 6 cross-products, I compiled the code twice, once with
the default compiler settings, the other with -O3.

Then, entirely by accident, I noticed that running the output into a
file was a lot slower than redirecting stdout into /dev/null, so I did
everything twice. The timing numbers presented here were obtained on a
400 MHz PowerPC running Mac OS-X 10.2.6 with gcc-3.1.

bash-2.05a$ ./go 100000 /dev/null
printf.flush.opt0 user=0.3 sys=0.32 total=0.62
printf.flush.opt3 user=0.32 sys=0.3 total=0.62
printf.buffer.opt0 user=0.18 sys=0.01 total=0.19
printf.buffer.opt3 user=0.18 sys=0.01 total=0.19
stream.flush.opt0 user=3.65 sys=3.12 total=6.77
stream.flush.opt3 user=3.26 sys=3.46 total=6.72
stream.buffer.opt0 user=3.35 sys=3.26 total=6.61
stream.buffer.opt3 user=3.25 sys=3.44 total=6.69
unsync.flush.opt0 user=1.9 sys=0.77 total=2.67
unsync.flush.opt3 user=1.86 sys=0.76 total=2.62
unsync.buffer.opt0 user=1.31 sys=0.02 total=1.33
unsync.buffer.opt3 user=1.28 sys=0.02 total=1.3

bash-2.05a$ ./go 100000 out2
printf.flush.opt0 user=0.78 sys=5.66 total=6.44
printf.flush.opt3 user=0.49 sys=5.94 total=6.43
printf.buffer.opt0 user=0.19 sys=0.04 total=0.23
printf.buffer.opt3 user=0.18 sys=0.03 total=0.21
stream.flush.opt0 user=7.63 sys=58.23 total=65.86
stream.flush.opt3 user=7.92 sys=58.19 total=66.11
stream.buffer.opt0 user=7.92 sys=57.67 total=65.59
stream.buffer.opt3 user=7.91 sys=57.44 total=65.35
unsync.flush.opt0 user=3 sys=6.86 total=9.86
unsync.flush.opt3 user=3.09 sys=6.82 total=9.91
unsync.buffer.opt0 user=1.3 sys=0.12 total=1.42
unsync.buffer.opt3 user=1.3 sys=0.1 total=1.4

A few observations....

1) Stdio certainly does seem to be faster than streams. The closest
streams came was about a factor of two slower.

2) Using sync_with_stdio(false) gives a big improvement. In some cases,
it was only about a doubling of speed, but in the file/buffer/opt3
example, unsyncing got about a factor of 40 improvement!

3) I certainly expect output to a file to be slower than to /dev/null,
but the actual results are baffling.

3a) Why should user time change at all? It's still boiling down to
write() system calls; the user code shouldn't care what happens after
that. I expected all the difference to be in system time.

3b) It certainly makes sense that the flush versions use more system
time than the buffered versions (lots of small writes), but why should
unsyncing make any different in system time at all?

I havn't run the full suite of benchmarks on other platforms, so I have
no idea how much variation there will be on other boxes. One thing I
did notice is that -O3 makes very little difference here, but on my
Solaris (Sun Workshop compiler) box, it got me several-fold speedup.

Using the numbers above, and ignoring the /dev/null series as being
unrealistic for production code, I've got a range from 0.21 seconds
total (printf.buffer.opt3) to 65.35 seconds (stream.buffer.opt3). This
is a factor of about 300, which I find absolutely staggering!

Ah-ha, I traced the process, and now it's clear. Here's the process
writing it's first line of output:

3125 stream.buffer.op CALL write(0x1,0xd8000,0x1)
3125 stream.buffer.op GIO fd 1 wrote 1 byte
"i"
3125 stream.buffer.op RET write 1
3125 stream.buffer.op CALL write(0x1,0xd8000,0x1)
3125 stream.buffer.op GIO fd 1 wrote 1 byte
" "
3125 stream.buffer.op RET write 1
3125 stream.buffer.op CALL write(0x1,0xd8000,0x1)
3125 stream.buffer.op GIO fd 1 wrote 1 byte
"="
3125 stream.buffer.op RET write 1
3125 stream.buffer.op CALL write(0x1,0xd8000,0x1)
3125 stream.buffer.op GIO fd 1 wrote 1 byte
" "
3125 stream.buffer.op RET write 1
3125 stream.buffer.op CALL write(0x1,0xd8000,0x1)
3125 stream.buffer.op GIO fd 1 wrote 1 byte
"0"
3125 stream.buffer.op RET write 1
3125 stream.buffer.op CALL write(0x1,0xd8000,0x1)
3125 stream.buffer.op GIO fd 1 wrote 1 byte
"
"
3125 stream.buffer.op RET write 1

It's doing single-byte writes!

The unsync version does:

3132 unsync.buffer.op CALL write(0x1,0xd8000,0x6)
3132 unsync.buffer.op GIO fd 1 wrote 6 bytes
"i = 0
"
3132 unsync.buffer.op RET write 6

i.e. one write per line.

The printf version is also doing one write per line:

3135 printf.buffer.op CALL write(0x1,0x84000,0x6)
3135 printf.buffer.op GIO fd 1 wrote 6 bytes
"i = 0
"
3135 printf.buffer.op RET write 6

Redirecting stdout into /dev/null while tracing, shows (for the printf
version, with count=10):

3138 printf.buffer.op CALL write(0x1,0x84000,0x3c)
3138 printf.buffer.op GIO fd 1 wrote 60 bytes
"i = 0
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
"
3138 printf.buffer.op RET write 60/0x3c

I also see it did some ioctl calls on descriptor 1, which I assume was
trying to figure out if it's connected to a terminal and adjusting
buffering accordingly (normal stdio behavior). The unsync version is
also doing block buffering to /dev/null (and the normal streams version
is doing one-byte writes).

The moral of the story is really nothing surprising -- if performance is
important, don't waste your time trying to optimize anything until
you've got it all working and profiled, because you might be surprised
where the time is going. And when it comes to measurement, process
tracing can be your friend!

Michiel Salters

unread,

Jul 18, 2003, 4:16:04 AM7/18/03

to

Thomas Matthews <Thomas_Matth...@sbcglobal.net> wrote in message news:<M5iRa.10457$jM7.301...@newssvr10.news.prodigy.com>...

> Applications which perform any significant amount of I/O will first
> format to a buffer, then output the buffer as one chunk. IOW, use
> stringstream, then ostream::write().

The technique is correct, but the implementation youy suggest is not.
Te format to buffer should be done by ostream, the chunked output by
ostreambuf. Luckily, that's what happens already. Buffering twice
will often decrease performance.

Regards,
--
Michiel Salters