printf, pypy, etc.

Stefan Karpinski

unread,

Jan 24, 2012, 2:09:11 PM1/24/12

to juli...@googlegroups.com

If the Julia benchmark for printfd is reduced to this:

function printfd(n)

f = open("/dev/null","w")
for i = 1:n

# printf(f, f"%d %d\n", i, i)
_jl_int_dec(i)
_jl_int_dec(i)
write(f, pointer(_jl_digits), _jl_length[1])

end
close(f)
end

on my machine the benchmark clocks in at 22 ms, so at least in principle we can reach the same performance as C. Excellent: there's nothing fundamental stopping us from reaching C's speed here.

I've also investigated PyPy's printf performance a bit. Here's the original PyPy blog post that talked about being faster than C. My basic test harness for benchmarking python code:

from __future__ import print_function
import time

def test(n):
# code to be benchmarked

tmin = float('inf')
for i in xrange(5):

t = time.time()
l = test(100000)
t = time.time()-t

if t < tmin: tmin = t
print("Time: " + str(tmin*1000) + " ms")

PyPy's performance for just generating a formatted string is indeed impressive:

def test(n):
for i in xrange(n):
"%d %d" % (i, i)

Time: 10.8501911163 ms.

This is half the time C takes and half of what Julia takes to even just decode an integer twice into a static buffer! How can that much speedup be possible? I got a little suspicious — the benchmark formats the same number twice. Is it possible that PyPy is only decoding the number once? What happens if we only format the number once?

def test(n):
for i in xrange(n):

"%d" % (i)

Time: 9.82618331909 ms

Hmm. Decoding only once doesn't take half the time — it benchmarks at only a millisecond less, which could be accounted for by writing half as many bytes into the string buffer. What if we format two different numbers?

def test(n):
# f = open("/dev/null","w")

for i in xrange(n):
"%d %d" % (i, 3*i)

Time: 20.9879875183 ms

Well, isn't that interesting. Formatting two different numbers takes about the same time as C or Julia for decoding two integers. (Just decoding an integer twice in Julia takes 18.6 ms; for some reason sprintf is slower than fprintf in C.)

So it looks like the PyPy JIT may recognize that it's going to do the same integer decoding work twice and does it only once. That's an impressive but not very helpful in realistic usages of printf: you usually want to print different numbers, not the same number twice.

What we've tested so far isn't quite the benchmark that we're using for C and Julia right now: the difference there is that we're actually print the result to an output stream (/dev/null), rather than just formatting a string and then not doing anything with it. What happens if we try printing the formatted value to an output stream in PyPy?

def test(n):
f = open("/dev/null","w")

for i in xrange(n):
print("%d %d\n" % (i, i), file=f)

Time: 181.813955307 ms

There goes the performance. So PyPy is clever about *generating* formated string objects fast, but when it comes to actually printing them, both C and Julia are handily beating PyPy: Julia takes 105 ms and C takes 25 ms for this benchmark. (I realized that I wasn't using longs in the C benchmark and wasn't printing newlines, both of which have a small but noticeable impact, raising the runtime from 22 ms.)

Therefore, for the problem of printing a series of pairs of generally distinct numbers to an output stream, and other realistic uses of printf, we can stop worrying about PyPy and focus on matching C's performance. PyPy's ability to format the same number two times in half the time that C does, appears to be a matter of (cleverly) only doing it once. The fact that this trick just happens to be applicable to the format example used in the blog post makes me wonder if that blog post may have been intentionally misleading: the performance gain doesn't come from specialization on the format string, but from specialization on the two values to be formatted being identical. Can the authors really not have known this?

All this leads me to conclude that while jitting specialized code for more complex printf formats might be a performance boost when the format string is complicated, there doesn't seem to be any gain for simple cases like "%d". This gives me cause to reconsider the entire strategy I've taken for implementing printf in Julia. Before scrapping it, however, I'd like to see if real gains can actually be had from printf format specialization or not. The evidence I thought this PyPy blog post provided for very impressive performance gains from specializing on even simple printf formats has completely evaporated.

Jeff Bezanson

unread,

Jan 24, 2012, 2:31:37 PM1/24/12

to juli...@googlegroups.com

Very interesting. Good detective work. A fine example of playing
benchmark games. So it turns out my I/O code is not so horrible and
slow after all :)

Looking at all the printf code, it turns out most of the work (for
both Stefan and the CPU) is decoding numbers and manipulating digits
(and doing I/O). Especially since our %d and %f both handle all kinds
of numbers. I didn't realize at first how much work needed to be done
dynamically on the generated digits.
The parsing portion of it is only about 100 lines. Indeed, maybe the
code generation is not worth it. I also worry that people might build
format strings at run time, in which case our performance would be
truly awful. And then there's the calling syntax issue.

It's worth doing the experiment of repeating the "%d" benchmark with
parsing done at run time, and see how much difference it makes.

Stefan Karpinski

unread,

Jan 24, 2012, 3:56:09 PM1/24/12

to juli...@googlegroups.com

Very interesting. Good detective work. A fine example of playing
benchmark games. So it turns out my I/O code is not so horrible and
slow after all :)

Thanks. I never thought your io code was horrible or slow — I just wanted to make it so cheap to call write lots of times that it was a reasonable way to build output. I was considering even trying to make write(io,byte) an intrinsic so that it *always* generates inline code. I still might want to do something like this just because we so commonly build up strings by printing them in very small pieces. That's not terribly efficient compared to writing bytes directly into a buffer. What I was thinking was making calling write lots of times the same as writing bytes into a buffer. This is still an appealing idea to me.

Looking at all the printf code, it turns out most of the work (for both Stefan and the CPU) is decoding numbers and manipulating digits (and doing I/O). Especially since our %d and %f both handle all kinds of numbers. I didn't realize at first how much work needed to be done dynamically on the generated digits. The parsing portion of it is only about 100 lines.

Yeah, writing the bit that parses printf formats is definitely not the hard part. That took me all of an hour or two — I'm good at writing little C-style string parsing routines and the printf "language" is not terribly hard to parse.

Indeed, maybe the code generation is not worth it. I also worry that people might build format strings at run time, in which case our performance would be truly awful. And then there's the calling syntax issue.

If we can deal with the calling syntax issue, I think the building format strings at runtime issue goes away: as long as the syntax doesn't tempt people to do things like this:

for i = 1:100000
printf("%d %d\n", i, i)

end

where each loop iteration ends up doing the code gen over again, then it's not a problem. If someone actually has a use case for dynamically constructed format strings they should do something like this:

fmt = f"%$(n)d %$(n)d\n"
for i = 1:100000

fmt(i,i)
end

I guess what I'm saying is that the former case happening accidentally is the real worry, not the latter case. There is no realistic use case for actually generating *lots* of printf formats dynamically; there is a realistic use case for generating a few printf formats dynamically and then using them lots of times.

It's worth doing the experiment of repeating the "%d" benchmark with
parsing done at run time, and see how much difference it makes.

Yeah. I have to do some more investigation to figure out what's killing our performance right now.

I also have been mulling over the idea of using a "template buffer" for padded printf formats. The idea is that fields with width specifications almost always indicate that you *expect* the output to fit into the specified width. So you can make a pre-filled template buffer and just decode numbers directly into the template and just call write on the entire filled in template at the end. If the values don't fit into the template, you have to do something slower, but that should be relatively uncommon.

Jeff Bezanson

unread,

Jan 24, 2012, 4:32:59 PM1/24/12

to juli...@googlegroups.com

I would expect to see code like

fmt = "%$(n)d"
for i=1:k
printf(fmt, i)
end

Stefan Karpinski

unread,

Jan 24, 2012, 4:37:20 PM1/24/12

to juli...@googlegroups.com

I agree — I think that's reasonable code since the format function is generated once and used many times.

Reply all

Reply to author

Forward