fastest string building?

Preston Landers

unread,

Oct 19, 1999, 3:00:00 AM10/19/99

to

I was recently wondering what the fastest way to work with strings
was. For instance, let's say I have a string, and I want to add some
things to the end, or possibly insert them in the middle. If I were to
repeat that thousands or hundreds of thousands of times, what would be
the most efficient syntax to do it?

To my knowledge, there are basically three ways to do it.

1: s = part1 + part2 + part3 + ... partn

2: s = string.join([part1, part2, partn], "")

3: s = "%s%s%s%s" % (part1, part2, part3, part4)

Am I missing anything?

My guess was that method #2 is fastest, followed by 3 and then #1 would
be (by far) the slowest.

I did a quick test (using the unix shell 'time' command) which more or
less confirmed my hypothesis.

Just now, however, I decided to do a more indepth test with the Python
profiler. The results were surprising: method #1 above was fastest,
followed closely by #3, and #2 lagged way back. So, according to this
test, s = part1 + part2 + partn seems to be the fastest way to build a
string out of parts.

Can someone look at my profiling script and see if I made any obvious
errors? Is behavior of strings already well-known?

By the way, the new_profile module below is just the same as the
regular Python profile with a new profiling constant calibrated for my
machine.

thanks,

---Preston

print "This program demonstrates the relative efficiency of different
types of string construction methods in Python."

print "Method 1: s = a + b"
print "Method 2: s = string.join([a, b], '')"
print "Method 3: s = '%s%s' % (a, b)"

import string, time

def method_1(a, b, c, d, e):
return a + b + c + d + e

def method_2(a, b, c, d, e):
return string.join([a, b, c, d, e], "")

def method_3(a, b, c, d, e):
return "%s%s%s%s%s" % (a, b, c, d, e)

def wrapper(method):
## a = str(time.time())
## b = "constant"
c_range = 50
d_range = 50
e_range = 50

for c in range(c_range):
for d in range(d_range):
for e in range(e_range):
# args = (a, b, str(c), str(d), str(e))
args = ("a", "b", "c", "d", "e")
apply(method, args)

import new_profile
import pstats

def __main__():
print "Profiling method_1:"
new_profile.run("wrapper(method_1)", "/tmp/m1.prof")
p = pstats.Stats("/tmp/m1.prof")
p.strip_dirs().sort_stats('time','cum', 'nfl').print_stats
().print_callees()

print "Profiling method_2:"
new_profile.run("wrapper(method_2)", "/tmp/m2.prof")
p = pstats.Stats("/tmp/m2.prof")
p.strip_dirs().sort_stats('time','cum', 'nfl').print_stats
().print_callees()

print "Profiling method_3:"
new_profile.run("wrapper(method_3)", "/tmp/m3.prof")
p = pstats.Stats("/tmp/m3.prof")
p.strip_dirs().sort_stats('time','cum', 'nfl').print_stats
().print_callees()

print "Finished!"

__main__()

--
|| Preston Landers <preston...@my-deja.com> ||

Sent via Deja.com http://www.deja.com/
Before you buy.

Aahz Maruch

unread,

Oct 19, 1999, 3:00:00 AM10/19/99

to

In article <7uieee$nsa$1...@nnrp1.deja.com>,

Preston Landers <preston...@my-deja.com> wrote:
>
> I was recently wondering what the fastest way to work with strings
>was. For instance, let's say I have a string, and I want to add some
>things to the end, or possibly insert them in the middle. If I were to
>repeat that thousands or hundreds of thousands of times, what would be
>the most efficient syntax to do it?
>
>To my knowledge, there are basically three ways to do it.
>
>1: s = part1 + part2 + part3 + ... partn
>
>2: s = string.join([part1, part2, partn], "")
>
>3: s = "%s%s%s%s" % (part1, part2, part3, part4)
>
>Am I missing anything?

cStringIO

>Just now, however, I decided to do a more indepth test with the Python
>profiler. The results were surprising: method #1 above was fastest,
>followed closely by #3, and #2 lagged way back. So, according to this
>test, s = part1 + part2 + partn seems to be the fastest way to build a
>string out of parts.

That's probably true. The tricky part is that "%" gives you a lot of
flexibility and simplicity for building strings, so it's almost always
better if you have a moderately complex string to build.

Problem is this: most of the time when speed becomes an issue, you're
building strings in loops. Once a loop is involved, the copying
overhead of "s=s+part" becomes non-trivial. At that point, either
string.join() or cStringIO become the best solution.
--
--- Aahz (@netcom.com)

Androgynous poly kinky vanilla queer het <*> http://www.rahul.net/aahz/
Hugs and backrubs -- I break Rule 6 (if you want to know, do some research)

Neil Schemenauer

unread,

Oct 19, 1999, 3:00:00 AM10/19/99

to

Preston Landers <preston...@my-deja.com> wrote:
>Can someone look at my profiling script and see if I made any obvious
>errors? Is behavior of strings already well-known?

I think you under-estimated the difficulty of benchmarking. :)

The speed of each method depends on the situation. Method 1 is
really bad if you continuously add characters on the end of a
long string:

spam = ''
for i in range(10000):
spam = spam + 'a'

This creates and destroys a string object each time around the
loop. In this case, method 2 or [c]StringIO is much better:

spam = []
for i in range(10000):
spam.append('a')
spam = string.join(spam, '')

I believe this is the primary use of the string.join idiom. I
would never write:

spam = string.join([a, b, c])

but rather:

spam = '%s%s%s%s' % (a, b, c)

or even:

spam = a + b + c

if I am lazy and the strings are not expected to be huge.

Neil

Duncan Booth

unread,

Oct 22, 1999, 3:00:00 AM10/22/99

to

preston...@my-deja.com (Preston Landers) wrote in
<7uieee$nsa$1...@nnrp1.deja.com>:

>Just now, however, I decided to do a more indepth test with the Python
>profiler. The results were surprising: method #1 above was fastest,
>followed closely by #3, and #2 lagged way back. So, according to this
>test, s = part1 + part2 + partn seems to be the fastest way to build a
>string out of parts.
>

>Can someone look at my profiling script and see if I made any obvious
>errors? Is behavior of strings already well-known?
>
>

Your strings are very short, so the actual copying takes an insignificant
proportion of the time. Your method 2 needs to do a lot of extra work,
looking up a global variable, calling a function etc.

Change the assignment to use longer strings in args:
args = ("a"*1000, "b"*1000, "c"*1000, "d"*1000, "e"*1000)
and now on my machine method_3 beats method_1, with method_2 still last.

Now change the multiplier to 10000 (and reduce the number of times round
the loop by a factor of 10) and method_1 comes last, method_3 still wins.

Finally change the definition of method_2 slightly to remove the global
lookup:
def method_2(a, b, c, d, e, join=string.join):
return join([a, b, c, d, e], "")

At last, with 10000 character strings the modified method_2 is the fastest.