Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

String concatenation benchmarking weirdness

32 views
Skip to first unread message

Rotwang

unread,
Jan 11, 2013, 2:03:41 PM1/11/13
to
Hi all,

the other day I 2to3'ed some code and found it ran much slower in 3.3.0
than 2.7.2. I fixed the problem but in the process of trying to diagnose
it I've stumbled upon something weird that I hope someone here can
explain to me. In what follows I'm using Python 2.7.2 on 64-bit Windows
7. Suppose I do this:

from timeit import timeit

# find out how the time taken to append a character to the end of a byte
# string depends on the size of the string

results = []
for size in range(0, 10000001, 100000):
results.append(timeit("y = x + 'a'",
setup = "x = 'a' * %i" % size, number = 1))

If I plot results against size, what I see is that the time taken
increases approximately linearly with the size of the string, with the
string of length 10000000 taking about 4 milliseconds. On the other
hand, if I replace the statement to be timed with "x = x + 'a'" instead
of "y = x + 'a'", the time taken seems to be pretty much independent of
size, apart from a few spikes; the string of length 10000000 takes about
4 microseconds.

I get similar results with strings (but not bytes) in 3.3.0. My guess is
that this is some kind of optimisation that treats strings as mutable
when carrying out operations that result in the original string being
discarded. If so it's jolly clever, since it knows when there are other
references to the same string:

timeit("x = x + 'a'", setup = "x = y = 'a' * %i" % size, number = 1)
# grows linearly with size

timeit("x = x + 'a'", setup = "x, y = 'a' * %i", 'a' * %i"
% (size, size), number = 1)
# stays approximately constant

It also can see through some attempts to fool it:

timeit("x = ('' + x) + 'a'", setup = "x = 'a' * %i" % size, number = 1)
# stays approximately constant

timeit("x = x*1 + 'a'", setup = "x = 'a' * %i" % size, number = 1)
# stays approximately constant

Is my guess correct? If not, what is going on? If so, is it possible to
explain to a programming noob how the interpreter does this? And is
there a reason why it doesn't work with bytes in 3.3?


--
I have made a thing that superficially resembles music:

http://soundcloud.com/eroneity/we-berated-our-own-crapiness

Ian Kelly

unread,
Jan 11, 2013, 3:16:48 PM1/11/13
to Python
Basically, yes. You can find the discussion behind that optimization at:

http://bugs.python.org/issue980695

It knows when there are other references to the string because all
objects in CPython are reference-counted. It also works despite your
attempts to "fool" it because after evaluating the first operation
(which is easily optimized to return the string itself in both cases),
the remaining part of the expression is essentially "x = TOS + 'a'",
where x and the top of the stack are the same string object, which is
the same state the original code reaches after evaluating just the x.

The stated use case for this optimization is to make repeated
concatenation more efficient, but note that it is still generally
preferable to use the ''.join() construct, because the optimization is
specific to CPython and may not exist for other Python
implementations.

> And is there a reason why it doesn't work with bytes in 3.3?

No idea. Probably just never got implemented due to a lack of demand.

Rotwang

unread,
Jan 11, 2013, 3:51:27 PM1/11/13
to
On 11/01/2013 20:16, Ian Kelly wrote:
> On Fri, Jan 11, 2013 at 12:03 PM, Rotwang <sg...@hotmail.co.uk> wrote:
>> Hi all,
>>
>> the other day I 2to3'ed some code and found it ran much slower in 3.3.0 than
>> 2.7.2. I fixed the problem but in the process of trying to diagnose it I've
>> stumbled upon something weird that I hope someone here can explain to me.
>>
>> [stuff about timings]
>>
>> Is my guess correct? If not, what is going on? If so, is it possible to
>> explain to a programming noob how the interpreter does this?
>
> Basically, yes. You can find the discussion behind that optimization at:
>
> http://bugs.python.org/issue980695
>
> It knows when there are other references to the string because all
> objects in CPython are reference-counted. It also works despite your
> attempts to "fool" it because after evaluating the first operation
> (which is easily optimized to return the string itself in both cases),
> the remaining part of the expression is essentially "x = TOS + 'a'",
> where x and the top of the stack are the same string object, which is
> the same state the original code reaches after evaluating just the x.

Nice, thanks.


> The stated use case for this optimization is to make repeated
> concatenation more efficient, but note that it is still generally
> preferable to use the ''.join() construct, because the optimization is
> specific to CPython and may not exist for other Python
> implementations.

The slowdown in my code was caused by a method that built up a string of
bytes by repeatedly using +=, before writing the result to a WAV file.
My fix was to replaced the bytes string with a bytearray, which seems
about as fast as the rewrite I just tried with b''.join. Do you know
whether the bytearray method will still be fast on other implementations?

wxjm...@gmail.com

unread,
Jan 12, 2013, 3:38:32 AM1/12/13
to
from timeit import timeit, repeat

size = 1000

r = repeat("y = x + 'a'", setup = "x = 'a' * %i" % size)
print('1:', r)
r = repeat("y = x + 'é'", setup = "x = 'a' * %i" % size)
print('2:', r)
r = repeat("y = x + 'œ'", setup = "x = 'a' * %i" % size)
print('3:', r)
r = repeat("y = x + '€'", setup = "x = 'a' * %i" % size)
print('4:', r)
r = repeat("y = x + '€'", setup = "x = '€' * %i" % size)
print('5:', r)
r = repeat("y = x + 'œ'", setup = "x = 'œ' * %i" % size)
print('6:', r)
r = repeat("y = é + 'œ'", setup = "é = 'œ' * %i" % size)
print('7:', r)
r = repeat("y = é + 'œ'", setup = "é = '€' * %i" % size)
print('8:', r)



>c:\python32\pythonw -u "vitesse3.py"
1: [0.3603178435286996, 0.42901157137281515, 0.35459694357592086]
2: [0.3576409223543202, 0.4272010951864649, 0.3590055732104662]
3: [0.3552022735516487, 0.4256544908828328, 0.35824546465278573]
4: [0.35488168890607774, 0.4271707696118834, 0.36109528098614074]
5: [0.3560675370237849, 0.4261538782668417, 0.36138160167082134]
6: [0.3570182634788317, 0.4270155971913008, 0.35770629956705324]
7: [0.3556977225493485, 0.4264969117143753, 0.3645634239700426]
8: [0.35511247834379844, 0.4259628665308437, 0.3580737510097034]
>Exit code: 0
>c:\Python33\pythonw -u "vitesse3.py"
1: [0.3053600256152646, 0.3306491917840535, 0.3044963374976518]
2: [0.36252767208680514, 0.36937298133086727, 0.3685573415262271]
3: [0.7666293438924097, 0.7653473991487574, 0.7630926729867262]
4: [0.7636680712265038, 0.7647586103955284, 0.7631395397838059]
5: [0.44721085450773934, 0.3863234021671369, 0.45664368355696094]
6: [0.44699700013114807, 0.3873974001136613, 0.45167383387335036]
7: [0.4465200615491014, 0.387050034441188, 0.45459690419205856]
8: [0.44760587465455437, 0.3875261853459726, 0.45421212384964704]
>Exit code: 0


The difference between a correct (coherent) unicode handling and ...

jmf

Terry Reedy

unread,
Jan 12, 2013, 6:31:09 AM1/12/13
to pytho...@python.org
By 'correct' Jim means 'speedy', for a subset of string operations*.
rather than 'accurate'. In 3.2 and before, CPython does not handle
extended plane characters correctly on Windows and other narrow builds.
This is, by the way, true of many other languages. For instance, Tcl 8.5
and before (not sure about the new 8.6) does not handle them at all. The
same is true of Microsoft command windows.

* lets try another comparison:

from timeit import timeit
print(timeit("a.encode()", "a = 'a'*10000"))

3.2: 12.1 seconds
3.3 .7 seconds

3.3 is 15 times faster!!! (The factor increases with the length of a.)

A fairer comparison is the approximately 120 micro benchmarks in
Tools/stringbench.py. Here they are, uncensored, for 3.3.0 and 3.2.3. It
is in the Tools directory of some distributions but not all (including
not Windows). It can be downloaded from
http://hg.python.org/cpython/file/6fe28afa6611/Tools/stringbench

In FireFox, Right-click on the stringbench.py link and 'Save link as...'
to somewhere you can run it from.

>>>
stringbench v2.0
3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit
(AMD64)]
2013-01-12 06:17:51.685781
bytes unicode
(in ms) (in ms) % comment
========== case conversion -- dense
0.41 0.43 95.2 ("WHERE IN THE WORLD IS CARMEN SAN DEIGO?"*10).lower()
(*1000)
0.42 0.43 95.8 ("where in the world is carmen san deigo?"*10).upper()
(*1000)
========== case conversion -- rare
0.41 0.43 95.8 ("Where in the world is Carmen San Deigo?"*10).lower()
(*1000)
0.42 0.43 96.3 ("wHERE IN THE WORLD IS cARMEN sAN dEIGO?"*10).upper()
(*1000)
========== concat 20 strings of words length 4 to 15
1.83 1.95 94.1 s1+s2+s3+s4+...+s20 (*1000)
========== concat two strings
0.10 0.10 98.7 "Andrew"+"Dalke" (*1000)
========== count AACT substrings in DNA example
2.46 2.44 100.9 dna.count("AACT") (*10)
========== count newlines
0.77 0.75 103.6 ...text.with.2000.newlines.count("\n") (*10)
========== early match, single character
0.30 0.27 110.5 ("A"*1000).find("A") (*1000)
0.45 0.06 750.5 "A" in "A"*1000 (*1000)
0.30 0.27 110.4 ("A"*1000).index("A") (*1000)
0.24 0.22 107.2 ("A"*1000).partition("A") (*1000)
0.33 0.29 116.6 ("A"*1000).rfind("A") (*1000)
0.32 0.29 107.9 ("A"*1000).rindex("A") (*1000)
0.20 0.21 94.1 ("A"*1000).rpartition("A") (*1000)
0.42 0.45 93.4 ("A"*1000).rsplit("A", 1) (*1000)
0.39 0.41 95.9 ("A"*1000).split("A", 1) (*1000)
========== early match, two characters
0.32 0.27 121.1 ("AB"*1000).find("AB") (*1000)
0.45 0.06 729.5 "AB" in "AB"*1000 (*1000)
0.30 0.27 111.2 ("AB"*1000).index("AB") (*1000)
0.23 0.28 85.0 ("AB"*1000).partition("AB") (*1000)
0.33 0.30 110.6 ("AB"*1000).rfind("AB") (*1000)
0.33 0.30 110.5 ("AB"*1000).rindex("AB") (*1000)
0.22 0.27 83.1 ("AB"*1000).rpartition("AB") (*1000)
0.46 0.47 96.7 ("AB"*1000).rsplit("AB", 1) (*1000)
0.44 0.48 90.9 ("AB"*1000).split("AB", 1) (*1000)
========== endswith multiple characters
0.24 0.29 84.0 "Andrew".endswith("Andrew") (*1000)
========== endswith multiple characters - not!
0.26 0.28 92.9 "Andrew".endswith("Anders") (*1000)
========== endswith single character
0.25 0.28 90.0 "Andrew".endswith("w") (*1000)
========== formatting a string type with a dict
N/A 0.67 0.0 "The %(k1)s is %(k2)s the
%(k3)s."%{"k1":"x","k2":"y","k3":"z",} (*1000)
========== join empty string, with 1 character sep
N/A 0.06 0.0 "A".join("") (*100)
========== join empty string, with 5 character sep
N/A 0.06 0.0 "ABCDE".join("") (*100)
========== join list of 100 words, with 1 character sep
0.87 1.27 68.8 "A".join(["Bob"]*100)) (*1000)
========== join list of 100 words, with 5 character sep
1.14 1.54 74.0 "ABCDE".join(["Bob"]*100)) (*1000)
========== join list of 26 characters, with 1 character sep
0.27 0.37 72.0 "A".join(list("ABC..Z")) (*1000)
========== join list of 26 characters, with 5 character sep
0.32 0.43 75.7 "ABCDE".join(list("ABC..Z")) (*1000)
========== join string with 26 characters, with 1 character sep
N/A 1.30 0.0 "A".join("ABC..Z") (*1000)
========== join string with 26 characters, with 5 character sep
N/A 1.37 0.0 "ABCDE".join("ABC..Z") (*1000)
========== late match, 100 characters
3.25 3.23 100.5 s="ABC"*33; ((s+"D")*500+s+"E").find(s+"E") (*100)
2.79 2.78 100.4 s="ABC"*33; ((s+"D")*500+"E"+s).find("E"+s) (*100)
1.98 1.94 102.3 s="ABC"*33; (s+"E") in ((s+"D")*300+s+"E") (*100)
3.24 3.23 100.3 s="ABC"*33; ((s+"D")*500+s+"E").index(s+"E") (*100)
4.26 3.62 117.7 s="ABC"*33; ((s+"D")*500+s+"E").partition(s+"E") (*100)
3.23 3.23 100.1 s="ABC"*33; ("E"+s+("D"+s)*500).rfind("E"+s) (*100)
2.32 2.32 100.1 s="ABC"*33; (s+"E"+("D"+s)*500).rfind(s+"E") (*100)
3.23 3.21 100.8 s="ABC"*33; ("E"+s+("D"+s)*500).rindex("E"+s) (*100)
3.58 3.57 100.4 s="ABC"*33; ("E"+s+("D"+s)*500).rpartition("E"+s) (*100)
3.60 3.60 100.0 s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100)
3.60 3.56 101.2 s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100)
========== late match, two characters
0.62 0.58 106.3 ("AB"*300+"C").find("BC") (*1000)
0.92 0.82 111.8 ("AB"*300+"CA").find("CA") (*1000)
0.73 0.33 218.8 "BC" in ("AB"*300+"C") (*1000)
0.61 0.60 101.0 ("AB"*300+"C").index("BC") (*1000)
0.54 0.82 66.4 ("AB"*300+"C").partition("BC") (*1000)
0.66 0.63 104.6 ("C"+"AB"*300).rfind("CA") (*1000)
0.91 0.88 102.3 ("BC"+"AB"*300).rfind("BC") (*1000)
0.65 0.62 105.1 ("C"+"AB"*300).rindex("CA") (*1000)
0.53 0.56 94.5 ("C"+"AB"*300).rpartition("CA") (*1000)
0.75 0.77 96.6 ("C"+"AB"*300).rsplit("CA", 1) (*1000)
0.65 0.67 97.0 ("AB"*300+"C").split("BC", 1) (*1000)
========== no match, single character
0.89 0.87 102.3 ("A"*1000).find("B") (*1000)
1.03 0.64 159.1 "B" in "A"*1000 (*1000)
0.67 0.68 98.7 ("A"*1000).partition("B") (*1000)
0.87 0.85 102.8 ("A"*1000).rfind("B") (*1000)
0.67 0.68 98.5 ("A"*1000).rpartition("B") (*1000)
0.87 0.87 99.2 ("A"*1000).rsplit("B", 1) (*1000)
0.86 0.85 101.5 ("A"*1000).split("B", 1) (*1000)
========== no match, two characters
1.22 1.16 104.9 ("AB"*1000).find("BC") (*1000)
1.93 2.02 95.2 ("AB"*1000).find("CA") (*1000)
1.37 0.94 145.3 "BC" in "AB"*1000 (*1000)
1.39 2.14 65.1 ("AB"*1000).partition("BC") (*1000)
2.32 2.31 100.7 ("AB"*1000).rfind("BC") (*1000)
1.47 1.44 102.1 ("AB"*1000).rfind("CA") (*1000)
2.26 2.27 99.7 ("AB"*1000).rpartition("BC") (*1000)
2.46 2.45 100.2 ("AB"*1000).rsplit("BC", 1) (*1000)
1.15 1.16 99.1 ("AB"*1000).split("BC", 1) (*1000)
========== quick replace multiple character match
0.13 0.12 105.0 ("A" + ("Z"*128*1024)).replace("AZZ", "BBZZ", 1) (*10)
========== quick replace single character match
0.12 0.12 105.2 ("A" + ("Z"*128*1024)).replace("A", "BB", 1) (*10)
========== repeat 1 character 10 times
0.08 0.10 80.6 "A"*10 (*1000)
========== repeat 1 character 1000 times
0.16 0.18 93.1 "A"*1000 (*1000)
========== repeat 5 characters 10 times
0.11 0.13 84.4 "ABCDE"*10 (*1000)
========== repeat 5 characters 1000 times
0.39 0.41 94.8 "ABCDE"*1000 (*1000)
========== replace and expand multiple characters, big string
2.02 2.36 85.6 "...text.with.2000.newlines...replace("\n", "\r\n") (*10)
========== replace multiple characters, dna
3.12 3.23 96.6 dna.replace("ATC", "ATT") (*10)
========== replace single character
0.33 0.40 82.4 "This is a test".replace(" ", "\t") (*1000)
========== replace single character, big string
0.75 0.86 87.4 "...text.with.2000.lines...replace("\n", " ") (*10)
========== replace/remove multiple characters
0.41 0.48 86.1 "When shall we three meet again?".replace("ee", "") (*1000)
========== split 1 whitespace
0.14 0.18 79.3 ("Here are some words. "*2).partition(" ") (*1000)
0.11 0.14 75.1 ("Here are some words. "*2).rpartition(" ") (*1000)
0.35 0.39 90.3 ("Here are some words. "*2).rsplit(None, 1) (*1000)
0.32 0.38 83.9 ("Here are some words. "*2).split(None, 1) (*1000)
========== split 2000 newlines
1.74 2.02 86.3 "...text...".rsplit("\n") (*10)
1.69 1.97 85.5 "...text...".split("\n") (*10)
1.89 2.55 74.0 "...text...".splitlines() (*10)
========== split newlines
0.35 0.39 88.9 "this\nis\na\ntest\n".rsplit("\n") (*1000)
0.34 0.40 86.4 "this\nis\na\ntest\n".split("\n") (*1000)
0.32 0.40 80.7 "this\nis\na\ntest\n".splitlines() (*1000)
========== split on multicharacter separator (dna)
2.28 2.30 99.1 dna.rsplit("ACTAT") (*10)
2.63 2.66 98.9 dna.split("ACTAT") (*10)
========== split on multicharacter separator (small)
0.55 0.69 79.0
"this--is--a--test--of--the--emergency--broadcast--system".rsplit("--")
(*1000)
0.58 0.70 82.9
"this--is--a--test--of--the--emergency--broadcast--system".split("--")
(*1000)
========== split whitespace (huge)
1.51 2.12 71.4 human_text.rsplit() (*10)
1.51 2.05 73.6 human_text.split() (*10)
========== split whitespace (small)
0.48 0.68 70.1 ("Here are some words. "*2).rsplit() (*1000)
0.48 0.64 74.9 ("Here are some words. "*2).split() (*1000)
========== startswith multiple characters
0.24 0.25 95.9 "Andrew".startswith("Andrew") (*1000)
========== startswith multiple characters - not!
0.24 0.25 95.7 "Andrew".startswith("Anders") (*1000)
========== startswith single character
0.23 0.25 95.4 "Andrew".startswith("A") (*1000)
========== strip terminal newline
0.09 0.21 44.1 s="Hello!\n"; s[:-1] if s[-1]=="\n" else s (*1000)
0.09 0.12 74.0 "\nHello!".rstrip() (*1000)
0.09 0.12 74.0 "Hello!\n".rstrip() (*1000)
0.09 0.12 71.6 "\nHello!\n".strip() (*1000)
0.09 0.12 73.2 "\nHello!".strip() (*1000)
0.09 0.12 72.9 "Hello!\n".strip() (*1000)
========== strip terminal spaces and tabs
0.09 0.13 69.6 "\t \tHello".rstrip() (*1000)
0.09 0.13 72.3 "Hello\t \t".rstrip() (*1000)
0.07 0.08 86.8 "Hello\t \t".strip() (*1000)
========== tab split
0.59 0.65 90.9 GFF3_example.rsplit("\t", 8) (*1000)
0.55 0.59 94.2 GFF3_example.rsplit("\t") (*1000)
0.52 0.57 90.7 GFF3_example.split("\t", 8) (*1000)
0.52 0.57 90.1 GFF3_example.split("\t") (*1000)
108.87 116.31 93.6 TOTAL
>>>
stringbench v2.0
3.2.3 (default, Apr 11 2012, 07:12:16) [MSC v.1500 64 bit (AMD64)]
2013-01-12 06:23:05.994000
bytes unicode
(in ms) (in ms) % comment
========== case conversion -- dense
0.63 3.01 21.0 ("WHERE IN THE WORLD IS CARMEN SAN DEIGO?"*10).lower()
(*1000)
0.63 2.90 21.5 ("where in the world is carmen san deigo?"*10).upper()
(*1000)
========== case conversion -- rare
0.84 2.83 29.8 ("Where in the world is Carmen San Deigo?"*10).lower()
(*1000)
0.50 3.47 14.3 ("wHERE IN THE WORLD IS cARMEN sAN dEIGO?"*10).upper()
(*1000)
========== concat 20 strings of words length 4 to 15
1.82 1.75 103.9 s1+s2+s3+s4+...+s20 (*1000)
========== concat two strings
0.09 0.08 115.5 "Andrew"+"Dalke" (*1000)
========== count AACT substrings in DNA example
2.40 2.64 91.1 dna.count("AACT") (*10)
========== count newlines
0.77 0.75 101.6 ...text.with.2000.newlines.count("\n") (*10)
========== early match, single character
0.19 0.18 101.9 ("A"*1000).find("A") (*1000)
0.39 0.05 824.7 "A" in "A"*1000 (*1000)
0.19 0.19 96.3 ("A"*1000).index("A") (*1000)
0.20 0.22 87.5 ("A"*1000).partition("A") (*1000)
0.20 0.20 101.8 ("A"*1000).rfind("A") (*1000)
0.20 0.20 101.2 ("A"*1000).rindex("A") (*1000)
0.18 0.22 82.5 ("A"*1000).rpartition("A") (*1000)
0.41 0.45 91.7 ("A"*1000).rsplit("A", 1) (*1000)
0.42 0.43 99.0 ("A"*1000).split("A", 1) (*1000)
========== early match, two characters
0.19 0.19 102.3 ("AB"*1000).find("AB") (*1000)
0.39 0.05 781.6 "AB" in "AB"*1000 (*1000)
0.19 0.20 97.9 ("AB"*1000).index("AB") (*1000)
0.23 0.33 71.1 ("AB"*1000).partition("AB") (*1000)
0.20 0.20 101.6 ("AB"*1000).rfind("AB") (*1000)
0.20 0.20 100.1 ("AB"*1000).rindex("AB") (*1000)
0.22 0.31 70.4 ("AB"*1000).rpartition("AB") (*1000)
0.47 0.53 90.0 ("AB"*1000).rsplit("AB", 1) (*1000)
0.45 0.52 85.0 ("AB"*1000).split("AB", 1) (*1000)
========== endswith multiple characters
0.18 0.18 97.6 "Andrew".endswith("Andrew") (*1000)
========== endswith multiple characters - not!
0.18 0.18 100.4 "Andrew".endswith("Anders") (*1000)
========== endswith single character
0.18 0.18 97.1 "Andrew".endswith("w") (*1000)
========== formatting a string type with a dict
N/A 0.53 0.0 "The %(k1)s is %(k2)s the
%(k3)s."%{"k1":"x","k2":"y","k3":"z",} (*1000)
========== join empty string, with 1 character sep
N/A 0.05 0.0 "A".join("") (*100)
========== join empty string, with 5 character sep
N/A 0.05 0.0 "ABCDE".join("") (*100)
========== join list of 100 words, with 1 character sep
1.02 1.02 99.6 "A".join(["Bob"]*100)) (*1000)
========== join list of 100 words, with 5 character sep
1.25 1.48 84.4 "ABCDE".join(["Bob"]*100)) (*1000)
========== join list of 26 characters, with 1 character sep
0.31 0.25 122.9 "A".join(list("ABC..Z")) (*1000)
========== join list of 26 characters, with 5 character sep
0.36 0.41 88.4 "ABCDE".join(list("ABC..Z")) (*1000)
========== join string with 26 characters, with 1 character sep
N/A 1.06 0.0 "A".join("ABC..Z") (*1000)
========== join string with 26 characters, with 5 character sep
N/A 1.22 0.0 "ABCDE".join("ABC..Z") (*1000)
========== late match, 100 characters
2.52 2.68 94.0 s="ABC"*33; ((s+"D")*500+s+"E").find(s+"E") (*100)
2.35 3.06 76.9 s="ABC"*33; ((s+"D")*500+"E"+s).find("E"+s) (*100)
1.55 1.61 96.2 s="ABC"*33; (s+"E") in ((s+"D")*300+s+"E") (*100)
2.51 2.68 94.0 s="ABC"*33; ((s+"D")*500+s+"E").index(s+"E") (*100)
3.57 4.66 76.7 s="ABC"*33; ((s+"D")*500+s+"E").partition(s+"E") (*100)
3.23 3.24 99.8 s="ABC"*33; ("E"+s+("D"+s)*500).rfind("E"+s) (*100)
2.35 2.56 91.7 s="ABC"*33; (s+"E"+("D"+s)*500).rfind(s+"E") (*100)
3.23 3.24 99.8 s="ABC"*33; ("E"+s+("D"+s)*500).rindex("E"+s) (*100)
3.58 3.92 91.4 s="ABC"*33; ("E"+s+("D"+s)*500).rpartition("E"+s) (*100)
3.62 3.96 91.4 s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100)
2.89 3.38 85.4 s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100)
========== late match, two characters
0.52 0.52 99.5 ("AB"*300+"C").find("BC") (*1000)
0.69 0.90 76.5 ("AB"*300+"CA").find("CA") (*1000)
0.67 0.37 179.2 "BC" in ("AB"*300+"C") (*1000)
0.51 0.53 96.8 ("AB"*300+"C").index("BC") (*1000)
0.48 0.81 59.3 ("AB"*300+"C").partition("BC") (*1000)
0.55 0.55 101.5 ("C"+"AB"*300).rfind("CA") (*1000)
0.85 0.85 100.0 ("BC"+"AB"*300).rfind("BC") (*1000)
0.55 0.55 100.3 ("C"+"AB"*300).rindex("CA") (*1000)
0.52 0.60 87.1 ("C"+"AB"*300).rpartition("CA") (*1000)
0.78 0.82 95.4 ("C"+"AB"*300).rsplit("CA", 1) (*1000)
0.65 0.72 91.2 ("AB"*300+"C").split("BC", 1) (*1000)
========== no match, single character
0.77 0.77 100.6 ("A"*1000).find("B") (*1000)
0.98 0.63 155.1 "B" in "A"*1000 (*1000)
0.66 0.66 99.7 ("A"*1000).partition("B") (*1000)
0.77 0.77 100.4 ("A"*1000).rfind("B") (*1000)
0.66 0.66 99.7 ("A"*1000).rpartition("B") (*1000)
0.88 0.88 100.4 ("A"*1000).rsplit("B", 1) (*1000)
0.88 0.87 101.2 ("A"*1000).split("B", 1) (*1000)
========== no match, two characters
1.19 1.21 98.1 ("AB"*1000).find("BC") (*1000)
1.79 2.51 71.2 ("AB"*1000).find("CA") (*1000)
1.28 1.08 119.1 "BC" in "AB"*1000 (*1000)
1.10 2.11 52.1 ("AB"*1000).partition("BC") (*1000)
2.37 2.37 100.0 ("AB"*1000).rfind("BC") (*1000)
1.36 1.36 100.5 ("AB"*1000).rfind("CA") (*1000)
2.25 2.26 99.9 ("AB"*1000).rpartition("BC") (*1000)
2.38 2.62 90.7 ("AB"*1000).rsplit("BC", 1) (*1000)
1.18 1.30 90.1 ("AB"*1000).split("BC", 1) (*1000)
========== quick replace multiple character match
0.12 0.32 37.1 ("A" + ("Z"*128*1024)).replace("AZZ", "BBZZ", 1) (*10)
========== quick replace single character match
0.12 0.30 37.9 ("A" + ("Z"*128*1024)).replace("A", "BB", 1) (*10)
========== repeat 1 character 10 times
0.08 0.09 90.3 "A"*10 (*1000)
========== repeat 1 character 1000 times
0.16 0.19 82.2 "A"*1000 (*1000)
========== repeat 5 characters 10 times
0.11 0.12 98.3 "ABCDE"*10 (*1000)
========== repeat 5 characters 1000 times
0.40 0.58 67.9 "ABCDE"*1000 (*1000)
========== replace and expand multiple characters, big string
1.95 2.13 91.7 "...text.with.2000.newlines...replace("\n", "\r\n") (*10)
========== replace multiple characters, dna
2.93 3.25 90.3 dna.replace("ATC", "ATT") (*10)
========== replace single character
0.25 0.26 96.6 "This is a test".replace(" ", "\t") (*1000)
========== replace single character, big string
0.73 1.01 72.0 "...text.with.2000.lines...replace("\n", " ") (*10)
========== replace/remove multiple characters
0.30 0.34 89.0 "When shall we three meet again?".replace("ee", "") (*1000)
========== split 1 whitespace
0.12 0.13 93.3 ("Here are some words. "*2).partition(" ") (*1000)
0.11 0.11 98.8 ("Here are some words. "*2).rpartition(" ") (*1000)
0.32 0.37 86.5 ("Here are some words. "*2).rsplit(None, 1) (*1000)
0.32 0.33 96.9 ("Here are some words. "*2).split(None, 1) (*1000)
========== split 2000 newlines
1.76 2.19 80.5 "...text...".rsplit("\n") (*10)
1.72 2.10 81.9 "...text...".split("\n") (*10)
1.87 2.58 72.4 "...text...".splitlines() (*10)
========== split newlines
0.36 0.34 103.9 "this\nis\na\ntest\n".rsplit("\n") (*1000)
0.35 0.33 105.9 "this\nis\na\ntest\n".split("\n") (*1000)
0.31 0.34 89.7 "this\nis\na\ntest\n".splitlines() (*1000)
========== split on multicharacter separator (dna)
2.18 2.34 93.4 dna.rsplit("ACTAT") (*10)
2.50 2.64 94.5 dna.split("ACTAT") (*10)
========== split on multicharacter separator (small)
0.59 0.62 95.3
"this--is--a--test--of--the--emergency--broadcast--system".rsplit("--")
(*1000)
0.55 0.59 93.1
"this--is--a--test--of--the--emergency--broadcast--system".split("--")
(*1000)
========== split whitespace (huge)
1.54 2.34 65.5 human_text.rsplit() (*10)
1.51 2.22 68.3 human_text.split() (*10)
========== split whitespace (small)
0.46 0.60 76.5 ("Here are some words. "*2).rsplit() (*1000)
0.45 0.51 87.6 ("Here are some words. "*2).split() (*1000)
========== startswith multiple characters
0.18 0.18 97.3 "Andrew".startswith("Andrew") (*1000)
========== startswith multiple characters - not!
0.18 0.18 100.1 "Andrew".startswith("Anders") (*1000)
========== startswith single character
0.17 0.18 96.8 "Andrew".startswith("A") (*1000)
========== strip terminal newline
0.11 0.21 52.0 s="Hello!\n"; s[:-1] if s[-1]=="\n" else s (*1000)
0.06 0.07 92.1 "\nHello!".rstrip() (*1000)
0.06 0.07 92.2 "Hello!\n".rstrip() (*1000)
0.06 0.07 91.2 "\nHello!\n".strip() (*1000)
0.06 0.07 91.1 "\nHello!".strip() (*1000)
0.06 0.07 91.1 "Hello!\n".strip() (*1000)
========== strip terminal spaces and tabs
0.07 0.07 89.4 "\t \tHello".rstrip() (*1000)
0.07 0.07 91.4 "Hello\t \t".rstrip() (*1000)
0.04 0.05 88.7 "Hello\t \t".strip() (*1000)
========== tab split
0.57 0.56 100.8 GFF3_example.rsplit("\t", 8) (*1000)
0.53 0.53 100.7 GFF3_example.rsplit("\t") (*1000)
0.49 0.49 101.2 GFF3_example.split("\t", 8) (*1000)
0.51 0.49 103.5 GFF3_example.split("\t") (*1000)
102.13 125.57 81.3 TOTAL

--
Terry Jan Reedy


Ian Kelly

unread,
Jan 12, 2013, 12:34:12 PM1/12/13
to Python
On Sat, Jan 12, 2013 at 1:38 AM, <wxjm...@gmail.com> wrote:
> The difference between a correct (coherent) unicode handling and ...

This thread was about byte string concatenation, not unicode, so your
rant is not even on-topic here.
0 new messages