unicode() vs. s.decode()

Michael Ströder

unread,

Aug 5, 2009, 10:43:09 AM8/5/09

to

HI!

These both expressions are equivalent but which is faster or should be used
for any reason?

u = unicode(s,'utf-8')

u = s.decode('utf-8') # looks nicer

Ciao, Michael.

Jason Tackaberry

unread,

Aug 5, 2009, 11:53:56 AM8/5/09

to python-list

It is sometimes non-obvious which constructs are faster than others in
Python. I also regularly have these questions, but it's pretty easy to
run quick (albeit naive) benchmarks to see.

The first thing to try is to have a look at the bytecode for each:

>>> import dis
>>> dis.dis(lambda s: s.decode('utf-8'))
1 0 LOAD_FAST 0 (s)
3 LOAD_ATTR 0 (decode)
6 LOAD_CONST 0 ('utf-8')
9 CALL_FUNCTION 1
12 RETURN_VALUE
>>> dis.dis(lambda s: unicode(s, 'utf-8'))
1 0 LOAD_GLOBAL 0 (unicode)
3 LOAD_FAST 0 (s)
6 LOAD_CONST 0 ('utf-8')
9 CALL_FUNCTION 2
12 RETURN_VALUE

The presence of LOAD_ATTR in the first form hints that this is probably
going to be slower. Next, actually try it:

>>> import timeit
>>> timeit.timeit('"foobarbaz".decode("utf-8")')
1.698289155960083
>>> timeit.timeit('unicode("foobarbaz", "utf-8")')
0.53305888175964355

So indeed, uncode(s, 'utf-8') is faster by a fair margin.

On the other hand, unless you need to do this in a tight loop several
tens of thousands of times, I'd prefer the slower form s.decode('utf-8')
because it's, as you pointed out, cleaner and more readable code.

Cheers,
Jason.

1x7y2z9

unread,

Aug 5, 2009, 2:12:45 PM8/5/09

to

unicode() has LOAD_GLOBAL which s.decode() does not. Is it generally
the case that LOAD_ATTR is slower than LOAD_GLOBAL that lead to your
intuition that the former would probably be slower? Or some other
intuition?
Of course, the results from timeit are a different thing - I ask about
the intuition in the disassembler output.
Thanks.

John Machin

unread,

Aug 5, 2009, 9:31:56 PM8/5/09

to pytho...@python.org

Jason Tackaberry <tack <at> urandom.ca> writes:
> On Wed, 2009-08-05 at 16:43 +0200, Michael Ströder wrote:

> > These both expressions are equivalent but which is faster or should be used
> > for any reason?
> > u = unicode(s,'utf-8')
> > u = s.decode('utf-8') # looks nicer
>

> It is sometimes non-obvious which constructs are faster than others in
> Python. I also regularly have these questions, but it's pretty easy to
> run quick (albeit naive) benchmarks to see.
>
> The first thing to try is to have a look at the bytecode for each:

[snip]

> The presence of LOAD_ATTR in the first form hints that this is probably
> going to be slower. Next, actually try it:
>

> >>> import timeit
> >>> timeit.timeit('"foobarbaz".decode("utf-8")')
> 1.698289155960083
> >>> timeit.timeit('unicode("foobarbaz", "utf-8")')
> 0.53305888175964355
>
> So indeed, uncode(s, 'utf-8') is faster by a fair margin.

Faster by an enormous margin; attributing this to the cost of attribute lookup
seems implausible.

Suggested further avenues of investigation:

(1) Try the timing again with "cp1252" and "utf8" and "utf_8"

(2) grep "utf-8" <Python2.X_source_code>/Objects/unicodeobject.c

HTH,
John

Jason Tackaberry

unread,

Aug 6, 2009, 10:15:41 AM8/6/09

to pytho...@python.org

On Thu, 2009-08-06 at 01:31 +0000, John Machin wrote:
> Faster by an enormous margin; attributing this to the cost of attribute lookup
> seems implausible.

Ok, fair point. I don't think the time difference fully registered when
I composed that message.

Testing a global access (LOAD_GLOBAL) versus an attribute access on a
global object (LOAD_GLOBAL + LOAD_ATTR) shows that the latter is about
40% slower than the former. So that certainly doesn't account for the
difference.

> Suggested further avenues of investigation:
>
> (1) Try the timing again with "cp1252" and "utf8" and "utf_8"
>
> (2) grep "utf-8" <Python2.X_source_code>/Objects/unicodeobject.c

Very pedagogical of you. :) Indeed, it looks like bigger player in the
performance difference is the fact that the code path for unicode(s,
enc) short-circuits the codec registry for common encodings (which
includes 'utf-8' specifically), whereas s.decode('utf-8') necessarily
consults the codec registry.

Cheers,
Jason.

Thorsten Kampe

unread,

Aug 6, 2009, 10:32:34 AM8/6/09

to

* Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200)

"decode" was added in Python 2.2 for the sake of symmetry to encode().
It's essentially the same as unicode() and I wouldn't be surprised if it
is exactly the same. I don't think any measurable speed increase will be
noticeable between those two.

Thorsten

Michael Ströder

unread,

Aug 6, 2009, 12:26:09 PM8/6/09

to

Thorsten Kampe wrote:
> * Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200)
>> These both expressions are equivalent but which is faster or should be
>> used for any reason?
>>
>> u = unicode(s,'utf-8')
>>
>> u = s.decode('utf-8') # looks nicer
>
> "decode" was added in Python 2.2 for the sake of symmetry to encode().

Yes, and I like the style. But...

> It's essentially the same as unicode() and I wouldn't be surprised if it
> is exactly the same.

Did you try?

> I don't think any measurable speed increase will be noticeable between
> those two.

Well, seems not to be true. Try yourself. I did (my console has UTF-8 as charset):

Python 2.6 (r26:66714, Feb 3 2009, 20:52:03)
[GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> timeit.Timer("'äöüÄÖÜß'.decode('utf-8')").timeit(1000000)
7.2721178531646729
>>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000000)
7.1302499771118164
>>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(1000000)
8.3726329803466797
>>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000000)
1.8622009754180908
>>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(1000000)
8.651669979095459
>>>

Comparing again the two best combinations:

>>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(10000000)
17.23644495010376
>>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(10000000)
72.087096929550171

That is significant! So the winner is:

unicode('äöüÄÖÜß','utf-8')

Ciao, Michael.

Thorsten Kampe

unread,

Aug 6, 2009, 2:05:52 PM8/6/09

to

* Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200)

> Thorsten Kampe wrote:
> > * Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200)

> > I don't think any measurable speed increase will be noticeable
> > between those two.
>
> Well, seems not to be true. Try yourself. I did (my console has UTF-8 as charset):
>
> Python 2.6 (r26:66714, Feb 3 2009, 20:52:03)
> [GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import timeit
> >>> timeit.Timer("'äöüÄÖÜß'.decode('utf-8')").timeit(1000000)
> 7.2721178531646729
> >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000000)
> 7.1302499771118164
> >>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(1000000)
> 8.3726329803466797
> >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000000)
> 1.8622009754180908
> >>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(1000000)
> 8.651669979095459
> >>>
>
> Comparing again the two best combinations:
>
> >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(10000000)
> 17.23644495010376
> >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(10000000)
> 72.087096929550171
>
> That is significant! So the winner is:
>
> unicode('äöüÄÖÜß','utf-8')

Unless you are planning to write a loop that decodes "äöüÄÖÜß" one
million times, these benchmarks are meaningless.

Thorsten

Steven D'Aprano

unread,

Aug 6, 2009, 3:17:30 PM8/6/09

to

On Thu, 06 Aug 2009 20:05:52 +0200, Thorsten Kampe wrote:

> > That is significant! So the winner is:
> >
> > unicode('äöüÄÖÜß','utf-8')
>
> Unless you are planning to write a loop that decodes "äöüÄÖÜß" one
> million times, these benchmarks are meaningless.

What if you're writing a loop which takes one million different lines of
text and decodes them once each?

>>> setup = 'L = ["abc"*(n%100) for n in xrange(1000000)]'
>>> t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup)
>>> t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup)
>>> t1.timeit(number=1)
5.6751680374145508
>>> t2.timeit(number=1)
2.6822888851165771

Seems like a pretty meaningful difference to me.

--
Steven

Michael Ströder

unread,

Aug 6, 2009, 9:25:03 PM8/6/09

to

Thorsten Kampe wrote:
> * Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200)

>>>>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(10000000)
>> 17.23644495010376
>>>>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(10000000)
>> 72.087096929550171
>>
>> That is significant! So the winner is:
>>
>> unicode('äöüÄÖÜß','utf-8')
>
> Unless you are planning to write a loop that decodes "äöüÄÖÜß" one
> million times, these benchmarks are meaningless.

Well, I can tell you I would not have posted this here and checked it if it
would be meaningless for me. You don't have to read and answer this thread if
it's meaningless to you.

Ciao, Michael.

John Machin

unread,

Aug 6, 2009, 10:01:24 PM8/6/09

to pytho...@python.org

Jason Tackaberry <tack <at> urandom.ca> writes:

So the next question (the answer to which may benefit all users
of .encode() and .decode()) is:

Why does consulting the codec registry take so long,
and can this be improved?

Mark Lawrence

unread,

Aug 7, 2009, 3:04:51 AM8/7/09

to pytho...@python.org

I believe that the comment "these benchmarks are meaningless" refers to
the length of the strings being used in the tests. Surely something
involving thousands or millions of characters is more meaningful? Or to
go the other way, you are unlikely to write
for c in 'äöüÄÖÜß':
u = unicode(c, 'utf-8')
...
Yes?

--
Kindest regards.

Mark Lawrence.

Steven D'Aprano

unread,

Aug 7, 2009, 3:25:44 AM8/7/09

to

On Fri, 07 Aug 2009 08:04:51 +0100, Mark Lawrence wrote:

> I believe that the comment "these benchmarks are meaningless" refers to
> the length of the strings being used in the tests. Surely something
> involving thousands or millions of characters is more meaningful? Or to
> go the other way, you are unlikely to write for c in 'äöüÄÖÜß':
> u = unicode(c, 'utf-8')
> ...
> Yes?

There are all sorts of potential use-cases. A day or two ago, somebody
posted a question involving tens of thousands of lines of tens of
thousands of characters each (don't quote me, I'm going by memory). On
the other hand, it doesn't require much imagination to think of a use-
case where there are millions of lines each of a dozen or so characters,
and you want to process it line by line:

noun: cat
noun: dog
verb: café
...

As always, before optimizing, you should profile to be sure you are
actually optimizing and not wasting your time.

--
Steven

Thorsten Kampe

unread,

Aug 7, 2009, 6:00:42 AM8/7/09

to

* Steven D'Aprano (06 Aug 2009 19:17:30 GMT)

Bollocks. No one will even notice whether a code sequence runs 2.7 or
5.7 seconds. That's completely artificial benchmarking.

Thorsten

Thorsten Kampe

unread,

Aug 7, 2009, 6:12:32 AM8/7/09

to

* Michael Ströder (Fri, 07 Aug 2009 03:25:03 +0200)

Again: if you think decoding "äöüÄÖÜß" one million times is a real world
use case for your module then go for unicode(). Otherwise the time you
spent benchmarking artificial cases like this is just wasted time. In
real life people won't even notice whether an application takes one or
two minutes to complete.

Use whatever you prefer (decode() or unicode()). If you experience
performance bottlenecks when you're done, test whether changing decode()
to unicode() makes a difference. /That/ is relevant.

Thorsten

garabik-ne...@kassiopeia.juls.savba.sk

unread,

Aug 7, 2009, 7:49:05 AM8/7/09

to

Thorsten Kampe <thor...@thorstenkampe.de> wrote:
> * Steven D'Aprano (06 Aug 2009 19:17:30 GMT)

>> What if you're writing a loop which takes one million different lines of
>> text and decodes them once each?
>>
>> >>> setup = 'L = ["abc"*(n%100) for n in xrange(1000000)]'
>> >>> t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup)
>> >>> t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup)
>> >>> t1.timeit(number=1)
>> 5.6751680374145508
>> >>> t2.timeit(number=1)
>> 2.6822888851165771
>>
>> Seems like a pretty meaningful difference to me.
>
> Bollocks. No one will even notice whether a code sequence runs 2.7 or
> 5.7 seconds. That's completely artificial benchmarking.
>

For a real-life example, I have often a file with one word per line, and
I run python scripts to apply some (sometimes fairy trivial)
transformation over it. REAL example, reading lines with word, lemma,
tag separated by tabs from stdin and writing word into stdout, unless it
starts with '<' (~6e5 lines, python2.5, user times, warm cache, I hope
the comments are self-explanatory)

no unicode
user 0m2.380s

decode('utf-8'), encode('utf-8')
user 0m3.560s

sys.stdout = codecs.getwriter('utf-8')(sys.stdout);sys.stdin = codecs.getreader('utf-8')(sys.stdin)
user 0m6.180s

unicode(line, 'utf8'), encode('utf-8')
user 0m3.820s

unicode(line, 'utf-8'), encode('utf-8')
user 0m2.880sa

python3.1
user 0m1.560s

Since I have something like 18 million words in my currenct project (and
> 600 million overall) and I often tweak some parameters and re-run the
> transformations, the differences are pretty significant.

Personally, I have been surprised by:
1) bad performance of the codecs wrapper (I expected it to be on par with
unicode(x,'utf-8'), mayble slightly better due to less function calls
2) good performance of python3.1 (utf-8 locale)

--
-----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

alex23

unread,

Aug 7, 2009, 9:53:22 AM8/7/09

to

Thorsten Kampe <thors...@thorstenkampe.de> wrote:
> Bollocks. No one will even notice whether a code sequence runs 2.7 or
> 5.7 seconds. That's completely artificial benchmarking.

But that's not what you first claimed:

> I don't think any measurable speed increase will be
> noticeable between those two.

But please, keep changing your argument so you don't have to admit you
were wrong.

Thorsten Kampe

unread,

Aug 7, 2009, 11:13:07 AM8/7/09

to

* alex23 (Fri, 7 Aug 2009 06:53:22 -0700 (PDT))

Bollocks. Please note the word "noticeable". "noticeable" as in
recognisable as in reasonably experiencable or as in whatever.

One guy claims he has times between 2.7 and 5.7 seconds when
benchmarking more or less randomly generated "one million different
lines". That *is* *exactly* nothing.

Another guy claims he gets times between 2.9 and 6.2 seconds when
running decode/unicode in various manifestations over "18 million
words" (or is it 600 million?) and says "the differences are pretty
significant". I think I don't have to comment on that.

If you increase the number of loops to one million or one billion or
whatever even the slightest completely negligible difference will occur.
The same thing will happen if you just increase the corpus of words to a
million, trillion or whatever. The performance implications of that are
exactly none.

Thorsten

garabik-ne...@kassiopeia.juls.savba.sk

unread,

Aug 7, 2009, 1:41:38 PM8/7/09

to

Thorsten Kampe <thor...@thorstenkampe.de> wrote:

> lines". That *is* *exactly* nothing.
>
> Another guy claims he gets times between 2.9 and 6.2 seconds when
> running decode/unicode in various manifestations over "18 million

over a sample of 600000 words (sorry for not being able to explain
myself clear enough so that everyone understands)
while my current project is 18e6 words, that is the overall running time
will be 87 vs. 186 seconds, which is fairly noticeable.

> words" (or is it 600 million?) and says "the differences are pretty
> significant".

600 million is the size of the whole corpus, that translates to
48 minutes vs. 1h43min. That already is a huge difference (going to
lunch during noon or waiting another hour until it runs over - and
you can bet it is _very_ noticeable when I am hungry :-)).

With 9 different versions of the corpus (that is, what we are really
using now) that goes to 7.2 hours (or even less with python3.1!) vs. 15
hours. Being able to re-run the whole corpus generation in one working
day (and then go on with the next issues) vs. working overtime or
delivering the corpus one day later is a huge difference. Like, being
one day behind the schedule.

> I think I don't have to comment on that.

Indeed, the numbers are self-explanatory.

>
> If you increase the number of loops to one million or one billion or
> whatever even the slightest completely negligible difference will occur.
> The same thing will happen if you just increase the corpus of words to a
> million, trillion or whatever. The performance implications of that are
> exactly none.
>

I am not sure I understood that. Must be my English :-)

alex23

unread,

Aug 7, 2009, 1:45:29 PM8/7/09

to

garabik-news-2005...@kassiopeia.juls.savba.sk wrote:
> I am not sure I understood that. Must be my English :-)

I just parsed it as "blah blah blah I won't admit I'm wrong" and
didn't miss anything substantive.

Steven D'Aprano

unread,

Aug 7, 2009, 10:04:58 PM8/7/09

to

On Fri, 07 Aug 2009 12:00:42 +0200, Thorsten Kampe wrote:

> Bollocks. No one will even notice whether a code sequence runs 2.7 or
> 5.7 seconds. That's completely artificial benchmarking.

You think users won't notice a doubling of execution time? Well, that
explains some of the apps I'm forced to use...

A two-second running time for (say) a command-line tool is already
noticeable. A five-second one is *very* noticeable -- long enough to be a
drag, short enough that you aren't tempted to go off and do something
else while you're waiting for it to finish.

--
Steven

Steven D'Aprano

unread,

Aug 7, 2009, 11:29:43 PM8/7/09

to

On Fri, 07 Aug 2009 17:13:07 +0200, Thorsten Kampe wrote:

> One guy claims he has times between 2.7 and 5.7 seconds when
> benchmarking more or less randomly generated "one million different
> lines". That *is* *exactly* nothing.

We agree that in the grand scheme of things, a difference of 2.7 seconds
versus 5.7 seconds is a trivial difference if your entire program takes
(say) 8 minutes to run. You won't even notice it.

But why assume that the program takes 8 minutes to run? Perhaps it takes
8 seconds to run, and 6 seconds of that is the decoding. Then halving
that reduces the total runtime from 8 seconds to 5, which is a noticeable
speed increase to the user, and significant if you then run that program
tens of thousands of times.

The Python dev team spend significant time and effort to get improvements
of the order of 10%, and you're pooh-poohing an improvement of the order
of 100%. By all means, reminding people that pre-mature optimization is a
waste of time, but it's possible to take that attitude too far to Planet
Bizarro. At the point that you start insisting, and emphasising, that a
three second time difference is "*exactly*" zero, it seems to me that
this is about you winning rather than you giving good advice.

--
Steven

Thorsten Kampe

unread,

Aug 8, 2009, 7:16:12 AM8/8/09

to

* Steven D'Aprano (08 Aug 2009 03:29:43 GMT)

> On Fri, 07 Aug 2009 17:13:07 +0200, Thorsten Kampe wrote:
> > One guy claims he has times between 2.7 and 5.7 seconds when
> > benchmarking more or less randomly generated "one million different
> > lines". That *is* *exactly* nothing.
>
> We agree that in the grand scheme of things, a difference of 2.7 seconds
> versus 5.7 seconds is a trivial difference if your entire program takes
> (say) 8 minutes to run. You won't even notice it.

Exactly.

> But why assume that the program takes 8 minutes to run? Perhaps it takes
> 8 seconds to run, and 6 seconds of that is the decoding. Then halving
> that reduces the total runtime from 8 seconds to 5, which is a noticeable
> speed increase to the user, and significant if you then run that program
> tens of thousands of times.

Exactly. That's why it doesn't make sense to benchmark decode()/unicode
() isolated - meaning out of the context of your actual program.

> By all means, reminding people that pre-mature optimization is a
> waste of time, but it's possible to take that attitude too far to Planet
> Bizarro. At the point that you start insisting, and emphasising, that a
> three second time difference is "*exactly*" zero,

Exactly. Because it was not generated in a real world use case but by
running a simple loop one millions times. Why one million times? Because
by running it "only" one hundred thousand times the difference would
have seen even less relevant.

> it seems to me that this is about you winning rather than you giving
> good advice.

I already gave good advice:
1. don't benchmark
2. don't benchmark until you have an actual performance issue
3. if you benchmark then the whole application and not single commands

It's really easy: Michael has working code. With that he can easily
write two versions - one that uses decode() and one that uses unicode().
He can benchmark these with some real world input he often uses by
running it a hundred or a thousand times (even a million if he likes).
Then he can compare the results. I doubt that there will be any
noticeable difference.

Thorsten

Thorsten Kampe

unread,

Aug 8, 2009, 8:19:44 AM8/8/09

to

* alex23 (Fri, 7 Aug 2009 10:45:29 -0700 (PDT))

Alex, there are still a number of performance optimizations that require
a thorough optimizer like you. Like using short identifiers instead of
long ones. I guess you could easily prove that by comparing "a = 0" to
"a_long_identifier = 0" and running it one hundred trillion times. The
performance gain could easily add up to *days*. Keep us updated.

Thorsten

Thorsten Kampe

unread,

Aug 8, 2009, 8:28:54 AM8/8/09

to

* garabik-ne...@kassiopeia.juls.savba.sk (Fri, 7 Aug 2009
17:41:38 +0000 (UTC))

> Thorsten Kampe <thor...@thorstenkampe.de> wrote:
> > If you increase the number of loops to one million or one billion or
> > whatever even the slightest completely negligible difference will
> > occur. The same thing will happen if you just increase the corpus of
> > words to a million, trillion or whatever. The performance
> > implications of that are exactly none.
>
> I am not sure I understood that. Must be my English :-)

I guess you understand me very well and I understand you very well. If
the performance gain you want to prove doesn't show with 600,000 words,
you test again with 18,000,000 words and if that is not impressive
enough with 600,000,000 words. Great.

Or if a million repetitions of your "improved" code don't show the
expected "performance advantage" you run it a billion times. Even
greater. Keep on optimzing.

Thorsten

Michael Ströder

unread,

Aug 8, 2009, 9:09:23 AM8/8/09

to

Thorsten Kampe wrote:
> * Steven D'Aprano (08 Aug 2009 03:29:43 GMT)

>> But why assume that the program takes 8 minutes to run? Perhaps it takes
>> 8 seconds to run, and 6 seconds of that is the decoding. Then halving
>> that reduces the total runtime from 8 seconds to 5, which is a noticeable
>> speed increase to the user, and significant if you then run that program
>> tens of thousands of times.
>
> Exactly. That's why it doesn't make sense to benchmark decode()/unicode
> () isolated - meaning out of the context of your actual program.

Thorsten, the point is you're too arrogant to admit that making such a general
statement like you did without knowing *anything* about the context is simply
false. So this is not a technial matter. It's mainly an issue with your attitude.

>> By all means, reminding people that pre-mature optimization is a
>> waste of time, but it's possible to take that attitude too far to Planet
>> Bizarro. At the point that you start insisting, and emphasising, that a
>> three second time difference is "*exactly*" zero,
>
> Exactly. Because it was not generated in a real world use case but by
> running a simple loop one millions times. Why one million times? Because
> by running it "only" one hundred thousand times the difference would
> have seen even less relevant.

I was running it one million times to mitigate influences on the timing by
other background processes which is a common technique when benchmarking. I
was mainly interested in the percentage which is indeed significant. The
absolute times also strongly depend on the hardware where the software is
running. So your comment about the absolute times are complete nonsense. I'm
eager that this software should also run with acceptable response times on
hardware much slower than my development machine.

> I already gave good advice:
> 1. don't benchmark
> 2. don't benchmark until you have an actual performance issue
> 3. if you benchmark then the whole application and not single commands

You don't know anything about what I'm doing and what my aim is. So your
general rules don't apply.

> It's really easy: Michael has working code. With that he can easily
> write two versions - one that uses decode() and one that uses unicode().

Yes, I have working code which was originally written before .decode() being
added in Python 2.2. Therefore I wondered whether it would be nice for
readability to replace unicode() by s.decode() since the software does not
support Python versions prior 2.3 anymore anyway. But one aspect is also
performance and hence my question and testing.

Ciao, Michael.

Michael Fötsch

unread,

Aug 8, 2009, 12:02:48 PM8/8/09

to pytho...@python.org

Michael Ströder wrote:
> >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(10000000)
> 17.23644495010376
> >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(10000000)
> 72.087096929550171
>
> That is significant! So the winner is:
>
> unicode('äöüÄÖÜß','utf-8')

Which proves that benchmark results can be misleading sometimes. :-)

unicode() becomes *slower* when you try "UTF-8" in uppercase, or an
entirely different codec, say "cp1252":

>>> timeit.Timer("unicode('äöüÄÖÜß','UTF-8')").timeit(1000000)
2.5777881145477295
>>> timeit.Timer("'äöüÄÖÜß'.decode('UTF-8')").timeit(1000000)
1.8430399894714355
>>> timeit.Timer("unicode('äöüÄÖÜß','cp1252')").timeit(1000000)
2.3622498512268066
>>> timeit.Timer("'äöüÄÖÜß'.decode('cp1252')").timeit(1000000)
1.7812771797180176

The reason seems to be that unicode() bypasses codecs.lookup() if the
encoding is one of "utf-8", "latin-1", "mbcs", or "ascii". OTOH,
str.decode() always calls codecs.lookup().

If speed is your primary concern, this will give you even better
performance than unicode():

decoder = codecs.lookup("utf-8").decode
for i in xrange(1000000):
decoder("äöüÄÖÜß")[0]

However, there's also a functional difference between unicode() and
str.decode():

unicode() always raises an exception when you try to decode a unicode
object. str.decode() will first try to encode a unicode object using the
default encoding (usually "ascii"), which might or might not work.

Kind Regards,
M.F.

garabik-ne...@kassiopeia.juls.savba.sk

unread,

Aug 8, 2009, 12:16:49 PM8/8/09

to

Thorsten Kampe <thor...@thorstenkampe.de> wrote:
> * garabik-ne...@kassiopeia.juls.savba.sk (Fri, 7 Aug 2009
> 17:41:38 +0000 (UTC))
>> Thorsten Kampe <thor...@thorstenkampe.de> wrote:
>> > If you increase the number of loops to one million or one billion or
>> > whatever even the slightest completely negligible difference will
>> > occur. The same thing will happen if you just increase the corpus of
>> > words to a million, trillion or whatever. The performance
>> > implications of that are exactly none.
>>
>> I am not sure I understood that. Must be my English :-)
>
> I guess you understand me very well and I understand you very well. If

I did not. Really. But then it has been explained to me, so I think I do
now :-)

> the performance gain you want to prove doesn't show with 600,000 words,
> you test again with 18,000,000 words and if that is not impressive
> enough with 600,000,000 words. Great.
>

Huh?
18e6 words is what I am working with _now_. Most of the data is already
collected, there are going to be few more books, but that's all. And the
optimization I was talking about means going home from work one hour
later or earlier. Quite noticeable for me.
600e6 words is the main corpus. Data is already there and wait to be
processed in some time. Once we finih our current project. That is
real life, no thought experiment.

> Or if a million repetitions of your "improved" code don't show the
> expected "performance advantage" you run it a billion times. Even
> greater. Keep on optimzing.

No, we do not have one billion words (yet - I assume you are talking
about American billion - if you are talking about European billion, we
would be masters of the world with a billion word corpus!).
However, that might change once we start collecting www data (which is a
separate project, to be started in a year or two)
Then, we'll do some more optimiation because the time differences will
be more noticeable. Easy as that.

Thorsten Kampe

unread,

Aug 8, 2009, 1:00:11 PM8/8/09

to

* Michael Ströder (Sat, 08 Aug 2009 15:09:23 +0200)

> Thorsten Kampe wrote:
> > * Steven D'Aprano (08 Aug 2009 03:29:43 GMT)
> >> But why assume that the program takes 8 minutes to run? Perhaps it takes
> >> 8 seconds to run, and 6 seconds of that is the decoding. Then halving
> >> that reduces the total runtime from 8 seconds to 5, which is a noticeable
> >> speed increase to the user, and significant if you then run that program
> >> tens of thousands of times.
> >
> > Exactly. That's why it doesn't make sense to benchmark decode()/unicode
> > () isolated - meaning out of the context of your actual program.
>
> Thorsten, the point is you're too arrogant to admit that making such a general
> statement like you did without knowing *anything* about the context is simply
> false.

I made a general statement to a very general question ("These both

expressions are equivalent but which is faster or should be used for any

reason?"). If you have specific needs or reasons then you obviously
failed to provide that specific "context" in your question.

> >> By all means, reminding people that pre-mature optimization is a
> >> waste of time, but it's possible to take that attitude too far to Planet
> >> Bizarro. At the point that you start insisting, and emphasising, that a
> >> three second time difference is "*exactly*" zero,
> >
> > Exactly. Because it was not generated in a real world use case but by
> > running a simple loop one millions times. Why one million times? Because
> > by running it "only" one hundred thousand times the difference would
> > have seen even less relevant.
>
> I was running it one million times to mitigate influences on the timing by
> other background processes which is a common technique when benchmarking.

Err, no. That is what "repeat" is for and it defaults to 3 ("This means
that other processes running on the same computer may interfere with the
timing. The best thing to do when accurate timing is necessary is to
repeat the timing a few times and use the best time. [...] the default
of 3 repetitions is probably enough in most cases.")

Three times - not one million times. You choose one million times (for
the loop) when the thing you're testing is very fast (like decoding) and
you don't want results in the 0.00000n range. Which is what you asked
for and what you got.

> > I already gave good advice:
> > 1. don't benchmark
> > 2. don't benchmark until you have an actual performance issue
> > 3. if you benchmark then the whole application and not single commands
>
> You don't know anything about what I'm doing and what my aim is. So your
> general rules don't apply.

See above. You asked a general question, you got a general answer.

> > It's really easy: Michael has working code. With that he can easily
> > write two versions - one that uses decode() and one that uses unicode().
>
> Yes, I have working code which was originally written before .decode() being
> added in Python 2.2. Therefore I wondered whether it would be nice for
> readability to replace unicode() by s.decode() since the software does not
> support Python versions prior 2.3 anymore anyway. But one aspect is also
> performance and hence my question and testing.

You haven't done any testing yet. Running decode/unicode one million
times in a loop is not testing. If you don't believe me then read at
least Martelli's Optimization chapter in Python in a nutshell (the
chapter is available via Google books).

Thorsten

Michael Ströder

unread,

Aug 8, 2009, 7:42:14 PM8/8/09

to

Michael Fötsch wrote:
> If speed is your primary concern, this will give you even better
> performance than unicode():
>
> decoder = codecs.lookup("utf-8").decode
> for i in xrange(1000000):
> decoder("äöüÄÖÜß")[0]

Hmm, that could be interesting. I will give it a try.

> However, there's also a functional difference between unicode() and
> str.decode():
>
> unicode() always raises an exception when you try to decode a unicode
> object. str.decode() will first try to encode a unicode object using the
> default encoding (usually "ascii"), which might or might not work.

Thanks for pointing that out. So in my case I'd consider that also a plus for
using unicode().

Ciao, Michael.

Jeroen Ruigrok van der Werven

unread,

Aug 9, 2009, 4:41:11 AM8/9/09

to pytho...@python.org

-On [20090808 20:07], Thorsten Kampe (thor...@thorstenkampe.de) wrote:
>In real life people won't even notice whether an application takes one or
>two minutes to complete.

I think you are quite wrong here.

I have worked with optical engineers who needed to calculate grating numbers
for their lenses. If they can have a calculation program that runs in 1
minute instead of 2 they can effectively double their output during the day
(since they run calculations hundreds to thousand times a day to get the
most optimal results with minor tweaks).

I think you are being a bit too easy on hand waving here that mere minute
runtimes are not noticeable.

--
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーンラウフロックヴァンデルウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
When we have not what we like, we must like what we have...

Steven D'Aprano

unread,

Aug 9, 2009, 6:01:25 AM8/9/09

to

On Sat, 08 Aug 2009 19:00:11 +0200, Thorsten Kampe wrote:

>> I was running it one million times to mitigate influences on the timing
>> by other background processes which is a common technique when
>> benchmarking.
>
> Err, no. That is what "repeat" is for and it defaults to 3 ("This means
> that other processes running on the same computer may interfere with the
> timing. The best thing to do when accurate timing is necessary is to
> repeat the timing a few times and use the best time. [...] the default
> of 3 repetitions is probably enough in most cases.")

It's useful to look at the timeit module to see what the author(s) think.

Let's start with the repeat() method. In the Timer docstring:

"The repeat() method is a convenience to call timeit() multiple times and
return a list of results."

and the repeat() method's own docstring:

"This is a convenience function that calls the timeit() repeatedly,
returning a list of results. The first argument specifies how many times
to call timeit(), defaulting to 3; the second argument specifies the
timer argument, defaulting to one million."

So it's quite obvious that the module author(s), and possibly even Tim
Peters himself, consider repeat() to be a mere convenience method.
There's nothing you can do with repeat() that can't be done with the
timeit() method itself.

Notice that both repeat() and timeit() methods take an argument to
specify how many times to execute the code snippet. Why not just execute
it once? The module doesn't say, but the answer is a basic measurement
technique: if your clock is accurate to (say) a millisecond, and you
measure a single event as taking a millisecond, then your relative error
is roughly 100%. But if you time 1000 events, and measure the total time
as 1 second, the relative error is now 0.1%.

The authors of the timeit module obvious considered this an important
factor: not only did they allow you to specify the number of times to
execute the code snippet (defaulting to one million, not to one) but they
had this to say:

[quote]
Command line usage:
python timeit.py [-n N] [-r N] [-s S] [-t] [-c] [-h] [statement]

Options:
-n/--number N: how many times to execute 'statement'
[...]

If -n is not given, a suitable number of loops is calculated by trying
successive powers of 10 until the total time is at least 0.2 seconds.
[end quote]

In other words, when calling the timeit module from the command line, by
default it will choose a value for n that gives a sufficiently small
relative error.

It's not an accident that timeit gives you two "count" parameters: the
number of times to execute the code snippet per timing, and the number of
timings. They control (partly) for different sources of error.

--
Steven