On Wed, 2009-08-05 at 16:43 +0200, Michael Ströder wrote: > These both expressions are equivalent but which is faster or should be used > for any reason?
> u = unicode(s,'utf-8')
> u = s.decode('utf-8') # looks nicer
It is sometimes non-obvious which constructs are faster than others in Python. I also regularly have these questions, but it's pretty easy to run quick (albeit naive) benchmarks to see.
The first thing to try is to have a look at the bytecode for each:
So indeed, uncode(s, 'utf-8') is faster by a fair margin.
On the other hand, unless you need to do this in a tight loop several tens of thousands of times, I'd prefer the slower form s.decode('utf-8') because it's, as you pointed out, cleaner and more readable code.
unicode() has LOAD_GLOBAL which s.decode() does not. Is it generally the case that LOAD_ATTR is slower than LOAD_GLOBAL that lead to your intuition that the former would probably be slower? Or some other intuition? Of course, the results from timeit are a different thing - I ask about the intuition in the disassembler output. Thanks.
> On Wed, 2009-08-05 at 16:43 +0200, Michael Ströder wrote: > > These both expressions are equivalent but which is faster or should be used > > for any reason? > > u = unicode(s,'utf-8') > > u = s.decode('utf-8') # looks nicer
> It is sometimes non-obvious which constructs are faster than others in > Python. I also regularly have these questions, but it's pretty easy to > run quick (albeit naive) benchmarks to see.
> The first thing to try is to have a look at the bytecode for each: [snip] > The presence of LOAD_ATTR in the first form hints that this is probably > going to be slower. Next, actually try it:
On Thu, 2009-08-06 at 01:31 +0000, John Machin wrote: > Faster by an enormous margin; attributing this to the cost of attribute lookup > seems implausible.
Ok, fair point. I don't think the time difference fully registered when I composed that message.
Testing a global access (LOAD_GLOBAL) versus an attribute access on a global object (LOAD_GLOBAL + LOAD_ATTR) shows that the latter is about 40% slower than the former. So that certainly doesn't account for the difference.
> Suggested further avenues of investigation:
> (1) Try the timing again with "cp1252" and "utf8" and "utf_8"
Very pedagogical of you. :) Indeed, it looks like bigger player in the performance difference is the fact that the code path for unicode(s, enc) short-circuits the codec registry for common encodings (which includes 'utf-8' specifically), whereas s.decode('utf-8') necessarily consults the codec registry.
* Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200)
> These both expressions are equivalent but which is faster or should be > used for any reason?
> u = unicode(s,'utf-8')
> u = s.decode('utf-8') # looks nicer
"decode" was added in Python 2.2 for the sake of symmetry to encode(). It's essentially the same as unicode() and I wouldn't be surprised if it is exactly the same. I don't think any measurable speed increase will be noticeable between those two.
Thorsten Kampe wrote: > * Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200) >> These both expressions are equivalent but which is faster or should be >> used for any reason?
>> u = unicode(s,'utf-8')
>> u = s.decode('utf-8') # looks nicer
> "decode" was added in Python 2.2 for the sake of symmetry to encode().
Yes, and I like the style. But...
> It's essentially the same as unicode() and I wouldn't be surprised if it > is exactly the same.
Did you try?
> I don't think any measurable speed increase will be noticeable between > those two.
Well, seems not to be true. Try yourself. I did (my console has UTF-8 as charset):
Python 2.6 (r26:66714, Feb 3 2009, 20:52:03) [GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2 Type "help", "copyright", "credits" or "license" for more information.
> Thorsten Kampe wrote: > > * Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200) > > I don't think any measurable speed increase will be noticeable > > between those two.
> Well, seems not to be true. Try yourself. I did (my console has UTF-8 as charset):
> Python 2.6 (r26:66714, Feb 3 2009, 20:52:03) > [GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import timeit > >>> timeit.Timer("'äöüÄÖÜß'.decode('utf-8')").timeit(1000000) > 7.2721178531646729 > >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000000) > 7.1302499771118164 > >>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(1000000) > 8.3726329803466797 > >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000000) > 1.8622009754180908 > >>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(1000000) > 8.651669979095459
On Thu, 06 Aug 2009 20:05:52 +0200, Thorsten Kampe wrote: > > That is significant! So the winner is:
> > unicode('äöüÄÖÜß','utf-8')
> Unless you are planning to write a loop that decodes "äöüÄÖÜß" one > million times, these benchmarks are meaningless.
What if you're writing a loop which takes one million different lines of text and decodes them once each?
>>> setup = 'L = ["abc"*(n%100) for n in xrange(1000000)]' >>> t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup) >>> t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup) >>> t1.timeit(number=1) 5.6751680374145508 >>> t2.timeit(number=1)
Thorsten Kampe wrote: > * Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200) >>>>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(10000000) >> 17.23644495010376 >>>>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(10000000) >> 72.087096929550171
>> That is significant! So the winner is:
>> unicode('äöüÄÖÜß','utf-8')
> Unless you are planning to write a loop that decodes "äöüÄÖÜß" one > million times, these benchmarks are meaningless.
Well, I can tell you I would not have posted this here and checked it if it would be meaningless for me. You don't have to read and answer this thread if it's meaningless to you.
> Very pedagogical of you. :) Indeed, it looks like bigger player in the > performance difference is the fact that the code path for unicode(s, > enc) short-circuits the codec registry for common encodings (which > includes 'utf-8' specifically), whereas s.decode('utf-8') necessarily > consults the codec registry.
So the next question (the answer to which may benefit all users of .encode() and .decode()) is:
Why does consulting the codec registry take so long, and can this be improved?
Michael Ströder wrote: > Thorsten Kampe wrote: >> * Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200) >>>>>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(10000000) >>> 17.23644495010376 >>>>>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(10000000) >>> 72.087096929550171
>>> That is significant! So the winner is:
>>> unicode('äöüÄÖÜß','utf-8') >> Unless you are planning to write a loop that decodes "äöüÄÖÜß" one >> million times, these benchmarks are meaningless.
> Well, I can tell you I would not have posted this here and checked it if it > would be meaningless for me. You don't have to read and answer this thread if > it's meaningless to you.
> Ciao, Michael.
I believe that the comment "these benchmarks are meaningless" refers to the length of the strings being used in the tests. Surely something involving thousands or millions of characters is more meaningful? Or to go the other way, you are unlikely to write for c in 'äöüÄÖÜß': u = unicode(c, 'utf-8') ... Yes?
On Fri, 07 Aug 2009 08:04:51 +0100, Mark Lawrence wrote: > I believe that the comment "these benchmarks are meaningless" refers to > the length of the strings being used in the tests. Surely something > involving thousands or millions of characters is more meaningful? Or to > go the other way, you are unlikely to write for c in 'äöüÄÖÜß': > u = unicode(c, 'utf-8') > ... > Yes?
There are all sorts of potential use-cases. A day or two ago, somebody posted a question involving tens of thousands of lines of tens of thousands of characters each (don't quote me, I'm going by memory). On the other hand, it doesn't require much imagination to think of a use- case where there are millions of lines each of a dozen or so characters, and you want to process it line by line:
noun: cat noun: dog verb: café ...
As always, before optimizing, you should profile to be sure you are actually optimizing and not wasting your time.
> Thorsten Kampe wrote: > > * Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200) > >>>>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(10000000) > >> 17.23644495010376 > >>>>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(10000000) > >> 72.087096929550171
> >> That is significant! So the winner is:
> >> unicode('äöüÄÖÜß','utf-8')
> > Unless you are planning to write a loop that decodes "äöüÄÖÜß" one > > million times, these benchmarks are meaningless.
> Well, I can tell you I would not have posted this here and checked it if it > would be meaningless for me. You don't have to read and answer this thread if > it's meaningless to you.
Again: if you think decoding "äöüÄÖÜß" one million times is a real world use case for your module then go for unicode(). Otherwise the time you spent benchmarking artificial cases like this is just wasted time. In real life people won't even notice whether an application takes one or two minutes to complete.
Use whatever you prefer (decode() or unicode()). If you experience performance bottlenecks when you're done, test whether changing decode() to unicode() makes a difference. /That/ is relevant.
Thorsten Kampe <thors...@thorstenkampe.de> wrote: > * Steven D'Aprano (06 Aug 2009 19:17:30 GMT) >> What if you're writing a loop which takes one million different lines of >> text and decodes them once each?
>> >>> setup = 'L = ["abc"*(n%100) for n in xrange(1000000)]' >> >>> t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup) >> >>> t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup) >> >>> t1.timeit(number=1) >> 5.6751680374145508 >> >>> t2.timeit(number=1) >> 2.6822888851165771
>> Seems like a pretty meaningful difference to me.
> Bollocks. No one will even notice whether a code sequence runs 2.7 or > 5.7 seconds. That's completely artificial benchmarking.
For a real-life example, I have often a file with one word per line, and I run python scripts to apply some (sometimes fairy trivial) transformation over it. REAL example, reading lines with word, lemma, tag separated by tabs from stdin and writing word into stdout, unless it starts with '<' (~6e5 lines, python2.5, user times, warm cache, I hope the comments are self-explanatory)
no unicode user 0m2.380s
decode('utf-8'), encode('utf-8') user 0m3.560s
sys.stdout = codecs.getwriter('utf-8')(sys.stdout);sys.stdin = codecs.getreader('utf-8')(sys.stdin) user 0m6.180s
unicode(line, 'utf8'), encode('utf-8') user 0m3.820s
unicode(line, 'utf-8'), encode('utf-8') user 0m2.880sa
python3.1 user 0m1.560s
Since I have something like 18 million words in my currenct project (and > 600 million overall) and I often tweak some parameters and re-run the > transformations, the differences are pretty significant.
Personally, I have been surprised by: 1) bad performance of the codecs wrapper (I expected it to be on par with unicode(x,'utf-8'), mayble slightly better due to less function calls 2) good performance of python3.1 (utf-8 locale)
-- ----------------------------------------------------------- | Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ | | __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk | ----------------------------------------------------------- Antivirus alert: file .signature infected by signature virus. Hi! I'm a signature virus! Copy me into your signature file to help me spread!
Thorsten Kampe <thors...@thorstenkampe.de> wrote: > Bollocks. No one will even notice whether a code sequence runs 2.7 or > 5.7 seconds. That's completely artificial benchmarking.
But that's not what you first claimed:
> I don't think any measurable speed increase will be > noticeable between those two.
But please, keep changing your argument so you don't have to admit you were wrong.
> Thorsten Kampe <thors...@thorstenkampe.de> wrote: > > Bollocks. No one will even notice whether a code sequence runs 2.7 or > > 5.7 seconds. That's completely artificial benchmarking.
> But that's not what you first claimed:
> > I don't think any measurable speed increase will be > > noticeable between those two.
> But please, keep changing your argument so you don't have to admit you > were wrong.
Bollocks. Please note the word "noticeable". "noticeable" as in recognisable as in reasonably experiencable or as in whatever.
One guy claims he has times between 2.7 and 5.7 seconds when benchmarking more or less randomly generated "one million different lines". That *is* *exactly* nothing.
Another guy claims he gets times between 2.9 and 6.2 seconds when running decode/unicode in various manifestations over "18 million words" (or is it 600 million?) and says "the differences are pretty significant". I think I don't have to comment on that.
If you increase the number of loops to one million or one billion or whatever even the slightest completely negligible difference will occur. The same thing will happen if you just increase the corpus of words to a million, trillion or whatever. The performance implications of that are exactly none.
Thorsten Kampe <thors...@thorstenkampe.de> wrote: > lines". That *is* *exactly* nothing.
> Another guy claims he gets times between 2.9 and 6.2 seconds when > running decode/unicode in various manifestations over "18 million
over a sample of 600000 words (sorry for not being able to explain myself clear enough so that everyone understands) while my current project is 18e6 words, that is the overall running time will be 87 vs. 186 seconds, which is fairly noticeable.
> words" (or is it 600 million?) and says "the differences are pretty > significant".
600 million is the size of the whole corpus, that translates to 48 minutes vs. 1h43min. That already is a huge difference (going to lunch during noon or waiting another hour until it runs over - and you can bet it is _very_ noticeable when I am hungry :-)).
With 9 different versions of the corpus (that is, what we are really using now) that goes to 7.2 hours (or even less with python3.1!) vs. 15 hours. Being able to re-run the whole corpus generation in one working day (and then go on with the next issues) vs. working overtime or delivering the corpus one day later is a huge difference. Like, being one day behind the schedule.
> I think I don't have to comment on that.
Indeed, the numbers are self-explanatory.
> If you increase the number of loops to one million or one billion or > whatever even the slightest completely negligible difference will occur. > The same thing will happen if you just increase the corpus of words to a > million, trillion or whatever. The performance implications of that are > exactly none.
I am not sure I understood that. Must be my English :-)
-- ----------------------------------------------------------- | Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ | | __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk | ----------------------------------------------------------- Antivirus alert: file .signature infected by signature virus. Hi! I'm a signature virus! Copy me into your signature file to help me spread!
On Fri, 07 Aug 2009 12:00:42 +0200, Thorsten Kampe wrote: > Bollocks. No one will even notice whether a code sequence runs 2.7 or > 5.7 seconds. That's completely artificial benchmarking.
You think users won't notice a doubling of execution time? Well, that explains some of the apps I'm forced to use...
A two-second running time for (say) a command-line tool is already noticeable. A five-second one is *very* noticeable -- long enough to be a drag, short enough that you aren't tempted to go off and do something else while you're waiting for it to finish.
On Fri, 07 Aug 2009 17:13:07 +0200, Thorsten Kampe wrote: > One guy claims he has times between 2.7 and 5.7 seconds when > benchmarking more or less randomly generated "one million different > lines". That *is* *exactly* nothing.
We agree that in the grand scheme of things, a difference of 2.7 seconds versus 5.7 seconds is a trivial difference if your entire program takes (say) 8 minutes to run. You won't even notice it.
But why assume that the program takes 8 minutes to run? Perhaps it takes 8 seconds to run, and 6 seconds of that is the decoding. Then halving that reduces the total runtime from 8 seconds to 5, which is a noticeable speed increase to the user, and significant if you then run that program tens of thousands of times.
The Python dev team spend significant time and effort to get improvements of the order of 10%, and you're pooh-poohing an improvement of the order of 100%. By all means, reminding people that pre-mature optimization is a waste of time, but it's possible to take that attitude too far to Planet Bizarro. At the point that you start insisting, and emphasising, that a three second time difference is "*exactly*" zero, it seems to me that this is about you winning rather than you giving good advice.
> On Fri, 07 Aug 2009 17:13:07 +0200, Thorsten Kampe wrote: > > One guy claims he has times between 2.7 and 5.7 seconds when > > benchmarking more or less randomly generated "one million different > > lines". That *is* *exactly* nothing.
> We agree that in the grand scheme of things, a difference of 2.7 seconds > versus 5.7 seconds is a trivial difference if your entire program takes > (say) 8 minutes to run. You won't even notice it.
Exactly.
> But why assume that the program takes 8 minutes to run? Perhaps it takes > 8 seconds to run, and 6 seconds of that is the decoding. Then halving > that reduces the total runtime from 8 seconds to 5, which is a noticeable > speed increase to the user, and significant if you then run that program > tens of thousands of times.
Exactly. That's why it doesn't make sense to benchmark decode()/unicode () isolated - meaning out of the context of your actual program.
> By all means, reminding people that pre-mature optimization is a > waste of time, but it's possible to take that attitude too far to Planet > Bizarro. At the point that you start insisting, and emphasising, that a > three second time difference is "*exactly*" zero,
Exactly. Because it was not generated in a real world use case but by running a simple loop one millions times. Why one million times? Because by running it "only" one hundred thousand times the difference would have seen even less relevant.
> it seems to me that this is about you winning rather than you giving > good advice.
I already gave good advice: 1. don't benchmark 2. don't benchmark until you have an actual performance issue 3. if you benchmark then the whole application and not single commands
It's really easy: Michael has working code. With that he can easily write two versions - one that uses decode() and one that uses unicode(). He can benchmark these with some real world input he often uses by running it a hundred or a thousand times (even a million if he likes). Then he can compare the results. I doubt that there will be any noticeable difference.
> garabik-news-2005...@kassiopeia.juls.savba.sk wrote: > > I am not sure I understood that. Must be my English :-)
> I just parsed it as "blah blah blah I won't admit I'm wrong" and > didn't miss anything substantive.
Alex, there are still a number of performance optimizations that require a thorough optimizer like you. Like using short identifiers instead of long ones. I guess you could easily prove that by comparing "a = 0" to "a_long_identifier = 0" and running it one hundred trillion times. The performance gain could easily add up to *days*. Keep us updated.
* garabik-news-2005...@kassiopeia.juls.savba.sk (Fri, 7 Aug 2009 17:41:38 +0000 (UTC))
> Thorsten Kampe <thors...@thorstenkampe.de> wrote: > > If you increase the number of loops to one million or one billion or > > whatever even the slightest completely negligible difference will > > occur. The same thing will happen if you just increase the corpus of > > words to a million, trillion or whatever. The performance > > implications of that are exactly none.
> I am not sure I understood that. Must be my English :-)
I guess you understand me very well and I understand you very well. If the performance gain you want to prove doesn't show with 600,000 words, you test again with 18,000,000 words and if that is not impressive enough with 600,000,000 words. Great.
Or if a million repetitions of your "improved" code don't show the expected "performance advantage" you run it a billion times. Even greater. Keep on optimzing.