Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
unicode() vs. s.decode()
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 32 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Michael Ströder  
View profile  
 More options Aug 5 2009, 10:43 am
Newsgroups: comp.lang.python
From: Michael Ströder <mich...@stroeder.com>
Date: Wed, 05 Aug 2009 16:43:09 +0200
Local: Wed, Aug 5 2009 10:43 am
Subject: unicode() vs. s.decode()
HI!

These both expressions are equivalent but which is faster or should be used
for any reason?

u = unicode(s,'utf-8')

u = s.decode('utf-8') # looks nicer

Ciao, Michael.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jason Tackaberry  
View profile  
 More options Aug 5 2009, 11:53 am
Newsgroups: comp.lang.python
From: Jason Tackaberry <t...@urandom.ca>
Date: Wed, 05 Aug 2009 11:53:56 -0400
Local: Wed, Aug 5 2009 11:53 am
Subject: Re: unicode() vs. s.decode()

On Wed, 2009-08-05 at 16:43 +0200, Michael Ströder wrote:
> These both expressions are equivalent but which is faster or should be used
> for any reason?

> u = unicode(s,'utf-8')

> u = s.decode('utf-8') # looks nicer

It is sometimes non-obvious which constructs are faster than others in
Python.  I also regularly have these questions, but it's pretty easy to
run quick (albeit naive) benchmarks to see.

The first thing to try is to have a look at the bytecode for each:

        >>> import dis
        >>> dis.dis(lambda s: s.decode('utf-8'))
          1           0 LOAD_FAST                0 (s)
                      3 LOAD_ATTR                0 (decode)
                      6 LOAD_CONST               0 ('utf-8')
                      9 CALL_FUNCTION            1
                     12 RETURN_VALUE        
        >>> dis.dis(lambda s: unicode(s, 'utf-8'))
          1           0 LOAD_GLOBAL              0 (unicode)
                      3 LOAD_FAST                0 (s)
                      6 LOAD_CONST               0 ('utf-8')
                      9 CALL_FUNCTION            2
                     12 RETURN_VALUE      

The presence of LOAD_ATTR in the first form hints that this is probably
going to be slower.   Next, actually try it:

        >>> import timeit
        >>> timeit.timeit('"foobarbaz".decode("utf-8")')
        1.698289155960083
        >>> timeit.timeit('unicode("foobarbaz", "utf-8")')
        0.53305888175964355

So indeed, uncode(s, 'utf-8') is faster by a fair margin.

On the other hand, unless you need to do this in a tight loop several
tens of thousands of times, I'd prefer the slower form s.decode('utf-8')
because it's, as you pointed out, cleaner and more readable code.

Cheers,
Jason.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
1x7y2z9  
View profile  
 More options Aug 5 2009, 2:12 pm
Newsgroups: comp.lang.python
From: 1x7y2z9 <1x7y...@gmail.com>
Date: Wed, 5 Aug 2009 11:12:45 -0700 (PDT)
Local: Wed, Aug 5 2009 2:12 pm
Subject: Re: unicode() vs. s.decode()
unicode() has LOAD_GLOBAL which s.decode() does not.  Is it generally
the case that LOAD_ATTR is slower than LOAD_GLOBAL that lead to your
intuition that the former would probably be slower?  Or some other
intuition?
Of course, the results from timeit are a different thing - I ask about
the intuition in the disassembler output.
Thanks.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
John Machin  
View profile  
 More options Aug 5 2009, 9:31 pm
Newsgroups: comp.lang.python
From: John Machin <sjmac...@lexicon.net>
Date: Thu, 6 Aug 2009 01:31:56 +0000 (UTC)
Local: Wed, Aug 5 2009 9:31 pm
Subject: Re: unicode() vs. s.decode()
Jason Tackaberry <tack <at> urandom.ca> writes:

Faster by an enormous margin; attributing this to the cost of attribute lookup
seems implausible.

Suggested further avenues of investigation:

(1) Try the timing again with "cp1252" and "utf8" and "utf_8"

(2) grep "utf-8" <Python2.X_source_code>/Objects/unicodeobject.c

HTH,
John


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jason Tackaberry  
View profile  
 More options Aug 6 2009, 10:15 am
Newsgroups: comp.lang.python
From: Jason Tackaberry <t...@urandom.ca>
Date: Thu, 06 Aug 2009 10:15:41 -0400
Local: Thurs, Aug 6 2009 10:15 am
Subject: Re: unicode() vs. s.decode()

On Thu, 2009-08-06 at 01:31 +0000, John Machin wrote:
> Faster by an enormous margin; attributing this to the cost of attribute lookup
> seems implausible.

Ok, fair point.  I don't think the time difference fully registered when
I composed that message.

Testing a global access (LOAD_GLOBAL) versus an attribute access on a
global object (LOAD_GLOBAL + LOAD_ATTR) shows that the latter is about
40% slower than the former.  So that certainly doesn't account for the
difference.

> Suggested further avenues of investigation:

> (1) Try the timing again with "cp1252" and "utf8" and "utf_8"

> (2) grep "utf-8" <Python2.X_source_code>/Objects/unicodeobject.c

Very pedagogical of you. :)  Indeed, it looks like bigger player in the
performance difference is the fact that the code path for unicode(s,
enc) short-circuits the codec registry for common encodings (which
includes 'utf-8' specifically), whereas s.decode('utf-8') necessarily
consults the codec registry.

Cheers,
Jason.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thorsten Kampe  
View profile  
 More options Aug 6 2009, 10:32 am
Newsgroups: comp.lang.python
From: Thorsten Kampe <thors...@thorstenkampe.de>
Date: Thu, 6 Aug 2009 16:32:34 +0200
Local: Thurs, Aug 6 2009 10:32 am
Subject: Re: unicode() vs. s.decode()
* Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200)

> These both expressions are equivalent but which is faster or should be
> used for any reason?

> u = unicode(s,'utf-8')

> u = s.decode('utf-8') # looks nicer

"decode" was added in Python 2.2 for the sake of symmetry to encode().
It's essentially the same as unicode() and I wouldn't be surprised if it
is exactly the same. I don't think any measurable speed increase will be
noticeable between those two.

Thorsten


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Ströder  
View profile  
 More options Aug 6 2009, 12:26 pm
Newsgroups: comp.lang.python
From: Michael Ströder <mich...@stroeder.com>
Date: Thu, 06 Aug 2009 18:26:09 +0200
Local: Thurs, Aug 6 2009 12:26 pm
Subject: Re: unicode() vs. s.decode()

Thorsten Kampe wrote:
> * Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200)
>> These both expressions are equivalent but which is faster or should be
>> used for any reason?

>> u = unicode(s,'utf-8')

>> u = s.decode('utf-8') # looks nicer

> "decode" was added in Python 2.2 for the sake of symmetry to encode().

Yes, and I like the style. But...

> It's essentially the same as unicode() and I wouldn't be surprised if it
> is exactly the same.

Did you try?

> I don't think any measurable speed increase will be noticeable between
> those two.

Well, seems not to be true. Try yourself. I did (my console has UTF-8 as charset):

Python 2.6 (r26:66714, Feb  3 2009, 20:52:03)
[GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> import timeit
>>> timeit.Timer("'äöüÄÖÜß'.decode('utf-8')").timeit(1000000)
7.2721178531646729
>>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000000)
7.1302499771118164
>>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(1000000)
8.3726329803466797
>>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000000)
1.8622009754180908
>>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(1000000)
8.651669979095459

Comparing again the two best combinations:

>>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(10000000)
17.23644495010376
>>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(10000000)

72.087096929550171

That is significant! So the winner is:

unicode('äöüÄÖÜß','utf-8')

Ciao, Michael.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thorsten Kampe  
View profile  
 More options Aug 6 2009, 2:05 pm
Newsgroups: comp.lang.python
From: Thorsten Kampe <thors...@thorstenkampe.de>
Date: Thu, 6 Aug 2009 20:05:52 +0200
Local: Thurs, Aug 6 2009 2:05 pm
Subject: Re: unicode() vs. s.decode()
* Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200)

Unless you are planning to write a loop that decodes "äöüÄÖÜß" one
million times, these benchmarks are meaningless.

Thorsten


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Steven D'Aprano  
View profile  
 More options Aug 6 2009, 3:17 pm
Newsgroups: comp.lang.python
From: Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au>
Date: 06 Aug 2009 19:17:30 GMT
Local: Thurs, Aug 6 2009 3:17 pm
Subject: Re: unicode() vs. s.decode()

On Thu, 06 Aug 2009 20:05:52 +0200, Thorsten Kampe wrote:
> > That is significant! So the winner is:

> > unicode('äöüÄÖÜß','utf-8')

> Unless you are planning to write a loop that decodes "äöüÄÖÜß" one
> million times, these benchmarks are meaningless.

What if you're writing a loop which takes one million different lines of
text and decodes them once each?

>>> setup = 'L = ["abc"*(n%100) for n in xrange(1000000)]'
>>> t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup)
>>> t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup)
>>> t1.timeit(number=1)
5.6751680374145508
>>> t2.timeit(number=1)

2.6822888851165771

Seems like a pretty meaningful difference to me.

--
Steven


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Ströder  
View profile  
 More options Aug 6 2009, 9:25 pm
Newsgroups: comp.lang.python
From: Michael Ströder <mich...@stroeder.com>
Date: Fri, 07 Aug 2009 03:25:03 +0200
Local: Thurs, Aug 6 2009 9:25 pm
Subject: Re: unicode() vs. s.decode()

Thorsten Kampe wrote:
> * Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200)
>>>>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(10000000)
>> 17.23644495010376
>>>>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(10000000)
>> 72.087096929550171

>> That is significant! So the winner is:

>> unicode('äöüÄÖÜß','utf-8')

> Unless you are planning to write a loop that decodes "äöüÄÖÜß" one
> million times, these benchmarks are meaningless.

Well, I can tell you I would not have posted this here and checked it if it
would be meaningless for me. You don't have to read and answer this thread if
it's meaningless to you.

Ciao, Michael.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
John Machin  
View profile  
 More options Aug 6 2009, 10:01 pm
Newsgroups: comp.lang.python
From: John Machin <sjmac...@lexicon.net>
Date: Fri, 7 Aug 2009 02:01:24 +0000 (UTC)
Local: Thurs, Aug 6 2009 10:01 pm
Subject: Re: unicode() vs. s.decode()
Jason Tackaberry <tack <at> urandom.ca> writes:

> On Thu, 2009-08-06 at 01:31 +0000, John Machin wrote:
> > Suggested further avenues of investigation:

> > (1) Try the timing again with "cp1252" and "utf8" and "utf_8"

> > (2) grep "utf-8" <Python2.X_source_code>/Objects/unicodeobject.c

> Very pedagogical of you. :)  Indeed, it looks like bigger player in the
> performance difference is the fact that the code path for unicode(s,
> enc) short-circuits the codec registry for common encodings (which
> includes 'utf-8' specifically), whereas s.decode('utf-8') necessarily
> consults the codec registry.

So the next question (the answer to which may benefit all users
of .encode() and .decode()) is:

    Why does consulting the codec registry take so long,
    and can this be improved?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark Lawrence  
View profile  
 More options Aug 7 2009, 3:04 am
Newsgroups: comp.lang.python
From: Mark Lawrence <breamore...@yahoo.co.uk>
Date: Fri, 07 Aug 2009 08:04:51 +0100
Local: Fri, Aug 7 2009 3:04 am
Subject: Re: unicode() vs. s.decode()

I believe that the comment "these benchmarks are meaningless" refers to
the length of the strings being used in the tests.  Surely something
involving thousands or millions of characters is more meaningful? Or to
go the other way, you are unlikely to write
for c in 'äöüÄÖÜß':
     u = unicode(c, 'utf-8')
     ...
Yes?

--
Kindest regards.

Mark Lawrence.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Steven D'Aprano  
View profile  
 More options Aug 7 2009, 3:25 am
Newsgroups: comp.lang.python
From: Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au>
Date: 07 Aug 2009 07:25:44 GMT
Local: Fri, Aug 7 2009 3:25 am
Subject: Re: unicode() vs. s.decode()

On Fri, 07 Aug 2009 08:04:51 +0100, Mark Lawrence wrote:
> I believe that the comment "these benchmarks are meaningless" refers to
> the length of the strings being used in the tests.  Surely something
> involving thousands or millions of characters is more meaningful? Or to
> go the other way, you are unlikely to write for c in 'äöüÄÖÜß':
>      u = unicode(c, 'utf-8')
>      ...
> Yes?

There are all sorts of potential use-cases. A day or two ago, somebody
posted a question involving tens of thousands of lines of tens of
thousands of characters each (don't quote me, I'm going by memory). On
the other hand, it doesn't require much imagination to think of a use-
case where there are millions of lines each of a dozen or so characters,
and you want to process it line by line:

noun: cat
noun: dog
verb: café
...

As always, before optimizing, you should profile to be sure you are
actually optimizing and not wasting your time.

--
Steven


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thorsten Kampe  
View profile  
 More options Aug 7 2009, 6:00 am
Newsgroups: comp.lang.python
From: Thorsten Kampe <thors...@thorstenkampe.de>
Date: Fri, 7 Aug 2009 12:00:42 +0200
Local: Fri, Aug 7 2009 6:00 am
Subject: Re: unicode() vs. s.decode()
* Steven D'Aprano (06 Aug 2009 19:17:30 GMT)

Bollocks. No one will even notice whether a code sequence runs 2.7 or
5.7 seconds. That's completely artificial benchmarking.

Thorsten


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thorsten Kampe  
View profile  
 More options Aug 7 2009, 6:12 am
Newsgroups: comp.lang.python
From: Thorsten Kampe <thors...@thorstenkampe.de>
Date: Fri, 7 Aug 2009 12:12:32 +0200
Local: Fri, Aug 7 2009 6:12 am
Subject: Re: unicode() vs. s.decode()
* Michael Ströder (Fri, 07 Aug 2009 03:25:03 +0200)

Again: if you think decoding "äöüÄÖÜß" one million times is a real world
use case for your module then go for unicode(). Otherwise the time you
spent benchmarking artificial cases like this is just wasted time. In
real life people won't even notice whether an application takes one or
two minutes to complete.

Use whatever you prefer (decode() or unicode()). If you experience
performance bottlenecks when you're done, test whether changing decode()
to unicode() makes a difference. /That/ is relevant.

Thorsten


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
garabik-news-2005...@kassiopeia.juls.savba.sk  
View profile  
 More options Aug 7 2009, 7:49 am
Newsgroups: comp.lang.python
From: garabik-news-2005...@kassiopeia.juls.savba.sk
Date: Fri, 7 Aug 2009 11:49:05 +0000 (UTC)
Local: Fri, Aug 7 2009 7:49 am
Subject: Re: unicode() vs. s.decode()

For a real-life example, I have often a file with one word per line, and
I run python scripts to apply some (sometimes fairy trivial)
transformation over it. REAL example, reading lines with word, lemma,
tag separated by tabs from stdin and writing word into stdout, unless it
starts with '<' (~6e5 lines, python2.5, user times, warm cache, I hope
the comments are self-explanatory)

no unicode
user    0m2.380s

decode('utf-8'), encode('utf-8')
user    0m3.560s

sys.stdout = codecs.getwriter('utf-8')(sys.stdout);sys.stdin = codecs.getreader('utf-8')(sys.stdin)
user    0m6.180s

unicode(line, 'utf8'), encode('utf-8')
user    0m3.820s

unicode(line, 'utf-8'), encode('utf-8')
user    0m2.880sa

python3.1
user    0m1.560s

Since I have something like 18 million words in my currenct project (and
 > 600 million overall) and I often tweak some parameters and re-run the
 > transformations, the differences are pretty significant.

Personally, I have been surprised by:
1) bad performance of the codecs wrapper (I expected it to be on par with
   unicode(x,'utf-8'), mayble slightly better due to less function calls
2) good performance of python3.1 (utf-8 locale)

--
 -----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__    garabik @ kassiopeia.juls.savba.sk     |
 -----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
alex23  
View profile  
 More options Aug 7 2009, 9:53 am
Newsgroups: comp.lang.python
From: alex23 <wuwe...@gmail.com>
Date: Fri, 7 Aug 2009 06:53:22 -0700 (PDT)
Local: Fri, Aug 7 2009 9:53 am
Subject: Re: unicode() vs. s.decode()

Thorsten Kampe <thors...@thorstenkampe.de> wrote:
> Bollocks. No one will even notice whether a code sequence runs 2.7 or
> 5.7 seconds. That's completely artificial benchmarking.

But that's not what you first claimed:

> I don't think any measurable speed increase will be
> noticeable between those two.

But please, keep changing your argument so you don't have to admit you
were wrong.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thorsten Kampe  
View profile  
 More options Aug 7 2009, 11:13 am
Newsgroups: comp.lang.python
From: Thorsten Kampe <thors...@thorstenkampe.de>
Date: Fri, 7 Aug 2009 17:13:07 +0200
Local: Fri, Aug 7 2009 11:13 am
Subject: Re: unicode() vs. s.decode()
* alex23 (Fri, 7 Aug 2009 06:53:22 -0700 (PDT))

> Thorsten Kampe <thors...@thorstenkampe.de> wrote:
> > Bollocks. No one will even notice whether a code sequence runs 2.7 or
> > 5.7 seconds. That's completely artificial benchmarking.

> But that's not what you first claimed:

> > I don't think any measurable speed increase will be
> > noticeable between those two.

> But please, keep changing your argument so you don't have to admit you
> were wrong.

Bollocks. Please note the word "noticeable". "noticeable" as in
recognisable as in reasonably experiencable or as in whatever.

One guy claims he has times between 2.7 and 5.7 seconds when
benchmarking more or less randomly generated "one million different
lines". That *is* *exactly* nothing.

Another guy claims he gets times between 2.9 and 6.2 seconds when
running decode/unicode in various manifestations over "18 million
words" (or is it 600 million?) and says "the differences are pretty
significant". I think I don't have to comment on that.

If you increase the number of loops to one million or one billion or
whatever even the slightest completely negligible difference will occur.
The same thing will happen if you just increase the corpus of words to a
million, trillion or whatever. The performance implications of that are
exactly none.

Thorsten


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
garabik-news-2005...@kassiopeia.juls.savba.sk  
View profile  
 More options Aug 7 2009, 1:41 pm
Newsgroups: comp.lang.python
From: garabik-news-2005...@kassiopeia.juls.savba.sk
Date: Fri, 7 Aug 2009 17:41:38 +0000 (UTC)
Local: Fri, Aug 7 2009 1:41 pm
Subject: Re: unicode() vs. s.decode()

Thorsten Kampe <thors...@thorstenkampe.de> wrote:
> lines". That *is* *exactly* nothing.

> Another guy claims he gets times between 2.9 and 6.2 seconds when
> running decode/unicode in various manifestations over "18 million

over a sample of 600000 words (sorry for not being able to explain
myself clear enough so that everyone understands)
while my current project is 18e6 words, that is the overall running time
will be 87 vs. 186 seconds, which is fairly noticeable.

> words" (or is it 600 million?) and says "the differences are pretty
> significant".

600 million is the size of the whole corpus, that translates to
48 minutes vs. 1h43min. That already is a huge difference (going to
lunch during noon or waiting another hour until it runs over - and
you can bet it is _very_ noticeable when I am hungry :-)).

With 9 different versions of the corpus (that is, what we are really
using now) that goes to 7.2 hours (or even less with python3.1!) vs. 15
hours. Being able to re-run the whole corpus generation in one working
day (and then go on with the next issues) vs. working overtime or
delivering the corpus one day later is a huge difference. Like, being
one day behind the schedule.

> I think I don't have to comment on that.

Indeed, the numbers are self-explanatory.

> If you increase the number of loops to one million or one billion or
> whatever even the slightest completely negligible difference will occur.
> The same thing will happen if you just increase the corpus of words to a
> million, trillion or whatever. The performance implications of that are
> exactly none.

I am not sure I understood that. Must be my English :-)

--
 -----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__    garabik @ kassiopeia.juls.savba.sk     |
 -----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
alex23  
View profile  
 More options Aug 7 2009, 1:45 pm
Newsgroups: comp.lang.python
From: alex23 <wuwe...@gmail.com>
Date: Fri, 7 Aug 2009 10:45:29 -0700 (PDT)
Local: Fri, Aug 7 2009 1:45 pm
Subject: Re: unicode() vs. s.decode()

garabik-news-2005...@kassiopeia.juls.savba.sk wrote:
> I am not sure I understood that. Must be my English :-)

I just parsed it as "blah blah blah I won't admit I'm wrong" and
didn't miss anything substantive.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Steven D'Aprano  
View profile  
 More options Aug 7 2009, 10:04 pm
Newsgroups: comp.lang.python
From: Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au>
Date: 08 Aug 2009 02:04:58 GMT
Local: Fri, Aug 7 2009 10:04 pm
Subject: Re: unicode() vs. s.decode()

On Fri, 07 Aug 2009 12:00:42 +0200, Thorsten Kampe wrote:
> Bollocks. No one will even notice whether a code sequence runs 2.7 or
> 5.7 seconds. That's completely artificial benchmarking.

You think users won't notice a doubling of execution time? Well, that
explains some of the apps I'm forced to use...

A two-second running time for (say) a command-line tool is already
noticeable. A five-second one is *very* noticeable -- long enough to be a
drag, short enough that you aren't tempted to go off and do something
else while you're waiting for it to finish.

--
Steven


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Steven D'Aprano  
View profile  
 More options Aug 7 2009, 11:29 pm
Newsgroups: comp.lang.python
From: Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au>
Date: 08 Aug 2009 03:29:43 GMT
Local: Fri, Aug 7 2009 11:29 pm
Subject: Re: unicode() vs. s.decode()

On Fri, 07 Aug 2009 17:13:07 +0200, Thorsten Kampe wrote:
> One guy claims he has times between 2.7 and 5.7 seconds when
> benchmarking more or less randomly generated "one million different
> lines". That *is* *exactly* nothing.

We agree that in the grand scheme of things, a difference of 2.7 seconds
versus 5.7 seconds is a trivial difference if your entire program takes
(say) 8 minutes to run. You won't even notice it.

But why assume that the program takes 8 minutes to run? Perhaps it takes
8 seconds to run, and 6 seconds of that is the decoding. Then halving
that reduces the total runtime from 8 seconds to 5, which is a noticeable
speed increase to the user, and significant if you then run that program
tens of thousands of times.

The Python dev team spend significant time and effort to get improvements
of the order of 10%, and you're pooh-poohing an improvement of the order
of 100%. By all means, reminding people that pre-mature optimization is a
waste of time, but it's possible to take that attitude too far to Planet
Bizarro. At the point that you start insisting, and emphasising, that a
three second time difference is "*exactly*" zero, it seems to me that
this is about you winning rather than you giving good advice.

--
Steven


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thorsten Kampe  
View profile  
 More options Aug 8 2009, 7:16 am
Newsgroups: comp.lang.python
From: Thorsten Kampe <thors...@thorstenkampe.de>
Date: Sat, 8 Aug 2009 13:16:12 +0200
Local: Sat, Aug 8 2009 7:16 am
Subject: Re: unicode() vs. s.decode()
* Steven D'Aprano (08 Aug 2009 03:29:43 GMT)

> On Fri, 07 Aug 2009 17:13:07 +0200, Thorsten Kampe wrote:
> > One guy claims he has times between 2.7 and 5.7 seconds when
> > benchmarking more or less randomly generated "one million different
> > lines". That *is* *exactly* nothing.

> We agree that in the grand scheme of things, a difference of 2.7 seconds
> versus 5.7 seconds is a trivial difference if your entire program takes
> (say) 8 minutes to run. You won't even notice it.

Exactly.

> But why assume that the program takes 8 minutes to run? Perhaps it takes
> 8 seconds to run, and 6 seconds of that is the decoding. Then halving
> that reduces the total runtime from 8 seconds to 5, which is a noticeable
> speed increase to the user, and significant if you then run that program
> tens of thousands of times.

Exactly. That's why it doesn't make sense to benchmark decode()/unicode
() isolated - meaning out of the context of your actual program.

> By all means, reminding people that pre-mature optimization is a
> waste of time, but it's possible to take that attitude too far to Planet
> Bizarro. At the point that you start insisting, and emphasising, that a
> three second time difference is "*exactly*" zero,

Exactly. Because it was not generated in a real world use case but by
running a simple loop one millions times. Why one million times? Because
by running it "only" one hundred thousand times the difference would
have seen even less relevant.

> it seems to me that this is about you winning rather than you giving
> good advice.

I already gave good advice:
1. don't benchmark
2. don't benchmark until you have an actual performance issue
3. if you benchmark then the whole application and not single commands

It's really easy: Michael has working code. With that he can easily
write two versions - one that uses decode() and one that uses unicode().
He can benchmark these with some real world input he often uses by
running it a hundred or a thousand times (even a million if he likes).
Then he can compare the results. I doubt that there will be any
noticeable difference.

Thorsten


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thorsten Kampe  
View profile  
 More options Aug 8 2009, 8:19 am
Newsgroups: comp.lang.python
From: Thorsten Kampe <thors...@thorstenkampe.de>
Date: Sat, 8 Aug 2009 14:19:44 +0200
Local: Sat, Aug 8 2009 8:19 am
Subject: Re: unicode() vs. s.decode()
* alex23 (Fri, 7 Aug 2009 10:45:29 -0700 (PDT))

> garabik-news-2005...@kassiopeia.juls.savba.sk wrote:
> > I am not sure I understood that. Must be my English :-)

> I just parsed it as "blah blah blah I won't admit I'm wrong" and
> didn't miss anything substantive.

Alex, there are still a number of performance optimizations that require
a thorough optimizer like you. Like using short identifiers instead of
long ones. I guess you could easily prove that by comparing "a = 0" to
"a_long_identifier = 0" and running it one hundred trillion times. The
performance gain could easily add up to *days*. Keep us updated.

Thorsten


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thorsten Kampe  
View profile  
 More options Aug 8 2009, 8:28 am
Newsgroups: comp.lang.python
From: Thorsten Kampe <thors...@thorstenkampe.de>
Date: Sat, 8 Aug 2009 14:28:54 +0200
Local: Sat, Aug 8 2009 8:28 am
Subject: Re: unicode() vs. s.decode()
* garabik-news-2005...@kassiopeia.juls.savba.sk (Fri, 7 Aug 2009
17:41:38 +0000 (UTC))

> Thorsten Kampe <thors...@thorstenkampe.de> wrote:
> > If you increase the number of loops to one million or one billion or
> > whatever even the slightest completely negligible difference will
> > occur. The same thing will happen if you just increase the corpus of
> > words to a million, trillion or whatever. The performance
> > implications of that are exactly none.

> I am not sure I understood that. Must be my English :-)

I guess you understand me very well and I understand you very well. If
the performance gain you want to prove doesn't show with 600,000 words,
you test again with 18,000,000 words and if that is not impressive
enough with 600,000,000 words. Great.

Or if a million repetitions of your "improved" code don't show the
expected "performance advantage" you run it a billion times. Even
greater. Keep on optimzing.

Thorsten


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 32   Newer >
« Back to Discussions « Newer topic     Older topic »