Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
How do I display unicode value stored in a string variable using ord()
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 26 - 50 of 144 - Collapse all  -  Translate all to Translated (View all originals) < Older  Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Steven D'Aprano  
View profile  
 More options Aug 18 2012, 1:59 pm
Newsgroups: comp.lang.python
From: Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>
Date: 18 Aug 2012 17:59:18 GMT
Local: Sat, Aug 18 2012 1:59 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()

On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
> Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
>> [...]
>> The problem with UCS-4 is that every character requires four bytes.
>> [...]

> I'm aware of this (and all the blah blah blah you are explaining). This
> always the same song. Memory.

Exactly. The reason it is always the same song is because it is an
important song.

> Let me ask. Is Python an 'american" product for us-users or is it a tool
> for everybody [*]?

It is a product for everyone, which is exactly why PEP 393 is so
important. PEP 393 means that users who have only a few non-BMP
characters don't have to pay the cost of UCS-4 for every single string in
their application, only for the ones that actually require it. PEP 393
means that using Unicode strings is now cheaper for everybody.

You seem to be arguing that the way forward is not to make Unicode
cheaper for everyone, but to make ASCII strings more expensive so that
everyone suffers equally. I reject that idea.

> Is there any reason why non ascii users are somehow penalized compared
> to ascii users?

Of course there is a reason.

If you want to represent 1114111 different characters in a string, as
Unicode supports, you can't use a single byte per character, or even two
bytes. That is a fact of basic mathematics. Supporting 1114111 characters
must be more expensive than supporting 128 of them.

But why should you carry the cost of 4-bytes per character just because
someday you *might* need a non-BMP character?

> This flexible string representation is a regression (ascii users or
> not).

No it is not. It is a great step forward to more efficient Unicode.

And it means that now Python can correctly deal with non-BMP characters
without the nonsense of UTF-16 surrogates:

steve@runes:~$ python3.3 -c "print(len(chr(1114000)))"  # Right!
1
steve@runes:~$ python3.2 -c "print(len(chr(1114000)))"  # Wrong!
2

without doubling the storage of every string.

This is an important step towards making the full range of Unicode
available more widely.

> I recognize in practice the real impact is for many users closed to zero

Then what's the problem?

> (including me) but I have shown (I think) that this flexible
> representation is, by design, not as optimal as it is supposed to be.

You have not shown any real problem at all.

You have shown untrustworthy, edited timing results that don't match what
other people are reporting.

Even if your timing results are genuine, you haven't shown that they make
any difference for real code that does useful work.

--
Steven


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
wxjmfa...@gmail.com  
View profile  
 More options Aug 18 2012, 2:05 pm
Newsgroups: comp.lang.python
From: wxjmfa...@gmail.com
Date: Sat, 18 Aug 2012 11:05:07 -0700 (PDT)
Local: Sat, Aug 18 2012 2:05 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()
Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :

> Proof that is acceptable to everybody please, not just yourself.

I cann't, I'm only facing the fact it works slower on my
Windows platform.

As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.

Imagine an editor, I type an "a", internally the text is
saved as ascii, then I type en "é", the text can only
be saved in at least latin-1. Then I enter an "€", the text
become an internal ucs-4 "string". The remove the "€" and so
on.

Intuitively I expect there is some kind slow down between
all these "strings" conversion.

When I tested this flexible representation, a few months
ago, at the first alpha release. This is precisely what,
I tested. String manipulations which are forcing this internal
change and I concluded the result is not brillant. Realy,
a factor 0.n up to 10.

This are simply my conclusions.

Related question.

Does any body know a way to get the size of the internal
"string" in bytes? In the narrow or wide build it is easy,
I can encode with the "unicode_internal" codec. In Py 3.3,
I attempted to toy with sizeof and stuct, but without
success.

jmf


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
wxjmfa...@gmail.com  
View profile  
 More options Aug 18 2012, 2:05 pm
Newsgroups: comp.lang.python
From: wxjmfa...@gmail.com
Date: Sat, 18 Aug 2012 11:05:07 -0700 (PDT)
Local: Sat, Aug 18 2012 2:05 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()
Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :

> Proof that is acceptable to everybody please, not just yourself.

I cann't, I'm only facing the fact it works slower on my
Windows platform.

As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.

Imagine an editor, I type an "a", internally the text is
saved as ascii, then I type en "é", the text can only
be saved in at least latin-1. Then I enter an "€", the text
become an internal ucs-4 "string". The remove the "€" and so
on.

Intuitively I expect there is some kind slow down between
all these "strings" conversion.

When I tested this flexible representation, a few months
ago, at the first alpha release. This is precisely what,
I tested. String manipulations which are forcing this internal
change and I concluded the result is not brillant. Realy,
a factor 0.n up to 10.

This are simply my conclusions.

Related question.

Does any body know a way to get the size of the internal
"string" in bytes? In the narrow or wide build it is easy,
I can encode with the "unicode_internal" codec. In Py 3.3,
I attempted to toy with sizeof and stuct, but without
success.

jmf


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Rubin  
View profile  
 More options Aug 18 2012, 2:26 pm
Newsgroups: comp.lang.python
From: Paul Rubin <no.em...@nospam.invalid>
Date: Sat, 18 Aug 2012 11:26:21 -0700
Local: Sat, Aug 18 2012 2:26 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()

Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> writes:
> (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
> using two code points. This is fragile and doesn't work very well,
> because string-handling methods can break the surrogate pairs apart,
> leaving you with invalid unicode string. Not good.)
...
> With PEP 393, each Python string will be stored in the most efficient
> format possible:

Can you explain the issue of "breaking surrogate pairs apart" a little
more?  Switching between encodings based on the string contents seems
silly at first glance.  Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages.  I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
wxjmfa...@gmail.com  
View profile  
 More options Aug 18 2012, 2:30 pm
Newsgroups: comp.lang.python
From: wxjmfa...@gmail.com
Date: Sat, 18 Aug 2012 11:30:19 -0700 (PDT)
Local: Sat, Aug 18 2012 2:30 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()
Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit :

No offense here. But this is an *american* answer.

The same story as the coding of text files, where "utf-8 == ascii"
and the rest of the world doesn't count.

jmf


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
MRAB  
View profile  
 More options Aug 18 2012, 2:34 pm
Newsgroups: comp.lang.python
From: MRAB <pyt...@mrabarnett.plus.com>
Date: Sat, 18 Aug 2012 19:34:50 +0100
Local: Sat, Aug 18 2012 2:34 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 19:05, wxjmfa...@gmail.com wrote:

[snip]

"a" will be stored as 1 byte/codepoint.

Adding "�", it will still be stored as 1 byte/codepoint.

Adding "�", it will still be stored as 2 bytes/codepoint.

But then you wouldn't be adding them one at a time in Python, you'd be
building a list and then joining them together in one operation.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
rusi  
View profile  
 More options Aug 18 2012, 2:40 pm
Newsgroups: comp.lang.python
From: rusi <rustompm...@gmail.com>
Date: Sat, 18 Aug 2012 11:40:23 -0700 (PDT)
Local: Sat, Aug 18 2012 2:40 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()
On Aug 18, 10:59 pm, Steven D'Aprano <steve

+comp.lang.pyt...@pearwood.info> wrote:
> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
> > Is there any reason why non ascii users are somehow penalized compared
> > to ascii users?

> Of course there is a reason.

> If you want to represent 1114111 different characters in a string, as
> Unicode supports, you can't use a single byte per character, or even two
> bytes. That is a fact of basic mathematics. Supporting 1114111 characters
> must be more expensive than supporting 128 of them.

> But why should you carry the cost of 4-bytes per character just because
> someday you *might* need a non-BMP character?

I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605

Original above does not open for me but here's a copy that does:

http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-inc...


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
MRAB  
View profile  
 More options Aug 18 2012, 2:59 pm
Newsgroups: comp.lang.python
From: MRAB <pyt...@mrabarnett.plus.com>
Date: Sat, 18 Aug 2012 19:59:32 +0100
Local: Sat, Aug 18 2012 2:59 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 19:26, Paul Rubin wrote:

On a narrow build, codepoints outside the BMP are stored as a surrogate
pair (2 codepoints). On a wide build, all codepoints can be represented
without the need for surrogate pairs.

The problem with strings containing surrogate pairs is that you could
inadvertently slice the string in the middle of the surrogate pair.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark Lawrence  
View profile  
 More options Aug 18 2012, 3:45 pm
Newsgroups: comp.lang.python
From: Mark Lawrence <breamore...@yahoo.co.uk>
Date: Sat, 18 Aug 2012 20:45:53 +0100
Local: Sat, Aug 18 2012 3:45 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 19:30, wxjmfa...@gmail.com wrote:

Thinking about it I entirely agree with you.  Steven D'Aprano strikes me
as typically American, in the same way that I'm typically Brazilian :)

--
Cheers.

Mark Lawrence.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark Lawrence  
View profile  
 More options Aug 18 2012, 3:50 pm
Newsgroups: comp.lang.python
From: Mark Lawrence <breamore...@yahoo.co.uk>
Date: Sat, 18 Aug 2012 20:50:58 +0100
Local: Sat, Aug 18 2012 3:50 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 19:40, rusi wrote:

ROFLMAO doesn't adequately some up how much I laughed.

--
Cheers.

Mark Lawrence.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Terry Reedy  
View profile  
 More options Aug 18 2012, 4:09 pm
Newsgroups: comp.lang.python
From: Terry Reedy <tjre...@udel.edu>
Date: Sat, 18 Aug 2012 16:09:14 -0400
Local: Sat, Aug 18 2012 4:09 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()
On 8/18/2012 12:38 PM, wxjmfa...@gmail.com wrote:

> Sorry guys, I'm not stupid (I think). I can open IDLE with
> Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
> always slower. Period.

You have not tried enough tests ;-).

On my Win7-64 system:
from timeit import timeit

print(timeit(" 'a'*10000 "))
3.3.0b2: .5
3.2.3: .8

print(timeit("c in a", "c  = '…'; a = 'a'*10000"))
3.3: .05 (independent of len(a)!)
3.2: 5.8  100 times slower! Increase len(a) and the ratio can be made as
high as one wants!

print(timeit("a.encode()", "a = 'a'*1000"))
3.2: 1.5
3.3:  .26

Similar with encoding='utf-8' added to call.

Jim, please stop the ranting. It does not help improve Python. utf-32 is
not a panacea; it has problems of time, space, and system compatibility
(Windows and others). Victor Stinner, whatever he may have once thought
and said, put a *lot* of effort into making the new implementation both
correct and fast.

On your replace example
 >>> imeit.timeit("('ab…' * 1000).replace('…', '……')")
 > 61.919225272152346
 >>> timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
 > 1.2918679017971044

I do not see the point of changing both length and replacement. For me,
the time is about the same for either replacement. I do see about the
same slowdown ratio for 3.3 versus 3.2 I also see it for pure search
without replacement.

print(timeit("c in a", "c  = '…'; a = 'a'*1000+c"))
# .6 in 3.2.3, 1.2 in 3.3.0

This does not make sense to me and I will ask about it.

--
Terry Jan Reedy


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
wxjmfa...@gmail.com  
View profile  
 More options Aug 18 2012, 4:22 pm
Newsgroups: comp.lang.python
From: wxjmfa...@gmail.com
Date: Sat, 18 Aug 2012 13:22:00 -0700 (PDT)
Subject: Re: How do I display unicode value stored in a string variable using ord()
Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit :

I thing it's time to leave the discussion and to go to bed.

You can take the problem the way you wish, Python 3.3 is "slower"
than Python 3.2.

If you see the present status as an optimisation, I'm condidering
this as a regression.

I'm pretty sure a pure ucs-4/utf-32 can only be, by nature,
the correct solution.

To be extreme, tools using pure utf-16 or utf-32 are, at least,
considering all the citizen on this planet in the same way.

jmf


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark Lawrence  
View profile  
 More options Aug 18 2012, 5:37 pm
Newsgroups: comp.lang.python
From: Mark Lawrence <breamore...@yahoo.co.uk>
Date: Sat, 18 Aug 2012 22:37:13 +0100
Local: Sat, Aug 18 2012 5:37 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 21:22, wxjmfa...@gmail.com wrote:

In plain English, duck out cos I'm losing.

> You can take the problem the way you wish, Python 3.3 is "slower"
> than Python 3.2.

I'll ask for the second time.  Provide proof that is acceptable to
everybody and not just yourself.

> If you see the present status as an optimisation, I'm condidering
> this as a regression.

Considering does not equate to proof.  Where are the figures which back
up your claim?

> I'm pretty sure a pure ucs-4/utf-32 can only be, by nature,
> the correct solution.

I look forward to seeing your patch on the bug tracker.  If and only if
you can find something that needs patching, which from the course of
this thread I think is highly unlikely.

> To be extreme, tools using pure utf-16 or utf-32 are, at least,
> considering all the citizen on this planet in the same way.

> jmf

--
Cheers.

Mark Lawrence.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris Angelico  
View profile  
 More options Aug 18 2012, 8:46 pm
Newsgroups: comp.lang.python
From: Chris Angelico <ros...@gmail.com>
Date: Sun, 19 Aug 2012 10:46:18 +1000
Local: Sat, Aug 18 2012 8:46 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()

On Sun, Aug 19, 2012 at 4:26 AM, Paul Rubin <no.em...@nospam.invalid> wrote:
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more?  Switching between encodings based on the string contents seems
> silly at first glance.  Strings are immutable so I don't understand why
> not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
> Latin-based alphabets and UTF-16 may be more efficient for some other
> languages.  I think even UCS-4 doesn't completely fix the surrogate pair
> issue if it means the only thing I can think of.

UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
few thousand bytes, how do you locate the 273rd character? You have to
scan from the beginning. The same applies when surrogate pairs are
used to represent single characters, unless the representation leaks
and a surrogate is indexed as two - which is where the breaking-apart
happens.

ChrisA


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Rubin  
View profile  
 More options Aug 18 2012, 10:11 pm
Newsgroups: comp.lang.python
From: Paul Rubin <no.em...@nospam.invalid>
Date: Sat, 18 Aug 2012 19:11:38 -0700
Local: Sat, Aug 18 2012 10:11 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()

Chris Angelico <ros...@gmail.com> writes:
> UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
> few thousand bytes, how do you locate the 273rd character?

How often do you need to do that, as opposed to traversing the string by
iteration?  Anyway, you could use a rope-like implementation, or an
index structure over the string.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris Angelico  
View profile  
 More options Aug 18 2012, 10:19 pm
Newsgroups: comp.lang.python
From: Chris Angelico <ros...@gmail.com>
Date: Sun, 19 Aug 2012 12:19:00 +1000
Subject: Re: How do I display unicode value stored in a string variable using ord()

On Sun, Aug 19, 2012 at 12:11 PM, Paul Rubin <no.em...@nospam.invalid> wrote:
> Chris Angelico <ros...@gmail.com> writes:
>> UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
>> few thousand bytes, how do you locate the 273rd character?

> How often do you need to do that, as opposed to traversing the string by
> iteration?  Anyway, you could use a rope-like implementation, or an
> index structure over the string.

Well, imagine if Python strings were stored in UTF-8. How would you slice it?

>>> "asdfqwer"[4:]

'qwer'

That's a not uncommon operation when parsing strings or manipulating
data. You'd need to completely rework your algorithms to maintain a
position somewhere.

ChrisA


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Rubin  
View profile  
 More options Aug 18 2012, 10:35 pm
Newsgroups: comp.lang.python
From: Paul Rubin <no.em...@nospam.invalid>
Date: Sat, 18 Aug 2012 19:35:44 -0700
Local: Sat, Aug 18 2012 10:35 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()

Chris Angelico <ros...@gmail.com> writes:
>>>> "asdfqwer"[4:]
> 'qwer'

> That's a not uncommon operation when parsing strings or manipulating
> data. You'd need to completely rework your algorithms to maintain a
> position somewhere.

Scanning 4 characters (or a few dozen, say) to peel off a token in
parsing a UTF-8 string is no big deal.  It gets more expensive if you
want to index far more deeply into the string.  I'm asking how often
that is done in real code.  Obviously one can concoct hypothetical
examples that would suffer.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris Angelico  
View profile  
 More options Aug 18 2012, 11:01 pm
Newsgroups: comp.lang.python
From: Chris Angelico <ros...@gmail.com>
Date: Sun, 19 Aug 2012 13:01:46 +1000
Local: Sat, Aug 18 2012 11:01 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()

On Sun, Aug 19, 2012 at 12:35 PM, Paul Rubin <no.em...@nospam.invalid> wrote:
> Chris Angelico <ros...@gmail.com> writes:
>>>>> "asdfqwer"[4:]
>> 'qwer'

>> That's a not uncommon operation when parsing strings or manipulating
>> data. You'd need to completely rework your algorithms to maintain a
>> position somewhere.

> Scanning 4 characters (or a few dozen, say) to peel off a token in
> parsing a UTF-8 string is no big deal.  It gets more expensive if you
> want to index far more deeply into the string.  I'm asking how often
> that is done in real code.  Obviously one can concoct hypothetical
> examples that would suffer.

Sure, four characters isn't a big deal to step through. But it still
makes indexing and slicing operations O(N) instead of O(1), plus you'd
have to zark the whole string up to where you want to work. It'd be
workable, but you'd have to redo your algorithms significantly; I
don't have a Python example of parsing a huge string, but I've done it
in other languages, and when I can depend on indexing being a cheap
operation, I'll happily do exactly that.

ChrisA


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Rubin  
View profile  
 More options Aug 18 2012, 11:10 pm
Newsgroups: comp.lang.python
From: Paul Rubin <no.em...@nospam.invalid>
Date: Sat, 18 Aug 2012 20:10:35 -0700
Local: Sat, Aug 18 2012 11:10 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()

Chris Angelico <ros...@gmail.com> writes:
> Sure, four characters isn't a big deal to step through. But it still
> makes indexing and slicing operations O(N) instead of O(1), plus you'd
> have to zark the whole string up to where you want to work.

I know some systems chop the strings into blocks of (say) a few
hundred chars, so you can immediately get to the correct
block, then scan into the block to get to the desired char offset.

> I don't have a Python example of parsing a huge string, but I've done
> it in other languages, and when I can depend on indexing being a cheap
> operation, I'll happily do exactly that.

I'd be interested to know what the context was, where you parsed
a big unicode string in a way that required random access to
the nth character in the string.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Terry Reedy  
View profile  
 More options Aug 18 2012, 11:12 pm
Newsgroups: comp.lang.python
From: Terry Reedy <tjre...@udel.edu>
Date: Sat, 18 Aug 2012 23:12:50 -0400
Local: Sat, Aug 18 2012 11:12 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()
On 8/18/2012 4:09 PM, Terry Reedy wrote:

> print(timeit("c in a", "c  = '…'; a = 'a'*1000+c"))
> # .6 in 3.2.3, 1.2 in 3.3.0

> This does not make sense to me and I will ask about it.

I did ask on pydef list and paraphrased responses include:
1. 'My system gives opposite ratios.'
2. 'With a default of 1000000 repetitions in a loop, the reported times
are microseconds per operation and thus not practically significant.'
3. 'There is a stringbench.py with a large number of such micro benchmarks.'

I believe there are also whole-application benchmarks that try to mimic
real-world mixtures of operations.

People making improvements must consider performance on multiple systems
and multiple benchmarks. If someone wants to work on search speed, they
cannot just optimize that one operation on one system.

--
Terry Jan Reedy


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris Angelico  
View profile  
 More options Aug 18 2012, 11:31 pm
Newsgroups: comp.lang.python
From: Chris Angelico <ros...@gmail.com>
Date: Sun, 19 Aug 2012 13:31:21 +1000
Local: Sat, Aug 18 2012 11:31 pm
Subject: Re: How do I display unicode value stored in a string variable using ord()

On Sun, Aug 19, 2012 at 1:10 PM, Paul Rubin <no.em...@nospam.invalid> wrote:
> Chris Angelico <ros...@gmail.com> writes:
>> I don't have a Python example of parsing a huge string, but I've done
>> it in other languages, and when I can depend on indexing being a cheap
>> operation, I'll happily do exactly that.

> I'd be interested to know what the context was, where you parsed
> a big unicode string in a way that required random access to
> the nth character in the string.

It's something I've done in C/C++ fairly often. Take one big fat
buffer, slice it and dice it as you get the information you want out
of it. I'll retain and/or calculate indices (when I'm not using
pointers, but that's a different kettle of fish). Generally, I'm
working with pure ASCII, but port those same algorithms to Python and
you'll easily be able to read in a file in some known encoding and
manipulate it as Unicode.

It's not so much 'random access to the nth character' as an efficient
way of jumping forward. For instance, if I know that the next thing is
a literal string of n characters (that I don't care about), I want to
skip over that and keep parsing. The Adobe Message Format is
particularly noteworthy in this, but it's a stupid format and I don't
recommend people spend too much time reading up on it (unless you like
that sensation of your brain trying to escape through your ear).

ChrisA


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Rubin  
View profile  
 More options Aug 19 2012, 1:58 am
Newsgroups: comp.lang.python
From: Paul Rubin <no.em...@nospam.invalid>
Date: Sat, 18 Aug 2012 22:58:14 -0700
Local: Sun, Aug 19 2012 1:58 am
Subject: Re: How do I display unicode value stored in a string variable using ord()

Chris Angelico <ros...@gmail.com> writes:
> Generally, I'm working with pure ASCII, but port those same algorithms
> to Python and you'll easily be able to read in a file in some known
> encoding and manipulate it as Unicode.

If it's pure ASCII, you can use the bytes or bytearray type.  

> It's not so much 'random access to the nth character' as an efficient
> way of jumping forward. For instance, if I know that the next thing is
> a literal string of n characters (that I don't care about), I want to
> skip over that and keep parsing.

I don't understand how this is supposed to work.  You're going to read a
large unicode text file (let's say it's UTF-8) into a single big string?
So the runtime library has to scan the encoded contents to find the
highest numbered codepoint (let's say it's mostly ascii but has a few
characters outside the BMP), expand it all (in this case) to UCS-4
giving 4x memory bloat and requiring decoding all the UTF-8 regardless,
and now we should worry about the efficiency of skipping n characters?

Since you have to decode the n characters regardless, I'd think this
skipping part should only be an issue if you have to do it a lot of
times.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Steven D'Aprano  
View profile  
 More options Aug 19 2012, 2:09 am
Newsgroups: comp.lang.python
From: Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>
Date: 19 Aug 2012 06:09:49 GMT
Local: Sun, Aug 19 2012 2:09 am
Subject: Re: How do I display unicode value stored in a string variable using ord()
This is a long post. If you don't feel like reading an essay, skip to the
very bottom and read my last few paragraphs, starting with "To recap".

On Sat, 18 Aug 2012 11:26:21 -0700, Paul Rubin wrote:
> Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> writes:
>> (There is an extension to UCS-2, UTF-16, which encodes non-BMP
>> characters using two code points. This is fragile and doesn't work very
>> well, because string-handling methods can break the surrogate pairs
>> apart, leaving you with invalid unicode string. Not good.)
> ...
>> With PEP 393, each Python string will be stored in the most efficient
>> format possible:

> Can you explain the issue of "breaking surrogate pairs apart" a little
> more?  Switching between encodings based on the string contents seems
> silly at first glance.  

Forget encodings! We're not talking about encodings. Encodings are used
for converting text as bytes for transmission over the wire or storage on
disk. PEP 393 talks about the internal representation of text within
Python, the C-level data structure.

In 3.2, that data structure depends on a compile-time switch. In a
"narrow build", text is stored using two-bytes per character, so the
string "len" (as in the name of the built-in function) will be stored as

006c 0065 006e

(or possibly 6c00 6500 6e00, depending on whether your system is
LittleEndian or BigEndian), plus object-overhead, which I shall ignore.

Since most identifiers are ASCII, that's already using twice as much
memory as needed. This standard data structure is called UCS-2, and it
only handles characters in the Basic Multilingual Plane, the BMP (roughly
the first 64000 Unicode code points). I'll come back to that.

In a "wide build", text is stored as four-bytes per character, so "len"
is stored as either:

0000006c 00000065 0000006e
6c000000 65000000 6e000000

Now memory is cheap, but it's not *that* cheap, and no matter how much
memory you have, you can always use more.

This system is called UCS-4, and it can handle the entire Unicode
character set, for now and forever. (If we ever need more that four-bytes
worth of characters, it won't be called Unicode.)

Remember I said that UCS-2 can only handle the 64K characters
[technically: code points] in the Basic Multilingual Plane? There's an
extension to UCS-2 called UTF-16 which extends it to the entire Unicode
range. Yes, that's the same name as the UTF-16 encoding, because it's
more or less the same system.

UTF-16 says "let's represent characters in the BMP by two bytes, but
characters outside the BMP by four bytes." There's a neat trick to this:
the BMP doesn't use the entire two-byte range, so there are some byte
pairs which are illegal in UCS-2 -- they don't correspond to *any*
character. UTF-16 used those byte pairs to signal "this is half a
character, you need to look at the next pair for the rest of the
character".

Nifty hey? These pairs-of-pseudocharacters are called "surrogate pairs".

Except this comes at a big cost: you can no longer tell how long a string
is by counting the number of bytes, which is fast, because sometimes four
bytes is two characters and sometimes it's one and you can't tell which
it will be until you actually inspect all four bytes.

Copying sub-strings now becomes either slow, or buggy. Say you want to
grab the 10th characters in a string. The fast way using UCS-2 is to
simply grab bytes 8 and 9 (remember characters are pairs of bytes and we
start counting at zero) and you're done. Fast and safe if you're willing
to give up the non-BMP characters.

It's also fast and safe if you use USC-4, but then everything takes twice
as much space, so you probably end up spending so much time copying null
bytes that you're probably slower anyway. Especially when your OS starts
paging memory like mad.

But in UTF-16, indexing can be fast or safe but not both. Maybe bytes 8
and 9 are half of a surrogate pair, and you've now split the pair and
ended up with an invalid string. That's what Python 3.2 does, it fails to
handle surrogate pairs properly:

py> s = chr(0xFFFF + 1)
py> a, b = s
py> a
'\ud800'
py> b
'\udc00'

I've just split a single valid Unicode character into two invalid
characters. Python3.2 will (probably) mindless process those two non-
characters, and the only sign I have that I did something wrong is that
my data is now junk.

Since any character can be a surrogate pair, you have to scan every pair
of bytes in order to index a string, or work out it's length, or copy a
substring. It's not enough to just check if the last pair is a surrogate.

When you don't, you have bugs like this from Python 3.2:

py> s = "01234" + chr(0xFFFF + 1) + "6789"
py> s[9] == '9'
False
py> s[9], len(s)
('8', 11)

Which is now fixed in Python 3.3.

So variable-width data structures like UTF-8 or UTF-16 are crap for the
internal representation of strings -- they are either fast or correct but
cannot be both.

But UCS-2 is sub-optimal, because it can only handle the BMP, and UCS-4
is too because ASCII-only strings like identifiers end up being four
times as big as they need to be. 1-byte schemes like Latin-1 are
unspeakable because they only handle 256 characters, fewer if you don't
count the C0 and C1 control codes.

PEP 393 to the rescue! What if you could encode pure-ASCII strings like
"len" using one byte per character, and BMP strings using two bytes per
character (UCS-2), and fall back to four bytes (UCS-4) only when you
really need it?

The benefits are:

* Americans and English-Canadians and Australians and other barbarians of
that ilk who only use ASCII save a heap of memory;

* people who mostly use non-BMP characters only pay the cost of four-
bytes per character for strings that actually *need* four-bytes per
character;

* people who use lots of non-BMP characters are no worse off.

The costs are:

* string routines need to be smarter -- they have to handle three
different data structures (ASCII, UCS-2, UCS-4) instead of just one;

* there's a certain amount of overhead when creating a string -- you have
to work out which in-memory format to use, and that's not necessarily
trivial, but at least it's a once-off cost when you create the string;

* people who misunderstand what's going on get all upset over micro-
benchmarks.

> Strings are immutable so I don't understand why
> not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
> Latin-based alphabets and UTF-16 may be more efficient for some other
> languages.  I think even UCS-4 doesn't completely fix the surrogate pair
> issue if it means the only thing I can think of.

To recap:

* Variable-byte formats like UTF-8 and UTF-16 mean that basic string
operations are not O(1) but are O(N). That means they are slow, or buggy,
pick one.

* Fixed width UCS-2 doesn't handle the full Unicode range, only the BMP.
That's better than it sounds: the BMP supports most character sets, but
not all. Still, there are people who need the supplementary planes, and
UCS-2 lets them down.

* Fixed width UCS-4 does handle the full Unicode range, without
surrogates, but at the cost of using 2-4 times more string memory for the
vast majority of users.

* PEP 393 doesn't use variable-width characters, but variable-width
strings. Instead of choosing between 1, 2 and 4 bytes per character, it
chooses *per string*. This keeps basic string operations O(1) instead of
O(N), saves memory where possible, while still supporting the full
Unicode range without a compile-time option.

--
Steven


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Steven D'Aprano  
View profile  
 More options Aug 19 2012, 2:13 am
Newsgroups: comp.lang.python
From: Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>
Date: 19 Aug 2012 06:13:29 GMT
Local: Sun, Aug 19 2012 2:13 am
Subject: Re: How do I display unicode value stored in a string variable using ord()

On Sat, 18 Aug 2012 11:30:19 -0700, wxjmfauth wrote:
>> > I'm aware of this (and all the blah blah blah you are explaining).
>> > This always the same song. Memory.

>> Exactly. The reason it is always the same song is because it is an
>> important song.

> No offense here. But this is an *american* answer.

I am not American.

I am not aware that computers outside of the USA, and Australia, have
unlimited amounts of memory. You must be very lucky.

> The same story as the coding of text files, where "utf-8 == ascii" and
> the rest of the world doesn't count.

UTF-8 is not ASCII.

--
Steven


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Steven D'Aprano  
View profile  
 More options Aug 19 2012, 2:30 am
Newsgroups: comp.lang.python
From: Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>
Date: 19 Aug 2012 06:30:54 GMT
Local: Sun, Aug 19 2012 2:30 am
Subject: Re: How do I display unicode value stored in a string variable using ord()

On Sat, 18 Aug 2012 11:05:07 -0700, wxjmfauth wrote:
> As I understand (I think) the undelying mechanism, I can only say, it is
> not a surprise that it happens.

> Imagine an editor, I type an "a", internally the text is saved as ascii,
> then I type en "é", the text can only be saved in at least latin-1. Then
> I enter an "€", the text become an internal ucs-4 "string". The remove
> the "€" and so on.

Firstly, that is not what Python does. For starters, € is in the BMP, and
so is nearly every character you're ever going to use unless you are
Asian or a historian using some obscure ancient script. NONE of the
examples you have shown in your emails have included 4-byte characters,
they have all been ASCII or UCS-2.

You are suffering from a misunderstanding about what is going on and
misinterpreting what you have seen.

In *both* Python 3.2 and 3.3, both é and € are represented by two bytes.
That will not change. There is a tiny amount of fixed overhead for
strings, and that overhead is slightly different between the versions,
but you'll never notice the difference.

Secondly, how a text editor or word processor chooses to store the text
that you type is not the same as how Python does it. A text editor is not
going to be creating a new immutable string after every key press. That
will be slow slow SLOW. The usual way is to keep a buffer for each
paragraph, and add and subtract characters from the buffer.

> Intuitively I expect there is some kind slow down between all these
> "strings" conversion.

Your intuition is wrong. Strings are not converted from ASCII to USC-2 to
USC-4 on the fly, they are converted once, when the string is created.

The tests we ran earlier, e.g.:

('ab…' * 1000).replace('…', 'œ…')

show the *worst possible case* for the new string handling, because all
we do is create new strings. First we create a string 'ab…', then we
create another string 'ab…'*1000, then we create two new strings '…' and
'œ…', and finally we call replace and create yet another new string.

But in real applications, once you have created a string, you don't just
immediately create a new one and throw the old one away. You likely do
work with that string:

steve@runes:~$ python3.2 -m timeit "s = 'abcœ…'*1000; n = len(s); flag =
s.startswith(('*', 'a'))"
100000 loops, best of 3: 2.41 usec per loop

steve@runes:~$ python3.3 -m timeit "s = 'abcœ…'*1000; n = len(s); flag =
s.startswith(('*', 'a'))"
100000 loops, best of 3: 2.29 usec per loop

Once you start doing *real work* with the strings, the overhead of
deciding whether they should be stored using 1, 2 or 4 bytes begins to
fade into the noise.

> When I tested this flexible representation, a few months ago, at the
> first alpha release. This is precisely what, I tested. String
> manipulations which are forcing this internal change and I concluded the
> result is not brillant. Realy, a factor 0.n up to 10.

Like I said, if you really think that there is a significant, repeatable
slow-down on Windows, report it as a bug.

> Does any body know a way to get the size of the internal "string" in
> bytes?

sys.getsizeof(some_string)

steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10030
steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10038

As I said, there is a *tiny* overhead difference. But identifiers will
generally be smaller:

steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
(size.__name__))"
48
steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
(size.__name__))"
34

You can check the object overhead by looking at the size of the empty
string.

--
Steven


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 26 - 50 of 144 < Older  Newer >
« Back to Discussions « Newer topic     Older topic »