str() should convert ANY object to a string without EXCEPTIONS !

est

unread,

Sep 28, 2008, 1:37:09 AM9/28/08

to

From python manual

str( [object])

Return a string containing a nicely printable representation of an
object. For strings, this returns the string itself. The difference
with repr(object) is that str(object) does not always attempt to
return a string that is acceptable to eval(); its goal is to return a
printable string. If no argument is given, returns the empty string,
''.

now we try this under windows:

>>> str(u'\ue863')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\ue863' in
position 0
: ordinal not in range(128)

FAIL.

also almighty Linux

Python 2.3.4 (#1, Feb 6 2006, 10:38:46)
[GCC 3.4.5 20051201 (Red Hat 3.4.5-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> str(u'\ue863')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\ue863' in
position 0: ordinal not in range(128)

Python 2.4.4 (#2, Apr 5 2007, 20:11:18)
[GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> str(u'\ue863')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\ue863' in
position 0: ordinal not in range(128)

Python 2.5 (release25-maint, Jul 20 2008, 20:47:25)
[GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> str(u'\ue863')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\ue863' in
position 0: ordinal not in range(128)

The problem is, why the f**k set ASCII encoding to range(128) ????????
while str() is internally byte array it should be handled in
range(256) !!!!!!!!!!

http://bugs.python.org/issue3648

One possible solution(Windows Only)

>>> str(u'\ue863'.encode('mbcs'))
'\xfe\x9f'
>>> print u'\ue863'.encode('mbcs')
䶮

I now spending 60% of my developing time dealing with ASCII range(128)
errors. It was PAIN!!!!!!

Please fix this issue.

http://bugs.python.org/issue3648

Please.

Lawrence D'Oliveiro

unread,

Sep 28, 2008, 2:03:42 AM9/28/08

to

In message
<9890864a-09f9-40d6...@q26g2000prq.googlegroups.com>, est
wrote:

> The problem is, why the f**k set ASCII encoding to range(128) ????????

Because that's how ASCII is defined.

> while str() is internally byte array it should be handled in
> range(256) !!!!!!!!!!

But that's for random bytes. How would you convert an arbitrary object to
random bytes?

Marc 'BlackJack' Rintsch

unread,

Sep 28, 2008, 2:05:11 AM9/28/08

to

On Sat, 27 Sep 2008 22:37:09 -0700, est wrote:

> The problem is, why the f**k set ASCII encoding to range(128) ????????

Because that's how ASCII is defined. ASCII is a 7-bit code.

> while str() is internally byte array it should be handled in range(256)
> !!!!!!!!!!

Yes `str` can handle that, but that's not the point. The point is how to
translate the contents of a `unicode` object into that range. There are
many different possibilities and Python refuses to guess and tries the
lowest common denominator -- ASCII -- instead.

> I now spending 60% of my developing time dealing with ASCII range(128)
> errors. It was PAIN!!!!!!
>
> Please fix this issue.
>
> http://bugs.python.org/issue3648
>
> Please.

The issue was closed as 'invalid'. Dealing with Unicode can be a pain
and frustrating, but that's not a Python problem, it's the subject itself
that needs some thoughts. If you think this through, the relationship
between characters, encodings, and bytes, and stop dreaming of a magic
solution that works without dealing with this stuff explicitly, the pain
will go away -- or ease at least.

Ciao,
Marc 'BlackJack' Rintsch

Terry Reedy

unread,

Sep 28, 2008, 3:55:46 AM9/28/08

to pytho...@python.org

est wrote:
>>From python manual
>
> str( [object])
>
> Return a string containing a nicely printable representation of an
> object. For strings, this returns the string itself. The difference
> with repr(object) is that str(object) does not always attempt to
> return a string that is acceptable to eval(); its goal is to return a
> printable string. If no argument is given, returns the empty string,
> ''.
>
>
> now we try this under windows:
>
>>>> str(u'\ue863')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\ue863' in
> position 0
> : ordinal not in range(128)

In 3.0 this is fixed:
>>> str('\ue863') # u prefix is gone
'\ue863'
>>> str(b'123') # b prefix is added
"b'123'"

Problems like this at least partly motivated the change to unicode
instead of bytes as the string type.

tjr

Steven D'Aprano

unread,

Sep 28, 2008, 3:56:49 AM9/28/08

to

from random import randint
''.join(chr(randint(0, 255)) for i in xrange(len(input)))

of course. How else should you get random bytes? :)

--
Steven

Steven D'Aprano

unread,

Sep 28, 2008, 4:26:33 AM9/28/08

to

I'm not sure that "fixed" is the right word. Isn't that more or less the
same as telling the OP to use unicode() instead of str()? It merely
avoids the problem of converting Unicode to ASCII by leaving your string
as Unicode, rather than fixing it. Perhaps that's the right thing to do,
but it's a bit like the old joke:

"Doctor, it hurts when I do this."
"Then don't do it!"

As for the second example you give:

>>> str(b'123') # b prefix is added
"b'123'"

Perhaps I'm misinterpreting it, but from here it looks to me that str()
is doing what repr() used to do, and I'm really not sure that's a good
thing. I would have expected that str(b'123') in Python 3 should do the
same thing as unicode('123') does now:

>>> unicode('123')
u'123'

(except without the u prefix).

--
Steven

est

unread,

Sep 28, 2008, 4:35:11 AM9/28/08

to

> Because that's how ASCII is defined.

> Because that's how ASCII is defined. ASCII is a 7-bit code.

Then why can't python use another default encoding internally
range(256)?

> Python refuses to guess and tries the lowest common denominator -- ASCII -- instead.

That's the problem. ASCII is INCOMPLETE!

If Python choose another default encoding which handles range(256),
80% of python unicode encoding problems are gone.

It's not HARD to process unicode, it's just python & python community
refuse to correct it.

> stop dreaming of a magic solution

It's not 'magic' it's a BUG. Just print 0x7F to 0xFF to console,
what's wrong????

> Isn't that more or less the same as telling the OP to use unicode() instead of str()?

sockets could handle str() only. If you throw unicode objects to a
socket, it will automatically call str() and cause an error.

Steven D'Aprano

unread,

Sep 28, 2008, 4:38:00 AM9/28/08

to

On Sat, 27 Sep 2008 22:37:09 -0700, est wrote:

>>>> str(u'\ue863')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\ue863' in
> position 0
> : ordinal not in range(128)
>
> FAIL.

What result did you expect?

[...]

> The problem is, why the f**k set ASCII encoding to range(128) ????????
> while str() is internally byte array it should be handled in range(256)
> !!!!!!!!!!

To quote Terry Pratchett:

"What sort of person," said Salzella patiently, "sits down and
*writes* a maniacal laugh? And all those exclamation marks, you
notice? Five? A sure sign of someone who wears his underpants
on his head." -- (Terry Pratchett, Maskerade)

In any case, even if the ASCII encoding used all 256 possible bytes, you
still have a problem. Your unicode string is a single character with
ordinal value 59491:

>>> ord(u'\ue863')
59491

You can't fit 59491 (or more) characters into 256, so obviously some
unicode chars aren't going to fit into ASCII without some sort of
encoding. You show that yourself:

u'\ue863'.encode('mbcs') # Windows only

But of course 'mbcs' is only one possible encoding. There are others.
Python refuses to guess which encoding you want. Here's another:

u'\ue863'.encode('utf-8')

--
Steven

est

unread,

Sep 28, 2008, 5:21:18 AM9/28/08

to

On Sep 28, 4:38 pm, Steven D'Aprano <st...@REMOVE-THIS-

OK, I am tired of arguing these things since python 3.0 fixed it
somehow.

Can anyone tell me how to customize a default encoding, let's say
'ansi' which handles range(256) ?

Lie

unread,

Sep 28, 2008, 5:49:11 AM9/28/08

to

On Sep 28, 12:37 pm, est <electronix...@gmail.com> wrote:
> From python manual
>
> str( [object])
>
> Return a string containing a nicely printable representation of an
> object. For strings, this returns the string itself. The difference
> with repr(object) is that str(object) does not always attempt to
> return a string that is acceptable to eval(); its goal is to return a
> printable string. If no argument is given, returns the empty string,
> ''.
>
> now we try this under windows:
>
> >>> str(u'\ue863')
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\ue863' in
> position 0
> : ordinal not in range(128)
>
> FAIL.

And it is correct to fail, ASCII is only defined within range(128),
the rest (i.e. range(128, 256)) is not defined in ASCII. The
range(128, 256) are extension slots, with many conflicting meanings.

>
> also almighty Linux
>
> Python 2.3.4 (#1, Feb 6 2006, 10:38:46)
> [GCC 3.4.5 20051201 (Red Hat 3.4.5-2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.>>> str(u'\ue863')
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> UnicodeEncodeError: 'ascii' codec can't encode character u'\ue863' in
> position 0: ordinal not in range(128)
>
> Python 2.4.4 (#2, Apr 5 2007, 20:11:18)
> [GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.>>> str(u'\ue863')
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> UnicodeEncodeError: 'ascii' codec can't encode character u'\ue863' in
> position 0: ordinal not in range(128)
>
> Python 2.5 (release25-maint, Jul 20 2008, 20:47:25)
> [GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.>>> str(u'\ue863')
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\ue863' in
> position 0: ordinal not in range(128)

If that str() function has returned anything but error on this, I'd
file a bug report.

> The problem is, why the f**k set ASCII encoding to range(128) ????????
> while str() is internally byte array it should be handled in
> range(256) !!!!!!!!!!

string is a byte array, but unicode and ASCII is NOT. Unicode string
is a character array defined up to range(65535). Each character in
unicode may be one or two bytes long. ASCII string is a character
array defined up to range(127). Other than Unicode (actually utf-8,
utf-16, and utf-32) and ASCII, there are many other encodings (ECBDIC,
iso-8859-1', ..., 'iso-8859-16', 'KOI8', 'GB18030', 'Shift-JIS', etc,
etc, etc) each with conflicting byte to characters mappings.
Fortunately, most of these encodings do share a common ground: ASCII.

Actually, when a strictly stupid str() receives a Unicode string (i.e.
character array), it should return a <unicode s at
0x423549af813e4954>, but it doesn't, str() is smarter than that, it
tries to convert whatever fits into ASCII, i.e. characters lower than
128. Why ASCII? Because character from range(128, 256) varies widely
and it doesn't know which encoding you want to use, so if you don't
tell me what encoding to use it'd not guess (Python Zen: In the face
of ambiguity, refuse the temptation to guess).

If you're trying to convert a character array (Unicode) into a byte
string, it's done by specifying which codec you want to use. str()
tries to convert your character array (Unicode) to byte string using
ASCII codec. s.encode(codec) would convert a given character array
into byte string using codec.

> http://bugs.python.org/issue3648
>
> One possible solution(Windows Only)
>
> >>> str(u'\ue863'.encode('mbcs'))
> '\xfe\x9f'

actually str() is not needed, you need only: u'\ue863'.encode('mbcs')

> >>> print u'\ue863'.encode('mbcs')
> 䶮
>
> I now spending 60% of my developing time dealing with ASCII range(128)
> errors. It was PAIN!!!!!!

Despair not, there is a quick hack:
# but only use it as temporary solution, FIX YOUR CODE PROPERLY
str_ = str
str = lambda s = '': s.encode('mbcs') if isinstance(s, basestring)
else str_(s)

Olivier Lauzanne

unread,

Sep 28, 2008, 6:01:12 AM9/28/08

to

On Sep 28, 11:21 am, est <electronix...@gmail.com> wrote:
> On Sep 28, 4:38 pm, Steven D'Aprano <st...@REMOVE-THIS-

> Can anyone tell me how to customize a default encoding, let's say
> 'ansi' which handles range(256) ?

I assume you are using python2.5
Edit the file /usr/lib/python2.5/site.py

There is a method called
def setencoding():
[...]
encoding = "ascii"
[...]

Change "encoding = "ascii" to encoding = "utf-8"

On windows you may have to use "mbsc" or something like that. I have
no idea what windows use at its encoding.

As long as all systems don't use the same encoding (let's say utf-8
since it is becoming the standard on unixes and on the web) using
ascii as a default encoding makes sense.

Marc 'BlackJack' Rintsch

unread,

Sep 28, 2008, 6:15:00 AM9/28/08

to

On Sun, 28 Sep 2008 01:35:11 -0700, est wrote:

>> Because that's how ASCII is defined.
>> Because that's how ASCII is defined. ASCII is a 7-bit code.
>
> Then why can't python use another default encoding internally
> range(256)?

Because that doesn't suffice. Unicode code points can be >255.

> If Python choose another default encoding which handles range(256), 80%
> of python unicode encoding problems are gone.

80% of *your* problems with it *seems* to be gone then.

> It's not HARD to process unicode, it's just python & python community
> refuse to correct it.

It is somewhat hard to deal with unicode because many don't want to think
about it or don't grasp the relationship between encodings, byte values,
and characters. Including you.

>> stop dreaming of a magic solution
>
> It's not 'magic' it's a BUG. Just print 0x7F to 0xFF to console, what's
> wrong????

What do you mean by "just print 0x7F to 0xFF"? For example if I have ``s
= u'Smørebrød™'`` what bytes should ``str(s)`` produce and why those and
not others?

>> Isn't that more or less the same as telling the OP to use unicode()
>> instead of str()?
>
> sockets could handle str() only. If you throw unicode objects to a
> socket, it will automatically call str() and cause an error.

Because *you* have to tell explicitly how the unicode object should be
encoded as bytes. Python can't do this automatically because it has *no
idea* what the process at the other end of the socket expects.

Now you are complaining that Python chooses ASCII. If it is changed to
something else, like MBCS, others start complaining why it is MBCS and
not something different. See: No fix, just moving the problem to someone
else.

Ciao,
Marc 'BlackJack' Rintsch

Lie

unread,

Sep 28, 2008, 7:04:10 AM9/28/08

to

I'm against calling python 3.0 fixed it, python 3.0's default encoding
is utf-8/Unicode, and that is why your problem magically disappears.

> Can anyone tell me how to customize a default encoding, let's say
> 'ansi' which handles range(256) ?

Python used to have sys.setdefaultencoding, but that feature was an
accident. sys.setdefaultencoding was intended to be used for testing
purpose when the developers haven't decided what to use as default
encoding (what use is default when you can change it).
sys.setdefaultencoding has been removed, programmers should encode
characters manually if they want to use something other than the
default encoding (ASCII).

est

unread,

Sep 28, 2008, 7:09:30 AM9/28/08

to

Well, you succeseded in putting all blame to myself alone. Great.

When you guy's are dealing with CJK characters in the future, you'll
find out what I mean.

In fact Boa Constructor keeps prompting ASCII and range(128) error on
my Windows. That's pretty cool.

Lie

unread,

Sep 28, 2008, 7:12:45 AM9/28/08

to

On Sep 28, 3:35 pm, est <electronix...@gmail.com> wrote:
> > Because that's how ASCII is defined.
> > Because that's how ASCII is defined. ASCII is a 7-bit code.
>
> Then why can't python use another default encoding internally
> range(256)?
>
> > Python refuses to guess and tries the lowest common denominator -- ASCII -- instead.
>
> That's the problem. ASCII is INCOMPLETE!

What do you propose? Use mbsc and smack out linux computers? Use KOI
and make non-Russians suicide? Use GB and shot dead non-Chinese? Use
latin-1 and make emails servers scream?

> If Python choose another default encoding which handles range(256),
> 80% of python unicode encoding problems are gone.
>
> It's not HARD to process unicode, it's just python & python community
> refuse to correct it.

Python's unicode support is already correct. Only your brainwave have
not been tuned to it yet.

est

unread,

Sep 28, 2008, 7:17:39 AM9/28/08

to

Have you ever programmed with CJK characters before?

Christian Heimes

unread,

Sep 28, 2008, 7:34:44 AM9/28/08

to pytho...@python.org

Steven D'Aprano wrote:
>>>> str(b'123') # b prefix is added
> "b'123'"
>
>
> Perhaps I'm misinterpreting it, but from here it looks to me that str()
> is doing what repr() used to do, and I'm really not sure that's a good
> thing. I would have expected that str(b'123') in Python 3 should do the
> same thing as unicode('123') does now:

No, you are getting it right and yes, it's problematic. Guido wanted
str(b'') to succeed. But the behavior can easily mask bugs in code.
Therefor a byte warning mode was implemented.

$ ./python -b
>>> str(b'123')
__main__:1: BytesWarning: str() on a bytes instance
"b'123'"

$ ./python -bb
>>> str(b'123')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>

BytesWarning: str() on a bytes instance
>>> b'' == ''

Traceback (most recent call last):
File "<stdin>", line 1, in <module>

BytesWarning: Comparison between bytes and string

Roy Smith

unread,

Sep 28, 2008, 9:25:56 AM9/28/08

to

In article <00ef327d$0$20666$c3e...@news.astraweb.com>,

Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> wrote:

> from random import randint
> ''.join(chr(randint(0, 255)) for i in xrange(len(input)))
>
>
> of course. How else should you get random bytes? :)

That a UUOL (Useless Usage Of Len; by analogy to UUOC). This works just as
well:

''.join(chr(randint(0, 255)) for i in input)

Lawrence D'Oliveiro

unread,

Sep 28, 2008, 7:14:34 PM9/28/08

to

In message
<913193c5-1722-45ae...@b38g2000prf.googlegroups.com>, est
wrote:

> Well, you succeseded in putting all blame to myself alone. Great.

Take it as a hint.

> When you guy's are dealing with CJK characters in the future, you'll
> find out what I mean.

Speaking as somebody who HAS dealt with CJK characters in the past--see
above.

Gabriel Genellina

unread,

Sep 30, 2008, 3:15:45 AM9/30/08

to pytho...@python.org

En Sun, 28 Sep 2008 07:01:12 -0300, Olivier Lauzanne
<nevare...@gmail.com> escribió:

> On Sep 28, 11:21 am, est <electronix...@gmail.com> wrote:

>> Can anyone tell me how to customize a default encoding, let's say
>> 'ansi' which handles range(256) ?
>
> I assume you are using python2.5
> Edit the file /usr/lib/python2.5/site.py
>
> There is a method called
> def setencoding():
> [...]
> encoding = "ascii"
> [...]
>
> Change "encoding = "ascii" to encoding = "utf-8"
>
> On windows you may have to use "mbsc" or something like that. I have
> no idea what windows use at its encoding.

*Not* a good idea at all.
You're just masking errors, and making your programs incompatible with all
other Pythons installed around the world.

--
Gabriel Genellina