"%s" vs unicode

Robin Becker

unread,

Jan 7, 2003, 2:26:16 PM1/7/03

to

A user reports that an old style error message is being smashed as the
message was being turned into unicode.

the situation was of the form

raise "xxxx %s' % var

where var was a lump of unicode. We can easily correct the sistuation.

We didn't expect unicode, but what's surprising is that "%s" % u'A' -->
u'A'.

Doesn't %s mean take the str()?

I looked in vain in the docs for the definition of exactly what "%s" % x
is supposed to mean, but somehow this kind of reminds me of the old /
operator dispute ie int / int --> int & int / float --> float.

That was considered harmful (although I disagreed) so why is this
implicit conversion allowed?
--
Robin Becker

Martin v. Löwis

unread,

Jan 7, 2003, 3:25:41 PM1/7/03

to

Robin Becker <ro...@jessikat.fsnet.co.uk> writes:

> That was considered harmful (although I disagreed) so why is this
> implicit conversion allowed?

It follows the general principle that, when combining byte strings and
Unicode strings, the byte string will be converted to Unicode, not
vice versa.

Regards,
Martin

Terry Reedy

unread,

Jan 7, 2003, 5:46:53 PM1/7/03

to

"Robin Becker" <ro...@jessikat.fsnet.co.uk> wrote in message
news:1GtMGLAY...@jessikat.demon.co.uk...

> A user reports that an old style error message is being smashed as
the
> message was being turned into unicode.
>
> the situation was of the form
>
> raise "xxxx %s' % var
>
> where var was a lump of unicode. We can easily correct the
sistuation.
>
> We didn't expect unicode, but what's surprising is that "%s" %
u'A' -->
> u'A'.

u'1234' works just as well, with a corresponding output.

>Doesn't %s mean take the str()? I looked in vain in the docs
> for the definition of exactly what "%s" % x is supposed to mean,

The doc (LibRef 2.2.6.2 String Formatting Operations) is slightly
contradictory. In the table for % it does say

s String (converts any python object using str()).

but the first paragraph ends with "If format is a Unicode object, or
if any of the objects being converted using the %s conversion are
Unicode objects, the result will be a Unicode object as well. " I
reported this as SF bug 664044.

Your user's problem appear to arise from the unicode result.

>>> raise u'a'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: exceptions must be strings, classes, or instances, not
unicode

Terry J. Reedy

Robin Becker

unread,

Jan 7, 2003, 6:46:14 PM1/7/03

to

In article <m3u1gkw...@mira.informatik.hu-berlin.de>, Martin v.
Löwis <mar...@v.loewis.de> writes

but the general principal here is that %s converts things to strings.
What is general about breaking this?
--
Robin Becker

Steve Holden

unread,

Jan 7, 2003, 7:34:34 PM1/7/03

to

"Robin Becker" <ro...@jessikat.fsnet.co.uk> wrote in message

news:JiqD$KAGb2...@jessikat.fsnet.co.uk...

Well, there's also a widening principle that you seem to be ignoring. In the
same way that int+float gives float, and float*complex gives complex, so any
string operation involving a Unicode operation gives a Unicode result.

Otherwise what would you have Python do with non-translatable Unicode
characters it has to handle in a %s substitution? Give an ordinal value
error?

regards
-----------------------------------------------------------------------
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/pwp/
Bring your musical instrument to PyCon! http://www.python.org/pycon/
-----------------------------------------------------------------------

Robin Becker

unread,

Jan 7, 2003, 7:59:56 PM1/7/03

to

In article <loKS9.175$467.90@fe08>, Steve Holden <sho...@holdenweb.com>
writes
......

>Well, there's also a widening principle that you seem to be ignoring. In the
>same way that int+float gives float, and float*complex gives complex, so any
>string operation involving a Unicode operation gives a Unicode result.
>
>Otherwise what would you have Python do with non-translatable Unicode
>characters it has to handle in a %s substitution? Give an ordinal value
>error?
>
>regards

......I'm not really disagreeing. The widening here just seems wrong. If
"x" op y --> unicode/string why can't "x" op z --> complex for some
specific values of x & z. Normally Python tries to be simple, but if
unicode is a widened string why doesn't raise u'A' work? Widening should
work reasonably if we are going to have a concensus on what is
reasonable.
--
Robin Becker

Neal Norwitz

unread,

Jan 7, 2003, 9:00:37 PM1/7/03

to

On Tue, 07 Jan 2003 19:59:56 -0500, Robin Becker wrote:

> In article <loKS9.175$467.90@fe08>, Steve Holden <sho...@holdenweb.com>
> writes
> ......
>>Well, there's also a widening principle that you seem to be ignoring. In
>>the same way that int+float gives float, and float*complex gives
>>complex, so any string operation involving a Unicode operation gives a
>>Unicode result.
>>

> ......I'm not really disagreeing. The widening here just seems wrong. If
> "x" op y --> unicode/string why can't "x" op z --> complex for some
> specific values of x & z. Normally Python tries to be simple, but if
> unicode is a widened string why doesn't raise u'A' work? Widening should
> work reasonably if we are going to have a concensus on what is
> reasonable.

Raising strings as exceptions is deprecated. So there isn't much point
in allowing unicode exceptions.

Neal

Terry Reedy

unread,

Jan 7, 2003, 9:26:42 PM1/7/03

to

> if unicode is a widened string why doesn't raise u'A' work?

Probably because raising a string is discouraged if not deprecated.
It may disappear in some future version.

TJR

Gerd Woetzel

unread,

Jan 8, 2003, 10:01:45 AM1/8/03

to

mar...@v.loewis.de (Martin v. Loewis) writes:

>It follows the general principle that, when combining byte strings and
>Unicode strings, the byte string will be converted to Unicode, not
>vice versa.

<imho>
Unfortunately the "general principle" is wrong.
There is a canonical embedding of Unicode strings into byte strings (which
is UTF-8) but no canonical embedding of byte strings into Unicode strings.
Hence it should be vice versa.
</imho>

Its a real shame that I have no acces to the bvd's time machine :-)

Regards,
Gerd

Robin Becker

unread,

Jan 8, 2003, 7:14:21 PM1/8/03

to

In article <3e1c3d59$1...@news.fhg.de>, Gerd Woetzel <woe...@gmd.de>
writes
....

><imho>
>Unfortunately the "general principle" is wrong.
>There is a canonical embedding of Unicode strings into byte strings (which
>is UTF-8) but no canonical embedding of byte strings into Unicode strings.
>Hence it should be vice versa.
></imho>
>
>Its a real shame that I have no acces to the bvd's time machine :-)
>
>Regards,
>Gerd

I'm fairly sure I agree with you, but time travel will make criminals of
us all
--
Robin Becker

Martin v. Löwis

unread,

Jan 9, 2003, 5:51:24 AM1/9/03

to

Gerd Woetzel <woe...@gmd.de> writes:

> Unfortunately the "general principle" is wrong.
> There is a canonical embedding of Unicode strings into byte strings (which
> is UTF-8) but no canonical embedding of byte strings into Unicode strings.
> Hence it should be vice versa.

In the original Unicode proposal, there was no notion of a settable
default encoding (and this feature is still experimental); the default
encoding, at that time, was UTF-8.

Then people requested that byte-string-unicode-string conversion
should use other encodings, and it was pointed out that UTF-8 is maybe
confusing for existing applications. So the default encoding is now
administrator-settable, and defaults to ASCII.

With ASCII being the default encoding, there is *no* canonical
embedding of Unicode strings into byte strings: some Unicode strings
("most") cannot be converted to a byte string automatically.

The same is of cause true in the other direction: not all byte strings
can be converted to Unicode strings. For practical purposes,
converting byte strings works "more often", since the byte string will
be an ASCII string (literal) in many cases when combining it with a
Unicode string.

Regards,
Martin

Gerd Woetzel

unread,

Jan 9, 2003, 6:26:51 AM1/9/03

to

What I have written after some beer:

>><imho>

>> Unfortunately the "general principle" is wrong.
>> There is a canonical embedding of Unicode strings into byte strings (which
>> is UTF-8) but no canonical embedding of byte strings into Unicode strings.
>> Hence it should be vice versa.

>></imho>
>>Its a real shame that I have no acces to the bvd's time machine :-)

Robin Becker <ro...@jessikat.fsnet.co.uk> writes:

>I'm fairly sure I agree with you, but time travel will make criminals of
>us all

mar...@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) writes:

>[...]

>Then people requested that byte-string-unicode-string conversion
>should use other encodings, and it was pointed out that UTF-8 is maybe
>confusing for existing applications. So the default encoding is now
>administrator-settable, and defaults to ASCII.

>With ASCII being the default encoding, there is *no* canonical
>embedding of Unicode strings into byte strings: some Unicode strings
>("most") cannot be converted to a byte string automatically.

>[...]

I agree with you, but time travel will definitely make a criminal of me!

Cheers, Gerd