PEP 263 comments

Martin v. Loewis

unread,

Feb 25, 2002, 12:20:20 AM2/25/02

to

To make some progress on PEP 263, I suggest that some of the open issues
are resolved as follows:

- Comment syntax: I suggest to use the form
-*- coding: <coding name> -*-
Emacs already recognizes this syntax, as does patch #508973
on IDLEfork. The other proposed syntaxes should be removed from the
PEP.

- In addition, to simplify usage on Windows, Python recognizes the
UTF-8 file signature (e.g. as generated by notepad). Any file
starting with \xef\xbb\xbf is treated as being UTF-8; a coding
comment different from "utf-8" in such a file is an error.

- identifiers remain restricted to ASCII

- Implementation strategy: I believe the proposed strategy (change the
tokenizer) is overly complicated, and likely inefficient. Instead, I
suggest that the encoding directive applies only to Unicode literals.
It will still be formally an error if comments or string literals do
not follow the declared encoding, but the Python parser won't detect
this error.

For use in Unicode literals, the parser will continue to work as it
does now, except that it applies the declared coding in compile.c.
To do so, PyUnicode_DecodeRawUnicodeEscape and
PyUnicode_DecodeUnicodeEscape will expect an additional flag
indicating whether they operate on a char* or a Py_UNICODE*.

The only problem with this approach is that encodings where " or '
could be the second byte of a multi-byte character cannot be
supported as a source encoding. Python supports no such encoding
in the standard library at the moment, anyway, so this should not
be a problem.

- Backwards compatibility: I'm in favour of leaving mostly everything
as-is, i.e. if there is no declared encoding, it should be possible
to put arbitrary bytes in string literals and comments; the proposed
implementation strategy supports that. However, I think that Unicode
literals which use the Latin-1 fallback should be deprecated, and that
the implementation should raise a DeprecationWarning: Anybody relying
on that feature should declare that the encoding is Latin-1.

- Changes to IDLE: When IDLE opens a file, it shall look for the UTF-8
signature. If no UTF-8 signature is found, it shall look for the
coding comment. If none is found, it shall apply the locale's
coding, which is determined as follows:
- on windows, it is "mbcs"
- on Unix, it is the one returned by nl_langinfo(CODESET)
Otherwise, it is the system default encoding.

When saving a file, IDLE shall preserve the UTF-8 signature if there
was one. If not, and if there is a coding comment, that should be
used to encode the file. If there is none, the locale's encoding
should be used. If encoding fails (whether the coding was found in
the comment or in the locale), the file shall be UTF-8 encoded, and
an UTF-8 signature added.

Regards,
Martin

M.-A. Lemburg

unread,

Feb 26, 2002, 5:06:45 AM2/26/02

to

"Martin v. Loewis" wrote:
>
> To make some progress on PEP 263, I suggest that some of the open issues
> are resolved as follows:

Thanks for the comments. I've update the PEP at SourceForge...

> - Comment syntax: I suggest to use the form
> -*- coding: <coding name> -*-
> Emacs already recognizes this syntax, as does patch #508973
> on IDLEfork. The other proposed syntaxes should be removed from the
> PEP.

+1

> - In addition, to simplify usage on Windows, Python recognizes the
> UTF-8 file signature (e.g. as generated by notepad). Any file
> starting with \xef\xbb\xbf is treated as being UTF-8; a coding
> comment different from "utf-8" in such a file is an error.

+1

> - identifiers remain restricted to ASCII

+1

> - Implementation strategy: I believe the proposed strategy (change the
> tokenizer) is overly complicated, and likely inefficient. Instead, I
> suggest that the encoding directive applies only to Unicode literals.
> It will still be formally an error if comments or string literals do
> not follow the declared encoding, but the Python parser won't detect
> this error.
>
> For use in Unicode literals, the parser will continue to work as it
> does now, except that it applies the declared coding in compile.c.
> To do so, PyUnicode_DecodeRawUnicodeEscape and
> PyUnicode_DecodeUnicodeEscape will expect an additional flag
> indicating whether they operate on a char* or a Py_UNICODE*.
>
> The only problem with this approach is that encodings where " or '
> could be the second byte of a multi-byte character cannot be
> supported as a source encoding. Python supports no such encoding
> in the standard library at the moment, anyway, so this should not
> be a problem.

I've added a two phase approach to the PEP: first we only
handle Unicode literals, then we do the whole file in a later
step.

> - Backwards compatibility: I'm in favour of leaving mostly everything
> as-is, i.e. if there is no declared encoding, it should be possible
> to put arbitrary bytes in string literals and comments; the proposed
> implementation strategy supports that. However, I think that Unicode
> literals which use the Latin-1 fallback should be deprecated, and that
> the implementation should raise a DeprecationWarning: Anybody relying
> on that feature should declare that the encoding is Latin-1.

Python will have to use Latin-1 as fallback encoding anyway,
so I don't think it's worth the trouble...

> - Changes to IDLE: When IDLE opens a file, it shall look for the UTF-8
> signature. If no UTF-8 signature is found, it shall look for the
> coding comment. If none is found, it shall apply the locale's
> coding, which is determined as follows:
> - on windows, it is "mbcs"
> - on Unix, it is the one returned by nl_langinfo(CODESET)
> Otherwise, it is the system default encoding.
>
> When saving a file, IDLE shall preserve the UTF-8 signature if there
> was one. If not, and if there is a coding comment, that should be
> used to encode the file. If there is none, the locale's encoding
> should be used. If encoding fails (whether the coding was found in
> the comment or in the locale), the file shall be UTF-8 encoded, and
> an UTF-8 signature added.

I did not add the IDLE changes to the PEP. Please upload them
as feature request to SF.

Thanks,
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.egenix.com/files/python/

Martin v. Loewis

unread,

Feb 26, 2002, 5:17:33 AM2/26/02

to

"M.-A. Lemburg" <m...@lemburg.com> writes:

> Thanks for the comments. I've update the PEP at SourceForge...

Thanks! Is there anything that needs to be resolved before we can ask
for BDFL pronouncement? According to PEP 1, the PEP editor must be
informed that the PEP is ready for review; I suggest it is (after this
batch of changes).

> I've added a two phase approach to the PEP: first we only
> handle Unicode literals, then we do the whole file in a later
> step.

Sounds good.

> I did not add the IDLE changes to the PEP. Please upload them
> as feature request to SF.

Ok, I will.

Regards,
Martin

M.-A. Lemburg

unread,

Feb 26, 2002, 5:22:58 AM2/26/02

to

"Martin v. Loewis" wrote:
>
> "M.-A. Lemburg" <m...@lemburg.com> writes:
>
> > Thanks for the comments. I've update the PEP at SourceForge...
>
> Thanks! Is there anything that needs to be resolved before we can ask
> for BDFL pronouncement? According to PEP 1, the PEP editor must be
> informed that the PEP is ready for review; I suggest it is (after this
> batch of changes).

Agreed. I'll post something to python-dev...

> > I've added a two phase approach to the PEP: first we only
> > handle Unicode literals, then we do the whole file in a later
> > step.
>
> Sounds good.
>
> > I did not add the IDLE changes to the PEP. Please upload them
> > as feature request to SF.
>
> Ok, I will.

--

Jason Orendorff

unread,

Feb 26, 2002, 5:14:34 PM2/26/02

to

Martin v. Loewis wrote:
> To make some progress on PEP 263, I suggest that some of the open issues
> are resolved as follows:

Counter-proposal:

- Comment syntax: none.
- UTF-8 file signature: not supported.
- Python source code encoding: must always be UTF-8.
- Implementation: within the parser, everything's just
ordinary UTF-8 bytes.
- IDLE: always save UTF-8 unless otherwise directed.

Advantage: simple, universal, easy, similar to what Java does.
No confusion about embedded 0x22 bytes in strings. Also,
stylistically I prefer not to have a document specify its own
encoding, or for comments to affect the meaning of a source
file.

I think this is better than PEP 263, but PEP 263 is better
than nothing.

## Jason Orendorff http://www.jorendorff.com/

Martin v. Loewis

unread,

Feb 27, 2002, 4:09:58 AM2/27/02

to

"Jason Orendorff" <ja...@jorendorff.com> writes:

> Counter-proposal:
>
> - Comment syntax: none.
> - UTF-8 file signature: not supported.
> - Python source code encoding: must always be UTF-8.
> - Implementation: within the parser, everything's just
> ordinary UTF-8 bytes.
> - IDLE: always save UTF-8 unless otherwise directed.

Do you seriously want to pursue this route? If so, how do you want to
deal with backwards compatibility? Currently, you can put arbitrary
bytes in character strings, and people make use of this opportunity
(even though the documentation says this is undefined).

Will you reject a source module just because it contains a latin-1
comment?

Regards,
Martin

Piet van Oostrum

unread,

Feb 27, 2002, 5:53:08 AM2/27/02

to

>>>>> "Jason Orendorff" <ja...@jorendorff.com> (JO) writes:

JO> Martin v. Loewis wrote:
>> To make some progress on PEP 263, I suggest that some of the open issues
>> are resolved as follows:

JO> Counter-proposal:

JO> - Comment syntax: none.
JO> - UTF-8 file signature: not supported.
JO> - Python source code encoding: must always be UTF-8.

There are still non utf-8 files around, and not everyone has a utf-8
editor.

JO> - Implementation: within the parser, everything's just
JO> ordinary UTF-8 bytes.
JO> - IDLE: always save UTF-8 unless otherwise directed.

JO> Advantage: simple, universal, easy, similar to what Java does.

Java does accept iso-latin-1 files as input. In fact on my machine (Mac
OSX) it doesn't even accept utf-8 files with the utf-8 signature. And
strings containing utf-8 are interpreted as just 8-bit characters, meaning
every byte is a character.

JO> No confusion about embedded 0x22 bytes in strings. Also,
JO> stylistically I prefer not to have a document specify its own
JO> encoding, or for comments to affect the meaning of a source
JO> file.

Which 0x22 bytes?

To a certain extend it is possible to autodetect if a file with 8-bit
characters (highest bit set) could be utf-8, but it is error-prone.

--
Piet van Oostrum <pi...@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van....@hccnet.nl

Bengt Richter

unread,

Feb 27, 2002, 7:32:17 PM2/27/02

to

On Mon, 25 Feb 2002 06:20:20 +0100, "Martin v. Loewis" <mar...@v.loewis.de> wrote:

>To make some progress on PEP 263, I suggest that some of the open issues
>are resolved as follows:
>
>- Comment syntax: I suggest to use the form
> -*- coding: <coding name> -*-
> Emacs already recognizes this syntax, as does patch #508973
> on IDLEfork. The other proposed syntaxes should be removed from the
> PEP.
>

Is this to be a *special purpose* (coding declaration) comment-context-limited
escape mechanism, or an *open ended* one? If special, I think Emacs can just
as well adapt to Python as vice versa. ISTM ad hoc special escapes in comments
are often the beginning of a convention for alternative out-of-band info, and
should be considered in that light. Cf. HTML ;-/

Perhaps it is time to (re?)consider a standard Python mechanism for embedding OOB
info at arbitrary places in Python source, analogous to XML's <![CDATA[ ... ]]> and
<? ... ?> (BTW, xml CDATA has a HUGE**N wart (IMO) in that it is not nestable.
They should have taken a clue from mime delimiting IMO).

I prefer orthogonal general purpose mechanisms to ad hoc syntactic escape warts,
so I would prefer <? ... ?> as a vehicle for defining encoding. <?py blah blah?>
could mean eval('blah blah') in some defined python environment and replacing
the <? ...?> with any string returned ('' if None), and then continue processing,
starting with the replaced text. This could be used for wholesale preprocessing
or simple mode flag control side effects, as in <?py set_special_flag(1)?>
(assuming that would get eval'd in a useful context).

In conjunction with a pythonic CDATA [1], it would permit wrapping a whole source
file (or starting at the second line):

#!/usr/bin/python
<?py recode(q'--unique delimiter[1]--'
... rest of source ...
--unique delimiter[1]--)?>

('recode' here has no special meaning. It would depend on the configured execution
context for <?py ...?>)

Thus the source could be arbitrarily reprocessed before being normally interpreted.
If you want to set a special effect flag, you could just say <?py set_special(1)?>.
If you want today's date embedded in place, write <?py now()?>, assuming that
function was appropriately defined in the execution context for <?py ...?>.
BTW, if we had an alternate assignment operator (perhaps ':=') that would make
flag=1 an expression when written as flag := 1, then we could write
<?py flag:=1? instead of <?py eval(compile('flag=1','','exec'))?>

There are side effect and execution environment issues to think about, but I think
the general mechanism could be very powerful, and ways could be created to configure
its operation via site.py etc.

[1] I am proposing a special q'...' string to be very like an r'...' string except that
the string in quotes following the q is interpreted as a mime-style arbitrary delimiter
string, and the whole value of the q'delim'...delim string representation is exactly
just the characters between the delimiters. E.g., you could

assert q'<-=delim=->'content here<-=delim=-> == 'content here' # this would be true

without getting an error.

Note the lack of quotes around the final delimiter string, since it itself is
the final delimiter. This can also be used to solve the final unescaped
backslash problem for quoting windows paths:

q'|'c:\foo\bar\|

Also note nestability, assuming you guarantee unique delimiter strings:

q'<::outside::>'q'|'c:\foo\bar\|<::outside::>

A null q delimiter could be defined to imply delimiting by the end of the file or
other representation container. I.e., q''<-- content up to EOF -->

Escapes are recognized according to raw string rules inside the quotes of the
q delimiter string, so it has a final backslash problem, but that shouldn't be
too hard to live with.

q'delim' ... delim should be able contain *anything*, including <? and """ and #... and
r'whatever', etc, but <? ?> should perhaps not take precedence over ordinary string
and comment contexts. You could argue both ways. Also, a different target than py in the
xxx slot of <?xxx ...?> should act per xml PI specs, probably.

A q string could theoretically allow putting unescaped arbitrary binary data in
a source file, though many editors would have problems dealing with it. Even so,
there might be some use for that.

Thinking about it, <? ..?> and q'delim'...delim might be worth separate PEPs
irrespective of PEP 263 relevance? Opinions?

(I decided to retitle this post, since that is the actual focus here).

Regards,
Bengt Richter

Martin von Loewis

unread,

Feb 28, 2002, 4:23:14 AM2/28/02

to

bo...@oz.net (Bengt Richter) writes:

> Is this to be a *special purpose* (coding declaration)
> comment-context-limited escape mechanism, or an *open ended* one?

I'm not sure what an "open ended" escape mechanism is. Emacs allows
the definition of arbitrary variables in -*- lines; coding: is just
one of them.

> If special, I think Emacs can just as well adapt to Python as vice
> versa. ISTM ad hoc special escapes in comments are often the
> beginning of a convention for alternative out-of-band info, and
> should be considered in that light. Cf. HTML ;-/

I'm not sure I understand this reference, either. Do you consider the
encoding declaration in the META http-equiv tag as out-of-band, or the
one in the HTTP request? Or are you referring to something completely
different?

> Perhaps it is time to (re?)consider a standard Python mechanism for
> embedding OOB info at arbitrary places in Python source, analogous
> to XML's <![CDATA[ ... ]]> and <? ... ?> (BTW, xml CDATA has a
> HUGE**N wart (IMO) in that it is not nestable.

Processing instructions and CDATA sections are completely different
things. While I could see that processing instructions are some kind
of out-of-band information, CDATA sections are just a certain
well-defined way to present contents; you can easily transform them to
use the standard markup without losing "out-of-band" information -
they are in-band.

> I prefer orthogonal general purpose mechanisms to ad hoc syntactic
> escape warts, so I would prefer <? ... ?> as a vehicle for defining
> encoding. <?py blah blah?> could mean eval('blah blah') in some
> defined python environment

I had initially proposed the directive statement (PEP 244) as a
mechanism to also support encoding declarations; that was
rejected. Tim Peters objected to the introduction of a 'pragma'
statement at many occasions.

> In conjunction with a pythonic CDATA [1], it would permit wrapping a
> whole source file (or starting at the second line):

That sounds like a different technology. If you really want to propose
this, write a PEP.

Regards,
Martin

Stephen J. Turnbull

unread,

Feb 28, 2002, 6:39:06 AM2/28/02

to

Hi, I'm Steve Turnbull, I do XEmacs. Mostly Mule. Barry asked me to
step up to bat on this.

Background: I've been doing Japanese Emacs ("nemacs") and Mule for 12
years now (it's where I live...). I've been watching Japanese open
source both as a wannabe hacker and a social scientist (that's my day
job) for about 10 years.

>>>>> "Martin" == Martin v Loewis <mar...@v.loewis.de> writes:

Martin> "Jason Orendorff" <ja...@jorendorff.com> writes:

>> Counter-proposal:
>> - Comment syntax: none.
>> - UTF-8 file signature: not supported.
>> - Python source code encoding: must always be UTF-8.
>> - Implementation: within the parser, everything's just
>> ordinary UTF-8 bytes.
>> - IDLE: always save UTF-8 unless otherwise directed.

Martin> Do you seriously want to pursue this route?

Yes, I think you do. I've watched Japanese patches to apps languish,
never to be merged, for 2, 5, 10 years. I'm sure some go back farther
than that. All because the Japanese want to use their favorite
encodings internally and in sources, none of which (except
ISO-2022-JP) bother to announce that they are Japanese in any way.

You're really not doing anyone a favor by supporting "my favorite
encoding" anymore. Make EUC and ISO 2022 users, Shift JIS and Big 5
abusers, Windows 125x <l-word deleted> check their weapons at the
door: as soon as you get inside Python, it's all Unicode.

Martin> If so, how do you want to deal with backwards
Martin> compatibility?

You don't. From now on, anything that goes into the official Python
sources is in UTF-8. Convert any existing stuff at your leisure.
This is recommended practice for 3rd party projects, too. People can
do want they want with their own stuff, but they are on notice that if
it screws up it's their problem.

XEmacs actually did this (half-way) three years ago. I convinced
Steve Baur to convert everything in the XEmacs CVS repository that
wasn't ISO 8859/1 to ISO-2022-JP (basically, start in ASCII, all other
character sets must designate to G0 or G1, and get invoked to GL; at
newlines, return to ASCII by designation; the "JP" part is really a
misnomer, it's fully multilingual). Presto! no more accidental Mule
corruption in the repository.

NB: UTF-8 is much more tractable than ISO-2022-JP, precisely because
of the ASCII 0x22 issue. We've had problems with that (since we still
support non-Mule XEmacsen that don't understand ISO 2022 controls).
UTF-8 makes this a non-issue.

We left a lot of stuff that was ISO 8859/1 as is. But this is a bad
idea post-Euro, and causes the occasional embarrassment as we pick up
more Latin-N, N != 1, users. The Euro by itself wouldn't be a
problem, nobody uses the generic currency symbol except as a bullet in
lists. But (and again this is based on my Japanese experience) it's
the other characters in Latin-9 that are the most important characters
in the world---to those who use them: they're part of their names.
But people do occasionally use the accents in composing characters and
libraries, so deprecating Latin-1 in favor of Latin-9 probably is
going to annoy a few people who have always followed the rules. So
make a clean sweep, now. In two years, you'll have no regrets.

Oh, and does Python have message catalogs and stuff like that? Do you
really want people doing multilingual work like translation mucking
about with random coding systems and error-prone coding cookies?
UTF-8 detection is much easier than detecting that an iso-8859-1
cookie should really be iso-8859-15 (a reverse Turing test).

Martin> Currently, you can put arbitrary bytes in character
Martin> strings, and people make use of this opportunity (even

I have no sympathy for self-inflicted injuries anymore. The amount of
effort that has gone into maintaining Japanese patches for Pine and
ghostscript has been extremely painful to watch. Anyway, Python
itself provides the necessary tools for salvation.

Martin> though the documentation says this is undefined).

So much for the alleged "backward compatibility" non-issue. :-)
People are abusing implementation dependencies; Just Say No.

Martin> Will you reject a source module just because it contains a
Martin> latin-1 comment?

That depends. Somebody is going to run it through the converter; it's
just a question of whether it's me, or the submitter. In the case of
XEmacs, because everybody uses Emacs to develop, it's just not an
issue: somebody commits the change from (eg) EUC-JP to ISO-2022-JP,
and after that Mule does its thing---nobody even notices, unless they
do a diff. Even at the time we got very few complaints about spurious
diffs. Now, never.

This is true for Python, too. I don't care if people want to do their
editing in ISCII or KOI8-R or Windows-1252 even. _They have the tool
needed to convert, by definition: they're Python users._ Here's where
cookies come in.

GNU Emacs supports your coding system cookies. XEmacs currently
doesn't, but we will, I already figured out what the change is and
told Barry OK. And I plan to add cookie-checking to my latin-unity
package (which undoes the Mule screwage that says Latin-1 NO-BREAK
SPACE != Latin-2 NO-BREAK SPACE). Other editors can do something
similar.

So people who insist on using a national coded character set in their
editor use cookies. Then the python-dev crew prepares a couple of
trivial scripts which munge sources from UTF-8 to national codeset +
cookie, and back (note you have to strip the cookie on the way back),
for the sake of people whose editor's Python-mode doesn't grok cookies.

I expect that with that kind of support, what is left is just enough
pain to induce lots of people to switch to UTF-8-capable editors, and
just little enough that you can say "well, this really is for
everybody's benefit; we know it's inconvenient in the transition, and
we're doing the best we can to ease it" to the rest, and not be lynched.

This disposes of the "not everyone has a UTF-8 editor" issue. Also,
in my experience distinguishing UTF-8 from "all other coding systems"
is hardly error-prone at all, except for files with extremely low
non-ASCII content, like under 50 bytes of non-ASCII. Ben Wing has
already implemented statistical detection (ie, returns a degree of
likelihood, and could -- not implemented yet -- look at statistical
properties of the text) for XEmacs 22. I imagine I could persuade him
to donate code to Python.

My apologies for the flood. I've been thinking about exactly this
kind of transition for XEmacs for about 5 years now, this compresses
all of that into a few dozen lines....

--
Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Don't ask how you can "do" free software business;
ask what your business can "do for" free software.

Carel Fellinger

unread,

Feb 28, 2002, 8:21:33 AM2/28/02

to

Stephen J. Turnbull <ste...@xemacs.org> wrote:
> Hi, I'm Steve Turnbull, I do XEmacs. Mostly Mule. Barry asked me to
> step up to bat on this.

... lots of interesting experiences snipped

Thanks for sharing, hope this helps us to do the right thing.

Now that we have your attention, what's your personal opinion on using
something else but ascii (in whatever encoding:) for the code itself
(things like keywords and variable names).

My Western centered view would be that it would make reading code
harder, but that's probably what the rest of the world already
experiences.

I personally use English even for my privat code and I even try to
stick to English in my comments, just in case some foreignor might have
to look at it:)
--
groetjes, carel

Martin von Loewis

unread,

Feb 28, 2002, 9:09:23 AM2/28/02

to ste...@xemacs.org

"Stephen J. Turnbull" <ste...@xemacs.org> writes:

> Hi, I'm Steve Turnbull, I do XEmacs. Mostly Mule. Barry asked me to
> step up to bat on this.

Thanks for your comments!

> You don't. From now on, anything that goes into the official Python
> sources is in UTF-8. Convert any existing stuff at your leisure.
> This is recommended practice for 3rd party projects, too. People can
> do want they want with their own stuff, but they are on notice that if
> it screws up it's their problem.

It's worse than this: under the proposed change, Python would refuse
to accept source code if it is not UTF-8 encoded. In turn, code that
has a euc-jp comment in it and is now happily accepted as source code
in the current Python programming language would be rejected.

This is like mandating that all Emacs-Lisp files are UTF-8, whether
they are part of the Emacs sources, or installed somewhere out there
in the wild.

> XEmacs actually did this (half-way) three years ago. I convinced
> Steve Baur to convert everything in the XEmacs CVS repository that
> wasn't ISO 8859/1 to ISO-2022-JP (basically, start in ASCII, all other
> character sets must designate to G0 or G1, and get invoked to GL; at
> newlines, return to ASCII by designation; the "JP" part is really a
> misnomer, it's fully multilingual). Presto! no more accidental Mule
> corruption in the repository.

This is a different issue: We are not discussing the encoding that the
Python sources use in the Python CVS tree, we are discussing the
encoding that Python source code uses.

> Oh, and does Python have message catalogs and stuff like that? Do you
> really want people doing multilingual work like translation mucking
> about with random coding systems and error-prone coding cookies?
> UTF-8 detection is much easier than detecting that an iso-8859-1
> cookie should really be iso-8859-15 (a reverse Turing test).

Python supports gettext, but this is still a different issue. The
Unicode type of Python is precisely that - it is not that Python would
support different wide character implementations internally. Again,
the issue is how source code is encoded.

> So much for the alleged "backward compatibility" non-issue. :-)
> People are abusing implementation dependencies; Just Say No.

A very radical opinion :-) but I get the feeling you might be missing
the point in question ...

> Martin> Will you reject a source module just because it contains a
> Martin> latin-1 comment?
>
> That depends. Somebody is going to run it through the converter; it's
> just a question of whether it's me, or the submitter.

'you' in this case isn't the maintainer of a software package; it is
the Python source code parser...

> GNU Emacs supports your coding system cookies. XEmacs currently
> doesn't, but we will, I already figured out what the change is and
> told Barry OK. And I plan to add cookie-checking to my latin-unity
> package (which undoes the Mule screwage that says Latin-1 NO-BREAK
> SPACE != Latin-2 NO-BREAK SPACE). Other editors can do something
> similar.

I assume you are talking about the -*- coding: foo -*- stuff here?
*This* is the issue in question. Should we allow it, or should we
mandate that all Python source code (not just the one in the Python
CVS) is UTF-8?

> So people who insist on using a national coded character set in their
> editor use cookies. Then the python-dev crew prepares a couple of
> trivial scripts which munge sources from UTF-8 to national codeset +
> cookie, and back (note you have to strip the cookie on the way back),
> for the sake of people whose editor's Python-mode doesn't grok cookies.

Again, not the issue: Most people run Python programs without ever
submitting them to python-dev :-)

You may wonder why Python (the programming language) needs to worry
about the encoding at all. The reason is that we allow Unicode
literals, in the form

u"text"

The question is what is the encoding of "text", on disk. In memory, it
will be 2-byte Unicode, so the interpreter needs to convert. To do
that, it must know what the encoding is, on disk. The choices are
using either UTF-8, or allowing encoding cookies.

> My apologies for the flood. I've been thinking about exactly this
> kind of transition for XEmacs for about 5 years now, this compresses
> all of that into a few dozen lines....

I'm not sure whether a similar issue exists in XEmacs: the encoding of
ELisp would be closest, but only if the Lisp interpreter ever needs to
worry about that.

Regards,
Martin

Jason Orendorff

unread,

Feb 28, 2002, 10:35:56 PM2/28/02

to

Piet van Oostrum wrote:
> JO> Advantage: simple, universal, easy, similar to what Java does.
>
> Java does accept iso-latin-1 files as input. In fact on my machine (Mac
> OSX) it doesn't even accept utf-8 files with the utf-8 signature. And
> strings containing utf-8 are interpreted as just 8-bit characters, meaning
> every byte is a character.

Oh! Yes, it works this way on Windows, too. javac assumes source
files are latin-1, and System.out.println() encodes output in latin-1.
Odd, because Java assumes UTF-8 everywhere else. Somehow I missed
this before (I guess because UTF-8 data slips through intact).

Scratch that, then.

> JO> No confusion about embedded 0x22 bytes in strings. Also,
> JO> stylistically I prefer not to have a document specify its own
> JO> encoding, or for comments to affect the meaning of a source
> JO> file.
>
> Which 0x22 bytes?

I'm referring to this paragraph in Martin's original post:

The only problem with this approach is that encodings where " or '
could be the second byte of a multi-byte character cannot be
supported as a source encoding. Python supports no such encoding
in the standard library at the moment, anyway, so this should not
be a problem.

\x22 is a double-quote mark. Martin is a little off on the last
bit, though; UTF-16 can produce \x22 bytes.

Stephen J. Turnbull

unread,

Mar 1, 2002, 12:40:22 AM3/1/02

to

>>>>> "Carel" == Carel Fellinger <cfel...@iae.nl> writes:

Carel> Stephen J. Turnbull <ste...@xemacs.org> wrote:

Carel> Now that we have your attention, what's your personal
Carel> opinion on using something else but ascii (in whatever
Carel> encoding:) for the code itself (things like keywords and
Carel> variable names).

My _very_ personal opinion is that identifiers can be run through an
online dictionary, and I'm fortunate to be pretty good at languages,
so that's fun. Nor do I do programming for a living, I can afford to
concentrate on "fun."

My opinion as a teacher is that I'd really like to be able to teach
programming to my freshmen with a language that uses Japanese
identifiers (including reserved words) and syntax (Japanese is a
reverse Polish language).

Carel> My Western centered view would be that it would make
Carel> reading code harder, but that's probably what the rest of
Carel> the world already experiences.

For "real-world" use, I mostly agree that sticking to English is
better, for code. None of the Japanese programmers I know feel that
use of English for identifiers and reserved words hinders them. In
fact, some think it may help somewhat, as it allows them to abstract
from any connotations that the word may have in natural language.
(Note that this implies that second courses in programming should use
an English-based language.)

There are exceptions, however. I maintain code that implements
Japanese input methods in XEmacs, and even native speakers of Japanese
sometimes get confused about what certain identifiers drawn from
English, or even romanized Japanese, are supposed to mean.

On the other hand, edict.el, which implements morphological
transformations for dictionary lookup, uses mixed English-Japanese
identifiers. The verbs (search, transform, etc) are English; the
parts of speech and technical names for transformations are Japanese
(ie, ideographs). Nobody (who is competent to understand what the
code does) ever makes a mistake.

This is a very special case, of course. But you can imagine many
cases where an organization has internal practices that would best be
described in native vocabulary. Then it's a tradeoff between
short-term maintainability and long-term internationalizability.

Carel> I even try to stick to English in my comments, just in case
Carel> some foreignor might have to look at it:)

If it weren't for the "groetjes, carel" in your .sig, I'd not be able
to say with confidence you're not a native English-speaker. Despite
the excellent English educational system, I cannot say the same for
Indians in general. Most are sufficiently fluent that your practice
would make sense, but many are not. And Japanese are legendarily bad
at English.

A correct comment in an illegible character set is harmless, although
perhaps frustrating; an incorrect one that you can read is dangerous.
So I'd have to say that except for the very worst cases, there are
benefits to having code in English, or near-English, vocabulary. But
comments should be written in a language you can use for teaching.

Stephen J. Turnbull

unread,

Mar 1, 2002, 1:39:42 AM3/1/02

to

Please note, you are correct---I missed the main point. I'm not going
to address that point-by-point, unless you ask. However, it's easy
enough to paraphrase my comments to apply to "are we going to mandate
UTF-8 for all source code?". The following is a rough counter-
proposal to handle the main issues, framed correctly.

>>>>> "Martin" == Martin von Loewis <loe...@informatik.hu-berlin.de> writes:

Martin> [U]nder the proposed change, Python would refuse to accept
Martin> source code if it is not UTF-8 encoded. In turn, code that
Martin> has a euc-jp comment in it and is now happily accepted as
Martin> source code in the current Python programming language
Martin> would be rejected.

Fine with me for Elisp. We don't have satisfactory UTF-8 support yet,
but will soon. After that there will be no excuses for us, since the
editor is the interpreter.

This _is_ the direction we're heading. Emacs treats Lisp files the
same as any other on initial loading, so we already have hooks in
place that could be used to translate. The point is that if XEmacs
didn't accept the encoding of a Lisp file, it would be on the user-
provided codec to get it right. It's not our problem (except to the
extent that we would of course provide such codecs). We're currently
working to make codecs available to users in a convenient, consistent
way--and Python has the advantage that that part is done!

Such a hook probably would be something new in Python, but I don't see
that it would be terribly difficult to implement.

Martin> Will you reject a source module just because it contains a
Martin> latin-1 comment?

>> That depends. Somebody is going to run it through the
>> converter; it's just a question of whether it's me, or the
>> submitter.

Martin> 'you' in this case isn't the maintainer of a software
Martin> package; it is the Python source code parser...

Same thing either way, if you add the preprocessing hook.

IMO, the Python source code parser should never see any text data[1]
that is not UTF-8 encoded. If you want to submit Python programs to
the parser that are not UTF-8 encoded, then it is your responsibility
as the programmer to make sure they get translated into UTF-8 (eg, by
the preprocessing hook) before the interpreter proper ever sees them.

Note that the Python language doesn't need to specify at all what's
allowed on the preprocessing hook. It can be a Perl-to-Python
translator, for all I care. You simply say "if the parser doesn't
accept it when run `python --skip-preprocessing-hook', it's not valid
Python." No more problem. From the user's point of view, in everyday
operation it's basically the same as a Python which accepts his
favorite encoding. From your point of view, the interpreter is
invulnerable to coding issues. Even if you choose to support complex
(eg, autorecognizing[2]) codecs on the pre-processing hook as part of
the Python library, bugs are more easily localized.

Since the preprocessing-hook would be callable from Python, it would
be easy to run it as a separate program, and require the users to send
the output of that as the bug report.

The final benefit is that in multilingual environments it makes use of
UTF-8 a lot more attractive to the users. But those are exactly the
environments where coding cookies will be a massive pain for Python to
support them, because people will forever be copying the top matter
from German files into Polish files and forgetting to adjust the
cookie, etc.

Footnotes:
[1] Ie, Python language or character text. It might be convenient to
have an octet-string primitive data type, in which you could put
EUC-encoded Japanese or Java byte codes. However, the Python
interpreter would never do anything with them except (1) pass them
whole as arguments or variable values, (2) extract slices, and (3)
extract individual octets as an integral (but non-character) type.
(Roughly speaking. There might be other operations that "base Python"
should implement, like applying codecs.)

[2] But I recommend against this. Don't offer support for such; it's
a time and effort sink, for little return.

Martin v. Loewis

unread,

Mar 1, 2002, 2:29:43 AM3/1/02

to ste...@xemacs.org

"Stephen J. Turnbull" <ste...@xemacs.org> writes:

> IMO, the Python source code parser should never see any text data[1]
> that is not UTF-8 encoded. If you want to submit Python programs to
> the parser that are not UTF-8 encoded, then it is your responsibility
> as the programmer to make sure they get translated into UTF-8 (eg, by
> the preprocessing hook) before the interpreter proper ever sees them.

That would cause surprises to users. They have a source program that says

# -*- coding: koi8-r -*-
print "some cyrillic text"

This currently works fine on their system; the text comes out on the
terminal just right. Now, Python would convert this text silently to
UTF-8 behind their backs, and the terminal would show just garbage.

In XEmacs, this is no problem: the "terminal" mostly is the *Messages*
buffer, and that would know that all text is UTF-8.

For Unicode strings, we indeed plan to make the transformation you
suggest (not to UTF-8, though): If you have a script that reads

# -*- coding: koi8-r -*-
print u"some cyrillic text"

then the string literal will be converted to the internal Unicode
type. How to print it is then another issue; you'll have to figure out
the encoding of the terminal - that is feasible in most cases.

It is not feasible to do the same for arbitrary byte strings: You (the
Python interpreter) could not know whether the string is supposedly
UTF-8 encoded, and that conversion to the terminal's encoding is
needed, or whether the string is an arbitrary byte sequence, which is
intended to appear on the terminal as-is. "All byte strings are UTF-8"
is not going to work, since Python is used to operate on binary data
as well, and the bytes that make up a GIF file just aren't UTF-8.

> [1] Ie, Python language or character text. It might be convenient to
> have an octet-string primitive data type, in which you could put
> EUC-encoded Japanese or Java byte codes.

The traditional "string" type is, in fact, a byte string type. Many
people use it still for character strings, since the Unicode type was
the later addition. Changing the string type to be a Unicode type was
not feasible since that would have broken many applications, in
particular C modules which expect that the internal representation of
the string type is char[].

> [2] But I recommend against this. Don't offer support for such; it's
> a time and effort sink, for little return.

There will be a simple form of auto-recognition: an UTF-8 signature
(i.e. a UTF-8 encoded BOM) at the beginning of a source file will be
treated as a clear indication that the file is UTF-8.

Regards,
Martin

Martin v. Loewis

unread,

Mar 1, 2002, 2:34:26 AM3/1/02

to

"Jason Orendorff" <ja...@jorendorff.com> writes:

> > Java does accept iso-latin-1 files as input. In fact on my machine (Mac
> > OSX) it doesn't even accept utf-8 files with the utf-8 signature. And
> > strings containing utf-8 are interpreted as just 8-bit characters, meaning
> > every byte is a character.
>
> Oh! Yes, it works this way on Windows, too. javac assumes source
> files are latin-1, and System.out.println() encodes output in latin-1.

That is not completely true. javac has the -encoding command line
which allows you to specify the source encoding; this defaults to the
platform default encoding (which probably was latin-1
resp. windows-1252 on your systems).

> I'm referring to this paragraph in Martin's original post:
>
> The only problem with this approach is that encodings where " or '
> could be the second byte of a multi-byte character cannot be
> supported as a source encoding. Python supports no such encoding
> in the standard library at the moment, anyway, so this should not
> be a problem.
>
> \x22 is a double-quote mark. Martin is a little off on the last
> bit, though; UTF-16 can produce \x22 bytes.

Right. Source encodings (atleast under the initial implementation)
need to be an ASCII superset (in the sense that source code that uses
only ASCII characters is ASCII-encoded); I see no way to allow UTF-16
as a source encoding.

Regards,
Martin

Bengt Richter

unread,

Mar 1, 2002, 4:30:13 AM3/1/02

to

On 28 Feb 2002 15:09:23 +0100, Martin von Loewis <loe...@informatik.hu-berlin.de> wrote:
[...]

>You may wonder why Python (the programming language) needs to worry
>about the encoding at all. The reason is that we allow Unicode
>literals, in the form
>
> u"text"
>
>The question is what is the encoding of "text", on disk. In memory, it
>will be 2-byte Unicode, so the interpreter needs to convert. To do
>that, it must know what the encoding is, on disk. The choices are
>using either UTF-8, or allowing encoding cookies.
>

I'm not sure what you mean by 'encoding cookies' but I assume you
mean something analogous to browser cookies, where some data of
interest is stored separately but related to some other data and
processing, like HTML form sumbissions etc.

Well, forget the cookie associations, but I think keeping meta-data
separate from data is a Good Thing(tm).

Also keeping it out of the names of things (i.e., don't encode file types
in name extensions ;-)

<the main idea of this post>
Perhaps we could just use a file to contain extra file metadata,
letting a file of metadata govern other files it names in the same
directory as itself. Probably a dot file in *nix.

For PEP 263 purposes, it would only need to be a text file with file
names tab delimited from keyword=encoding-info, with the first line(s)
perhaps with a glob pattern for a compact way of specifying encoding
for a lot of files in a directory at once.

To provide international encoding for file-associated info, like
a local dialect/special characters name etc., in a system whose
native file naming is more restricted, perhaps this directory of
file attributes could be standardized to UTF-8 for its own encoding.

That way, you could have the first column represent the file name
the system sees and an optional uname= keyword could provide an
alternate utf-8 encoded name for the file that tools that knew of it
could display, and then encoding=whatever for the actual file data per se.

The nice thing is that you don't have to touch the original files
to describe them. By including a location= keyword you could even
have this work like a symbolic link to a network file or even
an URL-specified file, which could be read-only and burned in a CD,
or a please-mount-backup-tape-x location, etc.

The actual file data would not have to be in the same directory at all.
</the main idea of this post>

I have more ideas, but I tend to overdo one post that way ;-)

Regards,
Bengt Richter

P.S. This discussion made me look for some more UTF info. For anyone
interested, I found a FAQ at

http://www.unicode.org/unicode/faq/utf_bom.html#2

and

http://www.unicode.org/unicode/reports/tr27/

has a nice table showing where bits go for UTF-8 and UTF-16 encoding
of unicode characters, and even 32-bit stuff.

Might make a refs links for the PEP.

There are some changes as to legality checks, apparently,
as of last May. I'm wondering if this affects PEP 263
and/or the unicode implementation in Python.

Martin von Loewis

unread,

Mar 1, 2002, 5:04:06 AM3/1/02

to

bo...@oz.net (Bengt Richter) writes:

> I'm not sure what you mean by 'encoding cookies' but I assume you
> mean something analogous to browser cookies, where some data of
> interest is stored separately but related to some other data and
> processing, like HTML form sumbissions etc.

I took the term from Stephen Turnbell, and I understood him to mean
the coding: variable that Emacs recognizes, i.e. some declaration
inside the file - quite unlike a browser cookie.

> Perhaps we could just use a file to contain extra file metadata,
> letting a file of metadata govern other files it names in the same
> directory as itself. Probably a dot file in *nix.

Nice idea, different PEP.

> For PEP 263 purposes, it would only need to be a text file with file
> names tab delimited from keyword=encoding-info, with the first line(s)
> perhaps with a glob pattern for a compact way of specifying encoding
> for a lot of files in a directory at once.

I don't think the file encoding information should be stored in a
different file; the risk of the two files becoming disassociated is
just to big to be acceptable.

> To provide international encoding for file-associated info, like
> a local dialect/special characters name etc., in a system whose
> native file naming is more restricted, perhaps this directory of
> file attributes could be standardized to UTF-8 for its own encoding.

We are not talking about file names here, but about file contents.

> There are some changes as to legality checks, apparently,
> as of last May. I'm wondering if this affects PEP 263
> and/or the unicode implementation in Python.

That doesn't affect this PEP; as for the Unicode 3.1 conformance, I
believe the current CVS implements UTF-8 correctly.

Regards,
Martin

Stephen J. Turnbull

unread,

Mar 1, 2002, 6:34:59 AM3/1/02

to

>>>>> "Martin" == Martin v Loewis <mar...@v.loewis.de> writes:

Martin> "Stephen J. Turnbull" <ste...@xemacs.org> writes:
>> IMO, the Python source code parser should never see any text
>> data[1] that is not UTF-8 encoded. If you want to submit
>> Python programs to the parser that are not UTF-8 encoded, then
>> it is your responsibility as the programmer to make sure they
>> get translated into UTF-8 (eg, by the preprocessing hook)
>> before the interpreter proper ever sees them.

Martin> That would cause surprises to users. They have a source
Martin> program that says

Martin> # -*- coding: koi8-r -*-
Martin> print "some cyrillic text"

Martin> This currently works fine on their system; the text comes
Martin> out on the terminal just right. Now, Python would convert
Martin> this text silently to UTF-8 behind their backs, and the
Martin> terminal would show just garbage.

No, it shows "Error: non-UTF-8 data detected in string." Conversion
only takes place if a preprocessing hook function is defined, and the
same environment that provides an appropriate preprocessing hook will
also arrange to make sure that program I/O is done in KOI8-R, too.

But I take your point. It will take time to develop such
environments. In the interim, it will cause users who are currently
depending on undefined behavior pain.

You _can_ say "no" now, while things are undefined. Or you can change
the language definition to promise support. If you do that, you are
unlikely to be able to get rid of that support for decades, as legacy
software will depend on it.

Martin> How to print it is then another issue; you'll have to
Martin> figure out the encoding of the terminal - that is
Martin> feasible in most cases.

Why open up that Pandora's box? Push it out into user space. Support
them as much as you want to with libraries, give up when it gets too
hard (it will!). My experience is that users will not thank you for
anything less than perfect support for all coding systems yesterday,
if the language definition promises any support at all. If the
language definition says "UTF-8 or die", they will thank you for the
nice codecs you provide to ease the transition.

>> [1] Ie, Python language or character text. It might be
>> convenient to have an octet-string primitive data type, in
>> which you could put EUC-encoded Japanese or Java byte codes.

Martin> The traditional "string" type is, in fact, a byte string
Martin> type. Many people use it still for character strings,

Maybe you don't need a third type. I see it as a matter of a
transition strategy, to allow you to generate exactly the error I
suggest above.

Huaiyu Zhu

unread,

Mar 1, 2002, 2:41:54 PM3/1/02

to

I've been following this discussion with quite some interest, but I do not
have the background to delimit the scope of various concepts. Is there a
gentle introduction to a unicode-newbie?

On 01 Mar 2002 15:39:42 +0900, Stephen J. Turnbull <ste...@xemacs.org> wrote:
>
>IMO, the Python source code parser should never see any text data[1]
>that is not UTF-8 encoded.

Presumably this discussion only concerns unicode strings - I don't think
want to lose the ability to read in arbitrary binary data as a raw string.
But then you mention

>[1] Ie, Python language or character text. It might be convenient to
>have an octet-string primitive data type, in which you could put
>EUC-encoded Japanese or Java byte codes.

What's the difference between this and a raw string (a byte sequence) that
you can translate into any other encoding?

Huaiyu

Carel Fellinger

unread,

Mar 1, 2002, 3:10:20 PM3/1/02

to

Stephen J. Turnbull <ste...@xemacs.org> wrote:

---lots snipped---

Thanks for your well balanced commands, they sure refined my thinking
on the subject.

> My _very_ personal opinion is that identifiers can be run through an
> online dictionary, and I'm fortunate to be pretty good at languages,
> so that's fun. Nor do I do programming for a living, I can afford to
> concentrate on "fun."

Yep fun, but very hard as single word translation is amongs the
hardest translation problems. Wouldn't it be better if we had a kind
of inter-lingua-meaning-representation and each and every identifier
linked to one by the author of the code himself? The `translation'
process could then be automated, and everyone could read the source in
her/his own language. (I know, in practice next to impossible to
achieve such a inter-lingua complete with all the mappings to all the
languages out there, but he I'm porting an old project I was involved
in that tried this route for general texts, so why not dream on:)

> My opinion as a teacher is that I'd really like to be able to teach
> programming to my freshmen with a language that uses Japanese
> identifiers (including reserved words) and syntax (Japanese is a
> reverse Polish language).

We used to make fun of the french who had their own version of Algol,
but growing older, wiser(?) and more experienced and finally having
kids of my own that want to program I've come to appreciate this point
more and more. And now with the world getting smaller and smaller
thanks to internet, I tent to think it's a problem we should tackle.

...

> A correct comment in an illegible character set is harmless, although
> perhaps frustrating; an incorrect one that you can read is dangerous.

very true

> So I'd have to say that except for the very worst cases, there are
> benefits to having code in English, or near-English, vocabulary. But
> comments should be written in a language you can use for teaching.

Well, atleast a language you are reasonably well versed in as to prevent
those `incorrect' comments.

--
groetjes, carel

Martin v. Loewis

unread,

Mar 1, 2002, 3:34:02 PM3/1/02

to

hua...@gauss.almadan.ibm.com (Huaiyu Zhu) writes:

> I've been following this discussion with quite some interest, but I do not
> have the background to delimit the scope of various concepts. Is there a
> gentle introduction to a unicode-newbie?

There are a number of introductions to Unicode; you may want to search
www.unicode.org, e.g.

http://www.unicode.org/unicode/standard/WhatIsUnicode.html

> >IMO, the Python source code parser should never see any text data[1]
> >that is not UTF-8 encoded.
>
> Presumably this discussion only concerns unicode strings - I don't think
> want to lose the ability to read in arbitrary binary data as a raw string.

First and foremost, the discussion is only about source code. A byte
string should certainly be able to store arbitrary bytes. Under
Stephen's proposal, it would indeed not be possible anymore to put
arbitrary binary data into source code.

> >[1] Ie, Python language or character text. It might be convenient to
> >have an octet-string primitive data type, in which you could put
> >EUC-encoded Japanese or Java byte codes.
>
> What's the difference between this and a raw string (a byte sequence) that
> you can translate into any other encoding?

Arbitrary binary data uses don't have a character set. If they are
character data, they should be stored as a character string (which, in
Python, is a Unicode string).

Regards,
Martin

Skip Montanaro

unread,

Mar 1, 2002, 3:53:06 PM3/1/02

to

Martin> A byte string should certainly be able to store arbitrary
Martin> bytes. Under Stephen's proposal, it would indeed not be possible
Martin> anymore to put arbitrary binary data into source code.

It's clear from discussions on python-dev that, like Barry Warsaw, I wear
American sunglasses when exposed to the Unicode sun.

Just to make sure I understand correctly, under Stephen's propsal would

s = "\xff"

be correct? I assume

s = "ÿ"

(a literal 0377 character) would be an error, yes? That is, when you saw
"arbitrary binary data" you are referring to non-printable octets in the
source file, right?

--
Skip Montanaro (sk...@pobox.com - http://www.mojam.com/)

Martin v. Loewis

unread,

Mar 1, 2002, 4:20:47 PM3/1/02

to

Skip Montanaro <sk...@pobox.com> writes:

> Just to make sure I understand correctly, under Stephen's propsal would
>
> s = "\xff"
>
> be correct? I assume
>

> s = "я"

>
> (a literal 0377 character) would be an error, yes?

Yes, on both accounts.

> That is, when you saw "arbitrary binary data" you are referring to
> non-printable octets in the source file, right?

Correct (except that whether something is printable is in the eye of
the beholder). On the source level, the four letters '\', 'x', 'f',
'f' are not arbitrary binary - they follow a specific syntax.

I actually doubt anybody is putting "arbitrary binary" data into
source code. Instead, most such occurrences are likely "printable", if
viewed in the encoding of the author of that code. Those would be
outlawed, unless that encoding is UTF-8.

Regards,
Martin

Bengt Richter

unread,

Mar 1, 2002, 9:26:34 PM3/1/02

to

On 01 Mar 2002 11:04:06 +0100, Martin von Loewis <loe...@informatik.hu-berlin.de> wrote:

>bo...@oz.net (Bengt Richter) writes:
[...]

>
>> Perhaps we could just use a file to contain extra file metadata,
>> letting a file of metadata govern other files it names in the same
>> directory as itself. Probably a dot file in *nix.
>
>Nice idea, different PEP.
>
>> For PEP 263 purposes, it would only need to be a text file with file
>> names tab delimited from keyword=encoding-info, with the first line(s)
>> perhaps with a glob pattern for a compact way of specifying encoding
>> for a lot of files in a directory at once.
>
>I don't think the file encoding information should be stored in a
>different file; the risk of the two files becoming disassociated is
>just to big to be acceptable.
>

Well, there's risk of a sym link becoming dissociated from its file too,
but if you are using the mechanism, it's quickly apparent when it breaks.

I agree there is a dissociation danger, but when an error pops up, it will
be easy to add the misbehaving file's name to the local meta-data directory file.
Encoding detection tools also could do a one-time scan of a directory and validate
the metadata, or at least warn.

If desired you can hide the actual data file by renaming it to a hashed
or other alias name, using the metadata entry to show the original name
and its symlink-like location option to point to the renamed file. Thus you
force the tools either not to find file_orig_name, or to look in the directory
file, where it will find

file_orig_name encoding=UTF-8 location=./hashedname

or not find it at all, but meta-data would not silently be ignored.

But the best way to keep metadata is with actual file system support, as Paul
Prescod mentioned privately (and I was about to go into when I decided my post
was getting too long ;-)

What I want is a universal file typing metadata prefix with codes issued
through a registry system that assigns numbers in a way that provides for both
common de facto standards and private company proprietary file types.

The prefix would be copied with any copy of the data file, but it would be excluded
from the range of normal seek operations.

If the prefix contained a location symlink, that would be all that was copied by
default. Data-verifiable links could contain md5 hashes of the data they link to.
The reason for the location link option is to be able to wrap legacy files and
whole file systems without modifying them, yet being able to integrate them into
the new file system. I like the idea of doing the prefix in UTF-8 so that a local
system can wrap a foreign system with local file names in the "symlinks".

Think of it as meta-data-enhanced UTF-8-encoded super-symlinks, with the principal
purpose of carrying universal-file-type-code & encoding-id in the meta-data, and
the option of an absent location-link meaning data follows immediately in
the same container as the super-symlink.

Note that not touching the linked file's data would allow you to enhance
the meta-data to handle compounded encoding formats which would not allow
embedding -*- "cookies" -*-, such zip, tgz, pgp, bin64, etc., etc.

You could create a new file with the metadata as a prefix with data
immediately following (probably on a block boundary), but this would
be nicest with file system support.

>> To provide international encoding for file-associated info, like
>> a local dialect/special characters name etc., in a system whose
>> native file naming is more restricted, perhaps this directory of
>> file attributes could be standardized to UTF-8 for its own encoding.
>
>We are not talking about file names here, but about file contents.
>

Right, I was 'introducing an optional side benefit'[1] of the main idea.
IOW, you could e.g. have optional meta-data containing a sanskrit string
for whatever purpose, irrespective of the type or encoding declared for
the file that the meta-data was associated with. If the meta-data file
were standardly always UTF-8 encoded, there would be no restriction on *its*
content, though the associated data file might be EBCDIC or whatever.

Whether you wanted to support file annotations, or special-language display names
or whatever would be a design decision above the super-symlink general
infrastructure I am describing.

[1] I guess that's a bad habit when trying to communicate a main idea ;-/

>> There are some changes as to legality checks, apparently,
>> as of last May. I'm wondering if this affects PEP 263
>> and/or the unicode implementation in Python.
>
>That doesn't affect this PEP; as for the Unicode 3.1 conformance, I
>believe the current CVS implements UTF-8 correctly.
>

I'll take your word for it ;-)

BTW, if my display font is Lucida Console will I be able to see infinity
like the 'A' in the following?
>>> u'\u0041'
u'A'
>>> u'\u221e'
u'\u221e'

Will the following work?

>>> print u'\u0041'
A
>>> print u'\u221e'

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
>>>

And if I redirect the output to a file, what will be in the file?

Regards,
Bengt Richter

Jason Orendorff

unread,

Mar 1, 2002, 10:18:27 PM3/1/02

to

I've been following this rather closely and thinking a lot about it,
and concluded that this is too big a headache to be believed.

Ideally, from the programmer's perspective:

* All my existing Python code should continue to run.
* I shouldn't have to understand what Unicode is, if all I want
is to bang out a quick script to say "hello world" in my native
language.
* I should be able to send Python files to other people in
other countries, and they should run fine there too.
* I should be able to use 'print' on strings and unicode strings
and get sensible output (I'll know it when I see it <wink>).
* Comments shouldn't affect the meaning of code.
* Random binary garbage in comments should be ignored, just like
it is today.

Unfortunately it's all impossible. I think MvL's proposal comes
about as close as anyone can today, but it's still yucky, and it's
*definitely* an abuse of Emacs's "-*- coding: -*-" magic (my main
objection).

I'm still holding out for an impossible ideal (or at least *clean*)
solution. But of course it's not my opinion that matters <wink>...

Martin v. Loewis

unread,

Mar 2, 2002, 3:25:38 AM3/2/02

to

bo...@oz.net (Bengt Richter) writes:

> Will the following work?
>
> >>> print u'\u0041'
> A
> >>> print u'\u221e'
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> UnicodeError: ASCII encoding error: ordinal not in range(128)
> >>>

With the patch to print Unicode to the Windows console, the
UnicodeError will be gone. Whether or not Lucida Unicode supports that
character, I cannot say.

> And if I redirect the output to a file, what will be in the file?

Since the isatty test fails, the system default encoding will be used.
Unless you've changed that, you'll get a UnicodeError again.

Regards,
Martin

Martin v. Loewis

unread,

Mar 2, 2002, 3:36:04 AM3/2/02

to

"Jason Orendorff" <ja...@jorendorff.com> writes:

> Unfortunately it's all impossible.

Unfortunately, yes. Let me explain how this will work under PEP 263.

> * All my existing Python code should continue to run.

They will, but you will might get a DeprecationWarning; eventually,
your code might stop being accepted.

> * I shouldn't have to understand what Unicode is, if all I want
> is to bang out a quick script to say "hello world" in my native
> language.

You won't need to understand what Unicode is. You will need to
understand what encodings are, though, i.e. you should know that "Grüß
Gott" requires Latin-1. With that knowledge, you can either change the
system default encoding, or put an encoding declaration in your file.

> * I should be able to send Python files to other people in
> other countries, and they should run fine there too.

That will work for the encoding declaration case (either with the
Unicode marker, or the UTF-8 signature). If people have changed the
system default encoding, this property may not hold.

> * I should be able to use 'print' on strings and unicode strings
> and get sensible output (I'll know it when I see it <wink>).

That will work depending on the terminal. In IDLE, it does currently
work. For plain strings, it will also normally work, unless the file's
encoding differs from the system encoding (or, rather, the user's
terminal encoding). For Unicode strings, I'll write a PEP describing
the use of Unicode at system interfaces, which targets this issue.

> * Comments shouldn't affect the meaning of code.

They won't in phase 1 of the PEP (although you will get a warning if
you haven't declared an encoding, and left the system encoding at
ASCII). In phase 2, you'll get a warning if the comment is not
well-formed under the encoding.

> * Random binary garbage in comments should be ignored, just like
> it is today.

That property may go away in phase 2. E.g. in a UTF-8 encoded file,
Python will eventually verify that the comments actually are UTF-8.
Some people consider this a good thing, as editors can't really
round-trip files with different encodings at different offsets.

Regards,
Martin

Neil Hodgson

unread,

Mar 2, 2002, 7:30:26 AM3/2/02

to

Martin v. Loewis:

> bo...@oz.net (Bengt Richter) writes:
>
> > Will the following work?
> >
> > >>> print u'\u0041'
> > A
> > >>> print u'\u221e'
> >
> > Traceback (most recent call last):
> > File "<stdin>", line 1, in ?
> > UnicodeError: ASCII encoding error: ordinal not in range(128)
> > >>>
>
> With the patch to print Unicode to the Windows console, the
> UnicodeError will be gone. Whether or not Lucida Unicode supports that
> character, I cannot say.

It does, it is an infinity symbol and is available in most Windows
fonts - Verdana, Times New Roman, ...

> > And if I redirect the output to a file, what will be in the file?
>
> Since the isatty test fails, the system default encoding will be used.
> Unless you've changed that, you'll get a UnicodeError again.

How do you tell Python to treat stdout as a tty even though it is
actually a pipe? I'd like to run Unicode printing programs from within SciTE
with I/O redirected to SciTE's output pane. There is also an issue with
stdin where I'd like to tell Python to treat redirected input, fed through a
pipe, as a tty.

Neil

Martin v. Loewis

unread,

Mar 2, 2002, 8:12:20 AM3/2/02

to

"Neil Hodgson" <nhod...@bigpond.net.au> writes:

> How do you tell Python to treat stdout as a tty even though it is
> actually a pipe? I'd like to run Unicode printing programs from within SciTE
> with I/O redirected to SciTE's output pane. There is also an issue with
> stdin where I'd like to tell Python to treat redirected input, fed through a
> pipe, as a tty.

If stdout is a pipe, you cannot use the WriteConsoleW API for output,
I believe. So you *have* to perform some encoding, and unless you can
find a good reason to use a different one, the system default encoding
would be used.

In general, I see the usability to print Unicode only for the
interpreter prompt. In all other cases (including cases where output
goes to a pipe), the application should explicitly use some codec.
That would normally be the codec for the locale's encoding, something
that is roughly determined as

if system is windows:
terminal_encoding = "mbcs" # (1)
elif have_nl_langinfo and have_CODEPAGE:
terminal_encoding = locale.nl_langinfo(locale.CODEPAGE)
else:
terminal_encoding = sys.getdefaultencoding()

(1): This is actually incorrect. On the terminal, CP_OEMCP is used,
not CP_ACP (what "mbcs" uses). In the MS Console API, there are ways
to query and modify the code page of a console, which has no effect if
the raster font is used.

If Python output goes to a pipe, I believe there is *no* way of
finding out what the right encoding should be: If the receiving end
writes to a text file, ANSI would be right, if the receiving end
prints to the console, OEM would be right - unless somebody has
changed the console code page for this console.

People are so used to characters not showing up correctly in a Windows
console that they won't worry if the application picks (1).

Regards,
Martin

Stephen J. Turnbull

unread,

Mar 4, 2002, 11:36:29 AM3/4/02

to

>>>>> "Huaiyu" == Huaiyu Zhu <hua...@gauss.almadan.ibm.com> writes:

>> [1] Ie, Python language or character text. It might be
>> convenient to have an octet-string primitive data type, in
>> which you could put EUC-encoded Japanese or Java byte codes.

Huaiyu> What's the difference between this and a raw string (a
Huaiyu> byte sequence) that you can translate into any other
Huaiyu> encoding?

None, in representation.[1] However, it would not be a string in
the sense you know it in Python: you can't concatenate it, you can't
iterate over it, all you can do is read it, write it, access octets
(which are not characters), or turn it into something else that you
can do something with. Consider: does it make sense to concatenate a
string of Java byte codes with a string of English text?

Footnotes:
[1] It might make sense to have a special representation for it in
source as a subset of the Unicode "private space". But that's going
way afield.

Stephen J. Turnbull

unread,

Mar 4, 2002, 12:00:33 PM3/4/02

to

>>>>> "Skip" == Skip Montanaro <sk...@pobox.com> writes:

Martin> A byte string should certainly be able to store arbitrary
Martin> bytes. Under Stephen's proposal, it would indeed not be possible
Martin> anymore to put arbitrary binary data into source code.

This is not clear, but it is certainly the easiest way to deal with
the issue.

Skip> Just to make sure I understand correctly, under Stephen's
Skip> propsal would

Skip> s = "\xff"

Skip> be correct?

Correct in the sense that I'm sure it would be accepted. However, it
would not necessarily be the only way to do it (in fact, it surely
would not).

Skip> s = "я"

Skip> (a literal 0377 character) would be an error, yes?

Assuming you mean that the source file contains eight bytes

0x73 0x20 0x3D 0x20 0x22 0xFF 0x22 0x0A

the parser should reject it, as 0xFF may not occur in a UTF-8-encoded
file.

However, Python would surely provide an optional module to
transparently decode that to UTF-8, before the parser proper sees it.
There would be no reason to write throwaway code in UTF-8 for pretty
much any language that already has codecs.

I'd like to emphasize that the point of my proposal is not to make
life harder for people who are just doing what comes naturally, and it
shouldn't. AFAICT most of the things that people would like to do
will be just as easy to do under my proposal as under Martin's.

They just won't be sanctioned as part of "true" Python. It's kind of
like a contract: you can translate it into other languages, but for
the purposes of legal interpretation, there is an original language
which is that of the court with jurisdiction.

Bengt Richter

unread,

Mar 4, 2002, 4:50:37 PM3/4/02

to

However, people will be wanting to put arbitrary data in strings, and
I think having 'xxx' define something other than 3 octets should be
separately controlled from whether the source is encoded in UTF-8 or
UTF-16 or whatever. Otherwise open('filexxx','wb').write('xxx') may
lose its meaning just because someone imported it into an editor
and wrote it out as UTF-8 for whatever reason (e.g., not intending to
change anything other than adding some comments in Japanese).

It may be convenient for some programs to have 'xxx' default to internal
UTF-8, but that means strings are strictly utf-8 sequences, and passing
'xxx' to .write() would probably imply passing the UTF-8 bytes.

So if you wrote open('fileyyy','wb').write('\xff') you would expect
to find the UTF-8 sequence, not one chr(0xff) -- oops, what is chr()
going to do? Shouldn't '#'+chr(0xff)+'#' be == '#\xff#' ?

I think changing the internal representation of unqualified-by-u quoted
strings to UTF-8 would be a very radical step, even as a controllable
option. To make it just blindly match source encoding would be more than radical.

With internal UTF-8 ordinary-string default encoding, I think there would
be a need for a plain old octet string as a different type (classical string?),
maybe corresponding to what we get now with latin-1 encoding and rendering,
for convenience. e.g., l'xxx' and maybe L'xxx' for the raw variant?

Regards,
Bengt Richter

Huaiyu Zhu

unread,

Mar 4, 2002, 7:30:09 PM3/4/02

to

On 05 Mar 2002 01:36:29 +0900, Stephen J. Turnbull <ste...@xemacs.org> wrote:

> Consider: does it make sense to concatenate a
>string of Java byte codes with a string of English text?

To produce a kind of tar file, perhaps? To send something through socket?
There are certainly many cases where one wants to manipulate a byte sequence
without any interpretation of it as a character string.

If the discussion is only concerned with source code encoding, that's fine.
But if it in the end restricts the kind of operations that can be performed
on raw strings, it's simply untenable.

Huaiyu

Stephen J. Turnbull

unread,

Mar 4, 2002, 9:28:26 PM3/4/02

to

>>>>> "Bengt" == Bengt Richter <bo...@oz.net> writes:

Bengt> With internal UTF-8 ordinary-string default encoding, I
Bengt> think there would be a need for a plain old octet string as
Bengt> a different type

Already in my proposal, I think.

Stephen J. Turnbull

unread,

Mar 4, 2002, 10:42:30 PM3/4/02

to

>>>>> "Huaiyu" == Huaiyu Zhu <hua...@gauss.almadan.ibm.com> writes:

Huaiyu> To produce a kind of tar file, perhaps?

Those are tarbytes. Explicit is better than implicit.

Huaiyu> To send something through socket?

Verbatim low-level I/O would be permitted. I would want them to be
restricted from being displayed literally by character-oriented
high-level functions like print, though. If you want to use print on
them, convert them to something with an appropriate (trivial) __repr__
operation defined.

Huaiyu> But if it in the end restricts the kind of operations that
Huaiyu> can be performed on raw strings, it's simply untenable.

No. Perhaps I haven't presented this well, but I don't care what you
do with blocks of raw memory, as long as they are a different type
from character strings and cannot be used as though they were
character strings. _It's the use of character strings that I want to
restrict._ Confounding octets with characters is a horrible disease---
at XEmacs, we call it "Ebola." It crashes editors and MUAs, it
corrupts text and destroys data, it may even cause your daughter to
decide to become a dentist and marry a marketing VP.

In the transition we have to decide what to do with the current
undifferentiated notation. I suspect that people who use raw strings
to access data representation are far more aware of the issues than
people who think of them as "human readable text" that "just works."
Thus by the principle of "least surprise" I advocate that raw string
users use a new notation, while the (much larger and much more naive)
group of people who expect

print 'And now for something completely different'

to do something useful would notice nothing new.

print o'she turned me into a newt ... it got bettah'

would either error (my preference) or print something like a tuple of
integers.

print Code86(o'spam, spam, eggs, and spam').disassemble()

would do what you think, if class Code86 were defined appropriately.

Yes, for people who use raw strings for access to representation
extensively, this would be a one-time PITA for the conversion to the
o'' notation (or whatever it turns out to be).

And it's probable that the representation in source would not be
one-to-one, ie the bytes would be encoded in uninterpreted UTF-8.
A raw string would be turned into an array of 16-bit integers by
the lexer, and the parser would then convert to an array of bytes
internally based on the o'' syntax.

Huaiyu Zhu

unread,

Mar 5, 2002, 2:41:26 PM3/5/02

to

On 05 Mar 2002 12:42:30 +0900, Stephen J. Turnbull <ste...@xemacs.org> wrote:
>
>Verbatim low-level I/O would be permitted. I would want them to be
>restricted from being displayed literally by character-oriented
>high-level functions like print, though. If you want to use print on
>them, convert them to something with an appropriate (trivial) __repr__
>operation defined.

...

>Thus by the principle of "least surprise" I advocate that raw string
>users use a new notation, while the (much larger and much more naive)
>group of people who expect
>
> print 'And now for something completely different'
>
>to do something useful would notice nothing new.
>
> print o'she turned me into a newt ... it got bettah'
>
>would either error (my preference) or print something like a tuple of
>integers.
>
> print Code86(o'spam, spam, eggs, and spam').disassemble()
>
>would do what you think, if class Code86 were defined appropriately.

Now I see what you are advocating. I agree completely.

Huaiyu