Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Python's 8-bit cleanness deprecated?

22 views
Skip to first unread message

Roman Suzi

unread,
Feb 3, 2003, 1:33:00 PM2/3/03
to

I've tryed vesrion 2.3a of Python and have been surprised by the following
warning:

1.py:6: DeprecationWarning: Non-ASCII character '\xf7', but no declared
encoding


Does it mean, that all that Python software which is not in ASCII will each
time give such warning? (Thus probably filling up web-server logs or just
surprising users (like Perl/C libs do when they don't know current locale).

I think it's madness... There must be other ways to deal with it. I could
agree that for correct operation IDLE is demanding correct encoding setting
(and nonetheless workis incorrectly!), but plain scripts should be
8-bit clean, without any conditions! (Luckily, it's alpha version, so
nothing really changed yet.)


Sincerely yours, Roman Suzi
--
r...@onego.ru =\= My AI powered by Linux RedHat 7.3


Just

unread,
Feb 3, 2003, 2:46:03 PM2/3/03
to
In article <mailman.1044297249...@python.org>,
Roman Suzi <r...@onego.ru> wrote:

> I've tryed vesrion 2.3a of Python and have been surprised by the following
> warning:
>
> 1.py:6: DeprecationWarning: Non-ASCII character '\xf7', but no declared
> encoding

Note the "but no declared encoding"...

> Does it mean, that all that Python software which is not in ASCII will each
> time give such warning? (Thus probably filling up web-server logs or just
> surprising users (like Perl/C libs do when they don't know current locale).
>
> I think it's madness... There must be other ways to deal with it. I could
> agree that for correct operation IDLE is demanding correct encoding setting
> (and nonetheless workis incorrectly!), but plain scripts should be
> 8-bit clean, without any conditions! (Luckily, it's alpha version, so
> nothing really changed yet.)

From Misc/NEWS:

- Encoding declarations (PEP 263, phase 1) have been implemented. A
comment of the form "# -*- coding: <encodingname> -*-" in the first
or second line of a Python source file indicates the encoding.


Just

Roman Suzi

unread,
Feb 3, 2003, 2:29:51 PM2/3/03
to
On Mon, 3 Feb 2003, Brian Quinlan wrote:

>> I think it's madness... There must be other ways to deal with it. I
>could
>> agree that for correct operation IDLE is demanding correct encoding
>> setting (and nonetheless workis incorrectly!), but plain scripts
>should
>> be 8-bit clean, without any conditions! (Luckily, it's alpha version,
>> so nothing really changed yet.)
>

>Just add:
># -*- coding: Latin-1 -*-
>
>to the top of your source files and you will be fine.

It's no problem with new scripts. But is there any reason to introduce this
useful feature by force? Requiring everyone to add one line to every script
they wrote?

It's not very pleasant...

Brian Quinlan

unread,
Feb 3, 2003, 2:16:45 PM2/3/03
to
> Does it mean, that all that Python software which is not in ASCII will
> each time give such warning? (Thus probably filling up web-server logs

> or just surprising users (like Perl/C libs do when they don't know
> current locale).
>

> I think it's madness... There must be other ways to deal with it. I
could
> agree that for correct operation IDLE is demanding correct encoding
> setting (and nonetheless workis incorrectly!), but plain scripts
should
> be 8-bit clean, without any conditions! (Luckily, it's alpha version,
> so nothing really changed yet.)

Just add:
# -*- coding: Latin-1 -*-

to the top of your source files and you will be fine.

Cheers,
Brian


Brian Quinlan

unread,
Feb 3, 2003, 3:04:11 PM2/3/03
to
> It's no problem with new scripts. But is there any reason to introduce
> this useful feature by force? Requiring everyone to add one line to
> every script they wrote?

Without an explicit declaration, it is impossible to accurately
determine most encodings. Remember your Python Zen:

"In the face of ambiguity, refuse the temptation to guess."


The effort required to do that doesn't strike me as significant. How
many total source files do you have that have non-ASCII characters in
them?

Cheers,
Brian


Paul Rubin

unread,
Feb 3, 2003, 3:29:27 PM2/3/03
to
Brian Quinlan <br...@sweetapp.com> writes:
> Just add:
> # -*- coding: Latin-1 -*-
>
> to the top of your source files and you will be fine.

What is this nonsense? The interpreter is reading comment text now?
Yucch!

Skip Montanaro

unread,
Feb 3, 2003, 3:03:53 PM2/3/03
to
>>>>> "Brian" == Brian Quinlan <br...@sweetapp.com> writes:
Brian> The effort required to do that doesn't strike me as
Brian> significant.

Yeah, but you (and me) live in ASCII-land

Brian> How many total source files do you have that have non-ASCII
Brian> characters in them?

Probably a fair number, since Roman lives in Russia.

Skip

Skip Montanaro

unread,
Feb 3, 2003, 3:51:52 PM2/3/03
to

>> # -*- coding: Latin-1 -*-

Paul> What is this nonsense? The interpreter is reading comment text
Paul> now? Yucch!

Given that most operating systems don't have files with data forks and
resource forks, how would you tell the lexical analyzer what the encoding of
a particular file is?

Skip

Brian Quinlan

unread,
Feb 3, 2003, 3:21:44 PM2/3/03
to
> >>>>> "Brian" == Brian Quinlan <br...@sweetapp.com> writes:
> Brian> The effort required to do that doesn't strike me as
> Brian> significant.
>
> Yeah, but you (and me) live in ASCII-land

I've written a lot of Python code but probably <1,000 source files.
Probably about 100 of those are still in use and in my care. I would
imagine that most Python users have generated less code than that.

> Brian> How many total source files do you have that have non-ASCII
> Brian> characters in them?
>
> Probably a fair number, since Roman lives in Russia.

Ah, then it should be easy. The encoding is probably the same for all of
his source files. He could probably write a simple script that inserts
the encoding (being careful to insert the encoding after the shebang
line, if present).

Cheers,
Brian


Paul Rubin

unread,
Feb 3, 2003, 4:25:32 PM2/3/03
to
Skip Montanaro <sk...@pobox.com> writes:
> Paul> What is this nonsense? The interpreter is reading comment text
> Paul> now? Yucch!
>
> Given that most operating systems don't have files with data forks
> and resource forks, how would you tell the lexical analyzer what the
> encoding of a particular file is?

How about "from __encodings__ import latin1"?

I think we've already have seen import statements used to control the
lexical analyzer, in "from __future__ import division".

holger krekel

unread,
Feb 3, 2003, 4:13:10 PM2/3/03
to
Skip Montanaro wrote:
>
> >> # -*- coding: Latin-1 -*-
>
> Paul> What is this nonsense? The interpreter is reading comment text
> Paul> now? Yucch!
>
> Given that most operating systems don't have files with data forks and
> resource forks, how would you tell the lexical analyzer what the encoding of
> a particular file is?

Well, that's easy, just do

from __future__ import encoding as utf8

:-)

only-10-percent-serious-ly y'rs,

holger

Jp Calderone

unread,
Feb 3, 2003, 4:36:08 PM2/3/03
to
On Mon, Feb 03, 2003 at 10:29:51PM +0300, Roman Suzi wrote:
> On Mon, 3 Feb 2003, Brian Quinlan wrote:
>
> >> I think it's madness... There must be other ways to deal with it. I
> >could
> >> agree that for correct operation IDLE is demanding correct encoding
> >> setting (and nonetheless workis incorrectly!), but plain scripts
> >should
> >> be 8-bit clean, without any conditions! (Luckily, it's alpha version,
> >> so nothing really changed yet.)
> >
> >Just add:
> ># -*- coding: Latin-1 -*-
> >
> >to the top of your source files and you will be fine.
>
> It's no problem with new scripts. But is there any reason to introduce this
> useful feature by force?

> Requiring everyone to add one line to every script they wrote?
>

> It's not very pleasant...
>

It's not, true, but that's a bit of an exageration. For example, I won't
have to add anything to any of my source files. :)

For people who do, a simple script should do the job (somewhat tested code):

#!/usr/bin/python

import os
CODING_DECL = '# -*- coding: Latin-1 -*-' + os.linesep

def main():
processDirectory('.')

def processDirectory(directory):
for f in os.listdir(directory):
if f.endswith('.py'):
processFile(os.path.join(directory, f))
elif os.path.isdir(f):
processDirectory(os.path.join(directory, f))

def processFile(f):
contents = file(f).read()
for c in contents:
if c > '\x7f':
break
else:
return

print 'Adding coding declaration to', f
output = file(f, 'w')
contents = contents.splitlines()
if contents[0].startswith('#!'):
output.write(contents[0] + os.linesep)
del contents[0]

output.write(CODING_DECL)
output.write(os.linesep.join(contents))
output.close()

if __name__ == '__main__':
main()


Customize to taste.

Jp

--
up 50 days, 1:50, 4 users, load average: 0.03, 0.07, 0.04

holger krekel

unread,
Feb 3, 2003, 4:46:38 PM2/3/03
to

Yip, but __future__ is really such a special thing (nobody
knows what may come out of it) you don't really want to duplicate
this specialness. I assume that such ideas have been discussed
on Python-Dev but i am too lazy to check it (sorry).

holger

Skip Montanaro

unread,
Feb 3, 2003, 4:45:38 PM2/3/03
to

Paul> How about "from __encodings__ import latin1"?

I believe the coding cookie scheme was used because some editors already
understand it (Emacs/XEmacs and perhaps vim). It makes sense to go along
with something others are already using, otherwise each file needing
something would require two strings to specify the same thing, one for the
editor, one for the Python interpreter.

Skip

Greg Ewing (using news.cis.dfn.de)

unread,
Feb 3, 2003, 6:36:42 PM2/3/03
to
holger krekel wrote:

>
> Yip, but __future__ is really such a special thing (nobody
> knows what may come out of it) you don't really want to duplicate
> this specialness.


Perhaps, then

from __future__ import alternate_reality_support
from __alternate_realities__ import wishful_thinking
from __feature_wishes__ import importable_source_file_encodings
from __encodings__ import latin1

Scott David Daniels

unread,
Feb 3, 2003, 6:50:00 PM2/3/03
to
OK, the suggestion about the proper encoding format was mine.
As I am a vi-bigot, I chose the non-vi standard in part to
demonstrate that I was advising, "choose someone elses's
standard, don't invent your own."

The reason we want it in the file is that we want your script,
when e-mailed to a Japanese user, to still run "correctly."
Remember that we want to check for multiple encodings: the
encoding will be source-file specific so an "import" is pretty
much a bad idea to solve it(since the imported file may be in
another format). This should allow you to use modules built
in several different encodings in the same program, rather than
having to translate every source code you get to your encoding.

Imagine the problem of a "CPAN"-like code repository with people
editting and checking in code from different source code
environments.

-Scott David Daniels
Scott....@Acm.Org

Carlos Ribeiro

unread,
Feb 3, 2003, 8:57:14 PM2/3/03
to
On Monday 03 February 2003 08:04 pm, Brian Quinlan wrote:
> The effort required to do that doesn't strike me as significant. How
> many total source files do you have that have non-ASCII characters in
> them?

For me it's rather extreme - I doubt I have a single source file *without*
non-ASCII characters. I heavily document my code, mostly in Portuguese, and
using the Latin-1 encoding most of the time.

BTW, it's interesting that I just replied to a related question regarding the
CSV API. As a brazilian, I'm not only Latin-1 bound - we also use d/m/y dates
and comma as decimal separators. <sigh>


Carlos Ribeiro

Francois Pinard

unread,
Feb 3, 2003, 10:32:04 PM2/3/03
to
[Brian Quinlan]

> How many total source files do you have that have non-ASCII characters
> in them?

It depends on the nationality and habits of the programmer. In my
own case, most of my Python files have plenty of non-ASCII characters,
either in comments or strings, and hopefully one of these days, in Python
identifiers as well, there is a misery in using bad French for them...

And even for those Python files I write in English for being available
to a wider community, there is usually my name somewhere in the source,
including a cedilla. And that cedilla is worth the `coding' clause!

I'm pretty sure there are still many people outside United States :-)

--
François Pinard http://www.iro.umontreal.ca/~pinard

Roman Suzi

unread,
Feb 4, 2003, 12:49:21 AM2/4/03
to
On Mon, 3 Feb 2003, Brian Quinlan wrote:

>> It's no problem with new scripts. But is there any reason to introduce
>> this useful feature by force? Requiring everyone to add one line to
>> every script they wrote?
>

>Without an explicit declaration, it is impossible to accurately
>determine most encodings. Remember your Python Zen:
>
> "In the face of ambiguity, refuse the temptation to guess."

There is no ambiguity in raw 8-bit. What if I have no text at all,
just some bytes with value > 127?


Let's make -*- necessary for ASCII as well - and watch at the
reaction of Python users ;)


>The effort required to do that doesn't strike me as significant. How


>many total source files do you have that have non-ASCII characters in
>them?

Probably, 80% of them. (Counting only work-related ones).

Roman Suzi

unread,
Feb 4, 2003, 12:42:54 AM2/4/03
to
On Mon, 3 Feb 2003, Scott David Daniels wrote:

>OK, the suggestion about the proper encoding format was mine.
>As I am a vi-bigot, I chose the non-vi standard in part to
>demonstrate that I was advising, "choose someone elses's
>standard, don't invent your own."

And that is good suggestion. I am both hands for
-*--thingie. However, I do not like Deprecation warnings!
It will be nightmare for maintainance people and
also newbies will feel themselves bad.

>The reason we want it in the file is that we want your script,
>when e-mailed to a Japanese user, to still run "correctly."

I do not belive my terminal can show even latin-1 as it's
tuned for cyrillic.

And what if I include some raw 8-bit character missing in the
current encoding?

No-no. Encodings must not generate warnings! Especially
deprecation ones. Also, will it mean that the program will
refuse to run if encoding is wrong?

If it will be the case, Python will be no better than Java
which was constantly missing some fonts on my computer
when I tried various programs.

I blame myself for not checking the PEP with the suggestion
in time...

I think raw 8bit must be set by default without any warnings.

>Remember that we want to check for multiple encodings: the
>encoding will be source-file specific so an "import" is pretty
>much a bad idea to solve it(since the imported file may be in
>another format). This should allow you to use modules built
>in several different encodings in the same program, rather than
>having to translate every source code you get to your encoding.

>Imagine the problem of a "CPAN"-like code repository with people
>editting and checking in code from different source code
>environments.

This is completely another matter. Python coding style
suggest writing such programs with English comments.

>-Scott David Daniels
>Scott....@Acm.Org

Alex Martelli

unread,
Feb 4, 2003, 4:54:55 AM2/4/03
to
Roman Suzi wrote:
...

> There is no ambiguity in raw 8-bit. What if I have no text at all,
> just some bytes with value > 127?

Then you don't code those bytes directly as part of a string literal
(you may use escape sequences instead). There are several other
constraints on what you can put directly in a string literal, anyway.


Alex

Alex Martelli

unread,
Feb 4, 2003, 4:59:21 AM2/4/03
to
Roman Suzi wrote:
...

> And what if I include some raw 8-bit character missing in the
> current encoding?

Then you use an escape sequence instead -- that's what they're
for. If the character is "missing in the current encoding"
it will be BETTER not to have it directly in the string
literal, anyway -- who knows what could happen to people
trying to display it, otherwise.

> I blame myself for not checking the PEP with the suggestion
> in time...

I think the alpha stage IS still "in time", if you can
convince Guido to change the warning-stragegy. But --
coach your arguments well: he's not easy to convince.

> I think raw 8bit must be set by default without any warnings.

I disagree, but not hotly -- I'll be quite content with
whatever warning strategy ends up being adopted; say
I'm a +0 on the choice made for 2.3alpha. But be warned
that you'll have to argue against hotly +1 people --
check the python-dev archives to hone your arguments.
(Arguing here is not much use of course, since Guido
doesn't read c.l.py currently).


Alex

Anders J. Munch

unread,
Feb 4, 2003, 5:03:50 AM2/4/03
to

I've had the same gut reaction as Paul ever since it was first
mentioned on c.l.p., but since I didn't have a better suggestion, I've
kept my mouth shut.

But of course, the very second 2.3a is out and it's probably too late
to change, the nature of the problem dawns on me. Aaargh, bad timing.

The thing is, this is just syntax. Plain and simple. Some call it a
hint, but it's not, it's a syntactical measure that affects the
interpretation of the code that follows. This is in exact analogy to
the r and u string prefixes.

A magic comment is syntax hidden in comments. No more, no less. And
hiding syntax in comments is bad for several reasons:
* Misleading the reader, who might think that comments can be ignored.
* Loss of syntax checking. Having once spent a week on a performance
problem in an Oracle database that turned out to be a misspelling in
a comment-embedded optimisation hint, I can assure you that this is
a very real problem.
* Forcing tools that read Python source code to mimic _exactly_ what
the interpreter does. This is really just a corollary to the
previous item.

Now to define a not-in-comment syntax, that trivial. Just remove the
in-comment part and write, as a statement:

-*- coding: Latin-1 -*-

- Anders


Paul Rubin

unread,
Feb 4, 2003, 5:33:21 AM2/4/03
to
"Anders J. Munch" <ande...@dancontrol.dk> writes:
> Now to define a not-in-comment syntax, that trivial. Just remove the
> in-comment part and write, as a statement:
>
> -*- coding: Latin-1 -*-

That is daring. I like it.

Anders J. Munch

unread,
Feb 4, 2003, 6:07:27 AM2/4/03
to

Actually I like _your_ proposal better <g>.

- Anders


Just

unread,
Feb 4, 2003, 7:36:04 AM2/4/03
to
In article <Z9M%9.174650$AA2.6...@news2.tin.it>,
Alex Martelli <al...@aleax.it> wrote:

> > I think raw 8bit must be set by default without any warnings.
>
> I disagree, but not hotly -- I'll be quite content with
> whatever warning strategy ends up being adopted; say
> I'm a +0 on the choice made for 2.3alpha. But be warned
> that you'll have to argue against hotly +1 people --
> check the python-dev archives to hone your arguments.
> (Arguing here is not much use of course, since Guido
> doesn't read c.l.py currently).

Here's a possible compromise (which I'm not sure is implementable at
all): Python could only issue warnings if 8-bit chars are used in string
literals, and not if they only occur in comments.

Just

Michael Hudson

unread,
Feb 4, 2003, 7:51:33 AM2/4/03
to
Roman Suzi <r...@onego.ru> writes:

> It's no problem with new scripts. But is there any reason to introduce this
> useful feature by force?

There's no way you'll persuade people to say what they mean unless you
"force" them.

Cheers,
M.

--
If design space weren't so vast, and the good solutions so small a
portion of it, programming would be a lot easier.
-- maney, comp.lang.python

Skip Montanaro

unread,
Feb 4, 2003, 8:15:30 AM2/4/03
to
Roman> However, I do not like Deprecation warnings! It will be
Roman> nightmare for maintainance people and also newbies will feel
Roman> themselves bad.

You can modify site.py when you install new versions of Python to suppress
such warnings.

Skip

Francois Pinard

unread,
Feb 4, 2003, 9:51:51 AM2/4/03
to
[Roman Suzi]

> Let's make -*- necessary for ASCII as well - and watch at the
> reaction of Python users ;)

Yes, I know. Similar examples abound. Just one or two. UTF-8 has been
designed so ASCII is wholly undisturbed, and even then, many people who
limit themselves to ASCII are reluctant to UTF-8. Some people insist
for American writing of dates to be ubiquitous. The planetary aspect of
computer communications is not yet fully granted for everybody! :-) Yet,
very admittedly, a lot of progress have been made in the recent years.

Bengt Richter

unread,
Feb 4, 2003, 11:59:15 AM2/4/03
to

Other files? Specify it in __init__.py files governing the associated directory
or specifically identified files? Look for matching files with a special extension,
like .pif files under windows for old DOS executables? Or config files with
inference rules and/or info on specifically designated files or directories?
Inference rules keyed to file extensions, letting people tag specially encoded
files as they wish? Virtualize the Python file name space and have virtual mount
points for real directories, and then base encoding inferences on virtual locations?

One thing that bothers me about passing info in comments is that it implies
a grammar for part of the source (comments) which affects on the result of
interpreting the source, but is not (AFAIK 2.2.2) documented as part of the language
grammar or the source. Of course the #! first line similarly uses comment text,
so we are basically already living with an OS file system usage hack for carrying
non-data info associated with the data of a file. <idearrhea warning>I wonder
how long it will be before we have a portable file system that has packet structure
defaulting to info packet followed by data packet, and selecting packets as an extra
seek parameter defaulting to data. Then you could have a convention of passing data
encoding expressed in utf-8 in the info packet</idearrhea warning>.

Regards,
Bengt Richter

Bengt Richter

unread,
Feb 4, 2003, 12:04:01 PM2/4/03
to

Wouldn't it be nicer just to collect them in a tree under a directory
with a __init__.py that specifies default encoding for the lot?

Regards,
Bengt Richter

Jeff Epler

unread,
Feb 4, 2003, 12:46:40 PM2/4/03
to

What makes you believe that Python can tell what is a comment and what
is a string without knowing the encoding?

I think the only limitation of the source file encoding is that it must
be an ASCII superset. So for instance I could have a perverse encoding
where 0x81 decodes to u'\n', and 0x83 is another valid character in the
encoding
's'. Then this byte string
'#\x81"\x83"\x81'
actually decodes to
u'#\n"\uXXXX"\n"
which means the file contains a string with high-bit-set chars used in
a string literal.

If there is also a requirement that the encoding be capable of doing a
round-trip unchanged (eg s.decode("perverse").encode("perverse") == s
with s = "".join([chr(x) for x in range(256)])) then perhaps your idea
is a "safe" one. In that case the encoding can't map two values both
onto \n, the key to my example.

Jeff

Brian Quinlan

unread,
Feb 4, 2003, 12:59:20 PM2/4/03
to
> >Ah, then it should be easy. The encoding is probably the same for all
of
> >his source files. He could probably write a simple script that
inserts
> >the encoding (being careful to insert the encoding after the shebang
> >line, if present).
> >
> Wouldn't it be nicer just to collect them in a tree under a directory
> with a __init__.py that specifies default encoding for the lot?

No:

1. That system would not interoperate with editors very well (the
current encoding system can be recognized by both VIM and Emacs)
2. It would make it more difficult to distribute single source files
3. It would require more work because the parser would need more context
when parsing source files

Cheers,
Brian


Just

unread,
Feb 4, 2003, 1:22:17 PM2/4/03
to
In article <mailman.1044380830...@python.org>,
Jeff Epler <jep...@unpythonic.net> wrote:

> On Tue, Feb 04, 2003 at 01:36:04PM +0100, Just wrote:
> > Here's a possible compromise (which I'm not sure is implementable at
> > all): Python could only issue warnings if 8-bit chars are used in string
> > literals, and not if they only occur in comments.
>
> What makes you believe that Python can tell what is a comment and what
> is a string without knowing the encoding?

This is not about knowing the encoding but about warning when an
encoding _should_ have been specified. Since whatever the encoding is,
it must be a superset of ASCII I don't see why my suggestion wouldn't
work (bar implementation limitations). That's not so say I'm completely
convinced of the idea myself.

> I think the only limitation of the source file encoding is that it must
> be an ASCII superset. So for instance I could have a perverse encoding
> where 0x81 decodes to u'\n', and 0x83 is another valid character in the
> encoding
> 's'. Then this byte string
> '#\x81"\x83"\x81'
> actually decodes to
> u'#\n"\uXXXX"\n"
> which means the file contains a string with high-bit-set chars used in
> a string literal.

I don't see your point: my suggestion is about reducing the warning
irritation for people using 8-bit encodings in comments of code that
works *now* (in Python <= 2.2), not about bizarre things you _could_ do
with perverse encoding directives in 2.3.

Just

Scott David Daniels

unread,
Feb 4, 2003, 3:41:51 PM2/4/03
to
Roman Suzi wrote:
> ...

> There is no ambiguity in raw 8-bit. What if I have no text at all,
> just some bytes with value > 127?

But raw 8-bit is about _bytes_, and the issue is characters. As
I imagine it, a raw 8-bit encoding would allow anything in "normal"
strings, but only ASCII in unicode strings. That is really the
worst of all possible worlds. If you've declared the encoding,
the compiler can "know" that the value of the expression:
ord("?") + ord(u"?")
Otherwise, it does not have a chance.

> Let's make -*- necessary for ASCII as well - and watch at the
> reaction of Python users ;)

The trick is to allow a system in which you can read and interpret
the first few lines (I think we've settled on 2) in order to get
the -*- line understood. UTF-8 would be the default if it weren't
so western-european-centric (Talk to Chinese or Japanese programmers
about how efficient UTF-8 is).

-Scott David Daniels
-Scott....@Acm.Org

Roman Suzi

unread,
Feb 4, 2003, 3:07:06 PM2/4/03
to
On Tue, 4 Feb 2003, Just wrote:

>In article <mailman.1044380830...@python.org>,
> Jeff Epler <jep...@unpythonic.net> wrote:
>
>> On Tue, Feb 04, 2003 at 01:36:04PM +0100, Just wrote:
>> > Here's a possible compromise (which I'm not sure is implementable at
>> > all): Python could only issue warnings if 8-bit chars are used in string
>> > literals, and not if they only occur in comments.
>>
>> What makes you believe that Python can tell what is a comment and what
>> is a string without knowing the encoding?
>
>This is not about knowing the encoding but about warning when an
>encoding _should_ have been specified. Since whatever the encoding is,
>it must be a superset of ASCII I don't see why my suggestion wouldn't
>work (bar implementation limitations). That's not so say I'm completely
>convinced of the idea myself.
>
>

>I don't see your point: my suggestion is about reducing the warning
>irritation for people using 8-bit encodings in comments of code that
>works *now* (in Python <= 2.2), not about bizarre things you _could_ do
>with perverse encoding directives in 2.3.
>
>Just

Well, obligatory -*- will cause not-just-ASCII OS vendors (at least Linux
distros) to disable warnings in their packages. And I am afraid that it will
be done for all warnings, not just encoding ones! Because packagers will
understand that well-cyrillized (for example) Linux should not warn about
encodings at each corner. They make tweakings to everything from LaTeX to
Emacs to make them usable by people who use cyrillic. So, Python's developers
decision to "warning irritate" due to encoding will be answered with
packager's tweaks. Nobody will want to be blamed for extra-growing error logs
on a web-server, or exposing user to extra warnings from some package which is
old but still usable.

Saying this, I agree that only forcing -*- we can achieve discipline to write
encoding on every program (I already do that on my scripts because I use two
cyrillic encodings out of five ;-) and it's convenient to have Emacs
automagically understand me).

The problem we are discussing is not technical one, it's about sociology and
perception of people.

Another trouble I feel with this new feature is that I can never tell if my
program will run or not. Even working with 'recode' program I need -f option
from time to time to let me do recoding inspite of some stray char which does
not (in recode's opinion) belong to certain encoding.

Now I will have same doubts with every Python prorgam. Will it run if I insert
this backtick? What about pseudograhics I use in KOI8-R while it's really from
CP866? What if standard on encodings change and there will be new "Asio"
currency instead of "Euro"? Will my program still run? Etc.

That is why I am asking for unconditional raw 8-bit cleanness of Python
without any -*- things...

Just

unread,
Feb 4, 2003, 4:18:59 PM2/4/03
to
In article <mailman.1044389414...@python.org>,
Roman Suzi <r...@onego.ru> wrote:

> That is why I am asking for unconditional raw 8-bit cleanness of Python
> without any -*- things...

And what should happen if an 8-bit char shows up in a unicode literal?

Just

Chris Liechti

unread,
Feb 4, 2003, 6:21:54 PM2/4/03
to
Roman Suzi <r...@onego.ru> wrote in news:mailman.1044337448.21865.python-
li...@python.org:

> On Mon, 3 Feb 2003, Scott David Daniels wrote:
>
>>OK, the suggestion about the proper encoding format was mine.
>>As I am a vi-bigot, I chose the non-vi standard in part to
>>demonstrate that I was advising, "choose someone elses's
>>standard, don't invent your own."
>
> And that is good suggestion. I am both hands for
> -*--thingie.

i don't like the "-*-" and especialy not that it is a comment.
a comment is a comment is a comment. it gets very confusing when
a comment alters the behaviour of a program (yes its only a warning, but i
consider programs that are issuing warnings as unfinished).

"#-*- coding: latin1 -*-" may look nice to a all time vi user but most
users are NOT using vi or emacs. to me it does not look very pythoninc
it's not easy to remeber, it's not helping my editor as it does not store
it's preferences in each source file and i can't even write a one-liner
"python -c ..." with latin1 chars.
we're not writing python _for_ vi but python _with_ vi, are we?

we're speaking german here and i will have to explain to every newbie why
he has to write that magic string in each and every source file as he
will use his native language it his first scripts...

the "#!..." line is a completly other thing. it has NO effect on python. it
has an effect on the environment that a script loads. it does not change
the behaviour of a python script. it cannot be compared to the encoding
comment.

> However, I do not like Deprecation warnings!
> It will be nightmare for maintainance people and
> also newbies will feel themselves bad.

no warnings for me please... it makes older python programs
appearing as bad software, which is simply not true.
this means that i'll get a warning for a program that was perfectly valid
and ran error free.

we're using python at work and many programs use german comments
and strings and getting a warning for these is wrong in my optinion.

worst thing i've seen so far is the perl waring about unknown locale
on all the GNU/Linux Debian Woody boxes for each and every perl script.
so i'm getting warnings for apt-get and lots other tools.

luckly it seems, that according to the PEP, warnings are only issued
when a non-ascii character is found and not like perl, but still, this
means a lot of warnings...
the warning is not much to the help the user, the developer should get it.
how about displaying the warning only with __debug__ == 1 and making
__debug__ == 0 as default for future python releases?

>>The reason we want it in the file is that we want your script,
>>when e-mailed to a Japanese user, to still run "correctly."
> I do not belive my terminal can show even latin-1 as it's
> tuned for cyrillic.

right, the source encoding won't change the fact that a (winblows) DOS box
wont display any non-ASCII chatacter as you want...



> I think raw 8bit must be set by default without any warnings.

I think it should work as in 2.2 without warnings, or that the warnings are
off by default.



>>Remember that we want to check for multiple encodings: the
>>encoding will be source-file specific so an "import" is pretty
>>much a bad idea to solve it(since the imported file may be in
>>another format). This should allow you to use modules built
>>in several different encodings in the same program, rather than
>>having to translate every source code you get to your encoding.

that's no argument against a "form __encoding__ import latin1" from which i
thing it would be more pythonic. its as easy to handle as the magic
comment. and we already have import that do not realy import (see
__future__ ;-)



>>Imagine the problem of a "CPAN"-like code repository with people
>>editting and checking in code from different source code
>>environments.
>
> This is completely another matter. Python coding style
> suggest writing such programs with English comments.

which is a good *sugestion*. but i don't like warnings for programs that
are not following this advise... do you realy care in which language the
comments of a working module are? do you wan't to see a warning even if you
don't ever look at the source?

my summary:
- comments with effect on code is a bad idea
- "#-*- coding: latin1 -*-" does not look pythonic/nice
- a warning gives a bad impression on (older) python programs
- the ability to specify the source encoding is a good idea but should
not be enforced by filling the screen with warnings instead of useful
program output.

chris

--
Chris <clie...@gmx.net>

Erik Max Francis

unread,
Feb 4, 2003, 7:45:33 PM2/4/03
to
Roman Suzi wrote:

> That is why I am asking for unconditional raw 8-bit cleanness of
> Python
> without any -*- things...

Personally, I think it's an extremely good idea. If high bits are set
anywhere in the file, knowing the encoding is essential, and so it's
good to require an explicit specification of the encoding to prevent any
ambiguity or misinterpretation.

The -*- convention is simply tipping its hat to a commonly-used
convention; it's already used, e.g., to indicate the indentation level
if desired.

Seems plusses all around; one might consider -*- in particular ugly, but
it's just following a common convention, and sometimes conformance and
recognizability is better than beauty.

--
Erik Max Francis / m...@alcyone.com / http://www.alcyone.com/max/
__ San Jose, CA, USA / 37 20 N 121 53 W / &tSftDotIotE
/ \ In principle I am against principles.
\__/ Tristan Tzara
Bosskey.net: Counter-Strike / http://www.bosskey.net/cs/
A personal guide to Counter-Strike.

Francois Pinard

unread,
Feb 4, 2003, 7:49:49 PM2/4/03
to
[Chris Liechti]

> "#-*- coding: latin1 -*-" may look nice to a all time vi user but most
> users are NOT using vi or emacs.

I often read on this list that most people are using Emacs, or vi, or
something else. The truth might well be that we do not have any real
statistics on this. Caution suggests that we refrain asserting things like
above, as gratuitous statements undermine the best of argumentations! :-)

P.S. - But if you have dependable numbers to offer, that cover a population
much wider than your co-workers or friends and mine, I presume some of
us would be curious about them! :-)

Dale Strickland-Clark

unread,
Feb 4, 2003, 8:08:56 PM2/4/03
to
Paul Rubin <phr-n...@NOSPAMnightsong.com> wrote:

>Brian Quinlan <br...@sweetapp.com> writes:
>> Just add:


>> # -*- coding: Latin-1 -*-
>>

>> to the top of your source files and you will be fine.
>
>What is this nonsense? The interpreter is reading comment text now?
>Yucch!

Absolutely. And what if you get the syntax wrong? How close to correct
does it have to be before you get told you've got it wrong?

Is everything bracketed by -*- on line 1-2 now parsed?

Parsing comments is a very poor solution.

--
Dale Strickland-Clark
Riverhall Systems Ltd

Andrew Bennetts

unread,
Feb 5, 2003, 12:59:20 AM2/5/03
to
On Tue, Feb 04, 2003 at 11:03:50AM +0100, Anders J. Munch wrote:
>
> Now to define a not-in-comment syntax, that trivial. Just remove the
> in-comment part and write, as a statement:
>
> -*- coding: Latin-1 -*-

What about:
from __encodings__ import -*- coding: Latin-1 -*-

<wink>

the-best-of-both-worlds-ly yrs, -Andrew.


Roman Suzi

unread,
Feb 5, 2003, 12:28:48 AM2/5/03
to

I do not necessary agree. Comments are meta-information. And anyway
first line is

#!/usr/bin/python

But having ASCII as default encoding is completely different matter.

Roman Suzi

unread,
Feb 5, 2003, 12:23:36 AM2/5/03
to
On Tue, 4 Feb 2003, Just wrote:

Then it is OK to issue a warning or even an error then due to
unknown 8bit encoding.

>Just

Anders J. Munch

unread,
Feb 5, 2003, 4:50:05 AM2/5/03
to
"Roman Suzi" <r...@onego.ru> wrote:
> On Wed, 5 Feb 2003, Dale Strickland-Clark wrote:
> >Parsing comments is a very poor solution.
>
> I do not necessary agree. Comments are meta-information.

Yes, but source file encoding is not meta-information. It has direct
consequences for program execution.

>And anyway
> first line is
>
> #!/usr/bin/python

Which, for comparison, has absolutely no effect on the Python
interpreter.

- Anders


Anders J. Munch

unread,
Feb 5, 2003, 4:54:53 AM2/5/03
to

He he. I'm not sure if you're laughing with me or at me. In any
case, you should know that I'm dead serious: If


# -*- coding: Latin-1 -*-

is a good idea then
-*- coding: Latin-1 -*-
is a better one.

I'm not saying that dash-star-dash is a good idea, I'm not saying that
it isn't. But whatever we do, the Python interpreter should never
execute the contents of comments.

can-an-encoding-comment-be-commented-out?-ly y'rs, Anders


Brian Quinlan

unread,
Feb 5, 2003, 4:32:42 AM2/5/03
to
> >And what should happen if an 8-bit char shows up in a unicode
literal?
>
> Then it is OK to issue a warning or even an error then due to
> unknown 8bit encoding.

The new Python parser expects to receive code in UTF-8 format. If the
code contains ASCII then it is already valid UTF-8. If your code is not
valid UTF-8 then the parser will die.

To solve this problem, all source code is converted to UTF-8 when
loaded. Without knowing the encoding, how can the source be converted to
UTF-8?

Cheers,
Brian


Erik Max Francis

unread,
Feb 5, 2003, 5:11:56 AM2/5/03
to
"Anders J. Munch" wrote:

> "Roman Suzi" <r...@onego.ru> wrote:
>
> > I do not necessary agree. Comments are meta-information.
>
> Yes, but source file encoding is not meta-information. It has direct
> consequences for program execution.

This seems to be a definitional issue. Since when is there a
prohibition that "metainformation" having consequences for program
execution?

--
Erik Max Francis / m...@alcyone.com / http://www.alcyone.com/max/
__ San Jose, CA, USA / 37 20 N 121 53 W / &tSftDotIotE

/ \ Never had very much to say / Laugh last, laugh longest
\__/ Des'ree
PyUID / http://www.alcyone.com/pyos/uid/
A module for generating "unique" IDs in Python.

Anders J. Munch

unread,
Feb 5, 2003, 5:55:12 AM2/5/03
to
"Erik Max Francis" <m...@alcyone.com> wrote:
> "Anders J. Munch" wrote:
>
> > "Roman Suzi" <r...@onego.ru> wrote:
> >
> > > I do not necessary agree. Comments are meta-information.
> >
> > Yes, but source file encoding is not meta-information. It has direct
> > consequences for program execution.
>
> This seems to be a definitional issue. Since when is there a
> prohibition that "metainformation" having consequences for program
> execution?

Source file encoding has direct consequences for program execution.
The shebang and Emacs/vim encoding comments do not. I might describe
this difference by saying that the shebang og Emacs/vim encoding
comments are meta-information and that for the interpreter to
recognise encoding comments makes it regular information, part of the
Python syntax. But it's not the word I care about, it's the
semantics.

Speaking of definitional issues, the term "encoding comment" is
misleading. It's not a comment: It's a syntactic construct that has a
remarkable similarity to a comment.

- Anders


Andrew Bennetts

unread,
Feb 5, 2003, 5:18:19 AM2/5/03
to
On Wed, Feb 05, 2003 at 10:54:53AM +0100, Anders J. Munch wrote:
> "Andrew Bennetts" <andrew-p...@puzzling.org> wrote:
> > On Tue, Feb 04, 2003 at 11:03:50AM +0100, Anders J. Munch wrote:
> > >
> > > Now to define a not-in-comment syntax, that trivial. Just remove the
> > > in-comment part and write, as a statement:
> > >
> > > -*- coding: Latin-1 -*-
> >
> > What about:
> > from __encodings__ import -*- coding: Latin-1 -*-
> >
> > <wink>
^^^^^^
Note the wink :)

> > the-best-of-both-worlds-ly yrs, -Andrew.
>
> He he. I'm not sure if you're laughing with me or at me. In any

I'm laughing with you :)

> case, you should know that I'm dead serious: If
> # -*- coding: Latin-1 -*-
> is a good idea then
> -*- coding: Latin-1 -*-
> is a better one.
>
> I'm not saying that dash-star-dash is a good idea, I'm not saying that
> it isn't. But whatever we do, the Python interpreter should never
> execute the contents of comments.

I agree in principle, but I just realised a possibly strong argument in favour
of putting the encoding in a comment: with your proposal, how would you
write a unicode source file that worked in both Python 2.3 and older Pythons
(e.g. 1.5.2)?

Then again, it perhaps isn't even possible to write a unicode source file
that works for 1.5.2 no matter what you do (unless you stick to e.g. plain
ASCII, which is technically also correct UTF-8), so perhaps this isn't a
real problem. I don't know enough about unicode to know.

Anyway, being an English speaker, the end result isn't going to bother me
much whatever happens... the default encoding of UTF-8 is more than adequate
for my simple needs.

-Andrew.


Laura Creighton

unread,
Feb 5, 2003, 5:44:36 AM2/5/03
to
On more than one occasion I have removed all the comments from some
source file before blowing it into PROM. PROMS are expensive. I would
have been mad as hell if after having sunk the budget somebody told
me that 'oops, certain comments are special' and that I have to do
the job all over again.

Laura

Roman Suzi

unread,
Feb 5, 2003, 6:13:38 AM2/5/03
to
One more argument contra:

what if I have a program which comments are in one encoding and string
literals in another one? (I have a project which uses cp1252 in literals
and koi8-r in comments (sometimes).)

Is there any decent editor supporting utf-8?

*

(I agree with the argument that 2.3a Python parser uses utf-8
as an internal representation or whatever)

Sincerely yours, Roman A.Suzi
--
- Petrozavodsk - Karelia - Russia - mailto:r...@onego.ru -


Bengt Richter

unread,
Feb 5, 2003, 7:27:01 AM2/5/03
to

How about showing foo.py encoding by naming?
foo_x_latin1_x_.py
or
foo-x-_coding__Latin-1_-x-.py
;-)

Regards,
Bengt Richter

Neil Hodgson

unread,
Feb 5, 2003, 7:40:26 AM2/5/03
to
Dale Strickland-Clark:

> Is everything bracketed by -*- on line 1-2 now parsed?

The '-*-' isn't needed. PEP 263 states:
the first or second line must match the regular expression
"coding[:=]\s*([\w-_.]+)".
so a first line of
"coding=utf-8"
should work

Neil


Anders J. Munch

unread,
Feb 5, 2003, 7:22:11 AM2/5/03
to
"Andrew Bennetts" <andrew-p...@puzzling.org> wrote:
>
> I agree in principle, but I just realised a possibly strong argument in
favour
> of putting the encoding in a comment: with your proposal, how would you
> write a unicode source file that worked in both Python 2.3 and older Pythons
> (e.g. 1.5.2)?
>
> Then again, it perhaps isn't even possible to write a unicode source file
> that works for 1.5.2 no matter what you do (unless you stick to e.g. plain
> ASCII, which is technically also correct UTF-8), so perhaps this isn't a
> real problem. I don't know enough about unicode to know.

Exactly, quoting from the PEP: In Python 2.1, Unicode literals can
only be written using the Latin-1 based encoding "unicode-escape".

The compatibility issue has to do with Latin-1. And really there is
no compatibility problem, as new code written to work with old
interpreters can always use ascii encoding and escape sequences for
the rest. It's just a matter of convenience.

Being a Latin-1 and Emacs user, all the convenience features will
benefit me. However I would end up using the comment-like syntax in
all my source files also, so in the end it won't be convenient at all.

practicality-beats-purity-but-sometimes-purity-is-practical-ly y'rs,
Anders


Jeff Epler

unread,
Feb 5, 2003, 9:19:02 AM2/5/03
to
On Wed, Feb 05, 2003 at 02:13:38PM +0300, Roman Suzi wrote:
> One more argument contra:
>
> what if I have a program which comments are in one encoding and string
> literals in another one? (I have a project which uses cp1252 in literals
> and koi8-r in comments (sometimes).)
>
> Is there any decent editor supporting utf-8?

depends what you think of as decent. vim can do it (not sure about bidi
support). emacs can do it (at least with mule, should include bidi).
You could probably make idle do it (no bidi support).

Jeff

Jeff Epler

unread,
Feb 5, 2003, 9:23:37 AM2/5/03
to
On Tue, Feb 04, 2003 at 12:41:51PM -0800, Scott David Daniels wrote:
> (Talk to Chinese or Japanese programmers
> about how efficient UTF-8 is).

And how do Hebrew, Greek, or Arabic speakers feel about the "efficiency"
of shift-jis or euc-jp?

Surely Europeans have more of a right to complain, since the non-ASCII
chars they use expand from 1 to 2 bytes when going from iso-8859-x to
utf-8 (a 100% expansion), while changing from shift-jis to utf-8
generally means an expansion from 2 to 3 bytes (a 50% expansion).

I think everybody should just suck it up and have a two-character
alphabet. That would be easiest and most efficient. ("there are 10
kinds of alphabet. Those with the right number of symbols and those
with too many.")

Jeff

Michael Hudson

unread,
Feb 5, 2003, 10:12:32 AM2/5/03
to
Roman Suzi <r...@onego.ru> writes:

> what if I have a program which comments are in one encoding and string
> literals in another one? (I have a project which uses cp1252 in literals
> and koi8-r in comments (sometimes).)

How can that work? How does your editor know which is which? Or do
you just flip from one charset to the other as needed?

> Is there any decent editor supporting utf-8?

GNU Emacs 21. I think WinXP's notepad.exe might. There are surely
others.

Cheers,
M.

--
Roll on a game of competetive offence-taking.
-- Dan Sheppard, ucam.chat

Michael Hudson

unread,
Feb 5, 2003, 10:10:42 AM2/5/03
to
Erik Max Francis <m...@alcyone.com> writes:

> Roman Suzi wrote:
>
> > That is why I am asking for unconditional raw 8-bit cleanness of
> > Python
> > without any -*- things...
>
> Personally, I think it's an extremely good idea.

Me too. I'll also observe that editing files in GNU Emacs 21 that
have characters with the high bit set and don't have a coding cookie
is exceptionally annoying so putting the coding in not only soothes
Python, but also potential contributors.

Cheers,
M.

--
Unfortunately, nigh the whole world is now duped into thinking that
silly fill-in forms on web pages is the way to do user interfaces.
-- Erik Naggum, comp.lang.lisp

Anders J. Munch

unread,
Feb 5, 2003, 10:22:46 AM2/5/03
to
"Jeff Epler" <jep...@unpythonic.net> wrote:
> On Tue, Feb 04, 2003 at 12:41:51PM -0800, Scott David Daniels wrote:
> > (Talk to Chinese or Japanese programmers
> > about how efficient UTF-8 is).
>
> And how do Hebrew, Greek, or Arabic speakers feel about the "efficiency"
> of shift-jis or euc-jp?
>
> Surely Europeans have more of a right to complain, since the non-ASCII
> chars they use expand from 1 to 2 bytes when going from iso-8859-x to
> utf-8 (a 100% expansion), while changing from shift-jis to utf-8
> generally means an expansion from 2 to 3 bytes (a 50% expansion).

You seem to assume a 100% frequency of non-ascii characters in Python
code. Not likely.

Still, I doubt that source size is really that important.

- Anders


Francois Pinard

unread,
Feb 5, 2003, 10:30:04 AM2/5/03
to
[Jeff Epler]

> > Is there any decent editor supporting utf-8?

> depends what you think of as decent. vim can do it (not sure about bidi
> support).

Someone recently told me that `vim' can do right to left. I do not know
about mixed bi-directionality, however, nor how `vim' interacts with
Unicode fonts. My guess would be that it does not, and rather relies on
console drivers in UTF-8 mode or such. The truth is that I do not know.

> emacs can do it (at least with mule, should include bidi).

Emacs has incomplete kludges for supporting UTF-8 with proper fonts, not
everything is supported. However, things like Latin-1 expressed as UTF-8
should work without much problems (yet a few might remain). All this is
in the long process of being retought and rewritten, so far that I know.

> You could probably make idle do it (no bidi support).

Idle uses Tk, and probably much depends on Unicode support in Tk.

There is an ambitious and impressive project called Pango, which sounds
quite promising, bringing a comprehensive and competent support for a
lot of languages scripts to GTK. It might be the best long term bet.

Anders J. Munch

unread,
Feb 5, 2003, 11:16:37 AM2/5/03
to
"Erik Max Francis" <m...@alcyone.com> wrote:
>
> The -*- convention is simply tipping its hat to a commonly-used
> convention; it's already used, e.g., to indicate the indentation level
> if desired.
>
> Seems plusses all around; one might consider -*- in particular ugly, but
> it's just following a common convention, and sometimes conformance and
> recognizability is better than beauty.

Quite. But achieving both is not that hard. How about this:

A source encoding directive is a line containing three tokens: the
identfier "encoding", a colon and a string constant. The string
constant contains the name of the encoding, optionally surrounded by
"-*- coding:" and "-*-". That way you have a choice between short and
sweet style:

encoding: "latin-1"

and 'sacrifice to the gods of prior art'-style:

encoding: "-*- coding: latin-1 -*-"

and

encoding: "-*- codíng: latin-1 -*-"

can be signalled as an error instead of silently ignored.

- Anders

Roman Suzi

unread,
Feb 5, 2003, 11:15:04 AM2/5/03
to
On Wed, 5 Feb 2003, Michael Hudson wrote:

>Erik Max Francis <m...@alcyone.com> writes:
>
>> Roman Suzi wrote:
>>
>> > That is why I am asking for unconditional raw 8-bit cleanness of
>> > Python
>> > without any -*- things...
>>
>> Personally, I think it's an extremely good idea.
>
>Me too. I'll also observe that editing files in GNU Emacs 21 that
>have characters with the high bit set and don't have a coding cookie
>is exceptionally annoying so putting the coding in not only soothes
>Python, but also potential contributors.

OK. I do agree that putting "coding" is good idea. I agree to Emacsish style.
My disagreement is that I am forced to add yet another line! This means that,
for example, before starting to learn Python one need to understand encodings!

And what about console Python? How do I tell it I am using koi8-r? Or
whatever? Why IDLE doesn't support koi8-r correctly even with -*- -things set
and the fact that Tcl/Tk support Unicode?

All this shows that Deprecation istoo early: infrastructure is not ready for
this move.

Roman Suzi

unread,
Feb 5, 2003, 11:09:51 AM2/5/03
to
On Wed, 5 Feb 2003, Michael Hudson wrote:

>Roman Suzi <r...@onego.ru> writes:
>
>> what if I have a program which comments are in one encoding and string
>> literals in another one? (I have a project which uses cp1252 in literals
>> and koi8-r in comments (sometimes).)
>
>How can that work? How does your editor know which is which? Or do
>you just flip from one charset to the other as needed?

No. I am working in koi8-r and I am accustomed to cp1252 letters as they look
in koi8-r. (I do not enter them - I only copy from some other source). The
editor (mcedit) just show them as is. That is why I am using it for the task
(usually I am using Emacs).

>> Is there any decent editor supporting utf-8?
>
>GNU Emacs 21.

Not in my Emacs 21. Probably I need some additional tweaking?

>I think WinXP's notepad.exe might. There are surely
>others.
>
>Cheers,
>M.

Sincerely yours, Roman Suzi

Paul Rubin

unread,
Feb 5, 2003, 1:39:55 PM2/5/03
to
"Anders J. Munch" <ande...@dancontrol.dk> writes:
> A source encoding directive is a line containing three tokens: the
> identfier "encoding", a colon and a string constant. The string
> constant contains the name of the encoding, optionally surrounded by
> "-*- coding:" and "-*-". That way you have a choice between short and
> sweet style:
>
> encoding: "latin-1"

This isn't so great because if you insert an encoding statement, the
script will no longer run under older Pythons like 2.2.

I think it's more in the Python tradition (although this particular
tradition is one that I don't like) to use a variable:

__encoding__ = "latin-1"

or if necessary:

__encoding__ = "-*- latin-1 -*-"

Brian Quinlan

unread,
Feb 5, 2003, 1:59:50 PM2/5/03
to
> I think it's more in the Python tradition (although this particular
> tradition is one that I don't like) to use a variable:
>
> __encoding__ = "latin-1"

It has to be a bit more special than that because the encoding must be
detected before the grammar is parsed. Variable assignment would be
acceptable, I guess, except that the assignment:

1. would have to use a simplified grammar
2. it would have to be near the top of the file

The encoding is really meta information that shouldn't be in source
files. It definitely should not be part of the language. But we live in
the real world were there is no where else to put it.

Cheers,
Brian


Paul Rubin

unread,
Feb 5, 2003, 2:13:56 PM2/5/03
to
Brian Quinlan <br...@sweetapp.com> writes:
> > I think it's more in the Python tradition (although this particular
> > tradition is one that I don't like) to use a variable:
> >
> > __encoding__ = "latin-1"
>
> It has to be a bit more special than that because the encoding must be
> detected before the grammar is parsed. Variable assignment would be
> acceptable, I guess, except that the assignment:
>
> 1. would have to use a simplified grammar
> 2. it would have to be near the top of the file

Yes, there are similar constraints for "from __future__" declarations.
This would be similar.

Bernhard Herzog

unread,
Feb 5, 2003, 2:47:32 PM2/5/03
to
"Anders J. Munch" <ande...@dancontrol.dk> writes:

> Source file encoding has direct consequences for program execution.
> The shebang and Emacs/vim encoding comments do not.

Some of them already do! The following code won't produce a syntax error
in CPython even though one would expect it to and it will display
correctly in Emacs:

# -*- tab-width:2 -*-

def g():
a = 1 # four spaces
b=2 # two tabs


IMO Python shouldn't do this, but for whatever reason it's there.

Bernhard

--
Intevation GmbH http://intevation.de/
Sketch http://sketch.sourceforge.net/
MapIt! http://www.mapit.de/

Anders J. Munch

unread,
Feb 5, 2003, 2:28:25 PM2/5/03
to
"Paul Rubin" <phr-n...@NOSPAMnightsong.com> wrote:
> "Anders J. Munch" <ande...@dancontrol.dk> writes:
> > A source encoding directive is a line containing three tokens: the
> > identfier "encoding", a colon and a string constant. The string
> > constant contains the name of the encoding, optionally surrounded by
> > "-*- coding:" and "-*-". That way you have a choice between short and
> > sweet style:
> >
> > encoding: "latin-1"
>
> This isn't so great because if you insert an encoding statement, the
> script will no longer run under older Pythons like 2.2.

Is that a good or a bad thing? I'm not so sure anymore.

First, this only affects newly written modules, and new modules
intended to be useful with old Python versions have the option of
using the ascii encoding and escape sequences for the rest. It should
be trivial to write a script that converts arbitrary source code with
an encoding declaration to this form.

Second, the backwards compatibility is a mirage. If the encoding
declaration is a no-op in the older version, then for all encodings
other than Latin-1, not only will it not work, but it will silently do
the wrong thing!

>
> I think it's more in the Python tradition (although this particular
> tradition is one that I don't like) to use a variable:
>
> __encoding__ = "latin-1"
>
> or if necessary:
>
> __encoding__ = "-*- latin-1 -*-"

Not bad at all. Same number of tokens, but has a very familiar look
and feel.

My only complaint is that it doesn't fail with older Python versions ;-)

- Anders


Anders J. Munch

unread,
Feb 5, 2003, 2:57:39 PM2/5/03
to
"Paul Rubin" <phr-n...@NOSPAMnightsong.com> wrote:
> Brian Quinlan <br...@sweetapp.com> writes:
> > > I think it's more in the Python tradition (although this particular
> > > tradition is one that I don't like) to use a variable:
> > >
> > > __encoding__ = "latin-1"
> >
> > It has to be a bit more special than that because the encoding must be
> > detected before the grammar is parsed. Variable assignment would be
> > acceptable, I guess, except that the assignment:
> >
> > 1. would have to use a simplified grammar
> > 2. it would have to be near the top of the file

You don't have to think about it as an assignment. Think about it as
three lexer tokens appearing as one of the two first lines of the
file.

>
> Yes, there are similar constraints for "from __future__" declarations.
> This would be similar.

Not quite. The point with "from __future__" is that if you try to
import a non-existant feature, you get a runtime (not a syntax) error,
which can be trapped from within the same file so you can default to
some other behaviour if the feature is missing. But you do get an
error! With a variable assignment, there's no error and the module
will silently do the wrong thing.

Which brings us back to "from __encodings__ import latin1", whose only
drawback is that it doesn't invoke any Emacs magic. (And so what?
Emacs is pliable. We'll think of something.)

Or perhaps something else that is a trappable error in Python<2.3:

__encoding__("utf-8")
__encoding__("-*- encoding: utf-8 -*-")

Again, this is not really a function call, it's just four lexer tokens
on a line at or near the top of the file. But to an old Python it's a
NameError.

- Anders

Chris Liechti

unread,
Feb 5, 2003, 3:01:15 PM2/5/03
to
Francois Pinard <pin...@iro.umontreal.ca> wrote in
news:mailman.1044406211...@python.org:

> [Chris Liechti]
>
>> "#-*- coding: latin1 -*-" may look nice to a all time vi user but
>> most users are NOT using vi or emacs.
>
> I often read on this list that most people are using Emacs, or vi, or
> something else. The truth might well be that we do not have any real
> statistics on this. Caution suggests that we refrain asserting things
> like above, as gratuitous statements undermine the best of
> argumentations! :-)

he he.. yeah i have no numbers. but i'm pretty sure that less than 50% of
all programmers are neither using vi(m) nor emacs... and less than 50% is
not "most" ;-)

just think of the many Windoze users that barely know what do do whith a
console window. and i guess many newbies do not start with vi but rather a
more GUI oriented editor.

i know that vi and emacs are good editors, i use vim some times. but
despite that, i think it's still not very good to optimize a programming
language for an editor ;-)

however, the "-*-" seem to be optional and the regexp matches on "coding:
..." but i still don't like that a *comment* has effect on how the program
is interpreted. that is IMHO a BadThing(TM).

chris

--
Chris <clie...@gmx.net>

Chris Liechti

unread,
Feb 5, 2003, 3:41:26 PM2/5/03
to
Paul Rubin <phr-n...@NOSPAMnightsong.com> wrote in
news:7x1y2mo...@ruckus.brouhaha.com:

anyway if the PEP proposes to seach for a regexp in comments, then it can
do it equaly well over the rest of the source. meaning that

regexp1 = r"__encoding__[\t ]*=[\t ]*[\"']+(\w+)[\"']+"
regexp2 = r"from[\t ]+__encoding__[\t ]+import[\t ]+[\"']+(\w+)[\"']+"

can be searched before parsing the grammar, both are valid python code that
do not need any language extensions and it still works after removing all
comments.

actualy i like the regexp1 way. you can even retreive the encoding at
runtime and if encoding matters during parsing and executing it also
matters during runtime, otherwise we would not need the PEP, right?

chris

a comment is a comment and should stay a comment...
--
Chris <clie...@gmx.net>

Skip Montanaro

unread,
Feb 5, 2003, 3:56:27 PM2/5/03
to

This is another thread that's wandered off into tit-for-tat hell. Can we
please just drop it and move on?

Skip


Brian Quinlan

unread,
Feb 5, 2003, 4:10:30 PM2/5/03
to
> anyway if the PEP proposes to seach for a regexp in comments, then it
can
> do it equaly well over the rest of the source. meaning that
>
> regexp1 = r"__encoding__[\t ]*=[\t ]*[\"']+(\w+)[\"']+"
> regexp2 = r"from[\t ]+__encoding__[\t ]+import[\t ]+[\"']+(\w+)[\"']+"
>
> can be searched before parsing the grammar, both are valid python code
> that do not need any language extensions and it still works after
> removing all comments.

Here are the downsides/observations:
1. you can't actually use a regular expression like that because the
file
might be using a multibyte encoding system or an encoding that is not
an ASCII superset i.e. searching for that pattern might be hard
2. no editors will understand the encoding meta-information that you are
try to provide (I don't get why people don't seem to understand the
meta-informational aspect of encodings; the encoding isn't a property

of your script, it is like the size or permissions of the source file
i.e. in an ideal world, Python shouldn't have to care because someone
else would worry about it).
3. it has runtime effects which are not necessarily desirable

> actualy i like the regexp1 way. you can even retreive the encoding at
> runtime and if encoding matters during parsing and executing it also
> matters during runtime, otherwise we would not need the PEP, right?

The encoding only matters at load time. You should be able to save your
source files using a different encoding, change the encoding declaration
(unless Emacs/VIM does it for you automatically) and run your script
without any change in behavior.

> a comment is a comment and should stay a comment...

Unless it is a shebang line?

Cheers,
Brian


Erik Max Francis

unread,
Feb 5, 2003, 5:53:23 PM2/5/03
to
"Anders J. Munch" wrote:

> Source file encoding has direct consequences for program execution.
> The shebang and Emacs/vim encoding comments do not.

That seems to me a distinction without a difference. The bangpath
determines (or at least can determine on some operating systems) which
interpreter gets run; if you require a certain interpreter version and
have the bangpath set wrong, your script will bomb. Certainly that has


direct consequences for program execution.

--
Erik Max Francis / m...@alcyone.com / http://www.alcyone.com/max/
__ San Jose, CA, USA / 37 20 N 121 53 W / &tSftDotIotE
/ \ It comes from inside, and that's what I consider to be soul music.
\__/ Sade Adu
Esperanto reference / http://www.alcyone.com/max/lang/esperanto/
An Esperanto reference for English speakers.

Chris Liechti

unread,
Feb 5, 2003, 6:12:15 PM2/5/03
to
Brian Quinlan <br...@sweetapp.com> wrote in
news:mailman.104447929...@python.org:

> > anyway if the PEP proposes to seach for a regexp in comments, then it
> can
>> do it equaly well over the rest of the source. meaning that
>>
>> regexp1 = r"__encoding__[\t ]*=[\t ]*[\"']+(\w+)[\"']+"
>> regexp2 = r"from[\t ]+__encoding__[\t ]+import[\t ]+[\"']+(\w+)[\"']+"
>>
>> can be searched before parsing the grammar, both are valid python code
>> that do not need any language extensions and it still works after
>> removing all comments.
>
> Here are the downsides/observations:
> 1. you can't actually use a regular expression like that because the
> file
> might be using a multibyte encoding system or an encoding that is not
> an ASCII superset i.e. searching for that pattern might be hard

but the PEP uses a regexp to describe the "# -*- conding" thing. it has the
same limitations, however it is implemented.

> 2. no editors will understand the encoding meta-information that you are
> try to provide

yes. but the "# -*- coding.." line will not be understand by the gazilion
of editors out there, only by two, vi and emacs. so that is not a strong
argument. or are all python programmers supposed to use one of these
editors?!?

> (I don't get why people don't seem to understand the
> meta-informational aspect of encodings; the encoding isn't a property
> of your script, it is like the size or permissions of the source file
> i.e. in an ideal world, Python shouldn't have to care because someone
> else would worry about it).

right, would be nice if that was handled by the filesystem. but
unfortunately it's harder to change that...

> 3. it has runtime effects which are not necessarily desirable

i'd call emmiting a warning as runtime effect too ;-)
the warning is intended for the eyes of a developer (and maybe in some
cases for a customer, but i doubt that it's of great significance) instead
a lot of only-users are getting warnings for software that they are using
for a long time, web logs are filled etc.



>> actualy i like the regexp1 way. you can even retreive the encoding at
>> runtime and if encoding matters during parsing and executing it also
>> matters during runtime, otherwise we would not need the PEP, right?
>
> The encoding only matters at load time. You should be able to save your
> source files using a different encoding, change the encoding declaration
> (unless Emacs/VIM does it for you automatically) and run your script
> without any change in behavior.

ok, if i put a "ä"(latin1) in my script it will be printed as an other
character in the DOS box, with or without the encoding line (as it is now).
so the entire encoding line did not improve anything, but caused a lot of
work for me, changing all old files to get rid of the warning...

i'm +1 for way to specify the encoding, i just don't like it when my (old)
programs write out a warning at the clients PC.

the PEP says:
"""A warning will be issued if non-ASCII bytes are found in the
input, once per improperly encoded input file."""

so i'll get warnings because of non-english comments, maybe many warnings
for one run of a big program. no, i don't like that.

>> a comment is a comment and should stay a comment...
>
> Unless it is a shebang line?

which is a completly different thing as its NOT at all interpreted by
python. it's read by your OS/Shell that want's to execute the file.

chris

--
Chris <clie...@gmx.net>

Chris Liechti

unread,
Feb 5, 2003, 6:35:24 PM2/5/03
to
Erik Max Francis <m...@alcyone.com> wrote in
news:3E4195E3...@alcyone.com:

> "Anders J. Munch" wrote:
>
>> Source file encoding has direct consequences for program execution.
>> The shebang and Emacs/vim encoding comments do not.
>
> That seems to me a distinction without a difference. The bangpath
> determines (or at least can determine on some operating systems) which

i see some differences...
now, i can strip all comments, the only thing i loose is the option to set
the X file attribute on Unix like operating systems. (affecting the
environment)
but if the encoding is stripped it's generating a warning (for non ASCII
files, affecting the interpreter)
it is the first comment that changes the interpretation of a source file.
what other magic comment will come in the future?

> interpreter gets run; if you require a certain interpreter version and
> have the bangpath set wrong, your script will bomb.

you can check sys.version and exit gracefuly if you want. however, you
can't disable this warning within a script.

well, its always a problem to write forwards compatible programs as one can
never know what the future brings...
some scripts with a division will break in some future release of python
and this one makes ugly outputs, just a bit earlier. in the end we all have
to live with the fact that progress has its costs.

chris

--
Chris <clie...@gmx.net>

Anders J. Munch

unread,
Feb 5, 2003, 6:45:24 PM2/5/03
to
"Erik Max Francis" <m...@alcyone.com> wrote:
> "Anders J. Munch" wrote:
>
> > Source file encoding has direct consequences for program execution.
> > The shebang and Emacs/vim encoding comments do not.
>
> That seems to me a distinction without a difference. The bangpath
> determines (or at least can determine on some operating systems) which
> interpreter gets run; if you require a certain interpreter version and
> have the bangpath set wrong, your script will bomb. Certainly that has
> direct consequences for program execution.

In the context of how a Python implementation executes a Python program
the shebang line has no effect whatsoever.

- Anders

Brian Quinlan

unread,
Feb 5, 2003, 6:57:29 PM2/5/03
to
> ok, if i put a "ä"(latin1) in my script it will be printed as an other
> character in the DOS box, with or without the encoding line (as it is
> now). so the entire encoding line did not improve anything, but caused

> a lot of work for me, changing all old files to get rid of the
warning...
>
> i'm +1 for way to specify the encoding, i just don't like it when my
(old)
> programs write out a warning at the clients PC.

I think I'm ready to leave this argument. But here is one more try: do
you think that it is desirable for people to be able to enter Unicode
characters directly into Python unicode literals?

If you do, I challenge you to modify the parser so that the encoding
need not be known OR that a warning is only generated when Unicode
characters are actually found in Unicode literals.

Cheers,
Brian


Erik Max Francis

unread,
Feb 5, 2003, 8:36:29 PM2/5/03
to
"Anders J. Munch" wrote:

> In the context of how a Python implementation executes a Python
> program
> the shebang line has no effect whatsoever.

Obviously, but the point here is that it makes a big difference in the
real world. The line in the sand (between "information" and
"metainformation") you've drawn seems arbitrary. Certainly the encoding
that the Python script has is extremely important to processing it.

--
Erik Max Francis / m...@alcyone.com / http://www.alcyone.com/max/
__ San Jose, CA, USA / 37 20 N 121 53 W / &tSftDotIotE

/ \ A man can stand a lot as long as he can stand himself.
\__/ Axel Munthe
Blackgirl International / http://www.blackgirl.org/
The Internet resource for black women.

Terry Reedy

unread,
Feb 5, 2003, 9:42:11 PM2/5/03
to

"Erik Max Francis" <m...@alcyone.com> wrote in message
news:3E41BC1D...@alcyone.com...

> "Anders J. Munch" wrote:
>
> > In the context of how a Python implementation executes a Python
> > program the shebang line has no effect whatsoever.

If the purpose of the shebang line is only to say *where* to find the
one and only Python interpreter that is on a system, then that is
correct. But if the purpose is to say *which* interpreter to use (say
1.52 versus 2.2 for a real example), then it is not. It is a shame
that this 'trick' does not work on all systems for making such a
choice.

> Obviously, but the point here is that it makes a big difference in
the
> real world. The line in the sand (between "information" and
> "metainformation") you've drawn seems arbitrary. Certainly the
encoding
> that the Python script has is extremely important to processing it.

Yes. Specifying which input decoding subprocessor to use is a logical
next step after chosing which interpreter version to use.

Terry J. Reedy


Neil Hodgson

unread,
Feb 6, 2003, 2:28:09 AM2/6/03
to
Chris Liechti:

> yes. but the "# -*- coding.." line will not be understand by the gazilion
> of editors out there, only by two, vi and emacs. so that is not a strong
> argument. or are all python programmers supposed to use one of these
> editors?!?

I added recognition of the coding line to SciTE although it only works
for UTF-8 currently. I expect other Python oriented editors and IDEs will be
upgraded in the near future and they are welcome to incorporate SciTE's
code.

Neil


Alex Martelli

unread,
Feb 6, 2003, 4:23:46 AM2/6/03
to
Terry Reedy wrote:

>
> "Erik Max Francis" <m...@alcyone.com> wrote in message
> news:3E41BC1D...@alcyone.com...
>> "Anders J. Munch" wrote:
>>
>> > In the context of how a Python implementation executes a Python
>> > program the shebang line has no effect whatsoever.
>
> If the purpose of the shebang line is only to say *where* to find the
> one and only Python interpreter that is on a system, then that is
> correct. But if the purpose is to say *which* interpreter to use (say
> 1.52 versus 2.2 for a real example), then it is not. It is a shame

And since you can pass switches such as -u on the shebang
line to affect the way Python executes the program, I do
not think it matters much if multiple interprets exist...


Alex

Anders J. Munch

unread,
Feb 6, 2003, 11:19:56 AM2/6/03
to
"Erik Max Francis" <m...@alcyone.com> wrote:
> "Anders J. Munch" wrote:
>
> > In the context of how a Python implementation executes a Python
> > program
> > the shebang line has no effect whatsoever.
>
> Obviously, but the point here is that it makes a big difference in the
> real world. The line in the sand (between "information" and
> "metainformation") you've drawn seems arbitrary. Certainly the encoding
> that the Python script has is extremely important to processing it.

Roman Suzi drew the line in the sand. I mere put comment-like
encoding syntax on the right side of it, namely the side of real,
solid, consequential information.

I have no particular need for the distinction myself. Let's skip the
subject of on which side the shebang line goes.

- Anders


Roman Suzi

unread,
Feb 6, 2003, 4:24:06 AM2/6/03
to

[Skip Montanaro]

>This is another thread that's wandered off into tit-for-tat hell. Can we
>please just drop it and move on?

As the OP of the thread, I feel obliged to summarize it.

We discussed PEP-0263
( http://python.org/peps/pep-0263.html )

1. Opinions were divided on the necessity for adding encoding
comment for non-ASCII encodings (in fact, non-utf-8 encodings).

There were many arguments pro and contra and (it seems) there are
more people who support PEP 263 than those who do not.
Arguments were:

+ explicit encoding disciplines programmer
+ it is portable
+ it allows editors such as Emacs, vim and SciTe to be informed on encoding

- it's annoying for beginners
- it irritates those who want to run old scripts with new Python
(phase 1 gives warning, phase 2 will give errors!)
- it makes one encoding per source a must, disallowing
any non-standart de-facto usages of encoding mixtures
- it's ugly

2. The last "minus" was discussed further. Many syntactic
suggestions were made (read the thread)

* Further discussion is probably not constructive, as Skip noticed.

Encoding-cookie is bitter, but probably necessary. I have no other
arguments.

However, nobody answered how one would feel if
# -*- coding: ascii -*-
would be necessary for every program.

M.-A. Lemburg

unread,
Feb 6, 2003, 3:07:36 PM2/6/03
to
Roman Suzi wrote:
> [Skip Montanaro]
>
>>This is another thread that's wandered off into tit-for-tat hell. Can we
>>please just drop it and move on?
>
>
> As the OP of the thread, I feel obliged to summarize it.
>
> We discussed PEP-0263
> ( http://python.org/peps/pep-0263.html )
>
> 1. Opinions were divided on the necessity for adding encoding
> comment for non-ASCII encodings (in fact, non-utf-8 encodings).
>
> There were many arguments pro and contra and (it seems) there are
> more people who support PEP 263 than those who do not.
> Arguments were:
>
> + explicit encoding disciplines programmer
> + it is portable
> + it allows editors such as Emacs, vim and SciTe to be informed on encoding
>
> - it's annoying for beginners
> - it irritates those who want to run old scripts with new Python
> (phase 1 gives warning, phase 2 will give errors!)
> - it makes one encoding per source a must, disallowing
> any non-standart de-facto usages of encoding mixtures
> - it's ugly
>
> 2. The last "minus" was discussed further. Many syntactic
> suggestions were made (read the thread)
>
> * Further discussion is probably not constructive, as Skip noticed.

Indeed :-) Even less, since it is already implemented in Python 2.3.

> Encoding-cookie is bitter, but probably necessary. I have no other
> arguments.
>
> However, nobody answered how one would feel if
> # -*- coding: ascii -*-
> would be necessary for every program.

--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting: http://www.egenix.com/
Python Software: http://www.egenix.com/files/python/


Paul Rubin

unread,
Feb 6, 2003, 4:53:26 PM2/6/03
to
"M.-A. Lemburg" <m...@lemburg.com> writes:
> Indeed :-) Even less, since it is already implemented in Python 2.3.

It is not. It's implemented in Python 2.3 alpha 1. Python 2.3 (the
real release) does not yet exist. There is still time to fix this bug.
That's the whole purpose of test releases.

François Pinard

unread,
Feb 6, 2003, 6:21:38 PM2/6/03
to
[Chris Liechti]

> [...] but i'm pretty sure that less than 50% of all programmers are


> neither using vi(m) nor emacs... and less than 50% is not "most" ;-)

"Being pretty sure" still needs to be substantiated with facts, or numbers.
I'm not confirming nor denying anything in this area, just stating that we
should refrain from turning mere impressions or intuitions into assertions.

--
François Pinard http://www.iro.umontreal.ca/~pinard

Skip Montanaro

unread,
Feb 6, 2003, 7:06:45 PM2/6/03
to

>> [...] but i'm pretty sure that less than 50% of all programmers are

>> neither using vi(m) nor emacs... and less than 50% is not "most" ;-)

François> "Being pretty sure" still needs to be substantiated with
François> facts, or numbers.

Even if they make up far less than half the programming editor market
available to Python programmers, vim, emacs, and apparently SciTE are doing
something about the encoding mess. I suspect that puts them ahead of other
editors in this regard.

Skip

Roman Suzi

unread,
Feb 7, 2003, 12:23:46 AM2/7/03
to
On Thu, 6 Feb 2003, M.-A. Lemburg wrote:

>Roman Suzi wrote:
>>
>> We discussed PEP-0263
>> ( http://python.org/peps/pep-0263.html )
>>

>> * Further discussion is probably not constructive, as Skip noticed.
>

>Indeed :-) Even less, since it is already implemented in Python 2.3.
>

>> Encoding-cookie is bitter, but probably necessary. I have no other
>> arguments.

Well, if encoding-cookie is here to stay, I have only one wish:

aaa.py:7: DeprecationWarning: Non-ASCII character '\xec', but no declared
encoding
"""

- please, add some more hint about encoding addition to the source.
URL of the PEP will do.

I still do not know what to do with user's of Python programs.
Do we need to urge them to become Python programmers ;-)

And one more point. The Style Guide need to be upgraded accordingly,
banning multiple encodings in the source and telling to add
"coding: " hint the recommended way.

M.-A. Lemburg

unread,
Feb 7, 2003, 3:47:21 AM2/7/03
to
Roman Suzi wrote:
> On Thu, 6 Feb 2003, M.-A. Lemburg wrote:
>
>
>>Roman Suzi wrote:
>>
>>>We discussed PEP-0263
>>>( http://python.org/peps/pep-0263.html )
>>>
>>>* Further discussion is probably not constructive, as Skip noticed.
>>
>>Indeed :-) Even less, since it is already implemented in Python 2.3.
>>
>>
>>>Encoding-cookie is bitter, but probably necessary. I have no other
>>>arguments.
>
>
> Well, if encoding-cookie is here to stay, I have only one wish:
>
> aaa.py:7: DeprecationWarning: Non-ASCII character '\xec', but no declared
> encoding
> """
>
> - please, add some more hint about encoding addition to the source.
> URL of the PEP will do.

Good idea.

> I still do not know what to do with user's of Python programs.
> Do we need to urge them to become Python programmers ;-)

No, but they'll need to pay some lucky Python programmer to
get rid off the warning :-) Seriously, the warning and the trouble
are intended as I already mentioned in the bug report Kirill
filed on SF: http://www.python.org/sf/681960/ :

Python's source code was originally never meant to contain
non-ASCII characters. The PEP implementation now officially
allows this provided that you use an encoding marker, e.g.

"""
# -*- coding: windows-1251 -*-
name = raw_input("Êàê òåáÿ çîâóò ? ")
print "Ïðèâåò %s" % name
"""
(If you open this in emacs, you'll see Russian text)

Note that this is also needed in order to support UTF-16
file formats which use two bytes per character. Python
will automatically detect these files, so if you really don't
like the coding marker, simply write the file using a UTF-16
aware editor which prepends a UTF-16 BOM mark to the
file.

BTW, if you absolutely want to use multiple encodings in a single
file and you're sure what you're doing, then you can "disable"
that warning and possible codec errors by telling Python
to interpret the file as latin-1:

"""
# Tell Python to read this file as-is: coding: latin-1
name = raw_input("Êàê òåáÿ çîâóò ? ")
print "Ïðèâåò %s" % name
"""

Note that Unicode literals then *have* to be in Latin-1,
otherwise, you'll lose big. By telling Python to read the
file using the Latin-1 codec you basically tell it to
work exactly like it does now (which is considered a bug).

This whole thing is one more step in the direction of
explicit is better than implicit and opens up Python
for many more languages such as, for example, Asian
scripts.

> And one more point. The Style Guide need to be upgraded accordingly,
> banning multiple encodings in the source and telling to add
> "coding: " hint the recommended way.

Good point. I'll add comment there.

Kirill Simonov

unread,
Feb 7, 2003, 11:39:56 AM2/7/03
to
* M.-A. Lemburg <m...@lemburg.com>:

> No, but they'll need to pay some lucky Python programmer to
> get rid off the warning :-) Seriously, the warning and the trouble
> are intended as I already mentioned in the bug report Kirill
> filed on SF: http://www.python.org/sf/681960/ :

Sorry, but I'm not convinced. I hope you still have patience to
hear my objections.

I've inspected the current implementation. The file encoding does not
affect ordinary string literals. At first the tokenizer converts them
into UTF-8 from the file encoding. Then the compiler converts them back
from UTF-8 to the file encoding. Thus the result is the same regardless
of what encoding you use. The comments are tossed out by the tokenizer
too. Why do you want them to be in any particular encoding if their
encoding doesn't matter?

Well, I understand. The file encoding is defined for the whole file.
So comments and string literals must be in this encoding too.
And that way we can define Unicode literals using our favourite encoding.

But what is the price that we pay for this? The millions of Python
scripts that use 8-bit string literals or comments are broken now in
order to allow the feature that no one ever used! I think that this is
an extreme.

And I can propose a perfect solution. If there are no defined encoding
for a source file, assume that it uses a simple 8-bit encoding. Do not
convert the file into UTF-8 in the tokenizer. And do not convert string
literals in the compiler. Raise SyntaxError if a non-ASCII character is
contained in a Unicode literal. We will even save a few CPU cycles
for most Python source files using this approach.

I will write a patch if you agree with this solution.

> This whole thing is one more step in the direction of
> explicit is better than implicit and opens up Python
> for many more languages such as, for example, Asian
> scripts.

If you need a pythonic quote, it is here
"Practicality beats purity"

--
xi

Mike C. Fletcher

unread,
Feb 7, 2003, 2:36:41 PM2/7/03
to
Kirill Simonov wrote:

>* M.-A. Lemburg <m...@lemburg.com>:
>
>
>>No, but they'll need to pay some lucky Python programmer to
>>get rid off the warning :-) Seriously, the warning and the trouble
>>are intended as I already mentioned in the bug report Kirill
>>filed on SF: http://www.python.org/sf/681960/ :
>>
>>

...

>And I can propose a perfect solution. If there are no defined encoding
>for a source file, assume that it uses a simple 8-bit encoding. Do not
>convert the file into UTF-8 in the tokenizer. And do not convert string
>literals in the compiler. Raise SyntaxError if a non-ASCII character is
>contained in a Unicode literal. We will even save a few CPU cycles
>for most Python source files using this approach.
>
>I will write a patch if you agree with this solution.
>

...

Of course, it means nothing for me to agree, (I don't have a Python-dev
vote)... but this approach (assuming it's workable) does sound more
reasonable than breaking every old module that uses > 128 characters in
regular string literals. Sure, I'd love to be paid big bucks to update
old, unmaintained Python modules, but I'm guessing the headache and cost
of having to do that would, by souring users on Python as being
unstable, have a net-negative effect on total Python jobs in the end.
As a devil's advocate, however, doesn't it make the conversion of the
file more complex? I'm guessing the python-dev people are doing
something like "codec.convert(file)", whereas they will need to convert
solely unicode strings with the new approach.

BTW, am I the only one who has visions of eventually being passed a
module written in a Chinese or Japanese codec and being unable to even
see what Chinese/Japanese characters are used (lack of fonts for text
editors), so just facing a field of nulls something like:

???? ????????.?? ?????? *
???? ???? ?????? ??????

????? ?????????:
"""???? ????? ??? ????????? (?????-???????) ??????? ??? ????????

??? ????????? ????? ???????? ?????? ???????? ??? ????????
? ?????????? ?? ????????, ???????????? ????? ???????? ??????????
"""
????????? = 1
??? __????__(????):
"""?????????? ??? ????????????????'? ???????? ??????????
"""
????.__???????? = []
??? ?????????????? ( ???? ):
"""??? ??? ???? ?? ???-???????? ??? ??? ????????? ???????

???? ???? ?? ??? ???? ?? ?????????? ???-????????
????????? ?? ??? ????????? ???????.
"""
?????? ????.__????????[:]

which might make for a fun game, at least, I suppose ;) , but would be
seriously freaky to work with. Similar dreams for UTF-16-encoded files,
lots-and-lots of NULLs in the older editors.

Just my $0.03 CDN,
Mike

_______________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://members.rogers.com/mcfletch/


Simo Salminen

unread,
Feb 7, 2003, 4:00:48 PM2/7/03
to
* Kirill Simonov [Fri, 7 Feb 2003 18:39:56 +0200]

> * M.-A. Lemburg <m...@lemburg.com>:
>> No, but they'll need to pay some lucky Python programmer to
>> get rid off the warning :-) Seriously, the warning and the trouble
>> are intended as I already mentioned in the bug report Kirill
>> filed on SF: http://www.python.org/sf/681960/ :
>
> But what is the price that we pay for this? The millions of Python
> scripts that use 8-bit string literals or comments are broken now in
> order to allow the feature that no one ever used! I think that this is
> an extreme.
>

I second this.

This change only makes python hostile to regular programmer, who
does not care about encodings, and only wants to use simple 8-bit
characters in comments.

People (well, atleast me) won't start to specify encoding at the
start of the file, because it does not buy anything. They will just
stop using high-bit ascii characters in comments, thus decreasing the
level of documentation.


> If you need a pythonic quote, it is here
> "Practicality beats purity"

Exactly. This change makes writing high-bit ASCII comments _very_
unpractical, and breaks old code for no good reason.

Cheers,
--
Simo Salminen

John Roth

unread,
Feb 7, 2003, 4:25:54 PM2/7/03
to

"Roman Suzi" <r...@onego.ru> wrote in message
news:mailman.1044297249...@python.org...
>
> I've tryed vesrion 2.3a of Python and have been surprised by the
following
> warning:
>
> 1.py:6: DeprecationWarning: Non-ASCII character '\xf7', but no
declared
> encoding
>
>
> Does it mean, that all that Python software which is not in ASCII will
each
> time give such warning? (Thus probably filling up web-server logs or
just
> surprising users (like Perl/C libs do when they don't know current
locale).
>
> I think it's madness... There must be other ways to deal with it. I
could
> agree that for correct operation IDLE is demanding correct encoding
setting
> (and nonetheless workis incorrectly!), but plain scripts should be
> 8-bit clean, without any conditions! (Luckily, it's alpha version, so
> nothing really changed yet.)
>
>
> Sincerely yours, Roman Suzi

After thinking about this for a few days, it suddenly occured to me
that there may be a very obscure method in this madness. That is, by
restricting python source to 7-bit ascii unless otherwise declared,
it opens the way to migrate to UTF-8 input. This, in turn, would
solve most of the character set problems in one fell swoop.

John Roth

Kirill Simonov

unread,
Feb 7, 2003, 5:07:18 PM2/7/03
to
* John Roth <john...@ameritech.net>:

>
> After thinking about this for a few days, it suddenly occured to me
> that there may be a very obscure method in this madness. That is, by
> restricting python source to 7-bit ascii unless otherwise declared,
> it opens the way to migrate to UTF-8 input. This, in turn, would
> solve most of the character set problems in one fell swoop.
>

Why do you think that UTF-8 is a panacea?

For example, my little script

print "Привет!"

will become

print u"Привет!".encode('koi8-r')

if I am forced to use UTF-8 for my source code. I don't see any
advantage here.

Yes, I know that I should use gettext. Actually, I do. But what should
do a 12-year-old student that writes her first script?

I have another question. How could I define the input encoding
for the interactive interpreter?


--
xi

Jp Calderone

unread,
Feb 7, 2003, 5:19:46 PM2/7/03
to
On Fri, Feb 07, 2003 at 09:00:48PM +0000, Simo Salminen wrote:
> * Kirill Simonov [Fri, 7 Feb 2003 18:39:56 +0200]
> > * M.-A. Lemburg <m...@lemburg.com>:
> >> No, but they'll need to pay some lucky Python programmer to get rid off
> >> the warning :-) Seriously, the warning and the trouble are intended as
> >> I already mentioned in the bug report Kirill filed on SF:
> >> http://www.python.org/sf/681960/ :
> >
> > But what is the price that we pay for this? The millions of Python
> > scripts that use 8-bit string literals or comments are broken now in
> > order to allow the feature that no one ever used! I think that this is
> > an extreme.
> >
>
> I second this.
>

I don't. In fact, I'm not even sure it makes sense. Source files that
are using non-ASCII encodings are precisely the ones that this feature
benefits. It allows anyone to look at these files and actually *read* them.

While it's true the programs are now "broken" (They're not really, they
won't be broken until this becomes a SyntaxError, and only then if they're
run on the new version of the interpreter - They will always work on
previous versions, forever), they were "broken" before - Python source files
were previously to contain *only* ASCII text.

> This change only makes python hostile to regular programmer, who does not
> care about encodings, and only wants to use simple 8-bit characters in
> comments.
>
> People (well, atleast me) won't start to specify encoding at the start of
> the file, because it does not buy anything. They will just stop using
> high-bit ascii characters in comments, thus decreasing the level of
> documentation.

If you need to regularly use an encoding other than ASCII, and you cannot
configure your editor to put the appropriate text at the top of newly
created .py files, I humbly suggest that you need to find a new editor.

>
>
> > If you need a pythonic quote, it is here
> > "Practicality beats purity"
>
> Exactly. This change makes writing high-bit ASCII comments _very_
> unpractical, and breaks old code for no good reason.

There is no such thing as high-bit ASCII. If you don't understand the
issue, why do you think you can comment relevantly upon it?

Jp

--
C/C++/Java/Perl/Python/Smalltalk/PHP/ASP/XML/Linux (User+Admin)
Genetic Algorithms/Genetic Programming/Neural Networks
Networking/Multithreading/Legacy Code Maintenance/OpenGL
See my complete resume at http://intarweb.us:8080/
--
up 54 days, 1:50, 7 users, load average: 0.00, 0.04, 0.14

Kirill Simonov

unread,
Feb 7, 2003, 5:48:46 PM2/7/03
to
* Jp Calderone <exa...@intarweb.us>:

> I don't. In fact, I'm not even sure it makes sense. Source files that
> are using non-ASCII encodings are precisely the ones that this feature
> benefits. It allows anyone to look at these files and actually *read* them.

The only editor that can read the encoding declaration is emacs. Do you
assume that *anyone* use emacs?

The only benefit from encoding declarations is the ability to write
Unicode literals in the chosen encoding. That's all.

--
xi

Jp Calderone

unread,
Feb 7, 2003, 5:52:38 PM2/7/03
to

Who said anything about -reading- the encoding declarations? Any half-way
decent editor should be able to -write- them. If people are already
including non-ASCII in their source files, I assume their terminal/GUI
already knows how to display it properly.

My point is that this is not an undue burden on developers, and there is
no reason someone should decide it is a high enough barrier to not use
non-ASCII characters in their source, let alone not use Python.

Scott David Daniels

unread,
Feb 7, 2003, 6:26:17 PM2/7/03
to
Simo Salminen wrote:
> * Kirill Simonov [Fri, 7 Feb 2003 18:39:56 +0200]
>>...But what is the price that we pay for this? The millions of Python

>>scripts that use 8-bit string literals or comments are broken now in
>>order to allow the feature that no one ever used! I think that this is
>>an extreme.
> ...

> This change only makes python hostile to regular programmer, who
> does not care about encodings, and only wants to use simple 8-bit
> characters in comments.

I told myself to be quiet, but ....

This change is one step on the way to switching python source
from bytes to characters; from binary source to text source.

Unix users often think there is no difference between binary and
text files: the two are different, but on unix the representation
is the same. That is, the text file consists of characters (which
have particular meanings), while the binary files are simply byte
streams which can only be replicated. Html files are not the same
as text files either, although they are represented as text files.
The difference is in what you know about the contents of the file.
If I know a file is html, I can display it with a browser and see
nifty effects like bold, italic, type size changes, .... Without
that information, I have less knowledge about how I might be able
to use that file.

Conceptually, source code is text, not bytes. Nobody really cares
how the characters in a line are encoded: the meaning is apparent
by looking at displayed characters. Unfortunately, we are now (as
we always were) in a world where there are multiple encodings for
the same characters. On any given computer system, for a particular
user, there is a text encoding they are most comfortable using.
THis preference is usually because their favorite text editor can
read and write that encoding, and it has all of the characters they
are likely to use.


There are various options for python source:

First, we could define the coding to be 'system local,' and
endure constant complaints when a file that works right on one
system (or even for one user) does not behave in the same way
for another. This is the "plain old 8-bit" option.

Second, we could (as I understand Python was conceived) restrict
python to a 7-bit printable ASCII plus space, horizontal tab, and
(? \n, \r, \r\n). By the by, if you think the last is nit-picking,
exactly which bytes are in the constant: """a
z""". The answer may depend on your operating system, or it may
not. You probably can run python programs shipped (as binary
files) from Mac OS X, Microsoft Windows, and Unix systems. The
results might differ. Pretty much anyone who uses more characters
than are available in ASCII are going to be infuriated by this
choice.

Third, we could declare a single encoding as "the blessed" encoding
for python source. This would be perfect for the winners and nasty
for the losers. One group would love "latin1" to be the code. Well,
UTF-8 at least has the pleasant property of being able to represent
the vast majority of characters representable on computers. So
UTF-8 might be a good choice. However, with a few exceptions, text
editors on a system work well with a particular local encoding. Only
in a very few cases is this a variant of unicode. So on many systems
people who use python will be forced to use a different text editor
than they normally use.

Fourth, we could define python in terms of characters, and allow
locally-encoded text to be be used as long as we know the mapping
from the local code to some standard (say, unicode). It is likely
that such python translators will consist of a thin sugary coating
of local-encoding-capable code over a chocolatey core of standard
python code to do the actual parsing, compiling, etc. The core, in
order to be most portable, is likely to munch on unicode. We'll
also need a sugary layer that knows how to determine which original
bytes of local encoding to use in such things as non-unicode string
constants (note the unicode strings will be just dandy as-is). This
looks a _lot_ like the first case, but allows local text encoding
that doesn't map ASCII to the ASCII subset of unicode. This is also
the first character-based option.

Fifth, and I personally choose to take the fifth, we could use the
fourth option, except we could write the encoding at the front of
the file (oops, unix uses the first line to control certain program
behavior, lets allow the forst _or_ the second line). If this works,
not only does it work as well as the fourth option, but we can
actually use modules developed under another encoding on our system
without ever having to push them through some sort of "try to do
what they mean" translator to get it into our local format. This
option uses character-based source code with explicit encoding to
allow us to run python from anywhere locally. _But_, it requires
we be explicit about encodings. Our translator will cope properly
with a program built from modules in different encodings, _but_it_
_must_know_the_encodings_. This is delightful, since now we can
have a code repository where we can pull contributed code written
in Brazil, Serbia, Kyoto, and Thailand from a single repository
safely. The sole cost is explicit encoding. We could probably
even cope with EBCDIC, were someone lusting to use old character
codes, since we need to only look at the first two lines. If
we cannot find an encoding in the first two lines looking at
simple ASCII, we try as EBCDIC and look to see if we find it.
If not, we then try big5 and ....

Roman Suzi asked:


"how one would feel if '# -*- coding: ascii -*-' would be

necessary for every program?"
I replied,
"I would probably never use it. If I had to use an encoding,
I would probably use: '# -*- coding: UTF-8 -*-', since I could
encode other author's names in comments (or credit strings)."


I really have no idea whether I am mentioning issues here that
people don't realize, or simply spouting off my opinions to a
group that finds them unconvincing. I, of course, hope to be doing
the former and will resume my silence for fear that I am doing the
latter.

-Scott David Daniels

It is loading more messages.
0 new messages