Multibyte Character Surport for Python

Wenshan Du

unread,

May 8, 2002, 8:21:45 AM5/8/02

to

hi,
Iike Python, but it is not very good at doing with multibyte
character. So, I rebuilt the pythoncore source code ,make a patch for
Python 2.2.1. Now, you can name you variables, class or function with
multibyte character, like Chinese, Kerea or Japanese etc. Python will
not displasy messages like "\xc4\xe3\xba\xc3" when you print a string
with multibyte character or search a database like ACCESS with mxODBC.
I name it Multi Byte Character Surport Patch(MBCSP). Now I like Python
better. Enjoy !
Download MBCSP for Python2.2.1 from URL:
http://www.dohao.org/python/mbcsp/en/
Enjoy!

Martin v. Löwis

unread,

May 8, 2002, 9:55:12 AM5/8/02

to

pyt...@dohao.org (Wenshan Du) writes:

> Iike Python, but it is not very good at doing with multibyte
> character. So, I rebuilt the pythoncore source code ,make a patch for
> Python 2.2.1. Now, you can name you variables, class or function with
> multibyte character, like Chinese, Kerea or Japanese etc. Python will
> not displasy messages like "\xc4\xe3\xba\xc3" when you print a string
> with multibyte character or search a database like ACCESS with mxODBC.
> I name it Multi Byte Character Surport Patch(MBCSP). Now I like Python
> better. Enjoy !

So far, it appeared that there is wide agreement that identifiers in
Python should be ASCII only. Do you disagree, i.e. do you *really*
want to use non-ASCII identifiers?

Allowing non-ASCII in strings is a different issue - work is in
progress to support that.

Regards,
Martin

Erno Kuusela

unread,

May 8, 2002, 10:10:06 AM5/8/02

to

In article <j4y9euw...@informatik.hu-berlin.de>,

loe...@informatik.hu-berlin.de (Martin v. Löwis) writes:

| So far, it appeared that there is wide agreement that identifiers in
| Python should be ASCII only. Do you disagree, i.e. do you *really*
| want to use non-ASCII identifiers?

what would be the advantage in preventing non-english-speaking people
from using python?

-- erno

François Pinard

unread,

May 8, 2002, 12:06:47 PM5/8/02

to

[Erno Kuusela]

The only reason I ever heard is preventing people to write code that cannot
be universally exported.

People can understand two different, orthogonal things in this issue:
keywords and user identifiers. I'm not really asking that keywords be
translated, because Python keywords and syntax are modelled after the English
languages. This may be debated of course, but is a lower priority issue.

However, identifiers created by local programmers, and especially identifiers
naming functions or methods, should be writable in national language
without forcing us to make orthographical mistakes all over (I usually
choose English identifiers over disgustingly written French identifiers).

You know, there is a background irritation at not being able to program
in my own language, this irritation is permanent and never fades out --
a bit like the fossile radiation after the big bang! :-) I surely like
Python a lot, but I would like it even more if it was on the side of
programmers of all nations, and not forcing everyone to wide portability:
there are many cases where planetary portability is just not a concern.

--
François Pinard http://www.iro.umontreal.ca/~pinard

François Pinard

unread,

May 8, 2002, 11:54:46 AM5/8/02

to

[Martin v. Löwis]

> So far, it appeared that there is wide agreement that identifiers in
> Python should be ASCII only. Do you disagree, i.e. do you *really*
> want to use non-ASCII identifiers?

For one, I would *really* like to use letters from my locale in identifiers.
Not everyone is writing with the whole planet as a goal, you know! :-)

There is a lot of in-house development, not meant to be exported, that
would be _so_ much more comfortable if we could use our own language
while programming. Many years ago, we experienced that university-wide,
by modifying the Pascal compiler so we can use French identifiers whenever
we feel like it (as well as a lot of other software and even hardware),
and we kept modifying compilers while new Pascal versions were released.
Moving on to other sites and languages, my co-workers and I did not try
redoing such patches all the time, everywhere. Yet, I would deeply like
that Python be on our side, and favourable at restoring our Lost Paradise.

Martin v. Löwis

unread,

May 8, 2002, 12:41:56 PM5/8/02

to

Erno Kuusela <erno...@erno.iki.fi> writes:

> | So far, it appeared that there is wide agreement that identifiers in
> | Python should be ASCII only. Do you disagree, i.e. do you *really*
> | want to use non-ASCII identifiers?
>
> what would be the advantage in preventing non-english-speaking people
> from using python?

There would be no advantage in doing so. However, restricting
identifiers to ASCII still allows non-english speaking people to use
Python, if they atleast know the Latin alphabet.

If they don't know the latin alphabet, they can't use Python even if
identifiers can be non-ASCII, since they keywords would still be
written with Latin letters.

Regards,
Martin

Martin v. Löwis

unread,

May 8, 2002, 1:04:32 PM5/8/02

to

pin...@iro.umontreal.ca (François Pinard) writes:

> > what would be the advantage in preventing non-english-speaking people
> > from using python?
>
> The only reason I ever heard is preventing people to write code that cannot
> be universally exported.

Indeed, the code ought to run regardless of environment
settings. These days, it doesn't have that much to do with "exporting"
it: even in a single organization, multiple encodings for a single
language are used (e.g. in Germany, all of Latin-1, Latin-9, UTF-8,
and CP-1252 are common); people would expect that their code continues
to work when they move it from a Unix machine to a Windows machine.

So in the end, the only acceptable strategy would be to allow
identifiers that contain letters (or letterlike symbols) in arbitrary
languages. For Python, that would mean that attributes must be Unicode
objects, which could cause code breakage.

Regards,
Martin

Martin v. Löwis

unread,

May 8, 2002, 12:58:47 PM5/8/02

to

pin...@iro.umontreal.ca (François Pinard) writes:

> There is a lot of in-house development, not meant to be exported, that
> would be _so_ much more comfortable if we could use our own language
> while programming.

You can do that in comments. You cannot do that in the program, since
all keywords remain English-based.

> Many years ago, we experienced that university-wide, by modifying
> the Pascal compiler so we can use French identifiers whenever we
> feel like it (as well as a lot of other software and even hardware),
> and we kept modifying compilers while new Pascal versions were
> released. Moving on to other sites and languages, my co-workers and
> I did not try redoing such patches all the time, everywhere. Yet, I
> would deeply like that Python be on our side, and favourable at
> restoring our Lost Paradise.

Modifying the compiler so that it supports one language (with one
encoding) is one thing; modifying it that it supports arbitrary
languages (with arbitrary encodings) is a different problem; existing
code may break if you make this kind of extension.

So a "it would be nice" is not a strong-enough rationale for such a
change - "I really need to have it, and I accept to break other
people's code for getting it" would be, if enough people voiced that
position.

Regards,
Martin

François Pinard

unread,

May 8, 2002, 3:02:19 PM5/8/02

to

[Martin v. Löwis]

> pin...@iro.umontreal.ca (François Pinard) writes:

> > There is a lot of in-house development, not meant to be exported, that
> > would be _so_ much more comfortable if we could use our own language
> > while programming.

> You can do that in comments. You cannot do that in the program, since
> all keywords remain English-based.

The suggestion of repeating code into comments is just not practical.

> Modifying the compiler so that it supports one language (with one encoding)
> is one thing; modifying it that it supports arbitrary languages (with
> arbitrary encodings) is a different problem; existing code may break if
> you make this kind of extension.

Existing code is not going to break, as long as English identifiers stay
a subset of nationally written identifiers. Which is usually the case
for most character sets, Unicode among them, allowing ASCII letters as a
subset of all letters.

> So a "it would be nice" is not a strong-enough rationale for such a
> change - "I really need to have it, and I accept to break other
> people's code for getting it" would be, if enough people voiced that
> position.

A great deal of recent Python changes were to make it nicer in various ways.
None were strictly unavoidable, the proof being that Python 1.5.2 has been
successfully used for many things, and could still be. We should not merely
vary the height of the "strong-enough rationale" bar depending on our own
tastes, as this merely gives a logical sounding to relatively pure emotions.

Having the capability of writing identifiers with national letters is
not going to break other people's code, this assertion looks a bit like
gratuitous FUD to me. Unless you are referring to probable transient
implementation bugs which are normal part of any release cycle? Python has
undergone changes which were much deeper and much more drastic than this
one would be, and the fear of transient bugs has not been a stopper.

If many people had experienced the pleasure of naming variables properly
for their national language while programming, I guess most of them would be
rather enthusiastic proponents on having this capability with Python, today.
As very few people experienced it, they can only imagine, without really
knowing, all the comfort that results. Python is dynamic and interesting
enough, in my opinion, for opening and leading a worth trend in this area.

Martin v. Loewis

unread,

May 8, 2002, 4:05:18 PM5/8/02

to

pin...@iro.umontreal.ca (François Pinard) writes:

> Existing code is not going to break, as long as English identifiers stay
> a subset of nationally written identifiers. Which is usually the case
> for most character sets, Unicode among them, allowing ASCII letters as a
> subset of all letters.

For Python, existing code, like inspect.py, *will* break: if
introspective code is suddenly confronted with non-ASCII identifiers,
it might break, e.g. if Unicode objects show up as keys in __dict__.

> Having the capability of writing identifiers with national letters is
> not going to break other people's code, this assertion looks a bit like
> gratuitous FUD to me. Unless you are referring to probable transient
> implementation bugs which are normal part of any release cycle?

No. The implementation strategy would be to allow Unicode identifiers
at run-time, and all introspective code - either within the Python
code base, or third-party, would need revision.

> Python has undergone changes which were much deeper and much more
> drastic than this one would be, and the fear of transient bugs has
> not been a stopper.

PEP 263 will introduce the notion of source encodings - without this,
it wouldn't even be possible to parse the source code, anymore. The
PEP, over months, had a question in it asking whether non-ASCII
identifiers should be allowed (the follow-up question would then be:
which ones?), and nobody ever spoke up requesting such a feature.

It is a real surprise for me that suddenly people want this.

Regards,
Martin

John Machin

unread,

May 8, 2002, 5:37:20 PM5/8/02

to

Erno Kuusela <erno...@erno.iki.fi> wrote in message news:<kuhelir...@lasipalatsi.fi>...

OK, here are some quick contributions to what promises to be a very
rational debate :-)

(1) You mean, like they are prevented from using FORTRAN, COBOL, C,
...?

(2) Perhaps you mean 'Merican-speaking ... I'd like to campaign for
programming languages to accept keywords, method names, etc based on
the programmer's locale, for example 'centre' versus 'center'

(3) And for folk who might prefer (say) verb-last order, we could base
the grammar on locale, so that instead of being forced unnaturally to
write

foo = 0

they could instead use something like this:

0 _(to) foo _(bind)

Alex Martelli

unread,

May 8, 2002, 6:47:16 PM5/8/02

to

François Pinard wrote:
...

> If many people had experienced the pleasure of naming variables properly
> for their national language while programming, I guess most of them would
> be rather enthusiastic proponents on having this capability with Python,

This one person has had this dubious "pleasure" and loathes the idea
with a vengeance. The very *IDEA* of cutting off the huge majority of
programmers in the world, who don't understand Italian, from being
able to understand and work with my code, is utterly abhorrent to me.

Now THAT is one niche where I'm glad that Italians' tendency to
esterophily has prevailed -- all languages (Basic variants, etc) who
perpetrated such horrors have died unmourned deaths. I may be
(very mildly) sad that we don't say "calcolatore" any more in Italy,
but "computer", and so on, but if that was the price to pay to kill
the "programmate in italiano" languages, it was well worth paying.

I now work for a Swedish firm and I'm *VERY* happy they fully
agree -- although it's Swedish, the official language of the firm is
English, all programs and docs &c. It's bad enough battling with
Swedish keyboards, documents (in Swedish) coming from outside
the firm, etc; at least with code and docs I'm OK!-)

I'm a citizen of the EU. How many different natural languages
should I learn to actually exercise my right to work anywhere in
the EU, if English (or ANY other SINGLE language) wasn't a de
facto standard? Let's see: Portuguese; at least four languages
for Spain (castillano, catalano, gallego, Euskadi -- what more?);
French; Flemish; Dutch; German; Italian; Irish; Welsh; Scot; Danish;
Swedish; Finnish. Hope I've covered them all (until Corsica,
Sardinia, etc, gain more linguistic status than they have now... or
until another 10 or so languages get added by EU's expansion...).

That's not how I want to spend my life. Long live a world where
ONE natural language (don't care which one: ONE, I can learn)
opens to me the doors of the (programming) world.

Alex

Chris Liechti

unread,

May 8, 2002, 7:45:33 PM5/8/02

to

mar...@v.loewis.de (Martin v. Loewis) wrote in
news:m3vg9yj...@mira.informatik.hu-berlin.de:

> PEP 263 will introduce the notion of source encodings - without this,
> it wouldn't even be possible to parse the source code, anymore. The
> PEP, over months, had a question in it asking whether non-ASCII
> identifiers should be allowed (the follow-up question would then be:
> which ones?), and nobody ever spoke up requesting such a feature.

i wouldn't allow non ASCII chars. not because i don't like them - i write
german so i need äöü - but think of someone in a foreign country who just
does not have those keys on his keyboard. how is he supposed to enter a
variable with such characters?
or better use chinese symbols - i don't know what they mean, not
even speaking of how to pronounce them. should i enter variable names as
pictures, taking my digicam because i can't paint that good by hand?

also note Alex's comment about the natural language. how many languages
must a programmer learn to work on sources if english isn't sufficient?

of course that restriction on characters doesn't need to be for strings and
comments. (some comments aren't readable anyway even if you know the
language where the words are taken from ;-)

(the PEP resticts to identifiers to ASCII only - good)

and how many encodings will be allowed? need i have to a zillion code pages
on my machine to run modules i find on the net? ok, much from the unicode
stuff can be reused, but what for smaller targets, startup time etc.

regarding the PEP263.
- i think i don't like "coding" it's not the obvious name for me.
i'm more used to "encoding" like used with HTML and MIME.

- why use ASCII as default encoding in the future and not UTF-8 (or Latin-
1)? ASCII is a subset of UTF8 and it would allow the rest of the world to
leave the default when using a unicode aware editor. i think it will become
very nasty if you must write the correct encoding in each source file...
or is it by intention that smallest available encoding of all is taken to
enforce more typing?

but basicaly i think the PEP is a good idea.

chris

--
Chris <clie...@gmx.net>

Neil Hodgson

unread,

May 8, 2002, 8:05:39 PM5/8/02

to

Martin v. Loewis:

> PEP 263 will introduce the notion of source encodings - without this,
> it wouldn't even be possible to parse the source code, anymore. The
> PEP, over months, had a question in it asking whether non-ASCII
> identifiers should be allowed (the follow-up question would then be:
> which ones?), and nobody ever spoke up requesting such a feature.
>
> It is a real surprise for me that suddenly people want this.

I've said several times in the past that Python should support non-ASCII
identifiers, partly to encourage people whose native character set is not
Roman based and partly to facilitate interop with other environments such as
Java and .NET that allow non-ASCII characters in identifiers.

Neil

Mark Jackson

unread,

May 8, 2002, 12:53:49 PM5/8/02

to

pin...@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) writes:
> [Erno Kuusela]
>
> > In article <j4y9euw...@informatik.hu-berlin.de>,
> > loe...@informatik.hu-berlin.de (Martin v. Löwis) writes:
>
> > | So far, it appeared that there is wide agreement that identifiers in
> > | Python should be ASCII only. Do you disagree, i.e. do you *really*
> > | want to use non-ASCII identifiers?
>
> > what would be the advantage in preventing non-english-speaking people
> > from using python?

Restricting Python identifiers to the ASCII charset does not prevent
non-english-speaking people from using Python.

> You know, there is a background irritation at not being able to program
> in my own language, this irritation is permanent and never fades out --
> a bit like the fossile radiation after the big bang! :-) I surely like
> Python a lot, but I would like it even more if it was on the side of
> programmers of all nations, and not forcing everyone to wide portability:
> there are many cases where planetary portability is just not a concern.

Having lived in France and worked in French (doing physics, not
programming) some years ago, I believe I am not entirely unaware of the
difficulty you are having. Still, if you feel so strongly about this
that you are prepared to write code that would, in fact, be unusable
outside your own locale: whyever do you think the larger community
should undertake the task of enabling you to do this?

--
Mark Jackson - http://www.alumni.caltech.edu/~mjackson
The Enron scandal calls into question the integrity of the
entire capitalist system, which previously we assumed was based
on honest, straightforward greed. - Joel Achenbach

François Pinard

unread,

May 8, 2002, 7:16:45 PM5/8/02

to

[Martin v. Loewis]

> PEP 263 will introduce the notion of source encodings - without this,
> it wouldn't even be possible to parse the source code, anymore. The
> PEP, over months, had a question in it asking whether non-ASCII
> identifiers should be allowed (the follow-up question would then be:
> which ones?), and nobody ever spoke up requesting such a feature.

I did speak about nationalised identifiers a few times already, in private
discussions, and once or twice with Guido. But not in the context of
PEP 263. Except very few of them, I do not follow PEPs very closely,
once I have an overall idea of their subject, and do not feel personally
guilty of what happens, or does not happen, in Python :-). There are many
people for this already! :-)

> It is a real surprise for me that suddenly people want this.

I read you a few times in the past (I do read you!) expressing that you
are not favourable to supporting national letters in Python identifiers.
So, in a way, your surprise does not surprise me! :-). On the other hand,
I'm surely glad that we are breaking the ice on this topic!

> pin...@iro.umontreal.ca (François Pinard) writes:

> > Existing code is not going to break, as long as English identifiers stay
> > a subset of nationally written identifiers. Which is usually the case
> > for most character sets, Unicode among them, allowing ASCII letters as a
> > subset of all letters.

> For Python, existing code, like inspect.py, *will* break: if introspective
> code is suddenly confronted with non-ASCII identifiers, it might break,
> e.g. if Unicode objects show up as keys in __dict__.

Should I read that one may not use Unicode strings as dictionary keys?
One would expect Python to support narrow and wide strings equally well.
In that precise case, `inspect.py' would need to be repaired, indeed.
A lot of things have been "repaired" when Unicode was introduced into
Python, I see this as perfectly normal. It is part of the move.

> > Having the capability of writing identifiers with national letters is
> > not going to break other people's code, this assertion looks a bit like
> > gratuitous FUD to me. Unless you are referring to probable transient
> > implementation bugs which are normal part of any release cycle?

> No. The implementation strategy would be to allow Unicode identifiers
> at run-time, and all introspective code - either within the Python
> code base, or third-party, would need revision.

Most probably. If national identifiers get introduced through Latin-1 or
UTF-8, the problem appears smaller. But I agree with you that for the
sake of Python being useful to more countries, it is better going the
Unicode way and afford both narrow and wide characters for identifiers.
This approach would also increase Python self-consistency on charsets.

François Pinard

unread,

May 8, 2002, 7:44:53 PM5/8/02

to

[Alex Martelli]

> François Pinard wrote:

> > If many people had experienced the pleasure of naming variables properly
> > for their national language while programming, I guess most of them would
> > be rather enthusiastic proponents on having this capability with Python,

> This one person has had this dubious "pleasure" and loathes the idea
> with a vengeance.

:-) :-) :-)

> The very *IDEA* of cutting off the huge majority of programmers in the
> world, who don't understand Italian, from being able to understand and
> work with my code, is utterly abhorrent to me.

Granted. You are a public man, and are quite visible in public fields.

But people are not all alike! I worked in many areas and teams in my life,
and wrote a lot of code in English when meant to be widely available.
At other times and circumstances, this just does not apply. Besides, I
know and work with people doing humble jobs in close shops, some have done
this for a lot of years, and they are good, nice and intelligent people.
Many of these people like their own language, and find English to be a
constant suffering, that is, much, much more than in my own case. Oh, being
prevented from writing French identifiers correctly always irritated me,
for dozens of years now: this is a mild, yet permanent irritation. I keep
smiling and sleep at night, so don't worry :-). Yet, I'm very sympathetic
to those I know who are far less comfortable in English than I am.

> Long live a world where ONE natural language (don't care which one: ONE,
> I can learn) opens to me the doors of the (programming) world.

If you feel happy in the Borg collective, I'm glad you are happy. :-)
Seriously, however, many of us do not aspire to assimilation, and would
like to think that resistance is not wholly futile. When one lives a full
computer life in French, say, with no appetite for international visibility,
limitations coming from the English languages are fully artificial.

There are plenty of programs around here, I see many of them every week,
written with all French identifiers, yet full of orthographical mistakes,
because the limitations of ASCII. These people are lost anyway for your
cause of ONE universal natural language. Let's accept that the difference
exists. I beg that for those people (me included), who do not spouse
the idea of English being universal, we consider that Python could be more
friendly, and not merely use the example of other programming languages as an
excuse for dismissing that there is a need. Python has its own distinctive
marks and ideas, nicely supporting national local teams could be one more.

Stephen J. Turnbull

unread,

May 9, 2002, 1:46:06 AM5/9/02

to

>>>>> "Martin" == Martin v Löwis <loe...@informatik.hu-berlin.de> writes:

Martin> So in the end, the only acceptable strategy would be to
Martin> allow identifiers that contain letters (or letterlike
Martin> symbols) in arbitrary languages. For Python, that would
Martin> mean that attributes must be Unicode objects, which could
Martin> cause code breakage.

This would actually be rather simple if you just declare that Python
programs as submitted to the (internal) parser must be in UTF-8, and
ensure that PEP 263 codecs do this in a way transparent to the user.
These codecs would have to differ from ordinary codecs in one way:
they must know about ordinary Python strings (ie, not Unicode), and
must _not_ translate those bytes to Unicode values, but rather pass
them to the parser with their integral values unchanged but UTF-8-
encoded.

The parser would need to know what to do with ordinary strings (ie,
decode UTF-8 encoded octets back to "raw" form) and Unicode strings
(ie, transform from UTF-8 to UTF-16).

Then you just lift the restriction that identifier names must consist
of bytes < 128. A second stage would be to restrict identifier names
from containing non-ASCII punctuation, whitespace, and other special
characters, but this doesn't bother Python, only human readers (unless
you want to extend Python syntax to include some non-ASCII symbols as
reserved words).

I see no reason why this would cause code breakage, although I haven't
tried it yet. It would break debugging for people who abuse ordinary
strings to contain externally-encoded text, as they would be unable to
view their print'ed strings (externally encoded) in the same console
as error messages referring to identifiers (UTF-8). But I think
that's a GoodThang<0.1 wink>.

--
Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
My nostalgia for Icon makes me forget about any of the bad things. I don't
have much nostalgia for Perl, so its faults I remember. Scott Gilbert c.l.py

Martin v. Loewis

unread,

May 9, 2002, 4:00:07 AM5/9/02

to

pin...@iro.umontreal.ca (François Pinard) writes:

> > For Python, existing code, like inspect.py, *will* break: if introspective
> > code is suddenly confronted with non-ASCII identifiers, it might break,
> > e.g. if Unicode objects show up as keys in __dict__.
>

> Should I read that one may not use Unicode strings as dictionary keys?

No, that is certainly possible. Also, a byte string and a Unicode
string have the same hash value and compare equal if the byte string
is an ASCII representation of the Unicode string, so you can use them
interchangably inside a dictionary.

It's just that introspective code won't *expect* to find Unicode
objects as keys of an attribute dictionary, and will likely fail to
process it in a meaningful way.

> One would expect Python to support narrow and wide strings equally well.
> In that precise case, `inspect.py' would need to be repaired, indeed.
> A lot of things have been "repaired" when Unicode was introduced into
> Python, I see this as perfectly normal. It is part of the move.

If only inspect.py was affected, that would be fine. However, that
also affects tools from other people, like PythonWin, which "we" (as
Python contributors) could not fix that easily.

> Most probably. If national identifiers get introduced through Latin-1 or
> UTF-8, the problem appears smaller. But I agree with you that for the
> sake of Python being useful to more countries, it is better going the
> Unicode way and afford both narrow and wide characters for identifiers.
> This approach would also increase Python self-consistency on charsets.

Indeed, the OP probably would not be happier if Python allowed
Latin-1. Using UTF-8 might reduce the problems, but would be
inconsistent with the rest of the character set support in Python,
where Unicode objects are the data type for
text-with-specified-representation.

Regards,
Martin

Martin v. Loewis

unread,

May 9, 2002, 4:22:11 AM5/9/02

to

"Stephen J. Turnbull" <ste...@xemacs.org> writes:

> Martin> So in the end, the only acceptable strategy would be to
> Martin> allow identifiers that contain letters (or letterlike
> Martin> symbols) in arbitrary languages. For Python, that would
> Martin> mean that attributes must be Unicode objects, which could
> Martin> cause code breakage.
>
> This would actually be rather simple if you just declare that Python
> programs as submitted to the (internal) parser must be in UTF-8, and
> ensure that PEP 263 codecs do this in a way transparent to the user.

Indeed, parsing it would not be an issue (although it *would* be an
issue to define the set of acceptable identifier characters).

> I see no reason why this would cause code breakage, although I
> haven't tried it yet.

It would break introspective tools who suddenly find Unicode objects
in attribute dictionaries.

Regards,
Martin

Karl Eichwalder

unread,

May 9, 2002, 3:27:35 AM5/9/02

to

pin...@iro.umontreal.ca (François Pinard) writes:

> Oh, being prevented from writing French identifiers correctly always
> irritated me, for dozens of years now: this is a mild, yet permanent
> irritation.

You can try to make better use of Emacs. Write proper 8bit letters and
at save time let Emacs convert them to a ASCII representation; loading
file again Emacs should be able to decode it (several examples are
available: TeX/LaTeX, PSGML). Not sure whether it's worth trying ;-)

--
k...@suse.de (work) / kei...@gmx.net (home): |
http://www.suse.de/~ke/ | ,__o
Free Translation Project: | _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*)

Stephen J. Turnbull

unread,

May 9, 2002, 6:05:37 AM5/9/02

to

>>>>> "Martin" == Martin v Loewis <mar...@v.loewis.de> writes:

Martin> It would break introspective tools who suddenly find
Martin> Unicode objects in attribute dictionaries.

What Unicode objects? They find ordinary strings that are mandated to
be encoded in UTF-8. The tools only need to be 8-bit clean, and not
do anything that involves the assumption that #characters == #octets.
And _that_ only affects people using non-ASCII identifiers, which
might be OK since it is an extension.

We do the migration to Unicode objects later, at the same time that
you would have done it anyway. In the meantime, this fits right in
with the kind of "backwards compatibility" that PEP 263 is all about.

Except that people who want to look at non-ASCII identifiers must
work in UTF-8 environments and not their local encoding. But that is
OK because non-ASCII identifiers are an extension.

Alex Martelli

unread,

May 9, 2002, 5:39:34 AM5/9/02

to

On Thursday 09 May 2002 01:44 am, François Pinard wrote:
...

> But people are not all alike! I worked in many areas and teams in my life,
> and wrote a lot of code in English when meant to be widely available.
> At other times and circumstances, this just does not apply. Besides, I
> know and work with people doing humble jobs in close shops, some have done
> this for a lot of years, and they are good, nice and intelligent people.

In 1989 I left IBM Research to join a small, obscure programming shop in
a tiny town close to my hometown, Bologna. I wanted to come back home
and maybe I liked the idea of being the relatively big fish in a small pond,
technically speaking, rather than working in the sort of places where good
parking spaces are marked "reserved for Nobel Prize Winners only":-).

One of the things I started nagging people about in my new position as
the senior software consultant was to *use English* in code and docs.
Some of those of my colleagues who saw themselves as "doing humble
jobs in a close shop" and had been doing this for a lot of years didn't like
that, and we had interminable debates. I just kept nagging.

Then one day we had a request for a job interview from a brilliant Chilean
guy who was still an exile from his country. His command of Italian was
close to mine of Spanish -- i.e., close to none, except what comes from
the two languages' similarities. Yet he WAS brilliant. He got in, to the
advantage both of himself and the firm, even though he was only able to
work on the parts of the software where enough English had been used
to let him follow his way around. I didn't fail to take advantage of this to
keep shaming those colleagues who wanted to stick to Italian -- "see,
what you're doing is, *deliberately keeping out of your software wonderful
people such as him, taking advantage of the fact he can't speak Italian*".

Where will the next brilliant political refugee come from? Do you want to
HELP his oppressive government keep him poor and on the run? (I was
of course playing also on the instinctive left-leaning sympathies that are
so prevalent in the professional classes in Bologna and environs:-).

Unfair, sure, but it helped, even though soon enough Chile moved back
towards freedom and the guy eventually decided to go back home (I was
glad for him, even though it was a loss for the firm). Gradually more of
our software moved to being almost-usable/maintainable/etc for people
knowing little or no Italian.

And you know what -- it was GOOD software. Good enough that a
few years after that we decided we HAD to go international, or perish.

Suddenly Alex's silly quirk of wanting stuff in English "to help refugees"
became one key issue in the firm's success. When I recently left (to
go full-time-Python) the firm's headquarters were in Santa Clara, CA;
a little but crucial development lab in Aix-en-Provence; the largest lab
after the main near-Bologna location, in Bangalore, India. None of that
would be possible if Italian had been a requirement to be able to work
on our software.

> > Long live a world where ONE natural language (don't care which one: ONE,
> > I can learn) opens to me the doors of the (programming) world.
>

> If you feel happy in the Borg collective, I'm glad you are happy. :-)

I'm proud of my cultural heritage and gladly use my language *when
appropriate*. But I don't try to impose it where it's NOT appropriate:
that would be as silly as people wanting to translate, say, "piano" or
"pizza" into their own languages because they can't stand Italian in
fields of endeavour where Italian is or was prominent enough, just as
English is today in computing.

> Seriously, however, many of us do not aspire to assimilation, and would

Neither do I. I choose to leave the US and come back to Italy, not
because of language issues, but of other cultural aspects that made
me happier to live here (even though I love many parts of the US and
many US characteristics, still, in the end, "there's no place like home").

But I still say "allegretto ma non troppo" in the field of music,
"ils sont fous ces Romains" when I read Asterix comics (in the original --
why use translations when I'm lucky enough to be able to enjoy the
original French), and "flip-flop" (rather than "multioscillatore bistabile")
in electronic circuits.

I understand many Francophones feel very differently about this. No
doubt historical accidents play a part. In Italy, the only serious attempt
to enforce (even by law) use of "pure Italian language" came from a
dictatorship, so such nationalistic urges feel highly dictatorial to Italians.

Even closer to home, the most prominent electric engineer born in my
town, Bologna, had an English mother -- so, his invention, "radio", has
forever an Italian name, but mostly-English technical terminology.

Or further back in history, Bologna's main claim to fame is the concept
of "Universitas" -- culture open to ALL (as long as they were able to
pay and had the prereq, specifically the "universal language of
learned men", which was then Latin). Had Wernerius, the Founder,
insisted on teaching in his native tongue, Lombard (a variation of
German), he'd have had few customers indeed. As it was, he set up
shop here, mandated Latin as THE *ONE* Language, and on that
basis was the Alma Mater launched (and later imitated endlessly).

You can still see some old-timers in Bologna saying something in the
Bolognese dialect and then at once the Italian translation of the same
thing -- that used to be a _Latin_ translation until less than 200 years
ago, and the motivation was ensuring all listeners understood, whether
they were uneducated locals (in which case they'd get the Bolognese
part) or students / scholars from anywhere in the world (in which case
they'd get the Latin part).

Sure, the students coming from Germany tended to party together
(we still have a quarter called "Alemanni" -- that's where they tended
to moslty live), so did those from Spain (we still have "Collegio di
Spagna"), and so on. But when it came to *WORK*, using just one
language was the crucial choice. Did Europe become "Assimilated"
because of that? Let's not be silly: a thousand years of history after
that show us how differentiated, both for good and for evil, our many
cultures have continued to grow. But for centuries, until the full
flower of Nationalist folly, the "One Language" served us well. My
father, a physician, barely 30 years ago was still able to get SOME
use out of some medical docs he had received from Yugoslavia about
one of his patients (in Croat, I believe) because the key aspects of
the diagnosis were in Latin.

It matters not a whit *WHICH* language it is, it does matter that it
be ONE language, not hundreds and thousands. In practice that
one language isn't Latin any more (in most fields of endeavour) but
English. Fine, whatever. As long as it's ONE.

> like to think that resistance is not wholly futile. When one lives a full

Indeed it's not: if there are enough of you, and you fight hard enough,
you may well be able to fragment the world back into incompatible
little warring pieces again. "Globalization" around 1905/1910 was
roughly the same as today, in terms of many measures such as
fraction of the economic flows being international, immigration and
emigration in the world, unity of ("high") culture. Yet resistance to
that proved anywhere BUT futile: the growing tide of nationalism
managed to lead right into the carnage of World War I, and almost
inevitably after that, further flag-waving, protectionism, nationalism
ever more extremes, dictatures, further massacres, until the pinnacle
of Word War II.

I'll do whatever is in my power (which is not much at all) against
such prospects, and in favour of the opposite prospect, that of one
world of which we're all citizens -- as culturally differentiated as, e.g.,
today are the various regions of, say, France, or Italy, each with its
own cherished dialects, traditions, cutures, and so on. But all able
to talk to each other, to work together, to move from one place to
the other without legal impediments -- a world where it's as absurd
to think of two nations going at war against each other as today it
would be to think that of, say, Cote d'Azur and Provence, or Emilia and
Toscana, or Massachussets and New Hampshire. Being able to
communicate helps, and sharing a language helps, even though
Provencals are justly proud of THEIR own language (quite as different
from French as, say, from Italian), Bolognese of their own (you may
choose to call it a "dialect", but that will get you quite a few hostile
stares...), and so on.

> computer life in French, say, with no appetite for international
> visibility, limitations coming from the English languages are fully
> artificial.

Just like having to say "adagio maestoso" is artificial...?

Whatever I can do (which is not much at all) against anything
furthering the _fragmentation_ (as opposed to, cultural diversity
within helpful and peaceful cooperation, which is *great*!) of
humanity, I will. I am convinced that encouraging the use of
a zillion different nautral languages in programs is a terrible idea
and I earnestly hope Python does nothing at all to _help_ it.

Alex

Martin v. Loewis

unread,

May 9, 2002, 6:32:58 AM5/9/02

to

"Stephen J. Turnbull" <ste...@xemacs.org> writes:

> What Unicode objects? They find ordinary strings that are mandated to
> be encoded in UTF-8.

That could be done, but I would discourage it. The Unicode type in
Python is the type to represent Unicode, and there is only one way to
do it.

> We do the migration to Unicode objects later, at the same time that
> you would have done it anyway. In the meantime, this fits right in
> with the kind of "backwards compatibility" that PEP 263 is all about.

You can't use UTF-8 to represent non-ASCII identifiers, and Unicode
objects later. Old byte code would not interoperate with new byte
code.

Regards,
Martin

Stephen J. Turnbull

unread,

May 9, 2002, 7:03:04 AM5/9/02

to

>>>>> "Alex" == Alex Martelli <al...@aleax.it> writes:

Alex> But I still say "allegretto ma non troppo" in the field of
Alex> music, "ils sont fous ces Romains" when I read Asterix
Alex> comics (in the original -- why use translations when I'm
Alex> lucky enough to be able to enjoy the original French), and
Alex> "flip-flop" (rather than "multioscillatore bistabile") in
Alex> electronic circuits.

And so do my Japanese students, at least for "allegretto" and
"flip-flop"---because those are _Japanese words_.

My students are mostly what we call "bunka-kei" (what Americans would
call the "arts" side of "arts and sciences"); they hate math and the
idea that you tell the computer what to do rather than the other way
around is quite a shock.

But one thing they can all do is write Japanese in "Roma characters"
---and they do. In their programs. Universally (required course),
except for a very few who intend to go into either foreign companies
or foreign graduate schools. And the bunka-kei all hate programming.

Not to mention that when they _do_ use English words for identifiers,
they're often semantically incorrect (because Japanese has altered the
meaning since borrowing them)! And highly prone to typos, because
students often forget whether they wrote the English spelling or some
Japanese approximation when they introduced the identifier.

Is it really so wrong to give my students a language where they're
able to use words that encourage comfortable expression? That don't
forcibly remind them that this is an alien activity from outer space?
Given that most of them are not going to write programs again for many
years? And the ones that do are going to proceed to write programs
using Japanese identifiers _anyway_ (unless they're lucky enough to
have you or somebody like you around to enforce sane practices)?

Alex> It matters not a whit *WHICH* language it is, it does matter
Alex> that it be ONE language, not hundreds and thousands. In
Alex> practice that one language isn't Latin any more (in most
Alex> fields of endeavour) but English. Fine, whatever. As long
Alex> as it's ONE.

Agreed. Except that a decade from now Chinese might be the ONE. Then
we'll be glad we have hanzi identifiers, as Python sweeps the CJK
world.<0.9 wink>

Alex> Whatever I can do (which is not much at all) against
Alex> anything furthering the _fragmentation_ (as opposed to,
Alex> cultural diversity within helpful and peaceful cooperation,
Alex> which is *great*!) of humanity, I will. I am convinced that
Alex> encouraging the use of a zillion different nautral languages
Alex> in programs is a terrible idea

Agreed, but ...

Alex> and I earnestly hope Python does nothing at all to _help_ it.

... is it really worth sacrificing the ability to introduce more
non-programmers to programming to avoid "helping fragmentation" by
25% over what those who want localized identifiers already can do?

David LeBlanc

unread,

May 9, 2002, 6:30:14 AM5/9/02

to

Bravo! - oops, that's Italian.
Bon! - errr... French.
Hurrah! - hmmm... British.
Yippie! - aaah... 'merican :-)

Most humbly agree.

Python should be unicode imo and using ASCII or code page variants thereof
should be the special case. It's time that character code sets with language
biases went away. If PythonWin isn't fully unicode, then it should be (and
it must already be at least partly unicode since that's what COM and "real"
MS Windows OS variants are natively already). Ditto for other apps.

Yes, by all means, use english to describe programs; the built-in syntax,
variable names and program documentation, but enable and encourage the use
of national languages to communicate between the program and the user. While
knowing a second language (english) is a reasonable prerequisite for a
professional developer, "computer literate" shouldn't also mean "also knows
english". (OTOH, I think all english speakers should learn and become
_fluent_ in a second language, if only for the mental flexability it
engenders. Most 2nd language education in American schools below
college/university is a joke imo. Ma francais est terrible, en part parsque
je ne l'utilizer pas!)

Sorry if this sounds contradictory - it's not meant to be at all.

Regards,

David LeBlanc
Seattle, WA USA

> -----Original Message-----
> From: python-l...@python.org
> [mailto:python-l...@python.org]On Behalf Of Alex Martelli
> Sent: Thursday, May 09, 2002 2:40

<snip>
> Whatever I can do (which is not much at all) against anything
> furthering the _fragmentation_ (as opposed to, cultural diversity
> within helpful and peaceful cooperation, which is *great*!) of
> humanity, I will. I am convinced that encouraging the use of
> a zillion different nautral languages in programs is a terrible idea

> and I earnestly hope Python does nothing at all to _help_ it.
>
>

> Alex
>

Stephen J. Turnbull

unread,

May 9, 2002, 7:32:30 AM5/9/02

to

>>>>> "Martin" == Martin v Loewis <mar...@v.loewis.de> writes:

Martin> "Stephen J. Turnbull" <ste...@xemacs.org> writes:

>> What Unicode objects? They find ordinary strings that are
>> mandated to be encoded in UTF-8.

Martin> That could be done,

Glad to have your confirmation. Now all I need to do is find the
time....

>> We do the migration to Unicode objects later, at the same time
>> that you would have done it anyway. In the meantime, this fits
>> right in with the kind of "backwards compatibility" that PEP
>> 263 is all about.

Martin> You can't use UTF-8 to represent non-ASCII identifiers,
Martin> and Unicode objects later. Old byte code would not
Martin> interoperate with new byte code.

_This_ is a serious objection. But if we're ever going to have
non-ASCII identifiers with some sanity, that transition will have to
be made. So I guess it never will happen in PSF Python?

Maybe that's for the best. Francois and my students can write in
their preferred languages, and "official" Python will support Alex's
"one world, one substrate for programming languages" campaign.

Alex Martelli

unread,

May 9, 2002, 8:16:48 AM5/9/02

to

Stephen J. Turnbull wrote:
...

> Alex> It matters not a whit *WHICH* language it is, it does matter
> Alex> that it be ONE language, not hundreds and thousands. In
> Alex> practice that one language isn't Latin any more (in most
> Alex> fields of endeavour) but English. Fine, whatever. As long
> Alex> as it's ONE.
>
> Agreed. Except that a decade from now Chinese might be the ONE. Then
> we'll be glad we have hanzi identifiers, as Python sweeps the CJK
> world.<0.9 wink>

Fine, WHEN that happens. And IF, of course. Meanwhile, Ruby will
probably get there first, born in Japan and all, right? Hey, it IS
so close to Python it almost hurts. And AFAIK, it doesn't support
what you so intensely want, which hasn't stopped it from huge Japan
success, though it may in the future.

> Alex> and I earnestly hope Python does nothing at all to _help_ it.
>
> ... is it really worth sacrificing the ability to introduce more
> non-programmers to programming to avoid "helping fragmentation" by
> 25% over what those who want localized identifiers already can do?

Yes. And it's NOT worth (IMHO, of course) helping the Japanese keep
ever more insular and separated from the rest of the world, a serious
aspect of their current predicaments (not just _my_ opinion, which
would be worth little given I'm no expert in this -- have a look at
the Economist's survey of Japan, it came out a month or so ago and
it's surely still on their site, www.economist.com).

Like "code to be run just once and then thrown away", similarly "code
that will never see the outside of this room" WILL over and over again
survive and spread to the four corners of the Earth, surprising all
involved, starting with the code's creator. Sure, it's bad enough if
said code has an identifier "principi" -- you don't know what it means
and must infer from context. (The comments and docstrings if any are
likely just as obscure). But it's STILL worse to let the code have
TWO identifiers "príncipi" and "princípi", where you have to notice
the subtle issue of which "i" has which kind of accent to be able to
tell the two identifiers apart. English minimizes this issue by
having just 26 glyphs (52, sigh, when you distinguish upper and lower
case, one issue where I wish Python was different), and no accents nor
other hard-to-tell-apart diacritics -- confusion is still possible but
much less likely than in a language WITH diacritics or thousands of
different glyphs. Non-programmers must learn ONE fundamental thing
about computer languages: that they're utterly different from natural
language, their nature, purpose and operation completely separate.

Whenever you try to blur the distinction, you do them all a signal
disservice. They may THINK they want to "speak their own language to
the computer", but they _don't_, really: if they think so, it's because
they still haven't grasped the key differences. Help them learn, rather
than "helping" them hide their ignorance from themselves.

Alex

Alex Martelli

unread,

May 9, 2002, 8:27:16 AM5/9/02

to

David LeBlanc wrote:
...

> Yes, by all means, use english to describe programs; the built-in syntax,
> variable names and program documentation, but enable and encourage the use
> of national languages to communicate between the program and the user.

...

> Sorry if this sounds contradictory - it's not meant to be at all.

Doesn't sound contradictory to me. I want programs easy to localize for
operation in any of several locales (language is not the only issue...),
though that is un-trivial enough to be a typical characteristics of
"professional" programs. But the technical side of things should be
accessible to technically-trained personnel from anywhere in the world,
and it's reasonable (actually more or less inevitable) to include some
nodding acquaintance with English as part of the prereq's for being
"technically trained personnel" in programming. Adding the ability to
tell tens of thousands of glyphs apart from each other is NOT at all
reasonable -- and yet this will be indispensable to make head or tails
out of programs in the "brave new world" dreamed of by people who want
non-ASCII letters in identifiers. I can't stop them; I can just hope
they'll get retribution one day, by needing to, and being unable to,
alter or maintain a program entirely using whatever huge set of ideograms
they find most impossibly hard to use. By then it will be too late to
do anything about it, of course, but maybe I'll get the bitter satisfaction
of being able to say "I told you so"...

Alex

Martin v. Loewis

unread,

May 9, 2002, 9:58:11 AM5/9/02

to

"Stephen J. Turnbull" <ste...@xemacs.org> writes:

> Martin> You can't use UTF-8 to represent non-ASCII identifiers,
> Martin> and Unicode objects later. Old byte code would not
> Martin> interoperate with new byte code.
>
> _This_ is a serious objection. But if we're ever going to have
> non-ASCII identifiers with some sanity, that transition will have to
> be made. So I guess it never will happen in PSF Python?

No. It just means that if the feature is implemented, it should be
done using Unicode objects right from the start. Unicode objects
interact with byte string favourably if the byte strings are ASCII
only: they have the same hash values, and compare equal, hence you can
mix ASCII strings and Unicode strings freely as dictionary keys.

> Maybe that's for the best. Francois and my students can write in
> their preferred languages, and "official" Python will support Alex's
> "one world, one substrate for programming languages" campaign.

I'd encourage you to develop a separate patch for it (perhaps after
the PEP 263 patch gets integrated), and distribute it to users - based
on user feedback, we get a clearer view whether people would use this
feature, and in what way.

I agree with Alex on the policy that every group of people developing
software should use (i.e. all code in English); that policy will
certainly apply to the source code of Python itself. I disagree that
the language should prevent violations of the policy - I just think
there will be additional problems if the feature is implemented.

Regards,
Martin

Stephen J. Turnbull

unread,

May 9, 2002, 10:08:01 AM5/9/02

to

>>>>> "Alex" == Alex Martelli <al...@aleax.it> writes:

Alex> Stephen J. Turnbull wrote:

>> Agreed. Except that a decade from now Chinese might be the
>> ONE. Then we'll be glad we have hanzi identifiers, as Python
>> sweeps the CJK world.<0.9 wink>

Alex> Fine, WHEN that happens. And IF, of course. Meanwhile,
Alex> Ruby will probably get there first, born in Japan and all,

"Made in Japan" is not exactly a password to rapid acceptance in this
neighborhood. It's getting more so, but not there yet.

Alex> right? Hey, it IS so close to Python it almost hurts. And
Alex> AFAIK, it doesn't support what you so intensely want, which

It's not that intense. I just see the tradeoff as being much more
balanced than you do. CP4E is important, too.

Alex> hasn't stopped it from huge Japan success, though it may in
Alex> the future.

The insularity you mention is working for it here. But its success is
deserved.

>> ... is it really worth sacrificing the ability to introduce
>> more non-programmers to programming to avoid "helping
>> fragmentation" by 25% over what those who want localized
>> identifiers already can do?

Alex> Yes. And it's NOT worth (IMHO, of course) helping the
Alex> Japanese keep ever more insular and separated from the rest

The ones who are causing and aggravating the problems will _never_
write any code; in fact, more widespread computer literacy---even if
everything the people think they know is wrong!---would substantially
decrease the gerontocrats' power, IMO.

Alex> Sure, it's bad enough if said code has an identifier
Alex> "principi" -- you don't know what it means and must infer
Alex> from context. (The comments and docstrings if any are
Alex> likely just as obscure).

And if the fact that the ONE language is Italian (I presume) forced me
to choose "principi" as an identifier, thus misdocumenting the variable
to those who _do_ understand Italian?

Alex> confusion is still possible but much less likely than in a
Alex> language WITH diacritics

Very true.

Alex> or thousands of different glyphs

This will make it unreadable to non-CJK-capable programmers, true.
But the likelihood of typos and confusion is at least as low as in
English, perhaps lower. It is also likely to permit similar levels of
expressiveness in less space (even with the 2:1 width ratio common for
ideographs vs. alphabetic characters).

Alex> They may THINK they want to "speak their own language to the
Alex> computer", but they _don't_, really: if they think so, it's
Alex> because they still haven't grasped the key differences.

Agreed. Unfortunately, my faculty won't permit the use of a LART,
which is the only technique I know of to get a reasonable share of
their attention to direct at key differences. Smacking them with
English just puts them to sleep.

Alex> Help them learn, rather than "helping" them hide their
Alex> ignorance from themselves.

MHO (based on the limited, very introductory courses I've taught in
programming) is that helping them to learn what programming really is
means removing as many of the incidental difficulties involved getting
their first real (ie, a task they choose) program working as possible.

Disciplines of good identifier choice, etc, come later. These have to
be enforced by The Management, anyway. Simply saying "no kanji" isn't
enough, as you well know (and the no-kanji rule is easily enforced
mechanically, which is something you can't say for "choose meaningful
identifiers").

Note that as an economics professor, I do have some experience with
the issue of weaning students from their milk language. There is
nearly zero published work of professional interest---except for
national economic policy---in any language but English. Not even
French or Russian. That doesn't stop there from being about 50
Japanese-language journals---but the students all know what's good for
them, and they don't _target_ the vernacular journals.

I can see all the reasons why that might not carry over to
programming. But on the other hand, it shows that there _is_
possibility that you can accomplish your goal with not very much extra
effort. You just need to convince the leaders. Even in a world where
one can choose identifiers written with ideographs.

In any case, Martin's point about bytecode compatibility once we
introduce Unicode identifiers is probably enough to make a real
multilingual Python impractical for the foreseeable future (maybe
Python 4?) I plan to experiment with a UTF-8 Python anyway. Keeping
your comments in mind, one thing I'll work on early is tools for
translating identifiers (presumably to English) and on metrics for
identifiers that are "too close" to one another. Even if Python never
needs them, some language will.

--
Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN

Erno Kuusela

unread,

May 9, 2002, 1:18:27 PM5/9/02

to

In article <j4helis...@informatik.hu-berlin.de>,

loe...@informatik.hu-berlin.de (Martin v. Löwis) writes:

| If they don't know the latin alphabet, they can't use Python even if
| identifiers can be non-ASCII, since they keywords would still be
| written with Latin letters.

there are so few keywords that their meaning is easily learned.

granted, the error messages and such are still in english, but
they could be made localizable.

-- erno

Erno Kuusela

unread,

May 9, 2002, 1:19:53 PM5/9/02

to

In article <c76ff6fc.02050...@posting.google.com>,
sjma...@lexicon.net (John Machin) writes:

| Erno Kuusela <erno...@erno.iki.fi> wrote in message
| news:<kuhelir...@lasipalatsi.fi>...
||

|| what would be the advantage in preventing non-english-speaking people
|| from using python?

| (1) You mean, like they are prevented from using FORTRAN, COBOL, C,
| ...?

yes (but not java).

-- erno

Erno Kuusela

unread,

May 9, 2002, 1:23:32 PM5/9/02

to

In article <mailman.102088459...@python.org>,
pin...@iro.umontreal.ca (François Pinard) writes:

| If many people had experienced the pleasure of naming variables properly
| for their national language while programming, I guess most of them would be
| rather enthusiastic proponents on having this capability with Python, today.
| As very few people experienced it, they can only imagine, without really
| knowing, all the comfort that results.

maybe people are just using it and they don't know that it's not
supposed to work.

Python 2.1.3 (#1, Apr 20 2002, 10:14:34)
[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> pää = 4
>>> print pää
4

-- erno

Syver Enstad

unread,

May 9, 2002, 1:36:40 PM5/9/02

to

Erno Kuusela <erno...@erno.iki.fi> writes:
> maybe people are just using it and they don't know that it's not
> supposed to work.
>
> Python 2.1.3 (#1, Apr 20 2002, 10:14:34)
> [GCC 2.95.4 20011002 (Debian prerelease)] on linux2
> Type "copyright", "credits" or "license" for more information.

> >>> pنن = 4
> >>> print pنن
> 4
>
>
> -- erno

What???

Python 2.2.1 (#34, Apr 9 2002, 19:34:33) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> pهه = 1
File "<stdin>", line 1
pهه = 1
^
SyntaxError: invalid syntax

--

Vennlig hilsen

Syver Enstad

Martin v. Loewis

unread,

May 9, 2002, 1:39:26 PM5/9/02

to

Erno Kuusela <erno...@erno.iki.fi> writes:

> | If they don't know the latin alphabet, they can't use Python even if
> | identifiers can be non-ASCII, since they keywords would still be
> | written with Latin letters.
>
> there are so few keywords that their meaning is easily learned.
>
> granted, the error messages and such are still in english, but
> they could be made localizable.

That still leaves the standard library. There are batteries included,
but they are all English - I hope you are not proposing that those
also get localized...

Regards,
Martin

Martin v. Loewis

unread,

May 9, 2002, 1:40:51 PM5/9/02

to

Erno Kuusela <erno...@erno.iki.fi> writes:

> || what would be the advantage in preventing non-english-speaking people
> || from using python?
>
> | (1) You mean, like they are prevented from using FORTRAN, COBOL, C,
> | ...?
>
> yes (but not java).

You mean, non-english-speaking people are prevented from using FORTRAN
and C? Can you name someone specifically? I don't know any such person.

Regards,
Martin

Дамјан Г.

unread,

May 9, 2002, 2:04:34 PM5/9/02

to

> What???

It works for me with:

LC_ALL=mk_MK python
Python 2.1.1 (#2, Sep 21 2001, 18:34:32)
[GCC 2.95.3 20010315 (release)] on linux2

Type "copyright", "credits" or "license" for more information.

>>> хало = "Hello"
>>> print хало
Hello

But doesn't with:
LC_ALL=C python

--
Дамјан

Jesus saves, but only Buddha makes incremental backups

Erno Kuusela

unread,

May 9, 2002, 2:51:48 PM5/9/02

to

In article <m3helhp...@mira.informatik.hu-berlin.de>,

mar...@v.loewis.de (Martin v. Loewis) writes:

| You mean, non-english-speaking people are prevented from using FORTRAN
| and C? Can you name someone specifically? I don't know any such person.

i don't know such people either. but since many people only know
languages that aren't written in ascii, it seems fairly probable that
they exist.

-- erno

Erno Kuusela

unread,

May 9, 2002, 2:54:39 PM5/9/02

to

In article <m3lmatp...@mira.informatik.hu-berlin.de>,

mar...@v.loewis.de (Martin v. Loewis) writes:

| That still leaves the standard library. There are batteries included,
| but they are all English - I hope you are not proposing that those
| also get localized...

it is a problem, but for teaching or embedding it may be reasonable
to not use it, or only use small parts of it. or write local language
wrappers for the standard library.

-- erno

Huaiyu Zhu

unread,

May 9, 2002, 4:23:30 PM5/9/02

to

Stephen J. Turnbull <ste...@xemacs.org> wrote:
>>>>>> "Martin" == Martin v Loewis <mar...@v.loewis.de> writes:
>
> Martin> It would break introspective tools who suddenly find
> Martin> Unicode objects in attribute dictionaries.
>
>What Unicode objects? They find ordinary strings that are mandated to
>be encoded in UTF-8. The tools only need to be 8-bit clean, and not
>do anything that involves the assumption that #characters == #octets.
>And _that_ only affects people using non-ASCII identifiers, which
>might be OK since it is an extension.

Out of curiosity: If a character is two bytes, what would len() report? If
s is a unicode string with wide characters, would list(s) be made of
characters or bytes? Would that be different under the current situation,
or the PEP 263, or under Stephen's proposal? Would it change depending on
how the unicode is encoded?

A list of such simple questions and answers for various proposals would help
many more people to understand the relevant PEPs.

Huaiyu

Chris Liechti

unread,

May 9, 2002, 5:03:55 PM5/9/02

to

hua...@gauss.almadan.ibm.com (Huaiyu Zhu) wrote in
news:slrnadlmm2...@gauss.almadan.ibm.com:

> Out of curiosity: If a character is two bytes, what would len()
> report? If s is a unicode string with wide characters, would list(s)
> be made of characters or bytes? Would that be different under the
> current situation, or the PEP 263, or under Stephen's proposal? Would
> it change depending on how the unicode is encoded?

we have an interactive console:
>>> len(unicode("hello"))
5

len gives you the number of characters no matter how many bytes are needed
to represent them.

>>> list(unicode("hello"))
[u'h', u'e', u'l', u'l', u'o']

so you get a list of unicode characters.

> A list of such simple questions and answers for various proposals
> would help many more people to understand the relevant PEPs.

i think most of that get's clear when you play around with the current
python and its unicode handling so that it does not need a special mention.

chris

--
Chris <clie...@gmx.net>

Skip Montanaro

unread,

May 9, 2002, 6:14:23 PM5/9/02

to

Huaiyu> If a character is two bytes, what would len() report?

Depends on the type of the argument. If it's a Unicode object, the number
of characters. If it's a plain string, the number of bytes:

>>> u"\N{Greek CAPItal letter alpha}"
u'\u0391'
>>> len(u"\N{Greek CAPItal letter alpha}")
1
>>> len(u"\N{Greek CAPItal letter alpha}".encode("utf-8"))
2

Huaiyu> Would it change depending on how the unicode is encoded?

Yes, depending on what you pass to len(). If it's a plain string it
definitely depends on the encoding:

>>> u"a"
u'a'
>>> u"a".encode("utf-16")
'\xff\xfea\x00'
>>> u"a".encode("utf-8")
'a'
>>> len(u"a".encode("utf-16"))
4
>>> len(u"a".encode("utf-8"))
1
>>> len(u"a")
1

--
Skip Montanaro (sk...@pobox.com - http://www.mojam.com/)
"Excellant Written and Communications Skills [required]" - post to chi.jobs

Martin v. Loewis

unread,

May 10, 2002, 2:30:40 AM5/10/02

to

Erno Kuusela <erno...@erno.iki.fi> writes:

> | You mean, non-english-speaking people are prevented from using FORTRAN
> | and C? Can you name someone specifically? I don't know any such person.
>
> i don't know such people either. but since many people only know
> languages that aren't written in ascii, it seems fairly probable that
> they exist.

I really question this claim. Most people that develop software (or
would be interested in doing so) will learn the latin alphabet at
school - even if they don't learn to speak English well.

Regards,
Martin

Martin v. Loewis

unread,

May 10, 2002, 2:34:39 AM5/10/02

to

hua...@gauss.almadan.ibm.com (Huaiyu Zhu) writes:

> Out of curiosity: If a character is two bytes, what would len() report? If
> s is a unicode string with wide characters, would list(s) be made of
> characters or bytes?

Python already supports "wide characters", by means of the unicode
type. For this type, len() reports the number of characters, not the
number of bytes used for internal storage.

> Would that be different under the current situation, or the PEP 263,
> or under Stephen's proposal? Would it change depending on how the
> unicode is encoded?

For the Unicode type, nothing would change - Stephen did not propose
to change the Unicode type.

Instead, he proposed that non-ASCII identifiers are represented using
UTF-8 encoded byte strings (instead of being represented as Unicode
objects); in that case, and for those identifiers, len() would return
the number of UTF-8 bytes.

> A list of such simple questions and answers for various proposals
> would help many more people to understand the relevant PEPs.

I recommend you familiarize yourself with the Unicode support first
that was introduced in Python 2.0.

Regards,
Martin

Erno Kuusela

unread,

May 10, 2002, 7:29:17 AM5/10/02

to

In article <m3wuucb...@mira.informatik.hu-berlin.de>,

mar...@v.loewis.de (Martin v. Loewis) writes:

|| i don't know such people either. but since many people only know
|| languages that aren't written in ascii, it seems fairly probable that
|| they exist.

| I really question this claim. Most people that develop software (or
| would be interested in doing so) will learn the latin alphabet at
| school - even if they don't learn to speak English well.

maybe someone with first hand experience will chime in. but
regardless, if you don't know english well, i would imagine it to be
quite uncomfortable to write programs when you cannot use your native
language.

-- erno

Lulu of the Lotus-Eaters

unread,

May 10, 2002, 2:21:31 PM5/10/02

to

Skip Montanaro <sk...@pobox.com> wrote previously:

|Yes, depending on what you pass to len(). If it's a plain string it
|definitely depends on the encoding:
| >>> u"a"
| u'a'
| >>> u"a".encode("utf-16")
| '\xff\xfea\x00'
| >>> u"a".encode("utf-8")
| 'a'
| >>> len(u"a".encode("utf-16"))
| 4
| >>> len(u"a".encode("utf-8"))
| 1
| >>> len(u"a")
| 1

Skip knows this, but novices might not. UTF-16 encoding is kinda a
funny case in term of length. Each UTF-16 string is prepended with a
two-byte "endian" header. So while Skip's example might suggest that
"a" takes 4 bytes to encode in UTF-16, it really only take 2 bytes, but
has a 2 byte "overhead." Compare:

>>> u"aa".encode("utf-16")
'\xff\xfea\x00a\x00'
>>> len(u"aa".encode("utf-16"))
6
>>> len(u"aaa".encode("utf-16"))
8

Yours, Lulu...

--
mertz@ _/_/_/_/_/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY:_/_/_/_/ v i
gnosis _/_/ Postmodern Enterprises _/_/ s r
.cx _/_/ MAKERS OF CHAOS.... _/_/ i u
_/_/_/_/_/ LOOK FOR IT IN A NEIGHBORHOOD NEAR YOU_/_/_/_/_/ g s

John Roth

unread,

May 10, 2002, 7:14:23 PM5/10/02

to

"Martin v. Loewis" <mar...@v.loewis.de> wrote in message
news:m3wuucb...@mira.informatik.hu-berlin.de...

The trouble is that while almost all of the languages used in the
Americas, Australia and Western Europe are based on
the Latin alphabet, that isn't true in the rest of the world, and
even then, it gets uncomfortable if your particular language's
diacritical marks aren't supported. You can't do really good,
descriptive names.

And good, descriptive names are one of the bedrocks of
good software.

I'd very much prefer that this issue get faced head on and
solved cleanly, although I doubt that it will be solved before
Python 3.0.

The way I'd suggest it is quite simple:

1. In Python 3.0, the input character set is unicode - either UTF-16 or
UTF-8
(I'm not prepared to make a solid arguement one way or the
other at this time.)

2. All identifiers MUST be expressed in the character set of
a single language (treating the various latin derived languages
as one for simplicity.) That doesn't mean that only one language
can be used for a module, only that a particular identifer must make
lexical sense in a specific language.

3. There must be a complete set of syntax words in each
supported language. That is, words such as 'and', 'or', 'if', 'else'
All such syntax words in a particular module must come from the
same language.

4. All syntax words are preceeded by a special character, which
is not presented to the viewer by Python 3.0 aware tools. Instead,
the special character is used to pick them out and highlight them.
The reason for this is that the vocabulary of syntax words can then
be expanded without impacting existing programs - they are
effectively from a different name space.

>
> Regards,
> Martin

Neil Hodgson

unread,

May 10, 2002, 8:44:57 PM5/10/02

to

John Roth:

> 2. All identifiers MUST be expressed in the character set of
> a single language (treating the various latin derived languages
> as one for simplicity.) That doesn't mean that only one language
> can be used for a module, only that a particular identifer must make
> lexical sense in a specific language.

Do you have a reason for this restriction? I see there being reasons for
using identifiers made from non-Roman (such as Japanese) and Roman letters
when applying naming conventions or when basing names on external entities
such as database identifiers. Say I have a database with a column called
[JapaneseWord] and want derived entities in a (possibly automatically
generated) form such as txt[JapaneseWord] and verified[JapaneseWord].

In mathematical English code I would quite like to use greek letters for
pi and sigma and so forth to make the code more similar to how I'd document
it.

Neil

Chris Liechti

unread,

May 10, 2002, 9:00:29 PM5/10/02

to

"John Roth" <john...@ameritech.net> wrote in
news:udol0hp...@news.supernews.com:

> "Martin v. Loewis" <mar...@v.loewis.de> wrote in message
> news:m3wuucb...@mira.informatik.hu-berlin.de...
>> Erno Kuusela <erno...@erno.iki.fi> writes:
>>
>> > | You mean, non-english-speaking people are prevented from using
> FORTRAN
>> > | and C? Can you name someone specifically? I don't know any such
> person.
>> >
>> > i don't know such people either. but since many people only know
>> > languages that aren't written in ascii, it seems fairly probable
> that
>> > they exist.
>>
>> I really question this claim. Most people that develop software (or
>> would be interested in doing so) will learn the latin alphabet at
>> school - even if they don't learn to speak English well.
>
> The trouble is that while almost all of the languages used in the
> Americas, Australia and Western Europe are based on
> the Latin alphabet, that isn't true in the rest of the world, and
> even then, it gets uncomfortable if your particular language's
> diacritical marks aren't supported. You can't do really good,
> descriptive names.
>
> And good, descriptive names are one of the bedrocks of
> good software.

true, but how i'm supposed to use the nice chinese module which uses class
names i can't even type on my keyboard?

[...]

> 3. There must be a complete set of syntax words in each
> supported language. That is, words such as 'and', 'or', 'if', 'else'
> All such syntax words in a particular module must come from the
> same language.

uff, this sounds evil to me. this means i could write "wenn" for an "if" in
german etc.? that would effectively downgrade python to a beginners only
language because the diffrent addon modules you find on the net are just a
chaotic language mix, unusable for a commercial project.

many modules on the net would not work in your language or if they would at
least execute you would still unable to look at the sourcecode, extend it,
understand it (ok it would solve the obfuscated code questions that show up
from time to time ;-).
we like open source, don't we? but if there were such many language
variants it became very difficult to work together.

if you say now that if one intends to make a module public, one could aways
choose to write it in english, i don't think thats a good argument. many
modules start as a private project, a quick hack etc. but then they're made
public. look at Alex's post for more good arguments...

> 4. All syntax words are preceeded by a special character, which
> is not presented to the viewer by Python 3.0 aware tools. Instead,
> the special character is used to pick them out and highlight them.
> The reason for this is that the vocabulary of syntax words can then
> be expanded without impacting existing programs - they are
> effectively from a different name space.

goodbye editing with a simple editor... of course you would also like to
introduce the possibility to write from the right to left and vertical.

i can see your good intention but i doubt that this leads to a better
programming language.

chris

--
Chris <clie...@gmx.net>

Neil Hodgson

unread,

May 10, 2002, 9:31:03 PM5/10/02

to

Chris Liechti:

> > And good, descriptive names are one of the bedrocks of
> > good software.
>
> true, but how i'm supposed to use the nice chinese module which uses class
> names i can't even type on my keyboard?

You can type Chinese names on your keyboard using a Chinese Input Method
Editor. I run Windows 2000 in an Australian English locale, but when I want
to type Japanese change to the Japanese IME which is quite easy to use.

Neil

Huaiyu Zhu

unread,

May 10, 2002, 9:45:04 PM5/10/02

to

Martin v. Loewis <mar...@v.loewis.de> wrote:
>
>For the Unicode type, nothing would change - Stephen did not propose
>to change the Unicode type.
>
>Instead, he proposed that non-ASCII identifiers are represented using
>UTF-8 encoded byte strings (instead of being represented as Unicode
>objects); in that case, and for those identifiers, len() would return
>the number of UTF-8 bytes.

But would that be different from the number of characters?

My confusion comes from his assertion that Python itself does not need to
care whether it's raw string or unicode. Is there any need for the
interpreter to split an identifier into sequence of characters? If the
answer is no, then I guess my question is moot.

>
>> A list of such simple questions and answers for various proposals
>> would help many more people to understand the relevant PEPs.
>
>I recommend you familiarize yourself with the Unicode support first
>that was introduced in Python 2.0.

My question was about what would be the case under the proposals. But I
guess I'm way out of my domain here.

Huaiyu

Oleg Broytmann

unread,

May 11, 2002, 1:45:05 AM5/11/02

to

On Fri, May 10, 2002 at 07:14:23PM -0400, John Roth wrote:
> 3. There must be a complete set of syntax words in each
> supported language. That is, words such as 'and', 'or', 'if', 'else'
> All such syntax words in a particular module must come from the
> same language.

Who will maintain those "complete sets"? Core team? They have enough
other things to do.

> 4. All syntax words are preceeded by a special character, which
> is not presented to the viewer by Python 3.0 aware tools. Instead,
> the special character is used to pick them out and highlight them.
> The reason for this is that the vocabulary of syntax words can then
> be expanded without impacting existing programs - they are
> effectively from a different name space.

Why do you want to make perl of python? If you want perl just go and use
perl, no problem.

Oleg.
--
Oleg Broytmann http://phd.pp.ru/ p...@phd.pp.ru
Programmers don't die, they just GOSUB without RETURN.

Martin v. Löwis

unread,

May 11, 2002, 3:21:31 AM5/11/02

to

"John Roth" <john...@ameritech.net> writes:

> The trouble is that while almost all of the languages used in the
> Americas, Australia and Western Europe are based on
> the Latin alphabet, that isn't true in the rest of the world, and
> even then, it gets uncomfortable if your particular language's
> diacritical marks aren't supported. You can't do really good,
> descriptive names.

I personally can live without the diacritical marks in program source
code, except when it comes to spelling my name - and I usually put
this into strings and comments only.

I'm fully aware that many people in this world write their languages
without latin letters. I still doubt that this is an obstacle when
writing software.

> 1. In Python 3.0, the input character set is unicode - either UTF-16
> or UTF-8 (I'm not prepared to make a solid arguement one way or the
> other at this time.)

Actually, PEP 263 gives a much wider choice; consider this aspect
solved.

> 2. All identifiers MUST be expressed in the character set of
> a single language (treating the various latin derived languages
> as one for simplicity.) That doesn't mean that only one language
> can be used for a module, only that a particular identifer must make
> lexical sense in a specific language.

That sounds terrible. Are you sure you can implement this? For
example, what about the Cyrillic-based languages? Are you also
treating them as one for simplicity? Can you produce a complete list
of languages, and for each one, a complete list of characters?

> 3. There must be a complete set of syntax words in each
> supported language. That is, words such as 'and', 'or', 'if', 'else'
> All such syntax words in a particular module must come from the
> same language.

That is even more terrible. So far, nobody has proposed to translate
Python keywords. How are you going to implement that: i.e. can you
produce a list of keywords for each language? How would I spell 'def'
in German?

Regards,
Martin

Martin v. Löwis

unread,

May 11, 2002, 3:27:01 AM5/11/02

to

hua...@gauss.almadan.ibm.com (Huaiyu Zhu) writes:

> >Instead, he proposed that non-ASCII identifiers are represented using
> >UTF-8 encoded byte strings (instead of being represented as Unicode
> >objects); in that case, and for those identifiers, len() would return
> >the number of UTF-8 bytes.
>
> But would that be different from the number of characters?

Yes. Watch this

>>> x=u"\N{EURO SIGN}"
>>> x
u'\u20ac'

This is a single character

>>> len(x.encode('utf-8'))
3

In UTF-8, this character has 3 bytes. Note that the number of bytes
in UTF-8 for a Unicode character varies between 1 and 4.

> My confusion comes from his assertion that Python itself does not need to
> care whether it's raw string or unicode. Is there any need for the
> interpreter to split an identifier into sequence of characters? If the
> answer is no, then I guess my question is moot.

The interpreter never does that, but still, a single identifier would
either be an ASCII byte string (the stress being on ASCII), or a
Unicode object:

>>> "x" == u"x"
1
>>> hash("x")
-1819822983
>>> hash(u"x")
-1819822983

Regards,
Martin

Stephen J. Turnbull

unread,

May 11, 2002, 3:29:12 AM5/11/02

to

>>>>> "Huaiyu" == Huaiyu Zhu <hua...@gauss.almadan.ibm.com> writes:

Huaiyu> Martin v. Loewis <mar...@v.loewis.de> wrote:

>> For the Unicode type, nothing would change - Stephen did not
>> propose to change the Unicode type.

>> Instead, he proposed that non-ASCII identifiers are represented
>> using UTF-8 encoded byte strings (instead of being represented
>> as Unicode objects); in that case, and for those identifiers,
>> len() would return the number of UTF-8 bytes.

Huaiyu> But would that be different from the number of characters?

No, for all backward-compatible (== ASCII-only identifiers) code.
Yes, for code actually using the proposed extension to non-ASCII
identifiers.

Huaiyu> My confusion comes from his assertion that Python itself
Huaiyu> does not need to care whether it's raw string or unicode.

My assertion is that we can choose either, and Python itself will work
fine, not that Python itself doesn't need to care. Furthermore, if we
choose UTF-8 as the internal encoding for non-ASCII identifiers,
Python itself doesn't need to be changed at all, except for the code
that tests whether an identifier is legal.

What would care is introspective code. Examples:

(1) Code that constructs a distribution of lengths of identifiers
known to the interpreter would be biased toward long identifiers,
since in UTF-8 #octets >= #characters.

(2) Code that uses identifiers in eval constructs would need to do
some horrible thing like

exec "print x + y".decode('iso-8859-1').encode('utf-8')

Note that in this all-ASCII example it's redundant, but would work.
Also the PEP 263 mechanism could be extended to give the program an
"execution locale" and automatically do that conversion. (Horrible,
but in the spirit of that PEP.)

Huaiyu> Is there any need for the interpreter to split an
Huaiyu> identifier into sequence of characters? If the answer is
Huaiyu> no, then I guess my question is moot.

There's no need that I know of for the interpreter to do so. However
(one of) Martin's points is that there are (very useful!) tools that
do, and these would either be "broken by the extension" or "merely
unreliable" for code that uses the non-ASCII identifier extension,
depending on your point of view.

Obviously I prefer the latter interpretation. I suggest that projects
that require reliable operation of introspective tools hire someone
like the martellibot to do coding standard enforcement<wink>. But the
"broken" interpretation is also reasonable, and I assume that is the
one that MvL holds.

Huaiyu> My question was about what would be the case under the
Huaiyu> proposals. But I guess I'm way out of my domain here.

The basic fact is that Unicode support for strings is already decided.
I disagree with some implementation decisions (eg, the idea of
prepending ZERO-WIDTH NO-BREAK SPACE to strings intended to be
exported in UTF-16 encoding is just insane IMO, code must be written

print list_of_strings[0]
for s in list_of_strings[1:]:
print s[2:]

Yuck!) But that's just something I can easily work around by defining
a codec to my liking---in fact the ease with which I can do this shows
the overall strength of the scheme adopted.

It is the interface of this well-ordered pythonic environment to the
disorderly world of natural language that is under discussion. PEP
263 provides standard support for people who wish to embed localized
(ie, non-ASCII) literal strings (both ordinary and Unicode) and
comments in their source files. Note that source code comes from
"outside of" Python; Python has no control over, nor even a way to
know, the encoding used.

Currently use of localized literal ordinary strings is possible, and
some projects depend on it, because of the specific Python
implementation. PEP 263 standardizes the situation in a way backward
compatible to these (very natural) "abuses" of the implementation, and
mandates its extension to literal Unicode strings.

My proposal goes farther and allows localized identifiers. AFAIK
Erno's use is just a curio; Alex's arguments for use of English if at
all possible, and certainly ASCII, in identifiers are strong and
natural. Even Japanese programmers rarely break this rule. So AFAIK
there is no body of code out there to be backward compatible with.

Martin v. Löwis

unread,

May 11, 2002, 3:57:21 AM5/11/02

to

"Stephen J. Turnbull" <ste...@xemacs.org> writes:

> (2) Code that uses identifiers in eval constructs would need to do
> some horrible thing like
>
> exec "print x + y".decode('iso-8859-1').encode('utf-8')

With PEP 263 implemented, the source encoding of identifiers and the
run-time encoding are two different issues. The source does not need
to be in UTF-8.

> Note that in this all-ASCII example it's redundant, but would work.
> Also the PEP 263 mechanism could be extended to give the program an
> "execution locale" and automatically do that conversion. (Horrible,
> but in the spirit of that PEP.)

Actually, the PEP requires that if a byte string is exec'ed, you need
a proper encoding declaration. The easiest one would be the UTF-8
signature, but I'd recommend to exec Unicode objects in the first
place.

> Obviously I prefer the latter interpretation. I suggest that projects
> that require reliable operation of introspective tools hire someone
> like the martellibot to do coding standard enforcement<wink>. But the
> "broken" interpretation is also reasonable, and I assume that is the
> one that MvL holds.

This is not an artificial objection: people already complained that
pydoc breaks when confronted with a Unicode doc string. I expect that
even dir() might stop "working", since its result would contain
Unicode objects which then cannot be printed at the interactive
console.

> The basic fact is that Unicode support for strings is already decided.
> I disagree with some implementation decisions (eg, the idea of
> prepending ZERO-WIDTH NO-BREAK SPACE to strings intended to be
> exported in UTF-16 encoding is just insane IMO

That's how UTF-16 is specified. If you don't want the BOM, use
UTF-16LE or UTF-16BE.

Regards,
Martin

John Roth

unread,

May 11, 2002, 7:50:28 AM5/11/02

to

"Neil Hodgson" <nhod...@bigpond.net.au> wrote in message
news:dQZC8.115171$o66.3...@news-server.bigpond.net.au...

Some good points. I was mostly attempting to provide a safety net to
reduce the possibility of unreadable code.

John Roth
>
> Neil
>
>
>

John Roth

unread,

May 11, 2002, 7:54:34 AM5/11/02

to

"Chris Liechti" <clie...@gmx.net> wrote in message
news:Xns920B1EE5D209...@62.2.16.82...

Not what I meant at all. The compiled byte code would be identical,
and presumably the compiler would recognize each of the sets, so you
could use any module you found anywhere.

> many modules on the net would not work in your language or if they
would at
> least execute you would still unable to look at the sourcecode, extend
it,
> understand it (ok it would solve the obfuscated code questions that
show up
> from time to time ;-).

Translating a module's syntax words from one language to
another is dead easy. If it's an issue (and I agree that it most
likely will be one) a syntax aware editor should do it on the fly.

John Roth

unread,

May 11, 2002, 8:00:53 AM5/11/02

to

"Oleg Broytmann" <p...@phd.pp.ru> wrote in message
news:mailman.102109598...@python.org...

> On Fri, May 10, 2002 at 07:14:23PM -0400, John Roth wrote:
>
> > 4. All syntax words are preceeded by a special character, which
> > is not presented to the viewer by Python 3.0 aware tools. Instead,
> > the special character is used to pick them out and highlight them.
> > The reason for this is that the vocabulary of syntax words can then
> > be expanded without impacting existing programs - they are
> > effectively from a different name space.
>
> Why do you want to make perl of python? If you want perl just go
and use
> perl, no problem.

I wasn't intending to do that. Perl's 'funny characters' solve one
significant problem that comes up every time someone suggests
adding a character syntax word to python: breaking existing code.

The only permanent solution to this problem is to take the character
syntax words from a different space than identifiers. Perl does it
(accidentally, I presume, although I don't know for certain) by
using special characters to mark (some aspects of) the type of
identifiers.
I actually took this idea from Color Forth!

As someone else noted, it would make simplistic editors much
less usable, but many (possibly most) of us use much more
capable editors. In any case, the basic point 1: the source would
be in some variation of Unicode, breaks all simplistic editors that
exist today.

John Roth

unread,

May 11, 2002, 8:09:44 AM5/11/02

to

"Martin v. Löwis" <loe...@informatik.hu-berlin.de> wrote in message
news:j44rhf4...@informatik.hu-berlin.de...

> "John Roth" <john...@ameritech.net> writes:
>
> > The trouble is that while almost all of the languages used in the
> > Americas, Australia and Western Europe are based on
> > the Latin alphabet, that isn't true in the rest of the world, and
> > even then, it gets uncomfortable if your particular language's
> > diacritical marks aren't supported. You can't do really good,
> > descriptive names.
>
> I personally can live without the diacritical marks in program source
> code, except when it comes to spelling my name - and I usually put
> this into strings and comments only.
>
> I'm fully aware that many people in this world write their languages
> without latin letters. I still doubt that this is an obstacle when
> writing software.
>
> > 1. In Python 3.0, the input character set is unicode - either UTF-16
> > or UTF-8 (I'm not prepared to make a solid arguement one way or the
> > other at this time.)
>
> Actually, PEP 263 gives a much wider choice; consider this aspect
> solved.

I just read that PEP. As far as I'm concerned, it's not solved, the
solution would be much worse than the disease. Python is noted
for simplicity and one way to do most things. PEP 263 (outside of
syntax issues) simply obfuscates the issue for quite minor returns.

> > 2. All identifiers MUST be expressed in the character set of
> > a single language (treating the various latin derived languages
> > as one for simplicity.) That doesn't mean that only one language
> > can be used for a module, only that a particular identifer must make
> > lexical sense in a specific language.
>
> That sounds terrible. Are you sure you can implement this? For
> example, what about the Cyrillic-based languages? Are you also
> treating them as one for simplicity? Can you produce a complete list
> of languages, and for each one, a complete list of characters?

I believe that the Unicode Consortium has already considered this.
After all, they didn't just add character encodings at random; they've
got specific support for many, many languages. I don't need to
repeat their work, and much more importantly, neither does the
core Python language team.

> > 3. There must be a complete set of syntax words in each
> > supported language. That is, words such as 'and', 'or', 'if', 'else'
> > All such syntax words in a particular module must come from the
> > same language.
>
> That is even more terrible. So far, nobody has proposed to translate
> Python keywords. How are you going to implement that: i.e. can you
> produce a list of keywords for each language? How would I spell 'def'
> in German?

AFIC, spelling is up to people who want to code in a particular
language.
I haven't considered implementation, but it seems like it should be
incredibly simple, given that point 4 means that syntax words are
easily distinguishable by the lexer. Think in terms of a dictionary,
although performance considerations probably means that something
faster would be necessary.

John Roth

Stephen J. Turnbull

unread,

May 11, 2002, 8:21:31 AM5/11/02

to

>>>>> "Martin" == Martin v Löwis <loe...@informatik.hu-berlin.de> writes:

[sjt]

>> I disagree with some implementation decisions (eg,
>> the idea of prepending ZERO-WIDTH NO-BREAK SPACE to strings
>> intended to be exported in UTF-16 encoding is just insane IMO

Martin> That's how UTF-16 is specified.

The Unicode standard permits, but does not require, a BOM.

Stephen J. Turnbull

unread,

May 11, 2002, 8:31:19 AM5/11/02

to

>>>>> "Martin" == Martin v Löwis <loe...@informatik.hu-berlin.de> writes:

>> 1. In Python 3.0, the input character set is unicode - either
>> UTF-16 or UTF-8 (I'm not prepared to make a solid arguement one
>> way or the other at this time.)

Martin> Actually, PEP 263 gives a much wider choice; consider this
Martin> aspect solved.

Some of us consider the wider choice to be a severe defect of PEP 263.

That doesn't mean we think that Python should prohibit writing
programs in arbitrary user-specified encodings. Only that the
facility for transforming a non-Unicode program into Unicode should be
provided as a standard library facility, rather than part of the
language. The lexical properties of the language would be specified
in terms of Unicode.

Martin v. Loewis

unread,

May 11, 2002, 9:22:29 AM5/11/02

to

"John Roth" <john...@ameritech.net> writes:

> I just read that PEP. As far as I'm concerned, it's not solved, the
> solution would be much worse than the disease. Python is noted
> for simplicity and one way to do most things. PEP 263 (outside of
> syntax issues) simply obfuscates the issue for quite minor returns.

Any specific objection?

> > That sounds terrible. Are you sure you can implement this? For
> > example, what about the Cyrillic-based languages? Are you also
> > treating them as one for simplicity? Can you produce a complete list
> > of languages, and for each one, a complete list of characters?
>
> I believe that the Unicode Consortium has already considered this.
> After all, they didn't just add character encodings at random; they've
> got specific support for many, many languages. I don't need to
> repeat their work, and much more importantly, neither does the
> core Python language team.

Ok, can you then kindly direct me to the relevant database? To my
knowledge, the Unicode consortium does *not* maintain this very data
(although they do maintain data that, at a shallow glance, look
related).

> > That is even more terrible. So far, nobody has proposed to translate
> > Python keywords. How are you going to implement that: i.e. can you
> > produce a list of keywords for each language? How would I spell 'def'
> > in German?
>
> AFIC, spelling is up to people who want to code in a particular
> language.

I'm telling you: I speak German, and I did a lot of software
localization work, but I couldn't find an acceptable translation for
any of the Python keywords which wouldn't sound outright silly.

> I haven't considered implementation, but it seems like it should be
> incredibly simple, given that point 4 means that syntax words are
> easily distinguishable by the lexer. Think in terms of a dictionary,
> although performance considerations probably means that something
> faster would be necessary.

Indeed, implementing this would be the easier part - obtaining the
data is difficult.

Regards,
Martin

Martin v. Loewis

unread,

May 11, 2002, 9:29:04 AM5/11/02

to

"Stephen J. Turnbull" <ste...@xemacs.org> writes:

> Martin> Actually, PEP 263 gives a much wider choice; consider this
> Martin> aspect solved.
>
> Some of us consider the wider choice to be a severe defect of PEP 263.

People have all kinds of opinions on this aspect of the PEP.

> That doesn't mean we think that Python should prohibit writing
> programs in arbitrary user-specified encodings. Only that the
> facility for transforming a non-Unicode program into Unicode should be
> provided as a standard library facility, rather than part of the
> language.

I believe that you are still the only one who voices this specific
position. More often, you find the position that Python source code
should be restricted to UTF-8, period. The counter-position to that
is: what about existing code, and what about people who don't have
UTF-8 editors?

Apart from you, nobody else agrees with the approach "let's make it
part of the library instead of part of the language". To most users,
the difference appears not to matter (including myself, except that I
think making it part of the language simplifies maintenance of the
feature).

I don't consider it evil to provide users with options: If UTF-8 is
technically superior (which I agree it is), it will become the default
text encoding of the future, anywith, with or without this PEP. Notice
that the PEP slightly favours UTF-8 over other encodings, due to
support of the UTF-8 signature.

Regards,
Martin

Martin v. Loewis

unread,

May 11, 2002, 9:34:47 AM5/11/02

to

"Stephen J. Turnbull" <ste...@xemacs.org> writes:

> Martin> That's how UTF-16 is specified.
>
> The Unicode standard permits, but does not require, a BOM.

Factually, the Unicode standard does not recognize UTF-16 as a byte
encoding; it only recognizes it as a CEF, not as a CES (see TR#17).

UTF-16 as-a-CES is defined in RFC 2781, which, in section 3.3, says
that the BOM SHOULD be inserted if the CES UTF-16 is used.

Regards,
Martin

Stephen J. Turnbull

unread,

May 11, 2002, 9:46:14 AM5/11/02

to

>>>>> "John" == John Roth <john...@ameritech.net> writes:

>> > 4. All syntax words are preceeded by a special character,
>> > which is not presented to the viewer by Python 3.0 aware
>> > tools.

Or any Unicode-aware tools, for that matter, because you'll use
ZERO-WIDTH SPACE.<0.9 wink>

Chris Liechti

unread,

May 11, 2002, 10:56:03 AM5/11/02

to

"Neil Hodgson" <nhod...@bigpond.net.au> wrote in

news:rv_C8.115301$o66.3...@news-server.bigpond.net.au:

i know i've played around with it. but that does not change the fact that
i'm still unable to type a specific character because i don't know any
chinese at all. all i could do is copy&paste of such names...

chris
--
Chris <clie...@gmx.net>

Stephen J. Turnbull

unread,

May 11, 2002, 10:55:23 AM5/11/02

to

>>>>> "Martin" == Martin v Loewis <mar...@v.loewis.de> writes:

Martin> UTF-16 as-a-CES is defined in RFC 2781, which, in section
Martin> 3.3, says that the BOM SHOULD be inserted if the CES
Martin> UTF-16 is used.

The content of what you wrote is identical to what I wrote. It's
optional, if you have good reason not to do so. The behavior of

u"a".encode("UTF-16") + u"b".encode("UTF-16")

versus

u"ab".encode("UTF-16")

is quite sufficient reason, to my mind.

It is, however, incorrect to cite RFC 2781 Section 3.3 "Choosing a
label for UTF-16 text" here. Python strings have no explicit charset
labels, which is the subject of that section. (At least I can't find
them in a string object using dir() etc.) It simply does not apply.

Not to mention that RFC 2781 is not intended to apply to the Python
interpreter's internal operation at all, except for "Widely
Distributed Python"<wink>. See section 1 "Introduction".

Oleg Broytmann

unread,

May 11, 2002, 12:06:45 PM5/11/02

to

On Sat, May 11, 2002 at 05:49:00PM +0200, Laura Creighton wrote:
> I can provide any number of people who consider, as a matter of
> principal, that it is _always_ better to make it part of the
> library and not part of the language. Some of these people will

I think this way.

> also argue that it is bad to provide users with options. This is

To some extent.

> the 'lean and elegant' school of language design, and they are
> extrmely consistent in liking tiny languages with large libraries.

Exactly! One of the best languages I've ever saw was Forth. Once I even
implemented Forth interpreter. The core of the interpreter was 200 lines in
assembler, and after that I switched to Forth and implemented the rest of
the language and library using the Forth itself.

Oleg Broytmann

unread,

May 11, 2002, 11:58:47 AM5/11/02

to

On Sat, May 11, 2002 at 08:00:53AM -0400, John Roth wrote:
> > > 4. All syntax words are preceeded by a special character, which
> > > is not presented to the viewer by Python 3.0 aware tools. Instead,
> > > the special character is used to pick them out and highlight them.
> > > The reason for this is that the vocabulary of syntax words can then
> > > be expanded without impacting existing programs - they are
> > > effectively from a different name space.
> >
> > Why do you want to make perl of python? If you want perl just go
> and use
> > perl, no problem.
>
> I wasn't intending to do that. Perl's 'funny characters' solve one
> significant problem that comes up every time someone suggests
> adding a character syntax word to python: breaking existing code.

In my opinion, the cure is worse than the disease.

Laura Creighton

unread,

May 11, 2002, 11:49:00 AM5/11/02

to

<snip>

>
> Apart from you, nobody else agrees with the approach "let's make it
> part of the library instead of part of the language". To most users,
> the difference appears not to matter (including myself, except that I
> think making it part of the language simplifies maintenance of the
> feature).
>
> I don't consider it evil to provide users with options: If UTF-8 is
> technically superior (which I agree it is), it will become the default
> text encoding of the future, anywith, with or without this PEP. Notice
> that the PEP slightly favours UTF-8 over other encodings, due to
> support of the UTF-8 signature.
>
> Regards,
> Martin

> --

I can provide any number of people who consider, as a matter of
principal, that it is _always_ better to make it part of the
library and not part of the language. Some of these people will

also argue that it is bad to provide users with options. This is

the 'lean and elegant' school of language design, and they are
extrmely consistent in liking tiny languages with large libraries.

Laura Creighton

Neil Hodgson

unread,

May 11, 2002, 7:56:21 PM5/11/02

to

Chris Liechti:
...
> > Chris Liechti:
...

> >> true, but how i'm supposed to use the nice chinese module which uses
> >> class names i can't even type on my keyboard?

...

> i know i've played around with it. but that does not change the fact that
> i'm still unable to type a specific character because i don't know any
> chinese at all. all i could do is copy&paste of such names...

For the specific need of using a Chinese module, copy and paste seems a
reasonable method. Further, autocompletion should then make it easy to use
further identifiers, although its my fault that the autocompletion in some
editors doesn't cope with Chinese (caused by wanting to have common code on
Windows 9x and NT and Windows 9x doesn't have wide character list boxes).

Neil

Gerhard Häring

unread,

May 11, 2002, 8:38:02 PM5/11/02

to

Martin v. Loewis wrote in comp.lang.python:

> More often, you find the position that Python source code should be
> restricted to UTF-8, period.

That's what I'd prefer to see rather sooner than later.

> The counter-position to that is: what about existing code,

recode(1)

> and what about people who don't have UTF-8 editors?

http://www.vim.org/, http://www.xemacs.org/ And certainly the
commercial Python IDEs would support this very soon, too.

Gerhard
--
mail: gerhard <at> bigfoot <dot> de registered Linux user #64239
web: http://www.cs.fhm.edu/~ifw00065/ OpenPGP public key id AD24C930
public key fingerprint: 3FCC 8700 3012 0A9E B0C9 3667 814B 9CAA AD24 C930
reduce(lambda x,y:x+y,map(lambda x:chr(ord(x)^42),tuple('zS^BED\nX_FOY\x0b')))

François Pinard

unread,

May 12, 2002, 10:05:51 AM5/12/02

to

[Gerhard Häring]

> Martin v. Loewis wrote in comp.lang.python:
> > More often, you find the position that Python source code should be
> > restricted to UTF-8, period.

> That's what I'd prefer to see rather sooner than later.

> > The counter-position to that is: what about existing code,

> recode(1)

This is not an acceptable solution. This is a difficult and recurrent
problem for various languages, and Python is no exception, offering Unicode
support without feeding Unicode fanatism.

The time has not come yet that everybody embraced and uses Unicode on an
individual basis. French and German still use ISO 8859-1 (or -15), Polish
still use ISO 8859-2, etc. Guess what, most Americans still use ASCII! [1]

When everybody will be using Unicode, it will be meaningful that Python
supports UTF-8 only. Python 3.0, Python 4.0 and maybe even Python 5.0 will
be published before the world turns Unicode all over :-). Let's keep in
mind that Python is there to help programmers at living a better life, today.
Python should take no part in Unicode religious proselytism, and not create
useless programmer suffering by prematurely limiting itself to Unicode-only.

--------------------
[1] Let's be honest, here! If Unicode was not offering something like
UTF-8 which almost fully supports ASCII without the shadow of a change,
I guess that the average American programmer would vividly oppose Unicode.

Just ponder that slight fuzziness in the way people interpret ASCII
apostrophe compared to Unicode apostrophe: this smallish detail already
generated endless and sometimes heated debates. (And for those who care,
my position is that whenever fonts and Unicode contradicts in the ASCII
area, fonts should merely be corrected and adapt to both ASCII and Unicode.
The complexity that was recently added in this area is pretty gratuitous,
and is only meant to salvage those who chose to deviate from ASCII.)

Martin v. Loewis

unread,

May 12, 2002, 11:04:04 AM5/12/02

to

Gerhard Häring <ger...@bigfoot.de> writes:

> > The counter-position to that is: what about existing code,
>
> recode(1)

It's not as easy as that. If you have

print "Mäldung"

then, after recoding, this program likely won't work correctly anymore
- it will print garbage (or, as the Japanese say: mojibake)

Regards,
Martin

Martin v. Loewis

unread,

May 12, 2002, 11:08:06 AM5/12/02

to

"Stephen J. Turnbull" <ste...@xemacs.org> writes:

> The content of what you wrote is identical to what I wrote. It's
> optional, if you have good reason not to do so. The behavior of
>
> u"a".encode("UTF-16") + u"b".encode("UTF-16")
>
> versus
>
> u"ab".encode("UTF-16")
>
> is quite sufficient reason, to my mind.

No. To my mind, this is good reason to not use UTF-16 altogether, but
be specific about the endianness: explicit is better than implicit.

Why does it help to have "UTF-16" to be a synonym to either "UTF-16BE"
or "UTF-16LE", but not telling anybody what it is a synonym to?

Regards,
Martin

Martin v. Loewis

unread,

May 12, 2002, 10:53:03 AM5/12/02

to

Laura Creighton <l...@strakt.com> writes:

> I can provide any number of people who consider, as a matter of
> principal, that it is _always_ better to make it part of the
> library and not part of the language. Some of these people will
> also argue that it is bad to provide users with options. This is
> the 'lean and elegant' school of language design, and they are
> extrmely consistent in liking tiny languages with large libraries.

On this specific question (source encodings): which of those people
specifically would favour Stephen's approach (which, I must admit, I
have not fully understood, since I don't know how he wants the hooks
to be invoked).

Regards,
Martin

Laura Creighton

unread,

May 12, 2002, 1:05:29 PM5/12/02

to

Write it up and post the question to comp.os.plan9. These people have
put unicode into their whole operating system and have been thinking
about these issues for their languages for more than a decade. I
cannot begin to do it justice here -- and Rob will end up flaming
me anyway for getting his point of view wrong. We've ported Python
to plan 9, so it won't even be off topic or anything.

Laura Creighton

ps they are in a good mood now. 4th edition just came out. cheer!

Erno Kuusela

unread,

May 12, 2002, 8:53:21 PM5/12/02

to

In article <mailman.1021223185...@python.org>, Laura
Creighton <l...@strakt.com> writes:

| Write it up and post the question to comp.os.plan9. These people have
| put unicode into their whole operating system and have been thinking
| about these issues for their languages for more than a decade.

it is a nice system.

on the other hand, utf-8 is ascii compatible and most of the users of
plan 9 are american, so they might not have to address all troublesome
situations right away.

there are major correctness advantages in having strict typing of
"legacy" (1-byte, undefine character set) text versus unicode text.

-- erno

Laura Creighton

unread,

May 12, 2002, 9:32:47 PM5/12/02

to

> In article <mailman.1021223185...@python.org>, Laura
> Creighton <l...@strakt.com> writes:
>
> | Write it up and post the question to comp.os.plan9. These people have
> | put unicode into their whole operating system and have been thinking
> | about these issues for their languages for more than a decade.
>
> it is a nice system.
>
> on the other hand, utf-8 is ascii compatible and most of the users of
> plan 9 are american, so they might not have to address all troublesome
> situations right away.

Plan 9 may be more used outside of the USA than inside these days.
Some of the most active groups of users live in Japan. They've been
using Plan 9 for more than a decade. But the most recent thread on
the AZERTY keyboard indicates that all is not perfect in paradise ...

>
> there are major correctness advantages in having strict typing of
> "legacy" (1-byte, undefine character set) text versus unicode text.
>
> -- erno

> --
> http://mail.python.org/mailman/listinfo/python-list

Laura

Stephen J. Turnbull

unread,

May 13, 2002, 1:01:57 AM5/13/02

to

>>>>> "Martin" == Martin v Loewis <mar...@v.loewis.de> writes:

Martin> Why does it help to have "UTF-16" to be a synonym to
Martin> either "UTF-16BE" or "UTF-16LE", but not telling anybody
Martin> what it is a synonym to?

Ask whoever implemented a UTF-16 codec for python, not me. Evidently
there's a good reason for it.

The fact is that the current implementation is just begging to produce
broken output that will be invisible to anyone who has a Unicode-
capable console. And that the only way to avoid it (without rewriting
all the APIs to pass Unicode objects instead of pre-encoded strings)
is really ugly code like the code I presented earlier.

Martin v. Loewis

unread,

May 13, 2002, 2:16:00 AM5/13/02

to

"Stephen J. Turnbull" <ste...@xemacs.org> writes:

> Martin> Why does it help to have "UTF-16" to be a synonym to
> Martin> either "UTF-16BE" or "UTF-16LE", but not telling anybody
> Martin> what it is a synonym to?
>
> Ask whoever implemented a UTF-16 codec for python, not me. Evidently
> there's a good reason for it.

In Python codecs, UTF-16 is *not* a synonym for UTF-16LE or BE;
instead, it adds the BOM (which the other two don't). You were
suggesting to omit the BOM, so I asked how that would help.

> The fact is that the current implementation is just begging to produce
> broken output that will be invisible to anyone who has a Unicode-
> capable console. And that the only way to avoid it (without rewriting
> all the APIs to pass Unicode objects instead of pre-encoded strings)
> is really ugly code like the code I presented earlier.

No, that is not the only way. Just use UTF-16BE, and all will be fine.

Regards,
Martin

Stephen J. Turnbull

unread,

May 13, 2002, 5:26:02 AM5/13/02

to

>>>>> "Martin" == Martin v Loewis <mar...@v.loewis.de> writes:

Martin> No, that is not the only way. Just use UTF-16BE, and all
Martin> will be fine.

If that were feasible, then there would be no UTF-16LE, no UTF-16, and
no BOM. Or at the very least, Python could get away with aliasing
UTF-16 to UTF-16BE on _all_ platforms.

Here's some more codec fun:

bash-2.05a$ python
Python 2.1.3 (#1, Apr 20 2002, 10:14:34)
[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> import codecs
>>> dir(codecs)
['BOM', 'BOM32_BE', 'BOM32_LE', 'BOM64_BE', 'BOM64_LE', 'BOM_BE',
'BOM_LE', ...]
# BOM64* ??!? Hmm
>>> codecs.BOM_BE
'\xfe\xff'
>>> codecs.BOM64_BE
'\x00\x00\xfe\xff'
>>> codecs.BOM32_BE
'\xfe\xff'
>>>
# And howcum no BOM8, which actually has some basis in the standard?
>>> f = codecs.open("/tmp/utf16","w","utf-16")
>>> f.write(u"a")
>>> f.close()
>>> f = codecs.open("/tmp/utf16","a","utf-16")
>>> f.write(u"a")
>>> f.close()
>>> f = open("/tmp/utf16","r")
>>> f.read()
'\xff\xfea\x00\xff\xfea\x00'

Submitted as request #555360 on the tracker, including the request for
a BOM8 constant and the wrong sizes (or names?) of BOM64 and BOM32.

Kragen Sitaker

unread,

May 14, 2002, 3:52:15 AM5/14/02

to

mar...@v.loewis.de (Martin v. Loewis) writes:
> "Stephen J. Turnbull" <ste...@xemacs.org> writes:
> > That doesn't mean we think that Python should prohibit writing
> > programs in arbitrary user-specified encodings. Only that the
> > facility for transforming a non-Unicode program into Unicode should be
> > provided as a standard library facility, rather than part of the
> > language.
>
> I believe that you are still the only one who voices this specific
> position. More often, you find the position that Python source code

> should be restricted to UTF-8, period. . . .

> Apart from you, nobody else agrees with the approach "let's make it
> part of the library instead of part of the language". To most users,
> the difference appears not to matter (including myself, except that I
> think making it part of the language simplifies maintenance of the
> feature).

I don't fully understand all the issues here, but I don't think that
pointing out that Stephen is the only person who holds a particular
opinion necessarily suggests that he is wrong. I believe Stephen is
the only person here who regularly writes in a language that is
written in a non-Latin character set --- Japanese, in his case. Also,
although I am not certain of this, I think he has worked on the
internationalization support in XEmacs.

> I don't consider it evil to provide users with options: If UTF-8 is
> technically superior (which I agree it is), it will become the default
> text encoding of the future, anywith, with or without this PEP. Notice
> that the PEP slightly favours UTF-8 over other encodings, due to
> support of the UTF-8 signature.

About providing users with options --- is it possible that these
options could mean I couldn't recompile your Python code if I don't
have code to support the particular encoding you wrote it in? How
about cutting and pasting code between modules written in different
encodings, either in an editor that didn't support Unicode or didn't
support one of the encodings correctly?

About using "recode" to support existing e.g. ISO-8859-15 code. If I
am not mistaken, that code can presently only contain ISO-8859-15
inside of byte strings and Unicode strings. Python 2.1 seems to
assume ISO-8859-1 for Unicode string contents. Would it be sufficient
to recode the contents of Unicode strings?

Kragen Sitaker

unread,

May 14, 2002, 4:09:17 AM5/14/02

to

Alex Martelli <al...@aleax.it> writes:
> This one person has had this dubious "pleasure" and loathes the idea
> with a vengeance. The very *IDEA* of cutting off the huge majority of
> programmers in the world, who don't understand Italian, from being
> able to understand and work with my code, is utterly abhorrent to me.

For any natural language X, it is the case that the huge majority of
people in the world do not understand X. As the population of
programmers becomes less Americocentric and less educated (that is, as
programming becomes easier and more useful), it is likely to be the
case that the majority of programmers in the world will not understand
English.

It is unfortunate that this state is abhorrent to you, but the current
state of programming --- confined to the elite --- is a greater evil.

> Now THAT is one niche where I'm glad that Italians' tendency to
> esterophily has prevailed -- all languages (Basic variants, etc) who
> perpetrated such horrors have died unmourned deaths. I may be

I agree that programming language keywords should not be localized;
the notations for iteration, conditionals, math, abstraction,
application, and so forth, should not vary by language. It is
perfectly acceptable for a person who does not speak English to learn
"if", "for", "except", and so forth, in order to speak Python; the
vocabulary is quite small. It is no different from American musicians
having to learn "allegro", "D.C. al fine", and "tremolo" --- it simply
doesn't add significantly to the difficulty of the notation.

But variable and function names belong to the programmer and the
program's audience, not the notation, and should be written in the
language that affords these people the most expressive power.

Martin v. Löwis

unread,

May 14, 2002, 4:13:24 AM5/14/02

to

Kragen Sitaker <kra...@pobox.com> writes:

> I don't fully understand all the issues here, but I don't think that
> pointing out that Stephen is the only person who holds a particular
> opinion necessarily suggests that he is wrong.

I'm not suggesting that he is 'wrong'; this specific question (how to
deal with source code encodings in programming languages) is not one
that has a single object 'right' answer.

Instead, it is a matter of judgement, based on criteria, which might
be both technical and political. I'm just suggesting that few people
seem to have the same criteria, or, atleast when applying them to the
specific question, come to the same conclusion.

> I believe Stephen is the only person here who regularly writes in a
> language that is written in a non-Latin character set --- Japanese,
> in his case. Also, although I am not certain of this, I think he
> has worked on the internationalization support in XEmacs.

Yes, I appreciate all that.

> About providing users with options --- is it possible that these
> options could mean I couldn't recompile your Python code if I don't
> have code to support the particular encoding you wrote it in?

Yes, that is the case.

> How about cutting and pasting code between modules written in
> different encodings, either in an editor that didn't support Unicode
> or didn't support one of the encodings correctly?

That is completely a matter of your editor. If the editor doesn't
support one of your encodings, it cannot display the source code
correctly.

If so, there is a good chance that it couldn't display the source code
correctly even if it had a different encoding.

For IDLE, if the source is displayed correctly, you will certainly be
able to copy arbitrary text. You may not be able to save the file in
the specified encoding then, anymore, if you paste text that cannot be
represented in that encoding.

> About using "recode" to support existing e.g. ISO-8859-15 code. If I
> am not mistaken, that code can presently only contain ISO-8859-15
> inside of byte strings and Unicode strings. Python 2.1 seems to
> assume ISO-8859-1 for Unicode string contents. Would it be sufficient
> to recode the contents of Unicode strings?

I don't think I understand the question. Are you talking about the GNU
recode utility?

Python code can contain non-ASCII in byte strings literals, Unicode
string literals, and comments. For recoding, all of those places need
to be recoded, or else no editor in the world will be able to display
the file correctly.

Regards,
Martin

Alex Martelli

unread,

May 14, 2002, 5:30:59 AM5/14/02

to

Kragen Sitaker wrote:

> Alex Martelli <al...@aleax.it> writes:
>> This one person has had this dubious "pleasure" and loathes the idea
>> with a vengeance. The very *IDEA* of cutting off the huge majority of
>> programmers in the world, who don't understand Italian, from being
>> able to understand and work with my code, is utterly abhorrent to me.
>
> For any natural language X, it is the case that the huge majority of
> people in the world do not understand X. As the population of

An interesting assertion, for which I'd like you to bring some supporting
statistics. What proportion of literate human beings does not understand
English (including as a 2nd and 3rd language)? I can't find convincing
statistics on the net, only suggestive information on anecdotical level,
e.g., that English is used about exclusively in India, the most populous
country in the world, for communication involving more than one of the
country's states/regions -- but I have no idea about what proportion of
Indian citizens are ever involved in such communications, versus those
who spend all of their life in or near their village and need never worry
about communication with connationals from elsewhere (and of the latter,
how many are literate? we can't really count illiterates as candidate
programmers, I think -- despite all of the "point and grunt" rhetoric).

It appears to me that, not just in India, but in other populous countries
with a huge variety of mother-tongues, English is preferred, as being more
politically and culturally neutral than the mother-tongue of the dominant
region or tribe, for inter-regional and inter-tribal communication by people
belonging to other tribes or regions (of course, for the same reason it may
be fought against by people who do belong to the dominant region or tribe).
But here, too, judging what proportion of a region's population is ever
involved with any interregional communication at all seems difficult.

And, how are the trends? Again I have no objective basis for judgment,
only guesses based on biased personal observation and anecdotes. When I
interact with recent immigrants and refugees into Italy, mostly from
Northern Africa, the Balcans, and the Near and Middle East, I seem to
have an easier time communicating with them in English than in Italian
(or French, which also has some use for that) -- almost as if those people
had striven to acquire some elements of English more than of Italian,
even though they were going to come here (of course, I speak no Arabic,
no Slavic tongue, no Albanian, no Turk, nor Kurdish -- should I acquire
all of them, communication with immigrants and refugees would no doubt
be easier -- but, as most Western Europeans share my limitations on
linguistic proficiency and have even less motivation than I do to remedy
them, the role of English as common grounds for such communication would
overall remain, no matter what my own studies could be). Of course, one
could argue that the masses of immigrants and refugees *are* "elite", no
matter how hard it seems to view them this way -- e.g., average number
of years of schooling for (say) Tunisians who emigrate to Europe is
higher (in a statistically significant way) than for Tunisians who stay
in Tunisia (the "real elite", the tiny crust of a poor country's people
who have substantial power and wealth there, are modestly motivated to
flee their country; but the poorest of the poor may even lack the
resources to finance emigration at all) -- this, I have heard, is the
most substantial difference between the current waves of migration and
previous ones (e.g., the millions of Italians who migrated in the past
didn't tend to have statistically-significant differences in schooling
wrt those who stayed).

I do see a burning urge for some English education at all levels around
me. One of the promises who swept our current government into power was
to meet this demand (one of the pillars in their electoral campaign was
to push "le tre 'i': Inglese, Internet, Impresa" -- English, Internet,
Entrepreneurship -- however empty and populistic you may judge these
electoral promises, do notice that English was in FIRST place). Qualified
teachers of French, Spanish, German, etc, go begging -- well over 90%
of students in our schools demand English as the foreign tongue to be
taught. You can hardly read Italian newspapers any more without a
modicum of English vocabulary -- as my father, 80 years old and knowing
French and German (as well as Italian) but not English, often bitterly
remarks (he's _particularly_ bitter at the prevalence of English in his
profession, medicine, where he's still active as a university professor
and consultant). Most often, English words and phrases are introduced
to replace perfectly good, usable, and widely used Italian equivalents,
rather than to tag new concepts. You hardly hear anybody any more
talking about the "allenatore" of a football team, for example -- it's
invariably the "trainer". Ferrari dominates Formula 1 racing, but you
never hear anybody any more talking about them as a "squadra" -- it's
*always* "team". Youth in the street may protest peacefully against
globalization, or not so peacefully at all, but the key words in their
protest in either case are never "no al globale", but rather "no global",
never "niente marchi", but "no logo". More moderate left, defeated in
last year's elections and still considering how to rebuild a political
strategy, is not proposing a new "partito dei lavoratori" or even a
"partito laburista", but directly a "Labour Party". The growing tide
of English words is absolutely transversal across political nuances,
fields of endeavour, cultural levels, socio-economic classes.

Italy is, admittedly, peculiarly prone to esterophily. Even before
"team" entered the local linguistic arena, you were as likely to hear
"equipe" as "team". But in recent decades the focus has switched
entirely to English, and the pace accelerates, it appears to me. Sure,
considering pro-capita income, Italy, and all of Western Europe, is
no doubt "elite", in a world-wide sense. But so are cultures that
are as peculiarly insular as Italy is open to the world, such as Japan.
It seems to me that this fight is far from decided.

> programmers becomes less Americocentric and less educated (that is, as
> programming becomes easier and more useful), it is likely to be the
> case that the majority of programmers in the world will not understand
> English.

It does not seem to me that you've made a case for the trend being
in this direction. Again judging from local trends, a far higher
percentage of programmers have _some_ (often modest) grasp of English
today, than was the case a few decades ago, where foreign languages
were not taught in _all_ schools (now they're taught starting from
the first year in school) and what was taught was more often French
than English (German and Spanish were always marginal in our schools,
and this hasn't changed much with time). Today I think it would be
laughable, unthinkable to launch an "italianized" language, as many
local suppliers (mostly Olivetti, now out of the computer market)
repeatedly attempted (with signal failures) decades ago.

> It is unfortunate that this state is abhorrent to you, but the current
> state of programming --- confined to the elite --- is a greater evil.

We're not really talking of current states so much as trends towards
the future. I don't think there can be a greater evil than promoting
the division of the world into pieces unable to communicate with each
other, to understand each other at some level, and therefore to
cooperate peacefully and fruitfully. Programming is cool, but peace
and prosperity require mutual understanding and collaboration even
more than they require programming. I don't believe we have to choose,
but, if we did, I would have no hesitation choosing to promote an
effort to help people communicate with each other and share the
results of their efforts, over any effort pushing the other way, even
if the latter was touted as promoting even wider access to programming.

> But variable and function names belong to the programmer and the
> program's audience, not the notation, and should be written in the
> language that affords these people the most expressive power.

Anybody is fully entitled to make his or her own choices between
natural language to use for each form of his or her impression. But
I hope that Python never actively *helps* people choose insularity and
isolation from the world in preference to openness and sharing.

Alex

Jacob Hallen

unread,

May 14, 2002, 7:10:12 AM5/14/02

to

I am Swedish and English is not my first language.

My view is that Python source code should be UTF-8, so that you can represent
multilingual strings in a readable way. However, I still think that
identifiers should be limited to ASCII.

Just like music score is the common language for written music, English
based programming languages have become the common base for programming.
Just like you have to learn how to read music score (unless you have a
perfect memory for tunes) to perform other peoples music, you need to learn
basic English in order to make your programs readable by others and be
able to read other peoples code.

I understand the attraction of using your native language for identifiers
and comments, but it is really the dark side of the source.

Jacob Hallén

--

Alex Martelli

unread,

May 14, 2002, 9:20:03 AM5/14/02

to

Jacob Hallen wrote:

> I am Swedish and English is not my first language.
>
> My view is that Python source code should be UTF-8, so that you can
> represent multilingual strings in a readable way. However, I still think
> that identifiers should be limited to ASCII.

I agree entirely with all of this (even though I'm Italian, not Swedish).

> I understand the attraction of using your native language for identifiers
> and comments, but it is really the dark side of the source.

I don't particularly mind about comments. By far most comments I've read
in programs thoughout my life were bogus anyway -- redundant reminders
of what language rules mandate anyway, or obsolete and thus actually
misleading rather than helpful. I wouldn't mind a language-savvy editor's
option to hide or remove all comments - yes, I'd lose something when the
comments are actually up to date AND informative about design intentions,
but all in all I think I'd break even at worst. OTOH, I think identifiers
have a better track record. Yes, a fraction of them are unhelpful or (more
rarely) actively misleading -- more often, however, i find them quite
informative and helpful in understanding what's going on in code.

Alex

gbr...@cix.compulink.co.uk

unread,

May 14, 2002, 9:31:12 AM5/14/02

to

Kragen Sitaker wrote:

> I agree that programming language keywords should not be localized;
> the notations for iteration, conditionals, math, abstraction,
> application, and so forth, should not vary by language. It is
> perfectly acceptable for a person who does not speak English to learn
> "if", "for", "except", and so forth, in order to speak Python; the
> vocabulary is quite small. It is no different from American musicians
> having to learn "allegro", "D.C. al fine", and "tremolo" --- it simply
> doesn't add significantly to the difficulty of the notation.

I disagree. I wouldn't object if a language used "si" or "weil" instead
of "if". But I sure as heck wouldn't want to use a Chinese character. No
matter how good a programming language is, if it requires the use of
Chinese characters I'm not touching it. I wouldn't expect a monolingual
Chinese speaker to feel any better about Python. Remember the subject is
"multibyte character support" not "alternative European code page
support".

> But variable and function names belong to the programmer and the
> program's audience, not the notation, and should be written in the
> language that affords these people the most expressive power.

Yes, but you can write any language using the roman alphabet. If you can
learn to use that alphabet for the keywords, you can translate variable
names as well. It's only a matter of convenience, or for speakers of
European languages that use accented characters.

Is it such a big problem to lose the accents? You still have to deal with
a standard library built around English. And there are all kinds of
problems that arise when you use arbitrary character sets. Like (hoping
these come out right) à and á can look similar from a distance, as can
"Latin Small Letter A With Macron". Would you feel confident
distinguishing ã and ä on a low resolution monitor? What happens if you
receive code that uses a character set you don't have a font for? If you
look through some Unicode tables you'll see characters that look
identical, in some cases are defined to be identical. Does the
interpreter have to keep a lookup table of equivalences? How does it know
what constitutes a "letter" in the first place?

I don't know if it's for English speakers to comment on, but I feel uneasy
about such a change. If the parser could recognise arbitrary characters,
the regular expressions knew what a letter was independent of locale and
Unicode strings could be reliably compared then at least the
implementation would be easy. But I can see people shooting themselves in
the foot as easily as they do with pointer arithmetic. Still, write a PEP
if you know exactly what you want. I could sleep much easier knowing such
a proposal had been definitively rejected.

Graham

Emile van Sebille

unread,

May 14, 2002, 9:32:36 AM5/14/02

to

"Alex Martelli"

> > For any natural language X, it is the case that the huge majority of
> > people in the world do not understand X. As the population of
>
> An interesting assertion, for which I'd like you to bring some
supporting
> statistics. What proportion of literate human beings does not
understand
> English (including as a 2nd and 3rd language)?

Well, 0%, for some definitions of literate <wink>.

> I can't find convincing
> statistics on the net, only suggestive information on anecdotical
level,

This looks like good information:

http://www.cis.org/articles/1996/English.html

[snip]

> The growing tide
> of English words is absolutely transversal across political nuances,
> fields of endeavour, cultural levels, socio-economic classes.

What goes around, comes around. ;-) English has more words than other
languages which makes it easier to share.

>
> Italy is, admittedly, peculiarly prone to esterophily.

Although this is a new one for me (and m-w.com and onelook.com). It's
probably a typo, but as I don't know/recognize the word, I'm not sure
how to look it up. ;-(

> Even before
> "team" entered the local linguistic arena, you were as likely to hear
> "equipe" as "team". But in recent decades the focus has switched
> entirely to English, and the pace accelerates, it appears to me.

Philip Williams, of ABC news, called English "... a language that
invades foreign lands quicker than foot-and-mouth" and the French
Ministry of Education says to its students "Choose another language --
Dutch, Italian, German, Icelandic, Congolese, anything but English."
Or, do like the French, and regularly expunge 'profanities' from the
official language, not that anybody regularly *speaks* the official
language. Or like Iceland, which rightly sets the blame at Bill Gates'
doorstep. ;-)
http://seattletimes.nwsource.com/news/technology/html98/icel_063098.html

--

Emile van Sebille
em...@fenx.com

A.Schmolck

unread,

May 14, 2002, 10:15:44 AM5/14/02

to

ja...@boris.cd.chalmers.se.cd.chalmers.se (Jacob Hallen) writes:

> I am Swedish and English is not my first language.
>
> My view is that Python source code should be UTF-8, so that you can represent
> multilingual strings in a readable way. However, I still think that
> identifiers should be limited to ASCII.
>
> Just like music score is the common language for written music, English
> based programming languages have become the common base for programming.
> Just like you have to learn how to read music score (unless you have a
> perfect memory for tunes) to perform other peoples music, you need to learn
> basic English in order to make your programs readable by others and be
> able to read other peoples code.

I think this analogy limps a bit. People all over the world *do* perform other
peoples' music without having to learned (western) musical scores. Indeed, it
would be a major cultural catastrophe, were it otherwise. Also, the amount of
effort that is required for someone with a sufficiently different language
background to learn English well enough to come up with good interface names
and documentation is in no way comparable to the amount of effort involved in
learning to read musical scores. And to say that those who haven't enough
English (or need to name things which there are no English words) should then
at least stick to the 26 alphabetic characters is like telling English users
of FOO to to use Chinese or at least transliterate the roman alphabet into
Chinese, because Chinese is *the* language to do FOO in.

People might be willing to do this if FOO is a really big thing in their lives
but not if they just think FOO sounds like an interesting thing and they want
to learn about it, or FOO might help them with some other problems.

Would you tell an American kid interested in learning FOO to go and learn
Chinese first? Even if FOO had nothing to do with China and Chinese culture as
such?

>
> I understand the attraction of using your native language for identifiers
> and comments, but it is really the dark side of the source.

Yes, there *are* big advantages associated with sticking with English for all
code, but you have to acknowledge that these advantages already come
comparatively free to you (and me, and Alex Martelli), since we happen to have
mastered English to a considerable extent (and it wasn't all that difficult,
given that Swedish, Italian, German and English are not all that
different). To many people outside Europe this doesn't apply, not even in a
rich country with an excellent education system such as Japan (one of the
reasons, I suspect why sun was clever enough to indulge the Japanese a bit
with Japanese language docs and the like). If python strives to bring
programming to the masses, and those masses are not all situated in Europe,
America (and a few select ex-colonies etc.) then unicode strings might not be
enough, so I think it's necessary to think about which audience one wants to
accommodate and at whose expense.

alex

P.S.: I will also admit a slight fancy for greek identifiers in math code :)

Bengt Richter

unread,

May 14, 2002, 3:09:17 PM5/14/02

to

On 14 May 2002 10:13:24 +0200, loe...@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=) wrote:
>
>Python code can contain non-ASCII in byte strings literals, Unicode
>string literals, and comments. For recoding, all of those places need
>to be recoded, or else no editor in the world will be able to display
>the file correctly.
>

From the above, a Python source seems to be what might be called a multi-encoded file.
ISTM a grammar defining the composition of a multi-encoded file would
make things a lot clearer. I know this doesn't represent current practice
(I'm just trying to prime the pump ;-) but, what if we had a grammar something like, e.g.,:

multi_encoded_file: encoded_string_packet* [bom_headed_unicode_string] ENDMARKER
encoded_string_packet: encoded_string_packet_header string_body
encoded_string_packet_header: '<' encoding_id ',' packet_body_length '>'
encoding_id: 'byte' | 'ascii' | 'unicode' | 'Latin-1' | etc...
packet_body_length: digit+
string_body: byte* | bom_headed_unicode_string
bom_headed_unicode_string: #<a byte sequence with any supported encoding and endianness>
etc...

The above effectively specifies a file as a raw byte sequence that can be interpreted
as variously encoded strings. This raw byte string file can have alternate various
representations for display and editing purposes. A smart editor could deal with the
various pieces like winword treats embedded graphics and tables, etc., but that is
another discussion. For the moment, I'd just like to elicit a grammar representing
current practice for Python.

This grammar is, or IMO should be, totally orthogonal to what the purpose of the text is,
whether Python source or a Chinese novel. Of course a particular use will imply constraints
on instance structure, but the grammar per se should not be affected.

For the case of Python, it would help me (and probably others) if you would you sketch a version
of the above that reflects current design in the various representations of text used by Python.
I.e., source file, and in-memory string representations, etc.

I think it is good to remember that a Python program is (or at least I consider it as such)
an abstract entity first and variously represented second. Abstract token sequences and
visible glyph sequences and binary coded representations all have roles, but it is easy
to smear the distinctions when thinking about them. Localization should IMO not alter
abstract semantics. The possibility of dynamically generating source text and eval- or
exec-ing it is something to consider too.

Regards,
Bengt Richter

Martin v. Loewis

unread,

May 14, 2002, 4:51:12 PM5/14/02

to

bo...@oz.net (Bengt Richter) writes:

> ISTM a grammar defining the composition of a multi-encoded file would
> make things a lot clearer.

What editor supports this kind of format?

> I think it is good to remember that a Python program is (or at least
> I consider it as such) an abstract entity first and variously
> represented second.

While this is true, a Python source code file is something very
specific, not something abstract.

> Abstract token sequences and visible glyph sequences and binary
> coded representations all have roles, but it is easy to smear the
> distinctions when thinking about them. Localization should IMO not
> alter abstract semantics.

And indeed, it doesn't - the byte code format is not at all affected
by the PEP.

> The possibility of dynamically generating source text and eval- or
> exec-ing it is something to consider too.

For that, I recommend to use Unicode objects - those don't have any
encoding issues.

Regards,
Martin

Greg Ewing

unread,

May 14, 2002, 10:35:27 PM5/14/02

to

Alex Martelli wrote:
>
> we can't really count illiterates as candidate
> programmers, I think -- despite all of the "point and grunt" rhetoric).

But haven't you heard of the CP4P&G initiative (Computer
Programming for Pointers and Grunters)? And the working
subcomittee for establishing a point-and-grunt encoding
for Unicode is due to report any day now...

--
Greg Ewing, Computer Science Dept, University of Canterbury,
Christchurch, New Zealand
To get my email address, please visit my web page:
http://www.cosc.canterbury.ac.nz/~greg

Stephen J. Turnbull

unread,

May 15, 2002, 3:08:05 AM5/15/02

to

>>>>> "Alex" == Alex Martelli <al...@aleax.it> writes:

Alex> OTOH, I think identifiers have a better track record. Yes,
Alex> a fraction of them are unhelpful or (more rarely) actively
Alex> misleading -- more often, however, i find them quite
Alex> informative and helpful in understanding what's going on in
Alex> code.

In fact, I suspect you're saying something stronger than "more often,"
more like "the great majority." However, this is less true in code
written by Japanese; it is fairly often true that English vocabulary
plus Japanese syntax leads to something actively misleading. I agree
with you about hiding comments, but occasionally such identifiers are
disentangled by seeing a more extensive comment written in the same
non-English syntax.

I suspect that your argument about usefulness of English identifiers
is not at all robust outside of native speakers of European languages.

Martin v. Loewis

unread,

May 15, 2002, 4:27:54 AM5/15/02

to

"Stephen J. Turnbull" <ste...@xemacs.org> writes:

> In fact, I suspect you're saying something stronger than "more often,"
> more like "the great majority." However, this is less true in code
> written by Japanese; it is fairly often true that English vocabulary
> plus Japanese syntax leads to something actively misleading. I agree
> with you about hiding comments, but occasionally such identifiers are
> disentangled by seeing a more extensive comment written in the same
> non-English syntax.

Out of curiosity: can you point to source code written in this style
(not necessarily by your students)?

Regards,
Martin

Stephen J. Turnbull

unread,

May 15, 2002, 5:35:55 AM5/15/02

to

>>>>> "Martin" == Martin v Loewis <mar...@v.loewis.de> writes:

Martin> Out of curiosity: can you point to source code written in
Martin> this style (not necessarily by your students)?

Note that it's not a "style." These are misunderstandings, they occur
more or less randomly. The point is that they occur frequently enough
to be troublesome.

The source is Mule Elisp add-ons in XEmacs. The core Mule code is
pretty clean, but that's mostly originally written by one man
(Ken'ichi Handa), and hacked on by lots of non-Japanese. The input
method code for Japanese, Wnn (the XEmacs egg-its package) and SJ3,
can be pretty horrible. The Canna people just gave up, most of the
variable and function names are at least partly romanized Japanese.

Now that you ask, I realize this is a pretty biased sample. A lot of
those names have to do with things that are unique to Japanese
grammar. So it's not surprising at all that they're thinking in
Japanese in those programs.

Sigvaldi Eggertsson

unread,

May 15, 2002, 7:23:22 AM5/15/02

to

"Emile van Sebille" <em...@fenx.com> wrote in message news:<Ul8E8.20960$L76.1829@rwcrnsc53>...

>> Philip Williams, of ABC news, called English "... a language that
> invades foreign lands quicker than foot-and-mouth" and the French
> Ministry of Education says to its students "Choose another language --
> Dutch, Italian, German, Icelandic, Congolese, anything but English."
> Or, do like the French, and regularly expunge 'profanities' from the
> official language, not that anybody regularly *speaks* the official
> language. Or like Iceland, which rightly sets the blame at Bill Gates'
> doorstep. ;-)
> http://seattletimes.nwsource.com/news/technology/html98/icel_063098.html

When the article was published it was already out of date, Microsoft
put out an Icelandic version of Windows 98 just after this was
written.