Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

To unicode or not to unicode

1 view
Skip to first unread message

Ron Garret

unread,
Feb 19, 2009, 9:57:13 PM2/19/09
to
I'm writing a little wiki that I call 琺iki. That's a lowercase Greek
mu at the beginning (it's pronounced micro-wiki). It's working, except
that I can't actually enter the name of the wiki into the wiki itself
because the default unicode encoding on my Python installation is
"ascii". So I'm trying to decide on a course of action. There seem to
be three possibilities:

1. Change the code to properly support unicode. Preliminary
investigations indicate that this is going to be a colossal pain in the
ass.

2. Change the default encoding on my Python installation to be latin-1
or UTF8. The disadvantage to this is that no one else will be able to
run my code without making the same change to their installation, since
you can't change default encodings once Python has started.

3. Punt and spell it 'uwiki' instead.

I'm feeling indecisive so I thought I'd ask other people's opinion.
What should I do?

rg

Benjamin Peterson

unread,
Feb 19, 2009, 10:21:57 PM2/19/09
to pytho...@python.org
Ron Garret <rNOSPAMon <at> flownet.com> writes:

>
> I'm writing a little wiki that I call µWiki. That's a lowercase Greek

> mu at the beginning (it's pronounced micro-wiki). It's working, except
> that I can't actually enter the name of the wiki into the wiki itself
> because the default unicode encoding on my Python installation is
> "ascii". So I'm trying to decide on a course of action. There seem to
> be three possibilities:

You should never have to rely on the default encoding. You should explicitly
decode and encode data.

>
> 1. Change the code to properly support unicode. Preliminary
> investigations indicate that this is going to be a colossal pain in the
> ass.

Properly handling unicode may be painful at first, but it will surely pay off in
the future.


Thorsten Kampe

unread,
Feb 20, 2009, 12:54:11 PM2/20/09
to
* Ron Garret (Thu, 19 Feb 2009 18:57:13 -0800)
> I'm writing a little wiki that I call µWiki. That's a lowercase Greek
> mu at the beginning (it's pronounced micro-wiki).

No, it's not. I suggest you start your Unicode adventure by configuring
your newsreader.

Thorsten

MRAB

unread,
Feb 20, 2009, 1:08:18 PM2/20/09
to Python List
It looked like mu to me, but you're correct: it's "MICRO SIGN", not
"GREEK SMALL LETTER MU".

Ron Garret

unread,
Feb 20, 2009, 2:21:02 PM2/20/09
to
In article <mailman.373.12351532...@python.org>,
MRAB <goo...@mrabarnett.plus.com> wrote:

Heh, I didn't know that those two things were distinct. Learn something
new every day.

rg

"Martin v. Löwis"

unread,
Feb 20, 2009, 3:05:04 PM2/20/09
to MRAB

I don't think that was the complaint. Instead, the complaint was
that the OP's original message did not have a Content-type header,
and that it was thus impossible to tell what the byte in front of
"Wiki" meant. To properly post either MICRO SIGN or GREEK SMALL LETTER
MU in a usenet or email message, you really must use MIME. (As both
your article and Thorsten's did, by choosing UTF-8)

Regards,
Martin

P.S. The difference between MICRO SIGN and GREEK SMALL LETTER MU
is nit-picking, IMO:

py> unicodedata.name(unicodedata.normalize("NFKC", u"\N{MICRO SIGN}"))
'GREEK SMALL LETTER MU'

Ron Garret

unread,
Feb 20, 2009, 3:19:13 PM2/20/09
to
In article <499F0CF0...@v.loewis.de>,

"Martin v. Löwis" <mar...@v.loewis.de> wrote:

> MRAB wrote:
> > Thorsten Kampe wrote:
> >> * Ron Garret (Thu, 19 Feb 2009 18:57:13 -0800)
> >>> I'm writing a little wiki that I call µWiki. That's a lowercase
> >>> Greek mu at the beginning (it's pronounced micro-wiki).
> >>
> >> No, it's not. I suggest you start your Unicode adventure by
> >> configuring your newsreader.
> >>
> > It looked like mu to me, but you're correct: it's "MICRO SIGN", not
> > "GREEK SMALL LETTER MU".
>
> I don't think that was the complaint. Instead, the complaint was
> that the OP's original message did not have a Content-type header,

I'm the OP. I'm using MT-Newswatcher 3.5.1. I thought I had it
configured properly, but I guess I didn't. Under
Preferences->Languages->Send Messages with Encoding I had selected
latin-1. I didn't know I also needed to have MIME turned on for that to
work. I've turned it on now. Is this better?

This should be a micro sign: µ

rg

"Martin v. Löwis"

unread,
Feb 20, 2009, 3:41:18 PM2/20/09
to Ron Garret
Ron Garret wrote:
> In article <499F0CF0...@v.loewis.de>,
> "Martin v. Löwis" <mar...@v.loewis.de> wrote:
>
>
> I'm the OP. I'm using MT-Newswatcher 3.5.1. I thought I had it
> configured properly, but I guess I didn't.

Probably you did. However, it then means that the newsreader is crap.

> Under
> Preferences->Languages->Send Messages with Encoding I had selected
> latin-1.

That sounds like early nineties, before the invention of MIME.

> I didn't know I also needed to have MIME turned on for that to
> work. I've turned it on now. Is this better?
>
> This should be a micro sign: µ

Not really (it's worse, from my point of view - but might be better
for others). You are now sending in UTF-8, but there is still no
MIME declaration in the news headers. As a consequence, my newsreader
continues to interpret it as Latin-1 (which it assumes as the default
encoding), and it comes out as moji-bake (in responding, my reader
should declare the encoding properly, so you should see what I see,
namely A-circumflex, micro sign)

If you look at the message headers / message source as sent e.g.
by MRAB, you'll notice lines like

MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

These lines are missing from your posting.

Assuming the newsreader is not crap, it might help to set the default
send encoding to ASCII. When sending micro sign, the newsreader might
infer that ASCII is not good enough, and use MIME - although it then
still needs to pick an encoding.

Regards,
Martin

Ross Ridge

unread,
Feb 21, 2009, 12:22:36 PM2/21/09
to
=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?= <mar...@v.loewis.de> wrote:
>I don't think that was the complaint. Instead, the complaint was
>that the OP's original message did not have a Content-type header,
>and that it was thus impossible to tell what the byte in front of
>"Wiki" meant. To properly post either MICRO SIGN or GREEK SMALL LETTER
>MU in a usenet or email message, you really must use MIME. (As both
>your article and Thorsten's did, by choosing UTF-8)

MIME only applies Internet e-mail messages. RFC 1036 doesn't require
nor give a meaning to a Content-Type header in a Usenet message, so
there's nothing wrong with the original poster's newsreader.

In any case what the original poster really should do is come up with
a better name for his program

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] rri...@csclub.uwaterloo.ca
-()-/()/ http://www.csclub.uwaterloo.ca/~rridge/
db //

Thorsten Kampe

unread,
Feb 21, 2009, 1:20:12 PM2/21/09
to
* Ross Ridge (Sat, 21 Feb 2009 12:22:36 -0500)

> =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?= <mar...@v.loewis.de> wrote:
> >I don't think that was the complaint. Instead, the complaint was
> >that the OP's original message did not have a Content-type header,
> >and that it was thus impossible to tell what the byte in front of
> >"Wiki" meant. To properly post either MICRO SIGN or GREEK SMALL LETTER
> >MU in a usenet or email message, you really must use MIME. (As both
> >your article and Thorsten's did, by choosing UTF-8)
>
> MIME only applies Internet e-mail messages.

No, it doesn't: "MIME's use, however, has grown beyond describing the
content of e-mail to describing content type in general. [...]

The content types defined by MIME standards are also of importance
outside of e-mail, such as in communication protocols like HTTP [...]"

http://en.wikipedia.org/wiki/MIME

> RFC 1036 doesn't require nor give a meaning to a Content-Type header
> in a Usenet message

Well, /maybe/ the reason for that is that RFC 1036 was written in 1987
and the first MIME RFC in 1992...? The "Son of RFC 1036" mentions MIME
more often than you can count.

> so there's nothing wrong with the original poster's newsreader.

If you follow RFC 1036 (who was written before anyone even thought of
MIME) then all content has to ASCII. The OP used non ASCII letters.

It's all about declaring your charset. In Python as well as in your
newsreader. If you don't declare your charset it's ASCII for you - in
Python as well as in your newsreader.

Thorsten

Ross Ridge

unread,
Feb 21, 2009, 2:52:09 PM2/21/09
to
Thorsten Kampe <thor...@thorstenkampe.de> wrote:
>> RFC 1036 doesn't require nor give a meaning to a Content-Type header
>> in a Usenet message
>
>Well, /maybe/ the reason for that is that RFC 1036 was written in 1987
>and the first MIME RFC in 1992...?

Obviously.

>"Son of RFC 1036" mentions MIME more often than you can count.

Since it was never sumbitted and accepted, RFC 1036 remains current.

>> so there's nothing wrong with the original poster's newsreader.
>
>If you follow RFC 1036 (who was written before anyone even thought of
>MIME) then all content has to ASCII. The OP used non ASCII letters.

RFC 1036 doesn't place any restrictions on the content on the body of
an article. On the other hand "Son of RFC 1036" does have restrictions
on characters used in the body of message:

Articles MUST not contain any octet with value exceeding 127,
i.e. any octet that is not an ASCII character

Which means that merely adding a Content-Encoding header wouldn't
be enough to conform to "Son of RFC 1036", the original poster would
also have had to either switch to a 7-bit character set or use a 7-bit
compatible transfer encoding. If you trying to claim that "Son of RFC
1036" is the new defacto standard, then that would mean your newsreader
is broken too.

>It's all about declaring your charset. In Python as well as in your
>newsreader. If you don't declare your charset it's ASCII for you - in
>Python as well as in your newsreader.

Except in practice unlike Python, many newsreaders don't assume ASCII.
The original article displayed fine for me. Google Groups displays it
correctly too:

http://groups.google.com/group/comp.lang.python/msg/828fefd7040238bc

I could just as easily argue that assuming ISO 8859-1 is the defacto
standard, and that its your newsreader that's broken. The reality however
is that RFC 1036 is the only standard for Usenet messages, defacto or
otherwise, and so there's nothing wrong with anyone's newsreader.

Thorsten Kampe

unread,
Feb 21, 2009, 4:05:39 PM2/21/09
to
* Ross Ridge (Sat, 21 Feb 2009 14:52:09 -0500)

> Thorsten Kampe <thor...@thorstenkampe.de> wrote:
>> It's all about declaring your charset. In Python as well as in your
>> newsreader. If you don't declare your charset it's ASCII for you - in
>> Python as well as in your newsreader.
>
> Except in practice unlike Python, many newsreaders don't assume ASCII.

They assume ASCII - unless you declare your charset (the exception being
Outlook Express and a few Windows newsreaders). Everything else is
"guessing".

> The original article displayed fine for me. Google Groups displays it
> correctly too:
>
> http://groups.google.com/group/comp.lang.python/msg/828fefd7040238bc

Your understanding of the principles of Unicode is as least as non-
existant as the OP's.



> I could just as easily argue that assuming ISO 8859-1 is the defacto
> standard, and that its your newsreader that's broken.

There is no "standard" in regard to guessing (this is what you call
"assuming"). The need for explicit declaration of an encoding is exactly
the same in Python as in any Usenet article.

> The reality however is that RFC 1036 is the only standard for Usenet
> messages, defacto or otherwise, and so there's nothing wrong with
> anyone's newsreader.

The reality is that all non-broken newsreaders use MIME headers to
declare and interpret the charset being used. I suggest you read at
least http://www.joelonsoftware.com/articles/Unicode.html to get an idea
of Unicode and associated topics.

Thorsten

Ross Ridge

unread,
Feb 21, 2009, 5:07:35 PM2/21/09
to
Ross Ridge (Sat, 21 Feb 2009 14:52:09 -0500)
> Except in practice unlike Python, many newsreaders don't assume ASCII.

Thorsten Kampe <thor...@thorstenkampe.de> wrote:
>They assume ASCII - unless you declare your charset (the exception being
>Outlook Express and a few Windows newsreaders). Everything else is
>"guessing".

No, it's an assumption like the way Python by default assumes ASCII.

>> The original article displayed fine for me. Google Groups displays it
>> correctly too:
>>
>> http://groups.google.com/group/comp.lang.python/msg/828fefd7040238bc
>
>Your understanding of the principles of Unicode is as least as non-
>existant as the OP's.

The link demonstrates that Google Groups doesn't assume ASCII like
Python does. Since popular newsreaders like Google Groups and Outlook
Express can display the message correctly without the MIME headers,
but your obscure one can't, there's a much stronger case to made that
it's your newsreader that's broken.

>> I could just as easily argue that assuming ISO 8859-1 is the defacto
>> standard, and that its your newsreader that's broken.
>
>There is no "standard" in regard to guessing (this is what you call
>"assuming"). The need for explicit declaration of an encoding is exactly
>the same in Python as in any Usenet article.

No, many newsreaders don't assume ASCII by default like Python.

>> The reality however is that RFC 1036 is the only standard for Usenet
>> messages, defacto or otherwise, and so there's nothing wrong with
>> anyone's newsreader.
>
>The reality is that all non-broken newsreaders use MIME headers to
>declare and interpret the charset being used.

Since RFC 1036 doesn't require MIME headers a reader that doesn't generate
them is by definition not broken.

Carl Banks

unread,
Feb 21, 2009, 5:24:55 PM2/21/09
to
On Feb 19, 6:57 pm, Ron Garret <rNOSPA...@flownet.com> wrote:
> I'm writing a little wiki that I call µWiki.  That's a lowercase Greek

Thorsten Kampe

unread,
Feb 21, 2009, 5:52:03 PM2/21/09
to
* Ross Ridge (Sat, 21 Feb 2009 17:07:35 -0500)

> The link demonstrates that Google Groups doesn't assume ASCII like
> Python does. Since popular newsreaders like Google Groups and Outlook
> Express can display the message correctly without the MIME headers,
> but your obscure one can't, there's a much stronger case to made that
> it's your newsreader that's broken.

*sigh* I give up on you. You didn't even read the "Joel on Software"
article. The whole "why" and "what for" of Unicode and MIME will always
be a complete mystery to you.

T.

Ross Ridge

unread,
Feb 21, 2009, 6:06:35 PM2/21/09
to
Ross Ridge (Sat, 21 Feb 2009 17:07:35 -0500)
> The link demonstrates that Google Groups doesn't assume ASCII like
> Python does. Since popular newsreaders like Google Groups and Outlook
> Express can display the message correctly without the MIME headers,
> but your obscure one can't, there's a much stronger case to made that
> it's your newsreader that's broken.

Thorsten Kampe <thor...@thorstenkampe.de> wrote:
>*sigh* I give up on you. You didn't even read the "Joel on Software"
>article. The whole "why" and "what for" of Unicode and MIME will always
>be a complete mystery to you.

I understand what Unicode and MIME are for and why they exist. Neither
their merits nor your insults change the fact that the only current
standard governing the content of Usenet posts doesn't require their use.

Thorsten Kampe

unread,
Feb 21, 2009, 6:35:35 PM2/21/09
to
* Ross Ridge (Sat, 21 Feb 2009 18:06:35 -0500)

> > The link demonstrates that Google Groups doesn't assume ASCII like
> > Python does. Since popular newsreaders like Google Groups and Outlook
> > Express can display the message correctly without the MIME headers,
> > but your obscure one can't, there's a much stronger case to made that
> > it's your newsreader that's broken.
>
> Thorsten Kampe <thor...@thorstenkampe.de> wrote:
> >*sigh* I give up on you. You didn't even read the "Joel on Software"
> >article. The whole "why" and "what for" of Unicode and MIME will always
> >be a complete mystery to you.
>
> I understand what Unicode and MIME are for and why they exist. Neither
> their merits nor your insults change the fact that the only current
> standard governing the content of Usenet posts doesn't require their
> use.

That's right. As long as you use pure ASCII you can skip this nasty step
of informing other people which charset you are using. If you do use non
ASCII then you have to do that. That's the way virtually all newsreaders
work. It has nothing to do with some 21+ year old RFC. Even your Google
Groups "newsreader" does that ('content="text/html; charset=UTF-8"').

Being explicit about your encoding is 99% of the whole Unicode magic in
Python and in any communication across the Internet (may it be NNTP,
SMTP or HTTP). Your Google Groups simply uses heuristics to guess the
encoding the OP probably used. Windows newsreaders simply use the locale
of the local host. That's guessing. You can call it assuming but it's
still guessing. There is no way you can be sure without any declaration.

And it's unpythonic. Python "assumes" ASCII and if the decodes/encoded
text doesn't fit that encoding it refuses to guess.

T.

Ross Ridge

unread,
Feb 21, 2009, 7:39:42 PM2/21/09
to
Ross Ridge (Sat, 21 Feb 2009 18:06:35 -0500)
> I understand what Unicode and MIME are for and why they exist. Neither
> their merits nor your insults change the fact that the only current
> standard governing the content of Usenet posts doesn't require their
> use.

Thorsten Kampe <thor...@thorstenkampe.de> wrote:
>That's right. As long as you use pure ASCII you can skip this nasty step
>of informing other people which charset you are using. If you do use non
>ASCII then you have to do that. That's the way virtually all newsreaders
>work. It has nothing to do with some 21+ year old RFC. Even your Google
>Groups "newsreader" does that ('content="text/html; charset=UTF-8"').

No, the original post demonstrates you don't have include MIME headers for
ISO 8859-1 text to be properly displayed by many newsreaders. The fact
that your obscure newsreader didn't display it properly doesn't mean
that original poster's newsreader is broken.

>Being explicit about your encoding is 99% of the whole Unicode magic in
>Python and in any communication across the Internet (may it be NNTP,
>SMTP or HTTP).

HTTP requires the assumption of ISO 8859-1 in the absense of any
specified encoding.

>Your Google Groups simply uses heuristics to guess the
>encoding the OP probably used. Windows newsreaders simply use the locale
>of the local host. That's guessing. You can call it assuming but it's
>still guessing. There is no way you can be sure without any declaration.

Newsreaders assuming ISO 8859-1 instead of ASCII doesn't make it a guess.
It's just a different assumption, nor does making an assumption, ASCII
or ISO 8850-1, give you any certainty.

>And it's unpythonic. Python "assumes" ASCII and if the decodes/encoded
>text doesn't fit that encoding it refuses to guess.

Which is reasonable given that Python is programming language where it's
better to have more conservative assumption about encodings so errors
can be more quickly diagnosed. A newsreader however is a different
beast, where it's better to make a less conservative assumption that's
more likely to display messages correctly to the user. Assuming ISO
8859-1 in the absense of any specified encoding allows the message to be
correctly displayed if the character set is either ISO 8859-1 or ASCII.
Doing things the "pythonic" way and assuming ASCII only allows such
messages to be displayed if ASCII is used.

Message has been deleted

Thorsten Kampe

unread,
Feb 21, 2009, 8:25:54 PM2/21/09
to
* Ross Ridge (Sat, 21 Feb 2009 19:39:42 -0500)

> Thorsten Kampe <thor...@thorstenkampe.de> wrote:
> >That's right. As long as you use pure ASCII you can skip this nasty step
> >of informing other people which charset you are using. If you do use non
> >ASCII then you have to do that. That's the way virtually all newsreaders
> >work. It has nothing to do with some 21+ year old RFC. Even your Google
> >Groups "newsreader" does that ('content="text/html; charset=UTF-8"').
>
> No, the original post demonstrates you don't have include MIME headers for
> ISO 8859-1 text to be properly displayed by many newsreaders.

*sigh* As you still refuse to read the article[1] I'm going to quote it
now here:

'The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember
one extremely important fact. It does not make sense to have a string
without knowing what encoding it uses.
[...]
If you have a string [...] in an email message, you have to know what
encoding it is in or you cannot interpret it or display it to users
correctly.

Almost every [...] "she can't read my emails when I use accents" problem
comes down to one naive programmer who didn't understand the simple fact
that if you don't tell me whether a particular string is encoded using
UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western
European), you simply cannot display it correctly [...]. There are over
a hundred encodings and above code point 127, all bets are off.'

Enough said.

> The fact that your obscure newsreader didn't display it properly
> doesn't mean that original poster's newsreader is broken.

You don't even know if my "obscure newsreader" displayed it properly.
Non ASCII text without a declared encoding is just a bunch of bytes.
It's not even text.

T.

[1] http://www.joelonsoftware.com/articles/Unicode.html

Steve Holden

unread,
Feb 21, 2009, 10:04:02 PM2/21/09
to pytho...@python.org
And I suggest you try to phrase your remarks in a way more respectful of
those you are discussing these matters with. I understand that
exasperation can lead to offensiveness, but if a lack of understanding
does exist then it's better to simply try and remove it without
commenting on its existence.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

"Martin v. Löwis"

unread,
Feb 21, 2009, 11:21:21 PM2/21/09
to Dennis Lee Bieber
> Since when is "Google Groups" a newsreader? So far as I know, all
> the display/formatting is handled by my web browser and GG merely stuffs
> messages into an HTML wrapper...

It also transmits this HTML wrapper via HTTP, where it claims that the
charset of the HTML is UTF-8. To do that, it must have converted the
original message from Latin-1 to UTF-8, which must have required
interpreting it as Latin-1 in the first place.

Regards,
Martin

Joshua Judson Rosen

unread,
Feb 22, 2009, 12:07:34 AM2/22/09
to
Ross Ridge <rri...@csclub.uwaterloo.ca> writes:
>
> > It's all about declaring your charset. In Python as well as in your
> > newsreader. If you don't declare your charset it's ASCII for you - in
> > Python as well as in your newsreader.
>
> Except in practice unlike Python, many newsreaders don't assume ASCII.
> The original article displayed fine for me.

Right. Exactly.

Wasn't that exact issue a driving force behind unicode's creation in
the first place? :)

To avoid horrors like this:

http://en.wikipedia.org/wiki/File:Letter_to_Russia_with_krokozyabry.jpg

... and people getting into arguments on usenet and having to use
rebuttals like "Well, it looked fine to *me*--there's nothing wrong,
we're just using incompatible encodings!"?

But you're right--specifying in usenet-posts is like
turn-signals....

Can we get back to Python programming, now? :)

--
Don't be afraid to ask (Lf.((Lx.xx) (Lr.f(rr)))).

dineshv

unread,
Feb 22, 2009, 8:17:39 AM2/22/09
to
re: "You should never have to rely on the default encoding. You should

explicitly decode and encode data."

What is the best practice for 1) doing this in Python and 2) for
unicode support ?

I want to standardize on unicode and want to put into place best
Python practice so that we don't have to worry. Thanks!

Dinesh

Denis Kasak

unread,
Feb 22, 2009, 9:49:25 AM2/22/09
to Ross Ridge, pytho...@python.org
On Sun, Feb 22, 2009 at 1:39 AM, Ross Ridge <rri...@csclub.uwaterloo.ca> wrote:
> Ross Ridge (Sat, 21 Feb 2009 18:06:35 -0500)
>> I understand what Unicode and MIME are for and why they exist. Neither
>> their merits nor your insults change the fact that the only current
>> standard governing the content of Usenet posts doesn't require their
>> use.
>
> Thorsten Kampe <thor...@thorstenkampe.de> wrote:
>>That's right. As long as you use pure ASCII you can skip this nasty step
>>of informing other people which charset you are using. If you do use non
>>ASCII then you have to do that. That's the way virtually all newsreaders
>>work. It has nothing to do with some 21+ year old RFC. Even your Google
>>Groups "newsreader" does that ('content="text/html; charset=UTF-8"').
>
> No, the original post demonstrates you don't have include MIME headers for
> ISO 8859-1 text to be properly displayed by many newsreaders. The fact
> that your obscure newsreader didn't display it properly doesn't mean
> that original poster's newsreader is broken.

And how is this kind of assuming better than clearly stating the used
encoding? Does the fact that the last official Usenet RFC doesn't
mandate content-type headers mean that all bets are off and that we
should rely on guesswork to determine the correct encoding of a
message? No, it means the RFC is outdated and no longer suitable for
current needs.

>>Being explicit about your encoding is 99% of the whole Unicode magic in
>>Python and in any communication across the Internet (may it be NNTP,
>>SMTP or HTTP).
>
> HTTP requires the assumption of ISO 8859-1 in the absense of any
> specified encoding.

Which is, of course, completely irrelevant for this discussion. Or are
you saying that this fact should somehow obliterate the need for
specifying encodings?

>>Your Google Groups simply uses heuristics to guess the
>>encoding the OP probably used. Windows newsreaders simply use the locale
>>of the local host. That's guessing. You can call it assuming but it's
>>still guessing. There is no way you can be sure without any declaration.
>
> Newsreaders assuming ISO 8859-1 instead of ASCII doesn't make it a guess.
> It's just a different assumption, nor does making an assumption, ASCII
> or ISO 8850-1, give you any certainty.

Assuming is another way of saying "I don't know, so I'm using this
arbitrary default", which is not that different from a completely wild
guess. :-)

>>And it's unpythonic. Python "assumes" ASCII and if the decodes/encoded
>>text doesn't fit that encoding it refuses to guess.
>
> Which is reasonable given that Python is programming language where it's
> better to have more conservative assumption about encodings so errors
> can be more quickly diagnosed. A newsreader however is a different
> beast, where it's better to make a less conservative assumption that's
> more likely to display messages correctly to the user. Assuming ISO
> 8859-1 in the absense of any specified encoding allows the message to be
> correctly displayed if the character set is either ISO 8859-1 or ASCII.
> Doing things the "pythonic" way and assuming ASCII only allows such
> messages to be displayed if ASCII is used.

Reading this paragraph, I've began thinking that we've misunderstood
each other. I agree that assuming ISO 8859-1 in the absence of
specification is a better guess than most (since it's more likely to
display the message correctly). However, not specifying the encoding
of a message is just asking for trouble and assuming anything is just
an attempt of cleaning someone's mess. Unfortunately, it is impossible
to detect the encoding scheme just by heuristics and with hundreds of
encodings in existence today, the only real solution to the problem is
clearly stating your content-type. Since MIME is the most accepted way
of doing this, it should be the preferred way, RFC'ed or not.

--
Denis Kasak

Joshua Judson Rosen

unread,
Feb 22, 2009, 7:46:35 PM2/22/09
to
Denis Kasak <denis...@gmail.com> writes:
>
> > > Python "assumes" ASCII and if the decodes/encoded text doesn't
> > > fit that encoding it refuses to guess.
> >
> > Which is reasonable given that Python is programming language where it's
> > better to have more conservative assumption about encodings so errors
> > can be more quickly diagnosed. A newsreader however is a different
> > beast, where it's better to make a less conservative assumption that's
> > more likely to display messages correctly to the user. Assuming ISO
> > 8859-1 in the absense of any specified encoding allows the message to be
> > correctly displayed if the character set is either ISO 8859-1 or ASCII.
> > Doing things the "pythonic" way and assuming ASCII only allows such
> > messages to be displayed if ASCII is used.
>
> Reading this paragraph, I've began thinking that we've misunderstood
> each other. I agree that assuming ISO 8859-1 in the absence of
> specification is a better guess than most (since it's more likely to
> display the message correctly).

So, yeah--back on the subject of programming in Python and supporting
charactersets beyond ASCII:

If you have to make an assumption, I'd really think that it'd be
better to use whatever the host OS's default is, if the host OS has
such a thing--using an assumption of ISO 8859-1 works only in select
regions on unix systems, and may fail even in those select regions on
Windows, Mac OS, and other systems; without the OS considerations,
just the regional constraints are likely to make an ISO-8859-1
assumption result in /incorrect/ results anywhere eastward of central
Europe. Is a user in Russia (or China, or Japan) *really* most likely
to be using ISO 8859-1?

As a point of reference, here's what's in the man-pages that I have
installed (note the /complete/ and conspicuous lack of references to
even some notable eastern languages or character-sets, such as Chinese
and Japanese, in the /entire/ ISO-8859 spectrum):

"ISO 8859 Alphabets
The full set of ISO 8859 alphabets includes:

ISO 8859-1 West European languages (Latin-1)
ISO 8859-2 Central and East European languages (Latin-2)
ISO 8859-3 Southeast European and miscellaneous languages (Latin-3)
ISO 8859-4 Scandinavian/Baltic languages (Latin-4)
ISO 8859-5 Latin/Cyrillic
ISO 8859-6 Latin/Arabic
ISO 8859-7 Latin/Greek
ISO 8859-8 Latin/Hebrew
ISO 8859-9 Latin-1 modification for Turkish (Latin-5)
ISO 8859-10 Lappish/Nordic/Eskimo languages (Latin-6)
ISO 8859-11 Latin/Thai
ISO 8859-13 Baltic Rim languages (Latin-7)
ISO 8859-14 Celtic (Latin-8)
ISO 8859-15 West European languages (Latin-9)
ISO 8859-16 Romanian (Latin-10)"

"ISO 8859-1 supports the following languages: Afrikaans, Basque,
Catalan, Danish, Dutch, English, Faeroese, Finnish, French,
Galician, German, Icelandic, Irish, Italian, Norwegian,
Portuguese, Scottish, Spanish, and Swedish."

"ISO 8859-2 supports the following languages: Albanian, Bosnian,
Croatian, Czech, English, Finnish, German, Hungarian, Irish, Polish,
Slovak, Slovenian and Sorbian."

"ISO 8859-7 encodes the characters used in modern monotonic
Greek."

"ISO 8859-9, also known as the "Latin Alphabet No. 5", encodes
the characters used in Turkish."

"ISO 8859-15 supports the following languages: Albanian, Basque, Breton,
Catalan, Danish, Dutch, English, Estonian, Faroese, Finnish, French,
Frisian, Galician, German, Greenlandic, Icelandic, Irish Gaelic,
Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic,
Scottish Gaelic, Spanish, and Swedish."

"ISO 8859-16 supports the following languages: Albanian, Bosnian,
Croatian, English, Finnish, German, Hungarian, Irish, Polish, Romanian,
Slovenian and Serbian."

Ben Finney

unread,
Feb 22, 2009, 8:05:13 PM2/22/09
to
Joshua Judson Rosen <roz...@geekspace.com> writes:

> If you have to make an assumption, I'd really think that it'd be
> better to use whatever the host OS's default is, if the host OS has
> such a thing--using an assumption of ISO 8859-1 works only in select
> regions on unix systems, and may fail even in those select regions
> on Windows, Mac OS, and other systems; without the OS
> considerations, just the regional constraints are likely to make an
> ISO-8859-1 assumption result in /incorrect/ results anywhere
> eastward of central Europe. Is a user in Russia (or China, or Japan)
> *really* most likely to be using ISO 8859-1?

The fallacy in the above is to assume that a given programmer will
only be opening files created in their current locale. I say that is a
fallacy, because programmers in fact open program files created all
over the world in different locales; and those files should, where
possible, be interpreted by Python the same everwhere.

Assuming a *single*, defined, encoding in the absence of an explicit
declaration at least makes all Python installations (of a given
version) read any program file the same in any locale.

--
\ “If [a technology company] has confidence in their future |
`\ ability to innovate, the importance they place on protecting |
_o__) their past innovations really should decline.” —Gary Barnett |
Ben Finney

John Machin

unread,
Feb 22, 2009, 8:16:47 PM2/22/09
to
On Feb 23, 11:46 am, Joshua Judson Rosen <roz...@geekspace.com> wrote:

1. As a point of reference for what?
2. The ISO 8859 character sets were deliberately restricted to scripts
that would fit in 8 bits. So Chinese, Japanese, Korean and Vietnamese
aren't included. Note that Chinese and Japanese already each had
*multiple* legacy (i.e. non-Unicode) character sets ... they (and the
rest the world) don't want/need yet another character set for each
language and never did want/need one.

0 new messages