Google isn't _that

Andreas Prilop

unread,

Aug 28, 2001, 10:12:54 AM8/28/01

to comm...@google.com

[[ This message was both posted and mailed: see
the "To," "Cc," and "Newsgroups" headers for details. ]]

Let's have a closer look at
<http://www.google.com/intl/vi/>

On that page we find:
charset=ISO-8859-1
font-family: arial
font face=arial
ế ớ

First, Netscape 4.x (and perhaps other browsers) will display ế
etc. as question mark because of "charset=ISO-8859-1".
With another browser you may still get question marks because _your_
version of Arial is likely not to contain special Vietnamese letters.
You need to disable the option "Use page-specified fonts" (This option
is generally enabled by default.) and use your own font that contain
Vietnamese characters.

After managing all that, the innocents may get the impression that they
_can_ search for special Vietnamese characters because they see such
letters on this page. But you can search for Latin-1 characters only
because of "charset=ISO-8859-1".
See <http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>
for details.

--
Outlook Express is a fine news_reader_.
Its only problem: It allows you to post.

Andrew Glasgow

unread,

Aug 28, 2001, 3:19:46 PM8/28/01

to

In article <280820011612553402%andreas...@altavista.net>,
Andreas Prilop <andreas...@altavista.net> wrote:

> [[ This message was both posted and mailed: see
> the "To," "Cc," and "Newsgroups" headers for details. ]]
>
>
> Let's have a closer look at
> <http://www.google.com/intl/vi/>
>
> On that page we find:
> charset=ISO-8859-1
> font-family: arial
> font face=arial
> ế ớ
>
> First, Netscape 4.x (and perhaps other browsers) will display ế
> etc. as question mark because of "charset=ISO-8859-1".

If so, only because they're b0rken. Unicode character codes are
supposed to be used in this way.

Andreas Prilop

unread,

Aug 28, 2001, 4:02:52 PM8/28/01

to

In article <news:amg39.REMOVETHIS-89...@newsstand.cit.cornell.edu>,
Andrew Glasgow <amg39.RE...@cornell.edu.INVALID> wrote:

> > First, Netscape 4.x (and perhaps other browsers) will display ế
> > etc. as question mark because of "charset=ISO-8859-1".
>
> If so, only because they're b0rken. Unicode character codes are
> supposed to be used in this way.

(This was only my first remark. Did you read on?)

They *can be used* in this way but need not *to be used* in this way.
Applying the old principle "Be conservative in what you send &c. &c.",
it would no problem here to express *all* special letters as &#number;
and then set "charset=UTF-8".

Alan J. Flavell

unread,

Aug 28, 2001, 4:20:54 PM8/28/01

to

On Aug 28, Andrew Glasgow inscribed on the eternal scroll:

> > First, Netscape 4.x (and perhaps other browsers) will display ế
> > etc. as question mark because of "charset=ISO-8859-1".
>
> If so, only because they're b0rken. Unicode character codes are
> supposed to be used in this way.

That's been known since RFC2070, but the Netscape developers
weren't too good in reading specs, as we're surely aware by now.

So in practical terms, which do you want - to reach as many WWW
readers as feasible, or to prove to yourself what we already know,
that NN4.* isn't good for serious work?

Andrew Glasgow

unread,

Aug 28, 2001, 6:14:36 PM8/28/01

to

In article <280820012202524038%andreas...@altavista.net>,
Andreas Prilop <andreas...@altavista.net> wrote:

> In article
> <news:amg39.REMOVETHIS-89...@newsstand.cit.cornell.edu>,
> Andrew Glasgow <amg39.RE...@cornell.edu.INVALID> wrote:
>
> > > First, Netscape 4.x (and perhaps other browsers) will display ế
> > > etc. as question mark because of "charset=ISO-8859-1".
> >
> > If so, only because they're b0rken. Unicode character codes are
> > supposed to be used in this way.
>
> (This was only my first remark. Did you read on?)

Err, yes. Forgot to mark the snip, sorry.

> They *can be used* in this way but need not *to be used* in this way.
> Applying the old principle "Be conservative in what you send &c. &c.",
> it would no problem here to express *all* special letters as &#number;
> and then set "charset=UTF-8".

If you're sending it as "charset=utf-8" you can send the special
characters without encoding them, provided that the client understands
unicode. Errr, I think. The point of the &#xxx; notation is to allow
one to use the characters while storing your text files as ordinary
8-bit text.

Alan J. Flavell

unread,

Aug 28, 2001, 6:44:57 PM8/28/01

to

On Aug 28, Andrew Glasgow inscribed on the eternal scroll:

(addressing Andreas Prilop...)

> If you're sending it as "charset=utf-8" you can send the special
> characters without encoding them, provided that the client understands
> unicode. Errr, I think.

Andreas _knows_, so if you have a question, why not ask it? Posting
untested surmises helps no-one.

> The point of the &#xxx; notation is to allow
> one to use the characters while storing your text files as ordinary
> 8-bit text.

utf-8 could never be described as "ordinary 8-bit text". "Ordinary
7-bit text" would be more appropriate in this context. Or else
properly coded utf-8 bytestreams.

Rhetorical question, aimed at nobody in particular: why the heck do
people still rabbit on about "special characters"? Surely nearly all
of the Unicode repertoire fall under that description now?

Andreas Prilop

unread,

Aug 29, 2001, 7:58:39 AM8/29/01

to comm...@google.com

[[ This message was both posted and mailed: see
the "To," "Cc," and "Newsgroups" headers for details. ]]

In article <news:amg39.REMOVETHIS-40...@newsstand.cit.cornell.edu>,
Andrew Glasgow <amg39.RE...@cornell.edu.INVALID> wrote:

> If you're sending it as "charset=utf-8" you can send the special
> characters without encoding them, provided that the client understands
> unicode. Errr, I think. The point of the &#xxx; notation is to allow
> one to use the characters while storing your text files as ordinary
> 8-bit text.

Ok, OK, I know that. You concentrate on my first remark, which is not
essential in any way. Let's ignore that und return to

charset=ISO-8859-1
font-family: arial
font face=arial

My point is that most (?) users' Arial will not contain special
Vietnamese characters and that
<http://www.google.com/intl/vi/>
will not allow you to search for special Vietnamese letters even if
*you see* them on that page.

Toby Speight

unread,

Aug 29, 2001, 9:50:01 AM8/29/01

to

0> In <URL:news:amg39.REMOVETHIS-89...@newsstand.cit.cornell.edu>,
0> Andrew Glasgow <URL:mailto:amg39.RE...@cornell.edu.INVALID> ("Andrew") wrote:

Andrew> In article <280820011612553402%andreas...@altavista.net>,

Andrew> Andreas Prilop <andreas...@altavista.net> wrote:

>> Let's have a closer look at

>> <URL:http://www.google.com/intl/vi/>

>>
>> On that page we find:
>> charset=ISO-8859-1
>> font-family: arial
>> font face=arial
>> ế ớ
>>
>> First, Netscape 4.x (and perhaps other browsers) will display ế
>> etc. as question mark because of "charset=ISO-8859-1".

Andrew> If so, only because they're b0rken. Unicode character codes are
Andrew> supposed to be used in this way.

Indeed, and in better browsers there's no problem. But it's unwise to
exclude a large proportion of your readership when it's easy to avoid
without violating the standards.

And the form submission issue remains. For example, pasting "Môi
truờng sở thích" from elsewhere on that page results in

> Xâu - "M ,At (Bi tru ,16 (Bng s ,17 (B th ,Am (Bch" - không tìm thấy
> trong bất cứ tài liệu nào.

IOW, it's seeing that submission as

> "M,At(Bi tru,16(Bng s,17(B th,Am(Bch"

Hmm, looks like EUC-KR... I guess that's what emacs-w3 chooses to send
in this case?

--
"You can ... sell your soul for complete control -
is that really what you need?" (Pink Floyd)

Andreas Prilop

unread,

Aug 29, 2001, 10:39:50 AM8/29/01

to

In article <news:s8ae0jf...@suilven.cam.eu.citrix.com>,
Toby Speight <strea...@gmx.net> wrote:

> IOW, it's seeing that submission as
>
> > "M,At(Bi tru,16(Bng s,17(B th,Am(Bch"
>
> Hmm, looks like EUC-KR... I guess that's what emacs-w3 chooses to send
> in this case?

Such results are discussed on Alan's page
<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>

Toby Speight

unread,

Aug 31, 2001, 4:43:00 PM8/31/01

to Alan Flavell

[posted to comp.infosystems.www.authoring.html and CCed to AJF]

0> In <URL:news:290820011639506236%25andrea...@altavista.net>,
0> Andreas Prilop <URL:mailto:andreas...@altavista.net> ("Andreas") wrote:

Andreas> In article <news:s8ae0jf...@suilven.cam.eu.citrix.com>,

Andreas> Toby Speight <strea...@gmx.net> wrote:

>> IOW, it's seeing that submission as
>>
>> > "M,At(Bi tru,16(Bng s,17(B th,Am(Bch"
>>
>> Hmm, looks like EUC-KR... I guess that's what emacs-w3 chooses to
>> send in this case?

Andreas> Such results are discussed on Alan's page
Andreas> <URL:http://ppewww.ph.gla.ac.uk/%7eflavell/charset/form-i18n.html>

Yes, though this is different from all the examples mentioned there.

UTF-8 document, Latin-1 browser, sending in EUC-KR.

Alan, this may be interesting to your readership.

Google isn't _that_ smart

Andreas Prilop

Andrew Glasgow

Andreas Prilop

Alan J. Flavell

Andrew Glasgow

Alan J. Flavell

Andreas Prilop

Toby Speight

Andreas Prilop

Toby Speight