Let's have a closer look at
<http://www.google.com/intl/vi/>
On that page we find:
charset=ISO-8859-1
font-family: arial
font face=arial
ế ớ
First, Netscape 4.x (and perhaps other browsers) will display ế
etc. as question mark because of "charset=ISO-8859-1".
With another browser you may still get question marks because _your_
version of Arial is likely not to contain special Vietnamese letters.
You need to disable the option "Use page-specified fonts" (This option
is generally enabled by default.) and use your own font that contain
Vietnamese characters.
After managing all that, the innocents may get the impression that they
_can_ search for special Vietnamese characters because they see such
letters on this page. But you can search for Latin-1 characters only
because of "charset=ISO-8859-1".
See <http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>
for details.
--
Outlook Express is a fine news_reader_.
Its only problem: It allows you to post.
> [[ This message was both posted and mailed: see
> the "To," "Cc," and "Newsgroups" headers for details. ]]
>
>
> Let's have a closer look at
> <http://www.google.com/intl/vi/>
>
> On that page we find:
> charset=ISO-8859-1
> font-family: arial
> font face=arial
> ế ớ
>
> First, Netscape 4.x (and perhaps other browsers) will display ế
> etc. as question mark because of "charset=ISO-8859-1".
If so, only because they're b0rken. Unicode character codes are
supposed to be used in this way.
--
| Andrew Glasgow <amg39(at)cornell.edu> |
| SCSI is *NOT* magic. There are *fundamental technical |
| reasons* why it is necessary to sacrifice a young goat |
| to your SCSI chain now and then. -- John Woods |
> > First, Netscape 4.x (and perhaps other browsers) will display ế
> > etc. as question mark because of "charset=ISO-8859-1".
>
> If so, only because they're b0rken. Unicode character codes are
> supposed to be used in this way.
(This was only my first remark. Did you read on?)
They *can be used* in this way but need not *to be used* in this way.
Applying the old principle "Be conservative in what you send &c. &c.",
it would no problem here to express *all* special letters as &#number;
and then set "charset=UTF-8".
> > First, Netscape 4.x (and perhaps other browsers) will display ế
> > etc. as question mark because of "charset=ISO-8859-1".
>
> If so, only because they're b0rken. Unicode character codes are
> supposed to be used in this way.
That's been known since RFC2070, but the Netscape developers
weren't too good in reading specs, as we're surely aware by now.
So in practical terms, which do you want - to reach as many WWW
readers as feasible, or to prove to yourself what we already know,
that NN4.* isn't good for serious work?
> In article
> <news:amg39.REMOVETHIS-89...@newsstand.cit.cornell.edu>,
> Andrew Glasgow <amg39.RE...@cornell.edu.INVALID> wrote:
>
> > > First, Netscape 4.x (and perhaps other browsers) will display ế
> > > etc. as question mark because of "charset=ISO-8859-1".
> >
> > If so, only because they're b0rken. Unicode character codes are
> > supposed to be used in this way.
>
> (This was only my first remark. Did you read on?)
Err, yes. Forgot to mark the snip, sorry.
> They *can be used* in this way but need not *to be used* in this way.
> Applying the old principle "Be conservative in what you send &c. &c.",
> it would no problem here to express *all* special letters as &#number;
> and then set "charset=UTF-8".
If you're sending it as "charset=utf-8" you can send the special
characters without encoding them, provided that the client understands
unicode. Errr, I think. The point of the &#xxx; notation is to allow
one to use the characters while storing your text files as ordinary
8-bit text.
(addressing Andreas Prilop...)
> If you're sending it as "charset=utf-8" you can send the special
> characters without encoding them, provided that the client understands
> unicode. Errr, I think.
Andreas _knows_, so if you have a question, why not ask it? Posting
untested surmises helps no-one.
> The point of the &#xxx; notation is to allow
> one to use the characters while storing your text files as ordinary
> 8-bit text.
utf-8 could never be described as "ordinary 8-bit text". "Ordinary
7-bit text" would be more appropriate in this context. Or else
properly coded utf-8 bytestreams.
Rhetorical question, aimed at nobody in particular: why the heck do
people still rabbit on about "special characters"? Surely nearly all
of the Unicode repertoire fall under that description now?
In article <news:amg39.REMOVETHIS-40...@newsstand.cit.cornell.edu>,
Andrew Glasgow <amg39.RE...@cornell.edu.INVALID> wrote:
> If you're sending it as "charset=utf-8" you can send the special
> characters without encoding them, provided that the client understands
> unicode. Errr, I think. The point of the &#xxx; notation is to allow
> one to use the characters while storing your text files as ordinary
> 8-bit text.
Ok, OK, I know that. You concentrate on my first remark, which is not
essential in any way. Let's ignore that und return to
charset=ISO-8859-1
font-family: arial
font face=arial
My point is that most (?) users' Arial will not contain special
Vietnamese characters and that
<http://www.google.com/intl/vi/>
will not allow you to search for special Vietnamese letters even if
*you see* them on that page.
Andrew> In article <280820011612553402%andreas...@altavista.net>,
Andrew> Andreas Prilop <andreas...@altavista.net> wrote:
>> Let's have a closer look at
>> <URL:http://www.google.com/intl/vi/>
>>
>> On that page we find:
>> charset=ISO-8859-1
>> font-family: arial
>> font face=arial
>> ế ớ
>>
>> First, Netscape 4.x (and perhaps other browsers) will display ế
>> etc. as question mark because of "charset=ISO-8859-1".
Andrew> If so, only because they're b0rken. Unicode character codes are
Andrew> supposed to be used in this way.
Indeed, and in better browsers there's no problem. But it's unwise to
exclude a large proportion of your readership when it's easy to avoid
without violating the standards.
And the form submission issue remains. For example, pasting "Môi
truờng sở thích" from elsewhere on that page results in
> Xâu - "M ,At (Bi tru ,16 (Bng s ,17 (B th ,Am (Bch" - không tìm thấy
> trong bất cứ tài liệu nào.
IOW, it's seeing that submission as
> "M,At(Bi tru,16(Bng s,17(B th,Am(Bch"
Hmm, looks like EUC-KR... I guess that's what emacs-w3 chooses to send
in this case?
--
"You can ... sell your soul for complete control -
is that really what you need?" (Pink Floyd)
> IOW, it's seeing that submission as
>
> > "M,At(Bi tru,16(Bng s,17(B th,Am(Bch"
>
> Hmm, looks like EUC-KR... I guess that's what emacs-w3 chooses to send
> in this case?
Such results are discussed on Alan's page
<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>
0> In <URL:news:290820011639506236%25andrea...@altavista.net>,
0> Andreas Prilop <URL:mailto:andreas...@altavista.net> ("Andreas") wrote:
Andreas> In article <news:s8ae0jf...@suilven.cam.eu.citrix.com>,
Andreas> Toby Speight <strea...@gmx.net> wrote:
>> IOW, it's seeing that submission as
>>
>> > "M,At(Bi tru,16(Bng s,17(B th,Am(Bch"
>>
>> Hmm, looks like EUC-KR... I guess that's what emacs-w3 chooses to
>> send in this case?
Andreas> Such results are discussed on Alan's page
Andreas> <URL:http://ppewww.ph.gla.ac.uk/%7eflavell/charset/form-i18n.html>
Yes, though this is different from all the examples mentioned there.
UTF-8 document, Latin-1 browser, sending in EUC-KR.
Alan, this may be interesting to your readership.