Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Trouble with the "lang" attribute

0 views
Skip to first unread message

Bertil Wennergren

unread,
Feb 5, 2003, 11:04:44 AM2/5/03
to
I've stumbled into a weird problem with using the "lang" attribute (or
"xml:lang" attribute in XHTML). A discussion can be found in this blog:

<http://weblog.delacour.net/index.php>

The problem occurs when e.g. a Japanese word is included in the text of an
otherwise English HTML document, but written in Latin characters ("Romaji"
as the Japanese call it). E.g.:

<p>I watch a lot of <span lang="ja">manga</span> movies.</p>

It could be argued that the word "manga" is now an English word, although
borrowed from Japanese, but let's forget about that, and decide that it's
still Japanese, and should be marked up as such. All fine. The above code
is the right way to do it. (A better element than "span" might be chosen,
but that's another matter.)

The problem is that on Windows many users that don't have the Japanese
language pack installed already, will get a nagging an completely useless
prompting to download and install the Japanese language support (a rather
large download, that will serve no purpose whatsoever since the text is
written in Roman characters). The prompting will be repeated for every page
load. This happens not only with Explorer, but also with Mozilla and
Phoenix (and perhaps other browsers too).

Users could turn off this prompting, but many won't know how to do that, and
they could get extremely annoyed, and just leave the page.

Are there any work-arounds? Any other suggestions? Should we just leave out
correct mark-up, just because some systems are completely brain-dead?

--
Bertil Wennergren <bert...@gmx.net> <http://www.bertilow.com>

Alan J. Flavell

unread,
Feb 5, 2003, 12:07:21 PM2/5/03
to
On Feb 5, Bertil Wennergren inscribed on the eternal scroll:

> <p>I watch a lot of <span lang="ja">manga</span> movies.</p>
>
> It could be argued that the word "manga" is now an English word, although
> borrowed from Japanese, but let's forget about that, and decide that it's
> still Japanese, and should be marked up as such.

Correct...

> The problem is that on Windows many users that don't have the Japanese
> language pack installed already, will get a nagging an completely useless
> prompting to download and install the Japanese language support (a rather
> large download, that will serve no purpose whatsoever since the text is
> written in Roman characters).

Oh dear. How many times is it necessary to repeat that in HTML,
language and character representation are two quite different issues?

If they had needed to download a different _pronunciation_ module for
their speaking browser, then it would have been more than excusable.

> This happens not only with Explorer, but also with Mozilla and
> Phoenix (and perhaps other browsers too).

Oh dear **2.

> Are there any work-arounds? Any other suggestions? Should we just leave out
> correct mark-up, just because some systems are completely brain-dead?

.sig applies, I fear.

all the best

--
Manchmal denke ich, ich brauche die ganzen schicken, neuen
Brauser nur, um mich besser gegen den Autor zu wehren...
- Sybille Kahl on dciwam

Jukka K. Korpela

unread,
Feb 5, 2003, 1:27:06 PM2/5/03
to
Bertil Wennergren <bert...@gmx.net> wrote:

> The problem occurs when e.g. a Japanese word is included in the text of
> an otherwise English HTML document, but written in Latin characters

- -


> The problem is that on Windows many users that don't have the Japanese
> language pack installed already, will get a nagging an completely
> useless prompting to download and install the Japanese language support

This is depressing, but thanks for pointing this out. I think many of have
not met this problem yet, either because we have Japanese support installed
or because we haven't visited pages where Romanized Japanese has language
markup. The observation reminds us that we should not write language markup
in too much detail, until the definitions and implementations have matured.
(For an entire document, or for a block quotation, and for a book title, for
example, language markup is surely recommendable, and not much work. But
even for them, maybe it's better to suppress the lang markup, if the text is
transliterated or transcribed.)

I have some related bad news that points to the same direction.

I tested the following simple markup on Mozilla 1.2.1 (Win):

<span lang="ru">Dostojevski</span>


<span lang="ja">manga</span>

in a document otherwise in Finnish and ISO-8859-1 encoding. The result is
that the words Dostojevski and manga appear in different fonts. Different
from each other and from the rest of the text. I had not touched any related
settings, and my document had nothing for suggesting any specific fonts.
Actually I _then_ touched the settings, to make the effect on Dostoyevsky
more observable, i.e. greater difference in fonts. Mozilla really has
language-dependent settings for fonts, and the lang attribute is apparently
used to determine the language. There's a small demo at
http://www.cs.tut.fi/~jkorpela/kielimerkkaus/4.html#tr
(within textual explanations in Finnish, but there's a screenshot of what
Mozilla does).

Sometimes, of course, we would _like_ to have foreign words appear as
stylistically different. Like in italics, or something. But what Mozilla
does is totally uncalled-for.

We cannot say any more that popular browsers don't care about lang
attributes. I had expected to be happier at this moment. :-(

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Bertil Wennergren

unread,
Feb 5, 2003, 2:08:39 PM2/5/03
to
Jukka K. Korpela:

> Bertil Wennergren <bert...@gmx.net> wrote:

>> The problem occurs when e.g. a Japanese word is included in the text of
>> an otherwise English HTML document, but written in Latin characters
> - -

>> The problem is that on Windows many users that don't have the Japanese
>> language pack installed already, will get a nagging an completely
>> useless prompting to download and install the Japanese language support

> This is depressing, but thanks for pointing this out. I think many of have
> not met this problem yet, either because we have Japanese support
> installed or because we haven't visited pages where Romanized Japanese has
> language markup. The observation reminds us that we should not write
> language markup in too much detail, until the definitions and
> implementations have matured.

That's probably the right way of looking at this problem.

> (For an entire document, or for a block
> quotation, and for a book title, for example, language markup is surely
> recommendable, and not much work. But even for them, maybe it's better to
> suppress the lang markup, if the text is transliterated or transcribed.)

If it means that users will be prompted to download weird things (most will
not understand what's happening at all, nor why!), then we should surely
leave the "lang" attributes out. They're not that important, especially
since the software that might use them is probably next to non-existant
anyway.

> I tested the following simple markup on Mozilla 1.2.1 (Win):

> <span lang="ru">Dostojevski</span>
> <span lang="ja">manga</span>

> in a document otherwise in Finnish and ISO-8859-1 encoding. The result is
> that the words Dostojevski and manga appear in different fonts.

Yes, I've seen that too. That is also depressing, but not as bad a problem
as the download prompting. I could live with the funny fonts.

> We cannot say any more that popular browsers don't care about lang
> attributes. I had expected to be happier at this moment. :-(

Me too.

Daniel R. Tobias

unread,
Feb 5, 2003, 2:39:30 PM2/5/03
to
Bertil Wennergren <bert...@gmx.net> wrote in message news:<b1rccu$8q6$01$1...@news.t-online.com>...

> The problem is that on Windows many users that don't have the Japanese
> language pack installed already, will get a nagging an completely useless
> prompting to download and install the Japanese language support (a rather
> large download, that will serve no purpose whatsoever since the text is
> written in Roman characters). The prompting will be repeated for every page
> load. This happens not only with Explorer, but also with Mozilla and
> Phoenix (and perhaps other browsers too).

I'd say the browsers are misbehaving; languages and character sets are
completely different issues, and it would be more sensible if the
browser prompted to install language support upon encountering
characters not in its current repertoire, without regard to the
language attributes.

It's been a longstanding pet peeve of mine that the use of logical
tags and attributes (e.g., "lang") runs into chicken-or-egg dilemmas
where site developers see no need to include them because user agents
don't care about them, while user agent developers see no need to
support them because site developers aren't using them anyway.
However, an even worse situation than such nonsupport is the case
where somebody (either a browser developer or a site developer)
actually does start using or supporting these tags or attributes, but
in an incorrect way that actually causes somebody else's perfectly
logical uses to break in some ugly and annoying manner. That seems to
be what you're describing now.

--
Dan

Alan J. Flavell

unread,
Feb 5, 2003, 3:03:10 PM2/5/03
to
On Feb 5, Bertil Wennergren inscribed on the eternal scroll:

> If it means that users will be prompted to download weird things (most will


> not understand what's happening at all, nor why!), then we should surely
> leave the "lang" attributes out. They're not that important, especially
> since the software that might use them is probably next to non-existant
> anyway.

On the other hand, my recollection from a free trial period was that
IBM HPR would "do the right thing" when reading out foreign text (in
one of its supported languages) provided the language was correctly
marked, but what it read out when the foreign language was not marked
(e.g German read with a US American accent) was well nigh
incomprehensible. (ik bin kein Berliner... SCNR.)

So it would be a shame to toss this technically-correct and
potentially-useful feature aside.

[...]


> > in a document otherwise in Finnish and ISO-8859-1 encoding. The result is
> > that the words Dostojevski and manga appear in different fonts.

> Yes, I've seen that too. That is also depressing,

I don't know about that...? Maybe not a top winner in aesthetics, but
it does have a certain grim logic to it.

> but not as bad a problem
> as the download prompting. I could live with the funny fonts.

If the "funny" font also contains a more-complete character repertoire
for the "language group", then it's possible that better results will
be achieved overall. But it's confusion between language and writing
system that is the fundamental error here, it seems to me.

Bertil Wennergren

unread,
Feb 5, 2003, 3:47:24 PM2/5/03
to
Alan J. Flavell:

> On Feb 5, Bertil Wennergren inscribed on the eternal scroll:

>> If it means that users will be prompted to download weird things (most
>> will not understand what's happening at all, nor why!), then we should
>> surely leave the "lang" attributes out. They're not that important,
>> especially since the software that might use them is probably next to
>> non-existant anyway.

> On the other hand, my recollection from a free trial period was that
> IBM HPR would "do the right thing" when reading out foreign text (in
> one of its supported languages) provided the language was correctly
> marked, but what it read out when the foreign language was not marked
> (e.g German read with a US American accent) was well nigh
> incomprehensible. (ik bin kein Berliner... SCNR.)

German would no be an issue here (except if it's written in katakana in a
Japanese page, and the browser uselessly prompts the poor Japanese user to
download a _German_ language pack, in order to display the German text...).

The issue is with e.g. Japanese written in Latin transcription, or Arabic in
Latin transscription (or for that matter Japanese or Arabic in Cyrillic
transcription...), and similar cases. I strongly doubt there are is any
software out there that will correctly apply any Japanese or Arabic or
Chinese or Russian pronunciation rules to _transcribed_ versions of those
languages. For Arabic that would be next to impossible since there are no
working standards for transcription. The situation for Japanese is similar.
And there is in any case no provision in the "lang" attribute (or
otherwise) to indicate which transcription system is being used.

So I do think the actual practical use of 'lang="ja"' for transcribed text
is very close to nil right now. Unfortunately.

> So it would be a shame to toss this technically-correct and
> potentially-useful feature aside.

A shame it is.



>> but not as bad a problem
>> as the download prompting. I could live with the funny fonts.

> If the "funny" font also contains a more-complete character repertoire
> for the "language group", then it's possible that better results will
> be achieved overall.

What benefits of that kind could there be for transcribed text? If we're
including Japanese text in actual Japanese writing, we should most surely
include the "lang" attribute. In that case the prompting will not be
useless at all. Quite the opposite.

> But it's confusion between language and writing
> system that is the fundamental error here, it seems to me.

Indeed. But I'm not sure if the mistake is in the browsers, or in the OS.

Alan J. Flavell

unread,
Feb 5, 2003, 4:53:27 PM2/5/03
to
On Feb 5, Bertil Wennergren inscribed on the eternal scroll:

> The issue is with e.g. Japanese written in Latin transcription,

Yes, I know. I'm just making the point that this specific oddity
shouldn't be allowed to discredit the whole idea of correct markup.

> And there is in any case no provision in the "lang" attribute (or
> otherwise) to indicate which transcription system is being used.

Right, which means it's a sort-of pathological case, and it seems
the vendor(s) have made an unfortunate choice.

> >> but not as bad a problem
> >> as the download prompting. I could live with the funny fonts.
>
> > If the "funny" font also contains a more-complete character repertoire
> > for the "language group", then it's possible that better results will
> > be achieved overall.
>
> What benefits of that kind could there be for transcribed text?

Well, indeed. As this is Usenet I suppose it's understandable that
you would not credit me with the ability to reason that part out for
myself.

Again I make the point that _in theory_ in HTML there is NO
relationship between the language and the writing system. As usual
the browser developers have proved unable to put this distinction into
practice.

> > But it's confusion between language and writing
> > system that is the fundamental error here, it seems to me.
>
> Indeed. But I'm not sure if the mistake is in the browsers, or in the OS.

Jukka's posting seemed to make it clear that Mozilla does this
explicitly i.e that can't be blamed on the OS.

cheers

Bertil Wennergren

unread,
Feb 5, 2003, 6:10:33 PM2/5/03
to
Alan J. Flavell:

> On Feb 5, Bertil Wennergren inscribed on the eternal scroll:

>> The issue is with e.g. Japanese written in Latin transcription,

> Yes, I know. I'm just making the point that this specific oddity
> shouldn't be allowed to discredit the whole idea of correct markup.

True. The only (sad) consequence would be a need to avoid indicating "lang"
in a few special cases, namely for foreign words that appear in the same
writing system as the main text although they're normally written with a
different script. For all other cases there is no reason to avoid the
"lang" attribute.



> Again I make the point that _in theory_ in HTML there is NO
> relationship between the language and the writing system. As usual
> the browser developers have proved unable to put this distinction into
> practice.

Indeed.



>> But I'm not sure if the mistake is in the browsers, or in the OS.

> Jukka's posting seemed to make it clear that Mozilla does this
> explicitly i.e that can't be blamed on the OS.

That part is probably in the browsers. And it's probably esentially the same
thing that triggers the language pack download prompt too.

Now, if we could get Mozilla to behave better in cases like this (I have no
hope about Explorer...), then we could get around this at least in XHTML,
by using "xml:lang" instead of "lang", since Explorer has no idea what
"xml:lang" is (at least in XHTML files).

'<span xml:lang="ja">manga</span>' would _probably_ not trigger any stupid
downloads of language packs in Explorer. But it probably will in Mozilla
and Phoenix, as long as this bug remains.

Chris Hoess

unread,
Feb 5, 2003, 11:20:59 PM2/5/03
to
In article <Pine.LNX.4.53.03...@lxplus083.cern.ch>,
Alan J. Flavell wrote:
>
> Again I make the point that _in theory_ in HTML there is NO
> relationship between the language and the writing system. As usual
> the browser developers have proved unable to put this distinction into
> practice.

Er, CJK ideographs? But I'll try to corner one of the font people and ask
about this.

--
Chris Hoess

Alan J. Flavell

unread,
Feb 6, 2003, 5:21:48 AM2/6/03
to
On Feb 6, Chris Hoess inscribed on the eternal scroll:

> Alan J. Flavell wrote:
>
> > Again I make the point that _in theory_ in HTML there is NO
> > relationship between the language and the writing system. As usual
> > the browser developers have proved unable to put this distinction into
> > practice.
>
> Er, CJK ideographs?

Er, yes. I forgot that routine disclaimer about Han unification.
Mea culpa.

(I won't try to wriggle out on the excuse that _that_ distinction is a
matter for Unicode itself, rather than for HTML per se...)

> But I'll try to corner one of the font people and ask
> about this.

In fairness, Mozilla's preferences menu _does_ offer different fonts
per language (group), not per writing system, so one can't really deny
that it "does what it says on the tin". But the current thread does
seem to have shown up the logical confusion which that involves.

cheers

Andreas Prilop

unread,
Feb 6, 2003, 10:03:04 AM2/6/03
to
Bertil Wennergren <bert...@gmx.net> wrote:

> It could be argued that the word "manga" is now an English word, although
> borrowed from Japanese, but let's forget about that, and decide that it's
> still Japanese, and should be marked up as such.

What is it good for? Is it useful at all to write

<P LANG="de">Dieser <SPAN LANG="en">Computer</SPAN> läuft unter
<SPAN LANG="en">Windows</SPAN> 2000.</P>

--
Top posting.
What's the most irritating thing on Usenet?

Andreas Prilop

unread,
Feb 6, 2003, 10:07:31 AM2/6/03
to
d...@tobias.name (Daniel R. Tobias) wrote:

> I'd say the browsers are misbehaving; languages and character sets are
> completely different issues, and it would be more sensible if the
> browser prompted to install language support upon encountering
> characters not in its current repertoire, without regard to the
> language attributes.

Consider an English document with nothing but plain ASCII characters.
With [Cyrillic] "charset=ISO-8859-5", Windows Internet Explorer wants to
install a Cyrillic language pack. With [Cyrillic] "charset=Windows-1251",
it doesn't want to install -- IIRC.

Andreas Prilop

unread,
Feb 6, 2003, 10:14:55 AM2/6/03
to
"Jukka K. Korpela" <jkor...@cs.tut.fi> wrote:

> Mozilla really has
> language-dependent settings for fonts, and the lang attribute is apparently
> used to determine the language.

Can you set up different typefaces for Arabic and Persian? I think it
would be highly desirable to have Persian text displayed in different
[i.e. Nastaliq] typeface.
Example: <http://www.unics.uni-hannover.de/nhtcapri/arabic.html6>

Andreas Prilop

unread,
Feb 6, 2003, 10:21:15 AM2/6/03
to
I wrote:

> Can you set up different typefaces for Arabic and Persian? I think it
> would be highly desirable to have Persian text displayed in different
> [i.e. Nastaliq] typeface.

Postscript: <http://www.arbornet.org/~tabish/u-font/>

Bertil Wennergren

unread,
Feb 6, 2003, 10:45:27 AM2/6/03
to
Andreas Prilop:

> Consider an English document with nothing but plain ASCII characters.
> With [Cyrillic] "charset=ISO-8859-5", Windows Internet Explorer wants to
> install a Cyrillic language pack. With [Cyrillic] "charset=Windows-1251",
> it doesn't want to install -- IIRC.

That must be a bug within the bug, accidentally resulting in the right
behaviour.

Bertil Wennergren

unread,
Feb 6, 2003, 10:43:35 AM2/6/03
to
Andreas Prilop:

> Bertil Wennergren <bert...@gmx.net> wrote:

>> It could be argued that the word "manga" is now an English word, although
>> borrowed from Japanese, but let's forget about that, and decide that it's
>> still Japanese, and should be marked up as such.

> What is it good for? Is it useful at all to write

> <P LANG="de">Dieser <SPAN LANG="en">Computer</SPAN> läuft unter
> <SPAN LANG="en">Windows</SPAN> 2000.</P>

A speach browser could use that info to switch to another set of
pronunciation rules for the foreign word.

Actually that can be a good rule for deciding if a "lang" attribute is
appropriate. If it would be desireable for the foreign word to be spoken
according to the rules of the foreing language (in the current context),
then "lang" should be used. If however a speech rendering according to the
surrounding language would be more appropriate, then "lang" should probably
not be used.

In your example the word "Computer" is now actually a German word, and it
would be somewhat ridiculous to speak it in pure English. The German
pronunciation would be the right choice. So, no "lang", I say.

But here the opposite would be true:

<P LANG="de">Die Amerikaner nennen das
"<SPAN LANG="en">nitpicking</SPAN>".
</P>

(No, I dont' think "q" would be better than "span" there.)

Andreas Prilop

unread,
Feb 6, 2003, 11:11:21 AM2/6/03
to
Bertil Wennergren <bert...@gmx.net> wrote:

>> Consider an English document with nothing but plain ASCII characters.
>> With [Cyrillic] "charset=ISO-8859-5", Windows Internet Explorer wants to
>> install a Cyrillic language pack. With [Cyrillic] "charset=Windows-1251",
>> it doesn't want to install -- IIRC.
>
> That must be a bug within the bug, accidentally resulting in the right
> behaviour.

Err, what's "right behaviour" here? I recall from memory the following
behaviour of Windows Internet Explorer and I hope I describe it
correctly:

For Central European, Turkish, Greek, Cyrillic character sets, IE
doesn't care which characters actually appear on a page; it only pays
attention to the specified encoding ("charset").
It wants to install a language pack only when "charset=ISO-8859-..."
but not when "charset=Windows-...".

I think it is an unfair advantage for Windows-specific encodings that
such pages will be displayed without installing any language packs.
I noticed that certain PCs with IE in public libraries here display
Windows-encoded Cyrillic pages, but not ISO-encoded Cyrillic pages.
Same for Greek and Central European. The user in such a public library
must get the impression that such pages are "broken".

Mark Tranchant

unread,
Feb 6, 2003, 10:58:01 AM2/6/03
to
"Bertil Wennergren" <bert...@gmx.net> wrote in message
news:b1tvgl$180$04$1...@news.t-online.com...
> Andreas Prilop:

> > What is it good for? Is it useful at all to write
>
> > <P LANG="de">Dieser <SPAN LANG="en">Computer</SPAN> läuft unter
> > <SPAN LANG="en">Windows</SPAN> 2000.</P>
>
> A speach browser could use that info to switch to another set of
> pronunciation rules for the foreign word.
>
> Actually that can be a good rule for deciding if a "lang" attribute is
> appropriate. If it would be desireable for the foreign word to be spoken
> according to the rules of the foreing language (in the current context),
> then "lang" should be used. If however a speech rendering according to the
> surrounding language would be more appropriate, then "lang" should
probably
> not be used.

Agreed. I'm in the process of "langing" the French place names in my page
at:

http://www.tranchant.freeserve.co.uk/cycling/french-trip.html
(changes not online, so don't point this out!)

but I have decided not to do so for Paris. English-speaking countries
pronounce this as "Pariss" rather than "Paree", which is how I would want it
spoken.

I debated this choice for some time, because the "lang" attribute is not
just for spoken pronounciation, but I think I've made the right decision.

--
Mark.


Bertil Wennergren

unread,
Feb 6, 2003, 1:01:50 PM2/6/03
to
Andreas Prilop:

> Bertil Wennergren <bert...@gmx.net> wrote:

>>> Consider an English document with nothing but plain ASCII characters.
>>> With [Cyrillic] "charset=ISO-8859-5", Windows Internet Explorer wants to
>>> install a Cyrillic language pack. With [Cyrillic]
>>> "charset=Windows-1251", it doesn't want to install -- IIRC.

>> That must be a bug within the bug, accidentally resulting in the right
>> behaviour.

> Err, what's "right behaviour" here?

The correct behaviour is not bothering the user with useless download
prompts. The text uses ASCII characters only, so (providing the "ASCII
pack" is already installed) everything should go just find.

Bertil Wennergren

unread,
Feb 6, 2003, 1:14:18 PM2/6/03
to
Mark Tranchant:

> "Bertil Wennergren" <bert...@gmx.net> wrote in message
> news:b1tvgl$180$04$1...@news.t-online.com...

>> Actually that can be a good rule for deciding if a "lang" attribute is


>> appropriate. If it would be desireable for the foreign word to be spoken
>> according to the rules of the foreing language (in the current context),
>> then "lang" should be used. If however a speech rendering according to
>> the surrounding language would be more appropriate, then "lang" should
>> probably not be used.

> Agreed. I'm in the process of "langing" the French place names in my page
> at:

> http://www.tranchant.freeserve.co.uk/cycling/french-trip.html
> (changes not online, so don't point this out!)

> but I have decided not to do so for Paris. English-speaking countries
> pronounce this as "Pariss" rather than "Paree", which is how I would want
> it spoken.

> I debated this choice for some time, because the "lang" attribute is not
> just for spoken pronounciation, but I think I've made the right decision.

Indeed. From a visual point of view the test could be if it would make any
kind of sense to present the word in question in italics (for foreign
word), as is often done in typesetting, or with quote marks (indication a
foreing word). As for "Paris" occuring in sentences like "I went to Paris
last year", that would be ludicrous.

But for "I went to 'ad-Dar al-Bayda' last year" things would be different.
The normal English name is "Casablanca", and if someone would (for some
weird reason) use 'ad-Dar al-Bayda' instead, that would indeed call for
italics or quote marks - and also for 'lang="ar"'. But, unfortunately, that
would make brain-dead browsers (or operating systems) prompt some users to
download a couple of megabytes of Arabic fonts (from wich the browser would
then proceed to use only the ASCII characters!). So, from a practical point
of view the "lang" attribute would probably have to be ditched.

On the other hand, if the text was a quotation from CNN, most any foreign
city name should probably have a "lang" attribute, since the CNN announcers
tend to tie their tongues into knots in order to make those names sound
"foreign" (the same quasi-Spanish pronunciation rules being applied to any
language from Sami to Swazi)...

Harlan Messinger

unread,
Feb 6, 2003, 1:57:10 PM2/6/03
to

"Bertil Wennergren" <bert...@gmx.net> wrote in message
news:b1u7jq$tgs$01$2...@news.t-online.com...

What would the impact on efficiency be if the browser were to check every
single character against the list of available characters in the current
character set and font, rather than associating each writing system with a
different font and then performing the check only upon encountering a lang
attribute?


Alan J. Flavell

unread,
Feb 6, 2003, 2:08:44 PM2/6/03
to
On Feb 6, Harlan Messinger inscribed on the eternal scroll:

> What would the impact on efficiency be if the browser were to check every
> single character against the list of available characters in the current
> character set and font,

This doesn't work in IE anyway. If you have an i18n document which
uses obscure Unicode characters in a particular writing system, and
you have a font which claims to support that writing system, then IE
is willing to use that font irrespective of whether it really does
contain each and every one of the particular characters that you need.
There's some descriptive observations at
http://ppewww.ph.gla.ac.uk/~flavell/charset/browsers-fonts.html
although I admit I don't know what's going on internally so it's
purely an heuristic account of what I found.

(Mozilla, on the other hand, seems willing to pick odd characters out
of different fonts if the primary choice doesn't contain them.)

It's not just a matter of fonts: those language packs in general
contain more than just font coverage. You might already have an
installed font for the writing system in question, but it'll still
want to install the language pack for it if you haven't done so
before.

> rather than associating each writing system with a
> different font and then performing the check only upon encountering a lang
> attribute?

It's a trivial lookup operation to determine from the Unicode value
U+xxxx just which writing system group a character belongs to.
CPU cost can be ignored, compared with what's needed to implement
any kind of typographical rendering algorithm, IMHO.

Chris Hoess

unread,
Feb 10, 2003, 9:20:54 PM2/10/03
to
In article <Pine.LNX.4.53.03...@lxplus086.cern.ch>,
Alan J. Flavell wrote:
>
> In fairness, Mozilla's preferences menu _does_ offer different fonts
> per language (group), not per writing system, so one can't really deny
> that it "does what it says on the tin". But the current thread does
> seem to have shown up the logical confusion which that involves.

Yes, that does seem to be the case as far as "funny fonts" go; the only
way to avoid this I can see would be to set all one's fonts in the
different categories (serif, sans-serif, etc.) to a single, complete
Unicode font for each category. Ugh. In a very broad sense, though, I
think this is symptomatic of a problem that transcends petty details of
software and encoding: it's the problem of co-existence between our normal
methods of writing (which, in the English-speaking world, is likely to be
limited to the Latin alphabet and various mathematical symbols, and which
cannot cope with ideographs) and with our software, which is able to
process a much greater character range, in the form of Unicode, than most
of us are able to write or understand. To be more pithy, the technique of
"romanizing" foreign languages strikes me as not a little like the wetware
equivalent of <font face="symbol">a</font>; it's ambiguous and unpleasant,
but if your brain/software environment is incapable of handling Unicode,
what can one do?

In other news, I've filed bug 192636 on the original problem, so we'll see
how that plays out.

--
Chris Hoess

Alan J. Flavell

unread,
Feb 11, 2003, 12:16:10 PM2/11/03
to
On Feb 11, Chris Hoess inscribed on the eternal scroll:

> Alan J. Flavell wrote:
> > In fairness, Mozilla's preferences menu _does_ offer different fonts
> > per language (group), not per writing system, so one can't really deny
> > that it "does what it says on the tin".

As a matter of interest, Google-Groups showed me your posting
accompanied by three "sponsored links" about Chinese language courses.
Hmmm.

> Yes, that does seem to be the case as far as "funny fonts" go; the only
> way to avoid this I can see would be to set all one's fonts in the
> different categories (serif, sans-serif, etc.) to a single, complete
> Unicode font for each category. Ugh.

It looks as if MSIE (or the operating system in which it is a
component, I never know which to attribute to what) goes some way
towards this, although again the results can be less than ideal, as I
discuss amongst the random selection of topics in
http://ppewww.ph.gla.ac.uk/~flavell/charset/browsers-fonts.html

These observations relate to default font selection in a Western-
language locale, anyway. If you configure for "Latin" scripts a font
which also claims[1] to contain Greek, Cyrillic etc., then IE will
prefer that font, and ignore what you configured for Greek, Cyrillic
etc even though they might have been better fonts for those writing
systems.

[1] i.e in what the MS "font properties" extension calls the font's
"Unicode data".

> In a very broad sense, though, I
> think this is symptomatic of a problem that transcends petty details of
> software and encoding: it's the problem of co-existence between our normal
> methods of writing (which, in the English-speaking world, is likely to be
> limited to the Latin alphabet and various mathematical symbols, and which
> cannot cope with ideographs) and with our software,

You have a point: but, Western writing might be predominant with us,
but I assume there are also arrangements for writing "Latin" languages
in other writing systems too.

> To be more pithy, the technique of
> "romanizing" foreign languages strikes me as not a little like the wetware
> equivalent of <font face="symbol">a</font>; it's ambiguous and unpleasant,

As I remarked before, HTML's principle is that the language is the
language no matter how it is written. That's why the lang attribute
was meant to be orthogonal to the actual characters used.

It's only indirectly via Unicode's "Han unification" that it became
necessary to disambiguate CJK writing via choice of language; but that
isn't my field, so I'm not going to go pontificating about it.

> In other news, I've filed bug 192636 on the original problem,

Seems a fair statement of that part of the problem ;-)

cheers

0 new messages