Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Chinese text in HTML page and Byte-Order Mark

1 view
Skip to first unread message

Alfred Molon

unread,
May 27, 2013, 5:16:41 PM5/27/13
to
I've noticed that some pages use <span lang="zh" xml:lang="zh"> to embed
Chinese text, but even simply embedding Chinese text in a UTF-8 HTML
page seems to work fine as well, for instance:
http://www.molon.de/galleries/Taiwan/Kaohsiung/Skyline/

Why then would this language declaration be necessary?

Another question, the above page validates without errors, but I get the
warning:

Byte-Order Mark found in UTF-8 File.

The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to
cause problems for some text editors and older browsers. You may want to
consider avoiding its use until it is better supported.

How to remove the Byte-Order Mark?
--

Alfred Molon
http://www.molon.de - Photos of Asia, Africa and Europe
Message has been deleted

dorayme

unread,
May 27, 2013, 7:45:06 PM5/27/13
to
In article <MPG.2c0deba11...@news.supernews.com>,
Alfred Molon <alfred...@yahoo.com> wrote:

> How to remove the Byte-Order Mark?

Look in your editor. In mine, you can choose:

<http://dorayme.netweaver.com.au/justPics/byteordersettings.png>

--
dorayme

Ben Bacarisse

unread,
May 27, 2013, 10:19:47 PM5/27/13
to
Alfred Molon <alfred...@yahoo.com> writes:

> I've noticed that some pages use <span lang="zh" xml:lang="zh"> to embed
> Chinese text, but even simply embedding Chinese text in a UTF-8 HTML
> page seems to work fine as well, for instance:
> http://www.molon.de/galleries/Taiwan/Kaohsiung/Skyline/
>
> Why then would this language declaration be necessary?

It lets the reader (either human or mechanical) know what language a
piece of text is written in. For example, a voice-based browser needs
to pronounce

<h1 lang=en>Les Chats</h1>

and

<h1 lang=fr>Les Chats</h1>

quite differently.

--
Ben.

Alfred Molon

unread,
May 28, 2013, 1:54:03 AM5/28/13
to
In article <dorayme-43D6B1...@news.albasani.net>, dorayme
says...

> > How to remove the Byte-Order Mark?
>
> Look in your editor. In mine, you can choose:
>
> <http://dorayme.netweaver.com.au/justPics/byteordersettings.png>

I see. I just noticed that also my editor (Notepad++) has that option.
What is the BOM needed for?

Jukka K. Korpela

unread,
May 28, 2013, 2:44:33 AM5/28/13
to
2013-05-28 0:16, Alfred Molon wrote:

> I've noticed that some pages use <span lang="zh" xml:lang="zh"> to embed
> Chinese text, but even simply embedding Chinese text in a UTF-8 HTML
> page seems to work fine as well

Yes, pages work without the lang attribute, but using it may have some
effect.

> Why then would this language declaration be necessary?

According to accessibility guidelines, the language of text should be
declared, to help e.g. in speech synthesis. This applies to all texts,
including English-language texts. But this is largely just theory,
though it would apply especially strongly to Chinese texts, since the
way "Chinese" characters (characters of Chinese origin, used for writing
Chinese, Japanese, and other languages) may essentially depend on
language. But speech synthesizers will guess the language or use a fixed
language or use the language selected by the user.

There are other reasons for declaring language, see
http://www.w3.org/International/questions/qa-lang-why.en
but I will just illustrate one of them:

When I view a page containing Chinese characters, on Firefox, those
characters appear in my system in the MS PGothic font, when the page
does not have any font settings. If the characters are inside an element
to which lang=zh applies, they appear in the SimSun font instead. And if
the attribute is lang=zh-TW or lang=zh-Hant, they appear in PMingLiu.
The reason is that the attribute makes the browser apply different
default fonts.

Nowadays, few authors leave fonts unspecified. The main reason is
probably that most browsers have Times New Roman as the default font,
and it is common knowledge, or prejudice, that it is unsuitable for web
pages. So authors declare Arial, because someone told it's cool, or
Verdana, since someone said it's even cooler. And because those fonts
aren't really cool at all in normal font size, authors too often set
font size to something barely legible, but I digress.

On the page you mentioned, the font family declaration in CSS is
font-family: Verdana, Arial, Helvetica, sans-serif. Since none of the
specific font families listed contains Chinese characters, the browser
will use its definition for sans-serif and, if it does not contain them
either, pick them up from some of the fonts in the system, using its own
internal rules.

The morale is that when using Chinese characters, you should take them
into account when writing your font-family rule. This is not obligatory,
but it's the right way to ensure (as far as possible) that the font used
for them will be acceptable and will stylistically match the font used
in the text otherwise.

And when you do so, the lang attribute does not matter in font selection
- but it is advisable to use it for other reasons.

> Another question, the above page validates without errors, but I get the
> warning:
>
> Byte-Order Mark found in UTF-8 File.
>
> The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to
> cause problems for some text editors and older browsers. You may want to
> consider avoiding its use until it is better supported.

That's grossly outdated information, probably retained just because some
people think there *might* be some browser in use that has problems with
BOM. There isn't. Hasn't been for many years. Except perhaps in a
museum, where Netscape 2 and IE 3 can be seen.

In the modern world, BOM is *good* even in UTF-8. It acts as a
practically certain way of indicating that the page is UTF-8 encoded,
even if HTTP headers are missing (e.g., because the page has been saved
locally.

You may have problems if you have a BOM at the start of a PHP file. But
that's something completely different.

> How to remove the Byte-Order Mark?

You could remove it by using an editor that can save in the "UTF-8, no
BOM" format. But there is no reason to remove it.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
0 new messages