Guide to using special characters in HTML

Jukka K. Korpela

unread,

Jan 31, 2012, 12:14:00 PM1/31/12

to

Years ago, I wrote a page about using national and special characters in
HTML. At that time, the main problems were with character encodings.
Things have changed a lot, and instead of updating the old document, I
wrote a new one that focuses on font problems:

http://www.cs.tut.fi/~jkorpela/html/characters.html

It's largely based on discussions in this group. Comments welcome.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

dorayme

unread,

Jan 31, 2012, 4:36:25 PM1/31/12

to

In article <jg97gm$7rs$1...@dont-email.me>,
"Jukka K. Korpela" <jkor...@cs.tut.fi> wrote:

> ... I

> wrote a new one that focuses on font problems:
>
> http://www.cs.tut.fi/~jkorpela/html/characters.html
>
> It's largely based on discussions in this group. Comments welcome.

Really nice, JK.

Perhaps you could say just a little more about "Directly as a
character" to put the familiar idea of "typing" the character in from
the keyboard as an example for the reader.

--
dorayme

Message has been deleted

tlvp

unread,

Jan 31, 2012, 6:52:21 PM1/31/12

to

On Tue, 31 Jan 2012 19:14:00 +0200, Jukka K. Korpela wrote:

> ...
> http://www.cs.tut.fi/~jkorpela/html/characters.html
> ... Comments welcome.

Just before the new section headed "Use UTF-8 ... " is an odd fragment

: character data to be added to the document rendering
: might use a different using a “backslash escape,”

My guess: something's missing between "different" and "using".
Alternatively, the "using a" part is superfluous (?) or "rendering" (?) .
But I'm no mind reader, Jukka, only *you* know what you meant here :-) .

HTH. Cheers, -- tlvp
--
Avant de repondre, jeter la poubelle, SVP.

tlvp

unread,

Jan 31, 2012, 7:07:04 PM1/31/12

to

On Tue, 31 Jan 2012 19:14:00 +0200, Jukka K. Korpela wrote:

> ...
> http://www.cs.tut.fi/~jkorpela/html/characters.html

> ... Comments welcome. ...

Another tiny copy-edit, if I may: just before the very end, in the fragment

: ... important thing is the level out the differences ...

I'd bet the first "the" was meant to be a "to" :-) .

In all other respects, this is a very clear, nicely organized, and helpful
addition to Jukka's {IT and communications} site: Thank you!

Again, IHTH. Cheers, -- tlvp

Jukka K. Korpela

unread,

Feb 1, 2012, 1:52:04 AM2/1/12

to

2012-01-31 23:36, dorayme wrote:

> Perhaps you could say just a little more about "Directly as a
> character" to put the familiar idea of "typing" the character in from
> the keyboard as an example for the reader.

In this context (of "special" characters), the odds are that the
character cannot be typed in directly. But the methods of entering
characters vary greatly, so there's little one can say about them
without writing quite a lot. But I added a link to the
http://www.fileformat.info/tip/microsoft/enter_unicode.htm
which probably covers over 80% of the needed of over 80% of people in
writing "special" characters - in one way (There's always a faster way,
but implementing it may take time).

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Jukka K. Korpela

unread,

Feb 1, 2012, 2:16:37 AM2/1/12

to

2012-02-01 1:52, tlvp wrote:

> Just before the new section headed "Use UTF-8 ... " is an odd fragment
>
> : character data to be added to the document rendering
> : might use a different using a “backslash escape,”
>
> My guess: something's missing between "different" and "using".
> Alternatively, the "using a" part is superfluous (?) or "rendering" (?) .

The “using a” part was superfluous. Thank you for this and the other
copyedit.

The word “rendering” is not superfluous, as in CSS, we can add
characters to the rendering, not to the document itself—e.g.,
blockquote:before { content: "\002605"; color: red } would add a red
U+2605 character at the start of each block quotation, _without_ adding
the character to the document as seen by search engines, client-side
JavaScript, etc.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

tlvp

unread,

Feb 1, 2012, 5:26:32 AM2/1/12

to

On Wed, 01 Feb 2012 09:16:37 +0200, Jukka K. Korpela wrote:

> ... [snip] ...

> The “using a” part was superfluous. Thank you for this and the other
> copyedit.

You're very welcome: a modest repayment for all the help you've given me.

> The word “rendering” is not superfluous, as ...

I daresay. But faced only with text some of which you couldn't have
intended, I couldn't easily guess what you *did* intend :-) . Clear now,
though.

Gus Richter

unread,

Feb 1, 2012, 12:13:24 PM2/1/12

to

On 1/31/2012 12:14 PM, Jukka K. Korpela wrote:
> Years ago, I wrote a page about using national and special characters in
> HTML. At that time, the main problems were with character encodings.
> Things have changed a lot, and instead of updating the old document, I
> wrote a new one that focuses on font problems:
>
> http://www.cs.tut.fi/~jkorpela/html/characters.html
>
> It's largely based on discussions in this group. Comments welcome.

Very nice and informative but I have a question regarding this document.
Why have you used ISO-8859-1 instead of UTF-8 in spite of you advising
to "Use UTF-8 if possible" and stating that UTF-8 is "Usually the best
option"?

Disregarding the obsolete element usage, the W3C Validator states that
you are "Using windows-1252 instead of the declared encoding iso-8859-1"?

--
Gus

Jukka K. Korpela

unread,

Feb 1, 2012, 2:02:29 PM2/1/12

to

2012-02-01 19:13, Gus Richter wrote:

> Why have you used ISO-8859-1 instead of UTF-8 in spite of you advising
> to "Use UTF-8 if possible" and stating that UTF-8 is "Usually the best
> option"?

Good catch. "Possible" is a relative word. I _could_ use UTF-8, but it
would be inconvenient to the extent that there are risks of messing
things up. I maintain my pages on a Unix system, mostly using PuTTY and
Emacs, which aren't particularly UTF-8-friendly. _Usually_ people are
using tools that are more suitable for direct UTF-8 authoring.

> Disregarding the obsolete element usage, the W3C Validator states that
> you are "Using windows-1252 instead of the declared encoding iso-8859-1"?

I know. And I understand the reasoning behind favoring windows-1252, but
I still haven't accepted it. I think iso-8859-1 has a legal right to
exist on its own, and if I say charset=iso-8859-1, I _mean_ it. But this
might be pointless standardism.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

David Stone

unread,

Feb 1, 2012, 4:23:52 PM2/1/12

to

In article <jg97gm$7rs$1...@dont-email.me>,
"Jukka K. Korpela" <jkor...@cs.tut.fi> wrote:

> Years ago, I wrote a page about using national and special characters in
> HTML. At that time, the main problems were with character encodings.
> Things have changed a lot, and instead of updating the old document, I
> wrote a new one that focuses on font problems:
>
> http://www.cs.tut.fi/~jkorpela/html/characters.html
>
> It's largely based on discussions in this group. Comments welcome.

Bookmarked!

Some suggested corrections:

"Beware that HTML5 drafts have an extended set of ..." - this should
probably be "Be aware that..." or even "Please note that..."

"Font support to CONTOUR INTEGRAL" - s/to/for/

dorayme

unread,

Feb 1, 2012, 5:33:52 PM2/1/12

to

In article
<no.email-AF446B...@news.eternal-september.org>,

David Stone <no.e...@domain.invalid> wrote:

> In article <jg97gm$7rs$1...@dont-email.me>,
> "Jukka K. Korpela" <jkor...@cs.tut.fi> wrote:

...
> > http://www.cs.tut.fi/~jkorpela/html/characters.html
> >
...

>
> "Beware that HTML5 drafts have an extended set of ..." - this should
> probably be "Be aware that..." or even "Please note that..."

Maybe it sounds odd because of the "that" and also because the thing
to be wary of is the limited browser support rather than the
immediately following 'HTML5 drafts have an extended set of “named
character references,”'.

I think it is OK and communicates the meaning as is but perhaps

The HTML5 drafts have an extended set
of “named character references,” but
beware of poor support for the added names.

would make everyone happy?

The things we talk about! <g>

--
dorayme

Dr J R Stockton

unread,

Feb 1, 2012, 6:41:30 PM2/1/12

to

In comp.infosystems.www.authoring.html message <jg97gm$7rs$1@dont-
email.me>, Tue, 31 Jan 2012 19:14:00, Jukka K. Korpela
<jkor...@cs.tut.fi> posted:

>Years ago, I wrote a page about using national and special characters
>in HTML. At that time, the main problems were with character encodings.
>Things have changed a lot, and instead of updating the old document, I
>wrote a new one that focuses on font problems:
>
>http://www.cs.tut.fi/~jkorpela/html/characters.html
>
>It's largely based on discussions in this group. Comments welcome.

cited.

"Entering characters"

It could be useful to state the range of character encodings. It is
fairly well-known that "ASCII or ANSI" goes up to "127 or 255", and that
Unicode foes up to 65535 = U+FFFF. But in fact Unicode goes higher, so
do browsers actually understand 𒍁 or $#x123412;, regardless of
the question of whether any installed font has them ?

JavaScript characters are restricted to \u0000 to \uFFFF; unless there's
something I've missed in ECMA 262 5; I doubt whether Unicode over U+FFFF
can be represented in current JavaScript, and I have no idea about
VBScript.

A link to the UTF-8 standard, and to a good description, might help.

You could add that if it is essential to use a character (whether or not
in Unicode) that is liable not to be displayed suitably as such by
readers' browsers, a small image styled for height in ex units can be
used.

I have a page which requires one Old French beta. That is in Unicode,
and below U+FFFF; and my present computer system displays it nicely. On
Old French Greek, it no doubt is seen as an Old French beta. But in an
English sentence, even in quotes, it looks too much like a 6, no actual
6 being nearby for comparison.

The printer of Euler's Collected Works had the same problem; his cases
held β but not ϐ. In E.327, the ϐ characters may
actually be individually hand-drawn; they lack sharpness of edge, and
seem to vary more than printed ones do. So I chose a nice one and, via
copy'n'paste and Windows Paste, made a little graphic of it. It looks
as much at home in the browser display as in the PDF image of the Works.
So styled, it matches its HTML neighbours in Zoom and Zoom Text Only.

IMHO, it's normally better to have a character which certainly looks to
be an instance of that character at about the right size than to have
one which may be perfect for style and size but is quite likely not to
show as that character at all.

"...such as \0000e9." - I know nothing of that format; but I only use
CSS1, since the CSS2 document is too big.

"Line spacing problems"

The first diagram needs more words. Generally, 29" is shown with a
normal line height, since ordinary people (British, Finns, etc.) get
those characters by pressing 2 9 Shift-2 and get ASCII 50 57 34, with
ordinary line height.

Evidently CSS font designation needs grammar for "anything but Cambria
Math".

An index at the top, linking to H2 elements, would help.

Since you wrote on ISO 8601, you may care to know that I have something
resembling MSDOS DIR but with options for yyyy-mm-dd, yyyy-Www-d, yyyy-
ddd, and time_t in milliseconds, days GMT, and days LCT. Google for
SEAKFYLE.

I write Web pages in 7-bit ASCII, characters HT, LF, CR, 20-126.

--
(c) John Stockton, nr London, UK. ?@merlyn.demon.co.uk Turnpike 6.05 WinXP.
Web <http://www.merlyn.demon.co.uk/> - FAQ-type topics, acronyms, and links.
Command-prompt MiniTrue is useful for viewing/searching/altering files. Free,
DOS/Win/UNIX now 2.0.6; see <URL:http://www.merlyn.demon.co.uk/pc-links.htm>.

Jukka K. Korpela

unread,

Feb 2, 2012, 7:19:05 AM2/2/12

to

2012-02-02 1:41, Dr J R Stockton wrote:

> It is fairly well-known that [...]

> Unicode foes up to 65535 = U+FFFF. But in fact Unicode goes higher, so
> do browsers actually understand 𒍁 or $#x123412;, regardless of
> the question of whether any installed font has them ?

I usually avoid referring to Unicode characters past U+FFFF because they
involve extra complexities and most people never saw any need for such
characters. In this context, however, there is nothing special about
them: they can be entered as such, or as character references (there are
no entity references defined for them). Browsers seem to handle them the
same was as the BMP characters (i.e., the range up to U+FFFF). Demo:
http://www.cs.tut.fi/~jkorpela/html/non-bmp.html8

> JavaScript characters are restricted to \u0000 to \uFFFF; unless there's
> something I've missed in ECMA 262 5; I doubt whether Unicode over U+FFFF
> can be represented in current JavaScript,

"JavaScript characters" are unsigned 16-bit integers which can be
interpreted as Unicode code units, so Unicode over U+FFFF needs to be
represented as two consecutive code units (so-called surrogates). In
modern browsers, you can use Unicode over U+FFFF as such in JavaScript
string literals, too, or you can write them using the \u notation for
each of the code units. So it's a bit complicated, but possible. There's
a discussion of the topic in my book "Going Global with JavaScript and
Globalize.js".

> You could add that if it is essential to use a character (whether or not
> in Unicode) that is liable not to be displayed suitably as such by
> readers' browsers, a small image styled for height in ex units can be
> used.

I have mixed feelings about this. I'm afraid it might make people use
such techniques for no reason, just as they use images even for simple
equations. But undoubtedly there are situations where an image is the
practical solution.

> "...such as \0000e9." - I know nothing of that format; but I only use
> CSS1, since the CSS2 document is too big.

It's not needed in CSS1, but it can be useful in CSS2 when using
generated content.

> "Line spacing problems"
>
> The first diagram needs more words. Generally, 29" is shown with a
> normal line height, since ordinary people (British, Finns, etc.) get
> those characters by pressing 2 9 Shift-2 and get ASCII 50 57 34, with
> ordinary line height.

The diameter sign may also cause the phenomenon. But I'll add a note
about the correct inch symbol.

> An index at the top, linking to H2 elements, would help.

I'll probably add it. I first thought the page is so short that it does
not need an index, but then it started growing...

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Roedy Green

unread,

Feb 19, 2012, 1:54:43 PM2/19/12

to

On Thu, 02 Feb 2012 14:19:05 +0200, "Jukka K. Korpela"
<jkor...@cs.tut.fi> wrote, quoted or indirectly quoted someone who
said :

>> It is fairly well-known that [...]
>> Unicode foes up to 65535 = U+FFFF. But in fact Unicode goes higher, so
>> do browsers actually understand 𒍁 or $#x123412;, regardless of
>> the question of whether any installed font has them ?

I have prepared a number of tables showing you the entities, hex
entities, and how the characters render in your browser. I have also
have English language descriptions of the characters so you can search
for them.

See http://mindprod.com/jgloss/htmlentities.html
and
http://mindprod.com/jgloss/html5.html
and
http://mindprod.com/jgloss/hexentities.html

--
Roedy Green Canadian Mind Products
http://mindprod.com
One of the most useful comments you can put in a program is
"If you change this, remember to change ?XXX? too".

Stan Brown

unread,

Feb 19, 2012, 5:29:17 PM2/19/12

to

On Sun, 19 Feb 2012 10:54:43 -0800, Roedy Green wrote:
>
> I have prepared a number of tables showing you the entities, hex
> entities, and how the characters render in your browser. I have also
> have English language descriptions of the characters so you can search
> for them.
>
> See http://mindprod.com/jgloss/htmlentities.html
> and
> http://mindprod.com/jgloss/html5.html
> and
> http://mindprod.com/jgloss/hexentities.html

Lots of such tables exist already. Can you say what makes yours
better?

You might want to take another look at the page title, "Java
Glossary".

And why is it "best viewed with Internet Explorer"? Demanding use of
a particular browser is always a bad idea.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/2003/05/05/why_we_wont_help_you

Stan Brown

unread,

Feb 19, 2012, 5:30:18 PM2/19/12

to

On Sun, 19 Feb 2012 10:54:43 -0800, Roedy Green wrote:

> http://mindprod.com/jgloss/htmlentities.html

You have listed the acute accent among "quoting characters".

Stan Brown

unread,

Feb 19, 2012, 5:32:38 PM2/19/12

to

On Sun, 19 Feb 2012 10:54:43 -0800, Roedy Green wrote:
> http://mindprod.com/jgloss/htmlentities.html

You have described « and » as sea birds. The correct term is
"guillemets", not "guillemots".

I'll stop now.

tlvp

unread,

Feb 19, 2012, 9:16:59 PM2/19/12

to

On Sun, 19 Feb 2012 10:54:43 -0800, Roedy Green wrote:

> ...
> I have prepared a number of tables showing you the entities, hex ...

As regards the table at

<http://mindprod.com/jgloss/htmlentities.html#POLISH> ,

I should point out that Polish uses no Ô -- it does use an ó,
and that's what's wanted where your table features O-circumflexes :-) .

That will buff out the only rough spot I could easily find on an otherwise
very smooth and useful page. Cheers,

tlvp

unread,

Feb 19, 2012, 9:21:21 PM2/19/12

to

On Sun, 19 Feb 2012 17:32:38 -0500, Stan Brown wrote:

> On Sun, 19 Feb 2012 10:54:43 -0800, Roedy Green wrote:
>> http://mindprod.com/jgloss/htmlentities.html
>
> You have described « and » as sea birds. The correct term is
> "guillemets", not "guillemots".

Guillemots are already an endangered species -- do you want to make them
rarer still by banning their use in naming "French quotation marks" :-) ?

Jukka K. Korpela

unread,

Feb 20, 2012, 12:44:51 AM2/20/12

to

2012-02-20 4:16, tlvp wrote:

> As regards the table at
>
> <http://mindprod.com/jgloss/htmlentities.html#POLISH> ,
>
> I should point out that Polish uses no Ô -- it does use an ó,
> and that's what's wanted where your table features O-circumflexes :-) .

And Ô surely isn't the lowercase equivalent to ó.

Stan Brown mentioned that lots of such tables exist and asked what makes
this better. We already know that this contains errors. I mentioned this
at least 4 years ago when Roedy Green advertized his "HTML Entities :
Java Glossary".

It also contains loads of named character references that are not
supported e.g. by IE, despite the bogus claim "best viewed with
Microsoft Internet Explorer". Just saying "Do not use the following
HTML5 entities in your web pages. Many browsers do not support them
yet." isn't enough. There is no reason to advertize such "entities"
(HTML5 does not call them entities) at all, and people are known to miss
disclaimers.

Just to mention some more bogosities found on a casual look, on page on
"Ligatures":
1) It lists several characters that are not ligatures but normal letters
of the Latin alphabet (though they _originated_ in ligatures, just as
"w" did).
2) It list "th" and "fj" as "not available", giving the false impression
that there is something special about these letter combinations.
4) It presents a reference to a Private Use Area character, which is
grossly misleading: it's not "unofficial", it's officially for private
use by agreements between interested parties, not to be used in any
information interchange outside such afgreements.
5) It claims that the sharp s ß "has no equivalent upper case form",
which is just false. The capital letter sharp s was added in Unicode 5.1
(in 2007).
6) It says "Exchange documents should not contain ligatures", which is
grossly wrong for letters like "æ" that the page calls "ligatures" and
should make you wonder why the page then exists at all - what are HTML
documents that are not "exchange documents"?
7) "If your font does not have ligatures, you can fake ligatures with
kerning to squeeze letters closer together than normal" is just absurd.
Manual tuning of kerning is _not_ supposed to produce fake ligatures,
and there is no reason to try to fake ligatures.

> That will buff out the only rough spot I could easily find on an otherwise
> very smooth and useful page. Cheers,

If it looks smooth and useful, it is dangerous.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

tlvp

unread,

Feb 20, 2012, 7:32:09 AM2/20/12

to

On Mon, 20 Feb 2012 07:44:51 +0200, Jukka K. Korpela wrote:

> 2012-02-20 4:16, tlvp wrote:
>
>> As regards the table at
>>
>> <http://mindprod.com/jgloss/htmlentities.html#POLISH> ,
>>
>> I should point out that Polish uses no Ô -- it does use an ó,
>> and that's what's wanted where your table features O-circumflexes :-) .
>
> And Ô surely isn't the lowercase equivalent to ó.

No. More important to me was that it's just not a Polish character at all,
in any case (double-entendre intended :-) ).

> --- [big snip] ---

>
>> That will buff out the only rough spot I could easily find on an otherwise
>> very smooth and useful page. Cheers,
>
> If it looks smooth and useful, it is dangerous.

My only disclaimer, once again: "...only rough spot I could easily find..."
-- but I clearly didn't look very hard :-) . I suppose I might have done
better adding a "looking" between "useful" and "page", but it was late and
I was too tired to determine whether that was diplomatic enough or would
look like an arrogant slap-in-the-face, so I left it out.

Thanks for displaying all the wrinkles I missed :-) . And cheers, -- tlvp

Dr J R Stockton

unread,

Feb 20, 2012, 3:17:34 PM2/20/12

to

In comp.infosystems.www.authoring.html message <m6h2k7p9h6n5rn8d8lfo1qqc
bjkkb...@4ax.com>, Sun, 19 Feb 2012 10:54:43, Roedy Green <see_website
@mindprod.com.invalid> posted:

>On Thu, 02 Feb 2012 14:19:05 +0200, "Jukka K. Korpela"
><jkor...@cs.tut.fi> wrote, quoted or indirectly quoted someone who
>said :
>
>>> It is fairly well-known that [...]
>>> Unicode foes up to 65535 = U+FFFF. But in fact Unicode goes higher, so
>>> do browsers actually understand 𒍁 or $#x123412;, regardless of
>>> the question of whether any installed font has them ?
>
>I have prepared a number of tables showing you the entities, hex
>entities, and how the characters render in your browser. I have also
>have English language descriptions of the characters so you can search
>for them.
>
>See http://mindprod.com/jgloss/htmlentities.html
>and
>http://mindprod.com/jgloss/html5.html
>and
>http://mindprod.com/jgloss/hexentities.html

Good, though my words were intended as a suggestion for Jukka's page.

Some of your site needs a "Health Warning" for those who happen to be
using a slow or expensive-per-byte connection.

If you delete from your DoB the characters which, in octal, end in a
zero, you have the glyph patterns needed for my DoB.

--
(c) John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v6.05 MIME.
Web <http://www.merlyn.demon.co.uk/> - FAQish topics, acronyms, & links.
Proper <= 4-line sig. separator as above, a line exactly "-- " (SonOfRFC1036)
Do not Mail News to me. Before a reply, quote with ">" or "> " (SonOfRFC1036)