Q.: Character entity for ZeroWidthSpace character?

tlvp

unread,

Dec 18, 2016, 9:30:04 PM12/18/16

to

Elsewhere in this group, James Moe recently wrote:

> In general it is safer to use character entities
> ... than numeric escape sequences.

In that spirit, I'd greatly welcome a character entity for the numeric
escape sequence  requisitioning a ZeroWidthSpace character.

Is there such a beastie? Thanks! Cheers, and Seasons' Greetings, -- tlvp
--
Avant de repondre, jeter la poubelle, SVP.

Jukka K. Korpela

unread,

Dec 19, 2016, 4:06:10 AM12/19/16

to

19.12.2016, 4:30, tlvp wrote:

> Elsewhere in this group, James Moe recently wrote:
>
>> In general it is safer to use character entities
>> ... than numeric escape sequences.

That’s not true; it’s rather the opposite (though for …, only as
regards to the possibility that the document might some day be processed
by an XHTML processor, which is not required to support named character
references except those defined in XML).

> In that spirit, I'd greatly welcome a character entity for the numeric
> escape sequence  requisitioning a ZeroWidthSpace character.

There is: &ZeroWidthSpace;. Reference:
https://www.w3.org/TR/html/syntax.html#named-character-references
But don’t expect browsers to support it. They may, or they may not,
depending on whether browser vendors have kept up with the fancy
additions to the list and whether people have updated their browsers.

You can alternatively use &NegativeMediumSpace; or &NegativeThickSpace;
or &NegativeThinSpace; or &NegativeVeryThinSpace; (I’m not even
considering asking for the reasons for this apparent insanity; I’m sure
I would get a long and convincing-looking explanation).

--
Yucca, http://www.cs.tut.fi/~jkorpela/

tlvp

unread,

Dec 19, 2016, 4:42:28 AM12/19/16

to

Thanks, Yucca. My explanation for looking is simple, and convincing to me
at least: my author colleague sometimes sets local explanatory material off
between em-dashes, and prefers these (a) abutting the words they separate,
but (b) prepared to break away from them should line-flow aesthetics demand
it. One resolution of the conflict those demands spawn is to use the trio
[ZeroWidthSpace][EmDash][ZeroWidthSpace] in place of simply EmDash, i.e.,
&emdash; , wherever an em-dash would be called for.

That's forced by the observation that "Browsers create soft linebreaks
after hyphens (see above), but not after en dashes or em dashes." (Source:
<https://www.w3.org/wiki/Common_HTML_entities_used_for_typography#HTML_entity_usage_notes>,
item 7.)

Well, &emdash; is memorable. I seek an equally memorable replacement for
 ... if there is one :-) ... preferably one that browsers support.
So if there's only the ill-supported &ZeroWidthSpace; I guess I'll throw in
the proverbial towel.

Negative spaces? I'm not sure I need such beasties, but if free ... :-) .

Thanks again. And cheers, -- tlvp

tlvp

unread,

Dec 19, 2016, 4:55:04 AM12/19/16

to

On Mon, 19 Dec 2016 04:42:27 -0500, tlvp mistakenly wrote:

> ... &emdash; ...

Sorry, I know better: should have been just — in the middle there.
Sorry to raise such confusion. A thousand apologies! Cheers, -- tlvp

Stan Brown

unread,

Dec 19, 2016, 6:58:41 PM12/19/16

to

On Mon, 19 Dec 2016 11:06:10 +0200, Jukka K. Korpela wrote:

> 19.12.2016, 4:30, tlvp wrote:
>
> > Elsewhere in this group, James Moe recently wrote:
> >
> >> In general it is safer to use character entities
> >> ... than numeric escape sequences.
>

> That?s not true; it?s rather the opposite

Thanks. That seemed wrong to me, but I don't have your depth of
knowledge. I appreciate he confirmation BEFORE I asked.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://BrownMath.com/
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You: http://preview.tinyurl.com/WhyWont

Stan Brown

unread,

Dec 19, 2016, 7:00:32 PM12/19/16

to

On Mon, 19 Dec 2016 04:55:02 -0500, tlvp wrote:
> On Mon, 19 Dec 2016 04:42:27 -0500, tlvp mistakenly wrote:
>
> > ... &emdash; ...
>
> Sorry, I know better: should have been just — in the middle there.
> Sorry to raise such confusion. A thousand apologies! Cheers, -- tlvp
>

Of course we wouldn't need such filigree if browsers(*) didn't feel
free to break a line BEFORE an em dash. That seems like a pretty
basic function compared to all the much more complicated stuff they
implement correctly.

(*) Where "Firefox" is a member of "browsers". I haven't tested in
any others.

Jukka K. Korpela

unread,

Dec 20, 2016, 12:04:59 AM12/20/16

to

20.12.2016, 2:00, Stan Brown wrote:

> On Mon, 19 Dec 2016 04:55:02 -0500, tlvp wrote:
>> On Mon, 19 Dec 2016 04:42:27 -0500, tlvp mistakenly wrote:
>>
>>> ... &emdash; ...
>>
>> Sorry, I know better: should have been just — in the middle there.
>> Sorry to raise such confusion. A thousand apologies! Cheers, -- tlvp
>>
>
> Of course we wouldn't need such filigree if browsers(*) didn't feel
> free to break a line BEFORE an em dash. That seems like a pretty
> basic function compared to all the much more complicated stuff they
> implement correctly.

I’m confused. According to Unicode Line Breaking rules, EM DASH has line
breaking class B2 [Break Opportunity Before and After], so the behavior
you have noticed appears to be correct. And if you want to prevent that,
you need explicit line breaking *prohibition*, whereas  is ZERO
WIDTH SPACE, which explicitly *allows* line breaking (so it is redundant
in —, though it can be useful since not all browsers
implement the Line Breaking rules correctly. (For years, browsers broke
only on spaces.)

When EM DASH is used in its normal meaning in English texts, namely as a
punctuation character to set off a parenthetic remark (in US English
style), it seems appropriate that it has the Break Opportunity Before
and After property, though perhaps it is better to leave it at the end
of a line rather than at the start of a new line.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Jukka K. Korpela

unread,

Dec 20, 2016, 4:29:26 AM12/20/16

to

19.12.2016, 11:42, tlvp wrote:

> [...] my author colleague sometimes sets local explanatory material off

> between em-dashes, and prefers these (a) abutting the words they separate,
> but (b) prepared to break away from them should line-flow aesthetics demand
> it. One resolution of the conflict those demands spawn is to use the trio
> [ZeroWidthSpace][EmDash][ZeroWidthSpace] in place of simply EmDash, i.e.,
> &emdash; , wherever an em-dash would be called for.

It looks messy, but I don’t think you can make it simpler (in HTML
source). Actually entering the characters involved is impractical even
if you can define e.g. a macro for it in an editor, since zero width
spaces are literally unnoticeable (unless an editor chooses to render it
in some special way). Using <wbr>—<wbr> would be a nicer option, but
people who define HTML standards have decided to treat the good old
<wbr> as Bad, Obsolete, and whatever, and the browser support, which was
excellent, isn’t quite that any more.

> That's forced by the observation that "Browsers create soft linebreaks
> after hyphens (see above), but not after en dashes or em dashes." (Source:
> <https://www.w3.org/wiki/Common_HTML_entities_used_for_typography#HTML_entity_usage_notes>,
> item 7.)

The information is outdated. Chrome implements Unicode line breaking
rules for EM DASH. Unfortunately other browsers misbehave.

Don’t treat that page as an authority of any kind. For one thing, it
claims that EN DASH is indistinguishable from MINUS SIGN. (They are two
distinct characters, and even though people may confuse them and even
though fonts may have identical or almost identical glyphs for them, a
well-designed font makes them different.)

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Helmut Richter

unread,

Dec 20, 2016, 5:57:42 AM12/20/16

to

Am 20.12.2016 um 06:04 schrieb Jukka K. Korpela:

> When EM DASH is used in its normal meaning in English texts, namely
> as a punctuation character to set off a parenthetic remark (in US
> English style),

In German, the em dash with thin spaces before and after was used in the
same sense until about 50 years ago, now an en dash with normal spaces
is used instead. I like that better than the US way as I find abutting
characters distracting at a place where phrases are separated rather
than connected but this may be mere habituation to local customs.

The question where to break lines arises with all these typesetting
customs in the same way.

> it seems appropriate that it has the Break Opportunity Before
> and After property, though perhaps it is better to leave it at the
> end of a line rather than at the start of a new line.

Some uses of such dashes – but not all – act like a pair of parentheses
while others more like a semicolon – not obvious to distinguish. In my
opinion, the line break should happen before the dash for the left
parenthesis and after the dash in the remaining cases. For some time I
have forced that behaviour with NBSP characters but now I consider this
too much effort.

Much more important are NBSP before or after numbers, e.g.
number 17 but 17 bottles. There does not seem to be an
automatic way to enforce that.

--
Helmut Richter

tlvp

unread,

Dec 20, 2016, 6:23:07 PM12/20/16

to

On Tue, 20 Dec 2016 07:04:59 +0200, Jukka K. Korpela wrote:

> ... ZERO

> WIDTH SPACE, which explicitly *allows* line breaking (so it is redundant
> in —, though it can be useful since not all browsers

> implement the Line Breaking rules correctly. ...

Exactly. I've tested my HTML in older browsers, some of which need specific
"feel free to wrap the line here" instructions before and after each
em-dash. It's necessary also when converting an HTML file -- or a docx file
-- to the Kindle .MOBI format. Cheers, -- tlvp

tlvp

unread,

Dec 20, 2016, 6:28:24 PM12/20/16

to

On Tue, 20 Dec 2016 11:58:15 +0100, Helmut Richter wrote:

> In German, the em dash with thin spaces before and after was used in the
> same sense until about 50 years ago, now an en dash with normal spaces
> is used instead.

At least one Polish publisher shares the new German aesthetic you describe.

> ... I like that better than the US way as I find abutting

> characters distracting at a place where phrases are separated rather
> than connected

Theoretically, I have to agree with this position. But as a matter of
practice, I've become accustomed to the US way on this, and now accept it
as standard.

Thanks for the added perspective. Cheers, -- tlvp

tlvp

unread,

Dec 20, 2016, 6:33:56 PM12/20/16

to

On Tue, 20 Dec 2016 11:29:26 +0200, Jukka K. Korpela wrote:

>> — , wherever an em-dash would be called for.

>
> It looks messy, but I don’t think you can make it simpler

OK, wishes called back home again as impractical.

> ... Actually entering the characters involved is impractical ...

Absolutely! Proofing HTML with invisible ZWSpaces scattered about would be
a worse nightmare even than living in the USA will be, come Jan. 20 :-{ .

Cheers, -- tlvp

Frosted Flake

unread,

Dec 20, 2016, 9:34:07 PM12/20/16

to

Better than it has been for the past 8 years under the racist-in-chief

--
Frosted Flake

tlvp

unread,

Dec 29, 2016, 2:51:19 PM12/29/16

to

On Mon, 19 Dec 2016 11:06:10 +0200, Jukka K. Korpela wrote:

> ...

But that lets me interject, highly OT, that a recent {The Economist} issue
(that of December 24) features a survey of Finland's reindeer herders, the
first of whom to appear in the article, 2nd paragraph on p. 26, bears the
same surname as our esteemed Jukka.

Is it then a common surname? (The non-Finns among us have no idea.)
Alternatively, could Mme. Raissa K. actually be a Jukka-relative? Or is it
all just far-flung coincidence?

Apologies if I'm poking where even angels should fear to tread. But
Season's best greetings, best wishes for the New Year, and cheers, -- tlvp

Jukka K. Korpela

unread,

Dec 29, 2016, 3:38:56 PM12/29/16

to

29.12.2016, 21:51, tlvp wrote:

> But that lets me interject, highly OT, that a recent {The Economist} issue
> (that of December 24) features a survey of Finland's reindeer herders, the
> first of whom to appear in the article, 2nd paragraph on p. 26, bears the
> same surname as our esteemed Jukka.
>
> Is it then a common surname?

This is indeed OT, but as the question was raised here, I’ll answer
briefly; if you wish to ask for clarifications, please use e-mail (I
have set Followup-To: poster for this):

Yes, it is a relatively common surname – about 5,000 people, in a
country with a population of about 5.4 million. So two people with this
surname are probably not related (or they are seventh cousins or
something). I just updated my small page about the name:
http://www.cs.tut.fi/~jkorpela/korpela.html

Somewhat relating to the topic of this group, I’d like to mention that
when checking the updated version of the page, I used the online
proofreading (spelling check) tool at
https://oikofix.com/web?lang=en
It can check both plain text, via copy and paste, and online HTML
documents (i.e. web pages), via URL. It handles US English, British
English, Finnish, and Northern Sámi (but only by menu selection for an
entire document; it does not recognize lang attributes).

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Molly Mockford

unread,

Dec 29, 2016, 5:25:42 PM12/29/16

to

At 14:51:22 on Thu, 29 Dec 2016, tlvp <mPiOsUcB...@att.net> wrote
in <10ae1f9mznvtk$.au0814ugrfpg$.d...@40tude.net>:

>On Mon, 19 Dec 2016 11:06:10 +0200, Jukka K. Korpela wrote:
>
>> ...
>
>But that lets me interject, highly OT, that a recent {The Economist} issue
>(that of December 24) features a survey of Finland's reindeer herders, the
>first of whom to appear in the article, 2nd paragraph on p. 26, bears the
>same surname as our esteemed Jukka.
>
>Is it then a common surname? (The non-Finns among us have no idea.)
>Alternatively, could Mme. Raissa K. actually be a Jukka-relative? Or is it
>all just far-flung coincidence?

Way back in the mists of time when I ran the SETI screensaver to crunch
numbers in the hope of finding ET, I seem to remember that one of the
names on the SETI team was Erik Korpela. I have always assumed that he
was no relation to our Jukka.
--
Molly Mockford
Nature loves variety. Unfortunately, society hates it. (Milton Diamond Ph.D.)
(My Reply-To address *is* valid, though may not remain so for ever.)

Thomas 'PointedEars' Lahn

unread,

Jan 19, 2017, 3:00:03 PM1/19/17

to

Helmut Richter wrote:

> Much more important are NBSP before or after numbers, e.g.
> number 17 but 17 bottles. There does not seem to be an
> automatic way to enforce that.

It can be scripted.

--
Anyone who slaps a 'this page is best viewed with Browser X' label on
a Web page appears to be yearning for the bad old days, before the Web,
when you had very little chance of reading a document written on another
computer, another word processor, or another network. -- Tim Berners-Lee

Helmut Richter

unread,

Jan 20, 2017, 7:01:45 AM1/20/17

to

Am 19.01.2017 um 21:00 schrieb Thomas 'PointedEars' Lahn:

> Helmut Richter wrote:
>
>> Much more important are NBSP before or after numbers, e.g.
>> number 17 but 17 bottles. There does not seem to be an
>> automatic way to enforce that.
>
> It can be scripted.

By what criterion?

--
Helmut Richter

Thomas 'PointedEars' Lahn

unread,

Jan 20, 2017, 8:29:27 AM1/20/17

to

The word or phrase preceding or following the number is in the list of words
or phrases to be considered for the corresponding formatting. In this
example, “number” would be in the first list, “bottles” in the second one.
“but” would not be in either list, or you could exclude “but” from being
considered because it follows a replacement. This can be done with regular
expression matching already.

In general, syntactical analysis has to be performed, and formatting has to
be performed based on the part-of-speech class of a word or phrase. For
example:

noun (singular) number (> 1) conjunction number (> 1) noun (plural)

“number 17 but 17 bottles”
`--------.,-------' `---------.,--------'

You could define the following rules to be used in a formatting algorithm:

- Use   instead of space between
* a noun in the singular form and a number;
* a number > 1 and a noun in the plural form.

HTH

PointedEars
--
> If you get a bunch of authors […] that state the same "best practices"
> in any programming language, then you can bet who is wrong or right...
Not with javascript. Nonsense propagates like wildfire in this field.
-- Richard Cornford, comp.lang.javascript, 2011-11-14

Helmut Richter

unread,

Jan 20, 2017, 9:43:54 AM1/20/17

to

Am 20.01.2017 um 14:29 schrieb Thomas 'PointedEars' Lahn:

> Helmut Richter wrote:
>
>> Am 19.01.2017 um 21:00 schrieb Thomas 'PointedEars' Lahn:
>>> Helmut Richter wrote:
>>>> Much more important are NBSP before or after numbers, e.g.
>>>> number 17 but 17 bottles. There does not seem to be an
>>>> automatic way to enforce that.
>>>
>>> It can be scripted.
>>
>> By what criterion?
>
> The word or phrase preceding or following the number is in the list of words
> or phrases to be considered for the corresponding formatting. In this
> example, “number” would be in the first list, “bottles” in the second one.
> “but” would not be in either list, or you could exclude “but” from being
> considered because it follows a replacement. This can be done with regular
> expression matching already.

For an individual text with few things that are numbered (e.g. "class x"
or "theorem y" where you know the upper bound of x and y) this may work
well, and this is how I do it. Moreover, numbers that are names tend to
be numerically small, and numbers that count something tend to be larger
because small numbers are often written as words. So, for each given
text, one can easily find a small set of regex that serves the purpose
in 95% of occurrences, and check the remainder manually.

When I wrote "no automatic way" I thought of a criterion that would work
for many texts about different topics, but there I assume that finding
such general patterns is much more work than treating each text
separately with the specific terms and phrases in it.

Identifying plurals may be easy in English but not always in other
languages.

--
Helmut Richter

Thomas 'PointedEars' Lahn

unread,

Jan 20, 2017, 11:02:00 AM1/20/17

to

Helmut Richter wrote:

> Am 20.01.2017 um 14:29 schrieb Thomas 'PointedEars' Lahn:
>> The word or phrase preceding or following the number is in the list of
>> words or phrases to be considered for the corresponding formatting. In
>> this example, “number” would be in the first list, “bottles” in the
>> second one. “but” would not be in either list, or you could exclude “but”
>> from being considered because it follows a replacement. This can be done
>> with regular expression matching already.
>

> […] for each given text, one can easily find a small set of regex that

> serves the purpose in 95% of occurrences, and check the remainder
> manually.
>
> When I wrote "no automatic way" I thought of a criterion that would work
> for many texts about different topics, but there I assume that finding
> such general patterns is much more work than treating each text
> separately with the specific terms and phrases in it.

If you read my posting completely, you will see that syntactical analysis is
the key to a general, automated solution. You only need (the implementation
of) a dictionary.

> Identifying plurals may be easy in English but not always in other
> languages.

The dictionary needs to take care of that. But if you think about it, the
plural as such does not need to be identified, only that the word is a noun
(and maybe not even that).

tlhoy'Daq wa'maH HIqbal

leh mashek t'bihr fi'temok¹

;-)

_______
¹ “gra obggyrf bs orre ba gur jnyy” va Xyvatba naq Zbqrea Tbyvp Ihypna
--
Prototype.js was written by people who don't know javascript for people
who don't know javascript. People who don't know javascript are not
the best source of advice on designing systems that use javascript.
-- Richard Cornford, cljs, <f806at$ail$1$8300...@news.demon.co.uk>

Jukka K. Korpela

unread,

Jan 20, 2017, 12:18:05 PM1/20/17

to

Please don’t try to discuss with a troll.

Note that troll posted an off-topic remark about a month after the
discussion. Scripts, whether client-side or server-side (the troll did
not bother making a distinction), are not HTML.

(Besides, there is no general reason to prevent a line break before or
after a “number”, which here apparently means a sequence of digits
standing for an integer. There are special cases, like a “number”
followed by a unit identifier, like “17 m”, where a line break is
disallowed according to applicable standards. But dealing with them is
not an HTML issue. The HTML side of the matter is simple: browsers honor
the no-break space.)

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Thomas 'PointedEars' Lahn

unread,

Jan 20, 2017, 12:42:45 PM1/20/17

to

Jukka K. Korpela wrote:

> 20.1.2017, 14:02, Helmut Richter wrote:
>> Am 19.01.2017 um 21:00 schrieb Thomas 'PointedEars' Lahn:
>>> Helmut Richter wrote:
>>>> Much more important are NBSP before or after numbers, e.g.
>>>> number 17 but 17 bottles. There does not seem to be an
>>>> automatic way to enforce that.
>>> It can be scripted.
>> By what criterion?
>
> Please don’t try to discuss with a troll.
>
> Note that troll posted an off-topic remark

Hardly. But if so, then Helmut’s original comment already was off-topic.

> about a month after

Usenet is not a real-life communications medium. I happened to stop by here
(because nothing interesting or new was going on elsewhere, really), saw a
posting that I found interesting, had a good idea about it, and replied.

It is not up to you to define the criteria by which it is appropriate to
post.

Get a life …

> [insult]

… or rather, FOAD. I am sick and tired of your obnoxious wannabe arrogance.

> did not bother making a distinction),

Because that should be left to the person implementing it. If it is
important that the content should be well-formatted when served, then do it
server-side. If it is a nice to have, and does not incur do it client-side.

For example, currently I do the hyphenation in the first sections of the
ECMAScript Support Matrix client-side because it is a nice-to-have only.

> are not HTML.

So you are almost *three* years out of touch now.

<https://www.w3.org/TR/2014/REC-html5-20141028/webappapis.html#scripting>

<https://www.w3.org/TR/2014/REC-html5-20141028/infrastructure.html#scripting-0>

<https://www.w3.org/TR/2014/REC-html5-20141028/dom.html#dom>

PointedEars
--
Danny Goodman's books are out of date and teach practices that are
positively harmful for cross-browser scripting.
-- Richard Cornford, cljs, <cife6q$253$1$8300...@news.demon.co.uk> (2004)

Thomas 'PointedEars' Lahn

unread,

Jan 20, 2017, 12:44:22 PM1/20/17

to

Jukka K. Korpela wrote:

> 20.1.2017, 14:02, Helmut Richter wrote:
>> Am 19.01.2017 um 21:00 schrieb Thomas 'PointedEars' Lahn:
>>> Helmut Richter wrote:
>>>> Much more important are NBSP before or after numbers, e.g.
>>>> number 17 but 17 bottles. There does not seem to be an
>>>> automatic way to enforce that.
>>> It can be scripted.
>> By what criterion?
>
> Please don’t try to discuss with a troll.
>
> Note that troll posted an off-topic remark

Hardly. But if so, then Helmut’s original comment already was off-topic.

> about a month after

Usenet is not a real-life communications medium. I happened to stop by here
(because nothing interesting or new was going on elsewhere, really), saw a
posting that I found interesting, had a good idea about it, and replied.

It is not up to you to define the criteria by which it is appropriate to
post.

Get a life …

> [insult]

… or rather, FOAD. I am sick and tired of your obnoxious wannabe arrogance.

> did not bother making a distinction),

Because that should be left to the person implementing it. If it is
important that the content should be well-formatted when served, then do it

server-side. If it is a nice to have, and does not incur to much of a
performance penalty, do it client-side.

For example, I am currently doing the automated hyphenation (using ) in

tlvp

unread,

Jan 21, 2017, 3:27:57 AM1/21/17

to

On Fri, 20 Jan 2017 15:44:46 +0100, Helmut Richter wrote:

> Identifying plurals may be easy in English ...

... or rather it "may *perhaps often* be easy in English" -- counter-
examples come to mind like sheep, fish, deer, ... . Or mouse/mice,
goose/geese, etc. As the Google Translation from German or Slavic might put
it, "The English language is a heavy language." :-) . Cheers, -- tlvp