[Typography] Line breaks near em-dashes

855 views
Skip to first unread message

Scott Ridley

unread,
Jan 25, 2022, 2:33:00 AM1/25/22
to Standard Ebooks
Hi all,

I'm fairly new to Standard Ebooks- just a reader at the moment, although I've been collecting typos to attempt to put in a pull request for the first time.

My question/comment is about when there's an em-dash that occurs near the end of a line. The word-joiner character keeps the em-dash attached to the proceeding word, however my reader (Apple Books) will then sometimes auto hyphenate the word before the em-dash to make it fit, which can be awkward, especially when it'd be better to break after the em-dash.

e.g. in "The Mysterious Affaire at Styles" by Agatha Christie, a line is rendered:
My poor Emily. They're a lot of shark-
s—all of them.

I'm not sure if this occurs in other reader software, but I've experimented with adding a zero-width space after the em-dash, and that makes it break much more naturally:
My poor Emily. They're a lot of sharks—
all of them.

Can I suggest the style manual (and toolset) be altered to add a zero-width space where an em-dash sits between words (i.e. not at the end of a sentence)? Or does anyone else have a better solution to this issue?

Thanks and regards!
Scott.

Alex Cabal

unread,
Jan 25, 2022, 10:53:21 AM1/25/22
to standar...@googlegroups.com
Hi Scott, thanks for looking in to that.

The word joiner character is the semantically correct character we want
to use. See
<https://www.unicode.org/L2/L2019/19114-line-break-design.pdf> and in
particular this quote:

> but because it is more commonly used as byte order mark, the use of
U+2060 WORD JOINER to indicate word joining is strongly preferred for
any new text.

If you add a ZWJ *after* the em dash, you risk ereaders treating the
first word, plus the em dash, plus the second word all as a single
non-breaking word, which would look really bad too. iBooks appears not
to do that, but all bets are off for the many other ereaders out there;
and it might even actually do that given a different example, too,
depending on what else is before and after the line.

So, we don't want to be enshrining incorrect semantics in our ebook
sources just because some ereaders do things differently. Unfortunately
the state of ereaders is that each one has its own quirks and there's no
single file that can appease them all perfectly. We occasionally have to
make due with weird errors like this one.

On 1/25/22 1:33 AM, Scott Ridley wrote:
> Hi all,
>
> I'm fairly new to Standard Ebooks- just a reader at the moment, although
> I've been collecting typos to attempt to put in a pull request for the
> first time.
>
> My question/comment is about when there's an em-dash that occurs near
> the end of a line. The word-joiner character keeps the em-dash attached
> to the proceeding word, however my reader (Apple Books) will then
> sometimes auto hyphenate the word before the em-dash to make it fit,
> which can be awkward, especially when it'd be better to break after the
> em-dash.
>
> e.g. in "The Mysterious Affaire at Styles" by Agatha Christie, a line is
> rendered:
> *My poor Emily. They're a lot of shark-*
> *s—all of them.*
>
> I'm not sure if this occurs in other reader software, but I've
> experimented with adding a zero-width space after the em-dash, and that
> makes it break much more naturally:
> *My poor Emily. They're a lot of sharks—*
> *all of them.*
>
> Can I suggest the style manual (and toolset) be altered to add a
> zero-width space where an em-dash sits between words (i.e. not at the
> end of a sentence)? Or does anyone else have a better solution to this
> issue?
>
> Thanks and regards!
> Scott.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/5c954862-e495-4c25-9c5e-61569a49b33bn%40googlegroups.com
> <https://groups.google.com/d/msgid/standardebooks/5c954862-e495-4c25-9c5e-61569a49b33bn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Vince

unread,
Jan 25, 2022, 11:14:15 AM1/25/22
to Standard Ebooks
Right, but we already replace the word-joiner with a zero-width space in build, so the epub’s not actually using the word-joiner. So the question isn’t really about the source, it’s about the epub that’s built.

We already put a word-joiner before and after an en-dash, which build then turns into a ZWS before and after. Why couldn’t we do that in build for an em-dash?

Alex Cabal

unread,
Jan 25, 2022, 11:33:44 AM1/25/22
to standar...@googlegroups.com
Because en dashes are smaller than em dashes, and most often (but of
course not always) connect numbers which are also short. If a line fails
to break there, it's not a big deal and the smallness of the unit allows
for a break opportunity elsewhere on the line.

Compare the break opportunities of an en dash in a typical use case:

> 12<wj>--<wj>3pm

to an em dash in its natural prose use case:

> Barleycorn<wj>---<wg>extravagant!

all of a sudden you have a huge unit with no break opportunity (since
many ereaders will consider the unit one "word" and not have an entry
for it in their hyphenation dictionary).

For what it's worth, I've never encountered the bug Scott mentioned and
I read on a Kobo, which like iBooks also uses Webkit. So this might just
be an iBooks bug (or quirk) and not even a Webkit bug.

On 1/25/22 10:14 AM, Vince wrote:
> Right, but we already replace the word-joiner with a zero-width space in
> build, so the epub’s not actually using the word-joiner. So the question
> isn’t really about the source, it’s about the epub that’s built.
>
> We already put a word-joiner before and after an en-dash, which build
> then turns into a ZWS before and after. Why couldn’t we do that in build
> for an em-dash?
>
>
>> On Jan 25, 2022, at 9:53 AM, Alex Cabal <al...@standardebooks.org
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/8EED18BA-A8EC-4A74-818F-8535218E6C73%40letterboxes.org
> <https://groups.google.com/d/msgid/standardebooks/8EED18BA-A8EC-4A74-818F-8535218E6C73%40letterboxes.org?utm_medium=email&utm_source=footer>.

Robin Whittleton

unread,
Jan 25, 2022, 11:37:41 AM1/25/22
to standar...@googlegroups.com
Could we change the soft-hyphenation strategy to never insert in a word before a word-joiner / em-dash combo?

> On 25 Jan 2022, at 17:33, Alex Cabal <al...@standardebooks.org> wrote:
>
> Because en dashes are smaller than em dashes, and most often (but of course not always) connect numbers which are also short. If a line fails to break there, it's not a big deal and the smallness of the unit allows for a break opportunity elsewhere on the line.
> To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/423d958a-9354-3d96-27b6-9e5d9f9abb4f%40standardebooks.org.

Alex Cabal

unread,
Jan 25, 2022, 11:40:59 AM1/25/22
to standar...@googlegroups.com
Soft hyphens only apply to the Kindle build, because other ereaders are
all over the place with font and highlighting support for them. Scott is
referring to the compatible epub build since he's reading on iBooks.

Vince

unread,
Jan 25, 2022, 1:07:05 PM1/25/22
to Standard Ebooks
It happens all the time on Books, both iPad and Mac, I’ve just never had (taken) time to look at it further. Like if Books would actually respect a word-joiner, etc.

Scott Ridley

unread,
Jan 25, 2022, 6:20:26 PM1/25/22
to Standard Ebooks
I think I might have been a little bit unclear on what characters I was talking about-

There's 3 different characters being discussed-
<wj> - the word joiner character U+2060
<zw> - the zero width non-breaking space U+FEFF
<wbr> - the zero width breaking space U+200B - equivalent to the HTML tag <wbr/>

With my example source text being: "My poor Emily. They're a lot of sharks---all of them"

I think Alex misunderstood my proposal/question as encoding it as: "My poor Emily. They're a lot of sharks<wj>---<zw>all of them" when in fact I meant "My poor Emily. They're a lot of sharks<wj>---<wbr>all of them", since that indicates that the line break should happen after the em-dash if required.

My understanding is that:
The current standard is to have this text encoded as "My poor Emily. They're a lot of sharks<wj>---all of them" in the source.
But for compatibility reasons, this is converted to "My poor Emily. They're a lot of sharks<zw>---all of them" at build time for the compatible epub.
I'm suggesting "My poor Emily. They're a lot of sharks<wj>---<wbr>all of them" where the em-dash is mid-sentence.

There's the other case, where the em-dash appears at the end of a line, especially inside speech: '"Your mother, you tell me, had a violent quarrel with someone yesterday afternoon---"'. Obviously here the <wbr> isn't appropriate since you want the end-quote to stay with the dash, so that's why I'm suggesting adding the <wbr> explicitly, rather than as part of the build process.

I was unaware that the different builds made adjustments to the actual text previously. Since adjustments are being made at build time, the <wbr> character can be removed automatically for those readers where it becomes a problem, but remain in where it's ignored or useful. (I don't have any evidence, but I do know that the <wbr/> html character has been supported by every browser forever, so hopefully they cope well with the unicode U+200B.)

Thanks for being patient! As a newbie I'm always worried that I'm asking a dumb question.

Scott.

Vince

unread,
Jan 25, 2022, 6:41:02 PM1/25/22
to Standard Ebooks
Interesting. I’ve always called U+FEFF a zero width space, but now that you’ve pointed this out, I see that everyone else, most importantly Unicode, calls that a zero-width no-break space, while the zero width space is the 200B. Thanks for the education!

(Which, as an aside, makes SE’s ZERO_WIDTH_SPACE variable name misleading as well.)

Alex will have to weigh in on whether that makes a difference, I just want to note that we only have three different outputs—epub, Kindle, and Kobo. So the epub (the format in question) has to work for all readers except Kindle/Kobo, not just for Apple Books. IOW, we can’t make a change just for Books.

Also, whether it’s added by typogrify (which is what’s adding the word-joiner in the first place) or by build, either way it’s code that’s going to have to determine whether it’s at the end of a sentence of not. Which I don’t believe would be too difficult (famous last words).


--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/01644aee-f999-4d63-9d04-a6384323367en%40googlegroups.com.

Alex Cabal

unread,
Jan 25, 2022, 6:48:55 PM1/25/22
to standar...@googlegroups.com
Yes, but the em dash character already has a break opportunity after it.
See <https://www.unicode.org/reports/tr14/tr14-37.html>


The chart is difficult but basically it reveals that the em dash has a
break opportunity before and after it.
Also see
<https://util.unicode.org/UnicodeJsps/character.jsp?a=%E2%80%94> where
you'll see the Line_Break property is Break_Both.

<wbr> is useful in locations where there is no break opportunity, like
in the middle of a word. But I don't know what the benefit is of placing
it in a location where there is already a break opportunity. If it fixes
an iBooks bug, then I think that's a coincidence.
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/01644aee-f999-4d63-9d04-a6384323367en%40googlegroups.com
> <https://groups.google.com/d/msgid/standardebooks/01644aee-f999-4d63-9d04-a6384323367en%40googlegroups.com?utm_medium=email&utm_source=footer>.

Vince

unread,
Jan 25, 2022, 7:05:55 PM1/25/22
to Standard Ebooks
On Jan 25, 2022, at 5:48 PM, Alex Cabal <al...@standardebooks.org> wrote:

<wbr> is useful in locations where there is no break opportunity, like in the middle of a word. But I don't know what the benefit is of placing it in a location where there is already a break opportunity. If it fixes an iBooks bug, then I think that's a coincidence.

That’s the benefit. :) Build already works around all kinds of issues (e.g. replacing word-joiner with zwnbs, etc.). If there’s already a break opportunity after the em-dash, then adding a zws after it should have no detrimental effect. But if it fixes the display for iPad/Macs in the process, that seems like a pretty easy win with no downside.

Scott Ridley

unread,
Jan 25, 2022, 7:10:11 PM1/25/22
to Standard Ebooks
You're right, Alex- thanks for the Unicode stuff, it's very helpful to see what it 'should' be doing rather than actually is.
I'll raise a bug with Apple about it, since it does appear to be a render bug.

Vince- It's probably a long-settled question, but has any consideration been put into building a 'quirks' tool that makes 'incorrect' but helpful changes for the various readers? It's probably a can of worms that you don't want to open, but it might be useful for these sorts of bugs that will likely never get fixed by the e-reader.

Scott

Alex Cabal

unread,
Jan 25, 2022, 7:11:18 PM1/25/22
to standar...@googlegroups.com
Hard to say about downsides. What if the default font on another reader
doesn't support it? Then there might be blank squares at the end of each
em dash. The compatible epubs go out to a huge variety of ereaders.

On 1/25/22 6:05 PM, Vince wrote:
>> On Jan 25, 2022, at 5:48 PM, Alex Cabal <al...@standardebooks.org
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/D8BC64E1-01A6-4223-86AF-B03CDA25CCA5%40letterboxes.org
> <https://groups.google.com/d/msgid/standardebooks/D8BC64E1-01A6-4223-86AF-B03CDA25CCA5%40letterboxes.org?utm_medium=email&utm_source=footer>.

Alex Cabal

unread,
Jan 25, 2022, 7:12:16 PM1/25/22
to standar...@googlegroups.com
Bad font support, by the way, is why we decompose <wj> to <zw> in the
first place. The more exotic the character, the more likely we're going
to get a blank square in lesser ereaders.

On 1/25/22 6:05 PM, Vince wrote:
>> On Jan 25, 2022, at 5:48 PM, Alex Cabal <al...@standardebooks.org
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/D8BC64E1-01A6-4223-86AF-B03CDA25CCA5%40letterboxes.org
> <https://groups.google.com/d/msgid/standardebooks/D8BC64E1-01A6-4223-86AF-B03CDA25CCA5%40letterboxes.org?utm_medium=email&utm_source=footer>.

Alex Cabal

unread,
Jan 25, 2022, 7:14:25 PM1/25/22
to standar...@googlegroups.com
That's what the "compatible" epub build is for, but it tries to target
as many ereaders as possible while still being reasonable. Doing a
custom build per ereader would be pretty wild and then when things go
wrong people would complain at us instead of their bad ereader.
> <https://groups.google.com/d/msgid/standardebooks/01644aee-f999-4d63-9d04-a6384323367en%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/standardebooks/01644aee-f999-4d63-9d04-a6384323367en%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/0e7d7827-1563-486c-89cb-43f36f006ec1n%40googlegroups.com
> <https://groups.google.com/d/msgid/standardebooks/0e7d7827-1563-486c-89cb-43f36f006ec1n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Vince

unread,
Jan 25, 2022, 7:21:18 PM1/25/22
to Standard Ebooks
I would argue that if a reader recognizes the zero-width no-break space, which is more “exotic", then the odds of it not recognizing the zero-width space, which is less exotic, is extremely unlikely. It’s certainly worth trying, IMO.

Scott Ridley

unread,
Jan 25, 2022, 7:25:03 PM1/25/22
to Standard Ebooks
Vince - you answered my question before I asked it! Looks like the 'compatible' version is just such a quirks mode.

Alex - yes, I can see that having a million versions would spiral out of control, that's why I was thinking a tool or build mode, rather than a straight out download. The 'compatible' one is best-effort, but have the option open for people to build their own using the se tools to target their own ereaders.

I had a quick look at the github for the tools and very quickly got out of my depth!
Reply all
Reply to author
Forward
0 new messages