Line Breaking

fantasai

unread,

Jul 26, 2007, 3:53:20 PM7/26/07

to

Masayuki recently landed bug 255990 (Characters below U+0100 not subject to
line-breaking rules), which has improved our line breaking in some cases
(we break after slashes!) and made it worse in others (we break 's/he'!)
https://bugzilla.mozilla.org/show_bug.cgi?id=255990

There's a meta bug at
https://bugzilla.mozilla.org/show_bug.cgi?id=206152
but I think we need a little more centralized planning. Filing individual
bugs isn't going to give us a coherent picture of what we're doing and what
we want to do. There are quite a few good comments in various bugs, such as
those from Simo Kaupinmäki, that are on a level higher than the individual
bugs. I want to pull out that discussion onto the newsgroups here.

Specs and documents relevant to line breaking include
* JIS X 4051 (in Japanese; I can scan it if someone tells me which pages)
* UAX 14 <http://www.unicode.org/reports/tr14/>, but see the proposed update at
<http://www.unicode.org/reports/tr14/tr14-20.html>
* Masayuki's charts <https://bugzilla.mozilla.org/attachment.cgi?id=271380>
<http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/tools/spec_table.html>
* Jukka's comments on UAX 14 <http://www.cs.tut.fi/~jkorpela/unicode/linebr.html>
* CSS3 Text <http://www.w3.org/TR/css3-text/#line-breaking>

We'll probably want to come up with long-term and short-term plans, so please
keep that in mind when discussing the merits of any suggested approaches. And
while breaking differently based on language context might be nice, the important
problem we want to solve is language-independent breaking. (Script-dependence
is ok.)

Right, so discussion here, record of important points in
http://wiki.mozilla.org/Gecko:Line_Breaking
and replies to this can be split into multiple messages; a 10-page treatise
could be useful but is not necessary. It is not clear to me e.g. what
jp-critical situations Masayuki was addressing in bug 255990, so I want
examples and use cases. Feel free to copy/paraphrase comments from bugzilla.
If anyone is planning more work in this area (roc, masayuki?) please summarize
your plans here. Thanks!

~fantasai

rocal...@gmail.com

unread,

Jul 30, 2007, 8:16:37 PM7/30/07

to

On Jul 27, 7:53 am, fantasai <fantasai.li...@inkedblade.net> wrote:
> There's a meta bug at
> https://bugzilla.mozilla.org/show_bug.cgi?id=206152
> but I think we need a little more centralized planning. Filing individual
> bugs isn't going to give us a coherent picture of what we're doing and what
> we want to do. There are quite a few good comments in various bugs, such as
> those from Simo Kaupinmäki, that are on a level higher than the individual
> bugs. I want to pull out that discussion onto the newsgroups here.

I agree, this is a good idea.

I really want to find out what the jp-critical need for linebreaking
of Latin-1 text is.

Rob

Masayuki Nakano

unread,

Aug 2, 2007, 6:12:50 AM8/2/07

to rocal...@gmail.com

rob...@ocallahan.org wrote:
> I really want to find out what the jp-critical need for linebreaking
> of Latin-1 text is.

we need to break URLs in most cases, therefore, I think, we should break
after '/' for path part of URLs. And also we should break after '\' for
windows file path too. And we should also break after '&' and ';' or '='
for param part of URLs. And '%' too. It is used for %-encoding.

'/' and '\' may be needed the context analysis. (they are only broken
after if it is second or more?)

I'm thinking for other patters, but I think it's enough.

*However*, if nobody have trouble, we should use similar spec for the
compatibility with WinIE. I.e., '!', '$', '?', '[', ']', '{', '}', '¢'
and '°'.

--
Masayuki Nakano <masa...@d-toybox.com>
Manager, Internationalization, Mozilla Japan.
Personal Web Site (Written in Japanese): http://www.d-toybox.com/studio/

fantasai

unread,

Aug 2, 2007, 1:19:39 PM8/2/07

to

Masayuki Nakano wrote:
> rob...@ocallahan.org wrote:
>> I really want to find out what the jp-critical need for linebreaking
>> of Latin-1 text is.
>
> we need to break URLs in most cases, therefore, I think, we should break
> after '/' for path part of URLs. And also we should break after '\' for
> windows file path too. And we should also break after '&' and ';' or '='
> for param part of URLs. And '%' too. It is used for %-encoding.

Why do we "need to" break URLs? Yes, it would be nice but can you give
some examples of where this is "jp-critical"? It would be better if we
all had the same understanding of the problem we're trying to solve here.

I'd rather not break at '\', because it is used in escapes, which could
be in the middle of a token. If we're breaking at '\', I want to see
examples of why it is necessary and some estimate of how common these
cases are vs. the cases where we want the word to stay together.

For %-encoding, you'd want to break *before* the %, not after it. But
for normal usage, e.g. '100%', you really don't want to break before.
Therefore I don't think % should be a breaking character.

I don't see a problem with breaking after ';'. I can't recall how they're
particularly relevant to URLs, but I also can't think of any cases where
that would break anything.

If we want to break at &, then we should prioritize spaces and semicolons
over &. We don't want 'x &nbps; ' to break after either &.

> '/' and '\' may be needed the context analysis. (they are only broken
> after if it is second or more?)

Yes, I believe this is necessary. 'c/o' should never break. Neither should
'\n'. If we're allowing breaks at slashes in 1.9, then this level of context
analysis is imho required.

> I'm thinking for other patters, but I think it's enough.
>
> *However*, if nobody have trouble, we should use similar spec for the
> compatibility with WinIE. I.e., '!', '$', '?', '[', ']', '{', '}', '¢'
> and '°'.

I think we should avoid introducing breaks until we have thought through
all the consequences, which I'm not convinced we have for most of these
breaks. Compatibility with WinIE is not "thought through all the
consequences".

~fantasai

Boris Zbarsky

unread,

Aug 2, 2007, 3:18:49 PM8/2/07

to

fantasai wrote:
> I don't see a problem with breaking after ';'. I can't recall how they're
> particularly relevant to URLs

The same way that '?' is, basically. I agree that this is a reasonable place to
break, not only in URIs but in general.

-Boris

David E. Ross

unread,

Aug 2, 2007, 4:23:38 PM8/2/07

to

Examples where the virgule (/) should not break a line:
n/a for "not applicable"
fractions (e.g., 2/3)
dates (e.g., 8/2/07, which I would usually write "2Aug07" to avoid
confusion with 8Feb07)
combined units of measure (e.g., mi/hr, cm/sec, kg/m2)
and/or (a stylistic abomination that I never use)

In text other than Web pages, I manually break URIs and file paths just
before (not after) the virgule.

--

David E. Ross
<http://www.rossde.com/>.

Anyone who thinks government owns a monopoly on inefficient, obstructive
bureaucracy has obviously never worked for a large corporation. © 1997

Boris Zbarsky

unread,

Aug 2, 2007, 4:42:03 PM8/2/07

to

David E. Ross wrote:
> dates (e.g., 8/2/07, which I would usually write "2Aug07" to avoid
> confusion with 8Feb07)

Arguably, the '-' in 2007-08-02 should not break either. Nor should the '-' in
"Examples 1-5". Nor in 2007-Aug-02.

-Boris

rocal...@gmail.com

unread,

Aug 3, 2007, 7:44:42 PM8/3/07

to

On Aug 2, 10:12 pm, Masayuki Nakano <masay...@d-toybox.com> wrote:
> rob...@ocallahan.org wrote:
> > I really want to find out what the jp-critical need for linebreaking
> > of Latin-1 text is.
>
> we need to break URLs in most cases, therefore, I think, we should break
> after '/' for path part of URLs. And also we should break after '\' for
> windows file path too. And we should also break after '&' and ';' or '='
> for param part of URLs. And '%' too. It is used for %-encoding.
>
> '/' and '\' may be needed the context analysis. (they are only broken
> after if it is second or more?)

Are these jp-critical but not critical for Western users? If so, why
the difference?

Rob

Masayuki Nakano

unread,

Aug 3, 2007, 8:51:31 PM8/3/07

to rocal...@gmail.com

Now, I cannot list the example sites immediately. Because many sites
changed the style, maybe for Fx. E.g., Mixi which is most popular SNS
site in Japan is using WBR hack for URLs. (Of course, this is not good
thing.) And Slashdot Japan uses overflow:auto; for paragraph, it is
better approach, but it makes horizontal scrollbar.

Japanese language can break everywhere except a few exceptions. I.e., we
are breaking words always. So, for us, the old rule (only breaks around
SPACE) is strange. Because some points (around punctuations and
parentheses), we can look as breakable points.
# This issue is blamed from both designers and users in Japan.

## jp-critial has two meanings:
## 1. Really critical bugs, e.g., cannot use IME.
## 2. Marketing strategy, e.g., some bugs are important (only?) in
Japan, but not so in bugzilla.mozilla.org.

rocal...@gmail.com

unread,

Aug 3, 2007, 9:00:31 PM8/3/07

to

On Aug 4, 12:51 pm, Masayuki Nakano <masay...@d-toybox.com> wrote:
> Japanese language can break everywhere except a few exceptions. I.e., we
> are breaking words always. So, for us, the old rule (only breaks around
> SPACE) is strange. Because some points (around punctuations and
> parentheses), we can look as breakable points.

OK, that makes some sense I guess ... I can see that if you're used to
breaking around punctuation in Japanese text then it's confusing when
it doesn't work in Latin text (but it does in IE).

Rob

Masayuki Nakano

unread,

Aug 3, 2007, 9:02:47 PM8/3/07

to Boris Zbarsky

'-', '/' and '=' are not breakable if the next character is numeric.

But the last case (2007-Aug-02) is not saved by the rule. Should we also
check whether the previous character is numeric?

Note that if *authors* hope that the text should not broken around
hyphen, they should use non-breakable hyphen (U+2011).

Masayuki Nakano

unread,

Aug 3, 2007, 9:14:06 PM8/3/07

to rocal...@gmail.com

yeah, I don't think that the IE's spec is best approach for users.
However, many authors still use only WinIE for testing (only in Japan?).
Therefore, I want to keep the compatibility with WinIE except the points
that is not good.

Boris Zbarsky

unread,

Aug 3, 2007, 10:56:49 PM8/3/07

to

Masayuki Nakano wrote:
> '-', '/' and '=' are not breakable if the next character is numeric.
>
> But the last case (2007-Aug-02) is not saved by the rule. Should we also
> check whether the previous character is numeric?

That could lead to issues with long organic chemical names. I'm not sure
there's a perfect solution there. ;(

> Note that if *authors* hope that the text should not broken around
> hyphen, they should use non-breakable hyphen (U+2011).

There's plenty of plain ASCII text we end up rendering (e-mail comes to mind),
so I'd really like us to have default line breaking that's as good as we can
reasonably manage.

One other line breaking weirdness I ran into today: the string "Init()" broke
after the 't'. That looked _really_ odd. I wouldn't break before a '(' that's
preceded by a letter like that... Breaking after the '(' if we have to would
sometimes be reasonable, though in this case that would look pretty odd too.
And really, "sin(x)" should not be line breaking anywhere, imo... ;)

-Boris

Message has been deleted

lis...@uta.fi

unread,

Aug 4, 2007, 11:27:10 AM8/4/07

to

I took a little break from the line break discussion, but now I try to
collect and extend my main points from the various bug comments. My
starting point is the approach suggested by Jukka Korpela in his
criticism on the Unicode Standard Annex (UAX) #14:
http://www.cs.tut.fi/~jkorpela/unicode/linebr.html

Basically, the generic (language-independent) line breaking rules
should be as simple as possible while at the same time trying to
respect the conventions of natural languages. Thus, each character
should default to the kind of line breaking that was most likely
expected of it in its natural context.

UAX 14 names three principal styles to determine line break
opportunities in different scripts:

- Western: spaces and hyphens are used to determine breaks
- East Asian: lines can break anywhere, unless prohibited
- South East Asian: line breaks require morphological analysis
<http://www.unicode.org/reports/tr14/tr14-20.html#BreakOpportunities>

According to UAX 14, the Western and East Asian styles can be unified
into a single set of specifications, whereas the South East Asian
style requires more complicated, language-dependent hyphenation
algorithms. Although, I suppose, the unified specification alone was
not enough to fully cater for the needs of any language, it should be
good enough for most cases in Western and East Asian languages. The
default behavior of each character could be redefined and refined at
the language-dependent level when necessary, but this should be
treated as a separate issue, since the language of a document was not
always easy to identify.

I'll concentrate on discussing the properties of the Western and
especially Latin scripts, since the Asian scripts are beyond my area
of expertise. I recognize that some compromises may be necessary in
order to make the line breaking system adequate for both the Western
and East Asian users, but I think we have to start by considering the
basis of each tradition independently.

CONVENTIONAL LINE BREAKS IN LATIN SCRIPTS

In Latin scripts, line break opportunities are basically marked with
spaces. Additional break opportunities may be marked with hyphens or
dashes. Breaking in any other place would generally be unconventional
and potentially confusing.

Technically, a break may usually occur only _after_ a character. In
some languages, a break may be allowed even before an em-dash, but
since this would be unexpected in other language contexts, it should
be defined as a language-dependent exception.

There are some special cases where a line break is not desirable even
after a space, a hyphen or a dash. However, in most everyday cases the
exceptions should be reasonably simple to specify:

A line break is allowed after a space, a hyphen or a dash, unless

(a) the space or hyphen is of the non-breaking type (reasoning: the
very idea of a non-breaking character is to prohibit line break)

(b) the hyphen or dash is adjacent to a space (reasoning: the basic
function of a space is to separate two words from each other, so it
seems apparent that a hyphen or dash _preceded_ by a space -- as in
the expression "suffix -ed" -- is supposed to be a fixed part of the
word it is directly connected to)

(c) the hyphen or dash is adjacent to any punctuation (reasoning:
combining a hyphen with other punctuation may imply many different
kinds of ordinary or exceptional usage -- such as ASCII art -- where
it is not desirable to break; however, since two or three adjacent
hyphens were often used as a substitute for a single dash, a double
hyphen might be considered equivalent to an en-dash and a triple
hyphen equivalent to an em-dash, generally allowing a line break after
the last hyphen)

(d) there is no more than one alphabetic or symbol character on either
side of the hyphen or dash (this would improve the typographical
appearence by preventing widowed and orphaned characters at the start
or end of a line; one might even consider preventing line breaks if
there were no more than _two_ characters on either side, or allowing
the user to define the best setting in the browser preferences).

These minimal line breaking rules should cover the most important
cases at least for Latin scripts (although I probably overlooked
something, please feel free to append the list).

A somewhat more detailed set of rules may be needed for numerical
contexts, where a hyphen (or sometimes perhaps a dash) is often used
as a minus sign. Note that disallowing line breaks altogether adjacent
to a numeric character would not produce a desired effect for example
in long chemical names, such as "2-bromo-4,4-dichlorophenol".

Further exceptions could be specified at the language-dependent level,
or by special "emergency break" rules for very long strings.

Language-dependent additions

Although language-dependent rules go beyond the scope of this
discussion, it might be illustrative to consider briefly how the
generic rules were appendable. As long as the document defined the
language(s) used, it should be fairly easy to apply language-dependent
additional rules, for example:

- in English, a line break is allowed both before and after an em-
dash, and irrespective of how many alphabetic or numeric characters
there are on either side

- in French, a line break is not allowed after a space if it is
followed by an exclamation mark, a question mark, a colon, a semicolon
or a closing guillemet, nor if it is preceded by an opening guillemet
(as it is conventional to separate these characters with a space in
French typography)

- in Finnish, a line break is not allowed after a space if it is
preceded by a hyphen (as there may occur cases such as "koulu- ja
kirjastorakennus" -- referring to a combined school and library
building -- and the combination of the words "koulu-" and "ja" should
not be confused with the plural partitive form "kouluja" -- schools --
which could be hyphenated as "koulu-ja").

Of course one can come by many more language-dependent rules, but they
can be added little by little, as native speakers start to point out
deficiencies. However, one should consider very carefully the positive
and negative effects and the necessity of each additional exception.
For example, in the French and Finnish examples above, the undesired
breaks can usually be prevented with a no-break space, so basically no
special rules should be needed. On the other hand, writing a Unicode
character or an HTML entity is often clumsy, and the result can be
unpredictable (for example, just a couple of days ago I tried to use
some HTML entities when commenting to a blog, but the entity codes
ended up showing as regular text), so a plain space may be a safer
choice after all.

Perhaps one day, the rules may be appended to include even language-
specific hyphenation algorithms, but for now, I suppose that's
something we can only dream of.

Non-natural languages

Non-natural languages may require special consideration, but
basically, they should follow the conventions of natural languages. In
a technical notation, such as a URL or a sequence of programming
language code, an unconventional line break may actually be even more
confusing than in a natural language sentence. Natural languages
usually contain a lot of redundancy, in order to make sure that
occasional errors or distractions will not distort the whole message.
Non-natural languages, however, usually strive for efficiency and
depend on the data to be interpreted exactly as it is written. Thus,
it may be crucial to know whether there is a space between two
characters or not, but an unconventional line break would hide this
essential detail.

Misunderstanding UAX 14

Unfortunately, UAX 14 tends to obscure the basic line breaking
principles for Latin scripts by describing the behavior of various
characters in a very complicated way. It is easy to misunderstand UAX
14. For example, I was stunned when I read (in the third section of
Table 1)* that closing punctuation -- such as ')' -- prohibits line
breaks before, and that opening punctuation -- such as '(' --
prohibits line breaks after.
*<http://www.unicode.org/reports/tr14/tr14-20.html#Table1>

Since line breaks were not prohibited _after_ a closing parenthesis
and _before_ an opening parenthesis, this seemed to imply that they
should be allowed. However, it would be absurd to break as in the
following examples:

colo(u)ring

colo(u)
ring

colo
(u)ring

After some reasoning, and with the help of the explanations found in
the (rather long) Chapter 5.1, I realized that the idea is merely to
overrule the default behavior of the nearest enclosed character (which
in my examples is "u"), in the case that _it_ allows a line break
before or after. These rules do not speak anything about how to break
_outside_ the parentheses, but only how to not break _inside_ them.

Perhaps it is exactly the confusing description in UAX 14 that has
tricked even the IE designers to allow line breaks before and after
parentheses (as well as in many other strange situations), regardless
of whether there are spaces involved or not. This is definitely not
correct in a Latin context (whereas in an East Asian context it may
actually be preferable).

LINE BREAKING AT A SLASH

According to the conventional principles of Latin scripts, a slash
would not be considered to offer a line break opportunity. Actually, a
slash is rather rare in natural language contexts, but there are
special expressions that depend on the presupposition that a word
cannot be broken at a slash (for example, abbreviations "c/o" and "s/
he" would become more difficult to perceive if they were broken).

The typographical line breaking conventions have been developed over a
period of centuries, long before there were computers and URLs to
worry about. Neither, it seems, were file-paths and URLs designed to
take into account the typographical issue of how they should be
presented in a horizontally limited space. Thus, as computers and the
Web have become an important means of communication in our everyday
life, it seems that some modifications to the conventional line
breaking rules are needed.

When analyzing the structure of a file-path, the most logical line
break opportunity seems to be either immediately after or immediately
before a slash. However, allowing line breaks indiscriminately at any
slash would produce new problems. Thus, break opportunities should be
limited to the special cases where they were considered really
necessary, i.e., long file-paths and URLs.

Perhaps the most straightforward way to identify breakable file-paths
would be to count how many slashes there were in each string, since in
natural language expressions there's rarely more than one slash. Even
if there are two slashes in a file-path, the string as a whole is
often so short that breaking it does not offer any significant
typographical improvement. For example, it would be pointless to break
a file-path such as "/etc/apt". Therefore, it might be considered
reasonable to disallow breaks unless there were at least three slashes
in a string.

Even when there were three or more slashes and the string was broken,
the reader should be given a hint that something exceptional happened
and that the broken string was actually supposed to be interpreted as
a single, continuous entity. Therefore, a break should not be allowed
after the first slash. Seeing that there was no space after the first
slash should give the reader a hint that perhaps there were no spaces
after the other slashes either (although this would be deceiving in
file-paths and URLs that _ended_ with a slash).

Furthermore, if the last part of the string is also a regular word in
the context language (as "apt" is a word in English), it may not
always be clear whether the part separated by a line break belongs to
the string or to the context. Therefore, a break should not be allowed
after the last slash (nor after the first), but any other slash might
be considered to offer a break opportunity:

/etc/
foobar/apt

This way, the presence of slashes on both lines would give the reader
a hint that the parts did perhaps belong to the same string even
though they were separated by an unconventional line break.

However, even this solution leaves room for potential confusion.
Sometimes a word is wrapped in slashes as if in parentheses or quotes
-- like /this/ -- in order to simulate the appearance of italics.
Furthermore, according to the International Phonetic Alphabet, slashes
may be used in a similar fashion in order to describe the actual
pronunciation of a word. Thus, there may occur cases such as:

(1) /foobar/ and/or

Now, consider the following file-path:

(2) /foobar/and/or

If broken before "and", both examples will look exactly the same:

/foobar/
and/or

In the first example, a line break after the space that precedes "and"
would be perfectly conventional. In the second example, a line-break
after the second slash would be unconventional and a potential cause
for confusion.

Therefore, a possibly better solution (as suggested above by David E.
Ross) might be to allow breaks only _before_ a slash. In that case,
the second example could be broken in two ways:

/foobar
/and/or

/foobar/and
/or

This should prevent anybody from confusing a file-path to the special
usage of simulating italics or marking pronunciation with slashes.
Also, seeing a line beginning with a slash would warn the reader that
there was something exceptional in the string, and if there were
slashes even on the previous line, it shouldn't be too hard to
conclude that the strings were somehow linked to each other.

CONCLUSIONS

The default behavior for most Latin characters is to not allow line
breaks either before or after, whereas it seems that for most East
Asian characters the default behavior is to allow line breaks both
before and after. Obviously, if a Latin character is put adjacent to
an East Asian character, their default behaviors conflict. There
should be a consistent rule on how the conflict is solved.

Since restricting the line breaks appears to be a significant problem
in East Asian languages, perhaps it would be reasonable to allow an
East Asian character to overrule the default breaking behavior of a
Latin character if put adjacent to each other. However, if put into a
Latin context, even a non-Latin character should rather be treated as
a symbol character inherent in Latin scripts (and thus, line breaking
would not be allowed), but this can be specified at the language-
dependent level.

This approach would not solve the problem that East Asian users
expected even words written with Latin characters to break at any
punctuation, but I'm afraid that this issue cannot be helped without
violating the fundamental logic of Latin scripts.

I have tried to illustrate the conditions of line breaking in Latin
scripts and the potential problems caused by overlooking and adding
exceptions to the conventional rules. Each exception should be
considered very carefully because a relative improvement in
typographical appearance can hardly be justified if the required
adjustments can distort the actual message. The basic function of the
art of typography is to make it easier for the reader to absorb
information. If typographical solutions make the contents more
difficult to understand, it is bad typography.

--
Simo Kaupinmäki

lis...@uta.fi

unread,

Aug 4, 2007, 2:08:33 PM8/4/07

to

On 2 elo, 20:19, fantasai <fantasai.li...@inkedblade.net> wrote:

> Masayuki Nakano wrote:
> > And we should also break after '&' and ';' or '='
>

> I don't see a problem with breaking after ';'. I can't recall how they're
> particularly relevant to URLs, but I also can't think of any cases where
> that would break anything.

Breaking after semicolon (;) would be unexpected, but I can't think of
any natural language example where semicolon is not followed by a
space (which, of course, offers a break opportunity anyway, unless it
is a no-break space). However, semicolon is often used in smileys, and
you wouldn't want to break there. ;-)

> If we want to break at &, then we should prioritize spaces and semicolons
> over &. We don't want 'x &nbps; ' to break after either &.

Also, breaking abbreviations of names, such as AT&T, would generally
be undesirable.

> > *However*, if nobody have trouble, we should use similar spec for the
> > compatibility with WinIE. I.e., '!', '$', '?', '[', ']', '{', '}', '¢'
> > and '°'.

Square brackets '[]' could be used instead of parentheses in cases
such as 'colo[u]ring'. I'm not sure whether anybody would use curly
brackets '{}' in such a situation, but if somebody did, it would be
logical to expect them to behave in the same way than the other
brackets.

An exclamation mark and a question mark may occur adjacent to each
other when somebody wants to express that s/he has written something
surprising (!?). Also, some writers may try to emphasize their
sentences with a sequence of exclamation marks (!!!) or question marks
(???); this is bad style, but nevertheless, one would not expect such
a sequence to be broken.

In English, '$' is usually followed by a number ($20), while in some
other languages, the currency symbol comes after the number ('20$' or
'20 $' -- the latter example preferably with a no-break space). Also,
in some languages, case suffixes are connected to symbol characters
with a colon. In Finnish, for example, the genitive suffix -n could be
used when building strings such as '$:n', '¢:n' and '%:n', and these
should not be broken.

A degree symbol (U+00B0) is often written adjacent to the unit of
measurement (100 °C), but if the unit is not mentioned, the degree
symbol is written adjacent to the number (100°). Thus, the degree
symbol should not be breakable at all.

--
Simo Kaupinmäki

David E. Ross

unread,

Aug 4, 2007, 8:09:41 PM8/4/07

to

On 8/4/2007 11:08 AM, lis...@uta.fi wrote [in part]:

>
> In English, '$' is usually followed by a number ($20), while in some
> other languages, the currency symbol comes after the number ('20$' or
> '20 $' -- the latter example preferably with a no-break space). Also,
> in some languages, case suffixes are connected to symbol characters
> with a colon. In Finnish, for example, the genitive suffix -n could be
> used when building strings such as '$:n', '¢:n' and '%:n', and these
> should not be broken.

In news reports about cross-border trade, I sometimes see $25US and and
$26Cdn to distinguish between amounts of money expressed in U.S. dollars
and Canadian dollars. But then I sometimes see US$25 and Can$26. Thus,
this is not standardized.

Also, I see $ indicating the beginning and end of a string of
characters, often generated and dynamically inserted into a Web page.
Go to <http://www.w3.org/>, and scroll to the bottom of the page. You
will see the line:
Webmaster · Last modified: $Date: 2007/08/03 16:19:21 $ |

The $ symbol should not be used to cause a break.

> A degree symbol (U+00B0) is often written adjacent to the unit of
> measurement (100 °C), but if the unit is not mentioned, the degree
> symbol is written adjacent to the number (100°). Thus, the degree
> symbol should not be breakable at all.

Very often, I see the number, degree symbol, and units letter all run
together without any spaces. See, for example, the temperatures given
in the upper-left corner at <http://www.cbc.ca/news/>. I have also seen
the alternative form 90° F, with the space between the degree symbol and
the units letter. I think this might be more common than Kaupinmäki's
example with the space between the number and the degree symbol. The
degree symbol should not be used to cause a break.

Boris Zbarsky

unread,

Aug 5, 2007, 9:25:55 AM8/5/07

to

lis...@uta.fi wrote:

[long mail snipped]

I pretty much agree with this e-mail. I wonder how hard some of this is to
implement... but it would be great if we can do it.

-Boris

Masayuki Nakano

unread,

Aug 5, 2007, 3:43:59 PM8/5/07

to

Hi, all.

I posted new patch to bug 389056.
https://bugzilla.mozilla.org/show_bug.cgi?id=389056
https://bugzilla.mozilla.org/attachment.cgi?id=275308&action=diff

This patch has many fixes from feedbacks of this thread and other bugs.

> Masayuki Nakano (Mozilla Japan) 2007-08-05 02:31:20 PDT
>
> Created an attachment (id=275307) [details]
> Patch rv3.0
>
> This changes many character's behaviors by the feedbacks.
>
> 1. Added the new class which is not breakable around them even if there are
> breakable. But if there is SPACE, it is breakable. This behavior is defined in
> UAX#14.
>
> 2. [] is not breakable now if around characters are character/numeric class.
>
> 3. / and \ are not breakable now. But if a word has two or more /s or \s, they
> are breakable except the first one. And they are breakable *before* them at the
> time. Therefore, the path text always has / or \. I believe that this is best
> way. But / is not breakable everytime if next is numeric for date format.
> (2007/01/01)
>
> 4. DEGREE SIGN is not breakable if after character is character class. But it's
> still breakable if after character is numeric, for compatibility.
>
> 5. The hyphen is not breakable if the text doesn't have character class, or
> next is numeric class. Therefore, we cannot break 2007-Aug-07
>
> 6. Pound/Yen sign are not same behavior with IE7.

the unix/url path is breaking example:

foo/bar/foo2/bar2/

is rendered as:

foo/bar
/foo2
/bar2/

each separated words always have '/', and only '/' lines are not there.

David E. Ross

unread,

Aug 5, 2007, 9:04:42 PM8/5/07

to

On 8/5/2007 12:43 PM, Masayuki Nakano wrote [in part]:
>>
>> 4. DEGREE SIGN is not breakable if after character is character class. But it's
>> still breakable if after character is numeric, for compatibility.

Thus,
xxx xxx xxx 126° 15' 32"
near the end of a line could break as
xxx xxx xxx 126
° 15' 32"

Or did I misunderstand? Maybe you need to indicate where the break
occurs, before or after the degree sign.

Masayuki Nakano

unread,

Aug 6, 2007, 11:11:01 AM8/6/07

to

David E. Ross wrote:
> On 8/5/2007 12:43 PM, Masayuki Nakano wrote [in part]:
>>> 4. DEGREE SIGN is not breakable if after character is character class. But it's
>>> still breakable if after character is numeric, for compatibility.
>
> Thus,
> xxx xxx xxx 126° 15' 32"
> near the end of a line could break as
> xxx xxx xxx 126
> ° 15' 32"
>
> Or did I misunderstand? Maybe you need to indicate where the break
> occurs, before or after the degree sign.

er, no.

"126°" is not broken. And also "126°C" too.

"126°126" is broken after '°'. I think that this case is not in normal
context.

lis...@uta.fi

unread,

Aug 6, 2007, 1:37:35 PM8/6/07

to

On 5 elo, 22:43, Masayuki Nakano <masay...@d-toybox.com> wrote:

> > 4. DEGREE SIGN is not breakable if after character is character class. But it's
> > still breakable if after character is numeric, for compatibility.

Could you give an example where it was actually helpful to allow a
degree sign to break if followed by a numeric character? I'd rather
not allow any odd line breaks if it wasn't clear what the benefits
were, even if we weren't able to come by any examples where it could
cause confusion. Just that we haven't thought about it here doesn't
prove that it can't occur in some more or less specific situation in
some natural or non-natural language.

Actually, each example where an unconventional line break improves the
typographical appearance is also an example where it may cause
confusion, since for a Western reader it's kind of natural to assume a
space there. The same goes for allowing breaks at curly backets,
semicolons etc. Thus, for each exception, we should consider carefully
whether the typographical benefits really overweight the danger of
confusing some users.

We should also keep in mind that the consistent line breaking
principles in Latin scripts sometimes allowed characters to be used
even in unconventional ways. For example, occasionally the company
name Microsoft is written in sarcastic tone with '$' instead of 's':
Micro$oft. In a similar fashion, a Finnish magazine called Voima
('Force') likes to replace the 'i' in its logo with '!': Vo!ma. The
swapped characters look similar enough that the reader is assumed to
be able to recognize the word even though there is basically a
spelling error in it. Personally, I'm not a great fan of this kind of
character swapping, but it could be considered unfortunate if an
unexpected break destroyed the (supposedly clever) idea that the
writer tried to express.

> > 5. The hyphen is not breakable if the text doesn't have character class, or
> > next is numeric class. Therefore, we cannot break 2007-Aug-07

Do you mean that the string '2007-Aug-07' is not breakable at all, or
that it is breakable only before 'Aug'?

Not breaking dates at all may be a good idea in itself, but I would
consider it more important to allow at least some kind of breaks in
long chemical names, such as '2-bromo-4,4-dichlorophenol'. It may be
difficult to have it both ways.

> foo/bar
> /foo2
> /bar2/
>
> each separated words always have '/', and only '/' lines are not there.

Yes, this looks good (or at least as good a compromise as is
attainable). It is an important point that a break is not allowed
before the _ending_ slash so that it cannot end up widowed on the next
line.

--
Simo Kaupinmäki

Masayuki Nakano

unread,

Aug 6, 2007, 2:50:02 PM8/6/07

to lis...@uta.fi

I think that it is important that we use similar rules with IE. Very
many table cells which width is auto are used in the websites of
transitional structure. If a table cell has long words, the min-width of
the cell depends on the width of most long word in it. So, the layout of
the pages depend on line-breaking rules.

If there are two or more cells which widths are auto in same row, the
compatibility depends on line-breaking rule (not defined in HTML/CSS
spec) *and* table layout algorithm (not defined in HTML/CSS spec).

So, we need to keep the compatibility with IE, it is so hard.

If we change the behavior from IE without actual bad example, it is risky.

*However*, the non-breaking behavior is same as the current release
builds (e.g., gecko1.8.1). So, there are no regressions even if I change
the all punctuations (expect some punctuations which is used in
URL/path) to *like-character* class.

It may be reasonable way for now.

I'll post new patch.

And I think that we should implement the prioritized line breaking in
Gecko2 or later. (I'm not able to work on it in Gecko1.9.)

>>> 5. The hyphen is not breakable if the text doesn't have character class, or
>>> next is numeric class. Therefore, we cannot break 2007-Aug-07
>
> Do you mean that the string '2007-Aug-07' is not breakable at all, or
> that it is breakable only before 'Aug'?

'2007-Aug-07' is not broken.

> Not breaking dates at all may be a good idea in itself, but I would
> consider it more important to allow at least some kind of breaks in
> long chemical names, such as '2-bromo-4,4-dichlorophenol'. It may be
> difficult to have it both ways.

yeah, we cannot analyze whether the current word is date or chemical
word. We cannot save both cases...

I have a question, should the chemical name be broken after '-'? If so,
we need to give up to save one case...

>> foo/bar
>> /foo2
>> /bar2/
>>
>> each separated words always have '/', and only '/' lines are not there.
>
> Yes, this looks good (or at least as good a compromise as is
> attainable). It is an important point that a break is not allowed
> before the _ending_ slash so that it cannot end up widowed on the next
> line.

Thanks, I don't want to change this behavior, because this rule matches
to our code, the implementation is simple.

Justin Wood (Callek)

unread,

Aug 7, 2007, 1:38:35 AM8/7/07

to

Masayuki Nakano wrote:
> David E. Ross wrote:
>> On 8/5/2007 12:43 PM, Masayuki Nakano wrote [in part]:
>>>> 4. DEGREE SIGN is not breakable if after character is character
>>>> class. But it's
>>>> still breakable if after character is numeric, for compatibility.
>>
>> Thus,
>> xxx xxx xxx 126° 15' 32"
>> near the end of a line could break as
>> xxx xxx xxx 126
>> ° 15' 32"
>>
>> Or did I misunderstand? Maybe you need to indicate where the break
>> occurs, before or after the degree sign.
>
> er, no.
>
> "126°" is not broken. And also "126°C" too.
>
> "126°126" is broken after '°'. I think that this case is not in normal
> context.
>

It can be, especially in engineering.

128°12'22" (assuming I didn't mix minutes and seconds notation up)

Such that it reads "128 degrees 12 minutes and 22 seconds" for radial
measurement. Useful especially in various engineering/mathematic fields.

~Justin Wood (Callek)

lis...@uta.fi

unread,

Aug 8, 2007, 3:56:59 PM8/8/07

to

On 6 elo, 21:50, Masayuki Nakano <masay...@d-toybox.com> wrote:

> lis...@uta.fi wrote:
> > Not breaking dates at all may be a good idea in itself, but I would
> > consider it more important to allow at least some kind of breaks in
> > long chemical names, such as '2-bromo-4,4-dichlorophenol'. It may be
> > difficult to have it both ways.
>
> yeah, we cannot analyze whether the current word is date or chemical
> word. We cannot save both cases...
>
> I have a question, should the chemical name be broken after '-'? If so,
> we need to give up to save one case...

Well, I'm not an expert on chemistry, so I'll have to refer to Alan
Wood who originally raised the issue with chemical names:
https://bugzilla.mozilla.org/show_bug.cgi?id=95067#c78

Apparently, at least in this case, the preferred break point would be
after the second hyphen:

2-bromo-
4,4-dichlorophenol

However, if the optimal break point was too difficult to define, any
break would be better than no breaks at all.

Now, _that_ looks scary! (I suppose breaking at hyphens would help a
little, but I have no idea how on earth one could define a preferred
break point there.)

--
Simo Kaupinmäki

Boris Zbarsky

unread,

Aug 8, 2007, 4:30:30 PM8/8/07

to

lis...@uta.fi wrote:
> Well, I'm not an expert on chemistry, so I'll have to refer to Alan
> Wood who originally raised the issue with chemical names:
> https://bugzilla.mozilla.org/show_bug.cgi?id=95067#c78

He raises an excellent suggestion: disallow breaks at '-' if this would result
in a 4 char or less orphan. Allow them otherwise. Prefer breaking before a
number to breaking before a letter (makes sense for chemical names, at least).

I suspect that similar rules for '/' would cover lots of the cases like c/o and
so forth, by the way.

-Boris

Masayuki Nakano

unread,

Aug 9, 2007, 11:41:28 AM8/9/07

to Boris Zbarsky

I don't take this idea in latest patch. But the latest patch fixes many
problems, e.g., the all testcases clears in:
https://bugzilla.mozilla.org/attachment.cgi?id=275972
# The chemical context will be broken.

I hope that the latest patch is landed to trunk for testing by nightly
build users, and I continue to work for new approach of this.

lis...@uta.fi

unread,

Aug 9, 2007, 2:27:18 PM8/9/07

to

On 9 elo, 18:41, Masayuki Nakano <masay...@d-toybox.com> wrote:

> https://bugzilla.mozilla.org/attachment.cgi?id=275972

I think there is an error in the "Should break inside" list where the
second item says that the combination SPACE+NBSP is breakable.
According to UAX 14, a non-breaking character prohibits breaks both
before and after:
http://www.unicode.org/reports/tr14/tr14-20.html#Table1

Thus, in practice, SPACE+NBSP is equivalent to NBSP+NBSP (since SPACE
offers a break opportunity only after, and that opportunity is then
overruled by NBSP).

Another small remark: my understanding is that in PHP code there may
occur strings of the following type:

$_GET{'x'}

$_GET['x']

(These should basically mean the same thing, but curly brackets are
recommended over square brackets.)

I know next to nothing about PHP, but I'd assume breaking these
strings would be as unwanted as breaking "Init()" or "sin(x)". Perhaps
these cases too should be added to the "Don't break" list?

--
Simo Kaupinmäki

Masayuki Nakano

unread,

Aug 10, 2007, 1:15:03 AM8/10/07

to lis...@uta.fi

No, See:

> GL: Non-breaking (“Glue”) (XB/XA) (Non-tailorable)

> Non-breaking characters prohibit breaks on either side, but that prohibition
> can be overridden by SP or ZW. In particular, when NBSP follows SPACE,
> there is a break opportunity after the SPACE and NBSP will go as visible
> space onto the next line. See also WJ. The following lists the characters
> of line break class GL with additional description.

We should break here.

fantasai

unread,

Aug 13, 2007, 6:18:48 PM8/13/07

to Masayuki Nakano

Masayuki Nakano wrote:
>
> I don't take this idea in latest patch. But the latest patch fixes many
> problems, e.g., the all testcases clears in:
> https://bugzilla.mozilla.org/attachment.cgi?id=275972
> # The chemical context will be broken.
>
> I hope that the latest patch is landed to trunk for testing by nightly
> build users, and I continue to work for new approach of this.

bug #?

~fantasai

Masayuki Nakano

unread,

Aug 16, 2007, 12:56:30 PM8/16/07

to fantasai

The new bug will be filed by me if the patch is landed.
The current bug is:
https://bugzilla.mozilla.org/show_bug.cgi?id=389056

lis...@uta.fi

unread,

Aug 25, 2007, 2:41:24 PM8/25/07

to

> GL: Non-breaking ("Glue") (XB/XA) (Non-tailorable)
> Non-breaking characters prohibit breaks on either side, but that
prohibition
> can be overridden by SP or ZW. In particular, when NBSP follows
SPACE,
> there is a break opportunity after the SPACE and NBSP will go as
visible
> space onto the next line. See also WJ. The following lists the
characters
> of line break class GL with additional description.

Oh, that's right. I had forgotten about this exception stated in UAX
14. I guess it makes sense, in a way, to have different character
combinations to behave differently. On the other hand, the exception
threatens to dilute the basic idea of non-breaking characters and make
things more complicated without offering any reasoning. Even
OpenOffice Writer (which usually tends to break lines rather sensibly)
appears confused by the inconsistency -- to the extent that it allows
breaking between a regular space and a _word_joiner_ (U+2060) although
UAX 14 specifically states that the word joiner takes precedence over
the space.*
*See: http://www.unicode.org/reports/tr14/tr14-20.html#WJ

I can't see any harm (apart from the confusing factor) caused by the
SPACE+NBSP exception in itself, but it would be interesting to hear
whether there was any real-life case where you actually wanted a
double space to be breakable. Remarkably, even the quote above doesn't
actually say that the prohibition _must_ (nor that it should) be
overridden but only that it _can_ be. So UAX 14 seems to leave the
final decision on this matter to implementors.

Another, seemingly more important case has been added to the proposed
update on UAX 14, considering a non-breaking _hyphen_ that follows a
space. This time there is even some reasoning that refers to words
with a hyphen as the first character (the special case described,
albeit on a very abstract level, is from Finnish ortography but
comparable cases -- such as "suffix -ed" -- may sometimes occur even
in English, as well as in other languages). In such a situation, UAX
14 recommends the authors to insert a non-breaking hyphen instead of a
regular hyphen, and consequently a line break should be allowed
between the hyphen and the preceding space.*
*See: http://www.unicode.org/reports/tr14/tr14-20.html#Hyphen

Actually, as I have already suggested in this thread, it is more
logical to use a regular hyphen in this kind of a situation, since it
is apparent that the preferred break point is after the space and not
after the hyphen. It would be absurd to leave a word-starting hyphen
orphaned at the end of a line. (This is an example where OpenOffice
gets it right, while many other applications -- such as IE, Opera and
Word -- fail miserably; Word even tends to auto-replace the hyphen
with an en-dash, which is totally unacceptable in many -- if not most
-- cases.)

Another new exception added to UAX 14 is the broken double hyphen that
may occur in Polish and Portuguese ortographies. In this case, two
hyphens are shown only if there is a line break in between; normally
there is just a single hyphen visible. For example, the Polish
compound word "czerwono-niebieska" should be broken in the middle so
that there's a hyphen both at the end of the first line and at the
beginning of the second line:

czerwono-
-niebieska

In order to produce this effect, UAX 14 recommends the authors to use
the combination of a soft hyphen and a non-breaking hyphen. Again it
might be considered more logical to use a regular hyphen instead of a
non-breaking hyphen, since a soft hyphen was supposed to show a
preferred break point anyway* -- but of course this would be a little
problematic if some browsers didn't recognize the preferred break
point and broke the word after the regular hyphen instead.
*See: http://www.unicode.org/reports/tr14/tr14-20.html#SoftHyphen

Thus, a line break should be allowed even between a soft hyphen and a
non-breaking hyphen.

Nevertheless, generally I'd expect a non-breaking character to
prohibit breaks both before and after. For example, I'd certainly not
expect a line break between a hyphen-minus and a no-break space (this
is another example where OpenOffice gets it right). If the breaking
characters did by default take precedence over the non-breaking
characters, it would be pointless for UAX 14 to state that the non-
breaking characters prohibited breaks before as well as after.

It is interesting that, in addition to the ASCII hyphen-minus (U
+002D), Unicode specifies even a "regular" hyphen (U+2010). It is
rarely used in real life (since it is clumsy to produce with a typical
computer keyboard and downright harmful if the data stray into an
application that doesn't recognize it), and an extra hyphen character
seems redundant unless it is treated differently from the traditional
hyphen-minus. According to UAX 14, the main difference seems to be
that the hyphen-minus requires additional context analysis in order to
be able to distinguish its two usages as a hyphen and as a minus.

As the "regular" hyphen is actually quite a marginal character
nowadays, it may be considered tempting to allow it to break always,
irrespective of the context (like IE, Opera and Word seem to treat
even the hyphen-minus). This would offer authors a way to ignore some
of the traditional principles of Western typography if it was
considered beneficial in a specific case. However, this could cause
troubles if one day the "regular" hyphen did become the default hyphen
character used in computer applications. As the preferred hyphen
character for the future, its default breaking behavior should rather
be as optimal as possible. We have already seen more than enough
negligent solutions in digitalized typography.

--
Simo Kaupinmäki

lis...@uta.fi

unread,

Aug 25, 2007, 3:37:51 PM8/25/07

to

On 25 elo, 21:41, lis...@uta.fi wrote:

> It is interesting that, in addition to the ASCII hyphen-minus (U
> +002D), Unicode specifies even a "regular" hyphen (U+2010). It is

Now, here is an interesting bug. I wrote my previous message with a
text editor and copy-pasted it into the Google Groups HTML form field
with Iceweasel 2.0.0.6 browser.* The text looked fine before I sent it
but now, for some reason, the string "U+002D" is broken before the
plus character in a way that really shouldn't happen. Apparently,
something went wrong when the lines were refitted into newsgroups size
(even the UAX 14 quote in the beginning of my message became mangled a
little).

I haven't noticed this kind of undesirable breaks in normal browsing
situations. Perhaps it's the Google Groups system that is to blame?

*(For those that haven't heard about it: Iceweasel is a rebranded
version of Firefox, distributed by Debian GNU/Linux.)

--
Simo Kaupinmäki

Masayuki Nakano

unread,

Sep 2, 2007, 3:09:11 PM9/2/07

to

I posted the new approach patch to bug 389056.

The general breaking rule is not changed from previous patch. But the
patch doesn't break at 'near' from start of word, end of word and
previous breaking point in western language context.

'near' is defined as 6 characters now. Therefore, in Western word, never
breaking at first 6 characters and last 6 characters. i.e., if a word is
broken, it has 11 characters at least.

By this, the short (less than 6 characters) word part is not created.
This helps readability in some cases. And I believe that most western
language sentence doesn't break unnatural.

Boris Zbarsky

unread,

Sep 2, 2007, 3:33:04 PM9/2/07

to

Masayuki Nakano wrote:
> 'near' is defined as 6 characters now. Therefore, in Western word, never
> breaking at first 6 characters and last 6 characters. i.e., if a word is
> broken, it has 11 characters at least.

Why 6? I actually think 4 would be a lot more reasonable...

-Boris

Masayuki Nakano

unread,

Sep 2, 2007, 4:46:35 PM9/2/07

to Boris Zbarsky

We can remove date format checking. If it is less than 5, "/2007" and
"2007-" is breakable at before/after them.

Boris Zbarsky

unread,

Sep 2, 2007, 4:50:34 PM9/2/07

to

Masayuki Nakano wrote:
> We can remove date format checking. If it is less than 5, "/2007" and
> "2007-" is breakable at before/after them.

But I'd expect both "right-wing" and "left-wing" to be breakable.

Perhaps we should look not only at the number of chars but also at what the
chars are or something?

-Boris

Masayuki Nakano

unread,

Sep 2, 2007, 5:01:32 PM9/2/07

to

Roc and I discussed. He hoped that if I need complicated patch for
fixing it, the ASCII string breaker should be backed out. But the
URL/File path breaker is important for Japanese
Marketing/Users/Designers. So, I cannot accept it.

We need simple way for 1.9. And we should implement prioritized
line-breaking in Mozilla2 (or later). Until then, we should take simple
and low risk way.

And I think that your examples doesn't grow up the table cell width in
actual web pages (and also doesn't overflow from fixed width box).
So, it is never broken by prioritized line-breaking even if we implement it.

Masayuki Nakano

unread,

Sep 2, 2007, 5:08:21 PM9/2/07

to

Masayuki Nakano wrote:
> Boris Zbarsky wrote:
>> Masayuki Nakano wrote:
>>> We can remove date format checking. If it is less than 5, "/2007" and
>>> "2007-" is breakable at before/after them.
>>
>> But I'd expect both "right-wing" and "left-wing" to be breakable.
>>
>> Perhaps we should look not only at the number of chars but also at
>> what the chars are or something?
>
> Roc and I discussed. He hoped that if I need complicated patch for
> fixing it, the ASCII string breaker should be backed out. But the
> URL/File path breaker is important for Japanese
> Marketing/Users/Designers. So, I cannot accept it.
>
> We need simple way for 1.9. And we should implement prioritized
> line-breaking in Mozilla2 (or later). Until then, we should take simple
> and low risk way.
>
> And I think that your examples doesn't grow up the table cell width in
> actual web pages (and also doesn't overflow from fixed width box).
> So, it is never broken by prioritized line-breaking even if we implement
> it.
>

er, of course, if there are some simple/low risk way for them. (or 4 is
good number for hyphen if both sides are alphabets, I can change so.)

Boris Zbarsky

unread,

Sep 2, 2007, 7:32:50 PM9/2/07

to

Masayuki Nakano wrote:
> And I think that your examples doesn't grow up the table cell width in
> actual web pages (and also doesn't overflow from fixed width box).

I'm not sure I follow. Is that a statement about typical table cell widths on
typical web pages, or a statement about its fundamental layout properties?

-Boris

Masayuki Nakano

unread,

Sep 3, 2007, 10:05:38 AM9/3/07

to

Yes.

Masayuki Nakano

unread,

Sep 3, 2007, 10:55:16 AM9/3/07

to

Masayuki Nakano wrote:
> Boris Zbarsky wrote:
>> Masayuki Nakano wrote:
>>> And I think that your examples doesn't grow up the table cell width
>>> in actual web pages (and also doesn't overflow from fixed width box).
>>
>> I'm not sure I follow. Is that a statement about typical table cell
>> widths on typical web pages, or a statement about its fundamental
>> layout properties?
>
> Yes.

er, I meant that "a statement about typical table cell widths on typical
web pages".

Boris Zbarsky

unread,

Sep 3, 2007, 12:20:56 PM9/3/07

to

Masayuki Nakano wrote:
> er, I meant that "a statement about typical table cell widths on typical
> web pages".

I guess we're only talking about intra-word breaking, right? So whitespace can
still be broken on anytime?

In that case, you might be right. Let's give it a shot.

-Boris

rocal...@gmail.com

unread,

Sep 3, 2007, 6:28:52 PM9/3/07

to

On Sep 3, 8:50 am, Boris Zbarsky <bzbar...@mit.edu> wrote:
> Masayuki Nakano wrote:
> > We can remove date format checking. If it is less than 5, "/2007" and
> > "2007-" is breakable at before/after them.
>
> But I'd expect both "right-wing" and "left-wing" to be breakable.

Yes, but if we don't allow them to break, the impact is small because
they're quite short anyway. So I think it's good to be conservative
about this.

Rob

rocal...@gmail.com

unread,

Sep 19, 2007, 6:26:25 PM9/19/07

to

Masayuki-san landed an update to the line breaking code last night.
The main impact is that we don't allow a break opportunity "too
close" (currently,. less than 6 characters away) from another break
opportunity unless it's at a whitespace boundary. This should make
things a lot better.

Rob

Masayuki Nakano

unread,

Oct 8, 2007, 8:17:16 AM10/8/07

to

I updated http://wiki.mozilla.org/Gecko:Line_Breaking

The document explains the rules of current trunk simply.
# Should be moved to MDC?

Thanks.

Message has been deleted

Masayuki Nakano

unread,

Oct 8, 2007, 2:59:38 PM10/8/07

to Peter Weilbacher

Peter Weilbacher wrote:

> On Mon, 8 Oct 2007 12:17:16 UTC, Masayuki Nakano wrote:
>
>> I updated http://wiki.mozilla.org/Gecko:Line_Breaking
>>
>> The document explains the rules of current trunk simply.
>

> Is there a way that I as a user can switch off file path/URI breaking? I
> find that very annoying. Is it planned to implement that or should I
> file an RFE?

If you are author, you can use |<span style="white-space:
nowrap;">/foo/bar</span>|. The users cannot control it.

Message has been deleted

Masayuki Nakano

unread,

Oct 8, 2007, 8:54:05 PM10/8/07

to Peter Weilbacher

Peter Weilbacher wrote:
> On Mon, 8 Oct 2007 18:59:38 UTC, Masayuki Nakano wrote:

>
>> Peter Weilbacher wrote:
>>> Is there a way that I as a user can switch off file path/URI breaking? I
>>> find that very annoying. Is it planned to implement that or should I
>>> file an RFE?
>> If you are author, you can use |<span style="white-space:
>> nowrap;">/foo/bar</span>|. The users cannot control it.
>

> No, I'm not the author. I really like the new wrapping capabilities in
> normal text. But it drives me nuts when I paste URLs into the Bugzilla
> comment field that they are wrapped. I want to turn that off. (When
> looking at the comment after submission they are not wrapped any more,
> that makes it even more confusing.) Am I the only one annoyed by that?
> But Bugzilla is not the only place where that happens, although I have
> problems of coming up with something else right now...

hmm... I'm not sure it is problem. I think there are two patterns when
you paste/write the URIs.

1. URI is a independent.

I.e., URI is an only member of a paragraph. In western context, we use
empty line for paragraph separator. So, the wrapped URI doesn't make any
confusing.

2. URI is a word.

I.e., URI is an inline word of paragraph. In old layout, the URI was a
line. Now, the URI might be separated two or more lines. But there are
really word separators around URI. So, the URI is always an independent
word. (And other browsers have same behavior, but I don't know the
objections for it.)

I cannot understand what is problem. Note that Japanese text is
breakable in most points, so the broken words are natural things for me...

Boris Zbarsky

unread,

Oct 8, 2007, 10:07:58 PM10/8/07

to

Peter Weilbacher wrote:
> But it drives me nuts when I paste URLs into the Bugzilla
> comment field that they are wrapped. I want to turn that off. (When
> looking at the comment after submission they are not wrapped any more,
> that makes it even more confusing.) Am I the only one annoyed by that?

It's a little confusing, because I always think I failed to paste the full
URI... But maybe it just takes getting used to.

-Boris

Peter Weilbacher

unread,

Oct 9, 2007, 10:47:01 AM10/9/07

to

On 10/09/07 02:54, Masayuki Nakano wrote:

> hmm... I'm not sure it is problem. I think there are two patterns when
> you paste/write the URIs.
>
> 1. URI is a independent.
>
> I.e., URI is an only member of a paragraph. In western context, we use
> empty line for paragraph separator. So, the wrapped URI doesn't make any
> confusing.
>
> 2. URI is a word.
>
> I.e., URI is an inline word of paragraph. In old layout, the URI was a
> line. Now, the URI might be separated two or more lines. But there are
> really word separators around URI. So, the URI is always an independent
> word. (And other browsers have same behavior, but I don't know the
> objections for it.)
>
> I cannot understand what is problem. Note that Japanese text is
> breakable in most points, so the broken words are natural things for me...

The difference between words (in western languages) and URLs is that if
you break a word you have a continuation character (the '-') that signifies
that this word continues on the next line. This special character is very
visible because it stands out in normal text where most of the characters
are alphabetic. For URLs you don't have that, at least not a unique one.
And because URLs (and file paths) can contain all kinds of special
characters and numbers in addition to alphabetic characters the eye is
already distracted enough and doesn't easily notice line breaks.

The worst examples are the long URLs that some people using Apple Mail send
to the .planning newsgroup[1]. Granted, Apple Mail wrapping is much worse, you
have done much better work. But let's say you want to send something about
fixed bugs https://bugzilla.mozilla.org/buglist.cgi?keywords=fixed1.8.1.8&
order=bugs.bug_id and you find some rubbish at the beginning of the line
where it's difficult to at first glance where that comes from. It gets even
worse when someone now quotes this part[2]. If the URL would not be broken,
perhaps like this https://bugzilla.mozilla.org/buglist.cgi?keywords=fixed1.8.1.8&order=bugs.bug_id
it would be clear immediately where it stopped, even though it might not be
as beautiful because something extends outside the width of the normal text.

Note that I am not arguing to switch the standard behavior. As Boris says
most people may get used to it eventually. But for myself and the people who
don't, I would like to be able to decide not to break URLs at all (perhaps
by some hidden pref). If I would get an indication that this would be taken
up I could try to make a patch for that.

Peter.

[1] e.g. Message-ID: <mailman.395.11869872...@lists.mozilla.org>
or <mailman.5674.1185818...@lists.mozilla.org>
[2] Message-ID: <mailman.2051.1188654...@lists.mozilla.org>
or <mailman.365.11900455...@lists.mozilla.org>

Jonas Sicking

unread,

Oct 9, 2007, 5:21:01 PM10/9/07

to

I've never liked urls wrapping over several lines myself, and I think
it's even harder for people that don't know the url syntax to understand
what is going on.

Would it be at all possible to do different line wrapping for urls in
japanese (and similar) text, and in western languages?

/ Jonas

Masayuki Nakano

unread,

Oct 10, 2007, 9:57:19 AM10/10/07

to Jonas Sicking, r...@ocallahan.org

Jonas Sicking wrote:
> Would it be at all possible to do different line wrapping for urls in
> japanese (and similar) text, and in western languages?

fm... It's interesting. But it seems that the hidden prefs are much
better. And should it be localizable pref? (We may be able to use lang
attribute, but it may confuse to the users.)

Roc:

How about you?

Masayuki Nakano

unread,

Oct 10, 2007, 11:42:14 AM10/10/07

to

I filed bug 399321 for switchable URL breaker.
https://bugzilla.mozilla.org/show_bug.cgi?id=399321

Jonas Sicking

unread,

Oct 10, 2007, 5:34:33 PM10/10/07

to

Masayuki Nakano wrote:
> Jonas Sicking wrote:
>> Would it be at all possible to do different line wrapping for urls in
>> japanese (and similar) text, and in western languages?
>
> fm... It's interesting. But it seems that the hidden prefs are much
> better. And should it be localizable pref? (We may be able to use lang
> attribute, but it may confuse to the users.)

I don't think a pref of any sort is going to help very many users. The
majority of our user base are non-techy people that don't mess around
with prefs, much less hidden ones.

Using the lang-attribute won't really solve anything either as very
little content uses that.

/ Jonas

rocal...@gmail.com

unread,

Oct 10, 2007, 6:41:56 PM10/10/07

to

On Oct 11, 2:57 am, Masayuki Nakano <masay...@d-toybox.com> wrote:
> Jonas Sicking wrote:
> > Would it be at all possible to do different line wrapping for urls in
> > japanese (and similar) text, and in western languages?
>
> fm... It's interesting. But it seems that the hidden prefs are much
> better. And should it be localizable pref? (We may be able to use lang
> attribute, but it may confuse to the users.)

I don't think switchable behaviour is a good idea.

I think breaking before the punctuation would help.

I don't think this is a huge issue either way, so I don't think we
should do anything difficult or risky to solve it.

Rob

David E. Ross

unread,

Oct 12, 2007, 11:12:44 AM10/12/07

to

On 10/10/2007 6:57 AM, Masayuki Nakano wrote:
> Jonas Sicking wrote:
>> Would it be at all possible to do different line wrapping for urls in
>> japanese (and similar) text, and in western languages?
>
> fm... It's interesting. But it seems that the hidden prefs are much
> better. And should it be localizable pref? (We may be able to use lang
> attribute, but it may confuse to the users.)
>
> Roc:
>
> How about you?
>

Why can't this be resolved by users adhering to Appendix C of RFC 3986
(as I do in my signature below and in the following URI)? See
<ftp://ftp.rfc-editor.org/in-notes/rfc3986.txt>. Do a case-sensitive
search on "Delimiting".

--
David E. Ross
<http://www.rossde.com/>

Natural foods can be harmful: Look at all the
people who die of natural causes.

Boris Zbarsky

unread,

Oct 12, 2007, 12:03:05 PM10/12/07

to

David E. Ross wrote:
> Why can't this be resolved by users adhering to Appendix C of RFC 3986

Because in practice they don't.

-Boris