CJK line breaking

Robert O'Callahan

unread,

Jul 5, 2006, 7:26:45 PM7/5/06

to

While working on some inline layout stuff, I've run into
nsJISx4051LineBreaker (which we use for all line breaking, actually).

Apparently it's intended to work like this: if a word (delimited by
whitespace) contains at least one CJK character, then we apply JISX4051
rules to break within that word, otherwise we just use the whole
whitespace-delimited word. JISX4051 changes behaviour even for non-CJK
text; for example, it allows breaking after commas in Latin text. So given
aaaaaa,bbbbbbbbbbbbb,ccccccccccccccccccc,dddddddddddddddd,<CJK>
we'll allow breaking after all those commas.

This is nasty and actually has many bugs on trunk. For example, it means
that removing the CJK text from the end of the run --- which could be
several lines after the start of the run --- requires us to reflow the
entire run. This messes with incremental line reflow because normally
content much later in a paragraph can't affect the layout of previous
lines. Unfortunately there is no way around this in general; in
particular, Thai word breaking apparently requires dictionary-based
analysis of the entire paragraph, so a really good algorithm will adjust
breaks globally based on the contents of multiple lines.

Some of the trunk bugs are due to this multiline reflow issue. Other
trunk bugs are due to the fact that the linebreaker's CanBreakBetween
scans in both directions looking for CJK characters to trigger CJK
rules, but Next only scans forwards and Prev only scans backwards. So
CJK rules may or may not be triggered for a given chunk of text
depending on which API call you use.

I'm tempted to use the simplifying assumption that breaking between two
non-CJK chars should use non-CJK rules, but I'm not sure of all the
consequences of that. It basically means we'll not break in places where
maybe we should. For example given
<CJK><CJK>,300<CJK><CJK>
we really want to be able to break after the comma. Worse, sequences of
(<CJK>)(<CJK>)(<CJK>)(<CJK>)(<CJK>)(<CJK>)(<CJK>)
won't break anywhere. If even if that ssumption is tenable, then because
of the Thai issue, we're eventually have to do some nasty stuff if not now.

For now I'm going to go with it, because it won't cause breaks in bad
places and the alternative is to do a lot of work that I really hadn't
planned on (fixing linebreaker, fixing block reflow, and adding whatever
optimizations are necessary to make things not suck), but I'd appreciate
feedback on this.

Rob

L. David Baron

unread,

Jul 5, 2006, 7:29:20 PM7/5/06

to dev-tec...@lists.mozilla.org

On Thursday 2006-07-06 11:26 +1200, Robert O'Callahan wrote:
> While working on some inline layout stuff, I've run into
> nsJISx4051LineBreaker (which we use for all line breaking, actually).

We do? nsTextTransformer only seems to invoke it in the *Unicode*
functions, and https://bugzilla.mozilla.org/show_bug.cgi?id=255990 says
we don't.

-David

--
L. David Baron <URL: http://dbaron.org/ >
Technical Lead, Layout & CSS, Mozilla Corporation

Robert O'Callahan

unread,

Jul 5, 2006, 11:46:46 PM7/5/06

to

L. David Baron wrote:
> On Thursday 2006-07-06 11:26 +1200, Robert O'Callahan wrote:
>> While working on some inline layout stuff, I've run into
>> nsJISx4051LineBreaker (which we use for all line breaking, actually).
>
> We do? nsTextTransformer only seems to invoke it in the *Unicode*
> functions, and https://bugzilla.mozilla.org/show_bug.cgi?id=255990 says
> we don't.

Er right, I meant for all line breaking of Unicode text.

Rob

Robert O'Callahan

unread,

Jul 5, 2006, 11:48:13 PM7/5/06

to

... becase when I first encountered nsJISx4051LineBreaker, I assumed it
was used only for Japanese or CJK in general, selected by language code
or something, and was a bit surprised to see that it's always used for
Unicode.

Rob

Robert O'Callahan

unread,

Jul 6, 2006, 12:12:18 AM7/6/06

to

L. David Baron wrote:
> We do? nsTextTransformer only seems to invoke it in the *Unicode*
> functions, and https://bugzilla.mozilla.org/show_bug.cgi?id=255990 says
> we don't.

Actually, I should have thought about that bug earlier. There, people
are advocating *always* using the JISX4051 rules. That is quite scary
because it means we'd start breaking ASCII text with punctation but no
spaces, e.g., we'll break after the comma in "hello,kitty" and after
plus in "a+b". But maybe that isn't so bad... It would eliminate the
locality problem I just complained about, and remove the need for
multi-line analysis --- except for Thai. It would reduce the context
required to at most two characters either side of the line break (again,
except for Thai), so we're hardly ever going to have to look at more
than one text frame either side, so when we do, we can afford to do
something simple and slow, i.e. traversing the frame tree looking for
the required text.

Rob