nsTextFrame redesign and impact on i18n

Jean-Marc Desperrier

unread,

Feb 28, 2006, 7:16:02 AM2/28/06

to

I just noticed ROC's blog entry about the nsTextFrame redesign he
envisions for Cairo based builds and the impact it would have on i18n.
I think that's a should read for people interested in i18n in Mozilla :
http://weblogs.mozillazine.org/roc/archives/2006/02/post_1.html

There's an important info about Pango performance in it :
"Currently Red Hat ships Firefox builds configured to use Pango
underneath the existing nsTextFrame. These builds have notoriously poor
performance."

There's also an interesting comment about Thai :
"I believe Thai line-breaking is tough, and we'll still be doing
line-breaking in our frame code, so that will still be our problem. I
believe Thai also requires special glyph positioning, shaping and/or
clustering, and that will be handled by the underlying platform (Pango,
Uniscribe, ATSUI)."

"We will only passing single-style chunks of text to ATSUI, so if you
use inside your paragraph, ATSUI will only see one piece of the
paragraph at a time. Also, we won't be using ATSUI for line breaking,
ATSUI will just be laying it out as a really long line that we'll then
chop up."

So it would be a two step process, first pre-processing by gfxTextRun
that will rely on OS specific i18n handling, and then line breaking by
nsTextFrame.

Jean-Marc Desperrier

unread,

Mar 6, 2006, 7:57:19 AM3/6/06

to

Jean-Marc Desperrier wrote:
> There's also an interesting comment about Thai :
> "I believe Thai line-breaking is tough, and we'll still be doing
> line-breaking in our frame code, so that will still be our problem. I
> believe Thai also requires special glyph positioning, shaping and/or
> clustering, and that will be handled by the underlying platform (Pango,
> Uniscribe, ATSUI)."
>
> "We will only passing single-style chunks of text to ATSUI, so if you
> use inside your paragraph, ATSUI will only see one piece of the
> paragraph at a time. Also, we won't be using ATSUI for line breaking,
> ATSUI will just be laying it out as a really long line that we'll then
> chop up."
>
> So it would be a two step process, first pre-processing by gfxTextRun
> that will rely on OS specific i18n handling, and then line breaking by
> nsTextFrame.

I just realized this means the line breaking has to be done on glyphs,
not characters.

This might bring some problems.

In fact I realized this when reading an apparently completely unrelated
discussion here :
http://www.ntg.nl/pipermail/aleph/2005-December/000424.html
"OpenType relies on a very clear (and indeed justified) notion of
separating characters and glyphs. The input text is a character string,
it is then converted into a parallel glyph string so that glyph
positioning and substitution are done entirely on glyphs, not on
characters.
[... Omega ...] Original character codes (Unicode or other) are
eventually replaced by glyph codes from which point there is no turning
back,[...].
Let's suppose you [did] contextual substitution and even positioning,
everything is fine. Now comes paragraph building and you need
hyphenation. How are you going to do it if you do not
have your original characters anymore? Font-specific hyphenation patterns?"

That discussion is Tex related, so they aim to be able to do dynamic
hyphenation which is much harder than that what Mozilla needs to do, but
still, some problems might come out.

Samphan Raruenrom

unread,

Mar 6, 2006, 8:53:10 AM3/6/06

to

Jean-Marc Desperrier wrote:
> Jean-Marc Desperrier wrote:
>> There's also an interesting comment about Thai :
>> "I believe Thai line-breaking is tough, and we'll still be doing
>> line-breaking in our frame code, so that will still be our problem. I
>> believe Thai also requires special glyph positioning, shaping and/or
>> clustering, and that will be handled by the underlying platform
>> (Pango, Uniscribe, ATSUI)."
>> "We will only passing single-style chunks of text to ATSUI, so if you
>> use inside your paragraph, ATSUI will only see one piece of the
>> paragraph at a time. Also, we won't be using ATSUI for line breaking,
>> ATSUI will just be laying it out as a really long line that we'll then
>> chop up."
>> So it would be a two step process, first pre-processing by gfxTextRun
>> that will rely on OS specific i18n handling, and then line breaking by
>> nsTextFrame.

So they will use the same API for line breaking (called from
nsTextFrame) i.e. nsILineBreaker?

> I just realized this means the line breaking has to be done on glyphs,
> not characters.
> This might bring some problems.

Does that mean that the strings that line breaker will see are glyph
indexes? That'll make a real problem.

--
_/|\_ Samphan Raruenrom. http://www.osdev.co.th/

Jean-Marc Desperrier

unread,

Mar 7, 2006, 11:22:23 AM3/7/06

to

Samphan Raruenrom wrote:

> Jean-Marc Desperrier wrote:
>>> So it would be a two step process, first pre-processing by gfxTextRun
>>> that will rely on OS specific i18n handling, and then line breaking
>>> by nsTextFrame.
>
> So they will use the same API for line breaking (called from
> nsTextFrame) i.e. nsILineBreaker?
>
>> I just realized this means the line breaking has to be done on glyphs,
>> not characters.
>> This might bring some problems.
>
> Does that mean that the strings that line breaker will see are glyph
> indexes? That'll make a real problem.

I got it wrong. Roc answered to my remark about it :
"Line break positions will be selected using our existing code, which
works on characters (well, UTF16...)"

But I don't understand then how the APU (for ATSUI/Pango/Uniscribe)
layer will be used first. I don't see how it will be able to do the work
it has to do (glyph positioning, shaping and/or clustering) and output
characters instead of glyphs.

Jean-Marc Desperrier

unread,

Mar 7, 2006, 2:46:56 PM3/7/06

to

Jean-Marc Desperrier wrote:
> But I don't understand then how the APU (for ATSUI/Pango/Uniscribe)
> layer will be used first. I don't see how it will be able to do the work
> it has to do (glyph positioning, shaping and/or clustering) and output
> characters instead of glyphs.

I did check the Uniscribe doc to see how it works there.

The following page describes the whole process :
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/uniscrib_9t2d.asp

The application first calls ScriptItemize to identify items.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/uniscrib_1dm3.asp
Each item holds only one script and one rendering direction.

It then calls ScriptLayout to reorder the items-runs inside the line if
needed (bi-di).
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/uniscrib_1msl.asp

It then calls ScriptShape on each that identifies clusters and generate
glyphs.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/uniscrib_7yhx.asp
This gives out a glyph array from the input character array, as well as
a cluster array.
The cluster array make the correspondence between character and glyph,
knowing that one character might correspond to several glyph, and one
glyph might correspond to several characters (in which case the cluster
array will map several character to the same glyph).

It then calls ScriptPlace to measure the text from the glyphs.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/uniscrib_1msl.asp

If it needs to know where to break, it needs to run ScriptBreak on
items, therefore characters, not glyphs :
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/uniscrib_1dm3.asp
This gives out character boundaries, as well as WordStop, SoftBreak, and
WhiteSpace info (and also invalid chars info).
That doc says character boundaries can also be deduced from the cluster
array says (by saying the character boundary is where the unicode code
point boundary is also a glyph boundary ?).

Finally, ScriptTextOut is used to display each item-run in the order
given by ScriptLayout :
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/uniscrib_6304.asp

In conclusion, word breaking is indeed done on characters, but the
necessary spacing info comes only once glyphs have been calculated, and
to make it work you need very detailed info on the correspondence
between glyphs and characters clusters.

In that view, I'm still not sure it makes much sense to continue to use
the old code for line breaking given that OS specific code certainly has
the same functionality and we risk a discrepancy between the OS specific
layering and the word breaking.