Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Intent to implement and ship: Improved ruby parsing in HTML with new tag omission rules

250 views
Skip to first unread message

Koji Ishii

unread,
Jul 1, 2014, 3:58:45 PM7/1/14
to dev-pl...@lists.mozilla.org, 川幡 太一, Robin Berjon, Yuki Sekiguchi, Richard Ishida
Summary:
Two recent HTML changes improve ruby support:
1) Addition of the rb and rtc elements (but not rbc); and
2) Matching update to the tag omission rules to make ruby authoring easier.
By implementing these changes, Gecko supports the parsing side of all the ruby use cases required for the internationalization of HTML (see use cases document below for details). It also enables the implementation of the CSS Ruby Layout. The Japanese education market strongly requires this and a Mozilla developer has already started working on it.

The Japanese government is creating e-textbooks based on this feature for use in 2016, targeting one PC or tablet for every student by 2020.

This change landed in WebKit in April, and we are discussing this with other implementers as well.

Bug: https://bugzilla.mozilla.org/show_bug.cgi?id=664104
Link to standard: http://www.w3.org/html/wg/drafts/html/CR/syntax.html#parsing-main-inbody
Use cases W3C Note: http://www.w3.org/TR/ruby-use-cases/
Platform coverage: all platforms (parsing only, layout will be in separate intents)
Estimated or target release: 33
Preference behind which this will be implemented: changes are tiny enough and no compat issues are expected that the preference switch is not required.

/koji

Henri Sivonen

unread,
Jul 2, 2014, 3:05:42 AM7/2/14
to Koji Ishii, 川幡 太一, Robin Berjon, dev-pl...@lists.mozilla.org, Yuki Sekiguchi, Richard Ishida
On Tue, Jul 1, 2014 at 10:58 PM, Koji Ishii <koji...@gluesoft.co.jp> wrote:
> Platform coverage: all platforms (parsing only, layout will be in separate intents)

The parsing change is the easy part. Is there a plan to get the layout
part implemented?

My general take on this issue is:
1) As far as assigning the time of core developers goes, it seems
that there's always higher-priority stuff to work on instead of
complex ruby layout.
2) If someone else has different priorities, really values complex
ruby working and can develop an implementation that truly just takes
normal review time from the core developers, I think it makes sense
let someone other than the core developers to implement complex ruby.
3) I think the HTML parsing algorithm shouldn't be used as a way to
block point 2 from happening.

But is point 2 happening?

--
Henri Sivonen
hsiv...@hsivonen.fi
https://hsivonen.fi/

Cameron McCormack

unread,
Jul 2, 2014, 8:12:06 AM7/2/14
to Henri Sivonen, Koji Ishii, 川幡 太一, Robin Berjon, dev-pl...@lists.mozilla.org, Yuki Sekiguchi, Richard Ishida
Some work has begun on ruby layout by sgbowen in bug 1021952 recently.

L. David Baron

unread,
Jul 2, 2014, 6:32:24 PM7/2/14
to Henri Sivonen, 川幡 太一, Robin Berjon, Koji Ishii, Yuki Sekiguchi, Richard Ishida, dev-pl...@lists.mozilla.org
We have a summer intern working on ruby this summer; I'm reasonably
optimistic that she'll get much of css-ruby implemented, although
maybe omitting some of the harder bits like 'ruby-position:
inter-character' (which really isn't so much hard as different from
the rest and therefore requiring separate code).

See the 7-digit dependencies of
https://bugzilla.mozilla.org/showdependencytree.cgi?id=256274&maxdepth=1&hide_resolved=0

-David

--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla https://www.mozilla.org/ 𝄂
Before I built a wall I'd ask to know
What I was walling in or walling out,
And to whom I was like to give offense.
- Robert Frost, Mending Wall (1914)
signature.asc

ian.h...@gmail.com

unread,
Jul 7, 2014, 1:34:50 PM7/7/14
to
On Tuesday, July 1, 2014 12:58:45 PM UTC-7, Koji Ishii wrote:
> Summary:
>
> Two recent HTML changes improve ruby support:
>
> 1) Addition of the rb and rtc elements (but not rbc); and
>
> 2) Matching update to the tag omission rules to make ruby authoring easier.
>
> By implementing these changes, Gecko supports the parsing side of all the ruby use cases required for the internationalization of HTML (see use cases document below for details). It also enables the implementation of the CSS Ruby Layout. The Japanese education market strongly requires this and a Mozilla developer has already started working on it.

Could you elaborate on why we are using the more complicated W3C rules here instead of the simpler WHATWG rules, given that the WHATWG rules also address the same use cases?

See: https://bugzilla.mozilla.org/show_bug.cgi?id=33339#c110

--
Ian Hickson

Xidorn Quan

unread,
Dec 26, 2014, 7:41:20 AM12/26/14
to
IMO, the main reason is that, the W3C rules provide more flexibility for authors to make the document more semantic and stylable.

Please note that, the inline form is not limited to providing compatibility. You can see an example in JLREQ Fig. 3.9. It is a use case includes inline kana. If you want the word "明朝体" to be marked in ruby in separate form, with the WHATWG rules, you must write it as:

<ruby>明<rt>みん</rt>朝<rt>ちょう</rt>体<rt>たい</rt></ruby>

It is incompatible with the inline form, which means, if an author wants to switch between the inline form and ruby, there are only two options: 1. provide a different document for each form; 2. drop the separate form and use only the collapsed form for ruby. Neither of them perfectly matches the requirement. But with the W3C rules, it can be written as:

<ruby><rb>明<rb>朝<rb>体<rp>(<rt>みん<rt>ちょう<rt>たい<rp>)</ruby>

which is obviously compatible with the inline form.

The difference in expression ability becomes more important when there are words mixed with kanji and kana, such as "振り仮名". For this word, you won't even have the second option above, because I don't think people want to write something like

<ruby>振り仮名<rt>ふりがな</rt></ruby>

In conclusion, I think the WHATWG rules are not flexible enough for multi-pair rubies, which limits both the semantization and the stylability of documents. In other words, I don't think the two rule sets address the same use cases, especially in perspective of semantics. The W3C rules are much more powerful, though also more complicated, than the WHATWG rules.

- Xidorn

Michael[tm] Smith

unread,
Dec 26, 2014, 8:24:52 AM12/26/14
to Xidorn Quan, dev-pl...@lists.mozilla.org
Hi Xidorn,

Xidorn Quan <quanx...@gmail.com>, 2014-12-26 04:41 -0800:
...
> If you want the word "明朝体" to be marked in ruby in separate form, with
> the WHATWG rules, you must write it as:
>
> <ruby>明<rt>みん</rt>朝<rt>ちょう</rt>体<rt>たい</rt></ruby>
>
> It is incompatible with the inline form, which means, if an author wants
> to switch between the inline form and ruby, there are only two options:
> 1. provide a different document for each form; 2. drop the separate form
> and use only the collapsed form for ruby. Neither of them perfectly
> matches the requirement. But with the W3C rules, it can be written as:
>
> <ruby><rb>明<rb>朝<rb>体<rp>(<rt>みん<rt>ちょう<rt>たい<rp>)</ruby>
>
> which is obviously compatible with the inline form.
>
> The difference in expression ability becomes more important when there
> are words mixed with kanji and kana, such as "振り仮名". For this word,
> you won't even have the second option above, because I don't think people
> want to write something like
>
> <ruby>振り仮名<rt>ふりがな</rt></ruby>

What would be the right way to mark that up with <rb>? In particular, what
would be the right way if the authors wants to switch between the inline
form and ruby?

--Mike

> In conclusion, I think the WHATWG rules are not flexible enough for
> multi-pair rubies, which limits both the semantization and the
> stylability of documents. In other words, I don't think the two rule sets
> address the same use cases, especially in perspective of semantics. The
> W3C rules are much more powerful, though also more complicated, than the
> WHATWG rules.

--
Michael[tm] Smith https://people.w3.org/mike
signature.asc

Xidorn Quan

unread,
Dec 26, 2014, 6:13:33 PM12/26/14
to Michael[tm] Smith, dev-pl...@lists.mozilla.org
On Sat, Dec 27, 2014 at 12:23 AM, Michael[tm] Smith <mi...@w3.org> wrote:

> Hi Xidorn,
>
> Xidorn Quan <quanx...@gmail.com>, 2014-12-26 04:41 -0800:
> ...
> > If you want the word "明朝体" to be marked in ruby in separate form, with
> > the WHATWG rules, you must write it as:
> >
> > <ruby>明<rt>みん</rt>朝<rt>ちょう</rt>体<rt>たい</rt></ruby>
> >
> > It is incompatible with the inline form, which means, if an author wants
> > to switch between the inline form and ruby, there are only two options:
> > 1. provide a different document for each form; 2. drop the separate form
> > and use only the collapsed form for ruby. Neither of them perfectly
> > matches the requirement. But with the W3C rules, it can be written as:
> >
> > <ruby><rb>明<rb>朝<rb>体<rp>(<rt>みん<rt>ちょう<rt>たい<rp>)</ruby>
> >
> > which is obviously compatible with the inline form.
> >
> > The difference in expression ability becomes more important when there
> > are words mixed with kanji and kana, such as "振り仮名". For this word,
> > you won't even have the second option above, because I don't think people
> > want to write something like
> >
> > <ruby>振り仮名<rt>ふりがな</rt></ruby>
>
> What would be the right way to mark that up with <rb>? In particular, what
> would be the right way if the authors wants to switch between the inline
> form and ruby?
>

It would be

<ruby><rb>振<rb>り<rb>仮<rb>名<rt>ふ<rt>り<rt>が<rt>な</ruby>

The <rt> for "り" here could be individually hidden in ruby form by
stylesheets. In fact, in CSS Ruby, we currently have autohide rule which
automatically hide the the annotation when it is equal to the base.

- Xidorn

Michael[tm] Smith

unread,
Dec 27, 2014, 1:05:40 PM12/27/14
to Xidorn Quan, dev-pl...@lists.mozilla.org
Xidorn Quan <quanx...@gmail.com>, 2014-12-27 10:12 +1100:

> On Sat, Dec 27, 2014 at 12:23 AM, Michael[tm] Smith <mi...@w3.org> wrote:
...
> > Xidorn Quan <quanx...@gmail.com>, 2014-12-26 04:41 -0800:
> > ...
> > > The difference in expression ability becomes more important when there
> > > are words mixed with kanji and kana, such as "振り仮名". For this word,
> > > you won't even have the second option above, because I don't think people
> > > want to write something like
> > >
> > > <ruby>振り仮名<rt>ふりがな</rt></ruby>
> >
> > What would be the right way to mark that up with <rb>? In particular, what
> > would be the right way if the authors wants to switch between the inline
> > form and ruby?
>
> It would be
>
> <ruby><rb>振<rb>り<rb>仮<rb>名<rt>ふ<rt>り<rt>が<rt>な</ruby>
>
> The <rt> for "り" here could be individually hidden in ruby form by
> stylesheets. In fact, in CSS Ruby, we currently have autohide rule which
> automatically hide the the annotation when it is equal to the base.

Thanks, from looking at the current CSS Ruby draft, I see you must mean this:

http://drafts.csswg.org/css-ruby/#autohide

And maybe I'm missing something but from that I see this autohide thing seems
to be magic the UA does without exposing any means for Web content to cleanly
override it -- neither through CSS nor script. ("Future levels of CSS Ruby
may add controls for auto-hiding, but in this level it is always forced.")

If so, I think that kind of thing is something that a lot of web devs has
said they'd rather browsers quit doing -- and that most new specs these
days seem to try to avoid doing. But again, maybe I'm missing something.

But anyway it makes me wonder why it's specced this way to begin with.
Other than the case where a base is kana I don't know what other real-world
case there might be where an annotation might be equal to its base.

Further, I don't know of any typical case where if a base character is
kana, why you'd ever want to display furigana/yomigana for it.

So as long as the spec is going to require UAs to resort to magic behavior,
I think the magic could instead just be "autohide any ruby annotations for
kana characters". And then you could just have simpler markup like this:

<ruby>振り仮名<rt>ふりがな</rt></ruby>

...and UAs would display as expected -- with no annotation for the り.

It doesn't seem like that magic would be any more difficult for UAs to
implement and wouldn't be any worse than the "hide the annotation when it
is equal to the base" magic the CSS Ruby spec currently requires UAs to do.

So anyway, to get back to the "Could you elaborate on why we are using the
more complicated W3C rules here instead of the simpler WHATWG rules, given
that the WHATWG rules also address the same use cases?" question that Hixie
had originally asked at that you responded to in your earlier message at
https://lists.mozilla.org/pipermail/dev-platform/2014-December/008123.html

...from the above it seems the base-consisting-of-kanji-mixed-with-kana
case may not be such a compelling case for illustrating the need for <rb>
to be included in HTML. At least it's not as long as UAs are just doing
magic autohiding without exposing any way for Web content to override it.

--Mike
signature.asc

Masatoshi Kimura

unread,
Dec 27, 2014, 6:28:42 PM12/27/14
to dev-pl...@lists.mozilla.org
On 2014/12/28 3:04, Michael[tm] Smith wrote:
> Further, I don't know of any typical case where if a base character
> is kana, why you'd ever want to display furigana/yomigana for it.

Ruby is not used only for furigana/yomigana. I know one example from a
very popular Japanese novel:
<ruby>赤眼の魔王<rt>ルビーアイ</rt></ruby>
This is not the only example. I'm confident we could find many case from
some Japanese novels.
Probably your next word is "It is not typical." or "Statistics,
please.". But unlike Xidorn Quan, I'm not interested in what WHATWG
people are doing because I know they are not serious about ruby at all.
Feel free to mess around with the ruby spec.

> So as long as the spec is going to require UAs to resort to magic
> behavior, I think the magic could instead just be "autohide any
> ruby annotations for kana characters".

How to determine what ruby annotation corresponds to what base
character if the character count does not match?
(Again, I'm not interested in your answer. I know whatever case the
WHATWG spec cannot deal with is not "typical".)

--
VYV0...@nifty.ne.jp

Xidorn Quan

unread,
Dec 27, 2014, 6:30:32 PM12/27/14
to Michael[tm] Smith, dev-pl...@lists.mozilla.org
On Sun, Dec 28, 2014 at 5:04 AM, Michael[tm] Smith <mi...@w3.org> wrote:

> Xidorn Quan <quanx...@gmail.com>, 2014-12-27 10:12 +1100:
>
> > On Sat, Dec 27, 2014 at 12:23 AM, Michael[tm] Smith <mi...@w3.org> wrote:
> ...
> > > Xidorn Quan <quanx...@gmail.com>, 2014-12-26 04:41 -0800:
> > > ...
> > > > The difference in expression ability becomes more important when
> there
> > > > are words mixed with kanji and kana, such as "振り仮名". For this word,
> > > > you won't even have the second option above, because I don't think
> people
> > > > want to write something like
> > > >
> > > > <ruby>振り仮名<rt>ふりがな</rt></ruby>
> > >
> > > What would be the right way to mark that up with <rb>? In particular,
> what
> > > would be the right way if the authors wants to switch between the
> inline
> > > form and ruby?
> >
> > It would be
> >
> > <ruby><rb>振<rb>り<rb>仮<rb>名<rt>ふ<rt>り<rt>が<rt>な</ruby>
> >
> > The <rt> for "り" here could be individually hidden in ruby form by
> > stylesheets. In fact, in CSS Ruby, we currently have autohide rule which
> > automatically hide the the annotation when it is equal to the base.
>
> Thanks, from looking at the current CSS Ruby draft, I see you must mean
> this:
>
> http://drafts.csswg.org/css-ruby/#autohide
>
> And maybe I'm missing something but from that I see this autohide thing
> seems
> to be magic the UA does without exposing any means for Web content to
> cleanly
> override it -- neither through CSS nor script. ("Future levels of CSS Ruby
> may add controls for auto-hiding, but in this level it is always forced.")
>
> If so, I think that kind of thing is something that a lot of web devs has
> said they'd rather browsers quit doing -- and that most new specs these
> days seem to try to avoid doing. But again, maybe I'm missing something.
>
> But anyway it makes me wonder why it's specced this way to begin with.
> Other than the case where a base is kana I don't know what other real-world
> case there might be where an annotation might be equal to its base.
>
> Further, I don't know of any typical case where if a base character is
> kana, why you'd ever want to display furigana/yomigana for it.
>
> So as long as the spec is going to require UAs to resort to magic behavior,
> I think the magic could instead just be "autohide any ruby annotations for
> kana characters". And then you could just have simpler markup like this:
>
> <ruby>振り仮名<rt>ふりがな</rt></ruby>
>
> ...and UAs would display as expected -- with no annotation for the り.
>
> It doesn't seem like that magic would be any more difficult for UAs to
> implement and wouldn't be any worse than the "hide the annotation when it
> is equal to the base" magic the CSS Ruby spec currently requires UAs to do.
>

There are two problems if the UA wants to remove an individual kana in the
annotation. The first is, how do you display this ruby after removing that
character? There are three options for you:

(1) <ruby>振り仮名<rt>ふがな</rt></ruby>
(2) <ruby>振<rt>ふ</rt>り<rt></rt>仮名<rt>がな</rt></ruby>
(3) <ruby>振<rt>ふ</rt>り<rt></rt>仮<rt>が</rt>名<rt>な</rt></ruby>

(1) is completely wrong. In the three options, (3) is the preferable way,
but it is hard, if not impossible, for UAs to decide between (2) and (3).
They can do so only if they have Japanese dictionary integrated.

The second is, how do you know the "り" in the annotation matches the "り" in
the base? In this case, it might seems to be obvious, but Japanese also has
words like "言い訳 (いいわけ)", "聞き手 (ききて)". There is also more complex use case
which uses inline form to mark a novel title, such as "電波女と青春男
(でんぱおんなとせいしゅんおとこ)".

In addition, IIRC, CSS operates more on box level, not character level,
right? It would make UAs much harder to implement if a style affects
individual characters.

In conclusion, yes, I would admit that everything is also possible under
the current WHATWG rules, with UAs knowing every magic in Japanese. But in
this way, the advantage of the WHATWG rules, that they are simpler, is no
longer true. The W3C rules are much simpler in handling these use cases.

So anyway, to get back to the "Could you elaborate on why we are using the
> more complicated W3C rules here instead of the simpler WHATWG rules, given
> that the WHATWG rules also address the same use cases?" question that Hixie
> had originally asked at that you responded to in your earlier message at
> https://lists.mozilla.org/pipermail/dev-platform/2014-December/008123.html
>
> ...from the above it seems the base-consisting-of-kanji-mixed-with-kana
> case may not be such a compelling case for illustrating the need for <rb>
> to be included in HTML. At least it's not as long as UAs are just doing
> magic autohiding without exposing any way for Web content to override it.
>

Although I don't think authors want in most cases, it is still possible for
them to suppress this behavior by, for example, inserting a zero-width
space in that annotation. I know it is a bit awkward, but it is possible,
if one really wants to. But it would be impossible to do so in your model
with WHATWG rules.

Anyway, the other case still makes the W3C rule preferable.

- Xidorn

L. David Baron

unread,
Dec 27, 2014, 8:51:33 PM12/27/14
to Michael[tm] Smith, Xidorn Quan, dev-pl...@lists.mozilla.org
On Sunday 2014-12-28 03:04 +0900, Michael[tm] Smith wrote:
> So as long as the spec is going to require UAs to resort to magic behavior,
> I think the magic could instead just be "autohide any ruby annotations for
> kana characters". And then you could just have simpler markup like this:
>
> <ruby>振り仮名<rt>ふりがな</rt></ruby>
>
> ...and UAs would display as expected -- with no annotation for the り.

I don't see how UAs could determine which kana to eliminate.

What if the markup were instead:

<ruby>振り仮名<rt>ふりりがな</rt></ruby>

(After all, many characters need more than one kana for their ruby.)

How would the browser know whether to center ふり over 振, hide the
second り, and center がな over 仮名, or whether to center ふ over
振, hide the first り, and center りがな over 仮名?

The split between container and annotation is what gives the browser
the information to do that separation correctly.
signature.asc

L. David Baron

unread,
Dec 30, 2014, 12:26:30 PM12/30/14
to Eric Shepherd, Xidorn Quan, Michael[tm] Smith, dev-pl...@lists.mozilla.org
On Tuesday 2014-12-30 12:14 -0500, Eric Shepherd wrote:
> Is there a bug for the changes being discussed here, and is it marked with dev-doc-needed? Sounds like there will be, at a minimum, a few tweaks to the discussion about how this stuff works.

From the message at the start of the thread (six months ago):
https://bugzilla.mozilla.org/show_bug.cgi?id=664104
signature.asc

Eric Shepherd

unread,
Jan 2, 2015, 11:26:39 AM1/2/15
to L. David Baron, Xidorn Quan, Michael[tm] Smith, dev-pl...@lists.mozilla.org
I feel a lot less embarrassed about not finding that bug number now that I know how long this thread has been running. :)

Eric Shepherd
Developer Documentation Lead
Mozilla
http://www.bitstampede.com/
0 new messages