ICU Message Format and Plurals...

jim

unread,

Apr 19, 2021, 3:11:31 PM4/19/21

to Group: okapi-devel, Mihai Nita

More tools are supporting the ICU Message format. I know Mihai has
posted several documents regarding the support of plurals etc.. Mihai
can you reply with links so we can all have access to that information
in this thread?

I am interested in:

(1) General support for the ICU Message Format as a (sub) filter.
Options for parsing etc..
(2) Design options for supporting message strings (including plurals) in
Okapi internal data structures.
(3) Ditto for xliff1.2 and xliff 2.1

Jim

Alessandro Falappa

unread,

Apr 20, 2021, 3:45:30 AM4/20/21

to Group: okapi-devel

Hi Jim,

as far as parsing ICU message formatting patterns is concerned I definitely suggest ICU4J the reference Java implementation of ICU which supports message formatting as well other internationalization and localization computing tasks.

Some references I found useful:

ICU documentation: https://unicode-org.github.io/icu/
ICU4J readme: https://unicode-org.github.io/icu/userguide/icu4j/
ICU documentation on message formatting: https://unicode-org.github.io/icu/userguide/format_parse/messages/
ICU4J class for parsing ICU message patterns: https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/MessagePattern.html
Online ICU Message Editor (useful to experiment with the syntax but based on a Javascript library): https://format-message.github.io/icu-message-format-for-translators/editor.html

I have already implemented a pair of Okapi steps dealing with ICU message patterns for internal use here in Translated, but the aim of these step is to recognize ICU message patterns and transform them in a form that allows translators to edit them in our MateCat CAT tool without breaking the syntax. I don't think this is what you are after.

With regard to point 1 an implementation as an Okapi Filter to be mainly used as subfilter in a parent Filter is IMHO a better way even if it has a flaw: it breaks the context because the plural/select pattern is likely treated as a subflow and moved in a different translation unit (see next paragraph).

With regard to point 3, from my knowledge of XLIFF 1.2 and 2.1 specifications there is no explicit provision for pluralization. I do not know if there are best practices in this regard out there. The Angular format recently brought to our attention in this mailing list is the first example I stumbled upon and its approach is not very appealing to me (the pattern remains in clear in a <source> tag and its syntax must be maintained in the corresponding <target> tag, there are high chances the translator can break the syntax).

No opinions yet regarding point 2.

Regards,

Alessandro Falappa
Senior Java Developer

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/96a71c72-b6bd-bc1b-c7c7-f731c1dc8d58%40gmail.com.

jim

unread,

Apr 20, 2021, 12:05:19 PM4/20/21

to okapi...@googlegroups.com

Thank you Alessandro for the useful information.

Mihai has posted a document in the past with some ideas for supporting plurals in xliff 2.0 (even a proposal for a new module). I'm not convinced at this point that hacking a solution for xliff 1.2 would be worth it and would provide more incentive to move to xliff 2.0

The discussions we have had in the past, if I recall, basically fell into two possible implementations. Both would "explode" the number of permutations. (1) Store the permutations in the same TextUnit using specific metadata that a CAT tool can use to display the info (like the CU Message Editor below) (2) Generate individual TextUnits for each permutation.

My feeling is that we need #1 with all the extra metadata, context etc... A step (post filter) to provide #2 could be implemented to support xliff 1.2 or other less sophisticated CAT tools.

I'm not sure the ICU4J parser would preserve whitespace for merge etc.. But we would definitely use the CLDR and ICU4J to validate, provide metadata and create the proper number of permutations per locale.

I know Mihai has given this a lot of thought - he can expand on this when he gets a chance.

Jim

To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/CAE7HRiJop19cdqS-21RFgQDvsZqy2hW_%3Dmk3U-%2BFCSGKbGUbaQ%40mail.gmail.com.

Mihai Nita

unread,

Apr 20, 2021, 2:41:13 PM4/20/21

to jim, Group: okapi-devel

Everything below is syntax independent.

It is not only addressing ICU plurals, but any kind of plurals (and selectors in general).

So it can deal with Android plurals, or gettext/po plurals, or ICU syntax used in other filters (.properties, or .xml, or json)

(ICU syntax parser as a subfilter)

Docs:

XLIFF 2.x module to support plural / gender / select (draft)

https://docs.google.com/document/d/1aXQS9HyzoqCktPUUNxeUverSY3l52qhipoKqf9ATyhg

Plural and gender support in XLIFF 2.x

https://docs.google.com/document/d/1aWb5Tw2Gj2c9Zqnn_SS1SwV08AYdO3ixv1FO5kgUW70

The first doc is the proposal for the XLIFF 2.x standard, so it is more focused on the xml representation part.

The second one is more about explaining the problem, examples, how to deal with it and why.

Code:

Branches with code prototyping:

mihai_plural_xliff : This is Okapi code that would deal with plurals. Most concepts are represented with annotations.

The xliff in this case is 1.2, with special namespaces, but the concepts are very similar with the 2.x official proposal.

mihai_stream_step : This is code that would allow the application of a sub-filter as a step, without being invoked by the main filter.

You can do something like excel_filter => step(filter_wrap(xml_filter)) => step(filter_wrap(html_filter)) ...

And the implementation can also do n:m mapping.

So you can get 1 TextUnit event that contains an ICU message and produce a group with 2 TextUnits for singular and plural.

Then you can get the 2 events in the "black box" and produce 4 (for instance to take singular and plural TextUnit from English and produce 4 TextUnit(s) needed for the Russian plurals, in a separate step.

Bottleneck:

Representing and processing "message level selectors" is easy:

{count, plural,

=1 {You deleted one file from folder {folderName}}

other {You deleted {count} files from folder {folderName}}

}

"Internal selectors" (where the selection is done on a substring) is another can of worms.

You deleted {count, plural, =1 {one file} other {{count} files}} from folder {folderName}

My current proposal is to not support the second form at all, and to add a conversion to the first form.

The second form is not very difficult to represent in xliff (I have a section for it in the 2.x proposal).

But it is difficult for translators (they would have to "drag inside the selector" words that are outside in English in order to deal with agreement in gender / number / grammatical case) and would completely mess up validation (the number of placeholders would change).

And it is bad i18n to begin with, it's technically concatenation.

For ICU that can be done almost 100% algorithmically, except for special crazy cases where you have two or more plurals with offset.

(multiple plurals are fine, but maximum one can have offset)

I can't think of a good example of a string using that, except something very unnatural.

I have code for that somewhere (but not in opaki branches, or anywhere public)

Other formats that I am aware of (Android XML, po files) don't support that kind of internal selection at all.

Cheers,

Mihai

Alessandro Falappa

unread,

Apr 21, 2021, 9:00:22 AM4/21/21

to Group: okapi-devel

Hi Mihai,

I found the docs very informative. In order to simplify things a bit one could leave the gender topic out of the scope and deal only with plurals and choices. Gender selection would then be implemented with the choice construct.

I liked the "matrix" form to simplify nesting even if it does not help the combinatorial explosion of cases.

I now understand that having plurals or select constructs embedded into outer phrases is discouraged and could be normalized by copying the outer phrases into each case of the plural/select construct. Using sub flows is technically valid but confusing from a translator point of view.

Regards,

Alessandro Falappa
Senior Java Developer

Il giorno mar 20 apr 2021 alle ore 20:41 Mihai Nita <mih...@gmail.com> ha scritto:

Everything below is syntax independent.
It is not only addressing ICU plurals, but any kind of plurals (and selectors in general).
So it can deal with Android plurals, or gettext/po plurals, or ICU syntax used in other filters (.properties, or .xml, or json)
(ICU syntax parser as a subfilter)

Docs:

XLIFF 2.x module to support plural / gender / select (draft)
https://docs.google.com/document/d/1aXQS9HyzoqCktPUUNxeUverSY3l52qhipoKqf9ATyhg

Plural and gender support in XLIFF 2.x
https://docs.google.com/document/d/1aWb5Tw2Gj2c9Zqnn_SS1SwV08AYdO3ixv1FO5kgUW70

The first doc is the proposal for the XLIFF 2.x standard, so it is more focused on the xml representation part.
The second one is more about explaining the problem, examples, how to deal with it and why.

Code:

Branches with code prototyping:

mihai_the scopethe scopeplural_xliff : This is Okapi code that would deal with plurals. Most concepts are represented with annotations.

jim

unread,

Apr 21, 2021, 1:51:44 PM4/21/21

to okapi...@googlegroups.com

For the underlying okapi data model we can use Mihai's annotations to create an enriched *single* TextUnit that will preserve context. This would contain all the metadata needed for proper translation and validation of all the forms. Depending on the CAT tool we can have steps to "explode" the strings (simple xliff 1.2 translation) or keep the single TextUnit and let the CAT tool use it to the best of it's ability. Several editors out there have special UI for these strings that help preserve context and provide instant validation and rendering.

We will need a proper filter so we can support embedded message strings in various formats. I don't think ICU4J parser preserves whitespace - which is normally needed to do an accurate merge. This will especially true for container formats like YAML which are very sensitive to whitespace. This may mean we write our own parser - but this doesn't look hard. See attached fully working grammar.

Full validation can be done post merge with a special step - much like our current xml validation steps.

Here's an example translation UI: PO Editor

Jim

--

You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/CAE7HRiKYuv%2BpXaitdA-7DOm21058bWsX%2BeG7LMWezY4%2B9R2PEA%40mail.gmail.com.

parser.pegjs

jim

unread,

Apr 21, 2021, 2:37:46 PM4/21/21

to okapi...@googlegroups.com

I created a branch "icu_messages" based off the latest dev and merged mihai's branches referenced below. However I did not merge the changes to the xliff filter and writer. It was a difficult merge, but I'm also not sure if we want to "pollute" xliff 1.2 with the enhancements needed for plurals. We can add that code later if wanted. For now I'm more interested in the plurals annotations and other core supporting code.

Jim

Alessandro Falappa

unread,

Apr 22, 2021, 4:24:52 AM4/22/21

to Group: okapi-devel

Hi Jim,

Are you sure ICU4J parser does not preserve whitespace? In my experience the ICU4J MessgePattern class (javadocs at https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/MessagePattern.html) parses a pattern string and returns the tokens as char indexes without modifying the string in any way. The only annoying thing I found was no precise or no index reporting at all in case of pattern syntax errors.

Regards,

Alessandro Falappa
Senior Java Developer

To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/5d21d050-b661-40f3-06f6-997fe9c89124%40gmail.com.

jim

unread,

Apr 22, 2021, 2:14:34 PM4/22/21

to okapi...@googlegroups.com

I must of been thinking of a different or older parser that returned an Abstract Syntax Tree. Looks like this will work, but kind of cumbersome as you have to keep track of a lot of state yourself and infer the in between stuff (insifnificant whitespace vs. actual strings etc..).

@Test
public void parseIcuMessage() {
   String message = "    \t \n{gender_of_host, select,\n" + "        " +
         "female {\n"
         + "            {num_guests, plural, offset:1 \n" + "              =0 {{host} does not give a party.}\n"
         + "              =1 {{host} invites {guest} to her party.}\n"
         + "              =2 {{host} invites {guest} and one other person to her party.}\n"
         + "              other {{host} invites {guest} and # other people to her party.}}}\n"
         + "          male {\n" + "            {num_guests, plural, offset:1 \n"
         + "              =0 {{host} does not give a party.}\n"
         + "              =1 {{host} invites {guest} to his party.}\n"
         + "              =2 {{host} invites {guest} and one other person to his party.}\n"
         + "              other {{host} invites {guest} and # other people to his party.}}}\n"
         + "          other {\n" + "            {num_guests, plural, offset:1 \n"
         + "              =0 {{host} does not give a party.}\n"
         + "              =1 {{host} invites {guest} to their party.}\n"
         + "              =2 {{host} invites {guest} and one other person to their party.}\n"
         + "              other {{host} invites {guest} and # other people to their party.}}}}";

   MessagePattern mp = new MessagePattern();
   MessagePattern pmp = mp.parse(message);
   int c = pmp.countParts();
   for (int i = 0; i < c; i++) {
      MessagePattern.Part part = pmp.getPart(i);
      String m = message.substring(part.getIndex(), part.getLength()+part.getIndex());
      int t = part.getValue();
   }
}

To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/CAE7HRi%2BHJvH_SMrtc165Kvwz7gLHgX534MCd6tK5Lt7cH9SVMg%40mail.gmail.com.

jim

unread,

Apr 22, 2021, 2:37:25 PM4/22/21

to okapi...@googlegroups.com

Based on this I could start a proper filter. If nothing else we could test round trip with the raw message string.

Jim

Mihai Nita

unread,

Apr 25, 2021, 5:12:22 AM4/25/21

to Group: okapi-devel

I've been quite split about gender. It is true that it can be seen as "syntactic sugar" on top of select.

But these are the reasons that made me include it:

Without a clear definition, different people / tools will come up with different ways to express it. For example male/female, masculine/feminine, masc/fem, m/f, ... I've seen it already.

Validation, processing, and in general being "smart" about it are also problematic.

Translators and tools that would have to deal with this have no way to know that a message selection is about gender.

If I am a translator and I see this:

{item_gender, select, other {I got a red {item}}}

then yes, I have a hint that the software knows about the fact that various items have (gramatical) gender in a lot of languages. But I have no clue what selectors I need to add.

With "select" becoming "gender" (yes, just syntactic sugar) now I can validate that the translator didn't add the proper genders for Romanian. Or the tooling can add them automatically, depending on language.

It is a bit like in the old days when C didn't have a boolean type.

People #defined bool (or boolean) and true/True/TRUE and false/False/FALSE, for increased readability.

I agree, just an opinion.

But I thought I would put it out there, and see what the others (including you, of course :-) have to say.

Thank you,

Mihai

--

Mihai Nita

unread,

Apr 25, 2021, 5:20:40 AM4/25/21

to Group: okapi-devel

MessagePattern works with offsets.

MessagePatternUtil returns an AST.

https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/text/MessagePatternUtil.html

But there is no need to preserve the spaces outside the message proper.

{count,plural,=1{foo}other{bar}}

and

{count, plural,

=1 {foo}

other {bar}

}

produce the exact same result.

And it is difficult to "preserve spaces" if you need to add cases

You would in fact need to "generate" spaces, because for Russian in the second case you would need to return this:

{count, plural,

=1 {foo}

one {foo}

few {foo}

many {foo}

other {bar}

}

We do similar stuff for HTML, where newline becomes space, and several spaces get collapsed into one.

Mihai

To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/b877ae0f-d542-c1c5-8550-5269dc6f0754%40gmail.com.

jim

unread,

Apr 26, 2021, 1:45:13 PM4/26/21

to okapi...@googlegroups.com, Mihai Nita

Mihai - the problem is container formats like YAML - where whitespace does matter. But we may be able to preserve the surrounding space as a pre-process before sending the message content to the subfilter. Ideally we don't want special logic in the primary filter (YAML, JSON, XML) - this should all happen in the subfilter.

Jim

To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/CAK69zbm8QnVrqBe-FmajuKOa6zmNyyK_YE%2BJt%2BsX8pUrApbzTQ%40mail.gmail.com.

jim

unread,

Apr 26, 2021, 1:46:26 PM4/26/21

to okapi...@googlegroups.com, Mihai Nita

IMHO Gender deserves to be explicit - for the reasons Mihai describes below.

Jim

To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/CAK69zbmsx--jw3fB-g57j_ZnjPp5w2YS1kYHC4AMUKHSGoxpbA%40mail.gmail.com.

Jack Cole

unread,

Jul 2, 2021, 3:40:57 PM7/2/21

to okapi-devel

This is the XLIFF 2 output I've come up with when writing my own ICU filter. I've used a lot of a suggestions provided by you guys in your notes and in meetings, such as adding the "name" attribute so that the TMS can use that for hinting. I've written it in a way way where it requires as little effort to combine it back together. You simply need to insert the subflows in the positions marked in the coded text, and then delete the units that are subflows.

I also included examples where examples weren't needed (example is exactly the same as the source segment content), so that it's consistent when translators were working on the document. Could this be unnecessary spam? If there are no plural values, should there also be no examples? I guess it depends on how well the UI makes notes stand out.

It's also essential I don't modify the original source, so I've made the original source ignorable so it won't be translated, but still saved so when the data is merged.

If you guys have any comments on how this might be improved, or point out if I am misusing XLIFF 2 in a way that can bite me in the ass at a future date, please let me know.

Original Text:

{gender, select, male {He} female {She} other {They}} purchased {count, plural, =0 {no items} =1 {one item} other {{count} items}}.

<data id="d1">{gender, select, male {He} female {She} other {They}}</data>

<data id="d2">{count, plural, =0 {no items} =1 {one item} other {{count} items}}</data>

<data id="de1">{gender, select, male {count, plural, =0 {[[SUBFLOW:3-0-0]]} =1 {[[SUBFLOW:3-0-1]]} other {[[SUBFLOW:3-0-2]]}} female {count, plural, =0 {[[SUBFLOW:3-0-3]]} =1 {[[SUBFLOW:3-0-4]]} other {[[SUBFLOW:3-0-5]]}} other {count, plural, =0 {[[SUBFLOW:3-0-6]]}} =1 {[[SUBFLOW:3-0-7]]} other {[[SUBFLOW:3-0-8]]}}}</data>

</originalData>

</segment>

<source xml:space="preserve"><ph id="0" canCopy="no" disp="{gender}" dataRef="d1"/> purchased <ph id="1" canCopy="no" disp="{gender}" dataRef="d2"/>.</source>

</ignorable>

</unit>

<notes>

<note>Example: He purchased no items.</note>

</notes>

<source>He purchased no items.</source>

</segment>

</unit>

<notes>

<note>Example: He purchased one item.</note>

</notes>

<source>He purchased one item.</source>

</segment>

</unit>

<notes>

<note>Example: He purchased 2 items.</note>

<note>Example: He purchased 3 items.</note>

</notes>

<source>He purchased <ph id="1" type="fmt" subType="icu:simple" canCopy="no" canDelete="no" disp="{count}" dataRef="d1"/> items.</source>