1.46.0 release?...

jimbo

unread,

Aug 17, 2023, 11:40:57 AM8/17/23

to Group: okapi-devel

We've gotten a few requests to have the 1.46.0 release. This will be the last release before porting the project over to GitHub (baring any any push back).

One feature I would like to add before the 1.46.0 release is automated plural form generation for the new message format filter. I may need to make some other adjustments to the filter as I get feedback from our team.

Are there are any other high priority bugs that should to be fixed?

Jim

Chase Tingley

unread,

Aug 17, 2023, 11:57:36 AM8/17/23

to okapi...@googlegroups.com

We've got the initial TTML implementation landed, there may be a couple bugs that shake out of that in the next couple weeks. On the other hand, there's a lot of shared code with VTT, and that could has already been tested pretty well, so it may not be too bad.

So, I don't think there are any blockers here.

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/0288ace8-22b6-6f9f-76ab-06cdfdb54a02%40gmail.com.

Mihai Nita

unread,

Aug 30, 2023, 9:05:08 PM8/30/23

to Group: okapi-devel

No blockers here either.

One note about plurals:

- I can share some of the experience I have with plural expansion

- xliff 2.2 extension for plural/gender/select is happening. Using translation units + segments instead of group + tu. Pros and cons. But there was strong push for that. Apparently most tools can't change to handle n:m

Mihai

--

jimbo

unread,

Aug 31, 2023, 10:54:44 AM8/31/23

to okapi...@googlegroups.com, Mihai Nita

Hi Mihai! Very timely post. I am designing the plural forms auto expansion now. I think I have a third alternative between producing more TU's and adding segments.

I plan to have the plural expansion a filter option. When enabled the filter will detect the plurals in the source, replace the source text if needed, then extract the expanded message string. From the perspective of downstream steps the source content came from the original file so everything proceeds normally (if with the standard Okapi IPipeline).

I don't like the idea of adding extra segments to a TextUnit. This breaks with the "standard" of filters create TextUnits and the Segmenter creates segments. This has implications for split/merge in the workbench and other operations like merge.

In summary the current design will have the ICU message filter will produce Group and TextUnit events.

Not sure of the full implications for Okapi and Xliff 2.2 support of plurals - but I'm sure we can convert back and forth with enough metadata.

Please let me know asap of any questions! Any sample code our other info would be appreciated.

BTW: Do you have any thoughts on handling Gender? I know ICU doesn't have built-in support, but maybe the CLDR has info on gender and we could do some type of "expansion"?

Also, if you have any code or ideas for ICU message string validation I would be interested to add it to the filter. Diagnostics are going to be important to detect badly internationalized strings.

cheers,

Jim

To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/CAK69zb%3DntUt4Vkf_J_e%2B3muzfwFBf%2B%2B8xDx0u5uO0RpoPtZPEw%40mail.gmail.com.

jimbo

unread,

Aug 31, 2023, 1:46:31 PM8/31/23

to okapi...@googlegroups.com, Mihai Nita

BTW: I'm looking for the best protobuf file UI viewer. Give it a pb file and navigate it visually. Found a few but most are ugly and old. I thought I remembered seeing some kind of editor/debugger tool a while back.

A workaround would be to use standard JSON for design and debugging (lot's of tools for JSON)

Jim

On 8/30/23 19:04, Mihai Nita wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/CAK69zb%3DntUt4Vkf_J_e%2B3muzfwFBf%2B%2B8xDx0u5uO0RpoPtZPEw%40mail.gmail.com.

jimbo

unread,

Oct 4, 2023, 3:17:39 PM10/4/23

to okapi...@googlegroups.com, Mihai Nita

Mihai - I'm coding up the plural expansion feature now. But I've run into an issue with strings with multiple embedded PLURAL/SELECTORDINAL groups. Parser does a good job of pulling these out into an AST. But wondering if you have a good algorithm you could share that would adjust the string with the new target plurals (aks plural expansion)?

branch is: https://bitbucket.org/okapiframework/okapi/branch/plural_expansion

 @Test
    public void testWithEmbeddedPluralMessage() throws Exception {
        String message = "{0, plural,one {You have {1, plural, one {# apple} other {# apples}}} other {You and # others have {1, plural, one {# apple} other {# apples}}}}";
        try (MessageFormatParser p = new MessageFormatParser()) {
            p.parse(message);
            assertEquals(message, p.toString());
        }
    }

On 8/30/23 19:04, Mihai Nita wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/CAK69zb%3DntUt4Vkf_J_e%2B3muzfwFBf%2B%2B8xDx0u5uO0RpoPtZPEw%40mail.gmail.com.

Mihai Nita

unread,

Oct 10, 2023, 3:50:10 AM10/10/23

to jimbo, okapi...@googlegroups.com

I have some code doing a "message normalization":

https://mihai-nita.net/tmp/icu-msg-normalizer.zip

That should make expansion easier, and also translation is less difficult, as we have full messages in decisions.

This is something I've put together on my own time and computer, at home, so it is not a problem to share it.

I'll see what I can do about the expansion.

This is how the "normalization" works:

1. For plurals convert # to {count} (whatever the count variable is)
That works, except if there is a plural with offset.
2. For each selection (select / plural / plural ordinal) take the prefix and suffix and copy them inside each branch of the decision.

Simple selection example:

You deleted {fcount, plural, =1 {one file} other {# files}}!!!

Replace # with {fcount}:

You deleted {fcount, plural, =1 {one file} other {{fcount} files}}!!!

Marking the prefix yellow, suffix cyan:

You deleted {fcount, plural, =1 {one file} other {{fcount} files}}!!!

Distribute the prefix:

{fcount, plural, =1 {You deleted one file} other {You deleted {fcount} files}}!!!

Distribute the suffix:

{fcount, plural, =1 {You deleted one file!!!} other {You deleted {fcount} files!!!}}

Double selection example:

There {guestCount, plural, =1{was one guest} other{were # guests}} at {hostGender,select, feminine{her} masculine{his} other{their}} party.

Replace # with {guestCount}:

There {guestCount, plural, =1{was one guest} other{were {guestCount} guests}} at {hostGender,select, feminine{her} masculine{his} other{their}} party.

Marking the prefix yellow, suffix cyan for the first selection (the plural):

There {guestCount, plural, =1{was one guest} other{were {guestCount} guests}} at {hostGender,select, feminine{her} masculine{his} other{their}} party.

Distribute the prefix:

{guestCount, plural, =1{There was one guest} other{There were {guestCount} guests}} at {hostGender,select, feminine{her} masculine{his} other{their}} party.

Distribute the suffix:

{guestCount, plural, =1{There was one guest at {hostGender,select, feminine{her} masculine{his} other{their}} party.} other{There were {guestCount} guests at {hostGender,select, feminine{her} masculine{his} other{their}} party.}}

Wrap for readability:

{guestCount, plural,
=1{There was one guest at {hostGender,select, feminine{her} masculine{his} other{their}} party.}
other{There were {guestCount} guests at {hostGender,select, feminine{her} masculine{his} other{their}} party.}
}

Around the second decision (the gender), marking the prefix yellow, suffix cyan:

{guestCount, plural,
=1{There was one guest at {hostGender,select, feminine{her} masculine{his} other{their}} party.}
other{There were {guestCount} guests at {hostGender,select, feminine{her} masculine{his} other{their}} party.}
}

Distribute the prefixes:

{guestCount, plural,
=1{{hostGender,select, feminine{There was one guest at her} masculine{There was one guest at his} other{There was one guest at their}} party.}

other{{hostGender,select, feminine{There were {guestCount} guests at her} masculine{There were {guestCount} guests at his} other{There were {guestCount} guests at their}} party.}

}

Distribute the suffixes:

{guestCount, plural,
=1{{hostGender,select, feminine{There was one guest at her party.} masculine{There was one guest at his party.} other{There was one guest at their party.}}}

other{{hostGender,select, feminine{There were {guestCount} guests at her party.} masculine{There were {guestCount} guests at his party.} other{There were {guestCount} guests at their party.}}}

}

Wrap and indent for readability:

{guestCount, plural,
=1 {
{hostGender,select,
feminine{There was one guest at her party.}
masculine{There was one guest at his party.}
other{There was one guest at their party.}
}
}

other {
{hostGender,select,
feminine{There were {guestCount} guests at her party.}
masculine{There were {guestCount} guests at his party.}
other{There were {guestCount} guests at their party.}
}
}

}

Some explanations, and what's missing.

Why replace # with {count}.

This is only needed for more than one plural. Because # basically means {count}, but it matches the innermost plural count. If we have two plural counters ("You deleted # files in # folders") there is now way to use # for both fileCount and folderCount.

This is implemented.

But this replacement does not work for plural with offset, because the # really means the value of count - offset.

So for plural with offset # should stay as is.

This is a problem for multiple plurals, one of them with an offset, because that one should be the innermost one. So we might need to do reordering of the plurals.

That reordering is not implemented.

And it is impossible to "normalize" messages with more than one plural with an offset.

Happily enough these are extremely rare (I've never seen one in the wild :-)

===

I will describe the expansion part, I will check what code I can share.

But basically after normalization we have a map from select cases to messages.
(in reality we have switch inside switch, but these are equivalent)

The key is an array of cases, value is the message.

The message before (with gender and plural) is:

switch {guestCount,plural}

case =1:

switch {hostGender,select}

case feminine {There was one guest at her party.}

case masculine {There was one guest at his party.}

case other {There was one guest at their party.}

case other:

switch {hostGender,select}

case feminine {There were {guestCount} guests at her party.}

case masculine {There were {guestCount} guests at his party.}

case other {There were {guestCount} guests at their party.}

But "flattened" to the map I mentioned before it becomes:

switch [ {guestCount,plural} {hostGender,select} ]

case [ =1 feminine ] {There was one guest at her party.}

case [ =1 masculine ] {There was one guest at his party.}

case [ =1 other ] {There was one guest at their party.}

case [ other feminine ] {There were {guestCount} guests at her party.}

case [ other masculine ] {There were {guestCount} guests at his party.}

case [ other other ] {There were {guestCount} guests at their party.}

I found that thinking and operating on this structure is easier than nested switches.

I add cases as needed only looking at the plural column, the rest is copied as is:

case [ other feminine ] {There were {guestCount} guests at her party.}

copy to:

case [ few feminine ] {There were {guestCount} guests at her party.}

case [ many feminine ] {There were {guestCount} guests at her party.}

It gets a bit messier with several plurals, but only for a human :-)

For a machine we just create all combinations of plural keywords that are required for the target language.

Mihai

jimbo

unread,

Oct 10, 2023, 11:27:52 AM10/10/23

to Mihai Nita, okapi...@googlegroups.com

Excellent! Now that I've spent some time working with different message strings I can see this will be really helpful. I'll try to integrate this into the current parser and AST if possible.

Jim

jimbo

unread,

Oct 10, 2023, 12:08:32 PM10/10/23

to Mihai Nita, okapi...@googlegroups.com

Ah, I wish I had seen MessagePatternUtil before. I'll refactor our Parser to use it vs my custom Token's. That will make it trivial to incorporate your code.

Jim

On 10/10/23 01:49, Mihai Nita wrote:

Chase Tingley

unread,

Oct 10, 2023, 9:14:16 PM10/10/23

to okapi...@googlegroups.com, Mihai Nita

Going back to the original question in this thread -- do you two see these changes as blocking 1.46.0, or can we release any time?

From my POV I think we can release any time -- we run modified trunk code, so it doesn't really affect us. But it's been a while for users who depend on the pre-built artifacts.

To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/5d5eaff0-eb9d-4fe9-93d7-f612df0fb560%40gmail.com.

jimbo

unread,

Oct 12, 2023, 11:36:11 AM10/12/23

to okapi...@googlegroups.com, Chase Tingley, Mihai Nita

I'm going to create a PR for my Message Filter changes. I'd really like to incorporate Mihai's code and refactor to use MessagePatternUtil - but I can do that as a separate PR.

The only other feature I would like to get out is the EnumSet for Property - that will be another PR.

There are a few OpenXML bugs on the top of the issue list. If we can get some of those in 1.46.0 that would be great, but not essential.

I think we can release at the end of the month - is that OK for everybody?

Jim

To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/CAGRYq4ha02ztOeM5yT3Bx5aPEkwr8zFrcXUu6zNrBEw_pkL6-Q%40mail.gmail.com.

Mihai Nita

unread,

Oct 25, 2023, 6:52:42 PM10/25/23

to jimbo, Group: okapi-devel, Chase Tingley

Fine with me.

I'm crazy busy these days, both at work and at home (not much difference really :-)