Needed changes to TextUnitMerger code before release...

7 views
Skip to first unread message

jim

unread,
Sep 14, 2021, 3:07:46 PM9/14/21
to Group: okapi-devel
Jack's new ICU message format based xliff 2 caused some problems with
the TU merger. I think the merger is too lax in matching. I would like
to depend on code id's only for merge - but I have seen cases where the
code id's change post filter. There is also the case of leverage where
id's may not even exist.

Not sure the best way forward - but we may have to sacrifice some of
Kuro's use cases with leverage in order to make the pure
extract->translate->merge case more strict.

Here is a specific example:

Original:
<source xml:space="preserve"><ph id="1" type="fmt" subType="icu:select"
canCopy="no" disp="{taxableArea}" dataRef="d1"/></source>
<target xml:space="preserve"><ph id="2" canCopy="no"
disp="{taxableArea}" dataRef="d2"/></target>

Merged:
 <source xml:space="preserve"><ph id="1" type="fmt"
subType="icu:select" canCopy="no" disp="{taxableArea}"
dataRef="d1"/></source>
 <target xml:space="preserve"><ph id="1" type="fmt"
subType="icu:select" canCopy="no" disp="{taxableArea}"
dataRef="d2"/></target>

krsk...@gmail.com

unread,
Sep 14, 2021, 3:45:55 PM9/14/21
to okapi-devel
I'm not understanding this use case. What does it mean that a TU only has a place holder? What are we translating here? And what is this merged with ? I'm actually not understanding how XLIFF(2) filter is used. Why do we want to extract an XLIFF file, resulting in another XLIFF file, translate that second XLIFF file, then merge to make a third, merged XLIFF file ? Can someone tell me ?

Anyway, it sounds like we have two opposite goals. For leveraging, we want a more tolerant merge logic to maximize reuse of the existing translations (but we still don't want to go too far to avoid the merge result that doesn't make sense). And for this use case (although I still don't understand it), more strict merge. Maybe we need another boolean argument to satisfy both usage, or even a separate method?

jim

unread,
Sep 15, 2021, 1:28:29 PM9/15/21
to okapi...@googlegroups.com, krsk...@gmail.com
The example is only to show that there is a legitimate case that source and target codes can be different (I didn't include the full segment). This is going to be a common case when we start supporting ICU messages.

Yes, I thought about a boolean. But how do we know if a target came from a leverage and hasn't been edited by a human? We do have the "approved" property. But I don't know if that is used consistently.

Jim
--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/cf6f7fd4-5e67-4c15-9dc3-f96584739fc9n%40googlegroups.com.

jim

unread,
Sep 15, 2021, 1:34:47 PM9/15/21
to okapi...@googlegroups.com, krsk...@gmail.com
Maybe a new boolean "edited" or "final".  It would be true by default and force the TextUnitMerger to a "strict" mode. For, leveraged documents the value would be false and use the current algorithm.

Jim

jim

unread,
Sep 15, 2021, 2:55:36 PM9/15/21
to okapi...@googlegroups.com, krsk...@gmail.com
The problem is we have no control of any processes that happened on the target data before the merge. Even for basic, non-leveraged translation. Code data could have been lost or transformed. For example, the intermediate xliff 1.2 could have used generic codes like <x1>, <g1></g1> etc which looses everything but the basic code type and id. The Code simplifier makes changes etc..

Theoretically we should be able to match on id and basic code type (OPEN, CLOSE, PLACEHOLDER). But some unit tests fail when we make the code matching this restrictive. I would need to track down each case and find out why.

Jim

jim

unread,
Sep 16, 2021, 1:10:20 PM9/16/21
to okapi...@googlegroups.com, krsk...@gmail.com
We decided to do this in our okapi meeting:

  1. New boolean (edited, final etc..) for TextUnitMerger that is true by default.
  2. The default logic will *only* match on id/originalId and tagType (OPEN, CLOSE, PLACEHOLDER).
  3. If final=false then the less strict logic for matching codes will be used (to support leverage cases etc..).
  4. Fix any filters (TMX, XLIFF, XLIFF2) that break the contract of changing code id's.
  5. Go through all code and guarantee that code id's are *not* changed at any point in the extraction/merge pipeline.
  6. Document this contract in a way that is clear to all developers.


Jim

Reply all
Reply to author
Forward
0 new messages