Including word count information in generated XLIFF 2 file

192 views
Skip to first unread message

Martin Wunderlich

unread,
Oct 28, 2015, 11:45:32 AM10/28/15
to okapi-devel
Hi all,

I was wondering, if there is a way to include word info in an XLIFF 2 that is created as part of an Okapi rainbow pipeline. It seems the XLIFF2PackageWriter class does not support this, but I might be missing something. As a work-around, I could generate the scoping report with the respective step and add the word count information but maybe there is a better way.
Thanks a lot.

Cheers,

Martin
 

Aaron Madlon-Kay

unread,
Oct 29, 2015, 3:26:43 AM10/29/15
to okapi-devel
Our word/character count steps are based on GMX-V 2.0, which defines XML that can be embedded into XLIFF.
https://docbox.etsi.org/isg/open/isglis/gmx-v/gmx-v/gmx-v-2.0.html

However I don't believe this is actually implemented anywhere. Right now the counts are only stored as annotations to be consumed by other steps in the pipeline such as the Scoping Report Step.

-Aaron

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yves Savourel

unread,
Oct 29, 2015, 1:31:07 PM10/29/15
to okapi...@googlegroups.com

Hi Martin,

 

Like Aaron already answered: there is a way to get the word count as annotations, but none of the XLIFF writers map that to an output currently. I don’t know anyone planning to implement this at this time.

 

Cheers,

-ys

--

Chase Tingley

unread,
Oct 29, 2015, 1:33:22 PM10/29/15
to okapi...@googlegroups.com
Slightly OT - I know the GMX-V spec talks quite a bit about how it should be applied to XLIFF 1.2 as a model, but is there any kind of documented best practice (from either GMX-V or the XLIFF TC) on how to best serialize this data?  (Also interested in the right way to do this for XLIFF 2.)

Yves Savourel

unread,
Oct 29, 2015, 3:39:12 PM10/29/15
to okapi...@googlegroups.com
Not that I know of.
The only implementation I saw of GMX-V is in the XLIFF 1.2 files produced by XTM (see example below). It includes quite a few customer types.

-ys

-------------------

<metrics:metrics source-language="en-US" target-language="fr-FR" tool-name="xml-intl_XLIFF_Extract" tool-version="1.0" version="1.0">

<metrics:stage date="20151017T122123Z" phase="initial" source-language="en-US">
<metrics:count-group name="verifiable">
<metrics:count type="TextUnitCount" value="6"/>
<metrics:count type="x-AlphanumericOnlyTextUnitCount" value="0"/>
<metrics:count type="x-MeasurementOnlyTextUnitCount" value="0"/>
<metrics:count type="x-NumericOnlyTextUnitCount" value="0"/>
<metrics:count type="x-PunctuationOnlyTextUnitCount" value="0"/>
<metrics:count type="x-ExactMatchedTextUnitCount" value="5"/>
<metrics:count type="x-RealTextUnitCount" value="0"/>
<metrics:count type="x-LeveragedMatchedTextUnitCount" value="1"/>
<metrics:count type="RepetitionMatchedTextUnitCount" value="0"/>
<metrics:count type="RepetitionMatchedTextUnitCount" category="95-99" value="0"/>
<metrics:count type="RepetitionMatchedTextUnitCount" category="85-94" value="0"/>
<metrics:count type="RepetitionMatchedTextUnitCount" category="75-84" value="0"/>
<metrics:count type="x-FuzzyMatchedTextUnitCount" category="95-99" value="0"/>
<metrics:count type="x-FuzzyMatchedTextUnitCount" category="85-94" value="0"/>
<metrics:count type="x-FuzzyMatchedTextUnitCount" category="75-84" value="0"/>
<metrics:count type="TranslatableInlineCount" value="4"/>
<metrics:count type="TotalWordCount" value="38"/>
<metrics:count type="AlphanumericOnlyTextUnitWordCount" value="0"/>
<metrics:count type="MeasurementOnlyTextUnitWordCount" value="0"/>
<metrics:count type="NumericOnlyTextUnitWordCount" value="0"/>
<metrics:count type="ExactMatchedWordCount" value="36"/>
<metrics:count type="x-RealWordCount" value="0"/>
<metrics:count type="LeveragedMatchedWordCount" value="2"/>
<metrics:count type="RepetitionMatchedWordCount" value="0"/>
<metrics:count type="RepetitionMatchedWordCount" category="95-99" value="0"/>
<metrics:count type="RepetitionMatchedWordCount" category="85-94" value="0"/>
<metrics:count type="RepetitionMatchedWordCount" category="75-84" value="0"/>
<metrics:count type="FuzzyMatchedWordCount" category="95-99" value="0"/>
<metrics:count type="FuzzyMatchedWordCount" category="85-94" value="0"/>
<metrics:count type="FuzzyMatchedWordCount" category="75-84" value="0"/>
<metrics:count type="TotalCharacterCount" value="171"/>
<metrics:count type="PunctuationCharacterCount" value="4"/>
<metrics:count type="WhiteSpaceCharacterCount" value="32"/>
<metrics:count type="AlphanumericOnlyTextUnitCharacterCount" value="0"/>
<metrics:count type="MeasurementOnlyTextUnitCharacterCount" value="0"/>
<metrics:count type="NumericOnlyTextUnitCharacterCount" value="0"/>
<metrics:count type="x-PunctuationOnlyTextUnitCharacterCount" value="0"/>
<metrics:count type="ExactMatchedCharacterCount" value="164"/>
<metrics:count type="x-RealCharacterCount" value="0"/>
<metrics:count type="LeveragedMatchedCharacterCount" value="7"/>
<metrics:count type="RepetitionMatchedCharacterCount" value="0"/>
<metrics:count type="RepetitionMatchedCharacterCount" category="95-99" value="0"/>
<metrics:count type="RepetitionMatchedCharacterCount" category="85-94" value="0"/>
<metrics:count type="RepetitionMatchedCharacterCount" category="75-84" value="0"/>
<metrics:count type="FuzzyMatchedCharacterCount" category="95-99" value="0"/>
<metrics:count type="FuzzyMatchedCharacterCount" category="85-94" value="0"/>
<metrics:count type="FuzzyMatchedCharacterCount" category="75-84" value="0"/>
</metrics:count-group>
</metrics:stage>

<metrics:stage date="20151017T122123Z" phase="translation" source-language="en-US">
<metrics:count-group name="verifiable">
<metrics:count type="x-UnitsDone" value="5"/>
<metrics:count type="x-WordsDone" value="36"/>
<metrics:count type="x-CharactersDone" value="164"/>
</metrics:count-group>
</metrics:stage>

</metrics:metrics>



Martin Wunderlich

unread,
Oct 29, 2015, 3:41:06 PM10/29/15
to okapi-devel
I played around with this a bit today, following Aaron's suggestion, and though about how the word count information would be embedded into an XLIFF 2 file. Since I wanted to have word counts on a segment level and since there is no extension point for elements from other namespaces on that level, I think one could only at the information at the unit level. It might look like this for example:

   
      <file id="f1">
           
<group id="g2">
               
<unit id="ud7-1">
                   
<metrics:metrics version ="2.0" source-language="" tool-name="censhare" tool-version="5.4">
                       
<metrics:stage phase="initial" date="" source-language="">
                           
<metrics:count-group name="verifiable">
                               
<!-- One count element per segment, matching the sequence -->
                               
<metrics:count type ="LeveragedMatchedWordCount" value="123" />
                               
<metrics:count type ="TotalWordCount" value ="42"/>
                                ...        
                           
</metrics:count-group>
                       
</metrics:stage>
                   
</metrics:metrics>
                   
                   
<!-- Segments follow here -->


I am not sure if that would be the correct solution (if there is such a thing). This approach would add a lot of additional mark-up to the file, though, and currently I am creating an external custom structure, even though I would prefer to use the standard, of course. It might be  nice extension for XLIFF 2.1 to allow a simple word count attribute on the segment level.

Cheers,

Martin

Patrice Ferrot

unread,
Feb 17, 2017, 12:00:49 PM2/17/17
to okapi-devel
I realize that this thread is a bit old now, but I came across it as I was googling how people typically store word count information in XLIFF 2.0 files and thought I would still share what we are planning to do. We are planning on using the metadata module and go with something like what you can see below.

Best regards,
Patrice


<?xml version="1.0"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en-US" trgLang="fr-FR">
<file id="1" canResegment="no" original="withTitleAndAlt.html">
<mda:metadata xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0">
<mda:metaGroup category="adsk:segment_details">
<mda:meta type="segment_ref">#f=1/u=1/4</mda:meta>
<mda:meta type="word_count">4</mda:meta>
</mda:metaGroup>
<mda:metaGroup category="adsk:segment_details">
<mda:meta type="segment_ref">#f=1/u=2/7</mda:meta>
<mda:meta type="word_count">3</mda:meta>
</mda:metaGroup>
<mda:metaGroup category="adsk:segment_details">
<mda:meta type="segment_ref">#f=1/u=1/9</mda:meta>
<mda:meta type="word_count">9</mda:meta>
</mda:metaGroup>
<mda:metaGroup category="adsk:segment_details">
<mda:meta type="segment_ref">#f=1/u=4/10</mda:meta>
<mda:meta type="word_count">5</mda:meta>
</mda:metaGroup>
<mda:metaGroup category="adsk:segment_details">
<mda:meta type="segment_ref">#f=1/u=1/13</mda:meta>
<mda:meta type="word_count">0</mda:meta>
</mda:metaGroup>
</mda:metadata>
<unit id="2">
<segment id="7">
<source>Link to Google</source>
</segment>
</unit>
<unit id="4">
<segment id="10">
<source>This is the alt text</source>
</segment>
</unit>
<unit id="1">
<originalData>
<data id="d1">&lt;a href="http://www.google.com" title="[#f=1/u=2]"></data>
<data id="d2">&lt;/a></data>
<data id="d3">&lt;img src="images/myImage.gif" alt="[#f=1/u=4]"/></data>
</originalData>
<segment id="4">
<source>This is the title</source>
</segment>
<segment id="9">
<source>A basic HTML document with a link to <pc id="1__9_ph" dataRefEnd="d2" dataRefStart="d1">google.com</pc>.</source>
</segment>
<segment id="13">
<source><ph id="1__13_ph" dataRef="d3"/></source>
</segment>
</unit>
</file>
</xliff>
Reply all
Reply to author
Forward
0 new messages