parsing Xliff file to use g tag instead of bpt tag

116 views
Skip to first unread message

Panji Wiramanik

unread,
Aug 22, 2023, 8:16:49 AM8/22/23
to okapi-users
Hi, I wanted to extract the xliff file to get ordered id number for segment this is the example 

<xliff
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:tc:xliff:document:1.2 http://docs.oasis-open.org/xliff/v1.2/os/xliff-core-1.2-strict.xsd"
xmlns="urn:oasis:names:tc:xliff:document:1.2"
xmlns:xhtml="http://www.w3.org/1999/xhtml" version="1.2">
<file original="course" datatype="plaintext" source-language="en-US">
<body>
<trans-unit id="title">
<source>TH [SEA Product 101] PREEN SCREEN™ SPF 50 REAPPLICATION MIST</source>
</trans-unit>
<trans-unit id="description">
<source>
<g id="xOkRhmcxknbMLo2w" ctype="x-html-P">
<g id="thIW-cpxPXvhHJhi" ctype="x-html-SPAN" xhtml:style="color: rgb(0, 16, 158);">It's the SPF dilemma we all face on a daily basis: How do I reapply my sunscreen effortlessly AND without distrupting my makeup?</g>
</g>
<g id="JHbDSAwSt8s2CnX1" ctype="x-html-P">
<g id="nytY4y568fnlSDwu" ctype="x-html-SPAN" xhtml:style="color: rgb(0, 16, 158);">Enter
<g id="cI_-uRgis1uce1o0" ctype="x-html-STRONG"> PREEN SCREEN™ SPF 50 REAPPLICATION MIST.</g> For SPF touch-ups on the go, without the fuss.
</g>
</g>
</source>
</trans-unit>
</body>
</file>
</xliff>

but the result is like this tag changed to bpt and the id is not number

<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.2"
xmlns="urn:oasis:names:tc:xliff:document:1.2"
xmlns:okp="okapi-framework:xliff-extensions"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsxlf="http://www.w3.org/ns/its-xliff/" its:version="2.0">
<file original="course" source-language="en" target-language="id" datatype="x-plaintext" okp:inputEncoding="UTF-8">
<body>
<trans-unit id="title">
<source xml:lang="en">TH [SEA Product 101] PREEN SCREEN™ SPF 50 REAPPLICATION MIST</source>
<seg-source>
<mrk mid="0" mtype="seg">TH [SEA Product 101] PREEN SCREEN™ SPF 50 REAPPLICATION MIST</mrk>
</seg-source>
<target xml:lang="id">
<mrk mid="0" mtype="seg">TH [SEA Product 101] PREEN SCREEN™ SPF 50 REAPPLICATION MIST</mrk>
</target>
</trans-unit>
<trans-unit id="description">
<source xml:lang="en">
<bpt id="xOkRhmcxknbMLo2w"></bpt>
<bpt id="thIW-cpxPXvhHJhi"></bpt>It's the SPF dilemma we all face on a daily basis: How do I reapply my sunscreen effortlessly AND without distrupting my makeup?
<ept id="thIW-cpxPXvhHJhi"></ept>
<ept id="xOkRhmcxknbMLo2w"></ept>
<it id="JHbDSAwSt8s2CnX1" pos="open"></it>
<it id="nytY4y568fnlSDwu" pos="open"></it>Enter
<bpt id="cI_-uRgis1uce1o0"></bpt> PREEN SCREEN™ SPF 50 REAPPLICATION MIST.
<ept id="cI_-uRgis1uce1o0"></ept> For SPF touch-ups on the go, without the fuss.
<it id="nytY4y568fnlSDwu" pos="close"></it>
<it id="JHbDSAwSt8s2CnX1" pos="close"></it>
</source>
<seg-source>
<mrk mid="0" mtype="seg">
<bpt id="xOkRhmcxknbMLo2w"></bpt>
<bpt id="thIW-cpxPXvhHJhi"></bpt>It's the SPF dilemma we all face on a daily basis: How do I reapply my sunscreen effortlessly AND without distrupting my makeup?
<ept id="thIW-cpxPXvhHJhi"></ept>
<ept id="xOkRhmcxknbMLo2w"></ept>
</mrk>
<mrk mid="1" mtype="seg">
<it id="JHbDSAwSt8s2CnX1" pos="open"></it>
<it id="nytY4y568fnlSDwu" pos="open"></it>Enter
<bpt id="cI_-uRgis1uce1o0"></bpt> PREEN SCREEN™ SPF 50 REAPPLICATION MIST.
<ept id="cI_-uRgis1uce1o0"></ept>
</mrk>
<mrk mid="2" mtype="seg">For SPF touch-ups on the go, without the fuss.
<it id="nytY4y568fnlSDwu" pos="close"></it>
</mrk>
<mrk mid="3" mtype="seg">
<it id="JHbDSAwSt8s2CnX1" pos="close"></it>
</mrk>
</seg-source>
<target xml:lang="id">
<mrk mid="0" mtype="seg">
<bpt id="xOkRhmcxknbMLo2w"></bpt>
<bpt id="thIW-cpxPXvhHJhi"></bpt>It's the SPF dilemma we all face on a daily basis: How do I reapply my sunscreen effortlessly AND without distrupting my makeup?
<ept id="thIW-cpxPXvhHJhi"></ept>
<ept id="xOkRhmcxknbMLo2w"></ept>
</mrk>
<mrk mid="1" mtype="seg">
<it id="JHbDSAwSt8s2CnX1" pos="open"></it>
<it id="nytY4y568fnlSDwu" pos="open"></it>Enter
<bpt id="cI_-uRgis1uce1o0"></bpt> PREEN SCREEN™ SPF 50 REAPPLICATION MIST.
<ept id="cI_-uRgis1uce1o0"></ept>
</mrk>
<mrk mid="2" mtype="seg">For SPF touch-ups on the go, without the fuss.
<it id="nytY4y568fnlSDwu" pos="close"></it>
</mrk>
<mrk mid="3" mtype="seg">
<it id="JHbDSAwSt8s2CnX1" pos="close"></it>
</mrk>
</target>
</trans-unit>
</body>
</file>
</xliff>

here is the command line that i used

tikal.sh -fc okf_autoxliff -ie UTF-8 -x test.xlf -seg languagetoolorg-srx.srx -sl en -tl id  

is there any config i can do to change this? thank you

Chase Tingley

unread,
Aug 22, 2023, 1:10:37 PM8/22/23
to Panji Wiramanik, okapi-users
Hi Panji,

Tikal currently always produces extended codes (bpt/ept) instead of <g> (this is a recent change).  It is hardcoded, but on the off chance you build the code yourself you can change "false" to "true" at this line and rebuild the applications:


You can also use Rainbow, which exposes this option as part of the "Generic XLIFF" options in the "Rainbow Translation Kit Creation" pipeline step.  You want to check "Use <g></g> and <x/> notation" like this:

Screenshot from 2023-08-22 10-09-29.png




--
You received this message because you are subscribed to the Google Groups "okapi-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-users/38d73a2a-bed9-446a-a005-3ce78cb519c4n%40googlegroups.com.

Panji Wiramanik

unread,
Aug 22, 2023, 6:52:44 PM8/22/23
to okapi-users
ok, thankyou. I have rebuild the app and it's fine now. also I have another question can I add like regex filter or masybe subfilter that does this function where the segment sentence is splitted into several segment for example this is the current result


<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:okp="okapi-framework:xliff-extensions" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsxlf="http://www.w3.org/ns/its-xliff/" its:version="2.0">
<file original="course" source-language="en" target-language="id" datatype="x-plaintext" okp:inputEncoding="UTF-8" okp:configId="okf_autoxliff">

<body>
<trans-unit id="title">
<source xml:lang="en">TH [SEA Product 101] PREEN SCREEN™ SPF 50 REAPPLICATION MIST</source>
<target xml:lang="id">TH [SEA Product 101] PREEN SCREEN™ SPF 50 REAPPLICATION MIST</target>
</trans-unit>
<trans-unit id="description">
<source xml:lang="en">It's the SPF dilemma we all face on a daily basis: How do I reapply my sunscreen effortlessly AND without distrupting my makeup? Enter  PREEN SCREEN™ SPF 50 REAPPLICATION MIST. For SPF touch-ups on the go, without the fuss.</source>
<target xml:lang="id">It's the SPF dilemma we all face on a daily basis: How do I reapply my sunscreen effortlessly AND without distrupting my makeup? Enter  PREEN SCREEN™ SPF 50 REAPPLICATION MIST. For SPF touch-ups on the go, without the fuss.</target>
</trans-unit>
</body>
</file>
</xliff>

to be like this


<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:okp="okapi-framework:xliff-extensions" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsxlf="http://www.w3.org/ns/its-xliff/" its:version="2.0">
<file original="course" source-language="en" target-language="id" datatype="x-plaintext" okp:inputEncoding="UTF-8" okp:configId="okf_autoxliff">
<body>
<trans-unit id="title">
<source xml:lang="en">TH [SEA Product 101] PREEN SCREEN™ SPF 50 REAPPLICATION MIST</source>
<target xml:lang="id">TH [SEA Product 101] PREEN SCREEN™ SPF 50 REAPPLICATION MIST</target>
</trans-unit>
<trans-unit id="description">
<source xml:lang="en">It's the SPF dilemma we all face on a daily basis: How do I reapply my sunscreen effortlessly AND without distrupting my makeup?</source>
<target xml:lang="id">It's the SPF dilemma we all face on a daily basis: How do I reapply my sunscreen effortlessly AND without distrupting my makeup?</target>
</trans-unit>
<trans-unit id="description">
<source xml:lang="en">Enter  PREEN SCREEN™ SPF 50 REAPPLICATION MIST.</source>
<target xml:lang="id">Enter  PREEN SCREEN™ SPF 50 REAPPLICATION MIST.</target>
</trans-unit>
<trans-unit id="description">
<source xml:lang="en">For SPF touch-ups on the go, without the fuss.</source>
<target xml:lang="id">For SPF touch-ups on the go, without the fuss.</target>
</trans-unit>
</body>
</file>
</xliff>

Chase Tingley

unread,
Aug 22, 2023, 7:40:19 PM8/22/23
to Panji Wiramanik, okapi-users
Something like this will work:

tikal.sh -fc okf_xliff test.xlf -s -seg <your srx file>

The output will go to test.out.xlf.  However, two caveats:
  1. it seems like this may copy source to target, so you would need to remove that later.
  2. it encodes the segmented data in <seg-source>.
If you want it split into individual <trans-unit> instead, you will need to use a Rainbow pipeline like:
- Raw Document to Filter Events
- Segmentation Step 
- Segments to Text Units Converter 
- Rainbow Translation Kit Creation

Unfortunately, the segmentation step seems to have a bug where it overrides the "Use <g> notation" setting from my previous email, so if you go that route you will end up with bpt/ept.

ct


Panji Wiramanik

unread,
Aug 23, 2023, 12:13:10 AM8/23/23
to okapi-users
is srx file need to be in xml format? because i have this regex 

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s

there is '<' symbol that will return error

Panji Wiramanik

unread,
Aug 23, 2023, 12:19:15 AM8/23/23
to okapi-users
nevermind it need to be in ASCII characters

thankyou
Reply all
Reply to author
Forward
0 new messages