Extract surounding external xml tag as resname to xliff

22 views
Skip to first unread message

Marc

unread,
Feb 2, 2022, 12:22:31 PM2/2/22
to okapi-users
Hi together,

as far as I can see, so far there is no way to pass the information
about the name of a surounding xml-tag, that is declared to be an
external (structure) tag into the created xliff, for example in the
resname attribute, is this correct?

As an example the following xml document:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
        <SITENODE>
        <label>Linear Motion Technology</label>
        </SITENODE>
</root>

should create an xliff like this (all tags in the xml are defined as
structure tags in the xml to xliff conversion):

<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2"
xmlns:okp="okapi-framework:xliff-extensions"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsxlf="http://www.w3.org/ns/its-xliff/" its:version="2.0">
<file original="test.xml" source-language="en" target-language="de"
datatype="xml">
<body>
<trans-unit id="1" resname="label">
<source xml:lang="en">Linear Motion Technology</source>
<seg-source><mrk mid="0" mtype="seg">Linear Motion
Technology</mrk></seg-source>
<target xml:lang="en"></target>
</trans-unit>
</body>
</file>
</xliff>

Is something like this possible? Either with xliff 1.2 or 2.0 as output.

best

Marc


yves.s...@gmail.com

unread,
Feb 2, 2022, 12:39:21 PM2/2/22
to okapi-users
Hi Marc,

Actually you may (not sure) be able to do this.
The XML filter supports ITS 2 and there is a way to tell what the value of an ID will be based on an XPath.
See https://okapiframework.org/wiki/index.php/XML_Filter#idValue_and_xml:id for details.
I don't have time to try it right now, but it's the direction I would look into: somehow you may be able to tell the filter that the value should be the name of the parent element.
It's very unusual though and it means the text should always be extracted from elements that have different names.

-ys
--
You received this message because you are subscribed to the Google Groups "okapi-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-users/6a796b66-b2c3-ef0e-53b5-51fe17811951%40marcmittag.de.

Chase Tingley

unread,
Feb 2, 2022, 12:40:27 PM2/2/22
to Yves Savourel, okapi-users
The HTML filter will generate a restype based on the surrounding element, but the XML Stream filter for some reason does not. I am not sure why this is.

Marc

unread,
Feb 2, 2022, 1:02:07 PM2/2/22
to Chase Tingley, Yves Savourel, okapi-users

Hi Yves, hi Chase,

thank you very much for the quick answer!

On 02.02.22 18:40, Chase Tingley wrote:
The HTML filter will generate a restype based on the surrounding element, but the XML Stream filter for some reason does not. I am not sure why this is.

Probably because in HTML you can define, what element should map to what value of the restype attribute (available values are pre-defined in xliff spec).

And with XML you can not do that.

Therefore it would be good to have a possiblity to set the resname accordingly.


On Wed, Feb 2, 2022 at 9:39 AM <yves.s...@gmail.com> wrote:

Actually you may (not sure) be able to do this.
The XML filter supports ITS 2 and there is a way to tell what the value of an ID will be based on an XPath.
See https://okapiframework.org/wiki/index.php/XML_Filter#idValue_and_xml:id for details.

With "ID" you mean, the ID of the trans-unit would be set to the value of the xpath, right?

Unfortunately this is no solutions, since we need (as you pointed out) unique IDs.

And a lot of segments in the xml will have the same structure tags surrounding them.

The purpose for getting the external tag information is to be able to pass it to translate5, so that we can show this information as context information to the translator.

Like many other CAT tools are doing it.

So it is not possible right now?

If no: It could be, that I can convince our clients to fund such a development. Do you know someone, who could do that for funding?

best

Marc


Chase Tingley

unread,
Feb 2, 2022, 9:23:32 PM2/2/22
to Marc, Yves Savourel, okapi-users
Aha, so reason this works for HTML is because the okf_html filters defined the elementType attribute for the elements in its config.

So you can do this with the xmlstream filter if you know the name of the element.  For example, this config:

assumeWellformed: true
preserve_whitespace: true
attributes:
  xml:lang:
    ruleTypes: [ATTRIBUTE_WRITABLE]
  xml:id:
    ruleTypes: [ATTRIBUTE_ID]
  id:
    ruleTypes: [ATTRIBUTE_ID]
  xml:space:
    ruleTypes: [ATTRIBUTE_PRESERVE_WHITESPACE]
    preserve: ['xml:space', EQUALS, preserve]
    default: ['xml:space', EQUALS, default]
elements:
  label:
    ruleTypes: [TEXTUNIT]
    elementType: label

Would produce a restype of "label" when run on your sample.

However, there are two obvious caveats:
1. this is setting restype, not resname.  (I think this is more appropriate since it's based on element type, however.)
2. it's not a generic feature, you have to specify it for every element you want to have this behavior.

A option to enable a generic behavior to solve #2 would be possible.

Marc

unread,
Feb 7, 2022, 1:53:32 PM2/7/22
to Chase Tingley, okapi-users

Hi Chase,

thank you, this solution sounds very interesting.

I just tried it, but it did not work and I do not know why.

I tried it using Rainbow with Okapi 1.41 locally on my Ubuntu.

I have this fprm for the xml stream filter:

global_cdata_subfilter: okf_regex@translate5-exclude-cdata
assumeWellformed: true
preserve_whitespace: false


attributes:
  xml:lang:
    ruleTypes: [ATTRIBUTE_WRITABLE]
  xml:id:
    ruleTypes: [ATTRIBUTE_ID]
  id:
    ruleTypes: [ATTRIBUTE_ID]
  xml:space:
    ruleTypes: [ATTRIBUTE_PRESERVE_WHITESPACE]
    preserve: ['xml:space', EQUALS, preserve]
    default: ['xml:space', EQUALS, default]
elements:

  LINK:
    ruleTypes: [INLINE]
  FORMAT:
    ruleTypes: [TEXTUNIT]
    elementType: FORMAT


This is the xml I'm processing

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<!DOCTYPE FIRSTspirit_XML_EXPORT SYSTEM "TranslationXml.dtd">
<FIRSTspirit_XML_EXPORT sourceLanguage="EN" version="5.2.200508.79058">
    <PAGENODE revision="33558" uid="haegglunds" uidType="PAGESTORE">
        <PAGENODE revision="33558" uid="haegglunds_products" uidType="PAGESTORE">
            <PAGE revision="52459" uid="standard_page_1" uidType="PAGESTORE">
                <BODY name="main_content">
                    <SECTION id="307797" name="text_modules_1" templateid="41">
                        <CMS_VALUE name="st_text">                                                         
                            <FORMAT name="h2">Power for productivity</FORMAT>                                     
                         </CMS_VALUE>
                    </SECTION>
                </BODY>
            </PAGE>
        </PAGENODE>
    </PAGENODE>
</FIRSTspirit_XML_EXPORT>


And this is the xliff I'm getting:


<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:okp="okapi-framework:xliff-extensions" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsxlf="http://www.w3.org/ns/its-xliff/" its:version="2.0">

<file original="xxxxx.xml" source-language="en-US" target-language="de-DE" datatype="xml">
<body>
<trans-unit id="tu1">
<source xml:lang="en-US">Power for productivity</source>
<seg-source><mrk mid="0" mtype="seg">Power for productivity</mrk></seg-source>
<target xml:lang="de-DE"></target>
</trans-unit>
</body>
</file>
</xliff>


What do I miss?


Thank you very much in advance

Marc

--
You received this message because you are subscribed to the Google Groups "okapi-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-users...@googlegroups.com.

jim

unread,
Feb 7, 2022, 2:08:38 PM2/7/22
to Marc, Chase Tingley, okapi-users
Marc,

We compare lower case (I beleive) so try "format".

Jim

Marc

unread,
Feb 7, 2022, 4:01:03 PM2/7/22
to jim, Chase Tingley, okapi-users

Thank you Jim, thank you Chase,

that works! Lower case did the trick. It's a bit confusing, that you have to put the tag in the config lowercase, even though it appears in uppercase in the xml.

Yet anyway, now I know it.

Thank you!

What unfortunately does not work is to do something like this in the fprm:

  format:
    ruleTypes: [TEXTUNIT]
    elementType: h1
  conditions:
  - name
  - EQUALS
  - [h1]
format:
    ruleTypes: [TEXTUNIT]
    elementType: p
  conditions:
  - name
  - EQUALS
  - [p]

So to define the elementType dependent on the value of an attribute differently for the same tag.

Because Rainbow silently deletes the first format definitions.

Probably there is no solution for this at the moment, right?

Yet this is a seldom case, I must admit, and we could solve it with a pre- and post-processing of the xml before we send the  xml to okapi and after we get it back from it for the export.

So know I have a solution and can talk to the client.

And yes, to have a config option in Okapi that (if active) sets the restype by default would be great. Maybe we can find a way to implement this/support its development sooner or later.

best

Marc

jim

unread,
Feb 8, 2022, 1:00:09 PM2/8/22
to Marc Mittag (MittagQI), Marc, Chase Tingley, okapi-users
I'm still not sure why we need the lowercase in the config. I recently revamped the code and normalized the string compares. I must have missed one.

If you log a ticket that's the best hope for changes. A code pull request would get much more attention. Either are appreociated.

Jim
To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-users/acf0935e-ec84-8190-929a-a804281358ac%40gmail.com.
-- 
Marc Mittag
MittagQI - Quality Informatics

Service Desk for Requests:
https://jira.translate5.net/servicedesk
Please request a login via mail, if you have none

MittagQI
Konrad-Lorenz-Weg 10
D-72116 Mössingen
Germany
Tel.:   ++49 (0)7473/220202
Fax:    ++49 (0)7473/220211
mailto: Ma...@MittagQI.com
Web:    www.MittagQI.com

Optionale PGP-Verschlüsselung:
Für jeden Mitarbeiter von MittagQI ist auf 
pool.sks-keyservers.net
ein PGP-Key hinterlegt den Sie zur 
PGP-Verschlüsselung Ihrer Mails an uns
nutzen können.

Marc

unread,
Feb 8, 2022, 1:23:58 PM2/8/22
to jim, Marc, Chase Tingley, okapi-users

Hi Jim,

I was working with Okapi Rainbow 1.41.

So maybe the behavior has changed with 1.42? I did not download that one yet.

best

Marc

Marc

unread,
Mar 19, 2022, 7:03:12 AM3/19/22
to Chase Tingley, Yves Savourel, okapi-users

Hi Chase,

thank you again for the answer below!

I understand it right, that there is no possiblitiy to do that for the XML-ITS filter somehow, right? This would have to be developed, right?

best

Marc

On 03.02.22 03:23, Chase Tingley wrote:
--
You received this message because you are subscribed to the Google Groups "okapi-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-users...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages