Segmentation rules should respect content inside of things protected by codefinder

24 views
Skip to first unread message

Marc Mittag

unread,
Jun 11, 2024, 2:48:44 AM6/11/24
to okapi-users
Dear all,

I have the following xml-snippet:

<p id="610">Dies ist ein Test. &lt;Produktname&gt; die physikalische
Einheit, in der &lt;Distance&gt; anzugeben ist, kann mit PUN? abgefragt
werden.</p>

I parse it with xml-its-Filter with the following codeFinder:

<okp:codeFinder useCodeFinder="yes">#v1
count.i=1
rule0=&lt;(/?)\w+[^&gt;]*?&gt;
</okp:codeFinder>


Now I will not a new segment after "Test. ", because
"&lt;Produktname&gt;" is protected as inline tag by the codefinder and
the next word is lower case.

Can I somehow achieve, that in this case I get 2 segments, but in a case
like

<p id="610">Dies ist ein Test. die physikalische Einheit, in der
&lt;Distance&gt; anzugeben ist, kann mit PUN? abgefragt werden.</p>

I would still get one?

best

Marc

Álvaro Mira del Amo

unread,
Jun 17, 2024, 6:13:28 PM6/17/24
to okapi-users
Hi Marc,

If possible could you share the full xml file, your filter configuration as well as the steps included in your pipeline (especially the segmentation)?
I think those details could be useful for anybody who wants to give it a try to support you.

I took your details and created a basic filter with the codefinder rule, then ran the translation kit creation pipeline with the segmentation enabled and the default segmentation configuration and this is the result: two segments for the for first tu, one for the second tu.

Is that what you were trying to achieve?

Marc.png
Reply all
Reply to author
Forward
0 new messages