Segmentation rules should respect content inside of things protected by codefinder

24 views

Skip to first unread message

Marc Mittag

unread,

Jun 11, 2024, 2:48:44 AM6/11/24

to okapi-users

Dear all,

I have the following xml-snippet:

<p id="610">Dies ist ein Test. <Produktname> die physikalische
Einheit, in der <Distance> anzugeben ist, kann mit PUN? abgefragt
werden.</p>

I parse it with xml-its-Filter with the following codeFinder:

<okp:codeFinder useCodeFinder="yes">#v1
count.i=1
rule0=<(/?)\w+[^>]*?>
</okp:codeFinder>

Now I will not a new segment after "Test. ", because
"<Produktname>" is protected as inline tag by the codefinder and
the next word is lower case.

Can I somehow achieve, that in this case I get 2 segments, but in a case
like

<p id="610">Dies ist ein Test. die physikalische Einheit, in der
<Distance> anzugeben ist, kann mit PUN? abgefragt werden.</p>

I would still get one?

best

Marc

Álvaro Mira del Amo

unread,

Jun 17, 2024, 6:13:28 PM6/17/24

to okapi-users

Hi Marc,

If possible could you share the full xml file, your filter configuration as well as the steps included in your pipeline (especially the segmentation)?

I think those details could be useful for anybody who wants to give it a try to support you.

I took your details and created a basic filter with the codefinder rule, then ran the translation kit creation pipeline with the segmentation enabled and the default segmentation configuration and this is the result: two segments for the for first tu, one for the second tu.

Is that what you were trying to achieve?

Reply all

Reply to author

Forward

0 new messages