Okapi HTML filter ignores OmegaT option "Remove leading/trailing tags"

61 views
Skip to first unread message

Manuel Souto Pico

unread,
Feb 24, 2022, 7:06:10 PM2/24/22
to okapi-users
Dear colleagues,

I would like to give feedback about the Okapi HTML filter, which seems to have a problem with leading/trailing tags. 

I am translating a HTML file in OmegaT, you can see a sample here:https://jsfiddle.net/a7kut1cm/1/

I would like to use the Okapi HTML filter because it creates an ID for every paragraph, which I need as unique context for alternative translations.

However, I can see the Okapi HTML filter does not remove leading/trailing paired tags.

For example, these tree nodes

<label for="answer">Strongly agree</label>
<br>
<a href="#">Click here to continue.</a>

become these two segments:
  • <g1>Strongly agree</g1>
  • <g1>Click here to continue.</g1>
In contrast, the default OmegaT "HTML and XHTML" produces a much cleaner result:
  • Strongly agree
  • Click here to continue.
The option "Remove leading and trailing tags" in Project Settings > File Filters is checked.

It seems the OmegaT filter observes that preference but the Okapi filter ignores it.

Normally the Okapi filter does a better job than the default OmegaT filter, but in this case it seems to be the opposite.

Shall I create a ticket for this?

I can provide a sample project if anyone wants to test it.

Cheers, Manuel

jim

unread,
Feb 24, 2022, 9:17:19 PM2/24/22
to Manuel Souto Pico, okapi-users

This would be a feature request. The HTML filter has never done this type of cleanup. However, we do have steps like CodeSImplifier that would do this (not part of the actual filter). I'm not familiar with the OmegaT integration but it should be possible to add something like this and use the OmegaT option you describe.

Jim

--
You received this message because you are subscribed to the Google Groups "okapi-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-users/CABm46baLQivcy-16mfGK4A%3DrWZQZ%3D_urd4HRojYxyC%3D%2B%3DipRFQ%40mail.gmail.com.

Manuel Souto Pico

unread,
Feb 25, 2022, 5:17:53 AM2/25/22
to jim, okapi-users
Thank you for your prompt reply, Jim.

My understanding was that the filter hides leading and trailing tags depending on that option (which is generic for all filters). I thought that option would be understood by any filter but I've just had a look with the Okapi OpenXML filter and it seems the filter (using default options) hides leading/trailing tag pairs regardless of whether that option is checked or not.

Which means, I guess, that the option only applies to OmegaT default filters?

This is the option we're talking about :
image.png

This cleanup is something that the Okapi OpenXML filter already does (and better than the OmegaT OpenXML filter), so perhaps the Okapi HTML filter can "learn" from its OpenXML sibling? Perhaps that part of the filter is recyclable?

Just to clarify why this is necessary, just in case: leading and trailing tags will always need to appear in the same position in the translation, because they apply to the whole segment, therefore the translator doesn't need to see them and position them elsewhere.

I can create the RFE but I'll wait a few days to see whether any more relevant info is added in this thread.

Cheers, Manuel

Mihai Nita

unread,
Mar 6, 2022, 4:08:38 PM3/6/22
to Manuel Souto Pico, jim, okapi-users
I am not even sure it is safe to break something like this into two text units, and remove the tags.

    <label for="answer">Strongly agree</label>
    <br>
    <a href="#">Click here to continue.</a>

All tags here (labelbra) are internal tags.
They can be inside a sentence.
So a translator should be free to move them around.

Imagine I want this:
   Red
   T-Shirt

In Romanian that would be
   Tricou
   Roșu

So I have to be able to move the text around, including the tags.

I would really expect this to be extracted and presented to the translators like this:
    <g>Strongly agree</g><x2><g3>Click here to continue.</g>

Which is 



jim

unread,
Mar 7, 2022, 9:20:12 AM3/7/22
to Mihai Nita, Manuel Souto Pico, okapi-users

The filters job is to produce the most complete output - so that other subsequent steps have full access to all information. This is why most filters *do not* do cleanup. OpenXml and IDML are exceptions in that the codes they produce can be justified as needless noise. Or any any case a compromise is made in order that the segments can be translated.

The best solution is to use post-processing like the PostSegmentationSimplifierStep. It will do what you want but gives you much finer control over the kinds of things Mihai is warning about (very important!)

I don't know the OmegaT integration code but see no reason this couldn't be added as an option (barring resources and time).

Jim

jim

unread,
Mar 12, 2022, 7:26:19 AM3/12/22
to Manuel Souto Pico, okapi-users, Group: okapi-devel

I see. The problem is Mihai and I are not familiar with OmegT or its integration.  If this was working file a bug ticket with the info below. This is something that has to be done post-filter as the HTML filter itself does not have this option.

One thing we might want to consider is spinning off the OmegaT integration code into an independent project. It would have its own issues, code base etc.. I think it would make things *much* easier for you guys. We could give everyone access. Doesn't even need to live in the Okapifraamework repository. We could create a new one.

What do you guys think?

On 3/11/22 16:04, Manuel Souto Pico wrote:
Hi Jim,

jim <jhargr...@gmail.com> escreveu no dia segunda, 7/03/2022 à(s) 15:20:

The filters job is to produce the most complete output - so that other subsequent steps have full access to all information. This is why most filters *do not* do cleanup. OpenXml and IDML are exceptions in that the codes they produce can be justified as needless noise. Or any any case a compromise is made in order that the segments can be translated.


I think that's a different topic from the issue that I was raising. My feedback was not about tag noise clean-up in general, but about how the filter ignores a user-defined setting in OmegaT that allows the user to decide whether leading/trailing tags are displayed or not.

The purpose of inline tags is to replicate those codes in the translation while letting the translator insert them in the appropriate location in the translation. However, leading and trailing tags never have a different position in the target language, the two paired tags must simply embed the full sentence in both languages. I have never seen an exception to that and I can't think of a reason for changing their position. Therefore, they don't need to be exposed in the segment, since the translator will have to insert them in the same position (at the beginning and at the end of the segment). It doesn't prevent translation, it just makes it more cumbersome when that happens often.

I'm not sure if my original example was misleading (at least I think I have managed to mislead Mihai..). Let me try with another example. I had used an example in HTML in my original email, but I've just seen now that this problem also happens with other file types such as XLIFF.

I can produce an OmegaT project in Rainbow from the HTML file of my original email, where the XLIFF file has something like this:

<trans-unit id="tu4" restype="x-input">
<source xml:lang="en-US"><g id="1" ctype="x-label" equiv-text="&lt;label for=&quot;answer&quot;>">Strongly agree</g></source>
<target xml:lang="fr-FR"><g id="1" ctype="x-label" equiv-text="&lt;label for=&quot;answer&quot;>">Strongly agree</g></target>
</trans-unit>

If I use the default OmegaT XLIFF filter and have the "Remove leading and trailing tags" option unchecked, I get:

image.png

If I check the "Remove leading and trailing tags" option (with that same filter), I get:

image.png

So far so good. However, if I try to use the same preference with the Okapi XLIFF filter, it makes no effect, I get the tags nonetheless:

image.png

In other words, the problem is independent from what the filter does or what I configure the filter to do (whether <br> is considered INLINE or TEXTUNIT or EXCLUDED, etc.). I think Mihai was talking about that.
 

The best solution is to use post-processing like the PostSegmentationSimplifierStep. It will do what you want but gives you much finer control over the kinds of things Mihai is warning about (very important!)

I have tried adding this step (precisely called "Post-segmentation Inline Codes Simplifier") to my pipeline, but I get the same result as above in the generated XLIFF...
image.png
I have added a segmentation step just in case, but with no rules...

I don't know the OmegaT integration code but see no reason this couldn't be added as an option (barring resources and time).

If it can do the same thing as the  "Remove leading and trailing tags" option in OmegaT, and can be added to all Okapi filters in the plugin for OmegaT, that would be fine with me. However, I think it would be much clearer if the "Remove leading and trailing tags" option in OmegaT could work also with Okapi filters.

I don't know whether the problem is in the Okapi filter not being aware of that setting or in OmegaT not sending that info to the filter...

I hope this helps.

Cheers, Manuel

jim

unread,
Mar 12, 2022, 10:19:09 AM3/12/22
to Manuel Souto Pico, okapi-users, Group: okapi-devel

Thinking about this more it makes sense to actually move the okapi plugin to OmegaT. That way you guys can address any configuration issues in the plugin, while we continue to address problems on the Okapi side.

If we don't do that the probability of resolving the plugin bugs/enhancements drops in the OmegaT plugin drops. Some of the main contributors now simply don't have experience with it.

Okapi has stable maven artifacts now - so you just move to whatever versions you are comfortable with. Continue to post issue on direct Okapi bugs.

does that make sense?

Jim

Mihai Nita

unread,
Mar 15, 2022, 5:45:57 PM3/15/22
to Group: okapi-devel, Manuel Souto Pico, okapi-users
+1 that an OmegaT plugin probably belongs in OmegaT.
You know more about OmegaT, and you would also gain "independence" to add what you want, release when you want, etc.

Built on top of Okapi, using Okapi public APIs (the way it already does).
With feature requests when Okapi does not provide something you need, and so on.
We did some improvements in the last few releases: more modular, publishing official maven artifacts, nightly artifacts, etc.

So it might be easier than it used to be some years ago.
If you want we can split it in a standalone git repo, to "untangle it" and to make it easier to take.

Regards,
Mihai


You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/73c64118-43b3-030c-2ade-64e618680a77%40gmail.com.

Mihai Nita

unread,
Mar 15, 2022, 5:49:31 PM3/15/22
to Group: okapi-devel, Manuel Souto Pico, okapi-users
Oups, sorry, it is already in a separate repository.
Mihai

Manuel Souto Pico

unread,
Mar 15, 2022, 5:51:24 PM3/15/22
to jim, Group: okapi-devel, okapi-users
Hi Jim,

I'm not sure what "moving the okapi plugin to OmegaT" really means. However, my understanding is that the filters included in the plugin for OmegaT and the filters included in Rainbow are identical or have basically the same code, so from that perspective to me it would seem that the filters plugin for OmegaT should be maintained along with the rest of the Okapi project.

Probably it should be one of the OmegaT core developers who has a say here, not me. I believe Aaron Madlon-Kay (OmegaT project's PM and integration manager) is in this list (okapi-users), so he might want to have a say.

I would say in any case no decision should be rushed in this regards.

Cheers, Manuel

PS: With regards to lack of familiarity with OmegaT on the side of Okapi developers, you can count on support (explanations, testing, etc.) from OmegaT users in this list like myself, at least as regards behaviour and configuration from the perspective of the end user. 

jim

unread,
Mar 15, 2022, 6:00:27 PM3/15/22
to Manuel Souto Pico, Group: okapi-devel, okapi-users

The plugin code does not contain any true filters - only wrappers for the Okapi filters. The plugin acts as a light intermediate between OmegaT and Okapi. It's the code you want to change if there is a configuration problem for example.

https://bitbucket.org/okapiframework/omegat-plugin/src/dev/

We could just give you guys full access to this repository. Make any changes you need. Several of your tickets could be addressed above vs the "real" Okapi filters here:

https://bitbucket.org/okapiframework/okapi/src/dev/

Agreed, lets wait for Aaron Madlon-Kay to reply. 

Jim

Aaron Madlon-Kay

unread,
Mar 15, 2022, 9:41:53 PM3/15/22
to jim, Manuel Souto Pico, Group: okapi-devel, okapi-users
Hi all.

I think *where* the code lives is not the main problem. The main problem is dev resources: there are very few contributors to OmegaT; when you take the intersection of OmegaT contributors and Okapi contributors, the result is probably just me. I already have access to the plugin code, so giving me more access will not help. What little time I have available for work on either OmegaT or Okapi is entirely consumed right now by “keeping the lights on”, and I’m barely managing that as it is.

I’m fine with OmegaT taking ownership of the plugin code, or continuing to share ownership. But I don’t think either choice solves the problem.

-Aaron


Hi Jim,

<image.png>

If I check the "Remove leading and trailing tags" option (with that same filter), I get:

<image.png>

So far so good. However, if I try to use the same preference with the Okapi XLIFF filter, it makes no effect, I get the tags nonetheless:

<image.png>

In other words, the problem is independent from what the filter does or what I configure the filter to do (whether <br> is considered INLINE or TEXTUNIT or EXCLUDED, etc.). I think Mihai was talking about that.
 

The best solution is to use post-processing like the PostSegmentationSimplifierStep. It will do what you want but gives you much finer control over the kinds of things Mihai is warning about (very important!)

I have tried adding this step (precisely called "Post-segmentation Inline Codes Simplifier") to my pipeline, but I get the same result as above in the generated XLIFF...

yves.s...@gmail.com

unread,
Mar 15, 2022, 11:34:56 PM3/15/22
to Aaron Madlon-Kay, jim, Manuel Souto Pico, Group: okapi-devel, okapi-users

Hi all,

 

I tend to agree with Aaron.

The problem is mostly resource: never enough time to do half of what’s needed.

 

-yves

Jim Hargrave

unread,
Mar 15, 2022, 11:45:09 PM3/15/22
to yves.s...@gmail.com, Aaron Madlon-Kay, Manuel Souto Pico, Group: okapi-devel, okapi-users
Resource issue is unfortunately the main one. But if we expose the code to the outside it gives you the opportunity to hire a contractor and be able to make any changes that you want without having to go through us. Also just having the code you may find that some of the changes are easy enough to make your cells and you can test it out more easily.

Jim Hargrave

On Mar 15, 2022, at 9:34 PM, yves.s...@gmail.com wrote:



Manuel Souto Pico

unread,
Mar 16, 2022, 12:06:25 PM3/16/22
to Jim Hargrave, Yves Savourel, Aaron Madlon-Kay, Group: okapi-devel, okapi-users
Thanks you for all contributions.

It seems hiring a contractor as Jim mentions is indeed my only chance to have some improvements implemented, I'm willing to go that path. If Jim's suggestion may help move forward faster with any changes I need (provided I can hire a contractor to implement them), I'm all for it.

However, let me go back to the original topic of this thread: how to modify what the Okapi filters (HTML, XLIFF, etc.) extracts / exposes through some general OmegaT filter options, in particular the one to "Remove leading/trailing tags". I think the first thing that would help me is to know whether this is something to be addressed in the plugin, in the filter, in OmegaT or in all of those. Knowing that, I can write a ticket in the appropriate repository. Could someone put me in the right direction?

Thank you so much.

Cheers, Manuel

jim

unread,
Mar 21, 2022, 12:35:50 PM3/21/22
to Manuel Souto Pico, okapi-users

The removal of leading/trailing tags would have to be done in the omegat interface code (the repository I pointed to).  This isn't a direct option for the okapi HTML filter. You would have to post-process the text units coming out of okapi before they go to OmegaT.

But I have never worked in that code - this is my best guess. But this doesn't address how to get the trimmed tags back into Okapi. Anyway hopefully this can get you started. Keep in mind those are *not* Okapi filters only wrappers of okapi filters for OmegaT.

I think this is where you intercept the content:

AbstractOkapiFilter.processFile()

Jim

Reply all
Reply to author
Forward
0 new messages