Best practice for openxml inline tags...

2 views
Skip to first unread message

Jim Hargrave

unread,
Apr 17, 2025, 4:25:52 PMApr 17
to Group: okapi-devel

The openxml filter sometimes presents a large number of tags to the translator. In some cases it is impossible to know where these tags should go in the translation.

Does anyone use this param by default: bPreferenceAggressiveCleanup.b=true

Are there other parameters you use to reduce tag clutter?

What about any of the code simplifier steps?

Chase Tingley

unread,
Apr 17, 2025, 5:46:18 PMApr 17
to okapi...@googlegroups.com
Yes, we've used bPreferenceAggressiveCleanup by default for several years. It is particularly good at dealing with DOCX that was generated via conversion from PDF, which often has thousands of tiny adjustments to kerning, vertical spacing, etc.  I think it would probably be safe as a global default, but because it causes such a radical change in the behavior of some files, we left it an option it for reasons of backwards compatibility.

I would also recommend:
  • ignoreWhitespaceStyles.b=true  - similarly, some files contain a lot of files due to font/style shifts every time there's a whitespace character. This strips those style shifts.
  • bPreferenceAutomaticallyAcceptRevisions.b=true - strips revisions on import and automatically pretends they were accepted for the purposes of extraction. There isn't really a benefit to not doing this, in my opinion.

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/okapi-devel/d66952e8-8d84-4d21-81c9-19b127f99f5a%40gmail.com.

Jim Hargrave

unread,
Apr 18, 2025, 3:23:57 PMApr 18
to okapi...@googlegroups.com, Chase Tingley

Thank you. I will pass this on. My only concern is that we also use the PostSegmentationCodeSimplifier. Looking at the openxml merge code there might be the possibility of a conflict. This hasn't been given much coverage in our integration tests.

Jim

Reply all
Reply to author
Forward
0 new messages