Author's name is extracted

48 views
Skip to first unread message

Manuel Souto Pico

unread,
Mar 11, 2021, 10:34:55 AM3/11/21
to okapi-users
Dear all, 

I'm translating a Word file in OmegaT using the OpenXML filter with a configuration file (okf_o...@foo.fprm) containing the settings you can see below. 

The last segment in the project seems to be the name of the author, which is not visible when I open the file in LibreOffice (I guess the same result in MS Office's Word).

Am I overlooking some setting in the filter config? 

Thanks in advance.

#v1
bPreferenceTranslateDocProperties.b=false
bPreferenceTranslateComments.b=false
bPreferenceTranslatePowerpointNotes.b=true
bPreferenceTranslatePowerpointMasters.b=true
bPreferenceIgnorePlaceholdersInPowerpointMasters.b=false
bPreferenceTranslateWordHeadersFooters.b=false
bPreferenceTranslateWordHidden.b=false
bPreferenceTranslateWordExcludeGraphicMetaData.b=false
bPreferenceTranslatePowerpointHidden.b=false
bPreferenceTranslateExcelHidden.b=false
bPreferenceTranslateExcelExcludeColors.b=false
bPreferenceTranslateExcelExcludeColumns.b=false
bPreferenceTranslateExcelSheetNames.b=false
bPreferenceAddLineSeparatorAsCharacter.b=false
sPreferenceLineSeparatorReplacement=$0a$
bPreferenceReplaceNoBreakHyphenTag.b=false
bPreferenceIgnoreSoftHyphenTag.b=false
bPreferenceAddTabAsCharacter.b=false
bPreferenceAggressiveCleanup.b=true
bPreferenceAutomaticallyAcceptRevisions.b=true
bPreferencePowerpointIncludedSlideNumbersOnly.b=false
bPreferenceTranslateExcelDiagramData.b=false
bPreferenceTranslateExcelDrawings.b=false
subfilter=
bInExcludeMode.b=true
bInExcludeHighlightMode.b=true
bPreferenceTranslateWordExcludeColors.b=false
bReorderPowerpointNotesAndComments.b=false
tsComplexFieldDefinitionsToExtract.i=1
cfd0=HYPERLINK
tsExcelExcludedColors.i=0
tsExcelExcludedColumns.i=0
tsExcludeWordStyles.i=0
tsWordHighlightColors.i=0
tsWordExcludedColors.i=0
tsPowerpointIncludedSlideNumbers.i=0
bExtractExternalHyperlinks.b=false

Cheers, Manuel

Chase Tingley

unread,
Mar 11, 2021, 12:03:32 PM3/11/21
to Manuel Souto Pico, okapi-users
The filter extracts a subset of the document properties (which File > Properties).  There's a very old enhancement to allow this to be customized, but it has never been implemented.


--
You received this message because you are subscribed to the Google Groups "okapi-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-users/CABm46bY7SJ29_xN%3DAmWk8DUTiMsfYdc7CygTQ%3DWOMKu5S5_NQA%40mail.gmail.com.

Manuel Souto Pico

unread,
Mar 15, 2021, 7:03:43 AM3/15/21
to Chase Tingley, okapi-users
Thanks, Chase. 

I understand it's not possible to choose whether to extract the document's creator name or not. 

Would it be possible to make it the default option not to extract that property? That's clearly out of scope for translation, it's a piece of info that it's not visible in the document so it can be considered metadata, not text.

Also, in terms of privacy, exposing that data could be a potential source of issues with the client.

Thanks.

Cheers, Manuel

Manuel Souto Pico

unread,
May 9, 2021, 7:29:56 AM5/9/21
to Chase Tingley, okapi-users
In the meantime, in case it's helpful for anyone, here's a workaround to avoid having the document's author name extracted and displayed in OmegaT as a translatable string:

in MS Word, go to Filer > Info > Related people: Author > right click > Remove person

That fixes the issue for one document, but of course it would be a pain to have to do it every time one wants to translate a Word file using the Okapi OpenXML filter.

Chase Tingley

unread,
May 10, 2021, 12:06:26 AM5/10/21
to Manuel Souto Pico, okapi-users
I was looking at the code tonight, and I think it may be possible to tweak this configuration for those who are feeling *very* adventurous.  The properties extraction code is very old, and most of the rest of the filter has been rewritten around it, but this one part still embeds a configuration for Okapi's XML filter to select the properties to extract.  It should be possible to modify that file.

In your Okapi distribution, in the lib folder, there is a jar file called okapi-lib-[version].jar.  A jar is basically just a zip file; you can extract it and re-zip it with tools like winzip.  (You may need to change the extension.)

If you extract this jar to a directory, you will find a file called net/sf/okapi/filters/openxml/wordDocPropertiesConfiguration.yml.  You will need to modify this file in a text editor and re-zip the jar.

The content of wordDocPropertiesConfiguration.yml looks like this:
elements:
  filetype:
    ruleTypes: [ATTRIBUTES_ONLY]
    elementType: MSWORDCHART
  'dc:title':
    ruleTypes: [TEXTUNIT]
  'dc:subject':
    ruleTypes: [TEXTUNIT]
  'dc:creator':
    ruleTypes: [TEXTUNIT]
 [....] 

If you remove a rule (the element name + the "ruleTypes; [TEXTUNIT]" line), save the file, and then re-zip the jar file, that element will no longer be extracted.  Similarly, you can add additional elements (using namespace prefix) with a TEXTUNIT rule and they should show up.  Disclaimer: if you extract things this way, you will need to use a similarly hacked copy to merge the files, otherwise the merger will get very cross with you.  Second disclaimer: I haven't actually tried this, so it may not work.  Death, permanent injury, indigestion, etc may result.



Manuel Souto Pico

unread,
May 10, 2021, 4:13:00 AM5/10/21
to Chase Tingley, okapi-users
Hi Chase, 

I really thank you for your effort and your feedback. I don't really feel *very* adventurous, a bit maybe but not to the point of risking more indigestions than I already have ^^

In the case of extreme dire need, I could give a try to your suggestion, but removing the author's name in the document's properties as I explained in my email yesterday seems a safer bet (it's not something that I need to do myself, I just give advice about it). Also, I don't want to have to hack the plugin with every release that comes out.

The official current policy in OmegaT is to encourage development work on (and therefore usage of) Okapi filters to the detriment of the default OmegaT ones, at least as regards the OpenXML filter. I try to be consistent with that in my recommendations to users and partners, but it's not easy due to these, let's say, disadvantages of your filter.

In any case, the pros of your filter do outnumber the cons, imho. In my organization we have stopped officially using the default OpenXML filter in OmegaT, because of the issues it has with legacy reliance on prev/next context for alternative (ICE) translations and with leading/trailing tags, both solved in the Okapi filter. 

In a nutshell, the Okapi filter is better, but not free from minor inconvenients. One of them is the spartan way of configuring the filter by means of a filter configuration file (without a GUI dialog), another one is the fact that it doesn't include all options (like the author's name). Hopefully these things will be ironed out in future versions. 

Cheers, Manuel 

Manuel Souto Pico

unread,
May 10, 2021, 1:28:13 PM5/10/21
to Chase Tingley, okapi-users
For the record, another problem I have with the default OpenXML filter is this: https://bitbucket.org/okapiframework/okapi/issues/1055/unable-to-set-translation-in-omegat-with

Wei Jiang

unread,
May 11, 2021, 10:49:18 AM5/11/21
to Manuel Souto Pico, okapi-users

It seems that the author’s name is the last segmented to be extracted from a document. So, you might just remove that segment, say by using a regular expression.

 

Sent from Mail for Windows 10

--

You received this message because you are subscribed to the Google Groups "okapi-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-users...@googlegroups.com.

Manuel Souto Pico

unread,
May 14, 2021, 1:47:55 PM5/14/21
to Wei Jiang, okapi-users
Thanks, Wei.

I suppose you are talking about removing the segment in an XLIFF file? Notice that my question was not about preparing files in Rainbow or Tikal, but about the Okapi plugin for OmegaT (to extract the text directly in OmegaT, so there's no intermediary XLIFF file).

If that's not what you meant, then I don't know. Could you please clarify?

But in any case, I'm not sure it would be a good idea to manipulate an XLIFF file like that, especially thinking of the impact on the merge. Well, it would need to be tested, I'll bear your suggestion in mind if I ever need to create XLIFF from Word and I still find this problem.

I really think the clean solution would be to add an option to the filter configuration file to let the user decide whether that property should be extracted or not, with 'no' as the default.

Thanks and have a nice weekend.

Cheers, Manuel

Manuel Souto Pico

unread,
Dec 5, 2022, 12:36:56 PM12/5/22
to Wei Jiang, okapi-users
Dear all,

I would like to touch base on this topic. A bit more than one year after I reported, I have now re-tested this in the latest version of the filters plugin and I can still see the same behaviour.

I'm bringing this up again so that it may get some attention. I don't think I ever created a ticket for this, shall I?

Is there anything else I can do to have this enhancement implemented?

Thanks.

Cheers, Manuel
Reply all
Reply to author
Forward
0 new messages