Improved column configuration options for the XLSX filter

8 views
Skip to first unread message

Chase Tingley

unread,
Sep 13, 2021, 5:07:04 PM9/13/21
to okapi...@googlegroups.com
Hi all,

Denis has a large changeset up in PR 552 which he has been working on for some time and I have been testing for the last several days.  The goal of this work is to enhance the ability to configure row and column extraction in XLSX in order to handle "data dump" XLSX files, which we are seeing with increasing frequency.  A typical XLSX file of this type would be one containing hundreds of thousands of rows, with the instructions "Translate columns B, C, and F; expose columns A and G as metadata fields on the cells from each row; exclude everything else". 

Although the OpenXML filter has some functionality for column extraction configs, this is limited in a couple ways:
  • There is no way to extract a non-translatable column as metadata (ie, context-group), which is an increasingly common request.
  • There is no way to skip one or more header rows, as can be done in the CSV/Table filter
  • The sheet configuration is sort of wonky (you select columns to exclude for Sheet 1, Sheet 2, and "Sheet 3 and higher" individually)
We have been working around this type of scenario by using JSON as a pivot format and using the JSON extract/metadata key options to configure the "column" behavior, but enough is enough.

An example of the new implementation looks like this:

worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=.*
worksheetConfigurations.0.excludedRows=
worksheetConfigurations.0.metadataRows=1
worksheetConfigurations.0.excludedColumns=B,G
worksheetConfigurations.0.metadataColumns=D,E,F

This allows one or more worksheet configurations (which match a sheet name by a pattern), each of which can specify:
  • excludedRows - a list of which rows, if any, should be completely excluded.
  • metadataRows - a list of which rows, if any, should be used to provide the names of metadata values extracted from certain columns. 
  • excludedColumns - A list of which columns should be completely skipped
  • metadataColumns - A list of columns that provide metadata values for the row
Rows are organized by XLIFF group, with all metadata cells from that row included in a <context-group> block for that row.

This new mode doesn't have a UI yet.  The old options are still supported, although they are marked as deprecated in the code.

ct

Jean-Christophe Helary

unread,
Sep 13, 2021, 7:24:06 PM9/13/21
to okapi...@googlegroups.com
There is an important development on Okapi.

Allow me to suggest that people who are involved in complex Excel sheet translation should take a look.

It is still a PR and there is some gitting/building required to actually test the thing, but I encourage you to do so.

Once the code makes it to Okapi, it will be available through Rainbow (I guess) for XLIFF package creations and thus translation in OmegaT.

Jean-Christophe
> --
> You received this message because you are subscribed to the Google Groups "okapi-devel" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/CAGRYq4jtSZL3NoAH0Ba-R2sDsPGOgpQ%3DnW8c9MKgz8R%3D9H9AvA%40mail.gmail.com.

--
Jean-Christophe Helary @brandelune
https://mac4translators.blogspot.com
https://sr.ht/~brandelune/omegat-as-a-book/

Alessandro Falappa

unread,
Sep 15, 2021, 3:27:45 AM9/15/21
to Group: okapi-devel
Hello Chase,
  extending the XLSX filter with those configurations is a welcome addition.
I can confirm from our side that excluding some header rows is an often requested feature.
Excluding some columns is definitely useful as well.
We do not need, at present, to extract some cells as metadata but can come handy in the future.

What we have encountered is clients sending a sheet and telling us "column A contains the source text, please put the translation in language X in column B, the translation in language Y in column C" (and so forth), thus what could be added is a configuration to set the column to extract from and the column to merge into. This does not completely solve the aforementioned use case but would allow us to produce two translated sheets that could be easily merged (translation do not overlap).

I honestly do not know the internals of the OpenXML filter enough to tell if this different merge destination is difficult to implement.

Nice work indeed.

Regards


Alessandro Falappa
Integrations Team Leader


--

Denis A. Konovalyenko

unread,
Sep 15, 2021, 8:51:45 AM9/15/21
to okapi...@googlegroups.com

Hi Alessandro,

Thank you for your feedback on the made OpenXML filter improvements!

I think your use case can be solved on the filter level with medium level of effort.

Best regards,

Denis

Reply all
Reply to author
Forward
0 new messages