Denis has a large changeset up in PR 552
which he has been working on for some time and I have been testing for the last several days. The goal of this work is to enhance the ability to configure row and column extraction in XLSX in order to handle "data dump" XLSX files, which we are seeing with increasing frequency. A typical XLSX file of this type would be one containing hundreds of thousands of rows, with the instructions "Translate columns B, C, and F; expose columns A and G as metadata fields on the cells from each row; exclude everything else".
Although the OpenXML filter has some functionality for column extraction configs, this is limited in a couple ways:
- There is no way to extract a non-translatable column as metadata (ie, context-group), which is an increasingly common request.
- There is no way to skip one or more header rows, as can be done in the CSV/Table filter
- The sheet configuration is sort of wonky (you select columns to exclude for Sheet 1, Sheet 2, and "Sheet 3 and higher" individually)
We have been working around this type of scenario by using JSON as a pivot format and using the JSON extract/metadata key options to configure the "column" behavior, but enough is enough.
An example of the new implementation looks like this:
This allows one or more worksheet configurations (which match a sheet name by a pattern), each of which can specify:
- excludedRows - a list of which rows, if any, should be completely excluded.
- metadataRows - a list of which rows, if any, should be used to provide the names of metadata values extracted from certain columns.
- excludedColumns - A list of which columns should be completely skipped
- metadataColumns - A list of columns that provide metadata values for the row
Rows are organized by XLIFF group, with all metadata cells from that row included in a <context-group> block for that row.
This new mode doesn't have a UI yet. The old options are still supported, although they are marked as deprecated in the code.