Feedback on bilingual XSLX extraction

44 views

Skip to first unread message

Chase Tingley

unread,

Mar 20, 2025, 6:38:24 PMMar 20

to okapi-users

Hi all,

The latest nightly build contains a new feature for the XLSX filter that some of you might find useful. For some time it's been possible to specify a worksheet configuration that reads source from one column and writes the target to a different column. This has now been extended so that the target column support is fully read/write. That is, we can now read bilingual XLSX files and convert them to XLIFF, etc.

A sample demonstrating this is in Denis's ticket for the feature here:

https://gitlab.com/okapiframework/Okapi/-/issues/1399

This functionality is very new, but if you are interested in testing it out before the full release, I encourage you to give it a try in the latest nightly build and see how it works. Feedback is welcome as well :)

Jonathan Bourriquis

unread,

Mar 22, 2025, 12:25:59 PMMar 22

to Chase Tingley, okapi-users

Hey Chase (and all),

I have just tested the feature in depth from yesterday's nightly build.

It works great, thanks for adding this!

I was actually about to request that feature, great timing.

-----

As a related piece of feedback, here are a few features which (I think) are still missing from this OpenXML filter:

1. As Victor and others have pointed out, it would be *very* useful if the `maxwidth` parameter could be dynamically pulled from the value of a predefined column rather than it being assigned a static value. This is a very common scenario.

2. The ability to use a codefinder with Word/Powerpoint/Excel.

The Excel filter does already allow for subfiltering, which it's great, especially when dealing with chunks of embedded HTML/MD, but I think it would be practical to also have a codefinder (for scenario where the only prep required is for a bunch of patterns to be tagged).

3. Ability to control the cell flow in Excel extractions.

In some cases, the file content is organized by column rather than by row, as a result the TUs in the XLF don't follow the actual logical order of the file, which makes the work of linguists more confusing. Phrase TMS, for example, has an option to control the flow of the extraction (by row / by column).

4. The ability to exclude Excel cells by cell style.

In short, the same option that already exists in the Word section (+ also allow users to manually set a custom style name).

5. (Excel filter) The extracted metadata are currently stored in the XLF under the `<context>` element.

The problem is that some CAT tools (such as XTM) expect comments/metadata to be stored in the `<note>` element instead. By default it just ignores the `<context>` element altogether. It would be great if the user could define the XLF element name where the metadata should be stored, or perhaps have a choice between the 3-4 most common options, so that this metadata parameter could be used in conjunction with all TMS/CAT tools.

6. But to me the biggest limitation, and it's more to do with Rainbow itself, is the fact that we can't prep multi-target XLSX files (a very common scenario).

Of course it's tied to the fact that we can't create multilingual kits in Rainbow, and I understand it would necessitate a lot of work to implement that feature, but it's still worth mentioning as I'm definitely not the only one facing this limitation. My current workaround for multi-targets projects, as you may recall, is to prep those with some Python script via Tikal (similar to what Victor mentioned a while ago), but it's not ideal.

--
You received this message because you are subscribed to the Google Groups "okapi-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/okapi-users/CAGRYq4g-JdEKszvyO9tH59sgbjr2Tw7sbhVCCLGhF0%3DM4QmirA%40mail.gmail.com.

Víctor Parra García

unread,

Mar 22, 2025, 1:33:12 PMMar 22

to Jonathan Bourriquis, Chase Tingley, okapi-users

Jonathan, you got all my money :D

For me, point 2 is just SO essential, it's super common getting excel preps with any kind of code inside that just making {{variable}} a tag is impossible.

Also, point 6 would make Okapi a unique tool, providing things Trados or memoQ just can't.

Of course point 1 is another of my essentials, it's a scenario I deal with 90% of my working time.

Cheers!

To view this discussion visit https://groups.google.com/d/msgid/okapi-users/CAEN%2BukQSGb5nbpj7VoOj-NgrykzVQaMa4V%3DMLhpK96%3DY7oULFw%40mail.gmail.com.

Chase Tingley

unread,

Mar 24, 2025, 2:28:37 PMMar 24

to Jonathan Bourriquis, okapi-users

Hi Jonathan, thanks for all this feedback. There are a lot of good ideas here (some of them easier than others).

On Sat, Mar 22, 2025 at 9:25 AM Jonathan Bourriquis <jonathan....@gmail.com> wrote:

1. As Victor and others have pointed out, it would be *very* useful if the `maxwidth` parameter could be dynamically pulled from the value of a predefined column rather than it being assigned a static value. This is a very common scenario.

2. The ability to use a codefinder with Word/Powerpoint/Excel.
The Excel filter does already allow for subfiltering, which it's great, especially when dealing with chunks of embedded HTML/MD, but I think it would be practical to also have a codefinder (for scenario where the only prep required is for a bunch of patterns to be tagged).

This seems like it should be pretty easy.

3. Ability to control the cell flow in Excel extractions.
In some cases, the file content is organized by column rather than by row, as a result the TUs in the XLF don't follow the actual logical order of the file, which makes the work of linguists more confusing. Phrase TMS, for example, has an option to control the flow of the extraction (by row / by column).

I would guess this is quite a bit of work (and has some interesting implications for the config UI), although it is a nifty idea for a feature. I admit I don't think I've ever encountered this type of data.

4. The ability to exclude Excel cells by cell style.
In short, the same option that already exists in the Word section (+ also allow users to manually set a custom style name).

This is a good one. Also, in discussion with some of our internal users about this suggestion, it came up that the "Exclude color" feature in Excel now works on both foreground and background colors, but the wiki and option name hadn't been updated to account for this. (I've fixed the wiki and will open a PR for the UI.)

5. (Excel filter) The extracted metadata are currently stored in the XLF under the `<context>` element.
The problem is that some CAT tools (such as XTM) expect comments/metadata to be stored in the `<note>` element instead. By default it just ignores the `<context>` element altogether. It would be great if the user could define the XLF element name where the metadata should be stored, or perhaps have a choice between the 3-4 most common options, so that this metadata parameter could be used in conjunction with all TMS/CAT tools.

Makes sense, we already do this for at least one other filter. I wish more systems made use of context-group/context, it's a better system for metadata, but what can you do.

6. But to me the biggest limitation, and it's more to do with Rainbow itself, is the fact that we can't prep multi-target XLSX files (a very common scenario).
Of course it's tied to the fact that we can't create multilingual kits in Rainbow, and I understand it would necessitate a lot of work to implement that feature, but it's still worth mentioning as I'm definitely not the only one facing this limitation. My current workaround for multi-targets projects, as you may recall, is to prep those with some Python script via Tikal (similar to what Victor mentioned a while ago), but it's not ideal.