Using filter writer only when deleting content - not filter...

1 view
Skip to first unread message

Jim Hargrave

unread,
Jul 10, 2025, 3:22:22 PMJul 10
to Group: okapi-devel

Discussing this with Denis in regard to removing embedded xlsx files and related references in the openxml. But this is a general architecture and usage point I would like to make.

Need to get more feedback.

My main point is if we remove content it should be done in the filter writer (not the filter itself). The main reason for this is that we want the jsonl (perhaps xliff too) to reflect the original document as imported. Tools may use the jsonl to provide context. In the case of embedded xlsx files we would loose that context if those references are lost in the filtering process.

Does that make sense? I want to propose a general architectural rule:  If content is deleted it should be done at the latest stage possible and not reflected in the extracted content. In our case jsonl.

There may be exceptions like inline code simplification or some common cleanups that we do. But these normally don't impact the original as the changes would be sen as "equivalent".

Specifically this proposal applies to the delete)embedded_excel option for openxml. The PR as coded has this implemented in the filter writer.

Jim

yves.s...@gmail.com

unread,
Jul 10, 2025, 10:49:49 PMJul 10
to okapi...@googlegroups.com

> if we remove content it should be done in the filter writer (not the filter itself). The main reason for this is that we want the jsonl (perhaps xliff too) to reflect the original document as imported.

 

This sounds reasonable.

The only objection I could think about is the cases where the part to remove is part of the inline content, and that could not be reflected in the skeleton. But you mentions those as exceptions.

 

-ys

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/okapi-devel/b2ae26a0-70c8-4aa2-b71e-8a3fa6d8314b%40gmail.com.

Jim Hargrave

unread,
Jul 11, 2025, 2:50:08 PMJul 11
to okapi...@googlegroups.com, yves.s...@gmail.com

thanks Yves! 

openxml is a difficult case for inline codes as there is so much cleaning, merging etc..  We can let that one go.

Jim

Reply all
Reply to author
Forward
0 new messages