openxml pptx chart craziness...

0 views
Skip to first unread message

Jim Hargrave

unread,
Jun 11, 2025, 6:07:29 PMJun 11
to Group: okapi-devel

Denis helped us with one use case for charts. But this seems to be a deep rabbit hole if you want to fully localize a chart in PPTX. These are from a financial domain so pretty complicated. I found several other cases of cached values like numCache which can have currency values that need conversions etc..

What do you think about recursive extraction of embedded spreadsheets? Also the formatCode is more of an internationalized string format like ICU - we can extract it but translators may not understand how to localize it.

From our internal discussion:

These charts normally reference spreadsheets (sometimes external sometimes embedded). openxml does something clever. It caches spreadsheet values in the PPTX. Denis and I decided it would be easier to extract the cache values vs trying to find the original spreadsheets. These spreadsheets have priority over the cached values. So there will be a mismatch. If I understand correctly, if the localized PPTX file with translated cache values finds the original spreadsheet then the cached values will be overwritten. So we will need to localize the spreadsheet as well.There is another even more complicated case. Sometimes I see that PPTX files have embedded spreadsheets (not external).We have also found that chart cached values use a formatting string (like ICU) which can contain language specific content  (ex: <c:formatCode>#,##0"億円"</c:formatCode>)! On top of the difficulty of tracking down cached values vs spreadsheets we have an internationalization problem.Attached are some example files and screenshots showing what is missing. 

Thoughts?

Chase Tingley

unread,
Jun 11, 2025, 8:35:42 PMJun 11
to okapi...@googlegroups.com
I've always tried to stay away from embedded spreadsheets, just because the concept seemed so gross I didn't want to encourage it.  (Also, it at least used to be true that the embeds could be in the pre-2007 binary format, which was a problem.)  If the embeds are in the modern format, it seems like it would be pretty possible for the filter to call itself recursively.

I'm torn about this, because extracting the embeds does seem cleaner in some ways from a localization perspective, dealing with cached values can lead to a fair amount of user confusion. (This has come up a couple times with Table of Contents values in Word, etc.)

Probably either approach is viable.

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/okapi-devel/3ee48b02-7077-4ef2-80e4-649583074c15%40gmail.com.

Jim Hargrave

unread,
Jun 11, 2025, 10:16:07 PMJun 11
to okapi...@googlegroups.com, Chase Tingley

Right, totally agree. But if I understand correctly, if the spreadsheets are embedded when the document is loaded they will overwrite the cache values. We may have no choice - hum could we remove the embedded spreadsheets as an option?

For documents without an embedded spreadsheet we have no choice but to extract the cache values and hope the users don't use the original, external spreadsheet

Then there is the formatCode (basically formatted string with placeholders like ICU) - we must extract those as well - but perhaps the filter can add a comment with specific instructions

J

Jim Hargrave

unread,
Jun 12, 2025, 1:20:19 PMJun 12
to okapi...@googlegroups.com, Chase Tingley, yves.s...@gmail.com, Denis A. Konovalyenko

Since  openxml is growing in importance for us I will try to tackle this PR myself (guidance from Denis) to get some experience with openxml. Here's what I propose. I will create an issue as well.

  1. *Fully* handle localization pptx of charts.
  2. Option: Recursively search for embedded spreadsheets and remove them on merge (to prevent localized cache values from being overwritten)
  3. Extract all chart values (cache  etc..).
  4. Be mindful of formatting strings like formatCode - extract and mark as localizable (vs default translatable). Add Note with warning to translators.
  5. Fully account for font/typeface changes in charts based on target locale.
  6. Find a way to annotate any values that need to be internationalized (currency conversions etc..) to give more context to translators.

Questions:

  1. Did I miss anything?
  2. I don't have PowerPoint, but deleting embedded spreadsheets doesn't seem to cause an issue in LibreOffice - it loads the cache values as expected.
Reply all
Reply to author
Forward
0 new messages