Zip globalization support

38 views
Skip to first unread message

Dioxide Oxygen

unread,
Apr 17, 2025, 10:45:43 PM4/17/25
to OneCommander
Background
ZIP file specifications historically use the system's local encoding (e.g., CP936 for Chinese, Shift_JIS for Japanese) to store filenames, rather than UTF-8. This causes cross-language compatibility issues: for example, filenames created by Japanese users (Shift_JIS) will appear as garbled text when extracted by Chinese users (using GBK/CP936), and vice versa. This issue often happens in many content-creation communities that strongly rely on international communication, such as UTAU singing voice synthesis and MMD animation.


Example
I'm using Windows zh-CN. Here is a zip file from Japan. All the file names in the zip file are in Japanese. This is the file extracted with OneCommander (filenames are incorrectly decoded with GB2312, resulted in garbled Chinese text)
Image
This is what the filenames supposed to be (decoded with Shift-JIS)
Image


Things to do
  • Support extracting a zip file with user-chosen encoding
  • Auto detect encoding
  • When creating a zip file, use UTF-8 encoding by default (which won't get garbled if I send it to a foreigner)

Proposed technical implementation details
I've implemented the features above in Files, another third party file namager for Windows written in C#. Here are my PRs:
https://github.com/files-community/Files/pull/17022 (using SharpZipLib which support extracting with specified encoding) https://github.com/files-community/Files/pull/17026 (add parameter "cu on" when adding files into archive) https://github.com/files-community/Files/pull/17045 (using UTF.Unknown to detect encoding)

Why I'm working on this

I'm a developer of OpenUtau, an open source C# implementation of UTAU, a singing voice synthesis system that comes from Japan. Due to the zip encoding issue, when a non-Japanese user downloads a Japanese voicebank, it will become garbled. So OpenUtau introduced a builtin voicebank extractor that let the user choose encoding, and a voicebank publisher that creates UTF-8 zip archive for global release. So I'm checking existing file management tools for their status of zip globalization, and working on these projects if it's written in a programming language that I know how to use.


References
Wikipedia:
https://en.wikipedia.org/wiki/ZIP_(file_format)#Internationalization_issues ZipUnicode, a tool for detecting the encoding of a zip file: https://pypi.org/project/ZipUnicode/
Reply all
Reply to author
Forward
0 new messages