Use a Clean/Smudge Filter to display and process Windows UTF-16 files.

248 views
Skip to first unread message

Ken Ismert

unread,
Apr 11, 2013, 8:27:07 PM4/11/13
to msy...@googlegroups.com
Note: please ignore my earlier post "Script for handling UTF-16 files". Use this method instead.

Microsoft programs have mostly standardized on the UTF-16 (UCS-2) encoding for their Unicode files. Sadly, none of common open source version control systems (Git, Mercurial and Bazaar) handled this encoding properly. These programs all treat UTF-16 files as binary, because they usually contain a lot of NULL characters.

My first attempt to solve this problem involved using a diff textconv conversion to display UTF-16 files in diffs. But this approach is it is limited to diffs only.

Fortunately, Git provides a much better fix: a clean & smudge filter. This allows Git to treat your UTF-16 files as text in most cases: merge, git-grep, gitattributes (eol-conversion, ident-replacement, built-in diff patterns...).

Here is how to set up this much nicer handling of UTF-16 files in msysGit for Windows:
  1. Get Gnu libiconv, and install
  2. Ensure that the libiconv directory (usually "C:\Program Files\GnuWin32\bin") is in your %PATH%
  3. Add the following to ~\Git\etc\gitconfig:
    [filter "winutf16"]
        clean = iconv -f utf-16 -t utf-8
        smudge = iconv -f utf-8 -t utf-16
        required
  4. Add this line to your global ~/Git/etc/gitattributes or local ~/.gitattributes:
    *.txt filter=winutf16
That's it! UTF-16 files should look and work normally for most Git functions.

Thanks to Karsten Blees and Erik Faye-Lund for their help!

-Ken

Ken Ismert

unread,
Apr 29, 2013, 10:29:52 AM4/29/13
to msy...@googlegroups.com
 
Here is a better replacement for the above method, which handles files of mixed types with the same extension:

1. Get Gnu libiconv, and file, and install both.
2. Ensure that the GnuWin32\bin directory (usually "C:\Program Files\GnuWin32\bin") is in your %PATH%
3. Add the following to ~\Git\etc\gitconfig:
[filter "mixedtext"]
    clean = iconv -sc -f $(file -b --mime-encoding %f) -t utf-8
    smudge = iconv -sc -f utf-8 -t $(file -b --mime-encoding %f)
    required
4. Add a line to your global ~/Git/etc/gitattributes or local ~/.gitattributes to handle mixed format text:
*.txt filter=mixedtext
 
This is the 20% effort that should cover 80% of all Windows text formatting problems.

Thanks to Johannes Sixt and Steffen Jaeckel for their help in getting this new clean/smudge running!

-Ken

Ken Ismert

unread,
May 1, 2013, 8:41:37 PM5/1/13
to msy...@googlegroups.com, steven....@gmail.com

Thanks for this Ken!  It would seem you and I are the only ones on the planet trying to solve git's lack of support for UCS-2/UTF-16. 

One question:  It doesn't seem to compensate for files that are already UTF-8 and it appears it can actually corrupt them.  Have you seen any behavior that would corroborate this?

Steve
 
Steve,

Thanks for the heads up. This is a brand-new solution, so I haven't seen any corruption yet on the few UTF-8 files I have tried.

If you have a problem file that you are willing to share, you could send it to me and I will try to duplicate your result.

I would expect if you told iconv to convert to and from the same format, it would just become a no-op passthru on stdin.

But if that is not the case, perhaps a script is required that will substitute cat for iconv if the source format is UTF-8

-Ken
Reply all
Reply to author
Forward
0 new messages