Script for handling UTF-16 files

930 views
Skip to first unread message

Ken Ismert

unread,
Apr 9, 2013, 7:47:24 PM4/9/13
to msy...@googlegroups.com

I bumped into the UTF-16 display problem with Git Extensions running on top of msysGit. After lots of searching and experimenting, I came up with a solution that works for me.

Note: Please see questions below.

This method is for MSysGit 1.8.1, and is tested on Windows XP. I use Git Extensions 2.44, but since the changes are at the Git level, they should work for Git Gui as well. Steps:

1) Install Gnu Iconv: http://gnuwin32.sourceforge.net/packages/libiconv.htm

2) Create the following script, name it astextutf16, and place it in the /bin directory of your Git installation (this is based on the existing astextplain script):

    #!/bin/sh -e
    # converts Windows Unicode (UTF-16 / UCS-2) to Git-friendly UTF-8
    # notes:
    # * requires Gnu iconv:
    #        http://gnuwin32.sourceforge.net/packages/libiconv.htm
    # * this script must be placed in: ~/Git/bin
    # * modify global ~/Git/etc/gitconfig or local ~/.git/config:
    #        [diff "astextutf16"]
    #            textconv = astextutf16
    # * or, from command line:
    #        $ git config diff.astextutf16.textconv astextutf16
    # * modify global ~/Git/etc/gitattributes or local ~/.gitattributes:
    #        *.txt diff=astextutf16
    if test "$#" != 1 ; then
        echo "Usage: astextutf16 <file>" 1>&2
        exit 1
    fi
    # -f(rom) utf-16 -t(o) utf-8
    "\Program Files\GnuWin32\bin\iconv.exe" -f utf-16 -t utf-8 "$1"
    exit 0
   
3) Modify the global ~/Git/etc/gitconfig or your local ~/.git/config file, and add these lines:

    [diff "astextutf16"]
        textconv = astextutf16
       
4) Or, from command line:

    $ git config diff.astextutf16.textconv astextutf16

5) Modify the global ~/Git/etc/gitattributes or your local ~/.gitattributes file, and map your extensions to be converted:

    *.txt diff=astextutf16

6) Test. UTF-16 files should now be visible.

==========

Questions:
a) This only works when the file is in the \Git\bin directory. Although many web examples suggested otherwise, I never could get this script to work in a stand-alone directory. Is there something magic about having it in \bin?

b) This is a 'band-aid' rather than a real 'fix', which I gather is hard to implement. Still, I would like to have functionality like this built-in, so I could at least manually configure which files in a project are UTF-16, and which are not. Is this unrealistic?

c) I had success with iconv, but is there any built-in UTF-16 to UTF-8 converter that ships with msysGit?

As a quick fix, how hard would it be to add a 'utf16' diff filter, similar to cpp or csharp? Or is this simply the wrong place to put in a work-around?

Thanks,
-Ken

Erik Faye-Lund

unread,
Apr 10, 2013, 3:44:48 AM4/10/13
to Ken Ismert, msysGit
Nice. Thanks a lot for posting this, I'm sure others will find it useful too.

> Questions:
> a) This only works when the file is in the \Git\bin directory. Although many
> web examples suggested otherwise, I never could get this script to work in a
> stand-alone directory. Is there something magic about having it in \bin?
>

Did you try other directories that are in $PATH?

> b) This is a 'band-aid' rather than a real 'fix', which I gather is hard to
> implement. Still, I would like to have functionality like this built-in, so
> I could at least manually configure which files in a project are UTF-16, and
> which are not. Is this unrealistic?

I'm not sure. Supporting multiple encodings built-in sounds like a bit
of a slippery slope to me; what about UTF-16BE, UCS2, Big5, Shift JIS,
EBCDIC? But perhaps it could be possible to have libiconv handle all
of this for us?

On the other hand, UTF-16LE is quite common on Windows. Perhaps it
would make an OK compromise for Git for Windows to distribute
something similar to what you have here out of the box?

> c) I had success with iconv, but is there any built-in UTF-16 to UTF-8
> converter that ships with msysGit?

I thought we already shipped with iconv...? At least my msysGit
install has /mingw/bin/iconv. Ah, you're thinking of Git for
Windows... I can't see that we copy our iconv-binary to the Git for
Windows installer, so perhaps starting to ship it could make sense.

Karsten Blees

unread,
Apr 10, 2013, 2:59:56 PM4/10/13
to Ken Ismert, msy...@googlegroups.com, g...@vger.kernel.org
Am 10.04.2013 01:47, schrieb Ken Ismert:
>
> I bumped into the UTF-16 display problem with Git Extensions running on top of msysGit. After lots of searching and experimenting, I came up with a solution that works for me.
>
> Note: Please see questions below.
>
> This method is for MSysGit 1.8.1, and is tested on Windows XP. I use Git Extensions 2.44, but since the changes are at the Git level, they should work for Git Gui as well. Steps:

There has been a discussion about handling UTF-16 on the git ML a while back, see http://thread.gmane.org/gmane.comp.version-control.git/159708

As suggested there, I would try to use a clean/smudge filter (i.e. store UTF-16 files as UTF-8 in the repository and convert back to UTF-16 on checkout). That way git can treat your UTF-16 files as text in most cases (i.e. you can merge them, git-grep works, gitattributes work (eol-conversion, ident-replacement, built-in diff patterns...)).

If you use a textconv filter, UTF-16 content will be treated as binary by most git operations.

There's also an 'encoding' attribute and a 'gui.encoding' setting which in theory should solve your issue (i.e. specify encoding of files for display by GUI tools). I don't know if Git Extensions supports that, or whether its supposed to work for binary files at all.

> 3) Modify the global ~/Git/etc/gitconfig or your local ~/.git/config file, and add these lines:
>
> [diff "astextutf16"]
> textconv = astextutf16

Why not simply "textconv = iconv -f utf-16 -t utf-8", without the extra script?

> c) I had success with iconv, but is there any built-in UTF-16 to UTF-8 converter that ships with msysGit?

There are ready-to-use UTF-conversion functions in the codebase, but these are not accessible as a git command or built-in filter.

> As a quick fix, how hard would it be to add a 'utf16' diff filter, similar to cpp or |csharp? Or is this simply the wrong place to put in a work-around?

As described above, I think a diff filter is not the right tool for the job. The only universal format for text content that works reasonably well with established text-based technologies (merge algorithms, regex etc.) is UTF-8. If we want to benefit from these technologies, git should store text files as UTF-8 and convert from / to platform-specific formats on checkin / checkout or for display.

Bye,
Karsten

Ken Ismert

unread,
Apr 10, 2013, 5:22:50 PM4/10/13
to msy...@googlegroups.com

Erik, Karsten:

Thanks for your useful feedback. Please see my suggestions at the end.

Answers to your questions:

Erik:

> Did you try other directories that are in $PATH?

Ok, that helped a lot. Adding my directory to $PATH fixed the problem. I was trying to hard-code the path (per some Linux examples).

Karsten:

> Why not simply "textconv = iconv -f utf-16 -t utf-8", without the extra script?

Following Erik's $PATH suggestion, I got that working too.
Hard-coding the path doesn't work: "C:\\Program Files\\GnuWin32\\bin\\iconv -f utf-16 -t utf-8" returns:
C:\Program Files\GnuWin32\bin\iconv -f utf-16 -t utf-8: C:Program: command not found
fatal: unable to read files to diff
That's a DOS command parsing error. Enclosing the path in quotes should fix that, but it doesn't. Not a big issue, since $PATH works.

Still, that removes the need for a separate script.

> ... I would try to use a clean/smudge filter
> (i.e. store UTF-16 files as UTF-8 in the repository ...

An even nicer solution! Now that $path is fixed, I put the following into ~\Git\etc\gitconfig:
[filter "winutf16"]
    clean = iconv -f utf-16 -t utf-8
    smudge = iconv -f utf-8 -t utf-16
    required

And, updated my local .gitattributes:
*.txt filter=winutf16

Now, no need for the diff filter! Looks good so far in Git Extensions. I'll test further to confirm.


> There's also an 'encoding' attribute and a 'gui.encoding' setting
> which in theory should solve your issue (i.e. specify encoding
> of files for display by GUI tools). I don't know if Git Extensions
> supports that, or whether its supposed to work for binary files
> at all.

I will try that out soon.

==============================
Suggestions:

Unless the 'encoding' attribute works, could I suggest the following?

1) Include iconv in the Windows distribution of msysGit
2) Add something like the "winutf16" filter to ~\Git\etc\gitconfig

That, and posting a question/answer to Stack Overflow should largely cover this problem, until a comprehensive solution is implemented.

Thanks,
-Ken

Ken Ismert

unread,
Apr 11, 2013, 3:11:25 PM4/11/13
to msy...@googlegroups.com, Ken Ismert, g...@vger.kernel.org

Karsten,

There's also an 'encoding' attribute and a 'gui.encoding' setting which in theory should solve your issue (i.e. specify encoding of files for display by GUI tools). I don't know if Git Extensions supports that, or whether its supposed to work for binary files at all.

I played around with this, and neither 'encoding' or 'gui.encoding' solves my problem in Get Extensions or Git Gui.

According to the man page, I would put this entry in my .gitattributes:
*.txt encoding=utf-8

This sounds rather tautological -- "Why yes, I do want to display my UTF-16 file as text!"

More to the point, since Git *doesn't know* the original encoding of a 'non-standard' Unicode file, how can it possibly know how to convert it to UTF-8?

So, 'encoding' and 'gui.encoding' seem to be poorly conceived, and certainly aren't helpful for the Windows UTF-16 problem.

-Ken
Reply all
Reply to author
Forward
0 new messages