[PATCH v12] Issue 80: Unicode support on Windows

866 views
Skip to first unread message

karste...@dcon.de

unread,
Nov 26, 2011, 7:52:01 PM11/26/11
to msy...@googlegroups.com, bl...@dcon.de, Johannes....@gmx.de, kusm...@gmail.com, ki...@mns.spb.ru, mic...@wheelycreek.net, robin.r...@gmail.com, at...@chejz.com, pa...@paulbetts.org

Hi all,

the last version of the unicode patch series was on top of a tentative V1.7.7, so I thought an update might be in order. I haven't changed any code, just comments/commit messages/patch order as discussed on the mailing list (so I think I'll save the 25 followup mails). Here is the current version:
git: http://repo.or.cz/w/git/mingw/4msysgit/kblees.git/kb/unicode-v12
less-444 (unchanged): https://github.com/kblees/msysgit/commits/kb/msys/less
Unicode msys.dll (unchanged): https://github.com/kblees/msysgit/commits/kb/msys/unicode

A git installer that bundles msys.dll, less-444 and git v1.7.7.1 with unicode-v12 patches can be found here:
https://docs.google.com/uc?id=0BxXUoUg2r8ftMzc4NmIyNzctZmZlYS00NDhjLTg3OTAtOTEzNWQ1ZTVmYzAx&export=download

I also managed to build a TortoiseGit that works with the Unicode-aware Git. I simply replaced all 'CP_ACP' in the sources with 'CP_UTF8', which is a dirty hack but seems to do the trick (no, I don't plan to delve much deeper into TortoiseGit :-). Installers can be found here, if anyone is interested:
32 bit: https://docs.google.com/uc?id=0BxXUoUg2r8ftYjFmMzg0MmItYjZhMy00MjM4LWFkYjktN2RiOTUxNDdiMzdk&export=download
64 bit: https://docs.google.com/uc?id=0BxXUoUg2r8ftNDQxMWEyMTgtM2FjMy00YzA5LTgxNGEtNTc2MjhkYWZjNDIy&export=download


Changes since kb/unicode-v11:
- removed doxygen tags
- improved some commit messages
- removed most MSVC patches (the 'MSVC: link dynamically to the CRT' patch is prereqisite for using __wgetmainargs later)
- 'Win32: Thread-safe windows console output': removed dependency on utftowcsn so that it can be moved up and merged independently

Commit logs ('+': new/rewritten in v12, '-': removed, '!': modified)
---
- [00/25] MSVC: disable const qualifier compiler warnings (unrelated)
- [00/25] MSVC: don't use uninitialized variables (unrelated)
- [00/25] MSVC: include <io.h> for mktemp (already fixed)
- [00/25] MSVC: fix winansi.c compile errors (already merged)
  [01/25] MSVC: link dynamically to the CRT
  [02/25] git-gui: fix encoding in git-gui file browser
  [03/25] gitk: fix file name encoding in diff hunk headers
  [04/25] Revert "Disable test on MinGW that challenges its bash quoting"
! [05/25] Win32: Thread-safe windows console output
! [06/25] Win32: add Unicode conversion functions
  [07/25] Win32: Unicode file name support (except dirent)
  [08/25] Win32: Unicode file name support (dirent)
  [09/25] Unicode file name support (gitk and git-gui)
  [10/25] Win32: Unicode arguments (outgoing)
  [11/25] Win32: Unicode arguments (incoming)
+ [12/25] Win32: sync Unicode console output and file system
  [13/25] Win32: Unicode environment (outgoing)
  [14/25] Win32: Unicode environment (incoming)
  [15/25] MinGW: disable legacy encoding tests
! [16/25] Win32: fix environment memory leaks
  [17/25] Win32: unify environment case-sensitivity
! [18/25] Win32: simplify internal mingw_spawn* APIs
  [19/25] Win32: move environment functions
  [20/25] Win32: unify environment function names
  [21/25] Win32: move environment block creation to a helper method
  [22/25] Win32: don't copy the environment twice when spawning child processes
  [23/25] Win32: reduce environment array reallocations
  [24/25] Win32: keep the environment sorted
! [25/25] Win32: patch Windows environment on startup

 Makefile                |    8 +-
 compat/mingw.c          |  706 +++++++++++++++++++++++++++++++---------------
 compat/mingw.h          |  111 +++++++-
 compat/win32/dirent.c   |   31 +--
 compat/win32/dirent.h   |    2 +-
 compat/winansi.c        |  374 +++++++++++++++----------
 git-gui/git-gui.sh      |    6 +-
 git-gui/lib/browser.tcl |    2 +-
 git-gui/lib/index.tcl   |    6 +-
 gitk-git/gitk           |   15 +-
 run-command.c           |   10 +-
 t/t3901-i18n-patch.sh   |   19 +-
 t/t4201-shortlog.sh     |    6 +-
 t/t5505-remote.sh       |    5 +-
 t/t8005-blame-i18n.sh   |    8 +-
 15 files changed, 852 insertions(+), 457 deletions(-)


And the diff between v11 and v12...
---
$ git diff -p --stat kb/unicode-v11-99df39b2..kb/unicode-v12
 compat/mingw.h |   30 +++++++++++++++++-------------
 1 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/compat/mingw.h b/compat/mingw.h
index 0a12a94..675eb06 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -368,13 +368,15 @@ void mingw_mark_as_git_dir(const char *dir);
  * or even indefinite-byte sequences, the largest valid code point \u10ffff
  * encodes as only 4 UTF-8 bytes.
  *
- * @param wcs wide char target buffer
- * @param utf string to convert
- * @param wcslen size of target buffer (in wchar_t's)
- * @param utflen size of string to convert, or -1 if 0-terminated
- * @return length of converted string (_wcslen(wcs)), or -1 on failure (errno
- *         is set to EINVAL or ENAMETOOLONG)
- * @see mbstowcs, MultiByteToWideChar
+ * Parameters:
+ * wcs: wide char target buffer
+ * utf: string to convert
+ * wcslen: size of target buffer (in wchar_t's)
+ * utflen: size of string to convert, or -1 if 0-terminated
+ *
+ * Returns:
+ * length of converted string (_wcslen(wcs)), or -1 on failure (errno is set
+ * to EINVAL or ENAMETOOLONG)
  */
 int utftowcsn(wchar_t *wcs, const char *utf, size_t wcslen, int utflen);
 static inline int utftowcs(wchar_t *wcs, const char *utf, size_t wcslen)
@@ -402,12 +404,14 @@ static inline int utftowcs(wchar_t *wcs, const char *utf, size_t wcslen)
  *
  * Note that invalid code points > 10ffff cannot be represented in UTF-16.
  *
- * @param utf target buffer
- * @param wcs wide string to convert
- * @param utflen size of target buffer
- * @return length of converted string, or -1 on failure (errno is set to EINVAL
- *         or ENAMETOOLONG)
- * @see wcstombs, WideCharToMultiByte
+ * Parameters:
+ * utf: target buffer
+ * wcs: wide string to convert
+ * utflen: size of target buffer
+ *
+ * Returns:
+ * length of converted string, or -1 on failure (errno is set to EINVAL or
+ * ENAMETOOLONG)
  */
 int wcstoutf(char *utf, const wchar_t *wcs, size_t utflen);

Vitaly

unread,
Nov 28, 2011, 2:06:44 AM11/28/11
to msy...@googlegroups.com
I've tested both git and tortoisegit in this way: create folder with Russian characters, init repository in this folder, add some files with Russian characters, commit, change, show diff, commit again. ("core.quotepath = false" specified in .gitconfig)
I've found only one problem - "git gui": it shows error - "couldn't change working directory to "C:/Users/Vitaly/Desktop/Тест": no such file or directory" - seems like there is broken encoding for repository folder name. All other tools (git, gitk, tortoisegit) works good and shows correct Russian characters both for folders/filenames and commit messages.
 

karste...@dcon.de

unread,
Dec 1, 2011, 7:38:05 PM12/1/11
to msy...@googlegroups.com, bl...@dcon.de, Johannes....@gmx.de, kusm...@gmail.com, ki...@mns.spb.ru, mic...@wheelycreek.net, robin.r...@gmail.com, at...@chejz.com, pa...@paulbetts.org
Hi there,

here is v13 of the unicode patch series:
git: http://repo.or.cz/w/git/mingw/4msysgit/kblees.git/kb/unicode-v13

A git installer that bundles the Unicode msys.dll, less-444 and git
v1.7.7.1 with unicode-v13 patches can be found here:
https://docs.google.com/uc?id=0BxXUoUg2r8ftNWZkY2U1NGQtZDVmNy00N2FhLWJhZTItYjRjYTg3NGJhNzQx&export=download


And TortoiseGit V1.7.5.0 built with CP_ACP replaced by CP_UTF8
(unchanged):
32 bit:
https://docs.google.com/uc?id=0BxXUoUg2r8ftYjFmMzg0MmItYjZhMy00MjM4LWFkYjktN2RiOTUxNDdiMzdk&export=download

64 bit:
https://docs.google.com/uc?id=0BxXUoUg2r8ftNDQxMWEyMTgtM2FjMy00YzA5LTgxNGEtNTc2MjhkYWZjNDIy&export=download


Changes since kb/unicode-v12:
- Win32: Thread-safe windows console output:
- added handling of split UTF-8 sequences
- minimized changes (old initialization code, didn't move
write_console)
- reformatted to fit into 80 cols with tab width 8
- improved commit message
- Win32: add Unicode conversion functions (and subsequent patches)
- renamed conversion functions to xutftowcs/xwcstoutf
- changed ENAMETOOLONG to ERANGE
- added filename-specific xutftowcs_path

Commit logs ('!': modified in v13)
---


[01/25] MSVC: link dynamically to the CRT
[02/25] git-gui: fix encoding in git-gui file browser
[03/25] gitk: fix file name encoding in diff hunk headers
[04/25] Revert "Disable test on MinGW that challenges its bash quoting"
! [05/25] Win32: Thread-safe windows console output
! [06/25] Win32: add Unicode conversion functions

! [07/25] Win32: Unicode file name support (except dirent)
! [08/25] Win32: Unicode file name support (dirent)


[09/25] Unicode file name support (gitk and git-gui)

! [10/25] Win32: Unicode arguments (outgoing)
! [11/25] Win32: Unicode arguments (incoming)
! [12/25] Win32: sync Unicode console output and file system
! [13/25] Win32: Unicode environment (outgoing)
! [14/25] Win32: Unicode environment (incoming)


[15/25] MinGW: disable legacy encoding tests

[16/25] Win32: fix environment memory leaks
[17/25] Win32: unify environment case-sensitivity

[18/25] Win32: simplify internal mingw_spawn* APIs
[19/25] Win32: move environment functions
[20/25] Win32: unify environment function names
[21/25] Win32: move environment block creation to a helper method
[22/25] Win32: don't copy the environment twice when spawning child
processes
[23/25] Win32: reduce environment array reallocations
[24/25] Win32: keep the environment sorted

[25/25] Win32: patch Windows environment on startup

Makefile | 8 +-
compat/mingw.c | 703
+++++++++++++++++++++++++++++++----------------
compat/mingw.h | 134 ++++++++-
compat/win32/dirent.c | 32 +--
compat/win32/dirent.h | 2 +-
compat/winansi.c | 387 +++++++++++++++++---------


git-gui/git-gui.sh | 6 +-
git-gui/lib/browser.tcl | 2 +-
git-gui/lib/index.tcl | 6 +-
gitk-git/gitk | 15 +-
run-command.c | 10 +-
t/t3901-i18n-patch.sh | 19 +-
t/t4201-shortlog.sh | 6 +-
t/t5505-remote.sh | 5 +-
t/t8005-blame-i18n.sh | 8 +-

15 files changed, 900 insertions(+), 443 deletions(-)

--


Diff between v12 and v13...
---
$ git diff -p --stat kb/unicode-v12..kb/unicode-v13
compat/mingw.c | 65 +++++++++++++---------------
compat/mingw.h | 39 +++++++++++++----
compat/win32/dirent.c | 9 +++-
compat/winansi.c | 113
+++++++++++++++++++++++++++++++++----------------
4 files changed, 144 insertions(+), 82 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index 15c1029..2104f25 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -205,7 +205,7 @@ int mingw_unlink(const char *pathname)
{
int ret, tries = 0;
wchar_t wpathname[MAX_PATH];
- if (utftowcs(wpathname, pathname, MAX_PATH) < 0)
+ if (xutftowcs_path(wpathname, pathname) < 0)
return -1;

/* read-only files cannot be removed */
@@ -253,7 +253,7 @@ int mingw_rmdir(const char *pathname)
{
int ret, tries = 0;
wchar_t wpathname[MAX_PATH];
- if (utftowcs(wpathname, pathname, MAX_PATH) < 0)
+ if (xutftowcs_path(wpathname, pathname) < 0)
return -1;

while ((ret = _wrmdir(wpathname)) == -1 && tries <
ARRAY_SIZE(delay)) {
@@ -293,7 +293,7 @@ void mingw_mark_as_git_dir(const char *dir)
{
wchar_t wdir[MAX_PATH];
if (hide_dotfiles != HIDE_DOTFILES_FALSE && !is_bare_repository())
- if (utftowcs(wdir, dir, MAX_PATH) < 0 ||
make_hidden(wdir))
+ if (xutftowcs_path(wdir, dir) < 0 || make_hidden(wdir))
warning("Failed to make '%s' hidden", dir);
git_config_set("core.hideDotFiles",
hide_dotfiles == HIDE_DOTFILES_FALSE ? "false" :
@@ -305,7 +305,7 @@ int mingw_mkdir(const char *path, int mode)
{
int ret;
wchar_t wpath[MAX_PATH];
- if (utftowcs(wpath, path, MAX_PATH) < 0)
+ if (xutftowcs_path(wpath, path) < 0)
return -1;
ret = _wmkdir(wpath);
if (!ret && hide_dotfiles == HIDE_DOTFILES_TRUE) {
@@ -335,7 +335,7 @@ int mingw_open (const char *filename, int oflags, ...)
if (filename && !strcmp(filename, "/dev/null"))
filename = "nul";

- if (utftowcs(wfilename, filename, MAX_PATH) < 0)
+ if (xutftowcs_path(wfilename, filename) < 0)
return -1;
fd = _wopen(wfilename, oflags, mode);

@@ -385,8 +385,8 @@ FILE *mingw_fopen (const char *filename, const char
*otype)
hide = access(filename, F_OK);
if (filename && !strcmp(filename, "/dev/null"))
filename = "nul";
- if (utftowcs(wfilename, filename, MAX_PATH) < 0 ||
- utftowcs(wotype, otype, 10) < 0)
+ if (xutftowcs_path(wfilename, filename) < 0 ||
+ xutftowcs(wotype, otype, 10) < 0)
return NULL;
file = _wfopen(wfilename, wotype);
if (file && hide && make_hidden(wfilename))
@@ -404,8 +404,8 @@ FILE *mingw_freopen (const char *filename, const char
*otype, FILE *stream)
hide = access(filename, F_OK);
if (filename && !strcmp(filename, "/dev/null"))
filename = "nul";
- if (utftowcs(wfilename, filename, MAX_PATH) < 0 ||
- utftowcs(wotype, otype, 10) < 0)
+ if (xutftowcs_path(wfilename, filename) < 0 ||
+ xutftowcs(wotype, otype, 10) < 0)
return NULL;
file = _wfreopen(wfilename, wotype, stream);
if (file && hide && make_hidden(wfilename))
@@ -416,7 +416,7 @@ FILE *mingw_freopen (const char *filename, const char
*otype, FILE *stream)
int mingw_access(const char *filename, int mode)
{
wchar_t wfilename[MAX_PATH];
- if (utftowcs(wfilename, filename, MAX_PATH) < 0)
+ if (xutftowcs_path(wfilename, filename) < 0)
return -1;
/* X_OK is not supported by the MSVCRT version */
return _waccess(wfilename, mode & ~X_OK);
@@ -425,7 +425,7 @@ int mingw_access(const char *filename, int mode)
int mingw_chdir(const char *dirname)
{
wchar_t wdirname[MAX_PATH];
- if (utftowcs(wdirname, dirname, MAX_PATH) < 0)
+ if (xutftowcs_path(wdirname, dirname) < 0)
return -1;
return _wchdir(wdirname);
}
@@ -433,7 +433,7 @@ int mingw_chdir(const char *dirname)
int mingw_chmod(const char *filename, int mode)
{
wchar_t wfilename[MAX_PATH];
- if (utftowcs(wfilename, filename, MAX_PATH) < 0)
+ if (xutftowcs_path(wfilename, filename) < 0)
return -1;
return _wchmod(wfilename, mode);
}
@@ -465,7 +465,7 @@ static int do_lstat(int follow, const char *file_name,
struct stat *buf)
{
WIN32_FILE_ATTRIBUTE_DATA fdata;
wchar_t wfilename[MAX_PATH];
- if (utftowcs(wfilename, file_name, MAX_PATH) < 0)
+ if (xutftowcs_path(wfilename, file_name) < 0)
return -1;

if (GetFileAttributesExW(wfilename, GetFileExInfoStandard,
&fdata)) {
@@ -608,7 +608,7 @@ int mingw_utime (const char *file_name, const struct
utimbuf *times)
int fh, rc;
DWORD attrs;
wchar_t wfilename[MAX_PATH];
- if (utftowcs(wfilename, file_name, MAX_PATH) < 0)
+ if (xutftowcs_path(wfilename, file_name) < 0)
return -1;

/* must have write permission */
@@ -656,11 +656,11 @@ unsigned int sleep (unsigned int seconds)
char *mingw_mktemp(char *template)
{
wchar_t wtemplate[MAX_PATH];
- if (utftowcs(wtemplate, template, MAX_PATH) < 0)
+ if (xutftowcs_path(wtemplate, template) < 0)
return NULL;
if (!_wmktemp(wtemplate))
return NULL;
- if (wcstoutf(template, wtemplate, strlen(template) + 1) < 0)
+ if (xwcstoutf(template, wtemplate, strlen(template) + 1) < 0)
return NULL;
return template;
}
@@ -729,7 +729,7 @@ char *mingw_getcwd(char *pointer, int len)
wchar_t wpointer[MAX_PATH];
if (!_wgetcwd(wpointer, MAX_PATH))
return NULL;
- if (wcstoutf(pointer, wpointer, len) < 0)
+ if (xwcstoutf(pointer, wpointer, len) < 0)
return NULL;
for (i = 0; pointer[i]; i++)
if (pointer[i] == '\\')
@@ -1056,7 +1056,7 @@ static wchar_t *make_environment_block(char
**deltaenv)
for (i = 0; tmpenv[i]; i++) {
size = 2 * strlen(tmpenv[i]) + 2;
ALLOC_GROW(wenvblk, (envblkpos + size) * sizeof(wchar_t),
envblksz);
- envblkpos += utftowcs(&wenvblk[envblkpos], tmpenv[i],
size) + 1;
+ envblkpos += xutftowcs(&wenvblk[envblkpos], tmpenv[i],
size) + 1;
}
/* add final \0 terminator */
wenvblk[envblkpos] = 0;
@@ -1114,9 +1114,9 @@ static pid_t mingw_spawnve_fd(const char *cmd, const
char **argv, char **deltaen
si.hStdOutput = winansi_get_osfhandle(fhout);
si.hStdError = winansi_get_osfhandle(fherr);

- if (utftowcs(wcmd, cmd, MAX_PATH) < 0)
+ if (xutftowcs_path(wcmd, cmd) < 0)
return -1;
- if (dir && utftowcs(wdir, dir, MAX_PATH) < 0)
+ if (dir && xutftowcs_path(wdir, dir) < 0)
return -1;

/* concatenate argv, quoting args as we go */
@@ -1137,7 +1137,7 @@ static pid_t mingw_spawnve_fd(const char *cmd, const
char **argv, char **deltaen
}

wargs = xmalloc((2 * args.len + 1) * sizeof(wchar_t));
- utftowcs(wargs, args.buf, 2 * args.len + 1);
+ xutftowcs(wargs, args.buf, 2 * args.len + 1);
strbuf_release(&args);

wenvblk = make_environment_block(deltaenv);
@@ -1599,9 +1599,7 @@ int mingw_rename(const char *pold, const char *pnew)
DWORD attrs, gle;
int tries = 0;
wchar_t wpold[MAX_PATH], wpnew[MAX_PATH];
- if (utftowcs(wpold, pold, MAX_PATH) < 0)
- return -1;
- if (utftowcs(wpnew, pnew, MAX_PATH) < 0)
+ if (xutftowcs_path(wpold, pold) < 0 || xutftowcs_path(wpnew, pnew)
< 0)
return -1;

/*
@@ -1842,9 +1840,8 @@ int link(const char *oldpath, const char *newpath)
typedef BOOL (WINAPI *T)(LPCWSTR, LPCWSTR, LPSECURITY_ATTRIBUTES);
static T create_hard_link = NULL;
wchar_t woldpath[MAX_PATH], wnewpath[MAX_PATH];
- if (utftowcs(woldpath, oldpath, MAX_PATH) < 0)
- return -1;
- if (utftowcs(wnewpath, newpath, MAX_PATH) < 0)
+ if (xutftowcs_path(woldpath, oldpath) < 0 ||
+ xutftowcs_path(wnewpath, newpath) < 0)
return -1;

if (!create_hard_link) {
@@ -1973,7 +1970,7 @@ int mingw_offset_1st_component(const char *path)
return offset + is_dir_sep(path[offset]);
}

-int utftowcsn(wchar_t *wcs, const char *utfs, size_t wcslen, int utflen)
+int xutftowcsn(wchar_t *wcs, const char *utfs, size_t wcslen, int utflen)
{
int upos = 0, wpos = 0;
const unsigned char *utf = (const unsigned char*) utfs;
@@ -1993,7 +1990,7 @@ int utftowcsn(wchar_t *wcs, const char *utfs, size_t
wcslen, int utflen)

if (wpos >= wcslen) {
wcs[wpos] = 0;
- errno = ENAMETOOLONG;
+ errno = ERANGE;
return -1;
}

@@ -2045,7 +2042,7 @@ int utftowcsn(wchar_t *wcs, const char *utfs, size_t
wcslen, int utflen)
return wpos;
}

-int wcstoutf(char *utf, const wchar_t *wcs, size_t utflen)
+int xwcstoutf(char *utf, const wchar_t *wcs, size_t utflen)
{
if (!wcs || !utf || utflen < 1) {
errno = EINVAL;
@@ -2054,7 +2051,7 @@ int wcstoutf(char *utf, const wchar_t *wcs, size_t
utflen)
utflen = WideCharToMultiByte(CP_UTF8, 0, wcs, -1, utf, utflen,
NULL, NULL);
if (utflen)
return utflen - 1;
- errno = ENAMETOOLONG;
+ errno = ERANGE;
return -1;
}

@@ -2099,14 +2096,14 @@ void mingw_startup()
buffer = xmalloc(maxlen);

/* convert command line arguments and environment to UTF-8 */
- len = wcstoutf(buffer, _wpgmptr, maxlen);
+ len = xwcstoutf(buffer, _wpgmptr, maxlen);
__argv[0] = xmemdupz(buffer, len);
for (i = 1; i < argc; i++) {
- len = wcstoutf(buffer, wargv[i], maxlen);
+ len = xwcstoutf(buffer, wargv[i], maxlen);
__argv[i] = xmemdupz(buffer, len);
}
for (i = 0; wenv[i]; i++) {
- len = wcstoutf(buffer, wenv[i], maxlen);
+ len = xwcstoutf(buffer, wenv[i], maxlen);
environ[i] = xmemdupz(buffer, len);
}
environ[i] = NULL;
diff --git a/compat/mingw.h b/compat/mingw.h
index 675eb06..675d19a 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -375,13 +375,33 @@ void mingw_mark_as_git_dir(const char *dir);
* utflen: size of string to convert, or -1 if 0-terminated
*
* Returns:
- * length of converted string (_wcslen(wcs)), or -1 on failure (errno is
set
- * to EINVAL or ENAMETOOLONG)


+ * length of converted string (_wcslen(wcs)), or -1 on failure

+ *
+ * Errors:
+ * EINVAL: one of the input parameters is invalid (e.g. NULL)
+ * ERANGE: the output buffer is too small
+ */
+int xutftowcsn(wchar_t *wcs, const char *utf, size_t wcslen, int utflen);
+
+/**
+ * Simplified variant of xutftowcsn, assumes input string is
\0-terminated.
+ */
+static inline int xutftowcs(wchar_t *wcs, const char *utf, size_t wcslen)
+{
+ return xutftowcsn(wcs, utf, wcslen, -1);
+}
+
+/**
+ * Simplified file system specific variant of xutftowcsn, assumes output
+ * buffer size is MAX_PATH wide chars and input string is \0-terminated,
+ * fails with ENAMETOOLONG if input string is too long.
*/
-int utftowcsn(wchar_t *wcs, const char *utf, size_t wcslen, int utflen);
-static inline int utftowcs(wchar_t *wcs, const char *utf, size_t wcslen)
+static inline int xutftowcs_path(wchar_t *wcs, const char *utf)
{
- return utftowcsn(wcs, utf, wcslen, -1);
+ int result = xutftowcsn(wcs, utf, MAX_PATH, -1);
+ if (result < 0 && errno == ERANGE)
+ errno = ENAMETOOLONG;
+ return result;
}

/**
@@ -410,10 +430,13 @@ static inline int utftowcs(wchar_t *wcs, const char
*utf, size_t wcslen)
* utflen: size of target buffer
*
* Returns:
- * length of converted string, or -1 on failure (errno is set to EINVAL
or
- * ENAMETOOLONG)


+ * length of converted string, or -1 on failure

+ *
+ * Errors:
+ * EINVAL: one of the input parameters is invalid (e.g. NULL)
+ * ERANGE: the output buffer is too small
*/
-int wcstoutf(char *utf, const wchar_t *wcs, size_t utflen);
+int xwcstoutf(char *utf, const wchar_t *wcs, size_t utflen);

/*
* A replacement of main() that adds win32 specific initialization.
diff --git a/compat/win32/dirent.c b/compat/win32/dirent.c
index 37f56b7..c69a689 100644
--- a/compat/win32/dirent.c
+++ b/compat/win32/dirent.c
@@ -9,7 +9,7 @@ struct DIR {
static inline void finddata2dirent(struct dirent *ent, WIN32_FIND_DATAW
*fdata)
{
/* convert UTF-16 name to UTF-8 */
- wcstoutf(ent->d_name, fdata->cFileName, sizeof(ent->d_name));
+ xwcstoutf(ent->d_name, fdata->cFileName, sizeof(ent->d_name));

/* Set file type, based on WIN32_FIND_DATA */
if (fdata->dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)
@@ -27,9 +27,12 @@ DIR *opendir(const char *name)
DIR *dir;

/* convert name to UTF-16, check length (-2 for '/' '*') */
- len = utftowcs(pattern, name, MAX_PATH - 2);
- if (len < 0)
+ len = xutftowcs(pattern, name, MAX_PATH - 2);
+ if (len < 0) {
+ if (errno == ERANGE)
+ errno = ENAMETOOLONG;
return NULL;
+ }

/* append optional '/' and wildcard '*' */
if (len && !is_dir_sep(pattern[len - 1]))
diff --git a/compat/winansi.c b/compat/winansi.c
index d11b532..18f2cdf 100644
--- a/compat/winansi.c
+++ b/compat/winansi.c
@@ -53,7 +53,8 @@ static void warn_if_raster_font(void)

/* GetCurrentConsoleFontEx is available since Vista */
pGetCurrentConsoleFontEx = (PGETCURRENTCONSOLEFONTEX)
GetProcAddress(
- GetModuleHandle("kernel32.dll"),
"GetCurrentConsoleFontEx");
+ GetModuleHandle("kernel32.dll"),
+ "GetCurrentConsoleFontEx");
if (pGetCurrentConsoleFontEx) {
CONSOLE_FONT_INFOEX cfi;
cfi.cbSize = sizeof(cfi);
@@ -62,8 +63,8 @@ static void warn_if_raster_font(void)
} else {
/* pre-Vista: check default console font in registry */
HKEY hkey;
- if (ERROR_SUCCESS == RegOpenKeyExA(HKEY_CURRENT_USER,
"Console", 0,
- KEY_READ, &hkey)) {
+ if (ERROR_SUCCESS == RegOpenKeyExA(HKEY_CURRENT_USER,
"Console",
+ 0, KEY_READ, &hkey)) {
DWORD size = sizeof(fontFamily);
RegQueryValueExA(hkey, "FontFamily", NULL, NULL,
(LPVOID) &fontFamily, &size);
@@ -72,10 +73,10 @@ static void warn_if_raster_font(void)
}

if (!(fontFamily & TMPF_TRUETYPE)) {
- const wchar_t *msg = L"\nWarning: Your console font
probably doesn\'t "
- L"support Unicode. If you experience strange
characters in the "
- L"output, consider switching to a TrueType font
such as Lucida "
- L"Console!\n";
+ const wchar_t *msg = L"\nWarning: Your console font
probably "
+ L"doesn\'t support Unicode. If you experience
strange "
+ L"characters in the output, consider switching to
a "
+ L"TrueType font such as Lucida Console!\n";
WriteConsoleW(console, msg, wcslen(msg), NULL, NULL);
}
}
@@ -85,6 +86,8 @@ static int is_console(int fd)
CONSOLE_SCREEN_BUFFER_INFO sbi;
HANDLE hcon;

+ static int initialized = 0;
+
/* get OS handle of the file descriptor */
hcon = (HANDLE) _get_osfhandle(fd);
if (hcon == INVALID_HANDLE_VALUE)
@@ -99,11 +102,34 @@ static int is_console(int fd)
return 0;

/* initialize attributes */
- attr = plain_attr = sbi.wAttributes;
- negative = 0;
+ if (!initialized) {
+ attr = plain_attr = sbi.wAttributes;
+ negative = 0;
+ initialized = 1;
+ }
+
return 1;
}

+#define BUFFER_SIZE 4096
+#define MAX_PARAMS 16
+
+static void write_console(unsigned char *str, size_t len)
+{
+ /* only called from console_thread, so a static buffer will do */
+ static wchar_t wbuf[2 * BUFFER_SIZE + 1];
+
+ /* convert utf-8 to utf-16 */
+ int wlen = xutftowcsn(wbuf, (char*) str, 2 * BUFFER_SIZE + 1,
len);
+
+ /* write directly to console */
+ WriteConsoleW(console, wbuf, wlen, NULL, NULL);
+
+ /* remember if non-ascii characters are printed */
+ if (wlen != len)
+ non_ascii_used = 1;
+}
+
#define FOREGROUND_ALL (FOREGROUND_RED | FOREGROUND_GREEN |
FOREGROUND_BLUE)
#define BACKGROUND_ALL (BACKGROUND_RED | BACKGROUND_GREEN |
BACKGROUND_BLUE)

@@ -288,39 +314,21 @@ static void set_attr(char func, const int *params,
int paramlen)
}
}

-#define BUFFER_SIZE 4096
-#define MAX_PARAMS 16
-
-static void write_console(char *str, size_t len)
-{
- /* only called from console_thread, so a static buffer will do */
- static wchar_t wbuf[2 * BUFFER_SIZE + 1];
-
- /* convert utf-8 to utf-16 */
- int wlen = utftowcsn(wbuf, str, 2 * BUFFER_SIZE + 1, len);
-
- /* write directly to console */
- WriteConsoleW(console, wbuf, wlen, NULL, NULL);
-
- /* remember if non-ascii characters are printed */
- if (wlen != len)
- non_ascii_used = 1;
-}
-
enum {
TEXT = 0, ESCAPE = 033, BRACKET = '[', EXIT = -1
};

static DWORD WINAPI console_thread(LPVOID unused)
{
- char buffer[BUFFER_SIZE];
+ unsigned char buffer[BUFFER_SIZE];
DWORD bytes;
- int start, end, c, parampos = 0, state = TEXT;
+ int start, end = 0, c, parampos = 0, state = TEXT;
int params[MAX_PARAMS];

while (state != EXIT) {
/* read next chunk of bytes from the pipe */
- if (!ReadFile(hread, buffer, BUFFER_SIZE, &bytes, NULL)) {
+ if (!ReadFile(hread, buffer + end, BUFFER_SIZE - end,
&bytes,
+ NULL)) {
/* exit if pipe has been closed */
if (GetLastError() == ERROR_BROKEN_PIPE)
break;
@@ -329,6 +337,7 @@ static DWORD WINAPI console_thread(LPVOID unused)
}

/* scan the bytes and handle ANSI control codes */
+ bytes += end;
start = end = 0;
while (end < bytes) {
c = buffer[end++];
@@ -337,7 +346,8 @@ static DWORD WINAPI console_thread(LPVOID unused)
if (c == ESCAPE) {
/* print text seen so far */
if (end - 1 > start)
- write_console(buffer +
start, end - 1 - start);
+ write_console(buffer +
start,
+ end - 1 - start);

/* then start parsing escape
sequence */
start = end - 1;
@@ -358,7 +368,10 @@ static DWORD WINAPI console_thread(LPVOID unused)
params[parampos] *= 10;
params[parampos] += c - '0';
} else if (c == ';') {
- /* next parameter, bail out if out
of bounds */
+ /*
+ * next parameter, bail out if out
of
+ * bounds
+ */
parampos++;
if (parampos >= MAX_PARAMS)
state = TEXT;
@@ -366,7 +379,10 @@ static DWORD WINAPI console_thread(LPVOID unused)
/* "\033[q": terminate the thread
*/
state = EXIT;
} else {
- /* end of escape sequence, change
console attributes */
+ /*
+ * end of escape sequence, change
+ * console attributes
+ */
set_attr(c, params, parampos + 1);
start = end;
state = TEXT;
@@ -375,9 +391,32 @@ static DWORD WINAPI console_thread(LPVOID unused)
}
}

- /* print remaining text unless we're parsing an escape
sequence */
- if (state == TEXT && end > start)
- write_console(buffer + start, end - start);
+ /* print remaining text unless parsing an escape sequence
*/
+ if (state == TEXT && end > start) {
+ /* check for incomplete UTF-8 sequences and fix
end */
+ if (buffer[end - 1] >= 0x80) {
+ if (buffer[end -1] >= 0xc0)
+ end--;
+ else if (end - 1 > start &&
+ buffer[end - 2] >= 0xe0)
+ end -= 2;
+ else if (end - 2 > start &&
+ buffer[end - 3] >= 0xf0)
+ end -= 3;
+ }
+
+ /* print remaining complete UTF-8 sequences */
+ if (end > start)
+ write_console(buffer + start, end -
start);
+
+ /* move remaining bytes to the front */
+ if (end < bytes)
+ memmove(buffer, buffer + end, bytes -
end);
+ end = bytes - end;
+ } else {
+ /* all data has been consumed, mark buffer empty
*/
+ end = 0;
+ }
}

/* check if the console font supports unicode */

Karsten Blees

unread,
Dec 2, 2011, 4:23:05 PM12/2/11
to msysGit, vital...@gmail.com, bl...@dcon.de

Does this fix the problem for you?

---
[PATCH] git-gui: fix git work tree encoding

...if the work tree path contains non-ASCII characters.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---
git-gui/git-gui.sh | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/git-gui/git-gui.sh b/git-gui/git-gui.sh
index 5f1faeb..6b531e8 100755
--- a/git-gui/git-gui.sh
+++ b/git-gui/git-gui.sh
@@ -1228,6 +1228,7 @@ apply_config
# v1.7.0 introduced --show-toplevel to return the canonical work-tree
if {[package vsatisfies $_git_version 1.7.0]} {
set _gitworktree [git rev-parse --show-toplevel]
+ set _gitworktree [encoding convertfrom utf-8 $_gitworktree]
} else {
# try to set work tree from environment, core.worktree or use
# cdup to obtain a relative path to the top of the worktree. If
--
1.7.7.1.msysgit.2

Vitaly

unread,
Dec 5, 2011, 2:50:17 AM12/5/11
to msy...@googlegroups.com, vital...@gmail.com, bl...@dcon.de
No, problem is still the same, but looks like broken encoding converted twice.

суббота, 3 декабря 2011 г. 1:23:05 UTC+4 пользователь Karsten Blees написал:

sh2ka

unread,
Dec 4, 2011, 11:11:31 PM12/4/11
to msysGit
It seems that my previous message failed. Sending again:

I've tested last versions of git and tortoisegit with unicode support
and found a bug: diff doesn't work with files those have russian
filename - says that file in temporary folder doesn't exists.

I tested it with simple text files and openoffice/libreoffice files -
the same error.

I think that tortoise extracts file with bad codepage and filesystem
prevents creation of file with such filename.
Maybe this is a git (unicode version) bug, not tortoisegit - I don't
know - but error exists.

With regards, Michael!

sh2ka

unread,
Dec 4, 2011, 10:39:11 PM12/4/11
to msysGit
Thanks for your work, Karsten!

Here I want to report a bug - it don't allow to diff two openoffice/
libreoffice files with russian filenames - says that file in temporary
folder doesn't exist. I think tortoisegit tries extract file with bad
codepage and filesystem can't create file.

With regards, Michael!

Johannes Schindelin

unread,
Dec 6, 2011, 3:31:26 PM12/6/11
to karste...@dcon.de, msy...@googlegroups.com, bl...@dcon.de, kusm...@gmail.com, ki...@mns.spb.ru, mic...@wheelycreek.net, robin.r...@gmail.com, at...@chejz.com, pa...@paulbetts.org
Hi Karsten,

On Fri, 2 Dec 2011, karste...@dcon.de wrote:

> [01/25] MSVC: link dynamically to the CRT

This is not really Unicode, but it is uncontroversial, I think, so I
merged it into devel.

> [02/25] git-gui: fix encoding in git-gui file browser
> [03/25] gitk: fix file name encoding in diff hunk headers
> [04/25] Revert "Disable test on MinGW that challenges its bash quoting"

What about this? Why is it in the middle of the series? Can't we put it to
the end?

> ! [05/25] Win32: Thread-safe windows console output
> ! [06/25] Win32: add Unicode conversion functions
> ! [07/25] Win32: Unicode file name support (except dirent)
> ! [08/25] Win32: Unicode file name support (dirent)
> [09/25] Unicode file name support (gitk and git-gui)
> ! [10/25] Win32: Unicode arguments (outgoing)
> ! [11/25] Win32: Unicode arguments (incoming)
> ! [12/25] Win32: sync Unicode console output and file system

This patch is the one that probably got the most comments. Since it is not
obvious how it related into the Unicode series, can we please back it out
into its own branch?

> ! [13/25] Win32: Unicode environment (outgoing)
> ! [14/25] Win32: Unicode environment (incoming)
> [15/25] MinGW: disable legacy encoding tests
> [16/25] Win32: fix environment memory leaks
> [17/25] Win32: unify environment case-sensitivity
> [18/25] Win32: simplify internal mingw_spawn* APIs
> [19/25] Win32: move environment functions
> [20/25] Win32: unify environment function names
> [21/25] Win32: move environment block creation to a helper method
> [22/25] Win32: don't copy the environment twice when spawning child
> processes
> [23/25] Win32: reduce environment array reallocations
> [24/25] Win32: keep the environment sorted
> [25/25] Win32: patch Windows environment on startup

Probably the rest is good to go, no?

To show what I mean, I rebased your -v13 branch, backed out the console
synching, moved the patch to the test to the end, and pushed the result to
rebased-unicode:

https://github.com/msysgit/git/commits/rebased-unicode

It builds fine, but please be patient with me; I really could not follow
the developments all that closely. So I might very well have missed
something really, really important.

Unfortunately, /share/msysgit/run-tests.sh stops at t7400 because it
cannot find the libiconv-2.dll (and I do not have time to debug it,
aargh!) so I cannot run the performance comparison easily that I would
like to have.

Ciao,
Dscho

Per-Olof Hermansson

unread,
Dec 22, 2011, 4:38:09 PM12/22/11
to msysGit
Hi

I have long waited for a solution, so today I downloaded the git
installer for windows v 1.7.7.1 (se below) and the TortoiseGit 1.7.5.0
(se below).

I added a file locally:
Epostadresser MedlemmarHärnösand.xls

I pushed it to the origin repos (at assembla which i Linux). I view it
correctly in Assembla: Epostadresser MedlemmarHärnösand.xls

I can clone it correctly: Epostadresser MedlemmarHärnösand.xls

If I export (or download from Assembla) the file again locally I get
incorrectly: Epostadresser MedlemmarH+ñrn+Âsand.xls

regards,

Per-Olof

On 2 Dec, 01:38, karsten.bl...@dcon.de wrote:
> Hi there,
>
> here is v13 of the unicode patch series:
> git:http://repo.or.cz/w/git/mingw/4msysgit/kblees.git/kb/unicode-v13
> less-444 (unchanged):https://github.com/kblees/msysgit/commits/kb/msys/less
> Unicode msys.dll (unchanged):https://github.com/kblees/msysgit/commits/kb/msys/unicode
>
> A git installer that bundles the Unicode msys.dll, less-444 and git
> v1.7.7.1 with unicode-v13 patches can be found here:https://docs.google.com/uc?id=0BxXUoUg2r8ftNWZkY2U1NGQtZDVmNy00N2FhLW...
>
> And TortoiseGit V1.7.5.0 built with CP_ACP replaced by CP_UTF8
> (unchanged):
>
> 64 bit:https://docs.google.com/uc?id=0BxXUoUg2r8ftNDQxMWEyMTgtM2FjMy00YzA5LT...
>

karste...@dcon.de

unread,
Dec 29, 2011, 4:48:33 PM12/29/11
to Johannes Schindelin, at...@chejz.com, bl...@dcon.de, ki...@mns.spb.ru, kusm...@gmail.com, mic...@wheelycreek.net, msy...@googlegroups.com, pa...@paulbetts.org, robin.r...@gmail.com

Johannes Schindelin <Johannes....@gmx.de> wrote on 06.12.2011 21:31:26:

> Hi Karsten,
>
> On Fri, 2 Dec 2011, karste...@dcon.de wrote:
>
> >   [01/25] MSVC: link dynamically to the CRT
>
> This is not really Unicode, but it is uncontroversial, I think, so I
> merged it into devel.
>


Thanks (mühsam ernährt sich das Eichhörnchen... :-)

> >   [02/25] git-gui: fix encoding in git-gui file browser
> >   [03/25] gitk: fix file name encoding in diff hunk headers
> >   [04/25] Revert "Disable test on MinGW that challenges its bash quoting"
>
> What about this? Why is it in the middle of the series? Can't we put it to
> the end?
>


[02/25] and [03/25] are bug fixes that could be sent upstream directly...I didn't want to slip those fixes in with "[09/25] Unicode file name support (gitk and git-gui)".

[04/25] reenables a test that works fine with "MinGW: disable CRT command line globbing" (which has already been merged, currently at devel~15). So you could just merge this or drop "Disable test on MinGW that challenges its bash quoting" (devel~28) on next rebase.

> > ! [05/25] Win32: Thread-safe windows console output
> > ! [06/25] Win32: add Unicode conversion functions
> > ! [07/25] Win32: Unicode file name support (except dirent)
> > ! [08/25] Win32: Unicode file name support (dirent)
> >   [09/25] Unicode file name support (gitk and git-gui)
> > ! [10/25] Win32: Unicode arguments (outgoing)
> > ! [11/25] Win32: Unicode arguments (incoming)
> > ! [12/25] Win32: sync Unicode console output and file system
>
> This patch is the one that probably got the most comments. Since it is not
> obvious how it related into the Unicode series, can we please back it out
> into its own branch?
>


You probably mean "[05/25] Win32: Thread-safe windows console output"? This patch here is just a spinoff to remove the dependency on the Unicode conversion functions, so that the thread-safe console patch can be moved around and applied independently.

The relatedness to Unicode is historical rather than technical: I started the whole Unicode mess with a set of console patches in Aug 2010, when you complained that it wasn't thread safe, remember? :-)

> > ! [13/25] Win32: Unicode environment (outgoing)
> > ! [14/25] Win32: Unicode environment (incoming)
> >   [15/25] MinGW: disable legacy encoding tests
> >   [16/25] Win32: fix environment memory leaks
> >   [17/25] Win32: unify environment case-sensitivity
> >   [18/25] Win32: simplify internal mingw_spawn* APIs
> >   [19/25] Win32: move environment functions
> >   [20/25] Win32: unify environment function names
> >   [21/25] Win32: move environment block creation to a helper method
> >   [22/25] Win32: don't copy the environment twice when spawning child
> >           processes
> >   [23/25] Win32: reduce environment array reallocations
> >   [24/25] Win32: keep the environment sorted
> >   [25/25] Win32: patch Windows environment on startup
>
> Probably the rest is good to go, no?
>


There'll be some minor changes after Erik's last review cycle, I think the console thread termination mechanism is the biggest issue right now...
I'd be very interested in end-to-end performance comparisons, all I did was micro benchmarks with QueryPerformanceCounter...although I don't expect (or rather: hope) to see any significant differences.

> Ciao,
> Dscho

Konstantin Khomoutov

unread,
Aug 22, 2012, 4:11:33 PM8/22/12
to kon...@gmail.com, msy...@googlegroups.com, bl...@dcon.de, Johannes....@gmx.de, kusm...@gmail.com, ki...@mns.spb.ru, mic...@wheelycreek.net, robin.r...@gmail.com, at...@chejz.com, pa...@paulbetts.org, karste...@dcon.de
On Wed, Aug 22, 2012 at 04:03:44AM -0700, kon...@gmail.com wrote:

> file names with encoding as cp1251 does not work on FreeBSD
Can you please elaborate on how this is related to Unicode support in
Git for Windows?

In any case, note that Git (on any platform) is made agnostic to
filesystem encoding; this is by design, and will not change.

The patchset made by Karsten Blees and other folks, to resolve issue #80
being discussed makes Git for Windows use FooW-style Windows API calls
on Windows, where applicable and *manage filenames in UTF-8 internally.*
Basically this means that no matter which "code page" your system uses,
non-ASCII filenames will be encoded in UTF-8 in tree objects when you
add files with such names using Git for Windows.
It's then up to you to use UTF-8-enabled locale and filesystem encoding
on non-Windows systems.

Read https://github.com/kblees/git/wiki for more info.

Robin Rosenberg

unread,
Aug 24, 2012, 11:58:21 AM8/24/12
to msy...@googlegroups.com
Konstantin Khomoutov skrev 2012-08-22 22.11:
> On Wed, Aug 22, 2012 at 04:03:44AM -0700, kon...@gmail.com wrote:
>
>> file names with encoding as cp1251 does not work on FreeBSD
> Can you please elaborate on how this is related to Unicode support in
> Git for Windows?
>
> In any case, note that Git (on any platform) is made agnostic to
> filesystem encoding; this is by design, and will not change.

Don't be so sure about that. Linus played with the patches that fixes
filenames on Macs and seem to think it would be a good idea to have
to option to do this in the general case (e.g. UTF-8 <> ISO-8859-1),
so there is hope.

-- robin


Robin Rosenberg

unread,
Aug 27, 2012, 4:31:19 PM8/27/12
to karste...@dcon.de, msy...@googlegroups.com
karste...@dcon.de skrev 2012-08-27 11.59:
> As I understand it, file names on Mac / HFS are stored in decomposed
> normal form (i.e. '�' is two characters \u0061\u0308, UTF-8: 61 cc 88),
> while on Linux and Windows file names are stored in composed normal form
> ('�' = \u00e4, UTF-8: c3 a4). So its not just about encoding, you'd have
> to translate between different Unicode normalization forms as well, and
> I'm not sure if iconv can do this.

To iconv it's just another encoding. The code is in Git master and speaks
for itself. See 76759c7dff53e8c84e975b88cb8245587c14c7ba. It converts
between UTF-8-MAC and UTF-8.

-- robin


Reply all
Reply to author
Forward
0 new messages