[PATCH 0/5] Issue 80: Unicode support on Windows

934 views
Skip to first unread message

Karsten Blees

unread,
Oct 24, 2010, 6:00:18 PM10/24/10
to msy...@googlegroups.com, Karsten Blees
This patch series tries to solve the Unicode problems of Git for Windows, at least for git, gitk and git-gui.

Credits go to Thorvald Natvig (tn/utf8), Nico Rieck (work/utf8) and Takuya Murakami (work/tm/utf8) for their previous work on the subject, from which I've shamelessly stolen ideas as seemed fit. Notable differences from their patches include:

* The UTF-8 to UTF-16 conversion function can handle invalid UTF-8 (similar to TCL), so that old git repositories with legacy-encoded file names can be checked out without error.

* The conversion functions don't use static buffers in order to be fully thread safe. This makes each usage a bit more complicated, but is necessary for the multi-threaded parts of git to work (e.g. git-grep, git-pack-objects, core.preloadindex).

I've included a revised version of the thread-safe console patch that I've posted a while back. This version eliminates the call to TerminateThread and uses the fault-tolerant UTF-8 conversion function from the file name patch (so that file system and console output are in sync).

Patch 3 breaks some of the i18n tests (t3901, t4201 and t8005), which I really don't know how to fix properly. MSys bash is completely broken when it comes to Unicode, as it reads test scripts in Windows system encoding (instead of UTF-8). Changing the encoding of the test scripts (e.g. to Cp1252) is not a solution as this will fail on Windows systems with different encodings. Currently, I've just changed the tests to expect failure if bash cannot pass unicode script content to other processes correctly.

$ grep "failed [^0]" t/test-results/*
t/test-results/t5407-post-rewrite-hook-6400.counts:failed 9
t/test-results/t7400-submodule-basic-7004.counts:failed 1
t/test-results/t7508-status-5936.counts:failed 18
t/test-results/t9001-send-email-6652.counts:failed 4
t/test-results/t9500-gitweb-standalone-no-errors-3620.counts:failed 83
t/test-results/t9501-gitweb-standalone-http-status-5620.counts:failed 11
t/test-results/t9502-gitweb-standalone-parse-output-3384.counts:failed 9

Karsten Blees (5):
Unicode file names on Windows
Unicode file names for gitk and git-gui on Windows
Unicode arguments and environment on Windows
Thread-safe windows console output
Mark unicode-related tests broken on msys

Makefile | 2 -
compat/mingw.c | 757 ++++++++++++++++++++++++++++++++++---------------
compat/mingw.h | 103 +++++--
compat/winansi.c | 363 +++++++++++++++---------
git-gui/git-gui.sh | 6 +-
git-gui/lib/index.tcl | 6 +-
gitk-git/gitk | 15 +-
run-command.c | 10 +-
t/t3901-i18n-patch.sh | 28 +-
t/t4201-shortlog.sh | 6 +-
t/t8005-blame-i18n.sh | 9 +-
t/test-lib.sh | 13 +
12 files changed, 871 insertions(+), 447 deletions(-)

--
1.7.3.1.msysgit.0.8.g31c0ec

Karsten Blees

unread,
Oct 24, 2010, 6:00:21 PM10/24/10
to msy...@googlegroups.com, Karsten Blees
Converts command line arguments and environment from UTF-16 to UTF-8 on
startup, and vice versa when creating other processes.

Provides replacements for getenv/putenv/unsetenv that operate directly on
the UTF-8-encoded "environ" array to save UTF-8/16 conversions on each
call (and save the trouble to keep track of conversion memory). As sorting
the environment is required for CreateProcess anyway, keep "environ"
sorted and use binary search for faster lookups.

Initialization code has been moved from the main macro to the new
mingw_startup function. This function also converts the command line and
environment to UTF-8, and fixes Windows specific environment settings
(TMPDIR and TERM, formerly handled in mingw_getenv).

Disables MSVCRT wildcard expansion of the command line to be able to pass
*/? to git grep and friends (fixes t5505 and t7810).

Altering the environment when creating new processes is now handled in
spawnv* internally, replacing make_augmented_environ and related APIs.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---
Makefile | 2 -
compat/mingw.c | 354 +++++++++++++++++++++++++++++++++++++-------------------
compat/mingw.h | 23 ++---
run-command.c | 10 +--
4 files changed, 246 insertions(+), 143 deletions(-)

diff --git a/Makefile b/Makefile
index d3dcfb1..dba3182 100644
--- a/Makefile
+++ b/Makefile
@@ -1045,7 +1045,6 @@ ifeq ($(uname_S),Windows)
NO_SYMLINK_HEAD = YesPlease
NO_IPV6 = YesPlease
NO_SETENV = YesPlease
- NO_UNSETENV = YesPlease
NO_STRCASESTR = YesPlease
NO_STRLCPY = YesPlease
NO_STRTOK_R = YesPlease
@@ -1100,7 +1099,6 @@ ifneq (,$(findstring MINGW,$(uname_S)))
NO_LIBGEN_H = YesPlease
NO_SYMLINK_HEAD = YesPlease
NO_SETENV = YesPlease
- NO_UNSETENV = YesPlease
NO_STRCASESTR = YesPlease
NO_STRLCPY = YesPlease
NO_STRTOK_R = YesPlease
diff --git a/compat/mingw.c b/compat/mingw.c
index 8f20b4b..1285d27 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -761,22 +761,110 @@ char *mingw_getcwd(char *pointer, int len)
return pointer;
}

-#undef getenv
+/*
+ * Compare environment entries by key (i.e. stopping at '=').
+ */
+static int compareenv(const void *v1, const void *v2)
+{
+ const char *e1 = *(const char**)v1;
+ const char *e2 = *(const char**)v2;
+
+ /* sort NULL entries at the end */
+ if (!e1 || !e2)
+ return (e1 == e2) ? 0 : e1 ? -1 : 1;
+
+ for (;;) {
+ int c1 = *e1++;
+ int c2 = *e2++;
+ c1 = (c1 == '=') ? 0 : tolower(c1);
+ c2 = (c2 == '=') ? 0 : tolower(c2);
+ if (c1 > c2)
+ return 1;
+ if (c1 < c2)
+ return -1;
+ if (c1 == 0)
+ return 0;
+ }
+}
+
+#define MIN_ENV_SIZE 16
+
+/*
+ * Resize and sort environment for O(log n) getenv / putenv
+ */
+static void initenv(char **env)
+{
+ int i, envsz = MIN_ENV_SIZE;
+ char **e;
+ for (e = env, i = 1; *e; e++, i++)
+ if (i == envsz)
+ envsz <<= 1;
+ environ = xcalloc(envsz, sizeof(char*));
+ memcpy(environ, env, i * sizeof(char*));
+ qsort(environ, i, sizeof(char*), compareenv);
+}
+
+/*
+ * Allocated space for environ is always a power of 2 (with unused entries
+ * filled with NULL), so we can determine the size more efficiently.
+ */
+static int getenvsize()
+{
+ int envsz = MIN_ENV_SIZE;
+ while (environ[envsz - 1])
+ envsz <<= 1;
+ return envsz;
+}
+
char *mingw_getenv(const char *name)
{
- char *result = getenv(name);
- if (!result) {
- if (!strcmp(name, "TMPDIR")) {
- /* on Windows it is TMP and TEMP */
- result = getenv("TMP");
- if (!result)
- result = getenv("TEMP");
- } else if (!strcmp(name, "TERM")) {
- /* simulate TERM to enable auto-color (see color.c) */
- result = "winansi";
+ char *value, **env;
+ env = bsearch(&name, environ, getenvsize(), sizeof(char*), compareenv);
+ if (!env)
+ return NULL;
+ value = strchr(*env, '=');
+ return value ? &value[1] : NULL;
+}
+
+int mingw_unsetenv(const char *name)
+{
+ int envsz;
+ char **env;
+ envsz = getenvsize();
+ env = bsearch(&name, environ, envsz, sizeof(char*), compareenv);
+ if (!env)
+ return 0;
+ free(*env);
+ memmove(env, env + 1, (envsz - 1 - (env - environ)) * sizeof(char*));
+ return 0;
+}
+
+int mingw_putenv(char *namevalue)
+{
+ int envsz;
+ char **env;
+ if (!namevalue || !strchr(namevalue, '=')) {
+ errno = EINVAL;
+ return -1;
+ }
+ envsz = getenvsize();
+ env = bsearch(&namevalue, environ, envsz, sizeof(char*), compareenv);
+
+ if (env) {
+ /* replace existing entry */
+ free(*env);
+ *env = namevalue;
+ } else {
+ /* add at the end (last slot is guaranteed to be free), then sort */
+ environ[envsz - 1] = namevalue;
+ qsort(environ, envsz, sizeof(char*), compareenv);
+ /* expand array if last slot is used */
+ if (environ[envsz - 1]) {
+ environ = xrealloc(environ, 2 * envsz * sizeof(char*));
+ memset(environ + envsz, 0, envsz * sizeof(char*));
}
}
- return result;
+ return 0;
}

/*
@@ -961,21 +1049,63 @@ static char *path_lookup(const char *cmd, char **path, int exe_only)
return prog;
}

-static int env_compare(const void *a, const void *b)
+/*
+ * Create environment block suitable for CreateProcess. Merges current working
+ * directories, current environment and the supplied environment changes.
+ */
+static wchar_t *make_environment_block(char **env)
{
- char *const *ea = a;
- char *const *eb = b;
- return strcasecmp(*ea, *eb);
+ char **e, **t, **tmpenv;
+ wchar_t *envblk = NULL;
+ int len, envsz, envblksz = 0, envblkpos = 0;
+
+ /* get current environment size + supplied changes */
+ envsz = getenvsize();
+ for (e = env; e && *e; e++)
+ envsz++;
+
+ /* allocate temporary environment array and copy current environment */
+ tmpenv = xcalloc(envsz, sizeof(char*));
+ for (e = environ, t = tmpenv; *e; e++, t++)
+ *t = *e;
+
+ /* merge supplied environment changes into the temporary environment */
+ if (env) {
+ int tmpsz = t - tmpenv;
+ for (e = env; *e; e++) {
+ /* replace / remove existing entries or append at the end */
+ char **f = bsearch(e, tmpenv, tmpsz, sizeof(char*), compareenv);
+ if (f)
+ *f = strchr(*e, '=') ? *e : NULL;
+ else
+ *t++ = strchr(*e, '=') ? *e : NULL;
+ }
+ /* finally sort the temporary environment */
+ qsort(tmpenv, envsz, sizeof(char*), compareenv);
+ }
+
+ /* add temporary environment to the environment block */
+ for (e = tmpenv; *e; e++) {
+ len = strlen(*e) + 1;
+ ALLOC_GROW(envblk, (envblkpos + len) * sizeof(wchar_t), envblksz);
+ len = utftowcs(&envblk[envblkpos], *e, len);
+ envblkpos += len + 1;
+ }
+ /* add final \0 terminator */
+ ALLOC_GROW(envblk, (envblkpos + 1) * sizeof(wchar_t), envblksz);
+ envblk[envblkpos] = 0;
+ return envblk;
}

static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
const char *dir,
int prepend_cmd, int fhin, int fhout, int fherr)
{
- STARTUPINFO si;
+ STARTUPINFOW si;
PROCESS_INFORMATION pi;
- struct strbuf envblk, args;
- unsigned flags;
+ struct strbuf args;
+ wchar_t *envblk, wcmd[MAX_PATH], wdir[MAX_PATH], *wargs;
+ unsigned flags = CREATE_UNICODE_ENVIRONMENT;
BOOL ret;

/* Determine whether or not we are associated to a console */
@@ -992,7 +1122,7 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
* instead of CREATE_NO_WINDOW to make ssh
* recognize that it has no console.
*/
- flags = DETACHED_PROCESS;
+ flags |= DETACHED_PROCESS;
} else {
/* There is already a console. If we specified
* DETACHED_PROCESS here, too, Windows would
@@ -1000,7 +1130,6 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
* The same is true for CREATE_NO_WINDOW.
* Go figure!
*/
- flags = 0;
CloseHandle(cons);
}
memset(&si, 0, sizeof(si));
@@ -1010,6 +1139,11 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
si.hStdOutput = (HANDLE) _get_osfhandle(fhout);
si.hStdError = (HANDLE) _get_osfhandle(fherr);

+ if (utftowcs(wcmd, cmd, MAX_PATH) < 0)
+ return -1;
+ if (dir && utftowcs(wdir, dir, MAX_PATH) < 0)
+ return -1;
+
/* concatenate argv, quoting args as we go */
strbuf_init(&args, 0);
if (prepend_cmd) {
@@ -1027,33 +1161,18 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
free(quoted);
}

- if (env) {
- int count = 0;
- char **e, **sorted_env;
-
- for (e = env; *e; e++)
- count++;
+ wargs = xmalloc((2 * args.len + 1) * sizeof(wchar_t));
+ utftowcs(wargs, args.buf, 2 * args.len + 1);
+ strbuf_release(&args);

- /* environment must be sorted */
- sorted_env = xmalloc(sizeof(*sorted_env) * (count + 1));
- memcpy(sorted_env, env, sizeof(*sorted_env) * (count + 1));
- qsort(sorted_env, count, sizeof(*sorted_env), env_compare);
-
- strbuf_init(&envblk, 0);
- for (e = sorted_env; *e; e++) {
- strbuf_addstr(&envblk, *e);
- strbuf_addch(&envblk, '\0');
- }
- free(sorted_env);
- }
+ envblk = make_environment_block(env);

memset(&pi, 0, sizeof(pi));
- ret = CreateProcess(cmd, args.buf, NULL, NULL, TRUE, flags,
- env ? envblk.buf : NULL, dir, &si, &pi);
+ ret = CreateProcessW(wcmd, wargs, NULL, NULL, TRUE, flags, envblk,
+ dir ? wdir : NULL, &si, &pi);

- if (env)
- strbuf_release(&envblk);
- strbuf_release(&args);
+ free(envblk);
+ free(wargs);

if (!ret) {
errno = ENOENT;
@@ -1175,80 +1294,6 @@ void mingw_execv(const char *cmd, char *const *argv)
mingw_execve(cmd, argv, environ);
}

-static char **copy_environ(void)
-{
- char **env;
- int i = 0;
- while (environ[i])
- i++;
- env = xmalloc((i+1)*sizeof(*env));
- for (i = 0; environ[i]; i++)
- env[i] = xstrdup(environ[i]);
- env[i] = NULL;
- return env;
-}
-
-void free_environ(char **env)
-{
- int i;
- for (i = 0; env[i]; i++)
- free(env[i]);
- free(env);
-}
-
-static int lookup_env(char **env, const char *name, size_t nmln)
-{
- int i;
-
- for (i = 0; env[i]; i++) {
- if (0 == strncmp(env[i], name, nmln)
- && '=' == env[i][nmln])
- /* matches */
- return i;
- }
- return -1;
-}
-
-/*
- * If name contains '=', then sets the variable, otherwise it unsets it
- */
-static char **env_setenv(char **env, const char *name)
-{
- char *eq = strchrnul(name, '=');
- int i = lookup_env(env, name, eq-name);
-
- if (i < 0) {
- if (*eq) {
- for (i = 0; env[i]; i++)
- ;
- env = xrealloc(env, (i+2)*sizeof(*env));
- env[i] = xstrdup(name);
- env[i+1] = NULL;
- }
- }
- else {
- free(env[i]);
- if (*eq)
- env[i] = xstrdup(name);
- else
- for (; env[i]; i++)
- env[i] = env[i+1];
- }
- return env;
-}
-
-/*
- * Copies global environ and adjusts variables as specified by vars.
- */
-char **make_augmented_environ(const char *const *vars)
-{
- char **env = copy_environ();
-
- while (*vars)
- env = env_setenv(env, *vars++);
- return env;
-}
-
/*
* Note, this isn't a complete replacement for getaddrinfo. It assumes
* that service contains a numerical port, or that it it is null. It
@@ -1947,3 +1992,76 @@ int wcstoutf(char *utf, const wchar_t *wcs, size_t utflen)
errno = ENAMETOOLONG;
return -1;
}
+
+/*
+ * Disable MSVCRT command line wildcard expansion (__getmainargs called from
+ * mingw startup code, see init.c in mingw runtime).
+ */
+int _CRT_glob = 0;
+
+typedef struct {
+ int newmode;
+} _startupinfo;
+
+extern int __wgetmainargs(int *argc, wchar_t ***argv, wchar_t ***env, int glob,
+ _startupinfo *si);
+
+void mingw_startup()
+{
+ int i, maxlen, argc;
+ char *buffer;
+ wchar_t **e, **wenv, **wargv;
+ _startupinfo si;
+
+ /* get wide char arguments and environment */
+ __wgetmainargs(&argc, &wargv, &wenv, _CRT_glob, &si);
+
+ /* determine size of argv and environ conversion buffer */
+ maxlen = wcslen(_wpgmptr);
+ for (i = 1; i < argc; i++)
+ maxlen = max(maxlen, wcslen(wargv[i]));
+ for (e = wenv; *e; e++)
+ maxlen = max(maxlen, wcslen(*e));
+
+ /* allocate buffer (wchar_t encodes to max 3 UTF-8 bytes) */
+ maxlen = 3 * maxlen + 1;
+ buffer = xmalloc(maxlen);
+
+ /* convert command line arguments and environment to UTF-8 */
+ wcstoutf(buffer, _wpgmptr, maxlen);
+ __argv[0] = xstrdup(buffer);
+ for (i = 1; i < argc; i++) {
+ wcstoutf(buffer, wargv[i], maxlen);
+ __argv[i] = xstrdup(buffer);
+ }
+ for (e = wenv, i = 0; *e; e++, i++) {
+ wcstoutf(buffer, *e, maxlen);
+ environ[i] = xstrdup(buffer);
+ }
+ free(buffer);
+
+ /* initialize mingw replacement of environment functions */
+ initenv(environ);
+
+ /* fix Windows specific environment settings */
+
+ /* on Windows it is TMP and TEMP */
+ if (!getenv("TMPDIR")) {
+ const char *tmp = getenv("TMP");
+ if (!tmp)
+ tmp = getenv("TEMP");
+ if (tmp)
+ setenv("TMPDIR", tmp, 1);
+ }
+
+ /* simulate TERM to enable auto-color (see color.c) */
+ if (!getenv("TERM"))
+ setenv("TERM", "winansi", 1);
+
+ /* set up default file mode and file modes for stdin/out/err */
+ _fmode = _O_BINARY;
+ _setmode(_fileno(stdin), _O_BINARY);
+ _setmode(_fileno(stdout), _O_BINARY);
+ _setmode(_fileno(stderr), _O_BINARY);
+}
+
diff --git a/compat/mingw.h b/compat/mingw.h
index 41b0ad4..27f7510 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -208,6 +208,10 @@ char *mingw_getcwd(char *pointer, int len);

char *mingw_getenv(const char *name);
#define getenv mingw_getenv
+int mingw_putenv(char *namevalue);
+#define putenv mingw_putenv
+int mingw_unsetenv(const char *name);
+#define unsetenv mingw_unsetenv

struct hostent *mingw_gethostbyname(const char *host);
#define gethostbyname mingw_gethostbyname
@@ -301,13 +305,6 @@ void mingw_open_html(const char *path);
void mingw_mark_as_git_dir(const char *dir);
#define mark_as_git_dir mingw_mark_as_git_dir

-/*
- * helpers
- */
-
-char **make_augmented_environ(const char *const *vars);
-void free_environ(char **env);
-
/**
* Converts UTF-8 encoded string to UTF-16LE. To support legacy-encoded
* repositories, invalid UTF-8 bytes 0xa0 - 0xff are converted to corresponding
@@ -345,20 +342,16 @@ static inline int utftowcs(wchar_t *wcs, const char *utf, size_t wcslen)
int wcstoutf(char *utf, const wchar_t *wcs, size_t utflen);

/*
- * A replacement of main() that ensures that argv[0] has a path
- * and that default fmode and std(in|out|err) are in binary mode
+ * A replacement of main() that adds mingw specific initialization.
*/

+void mingw_startup();
#define main(c,v) dummy_decl_mingw_main(); \
static int mingw_main(); \
int main(int argc, const char **argv) \
{ \
- _fmode = _O_BINARY; \
- _setmode(_fileno(stdin), _O_BINARY); \
- _setmode(_fileno(stdout), _O_BINARY); \
- _setmode(_fileno(stderr), _O_BINARY); \
- argv[0] = xstrdup(_pgmptr); \
- return mingw_main(argc, argv); \
+ mingw_startup(); \
+ return mingw_main(__argc, __argv); \
} \
static int mingw_main(c,v)

diff --git a/run-command.c b/run-command.c
index 2a1041e..01d740b 100644
--- a/run-command.c
+++ b/run-command.c
@@ -317,7 +317,6 @@ fail_pipe:
{
int fhin = 0, fhout = 1, fherr = 2;
const char **sargv = cmd->argv;
- char **env = environ;

if (cmd->no_stdin)
fhin = open("/dev/null", O_RDWR);
@@ -342,23 +341,18 @@ fail_pipe:
else if (cmd->out > 1)
fhout = dup(cmd->out);

- if (cmd->env)
- env = make_augmented_environ(cmd->env);
-
if (cmd->git_cmd) {
cmd->argv = prepare_git_cmd(cmd->argv);
} else if (cmd->use_shell) {
cmd->argv = prepare_shell_cmd(cmd->argv);
}

- cmd->pid = mingw_spawnvpe(cmd->argv[0], cmd->argv, env, cmd->dir,
- fhin, fhout, fherr);
+ cmd->pid = mingw_spawnvpe(cmd->argv[0], cmd->argv, (char**) cmd->env,
+ cmd->dir, fhin, fhout, fherr);
failed_errno = errno;
if (cmd->pid < 0 && (!cmd->silent_exec_failure || errno != ENOENT))
error("cannot spawn %s: %s", cmd->argv[0], strerror(errno));

- if (cmd->env)
- free_environ(env);
if (cmd->git_cmd)
free(cmd->argv);

--
1.7.3.1.msysgit.0.8.g31c0ec

Karsten Blees

unread,
Oct 24, 2010, 6:00:23 PM10/24/10
to msy...@googlegroups.com, Karsten Blees
MSys bash doesn't support unicode at all. Testing a unicode-enabled git
with an encoding-agnostic bash cannot work.

This patch adds a new test function test_expect_success_unicode that tests
whether the shell is capable of passing unicode strings to another process.
If that works, the test is expected to succeed, otherwise it's expected to
fail.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

t/t3901-i18n-patch.sh | 28 ++++++++++++++--------------
t/t4201-shortlog.sh | 6 +++---
t/t8005-blame-i18n.sh | 9 +++++----
t/test-lib.sh | 13 +++++++++++++
4 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/t/t3901-i18n-patch.sh b/t/t3901-i18n-patch.sh
index 31a5770..d19255d 100755
--- a/t/t3901-i18n-patch.sh
+++ b/t/t3901-i18n-patch.sh
@@ -67,7 +67,7 @@ test_expect_success setup '
git config i18n.commitencoding UTF-8
'

-test_expect_success 'format-patch output (ISO-8859-1)' '
+test_expect_success_unicode 'format-patch output (ISO-8859-1)' '
git config i18n.logoutputencoding ISO8859-1 &&

git format-patch --stdout master..HEAD^ >out-l1 &&
@@ -78,7 +78,7 @@ test_expect_success 'format-patch output (ISO-8859-1)' '
grep "^From: =?ISO8859-1?q?=C1=E9=ED=20=F3=FA?=" out-l2
'

-test_expect_success 'format-patch output (UTF-8)' '
+test_expect_success_unicode 'format-patch output (UTF-8)' '
git config i18n.logoutputencoding UTF-8 &&

git format-patch --stdout master..HEAD^ >out-u1 &&
@@ -89,7 +89,7 @@ test_expect_success 'format-patch output (UTF-8)' '
grep "^From: =?UTF-8?q?=C3=81=C3=A9=C3=AD=20=C3=B3=C3=BA?=" out-u2
'

-test_expect_success 'rebase (U/U)' '
+test_expect_success_unicode 'rebase (U/U)' '
# We want the result of rebase in UTF-8
git config i18n.commitencoding UTF-8 &&

@@ -108,7 +108,7 @@ test_expect_success 'rebase (U/U)' '
check_encoding 2
'

-test_expect_success 'rebase (U/L)' '
+test_expect_success_unicode 'rebase (U/L)' '
git config i18n.commitencoding UTF-8 &&
git config i18n.logoutputencoding ISO8859-1 &&
. "$TEST_DIRECTORY"/t3901-utf8.txt &&
@@ -119,7 +119,7 @@ test_expect_success 'rebase (U/L)' '
check_encoding 2
'

-test_expect_success 'rebase (L/L)' '
+test_expect_success_unicode 'rebase (L/L)' '
# In this test we want ISO-8859-1 encoded commits as the result
git config i18n.commitencoding ISO8859-1 &&
git config i18n.logoutputencoding ISO8859-1 &&
@@ -131,7 +131,7 @@ test_expect_success 'rebase (L/L)' '
check_encoding 2 8859
'

-test_expect_success 'rebase (L/U)' '
+test_expect_success_unicode 'rebase (L/U)' '
# This is pathological -- use UTF-8 as intermediate form
# to get ISO-8859-1 results.
git config i18n.commitencoding ISO8859-1 &&
@@ -144,7 +144,7 @@ test_expect_success 'rebase (L/U)' '
check_encoding 2 8859
'

-test_expect_success 'cherry-pick(U/U)' '
+test_expect_success_unicode 'cherry-pick(U/U)' '
# Both the commitencoding and logoutputencoding is set to UTF-8.

git config i18n.commitencoding UTF-8 &&
@@ -159,7 +159,7 @@ test_expect_success 'cherry-pick(U/U)' '
check_encoding 3
'

-test_expect_success 'cherry-pick(L/L)' '
+test_expect_success_unicode 'cherry-pick(L/L)' '
# Both the commitencoding and logoutputencoding is set to ISO-8859-1

git config i18n.commitencoding ISO8859-1 &&
@@ -174,7 +174,7 @@ test_expect_success 'cherry-pick(L/L)' '
check_encoding 3 8859
'

-test_expect_success 'cherry-pick(U/L)' '
+test_expect_success_unicode 'cherry-pick(U/L)' '
# Commitencoding is set to UTF-8 but logoutputencoding is ISO-8859-1

git config i18n.commitencoding UTF-8 &&
@@ -189,7 +189,7 @@ test_expect_success 'cherry-pick(U/L)' '
check_encoding 3
'

-test_expect_success 'cherry-pick(L/U)' '
+test_expect_success_unicode 'cherry-pick(L/U)' '
# Again, the commitencoding is set to ISO-8859-1 but
# logoutputencoding is set to UTF-8.

@@ -205,7 +205,7 @@ test_expect_success 'cherry-pick(L/U)' '
check_encoding 3 8859
'

-test_expect_success 'rebase --merge (U/U)' '
+test_expect_success_unicode 'rebase --merge (U/U)' '
git config i18n.commitencoding UTF-8 &&
git config i18n.logoutputencoding UTF-8 &&
. "$TEST_DIRECTORY"/t3901-utf8.txt &&
@@ -216,7 +216,7 @@ test_expect_success 'rebase --merge (U/U)' '
check_encoding 2
'

-test_expect_success 'rebase --merge (U/L)' '
+test_expect_success_unicode 'rebase --merge (U/L)' '
git config i18n.commitencoding UTF-8 &&
git config i18n.logoutputencoding ISO8859-1 &&
. "$TEST_DIRECTORY"/t3901-utf8.txt &&
@@ -227,7 +227,7 @@ test_expect_success 'rebase --merge (U/L)' '
check_encoding 2
'

-test_expect_success 'rebase --merge (L/L)' '
+test_expect_success_unicode 'rebase --merge (L/L)' '
# In this test we want ISO-8859-1 encoded commits as the result
git config i18n.commitencoding ISO8859-1 &&
git config i18n.logoutputencoding ISO8859-1 &&
@@ -239,7 +239,7 @@ test_expect_success 'rebase --merge (L/L)' '
check_encoding 2 8859
'

-test_expect_success 'rebase --merge (L/U)' '
+test_expect_success_unicode 'rebase --merge (L/U)' '
# This is pathological -- use UTF-8 as intermediate form
# to get ISO-8859-1 results.
git config i18n.commitencoding ISO8859-1 &&
diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh
index cdb70b4..b672b5c 100755
--- a/t/t4201-shortlog.sh
+++ b/t/t4201-shortlog.sh
@@ -93,7 +93,7 @@ test_expect_success 'output from user-defined format is re-wrapped' '
test_cmp expect log.predictable
'

-test_expect_success 'shortlog wrapping' '
+test_expect_success_unicode 'shortlog wrapping' '
cat >expect <<\EOF &&
A U Thor (5):
Test
@@ -114,7 +114,7 @@ EOF
test_cmp expect out
'

-test_expect_success 'shortlog from non-git directory' '
+test_expect_success_unicode 'shortlog from non-git directory' '
git log HEAD >log &&
GIT_DIR=non-existing git shortlog -w <log >out &&
test_cmp expect out
@@ -135,7 +135,7 @@ $DSCHO (2):

EOF

-test_expect_success 'shortlog encoding' '
+test_expect_success_unicode 'shortlog encoding' '
git reset --hard "$commit" &&
git config --unset i18n.commitencoding &&
echo 2 > a1 &&
diff --git a/t/t8005-blame-i18n.sh b/t/t8005-blame-i18n.sh
index cb39055..6affb78 100755
--- a/t/t8005-blame-i18n.sh
+++ b/t/t8005-blame-i18n.sh
@@ -1,6 +1,7 @@
#!/bin/sh

test_description='git blame encoding conversion'
+
. ./test-lib.sh

. "$TEST_DIRECTORY"/t8005/utf8.txt
@@ -33,7 +34,7 @@ author $SJIS_NAME
summary $SJIS_MSG
EOF

-test_expect_success \
+test_expect_success_unicode \
'blame respects i18n.commitencoding' '
git blame --incremental file | \
egrep "^(author|summary) " > actual &&
@@ -49,7 +50,7 @@ author $EUC_JAPAN_NAME
summary $EUC_JAPAN_MSG
EOF

-test_expect_success \
+test_expect_success_unicode \
'blame respects i18n.logoutputencoding' '
git config i18n.logoutputencoding eucJP &&
git blame --incremental file | \
@@ -66,7 +67,7 @@ author $UTF8_NAME
summary $UTF8_MSG
EOF

-test_expect_success \
+test_expect_success_unicode \
'blame respects --encoding=UTF-8' '
git blame --incremental --encoding=UTF-8 file | \
egrep "^(author|summary) " > actual &&
@@ -82,7 +83,7 @@ author $UTF8_NAME
summary $UTF8_MSG
EOF

-test_expect_success \
+test_expect_success_unicode \
'blame respects --encoding=none' '
git blame --incremental --encoding=none file | \
egrep "^(author|summary) " > actual &&
diff --git a/t/test-lib.sh b/t/test-lib.sh
index b50d723..06edcbd 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -477,6 +477,19 @@ test_expect_success () {
echo >&3 ""
}

+# test if shell supports unicode, expect success if it does, otherwise failure
+test_expect_success_unicode () {
+tclsh -ÀÁÂ <<EOF
+ exit [expr {\$argv ne [lindex "-\u00c0\u00c1\u00c2" 0]}]
+EOF
+ if [ $? == 0 ]
+ then
+ test_expect_success "$@"
+ else
+ test_expect_failure "$@"
+ fi
+}
+
test_expect_code () {
test "$#" = 4 && { prereq=$1; shift; } || prereq=
test "$#" = 3 ||
--
1.7.3.1.msysgit.0.8.g31c0ec

Kirill Smelkov

unread,
Nov 26, 2010, 8:29:36 AM11/26/10
to Karsten Blees, Thorvald Natvig, Nico Rieck, Takuya Murakami, msy...@googlegroups.com
[ Cc'ing all the relevant people ]

Karsten, All,

On Sun, Oct 24, 2010 at 10:00:18PM +0000, Karsten Blees wrote:
> This patch series tries to solve the Unicode problems of Git for Windows, at least for git, gitk and git-gui.
>
> Credits go to Thorvald Natvig (tn/utf8), Nico Rieck (work/utf8) and Takuya Murakami (work/tm/utf8) for their previous work on the subject, from which I've shamelessly stolen ideas as seemed fit. Notable differences from their patches include:
>
> * The UTF-8 to UTF-16 conversion function can handle invalid UTF-8 (similar to TCL), so that old git repositories with legacy-encoded file names can be checked out without error.
>
> * The conversion functions don't use static buffers in order to be fully thread safe. This makes each usage a bit more complicated, but is necessary for the multi-threaded parts of git to work (e.g. git-grep, git-pack-objects, core.preloadindex).
>
> I've included a revised version of the thread-safe console patch that I've posted a while back. This version eliminates the call to TerminateThread and uses the fault-tolerant UTF-8 conversion function from the file name patch (so that file system and console output are in sync).
>
> Patch 3 breaks some of the i18n tests (t3901, t4201 and t8005), which I really don't know how to fix properly. MSys bash is completely broken when it comes to Unicode, as it reads test scripts in Windows system encoding (instead of UTF-8). Changing the encoding of the test scripts (e.g. to Cp1252) is not a solution as this will fail on Windows systems with different encodings. Currently, I've just changed the tests to expect failure if bash cannot pass unicode script content to other processes correctly.
>
> $ grep "failed [^0]" t/test-results/*
> t/test-results/t5407-post-rewrite-hook-6400.counts:failed 9
> t/test-results/t7400-submodule-basic-7004.counts:failed 1
> t/test-results/t7508-status-5936.counts:failed 18
> t/test-results/t9001-send-email-6652.counts:failed 4
> t/test-results/t9500-gitweb-standalone-no-errors-3620.counts:failed 83
> t/test-results/t9501-gitweb-standalone-http-status-5620.counts:failed 11
> t/test-results/t9502-gitweb-standalone-parse-output-3384.counts:failed 9
>
> Karsten Blees (5):
> Unicode file names on Windows
> Unicode file names for gitk and git-gui on Windows
> Unicode arguments and environment on Windows
> Thread-safe windows console output
> Mark unicode-related tests broken on msys

Big-Thanks-From: and Tested-By: Kirill Smelkov <ki...@mns.spb.ru>

Thanks a lot!

I've tested this together with my .pdf and .doc textconv patches and
this worked great both under Wine (my primary OS is Linux) and under
several real Windows XP.

The only glitch I've found so far is that file list in git-gui's
"Repository->Browse master's files" is not converted to UTF-16 and
displays as garbage.

Otherwise everything seems to work:

- patch 1: Looks sane although I can't review it in depth because of lack
of win32 knowledge.

- patch 2: There are more places in gitk where to reencode, e.g. "New
file ...", IIRC. Otherwise it works modulo already-mentioned garbage
in git-gui file list.

- patch 3: It works because I've been able to use it with my *.doc and
*.pdf textconv drivers on UTF-8 encoded filenames.

- patch 4: Works, as you've already said only with --no-pager. Would be
great to teach it to correctly activate UTF-8 -> UTF-16 conversion in
pager cases too, but what is now is already a very good start, imho.


Thanks again, and sorry I'm maybe not reviewing this stuff in-depth --
at least it works and let's hope we'll have something which addresses
issue80 in msysgit mainline someday.


Kirill

karste...@dcon.de

unread,
Nov 27, 2010, 2:51:31 PM11/27/10
to Kirill Smelkov, msy...@googlegroups.com, Nico Rieck, Thorvald Natvig, Takuya Murakami

Hi Kirill,

Kirill Smelkov <ki...@mns.spb.ru> wrote on 26.11.2010 14:29:36:

> [ Cc'ing all the relevant people ]
>


Sorry, I missed that...

>
> Big-Thanks-From: and Tested-By: Kirill Smelkov <ki...@mns.spb.ru>
>
> Thanks a lot!
>

Thanks for testing.

> I've tested this together with my .pdf and .doc textconv patches and
> this worked great both under Wine (my primary OS is Linux) and under
> several real Windows XP.
>
> The only glitch I've found so far is that file list in git-gui's
> "Repository->Browse master's files" is not converted to UTF-16 and
> displays as garbage.
>

I only touched code that obviously already dealt with file name encoding (i.e. 'encoding convertfrom...'), so I might have missed something. This should fix the repo browser:
---
diff --git a/git-gui/lib/browser.tcl b/git-gui/lib/browser.tcl
index c241572..575de43 100644
--- a/git-gui/lib/browser.tcl
+++ b/git-gui/lib/browser.tcl
@@ -191,7 +191,7 @@ method _ls {tree_id {name {}}} {
         $w conf -state disabled
 
         set fd [git_read ls-tree -z $tree_id]
-        fconfigure $fd -blocking 0 -translation binary -encoding binary
+        fconfigure $fd -blocking 0 -translation binary -encoding utf-8
         fileevent $fd readable [cb _read $fd]
 }
---

> Otherwise everything seems to work:
>
> - patch 1: Looks sane although I can't review it in depth because of lack
>   of win32 knowledge.
>
> - patch 2: There are more places in gitk where to reencode, e.g. "New
>   file ...", IIRC. Otherwise it works modulo already-mentioned garbage
>   in git-gui file list.
>


Where can you do a "New file..." in gitk?

> - patch 3: It works because I've been able to use it with my *.doc and
>   *.pdf textconv drivers on UTF-8 encoded filenames.
>
> - patch 4: Works, as you've already said only with --no-pager. Would be
>   great to teach it to correctly activate UTF-8 -> UTF-16 conversion in
>   pager cases too, but what is now is already a very good start, imho.
>

The only pager that I've found to work flawlessly is cygwin less.

MSys less always converts the input from the system's ANSI code page to the current console code page. The problem is the console handler in the MSys dll (see fhandler_console.cc in the msysCORE package, seems to be based on a 10-year-old cygwin fork). This also means that console output of all MSys programs (bash, cat, head, tail etc.) is affected. I've had limited success with piping git output through iconv first, but the bottom line is that MSys doesn't support Unicode.

The native less (windows binary from the author's home page http://www.greenwoodsoftware.com/less) has problems with colors and the cursor keys in utf-8 mode.

Here's the relevant section of my /etc/profile:
# msys less (v358)
# replace 1252 with your system ANSI codepage, i.e. GetACP() or
# HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP in the registry
#export LESSCHARSET=dos
#export GIT_PAGER="iconv -f utf-8 -t cp1252 -c | less"
# prevent additional msys conversion from ACP to OEMCP
#cmd /c "chcp 1252"

# cygwin less (v436)
export LESSCHARSET=utf-8
export GIT_PAGER=C:/cygwin/bin/less.exe

# native less (v436)
#export LESSCHARSET=utf-8
#export GIT_PAGER=C:/less/less.exe
#cmd /c "chcp 65001"

>
> Thanks again, and sorry I'm maybe not reviewing this stuff in-depth --
> at least it works and let's hope we'll have something which addresses
> issue80 in msysgit mainline someday.
>
>
> Kirill

Yes, I agree. After all, the Windows OS had full Unicode support from the start, so I think it's time that 'Git for Windows' earns that name (with the encoding issues, it feels more like a 'Git for DOS' right now).

Karsten

Kirill Smelkov

unread,
Nov 29, 2010, 7:15:50 AM11/29/10
to karste...@dcon.de, msy...@googlegroups.com, Nico Rieck, Thorvald Natvig, Takuya Murakami
On Sat, Nov 27, 2010 at 08:51:31PM +0100, karste...@dcon.de wrote:
> Hi Kirill,
>
> Kirill Smelkov <ki...@mns.spb.ru> wrote on 26.11.2010 14:29:36:

[snip]

> > I've tested this together with my .pdf and .doc textconv patches and
> > this worked great both under Wine (my primary OS is Linux) and under
> > several real Windows XP.
> >
> > The only glitch I've found so far is that file list in git-gui's
> > "Repository->Browse master's files" is not converted to UTF-16 and
> > displays as garbage.
> >
>
> I only touched code that obviously already dealt with file name encoding
> (i.e. 'encoding convertfrom...'), so I might have missed something. This
> should fix the repo browser:
> ---
> diff --git a/git-gui/lib/browser.tcl b/git-gui/lib/browser.tcl
> index c241572..575de43 100644
> --- a/git-gui/lib/browser.tcl
> +++ b/git-gui/lib/browser.tcl
> @@ -191,7 +191,7 @@ method _ls {tree_id {name {}}} {
> $w conf -state disabled
>
> set fd [git_read ls-tree -z $tree_id]
> - fconfigure $fd -blocking 0 -translation binary -encoding binary
> + fconfigure $fd -blocking 0 -translation binary -encoding utf-8
> fileevent $fd readable [cb _read $fd]
> }

Thanks for the fixup - now it works. Btw I'm keeping all the patches here

git://repo.or.cz/git/kirr.git win32/kirr


> > Otherwise everything seems to work:
> >
> > - patch 1: Looks sane although I can't review it in depth because of
> lack
> > of win32 knowledge.
> >
> > - patch 2: There are more places in gitk where to reencode, e.g. "New
> > file ...", IIRC. Otherwise it works modulo already-mentioned garbage
> > in git-gui file list.
> >
>
> Where can you do a "New file..." in gitk?

It's when there is a new file added by commit, which is binary and for
which there is no textconv. Then it reads like

diff --git a/<nice-unicode-name> b/<nice-unicode-name>
new file mode 100644
index 0000000..XXXXXXX
Binary files /dev/null and b/<garbaged-unicode-name> differ
^^^^^^^^^^^^^^^^^^^^^

here


Sorry for not being clear in the first place.

Also actually this and rename-to/rename-from problems are not win32
specific. I suggest to send them directly to gitk (only the encoding
should maybe not be hardcoded to utf-8, but use $diffencoding instead,
e.g. like here:
http://repo.or.cz/w/git.git/commitdiff/1f2cecfd53137b76d39b2dcd7bcf7e918cd745b3)

Thanks for the info.

> > Thanks again, and sorry I'm maybe not reviewing this stuff in-depth --
> > at least it works and let's hope we'll have something which addresses
> > issue80 in msysgit mainline someday.
> >
> >
> > Kirill
>
> Yes, I agree. After all, the Windows OS had full Unicode support from the
> start, so I think it's time that 'Git for Windows' earns that name (with
> the encoding issues, it feels more like a 'Git for DOS' right now).

Here it's not like 'Git for DOS' (we already have git-gui, gitk,
git-cheetah ...), but issue80 is really a show-stopper.


Thanks again,
Kirill

karste...@dcon.de

unread,
Dec 1, 2010, 12:15:35 PM12/1/10
to Kirill Smelkov, msy...@googlegroups.com, Nico Rieck, Thorvald Natvig, Takuya Murakami

Kirill Smelkov <ki...@mns.spb.ru> wrote on 29.11.2010 13:15:50:

> > > - patch 2: There are more places in gitk where to reencode, e.g. "New
> > >   file ...", IIRC. Otherwise it works modulo already-mentioned garbage
> > >   in git-gui file list.
> > >
> >
> > Where can you do a "New file..." in gitk?
>
> It's when there is a new file added by commit, which is binary and for
> which there is no textconv. Then it reads like
>
>     diff --git a/<nice-unicode-name> b/<nice-unicode-name>
>     new file mode 100644
>     index 0000000..XXXXXXX
>     Binary files /dev/null and b/<garbaged-unicode-name> differ
>                                   ^^^^^^^^^^^^^^^^^^^^^
>
>                                   here
>
>
> Sorry for not being clear in the first place.
>

I can't reproduce that. New binary files look like this for me:

------------------------------- ABC€ÄÖÜäöü߀.bin -------------------------------
new file mode 100644
index 0000000..e01bbc5
Binary files /dev/null and b/ABC€ÄÖÜäöü߀.bin differ

The "diff --git" line gets translated to "----...". I also don't see why gitk should convert the "Binary files..." line any different than "rename from" / "rename to". How did you get that output? Can you check the exact output of git diff-tree -p <commit>?

> Also actually this and rename-to/rename-from problems are not win32
> specific. I suggest to send them directly to gitk (only the encoding
> should maybe not be hardcoded to utf-8, but use $diffencoding instead,
> e.g. like here:
>
http://repo.or.cz/w/git.
> git/commitdiff/1f2cecfd53137b76d39b2dcd7bcf7e918cd745b3)
>


I don't think so. There are three different encodings to consider when parsing git output:

1. commits and tags (commit messages): git reencodes these to i18n.logoutputencoding or i18n.commitencoding, default utf-8. This is stored in $tclencoding in gitk.

2. blobs (file content): printed verbatim, gitk converts from gui.encoding (default system encoding), can be overridden per file in .gitattributes (must be explicitly enabled in gitk). This is handled by get_path_encoding and cached in $diffencoding (which can therefore change per file).

3. trees (file names): printed verbatim, no config setting, gitk converts from system encoding, which I changed to hardcoded utf-8

On most Unix systems today, the system encoding will be utf-8, so file name decoding in gitk will be fine. Only on win32 it is impossible to set the system encoding to utf-8 (e.g. via HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP=65001, don't try this unless you know how to use a rescue disk and reg.exe).

Currently, there is no git config setting or gitk global variable for file name encoding, and I don't think that this is necessary. Using $diffencoding would be wrong as $diffencoding is for file content and can change per file.

Regarding your patch, I also think that you should have used $tclencoding instead of $diffencoding, because submodule diffs consist of the summary lines of the commit messages.

> > Yes, I agree. After all, the Windows OS had full Unicode support from the
> > start, so I think it's time that 'Git for Windows' earns that name (with
> > the encoding issues, it feels more like a 'Git for DOS' right now).
>
> Here it's not like 'Git for DOS' (we already have git-gui, gitk,
> git-cheetah ...), but issue80 is really a show-stopper.

>

What I meant by 'Git for DOS' is, well, the last DOS-based Windows was taken off market about 8 years ago. It is sad that msysgit today is still based on the 'Windows for DOS' compatibility API (inheriting the code page limitations of Win3.1/95/98/ME), instead of the native Unicode API that has been around since the first release of Windows NT in 1993. For a distributed SCM desigend for projects of international scope, the 'for Windows' attribute just doesn't fit as long as issue80 is not resolved.
</rant> (sorry)

>
> Thanks again,
> Kirill

Ciao,
Karsten

Kirill Smelkov

unread,
Dec 18, 2010, 10:56:53 AM12/18/10
to karste...@dcon.de, msy...@googlegroups.com, Nico Rieck, Thorvald Natvig, Takuya Murakami
First of all, sorry for so long delay in replying...

I'm not telling it should handle it in different way, quite the opposite - it
should (but does not yet) handle it in the same way. IIRC gitk has built-in
line-matcher with logic which lines to encoding-convert and which not. I'd
change it to "convert all lines..."

> How did you get that output? Can you check the exact
> output of git diff-tree -p <commit>?

Here you are:

$ git diff-tree -p 34baf5
34baf57217ded50132e1f414ad76f6aab93bff47
diff --git a/hardware/mns/УС-1/Серия/УС-1_Э3_ИИ108-09.dwg b/hardware/mns/УС-1/Серия/УС-1_Э3_ИИ108-09.dwg
new file mode 100644
index 0000000..b0bf8a4
Binary files /dev/null and b/hardware/mns/УС-1/Серия/УС-1_Э3_ИИ108-09.dwg differ

This is ok, but in gitk under Linux I'm getting:

----------------- hardware/mns/УС-1/Серия/УС-1_Э3_ИИ108-09.dwg -----------------
new file mode 100644
index 0000000..b0bf8a4
Binary files /dev/null and b/hardware/mns/УС-1/Ð¡ÐµÑ€Ð¸Ñ /УС-1_Э3_ИИ108-09.dwg differ

Here are steps how to reproduce this from scratch:

$ mkdir test
$ cd test/
$ git init
$ echo Hello > hello.txt
$ git add hello.txt
$ git commit -m init
$ dd if=/dev/zero of=привет.bin bs=1K count=4
$ git add привет.bin
$ git commit -m second
$ git diff-tree -p HEAD
71d7f6cacfa1364bd9428f4ae51d05356ec30e51
diff --git a/привет.bin b/привет.bin
new file mode 100644
index 0000000..08e7df1
Binary files /dev/null and b/привет.bin differ
# so it's ok

$ gitk # and there I'm seeing
---------------------------------- привет.bin ----------------------------------
new file mode 100644
index 0000000..08e7df1
Binary files /dev/null and b/привет.bin differ

<rant>
I see and agree that there are various encoding envolved. But from a
user perspective, wouldn't one expect output of _one_ command to be in
the same encoding?

If that one command is `git log -p` we have all three cases represented
here.
</rant>

Anyway thanks for clarification and remark about my patch. I have almost
zero Tcl skills - that's why originally I've just copied nearby lines...


> > > Yes, I agree. After all, the Windows OS had full Unicode support from
> the
> > > start, so I think it's time that 'Git for Windows' earns that name
> (with
> > > the encoding issues, it feels more like a 'Git for DOS' right now).
> >
> > Here it's not like 'Git for DOS' (we already have git-gui, gitk,
> > git-cheetah ...), but issue80 is really a show-stopper.
> >
>
> What I meant by 'Git for DOS' is, well, the last DOS-based Windows was
> taken off market about 8 years ago. It is sad that msysgit today is still
> based on the 'Windows for DOS' compatibility API (inheriting the code page
> limitations of Win3.1/95/98/ME), instead of the native Unicode API that
> has been around since the first release of Windows NT in 1993. For a
> distributed SCM desigend for projects of international scope, the 'for
> Windows' attribute just doesn't fit as long as issue80 is not resolved.
> </rant> (sorry)

:) let's hope it will be resolved one day.


Sorry again for the dalay,
Kirill

karste...@dcon.de

unread,
Dec 21, 2010, 5:33:28 PM12/21/10
to Kirill Smelkov, msy...@googlegroups.com, Nico Rieck, Thorvald Natvig, Takuya Murakami
Sorry for beeing unclear, what I meant was: from looking at the code, I don't understand why gitk converts these lines differently for you.

The "diff --git" line is handled here:

@@ -7621,7 +7621,7 @@ proc makediffhdr {fname ids} {
     global ctext curdiffstart treediffs diffencoding
     global ctext_file_names jump_to_here targetline diffline
 
-    set fname [encoding convertfrom $fname]
+    set fname [encoding convertfrom utf-8 $fname]
     set diffencoding [get_path_encoding $fname]
     set i [lsearch -exact $treediffs($ids) $fname]
     if {$i >= 0} {

...and the "Binary files" line is handled here:

@@ -7776,6 +7776,7 @@ proc getblobdiffline {bdf ids} {
                 set diffinhdr 0
                 continue
             }
+            set line [encoding convertfrom utf-8 $line]
             $ctext insert end "$line\n" filesep
 
         } else {

> > How did you get that output? Can you check the exact
> > output of git diff-tree -p <commit>?
>
> Here you are:
>
>     $ git diff-tree -p 34baf5
>     34baf57217ded50132e1f414ad76f6aab93bff47
>     diff --git a/hardware/mns/УС-1/Серия/УС-1_Э3_ИИ108-09.dwg
> b/hardware/mns/УС-1/Серия/УС-1_Э3_ИИ108-09.dwg
>     new file mode 100644
>     index 0000000..b0bf8a4
>     Binary files /dev/null and
> b/hardware/mns/УС-1/Серия/УС-1_Э3_ИИ108-09.dwg differ
>


Looks good...

> This is ok, but in gitk under Linux I'm getting:
>
>     ----------------- hardware/mns/УС-1/Серия/УС-1_Э3_ИИ108-09.dwg
> -----------------
>     new file mode 100644
>     index 0000000..b0bf8a4
>     Binary files /dev/null and b/hardware/mns/УС-1/СериÑ
> /УС-1_Э3_ИИ108-09.dwg differ
>

That's exactly the output I would expect from gitk before my patch (i.e. without the @@ -7776,6 +7776,7 @@ hunk, e.g. У = \u0423, encoded to UTF-8 = d0a3, with fconfigure -encoding binary = \u00d0\u00a3 = У). Are you sure that the patch applied cleanly and you're actually executing the new version?

Ciao,
Karsten

Johannes Schindelin

unread,
May 21, 2011, 9:46:52 PM5/21/11
to Karsten Blees, msy...@googlegroups.com
Hi,

On Sun, 24 Oct 2010, Karsten Blees wrote:

> This patch series tries to solve the Unicode problems of Git for
> Windows, at least for git, gitk and git-gui.

Sorry, I did not really have time to follow this issue. So could some kind
soul update me as to the status?

Thanks!
Dscho

karste...@dcon.de

unread,
May 24, 2011, 2:42:01 PM5/24/11
to msy...@googlegroups.com, bl...@dcon.de, Johannes....@gmx.de, kusm...@gmail.com, ki...@mns.spb.ru, mic...@wheelycreek.net, robin.r...@gmail.com

Hi all,

Johannes Schindelin <Johannes....@gmx.de> wrote on 22.05.2011 03:46:52:
>
> Sorry, I did not really have time to follow this issue. So could some kind
> soul update me as to the status?
>

certainly, I'm sorry for not getting back to you sooner (and thanks for the reminder...). I'm lately spending most of my spare time renovating my new home, so progress on the git-front is somewhat slow...

Anyway, I just pushed what I've been working on to http://repo.or.cz/w/git/mingw/4msysgit/kblees.git as 'v9'. This version is a bit unconventional in that it ends in several alternate branches, to be able to compare the different approaches to implementing the Unicode environment (i.e. convert on startup vs. wrapping getenv/putenv).

I hope that we can settle on one of these branches and get this patch series finally merged. I'd suggest the full version, of course, the wrapped variant is IMO just too complex and ugly (unless I've screwed that one up entirely...).

Cheers,
Karsten



kb/unicode-v8-a8770a16
(http://repo.or.cz/w/git/mingw/4msysgit/kblees.git/kb/unicode-v8-a8770a16)
This is v8 rebased to current devel (for reference).

kb/unicode-v9-noenv
(http://repo.or.cz/w/git/mingw/4msysgit/kblees.git/kb/unicode-v9-noenv)
Removes all v8 environment patches, common base for the rest.
Changes last test of t4014 to require a unicode-aware bash.

kb/unicode-v9-minimal
(http://repo.or.cz/w/git/mingw/4msysgit/kblees.git/kb/unicode-v9-minimal)
Adds Unicode environment support by converting char **environ on startup and when creating child processes (my original approach, but just the Unicode 'feature', without any fixes or optimizations).

kb/unicode-v9-wrapped
(http://repo.or.cz/w/git/mingw/4msysgit/kblees.git/kb/unicode-v9-wrapped)
Alternate Unicode environment implementation wrapping _wgetenv / _wputenv as suggested on the mailing list. This includes the getenv sanitizer (cure to Jonathan Nieder's getenv-poison).

kb/unicode-v9-full
(http://repo.or.cz/w/git/mingw/4msysgit/kblees.git/kb/unicode-v9-full)
On top of the minimal version, restores all the other v8 features, but in smaller, hopefully more consise patches.
Changes to v8 (see diff below):
- environment size is tracked in global variables (instead of over-allocating the environment array and binary-searching the first NULL)
- mingw_spawnve_fd no longer merges char **environ with itself
- putenv(char*) changed to putenv(const char*) for compatibility with unsetenv



Here's a quick comparison of features, complexity and performance of the individual versions. The performance figures are for 'git log' with pager, merging 'LESS=FRSX' into the pager environment. Code-changes for the micro benchmarks can be found in *-perf branches, respectively (e.g. kb/unicode-v9-minimal-perf).

Feature matrix          |  minimal   |   wrapped     |   full    
------------------------+------------+---------------+------------
Unicode conversion in   | main/spawn | getenv/putenv | main/spawn
sorted environment      |     no     |      no       |    yes    
POSIX compliant putenv  |     no     |      no       |    yes    
POSIX compliant environ |    yes     |      no       |    yes    
- complexity -----------+------------+---------------+------------
changed files vs. noenv |      2     |       6       |      4    
inserted lines          |     31     |     271       |    158    
deleted lines           |     12     |      82       |    133    
- performance (µs) -----+------------+---------------+------------
startup                 |     22     |       0       |     36    
getenv                  |      3     |    3-21       |      0.06  
getenv total (35 calls) |    105     |     154       |      2    
spawn                   |     42     |      49       |     17    



Commit logs ('+': new/rewritten in v9, '-': removed, '!': modified)
---
$ git log --graph --oneline --decorate kb/unicode-v8-a8770a16 kb/unicode-v9-full kb/unicode-v9-wrapped ^devel~1 (+ some manual editing)

+ * 7d12dca (kb/unicode-v9-wrapped) Win32: Unicode environment (wrapped, outgoing)
+ * ea620e7 Win32: Unicode environment (wrapped, incoming)
+ * 71a975b Fix git's usage of getenv() on systems with exotic getenv implementations
  |
+ | * c757005 (kb/unicode-v9-full) Win32: unify environment function names
+ | * b87a2c8 Win32: move environment helper functions
+ | * 8e461d1 Win32: move environment block creation to a helper method
+ | * 1a6e505 Win32: patch Windows environment on startup
+ | * 4056c53 Win32: optimise env_setenv
+ | * 8dcd8ca Win32: keep the environment sorted
+ | * c886cea Win32: reduce environment array reallocations
+ | * 07ae80d Win32: don't copy the environment twice when spawning child processes
+ | * 101b840 Win32: fix environment sorting and memory leaks
  | |
+ | * ff2bf7c (kb/unicode-v9-minimal) Win32: Unicode environment (incoming)
+ | * 6521e17 Win32: Unicode environment (outgoing)
  |/  
! * a562e25 (kb/unicode-v9-noenv) Mark unicode-related tests broken on msys
  * 7a25b4b Win32: Thread-safe windows console output
  * a981716 Win32: Unicode arguments (incoming)
  * 56acc57 Win32: Unicode arguments (outgoing)
  * 16e9360 Unicode file name support (gitk and git-gui)
  * baf0f5d Win32: Unicode file name support (dirent)
  * 2bec7dc Win32: Unicode file name support (except dirent)
  * 2b888d8 Win32: add Unicode conversion functions
        |
  | * 6ceff8d (kb/unicode-v8-a8770a16) Mark unicode-related tests broken on msys
  | * b78e88c Win32: Thread-safe windows console output
- | * 7e2193f Win32: improve environment performance
- | * 58a58ee Win32: Unicode environment (incoming)
- | * 4095e08 Win32: Unicode environment (outgoing)
  | * 7b9f6eb Win32: Unicode arguments (incoming)
  | * 166ae45 Win32: Unicode arguments (outgoing)
  | * 6e893d1 Unicode file name support (gitk and git-gui)
  | * c086569 Win32: Unicode file name support (dirent)
  | * 7603d8f Win32: Unicode file name support (except dirent)
  | * d1f7ce9 Win32: add Unicode conversion functions
- | * fcf9db6 Win32: improve environment merging
  |/  
  * 87dcc11 MinGW: disable command line globbing (fixes t5505 and t7810)
  * 97a4c3a Win32: move main macro to a function
  * 0b40c49 gitk: fix file name encoding in diff hunk headers
  * c8f515b git-gui: fix encoding in git-gui file browser
  * 6bca2da Win32: fix potential multi-threading issue
  * dc50b47 Win32 dirent: improve dirent implementation
  * b599186 Win32 dirent: clarify #include directives
  * 36caf8c Win32 dirent: change FILENAME_MAX to MAX_PATH
  * 54d7ee9 Win32 dirent: remove unused dirent.d_reclen member
  * 5f26eb8 Win32 dirent: remove unused dirent.d_ino member
  * d79d001 MSVC: link dynamically to the CRT
  * 6bdbe8e MSVC: fix winansi.c compile errors
  * 585d81a MSVC: include <io.h> for mktemp
  * 1ca9bd4 MSVC: don't use uninitialized variables
  * 0eae8f2 MSVC: disable const qualifier compiler warnings
  * cc32418 Remove dummy S_I[RWX]USR definitions
  * a8770a1 (origin/devel, devel) Partly revert "Work around funny CR issue"



And the diff between v8 and v9-full...
---
$ git diff -p --stat kb/unicode-v8-a8770a16..kb/unicode-v9-full
 compat/mingw.c          |  110 +++++++++++++++++-----------------------------
 compat/mingw.h          |    2 +-
 run-command.c           |    2 +-
 t/t4014-format-patch.sh |    2 +-
 4 files changed, 44 insertions(+), 72 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index ea591de..62b8d12 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -745,10 +745,6 @@ static int compareenv(const void *v1, const void *v2)
         const char *e1 = *(const char**)v1;
         const char *e2 = *(const char**)v2;
 
-        /* sort NULL entries at the end */
-        if (!e1 || !e2)
-                return (e1 == e2) ? 0 : e1 ? -1 : 1;
-
         for (;;) {
                 int c1 = *e1++;
                 int c2 = *e2++;
@@ -763,7 +759,7 @@ static int compareenv(const void *v1, const void *v2)
         }
 }
 
-static inline int bsearchenv(char **env, const char *name, size_t size)
+static int bsearchenv(char **env, const char *name, size_t size)
 {
         unsigned low = 0, high = size - 1;
         while (low <= high) {
@@ -780,76 +776,53 @@ static inline int bsearchenv(char **env, const char *name, size_t size)
 }
 
 /*
- * Insert, replace (with 'key=value') or remove ('key=' or 'key') an entry from
- * a sorted environment array. Size must include the terminating NULL (room for
- * insert). Returns new size. Optionally frees removed or replaced entries.
+ * If name contains '=', then sets the variable, otherwise it unsets it
+ * Size includes the terminating NULL. Env must have room for size + 1 entries
+ * (in case of insert). Returns the new size. Optionally frees removed entries.
  */
-static int do_putenv(char **env, char *name, int size, int free_old)
+static int do_putenv(char **env, const char *name, int size, int free_old)
 {
-        /* lookup existing entry with the same key */
-        int i = bsearchenv(env, name, size);
-        char *eq = strchr(name, '=');
+        int i = bsearchenv(env, name, size - 1);
 
         /* optionally free removed / replaced entry */
         if (i >= 0 && free_old)
                 free(env[i]);
 
-        if (eq && eq[1]) {
+        if (strchr(name, '=')) {
                 /* if new value ('key=value') is specified, insert or replace entry */
                 if (i < 0) {
                         i = ~i;
-                        memmove(&env[i + 1], &env[i], (size - 1 - i) * sizeof(char*));
+                        memmove(&env[i + 1], &env[i], (size - i) * sizeof(char*));
                         size++;
                 }
-                env[i] = name;
+                env[i] = (char*) name;
         } else if (i >= 0) {
-                /* otherwise ('key=' or 'key') remove existing entry */
-                memmove(&env[i], &env[i + 1], (size - 1 - i) * sizeof(char*));
+                /* otherwise ('key') remove existing entry */
                 size--;
+                memmove(&env[i], &env[i + 1], (size - i) * sizeof(char*));
         }
         return size;
 }
 
-#define MIN_ENV_SIZE 16
-
-/*
- * Allocated space for environ is always a power of 2 (with unused entries
- * filled with NULL), so we can determine the size more efficiently.
- */
-static inline int getenvsize()
-{
-        int envsz = MIN_ENV_SIZE;
-        while (environ[envsz - 1])
-                envsz <<= 1;
-        return envsz;
-}
+/* used number of elements of environ array, including terminating NULL */
+static int environ_size = 0;
+/* allocated size of environ array, in bytes */
+static int environ_alloc = 0;
 
 char *mingw_getenv(const char *name)
 {
         char *value;
-        int pos = bsearchenv(environ, name, getenvsize());
+        int pos = bsearchenv(environ, name, environ_size - 1);
         if (pos < 0)
                 return NULL;
         value = strchr(environ[pos], '=');
         return value ? &value[1] : NULL;
 }
 
-int mingw_putenv(char *namevalue)
+int mingw_putenv(const char *namevalue)
 {
-        int envsz;
-        if (!namevalue) {
-                errno = EINVAL;
-                return -1;
-        }
-
-        envsz = getenvsize();
-        do_putenv(environ, namevalue, envsz, 1);
-
-        /* expand array if last slot is used */
-        if (environ[envsz - 1]) {
-                environ = xrealloc(environ, 2 * envsz * sizeof(char*));
-                memset(environ + envsz, 0, envsz * sizeof(char*));
-        }
+        ALLOC_GROW(environ, (environ_size + 1) * sizeof(char*), environ_alloc);
+        environ_size = do_putenv(environ, namevalue, environ_size, 1);
         return 0;
 }
 
@@ -1041,17 +1014,15 @@ static char *path_lookup(const char *cmd, char **path, int exe_only)
  */
 static wchar_t *make_environment_block(char **env)
 {
+        wchar_t *wenvblk = NULL;
         char **tmpenv;
-        wchar_t *envblk = NULL;
-        int i = 0, size = 0, envblksz = 0, envblkpos = 0;
+        int i = 0, size = environ_size, envblksz = 0, envblkpos = 0;
 
-        /* get size of current environment + supplied changes */
-        size = getenvsize();
         while (env && env[i])
                 i++;
 
-        /* copy (already sorted) current environment, leaving space for changes */
-        tmpenv = xcalloc(size + i, sizeof(char*));
+        /* copy the environment, leaving space for changes */
+        tmpenv = xmalloc((size + i) * sizeof(char*));
         memcpy(tmpenv, environ, size * sizeof(char*));
 
         /* merge supplied environment changes into the temporary environment */
@@ -1060,16 +1031,14 @@ static wchar_t *make_environment_block(char **env)
 
         /* create environment block from temporary environment */
         for (i = 0; tmpenv[i]; i++) {
-                size = 2 * strlen(tmpenv[i]) + 1;
-                ALLOC_GROW(envblk, (envblkpos + size) * sizeof(wchar_t), envblksz);
-                size = utftowcs(&envblk[envblkpos], tmpenv[i], size);
-                envblkpos += size + 1;
+                size = 2 * strlen(tmpenv[i]) + 2;
+                ALLOC_GROW(wenvblk, (envblkpos + size) * sizeof(wchar_t), envblksz);
+                envblkpos += utftowcs(&wenvblk[envblkpos], tmpenv[i], size) + 1;
         }
-        free(tmpenv);
         /* add final \0 terminator */
-        ALLOC_GROW(envblk, (envblkpos + 1) * sizeof(wchar_t), envblksz);
-        envblk[envblkpos] = 0;
-        return envblk;
+        wenvblk[envblkpos] = 0;
+        free(tmpenv);
+        return wenvblk;
 }
 
 struct pinfo_t {
@@ -1087,7 +1056,7 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
         STARTUPINFOW si;
         PROCESS_INFORMATION pi;
         struct strbuf args;
-        wchar_t *envblk, wcmd[MAX_PATH], wdir[MAX_PATH], *wargs;
+        wchar_t wcmd[MAX_PATH], wdir[MAX_PATH], *wargs, *wenvblk = NULL;
         unsigned flags = CREATE_UNICODE_ENVIRONMENT;
         BOOL ret;
 
@@ -1148,13 +1117,16 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
         utftowcs(wargs, args.buf, 2 * args.len + 1);
         strbuf_release(&args);
 
-        envblk = make_environment_block(env);
+        if (env == environ)
+                env = NULL;
+
+        wenvblk = make_environment_block(env);
 
         memset(&pi, 0, sizeof(pi));
-        ret = CreateProcessW(wcmd, wargs, NULL, NULL, TRUE, flags, envblk,
-                        dir ? wdir : NULL, &si, &pi);
+        ret = CreateProcessW(wcmd, wargs, NULL, NULL, TRUE, flags,
+                wenvblk, dir ? wdir : NULL, &si, &pi);
 
-        free(envblk);
+        free(wenvblk);
         free(wargs);
 
         if (!ret) {
@@ -2088,11 +2060,10 @@ void mingw_startup()
         for (i = 0; wenv[i]; i++)
                 maxlen = max(maxlen, wcslen(wenv[i]));
 
-        /* ceil environment size to next power of 2 (see getenvsize()) */
-        for (len = MIN_ENV_SIZE; len <= i; )
-                len <<= 1;
         /* nedmalloc can't free CRT memory, allocate resizable environment list */
-        environ = xcalloc(len, sizeof(char*));
+        environ = NULL;
+        environ_size = i + 1;
+        ALLOC_GROW(environ, environ_size * sizeof(char*), environ_alloc);
 
         /* allocate buffer (wchar_t encodes to max 3 UTF-8 bytes) */
         maxlen = 3 * maxlen + 1;
@@ -2109,6 +2080,7 @@ void mingw_startup()
                 len = wcstoutf(buffer, wenv[i], maxlen);
                 environ[i] = xmemdupz(buffer, len);
         }
+        environ[i] = NULL;
         free(buffer);
 
         /* sort environment for O(log n) getenv / putenv */
diff --git a/compat/mingw.h b/compat/mingw.h
index 5cb51f1..2a30b57 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -199,7 +199,7 @@ char *mingw_getcwd(char *pointer, int len);
 
 char *mingw_getenv(const char *name);
 #define getenv mingw_getenv
-int mingw_putenv(char *namevalue);
+int mingw_putenv(const char *namevalue);
 #define putenv mingw_putenv
 #define unsetenv mingw_putenv
 
diff --git a/run-command.c b/run-command.c
index dce5b3d..21f8473 100644
--- a/run-command.c
+++ b/run-command.c
@@ -348,7 +348,7 @@ fail_pipe:
         }
 
         cmd->pid = mingw_spawnvpe(cmd->argv[0], cmd->argv, (char**) cmd->env,
-                                cmd->dir, fhin, fhout, fherr);
+                        cmd->dir, fhin, fhout, fherr);
         failed_errno = errno;
         if (cmd->pid < 0 && (!cmd->silent_exec_failure || errno != ENOENT))
                 error("cannot spawn %s: %s", cmd->argv[0], strerror(errno));
diff --git a/t/t4014-format-patch.sh b/t/t4014-format-patch.sh
index 37a4109..7da4b15 100755
--- a/t/t4014-format-patch.sh
+++ b/t/t4014-format-patch.sh
@@ -783,7 +783,7 @@ Subject: [PATCH] =?UTF-8?q?f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
  =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
  =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar?=
 EOF
-test_expect_success 'format-patch wraps extremely long headers (rfc2047)' '
+test_expect_success_unicode 'format-patch wraps extremely long headers (rfc2047)' '
         rm -rf patches/ &&
         echo content >>file &&
         git add file &&

Erik Faye-Lund

unread,
May 24, 2011, 3:03:12 PM5/24/11
to karste...@dcon.de, msy...@googlegroups.com, bl...@dcon.de, Johannes....@gmx.de, ki...@mns.spb.ru, mic...@wheelycreek.net, robin.r...@gmail.com
On Tue, May 24, 2011 at 8:42 PM, <karste...@dcon.de> wrote:
>
> Hi all,
>
> Johannes Schindelin <Johannes....@gmx.de> wrote on 22.05.2011
> 03:46:52:
>>
>> Sorry, I did not really have time to follow this issue. So could some kind
>> soul update me as to the status?
>>
>
> certainly, I'm sorry for not getting back to you sooner (and thanks for the
> reminder...). I'm lately spending most of my spare time renovating my new
> home, so progress on the git-front is somewhat slow...
>
> Anyway, I just pushed what I've been working on to
> http://repo.or.cz/w/git/mingw/4msysgit/kblees.git as 'v9'. This version is a
> bit unconventional in that it ends in several alternate branches, to be able
> to compare the different approaches to implementing the Unicode environment
> (i.e. convert on startup vs. wrapping getenv/putenv).
>
> I hope that we can settle on one of these branches and get this patch series
> finally merged.

Did you figure out a way to do this without breaking the pager? IIRC,
there were some reports that with your previous branch, the pager
would have to be disabled (due to some problems in our less, I seem to
recall)... At least I love my pager too much to want to lose it for a
rarely-enough feature as unicode filenames.

Erik Faye-Lund

unread,
May 24, 2011, 3:14:32 PM5/24/11
to karste...@dcon.de, msy...@googlegroups.com, bl...@dcon.de, Johannes....@gmx.de, ki...@mns.spb.ru, mic...@wheelycreek.net, robin.r...@gmail.com
On Tue, May 24, 2011 at 8:42 PM, <karste...@dcon.de> wrote:
> Here's a quick comparison of features, complexity and performance of the
> individual versions. The performance figures are for 'git log' with pager,
> merging 'LESS=FRSX' into the pager environment. Code-changes for the micro
> benchmarks can be found in *-perf branches, respectively (e.g.
> kb/unicode-v9-minimal-perf).
>
> Feature matrix          |  minimal   |   wrapped     |   full
> ------------------------+------------+---------------+------------
> Unicode conversion in   | main/spawn | getenv/putenv | main/spawn
> sorted environment      |     no     |      no       |    yes
> POSIX compliant putenv  |     no     |      no       |    yes
> POSIX compliant environ |    yes     |      no       |    yes

This feature is currently required by compat/unsetenv.c, I believe.

> - complexity -----------+------------+---------------+------------
> changed files vs. noenv |      2     |       6       |      4
> inserted lines          |     31     |     271       |    158
> deleted lines           |     12     |      82       |    133
> - performance (µs) -----+------------+---------------+------------
> startup                 |     22     |       0       |     36
> getenv                  |      3     |    3-21       |      0.06
> getenv total (35 calls) |    105     |     154       |      2
> spawn                   |     42     |      49       |     17

Thanks, a lot, this settles all my issues with performance. If all we
lose is 36 µs on start-up per process for doing the full conversion,
then I'm not concerned.

The only thing I'd like to know is how large your environment was when
you did this test; it certainly affects performance.

Is this patch related to unicode? Sounds to me like if it fixes two
tests, perhaps it's worth having no matter what?

>   * 97a4c3a Win32: move main macro to a function
>   * 0b40c49 gitk: fix file name encoding in diff hunk headers
>   * c8f515b git-gui: fix encoding in git-gui file browser
>   * 6bca2da Win32: fix potential multi-threading issue
>   * dc50b47 Win32 dirent: improve dirent implementation
>   * b599186 Win32 dirent: clarify #include directives
>   * 36caf8c Win32 dirent: change FILENAME_MAX to MAX_PATH
>   * 54d7ee9 Win32 dirent: remove unused dirent.d_reclen member
>   * 5f26eb8 Win32 dirent: remove unused dirent.d_ino member
>   * d79d001 MSVC: link dynamically to the CRT
>   * 6bdbe8e MSVC: fix winansi.c compile errors
>   * 585d81a MSVC: include <io.h> for mktemp
>   * 1ca9bd4 MSVC: don't use uninitialized variables
>   * 0eae8f2 MSVC: disable const qualifier compiler warnings
>   * cc32418 Remove dummy S_I[RWX]USR definitions
>   * a8770a1 (origin/devel, devel) Partly revert "Work around funny CR issue"

As above, it sounds like a lot of these patches are simply
improvements no matter what. Perhaps we should weed out the ones that
are completely uncontroversial, so the ones that are controversial
gets more manageable?

Erik Faye-Lund

unread,
May 24, 2011, 3:35:29 PM5/24/11
to karste...@dcon.de, msy...@googlegroups.com, bl...@dcon.de, Johannes....@gmx.de, ki...@mns.spb.ru, mic...@wheelycreek.net, robin.r...@gmail.com

Here's a branch that are the patches that jumps out as uncontroversial
(there might be more, though):
http://repo.or.cz/w/git/kusma.git/shortlog/refs/heads/for-devel

Dscho, Karsten: would you be OK with me just pushing these directly to
'devel'? I believe they all have been discussed...

b688b9a MinGW: disable command line globbing (fixes t5505 an
605f727 Win32: move main macro to a function
8a2b741 Win32: fix potential multi-threading issue
5b4cdf7 Win32 dirent: improve dirent implementation
5eb0af2 Win32 dirent: clarify #include directives
1cb318d Win32 dirent: change FILENAME_MAX to MAX_PATH
5cbd51a Win32 dirent: remove unused dirent.d_reclen member
b52a1f5 Win32 dirent: remove unused dirent.d_ino member
b65a6c4 Remove dummy S_I[RWX]USR definitions

karste...@dcon.de

unread,
May 24, 2011, 3:38:35 PM5/24/11
to kusm...@gmail.com, bl...@dcon.de, Johannes....@gmx.de, ki...@mns.spb.ru, mic...@wheelycreek.net, msy...@googlegroups.com, robin.r...@gmail.com

Erik Faye-Lund <kusm...@gmail.com> wrote on 24.05.2011 21:03:12:
[...]

>
> Did you figure out a way to do this without breaking the pager? IIRC,
> there were some reports that with your previous branch, the pager
> would have to be disabled (due to some problems in our less, I seem to
> recall)... At least I love my pager too much to want to lose it for a
> rarely-enough feature as unicode filenames.

If you only use ASCII filenames, the current pager is fine.

If you wanna see unicode filenames in the console properly, I suggest cygwin less with this in /etc/profile (works flawlessly):

export LESSCHARSET=utf-8
export GIT_PAGER=C:/cygwin/bin/less.exe # or wherever cygwin is installed


MSys less works to a certain degree with this (has some issues with color and horizontal scrolling):

# replace cp1252/1252 with your Windows default codepage...
export LESSCHARSET=dos
export GIT_PAGER="iconv -f utf-8 -t cp1252 -c | less"
cmd /c "chcp 1252"


...and of course you need to "git config core.quotepath off"

Karsten

karste...@dcon.de

unread,
May 24, 2011, 4:27:51 PM5/24/11
to kusm...@gmail.com, bl...@dcon.de, Johannes....@gmx.de, ki...@mns.spb.ru, mic...@wheelycreek.net, msy...@googlegroups.com, robin.r...@gmail.com

Erik Faye-Lund <kusm...@gmail.com> wrote on 24.05.2011 21:14:32:

> On Tue, May 24, 2011 at 8:42 PM,  <karste...@dcon.de> wrote:
> > POSIX compliant environ |    yes     |      no       |    yes
>
> This feature is currently required by compat/unsetenv.c, I believe.
>


That's why the wrapped version needs it's own unsetenv...
compat/unsetenv.c doesn't free the removed entry, bzw...
 
> > - complexity -----------+------------+---------------+------------
> > changed files vs. noenv |      2     |       6       |      4
> > inserted lines          |     31     |     271       |    158
> > deleted lines           |     12     |      82       |    133
> > - performance (µs) -----+------------+---------------+------------
> > startup                 |     22     |       0       |     36
> > getenv                  |      3     |    3-21       |      0.06
> > getenv total (35 calls) |    105     |     154       |      2
> > spawn                   |     42     |      49       |     17
>
> Thanks, a lot, this settles all my issues with performance. If all we
> lose is 36 µs on start-up per process for doing the full conversion,
> then I'm not concerned.
>
> The only thing I'd like to know is how large your environment was when
> you did this test; it certainly affects performance.
>


$ env | wc -l
     61

With larger environment it's more in favor of the sorted version, due to linear vs. binary search in getenv...I added 500 more and got this:

            wrapped  full
startup          0    268
getenv total   376      5
spawn          300     71


[...]
> >   | * c086569 Win32: Unicode file name support (dirent)
> >   | * 7603d8f Win32: Unicode file name support (except dirent)
> >   | * d1f7ce9 Win32: add Unicode conversion functions
> > - | * fcf9db6 Win32: improve environment merging
> >   |/
> >   * 87dcc11 MinGW: disable command line globbing (fixes t5505 and t7810)
>
> Is this patch related to unicode? Sounds to me like if it fixes two
> tests, perhaps it's worth having no matter what?
>


Anything before "Win32: add Unicode conversion functions" is unrelated to Unicode...

> >   * 97a4c3a Win32: move main macro to a function
> >   * 0b40c49 gitk: fix file name encoding in diff hunk headers
> >   * c8f515b git-gui: fix encoding in git-gui file browser
> >   * 6bca2da Win32: fix potential multi-threading issue
> >   * dc50b47 Win32 dirent: improve dirent implementation
> >   * b599186 Win32 dirent: clarify #include directives
> >   * 36caf8c Win32 dirent: change FILENAME_MAX to MAX_PATH
> >   * 54d7ee9 Win32 dirent: remove unused dirent.d_reclen member
> >   * 5f26eb8 Win32 dirent: remove unused dirent.d_ino member
> >   * d79d001 MSVC: link dynamically to the CRT
> >   * 6bdbe8e MSVC: fix winansi.c compile errors
> >   * 585d81a MSVC: include <io.h> for mktemp
> >   * 1ca9bd4 MSVC: don't use uninitialized variables
> >   * 0eae8f2 MSVC: disable const qualifier compiler warnings
> >   * cc32418 Remove dummy S_I[RWX]USR definitions
> >   * a8770a1 (origin/devel, devel) Partly revert "Work around funnyCR issue"

>
> As above, it sounds like a lot of these patches are simply
> improvements no matter what. Perhaps we should weed out the ones that
> are completely uncontroversial, so the ones that are controversial
> gets more manageable?


Yeah, I'd love that :-)

Johannes Sixt

unread,
May 24, 2011, 4:54:05 PM5/24/11
to karste...@dcon.de, msy...@googlegroups.com, bl...@dcon.de, Johannes....@gmx.de, kusm...@gmail.com, ki...@mns.spb.ru, mic...@wheelycreek.net, robin.r...@gmail.com
Am 24.05.2011 20:42, schrieb karste...@dcon.de:
>
> Hi all,
>
> Johannes Schindelin <Johannes....@gmx.de> wrote on 22.05.2011
> 03:46:52:
>>
>> Sorry, I did not really have time to follow this issue. So could some
> kind
>> soul update me as to the status?
>>
>
> certainly, I'm sorry for not getting back to you sooner (and thanks for
> the reminder...). I'm lately spending most of my spare time renovating
> my new home, so progress on the git-front is somewhat slow...
>
> Anyway, I just pushed what I've been working on to
> http://repo.or.cz/w/git/mingw/4msysgit/kblees.git as 'v9'. This version
> is a bit unconventional in that it ends in several alternate branches,
> to be able to compare the different approaches to implementing the
> Unicode environment (i.e. convert on startup vs. wrapping
> getenv/putenv). <http://repo.or.cz/w/git/mingw/4msysgit/kblees.git>

>
> I hope that we can settle on one of these branches and get this patch
> series finally merged. I'd suggest the full version, of course, the
> wrapped variant is IMO just too complex and ugly (unless I've screwed
> that one up entirely...).

I lost track of what the long-term goal of this series is.

One thing that I am worried about is that with these patches git is
designed to talk to its environment in UTF-8. Is this correct? I'd
prefer to have a program that works "the Windows way", which would mean
that it talks to its environment using the current codepage setting.
More precisely, git.exe should work in a CMD with standard settings
without any wrappers (like msysgit's git.cmd).

-- Hannes

karste...@dcon.de

unread,
May 24, 2011, 6:28:28 PM5/24/11
to Johannes Sixt, bl...@dcon.de, Johannes....@gmx.de, ki...@mns.spb.ru, kusm...@gmail.com, mic...@wheelycreek.net, msy...@googlegroups.com, robin.r...@gmail.com

Johannes Sixt <j...@kdbg.org> wrote on 24.05.2011 22:54:05:

>
> I lost track of what the long-term goal of this series is.
>
> One thing that I am worried about is that with these patches git is
> designed to talk to its environment in UTF-8. Is this correct?


More precisely, the patches are designed so that git uses UTF-8 internally and talks to its environment in UTF-16 (which is the native encoding of all NT-based Windows versions).

> I'd
> prefer to have a program that works "the Windows way", which would mean
> that it talks to its environment using the current codepage setting.

> More precisely, git.exe should work in a CMD with standard settings
> without any wrappers (like msysgit's git.cmd).

If git requires some current codepage settings to work, that would mean that the same (binary identical) git repository won't work on a platform with different codepage settings, correct?

This is exactly the situation we're facing today: if non-ASCII filenames are used, Windows/msysgit repositories are incompatible with other-language-Windows repositories and Linux/Cygwin/JGit... repositories.

So, the goal here is to make git work *independant* of any codepage settings or wrappers by talking to Windows in it's native tonge (UTF-16).

Cheers,
Karsten

Johannes Schindelin

unread,
May 25, 2011, 2:43:18 AM5/25/11
to Erik Faye-Lund, karste...@dcon.de, msy...@googlegroups.com, bl...@dcon.de, ki...@mns.spb.ru, mic...@wheelycreek.net, robin.r...@gmail.com
Hi,

On Tue, 24 May 2011, Erik Faye-Lund wrote:

> Here's a branch that are the patches that jumps out as uncontroversial
> (there might be more, though):
> http://repo.or.cz/w/git/kusma.git/shortlog/refs/heads/for-devel
>
> Dscho, Karsten: would you be OK with me just pushing these directly to
> 'devel'? I believe they all have been discussed...

Sure! Go ahead...

Ciao,
Dscho

Johannes Schindelin

unread,
May 25, 2011, 3:16:34 AM5/25/11
to Erik Faye-Lund, karste...@dcon.de, msy...@googlegroups.com, bl...@dcon.de, ki...@mns.spb.ru, mic...@wheelycreek.net, robin.r...@gmail.com
Hi,

On Tue, 24 May 2011, Erik Faye-Lund wrote:

> On Tue, May 24, 2011 at 8:42 PM, <karste...@dcon.de> wrote:
> > - complexity -----------+------------+---------------+------------
> > changed files vs. noenv | � � �2 � � | � � � 6 � � � | � � �4
> > inserted lines � � � � �| � � 31 � � | � � 271 � � � | � �158
> > deleted lines � � � � � | � � 12 � � | � � �82 � � � | � �133

> > - performance (�s) -----+------------+---------------+------------


> > startup � � � � � � � � | � � 22 � � | � � � 0 � � � | � � 36
> > getenv � � � � � � � � �| � � �3 � � | � �3-21 � � � | � � �0.06
> > getenv total (35 calls) | � �105 � � | � � 154 � � � | � � �2
> > spawn � � � � � � � � � | � � 42 � � | � � �49 � � � | � � 17
>
> Thanks, a lot, this settles all my issues with performance. If all we

> lose is 36 �s on start-up per process for doing the full conversion,


> then I'm not concerned.

Is it really microseconds? I thought on Windows we measure milliseconds.

This makes a difference: 30 calls of something that takes 36 milliseconds,
and we're above 1 second, which is noticable.

> The only thing I'd like to know is how large your environment was when
> you did this test; it certainly affects performance.

I'm really worried about this, too. And not only the number of entries in
the environment, but also the length of the entries.

You know, a couple of years back, I had a quizzy feeling about linking
curl into the git executable. I had the impression that whatever overhead
(memory & time) it adds, it might be noticable in some setup. Guess what:
Linus himself worked on avoiding to link curl to the git executable. For
performance.

Ciao,
Dscho

Erik Faye-Lund

unread,
May 25, 2011, 3:54:35 AM5/25/11
to Johannes Schindelin, karste...@dcon.de, msy...@googlegroups.com, bl...@dcon.de, ki...@mns.spb.ru, mic...@wheelycreek.net, robin.r...@gmail.com
On Wed, May 25, 2011 at 9:16 AM, Johannes Schindelin
<Johannes....@gmx.de> wrote:
> Hi,
>
> On Tue, 24 May 2011, Erik Faye-Lund wrote:
>
>> On Tue, May 24, 2011 at 8:42 PM,  <karste...@dcon.de> wrote:
>> > - complexity -----------+------------+---------------+------------
>> > changed files vs. noenv |      2     |       6       |      4
>> > inserted lines          |     31     |     271       |    158
>> > deleted lines           |     12     |      82       |    133
>> > - performance (µs) -----+------------+---------------+------------

>> > startup                 |     22     |       0       |     36
>> > getenv                  |      3     |    3-21       |      0.06
>> > getenv total (35 calls) |    105     |     154       |      2
>> > spawn                   |     42     |      49       |     17
>>
>> Thanks, a lot, this settles all my issues with performance. If all we
>> lose is 36 µs on start-up per process for doing the full conversion,

>> then I'm not concerned.
>
> Is it really microseconds? I thought on Windows we measure milliseconds.
>

That's the metric he gave. I haven't tried to verify the numbers myself, though.

> This makes a difference: 30 calls of something that takes 36 milliseconds,
> and we're above 1 second, which is noticable.

I agree.

>> The only thing I'd like to know is how large your environment was when
>> you did this test; it certainly affects performance.
>
> I'm really worried about this, too. And not only the number of entries in
> the environment, but also the length of the entries.

I'm not as worried about that as I was agains an artificially small
environment (which didn't seem to be the case). The total length of
the environment is limited to 32768 characters (in the best case:
maximum 65536 bytes, UTF-16 encoded) on Windows, so there's a limit to
how far the performance will escalate.

> You know, a couple of years back, I had a quizzy feeling about linking
> curl into the git executable. I had the impression that whatever overhead
> (memory & time) it adds, it might be noticable in some setup. Guess what:
> Linus himself worked on avoiding to link curl to the git executable. For
> performance.

Yeah, that's my main worry here too; is this something we will regret
and have to put in a lot of time to fix later?

But if it turns out the numbers really are in microseconds, then this
is too insignificant for me to bother worrying about ;)

Johannes Schindelin

unread,
May 25, 2011, 4:03:44 AM5/25/11
to Erik Faye-Lund, karste...@dcon.de, msy...@googlegroups.com, bl...@dcon.de, ki...@mns.spb.ru, mic...@wheelycreek.net, robin.r...@gmail.com
Hi,

On Wed, 25 May 2011, Erik Faye-Lund wrote:

> On Wed, May 25, 2011 at 9:16 AM, Johannes Schindelin
> <Johannes....@gmx.de> wrote:
>
> > On Tue, 24 May 2011, Erik Faye-Lund wrote:
> >
> >> On Tue, May 24, 2011 at 8:42 PM, �<karste...@dcon.de> wrote:
> >> > - complexity -----------+------------+---------------+------------
> >> > changed files vs. noenv | � � �2 � � | � � � 6 � � � | � � �4
> >> > inserted lines � � � � �| � � 31 � � | � � 271 � � � | � �158
> >> > deleted lines � � � � � | � � 12 � � | � � �82 � � � | � �133

> >> > - performance (�s) -----+------------+---------------+------------


> >> > startup � � � � � � � � | � � 22 � � | � � � 0 � � � | � � 36
> >> > getenv � � � � � � � � �| � � �3 � � | � �3-21 � � � | � � �0.06
> >> > getenv total (35 calls) | � �105 � � | � � 154 � � � | � � �2
> >> > spawn � � � � � � � � � | � � 42 � � | � � �49 � � � | � � 17
> >>
> >> Thanks, a lot, this settles all my issues with performance. If all we

> >> lose is 36 �s on start-up per process for doing the full conversion,

> >> then I'm not concerned.
> >
> > Is it really microseconds? I thought on Windows we measure
> > milliseconds.
>
> That's the metric he gave. I haven't tried to verify the numbers myself,
> though.
>
> > This makes a difference: 30 calls of something that takes 36
> > milliseconds, and we're above 1 second, which is noticable.
>
> I agree.
>
> >> The only thing I'd like to know is how large your environment was
> >> when you did this test; it certainly affects performance.
> >
> > I'm really worried about this, too. And not only the number of entries
> > in the environment, but also the length of the entries.
>
> I'm not as worried about that as I was agains an artificially small
> environment (which didn't seem to be the case). The total length of the
> environment is limited to 32768 characters (in the best case: maximum
> 65536 bytes, UTF-16 encoded) on Windows, so there's a limit to how far
> the performance will escalate.

I did not know that! Good to know!

> > You know, a couple of years back, I had a quizzy feeling about linking
> > curl into the git executable. I had the impression that whatever
> > overhead (memory & time) it adds, it might be noticable in some setup.
> > Guess what: Linus himself worked on avoiding to link curl to the git
> > executable. For performance.
>
> Yeah, that's my main worry here too; is this something we will regret
> and have to put in a lot of time to fix later?
>
> But if it turns out the numbers really are in microseconds, then this is
> too insignificant for me to bother worrying about ;)

Right. If it's microseconds, I'll retract my worries :-)

Ciao,
Dscho

karste...@dcon.de

unread,
May 25, 2011, 7:09:26 AM5/25/11
to Johannes Schindelin, bl...@dcon.de, ki...@mns.spb.ru, Erik Faye-Lund, mic...@wheelycreek.net, msy...@googlegroups.com, robin.r...@gmail.com
Johannes Schindelin <Johannes....@gmx.de> wrote on 25.05.2011 09:16:34:
> Hi,
>
> On Tue, 24 May 2011, Erik Faye-Lund wrote:
>
> > On Tue, May 24, 2011 at 8:42 PM,  <karste...@dcon.de> wrote:
> > > - complexity -----------+------------+---------------+------------
> > > changed files vs. noenv |      2     |       6       |      4
> > > inserted lines          |     31     |     271       |    158
> > > deleted lines           |     12     |      82       |    133
> > > - performance (µs) -----+------------+---------------+------------

> > > startup                 |     22     |       0       |     36
> > > getenv                  |      3     |    3-21       |      0.06
> > > getenv total (35 calls) |    105     |     154       |      2
> > > spawn                   |     42     |      49       |     17
> >
> > Thanks, a lot, this settles all my issues with performance. If all we
> > lose is 36 µs on start-up per process for doing the full conversion,

> > then I'm not concerned.
>
> Is it really microseconds? I thought on Windows we measure milliseconds.


Yes, it is definitely microseconds, see the 1,000,000 when calculating the frequency of the performance counter:
(from http://repo.or.cz/w/git/mingw/4msysgit/kblees.git/commitdiff/fe6d39d9da46a693554b1ff4a618a78bd97923f6)

+static double perffreq = -1;
+
+double ticks()
+{
+       LARGE_INTEGER li;
+       if (perffreq < 0) {
+               QueryPerformanceFrequency(&li);
+               perffreq = ((double) li.QuadPart) / 1000000.0;
+       }
+       QueryPerformanceCounter(&li);
+       return ((double) li.QuadPart) / perffreq;
+}

>
> This makes a difference: 30 calls of something that takes 36 milliseconds,
> and we're above 1 second, which is noticable.
>


If it was milliseconds (or even seconds), that would be more reason to switch to a sorted environment, if you consider the total cost of that git call.
For a plumbing command we need: startup + ~25 * getenv, that's 0 + ~110 wrapped vs. 36 + 1.5 full.
Git as a launcher needs: startup + ~5 * getenv + spawn, that's 0 + ~22 + 49 wrapped vs. 36 + 0.3 + 17 full.

The full version is faster in any case, no matter what unit those numbers are in.

> > The only thing I'd like to know is how large your environment was when
> > you did this test; it certainly affects performance.
>
> I'm really worried about this, too. And not only the number of entries in
> the environment, but also the length of the entries.
>


The length of the entries mainly affects the wrapped version because of the multiple string copying involved in mingw_getenv and safe_getenv. The performance impact can be seen directly from the output of the kb/unicode-v9-wrapped-perf patch:
getenv PATH: 15.3122 µs
getenv GIT_EXEC_PATH: 3.28119 µs
getenv GIT_CEILING_DIRECTORIES: 4.01034 µs
getenv GIT_DIR: 3.64576 µs
[...]
getenv HOME: 3.28119 µs
[...]
getenv GIT_TRACE: 3.28119 µs
getenv PATH: 7.6561 µs

Note that the first getenv PATH needs to copy twice because it also adds the string to the safe_getenv string pool. Getenv HOME (short string) is indistinguishable from unset variables (e.g. GIT_TRACE, GIT_EXEC_PATH).

Michael Geddes

unread,
Jun 8, 2011, 12:09:56 AM6/8/11
to msy...@googlegroups.com, karste...@dcon.de, Johannes Sixt, bl...@dcon.de, Johannes....@gmx.de, ki...@mns.spb.ru, kusm...@gmail.com, robin.r...@gmail.com
Excellent work Karsten, and nice summary.

Can somebody remind me why the unicode-v9-full implementation has not been
integrated into our development branch?

Are we waiting on something? (Was there something about reversing the order of
some of the patches?)

I guess it would certainly be nice if msys similarly implemented UTF-8
encoding (and symlink support, but that's a seperate issue).

//.ichael G.

On Wed, 25 May 2011 06:28:28 AM karste...@dcon.de wrote:
> Johannes Sixt <j...@kdbg.org> wrote on 24.05.2011 22:54:05:
> > I lost track of what the long-term goal of this series is.
> >
> > One thing that I am worried about is that with these patches git is
> > designed to talk to its environment in UTF-8. Is this correct?
>
> More precisely, the patches are designed so that git uses UTF-8 internally
> and talks to its environment in UTF-16 (which is the native encoding of
> all NT-based Windows versions).
>

Johannes Schindelin

unread,
Jun 8, 2011, 2:54:29 PM6/8/11
to Michael Geddes, msy...@googlegroups.com, karste...@dcon.de, Johannes Sixt, bl...@dcon.de, ki...@mns.spb.ru, kusm...@gmail.com, robin.r...@gmail.com
Hi Michael,

On Wed, 8 Jun 2011, Michael Geddes wrote:

> Excellent work Karsten, and nice summary.
>
> Can somebody remind me why the unicode-v9-full implementation has not
> been integrated into our development branch?
>
> Are we waiting on something? (Was there something about reversing the
> order of some of the patches?)
>
> I guess it would certainly be nice if msys similarly implemented UTF-8
> encoding (and symlink support, but that's a seperate issue).

Well, Hannes was completely opposed until MSys learnt about UTF-8. At
least as far as I understood his mails.

There was a slight uncomfortable feeling on my side with the wholesale
conversion of the environment, but that has been resolved.

So I basically wanted to wait for violent disagreement, and absent that,
merge the series (once I have time, this week I was traveling to
Switzerland, next week I will be in France, so unless somebody else steps
up to the task, the merge will have to wait another ~2 weeks)...

Ciao,
Johannes

Michael Geddes

unread,
Jun 9, 2011, 5:01:11 AM6/9/11
to msy...@googlegroups.com, Johannes Schindelin, karste...@dcon.de, Johannes Sixt, bl...@dcon.de, ki...@mns.spb.ru, kusm...@gmail.com, robin.r...@gmail.com
On Thu, 9 Jun 2011 02:54:29 AM Johannes Schindelin wrote:
> Hi Michael,
>
> On Wed, 8 Jun 2011, Michael Geddes wrote:
> > Excellent work Karsten, and nice summary.
> >
> > Can somebody remind me why the unicode-v9-full implementation has not
> > been integrated into our development branch?
> >
> > Are we waiting on something? (Was there something about reversing the
> > order of some of the patches?)
> >
> > I guess it would certainly be nice if msys similarly implemented UTF-8
> > encoding (and symlink support, but that's a seperate issue).
>
> Well, Hannes was completely opposed until MSys learnt about UTF-8. At
> least as far as I understood his mails.
Certainly that's a valid point. Any hooks are going to be of less use until
msys plays ball too.
>
> There was a slight uncomfortable feeling on my side with the wholesale
> conversion of the environment, but that has been resolved.
I'd kept half an eye out on that thread, so glad to hear you are comfortable
with it now.

>
> So I basically wanted to wait for violent disagreement, and absent that,
> merge the series (once I have time, this week I was traveling to
> Switzerland, next week I will be in France, so unless somebody else steps
> up to the task, the merge will have to wait another ~2 weeks)...
I think the merge can wait another few weeks, and I've no real requirement for
it to be done, but as I've said before, I think it is the right way to go
about it, and thus am lending it my support.

I also don't want to get back onto the symlink issue until the unicode issue
is resolved.

I guess maybe it's time to make some waves in the msys pond now?
>
> Ciao,
> Johannes

thanks.

Michael.

Kirill Smelkov

unread,
Jun 12, 2011, 4:59:34 PM6/12/11
to Karsten Blees, karste...@dcon.de, Johannes Schindelin, Michael Geddes, msy...@googlegroups.com, karste...@dcon.de, Johannes Sixt, kusm...@gmail.com, robin.r...@gmail.com
Karsten, Johannes, everyone,

Karsten, first of all HUGE THANKS for getting unicode support into shape
again!

I'm late to the party, but here is a build of latest msysgit devel +
kb/unicode-v9-full for those who want to try it out:

http://repo.or.cz/w/msysgit/kirr.git/tree/refs/heads/x/built -> Git-kirr15.exe

(Only briefly tested)


And thanks Johannes for planning to merge it! Not only there is no
violent disagreement, but to me there is a strong wish from a lot of
people to try to resolve famous multilingual issue80, because that issue
blocks msysgit adoption in a lot of countries:

imho ascii-only filenames are like ascii only _contents_ in today's world.


In my company I had to hack on it and build several versions of utf-8
enabled msysgit installer for local use so we could use Git between both
linux and windows people. But I guess doing this work would be too much
for average case.

Thanks again and Hope we'll merge it,
Kirill

karste...@dcon.de

unread,
Jun 13, 2011, 9:56:34 PM6/13/11
to Johannes Schindelin, bl...@dcon.de, Johannes Sixt, ki...@mns.spb.ru, kusm...@gmail.com, Michael Geddes, msy...@googlegroups.com, robin.r...@gmail.com

Johannes Schindelin <Johannes....@gmx.de> wrote on 08.06.2011 20:54:29:
> >
> > I guess it would certainly be nice if msys similarly implemented UTF-8
> > encoding (and symlink support, but that's a seperate issue).
>
> Well, Hannes was completely opposed until MSys learnt about UTF-8. At
> least as far as I understood his mails.
>

Hi,

I found some time to work on msys-1.0.dll Unicode support, and while far from complete, the first results look promising.

The basic idea is, again, to provide wrappers for all Win32 *A functions referenced by the dll (~50). About half of these only have outgoing (LPCSTR) parameters, these I could handle with some simple macros. Most of the rest required just a few lines of code (not always 100% compliant with MSDN documentation, but good enough for a start, I think).
The console output was a bit more tricky, as the console handler tries to keep track of the cursor position, which must now be calcualted based on wide chars, not UTF-8 bytes.
I haven't looked at console input and dynamically loaded functions (LoadLibrary/GetProcAddress) yet.

Things that work on first glance:
- build git
- run the test suite
- msys less with Unicode (mostly, scrolling right/left seems to be broken) (may be due to the outdated less v358)

Things that don't work:
- some i18n tests are broken (those that play around with mixed encodings, e.g. t3901, this can no longer work if everything is UTF-8 only)
- bash starts in "/usr" instead of "/"
- git doesn't know my name and email :-(
The latter two may be because I used msysCore-1.0.16, 1.0.11 didn't seem to compile.

As I don't know when I'll be able to spend some more time on this, here's what I got so far (patch + compiled msys-1.0.dll to play around with).


> So I basically wanted to wait for violent disagreement, and absent that,
> merge the series (once I have time, this week I was traveling to
> Switzerland, next week I will be in France, so unless somebody else steps
> up to the task, the merge will have to wait another ~2 weeks)...
>
> Ciao,
> Johannes

As much as I'd like those ~30 patches merged...when I ported the unicode conversion functions from mingw.c, the msys build warned me of some broken comparisons in the more advanced checks for invalid UTF-8.

source/winsup/cygwin/unicode.c:397: warning: comparison is always true due to limited range of data type

This is easily fixed like so:

int utftowcsn(wchar_t *wcs, const char *utfs, size_t wcslen, int utflen)
{
        const unsigned char *utf = (const unsigned char*) utfs;

So I guess I'll have to do yet another reroll of the Unicode patch series...and perhaps this warning should be enabled in the git build, too?

Have fun,
Karsten


msys-unicode-v0.7z
Reply all
Reply to author
Forward
0 new messages