[PATCH v10 00/28] Issue 80: Unicode support on Windows

Karsten Blees

unread,

Oct 6, 2011, 1:40:52 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Hi there,

I just pushed v10 of the Unicode patches for msysgit to http://repo.or.cz/w/git/mingw/4msysgit/kblees.git.

A git installer that bundles the Unicode msys.dll, less-444 and git v1.7.6 with unicode-v10 patches is available here: https://docs.google.com/uc?id=0BxXUoUg2r8ftMjZlZDBjYjktNTE2Yi00NjExLTk0MzctMTMwZjQzNDkyNmZl&export=download

So far I haven't encountered any problems with the Unicode-aware msys.dll (http://groups.google.com/group/msysgit/browse_thread/thread/981da260b4fcb1df), and there was no feedback from the mailing list either (this could be good or bad :-).

Ciao,
Karsten

Branches of interest:

kb/unicode-v9-full-2cbf7794
(http://repo.or.cz/w/git/mingw/4msysgit/kblees.git/kb/unicode-v9-full-2cbf7794)
This is v9 rebased to current devel, with all conflicts resolved in favor of v9 (for reference).

kb/unicode-v10-minimal
(http://repo.or.cz/w/git/mingw/4msysgit/kblees.git/kb/unicode-v10-minimal)
Unicode support with minimal environment changes.
Changes since kb/unicode-v9-minimal:
- removed patches that have already been merged
- utftowcsn: fixed some advanced checks by using unsigned char * for the UTF-8 string
- reenabled t5505
- changed tests to work with Unicode-aware msys.dll / bash

kb/unicode-v10
(http://repo.or.cz/w/git/mingw/4msysgit/kblees.git/kb/unicode-v10)
Additional environment features. The first two of these patches I consider important fixes (memory leaks and case-sensitivity), the rest is nice to have.
Changes since kb/unicode-v9-full:
- reordered the environment patches to cleanup first, then add features
- removed unnecessary char **env parameter in some spawn functions
- bsearchenv: added case-sensitive matching introduced in "Windows: teach getenv to do a case-sensitive search" (df599e96)

Karsten Blees (28) ('+': new/rewritten in v10, '!': modified):
MSVC: disable const qualifier compiler warnings
MSVC: don't use uninitialized variables
MSVC: include <io.h> for mktemp
MSVC: fix winansi.c compile errors
MSVC: link dynamically to the CRT
git-gui: fix encoding in git-gui file browser
gitk: fix file name encoding in diff hunk headers
+ Revert "Disable test on MinGW that challenges its bash quoting"
! Win32: add Unicode conversion functions
Win32: Unicode file name support (except dirent)
Win32: Unicode file name support (dirent)
Unicode file name support (gitk and git-gui)
Win32: Unicode arguments (outgoing)
Win32: Unicode arguments (incoming)
Win32: Thread-safe windows console output
Win32: Unicode environment (outgoing)
Win32: Unicode environment (incoming)
+ (kb/unicode-v10-minimal) MinGW: disable legacy encoding tests
+ Win32: fix environment memory leaks
+ Win32: unify environment case-sensitivity
+ Win32: simplify internal mingw_spawn* APIs
Win32: move environment functions
! Win32: unify environment function names
! Win32: move environment block creation to a helper method
! Win32: don't copy the environment twice when spawning child processes
Win32: reduce environment array reallocations
! Win32: keep the environment sorted
(kb/unicode-v10) Win32: patch Windows environment on startup

And the diff between v9-full and v10...
---
$ git diff -p --stat kb/unicode-v9-full-2cbf7794..kb/unicode-v10
compat/mingw.c | 83 +++++++++++++++++++++++++++++--------------------
t/t3901-i18n-patch.sh | 19 ++++++-----
t/t4201-shortlog.sh | 6 ++--
t/t5505-remote.sh | 5 +--
t/t8005-blame-i18n.sh | 8 ++--
5 files changed, 68 insertions(+), 53 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index 51cde10..77ea5d9 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -181,7 +181,7 @@ static int ask_yes_no_if_possible(const char *format, ...)
vsnprintf(question, sizeof(question), format, args);
va_end(args);

- if ((retry_hook[0] = mingw_getenv("GIT_ASK_YESNO"))) {
+ if ((retry_hook[0] = getenv("GIT_ASK_YESNO"))) {
retry_hook[1] = question;
return !run_command_v_opt(retry_hook, 0);
}
@@ -761,18 +761,41 @@ static int compareenv(const void *v1, const void *v2)

static int bsearchenv(char **env, const char *name, size_t size)
{
- unsigned low = 0, high = size - 1;
+ unsigned low = 0, high = size - 1, mid, i, j, nmln;
+ int cmp;
while (low <= high) {
- unsigned mid = (low + high) >> 1;
- int cmp = compareenv(&env[mid], &name);
+ mid = (low + high) >> 1;
+ cmp = compareenv(&env[mid], &name);
if (cmp < 0)
low = mid + 1;
else if (cmp > 0)
high = mid - 1;
else
- return mid; /* found */
+ goto found;
}
return ~low; /* not found, return 1's complement of insert position */
+
+found:
+ /* look for duplicate entries before and after the matching entry (mid) */
+ i = j = mid;
+ while (i > low && !compareenv(&env[i - 1], &name))
+ i--;
+ while (j < high && !compareenv(&env[j + 1], &name))
+ j++;
+ if (i == j)
+ return mid; /* no duplicates */
+
+ /*
+ * try case-sensitive match first, this is necessary for sh-i18n--envsubst
+ * to work with environment variables that differ only in case (e.g. $PATH
+ * and $path)
+ */
+ nmln = strlen(name);
+ for (; i <= j; i++)
+ if (!strncmp(env[i], name, nmln) && '=' == env[i][nmln])
+ /* matches */
+ return i;
+ return mid;
}

/*
@@ -925,7 +948,7 @@ static const char *parse_interpreter(const char *cmd)
*/
static char **get_path_split(void)
{
- char *p, **path, *envpath = mingw_getenv("PATH");
+ char *p, **path, *envpath = getenv("PATH");
int i, n = 0;

if (!envpath || !*envpath)
@@ -1012,13 +1035,13 @@ static char *path_lookup(const char *cmd, char **path, int exe_only)
* Create environment block suitable for CreateProcess. Merges current
* process environment and the supplied environment changes.
*/
-static wchar_t *make_environment_block(char **env)
+static wchar_t *make_environment_block(char **deltaenv)
{
wchar_t *wenvblk = NULL;
char **tmpenv;
int i = 0, size = environ_size, envblksz = 0, envblkpos = 0;

- while (env && env[i])
+ while (deltaenv && deltaenv[i])
i++;

/* copy the environment, leaving space for changes */
@@ -1026,8 +1049,8 @@ static wchar_t *make_environment_block(char **env)
memcpy(tmpenv, environ, size * sizeof(char*));

/* merge supplied environment changes into the temporary environment */
- for (i = 0; env && env[i]; i++)
- size = do_putenv(tmpenv, env[i], size, 0);
+ for (i = 0; deltaenv && deltaenv[i]; i++)
+ size = do_putenv(tmpenv, deltaenv[i], size, 0);

/* create environment block from temporary environment */
for (i = 0; tmpenv[i]; i++) {
@@ -1049,7 +1072,7 @@ struct pinfo_t {
struct pinfo_t *pinfo = NULL;
CRITICAL_SECTION pinfo_cs;

-static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
+static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **deltaenv,
const char *dir,
int prepend_cmd, int fhin, int fhout, int fherr)
{
@@ -1117,10 +1140,7 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
utftowcs(wargs, args.buf, 2 * args.len + 1);
strbuf_release(&args);

- if (env == environ)
- env = NULL;
-
- wenvblk = make_environment_block(env);
+ wenvblk = make_environment_block(deltaenv);

memset(&pi, 0, sizeof(pi));
ret = CreateProcessW(wcmd, wargs, NULL, NULL, TRUE, flags,
@@ -1156,13 +1176,12 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
return (pid_t)pi.dwProcessId;
}

-static pid_t mingw_spawnve(const char *cmd, const char **argv, char **env,
- int prepend_cmd)
+static pid_t mingw_spawnv(const char *cmd, const char **argv, int prepend_cmd)
{
- return mingw_spawnve_fd(cmd, argv, env, NULL, prepend_cmd, 0, 1, 2);
+ return mingw_spawnve_fd(cmd, argv, NULL, NULL, prepend_cmd, 0, 1, 2);
}

-pid_t mingw_spawnvpe(const char *cmd, const char **argv, char **env,
+pid_t mingw_spawnvpe(const char *cmd, const char **argv, char **deltaenv,
const char *dir,
int fhin, int fhout, int fherr)
{
@@ -1186,14 +1205,14 @@ pid_t mingw_spawnvpe(const char *cmd, const char **argv, char **env,
pid = -1;
}
else {
- pid = mingw_spawnve_fd(iprog, argv, env, dir, 1,
+ pid = mingw_spawnve_fd(iprog, argv, deltaenv, dir, 1,
fhin, fhout, fherr);
free(iprog);
}
argv[0] = argv0;
}
else
- pid = mingw_spawnve_fd(prog, argv, env, dir, 0,
+ pid = mingw_spawnve_fd(prog, argv, deltaenv, dir, 0,
fhin, fhout, fherr);
free(prog);
}
@@ -1201,7 +1220,7 @@ pid_t mingw_spawnvpe(const char *cmd, const char **argv, char **env,
return pid;
}

-static int try_shell_exec(const char *cmd, char *const *argv, char **env)
+static int try_shell_exec(const char *cmd, char *const *argv)
{
const char *interpr = parse_interpreter(cmd);
char **path;
@@ -1219,7 +1238,7 @@ static int try_shell_exec(const char *cmd, char *const *argv, char **env)
argv2 = xmalloc(sizeof(*argv) * (argc+1));
argv2[0] = (char *)cmd; /* full path to the script file */
memcpy(&argv2[1], &argv[1], sizeof(*argv) * argc);
- pid = mingw_spawnve(prog, argv2, env, 1);
+ pid = mingw_spawnv(prog, argv2, 1);
if (pid >= 0) {
int status;
if (waitpid(pid, &status, 0) < 0)
@@ -1234,13 +1253,13 @@ static int try_shell_exec(const char *cmd, char *const *argv, char **env)
return pid;
}

-static void mingw_execve(const char *cmd, char *const *argv, char *const *env)
+void mingw_execv(const char *cmd, char *const *argv)
{
/* check if git_command is a shell script */
- if (!try_shell_exec(cmd, argv, (char **)env)) {
+ if (!try_shell_exec(cmd, argv)) {
int pid, status;

- pid = mingw_spawnve(cmd, (const char **)argv, (char **)env, 0);
+ pid = mingw_spawnv(cmd, (const char **)argv, 0);
if (pid < 0)
return;
if (waitpid(pid, &status, 0) < 0)
@@ -1255,7 +1274,7 @@ void mingw_execvp(const char *cmd, char *const *argv)
char *prog = path_lookup(cmd, path, 0);

if (prog) {
- mingw_execve(prog, argv, environ);
+ mingw_execv(prog, argv);
free(prog);
} else
errno = ENOENT;
@@ -1263,11 +1282,6 @@ void mingw_execvp(const char *cmd, char *const *argv)
free_path_split(path);
}

-void mingw_execv(const char *cmd, char *const *argv)
-{
- mingw_execve(cmd, argv, environ);
-}
-
int mingw_kill(pid_t pid, int sig)
{
if (pid > 0 && sig == SIGTERM) {
@@ -1952,9 +1966,10 @@ int mingw_offset_1st_component(const char *path)
return offset + is_dir_sep(path[offset]);
}

-int utftowcsn(wchar_t *wcs, const char *utf, size_t wcslen, int utflen)
+int utftowcsn(wchar_t *wcs, const char *utfs, size_t wcslen, int utflen)
{
int upos = 0, wpos = 0;
+ const unsigned char *utf = (const unsigned char*) utfs;
if (!utf || !wcs || wcslen < 1) {
errno = EINVAL;
return -1;
@@ -1965,7 +1980,7 @@ int utftowcsn(wchar_t *wcs, const char *utf, size_t wcslen, int utflen)
utflen = INT_MAX;

while (upos < utflen) {
- int c = ((int) utf[upos++]) & 0xff;
+ int c = utf[upos++] & 0xff;
if (utflen == INT_MAX && c == 0)
break;

diff --git a/t/t3901-i18n-patch.sh b/t/t3901-i18n-patch.sh
index 31a5770..55c8a2f 100755
--- a/t/t3901-i18n-patch.sh
+++ b/t/t3901-i18n-patch.sh
@@ -54,10 +54,13 @@ test_expect_success setup '
git add yours &&
git commit -s -m "Second on side" &&

- # the second one on the side branch is ISO-8859-1
- git config i18n.commitencoding ISO8859-1 &&
- # use author and committer name in ISO-8859-1 to match it.
- . "$TEST_DIRECTORY"/t3901-8859-1.txt &&
+ if test_have_prereq NOT_MINGW
+ then
+ # the second one on the side branch is ISO-8859-1
+ git config i18n.commitencoding ISO8859-1 &&
+ # use author and committer name in ISO-8859-1 to match it.
+ . "$TEST_DIRECTORY"/t3901-8859-1.txt
+ fi &&
test_tick &&
echo Yet another >theirs &&
git add theirs &&
@@ -119,7 +122,7 @@ test_expect_success 'rebase (U/L)' '
check_encoding 2
'

-test_expect_success 'rebase (L/L)' '
+test_expect_success NOT_MINGW 'rebase (L/L)' '
# In this test we want ISO-8859-1 encoded commits as the result
git config i18n.commitencoding ISO8859-1 &&
git config i18n.logoutputencoding ISO8859-1 &&
@@ -131,7 +134,7 @@ test_expect_success 'rebase (L/L)' '
check_encoding 2 8859
'

-test_expect_success 'rebase (L/U)' '
+test_expect_success NOT_MINGW 'rebase (L/U)' '
# This is pathological -- use UTF-8 as intermediate form
# to get ISO-8859-1 results.
git config i18n.commitencoding ISO8859-1 &&
@@ -159,7 +162,7 @@ test_expect_success 'cherry-pick(U/U)' '
check_encoding 3
'

-test_expect_success 'cherry-pick(L/L)' '
+test_expect_success NOT_MINGW 'cherry-pick(L/L)' '
# Both the commitencoding and logoutputencoding is set to ISO-8859-1

git config i18n.commitencoding ISO8859-1 &&
@@ -189,7 +192,7 @@ test_expect_success 'cherry-pick(U/L)' '
check_encoding 3
'

-test_expect_success 'cherry-pick(L/U)' '
+test_expect_success NOT_MINGW 'cherry-pick(L/U)' '
# Again, the commitencoding is set to ISO-8859-1 but
# logoutputencoding is set to UTF-8.

diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh
index 6872ba1..4896381 100755
--- a/t/t4201-shortlog.sh
+++ b/t/t4201-shortlog.sh
@@ -93,7 +93,7 @@ test_expect_success 'output from user-defined format is re-wrapped' '
test_cmp expect log.predictable
'

-test_expect_success 'shortlog wrapping' '
+test_expect_success NOT_MINGW 'shortlog wrapping' '
cat >expect <<\EOF &&
A U Thor (5):
Test
@@ -114,7 +114,7 @@ EOF
test_cmp expect out
'

-test_expect_success 'shortlog from non-git directory' '
+test_expect_success NOT_MINGW 'shortlog from non-git directory' '
git log HEAD >log &&
GIT_DIR=non-existing git shortlog -w <log >out &&
test_cmp expect out
@@ -135,7 +135,7 @@ $DSCHO (2):

EOF

-test_expect_success 'shortlog encoding' '
+test_expect_success NOT_MINGW 'shortlog encoding' '
git reset --hard "$commit" &&
git config --unset i18n.commitencoding &&
echo 2 > a1 &&
diff --git a/t/t5505-remote.sh b/t/t5505-remote.sh
index fae1200..0d0222e 100755
--- a/t/t5505-remote.sh
+++ b/t/t5505-remote.sh
@@ -915,10 +915,7 @@ test_expect_success 'remote set-url --add bbb' '
'

test_expect_success 'remote set-url --delete .*' '
- if test_have_prereq NOT_MINGW
- then
- test_must_fail git remote set-url --delete someremote .\*
- fi &&
+ test_must_fail git remote set-url --delete someremote .\* &&
echo "YYY" >expect &&
echo baz >>expect &&
echo bbb >>expect &&
diff --git a/t/t8005-blame-i18n.sh b/t/t8005-blame-i18n.sh
index cb39055..a6e73d0 100755
--- a/t/t8005-blame-i18n.sh
+++ b/t/t8005-blame-i18n.sh
@@ -33,7 +33,7 @@ author $SJIS_NAME
summary $SJIS_MSG
EOF

-test_expect_success \
+test_expect_success NOT_MINGW \
'blame respects i18n.commitencoding' '
git blame --incremental file | \
egrep "^(author|summary) " > actual &&
@@ -49,7 +49,7 @@ author $EUC_JAPAN_NAME
summary $EUC_JAPAN_MSG
EOF

-test_expect_success \
+test_expect_success NOT_MINGW \
'blame respects i18n.logoutputencoding' '
git config i18n.logoutputencoding eucJP &&
git blame --incremental file | \
@@ -66,7 +66,7 @@ author $UTF8_NAME
summary $UTF8_MSG
EOF

-test_expect_success \
+test_expect_success NOT_MINGW \
'blame respects --encoding=UTF-8' '
git blame --incremental --encoding=UTF-8 file | \
egrep "^(author|summary) " > actual &&
@@ -82,7 +82,7 @@ author $UTF8_NAME
summary $UTF8_MSG
EOF

-test_expect_success \
+test_expect_success NOT_MINGW \
'blame respects --encoding=none' '
git blame --incremental --encoding=none file | \
egrep "^(author|summary) " > actual &&
---

Karsten Blees

unread,

Oct 6, 2011, 1:40:54 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Git code sometimes uses uninitialized variables to initialize themselves
(i.e. 'type varname = varname;'). According to some commit messages, this
is supposedly to suppress the 'using uninitialized variable' warning
(which is quite paradox). GCC currently doesn't seem to catch this, but
MSVC does. Initialize with some arbitrary value instead to silence the
compilers.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

builtin/cat-file.c | 2 +-
builtin/fast-export.c | 2 +-
builtin/rev-list.c | 2 +-

transport.c | 2 +-
wt-status.c | 2 +-

10 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 07bd984..8a6bb24 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -168,7 +168,7 @@ static int batch_one_object(const char *obj_name, int print_contents)
unsigned char sha1[20];
enum object_type type = 0;
unsigned long size;
- void *contents = contents;
+ void *contents = NULL;

if (!obj_name)
return 1;
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 9836e6b..c7ce01c 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -481,7 +481,7 @@ static void get_tags_and_duplicates(struct object_array *pending,
for (i = 0; i < pending->nr; i++) {
struct object_array_entry *e = pending->objects + i;
unsigned char sha1[20];
- struct commit *commit = commit;
+ struct commit *commit = NULL;
char *full_name;

if (dwim_ref(e->name, strlen(e->name), sha1, &full_name) != 1)
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 56727e8..9a6073f 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -394,7 +394,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
mark_edges_uninteresting(revs.commits, &revs, show_edge);

if (bisect_list) {
- int reaches = reaches, all = all;
+ int reaches = 0, all = 0;

revs.commits = find_bisection(revs.commits, &reaches, &all,
bisect_find_all);
diff --git a/fast-import.c b/fast-import.c
index 7cc2262..4406f1f 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -2183,7 +2183,7 @@ static void file_change_m(struct branch *b)
const char *p = command_buf.buf + 2;
static struct strbuf uq = STRBUF_INIT;
const char *endp;
- struct object_entry *oe = oe;
+ struct object_entry *oe = NULL;
unsigned char sha1[20];
uint16_t mode, inline_data = 0;

@@ -2352,7 +2352,7 @@ static void note_change_n(struct branch *b, unsigned char old_fanout)
{
const char *p = command_buf.buf + 2;
static struct strbuf uq = STRBUF_INIT;
- struct object_entry *oe = oe;
+ struct object_entry *oe = NULL;
struct branch *s;
unsigned char sha1[20], commit_sha1[20];
char path[60];
@@ -2512,7 +2512,7 @@ static int parse_from(struct branch *b)

static struct hash_list *parse_merge(unsigned int *count)
{
- struct hash_list *list = NULL, *n, *e = e;
+ struct hash_list *list = NULL, *n, *e = NULL;
const char *from;
struct branch *s;

@@ -2806,7 +2806,7 @@ static void cat_blob(struct object_entry *oe, unsigned char sha1[20])
static void parse_cat_blob(void)
{
const char *p;
- struct object_entry *oe = oe;
+ struct object_entry *oe = NULL;
unsigned char sha1[20];

/* cat-blob SP <object> LF */
diff --git a/match-trees.c b/match-trees.c
index 26f7ed1..de050d0 100644
--- a/match-trees.c
+++ b/match-trees.c
@@ -72,12 +72,12 @@ static int score_trees(const unsigned char *hash1, const unsigned char *hash2)
die("%s is not a tree", sha1_to_hex(hash2));
init_tree_desc(&two, two_buf, size);
while (one.size | two.size) {
- const unsigned char *elem1 = elem1;
- const unsigned char *elem2 = elem2;
- const char *path1 = path1;
- const char *path2 = path2;
- unsigned mode1 = mode1;
- unsigned mode2 = mode2;
+ const unsigned char *elem1 = NULL;
+ const unsigned char *elem2 = NULL;
+ const char *path1 = NULL;
+ const char *path2 = NULL;
+ unsigned mode1 = 0;
+ unsigned mode2 = 0;
int cmp;

if (one.size)
diff --git a/merge-recursive.c b/merge-recursive.c
index 0cc1e6f..542f61d 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -1579,7 +1579,7 @@ int merge_recursive(struct merge_options *o,
{
struct commit_list *iter;
struct commit *merged_common_ancestors;
- struct tree *mrtree = mrtree;
+ struct tree *mrtree = NULL;
int clean;

if (show(o, 4)) {
diff --git a/run-command.c b/run-command.c
index a2796c4..a8be1b1 100644
--- a/run-command.c
+++ b/run-command.c
@@ -137,7 +137,7 @@ int start_command(struct child_process *cmd)
{
int need_in, need_out, need_err;
int fdin[2], fdout[2], fderr[2];
- int failed_errno = failed_errno;
+ int failed_errno = 0;

/*
* In case of errors we must keep the promise to close FDs
diff --git a/submodule.c b/submodule.c
index 1ba9646..71a0508 100644
--- a/submodule.c
+++ b/submodule.c
@@ -253,7 +253,7 @@ void show_submodule_summary(FILE *f, const char *path,
const char *del, const char *add, const char *reset)
{
struct rev_info rev;
- struct commit *left = left, *right = right;
+ struct commit *left = NULL, *right = NULL;
const char *message = NULL;
struct strbuf sb = STRBUF_INIT;
int fast_forward = 0, fast_backward = 0;
diff --git a/transport.c b/transport.c
index 98c5778..be1538c 100644
--- a/transport.c
+++ b/transport.c
@@ -104,7 +104,7 @@ static void insert_packed_refs(const char *packed_refs, struct ref **list)
return;

for (;;) {
- int cmp = cmp, len;
+ int cmp = 0, len;

if (!fgets(buffer, sizeof(buffer), f)) {
fclose(f);
diff --git a/wt-status.c b/wt-status.c
index 0237772..eba4d31 100644
--- a/wt-status.c
+++ b/wt-status.c
@@ -227,7 +227,7 @@ static void wt_status_print_change_data(struct wt_status *s,
{
struct wt_status_change_data *d = it->util;
const char *c = color(change_type, s);
- int status = status;
+ int status = 0;
char *one_name;
char *two_name;
const char *one, *two;
--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:40:55 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Include io.h from unistd.h, as in MinGW, so that mktemp is available.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/vcbuild/include/unistd.h | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/compat/vcbuild/include/unistd.h b/compat/vcbuild/include/unistd.h
index bde4d05..3ca5c57 100644
--- a/compat/vcbuild/include/unistd.h
+++ b/compat/vcbuild/include/unistd.h
@@ -3,6 +3,8 @@

/* Win32 define for porting git*/

+#include <io.h>
+
/* disable const** -> void* warning (~130 warnings for realloc etc.) */
#pragma warning(disable : 4090)

--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:40:53 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

MSVC warns about implicit casts from const** to void* (which should be OK,
as void* indicates the content is treated as opaque). E.g.

const char *x;
realloc(x, ...); // warning C4090 due to const** -> void*

As there's not that much MSVC-only code, the 'real' const to non-const
casts that we want to know about will be caught by GCC builds anyway, so
disable this warning.

MSVC link.exe doesn't recognize the /wd option in CFLAGS in the Makefile,
so use #pragma warning in central, MSVC-specific unistd.h.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/vcbuild/include/unistd.h | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/compat/vcbuild/include/unistd.h b/compat/vcbuild/include/unistd.h
index b14fcf9..bde4d05 100644
--- a/compat/vcbuild/include/unistd.h
+++ b/compat/vcbuild/include/unistd.h
@@ -3,6 +3,9 @@

/* Win32 define for porting git*/

+/* disable const** -> void* warning (~130 warnings for realloc etc.) */
+#pragma warning(disable : 4090)
+
#ifndef _MODE_T_
#define _MODE_T_
typedef unsigned short _mode_t;
--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:40:56 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Some constants (such as LF_FACESIZE) are undefined with -DNOGDI (set in the
Makefile), and CONSOLE_FONT_INFOEX is available in MSVC, but not in MinGW.

Cast FARPROC to PGETCURRENTCONSOLEFONTEX to suppress MSVC compiler warning.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/winansi.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/compat/winansi.c b/compat/winansi.c
index a5ca2d9..bf514f9 100644
--- a/compat/winansi.c
+++ b/compat/winansi.c
@@ -2,6 +2,7 @@
* Copyright 2008 Peter Harris <g...@peter.is-a-geek.org>
*/

+#undef NOGDI
#include "../git-compat-util.h"
#include <malloc.h>
#include <wingdi.h>
@@ -30,6 +31,7 @@ static int negative;
static FILE *last_stream = NULL;
static int non_ascii_used = 0;

+#ifdef __MINGW32__
typedef struct _CONSOLE_FONT_INFOEX {
ULONG cbSize;
DWORD nFont;
@@ -38,6 +40,7 @@ typedef struct _CONSOLE_FONT_INFOEX {
UINT FontWeight;
WCHAR FaceName[LF_FACESIZE];
} CONSOLE_FONT_INFOEX, *PCONSOLE_FONT_INFOEX;
+#endif

typedef BOOL (WINAPI *PGETCURRENTCONSOLEFONTEX)(HANDLE, BOOL,
PCONSOLE_FONT_INFOEX);
@@ -52,8 +55,8 @@ static void warn_if_raster_font(void)
return;

/* GetCurrentConsoleFontEx is available since Vista */
- pGetCurrentConsoleFontEx = GetProcAddress(GetModuleHandle("kernel32.dll"),
- "GetCurrentConsoleFontEx");
+ pGetCurrentConsoleFontEx = (PGETCURRENTCONSOLEFONTEX) GetProcAddress(
+ GetModuleHandle("kernel32.dll"), "GetCurrentConsoleFontEx");
if (pGetCurrentConsoleFontEx) {
CONSOLE_FONT_INFOEX cfi;
cfi.cbSize = sizeof(cfi);
--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:01 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Add Unicode conversion functions to convert between Windows native UTF-16LE
encoding to UTF-8 and back.

To support repositories with legacy-encoded file names, the UTF-8 to UTF-16
conversion function tries to create valid, unique file names even for
invalid UTF-8 byte sequences, so that these repositories can be checked out
without error.

The current implementation leaves invalid UTF-8 bytes in range 0xa0 - 0xff
as is (producing printable Unicode chars \u00a0 - \u00ff, equivalent to
ISO-8859-1), and converts 0x80 - 0x9f to hex-code (\u0080 - \u009f are
control chars).

The Windows MultiByteToWideChar API was not used as it either drops invalid
UTF-8 sequences (on Win2k/XP; producing non-unique or even empty file
names) or converts them to the replacement char \ufffd (Vista/7; causing
ERROR_INVALID_NAME in subsequent calls to file system APIs).

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
compat/mingw.h | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 162 insertions(+), 0 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index 57d41dd..ba7ac27 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -1869,6 +1869,91 @@ int mingw_offset_1st_component(const char *path)

return offset + is_dir_sep(path[offset]);
}

+int utftowcsn(wchar_t *wcs, const char *utfs, size_t wcslen, int utflen)

+{
+ int upos = 0, wpos = 0;

+ const unsigned char *utf = (const unsigned char*) utfs;

+ if (!utf || !wcs || wcslen < 1) {
+ errno = EINVAL;
+ return -1;
+ }
+ /* reserve space for \0 */
+ wcslen--;
+ if (utflen < 0)
+ utflen = INT_MAX;
+
+ while (upos < utflen) {

+ int c = utf[upos++] & 0xff;

+ if (utflen == INT_MAX && c == 0)
+ break;
+
+ if (wpos >= wcslen) {
+ wcs[wpos] = 0;
+ errno = ENAMETOOLONG;
+ return -1;
+ }
+
+ if (c < 0x80) {
+ /* ASCII */
+ wcs[wpos++] = c;
+ } else if (c >= 0xc2 && c < 0xe0 && upos < utflen &&
+ (utf[upos] & 0xc0) == 0x80) {
+ /* 2-byte utf-8 */
+ c = ((c & 0x1f) << 6);
+ c |= (utf[upos++] & 0x3f);
+ wcs[wpos++] = c;
+ } else if (c >= 0xe0 && c < 0xf0 && upos + 1 < utflen &&
+ !(c == 0xe0 && utf[upos] < 0xa0) && /* over-long encoding */
+ (utf[upos] & 0xc0) == 0x80 &&
+ (utf[upos + 1] & 0xc0) == 0x80) {
+ /* 3-byte utf-8 */
+ c = ((c & 0x0f) << 12);
+ c |= ((utf[upos++] & 0x3f) << 6);
+ c |= (utf[upos++] & 0x3f);
+ wcs[wpos++] = c;
+ } else if (c >= 0xf0 && c < 0xf5 && upos + 2 < utflen &&
+ wpos + 1 < wcslen &&
+ !(c == 0xf0 && utf[upos] < 0x90) && /* over-long encoding */
+ !(c == 0xf4 && utf[upos] >= 0x90) && /* > \u10ffff */
+ (utf[upos] & 0xc0) == 0x80 &&
+ (utf[upos + 1] & 0xc0) == 0x80 &&
+ (utf[upos + 2] & 0xc0) == 0x80) {
+ /* 4-byte utf-8: convert to \ud8xx \udcxx surrogate pair */
+ c = ((c & 0x07) << 18);
+ c |= ((utf[upos++] & 0x3f) << 12);
+ c |= ((utf[upos++] & 0x3f) << 6);
+ c |= (utf[upos++] & 0x3f);
+ c -= 0x10000;
+ wcs[wpos++] = 0xd800 | (c >> 10);
+ wcs[wpos++] = 0xdc00 | (c & 0x3ff);
+ } else if (c >= 0xa0) {
+ /* invalid utf-8 byte, printable unicode char: convert 1:1 */
+ wcs[wpos++] = c;
+ } else {
+ /* invalid utf-8 byte, non-printable unicode: convert to hex */
+ static const char *hex = "0123456789abcdef";
+ wcs[wpos++] = hex[c >> 4];
+ if (wpos < wcslen)
+ wcs[wpos++] = hex[c & 0x0f];
+ }
+ }
+ wcs[wpos] = 0;
+ return wpos;
+}
+
+int wcstoutf(char *utf, const wchar_t *wcs, size_t utflen)
+{
+ if (!wcs || !utf || utflen < 1) {
+ errno = EINVAL;
+ return -1;
+ }
+ utflen = WideCharToMultiByte(CP_UTF8, 0, wcs, -1, utf, utflen, NULL, NULL);
+ if (utflen)
+ return utflen - 1;
+ errno = ENAMETOOLONG;
+ return -1;
+}
+
/*
* Disable MSVCRT command line wildcard expansion (__getmainargs called from
* mingw startup code, see init.c in mingw runtime).
diff --git a/compat/mingw.h b/compat/mingw.h
index 3aa80bb..b4ff6cd 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -326,6 +326,83 @@ void mingw_mark_as_git_dir(const char *dir);
char **make_augmented_environ(const char *const *vars);
void free_environ(char **env);

+/**
+ * Converts UTF-8 encoded string to UTF-16LE.
+ *
+ * To support repositories with legacy-encoded file names, invalid UTF-8 bytes
+ * 0xa0 - 0xff are converted to corresponding printable Unicode chars \u00a0 -
+ * \u00ff, and invalid UTF-8 bytes 0x80 - 0x9f (which would make non-printable
+ * Unicode) are converted to hex-code.
+ *
+ * Lead-bytes not followed by an appropriate number of trail-bytes, over-long
+ * encodings and 4-byte encodings > \u10ffff are detected as invalid UTF-8.
+ *
+ * Maximum space requirement for the target buffer is two wide chars per UTF-8
+ * char (((strlen(utf) * 2) + 1) [* sizeof(wchar_t)]).
+ *
+ * The maximum space is needed only if the entire input string consists of
+ * invalid UTF-8 bytes in range 0x80-0x9f, as per the following table:
+ *
+ * | | UTF-8 | UTF-16 |
+ * Code point | UTF-8 sequence | bytes | words | ratio
+ * --------------+-------------------+-------+--------+-------
+ * 000000-00007f | 0-7f | 1 | 1 | 1
+ * 000080-0007ff | c2-df + 80-bf | 2 | 1 | 0.5
+ * 000800-00ffff | e0-ef + 2 * 80-bf | 3 | 1 | 0.33
+ * 010000-10ffff | f0-f4 + 3 * 80-bf | 4 | 2 (a) | 0.5
+ * invalid | 80-9f | 1 | 2 (b) | 2
+ * invalid | a0-ff | 1 | 1 | 1
+ *
+ * (a) encoded as UTF-16 surrogate pair
+ * (b) encoded as two hex digits
+ *
+ * Note that, while the UTF-8 encoding scheme can be extended to 5-byte, 6-byte
+ * or even indefinite-byte sequences, the largest valid code point \u10ffff
+ * encodes as only 4 UTF-8 bytes.
+ *
+ * @param wcs wide char target buffer
+ * @param utf string to convert
+ * @param wcslen size of target buffer (in wchar_t's)
+ * @param utflen size of string to convert, or -1 if 0-terminated
+ * @return length of converted string (_wcslen(wcs)), or -1 on failure (errno
+ * is set to EINVAL or ENAMETOOLONG)
+ * @see mbstowcs, MultiByteToWideChar
+ */
+int utftowcsn(wchar_t *wcs, const char *utf, size_t wcslen, int utflen);
+static inline int utftowcs(wchar_t *wcs, const char *utf, size_t wcslen)
+{
+ return utftowcsn(wcs, utf, wcslen, -1);
+}
+
+/**
+ * Converts UTF-16LE encoded string to UTF-8.
+ *
+ * Maximum space requirement for the target buffer is three UTF-8 chars per
+ * wide char ((_wcslen(wcs) * 3) + 1).
+ *
+ * The maximum space is needed only if the entire input string consists of
+ * UTF-16 words in range 0x0800-0xd7ff or 0xe000-0xffff (i.e. \u0800-\uffff
+ * modulo surrogate pairs), as per the following table:
+ *
+ * | | UTF-16 | UTF-8 |
+ * Code point | UTF-16 sequence | words | bytes | ratio
+ * --------------+-----------------------+--------+-------+-------
+ * 000000-00007f | 0000-007f | 1 | 1 | 1
+ * 000080-0007ff | 0080-07ff | 1 | 2 | 2
+ * 000800-00ffff | 0800-d7ff / e000-ffff | 1 | 3 | 3
+ * 010000-10ffff | d800-dbff + dc00-dfff | 2 | 4 | 2
+ *
+ * Note that invalid code points > 10ffff cannot be represented in UTF-16.
+ *
+ * @param utf target buffer
+ * @param wcs wide string to convert
+ * @param utflen size of target buffer
+ * @return length of converted string, or -1 on failure (errno is set to EINVAL
+ * or ENAMETOOLONG)
+ * @see wcstombs, WideCharToMultiByte
+ */
+int wcstoutf(char *utf, const wchar_t *wcs, size_t utflen);
+
/*
* A replacement of main() that adds win32 specific initialization.
*/
--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:00 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

This reverts commit 8117ed6a2205c17823910749cf0391ad811bdfa9.

With "MinGW: disable CRT command line globbing" (a05e9a86), wildcard
expansion works as expected (handled by git, not the CRT).

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

t/t5505-remote.sh | 5 +----
1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/t/t5505-remote.sh b/t/t5505-remote.sh
index fae1200..0d0222e 100755
--- a/t/t5505-remote.sh
+++ b/t/t5505-remote.sh
@@ -915,10 +915,7 @@ test_expect_success 'remote set-url --add bbb' '
'

test_expect_success 'remote set-url --delete .*' '
- if test_have_prereq NOT_MINGW
- then
- test_must_fail git remote set-url --delete someremote .\*
- fi &&
+ test_must_fail git remote set-url --delete someremote .\* &&
echo "YYY" >expect &&
echo baz >>expect &&
echo bbb >>expect &&

--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:40:57 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Dynamic linking is generally preferred over static linking, and MSVCRT.dll
has been integral part of Windows for a long time.

This also fixes linker warnings for _malloc and _free in zlib.lib, which
seems to be compiled for MSVCRT.dll already.

The DLL version also exports some of the CRT initialization functions,
which are hidden in the static libcmt.lib (e.g. __wgetmainargs, required by
subsequent Unicode patches).

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

Makefile | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index 234d0af..4594d81 100644
--- a/Makefile
+++ b/Makefile
@@ -1133,16 +1133,16 @@ ifeq ($(uname_S),Windows)
compat/win32/pthread.o compat/win32/syslog.o \
compat/win32/sys/poll.o compat/win32/dirent.o
COMPAT_CFLAGS = -D__USE_MINGW_ACCESS -DNOGDI -DHAVE_STRING_H -DHAVE_ALLOCA_H -Icompat -Icompat/regex -Icompat/win32 -DSTRIP_EXTENSION=\".exe\"
- BASIC_LDFLAGS = -IGNORE:4217 -IGNORE:4049 -NOLOGO -SUBSYSTEM:CONSOLE -NODEFAULTLIB:MSVCRT.lib
+ BASIC_LDFLAGS = -IGNORE:4217 -IGNORE:4049 -NOLOGO -SUBSYSTEM:CONSOLE
EXTLIBS = user32.lib advapi32.lib shell32.lib wininet.lib ws2_32.lib
PTHREAD_LIBS =
lib =
ifndef DEBUG
- BASIC_CFLAGS += -GL -Os -MT
+ BASIC_CFLAGS += -GL -Os -MD
BASIC_LDFLAGS += -LTCG
AR += -LTCG
else
- BASIC_CFLAGS += -Zi -MTd
+ BASIC_CFLAGS += -Zi -MDd
endif
X = .exe
endif
--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:40:59 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Decode file names from system encoding in all diff hunk header lines, not
just the first (i.e. print nice file names in 'rename from' / 'rename to' /
'Binary files' lines, too).

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

gitk-git/gitk | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/gitk-git/gitk b/gitk-git/gitk
index 2a92e20..935ed5c 100755
--- a/gitk-git/gitk
+++ b/gitk-git/gitk
@@ -7778,6 +7778,7 @@ proc getblobdiffline {bdf ids} {
set diffinhdr 0
continue
}
+ set line [encoding convertfrom $line]
$ctext insert end "$line\n" filesep

} else {
--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:40:58 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Assume git tree objects (i.e. output of git-ls-tree) are encoded in system
encoding, for display in the git-gui file browser.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

git-gui/lib/browser.tcl | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/git-gui/lib/browser.tcl b/git-gui/lib/browser.tcl
index a8c6223..d68b1f9 100644
--- a/git-gui/lib/browser.tcl
+++ b/git-gui/lib/browser.tcl
@@ -191,7 +191,7 @@ method _ls {tree_id {name {}}} {
$w conf -state disabled

set fd [git_read ls-tree -z $tree_id]
- fconfigure $fd -blocking 0 -translation binary -encoding binary
+ fconfigure $fd -blocking 0 -translation binary
fileevent $fd readable [cb _read $fd]
}

--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:03 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Changes opendir/readdir to use Windows Unicode APIs and convert between
UTF-8/UTF-16.

Removes parameter checks that are already covered by utftowcs.

Increases the size of dirent.d_name to accommodate the full
WIN32_FIND_DATA.cFileName converted to UTF-8 (UTF-16 to UTF-8 conversion
may grow by factor three in the worst case).

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/win32/dirent.c | 31 +++++++++++--------------------
compat/win32/dirent.h | 2 +-
2 files changed, 12 insertions(+), 21 deletions(-)

diff --git a/compat/win32/dirent.c b/compat/win32/dirent.c
index 82a515c..37f56b7 100644
--- a/compat/win32/dirent.c
+++ b/compat/win32/dirent.c
@@ -6,10 +6,10 @@ struct DIR {
int dd_stat; /* 0-based index */
};

-static inline void finddata2dirent(struct dirent *ent, WIN32_FIND_DATAA *fdata)
+static inline void finddata2dirent(struct dirent *ent, WIN32_FIND_DATAW *fdata)
{
- /* copy file name from WIN32_FIND_DATA to dirent */
- memcpy(ent->d_name, fdata->cFileName, sizeof(ent->d_name));
+ /* convert UTF-16 name to UTF-8 */
+ wcstoutf(ent->d_name, fdata->cFileName, sizeof(ent->d_name));

/* Set file type, based on WIN32_FIND_DATA */
if (fdata->dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)
@@ -20,25 +20,16 @@ static inline void finddata2dirent(struct dirent *ent, WIN32_FIND_DATAA *fdata)

DIR *opendir(const char *name)
{
- char pattern[MAX_PATH];
- WIN32_FIND_DATAA fdata;
+ wchar_t pattern[MAX_PATH];
+ WIN32_FIND_DATAW fdata;
HANDLE h;
int len;
DIR *dir;

- /* check that name is not NULL */
- if (!name) {
- errno = EINVAL;
+ /* convert name to UTF-16, check length (-2 for '/' '*') */
+ len = utftowcs(pattern, name, MAX_PATH - 2);
+ if (len < 0)
return NULL;
- }
- /* check that the pattern won't be too long for FindFirstFileA */
- len = strlen(name);
- if (len + 2 >= MAX_PATH) {
- errno = ENAMETOOLONG;
- return NULL;
- }
- /* copy name to temp buffer */
- memcpy(pattern, name, len + 1);

/* append optional '/' and wildcard '*' */
if (len && !is_dir_sep(pattern[len - 1]))
@@ -47,7 +38,7 @@ DIR *opendir(const char *name)
pattern[len] = 0;

/* open find handle */
- h = FindFirstFileA(pattern, &fdata);
+ h = FindFirstFileW(pattern, &fdata);
if (h == INVALID_HANDLE_VALUE) {
DWORD err = GetLastError();
errno = (err == ERROR_DIRECTORY) ? ENOTDIR : err_win_to_posix(err);
@@ -72,8 +63,8 @@ struct dirent *readdir(DIR *dir)
/* if first entry, dirent has already been set up by opendir */
if (dir->dd_stat) {
/* get next entry and convert from WIN32_FIND_DATA to dirent */
- WIN32_FIND_DATAA fdata;
- if (FindNextFileA(dir->dd_handle, &fdata)) {
+ WIN32_FIND_DATAW fdata;
+ if (FindNextFileW(dir->dd_handle, &fdata)) {
finddata2dirent(&dir->dd_dir, &fdata);
} else {
DWORD lasterr = GetLastError();
diff --git a/compat/win32/dirent.h b/compat/win32/dirent.h
index 8838cd6..058207e 100644
--- a/compat/win32/dirent.h
+++ b/compat/win32/dirent.h
@@ -10,7 +10,7 @@ typedef struct DIR DIR;

struct dirent {
unsigned char d_type; /* file type to prevent lstat after readdir */
- char d_name[MAX_PATH]; /* file name */
+ char d_name[MAX_PATH * 3]; /* file name (* 3 for UTF-8 conversion) */
};

DIR *opendir(const char *dirname);
--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:02 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Replaces Windows "ANSI" APIs dealing with file- or path names with their
Unicode equivalent, adding UTF-8/UTF-16LE conversion as necessary.

The dirent API (opendir/readdir/closedir) is updated in a separate commit.

Adds trivial wrappers for access, chmod and chdir.

Adds wrapper for mktemp (needed for both mkstemp and mkdtemp).

The simplest way to convert a repository with legacy-encoded (e.g. Cp1252)
file names to UTF-8 ist to checkout with an old msysgit version and
"git add --all & git commit" with the new version.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 213 +++++++++++++++++++++++++++++++++++++++-----------------
compat/mingw.h | 13 ++++
2 files changed, 161 insertions(+), 65 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index ba7ac27..fe759c0 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -1,6 +1,7 @@
#include "../git-compat-util.h"
#include "win32.h"
#include <conio.h>
+#include <wchar.h>
#include "../strbuf.h"
#include "../run-command.h"
#include "../cache.h"
@@ -200,14 +201,16 @@ static int ask_yes_no_if_possible(const char *format, ...)
}
}

-#undef unlink
int mingw_unlink(const char *pathname)
{
int ret, tries = 0;
+ wchar_t wpathname[MAX_PATH];
+ if (utftowcs(wpathname, pathname, MAX_PATH) < 0)
+ return -1;

/* read-only files cannot be removed */
- chmod(pathname, 0666);
- while ((ret = unlink(pathname)) == -1 && tries < ARRAY_SIZE(delay)) {
+ _wchmod(wpathname, 0666);
+ while ((ret = _wunlink(wpathname)) == -1 && tries < ARRAY_SIZE(delay)) {
if (!is_file_in_use_error(GetLastError()))
break;
/*
@@ -223,43 +226,40 @@ int mingw_unlink(const char *pathname)
while (ret == -1 && is_file_in_use_error(GetLastError()) &&
ask_yes_no_if_possible("Unlink of file '%s' failed. "
"Should I try again?", pathname))
- ret = unlink(pathname);
+ ret = _wunlink(wpathname);
return ret;
}

-static int is_dir_empty(const char *path)
+static int is_dir_empty(const wchar_t *wpath)
{
- struct strbuf buf = STRBUF_INIT;
- WIN32_FIND_DATAA findbuf;
+ WIN32_FIND_DATAW findbuf;
HANDLE handle;
-
- strbuf_addf(&buf, "%s\\*", path);
- handle = FindFirstFileA(buf.buf, &findbuf);
- if (handle == INVALID_HANDLE_VALUE) {
- strbuf_release(&buf);
+ wchar_t wbuf[MAX_PATH + 2];
+ wcscpy(wbuf, wpath);
+ wcscat(wbuf, L"\\*");
+ handle = FindFirstFileW(wbuf, &findbuf);
+ if (handle == INVALID_HANDLE_VALUE)
return GetLastError() == ERROR_NO_MORE_FILES;
- }

- while (!strcmp(findbuf.cFileName, ".") ||
- !strcmp(findbuf.cFileName, ".."))
- if (!FindNextFile(handle, &findbuf)) {
- strbuf_release(&buf);
+ while (!wcscmp(findbuf.cFileName, L".") ||
+ !wcscmp(findbuf.cFileName, L".."))
+ if (!FindNextFileW(handle, &findbuf))
return GetLastError() == ERROR_NO_MORE_FILES;
- }
FindClose(handle);
- strbuf_release(&buf);
return 0;
}

-#undef rmdir
int mingw_rmdir(const char *pathname)
{
int ret, tries = 0;
+ wchar_t wpathname[MAX_PATH];
+ if (utftowcs(wpathname, pathname, MAX_PATH) < 0)
+ return -1;

- while ((ret = rmdir(pathname)) == -1 && tries < ARRAY_SIZE(delay)) {
+ while ((ret = _wrmdir(wpathname)) == -1 && tries < ARRAY_SIZE(delay)) {
if (!is_file_in_use_error(GetLastError()))
break;
- if (!is_dir_empty(pathname)) {
+ if (!is_dir_empty(wpathname)) {
errno = ENOTEMPTY;
break;
}
@@ -276,14 +276,14 @@ int mingw_rmdir(const char *pathname)
while (ret == -1 && is_file_in_use_error(GetLastError()) &&
ask_yes_no_if_possible("Deletion of directory '%s' failed. "
"Should I try again?", pathname))
- ret = rmdir(pathname);
+ ret = _wrmdir(wpathname);
return ret;
}

-static int make_hidden(const char *path)
+static int make_hidden(const wchar_t *path)
{
- DWORD attribs = GetFileAttributes(path);
- if (SetFileAttributes(path, FILE_ATTRIBUTE_HIDDEN | attribs))
+ DWORD attribs = GetFileAttributesW(path);
+ if (SetFileAttributesW(path, FILE_ATTRIBUTE_HIDDEN | attribs))
return 0;
errno = err_win_to_posix(GetLastError());
return -1;
@@ -291,19 +291,23 @@ static int make_hidden(const char *path)

void mingw_mark_as_git_dir(const char *dir)
{
- if (hide_dotfiles != HIDE_DOTFILES_FALSE && !is_bare_repository() &&
- make_hidden(dir))
- warning("Failed to make '%s' hidden", dir);
+ wchar_t wdir[MAX_PATH];
+ if (hide_dotfiles != HIDE_DOTFILES_FALSE && !is_bare_repository())
+ if (utftowcs(wdir, dir, MAX_PATH) < 0 || make_hidden(wdir))
+ warning("Failed to make '%s' hidden", dir);
git_config_set("core.hideDotFiles",
hide_dotfiles == HIDE_DOTFILES_FALSE ? "false" :
(hide_dotfiles == HIDE_DOTFILES_DOTGITONLY ?
"dotGitOnly" : "true"));
}

-#undef mkdir
int mingw_mkdir(const char *path, int mode)
{
- int ret = mkdir(path);
+ int ret;
+ wchar_t wpath[MAX_PATH];
+ if (utftowcs(wpath, path, MAX_PATH) < 0)
+ return -1;
+ ret = _wmkdir(wpath);
if (!ret && hide_dotfiles == HIDE_DOTFILES_TRUE) {
/*
* In Windows a file or dir starting with a dot is not
@@ -312,17 +316,17 @@ int mingw_mkdir(const char *path, int mode)
*/
const char *start = basename((char*)path);
if (*start == '.')
- return make_hidden(path);
+ return make_hidden(wpath);
}
return ret;
}

-#undef open
int mingw_open (const char *filename, int oflags, ...)
{
va_list args;
unsigned mode;
int fd;
+ wchar_t wfilename[MAX_PATH];

va_start(args, oflags);
mode = va_arg(args, int);
@@ -331,10 +335,12 @@ int mingw_open (const char *filename, int oflags, ...)
if (filename && !strcmp(filename, "/dev/null"))
filename = "nul";

- fd = open(filename, oflags, mode);
+ if (utftowcs(wfilename, filename, MAX_PATH) < 0)
+ return -1;
+ fd = _wopen(wfilename, oflags, mode);

if (fd < 0 && (oflags & O_CREAT) && errno == EACCES) {
- DWORD attrs = GetFileAttributes(filename);
+ DWORD attrs = GetFileAttributesW(wfilename);
if (attrs != INVALID_FILE_ATTRIBUTES && (attrs & FILE_ATTRIBUTE_DIRECTORY))
errno = EISDIR;
}
@@ -346,7 +352,7 @@ int mingw_open (const char *filename, int oflags, ...)
* such a file is created.
*/
const char *start = basename((char*)filename);
- if (*start == '.' && make_hidden(filename))
+ if (*start == '.' && make_hidden(wfilename))
warning("Could not mark '%s' as hidden.", filename);
}
return fd;
@@ -369,38 +375,69 @@ ssize_t mingw_write(int fd, const void *buf, size_t count)
return write(fd, buf, min(count, 31 * 1024 * 1024));
}

-#undef fopen
FILE *mingw_fopen (const char *filename, const char *otype)
{
int hide = 0;
FILE *file;
+ wchar_t wfilename[MAX_PATH], wotype[10];
if (hide_dotfiles == HIDE_DOTFILES_TRUE &&
basename((char*)filename)[0] == '.')
hide = access(filename, F_OK);
if (filename && !strcmp(filename, "/dev/null"))
filename = "nul";
- file = fopen(filename, otype);
- if (file && hide && make_hidden(filename))
+ if (utftowcs(wfilename, filename, MAX_PATH) < 0 ||
+ utftowcs(wotype, otype, 10) < 0)
+ return NULL;
+ file = _wfopen(wfilename, wotype);
+ if (file && hide && make_hidden(wfilename))
warning("Could not mark '%s' as hidden.", filename);
return file;
}

-#undef freopen
FILE *mingw_freopen (const char *filename, const char *otype, FILE *stream)
{
int hide = 0;
FILE *file;
+ wchar_t wfilename[MAX_PATH], wotype[10];
if (hide_dotfiles == HIDE_DOTFILES_TRUE &&
basename((char*)filename)[0] == '.')
hide = access(filename, F_OK);
if (filename && !strcmp(filename, "/dev/null"))
filename = "nul";
- file = freopen(filename, otype, stream);
- if (file && hide && make_hidden(filename))
+ if (utftowcs(wfilename, filename, MAX_PATH) < 0 ||
+ utftowcs(wotype, otype, 10) < 0)
+ return NULL;
+ file = _wfreopen(wfilename, wotype, stream);
+ if (file && hide && make_hidden(wfilename))
warning("Could not mark '%s' as hidden.", filename);
return file;
}

+int mingw_access(const char *filename, int mode)
+{
+ wchar_t wfilename[MAX_PATH];
+ if (utftowcs(wfilename, filename, MAX_PATH) < 0)
+ return -1;
+ /* X_OK is not supported by the MSVCRT version */
+ return _waccess(wfilename, mode & ~X_OK);
+}
+
+int mingw_chdir(const char *dirname)
+{
+ wchar_t wdirname[MAX_PATH];
+ if (utftowcs(wdirname, dirname, MAX_PATH) < 0)
+ return -1;
+ return _wchdir(wdirname);
+}
+
+int mingw_chmod(const char *filename, int mode)
+{
+ wchar_t wfilename[MAX_PATH];
+ if (utftowcs(wfilename, filename, MAX_PATH) < 0)
+ return -1;
+ return _wchmod(wfilename, mode);
+}
+
/*
* The unit of FILETIME is 100-nanoseconds since January 1, 1601, UTC.
* Returns the 100-nanoseconds ("hekto nanoseconds") since the epoch.
@@ -426,10 +463,12 @@ static inline time_t filetime_to_time_t(const FILETIME *ft)
*/
static int do_lstat(int follow, const char *file_name, struct stat *buf)
{
- int err;
WIN32_FILE_ATTRIBUTE_DATA fdata;
+ wchar_t wfilename[MAX_PATH];
+ if (utftowcs(wfilename, file_name, MAX_PATH) < 0)
+ return -1;

- if (!(err = get_file_attr(file_name, &fdata))) {
+ if (GetFileAttributesExW(wfilename, GetFileExInfoStandard, &fdata)) {
buf->st_ino = 0;
buf->st_gid = 0;
buf->st_uid = 0;
@@ -442,8 +481,8 @@ static int do_lstat(int follow, const char *file_name, struct stat *buf)
buf->st_mtime = filetime_to_time_t(&(fdata.ftLastWriteTime));
buf->st_ctime = filetime_to_time_t(&(fdata.ftCreationTime));
if (fdata.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) {
- WIN32_FIND_DATAA findbuf;
- HANDLE handle = FindFirstFileA(file_name, &findbuf);
+ WIN32_FIND_DATAW findbuf;
+ HANDLE handle = FindFirstFileW(wfilename, &findbuf);
if (handle != INVALID_HANDLE_VALUE) {
if ((findbuf.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) &&
(findbuf.dwReserved0 == IO_REPARSE_TAG_SYMLINK)) {
@@ -462,7 +501,23 @@ static int do_lstat(int follow, const char *file_name, struct stat *buf)
}
return 0;
}
- errno = err;
+ switch (GetLastError()) {
+ case ERROR_ACCESS_DENIED:
+ case ERROR_SHARING_VIOLATION:
+ case ERROR_LOCK_VIOLATION:
+ case ERROR_SHARING_BUFFER_EXCEEDED:
+ errno = EACCES;
+ break;
+ case ERROR_BUFFER_OVERFLOW:
+ errno = ENAMETOOLONG;
+ break;
+ case ERROR_NOT_ENOUGH_MEMORY:
+ errno = ENOMEM;
+ break;
+ default:
+ errno = ENOENT;
+ break;
+ }
return -1;
}

@@ -551,16 +606,20 @@ int mingw_utime (const char *file_name, const struct utimbuf *times)
{
FILETIME mft, aft;
int fh, rc;
+ DWORD attrs;
+ wchar_t wfilename[MAX_PATH];
+ if (utftowcs(wfilename, file_name, MAX_PATH) < 0)
+ return -1;

/* must have write permission */
- DWORD attrs = GetFileAttributes(file_name);
+ attrs = GetFileAttributesW(wfilename);
if (attrs != INVALID_FILE_ATTRIBUTES &&
(attrs & FILE_ATTRIBUTE_READONLY)) {
/* ignore errors here; open() will report them */
- SetFileAttributes(file_name, attrs & ~FILE_ATTRIBUTE_READONLY);
+ SetFileAttributesW(wfilename, attrs & ~FILE_ATTRIBUTE_READONLY);
}

- if ((fh = open(file_name, O_RDWR | O_BINARY)) < 0) {
+ if ((fh = _wopen(wfilename, O_RDWR | O_BINARY)) < 0) {
rc = -1;
goto revert_attrs;
}
@@ -583,7 +642,7 @@ revert_attrs:
if (attrs != INVALID_FILE_ATTRIBUTES &&
(attrs & FILE_ATTRIBUTE_READONLY)) {
/* ignore errors again */
- SetFileAttributes(file_name, attrs);
+ SetFileAttributesW(wfilename, attrs);
}
return rc;
}
@@ -594,6 +653,18 @@ unsigned int sleep (unsigned int seconds)
return 0;
}

+char *mingw_mktemp(char *template)
+{
+ wchar_t wtemplate[MAX_PATH];
+ if (utftowcs(wtemplate, template, MAX_PATH) < 0)
+ return NULL;
+ if (!_wmktemp(wtemplate))
+ return NULL;
+ if (wcstoutf(template, wtemplate, strlen(template) + 1) < 0)
+ return NULL;
+ return template;
+}
+
int mkstemp(char *template)
{
char *filename = mktemp(template);
@@ -652,17 +723,18 @@ struct tm *localtime_r(const time_t *timep, struct tm *result)
return result;
}

-#undef getcwd
char *mingw_getcwd(char *pointer, int len)
{
int i;
- char *ret = getcwd(pointer, len);
- if (!ret)
- return ret;
+ wchar_t wpointer[MAX_PATH];
+ if (!_wgetcwd(wpointer, MAX_PATH))
+ return NULL;
+ if (wcstoutf(pointer, wpointer, len) < 0)
+ return NULL;
for (i = 0; pointer[i]; i++)
if (pointer[i] == '\\')
pointer[i] = '/';
- return ret;
+ return pointer;
}

/*
@@ -1505,33 +1577,38 @@ int mingw_rename(const char *pold, const char *pnew)
{
DWORD attrs, gle;
int tries = 0;
+ wchar_t wpold[MAX_PATH], wpnew[MAX_PATH];
+ if (utftowcs(wpold, pold, MAX_PATH) < 0)
+ return -1;
+ if (utftowcs(wpnew, pnew, MAX_PATH) < 0)
+ return -1;

/*
* Try native rename() first to get errno right.
* It is based on MoveFile(), which cannot overwrite existing files.
*/
- if (!rename(pold, pnew))
+ if (!_wrename(wpold, wpnew))
return 0;
if (errno != EEXIST)
return -1;
repeat:
- if (MoveFileEx(pold, pnew, MOVEFILE_REPLACE_EXISTING))
+ if (MoveFileExW(wpold, wpnew, MOVEFILE_REPLACE_EXISTING))
return 0;
/* TODO: translate more errors */
gle = GetLastError();
if (gle == ERROR_ACCESS_DENIED &&
- (attrs = GetFileAttributes(pnew)) != INVALID_FILE_ATTRIBUTES) {
+ (attrs = GetFileAttributesW(wpnew)) != INVALID_FILE_ATTRIBUTES) {
if (attrs & FILE_ATTRIBUTE_DIRECTORY) {
errno = EISDIR;
return -1;
}
if ((attrs & FILE_ATTRIBUTE_READONLY) &&
- SetFileAttributes(pnew, attrs & ~FILE_ATTRIBUTE_READONLY)) {
- if (MoveFileEx(pold, pnew, MOVEFILE_REPLACE_EXISTING))
+ SetFileAttributesW(wpnew, attrs & ~FILE_ATTRIBUTE_READONLY)) {
+ if (MoveFileExW(wpold, wpnew, MOVEFILE_REPLACE_EXISTING))
return 0;
gle = GetLastError();
/* revert file attributes on failure */
- SetFileAttributes(pnew, attrs);
+ SetFileAttributesW(wpnew, attrs);
}
}
if (tries < ARRAY_SIZE(delay) && gle == ERROR_ACCESS_DENIED) {
@@ -1741,11 +1818,17 @@ void mingw_open_html(const char *unixpath)

int link(const char *oldpath, const char *newpath)
{
- typedef BOOL (WINAPI *T)(const char*, const char*, LPSECURITY_ATTRIBUTES);
+ typedef BOOL (WINAPI *T)(LPCWSTR, LPCWSTR, LPSECURITY_ATTRIBUTES);
static T create_hard_link = NULL;
+ wchar_t woldpath[MAX_PATH], wnewpath[MAX_PATH];
+ if (utftowcs(woldpath, oldpath, MAX_PATH) < 0)
+ return -1;
+ if (utftowcs(wnewpath, newpath, MAX_PATH) < 0)
+ return -1;
+
if (!create_hard_link) {
create_hard_link = (T) GetProcAddress(
- GetModuleHandle("kernel32.dll"), "CreateHardLinkA");
+ GetModuleHandle("kernel32.dll"), "CreateHardLinkW");
if (!create_hard_link)
create_hard_link = (T)-1;
}
@@ -1753,7 +1836,7 @@ int link(const char *oldpath, const char *newpath)
errno = ENOSYS;
return -1;
}
- if (!create_hard_link(newpath, oldpath, NULL)) {
+ if (!create_hard_link(wnewpath, woldpath, NULL)) {
errno = err_win_to_posix(GetLastError());
return -1;
}
diff --git a/compat/mingw.h b/compat/mingw.h
index b4ff6cd..4454978 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -181,6 +181,19 @@ FILE *mingw_fopen (const char *filename, const char *otype);
FILE *mingw_freopen (const char *filename, const char *otype, FILE *stream);
#define freopen mingw_freopen

+int mingw_access(const char *filename, int mode);
+#undef access
+#define access mingw_access
+
+int mingw_chdir(const char *dirname);
+#define chdir mingw_chdir
+
+int mingw_chmod(const char *filename, int mode);
+#define chmod mingw_chmod
+
+char *mingw_mktemp(char *template);
+#define mktemp mingw_mktemp
+
char *mingw_getcwd(char *pointer, int len);
#define getcwd mingw_getcwd

--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:05 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Convert command line arguments from UTF-8 to UTF-16 when creating other
processes.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 18 ++++++++++++++----
1 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index fe759c0..818a7f0 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -938,9 +938,10 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,

const char *dir,
int prepend_cmd, int fhin, int fhout, int fherr)
{

- STARTUPINFO si;
+ STARTUPINFOW si;
PROCESS_INFORMATION pi;
struct strbuf envblk, args;
+ wchar_t wcmd[MAX_PATH], wdir[MAX_PATH], *wargs;
unsigned flags;
BOOL ret;

@@ -976,6 +977,11 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
si.hStdOutput = (HANDLE) _get_osfhandle(fhout);
si.hStdError = (HANDLE) _get_osfhandle(fherr);

+ if (utftowcs(wcmd, cmd, MAX_PATH) < 0)
+ return -1;
+ if (dir && utftowcs(wdir, dir, MAX_PATH) < 0)
+ return -1;
+
/* concatenate argv, quoting args as we go */
strbuf_init(&args, 0);
if (prepend_cmd) {
@@ -993,6 +999,10 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
free(quoted);
}

+ wargs = xmalloc((2 * args.len + 1) * sizeof(wchar_t));
+ utftowcs(wargs, args.buf, 2 * args.len + 1);
+ strbuf_release(&args);
+
if (env) {
int count = 0;
char **e, **sorted_env;
@@ -1014,12 +1024,12 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
}

memset(&pi, 0, sizeof(pi));
- ret = CreateProcess(cmd, args.buf, NULL, NULL, TRUE, flags,
- env ? envblk.buf : NULL, dir, &si, &pi);
+ ret = CreateProcessW(wcmd, wargs, NULL, NULL, TRUE, flags,
+ env ? envblk.buf : NULL, dir ? wdir : NULL, &si, &pi);

if (env)
strbuf_release(&envblk);
- strbuf_release(&args);
+ free(wargs);

if (!ret) {
errno = ENOENT;
--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:04 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Assumes file names in git tree objects are UTF-8 encoded.

On most unix systems, the system encoding (and thus the TCL system
encoding) will be UTF-8, so file names will be displayed correctly.

On Windows, it is impossible to set the system encoding to UTF-8. Changing
the TCL system encoding (via 'encoding system ...', e.g. in the startup
code) is explicitly discouraged by the TCL docs.

Change gitk and git-gui functions dealing with file names to always convert
from and to UTF-8.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

git-gui/git-gui.sh | 6 +++---
git-gui/lib/browser.tcl | 2 +-
git-gui/lib/index.tcl | 6 +++---
gitk-git/gitk | 16 ++++++++--------
4 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/git-gui/git-gui.sh b/git-gui/git-gui.sh
index 0d5bd34..74f10bf 100755
--- a/git-gui/git-gui.sh
+++ b/git-gui/git-gui.sh
@@ -1605,7 +1605,7 @@ proc read_diff_index {fd after} {
set i [split [string range $buf_rdi $c [expr {$z1 - 2}]] { }]
set p [string range $buf_rdi $z1 [expr {$z2 - 1}]]
merge_state \
- [encoding convertfrom $p] \
+ [encoding convertfrom utf-8 $p] \
[lindex $i 4]? \
[list [lindex $i 0] [lindex $i 2]] \
[list]
@@ -1638,7 +1638,7 @@ proc read_diff_files {fd after} {
set i [split [string range $buf_rdf $c [expr {$z1 - 2}]] { }]
set p [string range $buf_rdf $z1 [expr {$z2 - 1}]]
merge_state \
- [encoding convertfrom $p] \
+ [encoding convertfrom utf-8 $p] \
?[lindex $i 4] \
[list] \
[list [lindex $i 0] [lindex $i 2]]
@@ -1661,7 +1661,7 @@ proc read_ls_others {fd after} {
set pck [split $buf_rlo "\0"]
set buf_rlo [lindex $pck end]
foreach p [lrange $pck 0 end-1] {
- set p [encoding convertfrom $p]
+ set p [encoding convertfrom utf-8 $p]
if {[string index $p end] eq {/}} {
set p [string range $p 0 end-1]
}
diff --git a/git-gui/lib/browser.tcl b/git-gui/lib/browser.tcl
index d68b1f9..f4ee01d 100644

--- a/git-gui/lib/browser.tcl
+++ b/git-gui/lib/browser.tcl
@@ -191,7 +191,7 @@ method _ls {tree_id {name {}}} {
$w conf -state disabled

set fd [git_read ls-tree -z $tree_id]
- fconfigure $fd -blocking 0 -translation binary

+ fconfigure $fd -blocking 0 -translation binary -encoding utf-8

fileevent $fd readable [cb _read $fd]
}

diff --git a/git-gui/lib/index.tcl b/git-gui/lib/index.tcl
index 5d7bbf2..3d1d973 100644
--- a/git-gui/lib/index.tcl
+++ b/git-gui/lib/index.tcl
@@ -115,7 +115,7 @@ proc write_update_indexinfo {fd pathList totalCnt batch after} {
set info [lindex $s 2]
if {$info eq {}} continue

- puts -nonewline $fd "$info\t[encoding convertto $path]\0"
+ puts -nonewline $fd "$info\t[encoding convertto utf-8 $path]\0"
display_file $path $new
}

@@ -186,7 +186,7 @@ proc write_update_index {fd pathList totalCnt batch after} {
?M {set new M_}
?? {continue}
}
- puts -nonewline $fd "[encoding convertto $path]\0"
+ puts -nonewline $fd "[encoding convertto utf-8 $path]\0"
display_file $path $new
}

@@ -247,7 +247,7 @@ proc write_checkout_index {fd pathList totalCnt batch after} {
?M -
?T -
?D {
- puts -nonewline $fd "[encoding convertto $path]\0"
+ puts -nonewline $fd "[encoding convertto utf-8 $path]\0"
display_file $path ?_
}
}
diff --git a/gitk-git/gitk b/gitk-git/gitk
index 935ed5c..9facdbf 100755
--- a/gitk-git/gitk
+++ b/gitk-git/gitk
@@ -7251,7 +7251,7 @@ proc gettreeline {gtf id} {
if {[string index $fname 0] eq "\""} {
set fname [lindex $fname 0]
}
- set fname [encoding convertfrom $fname]
+ set fname [encoding convertfrom utf-8 $fname]
lappend treefilelist($id) $fname
}
if {![eof $gtf]} {
@@ -7474,7 +7474,7 @@ proc gettreediffline {gdtf ids} {
if {[string index $file 0] eq "\""} {
set file [lindex $file 0]
}
- set file [encoding convertfrom $file]
+ set file [encoding convertfrom utf-8 $file]
if {$file ne [lindex $treediff end]} {
lappend treediff $file
lappend sublist $file
@@ -7623,7 +7623,7 @@ proc makediffhdr {fname ids} {
global ctext curdiffstart treediffs diffencoding
global ctext_file_names jump_to_here targetline diffline

- set fname [encoding convertfrom $fname]
+ set fname [encoding convertfrom utf-8 $fname]
set diffencoding [get_path_encoding $fname]
set i [lsearch -exact $treediffs($ids) $fname]
if {$i >= 0} {
@@ -7657,7 +7657,7 @@ proc getblobdiffline {bdf ids} {
}
if {![string compare -length 5 "diff " $line]} {
if {![regexp {^diff (--cc|--git) } $line m type]} {
- set line [encoding convertfrom $line]
+ set line [encoding convertfrom utf-8 $line]
$ctext insert end "$line\n" hunksep
continue
}
@@ -7704,7 +7704,7 @@ proc getblobdiffline {bdf ids} {
makediffhdr $fname $ids

} elseif {![string compare -length 16 "* Unmerged path " $line]} {
- set fname [encoding convertfrom [string range $line 16 end]]
+ set fname [encoding convertfrom utf-8 [string range $line 16 end]]
$ctext insert end "\n"
set curdiffstart [$ctext index "end - 1c"]
lappend ctext_file_names $fname
@@ -7759,7 +7759,7 @@ proc getblobdiffline {bdf ids} {
if {[string index $fname 0] eq "\""} {
set fname [lindex $fname 0]
}
- set fname [encoding convertfrom $fname]
+ set fname [encoding convertfrom utf-8 $fname]
set i [lsearch -exact $treediffs($ids) $fname]
if {$i >= 0} {
setinlist difffilestart $i $curdiffstart
@@ -7778,7 +7778,7 @@ proc getblobdiffline {bdf ids} {
set diffinhdr 0
continue
}
- set line [encoding convertfrom $line]
+ set line [encoding convertfrom utf-8 $line]

$ctext insert end "$line\n" filesep

} else {

@@ -11361,7 +11361,7 @@ proc cache_gitattr {attr pathlist} {
foreach row [split $rlist "\n"] {
if {[regexp "(.*): $attr: (.*)" $row m path value]} {
if {[string index $path 0] eq "\""} {
- set path [encoding convertfrom [lindex $path 0]]
+ set path [encoding convertfrom utf-8 [lindex $path 0]]
}
set path_attr_cache($attr,$path) $value
}
--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:06 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Convert command line arguments from UTF-16 to UTF-8 on startup.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 35 +++++++++++++++++++++++++++++++++--
1 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index 818a7f0..e3cd129 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -2053,10 +2053,41 @@ int wcstoutf(char *utf, const wchar_t *wcs, size_t utflen)
*/
int _CRT_glob = 0;

+typedef struct {
+ int newmode;
+} _startupinfo;
+
+extern int __wgetmainargs(int *argc, wchar_t ***argv, wchar_t ***env, int glob,
+ _startupinfo *si);
+
void mingw_startup()
{
- /* copy executable name to argv[0] */
- __argv[0] = xstrdup(_pgmptr);
+ int i, len, maxlen, argc;
+ char *buffer;
+ wchar_t **wenv, **wargv;
+ _startupinfo si;
+
+ /* get wide char arguments and environment */
+ si.newmode = 0;
+ __wgetmainargs(&argc, &wargv, &wenv, _CRT_glob, &si);
+
+ /* determine size of argv and environ conversion buffer */
+ maxlen = wcslen(_wpgmptr);
+ for (i = 1; i < argc; i++)
+ maxlen = max(maxlen, wcslen(wargv[i]));
+
+ /* allocate buffer (wchar_t encodes to max 3 UTF-8 bytes) */
+ maxlen = 3 * maxlen + 1;
+ buffer = xmalloc(maxlen);
+
+ /* convert command line arguments and environment to UTF-8 */
+ len = wcstoutf(buffer, _wpgmptr, maxlen);
+ __argv[0] = xmemdupz(buffer, len);
+ for (i = 1; i < argc; i++) {
+ len = wcstoutf(buffer, wargv[i], maxlen);
+ __argv[i] = xmemdupz(buffer, len);
+ }
+ free(buffer);

/* initialize critical section for waitpid pinfo_t list */
InitializeCriticalSection(&pinfo_cs);
--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:10 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

On Windows, all native APIs are Unicode-based. It is impossible to pass
legacy encoded byte arrays to a process via command line or environment
variables. Disable the tests that try to do so.

In t3901, most tests still work if we don't mess up the repository encoding
in setup, so don't switch to ISO-8859-1 on MinGW.

Note that i18n tests that do their encoding tricks via encoded files (such
as t3900) are not affected by this.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

t/t3901-i18n-patch.sh | 19 +++++++++++--------
t/t4201-shortlog.sh | 6 +++---
t/t8005-blame-i18n.sh | 8 ++++----
3 files changed, 18 insertions(+), 15 deletions(-)

1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:12 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

The environment on Windows is case-insensitive. Some environment functions
(such as unsetenv and make_augmented_environment) have always used case-
sensitive comparisons instead, while others (getenv, putenv) were case-
insensitive. Recently, getenv has been changed to do a case-sensitive
before case-insensitive lookup to fix a problem in git's envsubst
implementation.

Prevent potential inconsistencies by using the same algorithm (case-
sensitive before case-insensitive) for all environment functions. This can
be implemented in the now central lookup_env function.

The original MSVCRT getenv (and thus, #undef getenv) is no longer needed,
direct references to mingw_getenv are no longer necessary.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 37 ++++++++++++++++++-------------------
1 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index 93e7a58..7c8a1bc 100644

--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -181,7 +181,7 @@ static int ask_yes_no_if_possible(const char *format, ...)
vsnprintf(question, sizeof(question), format, args);
va_end(args);

- if ((retry_hook[0] = mingw_getenv("GIT_ASK_YESNO"))) {
+ if ((retry_hook[0] = getenv("GIT_ASK_YESNO"))) {
retry_hook[1] = question;
return !run_command_v_opt(retry_hook, 0);
}

@@ -836,7 +836,7 @@ static const char *parse_interpreter(const char *cmd)

*/
static char **get_path_split(void)
{
- char *p, **path, *envpath = mingw_getenv("PATH");
+ char *p, **path, *envpath = getenv("PATH");
int i, n = 0;

if (!envpath || !*envpath)

@@ -1215,13 +1215,21 @@ void free_environ(char **env)
static int lookup_env(char **env, const char *name, size_t nmln)
{
int i;
-

+ /*
+ * try case-sensitive match first, this is necessary for sh-i18n--envsubst
+ * to work with environment variables that differ only in case (e.g. $PATH
+ * and $path)
+ */

for (i = 0; env[i]; i++) {
- if (0 == strncmp(env[i], name, nmln)
- && '=' == env[i][nmln])

+ if (!strncmp(env[i], name, nmln) && '=' == env[i][nmln])

/* matches */
return i;
}
+ /* if there's no case-sensitive match, try case-insensitive instead */
+ for (i = 0; env[i]; i++) {
+ if (!strncasecmp(env[i], name, nmln) && '=' == env[i][nmln])
+ return i;
+ }
return -1;
}

@@ -1267,32 +1275,23 @@ char **make_augmented_environ(const char *const *vars)
return env;
}

-#undef getenv
-
-/*
- * The system's getenv looks up the name in a case-insensitive manner.
- * This version tries a case-sensitive lookup and falls back to
- * case-insensitive if nothing was found. This is necessary because,
- * as a prominent example, CMD sets 'Path', but not 'PATH'.
- * Warning: not thread-safe.
- */
-static char *getenv_cs(const char *name)
+static char *do_getenv(const char *name)
{
size_t len = strlen(name);
int i = lookup_env(environ, name, len);
if (i >= 0)
return environ[i] + len + 1; /* skip past name and '=' */
- return getenv(name);
+ return NULL;
}

char *mingw_getenv(const char *name)
{
- char *result = getenv_cs(name);
+ char *result = do_getenv(name);
if (!result && !strcmp(name, "TMPDIR")) {
/* on Windows it is TMP and TEMP */
- result = getenv_cs("TMP");
+ result = do_getenv("TMP");
if (!result)
- result = getenv_cs("TEMP");
+ result = do_getenv("TEMP");
}
else if (!result && !strcmp(name, "TERM")) {
/* simulate TERM to enable auto-color (see color.c) */
--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:09 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Convert environment from UTF-16 to UTF-8 on startup.

No changes to getenv() are necessary, as the MSVCRT version is implemented
on top of char **environ.

However, putenv / _wputenv from MSVCRT no longer work, for two reasons:
1. they try to keep environ, _wenviron and the Win32 process environment
in sync, using the default system encoding instead of UTF-8 to convert
between charsets
2. msysgit and MSVCRT use different allocators, memory allocated in git
cannot be freed by the CRT and vice versa

Implement mingw_putenv using the env_setenv helper function from the
environment merge code.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 15 +++++++++++++++
compat/mingw.h | 2 ++
2 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index b38a7b0..3702b06 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -1299,6 +1299,12 @@ char *mingw_getenv(const char *name)
return result;
}

+int mingw_putenv(const char *namevalue)
+{
+ environ = env_setenv(environ, namevalue);
+ return 0;
+}
+
/*
* Note, this isn't a complete replacement for getaddrinfo. It assumes
* that service contains a numerical port, or that it is null. It
@@ -2077,6 +2083,11 @@ void mingw_startup()
maxlen = wcslen(_wpgmptr);

for (i = 1; i < argc; i++)

maxlen = max(maxlen, wcslen(wargv[i]));

+ for (i = 0; wenv[i]; i++)
+ maxlen = max(maxlen, wcslen(wenv[i]));
+
+ /* nedmalloc can't free CRT memory, allocate resizable environment list */
+ environ = xcalloc(i + 1, sizeof(char*));

/* allocate buffer (wchar_t encodes to max 3 UTF-8 bytes) */

maxlen = 3 * maxlen + 1;

@@ -2089,6 +2100,10 @@ void mingw_startup()

len = wcstoutf(buffer, wargv[i], maxlen);

__argv[i] = xmemdupz(buffer, len);
}

+ for (i = 0; wenv[i]; i++) {
+ len = wcstoutf(buffer, wenv[i], maxlen);
+ environ[i] = xmemdupz(buffer, len);
+ }

free(buffer);

/* initialize critical section for waitpid pinfo_t list */

diff --git a/compat/mingw.h b/compat/mingw.h
index fac4634..29cbef7 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -199,6 +199,8 @@ char *mingw_getcwd(char *pointer, int len);

char *mingw_getenv(const char *name);
#define getenv mingw_getenv
+int mingw_putenv(const char *namevalue);
+#define putenv mingw_putenv

struct hostent *mingw_gethostbyname(const char *host);
#define gethostbyname mingw_gethostbyname
--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:07 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Replace stdout and stderr with a pipe if they point to the console. A
background thread reads from the pipe, handles ANSI escape sequences and
UTF-8 to UTF-16 conversion, then writes to the console.

Global variables are either initialized on startup (single threaded) or
exclusively modified by the background thread. Threads communicate through
the pipe, no further synchronization is necessary.

Due to the byte-oriented pipe, ANSI escape sequences can no longer be
expected to arrive in one piece. Replace the string-based ansi_emulate()
with a simple stateful parser (this also fixes colored diff hunk headers,
which were broken as of commit 2efcc977).

Override isatty to return true for the pipes redirecting to the console.

Exec/spawn obtain the original console handle to pass to the next process
via winansi_get_osfhandle().

All other overrides are gone, the default stdio implementations work as
expected with the piped stdout/stderr descriptors.

Limitations: doesn't track reopened or duped file descriptors, i.e.:
- fdopen(1/2) returns fully buffered streams
- dup(1/2), dup2(1/2) returns normal pipe descriptors (i.e. isatty() =
false, winansi_get_osfhandle won't return the original console handle)

Currently, only the git-format-patch command uses xfdopen(xdup(1)) (see
"realstdout" in builtin/log.c), but works well with these limitations.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 9 +-
compat/mingw.h | 12 +--
compat/winansi.c | 359 +++++++++++++++++++++++++++++++++---------------------
3 files changed, 232 insertions(+), 148 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index e3cd129..2f346dc 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -973,9 +973,9 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
memset(&si, 0, sizeof(si));
si.cb = sizeof(si);
si.dwFlags = STARTF_USESTDHANDLES;
- si.hStdInput = (HANDLE) _get_osfhandle(fhin);
- si.hStdOutput = (HANDLE) _get_osfhandle(fhout);
- si.hStdError = (HANDLE) _get_osfhandle(fherr);
+ si.hStdInput = winansi_get_osfhandle(fhin);
+ si.hStdOutput = winansi_get_osfhandle(fhout);
+ si.hStdError = winansi_get_osfhandle(fherr);

if (utftowcs(wcmd, cmd, MAX_PATH) < 0)

return -1;
@@ -2097,4 +2097,7 @@ void mingw_startup()
_setmode(_fileno(stdin), _O_BINARY);
_setmode(_fileno(stdout), _O_BINARY);
_setmode(_fileno(stderr), _O_BINARY);
+
+ /* initialize Unicode console */
+ winansi_init();
}
diff --git a/compat/mingw.h b/compat/mingw.h
index 4454978..fac4634 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -297,14 +297,10 @@ sig_handler_t mingw_signal(int sig, sig_handler_t handler);
* ANSI emulation wrappers
*/

-int winansi_fputs(const char *str, FILE *stream);
-int winansi_printf(const char *format, ...) __attribute__((format (printf, 1, 2)));
-int winansi_fprintf(FILE *stream, const char *format, ...) __attribute__((format (printf, 2, 3)));
-int winansi_vfprintf(FILE *stream, const char *format, va_list list);
-#define fputs winansi_fputs
-#define printf(...) winansi_printf(__VA_ARGS__)
-#define fprintf(...) winansi_fprintf(__VA_ARGS__)
-#define vfprintf winansi_vfprintf
+void winansi_init(void);
+int winansi_isatty(int fd);
+HANDLE winansi_get_osfhandle(int fd);
+#define isatty winansi_isatty

/*
* git specific compatibility
diff --git a/compat/winansi.c b/compat/winansi.c
index bf514f9..d11b532 100644
--- a/compat/winansi.c
+++ b/compat/winansi.c
@@ -4,18 +4,13 @@

#undef NOGDI
#include "../git-compat-util.h"
-#include <malloc.h>
#include <wingdi.h>
#include <winreg.h>

/*
Functions to be wrapped:
*/
-#undef printf
-#undef fprintf
-#undef fputs
-#undef vfprintf
-/* TODO: write */
+#undef isatty

/*
ANSI codes used by git: m, K
@@ -28,8 +23,10 @@ static HANDLE console;
static WORD plain_attr;
static WORD attr;
static int negative;
-static FILE *last_stream = NULL;

static int non_ascii_used = 0;

+static HANDLE hthread, hread, hwrite;
+static HANDLE hwrite1 = INVALID_HANDLE_VALUE, hwrite2 = INVALID_HANDLE_VALUE;
+static HANDLE hconsole1, hconsole2;

#ifdef __MINGW32__
typedef struct _CONSOLE_FONT_INFOEX {
@@ -74,64 +71,39 @@ static void warn_if_raster_font(void)
}
}

- if (!(fontFamily & TMPF_TRUETYPE))
- warning("Your console font probably doesn\'t support "
- "Unicode. If you experience strange characters in the output, "
- "consider switching to a TrueType font such as Lucida Console!");
+ if (!(fontFamily & TMPF_TRUETYPE)) {
+ const wchar_t *msg = L"\nWarning: Your console font probably doesn\'t "
+ L"support Unicode. If you experience strange characters in the "
+ L"output, consider switching to a TrueType font such as Lucida "
+ L"Console!\n";
+ WriteConsoleW(console, msg, wcslen(msg), NULL, NULL);
+ }
}

-static int is_console(FILE *stream)
+static int is_console(int fd)
{
CONSOLE_SCREEN_BUFFER_INFO sbi;
HANDLE hcon;

- static int initialized = 0;
-
- /* use cached value if stream hasn't changed */
- if (stream == last_stream)
- return console != NULL;
-
- last_stream = stream;
- console = NULL;
-
- /* get OS handle of the stream */
- hcon = (HANDLE) _get_osfhandle(_fileno(stream));
+ /* get OS handle of the file descriptor */
+ hcon = (HANDLE) _get_osfhandle(fd);
if (hcon == INVALID_HANDLE_VALUE)
return 0;

+ /* check if its a device (i.e. console, printer, serial port) */
+ if (GetFileType(hcon) != FILE_TYPE_CHAR)
+ return 0;
+
/* check if its a handle to a console output screen buffer */
if (!GetConsoleScreenBufferInfo(hcon, &sbi))
return 0;

- if (!initialized) {
- attr = plain_attr = sbi.wAttributes;
- negative = 0;
- initialized = 1;
- /* check console font on exit */
- atexit(warn_if_raster_font);
- }
-
- console = hcon;
+ /* initialize attributes */
+ attr = plain_attr = sbi.wAttributes;
+ negative = 0;
return 1;
}

-static int write_console(const char *str, size_t len)
-{
- /* convert utf-8 to utf-16, write directly to console */
- int wlen = MultiByteToWideChar(CP_UTF8, 0, str, len, NULL, 0);
- wchar_t *wbuf = (wchar_t *) alloca(wlen * sizeof(wchar_t));
- MultiByteToWideChar(CP_UTF8, 0, str, len, wbuf, wlen);
-
- WriteConsoleW(console, wbuf, wlen, NULL, NULL);
-
- /* remember if non-ascii characters are printed */
- if (wlen != len)
- non_ascii_used = 1;
-
- /* return original (utf-8 encoded) length */
- return len;
-}
-
#define FOREGROUND_ALL (FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE)
#define BACKGROUND_ALL (BACKGROUND_RED | BACKGROUND_GREEN | BACKGROUND_BLUE)

@@ -175,18 +147,13 @@ static void erase_in_line(void)
&dummy);
}

-
-static const char *set_attr(const char *str)
+static void set_attr(char func, const int *params, int paramlen)
{
- const char *func;
- size_t len = strspn(str, "0123456789;");
- func = str + len;
-
- switch (*func) {
+ int i;
+ switch (func) {
case 'm':
- do {
- long val = strtol(str, (char **)&str, 10);
- switch (val) {
+ for (i = 0; i < paramlen; i++) {
+ switch (params[i]) {
case 0: /* reset */
attr = plain_attr;
negative = 0;
@@ -309,9 +276,7 @@ static const char *set_attr(const char *str)
/* Unsupported code */
break;
}
- str++;
- } while (*(str-1) == ';');
-
+ }
set_console_attr();
break;
case 'K':
@@ -321,112 +286,232 @@ static const char *set_attr(const char *str)
/* Unsupported code */
break;
}
-
- return func + 1;
}

-static int ansi_emulate(const char *str, FILE *stream)
+#define BUFFER_SIZE 4096
+#define MAX_PARAMS 16
+
+static void write_console(char *str, size_t len)
{
- int rv = 0;
- const char *pos = str;
+ /* only called from console_thread, so a static buffer will do */
+ static wchar_t wbuf[2 * BUFFER_SIZE + 1];

- fflush(stream);
+ /* convert utf-8 to utf-16 */
+ int wlen = utftowcsn(wbuf, str, 2 * BUFFER_SIZE + 1, len);

- while (*pos) {
- pos = strstr(str, "\033[");
- if (pos) {
- size_t len = pos - str;
+ /* write directly to console */
+ WriteConsoleW(console, wbuf, wlen, NULL, NULL);

- if (len) {
- size_t out_len = write_console(str, len);
- rv += out_len;
- if (out_len < len)
- return rv;
- }
+ /* remember if non-ascii characters are printed */
+ if (wlen != len)
+ non_ascii_used = 1;
+}

- str = pos + 2;
- rv += 2;
+enum {
+ TEXT = 0, ESCAPE = 033, BRACKET = '[', EXIT = -1
+};

- pos = set_attr(str);
- rv += pos - str;
- str = pos;
- } else {
- size_t len = strlen(str);
- rv += write_console(str, len);
- return rv;
+static DWORD WINAPI console_thread(LPVOID unused)
+{
+ char buffer[BUFFER_SIZE];
+ DWORD bytes;
+ int start, end, c, parampos = 0, state = TEXT;
+ int params[MAX_PARAMS];
+
+ while (state != EXIT) {
+ /* read next chunk of bytes from the pipe */
+ if (!ReadFile(hread, buffer, BUFFER_SIZE, &bytes, NULL)) {
+ /* exit if pipe has been closed */
+ if (GetLastError() == ERROR_BROKEN_PIPE)
+ break;
+ /* ignore other errors */
+ continue;
}
+
+ /* scan the bytes and handle ANSI control codes */
+ start = end = 0;
+ while (end < bytes) {
+ c = buffer[end++];
+ switch (state) {
+ case TEXT:
+ if (c == ESCAPE) {
+ /* print text seen so far */
+ if (end - 1 > start)
+ write_console(buffer + start, end - 1 - start);
+
+ /* then start parsing escape sequence */
+ start = end - 1;
+ memset(params, 0, sizeof(params));
+ parampos = 0;
+ state = ESCAPE;
+ }
+ break;
+
+ case ESCAPE:
+ /* continue if "\033[", otherwise bail out */
+ state = (c == BRACKET) ? BRACKET : TEXT;
+ break;
+
+ case BRACKET:
+ /* parse [0-9;]* into array of parameters */
+ if (c >= '0' && c <= '9') {
+ params[parampos] *= 10;
+ params[parampos] += c - '0';
+ } else if (c == ';') {
+ /* next parameter, bail out if out of bounds */
+ parampos++;
+ if (parampos >= MAX_PARAMS)
+ state = TEXT;
+ } else if (c == 'q') {
+ /* "\033[q": terminate the thread */
+ state = EXIT;
+ } else {
+ /* end of escape sequence, change console attributes */
+ set_attr(c, params, parampos + 1);
+ start = end;
+ state = TEXT;
+ }
+ break;
+ }
+ }
+
+ /* print remaining text unless we're parsing an escape sequence */
+ if (state == TEXT && end > start)
+ write_console(buffer + start, end - start);
}
- return rv;
+
+ /* check if the console font supports unicode */
+ warn_if_raster_font();
+
+ CloseHandle(hread);
+ return 0;
}

-int winansi_fputs(const char *str, FILE *stream)
+static void winansi_exit(void)
{
- int rv;
-
- if (!is_console(stream))
- return fputs(str, stream);
+ DWORD dummy;
+ /* flush all streams */
+ _flushall();
+
+ /* close the write ends of the pipes */
+ if (hwrite1 != INVALID_HANDLE_VALUE)
+ CloseHandle(hwrite1);
+ if (hwrite2 != INVALID_HANDLE_VALUE)
+ CloseHandle(hwrite2);
+
+ /* send termination sequence to the thread */
+ WriteFile(hwrite, "\033[q", 3, &dummy, NULL);
+ CloseHandle(hwrite);
+
+ /* allow the thread to copy remaining data to the console */
+ WaitForSingleObject(hthread, 1000);
+ CloseHandle(hthread);
+}

- rv = ansi_emulate(str, stream);
+static void die_lasterr(const char *fmt, ...)
+{
+ va_list params;
+ va_start(params, fmt);
+ errno = err_win_to_posix(GetLastError());
+ die_errno(fmt, params);
+ va_end(params);
+}

- if (rv >= 0)
- return 0;
- else
- return EOF;
+static HANDLE duplicate_handle(HANDLE hnd)
+{
+ HANDLE hresult, hproc = GetCurrentProcess();
+ if (!DuplicateHandle(hproc, hnd, hproc, &hresult, 0, TRUE,
+ DUPLICATE_SAME_ACCESS))
+ die_lasterr("DuplicateHandle(%li) failed", (long) hnd);
+ return hresult;
}

-int winansi_vfprintf(FILE *stream, const char *format, va_list list)
+static HANDLE redirect_console(FILE *stream, HANDLE *phcon, int new_fd)
{
- int len, rv;
- char small_buf[256];
- char *buf = small_buf;
- va_list cp;
+ /* get original console handle */
+ int fd = _fileno(stream);
+ HANDLE hcon = (HANDLE) _get_osfhandle(fd);
+ if (hcon == INVALID_HANDLE_VALUE)
+ die_errno("_get_osfhandle(%i) failed", fd);

- if (!is_console(stream))
- goto abort;
+ /* save a copy to phcon and console (used by the background thread) */
+ console = *phcon = duplicate_handle(hcon);

- va_copy(cp, list);
- len = vsnprintf(small_buf, sizeof(small_buf), format, cp);
- va_end(cp);
+ /* duplicate new_fd over fd (closes fd and associated handle (hcon)) */
+ if (_dup2(new_fd, fd))
+ die_errno("_dup2(%i, %i) failed", new_fd, fd);

- if (len > sizeof(small_buf) - 1) {
- buf = malloc(len + 1);
- if (!buf)
- goto abort;
+ /* no buffering, or stdout / stderr will be out of sync */
+ setbuf(stream, NULL);
+ return (HANDLE) _get_osfhandle(fd);
+}

- len = vsnprintf(buf, len + 1, format, list);
- }
+void winansi_init(void)
+{
+ int con1, con2, hwrite_fd;

- rv = ansi_emulate(buf, stream);
+ /* check if either stdout or stderr is a console output screen buffer */
+ con1 = is_console(1);
+ con2 = is_console(2);
+ if (!con1 && !con2)
+ return;

- if (buf != small_buf)
- free(buf);
- return rv;
+ /* create an anonymous pipe */
+ if (!CreatePipe(&hread, &hwrite, NULL, BUFFER_SIZE))
+ die_lasterr("CreatePipe failed");

-abort:
- rv = vfprintf(stream, format, list);
- return rv;
-}
+ /* start console spool thread on the pipe's read end */
+ hthread = CreateThread(NULL, 0, console_thread, NULL, 0, NULL);
+ if (hthread == INVALID_HANDLE_VALUE)
+ die_lasterr("CreateThread(console_thread) failed");

-int winansi_fprintf(FILE *stream, const char *format, ...)
-{
- va_list list;
- int rv;
+ /* schedule cleanup routine */
+ if (atexit(winansi_exit))
+ die_errno("atexit(winansi_exit) failed");

- va_start(list, format);
- rv = winansi_vfprintf(stream, format, list);
- va_end(list);
+ /* create a file descriptor for the write end of the pipe */
+ hwrite_fd = _open_osfhandle((long) duplicate_handle(hwrite), _O_BINARY);
+ if (hwrite_fd == -1)
+ die_errno("_open_osfhandle(%li) failed", (long) hwrite);

- return rv;
+ /* redirect stdout / stderr to the pipe */
+ if (con1)
+ hwrite1 = redirect_console(stdout, &hconsole1, hwrite_fd);
+ if (con2)
+ hwrite2 = redirect_console(stderr, &hconsole2, hwrite_fd);
+
+ /* close pipe file descriptor (also closes the duped hwrite) */
+ close(hwrite_fd);
}

-int winansi_printf(const char *format, ...)
+static int is_same_handle(HANDLE hnd, int fd)
{
- va_list list;
- int rv;
+ return hnd != INVALID_HANDLE_VALUE && hnd == (HANDLE) _get_osfhandle(fd);
+}

- va_start(list, format);
- rv = winansi_vfprintf(stdout, format, list);
- va_end(list);
+/*
+ * Return true if stdout / stderr is a pipe redirecting to the console.
+ */
+int winansi_isatty(int fd)
+{
+ if (fd == 1 && is_same_handle(hwrite1, 1))
+ return 1;
+ else if (fd == 2 && is_same_handle(hwrite2, 2))
+ return 1;
+ else
+ return isatty(fd);
+}

- return rv;
+/*
+ * Returns the real console handle if stdout / stderr is a pipe redirecting
+ * to the console. Allows spawn / exec to pass the console to the next process.
+ */
+HANDLE winansi_get_osfhandle(int fd)
+{
+ if (fd == 1 && is_same_handle(hwrite1, 1))
+ return hconsole1;
+ else if (fd == 2 && is_same_handle(hwrite2, 2))
+ return hconsole2;
+ else
+ return (HANDLE) _get_osfhandle(fd);
}
--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:11 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

All functions that modify the environment have memory leaks.

Disable gitunsetenv in the Makefile and use env_setenv (via mingw_putenv)
instead (this frees removed environment entries).

Move xstrdup from env_setenv to make_augmented_environ, so that
mingw_putenv no longer copies the environment entries (according to POSIX
[1], "the string [...] shall become part of the environment"). This also
fixes the memory leak in gitsetenv, which expects a POSIX compliant putenv.

[1] http://pubs.opengroup.org/onlinepubs/009695399/functions/putenv.html

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

Makefile | 2 --
compat/mingw.c | 10 ++++++----
compat/mingw.h | 1 +
3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/Makefile b/Makefile
index 4594d81..71bc9a7 100644
--- a/Makefile
+++ b/Makefile
@@ -1095,7 +1095,6 @@ ifeq ($(uname_S),Windows)
NO_SYMLINK_HEAD = YesPlease
NO_IPV6 = YesPlease
NO_SETENV = YesPlease
- NO_UNSETENV = YesPlease
NO_STRCASESTR = YesPlease
NO_STRLCPY = YesPlease
NO_STRTOK_R = YesPlease
@@ -1188,7 +1187,6 @@ ifneq (,$(findstring MINGW,$(uname_S)))
NO_LIBGEN_H = YesPlease
NO_SYMLINK_HEAD = YesPlease
NO_SETENV = YesPlease
- NO_UNSETENV = YesPlease
NO_STRCASESTR = YesPlease
NO_STRLCPY = YesPlease
NO_STRTOK_R = YesPlease
diff --git a/compat/mingw.c b/compat/mingw.c
index 3702b06..93e7a58 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -1238,14 +1238,14 @@ static char **env_setenv(char **env, const char *name)
for (i = 0; env[i]; i++)
;
env = xrealloc(env, (i+2)*sizeof(*env));
- env[i] = xstrdup(name);
+ env[i] = (char*) name;
env[i+1] = NULL;
}
}
else {
free(env[i]);
if (*eq)
- env[i] = xstrdup(name);
+ env[i] = (char*) name;
else
for (; env[i]; i++)
env[i] = env[i+1];
@@ -1260,8 +1260,10 @@ char **make_augmented_environ(const char *const *vars)
{
char **env = copy_environ();

- while (*vars)
- env = env_setenv(env, *vars++);
+ while (*vars) {
+ const char *v = *vars++;
+ env = env_setenv(env, strchr(v, '=') ? xstrdup(v) : v);
+ }
return env;
}

diff --git a/compat/mingw.h b/compat/mingw.h
index 29cbef7..131d2ec 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -201,6 +201,7 @@ char *mingw_getenv(const char *name);
#define getenv mingw_getenv
int mingw_putenv(const char *namevalue);
#define putenv mingw_putenv
+#define unsetenv mingw_putenv

Karsten Blees

unread,

Oct 6, 2011, 1:41:08 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Convert environment from UTF-8 to UTF-16 when creating other processes.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 26 ++++++++++++++------------
1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index 2f346dc..b38a7b0 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -940,9 +940,9 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
{
STARTUPINFOW si;
PROCESS_INFORMATION pi;
- struct strbuf envblk, args;
- wchar_t wcmd[MAX_PATH], wdir[MAX_PATH], *wargs;
- unsigned flags;
+ struct strbuf args;
+ wchar_t wcmd[MAX_PATH], wdir[MAX_PATH], *wargs, *wenvblk = NULL;
+ unsigned flags = CREATE_UNICODE_ENVIRONMENT;
BOOL ret;

/* Determine whether or not we are associated to a console */
@@ -959,7 +959,7 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
* instead of CREATE_NO_WINDOW to make ssh
* recognize that it has no console.
*/
- flags = DETACHED_PROCESS;
+ flags |= DETACHED_PROCESS;
} else {
/* There is already a console. If we specified
* DETACHED_PROCESS here, too, Windows would
@@ -967,7 +967,6 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
* The same is true for CREATE_NO_WINDOW.
* Go figure!
*/
- flags = 0;
CloseHandle(cons);
}
memset(&si, 0, sizeof(si));
@@ -1006,6 +1005,7 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,

if (env) {
int count = 0;
char **e, **sorted_env;

+ int i = 0, size = 0, envblksz = 0, envblkpos = 0;

for (e = env; *e; e++)
count++;
@@ -1015,20 +1015,22 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
memcpy(sorted_env, env, sizeof(*sorted_env) * (count + 1));
qsort(sorted_env, count, sizeof(*sorted_env), env_compare);

- strbuf_init(&envblk, 0);
- for (e = sorted_env; *e; e++) {
- strbuf_addstr(&envblk, *e);
- strbuf_addch(&envblk, '\0');
+ /* create environment block from temporary environment */
+ for (i = 0; sorted_env[i]; i++) {
+ size = 2 * strlen(sorted_env[i]) + 2;
+ ALLOC_GROW(wenvblk, (envblkpos + size) * sizeof(wchar_t), envblksz);
+ envblkpos += utftowcs(&wenvblk[envblkpos], sorted_env[i], size) + 1;
}
+ /* add final \0 terminator */
+ wenvblk[envblkpos] = 0;
free(sorted_env);

}

memset(&pi, 0, sizeof(pi));
ret = CreateProcessW(wcmd, wargs, NULL, NULL, TRUE, flags,

- env ? envblk.buf : NULL, dir ? wdir : NULL, &si, &pi);
+ wenvblk, dir ? wdir : NULL, &si, &pi);

- if (env)
- strbuf_release(&envblk);
+ free(wenvblk);
free(wargs);

if (!ret) {
--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:18 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Move environment array reallocation from do_putenv to the respective
callers. Keep track of the environment size in a global variable. Use
ALLOC_GROW in mingw_putenv to reduce reallocations. Allocate a
sufficiently sized environment array in make_environment_block to prevent
reallocations.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 50 +++++++++++++++++++++++++++++---------------------
1 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index aeae266..0f85f12 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -767,19 +767,19 @@ static int lookupenv(char **env, const char *name, size_t nmln)

/*
* If name contains '=', then sets the variable, otherwise it unsets it
+ * Size includes the terminating NULL. Env must have room for size + 1 entries
+ * (in case of insert). Returns the new size. Optionally frees removed entries.
*/
-static char **do_putenv(char **env, const char *name, int free_old)
+static int do_putenv(char **env, const char *name, int size, int free_old)
{
char *eq = strchrnul(name, '=');
int i = lookupenv(env, name, eq-name);

if (i < 0) {
if (*eq) {
- for (i = 0; env[i]; i++)
- ;
- env = xrealloc(env, (i+2)*sizeof(*env));
- env[i] = (char*) name;
- env[i+1] = NULL;
+ env[size - 1] = (char*) name;
+ env[size] = NULL;
+ size++;
}
}
else {
@@ -787,13 +787,20 @@ static char **do_putenv(char **env, const char *name, int free_old)

free(env[i]);
if (*eq)

env[i] = (char*) name;

- else
+ else {

for (; env[i]; i++)
env[i] = env[i+1];

+ size--;
+ }
}
- return env;
+ return size;
}

+/* used number of elements of environ array, including terminating NULL */
+static int environ_size = 0;
+/* allocated size of environ array, in bytes */
+static int environ_alloc = 0;
+

static char *do_getenv(const char *name)
{
size_t len = strlen(name);

@@ -821,7 +828,8 @@ char *mingw_getenv(const char *name)

int mingw_putenv(const char *namevalue)
{
- environ = do_putenv(environ, namevalue, 1);
+ ALLOC_GROW(environ, (environ_size + 1) * sizeof(char*), environ_alloc);
+ environ_size = do_putenv(environ, namevalue, environ_size, 1);
return 0;
}

@@ -1014,25 +1022,22 @@ static char *path_lookup(const char *cmd, char **path, int exe_only)

static wchar_t *make_environment_block(char **deltaenv)
{
wchar_t *wenvblk = NULL;

- int count = 0;
char **tmpenv;
- int i = 0, size = 0, envblksz = 0, envblkpos = 0;
+ int i = 0, size = environ_size, envblksz = 0, envblkpos = 0;

- while (environ[count])
- count++;

+ while (deltaenv && deltaenv[i])

+ i++;

- /* copy the environment */
- tmpenv = xmalloc(sizeof(*tmpenv) * (count + 1));
- memcpy(tmpenv, environ, sizeof(*tmpenv) * (count + 1));
+ /* copy the environment, leaving space for changes */
+ tmpenv = xmalloc((size + i) * sizeof(char*));
+ memcpy(tmpenv, environ, size * sizeof(char*));

/* merge supplied environment changes into the temporary environment */

for (i = 0; deltaenv && deltaenv[i]; i++)

- tmpenv = do_putenv(tmpenv, deltaenv[i], 0);

+ size = do_putenv(tmpenv, deltaenv[i], size, 0);

/* environment must be sorted */
- for (count = 0; tmpenv[count]; count++)
- ;
- qsort(tmpenv, count, sizeof(*tmpenv), compareenv);
+ qsort(tmpenv, size - 1, sizeof(char*), compareenv);

/* create environment block from temporary environment */
for (i = 0; tmpenv[i]; i++) {

@@ -2065,7 +2070,9 @@ void mingw_startup()
maxlen = max(maxlen, wcslen(wenv[i]));

/* nedmalloc can't free CRT memory, allocate resizable environment list */

- environ = xcalloc(i + 1, sizeof(char*));
+ environ = NULL;
+ environ_size = i + 1;
+ ALLOC_GROW(environ, environ_size * sizeof(char*), environ_alloc);

/* allocate buffer (wchar_t encodes to max 3 UTF-8 bytes) */
maxlen = 3 * maxlen + 1;

@@ -2082,6 +2089,7 @@ void mingw_startup()
len = wcstoutf(buffer, wenv[i], maxlen);

environ[i] = xmemdupz(buffer, len);
}

+ environ[i] = NULL;

free(buffer);

/* initialize critical section for waitpid pinfo_t list */

--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:14 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

...so that they can be reused by mingw_spawnve_fd.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 174 ++++++++++++++++++++++++++++----------------------------
1 files changed, 87 insertions(+), 87 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index dadabeb..425fac8 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -737,6 +737,93 @@ char *mingw_getcwd(char *pointer, int len)
return pointer;
}

+static int env_compare(const void *a, const void *b)
+{
+ char *const *ea = a;
+ char *const *eb = b;
+ return strcasecmp(*ea, *eb);
+}
+
+static int lookup_env(char **env, const char *name, size_t nmln)
+{
+ int i;

+ /*
+ * try case-sensitive match first, this is necessary for sh-i18n--envsubst
+ * to work with environment variables that differ only in case (e.g. $PATH
+ * and $path)
+ */

+ for (i = 0; env[i]; i++) {

+ if (!strncmp(env[i], name, nmln) && '=' == env[i][nmln])
+ /* matches */
+ return i;
+ }

+ /* if there's no case-sensitive match, try case-insensitive instead */
+ for (i = 0; env[i]; i++) {

+ if (!strncasecmp(env[i], name, nmln) && '=' == env[i][nmln])
+ return i;
+ }

+ return -1;
+}
+

+/*
+ * If name contains '=', then sets the variable, otherwise it unsets it
+ */
+static char **env_setenv(char **env, const char *name)
+{
+ char *eq = strchrnul(name, '=');
+ int i = lookup_env(env, name, eq-name);
+
+ if (i < 0) {
+ if (*eq) {
+ for (i = 0; env[i]; i++)
+ ;
+ env = xrealloc(env, (i+2)*sizeof(*env));
+ env[i] = (char*) name;
+ env[i+1] = NULL;
+ }
+ }
+ else {
+ free(env[i]);
+ if (*eq)
+ env[i] = (char*) name;
+ else
+ for (; env[i]; i++)
+ env[i] = env[i+1];
+ }
+ return env;
+}
+

+static char *do_getenv(const char *name)

+{
+ size_t len = strlen(name);
+ int i = lookup_env(environ, name, len);
+ if (i >= 0)
+ return environ[i] + len + 1; /* skip past name and '=' */
+ return NULL;
+}
+
+char *mingw_getenv(const char *name)
+{

+ char *result = do_getenv(name);

+ if (!result && !strcmp(name, "TMPDIR")) {
+ /* on Windows it is TMP and TEMP */

+ result = do_getenv("TMP");

+ if (!result)

+ result = do_getenv("TEMP");

+ }
+ else if (!result && !strcmp(name, "TERM")) {
+ /* simulate TERM to enable auto-color (see color.c) */
+ result = "winansi";
+ }
+ return result;
+}
+

+int mingw_putenv(const char *namevalue)
+{
+ environ = env_setenv(environ, namevalue);
+ return 0;
+}
+
/*

* See http://msdn2.microsoft.com/en-us/library/17w5ykft(vs.71).aspx
* (Parsing C++ Command-Line Arguments)
@@ -919,13 +1006,6 @@ static char *path_lookup(const char *cmd, char **path, int exe_only)
return prog;
}

-static int env_compare(const void *a, const void *b)
-{
- char *const *ea = a;
- char *const *eb = b;
- return strcasecmp(*ea, *eb);
-}
-
struct pinfo_t {
struct pinfo_t *next;
pid_t pid;
@@ -1206,55 +1286,6 @@ void free_environ(char **env)
free(env);
}

-static int lookup_env(char **env, const char *name, size_t nmln)
-{
- int i;
- /*
- * try case-sensitive match first, this is necessary for sh-i18n--envsubst
- * to work with environment variables that differ only in case (e.g. $PATH
- * and $path)
- */
- for (i = 0; env[i]; i++) {
- if (!strncmp(env[i], name, nmln) && '=' == env[i][nmln])
- /* matches */
- return i;
- }
- /* if there's no case-sensitive match, try case-insensitive instead */
- for (i = 0; env[i]; i++) {
- if (!strncasecmp(env[i], name, nmln) && '=' == env[i][nmln])
- return i;
- }
- return -1;
-}
-
-/*
- * If name contains '=', then sets the variable, otherwise it unsets it
- */
-static char **env_setenv(char **env, const char *name)
-{
- char *eq = strchrnul(name, '=');
- int i = lookup_env(env, name, eq-name);
-
- if (i < 0) {
- if (*eq) {

- for (i = 0; env[i]; i++)
- ;
- env = xrealloc(env, (i+2)*sizeof(*env));
- env[i] = (char*) name;
- env[i+1] = NULL;

- }
- }
- else {
- free(env[i]);
- if (*eq)

- env[i] = (char*) name;

- else
- for (; env[i]; i++)
- env[i] = env[i+1];
- }
- return env;
-}
-
/*
* Copies global environ and adjusts variables as specified by vars.
*/
@@ -1269,37 +1300,6 @@ char **make_augmented_environ(const char *const *vars)
return env;
}

-static char *do_getenv(const char *name)
-{
- size_t len = strlen(name);
- int i = lookup_env(environ, name, len);
- if (i >= 0)
- return environ[i] + len + 1; /* skip past name and '=' */
- return NULL;
-}
-
-char *mingw_getenv(const char *name)
-{
- char *result = do_getenv(name);
- if (!result && !strcmp(name, "TMPDIR")) {
- /* on Windows it is TMP and TEMP */
- result = do_getenv("TMP");
- if (!result)
- result = do_getenv("TEMP");
- }
- else if (!result && !strcmp(name, "TERM")) {
- /* simulate TERM to enable auto-color (see color.c) */
- result = "winansi";
- }
- return result;
-}
-
-int mingw_putenv(const char *namevalue)
-{
- environ = env_setenv(environ, namevalue);
- return 0;
-}
-

/*
* Note, this isn't a complete replacement for getaddrinfo. It assumes
* that service contains a numerical port, or that it is null. It

--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:15 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Environment helper functions use random naming ('env' prefix or suffix or
both, with or without '_'). Change to POSIX naming scheme ('env' suffix,
no '_').

Env_setenv has more in common with putenv than setenv. Change to do_putenv.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 16 ++++++++--------
1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index 425fac8..8ea5fe0 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -737,14 +737,14 @@ char *mingw_getcwd(char *pointer, int len)
return pointer;
}

-static int env_compare(const void *a, const void *b)

+static int compareenv(const void *a, const void *b)
{

char *const *ea = a;

char *const *eb = b;

return strcasecmp(*ea, *eb);
}

-static int lookup_env(char **env, const char *name, size_t nmln)
+static int lookupenv(char **env, const char *name, size_t nmln)
{
int i;
/*
@@ -768,10 +768,10 @@ static int lookup_env(char **env, const char *name, size_t nmln)
/*

* If name contains '=', then sets the variable, otherwise it unsets it

*/
-static char **env_setenv(char **env, const char *name)

+static char **do_putenv(char **env, const char *name)
{

char *eq = strchrnul(name, '=');
- int i = lookup_env(env, name, eq-name);

+ int i = lookupenv(env, name, eq-name);

if (i < 0) {
if (*eq) {

@@ -796,7 +796,7 @@ static char **env_setenv(char **env, const char *name)

static char *do_getenv(const char *name)
{

size_t len = strlen(name);
- int i = lookup_env(environ, name, len);

+ int i = lookupenv(environ, name, len);
if (i >= 0)

return environ[i] + len + 1; /* skip past name and '=' */

return NULL;
@@ -820,7 +820,7 @@ char *mingw_getenv(const char *name)

int mingw_putenv(const char *namevalue)
{

- environ = env_setenv(environ, namevalue);

+ environ = do_putenv(environ, namevalue);
return 0;
}

@@ -1093,7 +1093,7 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,

/* environment must be sorted */

sorted_env = xmalloc(sizeof(*sorted_env) * (count + 1));
memcpy(sorted_env, env, sizeof(*sorted_env) * (count + 1));
- qsort(sorted_env, count, sizeof(*sorted_env), env_compare);
+ qsort(sorted_env, count, sizeof(*sorted_env), compareenv);

/* create environment block from temporary environment */

for (i = 0; sorted_env[i]; i++) {
@@ -1295,7 +1295,7 @@ char **make_augmented_environ(const char *const *vars)

while (*vars) {

const char *v = *vars++;

- env = env_setenv(env, strchr(v, '=') ? xstrdup(v) : v);
+ env = do_putenv(env, strchr(v, '=') ? xstrdup(v) : v);
}
return env;
}
--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:16 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 55 ++++++++++++++++++++++++++++++++-----------------------
1 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index 8ea5fe0..303b661 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -1006,6 +1006,36 @@ static char *path_lookup(const char *cmd, char **path, int exe_only)
return prog;
}

+/*
+ * Create environment block suitable for CreateProcess.
+ */
+static wchar_t *make_environment_block(char **env)
+{
+ wchar_t *wenvblk = NULL;
+ int count = 0;
+ char **e, **tmpenv;
+ int i = 0, size = 0, envblksz = 0, envblkpos = 0;
+
+ for (e = env; *e; e++)
+ count++;
+
+ /* environment must be sorted */
+ tmpenv = xmalloc(sizeof(*tmpenv) * (count + 1));
+ memcpy(tmpenv, env, sizeof(*tmpenv) * (count + 1));
+ qsort(tmpenv, count, sizeof(*tmpenv), compareenv);
+
+ /* create environment block from temporary environment */
+ for (i = 0; tmpenv[i]; i++) {
+ size = 2 * strlen(tmpenv[i]) + 2;

+ ALLOC_GROW(wenvblk, (envblkpos + size) * sizeof(wchar_t), envblksz);

+ envblkpos += utftowcs(&wenvblk[envblkpos], tmpenv[i], size) + 1;

+ }
+ /* add final \0 terminator */
+ wenvblk[envblkpos] = 0;

+ free(tmpenv);
+ return wenvblk;
+}
+

struct pinfo_t {
struct pinfo_t *next;
pid_t pid;

@@ -1082,29 +1112,8 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,

utftowcs(wargs, args.buf, 2 * args.len + 1);
strbuf_release(&args);

- if (env) {

- int count = 0;

- char **e, **sorted_env;
- int i = 0, size = 0, envblksz = 0, envblkpos = 0;
-
- for (e = env; *e; e++)
- count++;
-
- /* environment must be sorted */
- sorted_env = xmalloc(sizeof(*sorted_env) * (count + 1));
- memcpy(sorted_env, env, sizeof(*sorted_env) * (count + 1));
- qsort(sorted_env, count, sizeof(*sorted_env), compareenv);
-
- /* create environment block from temporary environment */
- for (i = 0; sorted_env[i]; i++) {
- size = 2 * strlen(sorted_env[i]) + 2;
- ALLOC_GROW(wenvblk, (envblkpos + size) * sizeof(wchar_t), envblksz);
- envblkpos += utftowcs(&wenvblk[envblkpos], sorted_env[i], size) + 1;
- }
- /* add final \0 terminator */
- wenvblk[envblkpos] = 0;
- free(sorted_env);
- }
+ if (env)
+ wenvblk = make_environment_block(env);

memset(&pi, 0, sizeof(pi));
ret = CreateProcessW(wcmd, wargs, NULL, NULL, TRUE, flags,

--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:13 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

The only public spawn function that needs to tweak the environment is
mingw_spawnvpe (called from start_command). Nevertheless, all internal
spawn* functions take an env parameter and needlessly pass the global
char **environ around. Remove the env parameter where it's not needed.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 22 ++++++++--------------
1 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index 7c8a1bc..dadabeb 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -1060,10 +1060,9 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,

return (pid_t)pi.dwProcessId;
}

-static pid_t mingw_spawnve(const char *cmd, const char **argv, char **env,
- int prepend_cmd)
+static pid_t mingw_spawnv(const char *cmd, const char **argv, int prepend_cmd)
{
- return mingw_spawnve_fd(cmd, argv, env, NULL, prepend_cmd, 0, 1, 2);

+ return mingw_spawnve_fd(cmd, argv, environ, NULL, prepend_cmd, 0, 1, 2);

}

pid_t mingw_spawnvpe(const char *cmd, const char **argv, char **env,

@@ -1105,7 +1104,7 @@ pid_t mingw_spawnvpe(const char *cmd, const char **argv, char **env,

return pid;
}

-static int try_shell_exec(const char *cmd, char *const *argv, char **env)
+static int try_shell_exec(const char *cmd, char *const *argv)
{
const char *interpr = parse_interpreter(cmd);
char **path;

@@ -1123,7 +1122,7 @@ static int try_shell_exec(const char *cmd, char *const *argv, char **env)

argv2 = xmalloc(sizeof(*argv) * (argc+1));
argv2[0] = (char *)cmd; /* full path to the script file */
memcpy(&argv2[1], &argv[1], sizeof(*argv) * argc);
- pid = mingw_spawnve(prog, argv2, env, 1);
+ pid = mingw_spawnv(prog, argv2, 1);
if (pid >= 0) {
int status;
if (waitpid(pid, &status, 0) < 0)

@@ -1138,13 +1137,13 @@ static int try_shell_exec(const char *cmd, char *const *argv, char **env)

return pid;
}

-static void mingw_execve(const char *cmd, char *const *argv, char *const *env)
+void mingw_execv(const char *cmd, char *const *argv)
{
/* check if git_command is a shell script */
- if (!try_shell_exec(cmd, argv, (char **)env)) {
+ if (!try_shell_exec(cmd, argv)) {
int pid, status;

- pid = mingw_spawnve(cmd, (const char **)argv, (char **)env, 0);
+ pid = mingw_spawnv(cmd, (const char **)argv, 0);
if (pid < 0)
return;
if (waitpid(pid, &status, 0) < 0)

@@ -1159,7 +1158,7 @@ void mingw_execvp(const char *cmd, char *const *argv)

char *prog = path_lookup(cmd, path, 0);

if (prog) {
- mingw_execve(prog, argv, environ);
+ mingw_execv(prog, argv);
free(prog);
} else
errno = ENOENT;

@@ -1167,11 +1166,6 @@ void mingw_execvp(const char *cmd, char *const *argv)

free_path_split(path);
}

-void mingw_execv(const char *cmd, char *const *argv)
-{
- mingw_execve(cmd, argv, environ);
-}
-
int mingw_kill(pid_t pid, int sig)
{
if (pid > 0 && sig == SIGTERM) {

--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:17 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

When spawning child processes via start_command(), the environment and all
environment entries are copied twice. First by make_augmented_environ /
copy_environ to merge with child_process.env. Then a second time by
make_environment_block to create a sorted environment block string as
required by CreateProcess.

Move the merge logic to make_environment_block so that we only need to copy
the environment once. This changes semantics of the env parameter: it now
expects a delta (such as child_process.env) rather than a full environment.
This is not a problem as the parameter is only used by start_command()
(all other callers previously passed char **environ, and now pass NULL).

The merge logic no longer xstrdup()s the environment strings, so do_putenv
must not free them. Add a parameter to distinguish this from normal putenv.

Remove the now unused make_augmented_environ / free_environ API.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 76 ++++++++++++++++++-------------------------------------
compat/mingw.h | 6 ----
run-command.c | 10 +------
3 files changed, 27 insertions(+), 65 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index 303b661..aeae266 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -768,7 +768,7 @@ static int lookupenv(char **env, const char *name, size_t nmln)

/*
* If name contains '=', then sets the variable, otherwise it unsets it
*/

-static char **do_putenv(char **env, const char *name)
+static char **do_putenv(char **env, const char *name, int free_old)

{
char *eq = strchrnul(name, '=');

int i = lookupenv(env, name, eq-name);

@@ -783,7 +783,8 @@ static char **do_putenv(char **env, const char *name)

}
}
else {
- free(env[i]);

+ if (free_old)
+ free(env[i]);
if (*eq)

env[i] = (char*) name;

else
@@ -820,7 +821,7 @@ char *mingw_getenv(const char *name)

int mingw_putenv(const char *namevalue)
{
- environ = do_putenv(environ, namevalue);
+ environ = do_putenv(environ, namevalue, 1);
return 0;
}

@@ -1007,21 +1008,30 @@ static char *path_lookup(const char *cmd, char **path, int exe_only)
}

/*
- * Create environment block suitable for CreateProcess.
+ * Create environment block suitable for CreateProcess. Merges current
+ * process environment and the supplied environment changes.
*/
-static wchar_t *make_environment_block(char **env)
+static wchar_t *make_environment_block(char **deltaenv)
{
wchar_t *wenvblk = NULL;
int count = 0;
- char **e, **tmpenv;
+ char **tmpenv;

int i = 0, size = 0, envblksz = 0, envblkpos = 0;

- for (e = env; *e; e++)
+ while (environ[count])
count++;

- /* environment must be sorted */

+ /* copy the environment */

tmpenv = xmalloc(sizeof(*tmpenv) * (count + 1));

- memcpy(tmpenv, env, sizeof(*tmpenv) * (count + 1));
+ memcpy(tmpenv, environ, sizeof(*tmpenv) * (count + 1));
+
+ /* merge supplied environment changes into the temporary environment */
+ for (i = 0; deltaenv && deltaenv[i]; i++)
+ tmpenv = do_putenv(tmpenv, deltaenv[i], 0);

+
+ /* environment must be sorted */

+ for (count = 0; tmpenv[count]; count++)

+ ;
qsort(tmpenv, count, sizeof(*tmpenv), compareenv);

/* create environment block from temporary environment */

@@ -1044,7 +1054,7 @@ struct pinfo_t {
struct pinfo_t *pinfo = NULL;
CRITICAL_SECTION pinfo_cs;

-static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,
+static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **deltaenv,

const char *dir,
int prepend_cmd, int fhin, int fhout, int fherr)
{

@@ -1112,8 +1122,7 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,

utftowcs(wargs, args.buf, 2 * args.len + 1);
strbuf_release(&args);

- if (env)

- wenvblk = make_environment_block(env);
+ wenvblk = make_environment_block(deltaenv);

memset(&pi, 0, sizeof(pi));
ret = CreateProcessW(wcmd, wargs, NULL, NULL, TRUE, flags,

@@ -1151,10 +1160,10 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **env,

static pid_t mingw_spawnv(const char *cmd, const char **argv, int prepend_cmd)
{

- return mingw_spawnve_fd(cmd, argv, environ, NULL, prepend_cmd, 0, 1, 2);
+ return mingw_spawnve_fd(cmd, argv, NULL, NULL, prepend_cmd, 0, 1, 2);
}

-pid_t mingw_spawnvpe(const char *cmd, const char **argv, char **env,
+pid_t mingw_spawnvpe(const char *cmd, const char **argv, char **deltaenv,
const char *dir,

int fhin, int fhout, int fherr)
{

@@ -1178,14 +1187,14 @@ pid_t mingw_spawnvpe(const char *cmd, const char **argv, char **env,
pid = -1;
}
else {
- pid = mingw_spawnve_fd(iprog, argv, env, dir, 1,
+ pid = mingw_spawnve_fd(iprog, argv, deltaenv, dir, 1,
fhin, fhout, fherr);
free(iprog);
}
argv[0] = argv0;
}
else
- pid = mingw_spawnve_fd(prog, argv, env, dir, 0,
+ pid = mingw_spawnve_fd(prog, argv, deltaenv, dir, 0,
fhin, fhout, fherr);
free(prog);
}
@@ -1274,41 +1283,6 @@ int mingw_kill(pid_t pid, int sig)
return -1;
}

-static char **copy_environ(void)
-{
- char **env;
- int i = 0;
- while (environ[i])
- i++;
- env = xmalloc((i+1)*sizeof(*env));
- for (i = 0; environ[i]; i++)
- env[i] = xstrdup(environ[i]);
- env[i] = NULL;

- return env;
-}
-

-void free_environ(char **env)
-{
- int i;
- for (i = 0; env[i]; i++)
- free(env[i]);
- free(env);
-}
-
-/*
- * Copies global environ and adjusts variables as specified by vars.
- */
-char **make_augmented_environ(const char *const *vars)
-{
- char **env = copy_environ();
-
- while (*vars) {
- const char *v = *vars++;
- env = do_putenv(env, strchr(v, '=') ? xstrdup(v) : v);

- }
- return env;
-}
-
/*

* Note, this isn't a complete replacement for getaddrinfo. It assumes
* that service contains a numerical port, or that it is null. It

diff --git a/compat/mingw.h b/compat/mingw.h
index 131d2ec..3913b34 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -331,12 +331,6 @@ void mingw_open_html(const char *path);
void mingw_mark_as_git_dir(const char *dir);
#define mark_as_git_dir mingw_mark_as_git_dir

-/*
- * helpers
- */
-
-char **make_augmented_environ(const char *const *vars);
-void free_environ(char **env);

/**

* Converts UTF-8 encoded string to UTF-16LE.

diff --git a/run-command.c b/run-command.c
index a8be1b1..112489e 100644
--- a/run-command.c
+++ b/run-command.c
@@ -316,7 +316,6 @@ fail_pipe:
{
int fhin = 0, fhout = 1, fherr = 2;
const char **sargv = cmd->argv;
- char **env = environ;

if (cmd->no_stdin)
fhin = open("/dev/null", O_RDWR);
@@ -341,23 +340,18 @@ fail_pipe:
else if (cmd->out > 1)
fhout = dup(cmd->out);

- if (cmd->env)
- env = make_augmented_environ(cmd->env);
-
if (cmd->git_cmd) {
cmd->argv = prepare_git_cmd(cmd->argv);
} else if (cmd->use_shell) {
cmd->argv = prepare_shell_cmd(cmd->argv);
}

- cmd->pid = mingw_spawnvpe(cmd->argv[0], cmd->argv, env, cmd->dir,
- fhin, fhout, fherr);
+ cmd->pid = mingw_spawnvpe(cmd->argv[0], cmd->argv, (char**) cmd->env,
+ cmd->dir, fhin, fhout, fherr);
failed_errno = errno;
if (cmd->pid < 0 && (!cmd->silent_exec_failure || errno != ENOENT))
error("cannot spawn %s: %s", cmd->argv[0], strerror(errno));

- if (cmd->env)
- free_environ(env);
if (cmd->git_cmd)
free(cmd->argv);

--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:19 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

The Windows environment is sorted, keep it that way for O(log n)
environment access.

Change compareenv to compare only the keys, so that it can be used to
find an entry irrespective of the value.

Change lookupenv to binary seach for an entry. Return one's complement of
the insert position if not found (libc's bsearch return's NULL). If a
case-insensitive match is found, try to find a better case-sensitive match
around that entry.

Change do_putenv to insert new entries at the correct position. Simplify
the function by swapping if conditions and using memmove instead of for
loops.

Move qsort from make_environment_block to mingw_startup. We still need to
sort on startup to make sure that the environment is sorted according to
our compareenv function (while Win32 / CreateProcess requires the
environment block to be sorted case-insensitively, CreateProcess currently
doesn't enforce this, and some applications such as bash just don't care).

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 110 ++++++++++++++++++++++++++++++++++++--------------------
1 files changed, 71 insertions(+), 39 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index 0f85f12..dc905bb 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -737,32 +737,65 @@ char *mingw_getcwd(char *pointer, int len)
return pointer;
}

-static int compareenv(const void *a, const void *b)
+/*
+ * Compare environment entries by key (i.e. stopping at '=' or '\0').
+ */
+static int compareenv(const void *v1, const void *v2)
{
- char *const *ea = a;
- char *const *eb = b;
- return strcasecmp(*ea, *eb);
+ const char *e1 = *(const char**)v1;
+ const char *e2 = *(const char**)v2;
+
+ for (;;) {
+ int c1 = *e1++;
+ int c2 = *e2++;
+ c1 = (c1 == '=') ? 0 : tolower(c1);
+ c2 = (c2 == '=') ? 0 : tolower(c2);
+ if (c1 > c2)
+ return 1;
+ if (c1 < c2)
+ return -1;
+ if (c1 == 0)
+ return 0;
+ }
}

-static int lookupenv(char **env, const char *name, size_t nmln)
+static int bsearchenv(char **env, const char *name, size_t size)
{
- int i;
+ unsigned low = 0, high = size - 1, mid, i, j, nmln;
+ int cmp;
+ while (low <= high) {
+ mid = (low + high) >> 1;
+ cmp = compareenv(&env[mid], &name);
+ if (cmp < 0)
+ low = mid + 1;
+ else if (cmp > 0)
+ high = mid - 1;
+ else
+ goto found;
+ }
+ return ~low; /* not found, return 1's complement of insert position */
+
+found:
+ /* look for duplicate entries before and after the matching entry (mid) */
+ i = j = mid;
+ while (i > low && !compareenv(&env[i - 1], &name))
+ i--;
+ while (j < high && !compareenv(&env[j + 1], &name))
+ j++;
+ if (i == j)
+ return mid; /* no duplicates */
+
/*

* try case-sensitive match first, this is necessary for sh-i18n--envsubst

* to work with environment variables that differ only in case (e.g. $PATH

* and $path)
*/
- for (i = 0; env[i]; i++) {
+ nmln = strlen(name);
+ for (; i <= j; i++)

if (!strncmp(env[i], name, nmln) && '=' == env[i][nmln])

/* matches */

return i;
- }
- /* if there's no case-sensitive match, try case-insensitive instead */

- for (i = 0; env[i]; i++) {

- if (!strncasecmp(env[i], name, nmln) && '=' == env[i][nmln])
- return i;
- }
- return -1;

+ return mid;
}

/*
@@ -772,26 +805,24 @@ static int lookupenv(char **env, const char *name, size_t nmln)
*/
static int do_putenv(char **env, const char *name, int size, int free_old)
{
- char *eq = strchrnul(name, '=');
- int i = lookupenv(env, name, eq-name);
+ int i = bsearchenv(env, name, size - 1);
+
+ /* optionally free removed / replaced entry */
+ if (i >= 0 && free_old)
+ free(env[i]);

- if (i < 0) {
- if (*eq) {

- env[size - 1] = (char*) name;
- env[size] = NULL;
+ if (strchr(name, '=')) {
+ /* if new value ('key=value') is specified, insert or replace entry */
+ if (i < 0) {
+ i = ~i;
+ memmove(&env[i + 1], &env[i], (size - i) * sizeof(char*));
size++;
}
- }
- else {
- if (free_old)
- free(env[i]);

- if (*eq)
- env[i] = (char*) name;
- else {

- for (; env[i]; i++)
- env[i] = env[i+1];
- size--;
- }
+ env[i] = (char*) name;
+ } else if (i >= 0) {
+ /* otherwise ('key') remove existing entry */
+ size--;
+ memmove(&env[i], &env[i + 1], (size - i) * sizeof(char*));
}
return size;
}
@@ -803,11 +834,12 @@ static int environ_alloc = 0;

static char *do_getenv(const char *name)
{

- size_t len = strlen(name);
- int i = lookupenv(environ, name, len);

- if (i >= 0)

- return environ[i] + len + 1; /* skip past name and '=' */
- return NULL;
+ char *value;
+ int pos = bsearchenv(environ, name, environ_size - 1);
+ if (pos < 0)
+ return NULL;
+ value = strchr(environ[pos], '=');
+ return value ? &value[1] : NULL;
}

char *mingw_getenv(const char *name)
@@ -1036,9 +1068,6 @@ static wchar_t *make_environment_block(char **deltaenv)

for (i = 0; deltaenv && deltaenv[i]; i++)

size = do_putenv(tmpenv, deltaenv[i], size, 0);

- /* environment must be sorted */

- qsort(tmpenv, size - 1, sizeof(char*), compareenv);
-

/* create environment block from temporary environment */

for (i = 0; tmpenv[i]; i++) {

size = 2 * strlen(tmpenv[i]) + 2;

@@ -2092,6 +2121,9 @@ void mingw_startup()

environ[i] = NULL;
free(buffer);

+ /* sort environment for O(log n) getenv / putenv */
+ qsort(environ, i, sizeof(char*), compareenv);
+

/* initialize critical section for waitpid pinfo_t list */

InitializeCriticalSection(&pinfo_cs);

--
1.7.6.msysgit.1

Karsten Blees

unread,

Oct 6, 2011, 1:41:20 PM10/6/11

to bl...@dcon.de, msy...@googlegroups.com

...rather than checking for special values on every getenv call.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 33 ++++++++++++++++-----------------
1 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index dc905bb..77ea5d9 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -832,7 +832,7 @@ static int environ_size = 0;

/* allocated size of environ array, in bytes */

static int environ_alloc = 0;

-static char *do_getenv(const char *name)

+char *mingw_getenv(const char *name)
{
char *value;

int pos = bsearchenv(environ, name, environ_size - 1);

@@ -842,22 +842,6 @@ static char *do_getenv(const char *name)

return value ? &value[1] : NULL;
}

-char *mingw_getenv(const char *name)
-{
- char *result = do_getenv(name);
- if (!result && !strcmp(name, "TMPDIR")) {
- /* on Windows it is TMP and TEMP */
- result = do_getenv("TMP");
- if (!result)
- result = do_getenv("TEMP");
- }
- else if (!result && !strcmp(name, "TERM")) {
- /* simulate TERM to enable auto-color (see color.c) */
- result = "winansi";
- }
- return result;
-}
-

int mingw_putenv(const char *namevalue)
{

ALLOC_GROW(environ, (environ_size + 1) * sizeof(char*), environ_alloc);

@@ -2124,6 +2108,21 @@ void mingw_startup()

/* sort environment for O(log n) getenv / putenv */

qsort(environ, i, sizeof(char*), compareenv);

+ /* fix Windows specific environment settings */
+

+ /* on Windows it is TMP and TEMP */

+ if (!getenv("TMPDIR")) {
+ const char *tmp = getenv("TMP");
+ if (!tmp)
+ tmp = getenv("TEMP");
+ if (tmp)
+ setenv("TMPDIR", tmp, 1);
+ }
+

+ /* simulate TERM to enable auto-color (see color.c) */

+ if (!getenv("TERM"))
+ setenv("TERM", "winansi", 1);

Kirill

unread,

Oct 6, 2011, 11:51:10 PM10/6/11

to Karsten Blees, msy...@googlegroups.com

Karsten,

Frankly, I only briefly looked at your series, but...
NO MATTER what others say, this is EXTREMELY IMPRESSIVE! Although, I'm not
affected by international home dir names and all-languages-mixed file names
in my source trees, I GREATLY appreciate your effort.

Wish you BEST OF LUCK with the series!

--
Kirill.

Jörg Rosenkranz

unread,

Oct 7, 2011, 2:42:21 AM10/7/11

to Karsten Blees, msy...@googlegroups.com

Sorry, this should have gone to the list too:

2011/10/6 Karsten Blees <bl...@dcon.de>:
> So far I haven't encountered any problems with the Unicode-aware msys.dll (http://groups.google.com/group/msysgit/browse_thread/thread/981da260b4fcb1df), and there was no feedback from the mailing list either (this could be good or bad :-).

To give you some positive feedback: I'm using your last Unicode
installer (Git-1.7.5-unicode-20110720) for some weeks now and haven't
encountered any problems yet.

Thanks for the effort!
Joerg.

Johannes Schindelin

unread,

Oct 7, 2011, 4:35:08 PM10/7/11

to Karsten Blees, msy...@googlegroups.com

Hi Karsten,

On Thu, 6 Oct 2011, Karsten Blees wrote:

> I just pushed v10 of the Unicode patches for msysgit to
> http://repo.or.cz/w/git/mingw/4msysgit/kblees.git.

Thanks for being so persistent with this issue.

For starters, however, I would like to integrate your work on "less"
first. Could you make that a branch in your repository, too?

Thanks!
Dscho

karste...@dcon.de

unread,

Oct 7, 2011, 9:02:15 PM10/7/11

to Johannes Schindelin, Karsten Blees, msy...@googlegroups.com

http://repo.or.cz/w/git/mingw/4msysgit/kblees.git.

Not exactly, as it's a git fork...just gave birth to kblees/msysgit on github instead:

less-444: https://github.com/kblees/msysgit/commits/kb/msys/less
Unicode-aware msys.dll: https://github.com/kblees/msysgit/commits/kb/msys/unicode

I haven't tested this in isolation, though. I believe less and the msys.dll console handler need some common understanding of the current cursor position for scrolling to work correctly, and teaching UTF-8 to the msys.dll console handler is a major part of that patch...

However, updating the msys.dll without applying the git unicode patches as well will break some tests, starting with basics such as t0050-filesystem.

You could probably try the new less without the LESSCHARSET=utf-8 setting.

Ciao,
Karsten

OT: I'm still a bit confused about branches in the msysgit repo...I think "devel" is the development environment for git and other mingw-based stuff, while "msys" is for customized msys tools, with binaries cherry-picked to devel, is that basically correct? It gave me quite a headache that some msys stuff (such as src/rt) is also in the devel branch, but won't build there at all...

Konstantin Khomoutov

unread,

Oct 9, 2011, 11:32:36 AM10/9/11

to msy...@googlegroups.com, Karsten Blees

Hi!

I have just tested your 1.7.6 installer with integrated Unicode patches
on Russian Windows XP Pro SP3 and would like to report on my experience.

What's presented below, were done in a regular cmd.exe window.

I first created a test repo with a simple all-ASCII pathname
(C:\tmp\foo) and created a file with simple Cyrillic name "тест.txt"
("тест" stands for "test" in Russian).
`git add .` went OK, but display presented with `git status` and
`git ls-files` showed \xxx escapes for Cyrillic letters in the file
listed (I think that were escapes for UTF-8 bytes comprising the file
name in fact). So I googled the core.quotepath configuration option and
set it to false, now the filename was shown in some garbled encoding but
Git hinted me to set a Unicode font on the console, so I set Lucida.
That finally fixed displaying but for some reason did not prevented Git
from nagging me about my console supposedly not supporting Unicode
(setting Lucida "for all windows of the same type", that is, globally,
did not fix the problem).

Committing the added file went OK, but `git show` afterwards barked at me:

C:\tmp\foo>git show
WARNING: terminal is not fully functional
- (press RETURN)

Hitting return presented the diff where the filename appeared OK.
I have a feeling this might be a problem with less, not Git.

git-gui and gitk showed no problems displaying the added file name.
I then played with the repository a bit, renaming the file to another
Cyrillic name etc, verifying each stage with git-gui and gitk--all went
OK except for that non-unicode console and not fully-functional
terminal nagging.

I then logged in with a user with cyrillic username ("Администратор")
and created another repo--this time with the repo directory itself named
using Cyrillic letters.
Git had no apparent problems setting the user.email and user.name
configuration variables globally (via `git config --add --global ...`)
which are stored somewhere under %USERPROFILE% as I understand this.

Then I created a file in a subdirectory of the repo, with both the
subdirectory and the file named using Cyrillic letters.
Again, no problems beyond those described above with the git itself.
But then I tried to fire git-gui from inside the repo directory and it
failed to start with this error message:

---------------------------
Отсутствует рабочий каталог C:/tmp/РґРёСЂРµРєС‚РѕСЂРёСЏ:

couldn't change working directory to "C:/tmp/РґРёСЂРµРєС‚РѕСЂРёСЏ": no such file or directory
---------------------------

The first error, translated from Russian, means
"Missing working directory C:/tmp/...:"

The name of the directory looks rather typical to me: it's an UTF-8
string interpreted as containing characters in Windows-1251.
I did not look at the code yet but supposedly git-gui somehow asks git
about its working directory and it returns an UTF-8 encoded pathname.
If this pathname is read using a pipe from git's stdout, it may well be
interpreted as being in Windows-1251 as this is the standard Tcl's
behaviour on Windows--to assume the standard handles of spawned
processes operate with the currently active codepage.

gitk worked without any apparent problems.

The only remaining issue which manifested itself in all my experiments
is that dreaded commit encoding problem. I have gvim.exe (version 7.3)
as my $EDITOR and it assumes all files it opens on Windows are encoded
using the currently active codepage (cp1251, again).
Hence two problems:
* Filenames in the commit message file are displayed botched.
* If nothing is done to fix whis, like forced setting of the output
file's encoding to be UTF-8, commit message will be composed in the
system's codepage and will appear unreadable in both git's output and
its GUI tools.

I failed to verify the behaviour of notepad.exe with regard to commit
encodings because the commit message is generated with sole LFs as
line ending markers and so notepad shows it as one long string.

Some sort of summary follows.

So, here are slight technical glitches and some conceptual issues.

Techincal glitches are:
* Reporting about non-Unicode console while it has Unicode font set.
* Reporting something about broken terminal (less problem?).
* git-gui failing to run from within a directory with non-ASCII
characters in its name.
I beleive these can be fixed.

Now onto conceptual issues.
I take Subversion as being the gold standard in the field of supporting
system's native language on Windows among free VCS tools--with it,
filenames and commit messages "just work" in the eyes of a typical
inexperienced developer, both in untweaked console window and "in the GUI"
(everyone uses TortoiseSVN). Please note that's not too subjective:
every Windows developer I discussed the issue of supporting native
language in [D]VCS tools on Windows with agrees on Subversion as being
the only one doing it right (well, may be it's actually the only thing
it does right).

Compared to Subversion, Git for Windows with Unicode patch has these
issues:
* Console window has to be tweaked.
* core.quotepath config variable has to be set to false.
* The user has to somehow force their text editor to interpret and save
the commit message as being composed in UTF-8 or resort to using
i18n.*encoding variables.

I don't know if these issues can be solved at all to match the
experience developers get with Subversion but I thought they are
worth summarising here anyway.

All in all, this patch series considerably improves the current
situation, thanks Karsten and everyone involved!

(And I'll try to do more extensive testing between Windows and Linux
machines.)

karste...@dcon.de

unread,

Oct 9, 2011, 2:26:37 PM10/9/11

to Konstantin Khomoutov, Karsten Blees, msy...@googlegroups.com

Konstantin Khomoutov <flat...@users.sourceforge.net> wrote on 09.10.2011 17:32:36: > Hi! > > I have just tested your 1.7.6 installer with integrated Unicode patches > on Russian Windows XP Pro SP3 and would like to report on my experience. >

Thanks for testing and the detailed feedback!

> C:\tmp\foo>git show > WARNING: terminal is not fully functional > - (press RETURN) >

That's a less warning if the terminal is lacking some capability, set TERM=cygwin or msys in ControlPanel. I had the same problem after installing StrawberryPerl, this added a global TERM=dumb setting.

> But then I tried to fire git-gui from inside the repo directory and it > failed to start with this error message: > > --------------------------- > Отсутствует рабочий каталог C:/tmp/РґРёСЂРµРєС‚РѕСЂРёСЏ: > > couldn't change working directory to "C:/tmp/РґРёСЂРµРєС‚РѕСЂРёСЏ": > no such file or directory > --------------------------- >

I cannot reproduce that, git-gui works fine for me with non-ASCII user names. Perhaps that user's TEMP setting is broken?

> The only remaining issue which manifested itself in all my experiments > is that dreaded commit encoding problem. I have gvim.exe (version 7.3) > as my $EDITOR and it assumes all files it opens on Windows are encoded > using the currently active codepage (cp1251, again). > Hence two problems: > * Filenames in the commit message file are displayed botched. > * If nothing is done to fix whis, like forced setting of the output > file's encoding to be UTF-8, commit message will be composed in the > system's codepage and will appear unreadable in both git's output and > its GUI tools. >

I don't use [g]vim if I can help it, but the vim wiki suggests ":e ++enc=utf-8"...there's probably a .[g]vimrc to set the default encoding, I don't know.

> I failed to verify the behaviour of notepad.exe with regard to commit > encodings because the commit message is generated with sole LFs as > line ending markers and so notepad shows it as one long string. >

Notepad.exe saves UTF-8 with BOM (and of course CRLF), so it's not really an option. I'm using Notepad++ as general purpose editor, that supports UTF-8 without BOM and LF-only line breaks (never tried it with git though).

> Techincal glitches are: > * Reporting about non-Unicode console while it has Unicode font set.

On Windows XP, you need to set the console font of the default console window to a TrueType font (not the Git Bash window). That's because on XP there is no API to check the font, we have to rely on the registry (HKEY_CURRENT_USER/Console/FontFamily).

> Compared to Subversion, Git for Windows with Unicode patch has these > issues: > * Console window has to be tweaked.

I bet SVN has the exact same problems with fonts, as it's inherent to the Windows console. This may not be apparent as long as you stay within your OEM charset, but trying to list e.g. a russian filename on my german Windows box will definitely not work with a raster font.

> * core.quotepath config variable has to be set to false.

Core.quotepath=on is the default even on Unix, probably for good reason. So maybe this is an issue for the git mailing list?

> * The user has to somehow force their text editor to interpret and save > the commit message as being composed in UTF-8 or resort to using > i18n.*encoding variables. >

Setting i18n.commitencoding will prevent specifying commit messages on the command line, and probably has other issues (e.g. mixing utf-8 user name with cp125x commit message). This should be fixed on the editor side I think.

karste...@dcon.de

unread,

Oct 9, 2011, 3:04:43 PM10/9/11

to Karsten Blees, bl...@dcon.de, msy...@googlegroups.com

I found a bug in the new case-sensitive bsearchenv code: 'name' can be a "key=value" pair from putenv, but we only want to compare the key, so it should be:
@@ -790,7 +790,7 @@ found:

* to work with environment variables that differ only in case (e.g. $PATH
* and $path)
*/

- nmln = strlen(name);
+ nmln = strchrnul(name, '=') - name;

for (; i <= j; i++)
if (!strncmp(env[i], name, nmln) && '=' == env[i][nmln])
/* matches */

The Unicode patches with this fix, rebased to tentative (i.e. v1.7.7) are here:
http://repo.or.cz/w/git/mingw/4msysgit/kblees.git/kb/unicode-v11

And the git installer for this version (Unicode msys.dll, less-444, git v1.7.7 with unicode-v11 patches):
https://docs.google.com/uc?id=0BxXUoUg2r8ftYWQ2NGY2MTItY2E2ZC00NTk5LThiNTEtZDMzOTA1YmFmN2Fl&export=download

$ /share/msysGit/run-tests.sh
These tests failed:
t1020-subdirectory
t1402-check-ref-format
t5510-fetch
t5701-clone-local
t5704-bundle
t9001-send-email
t9300-fast-import
t9901-git-web--browse
Unfinished tests:
t9200-git-cvsexportcommit

I've also put together a small Git for Windows Unicode Howto. As the Wiki is still down, I'll post it here...

= Git for Windows Unicode Howto =

== Migrating old Git for Windows repositories ==

This is only relevant if you used non-ASCII file names with previous Git for Windows versions (V1.7.6 and earlier).

Previous Git for Windows versions stored file names in the default encoding of the originating Windows system, making these repositories incompatible with other Windows language-versions and other Git versions (including Cygwin-git and JGit / EGit on Windows).

The Unicode-enabled Git for Windows stores file names UTF-8 encoded.

=== Migration with previous Git for Windows version available ===

The simplest way to convert old repositories is by keeping an old Git for Windows version around:

# With the old Git for Windows version: check out a completely clean state of the working copy (so <tt>git status</tt> reports nothing, not even untracked files). This can be achieved by manually deleting everything except the .git directory, and then doing a <tt>git reset --hard</tt>.
# With the new Git for Windows version: <tt>git status</tt> with the new version should now report non-ASCII file names as deleted (with mangled file name) as well as untracked (with correct file name). Add the changes to the repository with <tt>git add --all & git commit -m "UTF-8 conversion"</tt>.
# Repeat for every branch of interest

=== Migration without previous Git for Windows available ===

This requires renaming all non-ASCII file names manually.

# <tt>git status</tt> will report non-ASCII file names as deleted (with mangled names) as well as untracked (also with mangled names). Fix the file names in the working copy.
# Add the changes to the repository, e.g. with <tt>git add --all & git commit -m "UTF-8 conversion"</tt>.
# Repeat for every branch of interest

=== Convert config files ===

Git config files with non-ASCII content need to be converted to UTF-8, for example your name in %HOME%/.gitconfig, or .gitattributes / .gitignore / .gitmodules files pointing to non-ASCII paths.

== Settings ==

=== Windows settings ===

==== Console font (per user) ====

The default console font does not support Unicode. Change the console font to a TrueType font such as Lucida Console or Consolas. The setup program does this automatically, but only for the installing user.

=== Git settings ===

These can be set per user (with the --global option) or per repository, the repository settings take precedence.

==== Disable quoted file names ====

By default, git will print non-ASCII file names in quoted octal notation, i.e. "\nnn\nnn...". This can be disabled with

git config [--global] core.quotepath off

==== Disable commit message transcoding ====

Previous Git for Windows required to set the <tt>i18n.logoutputencoding</tt> to your Windows system's default OEM encoding for proper console output of non-ASCII commit messages. This is no longer necessary. Remove this or set it to 'utf-8':

git config [--global] i18n.logoutputencoding utf-8

The i18n.commitencoding setting should also be removed or set to 'utf-8' to support commit messages on the command line (<tt>git commit -m "..."</tt> from cmd.exe, MSYS bash won't let you enter non-ASCII characters):

git config [--global] i18n.commitencoding utf-8

If you want to use a custom text editor to enter commit messages, find one that supports Unix line breaks (LF only) and can save UTF-8 without BOM (so Windows notepad.exe is a bad choice). The default MSYS vim won't let you enter non-ASCII characters.

Erik Faye-Lund

unread,

Oct 10, 2011, 10:10:21 AM10/10/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:40 PM, Karsten Blees <bl...@dcon.de> wrote:
> MSVC warns about implicit casts from const** to void* (which should be OK,
> as void* indicates the content is treated as opaque). E.g.
>
> const char *x;
> realloc(x, ...); // warning C4090 due to const** -> void*
>
> As there's not that much MSVC-only code, the 'real' const to non-const
> casts that we want to know about will be caught by GCC builds anyway, so
> disable this warning.
>
> MSVC link.exe doesn't recognize the /wd option in CFLAGS in the Makefile,
> so use #pragma warning in central, MSVC-specific unistd.h.
>

I think compat/msvc.h is the correct place to put this, not in our
placement unistd.h.

Erik Faye-Lund

unread,

Oct 10, 2011, 10:15:52 AM10/10/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:40 PM, Karsten Blees <bl...@dcon.de> wrote:

> Git code sometimes uses uninitialized variables to initialize themselves
> (i.e. 'type varname = varname;'). According to some commit messages, this
> is supposedly to suppress the 'using uninitialized variable' warning
> (which is quite paradox). GCC currently doesn't seem to catch this, but
> MSVC does. Initialize with some arbitrary value instead to silence the
> compilers.

While I'm in favor of doing these kind of changes, patches like this
have caused some controversy upstream earlier. I'm not entirely sure
what the preferred solution upstream is anymore, though.

The problem with doing changes like this is that it increase the delta
against the upstream without much gain. Can we do something like what
the previous patch did instead, i.e "#pragma warning(disable : XXXX)"?

Erik Faye-Lund

unread,

Oct 10, 2011, 10:20:52 AM10/10/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:40 PM, Karsten Blees <bl...@dcon.de> wrote:

> Include io.h from unistd.h, as in MinGW, so that mktemp is available.

mktemp should be defined by stdlib.h, not unistd.h:
http://pubs.opengroup.org/onlinepubs/7908799/xsh/mktemp.html

But perhaps we don't support overriding headers that exist in the
Windows SDK? If that is the case, I think this would be better to
place in compat/msvc.h...

Erik Faye-Lund

unread,

Oct 10, 2011, 10:26:51 AM10/10/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:40 PM, Karsten Blees <bl...@dcon.de> wrote:

> Dynamic linking is generally preferred over static linking, and MSVCRT.dll
> has been integral part of Windows for a long time.
>
> This also fixes linker warnings for _malloc and _free in zlib.lib, which
> seems to be compiled for MSVCRT.dll already.
>
> The DLL version also exports some of the CRT initialization functions,
> which are hidden in the static libcmt.lib (e.g. __wgetmainargs, required by
> subsequent Unicode patches).

>
> Signed-off-by: Karsten Blees <bl...@dcon.de>
> ---

> Makefile | 6 +++---
> 1 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index 234d0af..4594d81 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1133,16 +1133,16 @@ ifeq ($(uname_S),Windows)
> compat/win32/pthread.o compat/win32/syslog.o \
> compat/win32/sys/poll.o compat/win32/dirent.o
> COMPAT_CFLAGS = -D__USE_MINGW_ACCESS -DNOGDI -DHAVE_STRING_H -DHAVE_ALLOCA_H -Icompat -Icompat/regex -Icompat/win32 -DSTRIP_EXTENSION=\".exe\"
> - BASIC_LDFLAGS = -IGNORE:4217 -IGNORE:4049 -NOLOGO -SUBSYSTEM:CONSOLE -NODEFAULTLIB:MSVCRT.lib
> + BASIC_LDFLAGS = -IGNORE:4217 -IGNORE:4049 -NOLOGO -SUBSYSTEM:CONSOLE
> EXTLIBS = user32.lib advapi32.lib shell32.lib wininet.lib ws2_32.lib
> PTHREAD_LIBS =
> lib =
> ifndef DEBUG
> - BASIC_CFLAGS += -GL -Os -MT
> + BASIC_CFLAGS += -GL -Os -MD
> BASIC_LDFLAGS += -LTCG
> AR += -LTCG
> else
> - BASIC_CFLAGS += -Zi -MTd
> + BASIC_CFLAGS += -Zi -MDd
> endif
> X = .exe
> endif
> --
> 1.7.6.msysgit.1
>
>

Looks good to me. But if we're splitting out an MSVC topic, then this
should probably be included in that ;)

Erik Faye-Lund

unread,

Oct 10, 2011, 10:31:49 AM10/10/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:
> Replaces Windows "ANSI" APIs dealing with file- or path names with their
> Unicode equivalent, adding UTF-8/UTF-16LE conversion as necessary.
>
> The dirent API (opendir/readdir/closedir) is updated in a separate commit.
>

Does splitting this step in two produce a Git for Windows that works?
If not, it'll be unfortunate for bisection later on...

Johannes Schindelin

unread,

Oct 10, 2011, 10:46:54 AM10/10/11

to Erik Faye-Lund, Karsten Blees, msy...@googlegroups.com

Hi,

I'd prefer the #pragma approach, too. It is my hope that at least someone
regularly uses the valgrind mode I worked so hard on, so we should be
reasonably sure that we're more clever than the compiler regarding those
"uninitialized" values.

And I guess that the reasoning is that initializing values to rogue values
does not buy us anything except maybe a performance hit.

Ciao,
Dscho

Erik Faye-Lund

unread,

Oct 10, 2011, 10:49:43 AM10/10/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:
> Add Unicode conversion functions to convert between Windows native UTF-16LE
> encoding to UTF-8 and back.
>
> To support repositories with legacy-encoded file names, the UTF-8 to UTF-16
> conversion function tries to create valid, unique file names even for
> invalid UTF-8 byte sequences, so that these repositories can be checked out
> without error.
>
> The current implementation leaves invalid UTF-8 bytes in range 0xa0 - 0xff
> as is (producing printable Unicode chars \u00a0 - \u00ff, equivalent to
> ISO-8859-1), and converts 0x80 - 0x9f to hex-code (\u0080 - \u009f are
> control chars).
>
> The Windows MultiByteToWideChar API was not used as it either drops invalid
> UTF-8 sequences (on Win2k/XP; producing non-unique or even empty file
> names) or converts them to the replacement char \ufffd (Vista/7; causing
> ERROR_INVALID_NAME in subsequent calls to file system APIs).

>
> Signed-off-by: Karsten Blees <bl...@dcon.de>
> ---

> compat/mingw.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> compat/mingw.h | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 162 insertions(+), 0 deletions(-)
>
> diff --git a/compat/mingw.c b/compat/mingw.c
> index 57d41dd..ba7ac27 100644
> --- a/compat/mingw.c
> +++ b/compat/mingw.c
> @@ -1869,6 +1869,91 @@ int mingw_offset_1st_component(const char *path)
> return offset + is_dir_sep(path[offset]);
> }
>
> +int utftowcsn(wchar_t *wcs, const char *utfs, size_t wcslen, int utflen)
> +{
> + int upos = 0, wpos = 0;
> + const unsigned char *utf = (const unsigned char*) utfs;
> + if (!utf || !wcs || wcslen < 1) {
> + errno = EINVAL;
> + return -1;
> + }
> + /* reserve space for \0 */
> + wcslen--;
> + if (utflen < 0)
> + utflen = INT_MAX;
> +
> + while (upos < utflen) {
> + int c = utf[upos++] & 0xff;
> + if (utflen == INT_MAX && c == 0)
> + break;
> +
> + if (wpos >= wcslen) {
> + wcs[wpos] = 0;
> + errno = ENAMETOOLONG;
> + return -1;

You have earlier[1] said that this function has the semantics of
wcstombs. But this part isn't. wcstombs does not apply filename
semantics to buffer sizes. In fact[2] it does not even error if the
string is too long (but it does return the same value as passed in as
'n', which can be used to tell us that we overflowed the buffer). So I
think it's better to modify this so that it behaves like wcstombs.

[1]: http://markmail.org/message/f7ckzoopqafywj3k
[2]: http://pubs.opengroup.org/onlinepubs/009604599/functions/wcstombs.html

Erik Faye-Lund

unread,

Oct 10, 2011, 10:55:17 AM10/10/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:

> Assumes file names in git tree objects are UTF-8 encoded.
>
> On most unix systems, the system encoding (and thus the TCL system
> encoding) will be UTF-8, so file names will be displayed correctly.
>
> On Windows, it is impossible to set the system encoding to UTF-8. Changing
> the TCL system encoding (via 'encoding system ...', e.g. in the startup
> code) is explicitly discouraged by the TCL docs.
>
> Change gitk and git-gui functions dealing with file names to always convert
> from and to UTF-8.
>
> Signed-off-by: Karsten Blees <bl...@dcon.de>

This reminds me of "[PATCH v10 07/28] gitk: fix file name encoding in
diff hunk headers"... is there a reason why this is split in two
patches (or not coming directly after each other, at least)?

Erik Faye-Lund

unread,

Oct 10, 2011, 11:02:14 AM10/10/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:

> Convert command line arguments from UTF-16 to UTF-8 on startup.

When doing input and output in two different patches, don't we end up
breaking locale-dependent repos for a little while (i.e where one git
tool gets a UTF-8 filename when it expects a Windows 1252 filename
from some other git tool)? It's probably not a big deal, but it'd be
nice if people who have issues with Windows 1252 repos could safely
bisect their issues past these commits also...

Erik Faye-Lund

unread,

Oct 10, 2011, 11:05:54 AM10/10/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:

> Replace stdout and stderr with a pipe if they point to the console. A
> background thread reads from the pipe, handles ANSI escape sequences and
> UTF-8 to UTF-16 conversion, then writes to the console.
>
> Global variables are either initialized on startup (single threaded) or
> exclusively modified by the background thread. Threads communicate through
> the pipe, no further synchronization is necessary.
>
> Due to the byte-oriented pipe, ANSI escape sequences can no longer be
> expected to arrive in one piece. Replace the string-based ansi_emulate()
> with a simple stateful parser (this also fixes colored diff hunk headers,
> which were broken as of commit 2efcc977).
>
> Override isatty to return true for the pipes redirecting to the console.
>
> Exec/spawn obtain the original console handle to pass to the next process
> via winansi_get_osfhandle().
>
> All other overrides are gone, the default stdio implementations work as
> expected with the piped stdout/stderr descriptors.
>
> Limitations: doesn't track reopened or duped file descriptors, i.e.:
> - fdopen(1/2) returns fully buffered streams
> - dup(1/2), dup2(1/2) returns normal pipe descriptors (i.e. isatty() =
> false, winansi_get_osfhandle won't return the original console handle)
>
> Currently, only the git-format-patch command uses xfdopen(xdup(1)) (see
> "realstdout" in builtin/log.c), but works well with these limitations.
>
> Signed-off-by: Karsten Blees <bl...@dcon.de>

This commit message doesn't quite work for me; it describes WHAT the
patch does, but not WHY. The what-part is already covered pretty well
by the code; can you please describe the problem it solves instead of
how it solves some problem I don't know what is? :)

Erik Faye-Lund

unread,

Oct 10, 2011, 11:09:57 AM10/10/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:

> Convert environment from UTF-16 to UTF-8 on startup.
>
> No changes to getenv() are necessary, as the MSVCRT version is implemented
> on top of char **environ.
>
> However, putenv / _wputenv from MSVCRT no longer work, for two reasons:
> 1. they try to keep environ, _wenviron and the Win32 process environment
> in sync, using the default system encoding instead of UTF-8 to convert
> between charsets
> 2. msysgit and MSVCRT use different allocators, memory allocated in git
> cannot be freed by the CRT and vice versa
>
> Implement mingw_putenv using the env_setenv helper function from the
> environment merge code.
>
> Signed-off-by: Karsten Blees <bl...@dcon.de>

Again, I suspect that we might end up breaking Windows 1252 encoded
repos here for a single revision. Squash the incoming/outgoing
patches, perhaps?

Erik Faye-Lund

unread,

Oct 10, 2011, 11:26:16 AM10/10/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:

> On Windows, all native APIs are Unicode-based. It is impossible to pass
> legacy encoded byte arrays to a process via command line or environment
> variables. Disable the tests that try to do so.

So, the earlier patches didn't update the test suite as they went
along and broke stuff that used to work? If so, that's not how we do
it: this makes it impossible to bisect breakages in the test-suite.

But I don't get why this is needed. Surely it was possible to pass the
environment variables as 8859-1 earlier, since these tests used to
work. What changed? Is this change needed after MSYS was patched to do
Unicode? In that case, I doubt the MSYS patch is correct; if people on
unix can pass 8859-1 encoded environment variables from a
shell-script, we should be able to do so too, no?

If I understand the tests correctly it's about respecting
i18n.commitencoding, which is something I expected this series to help
us do BETTER, not worse...

Erik Faye-Lund

unread,

Oct 10, 2011, 11:30:59 AM10/10/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:

> All functions that modify the environment have memory leaks.
>
> Disable gitunsetenv in the Makefile and use env_setenv (via mingw_putenv)
> instead (this frees removed environment entries).
>
> Move xstrdup from env_setenv to make_augmented_environ, so that
> mingw_putenv no longer copies the environment entries (according to POSIX
> [1], "the string [...] shall become part of the environment"). This also
> fixes the memory leak in gitsetenv, which expects a POSIX compliant putenv.
>
> [1] http://pubs.opengroup.org/onlinepubs/009695399/functions/putenv.html
>
> Signed-off-by: Karsten Blees <bl...@dcon.de>

This is a bugfix for 'devel', and not strictly speaking Unicode
related, no? If so, I would have expected it to be earlier in the
series...

Erik Faye-Lund

unread,

Oct 10, 2011, 11:40:26 AM10/10/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:

> The only public spawn function that needs to tweak the environment is
> mingw_spawnvpe (called from start_command). Nevertheless, all internal
> spawn* functions take an env parameter and needlessly pass the global
> char **environ around. Remove the env parameter where it's not needed.
>
> Signed-off-by: Karsten Blees <bl...@dcon.de>

I think the commit message is missing the important(?) detail that you
are removing our unused (and unexposed) execve-emulation in favor of
execv (which is used by shell.c). I'm not saying that's a bad thing;
just that perhaps it should be pointed out.

Erik Faye-Lund

unread,

Oct 10, 2011, 11:59:05 AM10/10/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:40 PM, Karsten Blees <bl...@dcon.de> wrote:
>
> Hi there,

>
> I just pushed v10 of the Unicode patches for msysgit to http://repo.or.cz/w/git/mingw/4msysgit/kblees.git.
>

Let me reiterate what others have said: impressive :)

I'll quickly summarize my thoughts here, in case someone wants a simple rundown:

> Karsten Blees (28) ('+': new/rewritten in v10, '!': modified):
> MSVC: disable const qualifier compiler warnings
> MSVC: don't use uninitialized variables
> MSVC: include <io.h> for mktemp
> MSVC: fix winansi.c compile errors
> MSVC: link dynamically to the CRT

As I commented on these before, I believe all these are not strictly
speaking related, and should probably be fast-tracked to 'devel'
(after the discussion settles, which I'm sure it will very soon).

> Win32: Thread-safe windows console output

I don't quite understand what this has to do with Unicode.. Isn't it
just a bug-fix for the existing code?

> + (kb/unicode-v10-minimal) MinGW: disable legacy encoding tests

This one worries me a bit. Why can't we respect i18n.commitencoding
after going unicode?

> + Win32: fix environment memory leaks
> + Win32: unify environment case-sensitivity
> + Win32: simplify internal mingw_spawn* APIs
> Win32: move environment functions
> ! Win32: unify environment function names
> ! Win32: move environment block creation to a helper method
> ! Win32: don't copy the environment twice when spawning child processes
> Win32: reduce environment array reallocations
> ! Win32: keep the environment sorted

Some of these looks like improvements even without the Unicode-stuff.
So they should probably be moved earlier in the series, no?

> (kb/unicode-v10) Win32: patch Windows environment on startup

I think I should say this again: very impressive work. Thanks for
hanging in there so far, and I hope you're willing to stay in a little
bit longer so we can finally merge this very nice improvement! :)

Johannes Schindelin

unread,

Oct 10, 2011, 4:51:41 PM10/10/11

to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

Hi Karsten,

On Sat, 8 Oct 2011, karste...@dcon.de wrote:

> Johannes Schindelin <Johannes....@gmx.de> wrote on 07.10.2011
> 22:35:08:
>

> > On Thu, 6 Oct 2011, Karsten Blees wrote:
> >
> > > I just pushed v10 of the Unicode patches for msysgit to
> > > http://repo.or.cz/w/git/mingw/4msysgit/kblees.git.
> >
> > Thanks for being so persistent with this issue.
> >
> > For starters, however, I would like to integrate your work on "less"
> > first. Could you make that a branch in your repository, too?
>
> Not exactly, as it's a git fork...just gave birth to kblees/msysgit on
> github instead:
>
> less-444: https://github.com/kblees/msysgit/commits/kb/msys/less
> Unicode-aware msys.dll:
> https://github.com/kblees/msysgit/commits/kb/msys/unicode

Cool, thanks! I will try to shoehorn that into my schedule sometime this
week.

> OT: I'm still a bit confused about branches in the msysgit repo...I
> think "devel" is the development environment for git and other
> mingw-based stuff, while "msys" is for customized msys tools, with
> binaries cherry-picked to devel, is that basically correct?

Yep.

> It gave me quite a headache that some msys stuff (such as src/rt) is
> also in the devel branch, but won't build there at all...

The /src/rt/ in 'devel' looks different from the one in 'msys', and
we can call that experiment "failed", I guess. Unless someone objects
loudly, will remove it from 'devel'.

It would be fantastic if we could go with Sebastian's approach of making
Git for Windows an update site for MinGW at some stage, especially since
the MSys people want to make an MSys-based Git (which would be as slow as
Cygwin Git, I guess).

But I simply lack the time, and unfortunately it seems no other people are
interested.

It would need a little bit of committment, though, since we do have a few
changes in msys.dll which would most likely need to be refined a number of
times before upstream takes them.

Ciao,
Johannes

Johannes Schindelin

unread,

Oct 10, 2011, 6:41:07 PM10/10/11

to Erik Faye-Lund, Karsten Blees, msy...@googlegroups.com

Hi,

On Mon, 10 Oct 2011, Erik Faye-Lund wrote:

If it looks like a bug fix to you, please apply directly to 'devel'.

Thanks!
Dscho

karste...@dcon.de

unread,

Oct 11, 2011, 7:04:01 AM10/11/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

Incidentally, my first solution up to v6 was "#pragma warning(disable : 4700)", and I changed it as a result of a discussion on the mailing list in January (http://groups.google.com/group/msysgit/msg/25c6c47f95f4e742).

I see your point (delta against upstream), however...according to C99,
- the value of an uninitialized local variable is "indeterminate" (6.7.8),
- i.e. it can be a "trap representation" (3.17.2),
- i.e. reading that value results in "undefined behavior" (6.2.6.1),
- i.e. the compiler may generate code that crashes the program (3.4.3).

So, writing

int foo = foo;

is most definitely a programming error and should be fixed. On the other hand,

int foo = 0;

is perfectly fine. A smart optimizing compiler might even detect that the initial value is never read before beeing reassigned and remove the initialization entirely.

Erik Faye-Lund

unread,

Oct 11, 2011, 7:08:00 AM10/11/11

to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

On Tue, Oct 11, 2011 at 1:01 PM, <karste...@dcon.de> wrote:
>
> First of all, some words on the MSVC patches in general:
>
> The last upstream version that produced an MSVC-built git.exe at all (I
> haven't checked if it works) is v1.7.3, released over a year ago.
>
> My five MSVC patches fixed the MSVC build for the MinGW fork in January, but
> it broke again in April (with the addition of #include <resource.h>). As of
> today, it is completely broken due to the use of unix domain sockets in the
> credential daemon.

Just to clarify: The unix domain sockets was only in Junio's 'next',
and has already been kicked out of it already. It's usage also broke
the MinGW build, btw.

> There's obviously no interest in keeping this platform up to date (neither
> in the git nor the msysgit community), and as there is practically no
> supporting infrastructure (not even a working test suite), I don't see much
> value in holding on to this. So probably we should just move these patches
> to the attic and forget about them...
>

The MSVC port is by no means top priority, and leaving fixing it up to
whoever use MSVC is fine by me. In other words; I agree, we can leave
them out.

> As I understand it, msvc.[ch] is to adapt the MSVC build to the existing
> mingw.[ch] code, while compat/vcbuild/include is meant to fix general
> "non-posix"-ness of MSVC. For example, unistd.h also has all the S_*
> definitions that are missing from MSVC's <sys/stat.h>.
>

Exactly, and warning-disabling fits in the first category, not the
second one. This patch was about disabling a compiler-specific
warning, not providing POSIX functionality...

karste...@dcon.de

unread,

Oct 11, 2011, 7:06:55 AM10/11/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

Erik Faye-Lund <kusm...@gmail.com> wrote on 10.10.2011 16:20:52: > On Thu, Oct 6, 2011 at 7:40 PM, Karsten Blees <bl...@dcon.de> wrote: > > Include io.h from unistd.h, as in MinGW, so that mktemp is available. > > mktemp should be defined by stdlib.h, not unistd.h: >

http://pubs.opengroup.org/onlinepubs/7908799/xsh/mktemp.html >

MinGW declares _mktemp in io.h and includes io.h from unistd.h. IMO the simplest way to support multiple platforms with one source is to just mirror that.

> But perhaps we don't support overriding headers that exist in the > Windows SDK? If that is the case, I think this would be better to > place in compat/msvc.h...

I think copying and modifying the MSVC stdlib.h would be a copyright infringement, and writing those 46k from scratch (in a way that fits nicely into the rest of MSVC "not-strictly-posix" headers) is not a viable option IMO.

karste...@dcon.de

unread,

Oct 11, 2011, 7:10:42 AM10/11/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

If we drop MSVC support, that's fine. Otherwise this patch is prerequisite to using __wgetmainargs in #14/28 "Win32: Unicode arguments (incoming)".

Erik Faye-Lund

unread,

Oct 11, 2011, 7:16:40 AM10/11/11

to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

On Tue, Oct 11, 2011 at 1:04 PM, <karste...@dcon.de> wrote:
>
> Erik Faye-Lund <kusm...@gmail.com> wrote on 10.10.2011 16:15:52:
>
>> On Thu, Oct 6, 2011 at 7:40 PM, Karsten Blees <bl...@dcon.de> wrote:
>> > Git code sometimes uses uninitialized variables to initialize themselves
>> > (i.e. 'type varname = varname;'). According to some commit messages,
>> > this
>> > is supposedly to suppress the 'using uninitialized variable' warning
>> > (which is quite paradox). GCC currently doesn't seem to catch this, but
>> > MSVC does. Initialize with some arbitrary value instead to silence the
>> > compilers.
>>
>> While I'm in favor of doing these kind of changes, patches like this
>> have caused some controversy upstream earlier. I'm not entirely sure
>> what the preferred solution upstream is anymore, though.
>>
>> The problem with doing changes like this is that it increase the delta
>> against the upstream without much gain. Can we do something like what
>> the previous patch did instead, i.e "#pragma warning(disable : XXXX)"?
>
> Incidentally, my first solution up to v6 was "#pragma warning(disable :
> 4700)", and I changed it as a result of a discussion on the mailing list in
> January (http://groups.google.com/group/msysgit/msg/25c6c47f95f4e742).

That discussion seems to conclude that disabling the warning IS the
right approach, because there's little MSVC only specific code and GCC
will still pick up the errors.

> I see your point (delta against upstream), however...according to C99,
> - the value of an uninitialized local variable is "indeterminate" (6.7.8),
> - i.e. it can be a "trap representation" (3.17.2),
> - i.e. reading that value results in "undefined behavior" (6.2.6.1),
> - i.e. the compiler may generate code that crashes the program (3.4.3).
>
> So, writing
>
> int foo = foo;
>
> is most definitely a programming error and should be fixed. On the other
> hand,
>
> int foo = 0;
>
> is perfectly fine. A smart optimizing compiler might even detect that the
> initial value is never read before beeing reassigned and remove the
> initialization entirely.
>

I agree with all of this, but the people who makes the calls upstream
doesn't seem to. And for a reason; changes like this degrades
performance a bit. I tend to value code-confidence slightly higher
than performance, though.

karste...@dcon.de

unread,

Oct 11, 2011, 7:01:13 AM10/11/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

First of all, some words on the MSVC patches in general:

The last upstream version that produced an MSVC-built git.exe at all (I haven't checked if it works) is v1.7.3, released over a year ago.

My five MSVC patches fixed the MSVC build for the MinGW fork in January, but it broke again in April (with the addition of #include <resource.h>). As of today, it is completely broken due to the use of unix domain sockets in the credential daemon.

There's obviously no interest in keeping this platform up to date (neither in the git nor the msysgit community), and as there is practically no supporting infrastructure (not even a working test suite), I don't see much value in holding on to this. So probably we should just move these patches to the attic and forget about them...

Erik Faye-Lund

unread,

Oct 11, 2011, 7:23:29 AM10/11/11

to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

On Tue, Oct 11, 2011 at 1:06 PM, <karste...@dcon.de> wrote:
>
> Erik Faye-Lund <kusm...@gmail.com> wrote on 10.10.2011 16:20:52:
>
>> On Thu, Oct 6, 2011 at 7:40 PM, Karsten Blees <bl...@dcon.de> wrote:
>> > Include io.h from unistd.h, as in MinGW, so that mktemp is available.
>>
>> mktemp should be defined by stdlib.h, not unistd.h:
>>
> http://pubs.opengroup.org/onlinepubs/7908799/xsh/mktemp.html
>>
>
> MinGW declares _mktemp in io.h and includes io.h from unistd.h. IMO the
> simplest way to support multiple platforms with one source is to just mirror
> that.
>

But as you said yourself, the point of compat/vcbuild/include is to
provide lacking POSIX interfaces, not to emulate what MinGW-runtime
does.

>> But perhaps we don't support overriding headers that exist in the
>> Windows SDK? If that is the case, I think this would be better to
>> place in compat/msvc.h...
>
> I think copying and modifying the MSVC stdlib.h would be a copyright
> infringement, and writing those 46k from scratch (in a way that fits nicely
> into the rest of MSVC "not-strictly-posix" headers) is not a viable option
> IMO.

That's not at all what I meant. I meant to add something like this as
compat/vcbuild/include/stdlib.h:
#include <some-way-of-including-msvcs-copy-of-stdlib.h>
#include <io.h>

But it might not be possible to do this without hard-coding the path
to the PlatformSDK, so I think the second best thing is to put this in
compat/msvc.h which is where we usually make minor corrections for
MSVCRT oddities.

karste...@dcon.de

unread,

Oct 11, 2011, 8:09:45 AM10/11/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

http://markmail.org/message/f7ckzoopqafywj3k > [2]:http://pubs.opengroup.org/onlinepubs/009604599/functions/wcstombs.html
POSIX "semantics" (as in "meaning of parameters") is usually:

result do_something(destination, source, destination_size)

A lot of functions work that way, including strncpy, memcpy, memmove, wcstombs and mbstowcs. In the Win32 API, its usually the other way around, e.g.

result MultiByteToWideChar(codepage, flags, source, source_size, destination, destination_size)

In contrast to wcstombs/mbstowcs, my helper functions either return a 0-terminated string or fail with an errno appropriate for file system APIs. This allows most of the ~30 uses to require only 3 lines of additional code, e.g.:

wchar_t wname[MAX_PATH];
if (utftowcs(wname, name, MAX_PATH) < 0)
return -1;

If the helper functions worked exactly like wcstombs/mbstowcs, a lot more, largely redundant code would be required.

karste...@dcon.de

unread,

Oct 11, 2011, 8:15:02 AM10/11/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

It all works for ASCII-only file names. For flawless Unicode support, you'll need most of the patches along with the Unicode-aware msys.dll/bash. E.g. Unicode file name support is of little use with cp125x command line and environment.

Squashing one or two patches of the series might make sense (e.g. the diren/non-dirent file name patches and outgoing/incoming pairs), but unless you squash all of them into a single large patch, you'll have bisection problems anyway (with the few i18n tests, that is).

Seems that digestible patch sizes and bisection are contrary goals for a patch series this large, and I'd prefer to stick with the former.

karste...@dcon.de

unread,

Oct 11, 2011, 8:18:48 AM10/11/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

#7/28 fixes a bug (not Unicode or even Windows related, IIRC Kirill confirmed this one on Linux).

#12/28 changes the gitk/git-gui file name interfaces from system encoding to UTF-8 (and is only useful after having a git that stores file names in UTF-8...).

Erik Faye-Lund

unread,

Oct 11, 2011, 8:19:08 AM10/11/11

to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

With the normal wcstombs/mbstowcs operation you can do the same with:

wchar_t wname[MAX_PATH];
if (utftowcs(wname, name, MAX_PATH) == MAX_PATH)
return -1;

In the cases where errno should actually be set to ENAMETOOLONG, it's a simple:

if (utftowcs(wname, name, MAX_PATH) == MAX_PATH) {
errno = ENAMETOOLONG;
return -1;
}

So I don't think mixing in filename errno-semantics into a function
that is also used to convert non-filename data makes any sense.

>
> If the helper functions worked exactly like wcstombs/mbstowcs, a lot more,
> largely redundant code would be required.
>

No, just the snippets above. And we'd get the nice bonus of not
risking polluting errno (which POSIX is very strict about when is
allowed to be set).

karste...@dcon.de

unread,

Oct 11, 2011, 8:21:12 AM10/11/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

The WHY is in the subject: "Thread-safe windows console output".
I didn't think it would be necessary to explain why a multi-threaded program should be thread-safe.

Erik Faye-Lund

unread,

Oct 11, 2011, 8:28:09 AM10/11/11

to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

On Tue, Oct 11, 2011 at 2:15 PM, <karste...@dcon.de> wrote:
>
> Erik Faye-Lund <kusm...@gmail.com> wrote on 10.10.2011 16:31:49:
>
>> On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:
>> > Replaces Windows "ANSI" APIs dealing with file- or path names with their
>> > Unicode equivalent, adding UTF-8/UTF-16LE conversion as necessary.
>> >
>> > The dirent API (opendir/readdir/closedir) is updated in a separate
>> > commit.
>> >
>>
>> Does splitting this step in two produce a Git for Windows that works?
>> If not, it'll be unfortunate for bisection later on...
>
> It all works for ASCII-only file names. For flawless Unicode support, you'll
> need most of the patches along with the Unicode-aware msys.dll/bash. E.g.
> Unicode file name support is of little use with cp125x command line and
> environment.

That's basically what I feared.

> Squashing one or two patches of the series might make sense (e.g. the
> diren/non-dirent file name patches and outgoing/incoming pairs), but unless
> you squash all of them into a single large patch, you'll have bisection
> problems anyway (with the few i18n tests, that is).

Well, that's a problem, then. Documentation/SubmittingPatches clearly
states "make sure that the test suite passes after your commit".

Bisection is such a useful tool, and it performed all the time to find
the reason for a breakage, that I doubt breaking the test-suite in
itself will make these patches indigestible.

> Seems that digestible patch sizes and bisection are contrary goals for a
> patch series this large, and I'd prefer to stick with the former.

In my experience, it's usually possible to come up with a way of
ordering things so you never end up breaking the test-suite. But it
takes some careful planning.

Having small patches isn't the only goal; there's lots of large
patches in git.git. Having logical patches that solve one problem at
the time is really the goal here.

So, I think some squashing will be needed.

Erik Faye-Lund

unread,

Oct 11, 2011, 8:29:11 AM10/11/11

to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

Thanks for clearing this up. Perhaps #7/28 should be sent directly
upstream, then?

Erik Faye-Lund

unread,

Oct 11, 2011, 8:33:33 AM10/11/11

to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

The commit message doesn't say that the old code wasn't thread safe,
and under what circumstances it can lead to incorrect operation as-is.
I can make some educated guesses, but the commit message isn't as much
for the people who already know the details, as it is for all the
other people who ends up seeing the patch.

Anyway, if this isn't strictly speaking unicode-related (which the
commit message gives me no reason to think), I think it's candidate
for fast-tracking to 'devel'. But I'd like an improved commit message
first.

karste...@dcon.de

unread,

Oct 11, 2011, 10:59:56 AM10/11/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

Erik Faye-Lund <kusm...@gmail.com> wrote on 11.10.2011 14:19:08: > On Tue, Oct 11, 2011 at 2:09 PM, <karste...@dcon.de> wrote: > >

[snip]

> > > > In contrast to wcstombs/mbstowcs, my helper functions either return a > > 0-terminated string or fail with an errno appropriate for file system APIs. > > This allows most of the ~30 uses to require only 3 lines of additional code, > > e.g.: > > > > wchar_t wname[MAX_PATH]; > > if (utftowcs(wname, name, MAX_PATH) < 0) > > return -1; > > With the normal wcstombs/mbstowcs operation you can do the same with: > > wchar_t wname[MAX_PATH]; > if (utftowcs(wname, name, MAX_PATH) == MAX_PATH) > return -1; > > In the cases where errno should actually be set to ENAMETOOLONG, > it's a simple: > > if (utftowcs(wname, name, MAX_PATH) == MAX_PATH) { > errno = ENAMETOOLONG; > return -1; > } >

No, that's not enough. mbstowcs may fail with -1, or name may be NULL (mbstowcs behaviour is undefined for this). We better handle these cases:

wchar_t wname[MAX_PATH];
int result;
if (!name) {
errno = EINVAL;
return -1;
}
result = utftowcs(wname, name, MAX_PATH); if (result < 0) return -1; /* errno set by utftowcs */
if (result == MAX_PATH) {

errno = ENAMETOOLONG; return -1; }

I actually tried a lot of variants (including some macro tricks) before I settled on the current API, which IMO is simple, flexible, and minimizes duplicate code.

> So I don't think mixing in filename errno-semantics into a function > that is also used to convert non-filename data makes any sense. > > > > > If the helper functions worked exactly like wcstombs/mbstowcs, a lot more, > > largely redundant code would be required. > > > > No, just the snippets above. And we'd get the nice bonus of not > risking polluting errno (which POSIX is very strict about when is > allowed to be set).

The situations in which these functions will fail and "pollute" errno are well documented and very easy to avoid by passing a large enough buffer, so it's not really an issue.

Erik Faye-Lund

unread,

Oct 11, 2011, 11:37:54 AM10/11/11

to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

It might be the simplest for you, but it's certainly a can of worms
for future users. And this special casing of the routine renders your
earlier argument that this had the same semantics as wcstombs/mbstowcs
kind of moot.

Setting errno to EINVAL on NULL-input is a responsibility of a given
set of library functions, and to the best of my knowledge we already
do this where it's needed. If you want to refactor this part to reduce
code-reuse, you are free to do that. But please do not do it as some
cryptic side-effect in a conversion function named to be similar of
some other function that does not do this. If the function was called
something like convert_filename_or_set_errno (could be implemented as
a wrapper around this), I would not have been so worried.

I can't say I see your point about it being the most flexible, though.

>> > If the helper functions worked exactly like wcstombs/mbstowcs, a lot
>> > more,
>> > largely redundant code would be required.
>> >
>>
>> No, just the snippets above. And we'd get the nice bonus of not
>> risking polluting errno (which POSIX is very strict about when is
>> allowed to be set).
>
> The situations in which these functions will fail and "pollute" errno are
> well documented and very easy to avoid by passing a large enough buffer, so
> it's not really an issue.

Documentation isn't an excuse for a misleading interface. Besides, who
reads the documentation anyway? ;)

Speaking of documentation, javadoc/doxygen-style tags is not the way
we document in Git. We tend to write API-documentation in
Documentation/technical (or simply as normal comments).

Johannes Schindelin

unread,

Oct 11, 2011, 12:44:19 PM10/11/11

to Erik Faye-Lund, karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

Hi,

On Tue, 11 Oct 2011, Erik Faye-Lund wrote:

> On Tue, Oct 11, 2011 at 1:01 PM, <karste...@dcon.de> wrote:
>
> > There's obviously no interest in keeping this platform up to date
> > (neither in the git nor the msysgit community), and as there is
> > practically no supporting infrastructure (not even a working test
> > suite), I don't see much value in holding on to this. So probably we
> > should just move these patches to the attic and forget about them...
>
> The MSVC port is by no means top priority, and leaving fixing it up to
> whoever use MSVC is fine by me. In other words; I agree, we can leave
> them out.

I actually like the idea of having it in our repository, but you're right,
it takes somebody who wants to keep the support up-to-date. And quite
honestly, I will take any patches anybody claims fix MSVC unless they
break other setups.

So can you please make an MSVC-only branch? You might want to rebase it
onto 'tentative' which I will merge into 'devel' as soon as I found out
what is causing the bundle-related breakage. As kusma said, the
unix-sockets stuff has been backed out from current 'next' which I
rebase-merged onto.

Ciao,
Dscho

Johannes Schindelin

unread,

Oct 11, 2011, 12:55:07 PM10/11/11

to Erik Faye-Lund, karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

Hi,

On Tue, 11 Oct 2011, Erik Faye-Lund wrote:

> Speaking of documentation, javadoc/doxygen-style tags is not the way we
> document in Git. We tend to write API-documentation in
> Documentation/technical (or simply as normal comments).

Ach, to have Javadoc-style documentation... Then we could use doxygen...
Oh well.

Seriously again, if I understand the discussion correctly, then the
semantics of the function to convert strings from one encoding to another
is adjusted in a manner that is targeted to filenames?

Ciao,
Dscho

Johannes Schindelin

unread,

Oct 11, 2011, 12:58:38 PM10/11/11

to Erik Faye-Lund, karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

Hi,

On Tue, 11 Oct 2011, Erik Faye-Lund wrote:

> In my experience, it's usually possible to come up with a way of
> ordering things so you never end up breaking the test-suite. But it
> takes some careful planning.
>
> Having small patches isn't the only goal; there's lots of large patches
> in git.git. Having logical patches that solve one problem at the time is
> really the goal here.
>
> So, I think some squashing will be needed.

Wait. The way I do such changes usually goes like this: I add all the
logically-separate functions in single commits, then switch over by
including the header/source files in one commit (very bisectable, but of
course, nobody helps you to identify the issue in the really big change)
and then another commit to remove all now-obsolete code.

Would this be an option here?

Ciao,
Dscho

karste...@dcon.de

unread,

Oct 11, 2011, 9:09:57 PM10/11/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

Erik Faye-Lund <kusm...@gmail.com> wrote on 10.10.2011 17:26:16: > On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:

> > On Windows, all native APIs are Unicode-based. It is impossible to pass > > legacy encoded byte arrays to a process via command line or environment > > variables. Disable the tests that try to do so. > > So, the earlier patches didn't update the test suite as they went > along and broke stuff that used to work? If so, that's not how we do > it: this makes it impossible to bisect breakages in the test-suite. >

These tests (and a few more) break by updating the msys.dll. Applying the Unicode patches will eventually fix most of them. This final patch only disables the tests that cannot possibly work on Windows.
> But I don't get why this is needed. Surely it was possible to pass the > environment variables as 8859-1 earlier, since these tests used to > work. What changed? Is this change needed after MSYS was patched to do > Unicode? In that case, I doubt the MSYS patch is correct; if people on > unix can pass 8859-1 encoded environment variables from a > shell-script, we should be able to do so too, no? >

No, this was not possible and will probably never be on Windows. All native APIs are based on Unicode strings, you cannot store a however encoded byte array in a Windows environment variable, or pass a byte array on the command line, or have a byte array as a file name.

In order to use byte arrays in C programs on Windows (instead of wchar_t), we need to encode/decode these from/to native Unicode strings, using some implied encoding. The Windows *A APIs use the system's default encoding, and my patches change that to UTF-8.

The reason these tests currently seem to work is that there is usually a one-to-one translation between legacy single-byte character sets and UTF-16. Converting an arbitrary byte array to UTF-16 and back using a single-byte encoding such as Cp1252 will result in the exact same byte array.

With multi-byte encodings such as UTF-8, it is common to have certain byte sequences that are invalid. We have to deal with this somehow, either by dropping invalid bytes, or by using some replacement character. In either case, converting back and forth between UTF-8 and UTF-16 will only result in the same byte array if the original is valid UTF-8 (which is not the case for the tests in question).
> If I understand the tests correctly it's about respecting > i18n.commitencoding, which is something I expected this series to help > us do BETTER, not worse...
No, this patch series doesn't improve the LEGACY encoding support, quite the contrary: it enables the use of the default value "utf-8". As Unicode is a universal international character set that supersedes all language-specific legacy encodings, the i18n options are basically obsolete with this.

karste...@dcon.de

unread,

Oct 11, 2011, 9:30:18 PM10/11/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

Erik Faye-Lund <kusm...@gmail.com> wrote on 10.10.2011 17:30:59: > On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:

> > All functions that modify the environment have memory leaks. > > > > Disable gitunsetenv in the Makefile and use env_setenv (via mingw_putenv) > > instead (this frees removed environment entries). > > > > Move xstrdup from env_setenv to make_augmented_environ, so that > > mingw_putenv no longer copies the environment entries (according to POSIX > > [1], "the string [...] shall become part of the environment"). This also > > fixes the memory leak in gitsetenv, which expects a POSIX compliant putenv. > > > > [1]

http://pubs.opengroup.org/onlinepubs/009695399/functions/putenv.html > > > > Signed-off-by: Karsten Blees <bl...@dcon.de> >

> This is a bugfix for 'devel', and not strictly speaking Unicode > related, no? If so, I would have expected it to be earlier in the > series...

I've moved the environment stuff to the end of the patch series due to the lively discussion a few months ago. This patch depends on taking control of char **environ (which was heavily debated :-)) and having our own mingw_putenv (both introduced in #17/28 "Win32: Unicode environment (incoming)").

karste...@dcon.de

unread,

Oct 11, 2011, 9:47:35 PM10/11/11

to Johannes Schindelin, Karsten Blees, Erik Faye-Lund, msy...@googlegroups.com

I ran the test suite for all intermediate stages, starting with #9/28 "Win32: add Unicode conversion functions". None of the patches breaks a test, so I think no squashing is required.

#0 update msys.dll breaks the following tests:
t0050-filesystem
t3901-i18n-patch
t4014-format-patch
t4036-format-patch-signer-mime
t4201-shortlog
t7003-filter-branch
t8005-blame-i18n

#10/28 Win32: Unicode file name support (except dirent): no change
#11/28 Win32: Unicode file name support (dirent): fixes t7003-filter-branch
#12/28 Unicode file name support (gitk and git-gui): no change
#13/28 Win32: Unicode arguments (outgoing): no change
#14/28 Win32: Unicode arguments (incoming): fixes t0050-filesystem
#15/28 Win32: Thread-safe windows console output: no change
#16/28 Win32: Unicode environment (outgoing): no change
#17/28 Win32: Unicode environment (incoming): fixes t4014-format-patch and t4036-format-patch-signer-mime
#18/28 MinGW: disable legacy encoding tests: fixes t3901-i18n-patch, t4201-shortlog and t8005-blame-i18n

Jonathan Nieder

unread,

Oct 12, 2011, 6:02:24 AM10/12/11

to Erik Faye-Lund, karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

Erik Faye-Lund wrote:
> On Tue, Oct 11, 2011 at 1:04 PM, <karste...@dcon.de> wrote:

>> So, writing
>>
>> int foo = foo;
>>
>> is most definitely a programming error and should be fixed. On the other
>> hand,
>>
>> int foo = 0;
>>
>> is perfectly fine.

[...]

> I agree with all of this, but the people who makes the calls upstream
> doesn't seem to. And for a reason; changes like this degrades
> performance a bit. I tend to value code-confidence slightly higher
> than performance, though.

I don't make the calls upstream, but I think a much better reason to
dislike use of "int foo = 0;" in place of "int foo = foo;" or "int
foo;" is that it prevents valgrind from doing its job.

What ever happened to the following series?

http://thread.gmane.org/gmane.comp.version-control.git/169136

Erik Faye-Lund

unread,

Oct 12, 2011, 7:43:39 AM10/12/11

to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

On Wed, Oct 12, 2011 at 3:09 AM, <karste...@dcon.de> wrote:
>
> Erik Faye-Lund <kusm...@gmail.com> wrote on 10.10.2011 17:26:16:
>
>> On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:
>> > On Windows, all native APIs are Unicode-based. It is impossible to pass
>> > legacy encoded byte arrays to a process via command line or environment
>> > variables. Disable the tests that try to do so.
>>
>> So, the earlier patches didn't update the test suite as they went
>> along and broke stuff that used to work? If so, that's not how we do
>> it: this makes it impossible to bisect breakages in the test-suite.
>>
>
> These tests (and a few more) break by updating the msys.dll. Applying the
> Unicode patches will eventually fix most of them.

Thanks for clarifying this :)

No, they are not. There might be existing repos out there with other
commit-encodings, and we need to continue to support them. And to
allow people to collaborate with people who use older Git for Windows
versions (without these patches), we might want to support creating
them also. But then we will probably have to start recoding stuff when
we write the object, because it looks to me like Git makes no such
attempts now.

I'm not suggesting you change this, I'm just thinking out loud here.

Anyway, what doesn't make sense to allow at all, is setting
i18n.logOutputEncoding to something other than "utf-8" when stdout
goes to console. But it might make sense when writing to a file (and
might be required for format-patch to work correctly).

The problem that our environment-strings from t3901-8859-1.txt gets
reencoded to UTF-16 when loaded into the environment in the
'setup'-test can probably be worked around as-is with some plumbing.
But I'm not sure it's worth the hassle, so perhaps your disabling of
these are the best approach. I'm on the fence here. Rewriting the
tests to use plumbing for generating the commit can be done at a later
stage (if at all), I guess.

So, if I understand it correctly:
- The reason these tests fail, is that the patched msys.dll
interprets all environment variables as UTF-8, and re-encodes them to
UTF-16.
- After these patches (and patching msys.dll), we still support
reading existing repos with commits with non-UTF-8 commit-encoding.
- After these patches (and patching msys.dll), we cannot reliably
generate commits with non-UTF-8 commit-encodings from porcelain
commands.

Am I missing something?

Erik Faye-Lund

unread,

Oct 12, 2011, 7:50:17 AM10/12/11

to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

Quick note, you tend to send out e-mails (like this) where you have a
broken HTML version, and a nice and clean plain-text version. What
happens is that you have a link at the beginning, and the rest of the
e-mail gets wrapped inside that <a>-tag.

The GMail-interface barfs pretty badly on these, as it won't let me
poen the "- Show quoted text -"-part of the e-mail without opening the
link instead. So I'm forced to read these e-mails in raw-mode.

Would you mind trying to find out why this happens, and if possible
try to avoid it in the future?

Thanks, that makes perfect sense to me. :)

Erik Faye-Lund

unread,

Oct 12, 2011, 8:04:51 AM10/12/11

to Johannes Schindelin, karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

Yes. I would prefer to see that done as a filename-specific wrapper
over the string-conversion routine instead.

karste...@dcon.de

unread,

Oct 12, 2011, 10:16:54 AM10/12/11

to Jonathan Nieder, Karsten Blees, Erik Faye-Lund, msy...@googlegroups.com

I wasn't aware of the upstream discussion, thanks for the pointer.

IMO "int foo;" is the cleanest solution, however, some "int foo;" have
explicitly been changed to "int foo = foo;" to silence compiler warnings:

10e8d688 Correct compiler warnings in fast-import.
75b9a8a6 submodule.c: Squelch a "use before assignment" warning

With -Werror -Wuninitialized some compiler versions will barf, and
disabling -Wuninitialized will affect the entire codebase. That's why I
went for "int foo = 0;".

Perhaps this is just a stupid idea, but we could use macros to suppress
the warnings on a case-by-case basis, something like this:

/* replace with compiler version check */
#if COMPILER_SUPPORTS_PROPER_UNINITIALIZED_WARNINGS
# define UNINITIALIZED(default)
#else
# define UNINITIALIZED(default) = (default)
#endif

Then we could write:

int foo UNINITIALIZED(0);

karste...@dcon.de

unread,

Oct 12, 2011, 12:38:49 PM10/12/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

Erik Faye-Lund <kusm...@gmail.com> wrote on 12.10.2011 13:43:39:

> On Wed, Oct 12, 2011 at 3:09 AM, <karste...@dcon.de> wrote:
> >
> > Erik Faye-Lund <kusm...@gmail.com> wrote on 10.10.2011 17:26:16:
> >

[snip]

> >> If I understand the tests correctly it's about respecting
> >> i18n.commitencoding, which is something I expected this series to
help
> >> us do BETTER, not worse...
> >
> > No, this patch series doesn't improve the LEGACY encoding

support,quite the

> > contrary: it enables the use of the default value "utf-8". As Unicode
is a
> > universal international character set that supersedes all
language-specific
> > legacy encodings, the i18n options are basically obsolete with this.
>
> No, they are not. There might be existing repos out there with other
> commit-encodings, and we need to continue to support them.

This is not a problem, as the commit encoding is stored in each commit
object and git transcodes commits to i18n.logoutputencoding automatically.

> And to
> allow people to collaborate with people who use older Git for Windows
> versions (without these patches), we might want to support creating
> them also.

Not necessarily, as older Git for Windows versions also support
transcoding commit objects. If i18n.logoutputencoding and the console code
page are set to the Windows system encoding, displaying UTF-8 encoded
commit objects will work fine...except of course for UTF-8 characters that
are missing in the system encoding.

The only reason to support legacy i18n.commitencoding is probably for
editors that don't support UTF-8.

> But then we will probably have to start recoding stuff when
> we write the object, because it looks to me like Git makes no such
> attempts now.
>

The problem with recoding is that author/committer-name/email and commit
message may be specified in many different ways and the encoding depends
on the source:
- command line (-m, --author...): always UTF-8
- environment (GIT_AUTHOR_NAME...): always UTF-8
- ~/.gitconfig (user.name...): probably UTF-8?
- .git/config (user.name...): probably UTF-8, but probably different from
~/.gitconfig
- other commit (-C): as specified in other commit
- file (-F): anything

When I looked at the code that composes commit messages a while back, I
quickly decided that there were more important things to do :-)

> So, if I understand it correctly:
> - The reason these tests fail, is that the patched msys.dll
> interprets all environment variables as UTF-8, and re-encodes them to
> UTF-16.

Precisely.

> - After these patches (and patching msys.dll), we still support
> reading existing repos with commits with non-UTF-8 commit-encoding.

Yes.

> - After these patches (and patching msys.dll), we cannot reliably
> generate commits with non-UTF-8 commit-encodings from porcelain
> commands.

This is not necessary, as older Git versions support transcoding UTF-8
commits.
It is still possible to set i18n.commitencoding and use 'git-commit -F',
as t3900 demonstrates. However, if user.name is non-ASCII, it gets tricky:
you'd have to edit .git/config manually (not via git-config) and make sure
its stored in the target encoding.

Michael Geddes

unread,

Oct 15, 2011, 7:41:24 AM10/15/11

to msy...@googlegroups.com

Hi
> I cannot reproduce that, git-gui works fine for me with non-ASCII user
> names. Perhaps that user's TEMP setting is broken?
>
> > The only remaining issue which manifested itself in all my experiments
> > is that dreaded commit encoding problem. I have gvim.exe (version 7.3)
> > as my $EDITOR and it assumes all files it opens on Windows are encoded
> > using the currently active codepage (cp1251, again).
> > Hence two problems:
> > * Filenames in the commit message file are displayed botched.
> > * If nothing is done to fix whis, like forced setting of the output
> >
> > file's encoding to be UTF-8, commit message will be composed in the
> > system's codepage and will appear unreadable in both git's output and
> > its GUI tools.
>
> I don't use [g]vim if I can help it, but the vim wiki suggests ":e
> ++enc=utf-8"...there's probably a .[g]vimrc to set the default encoding, I
> don't know.
>

I've not followed through this thread, but I am a gvim/vim user, and can be
found lurking on #vim.

There are three settings that are important for this kind of stuff vim:

The first is 'encoding' which is the encoding of the terminal (or the way gvim
deals with it). This should be set to utf-8. I'm not sure what it currently
does under msys; I will investigate once I get back to work on Monday.

The second is 'fileencoding' which is the encoding that the current file is set
to - this is usually detected automatically, set explicitly ( ++enc=utf-8 ),
or falls to the default. Changing this and writing, will convert it from one
encoding to another.. whereas using ++enc will read it using that encoding.

The third is 'fileencodings', which is a list of encodings to search through to
detect a file's encoding.

A lot can be worked out from typing
:set fenc? enc? fencs?
in vim.

Let me know if I can help work this out. It might be that some different
defaults in the vimrc are warrented. I will be back on my work machine ina
couple of days.

//.ichael G.

Konstantin Khomoutov

unread,

Oct 15, 2011, 9:34:08 AM10/15/11

to Michael Geddes, msy...@googlegroups.com

On Sat, 15 Oct 2011 19:41:24 +0800
Michael Geddes <mic...@wheelycreek.net> wrote:

[...]

> > I don't use [g]vim if I can help it, but the vim wiki suggests ":e
> > ++enc=utf-8"...there's probably a .[g]vimrc to set the default
> > encoding, I don't know.
> I've not followed through this thread, but I am a gvim/vim user, and
> can be found lurking on #vim.
>
> There are three settings that are important for this kind of stuff
> vim:
>
> The first is 'encoding' which is the encoding of the terminal (or
> the way gvim deals with it). This should be set to utf-8. I'm not
> sure what it currently does under msys; I will investigate once I get
> back to work on Monday.

This is a bit more involved.
First, 'encoding' controls the encoding of Vim's internal buffers.
Second, terminal encoding is set using the 'termencoding' (abbreviated
'tenc') option. But the Vim manual on this option states that for GUI
Vim tenc does only control what's produced by the keyboard, and
displaying is anyway controlled via the 'encoding' option.

To demonstrate, if you run native vim.exe (that one from the vim.sf.net
installer) in a console window, it'll have enc set to the system "ANSI"
code page and tenc set to the system "OEM" codepage. In Windows XP,
setting Unicode font on a console window does not alter this behaviour
in any way. You can set enc=utf-8 in it, and that won't change the
termencoding from its default value. I hence suspect that the console
version of Vim for Windows is also not really Unicode-enabled, at least
not on Windows XP.

> The second is 'fileencoding' which is the encoding that the current
> file is set to - this is usually detected automatically, set
> explicitly ( ++enc=utf-8 ), or falls to the default. Changing this
> and writing, will convert it from one encoding to another.. whereas
> using ++enc will read it using that encoding.
>
> The third is 'fileencodings', which is a list of encodings to search
> through to detect a file's encoding.
>
> A lot can be worked out from typing
> :set fenc? enc? fencs?
> in vim.
>
> Let me know if I can help work this out. It might be that some
> different defaults in the vimrc are warrented. I will be back on my
> work machine ina couple of days.

I think that currently the only way to reliably get UTF-8 in commit
messages when using Vim is to set core.editor (or the GIT_EDITOR
environment variable) to
vim.exe -c "set enc=utf-8"
or
vim.exe -c "set fenc=utf-8"
(I think the former is slightly more logical these days).

Messing with defaults is dangerous because on Windows you usually
expect test editors to write out files in ANSI code page.

karste...@dcon.de

unread,

Oct 15, 2011, 12:39:14 PM10/15/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

Setting and checking errno is how error handling works in C. It is neither reserved for libraries, nor is the use of EINVAL (or any other error code) restricted to specific functions. POSIX implementations are expressly allowed to use additional error codes as long they don't conflict with the errors section of the function's specification (http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_03.html).

E.g. the BSD, OS X and MSVC versions of mbstowcs and wcstombs can all fail with EINVAL. The same is true for most POSIX wrappers in mingw.c, either by wrapping the MSVC version of the function or by using err_win_to_posix (which btw. also maps the generic ERROR_BUFFER_OVERFLOW to ENAMETOOLONG).

> and to the best of my knowledge we already > do this where it's needed. If you want to refactor this part to reduce > code-reuse, you are free to do that. But please do not do it as some > cryptic side-effect in a conversion function named to be similar of > some other function that does not do this. If the function was called > something like convert_filename_or_set_errno (could be implemented as > a wrapper around this), I would not have been so worried. >

The "or_set_errno" part is understood for C functions that can fail. And just because the conversion functions are largely used by file API wrappers, or that the error codes happen to be OK for these APIs, doesn't make those functions specific to filenames. In fact, the same functions are also used for command-line, environment and console conversions.

> I can't say I see your point about it being the most flexible, though. > > >> > If the helper functions worked exactly like wcstombs/mbstowcs, a lot > >> > more, > >> > largely redundant code would be required. > >> > > >> > >> No, just the snippets above. And we'd get the nice bonus of not > >> risking polluting errno (which POSIX is very strict about when is > >> allowed to be set). > > > > The situations in which these functions will fail and "pollute" errno are > > well documented and very easy to avoid by passing a large enough buffer, so > > it's not really an issue. > > Documentation isn't an excuse for a misleading interface. Besides, who > reads the documentation anyway? ;) > > Speaking of documentation, javadoc/doxygen-style tags is not the way > we document in Git. We tend to write API-documentation in > Documentation/technical (or simply as normal comments).

Yeah, I noticed that...manually documenting APIs apart from the source is tedious work and the results tend to be out of date and incompl

Talk about advantages of API documentation tools, things like
* complete, up-to-date method signatures
* cross referenced and indexed
* IDE integration

(KB, last updated 1997 while installing doxygen...)

karste...@dcon.de

unread,

Oct 15, 2011, 1:21:07 PM10/15/11

to kusm...@gmail.com, Karsten Blees, Johannes Schindelin, msy...@googlegroups.com

The conversion functions are designed to require minimal additional code in API wrappers in general, they are not specifically targeted at (or exclusively used for) filename conversions.

Erik Faye-Lund

unread,

Oct 18, 2011, 2:27:07 PM10/18/11

to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

On Sat, Oct 15, 2011 at 6:39 PM, <karste...@dcon.de> wrote:
>
> Erik Faye-Lund <kusm...@gmail.com> wrote on 11.10.2011 17:37:54:
>
>>On Tue, Oct 11, 2011 at 4:59 PM, <karste...@dcon.de> wrote:
>>>
>>> Erik Faye-Lund <kusm...@gmail.com> wrote on 11.10.2011 14:19:08:
>>>
>>> On Tue, Oct 11, 2011 at 2:09 PM, <karste...@dcon.de> wrote:
>>> I actually tried a lot of variants (including some macro tricks) before
>>> I
>>> settled on the current API, which IMO is simple, flexible, and minimizes
>>> duplicate code.
>>>
>>
>> It might be the simplest for you, but it's certainly a can of worms
>> for future users. And this special casing of the routine renders your
>> earlier argument that this had the same semantics as wcstombs/mbstowcs
>> kind of moot.
>>
>> Setting errno to EINVAL on NULL-input is a responsibility of a given
>> set of library functions,
>
> Setting and checking errno is how error handling works in C. It is neither
> reserved for libraries, nor is the use of EINVAL (or any other error code)
> restricted to specific functions. POSIX implementations are expressly
> allowed to use additional error codes as long they don't conflict with the
> errors section of the function's specification
> (http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_03.html).
>
> E.g. the BSD, OS X and MSVC versions of mbstowcs and wcstombs can all fail
> with EINVAL.

This does not really contradict my point, though. Being allowed
doesn't make it a bad idea. I prefer having clear and understandable
code over having a few lines less of code.

But my concern isn't with EINVAL so much as it is with ENAMETOOLONG,
which is applying filename semantics to any buffer overflow. And the C
way of handling general buffer overflows tends to be capping on write
and returning the bytes written.

> The same is true for most POSIX wrappers in mingw.c, either by
> wrapping the MSVC version of the function or by using err_win_to_posix
> (which btw. also maps the generic ERROR_BUFFER_OVERFLOW to ENAMETOOLONG).

err_win_to_posix converting ERROR_BUFFER_OVERFLOW to ENAMETOOLONG
looks like a bug to me. It should probably only have converted
ERROR_FILENAME_EXCED_RANGE to that.

The problem is that there isn't an ERROR_BUFFER_OVERFLOW-equivalent
errno-value in POSIX, so we can't really convert it. But very few
Win32-functions can generate ERROR_BUFFER_OVERFLOW:
GetAdaptersAddresses, GetAdaptersInfo, GetNetworkParams,
GetComputerName, SendARP, DsUnquoteRdnValue, DsQuoteRdnValue,
PrepareLogArchive, GetPerAdapterInfo, SLSetGenuineInformation and
MprConfigGetGuidName. We don't call any of these functions, so we can
safely remove it from the conversion table.

>> and to the best of my knowledge we already
>> do this where it's needed. If you want to refactor this part to reduce
>> code-reuse, you are free to do that. But please do not do it as some
>> cryptic side-effect in a conversion function named to be similar of
>> some other function that does not do this. If the function was called
>> something like convert_filename_or_set_errno (could be implemented as
>> a wrapper around this), I would not have been so worried.
>>
>
> The "or_set_errno" part is understood for C functions that can fail.

Yes, but it helps the reader of the code understand why it's not being
set elsewhere in the function call. Basically, when looking over the
code for a function like opendir (where POSIX require errno to be set
to ENAMETOOLONG if the passed path is above PATH_MAX), it is quite
obvious where errno is set if it calls a function called
"convert_path_or_set_errno". A comment can be just as clear thought,
so I have no problem just going with "convert_filename" plus a
comment. The name was just suggestion.

> And
> just because the conversion functions are largely used by file API wrappers,
> or that the error codes happen to be OK for these APIs, doesn't make those
> functions specific to filenames. In fact, the same functions are also used
> for command-line, environment and console conversions.
>

Exactly, this is a good reason why we do NOT want a filename specific
errno-assignment in these non-path-related code-paths. Even though
these code-paths might take care not to trigger it, it's just not very
easy for people reading the code to to understand. And the code can
change at some point, where a cryptic error message might start
appearing instead of one that makes sense.

>> Speaking of documentation, javadoc/doxygen-style tags is not the way
>> we document in Git. We tend to write API-documentation in
>> Documentation/technical (or simply as normal comments).
>

> Yeah, I noticed that... manually documenting APIs apart from the source is

> tedious work and the results tend to be out of date and incompl

:D

>
> Talk about advantages of API documentation tools, things like
> * complete, up-to-date method signatures
> * cross referenced and indexed
> * IDE integration
>
> (KB, last updated 1997 while installing doxygen...)

I'm not objecting to Doxygen being used in git, I'm objecting to
starting using Doxygen documentation in a commit that is not about
just that. Starting to use a new tool in an unrelated commit isn't the
way we usually do it :)

And just for the sake of the argument: Personally, I'm not a big
Doxygen-fan (because programmers tend to have the crazy idea that
listings of functions and structures is useful documentation, and be
lazy with proper documentation as a result), but I'm not in violent
disagree with it either. Besides, I don't think Doxygen does a very
good job on C-code.

Erik Faye-Lund

unread,

Oct 18, 2011, 4:31:59 PM10/18/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:

> ...rather than checking for special values on every getenv call.

>
> Signed-off-by: Karsten Blees <bl...@dcon.de>
> ---

> compat/mingw.c | 33 ++++++++++++++++-----------------
> 1 files changed, 16 insertions(+), 17 deletions(-)
>
> diff --git a/compat/mingw.c b/compat/mingw.c
> index dc905bb..77ea5d9 100644
> --- a/compat/mingw.c
> +++ b/compat/mingw.c
> @@ -832,7 +832,7 @@ static int environ_size = 0;
> /* allocated size of environ array, in bytes */
> static int environ_alloc = 0;
>
> -static char *do_getenv(const char *name)
> +char *mingw_getenv(const char *name)
> {
> char *value;
> int pos = bsearchenv(environ, name, environ_size - 1);
> @@ -842,22 +842,6 @@ static char *do_getenv(const char *name)
> return value ? &value[1] : NULL;
> }
>
> -char *mingw_getenv(const char *name)
> -{
> - char *result = do_getenv(name);
> - if (!result && !strcmp(name, "TMPDIR")) {
> - /* on Windows it is TMP and TEMP */
> - result = do_getenv("TMP");
> - if (!result)
> - result = do_getenv("TEMP");
> - }
> - else if (!result && !strcmp(name, "TERM")) {
> - /* simulate TERM to enable auto-color (see color.c) */
> - result = "winansi";
> - }
> - return result;
> -}
> -
> int mingw_putenv(const char *namevalue)
> {
> ALLOC_GROW(environ, (environ_size + 1) * sizeof(char*), environ_alloc);
> @@ -2124,6 +2108,21 @@ void mingw_startup()
> /* sort environment for O(log n) getenv / putenv */
> qsort(environ, i, sizeof(char*), compareenv);
>
> + /* fix Windows specific environment settings */
> +
> + /* on Windows it is TMP and TEMP */
> + if (!getenv("TMPDIR")) {
> + const char *tmp = getenv("TMP");
> + if (!tmp)
> + tmp = getenv("TEMP");
> + if (tmp)
> + setenv("TMPDIR", tmp, 1);
> + }
> +
> + /* simulate TERM to enable auto-color (see color.c) */
> + if (!getenv("TERM"))
> + setenv("TERM", "winansi", 1);
> +

This has the side-effect that these environment variables now gets
exported to child-processes, right? I'm not saying it's a bad thing,
but perhaps it's worth pointing out in the commit message?

I actually think exporting TMPDIR to child-processes fixes a small
issue in "git archimport" (i.e when called through git.exe)... :)

karste...@dcon.de

unread,

Nov 25, 2011, 1:02:48 PM11/25/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

First of all, sorry for not getting back to you sooner, but another project took up most of my time for the last few weeks.
Erik Faye-Lund <kusm...@gmail.com> wrote on 18.10.2011 20:27:07:
[...]

> But my concern isn't with EINVAL so much as it is with ENAMETOOLONG, > which is applying filename semantics to any buffer overflow.

ENAMETOOLONG is also used by sem_open/unlink, mq_open/unlink, posix_typed_mem_open/unlink and gethostname (in GLIBC), so its not exactly filename specific. Likewise I wouldn't call getcwd, write and strerror_r "math-specific" just because they can fail with ERANGE (which usually indicates arithmetic over/underflow in things like strtod, pow, exp, tan etc.).

With a limited set of general purpose error codes reused by different subsystems, it is inevitable that these error codes (and associated strerror messages) are not always accurate. ENAMETOOLONG just seems to be the best match for 'input data too long', but I could easily change it to ERANGE, E2BIG or even EOVERFLOW if you think that's better.

> And the C > way of handling general buffer overflows tends to be capping on write > and returning the bytes written.

[v]snprintf: returns required size, buffer capped and 0-terminated
strncpy: returns bytes written, buffer capped and NOT 0-terminated
mbstowcs/wcstombs: dito, additionally returns required size if buffer is NULL
strxfrm: returns required size, buffer unspecified
strerror_r: fails with ERANGE (errno must have been reset to 0 beforehand)
iconv: fails with E2BIG

Except for iconv (which follows C error handling conventions), the only clear pattern I can see here is that all these functions require extra code on the client side to even _detect_ buffer overflows.

> > The same is true for most POSIX wrappers in mingw.c, either by > > wrapping the MSVC version of the function or by using err_win_to_posix > > (which btw. also maps the generic ERROR_BUFFER_OVERFLOW to ENAMETOOLONG). > > err_win_to_posix converting ERROR_BUFFER_OVERFLOW to ENAMETOOLONG > looks like a bug to me. It should probably only have converted > ERROR_FILENAME_EXCED_RANGE to that.

The message text for ERROR_BUFFER_OVERFLOW (what you get from FormatMessageA/W) is "The file name is too long.", translating this to ENAMETOOLONG is actually spot on.

> The problem is that there isn't an ERROR_BUFFER_OVERFLOW-equivalent > errno-value in POSIX, so we can't really convert it. But very few > Win32-functions can generate ERROR_BUFFER_OVERFLOW: > GetAdaptersAddresses, GetAdaptersInfo, GetNetworkParams, > GetComputerName, SendARP, DsUnquoteRdnValue, DsQuoteRdnValue, > PrepareLogArchive, GetPerAdapterInfo, SLSetGenuineInformation and > MprConfigGetGuidName. We don't call any of these functions, so we can > safely remove it from the conversion table.

That's just the functions that explicitly document it...check the remarks section of GetLastError:

"The error codes returned by a function are not part of the Windows API specification and can vary by operating system or device driver. For this reason, we cannot provide the complete list of error codes that can be returned by each function. There are also many functions whose documentation does not include even a partial list of error codes that can be returned."

For this reason, I think it is impossible on Windows to provide an API layer that is fully POSIX comliant with respect to error codes.

For example, regarding overlong file names: CreateFileA only fails with ERROR_FILENAME_EXCED_RANGE if the filename exceeds 64K (well beyond MAX_PATH). Below that you get ERROR_PATH_NOT_FOUND (< 32K) or ERROR_INVALID_PARAMETER (> 32K), which _[w]open / _[w]creat translate to ENOENT or EINVAL, respectively. None of the MSVCRT stdio functions will _ever_ fail with ENAMETOOLONG, there's is no such mapping in the sources (dosmap.c). And I don't think cygwin does a better job with respect to error codes either.

But my issue in this patch series isn't with error codes. I just want to add proper Unicode support, and a single set of conversion functions should suffice for this. Adding wrappers that only differ in error semantics / error codes would IMO be quite deceptive as it suggests a "precision" in error codes that simply cannot be achieved on Windows.

> I'm not objecting to Doxygen being used in git, I'm objecting to > starting using Doxygen documentation in a commit that is not about > just that. Starting to use a new tool in an unrelated commit isn't the > way we usually do it :)

I'm not trying to introduce Doxygen as a new tool, I just find the doxygen/JavaDoc tags convenient for structured API documentation. If doxygen format is objectionable, though, I can invent my own format (it seems there's no API in git yet that's documented in this level of detail).

Ciao,
Karsten

karste...@dcon.de

unread,

Nov 28, 2011, 11:08:31 AM11/28/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com, Johannes....@gmx.de

Erik Faye-Lund <kusm...@gmail.com> wrote on 18.10.2011 20:27:07: > On Sat, Oct 15, 2011 at 6:39 PM, <karste...@dcon.de> wrote: > > Erik Faye-Lund <kusm...@gmail.com> wrote on 11.10.2011 17:37:54:

[...]

> >> If the function was called > >> something like convert_filename_or_set_errno (could be implemented as > >> a wrapper around this), I would not have been so worried. > >>

Would this resolve your issues with the conversion functions? I'd like your blessings before I start breaking this down into the individual patches.

Ciao,
Karsten

---
* Renamed the helper functions to xutftowcs / xwcstoutf (following git
practice of having x* overrides of prominent library functions, usually
with modified error semantics). That should make it clear that the helper
functions are NOT identical to mbstowcs / wcstombs.

* Changed ENAMETOOLONG to ERANGE in both conversion functions.

* Added a filename-specific wrapper xutftowcs_path that implies MAX_PATH
length for the output buffer and fails with ENAMETOOLONG instead of ERANGE.

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/mingw.c | 65 +++++++++++++++++++++++-------------------------
compat/mingw.h | 39 +++++++++++++++++++++++------
compat/win32/dirent.c | 9 ++++--
compat/winansi.c | 2 +-
4 files changed, 69 insertions(+), 46 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index 15c1029..2104f25 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -205,7 +205,7 @@ int mingw_unlink(const char *pathname)
{
int ret, tries = 0;
wchar_t wpathname[MAX_PATH];
- if (utftowcs(wpathname, pathname, MAX_PATH) < 0)
+ if (xutftowcs_path(wpathname, pathname) < 0)
return -1;

/* read-only files cannot be removed */
@@ -253,7 +253,7 @@ int mingw_rmdir(const char *pathname)
{
int ret, tries = 0;
wchar_t wpathname[MAX_PATH];
- if (utftowcs(wpathname, pathname, MAX_PATH) < 0)
+ if (xutftowcs_path(wpathname, pathname) < 0)
return -1;

while ((ret = _wrmdir(wpathname)) == -1 && tries < ARRAY_SIZE(delay)) {
@@ -293,7 +293,7 @@ void mingw_mark_as_git_dir(const char *dir)
{
wchar_t wdir[MAX_PATH];
if (hide_dotfiles != HIDE_DOTFILES_FALSE && !is_bare_repository())
- if (utftowcs(wdir, dir, MAX_PATH) < 0 || make_hidden(wdir))
+ if (xutftowcs_path(wdir, dir) < 0 || make_hidden(wdir))
warning("Failed to make '%s' hidden", dir);
git_config_set("core.hideDotFiles",
hide_dotfiles == HIDE_DOTFILES_FALSE ? "false" :
@@ -305,7 +305,7 @@ int mingw_mkdir(const char *path, int mode)
{
int ret;
wchar_t wpath[MAX_PATH];
- if (utftowcs(wpath, path, MAX_PATH) < 0)
+ if (xutftowcs_path(wpath, path) < 0)
return -1;
ret = _wmkdir(wpath);
if (!ret && hide_dotfiles == HIDE_DOTFILES_TRUE) {
@@ -335,7 +335,7 @@ int mingw_open (const char *filename, int oflags, ...)
if (filename && !strcmp(filename, "/dev/null"))
filename = "nul";

- if (utftowcs(wfilename, filename, MAX_PATH) < 0)
+ if (xutftowcs_path(wfilename, filename) < 0)
return -1;
fd = _wopen(wfilename, oflags, mode);

@@ -385,8 +385,8 @@ FILE *mingw_fopen (const char *filename, const char *otype)
hide = access(filename, F_OK);
if (filename && !strcmp(filename, "/dev/null"))
filename = "nul";
- if (utftowcs(wfilename, filename, MAX_PATH) < 0 ||
- utftowcs(wotype, otype, 10) < 0)
+ if (xutftowcs_path(wfilename, filename) < 0 ||
+ xutftowcs(wotype, otype, 10) < 0)
return NULL;
file = _wfopen(wfilename, wotype);
if (file && hide && make_hidden(wfilename))
@@ -404,8 +404,8 @@ FILE *mingw_freopen (const char *filename, const char *otype, FILE *stream)
hide = access(filename, F_OK);
if (filename && !strcmp(filename, "/dev/null"))
filename = "nul";
- if (utftowcs(wfilename, filename, MAX_PATH) < 0 ||
- utftowcs(wotype, otype, 10) < 0)
+ if (xutftowcs_path(wfilename, filename) < 0 ||
+ xutftowcs(wotype, otype, 10) < 0)
return NULL;
file = _wfreopen(wfilename, wotype, stream);
if (file && hide && make_hidden(wfilename))
@@ -416,7 +416,7 @@ FILE *mingw_freopen (const char *filename, const char *otype, FILE *stream)
int mingw_access(const char *filename, int mode)
{
wchar_t wfilename[MAX_PATH];
- if (utftowcs(wfilename, filename, MAX_PATH) < 0)
+ if (xutftowcs_path(wfilename, filename) < 0)
return -1;
/* X_OK is not supported by the MSVCRT version */
return _waccess(wfilename, mode & ~X_OK);
@@ -425,7 +425,7 @@ int mingw_access(const char *filename, int mode)
int mingw_chdir(const char *dirname)
{
wchar_t wdirname[MAX_PATH];
- if (utftowcs(wdirname, dirname, MAX_PATH) < 0)
+ if (xutftowcs_path(wdirname, dirname) < 0)
return -1;
return _wchdir(wdirname);
}
@@ -433,7 +433,7 @@ int mingw_chdir(const char *dirname)
int mingw_chmod(const char *filename, int mode)
{
wchar_t wfilename[MAX_PATH];
- if (utftowcs(wfilename, filename, MAX_PATH) < 0)
+ if (xutftowcs_path(wfilename, filename) < 0)
return -1;
return _wchmod(wfilename, mode);
}
@@ -465,7 +465,7 @@ static int do_lstat(int follow, const char *file_name, struct stat *buf)
{
WIN32_FILE_ATTRIBUTE_DATA fdata;
wchar_t wfilename[MAX_PATH];
- if (utftowcs(wfilename, file_name, MAX_PATH) < 0)
+ if (xutftowcs_path(wfilename, file_name) < 0)
return -1;

if (GetFileAttributesExW(wfilename, GetFileExInfoStandard, &fdata)) {
@@ -608,7 +608,7 @@ int mingw_utime (const char *file_name, const struct utimbuf *times)
int fh, rc;
DWORD attrs;
wchar_t wfilename[MAX_PATH];
- if (utftowcs(wfilename, file_name, MAX_PATH) < 0)
+ if (xutftowcs_path(wfilename, file_name) < 0)
return -1;

/* must have write permission */
@@ -656,11 +656,11 @@ unsigned int sleep (unsigned int seconds)
char *mingw_mktemp(char *template)
{
wchar_t wtemplate[MAX_PATH];
- if (utftowcs(wtemplate, template, MAX_PATH) < 0)
+ if (xutftowcs_path(wtemplate, template) < 0)
return NULL;
if (!_wmktemp(wtemplate))
return NULL;
- if (wcstoutf(template, wtemplate, strlen(template) + 1) < 0)
+ if (xwcstoutf(template, wtemplate, strlen(template) + 1) < 0)
return NULL;
return template;
}
@@ -729,7 +729,7 @@ char *mingw_getcwd(char *pointer, int len)
wchar_t wpointer[MAX_PATH];
if (!_wgetcwd(wpointer, MAX_PATH))
return NULL;
- if (wcstoutf(pointer, wpointer, len) < 0)
+ if (xwcstoutf(pointer, wpointer, len) < 0)
return NULL;
for (i = 0; pointer[i]; i++)
if (pointer[i] == '\\')
@@ -1056,7 +1056,7 @@ static wchar_t *make_environment_block(char **deltaenv)
for (i = 0; tmpenv[i]; i++) {
size = 2 * strlen(tmpenv[i]) + 2;
ALLOC_GROW(wenvblk, (envblkpos + size) * sizeof(wchar_t), envblksz);
- envblkpos += utftowcs(&wenvblk[envblkpos], tmpenv[i], size) + 1;
+ envblkpos += xutftowcs(&wenvblk[envblkpos], tmpenv[i], size) + 1;
}
/* add final \0 terminator */
wenvblk[envblkpos] = 0;
@@ -1114,9 +1114,9 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **deltaen
si.hStdOutput = winansi_get_osfhandle(fhout);
si.hStdError = winansi_get_osfhandle(fherr);

- if (utftowcs(wcmd, cmd, MAX_PATH) < 0)
+ if (xutftowcs_path(wcmd, cmd) < 0)
return -1;
- if (dir && utftowcs(wdir, dir, MAX_PATH) < 0)
+ if (dir && xutftowcs_path(wdir, dir) < 0)
return -1;

/* concatenate argv, quoting args as we go */
@@ -1137,7 +1137,7 @@ static pid_t mingw_spawnve_fd(const char *cmd, const char **argv, char **deltaen
}

wargs = xmalloc((2 * args.len + 1) * sizeof(wchar_t));
- utftowcs(wargs, args.buf, 2 * args.len + 1);
+ xutftowcs(wargs, args.buf, 2 * args.len + 1);
strbuf_release(&args);

wenvblk = make_environment_block(deltaenv);
@@ -1599,9 +1599,7 @@ int mingw_rename(const char *pold, const char *pnew)
DWORD attrs, gle;
int tries = 0;
wchar_t wpold[MAX_PATH], wpnew[MAX_PATH];
- if (utftowcs(wpold, pold, MAX_PATH) < 0)
- return -1;
- if (utftowcs(wpnew, pnew, MAX_PATH) < 0)
+ if (xutftowcs_path(wpold, pold) < 0 || xutftowcs_path(wpnew, pnew) < 0)
return -1;

/*
@@ -1842,9 +1840,8 @@ int link(const char *oldpath, const char *newpath)
typedef BOOL (WINAPI *T)(LPCWSTR, LPCWSTR, LPSECURITY_ATTRIBUTES);
static T create_hard_link = NULL;
wchar_t woldpath[MAX_PATH], wnewpath[MAX_PATH];
- if (utftowcs(woldpath, oldpath, MAX_PATH) < 0)
- return -1;
- if (utftowcs(wnewpath, newpath, MAX_PATH) < 0)
+ if (xutftowcs_path(woldpath, oldpath) < 0 ||
+ xutftowcs_path(wnewpath, newpath) < 0)
return -1;

if (!create_hard_link) {
@@ -1973,7 +1970,7 @@ int mingw_offset_1st_component(const char *path)
return offset + is_dir_sep(path[offset]);
}

-int utftowcsn(wchar_t *wcs, const char *utfs, size_t wcslen, int utflen)
+int xutftowcsn(wchar_t *wcs, const char *utfs, size_t wcslen, int utflen)
{
int upos = 0, wpos = 0;
const unsigned char *utf = (const unsigned char*) utfs;
@@ -1993,7 +1990,7 @@ int utftowcsn(wchar_t *wcs, const char *utfs, size_t wcslen, int utflen)

if (wpos >= wcslen) {
wcs[wpos] = 0;
- errno = ENAMETOOLONG;
+ errno = ERANGE;
return -1;
}

@@ -2045,7 +2042,7 @@ int utftowcsn(wchar_t *wcs, const char *utfs, size_t wcslen, int utflen)
return wpos;
}

-int wcstoutf(char *utf, const wchar_t *wcs, size_t utflen)
+int xwcstoutf(char *utf, const wchar_t *wcs, size_t utflen)
{
if (!wcs || !utf || utflen < 1) {
errno = EINVAL;
@@ -2054,7 +2051,7 @@ int wcstoutf(char *utf, const wchar_t *wcs, size_t utflen)
utflen = WideCharToMultiByte(CP_UTF8, 0, wcs, -1, utf, utflen, NULL, NULL);
if (utflen)
return utflen - 1;
- errno = ENAMETOOLONG;
+ errno = ERANGE;
return -1;
}

@@ -2099,14 +2096,14 @@ void mingw_startup()
buffer = xmalloc(maxlen);

/* convert command line arguments and environment to UTF-8 */
- len = wcstoutf(buffer, _wpgmptr, maxlen);
+ len = xwcstoutf(buffer, _wpgmptr, maxlen);
__argv[0] = xmemdupz(buffer, len);
for (i = 1; i < argc; i++) {
- len = wcstoutf(buffer, wargv[i], maxlen);
+ len = xwcstoutf(buffer, wargv[i], maxlen);
__argv[i] = xmemdupz(buffer, len);
}
for (i = 0; wenv[i]; i++) {
- len = wcstoutf(buffer, wenv[i], maxlen);
+ len = xwcstoutf(buffer, wenv[i], maxlen);
environ[i] = xmemdupz(buffer, len);
}
environ[i] = NULL;
diff --git a/compat/mingw.h b/compat/mingw.h
index 675eb06..675d19a 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -375,13 +375,33 @@ void mingw_mark_as_git_dir(const char *dir);
* utflen: size of string to convert, or -1 if 0-terminated
*
* Returns:
- * length of converted string (_wcslen(wcs)), or -1 on failure (errno is set
- * to EINVAL or ENAMETOOLONG)
+ * length of converted string (_wcslen(wcs)), or -1 on failure
+ *
+ * Errors:
+ * EINVAL: one of the input parameters is invalid (e.g. NULL)
+ * ERANGE: the output buffer is too small
+ */
+int xutftowcsn(wchar_t *wcs, const char *utf, size_t wcslen, int utflen);
+
+/**
+ * Simplified variant of xutftowcsn, assumes input string is \0-terminated.
+ */
+static inline int xutftowcs(wchar_t *wcs, const char *utf, size_t wcslen)
+{
+ return xutftowcsn(wcs, utf, wcslen, -1);
+}
+
+/**
+ * Simplified file system specific variant of xutftowcsn, assumes output
+ * buffer size is MAX_PATH wide chars and input string is \0-terminated,
+ * fails with ENAMETOOLONG if input string is too long.
*/
-int utftowcsn(wchar_t *wcs, const char *utf, size_t wcslen, int utflen);
-static inline int utftowcs(wchar_t *wcs, const char *utf, size_t wcslen)
+static inline int xutftowcs_path(wchar_t *wcs, const char *utf)
{
- return utftowcsn(wcs, utf, wcslen, -1);
+ int result = xutftowcsn(wcs, utf, MAX_PATH, -1);
+ if (result < 0 && errno == ERANGE)
+ errno = ENAMETOOLONG;
+ return result;
}

/**
@@ -410,10 +430,13 @@ static inline int utftowcs(wchar_t *wcs, const char *utf, size_t wcslen)
* utflen: size of target buffer
*
* Returns:
- * length of converted string, or -1 on failure (errno is set to EINVAL or
- * ENAMETOOLONG)
+ * length of converted string, or -1 on failure
+ *
+ * Errors:
+ * EINVAL: one of the input parameters is invalid (e.g. NULL)
+ * ERANGE: the output buffer is too small
*/
-int wcstoutf(char *utf, const wchar_t *wcs, size_t utflen);
+int xwcstoutf(char *utf, const wchar_t *wcs, size_t utflen);

/*
* A replacement of main() that adds win32 specific initialization.
diff --git a/compat/win32/dirent.c b/compat/win32/dirent.c
index 37f56b7..c69a689 100644
--- a/compat/win32/dirent.c
+++ b/compat/win32/dirent.c
@@ -9,7 +9,7 @@ struct DIR {
static inline void finddata2dirent(struct dirent *ent, WIN32_FIND_DATAW *fdata)
{
/* convert UTF-16 name to UTF-8 */
- wcstoutf(ent->d_name, fdata->cFileName, sizeof(ent->d_name));
+ xwcstoutf(ent->d_name, fdata->cFileName, sizeof(ent->d_name));

/* Set file type, based on WIN32_FIND_DATA */
if (fdata->dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)
@@ -27,9 +27,12 @@ DIR *opendir(const char *name)
DIR *dir;

/* convert name to UTF-16, check length (-2 for '/' '*') */
- len = utftowcs(pattern, name, MAX_PATH - 2);
- if (len < 0)
+ len = xutftowcs(pattern, name, MAX_PATH - 2);
+ if (len < 0) {
+ if (errno == ERANGE)
+ errno = ENAMETOOLONG;
return NULL;
+ }

/* append optional '/' and wildcard '*' */
if (len && !is_dir_sep(pattern[len - 1]))
diff --git a/compat/winansi.c b/compat/winansi.c
index d11b532..a0da0d3 100644
--- a/compat/winansi.c
+++ b/compat/winansi.c
@@ -297,7 +297,7 @@ static void write_console(char *str, size_t len)
static wchar_t wbuf[2 * BUFFER_SIZE + 1];

/* convert utf-8 to utf-16 */
- int wlen = utftowcsn(wbuf, str, 2 * BUFFER_SIZE + 1, len);
+ int wlen = xutftowcsn(wbuf, str, 2 * BUFFER_SIZE + 1, len);

/* write directly to console */
WriteConsoleW(console, wbuf, wlen, NULL, NULL);
--
1.7.7.1.msysgit.1.dirty

Erik Faye-Lund

unread,

Nov 28, 2011, 11:10:42 AM11/28/11

to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com, Johannes....@gmx.de

On Mon, Nov 28, 2011 at 5:08 PM, <karste...@dcon.de> wrote:
>
> Erik Faye-Lund <kusm...@gmail.com> wrote on 18.10.2011 20:27:07:
>> On Sat, Oct 15, 2011 at 6:39 PM, <karste...@dcon.de> wrote:
>> > Erik Faye-Lund <kusm...@gmail.com> wrote on 11.10.2011 17:37:54:
> [...]
>> >> If the function was called
>> >> something like convert_filename_or_set_errno (could be implemented as
>> >> a wrapper around this), I would not have been so worried.
>> >>
>
> Would this resolve your issues with the conversion functions? I'd like your
> blessings before I start breaking this down into the individual patches.
>

Yes, it would. Thanks! :)

Erik Faye-Lund

unread,

Nov 28, 2011, 12:51:24 PM11/28/11

to Karsten Blees, msy...@googlegroups.com

On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:

> Replace stdout and stderr with a pipe if they point to the console. A
> background thread reads from the pipe, handles ANSI escape sequences and
> UTF-8 to UTF-16 conversion, then writes to the console.
>

How does this make the code any more thread-safe? I'm assuming the
previous problem was that one thread could set the console attributes,
and then another thread could come in and change it again, while the
other one is depending on it. Is that so?

If it is, won't pipes be just as problematic? Since pipes are of
finite length, writes can end up being interleaved, leading to the
same problem. But now it's even worse; writes can be split in the
middle of UTF-8 sequences, leading to not only errors in the console
attributes, but also in the text emitted!

Or is there something I'm missing?

Wouldn't it be better to put a critical section around ansi_emulate,
keep the console attribute in a TLS, and re-set it right after taking
the mutex? That way, every thread that prints to the console gets it's
own console state...

But I'm really wondering, why is this needed? What part of the git
code prints to the console from multiple threads? I can understand if
one thread writes to stdout while one writes to stderr, as stdout is
expected to be buffered while stderr is not... is this what cause
problems?

Johannes Schindelin

unread,

Nov 28, 2011, 1:36:52 PM11/28/11

to Erik Faye-Lund, karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

Hi,

Absolutely. I appreciate especially the renaming to *_path() -- even if it
might sound trivial to you -- as it makes it absolutely clear what this
function is doing.

Ciao,
Dscho

karste...@dcon.de

unread,

Nov 28, 2011, 5:53:05 PM11/28/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

Erik Faye-Lund <kusm...@gmail.com> wrote on 28.11.2011 18:51:24: > On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote: > > Replace stdout and stderr with a pipe if they point to the console. A > > background thread reads from the pipe, handles ANSI escape sequences and > > UTF-8 to UTF-16 conversion, then writes to the console. > > > > How does this make the code any more thread-safe? I'm assuming the > previous problem was that one thread could set the console attributes, > and then another thread could come in and change it again, while the > other one is depending on it. Is that so? >

More critical than console attributes is unsynchronized access to the "console" and "last_stream" variables in is_console(). With two threads concurrently writing to a file and the console, the file thread may accidentally be redirected to the console. I think the git send/receive-pack routines work that way, although they use write() for the file part, which isn't overridden in winansi.c yet (the TODO is already there, though).

The only command that I'm aware of that prints to the console from multiple threads is git-grep (but this also uses write(), so console attributes currently don't work at all). But you're correct that multiple console threads _would_ overwrite each other's console attributes with the current, unsynchronized ansi_emulate function.

The new version doesn't have this problem because the Win32 and MSVCRT IO APIs are already thread safe, so a single call to an IO function is guaranteed to be uninterrupted by control codes from other threads. And if multi threaded console threads take care to reset their attributes with each call, everything is fine.

> If it is, won't pipes be just as problematic? Since pipes are of > finite length, writes can end up being interleaved, leading to the > same problem. But now it's even worse; writes can be split in the > middle of UTF-8 sequences, leading to not only errors in the console > attributes, but also in the text emitted! >

You're correct that UTF-8 sequences may be split (could happen every 1,024 characters at the worst with a 4k buffer). I've already tackled this problem in the msys console handler, and I'll try to look into it here when I find the time.

> Or is there something I'm missing? > > Wouldn't it be better to put a critical section around ansi_emulate, > keep the console attribute in a TLS, and re-set it right after taking > the mutex? That way, every thread that prints to the console gets it's > own console state... >

Multi threaded console apps will have to take care to print in reasonable chunks anyway (at least complete lines, git-grep prints one file at a time), and to reset their attributes in every single chunk. I don't think there's a per-thread console state on Linux either, so that's the only way it could possibly work.

Erik Faye-Lund

unread,

Nov 30, 2011, 6:19:03 AM11/30/11

to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

On Mon, Nov 28, 2011 at 11:53 PM, <karste...@dcon.de> wrote:
>
> Erik Faye-Lund <kusm...@gmail.com> wrote on 28.11.2011 18:51:24:
>
>> On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:
>> > Replace stdout and stderr with a pipe if they point to the console. A
>> > background thread reads from the pipe, handles ANSI escape sequences and
>> > UTF-8 to UTF-16 conversion, then writes to the console.
>> >
>>
>> How does this make the code any more thread-safe? I'm assuming the
>> previous problem was that one thread could set the console attributes,
>> and then another thread could come in and change it again, while the
>> other one is depending on it. Is that so?
>>
>
> More critical than console attributes is unsynchronized access to the
> "console" and "last_stream" variables in is_console(). With two threads
> concurrently writing to a file and the console, the file thread may
> accidentally be redirected to the console.

If is_console can be called from multiple threads, then "initialized"
is also problematic. And you fix that in this patch, which is good.

> I think the git send/receive-pack
> routines work that way, although they use write() for the file part, which
> isn't overridden in winansi.c yet (the TODO is already there, though).
>
> The only command that I'm aware of that prints to the console from multiple
> threads is git-grep (but this also uses write(), so console attributes
> currently don't work at all).

Are you saying that the patch is purely theoretical? I'm not against
protecting ourselves from being stupid in the future, but I think it
should be reflected in the commit message (if my understanding is
correct)...

> But you're correct that multiple console
> threads _would_ overwrite each other's console attributes with the current,
> unsynchronized ansi_emulate function.
>
> The new version doesn't have this problem because the Win32 and MSVCRT IO
> APIs are already thread safe, so a single call to an IO function is
> guaranteed to be uninterrupted by control codes from other threads. And if
> multi threaded console threads take care to reset their attributes with each
> call, everything is fine.
>
>> If it is, won't pipes be just as problematic? Since pipes are of
>> finite length, writes can end up being interleaved, leading to the
>> same problem. But now it's even worse; writes can be split in the
>> middle of UTF-8 sequences, leading to not only errors in the console
>> attributes, but also in the text emitted!
>>
>
> You're correct that UTF-8 sequences may be split (could happen every 1,024
> characters at the worst with a 4k buffer). I've already tackled this problem
> in the msys console handler, and I'll try to look into it here when I find
> the time.
>

I'm looking forward to have a look at it :)

>> Or is there something I'm missing?
>>
>> Wouldn't it be better to put a critical section around ansi_emulate,
>> keep the console attribute in a TLS, and re-set it right after taking
>> the mutex? That way, every thread that prints to the console gets it's
>> own console state...
>>
>
> Multi threaded console apps will have to take care to print in reasonable
> chunks anyway (at least complete lines, git-grep prints one file at a time),
> and to reset their attributes in every single chunk. I don't think there's a
> per-thread console state on Linux either, so that's the only way it could
> possibly work.
>

In this case it doesn't help to take such care if your file can be
larger than the 4k-pipe you've created. Multiple files might be
interleaved in 4k blocks.

>> But I'm really wondering, why is this needed? What part of the git
>> code prints to the console from multiple threads? I can understand if
>> one thread writes to stdout while one writes to stderr, as stdout is
>> expected to be buffered while stderr is not... is this what cause
>> problems?

I'd really love an answer to this question. Is this a theoretical fix,
or does it fix observable misbehavior?

After another (quick) review of the code, I have another nit on the
implementation:

On Thu, Oct 6, 2011 at 7:41 PM, Karsten Blees <bl...@dcon.de> wrote:

> @@ -28,8 +23,10 @@ static HANDLE console;
> static WORD plain_attr;
> static WORD attr;
> static int negative;
> -static FILE *last_stream = NULL;
> static int non_ascii_used = 0;
> +static HANDLE hthread, hread, hwrite;
> +static HANDLE hwrite1 = INVALID_HANDLE_VALUE, hwrite2 = INVALID_HANDLE_VALUE;
> +static HANDLE hconsole1, hconsole2;
>

'attr', 'plain_attr' and 'negative' are global variables, and...

> -static int is_console(FILE *stream)
> +static int is_console(int fd)
> {
> CONSOLE_SCREEN_BUFFER_INFO sbi;
> HANDLE hcon;
>
> - static int initialized = 0;
<snip>
> - if (!initialized) {
> - attr = plain_attr = sbi.wAttributes;
> - negative = 0;
> - initialized = 1;
> - /* check console font on exit */
> - atexit(warn_if_raster_font);
> - }
> -
> - console = hcon;
> + /* initialize attributes */
> + attr = plain_attr = sbi.wAttributes;
> + negative = 0;
> return 1;
> }
>

...is_console() initialize these from it's input fd's console attributes. But...

> +void winansi_init(void)
> +{
> + int con1, con2, hwrite_fd;
>
> - rv = ansi_emulate(buf, stream);
> + /* check if either stdout or stderr is a console output screen buffer */
> + con1 = is_console(1);
> + con2 = is_console(2);
> + if (!con1 && !con2)
> + return;
>

you call is_console(1) followed by is_console(2). This means that
"attr" and "negative" gets initialized twice. I guess your reasoning
is that there's only one console, so they'll be initialized to the
same value. I think this deserves a comment. Something like:

+ con1 = is_console(1);
+ con2 = is_console(2); /* may re-initialize state */

...should be sufficient. But perhaps it's even better to simply keep
the old initialized-logic, since it's already there?

karste...@dcon.de

unread,

Nov 30, 2011, 8:53:10 AM11/30/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

The thread safety part may be theoretical...at least I currently can't give you an example that reliably triggers the threading issue. I haven't analyzed all the pthread_create code paths though, so there may well be some git --no-pager ... command that does (probably with debug/trace output turned on).

The patch fixes ANSI emulation in at least git-grep and git-diff and Unicode commit messages in git-show and git-log, so it has some practical applications (and if its only performance by not having to check isatty / GetConsoleScreenBufferInfo on each IO call).

My local commit message currently starts like so:
"Winansi.c has many static variables that are accessed and modified from
the [v][f]printf / fputs functions overridden in the file. This may cause
multi threaded git commands that print to the console to produce corrupted
output or even crash.

Additionally, winansi.c doesn't override all functions that can be used to
print to the console (e.g. fwrite, write, fputc are missing), so that ANSI
escapes don't work properly for some git commands (e.g. git-grep).

Instead of doing ANSI emulation in just a few wrapped functions on top of
the IO API, let's plug into the IO system and take advantage of the thread
safety inherent to the IO system.
..."

> > > > You're correct that UTF-8 sequences may be split (could happen every 1,024 > > characters at the worst with a 4k buffer). I've already tackled this problem > > in the msys console handler, and I'll try to look into it here when I find > > the time. > > > > I'm looking forward to have a look at it :) >

See the patch below, I'll integrate this in the next version.

> >> Or is there something I'm missing? > >> > >> Wouldn't it be better to put a critical section around ansi_emulate, > >> keep the console attribute in a TLS, and re-set it right after taking > >> the mutex? That way, every thread that prints to the console gets it's > >> own console state... > >> > > > > Multi threaded console apps will have to take care to print in reasonable > > chunks anyway (at least complete lines, git-grep prints one file at a time),

> > and to reset their attributes in every single chunk. I don't thinkthere's a

> > per-thread console state on Linux either, so that's the only way it could > > possibly work. > > > > In this case it doesn't help to take such care if your file can be > larger than the 4k-pipe you've created. Multiple files might be > interleaved in 4k blocks. >

Only if the calling thread splits the files into 4k blocks and calls WriteFile/write/fwrite for each block. If you print the whole file in a single call, that call is uninterruptible by other threads (this is what git-grep does). The buffer size of the pipe doesn't affect the tread safety of stdio and Win32 APIs at all.

> >> But I'm really wondering, why is this needed? What part of the git > >> code prints to the console from multiple threads? I can understand if > >> one thread writes to stdout while one writes to stderr, as stdout is > >> expected to be buffered while stderr is not... is this what cause > >> problems? > > I'd really love an answer to this question. Is this a theoretical fix, > or does it fix observable misbehavior? >

The thread-safety part is theoretical (IMO that's no reason not to fix it though).

> After another (quick) review of the code, I have another nit on the > implementation: >

[...]

> > you call is_console(1) followed by is_console(2). This means that > "attr" and "negative" gets initialized twice. I guess your reasoning > is that there's only one console, so they'll be initialized to the > same value. I think this deserves a comment. Something like: > > + con1 = is_console(1); > + con2 = is_console(2); /* may re-initialize state */ > > ...should be sufficient. But perhaps it's even better to simply keep > the old initialized-logic, since it's already there?

Thanks, you're right, I didn't account for stderr / stdout going to different consoles :-)
is_console is called only twice during initialization, so spending a static 'initialized' variable to prevent reinitialization seems overkill. I guess I'll go for the comment.

---
Subject: [PATCH] Win32: fix broken UTF-8 console output

Signed-off-by: Karsten Blees <bl...@dcon.de>
---

compat/winansi.c | 36 +++++++++++++++++++++++++++++-------
1 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/compat/winansi.c b/compat/winansi.c
index a0da0d3..2b50dfb 100644
--- a/compat/winansi.c
+++ b/compat/winansi.c
@@ -291,13 +291,13 @@ static void set_attr(char func, const int *params, int paramlen)
#define BUFFER_SIZE 4096
#define MAX_PARAMS 16

-static void write_console(char *str, size_t len)
+static void write_console(unsigned char *str, size_t len)
{
/* only called from console_thread, so a static buffer will do */

static wchar_t wbuf[2 * BUFFER_SIZE + 1];

/* convert utf-8 to utf-16 */

- int wlen = xutftowcsn(wbuf, str, 2 * BUFFER_SIZE + 1, len);
+ int wlen = xutftowcsn(wbuf, (char*) str, 2 * BUFFER_SIZE + 1, len);

/* write directly to console */
WriteConsoleW(console, wbuf, wlen, NULL, NULL);

@@ -313,14 +313,14 @@ enum {

static DWORD WINAPI console_thread(LPVOID unused)
{
- char buffer[BUFFER_SIZE];
+ unsigned char buffer[BUFFER_SIZE];
DWORD bytes;
- int start, end, c, parampos = 0, state = TEXT;
+ int start, end = 0, c, parampos = 0, state = TEXT;
int params[MAX_PARAMS];

while (state != EXIT) {
/* read next chunk of bytes from the pipe */
- if (!ReadFile(hread, buffer, BUFFER_SIZE, &bytes, NULL)) {
+ if (!ReadFile(hread, buffer + end, BUFFER_SIZE - end, &bytes, NULL)) {
/* exit if pipe has been closed */
if (GetLastError() == ERROR_BROKEN_PIPE)
break;
@@ -329,6 +329,7 @@ static DWORD WINAPI console_thread(LPVOID unused)
}

/* scan the bytes and handle ANSI control codes */
+ bytes += end;
start = end = 0;
while (end < bytes) {
c = buffer[end++];
@@ -376,8 +377,29 @@ static DWORD WINAPI console_thread(LPVOID unused)
}

/* print remaining text unless we're parsing an escape sequence */
- if (state == TEXT && end > start)
- write_console(buffer + start, end - start);
+ if (state == TEXT && end > start) {
+ /* check for incomplete UTF-8 sequences and fix end accordingly */
+ if (buffer[end - 1] >= 0x80) {
+ if (buffer[end -1] >= 0xc0)
+ end--;
+ else if (end - 1 > start && buffer[end - 2] >= 0xe0)
+ end -= 2;
+ else if (end - 2 > start && buffer[end - 3] >= 0xf0)
+ end -= 3;
+ }
+
+ /* print remaining complete UTF-8 sequences */
+ if (end > start)
+ write_console(buffer + start, end - start);
+
+ /* move remaining bytes to the front */
+ if (end < bytes)
+ memmove(buffer, buffer + end, bytes - end);
+ end = bytes - end;
+ } else {
+ /* all data has been consumed, mark buffer empty */
+ end = 0;
+ }
}

/* check if the console font supports unicode */
--
1.7.7.1.msysgit.0.298.gce7c7

Erik Faye-Lund

unread,

Nov 30, 2011, 10:04:50 AM11/30/11

to karste...@dcon.de, Karsten Blees, msy...@googlegroups.com

On Wed, Nov 30, 2011 at 2:53 PM, <karste...@dcon.de> wrote:
> Erik Faye-Lund <kusm...@gmail.com> wrote on 30.11.2011 12:19:03:
>> On Mon, Nov 28, 2011 at 11:53 PM, <karste...@dcon.de> wrote:
>> > The only command that I'm aware of that prints to the console from
>> > multiple
>> > threads is git-grep (but this also uses write(), so console attributes
>> > currently don't work at all).
>>
>> Are you saying that the patch is purely theoretical? I'm not against
>> protecting ourselves from being stupid in the future, but I think it
>> should be reflected in the commit message (if my understanding is
>> correct)...
>>
>
> The thread safety part may be theoretical...at least I currently can't give
> you an example that reliably triggers the threading issue. I haven't
> analyzed all the pthread_create code paths though, so there may well be some
> git --no-pager ... command that does (probably with debug/trace output
> turned on).
>
> The patch fixes ANSI emulation in at least git-grep and git-diff and Unicode
> commit messages in git-show and git-log, so it has some practical
> applications (and if its only performance by not having to check isatty /
> GetConsoleScreenBufferInfo on each IO call).
>

Thanks a lot for the explanation. Based on this, I agree that this is
worth fixing.

> My local commit message currently starts like so:
> "Winansi.c has many static variables that are accessed and modified from
> the [v][f]printf / fputs functions overridden in the file. This may cause
> multi threaded git commands that print to the console to produce corrupted
> output or even crash.
>
> Additionally, winansi.c doesn't override all functions that can be used to
> print to the console (e.g. fwrite, write, fputc are missing), so that ANSI
> escapes don't work properly for some git commands (e.g. git-grep).
>
> Instead of doing ANSI emulation in just a few wrapped functions on top of
> the IO API, let's plug into the IO system and take advantage of the thread
> safety inherent to the IO system.
> ..."
>

Hmm, these last two paragraphs seems to suggest you are are changing
write, fwrite and fputc to go through the ansi-emulation also... But I
cant spot it in the code. Am I blind, or just misunderstanding the
text?

Apart from that, I like it.

>> > You're correct that UTF-8 sequences may be split (could happen every
>> > 1,024
>> > characters at the worst with a 4k buffer). I've already tackled this
>> > problem
>> > in the msys console handler, and I'll try to look into it here when I
>> > find
>> > the time.
>> >
>>
>> I'm looking forward to have a look at it :)
>>
>
> See the patch below, I'll integrate this in the next version.
>

After a quick glance at the code: looks good to me :)

>> In this case it doesn't help to take such care if your file can be
>> larger than the 4k-pipe you've created. Multiple files might be
>> interleaved in 4k blocks.
>>
>
> Only if the calling thread splits the files into 4k blocks and calls
> WriteFile/write/fwrite for each block. If you print the whole file in a
> single call, that call is uninterruptible by other threads (this is what
> git-grep does). The buffer size of the pipe doesn't affect the tread safety
> of stdio and Win32 APIs at all.
>

Not quite, but yes ;)

POSIX allows write to do an incomplete operation(in the case of more
than {PIPE_BUF} bytes, which is 4096 in our case). This is part of the
reason we have write_in_full in wrapper.c...
http://pubs.opengroup.org/onlinepubs/007904975/functions/write.html

"A write request for more than {PIPE_BUF} bytes shall cause one of the
following:

When at least one byte can be written, transfer what it can and return
the number of bytes written. When all data previously written to the
pipe is read, it shall transfer at least {PIPE_BUF} bytes.

When no data can be written, transfer no data, and return -1 with
errno set to [EAGAIN]."

So let's see what MSDN has to say:
http://msdn.microsoft.com/en-us/library/windows/desktop/aa365141(v=vs.85).aspx

"To write to the pipe, use the pipe's write handle in a call to the
WriteFile function. The WriteFile call does not return until it has
written the specified number of bytes to the pipe or an error occurs.
If the pipe buffer is full and there are more bytes to be written,
WriteFile does not return until another process reads from the pipe,
making more buffer space available."

Aha, so Windows is doing the sane thing. Good to know, and you are
indeed right. We're in the clear :)

A bit of a tangent, just to end of this part; git tries to protect
from outputting in the wrong order by locking a mutex around writing
it's output. So at this level it doesn't matter much anyway, I'm being
paranoid.

>> you call is_console(1) followed by is_console(2). This means that
>> "attr" and "negative" gets initialized twice. I guess your reasoning
>> is that there's only one console, so they'll be initialized to the
>> same value. I think this deserves a comment. Something like:
>>
>> + con1 = is_console(1);
>> + con2 = is_console(2); /* may re-initialize state */
>>
>> ...should be sufficient. But perhaps it's even better to simply keep
>> the old initialized-logic, since it's already there?
>
> Thanks, you're right, I didn't account for stderr / stdout going to
> different consoles :-)
> is_console is called only twice during initialization, so spending a static
> 'initialized' variable to prevent reinitialization seems overkill. I guess
> I'll go for the comment.

Works for me. But surely, your reasoning about "overkill" is
backwards: It's REMOVING the initialization-check that would be
overkill (as in code-churn), no?

> @@ -376,8 +377,29 @@ static DWORD WINAPI console_thread(LPVOID unused)
> }
>
> /* print remaining text unless we're parsing an escape
> sequence */
> - if (state == TEXT && end > start)
> - write_console(buffer + start, end - start);
> + if (state == TEXT && end > start) {
> + /* check for incomplete UTF-8 sequences and fix end
> accordingly */
> + if (buffer[end - 1] >= 0x80) {
> + if (buffer[end -1] >= 0xc0)
> + end--;
> + else if (end - 1 > start && buffer[end - 2]
>>= 0xe0)
> + end -= 2;
> + else if (end - 2 > start && buffer[end - 3]
>>= 0xf0)
> + end -= 3;
> + }

Minor nit: As you can see from gmail wrapping my lines here, you are
above 80 lines here and should wrap some of these lines.

Good read, I think we are VERY close to having a version of this
series that we can merge. Great work!

(Heh, Gmail scared me a bit with that ">>= 0xe0)"-line. But that's not
your fault, of course)

karste...@dcon.de

unread,

Nov 30, 2011, 7:47:31 PM11/30/11

to kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

Erik Faye-Lund <kusm...@gmail.com> wrote on 30.11.2011 16:04:50: > On Wed, Nov 30, 2011 at 2:53 PM, <karste...@dcon.de> wrote: > > Erik Faye-Lund <kusm...@gmail.com> wrote on 30.11.2011 12:19:03: > >> On Mon, Nov 28, 2011 at 11:53 PM, <karste...@dcon.de> wrote: > >> > The only command that I'm aware of that prints to the console from > >> > multiple > >> > threads is git-grep (but this also uses write(), so console attributes > >> > currently don't work at all). > >> > >> Are you saying that the patch is purely theoretical? I'm not against > >> protecting ourselves from being stupid in the future, but I think it > >> should be reflected in the commit message (if my understanding is > >> correct)... > >> > > > > The thread safety part may be theoretical...at least I currently can't give > > you an example that reliably triggers the threading issue. I haven't > > analyzed all the pthread_create code paths though, so there may well be some > > git --no-pager ... command that does (probably with debug/trace output > > turned on). > > > > The patch fixes ANSI emulation in at least git-grep and git-diff and Unicode > > commit messages in git-show and git-log, so it has some practical > > applications (and if its only performance by not having to check isatty / > > GetConsoleScreenBufferInfo on each IO call). > > > > Thanks a lot for the explanation. Based on this, I agree that this is > worth fixing. >

Good :-)

> > My local commit message currently starts like so: > > "Winansi.c has many static variables that are accessed and modified from > > the [v][f]printf / fputs functions overridden in the file. This may cause > > multi threaded git commands that print to the console to produce corrupted > > output or even crash. > > > > Additionally, winansi.c doesn't override all functions that can be used to > > print to the console (e.g. fwrite, write, fputc are missing), so that ANSI > > escapes don't work properly for some git commands (e.g. git-grep). > > > > Instead of doing ANSI emulation in just a few wrapped functions on top of > > the IO API, let's plug into the IO system and take advantage of the thread > > safety inherent to the IO system. > > ..." > > > > Hmm, these last two paragraphs seems to suggest you are are changing > write, fwrite and fputc to go through the ansi-emulation also...

Yes, they do.

> But I > cant spot it in the code. Am I blind, or just misunderstanding the > text? >

Instead of overriding these functions (in fact I'm removing overrides for [v]fprintf and fputs), you could say I'm "underriding" them by replacing the stdout / stderr file descriptors with the pipe to the console thread. The only function I need to override is isatty to trick git into thinking its actually writing to a tty and activate ANSI emulation.
[...]

> > >> you call is_console(1) followed by is_console(2). This means that > >> "attr" and "negative" gets initialized twice. I guess your reasoning > >> is that there's only one console, so they'll be initialized to the > >> same value. I think this deserves a comment. Something like: > >> > >> + con1 = is_console(1); > >> + con2 = is_console(2); /* may re-initialize state */ > >> > >> ...should be sufficient. But perhaps it's even better to simply keep > >> the old initialized-logic, since it's already there? > > > > Thanks, you're right, I didn't account for stderr / stdout going to > > different consoles :-) > > is_console is called only twice during initialization, so spending a static > > 'initialized' variable to prevent reinitialization seems overkill. I guess > > I'll go for the comment. > > Works for me. But surely, your reasoning about "overkill" is > backwards: It's REMOVING the initialization-check that would be > overkill (as in code-churn), no? >

Depends on whether you prefer minimal changes or minimal code. Future developers will look at the current state of the code, not the changes that led there, and every line less to read and understand saves them time, right?

> > @@ -376,8 +377,29 @@ static DWORD WINAPI console_thread(LPVOID unused) > > } > > > > /* print remaining text unless we're parsing an escape > > sequence */ > > - if (state == TEXT && end > start) > > - write_console(buffer + start, end - start); > > + if (state == TEXT && end > start) { > > + /* check for incomplete UTF-8 sequences and fix end > > accordingly */ > > + if (buffer[end - 1] >= 0x80) { > > + if (buffer[end -1] >= 0xc0) > > + end--; > > + else if (end - 1 > start && buffer[end - 2] > >>= 0xe0) > > + end -= 2; > > + else if (end - 2 > start && buffer[end - 3] > >>= 0xf0) > > + end -= 3; > > + } > > Minor nit: As you can see from gmail wrapping my lines here, you are > above 80 lines here and should wrap some of these lines. >

Preferred tab width for git development is 8 then? I've used 4 so far (default for Eclipse's BSD style).

> Good read, I think we are VERY close to having a version of this > series that we can merge. Great work! >

OK, I'll roll out v13 then.
Thanks

> (Heh, Gmail scared me a bit with that ">>= 0xe0)"-line. But that's not > your fault, of course)

Yeah, that part got shifted right to the next line, literally :-)

Johannes Schindelin

unread,

Nov 30, 2011, 8:38:46 PM11/30/11

to karste...@dcon.de, kusm...@gmail.com, Karsten Blees, msy...@googlegroups.com

Hi,

On Thu, 1 Dec 2011, karste...@dcon.de wrote:

> Instead of overriding these functions (in fact I'm removing overrides
> for [v]fprintf and fputs), you could say I'm "underriding" them by
> replacing the stdout / stderr file descriptors with the pipe to the
> console thread.

Clever.

> Depends on whether you prefer minimal changes or minimal code. Future
> developers will look at the current state of the code, not the changes
> that led there, and every line less to read and understand saves them
> time, right?

At the moment the problem is getting past the maintainer(s). And I
seriously like minimal changes to well-exercized code much more than
maximal changes for a future benefit that will not benefit me ;-)

So I'd really appreciate minimal changes. Really.

Thanks!
Dscho

Atsushi Nakagawa

unread,

Dec 6, 2011, 9:42:43 PM12/6/11

to Konstantin Khomoutov, Michael Geddes, msy...@googlegroups.com

On Sat, 15 Oct 2011 17:34:08 +0400
Konstantin Khomoutov <flat...@users.sourceforge.net> wrote:
> Michael Geddes <mic...@wheelycreek.net> wrote:
> > [...]

> I think that currently the only way to reliably get UTF-8 in commit
> messages when using Vim is to set core.editor (or the GIT_EDITOR
> environment variable) to
> vim.exe -c "set enc=utf-8"
> or
> vim.exe -c "set fenc=utf-8"
> (I think the former is slightly more logical these days).

I think the former would also require a "set tenc=ANSI" (where ANSI is
the specific ANSI codepage (ACP), as per Microsoft's definition, of the
system).

The discussion here looks at the encoding of messages in new
commits, but we should also look at what happens when vim reloads
previous commit messages (e.g. for `rebase -i' and `commit --amend').

In order to correctly handle those situations as well, the options will
probably become:

vim --cmd "set enc=utf-8 tenc=ANSI"
or
vim --cmd "set fencs=utf-8"

(i.e. `--cmd' rather than `-c' and `fencs' instead of `fenc')

<aside>
The brief explanation is: (so that other readers aren't forced to find a
manual)
* `--cmd' runs before vim loads the file, while `-c' runs after.
* the longer forms of `enc', `tenc', `fenc', `fencs' are `encoding',
`termencoding', `fileencoding' and `fileencodings'.
* `fencs' is a list encodings vim will chooses from when loading
a file. `fenc' will become the encoding vim has chosen. Changing `fenc'
changes the encoding vim uses when saving.
* Setting blanks (either explicitly, or by having no defaults) for `tenc',
`fenc' and `fencs' results in their respective values being the same as
`enc'.
</aside>

So it's a toss-up between these defaults:
set enc=utf-8 fencs= tenc=ANSI
or
set enc=ANSI fencs=utf-8 tenc=

Maybe the former is conceptually better, but practically I don't think
the latter is less so.

The former presents a challenge in that I don't know how to set "ANSI"
(the system dependent value, which is different from what vim means by
"ansi", and is "cp932" here). `enc' starts off with the ANSI setting
and I think it's done internally by vim.

Another consideration is that the former doesn't work properly here (I
can't input native characters with that setting because vim goes out of
INSERT mode when I try). A short discussion at [1] says that vim is
flaky in unicode mode so maybe it's related to that.

[1] http://groups.google.com/group/vim_dev/browse_frm/thread/a09c482b063f818b/3dd77f3014a60e43

Anyways, I propose the latter approach and will put together a patch as
soon as I've overcome some technical and time constraint hurdles. Maybe
I'll post an RFC of what I'm thinking of doing before that because I'm
on an network where I can't checkout external repositories.

> Messing with defaults is dangerous because on Windows you usually
> expect test editors to write out files in ANSI code page.

It's could be best if it can be limited to COMMIT_EDITMSG, but since
that's a technical challenge in itself, I'm thinking it might be okay if
the defaults were changed via "C:\Program Files\Git\share\vim\vimrc"
because that's under "Git for Windows"s install path.

--
Atsushi Nakagawa
<at...@chejz.com>
Changes are made when there is inconvenience.